modules.statistics

Classes

class modules.statistics.TableDescribe

Apply pandas DataFrame describe() method to TableData and provide described table and selected statistics.

This module generates descriptive statistics for the input table, supporting various data types through the include parameter. For numeric columns, it provides statistics like mean, std, min, max, etc. For non-numeric columns (e.g., object/string), it provides count, unique, top, and frequency.

Examples

Describe only numeric columns (default)
>>> describe_module = TableDescribe(table=my_table)

Describe all columns including strings
>>> describe_module = TableDescribe(table=my_table, include='all')

Describe only string/object columns
>>> describe_module = TableDescribe(table=my_table, include=['object'])

Describe specific numeric types
>>> describe_module = TableDescribe(table=my_table, include=['float64', 'int64'])

Select specific statistics for ResultModel output
>>> describe_module = TableDescribe(
...     table=my_table,
...     selected_pairs={"columnA": ["mean", "std"], "columnB": ["min", "max"]}
... )
Inherits from:

PipeModule

Methods:

__init__(mname: str = 'TableDescribe', auto_run: bool = True, table: PortTypeHint.TableData | None = None, selected_pairs: dict[str, list[str]] | None = None, describe_columns: list[str] | None = None, include: Literal[all] | list[str] | None = None) None

Initialize the TableDescribe object.

Parameters

tablePortTypeHint.TableData | None, default: None

The table to describe. selected_pairs : dict[str, list[str]] | None, default: None Dictionary mapping column names to lists of statistics to select for ResultModel output. Key is column name, value is list of statistic names.

Examples

- include='all': Describe all columns including strings
    - include=['object']: Describe only string/object columns
    - include=['number']: Describe all numeric columns (int + float)
    - include=['float64', 'int64']: Describe only float64 and int64 columns
    - include=['object', 'bool']: Describe string and boolean columns

Notes

(e.g., count, unique, top, freq for object types)
execute() PortTypeHint.TableData | None

Attributes:

InputTable: PortReference[PortTypeHint.TableData]
OutputStatTable: PortReference[PortTypeHint.TableData]
OutputResultModel: PortReference[PortTypeHint.ResultModel]
class modules.statistics.ValuesCount

Count unique combinations of values in TableData columns.

This module wraps the pandas DataFrame value_counts() method for TableData,

Returns

Any

multiple column counting, with options for normalization, sorting, and handling NA values.

Inherits from:

PipeModule

Methods:

__init__(mname: str = 'ValuesCount', auto_run: bool = True, table: PortTypeHint.TableData | None = None, subset: str | list[str] | None = None, normalize: bool = False, sort: bool = True, ascending: bool = False, dropna: bool = True) None

Initialize the ValuesCount object.

Parameters

tablePortTypeHint.TableData | None, default: None

The table to count values from. subset : str, list of str, or None, default: None Column name(s) or field title(s) to use when counting unique combinations. Can be a single string for one column or a list for multiple columns. If None, all columns are used.

normalize : bool, default: False

Returns

Any
If True, the returned counts will be normalized to represent proportions.

sort : bool, default: True
    Sort by frequencies. When True, the result is sorted by count values.

ascending : bool, default: False
    Sort in ascending order. Only applies when sort=True.
    If False (default), sorts in descending order (most common first).

dropna : bool, default: True
    Don't include counts of rows containing NA values.
    If False, rows with NA values will be included in the counts.

Notes

- Supports field titles in addition to column names for the subset parameter
- For single column counts, the output will have a simple structure
- For multiple column counts, the output will have one column per counted field
- The count/proportion column is named 'count' or 'proportion' depending on normalize parameter

Ports
InputTable: PortTypeHint.TableData
    The input TableData to count values from.

OutputCountTable: PortTypeHint.TableData
    The output TableData containing the value counts with columns for the counted values and their frequencies.
execute() PortTypeHint.TableData | None

Attributes:

InputTable: PortReference[PortTypeHint.TableData]
OutputCountTable: PortReference[PortTypeHint.TableData]
class modules.statistics.TimeSeriesRegression

Generate regression line data for time series data

Inherits from:

PipeModule

Methods:

__init__(mname: str = 'TimeSeriesRegression', auto_run: bool = True, table: PortTypeHint.TableData | None = None, time_column: str | None = None, value_column: str | None = None, point_column: str | None = None, model_type: Literal[hyperbolic, exponential] = 'exponential', prediction_time: int = 365, time_column_name: FieldMetadata | dict[str, str | None | Units] | None = None, value_column_name: FieldMetadata | dict[str, str | None | Units] | None = None) None

Initialize the TimeSeriesRegression object.

Parameters

tablePortTypeHint.TableData | None, default: None

The time series data to be used for regression.

time_columnstr | None, default: None

The field name or title of the time column which is the x column in the graph. If None, the time column will be the second column of the input table.

value_columnstr | None, default: None

The field name or title of the value column which is the y column in the graph. If None, the value column will be the third column of the input table.

point_columnstr | None, default: None

The field name or title of the point column which is the point name. If None, the point column will be filled with NAN. model_type: Literal[“hyperbolic”, “exponential”], default: ‘exponential’ The type of regression model.

prediction_timeint, default: 365

The relative time to predict. In current verison, the units of time must be day. time_column_name: FieldMetadata | dict[str, str] | None = None, The name of the time column in the output table. If None, the column will have the same field metadata as the input table. value_column_name: FieldMetadata | dict[str, str] | None = None, The name of the value column in the output table. If None, the column will have the same field metadata as the input table. Ports

InputTablePortTypeHint.TableData

The time series data to be used for regression.

OutputTablePortTypeHint.TableData

The output TableData containing the regression line data.

OutputSingleResultPortTypeHint.SingleResult

The single result containing the regression information.

update_ui_schema(reset: bool = False) dict[str, UIAttributeSchema]
execute() PortTypeHint.TableData | None

Properties:

time_column_name
value_column_name

Attributes:

InputTable: PortReference[PortTypeHint.TableData]
OutputTable: PortReference[PortTypeHint.TableData]
OutputSingleResult: PortReference[PortTypeHint.SingleResult]
class modules.statistics.SphericalKMeans

Cluster directional or unit-normalised data using Spherical K-Means.

Spherical K-Means is a variant of K-Means that optimises cosine similarity instead of Euclidean distance, making it suitable for directional data such as rock joint orientations, text embeddings, or any features that live on the unit hypersphere.

The algorithm normalises each input row to unit length (L2-norm = 1) before fitting unless ``normalize=False`` is passed (use ``False`` only when the data is already on the unit sphere).

The integer in ``cluster_label`` (``OutputLabelsTable``, per sample) matches ``cluster_id`` (``OutputCentersTable``, per cluster): both use the same 0-indexed cluster index from the fitted model. Rows with ``cluster_label == k`` belong to the centroid row with ``cluster_id == k``.

Examples

>>> skm = SphericalKMeans(table=my_table, feature_columns=["fx", "fy", "fz"], n_clusters=3)

>>> # Let the module auto-detect all numeric columns
>>> skm = SphericalKMeans(table=my_table, n_clusters=5)
Inherits from:

PipeModule

Methods:

__init__(mname: str = 'SphericalKMeans', auto_run: bool = True, table: PortTypeHint.TableData | None = None, feature_columns: list[str] | None = None, n_clusters: int = 5, init: Literal[(k-means++, random)] = 'k-means++', n_init: int = 10, max_iter: int = 300, tol: float = 0.0001, random_state: int | None = None, normalize: bool = True, centers_in_original_scale: bool = True) None

Initialize the SphericalKMeans module.

Parameters

tablePortTypeHint.TableData | None, default: None

The input data table.

feature_columnslist[str] | None, default: None

Column names or titles to use as clustering features. Accepts field names, field titles, or a mix of both. If None, all numeric columns in the table are used automatically.

n_clustersint, default: 5

Number of clusters to form.

init‘k-means++’ or ‘random’, default: ‘k-means++’

Initialisation strategy for cluster centers. ‘k-means++’ selects centers probabilistically proportional to their distance from already-chosen centers, which dramatically reduces the chance of poor initialisation and is almost always the better choice. ‘random’ picks n_clusters rows at random — faster but more sensitive to bad luck; consider raising n_init when using this option.

n_initint, default: 10

Number of times the algorithm is run with different centroid seeds. The result with the lowest inertia is kept. Valid for both init methods. With ‘k-means++’, n_init=1 is often sufficient for well-separated data. With ‘random’, raise this to 20–50 for noisy or high-dimensional data to compensate for the higher variance in random initialisations.

max_iterint, default: 300

Maximum number of iterations per run. Valid for both init methods. The algorithm stops early if convergence (tol) is reached first.

tolfloat, default: 1e-4

Convergence tolerance on the Frobenius norm of the center shift. Valid for both init methods.

random_stateint | None, default: None

Seed for the random number generator. Valid for both init methods. With ‘k-means++’, it seeds the probabilistic center selection. With ‘random’, it seeds which data rows are picked as initial centers. Pass an integer for fully reproducible results; None uses the global numpy random state.

normalizebool, default: True

Whether to L2-normalise each input row to unit norm before fitting. Set to True whenever the features are not already unit-normalised. Set to False only if you have already normalised the data yourself, to avoid redundant computation.

centers_in_original_scalebool, default: True

If False, ``OutputCentersTable`` rows are ``cluster_centers_`` from spherical K-means (unit vectors on the hypersphere in feature space). If True, each row is the arithmetic mean of the input feature values over samples assigned to that cluster — same units as the input table (e.g. degrees for orientation columns). Clustering is unchanged; only how centers are reported differs. Ports

InputTablePortReference[PortTypeHint.TableData]

The input data table.

OutputLabelsTablePortReference[PortTypeHint.TableData]

All columns from the input table, with an appended integer ‘cluster_label’ column. Rows that were excluded from fitting (any NaN in the feature columns) get a missing cluster label.

OutputCentersTablePortReference[PortTypeHint.TableData]

The cluster centroids as a table with one row per cluster. Columns match the feature columns of the input table. An additional ‘cluster_id’ column (0-indexed) identifies each centroid.

update_ui_schema(reset: bool = False) dict[str, UIAttributeSchema]
execute() PortTypeHint.TableData | None

Attributes:

InputTable: PortReference[PortTypeHint.TableData]
OutputLabelsTable: PortReference[PortTypeHint.TableData]
OutputCentersTable: PortReference[PortTypeHint.TableData]