modules.statistics

Classes

class modules.statistics.TableDescribe

Apply pandas DataFrame describe() method to TableData and provide described table and selected statistics.

This module generates descriptive statistics for the input table, supporting various data types through the include parameter. For numeric columns, it provides statistics like mean, std, min, max, etc. For non-numeric columns (e.g., object/string), it provides count, unique, top, and frequency.

Examples

Describe only numeric columns (default)
>>> describe_module = TableDescribe(table=my_table)

Describe all columns including strings
>>> describe_module = TableDescribe(table=my_table, include='all')

Describe only string/object columns
>>> describe_module = TableDescribe(table=my_table, include=['object'])

Describe specific numeric types
>>> describe_module = TableDescribe(table=my_table, include=['float64', 'int64'])

Select specific statistics for ResultModel output
>>> describe_module = TableDescribe(
...     table=my_table,
...     selected_pairs={"columnA": ["mean", "std"], "columnB": ["min", "max"]}
... )

Inherits from:: PipeModule

Methods:

__init__(mname: str = 'TableDescribe', auto_run: bool = True, table: PortTypeHint.TableData | None = None, selected_pairs: dict[str, list[str]] | None = None, describe_columns: list[str] | None = None, include: Literal[all] | list[str] | None = None) → None

Initialize the TableDescribe object.

Parameters

tablePortTypeHint.TableData | None, default: None: The table to describe. selected_pairs : dict[str, list[str]] | None, default: None Dictionary mapping column names to lists of statistics to select for ResultModel output. Key is column name, value is list of statistic names.

Examples

- include='all': Describe all columns including strings
    - include=['object']: Describe only string/object columns
    - include=['number']: Describe all numeric columns (int + float)
    - include=['float64', 'int64']: Describe only float64 and int64 columns
    - include=['object', 'bool']: Describe string and boolean columns

Notes

(e.g., count, unique, top, freq for object types)

execute() → PortTypeHint.TableData | None

Attributes:

InputTable: PortReference[PortTypeHint.TableData]

OutputStatTable: PortReference[PortTypeHint.TableData]

OutputResultModel: PortReference[PortTypeHint.ResultModel]

class modules.statistics.ValuesCount

Count unique combinations of values in TableData columns.

This module wraps the pandas DataFrame value_counts() method for TableData,

Returns

Any: multiple column counting, with options for normalization, sorting, and handling NA values.

Inherits from:: PipeModule

Methods:

__init__(mname: str = 'ValuesCount', auto_run: bool = True, table: PortTypeHint.TableData | None = None, subset: str | list[str] | None = None, normalize: bool = False, sort: bool = True, ascending: bool = False, dropna: bool = True) → None

Initialize the ValuesCount object.

Parameters

tablePortTypeHint.TableData | None, default: None: The table to count values from. subset : str, list of str, or None, default: None Column name(s) or field title(s) to use when counting unique combinations. Can be a single string for one column or a list for multiple columns. If None, all columns are used.

normalize : bool, default: False

Returns

Any

If True, the returned counts will be normalized to represent proportions.

sort : bool, default: True
    Sort by frequencies. When True, the result is sorted by count values.

ascending : bool, default: False
    Sort in ascending order. Only applies when sort=True.
    If False (default), sorts in descending order (most common first).

dropna : bool, default: True
    Don't include counts of rows containing NA values.
    If False, rows with NA values will be included in the counts.

Notes

- Supports field titles in addition to column names for the subset parameter
- For single column counts, the output will have a simple structure
- For multiple column counts, the output will have one column per counted field
- The count/proportion column is named 'count' or 'proportion' depending on normalize parameter

Ports
InputTable: PortTypeHint.TableData
    The input TableData to count values from.

OutputCountTable: PortTypeHint.TableData
    The output TableData containing the value counts with columns for the counted values and their frequencies.

execute() → PortTypeHint.TableData | None

Attributes:

InputTable: PortReference[PortTypeHint.TableData]

OutputCountTable: PortReference[PortTypeHint.TableData]

class modules.statistics.TimeSeriesRegression

Generate regression line data for time series data

Inherits from:: PipeModule

Methods:

Initialize the TimeSeriesRegression object.

Parameters

tablePortTypeHint.TableData | None, default: None: The time series data to be used for regression.
time_columnstr | None, default: None: The field name or title of the time column which is the x column in the graph. If None, the time column will be the second column of the input table.
value_columnstr | None, default: None: The field name or title of the value column which is the y column in the graph. If None, the value column will be the third column of the input table.
point_columnstr | None, default: None: The field name or title of the point column which is the point name. If None, the point column will be filled with NAN. model_type: Literal[“hyperbolic”, “exponential”], default: ‘exponential’ The type of regression model.
prediction_timeint, default: 365: The relative time to predict. In current verison, the units of time must be day. time_column_name: FieldMetadata | dict[str, str] | None = None, The name of the time column in the output table. If None, the column will have the same field metadata as the input table. value_column_name: FieldMetadata | dict[str, str] | None = None, The name of the value column in the output table. If None, the column will have the same field metadata as the input table. Ports
InputTablePortTypeHint.TableData: The time series data to be used for regression.
OutputTablePortTypeHint.TableData: The output TableData containing the regression line data.
OutputSingleResultPortTypeHint.SingleResult: The single result containing the regression information.

update_ui_schema(reset: bool = False) → dict[str, UIAttributeSchema]

execute() → PortTypeHint.TableData | None

Properties:

time_column_name

value_column_name

Attributes:

InputTable: PortReference[PortTypeHint.TableData]

OutputTable: PortReference[PortTypeHint.TableData]

OutputSingleResult: PortReference[PortTypeHint.SingleResult]

class modules.statistics.SphericalKMeans

Cluster directional or unit-normalised data using Spherical K-Means.

Spherical K-Means is a variant of K-Means that optimises cosine similarity instead of Euclidean distance, making it suitable for directional data such as rock joint orientations, text embeddings, or any features that live on the unit hypersphere.

The algorithm normalises each input row to unit length (L2-norm = 1) before fitting unless ``normalize=False`` is passed (use ``False`` only when the data is already on the unit sphere).

The integer in ``cluster_label`` (``OutputLabelsTable``, per sample) matches ``cluster_id`` (``OutputCentersTable``, per cluster): both use the same 0-indexed cluster index from the fitted model. Rows with ``cluster_label == k`` belong to the centroid row with ``cluster_id == k``.

Examples

>>> skm = SphericalKMeans(table=my_table, feature_columns=["fx", "fy", "fz"], n_clusters=3)

>>> # Let the module auto-detect all numeric columns
>>> skm = SphericalKMeans(table=my_table, n_clusters=5)

Inherits from:: PipeModule

Methods:

__init__(mname: str = 'SphericalKMeans', auto_run: bool = True, table: PortTypeHint.TableData | None = None, feature_columns: list[str] | None = None, n_clusters: int = 5, init: Literal[(k-means++, random)] = 'k-means++', n_init: int = 10, max_iter: int = 300, tol: float = 0.0001, random_state: int | None = None, normalize: bool = True, centers_in_original_scale: bool = True) → None

Initialize the SphericalKMeans module.

Parameters

tablePortTypeHint.TableData | None, default: None: The input data table.
feature_columnslist[str] | None, default: None: Column names or titles to use as clustering features. Accepts field names, field titles, or a mix of both. If None, all numeric columns in the table are used automatically.
n_clustersint, default: 5: Number of clusters to form.
init‘k-means++’ or ‘random’, default: ‘k-means++’: Initialisation strategy for cluster centers. ‘k-means++’ selects centers probabilistically proportional to their distance from already-chosen centers, which dramatically reduces the chance of poor initialisation and is almost always the better choice. ‘random’ picks n_clusters rows at random — faster but more sensitive to bad luck; consider raising n_init when using this option.
n_initint, default: 10: Number of times the algorithm is run with different centroid seeds. The result with the lowest inertia is kept. Valid for both init methods. With ‘k-means++’, n_init=1 is often sufficient for well-separated data. With ‘random’, raise this to 20–50 for noisy or high-dimensional data to compensate for the higher variance in random initialisations.
max_iterint, default: 300: Maximum number of iterations per run. Valid for both init methods. The algorithm stops early if convergence (tol) is reached first.
tolfloat, default: 1e-4: Convergence tolerance on the Frobenius norm of the center shift. Valid for both init methods.
random_stateint | None, default: None: Seed for the random number generator. Valid for both init methods. With ‘k-means++’, it seeds the probabilistic center selection. With ‘random’, it seeds which data rows are picked as initial centers. Pass an integer for fully reproducible results; None uses the global numpy random state.
normalizebool, default: True: Whether to L2-normalise each input row to unit norm before fitting. Set to True whenever the features are not already unit-normalised. Set to False only if you have already normalised the data yourself, to avoid redundant computation.
centers_in_original_scalebool, default: True: If False, ``OutputCentersTable`` rows are ``cluster_centers_`` from spherical K-means (unit vectors on the hypersphere in feature space). If True, each row is the arithmetic mean of the input feature values over samples assigned to that cluster — same units as the input table (e.g. degrees for orientation columns). Clustering is unchanged; only how centers are reported differs. Ports
InputTablePortReference[PortTypeHint.TableData]: The input data table.
OutputLabelsTablePortReference[PortTypeHint.TableData]: All columns from the input table, with an appended integer ‘cluster_label’ column. Rows that were excluded from fitting (any NaN in the feature columns) get a missing cluster label.
OutputCentersTablePortReference[PortTypeHint.TableData]: The cluster centroids as a table with one row per cluster. Columns match the feature columns of the input table. An additional ‘cluster_id’ column (0-indexed) identifies each centroid.

update_ui_schema(reset: bool = False) → dict[str, UIAttributeSchema]

execute() → PortTypeHint.TableData | None

Attributes:

InputTable: PortReference[PortTypeHint.TableData]

OutputLabelsTable: PortReference[PortTypeHint.TableData]

OutputCentersTable: PortReference[PortTypeHint.TableData]