modules.statistics
Classes
- class modules.statistics.TableDescribe
Apply pandas DataFrame describe() method to TableData and provide described table and selected statistics.
This module generates descriptive statistics for the input table, supporting various data types through the
includeparameter. For numeric columns, it provides statistics like mean, std, min, max, etc. For non-numeric columns (e.g., object/string), it provides count, unique, top, and frequency.Examples
Describe only numeric columns (default) >>> describe_module = TableDescribe(table=my_table) Describe all columns including strings >>> describe_module = TableDescribe(table=my_table, include='all') Describe only string/object columns >>> describe_module = TableDescribe(table=my_table, include=['object']) Describe specific numeric types >>> describe_module = TableDescribe(table=my_table, include=['float64', 'int64']) Select specific statistics for ResultModel output >>> describe_module = TableDescribe( ... table=my_table, ... selected_pairs={"columnA": ["mean", "std"], "columnB": ["min", "max"]} ... )
- Inherits from:
PipeModule
Methods:
- __init__(mname: str = 'TableDescribe', auto_run: bool = True, table: PortTypeHint.TableData | None = None, selected_pairs: dict[str, list[str]] | None = None, describe_columns: list[str] | None = None, include: Literal[all] | list[str] | None = None) None
Initialize the TableDescribe object.
Parameters
- tablePortTypeHint.TableData | None, default: None
The table to describe. selected_pairs : dict[str, list[str]] | None, default: None Dictionary mapping column names to lists of statistics to select for ResultModel output. Key is column name, value is list of statistic names.
Examples
- include='all': Describe all columns including strings - include=['object']: Describe only string/object columns - include=['number']: Describe all numeric columns (int + float) - include=['float64', 'int64']: Describe only float64 and int64 columns - include=['object', 'bool']: Describe string and boolean columnsNotes
(e.g., count, unique, top, freq for object types)
Attributes:
- InputTable: PortReference[PortTypeHint.TableData]
- OutputStatTable: PortReference[PortTypeHint.TableData]
- OutputResultModel: PortReference[PortTypeHint.ResultModel]
- class modules.statistics.ValuesCount
Count unique combinations of values in TableData columns.
This module wraps the pandas DataFrame value_counts() method for TableData,
Returns
- Any
multiple column counting, with options for normalization, sorting, and handling NA values.
- Inherits from:
PipeModule
Methods:
- __init__(mname: str = 'ValuesCount', auto_run: bool = True, table: PortTypeHint.TableData | None = None, subset: str | list[str] | None = None, normalize: bool = False, sort: bool = True, ascending: bool = False, dropna: bool = True) None
Initialize the ValuesCount object.
Parameters
- tablePortTypeHint.TableData | None, default: None
The table to count values from. subset : str, list of str, or None, default: None Column name(s) or field title(s) to use when counting unique combinations. Can be a single string for one column or a list for multiple columns. If None, all columns are used.
normalize : bool, default: False
Returns
- Any
If True, the returned counts will be normalized to represent proportions. sort : bool, default: True Sort by frequencies. When True, the result is sorted by count values. ascending : bool, default: False Sort in ascending order. Only applies when sort=True. If False (default), sorts in descending order (most common first). dropna : bool, default: True Don't include counts of rows containing NA values. If False, rows with NA values will be included in the counts.
Notes
- Supports field titles in addition to column names for the subset parameter - For single column counts, the output will have a simple structure - For multiple column counts, the output will have one column per counted field - The count/proportion column is named 'count' or 'proportion' depending on normalize parameter Ports InputTable: PortTypeHint.TableData The input TableData to count values from. OutputCountTable: PortTypeHint.TableData The output TableData containing the value counts with columns for the counted values and their frequencies.
Attributes:
- InputTable: PortReference[PortTypeHint.TableData]
- OutputCountTable: PortReference[PortTypeHint.TableData]
- class modules.statistics.TimeSeriesRegression
Generate regression line data for time series data
- Inherits from:
PipeModule
Methods:
- __init__(mname: str = 'TimeSeriesRegression', auto_run: bool = True, table: PortTypeHint.TableData | None = None, time_column: str | None = None, value_column: str | None = None, point_column: str | None = None, model_type: Literal[hyperbolic, exponential] = 'exponential', prediction_time: int = 365, time_column_name: FieldMetadata | dict[str, str | None | Units] | None = None, value_column_name: FieldMetadata | dict[str, str | None | Units] | None = None) None
Initialize the TimeSeriesRegression object.
Parameters
- tablePortTypeHint.TableData | None, default: None
The time series data to be used for regression.
- time_columnstr | None, default: None
The field name or title of the time column which is the x column in the graph. If None, the time column will be the second column of the input table.
- value_columnstr | None, default: None
The field name or title of the value column which is the y column in the graph. If None, the value column will be the third column of the input table.
- point_columnstr | None, default: None
The field name or title of the point column which is the point name. If None, the point column will be filled with NAN. model_type: Literal[“hyperbolic”, “exponential”], default: ‘exponential’ The type of regression model.
- prediction_timeint, default: 365
The relative time to predict. In current verison, the units of time must be day. time_column_name: FieldMetadata | dict[str, str] | None = None, The name of the time column in the output table. If None, the column will have the same field metadata as the input table. value_column_name: FieldMetadata | dict[str, str] | None = None, The name of the value column in the output table. If None, the column will have the same field metadata as the input table. Ports
- InputTablePortTypeHint.TableData
The time series data to be used for regression.
- OutputTablePortTypeHint.TableData
The output TableData containing the regression line data.
- OutputSingleResultPortTypeHint.SingleResult
The single result containing the regression information.
Properties:
- time_column_name
- value_column_name
Attributes:
- InputTable: PortReference[PortTypeHint.TableData]
- OutputTable: PortReference[PortTypeHint.TableData]
- OutputSingleResult: PortReference[PortTypeHint.SingleResult]
- class modules.statistics.SphericalKMeans
Cluster directional or unit-normalised data using Spherical K-Means.
Spherical K-Means is a variant of K-Means that optimises cosine similarity instead of Euclidean distance, making it suitable for directional data such as rock joint orientations, text embeddings, or any features that live on the unit hypersphere.
The algorithm normalises each input row to unit length (L2-norm = 1) before fitting unless
``normalize=False``is passed (use``False``only when the data is already on the unit sphere).The integer in
``cluster_label``(``OutputLabelsTable``, per sample) matches``cluster_id``(``OutputCentersTable``, per cluster): both use the same 0-indexed cluster index from the fitted model. Rows with``cluster_label == k``belong to the centroid row with``cluster_id == k``.Examples
>>> skm = SphericalKMeans(table=my_table, feature_columns=["fx", "fy", "fz"], n_clusters=3) >>> # Let the module auto-detect all numeric columns >>> skm = SphericalKMeans(table=my_table, n_clusters=5)
- Inherits from:
PipeModule
Methods:
- __init__(mname: str = 'SphericalKMeans', auto_run: bool = True, table: PortTypeHint.TableData | None = None, feature_columns: list[str] | None = None, n_clusters: int = 5, init: Literal[(k-means++, random)] = 'k-means++', n_init: int = 10, max_iter: int = 300, tol: float = 0.0001, random_state: int | None = None, normalize: bool = True, centers_in_original_scale: bool = True) None
Initialize the SphericalKMeans module.
Parameters
- tablePortTypeHint.TableData | None, default: None
The input data table.
- feature_columnslist[str] | None, default: None
Column names or titles to use as clustering features. Accepts field names, field titles, or a mix of both. If None, all numeric columns in the table are used automatically.
- n_clustersint, default: 5
Number of clusters to form.
- init‘k-means++’ or ‘random’, default: ‘k-means++’
Initialisation strategy for cluster centers. ‘k-means++’ selects centers probabilistically proportional to their distance from already-chosen centers, which dramatically reduces the chance of poor initialisation and is almost always the better choice. ‘random’ picks n_clusters rows at random — faster but more sensitive to bad luck; consider raising n_init when using this option.
- n_initint, default: 10
Number of times the algorithm is run with different centroid seeds. The result with the lowest inertia is kept. Valid for both init methods. With ‘k-means++’, n_init=1 is often sufficient for well-separated data. With ‘random’, raise this to 20–50 for noisy or high-dimensional data to compensate for the higher variance in random initialisations.
- max_iterint, default: 300
Maximum number of iterations per run. Valid for both init methods. The algorithm stops early if convergence (tol) is reached first.
- tolfloat, default: 1e-4
Convergence tolerance on the Frobenius norm of the center shift. Valid for both init methods.
- random_stateint | None, default: None
Seed for the random number generator. Valid for both init methods. With ‘k-means++’, it seeds the probabilistic center selection. With ‘random’, it seeds which data rows are picked as initial centers. Pass an integer for fully reproducible results; None uses the global numpy random state.
- normalizebool, default: True
Whether to L2-normalise each input row to unit norm before fitting. Set to True whenever the features are not already unit-normalised. Set to False only if you have already normalised the data yourself, to avoid redundant computation.
- centers_in_original_scalebool, default: True
If False,
``OutputCentersTable``rows are``cluster_centers_``from spherical K-means (unit vectors on the hypersphere in feature space). If True, each row is the arithmetic mean of the input feature values over samples assigned to that cluster — same units as the input table (e.g. degrees for orientation columns). Clustering is unchanged; only how centers are reported differs. Ports- InputTablePortReference[PortTypeHint.TableData]
The input data table.
- OutputLabelsTablePortReference[PortTypeHint.TableData]
All columns from the input table, with an appended integer ‘cluster_label’ column. Rows that were excluded from fitting (any NaN in the feature columns) get a missing cluster label.
- OutputCentersTablePortReference[PortTypeHint.TableData]
The cluster centroids as a table with one row per cluster. Columns match the feature columns of the input table. An additional ‘cluster_id’ column (0-indexed) identifies each centroid.
Attributes:
- InputTable: PortReference[PortTypeHint.TableData]
- OutputLabelsTable: PortReference[PortTypeHint.TableData]
- OutputCentersTable: PortReference[PortTypeHint.TableData]