modules.dataClean

Base classes and common functionality for data cleaning modules.

This module provides the foundational infrastructure for creating data cleaning modules that integrate with the GDI pipeline system. All data cleaning modules should inherit from the base classes defined here.

Classes

class modules.dataClean.BaseDataCleaningModule

Base class for all data cleaning modules in the GDI pipeline system.

This class provides the foundational infrastructure for data cleaning operations, including rule execution, logging, error handling, and integration with the pipeline system. All data cleaning modules should inherit from this class.

Features:

  • Rule-based cleaning: Define and execute cleaning rules with different actions

  • Comprehensive logging: Detailed logging with the GDI logging system

  • Error handling: Configurable error handling (warnings, errors, continue)

  • UI integration: Full UI schema support for configuration

  • Pipeline integration: Seamless integration with existing pipeline system

  • Flexible data handling: Works with both TableData and TableCollection

Attributes

cleaning_configCleaningConfiguration

Configuration object containing all cleaning rules and settings

strict_modebool

If True, stop execution on first error

generate_reportbool

Whether to generate detailed cleaning reports

report_levelstr

Level of detail for reports (“summary”, “detailed”, “verbose”)

max_errorsint | None

Maximum number of errors before stopping (None = unlimited)

Inherits from:

PipeModule

Methods:

__init__(mname: str | None = None, auto_run: bool = True, cleaning_config: CleaningConfiguration | None = None, strict_mode: bool = False, generate_report: bool = True, report_level: Literal[summary, detailed, verbose] = 'detailed', max_errors: int | None = None)

Initialize the base data cleaning module.

Parameters

mnamestr | None

Module name (auto-generated if None)

auto_runbool

Whether to execute automatically in pipeline

cleaning_configCleaningConfiguration | None

Configuration object with rules and settings

strict_modebool

Stop execution on first error

generate_reportbool

Generate detailed cleaning reports

report_levelstr

Report detail level (“summary”, “detailed”, “verbose”)

max_errorsint | None

Maximum errors before stopping (None = unlimited) **kwargs Additional parameters passed to parent class

update_ui_schema(reset: bool = False) dict[str, UIAttributeSchema]

Define UI schema for module configuration.

Returns

Any

including global settings and rules configuration.

execute() TableData | TableCollection | None

Execute the data cleaning module.

This is the main entry point for module execution. It handles the complete cleaning workflow including initialization, rule execution, logging, and error handling.

Returns

Any
TableData |  TableCollection  | None
    Cleaned data, or None if execution failed
class modules.dataClean.SimpleDataCleaningModule

A simple, concrete implementation of BaseDataCleaningModule.

This class provides a basic implementation that does nothing but log the cleaning rules. It’s useful for testing and as a template.

Inherits from:

BaseDataCleaningModule

Functions

modules.dataClean.create_basic_cleaning_module(module_name: str, rules: list[CleaningRule] | None = None) SimpleDataCleaningModule

Create a basic cleaning module with default configuration.

This is a convenience function for creating simple cleaning modules that log rules but don’t modify data.

Parameters

module_namestr

Name for the module

ruleslist[CleaningRule] | None

List of cleaning rules to apply **kwargs Additional parameters for module configuration

Returns

Any
SimpleDataCleaningModule
    Configured cleaning module instance