modules.dataClean
Base classes and common functionality for data cleaning modules.
This module provides the foundational infrastructure for creating data cleaning modules that integrate with the GDI pipeline system. All data cleaning modules should inherit from the base classes defined here.
Classes
- class modules.dataClean.BaseDataCleaningModule
Base class for all data cleaning modules in the GDI pipeline system.
This class provides the foundational infrastructure for data cleaning operations, including rule execution, logging, error handling, and integration with the pipeline system. All data cleaning modules should inherit from this class.
Features:
Rule-based cleaning: Define and execute cleaning rules with different actions
Comprehensive logging: Detailed logging with the GDI logging system
Error handling: Configurable error handling (warnings, errors, continue)
UI integration: Full UI schema support for configuration
Pipeline integration: Seamless integration with existing pipeline system
Flexible data handling: Works with both TableData and TableCollection
Attributes
- cleaning_configCleaningConfiguration
Configuration object containing all cleaning rules and settings
- strict_modebool
If True, stop execution on first error
- generate_reportbool
Whether to generate detailed cleaning reports
- report_levelstr
Level of detail for reports (“summary”, “detailed”, “verbose”)
- max_errorsint | None
Maximum number of errors before stopping (None = unlimited)
- Inherits from:
PipeModule
Methods:
- __init__(mname: str | None = None, auto_run: bool = True, cleaning_config: CleaningConfiguration | None = None, strict_mode: bool = False, generate_report: bool = True, report_level: Literal[summary, detailed, verbose] = 'detailed', max_errors: int | None = None)
Initialize the base data cleaning module.
Parameters
- mnamestr | None
Module name (auto-generated if None)
- auto_runbool
Whether to execute automatically in pipeline
- cleaning_configCleaningConfiguration | None
Configuration object with rules and settings
- strict_modebool
Stop execution on first error
- generate_reportbool
Generate detailed cleaning reports
- report_levelstr
Report detail level (“summary”, “detailed”, “verbose”)
- max_errorsint | None
Maximum errors before stopping (None = unlimited) **kwargs Additional parameters passed to parent class
- update_ui_schema(reset: bool = False) dict[str, UIAttributeSchema]
Define UI schema for module configuration.
Returns
- Any
including global settings and rules configuration.
- execute() TableData | TableCollection | None
Execute the data cleaning module.
This is the main entry point for module execution. It handles the complete cleaning workflow including initialization, rule execution, logging, and error handling.
Returns
- Any
TableData | TableCollection | None Cleaned data, or None if execution failed
- class modules.dataClean.SimpleDataCleaningModule
A simple, concrete implementation of BaseDataCleaningModule.
This class provides a basic implementation that does nothing but log the cleaning rules. It’s useful for testing and as a template.
- Inherits from:
BaseDataCleaningModule
Functions
- modules.dataClean.create_basic_cleaning_module(module_name: str, rules: list[CleaningRule] | None = None) SimpleDataCleaningModule
Create a basic cleaning module with default configuration.
This is a convenience function for creating simple cleaning modules that log rules but don’t modify data.
Parameters
- module_namestr
Name for the module
- ruleslist[CleaningRule] | None
List of cleaning rules to apply **kwargs Additional parameters for module configuration
Returns
- Any
SimpleDataCleaningModule Configured cleaning module instance