modules.filters 模块帮助

本章节包含 modules.filters 包中常用「数据筛选与过滤」模块的使用说明和示例，例如：

按条件筛选行（范围过滤、条件过滤等）
按字段值过滤、去重等操作

TableSeriesSelector

模块简介与适用场景

TableSeriesSelector 用于从 TableData 或 TableCollection 中选择一列数据，输出为 TableSeries``（端口：``OutputTableSeries）。
典型适用场景：
- 后续模块/绘图只需要某个字段（例如深度、孔号、时间等）；
- 从多表集合中按「表 + 字段」抽取一列作为统计/筛选依据。

端口说明

输入端口 - InputTable：输入表数据（TableData）或表集合（TableCollection）。
输出端口 - OutputTableSeries：选取到的一列（TableSeries）；当输入为空/表为空/字段不存在时为 None。

快速上手示例：从单表选择一列

from gdisdk.modules.filters import TableSeriesSelector

selector = TableSeriesSelector(mname="SelectSeries")
selector.select_field = "depth"   # 列名（也可以写列标题）
# selector.InputTable = table      # TableData
series = selector.execute()

快速上手示例：从多表集合选择一列

from gdisdk.modules.filters import TableSeriesSelector

selector = TableSeriesSelector(mname="SelectSeriesFromCollection")
selector.select_field = ("剖面数据表", "x_coordinate")  # (表名/表标题, 字段名/字段标题)
# selector.InputTable = tables     # TableCollection
series = selector.execute()

参数说明

TableSeriesSelector 参数一览
参数名	类型	默认值	说明
`select_field`	`tuple[str, str] \| str \| None`	`None`	选择的字段配置：当输入为 `TableData` 时填 str``（字段名/字段标题）；当输入为 ``TableCollection 时填 `(table, field)` 二元组。

在 pipeline 中的使用方式

from gdisdk.pipeline import PipeLine
from gdisdk.modules.filters import TableSeriesSelector

pipe = PipeLine(app_name="SelectSeriesDemo", app_title="选择一列示例")

selector = TableSeriesSelector("SelectSeries")
selector.select_field = "depth"
# links = upstream.OutputTable >> selector.InputTable
# pipe.add_links(links)
# pipe.run()
# series = selector.OutputTableSeries.data

更多信息

TableSelector

模块简介与适用场景

TableSelector 用于从 TableCollection 里选择一张表输出为 TableData``（端口：``OutputTable）。
典型适用场景：
- 多表读取后，只对其中某一张表做后续过滤/统计/绘图；
- 在 UI 中动态选择要处理的表（模块会根据输入表集合生成可选项）。

端口说明

输入端口 - InputTables：输入表集合（TableCollection）。
输出端口 - OutputTable：选中的表（TableData）；当输入为空/未命中 table_name 且未设置 table_idx 时为 None。

快速上手示例：按表名/表标题选择

from gdisdk.modules.filters import TableSelector

selector = TableSelector(mname="SelectTable")
selector.table_name = "剖面数据表"  # 表名或表标题；设置后会忽略 table_idx
# selector.InputTables = tables     # TableCollection
table = selector.execute()

参数说明

TableSelector 参数一览
参数名	类型	默认值	说明
`table_name`	`str \| None`	`None`	要选择的表名或表标题；若不为 `None`，则 `table_idx` 会被忽略。
`table_idx`	`int \| None`	`None`	按索引选择（从 0 开始）；仅在 `table_name` 为 `None` 时生效。

在 pipeline 中的使用方式

from gdisdk.pipeline import PipeLine
from gdisdk.modules.filters import TableSelector

pipe = PipeLine(app_name="SelectTableDemo", app_title="选择表示例")

selector = TableSelector("SelectTable")
selector.table_name = "剖面数据表"
# links = upstream.OutputTables >> selector.InputTables
# pipe.add_links(links)
# pipe.run()
# table = selector.OutputTable.data

更多信息

TableCollectionSelector

模块简介与适用场景

TableCollectionSelector 用于对 TableCollection 进行「选取」或「剔除」操作，输出新的 TableCollection``（端口：``OutputTables）。
典型适用场景：
- 从多表集合中只保留关心的表（operation="select"）；
- 从多表集合中移除不需要的表（operation="remove"），避免后续模块误处理。

端口说明

输入端口 - InputTables：输入表集合（TableCollection）。
输出端口 - OutputTables：处理后的表集合（TableCollection），且保留了已有的主子表关系。

快速上手示例：只保留指定表

from gdisdk.modules.filters import TableCollectionSelector

selector = TableCollectionSelector(mname="PickTables")
selector.operation = "select"
selector.table_names = ["剖面数据表", "钻孔表"]  # 支持表名或表标题
# selector.InputTables = tables               # TableCollection
out_tables = selector.execute()

快速上手示例：移除指定表

from gdisdk.modules.filters import TableCollectionSelector

selector = TableCollectionSelector(mname="RemoveTables")
selector.operation = "remove"
selector.table_names = ["不需要的表"]
# selector.InputTables = tables
out_tables = selector.execute()

参数说明

TableCollectionSelector 参数一览
参数名	类型	默认值	说明
`table_names`	`list[str] \| None`	`None`	要选取/移除的表名或表标题列表；若不为 `None`，会忽略 `table_idxs`；不存在的表会被忽略。
`table_idxs`	`list[int] \| None`	`None`	要选取/移除的表索引列表；越界索引会被忽略。
`operation`	`Literal["select","remove"]`	`"select"`	`"select"` 表示只保留指定表；`"remove"` 表示从集合中移除指定表。

注意

当 table_names 与 table_idxs 都为 None 时：
- operation="select"：输出 None；
- operation="remove"：直接返回**输入**的 TableCollection 引用。

在 pipeline 中的使用方式

from gdisdk.pipeline import PipeLine
from gdisdk.modules.filters import TableCollectionSelector

pipe = PipeLine(app_name="SelectTablesDemo", app_title="选择/移除表集合示例")

selector = TableCollectionSelector("SelectTables")
selector.operation = "select"
selector.table_names = ["剖面数据表"]
# links = upstream.OutputTables >> selector.InputTables
# pipe.add_links(links)
# pipe.run()
# tables = selector.OutputTables.data

更多信息

TablesQuery

模块简介与适用场景

TablesQuery 用于使用 pandas DataFrame.query 风格的表达式对表数据进行过滤，支持：
- 输入 TableData：输出过滤后的 TableData；
- 输入 TableCollection：输出过滤后的 TableCollection；
- 支持查询模板 query_template + 模板变量（用于 UI 动态改值）；
- 当 TableCollection 存在主表/子表关系（main_table/sub_tables/primary_key）且 cascade_to_children=True 时，会先过滤主表，再按主键值级联过滤子表。

端口说明

输入端口 - InputTables：输入表（TableData）或表集合（TableCollection）。
输出端口 - OutputTables：过滤后的表（TableData）或表集合（TableCollection）；当输入为空、query_template 为空、或模板变量校验失败（如变量值为 None 且 none_error_type 不为 "ignore"）时为 None。

快速上手示例：无模板变量（手写常量条件）

from gdisdk.modules.filters import TablesQuery

q = TablesQuery(mname="QueryTables")
q.query_template = "`年份` == 2007 and `国家` == 'US'"  # 字符串常量需要自己加引号
# q.InputTables = table_or_tables
out = q.execute()

快速上手示例：使用模板变量（UI 可改值）

from gdisdk.modules.filters import TablesQuery
from gdisdk.pipeline.pipeData import TemplateVariableConfig

q = TablesQuery(mname="QueryWithTpl")
q.query_template = "`年份` == {tpl_year} and `国家` == {tpl_country}"
q.template_variables = {
    "tpl_year": TemplateVariableConfig(
        title="年份",
        default=2007,
        value_type="int",
        schema_type="auto_select",
    ),
    "tpl_country": TemplateVariableConfig(
        title="国家",
        default="US",
        value_type="str",       # 字符串会自动加引号，无需写成 '{tpl_country}'
        schema_type="auto_select",
    ),
}
# q.InputTables = table_or_tables
q.tpl_country = "CN"  # 运行前可通过属性直接改模板变量
out = q.execute()

查询模板语法说明（常见写法速查）

query_template 基于 pandas 的 DataFrame.query 表达式语法（逻辑运算：and/or/not，比较：== != > >= < <=）。
推荐把字段名/字段标题写在反引号里（例如 \`年份\`），可避免绝大多数 “字段名不是合法标识符” 导致的解析错误。
字符串常量需要用引号包裹（例如 'US'）；但当你使用模板变量并设置 value_type="str" 时，字符串会**自动加引号**（无需写成 '{tpl_xxx}'）。

常见模板示例

# 1) 数值等值 / 范围
"`年份` == 2007"
"`深度` >= 10 and `深度` < 30"

# 2) 字符串等值（手写常量时需要引号）
"`国家` == 'US'"

# 3) in / not in（列表常量）
"`类型` in ['A', 'B', 'C']"
"`类型` not in ['废弃', '删除']"

# 4) 空值判断（None/NaN）
"`备注` == ''"                # 空字符串
"`备注`.isnull()"             # 为空（None/NaN）
"`备注`.notnull()"            # 非空

# 5) 字符串包含/前后缀（pandas 字符串方法）
"`名称`.str.contains('岩', na=False)"
"`名称`.str.startswith('ZK', na=False)"
"`名称`.str.endswith('号', na=False)"

# 6) 多条件组合（括号控制优先级）
"(`状态` == '有效' and `深度` > 10) or (`状态` == '复核')"

什么时候需要使用反引号 ````…````（非常重要）

pandas 的 query 要求 “字段名像 Python 变量一样可解析”。当字段名不满足这个条件时，必须用反引号包裹字段名（即 \`列名\`）。以下情况**建议/需要**用反引号：

字段名包含中文、空格、短横线等特殊字符（如 工程名称、x coordinate、x-coordinate）。
字段名以数字开头或包含点号等（如 2020年、a.b）。
字段名与 Python 关键字冲突（如 class、lambda）或包含不可作为标识符的字符。
你在模板里使用的是 “字段标题” 而非字段名时（标题常包含中文/空格，强烈建议用反引号）。

为了减少踩坑，本文档的示例统一使用 \`字段名/字段标题\` 形式；这也是最推荐的写法。

参数说明

TablesQuery 参数一览
参数名	类型	默认值	说明
`query_template`	`str \| None`	`None`	查询表达式模板（pandas 查询语法）；支持 `{tpl_xxx}` 占位符；中文/带空格字段名建议用反引号包裹（如 \`年份\`）。
`template_variables`	`dict[str, TemplateVariableConfig \| UIAttributeSchema] \| None`	`None`	模板变量配置字典；所有变量名必须以 ``tpl_`` 开头，并可通过 `q.tpl_xxx = ...` 在运行前赋值。
`main_table`	`str \| None`	`None`	级联过滤模式下的主表名/表标题；不填时会尝试从输入 `TableCollection.main_table` 自动获取。
`related_tables`	`list[str] \| None`	`None`	级联过滤模式下要跟随过滤的子表列表；不填时会尝试从 `TableCollection.sub_tables` 自动获取。
`join_key`	`str \| None`	`None`	主表与子表关联的键字段名/字段标题；不填时会尝试从 `TableCollection.primary_key` 自动获取。
`cascade_to_children`	`bool`	`True`	是否启用“主表过滤 → 按键值过滤子表”的级联逻辑；为 `False` 时对集合中每张表独立执行同一条 query。
`debug_mode`	`bool`	`False`	为 `True` 时输出更详细的查询评估错误信息（便于排查表达式问题）。
`none_error_type`	`Literal["error","warning","gdi_warning","ignore"]`	`"warning"`	当模板变量存在 `None` 值时的处理策略：报错、警告、GDIM 警告或忽略检查。

在 pipeline 中的使用方式

from gdisdk.pipeline import PipeLine
from gdisdk.modules.filters import TablesQuery

pipe = PipeLine(app_name="TablesQueryDemo", app_title="表格查询示例")

q = TablesQuery("Query")
q.query_template = "`年份` >= 2020"
# links = upstream.OutputTables >> q.InputTables
# pipe.add_links(links)
# pipe.run()
# out = q.OutputTables.data

更多信息

DropDuplicateRows

模块简介与适用场景

DropDuplicateRows 用于对输入表（TableData）按指定字段去重，输出去重后的 TableData``（端口：``OutputTable）。
典型适用场景：
- 钻孔/剖面/测点等业务数据按编号去重；
- 只保留唯一记录，或在重复记录中将某些字符串字段合并（如备注、来源等）。

端口说明

输入端口 - InputTable：输入表（TableData）。
输出端口 - OutputTable：去重后的表（TableData）；当输入为空时为 None。

快速上手示例：按字段去重

from gdisdk.modules.filters import DropDuplicateRows

dropdup = DropDuplicateRows(mname="DropDuplicate")
dropdup.subset = ["bore_number"]   # 可写字段名或字段标题
# dropdup.InputTable = table        # TableData
out_table = dropdup.execute()

快速上手示例：重复行合并字符串列（需要 subset）

from gdisdk.modules.filters import DropDuplicateRows

dropdup = DropDuplicateRows(mname="DropDuplicateMergeText")
dropdup.subset = ["bore_number"]
dropdup.join_string_columns = ["remark"]  # 需要合并的字符串列
dropdup.join_separator = "\n"
# dropdup.InputTable = table
out_table = dropdup.execute()

参数说明

DropDuplicateRows 参数一览
参数名	类型	默认值	说明
`subset`	`list[str] \| None`	`None`	用于识别重复行的列；为 `None` 时使用全部列；支持字段名或字段标题。
`keep_empty_strings`	`bool`	`True`	是否保留包含空字符串 `""` 的行；为 `False` 时会在去重后删除包含空字符串的行。
`keep_null_values`	`bool`	`True`	是否保留包含 `None/NaN` 的行；为 `False` 时会在去重后删除包含空值的行。
`join_string_columns`	`list[str] \| None`	`None`	当发现重复行（基于 `subset`）时，把这些列的字符串值进行拼接合并；仅当 ``subset`` 非空时生效。
`join_separator`	`str`	`"\\n"`	合并字符串列时使用的分隔符。

在 pipeline 中的使用方式

from gdisdk.pipeline import PipeLine
from gdisdk.modules.filters import DropDuplicateRows

pipe = PipeLine(app_name="DropDuplicateDemo", app_title="去重示例")

dropdup = DropDuplicateRows("DropDuplicate")
dropdup.subset = ["bore_number"]
# links = upstream.OutputTable >> dropdup.InputTable
# pipe.add_links(links)
# pipe.run()
# out_table = dropdup.OutputTable.data

更多信息

GdimAppDataSelector

模块简介与适用场景

GdimAppDataSelector 从 ResultModel``（通常来自 :class:`~gdisdk.modules.readers.GdimAppDataReader` 的 ``OutputResultModel）中**选取一个字段**，输出到端口 OutputData``（类型为 ``General，实际值可能是 TableData、标量等，取决于上游保存的内容）。
模块根据你配置的 name 与 module_name 自动拼接查找键，规则与 save_data_to_db() 落库时的键名一致：
- name 以 "Output" 开头 → 视为**输出端口名**，必须同时提供 module_name，键为 "{module_name}@{name}"；
- name 不以 "Output" 开头且提供了 module_name → 视为**模块参数**，键为 "{module_name}#{name}"；
- name 不以 "Output" 开头且 module_name 为 None → 视为 Pipeline 本体属性，键为 "pipeline@{name}"``（须与 ``save_data_to_db(..., data_type="pipeline") 使用的属性名一致）。
若键不存在，执行时会抛出 KeyError，并提示当前 ResultModel 中可用的键（便于核对上游配置）。
典型适用场景：报告类 Pipeline 中，在 GdimAppDataReader 之后为每个下游模块分别接一个 GdimAppDataSelector，将表格、图片、参数等拆成独立连线。

端口说明

输入端口 - InputResultModel：GdimAppDataReader 产出的 ResultModel。
输出端口 - OutputData：选中字段的值；输入或 name 为空时为 None。

快速上手示例：选取上游模块的输出表

from gdisdk.modules.filters import GdimAppDataSelector

sel = GdimAppDataSelector(mname="PickTable")
sel.module_name = "CorrosionCompute"
sel.name = "OutputTable"  # 端口名须以 Output 开头
# sel.InputResultModel = reader.OutputResultModel
table = sel.execute()

快速上手示例：选取 Pipeline 本体属性（如 workspace）

from gdisdk.modules.filters import GdimAppDataSelector

sel = GdimAppDataSelector(mname="PickWorkspace")
sel.module_name = None
sel.name = "workspace"
value = sel.execute()

参数说明

GdimAppDataSelector 参数一览
参数名	类型	默认值	说明
`name`	`str \| None`	`None`	端口名（须以 `Output` 开头）、模块参数名、或 Pipeline 本体属性名；为 `None` 时输出 `None`。
`module_name`	`str \| None`	`None`	选取端口或模块参数时必填；选取 `pipeline@...` 时保持 `None`。

在 pipeline 中的使用方式

from gdisdk.pipeline import PipeLine
from gdisdk.modules import GdimAppDataReader, GdimAppDataSelector

pipe = PipeLine(app_name="ReportFromSavedApp", app_title="读取上游结果写报告")

reader = GdimAppDataReader("ReadApp")
reader.app_title = "水腐蚀性分析"

pick = GdimAppDataSelector("PickResult")
pick.module_name = "AnalysisModule"
pick.name = "OutputTable"

links = reader.OutputResultModel >> pick.InputResultModel
pipe.add_links(links)
# pipe.run()
# table = pick.OutputData.data

更多信息

modules.readers 模块帮助中的 GdimAppDataReader。
运行机制 (Runtime)
API 参考 - modules 包
完整类定义与参数说明（filters）

MarkdownSectionFilter

模块简介与适用场景

MarkdownSectionFilter 用于按 Markdown 文档的「标题层级（# / ## / ### …）」解析出章节树，并按规则筛选需要的章节内容；主要输出为过滤后的 Markdown 文本（OutputMarkdown）以及结构化的 ResultModel``（``OutputResultModel，含元数据与统计）。
若需将结果写入文件，请在下游使用 writers 帮助中的 TextWriter。
典型适用场景：
- 将长报告按章节裁剪为「只保留关心的章节」以便后续 LLM/RAG 处理；
- 过滤目录、图表清单等噪声内容（drop_toc=True）；
- 对表格采用不同策略（保留/删除/仅保留标题/截断行数）以控制上下文长度。

端口说明

InputMarkdown：输入 Markdown 文本或 Markdown 文件路径；若该端口有值，会覆盖 markdown 参数/属性。
OutputMarkdown：过滤后的 Markdown 文本（字符串）。
OutputResultModel：动态类型 MarkdownSectionFilterResult``（``ResultModel 子类）。除由 filtered_markdown_key 命名的字段（默认为 filtered_markdown，存放过滤后的 Markdown 字符串）外，一般还包括：
- selected_sections / dropped_sections：选中与剔除章节的元数据列表（dict，非 JSON 字符串）；
- 统计相关字段：如 original_chars、filtered_chars、reduction_ratio、original_tokens_estimate、filtered_tokens_estimate、original_sections_count、original_markdown_tables、original_html_tables、filtered_markdown_tables、filtered_html_tables；
- selected_sections_count / dropped_sections_count；
- filter_rules：本次生效的过滤规则摘要。
多个过滤器并行时，可为每个模块设置不同的 filtered_markdown_key，以便下游合并 ResultModel 时字段名不冲突。

快速上手示例：按章节号筛选并压缩表格

from gdisdk.modules.filters import MarkdownSectionFilter

f = MarkdownSectionFilter(mname="FilterMarkdown")
f.markdown = "report.md"  # 也可以直接传入 markdown 字符串
f.include_number_prefixes = ["2", "4", "5"]  # 例如：场地条件/评价/结论
f.drop_toc = True
f.tables_mode = "truncate_rows"
f.table_truncate_rows = 5
out_md = f.execute()

快速上手示例：自定义 ResultModel 中 Markdown 字段名（避免冲突）

from gdisdk.modules.filters import MarkdownSectionFilter

f = MarkdownSectionFilter(
    mname="FilterMarkdown",
    filtered_markdown_key="site_conditions_md",
)
f.markdown = "report.md"
out_md = f.execute()
# out_md 与 f.OutputMarkdown.data 一致（过滤后的 Markdown 字符串）
# f.OutputResultModel.data.site_conditions_md 同为过滤后的 markdown
# 其它元数据：f.OutputResultModel.data.selected_sections 等

快速上手示例：按标题模式筛选（正则）并排除目录/图件

from gdisdk.modules.filters import MarkdownSectionFilter

f = MarkdownSectionFilter(mname="FilterByTitle")
f.markdown = "report.md"
f.include_title_patterns = ["场地.*地质条件", "地层岩性", "结论"]
f.exclude_title_patterns = ["图件", "目.*录"]
out_md = f.execute()

参数说明

MarkdownSectionFilter 参数一览
参数名	类型	默认值	说明
`markdown`	`str \| Path \| None`	`None`	输入 Markdown 文本或文件路径；若 `InputMarkdown` 端口有值，会在执行时覆盖该属性。
`filtered_markdown_key`	`str`	`"filtered_markdown"`	`OutputResultModel` 上存放“过滤后 Markdown 文本”的字段名（`ResultModel` 属性）；当多个 `MarkdownSectionFilter` 并行或下游合并多个结果时，建议为各模块设置不同字段名以避免冲突。
`include_number_prefixes`	`list[str]`	`[]`	需要包含的章节编号前缀（如 `["2", "4.6"]`）；会包含该前缀下所有子章节（如 `"2"` 会匹配 `2.1/2.2.1` 等）。
`include_title_patterns`	`list[str]`	`[]`	需要包含的标题模式（正则或子串）；标题命中则包含对应章节。
`exclude_number_prefixes`	`list[str]`	`[]`	需要剔除的章节编号前缀（优先级高于 include）。
`exclude_title_patterns`	`list[str]`	`[]`	需要剔除的标题模式（正则）。
`keep_parent_headings`	`bool`	`True`	为 `True` 时，当命中某个子章节，会同时保留其父级标题作为上下文（仅标题，不含父级正文内容）。
`drop_preamble`	`bool`	`True`	是否丢弃第一个标题之前的 “前言/封面” 等非章节内容。
`drop_toc`	`bool`	`True`	是否尝试删除目录块（TOC）。
`tables_mode`	`Literal["keep","drop","caption_only","truncate_rows"]`	`"keep"`	表格处理策略：保留/删除/仅保留表题/截断行数（保留表头 + 前 N 行数据）。
`table_truncate_rows`	`int`	`10`	当 `tables_mode="truncate_rows"` 时，保留的数据行数（不含表头）。

在 pipeline 中的使用方式

from gdisdk.pipeline import PipeLine
from gdisdk.modules.filters import MarkdownSectionFilter

pipe = PipeLine(app_name="MarkdownFilterDemo", app_title="Markdown 章节过滤示例")

f = MarkdownSectionFilter("MarkdownFilter")
f.include_number_prefixes = ["2", "4"]
f.drop_toc = True
# links = upstream.OutputMarkdown >> f.InputMarkdown
# pipe.add_links(links)
# pipe.run()
# out_md = f.OutputMarkdown.data
# meta = f.OutputResultModel.data
# 写文件可将 OutputMarkdown 接到 TextWriter 等模块

更多信息