492 lines
11 KiB
Markdown
492 lines
11 KiB
Markdown
# DOCX 操作手册
|
||
|
||
这个手册只描述脚本接口和数据约定。
|
||
`SKILL.md` 负责告诉 AI 什么时候该用这些脚本;真正需要执行时,再读取本手册。
|
||
|
||
## 运行方式
|
||
|
||
统一使用本 skill 自带虚拟环境:
|
||
|
||
```powershell
|
||
.venv\Scripts\python.exe scripts\docx_index.py --docx <绝对路径> --out <绝对路径>
|
||
.venv\Scripts\python.exe scripts\docx_query.py --docx <绝对路径> --query-file <绝对路径> --out <绝对路径>
|
||
.venv\Scripts\python.exe scripts\docx_create.py --spec-file <绝对路径> --report <绝对路径>
|
||
.venv\Scripts\python.exe scripts\outline_check.py --outline-file <绝对路径> --report <绝对路径>
|
||
.venv\Scripts\python.exe scripts\outline_export.py --spec-file <绝对路径> --report <绝对路径>
|
||
.venv\Scripts\python.exe scripts\docx_patch.py --patch-file <绝对路径> --report <绝对路径>
|
||
.venv\Scripts\python.exe scripts\render_docx.py --docx <绝对路径> --out-dir <绝对路径> --report <绝对路径>
|
||
```
|
||
|
||
也可以统一走:
|
||
|
||
```powershell
|
||
.venv\Scripts\python.exe scripts\docx_cli.py index ...
|
||
.venv\Scripts\python.exe scripts\docx_cli.py query ...
|
||
.venv\Scripts\python.exe scripts\docx_cli.py create ...
|
||
.venv\Scripts\python.exe scripts\docx_cli.py outline-check ...
|
||
.venv\Scripts\python.exe scripts\docx_cli.py outline-export ...
|
||
.venv\Scripts\python.exe scripts\docx_cli.py patch ...
|
||
.venv\Scripts\python.exe scripts\docx_cli.py render ...
|
||
```
|
||
|
||
## 0. 新建 DOCX
|
||
|
||
### 命令
|
||
|
||
```powershell
|
||
.venv\Scripts\python.exe scripts\docx_create.py --spec-file D:\work\create.json --report D:\work\create.report.json
|
||
```
|
||
|
||
或统一走 CLI:
|
||
|
||
```powershell
|
||
.venv\Scripts\python.exe scripts\docx_cli.py create --spec-file D:\work\create.json --report D:\work\create.report.json
|
||
```
|
||
|
||
### spec JSON
|
||
|
||
```json
|
||
{
|
||
"output_docx": "D:/work/generated-outline.docx",
|
||
"docx_style_profile": "default_bid",
|
||
"numbering_mode": "explicit_text",
|
||
"template_docx": null,
|
||
"title": "目录测试",
|
||
"blocks": [
|
||
{"type": "heading", "level": 1, "text": "技术标目录"},
|
||
{"type": "heading", "level": 2, "text": "项目总体方案"},
|
||
{"type": "paragraph", "text": "这里是说明文字"},
|
||
{"type": "list", "items": ["系统架构设计", "实施部署方案"]},
|
||
{"type": "table", "rows": [["章节", "说明"], ["5.1", "总体设计"]]},
|
||
{"type": "page_break"}
|
||
]
|
||
}
|
||
```
|
||
|
||
### 支持的 block 类型
|
||
|
||
- `heading`
|
||
- 必填:`text`
|
||
- 可选:`level`,范围 `1-9`
|
||
- `paragraph`
|
||
- 必填:`text`
|
||
- 可选:`style`
|
||
- `list`
|
||
- 必填:`items`
|
||
- 可选:`style`,默认 `List Bullet`
|
||
- `table`
|
||
- 必填:`rows`,二维数组且列数一致
|
||
- 可选:`style`
|
||
- `page_break`
|
||
|
||
### 输出
|
||
|
||
报告 JSON 包含:
|
||
|
||
- `status`
|
||
- `output_docx`
|
||
- `block_count`
|
||
- `blocks`
|
||
- `final_summary`
|
||
- `format_profile`
|
||
- `numbering_validation`
|
||
- `caption_validation`
|
||
- `toc_validation`
|
||
- `acceptance_checks`
|
||
|
||
## 0.1 目录门禁检查
|
||
|
||
### 命令
|
||
|
||
```powershell
|
||
.venv\Scripts\python.exe scripts\outline_check.py --outline-file D:\work\outline.json --report D:\work\outline.check.json
|
||
```
|
||
|
||
或统一走 CLI:
|
||
|
||
```powershell
|
||
.venv\Scripts\python.exe scripts\docx_cli.py outline-check --outline-file D:\work\outline.json --report D:\work\outline.check.json
|
||
```
|
||
|
||
### 输入约定
|
||
|
||
- 顶层为 `blocks`
|
||
- 目录节点使用 `type=heading`
|
||
- 目录层级使用 `level`
|
||
- 子节点放在 `children`
|
||
- 顶层可选 `outline_policy`
|
||
- 作为默认策略,不建议直接把例外开到整份目录
|
||
- 单个目录节点可选 `policy`
|
||
- `allow_service_facets: true|false`
|
||
- `respect_fixed_structure: true|false`
|
||
- 只对该节点及其子树生效
|
||
|
||
最小示例:
|
||
|
||
```json
|
||
{
|
||
"outline_policy": {
|
||
"allow_service_facets": false,
|
||
"respect_fixed_structure": false
|
||
},
|
||
"blocks": [
|
||
{
|
||
"type": "heading",
|
||
"level": 1,
|
||
"text": "技术标目录",
|
||
"children": [
|
||
{
|
||
"type": "heading",
|
||
"level": 2,
|
||
"text": "总体设计方案",
|
||
"children": [
|
||
{"type": "heading", "level": 3, "text": "建设目标与原则"}
|
||
]
|
||
},
|
||
{
|
||
"type": "heading",
|
||
"level": 2,
|
||
"text": "运维服务方案",
|
||
"policy": {
|
||
"allow_service_facets": true
|
||
},
|
||
"children": [
|
||
{"type": "heading", "level": 3, "text": "服务组织与分工"}
|
||
]
|
||
}
|
||
]
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
### 当前检查内容
|
||
|
||
- 目录深度与抽象标题下钻
|
||
- 对象化子节点与重复切面
|
||
- 商务及其他中的技术占位是否被错误展开
|
||
- 标题 `level` 是否逐级合法
|
||
- 服务型项目 / 固定目录例外是否只在指定节点生效
|
||
- `children` 类型是否合法
|
||
- block 是否为对象
|
||
|
||
## 0.2 目录阶段最终导出
|
||
|
||
本节只描述 `outline_export.py` 的接口,不定义目录阶段 workflow。
|
||
目录阶段规则以 `references/outline-stage.md` 为唯一准则。
|
||
`outline_export.py` 只在目录已经终检通过后调用,用于生成最终正式产物。
|
||
|
||
### 命令
|
||
|
||
```powershell
|
||
.venv\Scripts\python.exe scripts\outline_export.py --spec-file D:\work\outline-export.json --report D:\work\outline-export.report.json
|
||
```
|
||
|
||
或统一走 CLI:
|
||
|
||
```powershell
|
||
.venv\Scripts\python.exe scripts\docx_cli.py outline-export --spec-file D:\work\outline-export.json --report D:\work\outline-export.report.json
|
||
```
|
||
|
||
### 输入约定
|
||
|
||
```json
|
||
{
|
||
"technical_outline": {
|
||
"title": "技术标目录",
|
||
"blocks": []
|
||
},
|
||
"business_outline": {
|
||
"title": "商务及其他目录",
|
||
"blocks": []
|
||
},
|
||
"docx_style_profile": "default_bid",
|
||
"numbering_mode": "explicit_text",
|
||
"template_docx": null,
|
||
"technical_outline_json": "D:/work/final_outline_technical.json",
|
||
"business_outline_json": "D:/work/final_outline_business_other.json",
|
||
"technical_docx": "D:/final/技术标_目录版.docx",
|
||
"business_docx": "D:/final/商务及其他_目录版.docx"
|
||
}
|
||
```
|
||
|
||
### 输出
|
||
|
||
- 写出最终版 `work/final_outline_technical.json`
|
||
- 写出最终版 `work/final_outline_business_other.json`
|
||
- 写出两份目录版 DOCX
|
||
- 返回两份导出报告
|
||
|
||
## 1. 索引
|
||
|
||
### 命令
|
||
|
||
```powershell
|
||
.venv\Scripts\python.exe scripts\docx_index.py --docx D:\work\bid.docx --out D:\work\bid.index.json
|
||
```
|
||
|
||
### 输出
|
||
|
||
输出 JSON 顶层字段:
|
||
|
||
- `status`
|
||
- `docx`
|
||
- `summary`
|
||
- `nodes`
|
||
|
||
`nodes` 中每个节点至少包含:
|
||
|
||
- `node_id`
|
||
- `node_type`
|
||
- `text`
|
||
- `style_name`
|
||
- `heading_level`
|
||
- `path`
|
||
- `ordinal`
|
||
- `parent_id`
|
||
- `anchor`
|
||
|
||
当前支持的 `node_type`:
|
||
|
||
- `heading`
|
||
- `paragraph`
|
||
- `list_item`
|
||
- `table`
|
||
- `table_row`
|
||
- `table_cell`
|
||
- `image_placeholder`
|
||
|
||
### 适用场景
|
||
|
||
- 给现有模板标书建立可检索结构
|
||
- 判断某章是否存在
|
||
- 为后续 query / patch 提供稳定锚点
|
||
|
||
## 2. 查询
|
||
|
||
### 命令
|
||
|
||
```powershell
|
||
.venv\Scripts\python.exe scripts\docx_query.py --docx D:\work\bid.docx --query-file D:\work\query.json --out D:\work\query.result.json
|
||
```
|
||
|
||
### 查询 JSON
|
||
|
||
```json
|
||
{
|
||
"match_mode": "heading_text",
|
||
"value": "项目实施方案"
|
||
}
|
||
```
|
||
|
||
### 支持的 `match_mode`
|
||
|
||
- `exact_text`
|
||
- `contains_text`
|
||
- `regex`
|
||
- `heading_path`
|
||
- `heading_text`
|
||
- `table_title`
|
||
- `style_name`
|
||
- `node_type`
|
||
- `anchor`
|
||
- `node_id`
|
||
|
||
### 常用附加字段
|
||
|
||
- `node_type`
|
||
- `style_name`
|
||
- `heading_level`
|
||
- `occurrence`
|
||
- `allow_multiple`
|
||
- `context_window`
|
||
|
||
### 查询结果
|
||
|
||
结果 JSON 包含:
|
||
|
||
- `matches`
|
||
- `match_count`
|
||
- `ambiguous`
|
||
- `best_match`
|
||
- `candidate_anchors`
|
||
- `errors`
|
||
- `warnings`
|
||
|
||
默认原则:
|
||
|
||
- 单命中才适合直接 patch
|
||
- 多命中默认视为歧义
|
||
- 如果需要用第 N 个命中,必须显式传 `occurrence`
|
||
|
||
## 3. Patch
|
||
|
||
### 命令
|
||
|
||
```powershell
|
||
.venv\Scripts\python.exe scripts\docx_patch.py --patch-file D:\work\patch.json --report D:\work\patch.report.json --render-check
|
||
```
|
||
|
||
### patch JSON 顶层结构
|
||
|
||
```json
|
||
{
|
||
"source_docx": "D:/work/source.docx",
|
||
"output_docx": "D:/work/output.docx",
|
||
"docx_style_profile": "default_bid",
|
||
"numbering_mode": "explicit_text",
|
||
"template_docx": null,
|
||
"operations": []
|
||
}
|
||
```
|
||
|
||
默认写入新文件。
|
||
只有明确要原地修改时,才设置:
|
||
|
||
```json
|
||
{
|
||
"in_place": true
|
||
}
|
||
```
|
||
|
||
### operation 字段
|
||
|
||
- `op`
|
||
- `target`
|
||
- `content`
|
||
- `content_type`
|
||
- `on_ambiguous`
|
||
- `on_missing`
|
||
|
||
支持的 `op`:
|
||
|
||
- `insert_before`
|
||
- `insert_after`
|
||
- `replace_node`
|
||
- `replace_text`
|
||
- `delete_node`
|
||
|
||
支持的 `content_type`:
|
||
|
||
- `paragraphs`
|
||
- `heading`
|
||
- `table`
|
||
- `list`
|
||
|
||
### 示例 1:在某章节后插入正文
|
||
|
||
```json
|
||
{
|
||
"source_docx": "D:/work/source.docx",
|
||
"output_docx": "D:/work/output.docx",
|
||
"operations": [
|
||
{
|
||
"op": "insert_after",
|
||
"target": {
|
||
"match_mode": "heading_text",
|
||
"value": "项目实施方案"
|
||
},
|
||
"content_type": "paragraphs",
|
||
"content": [
|
||
"本项目实施总体目标是确保系统平滑上线并满足验收要求。",
|
||
"实施阶段按照调研、部署、联调、试运行和验收五个步骤推进。"
|
||
]
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
### 示例 2:替换指定文本
|
||
|
||
```json
|
||
{
|
||
"source_docx": "D:/work/source.docx",
|
||
"output_docx": "D:/work/output.docx",
|
||
"operations": [
|
||
{
|
||
"op": "replace_text",
|
||
"target": {
|
||
"match_mode": "contains_text",
|
||
"value": "质保期"
|
||
},
|
||
"old_text": "一年",
|
||
"new_text": "三年"
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
### 示例 3:替换整个节点
|
||
|
||
```json
|
||
{
|
||
"source_docx": "D:/work/source.docx",
|
||
"output_docx": "D:/work/output.docx",
|
||
"operations": [
|
||
{
|
||
"op": "replace_node",
|
||
"target": {
|
||
"match_mode": "heading_text",
|
||
"value": "售后服务方案"
|
||
},
|
||
"content_type": "heading",
|
||
"content": {
|
||
"text": "售后服务与运维保障",
|
||
"level": 2
|
||
}
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
## 4. 渲染校验
|
||
|
||
### 命令
|
||
|
||
```powershell
|
||
.venv\Scripts\python.exe scripts\render_docx.py --docx D:\work\output.docx --out-dir D:\work\render --report D:\work\render.report.json
|
||
```
|
||
|
||
### 行为
|
||
|
||
脚本会尝试:
|
||
|
||
1. DOCX 转 PDF
|
||
2. PDF 渲染页面图片
|
||
3. 输出渲染报告
|
||
|
||
### 报告字段
|
||
|
||
- `status`
|
||
- `docx`
|
||
- `pdf`
|
||
- `page_count`
|
||
- `images`
|
||
- `errors`
|
||
- `warnings`
|
||
- `format_profile`
|
||
- `numbering_validation`
|
||
- `caption_validation`
|
||
- `toc_validation`
|
||
- `acceptance_checks`
|
||
|
||
如果系统缺少 `soffice` 或图片渲染依赖,报告会返回 `render_skipped` 或带 warning,而不是直接把 patch 结果判定为失败。
|
||
|
||
## 5. 适合 AI 的使用策略
|
||
|
||
当 AI 写标书时,优先按下面顺序工作:
|
||
|
||
1. 先 `index`
|
||
2. 再 `query`
|
||
3. 确认命中唯一
|
||
4. 生成 patch JSON
|
||
5. 执行 `patch`
|
||
6. 执行 `render`
|
||
|
||
不要在以下情况下直接 patch:
|
||
|
||
- 查询结果为空
|
||
- 查询结果有多个候选但未明确选择
|
||
- 还没确认当前章节属于商务标还是技术标
|
||
- 需要插入的大段正文还未完成事实校验
|