Skill · expert-wiki-ingest · 增量入库流水线

1何时调起 · 何时不调

不是每次都该用 — 这个 skill 只覆盖一段窄窄的链路

它专做"raw/ 下来了一批新 docx/pptx/pdf,要灌进已建成的 wiki/"。不做的事:不处理音频视频,不处理数据表 xlsx,不重建已有的 wiki(那是 PRD §2 的初次 7-agent 链路)。

✅ 触发场景

"把 expert/raw 下新增的内容抽到 wiki"
"更新一下 expert/wiki 知识库"
"raw 下又来了一批材料,跑一遍 wiki 入库"
显式 /expert-wiki-ingest 或 /wiki-incremental-ingest
命令含 "增量入库" / "新增 X 抽到 wiki"

❌ 不要激活

wiki 还不存在 → 用 expert/wiki/PRD.md §2 的初次 7-agent 链路
新增是 mp4 / m4a / mp3 → whisper 转写另说(PRD §5 P0/P1,未实施)
新增是 xlsx / xls → 数据表入库(PRD §5 P2,未实施)
用户只想浏览现有 wiki → 打开 expert/wiki/index.html
用户问 "wiki 是什么" → 给 PRD.md / readme.html 链接

25 阶段流水线

从 raw 文件到可浏览 wiki · 5 步分工

每个阶段都有独立脚本 / sub-agent。每个产物在仓库里都能直接 git diff 看到,有问题随时回退。

Phase 1 · Pure Python

抽文本

docx (python-docx) / pptx (python-pptx) / pdf (pdftotext + pdfplumber fallback) → 一份 md。增量,默认跳过已抽过的。

python3 expert/wiki/_scripts/extract.py → wiki/_extracted/*.md

Phase 2 · LLM Sub-agents

抽实体

N 路 sub-agent 并行(一个 message 里发 N 个 Agent 调用)。每个吃 ~270KB 文本,产出 entities/people/concepts/reports/timeline 草稿。

Agent × N (parallel) → _drafts/agent-N-roundX/

Phase 3 · Pure Python

合并

同 slug 多 agent:union frontmatter list / scalar 冲突保留首个 + 备注;正文按 agent 顺序追加 "## 补充视角"。timeline 去重 + 倒序。

python3 _scripts/merge.py → wiki/{entities..}/*.md

Phase 4 · Pure Python

建索引

每分类 index.md(按 type / country / role / category / publish_year 多视图)+ 顶层 README.md 总览表。

python3 _scripts/build_index.py → wiki/<sub>/index.md

Phase 5 · Pure Python

生 HTML

自包含 ~1MB index.html(全文搜索 / wikilink 跳转)+ processing_status.html(raw ↔ extracted 对照)+ docs.html(导航条同步)。

python3 _scripts/build_html.py → wiki/*.html

expert/raw/<new>/
   ├─ 2科普视频文案/*.docx        # 60 个新增
   ├─ 3其余素材/*.pdf             # 7 个新增(其中 4 个图扫描抽不出)
   └─ 1商业航天材料/...           # Round 1 已入库
       │
       │  Phase 1 · extract.py(增量,_extract_report.json 记账)
       ▼
expert/wiki/_extracted/
   └─ 67 个新增 *__<ext>.md   # 纯文本,自带 source 头
       │
       │  Phase 2 · N 路 sub-agent 并行(参考 _TEMPLATE.md schema)
       ▼
expert/wiki/_drafts/agent-{N}-roundX/
   ├─ entities/*.md      # 公司 / 卫星 / 星座 / 机构
   ├─ people/*.md        # 创始人 / CEO / 总师 / 官员
   ├─ concepts/*.md      # 技术 / 商业模式 / 政策 / 频段
   ├─ reports/*.md       # 一份源 .md = 一条 reports
   └─ timeline.md        # - YYYY-MM-DD: 事件,来源 `xxx.md`
       │
       │  Phase 3 · merge.py(LIST_KEYS union / 正文追加 / timeline 去重倒序)
       ▼
expert/wiki/{entities,people,concepts,reports}/<slug>.md
expert/wiki/timeline/all.md
       │
       │  Phase 4 · build_index.py
       ▼
expert/wiki/{README.md, <sub>/index.md}
       │
       │  Phase 5 · build_html.py / build_processing_status.py / build_docs.py
       ▼
expert/wiki/index.html     # ~1MB 自包含浏览器
expert/wiki/processing_status.html
expert/wiki/{readme,template,timeline,next_steps,prd,redact}.html

3Sub-agent 提示词模板(Phase 2 核心)

4 路并行抽实体 · 一个 message 里发才真的并行

这是 skill 最重要的产物。下面这套提示词参数化(只换 §"你的 batch" 和 agent 号),N 路同发。一个 message 一个 Agent → 串行,wall time × N;一个 message N 个 Agent → 并行,wall time ≈ 最慢的那个。

⚠️ 本次踩的最大坑(2026-06-07 第一次跑)

第一次跑的时候先发了 agent-8,等它跑完(24 分钟)才发 agent-9/10/11,白白线性化。**正确做法**:一个 message 里包 N 个 Agent 工具调用 + run_in_background: true。下次直接复用模板,不要再踩。

分批 — greedy bin-packing 按文件 size

# 新增 ≥ 30 → 4 路;15-30 → 2-3 路;< 15 → 单 agent
# 路数定了后用 greedy 算法把 N 个文件均匀塞 4 个 batch:
cd /Users/john/InvesResearch && python3 -c "
import json
from pathlib import Path
ext = Path('expert/wiki/_extracted')
PREFIXES = ('2科普视频文案__', '3其余素材__')   # 改这里适配本轮新增
new = []
for prefix in PREFIXES:
    for p in ext.glob(prefix + '*.md'):
        new.append((p.stat().st_size, p.name))
new.sort(reverse=True)
N_BATCHES = 4
batches = [[] for _ in range(N_BATCHES)]
sizes = [0]*N_BATCHES
for sz, name in new:
    i = sizes.index(min(sizes))
    batches[i].append(name); sizes[i] += sz
ROUND = 2; START = 8  # round 1 = 1..7;round 2 = 8..11 接序
out = {f'agent-{i+START}-round{ROUND}': batches[i] for i in range(N_BATCHES)}
Path('/tmp/wiki_round_batches.json').write_text(json.dumps(out, ensure_ascii=False, indent=2), encoding='utf-8')
for i, b in enumerate(batches):
    print(f'BATCH {i+1}: {len(b)} files, {sizes[i]/1024:.1f} KB')
"

每个 sub-agent 的提示词骨架

你是 InvesResearch 商业航天 wiki 增量入库的 round-{ROUND} sub-agent #{N}。
从新增料里抽 entities/people/concepts/reports/timeline,merge.py 会合并到 wiki/。

## 工作目录
- 仓库根:/Users/john/InvesResearch
- 源:expert/wiki/_extracted/*.md(只读你 batch 里的 N 个文件)
- 输出(已建好):
  - expert/wiki/_drafts/agent-{N}-round{ROUND}/{entities,people,concepts,reports}/
  - expert/wiki/_drafts/agent-{N}-round{ROUND}/timeline.md

## 你的 batch (N 文件, ~270KB)
{逐行列文件名 + 一句话主题提示}

## 输出契约
见 expert/wiki/_drafts/_TEMPLATE.md(已有,不重贴)
frontmatter:
- entities: type / country / aliases / status / sources
- people:   role / affiliation / country / aliases / sources
- concepts: category / aliases / sources
- reports:  title / publish_year / publisher / report_type / source_files

## Slug 规则 — 关键(决定 merge.py 能不能合)
先 Read 4 个现有 slug 清单 dedup:
- expert/wiki/_drafts/_round{ROUND}_existing_entities.txt
- expert/wiki/_drafts/_round{ROUND}_existing_people.txt
- expert/wiki/_drafts/_round{ROUND}_existing_concepts.txt
- expert/wiki/_drafts/_round{ROUND}_existing_reports.txt

- 已有 slug → 同名复用,merge.py 会 union frontmatter + 把你的正文追加为 ## 补充视角 — agent-{N}-round{ROUND}
- 新 slug 用 latin / pinyin(spacex / starlink / elon-musk / direct-to-cell),不要中文 slug(unicode normalize 不一致会撞坑)

## 质量纪律
1. 不编造 — 事实必须在源文本里找到出处
2. 简洁(每条 30-80 行 markdown)
3. sources 只列你 batch 里出现过的文件名
4. 同一实体一条;不要把同物分塞 entities + concepts
5. 关联 1-2 个 [[...]] wikilink
6. reports 必出 N 条(每个源 .md 一条 reports)
   - 科普视频文案 → report_type: 历史/科普,publish_year 留空
   - 长 PDF 行业报告 → report_type: 行业研究,有 publish_year 就填
   - 公司尽调 PDF → report_type: 公司尽调,敏感信息降级(估值数走区间)
7. 优先重要的;一带而过的不要写
8. timeline.md 按时间倒序 - YYYY-MM[-DD]: 简述,来源 `文件名.md`

## 完成后
短汇报"产出 N entities / M people / K concepts / R reports / T timeline 条"。不 commit。

📋 派 4 路并行的正确姿势

把 4 个 Agent 工具调用放在同一个 message 里
所有 run_in_background: true
主线程继续干别的(写 skill / 改 docs.html / 提前准备 merge 命令)
完成通知会陆续到,不要 sleep / poll
所有 sub-agent 完成后再跑 Phase 3 merge.py

4整套 pipeline 命令清单

从 0 到 wiki 重新跑通 · 10 步可复制

注意 step 4 是 Agent 工具调用,不是 bash — 在那一步切到工具,其它都是 shell。

# 0 · 先看新增了什么
ls expert/raw/<new-folder>/ | head -20

# 1 · Phase 1 抽取(增量,自动跳过已抽过的)
python3 expert/wiki/_scripts/extract.py

# 2 · 删空 PDF 输出(图扫描类抽不出文字)+ 打 fail 补丁
find expert/wiki/_extracted -name "*.md" -size -500c -delete
# 见 §踩坑 #2 的 JSON patch

# 3 · 切批 + 准备 dedup 助手 + 建空草稿目录
# 详见 §3 的 greedy bin-pack 脚本
ls expert/wiki/entities/  > expert/wiki/_drafts/_round${ROUND}_existing_entities.txt
ls expert/wiki/people/    > expert/wiki/_drafts/_round${ROUND}_existing_people.txt
ls expert/wiki/concepts/  > expert/wiki/_drafts/_round${ROUND}_existing_concepts.txt
ls expert/wiki/reports/   > expert/wiki/_drafts/_round${ROUND}_existing_reports.txt
for n in $(seq $START $((START+N-1))); do
  mkdir -p "expert/wiki/_drafts/agent-${n}-round${ROUND}"/{entities,people,concepts,reports}
done

# 4 · 派 N 路 sub-agent (一个 message N 个 Agent 调用,见 §3)
⚠️ 不是 bash!切到 Claude Code Agent 工具,run_in_background: true

# 5 · 等所有 sub-agent 完成后,验收草稿数量
for n in $(seq $START $((START+N-1))); do
  echo "== agent-${n}-round${ROUND} =="
  for sub in entities people concepts reports; do
    echo "  ${sub}: $(ls expert/wiki/_drafts/agent-${n}-round${ROUND}/${sub}/ 2>/dev/null | wc -l)"
  done
  echo "  timeline: $(wc -l < expert/wiki/_drafts/agent-${n}-round${ROUND}/timeline.md)"
done

# 6 · Phase 3 merge
python3 expert/wiki/_scripts/merge.py

# 7 · Phase 4 索引
python3 expert/wiki/_scripts/build_index.py

# 8 · Phase 5 HTML(自包含浏览器 + 处理进度 + 文档导航)
python3 expert/wiki/_scripts/build_html.py
python3 expert/wiki/_scripts/build_processing_status.py
python3 expert/wiki/_scripts/build_docs.py
# 可选:
python3 expert/wiki/_scripts/redact_check.py          # PII 审计
python3 expert/wiki/_scripts/build_redact_html.py     # PII 审计可视化

# 9 · 验收
echo "entities=$(ls expert/wiki/entities/ | wc -l)"
echo "people=$(ls expert/wiki/people/ | wc -l)"
echo "concepts=$(ls expert/wiki/concepts/ | wc -l)"
echo "reports=$(ls expert/wiki/reports/ | wc -l)"
echo "timeline lines=$(wc -l < expert/wiki/timeline/all.md)"
echo "index.html size=$(wc -c < expert/wiki/index.html)"

# 10 · 本地抽检
cd expert/wiki && python3 -m http.server 8765
# → http://localhost:8765/index.html
# → http://localhost:8765/processing_status.html

5踩坑小记 · 下次直接查表

5 个已经踩过的坑 · 不要再踩一次

脚本里硬编码 /Users/john/lichao/ 路径

症状:第一次在 InvesResearch 仓库跑 wiki 脚本,所有 build_*.py 都报"文件不存在"。
原因:7 个脚本(extract.py merge.py build_index.py build_html.py build_processing_status.py build_docs.py redact_check.py build_redact_html.py)全部把 WIKI = Path("/Users/john/lichao/wiki") 写死。
修法(2026-06-07 已修):统一抽到 expert/wiki/_scripts/_paths.py,按 env > repo-relative > legacy 顺序解析。每个脚本顶部加 sys.path.insert + from _paths import WIKI。

PDF 抽出来是 0 字符

症状:_extracted/foo__pdf.md 只有 header 没正文(~100 字节)。
原因:图扫描 PDF / 字体编码异常 / 加密保护;pdftotext 和 pdfplumber 都拿不到字。
修法:find expert/wiki/_extracted -name "*.md" -size -500c -delete 删掉,然后用 Python patch 把它们从 _extract_report.json 的 ok 移到 fail(SKILL.md §6.2 有完整脚本)。

多 agent 并行 — 千万要在同一个 message 里发

症状:wall time 是单 agent 时长 × N(线性而不是并行)。
原因:每个 message 里只调一个 Agent → 串行。
修法:一个 message 里包 N 个 Agent 工具调用 + run_in_background: true,真正并行。
2026-06-07 我自己就踩了 — 第一次先发 agent-8,等它跑完才发其它 3 个,白白浪费 70 分钟。

sub-agent 写了中文 slug

症状:merge.py 报错或同名 slug 没合并。
原因:merge.py 用 stem 文件名做 key,中文 slug 在不同 agent 输出里 unicode NFC/NFD normalize 可能不一致。
修法:sub-agent 提示词里**反复强调** "slug 用 latin/pinyin,不要中文 slug"。已经写了的话跑 merge 前手动 mv 改名。

内容敏感(估值 / 内部密码 / 内部姓名)

症状:某 PDF 是公司尽调 / 投资介绍,含具体融资额 / 内部估值 / 私密联系方式。
修法:

sub-agent 提示词加 "敏感信息降级" 一节(估值走区间,姓名/电话/邮箱省略)
merge 之后跑 redact_check.py 做 PII 审计
看 _redact_report.md,高严重命中手动核查

6状态盘点

已经跑过哪几轮

每跑一轮入库,在这里记一条。Round 号也是 sub-agent 编号空间的偏移(round 1 占 agent-1..7,round 2 接续 agent-8..)。

Round	日期	输入	输出	状态
Round 1	2026-06-04	117 份 docx/pptx/pdf 原料(`商业航天材料/`)	248 entities / 33 people / 231 concepts / 35 reports / 483 timeline 条 7-agent 并行(agent-1..7),覆盖宏观 / 美国预警 / 中国产业 / LEO 通信 / 子领域 / 中移并购 / 资产管理	✅ 已交付
Round 2	2026-06-07	67 份新增(`2科普视频文案/` 60 docx + `3其余素材/` 7 PDF)	4 路 sub-agent (agent-8..11) 并行抽取中已知:agent-8 产 29 ent / 3 ppl / 36 cpt / 13 rep / 36 timeline,agent-9/10/11 同步	🟡 进行中
Round N+	—	raw/3其余素材/ 剩余:22 mp4 + 32 jpg + 4 xlsx	本 skill 不覆盖 mp4/m4a → whisper(PRD §5 P0/P1)· xlsx → pandas + LLM 解读(PRD §5 P2)· jpg → OCR(未排期)	🔵 待启动

7与其它 skill 的边界

process skill 与分析师 know-how skill 是两条线

本仓库已经有两套 skill 体系。本 skill 在项目根 skills/,沉淀的是"怎么把数据从一个地方搬到另一个地方"的流程;agent/skills/ 沉淀的是分析师的判断 know-how。两者互不干扰。

Skill	位置	干啥	边界
expert-wiki-ingest(本)	`skills/`	raw/ → wiki/ 增量入库流水线	不入 agent/ 业务
satellite_internet_research	`agent/skills/`	agent 决策层的分析师 know-how 总览	静态文档,不动数据
thesis_impact_judgment	`agent/skills/`	单事件 → 主线影响传导	classifier 内部规则
strategy_recommendation_5_levels	`agent/skills/`	战略建议 5 档(D V1 反推盲点修复)	decision 内部规则
wyhtb_writing_guide	`agent/skills/`	看牛/看熊写法指南	thesis 配置层
trigger_design_patterns	`agent/skills/`	证伪触发器设计	triggers 配置层

与 expert/X 抓取流水线的关系

本 skill 只覆盖 expert/raw/(线下素材包)→ wiki/。expert/X/(每日 X 推文抓取)→ agent/ events 表是另一条流水线(EH-1 x-ingest action,launchd cron 03:00 daily,详见 agent/docs/x-ingest-cron-runbook.md),两者不重叠。同一目录下的两条采集链路,职责完全分开:raw/ 是"线下知识库"(可被检索浏览),X/ 是"线上事件流"(进 agent 决策回路)。