--- name: expert-wiki-ingest description: 把 expert/raw/ 下新增的 docx/pptx/pdf 增量抽进 expert/wiki/ 知识库(entities / people / concepts / reports / timeline 五维)。Phase 1 纯 Python 抽文本;Phase 2 派多路 LLM sub-agent 并行抽实体;Phase 3 merge.py 合并;Phase 4-5 重生 index 与 HTML。 type: process trigger: - "把 expert/raw 下新增内容抽到 wiki" - "更新 expert/wiki 知识库" - "/expert-wiki-ingest" - "/wiki-incremental-ingest" not_when: - 整套 wiki 还没建(用 expert/wiki/PRD.md §2 的初次构建链路,不是增量) - 只想看现有 wiki(打开 expert/wiki/index.html) - 新增内容是音频 / 视频(本 skill 不含 whisper 转写,见 PRD §5 P0/P1) --- # Skill · expert-wiki-ingest > **意图**:expert/raw/ 来了新一批商业航天素材(docx/pptx/pdf),要增量灌进 expert/wiki/ 的 5 维知识库(entities / people / concepts / reports / timeline),不重建已有部分。 > **不做的事**:不处理音频/视频(留 whisper.cpp + faster-whisper 流水线;见 expert/wiki/PRD.md §5)。不处理 xlsx 数字型(纯数据需要 LLM 解读 schema,留 P2;见同 PRD §5)。不动 agent/ 业务逻辑(本 skill 只动 expert/wiki/)。 --- ## 1. 何时调起本 skill | 触发场景 | 例子 | |---|---| | 用户说 "把 expert/raw 下新增的内容抽到 wiki" | / | | 用户说 "更新一下 expert/wiki" | / | | 用户说 "raw 下又来了一批材料,跑一遍 wiki 入库" | / | | 显式 `/expert-wiki-ingest` 或 `/wiki-incremental-ingest` | / | | 命令含 "增量入库" / "新增 X 抽到 wiki" | / | **不要激活**: - expert/wiki/ 还不存在(用 expert/wiki/PRD.md §2 + `_TEMPLATE.md` 的初次 7-agent 构建链路) - 新增是 mp4/m4a/mp3 — 本 skill 不抽音频/视频(见 PRD §5 P0/P1) - xlsx / xls 数字表 — 本 skill 不抽数据表(见 PRD §5 P2) - 用户只想浏览 wiki — 打开 `expert/wiki/index.html` --- ## 2. 数据流(5 个 Phase) ``` expert/raw// │ Phase 1 · extract.py(纯 Python · 增量,自动跳过已抽过) ▼ expert/wiki/_extracted/__.md ← 纯文本中间产物 │ Phase 2 · N 路 LLM sub-agent 并行抽实体(本 skill 的核心) ▼ expert/wiki/_drafts/agent-N-/{entities,people,concepts,reports}/.md timeline.md │ Phase 3 · merge.py(union frontmatter / concat bodies / sort timeline) ▼ expert/wiki/{entities,people,concepts,reports}/.md timeline/all.md │ Phase 4 · build_index.py(每分类 index.md + 顶层 README.md) ▼ expert/wiki/{README.md, _{/index.md, timeline.html, processing_status.html}
│ Phase 5 · build_html.py + build_processing_status.py + build_docs.py
▼
expert/wiki/index.html(自包含 ~1MB)+ docs.html + processing_status.html
```

---

## 3. Phase 1 · 抽取(纯 Python,免 LLM)

### 3.1 准备

依赖:
- `python-docx` (`pip install python-docx`)
- `python-pptx` (`pip install python-pptx`)
- `pdftotext`(macOS:`brew install poppler`)+ fallback `pdfplumber`(`pip install pdfplumber`)

### 3.2 跑

```bash
# 默认:expert/raw/ → expert/wiki/_extracted/(repo-relative,无需 env)
python3 expert/wiki/_scripts/extract.py

# 自定义多源(逗号分隔):
RAW_ROOT=expert/raw/2科普视频文案,expert/raw/3其余素材 \
python3 expert/wiki/_scripts/extract.py

# 强制全量重抽(默认是增量):
EXTRACT_FORCE=1 python3 expert/wiki/_scripts/extract.py

# 自定义输出位:
OUT_ROOT=/tmp/test_extracted python3 expert/wiki/_scripts/extract.py
```

### 3.3 注意

- **增量**:脚本读 `_extract_report.json`,已 ok 的 slug 自动跳过。`EXTRACT_FORCE=1` 才强制重抽
- **slug 规则**:`extract.py` 自动 drop `1商业航天材料/` 前缀(老数据兼容)+ 保留 `2科普视频文案/` / `3其余素材/` 前缀(新数据 namespace)
- **空文本 PDF**:`pdftotext` + `pdfplumber` 都拿不到字时,本步骤产出 0 字符 md → 应手动 `rm` 删除并打补丁标记 fail 到 `_extract_report.json`(见 §6 故障小记)
- **跳过名单**:`SKIP_BASENAMES` 写死 3 个密码/数据型 PDF(OPEC / WOO / 垣信BP)— 不要替换为通用机制,人工判断更安全

### 3.4 产物检查

```bash
# 新增了多少条?
python3 -c "
import json
r = json.loads(open('expert/wiki/_extracted/_extract_report.json').read())
print(f\"ok={len(r['ok'])} fail={len(r['fail'])}\")
"

# 哪些是新一轮新增?
ls -t expert/wiki/_extracted/*.md | head -20
```

---

## 4. Phase 2 · 多路 LLM sub-agent 并行抽实体(本 skill 的核心)

### 4.1 切批 = greedy bin-packing 按文件 size

新增文件数 ≥ 30 时切 **4 路 sub-agent**;15 ≤ N < 30 切 **2-3 路**;N < 15 用单 agent。

```bash
# bin-pack 脚本(repo 里没固化,inline 给 sub-agent):
cd /Users/john/InvesResearch && python3 -c "
import os, json
from pathlib import Path
ext = Path('expert/wiki/_extracted')
# 改 PREFIXES 适配本轮新增:
PREFIXES = ('2科普视频文案__', '3其余素材__')
new = []
for prefix in PREFIXES:
for p in ext.glob(prefix + '*.md'):
new.append((p.stat().st_size, p.name))
new.sort(reverse=True)
N_BATCHES = 4
batches = [[] for _ in range(N_BATCHES)]
sizes = [0]*N_BATCHES
for sz, name in new:
i = sizes.index(min(sizes))
batches[i].append(name); sizes[i] += sz
for i, b in enumerate(batches):
print(f'BATCH {i+1}: {len(b)} files, {sizes[i]/1024:.1f} KB')
# 接 wiki/ 的 round 1 是 agent-1..7,本 round 接续 8..(8+N-1)
START = 8
out = {f'agent-{i+START}-round{ROUND}': batches[i] for i in range(N_BATCHES)}
Path('/tmp/wiki_round_batches.json').write_text(json.dumps(out, ensure_ascii=False, indent=2), encoding='utf-8')
"
```

> 把 `ROUND` 替换成下一轮号(round 1 = 初次 7-agent;round 2 = 本 skill 第一次跑;依此类推)。

### 4.2 准备 dedup 助手(给 sub-agent Read)

```bash
ls expert/wiki/entities/ > expert/wiki/_drafts/_round${ROUND}_existing_entities.txt
ls expert/wiki/people/ > expert/wiki/_drafts/_round${ROUND}_existing_people.txt
ls expert/wiki/concepts/ > expert/wiki/_drafts/_round${ROUND}_existing_concepts.txt
ls expert/wiki/reports/ > expert/wiki/_drafts/_round${ROUND}_existing_reports.txt
cp /tmp/wiki_round_batches.json expert/wiki/_drafts/_round${ROUND}_batches.json
```

### 4.3 给每个 sub-agent 建空草稿目录

```bash
for n in 8 9 10 11; do
mkdir -p "expert/wiki/_drafts/agent-${n}-round${ROUND}"/{entities,people,concepts,reports}
done
```

### 4.4 派 N 路 sub-agent(在**同一个 message** 里发,真正并行 — 4 个并行 ≠ 4 个串行!)

> ⚠️ **重要教训**(2026-06-07):第一次跑的时候我先发了 agent-8,等它完成(花了 24 分钟)才发其它 3 个,白白浪费 70 分钟。**正确做法是一个 message 里包 N 个 Agent 工具调用**,这样它们真的并行。

**Sub-agent 提示词模板**(参数化,N 路里只换 §"你的 batch" 和 agent 号):

```
你是 InvesResearch 商业航天 wiki 增量入库的 **round-${ROUND} sub-agent #${N}**(`agent-${N}-round${ROUND}`)。
从新增料里抽 entities/people/concepts/reports/timeline,Phase 3 merge.py 会合并到 wiki/。

## 工作目录
- 仓库根:`/Users/john/InvesResearch`
- 源文本(已抽好):`expert/wiki/_extracted/*.md`
- 你只读你 batch 里的 N 个文件
- 输出目录(已建好):
- `expert/wiki/_drafts/agent-${N}-round${ROUND}/{entities,people,concepts,reports}/`
- `expert/wiki/_drafts/agent-${N}-round${ROUND}/timeline.md`

## 你的 batch (N 文件, ~270KB)
{逐行列文件名 + 一句话主题提示,主题提示让 sub-agent 心里有数}

## 输出契约
见 `expert/wiki/_drafts/_TEMPLATE.md`(已有,不要重复贴)。
简化版 frontmatter:
- entities: type / country / aliases / status / sources
- people: role / affiliation / country / aliases / sources
- concepts: category / aliases / sources
- reports: title / publish_year / publisher / report_type / source_files

## Slug 规则 — 关键
**先 Read** 4 个现有 slug 清单(dedup):
- `expert/wiki/_drafts/_round${ROUND}_existing_entities.txt`
- `expert/wiki/_drafts/_round${ROUND}_existing_people.txt`
- `expert/wiki/_drafts/_round${ROUND}_existing_concepts.txt`
- `expert/wiki/_drafts/_round${ROUND}_existing_reports.txt`

- 已有 slug → 同名复用,merge.py 会自动 union frontmatter + 把你的正文追加为 `## 补充视角 — agent-${N}-round${ROUND}`
- 新 slug 用 **latin / pinyin**,不要中文 slug:
- 公司 / 卫星型号 / 计划:`spacex` `starlink` `oneweb` `artemis-ii` `golden-dome` `adras-j`
- 人物:`elon-musk` `jared-isaacman`
- 概念:`direct-to-cell` `on-orbit-refueling` `mars-colonization`
- merge.py 看的是 stem 文件名,中文 slug 会撞坑

## 质量纪律
1. 不编造 — 事实必须在源文本里找到出处。没有就**不写**或标 "(来源未明)"
2. 简洁(每条 30-80 行 markdown)
3. sources 字段**只列**你 batch 里出现过的文件名,不要写没读过的
4. 去重 — 同一实体一条;不要把同物分塞到 entities + concepts
5. 关联 1-2 个 `[[...]]` wikilink
6. **reports 必出 N 条**(每个源 .md 一条):
- 科普视频文案 → report_type: `历史/科普`,publish_year 留空
- 长 PDF 行业报告 → report_type: `行业研究`,有 publish_year 就填
- 公司尽调 PDF → report_type: `公司尽调`,**敏感信息降级**(估值数走区间)
7. 优先重要的:多次出现 / 有具体数据 / 行业代表性强的优先;一带而过的小公司不要写
8. timeline.md 按时间倒序,每条 `- YYYY-MM[-DD]: 简述,来源 \`文件名.md\``

## 数据型 PDF 特别处理
- 行业 ecosystem map / state of AI 之类:只挑 5-10 个代表实体 + 核心概念,不要全抽
- 文长 PDF(>100KB):只抽核心论点 + 涉及的 entities/concepts,正文不要复制粘贴

## 完成后
短汇报"产出 N entities / M people / K concepts / R reports / T timeline 条"。不 commit。
```

### 4.5 并行执行 + 等待

派出后:
- 每个 agent 估时 15-30 分钟(根据 batch 大小)
- 4 路并行 → wall time ≈ 单 agent 最慢的那个
- 用 `run_in_background: true` 让它们后台跑,主线程同时干其他事(写 skill 文档 / 改 docs.html 等)

### 4.6 完成后验收

```bash
for n in 8 9 10 11; do
echo "== agent-${n}-round${ROUND} =="
for sub in entities people concepts reports; do
echo " ${sub}: $(ls expert/wiki/_drafts/agent-${n}-round${ROUND}/${sub}/ 2>/dev/null | wc -l)"
done
echo " timeline lines: $(wc -l < expert/wiki/_drafts/agent-${n}-round${ROUND}/timeline.md 2>/dev/null)"
done
```

---

## 5. Phase 3-5 · merge + 重建 index 与 HTML

### Phase 3 — merge.py

```bash
python3 expert/wiki/_scripts/merge.py
```

策略(merge.py §1):
- glob 所有 `_drafts/agent-*/sub/*.md`(包括 round 1 + round 2 + ...)
- 同 slug 多 agent:LIST_KEYS(`aliases / sources / source_files`)**union**;scalar 冲突保留首个 + 备注其它
- 正文:第一个 agent 的正文做主体,其它 agent 以 `## 补充视角 — agent-N-roundX` 段追加
- timeline:所有 `- YYYY-MM-DD: ...` 行去重 → 按 sortkey 倒序 → `wiki/timeline/all.md`

### Phase 4 — build_index.py

```bash
python3 expert/wiki/_scripts/build_index.py
# → wiki/README.md + wiki/{entities,people,concepts,reports}/index.md
```

### Phase 5 — build_html.py + build_processing_status.py

```bash
python3 expert/wiki/_scripts/build_html.py # → wiki/index.html (~1MB 自包含)
python3 expert/wiki/_scripts/build_processing_status.py # → wiki/processing_status.html (raw/ ↔ _extracted/ 对照)
python3 expert/wiki/_scripts/build_docs.py # → wiki/{readme,template,timeline,next_steps,prd,redact}.html(导航条对齐)
python3 expert/wiki/_scripts/redact_check.py # 可选 — 重跑 PII 审计
python3 expert/wiki/_scripts/build_redact_html.py # 可选 — 重生 redact.html
```

> **依赖**:`build_docs.py` 需要 `markdown` 包(`pip install markdown`),其它都不依赖第三方。

### 验收

```bash
# 看 wiki 体量变化:
echo "entities=$(ls expert/wiki/entities/ | wc -l)"
echo "people=$(ls expert/wiki/people/ | wc -l)"
echo "concepts=$(ls expert/wiki/concepts/ | wc -l)"
echo "reports=$(ls expert/wiki/reports/ | wc -l)"
echo "timeline lines=$(wc -l < expert/wiki/timeline/all.md)"
echo "index.html size=$(wc -c < expert/wiki/index.html)"

# 起本地服务抽检:
cd expert/wiki && python3 -m http.server 8765
# 浏览 http://localhost:8765/index.html / processing_status.html
```

---

## 6. 故障小记(踩坑就回这里查)

### 6.1 脚本里硬编码 `/Users/john/lichao/` 路径

**症状**:第一次在 InvesResearch 仓库跑 wiki 脚本,所有 build_*.py 都报"文件不存在"。

**原因**:7 个脚本(`extract.py` `merge.py` `build_index.py` `build_html.py` `build_processing_status.py` `build_docs.py` `redact_check.py` `build_redact_html.py`)全部把 `WIKI = Path("/Users/john/lichao/wiki")` 写死。

**修法**(2026-06-07 已修):统一抽到 `expert/wiki/_scripts/_paths.py`,按 env > repo-relative > legacy 顺序解析:
```python
sys.path.insert(0, str(Path(__file__).resolve().parent))
from _paths import WIKI, RAW, DRAFTS, EXTRACTED, EXTRACT_REPORT
```

### 6.2 PDF 抽出来是 0 字符

**症状**:抽完 `_extracted/foo__pdf.md` 只有 header 没正文(~100 字节)。

**原因**:图扫描 PDF / 字体编码异常 / 加密保护;`pdftotext` 和 `pdfplumber` 都拿不到字。

**修法**:
1. `find expert/wiki/_extracted -name "*.md" -size -500c` 找出嫌疑文件
2. 手动 `rm` 删
3. 用下面 patch 把它们从 `_extract_report.json` 的 `ok` 转移到 `fail`:
```python
import json
from pathlib import Path
p = Path("expert/wiki/_extracted/_extract_report.json")
report = json.loads(p.read_text(encoding="utf-8"))
empty_outs = {"3其余素材__TR32-1265.I__pdf.md", ...} # 改成你的
kept, dropped = [], []
for e in report["ok"]:
if isinstance(e, dict) and e.get("out") in empty_outs:
dropped.append({"src": e.get("src"), "err": "pdf-text-extraction-failed (likely image-scan)"})
else:
kept.append(e)
report["ok"] = kept
report.setdefault("fail", []).extend(dropped)
p.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
```

### 6.3 多 agent 并行 — 千万要在同一个 message 里发

**症状**:wall time 是单 agent 时长 × N(线性而不是并行)。

**原因**:每个 message 里只调一个 Agent → 串行。

**修法**:**一个 message 里包 N 个 Agent 工具调用** + `run_in_background: true`,真正并行。

### 6.4 sub-agent 写了中文 slug

**症状**:merge.py 报错或同名 slug 没合并。

**原因**:`merge.py` 用 stem 文件名做 key,中文 slug 在不同 agent 输出里可能 normalize 不一致(unicode NFC/NFD)。

**修法**:在 sub-agent 提示词里**反复强调** "slug 用 latin/pinyin,不要中文 slug"。如果已经写了,跑 merge 前手动 `mv` 改名。

### 6.5 内容敏感(估值 / 内部密码 / 内部姓名)

**症状**:某 PDF 是公司尽调 / 投资介绍,含具体融资额 / 内部估值 / 私密联系方式。

**修法**:
- sub-agent 提示词加 "敏感信息降级" 一节(估值走区间,姓名/电话/邮箱省略)
- merge 之后跑 `redact_check.py` 做 PII 审计,看 `_redact_report.md`
- 高严重命中手动核查

---

## 7. 与本仓库其它 skill 的边界

| Skill | 干啥 | 边界 |
|---|---|---|
| **expert-wiki-ingest**(本) | raw/ → wiki/ 增量入库 | 不入 agent/ 业务 |
| `agent/skills/satellite_internet_research.md` | agent 决策层的分析师 know-how | 静态文档,不动数据 |
| `agent/skills/thesis_impact_judgment.md` | 单事件 → 主线影响传导 | classifier 内部规则,不动 wiki |
| `agent/skills/strategy_recommendation_5_levels.md` | 战略建议 5 档(D V1 反推盲点修复) | decision 内部规则 |
| `agent/skills/wyhtb_writing_guide.md` | 看牛/看熊写法指南 | thesis 配置层 |
| `agent/skills/trigger_design_patterns.md` | 证伪触发器设计 | triggers 配置层 |

**与 expert/X 抓取流水线的关系**:本 skill 只覆盖 expert/raw/(线下素材包)→ wiki/。expert/X/(每日 X 推文抓取)→ agent/ events 表是另一条流水线(EH-1 `x-ingest` action,详见 `agent/docs/x-ingest-cron-runbook.md`),两者不重叠。

---

## 8. 运行 checklist(整套 pipeline)

```bash
# 0 · 先看新增了什么
diff <(ls expert/wiki/_extracted/ | sort) \
<(python3 expert/wiki/_scripts/extract.py 2>&1 | grep -o 'this run: ok=[0-9]*')

# 1 · Phase 1 抽取(增量)
python3 expert/wiki/_scripts/extract.py

# 2 · 删空 PDF 输出 + 打 fail 补丁(若有)
find expert/wiki/_extracted -name "*.md" -size -500c -delete
# 然后跑上面 6.2 的 JSON patch

# 3 · 切批 + 准备 dedup 助手 + 建空草稿目录(见 §4.1-4.3)

# 4 · 在一个 message 里派 N 路 sub-agent(§4.4-4.5)

# 5 · 等所有 sub-agent 完成后,验收草稿数量(§4.6)

# 6 · Phase 3 merge
python3 expert/wiki/_scripts/merge.py

# 7 · Phase 4 索引
python3 expert/wiki/_scripts/build_index.py

# 8 · Phase 5 HTML
python3 expert/wiki/_scripts/build_html.py
python3 expert/wiki/_scripts/build_processing_status.py
python3 expert/wiki/_scripts/build_docs.py

# 9 · 验收 + 本地抽检
cd expert/wiki && python3 -m http.server 8765
# → http://localhost:8765/index.html

# 10 · 提交(可选,wiki 内容也走 git)
git add expert/wiki/{entities,people,concepts,reports,timeline,README.md,index.html,...}
git add expert/wiki/_scripts/ # 若改了脚本
git commit -m "feat(wiki): roundX 入库 N entities / M concepts / R reports"
```

---

## 9. 状态(2026-06-07 第一次跑本 skill)

| 阶段 | 输入 | 输出 | 状态 |
|---|---|---|---|
| Round 1(2026-06-04) | 117 份 docx/pptx/pdf 原料 | 248 entities / 33 people / 231 concepts / 35 reports / 483 timeline | ✅ 已交付(7-agent 并行) |
| Round 2(本次) | 67 份新增(60 科普视频文案 + 7 杂项 PDF) | _TBD_(待 4 路 sub-agent 完成回报) | 🟡 进行中 |

下次盘点:**raw/3其余素材/** 里还有 22 mp4 + 32 jpg + 4 xlsx,需要 whisper / OCR / pandas pipeline 才能入(本 skill 不覆盖,见 expert/wiki/PRD.md §5 P0/P1/P2)。

---

> **本 skill 的 trigger 关键字**:`/expert-wiki-ingest` · `/wiki-incremental-ingest` · "把 expert/raw 下新增内容抽到 wiki" · "更新 expert/wiki 知识库" · "raw 下又来了一批材料跑一遍 wiki 入库"}