LLM Wiki：用大语言模型构建个人知识库

Andrej Karpathy 2026-04-17

LLM Wiki

A pattern for building personal knowledge bases using LLMs.

This is an idea file, it is designed to be copy pasted to your own LLM Agent (e.g. OpenAI Codex, Claude Code, OpenCode / Pi, or etc.). Its goal is to communicate the high level idea, but your agent will build out the specifics in collaboration with you.

这是一个想法文档，设计目的是可以直接复制粘贴到你的 LLM Agent（如 OpenAI Codex、Claude Code、OpenCode / Pi 等）。它的目标是传达高层概念，但你的 agent 会与你协作构建具体细节。

The core idea

核心思想

Most people’s experience with LLMs and documents looks like RAG: you upload a collection of files, the LLM retrieves relevant chunks at query time, and generates an answer. This works, but the LLM is rediscovering knowledge from scratch on every question. There’s no accumulation. Ask a subtle question that requires synthesizing five documents, and the LLM has to find and piece together the relevant fragments every time. Nothing is built up. NotebookLM, ChatGPT file uploads, and most RAG systems work this way.

大多数人对 LLM 和文档的体验类似于 RAG：你上传一组文件，LLM 在查询时检索相关片段并生成答案。这确实有效，但 LLM 每次回答问题时都要从头重新发现知识。没有积累。问一个需要综合五份文档的微妙问题，LLM 每次都要找到并拼接相关片段。没有任何东西被累积起来。NotebookLM、ChatGPT 文件上传和大多数 RAG 系统都是这样工作的。

The idea here is different. Instead of just retrieving from raw documents at query time, the LLM incrementally builds and maintains a persistent wiki — a structured, interlinked collection of markdown files that sits between you and the raw sources. When you add a new source, the LLM doesn’t just index it for later retrieval. It reads it, extracts the key information, and integrates it into the existing wiki — updating entity pages, revising topic summaries, noting where new data contradicts old claims, strengthening or challenging the evolving synthesis. The knowledge is compiled once and then kept current, not re-derived on every query.

这里的想法不同。LLM 不是在查询时从原始文档检索，而是增量构建并维护一个持久的 wiki——一个结构化的、相互链接的 markdown 文件集合，位于你和原始来源之间。当你添加新来源时，LLM 不仅仅是为后续检索建立索引。它阅读文档、提取关键信息、并将其整合到现有 wiki 中——更新实体页面、修订主题摘要、标注新数据与旧论断的矛盾之处、加强或挑战正在演进的综合观点。知识只编译一次，然后保持更新，而不是每次查询都重新推导。

This is the key difference: the wiki is a persistent, compounding artifact. The cross-references are already there. The contradictions have already been flagged. The synthesis already reflects everything you’ve read. The wiki keeps getting richer with every source you add and every question you ask.

这是关键区别：wiki 是一个持久的、复利增长的知识产物。 交叉引用已经存在。矛盾之处已被标注。综合观点已经反映了你读过的所有内容。每添加一个来源、每提出一个问题，wiki 都变得更加丰富。

You never (or rarely) write the wiki yourself — the LLM writes and maintains all of it. You’re in charge of sourcing, exploration, and asking the right questions. The LLM does all the grunt work — the summarizing, cross-referencing, filing, and bookkeeping that makes a knowledge base actually useful over time. In practice, I have the LLM agent open on one side and Obsidian open on the other. The LLM makes edits based on our conversation, and I browse the results in real time — following links, checking the graph view, reading the updated pages. Obsidian is the IDE; the LLM is the programmer; the wiki is the codebase.

你从不（或很少）自己写 wiki——LLM 编写并维护所有内容。你负责获取来源、探索和提出正确的问题。LLM 做所有繁琐的工作——总结、交叉引用、归档和记账，这些让知识库真正长期有用的工作。实践中，我一边打开 LLM agent，一边打开 Obsidian。LLM 根据我们的对话进行编辑，我实时浏览结果——跟随链接、检查图谱视图、阅读更新的页面。Obsidian 是 IDE；LLM 是程序员；wiki 是代码库。

Application examples

应用示例

This can apply to a lot of different contexts. A few examples:

这可以应用于很多不同的场景。几个例子：

Personal: tracking your own goals, health, psychology, self-improvement — filing journal entries, articles, podcast notes, and building up a structured picture of yourself over time.
Research: going deep on a topic over weeks or months — reading papers, articles, reports, and incrementally building a comprehensive wiki with an evolving thesis.
Reading a book: filing each chapter as you go, building out pages for characters, themes, plot threads, and how they connect. By the end you have a rich companion wiki. Think of fan wikis like Tolkien Gateway — thousands of interlinked pages covering characters, places, events, languages, built by a community of volunteers over years. You could build something like that personally as you read, with the LLM doing all the cross-referencing and maintenance.
Business/team: an internal wiki maintained by LLMs, fed by Slack threads, meeting transcripts, project documents, customer calls. Possibly with humans in the loop reviewing updates. The wiki stays current because the LLM does the maintenance that no one on the team wants to do.
Competitive analysis, due diligence, trip planning, course notes, hobby deep-dives — anything where you’re accumulating knowledge over time and want it organized rather than scattered.

个人：追踪你自己的目标、健康、心理、自我提升——归档日记条目、文章、播客笔记，并随时间构建关于自己的结构化图景。
研究：在几周或几个月内深入某个主题——阅读论文、文章、报告，增量构建一个包含演进论点的综合 wiki。
阅读书籍：边读边归档每一章，构建角色、主题、情节线索及其关联的页面。最终你会得到一个丰富的 companion wiki。想想像 Tolkien Gateway 这样的粉丝 wiki——数千个相互链接的页面覆盖角色、地点、事件、语言，由志愿者社区多年构建。你可以在阅读时个人构建类似的东西，LLM 做所有交叉引用和维护工作。
商业/团队：由 LLM 维护的内部 wiki，由 Slack 线程、会议转录、项目文档、客户电话提供信息。可能有人在循环中审核更新。Wiki 保持更新，因为 LLM 做那些团队成员都不想做的维护工作。
竞争分析、尽职调查、旅行规划、课程笔记、爱好深度探索——任何你随时间积累知识并希望有序组织而非散乱的场景。

Architecture

架构

There are three layers:

有三个层级：

Raw sources — your curated collection of source documents. Articles, papers, images, data files. These are immutable — the LLM reads from them but never modifies them. This is your source of truth.

原始来源——你精心收集的源文档。文章、论文、图片、数据文件。这些是不可变的——LLM 从中读取但从不修改。这是你的真实来源。

The wiki — a directory of LLM-generated markdown files. Summaries, entity pages, concept pages, comparisons, an overview, a synthesis. The LLM owns this layer entirely. It creates pages, updates them when new sources arrive, maintains cross-references, and keeps everything consistent. You read it; the LLM writes it.

Wiki——LLM 生成的 markdown 文件目录。摘要、实体页面、概念页面、比较、概览、综合。LLM 完全拥有这一层。它创建页面、在新来源到来时更新、维护交叉引用、保持一致性。你阅读；LLM 编写。

The schema — a document (e.g. CLAUDE.md for Claude Code or AGENTS.md for Codex) that tells the LLM how the wiki is structured, what the conventions are, and what workflows to follow when ingesting sources, answering questions, or maintaining the wiki. This is the key configuration file — it’s what makes the LLM a disciplined wiki maintainer rather than a generic chatbot. You and the LLM co-evolve this over time as you figure out what works for your domain.

Schema——一个文档（如 Claude Code 的 CLAUDE.md 或 Codex 的 AGENTS.md），告诉 LLM wiki 如何结构化、有什么约定、在摄入来源、回答问题或维护 wiki 时遵循什么工作流。这是关键配置文件——它让 LLM 成为有纪律的 wiki 维护者而非通用聊天机器人。你和 LLM 随时间共同演进它，找出适合你领域的方案。

Operations

操作

Ingest. You drop a new source into the raw collection and tell the LLM to process it. An example flow: the LLM reads the source, discusses key takeaways with you, writes a summary page in the wiki, updates the index, updates relevant entity and concept pages across the wiki, and appends an entry to the log. A single source might touch 10-15 wiki pages. Personally I prefer to ingest sources one at a time and stay involved — I read the summaries, check the updates, and guide the LLM on what to emphasize. But you could also batch-ingest many sources at once with less supervision. It’s up to you to develop the workflow that fits your style and document it in the schema for future sessions.

摄入。 你将新来源放入原始集合并告诉 LLM 处理它。一个示例流程：LLM 阅读来源、与你讨论关键要点、在 wiki 中写摘要页面、更新索引、更新 wiki 中相关实体和概念页面、并在日志中追加条目。单个来源可能涉及 10-15 个 wiki 页面。个人上我倾向于一次摄入一个来源并保持参与——我阅读摘要、检查更新、引导 LLM 强调什么。但你也可以在较少监督下批量摄入多个来源。开发适合你风格的工作流并在 schema 中记录供未来会话使用，这取决于你。

Query. You ask questions against the wiki. The LLM searches for relevant pages, reads them, and synthesizes an answer with citations. Answers can take different forms depending on the question — a markdown page, a comparison table, a slide deck (Marp), a chart (matplotlib), a canvas. The important insight: good answers can be filed back into the wiki as new pages. A comparison you asked for, an analysis, a connection you discovered — these are valuable and shouldn’t disappear into chat history. This way your explorations compound in the knowledge base just like ingested sources do.

查询。 你对 wiki 提问。LLM 搜索相关页面、阅读它们、并综合带引用的回答。根据问题，回答可以采取不同形式——markdown 页面、比较表、幻灯片（Marp）、图表（matplotlib）、画布。重要洞察：好的回答可以作为新页面归档回 wiki。 你要求的比较、分析、你发现的关联——这些有价值的东西不应消失在聊天历史中。这样你的探索在知识库中复利增长，就像摄入的来源一样。

Lint. Periodically, ask the LLM to health-check the wiki. Look for: contradictions between pages, stale claims that newer sources have superseded, orphan pages with no inbound links, important concepts mentioned but lacking their own page, missing cross-references, data gaps that could be filled with a web search. The LLM is good at suggesting new questions to investigate and new sources to look for. This keeps the wiki healthy as it grows.

Lint。 定期让 LLM 对 wiki 健康检查。查找：页面间的矛盾、被更新来源取代的陈旧论断、无入链的孤儿页面、被提及但缺少独立页面的重要概念、缺失的交叉引用、可通过网络搜索填补的数据空白。LLM擅长建议新问题去调查和新来源去寻找。这让 wiki 随增长保持健康。

Indexing and logging

索引与日志

Two special files help the LLM (and you) navigate the wiki as it grows. They serve different purposes:

两个特殊文件帮助 LLM（和你）在 wiki 增长时导航。它们有不同目的：

index.md is content-oriented. It’s a catalog of everything in the wiki — each page listed with a link, a one-line summary, and optionally metadata like date or source count. Organized by category (entities, concepts, sources, etc.). The LLM updates it on every ingest. When answering a query, the LLM reads the index first to find relevant pages, then drills into them. This works surprisingly well at moderate scale (~100 sources, ~hundreds of pages) and avoids the need for embedding-based RAG infrastructure.

index.md 是面向内容的。它是 wiki 中所有内容的目录——每个页面有链接、一行摘要、可选的元数据如日期或来源数。按类别组织（实体、概念、来源等）。LLM 每次摄入时更新它。回答查询时，LLM 先读索引找相关页面，然后深入阅读。这在中等规模（约100来源、数百页面）下出奇有效，避免了需要基于嵌入的 RAG 基础设施。

log.md is chronological. It’s an append-only record of what happened and when — ingests, queries, lint passes. A useful tip: if each entry starts with a consistent prefix (e.g. ## [2026-04-02] ingest | Article Title), the log becomes parseable with simple unix tools — grep "^## \[" log.md | tail -5 gives you the last 5 entries. The log gives you a timeline of the wiki’s evolution and helps the LLM understand what’s been done recently.

log.md 是按时间顺序的。它是发生了什么以及何时发生的只追加记录——摄入、查询、lint 检查。一个有用的技巧：如果每个条目以一致的前缀开始（如 ## [2026-04-02] ingest | Article Title），日志就可用简单 unix 工具解析——grep "^## \[" log.md | tail -5 给你最后5条。日志给你 wiki 演进的时间线，帮助 LLM 理解最近做了什么。

Optional: CLI tools

可选：CLI 工具

At some point you may want to build small tools that help the LLM operate on the wiki more efficiently. A search engine over the wiki pages is the most obvious one — at small scale the index file is enough, but as the wiki grows you want proper search. qmd is a good option: it’s a local search engine for markdown files with hybrid BM25/vector search and LLM re-ranking, all on-device. It has both a CLI (so the LLM can shell out to it) and an MCP server (so the LLM can use it as a native tool). You could also build something simpler yourself — the LLM can help you vibe-code a naive search script as the need arises.

某些时候你可能想构建小工具帮助 LLM 更高效操作 wiki。Wiki 页面上的搜索引擎是最明显的——小规模时索引文件足够，但 wiki 增长后你需要正规搜索。qmd 是个好选择：它是 markdown 文件的本地搜索引擎，有混合 BM25/向量搜索和 LLM 重排序，全部本地运行。它既有 CLI（让 LLM 可以调用）又有 MCP server（让 LLM 可以作为原生工具使用）。你也可以自己构建更简单的——LLM 可以帮你按需编写简单搜索脚本。

Tips and tricks

技巧与窍门

Obsidian Web Clipper is a browser extension that converts web articles to markdown. Very useful for quickly getting sources into your raw collection.
Download images locally. In Obsidian Settings → Files and links, set “Attachment folder path” to a fixed directory (e.g. raw/assets/). Then in Settings → Hotkeys, search for “Download” to find “Download attachments for current file” and bind it to a hotkey (e.g. Ctrl+Shift+D). After clipping an article, hit the hotkey and all images get downloaded to local disk. This is optional but useful — it lets the LLM view and reference images directly instead of relying on URLs that may break.
Obsidian’s graph view is the best way to see the shape of your wiki — what’s connected to what, which pages are hubs, which are orphans.
Marp is a markdown-based slide deck format. Obsidian has a plugin for it. Useful for generating presentations directly from wiki content.
Dataview is an Obsidian plugin that runs queries over page frontmatter. If your LLM adds YAML frontmatter to wiki pages (tags, dates, source counts), Dataview can generate dynamic tables and lists.
The wiki is just a git repo of markdown files. You get version history, branching, and collaboration for free.

Obsidian Web Clipper 是一个浏览器扩展，将网页文章转换为 markdown。非常有用，可快速将来源放入你的原始集合。
本地下载图片。 在 Obsidian Settings → Files and links，将 “Attachment folder path” 设为固定目录（如 raw/assets/）。然后在 Settings → Hotkeys 搜索 “Download” 找到 “Download attachments for current file” 并绑定热键（如 Ctrl+Shift+D）。剪辑文章后按热键，所有图片下载到本地磁盘。这是可选但有用的——让 LLM 直接查看和引用图片而非依赖可能失效的 URL。
Obsidian 的图谱视图 是看 wiki 形态的最佳方式——什么连接什么、哪些页面是枢纽、哪些是孤儿。
Marp 是基于 markdown 的幻灯片格式。Obsidian 有插件支持。有用从 wiki 内容直接生成演示。
Dataview 是 Obsidian 插件，对页面 frontmatter 运行查询。如果你的 LLM 给 wiki 页面添加 YAML frontmatter（标签、日期、来源数），Dataview 可生成动态表格和列表。
Wiki 只是 markdown 文件的 git 仓库。你免费获得版本历史、分支和协作。

Why this works

为什么这有效

The tedious part of maintaining a knowledge base is not the reading or the thinking — it’s the bookkeeping. Updating cross-references, keeping summaries current, noting when new data contradicts old claims, maintaining consistency across dozens of pages. Humans abandon wikis because the maintenance burden grows faster than the value. LLMs don’t get bored, don’t forget to update a cross-reference, and can touch 15 files in one pass. The wiki stays maintained because the cost of maintenance is near zero.

维护知识库的繁琐部分不是阅读或思考——是记账。更新交叉引用、保持摘要更新、标注新数据与旧论断的矛盾、在数十页中维护一致性。人类放弃 wiki 因为维护负担增长比价值快。LLM 不会厌倦、不会忘记更新交叉引用、可以一次处理15个文件。Wiki 保持维护因为维护成本接近零。

The human’s job is to curate sources, direct the analysis, ask good questions, and think about what it all means. The LLM’s job is everything else.

人类的任务是精心收集来源、指导分析、提出好问题、思考这一切意味着什么。LLM 的任务是其他所有事。

The idea is related in spirit to Vannevar Bush’s Memex (1945) — a personal, curated knowledge store with associative trails between documents. Bush’s vision was closer to this than to what the web became: private, actively curated, with the connections between documents as valuable as the documents themselves. The part he couldn’t solve was who does the maintenance. The LLM handles that.

这个想法在精神上与 Vannevar Bush 的 Memex（1945）相关——一个个人的、精心收集的知识存储，文档间有联想路径。Bush 的愿景比 web 更接近这个：私有的、主动收集的、文档间的连接与文档本身同样有价值。他无法解决的部分是谁来做维护。LLM 处理了那部分。

Note

说明

This document is intentionally abstract. It describes the idea, not a specific implementation. The exact directory structure, the schema conventions, the page formats, the tooling — all of that will depend on your domain, your preferences, and your LLM of choice. Everything mentioned above is optional and modular — pick what’s useful, ignore what isn’t. For example: your sources might be text-only, so you don’t need image handling at all. Your wiki might be small enough that the index file is all you need, no search engine required. You might not care about slide decks and just want markdown pages. You might want a completely different set of output formats. The right way to use this is to share it with your LLM agent and work together to instantiate a version that fits your needs. The document’s only job is to communicate the pattern. Your LLM can figure out the rest.

本文档故意抽象。它描述想法而非具体实现。确切的目录结构、schema 约定、页面格式、工具——所有这些都取决于你的领域、偏好和选择的 LLM。上面提到的所有都是可选和模块化的——选有用的，忽略没用的。例如：你的来源可能只有文本，所以不需要图片处理。你的 wiki 可能足够小，索引文件就够了，不需要搜索引擎。你可能不在乎幻灯片只想要 markdown 页面。你可能想要完全不同的输出格式集。正确使用方式是与你的 LLM agent 分享并协作实例化适合你需求的版本。文档的唯一任务是传达模式。你的 LLM 可以搞定其余部分。