LLM Knowledge Bases (Karpathy)

本文由 AI 分析生成

建立時間： 2026-04-03 來源： https://x.com/karpathy/status/2039805659525644595

Summary

Andrej Karpathy describes a personal workflow where LLMs act as compilers: raw source documents (articles, papers, repos, images) are ingested into a raw/ directory, and an LLM incrementally “compiles” them into a structured markdown wiki — writing summaries, backlinks, concept articles, and inter-document links. The wiki lives in Obsidian for human browsing, and once large enough (~100 articles, ~400K words), the same LLM can answer complex questions by reading its own index files. Outputs (Q&A results, slides, visualizations) are filed back into the wiki, so every query compounds the knowledge base.

Karpathy 描述一種以 LLM 為「編譯器」的個人知識管理工作流：原始文件存入 raw/ 目錄，LLM 自動將其編譯成結構化的 Markdown wiki（附摘要、反向連結、概念文章）。Wiki 夠大後（約 100 篇、40 萬字），LLM 就能透過讀取自己的索引檔案回答複雜問題。每次查詢的輸出也歸檔回 wiki，形成複利式知識積累。

Key Points

Three-layer architecture: raw/ (source docs) → wiki (LLM-compiled markdown) → outputs (Q&A, slides, charts)
LLM maintains the wiki autonomously; the human rarely edits it directly
At ~100 articles scale, a standard LLM can handle Q&A without fancy RAG — auto-maintained index files + brief summaries are sufficient for retrieval
“Linting” pass: LLM health-checks the wiki for inconsistencies, imputes missing data via web search, and proposes new article candidates
Obsidian serves as the read-only frontend; Marp for slides; matplotlib for chart outputs
Long-term direction: synthetic data generation + fine-tuning so the LLM “knows” the wiki in weights, not just context

Insights

The key architectural insight is that at small-to-medium scale (~400K words), a capable LLM with good index files doesn’t need vector embeddings or a retrieval pipeline — it can navigate a well-organized markdown directory structure directly. This inverts the common assumption that RAG is necessary for any non-trivial knowledge base.

The compounding loop is the underrated part: outputs filed back into the wiki mean every query permanently enriches the base rather than disappearing into chat history. This transforms Q&A from a stateless interaction into an investment.

The “vibe coded search engine” comment suggests that even simple keyword search over the wiki, exposed as a CLI tool, is valuable enough to build — which implies the value is in organization and persistence, not retrieval sophistication.

Connections

karpathy is showing one of the simplest AI architectures that actually works… — JUMPERZ’s thread directly comments on this post and extrapolates to multi-agent architectures
The NotebookLM Workflow That Changed How I Learn Any Technology — NotebookLM implements a similar triangulated-source + AI synthesis approach, but as a hosted product rather than a local workflow
knowledge-management
obsidian
personal-wiki

Raw Excerpt

TLDR: raw data from a given number of sources is collected, then compiled by an LLM into a .md wiki, then operated on by various CLIs by the LLM to do Q&A and to incrementally enhance the wiki, and all of it viewable in Obsidian. You rarely ever write or edit the wiki manually, it’s the domain of the LLM. I think there is room here for an incredible new product instead of a hacky collection of scripts.

bot_vault

Explorer

LLM Knowledge Bases (Karpathy)

Summary

Key Points

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks