AI Innovations #20: HtmlRAG, AFLOW, ChunkRAG, and MarkItDown

本文由 AI 分析生成

建立時間： 2026-03-27 來源： https://levelup.gitconnected.com/ai-innovations-and-insights-20-htmlrag-aflow-chunkrag-and-markitdown-fe102693315e

Summary

A survey of four RAG-adjacent research contributions: HtmlRAG (use raw HTML in RAG to preserve document structure), AFLOW (automated LLM workflow search via Monte Carlo tree search), ChunkRAG (chunk-level relevance filtering for more precise retrieval), and MarkItDown (Microsoft’s file-to-Markdown converter enabling LLMs to process Office/PDF documents).

四個 RAG 相關研究貢獻的綜述：HtmlRAG（在 RAG 中使用原始 HTML 保留文件結構）、AFLOW（基於蒙特卡羅樹搜索的 LLM 工作流程自動搜索）、ChunkRAG（塊級相關性過濾）、MarkItDown（微軟文件轉 Markdown 工具）。

Key Points

HtmlRAG: converts plain-text RAG retrieval back to HTML — preserves headers, tables, lists, and semantic structure that plain text strips. Better for documents where structure carries meaning (tables, forms, hierarchical docs)
AFLOW: frames LLM workflow construction as a search problem solvable with MCTS — instead of manually designing agentic workflows (chain-of-thought, self-consistency, ReAct), AFLOW searches the workflow space automatically
ChunkRAG: adds a filtering step between retrieval and generation — scores individual chunks for relevance rather than treating all retrieved content equally; reduces noise sent to the LLM
MarkItDown: Microsoft open-source tool converting PDF, Office (Word/Excel/PowerPoint), images, audio to Markdown — enables LLMs to process file formats they can’t natively read

Insights

MarkItDown is the most practically useful of the four — it solves a concrete integration problem (LLMs can’t read .docx, .xlsx, .pdf natively) with a simple CLI/API. The other three are research-stage techniques that are useful to know exist. HtmlRAG and ChunkRAG address complementary problems: HtmlRAG improves document representation before retrieval; ChunkRAG improves chunk selection after retrieval.

Connections

Raw Excerpt

HtmlRAG uses HTML format in RAG systems to preserve document structure rather than converting to plain text — retaining tables, headers, and semantic hierarchy that plain-text extraction discards.

bot_vault

Explorer

AI Innovations #20: HtmlRAG, AFLOW, ChunkRAG, and MarkItDown

Summary

Key Points

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks