Summary

A survey of four RAG-adjacent research contributions: HtmlRAG (use raw HTML in RAG to preserve document structure), AFLOW (automated LLM workflow search via Monte Carlo tree search), ChunkRAG (chunk-level relevance filtering for more precise retrieval), and MarkItDown (Microsoft’s file-to-Markdown converter enabling LLMs to process Office/PDF documents).

四個 RAG 相關研究貢獻的綜述:HtmlRAG(在 RAG 中使用原始 HTML 保留文件結構)、AFLOW(基於蒙特卡羅樹搜索的 LLM 工作流程自動搜索)、ChunkRAG(塊級相關性過濾)、MarkItDown(微軟文件轉 Markdown 工具)。

Key Points

  • HtmlRAG: converts plain-text RAG retrieval back to HTML — preserves headers, tables, lists, and semantic structure that plain text strips. Better for documents where structure carries meaning (tables, forms, hierarchical docs)
  • AFLOW: frames LLM workflow construction as a search problem solvable with MCTS — instead of manually designing agentic workflows (chain-of-thought, self-consistency, ReAct), AFLOW searches the workflow space automatically
  • ChunkRAG: adds a filtering step between retrieval and generation — scores individual chunks for relevance rather than treating all retrieved content equally; reduces noise sent to the LLM
  • MarkItDown: Microsoft open-source tool converting PDF, Office (Word/Excel/PowerPoint), images, audio to Markdown — enables LLMs to process file formats they can’t natively read

Insights

MarkItDown is the most practically useful of the four — it solves a concrete integration problem (LLMs can’t read .docx, .xlsx, .pdf natively) with a simple CLI/API. The other three are research-stage techniques that are useful to know exist. HtmlRAG and ChunkRAG address complementary problems: HtmlRAG improves document representation before retrieval; ChunkRAG improves chunk selection after retrieval.

Connections

Raw Excerpt

HtmlRAG uses HTML format in RAG systems to preserve document structure rather than converting to plain text — retaining tables, headers, and semantic hierarchy that plain-text extraction discards.