Docling Library: Structured Documents for RAG and AI Agents
Docling is an open‑source Python library that turns messy PDFs and office documents into structured data ready for RAG pipelines and AI agents. Instead of just extracting plain text, it understands layout, tables, formulas, and more.
Here is a high‑level look at what Docling does and where it fits in an AI stack.
What Formats Can Docling Parse?
Docling supports PDFs, DOCX, PPTX, XLSX, HTML, LaTeX, and more. It builds a unified DoclingDocument representation that includes:
- Page layout and reading order.
- Tables as structured data, not just text.
- Code blocks, formulas (LaTeX), and images.
- OCR for scanned pages when needed.
Exporting for RAG: Chunks, Markdown, and JSON
After parsing, you can export documents as:
- Markdown for LLM‑friendly text.
- JSON for lossless storage and custom processing.
- DocTags or structured chunks where each chunk has position, type, and content.
This makes it easy to build retrieval‑augmented generation systems that respect document structure when answering questions.
Integrations with LangChain and LlamaIndex
Docling ships integrations with major AI frameworks, including:
- LangChain DoclingLoader with export modes like DOC_CHUNKS and MARKDOWN.
- Connectors for LlamaIndex, Crew AI, and Haystack.
- An MCP server to expose Docling as a tool to compatible AI agents.
Follow the Docling project and ecosystem docs
Docling ships frequent releases across parsers, exporters, and integrations. Point stakeholders to primary sources when you estimate timelines or compliance requirements.
- The main repository and issue tracker live under the docling-project organisation on GitHub.
- Hugging Face indexes many associated models and discussion threads; start from huggingface.co/docs when you connect Docling outputs to training or retrieval workflows.
- LangChain documents third-party loaders—including Docling—in the LangChain Python documentation.
Evaluation ideas for structured RAG
Structured exports are only valuable if retrieval improves measurably. Build a small golden set of questions with expected evidence spans before you tune chunk sizes.
- Test table questions separately from paragraph questions—layout parsers regress on different PDFs.
- Log which export mode (Markdown vs JSON chunks) produces better citation overlap for your domain.
- When you add agents, read tool-calling patterns in OpenAI platform docs and multimodal notes in Gemini API docs.
Docling vs Markitdown in Your AI Projects
Compared to Markitdown, which focuses on fast conversion to Markdown, Docling provides a richer structured representation of documents. For many educational and enterprise apps, a good pattern is:
- Use Docling when you care about tables, formulas, and layout.
- Use Markitdown when you just need quick Markdown for text‑heavy content.
In our advanced AI track at Paath.online, we show students how to build RAG pipelines that combine Docling, Markdown, and modern vector / hybrid search.