Docling in 2026: getting documents ready for generative AI
Docling is an open-source toolkit that parses real-world files (PDFs, Office, HTML, audio, and more) into a structured representation you can feed into RAG, agents, and evaluation pipelines. The project lives under the docling-project organisation on GitHub, is hosted by the LF AI & Data Foundation, and ships an MIT-licensed codebase with a cited technical report on arXiv:2408.09869.
What Docling parses (primary source)
According to the project README, Docling handles multiple formats—including PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, WebVTT, images, LaTeX, and plain text—with advanced PDF understanding (layout, reading order, tables, code, formulas, image classification). It also supports domain XML such as USPTO patents, JATS articles, and XBRL financial reports.
Full format tables and concepts are maintained in the official documentation: docling-project.github.io/docling.
What's new in recent releases (README "What's new")
- Heron — a new default layout model for faster PDF parsing.
- MCP server — connect Docling to MCP-compatible agents (see the usage page linked from the README).
- GraniteDocling — optional visual language model path documented on Hugging Face at ibm-granite/granite-docling-258M.
- Beta structured information extraction, WebVTT and LaTeX parsing, chart understanding, and more—always verify the exact version you install against the changelog in the repo.
Python 3.10+ is required as of Docling 2.70.0 (per README note on dropped Python 3.9 support).
Exports and integrations for RAG
Docling's unified DoclingDocument can be exported to Markdown, HTML, WebVTT, DocTags (see arXiv:2503.11576), and lossless JSON—useful when you need citations, tables, or downstream chunkers that respect structure.
Native integrations listed in the README include LangChain, LlamaIndex, Crew AI, and Haystack. For LangChain specifically, follow the upstream integration docs at LangChain — Docling integration (DoclingLoader).
How this pairs with other Paath.online guides
If you are comparing ingestion tools, read our earlier Docling library overview and Markitdown article—Docling leans toward structure-preserving pipelines; Markitdown leans toward fast Markdown conversion. For retrieval design, see hybrid search + RRF.