Docling in 2026: getting documents ready for generative AI

By Mohit Agarwal, Paath.online12 min read

Docling is an open-source toolkit that parses real-world files (PDFs, Office, HTML, audio, and more) into a structured representation you can feed into RAG, agents, and evaluation pipelines. The project lives under the docling-project organisation on GitHub, is hosted by the LF AI & Data Foundation, and ships an MIT-licensed codebase with a cited technical report on arXiv:2408.09869.

What Docling parses (primary source)

According to the project README, Docling handles multiple formats—including PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, WebVTT, images, LaTeX, and plain text—with advanced PDF understanding (layout, reading order, tables, code, formulas, image classification). It also supports domain XML such as USPTO patents, JATS articles, and XBRL financial reports.

Full format tables and concepts are maintained in the official documentation: docling-project.github.io/docling.

What's new in recent releases (README "What's new")

  • Heron — a new default layout model for faster PDF parsing.
  • MCP server — connect Docling to MCP-compatible agents (see the usage page linked from the README).
  • GraniteDocling — optional visual language model path documented on Hugging Face at ibm-granite/granite-docling-258M.
  • Beta structured information extraction, WebVTT and LaTeX parsing, chart understanding, and more—always verify the exact version you install against the changelog in the repo.

Python 3.10+ is required as of Docling 2.70.0 (per README note on dropped Python 3.9 support).

Exports and integrations for RAG

Docling's unified DoclingDocument can be exported to Markdown, HTML, WebVTT, DocTags (see arXiv:2503.11576), and lossless JSON—useful when you need citations, tables, or downstream chunkers that respect structure.

Native integrations listed in the README include LangChain, LlamaIndex, Crew AI, and Haystack. For LangChain specifically, follow the upstream integration docs at LangChain — Docling integration (DoclingLoader).

How this pairs with other Paath.online guides

If you are comparing ingestion tools, read our earlier Docling library overview and Markitdown article—Docling leans toward structure-preserving pipelines; Markitdown leans toward fast Markdown conversion. For retrieval design, see hybrid search + RRF.

Frequently asked questions

Can I learn the topics in this article with a tutor?

Yes. Paath.online offers live 1:1 Python and AI tutoring. We help beginners build fundamentals and students complete projects with step-by-step guidance.

Do I need prior coding experience?

Not for beginner tracks. We start from core Python concepts and build up to data, machine learning, and applied AI topics at your pace.

How do I book a free demo class?

Visit the contact page on Paath.online to book a free demo via WhatsApp, phone, or email.

About the instructor

Mohit Agarwal teaches live Python and AI classes at Paath.online. Sessions focus on beginners and students: clear explanations, debugging practice, and project-based learning for school, university, and career goals.

Instruction is available in English or Hindi. Topics include Python fundamentals, NumPy & Pandas, machine learning basics, RAG, and applied AI workflows.

Learn these topics with live 1:1 tutoring

Paath.online offers beginner-friendly Python and AI classes online with personalized mentorship. Pick a track that matches this article: