Dataset Generation using Docling and LlamaIndex

Ahmed6 · ‎05-05-2026

Introduction

As large language models (LLMs) continue to evolve, the differentiating factor for most enterprise AI workloads is no longer the model architecture itself, but the quality and structure of the data used to train or ground these systems. Whether the goal is finetuning, retrieval augmented generation (RAG), or any other AI-driven workflow, a well-prepared dataset from reliable source documents is foundational.

In this blog, we walk through an end-to-end approach for extracting, processing, and preparing finetuning datasets using Docling for document extraction and LlamaIndex for orchestration and dataset generation. It leverages a remote model endpoint to ultimately create JSONL files suitable for finetuning which can be consumed by Llama Factory, Unsloth or downstream AI pipelines.

Note: In a follow-up post, we will explore scaling this pipeline using Celery job queues with chain and chord patterns for distributed processing at enterprise scale.

Why Data Preparation Is Critical for AI Workflows

Most AI initiatives struggle not because of model limitations, but because of:

Unstructured or noisy source documents
Poor semantic chunking
Inconsistent labeling or formatting
Lack of traceability between source and generated data

For workflows such as:

Model finetuning
Instruction tuning
RAG pipelines
Evaluation dataset creation

the training or grounding data must be:

Clean and contextually coherent
Machine readable
Consistently structured (e.g., JSONL)

This is where Docling and LlamaIndex complement each other effectively. The combination addresses data quality at extraction, processing, and generation stages, ensuring that the final training dataset is production-ready and fully traceable.

Docling: Structured Extraction from Source Documents

What Is Docling?

Docling is a document processing framework designed to extract structured content from a variety of source formats such as:

PDF
Word documents
HTML
Markdown

Unlike basic text extraction tools, Docling focuses on preserving document semantics, including:

Headings and hierarchy
Tables
Paragraph boundaries
Metadata

This makes it particularly suitable for creation of high quality AI datasets.

Using Docling for Data Extraction

In our pipeline, Docling serves as the first transformation layer:

Ingest raw documents from enterprise repositories or data sources
Extract structured representations (sections, headings, tables, content blocks)
Normalize content into a format that downstream tools can reason about

The output from Docling is not just text, but a structured md type artifact that retains the logical layout of the original document — a key requirement for high quality finetuning and RAG.

LlamaIndex: Orchestrating Processing and Dataset Creation

Why LlamaIndex?

LlamaIndex provides a powerful abstraction layer for:

Document chunking
Metadata enrichment
Prompt-driven content generation
Integration with local or remote LLMs

When paired with Docling, it acts as the bridge between extracted content and model-ready datasets.

Processing Flow with LlamaIndex

Once documents are extracted by Docling, LlamaIndex is used to:

Ingest structured document nodes
Apply intelligent chunking strategies
- Size-based
- Semantic
- Section-aware
Attach metadata
- Source document ID
- Section headers
- Page numbers

This structured ingestion ensures that the resulting dataset maintains a strong link between the original source and the generated training examples.

Generating JSONL with a Remote Model Endpoint

Why JSONL?

JSON Lines (JSONL) is the de-facto standard format for:

Finetuning datasets
Instruction response pairs
Model evaluation inputs

Each line represents a self-contained training sample, making it scalable and easy to stream.

Remote Model Integration

This pipeline does support running local models, but for this setup it currently uses a remote LLM endpoint, which allows:

Centralized model governance
Versioned model access
Easier scaling and experimentation

Using LlamaIndex's integration capabilities, prompts are sent to the remote endpoint to generate structured outputs such as:

{ "instruction": "Summarize the key responsibilities described in the section.", "input": "The system is responsible for handling document ingestion and validation...", "output": "The system manages document ingestion, validation, and preprocessing for downstream AI workflows." }

Each generated response is appended as a single JSON object per line, resulting in a model-ready JSONL dataset.

End-to-End Workflow Summary

Putting it all together, the finetuning data preparation pipeline looks like this:

Source documents ingestion
Structured extraction using Docling
Document chunking and enrichment with LlamaIndex
Prompt driven content generation using a remote LLM
Export of structured JSONL datasets

This modular approach ensures the pipeline is reproducible, auditable, scalable, and model-agnostic.

Key Benefits of This Approach

Higher quality finetuning data due to semantic extraction
Reduced hallucinations in RAG and downstream workflows
Clear traceability from JSONL records back to source documents
Flexibility to swap models or prompts without re-extracting data

Conclusion

High performing AI systems start with high quality data pipelines. By combining Docling's structured document extraction with LlamaIndex's orchestration and a remote model endpoint, we can build a robust, scalable, and reusable process for generating finetuning and RAG datasets in JSONL format.

This approach not only improves model performance but also significantly reduces operational friction when iterating on AI workflows.

Backend extraction is equipped with a Celery based job queue which runs with Celery chain and chord for extraction and processing.

Resources

Docling https://github.com/DS4SD/docling

LlamaIndex https://www.llamaindex.ai/

Celery https://docs.celeryq.dev/