Introduction
As large language models (LLMs) continue to evolve, the differentiating factor for most enterprise AI workloads is no longer the model architecture itself, but the quality and structure of the data used to train or ground these systems. Whether the goal is finetuning, retrieval augmented generation (RAG), or any other AI-driven workflow, a well-prepared dataset from reliable source documents is foundational.
In this blog, we walk through an end-to-end approach for extracting, processing, and preparing finetuning datasets using Docling for document extraction and LlamaIndex for orchestration and dataset generation. It leverages a remote model endpoint to ultimately create JSONL files suitable for finetuning which can be consumed by Llama Factory, Unsloth or downstream AI pipelines.
Note: In a follow-up post, we will explore scaling this pipeline using Celery job queues with chain and chord patterns for distributed processing at enterprise scale.
Why Data Preparation Is Critical for AI Workflows
Most AI initiatives struggle not because of model limitations, but because of:
- Unstructured or noisy source documents
- Poor semantic chunking
- Inconsistent labeling or formatting
- Lack of traceability between source and generated data
For workflows such as:
- Model finetuning
- Instruction tuning
- RAG pipelines
- Evaluation dataset creation
the training or grounding data must be:
- Clean and contextually coherent
- Machine readable
- Consistently structured (e.g., JSONL)
This is where Docling and LlamaIndex complement each other effectively. The combination addresses data quality at extraction, processing, and generation stages, ensuring that the final training dataset is production-ready and fully traceable.
Docling: Structured Extraction from Source Documents
What Is Docling?
Docling is a document processing framework designed to extract structured content from a variety of source formats such as:
- Word documents
- HTML
- Markdown
Unlike basic text extraction tools, Docling focuses on preserving document semantics, including:
- Headings and hierarchy
- Tables
- Paragraph boundaries
- Metadata
This makes it particularly suitable for creation of high quality AI datasets.
Using Docling for Data Extraction
In our pipeline, Docling serves as the first transformation layer:
- Ingest raw documents from enterprise repositories or data sources
- Extract structured representations (sections, headings, tables, content blocks)
- Normalize content into a format that downstream tools can reason about
The output from Docling is not just text, but a structured md type artifact that retains the logical layout of the original document — a key requirement for high quality finetuning and RAG.
LlamaIndex: Orchestrating Processing and Dataset Creation
Why LlamaIndex?
LlamaIndex provides a powerful abstraction layer for:
- Document chunking
- Metadata enrichment
- Prompt-driven content generation
- Integration with local or remote LLMs
When paired with Docling, it acts as the bridge between extracted content and model-ready datasets.
Processing Flow with LlamaIndex
Once documents are extracted by Docling, LlamaIndex is used to:
- Ingest structured document nodes
- Apply intelligent chunking strategies
- Size-based
- Semantic
- Section-aware
- Attach metadata
- Source document ID
- Section headers
- Page numbers
This structured ingestion ensures that the resulting dataset maintains a strong link between the original source and the generated training examples.
Generating JSONL with a Remote Model Endpoint
Why JSONL?
JSON Lines (JSONL) is the de-facto standard format for:
- Finetuning datasets
- Instruction response pairs
- Model evaluation inputs
Each line represents a self-contained training sample, making it scalable and easy to stream.
Remote Model Integration
This pipeline does support running local models, but for this setup it currently uses a remote LLM endpoint, which allows:
- Centralized model governance
- Versioned model access
- Easier scaling and experimentation
Using LlamaIndex's integration capabilities, prompts are sent to the remote endpoint to generate structured outputs such as:
{ "instruction": "Summarize the key responsibilities described in the section.", "input": "The system is responsible for handling document ingestion and validation...", "output": "The system manages document ingestion, validation, and preprocessing for downstream AI workflows." }
Each generated response is appended as a single JSON object per line, resulting in a model-ready JSONL dataset.
End-to-End Workflow Summary
Putting it all together, the finetuning data preparation pipeline looks like this:
- Source documents ingestion
- Structured extraction using Docling
- Document chunking and enrichment with LlamaIndex
- Prompt driven content generation using a remote LLM
- Export of structured JSONL datasets
This modular approach ensures the pipeline is reproducible, auditable, scalable, and model-agnostic.
Key Benefits of This Approach
- Higher quality finetuning data due to semantic extraction
- Reduced hallucinations in RAG and downstream workflows
- Clear traceability from JSONL records back to source documents
- Flexibility to swap models or prompts without re-extracting data
Conclusion
High performing AI systems start with high quality data pipelines. By combining Docling's structured document extraction with LlamaIndex's orchestration and a remote model endpoint, we can build a robust, scalable, and reusable process for generating finetuning and RAG datasets in JSONL format.
This approach not only improves model performance but also significantly reduces operational friction when iterating on AI workflows.
Backend extraction is equipped with a Celery based job queue which runs with Celery chain and chord for extraction and processing.
Resources
Docling https://github.com/DS4SD/docling
LlamaIndex https://www.llamaindex.ai/
Celery https://docs.celeryq.dev/
여기에 의견을 추가하려면 등록된 사용자이어야 합니다. 이미 등록되어 있다면 로그인하시기 바랍니다. 아직 등록하지 않은 경우 등록 후 로그인하시기 바랍니다.