Executive Summary
Our comprehensive evaluation of Docling, Unstructured, and LlamaParse reveals Docling as the superior framework for extracting structured data from sustainability reports, with 97,9% accuracy in complex table extraction and excellent text fidelity. While LlamaParse offers impressive processing speed (consistently ~6 seconds regardless of document size) and Unstructured provides strong OCR capabilities (achieving 100% accuracy on simple tables but only 75% on complex structures), Docling’s balanced performance makes it ideal for enterprise sustainability analytics pipelines.
Key Takeaways:
Docling: Best overall accuracy and structure preservation (97.9% table cell accuracy)
LlamaParse: Fastest processing (6 seconds per document regardless of size)
Unstructured: Strong OCR but slowest processing (51-141 seconds depending on page count)
Table of Contents
- Introduction
- Overview of PDF Processing Frameworks
- Methodology and Evaluation Criteria
- Report Selection & Justification
- Results & Discussion
- Conclusion
1. Introduction
In today’s data-driven sustainability landscape, organizations face a critical challenge: how to efficiently transform unstructured sustainability reports into structured, machine-readable data for analysis. At Procycons, where we bridge the gap between sustainability goals and digital transformation, we understand that accurate extraction of sustainability data is the foundation for effective ESG analytics, reporting automation, and climate strategy development.
PDF documents remain the standard format for sustainability reporting, but their unstructured nature creates a significant barrier to automation. Extracting structured information—from complex emissions tables to technical climate disclosures—requires sophisticated processing frameworks that can maintain both content accuracy and structural fidelity.
In this benchmark study, we evaluate three leading PDF processing frameworks: Docling, Unstructured, and LlamaParse. Our goal is to determine which solution best serves the unique challenges of sustainability document processing:
- Preserving the accuracy of critical numerical ESG data
- Maintaining hierarchical structure of sustainability frameworks
- Correctly extracting complex multi-level tables containing emissions, resource usage, and other metrics
- Delivering performance that scales to enterprise document volumes
This evaluation forms a crucial component of our work at Procycons, where we develop RAG (Retrieval-Augmented Generation) systems and knowledge graphs that transform sustainability reporting from a manual process into an automated, AI-enhanced workflow. By optimizing the document processing foundation, we enable more accurate downstream applications for sustainability benchmarking, automated ESG reporting, and climate strategy development.
2. Overview of PDF Processing Frameworks
2.1. Docling
Docling is an open-source document processing framework developed by DS4SD (IBM Research) to facilitate the extraction and transformation of text, tables, and structural elements from PDFs. It leverages advanced AI models, including DocLayNet for layout analysis and TableFormer for table structure recognition. Docling is widely used in AI-driven document analysis, enterprise data processing, and research applications, designed to run efficiently on local hardware while supporting integrations with generative AI ecosystems.
2.2. Unstructured
Unstructured is a document processing platform designed to extract and transform complex enterprise data from various formats, including PDFs, DOCX, and HTML. It applies OCR and Transformer-based NLP models for text and table extraction. Offering both open-source and API-based solutions, Unstructured is frequently used in AI-driven content enrichment, legal document parsing, and data pipeline automation. It is actively maintained by Unstructured.io a company specializing in enterprise AI solutions.
2.3. LlamaParse
LlamaParse is a lightweight NLP-based framework designed for extracting structured data from documents, particularly PDFs. It integrates Llama-based NLP pipelines for text parsing and structural recognition. While it performs well on simple documents, it struggles with complex layouts, making it more suited for lightweight applications such as research and small-scale document processing tasks.
3. Methodology and Evaluation Criteria
To conduct a fair and comprehensive evaluation of PDF processing for sustainability report extraction, we analyzed the following key metrics:
- Text Extraction Accuracy: Ensures extracted text is correct and formatted properly, as errors impact downstream data integrity.
- Table Detection & Extraction: Critical for sustainability reports with tabular data, assessing correct identification and extraction of tables.
- Section Structure Accuracy: Evaluates retention of document hierarchy for readability and usability.
- ToC Accuracy: Measures the ability to reconstruct a table of contents for improved navigation.
- Processing Speed Comparison: Assesses the time taken to process PDFs of varying lengths, providing insights into efficiency and scalability.
4. Report Selection & Justification
We selected five corporate reports for benchmarking to evaluate the performance of Docling, Unstructured, and LlamaParser across a diverse set of document characteristics.
Report Name | Pages | Number of Words | Number of Tables | Complexity Characteristics |
---|---|---|---|---|
Bayer Sustainability Report 2023 (Short) | 52 | 34,104 | 32 | Multi-column text, Embedded charts, Detailed Table of Contents (ToC) |
DHL 2023 | 13 | 5,955 | 5 | Single-column text, Embedded charts |
Pfizer 2023 | 11 | 3,293 | 6 | Not specified (assumed basic layout, possibly single-column) |
Takeda 2023 | 14 | 4,356 | 8 | Multi-column text, Embedded charts, Detailed Table of Contents (ToC) |
UPS 2023 | 9 | 4,486 | 3 | Detailed Table of Contents (ToC) |
These reports were chosen for their diversity in layout, text styles, and table structures. To ensure a fair comparison, we shortened the reports where necessary (e.g., selecting specific page ranges for Pfizer, Takeda, and UPS) to include different types of tables (simple, multi-row, merged-cell) and textual content (single-column, multi-column, dense paragraphs, bullet points). This selection allowed us to examine how each framework handles varying document complexities, from presentation-style slides (DHL) to dense corporate reports (Bayer) and scanned excerpts (UPS). The inclusion of diverse topics ensures relevance to multiple industries, while the range of word counts (~4,500 to ~34,000) and table counts (3 to 32) tests scalability and accuracy across document sizes.
5. Results & Discussion
5.1. Metric Summary Table
This comparison table highlights the key performance metrics across all frameworks, providing a quick reference for decision-making in deployment scenarios.
Metric | Docling | Unstructured | LlamaParser |
---|---|---|---|
Text Extraction Accuracy | High accuracy, preserves formatting | Efficient, inconsistent line breaks | Struggles with multi-column, word merging |
Table Detection & Extraction | Recognizes complex tables well | OCR-based, variable with multi-row | Good for simple, poor with complex |
Section Structure Accuracy | Clear hierarchical structure | Mostly accurate, some misclassifications | Struggles with section differentiation |
ToC Generation | Accurate with correct references | Partial, some misalignment | Fails to reconstruct effectively |
Performance Metrics | Moderate (6,28s for 1 page, 65,12s for 50 pages) | Slow (51,06s for 1 page, 141,02s for 50 pages) | Fast (6s regardless of page count) |
5.2. Technology Behind Each Framework
The table below outlines the specific models and technologies powering each framework’s capabilities.
Metric | Docling | Unstructured | LlamaParser |
---|---|---|---|
Text Extraction | DocLayNet | OCR + Transformer-based NLP | Llama-based NLP pipeline |
Table Detection | TableFormer | Vision Transformer + OCR | Llama-based Table Parser |
Section Structure | DocLayNet + NLP Classifiers | Transformer-based Classifier | Llama-based Text Structuring |
ToC Generation | Layout-based Parsing + NLP | OCR + Heuristic Parsing | Llama-based ToC Recognition |
5.3. Detailed Performance Analysis
Below, we compare the outputs of each framework using excerpts from different reports as examples, focusing on text, tables, sections, and ToC.
5.3.1. Text Extraction Performance
The original text from the “Takeda 2023” PDF consists of two dense paragraphs with technical terms and clear paragraph breaks separating the content.

Observations on Text Extraction
Docling:
- Text Accuracy: Achieves 100% accuracy for core content, matching all sentences including the title and both paragraphs.
- Completeness: Captures all original text without omissions while preserving paragraph breaks and structure.
- Text Modifications: Retains original phrasing and technical terms without alteration.
- Formatting Preservation: Maintains paragraph breaks critical for readability and separates the title consistent with original heading style.
LlamaParse:
- Text Accuracy: Achieves high accuracy for original paragraphs but includes additional content not present in the source text.
- Completeness: Adds detailed technical information not part of the sampled section while losing the original paragraph break.
- Text Modifications: Introduces new sentences and data, suggesting over-extraction or hallucination.
- Formatting Preservation: Merges content into a continuous block, reducing readability, though title separation is maintained.
Unstructured:
- Text Accuracy: Extracts title and paragraphs correctly but includes significant additional content not present in the original section.
- Completeness: Adds substantial extra technical details likely pulled from elsewhere in the document.
- Text Modifications: Introduces new technical information without errors in the original content but alters the output scope.
- Formatting Preservation: Merges all content into a single block, losing paragraph breaks and diminishing structural fidelity despite correct title formatting.
5.3.2. Table Extraction Performance
We have chosen a table from “bayer-sustainability-report-2023” to analyze the table extraction performance of these platforms – figure below.
The table provides a breakdown of employees by gender (Women and Men), region (Total, Europe/Middle East/Africa, North America, Asia/Pacific, Latin America), and age group (< 20, 20-29, 30-39, 40-49, 50-59, ≥ 60). The structure is hierarchical:
- Top Level: Gender (Women: 41,562 total; Men: 58,161 total).
- Second Level: Regions under each gender (e.g., Women in Europe/Middle East/Africa: 18,981).
- Third Level: Age groups under each region (e.g., Women in Europe/Middle East/Africa, < 20: 6).

Observations on Data Accuracy
Docling:
- Issue: Misses one data point (“5” for Men in Latin America, < 20) from 48 entries, achieving 97.9% accuracy.
- Impact: Error is isolated and doesn’t affect totals, though compromises age group completeness.
- Strength: All other data, including gender totals, is correctly placed.
LlamaParse:
- Issue: Misplaces “Total” column values, using Latin America totals instead of gender totals.
- Impact: Systemic column misalignment affects entire table interpretation, with 100% data extraction but 0% correct placement.
- Strength: Captures the “5” data point that Docling misses.
Unstructured:
- Issue: Severe column shift error with missing Europe/Middle East/Africa data and shifted regions.
- Impact: Table becomes uninterpretable with 75% cell accuracy (36/48 entries) and 0% accuracy for Latin America data.
- Strength: Some numerical data can be manually remapped to correct regions.
Structural Fidelity
Docling:
- Preserves original column order and hierarchical nesting, maintaining structural faithfulness.
- Correctly handles blank “Total” column for age groups.
LlamaParse:
- Inverts column order with “Total” misplacement, distorting table meaning.
- Lacks hierarchical nesting indicators, secondary to column errors.
Unstructured:
- Suffers from severe column shifts, making regional hierarchy meaningless.
- Partially preserves gender and age group separation but lacks clear nesting indicators.
- Correctly leaves “Total” column blank for age groups, though irrelevant given data misalignment.
5.3.3. Section Structure
The section sample from the “UPS 2023” PDF demonstrates how frameworks handle hierarchical document structures, a critical aspect of maintaining document organization. The sample includes a main heading followed by a subheading, with a clear hierarchical relationship indicated by formatting differences in the original document.
Observations on Section Structure
Docling:
- Hierarchy Representation: Uses the same Markdown level (##) for both headings, failing to reflect hierarchical relationship.
- Text Accuracy: Captures exact text of both headings, including capitalization and punctuation.
- Formatting Preservation: Retains original text elements but loses styling differences that distinguish heading levels.
LlamaParse:
- Hierarchy Representation: Uses identical Markdown level (#) for both headings, missing the parent-child structure.
- Text Accuracy: Perfectly captures text of both headings, preserving all textual elements.
- Formatting Preservation: Maintains capitalization and punctuation but cannot reflect PDF-specific styling differences.
Unstructured:
- Hierarchy Representation: Correctly uses different Markdown levels (# for main heading, ## for subheading), properly reflecting the hierarchical relationship.
- Text Accuracy: Perfectly captures text of both headings with all original elements intact.
- Formatting Preservation: Cannot reproduce PDF styling but compensates with appropriate Markdown hierarchy, outperforming other frameworks in structural fidelity.
5.3.4. Table of Contents
The original TOC from the “UPS 2023” PDF features a “Contents” header followed by section entries with page numbers, using a two-column layout with dotted line separators between titles and page numbers.
Observations on Table of Contents
Docling:
- Text Accuracy: Captures all content with 100% accuracy, including titles, page numbers, and punctuation.
- Structural Representation: Uses a Markdown table with two columns, maintaining the separation of titles and page numbers.
- Formatting Preservation: Retains dotted line separators within table cells but marks “Contents” as a subheading (##) rather than a main heading.
LlamaParse:
- Text Accuracy: Achieves 100% accuracy for all text elements, including titles, page numbers, and dotted lines.
- Structural Representation: Implements a bulleted list format with titles and page numbers on same line, preserving logical flow.
- Formatting Preservation: Maintains dotted line separators and correctly marks “Contents” as a main heading (#), aligning with its prominence.
Unstructured:
- Text Accuracy: Severely deficient, capturing only the “Contents” title while omitting all entries and page numbers.
- Structural Representation: Includes an empty Markdown table that fails to represent the original structure or content.
- Formatting Preservation: Marks “Contents” as a subheading (##) and provides no content preservation, resulting in complete structural loss.
5.4. Processing Speed Comparison
One of the most critical factors in evaluating a PDF processing tool for automated document extraction is processing speed—how quickly a tool can extract and structure content from a document. A slow tool can significantly impact workflow efficiency, especially when dealing with large-scale document processing.
To benchmark speed, we used a series of test PDFs generated from a single extracted page. By comparing their ability to process documents of increasing length, we identified the best tool for handling structured document extraction at scale. We measured the average elapsed time for LlamaParse, Docling, and Unstructured while processing PDFs with increasing numbers of pages. The results reveal significant differences in how each tool handles scalability and performance – see figure below.

Key Observations
- LlamaParse is the fastest
- LlamaParse consistently processes documents in around 6 seconds, even as the page count increases.
- This suggests that it efficiently handles document scaling without significant slowdowns.
- Docling scales linearly with increasing pages
- Processing 1 page takes 6.28 seconds, but 50 pages take 65.12 seconds—a nearly linear increase in processing time.
- This indicates that Docling’s performance is stable but scales proportionally with document size.
- Unstructured struggles with speed
- Unstructured is significantly slower, taking 51 seconds for a single page and over 140 seconds for large files.
- It exhibits inconsistent scaling, as 15 pages take slightly less time than 5 pages, likely due to caching or internal optimizations.
- While its accuracy may be higher in some areas, its speed makes it less practical for high-volume processing.
5.5. Analysis Results
The outputs and metrics reveal distinct strengths and weaknesses across the frameworks, analyzed below:
Text Extraction Accuracy:
- Docling: Demonstrates high accuracy with 100% text fidelity in dense paragraphs (e.g., Takeda 2023), preserving original phrasing, technical terms, and paragraph breaks. This consistency makes it reliable for maintaining data integrity in documents with dense content.
- Unstructured: Offers efficient text extraction with high accuracy for core content, but introduces inconsistencies such as merging paragraph breaks and adding extraneous details. This over-extraction suggests potential overreach from other document sections, impacting precision.
- LlamaParse: Struggles with multi-column layouts and word merging, achieving high accuracy only for simple text but adding irrelevant content. This indicates a limitation in handling complex textual structures, reducing its suitability for diverse document formats.
Table Detection & Extraction:
- Docling: Excels in recognizing complex tables, preserving hierarchical nesting and column order (e.g., Bayer 2023 complicated table), with a single omission (“5” for Men in Latin America, < 20) resulting in 97.9% cell accuracy. Its use of TableFormer ensures robust structure retention, making it ideal for detailed tabular data.
- Unstructured: Performs variably, with OCR-based extraction succeeding numerically (e.g., 100% accuracy in simple tables) but failing structurally in multi-row tables (e.g., missing data due to column shifts in Bayer 2023). This limits its reliability for complex layouts.
- LlamaParse: Handles simple tables well (e.g., 100% numerical accuracy in simple tables) but falters with complex tables, misplacing “Total” columns (e.g., Bayer 2023). Its performance drops significantly with intricate structures, restricting its applicability.
Section Structure Accuracy:
- Docling: Maintains clear hierarchical structure but uses uniform Markdown levels (##), missing nesting cues (e.g., UPS 2023 section). This minor flaw is offset by perfect text accuracy, making it effective for readability despite formatting limitations.
- Unstructured: Mostly accurate, with correct text extraction (e.g., UPS 2023 section), but uses the same Markdown level (#) for all headings, failing to reflect hierarchy. This consistency with Docling and LlamaParse suggests a common limitation in structural differentiation.
- LlamaParse: Struggles with section differentiation, using uniform levels (#) and lacking hierarchical clarity (e.g., UPS 2023), similar to others. Its text accuracy is high, but structural weaknesses reduce usability for organized navigation.
Table of Contents (ToC) Generation:
- Docling: Achieves accurate ToC reconstruction with 100% text fidelity, using a table format with dotted lines, though it underestimates “Contents” prominence with ##. This makes it highly effective for navigation despite minor formatting issues.
- Unstructured: Fails dramatically, capturing only “Contents” with an empty table, missing all entries and page numbers (e.g., UPS 2023 ToC). This indicates a significant weakness in handling two-column layouts and dotted line separators.
- LlamaParse: Fails to reconstruct effectively, though it uses a bulleted list with dotted lines and correct text, aligning “Contents” with #. Its inability to fully replicate the structure limits its utility compared to Docling.
Performance Metric (Processing Speed):
- Docling: Offers moderate speed (6.28s for 1 page, 65.12s for 50 pages) with linear scaling, balancing accuracy and efficiency. This makes it well-suited for enterprise-scale processing where predictable performance is key.
- Unstructured: Struggles significantly with speed (51.06s for 1 page, 141.02s for 50 pages), showing inconsistent scaling. This inefficiency undermines its otherwise decent accuracy, making it less practical for high-volume workflows.
- LlamaParse: Excels in speed (~6s consistently, even for 50 pages), demonstrating remarkable scalability. This efficiency positions it as a strong contender for rapid processing, though its accuracy trade-offs restrict its use to simpler documents.
6. Conclusion
Based on our benchmarking results, including processing speed insights, Docling emerges as the most robust framework for processing complex business documents. It offers high text extraction accuracy, superior table structure preservation, and effective ToC reconstruction, supported by moderate and predictable processing speeds (e.g., 6.28s for 1 page, scaling linearly to 65.12s for 50 pages). Its use of advanced models like DocLayNet and TableFormer ensures reliable handling of diverse document elements, with only minor omissions (e.g., “5” in the Bayer table). This balance of precision, structural fidelity, and efficient performance makes Docling the recommended choice for applications requiring scalability and accuracy, such as enterprise data processing and business intelligence.
Unstructured performs well in text and simple table extraction, achieving 100% numerical accuracy in straightforward cases, but its inconsistencies such as column shifts in complex tables and incomplete ToC generation limit its reliability. Its significantly slower speed (e.g., 51.06s for 1 page, 141.02s for 50 pages) further hinder its practicality, suggesting it is best suited for less complex documents or scenarios where speed and resource constraints are not critical. Addressing its speed inefficiencies and structural parsing could enhance its competitiveness.
LlamaParse stands out for its exceptional processing speed (~6s consistently across all page counts), offering unparalleled efficiency and scalability. It performs adequately for basic extractions, with strong numerical accuracy in simple tables and text, but struggles with complex formatting (e.g., multi-column text, intricate tables) and ToC reconstruction. This speed advantage makes it ideal for lightweight, straightforward tasks, but its structural weaknesses and accuracy trade-offs render it less suitable for comprehensive document processing compared to Docling.
For applications prioritizing precision, efficiency, and structural fidelity—key for business analytics—Docling remains the optimal framework. Its linear speed scaling ensures it can handle large documents effectively, while LlamaParse’s rapid processing offers a niche for quick, simple extractions. Unstructured, despite its potential, requires significant optimization in speed and structural handling to compete. Future improvements for Unstructured could focus on reducing processing times and enhancing table parsing, while LlamaParse could benefit from better structural recognition to leverage its speed advantage in broader applications.