{"id":16660,"date":"2025-07-06T12:41:03","date_gmt":"2025-07-06T11:41:03","guid":{"rendered":"https:\/\/procycons.com\/?p=16660"},"modified":"2025-10-04T19:52:55","modified_gmt":"2025-10-04T18:52:55","slug":"long-document-classification-benchmark-2025","status":"publish","type":"post","link":"https:\/\/procycons.com\/en\/blogs\/long-document-classification-benchmark-2025\/","title":{"rendered":"Long Document Classification Benchmark 2025: XGBoost vs BERT vs Traditional ML [Complete Guide]"},"content":{"rendered":"<h2>What is Long Document Classification?<\/h2>\n<p style=\"text-align: justify;\">Long document classification is a specialized subfield of document classification within Natural Language Processing (NLP) that focuses on categorizing documents containing 1,000+ words (2+ pages), such as academic papers, legal contracts, and technical reports. Unlike short-text classification, long documents present challenges like input length constraints (e.g., 512 tokens in BERT), loss of contextual coherence when splitting the document, high computational cost, and the need for complex label structures like multi-label or hierarchical classification.<\/p>\n<h2>Executive Summary<\/h2>\n<p style=\"text-align: justify;\">This benchmark study evaluates various approaches for classifying lengthy documents (7,000-14,000 words \u2248 14-28 pages \u2248 short to medium academic papers) across 11 academic categories. <strong>XGBoost<\/strong><strong>\u00a0emerged as the most versatile solution<\/strong>, achieving F1-scores (balanced measure combining precision and recall) of 75-86 with reasonable computational requirements (Chen and Guestrin, 2016). <strong>Logistic Regression <\/strong><strong>provides the best efficiency-performance trade-off<\/strong> for resource-constrained environments, training in under 20 seconds with competitive accuracy (Genkin, Lewis and Madigan, 2005). Surprisingly, <strong>RoBERTa-base<\/strong> <strong>significantly underperformed<\/strong> despite its general reputation, while traditional machine learning approaches proved highly competitive against advanced transformer models (Liu et al., 2019).<\/p>\n<p style=\"text-align: justify;\">Our experiments analyzed 27,000+ documents across four complexity categories from simple keyword matching to large language models, revealing that <strong>traditional ML methods often outperform sophisticated transformers while using 10x fewer computational resources<\/strong>. These counter-intuitive results challenge common assumptions about the necessity of complex models for long document classification tasks.<\/p>\n<h3>Quick Recommendations<\/h3>\n<ul>\n<li style=\"text-align: justify;\"><strong>Best Overall<\/strong>: XGBoost (F1: 86, fast training)<\/li>\n<li style=\"text-align: justify;\"><strong>Most Efficient<\/strong>: Logistic Regression (trains in &lt;20s)<\/li>\n<li style=\"text-align: justify;\"><strong>GPU Available<\/strong>: BERT-base (Devlin et. al, 2019)\u00a0(F1: 82, but slower)<\/li>\n<li style=\"text-align: justify;\"><strong>Avoid<\/strong>: Keyword-based methods, RoBERTa-base<\/li>\n<\/ul>\n<h3>Study Methodology &amp; Credibility<\/h3>\n<ul>\n<li style=\"text-align: justify;\"><strong>Dataset Size<\/strong>: 27,000+ documents across 11 academic categories [<a href=\"#ref1\">Download<\/a>]<\/li>\n<li style=\"text-align: justify;\"><strong>Hardware Specification<\/strong>: 15x\u00a0 vCPUs, 45GB RAM, NVIDIA Tesla V100S 32GB<\/li>\n<li style=\"text-align: justify;\"><strong>Reproducibility<\/strong>: All code and configurations are available on <a href=\"#ref19\">GitHub<\/a><\/li>\n<\/ul>\n<h3>Key Research Findings (Verified Results)<\/h3>\n<ul>\n<li style=\"text-align: justify;\">XGBoost achieved an 86% F1-score on 27,000 academic documents<\/li>\n<li style=\"text-align: justify;\">Traditional ML methods train 10x faster than transformer models<\/li>\n<li style=\"text-align: justify;\">BERT requires 2GB+ GPU memory vs 100MB RAM for XGBoost<\/li>\n<li style=\"text-align: justify;\">RoBERTa-base achieved just a 57% F1-score, falling short of expectations in low-data settings<\/li>\n<li style=\"text-align: justify;\">Training transformer-based models on the full dataset isn\u2019t justifiable due to the extremely long training time (over 4 hours). Notably, as the data volume grows, model complexity increases and training time rises exponentially<\/li>\n<\/ul>\n<h3>How to Choose the Right Document Classification Method for Long Documents with a Small Number of Samples (~100 to 150 Samples)<\/h3>\n<table style=\"border: 2px solid #333; border-collapse: collapse; width: 100%;\">\n<thead>\n<tr style=\"background-color: #4caf50; color: white;\">\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">Aspect<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">Logistic Regression<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">XGBoost<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">BERT-base<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>Best Use Case<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Resource-constrained<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Production systems<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Research applications<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>Training Time<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">3 seconds<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">35 seconds<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">23 minutes<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>Accuracy (F1 %)<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">79<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">81<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">82<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>Memory Requirements<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">50MB RAM<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">100MB RAM<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">2GB GPU RAM<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>Implementation Difficulty<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Low<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Medium<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">High<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><strong>Table of Contents<\/strong><\/p>\n<ol>\n<li><a href=\"#intro\">Introduction<\/a><\/li>\n<li><a href=\"#classification-methods\">Classification Methods: Simple to Complex<\/a><\/li>\n<li><a href=\"#technical-specification\">Technical Specifications<\/a><\/li>\n<li><a href=\"#results\">Results and Analysis<\/a><\/li>\n<li><a href=\"#deployment\">Deployment Scenarios<\/a><\/li>\n<li><a href=\"#faq\">Frequently Asked Questions<\/a><\/li>\n<li><a href=\"#conclusion\">Conclusion<\/a><\/li>\n<\/ol>\n<h2 id=\"intro\">1. Introduction<\/h2>\n<p style=\"text-align: justify;\">Long document classification is a specialized subfield of document classification within Natural Language Processing (NLP). At its core, document classification involves assigning one or more predefined categories or labels to a given document based on its content. This is a fundamental task for organizing, managing, and retrieving information efficiently in various domains, from legal and healthcare to news and customer reviews.<\/p>\n<p style=\"text-align: justify;\">In long document classification, the term &#8220;long&#8221; refers to the substantial length of the documents involved. Unlike short texts\u2014such as tweets, headlines, or single sentences\u2014long documents can span multiple paragraphs, full articles, books, or even legal contracts. This extended length presents unique challenges that traditional text classification methods often struggle to address effectively.<\/p>\n<h3>Main Challenges in Long Document Classification<\/h3>\n<ul>\n<li style=\"text-align: justify;\"><strong>Contextual Information<\/strong>: Long documents contain much richer and more complex context. Accurately understanding and classifying them requires processing information spread across multiple sentences and paragraphs, not just a few keywords.<\/li>\n<li style=\"text-align: justify;\"><strong>Computational Complexity<\/strong>: Many advanced NLP models, especially Transformer-based ones like BERT, have limits on the maximum input length they can handle efficiently. Their self-attention mechanisms, while powerful for capturing word relationships, become computationally expensive (O(N\u00b2) complexity &#8211; grows exponentially with document length) and memory-intensive when dealing with very long texts.<\/li>\n<li style=\"text-align: justify;\"><strong>Information Density and Sparsity<\/strong>: Although long documents contain a lot of information, the most important features for classification are often sparsely distributed. This makes it challenging for models to detect and focus on these key signals amidst large amounts of less relevant content.<\/li>\n<li style=\"text-align: justify;\"><strong>Maintaining Coherence<\/strong>: A common approach is to split long documents into smaller chunks. However, this can break the flow and context, making it harder for models to grasp the overall meaning and make accurate classifications.<\/li>\n<\/ul>\n<h3>Study Objectives<\/h3>\n<p style=\"text-align: justify;\">In this benchmark study, we evaluate various long document classification methods from a practical and software development standpoint. Our objective is to identify which approach best addresses the unique challenges of long document processing, based on the following criteria:<\/p>\n<ol>\n<li style=\"text-align: justify;\"><strong>Efficiency<\/strong>: Models should be able to process long documents efficiently in terms of time and memory<\/li>\n<li style=\"text-align: justify;\"><strong>Accuracy<\/strong>: Models should be able to accurately classify documents, even with long lengths<\/li>\n<li style=\"text-align: justify;\"><strong>Robustness<\/strong>: Models should be robust to varying document lengths and different types of information organization<\/li>\n<\/ol>\n<p><span data-teams=\"true\">\t\t<div data-elementor-type=\"container\" data-elementor-id=\"16324\" class=\"elementor elementor-16324\" data-elementor-post-type=\"elementor_library\">\n\t\t\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-0e9d192 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"0e9d192\" data-element_type=\"section\" data-settings=\"{&quot;background_background&quot;:&quot;classic&quot;}\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-09d4dc5\" data-id=\"09d4dc5\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<section class=\"elementor-section elementor-inner-section elementor-element elementor-element-b61c536 elementor-section-full_width elementor-section-height-default elementor-section-height-default\" data-id=\"b61c536\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-50 elementor-inner-column elementor-element elementor-element-af082c2\" data-id=\"af082c2\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-c482e7b elementor-widget elementor-widget-heading\" data-id=\"c482e7b\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<div class=\"elementor-heading-title elementor-size-default\">Start Building Your Document Classification System Using Our Framework<\/div>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bd5891a elementor-widget elementor-widget-text-editor\" data-id=\"bd5891a\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Follow our proven methodology to implement high-performance classification for your documents.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t<div class=\"elementor-column elementor-col-50 elementor-inner-column elementor-element elementor-element-5b7f964\" data-id=\"5b7f964\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-ec9b4e1 elementor-align-right greenbtn elementor-tablet-align-center elementor-mobile-align-justify elementor-widget-mobile__width-initial elementor-widget elementor-widget-button\" data-id=\"ec9b4e1\" data-element_type=\"widget\" data-widget_type=\"button.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div class=\"elementor-button-wrapper\">\n\t\t\t\t\t<a class=\"elementor-button elementor-button-link elementor-size-sm\" href=\"https:\/\/procycons.com\/en\/contact-us\/\">\n\t\t\t\t\t\t<span class=\"elementor-button-content-wrapper\">\n\t\t\t\t\t\t\t\t\t<span class=\"elementor-button-text\">Contact us<\/span>\n\t\t\t\t\t<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t<\/span><\/p>\n<h2 id=\"classification methods\">2. Classification Methods: Simple to Complex<\/h2>\n<p style=\"text-align: justify;\">This section presents four categories of classification methods, ranging from simple keyword matching to sophisticated language models. Each method represents different trade-offs between accuracy, speed, and implementation complexity.<\/p>\n<h3>2.1 Simple Methods (No Training Required)<\/h3>\n<p style=\"text-align: justify;\">These methods are quick to implement and perform well when the documents are relatively simple and not structurally complex. Typically rule-based, pattern-based, or keyword-based, they require no training time, which makes them especially robust to changes in the number of labels.<\/p>\n<p><strong>When to use<\/strong>: Known document structures, rapid prototyping, or when training data is unavailable.<br \/>\n<strong>Main advantage<\/strong>: Zero training time and high interpretability.<br \/>\n<strong>Main limitation<\/strong>: Poor performance on complex or nuanced classification tasks.<\/p>\n<h4>Keyword-Based Classification<\/h4>\n<p>The process begins by extracting a set of representative keywords for each category from the document set. During testing (or prediction), the classification follows these basic steps:<\/p>\n<ol>\n<li style=\"text-align: justify;\">Tokenize the document<\/li>\n<li style=\"text-align: justify;\">Count the number of keyword matches for each category<\/li>\n<li style=\"text-align: justify;\">Assign the document to the category with the highest match count or keyword density<\/li>\n<\/ol>\n<p style=\"text-align: justify;\">More advanced tools like YAKE (Yet Another Keyword Extractor) can be used to automate keyword extraction (Campos et al., 2020). Additionally, if category names are known in advance, external keywords\u2014those not found within the documents\u2014can be added to the keyword sets with the help of intelligent models.<\/p>\n<div style=\"text-align: center;\">\n<h4>Keyword-Based Classification Diagram<\/h4>\n<p><img fetchpriority=\"high\" decoding=\"async\" class=\"alignnone size-full wp-image-16780\" src=\"https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/Keyword-2.jpg\" alt=\"Keyword-Based Classification\" width=\"2330\" height=\"114\" srcset=\"https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/Keyword-2.jpg 2330w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/Keyword-2-300x15.jpg 300w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/Keyword-2-1024x50.jpg 1024w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/Keyword-2-150x7.jpg 150w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/Keyword-2-768x38.jpg 768w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/Keyword-2-1536x75.jpg 1536w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/Keyword-2-2048x100.jpg 2048w\" sizes=\"(max-width: 2330px) 100vw, 2330px\" \/><\/p>\n<\/div>\n<h4>TF-IDF (Term Frequency-Inverse Document Frequency) + Similarity<\/h4>\n<p style=\"text-align: justify;\">While it uses TF-IDF vectors, it doesn&#8217;t require training a machine learning model. Instead, you select a few representative documents for each category\u2014often just 2 or 3 examples per category are sufficient\u2014and compute their TF-IDF vectors, which reflect the importance of each word within the document relative to the rest of the corpus.<\/p>\n<p style=\"text-align: justify;\">Next, for each category, you calculate a mean TF-IDF vector to represent a typical document in that class. During testing, you convert the new document into a TF-IDF vector and compute its cosine similarity with each category&#8217;s mean vector. The category with the highest similarity score is selected as the predicted label.<\/p>\n<p style=\"text-align: justify;\">This approach is particularly effective for long documents, as it takes into account the entire content rather than focusing on a limited set of keywords. It is also more robust than basic keyword matching and still avoids the need for supervised training.<\/p>\n<div style=\"text-align: center;\">\n<h4>TF-IDF-Based Classification Diagram<\/h4>\n<p><img decoding=\"async\" class=\"alignnone size-full wp-image-16711\" src=\"https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/TFIDF-1.jpg\" alt=\"TF-IDF-Based classification Diagram\" width=\"2162\" height=\"122\" srcset=\"https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/TFIDF-1.jpg 2162w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/TFIDF-1-300x17.jpg 300w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/TFIDF-1-1024x58.jpg 1024w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/TFIDF-1-150x8.jpg 150w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/TFIDF-1-768x43.jpg 768w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/TFIDF-1-1536x87.jpg 1536w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/TFIDF-1-2048x116.jpg 2048w\" sizes=\"(max-width: 2162px) 100vw, 2162px\" \/><\/p>\n<p>&nbsp;<\/p>\n<\/div>\n<p style=\"text-align: justify;\"><strong>Next steps<\/strong>: If simple methods meet your accuracy requirements, proceed with keyword extraction using YAKE or manual selection. Otherwise, consider intermediate methods for better performance.<\/p>\n<p style=\"text-align: justify;\"><strong>Key Takeaway<\/strong>: Simple methods offer rapid implementation and zero training time but suffer from poor accuracy on complex classification tasks. Best suited for well-structured documents with clear keyword patterns.<\/p>\n<h3>2.2 Intermediate Methods (Traditional ML)<\/h3>\n<p style=\"text-align: justify;\">Having covered simple methods, we now examine intermediate approaches that require training but offer significantly better performance.<\/p>\n<p><strong>When to use<\/strong>: When you have labeled training data and need reliable, fast classification.<br \/>\n<strong>Main advantage<\/strong>: Excellent balance of accuracy, speed, and resource requirements.<br \/>\n<strong>Main limitation<\/strong>: Requires feature engineering and training data.<\/p>\n<p style=\"text-align: justify;\">One of the most accessible and time-tested approaches for document classification \u2014 especially as a baseline \u2014 is the combination of TF-IDF vectorization with traditional machine learning classifiers such as Logistic Regression, Support Vector Machines (SVMs), or XGBoost. Despite its simplicity, this method continues to be a competitive option for many real-world applications, especially when interpretability, speed, and ease of deployment are prioritized.<\/p>\n<h4>Method Overview<\/h4>\n<p style=\"text-align: justify;\">The technique is straightforward: the document text is converted into a numerical representation using TF-IDF, which captures how important a word is relative to a corpus. This produces a sparse vector of weighted word counts.<\/p>\n<p>The resulting vector is then passed into a classical classifier, typically:<\/p>\n<ul style=\"text-align: justify;\">\n<li><strong>Logistic Regression<\/strong> for linear separability and fast training<\/li>\n<li><strong>SVM<\/strong> for more complex boundaries<\/li>\n<li><strong>XGBoost<\/strong> for high-performance, tree-based modeling<\/li>\n<\/ul>\n<p style=\"text-align: justify;\">The model learns to associate word presence and frequency patterns with the desired output labels (e.g., topic categories or document types).<\/p>\n<h4>Handling Long Documents<\/h4>\n<p style=\"text-align: justify;\">By default, TF-IDF can process the entire document at once, making it suitable for long texts without the need for complex chunking or truncation. However, if documents are extremely lengthy (e.g., exceeding 5,000\u201310,000 words), it may be beneficial to:<\/p>\n<ol>\n<li>Chunk the document into smaller segments (e.g., 1,000\u20132,000 words)<\/li>\n<li>Classify each chunk individually<\/li>\n<li>And then aggregate results using majority voting or average confidence scores<\/li>\n<\/ol>\n<p style=\"text-align: justify;\">This chunking strategy can improve stability and mitigate sparse vector issues while still remaining computationally efficient.<\/p>\n<div style=\"text-align: center;\">\n<h4>ML-Based Classification Diagram<\/h4>\n<p><img decoding=\"async\" class=\"alignnone size-full wp-image-16709\" src=\"https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/ML-2.jpg\" alt=\"ML-Based classification Diagram\" width=\"2266\" height=\"122\" srcset=\"https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/ML-2.jpg 2266w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/ML-2-300x16.jpg 300w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/ML-2-1024x55.jpg 1024w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/ML-2-150x8.jpg 150w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/ML-2-768x41.jpg 768w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/ML-2-1536x83.jpg 1536w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/ML-2-2048x110.jpg 2048w\" sizes=\"(max-width: 2266px) 100vw, 2266px\" \/><\/p>\n<p>&nbsp;<\/p>\n<\/div>\n<p style=\"text-align: justify;\"><strong>Next steps<\/strong>: Start with Logistic Regression for baseline performance, then try XGBoost for optimal accuracy. Use 5-fold cross-validation with stratified sampling for robust evaluation.<\/p>\n<p style=\"text-align: justify;\"><strong>Key Takeaway<\/strong>: Intermediate methods demonstrate the best balance of accuracy and efficiency. XGBoost consistently delivers top performance while Logistic Regression excels in resource-constrained environments.<\/p>\n<h3>2.3 Complex Methods (Transformer-Based)<\/h3>\n<p style=\"text-align: justify;\">Moving beyond traditional approaches, we explore transformer-based methods that leverage pre-trained language understanding.<\/p>\n<p><strong>When to use<\/strong>: When maximum accuracy is needed and GPU resources are available.<br \/>\n<strong>Main advantage<\/strong>: Deep language understanding and high accuracy potential.<br \/>\n<strong>Main limitation<\/strong>: Computational intensity and 512-token limit requiring chunking.<\/p>\n<p style=\"text-align: justify;\">For many classification tasks involving moderately long documents \u2014 typically in the range of 300 to 1,500 words \u2014 fine-tuned transformer models like BERT, DistilBERT (Sanh et al., 2019), and RoBERTa represent a highly effective and accessible middle-ground solution. These models strike a balance between traditional machine learning approaches and large-scale models like Longformer or GPT-4.<\/p>\n<h4>Architecture and Training<\/h4>\n<p style=\"text-align: justify;\">At their core, these models are pretrained language models that have learned general linguistic patterns from large corpora such as Wikipedia and BookCorpus. When fine-tuned for document classification, the architecture is extended by adding a simple classification head \u2014 usually a dense layer \u2014 on top of the transformer&#8217;s pooled output.<\/p>\n<p style=\"text-align: justify;\">Fine-tuning involves training this extended model on a labeled dataset for a specific task, such as classifying reports into categories like Finance, Sustainability, or Legal. During training, the model adjusts both the classification head and (optionally) the internal transformer weights based on task-specific examples.<\/p>\n<h4>Handling Length Limitations<\/h4>\n<p style=\"text-align: justify;\">One key limitation of standard transformers like BERT and DistilBERT is that they only support sequences up to 512 tokens. For long documents, this constraint must be addressed by:<\/p>\n<ul>\n<li style=\"text-align: justify;\"><strong>Truncation<\/strong>: Simply cutting off the text after the first 512 tokens. Fast, but may ignore critical information later in the document.<\/li>\n<li style=\"text-align: justify;\"><strong>Chunking<\/strong>: Splitting the document into overlapping or sequential segments, classifying each chunk individually, and then aggregating predictions using majority vote, average confidence, or attention-based weighting.<\/li>\n<li style=\"text-align: justify;\"><strong>Preprocessing and data preparation<\/strong>: In this approach, long documents are first broken down into shorter texts (up to 512 tokens) using preprocessing techniques like keyword extraction or summarization. While these methods may sacrifice some coherence between segments, they offer faster training and classification times.<\/li>\n<\/ul>\n<p style=\"text-align: justify;\">While chunking adds complexity, it enables these models to handle documents up to several thousand words while maintaining reasonable performance.<\/p>\n<div style=\"text-align: center;\">\n<h4>Transformer-Based Classification Diagram<\/h4>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-16719\" src=\"https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/Transformer-2.jpg\" alt=\"Transformer-Based classification\" width=\"2342\" height=\"262\" srcset=\"https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/Transformer-2.jpg 2342w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/Transformer-2-300x34.jpg 300w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/Transformer-2-1024x115.jpg 1024w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/Transformer-2-150x17.jpg 150w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/Transformer-2-768x86.jpg 768w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/Transformer-2-1536x172.jpg 1536w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/Transformer-2-2048x229.jpg 2048w\" sizes=\"(max-width: 2342px) 100vw, 2342px\" \/><\/p>\n<p>&nbsp;<\/p>\n<\/div>\n<p style=\"text-align: justify;\"><strong>Next steps<\/strong>: Begin with DistilBERT for faster training, then upgrade to BERT if accuracy gains justify the computational cost. Implement overlapping chunking strategies for documents exceeding 512 tokens.<\/p>\n<p style=\"text-align: justify;\"><strong>Key Takeaway<\/strong>: Transformer methods offer high accuracy but require significant computational resources. BERT-base performs well while RoBERTa-base surprisingly underperforms, highlighting the importance of empirical evaluation over reputation.<\/p>\n<h3>2.4 Most Complex Methods (Large Language Models)<\/h3>\n<p>Finally, we examine the most sophisticated approaches using large language models for instruction-based classification.<\/p>\n<p><strong>When to use<\/strong>: Zero-shot classification, extremely long documents, or when training data is limited.<br \/>\n<strong>Main advantage<\/strong>: No training required, handles very long contexts, high accuracy.<br \/>\n<strong>Main limitation<\/strong>: High API costs, slower inference, and requires internet connectivity.<\/p>\n<p style=\"text-align: justify;\">These methods are powerful models that are able to understand complex documents with minimal or no training. They are suitable for tasks such as instruction-based or zero-shot classification.<\/p>\n<h4>API-Based Classification<\/h4>\n<p style=\"text-align: justify;\"><strong>OpenAI GPT-4 \/ Claude \/ Gemini 1.5<\/strong>: This approach leverages the instruction-following capability of models like GPT-4, Claude, and Gemini through API calls. These models can handle long-context inputs\u2014up to 128,000 tokens in some cases (which is roughly equivalent to 300+ pages of text \u2248 multiple academic papers).<\/p>\n<p style=\"text-align: justify;\">The method is simple in concept: you provide the model with the document text (or a substantial chunk of it) along with a prompt like:<\/p>\n<p style=\"text-align: justify;\"><em>&#8220;You are a document classification assistant. Given the text below, classify the document into one of the following categories: [Finance, Legal, Sustainability].&#8221;<\/em><\/p>\n<p style=\"text-align: justify;\">Once prompted, the LLM analyzes the document in real time and returns a label or even a confidence score, often with an explanation.<\/p>\n<div style=\"text-align: center;\">\n<h4>LLM-Based Classification Diagram<\/h4>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-16717\" src=\"https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/LLM-2.jpg\" alt=\"LLM-Based Classification\" width=\"2080\" height=\"114\" srcset=\"https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/LLM-2.jpg 2080w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/LLM-2-300x16.jpg 300w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/LLM-2-1024x56.jpg 1024w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/LLM-2-150x8.jpg 150w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/LLM-2-768x42.jpg 768w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/LLM-2-1536x84.jpg 1536w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/LLM-2-2048x112.jpg 2048w\" sizes=\"(max-width: 2080px) 100vw, 2080px\" \/><\/p>\n<\/div>\n<p>&nbsp;<\/p>\n<h4>RAG-Enhanced Classification<\/h4>\n<p style=\"text-align: justify;\"><strong>LLMs Combined with RAG (Retrieval-Augmented Generation)<\/strong>: Retrieval-Augmented Generation (RAG) is a more advanced architectural pattern that combines a vector-based retrieval system with an LLM. Here&#8217;s how it works in a classification setting:<\/p>\n<ul>\n<li style=\"text-align: justify;\">First, the long document is split into smaller, semantically meaningful chunks (e.g., by section, heading, or paragraph)<\/li>\n<li style=\"text-align: justify;\">Each chunk is embedded into a dense vector using an embedding model (like OpenAI&#8217;s text-embedding or SentenceTransformers)<\/li>\n<li style=\"text-align: justify;\">These vectors are stored in a vector database (like FAISS or Pinecone)<\/li>\n<li style=\"text-align: justify;\">When classification is needed, the system retrieves only the most relevant document chunks and passes them to an LLM (like GPT-4) along with a classification instruction<\/li>\n<\/ul>\n<div style=\"text-align: center;\">\n<h4>LLM-Based + RAG Classification Diagram<\/h4>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-16777\" src=\"https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/RAG-3-1.jpg\" alt=\"LLM+RAG Classification\" width=\"2162\" height=\"546\" srcset=\"https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/RAG-3-1.jpg 2162w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/RAG-3-1-300x76.jpg 300w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/RAG-3-1-1024x259.jpg 1024w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/RAG-3-1-150x38.jpg 150w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/RAG-3-1-768x194.jpg 768w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/RAG-3-1-1536x388.jpg 1536w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/RAG-3-1-2048x517.jpg 2048w\" sizes=\"(max-width: 2162px) 100vw, 2162px\" \/><\/p>\n<\/div>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify;\">This method allows you to process long documents efficiently and scalably, while still benefiting from the power of large models.<\/p>\n<p style=\"text-align: justify;\"><strong>Next steps<\/strong>: Start with simpler prompting strategies before implementing RAG. Consider cost-effectiveness compared to fine-tuned models for your specific use case.<\/p>\n<p style=\"text-align: justify;\"><strong>Key Takeaway<\/strong>: LLM methods provide powerful zero-shot capabilities for long documents but come with high API costs and latency. Best suited for scenarios where training data is limited or extremely long context processing is required.<\/p>\n<h3>2.5 Model Comparison Summary<\/h3>\n<p style=\"text-align: justify;\">The following table provides a comprehensive overview of all classification methods, comparing their capabilities, resource requirements, and optimal use cases to help guide your selection process.<\/p>\n<table style=\"border: 2px solid #333; border-collapse: collapse; width: 100%;\">\n<thead>\n<tr style=\"background-color: #2196f3; color: white;\">\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">Methods<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">Model\/Class<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">Max Tokens<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">Chunking Needed?<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">Ease of Use (1-5)<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">Accuracy (1-5)<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">Resource Usage<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">Best For<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px; vertical-align: top;\" rowspan=\"2\"><strong>Simple<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Keyword\/Regex Rules<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">\u221e<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">No<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">1 (Easy)<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">2 (Low)<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>Minimal<\/strong> CPU &amp; RAM<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Known structure\/formats (e.g., legal)<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\">TF-IDF + Similarity<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">\u221e<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">No<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">2<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">2-3<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>Low<\/strong> CPU, ~150MB RAM<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Labeling based on few samples<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>Intermediate<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">TF-IDF + ML<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">\u221e (entire doc)<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Optional<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">1 (Easy)<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">3 (Good)<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>Low<\/strong> CPU, ~100MB RAM<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Fast baselines, prototyping<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px; vertical-align: top;\" rowspan=\"2\"><strong>Complex<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">BERT \/ DistilBERT \/ RoBERTa<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">512 tokens<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Yes<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">3<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">4 (High)<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>Needs GPU<\/strong> \/ ~1\u20132GB RAM<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Short\/medium texts, fine-tuning possible<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\">Longformer \/ BigBird<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">4,096\u201316,000<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">No<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">4<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">5 (Highest)<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>GPU (8GB+)<\/strong>, ~3\u20138GB RAM<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Long reports, deep accuracy needed<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>Most Complex<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">GPT-4 \/ Claude \/ Gemini<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">32k\u2013128k tokens<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">No or Light<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">4 (API-based)<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">5 (Highest)<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>High cost<\/strong>, API limits<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Zero-shot classification of big docs<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p style=\"text-align: justify;\"><strong>Key Insight<\/strong>: Traditional ML (XGBoost) often outperforms advanced transformers while using 10x fewer resources.<\/p>\n<h3>2.6 Referenced Datasets &amp; Standards<\/h3>\n<p>The following datasets provide excellent benchmarks for testing long document classification methods:<\/p>\n<table style=\"border: 2px solid #333; border-collapse: collapse; width: 100%;\">\n<thead>\n<tr style=\"background-color: #ff9800; color: white;\">\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">Dataset<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">Avg Length<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">Domain<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">Page Length<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">Categories<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">Source<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\">S2ORC<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">3k\u201310k tokens<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Academic<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">6\u201320<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Dozens<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\"><a id=\"ref3link\" href=\"#ref3\">Semantic Scholar<\/a><\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\">ArXiv<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">4k\u201314k words<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Academic<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">8\u201328<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">38+<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\"><a id=\"ref1link\" href=\"#ref1\">arXiv.org<\/a><\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\">BillSum<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">1.5k\u20136k tokens<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Government<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">3\u201312<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Policy categories<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\"><a id=\"ref4link\" href=\"#ref4\">FiscalNote<\/a><\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\">GOVREPORT<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">4k\u201310k tokens<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Gov\/Finance<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">8\u201320<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Various<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\"><a id=\"ref5link\" href=\"#ref5\">Government Agencies<\/a><\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\">CUAD<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">3k\u201310k tokens<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Legal<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">6\u201320<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Contract clauses<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\"><a id=\"ref6link\" href=\"#ref6\">Atticus Project<\/a><\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\">MIMIC-III<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">2k\u20135k tokens<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Medical<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">3\u201310<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Clinical notes<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\"><a id=\"ref7link\" href=\"#ref7\">PhysioNet<\/a><\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\">SEC 10-K\/Q<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">10k\u201350k words<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Finance<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">20\u2013100<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Company\/section<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\"><a id=\"ref8link\" href=\"#ref8\">SEC EDGAR<\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p style=\"text-align: justify;\"><strong>Context<\/strong>: All datasets are publicly available with proper licensing agreements. Training times vary from 2 hours (small datasets) to 2 days (large datasets) on standard hardware.<\/p>\n<h2 id=\"technical specification\">3. Technical Specifications<\/h2>\n<h3>3.1 Evaluation Criteria<\/h3>\n<p><strong>Accuracy Assessment<\/strong>: Using accuracy, precision (true positives \/ predicted positives), recall (true positives \/ actual positives), and F1-score (harmonic mean of precision and recall) criteria.<\/p>\n<p><strong>Resource and Time Assessment<\/strong>: The amount of time and resources used during training and testing.<\/p>\n<h3>3.2 Experiment Settings<\/h3>\n<p><strong>Hardware Configuration<\/strong>: 15x\u00a0 vCPUs, 45GB RAM, NVIDIA Tesla V100S 32GB.<\/p>\n<p><strong>Evaluation Methodology<\/strong>: 5-fold cross-validation with stratified sampling was used to ensure robust statistical evaluation.<\/p>\n<p><strong>Software Libraries<\/strong>: <a id=\"ref9link\" href=\"#ref9\">scikit-learn 1.3.0<\/a>, <a id=\"ref10link\" href=\"#ref10\">transformers 4.38.0<\/a>, <a id=\"ref11link\" href=\"#ref11\">PyTorch 2.7.1<\/a>, <a id=\"ref12link\" href=\"#ref12\">XGBoost 3.0.2<\/a><\/p>\n<h4>3.2.1 Dataset Selection<\/h4>\n<p>We use the ArXiv Dataset with 11 labels that have the largest length variation across academic domains.<\/p>\n<p>&nbsp;<\/p>\n<div style=\"text-align: center;\">\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-16700\" src=\"https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/dataset1.png\" alt=\"Number Of Examples Per Category\" width=\"711\" height=\"474\" srcset=\"https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/dataset1.png 1200w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/dataset1-300x200.png 300w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/dataset1-1024x683.png 1024w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/dataset1-150x100.png 150w, https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/dataset1-768x512.png 768w\" sizes=\"(max-width: 711px) 100vw, 711px\" \/><\/p>\n<\/div>\n<p style=\"text-align: justify;\"><strong>Document Length Context<\/strong>: To better contextualize these word counts, we can convert them to number of pages, using the standard estimate of 500 words per page for double-spaced academic text (14,000 words \u2248 28 pages \u2248 short academic paper). By this measure:<\/p>\n<ul>\n<li>math.ST averages around 28 pages<\/li>\n<li>math.GR and cs.DS are about 25\u201326 pages<\/li>\n<li>cs.IT and math.AC average roughly 20\u201324 pages<\/li>\n<li>while cs.CV and cs.NE average only 14\u201315 pages<\/li>\n<\/ul>\n<p style=\"text-align: justify;\">This significant variation demonstrates differences in writing styles, document depth, or research reporting norms across fields. Fields like mathematics and theoretical computer science tend to produce more comprehensive or technically dense documents, whereas applied areas like computer vision may favor more concise communication.<\/p>\n<p>&nbsp;<\/p>\n<div style=\"text-align: center;\">\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-16700\" src=\"https:\/\/procycons.com\/wp-content\/uploads\/2025\/07\/dataset2.png\" alt=\"Number Of Examples Per Category\" width=\"711\" height=\"474\" \/><\/p>\n<p>&nbsp;<\/p>\n<\/div>\n<h4>3.2.2 Data Size and Training\/Test Division<\/h4>\n<p><strong>Expected training time on standard hardware<\/strong>: 30 minutes to 8 hours, depending on method complexity.<\/p>\n<p><strong>Minimum Training Data Requirements<\/strong>:<\/p>\n<ul>\n<li>Simple methods: 50+ samples per class<\/li>\n<li>Logistic Regression: 100+ samples per class<\/li>\n<li>XGBoost: 1,000+ samples for optimal performance<\/li>\n<li>BERT\/Transformer models: 2,000+ samples per class<\/li>\n<\/ul>\n<p style=\"text-align: justify;\">In all experiments, 30% of the data was reserved as the test set. To evaluate the robustness of the model, several variations of the dataset were used: the original class-distributed data, a balanced dataset based on the minimum class size (~2,505 samples), and additional balanced datasets with fixed sizes of 100, 140, and 1,000 samples per class.<\/p>\n<h2 id=\"results\">4. Results and Analysis<\/h2>\n<p style=\"text-align: justify;\">Our experiments reveal counter-intuitive results about the performance-efficiency trade-offs in long document classification.<\/p>\n<h3>Why Traditional ML Outperforms Transformers<\/h3>\n<p>Our benchmark demonstrates that traditional machine learning approaches offer several advantages:<\/p>\n<ol>\n<li><strong>Computational Efficiency<\/strong>: Process entire documents without token limits<\/li>\n<li><strong>Training Speed<\/strong>: 10x faster training times with comparable accuracy<\/li>\n<li><strong>Resource Requirements<\/strong>: Function effectively on standard CPU hardware<\/li>\n<li><strong>Scalability<\/strong>: Handle large document collections without GPU infrastructure<\/li>\n<\/ol>\n<h3>4.1 Performance Rankings<\/h3>\n<p style=\"text-align: justify;\">The comparative evaluation across four datasets\u2014Original, Balanced-2505, Balanced-140, and Balanced-100\u2014demonstrates clear performance hierarchies:<\/p>\n<h4>Top Performers by F1-Score:<\/h4>\n<p><strong>XGBoost<\/strong> <strong>achieved the highest F1-scores on three datasets:<\/strong><\/p>\n<ul>\n<li><strong>Original<\/strong>: F1 = 86<\/li>\n<li><strong>Balanced-2505<\/strong>: F1 = 85<\/li>\n<li><strong>Balanced-100<\/strong>: F1 = 75<\/li>\n<\/ul>\n<p><strong>BERT-base<\/strong> <strong>was the top performer on the Balanced-140 dataset:<\/strong><\/p>\n<ul>\n<li><strong>Balanced-140<\/strong>: F1 = <strong>82<\/strong> (vs. XGBoost: 81)<\/li>\n<\/ul>\n<p><strong>Logistic Regression and SVM also delivered competitive results:<\/strong><\/p>\n<ul>\n<li>F1 range:\u00a0<strong>71\u201383<\/strong><\/li>\n<\/ul>\n<p><strong>DistilBERT-base<\/strong> <strong>maintained decent performance across settings:<\/strong><\/p>\n<ul>\n<li>F1 \u2248\u00a0<strong>75\u201377<\/strong><\/li>\n<\/ul>\n<p><strong>RoBERTa-base<\/strong> <strong>consistently underperformed:<\/strong><\/p>\n<ul>\n<li>F1 as low as\u00a0<strong>57<\/strong>, especially in low-data settings<\/li>\n<\/ul>\n<p><strong>Keyword-based methods<\/strong> had the lowest F1-scores (53\u201362)<\/p>\n<p style=\"text-align: justify;\"><strong>Key Takeaway<\/strong>: Although XGBoost generally performs best across most dataset scenarios, BERT-base slightly outperforms it in mid-sized datasets such as Balanced-140. This suggests that transformer models can surpass traditional machine learning methods when a moderate amount of data and sufficient GPU resources are available. However, the performance gap is not significant, and XGBoost remains the most well-rounded option, offering high accuracy, robustness, and computational efficiency across different dataset sizes.<\/p>\n<h3>4.2 Cost-Benefit Analysis of Each Method<\/h3>\n<p>An in-depth analysis of training and inference times reveals a wide gap in resource requirements between traditional ML methods and transformer-based models:<\/p>\n<h4>Training and Inference Times:<\/h4>\n<p><strong>Most Efficient<\/strong><\/p>\n<ul>\n<li><strong>Logistic Regression<\/strong>:\n<ul>\n<li><strong>Training<\/strong>: 2\u201319 seconds across datasets<\/li>\n<li><strong>Inference<\/strong>: ~0.01\u20130.06 seconds<\/li>\n<li><strong>Resource Use<\/strong>: Minimal CPU &amp; RAM (~50MB)<\/li>\n<li>Best suited for rapid deployment and resource-constrained environments.<\/li>\n<\/ul>\n<\/li>\n<li><strong>XGBoost<\/strong>:\n<ul>\n<li><strong>Training<\/strong>: Ranges from <strong>23s (Balanced-100)<\/strong> to <strong>369s (Balanced-2505)<\/strong><\/li>\n<li><strong>Inference<\/strong>: ~<strong>0.00\u20130.09 seconds<\/strong><\/li>\n<li><strong>Resource Use<\/strong>: Efficient on CPU (~100MB RAM)<\/li>\n<li>Excellent trade-off between speed and accuracy, particularly for large datasets.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p><strong>Resource Intensive<\/strong><\/p>\n<ul>\n<li><strong>SVM<\/strong>:\n<ul>\n<li><strong>Training<\/strong>: Up to <strong>2,480s<\/strong><\/li>\n<li><strong>Inference<\/strong>: As high as <strong>1,322s<\/strong><\/li>\n<li>High complexity and runtime make it unsuitable for real-time or production use.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Transformer Models<\/strong>:\n<ul>\n<li><strong>DistilBERT-base<\/strong>: Training \u2248 900\u20131,400s; Inference \u2248 140s<\/li>\n<li><strong>BERT-base<\/strong>: Training \u2248 1,300\u20132,700s; Inference \u2248 127\u2013138s<\/li>\n<li><strong>RoBERTa-base<\/strong>: Worst performance and highest training time (up to 2,718s)<\/li>\n<li>GPU-intensive (\u22652GB RAM) and slow inference make them impractical unless maximum accuracy is critical.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p><strong>Inefficient in Inference<\/strong><\/p>\n<ul>\n<li><strong>Keyword-based Methods<\/strong>:\n<ul>\n<li><strong>Training<\/strong>: Very fast (as low as 3\u2013135s)<\/li>\n<li><strong>Inference<\/strong>: Surprisingly <strong>slow<\/strong> \u2014 up to <strong>335s<\/strong><\/li>\n<li>Although easy to implement, the slow inference and poor accuracy make them unsuitable for large-scale or real-time use.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p style=\"text-align: justify;\"><strong>Key Takeaway<\/strong>: Traditional ML methods such as <strong>Logistic Regression and XGBoost<\/strong> offer the best cost-efficiency for real-world deployment, providing fast training, near-instant inference, and high accuracy without relying on GPUs. Transformer models provide improved performance only on certain datasets (e.g., BERT on Balanced-140) but incur significant resource and time costs, which may not be justified in many scenarios. It\u2019s important to note that the resource requirements of transformer models increase exponentially with growing complexity, such as larger data volumes.<\/p>\n<h3>4.3 Complete Model Evaluation Summary<\/h3>\n<table style=\"border: 2px solid #333; border-collapse: collapse; width: 100%;\">\n<thead>\n<tr style=\"background-color: #9c27b0; color: white;\">\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">Dataset<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">Methods<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">Model<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: center;\">Accuracy (%)<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: center;\">Precision (%)<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: center;\">Recall (%)<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: center;\">F1-Score (%)<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: center;\">Train Time (s)<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: center;\">Test Time (s)<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px; vertical-align: top;\" rowspan=\"5\"><strong>Original<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Simple<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Keyword-Based<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">56<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">57<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">56<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">55<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">135<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">335<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px; vertical-align: top;\" rowspan=\"4\">Intermediate<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>Logistic Regression<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #e8f5e8;\"><strong>84<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #e8f5e8;\"><strong>83<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #e8f5e8;\"><strong>84<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #e8f5e8;\"><strong>83<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #e8f5e8;\"><strong>19<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #e8f5e8;\"><strong>0.06<\/strong><\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\">SVM<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">84<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">83<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">84<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">83<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">2480<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">1322<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\">MLP<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">80<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">80<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">80<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">80<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">426<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">0.53<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>XGBoost<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>86<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>86<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>86<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>86<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>364<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>0.08<\/strong><\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px; vertical-align: top;\" rowspan=\"5\"><strong>Balanced-2505<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Simple<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Keyword-Based<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">53<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">53<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">53<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">53<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">50<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">253<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px; vertical-align: top;\" rowspan=\"4\">Intermediate<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Logistic Regression<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">83<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">83<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">83<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">83<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">17<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">0.05<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\">SVM<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">82<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">82<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">82<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">82<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">1681<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">839<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\">MLP<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">78<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">79<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">78<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">78<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">301<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">0.41<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>XGBoost<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>85<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>85<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>85<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>85<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>369<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>0.09<\/strong><\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px; vertical-align: top;\" rowspan=\"8\"><strong>Balanced-100<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Simple<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Keyword-Based<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">54<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">56<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">54<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">54<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">3<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">10<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px; vertical-align: top;\" rowspan=\"4\">Intermediate<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Logistic Regression<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">72<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">71<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">72<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">71<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">2<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">0.01<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\">SVM<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">72<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">73<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">72<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">72<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">7<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">2<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\">MLP<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">73<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">73<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">73<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">73<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">15<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">0.02<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>XGBoost<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>76<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>76<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>76<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>75<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>23<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>0<\/strong><\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px; vertical-align: top;\" rowspan=\"3\">Complex<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">DistilBERT-base<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">75<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">75<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">75<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">75<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">907<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">141<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\">BERT-base<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">77<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">78<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">77<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">77<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">1357<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">127<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\">RoBERTa-base<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">55<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">62<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">55<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">57<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">1402<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">124<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px; vertical-align: top;\" rowspan=\"8\"><strong>Balanced-140<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Simple<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Keyword-Based<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">62<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">63<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">62<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">62<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">3<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">14<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px; vertical-align: top;\" rowspan=\"4\">Intermediate<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Logistic Regression<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">79<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">79<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">79<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">79<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">3<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">0.01<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\">SVM<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">78<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">79<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">78<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">78<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">14<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">4<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\">MLP<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">78<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">79<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">78<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">78<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">19<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">0.02<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>XGBoost<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>81<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>80<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>81<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>80<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>34<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #ffd700;\"><strong>0<\/strong><\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px; vertical-align: top;\" rowspan=\"3\">Complex<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">DistilBERT-base<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">77<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">77<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">77<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">77<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">1399<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">142<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>BERT-base<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #e8f5e8;\"><strong>82<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #e8f5e8;\"><strong>82<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #e8f5e8;\"><strong>82<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #e8f5e8;\"><strong>82<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #e8f5e8;\"><strong>2685<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center; background-color: #e8f5e8;\"><strong>138<\/strong><\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\">RoBERTa-base<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">64<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">64<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">64<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">64<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">2718<\/td>\n<td style=\"border: 1px solid #333; padding: 8px; text-align: center;\">139<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<h3>4.4 Model Selection Decision Matrix<\/h3>\n<table style=\"border: 2px solid #333; border-collapse: collapse; width: 100%;\">\n<thead>\n<tr style=\"background-color: #607d8b; color: white;\">\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">Criteria<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">Best Model<\/th>\n<th style=\"border: 1px solid #333; padding: 12px; text-align: left;\">Notes<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>Highest Accuracy (All Data)<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">XGBoost<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">F1 = 86<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>Highest Accuracy (Small-Mid Data) &#8211; CPU access<br \/>\n<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">XGBoost<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">F1 = 81<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>Highest Accuracy (Small-Mid Data) &#8211; GPU access (All Data)<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">BERT-base<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">F1 = 82<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>Fastest Model<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Logistic Regression<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Training in &lt;20s<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>Best Efficiency (Speed\/Accuracy Trade-off)<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Logistic Regression<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Excellent balance between runtime, simplicity, and accuracy<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>Best Large-Scale Classifier<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">XGBoost<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Scales well to large datasets, robust to imbalance<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>Best GPU Utilization<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">BERT-base<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">High accuracy when GPU is available; better than RoBERTa\/DistilBERT-base<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid #333; padding: 8px;\"><strong>Not Recommended<\/strong><\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">RoBERTa-base, Keyword-Based<\/td>\n<td style=\"border: 1px solid #333; padding: 8px;\">Poor accuracy, long inference times, no performance edge<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3><\/h3>\n<h3>4.5 Robustness Analysis<\/h3>\n<p style=\"text-align: justify;\">This section analyzes the robustness of different models across various dataset sizes and conditions, highlighting their strengths, limitations, and areas needing further investigation.<\/p>\n<p><strong>High Confidence Findings<\/strong>:<\/p>\n<ul>\n<li><strong>XGBoost demonstrates robust performance<\/strong> across varying dataset sizes, especially for large and small data regimes (Original, Balanced-100).<\/li>\n<li><strong>BERT-base shows strong performance<\/strong> in mid-sized datasets (Balanced-140), indicating that transformer models can outperform traditional ML under the right data and compute conditions.<\/li>\n<li><strong>Logistic Regression remains a consistently reliable baseline<\/strong>, delivering strong results with minimal computational cost.<\/li>\n<li><strong>Traditional ML models<\/strong>, particularly XGBoost and Logistic Regression, offer high efficiency with competitive accuracy, especially when computational resources are limited.<\/li>\n<\/ul>\n<p><strong>Areas Requiring Further Research<\/strong>:<\/p>\n<ul>\n<li><strong>RoBERTa-base&#8217;s underperformance<\/strong> across all settings is unexpected and may stem from task-specific limitations or suboptimal fine-tuning strategies.<\/li>\n<li><strong>Transformer chunking strategies<\/strong> require further domain adaptation \u2014 current performance may be limited by generic slicing or truncation techniques.<\/li>\n<\/ul>\n<p style=\"text-align: justify;\"><strong>Key Takeaway<\/strong>: While traditional ML methods like XGBoost and Logistic Regression are robust, <strong>transformer models such as BERT-base can outperform them under specific conditions<\/strong>. These results underscore the importance of <strong>matching model complexity to the data scale and deployment constraints<\/strong>, rather than assuming more sophisticated architectures will yield better results by default.<\/p>\n<h2 id=\"deployment\">5. Deployment Scenarios<\/h2>\n<p style=\"text-align: justify;\">In this section, we explore deployment scenarios for text classification models, highlighting the best-suited algorithms for different operational constraints\u2014ranging from production systems to rapid prototyping\u2014based on trade-offs between accuracy, efficiency, and resource availability.<\/p>\n<p><strong>Production Systems<\/strong><\/p>\n<ul>\n<li><strong>Recommended<\/strong>: <strong>XGBoost<\/strong><\/li>\n<li><strong>Why<\/strong>: Achieves the highest F1-score (86) on full datasets with fast inference (~0.08s) and moderate training time (~6 minutes).<\/li>\n<li><strong>Use Case<\/strong>: High-volume or batch processing pipelines where both accuracy and throughput matter.<\/li>\n<li><strong>Notes<\/strong>: Robust across dataset sizes; suitable for environments with standard CPU infrastructure.<\/li>\n<\/ul>\n<p><strong>Resource-Constrained Environments<\/strong><\/p>\n<ul>\n<li><strong>Recommended<\/strong>: <strong>Logistic Regression<\/strong><\/li>\n<li><strong>Why<\/strong>: Extremely lightweight (training &lt;20s, inference ~0.01s), with competitive F1-scores (up to 83).<\/li>\n<li><strong>Use Case<\/strong>: Edge devices, embedded systems, and low-budget deployments.<\/li>\n<li><strong>Notes<\/strong>: Also ideal for rapid explainability and debugging.<\/li>\n<\/ul>\n<p><strong>Maximum Accuracy with GPU Access<\/strong><\/p>\n<ul>\n<li><strong>Recommended<\/strong>: <strong>BERT-base<\/strong><\/li>\n<li><strong>Why<\/strong>: Outperforms XGBoost on moderately sized datasets (F1 = 82 vs. 80 on Balanced-140).<\/li>\n<li><strong>Use Case<\/strong>: Research, compliance\/legal document classification, and applications where marginal accuracy improvements are mission-critical.<\/li>\n<li><strong>Notes<\/strong>: Requires GPU infrastructure (~2GB RAM); longer training and inference times.<\/li>\n<\/ul>\n<p><strong>Rapid Prototyping<\/strong><\/p>\n<ul>\n<li><strong>Recommended Pipeline<\/strong>: <strong>Logistic Regression \u2192 XGBoost \u2192 BERT-base<\/strong><\/li>\n<li><strong>Why<\/strong>: Enables iterative refinement \u2014 start simple and scale complexity only if needed.<\/li>\n<li><strong>Use Case<\/strong>: Early-stage experimentation, category testing, or resource-phased projects.<\/li>\n<\/ul>\n<p><strong>Not Recommended<\/strong><\/p>\n<ul>\n<li><strong>RoBERTa-base<\/strong>: Poor F1-scores (as low as 57), long training\/inference time, no performance edge.<\/li>\n<li><strong>Keyword-Based Methods<\/strong>: Fast to implement but low accuracy (F1 \u2248 53\u201362) and surprisingly slow inference.<\/li>\n<\/ul>\n<p style=\"text-align: justify;\"><strong>Key Takeaway<\/strong>: The best model for deployment depends on data size, infrastructure constraints, and accuracy needs. XGBoost is optimal for general production, Logistic Regression excels under limited resources, and BERT-base is preferred when accuracy is paramount and GPU compute is available. Defaulting to complexity is discouraged \u2014 empirical evidence supports traditional ML for many real-world use cases.<\/p>\n\t\t<div data-elementor-type=\"container\" data-elementor-id=\"16327\" class=\"elementor elementor-16327\" data-elementor-post-type=\"elementor_library\">\n\t\t\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-6517c7e elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"6517c7e\" data-element_type=\"section\" data-settings=\"{&quot;background_background&quot;:&quot;classic&quot;}\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-6f27450\" data-id=\"6f27450\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<section class=\"elementor-section elementor-inner-section elementor-element elementor-element-658054d elementor-section-full_width elementor-section-height-default elementor-section-height-default\" data-id=\"658054d\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-50 elementor-inner-column elementor-element elementor-element-02cf5a1\" data-id=\"02cf5a1\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-be8c11b elementor-widget elementor-widget-heading\" data-id=\"be8c11b\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<div class=\"elementor-heading-title elementor-size-default\">Need Custom Document Classification? Get Expert ML Consultation Today<\/div>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f7409e9 elementor-widget elementor-widget-text-editor\" data-id=\"f7409e9\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Let our team implement the optimal classification solution for your specific requirements.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t<div class=\"elementor-column elementor-col-50 elementor-inner-column elementor-element elementor-element-6f0d604\" data-id=\"6f0d604\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-787027c elementor-align-right greenbtn elementor-tablet-align-center elementor-mobile-align-justify elementor-widget-mobile__width-initial elementor-widget elementor-widget-button\" data-id=\"787027c\" data-element_type=\"widget\" data-widget_type=\"button.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div class=\"elementor-button-wrapper\">\n\t\t\t\t\t<a class=\"elementor-button elementor-button-link elementor-size-sm\" href=\"https:\/\/procycons.com\/en\/contact-us\/\">\n\t\t\t\t\t\t<span class=\"elementor-button-content-wrapper\">\n\t\t\t\t\t\t\t\t\t<span class=\"elementor-button-text\">Contact us<\/span>\n\t\t\t\t\t<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t\n<h2 id=\"faq\">6. Frequently Asked Questions<\/h2>\n<p><strong>What is the best method for long document classification?<\/strong><\/p>\n<p>It depends on your data size and deployment environment:<\/p>\n<ul>\n<li><strong>XGBoost<\/strong> delivers the highest accuracy overall (F1 = 86 on full dataset) and is the best general-purpose choice.<\/li>\n<li><strong>BERT-base<\/strong> outperforms other models on mid-sized datasets (F1 = 82 on Balanced-140) when GPU infrastructure is available.<\/li>\n<li><strong>Logistic Regression<\/strong> is ideal for resource-constrained environments, offering fast training and strong performance with minimal compute.<\/li>\n<\/ul>\n<h4>How does BERT compare to traditional machine learning for document classification?<\/h4>\n<p>BERT-base offers higher accuracy on moderately-sized datasets (like Balanced-140), but requires 10\u201350\u00d7 more computational resources than traditional models.<\/p>\n<ul>\n<li><strong>BERT<\/strong>: F1 up to 82, but needs GPUs and longer training time (~45 minutes).<\/li>\n<li><strong>XGBoost<\/strong>: Similar or better performance on larger or smaller datasets with much faster runtime and CPU-only operation.<\/li>\n<\/ul>\n<p>Use BERT when maximum accuracy is required and you have GPU capacity. Use traditional ML for speed, simplicity, and cost-efficiency.<\/p>\n<h4>What document length is considered &#8220;long&#8221; in NLP?<\/h4>\n<p>Any document over 1,000 words (\u22482+ pages) is considered &#8220;long.&#8221; Many benchmark datasets include documents ranging from 7,000 to 14,000 words (\u224814\u201328 pages).<\/p>\n<p>Long documents pose special challenges:<\/p>\n<ul>\n<li>Token length limitations (e.g., BERT: 512 tokens)<\/li>\n<li>Computational cost<\/li>\n<li>Maintaining contextual coherence over multiple pages<\/li>\n<\/ul>\n<h4>How much training data do you need for document classification?<\/h4>\n<p style=\"text-align: justify;\">Minimum requirements: 100 samples per class for basic models, 1,000+ for optimal XGBoost performance, 2,000+ for transformer models.<\/p>\n<h4>Can you classify documents in real-time?<\/h4>\n<p>Yes \u2014 with Logistic Regression and XGBoost, which offer sub-second inference times even on large documents.<\/p>\n<ul>\n<li><strong>XGBoost<\/strong>: Inference \u2248 0.08s<\/li>\n<li><strong>Logistic Regression<\/strong>: Inference \u2248 0.01s<\/li>\n<li><strong>BERT-base<\/strong>: Inference \u2248 2\u20133 minutes (not suitable for real-time unless optimized)<\/li>\n<\/ul>\n<h4>What&#8217;s the accuracy difference between simple and complex methods?<\/h4>\n<ul>\n<li><strong>Simple methods (e.g., keyword-based)<\/strong>: F1 \u2248 53\u201362<\/li>\n<li><strong>Traditional ML (e.g., XGBoost, Logistic Regression)<\/strong>: F1 \u2248 71\u201386<\/li>\n<li><strong>Transformer-based (e.g., BERT, DistilBERT)<\/strong>: F1 \u2248 75\u201382<\/li>\n<\/ul>\n<p style=\"text-align: justify;\">Performance gaps are narrower than expected. Complex models do not always outperform simpler ones, especially in real-world scenarios with limited resources.<\/p>\n<p><strong>When should I choose BERT over XGBoost or Logistic Regression?<\/strong><\/p>\n<p>Choose BERT-base when:<\/p>\n<ul>\n<li>You have access to GPU hardware (\u22652GB VRAM)<\/li>\n<li>Your dataset is moderately sized (e.g., 100\u20131,000 samples\/class)<\/li>\n<li>You need maximum accuracy, such as for legal, compliance, or critical academic categorization<\/li>\n<li>You&#8217;re working in a research setting where computational cost is acceptable<\/li>\n<\/ul>\n<p>Avoid BERT if:<\/p>\n<ul>\n<li>You lack GPU resources<\/li>\n<li>You require real-time classification<\/li>\n<li>You&#8217;re building resource-constrained applications (mobile, embedded, edge)<\/li>\n<\/ul>\n<p>In those cases, XGBoost or Logistic Regression is more practical.<\/p>\n<p><strong>Why does RoBERTa-base perform poorly despite its strong reputation?<\/strong><\/p>\n<p>RoBERTa-base surprisingly underperformed in this benchmark (F1 \u2248 57\u201364), likely due to:<\/p>\n<ul>\n<li>Poor generalization to long-document chunking without specialized preprocessing<\/li>\n<li>Lack of task-specific fine-tuning or inadequate learning under low-resource conditions<\/li>\n<li>Sensitivity to document structure or sequence distribution in academic corpora<\/li>\n<\/ul>\n<p style=\"text-align: justify;\">This result highlights the need for empirical validation \u2014 model reputation doesn&#8217;t guarantee performance in all domains. RoBERTa may still perform well with custom fine-tuning, but in this study, it was neither efficient nor accurate.<\/p>\n<h2 id=\"conclusion\">7. Conclusion<\/h2>\n<p style=\"text-align: justify;\">This benchmark study presents a comprehensive evaluation of traditional and modern approaches for <strong data-start=\"743\" data-end=\"775\">long document classification<\/strong> across a range of dataset sizes and resource constraints. Contrary to common assumptions, our findings reveal that <strong data-start=\"891\" data-end=\"979\">complex transformer models do not always outperform simpler machine learning methods<\/strong>, particularly in real-world deployment conditions.<\/p>\n<h3 data-start=\"1037\" data-end=\"1064\">Key Findings Summary<\/h3>\n<ol>\n<li style=\"text-align: justify;\" data-end=\"1376\"><strong>XGBoost stands out as the most robust and scalable solution overall<\/strong>, achieving the highest F1 score (86) on full-size datasets and demonstrating consistent performance across varying sample sizes. It offers excellent computational efficiency, making it well-suited for production environments that handle large document collections. Nonetheless, it also performs acceptably on smaller datasets\u2014for instance, achieving an F1 score of 81 on Balanced-140.<\/li>\n<li style=\"text-align: justify;\" data-end=\"1735\"><strong data-start=\"1381\" data-end=\"1394\">BERT-base delivers the highest accuracy on mid-sized datasets<\/strong> (e.g., F1 = 82 on Balanced-140), outperforming XGBoost in that setting. However, it requires <strong data-start=\"1546\" data-end=\"1568\">GPU infrastructure<\/strong> and incurs significant training and inference time, making it ideal for <strong data-start=\"1641\" data-end=\"1681\">research or high-stakes applications<\/strong> where resource availability is not a limiting factor.<\/li>\n<li style=\"text-align: justify;\" data-end=\"2019\"><strong data-start=\"1740\" data-end=\"1763\">Logistic <\/strong><strong>Regression remains an outstanding choice for resource-constrained environments<\/strong>. It trains in under 20 seconds, infers almost instantly, and achieves competitive F1-scores (up to 83), making it ideal for rapid prototyping, embedded systems, and edge deployment.<\/li>\n<li style=\"text-align: justify;\" data-end=\"2237\"><strong data-start=\"2024\" data-end=\"2068\">RoBERTa-base consistently underperformed<\/strong>, despite its reputation, with F1-scores as low as 57. This underscores the need for <strong data-start=\"2155\" data-end=\"2181\">empirical benchmarking<\/strong> rather than relying solely on perceived model strength.<\/li>\n<li style=\"text-align: justify;\" data-end=\"2501\"><strong data-start=\"2242\" data-end=\"2288\">Keyword-based and similarity-based methods are inadequate for complex<\/strong>, <strong>multi-class long document classification, despite their simplicity and fast setup.<\/strong> Their low accuracy and unexpectedly long inference times make them unsuitable for serious deployment.<\/li>\n<\/ol>\n<h3 data-start=\"2508\" data-end=\"2540\">Strategic Recommendations<\/h3>\n<ul>\n<li style=\"text-align: justify;\" data-end=\"2699\"><strong data-start=\"2544\" data-end=\"2580\">Start with traditional ML models<\/strong> like Logistic Regression or XGBoost. They offer strong performance with minimal overhead and allow for fast iteration.<\/li>\n<li style=\"text-align: justify;\" data-end=\"2818\"><strong data-start=\"2702\" data-end=\"2719\">Use BERT-base<\/strong> only when marginal accuracy improvements are mission-critical <strong data-start=\"2782\" data-end=\"2817\">and GPU resources are available<\/strong>.<\/li>\n<li style=\"text-align: justify;\" data-end=\"2994\"><strong data-start=\"2821\" data-end=\"2860\">Avoid overcomplicating early stages<\/strong> of model selection \u2014 the results show that simple models often deliver surprisingly competitive results for long-text classification.<\/li>\n<li style=\"text-align: justify;\" data-end=\"3164\">Carefully match your model to your <strong data-start=\"3032\" data-end=\"3064\">specific deployment scenario<\/strong>, considering the balance between <strong data-start=\"3098\" data-end=\"3163\">accuracy, runtime, memory requirements, and data availability<\/strong>.<\/li>\n<\/ul>\n<h3>Future Research Directions<\/h3>\n<p data-start=\"3206\" data-end=\"3247\">Several areas merit deeper investigation:<\/p>\n<ul>\n<li style=\"text-align: justify;\" data-end=\"3329\"><strong data-start=\"3251\" data-end=\"3306\">Domain-adaptive fine-tuning and chunking strategies<\/strong> for transformer models<\/li>\n<li style=\"text-align: justify;\" data-end=\"3457\">Exploration of <strong data-start=\"3347\" data-end=\"3367\">hybrid pipelines<\/strong> that combine fast traditional ML backbones with transformer-based reranking or refinement<\/li>\n<li style=\"text-align: justify;\" data-end=\"3574\">Investigation into <strong data-start=\"3479\" data-end=\"3508\">why RoBERTa underperforms<\/strong> and whether task-specific adaptations could recover its potential<\/li>\n<li style=\"text-align: justify;\" data-end=\"3670\">Evaluation of <strong data-start=\"3591\" data-end=\"3652\">new long-context transformers (e.g., Longformer, BigBird)<\/strong> on this benchmark<\/li>\n<\/ul>\n<h3>Final Takeaway<\/h3>\n<p style=\"text-align: justify;\" data-end=\"3990\">This benchmark challenges the belief that model complexity is always justified. In reality, <strong data-start=\"3792\" data-end=\"3884\">traditional ML models can deliver excellent performance for long document classification<\/strong> \u2014 often matching or surpassing transformers in both accuracy and speed, with 10\u00d7 less computational cost.<\/p>\n<p style=\"text-align: justify;\" data-end=\"4139\">The key to success lies not in chasing the most powerful model, but in <strong data-start=\"4063\" data-end=\"4138\">choosing the right model for your specific data, constraints, and goals<\/strong>.<\/p>\n<h3>References<\/h3>\n<p>Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C. and Jatowt, A. (2020) &#8216;YAKE! Keyword Extraction from Single Documents Using Multiple Local Features&#8217;, Information Sciences, 509, pp. 257-289.<\/p>\n<p>Chen, T. and Guestrin, C. (2016) &#8216;XGBoost: A scalable tree boosting system&#8217;, in <em>Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining<\/em>. New York: ACM, pp. 785-794.<\/p>\n<p>Devlin, J., Chang, M.W., Lee, K. and Toutanova, K. (2019) &#8216;BERT: Pre-training of deep bidirectional transformers for language understanding&#8217;, in <em>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)<\/em>. Minneapolis: Association for Computational Linguistics, pp. 4171-4186.<\/p>\n<p>Genkin, A., Lewis, D.D. and Madigan, D. (2005) <em>Sparse logistic regression for text categorization<\/em>. DIMACS Working Group on Monitoring Message Streams Project Report. New Brunswick: Rutgers University.<\/p>\n<p>Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. and Stoyanov, V. (2019) &#8216;RoBERTa: A robustly optimized BERT pretraining approach&#8217;, <em>arXiv preprint arXiv:1907.11692<\/em> [Preprint]. Available at: <a href=\"https:\/\/arxiv.org\/abs\/1907.11692\" target=\"_blank\" rel=\"noopener\">https:\/\/arxiv.org\/abs\/1907.11692<\/a>.<\/p>\n<p>Sanh, V., Debut, L., Chaumond, J. and Wolf, T. (2019) &#8216;DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter&#8217;, <em>arXiv preprint arXiv:1910.01108<\/em> [Preprint]. Available at: <a href=\"https:\/\/arxiv.org\/abs\/1910.01108\" target=\"_blank\" rel=\"noopener\">https:\/\/arxiv.org\/abs\/1910.01108<\/a>.<\/p>\n<h3>Download Resources and Libraries<\/h3>\n<ul>\n<li id=\"ref1\"><a href=\"https:\/\/github.com\/LiqunW\/Long-document-dataset\" target=\"_blank\" rel=\"noopener\"> Liqun W. (2021). Long Document Dataset \u2013 GitHub Repository<br \/>\n<\/a><\/li>\n<li id=\"ref3\"><a href=\"https:\/\/paperswithcode.com\/paper\/gorc-a-large-contextual-citation-graph-of\" target=\"_blank\" rel=\"noopener\"> S2ORC (Semantic Scholar Open Research Corpus)<br \/>\n<\/a><\/li>\n<li id=\"ref4\"><a href=\"https:\/\/github.com\/FiscalNote\/BillSum\" target=\"_blank\" rel=\"noopener\"> Congressional and California state bills \u2013 GitHub Repository<br \/>\n<\/a><\/li>\n<li id=\"ref5\"><a href=\"https:\/\/paperswithcode.com\/dataset\/govreport\" target=\"_blank\" rel=\"noopener\"> GOVREPORT &#8211; long document summarization dataset<br \/>\n<\/a><\/li>\n<li id=\"ref6\"><a href=\"https:\/\/www.kaggle.com\/datasets\/theatticusproject\/atticus-open-contract-dataset-aok-beta\" target=\"_blank\" rel=\"noopener\"> CUAD (Contract Understanding Atticus Dataset)<br \/>\n<\/a><\/li>\n<li id=\"ref7\"><a href=\"https:\/\/registry.opendata.aws\/mimiciii\/\" target=\"_blank\" rel=\"noopener\"> MIMIC-III (Medical Information Mart for Intensive Care)<br \/>\n<\/a><\/li>\n<li id=\"ref8\"><a href=\"https:\/\/www.kaggle.com\/datasets\/jamesglang\/sec-edgar-company-facts-september2023\" target=\"_blank\" rel=\"noopener\">SEC 10 K\/Q Filings<br \/>\n<\/a><\/li>\n<li id=\"ref9\"><a href=\"https:\/\/pypi.org\/project\/scikit-learn\/1.3.0\/\" target=\"_blank\" rel=\"noopener\">scikit-learn 1.3.0<br \/>\n<\/a><\/li>\n<li id=\"ref10\"><a href=\"https:\/\/pypi.org\/project\/transformers\/4.38.0\/\" target=\"_blank\" rel=\"noopener\">transformers 4.38.0<br \/>\n<\/a><\/li>\n<li id=\"ref11\"><a href=\"https:\/\/pypi.org\/project\/torch\/\" target=\"_blank\" rel=\"noopener\">torch 2.7.1<br \/>\n<\/a><\/li>\n<li id=\"ref12\"><a href=\"https:\/\/pypi.org\/project\/xgboost\/\" target=\"_blank\" rel=\"noopener\">xgboost 3.0.2<br \/>\n<\/a><\/li>\n<li id=\"ref19\"><a href=\"https:\/\/github.com\/Procycons\/Long-Document-Classification-Benchmark.git\" target=\"_blank\" rel=\"noopener\">All code and configurations related to the long document classification Benchmark 2025 \u2013 GitHub Repository<br \/>\n<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Complete evaluation of document classification approaches from keywords to GPT-4. Find the optimal balance between accuracy, speed, and computational requirements.<\/p>\n","protected":false},"author":2,"featured_media":16787,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[8],"tags":[58,86,89,91,87,90,88],"class_list":["post-16660","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-digitization","tag-ai","tag-document-classification","tag-large-language-model","tag-llm","tag-nlp","tag-transformer","tag-xgboost"],"acf":[],"_links":{"self":[{"href":"https:\/\/procycons.com\/en\/wp-json\/wp\/v2\/posts\/16660","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/procycons.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/procycons.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/procycons.com\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/procycons.com\/en\/wp-json\/wp\/v2\/comments?post=16660"}],"version-history":[{"count":53,"href":"https:\/\/procycons.com\/en\/wp-json\/wp\/v2\/posts\/16660\/revisions"}],"predecessor-version":[{"id":17770,"href":"https:\/\/procycons.com\/en\/wp-json\/wp\/v2\/posts\/16660\/revisions\/17770"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/procycons.com\/en\/wp-json\/wp\/v2\/media\/16787"}],"wp:attachment":[{"href":"https:\/\/procycons.com\/en\/wp-json\/wp\/v2\/media?parent=16660"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/procycons.com\/en\/wp-json\/wp\/v2\/categories?post=16660"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/procycons.com\/en\/wp-json\/wp\/v2\/tags?post=16660"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}