The bottleneck nobody wanted to admit: document ingestion
When we talk about optimizing RAG systems, the first instinct is to tweak embeddings, swap out the reranking model, or adjust the prompt heading to the LLM. That makes sense — those are the most visible parts of the pipeline. But there is an earlier stage that determines the fate of everything that comes after, and it tends to be treated with neglect: document ingestion. This is the phase where the original document — a PDF, a spreadsheet, a financial report packed with tables, headers, and footnotes — gets transformed into chunks of plain text. And this is exactly where most implementations lose context, merge sections that should not be together, and slice tables in ways no language model can reconstruct afterward.
POMA AI, a Berlin-based startup focused on document intelligence, decided to tackle this problem head-on. Instead of treating chunking as a generic step that slices text by character or token count, the company developed an approach that preserves the structural hierarchy of a document throughout the entire chunking process. As Dr. Alexander Kihm, founder and CEO of POMA AI, put it: every RAG system in production today loses information before the model even sees it. According to him, the industry has been optimizing embeddings, rerankers, and prompt engineering, but the ingestion layer is where most retrieval failures actually originate.
POMA-OfficeQA: the benchmark that puts numbers on the problem
The results of this approach just got concrete numbers with the release of POMA-OfficeQA, an open source benchmark available on GitHub that evaluates RAG Chunking quality on real documents from the United States Treasury. It covers approximately 2,150 pages spread across 14 official U.S. Treasury financial bulletins, carrying all the complexity that type of material brings: dense tables, title hierarchies, cross-references between sections, and formatting that varies from page to page.
The benchmark does not compare different language models or swap out the vector search engine. It keeps everything the same — same embeddings, same retrieval logic, same 20 table-query questions — and only changes the chunking method. This isolates the real impact of how the document is sliced, eliminating variables that could muddy the analysis. All three tested approaches used OpenAI’s text-embedding-3-large model for embeddings and cosine similarity for retrieval ranking.
The central metric of the benchmark is context recall, which measures the minimum token budget a retrieval system needs to guarantee that all required evidence is available in the retrieved context. The ground truth was established using exact chunk indices verified against the original documents, eliminating false positives from accidental numerical matches. On top of that, only questions that all three approaches successfully answered were included in the comparison, and questions where any method showed extraction failures — such as OCR errors or missing values — were excluded to ensure a fair comparison.
And the numbers are pretty revealing. POMA AI’s hierarchical, structure-aware approach reached the same quality levels in answers while using a fraction of the resources:
- Baseline (naive chunking with 500 tokens and 100 overlap): 1.45 million tokens
- Unstructured.io (element-based extraction): 1.48 million tokens
- POMA AI (structure-aware chunking): 340 thousand tokens
That represents a 77% token reduction in the default configuration, with zero sacrifice in answer accuracy. And the number climbs to 83% reduction when custom configurations are applied to POMA PrimeCut, the company’s processing tool.
What structure-aware chunking is and why it changes the game
The concept of structure-aware chunking starts from a simple but powerful premise: documents are not homogeneous blocks of text. They have an internal architecture — titles, subtitles, paragraphs, tables, lists, notes — and that architecture carries semantic information that traditional chunking simply ignores.
When you slice a 50-page PDF into 512-token blocks without considering where sections begin and end, you are destroying contextual relationships that the document author built intentionally. A table showing quarterly revenue data might end up split across two different chunks, and neither one will make sense on its own. A paragraph explaining a regulatory exception might be separated from the section title that gives it context, making the information ambiguous or even useless for the language model generating the final answer.
How POMA AI treats the document before slicing
POMA AI addresses this problem by treating the document as a tree structure before any slicing takes place. First, the system identifies the content hierarchy — which text belongs to which section, which tables are associated with which paragraphs, where the logical blocks of information begin and end. Only after this structural analysis does chunking happen, respecting the natural boundaries of the document.
In practice, this means a chunk will never cut a table in half, never separate a title from the content it introduces, and never mix information from different sections into the same block. The result is a smaller, more cohesive, and semantically richer set of chunks. And that is exactly why the pipeline can operate with far fewer tokens: each chunk carries more useful information and less noise, so the system needs to retrieve fewer blocks to answer the same question at the same quality level.
The company itself describes its approach as smart hierarchical chunking, emphasizing that this data preparation is the ideal way to feed embeddings into vector databases. Instead of pushing disconnected fragments into the embedding model and hoping vector similarity sorts things out, the strategy ensures each vector represents a coherent, self-contained unit of information.
Open and reproducible validation
This idea is not entirely new in the document processing literature, but POMA AI’s execution at production scale with validation through an open benchmark is a meaningful differentiator. The RAG community had been discussing for some time that fixed-size chunking was a serious limitation, but there was a lack of public, reproducible data to quantify the real impact of more sophisticated alternatives.
POMA-OfficeQA fills that gap and offers a concrete foundation for other teams and companies to compare their own document ingestion strategies against a structured baseline. Anyone can download the benchmark from GitHub, run it with their own pipeline, and identify where their system’s bottlenecks are. This level of openness is rare when it comes to corporate benchmarks and tends to build trust within the technical community.
The practical impact of token reduction at enterprise scale
Cutting 77% of the tokens in a RAG pipeline is not just a nice metric in a paper — it is money, latency, and operational viability. Anyone working with RAG in a corporate environment knows that API call costs for language models are directly proportional to the number of tokens processed. If you are running thousands of queries per day against extensive document bases like contracts, regulatory reports, or technical manuals, the difference between sending 1.45 million tokens and 340 thousand tokens to the model is massive on the monthly bill.
We are talking about a reduction that can make a use case economically viable when it previously just did not pencil out. Beyond the financial cost, there is the latency gain: fewer tokens means less processing time in both the retrieval stage and the response generation, which translates into a significantly smoother user experience.
The long context problem and the indirect advantage
There is also a technical dimension that often goes unnoticed. Language models have limited context windows, and even the most recent models with 128K or 200K token windows show quality degradation when the context gets too long. This is the well-known lost in the middle problem, where information positioned in the middle of a lengthy context tends to be ignored or underutilized by the model.
By drastically reducing the volume of tokens sent, POMA AI’s structural RAG Chunking not only saves resources but also increases the likelihood that relevant information lands in a favorable position within the context window. In other words, the model receives less content, but at a higher quality, and can leverage it more effectively when generating the response. This creates a compounding effect: lower cost, lower latency, and higher output quality, all at the same time.
The investor perspective on the structural advantage
Till Faida, co-founder of AdBlock and investor and advisor at POMA AI, reinforced this point when commenting on the benchmark. According to him, what was convincing about POMA was the engineering rigor behind a seemingly simple insight. Faida highlighted that the company went after the ingestion layer — the exact part of the pipeline everyone assumes is a solved problem. In his view, a 77% reduction in tokens changes the economics of running RAG at enterprise scale, and that is the kind of structural advantage you look for when investing.
Who directly benefits from this technology
For companies scaling their document intelligence operations — banks processing contracts, insurers analyzing policies, law firms reviewing case law, compliance departments navigating regulations — this combination of benefits can be transformative. Any organization dealing with large volumes of structured documents that needs to extract accurate answers at a controlled cost fits squarely in the profile of who wins with this evolution.
POMA AI positions its solution in exactly this niche, offering a document ingestion layer that understands structure before slicing. The fact that the benchmark is open source also signals a smart strategy for building community and technical credibility. Instead of asking the market to trust internal metrics, the company invites anyone to reproduce the results and challenge the methodology.
This level of transparency tends to accelerate adoption, especially among engineering teams that need to justify technical decisions with concrete data to their leadership. This is not an abstract promise of improvement — there is a public repository with code, data, and results that any team can audit before making a decision.
What this means for the future of RAG
The launch of POMA-OfficeQA spotlights a discussion that was mature but lacked quantitative validation: the document ingestion layer needs just as much attention as the language models and retrieval strategies. For a long time, the community focused on the more sophisticated pipeline components — embedding fine-tuning, reranking algorithms, advanced prompt engineering — while chunking was treated almost as a formality. Cutting text every 500 tokens with 100 overlap was considered good enough.
The benchmark data shows that good enough can mean wasting more than three-quarters of your computational resources. And that waste is not just financial — it shows up in less accurate answers, higher latency, and an inferior user experience. The message is clear: before investing in more expensive models or more complex retrieval techniques, it is worth looking at how documents are being prepared to enter the pipeline.
POMA AI’s approach with POMA PrimeCut demonstrates that treating documents as hierarchical structures — rather than monolithic blocks of text — generates gains that propagate through the entire chain. It is the kind of optimization that works in favor of every other component instead of competing with them. Structure-aware chunking applied to document processing has moved beyond a theoretical concept and earned practical validation with numbers any engineer can verify on their own 🚀
