There were two key bottlenecks: visual elements make the text a mess, leading (1) to poor retrieval and (2) poor understanding by the LLM. Instead of supporting each corner case, we've developed a RAG pipeline that treats documents as both an image and a text, leading to a dramatic reduction in size (8B outperforms 70B) and a moderate improvement in quality compared to the current SOTA.