DeepSeek on Monday released a new multimodal artificial intelligence model that can handle large and complex documents with significantly fewer tokens – the smallest unit of text that a model processes – by using visual perception as a compression medium for information.
The open-source DeepSeek-OCR (optical character recognition) model, available via online developer platforms Hugging Face and GitHub, was the result of an “investigation into the role of vision encoders” to compress text for large language models (LLMs), the Hangzhou-based AI start-up said in a blog post.
By using that approach, LLMs would be able to process a massive amount of text without incurring a proportional increase in computing cost.
“Through DeepSeek-OCR, we demonstrated that vision-text compression can achieve significant token reduction – seven to 20 times – for different historical context stages, offering a promising direction” to address long-context challenges in LLMs, the company said.
That showed DeepSeek’s steadfast efforts to raise the efficiency of AI models, while driving down the costs of building and using them – a principle that the company followed in the development of its breakthrough open-source models V3 and R1 that were released in December and February, respectively.
According to the company’s blog post, DeepSeek-OCR consisted of two main components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder.