Back to topics

Multimodal in Practice: OCR, Vision, and Real-World Limitations of LLMs

1 min read
260 words
Opinions on LLMs Multimodal Practice:

Multimodal LLMs are getting street-tested, not just hype. From Qwen3-VL’s OCR and captioning to fully offline assistants on consumer hardware, real-world constraints keep surprising us. [1]

OCR & End-to-End Vision — Vision-based systems for long documents push end-to-end understanding, sometimes skipping explicit OCR steps. DeepSeek-OCR and similar setups show how visual and textual cues can be merged in one pass, with GPT-4.1 acting as the multimodal brain in some demos. [1]

Bounding-Box Realities — Even strong visual LLMs stumble on precise coordinates. A thread on Qwen3-VL notes bounding boxes can be off unless you calibrate input resolution (downscale to 1000x1000 helps). The same discussion points to llama.cpp-based paths for more reliable bounding-world alignment. [3]

Native Multimodal Momentum — The Emu3.5 work argues native multimodal models are world learners, aiming to integrate vision and language without a separate adapter stack. The promise is in-model grounding across real tasks, not just benchmarks. [2]

Offline Voice on Hardware — A fully offline setup—Mistral via Ollama with Whisper for STT and Piper for TTS—shows practical latency on a GTX 1650 system: about 10 seconds per cycle with RAG in the loop. It’s a pragmatic, cloud-free path for home setups. [4]

Sizing & Vision Primitives — Vision-enabled models come with opaque sizing; a glimpse shows Qwen_Qwen3-VL-2B-Thinking at ~1.83 GB, and a vision stack needing an ~800 MB mmproj alongside the gguf. It highlights the friction between model size, file formats, and practical deployment. [5]

Reality check complete: the hype wanders, but real-world constraints—coords, offline latency, and file-size frictions—shape what multimodal LLMs can do today. [1][2][3][4][5]

References

[1]
HackerNews

Discusses OCR-free, vision-based QA for long documents using VLMs and multimodal GPT-4.1, challenging traditional OCR-first pipelines end-to-end retrieval layer question-answering.

View source
[2]
HackerNews

Emu3.5: Native Multimodal Models Are World Learners

ArXiv paper on Emu3.5 native multimodal models; discusses world learning, with project and GitHub links

View source
[3]
Reddit

While Qwen3-vl has very good OCR/image caption abilities, it still doesn't seem to generate accurate coordinates nor bounding boxes of objects in the screen. I just take a screenshot and send as-is and its accuracy is off. Tried resizing, no dice neither. Anyone else have this problem?

Discussion about Qwen3-VL OCR/coordinates limits; comparisons to Ollama, llama.cpp, and scaling strategies for bounding boxes with resolutions, downscaling, and variants

View source
[4]
Reddit

Built a fully offline voice assistant with Mistral + RAG - runs on consumer hardware (GTX 1650)

Offline LLM-powered voice assistant with Mistral 7B; RAG memory; local STT/TTS; prompts discussed; parallelization ideas

View source
[5]
Reddit

Guys can someone help me understand this difference

User compares vision-enabled model sizes across Hugging Face and LMStudio, notes mmproj file needed, and why vision works vs chat.

View source

Want to track your own topics?

Create custom trackers and get AI-powered insights from social discussions

Get Started