๐ง๐ต๐ฒ ๐๐๐๐๐ฟ๐ฒ ๐ข๐ณ ๐ฉ๐ถ๐๐ถ๐ผ๐ป-๐๐ฎ๐ป๐ด๐๐ฎ๐ด๐ฒ ๐ ๐ผ๐ฑ๐ฒ๐น๐ Vision-language models are improving at reading charts and answering image questions. But this progress comes with a cost: more visual detail means more compute, memory, and slower inference.
A new method, Reroute, Don't Remove, proposes a change in how we cut this cost. Instead of deleting low-priority visual tokens, it lets them stay and re-enter the candidate pool later. This approach changes the trade-off between model performance and compute cost.
Here's how it works:
- The model turns an image into patch-level embeddings and feeds them into the decoder alongside text tokens.
- This creates a scaling problem: higher-resolution images create more tokens, and attention cost grows with sequence length.
- The new method treats reduction as a routing problem, not a deletion problem. Tokens that are selected pass through the current decoder block, while deferred tokens wait for the next routing decision.
The benefits of this approach include:
- A smaller active set of tokens at each stage
- Preserving the chance to recover visually important tokens later
- Reducing the risk of losing grounding information too early
This approach is part of a wider shift in multimodal efficiency research, moving toward methods that are more adaptive and task-aware. If you work on multimodal systems, consider the following rules of thumb:
- A faster model that misses the relevant object is not better for many real workflows.
- Preserve room for later recovery if your task needs iterative reasoning.
- A layered routing scheme can be easier to reason about than a one-shot keep-or-delete choice.
Source: https://dev.to/prabhakar_chaudhary_7afe4/why-vision-language-models-should-reroute-not-remove-visual-tokens-3020 Optional learning community: https://t.me/GyaanSetuAi