𝗧𝗵𝗲 𝗙𝗎𝘁𝘂𝗿𝗲 𝗢𝗳 𝗩𝗶𝘀𝗶𝗼𝗻 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀

📅4 days ago⏱1 min read

𝗧𝗵𝗲 𝗙𝗎𝘁𝘂𝗿𝗲 𝗢𝗳 𝗩𝗶𝘀𝗶𝗼𝗻-𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 Vision-language models are improving at reading charts and answering image questions. But this progress comes with a cost: more visual detail means more compute, memory, and slower inference.

A new method, Reroute, Don't Remove, proposes a change in how we cut this cost. Instead of deleting low-priority visual tokens, it lets them stay and re-enter the candidate pool later. This approach changes the trade-off between model performance and compute cost.

Here's how it works:

The model turns an image into patch-level embeddings and feeds them into the decoder alongside text tokens.
This creates a scaling problem: higher-resolution images create more tokens, and attention cost grows with sequence length.
The new method treats reduction as a routing problem, not a deletion problem. Tokens that are selected pass through the current decoder block, while deferred tokens wait for the next routing decision.

The benefits of this approach include:

A smaller active set of tokens at each stage
Preserving the chance to recover visually important tokens later
Reducing the risk of losing grounding information too early

This approach is part of a wider shift in multimodal efficiency research, moving toward methods that are more adaptive and task-aware. If you work on multimodal systems, consider the following rules of thumb:

A faster model that misses the relevant object is not better for many real workflows.
Preserve room for later recovery if your task needs iterative reasoning.
A layered routing scheme can be easier to reason about than a one-shot keep-or-delete choice.

Source: https://dev.to/prabhakar_chaudhary_7afe4/why-vision-language-models-should-reroute-not-remove-visual-tokens-3020 Optional learning community: https://t.me/GyaanSetuAi

𝗧𝗵𝗲 𝗙𝗎𝘁𝘂𝗿𝗲 𝗢𝗳 𝗩𝗶𝘀𝗶𝗼𝗻 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀

Continue reading

𝗧𝗵𝗲 𝗔𝗜 𝗔𝗴𝗲𝗻𝘁 𝗕𝗶𝗹𝗹 𝗜𝘀 𝗛𝗲𝗿𝗲

𝗔𝗜 𝗚𝗮𝘁𝗲𝘄𝗮𝘆𝘀 𝗶𝗻 𝟮𝟬𝟮𝟲: 𝗧𝗵𝗲 𝟭𝟬𝟲𝘅 𝗖𝗼𝘀𝘁 𝗣𝗿𝗼𝗯𝗹𝗲𝗺

𝗛𝗼𝘄 𝗜 𝗖𝘂𝘁 𝗢𝘂𝗿 𝗔𝗜 𝗔𝗣𝗜 𝗕𝗶𝗹𝗹 𝗯𝘆 𝟵𝟱%

𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗼𝗳 𝗡𝗲𝘄 𝗗𝗲𝗲𝗽 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗠𝗥 𝗜𝗺𝗮𝗴𝗲 𝗥𝗲𝗰𝗼𝗻𝘀𝘁𝗿𝘂𝗰𝘁𝗶𝗼𝗻

𝗔𝗜/𝗠𝗟 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵 𝗗𝗶𝗴𝗲𝘀𝘁 — 𝗝𝘂𝗻 𝟭𝟯, 𝟮𝟬𝟮𝟲