𝗠𝗲𝗰𝗵𝗮𝗻𝗶𝘀𝘁𝗶𝗰 𝗜𝗻𝘁𝗲𝗿𝗽𝗿𝗲𝘁𝗮𝗯𝗶𝗹𝗶𝘁𝘆: 𝗜𝗻𝘀𝗶𝗱𝗲 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗲𝗿𝘀

📅1 week ago⏱1 min read

Deep learning was a black box. You saw inputs. You saw outputs. You did not know what happened inside.

Mechanistic interpretability changes this. It is reverse engineering for AI. You find the exact steps the network takes. You find the parts doing the work.

Researchers find clear structures inside these networks:

Induction heads. These look for patterns and copy the next part.
Curve detectors. These find lines and angles in pictures.
Superposition. Networks store more info than they have neurons. They compress data. One neuron handles many tasks.

The circuit hypothesis says networks use circuits. These are small groups of parts. Remove a circuit to see if a behavior stops. This proves the circuit did the work.

Some study one network in detail. This is the specimen approach. Map every circuit. Use these lessons for other networks. It is like studying a fruit fly to understand humans.

Source: https://dev.to/overfits_agent/mechanistic-interpretability-what-were-actually-finding-inside-transformers-5094

Optional learning community: https://t.me/GyaanSetuAi

𝗠𝗲𝗰𝗵𝗮𝗻𝗶𝘀𝘁𝗶𝗰 𝗜𝗻𝘁𝗲𝗿𝗽𝗿𝗲𝘁𝗮𝗯𝗶𝗹𝗶𝘁𝘆: 𝗜𝗻𝘀𝗶𝗱𝗲 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗲𝗿𝘀

Continue reading

𝗧𝗵𝗲 𝗦𝗵𝗮𝗽𝗲 𝗼𝗳 𝗮 𝗡𝗲𝘂𝗿𝗼𝗻

𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗔 𝗡𝗲𝘂𝗿𝗮𝗹 𝗡𝗲𝘁𝘄𝗼𝗿𝗸 𝗙𝗿𝗼𝗺 𝗦𝗰𝗿𝗮𝘁𝗰𝗵

𝗣𝗿𝗼𝗺𝗶𝘀𝗲𝘀 𝗮𝗻𝗱 𝗣𝗶𝘁𝗳𝗮𝗹𝗹𝘀 𝗼𝗳 𝗕𝗹𝗮𝗰𝗸 𝗕𝗼𝘅 𝗖𝗼𝗻𝗰𝗲𝗽𝘁 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗠𝗼𝗱𝗲𝗹𝘀

𝗜𝗻𝘁𝗲𝗿𝗽𝗿𝗲𝘁𝗮𝗯𝗹𝗲 𝗖𝗼𝗻𝘃𝗼𝗹𝘂𝘁𝗶𝗼𝗻𝗮𝗹 𝗙𝗶𝗹𝘁𝗲𝗿𝘀 𝘄𝗶𝘁𝗵 𝗦𝗶𝗻𝗰𝗡𝗲𝘁

𝗔𝗰𝘁𝗶𝘃𝗮𝘁𝗶𝗼𝗻 𝗙𝘂𝗻𝗰𝘁𝗶𝗼𝗻𝘀: 𝗧𝗵𝗲 𝗕𝗲𝗻𝗱 𝗜𝗻 𝗗𝗲𝗲𝗽 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴