mech-interpretability

This repo contains my experiments to better understand the recent advances in mechanistic interpretability research (with an emphasis on characterizing the stability of discovered circuits). Please see the drafts summarizing the findings.

Exploring length generalization in the context of indirect object identification task

Abstract

Mechanistic interpretability aims to explain how neural networks learn at the circuit level. So far only a handful of circuits with compelling evidence have been discovered in relatively small language models [1,2,3]. It remains unclear whether these circuits (a) persist once formed after training further, and (b) have similar explanatory power in larger models or models with different architecture. A recent case study on reverse engineering the circuit behind indirect object identification (IOI) task has uncovered an interesting circuit with a fairly specialized division of labor [3]. While the authors performed extensive experiments and ablation analyses to validate the specific circuit components (i.e., specialized attention heads), how the model performance varies under different perturbations is underexplored. Here we study the performance on IOI task and its variants in the original GPT-2 small and other models of equivalent size. This analysis makes some progress toward addressing the following open research questions curated by Neel Nanda: 5.34 and 5.35 [4].

References

Nelson Elhage, Neel Nanda, Catherine Olsson et al. A mathematical framework for transformer circuits. Transformer Circuits Thread (2021)
Nanda, Neel, et al. "Progress measures for grokking via mechanistic interpretability." arXiv:2301.05217 (2023)
Wang, Kevin, et al. "Interpretability in the wild: a circuit for indirect object identification in gpt-2 small." arXiv:2211.00593 (2022)
200 Concrete Problems in Interpretability
TransformerLens library

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
drafts		drafts
ioi-circuit		ioi-circuit
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

drafts

drafts

ioi-circuit

ioi-circuit

LICENSE

LICENSE

README.md

README.md

Repository files navigation

mech-interpretability

Exploring length generalization in the context of indirect object identification task

References

About

Releases

Packages

Languages

License

cx0/mech-interpretability

Folders and files

Latest commit

History

Repository files navigation

mech-interpretability

Exploring length generalization in the context of indirect object identification task

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages