Dan Wilhelm

Recent Papers
- [New!] "Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives" [arXiv (preprint)]
- "Tokenized SAEs: Disentangling SAE Reconstructions" [arXiv] | [LessWrong]
  - paper/poster @ ICML 2024 Mechanistic Interpretability Workshop
Reverse-engineering writeups
- [New!] Max of List puzzle (Bao Lab challenge)
- Cumulative Sum Sign puzzle (ARENA challenge)
LLM Foundations (in progress)

In-progress mech-interp intros
YouTube
ezinterp [GitHub]: a minimalistic transformer interpretability library for interactive exploration.

[New!] I am an advisor for TARA, an APAC-region research accelerator based on the ARENA curriculum.
This Site's GitHub