Casperjs benchmark testing

6/21/2023

So from your published work, it seems like you’re really interested in neural network interpretability. Interpretability for engineers Why interpretability? For links to what we’re discussing, you can check the description of this episode and you can review the transcript at. We’ll be talking about his ‘ Engineer’s Interpretability Sequence’ of blog posts, well as his paper on benchmarking whether interpretability tools can find Trojan horses inside neural networks. Stephen was previously an intern working with me at UC Berkeley, but he’s now a Ph.D student at MIT working with Dylan Hadfield-Menell on adversaries and interpretability in machine learning. In this episode I’ll be speaking with Stephen Casper. Benchmarking Interpretability Tools (for Deep Neural Networks) (Using Trojan Discovery).Deceptive alignment and interpretability.

Critiques of the AI safety interpretability community.
But how do we know if meaningful progress is actually being made? What should we want out of these tools? In this episode, I speak to Stephen Casper about these questions, as well as about a benchmark he’s co-developed to evaluate whether interpretability tools can find ‘Trojan horses’ hidden inside neural nets. Lots of people in the field of machine learning study ‘interpretability’, developing tools that they say give us useful information about neural networks.

0 Comments

Casperjs benchmark testing

Leave a Reply.

Author

Archives

Categories