Recent posts
The FIX Benchmark: Extracting Features Interpretable to eXperts
We present the FIX benchmark for evaluating how interpretable features are to real-world experts, ranging from gallbladder surgeons to supernova cosmologists.
Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference
We study jailbreak attacks through propositional Horn inference.
Towards Compositionality in Concept Learning
A method for learning compositional concepts from pre-trained foundation models.
Data-Efficient Learning with Neural Programs
Combining neural perception with symbolic or GPT-based reasoning
Sum-of-Parts Models: Faithful Attributions for Groups of Features
Overcoming fundamental barriers in feature attribution methods with grouped attributions
SmoothLLM: Defending LLMs Against Jailbreaking Attacks
LLMs, jailbreaking, and generative AI’s ‘biggest security flaw’
Stable Explanations with Multiplicative Smoothing
Achieving stability guarantees for feature attribution methods
Do Machine Learning Models Learn Statistical Rules Inferred from Data?
Understanding and improving model predictions using rules learned from data.