Recent posts
Probabilistic Stability Guarantees for Feature Attributions
We scale certified explanations to provide practical guarantees in high-dimensional settings.
The FIX Benchmark: Extracting Features Interpretable to eXperts
We present the FIX benchmark for evaluating how interpretable features are to real-world experts, ranging from gallbladder surgeons to supernova cosmologists.
Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference
We study jailbreak attacks through propositional Horn inference.
Towards Compositionality in Concept Learning
A method for learning compositional concepts from pre-trained foundation models.
Data-Efficient Learning with Neural Programs
Combining neural perception with symbolic or GPT-based reasoning
Sum-of-Parts Models: Faithful Attributions for Groups of Features
Overcoming fundamental barriers in feature attribution methods with grouped attributions
SmoothLLM: Defending LLMs Against Jailbreaking Attacks
LLMs, jailbreaking, and generative AI’s ‘biggest security flaw’
Stable Explanations with Multiplicative Smoothing
Achieving stability guarantees for feature attribution methods