The FIX Benchmark: Extracting Features Interpretable to eXperts
We present the FIX benchmark for evaluating how interpretable features are to real-world experts, ranging from gallbladder surgeons to supernova cosmologists.
We present the FIX benchmark for evaluating how interpretable features are to real-world experts, ranging from gallbladder surgeons to supernova cosmologists.
We study jailbreak attacks through propositional Horn inference.
A method for learning compositional concepts from pre-trained foundation models.
Combining neural perception with symbolic or GPT-based reasoning
Overcoming fundamental barriers in feature attribution methods with grouped attributions
LLMs, jailbreaking, and generative AI’s ‘biggest security flaw’
Achieving stability guarantees for feature attribution methods
Understanding and improving model predictions using rules learned from data.
A novel prompting method that provides faithful explanations and improves performance on complex reasoning tasks
An influence framework for studying in-context learning examples
Finding phrases that bias image and text generation towards unexpected outputs