DebugML

Instruction Following by Boosting Attention of Large Language Models

Vitoria Guardieiro, Adam Stein, Avishree Khare, Eric Wong July 10, 2025 11 minute read

We improve instruction-following in large language models by boosting attention, a simple technique that outperforms existing steering methods.

Probabilistic Stability Guarantees for Feature Attributions

Helen Jin, Anton Xue, Weiqiu You, Surbhi Goel, Eric Wong April 24, 2025 8 minute read

We scale certified explanations to provide practical guarantees in high-dimensional settings.

The FIX Benchmark: Extracting Features Interpretable to eXperts

Helen Jin, Shreya Havaldar, Chaehyeon Kim, Anton Xue, Weiqiu You, Helen Qu, Marco Gatti, Daniel A. Hashimoto, Bhuvnesh Jain, Amin Madani, Masao Sako, Lyle Ungar, Eric Wong October 15, 2024 6 minute read

We present the FIX benchmark for evaluating how interpretable features are to real-world experts, ranging from gallbladder surgeons to supernova cosmologists.

Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference

Anton Xue, Avishree Khare, Rajeev Alur, Surbhi Goel, Eric Wong July 9, 2024 12 minute read

We study jailbreak attacks through propositional Horn inference.

Towards Compositionality in Concept Learning

Adam Stein, Aaditya Naik, Yinjun Wu, Mayur Naik, Eric Wong July 5, 2024 9 minute read

A method for learning compositional concepts from pre-trained foundation models.

Data-Efficient Learning with Neural Programs

Alaia Solko-Breslin, Seewon Choi, Ziyang Li, Neelay Velingker, Rajeev Alur, Mayur Naik, Eric Wong June 11, 2024 13 minute read

Combining neural perception with symbolic or GPT-based reasoning

Sum-of-Parts Models: Faithful Attributions for Groups of Features

Weiqiu You, Helen Qu, Marco Gatti, Bhuvnesh Jain, Eric Wong October 26, 2023 12 minute read

Overcoming fundamental barriers in feature attribution methods with grouped attributions

SmoothLLM: Defending LLMs Against Jailbreaking Attacks

Alex Robey, Eric Wong, Hamed Hassani, George J. Pappas October 17, 2023 17 minute read

LLMs, jailbreaking, and generative AI’s ‘biggest security flaw’

Recent posts