All Posts

2026 2
2025 4
2024 4
2023 7

2026

The SuperActivator Mechanism: Transformers Concentrate Reliable Concept Signals in the Tail

Cassandra Goldberg, Chaehyeon Kim, Adam Stein, Eric Wong June 1, 2026 6 minute read

Amid noisy concept activations, transformer attention dynamics amplify reliable concept signals into a sparse high-activation tail.

Finding Widespread Cheating on Popular Agent Benchmarks

Adam Stein*, Davis Brown*, Hamed Hassani, Mayur Naik, Eric Wong
*Equal contribution. Order decided by coin flip April 10, 2026 7 minute read

Agentic cheating is a widespread issue, affecting thousands of submitted agent runs on 28+ submissions across 9 different benchmarks.

2025

CTSketch: Compositional Tensor Sketching for Scalable Neurosymbolic Learning

Seewon Choi*, Alaia Solko-Breslin*, Rajeev Alur, Eric Wong
*Equal contribution. Order decided by coin flip November 6, 2025 7 minute read

Scaling neurosymbolic learning with program decomposition and tensor sketching.

Probabilistic Soundness Guarantees in LLM Reasoning Chains

Weiqiu You, Anton Xue, Shreya Havaldar, Delip Rao, Helen Jin, Chris Callison-Burch, Eric Wong November 3, 2025 15 minute read

We certify the soundness of LLM reasoning chains with probabilistic guarantees, especially under error propagation.

Instruction Following by Boosting Attention of Large Language Models

Vitoria Guardieiro, Adam Stein, Avishree Khare, Eric Wong July 10, 2025 11 minute read

We improve instruction-following in large language models by boosting attention, a simple technique that outperforms existing steering methods.

Probabilistic Stability Guarantees for Feature Attributions

Helen Jin, Anton Xue, Weiqiu You, Surbhi Goel, Eric Wong April 24, 2025 8 minute read

We scale certified explanations to provide practical guarantees in high-dimensional settings.

2023

Sum-of-Parts: Self-Attributing Neural Networks with End-to-End Learning of Feature Groups

Weiqiu You, Helen Qu, Marco Gatti, Bhuvnesh Jain, Eric Wong October 26, 2023 12 minute read

Overcoming fundamental barriers in feature attribution methods with grouped attributions

SmoothLLM: Defending LLMs Against Jailbreaking Attacks

Alex Robey, Eric Wong, Hamed Hassani, George J. Pappas October 17, 2023 17 minute read

LLMs, jailbreaking, and generative AI’s ‘biggest security flaw’

Stable Explanations with Multiplicative Smoothing

Anton Xue, Rajeev Alur, Eric Wong July 26, 2023 13 minute read

Achieving stability guarantees for feature attribution methods

Do Machine Learning Models Learn Statistical Rules Inferred from Data?

Aaditya Naik, Yinjun Wu, Mayur Naik, Eric Wong July 25, 2023 9 minute read

Understanding and improving model predictions using rules learned from data.

Faithful Chain-of-Thought Reasoning

Qing Lyu*, Shreya Havaldar*, Adam Stein*, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, Chris Callison-Burch
*Equal contribution. Order decided by coin flip May 3, 2023 9 minute read

A novel prompting method that provides faithful explanations and improves performance on complex reasoning tasks

In-context Learning Influences

Tai Nguyen, Eric Wong March 3, 2023 7 minute read

An influence framework for studying in-context learning examples

Adversarial Prompting

Natalie Maus, Patrick Chao, Eric Wong, Jake Gardner February 5, 2023 7 minute read

Finding phrases that bias image and text generation towards unexpected outputs