[Join Mailing List]
[Add Calendar]
Alg-ML is a weekly machine learning theory seminar primarily attended by the research groups of
Prof. Sanjeev Arora, Prof. Elad Hazan, and Prof. Boris Hanin.
We discuss recent advances in algorithm design and theoretical machine learning.
Time: Tuesdays, 12:15–1:15 pm ET
Lunch: Usually at 12:00 pm
Location: CS 402
Open to all members of the Princeton community!
For spring 2026, the seminar is organized by Gon Buzaglo and Anand Brahmbhatt.
Subscribe to the alg-ml mailing list and the Google calendar.
Abstract: Training capable small language models is a central challenge, yet existing distillation methods treat teachers as static supervision sources. I argue that effective learning depends on how a small model learns from a bigger one and when it learns it. I show that intermediate teacher checkpoints reveal implicit learning trajectories, and that aligning students to these trajectories yields provable sample-complexity benefits.
Abstract: Modern Large Language Models (LLMs) are typically based on Transformers and/or Structured State Space Models (SSMs), and tend to generalize well even under a distribution shift between training and test data. Conventional wisdom attributes this generalization to implicit biases induced by architectures and the gradient-based algorithms that train them. This talk will describe a series of works theoretically analyzing and empirically evaluating implicit biases in Transformers and SSMs. Beginning with Transformers, I will consider Reinforcement Learning with outcome-based supervision (as in, e.g., DeepSeek-R1), and show that on a graph traversal task, if training data includes simple examples then an implicit bias admits generalization via step-by-step reasoning (Chain-of-Thought), whereas if training data does not include simple examples then learning is intractable. Continuing to SSMs, I will consider a teacher-student setting, and show that if training data is generic then an implicit bias admits generalization, yet there are cleanly labeled examples whose inclusion in training entirely disrupts generalization. These findings carry a counterintuitive message: for both Transformers and SSMs, it is sometimes beneficial to deliberately introduce a distribution shift to training data. Further research into the potential benefits of distribution shifts for Transformers and SSMs may pave the way to more effective curricula for training modern LLMs.
Abstract: TBD
Abstract: TBD
Abstract: TBD
Abstract: TBD
Abstract: TBD
Abstract: TBD
Abstract: TBD
Abstract: TBD