Speaker: Christopher Potts (Stanford University)
Title: Inducing Interpretable Causal Structures in Neural Networks
Abstract: Early symbolic NLP models were designed to leverage valuable insights about language and cognition. These insights were expressed directly in hand-designed structures, and this ensured that model behaviors were systematic and interpretable. Unfortunately, these models tended also to be brittle and specialized. By contrast, present-day models are data-driven and can flexibly acquire complex behaviors, which has opened up many new avenues. However, the trade-offs are now evident: these models often find opaque, unsystematic solutions. In this talk, I’ll report on our ongoing efforts to combine the best aspects of the old and new using techniques from causal abstraction analysis. In this method, we define high-level causal models, usually in symbolic terms, and then train neural networks to conform to the structure of those models while also learning specific tasks. The central technical piece is interchange intervention training (IIT), in which we swap internal representations in the target neural model in a way that is guided by the input–output behavior of the causal model. Where the IIT objective is minimized, the high-level model is an interpretable, faithful proxy for the underlying neural model. My talk will focus on how and why IIT works, since I am hoping this will help people identify new application areas for it, and I will also briefly review case studies applying IIT to natural language inference, grounded language understanding, and language model distillation.