Next week, we’ll have an invited talk by Canaan Breiss (University of Southern California) on Monday, March 10 at 3PM at 2001 McGill College in room 461. The details of the talk are given below.
Title: How to make the most of what you have: building phonological theory from sparse data
Abstract: One recent trend in phonological research has been towards using probabilistic grammar formalisms to not only yield analytical insights into linguistic patterns, but also to confront questions of learnability, language acquisition, and processing. This approach can provide valuable new evidence to adjudicate between existing theoretical proposals. However, this approach typically uses large quantitative datasets (spoken or written corpora, lexical statistics, high-powered behavioral experiments) that are only available for high-resource languages, putting the field in jeopardy of arriving at skewed conclusions about universals of linguistic patterning or cognition. In this talk, I present two case studies that use computational modeling to redress this imbalance, allowing sparse data to inform contemporary phonological theory.
First, drawing on data from the endangered Tōhoku dialect of Japanese, I use a dense sampling experimental design with eight participants to quantitatively probe a case of optional paradigm uniformity discussed in Ito & Mester (1996). I find that the variable process is influenced both by lexical characteristics (resting activation) and phonological markedness, and model the data using the Voting Theory of Bases proposed in Breiss (2024) to account for Lexical Conservatism data (Steriade 1997). The success of this model suggests that paradigm uniformity and lexical conservatism are two special cases of a general theory of how a dynamic lexicon interacts with a probabilistic grammar.
Second, I present an interactive approach to learning a grammar from scratch that utilizes linguistic acceptability judgments from a native speaker consultant in combination with a learning model that actively selects its own training data. The model maintains explicit uncertainty over the range of grammars compatible with the information it has seen so far, and uses this information to select future queries that will be most useful in reducing this uncertainty. I apply the model to the domain of phonotactics, and find that the model selects queries that achieve sample efficiency comparable to or greater than fully-supervised approaches with access to a large corpus of “ground truth” lexical data.