Please join us Friday, November 26th at 3:30pm for our next talk of the 2021-2022 McGill Linguistics Colloquium Series. If you are planning to attend talks and have not yet registered, you can do so here (you only need to register once for the 2021-2022 year). After registering, you will receive a confirmation email containing information about joining the meeting.

Speaker: Ewan Dunbar (University of Toronto)

Title: Probing state-of-the-art speech representation models using experimental speech perception data from human listeners


The strong performance of neural network natural language processing has led to an explosion of research probing systems’ linguistic knowledge (whether language models implicitly learn syntactic hierarchy, whether word embeddings understand quantifiers, and so on), in order to understand if the data-crunching power of these models can be harnessed as the basis for serious, theoretically-grounded models of grammatical learning and processing. Much of this “(psycho)linguistics for robots” work has focussed on textual models. Here, I show how we have applied this same approach to phonetics. In particular, we probe state-of-the-art unsupervised speech processing models and compare their behaviour to humans’ in order to shed light on the traditionally hazy and ad hoc construct of “acoustic distance.”

On the basis of a series of simple, broad-coverage speech perception experiments run on English- and French-speaking participants, I compare human listeners’ behaviour (how well they discriminate sounds in the experiment) to the “behaviour” of representations (how well they separate those same stimuli) which come from models trained with the express purpose of building better representations to be used in automatic speech recognition. For example, Facebook AI’s recent wav2vec 2.0 model takes large amounts of unlabelled speech as training data, and learns to extract a representation of the audio that is highly predictive of the surrounding context; it has now proven extraordinarily useful for replacing off-the-shelf audio features, to the point that some of the best-performing speech recognition systems today have switched to using these representations, which has substantially reduced the amount of labelled data needed to train high-quality speech recognizers.

We use the comparison with human behaviour to show that, for this and related systems, contrary to what many researchers may have *thought* these systems are doing, they are not really “learning representations of the sound inventory” of the training language, so much as learning good representations of the acoustics of speech – so good that they are very good models of “auditory distance” in human speech processing, but, notably, they lack the categorical effects on speech perception which are pervasive in human listening experiments, and they only show very weak effects of the language on which they are trained, unlike our human listeners. As well, I present new evidence that “speech is special” in human auditory processing, by comparing learned representations trained on speech data to the same models, trained on non-speech data. We show that representations trained on non-speech are very (very) poor predictors of human speech perception behaviour in experiments.