At this week’s MCQLL meeting, Austin Kraft will present “Motivating morphology in semi-supervised tokenization.” We will be meeting this Wednesday, April 2, at 10 AM. Meetings are held in person in room 117 of the McGill Linguistics department and on zoom at https://mcgill.zoom.us/j/89609376104.
Abstract: Tokenization is fundamental for computational linguistics and NLP. An input text is segmented into smaller pieces such as words or sub-words, which are then the basic units for language modeling and downstream tasks. However, this fundamental process has been largely understudied in how it can be formalized (Gastaldi et al. 2024), implemented, and potentially related to languages’ morphological patterns. Many contemporary NLP applications tokenize using Byte-Pair Encoding (BPE; Sennrich et al. 2016), an unsupervised algorithm that merges the most frequent adjacent characters in an input text. In the first part of this talk, I will provide overviews of BPE, its tendency to misalign with morphology to the detriment of downstream performance, and Bauwens and Delobelle’s (2024) proposed solution in their “BPE-knockout” method for Germanic. I will spend the second part of the talk motivating morphological semi-supervision for tokenization, particularly for languages less often represented in computational linguistics and NLP.