View Project

Norwegian AI Directory

Atomic Units for Language Universal representation of Speech


Description:

Traditional speech recognition systems are based on a top-down approach where the sub-word units are pre-defined, usually on the basis of linguistic theory. In order to build robust statistical models of these units, massive amounts of data is required. Yet, this approach is sensitive to mismatch between the imposed model and real-world data at all levels. The recognition problem is framed as finding the most likely sequence of units that match a legal sequence of words, as defined by the lexicon and the language model. Instead of relying on top-down decoding, we propose a paradigm based on bottom-up detection and information extraction. Instead of learning statistical models of pre-defined units, we aim at developing an approach to ASR that is based on learning the 'optimal' set of units that can be used to map from variable acoustic data to invariable meaningful symbols in a bottom-up information extraction procedure. These units must capture the structure in the speech signals that are imposed by the constraints of the articulatory system, i.e., the structure that encodes the linguistic information. At the same time, the units must be flexible and adaptive, so that they can be used for understanding unknown speakers in arbitrary acoustic backgrounds. Last but not least, it must be possible to learn the units from limited amounts of annotated speech. The core paradigm will be investigated through exploring and verifying five supporting hypotheses: - The salient information of the speech signal can be represented by detecting a small number of acoustic-phonetic events. - The set of sub-word units can be discovered from the detected events by machine learning approaches. - The relationship between sub-word units and linguistic units can be learnt from (possibly labeled) data. - The dependence of language and speaker on the sub-word units will be explored through employing them for automatic language identification and for speaker recognition.


Project leader: Torbjørn Svendsen

Started: 2015

Ends: 2020

Category: Universiteter

Sector: UoH-sektor

Budget: 8411000

Institution: Institutt for elektroniske systemer

Address: Trondheim