« OTELO » project
Language, whether written or spoken, is intrinsically ambiguous and polysemous. Linguists aspire to account for this ambiguity in order to understand how it works.
Researchers in information science and technology are also concerned by the formalization of linguistic variation for application purposes. Work on an exhaustive description of language is rare, as it involves approaches from several scientific communities. Winner of the call for excellence projects launched by the DATAIA Institute and MSH Paris-Saclay in 2020, OTELO offers a multi-level analysis of spoken language based on large oral corpora, segmented and annotated automatically.
Segmented into phones and words, this data will then be enriched with knowledge concerning the grammatical status of words, and their syntactic and semantic relationships in context. Expected results include
-
The role of phonetic information in the disambiguation of contextual homophonies involving entities;
-
The impact of “high-level” linguistic knowledge (grammatical, syntactic, semantic) in the diffusion of patterns of phonetic variation within the words of a language.
OTELO is led by Ioana Vasilescu, a researcher in linguistics at LIMSI, and Fabian Suchanek, a researcher in computer science at Télécom Paris. F. Suchanek's work is internationally renowned for the creation of the YAGO knowledge base, which is used in the IBM Watson system, among others. His research focuses on extracting entities and facts from text in natural language, and structuring this data in a knowledge base. One of the aspects addressed is the analysis of these knowledge bases, rule mining and completeness determination. His work is supported by an ANR-funded AI Chair.
At LIMSI, the analysis of written and spoken language is at the heart of the Language Science and Technology Department. Within this department, I. Vasilescu and his colleagues in the “Traitement du Langage Parlé” (Spoken Language Processing) group are behind numerous SHS initiatives involving the analysis of sound variation using large multilingual corpora. These analyses are based on massive data explored with automatic tools. The work of I. Vasilescu, supported by MSH Paris-Saclay, has highlighted the value of this methodology and of large corpora for the study of synchronic variation in relation to language history. LIMSI researchers are also behind a first joint approach involving multi-level analysis of oral data in connection with the errors of automatic systems, as part of the ANR VERA (adVanced ERror Analysis) project (Santiago et al., 2015; Goryainova et al., 2014).
Contacts : Fabian Suchanek (Télécom Paris) | Ioana Vasilescu (LIMSI)