Heterogeneous language data in CLARIN

TrackTrack 2 (Auditorium 2)
DescriptionLarge volumes of language-related data exist, but collections are very scattered as well as heterogeneous. Primary language materials comprise speech, video and text in various encodings and different transcription codes and formats. Furthermore language may carry multiple levels of annotation (such as morphological, syntactic and semantic, as well as layers of normalized editions). Secondary (derived) language materials such as lexicons, wordnets and grammars adhere to various encodings as well as different theoretical paradigms and formalisms. Experimental data (such as reaction times, fMRI data etc.), geographical mappings of language, etc. must also be considered. Making sense of the thousands of encodings and converting formats to standards is an enormous challenge. The European CLARIN ERIC and associated national projects such as CLARINO in Norway are aimed at establishing a common language resources and technology infrastructure where language resources become more interoperable and available for exploration and modeling.

