Text2TCS Documentation

Documentation of the Text2TCS-Application

A first version of the tool can be tried out and downloaded from the European Language Grid: live.european-language-grid.eu/catalogue/tool-service/23661/try%20out/

Text2TCS is an application to automatically extract terminological concept systems from texts encoded in UTF-8. The application extracts terms, groups them to synonym groups, and extracts relations that hold between these groups.

The following types of terms are considered:

Single word term
Multi-word term (prepositional phrases, noun compounds, etc.)
Named entities
Short forms incl. acronyms
Synonyms

The following types of relations are extracted for the following types of languages, whereby the relations are always indicated in English irrespective of the input language according to the following typology:

hierarchical relation:
- generic Relation (specific -> broader)
- partitive Relation (part -> whole)
non-hierarchical relation:
- activity relation (activity -> actor; activity -> entity)
- causal relation (cause -> effect)
- spatial Relation (entity -> space)
- instrumental relation (instrument -> purpose)
- origination relation (entity -> origin)
- property relation(entity -> property)
- associative relation (loose thematic connection)

This information is represented as our text format to increase readability as well as TBX/XML.

The application builds on XLM-R, a language recognizer and a tokenizer, which jointly support the following languages:

Amharic
Arabic
Armenian
Bulgarian
Burmese
Chinese
Danish
English
German
Dutch
French
Greek
Hindi
Italian
Japanese
Kazakh
Marathi
Persian
Polish
Russian
Spanish
Urdu

XLM-R supports a larger number of languages than the language recognition tool and the tokenizer. It is possible that the tool based on XLM-R also supports the following languages, which, however due to the other two tools cannot be guaranteed: Afrikaans, Albanian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, Gujarati, Hausa, Hebrew, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Javanese, Kannada, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.

The following software libraries with the following versions and licenses have been utilized in the development proccess of this application:

Name	Developer	Software license
Sacremoses (Version 0.0.45)	Liling Tan (https://alvations.bitbucket.io/)	MIT License
Seaborn (Version 0.11.1)	Michael Waskom (https://mwaskom.github.io/)	BSD 3-Clause "New" or "Revised" License
Torch (Version 1.8.1+cu101) - component of PyTorch	Facebook and others (see license)	Facebook ad others (see link in notes)
NLTK	NLTK Project	Apache 2.0
Numpy (Version 1.19.5)	NumPy	BSD 3-Clause "New" or "Revised" License
Pandas (Version 1.1.5)	Pandas Development Team	BSD 3-Clause "New" or "Revised" License
SentencePiece (Version 0.1.95)	Google	Apache License 2.0
Spacy (Version 2.2.4)	Explosion	MIT License
Transformers (Version 4.6.1)	Wolf et al. (https://aclanthology.org/2020.emnlp-demos.6/)	Apache License 2.0
MatPlotLib (Version 3.2.2)	Python Software Foundation	Python Software Foundation License (PSF)
scikit_learn (Version 0.24.2)	scikit-learn developers	BSD License
nvidia_ml_py3 (Version 7.352.0)	NVIDIA Corporation	BSD License
Seqeval (Version 1.2.2)	Hironsan	MIT License
Pynvml (Version 8.0.4)	NVIDIA Corporation	BSD License
graphviz (Version 0.10.1)	Sebastian Bank	MIT License
lxml (Version 4.2.6)	Stefan Behnel et al (https://lxml.de/4.2/credits.html)	BSD License
XLM-R	The Hugging Face Team	Apache License 2.0