Documentation of the Text2TCS-Application
A first version of the tool can be tried out and downloaded from the European Language Grid: live.european-language-grid.eu/catalogue/tool-service/23661/try%20out/
Text2TCS is an application to automatically extract terminological concept systems from texts encoded in UTF-8. The application extracts terms, groups them to synonym groups, and extracts relations that hold between these groups.
The following types of terms are considered:
- Single word term
- Multi-word term (prepositional phrases, noun compounds, etc.)
- Named entities
- Short forms incl. acronyms
- Synonyms
The following types of relations are extracted for the following types of languages, whereby the relations are always indicated in English irrespective of the input language according to the following typology:
- hierarchical relation:
- generic Relation (specific -> broader)
- partitive Relation (part -> whole)
- non-hierarchical relation:
- activity relation (activity -> actor; activity -> entity)
- causal relation (cause -> effect)
- spatial Relation (entity -> space)
- instrumental relation (instrument -> purpose)
- origination relation (entity -> origin)
- property relation(entity -> property)
- associative relation (loose thematic connection)
This information is represented as our text format to increase readability as well as TBX/XML.
The application builds on XLM-R, a language recognizer and a tokenizer, which jointly support the following languages:
- Amharic
- Arabic
- Armenian
- Bulgarian
- Burmese
- Chinese
- Danish
- English
- German
- Dutch
- French
- Greek
- Hindi
- Italian
- Japanese
- Kazakh
- Marathi
- Persian
- Polish
- Russian
- Spanish
- Urdu
XLM-R supports a larger number of languages than the language recognition tool and the tokenizer. It is possible that the tool based on XLM-R also supports the following languages, which, however due to the other two tools cannot be guaranteed: Afrikaans, Albanian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, Gujarati, Hausa, Hebrew, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Javanese, Kannada, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.
The following software libraries with the following versions and licenses have been utilized in the development proccess of this application:
Name | Developer | Software license |
Sacremoses (Version 0.0.45) | Liling Tan | MIT License |
Seaborn (Version 0.11.1) | Michael Waskom (https://mwaskom.github.io/) | BSD 3-Clause "New" or "Revised" License |
Torch (Version 1.8.1+cu101) - component of PyTorch | Facebook and others (see license) | Facebook ad others (see link in notes) |
NLTK | NLTK Project | Apache 2.0 |
Numpy (Version 1.19.5) | NumPy | BSD 3-Clause "New" or "Revised" License |
Pandas (Version 1.1.5) | Pandas Development Team | BSD 3-Clause "New" or "Revised" License |
SentencePiece (Version 0.1.95) | Apache License 2.0 | |
Spacy (Version 2.2.4) | Explosion | MIT License |
Transformers (Version 4.6.1) | Wolf et al. (https://aclanthology.org/2020.emnlp-demos.6/) | Apache License 2.0 |
MatPlotLib (Version 3.2.2) | Python Software Foundation | Python Software Foundation License (PSF) |
scikit_learn (Version 0.24.2) | scikit-learn developers | BSD License |
nvidia_ml_py3 (Version 7.352.0) | NVIDIA Corporation | BSD License |
Seqeval (Version 1.2.2) | Hironsan | MIT License |
Pynvml (Version 8.0.4) | NVIDIA Corporation | BSD License |
graphviz (Version 0.10.1) | Sebastian Bank | MIT License |
lxml (Version 4.2.6) | Stefan Behnel et al (https://lxml.de/4.2/credits.html) | BSD License |
XLM-R | The Hugging Face Team | Apache License 2.0 |