Documentation of the Text2TCS-Application

A first version of the tool can be tried out and downloaded from the European Language Grid: https://live.european-language-grid.eu/catalogue/tool-service/7375

Text2TCS is an application to automatically extract terminological concept systems from texts encoded in UTF-8. The application extracts terms, groups them to synonym groups, and extracts relations that hold between these groups.

The following types of terms are considered: 

  • Single word term
  • Multi-word term (prepositional phrases, noun compounds, etc.)
  • Named entities
  • Short forms incl. acronyms
  • Synonyms

The following types of relations are extracted for the following types of languages, whereby the relations are always indicated in English irrespective of the input language according to the following typology:

  • hierarchical relation:
    • generic Relation (specific -> broader)
    • partitive Relation (part -> whole)
  • non-hierarchical relation:
    • activity relation (activity -> actor; activity -> entity)
    • causal relation (cause -> effect)
    • spatial Relation (entity -> space)
    • instrumental relation (instrument -> purpose)
    • origination relation (entity -> origin)
    • property relation(entity -> property)
    • associative relation (loose thematic connection)

This information is represented as our text format to increase readability as well as TBX/XML.

The application builds on XLM-R, a language recognizer and a tokenizer, which jointly support the following languages:

  • Amharic
  • Arabic
  • Armenian
  • Bulgarian
  • Burmese
  • Chinese
  • Danish
  • English
  • German
  • Dutch
  • French
  • Greek
  • Hindi
  • Italian
  • Japanese
  • Kazakh
  • Marathi
  • Persian
  • Polish
  • Russian
  • Spanish
  • Urdu

XLM-R supports a larger number of languages than the language recognition tool and the tokenizer. It is possible that the tool based on XLM-R also supports the following languages, which, however due to the other two tools cannot be guaranteed: Afrikaans, Albanian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, Gujarati, Hausa, Hebrew, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Javanese, Kannada, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.

The following software libraries with the following versions and licenses have been utilized in the development proccess of this application:

Name

Developer

Software license

Sacremoses (Version 0.0.45)

Liling Tan

(https://alvations.bitbucket.io/) 

MIT License

Seaborn (Version 0.11.1)

Michael Waskom (https://mwaskom.github.io/)

BSD 3-Clause "New" or "Revised" License

Torch (Version 1.8.1+cu101) - component of PyTorch

Facebook and others (see license)

Facebook ad others (see link in notes)

NLTK

NLTK Project

Apache 2.0

Numpy (Version 1.19.5)

NumPy

BSD 3-Clause "New" or "Revised" License

Pandas (Version 1.1.5)

Pandas Development Team

BSD 3-Clause "New" or "Revised" License

SentencePiece (Version 0.1.95)

Google

Apache License 2.0

Spacy (Version 2.2.4)

Explosion

MIT License

Transformers (Version 4.6.1)

Wolf et al. (https://aclanthology.org/2020.emnlp-demos.6/)

Apache License 2.0

MatPlotLib (Version 3.2.2)

Python Software Foundation

Python Software Foundation License (PSF)

scikit_learn (Version 0.24.2)

scikit-learn developers

BSD License

nvidia_ml_py3 (Version 7.352.0)

NVIDIA Corporation

BSD License

Seqeval (Version 1.2.2)

Hironsan

MIT License

Pynvml (Version 8.0.4)

NVIDIA Corporation

BSD License

graphviz (Version 0.10.1)

Sebastian Bank

MIT License

lxml (Version 4.2.6)

Stefan Behnel et al (https://lxml.de/4.2/credits.html)

BSD License

XLM-R

The Hugging Face Team

Apache License 2.0