Unsupervised Classification of Work-Related Texts Using Sentence-Level Embeddings and K-Means Clustering
2025 66th International Scientific Conference on Information Technology and Management Science of Riga Technical University (ITMS 2025): Proceedings 2025
Henrihs Gorskis, Vitālijs Zabiņako, Jūlija Strebko, Jurijs Korņijenko, Andrejs Romānovs

This paper presents a method for unsupervised classification of work-related texts by leveraging sentence-level embeddings and k-means clustering. The proposed approach is applied to a corpus of heterogeneous work documents and a usercontent database with associated texts. Documents are preprocessed, translated to English, and embedded using a transformer-based sentence encoder. The k-means algorithm is trained on these embeddings, and classification is performed based on similarity to learned centroids. Results demonstrate competitive overall accuracy and show that the method is capable of detecting a portion of non-work-related cases, an outcome that highlights both its potential and the need for further refinement.


Atslēgas vārdi
Work-related text classification, Sentence embeddings, Unsupervised learning, K-means clustering, Multilingual text analysis
DOI
10.1109/ITMS67030.2025.11236545
Hipersaite
https://ieeexplore.ieee.org/document/11236545

Gorskis, H., Zabiņako, V., Strebko, J., Korņijenko, J., Romānovs, A. Unsupervised Classification of Work-Related Texts Using Sentence-Level Embeddings and K-Means Clustering. In: 2025 66th International Scientific Conference on Information Technology and Management Science of Riga Technical University (ITMS 2025): Proceedings, Latvia, Riga, 9-10 October, 2025. Piscataway: IEEE, 2025, Article number 11236545. ISBN 979-8-3315-4529-1. e-ISBN 979-8-3315-4528-4. ISSN 2771-6953. e-ISSN 2771-6937. Available from: doi:10.1109/ITMS67030.2025.11236545

Publikācijas valoda
English (en)
RTU Zinātniskā bibliotēka.
E-pasts: uzzinas@rtu.lv; Tālr: +371 28399196