Bert for Classification of Russian Functional Styles

Valery D. Solovyev, Andrey M. Ten, Marina I. Solnyshkina, Mariia I. Andreeva

Abstract


This paper tests the hypothesis that texts belonging to different functional styles possess distinct quantitative and linguistic parameters specific to each style. These parameters allow for quantitative classification using BERT. The research aims to develop a BERT classification model based on linguistic features of texts in five main functional styles: scientific, literary, official-business, journalistic, and colloquial. This approach addresses the problem of automatic classification of Russian functional styles based on statistical and morphological characteristics of texts. The selected hyperparameters for training the neural network include batch size, number of epochs, and initial learning rate. The study corpus comprises texts of the five abivementioned styles, totaling 163,421,783 tokens, sourced from the Russian National Corpus. The range of methods includes quantitative text analysis, morphological annotation, exhaustive analysis, and machine learning algorithms. The developed approach demonstrated high classification accuracy, indicating the promise of the proposed method. The results can be applied to tasks in automatic text processing, authorship attribution, and stylistic analysis. Future development includes classification models for various genres and domains, alternative transformer architectures (such as RoBERTa, GPT), larger datasets, and studying the impact of different fine-tuning strategies on classification quality.


Keywords


BERT, functional style, text classification, corpus linguistics, stylometry, automatic text analysis, statistical parameters, morphological annotation

Full Text: PDF