Development of a prototype system for recognizing and classifying corporate documents
https://doi.org/10.31432/1994-2443.2025.11
Abstract
Relevance. In today’s environment, improving the accuracy and speed of document processing is becoming increasingly important.
Target. Development of a system for converting, recognizing, and classifying corporate documents in non-editable formats.
Materials and Methods. The development utilized Python 3.10, the scikit-learn 1.6 library, joblib and poppler, the Razdel module, PyTorch 2.2, and Hugging Face Transformers 4.39. The PyPDF2 / pdfminer.six / pdfplumber packages; and the Tesseract OCR 5 tool using pytesseract. The OpenCV-python package was used to eliminate line breaks and reduce noise. The web interface was built on Vite and React using Bootstrap 5.
Results. A prototype system was developed that enables efficient document conversion from a non-editable format to an editable one within a specific document.
Conclusions. The use of artificial intelligence technologies accelerates workflows and reduces the error window. The solution integrates into workflows, but classification training requires a large amount of data
About the Authors
I. V. PerlovRussian Federation
Ivan V. Perlov
78, Vernadsky Avenue, Moscow, 119454
S. A. Selivanov
Russian Federation
Sergey A. Selivanov, Cand. Sci. (Eng.), Associate Professor
78, Vernadsky Avenue, Moscow, 119454
A. V. Sinitsyn
Russian Federation
Alexander V. Sinitsyn, PhD of Physico-Mathematical Sciences, Associate Professor
78, Vernadsky Avenue, Moscow, 119454
Sh. M. Shakhguseynov
Russian Federation
Shamhal M. Shakhguseynov
78, Vernadsky Avenue, Moscow, 119454
References
1. Su J., Ahmed M., Lu Yu., Pan Sh., Bo W., Liu Yu. RoFormer: Enhanced transformer with Rotary Position Embedding. Neurocomputing. 2024;568:127063. https://doi.org/10.1016/j.neucom.2023.127063
2. Romero-Fresco P. Subtitling through Speech Recognition: Respeaking. Manchester: St. Jerome, 2011. 261 p. ISBN 9781905763283.
3. Park J., Lee E., Kim Y., Kang I., Koo H.I., Cho N.I. Multi-Lingual Optical Character Recognition System Using the Reinforcement Learning of Character Segmenter. IEEE Access. 2020;8:174437-174448. https://doi.org/10.1109/ACCESS.2020.3025769
4. Memon J., Sami M., Khan R.A. Handwritten Optical Character Recognition (OCR): Comprehensive Systematic Literature Review (SLR). IEEE Access. 2020;8:142642- 142668. https://doi.org/10.1109/ACCESS.2020.3012542
5. Hossain A., Ali M. Recognition of Handwritten Digit using Convolutional Neural Network (CNN). Global Journal of Computer Science and Technology. 2019;19(2):27-33. https://doi.org/10.34257/GJCSTDVOL19IS2PG27
6. Wani N., Mangire G., Kumar A., Solse N., Gaikwad P.S. Legal Document Classification using TF-IDF and KNN. International Journal of Advanced Research in Science, Communication and Technology. 2022;2(1):590-595. https://doi.org/10.48175/IJARSCT-7522
7. Nasu Iu., Lanin V.V. Development of Legal Document Classification System Based on Support Vector Machine. Trudy ISP RAN / Proc. ISP RAS. 2023;35(2):49-56. https://doi.org/10.15514/ISPRAS2023-35(2)-4
8. Yulianti E., Bhary N., Abdurrohman J., Dwitilas F.W., Nuranti E.Q., Husin H.S. Named entity recognition on Indonesian legal documents: a dataset and study using transformer-based models. International Journal of Electrical and Computer Engineering (IJECE). 2024;14(5):5489-5501. https://doi.org/10.11591/ijece.v14i5.pp5489-5501
9. Leitner E., Rehm G., Moreno-Schneider J. Fine-Grained Named Entity Recognition in Legal Documents. Lecture Notes in Computer Science. 2019;11702:272-287. https://doi.org/10.1007/978-3-030-33220-4_20
10. Wadud M.A.H., Mridha M.F., Shin J., Nur K., Saha A.K. Deep-BERT: Transfer Learning for Classifying Multilingual Offensive Texts on Social Media. Comput Syst Sci Eng. 2023;44(2):1775–1791. https://doi.org/10.32604/csse.2023.027841
11. Kalyan K.S., Rajasekharan A., Sangeetha S. AMMU: A survey of transformer-based biomedical pretrained language models. Journal of Biomedical Informatics. 2022 Feb;126:103982. https://doi.org/10.1016/j.jbi.2021.103982
12. Al-Askary Y.B., Al-Momen S. Enhanced OCR Techniques for Recognizing Mathematical Expressions in Scanned Documents. Ibn AL-Haitham Journal For Pure and Applied Sciences. 2025;38(4):295–306. https://doi.org/10.30526/38.4.3640
13. Wang Z., Liu M., Liu K. Utilizing Machine Learning Techniques for Classifying Translated and Non-Translated Corporate Annual Reports. Applied Artificial Intelligence. 2024;38(1):e2340393. https://doi.org/10.1080/08839514.2024.2340393
14. Dong M., Gagnon M-A. Unveiling chemical industry secrets: Insights gleaned from scientific literatures that examine internal chemical corporate documents—A scoping review. PLoS ONE. 2025;20(1):e0310116. https://doi.org/10.1371/journal.pone.0310116
Review
For citations:
Perlov I.V., Selivanov S.A., Sinitsyn A.V., Shakhguseynov Sh.M. Development of a prototype system for recognizing and classifying corporate documents. Information and Innovations. 2025;20(2):41-57. (In Russ.) https://doi.org/10.31432/1994-2443.2025.11






















