Preview

Information and Innovations

Advanced search

Development of a prototype system for recognizing and classifying corporate documents

https://doi.org/10.31432/1994-2443.2025.11

Abstract

Relevance. In today’s environment, improving the accuracy and speed of document processing is becoming increasingly important.
Target. Development of a system for converting, recognizing, and classifying corporate documents in non-editable formats.
Materials and Methods. The development utilized Python 3.10, the scikit-learn 1.6 library, joblib and poppler, the Razdel module, PyTorch 2.2, and Hugging Face Transformers 4.39. The PyPDF2 / pdfminer.six / pdfplumber packages; and the Tesseract OCR 5 tool using pytesseract. The OpenCV-python package was used to eliminate line breaks and reduce noise. The web interface was built on Vite and React using Bootstrap 5.
Results. A prototype system was developed that enables efficient document conversion from a non-editable format to an editable one within a specific document.
Conclusions. The use of artificial intelligence technologies accelerates workflows and reduces the error window. The solution integrates into workflows, but classification training requires a large amount of data

About the Authors

I. V. Perlov
Federal State Budgetary Educational Institution of Higher Education “MIREA — Russian Technological University”
Russian Federation

Ivan V. Perlov

78, Vernadsky Avenue, Moscow, 119454



S. A. Selivanov
Federal State Budgetary Educational Institution of Higher Education “MIREA — Russian Technological University”
Russian Federation

Sergey A. Selivanov, Cand. Sci. (Eng.), Associate Professor

78, Vernadsky Avenue, Moscow, 119454



A. V. Sinitsyn
Federal State Budgetary Educational Institution of Higher Education “MIREA — Russian Technological University”
Russian Federation

Alexander V. Sinitsyn, PhD of Physico-Mathematical Sciences, Associate Professor

78, Vernadsky Avenue, Moscow, 119454



Sh. M. Shakhguseynov
Federal State Budgetary Educational Institution of Higher Education “MIREA — Russian Technological University”
Russian Federation

Shamhal M. Shakhguseynov

78, Vernadsky Avenue, Moscow, 119454



References

1. Su J., Ahmed M., Lu Yu., Pan Sh., Bo W., Liu Yu. RoFormer: Enhanced transformer with Rotary Position Embedding. Neurocomputing. 2024;568:127063. https://doi.org/10.1016/j.neucom.2023.127063

2. Romero-Fresco P. Subtitling through Speech Recognition: Respeaking. Manchester: St. Jerome, 2011. 261 p. ISBN 9781905763283.

3. Park J., Lee E., Kim Y., Kang I., Koo H.I., Cho N.I. Multi-Lingual Optical Character Recognition System Using the Reinforcement Learning of Character Segmenter. IEEE Access. 2020;8:174437-174448. https://doi.org/10.1109/ACCESS.2020.3025769

4. Memon J., Sami M., Khan R.A. Handwritten Optical Character Recognition (OCR): Comprehensive Systematic Literature Review (SLR). IEEE Access. 2020;8:142642- 142668. https://doi.org/10.1109/ACCESS.2020.3012542

5. Hossain A., Ali M. Recognition of Handwritten Digit using Convolutional Neural Network (CNN). Global Journal of Computer Science and Technology. 2019;19(2):27-33. https://doi.org/10.34257/GJCSTDVOL19IS2PG27

6. Wani N., Mangire G., Kumar A., Solse N., Gaikwad P.S. Legal Document Classification using TF-IDF and KNN. International Journal of Advanced Research in Science, Communication and Technology. 2022;2(1):590-595. https://doi.org/10.48175/IJARSCT-7522

7. Nasu Iu., Lanin V.V. Development of Legal Document Classification System Based on Support Vector Machine. Trudy ISP RAN / Proc. ISP RAS. 2023;35(2):49-56. https://doi.org/10.15514/ISPRAS2023-35(2)-4

8. Yulianti E., Bhary N., Abdurrohman J., Dwitilas F.W., Nuranti E.Q., Husin H.S. Named entity recognition on Indonesian legal documents: a dataset and study using transformer-based models. International Journal of Electrical and Computer Engineering (IJECE). 2024;14(5):5489-5501. https://doi.org/10.11591/ijece.v14i5.pp5489-5501

9. Leitner E., Rehm G., Moreno-Schneider J. Fine-Grained Named Entity Recognition in Legal Documents. Lecture Notes in Computer Science. 2019;11702:272-287. https://doi.org/10.1007/978-3-030-33220-4_20

10. Wadud M.A.H., Mridha M.F., Shin J., Nur K., Saha A.K. Deep-BERT: Transfer Learning for Classifying Multilingual Offensive Texts on Social Media. Comput Syst Sci Eng. 2023;44(2):1775–1791. https://doi.org/10.32604/csse.2023.027841

11. Kalyan K.S., Rajasekharan A., Sangeetha S. AMMU: A survey of transformer-based biomedical pretrained language models. Journal of Biomedical Informatics. 2022 Feb;126:103982. https://doi.org/10.1016/j.jbi.2021.103982

12. Al-Askary Y.B., Al-Momen S. Enhanced OCR Techniques for Recognizing Mathematical Expressions in Scanned Documents. Ibn AL-Haitham Journal For Pure and Applied Sciences. 2025;38(4):295–306. https://doi.org/10.30526/38.4.3640

13. Wang Z., Liu M., Liu K. Utilizing Machine Learning Techniques for Classifying Translated and Non-Translated Corporate Annual Reports. Applied Artificial Intelligence. 2024;38(1):e2340393. https://doi.org/10.1080/08839514.2024.2340393

14. Dong M., Gagnon M-A. Unveiling chemical industry secrets: Insights gleaned from scientific literatures that examine internal chemical corporate documents—A scoping review. PLoS ONE. 2025;20(1):e0310116. https://doi.org/10.1371/journal.pone.0310116


Review

For citations:


Perlov I.V., Selivanov S.A., Sinitsyn A.V., Shakhguseynov Sh.M. Development of a prototype system for recognizing and classifying corporate documents. Information and Innovations. 2025;20(2):41-57. (In Russ.) https://doi.org/10.31432/1994-2443.2025.11

Views: 53


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1994-2443 (Print)
ISSN 2949-2157 (Online)