Training Tesseract for Low-Resource Languages

Par : Hackernoon
2025/08/20 15:00

Abstract and 1. Introduction

1.1 Printing Press in Iraq and Iraqi Kurdistan

1.2 Challenges in Historical Documents

1.3 Kurdish Language

  1. Related work and 2.1 Arabic/Persian

    2.2 Chinese/Japanese and 2.3 Coptic

    2.4 Greek

    2.5 Latin

    2.6 Tamizhi

  2. Method and 3.1 Data Collection

    3.2 Data Preparation and 3.3 Preprocessing

    3.4 Environment Setup, 3.5 Dataset Preparation, and 3.6 Evaluation

  3. Experiments, Results, and Discussion and 4.1 Processed Data

    4.2 Dataset and 4.3 Experiments

    4.4 Results and Evaluation

    4.5 Discussion

  4. Conclusion

    5.1 Challenges and Limitations

    Online Resources, Acknowledgments, and References

5 Conclusion

The primary motivation for this study stems from the significant amounts of historical documents stored in libraries that still need to be processed. The lack of processing capabilities has led to exploring OCR technology for Kurdish, a low-resource language. Implementing OCR for extracting text from historical documents in Kurdish would greatly enhance available resources.

\ Extensive research was conducted to assess existing OCR systems for Kurdish and other languages worldwide. The investigation focused on previous work, accuracy, and underlying

\ Figure 18: A sample page from the book titled ’Awreky Pashawa’ published in 1930 (Zheen Center for Documentation and Research)

\ Figure 19: Manual transcription of the page

\ Figure 20: The transcription generated by our model

\ Table 1: Summary of the dataset

\ Table 2: Ocreval result

\ technology. It was determined that Tesseract was a suitable option for this research.

\ Once the technology was identified, efforts were made to collect digital copies of historical documents printed before 1950. This task proved challenging, as locating documents and converting them into digital format presented additional hurdles. Fortunately, the Zheen Center for Documentation and Research in Sulaymaniyah, which specializes in archiving historical documents, provided some books in the form of digital copies.

\ Upon receiving the digitized copies, a dataset was created to train the Tesseract model. Text lines were extracted from the pages, transcribed individually, and subjected to preprocessing to prepare the dataset.

\ With a dataset of 1233 lines, the model was trained based on the Arabic model. Following the training, the model’s performance was evaluated using various methods. Tesseract’s built-in evaluator lstmeval indicated a CER of 0.755%. Additionally, Ocreval demonstrated an average character accuracy of 84.02%. Finally, an in-house web application was developed to provide an easy-to-use interface for end-users, allowing them to interact with the model by inputting an image of a page and extracting the text.

\ This model could be a valuable tool for libraries and centers, enabling them to extract text from historical documents and perform further processing effectively.

\

:::info Authors:

(1) Blnd Yaseen, University of Kurdistan Howler, Kurdistan Region - Iraq ([email protected]);

(2) Hossein Hassani University of Kurdistan Howler Kurdistan Region - Iraq ([email protected]).

:::

:::info This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-NODERIVS 4.0 INTERNATIONAL license.

:::

\

Clause de non-responsabilité : les articles republiés sur ce site proviennent de plateformes publiques et sont fournis à titre informatif uniquement. Ils ne reflètent pas nécessairement les opinions de MEXC. Tous les droits restent la propriété des auteurs d'origine. Si vous estimez qu'un contenu porte atteinte aux droits d'un tiers, veuillez contacter [email protected] pour demander sa suppression. MEXC ne garantit ni l'exactitude, ni l'exhaustivité, ni l'actualité des contenus, et décline toute responsabilité quant aux actions entreprises sur la base des informations fournies. Ces contenus ne constituent pas des conseils financiers, juridiques ou professionnels, et ne doivent pas être interprétés comme une recommandation ou une approbation de la part de MEXC.
Partager des idées

Vous aimerez peut-être aussi

The Maintenance Algorithm: A Life Principle We Often Overlook

The Maintenance Algorithm: A Life Principle We Often Overlook

The maintenance algorithm applies to relationships as well as to machines. A well-maintained relationship feels lighter, more joyful, more resilient to the bumps along the way.
Moonveil
MORE$0.10007+0.01%
Life Crypto
LIFE$0.00004514+8.98%
WELL3
WELL$0.000126-3.07%
Partager
Hackernoon2025/08/20 15:00
Partager
How to Capture OAuth Callbacks in CLI and Desktop Apps with Localhost Servers

How to Capture OAuth Callbacks in CLI and Desktop Apps with Localhost Servers

This tutorial walks through building a production-ready OAuth callback server that works across Node.js, Deno, and Bun. We'll cover everything from the basic HTTP server setup to handling edge cases that trip up most implementations.
Boundless Network
BUN$0.000547--%
READY
READY$0.003245+0.34%
Edge
EDGE$0.69553+20.01%
Partager
Hackernoon2025/08/20 15:10
Partager
The U.S. House of Representatives intends to merge the CLARITY and GENIUS bills and strive to pass them before August

The U.S. House of Representatives intends to merge the CLARITY and GENIUS bills and strive to pass them before August

PANews reported on June 19 that according to Eleanor Terrett, the U.S. House of Representatives is considering advancing the market structure legislation CLARITY Act and the stablecoin bill GENIUS Act
U
U$0.01828-12.95%
Housecoin
HOUSE$0.017311+0.94%
Juneo Supernet
JUNE$0.0892+23.88%
Partager
PANews2025/06/19 10:38
Partager

Actualités tendance

Plus

The Maintenance Algorithm: A Life Principle We Often Overlook

How to Capture OAuth Callbacks in CLI and Desktop Apps with Localhost Servers

The U.S. House of Representatives intends to merge the CLARITY and GENIUS bills and strive to pass them before August

Unauthorized crypto trading now carries 2 years of prison in Hungary

Value Today Means Moving Faster Than the Plan