This article explores the creation of an OCR system for Kurdish, a low-resource language with vast unprocessed historical archives. Using Tesseract, researchers built and trained a model on digitized pre-1950 texts from the Zheen Center, achieving notable accuracy rates. The study highlights both the technical challenges of dataset preparation and the cultural significance of preserving Kurdish heritage through digital accessibility.This article explores the creation of an OCR system for Kurdish, a low-resource language with vast unprocessed historical archives. Using Tesseract, researchers built and trained a model on digitized pre-1950 texts from the Zheen Center, achieving notable accuracy rates. The study highlights both the technical challenges of dataset preparation and the cultural significance of preserving Kurdish heritage through digital accessibility.

Training Tesseract for Low-Resource Languages

Abstract and 1. Introduction

1.1 Printing Press in Iraq and Iraqi Kurdistan

1.2 Challenges in Historical Documents

1.3 Kurdish Language

  1. Related work and 2.1 Arabic/Persian

    2.2 Chinese/Japanese and 2.3 Coptic

    2.4 Greek

    2.5 Latin

    2.6 Tamizhi

  2. Method and 3.1 Data Collection

    3.2 Data Preparation and 3.3 Preprocessing

    3.4 Environment Setup, 3.5 Dataset Preparation, and 3.6 Evaluation

  3. Experiments, Results, and Discussion and 4.1 Processed Data

    4.2 Dataset and 4.3 Experiments

    4.4 Results and Evaluation

    4.5 Discussion

  4. Conclusion

    5.1 Challenges and Limitations

    Online Resources, Acknowledgments, and References

5 Conclusion

The primary motivation for this study stems from the significant amounts of historical documents stored in libraries that still need to be processed. The lack of processing capabilities has led to exploring OCR technology for Kurdish, a low-resource language. Implementing OCR for extracting text from historical documents in Kurdish would greatly enhance available resources.

\ Extensive research was conducted to assess existing OCR systems for Kurdish and other languages worldwide. The investigation focused on previous work, accuracy, and underlying

\ Figure 18: A sample page from the book titled ’Awreky Pashawa’ published in 1930 (Zheen Center for Documentation and Research)

\ Figure 19: Manual transcription of the page

\ Figure 20: The transcription generated by our model

\ Table 1: Summary of the dataset

\ Table 2: Ocreval result

\ technology. It was determined that Tesseract was a suitable option for this research.

\ Once the technology was identified, efforts were made to collect digital copies of historical documents printed before 1950. This task proved challenging, as locating documents and converting them into digital format presented additional hurdles. Fortunately, the Zheen Center for Documentation and Research in Sulaymaniyah, which specializes in archiving historical documents, provided some books in the form of digital copies.

\ Upon receiving the digitized copies, a dataset was created to train the Tesseract model. Text lines were extracted from the pages, transcribed individually, and subjected to preprocessing to prepare the dataset.

\ With a dataset of 1233 lines, the model was trained based on the Arabic model. Following the training, the model’s performance was evaluated using various methods. Tesseract’s built-in evaluator lstmeval indicated a CER of 0.755%. Additionally, Ocreval demonstrated an average character accuracy of 84.02%. Finally, an in-house web application was developed to provide an easy-to-use interface for end-users, allowing them to interact with the model by inputting an image of a page and extracting the text.

\ This model could be a valuable tool for libraries and centers, enabling them to extract text from historical documents and perform further processing effectively.

\

:::info Authors:

(1) Blnd Yaseen, University of Kurdistan Howler, Kurdistan Region - Iraq (blnd.yaseen@ukh.edu.krd);

(2) Hossein Hassani University of Kurdistan Howler Kurdistan Region - Iraq (hosseinh@ukh.edu.krd).

:::


:::info This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-NODERIVS 4.0 INTERNATIONAL license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Vitalik Buterin Reaffirms Original 2014 Ethereum Vision With Modern Web3 Technology Stack

Vitalik Buterin Reaffirms Original 2014 Ethereum Vision With Modern Web3 Technology Stack

TLDR: Ethereum proof-of-stake transition and ZK-EVM scaling solutions effectively realize the 2014 sharding vision. Waku evolved from Whisper to power decentralized
Share
Blockonomi2026/01/14 17:17
CME Group to Launch Solana and XRP Futures Options

CME Group to Launch Solana and XRP Futures Options

The post CME Group to Launch Solana and XRP Futures Options appeared on BitcoinEthereumNews.com. An announcement was made by CME Group, the largest derivatives exchanger worldwide, revealed that it would introduce options for Solana and XRP futures. It is the latest addition to CME crypto derivatives as institutions and retail investors increase their demand for Solana and XRP. CME Expands Crypto Offerings With Solana and XRP Options Launch According to a press release, the launch is scheduled for October 13, 2025, pending regulatory approval. The new products will allow traders to access options on Solana, Micro Solana, XRP, and Micro XRP futures. Expiries will be offered on business days on a monthly, and quarterly basis to provide more flexibility to market players. CME Group said the contracts are designed to meet demand from institutions, hedge funds, and active retail traders. According to Giovanni Vicioso, the launch reflects high liquidity in Solana and XRP futures. Vicioso is the Global Head of Cryptocurrency Products for the CME Group. He noted that the new contracts will provide additional tools for risk management and exposure strategies. Recently, CME XRP futures registered record open interest amid ETF approval optimism, reinforcing confidence in contract demand. Cumberland, one of the leading liquidity providers, welcomed the development and said it highlights the shift beyond Bitcoin and Ethereum. FalconX, another trading firm, added that rising digital asset treasuries are increasing the need for hedging tools on alternative tokens like Solana and XRP. High Record Trading Volumes Demand Solana and XRP Futures Solana futures and XRP continue to gain popularity since their launch earlier this year. According to CME official records, many have bought and sold more than 540,000 Solana futures contracts since March. A value that amounts to over $22 billion dollars. Solana contracts hit a record 9,000 contracts in August, worth $437 million. Open interest also set a record at 12,500 contracts.…
Share
BitcoinEthereumNews2025/09/18 01:39
U.S. politician makes super suspicious war stock trade

U.S. politician makes super suspicious war stock trade

The post U.S. politician makes super suspicious war stock trade appeared on BitcoinEthereumNews.com. Representative Gilbert Cisneros of California drew much attention
Share
BitcoinEthereumNews2026/01/14 17:27