This article documents the process of digitizing Kurdish historical publications and training Tesseract OCR to recognize the language. The team sourced rare archives from the Zheen Center, processed fragile scans into clean line-by-line images, and created a ground-truth dataset of over 1,200 files. Using Ubuntu and tesstrain, they set up a training environment, corrected image skew, applied cropping, and built transcription pairs to teach the model Kurdish text recognition. The results showcase how open-source OCR tools can help preserve cultural heritage through machine learning.This article documents the process of digitizing Kurdish historical publications and training Tesseract OCR to recognize the language. The team sourced rare archives from the Zheen Center, processed fragile scans into clean line-by-line images, and created a ground-truth dataset of over 1,200 files. Using Ubuntu and tesstrain, they set up a training environment, corrected image skew, applied cropping, and built transcription pairs to teach the model Kurdish text recognition. The results showcase how open-source OCR tools can help preserve cultural heritage through machine learning.

Training Tesseract OCR on Kurdish Historical Documents

Abstract and 1. Introduction

1.1 Printing Press in Iraq and Iraqi Kurdistan

1.2 Challenges in Historical Documents

1.3 Kurdish Language

  1. Related work and 2.1 Arabic/Persian

    2.2 Chinese/Japanese and 2.3 Coptic

    2.4 Greek

    2.5 Latin

    2.6 Tamizhi

  2. Method and 3.1 Data Collection

    3.2 Data Preparation and 3.3 Preprocessing

    3.4 Environment Setup, 3.5 Dataset Preparation, and 3.6 Evaluation

  3. Experiments, Results, and Discussion and 4.1 Processed Data

    4.2 Dataset and 4.3 Experiments

    4.4 Results and Evaluation

    4.5 Discussion

  4. Conclusion

    5.1 Challenges and Limitations

    Online Resources, Acknowledgments, and References

4 Experiments, Results, and Discussion

Initially, we collected some historical publications from the Zaytoon Public Library in Erbil. However, due to the fragile condition of the documents, it was not easy to transfer them into digital format. Then, via the internet, we found the Zheen Center for Documentation and Research in Sulaymaniyahn https://zheen.org, a facility specializing in scanning and archiving historical documents using unique technologies explicitly designed for that function. After visiting them and explaining our project, they agreed to provide us with digital copies of the earliest Kurdish publications they had in their collection.

4.1 Processed Data

To handle image processing tasks, we utilized a dedicated batch processing tool that was freely available. With this tool, we loaded the images and applied a de-skewing process to correct any skew present in the images. We also performed automatic cropping and converted the images to binary format, saving them in the specified destination directory.

4.2 Dataset

After receiving the historical documents from Zheen Center for Documentation and Research in a digital format, we converted the pages into single-line images with respected transcription for the line. We used an Image Processing application to crop lines and saved them in TIFF format.

\ After converting the pages into image lines (See Figure 16), we created transcription files for each image line using a text editing program by manually typing what is written in the images.

\ \ Figure 15: Sample page in the book titled ’Awat’ published in 1938 (Zheen Center for Documentation and Research)

\ \ We named the transcription files the same name as the image line with (.gt.txt) postfix (See Figure 17).

\ This way, the dataset for training Tesseract was created, which resulted in 1233 files. Half are the image lines, and the other is the transcription files (See Table 1).

4.3 Experiments

In this section, we provide details of the steps taken to prepare our environment, the training process of the model, and other relevant aspects.

\ 4.3.1 Environment Setup

\ For this training environment, we used Ubuntu 22.04.2 LTS (Jammy Jellyfish). We cloned the tesstrain from https://github.com/tesseract-ocr/tesstrain and we trained the model using our prepared dataset.

\

:::info Authors:

(1) Blnd Yaseen, University of Kurdistan Howler, Kurdistan Region - Iraq (blnd.yaseen@ukh.edu.krd);

(2) Hossein Hassani University of Kurdistan Howler Kurdistan Region - Iraq (hosseinh@ukh.edu.krd).

:::


:::info This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-NODERIVS 4.0 INTERNATIONAL license.

:::

\

Market Opportunity
SuperRare Logo
SuperRare Price(RARE)
$0.02104
$0.02104$0.02104
-0.66%
USD
SuperRare (RARE) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Shocking OpenVPP Partnership Claim Draws Urgent Scrutiny

Shocking OpenVPP Partnership Claim Draws Urgent Scrutiny

The post Shocking OpenVPP Partnership Claim Draws Urgent Scrutiny appeared on BitcoinEthereumNews.com. The cryptocurrency world is buzzing with a recent controversy surrounding a bold OpenVPP partnership claim. This week, OpenVPP (OVPP) announced what it presented as a significant collaboration with the U.S. government in the innovative field of energy tokenization. However, this claim quickly drew the sharp eye of on-chain analyst ZachXBT, who highlighted a swift and official rebuttal that has sent ripples through the digital asset community. What Sparked the OpenVPP Partnership Claim Controversy? The core of the issue revolves around OpenVPP’s assertion of a U.S. government partnership. This kind of collaboration would typically be a monumental endorsement for any private cryptocurrency project, especially given the current regulatory climate. Such a partnership could signify a new era of mainstream adoption and legitimacy for energy tokenization initiatives. OpenVPP initially claimed cooperation with the U.S. government. This alleged partnership was said to be in the domain of energy tokenization. The announcement generated considerable interest and discussion online. ZachXBT, known for his diligent on-chain investigations, was quick to flag the development. He brought attention to the fact that U.S. Securities and Exchange Commission (SEC) Commissioner Hester Peirce had directly addressed the OpenVPP partnership claim. Her response, delivered within hours, was unequivocal and starkly contradicted OpenVPP’s narrative. How Did Regulatory Authorities Respond to the OpenVPP Partnership Claim? Commissioner Hester Peirce’s statement was a crucial turning point in this unfolding story. She clearly stated that the SEC, as an agency, does not engage in partnerships with private cryptocurrency projects. This response effectively dismantled the credibility of OpenVPP’s initial announcement regarding their supposed government collaboration. Peirce’s swift clarification underscores a fundamental principle of regulatory bodies: maintaining impartiality and avoiding endorsements of private entities. Her statement serves as a vital reminder to the crypto community about the official stance of government agencies concerning private ventures. Moreover, ZachXBT’s analysis…
Share
BitcoinEthereumNews2025/09/18 02:13
Cronos (CRO) Flatlines Despite Altcoin Season, Analyst Explains Why

Cronos (CRO) Flatlines Despite Altcoin Season, Analyst Explains Why

According to crypto market analyst CoinBaron, Cronos (CRO) has underperformed during the current altcoin season, even as tokens such as Dogecoin (DOGE) and Shiba Inu (SHIB) posted double-digit gains. While most altcoins have outperformed Bitcoin (BTC) in the last 90 days, CRO has stalled after a strong rally earlier this year. The token is down […] The post Cronos (CRO) Flatlines Despite Altcoin Season, Analyst Explains Why appeared first on CoinChapter.
Share
Coinstats2025/09/18 05:02
Will XRP Price Increase In September 2025?

Will XRP Price Increase In September 2025?

Ripple XRP is a cryptocurrency that primarily focuses on building a decentralised payments network to facilitate low-cost and cross-border transactions. It’s a native digital currency of the Ripple network, which works as a blockchain called the XRP Ledger (XRPL). It utilised a shared, distributed ledger to track account balances and transactions. What Do XRP Charts Reveal? […]
Share
Tronweekly2025/09/18 00:00