Abstract and 1. Introduction

1.1 Printing Press in Iraq and Iraqi Kurdistan

1.2 Challenges in Historical Documents

1.3 Kurdish Language

5.1 Challenges and Limitations

Following is the list of main challenges and limitations we faced during this research:

• The limited availability of resources posed significant challenges during our data collection

process. Converting the collected data into a digital format proved an additional obstacle. Manual transcription of the documents was difficult due to unclear text, non-standard spacing, and unique vocabulary influenced by Arabic letters and terminologies. We attempted to create the dataset synthetically, crafting a small tool that assembled letters from a given collection of character images. Regrettably, the outcomes were unsatisfactory, and given our time constraints, we discontinued this approach.

• The non-standard spacing between the words and characters was challenging for transcribing the documents and needed to be more apparent for the model. The model interpreted the excessive gaps between characters or words as space characters. In contrast, in other cases where there should have been a space character, the minimal spacing went unnoticed by the model.

• Extracting text from multi-column pages was another limitation of the model.

• Recognizing mathematical equations was another limitation of our model.

Considering the challenges and limitations and based on the discussion on the results, we are interested in exploring several areas in the future as follows:

• Expanding the dataset is an aspect that requires further attention and effort.

• An observed issue pertained to the misalignment of spaces between words and characters. To address this, a post-processing phase is suggested for rectifying the misaligned space characters. • Ocr'ing the multi-column pages property is another area requiring more effort.

• Extracting mathematical equations accurately.

Online Resources

The dataset is partially publicly available for non-commercial use under the CC BY-NC-SA 4.0 license at https://github.com/KurdishBLARK/OCR4OldTextsInSorani.

Acknowledgments

We would like to extend our gratitude to the Zheen Center for Documentation and Research in Sulaymaniyah, Kurdistan Region, Iraq for their generous support in providing us with digital copies of certain historical publications.

