ACADlore - ATAIML, Volume 5, Issue 2, 2026

Optical Character Recognition (OCR) plays a crucial role in the digitization and preservation of textual information; however, for low-resource languages such as Kashmiri, reliable OCR solutions remain largely unavailable. Kashmiri, primarily written in the Perso-Arabic (Nastaliq) script, poses significant challenges due to its cursive structure, extensive use of ligatures, complex diacritical marks, and limited availability of annotated datasets. This research aims to address these challenges by developing a functional OCR system specifically tailored for Kashmiri text. The proposed system is built using the open-source Kraken OCR engine and leverages deep learning techniques with transfer learning from a pre-trained Arabic OCR model. A synthetic dataset was generated using Unicode Kashmiri text, enriched with Kashmiri-specific diacritics and exclusive characters, and rendered into images through automated text-to-image pipelines. Extensive preprocessing, augmentation, and iterative fine-tuning were performed to improve recognition accuracy. Model performance was evaluated using standard metrics such as Character Error Rate (CER) and Word Error Rate (WER) on both seen and unseen data. Experimental results demonstrate a substantial improvement over the initial model, with character accuracy increasing from 54.91% to 79.91% and word accuracy improving from 4.65% to 44.19%. The final model shows strong recognition capability for common and Arabic script characters, while Kashmiri-specific inherited diacritics remain a challenging area. In addition, a cross-platform user interface developed using Flutter enables users to upload or capture images and obtain digitized Kashmiri text through a simple and accessible workflow. Rather than proposing a new recognition architecture, this work contributes empirical insights, reproducible methodology, and error characterization for OCR in a previously unsupported low-resource Nastaliq language. This work is positioned as a baseline OCR system for printed Kashmiri Nastaliq text at the line level and does not claim state-of-the-art performance.

- no more data -