Challenges of image processing before OCR/HWR

Identifying useful image processing operations pre-OCR

The first step for any form of computational text analysis is to create high-quality machine-readable text, but even the most advanced tools for handwriting or optical character recognition struggle with low-quality scans or scans of materials that are damaged, low in contrast, or visually complex. Automated image processing can improve the final results, but there also is a danger of over-engineering the process.

For the BYODL workshop in February 2025, we prepared two scripts to test different image enhancement options with Python:

While both scripts make use of similar Python packages, the AI-recommended workflow manipulates the image more, including a conversion of the colour scan to grayscale. Both scripts for image manipulation are available in the notebooks folders of this repository to be run in Kaggle, Google Colab, or BinderHub. Please check the coding environments information page for detailed explanations.

Working with the OpenCV library for computer vision

The most important library in both scripts is the OpenCV (Open Source Computer Vision) library is an open source computer vision and machine learning software library. To better understand its possibilities for image manipulation, it is important to read the documentation. Not all options may be useful for preparing document scans for OCR.

GPT4-Vision recommended the following manipulations when asked to enhance a blurred scan of early modern German print:

It is important to identify what the specific visual features of the images to be processed are to then find the suitable workflow. Experiences shared by researchers working with comparable sources can help but may be difficult to find online.

Blogs and tutorials recommending different workflows

Several academic, private, and commercial blogs also discuss the use of OpenCV but recommend divergent steps for OCR preparation. Here are some examples:

More general image processing advice is also given in Mageshwaran R.’s Survey on Image Preprocessing Techniques to Improve OCR Accuracy and in other publications collected in the pre-OCR image processing folder of our Zotero group library.

Could (Gen)AI help in making an informed decision?

Rather than letting (Gen)AI decide on a suitable workflow based on sample images, it might be a more efficient approach to describe problematic aspects of the scanned sources first and then ask an AI chatbot to suggest the best approaches for each of them. This may reveal that not all images collected for a project can be processed in the same way and that images may have to be processed in batches. Grouping the images should then depend on visual aspects and not on their content. Projects with very diverse sources will most likely not be able to work with a one-fits-all image manipulation algorithm.