Challenges of image processing before OCR/HWR

Identifying useful image processing operations pre-OCR

The first step for any form of computational text analysis is to create high-quality machine-readable text, but even the most advanced tools for handwriting or optical character recognition struggle with low-quality scans or scans of materials that are damaged, low in contrast, or visually complex. Automated image processing can improve the final results, but there also is a danger of over-engineering the process.

For the BYODL workshop in February 2025, we prepared two scripts to test different image enhancement options with Python:

While both scripts make use of similar Python packages, the AI-recommended workflow manipulates the image more, including a conversion of the colour scan to grayscale. Both scripts for image manipulation are available in the notebooks folders of this repository to be run in Kaggle, Google Colab, or BinderHub. Please check the coding environments information page for detailed explanations.

Working with the OpenCV library for computer vision

The most important library in both scripts is the OpenCV (Open Source Computer Vision) library is an open source computer vision and machine learning software library. To better understand its possibilities for image manipulation, it is important to read the documentation. Not all options may be useful for preparing document scans for OCR.

GPT4-Vision recommended the following manipulations when asked to enhance a blurred scan of early modern German print:

It is important to identify what the specific visual features of the images to be processed are to then find the suitable workflow. Experiences shared by researchers working with comparable sources can help but may be difficult to find online.

Blogs and tutorials recommending different workflows

Several academic, private, and commercial blogs also discuss the use of OpenCV but recommend divergent steps for OCR preparation. Here are some examples:

More general image processing advice is also given in Mageshwaran R.’s Survey on Image Preprocessing Techniques to Improve OCR Accuracy and in other publications collected in the pre-OCR image processing folder of our Zotero group library.

In-built image processing in OCR4all

OCR4all offers some in-built image processing. For early modern colour scans with page backgrounds in tones of yellow and brown, the following settings are recommended:

Skew correction

Skew angle estimation (degrees) defines the maximum rotation range (± value) of your pages. Avoid unnecessarily high values, as they may distort segmentation.

Parallel processing

Number of parallel threads for processing are preset to 20. To check if this reflects the number of CPU threads available on your system, you can run nproc. You may want to reduce this number slightly if your system becomes slow or memory usage is high.

Thresholding and background handling

Threshold determines lightness and controls binarisation sensitivity. This is one of the most important steps.

Higher → cleaner white background, risk of losing thin strokes
Lower → more noise, better preservation of fine ink

Zoom for page background estimation

This setting improves adaptation to uneven paper tone.

Ignore border for threshold estimation

This option can be useful if edges of your paper are darker.

Text mask estimation

This sets the scale for estimating mask over text regions, separating text areas from the background.

Noise filtering

The percentage for filters controls the aggressiveness of noise removal. Reduce slightly if fine serifs or diacritics vanish.

Range for filters

Black / white estimation

This is critical for yellowed paper. To define a percentile for black estimation, start with 7. Increase to 8–10 if ink appears faded The percentile for white estimation should first be set to 93. Settings between 92 and 95 work well for most early modern papers. This setting helps normalise yellow/brown background toward white.

Grayscale processing

Forcing grayscale processing is recommended for colour scans. It ensures proper thresholding instead of treating the image as a strict binary.

Error checking

Disable error checking on inputs is set as the default. What is the advantage of turning it on?

Summary of default configuration for most early modern colour scans

For most early modern colour scans, start with the following settings:

Skew: 2
Threads: 20 (if system supports 20 logical cores)
Threshold: 0.6
Zoom background: 0.7
Ignore border: 0.15
Filter percentage: 80
Filter range: 20
Black percentile: 7
White percentile: 93
Force grayscale: enabled

Do not over-optimise for a perfectly white background. Preserving thin strokes, ligatures, long s (ſ), and marginalia is more important than eliminating all background tone. Always test on a small sample before batch processing.

Could (Gen)AI help in making an informed decision?

Rather than letting (Gen)AI decide on a suitable workflow based on sample images, it might be a more efficient approach to describe problematic aspects of the scanned sources first and then ask an AI chatbot to suggest the best approaches for each of them. This may reveal that not all images collected for a project can be processed in the same way and that images may have to be processed in batches. Grouping the images should then depend on visual aspects and not on their content. Projects with very diverse sources will most likely not be able to work with a one-fits-all image manipulation algorithm.