Automated OCR/HWR postprocessing

OCR results, especially for languages other than English, still contain many mistakes. One option to reduce these errors is to train better text recognition models for specific use cases, but this is not possible when we do not have access to the scanned sources or scan quality is poor. Automated postprocessing of the plain text generated by OCR tools is another option, and several researchers are actively exploring the role that AI / ML can play in this process.

Potentially, AI / ML / LLM can help with the correction of spelling mistakes / misidentified characters and categorisation tasks such as Named Entity Recognition (NER). However, most standard models perform poorly for historical texts or multilingual content. Training local and project-specific models seems to be the best solution.

Challenges of working with LLMs via (free-tier) coding environments

LLMs can be accessed via graphic user interfaces (GUIs) on the provider websites, via an application programming interface (API), or - in the case of open source models - as downloads that users can integrate into their own applications. Most models accessed via web interfaces and APIs have a strict rate limit per request and/or charge users for advanced processing of data. Many popular models have a limit of ca. 2000 tokens only, which requires input truncation. Also, API calls (using secret API keys) should only be carried out in stable and safe coding environments.

When trying to access LLMs in free-tier coding environments like Google Colab, users often encounter installation issues. Some existing models are “gated” and require authentication (via the Hugging Face platform). There can also be challenges related to the formats in which models are made available. GPT4All, for instance, only supports .gguf models, but many available models are provided as .bin, which causes errors. Some popular models may also be removed from an AI’s official repository, which forces users to manually fetch models or find alternatives.

Another challenge is computational speed. Google Colab, for instance, permits free users only limited computational power (no GPU acceleration), which is why accessing LLMs via code in Colab is considerably slower than working with an AI on the official cloud instance or locally. Also, Python is not always the best programming language to work with AI / LLMs. The GPT4All GUI uses optimised C++ inference, while Python calls run without optimisation. Python users should always opt for smaller models first.

Comparisons between different AIs and modes of access

To test the performance differences between AI models and modes of access, here are links to the official web interfaces, online demos, or installers of some popular AI tools:

Using the model directly via Hugging Face requires a hugging face account. The details are explained in the following Medium article: Chat with Llama-2 (7B) from HuggingFace (Llama-2–7b-chat-hf)

The colab-notebooks folder in this repository also contains a script that shows the limitations of accessing an AI model via the Python API as processing just 700 tokens of text took circa 15 minutes. So for quick tests on the spot, using the conversational chat options provided by several of the AI tools is recommended. When it comes to writing good prompts for typical post-OCR tasks, recommendations generated with ChatGPT-4 can be a starting point.

Sample prompts to test error correction and NER with LLMs

user: What are prompts that humanities scholars could try to make a Large Language Model a) correct misinterpreted characters in OCR output without changing words or meaning and b) find named entities such as people and places in OCR output?

The prompts generated by GPT-4 for OCR error correction and Named Entity Recognition (NER) with LLMs are the following:

Basic Prompt

Correct OCR errors in the following text without changing words, meaning, or formatting. Do not rephrase. Only fix characters that appear misrecognized.

Text:
[Insert OCR text here]

Detailed Prompt with Context

The following text is from an OCR scan of a historical document. Some characters were misrecognized due to printing artifacts. Correct only the misinterpreted characters while keeping all words unchanged. Do not modify punctuation or spacing.

Example input: 'Tlie quick brawn fox jumps over tlie lazy dog.'
Example output: 'The quick brown fox jumps over the lazy dog.'

Here is the OCR text:
[Insert OCR text here]

Structured Prompt for Batch Processing

You are an expert in text restoration. Given OCR-scanned text, your task is to fix character-level misrecognitions without altering word meaning, order, or structure. If unsure, leave the text unchanged.

Process the following OCR text:
[Insert OCR text here]

Basic NER Prompt

Identify and list all names of people and places in the following OCR text. Return results in a structured format.

Text:
[Insert OCR text here]

Advanced NER Prompt with JSON Output

Extract named entities (people and places) from the following OCR-scanned text. Output a structured list in JSON format with categories 'Person' and 'Location'.

Example output:
{
  "Person": ["John Smith", "Mary Johnson"],
  "Location": ["London", "Paris"]
}

Here is the OCR text:
[Insert OCR text here]

NER with Confidence Levels

Extract named entities (people and places) from the following OCR text and return them with a confidence level (high/medium/low). Do not infer names, only extract clear entities.

Text:
[Insert OCR text here]

Short text samples in different languages to test AI capabilities in text correction and text mark-up are in the sample_data_txt folder of this repository and explained on the samples txt page.