Overview of AI models and (AI-powered) OCR/HTR software used for text-processing in humanities research

Commonly used OCR software

OCR Tool	Core Features	Multilingual Support	Handling of Noisy / Historical Texts	Best Use Case	Access Link
Tesseract OCR (Open-Source)	Uses LSTMs for text recognition; supports training custom models.	Supports many languages but struggles with non-Latin scripts and historical fonts.	Performs poorly on noisy, low-quality, or historical scans.	Free OCR solution, best for clean digital scans.	Tesseract
Google Document AI	Cloud-based OCR with advanced AI-powered text extraction.	Strong multilingual support for modern languages, especially for Latin, Cyrillic, and Arabic scripts.	Good performer in benchmarks (English and Arabic), particularly on noisy documents.	Academic research, structured documents, complex layouts.	Google Document AI
Amazon Textract	Cloud-based OCR, integrated with AWS; handles forms and tables.	Good multilingual support, but struggles with Arabic and right-to-left scripts.	Better than Tesseract, but not as strong as Google Document AI on low-quality scans.	Business automation, extracting structured text from forms.	Amazon Textract
ABBYY FineReader	Proprietary OCR tool with strong historical document processing.	Excellent support for European languages and some Asian scripts.	One of the best commercial OCR tools for degraded texts.	Humanities research, digitizing archives.	ABBYY FineReader
Transkribus	Models available for OCR and HTR.	Supports historical scripts and handwritten texts (17th–19th century).	Excellent for noisy, historical handwritten texts, but requires training.	Archives, historical manuscript processing.	Transkribus
e-scriptorium	Open-source, focuses on custom HTR models.	Good support for non-Latin manuscripts (e.g., Arabic, Hebrew).	Highly customisable, but challenging installation and requires good personal hardware or cloud environment.	Humanities research, specific non-Latin OCR.	e-scriptorium
OCR4all	Open-source and focus on humanities applications.	Supports multiple languages but layout recognition can be challenging.	Needs training and the user community is considerably smaller than with Transkribus.	Best run in a cloud environment and challenging to set up.	OCR4all
Kraken OCR	Open-source, focuses on custom OCR models.	Good support for historical and some non-Latin scripts.	Highly customisable, but requires training datasets for optimal performance.	Humanities research, including non-Latin OCR.	Kraken OCR
Microsoft Azure OCR	AI-powered OCR with table and form detection.	Supports multiple languages but lacks customisation for historical documents.	Performs well on modern, clean text but struggles with historical/degraded materials.	Business and cloud-based document processing.	Azure AI Vision
Meta’s Nougat	Designed to transcribe (scientific) texts published in PDF format into Markdown format.	Uses Visual Transformer models for accurate (?) text and formula recognition.	Primarily focused on modern scientific documents. May require adaptation for historical texts.	Nougat

AI-powered OCR tools

Tool	Description
Transkribus	Makes use of “AI” (not specified in which form) for different analytical tasks and the preparation of digital editions
Azure AI Vision	Provides computer vision capabilities, including text extraction from images
ABBYY FineReader	Offers an AI-powered OCR SDK that enables software developers to integrate text recognition into their applications
Rossum	Cloud-native platform that uses proprietary AI engine to combine traditional OCR with AI for enterprise applications that require text processing and text generation
OCCAM	OCCAM is a cloud tool that “allows for automatically transcribing and translating printed or handwritten documents, to compare transcriptions to ground truth (manually validated transcriptions), and (as an experimental feature) to automatically correct PageXML-formatted transcriptions”

A [Tesseract OCR workflow], combined with an automated Google translation of the text output, is presented in: Selvaganapathi, G. (2024, July 29). Optical Character Recognition (Ocr) with google translate. Medium. https://medium.com/@gayathri.s.de/optical-character-recognition-ocr-with-google-translate-9c30bfb703d7

OCCAM is integrated with the European crowdsourcing platform Transcribathon, which allows for manually editing transcriptions. The tool was created by Tom Vanallemeersch, CrossLang NV (Belgium), and is offered as a service on https://ai4culture.crosslang.dev/ui/. In the GUI, there are no options to manually improve image segmentation, however, so the tool may only recognise small parts of a scan as actual text. The strength of the tool therefore lies in the evaluation and correction of workflows rather than from-scratch OCR. OCCAM’s support for PageXML correction — using language models, lexica, or validated transcriptions — can help with refining existing OCR outputs in structured formats.

TRANSKRIBUS, especially in its browser-based light version, has an intuitive user interface and is very beginner-friendly, but also much more of a black box. Although the project started as an EU-funded research project, the tool is now under Read Coop, a for-profit business. Therefore, TRANSKRIBUS has introduced a subscription model and gives users who pay priority access to computational resources and better models. Also, they are marketing the tool as an AI-powered, universal platform for text recognition, which raises high user expectations. In reality, community models often underperform on less common data types, such as multilingual materials. Unfortunately, there is also little transparency about the quality of the community models, their training data, or performance measures. As mentioned before, we cannot say anything about their “super models” because we never paid to use them, but the overall promises of “AI” seem exaggerated when the models offered are still trained for rather narrow use cases.

AI models for OCR cleaning and analytical tasks (experimental)

AI Model	Text Processing	Image Processing	OCR/NLP Performance	Access Link
GPT-4o (OpenAI)	Advanced text generation and understanding: can be used to correct spelling and NER	Accepts image inputs; suitable for multimodal tasks.	Excels in NLP tasks; capable of processing text within images.	GPT-4o
CLIP (OpenAI)	Connects textual and visual concepts; not primarily for text generation.	Strong image classification and understanding.	Effective in linking text and images; can assist in OCR tasks.	CLIP
DALL·E 2 (OpenAI)	Generates textual descriptions from images.	Creates images from textual descriptions.	Not designed for OCR; useful for generating visual data from text.	DALL·E 2
Llama 3.2 (Meta)	Proficient in text generation and understanding: can be used for spelling correction and NER	Capable of processing images; suitable for multimodal applications.	Effective in NLP tasks; image processing capabilities can aid OCR.	Llama 3.2
ChatGPT (OpenAI)	Advanced conversational AI for text-based interactions, but max. output is limited.	Can process images and recognise text in images, but has strict rate limits.	Can be used for exemplary NLP operations, e.g. for thesaurus building based on selective text input.	ChatGPT
Mistral AI	Less “creative” in text generation than OpenAI models and more focused on structured information processing.	Not designed for image processing.	Very efficient in helping with NLP programming, e.g. regex-building and code review. Direct language processing such as spelling correction and translation is often less reliable than with other models.	Mistral AI
Claude AI	Advanced conversational AI with reasoning capabilities.	Not designed for image processing.	Not tested for NLP or OCR tasks yet.	Claude AI
Cohere	Provides language models for text generation and understanding.	Not designed for image processing.	Not tested for NLP or OCR tasks yet.	Cohere
Copilot (Microsoft)	Assists in code generation and understanding.	Not designed for image processing.	Can help with programming-related NLP tasks; not suitable for direct OCR.	Copilot
Perplexity AI	Provides answers to complex questions using language models.	Not designed for image processing.	Effective in information retrieval (similar to a search engine); not suitable for NLP and OCR.	Perplexity AI
Inflection Pi AI	Personal AI assistant for conversational interactions.	Not designed for image processing.	Excels in conversational NLP tasks; not applicable for OCR.	Inflection Pi AI
BlackBox AI	Assists in code generation and debugging.	Not designed for image processing.	Effective in programming-related NLP tasks; not suitable for OCR.	BlackBox AI
Gemini (Google)	Advanced AI model for text generation and understanding.	Capable of processing images; suitable for multimodal applications.	Can be used for basic NLP tasks similar to OpenAI models; image processing capabilities can aid OCR.	Gemini
Phind	Search engine with AI capabilities for code and technical information.	Not designed for image processing.	Effective in information retrieval; not suitable for OCR.	Phind
You.com	AI-powered search engine with conversational capabilities.	Not designed for image processing.	Effective in information retrieval; not suitable for OCR.	You
Julius AI	AI data analyst for interpreting and visualizing data.	Not designed for image processing.	SUITABLE FOR TEXT DATA?; not suitable for OCR.	Julius AI
WormGPT	AI model with capabilities in various domains.	Not designed for image processing.	Performance in OCR/NLP tasks is not well-documented.	WormGPT
Poe	Platform providing access to multiple AI chatbots.	Not designed for image processing.	Performance depends on the integrated models.	Poe
T5 (Google)	Converts various NLP tasks into a text-to-text format, enabling unified text processing.	Not designed for image processing.	Versatile in NLP tasks like translation, summarization, and question answering.	Google T5
BERT (Google)	Provides contextualised word embeddings by processing text bi-directionally.	Not designed for image processing.	Strong performance in tasks like sentiment analysis, text classification, and question answering.	Google BERT
RoBERTa (Facebook)	An optimized version of BERT with improved training methodology for better text understanding.	Not designed for image processing.	Outperforms BERT in various NLP benchmarks.	Facebook RoBERTa
ALBERT (Google)	A lighter and faster version of BERT with parameter reduction techniques.	Not designed for image processing.	Maintains performance while being more efficient.	Google ALBERT
ELMo (AllenNLP)	Generates contextualised word embeddings considering the entire sentence.	Not designed for image processing.	Effective in tasks like sentiment analysis and question answering.	AllenNLP ELMo
DeepSeek AI	AI model for code generation and understanding.	Not designed for image processing.	Effective in programming-related NLP tasks; not suitable for OCR.	DeepSeek AI
1min AI	All-in-one AI app for text, writing, image, audio, and video.	Capable of processing images; suitable for multimodal applications.	Performance in OCR/NLP tasks is not well-documented.	1min AI
Cody	AI assistant for code-related tasks.	Not designed for image processing.	Effective in programming-related NLP tasks; not suitable for OCR.	Cody
Codeium	AI-powered code completion and generation tool.	Not designed for image processing.	Effective in programming-related NLP tasks; not suitable for OCR.	Codeium
AI21 Jamba	Language model for text generation and understanding.	Not designed for image processing.	Effective in various NLP tasks; not suitable for OCR.	AI21 Jamba
Hugging Face	Platform providing access to various AI models.	Offers models for both text and image processing.	Performance depends on the selected model.	Hugging Face

[!CAUTION] The overview of models was created in spring 2025 and we cannot promise to update it regularly, so the information may no longer hold true when you visit the repository. AI development progresses so quickly that it is difficult to keep track, and we are also unable to test all AI models ourselves. This is why you should consult more timely contributions in tech forums for more reliable information on the different models and recommended uses.

Ethical challenges

Working with AI raises questions of research integrity and ethics, especially when historical and multilingual sources are concerned. AI-powered OCR often performs better for widely used languages such as English and Spanish. In addition, AI-based spelling corrections trained on modern and more widespread language use may alter original meanings and ingest anachronisms in texts. AI would need to be trained on more diverse sources (including historical and regional texts) to perform better. At the moment, AI use still requires a high level of human monitoring. When publishing AI-enhanced text, scholars should flag where AI interventions (e.g. spelling corrections) have altered the original and what has been done to check the quality of the AI output. In many cases, direct AI text manipulation is not recommended. But AI can be used to develop controlled vocabularies or create lists of typical errors.