Overview of AI models and (AI-powered) OCR/HTR software used for text-processing in humanities research
Commonly used OCR software
OCR Tool | Core Features | Multilingual Support | Handling of Noisy / Historical Texts | Best Use Case | Access Link |
---|---|---|---|---|---|
Tesseract OCR (Open-Source) | Uses LSTMs for text recognition; supports training custom models. | Supports many languages but struggles with non-Latin scripts and historical fonts. | Performs poorly on noisy, low-quality, or historical scans. | Free OCR solution, best for clean digital scans. | Tesseract |
Google Document AI | Cloud-based OCR with advanced AI-powered text extraction. | Strong multilingual support for modern languages, especially for Latin, Cyrillic, and Arabic scripts. | Good performer in benchmarks (English and Arabic), particularly on noisy documents. | Academic research, structured documents, complex layouts. | Google Document AI |
Amazon Textract | Cloud-based OCR, integrated with AWS; handles forms and tables. | Good multilingual support, but struggles with Arabic and right-to-left scripts. | Better than Tesseract, but not as strong as Google Document AI on low-quality scans. | Business automation, extracting structured text from forms. | Amazon Textract |
ABBYY FineReader | Proprietary OCR tool with strong historical document processing. | Excellent support for European languages and some Asian scripts. | One of the best commercial OCR tools for degraded texts. | Humanities research, digitizing archives. | ABBYY FineReader |
Transkribus | Models available for OCR and HTR. | Supports historical scripts and handwritten texts (17th–19th century). | Excellent for noisy, historical handwritten texts, but requires training. | Archives, historical manuscript processing. | Transkribus |
e-scriptorium | Open-source, focuses on custom HTR models. | Good support for non-Latin manuscripts (e.g., Arabic, Hebrew). | Highly customisable, but challenging installation and requires good personal hardware or cloud environment. | Humanities research, specific non-Latin OCR. | e-scriptorium |
OCR4all | Open-source and focus on humanities applications. | Supports multiple languages but layout recognition can be challenging. | Needs training and the user community is considerably smaller than with Transkribus. | Best run in a cloud environment and challenging to set up. | OCR4all |
Kraken OCR | Open-source, focuses on custom OCR models. | Good support for historical and some non-Latin scripts. | Highly customisable, but requires training datasets for optimal performance. | Humanities research, including non-Latin OCR. | Kraken OCR |
Microsoft Azure OCR | AI-powered OCR with table and form detection. | Supports multiple languages but lacks customisation for historical documents. | Performs well on modern, clean text but struggles with historical/degraded materials. | Business and cloud-based document processing. | Azure AI Vision |
Meta’s Nougat | Designed to transcribe (scientific) texts published in PDF format into Markdown format. | Uses Visual Transformer models for accurate (?) text and formula recognition. | Primarily focused on modern scientific documents. May require adaptation for historical texts. |
Nougat |
AI-powered OCR tools
Tool | Description |
---|---|
Transkribus | Makes use of “AI” (not specified in which form) for different analytical tasks and the preparation of digital editions |
Azure AI Vision | Provides computer vision capabilities, including text extraction from images |
ABBYY FineReader | Offers an AI-powered OCR SDK that enables software developers to integrate text recognition into their applications |
Rossum | Cloud-native platform that uses proprietary AI engine to combine traditional OCR with AI for enterprise applications that require text processing and text generation |
OCCAM | OCCAM is a cloud tool that “allows for automatically transcribing and translating printed or handwritten documents, to compare transcriptions to ground truth (manually validated transcriptions), and (as an experimental feature) to automatically correct PageXML-formatted transcriptions” |
A [Tesseract OCR workflow], combined with an automated Google translation of the text output, is presented in: Selvaganapathi, G. (2024, July 29). Optical Character Recognition (Ocr) with google translate. Medium. https://medium.com/@gayathri.s.de/optical-character-recognition-ocr-with-google-translate-9c30bfb703d7
OCCAM is integrated with the European crowdsourcing platform Transcribathon, which allows for manually editing transcriptions. The tool was created by Tom Vanallemeersch, CrossLang NV (Belgium), and is offered as a service on https://ai4culture.crosslang.dev/ui/. In the GUI, there are no options to manually improve image segmentation, however, so the tool may only recognise small parts of a scan as actual text. The strength of the tool therefore lies in the evaluation and correction of workflows rather than from-scratch OCR. OCCAM’s support for PageXML correction — using language models, lexica, or validated transcriptions — can help with refining existing OCR outputs in structured formats.
TRANSKRIBUS, especially in its browser-based light version, has an intuitive user interface and is very beginner-friendly, but also much more of a black box. Although the project started as an EU-funded research project, the tool is now under Read Coop, a for-profit business. Therefore, TRANSKRIBUS has introduced a subscription model and gives users who pay priority access to computational resources and better models. Also, they are marketing the tool as an AI-powered, universal platform for text recognition, which raises high user expectations. In reality, community models often underperform on less common data types, such as multilingual materials. Unfortunately, there is also little transparency about the quality of the community models, their training data, or performance measures. As mentioned before, we cannot say anything about their “super models” because we never paid to use them, but the overall promises of “AI” seem exaggerated when the models offered are still trained for rather narrow use cases.
AI models for OCR cleaning and analytical tasks (experimental)
AI Model | Text Processing | Image Processing | OCR/NLP Performance | Access Link |
---|---|---|---|---|
GPT-4o (OpenAI) | Advanced text generation and understanding: can be used to correct spelling and NER | Accepts image inputs; suitable for multimodal tasks. | Excels in NLP tasks; capable of processing text within images. | GPT-4o |
CLIP (OpenAI) | Connects textual and visual concepts; not primarily for text generation. | Strong image classification and understanding. | Effective in linking text and images; can assist in OCR tasks. | CLIP |
DALL·E 2 (OpenAI) | Generates textual descriptions from images. | Creates images from textual descriptions. | Not designed for OCR; useful for generating visual data from text. | DALL·E 2 |
Llama 3.2 (Meta) | Proficient in text generation and understanding: can be used for spelling correction and NER | Capable of processing images; suitable for multimodal applications. | Effective in NLP tasks; image processing capabilities can aid OCR. | Llama 3.2 |
ChatGPT (OpenAI) | Advanced conversational AI for text-based interactions, but max. output is limited. | Can process images and recognise text in images, but has strict rate limits. | Can be used for exemplary NLP operations, e.g. for thesaurus building based on selective text input. | ChatGPT |
Mistral AI | Less “creative” in text generation than OpenAI models and more focused on structured information processing. | Not designed for image processing. | Very efficient in helping with NLP programming, e.g. regex-building and code review. Direct language processing such as spelling correction and translation is often less reliable than with other models. | Mistral AI |
Claude AI | Advanced conversational AI with reasoning capabilities. | Not designed for image processing. | Not tested for NLP or OCR tasks yet. | Claude AI |
Cohere | Provides language models for text generation and understanding. | Not designed for image processing. | Not tested for NLP or OCR tasks yet. | Cohere |
Copilot (Microsoft) | Assists in code generation and understanding. | Not designed for image processing. | Can help with programming-related NLP tasks; not suitable for direct OCR. | Copilot |
Perplexity AI | Provides answers to complex questions using language models. | Not designed for image processing. | Effective in information retrieval (similar to a search engine); not suitable for NLP and OCR. | Perplexity AI |
Inflection Pi AI | Personal AI assistant for conversational interactions. | Not designed for image processing. | Excels in conversational NLP tasks; not applicable for OCR. | Inflection Pi AI |
BlackBox AI | Assists in code generation and debugging. | Not designed for image processing. | Effective in programming-related NLP tasks; not suitable for OCR. | BlackBox AI |
Gemini (Google) | Advanced AI model for text generation and understanding. | Capable of processing images; suitable for multimodal applications. | Can be used for basic NLP tasks similar to OpenAI models; image processing capabilities can aid OCR. | Gemini |
Phind | Search engine with AI capabilities for code and technical information. | Not designed for image processing. | Effective in information retrieval; not suitable for OCR. | Phind |
You.com | AI-powered search engine with conversational capabilities. | Not designed for image processing. | Effective in information retrieval; not suitable for OCR. | You |
Julius AI | AI data analyst for interpreting and visualizing data. | Not designed for image processing. | SUITABLE FOR TEXT DATA?; not suitable for OCR. | Julius AI |
WormGPT | AI model with capabilities in various domains. | Not designed for image processing. | Performance in OCR/NLP tasks is not well-documented. | WormGPT |
Poe | Platform providing access to multiple AI chatbots. | Not designed for image processing. | Performance depends on the integrated models. | Poe |
T5 (Google) | Converts various NLP tasks into a text-to-text format, enabling unified text processing. | Not designed for image processing. | Versatile in NLP tasks like translation, summarization, and question answering. | Google T5 |
BERT (Google) | Provides contextualised word embeddings by processing text bi-directionally. | Not designed for image processing. | Strong performance in tasks like sentiment analysis, text classification, and question answering. | Google BERT |
RoBERTa (Facebook) | An optimized version of BERT with improved training methodology for better text understanding. | Not designed for image processing. | Outperforms BERT in various NLP benchmarks. | Facebook RoBERTa |
ALBERT (Google) | A lighter and faster version of BERT with parameter reduction techniques. | Not designed for image processing. | Maintains performance while being more efficient. | Google ALBERT |
ELMo (AllenNLP) | Generates contextualised word embeddings considering the entire sentence. | Not designed for image processing. | Effective in tasks like sentiment analysis and question answering. | AllenNLP ELMo |
DeepSeek AI | AI model for code generation and understanding. | Not designed for image processing. | Effective in programming-related NLP tasks; not suitable for OCR. | DeepSeek AI |
1min AI | All-in-one AI app for text, writing, image, audio, and video. | Capable of processing images; suitable for multimodal applications. | Performance in OCR/NLP tasks is not well-documented. | 1min AI |
Cody | AI assistant for code-related tasks. | Not designed for image processing. | Effective in programming-related NLP tasks; not suitable for OCR. | Cody |
Codeium | AI-powered code completion and generation tool. | Not designed for image processing. | Effective in programming-related NLP tasks; not suitable for OCR. | Codeium |
AI21 Jamba | Language model for text generation and understanding. | Not designed for image processing. | Effective in various NLP tasks; not suitable for OCR. | AI21 Jamba |
Hugging Face | Platform providing access to various AI models. | Offers models for both text and image processing. | Performance depends on the selected model. | Hugging Face |
[!CAUTION] The overview of models was created in spring 2025 and we cannot promise to update it regularly, so the information may no longer hold true when you visit the repository. AI development progresses so quickly that it is difficult to keep track, and we are also unable to test all AI models ourselves. This is why you should consult more timely contributions in tech forums for more reliable information on the different models and recommended uses.
Ethical challenges
Working with AI raises questions of research integrity and ethics, especially when historical and multilingual sources are concerned. AI-powered OCR often performs better for widely used languages such as English and Spanish. In addition, AI-based spelling corrections trained on modern and more widespread language use may alter original meanings and ingest anachronisms in texts. AI would need to be trained on more diverse sources (including historical and regional texts) to perform better. At the moment, AI use still requires a high level of human monitoring. When publishing AI-enhanced text, scholars should flag where AI interventions (e.g. spelling corrections) have altered the original and what has been done to check the quality of the AI output. In many cases, direct AI text manipulation is not recommended. But AI can be used to develop controlled vocabularies or create lists of typical errors.