The 26th ACM Symposium on Document Engineering

August 25, 2026 to August 28, 2026
HES-SO / University of Fribourg, Switzerland

OCRs for Corpus Extraction for the Maltese Language

→ Back to Call for Competitors

Organizers

External competition page: https://www.um.edu.mt/projects/nomocrat/doceng26competition/

Abstract

Develop an OCR model for transcribing images of paragraphs in pages extracted from Maltese PDFs. A train set is not provided and synthetic data must be used. Language resources will be provided to participants to assist with those unfamiliar with Maltese.

Motivation

A corpus is a large collection of texts that is used for studying language and for the development of large language models, among other things. PDFs are an important source of text for corpora, but they are challenging to extract text from, particularly due to the selectable text in the PDFs being unreliable due to font-based character substitutions (characters appearing as other characters), ligature substitutions, merged columns, and so on. A solution to this is to avoid the selectable text completely and focus on the apparent text by using an OCR (Optical Character Recognition).

Such a system will allow for the extraction of high-quality corpora from noisy sources of text, which increases the amount of available data for low-resource languages. This creates more inclusion among languages into the digital world. Higher-quality corpora also means that less text is needed to get around the noise that would otherwise be present, allowing for language models to be trained with less.

Task Description

The Maltese language is the national language of Malta. It is a Semitic language that uses a Latin script for writing. Given that it is a low-resource language, this competition is about benchmarking OCR models for Maltese PDFs trained on synthetic data. The OCR is to be applied to rectangular images of single paragraphs only. A development set of gold-standard transcribed images of paragraphs from Maltese PDF pages is provided to allow participants to evaluate their models. A test set is held out and only used by the competition organisers to measure the final evaluation metrics of the competition.

For instance, given an input image where the text is split across lines as:

0 – Għadha mhux fis-
seħħ

the expected output is the single string "0 — Għadha mhux fis-seħħ".

Note that the trailing - after fis is not a line-break hyphen to be removed, but a structural dash that must be preserved when joining the lines.

The main challenge of this competition is the lack of a train set, which needs to be generated synthetically. A second challenge is that the OCR output must be in paragraph form rather than as a list of lines (as some OCRs output). A list of lines is not useful for producing a corpus, and therefore the OCR model must either be trained to output the paragraph as a whole, or the lines must be joined as a post-processing step. Joining lines of text is not trivial, as some lines require a space between them and others don't. Some lines end in a hyphenated word that must be made whole again. The Maltese language makes extensive use of dashes (e.g. "il-kelb", which means "the dog"), which creates ambiguity for deciding whether the dash must be preserved or treated as a hyphen. A rule-based line joiner for Maltese has been developed and is referenced below.

Winning Criteria

The generated transcriptions are compared with gold-standard transcriptions and the CER (Character Error Rate) is measured. The model with the lowest CER is the winner. In the case of a tie, the one with the shortest runtime (on the organisers' computer) is given preference.

Provided Resources

Evaluation Computer Specifications

The computer that will evaluate the participants' models on the test set has the following specifications:

Rules

Important Dates

The submission deadline is June 30, 2026, 23:59 AoE (Anywhere on Earth).

Contact

For questions about this competition, please contact Marc Tanti.