Macedonian Corpus
A collection of linguistic data aimed at supporting NLP tasks for Macedonian.
๐ Summary
- First consolidated Macedonian Corpus for NLP research.
- Includes 3 versions of the corpus:
- Raw: 37.6 GB, 3.53 billion words.
- Cleaned: 35.5 GB, 3.31 billion words (filtered for quality).
- Cleaned + Deduplicated: 16.78 GB, 1.47 billion words (highest-quality - recommended)
- Enables pretraining/fine-tuning LLMs, machine translation, and linguistic analysis.
- Built with state-of-the-art filtering and deduplication techniques.
- See our GitHub for implementation details.
๐ Overview
Macedonian is widely recognized as a low-resource language in the field of NLP. While a 40GB text might seem substantial, it pales in comparison to the resources available for other languages. For instance, the fineweb-2 datasetโs largest collection is for Russian, with a staggering 1.65TB of data and 537 billion words. The scarcity of Macedonian textual data is primarily due to several factors: the smaller population size compared to countries like Russia or the U.S., slower development in the digitalization of resources, and limited access to digital archives.
Recognizing this gap, we decided to create the Macedonian Corpus, aiming to consolidate and organize Macedonian textual resources. To the best of our knowledge, no consolidated resource encompassing all available public data exists for Macedonian language. By creating this corpus, we hope to take a meaningful step toward addressing this deficiency and inspire developers, researchers, linguists, and institutions to join the effort.
One of the catalysts for this project was MMORE (a library co-developed by one of our teammates, GitHub) designed to extract text from diverse file formats such as PDF, DOCX, PPTX,... you name it. MMORE supports OCR (Optical Character Recognition), enabling the extraction of text from scanned PDFs, very useful in our case since a significant portion of the books we processed were in scanned formats.
This corpus aggregates text data from a variety of sources, including books, academic papers, web content, and other textual materials. To enhance usability, we offer three distinct versions of the dataset:
-
Raw: The unprocessed collection in its original state, as gathered from multiple sources.
-
Cleaned: Data filtered to remove noise, artifacts, and unwanted content, ensuring higher quality.
-
Cleaned + Deduplicated: Texts processed with MinHash deduplication to remove near-duplicate samples (documents). Deduplication is important given the overlap inherent in web-crawled data and has been shown to improve language model training outcomes (see paper).
As this is only the first version of the Macedonian Corpus, we are committed to expanding and improving it with the help of the community. Future iterations will benefit from contributions such as books, textual resources, and other materials written in clean Macedonian. We welcome feedback, suggestions, and collaborations to make this resource as impactful as possible. One potential avenue for improving this corpus is to collect text data from movies with Macedonian subtitles. This type of data is typically of high quality; however, it is important to first verify the legal implications and ensure proper permissions are obtained before using such data.
๐ Dataset Sources
The corpus is built by collecting and processing data from the following sources:
Source | Notes | Origin |
---|---|---|
UKIM | Books and dissertations from various topics | UKIM Digital Library, UKIM Repository |
Wikipedia (MK) | Macedonian Wikipedia dump | Wikipedia |
MANU | Various publications from MANU | MANU |
HuggingFace (fineweb-2) | Macedonian subset of FineWeb-2 (mkd_Cyrl) | Hugging Face |
Common Voice (MK) | Macedonian sentences from the Common Voice dataset | Common Voice |
CLARIN MaCoCu-mk 2.0 | Web-crawled Macedonian texts | CLARIN |
UKLO | Resources from UKLO | UKLO |
UGD | Resources from UGD | UGD |
SETimes Corpus (MK-EN) | Macedonian-English parallel corpus (only MK sentences used) | SETimes |
HPLT-2 (MK) | Macedonian subset of HPLT-2 | HPLT-2 |
Institute of Macedonian Language | Resources from the Institute of Macedonian Language "Krste Misirkov" | IMJ |
Official PE Gazette of North Macedonia | Official Gazette of North Macedonia | slvesnik |
Dataset Splits
The corpus is divided into the following categories based on the origin of the data:
Origin | Size (GB) | Words (B) | Percentage |
---|---|---|---|
HPLT-2 | 15.85 | 1.49 | 42.21% |
HuggingFace (fineweb-2) | 14.21 | 1.33 | 37.66% |
CLARIN (MaCoCu-mk 2.0) | 5.20 | 0.49 | 13.92% |
Other (MMORE) | 1.48 | 0.14 | 4.07% |
Wikipedia | 0.78 | 0.07 | 1.96% |
SETimes Corpus | 0.06 | 0.0044 | 0.13% |
Common Voice | 0.02 | 0.0018 | 0.05% |
Total | 37.60 | 3.53 | 100.00% |
โ๏ธ Usage
This corpus is intended to support a variety of use cases, including but not limited to:
-
Pretraining or Fine-tuning LLMs: The corpus can be used to pretrain LLMs specifically for the Macedonian language, enabling tasks like text generation, language understanding, and question answering.
-
Linguistic Analysis: Researchers can use the corpus to study the morphology, syntax, and semantics of the Macedonian language, contributing to both academic studies and computational linguistic advancements.
-
Machine Translation: The corpus can serve as a valuable resource for developing or improving machine translation systems between Macedonian and other languages.
-
Document Retrieval and Search: It can be used to build and evaluate information retrieval systems, such as search engines.
The corpus is provided as a JSONL file, where each line contains two fields:
text
: The raw textual data.source
: The source of the text.
{"text": "ะัะธะผะตั ัะตะบัั.", "source": "fineweb-2"}
๐ค How to Contribute?
You can contribute to the Macedonian corpus by:
-
Digitalize Books and Materials:
- Contribute by digitalizing books, documents, and other materials that are legally in the public domain. These digitalized materials can be used to expand the datasets.
- Ensure that the materials you contribute comply with copyright laws and are explicitly permitted for public use.
-
Expand Data Collection:
- Share other forms of Macedonian-language text data, such as articles, essays, or transcripts, that can legally be used for training or evaluating language models.
-
Encourage Institutional Participation:
- We hope this initiative inspires institutions in Macedonia, such as libraries, universities, and research centers, to take part in the digitalization of Macedonian-language materials.
- The availability of such materials will enable the development of specialized software tailored to the needs of Macedonian speakers and researchers.
๐ฌ Contact
For inquiries, feedback, or contributions, please feel free to reach out to the core team:
โ๏ธ Legal
Notice and Takedown Policy
We adhere strictly to copyright and data ownership laws. If you identify any material within the corpus that infringes on your rights, please contact us following the detailed steps provided in this section to have it reviewed and potentially removed.