LVSTCK Logo

Macedonian LLM Eval

A benchmark for evaluating the capabilities of LLMs in Macedonian.

GitHub, HuggingFace

📋 Overview

The Macedonian LLM Eval is a benchmark designed to quantiatively measure how well large language models perform in the Macedonian language.

This benchmark covers tasks like reasoning, world knowledge, and reading comprehension to provide a complete assessment. It helps researchers and developers compare models, identify strengths and weaknesses, and improve Macedonian-specific AI tools.

You can find the Macedonian LLM eval dataset on HuggingFace. The dataset was translated from Serbian to Macedonian using the Google Translate API. The Serbian dataset was selected as the source instead of English because Serbian and Macedonian are closer from a linguistic standpoint, making Serbian a better starting point for translation. Additionally, the Serbian dataset was refined using GPT-4, which, according to the original report, significantly improved the quality of the translation.

🎯 What is currently covered:

  • Common sense reasoning: Hellaswag, Winogrande, PIQA, OpenbookQA, ARC-Easy, ARC-Challenge
  • World knowledge: NaturalQuestions
  • Reading comprehension: BoolQ

📊 Latest Results - January 16, 2025

Macedonian LLM Evaluation Results


ModelVersionARC EasyARC ChallengeBool QHellaSwagOpenbook QAPIQANQ OpenWinoGrande
MKLLM-7B-Instruct7B0.50340.30030.78780.43280.29400.64200.04320.6148
BLOOM7B0.27740.18000.50280.26640.15800.531600.4964
Phi-3.5-mini3.8B0.28870.18770.60280.26340.16400.52560.00250.5193
Mistral7B0.46250.28670.75930.37220.21800.57830.02410.5612
Mistral-Nemo12B0.47180.31910.80860.39970.24200.60660.02910.6062
Qwen2.57B0.39060.25340.77890.33900.21600.55980.00420.5351
LLaMA 3.18B0.44530.28240.76390.37400.25200.58650.03350.5683
LLaMA 3.23B0.32240.23290.66240.29760.20600.54620.00440.5059
🏆LLaMA 3.3 - 8bit70B0.58080.36860.85110.46560.28200.66000.08780.6093
domestic-yak-instruct8B0.54670.33620.78650.44800.30200.69100.04570.6267

See our GitHub for more details.