We evaluated a variety of language models, trained using different pretraining objectives and representing both causal and masked LM types, on the BEAR dataset.
| Model | Type | Year of Publication | Num Params | Num Tokens | BEAR |
|---|
We warmly welcome contributions to the BEAR leaderboard! To contribute new models to the leaderboard, please add your results to the file results.json and create a pull request.
The result entry should follow the following format:
{
"model_name": "bert-base-cased",
"model_url": "https:\/\/huggingface.co\/bert-base-cased",
"model_family":"bert",
"model_type":"MLM",
"year_published":"2018",
"num_params": 109e6,
"num_tokens": 3.3e9,
"size_pretraining_GB": null,
"source":"https://huggingface.co/blog/bert-101",
"accuracy":{
"mean":0.1839348079,
"sem":0.0036593149
}
},
Please add the model name & family, the type of model (CLM or MLM) and the URL where the model can be accessed. Please also add information on the number of parameters the model contains, as well as information on the pre-training setup: The number of pre-training tokens as well as the size of the pre-training data in GB (if this information is available). If this information was retrieved from some other source (than the main page of the model), please add the URL to the source field. The mean accuracy refers to the weighted mean over all three templates, sem refers to the standard error.
Knowledge probing assesses to which degree a language model (LM) has successfully learned relational knowledge during pre-training. Probing is an inexpensive way to compare LMs of different sizes and training configurations. However, previous approaches rely on the objective function used in pre-training LMs and are thus applicable only to masked or causal LMs. As a result, comparing different types of LMs becomes impossible. To address this, we propose an approach that uses an LM's inherent ability to estimate the log-likelihood of any given textual statement. We carefully design an evaluation dataset of 7,731 instances (40,916 in a larger variant) from which we produce alternative statements for each relational fact, one of which is correct. We then evaluate whether an LM correctly assigns the highest log-likelihood to the correct statement. Our experimental evaluation of 22 common LMs shows that our proposed framework, BEAR, can effectively probe for knowledge across different LM types. We release the BEAR datasets and an open-source framework that implements the probing approach to the research community to facilitate the evaluation and development of LMs.
Install the package via pip:
pip install lm-pub-quiz
Evaluate a model on BEAR:
from lm_pub_quiz import Dataset, Evaluator
# Load the dataset
bear = Dataset.from_name("BEAR")
# Load the model
evaluator = Evaluator.from_model("gpt2", model_type="CLM", device="cuda:0")
# Run the evaluation
result = evaluator.evaluate_dataset(bear, template_index=0, batch_size=32, save_path="results/gpt2")
# Show the overall accuracy
print(result.get_metrics("accuracy", accumulate_all=True))
This example script outputs the accuracy accumulated over all relations weighed by the number of instances (this is what we call the "BEAR-score") as a pandas.Series:
accuracy 0.149528
num_instances 7731.000000
dtype: float64
For more details, visit the documentation.
Knowledge probing evaluates to which extent a language model (LM) has acquired relational knowledge during its pre-training phase. It provides a cost-effective means of comparing LMs of different sizes and training setups and is useful for monitoring knowledge gained or lost during continual learning (CL). In prior work, we presented an improved knowledge probe called BEAR (Wiland et al., 2024), which enables the comparison of LMs trained with different pre-training objectives (causal and masked LMs) and addresses issues of skewed distributions in previous probes to deliver a more unbiased reading of LM knowledge. With this paper, we present LM-Pub-Quiz, a Python framework and leaderboard built around the BEAR probing mechanism that enables researchers and practitioners to apply it in their work. It provides options for standalone evaluation and direct integration into the widely-used training pipeline of the Hugging Face transformers library. Further, it provides a fine-grained analysis of different knowledge types to assist users in better understanding the knowledge in each evaluated LM. We publicly release LM-Pub-Quiz as an open-source project.
When using the dataset or library, please cite the respective papers.
@inproceedings{wiland-etal-2024-bear,
title = "{BEAR}: A Unified Framework for Evaluating Relational Knowledge in Causal and Masked Language Models",
author = "Wiland, Jacek and
Ploner, Max and
Akbik, Alan",
editor = "Duh, Kevin and
Gomez, Helena and
Bethard, Steven",
booktitle = "Findings of the Association for Computational Linguistics: NAACL 2024",
month = jun,
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-naacl.155/",
doi = "10.18653/v1/2024.findings-naacl.155",
pages = "2393--2411"
}
@inproceedings{ploner-etal-2025-lm,
title = "{LM}-Pub-Quiz: A Comprehensive Framework for Zero-Shot Evaluation of Relational Knowledge in Language Models",
author = "Ploner, Max and
Wiland, Jacek and
Pohl, Sebastian and
Akbik, Alan",
editor = "Dziri, Nouha and
Ren, Sean (Xiang) and
Diao, Shizhe",
booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)",
month = apr,
year = "2025",
address = "Albuquerque, New Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.naacl-demo.4/",
doi = "10.18653/v1/2025.naacl-demo.4",
pages = "29--39",
ISBN = "979-8-89176-191-9"
}