pub quiz

Evaluate language models using multiple choice items

Illustration of how LM Pub Quiz evaluates LMs.
Illustration of how LM Pub Quiz evaluates LMs: Answers are ranked by the (pseudo) log-likelihoods of the textual statements derived from all of the answer options.


We evaluated a variety of language models, trained using different pretraining objectives and representing both causal and masked LM types, on the BEAR dataset.

Model Type Num Params BEAR
Accepted at NAACL 2024

BEAR: A Unified Framework for Evaluating Relational Knowledge in Causal and Masked Language Models


Knowledge probing assesses to which degree a language model (LM) has successfully learned relational knowledge during pre-training. Probing is an inexpensive way to compare LMs of different sizes and training configurations. However, previous approaches rely on the objective function used in pre-training LMs and are thus applicable only to masked or causal LMs. As a result, comparing different types of LMs becomes impossible. To address this, we propose an approach that uses an LM's inherent ability to estimate the log-likelihood of any given textual statement. We carefully design an evaluation dataset of 7,731 instances (40,916 in a larger variant) from which we produce alternative statements for each relational fact, one of which is correct. We then evaluate whether an LM correctly assigns the highest log-likelihood to the correct statement. Our experimental evaluation of 22 common LMs shows that our proposed framework, BEAR, can effectively probe for knowledge across different LM types. We release the BEAR datasets and an open-source framework that implements the probing approach to the research community to facilitate the evaluation and development of LMs.

Accuracy of various models on the BEAR dataset.

Example Usage

Install the package via pip:

pip install lm-pub-quiz

Evaluate a model on BEAR:

from lm_pub_quiz import Dataset, Evaluator

# Load the dataset
bear = Dataset.from_name("BEAR")

# Load the model
evaluator = Evaluator.from_model("gpt2", model_type="CLM", device="cuda:0")

# Run the evaluation
result = evaluator.evaluate_dataset(bear, template_index=0, batch_size=32, save_path="results/gpt2")

# Show the overall accuracy
print(result.get_metrics("accuracy", accumulate_all=True))

This example script outputs the accuracy accumulated over all relations weighed by the number of instances (this is what we call the "BEAR-score") as a pandas.Series:

accuracy            0.149528
num_instances    7731.000000
dtype: float64

For more details, visit the documentation.


When using the dataset or library, please cite the following paper:

  title = {{{BEAR}}: {{A Unified Framework}} for {{Evaluating Relational Knowledge}} in {{Causal}} and {{Masked Language Models}}},
  shorttitle = {{{BEAR}}},
  author = {Wiland, Jacek and Ploner, Max and Akbik, Alan},
  year = {2024},
  number = {arXiv:2404.04113},
  eprint = {2404.04113},
  publisher = {arXiv},
  url = {http://arxiv.org/abs/2404.04113},

Meet the Contributors

Portrait of Jacek Wiland

Jacek Wiland

Core Contributor

Portrait of Max Ploner

Max Ploner

Core Contributor

Portrait of Alan Akbik

Alan Akbik

Core Contributor

Sebastian Pohl