Getting started#
LM Pub Quiz implements a knowledge probing approach which uses LM's inherent ability to estimate the log-likelihood of any given textual statement. For more information visit the LM Pub Quiz website.
The following sections give a quick overview how to calculate the BEAR-score for a given model. For a more detailed look into the results, please take a look at the example workflow.
Installing the Package#
To install the package from PyPI, simply run:
Note
For alternative setups (esp. for contributing to the library), see the development section.
Evaluating a Model#
Models can be loaded and evaluated using the Evaluator
class. First, create an evaluator for the model, then run evaluate_dataset
with the loaded dataset.
from lm_pub_quiz import Dataset, Evaluator
# Load the dataset
dataset = Dataset.from_name("BEAR")
# Load the model
evaluator = Evaluator.from_model(
"gpt2",
model_type="CLM",
)
# Run the evaluation and save the
results = evaluator.evaluate_dataset(
dataset,
template_index=0,
save_path="gpt2_results",
batch_size=32,
)
Assessing the Results#
To load the results and compute the overall accuracy, you can use the following lines of code: