Skip to content

API Reference

You can use the API to call the evaluation from a python script. For this, you need to load a dataset (see Data Files for how these should be structured) and then execute the evaluation function using your desired configuration.

Example (compare with src/lm_pub_quiz/cli/evaluate_model.py):

from lm_pub_quiz import Dataset, Evaluator

# Load dataset
dataset = Dataset.from_name("BEAR")

# Create Evaluator (and load model)
evaluator = Evaluator.from_model("distilbert-base-cased")

# Run evaluation
result = evaluator.evaluate_dataset(dataset)

# Save result object
result.save("outputs/my_results")

Evaluator

lm_pub_quiz.Evaluator

Bases: BaseEvaluator

Perplexity-based evaluator base class.

score_answers(*, template, answers, reduction, subject=None) abstractmethod

Score an answer given a template.

This function must be implemented by child-classes for each model-type.

lm_pub_quiz.MaskedLMEvaluator

Bases: Evaluator

score_answers(*, template, answers, reduction, subject=None)

Calculates sequence scores using the Masked Language Model.

Parameters:

Name Type Description Default
template str

The template to use (should contain a [Y] marker).

required
answers List[str]

List of answers to calculate score for.

required

Returns:

Type Description
Union[ReducedReturnFormat, EachTokenReturnFormat]

List[float]: List of suprisals scores per sequence

lm_pub_quiz.CausalLMEvaluator

Bases: Evaluator

score_answers(*, template, answers, reduction, subject=None)

Calculates sequence scores using the Casual Language Model.

Parameters:

Name Type Description Default
template str

The template to use (should contain a [Y] marker).

required
answers List[str]

List of answers to calculate score for.

required

Returns:

Type Description
Union[EachTokenReturnFormat, ReducedReturnFormat]

List[float]: List of suprisals scores per sequence

Dataset Representation

There are two classes which are used to represent a dataset: Relation and Dataset (which is essentially a container for a number of relations).

lm_pub_quiz.Relation

Bases: RelationBase

Represents a relation within a dataset, including its code, answer space, templates, and an instance table.

Attributes:

Name Type Description
relation_code str

A unique code identifying the relation.

answer_space List[str]

A list of possible answers for this relation.

templates List[str]

Templates for generating instances of this relation.

instance_table DataFrame

A pandas DataFrame containing instances of the relation.

Methods:

Name Description
__str__

Returns a string representation showing the first five instances in the relation.

__repr__

Returns a string representation of the relation code.

__len__

Returns the number of instances in the relation.

subsample

Randomly samples a subset of instances from the relation.

load_from_file

Class method to create a Relation instance from a JSONL file.

from_path(path, *, relation_code=None, lazy=True, fmt=None) classmethod

Loads a relation from a JSONL file and associated metadata.

Parameters:

Name Type Description Default
path PathLike

The path to the dataset directory.

required
relation_code str

The specific code of the relation to load.

None
lazy bool

If False, the instance table is loaded directly into memory.

True

Returns:

Name Type Description
Relation Relation

An instance of the Relation class populated with data from the file.

Raises:

Type Description
Exception

If there is an error in loading the file or processing the data.

subsample(n=10)

Returns only a subsampled version of the dataset of the size n.

Parameters:

Name Type Description Default
n int

Size of the subsampled dataset

10

Returns:

Type Description
DataFrame

pd.DataFrame: Subsampled version of the dataset.

lm_pub_quiz.Dataset

Bases: DatasetBase[Relation]

A collection of relations forming a multiple choice dataset.

Attributes:

Name Type Description
relations List[Relation]

A list of Relation instances in the dataset.

dataset_name str

The name of the dataset.

Methods:

Name Description
load_from_path

Class method to load a dataset from a specified path.

from_name(name, *, lazy=True, base_path=None, chunk_size=10 * 1024, relation_info=None, **kwargs) classmethod

Loads a dataset from the cache (if available) or the url which is specified in the internal dataset table.

Parameters:

Name Type Description Default
name str

The name of the dataset.

required
lazy bool

If False, the instance tables of all relations are directly loaded into memory.

True

Returns:

Name Type Description
Dataset Dataset

An instance if Dataset loaded with the relations from the directory.

Raises:

Type Description
Exception

If there is an error in loading the dataset.

Usage

Loading the BEAR-dataset.

>>> from lm_pub_quiz import Dataset
>>> dataset = Dataset.from_name("BEAR")

from_path(path, *, lazy=True, fmt=None, relation_info=None, **kwargs) classmethod

Loads a multiple choice dataset from a specified directory path.

This method scans the directory for relation files and assembles them into a MultipleChoiceDataset.

Parameters:

Name Type Description Default
path str

The directory path where the dataset is stored.

required
lazy bool

If False, the instance tables of all relations are directly loaded into memory.

True

Returns:

Name Type Description
Dataset Dataset

An instance if Dataset loaded with the relations from the directory.

Raises:

Type Description
Exception

If there is an error in loading the dataset.

Usage

Loading the BEAR-dataset.

>>> from lm_pub_quiz import Dataset
>>> dataset = Dataset.from_path("/path/to/dataset/BEAR")

Evaluation Result

Similar to the dataset representation, the results are also represented in two classes RelationResult and the container DatasetResults.

lm_pub_quiz.RelationResult

Bases: RelationBase

from_path(path, *, relation_code=None, metadata=None, lazy=True, fmt=None) classmethod

Loads the evaluated relation from a JSONL file and associated metadata.

Parameters:

Name Type Description Default
path PathLike

The path to the relations instance table.

required

Returns:

Name Type Description
RelationResult RelationResult

An instance of the RelationResult class populated with data from the file.

Raises:

Type Description
Exception

If there is an error in loading the file or processing the data.

lm_pub_quiz.DatasetResults

Bases: DatasetBase[RelationResult]

Container for relation results.

from_path(path, *, lazy=True, fmt=None, relation_info=None, **kwargs) classmethod

Loads a results from a specified directory path.

This method scans the directory for relation files and assembles them into a DatasetResults.

Parameters:

Name Type Description Default
path str

The directory path where the dataset is stored.

required

Returns:

Name Type Description
DatasetResults DatasetResults

An instance of DatasetResults loaded with the results from the directory.

Raises:

Type Description
Exception

If there is an error in loading the dataset.

Usage

Loading all relation results for a dataset.

from results import DatasetResults
results = DatasetResults.load_from_path('/path/to/results/', dataset_name='BEAR')

get_metadata(key=None)

Return metadata from the relations. If no keys are passed, all consistent values are returned.

get_metrics(metrics, *, accumulate=False, divide_support=True)

Return the metrics for the relations in this dataset.

Parameters:

Name Type Description Default
accumulate bool | str | None

Compute the metrics for groups of relations (e.g. over the domains) or compute the overall scores for the complete dataset by setting accumulate=True.

False
divide_support bool

Set to true to divide the support (added by a relation to a group) by the number of groups it adds to (only relevant if there are multiple groups per relation i.e. when explode is set). This leads to a dataframe where the weightted mean is equal to the overall score.

True

Returns:

Type Description
Union[DataFrame, Series]

pandas.DataFrame | pandas.Series: A Series or DataFrame with the selected metrics depending on whether all relations where accumulated.

Data Base Clasess

The dataset representations as well as the evaluation results are based on common base classes.

lm_pub_quiz.data.base.RelationBase

Bases: DataBase

Base class for the representation of relations and relations results.

Source code in src/lm_pub_quiz/data/base.py
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
class RelationBase(DataBase):
    """Base class for the representation of relations and relations results."""

    _instance_table_file_name_suffix: str = ""
    _instance_table_default_format: str = "jsonl"
    _metadata_file_name: str = "metadata_relations.json"

    _len: Optional[int] = None

    def __init__(
        self,
        relation_code: str,
        *,
        lazy_options: Optional[Dict[str, Any]] = None,
        instance_table: Optional[pd.DataFrame] = None,
        answer_space: Optional[pd.Series] = None,
        relation_info: Optional[Dict[str, Any]] = None,
    ):

        self._relation_code = relation_code
        self._lazy_options = lazy_options
        self._instance_table = instance_table
        self._answer_space = answer_space
        self._relation_info = relation_info or {}

    @property
    def relation_code(self) -> str:
        return self._relation_code

    def copy(self, **kw):
        """Create a copy of the isntance with specified fields replaced by new values."""

        kw = {
            "relation_code": self.relation_code,
            "lazy_options": self._lazy_options.copy() if self._lazy_options is not None else None,
            "instance_table": self._instance_table.copy() if self._instance_table is not None else None,
            "answer_space": self._answer_space.copy() if self._answer_space is not None else None,
            "relation_info": self._relation_info.copy(),
            **kw,
        }
        return self.__class__(kw.pop("relation_code"), **kw)

    def saved(self, path: PathLike, *, fmt: InstanceTableFileFormat = None) -> Self:
        # Save relation and return the lazy-loading relation
        saved_path = self.save(path, fmt=fmt)

        if path is not None:
            lazy_options = {
                "path": saved_path,
                "fmt": fmt,
            }
        else:
            lazy_options = None

        return self.copy(instance_table=None, lazy_options=lazy_options)

    def activated(self) -> Self:
        """Return self or a copy of self with the instance_table loaded (lazy loading disabled)."""

        if not self.is_lazy:
            return self

        return self.copy(instance_table=self.instance_table)

    def __repr__(self) -> str:
        return str(self)

    def __str__(self) -> str:
        return f"{self.__class__.__name__} `{self.relation_code}`"

    @property
    def _derived_cardinality(self) -> str:
        if self.instance_table.duplicated("obj_id").any():
            return "multiple instances per answer"
        else:
            return "single instance per answer"

    @overload
    def relation_info(self, /, **kw) -> Dict[str, Any]: ...

    @overload
    def relation_info(self, key: str, /) -> Any: ...

    def relation_info(self, key: Optional[str] = None, /, **kw) -> Union[None, Any, Dict[str, Any]]:
        """Get or set additional relation information."""
        if key is not None:
            if key == "cardinality" and "cardinality" not in self._relation_info:
                return self._derived_cardinality
            else:
                return self._relation_info[key]
        elif len(kw) > 0:
            self._relation_info.update(kw)

        info = self._relation_info.copy()
        if "cardinality" not in info:
            info["cardinality"] = self._derived_cardinality
        return info

    @overload
    def get_metadata(self) -> Dict[str, Any]: ...

    @overload
    def get_metadata(self, key: str, /) -> Any: ...

    def get_metadata(self, key: Optional[str] = None) -> Union[Any, Dict[str, Any]]:
        """Get or set metadata."""
        if key is not None:
            if self._answer_space is None:
                msg = f"Key '{key}' not in metadata (no answer space in metadata)."
                raise KeyError(msg)
            elif key == "answer_space_labels":
                return self.answer_space.tolist()
            elif key == "answer_space_ids":
                return self.answer_space.index.tolist()
            elif key == "relation_info":
                return self.relation_info()
            else:
                msg = f"Key '{key}' not in metadata."
                raise KeyError(msg)

        elif self._answer_space is not None:
            return {k: self.get_metadata(k) for k in ("answer_space_labels", "answer_space_ids", "relation_info")}
        else:
            return {}

    @staticmethod
    def _generate_obj_ids(n: int, *, id_prefix: str = ""):
        return id_prefix + pd.RangeIndex(n, name="obj_id").astype(str)

    @classmethod
    def answer_space_from_instance_table(cls, instance_table: pd.DataFrame, **kw) -> pd.Series:
        if "obj_label" not in instance_table:
            msg = "Cannot generate answer space: No object information in instance table."
            raise ValueError(msg)

        if "obj_id" in instance_table:
            answer_groups = instance_table.groupby("obj_id", sort=False).obj_label
            unique_ids = answer_groups.nunique().eq(1)

            if not unique_ids.all():
                ids = ", ".join(f"'{v}'" for v in unique_ids[~unique_ids].index)
                log.warning("Some object IDs contain multiple labels: %s", ids)

            return answer_groups.first()

        else:
            answer_labels = instance_table["obj_label"].unique()
            return pd.Series(answer_labels, index=cls._generate_obj_ids(len(answer_labels), **kw), name="obj_label")

    @classmethod
    def answer_space_from_metadata(cls, metadata, **kw) -> Optional[pd.Series]:
        if "answer_space_labels" in metadata and "answer_space_ids" in metadata:
            if "answer_space_labels" in metadata:
                answer_space_labels = metadata.pop("answer_space_labels")
            else:
                answer_space_labels = metadata.pop("answer_space")

            answer_space_ids = metadata.pop("answer_space_ids", None)

            if answer_space_ids is None:
                answer_space_ids = cls._generate_obj_ids(len(answer_space_labels), **kw)

            index = pd.Index(answer_space_ids, name="obj_id")

            answer_space = pd.Series(answer_space_labels, index=index, name="obj_label")

            return answer_space
        elif (
            "answer_space_labels" not in metadata
            and "answer_space_ids" not in metadata
            and "answer_space" not in metadata
        ):
            return None
        else:
            warnings.warn(
                "To define an answer space in the medata data, specify `answer_space_ids` and "
                "`answer_space_labels` (using answer space base on the instance table).",
                stacklevel=1,
            )
            return None

    @property
    def answer_space(self) -> pd.Series:
        if self._answer_space is None:
            # invoke file loading to get answer space
            _ = self.instance_table

        return cast(pd.Series, self._answer_space)

    @property
    def instance_table(self) -> pd.DataFrame:
        if self._instance_table is None:
            if self._lazy_options is None:
                msg = (
                    f"Could not load instance table for {self.__class__.__name__} "
                    f"({self.relation_code}): No path given."
                )
                raise NoInstanceTableError(msg)

            instance_table = self.load_instance_table(answer_space=self._answer_space, **self._lazy_options)

            if self._answer_space is None:
                # store answer_space
                self._answer_space = self.answer_space_from_instance_table(
                    instance_table, id_prefix=f"{self.relation_code}-"
                )

            # store number of instances
            self._len = len(instance_table)

            return instance_table

        return self._instance_table

    def __len__(self) -> int:
        if self._instance_table is None:
            if self._len is None:
                # invoke file loading to get answer space
                _ = self.instance_table
            return cast(int, self._len)
        else:
            return len(self.instance_table)

    @abstractmethod
    def filter_subset(self, indices: Sequence[int], *, keep_answer_space: bool = False) -> Self:
        pass

    @classmethod
    def load_instance_table(
        cls,
        path: Path,
        *,
        answer_space: Optional[pd.Series] = None,  # noqa: ARG003
        fmt: InstanceTableFileFormat = None,
    ) -> pd.DataFrame:
        if not path.exists():
            msg = f"Could not load instance table for {cls.__name__}: Path `{path}` could not be found."
            raise FileNotFoundError(msg)
        elif not path.is_file():
            msg = f"Could not load instance table for {cls.__name__}: `{path}` is not a file."
            raise RuntimeError(msg)

        if fmt is None:
            fmt = tuple(s[1:] for s in path.suffixes)
        elif isinstance(fmt, str):
            fmt = tuple(fmt.split("."))

        log.debug("Loading instance table (format=.%s) from: %s", ".".join(fmt), path)

        if fmt == ("jsonl",):
            instance_table = pd.read_json(path, lines=True)

        elif fmt[0] == "parquet" and len(fmt) <= 2:  # noqa: PLR2004
            instance_table = pd.read_parquet(path)

        else:
            msg = f"Format .{'.'.join(fmt)} not recognized: Could not load instances at {path}."
            raise ValueError(msg)

        if instance_table.index.name is None:
            instance_table.index.name = "instance"

        return instance_table

    @classmethod
    def save_instance_table(cls, instance_table: pd.DataFrame, path: Path, fmt: InstanceTableFileFormat = None):
        """Save instance table with the format determined by the path suffix.

        Parameters:
           instance_table (pd.DataFrame): The instances to save.
           path (Path): Where to save the instance table. If format is not specified, the suffix is used to determined
                        the format.
           fmt (str): Which to save the instances in.
        """
        if fmt is None:
            fmt = tuple(s[1:] for s in path.suffixes)
        elif isinstance(fmt, str):
            fmt = tuple(fmt.split("."))

        if fmt == ("jsonl",):
            instance_table.to_json(path, orient="records", lines=True)

        elif fmt[0] == "parquet" and len(fmt) <= 2:  # noqa: PLR2004
            compression: Optional[str]

            if len(fmt) == 1:
                compression = None
            else:
                compression = fmt[1]

            instance_table.to_parquet(path, compression=compression)
        else:
            msg = f"Format .{'.'.join(fmt)} not recognized: Could not save instances at {path}."
            raise ValueError(msg)

    @property
    def is_lazy(self) -> bool:
        return self._instance_table is None and self._lazy_options is not None

    @property
    @abstractmethod
    def has_instance_table(self) -> bool:
        pass

    def save(self, save_path: PathLike, fmt: InstanceTableFileFormat = None) -> Optional[Path]:
        """Save results to a file and export meta_data"""
        save_path = Path(save_path)
        save_path.mkdir(parents=True, exist_ok=True)

        log.debug("Saving %s result to: %s", self, save_path)

        ### Metadata file -> .json ###
        if save_path.is_dir():
            metadata_path = save_path / self._metadata_file_name
        else:
            metadata_path = save_path
            save_path = save_path.parent

        if metadata_path.exists():
            with open(metadata_path) as file:
                all_metadata = json.load(file)

                if self.relation_code in all_metadata:
                    log.warning("Overwriting metadata info for relation %s (%s)", self.relation_code, save_path)
        else:
            all_metadata = {}

        ### Store instance table to .jsonl file ###
        if self.has_instance_table:
            instances_path = self.path_for_code(save_path, self.relation_code, fmt=fmt)
            self.save_instance_table(self.instance_table, instances_path, fmt=fmt)
            log.debug("Instance table was saved to: %s", instances_path)

        else:
            instances_path = None

        all_metadata[self.relation_code] = self.get_metadata()

        with open(metadata_path, "w") as file:
            json.dump(all_metadata, file, indent=4, default=str)
            log.debug("Metadata file was saved to: %s", metadata_path)

        return instances_path

    @staticmethod
    def true_stem(path: Path) -> str:
        return path.name.partition(".")[0]

    @classmethod
    def code_from_path(cls, path: Path) -> str:

        if not path.name.endswith(cls._instance_table_file_name_suffix):
            msg = (
                f"Incorrect path for {cls.__name__} instance table "
                f"(expected suffix {cls._instance_table_file_name_suffix}): {path}"
            )
            raise ValueError(msg)
        code = cls.true_stem(path)
        if len(cls._instance_table_file_name_suffix) > 0:
            code = code[: -len(cls._instance_table_file_name_suffix)]
        return code

    @classmethod
    def suffix_from_instance_format(cls, fmt: InstanceTableFileFormat = None) -> str:
        if fmt is None:
            return cls._instance_table_default_format
        elif isinstance(fmt, str):
            return fmt
        else:
            return ".".join(fmt)

    @classmethod
    def path_for_code(cls, path: Path, relation_code: str, *, fmt: InstanceTableFileFormat = None) -> Path:
        return path / f"{relation_code}{cls._instance_table_file_name_suffix}.{cls.suffix_from_instance_format(fmt)}"

    @overload
    @classmethod
    def search_path(cls, path: Path, relation_code: None = None, fmt: InstanceTableFileFormat = None) -> List[Path]: ...

    @overload
    @classmethod
    def search_path(cls, path: Path, relation_code: str, fmt: InstanceTableFileFormat = None) -> Path: ...

    @classmethod
    def search_path(
        cls, path: Path, relation_code: Optional[str] = None, fmt: InstanceTableFileFormat = None
    ) -> Union[List[Path], Path, None]:
        """Search path for instance files."""

        if relation_code is not None and fmt is not None:
            # Just look for the file
            p = cls.path_for_code(path, relation_code, fmt=fmt)
            if p.exists():
                return p
            else:
                return None

        if relation_code is None:
            code = ".*"
        else:
            code = re.escape(relation_code)

        if fmt is None:
            suffix = ".*"
        else:
            suffix = cls.suffix_from_instance_format(fmt)

        pattern = re.compile(f"(?P<relation_code>{code}){cls._instance_table_file_name_suffix}.(?P<suffix>{suffix})")

        matches: Dict[str, List[Path]] = defaultdict(list)
        for p in map(Path, os.scandir(path)):
            if p.name == cls._metadata_file_name:
                continue

            match = re.fullmatch(pattern, p.name)

            if match is not None:
                matches[match.group("relation_code")].append(p)

        selected_paths = []
        for code, matching_paths in matches.items():
            if len(matching_paths) > 1:
                log.warning("Found multiple files for relation %: %s", code, ", ".join(p.name for p in matching_paths))
            selected_paths.append(matching_paths[0])

        if relation_code is None:
            return selected_paths
        elif len(selected_paths) == 0:
            return None
        else:
            return selected_paths[0]

activated()

Return self or a copy of self with the instance_table loaded (lazy loading disabled).

Source code in src/lm_pub_quiz/data/base.py
def activated(self) -> Self:
    """Return self or a copy of self with the instance_table loaded (lazy loading disabled)."""

    if not self.is_lazy:
        return self

    return self.copy(instance_table=self.instance_table)

copy(**kw)

Create a copy of the isntance with specified fields replaced by new values.

Source code in src/lm_pub_quiz/data/base.py
def copy(self, **kw):
    """Create a copy of the isntance with specified fields replaced by new values."""

    kw = {
        "relation_code": self.relation_code,
        "lazy_options": self._lazy_options.copy() if self._lazy_options is not None else None,
        "instance_table": self._instance_table.copy() if self._instance_table is not None else None,
        "answer_space": self._answer_space.copy() if self._answer_space is not None else None,
        "relation_info": self._relation_info.copy(),
        **kw,
    }
    return self.__class__(kw.pop("relation_code"), **kw)

get_metadata(key=None)

Get or set metadata.

Source code in src/lm_pub_quiz/data/base.py
def get_metadata(self, key: Optional[str] = None) -> Union[Any, Dict[str, Any]]:
    """Get or set metadata."""
    if key is not None:
        if self._answer_space is None:
            msg = f"Key '{key}' not in metadata (no answer space in metadata)."
            raise KeyError(msg)
        elif key == "answer_space_labels":
            return self.answer_space.tolist()
        elif key == "answer_space_ids":
            return self.answer_space.index.tolist()
        elif key == "relation_info":
            return self.relation_info()
        else:
            msg = f"Key '{key}' not in metadata."
            raise KeyError(msg)

    elif self._answer_space is not None:
        return {k: self.get_metadata(k) for k in ("answer_space_labels", "answer_space_ids", "relation_info")}
    else:
        return {}

relation_info(key=None, /, **kw)

Get or set additional relation information.

Source code in src/lm_pub_quiz/data/base.py
def relation_info(self, key: Optional[str] = None, /, **kw) -> Union[None, Any, Dict[str, Any]]:
    """Get or set additional relation information."""
    if key is not None:
        if key == "cardinality" and "cardinality" not in self._relation_info:
            return self._derived_cardinality
        else:
            return self._relation_info[key]
    elif len(kw) > 0:
        self._relation_info.update(kw)

    info = self._relation_info.copy()
    if "cardinality" not in info:
        info["cardinality"] = self._derived_cardinality
    return info

save(save_path, fmt=None)

Save results to a file and export meta_data

Source code in src/lm_pub_quiz/data/base.py
def save(self, save_path: PathLike, fmt: InstanceTableFileFormat = None) -> Optional[Path]:
    """Save results to a file and export meta_data"""
    save_path = Path(save_path)
    save_path.mkdir(parents=True, exist_ok=True)

    log.debug("Saving %s result to: %s", self, save_path)

    ### Metadata file -> .json ###
    if save_path.is_dir():
        metadata_path = save_path / self._metadata_file_name
    else:
        metadata_path = save_path
        save_path = save_path.parent

    if metadata_path.exists():
        with open(metadata_path) as file:
            all_metadata = json.load(file)

            if self.relation_code in all_metadata:
                log.warning("Overwriting metadata info for relation %s (%s)", self.relation_code, save_path)
    else:
        all_metadata = {}

    ### Store instance table to .jsonl file ###
    if self.has_instance_table:
        instances_path = self.path_for_code(save_path, self.relation_code, fmt=fmt)
        self.save_instance_table(self.instance_table, instances_path, fmt=fmt)
        log.debug("Instance table was saved to: %s", instances_path)

    else:
        instances_path = None

    all_metadata[self.relation_code] = self.get_metadata()

    with open(metadata_path, "w") as file:
        json.dump(all_metadata, file, indent=4, default=str)
        log.debug("Metadata file was saved to: %s", metadata_path)

    return instances_path

save_instance_table(instance_table, path, fmt=None) classmethod

Save instance table with the format determined by the path suffix.

Parameters:

Name Type Description Default
instance_table DataFrame

The instances to save.

required
path Path

Where to save the instance table. If format is not specified, the suffix is used to determined the format.

required
fmt str

Which to save the instances in.

None
Source code in src/lm_pub_quiz/data/base.py
@classmethod
def save_instance_table(cls, instance_table: pd.DataFrame, path: Path, fmt: InstanceTableFileFormat = None):
    """Save instance table with the format determined by the path suffix.

    Parameters:
       instance_table (pd.DataFrame): The instances to save.
       path (Path): Where to save the instance table. If format is not specified, the suffix is used to determined
                    the format.
       fmt (str): Which to save the instances in.
    """
    if fmt is None:
        fmt = tuple(s[1:] for s in path.suffixes)
    elif isinstance(fmt, str):
        fmt = tuple(fmt.split("."))

    if fmt == ("jsonl",):
        instance_table.to_json(path, orient="records", lines=True)

    elif fmt[0] == "parquet" and len(fmt) <= 2:  # noqa: PLR2004
        compression: Optional[str]

        if len(fmt) == 1:
            compression = None
        else:
            compression = fmt[1]

        instance_table.to_parquet(path, compression=compression)
    else:
        msg = f"Format .{'.'.join(fmt)} not recognized: Could not save instances at {path}."
        raise ValueError(msg)

search_path(path, relation_code=None, fmt=None) classmethod

Search path for instance files.

Source code in src/lm_pub_quiz/data/base.py
@classmethod
def search_path(
    cls, path: Path, relation_code: Optional[str] = None, fmt: InstanceTableFileFormat = None
) -> Union[List[Path], Path, None]:
    """Search path for instance files."""

    if relation_code is not None and fmt is not None:
        # Just look for the file
        p = cls.path_for_code(path, relation_code, fmt=fmt)
        if p.exists():
            return p
        else:
            return None

    if relation_code is None:
        code = ".*"
    else:
        code = re.escape(relation_code)

    if fmt is None:
        suffix = ".*"
    else:
        suffix = cls.suffix_from_instance_format(fmt)

    pattern = re.compile(f"(?P<relation_code>{code}){cls._instance_table_file_name_suffix}.(?P<suffix>{suffix})")

    matches: Dict[str, List[Path]] = defaultdict(list)
    for p in map(Path, os.scandir(path)):
        if p.name == cls._metadata_file_name:
            continue

        match = re.fullmatch(pattern, p.name)

        if match is not None:
            matches[match.group("relation_code")].append(p)

    selected_paths = []
    for code, matching_paths in matches.items():
        if len(matching_paths) > 1:
            log.warning("Found multiple files for relation %: %s", code, ", ".join(p.name for p in matching_paths))
        selected_paths.append(matching_paths[0])

    if relation_code is None:
        return selected_paths
    elif len(selected_paths) == 0:
        return None
    else:
        return selected_paths[0]

lm_pub_quiz.data.base.DatasetBase

Bases: DataBase, Generic[RT]

Base class for a collection of relations or relations results.