Skip to content

Dataset Representation#

There are two classes which are used to represent a dataset: Relation and Dataset (which is essentially a container for a number of relations).

Relation #

Relation(
    relation_code: str,
    *,
    templates: list[str],
    answer_space: Optional[Series],
    instance_table: Optional[DataFrame],
    lazy_options: Optional[dict[str, Any]],
    relation_info: Optional[dict[str, Any]] = None
)

Represents a relation within a dataset, including its code, answer space, templates, and an instance table.

Methods:

Name Description
activated

Return self or a copy of self with the instance_table loaded (lazy loading disabled).

from_path

Loads a relation from a JSONL file and associated metadata.

relation_info

Get or set additional relation information.

save

Save results to a file and export meta_data

save_instance_table

Save instance table with the format determined by the path suffix.

search_path

Search path for instance files.

subsample

Returns only a subsampled version of the dataset of the size n.

Attributes:

Name Type Description
answer_space Series

The answer space of the relation.

instance_table DataFrame

A pandas.DataFrame containing all items in the relation.

relation_code str

The identifier of the relation.

answer_space property #

answer_space: Series

The answer space of the relation.

instance_table property #

instance_table: DataFrame

A pandas.DataFrame containing all items in the relation.

relation_code property #

relation_code: str

The identifier of the relation.

activated #

activated() -> Self

Return self or a copy of self with the instance_table loaded (lazy loading disabled).

from_path classmethod #

from_path(
    path: PathLike,
    *,
    relation_code: Optional[str] = None,
    lazy: bool = True,
    fmt: InstanceTableFileFormat = None
) -> Self

Loads a relation from a JSONL file and associated metadata.

Parameters:

Name Type Description Default

path #

PathLike

The path to the dataset directory.

required

relation_code #

str

The specific code of the relation to load.

None

lazy #

bool

If False, the instance table is loaded directly into memory.

True

Returns:

Name Type Description
Relation Self

An instance of the Relation class populated with data from the file.

Raises:

Type Description
Exception

If there is an error in loading the file or processing the data.

relation_info #

relation_info(**kw) -> dict[str, Any]
relation_info(key: str) -> Any
relation_info(
    key: Optional[str] = None, /, **kw
) -> Union[None, Any, dict[str, Any]]

Get or set additional relation information.

Use relation.relation_info(<field name>=<new value>) to set fields in the relation info dictionary. If a single field is selected, the respective value is returned. Otherwise the complete dictionary is returned.

Parameters:

Name Type Description Default

key #

Optional[str]

The field to retrieve.

None

**kw #

The fields not modify.

{}

Returns:

Type Description
Union[None, Any, dict[str, Any]]

If a field is selected, the respective value is returned, otherwise, the complete info dictionary is

Union[None, Any, dict[str, Any]]

returned.

save #

save(
    path: PathLike, fmt: InstanceTableFileFormat = None
) -> Optional[Path]

Save results to a file and export meta_data

save_instance_table classmethod #

save_instance_table(
    instance_table: DataFrame,
    path: Path,
    fmt: InstanceTableFileFormat = None,
)

Save instance table with the format determined by the path suffix.

Parameters:

Name Type Description Default

instance_table #

DataFrame

The instances to save.

required

path #

Path

Where to save the instance table. If format is not specified, the suffix is used to determined the format.

required

fmt #

str

Which to save the instances in.

None

search_path classmethod #

search_path(
    path: Path,
    relation_code: None = None,
    fmt: InstanceTableFileFormat = None,
) -> list[Path]
search_path(
    path: Path,
    relation_code: str,
    fmt: InstanceTableFileFormat = None,
) -> Path
search_path(
    path: Path,
    relation_code: Optional[str] = None,
    fmt: InstanceTableFileFormat = None,
) -> Union[list[Path], Path, None]

Search path for instance files.

subsample #

subsample(n: int = 10) -> DataFrame

Returns only a subsampled version of the dataset of the size n.

Parameters:

Name Type Description Default

n #

int

Size of the subsampled dataset

10

Returns:

Type Description
DataFrame

pd.DataFrame: Subsampled version of the dataset.

Dataset #

Dataset(
    relations: list[Relation],
    path: PathLike,
    name: Optional[str] = None,
)

A collection of relations forming a multiple choice dataset.

Usage

The prefferred way to load the BEAR knowledge probe is to load it by name:

from lm_pub_quiz import Dataset dataset = Dataset.from_name("BEAR")

Methods:

Name Description
from_name

Loads a dataset from the cache (if available) or the url which is specified in the internal dataset table.

from_path

Loads a multiple choice dataset from a specified directory path.

from_name classmethod #

from_name(
    name: str,
    *,
    lazy: bool = True,
    base_path: Optional[Path] = None,
    chunk_size: int = 10 * 1024,
    relation_info: Optional[PathLike] = None,
    **kwargs
) -> Self

Loads a dataset from the cache (if available) or the url which is specified in the internal dataset table.

Parameters:

Name Type Description Default

name #

str

The name of the dataset.

required

lazy #

bool

If False, the instance tables of all relations are directly loaded into memory.

True

Returns:

Name Type Description
Dataset Self

An instance if Dataset loaded with the relations from the directory.

Raises:

Type Description
Exception

If there is an error in loading the dataset.

Usage

Loading the BEAR-dataset.

>>> from lm_pub_quiz import Dataset
>>> dataset = Dataset.from_name("BEAR")

from_path classmethod #

from_path(
    path: PathLike,
    *,
    lazy: bool = True,
    fmt: InstanceTableFileFormat = None,
    relation_info: Optional[PathLike] = None,
    **kwargs
) -> Self

Loads a multiple choice dataset from a specified directory path.

This method scans the directory for relation files and assembles them into a MultipleChoiceDataset.

Parameters:

Name Type Description Default

path #

str

The directory path where the dataset is stored.

required

lazy #

bool

If False, the instance tables of all relations are directly loaded into memory.

True

Returns:

Name Type Description
Dataset Self

An instance if Dataset loaded with the relations from the directory.

Raises:

Type Description
Exception

If there is an error in loading the dataset.

Usage

Loading the BEAR-dataset.

>>> from lm_pub_quiz import Dataset
>>> dataset = Dataset.from_path("/path/to/dataset/BEAR")