Dataset Representation#
There are two classes which are used to represent a dataset: Relation and Dataset (which is essentially a container for a number of relations).
Relation
#
Relation(relation_code: str, *, templates: list[str], answer_space: Optional[Series], instance_table: Optional[DataFrame], lazy_options: Optional[dict[str, Any]], relation_info: Optional[dict[str, Any]] = None)
Represents a relation within a dataset, including its code, answer space, templates, and an instance table.
Methods:
| Name | Description |
|---|---|
activated |
Return self or a copy of self with the instance_table loaded (lazy loading disabled). |
from_path |
Loads a relation from a JSONL file and associated metadata. |
relation_info |
Get or set additional relation information. |
save |
Save results to a file and export meta_data |
save_instance_table |
Save instance table with the format determined by the path suffix. |
search_path |
Search path for instance files. |
subsample |
Returns only a subsampled version of the dataset of the size n. |
Attributes:
| Name | Type | Description |
|---|---|---|
answer_space |
Series
|
The answer space of the relation. |
instance_table |
DataFrame
|
A |
relation_code |
str
|
The identifier of the relation. |
instance_table
property
#
instance_table: DataFrame
A pandas.DataFrame containing all items in the relation.
activated
#
Return self or a copy of self with the instance_table loaded (lazy loading disabled).
from_path
classmethod
#
from_path(path: PathLike, *, relation_code: Optional[str] = None, lazy: bool = True, fmt: InstanceTableFileFormat = None) -> Self
Loads a relation from a JSONL file and associated metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
PathLike
|
The path to the dataset directory. |
required |
|
str
|
The specific code of the relation to load. |
None
|
|
bool
|
If False, the instance table is loaded directly into memory. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
Relation |
Self
|
An instance of the Relation class populated with data from the file. |
Raises:
| Type | Description |
|---|---|
Exception
|
If there is an error in loading the file or processing the data. |
relation_info
#
Get or set additional relation information.
Use relation.relation_info(<field name>=<new value>) to set fields in the relation info dictionary.
If a single field is selected, the respective value is returned. Otherwise the complete dictionary is
returned.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Optional[str]
|
The field to retrieve. |
None
|
|
Any
|
The fields to modify. |
{}
|
Returns:
| Type | Description |
|---|---|
Union[None, Any, dict[str, Any]]
|
If a field is selected, the respective value is returned, otherwise, the complete info dictionary is |
Union[None, Any, dict[str, Any]]
|
returned. |
save
#
Save results to a file and export meta_data
save_instance_table
classmethod
#
save_instance_table(instance_table: DataFrame, path: Path, fmt: InstanceTableFileFormat = None)
Save instance table with the format determined by the path suffix.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
DataFrame
|
The instances to save. |
required |
|
Path
|
Where to save the instance table. If format is not specified, the suffix is used to determined the format. |
required |
|
str
|
Which to save the instances in. |
None
|
search_path
classmethod
#
search_path(path: Path, relation_code: Optional[str] = None, fmt: InstanceTableFileFormat = None) -> Union[list[Path], Path, None]
Search path for instance files.
Dataset
#
Dataset(relations: list[Relation], path: PathLike, name: Optional[str] = None)
A collection of relations forming a multiple choice dataset.
Usage
The prefferred way to load the BEAR knowledge probe is to load it by name:
Methods:
| Name | Description |
|---|---|
from_name |
Loads a dataset from the cache (if available) or the url which is specified in the internal dataset table. |
from_path |
Loads a multiple choice dataset from a specified directory path. |
from_name
classmethod
#
from_name(name: str, *, lazy: bool = True, base_path: Optional[Path] = None, chunk_size: int = 10 * 1024, relation_info: Optional[PathLike] = None, **kwargs) -> Self
Loads a dataset from the cache (if available) or the url which is specified in the internal dataset table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str
|
The name of the dataset. |
required |
|
bool
|
If False, the instance tables of all relations are directly loaded into memory. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
Dataset |
Self
|
An instance if Dataset loaded with the relations from the directory. |
Raises:
| Type | Description |
|---|---|
Exception
|
If there is an error in loading the dataset. |
from_path
classmethod
#
from_path(path: PathLike, *, lazy: bool = True, fmt: InstanceTableFileFormat = None, relation_info: Optional[PathLike] = None, **kwargs) -> Self
Loads a multiple choice dataset from a specified directory path.
This method scans the directory for relation files and assembles them into a MultipleChoiceDataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str
|
The directory path where the dataset is stored. |
required |
|
bool
|
If False, the instance tables of all relations are directly loaded into memory. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
Dataset |
Self
|
An instance if Dataset loaded with the relations from the directory. |
Raises:
| Type | Description |
|---|---|
Exception
|
If there is an error in loading the dataset. |