Dataset Representation#
There are two classes which are used to represent a dataset: Relation
and Dataset
(which is essentially a container for a number of relations).
Relation
#
Relation(
relation_code: str,
*,
templates: list[str],
answer_space: Optional[Series],
instance_table: Optional[DataFrame],
lazy_options: Optional[dict[str, Any]],
relation_info: Optional[dict[str, Any]] = None
)
Represents a relation within a dataset, including its code, answer space, templates, and an instance table.
Methods:
Name | Description |
---|---|
activated |
Return self or a copy of self with the instance_table loaded (lazy loading disabled). |
from_path |
Loads a relation from a JSONL file and associated metadata. |
relation_info |
Get or set additional relation information. |
save |
Save results to a file and export meta_data |
save_instance_table |
Save instance table with the format determined by the path suffix. |
search_path |
Search path for instance files. |
subsample |
Returns only a subsampled version of the dataset of the size n. |
Attributes:
Name | Type | Description |
---|---|---|
answer_space |
Series
|
The answer space of the relation. |
instance_table |
DataFrame
|
A |
relation_code |
str
|
The identifier of the relation. |
instance_table
property
#
instance_table: DataFrame
A pandas.DataFrame
containing all items in the relation.
activated
#
Return self or a copy of self with the instance_table loaded (lazy loading disabled).
from_path
classmethod
#
from_path(
path: PathLike,
*,
relation_code: Optional[str] = None,
lazy: bool = True,
fmt: InstanceTableFileFormat = None
) -> Self
Loads a relation from a JSONL file and associated metadata.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
PathLike
|
The path to the dataset directory. |
required |
|
str
|
The specific code of the relation to load. |
None
|
|
bool
|
If False, the instance table is loaded directly into memory. |
True
|
Returns:
Name | Type | Description |
---|---|---|
Relation |
Self
|
An instance of the Relation class populated with data from the file. |
Raises:
Type | Description |
---|---|
Exception
|
If there is an error in loading the file or processing the data. |
relation_info
#
Get or set additional relation information.
Use relation.relation_info(<field name>=<new value>)
to set fields in the relation info dictionary.
If a single field is selected, the respective value is returned. Otherwise the complete dictionary is
returned.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
Optional[str]
|
The field to retrieve. |
None
|
|
The fields not modify. |
{}
|
Returns:
Type | Description |
---|---|
Union[None, Any, dict[str, Any]]
|
If a field is selected, the respective value is returned, otherwise, the complete info dictionary is |
Union[None, Any, dict[str, Any]]
|
returned. |
save
#
Save results to a file and export meta_data
save_instance_table
classmethod
#
save_instance_table(
instance_table: DataFrame,
path: Path,
fmt: InstanceTableFileFormat = None,
)
Save instance table with the format determined by the path suffix.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
DataFrame
|
The instances to save. |
required |
|
Path
|
Where to save the instance table. If format is not specified, the suffix is used to determined the format. |
required |
|
str
|
Which to save the instances in. |
None
|
search_path
classmethod
#
search_path(
path: Path,
relation_code: Optional[str] = None,
fmt: InstanceTableFileFormat = None,
) -> Union[list[Path], Path, None]
Search path for instance files.
Dataset
#
Dataset(
relations: list[Relation],
path: PathLike,
name: Optional[str] = None,
)
A collection of relations forming a multiple choice dataset.
Usage
The prefferred way to load the BEAR knowledge probe is to load it by name:
from lm_pub_quiz import Dataset dataset = Dataset.from_name("BEAR")
Methods:
Name | Description |
---|---|
from_name |
Loads a dataset from the cache (if available) or the url which is specified in the internal dataset table. |
from_path |
Loads a multiple choice dataset from a specified directory path. |
from_name
classmethod
#
from_name(
name: str,
*,
lazy: bool = True,
base_path: Optional[Path] = None,
chunk_size: int = 10 * 1024,
relation_info: Optional[PathLike] = None,
**kwargs
) -> Self
Loads a dataset from the cache (if available) or the url which is specified in the internal dataset table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
str
|
The name of the dataset. |
required |
|
bool
|
If False, the instance tables of all relations are directly loaded into memory. |
True
|
Returns:
Name | Type | Description |
---|---|---|
Dataset |
Self
|
An instance if Dataset loaded with the relations from the directory. |
Raises:
Type | Description |
---|---|
Exception
|
If there is an error in loading the dataset. |
from_path
classmethod
#
from_path(
path: PathLike,
*,
lazy: bool = True,
fmt: InstanceTableFileFormat = None,
relation_info: Optional[PathLike] = None,
**kwargs
) -> Self
Loads a multiple choice dataset from a specified directory path.
This method scans the directory for relation files and assembles them into a MultipleChoiceDataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
str
|
The directory path where the dataset is stored. |
required |
|
bool
|
If False, the instance tables of all relations are directly loaded into memory. |
True
|
Returns:
Name | Type | Description |
---|---|---|
Dataset |
Self
|
An instance if Dataset loaded with the relations from the directory. |
Raises:
Type | Description |
---|---|
Exception
|
If there is an error in loading the dataset. |