zea.data.datasets¶

This module provides classes and utilities for loading, validating, and managing ultrasound datasets stored in HDF5 format. It supports both local and Hugging Face Hub datasets, and offers efficient file handle caching for large collections of files.

Main Classes¶

H5FileHandleCache: Caches open HDF5 file handles to optimize repeated access.
Folder: Represents a group of HDF5 files in a directory, with optional validation.
Dataset: Provides an iterable interface over multiple HDF5 files or folders, with
support for directory-based splitting and validation.

Functions¶

split_files_by_directory: Splits files among directories according to specified ratios.
count_samples_per_directory: Counts the number of files per directory.

Features¶

Validation of dataset integrity with flag files and error logging.
Support for Hugging Face Hub datasets with local caching.
Utilities for dataset splitting and sample counting.
Example usage provided in the module’s main block.

Functions

`count_samples_per_directory`(file_names, ...)	Count number of samples per directory.
`split_files_by_directory`(file_names, ...)	Split files according to their parent directories and given split ratios.

Classes

`Dataset`(file_paths[, validate, directory_splits])	Iterate over File(s) and Folder(s).
`Folder`(folder_path[, validate, hf_cache_dir])	Group of HDF5 files in a folder that can be validated.
`H5FileHandleCache`([file_handle_cache_capacity])	Cache for HDF5 file handles.

class zea.data.datasets.Dataset(file_paths, validate=True, directory_splits=None, **kwargs)[source]¶

Bases: H5FileHandleCache

Iterate over File(s) and Folder(s).

Initializes the Dataset.

Parameters:

file_paths (Union[List[str], str]) – (list of) path(s) to the folder(s) containing the HDF5 file(s) or list of HDF5 file paths. Can be a mixed list of folders and files.
validate (bool) – Whether to validate the dataset. Defaults to True.
directory_splits (list | None) – List of directory split by. Is a list of floats between 0 and 1, with the same length as the number of file_paths given. If none, all files in file_paths are used.

__call__()[source]¶: Call self as a function.

find_files(paths)[source]¶

Find files and optionally validate folders and files.

Return type:: List[str]

classmethod from_config(dataset_folder, user=None, **kwargs)[source]¶: Creates a Dataset from a config file.

load_file_shapes(key)[source]¶: Load the shapes of the datasets in each file.

property n_files¶: Return number of files in dataset.

property total_frames¶: Return total number of frames in dataset.

class zea.data.datasets.Folder(folder_path, validate=True, hf_cache_dir=PosixPath('/home/docs/.cache/zea/huggingface/datasets'))[source]¶

Bases: object

Group of HDF5 files in a folder that can be validated. Mostly used internally, you might want to use the Dataset class instead.

copy(to_path, key, mode=None)[source]¶

Copy the data for all or a specific key to a new location.

Has the option to copy all keys or only a specific key. By default, it only copies if the destination file does not already contain the key. You can change the mode to ‘w’ to overwrite the destination file. Will always copy metadata such as dataset attributes and scan object.

Parameters:

to_path (str | Path) – The destination path where files will be copied.
key (str) – The key to copy from the source files. If ‘all’ or ‘*’, all keys will be copied.
mode (str | None) – The mode in which to open the destination files. Defaults to ‘a’ (append mode), and ‘w’ (write mode) if key is ‘all’ or ‘*’. See: https://docs.h5py.org/en/stable/high/file.html#opening-creating-files

find_h5_files()[source]¶

Return type:: List[str]

static get_data_types(file_path)[source]¶: Get data types from file.

load_file_shapes(key)[source]¶: Load the shapes of the datasets in each file.

property n_files¶: Return number of files in dataset.

validate_folder()[source]¶

Validate dataset contents.

If a validation file exists, it checks if the dataset was validated on the same date. If the validation file was corrupted, it raises an error. If the validation file was not corrupted and validated, it prints a message and returns.

class zea.data.datasets.H5FileHandleCache(file_handle_cache_capacity=128)[source]¶

Bases: object

Cache for HDF5 file handles.

This class manages a cache of HDF5 file handles to avoid reopening files multiple times. It uses an OrderedDict to maintain the order of file access and closes the least recently used file when the cache reaches its capacity.

close()[source]¶: Close all cached file handles.

get_file(file_path)[source]¶

Open an HDF5 file and cache it.

Return type:: File

zea.data.datasets.count_samples_per_directory(file_names, directories)[source]¶

Count number of samples per directory.

Parameters:

file_names (list) – List of file paths
directories (str or list) – Directory or list of directories

Returns:

Dictionary with directory paths as keys and sample counts as values

Return type:

dict

zea.data.datasets.split_files_by_directory(file_names, directory_list, directory_splits)[source]¶

Split files according to their parent directories and given split ratios.

Parameters:

file_names (list) – List of file paths.
directory_list (list) – List of directory paths to split by.
directory_splits (list) – List of split ratios (0-1) for each directory.

Returns:

(split_file_names, split_file_shapes)

Return type:

tuple