GenericDataset

class neuralhydrology.datasetzoo.genericdataset.GenericDataset(cfg: Config, is_train: bool, period: str, basin: str = None, additional_features: List[Dict[str, pandas.DataFrame]] = [], id_to_int: Dict[str, int] = {}, scaler: Dict[str, pandas.Series | xarray.DataArray] = {})

Bases: BaseDataset

Data set class for the generic dataset that reads data for any region based on common file layout conventions.

To use this dataset, the data_dir must contain a folder ‘time_series’ and (if static attributes are used) a folder ‘attributes’. The folder ‘time_series’ contains one netcdf file (.nc or .nc4) per basin, named ‘<basin_id>.nc/nc4’. The netcdf file has to have one coordinate called date, containing the datetime index. The folder ‘attributes’ contains one or more comma-separated file (.csv) with static attributes, indexed by basin id. Attributes files can be divided into groups of basins or groups of features (but not both, see genericdataset.load_attributes for more details).

Note: Invalid values have to be marked as NaN (e.g. using NumPy’s np.nan) in the netCDF files and not something like -999 for invalid discharge measurements, which is often found in hydrology datasets. If missing values are not marked as NaN’s, the GenericDataset will not be able to identify these values as missing data points.

Parameters:
  • cfg (Config) – The run configuration.

  • is_train (bool) – Defines if the dataset is used for training or evaluating. If True (training), means/stds for each feature are computed and stored to the run directory. If one-hot encoding is used, the mapping for the one-hot encoding is created and also stored to disk. If False, a scaler input is expected and similarly the id_to_int input if one-hot encoding is used.

  • period ({'train', 'validation', 'test'}) – Defines the period for which the data will be loaded

  • basin (str, optional) – If passed, the data for only this basin will be loaded. Otherwise the basin(s) are read from the appropriate basin file, corresponding to the period.

  • additional_features (List[Dict[str, pd.DataFrame]], optional) – List of dictionaries, mapping from a basin id to a pandas DataFrame. This DataFrame will be added to the data loaded from the dataset and all columns are available as ‘dynamic_inputs’, ‘evolving_attributes’ and ‘target_variables’

  • id_to_int (Dict[str, int], optional) – If the config argument ‘use_basin_id_encoding’ is True in the config and period is either ‘validation’ or ‘test’, this input is required. It is a dictionary, mapping from basin id to an integer (the one-hot encoding).

  • scaler (Dict[str, Union[pd.Series, xarray.DataArray]], optional) – If period is either ‘validation’ or ‘test’, this input is required. It contains the centering and scaling for each feature and is stored to the run directory during training (train_data/train_data_scaler.yml).

neuralhydrology.datasetzoo.genericdataset.load_attributes(data_dir: Path, basins: List[str] = None) pandas.DataFrame

Load static attributes.

Parameters:
  • data_dir (Path) – Path to the data directory. This folder must contain an ‘attributes’ folder with one or multiple csv files.

  • basins (List[str], optional) – If passed, return only attributes for the basins specified in this list. Otherwise, the attributes of all basins are returned.

Returns:

Basin-indexed DataFrame, containing the attributes as columns. If the attributes folder contains multiple files, they will be concatenated as follows:

  1. if the intersection of basins is non-empty, the files’ attributes are concatenated for the intersection of basins. The intersection of attributes must be empty in this case.

  2. if the intersection of basins is empty but the intersection of attributes is not, the files’ basins are concatenated for the intersection of attributes.

In all other cases, a ValueError is raised.

Return type:

pandas.DataFrame

Raises:
  • FileNotFoundError – If the attributes folder is not found or does not contain any csv files.

  • ValueError – If an attributes file contains duplicate basin or attribute names, multiple files are found that have no overlap, or there are no attributes for a basin specified in basins.

neuralhydrology.datasetzoo.genericdataset.load_timeseries(data_dir: Path, basin: str) pandas.DataFrame

Load time series data from netCDF files into pandas DataFrame.

Parameters:
  • data_dir (Path) – Path to the data directory. This folder must contain a folder called ‘time_series’ containing the time series data for each basin as a single time-indexed netCDF file called ‘<basin_id>.nc/nc4’.

  • basin (str) – The basin identifier.

Returns:

Time-indexed DataFrame containing the time series data as stored in the netCDF file.

Return type:

pd.DataFrame

Raises:
  • FileNotFoundError – If no netCDF file exists for the specified basin.

  • ValueError – If more than one netCDF file is found for the specified basin.