Caravan

class neuralhydrology.datasetzoo.caravan.Caravan(cfg: Config, is_train: bool, period: str, basin: str = None, additional_features: List[Dict[str, pandas.DataFrame]] = [], id_to_int: Dict[str, int] = {}, scaler: Dict[str, pandas.Series | xarray.DataArray] = {})

Bases: BaseDataset

Data set class for the Caravan data set by [1].

Parameters:
  • cfg (Config) – The run configuration.

  • is_train (bool) – Defines if the dataset is used for training or evaluating. If True (training), means/stds for each feature are computed and stored to the run directory. If one-hot encoding is used, the mapping for the one-hot encoding is created and also stored to disk. If False, a scaler input is expected and similarly the id_to_int input if one-hot encoding is used.

  • period ({'train', 'validation', 'test'}) – Defines the period for which the data will be loaded

  • basin (str, optional) – If passed, the data for only this basin will be loaded. Otherwise the basin(s) are read from the appropriate basin file, corresponding to the period.

  • additional_features (List[Dict[str, pd.DataFrame]], optional) – List of dictionaries, mapping from a basin id to a pandas DataFrame. This DataFrame will be added to the data loaded from the dataset, and all columns are available as ‘dynamic_inputs’, ‘evolving_attributes’ and ‘target_variables’

  • id_to_int (Dict[str, int], optional) – If the config argument ‘use_basin_id_encoding’ is True in the config and period is either ‘validation’ or ‘test’, this input is required. It is a dictionary, mapping from basin id to an integer (the one-hot encoding).

  • scaler (Dict[str, Union[pd.Series, xarray.DataArray]], optional) – If period is either ‘validation’ or ‘test’, this input is required. It contains the centering and scaling for each feature and is stored to the run directory during training (train_data/train_data_scaler.yml).

References

neuralhydrology.datasetzoo.caravan.load_caravan_attributes(data_dir: Path, basins: List[str] | None = None, subdataset: str | None = None) pandas.DataFrame

Load the attributes of the Caravan dataset.

Parameters:
  • data_dir (Path) – Path to the root directory of Caravan that has to include a sub-directory called ‘attributes’ which contain the attributes of all sub-datasets in separate folders.

  • basins (List[str], optional) – If passed, returns only attributes for the basins specified in this list. Otherwise, the attributes of all basins are returned.

  • subdataset (str, optional) – If passed, returns only the attributes of one sub-dataset. Otherwise, the attributes of all sub-datasets are loaded.

Raises:
  • FileNotFoundError – If the requested sub-dataset does not exist or any sub-dataset for the requested basins is missing.

  • ValueError – If any of the requested basins does not exist in the attribute files or if both, basins and sub-dataset are passed but at least one of the basins is not part of the corresponding sub-dataset.

Returns:

A basin indexed DataFrame with all attributes as columns.

Return type:

pd.DataFrame

neuralhydrology.datasetzoo.caravan.load_caravan_timeseries(data_dir: Path, basin: str, filetype: str = 'netcdf') pandas.DataFrame

Loads the timeseries data of one basin from the Caravan dataset.

Parameters:
  • data_dir (Path) – Path to the root directory of Caravan that has to include a sub-directory called ‘timeseries’. This sub-directory has to contain another sub-directory called either ‘csv’ or ‘netcdf’, depending on the choice of the filetype argument. By default, netCDF files are loaded from the ‘netcdf’ subdirectory.

  • basin (str) – The Caravan gauge id string in the form of {subdataset_name}_{gauge_id}.

  • filetype (str, optional) – Can be either ‘csv’ or ‘netcdf’. Depending on this value, this function will load the timeseries data from the netcdf files (default) or csv files.

Raises:
  • ValueError – If filetype is not in [‘csv’, ‘netcdf’].

  • FileNotFoundError – If no timeseries file exists for the basin.