LamaH

class neuralhydrology.datasetzoo.lamah.LamaH(cfg: Config, is_train: bool, period: str, basin: str = None, additional_features: List[Dict[str, pandas.DataFrame]] = [], id_to_int: Dict[str, int] = {}, scaler: Dict[str, pandas.Series | xarray.DataArray] = {})

Bases: BaseDataset

Data set class for the LamaH-CE dataset by [1].

The LamaH-CE dataset consists of three different catchment delineations, each with dedicated forcing time series and catchment attributes. These subdatasets are stored in the folder ‘A_basins_total_upstrm’, ‘B_basins_intermediate_all’, and ‘C_basins_intermediate_lowimp’. The different datasets can be used by setting the config argument dataset to lamah_a, lamah_b or lamah_c for ‘A_basins_total_upstrm’, ‘B_basins_intermediate_all’, or ‘C_basins_intermediate_lowimp’, respectively. Furthermore, if you download the full dataset, each of these subdatasets, as well as the streamflow data, comes at hourly and daily resolution. Based on the config argument use_frequencies this dataset class will load daily data (for daily resolutions or lower), or hourly data (for all temporal resolutions higher than daily). If nothing is specified in use_frequencies, daily data is loaded by default. Also note: discharge data in the LamaH dataset is provided in m3s-1. This dataset class will transform discharge into mmd-1 (for daily data) or mmh-1 (for hourly data), using the ‘area_gov’ provided in the attributes file.

Parameters:
  • cfg (Config) – The run configuration.

  • is_train (bool) – Defines if the dataset is used for training or evaluating. If True (training), means/stds for each feature are computed and stored to the run directory. If one-hot encoding is used, the mapping for the one-hot encoding is created and also stored to disk. If False, a scaler input is expected and similarly the id_to_int input if one-hot encoding is used.

  • period ({'train', 'validation', 'test'}) – Defines the period for which the data will be loaded

  • basin (str, optional) – If passed, the data for only this basin will be loaded. Otherwise the basin(s) are read from the appropriate basin file, corresponding to the period.

  • additional_features (List[Dict[str, pd.DataFrame]], optional) – List of dictionaries, mapping from a basin id to a pandas DataFrame. This DataFrame will be added to the data loaded from the dataset, and all columns are available as ‘dynamic_inputs’, ‘evolving_attributes’ and ‘target_variables’

  • id_to_int (Dict[str, int], optional) – If the config argument ‘use_basin_id_encoding’ is True in the config and period is either ‘validation’ or ‘test’, this input is required. It is a dictionary, mapping from basin id to an integer (the one-hot encoding).

  • scaler (Dict[str, Union[pd.Series, xarray.DataArray]], optional) – If period is either ‘validation’ or ‘test’, this input is required. It contains the centering and scaling for each feature and is stored to the run directory during training (train_data/train_data_scaler.yml).

References

neuralhydrology.datasetzoo.lamah.load_lamah_attributes(data_dir: Path, sub_dataset: str, basins: List[str] = []) pandas.DataFrame

Load LamaH catchment attributes.

Parameters:
  • data_dir (Path) – Path to the LamaH-CE directory.

  • sub_dataset (str) – One of {‘lamah_a’, ‘lamah_b’, ‘lamah_c’}, defining which of the three catchment delinations/sub-datasets (A_basins_total_upstrm, B_basins_intermediate_all, or C_basins_intermediate_lowimp) will be loaded.

  • basins (List[str], optional) – If passed, return only attributes for the basins specified in this list. Otherwise, the attributes of all basins are returned.

Returns:

Basin-indexed DataFrame, containing the attributes of the sub-dataset as well as the gauge attributes.

Return type:

pd.DataFrame

Raises:

ValueError – If any of the basin ids is not in the basin index.

neuralhydrology.datasetzoo.lamah.load_lamah_discharge(data_dir: Path, basin: str, temporal_resolution: str = '1D', normalize_discharge: bool = False) pandas.DataFrame

Load discharge data of the LamaH data set.

Parameters:
  • data_dir (Path) – Path to the LamaH directory.

  • basin (str) – Basin identifier number as string.

  • temporal_resolution (str, optional) – Defines if either daily (‘1D’, default) or hourly (‘1H’) timeseries data will be loaded.

  • normalize_discharge (bool, optional) – If true, normalizes discharge data by basin area, using the ‘area_gov’ attribute from attribute file.

Returns:

Time-indexed DataFrame, containing the forcings data.

Return type:

pd.DataFrame

Raises:

ValueError – If ‘temporal_resolution’ is not one of [‘1H’, ‘1D’].

neuralhydrology.datasetzoo.lamah.load_lamah_forcing(data_dir: Path, basin: str, sub_dataset: str, temporal_resolution: str = '1D') pandas.DataFrame

Load forcing data of the LamaH data set.

Parameters:
  • data_dir (Path) – Path to the LamaH directory.

  • basin (str) – Basin identifier number as string.

  • sub_dataset (str) – One of {‘lamah_a’, ‘lamah_b’, ‘lamah_c’}, defining which of the three catchment delinations/sub-datasets (A_basins_total_upstrm, B_basins_intermediate_all, or C_basins_intermediate_lowimp) will be loaded.

  • temporal_resolution (str, optional) – Defines if either daily (‘1D’, default) or hourly (‘1H’) timeseries data will be loaded.

Returns:

Time-indexed DataFrame, containing the forcings data.

Return type:

pd.DataFrame

Raises:
  • ValueError – If ‘sub_dataset’ is not one of {‘lamah_a’, ‘lamah_b’, ‘lamah_c’}.

  • ValueError – If ‘temporal_resolution’ is not one of [‘1H’, ‘1D’].