Rainfall–Runoff Prediction at Multiple Timescales with a Single Long ShortTerm Memory Network
Abstract
Long ShortTerm Memory Networks (LSTMs) have been applied to daily discharge prediction with remarkable success.
Many practical scenarios, however, require predictions at more granular timescales.
For instance, accurate prediction of short but extreme flood peaks can make a lifesaving difference, yet such peaks may escape the coarse temporal resolution of daily predictions.
Naively training an LSTM on hourly data, however, entails very long input sequences that make learning hard and computationally expensive.
In this study, we propose two MultiTimescale LSTM (MTSLSTM) architectures that jointly predict multiple timescales within one model, as they process longpast inputs at a single temporal resolution and branch out into each individual timescale for more recent input steps.
We test these models on 516 basins across the continental United States and benchmark against the US National Water Model.
Compared to naive prediction with a distinct LSTM per timescale, the multitimescale architectures are computationally more efficient with no loss in accuracy.
Beyond prediction quality, the multitimescale LSTM can process different input variables at different timescales, which is especially relevant to operational applications where the lead time of meteorological forcings depends on their temporal resolution.
/Author
[1, 2]MartinGauch
/Author[1]FrederikKratzert
/Author[1]DanielKlotz
/Author[3]GreyNearing
/Author[2]JimmyLin
/Author[1]SeppHochreiter
1]Institute for Machine Learning, Johannes Kepler University Linz, Linz, Austria
2]David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada
3]Google Research, Mountain View, CA, USA
/correspondence
Martin Gauch (gauch@ml.jku.at)
/pubdiscuss/published/firstpage
1
/introduction
Rainfall–runoff modeling approaches based on deep learning—particularly Long ShortTerm Memory (LSTM) networks—have proven highly successful in a number of studies.
LSTMs can predict multiple catchments using a single model and yield more accurate predictions than stateoftheart processbased models in a variety of benchmarks (Kratzert2019).
Different applications require hydrologic information at different timescales.
For example, hydropower operators might care about daily or weekly (or longer) inputs into their reservoirs, while flood forecasting requires subdaily predictions.
Yet, much of the work in applying deep learning to streamflow prediction has been at the daily timescale.
Daily forecasts make sense for medium to longrange forecasts, however, daily input resolution mutes diurnal variations that may cause important variations in discharge signatures, such as evapotranspiration and snowmelt.
Thus, daily predictions are often too coarse to provide actionable information for shortrange forecasts.
For example, in the event of flooding, the distinction between moderate discharge spread across the day and the same amount of water compressed into a few hours of flash flooding may pose a lifethreatening difference.
Because of this, operational hydrologic models often operate at multiple timescales using several independent setups of a traditional, processbased rainfall–runoff model.
For instance, the US National Oceanic and Atmospheric Administration’s (NOAA) National Water Model (NWM) produces hourly shortrange forecasts every hour, as well as three and sixhourly medium to longrange forecasts every six hours.
The issue of multiple input and output timescales is wellknown in the field of machine learning (we refer readers with a machine learning or data science background to Gauch2020 for a general introduction to rainfall–runoff modeling).
The architectural flexibility of recurrent neural models allows for approaches that jointly process the different timescales in a hierarchical fashion.
Techniques to “divide and conquer” long sequences through hierarchical processing date back decades (e.g., Schmidhuber1991, Mozer1991).
More recently, Koutnik2014 proposed a “clockwork” architecture that partitions a recurrent neural network into layers with individual clock speeds, where each layer is updated at its own frequency.
This way, lowerfrequency layers allow the network to learn longerterm dependencies.
Even in such a hierarchical approach, however, the highfrequency neurons must process the full time series, which makes training slow.
Neil2016 extended an LSTM to process irregularly sampled inputs by means of a time gate that only attends to the input at steps of a learned frequency.
This helps discriminate overlaid input signals, but is likely unsuited to rainfall–runoff prediction because it has no means of aggregating inputs while the time gate is closed.
Chung2016 demonstrated how hierarchical processing helps LSTMs to translate sentences and recognize handwriting, but the approach depends on a binary decision that is only differentiable through a workaround.
Most of these models were designed for tasks like natural language processing and other nonphysical applications.
Unlike these tasks, time series in rainfall–runoff modeling have regular frequencies with fixed translation factors (e.g., one day has always 24 hours), whereas words in natural language or strokes in handwriting vary in their length.
The application area of Araya2019 was closer in this respect—they predicted wind speed given input data at multiple timescales.
But, like the aforementioned language and handwriting applications, the objective was to predict a single time series—be it sentences, strokes, or wind speed.
Our objective encompasses multiple outputs, one for each target timescale.
Hence, multitimescale rainfall–runoff prediction has similarities to multiobjective optimization.
In our case, the different objectives are closely related, since aggregation of discharge across time steps should be conservatory:
For instance, every 24 hourly prediction steps should average (or sum, depending on the units) to one daily prediction step.
Rather than viewing the problem from a multiobjective perspective, Lechner2020odelstm modeled timecontinuous functions with ODELSTMs, a method that combines LSTMs with a mixture of ordinary differential equations and recurrent neural networks (ODERNN).
The resulting models can generate continuous predictions at arbitrary granularity.
Initially, this seems like a highly promising approach, however, it has several drawbacks in our application:
First, since one model generates predictions for all timescales, one cannot easily use different forcings products for different target timescales.
Second, ODELSTMs were originally intended for scenarios where the input data arrives in irregular intervals.
In our context, the opposite is true: the meteorological forcings have fixed frequencies and are therefore highly regular.
Also, for practical purposes we do not actually need predictions at arbitrary granularity—a fixed set of target timescales is sufficient.
Lastly, in our exploratory experiments, using ODELSTMs to predict at timescales that were not part of training was actually worse (and much slower) than (dis)aggregating the fixedtimescale predictions of our multitimescale LSTM.
For these reasons, we excluded ODELSTMs from the main evaluation in this study (nevertheless, Appendix C details some of our exploratory ODELSTM experiments).
In this paper, we show how LSTMbased architectures can jointly predict discharge at multiple timescales in one model.
We make the following contributions:

First, we outline two LSTM architectures that predict discharge at multiple timescales (Sect. 1.3).
We capitalize on the fact that watersheds are damped systems: while the history of total mass and energy inputs are important, the impact of highfrequency variation becomes less important at long lead times.
Our approach to providing multiple output timescales processes longpast input data at coarser temporal resolution than recent time steps.
This shortens the input sequences, since highresolution inputs are only necessary for the last few time steps.
We benchmark their daily and hourly predictions against (i) a naive solution that trains individual LSTMs for each timescale and (ii) a traditional hydrologic model, the US National Water Model (Sect. 2.1).
Our results show that all LSTM solutions predict at significantly higher Nash–Sutcliffe efficiency than NWM at all timescales.
While there is only little accuracy difference among the LSTMs, the naive model has much higher computational overhead than multitimescale LSTMs. 
Second, we propose a regularization scheme that reduces inconsistencies across timescales as they arise from naive, pertimescale prediction (Sect. 1.3.2).
According to our results, the regularization not only reduces inconsistencies but also results in slightly improved predictions overall (Sect. 2.2). 
Third, we demonstrate that a multitimescale LSTM can ingest individual and multiple sets of forcings for each target timescale, which closely resembles operational forecasting usecases where forcings with high temporal resolution often have shorter lead times than forcings with low resolution. (Sect. 2.4).
1 Data and Methods
1.1 Data
In order to maintain some degree of continuity and comparability, we conducted our experiments in a way that is as comparable as possible with previous benchmarking studies (Newman2017; Kratzert2019) on the CAMELS dataset (Addor2017).
Out of the 531 CAMELS basins used by previous studies, 516 basins have hourly stream gauge data available from the USGS Water Information System through the Instantaneous Values REST API.
Since our forcing data and benchmark model data use UTC timestamps, we converted the USGS streamflow timestamps to UTC.
While CAMELS provides only daily meteorological forcing data, we needed hourly forcings.
To maintain some congruence with previous CAMELS experiments, we used the hourly NLDAS2 product, which contains meteorological data since 1979 (Xia2012).
We spatially averaged the forcing variables listed in Table 1 for each basin.
Additionally, we averaged these basinspecific hourly meteorological variables to daily values.
Variable  Units 
Total precipitation  
Air Temperature  
Surface pressure  
Surface downward longwave radiation  
Surface downward shortwave radiation  
Specific humidity  
Potential energy  
Potential evaporation  
Convective fraction  – 
wind component  
wind component 
We trained our models on the period from October 1, 1990 to September 30, 2003.
The period from October 1, 2003 to September 30, 2008 was used as a validation period, where we evaluated different architectures and selected suitable model hyperparameters.
Finally, we tested our models on data from October 1, 2008 to September 30, 2018.
1.2 Benchmark Models
We used two groups of models as baselines for comparison with the proposed architectures: the LSTM proposed by Kratzert2019, naively adapted to hourly streamflow modeling, and the US National Water Model (NWM), a traditional hydrologic model used operationally by the National Oceanic and Atmospheric Administration.
Traditional Hydrologic Model
The US National Oceanic and Atmospheric Agency (NOAA) generates hourly streamflow predictions with the National Water Model, which is a processbased model based on WRFHydro (Gochis2020).
Specifically, we benchmarked against the NWM v2 reanalysis product, which includes hourly streamflow predictions for the years 1993 through 2018.
Naive LSTM
Long ShortTerm memory (LSTM) networks (Hochreiter1997) are a flavor of recurrent neural networks designed to model longterm dependencies between input and output data.
LSTMs maintain an internal memory state which is updated at each time step by a set of activated functions called gates.
These gates control the input–state relationship (through the input gate), the state–output relationship (through the output gate), and the memory timescales (through the forget gate).
Figure 1 illustrates this architecture. For a more detailed description of LSTMs, especially in the context of rainfall–runoff modeling, we refer to Kratzert2018.
LSTMs can cope with longer time series than classic recurrent neural networks because they are not susceptible to vanishing gradients during the training procedure (Bengio1994; Hochreiter1997).
Since LSTMs process input sequences sequentially, longer time series result in longer training and inference time.
For daily predictions, this is not a problem, since lookback windows of 365 days appear to be sufficient for most basins, at least in the CAMELS dataset.
Therefore, our baseline for daily predictions is the LSTM model from Kratzert2019, which was trained on daily data with an input sequence length of 365 days.
For hourly data, even half of a year corresponds to more than 4300 time steps, which results in very long training and inference runtime, as we will detail in Sect. 2.3.
In addition to the computational overhead, the LSTM forget gate makes it hard to learn longterm dependencies, because it effectively reintroduces vanishing gradients into the LSTM (Jozefowicz2015).
Yet, we cannot simply remove the forget gate—both empirical LSTM analyses (Jozefowicz2015; Greff2017) and our exploratory experiments showed that this deteriorates results.
To address this, Gers1999 proposed to initialize the bias of the forget gate to a small positive value (we used 3).
This starts training with an open gate and enables gradient flow across more time steps.
We used this bias initialization trick for all our LSTM models, and it allowed us to include the LSTM with hourly inputs as the naive hourly baseline for our proposed models.
The architecture for this naive benchmark is identical to the daily LSTM, except that we ingested input sequences of 4320 hours (180 days).
Further, we tuned the learning rate and batch size for the naive hourly LSTM, since it receives 24 times the amount of samples than the daily LSTM.
The extremely slow training impedes a more extensive hyperparameter search.
Appendix D details the grid of hyperparameters we evaluated to find a suitable configuration, as well as further details on the final hyperparameters.
1.3 Using LSTMs to Predict Multiple Timescales
We evaluated two different LSTM architectures that are capable of simultaneous predictions at multiple timescales.
For the sake of simplicity, the following explanations use the example of a twotimescale model that generates daily and hourly predictions.
Nevertheless, the architectures we describe here generalize to other timescales and to more than two timescales, as we will show in an experiment in Sect. 2.5.
The first model, shared multitimescale LSTM (sMTSLSTM), is a simple extension of the naive approach: We generate a daily prediction as usual—the LSTM ingests an input sequence of time steps at daily resolution and outputs a prediction at the last time step (i.e., sequencetoone prediction).
Next, we reset the hidden and cell states to their values from time step and ingest the hourly input sequence of length to generate 24 hourly predictions that correspond to the last daily prediction.
In other words, during each prediction step, we perform two forward passes through the same LSTM: one that generates a daily prediction and one that generates 24 corresponding hourly predictions.
We add a onehot timescale encoding to the input sequence such that the LSTM can distinguish daily from hourly inputs.
The key insight with this model is that the hourly forward pass starts with LSTM states from the daily forward pass, which act as a summary of longterm information up to that point.
In effect, the LSTM has access to a large lookback window but, unlike the naive hourly LSTM, it does not suffer from the performance impact of extremely long input sequences.
The second architecture, illustrated in Fig. 2, is a more general variant of the sMTSLSTM that is specifically built for multitimescale predictions, hence, we call it the multitimescale LSTM (MTSLSTM).
It works just like the shared version but splits the LSTM into two branches, one for each timescale.
We first generate a prediction with an LSTM acting at the coarsest timescale (here: daily) using a full input sequence of length (e.g., 365 days).
Next, we reuse the daily hidden and cell states from step as the initial states for an LSTM at a finer timescale (here: hourly), which generates the corresponding 24 hourly predictions.
Since the two LSTM branches may have different hidden sizes, we feed the states through a linear state transfer layer () before reusing them as initial hourly states.
In this setup, each LSTM branch only receives inputs of its respective timescale, hence, we do not need to onehot encode that timescale.
This architecture makes it clear why we call the other variant “shared” MTSLSTM: Effectively, the sMTSLSTM is an ablation of the MTSLSTM.
Both variants have the same architecture, but the weights of the sMTSLSTM are shared across all pertimescale branches and its state transfer layers are identity operations.
PerTimescale Input Variables
An important advantage of the MTSLSTM over the sMTSLSTM architecture arises from its more flexible input dimensionality.
As each timescale is processed in an individual LSTM branch, we can ingest different input variables to predict the different timescales.
This can be a key differentiator in operational applications when, for instance, there exist daily weather forecasts with a much longer lead time than the available hourly forecasts, or when using remote sensing data that is available only at certain overpass frequencies.
In these cases, the MTSLSTM can process the daily forcings in its daily LSTM branch and the hourly forcings in its hourly LSTM branch.
As such, this pertimescale forcings strategy allows for using different inputs at different timescales.
To evaluate this capability, we used two sets of forcings as daily inputs: the Daymet and Maurer forcing sets that are included in the CAMELS dataset (Newman2014).
For lack of other hourly forcing products, we conducted two experiments: in one, we continued to use only the hourly NLDAS forcings.
In the other, we additionally ingested the corresponding day’s Daymet and Maurer forcings at each hour.
Since the Maurer forcings only range until 2008, we conducted this experiment on the validation period from October 2003 to September 2008.
CrossTimescale Consistency
Since the MTSLSTM and sMTSLSTM architectures generate predictions at multiple timescales simultaneously, we can incentivize predictions that are consistent across timescales.
Unlike in other domains (e.g., computer vision, Zamir2020), consistency is welldefined in our application: predictions are consistent if the mean of every day’s hourly predictions is the same as that day’s daily prediction.
Hence, we can explicitly state this constraint as a regularization term in our loss function.
Building on the basinaveraged NSE loss from Kratzert2019, our loss function averages the losses from each individual timescale and regularizes with the mean squared difference between daily and dayaveraged hourly predictions.
Note that, although we describe the regularization with only two simultaneously predicted timescales, the approach generalizes to more timescales, as we can add the mean squared difference between each pair of timescale and nextfiner timescale .
All taken together, we used the following loss for a twotimescale daily–hourly model:
(1) 
Here, is the number of basins, is the number of samples for basin b at timescale , and are observed and predicted discharge values for the th time step of basin at timescale . is the observed discharge variance of basin over the whole training period, and is a small value that guarantees numeric stability.
We predicted discharge in the same unit () at all timescales, so that the daily prediction is compared with an average across 24 hours.
Predicting More Timescales
While all our experiments so far considered daily and hourly input and output data, the MTSLSTM architecture generalizes to other timescales.
We showed this capability in a setup similar to the operational National Water Model,
Table 2 details the sizes of the input and output sequences for each timescale.
To achieve a sufficiently long lookback window without using exceedingly long input sequences, we additionally predicted one day of daily streamflow but did not evaluate these daily predictions.
In this setup, the quality of predictions remained high across all target timescales, as our results in Sect. 2.5 show.
Related to the selection of target timescales, a practical note: The state transfer from lower to higher timescales requires careful coordination of timescales and input sequence lengths.
To illustrate, consider a toy setup that predicts two and threehourly discharge.
Each timescale uses an input sequence length of 20 steps (i.e., 40 and 60 hours, respectively).
The initial twohourly LSTM state should then be the threehourly state—in other words, the step widths are asynchronous at the point in time where the LSTM branches split.
Of course, one could simply select the 6th or 7th step, but then either two hours remain unprocessed or one hour is processed twice (once in the three and once in the twohourly LSTM branch).
Instead, we suggest selecting sequence lengths such that the LSTM splits at points where the timescales’ steps align.
In the above example, sequence lengths of 20 (threehourly) and 21 (twohourly) would align correctly: .
Timescale  Input sequence length  Prediction window 

Hourly  168 hours (= 7 days)  18 hours 
Threehourly  168 steps (= 21 days)  80 steps (= 10 days) 
Sixhourly  360 steps (= 90 days)  120 steps (= 30 days) 
Daily  365 days  1 day 
1.4 Evaluation Criteria
Following previous studies that used the CAMELS datasets (Kratzert2020; Addor2018), we benchmarked our models with respect to the metrics and signatures listed in Table 3.
Hydrologic signatures are statistics of hydrographs, and thus not natively a comparison between predicted and observed values.
For these values, we calculated the Pearson correlation coefficient between the signatures of observed and predicted discharge across the 516 CAMELS basins used in this study.
Metric/Signature  Description  Reference 

NSE  Nash–Sutcliffe efficiency  Eq. 3 in Nash1970 
KGE  Kling–Gupta efficiency  Eq. 9 in Gupta2009 
Pearson r  Pearson correlation between observed and simulated flow  
NSE  Ratio of standard deviations of observed and simulated flow  From Eq. 4 in Gupta2009 
NSE  Ratio of the means of observed and simulated flow  From Eq. 10 in Gupta2009 
FHV  Top 2% peak flow bias  Eq. A3 in Yilmaz2008 
FLV  Bottom 30% low flow bias  Eq. A4 in Yilmaz2008 
FMS  Bias of the slope of the flow duration curve between the 20% and 80% percentile  Eq. A2 Yilmaz2008 
Peaktiming  Mean time lag between observed and simulated peaks  See Appendix A 
Baseflow index  Ratio of mean daily/hourly baseflow to mean daily/hourly discharge  Ladson2013 
HFD mean  Mean halfflow date (date on which the cumulative discharge since October first reaches half of the annual discharge)  Court1962 
High flow dur.  Average duration of highflow events (number of consecutive steps times the median daily/hourly flow)  Clausen2000, Table 2 in Westerberg2015 
High flow freq.  Frequency of highflow days/hours ( times the median daily/hourly flow)  Clausen2000, Table 2 in Westerberg2015 
Low flow dur.  Average duration of lowflow events (number of consecutive days/hours times the mean flow)  Olden2003, Table 2 in Westerberg2015 
Low flow freq.  Frequency of lowflow days/hours ( times the mean daily/hourly flow)  Olden2003, Table 2 in Westerberg2015 
Q5  5% flow quantile (low flow)  
Q95  95% flow quantile (high flow)  
Q mean  Mean daily/hourly discharge  
Runoff ratio  Runoff ratio (ratio of mean daily/hourly discharge to mean daily/hourly precipitation, using NLDAS precipitation)  Eq. 2 in Sawicz2011 
Slope FDC  Slope of the flow duration curve (between the logtransformed 33rd and 66th streamflow percentiles)  Eq. 3 in Sawicz2011 
Stream elasticity  Streamflow precipitation elasticity (sensitivity of streamflow to changes in precipitation at the annual time scale, using NLDAS precipitation)  Eq. 7 in Sankarasubramanian2001 
Zero flow freq.  Frequency of days/hours with zero discharge 
To quantify the crosstimescale consistency of the different models, we calculated the root mean squared deviation between daily predictions and hourly predictions when aggregated to daily values:
(2) 
2 Results
2.1 Benchmarking
Table 4 compares the median test period evaluation metrics of the MTSLSTM and sMTSLSTM architectures with the benchmark naive LSTM models and the processbased National Water Model (NWM).
Figure 3 illustrates the cumulative distributions of perbasin NSE values of the different models.
By this metric, all LSTM models, even the naive ones, outperform NWM at both hourly and daily time steps.
All models perform slightly worse on hourly predictions than on daily predictions.
Accordingly, the results in Table 4 list differences in median NSE values between NWM and the LSTM models ranging from to (daily) and around (hourly).
Among the LSTM models, the differences in all metrics are comparatively small.
The sMTSLSTM achieves the best daily and hourly median NSE at both timescales.
The naive models produce NSE values that are slightly lower, but the difference is not statistically significant at (Wilcoxon signedrank test: (hourly), (daily)).
The NSE distribution of the MTSLSTM is significantly different from the sMTSLSTM ( (hourly), (daily)), but the absolute difference is small.
We leave it to the judgment of the reader to decide whether or not this difference is negligible from a hydrologic standpoint.
Beyond NSE, all LSTM models exhibit lower peaktiming errors than NWM.
For hourly predictions, the median peaktiming error of the sMTSLSTM is around three and a half hours, compared to more than six hours for NWM.
Naturally, the peaktiming errors for daily predictions are smaller values since the error is measured in days.
Consequently, the sMTSLSTM yields a peaktiming error of days, versus days for NWM.
The processbased NWM, in turn, often produces results with better flow bias metrics, especially with respect to low flows (FLV).
This observation underscores conclusions from prior work that indicate there is improvement to be made in arid climate regimes (Kratzert2020; Frame2020).
Similarly, the lower section of Table 4 lists correlations between hydrologic signatures of observed and predicted discharge: NWM has the highest correlations for frequencies of high flows, low flows, and zeroflows, as well as for flow duration curve slopes.
daily  hourly  
sMTSLSTM  MTSLSTM  Naive  NWM  sMTSLSTM  MTSLSTM  Naive  NWM  
NSE  0.762  0.755  0.752  0.751  
MSE  0.002  0.002  0.003  0.003  
RMSE  0.048  0.048  0.054  0.054  
KGE  0.760  0.731  0.739  
NSE  0.873  0.847  0.828  0.825  0.837  0.846  
Pearson r  0.891  0.885  
NSE  0.055  0.043  0.042  0.038  0.045  0.039  0.039  0.034 
FHV  13.336  15.053  16.296  16.115  14.467  14.174  
FMS  5.099  5.264  
FLV  9.617  9.730  12.195  0.775  6.315  
Peaktiming  0.306  0.310  3.540  
NSE (mean)  
Number of NSEs  
Highflow freq.  0.730  0.719  
Highflow dur.  0.512  0.491  0.471  0.316  
Lowflow freq.  0.796  0.789  
Lowflow dur.  0.316  0.309  
Zeroflow freq.  0.392  0.286  0.409  0.502  0.363  0.401  0.505  
Q95  0.980  0.956  0.980  0.979  0.956  
Q5  0.970  0.979  0.968  0.964  
Q mean  0.986  0.972  0.984  
HFD mean  0.943  0.945  0.944  0.948  
Slope FDC  0.679  0.712  
Stream elasticity  0.601  0.615  0.601  0.563  0.626  
Runoff ratio  0.957  0.962  0.924  0.955  0.918  
Baseflow index  0.897  0.904  0.935  0.932 
Figure 4 visualizes the distributions of NSE and peaktiming error across space for the hourly predictions with the sMTSLSTM.
As in previous studies, the NSE values are lowest in arid basins of the Great Plains and Southwestern United States.
The peaktiming error shows similar spatial patterns, however, especially the hourly peaktiming error also shows higher values along the southeastern coastline.
2.2 CrossTimescale Consistency
Since the (nonnaive) LSTMbased models jointly predict discharge at multiple timescales, we can incentivize predictions that are consistent across timescales.
As described in Sect. 1.3.2, this happens through a regularized NSE loss function that penalizes inconsistencies.
To gauge the effectiveness of this regularization, we compare inconsistencies between timescales in the best benchmark model, the sMTSLSTM, with and without regularization.
As a baseline, we also compare against the crosstimescale inconsistencies from two independent naive LSTMs.
Table 5 lists the mean, median, and maximum root mean squared deviation between the daily predictions and the hourly predictions when aggregated to daily values.
Without regularization, simultaneous prediction with the sMTSLSTM yields smaller inconsistencies than the naive approach (i.e., separate LSTMs at each timescale).
Crosstimescale regularization further reduces inconsistencies and results in a median root mean squared deviation of .
Besides reducing inconsistencies, the regularization term appears to have a small but beneficial influence on the overall skill of the daily predictions: With regularization, the median NSE increases slightly from to .
Judging from the hyperparameter tuning results, this appears to be a systematic improvement (rather than a fluke), because for both sMTSLSTM and MTSLSTM, at least the three best hyperparameter configurations use regularization.
Median  Maximum  

sMTSLSTM  –  –  
sMTSLSTM (no regularization)  
Naive 
2.3 Computational Efficiency
In addition to differences in accuracy, the different LSTM architectures have rather large differences in computational overhead, and therefore runtime.
The naive hourly model must iterate through 4320 input sequence steps for each prediction it makes, whereas the MTSLSTM and sMTSLSTM only require 365 daily and 336 hourly steps.
Consequently, where the naive hourly LSTM takes more than one day to train on one NVIDIA V100 GPU, the MTSLSTM and sMTSLSTM take just over 6 (MTSLSTM) and 8 hours (sMTSLSTM).
Moreover, while training is a onetime effort, the runtime advantage is even larger during inference: The naive model requires around 9 hours runtime to predict 10 years of hourly data for 516 basins on an NVIDIA V100 GPU.
This is about 40 times slower than the MTSLSTM and sMTSLSTM models, which both require around 13 minutes for the same task on the same hardware—and the multitimescale models generate daily predictions in addition to hourly ones.
2.4 PerTimescale Input Variables
While the MTSLSTM yields slightly worse predictions than the sMTSLSTM in our benchmark evaluation, it has the important ability to ingest different input variables at each timescale.
The following two experiments show how harnessing this feature can increase the accuracy of the MTSLSTM beyond its shared version.
In the first experiment, we used two daily input forcing sets (Daymet and Maurer) and one hourly forcing set (NLDAS).
In the second experiment, we additionally ingested the daily forcings into the hourly LSTM branch.
Table 6 compares the results of these two multiforcings experiments with the singleforcing models (MTSLSTM and sMTSLSTM) from the benchmarking section.
The second experiment—ingesting daily inputs into the hourly LSTM branch—yields the best results.
The additional daily forcings increase the median daily NSE by —from to .
Even though the hourly LSTM branch only obtains lowresolution additional values, the hourly NSE increases by —from to .
This multiinput version of the MTSLSTM is the best model in this study, significantly better than the best singleforcing model (sMTSLSTM).
An interesting observation is that it is daily inputs to the hourly LSTM branch that improve predictions. Using only hourly NLDAS forcings in the hourly branch, the hourly median NSE drops to .
Following Kratzert2020, we expect that additional hourly forcings datasets will have further positive impact on the hourly accuracy (beyond the improvement we see with additional daily forcings).
daily  hourly  
multiforcing  singleforcing  multiforcing  singleforcing  
A  B  sMTSLSTM  MTSLSTM  A  B  sMTSLSTM  MTSLSTM  
NSE  0.811  0.805  0.812  
MSE  0.002  0.002  0.002  
RMSE  0.040  0.039  0.045  
KGE  0.782  0.779  0.801  
NSE  0.879  0.905  
Pearson r  0.912  0.911  0.911  
NSE  0.007  0.002  
FHV  10.993  8.086  
FMS  9.336  10.606  
FLV  14.034  12.850  9.931  14.486  26.969  27.567  34.003  
Peaktiming  0.250  0.273  0.286  0.308  3.571  
NSE (mean)  
Number of NSEs 
2.5 Predicting More Timescales
In the above experiments, we evaluated our models on daily and hourly predictions only.
The MTSLSTM architectures, however, generalize to other timescales.
Table 7 lists the NSE values on the test period for each timescale.
To calculate metrics, we only consider the first eight threehourly and four sixhourly predictions that were made on each day.
The hourly median NSE of is barely different from the median NSE of the daily–hourly MTSLSTM ().
While the threehourly predictions are roughly as accurate as the hourly predictions, the sixhourly predictions are slightly worse (median NSE ).
Timescale  Model  Median NSE  Number of 

Hourly  MTSLSTM  
NWM  
Threehourly  MTSLSTM  
NWM  
Sixhourly  MTSLSTM  
NWM 
/conclusions
[Discussion and Conclusion]
The purpose of this work was to generalize LSTMbased rainfall–runoff modeling to multiple timescales.
This task is not as trivial as simply running different deep learning models at different timescales due to long lookback periods, associated memory leaks, and computational expense.
With MTSLSTM and sMTSLSTM, we propose two LSTMbased rainfall–runoff models that make use of the specific (physical) nature of the simulation problem.
The results show that the advantage LSTMs have over processbased models on daily predictions extends to subdaily predictions.
An architecturally simple approach, what we here call the sMTSLSTM, can process longterm dependencies at much smaller computational overhead than a naive hourly LSTM. Nevertheless, the high accuracy of the naive hourly model shows that LSTMs, even with a forget gate, can cope with very long input sequences.
Additionally, the LSTMs produce hourly predictions that are almost as good as their daily predictions, while the NWM’s accuracy drops significantly from daily to hourly predictions.
A more extensive hyperparameter tuning might even increase the accuracy of the naive model, however, this is difficult to test because of the large computational expense of training LSTMs with highresolution input sequences that are long enough to capture all hydrologically relevant history.
The high quality of the sMTSLSTM model indicates that the “summary” state between daily and hourly LSTM components contains as much information as the naive model extracts from the full hourly input sequence.
This is an intuitive assumption, since the informative content of highresolution forcings diminishes as we look back farther.
The MTSLSTM adds to that the ability to use distinct sets of input variables for each timescale.
This is an important operational usecase, as forcings with high temporal resolution often have shorter lead times than lowresolution products.
In addition, pertimescale input variables allow for input data with lower temporal resolution, such as remote sensing products, without interpolation.
Besides these conceptual considerations, this feature boosts model accuracy beyond the best singleforcings model, as we can ingest multiple forcing products at each timescale.
The results from ingesting mixed input resolutions into the hourly LSTM branch (hourly NLDAS and daily Daymet and Maurer) highlight the flexibility of machine learning models and show that daily forcing histories can contain enough information to support hourly predictions.
Yet, there remain a number of steps to be taken before these models can be truly operational:

First, like most academic rainfall–runoff models, our models operate in a reanalysis setting rather than performing actual leadtime forecasts.

Third, the implementation we use in this paper carries out the predictions of more granular timescales only whenever it predicts a lowresolution step, where it predicts multiple steps to offset the lower frequency.
For instance, at daily and hourly target timescales, the LSTM predicts 24 hourly steps once a day.
In a reanalysis setting, this does not matter, but in a forecasting setting, one needs to generate hourly predictions more often than daily predictions.
Note, however, that this is merely a restriction of our implementation, not an architectural one: By allowing variablelength input sequences, we could produce one hourly prediction each hour (rather than 24 each day).
This work represents one step toward developing operational hydrologic models based on deep learning.
Overall, we believe that the MTSLSTM is the most promising model for future use.
It can integrate forcings of different temporal resolutions, generates accurate and consistent predictions at multiple timescales, and its computational overhead both during training and inference is far smaller than that of individual models per timescale.
/codedataavailability
We trained all our machine learning models with the neuralhydrology Python library (https://github.com/neuralhydrology/neuralhydrology).
All code to reproduce our models and analyses is available at https://github.com/gauchm/mtslstm.
The trained models and their predictions are available at https://doi.org/10.5281/zenodo.4071886.
Hourly NLDAS forcings and observed streamflow are available at https://doi.org/10.5281/zenodo.4072701.
The CAMELS dataset with static basin attributes is accessible at https://ral.ucar.edu/solutions/products/camels.
Appendix A A PeakTiming Error Metric
Especially for predictions at high temporal resolutions, it is important that a model not only captures the correct magnitude of a flood event, but also its timing.
We measure this with a “peaktiming” metric that quantifies the lag between observed and predicted peaks.
First, we heuristically extract the most important peaks from the observed time series: starting with all observed peaks, we discard all peaks with topographic prominence smaller than the observed standard deviation and subsequently remove the smallest remaining peak until all peaks have a distance of at least 100 steps.
Then, we search for the largest prediction within a window of one day (for hourly data) or three days (for daily data) around the observed peak.
Given the pairs of observed and predicted peak time, the peaktiming error is their mean absolute difference.
Appendix B Negative Results
Throughout our experiments, we found LSTMs to be a highly resilient architecture: We tried many different approaches to multitimescale prediction, and a large fraction of them worked reasonably well, albeit not quite as well as the MTSLSTM models we present in this paper.
Nevertheless, we believe it makes sense to report some of these “negative” results—models that turned out not to work as well as the ones we finally settled on.
Note, however, that the following reports are based on exploratory experiments with a few seeds and no extensive hyperparameter tuning.
b.1 Delta Prediction
Extending our final MTSLSTM, we tried to facilitate hourly predictions for the LSTM by ingesting the corresponding day’s prediction into the hourly LSTM branch and only predicting each hour’s deviation from the daily mean.
If anything, however, this approach slightly deteriorated the prediction accuracy (and made the architecture more complicated).
We then experimented with predicting 24 weights for each day that distribute the daily streamflow across the 24 hours.
This would have yielded the elegant sideeffect of guaranteed consistency across timescales: the mean hourly prediction would always be equal to the daily prediction.
Yet, the results were clearly worse, and, as we show in Sect. 2.2, we can achieve nearconsistent results by incentive (regularization) rather than enforcement.
One possible reason for the reduced accuracy is that it may be harder for the LSTM to learn two different things—predicting hourly weights and daily streamflow—than to predict the same streamflow at two timescales.
b.2 CrossTimescale State Exchange
Inspired by residual neural networks (ResNets) that use socalled skip connections to bypass layers of computation and allow for a better flow of gradients during training (He2016resnet), we devised a “ResNetmultitimescale LSTM” where after each day, we combine the hidden state of the hourly LSTM branch with the hidden state of the daily branch into the initial daily and hourly hidden states for the next day.
This way, we hoped, the daily LSTM branch might obtain more finegrained information about the last few hours than it could infer from its daily inputs.
While the daily NSE remained roughly the same, the hourly predictions in this approach became much worse.
b.3 MultiTimescale Input, SingleTimescale Output
For both sMTSLSTM and MTSLSTM, we experimented with ingesting both daily and hourly data into the models, but only training them to predict hourly discharge.
In this setup, the models could fully focus on hourly predictions rather than trying to satisfy two possibly conflicting goals.
Interestingly, however, the hourlyonly predictions were worse than combined multitimescale predictions.
One reason for this effect may be that the state summary that the daily LSTM branch passes to the hourly branch is worse, as the model obtains no training signal for its daily outputs.
Appendix C TimeContinuous Prediction with ODELSTMs
As we already alluded to in the introduction, the combination of ordinary differential equations (ODEs) and LSTMs presents another approach to multitimescale prediction—one that is more accurately characterized as timecontinuous prediction (as opposed to our multiobjective learning approach that predicts at arbitrary but fixed timescales).
The ODELSTM passes each input time step through a normal LSTM, but then postprocesses the resulting hidden state with an ODE that has its own learned weights.
In effect, the ODE component can adjust the LSTM’s hidden state to the time step size.
For operational streamflow prediction, we believe that our MTSLSTM approach is better suited, since ODELSTMs cannot directly process different input variables for different target timescales.
That said, from a scientific standpoint, we think that the idea of training a model that can then generalize to arbitrary granularity is of great interest (e.g., toward a more comprehensive and interpretable understanding of hydrologic processes).
Although the idea of timecontinuous predictions seemed promising, in our exploratory experiments it was better to use an MTSLSTM and aggregate (or disaggregate) its predictions to the desired target temporal resolution.
Note that, due to the slow training of ODELSTMs, we carried out the following experiments on ten basins (training one model per basin; we used smaller hidden sizes, higher learning rates, and trained for more epochs to adjust the LSTMs to this setting).
Table 8 gives examples for predictions at untrained timescales: Table section (A) shows the mean and median NSE values across the ten basins when we trained the models on daily and 12hourly target data but then generated hourly predictions (for the MTSLSTM, we obtained hourly predictions by uniformly spreading the 12hourly prediction across 12 hours).
Table section (B) shows the results when we trained on hourly and threehourly target data but then predicted daily values (for the MTSLSTM, we aggregated eight threehourly predictions into one daily time step).
These initial results show that, in almost all cases, it is better to (dis)aggregate MTSLSTM predictions than to use an ODELSTM.
Section A: training on daily and 12hourly data, evaluation additionally on hourly predictions (hourly MTSLSTM results obtained as 12hourly predictions uniformly distributed across 12 hours).
Section B: training on hourly and threehourly data, evaluation additionally on daily predictions (daily MTSLSTM predictions obtained as averaged threehourly values).
The mean and median NSE are aggregated across the results for ten basins; best values are highlighted in bold.
Timescale  Timescale used in training loss 
median NSE  mean NSE  

MTSLSTM  ODELSTM  MTSLSTM  ODELSTM  
(A)  daily  yes  0.726  0.664  
12hourly  yes  0.734  0.672  
hourly  no  0.706  0.634  
(B)  daily  no  0.746  0.718  
3hourly  yes  0.728  0.672  
hourly  yes  0.700  0.633 
Appendix D Hyperparameter Tuning
For the multitimescale models, we performed a twostage tuning.
In the first stage, we trained architectural model parameters (regularization, hidden size, sequence length, dropout) for 30 epochs at a batch size of 512 and a learning rate that starts at , reduces to after ten epochs, and to after 20 epochs.
We selected the configuration with the best median NSE (we considered the average of daily and hourly median NSE) and, in the second stage, tuned its learning rate and batch size. Table 9 lists the parameter combinations we explored.
We did not tune architectural parameters of the naive LSTM models, since the architecture has already been extensively tuned by Kratzert2019.
The 24 times larger training set of the naive hourly model did, however, require additional tuning of learning rate and batch size, and we only trained for one epoch.
As the extremely long input sequences greatly increase training time, we can only evaluate a relatively small parameter grid for the naive hourly model.
Naive (daily)  Naive (hourly)  sMTSLSTM  MTSLSTM  

Loss  NSE  NSE  NSE  NSE 
Regularization (see  
Sect. 1.3.2)  –  –  yes, no  yes, no 
Hidden size  256  256  64, 128, 256, 512, 1024  , , , 
,  
Sequence length  365 days  4320 hours  365 days  
+ 72, 168, 336 hours  365 days  
+ 72, 168, 336 hours  
Dropout  0.4  0.4  0.2, 0.4, 0.6  0.2, 0.4, 0.6 
Learning rate  (1: 5e3,  
10: 1e3,  
25: 5e4)  5e4, 1e4  (1: 5e3, 10: 1e3, 25: 5e4), (1: 1e3, 10: 5e4, 25: 1e4), (1: 5e4, 10: 1e4, 25: 5e5), (1: 1e4, 10: 5e5, 25: 1e5) 
(1: 5e3, 10: 1e3, 25: 5e4), (1: 1e3, 10: 5e4, 25: 1e4), (1: 5e4, 10: 1e4, 25: 5e5), (1: 1e4, 10: 5e5, 25: 1e5) 

Batch size  256  256, 512  128, 256, 512, 1024  128, 256, 512, 1024 
Number of combinations  1  4 
/belowtable
(1: , 5: , 10: , …) denotes a learning rate of for epochs 1 to 4, of for epochs 5 to 10, etc.
/noappendix/authorcontribution
MG, FK, and DK designed all experiments.
MG conducted all experiments and analyzed the results, together with FK and DK.
GN supervised the manuscript from the hydrologic perspective, JL and SH from the machine learning perspective.
All authors worked on the manuscript.
/competinginterests
The authors declare that they have no conflict of interest.
Acknowledgements.
This research was undertaken thanks in part to funding from the Canada First Research Excellence Fund and the Global Water Futures Program, and enabled by computational resources provided by Compute Ontario and Compute Canada.
The ELLIS Unit Linz, the LIT AI Lab, and the
Institute for Machine Learning
are supported by
the Federal State Upper Austria.
We thank the projects
AIMOTION (LIT20186YOU212),
DeepToxGen (LIT20173YOU003),
AISNN (LIT20186YOU214),
DeepFlood (LIT20198YOU213),
Medical Cognitive Computing Center (MC3),
PRIMAL (FFG873979),
S3AI (FFG872172),
DL for granular flow (FFG871302),
ELISE (H2020ICT20193 ID: 951847),
AIDD (MSCAITN2020 ID: 956832).
Further, we thank
Janssen Pharmaceutica,
UCB Biopharma SRL,
Merck Healthcare KGaA,
Audi.JKU Deep Learning Center,
TGW LOGISTICS GROUP GMBH,
Silicon Austria Labs (SAL),
FILL Gesellschaft mbH,
Anyline GmbH,
Google (Faculty Research Award),
ZF Friedrichshafen AG,
Robert Bosch GmbH,
Software Competence Center Hagenberg GmbH,
TÜV Austria,
and the NVIDIA Corporation.