Toto is a decoder-only transformer built for multivariate time series forecasting. It adapts innovations from large language models—like RMSNorm, SwiGLU, and rotary embeddings—while introducing a novel “Proportional Factorized Space-Time Attention” mechanism. This design balances time- and space-wise attention to handle complex, high-cardinality data efficiently. Combined with a robust probabilistic prediction head using Student-T mixture models, Toto delivers flexible, scalable, and uncertainty-aware forecasts suitable for real-world applications.Toto is a decoder-only transformer built for multivariate time series forecasting. It adapts innovations from large language models—like RMSNorm, SwiGLU, and rotary embeddings—while introducing a novel “Proportional Factorized Space-Time Attention” mechanism. This design balances time- and space-wise attention to handle complex, high-cardinality data efficiently. Combined with a robust probabilistic prediction head using Student-T mixture models, Toto delivers flexible, scalable, and uncertainty-aware forecasts suitable for real-world applications.

How Toto Reimagines Multi-Head Attention for Multivariate Forecasting

2025/10/22 05:08
6 min read
For feedback or concerns regarding this content, please contact us at crypto.news@mexc.com
  1. Background
  2. Problem statement
  3. Model architecture
  4. Training data
  5. Results
  6. Conclusions
  7. Impact statement
  8. Future directions
  9. Contributions
  10. Acknowledgements and References

Appendix

3 Model architecture

Toto is a decoder-only forecasting model. This model employs many of the latest techniques from the literature, and introduces a novel method for adapting multi-head attention to multivariate time series data (Fig. 1).

\ 3.1 Transformer design

\ Transformer models for time series forecasting have variously used encoder-decoder [12, 13, 21], encoderonly [14, 15, 17], and decoder-only architectures [19, 23]. For Toto, we employ a decoder-only architecture. Decoder architectures have been shown to scale well [25, 26], and allow for arbitrary prediction horizons. The causal next-patch prediction task also simplifies the pre-training process.

\ We use techniques from some of the latest large language model (LLM) architectures, including prenormalization [27], RMSNorm [28], and SwiGLU feed-forward layers [29].

\ 3.2 Input embedding

\ Time series transformers in the literature have used various approaches for creating input embeddings. We use non-overlapping patch projections (Fig. 3), first introduced for Vision Transformers [30, 31] and popularized in the time series context by PatchTST [14]. Toto was trained using a fixed patch size of 32.

\

\ 3.3 Attention mechanism

\ Observability metrics are often high-cardinality, multivariate time series. Therefore, an ideal model will natively handle multivariate forecasting. It should be able to analyze relationships both in the time dimension (what we refer to as “time-wise” interactions) and in the channel dimension (what we refer to as “space-wise” interactions, following the convention in the Datadog platform of describing different groups or tag sets of a metric as the “space” dimension).

\ In order to model both space and time-wise interactions, we need to adapt the traditional multi-head attention architecture [11] from one to two dimensions. Several approaches have been proposed in the literature to do this, including:

\ • Assuming channel independence, and computing attention only in the time dimension [14]. This is efficient, but throws away all information about space-wise interactions.

\ • Computing attention only in the space dimension, and using a feed-forward network in the time dimension [17, 18].

\ • Concatenating variates along the time dimension and computing full cross-attention between every space/time location [15]. This can capture every possible space and time interaction, but it is computationally costly.

\ • Computing “factorized attention,” where each transformer block contains a separate space and time attention computation [16, 32, 33]. This allows both space and time mixing, and is more efficient than full cross-attention. However, it doubles the effective depth of the network.

\ In order to design our attention mechanism, we follow the intuition that for many time series, the time relationships are more important or predictive than the space relationships. As evidence, we observe that even models that completely ignore space-wise relationships (such as PatchTST [14] and TimesFM [19]) can still achieve competitive performance on multivariate datasets. However, other studies (e.g. Moirai [15]) have shown through ablations that there is some clear benefit to including space-wise relationships.

\ We therefore propose a novel variant of factorized attention, which we call “Proportional Factorized Space-Time Attention.” We use a mixture of alternating space-wise and time-wise attention blocks. As a configurable hyperparameter, we can change the ratio of time-wise to space-wise blocks, thus allowing us to devote more or less compute budget to each type of attention. For our base model, we selected a configuration with one space-wise attention block for every two time-wise blocks.

\ In the time-wise attention blocks, we use causal masking and rotary positional embeddings [34] with XPOS [35] in order to autoregressively model timedependent features. In the space-wise blocks, by contrast, we use full bidirectional attention in order to preserve permutation invariance of the covariates, with a block-diagonal ID mask to ensure that only related variates attend to each other. This masking allows us to pack multiple independent multivariate time series into the same batch, in order to improve training efficiency and reduce the amount of padding.

\ 3.4 Probabilistic prediction head

\ In order to be useful for forecasting applications, a model should produce probabilistic predictions. A common practice in time series models is to use an output layer where the model regresses the parameters of a probability distribution. This allows for prediction intervals to be computed using Monte Carlo sampling [7].

\ Common choices for an output layer are Normal [7] and Student-T [23, 36], which can improve robustness to outliers. Moirai [15] allows for more flexible residual distributions by proposing a novel mixture model incorporating a weighted combination of Gaussian, Student-T, Log-Normal, and Negative-Binomial outputs.

\ However, real-world time series can often have complex distributions that are challenging to fit, with outliers, heavy tails, extreme skew, and multimodality. In order to accommodate these scenarios, we introduce an even more flexible output likelihood. To do this we employ a method based on Gaussian mixture models (GMMs), which can approximate any density function ([37]). To avoid training instability in the presence of outliers, we use a Student-T mixture model (SMM), a robust generalization of GMMs [38] that has previously shown promise for modeling heavy-tailed financial time series [39, 40]. The model predicts k Student-T distributions (where k is a hyperparameter) for each time step, as well as a learned weighting.

\ Figure 4. Example metric query in the Datadog platform. The metric name (1) determines which metric is being queried. The filter clause (2) limits which contexts are queried, in this case restricting the query to the prod environment. The space aggregation (3) indicates that the average metric value should be returned for each unique combination of the group-by keys. The time aggregation (4) indicates that metric values should be aggregated to the average for each 60-second interval. The query results will be a multivariate time series with 1-minute time steps, and with separate individual variates for each unique service, datacenter tuple.

\ When we perform inference, we draw samples from the mixture distribution at each timestamp, then feed each sample back into the decoder for the next prediction. This allows us to produce prediction intervals at any quantile, limited only by the number of samples; for more precise tails, we can choose to spend more computation on sampling (Fig. 2).

\ 3.5 Input/output scaling

\ As in other time series models, we perform instance normalization on input data before passing it through the patch embedding, in order to make the model generalize better to inputs of different scales [41]. We scale the inputs to have zero mean and unit standard deviation. The output predictions are then rescaled back to the original units.

\ 3.6 Training objective

\ As a decoder-only model, Toto is pre-trained on the next-patch prediction task. We minimize the negative log-likelihood of the next predicted patch with respect to the distribution output of the model. We train the model using the AdamW optimizer [42].

\ 3.7 Hyperparameters

\ The hyperparameters used for Toto are detailed in Table A.1, with 103 million total parameters.

\

:::info Authors:

(1) Ben Cohen (ben.cohen@datadoghq.com);

(2) Emaad Khwaja (emaad@datadoghq.com);

(3) Kan Wang (kan.wang@datadoghq.com);

(4) Charles Masson (charles.masson@datadoghq.com);

(5) Elise Rame (elise.rame@datadoghq.com);

(6) Youssef Doubli (youssef.doubli@datadoghq.com);

(7) Othmane Abou-Amal (othmane@datadoghq.com).

:::


:::info This paper is available on arxiv under CC BY 4.0 license.

:::

\

Market Opportunity
Wink Logo
Wink Price(LIKE)
$0.001609
$0.001609$0.001609
+0.18%
USD
Wink (LIKE) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact crypto.news@mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Fed Decides On Interest Rates Today—Here’s What To Watch For

Fed Decides On Interest Rates Today—Here’s What To Watch For

The post Fed Decides On Interest Rates Today—Here’s What To Watch For appeared on BitcoinEthereumNews.com. Topline The Federal Reserve on Wednesday will conclude a two-day policymaking meeting and release a decision on whether to lower interest rates—following months of pressure and criticism from President Donald Trump—and potentially signal whether additional cuts are on the way. President Donald Trump has urged the central bank to “CUT INTEREST RATES, NOW, AND BIGGER” than they might plan to. Getty Images Key Facts The central bank is poised to cut interest rates by at least a quarter-point, down from the 4.25% to 4.5% range where they have been held since December to between 4% and 4.25%, as Wall Street has placed 100% odds of a rate cut, according to CME’s FedWatch, with higher odds (94%) on a quarter-point cut than a half-point (6%) reduction. Fed governors Christopher Waller and Michelle Bowman, both Trump appointees, voted in July for a quarter-point reduction to rates, and they may dissent again in favor of a large cut alongside Stephen Miran, Trump’s Council of Economic Advisers’ chair, who was sworn in at the meeting’s start on Tuesday. It’s unclear whether other policymakers, including Kansas City Fed President Jeffrey Schmid and St. Louis Fed President Alberto Musalem, will favor larger cuts or opt for no reduction. Fed Chair Jerome Powell said in his Jackson Hole, Wyoming, address last month the central bank would likely consider a looser monetary policy, noting the “shifting balance of risks” on the U.S. economy “may warrant adjusting our policy stance.” David Mericle, an economist for Goldman Sachs, wrote in a note the “key question” for the Fed’s meeting is whether policymakers signal “this is likely the first in a series of consecutive cuts” as the central bank is anticipated to “acknowledge the softening in the labor market,” though they may not “nod to an October cut.” Mericle said he…
Share
BitcoinEthereumNews2025/09/18 00:23
Oil Price Prediction: Supply Shock Puts $100 Crude Back in Play

Oil Price Prediction: Supply Shock Puts $100 Crude Back in Play

Crude oil has snapped out of its recent lull and is now trading at its highest level since June. And this time, it’s not just about scary headlines. It’s about
Share
Captainaltcoin2026/03/03 03:00
One Of Frank Sinatra’s Most Famous Albums Is Back In The Spotlight

One Of Frank Sinatra’s Most Famous Albums Is Back In The Spotlight

The post One Of Frank Sinatra’s Most Famous Albums Is Back In The Spotlight appeared on BitcoinEthereumNews.com. Frank Sinatra’s The World We Knew returns to the Jazz Albums and Traditional Jazz Albums charts, showing continued demand for his timeless music. Frank Sinatra performs on his TV special Frank Sinatra: A Man and his Music Bettmann Archive These days on the Billboard charts, Frank Sinatra’s music can always be found on the jazz-specific rankings. While the art he created when he was still working was pop at the time, and later classified as traditional pop, there is no such list for the latter format in America, and so his throwback projects and cuts appear on jazz lists instead. It’s on those charts where Sinatra rebounds this week, and one of his popular projects returns not to one, but two tallies at the same time, helping him increase the total amount of real estate he owns at the moment. Frank Sinatra’s The World We Knew Returns Sinatra’s The World We Knew is a top performer again, if only on the jazz lists. That set rebounds to No. 15 on the Traditional Jazz Albums chart and comes in at No. 20 on the all-encompassing Jazz Albums ranking after not appearing on either roster just last frame. The World We Knew’s All-Time Highs The World We Knew returns close to its all-time peak on both of those rosters. Sinatra’s classic has peaked at No. 11 on the Traditional Jazz Albums chart, just missing out on becoming another top 10 for the crooner. The set climbed all the way to No. 15 on the Jazz Albums tally and has now spent just under two months on the rosters. Frank Sinatra’s Album With Classic Hits Sinatra released The World We Knew in the summer of 1967. The title track, which on the album is actually known as “The World We Knew (Over and…
Share
BitcoinEthereumNews2025/09/18 00:02