Datadog introduces Toto, a groundbreaking foundation model for time series forecasting, trained on over one trillion data points. Optimized specifically for observability metrics, Toto delivers state-of-the-art zero-shot performance across multiple domains. Its novel architecture includes factorized space-time attention and a Student-T mixture model, enabling more efficient, accurate, and scalable predictions for complex, high-frequency infrastructure data. Toto marks a major step forward in real-time system monitoring and predictive analytics.Datadog introduces Toto, a groundbreaking foundation model for time series forecasting, trained on over one trillion data points. Optimized specifically for observability metrics, Toto delivers state-of-the-art zero-shot performance across multiple domains. Its novel architecture includes factorized space-time attention and a Student-T mixture model, enabling more efficient, accurate, and scalable predictions for complex, high-frequency infrastructure data. Toto marks a major step forward in real-time system monitoring and predictive analytics.

The Time Series Optimized Transformer Setting New Standards in Observability

2025/10/22 05:08

:::info Authors:

(1) Ben Cohen (ben.cohen@datadoghq.com);

(2) Emaad Khwaja (emaad@datadoghq.com);

(3) Kan Wang (kan.wang@datadoghq.com);

(4) Charles Masson (charles.masson@datadoghq.com);

(5) Elise Rame (elise.rame@datadoghq.com);

(6) Youssef Doubli (youssef.doubli@datadoghq.com);

(7) Othmane Abou-Amal (othmane@datadoghq.com).

:::

  1. Background
  2. Problem statement
  3. Model architecture
  4. Training data
  5. Results
  6. Conclusions
  7. Impact statement
  8. Future directions
  9. Contributions
  10. Acknowledgements and References

Appendix

\ This technical report describes the Time Series Optimized Transformer for Observability (Toto), a new state-ofthe-art foundation model for time series forecasting developed by Datadog. In addition to advancing the state of the art on generalized time series benchmarks in domains such as electricity and weather, this model is the first general-purpose time series forecasting foundation model to be specifically tuned for observability metrics.

\ Toto was trained on a dataset of one trillion time series data points – the largest among all currently published time series foundation models. Alongside publicly available time series datasets, 75% of the data used to train Toto consists of fully anonymous numerical metric data points from the Datadog platform.

\ In our experiments, Toto outperforms existing time series foundation models on observability data. It does this while also excelling at general-purpose forecasting tasks, achieving state-of-the-art zero-shot performance on multiple open benchmark datasets.

\ In this report, we detail the following key contributions:

\ • Proportional factorized space-time attention: We introduce an advanced attention mechanism that allows for efficient grouping of multivariate time series features, reducing computational overhead while maintaining high accuracy.

\ • Student-T mixture model head: This novel use of a probabilistic model that robustly generalizes Gaussian mixture models enables Toto to more accurately capture the complex dynamics of time series data and provides superior performance over traditional approaches.

\ • Domain-specific training data: In addition to general multi-domain time series data, Toto is specifically pre-trained on a large-scale dataset of Datadog observability metrics, encompassing unique characteristics not present in open-source datasets. This targeted training ensures enhanced performance in observability metric forecasting

\ Figure 1. Toto architecture diagram. Input time series of T steps (univariate example used for simplicity here) are first embedded using the patch embedding layer. They then pass through the transformer stack, which contains L identical segments. Each segment of the transformer consists of one space-wise transformer block followed by N time-wise blocks. The flattened transformer outputs are projected to form the parameters of the Student-T mixture model (SMM) head. The final outputs are the forecasts for the input series, shifted P steps (the patch width) into the future.

1 Background

We present Toto, a groundbreaking time series forecasting foundation model developed by Datadog. Toto is specifically designed to handle the complexities of observability data, leveraging a state-of-the-art transformer architecture to deliver unparalleled accuracy and performance. Toto is trained on a massive dataset of diverse time series data, enabling it to excel in zero-shot predictions. This model is tailored to meet the demanding requirements of real-time analysis as well as compute and memory-efficient scalability to very large data volumes, providing robust solutions for high-frequency and high-dimensional data commonly encountered in observability metrics.

\ 1.1 Observability data

\ The Datadog observability platform collects a vast array of metrics across multiple subdomains, crucial for monitoring and optimizing modern infrastructure and applications. These metrics include infrastructure data such as memory usage, CPU load, disk I/O, and network throughput, as well as application performance indicators like hit counts, error rates, and

\ Figure 2. Example of Toto's 96-step zero-shot forecasts on the ETTh1 dataset, showing multivariate probabilistic predictions. Solid lines represent ground truth, dashed lines represent median point forecasts, and shaded regions represent 95% prediction intervals.

\ latency [1]. Additionally, Datadog integrates specific metrics from numerous SaaS products, cloud services, open-source frameworks, and other third-party tools. The platform allows users to apply various time series models to proactively alert on anomalous behavior, leading to a reduction in time to detection (TTD) and time to resolution (TTR) of production incidents [2].

\ The complexity and diversity of these metrics present significant challenges for time series forecasting. Observability data often requires high time resolution, down to seconds or minutes, and is typically sparse with many zero-inflated metrics. Moreover, these metrics can display extreme dynamic ranges and right-skewed distributions. The dynamic and nonstationary nature of the systems being monitored further complicates the forecasting task, necessitating advanced models that can adapt and perform under these conditions.

\ 1.2 Traditional models

\ Historically, time series forecasting has relied on classical models such as ARIMA, exponential smoothing, and basic machine learning techniques [3]. While foundational, these models necessitate individual training for each metric, presenting several limitations [4]. The need to develop and maintain separate models for each metric impedes scalability, especially given the extensive range of metrics in observability data. Moreover, these models often fail to generalize across different types of metrics, leading to suboptimal performance on diverse datasets [5, 6]. Continuous retraining and tuning to adapt to evolving data patterns further increase the operational burden. This scaling limitation has hindered the adoption of deep learning–based methods for time series analysis, even as they show promise in terms of accuracy [7].

\ 1.3 Foundation models

\ Large neural network-based generative models, often referred to as “foundation models,” have revolutionized time series forecasting by enabling accurate predictions on new data not seen during training, known as zero-shot prediction [8]. This capability significantly reduces the need for constant retraining on each specific metric, thus saving considerable time and computational resources. Their architecture supports the parallel processing of vast data volumes, facilitating timely insights essential for maintaining system performance and reliability [9, 10].

\ Through pretraining on diverse datasets, generative models exhibit strong generalization across various types of time series data. This enhances their robustness and versatility, making them suitable for a wide range of applications. Zero-shot predictions are particularly attractive in the observability domain, where the limitations of traditional methods are felt very acutely. The most common use cases for time series models within an observability platform like Datadog include automated anomaly detection and predictive alerting. It is challenging to scale classical forecasting methods to handle cloud-based applications that can be composed of many ephemeral, dynamically scaling components such as containers, VMs, serverless functions, etc. These entities tend to be both high in cardinality and short-lived in time. This limits the practicality of traditional time series models in two ways:

\ • First, the high cardinality and volume of data can make fitting individual models to each time series computationally expensive or even intractable. The ability to train a single model and perform inference across a wide range of domains has the potential to dramatically improve the efficiency, and thus the coverage, of an autonomous monitoring system.

\ • Second, ephemeral infrastructure elements often lack enough historical data to confidently fit a model. In practice, algorithmic alerting systems often require an adaptation period of days or weeks before they can usefully monitor a new metric. However, if the object being monitored is a container with a lifespan measured in minutes or hours, these classical models are unable to adapt quickly enough to be useful. Real-world systems thus often fall back to crude heuristics, such as threshold-based alerts, which rely on the domain knowledge of users. Zero-shot foundation models can enable accurate predictions with much less historical context, by aggregating and interpolating prior information learned from a massive and diverse dataset.

\ The integration of transformer-based models [11] like Toto into observability data analysis thus promises significant improvements in forecasting accuracy and efficiency. These models offer a robust solution for managing diverse, high-frequency data and delivering zero-shot predictions. With their advanced capabilities, transformer-based models represent a significant leap forward in the field of observability and time series analysis [12–14].

1.4 Recent work

\ The past several years have seen the rise of transformer-based models as powerful tools for time series forecasting. These models leverage multi-head self-attention mechanisms to capture long-range dependencies and intricate patterns in data.

\ To address the unique challenges of time series data, recent advancements have introduced various modifications to the attention mechanism. For example, Moirai [15] uses “any-variate” attention to model dependencies across different series simultaneously. Factorized attention mechanisms [16] have been developed to separately capture temporal and spatial (cross-series) interactions, enhancing the ability to understand complex interdependencies. Other models [17, 18] have used cross-channel attention in conjunction with feed-forward networks for mixing in the time dimension. Additionally, causal masking [19] and hierarchical encoding [16] can improve the efficiency and accuracy of predictions in time series contexts.

\ These innovative transformer-based models have demonstrated state-of-the-art performance on benchmark datasets [14], frequently surpassing traditional models in both accuracy and robustness. Their capacity to process high-dimensional data efficiently [20] makes them ideal for applications involving numerous time series metrics with varying characteristics, such as observability.

\ Even more recently, a number of time series “foundation models” have been released [15, 19, 21–24]. By pre-training on extensive, multi-domain datasets, these large models achieve impressive zero-shot prediction capabilities, significantly reducing the need for constant retraining. This paradigm is appealing for the observability context, where we constantly have new time series to process and frequent retraining is impractical.

2 Problem statement

At Datadog, our time series data encompasses a variety of observability metrics from numerous subdomains. These metrics present several challenges for existing forecasting models:

\ • High time resolution: Users often require data in increments of seconds or minutes, unlike many publicly-available time series datasets that are at hourly frequency or above.

\ • Sparsity: Metrics such as error counts often track rare events, resulting in sparse and zero-inflated time series.

\ • Extreme right skew: Latency measurements in distributed systems exhibit positive, heavy tailed distributions with extreme values at high percentiles.

\ • Dynamic, nonstationary systems: The behavior of monitored systems changes frequently due to code deployments, infrastructure scaling, feature flag management, and other configuration changes, as well as external factors like seasonality and user-behavior-driven trends. Some time series, such as those monitoring fleet deployments, can also have a very low variance, exhibiting a piecewise-constant shape.

\ • High-cardinality multivariate data: Monitoring large fleets of ephemeral cloud infrastructure such as virtual machines (VMs), containers, serverless functions, etc. leads to high cardinality data, with hundreds or thousands of individual time series variates, often with limited historical data for each group.

\ • Historical anomalies: Historical data often contains outliers and anomalies caused by performance regressions or production incidents.

\ Foundation models pre-trained on other domains struggle to generalize effectively to observability data due to these characteristics. To overcome this, we developed Toto, a state-of-the-art foundation model that excels at observability forecasting while also achieving top performance on standard open benchmarks.

\

:::info This paper is available on arxiv under CC BY 4.0 license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

What can save you, my crypto world?

What can save you, my crypto world?

Author: Nancy, PANews “I wasted eight years of my life in the crypto industry.” Aevo co-founder Ken Chan published an article denouncing the crypto industry as having degenerated into a "super casino," a post that quickly went viral in online communities both domestically and internationally. Behind the millions of views, the community debate exploded. Supporters saw it as a wake-up call, bursting the bubble, while opponents viewed it as a betrayal by those who had already benefited. Putting aside the emotional outbursts, this debate reflects the collective anxiety and cyclical confusion within the industry currently facing liquidity shortages and a narrative vacuum. Turned into a super casino? What's wrong with the crypto ecosystem? In this lengthy article, Ken Chan candidly admits that the past eight years have been a journey from idealism to disillusionment. As a libertarian and programmer deeply influenced by the works of Ayn Rand, he was a staunch believer in the cypherpunk spirit, viewing Bitcoin as "a private bank for the rich." However, after eight years of full-time dedication to the industry, he painfully admitted that even though he had made money, he still felt that those eight years of his youth had been completely wasted. The narrative most often uttered by industry practitioners is "completely replacing the existing financial system with blockchain," but this is merely a propaganda slogan; they are simply maintaining the world's largest online casino, operating 24/7. This misperception stems from a drastically distorted industry incentive mechanism. In reality, no one cares about genuine technological iteration. Market participants are blindly pouring funds into the next Layer 1 public chain, attempting to bet on the next Solana. This speculative mentality has fueled an inflated market capitalization of hundreds of billions of dollars. In fact, there are quite a few zombie public blockchains nowadays. Even emerging high-performance blockchains that have raised tens or even hundreds of millions of dollars are not immune to the airdrop craze and incentive subsidy activities, leaving very few real users. This is like building countless highways in a desert, but there are no cities or factories along the way, only a group of speculators reselling land. The data also confirms this predicament. According to DeFiLlama, in the past 24 hours, only 15 chains had on-chain DEX transaction volumes exceeding 10 million, and only 4 chains met the requirement of having millions of daily active addresses. On this "ghost town" of over-saturated infrastructure, Ken argues that spot DEXs, perpetual contracts, prediction markets, and the Meme coin platform are essentially gambling tools. For example, the former Meme culture has been replaced by an industrialized "coin issuance pipeline," becoming an on-chain casino of extreme PvP; and the frequent interactions across many applications are not driven by genuine needs, but rather by the pursuit of points for airdrops. As Ken points out, while VCs can write 5,000-word essays outlining grand visions, the reality is that these games are constantly consuming the existing funds of retail and institutional investors. What makes Ken Chan even more uncomfortable is the industry's subversion of common business sense. Here, making money through token issuance, market making, and profit-taking is far easier than refining a product. The market is flooded with tokens that have "high FDV and low liquidity," projects with no real revenue yet boasting valuations of billions of dollars, and so-called governance tokens that are nothing more than liquidity tools for investors to exit. This environment where bad money drives out good not only deprives practitioners of the ability to identify sustainable businesses but also instills a highly toxic "financial nihilism" in the younger generation. With traditional assets becoming increasingly unaffordable, Generation Z is exhibiting its own form of "financial rebellion." According to a recent Financial Times article, the deteriorating housing affordability in the United States is profoundly changing Generation Z's financial and consumption behaviors, even driving some young people to speculate in cryptocurrencies and generating feelings of economic nihilism. Besides cryptocurrencies, trendy stocks, collectible toys, leveraged ETFs, and prediction markets are all financial trends among young people. Ken Chan's accusations resonated with many. For example, Tangent founder Jason Choi lamented that we already have countless low-cost/fast blockchains, lax regulatory systems, massive overfunding since 2017, and thousands of developers delivering smart contracts over the past decade. Yet, an AI company is about to IPO at a price exceeding the total market capitalization of all cryptocurrencies except Bitcoin and stablecoins. Inversion Capital founder Santiago Roel Santos points out that this is a sobering reminder of reality for the entire industry. Today, the crypto industry has only about 40 million monthly active users (MAU), while Facebook had 845 million MAU at its IPO and a market capitalization of approximately $100 billion; OpenAI currently has about 800 million MAU and its most recent valuation was $500 billion. To have a $10 trillion asset class, we need at least a billion users. Crypto KOL YQ cited an older article stating that many crypto OGs have chosen to leave the market after questioning their initial beliefs. In the current cycle, highly speculative projects like memes, perpetual tokens, and prediction markets remain resilient, while the value of many infrastructure and social projects is increasingly difficult to prove. This is undoubtedly the most difficult phase for startups, VCs, traders, and users, and the market is rife with "pump and dump" schemes using leveraged perpetual tokens to manipulate small-cap or older coins. In this environment, it's crucial to acknowledge the facts and accept reality. Whether you're a VC or an entrepreneur, the only way to survive is to continuously adjust your direction and consistently deliver products. Navigating the cycles of crypto sentiment, "the forest needs to be cleared of dead trees." Many industry professionals believe that Ken Chan's negative emotions are essentially a typical "retreat the ladder after getting ashore" mentality. As a beneficiary of the existing system, he made his fortune in the crypto market, yet he turned around and criticized this ladder to wealth as dirty. At the same time, his aversion to financial nihilism ignored the fact that for countless ordinary people around the world, this bubble-filled market remains one of the few channels for upward social mobility. Moreover, AEVO's price has already fallen by more than 98% from its all-time high. Regarding the current predicament of the crypto market, Ken believes the industry is merely spinning its wheels, but many proponents see it as a necessary growing pain in technological development. We cannot negate the entire financial city that is rising from the ground just because we see people losing money in a casino. If we turn our attention to high-inflation countries like Argentina, Turkey, and Nigeria, we find that stablecoins such as USDT and USDC have become de facto "hard currency." Local people rely on them to protect their meager savings from hyperinflation, and this financial system has effectively served tens of millions of people. Meanwhile, Bitcoin is no longer just a geek's toy; it's becoming part of the balance sheets of sovereign wealth funds, national government reserves (such as in El Salvador and Bhutan), and top hedge funds. Ethereum's technical components have been established as a global public blockchain standard and have gained recognition from Wall Street capital. Furthermore, with assets such as stocks, bonds, and real estate rapidly being put on-chain, financial efficiency is experiencing a substantial leap. On the technological front, countless developers are making breakthroughs in cutting-edge fields such as zero-knowledge proofs (ZK), censorship-resistant networks, and quantum resistance. These are the real undercurrents behind the noisy crypto market. Regarding the "casino analogy," Haseeb, a partner at Dragonlfy, points out that the cryptocurrency space has never lacked casinos. The first blockbuster application on Bitcoin was Satoshi Dice (2012). The first blockbuster smart contract on Ethereum was King of the Ether Throne (2015), which was essentially a Ponzi scheme. Once programmable money exists, people's first instinct is always to bet and play games—this is human nature. The crypto world has always had its hottest casinos: ICO casinos, DeFi, NFTs, and now MEME coins. The forms change, but the essence remains the same. While casinos are glamorous and attract attention on social media, focusing solely on their superficiality will cause you to miss the more important stories. He further points out that cryptocurrencies are becoming a superior financial vehicle, reshaping the nature of money and subtly altering the power relationship between individuals and governments. Bitcoin has begun to challenge national sovereignty, with governments incorporating it into their balance sheets; stablecoins are influencing monetary policy, prompting central banks to scramble to respond; and the scale and value of permissionless financial protocols like Uniswap and AAVE have surpassed many unicorn fintech companies. The world is undergoing a profound shift around cryptocurrencies. “This transformation is slower than many anticipated, but that’s how technology diffusion always is,” Haseeb stated. Three years after ChatGPT’s launch, generative AI still hasn’t been reflected in GDP or employment data; the Industrial Revolution took 50 years to truly impact productivity; and the widespread adoption of the internet took over 20 years. Expecting it to replace the world’s most regulated financial system within a mere five years is unrealistic. If you’re frustrated because you didn’t become rich from participating in a MEME project, take a deep breath; the industry doesn’t owe anyone wealth. In fact, pessimism and a sense of “mental surrender” on the timeline aren’t necessarily bad things. Pantera Capital partner Mason Nystrom also believes that a pessimistic view of cryptocurrencies and their social value is wrong. While speculation and abuse exist in the cryptocurrency space, and its casinos are real and large-scale, with many people losing money at the tables, it also contains a great deal of overlooked positive social value. He explained that Bitcoin has become a global, non-sovereign asset that anyone in the world with an internet connection can hold. It provides a veto/exit mechanism for people worldwide, transferring economic control from nations to individuals. Stablecoins offer more efficient and secure financial services to people around the world, with faster disbursement, higher returns, and lower costs. The lack of returns from banks for depositors, high fees for cross-border remittances, and the 2.9% transaction fee for e-commerce are all being reshaped by stablecoins, bringing tangible social value. Lending platforms like Aave and Morpho enable people worldwide to access over-collateralized loans. The low-collateral lending market will further unleash enormous social benefits, reduce capital costs, and create significant positive externalities. Furthermore, blockchain will enable global users to access previously restricted financial products such as stocks, bonds, insurance, and credit. Permissionless financing allows any good idea to gain support based on its own value. A more transparent, efficient, and low-cost market is itself an improvement for society. Mason Nystrom also stated that cryptocurrencies are building a completely new financial system. Some will build casinos, some will build payment networks, some will build speculative instruments, and others will build inclusive credit infrastructure. This new financial system will not be perfect, but it will far surpass the current state. If we only see the casino aspect of cryptocurrencies, perhaps we should take a step back and look at all the benefits that cryptocurrencies have brought to and will continue to bring to society from a more macro perspective. The crypto industry is currently experiencing a low point, and Ken's post is less a reflection and more an emotional outpouring after a failed startup. Projects like Aevo are not uncommon in their difficulties; this is precisely the survival of the fittest the industry is undergoing. In the past few years, the sector has seen an oversupply of projects lacking real value and unable to deliver viable products. The current pain is simply squeezing out the bubble that has accumulated. Just as forests need to be regularly cleared of dead trees to prevent decay from spreading, the same applies to the crypto industry. Let those who are weary, lost, or only here for speculation leave naturally, and the air will become clear. Either change your mindset and refocus on the future, or make way for those still building. This journey has just begun and is far from over.
Share
PANews2025/12/08 18:28