Multimodal AI workloads are breaking Spark and Ray. See how Daft’s streaming model runs 7× faster and more reliably across audio, video, and image pipelines.Multimodal AI workloads are breaking Spark and Ray. See how Daft’s streaming model runs 7× faster and more reliably across audio, video, and image pipelines.

Why Multimodal AI Broke the Data Pipeline — And How Daft Is Beating Ray and Spark to Fix It

2025/11/03 13:19

Multimodal AI workloads break traditional data engines. They need to embed documents, classify images, and transcribe audio, not just run aggregations and joins. These multimodal workloads are tough: memory usage balloons mid-pipeline, processing requires both CPU and GPU, and a single machine can't handle the data volume.

This post provides a comprehensive comparison of Daft and Ray Data for multimodal data processing, examining their architectures and performance. Benchmarks across large-scale audio, video, document, and image workloads found Daft ran 2-7x faster than Ray Data and 4-18x faster than Spark, while finishing jobs reliably.

The Multimodal Data Challenge

Multimodal data processing presents unique challenges:

  1. Memory Explosions: A compressed image like a JPEG inflates 20x in memory once decoded. A single video file can be decoded into thousands of frames, each being megabytes.
  2. Heterogeneous Compute: These workloads stress CPU, GPU, and network simultaneously. Processing steps include resampling, feature extraction, transcription, downloading, decoding, resizing, normalizing, and classification.
  3. Data Volume: The benchmarked workloads included 113,800 audio files from Common Voice 17, 10,000 PDFs from Common Crawl, 803,580 images from ImageNet, and 1,000 videos from Hollywood2.

Introducing the Contenders

Daft

Daft is designed to handle petabyte-scale workloads with multimodal data (audio, video, images, text, embeddings) as first-class citizens.

Key features include:

  • Native multimodal operations: Built-in image decoding/encoding/cropping/resizing, text and image embedding/classification APIs, LLM APIs, text tokenization, cosine similarity, URL downloads/uploads, reading video to image frames
  • Declarative DataFrame/SQL API: With schema validation and query optimizer that automatically handles projection pushdowns, filter pushdowns, and join reordering - optimizations users get "for free" without manual tuning
  • Comprehensive I/O support: Native readers and writers for Parquet, CSV, JSON, Lance, Iceberg, Delta Lake, and WARC formats, tightly integrated with the streaming execution model

Ray Data

Ray Data is a data processing library built on top of Ray, a framework for building distributed Python applications.

Key features include:

  • Low-level operators: Provides operations like map_batches that work directly on PyArrow record batches or pandas DataFrames
  • Ray ecosystem integration: Tight integration with Ray Train for distributed training and Ray Serve for model serving

Architecture Deep Dive

Daft's Streaming Execution Model

Daft's architecture revolves around its Swordfish streaming execution engine. Data is always "in flight": batches flow through the pipeline as soon as they are ready. For a partition of 100k images, the first 1000 can be fed into model inference while the next 1000 are being downloaded or decoded. The entire partition never has to be fully materialized in an intermediate buffer.

Backpressure mechanism: If GPU inference becomes the bottleneck, the upstream steps automatically slow down so memory usage remains bounded.

Adaptive batch sizing: Daft shrinks batch sizes on memory-heavy operations like url_download or image_decode, keeping throughput high without ballooning memory usage.

Flotilla distributed engine: Daft's distributed runner deploys one Swordfish worker per node, enabling the same streaming execution model to scale across clusters.

Ray Data's Execution Model

Ray Data streams data between heterogeneous operations (e.g., CPU → GPU) that users define via classes or resource requests. Within homogeneous operations, Ray Data fuses sequential operations into the same task and executes them sequentially, which can cause memory issues without careful tuning of block sizes. You can work around this by using classes instead of functions in map/map_batches, but this materializes intermediates in Ray's object store, adding serialization and memory copy overhead. Ray's object store is by default only 30% of machine memory, and this limitation can lead to excessive disk spilling.

Performance Benchmarks

Based on recent benchmarks conducted on identical AWS clusters (8 x g6.xlarge instances with NVIDIA L4 GPUs, each with 4 vCPUs, 16 GB memory, and 100 GB EBS volume), here's how the two frameworks compare:

| Workload | Daft | Ray Data | Spark | |----|----|----|----| | Audio Transcription (113,800 files) | 6m 22s | 29m 20s (4.6x slower) | 25m 46s (4.0x slower) | | Document Embedding (10,000 PDFs) | 1m 54s | 14m 32s (7.6x slower) | 8m 4s (4.2x slower) | | Image Classification (803,580 images) | 4m 23s | 23m 30s (5.4x slower) | 45m 7s (10.3x slower) | | Video Object Detection (1,000 videos) | 11m 46s | 25m 54s (2.2x slower) | 3h 36m (18.4x slower) |

Why Such Large Performance Differences?

Several architectural decisions contribute to Daft's performance advantages:

  1. Native Operations vs Python UDFs: Daft has native multimodal expressions including image decoding/encoding/cropping/resizing, text and image embedding/classification APIs, LLM APIs, text tokenization, cosine similarity, URL downloads/uploads, and reading video to image frames. These native multimodal expressions are highly optimized in Daft. In Ray Data you have to write your own Python UDFs that use external dependencies like Pillow, numpy, spacy, huggingface, etc. This comes at the cost of extra data movement because these libraries each have their own data format.
  2. Memory Management - Streaming vs Materialization: Daft streams data through network, CPU, and GPU in a continuous stream without materializing entire partitions. Ray Data fuses sequential operations which can cause memory issues. While you can work around this by using classes to materialize intermediates in the object store, this adds serialization and memory copy overhead.
  3. Resource Utilization: Daft pipelines everything inside a single Swordfish worker, which has control over all resources of the machine. Data asynchronously streams from cloud storage, into the CPUs to run pre-processing steps, then into GPU memory for inference, and back out for results to be uploaded. CPUs, GPUs, and the network stay saturated together for optimal throughput. In contrast, Ray Data by default reserves a CPU core for I/O-heavy operations like downloading large videos, which can leave that core unavailable for CPU-bound processing work, requiring manual tuning of fractional CPU requests to optimize resource usage.

When to Choose Which?

Based on the benchmark results and architectural differences:

Daft shows significant advantages for:

  • Multimodal data processing (images, documents, video, audio)
  • Workloads requiring reliable execution without extensive tuning
  • Complex queries with joins, aggregations, and multiple transformations
  • Teams preferring DataFrame/SQL semantics

Ray Data may be preferred when:

  • You have tight integration needs with the Ray ecosystem (Ray Train, Ray Serve)
  • You need fine-grained control over CPU/GPU allocation per operation

What Practitioners Are Saying

Is Daft battle-tested enough for production?

When Tim Romanski of Essential AI set out to taxonomize 23.6 billion web documents from Common Crawl (24 trillion tokens), his team pushed Daft to its limits - scaling from local development to 32,000 requests per second per VM. As he shared in a panel discussion: "We pushed Daft to the limit and it's battle tested… If we had to do the same thing in Spark, we would have to have the JVM installed, go through all of its nuts and bolts just to get something running. So the time to get something running in the first place was a lot shorter. And then once we got it running locally, we just scaled up to multiple machines."

What gap does Daft fill in the Ray ecosystem?

CloudKitchens rebuilt their entire ML infrastructure around what they call the "DREAM stack" (Daft, Ray, poEtry, Argo, Metaflow). When selecting their data processing layer, they identified specific limitations with Ray Data and chose Daft to complement Ray's compute capabilities. As their infrastructure team explained, "one issue with the Ray library for data processing, Ray Data, is that it doesn't cover the full range of DataFrame/ETL functions and its performance could be improved." They chose Daft because "it fills the gap of Ray Data by providing amazing DataFrame APIs" and noted that "in our tests, it's faster than Spark and uses fewer resources."

How does Daft perform on even larger datasets?

A data engineer from ByteDance commented on Daft's 300K image processing demonstration, sharing his own experience with an even larger image classification workload: "Not just 300,000 images - we ran image classification evaluations on the ImageNet dataset with approximately 1.28 million images, and Daft was about 20% faster than Ray Data." Additionally, in a separate technical analysis of Daft's architecture, he praised its "excellent execution performance and resource efficiency" and highlighted how it "effortlessly enables streaming processing of large-scale image datasets."

Resources

  • Benchmarks for Multimodal AI Workloads - Primary source for performance data and architectural comparisons
  • Benchmark Code Repository - Open-source code to reproduce all benchmarks
  • Distributed Data Community Slack - Join the community to discuss with Daft developers and users

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Aave DAO to Shut Down 50% of L2s While Doubling Down on GHO

Aave DAO to Shut Down 50% of L2s While Doubling Down on GHO

The post Aave DAO to Shut Down 50% of L2s While Doubling Down on GHO appeared on BitcoinEthereumNews.com. Aave DAO is gearing up for a significant overhaul by shutting down over 50% of underperforming L2 instances. It is also restructuring its governance framework and deploying over $100 million to boost GHO. This could be a pivotal moment that propels Aave back to the forefront of on-chain lending or sparks unprecedented controversy within the DeFi community. Sponsored Sponsored ACI Proposes Shutting Down 50% of L2s The “State of the Union” report by the Aave Chan Initiative (ACI) paints a candid picture. After a turbulent period in the DeFi market and internal challenges, Aave (AAVE) now leads in key metrics: TVL, revenue, market share, and borrowing volume. Aave’s annual revenue of $130 million surpasses the combined cash reserves of its competitors. Tokenomics improvements and the AAVE token buyback program have also contributed to the ecosystem’s growth. Aave global metrics. Source: Aave However, the ACI’s report also highlights several pain points. First, regarding the Layer-2 (L2) strategy. While Aave’s L2 strategy was once a key driver of success, it is no longer fit for purpose. Over half of Aave’s instances on L2s and alt-L1s are not economically viable. Based on year-to-date data, over 86.6% of Aave’s revenue comes from the mainnet, indicating that everything else is a side quest. On this basis, ACI proposes closing underperforming networks. The DAO should invest in key networks with significant differentiators. Second, ACI is pushing for a complete overhaul of the “friendly fork” framework, as most have been unimpressive regarding TVL and revenue. In some cases, attackers have exploited them to Aave’s detriment, as seen with Spark. Sponsored Sponsored “The friendly fork model had a good intention but bad execution where the DAO was too friendly towards these forks, allowing the DAO only little upside,” the report states. Third, the instance model, once a smart…
Share
BitcoinEthereumNews2025/09/18 02:28
Eigen price spikes 33% as EigenLayer leads fresh altcoin rally

Eigen price spikes 33% as EigenLayer leads fresh altcoin rally

The post Eigen price spikes 33% as EigenLayer leads fresh altcoin rally appeared on BitcoinEthereumNews.com. EigenLayer price hovered around $2.03, up by 33% after breaking to highs of $2.09. The US Securities and Exchange Commission’s move to approve a rules-based listing standard buoyed altcoins. EIGEN price also gained as the Fed cut interest rates, EigenLayer (EIGEN) is surging. Its price hovers near $2.03, currently up by 33% in 24 hours as a broader rally boosts altcoins. The cryptocurrency market is witnessing a notable resurgence amid the Federal Reserve’s monetary policy decision and a key regulatory win for altcoins. EigenLayer price jumps 33% to retest key level As most altcoins posted minor gains in early trading on Thursday, EigenLayer’s EIGEN token experienced a dramatic 33% price increase. The EIGEN token climbed from lows of $1.50 to hit highs of $2.09, with the sharp uptick marking a significant continuation following a breakout of a descending triangle pattern. Some catalysts of the uptick include partnerships and integrations, regulatory developments and macroeconomic indicators. For instance, on September 17, 2025, the US Securities and Exchange Commission approved generic listing standards for commodity-based trust shares. It means the regulator is adopting a rules-based approach that will streamline the approval process for exchange-traded products on platforms like the NYSE, Nasdaq, and Cboe Global Markets. BOOM: SEC has approved the generic listings standards that will clear way for spot crypto ETFs to launch (without going through all this bs every time) under ’33 Act so long as they have futures on Coinbase, which currently incl about 12-15 coins. pic.twitter.com/E9FXrniXRS — Eric Balchunas (@EricBalchunas) September 17, 2025 EIGEN gained ground as the Federal Reserve’s rate cut supported broader risk sentiment, while optimism has also been fueled by EigenLayer’s recent partnership with Google. In the past 24 hours, trading in the protocol’s native token surged, with volumes topping $427 million — a 260% jump alongside…
Share
BitcoinEthereumNews2025/09/18 17:43