This study benchmarks a new stroke rendering method implemented on both CPU and GPU. The GPU version—written in WGSL and tested on devices from mobile to RTX 4090—demonstrates up to 14× faster performance than CPU stroking, especially in curve-heavy scenes. By comparing line, arc, and Bézier primitives, the results show how GPU pipelines can drastically reduce execution time and memory use, paving the way for real-time, scalable vector rendering across devices.This study benchmarks a new stroke rendering method implemented on both CPU and GPU. The GPU version—written in WGSL and tested on devices from mobile to RTX 4090—demonstrates up to 14× faster performance than CPU stroking, especially in curve-heavy scenes. By comparing line, arc, and Bézier primitives, the results show how GPU pipelines can drastically reduce execution time and memory use, paving the way for real-time, scalable vector rendering across devices.

Evaluating Stroke Expansion Efficiency Across GPUs and CPUs

2025/10/30 22:11

ABSTRACT

1 INTRODUCTION

2 FLATTENING AND ARC APPROXIMATION OF CURVES

3 EULER SPIRALS AND THEIR PARALLEL CURVES

4 FLATTENED PARALLEL CURVES

5 ERROR METRICS FOR APPROXIMATION BY ARCS

6 EVOLUTES

7 CONVERSION FROM CUBIC BÉZIERS TO EULER SPIRALS

8 GPU IMPLEMENTATION

9 RESULTS

CONCLUSIONS, FUTURE WORK AND REFERENCES

RESULTS

We present two versions of our stroking method: a sequential CPU stroker and the GPU compute shader outlined in Section 8. Both versions can generate stroked outlines with line and arc primitives. We focused our evaluation on the number of output primitives and execution time, using the timings dataset provided by Nehab [2020], in addition to our own curve-intensive scenes to stress the GPU implementation. We present our findings in the remainder of this section.

\ 9.1 CPU: Primitive Count and Timing

We measured the timing of our CPU implementation, generating both line and arc primitives, against the Nehab and Skia strokers, which both output a mix of straight lines and quadratic Bézier primitives. The results are shown in Figure 15. The time shown is the total for test files from the timings dataset of Nehab [2020], and the tolerance was 0.25 when adjustable as a parameter. Similarly to Nehab [2020]’s observations, Skia is the fastest stroke expansion algorithm we measured.

\ The speed is attained with some compromises, in particular an imprecise error estimation that can sometimes generate outlines exceeding the given tolerance. We confirmed that this behavior is still present. The Nehab paper proposes a more careful error metric, but a significant fraction of the total computation time is dedicated to evaluating it. We also compared the number of primitives generated, for varying tolerance values from 1.0 to 0.001.

\ The number of arcs generated is significantly less than the number of line primitives. The number of arcs and quadratic Bézier primitives is comparable at practical tolerances, but the count of quadratic Béziers becomes more favorable at finer tolerances. The vertical axis shows the sum of output segment counts for test files from the timings dataset, plus mmark70k (as described in more detail in Section 9.2), and omitting the waves example1 .

\ 9.2 GPU: Execution Time

We evaluated the runtime performance of our compute shader on 4 GPU models and multiple form factors: Google Pixel 6 equipped with the Arm Mali-G78 MP20 mobile GPU, the Apple M1 Max integrated laptop GPU, and two discrete desktop GPUs: the mid-2015 NVIDIA GeForce GTX 980Ti and the late-2022 NVIDIA GeForce RTX 4090. We authored our entire GPU pipeline in the WebGPU Shading Language (WGSL) and used the wgpu [6] framework to interoperate with the native graphics APIs (we tested the M1 Max on macOS via the Metal API; the mobile and desktop GPUs were tested via the Vulkan API on Android and Ubuntu systems).

\ We instrumented our pipeline to obtain fine-grained pipeline stage execution times using the GPU timestamp query feature supported by both Metal and Vulkan. This allowed us to collect measurements for the curve flattening and input processing stages in isolation. We ran the compute pipeline on the timings SVG files provided by Nehab [2020]. The largest of these SVGs is waves (see rendering (a) in Figure 20) which contains 42 stroked paths and a total of 13,308 path segments. We authored two additional test scenes in order to stress the GPU with a higher workload.

\ The first is a very large stroked path with ~500,000 individually styled dashed segments which we adapted from Skia’s open-source test corpus. We used two variants of this scene with butt and round cap styles. For the second scene we adapted the Canvas Paths test case from MotionMark [1], which renders a large number of randomly generated stroked line, quadratic, and cubic Bézier paths. We made two versions with different sizes: mmark-70k and mmark-120k with 70,000 and 120,000 path segments, respectively. Table 2 shows the payload sizes of our largest test scenes. The test scenes were rendered to a GPU texture with dimensions 2088x1600.

\ The contents of each scene were uniformly scaled up (by encoding a transform) to fit the size of the texture, in order to increase the required number of output primitives. As with the CPU timings, we used an error tolerance of 0.25 pixels in all cases. We recorded the execution times across 3,000 GPU submissions for each scene. Figures 17 and 18 show the execution times for the two data sets. The entire Nehab [2020] timings set adds up to less than 1.5 ms on all GPUs except the mobile unit. Our algorithm is ~14× slower on the Mali-G78 compared to the M1 Max but it is still

are shown in milliseconds.Table 2: Comparison of input and output segment counts in the capable of processing waves in 3.48 ms on average. We observed that the performance of the kernel scales well with increasing GPU ability even at very large workloads, as demonstrated by the orderof-magnitude difference in run time between the mobile, laptop, and high-end desktop form factors we tested. We also confirmed that outputting arcs instead of lines can lead to a 2× decrease in execution time (particularly in scenes with high curve content, like waves and mmark) as predicted by the lower overall primitive count.

\ 9.3 GPU: Input Processing

Figure 19 shows the run time of the tag monoid parallel prefix sums with increasing input size. The prefix sums scale better at large input sizes compared to our stroke expansion kernel. This can be attributed to a lack of expensive computations and high controlflow uniformity in the tag monoid kernel. The CPU encoding times (prior to the GPU prefix sums) for some of the large scenes is shown in Table 3. Our implementation applies dashing on the CPU.

\ It is conceptually straightforward to implement on GPU; the core algorithm is prefix sum applied to segment length, and the Euler spiral representation is particularly well suited. However, we believe the design of an efficient GPU pipeline architecture involves comples tradeoffs which we did not sufficiently explore for this paper. Therefore the dash style applied to the long dash scene was processed on the CPU and dashes were sequentially encoded as individual strokes. Including CPU-side dashing, the time to encode long dash on the Apple M1 Max was approximately 25 ms on average, while the average time to encode the path after dashing was 4.14 ms.

\ Our GPU implementation targets graphics APIs that do not support dynamic memory allocation in shader code (in contrast to CUDA, which provides the equivalent of malloc). This makes it necessary to allocate the storage buffer in advance with enough memory to store the output, however, the precise memory requirement is unknown prior to running the shader. We augmented our encoding logic to compute a conservative estimate of the required output buffer size based on an upper bound on the number of line primitives per input path segment.

\ We found that Wang’s formula [7] works well in practice, as it is relatively inexpensive to evaluate, though it results in a higher memory estimate than required (see Table 4). For parallel curves, Wang’s formula does not guarantee an accurate upper bound on the required subdivisions, which we worked around by inflating the estimate for the source segment by a scale factor derived from the linewidth. We observed that enabling estimation increases

177.33 MB by 2–3× on the largest scenes but overall the impact is negligible when the scene size is modest

:::info Authors:

  1. Raph Levien
  2. Arman Uguray

:::

:::info This paper is available on arxiv under CC 4.0 license.

:::

\ \

Market Opportunity
NodeAI Logo
NodeAI Price(GPU)
$0.05852
$0.05852$0.05852
-11.18%
USD
NodeAI (GPU) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

The Channel Factories We’ve Been Waiting For

The Channel Factories We’ve Been Waiting For

The post The Channel Factories We’ve Been Waiting For appeared on BitcoinEthereumNews.com. Visions of future technology are often prescient about the broad strokes while flubbing the details. The tablets in “2001: A Space Odyssey” do indeed look like iPads, but you never see the astronauts paying for subscriptions or wasting hours on Candy Crush.  Channel factories are one vision that arose early in the history of the Lightning Network to address some challenges that Lightning has faced from the beginning. Despite having grown to become Bitcoin’s most successful layer-2 scaling solution, with instant and low-fee payments, Lightning’s scale is limited by its reliance on payment channels. Although Lightning shifts most transactions off-chain, each payment channel still requires an on-chain transaction to open and (usually) another to close. As adoption grows, pressure on the blockchain grows with it. The need for a more scalable approach to managing channels is clear. Channel factories were supposed to meet this need, but where are they? In 2025, subnetworks are emerging that revive the impetus of channel factories with some new details that vastly increase their potential. They are natively interoperable with Lightning and achieve greater scale by allowing a group of participants to open a shared multisig UTXO and create multiple bilateral channels, which reduces the number of on-chain transactions and improves capital efficiency. Achieving greater scale by reducing complexity, Ark and Spark perform the same function as traditional channel factories with new designs and additional capabilities based on shared UTXOs.  Channel Factories 101 Channel factories have been around since the inception of Lightning. A factory is a multiparty contract where multiple users (not just two, as in a Dryja-Poon channel) cooperatively lock funds in a single multisig UTXO. They can open, close and update channels off-chain without updating the blockchain for each operation. Only when participants leave or the factory dissolves is an on-chain transaction…
Share
BitcoinEthereumNews2025/09/18 00:09
XRP Price Prediction: Can Ripple Rally Past $2 Before the End of 2025?

XRP Price Prediction: Can Ripple Rally Past $2 Before the End of 2025?

The post XRP Price Prediction: Can Ripple Rally Past $2 Before the End of 2025? appeared first on Coinpedia Fintech News The XRP price has come under enormous pressure
Share
CoinPedia2025/12/16 19:22
BlackRock boosts AI and US equity exposure in $185 billion models

BlackRock boosts AI and US equity exposure in $185 billion models

The post BlackRock boosts AI and US equity exposure in $185 billion models appeared on BitcoinEthereumNews.com. BlackRock is steering $185 billion worth of model portfolios deeper into US stocks and artificial intelligence. The decision came this week as the asset manager adjusted its entire model suite, increasing its equity allocation and dumping exposure to international developed markets. The firm now sits 2% overweight on stocks, after money moved between several of its biggest exchange-traded funds. This wasn’t a slow shuffle. Billions flowed across multiple ETFs on Tuesday as BlackRock executed the realignment. The iShares S&P 100 ETF (OEF) alone brought in $3.4 billion, the largest single-day haul in its history. The iShares Core S&P 500 ETF (IVV) collected $2.3 billion, while the iShares US Equity Factor Rotation Active ETF (DYNF) added nearly $2 billion. The rebalancing triggered swift inflows and outflows that realigned investor exposure on the back of performance data and macroeconomic outlooks. BlackRock raises equities on strong US earnings The model updates come as BlackRock backs the rally in American stocks, fueled by strong earnings and optimism around rate cuts. In an investment letter obtained by Bloomberg, the firm said US companies have delivered 11% earnings growth since the third quarter of 2024. Meanwhile, earnings across other developed markets barely touched 2%. That gap helped push the decision to drop international holdings in favor of American ones. Michael Gates, lead portfolio manager for BlackRock’s Target Allocation ETF model portfolio suite, said the US market is the only one showing consistency in sales growth, profit delivery, and revisions in analyst forecasts. “The US equity market continues to stand alone in terms of earnings delivery, sales growth and sustainable trends in analyst estimates and revisions,” Michael wrote. He added that non-US developed markets lagged far behind, especially when it came to sales. This week’s changes reflect that position. The move was made ahead of the Federal…
Share
BitcoinEthereumNews2025/09/18 01:44