AI evaluation is a tricky engineering challenge. With so many diverse tasks that we're trying to solve with AI, it will become increasingly complex to get it right. I propose the following framework: decompose the pipeline into small steps, design a measurable and reproducible evaluation approach, assess the interactions between steps and adjust accordingly.AI evaluation is a tricky engineering challenge. With so many diverse tasks that we're trying to solve with AI, it will become increasingly complex to get it right. I propose the following framework: decompose the pipeline into small steps, design a measurable and reproducible evaluation approach, assess the interactions between steps and adjust accordingly.

Evaluating AI Is Harder Than Building It

2025/09/25 13:16
3 min read
For feedback or concerns regarding this content, please contact us at crypto.news@mexc.com

\ For the past few months the mentions of AI evaluation by leaders in the industry have became more and more frequent, with greatest minds tackling the challenges of ensuring AI safety, reliability and alignment. It got me thinking about the topic, and in this post I'll share my view on it.

The Problem

Creating a robust evaluation system is a tricky engineering challenge. With so many diverse tasks that we're trying to solve with AI, it will become increasingly complex to get it right. In pre-agentic era, most of the problems were narrow and specific. An example is making sure that the user gets better recommended posts, measured by engagement time, likes and so on. More engagement - better performance.

But as AI advanced and unlocked new experiences and scenarios, things became much more difficult. Even without agentic systems we started facing the challenge of getting the measurement right, especially with things like conversational AI. In contrast to the previous example, the exact thing to measure here is practically unknown. Here we have criteria like customer satisfaction rate (for customer support applications), "vibe" for creativity tasks, various benchmarks like SWE for pure coding ability and so on. The problem is that these criteria are actually proxy values for our evaluation approaches. This prevents us from achieving the same quality of measurement as we had with simpler tasks.

Today’s Main Concerns

As we accelerate in the agentic era, existing eval issues compound. Imagine you have a multi-step process that you're designing the agent system for. For each of these steps you have to create a proper quality control system to prevent points of failure or bottlenecks. Then, given that you're working with a pipeline, you must ensure that the chain of small steps that depend on each other completes flawlessly. What if one of the steps is an automated conversation with the user? This one is tricky to evaluate by itself, but when an abstract task like this becomes a part of your business pipeline, it will affect the entire thing.

A Proposed Solution

This might seem concerning, and it really is. In my opinion, we can still get it right if we apply systematic thinking to such problems. I propose the following framework:

  1. ==Decompose the pipeline into small steps==
  2. ==Design a measurable and reproducible evaluation approach==
  3. ==Assess the interactions between steps and adjust accordingly==

When we decompose the pipeline, we should try to match the step complexity with the current intellectual capacity of agentic tools that we currently have available. A good eval design will ensure that the results of each step are reliable and robust. And if we get the interplay of these steps in check, we can harden the overall pipeline integrity. When there are many moving parts, it’s important to get this step right, especially at scale.

Conclusion

Of course, the complexity doesn't end there. There's a huge amount of diverse problems that need careful and thoughtful approach, individual to the specific domain.

An example that excites me personally is how we apply non-invasive BCI technology to previously unimaginable things. From properly interpreting abstract data like brain states, to correctly measuring the effectiveness of incremental changes as we make progress in this field, this will require much more advanced approaches to evaluation than we have now.

So far things look promising, and with many great minds dedicating their time to designing better systems alongside the primary AI research I’m sure we’ll get a safe and aligned technology. Let me know what you think!

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact crypto.news@mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Three Reasons Why Pi Network (PI) Could Crash Again After Hitting a 3-Week High

Three Reasons Why Pi Network (PI) Could Crash Again After Hitting a 3-Week High

Meanwhile, some market observers believe PI could eventually explode above $1.
Share
CryptoPotato2026/03/05 23:54
BlackRock boosts AI and US equity exposure in $185 billion models

BlackRock boosts AI and US equity exposure in $185 billion models

The post BlackRock boosts AI and US equity exposure in $185 billion models appeared on BitcoinEthereumNews.com. BlackRock is steering $185 billion worth of model portfolios deeper into US stocks and artificial intelligence. The decision came this week as the asset manager adjusted its entire model suite, increasing its equity allocation and dumping exposure to international developed markets. The firm now sits 2% overweight on stocks, after money moved between several of its biggest exchange-traded funds. This wasn’t a slow shuffle. Billions flowed across multiple ETFs on Tuesday as BlackRock executed the realignment. The iShares S&P 100 ETF (OEF) alone brought in $3.4 billion, the largest single-day haul in its history. The iShares Core S&P 500 ETF (IVV) collected $2.3 billion, while the iShares US Equity Factor Rotation Active ETF (DYNF) added nearly $2 billion. The rebalancing triggered swift inflows and outflows that realigned investor exposure on the back of performance data and macroeconomic outlooks. BlackRock raises equities on strong US earnings The model updates come as BlackRock backs the rally in American stocks, fueled by strong earnings and optimism around rate cuts. In an investment letter obtained by Bloomberg, the firm said US companies have delivered 11% earnings growth since the third quarter of 2024. Meanwhile, earnings across other developed markets barely touched 2%. That gap helped push the decision to drop international holdings in favor of American ones. Michael Gates, lead portfolio manager for BlackRock’s Target Allocation ETF model portfolio suite, said the US market is the only one showing consistency in sales growth, profit delivery, and revisions in analyst forecasts. “The US equity market continues to stand alone in terms of earnings delivery, sales growth and sustainable trends in analyst estimates and revisions,” Michael wrote. He added that non-US developed markets lagged far behind, especially when it came to sales. This week’s changes reflect that position. The move was made ahead of the Federal…
Share
BitcoinEthereumNews2025/09/18 01:44
Pundit Says XRP Price At $100 Is Not Insane If You Understand This

Pundit Says XRP Price At $100 Is Not Insane If You Understand This

Crypto pundit Bird has explained why an XRP price target of $100 is not “insane” when one understands what the XRP Ledger (XRPL) can do. He highlighted how the
Share
NewsBTC2026/03/06 00:30