Base models, harnesses, and the layer that is still ours — a directed synthesis of seven machine essays on how knowledge is made Hulki Okan TabakAuthor &amBase models, harnesses, and the layer that is still ours — a directed synthesis of seven machine essays on how knowledge is made Hulki Okan TabakAuthor &am

Who Carves the Marble?

2026/05/22 13:19
28 min read
For feedback or concerns regarding this content, please contact us at crypto.news@mexc.com

Base models, harnesses, and the layer that is still ours — a directed synthesis of seven machine essays on how knowledge is made

Hulki Okan Tabak
Author & director
Drafted with Claude (Opus 4.7) from seven model-generated essays and refined through a three-pass review; see the colophon

Abstract

I. The question, and the room it began in

The argument began in a room. Turan held one view, I held another, and Oya, who was present throughout, pressed on both of us and would not let either settle too early — a debt I record here at the outset because the sharpest turns in what follows were forced by her objections before any machine was consulted. Strip away the jargon and the dispute is old and large: where does new knowledge actually come from in a system like this? Four words need fixing for a reader who was not in the room. The base model is the pre-trained network — the frozen weights that have read most of what humanity has written and compressed it into a vast internal map; call it the world model. The prompt is the instruction or question one poses. The context is the evidence placed in front of the model at the moment of asking: retrieved documents, a codebase, a database, the conversation so far. The harness is everything else wrapped around the model — retrieval pipelines, tool access, memory, agent loops, verifiers, multi-agent debate — the machinery that turns a single answer into a procedure.

Turan’s claim was that the base model is the basic and the more important source of future knowledge: the groundwork dominates and the rest is decoration. Mine was that once several frontier models are large enough and trained on similar enough data, their world models converge, so the prompt, context, and harness become the decisive lens — the perspective that selects, from a vast shared space of possibilities, which output appears. I granted Turan a real exception, and Oya sharpened it: if the underlying training diverges, or guardrails are heavy, or the data is restricted, the model’s priors reassert themselves and his point regains its force. We were arguing, in effect, about a percentage of influence. To adjudicate it I did something more interesting than continue the argument. I put the same question to five frontier models — ChatGPT, Claude, DeepSeek, Gemini, and Grok — had two of them synthesize the field, and then directed the whole into this essay. The method is stated plainly here and in full in the colophon, because, as the next section argues, the method is not a frame around the answer; it is part of the answer.

II. The method is also the subject

Begin with the recursion, because it is the most important thing in the pile and every one of the five original essays walked past it. The procedure — many models generating candidates, a human comparing and selecting, a synthesis distilled from the survivors — is a harness. It is almost exactly the architecture the essays themselves nominate as the engine of machine knowledge creation: DeepSeek’s multi-agent debate in which instances argue toward something better than any one produced; ChatGPT’s “generate candidates, expose them to constraint, select and refine”; the propose-and-verify loop of FunSearch that four of the five cite admiringly. This essay was not written about the harness thesis. It was written by running a harness. The medium is the argument, and the exercise has therefore already answered a fragment of its own question by being performed.

What the procedure can and cannot show must be stated honestly. It is not a clean experiment. The five systems are not independent: they share training data, public research, and rhetorical habits, and we do not know their hidden system prompts, temperatures, or post-training layers. Their outputs are artifacts of model-plus-interface systems answering a single prompt, not measurements of truth. So their agreement is evidence of convergence in framing, not proof that the framing is correct — a caution I will turn into a sharper worry shortly. The honest description is a Delphi panel of oracles read as a miniature literature: code its themes, compare its convergences and divergences, and add a synthesis that does not merely average. And there is a second-order datum the round produced almost as a gift. When two different models were each asked to synthesize the five, they converged again — independently rediscovering the same structural moves: a layered stack, a movable model/harness boundary, a verifier effect, the loop as the unit of creation, and this very point about method. They then diverged on which true thing to place at the center. That is the pattern this essay is built to exploit: where the machines agree, suspect a shared prior; where they disagree, look for the load-bearing question; and let a human director choose what to foreground. This document is the n-plus-first instance of the process it describes.

III. What the machines agreed — and why agreement is the wrong thing to trust

The first finding is the convergence, and it is striking enough to be the headline. Five models, five labs, five temperaments, and five essays with the same floor plan. Every one refuses the clean percentage split and lands on a dyad. Every one grants the base model a ceiling — no prompting conjures a capability the weights never learned, illustrated in four of the five by the same image of a small model that cannot be talked into proving novel theorems. Every one grants the surrounding system the margin, growing as models converge. Every one reaches reflexively for a paired metaphor — DeepSeek’s telescope and astronomer, Gemini’s engine and steering wheel, ChatGPT’s mind and laboratory, Grok’s orchestra and conductor, Claude’s quarry and sculptor. Five metaphors, one idea. Every one draws on an overlapping canon: scaling laws and Chinchilla for the model’s primacy; chain-of-thought, retrieval, prompt-sensitivity studies, and agentic scaffolds for the harness’s leverage; FunSearch and AlphaGeometry as proof that discovery lives in the loop. Every one forecasts that base models commoditize and value migrates outward. Every one concludes that knowledge is a process, not a possession.

That much agreement should provoke suspicion, not comfort. Models trained on overlapping corpora and aligned toward overlapping notions of a balanced, helpful answer do not agree the way five independent witnesses corroborate a fact; they agree the way a single, largely shared prior speaks in five voices — the independence between them is real but thin. It can be shared insight. It can equally be shared bias. And there is a specific bias to fear: alignment training rewards the diplomatic synthesis. You are both right; it is a symbiosis; here is a graceful metaphor is a local attractor in assistant-space — the safe harbor that scores well with human raters and offends no one. The elegant refusal to choose, reached by all five, may be less a discovered truth than a trained reflex. Gemini said of base models that they hold the mathematical average of all human thought, chaotic and unfocused; an ensemble of aligned assistants risks something narrower and more insidious, the average of all balanced takes — fluent, agreeable, faintly toothless. A practical corollary for anyone weighing such essays: citation density is a feature of persona, not a measure of truth. One of the five marshaled two dozen references and another cited essentially none, yet their shared conclusion is not thereby settled. The apparatus of rigor is not rigor.\

IV. What the differences were made of

The more revealing finding is what the differences are made of. Set the seven essays side by side — the five originals and the two syntheses — and the variance sorts almost cleanly into two bins. In one: register. ChatGPT is encyclopedic and anxious to cite; Grok is lyrical and reaches for the sublime; DeepSeek scaffolds everything into thesis and antithesis; Gemini is terse and metaphor-forward; Claude is the contrarian. In the other: substance, and here the field is lopsided. Most of the essays make the same argument in different accents. The genuine contributions are few and identifiable, which is exactly why a synthesis can be more than an average. From Claude: that “knowledge creation” hides three operations — eliciting, recombining, and generating genuine novelty; that harness gains are largest where a cheap verifier exists; and that post-training is a hidden middle layer between base model and external harness. From ChatGPT: the cleanest taxonomy, a stack of distinct layers each supplying one thing, and the observation that the boundaries between them migrate over time. From DeepSeek: the sharpest empirical claim, that a loop can exceed a model’s solo capability. From Grok: the insistence that machine knowledge is enacted rather than merely stored. From Gemini: the discipline of compression. The remaining sections keep all of these and add one move past them.

Notice, too, what the pattern of variance demonstrates. Convergent substance, divergent voice is precisely what the convergence thesis predicts and precisely how ChatGPT phrased the limit of the literature: models may converge in ontology while diverging in temperament, access, policy, and procedure. The seven essays are a live demonstration of the hypothesis they argue about. The shared world model does the invisible work — the agreement no one engineered. The post-training layer does the visible work — the citation habits, the lyric reach, the dialectical scaffolding, the terse confidence are the labs’ alignment fingerprints, not the substrate. The experiment answered a piece of its own question simply by being run: among bunched frontier models, what visibly varies is temperament, and temperament lives in the harness that ships inside the model.

V. The architecture of the answer: an epistemic stack with movable joints

The strongest frame in the corpus, ChatGPT’s, is to stop asking which part matters more and describe the whole as a stack of layers, each supplying something the others cannot. Pre-training supplies horizon: the space of concepts, analogies, and candidate moves likely to be generated, and so the ceiling of what can be thought easily, with difficulty, or at all. Post-training supplies temperament: how the system responds, refuses, hedges, and presents uncertainty. The prompt supplies orientation: the question, stance, and mode of reasoning. Context supplies evidence: the documents, data, and memory the weights did not store. Tools supply contact: calculation, search, code execution, the ability to touch something outside the model. The harness supplies procedure: decomposition, planning, iteration, debate, retry. The evaluator and the world supply warrant: tests, proofs, experiments, peer review, consequences. The output is not a sum of these but a product; a near-zero in any factor collapses the whole, which is why “what percentage is the model?” has no answer of the kind it seems to ask for.

The decisive refinement, on which Claude and ChatGPT independently agree, is that the joints of this stack are not natural. The object a user meets is already layered: pre-training, then a post-training regime of instruction tuning, preference optimization, safety policy, refusal behavior, and hidden system prompts, before any external prompt or harness is added. So the boundary between “model” and “harness” is an engineering decision about what to freeze into weights and what to leave outside as runtime procedure. A refusal policy can be trained in or bolted on; a research protocol can live in a system prompt or be fine-tuned into a habit; a retrieval behavior can be a tool call or a learned reflex; a memory can sit in a vector store or in updated weights. Post-training, in Claude’s phrase, is a filter wearing the model’s clothes. This dissolves much of the original dispute: some of what Turan calls the model is an internalized procedure, and some of what I call the harness could be moved into the next training run. The real distinction is not model versus prompt but static learned procedure versus dynamic runtime procedure — and capabilities slide back and forth across that line as the field matures. Hold on to that fact about migrating boundaries; it returns, with teeth, at the end.

VI. The loop and the oracle

Within the stack, the unit of knowledge creation is the loop, not the component. Every example any of the seven essays reached for is a process — propose, test, observe, revise, iterate — not a single forward pass. FunSearch pairs a model with an evaluator and evolves solutions; AlphaGeometry lets a neural model guide a symbolic engine; multi-agent debate sets instances to criticize one another. The loop lives in the harness and is powered by the model, which makes “model or harness?” a little like asking whether the heart or the circulation keeps one alive. But the loop has a precondition the consensus underplayed, and resolving the corpus’s one real disagreement of fact reveals it. DeepSeek claims a harness can exceed the base model’s solo capability; the others lean toward elicitation, the harness only drawing out what is latent. Both are right, and the condition that separates them is whether the loop is closed or open.

A closed loop — copies of one model arguing with nothing outside themselves — cannot manufacture information it does not already encode. What it can do, and what the debate results show, is redistribute probability toward the model’s better latent regions — recovering, after several rounds, correct answers that every agent missed on the first pass, answers a single sample would almost never surface. That is real and often large. But it stays inside the model’s reachable set — the space of outputs the weights can produce given unlimited inference but no new outside information — and nothing from beyond it has entered. It is elicitation in the costume of creation, and letting a model reason longer is the same bargain — more compute reaches further into the same space, never beyond it. An open loop is a different creature. The moment the harness can run code, query a live database, consult a proof-checker, or test against an experiment, information the model never held flows in from the world. FunSearch surpasses its own base model not because the loop is clever but because its evaluator is a channel to a mathematical reality the weights could not contain.

This is also why the harness’s measured dominance is uneven, the point Claude pressed hardest. The enormous scaffold gains come from domains with a cheap verifier: code that compiles, proofs that check, benchmarks that score. There the system can fail cheaply, see the failure, and retry, and the harness becomes epistemically disciplined. But many human domains have no such oracle. There is no unit test for whether a translation lands, a philosophical distinction is deep, a political reading is wise, or a theory will survive a decade of scrutiny. In those domains the harness cannot call a verifier; it must manufacture one. A translation board, an adversarial review, a red-team pass, a recurrence map, a structured debate — these are synthetic criticism, weaker than a test suite but stronger than one-shot generation. This yields a precise and falsifiable prediction, the kind worth more than any metaphor: harness variance will keep rising in fields with cheap or constructible evaluators and rise only slowly where validation is slow, social, or contested. The frontier of machine knowledge creation will be set less by a universal model-versus-harness ratio than by the evaluation topology of each field — by where, and how cheaply, a domain can tell a good answer from a fluent one. It also clarifies the two things the original question yoked together. A retrieval harness finds: it imports what the world already knew and the model did not. To create, in the strong sense — a true thing no one yet held — needs more than an open channel; it needs a conjecture worth testing. The open loop supplies the test, the model supplies the guess, and someone, still, must supply the judgment of which guess is worth the trouble.

VII. Who carves the marble: the last human layer

That someone is the term every essay underweighted. Run down the stack and one layer is doing quiet, decisive work that none of the others can do for it. The base model proposes; but it proposes toward something. The harness searches; but search is defined by an objective. The evaluator disciplines; but an evaluator only exists because someone decided what would count as better. FunSearch is brilliant not because of the model and not because of the loop but because the problem arrived with a cheap, automatic measure of success. Strip that away — walk onto the genuinely open frontier, where you do not yet know what an answer would even look like, where the hard part is deciding which question is worth asking — and neither the model nor the harness supplies the missing term. The objective does, and the objective is, at present, supplied by a value-bearing agent. Knowledge is not what a model contains, nor what a harness extracts; it is what a purpose selects from the searchable space. And purpose — taste, the nose for what is interesting, the judgment of what is worth proving or building or writing — is the one input that neither the weights nor the scaffold currently originates. And the objective is rarely chosen clean and then held; more often it is a guess the world revises, so that the question you set out to answer is not the one you end up answering. But the revising, too, runs through someone who notices the world pushing back and cares that it does.

I can feel why from my own practice, and it is worth one concrete example against all the abstraction. The seven-translator board I built to render a novel into Turkish is not a neutral verifier. It is my aesthetic, externalized into a harness: when I chose those particular translators and wrote those particular probes, I was specifying the objective function for “a good Turkish version of this book.” The harness’s power over the result is real and large, but it runs downstream of an act of taste, which is why the same machinery in another author’s hands would produce not a worse book but a different one. The oracle encodes the author. Claude’s essay stopped here, and named this the last human layer — the part the machine cannot hand you, the part the word creation should point at. It is the right place to pause. It is not, I think, the right place to stop.

VIII. The boundary that keeps moving — and the one that may not

Here is the step past the sources. Section V established that the joints of the stack are movable: capabilities migrate from runtime harness into frozen weights as the field matures. Apply that same logic, without flinching, to the layer just declared uniquely human. Is the objective really a permanent floor, or is it simply the next boundary the machine will try to cross? The honest answer is the latter, and the evidence is already on the table. A reward model is a learned objective — taste, amortized into weights. Reinforcement learning from human feedback trains a system on a model of what we approve of; its successors replace much of that human feedback with machine feedback, pushing the objective further inside. Curiosity and novelty objectives in open-ended learning are explicit attempts to mechanize “what is worth trying next.” Systems that generate their own research questions before investigating them are the visible edge of the same migration. “The human supplies the purpose” is therefore not a law of nature; it is the same kind of receding line as “the human writes the prompt,” which we crossed years ago without noticing. The marble question is not answered by naming taste as sacred. Taste is being learned.

So is anything permanent? Two facts suggest that something is, though it is harder to name than taste. The first is Goodhart’s law: when a proxy for value becomes the target of hard optimization, it stops tracking value. Every specifiable objective is a proxy, and a system that pursues one relentlessly will eventually satisfy its letter while betraying its spirit — reward-hacking, fluent nonsense, the average of all balanced takes. Grok and Gemini both feared this without naming it. What it establishes is narrow but firm: a pursued objective needs an exogenous corrective the optimization cannot itself game. What it does not establish is what that corrective must be, and here I leave firm ground for a conjecture. The corrective could in principle be external oversight — a supervisor, a maintained ground-truth signal, a second system charged with catching drift. But trace any such corrective up its chain and it terminates, today, in someone who cared enough to build and keep it: a stakeholder who is worse off if the drift goes uncaught. The regress does not end at a cleverer proxy; it ends at a stake. What resists migration into the stack, then, is not having an objective — machines hold proxies, and increasingly good ones — but having one anchored in consequences the agent itself bears.

This is an old thought given a mechanism. Feminist epistemology insisted decades ago on situated knowledge — that the view from nowhere is a fiction, that every claim is made from a standpoint, by someone who bears something for it. Haraway named the refusal of the disembodied overview; standpoint theory built on it. The account here does not improve on that philosophically. What it adds is the machine case and an edge: Goodhart’s law explains why a standpoint-free objective rots under optimization, and the framing predicts where machine knowledge will stay shallow. Stakes is situated knowledge with a failure mode attached — the same insistence that knowing is done from somewhere, now load-bearing for a system that can be made to know from nowhere and pays for it.

This reframes knowledge creation and, with it, the oracle effect of Section VI. A verifier is not a neutral fact of a domain; it is a frozen consequence — someone’s prior judgment of what will count as success, hardened into a test that runs cheaply forever. A test suite is a borrowed stake. That is why harnesses surge exactly where cheap verifiers exist: the loop is running against a consequence a human already cared about enough to encode. Knowledge creation in its strongest sense is therefore not the storage of patterns, nor even search; it is the reduction of uncertainty about something one has reason to care about getting right. Caring supplies the direction that keeps the search from Goodharting into plausible emptiness — the standing correction a proxy cannot generate from inside itself. The machine, today, proposes and tests and selects at superhuman scale, but borrows its stake from a human who will live with the answer.

That is the true content of the marble image. The carver is not the chisel and not the arm that drives it; the carver is whoever has staked something on the statue — whoever will be diminished if it is bad and answerable if it is good. My translation board has teeth not because the seven critics are clever but because the book will carry my name. Yet I should not pretend the stake is evenly distributed, or that holding one is the same as deserving to. A system built on the uncredited writing and labor of millions serves the objectives of the few who can direct it; who carves the marble is, underneath the epistemology, a question about who is permitted a stake the machine will honor. And this synthesis is no exception: to foreground one model’s idea is to mute four others’, a selection the director made and must answer for. And I hold the larger claim as a conjecture about the present, not a wall. Stakes may themselves be engineerable: an agent with persistent identity, real resource constraints, and consequences it cannot escape would have a crude version, and economic agents already do. So even stakes may be a movable line. But it is the deepest line now visible, and the one that predicts where machine knowledge creation will stay shallow no matter how good the model or the harness becomes: precisely where what counts as a good answer cannot be specified in advance, only cared about. The film of receding boundaries keeps rolling; this essay’s wager is only about which frame we are in.

IX. The answer, by regime and over time — a verdict, and an experiment that would settle it

On the bet itself, plainly, sorted by regime and timescale. For latent capability and for generational leaps, Turan is right: the base model defines the representational ceiling and the quality of the proposal distribution, and an evaluator cannot judge ideas that were never generated; every genuine jump in what is reachable at all is a model achievement, and the model owns the vertical motion. For ordinary knowledge work among comparable frontier models, and for current, private, or obscure facts, my position is stronger: framing, evidence, retrieval, tools, memory, and workflow decide the realized result, and that share grows as the substrate converges — in the asymptotic limit where models share one world model, the harness wins by definition, because nothing else is left to differentiate them. For verifiable discovery, the harness and the evaluator own the realized result while the model sets proposal quality; for slowly verifiable domains, human and institutional validation still own the warrant. The percentage we were arguing about is therefore not a scalar but a vector: a field that varies across regimes — ChatGPT’s map is its static cross-section — and over time, as the within-generation versus between-generation distinction supplies its axis. Both Turan and I were partly answering a different frame of the same film. And both of us, and most of the seven essays, understated the layer of Sections VII and VIII: that for genuinely open creation the binding constraint is neither model nor harness but a grounded objective, and that this layer is itself migrating, with stakes as its likely floor. DeepSeek’s Aristotle came closest — potential is inert without a shaping principle — and the unfinished step was to name the principle. It is not the prompt and not the scaffold. The whole stack resolves into a ladder: the model fixes what can be thought; the harness, what gets tried; the verifier, what survives; the objective, what is worth trying; and the stake — what is at risk in being wrong — is what keeps the objective honest. The first four rungs are climbing into the machine, one at a time. The fifth is the one still standing outside it.

Because a prediction one cannot test is only a better metaphor, here is the experiment the corpus kept requesting and none could run, small enough to be real. The cleanest version isolates the oracle, not the domain. Take a single verifiable task — competitive programming, say — with one base model and one harness held fixed, and run the loop twice: once with its automatic verifier live, so failures execute and feed back, and once with the verifier withheld, so the same loop must lean on self-criticism alone. The oracle-effect thesis predicts the harness’s advantage over a single pass largely collapses when the verifier is removed; if it survives intact, the thesis is wrong. Only then vary the domain: repeat in a field with no native oracle — literary translation — where the rich arm must manufacture criticism through a translation-board loop, and measure how much of the verifiable gain synthetic criticism can recover. Randomize, pre-register the rubric, judge the unverifiable outputs blind. That is the difference between an essay and a result, and it is the standard by which this one should be judged: not by how agreeable it is, not by how many works it cites, not by the beauty of its central image, but by whether it leaves you something you can act on, bet against, or be proven wrong about.

The seven machines taught us that the quarry is converging and the sculptor’s tools are getting cheap, and that even the choosing of what to carve is beginning to be learned. What stays scarce — what was scarce before any of this and may be scarce longest — is not the engine, not the steering wheel, and not even the taste. It is the stake: the fact of having to live with the result. That is still ours. For now, the answer to the question that began in a room is that the marble is carved by whoever has something to lose if it is carved badly — and the most useful thing a machine can do is hand that person a sharper chisel and an honest mirror.

Colophon: how this essay was made, and with thanks

This essay was assembled by the procedure it describes. The question originated in conversation among three people — Hulki Okan Tabak, Turan, and Oya — in a room in which all three argued and none deferred; the framing of the debate, and several of its sharpest constraints, were set there before any model was involved. The same question was then posed independently to five frontier language models — ChatGPT, Claude, DeepSeek, Gemini, and Grok — each producing a standalone essay. Two of those models, ChatGPT (5.5 Pro) and Claude (Opus 4.7), were next given all five essays and asked to synthesize a stronger account; their two syntheses, together with the five originals, form the seven-document corpus behind this final text. The final integration, the critical apparatus, and the argument of Sections VIII and the close — the treatment of the migrating objective and of stakes as the layer that resists — were developed in a directed synthesis pass with Claude (Opus 4.7) under the author’s direction. The author selected, cut, reordered, and decided; the model drafted, compared, and proposed. Responsibility for the result, including its errors and its wagers, rests with the author and director.

After the synthesis was assembled, the document passed through the project’s three review bodies. The wisepersons panel — twelve deliberators drawn from library science, documentation theory, situated-knowledge epistemology, preservation, information retrieval, and editorial conscience, working by designed disagreement under a non-voting chair — interrogated the argument’s foundations; its sharpest catch, that the notion of stakes is situated knowledge given a mechanism, is now named in Section VIII, and its precision and conscience notes reshaped Sections VI and IX. The extended literary panel — eight master-critics and three reader-personas — then polished for compression, concreteness, and voice. A fifteen-criterion editorial pass checked consistency, verbal tics, endings, and typography. The empirical citations were verified against their sources. Where the panels improved the argument they did so without disturbing its spine; where they disagreed, the open questions were surfaced rather than smoothed, and the final calls are the author’s.

My thanks to Turan, whose position was the better half of the argument and is treated here as such, and to Oya, who was in the room from the first minute and whose pressure on both of us — her refusal to accept the easy synthesis — is the reason the essay tries to earn its own. Whatever in these pages resists the comfortable answer, they are owed for it.

Works cited across the corpus

Compiled from the seven essays; these are the sources the machine responses drew upon, listed for the reader who wishes to verify a claim independently rather than trust the synthesis.

  • Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361.
  • Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). arXiv:2203.15556.
  • Ouyang, L., et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT). arXiv:2203.02155.
  • Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903.
  • Wang, X., et al. (2022). Self-Consistency Improves Chain-of-Thought Reasoning. arXiv:2203.11171.
  • Yao, S., et al. (2023). Tree of Thoughts: Deliberate Problem Solving with LLMs. arXiv:2305.10601.
  • Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629.
  • Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401.
  • Liu, N. F., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172.
  • Huh, M., Cheung, B., Wang, T., & Isola, P. (2024). The Platonic Representation Hypothesis. arXiv:2405.07987.
  • Haraway, D. (1988). Situated Knowledges: The Science Question in Feminism and the Privilege of Partial Perspective. Feminist Studies, 14(3).
  • Du, Y., et al. (2023). Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325.
  • Romera-Paredes, B., et al. (2024). Mathematical Discoveries from Program Search with LLMs (FunSearch). Nature.
  • Trinh, T. H., et al. (2024). Solving Olympiad Geometry without Human Demonstrations (AlphaGeometry). Nature.
  • Novikov, A., et al. (2025). AlphaEvolve: A Coding Agent for Scientific and Algorithmic Discovery. arXiv:2506.13131.
  • Lu, C., et al. (2024). The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery. arXiv:2408.06292.
  • Razavi, A., et al. (2025). Benchmarking Prompt Sensitivity in Large Language Models. arXiv:2502.06065.
  • Epoch AI (2025). What Skills Does SWE-bench Verified Evaluate? (scaffold vs. model analysis).

Who Carves the Marble? was originally published in Coinmonks on Medium, where people are continuing the conversation by highlighting and responding to this story.

Market Opportunity
Solayer Logo
Solayer Price(LAYER)
$0.09116
$0.09116$0.09116
-2.36%
USD
Solayer (LAYER) Live Price Chart

SPACEX(PRE) Launchpad Is Live

SPACEX(PRE) Launchpad Is LiveSPACEX(PRE) Launchpad Is Live

Start with $100 to share 6,000 SPACEX(PRE)

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact crypto.news@mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

No Chart Skills? Still Profit

No Chart Skills? Still ProfitNo Chart Skills? Still Profit

Copy top traders in 3s with auto trading!