Kategori: Uncategorized

  • Ship fast, optimize later: top AI engineers don’t care about cost — they’re prioritizing deployment

    Across industries, rising compute expenses are often cited as a barrier to AI adoption — but leading companies are finding that cost is no longer the real constraint.

    The tougher challenges (and the ones top of mind for many tech leaders)? Latency, flexibility and capacity.

    At Wonder, for instance, AI adds a mere few cents per order; the food delivery and takeout company is much more concerned with cloud capacity with skyrocketing demands. Recursion, for its part, has been focused on balancing small and larger-scale training and deployment via on-premises clusters and the cloud; this has afforded the biotech company flexibility for rapid experimentation.

    The companies’ true in-the-wild experiences highlight a broader industry trend: For enterprises operating AI at scale, economics aren't the key decisive factor — the conversation has shifted from how to pay for AI to how fast it can be deployed and sustained.

    AI leaders from the two companies recently sat down with Venturebeat’s CEO and editor-in-chief Matt Marshall as part of VB’s traveling AI Impact Series. Here’s what they shared.

    Wonder: Rethink what you assume about capacity

    Wonder uses AI to power everything from recommendations to logistics — yet, as of now, reported CTO James Chen, AI adds just a few cents per order.

    Chen explained that the technology component of a meal order costs 14 cents, the AI adds 2 to 3 cents, although that’s “going up really rapidly” to 5 to 8 cents. Still, that seems almost immaterial compared to total operating costs.

    Instead, the 100% cloud-native AI company’s main concern has been capacity with growing demand. Wonder was built with “the assumption” (which proved to be incorrect) that there would be “unlimited capacity” so they could move “super fast” and wouldn’t have to worry about managing infrastructure, Chen noted.

    But the company has grown quite a bit over the last few years, he said; as a result, about six months ago, “we started getting little signals from the cloud providers, ‘Hey, you might need to consider going to region two,’” because they were running out of capacity for CPU or data storage at their facilities as demand grew.

    It was “very shocking” that they had to move to plan B earlier than they anticipated. “Obviously it's good practice to be multi-region, but we were thinking maybe two more years down the road,” said Chen.

    What's not economically feasible (yet)

    Wonder built its own model to maximize its conversion rate, Chen noted; the goal is to surface new restaurants to relevant customers as much as possible. These are “isolated scenarios” where models are trained over time to be “very, very efficient and very fast.”

    Currently, the best bet for Wonder’s use case is large models, Chen noted. But in the long term, they’d like to move to small models that are hyper-customized to individuals (via AI agents or concierges) based on their purchase history and even their clickstream. “Having these micro models is definitely the best, but right now the cost is very expensive,” Chen noted. “If you try to create one for each person, it's just not economically feasible.”

    Budgeting is an art, not a science

    Wonder gives its devs and data scientists as much playroom as possible to experiment, and internal teams review the costs of use to make sure nobody turned on a model and “jacked up massive compute around a huge bill,” said Chen.

    The company is trying different things to offload to AI and operate within margins. “But then it's very hard to budget because you have no idea,” he said. One of the challenging things is the pace of development; when a new model comes out, “we can’t just sit there, right? We have to use it.”

    Budgeting for the unknown economics of a token-based system is “definitely art versus science.”

    A critical component in the software development lifecycle is preserving context when using large native models, he explained. When you find something that works, you can add it to your company’s “corpus of context” that can be sent with every request. That’s big and it costs money each time.

    “Over 50%, up to 80% of your costs is just resending the same information back into the same engine again on every request,” said Chen.

    In theory, the more they do should require less cost per unit. “I know when a transaction happens, I'll pay the X cent tax for each one, but I don't want to be limited to use the technology for all these other creative ideas."

    The 'vindication moment' for Recursion

    Recursion, for its part, has focused on meeting broad-ranging compute needs via a hybrid infrastructure of on-premise clusters and cloud inference.

    When initially looking to build out its AI infrastructure, the company had to go with its own setup, as “the cloud providers didn't have very many good offerings,” explained CTO Ben Mabey. “The vindication moment was that we needed more compute and we looked to the cloud providers and they were like, ‘Maybe in a year or so.’”

    The company’s first cluster in 2017 incorporated Nvidia gaming GPUs (1080s, launched in 2016); they have since added Nvidia H100s and A100s, and use a Kubernetes cluster that they run in the cloud or on-prem.

    Addressing the longevity question, Mabey noted: “These gaming GPUs are actually still being used today, which is crazy, right? The myth that a GPU's life span is only three years, that's definitely not the case. A100s are still top of the list, they're the workhorse of the industry.”

    Best use cases on-prem vs cloud; cost differences

    More recently, Mabey’s team has been training a foundation model on Recursion’s image repository (which consists of petabytes of data and more than 200 pictures). This and other types of big training jobs have required a “massive cluster” and connected, multi-node setups.

    “When we need that fully-connected network and access to a lot of our data in a high parallel file system, we go on-prem,” he explained. On the other hand, shorter workloads run in the cloud.

    Recursion’s method is to “pre-empt” GPUs and Google tensor processing units (TPUs), which is the process of interrupting running GPU tasks to work on higher-priority ones. “Because we don't care about the speed in some of these inference workloads where we're uploading biological data, whether that's an image or sequencing data, DNA data,” Mabey explained. “We can say, ‘Give this to us in an hour,’ and we're fine if it kills the job.”

    From a cost perspective, moving large workloads on-prem is “conservatively” 10 times cheaper, Mabey noted; for a five year TCO, it's half the cost. On the other hand, for smaller storage needs, the cloud can be “pretty competitive” cost-wise.

    Ultimately, Mabey urged tech leaders to step back and determine whether they’re truly willing to commit to AI; cost-effective solutions typically require multi-year buy-ins.

    “From a psychological perspective, I've seen peers of ours who will not invest in compute, and as a result they're always paying on demand," said Mabey. "Their teams use far less compute because they don't want to run up the cloud bill. Innovation really gets hampered by people not wanting to burn money.”

  • Google debuts AI chips with 4X performance boost, secures Anthropic megadeal worth billions

    Google Cloud is introducing what it calls its most powerful artificial intelligence infrastructure to date, unveiling a seventh-generation Tensor Processing Unit and expanded Arm-based computing options designed to meet surging demand for AI model deployment — what the company characterizes as a fundamental industry shift from training models to serving them to billions of users.

    The announcement, made Thursday, centers on Ironwood, Google's latest custom AI accelerator chip, which will become generally available in the coming weeks. In a striking validation of the technology, Anthropic, the AI safety company behind the Claude family of models, disclosed plans to access up to one million of these TPU chips — a commitment worth tens of billions of dollars and among the largest known AI infrastructure deals to date.

    The move underscores an intensifying competition among cloud providers to control the infrastructure layer powering artificial intelligence, even as questions mount about whether the industry can sustain its current pace of capital expenditure. Google's approach — building custom silicon rather than relying solely on Nvidia's dominant GPU chips — amounts to a long-term bet that vertical integration from chip design through software will deliver superior economics and performance.

    Why companies are racing to serve AI models, not just train them

    Google executives framed the announcements around what they call "the age of inference" — a transition point where companies shift resources from training frontier AI models to deploying them in production applications serving millions or billions of requests daily.

    "Today's frontier models, including Google's Gemini, Veo, and Imagen and Anthropic's Claude train and serve on Tensor Processing Units," said Amin Vahdat, vice president and general manager of AI and Infrastructure at Google Cloud. "For many organizations, the focus is shifting from training these models to powering useful, responsive interactions with them."

    This transition has profound implications for infrastructure requirements. Where training workloads can often tolerate batch processing and longer completion times, inference — the process of actually running a trained model to generate responses — demands consistently low latency, high throughput, and unwavering reliability. A chatbot that takes 30 seconds to respond, or a coding assistant that frequently times out, becomes unusable regardless of the underlying model's capabilities.

    Agentic workflows — where AI systems take autonomous actions rather than simply responding to prompts — create particularly complex infrastructure challenges, requiring tight coordination between specialized AI accelerators and general-purpose computing.

    Inside Ironwood's architecture: 9,216 chips working as one supercomputer

    Ironwood is more than incremental improvement over Google's sixth-generation TPUs. According to technical specifications shared by the company, it delivers more than four times better performance for both training and inference workloads compared to its predecessor — gains that Google attributes to a system-level co-design approach rather than simply increasing transistor counts.

    The architecture's most striking feature is its scale. A single Ironwood "pod" — a tightly integrated unit of TPU chips functioning as one supercomputer — can connect up to 9,216 individual chips through Google's proprietary Inter-Chip Interconnect network operating at 9.6 terabits per second. To put that bandwidth in perspective, it's roughly equivalent to downloading the entire Library of Congress in under two seconds.

    This massive interconnect fabric allows the 9,216 chips to share access to 1.77 petabytes of High Bandwidth Memory — memory fast enough to keep pace with the chips' processing speeds. That's approximately 40,000 high-definition Blu-ray movies' worth of working memory, instantly accessible by thousands of processors simultaneously. "For context, that means Ironwood Pods can deliver 118x more FP8 ExaFLOPS versus the next closest competitor," Google stated in technical documentation.

    The system employs Optical Circuit Switching technology that acts as a "dynamic, reconfigurable fabric." When individual components fail or require maintenance — inevitable at this scale — the OCS technology automatically reroutes data traffic around the interruption within milliseconds, allowing workloads to continue running without user-visible disruption.

    This reliability focus reflects lessons learned from deploying five previous TPU generations. Google reported that its fleet-wide uptime for liquid-cooled systems has maintained approximately 99.999% availability since 2020 — equivalent to less than six minutes of downtime per year.

    Anthropic's billion-dollar bet validates Google's custom silicon strategy

    Perhaps the most significant external validation of Ironwood's capabilities comes from Anthropic's commitment to access up to one million TPU chips — a staggering figure in an industry where even clusters of 10,000 to 50,000 accelerators are considered massive.

    "Anthropic and Google have a longstanding partnership and this latest expansion will help us continue to grow the compute we need to define the frontier of AI," said Krishna Rao, Anthropic's chief financial officer, in the official partnership agreement. "Our customers — from Fortune 500 companies to AI-native startups — depend on Claude for their most important work, and this expanded capacity ensures we can meet our exponentially growing demand."

    According to a separate statement, Anthropic will have access to "well over a gigawatt of capacity coming online in 2026" — enough electricity to power a small city. The company specifically cited TPUs' "price-performance and efficiency" as key factors in the decision, along with "existing experience in training and serving its models with TPUs."

    Industry analysts estimate that a commitment to access one million TPU chips, with associated infrastructure, networking, power, and cooling, likely represents a multi-year contract worth tens of billions of dollars — among the largest known cloud infrastructure commitments in history.

    James Bradbury, Anthropic's head of compute, elaborated on the inference focus: "Ironwood's improvements in both inference performance and training scalability will help us scale efficiently while maintaining the speed and reliability our customers expect."

    Google's Axion processors target the computing workloads that make AI possible

    Alongside Ironwood, Google introduced expanded options for its Axion processor family — custom Arm-based CPUs designed for general-purpose workloads that support AI applications but don't require specialized accelerators.

    The N4A instance type, now entering preview, targets what Google describes as "microservices, containerized applications, open-source databases, batch, data analytics, development environments, experimentation, data preparation and web serving jobs that make AI applications possible." The company claims N4A delivers up to 2X better price-performance than comparable current-generation x86-based virtual machines.

    Google is also previewing C4A metal, its first bare-metal Arm instance, which provides dedicated physical servers for specialized workloads such as Android development, automotive systems, and software with strict licensing requirements.

    The Axion strategy reflects a growing conviction that the future of computing infrastructure requires both specialized AI accelerators and highly efficient general-purpose processors. While a TPU handles the computationally intensive task of running an AI model, Axion-class processors manage data ingestion, preprocessing, application logic, API serving, and countless other tasks in a modern AI application stack.

    Early customer results suggest the approach delivers measurable economic benefits. Vimeo reported observing "a 30% improvement in performance for our core transcoding workload compared to comparable x86 VMs" in initial N4A tests. ZoomInfo measured "a 60% improvement in price-performance" for data processing pipelines running on Java services, according to Sergei Koren, the company's chief infrastructure architect.

    Software tools turn raw silicon performance into developer productivity

    Hardware performance means little if developers cannot easily harness it. Google emphasized that Ironwood and Axion are integrated into what it calls AI Hypercomputer — "an integrated supercomputing system that brings together compute, networking, storage, and software to improve system-level performance and efficiency."

    According to an October 2025 IDC Business Value Snapshot study, AI Hypercomputer customers achieved on average 353% three-year return on investment, 28% lower IT costs, and 55% more efficient IT teams.

    Google disclosed several software enhancements designed to maximize Ironwood utilization. Google Kubernetes Engine now offers advanced maintenance and topology awareness for TPU clusters, enabling intelligent scheduling and highly resilient deployments. The company's open-source MaxText framework now supports advanced training techniques including Supervised Fine-Tuning and Generative Reinforcement Policy Optimization.

    Perhaps most significant for production deployments, Google's Inference Gateway intelligently load-balances requests across model servers to optimize critical metrics. According to Google, it can reduce time-to-first-token latency by 96% and serving costs by up to 30% through techniques like prefix-cache-aware routing.

    The Inference Gateway monitors key metrics including KV cache hits, GPU or TPU utilization, and request queue length, then routes incoming requests to the optimal replica. For conversational AI applications where multiple requests might share context, routing requests with shared prefixes to the same server instance can dramatically reduce redundant computation.

    The hidden challenge: powering and cooling one-megawatt server racks

    Behind these announcements lies a massive physical infrastructure challenge that Google addressed at the recent Open Compute Project EMEA Summit. The company disclosed that it's implementing +/-400 volt direct current power delivery capable of supporting up to one megawatt per rack — a tenfold increase from typical deployments.

    "The AI era requires even greater power delivery capabilities," explained Madhusudan Iyengar and Amber Huffman, Google principal engineers, in an April 2025 blog post. "ML will require more than 500 kW per IT rack before 2030."

    Google is collaborating with Meta and Microsoft to standardize electrical and mechanical interfaces for high-voltage DC distribution. The company selected 400 VDC specifically to leverage the supply chain established by electric vehicles, "for greater economies of scale, more efficient manufacturing, and improved quality and scale."

    On cooling, Google revealed it will contribute its fifth-generation cooling distribution unit design to the Open Compute Project. The company has deployed liquid cooling "at GigaWatt scale across more than 2,000 TPU Pods in the past seven years" with fleet-wide availability of approximately 99.999%.

    Water can transport approximately 4,000 times more heat per unit volume than air for a given temperature change — critical as individual AI accelerator chips increasingly dissipate 1,000 watts or more.

    Custom silicon gambit challenges Nvidia's AI accelerator dominance

    Google's announcements come as the AI infrastructure market reaches an inflection point. While Nvidia maintains overwhelming dominance in AI accelerators — holding an estimated 80-95% market share — cloud providers are increasingly investing in custom silicon to differentiate their offerings and improve unit economics.

    Amazon Web Services pioneered this approach with Graviton Arm-based CPUs and Inferentia / Trainium AI chips. Microsoft has developed Cobalt processors and is reportedly working on AI accelerators. Google now offers the most comprehensive custom silicon portfolio among major cloud providers.

    The strategy faces inherent challenges. Custom chip development requires enormous upfront investment — often billions of dollars. The software ecosystem for specialized accelerators lags behind Nvidia's CUDA platform, which benefits from 15+ years of developer tools. And rapid AI model architecture evolution creates risk that custom silicon optimized for today's models becomes less relevant as new techniques emerge.

    Yet Google argues its approach delivers unique advantages. "This is how we built the first TPU ten years ago, which in turn unlocked the invention of the Transformer eight years ago — the very architecture that powers most of modern AI," the company noted, referring to the seminal "Attention Is All You Need" paper from Google researchers in 2017.

    The argument is that tight integration — "model research, software, and hardware development under one roof" — enables optimizations impossible with off-the-shelf components.

    Beyond Anthropic, several other customers provided early feedback. Lightricks, which develops creative AI tools, reported that early Ironwood testing "makes us highly enthusiastic" about creating "more nuanced, precise, and higher-fidelity image and video generation for our millions of global customers," said Yoav HaCohen, the company's research director.

    Google's announcements raise questions that will play out over coming quarters. Can the industry sustain current infrastructure spending, with major AI companies collectively committing hundreds of billions of dollars? Will custom silicon prove economically superior to Nvidia GPUs? How will model architectures evolve?

    For now, Google appears committed to a strategy that has defined the company for decades: building custom infrastructure to enable applications impossible on commodity hardware, then making that infrastructure available to customers who want similar capabilities without the capital investment.

    As the AI industry transitions from research labs to production deployments serving billions of users, that infrastructure layer — the silicon, software, networking, power, and cooling that make it all run — may prove as important as the models themselves.

    And if Anthropic's willingness to commit to accessing up to one million chips is any indication, Google's bet on custom silicon designed specifically for the age of inference may be paying off just as demand reaches its inflection point.

  • Moonshot’s Kimi K2 Thinking emerges as leading open source AI, outperforming GPT-5, Claude Sonnet 4.5 on key benchmarks

    Even as concern and skepticism grows over U.S. AI startup OpenAI's buildout strategy and high spending commitments, Chinese open source AI providers are escalating their competition and one has even caught up to OpenAI's flagship, paid proprietary model GPT-5 in key third-party performance benchmarks with a new, free model.

    The Chinese AI startup Moonshot AI’s new Kimi K2 Thinking model, released today, has vaulted past both proprietary and open-weight competitors to claim the top position in reasoning, coding, and agentic-tool benchmarks.

    Despite being fully open-source, the model now outperforms OpenAI’s GPT-5, Anthropic’s Claude Sonnet 4.5 (Thinking mode), and xAI's Grok-4 on several standard evaluations — an inflection point for the competitiveness of open AI systems.

    Developers can access the model via platform.moonshot.ai and kimi.com; weights and code are hosted on Hugging Face. The open release includes APIs for chat, reasoning, and multi-tool workflows.

    Users can try out Kimi K2 Thinking directly through its own ChatGPT-like website competitor and on a Hugging Face space as well.

    Modified Standard Open Source License

    Moonshot AI has formally released Kimi K2 Thinking under a Modified MIT License on Hugging Face.

    The license grants full commercial and derivative rights — meaning individual researchers and developers working on behalf of enterprise clients can access it freely and use it in commercial applications — but adds one restriction:

    "If the software or any derivative product serves over 100 million monthly active users or generates over $20 million USD per month in revenue, the deployer must prominently display 'Kimi K2' on the product’s user interface."

    For most research and enterprise applications, this clause functions as a light-touch attribution requirement while preserving the freedoms of standard MIT licensing.

    It makes K2 Thinking one of the most permissively licensed frontier-class models currently available.

    A New Benchmark Leader

    Kimi K2 Thinking is a Mixture-of-Experts (MoE) model built around one trillion parameters, of which 32 billion activate per inference.

    It combines long-horizon reasoning with structured tool use, executing up to 200–300 sequential tool calls without human intervention.

    According to Moonshot’s published test results, K2 Thinking achieved:

    • 44.9 % on Humanity’s Last Exam (HLE), a state-of-the-art score;

    • 60.2 % on BrowseComp, an agentic web-search and reasoning test;

    • 71.3 % on SWE-Bench Verified and 83.1 % on LiveCodeBench v6, key coding evaluations;

    • 56.3 % on Seal-0, a benchmark for real-world information retrieval.

    Across these tasks, K2 Thinking consistently outperforms GPT-5’s corresponding scores and surpasses the previous open-weight leader MiniMax-M2—released just weeks earlier by Chinese rival MiniMax AI.

    Open Model Outperforms Proprietary Systems

    GPT-5 and Claude Sonnet 4.5 Thinking remain the leading proprietary “thinking” models.

    Yet in the same benchmark suite, K2 Thinking’s agentic reasoning scores exceed both: for instance, on BrowseComp the open model’s 60.2 % decisively leads GPT-5’s 54.9 % and Claude 4.5’s 24.1 %.

    K2 Thinking also edges GPT-5 in GPQA Diamond (85.7 % vs 84.5 %) and matches it on mathematical reasoning tasks such as AIME 2025 and HMMT 2025.

    Only in certain heavy-mode configurations—where GPT-5 aggregates multiple trajectories—does the proprietary model regain parity.

    That Moonshot’s fully open-weight release can meet or exceed GPT-5’s scores marks a turning point. The gap between closed frontier systems and publicly available models has effectively collapsed for high-end reasoning and coding.

    Surpassing MiniMax-M2: The Previous Open-Source Benchmark

    When VentureBeat profiled MiniMax-M2 just a week and a half ago, it was hailed as the “new king of open-source LLMs,” achieving top scores among open-weight systems:

    • τ²-Bench 77.2

    • BrowseComp 44.0

    • FinSearchComp-global 65.5

    • SWE-Bench Verified 69.4

    Those results placed MiniMax-M2 near GPT-5-level capability in agentic tool use. Yet Kimi K2 Thinking now eclipses them by wide margins.

    Its BrowseComp result of 60.2 % exceeds M2’s 44.0 %, and its SWE-Bench Verified 71.3 % edges out M2’s 69.4 %. Even on financial-reasoning tasks such as FinSearchComp-T3 (47.4 %), K2 Thinking performs comparably while maintaining superior general-purpose reasoning.

    Technically, both models adopt sparse Mixture-of-Experts architectures for compute efficiency, but Moonshot’s network activates more experts and deploys advanced quantization-aware training (INT4 QAT).

    This design doubles inference speed relative to standard precision without degrading accuracy—critical for long “thinking-token” sessions reaching 256 k context windows.

    Agentic Reasoning and Tool Use

    K2 Thinking’s defining capability lies in its explicit reasoning trace. The model outputs an auxiliary field, reasoning_content, revealing intermediate logic before each final response. This transparency preserves coherence across long multi-turn tasks and multi-step tool calls.

    A reference implementation published by Moonshot demonstrates how the model autonomously conducts a “daily news report” workflow: invoking date and web-search tools, analyzing retrieved content, and composing structured output—all while maintaining internal reasoning state.

    This end-to-end autonomy enables the model to plan, search, execute, and synthesize evidence across hundreds of steps, mirroring the emerging class of “agentic AI” systems that operate with minimal supervision.

    Efficiency and Access

    Despite its trillion-parameter scale, K2 Thinking’s runtime cost remains modest. Moonshot lists usage at:

    • $0.15 / 1 M tokens (cache hit)

    • $0.60 / 1 M tokens (cache miss)

    • $2.50 / 1 M tokens output

    These rates are competitive even against MiniMax-M2’s $0.30 input / $1.20 output pricing—and an order of magnitude below GPT-5 ($1.25 input / $10 output).

    Comparative Context: Open-Weight Acceleration

    The rapid succession of M2 and K2 Thinking illustrates how quickly open-source research is catching frontier systems. MiniMax-M2 demonstrated that open models could approach GPT-5-class agentic capability at a fraction of the compute cost. Moonshot has now advanced that frontier further, pushing open weights beyond parity into outright leadership.

    Both models rely on sparse activation for efficiency, but K2 Thinking’s higher activation count (32 B vs 10 B active parameters) yields stronger reasoning fidelity across domains. Its test-time scaling—expanding “thinking tokens” and tool-calling turns—provides measurable performance gains without retraining, a feature not yet observed in MiniMax-M2.

    Technical Outlook

    Moonshot reports that K2 Thinking supports native INT4 inference and 256 k-token contexts with minimal performance degradation. Its architecture integrates quantization, parallel trajectory aggregation (“heavy mode”), and Mixture-of-Experts routing tuned for reasoning tasks.

    In practice, these optimizations allow K2 Thinking to sustain complex planning loops—code compile–test–fix, search–analyze–summarize—over hundreds of tool calls. This capability underpins its superior results on BrowseComp and SWE-Bench, where reasoning continuity is decisive.

    Enormous Implications for the AI Ecosystem

    The convergence of open and closed models at the high end signals a structural shift in the AI landscape. Enterprises that once relied exclusively on proprietary APIs can now deploy open alternatives matching GPT-5-level reasoning while retaining full control of weights, data, and compliance.

    Moonshot’s open publication strategy follows the precedent set by DeepSeek R1, Qwen3, GLM-4.6 and MiniMax-M2 but extends it to full agentic reasoning.

    For academic and enterprise developers, K2 Thinking provides both transparency and interoperability—the ability to inspect reasoning traces and fine-tune performance for domain-specific agents.

    The arrival of K2 Thinking signals that Moonshot — a young startup founded in 2023 with investment from some of China's biggest apps and tech companies — is here to play in an intensifying competition, and comes amid growing scrutiny of the financial sustainability of AI’s largest players.

    Just a day ago, OpenAI CFO Sarah Friar sparked controversy after suggesting at WSJ Tech Live event that the U.S. government might eventually need to provide a “backstop” for the company’s more than $1.4 trillion in compute and data-center commitments — a comment widely interpreted as a call for taxpayer-backed loan guarantees.

    Although Friar later clarified that OpenAI was not seeking direct federal support, the episode reignited debate about the scale and concentration of AI capital spending.

    With OpenAI, Microsoft, Meta, and Google all racing to secure long-term chip supply, critics warn of an unsustainable investment bubble and “AI arms race” driven more by strategic fear than commercial returns — one that could "blow up" and take down the entire global economy with it if there is hesitation or market uncertainty, as so many trades and valuations have now been made in anticipation of continued hefty AI investment and massive returns.

    Against that backdrop, Moonshot AI’s and MiniMax’s open-weight releases put more pressure on U.S. proprietary AI firms and their backers to justify the size of the investments and paths to profitability.

    If an enterprise customer can just as easily get comparable or better performance from a free, open source Chinese AI model than they do with paid, proprietary AI solutions like OpenAI's GPT-5, Anthropic's Claude Sonnet 4.5, or Google's Gemini 2.5 Pro — why would they continue paying to access the proprietary models? Already, Silicon Valley stalwarts like Airbnb have raised eyebrows for admitting to heavily using Chinese open source alternatives like Alibaba's Qwen over OpenAI's proprietary offerings.

    For investors and enterprises, these developments suggest that high-end AI capability is no longer synonymous with high-end capital expenditure. The most advanced reasoning systems may now come not from companies building gigascale data centers, but from research groups optimizing architectures and quantization for efficiency.

    In that sense, K2 Thinking’s benchmark dominance is not just a technical milestone—it’s a strategic one, arriving at a moment when the AI market’s biggest question has shifted from how powerful models can become to who can afford to sustain them.

    What It Means for Enterprises Going Forward

    Within weeks of MiniMax-M2’s ascent, Kimi K2 Thinking has overtaken it—along with GPT-5 and Claude 4.5—across nearly every reasoning and agentic benchmark.

    The model demonstrates that open-weight systems can now meet or surpass proprietary frontier models in both capability and efficiency.

    For the AI research community, K2 Thinking represents more than another open model: it is evidence that the frontier has become collaborative.

    The best-performing reasoning model available today is not a closed commercial product but an open-source system accessible to anyone.

  • From logs to insights: The AI breakthrough redefining observability

    Presented by Elastic


    Logs set to become the primary tool for finding the “why” in diagnosing network incidents

    Modern IT environments have a data problem: there’s too much of it. Organizations that need to manage a company’s environment are increasingly challenged to detect and diagnose issues in real-time, optimize performance, improve reliability, and ensure security and compliance — all within constrained budgets.

    The modern observability landscape has many tools that offer a solution. Most revolve around DevOps teams or Site Reliability Engineers (SREs) analyzing logs, metrics, and traces to uncover patterns and figure out what’s happening across the network, and diagnose why an issue or incident occurred. The problem is that the process creates information overload: A Kubernetes cluster alone can emit 30 to 50 gigabytes of logs a day, and suspicious behavior patterns can sneak past human eyes.

    "It’s so anachronistic now, in the world of AI, to think about humans alone observing infrastructure," says Ken Exner, chief product officer at Elastic. "I hate to break it to you, but machines are better than human beings at pattern matching.“

    An industry-wide focus on visualizing symptoms forces engineers to manually hunt for answers. The crucial "why" is buried in logs, but because they contain massive volumes of unstructured data, the industry tends to use them as a tool of last resort. This has forced teams into costly tradeoffs: either spend countless hours building complex data pipelines, drop valuable log data and risk critical visibility gaps, or log and forget.

    Elastic, the Search AI Company, recently released a new feature for observability called Streams, which aims to become the primary signal for investigations by taking noisy logs and turning them into patterns, context and meaning.

    Streams uses AI to automatically partition and parse raw logs to extract relevant fields, and greatly reduce the effort required of SREs to make logs usable. Streams also automatically surfaces significant events such as critical errors and anomalies from context-rich logs, giving SREs early warnings and a clear understanding of their workloads, enabling them to investigate and resolve issues faster. The ultimate goal is to show remediation steps.

    "From raw, voluminous, messy data, Streams automatically creates structure, putting it into a form that is usable, automatically alerts you to issues and helps you remediate them," Exner says. "That is the magic of Streams."

    A broken workflow

    Streams upends an observability process that some say is broken. Typically, SREs set up metrics, logs and traces. Then they set up alerts, and service level objectives (SLOs) — often hard-coded rules to show where a service or process has gone beyond a threshold, or a specific pattern has been detected.

    When an alert is triggered, it points to the metric that's showing an anomaly. From there, SREs look at a metrics dashboard, where they can visualize the issue and compare the alert to other metrics, or CPU to memory to I/O, and start looking for patterns.

    They may then need to look at a trace, and examine upstream and downstream dependencies across the application to dig into the root cause of the issue. Once they figure out what's causing the trouble, they jump into the logs for that database or service to try and debug the issue.

    Some companies simply seek to add more tools when current ones prove ineffective. That means SREs are hopping from tool to tool to keep on top of monitoring and troubleshooting across their infrastructure and applications.

    "You’re hopping across different tools. You’re relying on a human to interpret these things, visually look at the relationship between systems in a service map, visually look at graphs on a metrics dashboard, to figure out what and where the issue is, " Exner says. "But AI automates that workflow away."

    With AI-powered Streams, logs are not just used reactively to resolve issues, but also to proactively process potential issues and create information-rich alerts that help teams jump straight to problem-solving, offering a solution for remediation or even fixing the issue entirely, before automatically notifying the team that it's been taken care of.

    "I believe that logs, the richest set of information, the original signal type, will start driving a lot of the automation that a service reliability engineer typically does today, and does very manually," he adds. "A human should not be in that process, where they are doing this by digging into themselves, trying to figure out what is going on, where and what the issue is, and then once they find the root cause, they’re trying to figure out how to debug it."

    Observability’s future

    Large language models (LLMs) could be a key player in the future of observability. LLMs excel at recognizing patterns in vast quantities of repetitive data, which closely resembles log and telemetry data in complex, dynamic systems. And today’s LLMs can be trained for specific IT processes. With automation tooling, the LLM has the information and tools it needs to resolve database errors or Java heap issues, and more. Incorporating those into platforms that bring context and relevance will be essential.

    Automated remediation will still take some time, Exner says, but automated runbooks and playbooks generated by LLMs will become standard practice within the next couple of years. In other words, remediation steps will be driven by LLMs. The LLM will offer up fixes, and the human will verify and implement them, rather than calling in an expert.

    Addressing skill shortages

    Going all in on AI for observability would help address a major shortage in the talent needed to manage IT infrastructure. Hiring is slow because organizations need teams with a great deal of experience and understanding of potential issues, and how to resolve them fast. That experience can come from an LLM that is contextually grounded, Exner says.

    "We can help deal with the skill shortage by augmenting people with LLMs that make them all instantly experts," he explains. "I think this is going to make it much easier for us to take novice practitioners and make them expert practitioners in both security and observability, and it’s going to make it possible for a more novice practitioner to act like an expert.”

    Streams in Elastic Observability is available now. Get started by reading more on the Streams.


    Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

  • Google Cloud updates its AI Agent Builder with new observability dashboard and faster build-and-deploy tools

    Google Cloud has introduced a big update in a bid to keep AI developers on its Vertex AI platform for concepting, designing, building, testing, deploying and modifying AI agents in enterprise use cases.

    The new features, announced today, include additional governance tools for enterprises and expanding the capabilities for creating agents with just a few lines of code, moving faster with state-of-the-art context management layers and one-click deployment, as well as managed services for scaling production and evaluation, and support for identifying agents.

    Agent Builder, released last year during its annual Cloud Next event, provides a no-code platform for enterprises to create agents and connect these to orchestration frameworks like LangChain.

    Google’s Agent Development Kit (ADK), which lets developers build agents “in under 100 lines of code,” can also be accessed through Agent Builder. 

    “These new capabilities underscore our commitment to Agent Builder, and simplify the agent development process to meet developers where they are, no matter which tech stack they choose,” said Mike Clark, director of Product Management, Vertex AI Agent Builder. 

    Build agents faster

    Part of Google’s pitch for Agent Builder’s new features is that enterprises can bake in-orchestration even as they construct their agents. 

    “Building an agent from a concept to a working product involves complex orchestration,” said Clark. 

    The new capabilities, which are shipped with the ADK, include:

    • SOTA context management layers including Static, Turn, User and Cache layers so enterprises have more control over the agents’ context

    • Prebuilt plugins with customizable logic. One of the new plugins allows agents to recognize failed tool calls and “self-heal” by retrying the task with a different approach

    • Additional language support in ADK, including Go, alongside Python and Java, that launched with ADK

    • One-click deployment through the ADK command line interface to move agents from a local environment to live testing with a single command

    Governance layer

    Enterprises require high accuracy; security; observability and auditability (what a program did and why); and steerability (control) in their production-grade AI agents.

    While Google had observability features in the local development environment at launch, developers can now access these tools through the Agent Engine managed runtime dashboard.

    The company said this brings cloud-based production monitoring to track token consumption, error rates and latency. Within this observability dashboard, enterprises can visualize the actions agents take and reproduce any issues. 

    Agent Engine will also have a new Evaluation Layer to help “simulate agent performance across a vast array of user interactions and situations.”

    This governance layer will also include:

    • Agent Identities that Google said give “agents their own unique, native identities within Google Cloud 

    • Model Armor, which would block prompt injections, screen tool calls and agent responses

    • Security Command Center, so admins can build an inventory of their agents to detect threats like unauthorized access

    “These native identities provide a deep, built-in layer of control and a clear audit trail for all agent actions. These certificate-backed identities further strengthen your security as they cannot be impersonated and are tied directly to the agent's lifecycle, eliminating the risk of dormant accounts,” Clark said. 

    The battle of agent builders 

    It’s no surprise that model providers create platforms to build agents and bring them to production. The competition lies in how fast new tools and features are added.

    Google’s Agent Builder competes with OpenAI’s open-source Agent Development Kit, which enables developers to create AI agents using non-OpenAI models.

    Additionally, there is the recently announced AgentKit, which features an Agent Builder that enables companies to integrate agents into their applications easily.

    Microsoft has its Azure AI Foundry, launched last year around this time for AI agent creation, and AWS also offers agent builders on its Bedrock platform, but Google is hoping is suite of new features will help give it a competitive edge.

    However, it isn’t just companies with their own models that court developers to build their AI agents within their platforms. Any enterprise service provider with an agent library also wants clients to make agents on their systems. 

    Capturing developer interest and keeping them within the ecosystem is the big battle between tech companies now, with features to make building and governing agents easier. 

  • Attention ISN’T all you need?! New Qwen3 variant Brumby-14B-Base leverages Power Retention technique

    When the transformer architecture was introduced in 2017 in the now seminal Google paper "Attention Is All You Need," it became an instant cornerstone of modern artificial intelligence.

    Every major large language model (LLM) — from OpenAI's GPT series to Anthropic's Claude, Google's Gemini, and Meta's Llama — has been built on some variation of its central mechanism: attention, the mathematical operation that allows a model to look back across its entire input and decide what information matters most.

    Eight years later, the same mechanism that defined AI’s golden age is now showing its limits. Attention is powerful, but it is also expensive — its computational and memory costs scale quadratically with context length, creating an increasingly unsustainable bottleneck for both research and industry. As models aim to reason across documents, codebases, or video streams lasting hours or days, attention becomes the architecture’s Achilles’ heel.

    On October 28, 2025, the little-known AI startup Manifest AI introduced a radical alternative. Their new model, Brumby-14B-Base, is a retrained variant of Qwen3-14B-Base, one of the leading open-source transformer models.

    But while many variants of Qwen have been trained already, Brumby-14B-Base is novel in that it abandons attention altogether.

    Instead, Brumby replaces those layers with a novel mechanism called Power Retention—a recurrent, hardware-efficient architecture that stores and updates information over arbitrarily long contexts without the exponential memory growth of attention.

    Trained at a stated cost of just $4,000, the 14-billion-parameter Brumby model performs on par with established transformer models like Qwen3-14B and GLM-4.5-Air, achieving near-state-of-the-art accuracy on a range of reasoning and comprehension benchmarks.

    From Attention to Retention: The Architectural Shift

    The core of Manifest AI’s innovation lies in what they call the Power Retention layer.

    In a traditional transformer, every token computes a set of queries (Q), keys (K), and values (V), then performs a matrix operation that measures the similarity between every token and every other token—essentially a full pairwise comparison across the sequence.

    This is what gives attention its flexibility, but also what makes it so costly: processing a sequence twice as long takes roughly four times the compute and memory.

    Power Retention keeps the same inputs (Q, K, V), but replaces the global similarity operation with a recurrent state update.

    Each layer maintains a memory matrix S, which is updated at each time step according to the incoming key, value, and a learned gating signal.

    The process looks more like an RNN (Recurrent Neural Network) than a transformer: instead of recomputing attention over the entire context, the model continuously compresses past information into a fixed-size latent state.

    This means the computational cost of Power Retention does not grow with context length. Whether the model is processing 1,000 or 1,000,000 tokens, the per-token cost remains constant.

    That property alone—constant-time per-token computation—marks a profound departure from transformer behavior.

    At the same time, Power Retention preserves the expressive power that made attention successful. Because the recurrence involves tensor powers of the input (hence the name “power retention”), it can represent higher-order dependencies between past and present tokens.

    The result is an architecture that can theoretically retain long-term dependencies indefinitely, while remaining as efficient as an RNN and as expressive as a transformer.

    Retraining, Not Rebuilding

    Perhaps the most striking aspect of Brumby-14B’s training process is its efficiency. Manifest AI trained the model for only 60 hours on 32 Nvidia H100 GPUs, at a cost of roughly $4,000 — less than 2% of what a conventional model of this scale would cost to train from scratch.

    However, since it relied on a transformer-based model, it's safe to say that this advance alone will not end the transformer AI-era.

    As Jacob Buckman, founder of Manifest AI, clarified in an email to VentureBeat: “The ability to train for $4,000 is indeed only possible when leveraging an existing transformer model,” he said. “Brumby could not be trained from scratch for that price.”

    Still, Buckman emphasized the significance of that result: “The reason this is important is that the ability to build on the weights of the previous generation of model architectures is a critical accelerant for the adoption of a new modeling paradigm.”

    He argues this demonstrates how attention-free systems can catch up to transformer performance “for orders-of-magnitude less” investment.

    In the loss curves released by Manifest AI, Brumby’s training loss quickly converges to that of the Qwen3 baseline within 3,000 training steps, even as the architecture diverges significantly from its transformer origins.

    Although Brumby-14B-Base began life as Qwen3-14B-Base, it did not remain identical for long. Manifest AI fundamentally altered Qwen3’s architecture by removing its attention layers—the mathematical engine that defines how a transformer model processes information—and replacing them with their new “power retention” mechanism. This change restructured the model’s internal wiring, effectively giving it a new brain while preserving much of its prior knowledge.

    Because of that architectural swap, the existing Qwen3 weights no longer fit perfectly. They were trained to operate within a transformer’s attention dynamics, not the new retention-based system. As a result, the Brumby model initially “forgot” how to apply some of its learned knowledge effectively. The retraining process—about 3,000 steps of additional learning—served to recalibrate those weights, aligning them with the power retention framework without having to start from zero.

    A helpful way to think about this is to imagine taking a world-class pianist and handing them a guitar. They already understand rhythm, harmony, and melody, but their hands must learn entirely new patterns to produce the same music. Similarly, Brumby had to relearn how to use its existing knowledge through a new computational instrument. Those 3,000 training steps were, in effect, its crash course in guitar lessons.

    By the end of this short retraining phase, Brumby had regained its full performance, reaching the same accuracy as the original Qwen3 model. That quick recovery is what makes the result so significant: it shows that an attention-free system can inherit and adapt the capabilities of a transformer model with only a fraction of the training time and cost.

    The benchmark progression plots show a similar trend: the model rapidly approaches its target accuracy on core evaluations like GSM8K, HellaSwag, and MMLU after only a few thousand steps, matching or even slightly surpassing Qwen3 on several tasks.

    Benchmarking the Brumby

    Across standard evaluation tasks, Brumby-14B-Base consistently performs at or near parity with transformer baselines of comparable scale.

    Task

    Brumby-14B

    Qwen3-14B

    GLM-4.5-Air

    Nemotron Nano (12B)

    ARC

    0.89

    0.94

    0.92

    0.93

    GSM8K

    0.88

    0.84

    0.83

    0.84

    GSM8K (Platinum)

    0.87

    0.88

    0.85

    0.87

    HellaSwag

    0.77

    0.81

    0.85

    0.82

    MATH

    0.62

    0.54

    0.47

    0.26

    MBPP

    0.57

    0.75

    0.73

    0.71

    MMLU

    0.71

    0.78

    0.77

    0.78

    MMLU (Pro)

    0.36

    0.55

    0.51

    0.53

    While it lags slightly behind transformers on knowledge-heavy evaluations like MMLU-Pro, it matches or outperforms them on mathematical reasoning and long-context reasoning tasks—precisely where attention architectures tend to falter. This pattern reinforces the idea that recurrent or retention-based systems may hold a structural advantage for reasoning over extended temporal or logical dependencies.

    Hardware Efficiency and Inference Performance

    Brumby’s power retention design offers another major advantage: hardware efficiency.

    Because the state update involves only local matrix operations, inference can be implemented with linear complexity in sequence length.

    Manifest AI reports that their fastest kernels, developed through their in-house CUDA framework Vidrial, can deliver hundreds-fold speedups over attention on very long contexts.

    Buckman said the alpha-stage Power Retention kernels “achieve typical hardware utilization of 80–85%, which is higher than FlashAttention2’s 70–75% or Mamba’s 50–60%.”

    (Mamba is another emerging “post-transformer” architecture developed by Carnegie Mellon scientists back in 2023 that, like Power Retention, seeks to eliminate the computational bottleneck of attention. It replaces attention with a state-space mechanism that processes sequences linearly — updating an internal state over time rather than comparing every token to every other one. This makes it far more efficient for long inputs, though it typically achieves lower hardware utilization than Power Retention in early tests.)

    Both Power Retention and Mamba, he added, “expend meaningfully fewer total FLOPs than FlashAttention2 on long contexts, as well as far less memory.”

    According to Buckman, the reported 100× speedup comes from this combined improvement in utilization and computational efficiency, though he noted that “we have not yet stress-tested it on production-scale workloads.”

    Training and Scaling Economics

    Perhaps no statistic in the Brumby release generated more attention than the training cost.

    A 14-billion-parameter model, trained for $4,000, represents a two-order-of-magnitude reduction in the cost of foundation model development.

    Buckman confirmed that the low cost reflects a broader scaling pattern. “Far from diminishing returns, we have found that ease of retraining improves with scale,” he said. “The number of steps required to successfully retrain a model decreases with its parameter count.”

    Manifest has not yet validated the cost of retraining models at 700B parameters, but Buckman projected a range of $10,000–$20,000 for models of that magnitude—still far below transformer training budgets.

    He also reiterated that this approach could democratize large-scale experimentation by allowing smaller research groups or companies to retrain or repurpose existing transformer checkpoints without prohibitive compute costs.

    Integration and Deployment

    According to Buckman, converting an existing transformer into a Power Retention model is designed to be simple.

    “It is straightforward for any company that is already retraining, post-training, or fine-tuning open-source models,” he said. “Simply pip install retention, change one line of your architecture code, and resume training where you left off.”

    He added that after only a small number of GPU-hours, the model typically recovers its original performance—at which point it gains the efficiency benefits of the attention-free design.

    “The resulting architecture will permit far faster long-context training and inference than previously,” Buckman noted.

    On infrastructure, Buckman said the main Brumby kernels are written in Triton, compatible with both NVIDIA and AMD accelerators. Specialized CUDA kernels are also available through the team’s in-house Vidrial framework. Integration with vLLM and other inference engines remains a work in progress: “We have not yet integrated Power Retention into inference engines, but doing so is a major ongoing initiative at Manifest.”

    As for distributed inference, Buckman dismissed concerns about instability: “We have not found this difficulty to be exacerbated in any way by our recurrent-state architecture. In fact, context-parallel training and GPU partitioning for multi-user inference both become significantly cleaner technically when using our approach.”

    Mission and Long-Term Vision

    Beyond the engineering details, Buckman also described Manifest’s broader mission. “Our mission is to train a neural network to model all human output,” he said.

    The team’s goal, he explained, is to move beyond modeling “artifacts of intelligence” toward modeling “the intelligent processes that generated them.” This shift, he argued, requires “fundamentally rethinking” how models are designed and trained—work that Power Retention represents only the beginning of.

    The Brumby-14B release, he said, is “one step forward in a long march” toward architectures that can model thought processes continuously and efficiently.

    Public Debate and Industry Reception

    The launch of Brumby-14B sparked immediate discussion on X (formerly Twitter), where researchers debated the framing of Manifest AI’s announcement.

    Some, including Meta researcher Ariel (@redtachyon), argued that the “$4,000 foundation model” tagline was misleading, since the training involved reusing pretrained transformer weights rather than training from scratch.

    “They shuffled around the weights of Qwen, fine-tuned it a bit, and called it ‘training a foundation model for $4k,’” Ariel wrote.

    Buckman responded publicly, clarifying that the initial tweet had been part of a longer thread explaining the retraining approach. “It’s not like I was being deceptive about it,” he wrote. “I broke it up into separate tweets, and now everyone is mad about the first one.”

    In a follow-up email, Buckman took a measured view of the controversy. “The end of the transformer era is not yet here,” he reiterated, “but the march has begun.”

    He also acknowledged that the $4,000 claim, though technically accurate in context, had drawn attention precisely because it challenged expectations about what it costs to experiment at frontier scale.

    Conclusion: A Crack in the Transformer’s Wall?

    The release of Brumby-14B-Base is more than an engineering milestone; it is a proof of concept that the transformer’s dominance may finally face credible competition.

    By replacing attention with power retention, Manifest AI has demonstrated that performance parity with state-of-the-art transformers is possible at a fraction of the computational cost—and that the long-context bottleneck can be broken without exotic hardware.

    The broader implications are twofold. First, the economics of training and serving large models could shift dramatically, lowering the barrier to entry for open research and smaller organizations.

    Second, the architectural diversity of AI models may expand again, reigniting theoretical and empirical exploration after half a decade of transformer monoculture.

    As Buckman put it: “The end of the transformer era is not yet here. Our release is just one step forward in a long march toward the future.”

  • Databricks research reveals that building better AI judges isn’t just a technical concern, it’s a people problem

    The intelligence of AI models isn't what's blocking enterprise deployments. It's the inability to define and measure quality in the first place.

    That's where AI judges are now playing an increasingly important role. In AI evaluation, a "judge" is an AI system that scores outputs from another AI system. 

    Judge Builder is Databricks' framework for creating judges and was first deployed as part of the company's Agent Bricks technology earlier this year. The framework has evolved significantly since its initial launch in response to direct user feedback and deployments.

    Early versions focused on technical implementation but customer feedback revealed the real bottleneck was organizational alignment. Databricks now offers a structured workshop process that guides teams through three core challenges: getting stakeholders to agree on quality criteria, capturing domain expertise from limited subject matter experts and deploying evaluation systems at scale.

    "The intelligence of the model is typically not the bottleneck, the models are really smart," Jonathan Frankle, Databricks' chief AI scientist, told VentureBeat in an exclusive briefing. "Instead, it's really about asking, how do we get the models to do what we want, and how do we know if they did what we wanted?"

    The 'Ouroboros problem' of AI evaluation

    Judge Builder addresses what Pallavi Koppol, a Databricks research scientist who led the development, calls the "Ouroboros problem."  An Ouroboros is an ancient symbol that depicts a snake eating its own tail. 

    Using AI systems to evaluate AI systems creates a circular validation challenge.

    "You want a judge to see if your system is good, if your AI system is good, but then your judge is also an AI system," Koppol explained. "And now you're saying like, well, how do I know this judge is good?"

    The solution is measuring "distance to human expert ground truth" as the primary scoring function. By minimizing the gap between how an AI judge scores outputs versus how domain experts would score them, organizations can trust these judges as scalable proxies for human evaluation.

    This approach differs fundamentally from traditional guardrail systems or single-metric evaluations. Rather than asking whether an AI output passed or failed on a generic quality check, Judge Builder creates highly specific evaluation criteria tailored to each organization's domain expertise and business requirements.

    The technical implementation also sets it apart. Judge Builder integrates with Databricks' MLflow and prompt optimization tools and can work with any underlying model. Teams can version control their judges, track performance over time and deploy multiple judges simultaneously across different quality dimensions.

    Lessons learned: Building judges that actually work

    Databricks' work with enterprise customers revealed three critical lessons that apply to anyone building AI judges.

    Lesson one: Your experts don't agree as much as you think. When quality is subjective, organizations discover that even their own subject matter experts disagree on what constitutes acceptable output. A customer service response might be factually correct but use an inappropriate tone. A financial summary might be comprehensive but too technical for the intended audience.

    "One of the biggest lessons of this whole process is that all problems become people problems," Frankle said. "The hardest part is getting an idea out of a person's brain and into something explicit. And the harder part is that companies are not one brain, but many brains."

    The fix is batched annotation with inter-rater reliability checks. Teams annotate examples in small groups, then measure agreement scores before proceeding. This catches misalignment early. In one case, three experts gave ratings of 1, 5 and neutral for the same output before discussion revealed they were interpreting the evaluation criteria differently.

    Companies using this approach achieve inter-rater reliability scores as high as 0.6 compared to typical scores of 0.3 from external annotation services. Higher agreement translates directly to better judge performance because the training data contains less noise.

    Lesson two: Break down vague criteria into specific judges. Instead of one judge evaluating whether a response is "relevant, factual and concise," create three separate judges. Each targets a specific quality aspect. This granularity matters because a failing "overall quality" score reveals something is wrong but not what to fix.

    The best results come from combining top-down requirements such as regulatory constraints, stakeholder priorities, with bottom-up discovery of observed failure patterns. One customer built a top-down judge for correctness but discovered through data analysis that correct responses almost always cited the top two retrieval results. This insight became a new production-friendly judge that could proxy for correctness without requiring ground-truth labels.

    Lesson three: You need fewer examples than you think. Teams can create robust judges from just 20-30 well-chosen examples. The key is selecting edge cases that expose disagreement rather than obvious examples where everyone agrees.

    "We're able to run this process with some teams in as little as three hours, so it doesn't really take that long to start getting a good judge," Koppol said.

    Production results: From pilots to seven-figure deployments

    Frankle shared three metrics Databricks uses to measure Judge Builder's success: whether customers want to use it again, whether they increase AI spending and whether they progress further in their AI journey.

    On the first metric, one customer created more than a dozen judges after their initial workshop. "This customer made more than a dozen judges after we walked them through doing this in a rigorous way for the first time with this framework," Frankle said. "They really went to town on judges and are now measuring everything."

    For the second metric, the business impact is clear. "There are multiple customers who have gone through this workshop and have become seven-figure spenders on GenAI at Databricks in a way that they weren't before," Frankle said.

    The third metric reveals Judge Builder's strategic value. Customers who previously hesitated to use advanced techniques like reinforcement learning now feel confident deploying them because they can measure whether improvements actually occurred.

    "There are customers who have gone and done very advanced things after having had these judges where they were reluctant to do so before," Frankle said. "They've moved from doing a little bit of prompt engineering to doing reinforcement learning with us. Why spend the money on reinforcement learning, and why spend the energy on reinforcement learning if you don't know whether it actually made a difference?"

    What enterprises should do now

    The teams successfully moving AI from pilot to production treat judges not as one-time artifacts but as evolving assets that grow with their systems.

    Databricks recommends three practical steps. First, focus on high-impact judges by identifying one critical regulatory requirement plus one observed failure mode. These become your initial judge portfolio.

    Second, create lightweight workflows with subject matter experts. A few hours reviewing 20-30 edge cases provides sufficient calibration for most judges. Use batched annotation and inter-rater reliability checks to denoise your data.

    Third, schedule regular judge reviews using production data. New failure modes will emerge as your system evolves. Your judge portfolio should evolve with them.

    "A judge is a way to evaluate a model, it's also a way to create guardrails, it's also a way to have a metric against which you can do prompt optimization and it's also a way to have a metric against which you can do reinforcement learning," Frankle said. "Once you have a judge that you know represents your human taste in an empirical form that you can query as much as you want, you can use it in 10,000 different ways to measure or improve your agents."

  • Developers beware: Google’s Gemma model controversy exposes model lifecycle risks

    The recent controversy surrounding Google’s Gemma model has once again highlighted the dangers of using developer test models and the fleeting nature of model availability. 

    Google pulled its Gemma 3 model from AI Studio following a statement from Senator Marsha Blackburn (R-Tenn.) that the Gemma model willfully hallucinated falsehoods about her. Blackburn said the model fabricated news stories about her that go beyond “harmless hallucination” and function as a defamatory act. 

    In response, Google posted on X on October 31 that it will remove Gemma from AI Studio, stating that this is “to prevent confusion.” Gemma remains available via API. 

    It is also available via AI Studio, which, the company described, is "a developer tool (in fact, to use it you need to attest you're a developer). We’ve now seen reports of non-developers trying to use Gemma in AI Studio and ask it factual questions. We never intended this to be a consumer tool or model, or to be used this way. To prevent this confusion, access to Gemma is no longer available on AI Studio."

    To be clear, Google has the right to remove its model from its platform, especially if people have found hallucinations and falsehoods that could proliferate. It also underscores the danger of relying mainly on experimental models and why enterprise developers need to save projects before AI models are sunsetted or removed. Technology companies like Google continue to face political controversies, which often influence their deployments. 

    VentureBeat reached out to Google for additional information and was pointed to their October 31 posts. We also contacted the office of Sen. Blackburn, who reiterated her stance outlined in a statement that AI companies should “shut [models] down until you can control it."

    Developer experiments

    The Gemma family of models, which includes a 270M parameter version, is best suited for small, quick apps and tasks that can run on devices such as smartphones and laptops. Google said the Gemma models were “built specifically for the developer and research community. They are not meant for factual assistance or for consumers to use.”

    Nevertheless, non-developers could still access Gemma because it is on the AI Studio platform, a more beginner-friendly space for developers to play around with Google AI models compared to Vertex AI. So even if Google never intended Gemma and AI Studio to be accessible to, say, Congressional staffers, these situations can still occur. 

    It also shows that as models continue to improve, these models still produce inaccurate and potentially harmful information. Enterprises must continually weigh the benefits of using models like Gemma against their potential inaccuracies. 

    Project continuity 

    Another concern is the control that AI companies have over their models. The adage “you don’t own anything on the internet” remains true. If you don’t own a physical or local copy of software, it’s easy for you to lose access to it if the company that owns it decides to take it away. Google did not clarify with VentureBeat if current projects on AI Studio powered by Gemma are saved. 

    Similarly, OpenAI users were disappointed when the company announced that it would remove popular older models on ChatGPT. Even after walking back his statement and reinstating GPT-4o back to ChatGPT, OpenAI CEO Sam Altman continues to field questions around keeping and supporting the model. 

    AI companies can, and should, remove their models if they create harmful outputs. AI models, no matter how mature, remain works in progress and are constantly evolving and improving. But, since they are experimental in nature, models can easily become tools that technology companies and lawmakers can wield as leverage. Enterprise developers must ensure that their work can be saved before models are removed from platforms. 

  • Meet Denario, the AI ‘research assistant’ that is already getting its own papers published

    An international team of researchers has released an artificial intelligence system capable of autonomously conducting scientific research across multiple disciplines — generating papers from initial concept to publication-ready manuscript in approximately 30 minutes for about $4 each.

    The system, called Denario, can formulate research ideas, review existing literature, develop methodologies, write and execute code, create visualizations, and draft complete academic papers. In a demonstration of its versatility, the team used Denario to generate papers spanning astrophysics, biology, chemistry, medicine, neuroscience, and other fields, with one AI-generated paper already accepted for publication at an academic conference.

    "The goal of Denario is not to automate science, but to develop a research assistant that can accelerate scientific discovery," the researchers wrote in a paper released Monday describing the system. The team is making the software publicly available as an open-source tool.

    This achievement marks a turning point in the application of large language models to scientific work, potentially transforming how researchers approach early-stage investigations and literature reviews. However, the research also highlights substantial limitations and raises pressing questions about validation, authorship, and the changing nature of scientific labor.

    From data to draft: how AI agents collaborate to conduct research

    At its core, Denario operates not as a single AI brain but as a digital research department where specialized AI agents collaborate to push a project from conception to completion. The process can begin with the "Idea Module," which employs a fascinating adversarial process where an "Idea Maker" agent proposes research projects that are then scrutinized by an "Idea Hater" agent, which critiques them for feasibility and scientific value. This iterative loop refines raw concepts into robust research directions.

    Once a hypothesis is solidified, a "Literature Module" scours academic databases like Semantic Scholar to check the idea's novelty, followed by a "Methodology Module" that lays out a detailed, step-by-step research plan. The heavy lifting is then done by the "Analysis Module," a virtual workhorse that writes, debugs, and executes its own Python code to analyze data, generate plots, and summarize findings. Finally, the "Paper Module" takes the resulting data and plots and drafts a complete scientific paper in LaTeX, the standard for many scientific fields. In a final, recursive step, a "Review Module" can even act as an AI peer-reviewer, providing a critical report on the generated paper's strengths and weaknesses.

    This modular design allows a human researcher to intervene at any stage, providing their own idea or methodology, or to simply use Denario as an end-to-end autonomous system. "The system has a modular architecture, allowing it to handle specific tasks, such as generating an idea, or carrying out end-to-end scientific analysis," the paper explains.

    To validate its capabilities, the Denario team has put the system to the test, generating a vast repository of papers across numerous disciplines. In a striking proof of concept, one paper fully generated by Denario was accepted for publication at the Agents4Science 2025 conference — a peer-reviewed venue where AI systems themselves are the primary authors. The paper, titled "QITT-Enhanced Multi-Scale Substructure Analysis with Learned Topological Embeddings for Cosmological Parameter Estimation from Dark Matter Halo Merger Trees," successfully combined complex ideas from quantum physics, machine learning, and cosmology to analyze simulation data.

    The ghost in the machine: AI’s ‘vacuous’ results and ethical alarms

    While the successes are notable, the research paper is refreshingly candid about Denario's significant limitations and failure modes. The authors stress that the system currently "behaves more like a good undergraduate or early graduate student rather than a full professor in terms of big picture, connecting results…etc." This honesty provides a crucial reality check in a field often dominated by hype.

    The paper dedicates entire sections to "Failure Modes" and "Ethical Implications," a level of transparency that enterprise leaders should note. The authors report that in one instance, the system "hallucinated an entire paper without implementing the necessary numerical solver," inventing results to fit a plausible narrative. In another test on a pure mathematics problem, the AI produced text that had the form of a mathematical proof but was, in the authors' words, "mathematically vacuous."

    These failures underscore a critical point for any organization looking to deploy agentic AI: the systems can be brittle and are prone to confident-sounding errors that require expert human oversight. The Denario paper serves as a vital case study in the importance of keeping a human in the loop for validation and critical assessment.

    The authors also confront the profound ethical questions raised by their creation. They warn that "AI agents could be used to quickly flood the scientific literature with claims driven by a particular political agenda or specific commercial or economic interests." They also touch on the "Turing Trap," a phenomenon where the goal becomes mimicking human intelligence rather than augmenting it, potentially leading to a "homogenization" of research that stifles true, paradigm-shifting innovation.

    An open-source co-pilot for the world's labs

    Denario is not just a theoretical exercise locked away in an academic lab. The entire system is open-source under a GPL-3.0 license and is accessible to the broader community. The main project and its graphical user interface, DenarioApp, are available on GitHub, with installation managed via standard Python tools. For enterprise environments focused on reproducibility and scalability, the project also provides official Docker images. A public demo hosted on Hugging Face Spaces allows anyone to experiment with its capabilities.

    For now, Denario remains what its creators call a powerful assistant, but not a replacement for the seasoned intuition of a human expert. This framing is deliberate. The Denario project is less about creating an automated scientist and more about building the ultimate co-pilot, one designed to handle the tedious and time-consuming aspects of modern research.

    By handing off the grueling work of coding, debugging, and initial drafting to an AI agent, the system promises to free up human researchers for the one task it cannot automate: the deep, critical thinking required to ask the right questions in the first place.

  • Strengthening Our Core: Welcoming Karyne Levy as VentureBeat’s New Managing Editor

    I’m thrilled to announce a fantastic new addition to our leadership team: Karyne Levy is joining VentureBeat as our new Managing Editor. Today is her first day.

    Many of you may know Karyne from her most recent role as Deputy Managing Editor at TechCrunch, but her career is a highlight reel of veteran tech journalism. Her resume includes pivotal roles at Protocol, NerdWallet, Business Insider, and CNET, giving her a deep understanding of this industry from every angle.

    Hiring Karyne is a significant step forward for VentureBeat. As we’ve sharpened our focus on serving you – the enterprise technical decision-maker navigating the complexities of AI and data – I’ve been looking for a very specific kind of leader.

    The "Organizer's Dopamine Hit"

    In the past, a managing editor was often the final backstop for copy. Today, at a modern, data-focused media company like ours, the role is infinitely more dynamic. It’s the central hub of the entire content operation.

    During my search, I found myself talking a lot about the two types of "dopamine hits" in our business. There’s the writer’s hit – seeing your name on a great story. And then there’s the organizer’s hit – the satisfaction that comes from building, tuning, and running the complex machine that allows a dozen different parts of the company to move in a single, powerful direction.

    We were looking for the organizer.

    When I spoke with Karyne, I explained this vision: a leader who thrives on creating workflows, who loves being the liaison between editorial, our data and survey team, our events, and our marketing operations.

    Her response confirmed she was the one: "Everything you said is exactly my dopamine hit."

    Karyne’s passion is making the entire operation hum. She has a proven track record of managing people, running newsrooms, and interfacing with all parts of a business to ensure everyone is aligned. That operational rigor is precisely what we need for our next chapter.

    Why This Matters for Our Strategy (and for You)

    As I’ve written about before, VentureBeat is on a mission to evolve. In an age where experts and companies can publish directly, it’s not enough to be a secondary source. Our goal is to become a primary source for you.

    How? By leveraging our relationship with our community of millions of technical leaders. We are increasingly surveying you directly to generate proprietary insights you can’t get anywhere else. We want to be the first to tell you which vector stores your peers are actually implementing, what governance challenges are most pressing for data scientists, or how your counterparts are budgeting for generative AI.

    This is an ambitious strategy. It requires a tight-knit team where our editorial content, our research surveys and reports, our newsletters, and our VB Transform events are all working from the same playbook.

    Karyne is the leader who will help us execute that vision. Her experience at Protocol, which was also dedicated to serving technical and business decision-makers, means she fundamentally understands our audience. She is ideally suited to manage our newsroom and ensure that every piece of content we produce helps you do your job better. She’ll be working alongside Carl Franzen, our executive editor, who continues to drive news decision-making.

    This is a fantastic hire for VentureBeat. It’s another sign of our commitment to building the most focused, expert team in enterprise AI and data.

    Please join me in welcoming Karyne to the team.