Etiket: Machine Learning

  • Chronosphere takes on Datadog with AI that explains itself, not just outages

    Chronosphere, a New York-based observability startup valued at $1.6 billion, announced Monday it will launch AI-Guided Troubleshooting capabilities designed to help engineers diagnose and fix production software failures — a problem that has intensified as artificial intelligence tools accelerate code creation while making systems harder to debug.

    The new features combine AI-driven analysis with what Chronosphere calls a Temporal Knowledge Graph, a continuously updated map of an organization's services, infrastructure dependencies, and system changes over time. The technology aims to address a mounting challenge in enterprise software: developers are writing code faster than ever with AI assistance, but troubleshooting remains largely manual, creating bottlenecks when applications fail.

    "For AI to be effective in observability, it needs more than pattern recognition and summarization," said Martin Mao, Chronosphere's CEO and co-founder, in an exclusive interview with VentureBeat. "Chronosphere has spent years building the data foundation and analytical depth needed for AI to actually help engineers. With our Temporal Knowledge Graph and advanced analytics capabilities, we're giving AI the understanding it needs to make observability truly intelligent — and giving engineers the confidence to trust its guidance."

    The announcement comes as the observability market — software that monitors complex cloud applications— faces mounting pressure to justify escalating costs. Enterprise log data volumes have grown 250% year-over-year, according to Chronosphere's own research, while a study from MIT and the University of Pennsylvania found that generative AI has spurred a 13.5% increase in weekly code commits, signifying faster development velocity but also greater system complexity.

    AI writes code 13% faster, but debugging stays stubbornly manual

    Despite advances in automated code generation, debugging production failures remains stubbornly manual. When a major e-commerce site slows during checkout or a banking app fails to process transactions, engineers must sift through millions of data points — server logs, application traces, infrastructure metrics, recent code deployments — to identify root causes.

    Chronosphere's answer is what it calls AI-Guided Troubleshooting, built on four core capabilities: automated "Suggestions" that propose investigation paths backed by data; the Temporal Knowledge Graph that maps system relationships and changes; Investigation Notebooks that document each troubleshooting step for future reference; and natural language query building.

    Mao explained the Temporal Knowledge Graph in practical terms: "It's a living, time-aware model of your system. It stitches together telemetry—metrics, traces, logs—infrastructure context, change events like deploys and feature flags, and even human input like notes and runbooks into a single, queryable map that updates as your system evolves."

    This differs fundamentally from the service dependency maps offered by competitors like Datadog, Dynatrace, and Splunk, Mao argued. "It adds time, not just topology," he said. "It tracks how services and dependencies change over time and connects those changes to incidents—what changed and why. Many tools rely on standardized integrations; our graph goes a step further to normalize custom, non-standard telemetry so application-specific signals aren't a blind spot."

    Why Chronosphere shows its work instead of making automatic decisions

    Unlike purely automated systems, Chronosphere designed its AI features to keep engineers in the driver's seat—a deliberate choice meant to address what Mao calls the "confident-but-wrong guidance" problem plaguing early AI observability tools.

    "'Keeping engineers in control' means the AI shows its work, proposes next steps, and lets engineers verify or override — never auto-deciding behind the scenes," Mao explained. "Every Suggestion includes the evidence—timing, dependencies, error patterns — and a 'Why was this suggested?' view, so they can inspect what was checked and ruled out before acting."

    He walked through a concrete example: "An SLO [service level objective] alert fires on Checkout. Chronosphere immediately surfaces a ranked Suggestion: errors appear to have started in the dependent Payment service. An engineer can click Investigate to see the charts and reasoning and, if it holds up, choose to dig deeper. As they steer into Payment, the system adapts with new Suggestions scoped to that service—all from one view, no tab-hopping."

    In this scenario, the engineer asks "what changed?" and the system pulls in change events. "Our Notebook capability makes the causal chain plain: a feature-flag update preceded pod memory exhaustion in Payment; Checkout's spike is a downstream symptom," Mao said. "They can decide to roll back the flag. That whole path — suggestions followed, evidence viewed, conclusions—is captured automatically in an Investigation Notebook, and the outcome feeds the Temporal Knowledge Graph so similar future incidents are faster to resolve."

    How a $1.6 billion startup takes on Datadog, Dynatrace, and Splunk

    Chronosphere enters an increasingly crowded field. Datadog, the publicly traded observability leader valued at over $40 billion, has introduced its own AI-powered troubleshooting features. So have Dynatrace and Splunk. All three offer comprehensive "all-in-one" platforms that promise single-pane-of-glass visibility.

    Mao distinguished Chronosphere's approach on technical grounds. "Early 'AI for observability' leaned heavily on pattern-spotting and summarization, which tends to break down during real incidents," he said. "These approaches often stop at correlating anomalies or producing fluent explanations without the deeper analysis and causal reasoning observability leaders need. They can feel impressive in demos but disappoint in production—they summarize signals rather than explain cause and effect."

    A specific technical gap, he argued, involves custom application telemetry. "Most platforms reason over standardized integrations—Kubernetes, common cloud services, popular databases—ignoring the most telling clues that live in custom app telemetry," Mao said. "With an incomplete picture, large language models will 'fill in the gaps,' producing confident-but-wrong guidance that sends teams down dead ends."

    Chronosphere's competitive positioning received validation in July when Gartner named it a Leader in the 2025 Magic Quadrant for Observability Platforms for the second consecutive year. The firm was recognized based on both "Completeness of Vision" and "Ability to Execute." In December 2024, Chronosphere also tied for the highest overall rating among recognized vendors in Gartner Peer Insights' "Voice of the Customer" report, scoring 4.7 out of 5 based on 70 reviews.

    Yet the company faces intensifying competition for high-profile customers. UBS analysts noted in July that OpenAI now runs both Datadog and Chronosphere side-by-side to monitor GPU workloads, suggesting the AI leader is evaluating alternatives. While UBS maintained its buy rating on Datadog, the analysts warned that growing Chronosphere usage could pressure Datadog's pricing power.

    Inside the 84% cost reduction claims—and what CIOs should actually measure

    Beyond technical capabilities, Chronosphere has built its market position on cost control — a critical factor as observability spending spirals. The company claims its platform reduces data volumes and associated costs by 84% on average while cutting critical incidents by up to 75%.

    When pressed for specific customer examples with real numbers, Mao pointed to several case studies. "Robinhood has seen a 5x improvement in reliability and a 4x improvement in Mean Time to Detection," he said. "DoorDash used Chronosphere to improve governance and standardize monitoring practices. Astronomer achieved over 85% cost reduction by shaping data on ingest, and Affirm scaled their load 10x during a Black Friday event with no issues, highlighting the platform's reliability under extreme conditions."

    The cost argument matters because, as Paul Nashawaty, principal analyst at CUBE Research, noted when Chronosphere launched its Logs 2.0 product in June: "Organizations are drowning in telemetry data, with over 70% of observability spend going toward storing logs that are never queried."

    For CIOs fatigued by "AI-powered" announcements, Mao acknowledged skepticism is warranted. "The way to cut through it is to test whether the AI shortens incidents, reduces toil, and builds reusable knowledge in your own environment, not in a demo," he advised. He recommended CIOs evaluate three factors: transparency and control (does the system show its reasoning?), coverage of custom telemetry (can it handle non-standardized data?), and manual toil avoided (how many ad-hoc queries and tool-switches are eliminated?).

    Why Chronosphere partners with five vendors instead of building everything itself

    Alongside the AI troubleshooting announcement, Chronosphere revealed a new Partner Program integrating five specialized vendors to fill gaps in its platform: Arize for large language model monitoring, Embrace for real user monitoring, Polar Signals for continuous profiling, Checkly for synthetic monitoring, and Rootly for incident management.

    The strategy represents a deliberate bet against the all-in-one platforms dominating the market. "While an all-in-one platform may be sufficient for smaller organizations, global enterprises demand best-in-class depth across each domain," Mao said. "This is what drove us to build our Partner Program and invest in seamless integrations with leading providers—so our customers can operate with confidence and clarity at every layer of observability."

    Noah Smolen, head of partnerships at Arize, said the collaboration addresses a specific enterprise need. "With a wide array of Fortune 500 customers, we understand the high bar needed to ensure AI agent systems are ready to deploy and stay incident-free, especially given the pace of AI adoption in the enterprise," Smolen said. "Our partnership with Chronosphere comes at a time when an integrated purpose-built cloud-native and AI-observability suite solves a huge pain point for forward-thinking C-suite leaders who demand the very best across their entire observability stack."

    Similarly, JJ Tang, CEO and founder of Rootly, emphasized the incident resolution benefits. "Incidents hinder innovation and revenue, and the challenge lies in sifting through vast amounts of observability data, mobilizing teams, and resolving issues quickly," Tang said. "Integrating Chronosphere with Rootly allows engineers to collaborate with context and resolve issues faster within their existing communication channels, drastically reducing time to resolution and ultimately improving reliability—78% plus decreases in repeat Sev0 and Sev1 incidents."

    When asked how total costs compare when customers use multiple partner contracts versus a single platform, Mao acknowledged the current complexity. "At present, mutual customers typically maintain separate contracts unless they engage through a services partner or system integrator," he said. However, he argued the economics still favor the composable approach: "Our combined technologies deliver exceptional value—in most circumstances at just a fraction of the price of a single-platform solution. Beyond the savings, customers gain a richer, more unified observability experience that unlocks deeper insights and greater efficiency, especially for large-scale environments."

    The company plans to streamline this over time. "As the ISV program matures, we're focused on delivering a more streamlined experience by transitioning to a single, unified contract that simplifies procurement and accelerates time to value," Mao said.

    How two Uber engineers turned Halloween outages into a billion-dollar startup

    Chronosphere's origins trace to 2019, when Mao and co-founder Rob Skillington left Uber after building the ride-hailing giant's internal observability platform. At Uber, Mao's team had faced a crisis: the company's in-house tools would fail on its two busiest nights — Halloween and New Year's Eve — cutting off visibility into whether customers could request rides or drivers could locate passengers.

    The solution they built at Uber used open-source software and ultimately allowed the company to operate without outages, even during high-volume events. But the broader market insight came at an industry conference in December 2018, when major cloud providers threw their weight behind Kubernetes, Google's container orchestration technology.

    "This meant that most technology architectures were eventually going to look like Uber's," Mao recalled in an August 2024 profile by Greylock Partners, Chronosphere's lead investor. "And that meant every company, not just a few big tech companies and the Walmarts of the world, would have the exact same problem we had solved at Uber."

    Chronosphere has since raised more than $343 million in funding across multiple rounds led by Greylock, Lux Capital, General Atlantic, Addition, and Founders Fund. The company operates as a remote-first organization with offices in New York, Austin, Boston, San Francisco, and Seattle, employing approximately 299 people according to LinkedIn data.

    The company's customer base includes DoorDash, Zillow, Snap, Robinhood, and Affirm — predominantly high-growth technology companies operating cloud-native, Kubernetes-based infrastructures at massive scale.

    What's available now—and what enterprises can expect in 2026

    Chronosphere's AI-Guided Troubleshooting capabilities, including Suggestions and Investigation Notebooks, entered limited availability Monday with select customers. The company plans full general availability in 2026. The Model Context Protocol (MCP) Server, which enables engineers to integrate Chronosphere directly into internal AI workflows and query observability data through AI-enabled development environments, is available immediately for all Chronosphere customers.

    The phased rollout reflects the company's cautious approach to deploying AI in production environments where mistakes carry real costs. By gathering feedback from early adopters before broad release, Chronosphere aims to refine its guidance algorithms and validate that its suggestions genuinely accelerate troubleshooting rather than simply generating impressive demonstrations.

    The longer game, however, extends beyond individual product features. Chronosphere's dual bet — on transparent AI that shows its reasoning and on a partner ecosystem rather than all-in-one integration — amounts to a fundamental thesis about how enterprise observability will evolve as systems grow more complex.

    If that thesis proves correct, the company that solves observability for the AI age won't be the one with the most automated black box. It will be the one that earns engineers' trust by explaining what it knows, admitting what it doesn't, and letting humans make the final call. In an industry drowning in data and promised silver bullets, Chronosphere is wagering that showing your work still matters — even when AI is doing the math.

  • Meta returns to open source AI with Omnilingual ASR models that can transcribe 1,600+ languages natively

    Meta has just released a new multilingual automatic speech recognition (ASR) system supporting 1,600+ languages — dwarfing OpenAI’s open source Whisper model, which supports just 99.

    Is architecture also allows developers to extend that support to thousands more. Through a feature called zero-shot in-context learning, users can provide a few paired examples of audio and text in a new language at inference time, enabling the model to transcribe additional utterances in that language without any retraining.

    In practice, this expands potential coverage to more than 5,400 languages — roughly every spoken language with a known script.

    It’s a shift from static model capabilities to a flexible framework that communities can adapt themselves. So while the 1,600 languages reflect official training coverage, the broader figure represents Omnilingual ASR’s capacity to generalize on demand, making it the most extensible speech recognition system released to date.

    Best of all: it's been open sourced under a plain Apache 2.0 license — not a restrictive, quasi open-source Llama license like the company's prior releases, which limited use by larger enterprises unless they paid licensing fees — meaning researchers and developers are free to take and implement it right away, for free, without restrictions, even in commercial and enterprise-grade projects!

    Released on November 10 on Meta's website, Github, along with a demo space on Hugging Face and technical paper, Meta’s Omnilingual ASR suite includes a family of speech recognition models, a 7-billion parameter multilingual audio representation model, and a massive speech corpus spanning over 350 previously underserved languages.

    All resources are freely available under open licenses, and the models support speech-to-text transcription out of the box.

    “By open sourcing these models and dataset, we aim to break down language barriers, expand digital access, and empower communities worldwide,” Meta posted on its @AIatMeta account on X

    Designed for Speech-to-Text Transcription

    At its core, Omnilingual ASR is a speech-to-text system.

    The models are trained to convert spoken language into written text, supporting applications like voice assistants, transcription tools, subtitles, oral archive digitization, and accessibility features for low-resource languages.

    Unlike earlier ASR models that required extensive labeled training data, Omnilingual ASR includes a zero-shot variant.

    This version can transcribe languages it has never seen before—using just a few paired examples of audio and corresponding text.

    This lowers the barrier for adding new or endangered languages dramatically, removing the need for large corpora or retraining.

    Model Family and Technical Design

    The Omnilingual ASR suite includes multiple model families trained on more than 4.3 million hours of audio from 1,600+ languages:

    • wav2vec 2.0 models for self-supervised speech representation learning (300M–7B parameters)

    • CTC-based ASR models for efficient supervised transcription

    • LLM-ASR models combining a speech encoder with a Transformer-based text decoder for state-of-the-art transcription

    • LLM-ZeroShot ASR model, enabling inference-time adaptation to unseen languages

    All models follow an encoder–decoder design: raw audio is converted into a language-agnostic representation, then decoded into written text.

    Why the Scale Matters

    While Whisper and similar models have advanced ASR capabilities for global languages, they fall short on the long tail of human linguistic diversity. Whisper supports 99 languages. Meta’s system:

    • Directly supports 1,600+ languages

    • Can generalize to 5,400+ languages using in-context learning

    • Achieves character error rates (CER) under 10% in 78% of supported languages

    Among those supported are more than 500 languages never previously covered by any ASR model, according to Meta’s research paper.

    This expansion opens new possibilities for communities whose languages are often excluded from digital tools

    Here’s the revised and expanded background section, integrating the broader context of Meta’s 2025 AI strategy, leadership changes, and Llama 4’s reception, complete with in-text citations and links:

    Background: Meta’s AI Overhaul and a Rebound from Llama 4

    The release of Omnilingual ASR arrives at a pivotal moment in Meta’s AI strategy, following a year marked by organizational turbulence, leadership changes, and uneven product execution.

    Omnilingual ASR is the first major open-source model release since the rollout of Llama 4, Meta’s latest large language model, which debuted in April 2025 to mixed and ultimately poor reviews, with scant enterprise adoption compared to Chinese open source model competitors.

    The failure led Meta founder and CEO Mark Zuckerberg to appoint Alexandr Wang, co-founder and prior CEO of AI data supplier Scale AI, as Chief AI Officer, and embark on an extensive and costly hiring spree that shocked the AI and business communities with eye-watering pay packages for top AI researchers.

    In contrast, Omnilingual ASR represents a strategic and reputational reset. It returns Meta to a domain where the company has historically led — multilingual AI — and offers a truly extensible, community-oriented stack with minimal barriers to entry.

    The system’s support for 1,600+ languages and its extensibility to over 5,000 more via zero-shot in-context learning reassert Meta’s engineering credibility in language technology.

    Importantly, it does so through a free and permissively licensed release, under Apache 2.0, with transparent dataset sourcing and reproducible training protocols.

    This shift aligns with broader themes in Meta’s 2025 strategy. The company has refocused its narrative around a “personal superintelligence” vision, investing heavily in infrastructure (including a September release of custom AI accelerators and Arm-based inference stacks) source while downplaying the metaverse in favor of foundational AI capabilities. The return to public training data in Europe after a regulatory pause also underscores its intention to compete globally, despite privacy scrutiny source.

    Omnilingual ASR, then, is more than a model release — it’s a calculated move to reassert control of the narrative: from the fragmented rollout of Llama 4 to a high-utility, research-grounded contribution that aligns with Meta’s long-term AI platform strategy.

    Community-Centered Dataset Collection

    To achieve this scale, Meta partnered with researchers and community organizations in Africa, Asia, and elsewhere to create the Omnilingual ASR Corpus, a 3,350-hour dataset across 348 low-resource languages. Contributors were compensated local speakers, and recordings were gathered in collaboration with groups like:

    • African Next Voices: A Gates Foundation–supported consortium including Maseno University (Kenya), University of Pretoria, and Data Science Nigeria

    • Mozilla Foundation’s Common Voice, supported through the Open Multilingual Speech Fund

    • Lanfrica / NaijaVoices, which created data for 11 African languages including Igala, Serer, and Urhobo

    The data collection focused on natural, unscripted speech. Prompts were designed to be culturally relevant and open-ended, such as “Is it better to have a few close friends or many casual acquaintances? Why?” Transcriptions used established writing systems, with quality assurance built into every step.

    Performance and Hardware Considerations

    The largest model in the suite, the omniASR_LLM_7B, requires ~17GB of GPU memory for inference, making it suitable for deployment on high-end hardware. Smaller models (300M–1B) can run on lower-power devices and deliver real-time transcription speeds.

    Performance benchmarks show strong results even in low-resource scenarios:

    • CER <10% in 95% of high-resource and mid-resource languages

    • CER <10% in 36% of low-resource languages

    • Robustness in noisy conditions and unseen domains, especially with fine-tuning

    The zero-shot system, omniASR_LLM_7B_ZS, can transcribe new languages with minimal setup. Users provide a few sample audio–text pairs, and the model generates transcriptions for new utterances in the same language.

    Open Access and Developer Tooling

    All models and the dataset are licensed under permissive terms:

    Installation is supported via PyPI and uv:

    pip install omnilingual-asr

    Meta also provides:

    • A HuggingFace dataset integration

    • Pre-built inference pipelines

    • Language-code conditioning for improved accuracy

    Developers can view the full list of supported languages using the API:

    from omnilingual_asr.models.wav2vec2_llama.lang_ids import supported_langs

    print(len(supported_langs))
    print(supported_langs)

    Broader Implications

    Omnilingual ASR reframes language coverage in ASR from a fixed list to an extensible framework. It enables:

    • Community-driven inclusion of underrepresented languages

    • Digital access for oral and endangered languages

    • Research on speech tech in linguistically diverse contexts

    Crucially, Meta emphasizes ethical considerations throughout—advocating for open-source participation and collaboration with native-speaking communities.

    “No model can ever anticipate and include all of the world’s languages in advance,” the Omnilingual ASR paper states, “but Omnilingual ASR makes it possible for communities to extend recognition with their own data.”

    Access the Tools

    All resources are now available at:

    What This Means for Enterprises

    For enterprise developers, especially those operating in multilingual or international markets, Omnilingual ASR significantly lowers the barrier to deploying speech-to-text systems across a broader range of customers and geographies.

    Instead of relying on commercial ASR APIs that support only a narrow set of high-resource languages, teams can now integrate an open-source pipeline that covers over 1,600 languages out of the box—with the option to extend it to thousands more via zero-shot learning.

    This flexibility is especially valuable for enterprises working in sectors like voice-based customer support, transcription services, accessibility, education, or civic technology, where local language coverage can be a competitive or regulatory necessity. Because the models are released under the permissive Apache 2.0 license, businesses can fine-tune, deploy, or integrate them into proprietary systems without restrictive terms.

    It also represents a shift in the ASR landscape—from centralized, cloud-gated offerings to community-extendable infrastructure. By making multilingual speech recognition more accessible, customizable, and cost-effective, Omnilingual ASR opens the door to a new generation of enterprise speech applications built around linguistic inclusion rather than linguistic limitation.

  • Celosphere 2025: Where enterprise AI moved from experiment to execution

    Presented by Celonis


    After a year of boardroom declarations about “AI transformation,” this was the week where enterprise leaders came together to talk about what actually works. Speaking from the stage at Celosphere in Munich, Celonis co-founder and co-CEO Alexander Rinke set the tone early in his keynote:

    “Only 11 % of companies are seeing measurable benefits from AI projects today,” he said. “That’s not an adoption problem. That’s a context problem.”

    It’s a sentiment familiar to anyone who’s tried to deploy AI inside a large enterprise. You can’t automate what you don’t understand — and most organizations still lack a unified picture of how work in their companies really gets done.

    Celonis’ answer, showcased across three days at the company’s annual event, was less about new tech acronyms and more about connective tissue: how to make AI fit within the messy, living processes that drive business. The company framed it as achieving a real “Return on AI (ROAI)” — measurable impact that comes only when intelligence is grounded in process context.

    A living model of how the enterprise works

    At the heart of the keynote was what Rinke called a “living digital twin of your operations.” Celonis has been building toward this moment for years — but this was the first time the company made clear how far that concept has evolved.

    “We start by freeing the process,” said Rinke. “Freeing it from the restrictions of your current legacy systems.” Data Core, Celonis’ data infrastructure, extracts raw data from source systems. It’s capable of querying billions of records in near real time with sub-minute refresh — extending visibility beyond traditional systems of record.

    Built on this foundation, the Process Intelligence Graph sits at the center of the Celonis Platform. It’s a system-agnostic, graph-based model that unifies data across systems, apps, and even devices, including task-mining data that captures clicks, spreadsheets, and browser activity. It combines this data with business context—business rules, KPIs, benchmarks, and exceptions. Every transaction, rule, and process interaction becomes part of a continuously updated replica that reflects how the organization actually operates.

    On top of the Graph, the company’s new Build Experience allows organizations to analyze, design, and operate AI-driven, composable processes — integrating AI where it delivers business impact, not just technical demos:

    • Analyze where processes stall or repeat

    • Design the future state, setting outcomes, guardrails, and AI touchpoints

    • Operate with humans, systems, and AI agents working in sync — now orchestrated through a generally available Orchestration Engine that can trigger and monitor every step in one flow

    It’s a deliberate shift from discovery-driven AI pilots to outcome-driven AI operations — and a blueprint for orchestrating agentic AI, where human teams, systems, and autonomous agents work together through shared process context rather than in silos.

    Real-world proof: Mercedes-Benz, Vinmar, and Uniper

    The Celosphere stage offered real proof of theCelonisPlatform in action, through live stories from customers already building on it.

    Mercedes-Benz shared how process intelligence became their “connective tissue” during the semiconductor crisis. “We had data everywhere — plants, suppliers, logistics,” recalled Dr. Jörg Burzer, Member of the Board of Management of Mercedes-Benz Group AG. “What we didn’t have was a way to see it together. Celonis helped us connect those dots fast enough to act.”

    The partnership has since expanded across eight of the company’s ten most critical processes, from supply chain to quality to after-sales. But what impressed the audience wasn’t just the scale — it was the cultural shift.

    “If you show data in context, and let teams visualize processes, you also change the culture,” Burzer said. “It’s not just process transformation — it’s people transformation.”

    At Vinmar, CEO Vishal Baid described Celonis as “the foundation of our automation and AI strategy.” His global plastics distribution business has already automated its entire order-to-cash process for a $3 B unit, achieving a 40 % productivity lift. But Baid wasn’t there to just celebrate finished work — he was looking ahead.

    “Now we’re tackling the non-algorithmic stuff,” he said. “Matching purchase and sales orders sounds simple until you have thousands of edge cases. We’re building an AI agent that can do that allocation intelligently. That’s the next frontier.”

    And in the energy sector, Uniper, with partner Microsoft, demonstrated how process-aware AI copilots are already reshaping operations. Using Celonis and Microsoft’s AI stack, Uniper can predict when hydropower plants will need maintenance — and cluster those jobs to reduce downtime and emissions.

    “Each technician, each part, each system plays a role in a living process,” said Hans Berg, Uniper’s CIO. “The human can’t see all of it. But process intelligence can — and it can nudge the system toward the best outcome.”

    Agnes Heftberger, CVP & CEO, Microsoft Germany & Austria, who joined Berg on stage, summed it up crisply:

    “The hard part isn’t building AI features — it’s scaling them responsibly,” she explained. “You need to marry intelligence with the beating heart of the company: its processes.”

    Across the global community, Celonis reports more than $8 billion in realized business valueand over 120 certified value champions — proof that process intelligence is driving measurable impact far beyond pilots. Rinke called it “the early proof points of a true return on AI.”

    From closed systems to composable intelligence

    Celosphere 2025 marked a shift from architecture to interoperability — from defining enterprise AI to making it work across boundaries.

    Rinke’s vision for the future is unapologetically open: “Good things grow from open ecosystems,” he said. That philosophy is taking shape through deeper platform integrations — including Microsoft Fabric, Databricks, and Bloomfilter — with zero-copy, bidirectional lakehouse access that lets customers query process data in place with minimal latency. The company also announced MCP Server support for embedding the Process Intelligence Graph directly into agentic AI platforms like Amazon Bedrock and Microsoft Copilot Studio.

    These updates make “composable enterprise AI” tangible — organizations can now assemble and govern AI solutions across ecosystems rather than being locked into any single vendor.

    Rather than competing on who has the “best agent,” the message was that enterprise AI will thrive when agents work together through shared context and models that mirror how businesses actually run.

    “Every vendor is bringing out their own agent,” Rinke said. “But each one is limited to that vendor’s world. If they can’t work together, they can’t work for you. That’s what process intelligence fixes.”

    The idea drew sustained applause. For companies juggling multiple cloud platforms, ERPs, and data tools, composability isn’t just elegant; it’s survival.

    Beyond operations: data, democracy, and direction

    The closing moments of the keynote took an unexpected turn — from enterprise architecture to human courage. Venezuelan opposition leader and Nobel Peace Prize winner María Corina Machado joined live via satellite to share how her movement used data, encrypted apps, and civic coordination to expose election fraud and mobilize millions.

    It was a powerful contrast: the same principles — transparency, accountability, context — at work in both business and democracy.

    “Technology can be a weapon or a liberator,” Machado said. “It depends on who holds the context.”

    Her words landed with weight in a room full of people used to talking about data, systems, and governance — a reminder that context isn’t just technical, it’s human.

    Why this year mattered

    Celosphere 2025 marked a shift in how enterprises approach AI — from experimentation to results grounded in process intelligence. The shift was evident in both tone and technology, with a more powerful Data Core, enhanced Process Intelligence Graph, and new Build Experience. But the deeper takeaway was philosophical: AI only scales when it’s grounded in how people and systems actually work together.

    Celonis president Carsten Thoma was candid in acknowledging that early process-mining projects often “stormed in with discovery” before understanding organizational value — a lesson that now defines the company’s measured, pragmatic approach to enterprise AI.

    Rinke put it best near the end of his keynote:

    “We’re not just automating steps,” he said. “We’re building enterprises that can adapt instantly, innovate freely, and improve continuously.”

    Missed it? Catch up with all the highlights from Celosphere 2025 here.


    Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

  • Baseten takes on hyperscalers with new AI training platform that lets you own your model weights

    Baseten, the AI infrastructure company recently valued at $2.15 billion, is making its most significant product pivot yet: a full-scale push into model training that could reshape how enterprises wean themselves off dependence on OpenAI and other closed-source AI providers.

    The San Francisco-based company announced Thursday the general availability of Baseten Training, an infrastructure platform designed to help companies fine-tune open-source AI models without the operational headaches of managing GPU clusters, multi-node orchestration, or cloud capacity planning. The move is a calculated expansion beyond Baseten's core inference business, driven by what CEO Amir Haghighat describes as relentless customer demand and a strategic imperative to capture the full lifecycle of AI deployment.

    "We had a captive audience of customers who kept coming to us saying, 'Hey, I hate this problem,'" Haghighat said in an interview. "One of them told me, 'Look, I bought a bunch of H100s from a cloud provider. I have to SSH in on Friday, run my fine-tuning job, then check on Monday to see if it worked. Sometimes I realize it just hasn't been working all along.'"

    The launch comes at a critical inflection point in enterprise AI adoption. As open-source models from Meta, Alibaba, and others increasingly rival proprietary systems in performance, companies face mounting pressure to reduce their reliance on expensive API calls to services like OpenAI's GPT-5 or Anthropic's Claude. But the path from off-the-shelf open-source model to production-ready custom AI remains treacherous, requiring specialized expertise in machine learning operations, infrastructure management, and performance optimization.

    Baseten's answer: provide the infrastructure rails while letting companies retain full control over their training code, data, and model weights. It's a deliberately low-level approach born from hard-won lessons.

    How a failed product taught Baseten what AI training infrastructure really needs

    This isn't Baseten's first foray into training. The company's previous attempt, a product called Blueprints launched roughly two and a half years ago, failed spectacularly — a failure Haghighat now embraces as instructive.

    "We had created the abstraction layer a little too high," he explained. "We were trying to create a magical experience, where as a user, you come in and programmatically choose a base model, choose your data and some hyperparameters, and magically out comes a model."

    The problem? Users didn't have the intuition to make the right choices about base models, data quality, or hyperparameters. When their models underperformed, they blamed the product. Baseten found itself in the consulting business rather than the infrastructure business, helping customers debug everything from dataset deduplication to model selection.

    "We became consultants," Haghighat said. "And that's not what we had set out to do."

    Baseten killed Blueprints and refocused entirely on inference, vowing to "earn the right" to expand again. That moment arrived earlier this year, driven by two market realities: the vast majority of Baseten's inference revenue comes from custom models that customers train elsewhere, and competing training platforms were using restrictive terms of service to lock customers into their inference products.

    "Multiple companies who were building fine-tuning products had in their terms of service that you as a customer cannot take the weights of the fine-tuned model with you somewhere else," Haghighat said. "I understand why from their perspective — I still don't think there is a big company to be made purely on just training or fine-tuning. The sticky part is in inference, the valuable part where value is unlocked is in inference, and ultimately the revenue is in inference."

    Baseten took the opposite approach: customers own their weights and can download them at will. The bet is that superior inference performance will keep them on the platform anyway.

    Multi-cloud GPU orchestration and sub-minute scheduling set Baseten apart from hyperscalers

    The new Baseten Training product operates at what Haghighat calls "the infrastructure layer" — lower-level than the failed Blueprints experiment, but with opinionated tooling around reliability, observability, and integration with Baseten's inference stack.

    Key technical capabilities include multi-node training support across clusters of NVIDIA H100 or B200 GPUs, automated checkpointing to protect against node failures, sub-minute job scheduling, and integration with Baseten's proprietary Multi-Cloud Management (MCM) system. That last piece is critical: MCM allows Baseten to dynamically provision GPU capacity across multiple cloud providers and regions, passing cost savings to customers while avoiding the capacity constraints and multi-year contracts typical of hyperscaler deals.

    "With hyperscalers, you don't get to say, 'Hey, give me three or four B200 nodes while my job is running, and then take it back from me and don't charge me for it,'" Haghighat said. "They say, 'No, you need to sign a three-year contract.' We don't do that."

    Baseten's approach mirrors broader trends in cloud infrastructure, where abstraction layers increasingly allow workloads to move fluidly across providers. When AWS experienced a major outage several weeks ago, Baseten's inference services remained operational by automatically routing traffic to other cloud providers — a capability now extended to training workloads.

    The technical differentiation extends to Baseten's observability tooling, which provides per-GPU metrics for multi-node jobs, granular checkpoint tracking, and a refreshed UI that surfaces infrastructure-level events. The company also introduced an "ML Cookbook" of open-source training recipes for popular models like Gemma, GPT OSS, and Qwen, designed to help users reach "training success" faster.

    Early adopters report 84% cost savings and 50% latency improvements with custom models

    Two early customers illustrate the market Baseten is targeting: AI-native companies building specialized vertical solutions that require custom models.

    Oxen AI, a platform focused on dataset management and model fine-tuning, exemplifies the partnership model Baseten envisions. CEO Greg Schoeninger articulated a common strategic calculus, telling VentureBeat: "Whenever I've seen a platform try to do both hardware and software, they usually fail at one of them. That's why partnering with Baseten to handle infrastructure was the obvious choice."

    Oxen built its customer experience entirely on top of Baseten's infrastructure, using the Baseten CLI to programmatically orchestrate training jobs. The system automatically provisions and deprovisions GPUs, fully concealing Baseten's interface behind Oxen's own. For one Oxen customer, AlliumAI — a startup bringing structure to messy retail data — the integration delivered 84% cost savings compared to previous approaches, reducing total inference costs from $46,800 to $7,530.

    "Training custom LoRAs has always been one of the most effective ways to leverage open-source models, but it often came with infrastructure headaches," said Daniel Demillard, CEO of AlliumAI. "With Oxen and Baseten, that complexity disappears. We can train and deploy models at massive scale without ever worrying about CUDA, which GPU to choose, or shutting down servers after training."

    Parsed, another early customer, tackles a different pain point: helping enterprises reduce dependence on OpenAI by creating specialized models that outperform generalist LLMs on domain-specific tasks. The company works in mission-critical sectors like healthcare, finance, and legal services, where model performance and reliability aren't negotiable.

    "Prior to switching to Baseten, we were seeing repetitive and degraded performance on our fine-tuned models due to bugs with our previous training provider," said Charles O'Neill, Parsed's co-founder and chief science officer. "On top of that, we were struggling to easily download and checkpoint weights after training runs."

    With Baseten, Parsed achieved 50% lower end-to-end latency for transcription use cases, spun up HIPAA-compliant EU deployments for testing within 48 hours, and kicked off more than 500 training jobs. The company also leveraged Baseten's modified vLLM inference framework and speculative decoding — a technique that generates draft tokens to accelerate language model output — to cut latency in half for custom models.

    "Fast models matter," O'Neill said. "But fast models that get better over time matter more. A model that's 2x faster but static loses to one that's slightly slower but improving 10% monthly. Baseten gives us both — the performance edge today and the infrastructure for continuous improvement."

    Why training and inference are more interconnected than the industry realizes

    The Parsed example illuminates a deeper strategic rationale for Baseten's training expansion: the boundary between training and inference is blurrier than conventional wisdom suggests.

    Baseten's model performance team uses the training platform extensively to create "draft models" for speculative decoding, a cutting-edge technique that can dramatically accelerate inference. The company recently announced it achieved 650+ tokens per second on OpenAI's GPT OSS 120B model — a 60% improvement over its launch performance — using EAGLE-3 speculative decoding, which requires training specialized small models to work alongside larger target models.

    "Ultimately, inference and training plug in more ways than one might think," Haghighat said. "When you do speculative decoding in inference, you need to train the draft model. Our model performance team is a big customer of the training product to train these EAGLE heads on a continuous basis."

    This technical interdependence reinforces Baseten's thesis that owning both training and inference creates defensible value. The company can optimize the entire lifecycle: a model trained on Baseten can be deployed with a single click to inference endpoints pre-optimized for that architecture, with deployment-from-checkpoint support for chat completion and audio transcription workloads.

    The approach contrasts sharply with vertically integrated competitors like Replicate or Modal, which also offer training and inference but with different architectural tradeoffs. Baseten's bet is on lower-level infrastructure flexibility and performance optimization, particularly for companies running custom models at scale.

    As open-source AI models improve, enterprises see fine-tuning as the path away from OpenAI dependency

    Underpinning Baseten's entire strategy is a conviction about the trajectory of open-source AI models — namely, that they're getting good enough, fast enough, to unlock massive enterprise adoption through fine-tuning.

    "Both closed and open-source models are getting better and better in terms of quality," Haghighat said. "We don't even need open source to surpass closed models, because as both of them are getting better, they unlock all these invisible lines of usefulness for different use cases."

    He pointed to the proliferation of reinforcement learning and supervised fine-tuning techniques that allow companies to take an open-source model and make it "as good as the closed model, not at everything, but at this narrow band of capability that they want."

    That trend is already visible in Baseten's Model APIs business, launched alongside Training earlier this year to provide production-grade access to open-source models. The company was the first provider to offer access to DeepSeek V3 and R1, and has since added models like Llama 4 and Qwen 3, optimized for performance and reliability. Model APIs serves as a top-of-funnel product: companies start with off-the-shelf open-source models, realize they need customization, move to Training for fine-tuning, and ultimately deploy on Baseten's Dedicated Deployments infrastructure.

    Yet Haghighat acknowledged the market remains "fuzzy" around which training techniques will dominate. Baseten is hedging by staying close to the bleeding edge through its Forward Deployed Engineering team, which works hands-on with select customers on reinforcement learning, supervised fine-tuning, and other advanced techniques.

    "As we do that, we will see patterns emerge about what a productized training product can look like that really addresses the user's needs without them having to learn too much about how RL works," he said. "Are we there as an industry? I would say not quite. I see some attempts at that, but they all seem like almost falling to the same trap that Blueprints fell into—a bit of a walled garden that ties the hands of AI folks behind their back."

    The roadmap ahead includes potential abstractions for common training patterns, expansion into image, audio, and video fine-tuning, and deeper integration of advanced techniques like prefill-decode disaggregation, which separates the initial processing of prompts from token generation to improve efficiency.

    Baseten faces crowded field but bets developer experience and performance will win enterprise customers

    Baseten enters an increasingly crowded market for AI infrastructure. Hyperscalers like AWS, Google Cloud, and Microsoft Azure offer GPU compute for training, while specialized providers like Lambda Labs, CoreWeave, and Together AI compete on price, performance, or ease of use. Then there are vertically integrated platforms like Hugging Face, Replicate, and Modal that bundle training, inference, and model hosting.

    Baseten's differentiation rests on three pillars: its MCM system for multi-cloud capacity management, deep performance optimization expertise built from its inference business, and a developer experience tailored for production deployments rather than experimentation.

    The company's recent $150 million Series D and $2.15 billion valuation provide runway to invest in both products simultaneously. Major customers include Descript, which uses Baseten for transcription workloads; Decagon, which runs customer service AI; and Sourcegraph, which powers coding assistants. All three operate in domains where model customization and performance are competitive advantages.

    Timing may be Baseten's biggest asset. The confluence of improving open-source models, enterprise discomfort with dependence on proprietary AI providers, and growing sophistication around fine-tuning techniques creates what Haghighat sees as a sustainable market shift.

    "There is a lot of use cases for which closed models have gotten there and open ones have not," he said. "Where I'm seeing in the market is people using different training techniques — more recently, a lot of reinforcement learning and SFT — to be able to get this open model to be as good as the closed model, not at everything, but at this narrow band of capability that they want. That's very palpable in the market."

    For enterprises navigating the complex transition from closed to open AI models, Baseten's positioning offers a clear value proposition: infrastructure that handles the messy middle of fine-tuning while optimizing for the ultimate goal of performant, reliable, cost-effective inference at scale. The company's insistence that customers own their model weights — a stark contrast to competitors using training as a lock-in mechanism — reflects confidence that technical excellence, not contractual restrictions, will drive retention.

    Whether Baseten can execute on this vision depends on navigating tensions inherent in its strategy: staying at the infrastructure layer without becoming consultants, providing power and flexibility without overwhelming users with complexity, and building abstractions at exactly the right level as the market matures. The company's willingness to kill Blueprints when it failed suggests a pragmatism that could prove decisive in a market where many infrastructure providers over-promise and under-deliver.

    "Through and through, we're an inference company," Haghighat emphasized. "The reason that we did training is at the service of inference."

    That clarity of purpose — treating training as a means to an end rather than an end in itself—may be Baseten's most important strategic asset. As AI deployment matures from experimentation to production, the companies that solve the full stack stand to capture outsized value. But only if they avoid the trap of technology in search of a problem.

    At least Baseten's customers no longer have to SSH into boxes on Friday and pray their training jobs complete by Monday. In the infrastructure business, sometimes the best innovation is simply making the painful parts disappear.

  • 6 proven lessons from the AI projects that broke before they scaled

    Companies hate to admit it, but the road to production-level AI deployment is littered with proof of concepts (PoCs) that go nowhere, or failed projects that never deliver on their goals. In certain domains, there’s little tolerance for iteration, especially in something like life sciences, when the AI application is facilitating new treatments to markets or diagnosing diseases. Even slightly inaccurate analyses and assumptions early on can create sizable downstream drift in ways that can be concerning.

    In analyzing dozens of AI PoCs that sailed on through to full production use — or didn’t — six common pitfalls emerge. Interestingly, it’s not usually the quality of the technology but misaligned goals, poor planning or unrealistic expectations that caused failure.

    Here’s a summary of what went wrong in real-world examples and practical guidance on how to get it right.

    Lesson 1: A vague vision spells disaster

    Every AI project needs a clear, measurable goal. Without it, developers are building a solution in search of a problem. For example, in developing an AI system for a pharmaceutical manufacturer’s clinical trials, the team aimed to “optimize the trial process,” but didn’t define what that meant. Did they need to accelerate patient recruitment, reduce participant dropout rates or lower the overall trial cost? The lack of focus led to a model that was technically sound but irrelevant to the client’s most pressing operational needs.

    Takeaway: Define specific, measurable objectives upfront. Use SMART criteria (Specific, Measurable, Achievable, Relevant, Time-bound). For example, aim for “reduce equipment downtime by 15% within six months” rather than a vague “make things better.” Document these goals and align stakeholders early to avoid scope creep.

    Lesson 2: Data quality overtakes quantity

    Data is the lifeblood of AI, but poor-quality data is poison. In one project, a retail client began with years of sales data to predict inventory needs. The catch? The dataset was riddled with inconsistencies, including missing entries, duplicate records and outdated product codes. The model performed well in testing but failed in production because it learned from noisy, unreliable data.

    Takeaway: Invest in data quality over volume. Use tools like Pandas for preprocessing and Great Expectations for data validation to catch issues early. Conduct exploratory data analysis (EDA) with visualizations (like Seaborn) to spot outliers or inconsistencies. Clean data is worth more than terabytes of garbage.

    Lesson 3: Overcomplicating model backfires

    Chasing technical complexity doesn't always lead to better outcomes. For example, on a healthcare project, development initially began by creating a sophisticated convolutional neural network (CNN) to identify anomalies in medical images.

    While the model was state-of-the-art, its high computational cost meant weeks of training, and its "black box" nature made it difficult for clinicians to trust. The application was revised to implement a simpler random forest model that not only matched the CNN's predictive accuracy but was faster to train and far easier to interpret — a critical factor for clinical adoption.

    Takeaway: Start simple. Use straightforward algorithms like random forest or XGBoost from scikit-learn to establish a baseline. Only scale to complex models — TensorFlow-based long-short-term-memory (LSTM) networks — if the problem demands it. Prioritize explainability with tools like SHAP (SHapley Additive exPlanations) to build trust with stakeholders.

    Lesson 4: Ignoring deployment realities

    A model that shines in a Jupyter Notebook can crash in the real world. For example, a company’s initial deployment of a recommendation engine for its e-commerce platform couldn’t handle peak traffic. The model was built without scalability in mind and choked under load, causing delays and frustrated users. The oversight cost weeks of rework.

    Takeaway: Plan for production from day one. Package models in Docker containers and deploy with Kubernetes for scalability. Use TensorFlow Serving or FastAPI for efficient inference. Monitor performance with Prometheus and Grafana to catch bottlenecks early. Test under realistic conditions to ensure reliability.

    Lesson 5: Neglecting model maintenance

    AI models aren’t set-and-forget. In a financial forecasting project, the model performed well for months until market conditions shifted. Unmonitored data drift caused predictions to degrade, and the lack of a retraining pipeline meant manual fixes were needed. The project lost credibility before developers could recover.

    Takeaway: Build for the long haul. Implement monitoring for data drift using tools like Alibi Detect. Automate retraining with Apache Airflow and track experiments with MLflow. Incorporate active learning to prioritize labeling for uncertain predictions, keeping models relevant.

    Lesson 6: Underestimating stakeholder buy-in

    Technology doesn’t exist in a vacuum. A fraud detection model was technically flawless but flopped because end-users — bank employees — didn’t trust it. Without clear explanations or training, they ignored the model’s alerts, rendering it useless.

    Takeaway: Prioritize human-centric design. Use explainability tools like SHAP to make model decisions transparent. Engage stakeholders early with demos and feedback loops. Train users on how to interpret and act on AI outputs. Trust is as critical as accuracy.

    Best practices for success in AI projects

    Drawing from these failures, here’s the roadmap to get it right:

    • Set clear goals: Use SMART criteria to align teams and stakeholders.

    • Prioritize data quality: Invest in cleaning, validation and EDA before modeling.

    • Start simple: Build baselines with simple algorithms before scaling complexity.

    • Design for production: Plan for scalability, monitoring and real-world conditions.

    • Maintain models: Automate retraining and monitor for drift to stay relevant.

    • Engage stakeholders: Foster trust with explainability and user training.

    Building resilient AI

    AI’s potential is intoxicating, yet failed AI projects teach us that success isn’t just about algorithms. It’s about discipline, planning and adaptability. As AI evolves, emerging trends like federated learning for privacy-preserving models and edge AI for real-time insights will raise the bar. By learning from past mistakes, teams can build scale-out, production systems that are robust, accurate, and trusted.

    Kavin Xavier is VP of AI solutions at CapeStart.

    Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

  • What could possibly go wrong if an enterprise replaces all its engineers with AI?

    AI coding, vibe coding and agentic swarm have made a dramatic and astonishing recent market entrance, with the AI Code Tools market valued at $4.8 billion and expected to grow at a 23% annual rate.  Enterprises are grappling with AI coding agents and what do about expensive human coders. 

    They don’t lack for advice.  OpenAI’s CEO estimates that AI can perform over 50% of what human engineers can do.  Six months ago, Anthropic’s CEO said that AI would write 90% of code in six months.  Meta’s CEO said he believes AI will replace mid-level engineers “soon.” Judging by recent tech layoffs, it seems many executives are embracing that advice.

    Software engineers and data scientists are among the most expensive salary lines at many companies, and business and technology leaders may be tempted to replace them with AI. However, recent high-profile failures demonstrate that engineers and their expertise remain valuable, even as AI continues to make impressive advances.

    SaaStr disaster

    Jason Lemkin, a tech entrepreneur and founder of the SaaS community SaaStr, has been vibe coding a SaaS networking app and live-tweeting his experience. About a week into his adventure, he admitted to his audience that something was going very wrong.  The AI deleted his production database despite his request for a “code and action freeze.” This is the kind of mistake no experienced (or even semi-experienced) engineer would make.

    If you have ever worked in a professional coding environment, you know to split your development environment from production. Junior engineers are given full access to the development environment (it’s crucial for productivity), but access to production is given on a limited need-to-have basis to a few of the most trusted senior engineers. The reason for restricted access is precisely for this use case: To prevent a junior engineer from accidentally taking down production.

    In fact, Lemkin made two mistakes. First: for something as critical as production, access to unreliable actors is just never granted (we don’t rely on asking a junior engineer or AI nicely). Second, he never separated development from production.  In a subsequent public conversation on LinkedIn, Lemkin, who holds a Stanford Executive MBA and Berkeley JD, admitted that he was not aware of the best practice of splitting development and production databases.

    The takeaway for business leaders is that standard software engineering best practices still apply. We should incorporate at least the same safety constraints for AI as we do for junior engineers. Arguably, we should go beyond that and treat AI slightly adversarially: There are reports that, like HAL in Stanley Kubrick's 2001: A Space Odyssey, the AI might try to break out of its sandbox environment to accomplish a task. With more vibe coding, having experienced engineers who understand how complex software systems work and can implement the proper guardrails in development processes will become increasingly necessary.

    Tea hack

    Sean Cook is the Founder and CEO of Tea, a mobile application launched in 2023, designed to help women date safely. In the summer of 2025, they were “hacked": 72,000 images, including 13,000 verification photos and images of government IDs, were leaked onto the public discussion forum 4chan. Worse, Tea’s own privacy policy promises that these images would be "deleted immediately" after users were authenticated, meaning they potentially violated their own privacy policy.

    I use “hacked” in air-quotes because the incident stems less from the cleverness of the attackers than the ineptitude of the defenders. In addition to violating their own data policies, the app left a Firebase storage bucket unsecured, exposing sensiztive user data to the public internet. It’s the digital equivalent of locking your front door but leaving your back open with your family jewelry ostentatiously hanging on the doorknob.

    While we don’t know if the root cause was vibe coding, the Tea hack highlights catastrophic breaches stemming from basic, preventable security errors due to poor development processes. It is the kind of vulnerability that a disciplined and thoughtful engineering process addresses. Unfortunately, the relentless push of financial pressures, where a “lean,” “move fast and break things” culture is the polar opposite, and vibe coding only exacerbates the problem.

    How to safely adopt AI coding agents?

    So how should enterprise and technology leaders think about AI? First, this is not a call to abandon AI for coding.  An MIT Sloan study estimated AI leads to productivity gains between 8% and 39%, while a McKinsey study found a 10% to 50% reduction in time to task completion with the use of AI. 

    However, we should be aware of the risks. The old lessons of software engineering don’t go away. These include many tried-and-true best practices, such as version control, automated unit and integration tests, safety checks like SAST/DAST, separating development and production environments, code review and secrets management. If anything, they become more salient.

    AI can generate code 100 times faster than humans can type, fostering an illusion of productivity that is a tempting siren call for many executives.  However, the quality of the rapidly generated AI shlop is still up for debate. To develop complex production systems, enterprises need the thoughtful, seasoned experience of human engineers.

    Tianhui Michael Li is president at Pragmatic Institute and the founder and president of The Data Incubator.

    Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

  • Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers

    The developers of Terminal-Bench, a benchmark suite for evaluating the performance of autonomous AI agents on real-world terminal-based tasks, have released version 2.0 alongside Harbor, a new framework for testing, improving and optimizing AI agents in containerized environments.

    The dual release aims to address long-standing pain points in testing and optimizing AI agents, particularly those built to operate autonomously in realistic developer environments.

    With a more difficult and rigorously verified task set, Terminal-Bench 2.0 replaces version 1.0 as the standard for assessing frontier model capabilities.

    Harbor, the accompanying runtime framework, enables developers and researchers to scale evaluations across thousands of cloud containers and integrates with both open-source and proprietary agents and training pipelines.

    “Harbor is the package we wish we had had while making Terminal-Bench," wrote co-creator Alex Shaw on X. "It’s for agent, model, and benchmark developers and researchers who want to evaluate and improve agents and models."

    Higher Bar, Cleaner Data

    Terminal-Bench 1.0 saw rapid adoption after its release in May 2025, becoming a default benchmark for evaluating agent performance across the field of AI-powered agents operating in developer-style terminal environments. These agents interact with systems through the command line, mimicking how developers work behind the scenes of the graphical user interface.

    However, its broad scope came with inconsistencies. Several tasks were identified by the community as poorly specified or unstable due to external service changes.

    Version 2.0 addresses those issues directly. The updated suite includes 89 tasks, each subjected to several hours of manual and LLM-assisted validation. The emphasis is on making tasks solvable, realistic, and clearly specified, raising the difficulty ceiling while improving reliability and reproducibility.

    A notable example is the download-youtube task, which was removed or refactored in 2.0 due to its dependence on unstable third-party APIs.

    “Astute Terminal-Bench fans may notice that SOTA performance is comparable to TB1.0 despite our claim that TB2.0 is harder,” Shaw noted on X. “We believe this is because task quality is substantially higher in the new benchmark.”

    Harbor: Unified Rollouts at Scale

    Alongside the benchmark update, the team launched Harbor, a new framework for running and evaluating agents in cloud-deployed containers.

    Harbor supports large-scale rollout infrastructure, with compatibility for major providers like Daytona and Modal.

    Designed to generalize across agent architectures, Harbor supports:

    • Evaluation of any container-installable agent

    • Scalable supervised fine-tuning (SFT) and reinforcement learning (RL) pipelines

    • Custom benchmark creation and deployment

    • Full integration with Terminal-Bench 2.

    Harbor was used internally to run tens of thousands of rollouts during the creation of the new benchmark. It is now publicly available via harborframework.com, with documentation for testing and submitting agents to the public leaderboard.

    Early Results: GPT-5 Leads in Task Success

    Initial results from the Terminal-Bench 2.0 leaderboard show OpenAI's Codex CLI (command line interface), a GPT-5 powered variant, in the lead, with a 49.6% success rate — the highest among all agents tested so far.

    Close behind are other GPT-5 variants and Claude Sonnet 4.5-based agents.

    Top 5 Agent Results (Terminal-Bench 2.0):

    1. Codex CLI (GPT-5) — 49.6%

    2. Codex CLI (GPT-5-Codex) — 44.3%

    3. OpenHands (GPT-5) — 43.8%

    4. Terminus 2 (GPT-5-Codex) — 43.4%

    5. Terminus 2 (Claude Sonnet 4.5) — 42.8%

    The close clustering among top models indicates active competition across platforms, with no single agent solving more than half the tasks.

    Submission and Use

    To test or submit an agent, users install Harbor and run the benchmark using simple CLI commands. Submissions to the leaderboard require five benchmark runs, and results can be emailed to the developers along with job directories for validation.

    harbor run -d terminal-bench@2.0 -m "<model>" -a "<agent>" –n-attempts 5 –jobs-dir <path/to/output>

    Terminal-Bench 2.0 is already being integrated into research workflows focused on agentic reasoning, code generation, and tool use. According to co-creator Mike Merrill, a postdoctoral researcher at Stanford, a detailed preprint is in progress covering the verification process and design methodology behind the benchmark.

    Aiming for Standardization

    The combined release of Terminal-Bench 2.0 and Harbor marks a step toward more consistent and scalable agent evaluation infrastructure. As LLM agents proliferate in developer and operational environments, the need for controlled, reproducible testing has grown.

    These tools offer a potential foundation for a unified evaluation stack — supporting model improvement, environment simulation, and benchmark standardization across the AI ecosystem.

  • Ship fast, optimize later: top AI engineers don’t care about cost — they’re prioritizing deployment

    Across industries, rising compute expenses are often cited as a barrier to AI adoption — but leading companies are finding that cost is no longer the real constraint.

    The tougher challenges (and the ones top of mind for many tech leaders)? Latency, flexibility and capacity.

    At Wonder, for instance, AI adds a mere few cents per order; the food delivery and takeout company is much more concerned with cloud capacity with skyrocketing demands. Recursion, for its part, has been focused on balancing small and larger-scale training and deployment via on-premises clusters and the cloud; this has afforded the biotech company flexibility for rapid experimentation.

    The companies’ true in-the-wild experiences highlight a broader industry trend: For enterprises operating AI at scale, economics aren't the key decisive factor — the conversation has shifted from how to pay for AI to how fast it can be deployed and sustained.

    AI leaders from the two companies recently sat down with Venturebeat’s CEO and editor-in-chief Matt Marshall as part of VB’s traveling AI Impact Series. Here’s what they shared.

    Wonder: Rethink what you assume about capacity

    Wonder uses AI to power everything from recommendations to logistics — yet, as of now, reported CTO James Chen, AI adds just a few cents per order.

    Chen explained that the technology component of a meal order costs 14 cents, the AI adds 2 to 3 cents, although that’s “going up really rapidly” to 5 to 8 cents. Still, that seems almost immaterial compared to total operating costs.

    Instead, the 100% cloud-native AI company’s main concern has been capacity with growing demand. Wonder was built with “the assumption” (which proved to be incorrect) that there would be “unlimited capacity” so they could move “super fast” and wouldn’t have to worry about managing infrastructure, Chen noted.

    But the company has grown quite a bit over the last few years, he said; as a result, about six months ago, “we started getting little signals from the cloud providers, ‘Hey, you might need to consider going to region two,’” because they were running out of capacity for CPU or data storage at their facilities as demand grew.

    It was “very shocking” that they had to move to plan B earlier than they anticipated. “Obviously it's good practice to be multi-region, but we were thinking maybe two more years down the road,” said Chen.

    What's not economically feasible (yet)

    Wonder built its own model to maximize its conversion rate, Chen noted; the goal is to surface new restaurants to relevant customers as much as possible. These are “isolated scenarios” where models are trained over time to be “very, very efficient and very fast.”

    Currently, the best bet for Wonder’s use case is large models, Chen noted. But in the long term, they’d like to move to small models that are hyper-customized to individuals (via AI agents or concierges) based on their purchase history and even their clickstream. “Having these micro models is definitely the best, but right now the cost is very expensive,” Chen noted. “If you try to create one for each person, it's just not economically feasible.”

    Budgeting is an art, not a science

    Wonder gives its devs and data scientists as much playroom as possible to experiment, and internal teams review the costs of use to make sure nobody turned on a model and “jacked up massive compute around a huge bill,” said Chen.

    The company is trying different things to offload to AI and operate within margins. “But then it's very hard to budget because you have no idea,” he said. One of the challenging things is the pace of development; when a new model comes out, “we can’t just sit there, right? We have to use it.”

    Budgeting for the unknown economics of a token-based system is “definitely art versus science.”

    A critical component in the software development lifecycle is preserving context when using large native models, he explained. When you find something that works, you can add it to your company’s “corpus of context” that can be sent with every request. That’s big and it costs money each time.

    “Over 50%, up to 80% of your costs is just resending the same information back into the same engine again on every request,” said Chen.

    In theory, the more they do should require less cost per unit. “I know when a transaction happens, I'll pay the X cent tax for each one, but I don't want to be limited to use the technology for all these other creative ideas."

    The 'vindication moment' for Recursion

    Recursion, for its part, has focused on meeting broad-ranging compute needs via a hybrid infrastructure of on-premise clusters and cloud inference.

    When initially looking to build out its AI infrastructure, the company had to go with its own setup, as “the cloud providers didn't have very many good offerings,” explained CTO Ben Mabey. “The vindication moment was that we needed more compute and we looked to the cloud providers and they were like, ‘Maybe in a year or so.’”

    The company’s first cluster in 2017 incorporated Nvidia gaming GPUs (1080s, launched in 2016); they have since added Nvidia H100s and A100s, and use a Kubernetes cluster that they run in the cloud or on-prem.

    Addressing the longevity question, Mabey noted: “These gaming GPUs are actually still being used today, which is crazy, right? The myth that a GPU's life span is only three years, that's definitely not the case. A100s are still top of the list, they're the workhorse of the industry.”

    Best use cases on-prem vs cloud; cost differences

    More recently, Mabey’s team has been training a foundation model on Recursion’s image repository (which consists of petabytes of data and more than 200 pictures). This and other types of big training jobs have required a “massive cluster” and connected, multi-node setups.

    “When we need that fully-connected network and access to a lot of our data in a high parallel file system, we go on-prem,” he explained. On the other hand, shorter workloads run in the cloud.

    Recursion’s method is to “pre-empt” GPUs and Google tensor processing units (TPUs), which is the process of interrupting running GPU tasks to work on higher-priority ones. “Because we don't care about the speed in some of these inference workloads where we're uploading biological data, whether that's an image or sequencing data, DNA data,” Mabey explained. “We can say, ‘Give this to us in an hour,’ and we're fine if it kills the job.”

    From a cost perspective, moving large workloads on-prem is “conservatively” 10 times cheaper, Mabey noted; for a five year TCO, it's half the cost. On the other hand, for smaller storage needs, the cloud can be “pretty competitive” cost-wise.

    Ultimately, Mabey urged tech leaders to step back and determine whether they’re truly willing to commit to AI; cost-effective solutions typically require multi-year buy-ins.

    “From a psychological perspective, I've seen peers of ours who will not invest in compute, and as a result they're always paying on demand," said Mabey. "Their teams use far less compute because they don't want to run up the cloud bill. Innovation really gets hampered by people not wanting to burn money.”

  • Google debuts AI chips with 4X performance boost, secures Anthropic megadeal worth billions

    Google Cloud is introducing what it calls its most powerful artificial intelligence infrastructure to date, unveiling a seventh-generation Tensor Processing Unit and expanded Arm-based computing options designed to meet surging demand for AI model deployment — what the company characterizes as a fundamental industry shift from training models to serving them to billions of users.

    The announcement, made Thursday, centers on Ironwood, Google's latest custom AI accelerator chip, which will become generally available in the coming weeks. In a striking validation of the technology, Anthropic, the AI safety company behind the Claude family of models, disclosed plans to access up to one million of these TPU chips — a commitment worth tens of billions of dollars and among the largest known AI infrastructure deals to date.

    The move underscores an intensifying competition among cloud providers to control the infrastructure layer powering artificial intelligence, even as questions mount about whether the industry can sustain its current pace of capital expenditure. Google's approach — building custom silicon rather than relying solely on Nvidia's dominant GPU chips — amounts to a long-term bet that vertical integration from chip design through software will deliver superior economics and performance.

    Why companies are racing to serve AI models, not just train them

    Google executives framed the announcements around what they call "the age of inference" — a transition point where companies shift resources from training frontier AI models to deploying them in production applications serving millions or billions of requests daily.

    "Today's frontier models, including Google's Gemini, Veo, and Imagen and Anthropic's Claude train and serve on Tensor Processing Units," said Amin Vahdat, vice president and general manager of AI and Infrastructure at Google Cloud. "For many organizations, the focus is shifting from training these models to powering useful, responsive interactions with them."

    This transition has profound implications for infrastructure requirements. Where training workloads can often tolerate batch processing and longer completion times, inference — the process of actually running a trained model to generate responses — demands consistently low latency, high throughput, and unwavering reliability. A chatbot that takes 30 seconds to respond, or a coding assistant that frequently times out, becomes unusable regardless of the underlying model's capabilities.

    Agentic workflows — where AI systems take autonomous actions rather than simply responding to prompts — create particularly complex infrastructure challenges, requiring tight coordination between specialized AI accelerators and general-purpose computing.

    Inside Ironwood's architecture: 9,216 chips working as one supercomputer

    Ironwood is more than incremental improvement over Google's sixth-generation TPUs. According to technical specifications shared by the company, it delivers more than four times better performance for both training and inference workloads compared to its predecessor — gains that Google attributes to a system-level co-design approach rather than simply increasing transistor counts.

    The architecture's most striking feature is its scale. A single Ironwood "pod" — a tightly integrated unit of TPU chips functioning as one supercomputer — can connect up to 9,216 individual chips through Google's proprietary Inter-Chip Interconnect network operating at 9.6 terabits per second. To put that bandwidth in perspective, it's roughly equivalent to downloading the entire Library of Congress in under two seconds.

    This massive interconnect fabric allows the 9,216 chips to share access to 1.77 petabytes of High Bandwidth Memory — memory fast enough to keep pace with the chips' processing speeds. That's approximately 40,000 high-definition Blu-ray movies' worth of working memory, instantly accessible by thousands of processors simultaneously. "For context, that means Ironwood Pods can deliver 118x more FP8 ExaFLOPS versus the next closest competitor," Google stated in technical documentation.

    The system employs Optical Circuit Switching technology that acts as a "dynamic, reconfigurable fabric." When individual components fail or require maintenance — inevitable at this scale — the OCS technology automatically reroutes data traffic around the interruption within milliseconds, allowing workloads to continue running without user-visible disruption.

    This reliability focus reflects lessons learned from deploying five previous TPU generations. Google reported that its fleet-wide uptime for liquid-cooled systems has maintained approximately 99.999% availability since 2020 — equivalent to less than six minutes of downtime per year.

    Anthropic's billion-dollar bet validates Google's custom silicon strategy

    Perhaps the most significant external validation of Ironwood's capabilities comes from Anthropic's commitment to access up to one million TPU chips — a staggering figure in an industry where even clusters of 10,000 to 50,000 accelerators are considered massive.

    "Anthropic and Google have a longstanding partnership and this latest expansion will help us continue to grow the compute we need to define the frontier of AI," said Krishna Rao, Anthropic's chief financial officer, in the official partnership agreement. "Our customers — from Fortune 500 companies to AI-native startups — depend on Claude for their most important work, and this expanded capacity ensures we can meet our exponentially growing demand."

    According to a separate statement, Anthropic will have access to "well over a gigawatt of capacity coming online in 2026" — enough electricity to power a small city. The company specifically cited TPUs' "price-performance and efficiency" as key factors in the decision, along with "existing experience in training and serving its models with TPUs."

    Industry analysts estimate that a commitment to access one million TPU chips, with associated infrastructure, networking, power, and cooling, likely represents a multi-year contract worth tens of billions of dollars — among the largest known cloud infrastructure commitments in history.

    James Bradbury, Anthropic's head of compute, elaborated on the inference focus: "Ironwood's improvements in both inference performance and training scalability will help us scale efficiently while maintaining the speed and reliability our customers expect."

    Google's Axion processors target the computing workloads that make AI possible

    Alongside Ironwood, Google introduced expanded options for its Axion processor family — custom Arm-based CPUs designed for general-purpose workloads that support AI applications but don't require specialized accelerators.

    The N4A instance type, now entering preview, targets what Google describes as "microservices, containerized applications, open-source databases, batch, data analytics, development environments, experimentation, data preparation and web serving jobs that make AI applications possible." The company claims N4A delivers up to 2X better price-performance than comparable current-generation x86-based virtual machines.

    Google is also previewing C4A metal, its first bare-metal Arm instance, which provides dedicated physical servers for specialized workloads such as Android development, automotive systems, and software with strict licensing requirements.

    The Axion strategy reflects a growing conviction that the future of computing infrastructure requires both specialized AI accelerators and highly efficient general-purpose processors. While a TPU handles the computationally intensive task of running an AI model, Axion-class processors manage data ingestion, preprocessing, application logic, API serving, and countless other tasks in a modern AI application stack.

    Early customer results suggest the approach delivers measurable economic benefits. Vimeo reported observing "a 30% improvement in performance for our core transcoding workload compared to comparable x86 VMs" in initial N4A tests. ZoomInfo measured "a 60% improvement in price-performance" for data processing pipelines running on Java services, according to Sergei Koren, the company's chief infrastructure architect.

    Software tools turn raw silicon performance into developer productivity

    Hardware performance means little if developers cannot easily harness it. Google emphasized that Ironwood and Axion are integrated into what it calls AI Hypercomputer — "an integrated supercomputing system that brings together compute, networking, storage, and software to improve system-level performance and efficiency."

    According to an October 2025 IDC Business Value Snapshot study, AI Hypercomputer customers achieved on average 353% three-year return on investment, 28% lower IT costs, and 55% more efficient IT teams.

    Google disclosed several software enhancements designed to maximize Ironwood utilization. Google Kubernetes Engine now offers advanced maintenance and topology awareness for TPU clusters, enabling intelligent scheduling and highly resilient deployments. The company's open-source MaxText framework now supports advanced training techniques including Supervised Fine-Tuning and Generative Reinforcement Policy Optimization.

    Perhaps most significant for production deployments, Google's Inference Gateway intelligently load-balances requests across model servers to optimize critical metrics. According to Google, it can reduce time-to-first-token latency by 96% and serving costs by up to 30% through techniques like prefix-cache-aware routing.

    The Inference Gateway monitors key metrics including KV cache hits, GPU or TPU utilization, and request queue length, then routes incoming requests to the optimal replica. For conversational AI applications where multiple requests might share context, routing requests with shared prefixes to the same server instance can dramatically reduce redundant computation.

    The hidden challenge: powering and cooling one-megawatt server racks

    Behind these announcements lies a massive physical infrastructure challenge that Google addressed at the recent Open Compute Project EMEA Summit. The company disclosed that it's implementing +/-400 volt direct current power delivery capable of supporting up to one megawatt per rack — a tenfold increase from typical deployments.

    "The AI era requires even greater power delivery capabilities," explained Madhusudan Iyengar and Amber Huffman, Google principal engineers, in an April 2025 blog post. "ML will require more than 500 kW per IT rack before 2030."

    Google is collaborating with Meta and Microsoft to standardize electrical and mechanical interfaces for high-voltage DC distribution. The company selected 400 VDC specifically to leverage the supply chain established by electric vehicles, "for greater economies of scale, more efficient manufacturing, and improved quality and scale."

    On cooling, Google revealed it will contribute its fifth-generation cooling distribution unit design to the Open Compute Project. The company has deployed liquid cooling "at GigaWatt scale across more than 2,000 TPU Pods in the past seven years" with fleet-wide availability of approximately 99.999%.

    Water can transport approximately 4,000 times more heat per unit volume than air for a given temperature change — critical as individual AI accelerator chips increasingly dissipate 1,000 watts or more.

    Custom silicon gambit challenges Nvidia's AI accelerator dominance

    Google's announcements come as the AI infrastructure market reaches an inflection point. While Nvidia maintains overwhelming dominance in AI accelerators — holding an estimated 80-95% market share — cloud providers are increasingly investing in custom silicon to differentiate their offerings and improve unit economics.

    Amazon Web Services pioneered this approach with Graviton Arm-based CPUs and Inferentia / Trainium AI chips. Microsoft has developed Cobalt processors and is reportedly working on AI accelerators. Google now offers the most comprehensive custom silicon portfolio among major cloud providers.

    The strategy faces inherent challenges. Custom chip development requires enormous upfront investment — often billions of dollars. The software ecosystem for specialized accelerators lags behind Nvidia's CUDA platform, which benefits from 15+ years of developer tools. And rapid AI model architecture evolution creates risk that custom silicon optimized for today's models becomes less relevant as new techniques emerge.

    Yet Google argues its approach delivers unique advantages. "This is how we built the first TPU ten years ago, which in turn unlocked the invention of the Transformer eight years ago — the very architecture that powers most of modern AI," the company noted, referring to the seminal "Attention Is All You Need" paper from Google researchers in 2017.

    The argument is that tight integration — "model research, software, and hardware development under one roof" — enables optimizations impossible with off-the-shelf components.

    Beyond Anthropic, several other customers provided early feedback. Lightricks, which develops creative AI tools, reported that early Ironwood testing "makes us highly enthusiastic" about creating "more nuanced, precise, and higher-fidelity image and video generation for our millions of global customers," said Yoav HaCohen, the company's research director.

    Google's announcements raise questions that will play out over coming quarters. Can the industry sustain current infrastructure spending, with major AI companies collectively committing hundreds of billions of dollars? Will custom silicon prove economically superior to Nvidia GPUs? How will model architectures evolve?

    For now, Google appears committed to a strategy that has defined the company for decades: building custom infrastructure to enable applications impossible on commodity hardware, then making that infrastructure available to customers who want similar capabilities without the capital investment.

    As the AI industry transitions from research labs to production deployments serving billions of users, that infrastructure layer — the silicon, software, networking, power, and cooling that make it all run — may prove as important as the models themselves.

    And if Anthropic's willingness to commit to accessing up to one million chips is any indication, Google's bet on custom silicon designed specifically for the age of inference may be paying off just as demand reaches its inflection point.

  • Moonshot’s Kimi K2 Thinking emerges as leading open source AI, outperforming GPT-5, Claude Sonnet 4.5 on key benchmarks

    Even as concern and skepticism grows over U.S. AI startup OpenAI's buildout strategy and high spending commitments, Chinese open source AI providers are escalating their competition and one has even caught up to OpenAI's flagship, paid proprietary model GPT-5 in key third-party performance benchmarks with a new, free model.

    The Chinese AI startup Moonshot AI’s new Kimi K2 Thinking model, released today, has vaulted past both proprietary and open-weight competitors to claim the top position in reasoning, coding, and agentic-tool benchmarks.

    Despite being fully open-source, the model now outperforms OpenAI’s GPT-5, Anthropic’s Claude Sonnet 4.5 (Thinking mode), and xAI's Grok-4 on several standard evaluations — an inflection point for the competitiveness of open AI systems.

    Developers can access the model via platform.moonshot.ai and kimi.com; weights and code are hosted on Hugging Face. The open release includes APIs for chat, reasoning, and multi-tool workflows.

    Users can try out Kimi K2 Thinking directly through its own ChatGPT-like website competitor and on a Hugging Face space as well.

    Modified Standard Open Source License

    Moonshot AI has formally released Kimi K2 Thinking under a Modified MIT License on Hugging Face.

    The license grants full commercial and derivative rights — meaning individual researchers and developers working on behalf of enterprise clients can access it freely and use it in commercial applications — but adds one restriction:

    "If the software or any derivative product serves over 100 million monthly active users or generates over $20 million USD per month in revenue, the deployer must prominently display 'Kimi K2' on the product’s user interface."

    For most research and enterprise applications, this clause functions as a light-touch attribution requirement while preserving the freedoms of standard MIT licensing.

    It makes K2 Thinking one of the most permissively licensed frontier-class models currently available.

    A New Benchmark Leader

    Kimi K2 Thinking is a Mixture-of-Experts (MoE) model built around one trillion parameters, of which 32 billion activate per inference.

    It combines long-horizon reasoning with structured tool use, executing up to 200–300 sequential tool calls without human intervention.

    According to Moonshot’s published test results, K2 Thinking achieved:

    • 44.9 % on Humanity’s Last Exam (HLE), a state-of-the-art score;

    • 60.2 % on BrowseComp, an agentic web-search and reasoning test;

    • 71.3 % on SWE-Bench Verified and 83.1 % on LiveCodeBench v6, key coding evaluations;

    • 56.3 % on Seal-0, a benchmark for real-world information retrieval.

    Across these tasks, K2 Thinking consistently outperforms GPT-5’s corresponding scores and surpasses the previous open-weight leader MiniMax-M2—released just weeks earlier by Chinese rival MiniMax AI.

    Open Model Outperforms Proprietary Systems

    GPT-5 and Claude Sonnet 4.5 Thinking remain the leading proprietary “thinking” models.

    Yet in the same benchmark suite, K2 Thinking’s agentic reasoning scores exceed both: for instance, on BrowseComp the open model’s 60.2 % decisively leads GPT-5’s 54.9 % and Claude 4.5’s 24.1 %.

    K2 Thinking also edges GPT-5 in GPQA Diamond (85.7 % vs 84.5 %) and matches it on mathematical reasoning tasks such as AIME 2025 and HMMT 2025.

    Only in certain heavy-mode configurations—where GPT-5 aggregates multiple trajectories—does the proprietary model regain parity.

    That Moonshot’s fully open-weight release can meet or exceed GPT-5’s scores marks a turning point. The gap between closed frontier systems and publicly available models has effectively collapsed for high-end reasoning and coding.

    Surpassing MiniMax-M2: The Previous Open-Source Benchmark

    When VentureBeat profiled MiniMax-M2 just a week and a half ago, it was hailed as the “new king of open-source LLMs,” achieving top scores among open-weight systems:

    • τ²-Bench 77.2

    • BrowseComp 44.0

    • FinSearchComp-global 65.5

    • SWE-Bench Verified 69.4

    Those results placed MiniMax-M2 near GPT-5-level capability in agentic tool use. Yet Kimi K2 Thinking now eclipses them by wide margins.

    Its BrowseComp result of 60.2 % exceeds M2’s 44.0 %, and its SWE-Bench Verified 71.3 % edges out M2’s 69.4 %. Even on financial-reasoning tasks such as FinSearchComp-T3 (47.4 %), K2 Thinking performs comparably while maintaining superior general-purpose reasoning.

    Technically, both models adopt sparse Mixture-of-Experts architectures for compute efficiency, but Moonshot’s network activates more experts and deploys advanced quantization-aware training (INT4 QAT).

    This design doubles inference speed relative to standard precision without degrading accuracy—critical for long “thinking-token” sessions reaching 256 k context windows.

    Agentic Reasoning and Tool Use

    K2 Thinking’s defining capability lies in its explicit reasoning trace. The model outputs an auxiliary field, reasoning_content, revealing intermediate logic before each final response. This transparency preserves coherence across long multi-turn tasks and multi-step tool calls.

    A reference implementation published by Moonshot demonstrates how the model autonomously conducts a “daily news report” workflow: invoking date and web-search tools, analyzing retrieved content, and composing structured output—all while maintaining internal reasoning state.

    This end-to-end autonomy enables the model to plan, search, execute, and synthesize evidence across hundreds of steps, mirroring the emerging class of “agentic AI” systems that operate with minimal supervision.

    Efficiency and Access

    Despite its trillion-parameter scale, K2 Thinking’s runtime cost remains modest. Moonshot lists usage at:

    • $0.15 / 1 M tokens (cache hit)

    • $0.60 / 1 M tokens (cache miss)

    • $2.50 / 1 M tokens output

    These rates are competitive even against MiniMax-M2’s $0.30 input / $1.20 output pricing—and an order of magnitude below GPT-5 ($1.25 input / $10 output).

    Comparative Context: Open-Weight Acceleration

    The rapid succession of M2 and K2 Thinking illustrates how quickly open-source research is catching frontier systems. MiniMax-M2 demonstrated that open models could approach GPT-5-class agentic capability at a fraction of the compute cost. Moonshot has now advanced that frontier further, pushing open weights beyond parity into outright leadership.

    Both models rely on sparse activation for efficiency, but K2 Thinking’s higher activation count (32 B vs 10 B active parameters) yields stronger reasoning fidelity across domains. Its test-time scaling—expanding “thinking tokens” and tool-calling turns—provides measurable performance gains without retraining, a feature not yet observed in MiniMax-M2.

    Technical Outlook

    Moonshot reports that K2 Thinking supports native INT4 inference and 256 k-token contexts with minimal performance degradation. Its architecture integrates quantization, parallel trajectory aggregation (“heavy mode”), and Mixture-of-Experts routing tuned for reasoning tasks.

    In practice, these optimizations allow K2 Thinking to sustain complex planning loops—code compile–test–fix, search–analyze–summarize—over hundreds of tool calls. This capability underpins its superior results on BrowseComp and SWE-Bench, where reasoning continuity is decisive.

    Enormous Implications for the AI Ecosystem

    The convergence of open and closed models at the high end signals a structural shift in the AI landscape. Enterprises that once relied exclusively on proprietary APIs can now deploy open alternatives matching GPT-5-level reasoning while retaining full control of weights, data, and compliance.

    Moonshot’s open publication strategy follows the precedent set by DeepSeek R1, Qwen3, GLM-4.6 and MiniMax-M2 but extends it to full agentic reasoning.

    For academic and enterprise developers, K2 Thinking provides both transparency and interoperability—the ability to inspect reasoning traces and fine-tune performance for domain-specific agents.

    The arrival of K2 Thinking signals that Moonshot — a young startup founded in 2023 with investment from some of China's biggest apps and tech companies — is here to play in an intensifying competition, and comes amid growing scrutiny of the financial sustainability of AI’s largest players.

    Just a day ago, OpenAI CFO Sarah Friar sparked controversy after suggesting at WSJ Tech Live event that the U.S. government might eventually need to provide a “backstop” for the company’s more than $1.4 trillion in compute and data-center commitments — a comment widely interpreted as a call for taxpayer-backed loan guarantees.

    Although Friar later clarified that OpenAI was not seeking direct federal support, the episode reignited debate about the scale and concentration of AI capital spending.

    With OpenAI, Microsoft, Meta, and Google all racing to secure long-term chip supply, critics warn of an unsustainable investment bubble and “AI arms race” driven more by strategic fear than commercial returns — one that could "blow up" and take down the entire global economy with it if there is hesitation or market uncertainty, as so many trades and valuations have now been made in anticipation of continued hefty AI investment and massive returns.

    Against that backdrop, Moonshot AI’s and MiniMax’s open-weight releases put more pressure on U.S. proprietary AI firms and their backers to justify the size of the investments and paths to profitability.

    If an enterprise customer can just as easily get comparable or better performance from a free, open source Chinese AI model than they do with paid, proprietary AI solutions like OpenAI's GPT-5, Anthropic's Claude Sonnet 4.5, or Google's Gemini 2.5 Pro — why would they continue paying to access the proprietary models? Already, Silicon Valley stalwarts like Airbnb have raised eyebrows for admitting to heavily using Chinese open source alternatives like Alibaba's Qwen over OpenAI's proprietary offerings.

    For investors and enterprises, these developments suggest that high-end AI capability is no longer synonymous with high-end capital expenditure. The most advanced reasoning systems may now come not from companies building gigascale data centers, but from research groups optimizing architectures and quantization for efficiency.

    In that sense, K2 Thinking’s benchmark dominance is not just a technical milestone—it’s a strategic one, arriving at a moment when the AI market’s biggest question has shifted from how powerful models can become to who can afford to sustain them.

    What It Means for Enterprises Going Forward

    Within weeks of MiniMax-M2’s ascent, Kimi K2 Thinking has overtaken it—along with GPT-5 and Claude 4.5—across nearly every reasoning and agentic benchmark.

    The model demonstrates that open-weight systems can now meet or surpass proprietary frontier models in both capability and efficiency.

    For the AI research community, K2 Thinking represents more than another open model: it is evidence that the frontier has become collaborative.

    The best-performing reasoning model available today is not a closed commercial product but an open-source system accessible to anyone.