Etiket: Machine Learning

How Deductive AI saved DoorDash 1,000 engineering hours by automating software debugging

As software systems grow more complex and AI tools generate code faster than ever, a fundamental problem is getting worse: Engineers are drowning in debugging work, spending up to half their time hunting down the causes of software failures instead of building new products. The challenge has become so acute that it's creating a new category of tooling — AI agents that can diagnose production failures in minutes instead of hours.

Deductive AI, a startup emerging from stealth mode Wednesday, believes it has found a solution by applying reinforcement learning — the same technology that powers game-playing AI systems — to the messy, high-stakes world of production software incidents. The company announced it has raised $7.5 million in seed funding led by CRV, with participation from Databricks Ventures, Thomvest Ventures, and PrimeSet, to commercialize what it calls "AI SRE agents" that can diagnose and help fix software failures at machine speed.

The pitch resonates with a growing frustration inside engineering organizations: Modern observability tools can show that something broke, but they rarely explain why. When a production system fails at 3 a.m., engineers still face hours of manual detective work, cross-referencing logs, metrics, deployment histories, and code changes across dozens of interconnected services to identify the root cause.

"The complexities and inter-dependencies of modern infrastructure means that investigating the root cause of an outage or incident can feel like searching for a needle in a haystack, except the haystack is the size of a football field, it's made of a million other needles, it's constantly reshuffling itself, and is on fire — and every second you don't find it equals lost revenue," said Sameer Agarwal, Deductive's co-founder and chief technology officer, in an exclusive interview with VentureBeat.

Deductive's system builds what the company calls a "knowledge graph" that maps relationships across codebases, telemetry data, engineering discussions, and internal documentation. When an incident occurs, multiple AI agents work together to form hypotheses, test them against live system evidence, and converge on a root cause — mimicking the investigative workflow of experienced site reliability engineers, but completing the process in minutes rather than hours.

The technology has already shown measurable impact at some of the world's most demanding production environments. DoorDash's advertising platform, which runs real-time auctions that must complete in under 100 milliseconds, has integrated Deductive into its incident response workflow. The company has set an ambitious 2026 goal of resolving production incidents within 10 minutes.

"Our Ads Platform operates at a pace where manual, slow-moving investigations are no longer viable. Every minute of downtime directly affects company revenue," said Shahrooz Ansari, Senior Director of Engineering at DoorDash, in an interview with VentureBeat. "Deductive has become a critical extension of our team, rapidly synthesizing signals across dozens of services and surfacing the insights that matter—within minutes."

DoorDash estimates that Deductive has root-caused approximately 100 production incidents over the past few months, translating to more than 1,000 hours of annual engineering productivity and a revenue impact "in millions of dollars," according to Ansari. At location intelligence company Foursquare, Deductive reduced the time to diagnose Apache Spark job failures by 90% —t urning a process that previously took hours or days into one that completes in under 10 minutes — while generating over $275,000 in annual savings.

Why AI-generated code is creating a debugging crisis

The timing of Deductive's launch reflects a brewing tension in software development: AI coding assistants are enabling engineers to generate code faster than ever, but the resulting software is often harder to understand and maintain.

"Vibe coding," a term popularized by AI researcher Andrej Karpathy, refers to using natural-language prompts to generate code through AI assistants. While these tools accelerate development, they can introduce what Agarwal describes as "redundancies, breaks in architectural boundaries, assumptions, or ignored design patterns" that accumulate over time.

"Most AI-generated code still introduces redundancies, breaks architectural boundaries, makes assumptions, or ignores established design patterns," Agarwal told Venturebeat. "In many ways, we now need AI to help clean up the mess that AI itself is creating."

The claim that engineers spend roughly half their time on debugging isn't hyperbole. The Association for Computing Machinery reports that developers spend 35% to 50% of their time validating and debugging software. More recently, Harness's State of Software Delivery 2025 report found that 67% of developers are spending more time debugging AI-generated code.

"We've seen world-class engineers spending half of their time debugging instead of building," said Rakesh Kothari, Deductive's co-founder and CEO. "And as vibe coding generates new code at a rate we've never seen, this problem is only going to get worse."

How Deductive's AI agents actually investigate production failures

Deductive's technical approach differs substantially from the AI features being added to existing observability platforms like Datadog or New Relic. Most of those systems use large language models to summarize data or identify correlations, but they lack what Agarwal calls "code-aware reasoning"—the ability to understand not just that something broke, but why the code behaves the way it does.

"Most enterprises use multiple observability tools across different teams and services, so no vendor has a single holistic view of how their systems behave, fail, and recover—nor are they able to pair that with an understanding of the code that defines system behavior," Agarwal explained. "These are key ingredients to resolving software incidents and it is exactly the gap Deductive fills."

The system connects to existing infrastructure using read-only API access to observability platforms, code repositories, incident management tools, and chat systems. It then continuously builds and updates its knowledge graph, mapping dependencies between services and tracking deployment histories.

When an alert fires, Deductive launches what the company describes as a multi-agent investigation. Different agents specialize in different aspects of the problem: one might analyze recent code changes, another examines trace data, while a third correlates the timing of the incident with recent deployments. The agents share findings and iteratively refine their hypotheses.

The critical difference from rule-based automation is Deductive's use of reinforcement learning. The system learns from every incident which investigative steps led to correct diagnoses and which were dead ends. When engineers provide feedback, the system incorporates that signal into its learning model.

"Each time it observes an investigation, it learns which steps, data sources, and decisions led to the right outcome," Agarwal said. "It learns how to think through problems, not just point them out."

At DoorDash, a recent latency spike in an API initially appeared to be an isolated service issue. Deductive's investigation revealed that the root cause was actually timeout errors from a downstream machine learning platform undergoing a deployment. The system connected these dots by analyzing log volumes, traces, and deployment metadata across multiple services.

"Without Deductive, our team would have had to manually correlate the latency spike across all logs, traces, and deployment histories," Ansari said. "Deductive was able to explain not just what changed, but how and why it impacted production behavior."

The company keeps humans in the loop—for now

While Deductive's technology could theoretically push fixes directly to production systems, the company has deliberately chosen to keep humans in the loop—at least for now.

"While our system is capable of deeper automation and could push fixes to production, currently, we recommend precise fixes and mitigations that engineers can review, validate, and apply," Agarwal said. "We believe maintaining a human in the loop is essential for trust, transparency and operational safety."

However, he acknowledged that "over time, we do think that deeper automation will come and how humans operate in the loop will evolve."

Databricks and ThoughtSpot veterans bet on reasoning over observability

The founding team brings deep expertise from building some of Silicon Valley's most successful data infrastructure platforms. Agarwal earned his Ph.D. at UC Berkeley, where he created BlinkDB, an influential system for approximate query processing. He was among the first engineers at Databricks, where he helped build Apache Spark. Kothari was an early engineer at ThoughtSpot, where he led teams focused on distributed query processing and large-scale system optimization.

The investor syndicate reflects both the technical credibility and market opportunity. Beyond CRV's Max Gazor, the round included participation from Ion Stoica, founder of Databricks and Anyscale; Ajeet Singh, founder of Nutanix and ThoughtSpot; and Ben Sigelman, founder of Lightstep.

Rather than competing with platforms like Datadog or PagerDuty, Deductive positions itself as a complementary layer that sits on top of existing tools. The pricing model reflects this: Instead of charging based on data volume, Deductive charges based on the number of incidents investigated, plus a base platform fee.

The company offers both cloud-hosted and self-hosted deployment options and emphasizes that it doesn't store customer data on its servers or use it to train models for other customers — a critical assurance given the proprietary nature of both code and production system behavior.

With fresh capital and early customer traction at companies like DoorDash, Foursquare, and Kumo AI, Deductive plans to expand its team and deepen the system's reasoning capabilities from reactive incident analysis to proactive prevention. The near-term vision: helping teams predict problems before they occur.

DoorDash's Ansari offers a pragmatic endorsement of where the technology stands today: "Investigations that were previously manual and time-consuming are now automated, allowing engineers to shift their energy toward prevention, business impact, and innovation."

In an industry where every second of downtime translates to lost revenue, that shift from firefighting to building increasingly looks less like a luxury and more like table stakes.

Kasım 13, 2025
Meta’s SPICE framework lets AI systems teach themselves to reason
Researchers at Meta FAIR and the National University of Singapore have developed a new reinforcement learning framework for self-improving AI systems.

Called Self-Play In Corpus Environments (SPICE), the framework pits two AI agents against each other, creating its own challenges and gradually improving without human supervision.

While currently a proof-of-concept, this self-play mechanism could provide a basis for future AI systems that can dynamically adapt to their environments, making them more robust against the unpredictability of real-world applications.

The challenge of self-improving AI

The goal of self-improving AI is to create systems that can enhance their capabilities by interacting with their environment.

A common approach is reinforcement learning with verifiable rewards (RLVR), where models are rewarded for providing the correct answers to problems. This is often limited by its reliance on human-curated problem sets and domain-specific reward engineering, which makes it difficult to scale.

Self-play, where a model improves by competing against itself, is another promising paradigm. But existing self-play methods for language models are often limited by two critical factors.
1. Factual errors in generated questions and answers compound, leading to a feedback loop of hallucinations.
2. When the problem generator and solver have information symmetry (i.e., share the same knowledge base) they fail to generate genuinely new challenges and fall into repetitive patterns.
As the researchers note in their paper, “These systematic empirical failures indicate that self-improvement requires interaction with an external source providing diverse, verifiable feedback, rather than closed-loop pure introspection.”

How SPICE works

SPICE is a self-play framework where a single model acts in two distinct roles.
- A "Challenger" constructs a curriculum of challenging problems from a large corpus of documents.
- A "Reasoner" then attempts to solve these problems without access to the source documents.
This setup breaks the information symmetry that limits other self-play methods, as the Reasoner does not have access to the documents and knowledge that the Challenger uses to generate the problems.

Grounding the tasks in a vast and diverse corpus of documents prevents hallucination by anchoring questions and answers in real-world content. This is important because for AI systems to reliably self-improve, they need external grounding sources. Therefore, LLM agents should learn from interactions with humans and the real world, not just their own outputs, to avoid compounding errors.

The adversarial dynamic between the two roles creates an automatic curriculum.

The Challenger is rewarded for generating problems that are both diverse and at the frontier of the Reasoner's capability (not too easy and also not impossible).

The Reasoner is rewarded for answering correctly. This symbiotic interaction pushes both agents to continuously discover and overcome new challenges.

Because the system uses raw documents instead of pre-defined question-answer pairs, it can generate diverse task formats, such as multiple-choice and free-form questions.

This flexibility allows SPICE to be applied to any domain, breaking the bottleneck that has confined previous methods to narrow fields like math and code. It also reduces dependence on expensive human-curated datasets for specialized domains like legal or medical analysis.

SPICE in action

The researchers evaluated SPICE on several base models, including Qwen3-4B-Base and OctoThinker-3B-Hybrid-Base.

They compared its performance against baselines such as the base model with no training, a Reasoner model trained with a fixed "Strong Challenger" (Qwen3-32B-Instruct), and pure self-play methods like R-Zero and Absolute Zero. The evaluation covered a wide range of mathematical and general reasoning benchmarks.

Across all models, SPICE consistently outperformed the baselines, delivering significant improvements in both mathematical and general reasoning tasks.

The results show that the reasoning capabilities developed through corpus-grounded self-play transfer broadly across different models, thanks to the diverse external knowledge corpus they used.

A key finding is that the adversarial dynamic creates an effective automatic curriculum. As training progresses, the Challenger learns to generate increasingly difficult problems.

In one experiment, the Reasoner's pass rate on a fixed set of problems increased from 55% to 85% over time, showing its improved capabilities.

Meanwhile, later versions of the Challenger were able to generate questions that dropped the pass rate of an early-stage Reasoner from 55% to 35%, confirming that both roles co-evolve successfully.

The researchers conclude that this approach presents a paradigm shift in self-improving reasoning methods from “closed-loop self-play that often stagnates due to hallucination drift, to open-ended improvement through interaction with the vast, verifiable knowledge embedded in web document corpora.”

Currently, the corpus used for SPICE represents human experience captured in text. The ultimate goal is for self-improving systems to generate questions based on interactions with reality, including the physical world, the internet, and human interactions across multiple modalities like video, audio, and sensor data.
Kasım 12, 2025
Baidu just dropped an open-source multimodal AI that it claims beats GPT-5 and Gemini

Baidu Inc., China's largest search engine company, released a new artificial intelligence model on Monday that its developers claim outperforms competitors from Google and OpenAI on several vision-related benchmarks despite using a fraction of the computing resources typically required for such systems.

The model, dubbed ERNIE-4.5-VL-28B-A3B-Thinking, is the latest salvo in an escalating competition among technology companies to build AI systems that can understand and reason about images, videos, and documents alongside traditional text — capabilities increasingly critical for enterprise applications ranging from automated document processing to industrial quality control.

What sets Baidu's release apart is its efficiency: the model activates just 3 billion parameters during operation while maintaining 28 billion total parameters through a sophisticated routing architecture. According to documentation released with the model, this design allows it to match or exceed the performance of much larger competing systems on tasks involving document understanding, chart analysis, and visual reasoning while consuming significantly less computational power and memory.

"Built upon the powerful ERNIE-4.5-VL-28B-A3B architecture, the newly upgraded ERNIE-4.5-VL-28B-A3B-Thinking achieves a remarkable leap forward in multimodal reasoning capabilities," Baidu wrote in the model's technical documentation on Hugging Face, the AI model repository where the system was released.

The company said the model underwent "an extensive mid-training phase" that incorporated "a vast and highly diverse corpus of premium visual-language reasoning data," dramatically boosting its ability to align visual and textual information semantically.

How the model mimics human visual problem-solving through dynamic image analysis

Perhaps the model's most distinctive feature is what Baidu calls "Thinking with Images" — a capability that allows the AI to dynamically zoom in and out of images to examine fine-grained details, mimicking how humans approach visual problem-solving tasks.

"The model thinks like a human, capable of freely zooming in and out of images to grasp every detail and uncover all information," according to the model card. When paired with tools like image search, Baidu claims this feature "dramatically elevates the model's ability to process fine-grained details and handle long-tail visual knowledge."

This approach marks a departure from traditional vision-language models, which typically process images at a fixed resolution. By allowing dynamic image examination, the system can theoretically handle scenarios requiring both broad context and granular detail—such as analyzing complex technical diagrams or detecting subtle defects in manufacturing quality control.

The model also supports what Baidu describes as enhanced "visual grounding" capabilities with "more precise grounding and flexible instruction execution, easily triggering grounding functions in complex industrial scenarios," suggesting potential applications in robotics, warehouse automation, and other settings where AI systems must identify and locate specific objects in visual scenes.

Baidu's performance claims draw scrutiny as independent testing remains pending

Baidu's assertion that the model outperforms Google's Gemini 2.5 Pro and OpenAI's GPT-5-High on various document and chart understanding benchmarks has drawn attention across social media, though independent verification of these claims remains pending.

The company released the model under the permissive Apache 2.0 license, allowing unrestricted commercial use—a strategic decision that contrasts with the more restrictive licensing approaches of some competitors and could accelerate enterprise adoption.

"Apache 2.0 is smart," wrote one X user responding to Baidu's announcement, highlighting the competitive advantage of open licensing in the enterprise market.

According to Baidu's documentation, the model demonstrates six core capabilities beyond traditional text processing. In visual reasoning, the system can perform what Baidu describes as "multi-step reasoning, chart analysis, and causal reasoning capabilities in complex visual tasks," aided by what the company characterizes as "large-scale reinforcement learning."

For STEM problem solving, Baidu claims that "leveraging its powerful visual abilities, the model achieves a leap in performance on STEM tasks like solving problems from photos." The visual grounding capability allows the model to identify and locate objects within images with what Baidu characterizes as industrial-grade precision. Through tool integration, the system can invoke external functions including image search capabilities to access information beyond its training data.

For video understanding, Baidu claims the model possesses "outstanding temporal awareness and event localization abilities, accurately identifying content changes across different time segments in a video." Finally, the thinking with images feature enables the dynamic zoom functionality that distinguishes this model from competitors.

Inside the mixture-of-experts architecture that powers efficient multimodal processing

Under the hood, ERNIE-4.5-VL-28B-A3B-Thinking employs a Mixture-of-Experts (MoE) architecture — a design pattern that has become increasingly popular for building efficient large-scale AI systems. Rather than activating all 28 billion parameters for every task, the model uses a routing mechanism to selectively activate only the 3 billion parameters most relevant to each specific input.

This approach offers substantial practical advantages for enterprise deployments. According to Baidu's documentation, the model can run on a single 80GB GPU — hardware readily available in many corporate data centers — making it significantly more accessible than competing systems that may require multiple high-end accelerators.

The technical documentation reveals that Baidu employed several advanced training techniques to achieve the model's capabilities. The company used "cutting-edge multimodal reinforcement learning techniques on verifiable tasks, integrating GSPO and IcePop strategies to stabilize MoE training combined with dynamic difficulty sampling for exceptional learning efficiency."

Baidu also notes that in response to "strong community demand," the company "significantly strengthened the model's grounding performance with improved instruction-following capabilities."

The new model fits into Baidu's ambitious multimodal AI ecosystem

The new release is one component of Baidu's broader ERNIE 4.5 model family, which the company unveiled in June 2025. That family comprises 10 distinct variants, including Mixture-of-Experts models ranging from the flagship ERNIE-4.5-VL-424B-A47B with 424 billion total parameters down to a compact 0.3 billion parameter dense model.

According to Baidu's technical report on the ERNIE 4.5 family, the models incorporate "a novel heterogeneous modality structure, which supports parameter sharing across modalities while also allowing dedicated parameters for each individual modality."

This architectural choice addresses a longstanding challenge in multimodal AI development: training systems on both visual and textual data without one modality degrading the performance of the other. Baidu claims this design "has the advantage to enhance multimodal understanding without compromising, and even improving, performance on text-related tasks."

The company reported achieving 47% Model FLOPs Utilization (MFU) — a measure of training efficiency — during pre-training of its largest ERNIE 4.5 language model, using the PaddlePaddle deep learning framework developed in-house.

Comprehensive developer tools aim to simplify enterprise deployment and integration

For organizations looking to deploy the model, Baidu has released a comprehensive suite of development tools through ERNIEKit, what the company describes as an "industrial-grade training and compression development toolkit."

The model offers full compatibility with popular open-source frameworks including Hugging Face Transformers, vLLM (a high-performance inference engine), and Baidu's own FastDeploy toolkit. This multi-platform support could prove critical for enterprise adoption, allowing organizations to integrate the model into existing AI infrastructure without wholesale platform changes.

Sample code released by Baidu shows a relatively straightforward implementation path. Using the Transformers library, developers can load and run the model with approximately 30 lines of Python code, according to the documentation on Hugging Face.

For production deployments requiring higher throughput, Baidu provides vLLM integration with specialized support for the model's "reasoning-parser" and "tool-call-parser" capabilities — features that enable the dynamic image examination and external tool integration that distinguish this model from earlier systems.

The company also offers FastDeploy, a proprietary inference toolkit that Baidu claims delivers "production-ready, easy-to-use multi-hardware deployment solutions" with support for various quantization schemes that can reduce memory requirements and increase inference speed.

Why this release matters for the enterprise AI market at a critical inflection point

The release comes at a pivotal moment in the enterprise AI market. As organizations move beyond experimental chatbot deployments toward production systems that process documents, analyze visual data, and automate complex workflows, demand for capable and cost-effective vision-language models has intensified.

Several enterprise use cases appear particularly well-suited to the model's capabilities. Document processing — extracting information from invoices, contracts, and forms — represents a massive market where accurate chart and table understanding directly translates to cost savings through automation. Manufacturing quality control, where AI systems must detect visual defects, could benefit from the model's grounding capabilities. Customer service applications that handle images from users could leverage the multi-step visual reasoning.

The model's efficiency profile may prove especially attractive to mid-market organizations and startups that lack the computing budgets of large technology companies. By fitting on a single 80GB GPU — hardware costing roughly $10,000 to $30,000 depending on the specific model — the system becomes economically viable for a much broader range of organizations than models requiring multi-GPU setups costing hundreds of thousands of dollars.

"With all these new models, where's the best place to actually build and scale? Access to compute is everything," wrote one X user in response to Baidu's announcement, highlighting the persistent infrastructure challenges facing organizations attempting to deploy advanced AI systems.

The Apache 2.0 licensing further lowers barriers to adoption. Unlike models released under more restrictive licenses that may limit commercial use or require revenue sharing, organizations can deploy ERNIE-4.5-VL-28B-A3B-Thinking in production applications without ongoing licensing fees or usage restrictions.

Competition intensifies as Chinese tech giant takes aim at Google and OpenAI

Baidu's release intensifies competition in the vision-language model space, where Google, OpenAI, Anthropic, and Chinese companies including Alibaba and ByteDance have all released capable systems in recent months.

The company's performance claims — if validated by independent testing — would represent a significant achievement. Google's Gemini 2.5 Pro and OpenAI's GPT-5-High are substantially larger models backed by the deep resources of two of the world's most valuable technology companies. That a more compact, openly available model could match or exceed their performance on specific tasks would suggest the field is advancing more rapidly than some analysts anticipated.

"Impressive that ERNIE is outperforming Gemini 2.5 Pro," wrote one social media commenter, expressing surprise at the claimed results.

However, some observers counseled caution about benchmark comparisons. "It's fascinating to see how multimodal models are evolving, especially with features like 'Thinking with Images,'" wrote one X user. "That said, I'm curious if ERNIE-4.5's edge over competitors like Gemini-2.5-Pro and GPT-5-High primarily lies in specific use cases like document and chart" understanding rather than general-purpose vision tasks.

Industry analysts note that benchmark performance often fails to capture real-world behavior across the diverse scenarios enterprises encounter. A model that excels at document understanding may struggle with creative visual tasks or real-time video analysis. Organizations evaluating these systems typically conduct extensive internal testing on representative workloads before committing to production deployments.

Technical limitations and infrastructure requirements that enterprises must consider

Despite its capabilities, the model faces several technical challenges common to large vision-language systems. The minimum requirement of 80GB of GPU memory, while more accessible than some competitors, still represents a significant infrastructure investment. Organizations without existing GPU infrastructure would need to procure specialized hardware or rely on cloud computing services, introducing ongoing operational costs.

The model's context window — the amount of text and visual information it can process simultaneously — is listed as 128K tokens in Baidu's documentation. While substantial, this may prove limiting for some document processing scenarios involving very long technical manuals or extensive video content.

Questions also remain about the model's behavior on adversarial inputs, out-of-distribution data, and edge cases. Baidu's documentation does not provide detailed information about safety testing, bias mitigation, or failure modes — considerations increasingly important for enterprise deployments where errors could have financial or safety implications.

What technical decision-makers need to evaluate beyond the benchmark numbers

For technical decision-makers evaluating the model, several implementation factors warrant consideration beyond raw performance metrics.

The model's MoE architecture, while efficient during inference, adds complexity to deployment and optimization. Organizations must ensure their infrastructure can properly route inputs to the appropriate expert subnetworks — a capability not universally supported across all deployment platforms.

The "Thinking with Images" feature, while innovative, requires integration with image manipulation tools to achieve its full potential. Baidu's documentation suggests this capability works best "when paired with tools like image zooming and image search," implying that organizations may need to build additional infrastructure to fully leverage this functionality.

The model's video understanding capabilities, while highlighted in marketing materials, come with practical constraints. Processing video requires substantially more computational resources than static images, and the documentation does not specify maximum video length or optimal frame rates.

Organizations considering deployment should also evaluate Baidu's ongoing commitment to the model. Open-source AI models require continuing maintenance, security updates, and potential retraining as data distributions shift over time. While the Apache 2.0 license ensures the model remains available, future improvements and support depend on Baidu's strategic priorities.

Developer community responds with enthusiasm tempered by practical requests

Early response from the AI research and development community has been cautiously optimistic. Developers have requested versions of the model in additional formats including GGUF (a quantization format popular for local deployment) and MNN (a mobile neural network framework), suggesting interest in running the system on resource-constrained devices.

"Release MNN and GGUF so I can run it on my phone," wrote one developer, highlighting demand for mobile deployment options.

Other developers praised Baidu's technical choices while requesting additional resources. "Fantastic model! Did you use discoveries from PaddleOCR?" asked one user, referencing Baidu's open-source optical character recognition toolkit.

The model's lengthy name—ERNIE-4.5-VL-28B-A3B-Thinking—drew lighthearted commentary. "ERNIE-4.5-VL-28B-A3B-Thinking might be the longest model name in history," joked one observer. "But hey, if you're outperforming Gemini-2.5-Pro with only 3B active params, you've earned the right to a dramatic name!"

Baidu plans to showcase the ERNIE lineup during its Baidu World 2025 conference on November 13, where the company is expected to provide additional details about the model's development, performance validation, and future roadmap.

The release marks a strategic move by Baidu to establish itself as a major player in the global AI infrastructure market. While Chinese AI companies have historically focused primarily on domestic markets, the open-source release under a permissive license signals ambitions to compete internationally with Western AI giants.

For enterprises, the release adds another capable option to a rapidly expanding menu of AI models. Organizations no longer face a binary choice between building proprietary systems or licensing closed-source models from a handful of vendors. The proliferation of capable open-source alternatives like ERNIE-4.5-VL-28B-A3B-Thinking is reshaping the economics of AI deployment and accelerating adoption across industries.

Whether the model delivers on its performance promises in real-world deployments remains to be seen. But for organizations seeking powerful, cost-effective tools for visual understanding and reasoning, one thing is certain. As one developer succinctly summarized: "Open source plus commercial use equals chef's kiss. Baidu not playing around."

Kasım 12, 2025
Chronosphere takes on Datadog with AI that explains itself, not just outages

Chronosphere, a New York-based observability startup valued at $1.6 billion, announced Monday it will launch AI-Guided Troubleshooting capabilities designed to help engineers diagnose and fix production software failures — a problem that has intensified as artificial intelligence tools accelerate code creation while making systems harder to debug.

The new features combine AI-driven analysis with what Chronosphere calls a Temporal Knowledge Graph, a continuously updated map of an organization's services, infrastructure dependencies, and system changes over time. The technology aims to address a mounting challenge in enterprise software: developers are writing code faster than ever with AI assistance, but troubleshooting remains largely manual, creating bottlenecks when applications fail.

"For AI to be effective in observability, it needs more than pattern recognition and summarization," said Martin Mao, Chronosphere's CEO and co-founder, in an exclusive interview with VentureBeat. "Chronosphere has spent years building the data foundation and analytical depth needed for AI to actually help engineers. With our Temporal Knowledge Graph and advanced analytics capabilities, we're giving AI the understanding it needs to make observability truly intelligent — and giving engineers the confidence to trust its guidance."

The announcement comes as the observability market — software that monitors complex cloud applications— faces mounting pressure to justify escalating costs. Enterprise log data volumes have grown 250% year-over-year, according to Chronosphere's own research, while a study from MIT and the University of Pennsylvania found that generative AI has spurred a 13.5% increase in weekly code commits, signifying faster development velocity but also greater system complexity.

AI writes code 13% faster, but debugging stays stubbornly manual

Despite advances in automated code generation, debugging production failures remains stubbornly manual. When a major e-commerce site slows during checkout or a banking app fails to process transactions, engineers must sift through millions of data points — server logs, application traces, infrastructure metrics, recent code deployments — to identify root causes.

Chronosphere's answer is what it calls AI-Guided Troubleshooting, built on four core capabilities: automated "Suggestions" that propose investigation paths backed by data; the Temporal Knowledge Graph that maps system relationships and changes; Investigation Notebooks that document each troubleshooting step for future reference; and natural language query building.

Mao explained the Temporal Knowledge Graph in practical terms: "It's a living, time-aware model of your system. It stitches together telemetry—metrics, traces, logs—infrastructure context, change events like deploys and feature flags, and even human input like notes and runbooks into a single, queryable map that updates as your system evolves."

This differs fundamentally from the service dependency maps offered by competitors like Datadog, Dynatrace, and Splunk, Mao argued. "It adds time, not just topology," he said. "It tracks how services and dependencies change over time and connects those changes to incidents—what changed and why. Many tools rely on standardized integrations; our graph goes a step further to normalize custom, non-standard telemetry so application-specific signals aren't a blind spot."

Why Chronosphere shows its work instead of making automatic decisions

Unlike purely automated systems, Chronosphere designed its AI features to keep engineers in the driver's seat—a deliberate choice meant to address what Mao calls the "confident-but-wrong guidance" problem plaguing early AI observability tools.

"'Keeping engineers in control' means the AI shows its work, proposes next steps, and lets engineers verify or override — never auto-deciding behind the scenes," Mao explained. "Every Suggestion includes the evidence—timing, dependencies, error patterns — and a 'Why was this suggested?' view, so they can inspect what was checked and ruled out before acting."

He walked through a concrete example: "An SLO [service level objective] alert fires on Checkout. Chronosphere immediately surfaces a ranked Suggestion: errors appear to have started in the dependent Payment service. An engineer can click Investigate to see the charts and reasoning and, if it holds up, choose to dig deeper. As they steer into Payment, the system adapts with new Suggestions scoped to that service—all from one view, no tab-hopping."

In this scenario, the engineer asks "what changed?" and the system pulls in change events. "Our Notebook capability makes the causal chain plain: a feature-flag update preceded pod memory exhaustion in Payment; Checkout's spike is a downstream symptom," Mao said. "They can decide to roll back the flag. That whole path — suggestions followed, evidence viewed, conclusions—is captured automatically in an Investigation Notebook, and the outcome feeds the Temporal Knowledge Graph so similar future incidents are faster to resolve."

How a $1.6 billion startup takes on Datadog, Dynatrace, and Splunk

Chronosphere enters an increasingly crowded field. Datadog, the publicly traded observability leader valued at over $40 billion, has introduced its own AI-powered troubleshooting features. So have Dynatrace and Splunk. All three offer comprehensive "all-in-one" platforms that promise single-pane-of-glass visibility.

Mao distinguished Chronosphere's approach on technical grounds. "Early 'AI for observability' leaned heavily on pattern-spotting and summarization, which tends to break down during real incidents," he said. "These approaches often stop at correlating anomalies or producing fluent explanations without the deeper analysis and causal reasoning observability leaders need. They can feel impressive in demos but disappoint in production—they summarize signals rather than explain cause and effect."

A specific technical gap, he argued, involves custom application telemetry. "Most platforms reason over standardized integrations—Kubernetes, common cloud services, popular databases—ignoring the most telling clues that live in custom app telemetry," Mao said. "With an incomplete picture, large language models will 'fill in the gaps,' producing confident-but-wrong guidance that sends teams down dead ends."

Chronosphere's competitive positioning received validation in July when Gartner named it a Leader in the 2025 Magic Quadrant for Observability Platforms for the second consecutive year. The firm was recognized based on both "Completeness of Vision" and "Ability to Execute." In December 2024, Chronosphere also tied for the highest overall rating among recognized vendors in Gartner Peer Insights' "Voice of the Customer" report, scoring 4.7 out of 5 based on 70 reviews.

Yet the company faces intensifying competition for high-profile customers. UBS analysts noted in July that OpenAI now runs both Datadog and Chronosphere side-by-side to monitor GPU workloads, suggesting the AI leader is evaluating alternatives. While UBS maintained its buy rating on Datadog, the analysts warned that growing Chronosphere usage could pressure Datadog's pricing power.

Inside the 84% cost reduction claims—and what CIOs should actually measure

Beyond technical capabilities, Chronosphere has built its market position on cost control — a critical factor as observability spending spirals. The company claims its platform reduces data volumes and associated costs by 84% on average while cutting critical incidents by up to 75%.

When pressed for specific customer examples with real numbers, Mao pointed to several case studies. "Robinhood has seen a 5x improvement in reliability and a 4x improvement in Mean Time to Detection," he said. "DoorDash used Chronosphere to improve governance and standardize monitoring practices. Astronomer achieved over 85% cost reduction by shaping data on ingest, and Affirm scaled their load 10x during a Black Friday event with no issues, highlighting the platform's reliability under extreme conditions."

The cost argument matters because, as Paul Nashawaty, principal analyst at CUBE Research, noted when Chronosphere launched its Logs 2.0 product in June: "Organizations are drowning in telemetry data, with over 70% of observability spend going toward storing logs that are never queried."

For CIOs fatigued by "AI-powered" announcements, Mao acknowledged skepticism is warranted. "The way to cut through it is to test whether the AI shortens incidents, reduces toil, and builds reusable knowledge in your own environment, not in a demo," he advised. He recommended CIOs evaluate three factors: transparency and control (does the system show its reasoning?), coverage of custom telemetry (can it handle non-standardized data?), and manual toil avoided (how many ad-hoc queries and tool-switches are eliminated?).

Why Chronosphere partners with five vendors instead of building everything itself

Alongside the AI troubleshooting announcement, Chronosphere revealed a new Partner Program integrating five specialized vendors to fill gaps in its platform: Arize for large language model monitoring, Embrace for real user monitoring, Polar Signals for continuous profiling, Checkly for synthetic monitoring, and Rootly for incident management.

The strategy represents a deliberate bet against the all-in-one platforms dominating the market. "While an all-in-one platform may be sufficient for smaller organizations, global enterprises demand best-in-class depth across each domain," Mao said. "This is what drove us to build our Partner Program and invest in seamless integrations with leading providers—so our customers can operate with confidence and clarity at every layer of observability."

Noah Smolen, head of partnerships at Arize, said the collaboration addresses a specific enterprise need. "With a wide array of Fortune 500 customers, we understand the high bar needed to ensure AI agent systems are ready to deploy and stay incident-free, especially given the pace of AI adoption in the enterprise," Smolen said. "Our partnership with Chronosphere comes at a time when an integrated purpose-built cloud-native and AI-observability suite solves a huge pain point for forward-thinking C-suite leaders who demand the very best across their entire observability stack."

Similarly, JJ Tang, CEO and founder of Rootly, emphasized the incident resolution benefits. "Incidents hinder innovation and revenue, and the challenge lies in sifting through vast amounts of observability data, mobilizing teams, and resolving issues quickly," Tang said. "Integrating Chronosphere with Rootly allows engineers to collaborate with context and resolve issues faster within their existing communication channels, drastically reducing time to resolution and ultimately improving reliability—78% plus decreases in repeat Sev0 and Sev1 incidents."

When asked how total costs compare when customers use multiple partner contracts versus a single platform, Mao acknowledged the current complexity. "At present, mutual customers typically maintain separate contracts unless they engage through a services partner or system integrator," he said. However, he argued the economics still favor the composable approach: "Our combined technologies deliver exceptional value—in most circumstances at just a fraction of the price of a single-platform solution. Beyond the savings, customers gain a richer, more unified observability experience that unlocks deeper insights and greater efficiency, especially for large-scale environments."

The company plans to streamline this over time. "As the ISV program matures, we're focused on delivering a more streamlined experience by transitioning to a single, unified contract that simplifies procurement and accelerates time to value," Mao said.

How two Uber engineers turned Halloween outages into a billion-dollar startup

Chronosphere's origins trace to 2019, when Mao and co-founder Rob Skillington left Uber after building the ride-hailing giant's internal observability platform. At Uber, Mao's team had faced a crisis: the company's in-house tools would fail on its two busiest nights — Halloween and New Year's Eve — cutting off visibility into whether customers could request rides or drivers could locate passengers.

The solution they built at Uber used open-source software and ultimately allowed the company to operate without outages, even during high-volume events. But the broader market insight came at an industry conference in December 2018, when major cloud providers threw their weight behind Kubernetes, Google's container orchestration technology.

"This meant that most technology architectures were eventually going to look like Uber's," Mao recalled in an August 2024 profile by Greylock Partners, Chronosphere's lead investor. "And that meant every company, not just a few big tech companies and the Walmarts of the world, would have the exact same problem we had solved at Uber."

Chronosphere has since raised more than $343 million in funding across multiple rounds led by Greylock, Lux Capital, General Atlantic, Addition, and Founders Fund. The company operates as a remote-first organization with offices in New York, Austin, Boston, San Francisco, and Seattle, employing approximately 299 people according to LinkedIn data.

The company's customer base includes DoorDash, Zillow, Snap, Robinhood, and Affirm — predominantly high-growth technology companies operating cloud-native, Kubernetes-based infrastructures at massive scale.

What's available now—and what enterprises can expect in 2026

Chronosphere's AI-Guided Troubleshooting capabilities, including Suggestions and Investigation Notebooks, entered limited availability Monday with select customers. The company plans full general availability in 2026. The Model Context Protocol (MCP) Server, which enables engineers to integrate Chronosphere directly into internal AI workflows and query observability data through AI-enabled development environments, is available immediately for all Chronosphere customers.

The phased rollout reflects the company's cautious approach to deploying AI in production environments where mistakes carry real costs. By gathering feedback from early adopters before broad release, Chronosphere aims to refine its guidance algorithms and validate that its suggestions genuinely accelerate troubleshooting rather than simply generating impressive demonstrations.

The longer game, however, extends beyond individual product features. Chronosphere's dual bet — on transparent AI that shows its reasoning and on a partner ecosystem rather than all-in-one integration — amounts to a fundamental thesis about how enterprise observability will evolve as systems grow more complex.

If that thesis proves correct, the company that solves observability for the AI age won't be the one with the most automated black box. It will be the one that earns engineers' trust by explaining what it knows, admitting what it doesn't, and letting humans make the final call. In an industry drowning in data and promised silver bullets, Chronosphere is wagering that showing your work still matters — even when AI is doing the math.

Kasım 11, 2025
Meta returns to open source AI with Omnilingual ASR models that can transcribe 1,600+ languages natively
Meta has just released a new multilingual automatic speech recognition (ASR) system supporting 1,600+ languages — dwarfing OpenAI’s open source Whisper model, which supports just 99.

Is architecture also allows developers to extend that support to thousands more. Through a feature called zero-shot in-context learning, users can provide a few paired examples of audio and text in a new language at inference time, enabling the model to transcribe additional utterances in that language without any retraining.

In practice, this expands potential coverage to more than 5,400 languages — roughly every spoken language with a known script.

It’s a shift from static model capabilities to a flexible framework that communities can adapt themselves. So while the 1,600 languages reflect official training coverage, the broader figure represents Omnilingual ASR’s capacity to generalize on demand, making it the most extensible speech recognition system released to date.

Best of all: it's been open sourced under a plain Apache 2.0 license — not a restrictive, quasi open-source Llama license like the company's prior releases, which limited use by larger enterprises unless they paid licensing fees — meaning researchers and developers are free to take and implement it right away, for free, without restrictions, even in commercial and enterprise-grade projects!

Released on November 10 on Meta's website, Github, along with a demo space on Hugging Face and technical paper, Meta’s Omnilingual ASR suite includes a family of speech recognition models, a 7-billion parameter multilingual audio representation model, and a massive speech corpus spanning over 350 previously underserved languages.

All resources are freely available under open licenses, and the models support speech-to-text transcription out of the box.

“By open sourcing these models and dataset, we aim to break down language barriers, expand digital access, and empower communities worldwide,” Meta posted on its @AIatMeta account on X

Designed for Speech-to-Text Transcription

At its core, Omnilingual ASR is a speech-to-text system.

The models are trained to convert spoken language into written text, supporting applications like voice assistants, transcription tools, subtitles, oral archive digitization, and accessibility features for low-resource languages.

Unlike earlier ASR models that required extensive labeled training data, Omnilingual ASR includes a zero-shot variant.

This version can transcribe languages it has never seen before—using just a few paired examples of audio and corresponding text.

This lowers the barrier for adding new or endangered languages dramatically, removing the need for large corpora or retraining.

Model Family and Technical Design

The Omnilingual ASR suite includes multiple model families trained on more than 4.3 million hours of audio from 1,600+ languages:
- wav2vec 2.0 models for self-supervised speech representation learning (300M–7B parameters)
- CTC-based ASR models for efficient supervised transcription
- LLM-ASR models combining a speech encoder with a Transformer-based text decoder for state-of-the-art transcription
- LLM-ZeroShot ASR model, enabling inference-time adaptation to unseen languages
All models follow an encoder–decoder design: raw audio is converted into a language-agnostic representation, then decoded into written text.

Why the Scale Matters

While Whisper and similar models have advanced ASR capabilities for global languages, they fall short on the long tail of human linguistic diversity. Whisper supports 99 languages. Meta’s system:
- Directly supports 1,600+ languages
- Can generalize to 5,400+ languages using in-context learning
- Achieves character error rates (CER) under 10% in 78% of supported languages
Among those supported are more than 500 languages never previously covered by any ASR model, according to Meta’s research paper.

This expansion opens new possibilities for communities whose languages are often excluded from digital tools

Here’s the revised and expanded background section, integrating the broader context of Meta’s 2025 AI strategy, leadership changes, and Llama 4’s reception, complete with in-text citations and links:

Background: Meta’s AI Overhaul and a Rebound from Llama 4

The release of Omnilingual ASR arrives at a pivotal moment in Meta’s AI strategy, following a year marked by organizational turbulence, leadership changes, and uneven product execution.

Omnilingual ASR is the first major open-source model release since the rollout of Llama 4, Meta’s latest large language model, which debuted in April 2025 to mixed and ultimately poor reviews, with scant enterprise adoption compared to Chinese open source model competitors.

The failure led Meta founder and CEO Mark Zuckerberg to appoint Alexandr Wang, co-founder and prior CEO of AI data supplier Scale AI, as Chief AI Officer, and embark on an extensive and costly hiring spree that shocked the AI and business communities with eye-watering pay packages for top AI researchers.

In contrast, Omnilingual ASR represents a strategic and reputational reset. It returns Meta to a domain where the company has historically led — multilingual AI — and offers a truly extensible, community-oriented stack with minimal barriers to entry.

The system’s support for 1,600+ languages and its extensibility to over 5,000 more via zero-shot in-context learning reassert Meta’s engineering credibility in language technology.

Importantly, it does so through a free and permissively licensed release, under Apache 2.0, with transparent dataset sourcing and reproducible training protocols.

This shift aligns with broader themes in Meta’s 2025 strategy. The company has refocused its narrative around a “personal superintelligence” vision, investing heavily in infrastructure (including a September release of custom AI accelerators and Arm-based inference stacks) source while downplaying the metaverse in favor of foundational AI capabilities. The return to public training data in Europe after a regulatory pause also underscores its intention to compete globally, despite privacy scrutiny source.

Omnilingual ASR, then, is more than a model release — it’s a calculated move to reassert control of the narrative: from the fragmented rollout of Llama 4 to a high-utility, research-grounded contribution that aligns with Meta’s long-term AI platform strategy.

Community-Centered Dataset Collection

To achieve this scale, Meta partnered with researchers and community organizations in Africa, Asia, and elsewhere to create the Omnilingual ASR Corpus, a 3,350-hour dataset across 348 low-resource languages. Contributors were compensated local speakers, and recordings were gathered in collaboration with groups like:
- African Next Voices: A Gates Foundation–supported consortium including Maseno University (Kenya), University of Pretoria, and Data Science Nigeria
- Mozilla Foundation’s Common Voice, supported through the Open Multilingual Speech Fund
- Lanfrica / NaijaVoices, which created data for 11 African languages including Igala, Serer, and Urhobo
The data collection focused on natural, unscripted speech. Prompts were designed to be culturally relevant and open-ended, such as “Is it better to have a few close friends or many casual acquaintances? Why?” Transcriptions used established writing systems, with quality assurance built into every step.

Performance and Hardware Considerations

The largest model in the suite, the omniASR_LLM_7B, requires ~17GB of GPU memory for inference, making it suitable for deployment on high-end hardware. Smaller models (300M–1B) can run on lower-power devices and deliver real-time transcription speeds.

Performance benchmarks show strong results even in low-resource scenarios:
- CER <10% in 95% of high-resource and mid-resource languages
- CER <10% in 36% of low-resource languages
- Robustness in noisy conditions and unseen domains, especially with fine-tuning
The zero-shot system, omniASR_LLM_7B_ZS, can transcribe new languages with minimal setup. Users provide a few sample audio–text pairs, and the model generates transcriptions for new utterances in the same language.

Open Access and Developer Tooling

All models and the dataset are licensed under permissive terms:
- Apache 2.0 for models and code
- CC-BY 4.0 for the Omnilingual ASR Corpus on HuggingFace
Installation is supported via PyPI and uv:

pip install omnilingual-asr

Meta also provides:
- A HuggingFace dataset integration
- Pre-built inference pipelines
- Language-code conditioning for improved accuracy
Developers can view the full list of supported languages using the API:

from omnilingual_asr.models.wav2vec2_llama.lang_ids import supported_langs

print(len(supported_langs)) print(supported_langs)

Broader Implications

Omnilingual ASR reframes language coverage in ASR from a fixed list to an extensible framework. It enables:
- Community-driven inclusion of underrepresented languages
- Digital access for oral and endangered languages
- Research on speech tech in linguistically diverse contexts
Crucially, Meta emphasizes ethical considerations throughout—advocating for open-source participation and collaboration with native-speaking communities.

“No model can ever anticipate and include all of the world’s languages in advance,” the Omnilingual ASR paper states, “but Omnilingual ASR makes it possible for communities to extend recognition with their own data.”

Access the Tools

All resources are now available at:
- Code + Models: github.com/facebookresearch/omnilingual-asr
- Dataset: huggingface.co/datasets/facebook/omnilingual-asr-corpus
- Blogpost: ai.meta.com/blog/omnilingual-asr
What This Means for Enterprises

For enterprise developers, especially those operating in multilingual or international markets, Omnilingual ASR significantly lowers the barrier to deploying speech-to-text systems across a broader range of customers and geographies.

Instead of relying on commercial ASR APIs that support only a narrow set of high-resource languages, teams can now integrate an open-source pipeline that covers over 1,600 languages out of the box—with the option to extend it to thousands more via zero-shot learning.

This flexibility is especially valuable for enterprises working in sectors like voice-based customer support, transcription services, accessibility, education, or civic technology, where local language coverage can be a competitive or regulatory necessity. Because the models are released under the permissive Apache 2.0 license, businesses can fine-tune, deploy, or integrate them into proprietary systems without restrictive terms.

It also represents a shift in the ASR landscape—from centralized, cloud-gated offerings to community-extendable infrastructure. By making multilingual speech recognition more accessible, customizable, and cost-effective, Omnilingual ASR opens the door to a new generation of enterprise speech applications built around linguistic inclusion rather than linguistic limitation.
Kasım 11, 2025
Celosphere 2025: Where enterprise AI moved from experiment to execution
Presented by Celonis

After a year of boardroom declarations about “AI transformation,” this was the week where enterprise leaders came together to talk about what actually works. Speaking from the stage at Celosphere in Munich, Celonis co-founder and co-CEO Alexander Rinke set the tone early in his keynote:

“Only 11 % of companies are seeing measurable benefits from AI projects today,” he said. “That’s not an adoption problem. That’s a context problem.”

It’s a sentiment familiar to anyone who’s tried to deploy AI inside a large enterprise. You can’t automate what you don’t understand — and most organizations still lack a unified picture of how work in their companies really gets done.

Celonis’ answer, showcased across three days at the company’s annual event, was less about new tech acronyms and more about connective tissue: how to make AI fit within the messy, living processes that drive business. The company framed it as achieving a real “Return on AI (ROAI)” — measurable impact that comes only when intelligence is grounded in process context.

A living model of how the enterprise works

At the heart of the keynote was what Rinke called a “living digital twin of your operations.” Celonis has been building toward this moment for years — but this was the first time the company made clear how far that concept has evolved.

“We start by freeing the process,” said Rinke. “Freeing it from the restrictions of your current legacy systems.” Data Core, Celonis’ data infrastructure, extracts raw data from source systems. It’s capable of querying billions of records in near real time with sub-minute refresh — extending visibility beyond traditional systems of record.

Built on this foundation, the Process Intelligence Graph sits at the center of the Celonis Platform. It’s a system-agnostic, graph-based model that unifies data across systems, apps, and even devices, including task-mining data that captures clicks, spreadsheets, and browser activity. It combines this data with business context—business rules, KPIs, benchmarks, and exceptions. Every transaction, rule, and process interaction becomes part of a continuously updated replica that reflects how the organization actually operates.

On top of the Graph, the company’s new Build Experience allows organizations to analyze, design, and operate AI-driven, composable processes — integrating AI where it delivers business impact, not just technical demos:
- Analyze where processes stall or repeat
- Design the future state, setting outcomes, guardrails, and AI touchpoints
- Operate with humans, systems, and AI agents working in sync — now orchestrated through a generally available Orchestration Engine that can trigger and monitor every step in one flow
It’s a deliberate shift from discovery-driven AI pilots to outcome-driven AI operations — and a blueprint for orchestrating agentic AI, where human teams, systems, and autonomous agents work together through shared process context rather than in silos.

Real-world proof: Mercedes-Benz, Vinmar, and Uniper

The Celosphere stage offered real proof of theCelonisPlatform in action, through live stories from customers already building on it.

Mercedes-Benz shared how process intelligence became their “connective tissue” during the semiconductor crisis. “We had data everywhere — plants, suppliers, logistics,” recalled Dr. Jörg Burzer, Member of the Board of Management of Mercedes-Benz Group AG. “What we didn’t have was a way to see it together. Celonis helped us connect those dots fast enough to act.”

The partnership has since expanded across eight of the company’s ten most critical processes, from supply chain to quality to after-sales. But what impressed the audience wasn’t just the scale — it was the cultural shift.

“If you show data in context, and let teams visualize processes, you also change the culture,” Burzer said. “It’s not just process transformation — it’s people transformation.”

At Vinmar, CEO Vishal Baid described Celonis as “the foundation of our automation and AI strategy.” His global plastics distribution business has already automated its entire order-to-cash process for a $3 B unit, achieving a 40 % productivity lift. But Baid wasn’t there to just celebrate finished work — he was looking ahead.

“Now we’re tackling the non-algorithmic stuff,” he said. “Matching purchase and sales orders sounds simple until you have thousands of edge cases. We’re building an AI agent that can do that allocation intelligently. That’s the next frontier.”

And in the energy sector, Uniper, with partner Microsoft, demonstrated how process-aware AI copilots are already reshaping operations. Using Celonis and Microsoft’s AI stack, Uniper can predict when hydropower plants will need maintenance — and cluster those jobs to reduce downtime and emissions.

“Each technician, each part, each system plays a role in a living process,” said Hans Berg, Uniper’s CIO. “The human can’t see all of it. But process intelligence can — and it can nudge the system toward the best outcome.”

Agnes Heftberger, CVP & CEO, Microsoft Germany & Austria, who joined Berg on stage, summed it up crisply:

“The hard part isn’t building AI features — it’s scaling them responsibly,” she explained. “You need to marry intelligence with the beating heart of the company: its processes.”

Across the global community, Celonis reports more than $8 billion in realized business valueand over 120 certified value champions — proof that process intelligence is driving measurable impact far beyond pilots. Rinke called it “the early proof points of a true return on AI.”

From closed systems to composable intelligence

Celosphere 2025 marked a shift from architecture to interoperability — from defining enterprise AI to making it work across boundaries.

Rinke’s vision for the future is unapologetically open: “Good things grow from open ecosystems,” he said. That philosophy is taking shape through deeper platform integrations — including Microsoft Fabric, Databricks, and Bloomfilter — with zero-copy, bidirectional lakehouse access that lets customers query process data in place with minimal latency. The company also announced MCP Server support for embedding the Process Intelligence Graph directly into agentic AI platforms like Amazon Bedrock and Microsoft Copilot Studio.

These updates make “composable enterprise AI” tangible — organizations can now assemble and govern AI solutions across ecosystems rather than being locked into any single vendor.

Rather than competing on who has the “best agent,” the message was that enterprise AI will thrive when agents work together through shared context and models that mirror how businesses actually run.

“Every vendor is bringing out their own agent,” Rinke said. “But each one is limited to that vendor’s world. If they can’t work together, they can’t work for you. That’s what process intelligence fixes.”

The idea drew sustained applause. For companies juggling multiple cloud platforms, ERPs, and data tools, composability isn’t just elegant; it’s survival.

Beyond operations: data, democracy, and direction

The closing moments of the keynote took an unexpected turn — from enterprise architecture to human courage. Venezuelan opposition leader and Nobel Peace Prize winner María Corina Machado joined live via satellite to share how her movement used data, encrypted apps, and civic coordination to expose election fraud and mobilize millions.

It was a powerful contrast: the same principles — transparency, accountability, context — at work in both business and democracy.

“Technology can be a weapon or a liberator,” Machado said. “It depends on who holds the context.”

Her words landed with weight in a room full of people used to talking about data, systems, and governance — a reminder that context isn’t just technical, it’s human.

Why this year mattered

Celosphere 2025 marked a shift in how enterprises approach AI — from experimentation to results grounded in process intelligence. The shift was evident in both tone and technology, with a more powerful Data Core, enhanced Process Intelligence Graph, and new Build Experience. But the deeper takeaway was philosophical: AI only scales when it’s grounded in how people and systems actually work together.

Celonis president Carsten Thoma was candid in acknowledging that early process-mining projects often “stormed in with discovery” before understanding organizational value — a lesson that now defines the company’s measured, pragmatic approach to enterprise AI.

Rinke put it best near the end of his keynote:

“We’re not just automating steps,” he said. “We’re building enterprises that can adapt instantly, innovate freely, and improve continuously.”

Missed it? Catch up with all the highlights from Celosphere 2025 here.

Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
Kasım 10, 2025
Baseten takes on hyperscalers with new AI training platform that lets you own your model weights

Baseten, the AI infrastructure company recently valued at $2.15 billion, is making its most significant product pivot yet: a full-scale push into model training that could reshape how enterprises wean themselves off dependence on OpenAI and other closed-source AI providers.

The San Francisco-based company announced Thursday the general availability of Baseten Training, an infrastructure platform designed to help companies fine-tune open-source AI models without the operational headaches of managing GPU clusters, multi-node orchestration, or cloud capacity planning. The move is a calculated expansion beyond Baseten's core inference business, driven by what CEO Amir Haghighat describes as relentless customer demand and a strategic imperative to capture the full lifecycle of AI deployment.

"We had a captive audience of customers who kept coming to us saying, 'Hey, I hate this problem,'" Haghighat said in an interview. "One of them told me, 'Look, I bought a bunch of H100s from a cloud provider. I have to SSH in on Friday, run my fine-tuning job, then check on Monday to see if it worked. Sometimes I realize it just hasn't been working all along.'"

The launch comes at a critical inflection point in enterprise AI adoption. As open-source models from Meta, Alibaba, and others increasingly rival proprietary systems in performance, companies face mounting pressure to reduce their reliance on expensive API calls to services like OpenAI's GPT-5 or Anthropic's Claude. But the path from off-the-shelf open-source model to production-ready custom AI remains treacherous, requiring specialized expertise in machine learning operations, infrastructure management, and performance optimization.

Baseten's answer: provide the infrastructure rails while letting companies retain full control over their training code, data, and model weights. It's a deliberately low-level approach born from hard-won lessons.

How a failed product taught Baseten what AI training infrastructure really needs

This isn't Baseten's first foray into training. The company's previous attempt, a product called Blueprints launched roughly two and a half years ago, failed spectacularly — a failure Haghighat now embraces as instructive.

"We had created the abstraction layer a little too high," he explained. "We were trying to create a magical experience, where as a user, you come in and programmatically choose a base model, choose your data and some hyperparameters, and magically out comes a model."

The problem? Users didn't have the intuition to make the right choices about base models, data quality, or hyperparameters. When their models underperformed, they blamed the product. Baseten found itself in the consulting business rather than the infrastructure business, helping customers debug everything from dataset deduplication to model selection.

"We became consultants," Haghighat said. "And that's not what we had set out to do."

Baseten killed Blueprints and refocused entirely on inference, vowing to "earn the right" to expand again. That moment arrived earlier this year, driven by two market realities: the vast majority of Baseten's inference revenue comes from custom models that customers train elsewhere, and competing training platforms were using restrictive terms of service to lock customers into their inference products.

"Multiple companies who were building fine-tuning products had in their terms of service that you as a customer cannot take the weights of the fine-tuned model with you somewhere else," Haghighat said. "I understand why from their perspective — I still don't think there is a big company to be made purely on just training or fine-tuning. The sticky part is in inference, the valuable part where value is unlocked is in inference, and ultimately the revenue is in inference."

Baseten took the opposite approach: customers own their weights and can download them at will. The bet is that superior inference performance will keep them on the platform anyway.

Multi-cloud GPU orchestration and sub-minute scheduling set Baseten apart from hyperscalers

The new Baseten Training product operates at what Haghighat calls "the infrastructure layer" — lower-level than the failed Blueprints experiment, but with opinionated tooling around reliability, observability, and integration with Baseten's inference stack.

Key technical capabilities include multi-node training support across clusters of NVIDIA H100 or B200 GPUs, automated checkpointing to protect against node failures, sub-minute job scheduling, and integration with Baseten's proprietary Multi-Cloud Management (MCM) system. That last piece is critical: MCM allows Baseten to dynamically provision GPU capacity across multiple cloud providers and regions, passing cost savings to customers while avoiding the capacity constraints and multi-year contracts typical of hyperscaler deals.

"With hyperscalers, you don't get to say, 'Hey, give me three or four B200 nodes while my job is running, and then take it back from me and don't charge me for it,'" Haghighat said. "They say, 'No, you need to sign a three-year contract.' We don't do that."

Baseten's approach mirrors broader trends in cloud infrastructure, where abstraction layers increasingly allow workloads to move fluidly across providers. When AWS experienced a major outage several weeks ago, Baseten's inference services remained operational by automatically routing traffic to other cloud providers — a capability now extended to training workloads.

The technical differentiation extends to Baseten's observability tooling, which provides per-GPU metrics for multi-node jobs, granular checkpoint tracking, and a refreshed UI that surfaces infrastructure-level events. The company also introduced an "ML Cookbook" of open-source training recipes for popular models like Gemma, GPT OSS, and Qwen, designed to help users reach "training success" faster.

Early adopters report 84% cost savings and 50% latency improvements with custom models

Two early customers illustrate the market Baseten is targeting: AI-native companies building specialized vertical solutions that require custom models.

Oxen AI, a platform focused on dataset management and model fine-tuning, exemplifies the partnership model Baseten envisions. CEO Greg Schoeninger articulated a common strategic calculus, telling VentureBeat: "Whenever I've seen a platform try to do both hardware and software, they usually fail at one of them. That's why partnering with Baseten to handle infrastructure was the obvious choice."

Oxen built its customer experience entirely on top of Baseten's infrastructure, using the Baseten CLI to programmatically orchestrate training jobs. The system automatically provisions and deprovisions GPUs, fully concealing Baseten's interface behind Oxen's own. For one Oxen customer, AlliumAI — a startup bringing structure to messy retail data — the integration delivered 84% cost savings compared to previous approaches, reducing total inference costs from $46,800 to $7,530.

"Training custom LoRAs has always been one of the most effective ways to leverage open-source models, but it often came with infrastructure headaches," said Daniel Demillard, CEO of AlliumAI. "With Oxen and Baseten, that complexity disappears. We can train and deploy models at massive scale without ever worrying about CUDA, which GPU to choose, or shutting down servers after training."

Parsed, another early customer, tackles a different pain point: helping enterprises reduce dependence on OpenAI by creating specialized models that outperform generalist LLMs on domain-specific tasks. The company works in mission-critical sectors like healthcare, finance, and legal services, where model performance and reliability aren't negotiable.

"Prior to switching to Baseten, we were seeing repetitive and degraded performance on our fine-tuned models due to bugs with our previous training provider," said Charles O'Neill, Parsed's co-founder and chief science officer. "On top of that, we were struggling to easily download and checkpoint weights after training runs."

With Baseten, Parsed achieved 50% lower end-to-end latency for transcription use cases, spun up HIPAA-compliant EU deployments for testing within 48 hours, and kicked off more than 500 training jobs. The company also leveraged Baseten's modified vLLM inference framework and speculative decoding — a technique that generates draft tokens to accelerate language model output — to cut latency in half for custom models.

"Fast models matter," O'Neill said. "But fast models that get better over time matter more. A model that's 2x faster but static loses to one that's slightly slower but improving 10% monthly. Baseten gives us both — the performance edge today and the infrastructure for continuous improvement."

Why training and inference are more interconnected than the industry realizes

The Parsed example illuminates a deeper strategic rationale for Baseten's training expansion: the boundary between training and inference is blurrier than conventional wisdom suggests.

Baseten's model performance team uses the training platform extensively to create "draft models" for speculative decoding, a cutting-edge technique that can dramatically accelerate inference. The company recently announced it achieved 650+ tokens per second on OpenAI's GPT OSS 120B model — a 60% improvement over its launch performance — using EAGLE-3 speculative decoding, which requires training specialized small models to work alongside larger target models.

"Ultimately, inference and training plug in more ways than one might think," Haghighat said. "When you do speculative decoding in inference, you need to train the draft model. Our model performance team is a big customer of the training product to train these EAGLE heads on a continuous basis."

This technical interdependence reinforces Baseten's thesis that owning both training and inference creates defensible value. The company can optimize the entire lifecycle: a model trained on Baseten can be deployed with a single click to inference endpoints pre-optimized for that architecture, with deployment-from-checkpoint support for chat completion and audio transcription workloads.

The approach contrasts sharply with vertically integrated competitors like Replicate or Modal, which also offer training and inference but with different architectural tradeoffs. Baseten's bet is on lower-level infrastructure flexibility and performance optimization, particularly for companies running custom models at scale.

As open-source AI models improve, enterprises see fine-tuning as the path away from OpenAI dependency

Underpinning Baseten's entire strategy is a conviction about the trajectory of open-source AI models — namely, that they're getting good enough, fast enough, to unlock massive enterprise adoption through fine-tuning.

"Both closed and open-source models are getting better and better in terms of quality," Haghighat said. "We don't even need open source to surpass closed models, because as both of them are getting better, they unlock all these invisible lines of usefulness for different use cases."

He pointed to the proliferation of reinforcement learning and supervised fine-tuning techniques that allow companies to take an open-source model and make it "as good as the closed model, not at everything, but at this narrow band of capability that they want."

That trend is already visible in Baseten's Model APIs business, launched alongside Training earlier this year to provide production-grade access to open-source models. The company was the first provider to offer access to DeepSeek V3 and R1, and has since added models like Llama 4 and Qwen 3, optimized for performance and reliability. Model APIs serves as a top-of-funnel product: companies start with off-the-shelf open-source models, realize they need customization, move to Training for fine-tuning, and ultimately deploy on Baseten's Dedicated Deployments infrastructure.

Yet Haghighat acknowledged the market remains "fuzzy" around which training techniques will dominate. Baseten is hedging by staying close to the bleeding edge through its Forward Deployed Engineering team, which works hands-on with select customers on reinforcement learning, supervised fine-tuning, and other advanced techniques.

"As we do that, we will see patterns emerge about what a productized training product can look like that really addresses the user's needs without them having to learn too much about how RL works," he said. "Are we there as an industry? I would say not quite. I see some attempts at that, but they all seem like almost falling to the same trap that Blueprints fell into—a bit of a walled garden that ties the hands of AI folks behind their back."

The roadmap ahead includes potential abstractions for common training patterns, expansion into image, audio, and video fine-tuning, and deeper integration of advanced techniques like prefill-decode disaggregation, which separates the initial processing of prompts from token generation to improve efficiency.

Baseten faces crowded field but bets developer experience and performance will win enterprise customers

Baseten enters an increasingly crowded market for AI infrastructure. Hyperscalers like AWS, Google Cloud, and Microsoft Azure offer GPU compute for training, while specialized providers like Lambda Labs, CoreWeave, and Together AI compete on price, performance, or ease of use. Then there are vertically integrated platforms like Hugging Face, Replicate, and Modal that bundle training, inference, and model hosting.

Baseten's differentiation rests on three pillars: its MCM system for multi-cloud capacity management, deep performance optimization expertise built from its inference business, and a developer experience tailored for production deployments rather than experimentation.

The company's recent $150 million Series D and $2.15 billion valuation provide runway to invest in both products simultaneously. Major customers include Descript, which uses Baseten for transcription workloads; Decagon, which runs customer service AI; and Sourcegraph, which powers coding assistants. All three operate in domains where model customization and performance are competitive advantages.

Timing may be Baseten's biggest asset. The confluence of improving open-source models, enterprise discomfort with dependence on proprietary AI providers, and growing sophistication around fine-tuning techniques creates what Haghighat sees as a sustainable market shift.

"There is a lot of use cases for which closed models have gotten there and open ones have not," he said. "Where I'm seeing in the market is people using different training techniques — more recently, a lot of reinforcement learning and SFT — to be able to get this open model to be as good as the closed model, not at everything, but at this narrow band of capability that they want. That's very palpable in the market."

For enterprises navigating the complex transition from closed to open AI models, Baseten's positioning offers a clear value proposition: infrastructure that handles the messy middle of fine-tuning while optimizing for the ultimate goal of performant, reliable, cost-effective inference at scale. The company's insistence that customers own their model weights — a stark contrast to competitors using training as a lock-in mechanism — reflects confidence that technical excellence, not contractual restrictions, will drive retention.

Whether Baseten can execute on this vision depends on navigating tensions inherent in its strategy: staying at the infrastructure layer without becoming consultants, providing power and flexibility without overwhelming users with complexity, and building abstractions at exactly the right level as the market matures. The company's willingness to kill Blueprints when it failed suggests a pragmatism that could prove decisive in a market where many infrastructure providers over-promise and under-deliver.

"Through and through, we're an inference company," Haghighat emphasized. "The reason that we did training is at the service of inference."

That clarity of purpose — treating training as a means to an end rather than an end in itself—may be Baseten's most important strategic asset. As AI deployment matures from experimentation to production, the companies that solve the full stack stand to capture outsized value. But only if they avoid the trap of technology in search of a problem.

At least Baseten's customers no longer have to SSH into boxes on Friday and pray their training jobs complete by Monday. In the infrastructure business, sometimes the best innovation is simply making the painful parts disappear.

Kasım 10, 2025
6 proven lessons from the AI projects that broke before they scaled
Companies hate to admit it, but the road to production-level AI deployment is littered with proof of concepts (PoCs) that go nowhere, or failed projects that never deliver on their goals. In certain domains, there’s little tolerance for iteration, especially in something like life sciences, when the AI application is facilitating new treatments to markets or diagnosing diseases. Even slightly inaccurate analyses and assumptions early on can create sizable downstream drift in ways that can be concerning.

In analyzing dozens of AI PoCs that sailed on through to full production use — or didn’t — six common pitfalls emerge. Interestingly, it’s not usually the quality of the technology but misaligned goals, poor planning or unrealistic expectations that caused failure.

Here’s a summary of what went wrong in real-world examples and practical guidance on how to get it right.

Lesson 1: A vague vision spells disaster

Every AI project needs a clear, measurable goal. Without it, developers are building a solution in search of a problem. For example, in developing an AI system for a pharmaceutical manufacturer’s clinical trials, the team aimed to “optimize the trial process,” but didn’t define what that meant. Did they need to accelerate patient recruitment, reduce participant dropout rates or lower the overall trial cost? The lack of focus led to a model that was technically sound but irrelevant to the client’s most pressing operational needs.

Takeaway: Define specific, measurable objectives upfront. Use SMART criteria (Specific, Measurable, Achievable, Relevant, Time-bound). For example, aim for “reduce equipment downtime by 15% within six months” rather than a vague “make things better.” Document these goals and align stakeholders early to avoid scope creep.

Lesson 2: Data quality overtakes quantity

Data is the lifeblood of AI, but poor-quality data is poison. In one project, a retail client began with years of sales data to predict inventory needs. The catch? The dataset was riddled with inconsistencies, including missing entries, duplicate records and outdated product codes. The model performed well in testing but failed in production because it learned from noisy, unreliable data.

Takeaway: Invest in data quality over volume. Use tools like Pandas for preprocessing and Great Expectations for data validation to catch issues early. Conduct exploratory data analysis (EDA) with visualizations (like Seaborn) to spot outliers or inconsistencies. Clean data is worth more than terabytes of garbage.

Lesson 3: Overcomplicating model backfires

Chasing technical complexity doesn't always lead to better outcomes. For example, on a healthcare project, development initially began by creating a sophisticated convolutional neural network (CNN) to identify anomalies in medical images.

While the model was state-of-the-art, its high computational cost meant weeks of training, and its "black box" nature made it difficult for clinicians to trust. The application was revised to implement a simpler random forest model that not only matched the CNN's predictive accuracy but was faster to train and far easier to interpret — a critical factor for clinical adoption.

Takeaway: Start simple. Use straightforward algorithms like random forest or XGBoost from scikit-learn to establish a baseline. Only scale to complex models — TensorFlow-based long-short-term-memory (LSTM) networks — if the problem demands it. Prioritize explainability with tools like SHAP (SHapley Additive exPlanations) to build trust with stakeholders.

Lesson 4: Ignoring deployment realities

A model that shines in a Jupyter Notebook can crash in the real world. For example, a company’s initial deployment of a recommendation engine for its e-commerce platform couldn’t handle peak traffic. The model was built without scalability in mind and choked under load, causing delays and frustrated users. The oversight cost weeks of rework.

Takeaway: Plan for production from day one. Package models in Docker containers and deploy with Kubernetes for scalability. Use TensorFlow Serving or FastAPI for efficient inference. Monitor performance with Prometheus and Grafana to catch bottlenecks early. Test under realistic conditions to ensure reliability.

Lesson 5: Neglecting model maintenance

AI models aren’t set-and-forget. In a financial forecasting project, the model performed well for months until market conditions shifted. Unmonitored data drift caused predictions to degrade, and the lack of a retraining pipeline meant manual fixes were needed. The project lost credibility before developers could recover.

Takeaway: Build for the long haul. Implement monitoring for data drift using tools like Alibi Detect. Automate retraining with Apache Airflow and track experiments with MLflow. Incorporate active learning to prioritize labeling for uncertain predictions, keeping models relevant.

Lesson 6: Underestimating stakeholder buy-in

Technology doesn’t exist in a vacuum. A fraud detection model was technically flawless but flopped because end-users — bank employees — didn’t trust it. Without clear explanations or training, they ignored the model’s alerts, rendering it useless.

Takeaway: Prioritize human-centric design. Use explainability tools like SHAP to make model decisions transparent. Engage stakeholders early with demos and feedback loops. Train users on how to interpret and act on AI outputs. Trust is as critical as accuracy.

Best practices for success in AI projects

Drawing from these failures, here’s the roadmap to get it right:
- Set clear goals: Use SMART criteria to align teams and stakeholders.
- Prioritize data quality: Invest in cleaning, validation and EDA before modeling.
- Start simple: Build baselines with simple algorithms before scaling complexity.
- Design for production: Plan for scalability, monitoring and real-world conditions.
- Maintain models: Automate retraining and monitor for drift to stay relevant.
- Engage stakeholders: Foster trust with explainability and user training.
Building resilient AI

AI’s potential is intoxicating, yet failed AI projects teach us that success isn’t just about algorithms. It’s about discipline, planning and adaptability. As AI evolves, emerging trends like federated learning for privacy-preserving models and edge AI for real-time insights will raise the bar. By learning from past mistakes, teams can build scale-out, production systems that are robust, accurate, and trusted.

Kavin Xavier is VP of AI solutions at CapeStart.

Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
Kasım 9, 2025
What could possibly go wrong if an enterprise replaces all its engineers with AI?

AI coding, vibe coding and agentic swarm have made a dramatic and astonishing recent market entrance, with the AI Code Tools market valued at $4.8 billion and expected to grow at a 23% annual rate. Enterprises are grappling with AI coding agents and what do about expensive human coders.

They don’t lack for advice. OpenAI’s CEO estimates that AI can perform over 50% of what human engineers can do. Six months ago, Anthropic’s CEO said that AI would write 90% of code in six months. Meta’s CEO said he believes AI will replace mid-level engineers “soon.” Judging by recent tech layoffs, it seems many executives are embracing that advice.

Software engineers and data scientists are among the most expensive salary lines at many companies, and business and technology leaders may be tempted to replace them with AI. However, recent high-profile failures demonstrate that engineers and their expertise remain valuable, even as AI continues to make impressive advances.

SaaStr disaster

Jason Lemkin, a tech entrepreneur and founder of the SaaS community SaaStr, has been vibe coding a SaaS networking app and live-tweeting his experience. About a week into his adventure, he admitted to his audience that something was going very wrong. The AI deleted his production database despite his request for a “code and action freeze.” This is the kind of mistake no experienced (or even semi-experienced) engineer would make.

If you have ever worked in a professional coding environment, you know to split your development environment from production. Junior engineers are given full access to the development environment (it’s crucial for productivity), but access to production is given on a limited need-to-have basis to a few of the most trusted senior engineers. The reason for restricted access is precisely for this use case: To prevent a junior engineer from accidentally taking down production.

In fact, Lemkin made two mistakes. First: for something as critical as production, access to unreliable actors is just never granted (we don’t rely on asking a junior engineer or AI nicely). Second, he never separated development from production. In a subsequent public conversation on LinkedIn, Lemkin, who holds a Stanford Executive MBA and Berkeley JD, admitted that he was not aware of the best practice of splitting development and production databases.

The takeaway for business leaders is that standard software engineering best practices still apply. We should incorporate at least the same safety constraints for AI as we do for junior engineers. Arguably, we should go beyond that and treat AI slightly adversarially: There are reports that, like HAL in Stanley Kubrick's 2001: A Space Odyssey, the AI might try to break out of its sandbox environment to accomplish a task. With more vibe coding, having experienced engineers who understand how complex software systems work and can implement the proper guardrails in development processes will become increasingly necessary.

Tea hack

Sean Cook is the Founder and CEO of Tea, a mobile application launched in 2023, designed to help women date safely. In the summer of 2025, they were “hacked": 72,000 images, including 13,000 verification photos and images of government IDs, were leaked onto the public discussion forum 4chan. Worse, Tea’s own privacy policy promises that these images would be "deleted immediately" after users were authenticated, meaning they potentially violated their own privacy policy.

I use “hacked” in air-quotes because the incident stems less from the cleverness of the attackers than the ineptitude of the defenders. In addition to violating their own data policies, the app left a Firebase storage bucket unsecured, exposing sensiztive user data to the public internet. It’s the digital equivalent of locking your front door but leaving your back open with your family jewelry ostentatiously hanging on the doorknob.

While we don’t know if the root cause was vibe coding, the Tea hack highlights catastrophic breaches stemming from basic, preventable security errors due to poor development processes. It is the kind of vulnerability that a disciplined and thoughtful engineering process addresses. Unfortunately, the relentless push of financial pressures, where a “lean,” “move fast and break things” culture is the polar opposite, and vibe coding only exacerbates the problem.

How to safely adopt AI coding agents?

So how should enterprise and technology leaders think about AI? First, this is not a call to abandon AI for coding. An MIT Sloan study estimated AI leads to productivity gains between 8% and 39%, while a McKinsey study found a 10% to 50% reduction in time to task completion with the use of AI.

However, we should be aware of the risks. The old lessons of software engineering don’t go away. These include many tried-and-true best practices, such as version control, automated unit and integration tests, safety checks like SAST/DAST, separating development and production environments, code review and secrets management. If anything, they become more salient.

AI can generate code 100 times faster than humans can type, fostering an illusion of productivity that is a tempting siren call for many executives. However, the quality of the rapidly generated AI shlop is still up for debate. To develop complex production systems, enterprises need the thoughtful, seasoned experience of human engineers.

Tianhui Michael Li is president at Pragmatic Institute and the founder and president of The Data Incubator.

Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

Kasım 9, 2025
Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers
The developers of Terminal-Bench, a benchmark suite for evaluating the performance of autonomous AI agents on real-world terminal-based tasks, have released version 2.0 alongside Harbor, a new framework for testing, improving and optimizing AI agents in containerized environments.

The dual release aims to address long-standing pain points in testing and optimizing AI agents, particularly those built to operate autonomously in realistic developer environments.

With a more difficult and rigorously verified task set, Terminal-Bench 2.0 replaces version 1.0 as the standard for assessing frontier model capabilities.

Harbor, the accompanying runtime framework, enables developers and researchers to scale evaluations across thousands of cloud containers and integrates with both open-source and proprietary agents and training pipelines.

“Harbor is the package we wish we had had while making Terminal-Bench," wrote co-creator Alex Shaw on X. "It’s for agent, model, and benchmark developers and researchers who want to evaluate and improve agents and models."

Higher Bar, Cleaner Data

Terminal-Bench 1.0 saw rapid adoption after its release in May 2025, becoming a default benchmark for evaluating agent performance across the field of AI-powered agents operating in developer-style terminal environments. These agents interact with systems through the command line, mimicking how developers work behind the scenes of the graphical user interface.

However, its broad scope came with inconsistencies. Several tasks were identified by the community as poorly specified or unstable due to external service changes.

Version 2.0 addresses those issues directly. The updated suite includes 89 tasks, each subjected to several hours of manual and LLM-assisted validation. The emphasis is on making tasks solvable, realistic, and clearly specified, raising the difficulty ceiling while improving reliability and reproducibility.

A notable example is the download-youtube task, which was removed or refactored in 2.0 due to its dependence on unstable third-party APIs.

“Astute Terminal-Bench fans may notice that SOTA performance is comparable to TB1.0 despite our claim that TB2.0 is harder,” Shaw noted on X. “We believe this is because task quality is substantially higher in the new benchmark.”

Harbor: Unified Rollouts at Scale

Alongside the benchmark update, the team launched Harbor, a new framework for running and evaluating agents in cloud-deployed containers.

Harbor supports large-scale rollout infrastructure, with compatibility for major providers like Daytona and Modal.

Designed to generalize across agent architectures, Harbor supports:
- Evaluation of any container-installable agent
- Scalable supervised fine-tuning (SFT) and reinforcement learning (RL) pipelines
- Custom benchmark creation and deployment
- Full integration with Terminal-Bench 2.
Harbor was used internally to run tens of thousands of rollouts during the creation of the new benchmark. It is now publicly available via harborframework.com, with documentation for testing and submitting agents to the public leaderboard.

Early Results: GPT-5 Leads in Task Success

Initial results from the Terminal-Bench 2.0 leaderboard show OpenAI's Codex CLI (command line interface), a GPT-5 powered variant, in the lead, with a 49.6% success rate — the highest among all agents tested so far.

Close behind are other GPT-5 variants and Claude Sonnet 4.5-based agents.

Top 5 Agent Results (Terminal-Bench 2.0):
1. Codex CLI (GPT-5) — 49.6%
2. Codex CLI (GPT-5-Codex) — 44.3%
3. OpenHands (GPT-5) — 43.8%
4. Terminus 2 (GPT-5-Codex) — 43.4%
5. Terminus 2 (Claude Sonnet 4.5) — 42.8%
The close clustering among top models indicates active competition across platforms, with no single agent solving more than half the tasks.

Submission and Use

To test or submit an agent, users install Harbor and run the benchmark using simple CLI commands. Submissions to the leaderboard require five benchmark runs, and results can be emailed to the developers along with job directories for validation.

harbor run -d terminal-bench@2.0 -m "<model>" -a "<agent>" –n-attempts 5 –jobs-dir <path/to/output>

Terminal-Bench 2.0 is already being integrated into research workflows focused on agentic reasoning, code generation, and tool use. According to co-creator Mike Merrill, a postdoctoral researcher at Stanford, a detailed preprint is in progress covering the verification process and design methodology behind the benchmark.

Aiming for Standardization

The combined release of Terminal-Bench 2.0 and Harbor marks a step toward more consistent and scalable agent evaluation infrastructure. As LLM agents proliferate in developer and operational environments, the need for controlled, reproducible testing has grown.

These tools offer a potential foundation for a unified evaluation stack — supporting model improvement, environment simulation, and benchmark standardization across the AI ecosystem.
Kasım 8, 2025

Etiket: Machine Learning

Why AI-generated code is creating a debugging crisis

How Deductive's AI agents actually investigate production failures

The company keeps humans in the loop—for now

Databricks and ThoughtSpot veterans bet on reasoning over observability

The challenge of self-improving AI

How SPICE works

SPICE in action

How the model mimics human visual problem-solving through dynamic image analysis

Baidu's performance claims draw scrutiny as independent testing remains pending

Inside the mixture-of-experts architecture that powers efficient multimodal processing

The new model fits into Baidu's ambitious multimodal AI ecosystem

Comprehensive developer tools aim to simplify enterprise deployment and integration

Why this release matters for the enterprise AI market at a critical inflection point

Competition intensifies as Chinese tech giant takes aim at Google and OpenAI

Technical limitations and infrastructure requirements that enterprises must consider

What technical decision-makers need to evaluate beyond the benchmark numbers

Developer community responds with enthusiasm tempered by practical requests

AI writes code 13% faster, but debugging stays stubbornly manual

Why Chronosphere shows its work instead of making automatic decisions

How a $1.6 billion startup takes on Datadog, Dynatrace, and Splunk

Inside the 84% cost reduction claims—and what CIOs should actually measure

Why Chronosphere partners with five vendors instead of building everything itself

How two Uber engineers turned Halloween outages into a billion-dollar startup

What's available now—and what enterprises can expect in 2026

Designed for Speech-to-Text Transcription

Model Family and Technical Design

Why the Scale Matters

Background: Meta’s AI Overhaul and a Rebound from Llama 4

Community-Centered Dataset Collection

Performance and Hardware Considerations

Open Access and Developer Tooling

Broader Implications

Access the Tools

What This Means for Enterprises

A living model of how the enterprise works

Real-world proof: Mercedes-Benz, Vinmar, and Uniper

From closed systems to composable intelligence

Beyond operations: data, democracy, and direction

Why this year mattered

How a failed product taught Baseten what AI training infrastructure really needs

Multi-cloud GPU orchestration and sub-minute scheduling set Baseten apart from hyperscalers

Early adopters report 84% cost savings and 50% latency improvements with custom models

Why training and inference are more interconnected than the industry realizes

As open-source AI models improve, enterprises see fine-tuning as the path away from OpenAI dependency

Baseten faces crowded field but bets developer experience and performance will win enterprise customers

Lesson 1: A vague vision spells disaster

Lesson 2: Data quality overtakes quantity

Lesson 3: Overcomplicating model backfires

Lesson 4: Ignoring deployment realities

Lesson 5: Neglecting model maintenance

Lesson 6: Underestimating stakeholder buy-in

Best practices for success in AI projects

Building resilient AI

SaaStr disaster

Tea hack

How to safely adopt AI coding agents?

Higher Bar, Cleaner Data

Harbor: Unified Rollouts at Scale

Early Results: GPT-5 Leads in Task Success

Submission and Use

Aiming for Standardization