Blog

  • Weibo’s new open source AI model VibeThinker-1.5B outperforms DeepSeek-R1 on $7,800 post-training budget

    Another day in late 2025, another impressive result from a Chinese company in open source artificial intelligence.

    Chinese social networking company Weibo's AI division recently released its open source VibeThinker-1.5B—a 1.5 billion parameter large language model (LLM) that is a fine-tuned variant of rival Chinese tech firm Alibaba's Qwen2.5-Math-1.5B.

    It's available now for free download and usage by researchers and enterprise developers—even for commercial purposes—under a permissive MIT License on Hugging Face, GitHub and ModelScope, with a technical report on open access science publishing site arxiv.org.

    And yet, despite its compact size, VibeThinker-1.5B achieves benchmark-topping reasoning performance on math and code tasks, rivaling or surpassing models hundreds of times its size, even outperforming Chinese rival DeepSeek's famed R1 that went viral at the start of this year—a 671-billion parameter model—on formal reasoning benchmark.

    It further eclipses Mistral AI's Magistral Medium and holds its own against Anthropic's Claude Opus 4 and OpenAI's gpt-oss-20B Medium, all while requiring a fraction of the infrastructure and investment.

    It also does so having been post-trained on a budget of merely $7800 USD for compute resources (3900 GPU hours on Nvidia H800s) — far less than the tens, or even hundreds, of thousands of dollars typically required to fine-tune models of similar or larger scale.

    Recall this is not the total cost of the model's development, however: LLMs are trained in stages. First comes pre-training, when the model learns basic language structure and general knowledge by predicting the next word across enormous amounts of text from the internet, books, and articles. This gives it fluency but not much sense of how to follow instructions or hold a conversation

    Post-training comes next, using much smaller, higher-quality datasets—typically collections of example questions, prompts, and expert-written answers—to teach the model how to respond helpfully, reason through problems, and align with human expectations. Still, Weibo's post-training cost effectiveness on VibeThinker-1.5B is noteworthy and should be commended.

    The open-source release upends assumptions about parameter scale, compute intensity, and the minimum viable size for high-performance LLMs.

    A Different Training Approach: Spectrum-to-Signal

    VibeThinker-1.5B owes its performance not to scale, but to the training framework behind it: the Spectrum-to-Signal Principle (SSP).

    Instead of optimizing a model purely for single-answer correctness (Pass@1), the SSP framework decouples supervised fine-tuning (SFT) and reinforcement learning (RL) into two distinct phases with different goals:

    • SFT (“Spectrum Phase”): The model is trained to maximize diversity across potential correct answers, improving its Pass@K score. This builds a wide range of plausible solution paths.

    • RL (“Signal Phase”): A second-stage reinforcement learning system (called MaxEnt-Guided Policy Optimization, or MGPO) is used to identify and amplify the most correct paths from this diverse solution pool. MGPO prioritizes problems where the model is most uncertain, using entropy-based weighting to focus learning.

    The authors argue this separation allows small models to explore reasoning space more effectively—achieving signal amplification without relying on massive parameter counts.

    VibeThinker-1.5B makes a compelling case that the industry’s reliance on parameter scaling as the only route to better reasoning performance may be outdated.

    By adopting a diversity-first training pipeline, WeiboAI has shown that smaller, more accessible models can match and even outperform billion-dollar systems in logic-heavy tasks.

    The low resource footprint is among the most significant aspects of VibeThinker-1.5B. At under $8,000, the post-training cost is 30–60x lower than models like DeepSeek R1 and MiniMax-M1, which cost between $294K and $535K to train.

    Performance Across Domains

    Despite its small size, VibeThinker-1.5B delivers cross-domain reasoning that outpaces many larger open-source and commercial models:

    Model

    AIME25

    LiveCodeBench v6

    GPQA-Diamond

    VibeThinker-1.5B

    74.4

    51.1

    46.7

    GPT-OSS-20B-Medium

    72.1

    54.9

    66.0

    Claude Opus 4

    69.2

    56.6

    79.6

    MiniMax M1 (456B)

    74.6

    62.3

    69.2

    DeepSeek R1 (671B)

    70.0

    65.9

    71.5

    Kimi K2 (1.09T)

    49.5

    53.7

    75.1

    VibeThinker was benchmarked against both reasoning-centric models (Magistral, Claude, OpenAI o3-mini) and non-reasoning LLMs (GPT-4.1, Kimi K2, DeepSeek V3). Across structured reasoning benchmarks, the model consistently outperformed non-reasoning models, regardless of size:

    • On AIME24 (math), it beat Kimi K2 (1.09T) by over 10 points (80.3 vs. 69.6).

    • On LiveCodeBench v6, it surpassed Claude Opus 4 (51.1 vs. 47.4).

    • On GPQA, it scored below GPT-4.1 and Claude, but still doubled its base model (from 16.4 to 46.7).

    This supports the authors’ claim that size is not the only path to reasoning capability—with proper training design, smaller models can reach or even exceed the performance of far larger systems in targeted tasks.

    Notably, it achieves parity with models hundreds of times larger on math and code, though it lags behind in general knowledge reasoning (GPQA), where larger models maintain an edge.

    This suggests a potential specialization trade-off: while VibeThinker excels at structured logical tasks, it has less capacity for wide-ranging encyclopedic recall, a known limitation of smaller architectures.

    Guidance for Enterprise Adoption

    The release includes recommended inference settings (temperature = 0.6, top_p = 0.95, max tokens = 40960).

    The model is small enough to be deployed on edge devices, including mobile phones and vehicle-embedded systems, while inference costs are estimated to be 20–70x cheaper than with large models.

    This positions VibeThinker-1.5B not just as a research achievement, but as a potential foundation for cost-efficient, locally deployable reasoning systems.

    Weibo’s Strategy and Market Position

    Weibo, launched by Sina Corporation in 2009, remains a cornerstone of China’s social media ecosystem. Often described as China’s version of X (formerly Twitter), the platform blends microblogging, multimedia content, and trending-topic features with a regulatory environment shaped by tight government oversight.

    Despite counting 600 million monthly active users (more than twice that of X), investors are not optimistic about its advertising revenue growth potential in the near term, and Weibo is navigating intensifying competition from video-first platforms like Douyin, which are drawing younger users and increasing time-spent elsewhere.

    In response, Weibo has leaned into creator-economy monetization, live-streaming, and vertical video—adding tools for influencer engagement, e-commerce integration, and richer analytics for brands.

    The platform’s role as a digital public square also makes it a focus of regulatory scrutiny. Chinese authorities continue to apply pressure on issues ranging from content governance to data security. In September 2025, Weibo was among the platforms cited in official warnings, highlighting its ongoing exposure to policy risks.

    Weibo’s push into AI R&D—exemplified by the release of VibeThinker-1.5B—signals a shift in ambition. Beyond being a media platform, Weibo is positioning itself as a player in the next phase of Chinese AI development, using its capital reserves, user behavior data, and in-house research capacity to pursue adjacent technical domains.

    What It Means for Enterprise Technical Decision Makers

    For engineering leaders and enterprise AI teams, VibeThinker’s release has practical implications for everything from orchestration pipelines to cost modeling.

    A 1.5B-parameter model that outperforms 100x larger models on math and programming tasks doesn’t just save compute—it shifts the architectural balance. It enables LLM inference on constrained infrastructure, reduces latency at the edge, and lowers the barrier to entry for applications that otherwise would have required API access to closed, frontier-scale models.

    That matters for enterprise ML leads trying to deploy reasoning-capable agents within existing systems, or for platform owners tasked with integrating LLMs into automated workflows.

    It also speaks to those running reinforcement learning from human feedback (RLHF) pipelines or managing inference optimization across hybrid cloud environments.

    The model’s post-training methodology—particularly its entropy-targeted reinforcement learning approach—offers a roadmap for teams looking to refine smaller checkpoints instead of relying on large-scale pretraining.

    VibeThinker’s benchmark transparency and data decontamination steps also address another emerging priority in enterprise AI: auditability. While its performance on general-knowledge tests still trails large frontier models, its task-specific reliability makes it an attractive candidate for controlled environments where correctness matters more than coverage.

    In short, VibeThinker-1.5B isn’t just a research milestone—it’s a strong candidate for practical enterprise use, deployment and learnings. It suggests that a new class of compact, reasoning-optimized models is viable for enterprise use cases that were previously the domain of far larger systems. For organizations trying to balance cost, latency, interpretability, and control, it’s a good new option to the long, growing list of Chinese open source offerings.

  • Meta’s SPICE framework lets AI systems teach themselves to reason

    Researchers at Meta FAIR and the National University of Singapore have developed a new reinforcement learning framework for self-improving AI systems.

    Called Self-Play In Corpus Environments (SPICE), the framework pits two AI agents against each other, creating its own challenges and gradually improving without human supervision.

    While currently a proof-of-concept, this self-play mechanism could provide a basis for future AI systems that can dynamically adapt to their environments, making them more robust against the unpredictability of real-world applications.

    The challenge of self-improving AI

    The goal of self-improving AI is to create systems that can enhance their capabilities by interacting with their environment.

    A common approach is reinforcement learning with verifiable rewards (RLVR), where models are rewarded for providing the correct answers to problems. This is often limited by its reliance on human-curated problem sets and domain-specific reward engineering, which makes it difficult to scale.

    Self-play, where a model improves by competing against itself, is another promising paradigm. But existing self-play methods for language models are often limited by two critical factors.

    1. Factual errors in generated questions and answers compound, leading to a feedback loop of hallucinations.

    2. When the problem generator and solver have information symmetry (i.e., share the same knowledge base) they fail to generate genuinely new challenges and fall into repetitive patterns. 

    As the researchers note in their paper, “These systematic empirical failures indicate that self-improvement requires interaction with an external source providing diverse, verifiable feedback, rather than closed-loop pure introspection.”

    How SPICE works

    SPICE is a self-play framework where a single model acts in two distinct roles.

    • A "Challenger" constructs a curriculum of challenging problems from a large corpus of documents.

    • A "Reasoner" then attempts to solve these problems without access to the source documents.

    This setup breaks the information symmetry that limits other self-play methods, as the Reasoner does not have access to the documents and knowledge that the Challenger uses to generate the problems.

    Grounding the tasks in a vast and diverse corpus of documents prevents hallucination by anchoring questions and answers in real-world content. This is important because for AI systems to reliably self-improve, they need external grounding sources. Therefore, LLM agents should learn from interactions with humans and the real world, not just their own outputs, to avoid compounding errors.

    The adversarial dynamic between the two roles creates an automatic curriculum.

    The Challenger is rewarded for generating problems that are both diverse and at the frontier of the Reasoner's capability (not too easy and also not impossible).

    The Reasoner is rewarded for answering correctly. This symbiotic interaction pushes both agents to continuously discover and overcome new challenges. 

    Because the system uses raw documents instead of pre-defined question-answer pairs, it can generate diverse task formats, such as multiple-choice and free-form questions.

    This flexibility allows SPICE to be applied to any domain, breaking the bottleneck that has confined previous methods to narrow fields like math and code. It also reduces dependence on expensive human-curated datasets for specialized domains like legal or medical analysis.

    SPICE in action

    The researchers evaluated SPICE on several base models, including Qwen3-4B-Base and OctoThinker-3B-Hybrid-Base.

    They compared its performance against baselines such as the base model with no training, a Reasoner model trained with a fixed "Strong Challenger" (Qwen3-32B-Instruct), and pure self-play methods like R-Zero and Absolute Zero. The evaluation covered a wide range of mathematical and general reasoning benchmarks.

    Across all models, SPICE consistently outperformed the baselines, delivering significant improvements in both mathematical and general reasoning tasks.

    The results show that the reasoning capabilities developed through corpus-grounded self-play transfer broadly across different models, thanks to the diverse external knowledge corpus they used.

    A key finding is that the adversarial dynamic creates an effective automatic curriculum. As training progresses, the Challenger learns to generate increasingly difficult problems.

    In one experiment, the Reasoner's pass rate on a fixed set of problems increased from 55% to 85% over time, showing its improved capabilities.

    Meanwhile, later versions of the Challenger were able to generate questions that dropped the pass rate of an early-stage Reasoner from 55% to 35%, confirming that both roles co-evolve successfully.

    The researchers conclude that this approach presents a paradigm shift in self-improving reasoning methods from “closed-loop self-play that often stagnates due to hallucination drift, to open-ended improvement through interaction with the vast, verifiable knowledge embedded in web document corpora.”

    Currently, the corpus used for SPICE represents human experience captured in text. The ultimate goal is for self-improving systems to generate questions based on interactions with reality, including the physical world, the internet, and human interactions across multiple modalities like video, audio, and sensor data.

  • Baidu just dropped an open-source multimodal AI that it claims beats GPT-5 and Gemini

    Baidu Inc., China's largest search engine company, released a new artificial intelligence model on Monday that its developers claim outperforms competitors from Google and OpenAI on several vision-related benchmarks despite using a fraction of the computing resources typically required for such systems.

    The model, dubbed ERNIE-4.5-VL-28B-A3B-Thinking, is the latest salvo in an escalating competition among technology companies to build AI systems that can understand and reason about images, videos, and documents alongside traditional text — capabilities increasingly critical for enterprise applications ranging from automated document processing to industrial quality control.

    What sets Baidu's release apart is its efficiency: the model activates just 3 billion parameters during operation while maintaining 28 billion total parameters through a sophisticated routing architecture. According to documentation released with the model, this design allows it to match or exceed the performance of much larger competing systems on tasks involving document understanding, chart analysis, and visual reasoning while consuming significantly less computational power and memory.

    "Built upon the powerful ERNIE-4.5-VL-28B-A3B architecture, the newly upgraded ERNIE-4.5-VL-28B-A3B-Thinking achieves a remarkable leap forward in multimodal reasoning capabilities," Baidu wrote in the model's technical documentation on Hugging Face, the AI model repository where the system was released.

    The company said the model underwent "an extensive mid-training phase" that incorporated "a vast and highly diverse corpus of premium visual-language reasoning data," dramatically boosting its ability to align visual and textual information semantically.

    How the model mimics human visual problem-solving through dynamic image analysis

    Perhaps the model's most distinctive feature is what Baidu calls "Thinking with Images" — a capability that allows the AI to dynamically zoom in and out of images to examine fine-grained details, mimicking how humans approach visual problem-solving tasks.

    "The model thinks like a human, capable of freely zooming in and out of images to grasp every detail and uncover all information," according to the model card. When paired with tools like image search, Baidu claims this feature "dramatically elevates the model's ability to process fine-grained details and handle long-tail visual knowledge."

    This approach marks a departure from traditional vision-language models, which typically process images at a fixed resolution. By allowing dynamic image examination, the system can theoretically handle scenarios requiring both broad context and granular detail—such as analyzing complex technical diagrams or detecting subtle defects in manufacturing quality control.

    The model also supports what Baidu describes as enhanced "visual grounding" capabilities with "more precise grounding and flexible instruction execution, easily triggering grounding functions in complex industrial scenarios," suggesting potential applications in robotics, warehouse automation, and other settings where AI systems must identify and locate specific objects in visual scenes.

    Baidu's performance claims draw scrutiny as independent testing remains pending

    Baidu's assertion that the model outperforms Google's Gemini 2.5 Pro and OpenAI's GPT-5-High on various document and chart understanding benchmarks has drawn attention across social media, though independent verification of these claims remains pending.

    The company released the model under the permissive Apache 2.0 license, allowing unrestricted commercial use—a strategic decision that contrasts with the more restrictive licensing approaches of some competitors and could accelerate enterprise adoption.

    "Apache 2.0 is smart," wrote one X user responding to Baidu's announcement, highlighting the competitive advantage of open licensing in the enterprise market.

    According to Baidu's documentation, the model demonstrates six core capabilities beyond traditional text processing. In visual reasoning, the system can perform what Baidu describes as "multi-step reasoning, chart analysis, and causal reasoning capabilities in complex visual tasks," aided by what the company characterizes as "large-scale reinforcement learning." 

    For STEM problem solving, Baidu claims that "leveraging its powerful visual abilities, the model achieves a leap in performance on STEM tasks like solving problems from photos." The visual grounding capability allows the model to identify and locate objects within images with what Baidu characterizes as industrial-grade precision. Through tool integration, the system can invoke external functions including image search capabilities to access information beyond its training data.

    For video understanding, Baidu claims the model possesses "outstanding temporal awareness and event localization abilities, accurately identifying content changes across different time segments in a video." Finally, the thinking with images feature enables the dynamic zoom functionality that distinguishes this model from competitors.

    Inside the mixture-of-experts architecture that powers efficient multimodal processing

    Under the hood, ERNIE-4.5-VL-28B-A3B-Thinking employs a Mixture-of-Experts (MoE) architecture — a design pattern that has become increasingly popular for building efficient large-scale AI systems. Rather than activating all 28 billion parameters for every task, the model uses a routing mechanism to selectively activate only the 3 billion parameters most relevant to each specific input.

    This approach offers substantial practical advantages for enterprise deployments. According to Baidu's documentation, the model can run on a single 80GB GPU — hardware readily available in many corporate data centers — making it significantly more accessible than competing systems that may require multiple high-end accelerators.

    The technical documentation reveals that Baidu employed several advanced training techniques to achieve the model's capabilities. The company used "cutting-edge multimodal reinforcement learning techniques on verifiable tasks, integrating GSPO and IcePop strategies to stabilize MoE training combined with dynamic difficulty sampling for exceptional learning efficiency."

    Baidu also notes that in response to "strong community demand," the company "significantly strengthened the model's grounding performance with improved instruction-following capabilities."

    The new model fits into Baidu's ambitious multimodal AI ecosystem

    The new release is one component of Baidu's broader ERNIE 4.5 model family, which the company unveiled in June 2025. That family comprises 10 distinct variants, including Mixture-of-Experts models ranging from the flagship ERNIE-4.5-VL-424B-A47B with 424 billion total parameters down to a compact 0.3 billion parameter dense model.

    According to Baidu's technical report on the ERNIE 4.5 family, the models incorporate "a novel heterogeneous modality structure, which supports parameter sharing across modalities while also allowing dedicated parameters for each individual modality."

    This architectural choice addresses a longstanding challenge in multimodal AI development: training systems on both visual and textual data without one modality degrading the performance of the other. Baidu claims this design "has the advantage to enhance multimodal understanding without compromising, and even improving, performance on text-related tasks."

    The company reported achieving 47% Model FLOPs Utilization (MFU) — a measure of training efficiency — during pre-training of its largest ERNIE 4.5 language model, using the PaddlePaddle deep learning framework developed in-house.

    Comprehensive developer tools aim to simplify enterprise deployment and integration

    For organizations looking to deploy the model, Baidu has released a comprehensive suite of development tools through ERNIEKit, what the company describes as an "industrial-grade training and compression development toolkit."

    The model offers full compatibility with popular open-source frameworks including Hugging Face Transformers, vLLM (a high-performance inference engine), and Baidu's own FastDeploy toolkit. This multi-platform support could prove critical for enterprise adoption, allowing organizations to integrate the model into existing AI infrastructure without wholesale platform changes.

    Sample code released by Baidu shows a relatively straightforward implementation path. Using the Transformers library, developers can load and run the model with approximately 30 lines of Python code, according to the documentation on Hugging Face.

    For production deployments requiring higher throughput, Baidu provides vLLM integration with specialized support for the model's "reasoning-parser" and "tool-call-parser" capabilities — features that enable the dynamic image examination and external tool integration that distinguish this model from earlier systems.

    The company also offers FastDeploy, a proprietary inference toolkit that Baidu claims delivers "production-ready, easy-to-use multi-hardware deployment solutions" with support for various quantization schemes that can reduce memory requirements and increase inference speed.

    Why this release matters for the enterprise AI market at a critical inflection point

    The release comes at a pivotal moment in the enterprise AI market. As organizations move beyond experimental chatbot deployments toward production systems that process documents, analyze visual data, and automate complex workflows, demand for capable and cost-effective vision-language models has intensified.

    Several enterprise use cases appear particularly well-suited to the model's capabilities. Document processing — extracting information from invoices, contracts, and forms — represents a massive market where accurate chart and table understanding directly translates to cost savings through automation. Manufacturing quality control, where AI systems must detect visual defects, could benefit from the model's grounding capabilities. Customer service applications that handle images from users could leverage the multi-step visual reasoning.

    The model's efficiency profile may prove especially attractive to mid-market organizations and startups that lack the computing budgets of large technology companies. By fitting on a single 80GB GPU — hardware costing roughly $10,000 to $30,000 depending on the specific model — the system becomes economically viable for a much broader range of organizations than models requiring multi-GPU setups costing hundreds of thousands of dollars.

    "With all these new models, where's the best place to actually build and scale? Access to compute is everything," wrote one X user in response to Baidu's announcement, highlighting the persistent infrastructure challenges facing organizations attempting to deploy advanced AI systems.

    The Apache 2.0 licensing further lowers barriers to adoption. Unlike models released under more restrictive licenses that may limit commercial use or require revenue sharing, organizations can deploy ERNIE-4.5-VL-28B-A3B-Thinking in production applications without ongoing licensing fees or usage restrictions.

    Competition intensifies as Chinese tech giant takes aim at Google and OpenAI

    Baidu's release intensifies competition in the vision-language model space, where Google, OpenAI, Anthropic, and Chinese companies including Alibaba and ByteDance have all released capable systems in recent months.

    The company's performance claims — if validated by independent testing — would represent a significant achievement. Google's Gemini 2.5 Pro and OpenAI's GPT-5-High are substantially larger models backed by the deep resources of two of the world's most valuable technology companies. That a more compact, openly available model could match or exceed their performance on specific tasks would suggest the field is advancing more rapidly than some analysts anticipated.

    "Impressive that ERNIE is outperforming Gemini 2.5 Pro," wrote one social media commenter, expressing surprise at the claimed results.

    However, some observers counseled caution about benchmark comparisons. "It's fascinating to see how multimodal models are evolving, especially with features like 'Thinking with Images,'" wrote one X user. "That said, I'm curious if ERNIE-4.5's edge over competitors like Gemini-2.5-Pro and GPT-5-High primarily lies in specific use cases like document and chart" understanding rather than general-purpose vision tasks.

    Industry analysts note that benchmark performance often fails to capture real-world behavior across the diverse scenarios enterprises encounter. A model that excels at document understanding may struggle with creative visual tasks or real-time video analysis. Organizations evaluating these systems typically conduct extensive internal testing on representative workloads before committing to production deployments.

    Technical limitations and infrastructure requirements that enterprises must consider

    Despite its capabilities, the model faces several technical challenges common to large vision-language systems. The minimum requirement of 80GB of GPU memory, while more accessible than some competitors, still represents a significant infrastructure investment. Organizations without existing GPU infrastructure would need to procure specialized hardware or rely on cloud computing services, introducing ongoing operational costs.

    The model's context window — the amount of text and visual information it can process simultaneously — is listed as 128K tokens in Baidu's documentation. While substantial, this may prove limiting for some document processing scenarios involving very long technical manuals or extensive video content.

    Questions also remain about the model's behavior on adversarial inputs, out-of-distribution data, and edge cases. Baidu's documentation does not provide detailed information about safety testing, bias mitigation, or failure modes — considerations increasingly important for enterprise deployments where errors could have financial or safety implications.

    What technical decision-makers need to evaluate beyond the benchmark numbers

    For technical decision-makers evaluating the model, several implementation factors warrant consideration beyond raw performance metrics.

    The model's MoE architecture, while efficient during inference, adds complexity to deployment and optimization. Organizations must ensure their infrastructure can properly route inputs to the appropriate expert subnetworks — a capability not universally supported across all deployment platforms.

    The "Thinking with Images" feature, while innovative, requires integration with image manipulation tools to achieve its full potential. Baidu's documentation suggests this capability works best "when paired with tools like image zooming and image search," implying that organizations may need to build additional infrastructure to fully leverage this functionality.

    The model's video understanding capabilities, while highlighted in marketing materials, come with practical constraints. Processing video requires substantially more computational resources than static images, and the documentation does not specify maximum video length or optimal frame rates.

    Organizations considering deployment should also evaluate Baidu's ongoing commitment to the model. Open-source AI models require continuing maintenance, security updates, and potential retraining as data distributions shift over time. While the Apache 2.0 license ensures the model remains available, future improvements and support depend on Baidu's strategic priorities.

    Developer community responds with enthusiasm tempered by practical requests

    Early response from the AI research and development community has been cautiously optimistic. Developers have requested versions of the model in additional formats including GGUF (a quantization format popular for local deployment) and MNN (a mobile neural network framework), suggesting interest in running the system on resource-constrained devices.

    "Release MNN and GGUF so I can run it on my phone," wrote one developer, highlighting demand for mobile deployment options.

    Other developers praised Baidu's technical choices while requesting additional resources. "Fantastic model! Did you use discoveries from PaddleOCR?" asked one user, referencing Baidu's open-source optical character recognition toolkit.

    The model's lengthy name—ERNIE-4.5-VL-28B-A3B-Thinking—drew lighthearted commentary. "ERNIE-4.5-VL-28B-A3B-Thinking might be the longest model name in history," joked one observer. "But hey, if you're outperforming Gemini-2.5-Pro with only 3B active params, you've earned the right to a dramatic name!"

    Baidu plans to showcase the ERNIE lineup during its Baidu World 2025 conference on November 13, where the company is expected to provide additional details about the model's development, performance validation, and future roadmap.

    The release marks a strategic move by Baidu to establish itself as a major player in the global AI infrastructure market. While Chinese AI companies have historically focused primarily on domestic markets, the open-source release under a permissive license signals ambitions to compete internationally with Western AI giants.

    For enterprises, the release adds another capable option to a rapidly expanding menu of AI models. Organizations no longer face a binary choice between building proprietary systems or licensing closed-source models from a handful of vendors. The proliferation of capable open-source alternatives like ERNIE-4.5-VL-28B-A3B-Thinking is reshaping the economics of AI deployment and accelerating adoption across industries.

    Whether the model delivers on its performance promises in real-world deployments remains to be seen. But for organizations seeking powerful, cost-effective tools for visual understanding and reasoning, one thing is certain. As one developer succinctly summarized: "Open source plus commercial use equals chef's kiss. Baidu not playing around."

  • Chronosphere takes on Datadog with AI that explains itself, not just outages

    Chronosphere, a New York-based observability startup valued at $1.6 billion, announced Monday it will launch AI-Guided Troubleshooting capabilities designed to help engineers diagnose and fix production software failures — a problem that has intensified as artificial intelligence tools accelerate code creation while making systems harder to debug.

    The new features combine AI-driven analysis with what Chronosphere calls a Temporal Knowledge Graph, a continuously updated map of an organization's services, infrastructure dependencies, and system changes over time. The technology aims to address a mounting challenge in enterprise software: developers are writing code faster than ever with AI assistance, but troubleshooting remains largely manual, creating bottlenecks when applications fail.

    "For AI to be effective in observability, it needs more than pattern recognition and summarization," said Martin Mao, Chronosphere's CEO and co-founder, in an exclusive interview with VentureBeat. "Chronosphere has spent years building the data foundation and analytical depth needed for AI to actually help engineers. With our Temporal Knowledge Graph and advanced analytics capabilities, we're giving AI the understanding it needs to make observability truly intelligent — and giving engineers the confidence to trust its guidance."

    The announcement comes as the observability market — software that monitors complex cloud applications— faces mounting pressure to justify escalating costs. Enterprise log data volumes have grown 250% year-over-year, according to Chronosphere's own research, while a study from MIT and the University of Pennsylvania found that generative AI has spurred a 13.5% increase in weekly code commits, signifying faster development velocity but also greater system complexity.

    AI writes code 13% faster, but debugging stays stubbornly manual

    Despite advances in automated code generation, debugging production failures remains stubbornly manual. When a major e-commerce site slows during checkout or a banking app fails to process transactions, engineers must sift through millions of data points — server logs, application traces, infrastructure metrics, recent code deployments — to identify root causes.

    Chronosphere's answer is what it calls AI-Guided Troubleshooting, built on four core capabilities: automated "Suggestions" that propose investigation paths backed by data; the Temporal Knowledge Graph that maps system relationships and changes; Investigation Notebooks that document each troubleshooting step for future reference; and natural language query building.

    Mao explained the Temporal Knowledge Graph in practical terms: "It's a living, time-aware model of your system. It stitches together telemetry—metrics, traces, logs—infrastructure context, change events like deploys and feature flags, and even human input like notes and runbooks into a single, queryable map that updates as your system evolves."

    This differs fundamentally from the service dependency maps offered by competitors like Datadog, Dynatrace, and Splunk, Mao argued. "It adds time, not just topology," he said. "It tracks how services and dependencies change over time and connects those changes to incidents—what changed and why. Many tools rely on standardized integrations; our graph goes a step further to normalize custom, non-standard telemetry so application-specific signals aren't a blind spot."

    Why Chronosphere shows its work instead of making automatic decisions

    Unlike purely automated systems, Chronosphere designed its AI features to keep engineers in the driver's seat—a deliberate choice meant to address what Mao calls the "confident-but-wrong guidance" problem plaguing early AI observability tools.

    "'Keeping engineers in control' means the AI shows its work, proposes next steps, and lets engineers verify or override — never auto-deciding behind the scenes," Mao explained. "Every Suggestion includes the evidence—timing, dependencies, error patterns — and a 'Why was this suggested?' view, so they can inspect what was checked and ruled out before acting."

    He walked through a concrete example: "An SLO [service level objective] alert fires on Checkout. Chronosphere immediately surfaces a ranked Suggestion: errors appear to have started in the dependent Payment service. An engineer can click Investigate to see the charts and reasoning and, if it holds up, choose to dig deeper. As they steer into Payment, the system adapts with new Suggestions scoped to that service—all from one view, no tab-hopping."

    In this scenario, the engineer asks "what changed?" and the system pulls in change events. "Our Notebook capability makes the causal chain plain: a feature-flag update preceded pod memory exhaustion in Payment; Checkout's spike is a downstream symptom," Mao said. "They can decide to roll back the flag. That whole path — suggestions followed, evidence viewed, conclusions—is captured automatically in an Investigation Notebook, and the outcome feeds the Temporal Knowledge Graph so similar future incidents are faster to resolve."

    How a $1.6 billion startup takes on Datadog, Dynatrace, and Splunk

    Chronosphere enters an increasingly crowded field. Datadog, the publicly traded observability leader valued at over $40 billion, has introduced its own AI-powered troubleshooting features. So have Dynatrace and Splunk. All three offer comprehensive "all-in-one" platforms that promise single-pane-of-glass visibility.

    Mao distinguished Chronosphere's approach on technical grounds. "Early 'AI for observability' leaned heavily on pattern-spotting and summarization, which tends to break down during real incidents," he said. "These approaches often stop at correlating anomalies or producing fluent explanations without the deeper analysis and causal reasoning observability leaders need. They can feel impressive in demos but disappoint in production—they summarize signals rather than explain cause and effect."

    A specific technical gap, he argued, involves custom application telemetry. "Most platforms reason over standardized integrations—Kubernetes, common cloud services, popular databases—ignoring the most telling clues that live in custom app telemetry," Mao said. "With an incomplete picture, large language models will 'fill in the gaps,' producing confident-but-wrong guidance that sends teams down dead ends."

    Chronosphere's competitive positioning received validation in July when Gartner named it a Leader in the 2025 Magic Quadrant for Observability Platforms for the second consecutive year. The firm was recognized based on both "Completeness of Vision" and "Ability to Execute." In December 2024, Chronosphere also tied for the highest overall rating among recognized vendors in Gartner Peer Insights' "Voice of the Customer" report, scoring 4.7 out of 5 based on 70 reviews.

    Yet the company faces intensifying competition for high-profile customers. UBS analysts noted in July that OpenAI now runs both Datadog and Chronosphere side-by-side to monitor GPU workloads, suggesting the AI leader is evaluating alternatives. While UBS maintained its buy rating on Datadog, the analysts warned that growing Chronosphere usage could pressure Datadog's pricing power.

    Inside the 84% cost reduction claims—and what CIOs should actually measure

    Beyond technical capabilities, Chronosphere has built its market position on cost control — a critical factor as observability spending spirals. The company claims its platform reduces data volumes and associated costs by 84% on average while cutting critical incidents by up to 75%.

    When pressed for specific customer examples with real numbers, Mao pointed to several case studies. "Robinhood has seen a 5x improvement in reliability and a 4x improvement in Mean Time to Detection," he said. "DoorDash used Chronosphere to improve governance and standardize monitoring practices. Astronomer achieved over 85% cost reduction by shaping data on ingest, and Affirm scaled their load 10x during a Black Friday event with no issues, highlighting the platform's reliability under extreme conditions."

    The cost argument matters because, as Paul Nashawaty, principal analyst at CUBE Research, noted when Chronosphere launched its Logs 2.0 product in June: "Organizations are drowning in telemetry data, with over 70% of observability spend going toward storing logs that are never queried."

    For CIOs fatigued by "AI-powered" announcements, Mao acknowledged skepticism is warranted. "The way to cut through it is to test whether the AI shortens incidents, reduces toil, and builds reusable knowledge in your own environment, not in a demo," he advised. He recommended CIOs evaluate three factors: transparency and control (does the system show its reasoning?), coverage of custom telemetry (can it handle non-standardized data?), and manual toil avoided (how many ad-hoc queries and tool-switches are eliminated?).

    Why Chronosphere partners with five vendors instead of building everything itself

    Alongside the AI troubleshooting announcement, Chronosphere revealed a new Partner Program integrating five specialized vendors to fill gaps in its platform: Arize for large language model monitoring, Embrace for real user monitoring, Polar Signals for continuous profiling, Checkly for synthetic monitoring, and Rootly for incident management.

    The strategy represents a deliberate bet against the all-in-one platforms dominating the market. "While an all-in-one platform may be sufficient for smaller organizations, global enterprises demand best-in-class depth across each domain," Mao said. "This is what drove us to build our Partner Program and invest in seamless integrations with leading providers—so our customers can operate with confidence and clarity at every layer of observability."

    Noah Smolen, head of partnerships at Arize, said the collaboration addresses a specific enterprise need. "With a wide array of Fortune 500 customers, we understand the high bar needed to ensure AI agent systems are ready to deploy and stay incident-free, especially given the pace of AI adoption in the enterprise," Smolen said. "Our partnership with Chronosphere comes at a time when an integrated purpose-built cloud-native and AI-observability suite solves a huge pain point for forward-thinking C-suite leaders who demand the very best across their entire observability stack."

    Similarly, JJ Tang, CEO and founder of Rootly, emphasized the incident resolution benefits. "Incidents hinder innovation and revenue, and the challenge lies in sifting through vast amounts of observability data, mobilizing teams, and resolving issues quickly," Tang said. "Integrating Chronosphere with Rootly allows engineers to collaborate with context and resolve issues faster within their existing communication channels, drastically reducing time to resolution and ultimately improving reliability—78% plus decreases in repeat Sev0 and Sev1 incidents."

    When asked how total costs compare when customers use multiple partner contracts versus a single platform, Mao acknowledged the current complexity. "At present, mutual customers typically maintain separate contracts unless they engage through a services partner or system integrator," he said. However, he argued the economics still favor the composable approach: "Our combined technologies deliver exceptional value—in most circumstances at just a fraction of the price of a single-platform solution. Beyond the savings, customers gain a richer, more unified observability experience that unlocks deeper insights and greater efficiency, especially for large-scale environments."

    The company plans to streamline this over time. "As the ISV program matures, we're focused on delivering a more streamlined experience by transitioning to a single, unified contract that simplifies procurement and accelerates time to value," Mao said.

    How two Uber engineers turned Halloween outages into a billion-dollar startup

    Chronosphere's origins trace to 2019, when Mao and co-founder Rob Skillington left Uber after building the ride-hailing giant's internal observability platform. At Uber, Mao's team had faced a crisis: the company's in-house tools would fail on its two busiest nights — Halloween and New Year's Eve — cutting off visibility into whether customers could request rides or drivers could locate passengers.

    The solution they built at Uber used open-source software and ultimately allowed the company to operate without outages, even during high-volume events. But the broader market insight came at an industry conference in December 2018, when major cloud providers threw their weight behind Kubernetes, Google's container orchestration technology.

    "This meant that most technology architectures were eventually going to look like Uber's," Mao recalled in an August 2024 profile by Greylock Partners, Chronosphere's lead investor. "And that meant every company, not just a few big tech companies and the Walmarts of the world, would have the exact same problem we had solved at Uber."

    Chronosphere has since raised more than $343 million in funding across multiple rounds led by Greylock, Lux Capital, General Atlantic, Addition, and Founders Fund. The company operates as a remote-first organization with offices in New York, Austin, Boston, San Francisco, and Seattle, employing approximately 299 people according to LinkedIn data.

    The company's customer base includes DoorDash, Zillow, Snap, Robinhood, and Affirm — predominantly high-growth technology companies operating cloud-native, Kubernetes-based infrastructures at massive scale.

    What's available now—and what enterprises can expect in 2026

    Chronosphere's AI-Guided Troubleshooting capabilities, including Suggestions and Investigation Notebooks, entered limited availability Monday with select customers. The company plans full general availability in 2026. The Model Context Protocol (MCP) Server, which enables engineers to integrate Chronosphere directly into internal AI workflows and query observability data through AI-enabled development environments, is available immediately for all Chronosphere customers.

    The phased rollout reflects the company's cautious approach to deploying AI in production environments where mistakes carry real costs. By gathering feedback from early adopters before broad release, Chronosphere aims to refine its guidance algorithms and validate that its suggestions genuinely accelerate troubleshooting rather than simply generating impressive demonstrations.

    The longer game, however, extends beyond individual product features. Chronosphere's dual bet — on transparent AI that shows its reasoning and on a partner ecosystem rather than all-in-one integration — amounts to a fundamental thesis about how enterprise observability will evolve as systems grow more complex.

    If that thesis proves correct, the company that solves observability for the AI age won't be the one with the most automated black box. It will be the one that earns engineers' trust by explaining what it knows, admitting what it doesn't, and letting humans make the final call. In an industry drowning in data and promised silver bullets, Chronosphere is wagering that showing your work still matters — even when AI is doing the math.

  • Meta returns to open source AI with Omnilingual ASR models that can transcribe 1,600+ languages natively

    Meta has just released a new multilingual automatic speech recognition (ASR) system supporting 1,600+ languages — dwarfing OpenAI’s open source Whisper model, which supports just 99.

    Is architecture also allows developers to extend that support to thousands more. Through a feature called zero-shot in-context learning, users can provide a few paired examples of audio and text in a new language at inference time, enabling the model to transcribe additional utterances in that language without any retraining.

    In practice, this expands potential coverage to more than 5,400 languages — roughly every spoken language with a known script.

    It’s a shift from static model capabilities to a flexible framework that communities can adapt themselves. So while the 1,600 languages reflect official training coverage, the broader figure represents Omnilingual ASR’s capacity to generalize on demand, making it the most extensible speech recognition system released to date.

    Best of all: it's been open sourced under a plain Apache 2.0 license — not a restrictive, quasi open-source Llama license like the company's prior releases, which limited use by larger enterprises unless they paid licensing fees — meaning researchers and developers are free to take and implement it right away, for free, without restrictions, even in commercial and enterprise-grade projects!

    Released on November 10 on Meta's website, Github, along with a demo space on Hugging Face and technical paper, Meta’s Omnilingual ASR suite includes a family of speech recognition models, a 7-billion parameter multilingual audio representation model, and a massive speech corpus spanning over 350 previously underserved languages.

    All resources are freely available under open licenses, and the models support speech-to-text transcription out of the box.

    “By open sourcing these models and dataset, we aim to break down language barriers, expand digital access, and empower communities worldwide,” Meta posted on its @AIatMeta account on X

    Designed for Speech-to-Text Transcription

    At its core, Omnilingual ASR is a speech-to-text system.

    The models are trained to convert spoken language into written text, supporting applications like voice assistants, transcription tools, subtitles, oral archive digitization, and accessibility features for low-resource languages.

    Unlike earlier ASR models that required extensive labeled training data, Omnilingual ASR includes a zero-shot variant.

    This version can transcribe languages it has never seen before—using just a few paired examples of audio and corresponding text.

    This lowers the barrier for adding new or endangered languages dramatically, removing the need for large corpora or retraining.

    Model Family and Technical Design

    The Omnilingual ASR suite includes multiple model families trained on more than 4.3 million hours of audio from 1,600+ languages:

    • wav2vec 2.0 models for self-supervised speech representation learning (300M–7B parameters)

    • CTC-based ASR models for efficient supervised transcription

    • LLM-ASR models combining a speech encoder with a Transformer-based text decoder for state-of-the-art transcription

    • LLM-ZeroShot ASR model, enabling inference-time adaptation to unseen languages

    All models follow an encoder–decoder design: raw audio is converted into a language-agnostic representation, then decoded into written text.

    Why the Scale Matters

    While Whisper and similar models have advanced ASR capabilities for global languages, they fall short on the long tail of human linguistic diversity. Whisper supports 99 languages. Meta’s system:

    • Directly supports 1,600+ languages

    • Can generalize to 5,400+ languages using in-context learning

    • Achieves character error rates (CER) under 10% in 78% of supported languages

    Among those supported are more than 500 languages never previously covered by any ASR model, according to Meta’s research paper.

    This expansion opens new possibilities for communities whose languages are often excluded from digital tools

    Here’s the revised and expanded background section, integrating the broader context of Meta’s 2025 AI strategy, leadership changes, and Llama 4’s reception, complete with in-text citations and links:

    Background: Meta’s AI Overhaul and a Rebound from Llama 4

    The release of Omnilingual ASR arrives at a pivotal moment in Meta’s AI strategy, following a year marked by organizational turbulence, leadership changes, and uneven product execution.

    Omnilingual ASR is the first major open-source model release since the rollout of Llama 4, Meta’s latest large language model, which debuted in April 2025 to mixed and ultimately poor reviews, with scant enterprise adoption compared to Chinese open source model competitors.

    The failure led Meta founder and CEO Mark Zuckerberg to appoint Alexandr Wang, co-founder and prior CEO of AI data supplier Scale AI, as Chief AI Officer, and embark on an extensive and costly hiring spree that shocked the AI and business communities with eye-watering pay packages for top AI researchers.

    In contrast, Omnilingual ASR represents a strategic and reputational reset. It returns Meta to a domain where the company has historically led — multilingual AI — and offers a truly extensible, community-oriented stack with minimal barriers to entry.

    The system’s support for 1,600+ languages and its extensibility to over 5,000 more via zero-shot in-context learning reassert Meta’s engineering credibility in language technology.

    Importantly, it does so through a free and permissively licensed release, under Apache 2.0, with transparent dataset sourcing and reproducible training protocols.

    This shift aligns with broader themes in Meta’s 2025 strategy. The company has refocused its narrative around a “personal superintelligence” vision, investing heavily in infrastructure (including a September release of custom AI accelerators and Arm-based inference stacks) source while downplaying the metaverse in favor of foundational AI capabilities. The return to public training data in Europe after a regulatory pause also underscores its intention to compete globally, despite privacy scrutiny source.

    Omnilingual ASR, then, is more than a model release — it’s a calculated move to reassert control of the narrative: from the fragmented rollout of Llama 4 to a high-utility, research-grounded contribution that aligns with Meta’s long-term AI platform strategy.

    Community-Centered Dataset Collection

    To achieve this scale, Meta partnered with researchers and community organizations in Africa, Asia, and elsewhere to create the Omnilingual ASR Corpus, a 3,350-hour dataset across 348 low-resource languages. Contributors were compensated local speakers, and recordings were gathered in collaboration with groups like:

    • African Next Voices: A Gates Foundation–supported consortium including Maseno University (Kenya), University of Pretoria, and Data Science Nigeria

    • Mozilla Foundation’s Common Voice, supported through the Open Multilingual Speech Fund

    • Lanfrica / NaijaVoices, which created data for 11 African languages including Igala, Serer, and Urhobo

    The data collection focused on natural, unscripted speech. Prompts were designed to be culturally relevant and open-ended, such as “Is it better to have a few close friends or many casual acquaintances? Why?” Transcriptions used established writing systems, with quality assurance built into every step.

    Performance and Hardware Considerations

    The largest model in the suite, the omniASR_LLM_7B, requires ~17GB of GPU memory for inference, making it suitable for deployment on high-end hardware. Smaller models (300M–1B) can run on lower-power devices and deliver real-time transcription speeds.

    Performance benchmarks show strong results even in low-resource scenarios:

    • CER <10% in 95% of high-resource and mid-resource languages

    • CER <10% in 36% of low-resource languages

    • Robustness in noisy conditions and unseen domains, especially with fine-tuning

    The zero-shot system, omniASR_LLM_7B_ZS, can transcribe new languages with minimal setup. Users provide a few sample audio–text pairs, and the model generates transcriptions for new utterances in the same language.

    Open Access and Developer Tooling

    All models and the dataset are licensed under permissive terms:

    Installation is supported via PyPI and uv:

    pip install omnilingual-asr

    Meta also provides:

    • A HuggingFace dataset integration

    • Pre-built inference pipelines

    • Language-code conditioning for improved accuracy

    Developers can view the full list of supported languages using the API:

    from omnilingual_asr.models.wav2vec2_llama.lang_ids import supported_langs

    print(len(supported_langs))
    print(supported_langs)

    Broader Implications

    Omnilingual ASR reframes language coverage in ASR from a fixed list to an extensible framework. It enables:

    • Community-driven inclusion of underrepresented languages

    • Digital access for oral and endangered languages

    • Research on speech tech in linguistically diverse contexts

    Crucially, Meta emphasizes ethical considerations throughout—advocating for open-source participation and collaboration with native-speaking communities.

    “No model can ever anticipate and include all of the world’s languages in advance,” the Omnilingual ASR paper states, “but Omnilingual ASR makes it possible for communities to extend recognition with their own data.”

    Access the Tools

    All resources are now available at:

    What This Means for Enterprises

    For enterprise developers, especially those operating in multilingual or international markets, Omnilingual ASR significantly lowers the barrier to deploying speech-to-text systems across a broader range of customers and geographies.

    Instead of relying on commercial ASR APIs that support only a narrow set of high-resource languages, teams can now integrate an open-source pipeline that covers over 1,600 languages out of the box—with the option to extend it to thousands more via zero-shot learning.

    This flexibility is especially valuable for enterprises working in sectors like voice-based customer support, transcription services, accessibility, education, or civic technology, where local language coverage can be a competitive or regulatory necessity. Because the models are released under the permissive Apache 2.0 license, businesses can fine-tune, deploy, or integrate them into proprietary systems without restrictive terms.

    It also represents a shift in the ASR landscape—from centralized, cloud-gated offerings to community-extendable infrastructure. By making multilingual speech recognition more accessible, customizable, and cost-effective, Omnilingual ASR opens the door to a new generation of enterprise speech applications built around linguistic inclusion rather than linguistic limitation.

  • Celosphere 2025: Where enterprise AI moved from experiment to execution

    Presented by Celonis


    After a year of boardroom declarations about “AI transformation,” this was the week where enterprise leaders came together to talk about what actually works. Speaking from the stage at Celosphere in Munich, Celonis co-founder and co-CEO Alexander Rinke set the tone early in his keynote:

    “Only 11 % of companies are seeing measurable benefits from AI projects today,” he said. “That’s not an adoption problem. That’s a context problem.”

    It’s a sentiment familiar to anyone who’s tried to deploy AI inside a large enterprise. You can’t automate what you don’t understand — and most organizations still lack a unified picture of how work in their companies really gets done.

    Celonis’ answer, showcased across three days at the company’s annual event, was less about new tech acronyms and more about connective tissue: how to make AI fit within the messy, living processes that drive business. The company framed it as achieving a real “Return on AI (ROAI)” — measurable impact that comes only when intelligence is grounded in process context.

    A living model of how the enterprise works

    At the heart of the keynote was what Rinke called a “living digital twin of your operations.” Celonis has been building toward this moment for years — but this was the first time the company made clear how far that concept has evolved.

    “We start by freeing the process,” said Rinke. “Freeing it from the restrictions of your current legacy systems.” Data Core, Celonis’ data infrastructure, extracts raw data from source systems. It’s capable of querying billions of records in near real time with sub-minute refresh — extending visibility beyond traditional systems of record.

    Built on this foundation, the Process Intelligence Graph sits at the center of the Celonis Platform. It’s a system-agnostic, graph-based model that unifies data across systems, apps, and even devices, including task-mining data that captures clicks, spreadsheets, and browser activity. It combines this data with business context—business rules, KPIs, benchmarks, and exceptions. Every transaction, rule, and process interaction becomes part of a continuously updated replica that reflects how the organization actually operates.

    On top of the Graph, the company’s new Build Experience allows organizations to analyze, design, and operate AI-driven, composable processes — integrating AI where it delivers business impact, not just technical demos:

    • Analyze where processes stall or repeat

    • Design the future state, setting outcomes, guardrails, and AI touchpoints

    • Operate with humans, systems, and AI agents working in sync — now orchestrated through a generally available Orchestration Engine that can trigger and monitor every step in one flow

    It’s a deliberate shift from discovery-driven AI pilots to outcome-driven AI operations — and a blueprint for orchestrating agentic AI, where human teams, systems, and autonomous agents work together through shared process context rather than in silos.

    Real-world proof: Mercedes-Benz, Vinmar, and Uniper

    The Celosphere stage offered real proof of theCelonisPlatform in action, through live stories from customers already building on it.

    Mercedes-Benz shared how process intelligence became their “connective tissue” during the semiconductor crisis. “We had data everywhere — plants, suppliers, logistics,” recalled Dr. Jörg Burzer, Member of the Board of Management of Mercedes-Benz Group AG. “What we didn’t have was a way to see it together. Celonis helped us connect those dots fast enough to act.”

    The partnership has since expanded across eight of the company’s ten most critical processes, from supply chain to quality to after-sales. But what impressed the audience wasn’t just the scale — it was the cultural shift.

    “If you show data in context, and let teams visualize processes, you also change the culture,” Burzer said. “It’s not just process transformation — it’s people transformation.”

    At Vinmar, CEO Vishal Baid described Celonis as “the foundation of our automation and AI strategy.” His global plastics distribution business has already automated its entire order-to-cash process for a $3 B unit, achieving a 40 % productivity lift. But Baid wasn’t there to just celebrate finished work — he was looking ahead.

    “Now we’re tackling the non-algorithmic stuff,” he said. “Matching purchase and sales orders sounds simple until you have thousands of edge cases. We’re building an AI agent that can do that allocation intelligently. That’s the next frontier.”

    And in the energy sector, Uniper, with partner Microsoft, demonstrated how process-aware AI copilots are already reshaping operations. Using Celonis and Microsoft’s AI stack, Uniper can predict when hydropower plants will need maintenance — and cluster those jobs to reduce downtime and emissions.

    “Each technician, each part, each system plays a role in a living process,” said Hans Berg, Uniper’s CIO. “The human can’t see all of it. But process intelligence can — and it can nudge the system toward the best outcome.”

    Agnes Heftberger, CVP & CEO, Microsoft Germany & Austria, who joined Berg on stage, summed it up crisply:

    “The hard part isn’t building AI features — it’s scaling them responsibly,” she explained. “You need to marry intelligence with the beating heart of the company: its processes.”

    Across the global community, Celonis reports more than $8 billion in realized business valueand over 120 certified value champions — proof that process intelligence is driving measurable impact far beyond pilots. Rinke called it “the early proof points of a true return on AI.”

    From closed systems to composable intelligence

    Celosphere 2025 marked a shift from architecture to interoperability — from defining enterprise AI to making it work across boundaries.

    Rinke’s vision for the future is unapologetically open: “Good things grow from open ecosystems,” he said. That philosophy is taking shape through deeper platform integrations — including Microsoft Fabric, Databricks, and Bloomfilter — with zero-copy, bidirectional lakehouse access that lets customers query process data in place with minimal latency. The company also announced MCP Server support for embedding the Process Intelligence Graph directly into agentic AI platforms like Amazon Bedrock and Microsoft Copilot Studio.

    These updates make “composable enterprise AI” tangible — organizations can now assemble and govern AI solutions across ecosystems rather than being locked into any single vendor.

    Rather than competing on who has the “best agent,” the message was that enterprise AI will thrive when agents work together through shared context and models that mirror how businesses actually run.

    “Every vendor is bringing out their own agent,” Rinke said. “But each one is limited to that vendor’s world. If they can’t work together, they can’t work for you. That’s what process intelligence fixes.”

    The idea drew sustained applause. For companies juggling multiple cloud platforms, ERPs, and data tools, composability isn’t just elegant; it’s survival.

    Beyond operations: data, democracy, and direction

    The closing moments of the keynote took an unexpected turn — from enterprise architecture to human courage. Venezuelan opposition leader and Nobel Peace Prize winner María Corina Machado joined live via satellite to share how her movement used data, encrypted apps, and civic coordination to expose election fraud and mobilize millions.

    It was a powerful contrast: the same principles — transparency, accountability, context — at work in both business and democracy.

    “Technology can be a weapon or a liberator,” Machado said. “It depends on who holds the context.”

    Her words landed with weight in a room full of people used to talking about data, systems, and governance — a reminder that context isn’t just technical, it’s human.

    Why this year mattered

    Celosphere 2025 marked a shift in how enterprises approach AI — from experimentation to results grounded in process intelligence. The shift was evident in both tone and technology, with a more powerful Data Core, enhanced Process Intelligence Graph, and new Build Experience. But the deeper takeaway was philosophical: AI only scales when it’s grounded in how people and systems actually work together.

    Celonis president Carsten Thoma was candid in acknowledging that early process-mining projects often “stormed in with discovery” before understanding organizational value — a lesson that now defines the company’s measured, pragmatic approach to enterprise AI.

    Rinke put it best near the end of his keynote:

    “We’re not just automating steps,” he said. “We’re building enterprises that can adapt instantly, innovate freely, and improve continuously.”

    Missed it? Catch up with all the highlights from Celosphere 2025 here.


    Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

  • Baseten takes on hyperscalers with new AI training platform that lets you own your model weights

    Baseten, the AI infrastructure company recently valued at $2.15 billion, is making its most significant product pivot yet: a full-scale push into model training that could reshape how enterprises wean themselves off dependence on OpenAI and other closed-source AI providers.

    The San Francisco-based company announced Thursday the general availability of Baseten Training, an infrastructure platform designed to help companies fine-tune open-source AI models without the operational headaches of managing GPU clusters, multi-node orchestration, or cloud capacity planning. The move is a calculated expansion beyond Baseten's core inference business, driven by what CEO Amir Haghighat describes as relentless customer demand and a strategic imperative to capture the full lifecycle of AI deployment.

    "We had a captive audience of customers who kept coming to us saying, 'Hey, I hate this problem,'" Haghighat said in an interview. "One of them told me, 'Look, I bought a bunch of H100s from a cloud provider. I have to SSH in on Friday, run my fine-tuning job, then check on Monday to see if it worked. Sometimes I realize it just hasn't been working all along.'"

    The launch comes at a critical inflection point in enterprise AI adoption. As open-source models from Meta, Alibaba, and others increasingly rival proprietary systems in performance, companies face mounting pressure to reduce their reliance on expensive API calls to services like OpenAI's GPT-5 or Anthropic's Claude. But the path from off-the-shelf open-source model to production-ready custom AI remains treacherous, requiring specialized expertise in machine learning operations, infrastructure management, and performance optimization.

    Baseten's answer: provide the infrastructure rails while letting companies retain full control over their training code, data, and model weights. It's a deliberately low-level approach born from hard-won lessons.

    How a failed product taught Baseten what AI training infrastructure really needs

    This isn't Baseten's first foray into training. The company's previous attempt, a product called Blueprints launched roughly two and a half years ago, failed spectacularly — a failure Haghighat now embraces as instructive.

    "We had created the abstraction layer a little too high," he explained. "We were trying to create a magical experience, where as a user, you come in and programmatically choose a base model, choose your data and some hyperparameters, and magically out comes a model."

    The problem? Users didn't have the intuition to make the right choices about base models, data quality, or hyperparameters. When their models underperformed, they blamed the product. Baseten found itself in the consulting business rather than the infrastructure business, helping customers debug everything from dataset deduplication to model selection.

    "We became consultants," Haghighat said. "And that's not what we had set out to do."

    Baseten killed Blueprints and refocused entirely on inference, vowing to "earn the right" to expand again. That moment arrived earlier this year, driven by two market realities: the vast majority of Baseten's inference revenue comes from custom models that customers train elsewhere, and competing training platforms were using restrictive terms of service to lock customers into their inference products.

    "Multiple companies who were building fine-tuning products had in their terms of service that you as a customer cannot take the weights of the fine-tuned model with you somewhere else," Haghighat said. "I understand why from their perspective — I still don't think there is a big company to be made purely on just training or fine-tuning. The sticky part is in inference, the valuable part where value is unlocked is in inference, and ultimately the revenue is in inference."

    Baseten took the opposite approach: customers own their weights and can download them at will. The bet is that superior inference performance will keep them on the platform anyway.

    Multi-cloud GPU orchestration and sub-minute scheduling set Baseten apart from hyperscalers

    The new Baseten Training product operates at what Haghighat calls "the infrastructure layer" — lower-level than the failed Blueprints experiment, but with opinionated tooling around reliability, observability, and integration with Baseten's inference stack.

    Key technical capabilities include multi-node training support across clusters of NVIDIA H100 or B200 GPUs, automated checkpointing to protect against node failures, sub-minute job scheduling, and integration with Baseten's proprietary Multi-Cloud Management (MCM) system. That last piece is critical: MCM allows Baseten to dynamically provision GPU capacity across multiple cloud providers and regions, passing cost savings to customers while avoiding the capacity constraints and multi-year contracts typical of hyperscaler deals.

    "With hyperscalers, you don't get to say, 'Hey, give me three or four B200 nodes while my job is running, and then take it back from me and don't charge me for it,'" Haghighat said. "They say, 'No, you need to sign a three-year contract.' We don't do that."

    Baseten's approach mirrors broader trends in cloud infrastructure, where abstraction layers increasingly allow workloads to move fluidly across providers. When AWS experienced a major outage several weeks ago, Baseten's inference services remained operational by automatically routing traffic to other cloud providers — a capability now extended to training workloads.

    The technical differentiation extends to Baseten's observability tooling, which provides per-GPU metrics for multi-node jobs, granular checkpoint tracking, and a refreshed UI that surfaces infrastructure-level events. The company also introduced an "ML Cookbook" of open-source training recipes for popular models like Gemma, GPT OSS, and Qwen, designed to help users reach "training success" faster.

    Early adopters report 84% cost savings and 50% latency improvements with custom models

    Two early customers illustrate the market Baseten is targeting: AI-native companies building specialized vertical solutions that require custom models.

    Oxen AI, a platform focused on dataset management and model fine-tuning, exemplifies the partnership model Baseten envisions. CEO Greg Schoeninger articulated a common strategic calculus, telling VentureBeat: "Whenever I've seen a platform try to do both hardware and software, they usually fail at one of them. That's why partnering with Baseten to handle infrastructure was the obvious choice."

    Oxen built its customer experience entirely on top of Baseten's infrastructure, using the Baseten CLI to programmatically orchestrate training jobs. The system automatically provisions and deprovisions GPUs, fully concealing Baseten's interface behind Oxen's own. For one Oxen customer, AlliumAI — a startup bringing structure to messy retail data — the integration delivered 84% cost savings compared to previous approaches, reducing total inference costs from $46,800 to $7,530.

    "Training custom LoRAs has always been one of the most effective ways to leverage open-source models, but it often came with infrastructure headaches," said Daniel Demillard, CEO of AlliumAI. "With Oxen and Baseten, that complexity disappears. We can train and deploy models at massive scale without ever worrying about CUDA, which GPU to choose, or shutting down servers after training."

    Parsed, another early customer, tackles a different pain point: helping enterprises reduce dependence on OpenAI by creating specialized models that outperform generalist LLMs on domain-specific tasks. The company works in mission-critical sectors like healthcare, finance, and legal services, where model performance and reliability aren't negotiable.

    "Prior to switching to Baseten, we were seeing repetitive and degraded performance on our fine-tuned models due to bugs with our previous training provider," said Charles O'Neill, Parsed's co-founder and chief science officer. "On top of that, we were struggling to easily download and checkpoint weights after training runs."

    With Baseten, Parsed achieved 50% lower end-to-end latency for transcription use cases, spun up HIPAA-compliant EU deployments for testing within 48 hours, and kicked off more than 500 training jobs. The company also leveraged Baseten's modified vLLM inference framework and speculative decoding — a technique that generates draft tokens to accelerate language model output — to cut latency in half for custom models.

    "Fast models matter," O'Neill said. "But fast models that get better over time matter more. A model that's 2x faster but static loses to one that's slightly slower but improving 10% monthly. Baseten gives us both — the performance edge today and the infrastructure for continuous improvement."

    Why training and inference are more interconnected than the industry realizes

    The Parsed example illuminates a deeper strategic rationale for Baseten's training expansion: the boundary between training and inference is blurrier than conventional wisdom suggests.

    Baseten's model performance team uses the training platform extensively to create "draft models" for speculative decoding, a cutting-edge technique that can dramatically accelerate inference. The company recently announced it achieved 650+ tokens per second on OpenAI's GPT OSS 120B model — a 60% improvement over its launch performance — using EAGLE-3 speculative decoding, which requires training specialized small models to work alongside larger target models.

    "Ultimately, inference and training plug in more ways than one might think," Haghighat said. "When you do speculative decoding in inference, you need to train the draft model. Our model performance team is a big customer of the training product to train these EAGLE heads on a continuous basis."

    This technical interdependence reinforces Baseten's thesis that owning both training and inference creates defensible value. The company can optimize the entire lifecycle: a model trained on Baseten can be deployed with a single click to inference endpoints pre-optimized for that architecture, with deployment-from-checkpoint support for chat completion and audio transcription workloads.

    The approach contrasts sharply with vertically integrated competitors like Replicate or Modal, which also offer training and inference but with different architectural tradeoffs. Baseten's bet is on lower-level infrastructure flexibility and performance optimization, particularly for companies running custom models at scale.

    As open-source AI models improve, enterprises see fine-tuning as the path away from OpenAI dependency

    Underpinning Baseten's entire strategy is a conviction about the trajectory of open-source AI models — namely, that they're getting good enough, fast enough, to unlock massive enterprise adoption through fine-tuning.

    "Both closed and open-source models are getting better and better in terms of quality," Haghighat said. "We don't even need open source to surpass closed models, because as both of them are getting better, they unlock all these invisible lines of usefulness for different use cases."

    He pointed to the proliferation of reinforcement learning and supervised fine-tuning techniques that allow companies to take an open-source model and make it "as good as the closed model, not at everything, but at this narrow band of capability that they want."

    That trend is already visible in Baseten's Model APIs business, launched alongside Training earlier this year to provide production-grade access to open-source models. The company was the first provider to offer access to DeepSeek V3 and R1, and has since added models like Llama 4 and Qwen 3, optimized for performance and reliability. Model APIs serves as a top-of-funnel product: companies start with off-the-shelf open-source models, realize they need customization, move to Training for fine-tuning, and ultimately deploy on Baseten's Dedicated Deployments infrastructure.

    Yet Haghighat acknowledged the market remains "fuzzy" around which training techniques will dominate. Baseten is hedging by staying close to the bleeding edge through its Forward Deployed Engineering team, which works hands-on with select customers on reinforcement learning, supervised fine-tuning, and other advanced techniques.

    "As we do that, we will see patterns emerge about what a productized training product can look like that really addresses the user's needs without them having to learn too much about how RL works," he said. "Are we there as an industry? I would say not quite. I see some attempts at that, but they all seem like almost falling to the same trap that Blueprints fell into—a bit of a walled garden that ties the hands of AI folks behind their back."

    The roadmap ahead includes potential abstractions for common training patterns, expansion into image, audio, and video fine-tuning, and deeper integration of advanced techniques like prefill-decode disaggregation, which separates the initial processing of prompts from token generation to improve efficiency.

    Baseten faces crowded field but bets developer experience and performance will win enterprise customers

    Baseten enters an increasingly crowded market for AI infrastructure. Hyperscalers like AWS, Google Cloud, and Microsoft Azure offer GPU compute for training, while specialized providers like Lambda Labs, CoreWeave, and Together AI compete on price, performance, or ease of use. Then there are vertically integrated platforms like Hugging Face, Replicate, and Modal that bundle training, inference, and model hosting.

    Baseten's differentiation rests on three pillars: its MCM system for multi-cloud capacity management, deep performance optimization expertise built from its inference business, and a developer experience tailored for production deployments rather than experimentation.

    The company's recent $150 million Series D and $2.15 billion valuation provide runway to invest in both products simultaneously. Major customers include Descript, which uses Baseten for transcription workloads; Decagon, which runs customer service AI; and Sourcegraph, which powers coding assistants. All three operate in domains where model customization and performance are competitive advantages.

    Timing may be Baseten's biggest asset. The confluence of improving open-source models, enterprise discomfort with dependence on proprietary AI providers, and growing sophistication around fine-tuning techniques creates what Haghighat sees as a sustainable market shift.

    "There is a lot of use cases for which closed models have gotten there and open ones have not," he said. "Where I'm seeing in the market is people using different training techniques — more recently, a lot of reinforcement learning and SFT — to be able to get this open model to be as good as the closed model, not at everything, but at this narrow band of capability that they want. That's very palpable in the market."

    For enterprises navigating the complex transition from closed to open AI models, Baseten's positioning offers a clear value proposition: infrastructure that handles the messy middle of fine-tuning while optimizing for the ultimate goal of performant, reliable, cost-effective inference at scale. The company's insistence that customers own their model weights — a stark contrast to competitors using training as a lock-in mechanism — reflects confidence that technical excellence, not contractual restrictions, will drive retention.

    Whether Baseten can execute on this vision depends on navigating tensions inherent in its strategy: staying at the infrastructure layer without becoming consultants, providing power and flexibility without overwhelming users with complexity, and building abstractions at exactly the right level as the market matures. The company's willingness to kill Blueprints when it failed suggests a pragmatism that could prove decisive in a market where many infrastructure providers over-promise and under-deliver.

    "Through and through, we're an inference company," Haghighat emphasized. "The reason that we did training is at the service of inference."

    That clarity of purpose — treating training as a means to an end rather than an end in itself—may be Baseten's most important strategic asset. As AI deployment matures from experimentation to production, the companies that solve the full stack stand to capture outsized value. But only if they avoid the trap of technology in search of a problem.

    At least Baseten's customers no longer have to SSH into boxes on Friday and pray their training jobs complete by Monday. In the infrastructure business, sometimes the best innovation is simply making the painful parts disappear.

  • 6 proven lessons from the AI projects that broke before they scaled

    Companies hate to admit it, but the road to production-level AI deployment is littered with proof of concepts (PoCs) that go nowhere, or failed projects that never deliver on their goals. In certain domains, there’s little tolerance for iteration, especially in something like life sciences, when the AI application is facilitating new treatments to markets or diagnosing diseases. Even slightly inaccurate analyses and assumptions early on can create sizable downstream drift in ways that can be concerning.

    In analyzing dozens of AI PoCs that sailed on through to full production use — or didn’t — six common pitfalls emerge. Interestingly, it’s not usually the quality of the technology but misaligned goals, poor planning or unrealistic expectations that caused failure.

    Here’s a summary of what went wrong in real-world examples and practical guidance on how to get it right.

    Lesson 1: A vague vision spells disaster

    Every AI project needs a clear, measurable goal. Without it, developers are building a solution in search of a problem. For example, in developing an AI system for a pharmaceutical manufacturer’s clinical trials, the team aimed to “optimize the trial process,” but didn’t define what that meant. Did they need to accelerate patient recruitment, reduce participant dropout rates or lower the overall trial cost? The lack of focus led to a model that was technically sound but irrelevant to the client’s most pressing operational needs.

    Takeaway: Define specific, measurable objectives upfront. Use SMART criteria (Specific, Measurable, Achievable, Relevant, Time-bound). For example, aim for “reduce equipment downtime by 15% within six months” rather than a vague “make things better.” Document these goals and align stakeholders early to avoid scope creep.

    Lesson 2: Data quality overtakes quantity

    Data is the lifeblood of AI, but poor-quality data is poison. In one project, a retail client began with years of sales data to predict inventory needs. The catch? The dataset was riddled with inconsistencies, including missing entries, duplicate records and outdated product codes. The model performed well in testing but failed in production because it learned from noisy, unreliable data.

    Takeaway: Invest in data quality over volume. Use tools like Pandas for preprocessing and Great Expectations for data validation to catch issues early. Conduct exploratory data analysis (EDA) with visualizations (like Seaborn) to spot outliers or inconsistencies. Clean data is worth more than terabytes of garbage.

    Lesson 3: Overcomplicating model backfires

    Chasing technical complexity doesn't always lead to better outcomes. For example, on a healthcare project, development initially began by creating a sophisticated convolutional neural network (CNN) to identify anomalies in medical images.

    While the model was state-of-the-art, its high computational cost meant weeks of training, and its "black box" nature made it difficult for clinicians to trust. The application was revised to implement a simpler random forest model that not only matched the CNN's predictive accuracy but was faster to train and far easier to interpret — a critical factor for clinical adoption.

    Takeaway: Start simple. Use straightforward algorithms like random forest or XGBoost from scikit-learn to establish a baseline. Only scale to complex models — TensorFlow-based long-short-term-memory (LSTM) networks — if the problem demands it. Prioritize explainability with tools like SHAP (SHapley Additive exPlanations) to build trust with stakeholders.

    Lesson 4: Ignoring deployment realities

    A model that shines in a Jupyter Notebook can crash in the real world. For example, a company’s initial deployment of a recommendation engine for its e-commerce platform couldn’t handle peak traffic. The model was built without scalability in mind and choked under load, causing delays and frustrated users. The oversight cost weeks of rework.

    Takeaway: Plan for production from day one. Package models in Docker containers and deploy with Kubernetes for scalability. Use TensorFlow Serving or FastAPI for efficient inference. Monitor performance with Prometheus and Grafana to catch bottlenecks early. Test under realistic conditions to ensure reliability.

    Lesson 5: Neglecting model maintenance

    AI models aren’t set-and-forget. In a financial forecasting project, the model performed well for months until market conditions shifted. Unmonitored data drift caused predictions to degrade, and the lack of a retraining pipeline meant manual fixes were needed. The project lost credibility before developers could recover.

    Takeaway: Build for the long haul. Implement monitoring for data drift using tools like Alibi Detect. Automate retraining with Apache Airflow and track experiments with MLflow. Incorporate active learning to prioritize labeling for uncertain predictions, keeping models relevant.

    Lesson 6: Underestimating stakeholder buy-in

    Technology doesn’t exist in a vacuum. A fraud detection model was technically flawless but flopped because end-users — bank employees — didn’t trust it. Without clear explanations or training, they ignored the model’s alerts, rendering it useless.

    Takeaway: Prioritize human-centric design. Use explainability tools like SHAP to make model decisions transparent. Engage stakeholders early with demos and feedback loops. Train users on how to interpret and act on AI outputs. Trust is as critical as accuracy.

    Best practices for success in AI projects

    Drawing from these failures, here’s the roadmap to get it right:

    • Set clear goals: Use SMART criteria to align teams and stakeholders.

    • Prioritize data quality: Invest in cleaning, validation and EDA before modeling.

    • Start simple: Build baselines with simple algorithms before scaling complexity.

    • Design for production: Plan for scalability, monitoring and real-world conditions.

    • Maintain models: Automate retraining and monitor for drift to stay relevant.

    • Engage stakeholders: Foster trust with explainability and user training.

    Building resilient AI

    AI’s potential is intoxicating, yet failed AI projects teach us that success isn’t just about algorithms. It’s about discipline, planning and adaptability. As AI evolves, emerging trends like federated learning for privacy-preserving models and edge AI for real-time insights will raise the bar. By learning from past mistakes, teams can build scale-out, production systems that are robust, accurate, and trusted.

    Kavin Xavier is VP of AI solutions at CapeStart.

    Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

  • What could possibly go wrong if an enterprise replaces all its engineers with AI?

    AI coding, vibe coding and agentic swarm have made a dramatic and astonishing recent market entrance, with the AI Code Tools market valued at $4.8 billion and expected to grow at a 23% annual rate.  Enterprises are grappling with AI coding agents and what do about expensive human coders. 

    They don’t lack for advice.  OpenAI’s CEO estimates that AI can perform over 50% of what human engineers can do.  Six months ago, Anthropic’s CEO said that AI would write 90% of code in six months.  Meta’s CEO said he believes AI will replace mid-level engineers “soon.” Judging by recent tech layoffs, it seems many executives are embracing that advice.

    Software engineers and data scientists are among the most expensive salary lines at many companies, and business and technology leaders may be tempted to replace them with AI. However, recent high-profile failures demonstrate that engineers and their expertise remain valuable, even as AI continues to make impressive advances.

    SaaStr disaster

    Jason Lemkin, a tech entrepreneur and founder of the SaaS community SaaStr, has been vibe coding a SaaS networking app and live-tweeting his experience. About a week into his adventure, he admitted to his audience that something was going very wrong.  The AI deleted his production database despite his request for a “code and action freeze.” This is the kind of mistake no experienced (or even semi-experienced) engineer would make.

    If you have ever worked in a professional coding environment, you know to split your development environment from production. Junior engineers are given full access to the development environment (it’s crucial for productivity), but access to production is given on a limited need-to-have basis to a few of the most trusted senior engineers. The reason for restricted access is precisely for this use case: To prevent a junior engineer from accidentally taking down production.

    In fact, Lemkin made two mistakes. First: for something as critical as production, access to unreliable actors is just never granted (we don’t rely on asking a junior engineer or AI nicely). Second, he never separated development from production.  In a subsequent public conversation on LinkedIn, Lemkin, who holds a Stanford Executive MBA and Berkeley JD, admitted that he was not aware of the best practice of splitting development and production databases.

    The takeaway for business leaders is that standard software engineering best practices still apply. We should incorporate at least the same safety constraints for AI as we do for junior engineers. Arguably, we should go beyond that and treat AI slightly adversarially: There are reports that, like HAL in Stanley Kubrick's 2001: A Space Odyssey, the AI might try to break out of its sandbox environment to accomplish a task. With more vibe coding, having experienced engineers who understand how complex software systems work and can implement the proper guardrails in development processes will become increasingly necessary.

    Tea hack

    Sean Cook is the Founder and CEO of Tea, a mobile application launched in 2023, designed to help women date safely. In the summer of 2025, they were “hacked": 72,000 images, including 13,000 verification photos and images of government IDs, were leaked onto the public discussion forum 4chan. Worse, Tea’s own privacy policy promises that these images would be "deleted immediately" after users were authenticated, meaning they potentially violated their own privacy policy.

    I use “hacked” in air-quotes because the incident stems less from the cleverness of the attackers than the ineptitude of the defenders. In addition to violating their own data policies, the app left a Firebase storage bucket unsecured, exposing sensiztive user data to the public internet. It’s the digital equivalent of locking your front door but leaving your back open with your family jewelry ostentatiously hanging on the doorknob.

    While we don’t know if the root cause was vibe coding, the Tea hack highlights catastrophic breaches stemming from basic, preventable security errors due to poor development processes. It is the kind of vulnerability that a disciplined and thoughtful engineering process addresses. Unfortunately, the relentless push of financial pressures, where a “lean,” “move fast and break things” culture is the polar opposite, and vibe coding only exacerbates the problem.

    How to safely adopt AI coding agents?

    So how should enterprise and technology leaders think about AI? First, this is not a call to abandon AI for coding.  An MIT Sloan study estimated AI leads to productivity gains between 8% and 39%, while a McKinsey study found a 10% to 50% reduction in time to task completion with the use of AI. 

    However, we should be aware of the risks. The old lessons of software engineering don’t go away. These include many tried-and-true best practices, such as version control, automated unit and integration tests, safety checks like SAST/DAST, separating development and production environments, code review and secrets management. If anything, they become more salient.

    AI can generate code 100 times faster than humans can type, fostering an illusion of productivity that is a tempting siren call for many executives.  However, the quality of the rapidly generated AI shlop is still up for debate. To develop complex production systems, enterprises need the thoughtful, seasoned experience of human engineers.

    Tianhui Michael Li is president at Pragmatic Institute and the founder and president of The Data Incubator.

    Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

  • Ship fast, optimize later: top AI engineers don’t care about cost — they’re prioritizing deployment

    Across industries, rising compute expenses are often cited as a barrier to AI adoption — but leading companies are finding that cost is no longer the real constraint.

    The tougher challenges (and the ones top of mind for many tech leaders)? Latency, flexibility and capacity.

    At Wonder, for instance, AI adds a mere few cents per order; the food delivery and takeout company is much more concerned with cloud capacity with skyrocketing demands. Recursion, for its part, has been focused on balancing small and larger-scale training and deployment via on-premises clusters and the cloud; this has afforded the biotech company flexibility for rapid experimentation.

    The companies’ true in-the-wild experiences highlight a broader industry trend: For enterprises operating AI at scale, economics aren't the key decisive factor — the conversation has shifted from how to pay for AI to how fast it can be deployed and sustained.

    AI leaders from the two companies recently sat down with Venturebeat’s CEO and editor-in-chief Matt Marshall as part of VB’s traveling AI Impact Series. Here’s what they shared.

    Wonder: Rethink what you assume about capacity

    Wonder uses AI to power everything from recommendations to logistics — yet, as of now, reported CTO James Chen, AI adds just a few cents per order.

    Chen explained that the technology component of a meal order costs 14 cents, the AI adds 2 to 3 cents, although that’s “going up really rapidly” to 5 to 8 cents. Still, that seems almost immaterial compared to total operating costs.

    Instead, the 100% cloud-native AI company’s main concern has been capacity with growing demand. Wonder was built with “the assumption” (which proved to be incorrect) that there would be “unlimited capacity” so they could move “super fast” and wouldn’t have to worry about managing infrastructure, Chen noted.

    But the company has grown quite a bit over the last few years, he said; as a result, about six months ago, “we started getting little signals from the cloud providers, ‘Hey, you might need to consider going to region two,’” because they were running out of capacity for CPU or data storage at their facilities as demand grew.

    It was “very shocking” that they had to move to plan B earlier than they anticipated. “Obviously it's good practice to be multi-region, but we were thinking maybe two more years down the road,” said Chen.

    What's not economically feasible (yet)

    Wonder built its own model to maximize its conversion rate, Chen noted; the goal is to surface new restaurants to relevant customers as much as possible. These are “isolated scenarios” where models are trained over time to be “very, very efficient and very fast.”

    Currently, the best bet for Wonder’s use case is large models, Chen noted. But in the long term, they’d like to move to small models that are hyper-customized to individuals (via AI agents or concierges) based on their purchase history and even their clickstream. “Having these micro models is definitely the best, but right now the cost is very expensive,” Chen noted. “If you try to create one for each person, it's just not economically feasible.”

    Budgeting is an art, not a science

    Wonder gives its devs and data scientists as much playroom as possible to experiment, and internal teams review the costs of use to make sure nobody turned on a model and “jacked up massive compute around a huge bill,” said Chen.

    The company is trying different things to offload to AI and operate within margins. “But then it's very hard to budget because you have no idea,” he said. One of the challenging things is the pace of development; when a new model comes out, “we can’t just sit there, right? We have to use it.”

    Budgeting for the unknown economics of a token-based system is “definitely art versus science.”

    A critical component in the software development lifecycle is preserving context when using large native models, he explained. When you find something that works, you can add it to your company’s “corpus of context” that can be sent with every request. That’s big and it costs money each time.

    “Over 50%, up to 80% of your costs is just resending the same information back into the same engine again on every request,” said Chen.

    In theory, the more they do should require less cost per unit. “I know when a transaction happens, I'll pay the X cent tax for each one, but I don't want to be limited to use the technology for all these other creative ideas."

    The 'vindication moment' for Recursion

    Recursion, for its part, has focused on meeting broad-ranging compute needs via a hybrid infrastructure of on-premise clusters and cloud inference.

    When initially looking to build out its AI infrastructure, the company had to go with its own setup, as “the cloud providers didn't have very many good offerings,” explained CTO Ben Mabey. “The vindication moment was that we needed more compute and we looked to the cloud providers and they were like, ‘Maybe in a year or so.’”

    The company’s first cluster in 2017 incorporated Nvidia gaming GPUs (1080s, launched in 2016); they have since added Nvidia H100s and A100s, and use a Kubernetes cluster that they run in the cloud or on-prem.

    Addressing the longevity question, Mabey noted: “These gaming GPUs are actually still being used today, which is crazy, right? The myth that a GPU's life span is only three years, that's definitely not the case. A100s are still top of the list, they're the workhorse of the industry.”

    Best use cases on-prem vs cloud; cost differences

    More recently, Mabey’s team has been training a foundation model on Recursion’s image repository (which consists of petabytes of data and more than 200 pictures). This and other types of big training jobs have required a “massive cluster” and connected, multi-node setups.

    “When we need that fully-connected network and access to a lot of our data in a high parallel file system, we go on-prem,” he explained. On the other hand, shorter workloads run in the cloud.

    Recursion’s method is to “pre-empt” GPUs and Google tensor processing units (TPUs), which is the process of interrupting running GPU tasks to work on higher-priority ones. “Because we don't care about the speed in some of these inference workloads where we're uploading biological data, whether that's an image or sequencing data, DNA data,” Mabey explained. “We can say, ‘Give this to us in an hour,’ and we're fine if it kills the job.”

    From a cost perspective, moving large workloads on-prem is “conservatively” 10 times cheaper, Mabey noted; for a five year TCO, it's half the cost. On the other hand, for smaller storage needs, the cloud can be “pretty competitive” cost-wise.

    Ultimately, Mabey urged tech leaders to step back and determine whether they’re truly willing to commit to AI; cost-effective solutions typically require multi-year buy-ins.

    “From a psychological perspective, I've seen peers of ours who will not invest in compute, and as a result they're always paying on demand," said Mabey. "Their teams use far less compute because they don't want to run up the cloud bill. Innovation really gets hampered by people not wanting to burn money.”