Blog

  • Tariff turbulence exposes costly blind spots in supply chains and AI

    Presented by Celonis


    When tariff rates change overnight, companies have 48 hours to model alternatives and act before competitors secure the best options. At Celosphere 2025 in Munich, enterprises demonstrated how they’re turning that chaos into competitive advantage — with quantifiable results that separate winners from losers.

    Vinmar International: Theglobal plastics and chemicals distributor created a real-time digital twin of its $3B supply chain, cutting default expedites by more than 20% and improving delivery agility across global operations.

    Florida Crystals: One of America's largest cane sugar producers, the company unlocked millions in working capital and strengthened supply chain resilience by eliminating manual rework across Finance, Procurement, and Inbound Supply. AI pilots now extend gains into invoice processing, predictive maintenance, and order management.

    ASOS: The ecommerce fashion giant connected its end-to-end supply chain for full transparency, reducing process variation, accelerating speed-to-market, and improving customer experience at scale.

    The common thread here: process intelligence that bridges the gap traditional ERP systems can’t close — connecting operational dots across ERP, finance, and logistics systems when seconds matter.

    “The question isn’t whether disruptions will hit,” says Peter Budweiser, General Manager of Supply Chain at Celonis. “It’s whether your systems can show you what’s breaking fast enough to fix it.”

    That visibility gap costs the average company double-digit millions in working capital and competitive positioning. As 54% of supply chain leaders face disruptions daily, the pressure is shifting to AI agents that execute real actions: triggering purchase orders, rerouting shipments, adjusting inventory. But an autonomous agent acting on stale or siloed data can make million-dollar mistakes when tariff structures shift overnight.

    Tariffs, as old as trade itself, have become the ultimate stress test for enterprise AI — revealing whether companies truly understand their supply chains and whether their AI can be trusted to act.

    Modern ERP: Data rich, insight poor

    Supply chain leaders face a paradox: drowning in data while starving for insight. Traditional enterprise systems — SAP, Oracle, PeopleSoft — capture every transaction meticulously.

    SAP logs the purchase order. Oracle tracks the shipment. The warehouse system records inventory movement. Each performs its function, but when tariffs change and companies need to model alternative sourcing scenarios across all three simultaneously, the data sits in silos.

    “What’s changed is the speed at which disruptions cascade,” says Manik Sharma, Head of Supply Chain GTM AI at Celonis. “Traditional ERP systems weren’t built for today’s volatility.”

    Companies generate thousands of reports showing what happened last quarter. They struggle to answer what happens if tariffs increase 25% tomorrow and need to switch suppliers within days.

    Tariffs: The 48-hour scramble

    Global trade volatility has transformed tariffs from predictable costs into strategic weapons. When new rates drop with unprecedented frequency, input costs spike across suppliers, finance teams scramble to calculate margin impact, and procurement races to identify alternatives buried in disconnected systems where no one knows if switching suppliers delays shipments or violates contracts.

    By hour 48, competitors who already modeled scenarios execute supplier switches while late movers face capacity constraints and premium pricing.

    Process intelligence changes that dynamic by allowing businesses to continuously model “what-if” scenarios, showing leaders how tariff changes cascade through suppliers, contracts, production lines, warehouses, and customers. When rates hit, companies can move within hours instead of days.

    No AI without PI: Why process intelligence is non-negotiable for supply chains

    AI and supply chains are mutually dependent: AI needs operational context, and supply chains need AI to keep pace with volatility. But here's the truth — there is no AI without PI. Without process intelligence, AI agents operate blindly.

    The ongoing SAP migration wave illustrates why. An estimated 85–90% of SAP customers are still moving from ECC to S/4HANA. Moving to newer databases doesn’t solve supply chain visibility — it provides faster access to the same fragmented data.

    Kerry Brown, a transformation evangelist at Celonis, sees this across industries.

    “Organizations are shifting from PeopleSoft to Oracle, or EBS to Fusion. The bulk is in SAP,” she explains. “But what they really need isn’t a new ERP. They need to understand how work actually flows across systems they already have.”

    That requires end-to-end operational context. Process intelligence provides this by enabling companies to extract and connect event data across systems, showing how processes execute in real time.

    This distinction becomes critical when deploying autonomous agents. When visibility is fragmented, autonomous agents can easily make decisions that appear rational locally but create downstream disruption. With real-time context, AI can operate with clarity and precision, and supply chains can stay ahead of tariff-driven disruption.

    Digital Twins: Powering real-time response

    The companies highlighted at Celosphere all applied the same principle: understand how processes run across systems in real time. Celonis PI creates a digital twin above existing systems, using its Process Intelligence Graph to link orders, shipments, invoices, and payments end-to-end. Dependencies that traditional integrations miss become visible. A delay in SAP instantly reveals its impact across Oracle, warehouse scheduling, and customer delivery commitments.

    “The platform brings together process data spanning systems and departments, enriched with business context that powers AI agents to transform operations effectively,” says Daniel Brown, Chief Product Officer at Celonis.

    With this cross-system awareness, Celonis coordinates actions across complex workflows involving AI agents, humans, and automations — especially critical when tariffs force rapid decisions about suppliers, shipments, and customers.

    Zero-copy integration enables instant modeling

    A key advancement unveiled at Celosphere — zero-copy integration with Databricks — removes another barrier. Traditionally, analyzing supply chain data meant copying from source systems into central warehouses, creating data latency.

    Celonis Data Core now integrates directly with platforms like Databricks and Microsoft Fabric, querying billions of records in near real time without duplication. When trade policy shifts, companies model alternatives instantly, not after overnight data refresh cycles.

    Enhanced Task Mining extends this by connecting desktop activity — keystrokes, mouse clicks, screen scrolls — to business processes. This exposes manual work invisible to system logs: spreadsheet gymnastics, email negotiations, phone calls that keep supply chains moving during urgent changes.

    Competitive advantage in volatile markets

    Most companies can’t rip out and replace systems running critical operations — nor should they. Process intelligence offers a different path: compose workflows from existing systems, deploy AI where it creates value, and adapt continuously as conditions change. This “Free the Process” movement liberates companies from rigid architectures without forcing wholesale replacement.

    As global trade volatility intensifies, the companies that model will move faster, make smarter decisions, and turn tariff chaos into competitive advantage — all while existing ERPs keep running.

    When the next wave of tariffs hits — and it will — companies won’t have days to respond. They’ll have hours. The question isn’t whether your ERP captures the data. It’s whether your systems connect the dots fast enough to matter.

    Missed Celosphere 2025? Catch up with all the highlights here.


    Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

  • Gemini 3 Pro scores 69% trust in blinded testing up from 16% for Gemini 2.5: The case for evaluating AI on real-world trust, not academic benchmarks

    Just a few short weeks ago, Google debuted its Gemini 3 model, claiming it scored a leadership position in multiple AI benchmarks. But the challenge with vendor-provided benchmarks is that they are just that — vendor-provided.

    A new vendor-neutral evaluation from Prolific, however, puts Gemini 3 at the top of the leaderboard. This isn't on a set of academic benchmarks; rather, it's on a set of real-world attributes that actual users and organizations care about. 

    Prolific was founded by researchers at the University of Oxford. The company delivers high-quality, reliable human data to power rigorous research and ethical AI development. The company's “HUMAINE benchmark” applies this approach by using representative human sampling and blind testing to rigorously compare AI models across a variety of user scenarios, measuring not just technical performance but also user trust, adaptability and communication style.

    The latest HUMAINE test evaluated 26,000 users in a blind test of models. In the evaluation, Gemini 3 Pro's trust score surged from 16% to 69%, the highest ever recorded by Prolific. Gemini 3 now ranks number one overall in trust, ethics and safety 69% of the time across demographic subgroups, compared to its predecessor Gemini 2.5 Pro, which held the top spot only 16% of the time.

    Overall, Gemini 3 ranked first in three of four evaluation categories: performance and reasoning, interaction and adaptiveness and trust and safety. It lost only on communication style, where DeepSeek V3 topped preferences at 43%. The HUMAINE test also showed that Gemini 3 performed consistently well across 22 different demographic user groups, including variations in age, sex, ethnicity and political orientation. The evaluation also found that users are now five times more likely to choose the model in head-to-head blind comparisons.

    But the ranking matters less than why it won.

    "It's the consistency across a very wide range of different use cases, and a personality and a style that appeals across a wide range of different user types," Phelim Bradley, co-founder and CEO of Prolific, told VentureBeat. "Although in some specific instances, other models are preferred by either small subgroups or on a particular conversation type, it's the breadth of knowledge and the flexibility of the model across a range of different use cases and audience types that allowed it to win this particular benchmark."

    How blinded testing reveals what academic benchmarks miss

    HUMAINE's methodology exposes gaps in how the industry evaluates models. Users interact with two models simultaneously in multi-turn conversations. They don't know which vendors power each response. They discuss whatever topics matter to them, not predetermined test questions.

    It's the sample itself that matters. HUMAINE uses representative sampling across U.S. and UK populations, controlling for age, sex, ethnicity and political orientation. This reveals something static benchmarks can't capture: Model performance varies by audience.

    "If you take an AI leaderboard, the majority of them still could have a fairly static list," Bradley said. "But for us, if you control for the audience, we end up with a slightly different leaderboard, whether you're looking at a left-leaning sample, right-leaning sample, U.S., UK. And I think age was actually the most different stated condition in our experiment."

    For enterprises deploying AI across diverse employee populations, this matters. A model that performs well for one demographic may underperform for another.

    The methodology also addresses a fundamental question in AI evaluation: Why use human judges at all when AI could evaluate itself? Bradley noted that his firm does use AI judges in certain use cases, although he stressed that human evaluation is still the critical factor.

    "We see the biggest benefit coming from smart orchestration of both LLM judge and human data, both have strengths and weaknesses, that, when smartly combined, do better together," said Bradley. "But we still think that human data is where the alpha is. We're still extremely bullish that human data and human intelligence is required to be in the loop."

    What trust means in AI evaluation

    Trust, ethics and safety measures user confidence in reliability, factual accuracy and responsible behavior. In HUMAINE's methodology, trust isn't a vendor claim or a technical metric — it's what users report after blinded conversations with competing models.

    The 69% figure represents probability across demographic groups. This consistency matters more than aggregate scores because organizations can serve diverse populations.

    "There was no awareness that they were using Gemini in this scenario," Bradley said. "It was based only on the blinded multi-turn response."

    This separates perceived trust from earned trust. Users judged model outputs without knowing which vendor produced them, eliminating Google's brand advantage. For customer-facing deployments where the AI vendor remains invisible to end users, this distinction matters.

    What enterprises should do now

    One of the critical things that enterprises should do now when considering different models is embrace an evaluation framework that works.

    "It is increasingly challenging to evaluate models exclusively based on vibes," Bradley said. "I think increasingly we need more rigorous, scientific approaches to truly understand how these models are performing."

    The HUMAINE data provides a framework: Test for consistency across use cases and user demographics, not just peak performance on specific tasks. Blind the testing to separate model quality from brand perception. Use representative samples that match your actual user population. Plan for continuous evaluation as models change.

    For enterprises looking to deploy AI at scale, this means moving beyond "which model is best" to "which model is best for our specific use case, user demographics and required attributes."

     The rigor of representative sampling and blind testing provides the data to make that determination — something technical benchmarks and vibes-based evaluation cannot deliver.

  • Mistral launches Mistral 3, a family of open models designed to run on laptops, drones, and edge devices

    Mistral AI, Europe's most prominent artificial intelligence startup, is releasing its most ambitious product suite to date: a family of 10 open-source models designed to run everywhere from smartphones and autonomous drones to enterprise cloud systems, marking a major escalation in the company's challenge to both U.S. tech giants and surging Chinese competitors.

    The Mistral 3 family, launching today, includes a new flagship model called Mistral Large 3 and a suite of smaller "Ministral 3" models optimized for edge computing applications. All models will be released under the permissive Apache 2.0 license, allowing unrestricted commercial use — a sharp contrast to the closed systems offered by OpenAI, Google, and Anthropic.

    The release is a pointed bet by Mistral that the future of artificial intelligence lies not in building ever-larger proprietary systems, but in offering businesses maximum flexibility to customize and deploy AI tailored to their specific needs, often using smaller models that can run without cloud connectivity.

    "The gap between closed and open source is getting smaller, because more and more people are contributing to open source, which is great," Guillaume Lample, Mistral's chief scientist and co-founder, said in an exclusive interview with VentureBeat. "We are catching up fast."

    Why Mistral is choosing flexibility over frontier performance in the AI race

    The strategic calculus behind Mistral 3 diverges sharply from recent model releases by industry leaders. While OpenAI, Google, and Anthropic have focused recent launches on increasingly capable "agentic" systems — AI that can autonomously execute complex multi-step tasks — Mistral is prioritizing breadth, efficiency, and what Lample calls "distributed intelligence."

    Mistral Large 3, the flagship model, employs a Mixture of Experts architecture with 41 billion active parameters drawn from a total pool of 675 billion parameters. The model can process both text and images, handles context windows up to 256,000 tokens, and was trained with particular emphasis on non-English languages — a rarity among frontier AI systems.

    "Most AI labs focus on their native language, but Mistral Large 3 was trained on a wide variety of languages, making advanced AI useful for billions who speak different native languages," the company said in a statement reviewed ahead of the announcement.

    But the more significant departure lies in the Ministral 3 lineup: nine compact models across three sizes (14 billion, 8 billion, and 3 billion parameters) and three variants tailored for different use cases. Each variant serves a distinct purpose: base models for extensive customization, instruction-tuned models for general chat and task completion, and reasoning-optimized models for complex logic requiring step-by-step deliberation.

    The smallest Ministral 3 models can run on devices with as little as 4 gigabytes of video memory using 4-bit quantization — making frontier AI capabilities accessible on standard laptops, smartphones, and embedded systems without requiring expensive cloud infrastructure or even internet connectivity. This approach reflects Mistral's belief that AI's next evolution will be defined not by sheer scale, but by ubiquity: models small enough to run on drones, in vehicles, in robots, and on consumer devices.

    How fine-tuned small models beat expensive large models for enterprise customers

    Lample's comments reveal a business model fundamentally different from that of closed-source competitors. Rather than competing primarily on benchmark performance, Mistral is targeting enterprise customers frustrated by the cost and inflexibility of proprietary systems.

    "Sometimes customers say, 'Is there a use case where the best closed-source model isn't working?' If that's the case, then they're essentially stuck," Lample explained. "There's nothing they can do. It's the best model available, and it's not working out of the box."

    This is where Mistral's approach diverges. When a generic model fails, the company deploys engineering teams to work directly with customers, analyzing specific problems, creating synthetic training data, and fine-tuning smaller models to outperform larger general-purpose systems on narrow tasks.

    "In more than 90% of cases, a small model can do the job, especially if it's fine-tuned. It doesn't have to be a model with hundreds of billions of parameters, just a 14-billion or 24-billion parameter model," Lample said. "So it's not only much cheaper, but also faster, plus you have all the benefits: you don't need to worry about privacy, latency, reliability, and so on."

    The economic argument is compelling. Multiple enterprise customers have approached Mistral after building prototypes with expensive closed-source models, only to find deployment costs prohibitive at scale, according to Lample.

    "They come back to us a couple of months later because they realize, 'We built this prototype, but it's way too slow and way too expensive,'" he said.

    Where Mistral 3 fits in the increasingly crowded open-source AI market

    Mistral's release comes amid fierce competition on multiple fronts. OpenAI recently released GPT-5.1 with enhanced agentic capabilities. Google launched Gemini 3 with improved multimodal understanding. Anthropic released Opus 4.5 on the same day as this interview, with similar agent-focused features.

    But Lample argues those comparisons miss the point. "It's a little bit behind. But I think what matters is that we are catching up fast," he acknowledged regarding performance against closed models. "I think we are maybe playing a strategic long game."

    That long game involves a different competitive set: primarily open-source models from Chinese companies like DeepSeek and Alibaba's Qwen series, which have made remarkable strides in recent months.

    Mistral differentiates itself through multilingual capabilities that extend far beyond English or Chinese, multimodal integration handling both text and images in a unified model, and what the company characterizes as superior customization through easier fine-tuning.

    "One key difference with the models themselves is that we focused much more on multilinguality," Lample said. "If you look at all the top models from [Chinese competitors], they're all text-only. They have visual models as well, but as separate systems. We wanted to integrate everything into a single model."

    The multilingual emphasis aligns with Mistral's broader positioning as a European AI champion focused on digital sovereignty — the principle that organizations and nations should maintain control over their AI infrastructure and data.

    Building beyond models: Mistral's full-stack enterprise AI platform strategy

    Mistral 3's release builds on an increasingly comprehensive enterprise AI platform that extends well beyond model development. The company has assembled a full-stack offering that differentiates it from pure model providers.

    Recent product launches include Mistral Agents API, which combines language models with built-in connectors for code execution, web search, image generation, and persistent memory across conversations; Magistral, the company's reasoning model designed for domain-specific, transparent, and multilingual reasoning; and Mistral Code, an AI-powered coding assistant bundling models, an in-IDE assistant, and local deployment options with enterprise tooling.

    The consumer-facing Le Chat assistant has been enhanced with Deep Research mode for structured research reports, voice capabilities, and Projects for organizing conversations into context-rich folders. More recently, Le Chat gained a connector directory with 20+ enterprise integrations powered by the Model Context Protocol (MCP), spanning tools like Databricks, Snowflake, GitHub, Atlassian, Asana, and Stripe.

    In October, Mistral unveiled AI Studio, a production AI platform providing observability, agent runtime, and AI registry capabilities to help enterprises track output changes, monitor usage, run evaluations, and fine-tune models using proprietary data.

    Mistral now positions itself as a full-stack, global enterprise AI company, offering not just models but an application-building layer through AI Studio, compute infrastructure, and forward-deployed engineers to help businesses realize return on investment.

    Why open source AI matters for customization, transparency and sovereignty

    Mistral's commitment to open-source development under permissive licenses is both an ideological stance and a competitive strategy in an AI landscape increasingly dominated by closed systems.

    Lample elaborated on the practical benefits: "I think something that people don't realize — but our customers know this very well — is how much better any model can actually improve if you fine tune it on the task of interest. There's a huge gap between a base model and one that's fine-tuned for a specific task, and in many cases, it outperforms the closed-source model."

    The approach enables capabilities impossible with closed systems: organizations can fine-tune models on proprietary data that never leaves their infrastructure, customize architectures for specific workflows, and maintain complete transparency into how AI systems make decisions — critical for regulated industries like finance, healthcare, and defense.

    This positioning has attracted government and public sector partnerships. The company launched "AI for Citizens" in July 2025, an initiative to "help States and public institutions strategically harness AI for their people by transforming public services" and has secured strategic partnerships with France's army and job agency, Luxembourg's government, and various European public sector organizations.

    Mistral's transatlantic AI collaboration goes beyond European borders

    While Mistral is frequently characterized as Europe's answer to OpenAI, the company views itself as a transatlantic collaboration rather than a purely European venture. The company has teams across both continents, with co-founders spending significant time with customers and partners in the United States, and these models are being trained in partnerships with U.S.-based teams and infrastructure providers.

    This transatlantic positioning may prove strategically important as geopolitical tensions around AI development intensify. The recent ASML investment, a €1.7 billion ($1.5 billion) funding round led by the Dutch semiconductor equipment manufacturer, signals deepening collaboration across the Western semiconductor and AI value chain at a moment when both Europe and the United States are seeking to reduce dependence on Chinese technology.

    Mistral's investor base reflects this dynamic: the Series C round included participation from U.S. firms Andreessen Horowitz, General Catalyst, Lightspeed, and Index Ventures alongside European investors like France's state-backed Bpifrance and global players like DST Global and Nvidia.

    Founded in May 2023 by former Google DeepMind and Meta researchers, Mistral has raised roughly $1.05 billion (€1 billion) in funding. The company was valued at $6 billion in a June 2024 Series B, then more than doubled its valuation in a September Series C.

    Can customization and efficiency beat raw performance in enterprise AI?

    The Mistral 3 release crystallizes a fundamental question facing the AI industry: Will enterprises ultimately prioritize the absolute cutting-edge capabilities of proprietary systems, or will they choose open, customizable alternatives that offer greater control, lower costs, and independence from big tech platforms?

    Mistral's answer is unambiguous. The company is betting that as AI moves from prototype to production, the factors that matter most shift dramatically. Raw benchmark scores matter less than total cost of ownership. Slight performance edges matter less than the ability to fine-tune for specific workflows. Cloud-based convenience matters less than data sovereignty and edge deployment.

    It's a wager with significant risks. Despite Lample's optimism about closing the performance gap, Mistral's models still trail the absolute frontier. The company's revenue, while growing, reportedly remains modest relative to its nearly $14 billion valuation. And competition intensifies from both well-funded Chinese rivals making remarkable open-source progress and U.S. tech giants increasingly offering their own smaller, more efficient models.

    But if Mistral is right — if the future of AI looks less like a handful of cloud-based oracles and more like millions of specialized systems running everywhere from factory floors to smartphones — then the company has positioned itself at the center of that transformation.

    The release of Mistral 3 is the most comprehensive expression yet of that vision: 10 models, spanning every size category, optimized for every deployment scenario, available to anyone who wants to build with them.

    Whether "distributed intelligence" becomes the industry's dominant paradigm or remains a compelling alternative serving a narrower market will determine not just Mistral's fate, but the broader question of who controls the AI future — and whether that future will be open.

    For now, the race is on. And Mistral is betting it can win not by building the biggest model, but by building everywhere else.

  • Amazon’s new AI can code for days without human help. What does that mean for software engineers?

    Amazon Web Services on Tuesday announced a new class of artificial intelligence systems called "frontier agents" that can work autonomously for hours or even days without human intervention, representing one of the most ambitious attempts yet to automate the full software development lifecycle.

    The announcement, made during AWS CEO Matt Garman's keynote address at the company's annual re:Invent conference, introduces three specialized AI agents designed to act as virtual team members: Kiro autonomous agent for software development, AWS Security Agent for application security, and AWS DevOps Agent for IT operations.

    The move signals Amazon's intent to leap ahead in the intensifying competition to build AI systems capable of performing complex, multi-step tasks that currently require teams of skilled engineers.

    "We see frontier agents as a completely new class of agents," said Deepak Singh, vice president of developer agents and experiences at Amazon, in an interview ahead of the announcement. "They're fundamentally designed to work for hours and days. You're not giving them a problem that you want finished in the next five minutes. You're giving them complex challenges that they may have to think about, try different solutions, and get to the right conclusion — and they should do that without intervention."

    Why Amazon believes its new agents leave existing AI coding tools behind

    The frontier agents differ from existing AI coding assistants like GitHub Copilot or Amazon's own CodeWhisperer in several fundamental ways.

    Current AI coding tools, while powerful, require engineers to drive every interaction. Developers must write prompts, provide context, and manually coordinate work across different code repositories. When switching between tasks, the AI loses context and must start fresh.

    The new frontier agents, by contrast, maintain persistent memory across sessions and continuously learn from an organization's codebase, documentation, and team communications. They can independently determine which code repositories require changes, work on multiple files simultaneously, and coordinate complex transformations spanning dozens of microservices.

    "With a current agent, you would go microservice by microservice, making changes one at a time, and each change would be a different session with no shared context," Singh explained. "With a frontier agent, you say, 'I need to solve this broad problem.' You point it to the right application, and it decides which repos need changes."

    The agents exhibit three defining characteristics that AWS believes set them apart: autonomy in decision-making, the ability to scale by spawning multiple agents to work on different aspects of a problem simultaneously, and the capacity to operate independently for extended periods.

    "A frontier agent can decide to spin up 10 versions of itself, all working on different parts of the problem at once," Singh said.

    How each of the three frontier agents tackles a different phase of development

    Kiro autonomous agent serves as a virtual developer that maintains context across coding sessions and learns from an organization's pull requests, code reviews, and technical discussions. Teams can connect it to GitHub, Jira, Slack, and internal documentation systems. The agent then acts like a teammate, accepting task assignments and working independently until it either completes the work or requires human guidance.

    AWS Security Agent embeds security expertise throughout the development process, automatically reviewing design documents and scanning pull requests against organizational security requirements. Perhaps most significantly, it transforms penetration testing from a weeks-long manual process into an on-demand capability that completes in hours.

    SmugMug, a photo hosting platform, has already deployed the security agent. "AWS Security Agent helped catch a business logic bug that no existing tools would have caught, exposing information improperly," said Andres Ruiz, staff software engineer at the company. "To any other tool, this would have been invisible. But the ability for Security Agent to contextualize the information, parse the API response, and find the unexpected information there represents a leap forward in automated security testing."

    AWS DevOps Agent functions as an always-on operations team member, responding instantly to incidents and using its accumulated knowledge to identify root causes. It connects to observability tools including Amazon CloudWatch, Datadog, Dynatrace, New Relic, and Splunk, along with runbooks and deployment pipelines.

    Commonwealth Bank of Australia tested the DevOps agent by replicating a complex network and identity management issue that typically requires hours for experienced engineers to diagnose. The agent identified the root cause in under 15 minutes.

    "AWS DevOps Agent thinks and acts like a seasoned DevOps engineer, helping our engineers build a banking infrastructure that's faster, more resilient, and designed to deliver better experiences for our customers," said Jason Sandry, head of cloud services at Commonwealth Bank.

    Amazon makes its case against Google and Microsoft in the AI coding wars

    The announcement arrives amid a fierce battle among technology giants to dominate the emerging market for AI-powered development tools. Google has made significant noise in recent weeks with its own AI coding capabilities, while Microsoft continues to advance GitHub Copilot and its broader AI development toolkit.

    Singh argued that AWS holds distinct advantages rooted in the company's 20-year history operating cloud infrastructure and Amazon's own massive software engineering organization.

    "AWS has been the cloud of choice for 20 years, so we have two decades of knowledge building and running it, and working with customers who've been building and running applications on it," Singh said. "The learnings from operating AWS, the knowledge our customers have, the experience we've built using these tools ourselves every day to build real-world applications—all of that is embodied in these frontier agents."

    He drew a distinction between tools suitable for prototypes versus production systems. "There's a lot of things out there that you can use to build your prototype or your toy application. But if you want to build production applications, there's a lot of knowledge that we bring in as AWS that apply here."

    The safeguards Amazon built to keep autonomous agents from going rogue

    The prospect of AI systems operating autonomously for days raises immediate questions about what happens when they go off track. Singh described multiple safeguards built into the system.

    All learnings accumulated by the agents are logged and visible, allowing engineers to understand what knowledge influences the agent's decisions. Teams can even remove specific learnings if they discover the agent has absorbed incorrect information from team communications.

    "You can go in and even redact that from its knowledge like, 'No, we don't want you to ever use this knowledge,'" Singh said. "You can look at the knowledge like it's almost—it's like looking at your neurons inside your brain. You can disconnect some."

    Engineers can also monitor agent activity in real-time and intervene when necessary, either redirecting the agent or taking over entirely. Most critically, the agents never commit code directly to production systems. That responsibility remains with human engineers.

    "These agents are never going to check the code into production. That is still the human's responsibility," Singh emphasized. "You are still, as an engineer, responsible for the code you're checking in, whether it's generated by you or by an agent working autonomously."

    What frontier agents mean for the future of software engineering jobs

    The announcement inevitably raises concerns about the impact on software engineering jobs. Singh pushed back against the notion that frontier agents will replace developers, framing them instead as tools that amplify human capabilities.

    "Software engineering is craft. What's changing is not, 'Hey, agents are doing all the work.' The craft of software engineering is changing—how you use agents, how do you set up your code base, how do you set up your prompts, how do you set up your rules, how do you set up your knowledge bases so that agents can be effective," he said.

    Singh noted that senior engineers who had drifted away from hands-on coding are now writing more code than ever. "It's actually easier for them to become software engineers," he said.

    He pointed to an internal example where a team completed a project in 78 days that would have taken 18 months using traditional practices. "Because they were able to use AI. And the thing that made it work was not just the fact that they were using AI, but how they organized and set up their practices of how they built that software were maximized around that."

    How Amazon plans to make AI-generated code more trustworthy over time

    Singh outlined several areas where frontier agents will evolve over the coming years. Multi-agent architectures, where systems of specialized agents coordinate to solve complex problems, represent a major frontier. So does the integration of formal verification techniques to increase confidence in AI-generated code.

    AWS recently introduced property-based testing in Kiro, which uses automated reasoning to extract testable properties from specifications and generate thousands of test scenarios automatically.

    "If you have a shopping cart application, every way an order can be canceled, and how it might be canceled, and the way refunds are handled in Germany versus the US—if you're writing a unit test, maybe two, Germany and US, but now, because you have this property-based testing approach, your agent can create a scenario for every country you operate in and test all of them automatically for you," Singh explained.

    Building trust in autonomous systems remains the central challenge. "Right now you still require tons of human guardrails at every step to make sure that the right thing happens. And as we get better at these techniques, you will use less and less, and you'll be able to trust the agents a lot more," he said.

    Amazon's bigger bet on autonomous AI stretches far beyond writing code

    The frontier agents announcement arrived alongside a cascade of other news at re:Invent 2025. AWS kicked off the conference with major announcements on agentic AI capabilities, customer service innovations, and multicloud networking.

    Amazon expanded its Nova portfolio with four new models delivering industry-leading price-performance across reasoning, multimodal processing, conversational AI, code generation, and agentic tasks. Nova Forge pioneers "open training," giving organizations access to pre-trained model checkpoints and the ability to blend proprietary data with Amazon Nova-curated datasets.

    AWS also added 18 new open weight models to Amazon Bedrock, reinforcing its commitment to offering a broad selection of fully managed models from leading AI providers. The launch includes new models from Mistral AI, Google's Gemma 3, MiniMax's M2, NVIDIA's Nemotron, and OpenAI's GPT OSS Safeguard.

    On the infrastructure side, Amazon EC2 Trn3 UltraServers, powered by AWS's first 3nm AI chip, pack up to 144 Trainium3 chips into a single integrated system, delivering up to 4.4x more compute performance and 4x greater energy efficiency than the previous generation. AWS AI Factories provides enterprises and government organizations with dedicated AWS AI infrastructure deployed in their own data centers, combining NVIDIA GPUs, Trainium chips, AWS networking, and AI services like Amazon Bedrock and SageMaker AI.

    All three frontier agents launched in preview on Tuesday. Pricing will be announced when the services reach general availability.

    Singh made clear the company sees applications far beyond coding. "These are the first frontier agents we are releasing, and they're in the software development lifecycle," he said. "The problems and use cases for frontier agents—these agents that are long running, capable of autonomy, thinking, always learning and improving—can be applied to many, many domains."

    Amazon, after all, operates satellite networks, runs robotics warehouses, and manages one of the world's largest e-commerce platforms. If autonomous agents can learn to write code on their own, the company is betting they can eventually learn to do just about anything else.

  • MIT offshoot Liquid AI releases blueprint for enterprise-grade small-model training

    When Liquid AI, a startup founded by MIT computer scientists back in 2023, introduced its Liquid Foundation Models series 2 (LFM2) in July 2025, the pitch was straightforward: deliver the fastest on-device foundation models on the market using the new "liquid" architecture, with training and inference efficiency that made small models a serious alternative to cloud-only large language models (LLMs) such as OpenAI's GPT series and Google's Gemini.

    The initial release shipped dense checkpoints at 350M, 700M, and 1.2B parameters, a hybrid architecture heavily weighted toward gated short convolutions, and benchmark numbers that placed LFM2 ahead of similarly sized competitors like Qwen3, Llama 3.2, and Gemma 3 on both quality and CPU throughput. The message to enterprises was clear: real-time, privacy-preserving AI on phones, laptops, and vehicles no longer required sacrificing capability for latency.

    In the months since that launch, Liquid has expanded LFM2 into a broader product line — adding task-and-domain-specialized variants, a small video ingestion and analysis model, and an edge-focused deployment stack called LEAP — and positioned the models as the control layer for on-device and on-prem agentic systems.

    Now, with the publication of the detailed, 51-page LFM2 technical report on arXiv, the company is going a step further: making public the architecture search process, training data mixture, distillation objective, curriculum strategy, and post-training pipeline behind those models.

    And unlike earlier open models, LFM2 is built around a repeatable recipe: a hardware-in-the-loop search process, a training curriculum that compensates for smaller parameter budgets, and a post-training pipeline tuned for instruction following and tool use.

    Rather than just offering weights and an API, Liquid is effectively publishing a detailed blueprint that other organizations can use as a reference for training their own small, efficient models from scratch, tuned to their own hardware and deployment constraints.

    A model family designed around real constraints, not GPU labs

    The technical report begins with a premise enterprises are intimately familiar with: real AI systems hit limits long before benchmarks do. Latency budgets, peak memory ceilings, and thermal throttling define what can actually run in production—especially on laptops, tablets, commodity servers, and mobile devices.

    To address this, Liquid AI performed architecture search directly on target hardware, including Snapdragon mobile SoCs and Ryzen laptop CPUs. The result is a consistent outcome across sizes: a minimal hybrid architecture dominated by gated short convolution blocks and a small number of grouped-query attention (GQA) layers. This design was repeatedly selected over more exotic linear-attention and SSM hybrids because it delivered a better quality-latency-memory Pareto profile under real device conditions.

    This matters for enterprise teams in three ways:

    1. Predictability. The architecture is simple, parameter-efficient, and stable across model sizes from 350M to 2.6B.

    2. Operational portability. Dense and MoE variants share the same structural backbone, simplifying deployment across mixed hardware fleets.

    3. On-device feasibility. Prefill and decode throughput on CPUs surpass comparable open models by roughly 2× in many cases, reducing the need to offload routine tasks to cloud inference endpoints.

    Instead of optimizing for academic novelty, the report reads as a systematic attempt to design models enterprises can actually ship.

    This is notable and more practical for enterprises in a field where many open models quietly assume access to multi-H100 clusters during inference.

    A training pipeline tuned for enterprise-relevant behavior

    LFM2 adopts a training approach that compensates for the smaller scale of its models with structure rather than brute force. Key elements include:

    • 10–12T token pre-training and an additional 32K-context mid-training phase, which extends the model’s useful context window without exploding compute costs.

    • A decoupled Top-K knowledge distillation objective that sidesteps the instability of standard KL distillation when teachers provide only partial logits.

    • A three-stage post-training sequence—SFT, length-normalized preference alignment, and model merging—designed to produce more reliable instruction following and tool-use behavior.

    For enterprise AI developers, the significance is that LFM2 models behave less like “tiny LLMs” and more like practical agents able to follow structured formats, adhere to JSON schemas, and manage multi-turn chat flows. Many open models at similar sizes fail not due to lack of reasoning ability, but due to brittle adherence to instruction templates. The LFM2 post-training recipe directly targets these rough edges.

    In other words: Liquid AI optimized small models for operational reliability, not just scoreboards.

    Multimodality designed for device constraints, not lab demos

    The LFM2-VL and LFM2-Audio variants reflect another shift: multimodality built around token efficiency.

    Rather than embedding a massive vision transformer directly into an LLM, LFM2-VL attaches a SigLIP2 encoder through a connector that aggressively reduces visual token count via PixelUnshuffle. High-resolution inputs automatically trigger dynamic tiling, keeping token budgets controllable even on mobile hardware. LFM2-Audio uses a bifurcated audio path—one for embeddings, one for generation—supporting real-time transcription or speech-to-speech on modest CPUs.

    For enterprise platform architects, this design points toward a practical future where:

    • document understanding happens directly on endpoints such as field devices;

    • audio transcription and speech agents run locally for privacy compliance;

    • multimodal agents operate within fixed latency envelopes without streaming data off-device.

    The through-line is the same: multimodal capability without requiring a GPU farm.

    Retrieval models built for agent systems, not legacy search

    LFM2-ColBERT extends late-interaction retrieval into a footprint small enough for enterprise deployments that need multilingual RAG without the overhead of specialized vector DB accelerators.

    This is particularly meaningful as organizations begin to orchestrate fleets of agents. Fast local retrieval—running on the same hardware as the reasoning model—reduces latency and provides a governance win: documents never leave the device boundary.

    Taken together, the VL, Audio, and ColBERT variants show LFM2 as a modular system, not a single model drop.

    The emerging blueprint for hybrid enterprise AI architectures

    Across all variants, the LFM2 report implicitly sketches what tomorrow’s enterprise AI stack will look like: hybrid local-cloud orchestration, where small, fast models operating on devices handle time-critical perception, formatting, tool invocation, and judgment tasks, while larger models in the cloud offer heavyweight reasoning when needed.

    Several trends converge here:

    • Cost control. Running routine inference locally avoids unpredictable cloud billing.

    • Latency determinism. TTFT and decode stability matter in agent workflows; on-device eliminates network jitter.

    • Governance and compliance. Local execution simplifies PII handling, data residency, and auditability.

    • Resilience. Agentic systems degrade gracefully if the cloud path becomes unavailable.

    Enterprises adopting these architectures will likely treat small on-device models as the “control plane” of agentic workflows, with large cloud models serving as on-demand accelerators.

    LFM2 is one of the clearest open-source foundations for that control layer to date.

    The strategic takeaway: on-device AI is now a design choice, not a compromise

    For years, organizations building AI features have accepted that “real AI” requires cloud inference. LFM2 challenges that assumption. The models perform competitively across reasoning, instruction following, multilingual tasks, and RAG—while simultaneously achieving substantial latency gains over other open small-model families.

    For CIOs and CTOs finalizing 2026 roadmaps, the implication is direct: small, open, on-device models are now strong enough to carry meaningful slices of production workloads.

    LFM2 will not replace frontier cloud models for frontier-scale reasoning. But it offers something enterprises arguably need more: a reproducible, open, and operationally feasible foundation for agentic systems that must run anywhere, from phones to industrial endpoints to air-gapped secure facilities.

    In the broadening landscape of enterprise AI, LFM2 is less a research milestone and more a sign of architectural convergence. The future is not cloud or edge—it’s both, operating in concert. And releases like LFM2 provide the building blocks for organizations prepared to build that hybrid future intentionally rather than accidentally.

  • DeepSeek just dropped two insanely powerful AI models that rival GPT-5 and they’re totally free

    Chinese artificial intelligence startup DeepSeek released two powerful new AI models on Sunday that the company claims match or exceed the capabilities of OpenAI's GPT-5 and Google's Gemini-3.0-Pro — a development that could reshape the competitive landscape between American tech giants and their Chinese challengers.

    The Hangzhou-based company launched DeepSeek-V3.2, designed as an everyday reasoning assistant, alongside DeepSeek-V3.2-Speciale, a high-powered variant that achieved gold-medal performance in four elite international competitions: the 2025 International Mathematical Olympiad, the International Olympiad in Informatics, the ICPC World Finals, and the China Mathematical Olympiad.

    The release carries profound implications for American technology leadership. DeepSeek has once again demonstrated that it can produce frontier AI systems despite U.S. export controls that restrict China's access to advanced Nvidia chips — and it has done so while making its models freely available under an open-source MIT license.

    "People thought DeepSeek gave a one-time breakthrough but we came back much bigger," wrote Chen Fang, who identified himself as a contributor to the project, on X (formerly Twitter). The release drew swift reactions online, with one user declaring: "Rest in peace, ChatGPT."

    How DeepSeek's sparse attention breakthrough slashes computing costs

    At the heart of the new release lies DeepSeek Sparse Attention, or DSA — a novel architectural innovation that dramatically reduces the computational burden of running AI models on long documents and complex tasks.

    Traditional AI attention mechanisms, the core technology allowing language models to understand context, scale poorly as input length increases. Processing a document twice as long typically requires four times the computation. DeepSeek's approach breaks this constraint using what the company calls a "lightning indexer" that identifies only the most relevant portions of context for each query, ignoring the rest.

    According to DeepSeek's technical report, DSA reduces inference costs by roughly half compared to previous models when processing long sequences. The architecture "substantially reduces computational complexity while preserving model performance," the report states.

    Processing 128,000 tokens — roughly equivalent to a 300-page book — now costs approximately $0.70 per million tokens for decoding, compared to $2.40 for the previous V3.1-Terminus model. That represents a 70% reduction in inference costs.

    The 685-billion-parameter models support context windows of 128,000 tokens, making them suitable for analyzing lengthy documents, codebases, and research papers. DeepSeek's technical report notes that independent evaluations on long-context benchmarks show V3.2 performing on par with or better than its predecessor "despite incorporating a sparse attention mechanism."

    The benchmark results that put DeepSeek in the same league as GPT-5

    DeepSeek's claims of parity with America's leading AI systems rest on extensive testing across mathematics, coding, and reasoning tasks — and the numbers are striking.

    On AIME 2025, a prestigious American mathematics competition, DeepSeek-V3.2-Speciale achieved a 96.0% pass rate, compared to 94.6% for GPT-5-High and 95.0% for Gemini-3.0-Pro. On the Harvard-MIT Mathematics Tournament, the Speciale variant scored 99.2%, surpassing Gemini's 97.5%.

    The standard V3.2 model, optimized for everyday use, scored 93.1% on AIME and 92.5% on HMMT — marginally below frontier models but achieved with substantially fewer computational resources.

    Most striking are the competition results. DeepSeek-V3.2-Speciale scored 35 out of 42 points on the 2025 International Mathematical Olympiad, earning gold-medal status. At the International Olympiad in Informatics, it scored 492 out of 600 points — also gold, ranking 10th overall. The model solved 10 of 12 problems at the ICPC World Finals, placing second.

    These results came without internet access or tools during testing. DeepSeek's report states that "testing strictly adheres to the contest's time and attempt limits."

    On coding benchmarks, DeepSeek-V3.2 resolved 73.1% of real-world software bugs on SWE-Verified, competitive with GPT-5-High at 74.9%. On Terminal Bench 2.0, measuring complex coding workflows, DeepSeek scored 46.4%—well above GPT-5-High's 35.2%.

    The company acknowledges limitations. "Token efficiency remains a challenge," the technical report states, noting that DeepSeek "typically requires longer generation trajectories" to match Gemini-3.0-Pro's output quality.

    Why teaching AI to think while using tools changes everything

    Beyond raw reasoning, DeepSeek-V3.2 introduces "thinking in tool-use" — the ability to reason through problems while simultaneously executing code, searching the web, and manipulating files.

    Previous AI models faced a frustrating limitation: each time they called an external tool, they lost their train of thought and had to restart reasoning from scratch. DeepSeek's architecture preserves the reasoning trace across multiple tool calls, enabling fluid multi-step problem solving.

    To train this capability, the company built a massive synthetic data pipeline generating over 1,800 distinct task environments and 85,000 complex instructions. These included challenges like multi-day trip planning with budget constraints, software bug fixes across eight programming languages, and web-based research requiring dozens of searches.

    The technical report describes one example: planning a three-day trip from Hangzhou with constraints on hotel prices, restaurant ratings, and attraction costs that vary based on accommodation choices. Such tasks are "hard to solve but easy to verify," making them ideal for training AI agents.

    DeepSeek employed real-world tools during training — actual web search APIs, coding environments, and Jupyter notebooks — while generating synthetic prompts to ensure diversity. The result is a model that generalizes to unseen tools and environments, a critical capability for real-world deployment.

    DeepSeek's open-source gambit could upend the AI industry's business model

    Unlike OpenAI and Anthropic, which guard their most powerful models as proprietary assets, DeepSeek has released both V3.2 and V3.2-Speciale under the MIT license — one of the most permissive open-source frameworks available.

    Any developer, researcher, or company can download, modify, and deploy the 685-billion-parameter models without restriction. Full model weights, training code, and documentation are available on Hugging Face, the leading platform for AI model sharing.

    The strategic implications are significant. By making frontier-capable models freely available, DeepSeek undermines competitors charging premium API prices. The Hugging Face model card notes that DeepSeek has provided Python scripts and test cases "demonstrating how to encode messages in OpenAI-compatible format" — making migration from competing services straightforward.

    For enterprise customers, the value proposition is compelling: frontier performance at dramatically lower cost, with deployment flexibility. But data residency concerns and regulatory uncertainty may limit adoption in sensitive applications — particularly given DeepSeek's Chinese origins.

    Regulatory walls are rising against DeepSeek in Europe and America

    DeepSeek's global expansion faces mounting resistance. In June, Berlin's data protection commissioner Meike Kamp declared that DeepSeek's transfer of German user data to China is "unlawful" under EU rules, asking Apple and Google to consider blocking the app.

    The German authority expressed concern that "Chinese authorities have extensive access rights to personal data within the sphere of influence of Chinese companies." Italy ordered DeepSeek to block its app in February. U.S. lawmakers have moved to ban the service from government devices, citing national security concerns.

    Questions also persist about U.S. export controls designed to limit China's AI capabilities. In August, DeepSeek hinted that China would soon have "next generation" domestically built chips to support its models. The company indicated its systems work with Chinese-made chips from Huawei and Cambricon without additional setup.

    DeepSeek's original V3 model was reportedly trained on roughly 2,000 older Nvidia H800 chips — hardware since restricted for China export. The company has not disclosed what powered V3.2 training, but its continued advancement suggests export controls alone cannot halt Chinese AI progress.

    What DeepSeek's release means for the future of AI competition

    The release arrives at a pivotal moment. After years of massive investment, some analysts question whether an AI bubble is forming. DeepSeek's ability to match American frontier models at a fraction of the cost challenges assumptions that AI leadership requires enormous capital expenditure.

    The company's technical report reveals that post-training investment now exceeds 10% of pre-training costs — a substantial allocation credited for reasoning improvements. But DeepSeek acknowledges gaps: "The breadth of world knowledge in DeepSeek-V3.2 still lags behind leading proprietary models," the report states. The company plans to address this by scaling pre-training compute.

    DeepSeek-V3.2-Speciale remains available through a temporary API until December 15, when its capabilities will merge into the standard release. The Speciale variant is designed exclusively for deep reasoning and does not support tool calling — a limitation the standard model addresses.

    For now, the AI race between the United States and China has entered a new phase. DeepSeek's release demonstrates that open-source models can achieve frontier performance, that efficiency innovations can slash costs dramatically, and that the most powerful AI systems may soon be freely available to anyone with an internet connection.

    As one commenter on X observed: "Deepseek just casually breaking those historic benchmarks set by Gemini is bonkers."

    The question is no longer whether Chinese AI can compete with Silicon Valley. It's whether American companies can maintain their lead when their Chinese rival gives comparable technology away for free.

  • OpenAGI emerges from stealth with an AI agent that it claims crushes OpenAI and Anthropic

    A stealth artificial intelligence startup founded by an MIT researcher emerged this morning with an ambitious claim: its new AI model can control computers better than systems built by OpenAI and Anthropic — at a fraction of the cost.

    OpenAGI, led by chief executive Zengyi Qin, released Lux, a foundation model designed to operate computers autonomously by interpreting screenshots and executing actions across desktop applications. The San Francisco-based company says Lux achieves an 83.6 percent success rate on Online-Mind2Web, a benchmark that has become the industry's most rigorous test for evaluating AI agents that control computers.

    That score is a significant leap over the leading models from well-funded competitors. OpenAI's Operator, released in January, scores 61.3 percent on the same benchmark. Anthropic's Claude Computer Use achieves 56.3 percent.

    "Traditional LLM training feeds a large amount of text corpus into the model. The model learns to produce text," Qin said in an exclusive interview with VentureBeat. "By contrast, our model learns to produce actions. The model is trained with a large amount of computer screenshots and action sequences, allowing it to produce actions to control the computer."

    The announcement arrives at a pivotal moment for the AI industry. Technology giants and startups alike have poured billions of dollars into developing autonomous agents capable of navigating software, booking travel, filling out forms, and executing complex workflows. OpenAI, Anthropic, Google, and Microsoft have all released or announced agent products in the past year, betting that computer-controlling AI will become as transformative as chatbots.

    Yet independent research has cast doubt on whether current agents are as capable as their creators suggest.

    Why university researchers built a tougher benchmark to test AI agents—and what they discovered

    The Online-Mind2Web benchmark, developed by researchers at Ohio State University and the University of California, Berkeley, was designed specifically to expose the gap between marketing claims and actual performance.

    Published in April and accepted to the Conference on Language Modeling 2025, the benchmark comprises 300 diverse tasks across 136 real websites — everything from booking flights to navigating complex e-commerce checkouts. Unlike earlier benchmarks that cached parts of websites, Online-Mind2Web tests agents in live online environments where pages change dynamically and unexpected obstacles appear.

    The results, according to the researchers, painted "a very different picture of the competency of current agents, suggesting over-optimism in previously reported results."

    When the Ohio State team tested five leading web agents with careful human evaluation, they found that many recent systems — despite heavy investment and marketing fanfare — did not outperform SeeAct, a relatively simple agent released in January 2024. Even OpenAI's Operator, the best performer among commercial offerings in their study, achieved only 61 percent success.

    "It seemed that highly capable and practical agents were maybe indeed just months away," the researchers wrote in a blog post accompanying their paper. "However, we are also well aware that there are still many fundamental gaps in research to fully autonomous agents, and current agents are probably not as competent as the reported benchmark numbers may depict."

    The benchmark has gained traction as an industry standard, with a public leaderboard hosted on Hugging Face tracking submissions from research groups and companies.

    How OpenAGI trained its AI to take actions instead of just generating text

    OpenAGI's claimed performance advantage stems from what the company calls "Agentic Active Pre-training," a training methodology that differs fundamentally from how most large language models learn.

    Conventional language models train on vast text corpora, learning to predict the next word in a sequence. The resulting systems excel at generating coherent text but were not designed to take actions in graphical environments.

    Lux, according to Qin, takes a different approach. The model trains on computer screenshots paired with action sequences, learning to interpret visual interfaces and determine which clicks, keystrokes, and navigation steps will accomplish a given goal.

    "The action allows the model to actively explore the computer environment, and such exploration generates new knowledge, which is then fed back to the model for training," Qin told VentureBeat. "This is a naturally self-evolving process, where a better model produces better exploration, better exploration produces better knowledge, and better knowledge leads to a better model."

    This self-reinforcing training loop, if it functions as described, could help explain how a smaller team might achieve results that elude larger organizations. Rather than requiring ever-larger static datasets, the approach would allow the model to continuously improve by generating its own training data through exploration.

    OpenAGI also claims significant cost advantages. The company says Lux operates at roughly one-tenth the cost of frontier models from OpenAI and Anthropic while executing tasks faster.

    Unlike browser-only competitors, Lux can control Slack, Excel, and other desktop applications

    A critical distinction in OpenAGI's announcement: Lux can control applications across an entire desktop operating system, not just web browsers.

    Most commercially available computer-use agents, including early versions of Anthropic's Claude Computer Use, focus primarily on browser-based tasks. That limitation excludes vast categories of productivity work that occur in desktop applications — spreadsheets in Microsoft Excel, communications in Slack, design work in Adobe products, code editing in development environments.

    OpenAGI says Lux can navigate these native applications, a capability that would substantially expand the addressable market for computer-use agents. The company is releasing a developer software development kit alongside the model, allowing third parties to build applications on top of Lux.

    The company is also working with Intel to optimize Lux for edge devices, which would allow the model to run locally on laptops and workstations rather than requiring cloud infrastructure. That partnership could address enterprise concerns about sending sensitive screen data to external servers.

    "We are partnering with Intel to optimize our model on edge devices, which will make it the best on-device computer-use model," Qin said.

    The company confirmed it is in exploratory discussions with AMD and Microsoft about additional partnerships.

    What happens when you ask an AI agent to copy your bank details

    Computer-use agents present novel safety challenges that do not arise with conventional chatbots. An AI system capable of clicking buttons, entering text, and navigating applications could, if misdirected, cause significant harm — transferring money, deleting files, or exfiltrating sensitive information.

    OpenAGI says it has built safety mechanisms directly into Lux. When the model encounters requests that violate its safety policies, it refuses to proceed and alerts the user.

    In an example provided by the company, when a user asked the model to "copy my bank details and paste it into a new Google doc," Lux responded with an internal reasoning step: "The user asks me to copy the bank details, which are sensitive information. Based on the safety policy, I am not able to perform this action." The model then issued a warning to the user rather than executing the potentially dangerous request.

    Such safeguards will face intense scrutiny as computer-use agents proliferate. Security researchers have already demonstrated prompt injection attacks against early agent systems, where malicious instructions embedded in websites or documents can hijack an agent's behavior. Whether Lux's safety mechanisms can withstand adversarial attacks remains to be tested by independent researchers.

    The MIT researcher who built two of GitHub's most downloaded AI models

    Qin brings an unusual combination of academic credentials and entrepreneurial experience to OpenAGI.

    He completed his doctorate at the Massachusetts Institute of Technology in 2025, where his research focused on computer vision, robotics, and machine learning. His academic work appeared in top venues including the Conference on Computer Vision and Pattern Recognition, the International Conference on Learning Representations, and the International Conference on Machine Learning.

    Before founding OpenAGI, Qin built several widely adopted AI systems. JetMoE, a large language model he led development on, demonstrated that a high-performing model could be trained from scratch for less than $100,000 — a fraction of the tens of millions typically required. The model outperformed Meta's LLaMA2-7B on standard benchmarks, according to a technical report that attracted attention from MIT's Computer Science and Artificial Intelligence Laboratory.

    His previous open-source projects achieved remarkable adoption. OpenVoice, a voice cloning model, accumulated approximately 35,000 stars on GitHub and ranked in the top 0.03 percent of open-source projects by popularity. MeloTTS, a text-to-speech system, has been downloaded more than 19 million times, making it one of the most widely used audio AI models since its 2024 release.

    Qin also co-founded MyShell, an AI agent platform that has attracted six million users who have collectively built more than 200,000 AI agents. Users have had more than one billion interactions with agents on the platform, according to the company.

    Inside the billion-dollar race to build AI that controls your computer

    The computer-use agent market has attracted intense interest from investors and technology giants over the past year.

    OpenAI released Operator in January, allowing users to instruct an AI to complete tasks across the web. Anthropic has continued developing Claude Computer Use, positioning it as a core capability of its Claude model family. Google has incorporated agent features into its Gemini products. Microsoft has integrated agent capabilities across its Copilot offerings and Windows.

    Yet the market remains nascent. Enterprise adoption has been limited by concerns about reliability, security, and the ability to handle edge cases that occur frequently in real-world workflows. The performance gaps revealed by benchmarks like Online-Mind2Web suggest that current systems may not be ready for mission-critical applications.

    OpenAGI enters this competitive landscape as an independent alternative, positioning superior benchmark performance and lower costs against the massive resources of its well-funded rivals. The company's Lux model and developer SDK are available beginning today.

    Whether OpenAGI can translate benchmark dominance into real-world reliability remains the central question. The AI industry has a long history of impressive demos that falter in production, of laboratory results that crumble against the chaos of actual use. Benchmarks measure what they measure, and the distance between a controlled test and an 8-hour workday full of edge cases, exceptions, and surprises can be vast.

    But if Lux performs in the wild the way it performs in the lab, the implications extend far beyond one startup's success. It would suggest that the path to capable AI agents runs not through the largest checkbooks but through the cleverest architectures—that a small team with the right ideas can outmaneuver the giants.

    The technology industry has seen that story before. It rarely stays true for long.

  • Hybrid cloud security must be rebuilt for an AI war it was never designed to fight

    Hybrid cloud security was built before the current era of automated, machine-based cyberattacks that take just milliseconds to execute and minutes to deliver devastating impacts to infrastructure.

    The architectures and tech stacks every enterprise depends on, from batch-based detection to siloed tools to 15-minute response windows, stood a better chance of defending against attackers moving at human speed. But in a weaponized AI world, those approaches to analyzing threat data don't make sense.

    The latest survey numbers tell the story. More than half (55%) of organizations suffered cloud breaches in the past year. That’s a 17-point spike, according to Gigamon's 2025 Hybrid Cloud Security Survey. Nearly half of the enterprises polled said their security tools missed the attack entirely. While 82% of enterprises now run hybrid or multi-cloud environments, only 36% express confidence in detecting threats in real time, per Fortinet's 2025 State of Cloud Security Report.

    Adversaries aren’t wasting any time weaponizing AI to target hybrid cloud vulnerabilities. Organizations now face 1,925 cyberattacks weekly. That’s an increase of 47% in a year. Further, ransomware surged 126% in the first quarter of 2025 alone. The visibility gaps everyone talks about in hybrid environments is where breaches originate. The bottom line is that the security architectures designed for the pre-AI era can't keep pace.

    But the industry is finally beginning to respond. CrowdStrike, for its part, is providing one vision of cybersecurity reinvention. Today at AWS re:Invent, the company is rolling out real-time Cloud Detection and Response, a platform designed to compress 15-minute response windows down to seconds.

    But the bigger story is why the entire approach to hybrid cloud security must change, and what that means for CISOs planning their 2026 strategies.

    Why the old model for hybrid cloud security is failing

    Initially, hybrid cloud promised the best of both worlds. Every organization could have public cloud agility with on-prem control. The security model that took shape reflected the best practices at the time. The trouble is that those best practices are now introducing vulnerabilities.

    How bad is it? The majority of security teams struggle to keep up with the threats and workloads. According to recent research:

    "You can't secure what you can't see," says Mandy Andress, CISO at Elastic. "That's the heart of the two big challenges we see as security practitioners: The complexity and sprawl of an organization's infrastructure, coupled with the rapid pace of technological change."

    CrowdStrike's Zaitsev diagnosed the root cause: "Everyone assumed this was a one-way trip, lift and shift everything to the cloud. That's not what happened. We're seeing companies pull workloads back on-prem when the economics make sense. The reality? Everyone's going to be hybrid. Five years from now. Ten years. Maybe forever. Security has to deal with that."

    Weaponized AI is changing the threat calculus fast

    The weaponized AI era isn't just accelerating attacks. It’s breaking the fundamental assumptions on which hybrid cloud security was built. The window between patch release and weaponized exploit collapsed from weeks to hours. The majority of adversaries aren't typing commands anymore; they're automating machine-based campaigns that orchestrate agentic AI at a scale and speed that current hybrid cloud tools and human SOC teams can't keep up with.

    Zaitsev shared threat data from CrowdStrike's mid-year hunting report, which found that cloud intrusions spiked 136% in a year, with roughly 40% of all cloud actor activity coming from Chinese nexus adversaries. This illustrates how quickly the threat landscape can change, and why hybrid cloud security needs to be reinvented for the AI era now.

    Mike Riemer, SVP and field CISO at Ivanti, has witnessed the timeline collapse. Threat actors now reverse-engineer patches within 72 hours using AI assistance. If enterprises don't patch within that time frame, "they're open to exploit," Riemer told VentureBeat. "That's the new reality."

    Using previous-generation tools in the current cloud control plane is a dangerous bet. All it takes is a single compromised virtual machine (VM) that no one knows exists. Compromise the control plane, including the APIs that manage cloud resources, and they’ve got keys to spin up, modify or delete thousands of assets across a company’s hybrid environment.

    The seams between hybrid cloud environments are attack highways where millisecond-long attacks seldom leave any digital exhaust or traces. Many organizations never see weaponized AI attacks coming.

    VentureBeat hears that the worst hybrid cloud attacks can only be diagnosed long after the fact, when forensics and analysis are finally completed. Attackers and adversaries are that good at covering their tracks, often relying on living-off-the-land (LotL) tools to evade detection for months, even years in extreme cases.

    "Enterprises training AI models are concentrating sensitive data in cloud environments, which is gold for adversaries," CrowdStrike's Zaitsev said. "Attackers are using agentic AI to run their campaigns. The traditional SOC workflow — see the alert, triage, investigate for 15 or 20 minutes, take action an hour or a day later —is completely insufficient. You're bringing a knife to a gunfight."

    The human toll of relying on outdated architecture

    The human toll of the hybrid cloud crisis shows up in SOC metrics and burnout. The AI SOC Market Landscape 2025 report found that the average security operations center processes 960 alerts daily. Each takes roughly 70 minutes to investigate properly. Assuming standard SOC staffing levels, there aren't enough hours in the day to get to all those alerts.

    Futher, at least 40% of alerts, on average, never get touched. The human cost is staggering. A Tines survey of SOC analysts found that 71% are experiencing burnout. Two-thirds say manual grunt work consumes more than half of SOC workers' day. The same percentage are eyeing the exit from their jobs, and, in many extreme cases as some confide to VentureBeat, the industry.

    Hybrid environments make everything more complicated. Enterprises have different tools for AWS, Azure and on-prem architectures. They have different consoles; often different teams. As for alert correlation across environments? It's manual and often delegated to the most senior SOC team members — if it happens at all.

    Batch-based detection can't survive the weaponized AI era

    Here's what most legacy vendors of hybrid cloud security tools won't openly admit: Cloud security tools are fundamentally flawed and not designed for real-time defense. The majority are batch-based, collecting logs every five, ten or fifteen minutes, processing them through correlation engines, then generating alerts. In a world where adversaries are increasingly executing machine-based attacks in milliseconds, a 15-minute detection delay isn't just a minor setback; it's the difference between stopping an attack and having to investigate a breach.

    As adversaries weaponize AI to accelerate cloud attacks and move laterally across systems, traditional cloud detection and response (CDR) tools relying on log batch processing are too slow to keep up. These systems can take 15 minutes or more to surface a single detection.

    CrowdStrike's Zaitsev didn't hedge. Before the company's new tools released today, there was no such thing as real-time cloud detection and prevention, he claimed. "Everyone else is batch-based. Suck down logs every five or 10 minutes, wait for data, import it, correlate it. We've seen competitors take 10 to 15 minutes minimum. That's not detection—that's archaeology."

    He continued: "It's carrier pigeon versus 5G. The gap between 15 minutes and 15 seconds isn't just about alert quality. It's the difference between getting a notification that something has already happened; now you're doing cleanup, versus actually stopping the attack before the adversary achieves anything. One is incident response. The other is prevention."

    Reinventing hybrid cloud security must begin with speed

    CrowdStrike's new real-time Cloud Detection and Response, part of Falcon Cloud Security's unified cloud-native application protection platform (CNAPP), is intended to secure every layer of hybrid cloud risk. It is built on three key innovations:

    • Real-time detection engine: Built on event streaming technology pioneered and battle-tested by Falcon Adversary OverWatch, this engine analyzes cloud logs as they stream in. It then applies detections to eliminate latency and false positives.

    • New cloud-specific indicators of attack out of the box: AI and machine learning (ML) correlate what's happening in real time against cloud asset and identity data. That's how the system catches stealthy moves like privilege escalation and CloudShell abuse before attackers can capitalize on them.

    • Automated cloud response actions and workflows: There's a gap in traditional cloud security. Cloud workload protection (CWP) simply stops at the workload. Cloud security posture management (CSPM) shows what could go wrong. But neither protects the control plane at runtime. New workflows built on Falcon Fusion SOAR close that gap, triggering instantly to disrupt adversaries before SOC teams can intervene.

    CrowdStrike's Cloud Detection and Response integrates with AWS EventBridge, Amazon's real-time serverless event streaming service. Instead of polling for logs on a schedule, the system taps directly into the event stream as things happen.

    "Anything that calls itself CNAPP that doesn't have real-time cloud detection and response is now obsolete," CrowdStrike CTO Elia Zaitsev said in an exclusive interview with VentureBeat.

    By contrast, EventBridge provides a us asynchronous, microservice-based, just-in-time event processing. "We're not waiting five minutes for a bucket of data," he said.

    But tapping into it is only half the problem. "Can you actually keep up with that firehose? Can you process it fast enough to matter?" Zaitsev asked rhetorically. CrowdStrike claims it can handle 60 million events per second. "This isn't duct tape and a demo."

    The underlying streaming technology isn't new to CrowdStrike. Falcon Adversary OverWatch has been running stream processing for 15 years to hunt across CrowdStrike's customer base, processing logs in real time rather than waiting for batch cycles to complete.

    The platform integrates Charlotte AI for automated triage, providing 98% accuracy matching expert managed detection and response (MDR) analysts, cutting 40-plus hours of manual work weekly. When the system detects a control plane compromise, it doesn't wait for human approval. It revokes tokens, kills sessions, boots the attacker and nukes malicious CloudFormation templates, all before the adversary can execute.

    What this means for the CNAPP market

    Cloud security is the fastest-growing segment in Gartner's latest forecast, expanding at a 25.9% CAGR through 2028. Precedence Research projects the market will grow from $36 billion in 2024 to $121 billion by 2034. And it's crowded: Palo Alto Networks, Wiz (now absorbed into Google via a $32 billion acquisition), Microsoft, Orca, SentinelOne (to name a few).

    CrowdStrike already had a seat at the table as a Leader in the 2025 IDC MarketScape for CNAPP for the third consecutive year. Gartner predicts that by 2029, 40% of enterprises that successfully implement zero trust in cloud environments will rely on CNAPP platforms due to their visibility and control.

    But Zaitsev is making a bigger claim, stating that today's announcement redefines what "complete" means for CNAPP in hybrid environments. "CSPM isn't going away. Cloud workload protection isn't going away. What becomes obsolete is calling something a CNAPP when it lacks real-time cloud detection and response. You're missing the safety net, the thing that catches what gets through proactive defenses. And in hybrid, something always gets through."

    The unified platform angle matters specifically for hybrid," he said. "Adversaries deliberately hop between environments because they know defenders run different tools, often different teams, for cloud versus on-prem versus identity. Jumping domains is how you shake your tail. Attackers know most organizations can't follow them across the seams. With us, they can't do that anymore."

    Building hybrid security for the AI era

    Reinventing hybrid cloud security won't happen overnight. Here's where CISOs should focus:

    • Map your hybrid visibility gaps: Every cloud workload, every on-prem system, every identity traversing between them. If 82% of breaches trace to blind spots, know where yours are before attackers find them.

    • Pressure vendors on detection latency: Ask challenging questions about architecture. If they're running batch-based processing, understand what a 15-minute window means when adversaries move in seconds.

    • Deploy AI triage now: With 40% of alerts going uninvestigated and 71% of analysts burned out, automation isn't a roadmap item; it’s a must-have for a successful deterrence strategy. Look for measurable accuracy rates and real-time savings.

    • Compress patch cycles to 72 hours: AI-assisted reverse engineering has collapsed the exploit window. Monthly patch cycles don't cut it anymore.

    • Architect for permanent hybrid. Stop waiting for cloud migration to simplify security. It won't. Design for complexity as the baseline, not a temporary state. The 54% of enterprises running hybrid models today will still be hybrid tomorrow.

    The bottom line

    Hybrid cloud security must be reinvented for the AI era. Previous-generation hybrid cloud security solutions are quickly being eclipsed by weaponized AI attacks, often launched as machine-on-machine intrusion attempts. The evidence is clear: 55% breach rates, 91% of security leaders making compromises they know are dangerous and AI-accelerated attacks that move faster than batch-based detection can respond. Architectures designed for human-speed threats can't protect against machine-speed adversaries.

    "Modern cybersecurity is about differentiating between acceptable and unacceptable risk," says Chaim Mazal, CSO at Gigamon. "Our research shows where CISOs are drawing that line, highlighting the critical importance of visibility into all data-in-motion to secure complex hybrid cloud infrastructure against today's emerging threats. It's clear that current approaches aren't keeping pace, which is why CISOs must reevaluate tool stacks and reprioritize investments and resources to more confidently secure their infrastructure."

    VentureBeat will be tracking which approaches to hybrid cloud reinvention actually deliver, and which don't, in the months ahead.

  • Ontology is the real guardrail: How to stop AI agents from misunderstanding your business

    Enterprises are investing billions of dollars in AI agents and infrastructure to transform business processes. However, we are seeing limited success in real-world applications, often due to the inability of agents to truly understand business data, policies and processes.

    While we manage the integrations well with technologies like API management, model context protocol (MCP) and others, having agents truly understand the “meaning” of data in the context of a given businesis a different story. Enterprise data is mostly siloed into disparate systems in structured and unstructured forms and needs to be analyzed with a domain-specific business lens.s

    As an example, the term “customer” may refer to a different group of people in a Sales CRM system, compared to a finance system which may use this tag for paying clients. One department might define “product” as a SKU; another may represent as a "product" family; a third as a marketing bundle.

    Data about “product sales” thus varies in meaning without agreed upon relationships and definitions. For agents to combine data from multiple systems, they must understand different representations. Agents need to know what the data means in context and how to find the right data for the right process. Moreover, schema changes in systems and data quality issues during collection can lead to more ambiguity and inability of agents to know how to act when such situations are encountered.

    Furthermore, classification of data into categories like PII (personally identifiable information) needs to be rigorously followed to maintain compliance with standards like GDPR and CCPA. This requires the data to be labelled correctly and agents to be able to understand and respect this classification. Hence we see that building a cool demo using agents is very much doable – but putting into production working on real business data is a different story altogether.

    The ontology-based source of truth

    Building effective agentic solutions requries an ontology-based single source of truth. Ontology is a business definition of concepts, their hierarchy and relationships. It defines terms with respect to business domains, can help establish a single-source of truth for data and capture uniform field names and apply classifications to fields.

    An ontology may be domain-specific (healthcare or finance), or organization-specific based on internal structures. Defining an ontology upfront is time consuming, but can help standardize business processes and lay a strong foundation for agentic AI.

    Ontology may be realized using common queryable formats like triplestore. More complex business rules with multi-hop relations could use a labelled property graphs like Neo4j. These graphs can also help enterprises discover new relationships and answer complex questions. Ontologies like FIBO (Finance Industry Business Ontology) and UMLS (Unified Medical Language System) are available in the public domain and can be a very good starting point. However, these usually need to be customized to capture specific details of an enterprise.

    Getting started with ontology

    Once implemented, an ontology can be the driving force for enterprise agents. We can now prompt AI to follow the ontology and use it to discover data and relationships. If needed, we can have an agentic layer serve key details of the ontology itself and discover data. Business rules and policies can be implemented in this ontology for agents to adhere to. This is an excellent way to ground your agents and establish guardrails based on real business context.

    Agents designed in this manner and tuned to follow an ontology can stick to guardrails and avoid hallucinations that can be caused by the large language models (LLM) powering them. For example, a business policy may define that unless all documents associated with a loan do not have verified flags set to "true," the loan status should be kept in “pending” state. Agents can work around this policy and determine what documents are needed and query the knowledge base.

    Here's an example implementation:

    (Original figure by Author)

    As illustrated, we have structured and unstructured data processed by a document intelligence (DocIntel) agent which populates a Neo4j database based on an ontology of the business domain. A data discovery agent in Neo4j finds and queries the right data and passes it to other agents handling business process execution. The inter-agent communication happens with a popular protocol like A2A (agent to agent). A new protocol called AG-UI (Agent User Interaction) can help build more generic UI screens to capture the workings and responses from these agents. 

    With this method, we can avoid hallucinations by enforcing agents to follow ontology-driven paths and maintain data classifications and relationships. Moreover, we can scale easily by adding new assets, relationships and policies that agents can automatically comply to, and control hallucinations by defining rules for the whole system rather than individual entities. For example, if an agent hallucinates an individual 'customer,' because the connected data for the hallucinated 'customer' will not be verifiable in the data discovery, we can easily detect this anomaly and plan to eliminate it. This helps the agentic system scale with the business and manage its dynamic nature.

    Indeed, a reference architecture like this adds some overhead in data discovery and graph databases. But for a large enterprise, it adds the right guardrails and gives agents directions to orchestrate complex business processes.

    Dattaraj Rao is innovation and R&D architect at Persistent Systems.

    Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

  • Why observable AI is the missing SRE layer enterprises need for reliable LLMs

    As AI systems enter production, reliability and governance can’t depend on wishful thinking. Here’s how observability turns large language models (LLMs) into auditable, trustworthy enterprise systems.

    Why observability secures the future of enterprise AI

    The enterprise race to deploy LLM systems mirrors the early days of cloud adoption. Executives love the promise; compliance demands accountability; engineers just want a paved road.

    Yet, beneath the excitement, most leaders admit they can’t trace how AI decisions are made, whether they helped the business, or if they broke any rule.

    Take one Fortune 100 bank that deployed an LLM to classify loan applications. Benchmark accuracy looked stellar. Yet, 6 months later, auditors found that 18% of critical cases were misrouted, without a single alert or trace. The root cause wasn’t bias or bad data. It was invisible. No observability, no accountability.

    If you can’t observe it, you can’t trust it. And unobserved AI will fail in silence.

    Visibility isn’t a luxury; it’s the foundation of trust. Without it, AI becomes ungovernable.

    Start with outcomes, not models

    Most corporate AI projects begin with tech leaders choosing a model and, later, defining success metrics.
    That’s backward.

    Flip the order:

    • Define the outcome first. What’s the measurable business goal?

      • Deflect 15 % of billing calls

      • Reduce document review time by 60 %

      • Cut case-handling time by two minutes

    • Design telemetry around that outcome, not around “accuracy” or “BLEU score.”

    • Select prompts, retrieval methods and models that demonstrably move those KPIs.

    At one global insurer, for instance, reframing success as “minutes saved per claim” instead of “model precision” turned an isolated pilot into a company-wide roadmap.

    A 3-layer telemetry model for LLM observability

    Just like microservices rely on logs, metrics and traces, AI systems need a structured observability stack:

    a) Prompts and context: What went in

    • Log every prompt template, variable and retrieved document.

    • Record model ID, version, latency and token counts (your leading cost indicators).

    • Maintain an auditable redaction log showing what data was masked, when and by which rule.

    b) Policies and controls: The guardrails

    • Capture safety-filter outcomes (toxicity, PII), citation presence and rule triggers.

    • Store policy reasons and risk tier for each deployment.

    • Link outputs back to the governing model card for transparency.

    c) Outcomes and feedback: Did it work?

    • Gather human ratings and edit distances from accepted answers.

    • Track downstream business events, case closed, document approved, issue resolved.

    • Measure the KPI deltas, call time, backlog, reopen rate.

    All three layers connect through a common trace ID, enabling any decision to be replayed, audited or improved.

    Diagram © SaiKrishna Koorapati (2025). Created specifically for this article; licensed to VentureBeat for publication.

    Apply SRE discipline: SLOs and error budgets for AI

    Service reliability engineering (SRE) transformed software operations; now it’s AI’s turn.

    Define three “golden signals” for every critical workflow:

    Signal

    Target SLO

    When breached

    Factuality

    ≥ 95 % verified against source of record

    Fallback to verified template

    Safety

    ≥ 99.9 % pass toxicity/PII filters

    Quarantine and human review

    Usefulness

    ≥ 80 % accepted on first pass

    Retrain or rollback prompt/model

    If hallucinations or refusals exceed budget, the system auto-routes to safer prompts or human review just like rerouting traffic during a service outage.

    This isn’t bureaucracy; it’s reliability applied to reasoning.

    Build the thin observability layer in two agile sprints

    You don’t need a six-month roadmap, just focus and two short sprints.

    Sprint 1 (weeks 1-3): Foundations

    • Version-controlled prompt registry

    • Redaction middleware tied to policy

    • Request/response logging with trace IDs

    • Basic evaluations (PII checks, citation presence)

    • Simple human-in-the-loop (HITL) UI

    Sprint 2 (weeks 4-6): Guardrails and KPIs

    • Offline test sets (100–300 real examples)

    • Policy gates for factuality and safety

    • Lightweight dashboard tracking SLOs and cost

    • Automated token and latency tracker

    In 6 weeks, you’ll have the thin layer that answers 90% of governance and product questions.

    Make evaluations continuous (and boring)

    Evaluations shouldn’t be heroic one-offs; they should be routine.

    • Curate test sets from real cases; refresh 10–20 % monthly.

    • Define clear acceptance criteria shared by product and risk teams.

    • Run the suite on every prompt/model/policy change and weekly for drift checks.

    • Publish one unified scorecard each week covering factuality, safety, usefulness and cost.

    When evals are part of CI/CD, they stop being compliance theater and become operational pulse checks.

    Apply human oversight where it matters

    Full automation is neither realistic nor responsible. High-risk or ambiguous cases should escalate to human review.

    • Route low-confidence or policy-flagged responses to experts.

    • Capture every edit and reason as training data and audit evidence.

    • Feed reviewer feedback back into prompts and policies for continuous improvement.

    At one health-tech firm, this approach cut false positives by 22 % and produced a retrainable, compliance-ready dataset in weeks.

    Cost control through design, not hope

    LLM costs grow non-linearly. Budgets won’t save you architecture will.

    • Structure prompts so deterministic sections run before generative ones.

    • Compress and rerank context instead of dumping entire documents.

    • Cache frequent queries and memoize tool outputs with TTL.

    • Track latency, throughput and token use per feature.

    When observability covers tokens and latency, cost becomes a controlled variable, not a surprise.

    The 90-day playbook

    Within 3 months of adopting observable AI principles, enterprises should see:

    • 1–2 production AI assists with HITL for edge cases

    • Automated evaluation suite for pre-deploy and nightly runs

    • Weekly scorecard shared across SRE, product and risk

    • Audit-ready traces linking prompts, policies and outcomes

    At a Fortune 100 client, this structure reduced incident time by 40 % and aligned product and compliance roadmaps.

    Scaling trust through observability

    Observable AI is how you turn AI from experiment to infrastructure.

    With clear telemetry, SLOs and human feedback loops:

    • Executives gain evidence-backed confidence.

    • Compliance teams get replayable audit chains.

    • Engineers iterate faster and ship safely.

    • Customers experience reliable, explainable AI.

    Observability isn’t an add-on layer, it’s the foundation for trust at scale.

    SaiKrishna Koorapati is a software engineering leader.

    Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.