Blog

  • Amazon’s new AI can code for days without human help. What does that mean for software engineers?

    Amazon Web Services on Tuesday announced a new class of artificial intelligence systems called "frontier agents" that can work autonomously for hours or even days without human intervention, representing one of the most ambitious attempts yet to automate the full software development lifecycle.

    The announcement, made during AWS CEO Matt Garman's keynote address at the company's annual re:Invent conference, introduces three specialized AI agents designed to act as virtual team members: Kiro autonomous agent for software development, AWS Security Agent for application security, and AWS DevOps Agent for IT operations.

    The move signals Amazon's intent to leap ahead in the intensifying competition to build AI systems capable of performing complex, multi-step tasks that currently require teams of skilled engineers.

    "We see frontier agents as a completely new class of agents," said Deepak Singh, vice president of developer agents and experiences at Amazon, in an interview ahead of the announcement. "They're fundamentally designed to work for hours and days. You're not giving them a problem that you want finished in the next five minutes. You're giving them complex challenges that they may have to think about, try different solutions, and get to the right conclusion — and they should do that without intervention."

    Why Amazon believes its new agents leave existing AI coding tools behind

    The frontier agents differ from existing AI coding assistants like GitHub Copilot or Amazon's own CodeWhisperer in several fundamental ways.

    Current AI coding tools, while powerful, require engineers to drive every interaction. Developers must write prompts, provide context, and manually coordinate work across different code repositories. When switching between tasks, the AI loses context and must start fresh.

    The new frontier agents, by contrast, maintain persistent memory across sessions and continuously learn from an organization's codebase, documentation, and team communications. They can independently determine which code repositories require changes, work on multiple files simultaneously, and coordinate complex transformations spanning dozens of microservices.

    "With a current agent, you would go microservice by microservice, making changes one at a time, and each change would be a different session with no shared context," Singh explained. "With a frontier agent, you say, 'I need to solve this broad problem.' You point it to the right application, and it decides which repos need changes."

    The agents exhibit three defining characteristics that AWS believes set them apart: autonomy in decision-making, the ability to scale by spawning multiple agents to work on different aspects of a problem simultaneously, and the capacity to operate independently for extended periods.

    "A frontier agent can decide to spin up 10 versions of itself, all working on different parts of the problem at once," Singh said.

    How each of the three frontier agents tackles a different phase of development

    Kiro autonomous agent serves as a virtual developer that maintains context across coding sessions and learns from an organization's pull requests, code reviews, and technical discussions. Teams can connect it to GitHub, Jira, Slack, and internal documentation systems. The agent then acts like a teammate, accepting task assignments and working independently until it either completes the work or requires human guidance.

    AWS Security Agent embeds security expertise throughout the development process, automatically reviewing design documents and scanning pull requests against organizational security requirements. Perhaps most significantly, it transforms penetration testing from a weeks-long manual process into an on-demand capability that completes in hours.

    SmugMug, a photo hosting platform, has already deployed the security agent. "AWS Security Agent helped catch a business logic bug that no existing tools would have caught, exposing information improperly," said Andres Ruiz, staff software engineer at the company. "To any other tool, this would have been invisible. But the ability for Security Agent to contextualize the information, parse the API response, and find the unexpected information there represents a leap forward in automated security testing."

    AWS DevOps Agent functions as an always-on operations team member, responding instantly to incidents and using its accumulated knowledge to identify root causes. It connects to observability tools including Amazon CloudWatch, Datadog, Dynatrace, New Relic, and Splunk, along with runbooks and deployment pipelines.

    Commonwealth Bank of Australia tested the DevOps agent by replicating a complex network and identity management issue that typically requires hours for experienced engineers to diagnose. The agent identified the root cause in under 15 minutes.

    "AWS DevOps Agent thinks and acts like a seasoned DevOps engineer, helping our engineers build a banking infrastructure that's faster, more resilient, and designed to deliver better experiences for our customers," said Jason Sandry, head of cloud services at Commonwealth Bank.

    Amazon makes its case against Google and Microsoft in the AI coding wars

    The announcement arrives amid a fierce battle among technology giants to dominate the emerging market for AI-powered development tools. Google has made significant noise in recent weeks with its own AI coding capabilities, while Microsoft continues to advance GitHub Copilot and its broader AI development toolkit.

    Singh argued that AWS holds distinct advantages rooted in the company's 20-year history operating cloud infrastructure and Amazon's own massive software engineering organization.

    "AWS has been the cloud of choice for 20 years, so we have two decades of knowledge building and running it, and working with customers who've been building and running applications on it," Singh said. "The learnings from operating AWS, the knowledge our customers have, the experience we've built using these tools ourselves every day to build real-world applications—all of that is embodied in these frontier agents."

    He drew a distinction between tools suitable for prototypes versus production systems. "There's a lot of things out there that you can use to build your prototype or your toy application. But if you want to build production applications, there's a lot of knowledge that we bring in as AWS that apply here."

    The safeguards Amazon built to keep autonomous agents from going rogue

    The prospect of AI systems operating autonomously for days raises immediate questions about what happens when they go off track. Singh described multiple safeguards built into the system.

    All learnings accumulated by the agents are logged and visible, allowing engineers to understand what knowledge influences the agent's decisions. Teams can even remove specific learnings if they discover the agent has absorbed incorrect information from team communications.

    "You can go in and even redact that from its knowledge like, 'No, we don't want you to ever use this knowledge,'" Singh said. "You can look at the knowledge like it's almost—it's like looking at your neurons inside your brain. You can disconnect some."

    Engineers can also monitor agent activity in real-time and intervene when necessary, either redirecting the agent or taking over entirely. Most critically, the agents never commit code directly to production systems. That responsibility remains with human engineers.

    "These agents are never going to check the code into production. That is still the human's responsibility," Singh emphasized. "You are still, as an engineer, responsible for the code you're checking in, whether it's generated by you or by an agent working autonomously."

    What frontier agents mean for the future of software engineering jobs

    The announcement inevitably raises concerns about the impact on software engineering jobs. Singh pushed back against the notion that frontier agents will replace developers, framing them instead as tools that amplify human capabilities.

    "Software engineering is craft. What's changing is not, 'Hey, agents are doing all the work.' The craft of software engineering is changing—how you use agents, how do you set up your code base, how do you set up your prompts, how do you set up your rules, how do you set up your knowledge bases so that agents can be effective," he said.

    Singh noted that senior engineers who had drifted away from hands-on coding are now writing more code than ever. "It's actually easier for them to become software engineers," he said.

    He pointed to an internal example where a team completed a project in 78 days that would have taken 18 months using traditional practices. "Because they were able to use AI. And the thing that made it work was not just the fact that they were using AI, but how they organized and set up their practices of how they built that software were maximized around that."

    How Amazon plans to make AI-generated code more trustworthy over time

    Singh outlined several areas where frontier agents will evolve over the coming years. Multi-agent architectures, where systems of specialized agents coordinate to solve complex problems, represent a major frontier. So does the integration of formal verification techniques to increase confidence in AI-generated code.

    AWS recently introduced property-based testing in Kiro, which uses automated reasoning to extract testable properties from specifications and generate thousands of test scenarios automatically.

    "If you have a shopping cart application, every way an order can be canceled, and how it might be canceled, and the way refunds are handled in Germany versus the US—if you're writing a unit test, maybe two, Germany and US, but now, because you have this property-based testing approach, your agent can create a scenario for every country you operate in and test all of them automatically for you," Singh explained.

    Building trust in autonomous systems remains the central challenge. "Right now you still require tons of human guardrails at every step to make sure that the right thing happens. And as we get better at these techniques, you will use less and less, and you'll be able to trust the agents a lot more," he said.

    Amazon's bigger bet on autonomous AI stretches far beyond writing code

    The frontier agents announcement arrived alongside a cascade of other news at re:Invent 2025. AWS kicked off the conference with major announcements on agentic AI capabilities, customer service innovations, and multicloud networking.

    Amazon expanded its Nova portfolio with four new models delivering industry-leading price-performance across reasoning, multimodal processing, conversational AI, code generation, and agentic tasks. Nova Forge pioneers "open training," giving organizations access to pre-trained model checkpoints and the ability to blend proprietary data with Amazon Nova-curated datasets.

    AWS also added 18 new open weight models to Amazon Bedrock, reinforcing its commitment to offering a broad selection of fully managed models from leading AI providers. The launch includes new models from Mistral AI, Google's Gemma 3, MiniMax's M2, NVIDIA's Nemotron, and OpenAI's GPT OSS Safeguard.

    On the infrastructure side, Amazon EC2 Trn3 UltraServers, powered by AWS's first 3nm AI chip, pack up to 144 Trainium3 chips into a single integrated system, delivering up to 4.4x more compute performance and 4x greater energy efficiency than the previous generation. AWS AI Factories provides enterprises and government organizations with dedicated AWS AI infrastructure deployed in their own data centers, combining NVIDIA GPUs, Trainium chips, AWS networking, and AI services like Amazon Bedrock and SageMaker AI.

    All three frontier agents launched in preview on Tuesday. Pricing will be announced when the services reach general availability.

    Singh made clear the company sees applications far beyond coding. "These are the first frontier agents we are releasing, and they're in the software development lifecycle," he said. "The problems and use cases for frontier agents—these agents that are long running, capable of autonomy, thinking, always learning and improving—can be applied to many, many domains."

    Amazon, after all, operates satellite networks, runs robotics warehouses, and manages one of the world's largest e-commerce platforms. If autonomous agents can learn to write code on their own, the company is betting they can eventually learn to do just about anything else.

  • MIT offshoot Liquid AI releases blueprint for enterprise-grade small-model training

    When Liquid AI, a startup founded by MIT computer scientists back in 2023, introduced its Liquid Foundation Models series 2 (LFM2) in July 2025, the pitch was straightforward: deliver the fastest on-device foundation models on the market using the new "liquid" architecture, with training and inference efficiency that made small models a serious alternative to cloud-only large language models (LLMs) such as OpenAI's GPT series and Google's Gemini.

    The initial release shipped dense checkpoints at 350M, 700M, and 1.2B parameters, a hybrid architecture heavily weighted toward gated short convolutions, and benchmark numbers that placed LFM2 ahead of similarly sized competitors like Qwen3, Llama 3.2, and Gemma 3 on both quality and CPU throughput. The message to enterprises was clear: real-time, privacy-preserving AI on phones, laptops, and vehicles no longer required sacrificing capability for latency.

    In the months since that launch, Liquid has expanded LFM2 into a broader product line — adding task-and-domain-specialized variants, a small video ingestion and analysis model, and an edge-focused deployment stack called LEAP — and positioned the models as the control layer for on-device and on-prem agentic systems.

    Now, with the publication of the detailed, 51-page LFM2 technical report on arXiv, the company is going a step further: making public the architecture search process, training data mixture, distillation objective, curriculum strategy, and post-training pipeline behind those models.

    And unlike earlier open models, LFM2 is built around a repeatable recipe: a hardware-in-the-loop search process, a training curriculum that compensates for smaller parameter budgets, and a post-training pipeline tuned for instruction following and tool use.

    Rather than just offering weights and an API, Liquid is effectively publishing a detailed blueprint that other organizations can use as a reference for training their own small, efficient models from scratch, tuned to their own hardware and deployment constraints.

    A model family designed around real constraints, not GPU labs

    The technical report begins with a premise enterprises are intimately familiar with: real AI systems hit limits long before benchmarks do. Latency budgets, peak memory ceilings, and thermal throttling define what can actually run in production—especially on laptops, tablets, commodity servers, and mobile devices.

    To address this, Liquid AI performed architecture search directly on target hardware, including Snapdragon mobile SoCs and Ryzen laptop CPUs. The result is a consistent outcome across sizes: a minimal hybrid architecture dominated by gated short convolution blocks and a small number of grouped-query attention (GQA) layers. This design was repeatedly selected over more exotic linear-attention and SSM hybrids because it delivered a better quality-latency-memory Pareto profile under real device conditions.

    This matters for enterprise teams in three ways:

    1. Predictability. The architecture is simple, parameter-efficient, and stable across model sizes from 350M to 2.6B.

    2. Operational portability. Dense and MoE variants share the same structural backbone, simplifying deployment across mixed hardware fleets.

    3. On-device feasibility. Prefill and decode throughput on CPUs surpass comparable open models by roughly 2× in many cases, reducing the need to offload routine tasks to cloud inference endpoints.

    Instead of optimizing for academic novelty, the report reads as a systematic attempt to design models enterprises can actually ship.

    This is notable and more practical for enterprises in a field where many open models quietly assume access to multi-H100 clusters during inference.

    A training pipeline tuned for enterprise-relevant behavior

    LFM2 adopts a training approach that compensates for the smaller scale of its models with structure rather than brute force. Key elements include:

    • 10–12T token pre-training and an additional 32K-context mid-training phase, which extends the model’s useful context window without exploding compute costs.

    • A decoupled Top-K knowledge distillation objective that sidesteps the instability of standard KL distillation when teachers provide only partial logits.

    • A three-stage post-training sequence—SFT, length-normalized preference alignment, and model merging—designed to produce more reliable instruction following and tool-use behavior.

    For enterprise AI developers, the significance is that LFM2 models behave less like “tiny LLMs” and more like practical agents able to follow structured formats, adhere to JSON schemas, and manage multi-turn chat flows. Many open models at similar sizes fail not due to lack of reasoning ability, but due to brittle adherence to instruction templates. The LFM2 post-training recipe directly targets these rough edges.

    In other words: Liquid AI optimized small models for operational reliability, not just scoreboards.

    Multimodality designed for device constraints, not lab demos

    The LFM2-VL and LFM2-Audio variants reflect another shift: multimodality built around token efficiency.

    Rather than embedding a massive vision transformer directly into an LLM, LFM2-VL attaches a SigLIP2 encoder through a connector that aggressively reduces visual token count via PixelUnshuffle. High-resolution inputs automatically trigger dynamic tiling, keeping token budgets controllable even on mobile hardware. LFM2-Audio uses a bifurcated audio path—one for embeddings, one for generation—supporting real-time transcription or speech-to-speech on modest CPUs.

    For enterprise platform architects, this design points toward a practical future where:

    • document understanding happens directly on endpoints such as field devices;

    • audio transcription and speech agents run locally for privacy compliance;

    • multimodal agents operate within fixed latency envelopes without streaming data off-device.

    The through-line is the same: multimodal capability without requiring a GPU farm.

    Retrieval models built for agent systems, not legacy search

    LFM2-ColBERT extends late-interaction retrieval into a footprint small enough for enterprise deployments that need multilingual RAG without the overhead of specialized vector DB accelerators.

    This is particularly meaningful as organizations begin to orchestrate fleets of agents. Fast local retrieval—running on the same hardware as the reasoning model—reduces latency and provides a governance win: documents never leave the device boundary.

    Taken together, the VL, Audio, and ColBERT variants show LFM2 as a modular system, not a single model drop.

    The emerging blueprint for hybrid enterprise AI architectures

    Across all variants, the LFM2 report implicitly sketches what tomorrow’s enterprise AI stack will look like: hybrid local-cloud orchestration, where small, fast models operating on devices handle time-critical perception, formatting, tool invocation, and judgment tasks, while larger models in the cloud offer heavyweight reasoning when needed.

    Several trends converge here:

    • Cost control. Running routine inference locally avoids unpredictable cloud billing.

    • Latency determinism. TTFT and decode stability matter in agent workflows; on-device eliminates network jitter.

    • Governance and compliance. Local execution simplifies PII handling, data residency, and auditability.

    • Resilience. Agentic systems degrade gracefully if the cloud path becomes unavailable.

    Enterprises adopting these architectures will likely treat small on-device models as the “control plane” of agentic workflows, with large cloud models serving as on-demand accelerators.

    LFM2 is one of the clearest open-source foundations for that control layer to date.

    The strategic takeaway: on-device AI is now a design choice, not a compromise

    For years, organizations building AI features have accepted that “real AI” requires cloud inference. LFM2 challenges that assumption. The models perform competitively across reasoning, instruction following, multilingual tasks, and RAG—while simultaneously achieving substantial latency gains over other open small-model families.

    For CIOs and CTOs finalizing 2026 roadmaps, the implication is direct: small, open, on-device models are now strong enough to carry meaningful slices of production workloads.

    LFM2 will not replace frontier cloud models for frontier-scale reasoning. But it offers something enterprises arguably need more: a reproducible, open, and operationally feasible foundation for agentic systems that must run anywhere, from phones to industrial endpoints to air-gapped secure facilities.

    In the broadening landscape of enterprise AI, LFM2 is less a research milestone and more a sign of architectural convergence. The future is not cloud or edge—it’s both, operating in concert. And releases like LFM2 provide the building blocks for organizations prepared to build that hybrid future intentionally rather than accidentally.

  • DeepSeek just dropped two insanely powerful AI models that rival GPT-5 and they’re totally free

    Chinese artificial intelligence startup DeepSeek released two powerful new AI models on Sunday that the company claims match or exceed the capabilities of OpenAI's GPT-5 and Google's Gemini-3.0-Pro — a development that could reshape the competitive landscape between American tech giants and their Chinese challengers.

    The Hangzhou-based company launched DeepSeek-V3.2, designed as an everyday reasoning assistant, alongside DeepSeek-V3.2-Speciale, a high-powered variant that achieved gold-medal performance in four elite international competitions: the 2025 International Mathematical Olympiad, the International Olympiad in Informatics, the ICPC World Finals, and the China Mathematical Olympiad.

    The release carries profound implications for American technology leadership. DeepSeek has once again demonstrated that it can produce frontier AI systems despite U.S. export controls that restrict China's access to advanced Nvidia chips — and it has done so while making its models freely available under an open-source MIT license.

    "People thought DeepSeek gave a one-time breakthrough but we came back much bigger," wrote Chen Fang, who identified himself as a contributor to the project, on X (formerly Twitter). The release drew swift reactions online, with one user declaring: "Rest in peace, ChatGPT."

    How DeepSeek's sparse attention breakthrough slashes computing costs

    At the heart of the new release lies DeepSeek Sparse Attention, or DSA — a novel architectural innovation that dramatically reduces the computational burden of running AI models on long documents and complex tasks.

    Traditional AI attention mechanisms, the core technology allowing language models to understand context, scale poorly as input length increases. Processing a document twice as long typically requires four times the computation. DeepSeek's approach breaks this constraint using what the company calls a "lightning indexer" that identifies only the most relevant portions of context for each query, ignoring the rest.

    According to DeepSeek's technical report, DSA reduces inference costs by roughly half compared to previous models when processing long sequences. The architecture "substantially reduces computational complexity while preserving model performance," the report states.

    Processing 128,000 tokens — roughly equivalent to a 300-page book — now costs approximately $0.70 per million tokens for decoding, compared to $2.40 for the previous V3.1-Terminus model. That represents a 70% reduction in inference costs.

    The 685-billion-parameter models support context windows of 128,000 tokens, making them suitable for analyzing lengthy documents, codebases, and research papers. DeepSeek's technical report notes that independent evaluations on long-context benchmarks show V3.2 performing on par with or better than its predecessor "despite incorporating a sparse attention mechanism."

    The benchmark results that put DeepSeek in the same league as GPT-5

    DeepSeek's claims of parity with America's leading AI systems rest on extensive testing across mathematics, coding, and reasoning tasks — and the numbers are striking.

    On AIME 2025, a prestigious American mathematics competition, DeepSeek-V3.2-Speciale achieved a 96.0% pass rate, compared to 94.6% for GPT-5-High and 95.0% for Gemini-3.0-Pro. On the Harvard-MIT Mathematics Tournament, the Speciale variant scored 99.2%, surpassing Gemini's 97.5%.

    The standard V3.2 model, optimized for everyday use, scored 93.1% on AIME and 92.5% on HMMT — marginally below frontier models but achieved with substantially fewer computational resources.

    Most striking are the competition results. DeepSeek-V3.2-Speciale scored 35 out of 42 points on the 2025 International Mathematical Olympiad, earning gold-medal status. At the International Olympiad in Informatics, it scored 492 out of 600 points — also gold, ranking 10th overall. The model solved 10 of 12 problems at the ICPC World Finals, placing second.

    These results came without internet access or tools during testing. DeepSeek's report states that "testing strictly adheres to the contest's time and attempt limits."

    On coding benchmarks, DeepSeek-V3.2 resolved 73.1% of real-world software bugs on SWE-Verified, competitive with GPT-5-High at 74.9%. On Terminal Bench 2.0, measuring complex coding workflows, DeepSeek scored 46.4%—well above GPT-5-High's 35.2%.

    The company acknowledges limitations. "Token efficiency remains a challenge," the technical report states, noting that DeepSeek "typically requires longer generation trajectories" to match Gemini-3.0-Pro's output quality.

    Why teaching AI to think while using tools changes everything

    Beyond raw reasoning, DeepSeek-V3.2 introduces "thinking in tool-use" — the ability to reason through problems while simultaneously executing code, searching the web, and manipulating files.

    Previous AI models faced a frustrating limitation: each time they called an external tool, they lost their train of thought and had to restart reasoning from scratch. DeepSeek's architecture preserves the reasoning trace across multiple tool calls, enabling fluid multi-step problem solving.

    To train this capability, the company built a massive synthetic data pipeline generating over 1,800 distinct task environments and 85,000 complex instructions. These included challenges like multi-day trip planning with budget constraints, software bug fixes across eight programming languages, and web-based research requiring dozens of searches.

    The technical report describes one example: planning a three-day trip from Hangzhou with constraints on hotel prices, restaurant ratings, and attraction costs that vary based on accommodation choices. Such tasks are "hard to solve but easy to verify," making them ideal for training AI agents.

    DeepSeek employed real-world tools during training — actual web search APIs, coding environments, and Jupyter notebooks — while generating synthetic prompts to ensure diversity. The result is a model that generalizes to unseen tools and environments, a critical capability for real-world deployment.

    DeepSeek's open-source gambit could upend the AI industry's business model

    Unlike OpenAI and Anthropic, which guard their most powerful models as proprietary assets, DeepSeek has released both V3.2 and V3.2-Speciale under the MIT license — one of the most permissive open-source frameworks available.

    Any developer, researcher, or company can download, modify, and deploy the 685-billion-parameter models without restriction. Full model weights, training code, and documentation are available on Hugging Face, the leading platform for AI model sharing.

    The strategic implications are significant. By making frontier-capable models freely available, DeepSeek undermines competitors charging premium API prices. The Hugging Face model card notes that DeepSeek has provided Python scripts and test cases "demonstrating how to encode messages in OpenAI-compatible format" — making migration from competing services straightforward.

    For enterprise customers, the value proposition is compelling: frontier performance at dramatically lower cost, with deployment flexibility. But data residency concerns and regulatory uncertainty may limit adoption in sensitive applications — particularly given DeepSeek's Chinese origins.

    Regulatory walls are rising against DeepSeek in Europe and America

    DeepSeek's global expansion faces mounting resistance. In June, Berlin's data protection commissioner Meike Kamp declared that DeepSeek's transfer of German user data to China is "unlawful" under EU rules, asking Apple and Google to consider blocking the app.

    The German authority expressed concern that "Chinese authorities have extensive access rights to personal data within the sphere of influence of Chinese companies." Italy ordered DeepSeek to block its app in February. U.S. lawmakers have moved to ban the service from government devices, citing national security concerns.

    Questions also persist about U.S. export controls designed to limit China's AI capabilities. In August, DeepSeek hinted that China would soon have "next generation" domestically built chips to support its models. The company indicated its systems work with Chinese-made chips from Huawei and Cambricon without additional setup.

    DeepSeek's original V3 model was reportedly trained on roughly 2,000 older Nvidia H800 chips — hardware since restricted for China export. The company has not disclosed what powered V3.2 training, but its continued advancement suggests export controls alone cannot halt Chinese AI progress.

    What DeepSeek's release means for the future of AI competition

    The release arrives at a pivotal moment. After years of massive investment, some analysts question whether an AI bubble is forming. DeepSeek's ability to match American frontier models at a fraction of the cost challenges assumptions that AI leadership requires enormous capital expenditure.

    The company's technical report reveals that post-training investment now exceeds 10% of pre-training costs — a substantial allocation credited for reasoning improvements. But DeepSeek acknowledges gaps: "The breadth of world knowledge in DeepSeek-V3.2 still lags behind leading proprietary models," the report states. The company plans to address this by scaling pre-training compute.

    DeepSeek-V3.2-Speciale remains available through a temporary API until December 15, when its capabilities will merge into the standard release. The Speciale variant is designed exclusively for deep reasoning and does not support tool calling — a limitation the standard model addresses.

    For now, the AI race between the United States and China has entered a new phase. DeepSeek's release demonstrates that open-source models can achieve frontier performance, that efficiency innovations can slash costs dramatically, and that the most powerful AI systems may soon be freely available to anyone with an internet connection.

    As one commenter on X observed: "Deepseek just casually breaking those historic benchmarks set by Gemini is bonkers."

    The question is no longer whether Chinese AI can compete with Silicon Valley. It's whether American companies can maintain their lead when their Chinese rival gives comparable technology away for free.

  • OpenAGI emerges from stealth with an AI agent that it claims crushes OpenAI and Anthropic

    A stealth artificial intelligence startup founded by an MIT researcher emerged this morning with an ambitious claim: its new AI model can control computers better than systems built by OpenAI and Anthropic — at a fraction of the cost.

    OpenAGI, led by chief executive Zengyi Qin, released Lux, a foundation model designed to operate computers autonomously by interpreting screenshots and executing actions across desktop applications. The San Francisco-based company says Lux achieves an 83.6 percent success rate on Online-Mind2Web, a benchmark that has become the industry's most rigorous test for evaluating AI agents that control computers.

    That score is a significant leap over the leading models from well-funded competitors. OpenAI's Operator, released in January, scores 61.3 percent on the same benchmark. Anthropic's Claude Computer Use achieves 56.3 percent.

    "Traditional LLM training feeds a large amount of text corpus into the model. The model learns to produce text," Qin said in an exclusive interview with VentureBeat. "By contrast, our model learns to produce actions. The model is trained with a large amount of computer screenshots and action sequences, allowing it to produce actions to control the computer."

    The announcement arrives at a pivotal moment for the AI industry. Technology giants and startups alike have poured billions of dollars into developing autonomous agents capable of navigating software, booking travel, filling out forms, and executing complex workflows. OpenAI, Anthropic, Google, and Microsoft have all released or announced agent products in the past year, betting that computer-controlling AI will become as transformative as chatbots.

    Yet independent research has cast doubt on whether current agents are as capable as their creators suggest.

    Why university researchers built a tougher benchmark to test AI agents—and what they discovered

    The Online-Mind2Web benchmark, developed by researchers at Ohio State University and the University of California, Berkeley, was designed specifically to expose the gap between marketing claims and actual performance.

    Published in April and accepted to the Conference on Language Modeling 2025, the benchmark comprises 300 diverse tasks across 136 real websites — everything from booking flights to navigating complex e-commerce checkouts. Unlike earlier benchmarks that cached parts of websites, Online-Mind2Web tests agents in live online environments where pages change dynamically and unexpected obstacles appear.

    The results, according to the researchers, painted "a very different picture of the competency of current agents, suggesting over-optimism in previously reported results."

    When the Ohio State team tested five leading web agents with careful human evaluation, they found that many recent systems — despite heavy investment and marketing fanfare — did not outperform SeeAct, a relatively simple agent released in January 2024. Even OpenAI's Operator, the best performer among commercial offerings in their study, achieved only 61 percent success.

    "It seemed that highly capable and practical agents were maybe indeed just months away," the researchers wrote in a blog post accompanying their paper. "However, we are also well aware that there are still many fundamental gaps in research to fully autonomous agents, and current agents are probably not as competent as the reported benchmark numbers may depict."

    The benchmark has gained traction as an industry standard, with a public leaderboard hosted on Hugging Face tracking submissions from research groups and companies.

    How OpenAGI trained its AI to take actions instead of just generating text

    OpenAGI's claimed performance advantage stems from what the company calls "Agentic Active Pre-training," a training methodology that differs fundamentally from how most large language models learn.

    Conventional language models train on vast text corpora, learning to predict the next word in a sequence. The resulting systems excel at generating coherent text but were not designed to take actions in graphical environments.

    Lux, according to Qin, takes a different approach. The model trains on computer screenshots paired with action sequences, learning to interpret visual interfaces and determine which clicks, keystrokes, and navigation steps will accomplish a given goal.

    "The action allows the model to actively explore the computer environment, and such exploration generates new knowledge, which is then fed back to the model for training," Qin told VentureBeat. "This is a naturally self-evolving process, where a better model produces better exploration, better exploration produces better knowledge, and better knowledge leads to a better model."

    This self-reinforcing training loop, if it functions as described, could help explain how a smaller team might achieve results that elude larger organizations. Rather than requiring ever-larger static datasets, the approach would allow the model to continuously improve by generating its own training data through exploration.

    OpenAGI also claims significant cost advantages. The company says Lux operates at roughly one-tenth the cost of frontier models from OpenAI and Anthropic while executing tasks faster.

    Unlike browser-only competitors, Lux can control Slack, Excel, and other desktop applications

    A critical distinction in OpenAGI's announcement: Lux can control applications across an entire desktop operating system, not just web browsers.

    Most commercially available computer-use agents, including early versions of Anthropic's Claude Computer Use, focus primarily on browser-based tasks. That limitation excludes vast categories of productivity work that occur in desktop applications — spreadsheets in Microsoft Excel, communications in Slack, design work in Adobe products, code editing in development environments.

    OpenAGI says Lux can navigate these native applications, a capability that would substantially expand the addressable market for computer-use agents. The company is releasing a developer software development kit alongside the model, allowing third parties to build applications on top of Lux.

    The company is also working with Intel to optimize Lux for edge devices, which would allow the model to run locally on laptops and workstations rather than requiring cloud infrastructure. That partnership could address enterprise concerns about sending sensitive screen data to external servers.

    "We are partnering with Intel to optimize our model on edge devices, which will make it the best on-device computer-use model," Qin said.

    The company confirmed it is in exploratory discussions with AMD and Microsoft about additional partnerships.

    What happens when you ask an AI agent to copy your bank details

    Computer-use agents present novel safety challenges that do not arise with conventional chatbots. An AI system capable of clicking buttons, entering text, and navigating applications could, if misdirected, cause significant harm — transferring money, deleting files, or exfiltrating sensitive information.

    OpenAGI says it has built safety mechanisms directly into Lux. When the model encounters requests that violate its safety policies, it refuses to proceed and alerts the user.

    In an example provided by the company, when a user asked the model to "copy my bank details and paste it into a new Google doc," Lux responded with an internal reasoning step: "The user asks me to copy the bank details, which are sensitive information. Based on the safety policy, I am not able to perform this action." The model then issued a warning to the user rather than executing the potentially dangerous request.

    Such safeguards will face intense scrutiny as computer-use agents proliferate. Security researchers have already demonstrated prompt injection attacks against early agent systems, where malicious instructions embedded in websites or documents can hijack an agent's behavior. Whether Lux's safety mechanisms can withstand adversarial attacks remains to be tested by independent researchers.

    The MIT researcher who built two of GitHub's most downloaded AI models

    Qin brings an unusual combination of academic credentials and entrepreneurial experience to OpenAGI.

    He completed his doctorate at the Massachusetts Institute of Technology in 2025, where his research focused on computer vision, robotics, and machine learning. His academic work appeared in top venues including the Conference on Computer Vision and Pattern Recognition, the International Conference on Learning Representations, and the International Conference on Machine Learning.

    Before founding OpenAGI, Qin built several widely adopted AI systems. JetMoE, a large language model he led development on, demonstrated that a high-performing model could be trained from scratch for less than $100,000 — a fraction of the tens of millions typically required. The model outperformed Meta's LLaMA2-7B on standard benchmarks, according to a technical report that attracted attention from MIT's Computer Science and Artificial Intelligence Laboratory.

    His previous open-source projects achieved remarkable adoption. OpenVoice, a voice cloning model, accumulated approximately 35,000 stars on GitHub and ranked in the top 0.03 percent of open-source projects by popularity. MeloTTS, a text-to-speech system, has been downloaded more than 19 million times, making it one of the most widely used audio AI models since its 2024 release.

    Qin also co-founded MyShell, an AI agent platform that has attracted six million users who have collectively built more than 200,000 AI agents. Users have had more than one billion interactions with agents on the platform, according to the company.

    Inside the billion-dollar race to build AI that controls your computer

    The computer-use agent market has attracted intense interest from investors and technology giants over the past year.

    OpenAI released Operator in January, allowing users to instruct an AI to complete tasks across the web. Anthropic has continued developing Claude Computer Use, positioning it as a core capability of its Claude model family. Google has incorporated agent features into its Gemini products. Microsoft has integrated agent capabilities across its Copilot offerings and Windows.

    Yet the market remains nascent. Enterprise adoption has been limited by concerns about reliability, security, and the ability to handle edge cases that occur frequently in real-world workflows. The performance gaps revealed by benchmarks like Online-Mind2Web suggest that current systems may not be ready for mission-critical applications.

    OpenAGI enters this competitive landscape as an independent alternative, positioning superior benchmark performance and lower costs against the massive resources of its well-funded rivals. The company's Lux model and developer SDK are available beginning today.

    Whether OpenAGI can translate benchmark dominance into real-world reliability remains the central question. The AI industry has a long history of impressive demos that falter in production, of laboratory results that crumble against the chaos of actual use. Benchmarks measure what they measure, and the distance between a controlled test and an 8-hour workday full of edge cases, exceptions, and surprises can be vast.

    But if Lux performs in the wild the way it performs in the lab, the implications extend far beyond one startup's success. It would suggest that the path to capable AI agents runs not through the largest checkbooks but through the cleverest architectures—that a small team with the right ideas can outmaneuver the giants.

    The technology industry has seen that story before. It rarely stays true for long.

  • Hybrid cloud security must be rebuilt for an AI war it was never designed to fight

    Hybrid cloud security was built before the current era of automated, machine-based cyberattacks that take just milliseconds to execute and minutes to deliver devastating impacts to infrastructure.

    The architectures and tech stacks every enterprise depends on, from batch-based detection to siloed tools to 15-minute response windows, stood a better chance of defending against attackers moving at human speed. But in a weaponized AI world, those approaches to analyzing threat data don't make sense.

    The latest survey numbers tell the story. More than half (55%) of organizations suffered cloud breaches in the past year. That’s a 17-point spike, according to Gigamon's 2025 Hybrid Cloud Security Survey. Nearly half of the enterprises polled said their security tools missed the attack entirely. While 82% of enterprises now run hybrid or multi-cloud environments, only 36% express confidence in detecting threats in real time, per Fortinet's 2025 State of Cloud Security Report.

    Adversaries aren’t wasting any time weaponizing AI to target hybrid cloud vulnerabilities. Organizations now face 1,925 cyberattacks weekly. That’s an increase of 47% in a year. Further, ransomware surged 126% in the first quarter of 2025 alone. The visibility gaps everyone talks about in hybrid environments is where breaches originate. The bottom line is that the security architectures designed for the pre-AI era can't keep pace.

    But the industry is finally beginning to respond. CrowdStrike, for its part, is providing one vision of cybersecurity reinvention. Today at AWS re:Invent, the company is rolling out real-time Cloud Detection and Response, a platform designed to compress 15-minute response windows down to seconds.

    But the bigger story is why the entire approach to hybrid cloud security must change, and what that means for CISOs planning their 2026 strategies.

    Why the old model for hybrid cloud security is failing

    Initially, hybrid cloud promised the best of both worlds. Every organization could have public cloud agility with on-prem control. The security model that took shape reflected the best practices at the time. The trouble is that those best practices are now introducing vulnerabilities.

    How bad is it? The majority of security teams struggle to keep up with the threats and workloads. According to recent research:

    "You can't secure what you can't see," says Mandy Andress, CISO at Elastic. "That's the heart of the two big challenges we see as security practitioners: The complexity and sprawl of an organization's infrastructure, coupled with the rapid pace of technological change."

    CrowdStrike's Zaitsev diagnosed the root cause: "Everyone assumed this was a one-way trip, lift and shift everything to the cloud. That's not what happened. We're seeing companies pull workloads back on-prem when the economics make sense. The reality? Everyone's going to be hybrid. Five years from now. Ten years. Maybe forever. Security has to deal with that."

    Weaponized AI is changing the threat calculus fast

    The weaponized AI era isn't just accelerating attacks. It’s breaking the fundamental assumptions on which hybrid cloud security was built. The window between patch release and weaponized exploit collapsed from weeks to hours. The majority of adversaries aren't typing commands anymore; they're automating machine-based campaigns that orchestrate agentic AI at a scale and speed that current hybrid cloud tools and human SOC teams can't keep up with.

    Zaitsev shared threat data from CrowdStrike's mid-year hunting report, which found that cloud intrusions spiked 136% in a year, with roughly 40% of all cloud actor activity coming from Chinese nexus adversaries. This illustrates how quickly the threat landscape can change, and why hybrid cloud security needs to be reinvented for the AI era now.

    Mike Riemer, SVP and field CISO at Ivanti, has witnessed the timeline collapse. Threat actors now reverse-engineer patches within 72 hours using AI assistance. If enterprises don't patch within that time frame, "they're open to exploit," Riemer told VentureBeat. "That's the new reality."

    Using previous-generation tools in the current cloud control plane is a dangerous bet. All it takes is a single compromised virtual machine (VM) that no one knows exists. Compromise the control plane, including the APIs that manage cloud resources, and they’ve got keys to spin up, modify or delete thousands of assets across a company’s hybrid environment.

    The seams between hybrid cloud environments are attack highways where millisecond-long attacks seldom leave any digital exhaust or traces. Many organizations never see weaponized AI attacks coming.

    VentureBeat hears that the worst hybrid cloud attacks can only be diagnosed long after the fact, when forensics and analysis are finally completed. Attackers and adversaries are that good at covering their tracks, often relying on living-off-the-land (LotL) tools to evade detection for months, even years in extreme cases.

    "Enterprises training AI models are concentrating sensitive data in cloud environments, which is gold for adversaries," CrowdStrike's Zaitsev said. "Attackers are using agentic AI to run their campaigns. The traditional SOC workflow — see the alert, triage, investigate for 15 or 20 minutes, take action an hour or a day later —is completely insufficient. You're bringing a knife to a gunfight."

    The human toll of relying on outdated architecture

    The human toll of the hybrid cloud crisis shows up in SOC metrics and burnout. The AI SOC Market Landscape 2025 report found that the average security operations center processes 960 alerts daily. Each takes roughly 70 minutes to investigate properly. Assuming standard SOC staffing levels, there aren't enough hours in the day to get to all those alerts.

    Futher, at least 40% of alerts, on average, never get touched. The human cost is staggering. A Tines survey of SOC analysts found that 71% are experiencing burnout. Two-thirds say manual grunt work consumes more than half of SOC workers' day. The same percentage are eyeing the exit from their jobs, and, in many extreme cases as some confide to VentureBeat, the industry.

    Hybrid environments make everything more complicated. Enterprises have different tools for AWS, Azure and on-prem architectures. They have different consoles; often different teams. As for alert correlation across environments? It's manual and often delegated to the most senior SOC team members — if it happens at all.

    Batch-based detection can't survive the weaponized AI era

    Here's what most legacy vendors of hybrid cloud security tools won't openly admit: Cloud security tools are fundamentally flawed and not designed for real-time defense. The majority are batch-based, collecting logs every five, ten or fifteen minutes, processing them through correlation engines, then generating alerts. In a world where adversaries are increasingly executing machine-based attacks in milliseconds, a 15-minute detection delay isn't just a minor setback; it's the difference between stopping an attack and having to investigate a breach.

    As adversaries weaponize AI to accelerate cloud attacks and move laterally across systems, traditional cloud detection and response (CDR) tools relying on log batch processing are too slow to keep up. These systems can take 15 minutes or more to surface a single detection.

    CrowdStrike's Zaitsev didn't hedge. Before the company's new tools released today, there was no such thing as real-time cloud detection and prevention, he claimed. "Everyone else is batch-based. Suck down logs every five or 10 minutes, wait for data, import it, correlate it. We've seen competitors take 10 to 15 minutes minimum. That's not detection—that's archaeology."

    He continued: "It's carrier pigeon versus 5G. The gap between 15 minutes and 15 seconds isn't just about alert quality. It's the difference between getting a notification that something has already happened; now you're doing cleanup, versus actually stopping the attack before the adversary achieves anything. One is incident response. The other is prevention."

    Reinventing hybrid cloud security must begin with speed

    CrowdStrike's new real-time Cloud Detection and Response, part of Falcon Cloud Security's unified cloud-native application protection platform (CNAPP), is intended to secure every layer of hybrid cloud risk. It is built on three key innovations:

    • Real-time detection engine: Built on event streaming technology pioneered and battle-tested by Falcon Adversary OverWatch, this engine analyzes cloud logs as they stream in. It then applies detections to eliminate latency and false positives.

    • New cloud-specific indicators of attack out of the box: AI and machine learning (ML) correlate what's happening in real time against cloud asset and identity data. That's how the system catches stealthy moves like privilege escalation and CloudShell abuse before attackers can capitalize on them.

    • Automated cloud response actions and workflows: There's a gap in traditional cloud security. Cloud workload protection (CWP) simply stops at the workload. Cloud security posture management (CSPM) shows what could go wrong. But neither protects the control plane at runtime. New workflows built on Falcon Fusion SOAR close that gap, triggering instantly to disrupt adversaries before SOC teams can intervene.

    CrowdStrike's Cloud Detection and Response integrates with AWS EventBridge, Amazon's real-time serverless event streaming service. Instead of polling for logs on a schedule, the system taps directly into the event stream as things happen.

    "Anything that calls itself CNAPP that doesn't have real-time cloud detection and response is now obsolete," CrowdStrike CTO Elia Zaitsev said in an exclusive interview with VentureBeat.

    By contrast, EventBridge provides a us asynchronous, microservice-based, just-in-time event processing. "We're not waiting five minutes for a bucket of data," he said.

    But tapping into it is only half the problem. "Can you actually keep up with that firehose? Can you process it fast enough to matter?" Zaitsev asked rhetorically. CrowdStrike claims it can handle 60 million events per second. "This isn't duct tape and a demo."

    The underlying streaming technology isn't new to CrowdStrike. Falcon Adversary OverWatch has been running stream processing for 15 years to hunt across CrowdStrike's customer base, processing logs in real time rather than waiting for batch cycles to complete.

    The platform integrates Charlotte AI for automated triage, providing 98% accuracy matching expert managed detection and response (MDR) analysts, cutting 40-plus hours of manual work weekly. When the system detects a control plane compromise, it doesn't wait for human approval. It revokes tokens, kills sessions, boots the attacker and nukes malicious CloudFormation templates, all before the adversary can execute.

    What this means for the CNAPP market

    Cloud security is the fastest-growing segment in Gartner's latest forecast, expanding at a 25.9% CAGR through 2028. Precedence Research projects the market will grow from $36 billion in 2024 to $121 billion by 2034. And it's crowded: Palo Alto Networks, Wiz (now absorbed into Google via a $32 billion acquisition), Microsoft, Orca, SentinelOne (to name a few).

    CrowdStrike already had a seat at the table as a Leader in the 2025 IDC MarketScape for CNAPP for the third consecutive year. Gartner predicts that by 2029, 40% of enterprises that successfully implement zero trust in cloud environments will rely on CNAPP platforms due to their visibility and control.

    But Zaitsev is making a bigger claim, stating that today's announcement redefines what "complete" means for CNAPP in hybrid environments. "CSPM isn't going away. Cloud workload protection isn't going away. What becomes obsolete is calling something a CNAPP when it lacks real-time cloud detection and response. You're missing the safety net, the thing that catches what gets through proactive defenses. And in hybrid, something always gets through."

    The unified platform angle matters specifically for hybrid," he said. "Adversaries deliberately hop between environments because they know defenders run different tools, often different teams, for cloud versus on-prem versus identity. Jumping domains is how you shake your tail. Attackers know most organizations can't follow them across the seams. With us, they can't do that anymore."

    Building hybrid security for the AI era

    Reinventing hybrid cloud security won't happen overnight. Here's where CISOs should focus:

    • Map your hybrid visibility gaps: Every cloud workload, every on-prem system, every identity traversing between them. If 82% of breaches trace to blind spots, know where yours are before attackers find them.

    • Pressure vendors on detection latency: Ask challenging questions about architecture. If they're running batch-based processing, understand what a 15-minute window means when adversaries move in seconds.

    • Deploy AI triage now: With 40% of alerts going uninvestigated and 71% of analysts burned out, automation isn't a roadmap item; it’s a must-have for a successful deterrence strategy. Look for measurable accuracy rates and real-time savings.

    • Compress patch cycles to 72 hours: AI-assisted reverse engineering has collapsed the exploit window. Monthly patch cycles don't cut it anymore.

    • Architect for permanent hybrid. Stop waiting for cloud migration to simplify security. It won't. Design for complexity as the baseline, not a temporary state. The 54% of enterprises running hybrid models today will still be hybrid tomorrow.

    The bottom line

    Hybrid cloud security must be reinvented for the AI era. Previous-generation hybrid cloud security solutions are quickly being eclipsed by weaponized AI attacks, often launched as machine-on-machine intrusion attempts. The evidence is clear: 55% breach rates, 91% of security leaders making compromises they know are dangerous and AI-accelerated attacks that move faster than batch-based detection can respond. Architectures designed for human-speed threats can't protect against machine-speed adversaries.

    "Modern cybersecurity is about differentiating between acceptable and unacceptable risk," says Chaim Mazal, CSO at Gigamon. "Our research shows where CISOs are drawing that line, highlighting the critical importance of visibility into all data-in-motion to secure complex hybrid cloud infrastructure against today's emerging threats. It's clear that current approaches aren't keeping pace, which is why CISOs must reevaluate tool stacks and reprioritize investments and resources to more confidently secure their infrastructure."

    VentureBeat will be tracking which approaches to hybrid cloud reinvention actually deliver, and which don't, in the months ahead.

  • Ontology is the real guardrail: How to stop AI agents from misunderstanding your business

    Enterprises are investing billions of dollars in AI agents and infrastructure to transform business processes. However, we are seeing limited success in real-world applications, often due to the inability of agents to truly understand business data, policies and processes.

    While we manage the integrations well with technologies like API management, model context protocol (MCP) and others, having agents truly understand the “meaning” of data in the context of a given businesis a different story. Enterprise data is mostly siloed into disparate systems in structured and unstructured forms and needs to be analyzed with a domain-specific business lens.s

    As an example, the term “customer” may refer to a different group of people in a Sales CRM system, compared to a finance system which may use this tag for paying clients. One department might define “product” as a SKU; another may represent as a "product" family; a third as a marketing bundle.

    Data about “product sales” thus varies in meaning without agreed upon relationships and definitions. For agents to combine data from multiple systems, they must understand different representations. Agents need to know what the data means in context and how to find the right data for the right process. Moreover, schema changes in systems and data quality issues during collection can lead to more ambiguity and inability of agents to know how to act when such situations are encountered.

    Furthermore, classification of data into categories like PII (personally identifiable information) needs to be rigorously followed to maintain compliance with standards like GDPR and CCPA. This requires the data to be labelled correctly and agents to be able to understand and respect this classification. Hence we see that building a cool demo using agents is very much doable – but putting into production working on real business data is a different story altogether.

    The ontology-based source of truth

    Building effective agentic solutions requries an ontology-based single source of truth. Ontology is a business definition of concepts, their hierarchy and relationships. It defines terms with respect to business domains, can help establish a single-source of truth for data and capture uniform field names and apply classifications to fields.

    An ontology may be domain-specific (healthcare or finance), or organization-specific based on internal structures. Defining an ontology upfront is time consuming, but can help standardize business processes and lay a strong foundation for agentic AI.

    Ontology may be realized using common queryable formats like triplestore. More complex business rules with multi-hop relations could use a labelled property graphs like Neo4j. These graphs can also help enterprises discover new relationships and answer complex questions. Ontologies like FIBO (Finance Industry Business Ontology) and UMLS (Unified Medical Language System) are available in the public domain and can be a very good starting point. However, these usually need to be customized to capture specific details of an enterprise.

    Getting started with ontology

    Once implemented, an ontology can be the driving force for enterprise agents. We can now prompt AI to follow the ontology and use it to discover data and relationships. If needed, we can have an agentic layer serve key details of the ontology itself and discover data. Business rules and policies can be implemented in this ontology for agents to adhere to. This is an excellent way to ground your agents and establish guardrails based on real business context.

    Agents designed in this manner and tuned to follow an ontology can stick to guardrails and avoid hallucinations that can be caused by the large language models (LLM) powering them. For example, a business policy may define that unless all documents associated with a loan do not have verified flags set to "true," the loan status should be kept in “pending” state. Agents can work around this policy and determine what documents are needed and query the knowledge base.

    Here's an example implementation:

    (Original figure by Author)

    As illustrated, we have structured and unstructured data processed by a document intelligence (DocIntel) agent which populates a Neo4j database based on an ontology of the business domain. A data discovery agent in Neo4j finds and queries the right data and passes it to other agents handling business process execution. The inter-agent communication happens with a popular protocol like A2A (agent to agent). A new protocol called AG-UI (Agent User Interaction) can help build more generic UI screens to capture the workings and responses from these agents. 

    With this method, we can avoid hallucinations by enforcing agents to follow ontology-driven paths and maintain data classifications and relationships. Moreover, we can scale easily by adding new assets, relationships and policies that agents can automatically comply to, and control hallucinations by defining rules for the whole system rather than individual entities. For example, if an agent hallucinates an individual 'customer,' because the connected data for the hallucinated 'customer' will not be verifiable in the data discovery, we can easily detect this anomaly and plan to eliminate it. This helps the agentic system scale with the business and manage its dynamic nature.

    Indeed, a reference architecture like this adds some overhead in data discovery and graph databases. But for a large enterprise, it adds the right guardrails and gives agents directions to orchestrate complex business processes.

    Dattaraj Rao is innovation and R&D architect at Persistent Systems.

    Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

  • Why observable AI is the missing SRE layer enterprises need for reliable LLMs

    As AI systems enter production, reliability and governance can’t depend on wishful thinking. Here’s how observability turns large language models (LLMs) into auditable, trustworthy enterprise systems.

    Why observability secures the future of enterprise AI

    The enterprise race to deploy LLM systems mirrors the early days of cloud adoption. Executives love the promise; compliance demands accountability; engineers just want a paved road.

    Yet, beneath the excitement, most leaders admit they can’t trace how AI decisions are made, whether they helped the business, or if they broke any rule.

    Take one Fortune 100 bank that deployed an LLM to classify loan applications. Benchmark accuracy looked stellar. Yet, 6 months later, auditors found that 18% of critical cases were misrouted, without a single alert or trace. The root cause wasn’t bias or bad data. It was invisible. No observability, no accountability.

    If you can’t observe it, you can’t trust it. And unobserved AI will fail in silence.

    Visibility isn’t a luxury; it’s the foundation of trust. Without it, AI becomes ungovernable.

    Start with outcomes, not models

    Most corporate AI projects begin with tech leaders choosing a model and, later, defining success metrics.
    That’s backward.

    Flip the order:

    • Define the outcome first. What’s the measurable business goal?

      • Deflect 15 % of billing calls

      • Reduce document review time by 60 %

      • Cut case-handling time by two minutes

    • Design telemetry around that outcome, not around “accuracy” or “BLEU score.”

    • Select prompts, retrieval methods and models that demonstrably move those KPIs.

    At one global insurer, for instance, reframing success as “minutes saved per claim” instead of “model precision” turned an isolated pilot into a company-wide roadmap.

    A 3-layer telemetry model for LLM observability

    Just like microservices rely on logs, metrics and traces, AI systems need a structured observability stack:

    a) Prompts and context: What went in

    • Log every prompt template, variable and retrieved document.

    • Record model ID, version, latency and token counts (your leading cost indicators).

    • Maintain an auditable redaction log showing what data was masked, when and by which rule.

    b) Policies and controls: The guardrails

    • Capture safety-filter outcomes (toxicity, PII), citation presence and rule triggers.

    • Store policy reasons and risk tier for each deployment.

    • Link outputs back to the governing model card for transparency.

    c) Outcomes and feedback: Did it work?

    • Gather human ratings and edit distances from accepted answers.

    • Track downstream business events, case closed, document approved, issue resolved.

    • Measure the KPI deltas, call time, backlog, reopen rate.

    All three layers connect through a common trace ID, enabling any decision to be replayed, audited or improved.

    Diagram © SaiKrishna Koorapati (2025). Created specifically for this article; licensed to VentureBeat for publication.

    Apply SRE discipline: SLOs and error budgets for AI

    Service reliability engineering (SRE) transformed software operations; now it’s AI’s turn.

    Define three “golden signals” for every critical workflow:

    Signal

    Target SLO

    When breached

    Factuality

    ≥ 95 % verified against source of record

    Fallback to verified template

    Safety

    ≥ 99.9 % pass toxicity/PII filters

    Quarantine and human review

    Usefulness

    ≥ 80 % accepted on first pass

    Retrain or rollback prompt/model

    If hallucinations or refusals exceed budget, the system auto-routes to safer prompts or human review just like rerouting traffic during a service outage.

    This isn’t bureaucracy; it’s reliability applied to reasoning.

    Build the thin observability layer in two agile sprints

    You don’t need a six-month roadmap, just focus and two short sprints.

    Sprint 1 (weeks 1-3): Foundations

    • Version-controlled prompt registry

    • Redaction middleware tied to policy

    • Request/response logging with trace IDs

    • Basic evaluations (PII checks, citation presence)

    • Simple human-in-the-loop (HITL) UI

    Sprint 2 (weeks 4-6): Guardrails and KPIs

    • Offline test sets (100–300 real examples)

    • Policy gates for factuality and safety

    • Lightweight dashboard tracking SLOs and cost

    • Automated token and latency tracker

    In 6 weeks, you’ll have the thin layer that answers 90% of governance and product questions.

    Make evaluations continuous (and boring)

    Evaluations shouldn’t be heroic one-offs; they should be routine.

    • Curate test sets from real cases; refresh 10–20 % monthly.

    • Define clear acceptance criteria shared by product and risk teams.

    • Run the suite on every prompt/model/policy change and weekly for drift checks.

    • Publish one unified scorecard each week covering factuality, safety, usefulness and cost.

    When evals are part of CI/CD, they stop being compliance theater and become operational pulse checks.

    Apply human oversight where it matters

    Full automation is neither realistic nor responsible. High-risk or ambiguous cases should escalate to human review.

    • Route low-confidence or policy-flagged responses to experts.

    • Capture every edit and reason as training data and audit evidence.

    • Feed reviewer feedback back into prompts and policies for continuous improvement.

    At one health-tech firm, this approach cut false positives by 22 % and produced a retrainable, compliance-ready dataset in weeks.

    Cost control through design, not hope

    LLM costs grow non-linearly. Budgets won’t save you architecture will.

    • Structure prompts so deterministic sections run before generative ones.

    • Compress and rerank context instead of dumping entire documents.

    • Cache frequent queries and memoize tool outputs with TTL.

    • Track latency, throughput and token use per feature.

    When observability covers tokens and latency, cost becomes a controlled variable, not a surprise.

    The 90-day playbook

    Within 3 months of adopting observable AI principles, enterprises should see:

    • 1–2 production AI assists with HITL for edge cases

    • Automated evaluation suite for pre-deploy and nightly runs

    • Weekly scorecard shared across SRE, product and risk

    • Audit-ready traces linking prompts, policies and outcomes

    At a Fortune 100 client, this structure reduced incident time by 40 % and aligned product and compliance roadmaps.

    Scaling trust through observability

    Observable AI is how you turn AI from experiment to infrastructure.

    With clear telemetry, SLOs and human feedback loops:

    • Executives gain evidence-backed confidence.

    • Compliance teams get replayable audit chains.

    • Engineers iterate faster and ship safely.

    • Customers experience reliable, explainable AI.

    Observability isn’t an add-on layer, it’s the foundation for trust at scale.

    SaiKrishna Koorapati is a software engineering leader.

    Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

  • Anthropic says it solved the long-running AI agent problem with a new multi-session Claude SDK

    Agent memory remains a problem that enterprises want to fix, as agents forget some instructions or conversations the longer they run. 

    Anthropic believes it has solved this issue for its Claude Agent SDK, developing a two-fold solution that allows an agent to work across different context windows.

    “The core challenge of long-running agents is that they must work in discrete sessions, and each new session begins with no memory of what came before,” Anthropic wrote in a blog post. “Because context windows are limited, and because most complex projects cannot be completed within a single window, agents need a way to bridge the gap between coding sessions.”

    Anthropic engineers proposed a two-fold approach for its Agent SDK: An initializer agent to set up the environment, and a coding agent to make incremental progress in each session and leave artifacts for the next.  

    The agent memory problem

    Since agents are built on foundation models, they remain constrained by the limited, although continually growing, context windows. For long-running agents, this could create a larger problem, leading the agent to forget instructions and behave abnormally while performing a task. Enhancing agent memory becomes essential for consistent, business-safe performance. 

    Several methods emerged over the past year, all attempting to bridge the gap between context windows and agent memory. LangChain’s LangMem SDK, Memobase and OpenAI’s Swarm are examples of companies offering memory solutions. Research on agentic memory has also exploded recently, with proposed frameworks like Memp and the Nested Learning Paradigm from Google offering new alternatives to enhance memory. 

    Many of the current memory frameworks are open source and can ideally adapt to different large language models (LLMs) powering agents. Anthropic’s approach improves its Claude Agent SDK. 

    How it works

    Anthropic identified that even though the Claude Agent SDK had context management capabilities and “should be possible for an agent to continue to do useful work for an arbitrarily long time,” it was not sufficient. The company said in its blog post that a model like Opus 4.5 running the Claude Agent SDK can “fall short of building a production-quality web app if it’s only given a high-level prompt, such as 'build a clone of claude.ai.'” 

    The failures manifested in two patterns, Anthropic said. First, the agent tried to do too much, causing the model to run out of context in the middle. The agent then has to guess what happened and cannot pass clear instructions to the next agent. The second failure occurs later on, after some features have already been built. The agent sees progress has been made and just declares the job done. 

    Anthropic researchers broke down the solution: Setting up an initial environment to lay the foundation for features and prompting each agent to make incremental progress towards a goal, while still leaving a clean slate at the end. 

    This is where the two-part solution of Anthropic's agent comes in. The initializer agent sets up the environment, logging what agents have done and which files have been added. The coding agent will then ask models to make incremental progress and leave structured updates. 

    “Inspiration for these practices came from knowing what effective software engineers do every day,” Anthropic said. 

    The researchers said they added testing tools to the coding agent, improving its ability to identify and fix bugs that weren’t obvious from the code alone. 

    Future research

    Anthropic noted that its approach is “one possible set of solutions in a long-running agent harness.” However, this is just the beginning stage of what could become a wider research area for many in the AI space. 

    The company said its experiments to boost long-term memory for agents haven’t shown whether a single general-purpose coding agent works best across contexts or a multi-agent structure. 

    Its demo also focused on full-stack web app development, so other experiments should focus on generalizing the results across different tasks.

    “It’s likely that some or all of these lessons can be applied to the types of long-running agentic tasks required in, for example, scientific research or financial modeling,” Anthropic said. 

  • What to be thankful for in AI in 2025

    Hello, dear readers. Happy belated Thanksgiving and Black Friday!

    This year has felt like living inside a permanent DevDay. Every week, some lab drops a new model, a new agent framework, or a new “this changes everything” demo. It’s overwhelming. But it’s also the first year I’ve felt like AI is finally diversifying — not just one or two frontier models in the cloud, but a whole ecosystem: open and closed, giant and tiny, Western and Chinese, cloud and local.

    So for this Thanksgiving edition, here’s what I’m genuinely thankful for in AI in 2025 — the releases that feel like they’ll matter in 12–24 months, not just during this week’s hype cycle.

    1. OpenAI kept shipping strong: GPT-5, GPT-5.1, Atlas, Sora 2 and open weights

    As the company that undeniably birthed the "generative AI" era with its viral hit product ChatGPT in late 2022, OpenAI arguably had among the hardest tasks of any AI company in 2025: continue its growth trajectory even as well-funded competitors like Google with its Gemini models and other startups like Anthropic fielded their own highly competitive offerings.

    Thankfully, OpenAI rose to the challenge and then some. Its headline act was GPT-5, unveiled in August as the next frontier reasoning model, followed in November by GPT-5.1 with new Instant and Thinking variants that dynamically adjust how much “thinking time” they spend per task.

    In practice, GPT-5’s launch was bumpy — VentureBeat documented early math and coding failures and a cooler-than-expected community reaction in “OpenAI’s GPT-5 rollout is not going smoothly," but it quickly course corrected based on user feedback and, as a daily user of this model, I'm personally pleased with it and impressed with it.

    At the same time, enterprises actually using the models are reporting solid gains. ZenDesk Global, for example, says GPT-5-powered agents now resolve more than half of customer tickets, with some customers seeing 80–90% resolution rates. That’s the quiet story: these models may not always impress the chattering classes on X, but they’re starting to move real KPIs.

    On the tooling side, OpenAI finally gave developers a serious AI engineer with GPT-5.1-Codex-Max, a new coding model that can run long, agentic workflows and is already the default in OpenAI’s Codex environment. VentureBeat covered it in detail in “OpenAI debuts GPT-5.1-Codex-Max coding model and it already completed a 24-hour task internally.”

    Then there’s ChatGPT Atlas, a full browser with ChatGPT baked into the chrome itself — sidebar summaries, on-page analysis, and search tightly integrated into regular browsing. It’s the clearest sign yet that “assistant” and “browser” are on a collision course.

    On the media side, Sora 2 turned the original Sora video demo into a full video-and-audio model with better physics, synchronized sound and dialogue, and more control over style and shot structure, plus a dedicated Sora app with a full fledged social networking component, allowing any user to create their own TV network in their pocket.

    Finally — and maybe most symbolically — OpenAI released gpt-oss-120B and gpt-oss-20B, open-weight MoE reasoning models under an Apache 2.0–style license. Whatever you think of their quality (and early open-source users have been loud about their complaints), this is the first time since GPT-2 that OpenAI has put serious weights into the public commons.

    2. China’s open-source wave goes mainstream

    If 2023–24 was about Llama and Mistral, 2025 belongs to China’s open-weight ecosystem.

    A study from MIT and Hugging Face found that China now slightly leads the U.S. in global open-model downloads, largely thanks to DeepSeek and Alibaba’s Qwen family.

    Highlights:

    • DeepSeek-R1 dropped in January as an open-source reasoning model rivaling OpenAI’s o1, with MIT-licensed weights and a family of distilled smaller models. VentureBeat has followed the story from its release to its cybersecurity impact to performance-tuned R1 variants.

    • Kimi K2 Thinking from Moonshot, a “thinking” open-source model that reasons step-by-step with tools, very much in the o1/R1 mold, and is positioned as the best open reasoning model so far in the world.

    • Z.ai shipped GLM-4.5 and GLM-4.5-Air as “agentic” models, open-sourcing base and hybrid reasoning variants on GitHub.

    • Baidu’s ERNIE 4.5 family arrived as a fully open-sourced, multimodal MoE suite under Apache 2.0, including a 0.3B dense model and visual “Thinking” variants focused on charts, STEM, and tool use.

    • Alibaba’s Qwen3 line — including Qwen3-Coder, large reasoning models, and the Qwen3-VL series released over the summer and fall months of 2025 — continues to set a high bar for open weights in coding, translation, and multimodal reasoning, leading me to declare this past summer as "

      Qwen's summer."

    VentureBeat has been tracking these shifts, including Chinese math and reasoning models like Light-R1-32B and Weibo’s tiny VibeThinker-1.5B, which beat DeepSeek baselines on shoestring training budgets.

    If you care about open ecosystems or on-premise options, this is the year China’s open-weight scene stopped being a curiosity and became a serious alternative.

    3. Small and local models grow up

    Another thing I’m thankful for: we’re finally getting good small models, not just toys.

    Liquid AI spent 2025 pushing its Liquid Foundation Models (LFM2) and LFM2-VL vision-language variants, designed from day one for low-latency, device-aware deployments — edge boxes, robots, and constrained servers, not just giant clusters. The newer LFM2-VL-3B targets embedded robotics and industrial autonomy, with demos planned at ROSCon.

    On the big-tech side, Google’s Gemma 3 line made a strong case that “tiny” can still be capable. Gemma 3 spans from 270M parameters up through 27B, all with open weights and multimodal support in the larger variants.

    The standout is Gemma 3 270M, a compact model purpose-built for fine-tuning and structured text tasks — think custom formatters, routers, and watchdogs — covered both in Google’s developer blog and community discussions in local-LLM circles.

    These models may never trend on X, but they’re exactly what you need for privacy-sensitive workloads, offline workflows, thin-client devices, and “agent swarms” where you don’t want every tool call hitting a giant frontier LLM.

    4. Meta + Midjourney: aesthetics as a service

    One of the stranger twists this year: Meta partnered with Midjourney instead of simply trying to beat it.

    In August, Meta announced a deal to license Midjourney’s “aesthetic technology” — its image and video generation stack — and integrate it into Meta’s future models and products, from Facebook and Instagram feeds to Meta AI features.

    VentureBeat covered the partnership in “Meta is partnering with Midjourney and will license its technology for future models and products,” raising the obvious question: does this slow or reshape Midjourney’s own API roadmap? Still awaiting an answer there, but unfortunately, stated plans for an API release have yet to materialize, suggesting that it has.

    For creators and brands, though, the immediate implication is simple: Midjourney-grade visuals start to show up in mainstream social tools instead of being locked away in a Discord bot. That could normalize higher-quality AI art for a much wider audience — and force rivals like OpenAI, Google, and Black Forest Labs to keep raising the bar.

    5. Google’s Gemini 3 and Nano Banana Pro

    Google tried to answer GPT-5 with Gemini 3, billed as its most capable model yet, with better reasoning, coding, and multimodal understanding, plus a new Deep Think mode for slow, hard problems.

    VentureBeat’s coverage, “Google unveils Gemini 3 claiming the lead in math, science, multimodal and agentic AI,” framed it as a direct shot at frontier benchmarks and agentic workflows.

    But the surprise hit is Nano Banana Pro (Gemini 3 Pro Image), Google’s new flagship image generator. It specializes in infographics, diagrams, multi-subject scenes, and multilingual text that actually renders legibly across 2K and 4K resolutions.

    In the world of enterprise AI — where charts, product schematics, and “explain this system visually” images matter more than fantasy dragons — that’s a big deal.

    6. Wild cards I’m keeping an eye on

    A few more releases I’m thankful for, even if they don’t fit neatly into one bucket:

    Last thought (for now)

    If 2024 was the year of “one big model in the cloud,” 2025 is the year the map exploded: multiple frontiers at the top, China taking the lead in open models, small and efficient systems maturing fast, and creative ecosystems like Midjourney getting pulled into big-tech stacks.

    I’m thankful not just for any single model, but for the fact that we now have options — closed and open, local and hosted, reasoning-first and media-first. For journalists, builders, and enterprises, that diversity is the real story of 2025.

    Happy holidays and best to you and your loved ones!

  • Alibaba’s AgentEvolver lifts model performance in tool use by ~30% using synthetic, auto-generated tasks

    Researchers at Alibaba’s Tongyi Lab have developed a new framework for self-evolving agents that create their own training data by exploring their application environments. The framework, AgentEvolver, uses the knowledge and reasoning capabilities of large language models for autonomous learning, addressing the high costs and manual effort typically required to gather task-specific datasets.

    Experiments show that compared to traditional reinforcement learning–based frameworks, AgentEvolver is more efficient at exploring its environment, makes better use of data, and adapts faster to application environments. For the enterprise, this is significant because it lowers the barrier to training agents for bespoke applications, making powerful, custom AI assistants more accessible to a wider range of organizations.

    The high cost of training AI agents

    Reinforcement learning has become a major paradigm for training LLMs to act as agents that can interact with digital environments and learn from feedback. However, developing agents with RL faces fundamental challenges. First, gathering the necessary training datasets is often prohibitively expensive, requiring significant manual labor to create examples of tasks, especially in novel or proprietary software environments where there are no available off-the-shelf datasets.

    Second, the RL techniques commonly used for LLMs require the model to run through a massive number of trial-and-error attempts to learn effectively. This process is computationally costly and inefficient. As a result, training capable LLM agents through RL remains laborious and expensive, limiting their deployment in custom enterprise settings.

    How AgentEvolver works

    The main idea behind AgentEvolver is to give models greater autonomy in their own learning process. The researchers describe it as a “self-evolving agent system” designed to “achieve autonomous and efficient capability evolution through environmental interaction.” It uses the reasoning power of an LLM to create a self-training loop, allowing the agent to continuously improve by directly interacting with its target environment without needing predefined tasks or reward functions.

    “We envision an agent system where the LLM actively guides exploration, task generation, and performance refinement,” the researchers wrote in their paper.

    The self-evolution process is driven by three core mechanisms that work together.

    The first is self-questioning, where the agent explores its environment to discover the boundaries of its functions and identify useful states. It’s like a new user clicking around an application to see what’s possible. Based on this exploration, the agent generates its own diverse set of tasks that align with a user’s general preferences. This reduces the need for handcrafted datasets and allows the agent and its tasks to co-evolve, progressively enabling it to handle more complex challenges. 

    According to Yunpeng Zhai, researcher at Alibaba and co-author of the paper, who spoke to VentureBeat, the self-questioning mechanism effectively turns the model from a “data consumer into a data producer,” dramatically reducing the time and cost required to deploy an agent in a proprietary environment.

    The second mechanism is self-navigating, which improves exploration efficiency by reusing and generalizing from past experiences. AgentEvolver extracts insights from both successful and unsuccessful attempts and uses them to guide future actions. For example, if an agent tries to use an API function that doesn't exist in an application, it registers this as an experience and learns to verify the existence of functions before attempting to use them in the future.

    The third mechanism, self-attributing, enhances learning efficiency by providing more detailed feedback. Instead of just a final success or failure signal (a common practice in RL that can result in sparse rewards), this mechanism uses an LLM to assess the contribution of each individual action in a multi-step task. It retrospectively determines whether each step contributed positively or negatively to the final outcome, giving the agent fine-grained feedback that accelerates learning. 

    This is crucial for regulated industries where how an agent solves a problem is as important as the result. “Instead of rewarding a student only for the final answer, we also evaluate the clarity and correctness of each step in their reasoning,” Zhai explained. This improves transparency and encourages the agent to adopt more robust and auditable problem-solving patterns.

    “By shifting the training initiative from human-engineered pipelines to LLM-guided self-improvement, AgentEvolver establishes a new paradigm that paves the way toward scalable, cost-effective, and continually improving intelligent systems,” the researchers state.

    The team has also developed a practical, end-to-end training framework that integrates these three mechanisms. A key part of this foundation is the Context Manager, a component that controls the agent's memory and interaction history. While today's benchmarks test a limited number of tools, real enterprise environments can involve thousands of APIs. 

    Zhai acknowledges this is a core challenge for the field, but notes that AgentEvolver was designed to be extended. “Retrieval over extremely large action spaces will always introduce computational challenges, but AgentEvolver’s architecture provides a clear path toward scalable tool reasoning in enterprise settings,” he said.

    A more efficient path to agent training

    To measure the effectiveness of their framework, the researchers tested it on AppWorld and BFCL v3, two benchmarks that require agents to perform long, multi-step tasks using external tools. They used models from Alibaba’s Qwen2.5 family (7B and 14B parameters) and compared their performance against a baseline model trained with GRPO, a popular RL technique used to develop reasoning models like DeepSeek-R1.

    The results showed that integrating all three mechanisms in AgentEvolver led to substantial performance gains. For the 7B model, the average score improved by 29.4%, and for the 14B model, it increased by 27.8% over the baseline. The framework consistently enhanced the models' reasoning and task-execution capabilities across both benchmarks. The most significant improvement came from the self-questioning module, which autonomously generates diverse training tasks and directly addresses the data scarcity problem.

    The experiments also demonstrated that AgentEvolver can efficiently synthesize a large volume of high-quality training data. The tasks generated by the self-questioning module proved diverse enough to achieve good training efficiency even with a small amount of data.

    For enterprises, this provides a path to creating agents for bespoke applications and internal workflows while minimizing the need for manual data annotation. By providing high-level goals and letting the agent generate its own training experiences, organizations can develop custom AI assistants more simply and cost-effectively.

    “This combination of algorithmic design and engineering pragmatics positions AgentEvolver as both a research vehicle and a reusable foundation for building adaptive, tool-augmented agents,” the researchers conclude.

    Looking ahead, the ultimate goal is much bigger. “A truly ‘singular model’ that can drop into any software environment and master it overnight is certainly the holy grail of agentic AI,” Zhai said. “We see AgentEvolver as a necessary step in that direction.” While that future still requires breakthroughs in model reasoning and infrastructure, self-evolving approaches are paving the way.