Blog

  • We keep talking about AI agents, but do we ever know what they are?

    Imagine you do two things on a Monday morning.

    First, you ask a chatbot to summarize your new emails. Next, you ask an AI tool to figure out why your top competitor grew so fast last quarter. The AI silently gets to work. It scours financial reports, news articles and social media sentiment. It cross-references that data with your internal sales numbers, drafts a strategy outlining three potential reasons for the competitor's success and schedules a 30-minute meeting with your team to present its findings.

    We're calling both of these "AI agents," but they represent worlds of difference in intelligence, capability and the level of trust we place in them. This ambiguity creates a fog that makes it difficult to build, evaluate, and safely govern these powerful new tools. If we can't agree on what we're building, how can we know when we've succeeded?

    This post won't try to sell you on yet another definitive framework. Instead, think of it as a survey of the current landscape of agent autonomy, a map to help us all navigate the terrain together.

    What are we even talking about? Defining an "AI agent"

    Before we can measure an agent's autonomy, we need to agree on what an "agent" actually is. The most widely accepted starting point comes from the foundational textbook on AI, Stuart Russell and Peter Norvig’sArtificial Intelligence: A Modern Approach.” 

    They define an agent as anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators. A thermostat is a simple agent: Its sensor perceives the room temperature, and its actuator acts by turning the heat on or off.

    ReAct Model for AI Agents (Credit: Confluent)

    That classic definition provides a solid mental model. For today's technology, we can translate it into four key components that make up a modern AI agent:

    1. Perception (the "senses"): This is how an agent takes in information about its digital or physical environment. It's the input stream that allows the agent to understand the current state of the world relevant to its task.

    2. Reasoning engine (the "brain"): This is the core logic that processes the perceptions and decides what to do next. For modern agents, this is typically powered by a large language model (LLM). The engine is responsible for planning, breaking down large goals into smaller steps, handling errors and choosing the right tools for the job.

    3. Action (the "hands"): This is how an agent affects its environment to move closer to its goal. The ability to take action via tools is what gives an agent its power.

    4. Goal/objective: This is the overarching task or purpose that guides all of the agent's actions. It is the "why" that turns a collection of tools into a purposeful system. The goal can be simple ("Find the best price for this book") or complex ("Launch the marketing campaign for our new product")

    Putting it all together, a true agent is a full-body system. The reasoning engine is the brain, but it’s useless without the senses (perception) to understand the world and the hands (actions) to change it. This complete system, all guided by a central goal, is what creates genuine agency.

    With these components in mind, the distinction we made earlier becomes clear. A standard chatbot isn't a true agent. It perceives your question and acts by providing an answer, but it lacks an overarching goal and the ability to use external tools to accomplish it.

    An agent, on the other hand, is software that has agency. 

    It has the capacity to act independently and dynamically toward a goal. And it's this capacity that makes a discussion about the levels of autonomy so important.

    Learning from the past: How we learned to classify autonomy

    The dizzying pace of AI can make it feel like we're navigating uncharted territory. But when it comes to classifying autonomy, we’re not starting from scratch. Other industries have been working on this problem for decades, and their playbooks offer powerful lessons for the world of AI agents.

    The core challenge is always the same: How do you create a clear, shared language for the gradual handover of responsibility from a human to a machine?

    SAE levels of driving automation

    Perhaps the most successful framework comes from the automotive industry. The SAE J3016 standard defines six levels of driving automation, from Level 0 (fully manual) to Level 5 (fully autonomous).

    The SAE J3016 Levels of Driving Automation (Credit: SAE International)

    What makes this model so effective isn't its technical detail, but its focus on two simple concepts:

    1. Dynamic driving task (DDT): This is everything involved in the real-time act of driving: steering, braking, accelerating and monitoring the road.

    2. Operational design domain (ODD): These are the specific conditions under which the system is designed to work. For example, "only on divided highways" or "only in clear weather during the daytime."

    The question for each level is simple: Who is doing the DDT, and what is the ODD? 

    At Level 2, the human must supervise at all times. At Level 3, the car handles the DDT within its ODD, but the human must be ready to take over. At Level 4, the car can handle everything within its ODD, and if it encounters a problem, it can safely pull over on its own.

    The key insight for AI agents: A robust framework isn't about the sophistication of the AI "brain." It's about clearly defining the division of responsibility between human and machine under specific, well-defined conditions.

    Aviation's 10 Levels of Automation

    While the SAE’s six levels are great for broad classification, aviation offers a more granular model for systems designed for close human-machine collaboration. The Parasuraman, Sheridan, and Wickens model proposes a detailed 10-level spectrum of automation.

    Levels of Automation of Decision and Action Selection for Aviation (Credit: The MITRE Corporation)

    This framework is less about full autonomy and more about the nuances of interaction. For example:

    • At Level 3, the computer "narrows the selection down to a few" for the human to choose from.

    • At Level 6, the computer "allows the human a restricted time to veto before it executes" an action.

    • At Level 9, the computer "informs the human only if it, the computer, decides to."

    The key insight for AI agents: This model is perfect for describing the collaborative "centaur" systems we're seeing today. Most AI agents won't be fully autonomous (Level 10) but will exist somewhere on this spectrum, acting as a co-pilot that suggests, executes with approval or acts with a veto window.

    Robotics and unmanned systems

    Finally, the world of robotics brings in another critical dimension: context. The National Institute of Standards and Technology's (NIST) Autonomy Levels for Unmanned Systems (ALFUS) framework was designed for systems like drones and industrial robots.

    The Three-Axis Model for ALFUS (Credit: NIST)

    Its main contribution is adding context to the definition of autonomy, assessing it along three axes:

    1. Human independence: How much human supervision is required?

    2. Mission complexity: How difficult or unstructured is the task?

    3. Environmental complexity: How predictable and stable is the environment in which the agent operates?

    The key insight for AI agents: This framework reminds us that autonomy isn't a single number. An agent performing a simple task in a stable, predictable digital environment (like sorting files in a single folder) is fundamentally less autonomous than an agent performing a complex task across the chaotic, unpredictable environment of the open internet, even if the level of human supervision is the same.

    The emerging frameworks for AI agents

    Having looked at the lessons from automotive, aviation and robotics, we can now examine the emerging frameworks designed for AI agents. While the field is still new and no single standard has won out, most proposals fall into three distinct, but often overlapping, categories based on the primary question they seek to answer.

    Category 1: The "What can it do?" frameworks (capability-focused)

    These frameworks classify agents based on their underlying technical architecture and what they are capable of achieving. They provide a roadmap for developers, outlining a progression of increasingly sophisticated technical milestones that often correspond directly to code patterns.

    A prime example of this developer-centric approach comes from Hugging Face. Their framework uses a star rating to show the gradual shift in control from human to AI:

    Five Levels of AI Agent Autonomy, as proposed by HuggingFace (Credit: Hugging Face)

    • Zero stars (simple processor): The AI has no impact on the program's flow. It simply processes information and its output is displayed, like a print statement. The human is in complete control.

    • One star (router): The AI makes a basic decision that directs program flow, like choosing between two predefined paths (if/else). The human still defines how everything is done.

    • Two stars (tool call): The AI chooses which predefined tool to use and what arguments to use with it. The human has defined the available tools, but the AI decides how to execute them.

    • Three stars (multi-step agent): The AI now controls the iteration loop. It decides which tool to use, when to use it and whether to continue working on the task.

    • Four stars (fully autonomous): The AI can generate and execute entirely new code to accomplish a goal, going beyond the predefined tools it was given.

    Strengths: This model is excellent for engineers. It's concrete, maps directly to code and clearly benchmarks the transfer of executive control to the AI. 

    Weaknesses: It is highly technical and less intuitive for non-developers trying to understand an agent's real-world impact.

    Category 2: The "How do we work together?" frameworks (interaction-focused)

    This second category defines autonomy not by the agent’s internal skills, but by the nature of its relationship with the human user. The central question is: Who is in control, and how do we collaborate?

    This approach often mirrors the nuance we saw in the aviation models. For instance, a framework detailed in the paper Levels of Autonomy for AI Agents defines levels based on the user's role:

    • L1 – user as an operator: The human is in direct control (like a person using Photoshop with AI-assist features).

    • L4 – user as an approver: The agent proposes a full plan or action, and the human must give a simple "yes" or "no" before it proceeds.

    • L5 – user as an observer: The agent has full autonomy to pursue a goal and simply reports its progress and results back to the human.

    Levels of Autonomy for AI Agents

    Strengths: These frameworks are highly intuitive and user-centric. They directly address the critical issues of control, trust, and oversight.

    Weaknesses: An agent with simple capabilities and one with highly advanced reasoning could both fall into the "Approver" level, so this approach can sometimes obscure the underlying technical sophistication.

    Category 3: The "Who is responsible?" frameworks (governance-focused)

    The final category is less concerned with how an agent works and more with what happens when it fails. These frameworks are designed to help answer crucial questions about law, safety and ethics.

    Think tanks like Germany's Stiftung Neue VTrantwortung have analyzed AI agents through the lens of legal liability. Their work aims to classify agents in a way that helps regulators determine who is responsible for an agent's actions: The user who deployed it, the developer who built it or the company that owns the platform it runs on?

    This perspective is essential for navigating complex regulations like the EU's Artificial Intelligence Act, which will treat AI systems differently based on the level of risk they pose.

    Strengths: This approach is absolutely essential for real-world deployment. It forces the difficult but necessary conversations about accountability that build public trust.

    Weaknesses: It's more of a legal or policy guide than a technical roadmap for developers.

    A comprehensive understanding requires looking at all three questions at once: An agent's capabilities, how we interact with it and who is responsible for the outcome..

    Identifying the gaps and challenges

    Looking at the landscape of autonomy frameworks shows us that no  single model is sufficient because the true challenges lie in the gaps between them, in areas that are incredibly difficult to define and measure.

    What is the "Road" for a digital agent?

    The SAE framework for self-driving cars gave us the powerful concept of an ODD, the specific conditions under which a system can operate safely. For a car, that might be "divided highways, in clear weather, during the day." This is a great solution for a physical environment, but what’s the ODD for a digital agent?

    The "road" for an agent is the entire internet. An infinite, chaotic and constantly changing environment. Websites get redesigned overnight, APIs are deprecated and social norms in online communities shift. 

    How do we define a "safe" operational boundary for an agent that can browse websites, access databases and interact with third-party services? Answering this is one of the biggest unsolved problems. Without a clear digital ODD, we can't make the same safety guarantees that are becoming standard in the automotive world.

    This is why, for now, the most effective and reliable agents operate within well-defined, closed-world scenarios. As I argued in a recent VentureBeat article, forgetting the open-world fantasies and focusing on "bounded problems" is the key to real-world success. This means defining a clear, limited set of tools, data sources and potential actions. 

    Beyond simple tool use

    Today's agents are getting very good at executing straightforward plans. If you tell one to "find the price of this item using Tool A, then book a meeting with Tool B," it can often succeed. But true autonomy requires much more. 

    Many systems today hit a technical wall when faced with tasks that require:

    • Long-term reasoning and planning: Agents struggle to create and adapt complex, multi-step plans in the face of uncertainty. They can follow a recipe, but they can't yet invent one from scratch when things go wrong.

    • Robust self-correction: What happens when an API call fails or a website returns an unexpected error? A truly autonomous agent needs the resilience to diagnose the problem, form a new hypothesis and try a different approach, all without a human stepping in.

    • Composability: The future likely involves not one agent, but a team of specialized agents working together. Getting them to collaborate reliably, to pass information back and forth, delegate tasks and resolve conflicts is a monumental software engineering challenge that we are just beginning to tackle.

    The elephant in the room: Alignment and control

    This is the most critical challenge of all, because it's not just technical, it's deeply human. Alignment is the problem of ensuring an agent's goals and actions are consistent with our intentions and values, even when those values are complex, unstated or nuanced.

    Imagine you give an agent the seemingly harmless goal of "maximizing customer engagement for our new product." The agent might correctly determine that the most effective strategy is to send a dozen notifications a day to every user. The agent has achieved its literal goal perfectly, but it has violated the unstated, common-sense goal of "don't be incredibly annoying."

    This is a failure of alignment.

    The core difficulty, which organizations like the AI Alignment Forum are dedicated to studying, is that it is incredibly hard to specify fuzzy, complex human preferences in the precise, literal language of code. As agents become more powerful, ensuring they are not just capable but also safe, predictable and aligned with our true intent becomes the most important challenge we face.

    The future is agentic (and collaborative)

    The path forward for AI agents is not a single leap to a god-like super-intelligence, but a more practical and collaborative journey. The immense challenges of open-world reasoning and perfect alignment mean that the future is a team effort.

    We will see less of the single, all-powerful agent and more of an "agentic mesh" — a network of specialized agents, each operating within a bounded domain, working together to tackle complex problems. 

    More importantly, they will work with us. The most valuable and safest applications will keep a human on the loop, casting them as a co-pilot or strategist to augment our intellect with the speed of machine execution. This "centaur" model will be the most effective and responsible path forward.

    The frameworks we've explored aren’t just theoretical. They’re practical tools for building trust, assigning responsibility and setting clear expectations. They help developers define limits and leaders shape vision, laying the groundwork for AI to become a dependable partner in our work and lives.

    Sean Falconer is Confluent's AI entrepreneur in residence.

  • When dirt meets data: ScottsMiracle-Gro saved $150M using AI

    How a semiconductor veteran turned over a century of horticultural wisdom into AI-led competitive advantage 

    For decades, a ritual played out across ScottsMiracle-Gro’s media facilities. Every few weeks, workers walked acres of towering compost and wood chip piles with nothing more than measuring sticks. They wrapped rulers around each mound, estimated height, and did what company President Nate Baxter now describes as “sixth-grade geometry to figure out volume.”

    Today, drones glide over those same plants with mechanical precision. Vision systems calculate volumes in real time. The move from measuring sticks to artificial intelligence signals more than efficiency. It is the visible proof of one of corporate America’s most unlikely technology stories.

    The AI revolution finds an unexpected leader

    Enterprise AI has been led by predictable players. Software companies with cloud-native architectures. Financial services firms with vast data lakes. Retailers with rich digital touchpoints. Consumer packaged goods companies that handle physical products like fertilizer and soil were not expected to lead.

    Yet ScottsMiracle-Gro has realized more than half of a targeted $150 million in supply chain savings. It reports a 90 percent improvement in customer service response times. Its predictive models enable weekly reallocation of marketing resources across regional markets.

    A Silicon Valley veteran bets on soil science

    Baxter’s path to ScottsMiracle-Gro (SMG) reads like a calculated pivot, not a corporate rescue. After two decades in semiconductor manufacturing at Intel and Tokyo Electron, he knew how to apply advanced technology to complex operations.

    “I sort of initially said, ‘Why would I do this? I’m running a tech company. It’s an industry I’ve been in for 25 years,’” Baxter recalls of his reaction when ScottsMiracle-Gro CEO Jim Hagedorn approached him in 2023. The company was reeling from a collapsed $1.2 billion hydroponics investment and facing what he describes as “pressure from a leverage standpoint.”

    His wife challenged him with a direct prompt. If you are not learning or putting yourself in uncomfortable situations, you should change that.

    Baxter saw clear parallels between semiconductor manufacturing and SMG’s operations. Both require precision, quality control, and the optimization of complex systems. He also saw untapped potential in SMG’s domain knowledge. One hundred fifty years of horticultural expertise, regulatory know-how, and customer insight had never been fully digitized.

    “It became apparent to me whether it was on the backend with data analytics, business process transformation, and obviously now with AI being front and center of the consumer experience, a lot of opportunities are there,” he explains.

    The declaration that changed everything

    The pivot began at an all-hands meeting. “I just said, you know, guys, we’re a tech company. You just don’t know it yet,” Baxter recalls. “There’s so much opportunity here to drive this company to where it needs to go.”

    The first challenge was organizational. SMG had evolved into functional silos. IT, supply chain, and brand teams ran independent systems with little coordination. Drawing on his experience with complex technology organizations, Baxter restructured the consumer business into three business units. General managers became accountable not just for financial results but also for technology implementation within their domains.

    “I came in and said, we’re going to create new business units,” he explains. “The buck stops with you and I’m holding you accountable not only for the business results, for the quality of the creative and marketing, but for the implementation of technology.”

    To support the new structure, SMG set up centers of excellence for digital capabilities, insights and analytics, and creative functions. The hybrid design placed centralized expertise behind distributed accountability.

    Mining corporate memory for AI gold

    Turning legacy knowledge into machine-ready intelligence required what Fausto Fleites, VP of Data Intelligence, calls “archaeological work.” The team excavated decades of business logic embedded in legacy SAP systems and converted filing cabinets of research into AI-ready datasets. Fleites, a Cuban immigrant with a doctorate from FIU who led Florida’s public hurricane loss model before roles at Sears and Cemex, understood the stakes.

    “The costly part of the migration was the business reporting layer we have in SAP Business Warehouse,” Fleites explains. “You need to uncover business logic created in many cases over decades.”

    SMG chose Databricks as its unified data platform. The team had Apache Spark expertise. Databricks offered strong SAP integration and aligned with a preference for open-source technologies that minimize vendor lock-in.

    The breakthrough came through systematic knowledge management. SMG built an AI bot using Google’s Gemini large language model to catalog and clean internal repositories. The system identified duplicates, grouped content by topic, and restructured information for AI consumption. The effort reduced knowledge articles by 30 percent while increasing their utility.

    “We used Gemini LLMs to actually categorize them into topics, find similar documents,” Fleites explains. A hybrid approach that combined modern AI with techniques like cosine similarity became the foundation for later applications.

    Building AI systems that actually understand fertilizer

    Early trials with off-the-shelf AI exposed a real risk. General-purpose models confused products designed for killing weeds with those for preventing them. That mistake can ruin a lawn.

    “Different products, if you use one in the wrong place, would actually have a very negative outcome,” Fleites notes. “But those are kind of synonyms in certain contexts to the LLM. So they were recommending the wrong products.”

    The solution was a new architecture. SMG created what Fleites calls a “hierarchy of agents.” A supervisor agent routes queries to specialized worker agents organized by brand. Each agent draws on deep product knowledge encoded from a 400-page internal training manual.

    The system also changes the conversation. When users ask for recommendations, the agents start with questions about location, goals, and lawn conditions. They narrow possibilities step by step before offering suggestions. The stack integrates with APIs for product availability and state-specific regulatory compliance.

    From drones to demand forecasting across the enterprise

    The transformation runs across the company. Drones measure inventory piles. Demand forecasting models analyze more than 60 factors, including weather patterns, consumer sentiment, and macroeconomic indicators.

    These predictions enable faster moves. When drought struck Texas, the models supported a shift in promotional spending to regions with favorable weather. The reallocation helped drive positive quarterly results.

    “We not only have the ability to move marketing and promotion dollars around, but we’ve even gotten to the point where if it’s going to be a big weekend in the Northeast, we’ll shift our field sales resources from other regions up there,” Baxter explains.

    Consumer Services changed as well. AI agents now process incoming emails through Salesforce, draft responses based on the knowledge base, and flag them for brief human review. Draft times dropped from ten minutes to seconds and response quality improved.

    The company emphasizes explainable AI. Using SHAP, SMG built dashboards that decompose each forecast and show how weather, promotions, or media spending contribute to predictions.

    “Typically, if you open a prediction to a business person and you don’t say why, they’ll say, ‘I don’t believe you,’” Fleites explains. Transparency made it possible to move resource allocation from quarterly to weekly cycles.

    Competing like a startup

    SMG’s results challenge assumptions about AI readiness in traditional industries. The advantage does not come from owning the most sophisticated models. It comes from combining general-purpose AI with unique, structured domain knowledge.

    “LLMs are going to be a commodity,” Fleites observes. “The strategic differentiator is what is the additional level of [internal] knowledge we can fit to them.”

    Partnerships are central. SMG works with Google Vertex AI for foundational models, Sierra.ai for production-ready conversational agents, and Kindwise for computer vision. The ecosystem approach lets a small internal team recruited from Meta, Google, and AI startups deliver outsized impact without building everything from scratch.

    Talent follows impact. Conventional wisdom says traditional companies cannot compete with Meta salaries or Google stock. SMG offered something different. It offered the chance to build transformative AI applications with immediate business impact.

    “When we have these interviews, what we propose to them is basically the ability to have real value with the latest knowledge in these spaces,” Fleites explains. “A lot of people feel motivated to come to us” because much of big tech AI work, despite the hype, “doesn’t really have an impact.”

    Team design mirrors that philosophy. “My direct reports are leaders and not only manage people, but are technically savvy,” Fleites notes. “We always are constantly switching hands between developing or maintaining a solution versus strategy versus managing people.” He still writes code weekly. The small team of 15 to 20 AI and engineering professionals stays lean by contracting out implementation while keeping “the know-how and the direction and the architecture” in-house.

    When innovation meets immovable objects

    Not every pilot succeeded. SMG tested semi-autonomous forklifts in a 1.3 million square foot distribution facility. Remote drivers in the Philippines controlled up to five vehicles at once with strong safety records.

    “The technology was actually really great,” Baxter acknowledges. The vehicles could not lift enough weight for SMG’s heavy products. The company paused implementation.

    “Not everything we’ve tried has gone smoothly,” Baxter admits. “But I think another important point is you have to focus on a few critical ones and you have to know when something isn’t going to work and readjust.”

    The lesson tracks with semiconductor discipline. Investments must show measurable returns within set timeframes. Regulatory complexity adds difficulty. Products must comply with EPA rules and a patchwork of state restrictions, which AI systems must navigate correctly.

    The gardening sommelier and agent-to-agent futures

    The roadmap reflects a long-term view. SMG plans a “gardening sommelier” mobile app in 2026 that identifies plants, weeds, and lawn problems from photos and provides instant guidance. A beta already helps field sales teams answer complex product questions by querying the 400-page knowledge base.

    The company is exploring agent-to-agent communication so its specialized AI can interface with retail partners’ systems. A customer who asks a Walmart chatbot for lawn advice could trigger an SMG query that returns accurate, regulation-compliant recommendations.

    SMG has launched AI-powered search on its website, replacing keyword systems with conversational engines based on the internal stack. The future vision pairs predictive models with conversational agents so the system can reach out when conditions suggest a customer may need help.

    What traditional industries can learn

    ScottsMiracle-Gro's transformation offers a clear playbook for enterprises. The advantage doesn't come from deploying the most sophisticated models. Instead, it comes from combining AI with proprietary domain knowledge that competitors can't easily replicate.

    By making general managers responsible for both business results and technology implementation, SMG ensured AI wasn't just an IT initiative but a business imperative. The 150 years of horticultural expertise only became valuable when it was digitized, structured, and made accessible to AI systems.

    Legacy companies competing for AI engineers can't match Silicon Valley compensation packages. But they can offer something tech giants often can't: immediate, measurable impact. When engineers see their weather forecasting models directly influence quarterly results or their agent architecture prevent customers from ruining their lawns, the work carries weight that another incremental improvement to an ad algorithm never will.

    “We have a right to win,” Baxter says. “We have 150 years of this experience.” That experience is now data, and data is the company’s competitive edge. ScottsMiracle-Gro didn’t outspend its rivals or chase the newest AI model. It turned knowledge into an operating system for growth. For a company built on soil, its biggest breakthrough might be cultivating data.

  • Is vibe coding ruining a generation of engineers?

    AI tools are revolutionizing software development by automating repetitive tasks, refactoring bloated code, and identifying bugs in real-time. Developers can now generate well-structured code from plain language prompts, saving hours of manual effort. These tools learn from vast codebases, offering context-aware recommendations that enhance productivity and reduce errors. Rather than starting from scratch, engineers can prototype quickly, iterate faster and focus on solving increasingly complex problems.

    As code generation tools grow in popularity, they raise questions about the future size and structure of engineering teams. Earlier this year, Garry Tan, CEO of startup accelerator Y Combinator, noted that about one-quarter of its current clients use AI to write 95% or more of their software. In an interview with CNBC, Tan said: “What that means for founders is that you don’t need a team of 50 or 100 engineers, you don’t have to raise as much. The capital goes much longer.”

    AI-powered coding may offer a fast solution for businesses under budget pressure — but its long-term effects on the field and labor pool cannot be ignored.

    As AI-powered coding rises, human expertise may diminish


    In the era of AI, the traditional journey to coding expertise that has long supported senior developers may be at risk. Easy access to large language models (LLMs) enables junior coders to quickly identify issues in code. While this speeds up software development, it can distance developers from their own work, delaying the growth of core problem-solving skills. As a result, they may avoid the focused, sometimes uncomfortable hours required to build expertise and progress on the path to becoming successful senior developers.

    Consider Anthropic’s Claude Code, a terminal-based assistant built on the Claude 3.7 Sonnet model, which automates bug detection and resolution, test creation and code refactoring. Using natural language commands, it reduces repetitive manual work and boosts productivity.

    Microsoft has also released two open-source frameworks — AutoGen and Semantic Kernel — to support the development of agentic AI systems. AutoGen enables asynchronous messaging, modular components, and distributed agent collaboration to build complex workflows with minimal human input. Semantic Kernel is an SDK that integrates LLMs with languages like C#, Python and Java, letting developers build AI agents to automate tasks and manage enterprise applications.

    The increasing availability of these tools from Anthropic, Microsoft and others may reduce opportunities for coders to refine and deepen their skills. Rather than “banging their heads against the wall” to debug a few lines or select a library to unlock new features, junior developers may simply turn to AI for an assist. This means senior coders with problem-solving skills honed over decades may become an endangered species.

    Overreliance on AI for writing code risks weakening developers’ hands-on experience and understanding of key programming concepts. Without regular practice, they may struggle to independently debug, optimize or design systems. Ultimately, this erosion of skill can undermine critical thinking, creativity and adaptability — qualities that are essential not just for coding, but for assessing the quality and logic of AI-generated solutions.

    AI as mentor: Turning code automation into hands-on learning

    While concerns about AI diminishing human developer skills are valid, businesses shouldn’t dismiss AI-supported coding. They just need to think carefully about when and how to deploy AI tools in development. These tools can be more than productivity boosters; they can act as interactive mentors, guiding coders in real time with explanations, alternatives and best practices.

    When used as a training tool, AI can reinforce learning by showing coders why code is broken and how to fix it—rather than simply applying a solution. For example, a junior developer using Claude Code might receive immediate feedback on inefficient syntax or logic errors, along with suggestions linked to detailed explanations. This enables active learning, not passive correction. It’s a win-win: Accelerating project timelines without doing all the work for junior coders.

    Additionally, coding frameworks can support experimentation by letting developers prototype agent workflows or integrate LLMs without needing expert-level knowledge upfront. By observing how AI builds and refines code, junior developers who actively engage with these tools can internalize patterns, architectural decisions and debugging strategies — mirroring the traditional learning process of trial and error, code reviews and mentorship.

    However, AI coding assistants shouldn’t replace real mentorship or pair programming. Pull requests and formal code reviews remain essential for guiding newer, less experienced team members. We are nowhere near the point at which AI can single-handedly upskill a junior developer.

    Companies and educators can build structured development programs around these tools that emphasize code comprehension to ensure AI is used as a training partner rather than a crutch. This encourages coders to question AI outputs and requires manual refactoring exercises. In this way, AI becomes less of a replacement for human ingenuity and more of a catalyst for accelerated, experiential learning.

    Bridging the gap between automation and education

    When utilized with intention, AI doesn’t just write code; it teaches coding, blending automation with education to prepare developers for a future where deep understanding and adaptability remain indispensable.

    By embracing AI as a mentor, as a programming partner and as a team of developers we can direct to the problem at hand, we can bridge the gap between effective automation and education. We can empower developers to grow alongside the tools they use. We can ensure that, as AI evolves, so too does the human skill set, fostering a generation of coders who are both efficient and deeply knowledgeable.

    Richard Sonnenblick is chief data scientist at Planview.

  • Will updating your AI agents help or hamper their performance? Raindrop’s new tool Experiments tells you

    It seems like almost every week for the last two years since ChatGPT launched, new large language models (LLMs) from rival labs or from OpenAI itself have been released. Enterprises are hard pressed to keep up with the massive pace of change, let alone understand how to adapt to it — which of these new models should they adopt, if any, to power their workflows and the custom AI agents they're building to carry them out?

    Help has arrived: AI applications observability startup Raindrop has launched Experiments, a new analytics feature that the company describes as the first A/B testing suite designed specifically for enterprise AI agents — allowing companies to see and compare how updating agents to new underlying models, or changing their instructions and tool access, will impact their performance with real end users.

    The release extends Raindrop’s existing observability tools, giving developers and teams a way to see how their agents behave and evolve in real-world conditions.

    With Experiments, teams can track how changes — such as a new tool, prompt, model update, or full pipeline refactor — affect AI performance across millions of user interactions. The new feature is available now for users on Raindrop’s Pro subscription plan ($350 monthly) at raindrop.ai.

    A Data-Driven Lens on Agent Development

    Raindrop co-founder and chief technology officer Ben Hylak noted in a product announcement video (above) that Experiments helps teams see “how literally anything changed,” including tool usage, user intents, and issue rates, and to explore differences by demographic factors such as language. The goal is to make model iteration more transparent and measurable.

    The Experiments interface presents results visually, showing when an experiment performs better or worse than its baseline. Increases in negative signals might indicate higher task failure or partial code output, while improvements in positive signals could reflect more complete responses or better user experiences.

    By making this data easy to interpret, Raindrop encourages AI teams to approach agent iteration with the same rigor as modern software deployment—tracking outcomes, sharing insights, and addressing regressions before they compound.

    Background: From AI Observability to Experimentation

    Raindrop’s launch of Experiments builds on the company’s foundation as one of the first AI-native observability platforms, designed to help enterprises monitor and understand how their generative AI systems behave in production.

    As VentureBeat reported earlier this year, the company — originally known as Dawn AI — emerged to address what Hylak, a former Apple human interface designer, called the “black box problem” of AI performance, helping teams catch failures “as they happen and explain to enterprises what went wrong and why."

    At the time, Hylak described how “AI products fail constantly—in ways both hilarious and terrifying,” noting that unlike traditional software, which throws clear exceptions, “AI products fail silently.” Raindrop’s original platform focused on detecting those silent failures by analyzing signals such as user feedback, task failures, refusals, and other conversational anomalies across millions of daily events.

    The company’s co-founders— Hylak, Alexis Gauba, and Zubin Singh Koticha — built Raindrop after encountering firsthand the difficulty of debugging AI systems in production.

    “We started by building AI products, not infrastructure,” Hylak told VentureBeat. “But pretty quickly, we saw that to grow anything serious, we needed tooling to understand AI behavior—and that tooling didn’t exist.”

    With Experiments, Raindrop extends that same mission from detecting failures to measuring improvements. The new tool transforms observability data into actionable comparisons, letting enterprises test whether changes to their models, prompts, or pipelines actually make their AI agents better—or just different.

    Solving the “Evals Pass, Agents Fail” Problem

    Traditional evaluation frameworks, while useful for benchmarking, rarely capture the unpredictable behavior of AI agents operating in dynamic environments.

    As Raindrop co-founder Alexis Gauba explained in her LinkedIn announcement, “Traditional evals don’t really answer this question. They’re great unit tests, but you can’t predict your user’s actions and your agent is running for hours, calling hundreds of tools.”

    Gauba said the company consistently heard a common frustration from teams: “Evals pass, agents fail.”

    Experiments is meant to close that gap by showing what actually changes when developers ship updates to their systems.

    The tool enables side-by-side comparisons of models, tools, intents, or properties, surfacing measurable differences in behavior and performance.

    Designed for Real-World AI Behavior

    In the announcement video, Raindrop described Experiments as a way to “compare anything and measure how your agent’s behavior actually changed in production across millions of real interactions.”

    The platform helps users spot issues such as task failure spikes, forgetting, or new tools that trigger unexpected errors.

    It can also be used in reverse — starting from a known problem, such as an “agent stuck in a loop,” and tracing back to which model, tool, or flag is driving it.

    From there, developers can dive into detailed traces to find the root cause and ship a fix quickly.

    Each experiment provides a visual breakdown of metrics like tool usage frequency, error rates, conversation duration, and response length.

    Users can click on any comparison to access the underlying event data, giving them a clear view of how agent behavior changed over time. Shared links make it easy to collaborate with teammates or report findings.

    Integration, Scalability, and Accuracy

    According to Hylak, Experiments integrates directly with “the feature flag platforms companies know and love (like Statsig!)” and is designed to work seamlessly with existing telemetry and analytics pipelines.

    For companies without those integrations, it can still compare performance over time—such as yesterday versus today—without additional setup.

    Hylak said teams typically need around 2,000 users per day to produce statistically meaningful results.

    To ensure the accuracy of comparisons, Experiments monitors for sample size adequacy and alerts users if a test lacks enough data to draw valid conclusions.

    “We obsess over making sure metrics like Task Failure and User Frustration are metrics that you’d wake up an on-call engineer for,” Hylak explained. He added that teams can drill into the specific conversations or events that drive those metrics, ensuring transparency behind every aggregate number.

    Security and Data Protection

    Raindrop operates as a cloud-hosted platform but also offers on-premise personally identifiable information (PII) redaction for enterprises that need additional control.

    Hylak said the company is SOC 2 compliant and has launched a PII Guard feature that uses AI to automatically remove sensitive information from stored data. “We take protecting customer data very seriously,” he emphasized.

    Pricing and Plans

    Experiments is part of Raindrop’s Pro plan, which costs $350 per month or $0.0007 per interaction. The Pro tier also includes deep research tools, topic clustering, custom issue tracking, and semantic search capabilities.

    Raindrop’s Starter plan — $65 per month or $0.001 per interaction — offers core analytics including issue detection, user feedback signals, Slack alerts, and user tracking. Both plans come with a 14-day free trial.

    Larger organizations can opt for an Enterprise plan with custom pricing and advanced features like SSO login, custom alerts, integrations, edge-PII redaction, and priority support.

    Continuous Improvement for AI Systems

    With Experiments, Raindrop positions itself at the intersection of AI analytics and software observability. Its focus on “measure truth,” as stated in the product video, reflects a broader push within the industry toward accountability and transparency in AI operations.

    Rather than relying solely on offline benchmarks, Raindrop’s approach emphasizes real user data and contextual understanding. The company hopes this will allow AI developers to move faster, identify root causes sooner, and ship better-performing models with confidence.

  • Together AI’s ATLAS adaptive speculator delivers 400% inference speedup by learning from workloads in real-time

    Enterprises expanding AI deployments are hitting an invisible performance wall. The culprit? Static speculators that can't keep up with shifting workloads.

    Speculators are smaller AI models that work alongside large language models during inference. They draft multiple tokens ahead, which the main model then verifies in parallel. This technique (called speculative decoding) has become essential for enterprises trying to reduce inference costs and latency. Instead of generating tokens one at a time, the system can accept multiple tokens at once, dramatically improving throughput.

    Together AI today announced research and a new system called ATLAS (AdapTive-LeArning Speculator System) that aims to help enterprises overcome the challenge of static speculators. The technique provides a self-learning inference optimization capability that can help to deliver up to 400% faster inference performance than a baseline level of performance available in existing inference technologies such as vLLM.. The system addresses a critical problem: as AI workloads evolve, inference speeds degrade, even with specialized speculators in place.

    The company which got its start in 2023, has been focused on optimizing inference on its enterprise AI platform. Earlier this year the company raised $305 million as customer adoption and demand has grown.

    "Companies we work with generally, as they scale up, they see shifting workloads, and then they don't see as much speedup from speculative execution as before," Tri Dao, chief scientist at Together AI, told VentureBeat in an exclusive interview. "These speculators generally don't work well when their workload domain starts to shift."

    The workload drift problem no one talks about

    Most speculators in production today are "static" models. They're trained once on a fixed dataset representing expected workloads, then deployed without any ability to adapt. Companies like Meta and Mistral ship pre-trained speculators alongside their main models. Inference platforms like vLLM use these static speculators to boost throughput without changing output quality.

    But there's a catch. When an enterprise's AI usage evolves the static speculator's accuracy plummets.

    "If you're a company producing coding agents, and most of your developers have been writing in Python, all of a sudden some of them switch to writing Rust or C, then you see the speed starts to go down," Dao explained. "The speculator has a mismatch between what it was trained on versus what the actual workload is."

    This workload drift represents a hidden tax on scaling AI. Enterprises either accept degraded performance or invest in retraining custom speculators. That process captures only a snapshot in time and quickly becomes outdated.

    How adaptive speculators work: A dual-model approach

    ATLAS uses a dual-speculator architecture that combines stability with adaptation:

    The static speculator – A heavyweight model trained on broad data provides consistent baseline performance. It serves as a "speed floor."

    The adaptive speculator – A lightweight model learns continuously from live traffic. It specializes on-the-fly to emerging domains and usage patterns.

    The confidence-aware controller – An orchestration layer dynamically chooses which speculator to use. It adjusts the speculation "lookahead" based on confidence scores.

    "Before the adaptive speculator learns anything, we still have the static speculator to help provide the speed boost in the beginning," Ben Athiwaratkun, staff AI scientist at Together AI explained to VentureBeat. "Once the adaptive speculator becomes more confident, then the speed grows over time."

    The technical innovation lies in balancing acceptance rate (how often the target model agrees with drafted tokens) and draft latency. As the adaptive model learns from traffic patterns, the controller relies more on the lightweight speculator and extends lookahead. This compounds performance gains.

    Users don't need to tune any parameters. "On the user side, users don't have to turn any knobs," Dao said. "On our side, we have turned these knobs for users to adjust in a configuration that gets good speedup."

    Performance that rivals custom silicon

    Together AI's testing shows ATLAS reaching 500 tokens per second on DeepSeek-V3.1 when fully adapted. More impressively, those numbers on Nvidia B200 GPUs match or exceed specialized inference chips like Groq's custom hardware.

    "The software and algorithmic improvement is able to close the gap with really specialized hardware," Dao said. "We were seeing 500 tokens per second on these huge models that are even faster than some of the customized chips."

    The 400% speedup that the company claims for inference represents the cumulative effect of Together's Turbo optimization suite. FP4 quantization delivers 80% speedup over FP8 baseline. The static Turbo Speculator adds another 80-100% gain. The adaptive system layers on top. Each optimization compounds the benefits of the others.

    Compared to standard inference engines like vLLM or Nvidia's TensorRT-LLM, the improvement is substantial. Together AI benchmarks against the stronger baseline between the two for each workload before applying speculative optimizations.

    The memory-compute tradeoff explained

    The performance gains stem from exploiting a fundamental inefficiency in modern inference: wasted compute capacity.

    Dao explained that typically during inference, much of the compute power is not fully utilized.

    "During inference, which is actually the dominant workload nowadays, you're mostly using the memory subsystem," he said.

    Speculative decoding trades idle compute for reduced memory access. When a model generates one token at a time, it's memory-bound. The GPU sits idle while waiting for memory. But when the speculator proposes five tokens and the target model verifies them simultaneously, compute utilization spikes while memory access remains roughly constant.

    "The total amount of compute to generate five tokens is the same, but you only had to access memory once, instead of five times," Dao said.

    Think of it as intelligent caching for AI

    For infrastructure teams familiar with traditional database optimization, adaptive speculators function like an intelligent caching layer, but with a crucial difference.

    Traditional caching systems like Redis or memcached require exact matches. You store the exact same query result and retrieve it when that specific query runs again. Adaptive speculators work differently.

    "You can view it as an intelligent way of caching, not storing exactly, but figuring out some patterns that you see," Dao explained. "Broadly, we're observing that you're working with similar code, or working with similar, you know, controlling compute in a similar way. We can then predict what the big model is going to say. We just get better and better at predicting that."

    Rather than storing exact responses, the system learns patterns in how the model generates tokens. It recognizes that if you're editing Python files in a specific codebase, certain token sequences become more likely. The speculator adapts to those patterns, improving its predictions over time without requiring identical inputs.

    Use cases: RL training and evolving workloads

    Two enterprise scenarios particularly benefit from adaptive speculators:

    Reinforcement learning training: Static speculators quickly fall out of alignment as the policy evolves during training. ATLAS adapts continuously to the shifting policy distribution.

    Evolving workloads: As enterprises discover new AI use cases, workload composition shifts. "Maybe they started using AI for chatbots, but then they realized, hey, it can write code, so they start shifting to code," Dao said. "Or they realize these AIs can actually call tools and control computers and do accounting and things like that."

    In a vibe-coding session, the adaptive system can specialize for the specific codebase being edited. These are files not seen during training. This further increases acceptance rates and decoding speed.

    What it means for enterprises and the inference ecosystem

    ATLAS is available now on Together AI's dedicated endpoints as part of the platform at no additional cost. The company's 800,000-plus developers (up from 450,000 in February) have access to the optimization.

    But the broader implications extend beyond one vendor's product. The shift from static to adaptive optimization represents a fundamental rethinking of how inference platforms should work. As enterprises deploy AI across multiple domains, the industry will need to move beyond one-time trained models toward systems that learn and improve continuously.

    Together AI has historically released some of its research techniques as open source and collaborated with projects like vLLM. While the fully integrated ATLAS system is proprietary, some of the underlying techniques may eventually influence the broader inference ecosystem. 

    For enterprises looking to lead in AI, the message is clear: adaptive algorithms on commodity hardware can match custom silicon at a fraction of the cost. As this approach matures across the industry, software optimization increasingly trumps specialized hardware.

  • The most important OpenAI announcement you probably missed at DevDay 2025

    OpenAI’s annual developer conference on Monday was a spectacle of ambitious AI product launches, from an app store for ChatGPT to a stunning video-generation API that brought creative concepts to life. But for the enterprises and technical leaders watching closely, the most consequential announcement was the quiet general availability of Codex, the company's AI software engineer. This release signals a profound shift in how software—and by extension, modern business—is built.

    While other announcements captured the public’s imagination, the production-ready release of Codex, supercharged by a new specialized model and a suite of enterprise-grade tools, is the engine behind OpenAI’s entire vision. It is the tool that builds the tools, the proven agent in a world buzzing with agentic potential, and the clearest articulation of the company's strategy to win the enterprise.

    The general availability of Codex moves it from a "research preview" to a fully supported product, complete with a new software development kit (SDK), a Slack integration, and administrative controls for security and monitoring.This transition declares that Codex is ready for mission-critical work inside the world’s largest companies.

    "We think this is the best time in history to be a builder; it has never been faster to go from idea to product," said OpenAI CEO Sam Altman during the opening keynote presentation. "Software used to take months or years to build. You saw that it can take minutes now to build with AI." 

    That acceleration is not theoretical. It's a reality born from OpenAI’s own internal use — a massive "dogfooding" effort that serves as the ultimate case study for enterprise customers.

    Inside GPT-5-Codex: The AI model that codes autonomously for hours and drives 70% productivity gains

    At the heart of the Codex upgrade is GPT-5-Codex, a version of OpenAI's latest flagship model that has been "purposely trained for Codex and agentic coding." The new model is designed to function as an autonomous teammate, moving far beyond simple code autocompletion.

    "I personally like to think about it as a little bit like a human teammate," explained Tibo Sottiaux, an OpenAI engineer, during a technical session on Codex. "You can pair a program with it on your computer, you can delegate to it, or as you'll see, you can give it a job without explicit prompting."

    This new model enables "adaptive thinking," allowing it to dynamically adjust the time and computational effort spent on a task based on its complexity.For simple requests, it's fast and efficient, but for complex refactoring projects, it can work for hours.

    One engineer during the technical session noted, "I've seen the GPT-5-Codex model work for over seven hours productively… on a marathon session." This capability to handle long-running, complex tasks is a significant leap beyond the simple, single-shot interactions that define most AI coding assistants.

    The results inside OpenAI have been dramatic. The company reported that 92% of its technical staff now uses Codex daily, and those engineers complete 70% more pull requests (a measure of code contribution) each week. Usage has surged tenfold since August. 

    "When we as a team see the stats, it feels great," Sottiaux shared. "But even better is being at lunch with someone who then goes 'Hey I use Codex all the time. Here's a cool thing that I do with it. Do you want to hear about it?'" 

    How OpenAI uses Codex to build its own AI products and catch hundreds of bugs daily

    Perhaps the most compelling argument for Codex’s importance is that it is the foundational layer upon which OpenAI’s other flashy announcements were built. During the DevDay event, the company showcased custom-built arcade games and a dynamic, AI-powered website for the conference itself, all developed using Codex.

    In one session, engineers demonstrated how they built "Storyboard," a custom creative tool for the film industry, in just 48 hours during an internal hackathon. "We decided to test Codex, our coding agent… we would send tasks to Codex in between meetings. We really easily reviewed and merged PRs into production, which Codex even allowed us to do from our phones," said Allison August, a solutions engineering leader at OpenAI. 

    This reveals a critical insight: the rapid innovation showcased at DevDay is a direct result of the productivity flywheel created by Codex. The AI is a core part of the manufacturing process for all other AI products.

    A key enterprise-focused feature is the new, more robust code review capability. OpenAI said it "purposely trained GPT-5-Codex to be great at ultra thorough code review," enabling it to explore dependencies and validate a programmer's intent against the actual implementation to find high-quality bugs.Internally, nearly every pull request at OpenAI is now reviewed by Codex, catching hundreds of issues daily before they reach a human reviewer.

    "It saves you time, you ship with more confidence," Sottiaux said. "There's nothing worse than finding a bug after we actually ship the feature." 

    Why enterprise software teams are choosing Codex over GitHub Copilot for mission-critical development

    The maturation of Codex is central to OpenAI’s broader strategy to conquer the enterprise market, a move essential to justifying its massive valuation and unprecedented compute expenditures. During a press conference, CEO Sam Altman confirmed the strategic shift.

    "The models are there now, and you should expect a huge focus from us on really winning enterprises with amazing products, starting here," Altman said during a private press conference. 

    OpenAI President and Co-founder Greg Brockman immediately added, "And you can see it already with Codex, which I think has been just an incredible success and has really grown super fast." 

    For technical decision-makers, the message is clear. While consumer-facing agents that book dinner reservations are still finding their footing, Codex is a proven enterprise agent delivering substantial ROI today. Companies like Cisco have already rolled out Codex to their engineering organizations, cutting code review times by 50% and reducing project timelines from weeks to days.

    With the new Codex SDK, companies can now embed this agentic power directly into their own custom workflows, such as automating fixes in a CI/CD pipeline or even creating self-evolving applications. During a live demo, an engineer showcased a mobile app that updated its own user interface in real-time based on a natural language prompt, all powered by the embedded Codex SDK. 

    While the launch of an app ecosystem in ChatGPT and the breathtaking visuals of the Sora 2 API rightfully generated headlines, the general availability of Codex marks a more fundamental and immediate transformation. It is the quiet but powerful engine driving the next era of software development, turning the abstract promise of AI-driven productivity into a tangible, deployable reality for businesses today.

  • Echelon’s AI agents take aim at Accenture and Deloitte consulting models

    Echelon, an artificial intelligence startup that automates enterprise software implementations, emerged from stealth mode today with $4.75 million in seed funding led by Bain Capital Ventures, targeting a fundamental shift in how companies deploy and maintain critical business systems.

    The San Francisco-based company has developed AI agents specifically trained to handle end-to-end ServiceNow implementations — complex enterprise software deployments that traditionally require months of work by offshore consulting teams and cost companies millions of dollars annually.

    "The biggest barrier to digital transformation isn't technology — it's the time it takes to implement it," said Rahul Kayala, Echelon's founder and CEO, who previously worked at AI-powered IT company Moveworks. "AI agents are eliminating that constraint entirely, allowing enterprises to experiment, iterate, and deploy platform changes with unprecedented speed."

    The announcement signals a potential disruption to the $1.5 trillion global IT services market, where companies like Accenture, Deloitte, and Capgemini have long dominated through labor-intensive consulting models that Echelon argues are becoming obsolete in the age of artificial intelligence.

    Why ServiceNow deployments take months and cost millions

    ServiceNow, a cloud-based platform used by enterprises to manage IT services, human resources, and business workflows, has become critical infrastructure for large organizations. However, implementing and customizing the platform typically requires specialized expertise that most companies lack internally.

    The complexity stems from ServiceNow's vast customization capabilities. Organizations often need hundreds of "catalog items" — digital forms and workflows for employee requests — each requiring specific configurations, approval processes, and integrations with existing systems. According to Echelon's research, these implementations frequently stretch far beyond planned timelines due to technical complexity and communication bottlenecks between business stakeholders and development teams.

    "What starts out simple often turns into weeks of effort once the actual work begins," the company noted in its analysis of common implementation challenges. "A basic request form turns out to be five requests stuffed into one. We had catalog items with 50+ variables, 10 or more UI policies, all connected. Update one field, and something else would break."

    The traditional solution involves hiring offshore development teams or expensive consultants, creating what Echelon describes as a problematic cycle: "One question here, one delay there, and suddenly you're weeks behind."

    How AI agents replace expensive offshore consulting teams

    Echelon's approach replaces human consultants with AI agents trained by elite ServiceNow experts from top consulting firms. These agents can analyze business requirements, ask clarifying questions in real-time, and automatically generate complete ServiceNow configurations including forms, workflows, testing scenarios, and documentation.

    The technology delivers a significant advancement from general-purpose AI tools. Rather than providing generic code suggestions, Echelon's agents understand ServiceNow's specific architecture, best practices, and common integration patterns. They can identify gaps in requirements and propose solutions that align with enterprise governance standards.

    "Instead of routing every piece of input through five people, the business process owner directly uploaded their requirements," Kayala explained, describing a recent customer implementation. "The AI developer analyzes it and asks follow-up questions like: 'I see a process flow with 3 branches, but only 2 triggers. Should there be a 3rd?' The kinds of things a seasoned developer would ask. With AI, these questions came instantly."

    Early customers report dramatic time savings. One financial services company saw a service catalog migration project that was projected to take six months completed in six weeks using Echelon's AI agents.

    What makes Echelon's AI different from coding assistants

    Echelon's technology addresses several technical challenges that have prevented broader AI adoption in enterprise software implementation. The agents are trained not just on ServiceNow's technical capabilities but on the accumulated expertise of senior consultants who understand complex enterprise requirements, governance frameworks, and integration patterns.

    This approach differs from general-purpose AI coding assistants like GitHub Copilot, which provide syntax suggestions but lack domain-specific expertise. Echelon's agents understand ServiceNow's data models, security frameworks, and upgrade considerations—knowledge typically acquired through years of consulting experience.

    The company's training methodology involves elite ServiceNow experts from consulting firms like Accenture and specialized ServiceNow partner Thirdera. This embedded expertise enables the AI to handle complex requirements and edge cases that typically require senior consultant intervention.

    The real challenge isn't teaching AI to write code — it's capturing the intuitive expertise that separates junior developers from seasoned architects. Senior ServiceNow consultants instinctively know which customizations will break during upgrades and how simple requests spiral into complex integration problems. This institutional knowledge creates a far more defensible moat than general-purpose coding assistants can offer.

    The $1.5 trillion consulting market faces disruption

    Echelon's emergence reflects broader trends reshaping the enterprise software market. As companies accelerate digital transformation initiatives, the traditional consulting model increasingly appears inadequate for the speed and scale required.

    ServiceNow itself has grown rapidly, reporting over $10.98 billion in annual revenue in 2024, and $12.06 billion for the trailing twelve months ending June 30, 2025, as organizations continue to digitize more business processes. However, this growth has created a persistent talent shortage, with demand for skilled ServiceNow professionals — particularly those with AI expertise — significantly outpacing supply.

    The startup's approach could fundamentally alter the economics of enterprise software implementation. Traditional consulting engagements often involve large teams working for months, with costs scaling linearly with project complexity. AI agents, by contrast, can handle multiple projects simultaneously and apply learned knowledge across customers.

    Rak Garg, the Bain Capital Ventures partner who led Echelon's funding round, sees this as part of a larger shift toward AI-powered professional services. "We see the same trend with other BCV companies like Prophet Security, which automates security operations, and Crosby, which automates legal services for startups. AI is quickly becoming the delivery layer across multiple functions."

    Scaling beyond ServiceNow while maintaining enterprise reliability

    Despite early success, Echelon faces significant challenges in scaling its approach. Enterprise customers prioritize reliability above speed, and any AI-generated configurations must meet strict security and compliance requirements.

    "Inertia is the biggest risk," Garg acknowledged. "IT systems shouldn't ever go down, and companies lose thousands of man-hours of productivity with every outage. Proving reliability at scale, and building on repeatable results will be critical for Echelon."

    The company plans to expand beyond ServiceNow to other enterprise platforms including SAP, Salesforce, and Workday — each creating substantial additional market opportunities. However, each platform requires developing new domain expertise and training models on platform-specific best practices.

    Echelon also faces potential competition from established consulting firms that are developing their own AI capabilities. However, Garg views these firms as potential partners rather than competitors, noting that many have already approached Echelon about collaboration opportunities.

    "They know that AI is shifting their business model in real-time," he said. "Customers are placing immense pricing pressure on larger firms and asking hard questions, and these firms can use Echelon agents to accelerate their projects."

    How AI agents could reshape all professional services

    Echelon's funding and emergence from stealth marks a significant milestone in the application of AI to professional services. Unlike consumer AI applications that primarily enhance individual productivity, enterprise AI agents like Echelon's directly replace skilled labor at scale.

    The company's approach — training AI systems on expert knowledge rather than just technical documentation — could serve as a model for automating other complex professional services. Legal research, financial analysis, and technical consulting all involve similar patterns of applying specialized expertise to unique customer requirements.

    For enterprise customers, the promise extends beyond cost savings to strategic agility. Organizations that can rapidly implement and modify business processes gain competitive advantages in markets where customer expectations and regulatory requirements change frequently.

    As Kayala noted, "This unlocks a completely different approach to business agility and competitive advantage."

    The implications extend far beyond ServiceNow implementations. If AI agents can master the intricacies of enterprise software deployment—one of the most complex and relationship-dependent areas of professional services — few knowledge work domains may remain immune to automation.

    The question isn't whether AI will transform professional services, but how quickly human expertise can be converted into autonomous digital workers that never sleep, never leave for competitors, and get smarter with every project they complete.

  • To scale agentic AI, Notion tore down its tech stack and started fresh

    Many organizations would be hesitant to overhaul their tech stack and start from scratch.

    Not Notion.

    For the 3.0 version of its productivity software (released in September), the company didn’t hesitate to rebuild from the ground up; they recognized that it was necessary, in fact, to support agentic AI at enterprise scale.

    Whereas traditional AI-powered workflows involve explicit, step-by-step instructions based on few-shot learning, AI agents powered by advanced reasoning models are thoughtful about tool definition, can identify and comprehend what tools they have at their disposal and plan next steps.

    “Rather than trying to retrofit into what we were building, we wanted to play to the strengths of reasoning models,” Sarah Sachs, Notion’s head of AI modeling, told VentureBeat. “We've rebuilt a new architecture because workflows are different from agents.”

    Re-orchestrating so models can work autonomously

    Notion has been adopted by 94% of Forbes AI 50 companies, has 100 million total users and counts among its customers OpenAI, Cursor, Figma, Ramp and Vercel.

    In a rapidly evolving AI landscape, the company identified the need to move beyond simpler, task-based workflows to goal-oriented reasoning systems that allow agents to autonomously select, orchestrate, and execute tools across connected environments.

    Very quickly, reasoning models have become “far better” at learning to use tools and follow chain-of-thought (CoT) instructions, Sachs noted. This allows them to be “far more independent” and make multiple decisions within one agentic workflow. “We rebuilt our AI system to play to that," she said.

    From an engineering perspective, this meant replacing rigid prompt-based flows with a unified orchestration model, Sachs explained. This core model is supported by modular sub-agents that search Notion and the web, query and add to databases and edit content.

    Each agent uses tools contextually; for instance, they can decide whether to search Notion itself, or another platform like Slack. The model will perform successive searches until the relevant information is found. It can then, for instance, convert notes into proposals, create follow-up messages, track tasks, and spot and make updates in knowledge bases.

    In Notion 2.0, the team focused on having AI perform specific tasks, which required them to “think exhaustively” about how to prompt the model, Sachs noted. However, with version 3.0, users can assign tasks to agents, and agents can actually take action and perform multiple tasks concurrently.

    “We reorchestrated it to be self-selecting on the tools, rather than few-shotting, which is explicitly prompting how to go through all these different scenarios,” Sachs explained. The aim is to ensure everything interfaces with AI and that “anything you can do, your Notion agent can do.”

    Bifurcating to isolate hallucinations

    Notion’s philosophy of “better, faster, cheaper,” drives a continuous iteration cycle that balances latency and accuracy through fine-tuned vector embeddings and elastic search optimization. Sachs’ team employs a rigorous evaluation framework that combines deterministic tests, vernacular optimization, human-annotated data and LLMs-as-a-judge, with model-based scoring identifying discrepancies and inaccuracies.

    “By bifurcating the evaluation, we're able to identify where the problems come from, and that helps us isolate unnecessary hallucinations,” Sachs explained. Further, making the architecture itself simpler means it’s easier to make changes as models and techniques evolve.

    “We optimize latency and parallel thinking as much as possible,” which leads to “way better accuracy,” Sachs noted. Models are grounded in data from the web and the Notion connected workspace.

    Ultimately, Sachs reported, the investment in rebuilding its architecture has already provided Notion returns in terms of capability and faster rate of change.

    She added, “We are fully open to rebuilding it again, when the next breakthrough happens, if we have to.”

    Understanding contextual latency

    When building and fine-tuning models, it’s important to understand that latency is subjective: AI must provide the most relevant information, not necessarily the most, at the cost of speed.

    “You'd be surprised at the different ways customers are willing to wait for things and not wait for things,” Sachs said. It makes for an interesting experiment: How slow can you go before people abandon the model?

    With pure navigational search, for instance, users may not be as patient; they want answers near-immediately. “If you ask, ‘What's two plus two,’ you don't want to wait for your agent to be searching everywhere in Slack and JIRA,” Sachs pointed out.

    But the longer the time it's given, the more exhaustive a reasoning agent can be. For instance, Notion can perform 20 minutes of autonomous work across hundreds of websites, files and other materials. In these instances, users are more willing to wait, Sachs explained; they allow the model to execute in the background while they attend to other tasks.

    “It's a product question,” said Sachs. “How do we set user expectations from the UI? How do we ascertain user expectations on latency?”

    Notion is its biggest user

    Notion understands the importance of using its own product — in fact, its employees are among its biggest power users.

    Sachs explained that teams have active sandboxes that generate training and evaluation data, as well as a “really active” thumbs-up-thumbs-down user feedback loop. Users aren’t shy about saying what they think should be improved or features they’d like to see.

    Sachs emphasized that when a user thumbs down an interaction, they are explicitly giving permission to a human annotator to analyze that interaction in a way that de-anonymizes them as much as possible.

    “We are using our own tool as a company all day, every day, and so we get really fast feedback loops,” said Sachs. “We’re really dogfooding our own product.”

    That said, it’s their own product they’re building, Sachs noted, so they understand that they may have goggles on when it comes to quality and functionality. To balance this out, Notion has trusted "very AI-savvy" design partners who are granted early access to new capabilities and provide important feedback.

    Sachs emphasized that this is just as important as internal prototyping.

    “We're all about experimenting in the open, I think you get much richer feedback,” said Sachs. “Because at the end of the day, if we just look at how Notion uses Notion, we're not really giving the best experience to our customers.”

    Just as importantly, continuous internal testing allows teams to evaluate progressions and make sure models aren't regressing (when accuracy and performance degrades over time). "Everything you're doing stays faithful," Sachs explained. "You know that your latency is within bounds."

    Many companies make the mistake of focusing too intensely on retroactively-focused evans; this makes it difficult for them to understand how or where they're improving, Sachs pointed out. Notion considers evals as a "litmus test" of development and forward-looking progression and evals of observability and regression proofing.

    “I think a big mistake a lot of companies make is conflating the two,” said Sachs. “We use them for both purposes; we think about them really differently.”

    Takeaways from Notion's journey

    For enterprises, Notion can serve as a blueprint for how to responsibly and dynamically operationalize agentic AI in a connected, permissioned enterprise workspace.

    Sach’s takeaways for other tech leaders:

    • Don’t be afraid to rebuild when foundational capabilities change; Notion fully re-engineered its architecture to align with reasoning-based models.

    • Treat latency as contextual: Optimize per use case, rather than universally.

    • Ground all outputs in trustworthy, curated enterprise data to ensure accuracy and trust.

      She advised: “Be willing to make the hard decisions. Be willing to sit at the top of the frontier, so to speak, on what you're developing to build the best product you can for your customers.”

  • New memory framework builds AI agents that can handle the real world’s unpredictability

    Researchers at the University of Illinois Urbana-Champaign and Google Cloud AI Research have developed a framework that enables large language model (LLM) agents to organize their experiences into a memory bank, helping them get better at complex tasks over time.

    The framework, called ReasoningBank, distills “generalizable reasoning strategies” from an agent’s successful and failed attempts to solve problems. The agent then uses this memory during inference to avoid repeating past mistakes and make better decisions as it faces new problems. The researchers show that when combined with test-time scaling techniques, where an agent makes multiple attempts at a problem, ReasoningBank significantly improves the performance and efficiency of LLM agents.

    Their findings show that ReasoningBank consistently outperforms classic memory mechanisms across web browsing and software engineering benchmarks, offering a practical path toward building more adaptive and reliable AI agents for enterprise applications.

    The challenge of LLM agent memory

    As LLM agents are deployed in applications that run for long periods, they encounter a continuous stream of tasks. One of the key limitations of current LLM agents is their failure to learn from this accumulated experience. By approaching each task in isolation, they inevitably repeat past mistakes, discard valuable insights from related problems, and fail to develop skills that would make them more capable over time.

    The solution to this limitation is to give agents some kind of memory. Previous efforts to give agents memory have focused on storing past interactions for reuse by organizing information in various forms from plain text to structured graphs. However, these approaches often fall short. Many use raw interaction logs or only store successful task examples. This means they can't distill higher-level, transferable reasoning patterns and, crucially, they don’t extract and use the valuable information from the agent’s failures. As the researchers note in their paper, “existing memory designs often remain limited to passive record-keeping rather than providing actionable, generalizable guidance for future decisions.”

    How ReasoningBank works

    ReasoningBank is a memory framework designed to overcome these limitations. Its central idea is to distill useful strategies and reasoning hints from past experiences into structured memory items that can be stored and reused.

    According to Jun Yan, a Research Scientist at Google and co-author of the paper, this marks a fundamental shift in how agents operate. "Traditional agents operate statically—each task is processed in isolation," Yan explained. "ReasoningBank changes this by turning every task experience (successful or failed) into structured, reusable reasoning memory. As a result, the agent doesn’t start from scratch with each customer; it recalls and adapts proven strategies from similar past cases."

    The framework processes both successful and failed experiences and turns them into a collection of useful strategies and preventive lessons. The agent judges success and failure through LLM-as-a-judge schemes to obviate the need for human labeling.

    Yan provides a practical example of this process in action. An agent tasked with finding Sony headphones might fail because its broad search query returns over 4,000 irrelevant products. "ReasoningBank will first try to figure out why this approach failed," Yan said. "It will then distill strategies such as ‘optimize search query’ and ‘confine products with category filtering.’ Those strategies will be extremely useful to get future similar tasks successfully done."

    The process operates in a closed loop. When an agent faces a new task, it uses an embedding-based search to retrieve relevant memories from ReasoningBank to guide its actions. These memories are inserted into the agent’s system prompt, providing context for its decision-making. Once the task is completed, the framework creates new memory items to extract insights from successes and failures. This new knowledge is then analyzed, distilled, and merged into the ReasoningBank, allowing the agent to continuously evolve and improve its capabilities.

    Supercharging memory with scaling

    The researchers found a powerful synergy between memory and test-time scaling. Classic test-time scaling involves generating multiple independent answers to the same question, but the researchers argue that this “vanilla form is suboptimal because it does not leverage inherent contrastive signal that arises from redundant exploration on the same problem.”

    To address this, they propose Memory-aware Test-Time Scaling (MaTTS), which integrates scaling with ReasoningBank. MaTTS comes in two forms. In “parallel scaling,” the system generates multiple trajectories for the same query, then compares and contrasts them to identify consistent reasoning patterns. In sequential scaling, the agent iteratively refines its reasoning within a single attempt, with the intermediate notes and corrections also serving as valuable memory signals.

    This creates a virtuous cycle: the existing memory in ReasoningBank steers the agent toward more promising solutions, while the diverse experiences generated through scaling enable the agent to create higher-quality memories to store in ReasoningBank. 

    “This positive feedback loop positions memory-driven experience scaling as a new scaling dimension for agents,” the researchers write.

    ReasoningBank in action

    The researchers tested their framework on WebArena (web browsing) and SWE-Bench-Verified (software engineering) benchmarks, using models like Google’s Gemini 2.5 Pro and Anthropic’s Claude 3.7 Sonnet. They compared ReasoningBank against baselines including memory-free agents and agents using trajectory-based or workflow-based memory frameworks.

    The results show that ReasoningBank consistently outperforms these baselines across all datasets and LLM backbones. On WebArena, it improved the overall success rate by up to 8.3 percentage points compared to a memory-free agent. It also generalized better on more difficult, cross-domain tasks, while reducing the number of interaction steps needed to complete tasks. When combined with MaTTS, both parallel and sequential scaling further boosted performance, consistently outperforming standard test-time scaling.

    This efficiency gain has a direct impact on operational costs. Yan points to a case where a memory-free agent took eight trial-and-error steps just to find the right product filter on a website. "Those trial and error costs could be avoided by leveraging relevant insights from ReasoningBank," he noted. "In this case, we save almost twice the operational costs," which also improves the user experience by resolving issues faster.

    For enterprises, ReasoningBank can help develop cost-effective agents that can learn from experience and adapt over time in complex workflows and areas like software development, customer support, and data analysis. As the paper concludes, “Our findings suggest a practical pathway toward building adaptive and lifelong-learning agents.”

    Yan confirmed that their findings point toward a future of truly compositional intelligence. For example, a coding agent could learn discrete skills like API integration and database management from separate tasks. "Over time, these modular skills… become building blocks the agent can flexibly recombine to solve more complex tasks," he said, suggesting a future where agents can autonomously assemble their knowledge to manage entire workflows with minimal human oversight.

  • Google’s AI can now surf the web for you, click on buttons, and fill out forms with Gemini 2.5 Computer Use

    Some of the largest providers of large language models (LLMs) have sought to move beyond multimodal chatbots — extending their models out into "agents" that can actually take more actions on behalf of the user across websites. Recall OpenAI's ChatGPT Agent (formerly known as "Operator") and Anthropic's Computer Use, both released over the last two years.

    Now, Google is getting into that same game as well. Today, the search giant's DeepMind AI lab subsidiary unveiled a new, fine-tuned and custom-trained version of its powerful Gemini 2.5 Pro LLM known as "Gemini 2.5 Pro Computer Use," which can use a virtual browser to surf the web on your behalf, retrieve information, fill out forms, and even take actions on websites — all from a user's single text prompt.

    "These are early days, but the model’s ability to interact with the web – like scrolling, filling forms + navigating dropdowns – is an important next step in building general-purpose agents," said Google CEO Sundar Pichai, as part of a longer statement on the social network, X.

    The model is not available for consumers directly from Google, though.

    Instead, Google partnered with another company, Browserbase, founded by former Twilio engineer Paul Klein in early 2024, which offers virtual "headless" web browser specifically for use by AI agents and applications. (A "headless" browser is one that doesn't require a graphical user interface, or GUI, to navigate the web, though in this case and others, Browserbase does show a graphical representation for the user).

    Users can demo the new Gemini 2.5 Computer Use model directly on Browserbase here and even compare it side-by-side with the older, rival offerings from OpenAI and Anthropic in a new "Browser Arena" launched by the startup (though only one additional model can be selected alongside Gemini at a time).

    For AI builders and developers, it's being made as a raw, albeit propreitary LLM through the Gemini API in Google AI Studio for rapid prototyping, and Google Cloud's Vertex AI model selector and applications building platform.

    The new offering builds on the capabilities of Gemini 2.5 Pro, released back in March 2025 but which has been updated significantly several times since then, with a specific focus on enabling AI agents to perform direct interactions with user interfaces, including browsers and mobile applications.

    Overall, it appears Gemini 2.5 Computer Use is designed to let developers create agents that can complete interface-driven tasks autonomously — such as clicking, typing, scrolling, filling out forms, and navigating behind login screens.

    Rather than relying solely on APIs or structured inputs, this model allows AI systems to interact with software visually and functionally, much like a human would.

    Brief User Hands-On Tests

    In my brief, unscientific initial hands-on tests on the Browserbase website, Gemini 2.5 Computer Use successfully navigate to Taylor Swift's official website as instructed and provided me a summary of what was being sold or promoted at the top — a special edition of her newest album, "The Life of A Showgirl."

    In another test, I asked Gemini 2.5 Computer Use to search Amazon for highly rated and well-reviewed solar lights I could stake into my back yard, and I was delighted to watch as it successfully completed a Google Search Captcha designed to weed out non-human users ("Select all the boxes with a motorcycle.") It did so in a matter of seconds.

    However, once it got through there, it stalled and was unable to complete the task, despite serving up a "task competed" message.

    I should also note here that while the ChatGPT agent from OpenAI and Anthropic's Claude can create and edit local files — such as PowerPoint presentations, spreadsheets, or text documents — on the user’s behalf, Gemini 2.5 Computer Use does not currently offer direct file system access or native file creation capabilities.

    Instead, it is designed to control and navigate web and mobile user interfaces through actions like clicking, typing, and scrolling. Its output is limited to suggested UI actions or chatbot-style text responses; any structured output like a document or file must be handled separately by the developer, often through custom code or third-party integrations.

    Performance Benchmarks

    Google says Gemini 2.5 Computer Use has demonstrated leading results in multiple interface control benchmarks, particularly when compared to other major AI systems including Claude Sonnet and OpenAI’s agent-based models.

    Evaluations were conducted via Browserbase and Google’s own testing.

    Some highlights include:

    • Online-Mind2Web (Browserbase): 65.7% for Gemini 2.5 vs. 61.0% (Claude Sonnet 4) and 44.3% (OpenAI Agent)

    • WebVoyager (Browserbase): 79.9% for Gemini 2.5 vs. 69.4% (Claude Sonnet 4) and 61.0% (OpenAI Agent)

    • AndroidWorld (DeepMind): 69.7% for Gemini 2.5 vs. 62.1% (Claude Sonnet 4); OpenAI's model could not be measured due to lack of access

    • OSWorld: Currently not supported by Gemini 2.5; top competitor result was 61.4%

    In addition to strong accuracy, Google reports that the model operates at lower latency than other browser control solutions — a key factor in production use cases like UI automation and testing.

    How It Works

    Agents powered by the Computer Use model operate within an interaction loop. They receive:

    • A user task prompt

    • A screenshot of the interface

    • A history of past actions

    The model analyzes this input and produces a recommended UI action, such as clicking a button or typing into a field.

    If needed, it can request confirmation from the end user for riskier tasks, such as making a purchase.

    Once the action is executed, the interface state is updated and a new screenshot is sent back to the model. The loop continues until the task is completed or halted due to an error or a safety decision.

    The model uses a specialized tool called computer_use, and it can be integrated into custom environments using tools like Playwright or via the Browserbase demo sandbox.

    Use Cases and Adoption

    According to Google, teams internally and externally have already started using the model across several domains:

    • Google’s payments platform team reports that Gemini 2.5 Computer Use successfully recovers over 60% of failed test executions, reducing a major source of engineering inefficiencies.

    • Autotab, a third-party AI agent platform, said the model outperformed others on complex data parsing tasks, boosting performance by up to 18% in their hardest evaluations.

    • Poke.com, a proactive AI assistant provider, noted that the Gemini model often operates 50% faster than competing solutions during interface interactions.

    The model is also being used in Google’s own product development efforts, including in Project Mariner, the Firebase Testing Agent, and AI Mode in Search.

    Safety Measures

    Because this model directly controls software interfaces, Google emphasizes a multi-layered approach to safety:

    • A per-step safety service inspects every proposed action before execution.

    • Developers can define system-level instructions to block or require confirmation for specific actions.

    • The model includes built-in safeguards to avoid actions that might compromise security or violate Google’s prohibited use policies.

    For example, if the model encounters a CAPTCHA, it will generate an action to click the checkbox but flag it as requiring user confirmation, ensuring the system does not proceed without human oversight.

    Technical Capabilities

    The model supports a wide array of built-in UI actions such as:

    • click_at, type_text_at, scroll_document, drag_and_drop, and more

    • User-defined functions can be added to extend its reach to mobile or custom environments

    • Screen coordinates are normalized (0–1000 scale) and translated back to pixel dimensions during execution

    It accepts image and text input and outputs text responses or function calls to perform tasks. The recommended screen resolution for optimal results is 1440×900, though it can work with other sizes.

    API Pricing Remains Almost Identical to Gemini 2.5 Pro

    The pricing for Gemini 2.5 Computer Use aligns closely with the standard Gemini 2.5 Pro model. Both follow the same per-token billing structure: input tokens are priced at $1.25 per one million tokens for prompts under 200,000 tokens, and $2.50 per million tokens for prompts longer than that.

    Output tokens follow a similar split, priced at $10.00 per million for smaller responses and $15.00 for larger ones.

    Where the models diverge is in availability and additional features.

    Gemini 2.5 Pro includes a free tier that allows developers to use the model at no cost, with no explicit token cap published, though usage may be subject to rate limits or quota constraints depending on the platform (e.g. Google AI Studio).

    This free access includes both input and output tokens. Once developers exceed their allotted quota or switch to the paid tier, standard per-token pricing applies.

    In contrast, Gemini 2.5 Computer Use is available exclusively through the paid tier. There is no free access currently offered for this model, and all usage incurs token-based charges from the outset.

    Feature-wise, Gemini 2.5 Pro supports optional capabilities like context caching (starting at $0.31 per million tokens) and grounding with Google Search (free for up to 1,500 requests per day, then $35 per 1,000 additional requests). These are not available for Computer Use at this time.

    Another distinction is in data handling: output from the Computer Use model is not used to improve Google products in the paid tier, while free-tier usage of Gemini 2.5 Pro contributes to model improvement unless explicitly opted out.

    Overall, developers can expect similar token-based costs across both models, but they should consider tier access, included capabilities, and data use policies when deciding which model fits their needs.