Etiket: Machine Learning

  • Google releases new AI video model Veo 3.1 in Flow and API: what it means for enterprises

    As expected after days of leaks and rumors online, Google has unveiled Veo 3.1, its latest AI video generation model, bringing a suite of creative and technical upgrades aimed at improving narrative control, audio integration, and realism in AI-generated video.

    While the updates expand possibilities for hobbyists and content creators using Google’s online AI creation app, Flow, the release also signals a growing opportunity for enterprises, developers, and creative teams seeking scalable, customizable video tools.

    The quality is higher, the physics better, the pricing the same as before, and the control and editing features more robust and varied.

    My initial tests showed it to be a powerful and performant model that immediately delights with each generation. However, the look is more cinematic, polished and a little more "artificial" than by default than rivals such as OpenAI's new Sora 2, released late last month, which may or may not be what a particular user is going after (Sora excels at handheld and "candid" style videos).

    Expanded Control Over Narrative and Audio

    Veo 3.1 builds on its predecessor, Veo 3 (released back in May 2025) with enhanced support for dialogue, ambient sound, and other audio effects.

    Native audio generation is now available across several key features in Flow, including “Frames to Video,” “Ingredients to Video,” and “Extend," which give users the ability to, respectively: turn still images into video; use items, characters and objects from multiple images in a single video; and generate longer clips than the initial 8 seconds, to more than 30 seconds or even 1+ plus when continuing from a prior clip's final frame.

    Before, you had to add audio manually after using these features.

    This addition gives users greater command over tone, emotion, and storytelling — capabilities that have previously required post-production work.

    In enterprise contexts, this level of control may reduce the need for separate audio pipelines, offering an integrated way to create training content, marketing videos, or digital experiences with synchronized sound and visuals.

    Google noted in a blog post that the updates reflect user feedback calling for deeper artistic control and improved audio support. Gallegos emphasizes the importance of making edits and refinements possible directly in Flow, without reworking scenes from scratch.

    Richer Inputs and Editing Capabilities

    With Veo 3.1, Google introduces support for multiple input types and more granular control over generated outputs. The model accepts text prompts, images, and video clips as input, and also supports:

    • Reference images (up to three) to guide appearance and style in the final output

    • First and last frame interpolation to generate seamless scenes between fixed endpoints

    • Scene extension that continues a video’s action or motion beyond its current duration

    These tools aim to give enterprise users a way to fine-tune the look and feel of their content—useful for brand consistency or adherence to creative briefs.

    Additional capabilities like “Insert” (add objects to scenes) and “Remove” (delete elements or characters) are also being introduced, though not all are immediately available through the Gemini API.

    Deployment Across Platforms

    Veo 3.1 is accessible through several of Google’s existing AI services:

    • Flow, Google’s own interface for AI-assisted filmmaking

    • Gemini API, targeted at developers building video capabilities into applications

    • Vertex AI, where enterprise integration will soon support Veo’s “Scene Extension” and other key features

    Availability through these platforms allows enterprise customers to choose the right environment—GUI-based or programmatic—based on their teams and workflows.

    Pricing and Access

    The Veo 3.1 model is currently in preview and available only on the paid tier of the Gemini API. The cost structure is the same as Veo 3, the preceding generation of AI video models from Google.

    • Standard model: $0.40 per second of video

    • Fast model: $0.15 per second

    There is no free tier, and users are charged only if a video is successfully generated. This model is consistent with previous Veo versions and provides predictable pricing for budget-conscious enterprise teams.

    Technical Specs and Output Control

    Veo 3.1 outputs video at 720p or 1080p resolution, with a 24 fps frame rate.

    Duration options include 4, 6, or 8 seconds from a text prompt or uploaded images, with the ability to extend videos up to 148 seconds (more than 2 and half minutes!) when using the “Extend” feature.

    New functionality also includes tighter control over subjects and environments. For example, enterprises can upload a product image or visual reference, and Veo 3.1 will generate scenes that preserve its appearance and stylistic cues across the video. This could streamline creative production pipelines for retail, advertising, and virtual content production teams.

    Initial Reactions

    The broader creator and developer community has responded to Veo 3.1’s launch with a mix of optimism and tempered critique—particularly when comparing it to rival models like OpenAI’s Sora 2.

    Matt Shumer, an AI founder of Otherside AI/Hyperwrite, and early adopter, described his initial reaction as “disappointment,” noting that Veo 3.1 is “noticeably worse than Sora 2” and also “quite a bit more expensive.”

    However, he acknowledged that Google’s tooling—such as support for references and scene extension—is a bright spot in the release.

    Travis Davids, a 3D digital artist and AI content creator, echoed some of that sentiment. While he noted improvements in audio quality, particularly in sound effects and dialogue, he raised concerns about limitations that remain in the system.

    These include the lack of custom voice support, an inability to select generated voices directly, and the continued cap at 8-second generations—despite some public claims about longer outputs.

    Davids also pointed out that character consistency across changing camera angles still requires careful prompting, whereas other models like Sora 2 handle this more automatically. He questioned the absence of 1080p resolution for users on paid tiers like Flow Pro and expressed skepticism over feature parity.

    On the more positive end, @kimmonismus, an AI newsletter writer, stated that “Veo 3.1 is amazing,” though still concluded that OpenAI’s latest model remains preferable overall.

    Collectively, these early impressions suggest that while Veo 3.1 delivers meaningful tooling enhancements and new creative control features, expectations have shifted as competitors raise the bar on both quality and usability.

    Adoption and Scale

    Since launching Flow five months ago, Google says over 275 million videos have been generated across various Veo models.

    The pace of adoption suggests significant interest not only from individuals but also from developers and businesses experimenting with automated content creation.

    Thomas Iljic, Director of Product Management at Google Labs, highlights that Veo 3.1’s release brings capabilities closer to how human filmmakers plan and shoot. These include scene composition, continuity across shots, and coordinated audio—all areas that enterprises increasingly look to automate or streamline.

    Safety and Responsible AI Use

    Videos generated with Veo 3.1 are watermarked using Google’s SynthID technology, which embeds an imperceptible identifier to signal that the content is AI-generated.

    Google applies safety filters and moderation across its APIs to help minimize privacy and copyright risks. Generated content is stored temporarily and deleted after two days unless downloaded.

    For developers and enterprises, these features provide reassurance around provenance and compliance—critical in regulated or brand-sensitive industries.

    Where Veo 3.1 Stands Among a Crowded AI Video Model Space

    Veo 3.1 is not just an iteration on prior models—it represents a deeper integration of multimodal inputs, storytelling control, and enterprise-level tooling. While creative professionals may see immediate benefits in editing workflows and fidelity, businesses exploring automation in training, advertising, or virtual experiences may find even greater value in the model’s composability and API support.

    The early user feedback highlights that while Veo 3.1 offers valuable tooling, expectations around realism, voice control, and generation length are evolving rapidly. As Google expands access through Vertex AI and continues refining Veo, its competitive positioning in enterprise video generation will hinge on how quickly these user pain points are addressed.

  • Anthropic is giving away its powerful Claude Haiku 4.5 AI for free to take on OpenAI

    Anthropic released Claude Haiku 4.5 on Wednesday, a smaller and significantly cheaper artificial intelligence model that matches the coding capabilities of systems that were considered cutting-edge just months ago, marking the latest salvo in an intensifying competition to dominate enterprise AI.

    The model costs $1 per million input tokens and $5 per million output tokens — roughly one-third the price of Anthropic's mid-sized Sonnet 4 model released in May, while operating more than twice as fast. In certain tasks, particularly operating computers autonomously, Haiku 4.5 actually surpasses its more expensive predecessor.

    "Haiku 4.5 is a clear leap in performance and is now largely as smart as Sonnet 4 while being significantly faster and one-third of the cost," an Anthropic spokesperson told VentureBeat, underscoring how rapidly AI capabilities are becoming commoditized as the technology matures.

    The launch comes just two weeks after Anthropic released Claude Sonnet 4.5, which the company bills as the world's best coding model, and two months after introducing Opus 4.1. The breakneck pace of releases reflects mounting pressure from OpenAI, whose $500 billion valuation dwarfs Anthropic's $183 billion, and which has inked a series of multibillion-dollar infrastructure deals while expanding its product lineup.

    How free access to advanced AI could reshape the enterprise market

    In an unusual move that could reshape competitive dynamics in the AI market, Anthropic is making Haiku 4.5 available for all free users of its Claude.ai platform. The decision effectively democratizes access to what the company characterizes as "near-frontier-level intelligence" — capabilities that would have been available only in expensive, premium models months ago.

    "The launch of Claude Haiku 4.5 means that near-frontier-level intelligence is available for free to all users through Claude.ai," the Anthropic spokesperson told VentureBeat. "It also offers significant advantages to our enterprise customers: Sonnet 4.5 can handle frontier planning while Haiku 4.5 powers sub-agents, enabling multi-agent systems that tackle complex refactors, migrations, and large features builds with speed and quality."

    This multi-agent architecture signals a significant shift in how AI systems are deployed. Rather than relying on a single, monolithic model, enterprises can now orchestrate teams of specialized AI agents: a more sophisticated Sonnet 4.5 model breaking down complex problems and delegating subtasks to multiple Haiku 4.5 agents working in parallel. For software development teams, this could mean Sonnet 4.5 plans a major code refactoring while Haiku 4.5 agents simultaneously execute changes across dozens of files.

    The approach mirrors how human organizations distribute work, and could prove particularly valuable for enterprises seeking to balance performance with cost efficiency — a critical consideration as AI deployment scales.

    Inside Anthropic's path to $7 billion in annual revenue

    The model launch coincides with revelations that Anthropic's business is experiencing explosive growth. The company's annual revenue run rate is approaching $7 billion this month, Anthropic told Reuters, up from more than $5 billion reported in August. Internal projections obtained by Reuters suggest the company is targeting between $20 billion and $26 billion in annualized revenue for 2026, representing growth of more than 200% to nearly 300%.

    The company now serves more than 300,000 business customers, with enterprise products accounting for approximately 80% of revenue. Among Anthropic's most successful offerings is Claude Code, a code-generation tool that has reached nearly $1 billion in annualized revenue since launching earlier this year.

    Those numbers come as artificial intelligence enters what many in the industry characterize as a critical inflection point. After two years of what Anthropic Chief Product Officer Mike Krieger recently described as "AI FOMO" — where companies adopted AI tools without clear success metrics — enterprises are now demanding measurable returns on investment.

    "The best products can be grounded in some kind of success metric or evaluation," Krieger said on the "Superhuman AI" podcast. "I've seen that a lot in talking to companies that are deploying AI."

    For enterprises evaluating AI tools, the calculus increasingly centers on concrete productivity gains. Google CEO Sundar Pichai claimed in June that AI had generated a 10% boost in engineering velocity at his company — though measuring such improvements across different roles and use cases remains challenging, as Krieger acknowledged.

    Why AI safety testing matters more than ever for enterprise adoption

    Anthropic's launch comes amid heightened scrutiny of the company's approach to AI safety and regulation. On Tuesday, David Sacks, the White House's AI "czar" and a venture capitalist, accused Anthropic of "running a sophisticated regulatory capture strategy based on fear-mongering" that is "damaging the startup ecosystem."

    The attack targeted remarks by Jack Clark, Anthropic's British co-founder and head of policy, who had described being "deeply afraid" of AI's trajectory. Clark told Bloomberg he found Sacks' criticism "perplexing."

    Anthropic addressed such concerns head-on in its release materials, emphasizing that Haiku 4.5 underwent extensive safety testing. The company classified the model as ASL-2 — its AI Safety Level 2 standard — compared to the more restrictive ASL-3 designation for the more powerful Sonnet 4.5 and Opus 4.1 models.

    "Our teams have red-teamed and tested our agentic capabilities to the limits in order to assess whether it can be used to engage in harmful activity like generating misinformation or promoting fraudulent behavior like scams," the spokesperson told VentureBeat. "In our automated alignment assessment, it showed a statistically significantly lower overall rate of misaligned behaviors than both Claude Sonnet 4.5 and Claude Opus 4.1 — making it, by this metric, our safest model yet."

    The company said its safety testing showed Haiku 4.5 poses only limited risks regarding the production of chemical, biological, radiological and nuclear weapons. Anthropic has also implemented classifiers designed to detect and filter prompt injection attacks, a common method for attempting to manipulate AI systems into producing harmful content.

    The emphasis on safety reflects Anthropic's founding mission. The company was established in 2021 by former OpenAI executives, including siblings Dario and Daniela Amodei, who left amid concerns about OpenAI's direction following its partnership with Microsoft. Anthropic has positioned itself as taking a more cautious, research-oriented approach to AI development.

    Benchmark results show Haiku 4.5 competing with larger, more expensive models

    According to Anthropic's benchmarks, Haiku 4.5 performs competitively with or exceeds several larger models across multiple evaluation criteria. On SWE-bench Verified, a widely used test measuring AI systems' ability to solve real-world software engineering problems, Haiku 4.5 scored 73.3% — slightly ahead of Sonnet 4's 72.7% and close to GPT-5 Codex's 74.5%.

    The model demonstrated particular strength in computer use tasks, achieving 50.7% on the OSWorld benchmark compared to Sonnet 4's 42.2%. This capability allows the AI to interact directly with computer interfaces — clicking buttons, filling forms, navigating applications — which could prove transformative for automating routine digital tasks.

    In coding-specific benchmarks like Terminal-Bench, which tests AI agents' ability to complete complex software tasks using command-line tools, Haiku 4.5 scored 41.0%, trailing only Sonnet 4.5's 50.0% among Claude models.

    The model maintains a 200,000-token context window for standard users, with developers accessing the Claude Developer Platform able to use a 1-million-token context window. That expanded capacity means the model can process extremely large codebases or documents in a single request — roughly equivalent to a 1,500-page book.

    What three major AI model releases in two months says about the competition

    When asked about the rapid succession of model releases, the Anthropic spokesperson emphasized the company's focus on execution rather than competitive positioning.

    "We're focused on shipping the best possible products for our customers — and our shipping velocity speaks for itself," the spokesperson said. "What was state-of-the-art just five months ago is now faster, cheaper, and more accessible."

    That velocity stands in contrast to the company's earlier, more measured release schedule. Anthropic appeared to have paused development of its Haiku line after releasing version 3.5 at the end of last year, leading some observers to speculate the company had deprioritized smaller models.

    That rapid price-performance improvement validates a core promise of artificial intelligence: that capabilities will become dramatically cheaper over time as the technology matures and companies optimize their models. For enterprises, it suggests that today's budget constraints around AI deployment may ease considerably in coming years.

    From customer service to code: Real-world applications for faster, cheaper AI

    The practical applications of Haiku 4.5 span a wide range of enterprise functions, from customer service to financial analysis to software development. The model's combination of speed and intelligence makes it particularly suited for real-time, low-latency tasks like chatbot conversations and customer support interactions, where delays of even a few seconds can degrade user experience.

    In financial services, the multi-agent architecture enabled by pairing Sonnet 4.5 with Haiku 4.5 could transform how firms monitor markets and manage risk. Anthropic envisions Haiku 4.5 monitoring thousands of data streams simultaneously — tracking regulatory changes, market signals and portfolio risks — while Sonnet 4.5 handles complex predictive modeling and strategic analysis.

    For research organizations, the division of labor could compress timelines dramatically. Sonnet 4.5 might orchestrate a comprehensive analysis while multiple Haiku 4.5 agents parallelize literature reviews, data gathering and document synthesis across dozens of sources, potentially "compressing weeks of research into hours," according to Anthropic's use case descriptions.

    Several companies have already integrated Haiku 4.5 and reported positive results. Guy Gur-Ari, co-founder of coding startup Augment, said the model "hit a sweet spot we didn't think was possible: near-frontier coding quality with blazing speed and cost efficiency." In Augment's internal testing, Haiku 4.5 achieved 90% of Sonnet 4.5's performance while matching much larger models.

    Jeff Wang, CEO of Windsurf, another coding-focused startup, said Haiku 4.5 "is blurring the lines" on traditional trade-offs between speed, cost and quality. "It's a fast frontier model that keeps costs efficient and signals where this class of models is headed."

    Jon Noronha, co-founder of presentation software company Gamma, reported that Haiku 4.5 "outperformed our current models on instruction-following for slide text generation, achieving 65% accuracy versus 44% from our premium tier model — that's a game-changer for our unit economics."

    The price of progress: What plummeting AI costs mean for enterprise strategy

    For enterprises evaluating AI strategies, Haiku 4.5 presents both opportunity and challenge. The opportunity lies in accessing sophisticated AI capabilities at dramatically lower costs, potentially making viable entire categories of applications that were previously too expensive to deploy at scale.

    The challenge is keeping pace with a technology landscape that is evolving faster than most organizations can absorb. As Krieger noted in his recent podcast appearance, companies are moving beyond "AI FOMO" to demand concrete metrics and demonstrated value. But establishing those metrics and evaluation frameworks takes time — time that may be in short supply as competitors race ahead.

    The shift from single-model deployments to multi-agent architectures also requires new ways of thinking about AI systems. Rather than viewing AI as a monolithic assistant, enterprises must learn to orchestrate multiple specialized agents, each optimized for particular tasks — more akin to managing a team than operating a tool.

    The fundamental economics of AI are shifting with remarkable speed. Five months ago, Sonnet 4's capabilities commanded premium pricing and represented the cutting edge. Today, Haiku 4.5 delivers similar performance at a third of the cost. If that trajectory continues — and both Anthropic's release schedule and competitive pressure from OpenAI and Google suggest it will — the AI capabilities that seem remarkable today may be routine and inexpensive within a year.

    For Anthropic, the challenge will be translating technical achievements into sustainable business growth while maintaining the safety-focused approach that differentiates it from competitors. The company's projected revenue growth to as much as $26 billion by 2026 suggests strong market traction, but achieving those targets will require continued innovation and successful execution across an increasingly complex product portfolio.

    Whether enterprises will choose Claude over increasingly capable alternatives from OpenAI, Google and a growing field of competitors remains an open question. But Anthropic is making a clear bet: that the future of AI belongs not to whoever builds the single most powerful model, but to whoever can deliver the right intelligence, at the right speed, at the right price — and make it accessible to everyone.

    In an industry where the promise of artificial intelligence has long outpaced reality, Anthropic is betting that delivering on that promise, faster and cheaper than anyone expected, will be enough to win. And with pricing dropping by two-thirds in just five months while performance holds steady, that promise is starting to look like reality.

  • Visa just launched a protocol to secure the AI shopping boom — here’s what it means for merchants

    Visa is introducing a new security framework designed to solve one of the thorniest problems emerging in artificial intelligence-powered commerce: how retailers can tell the difference between legitimate AI shopping assistants and the malicious bots that plague their websites.

    The payments giant unveiled its Trusted Agent Protocol on Tuesday, establishing what it describes as foundational infrastructure for "agentic commerce" — a term for the rapidly growing practice of consumers delegating shopping tasks to AI agents that can search products, compare prices, and complete purchases autonomously.

    The protocol enables merchants to cryptographically verify that an AI agent browsing their site is authorized and trustworthy, rather than a bot designed to scrape pricing data, test stolen credit cards, or carry out other fraudulent activities.

    The launch comes as AI-driven traffic to U.S. retail websites has exploded by more than 4,700% over the past year, according to data from Adobe cited by Visa. That dramatic surge has created an acute challenge for merchants whose existing bot detection systems — designed to block automated traffic — now risk accidentally blocking legitimate AI shoppers along with bad actors.

    "Merchants need additional tools that provide them with greater insight and transparency into agentic commerce activities to ensure they can participate safely," said Rubail Birwadker, Visa's Global Head of Growth, in an exclusive interview with VentureBeat. "Without common standards, potential risks include ecosystem fragmentation and the proliferation of closed loop models."

    The stakes are substantial. While 85% of shoppers who have used AI to shop report improved experiences, merchants face the prospect of either turning away legitimate AI-powered customers or exposing themselves to sophisticated bot attacks. Visa's own data shows the company prevented $40 billion in fraudulent activity between October 2022 and September 2023, nearly double the previous year, much of it involving AI-powered enumeration attacks where bots systematically test combinations of card numbers until finding valid credentials.

    Inside the cryptographic handshake: How Visa verifies AI shopping agents

    Visa's Trusted Agent Protocol operates through what Birwadker describes as a "cryptographic trust handshake" between merchants and approved AI agents. The system works in three steps:

    First, AI agents must be approved and onboarded through Visa's Intelligent Commerce program, where they undergo vetting to meet trust and reliability standards. Each approved agent receives a unique digital signature key — essentially a cryptographic credential that proves its identity.

    When an approved agent visits a merchant's website, it creates a digital signature using its key and transmits three categories of information: Agent Intent (indicating the agent is trusted and intends to retrieve product details or make a purchase), Consumer Recognition (data showing whether the underlying consumer has an existing account with the merchant), and Payment Information (optional payment data to support checkout).

    Merchants or their infrastructure providers, such as content delivery networks, then validate these digital signatures against Visa's registry of approved agents. "Upon proper validation of these fields, the merchant can confirm the signature is a trusted agent," Birwadker explained.

    Crucially, Visa designed the protocol to require minimal changes to existing merchant infrastructure. Built on the HTTP Message Signature standard and aligned with Web Both Auth, the protocol works with existing web infrastructure without requiring merchants to overhaul their checkout pages. "This is no-code functionality," Birwadker emphasized, though merchants may need to integrate with Visa's Developer Center to access the verification system.

    The race for AI commerce standards: Visa faces competition from Google, OpenAI, and Stripe

    Visa developed the protocol in collaboration with Cloudflare, the web infrastructure and security company that already provides bot management services to millions of websites. The partnership reflects Visa's recognition that solving bot verification requires cooperation across the entire web stack, not just the payments layer.

    "Trusted Agent Protocol supplements traditional bot management by providing merchants insights that enable agentic commerce," Birwadker said. "Agents are providing additional context they otherwise would not, including what it intends to do, who the underlying consumer is, and payment information."

    The protocol arrives as multiple technology giants race to establish competing standards for AI commerce. Google recently introduced its Agent Protocol for Payments (AP2), while OpenAI and Stripe have discussed their own approaches to enabling AI agents to make purchases. Microsoft, Shopify, Adyen, Ant International, Checkout.com, Cybersource, Elavon, Fiserv, Nuvei, and Worldpay provided feedback during Trusted Agent Protocol's development, according to Visa.

    When asked how Visa's protocol relates to these competing efforts, Birwadker struck a collaborative tone. "Both Google's AP2 and Visa's Trusted Agent Protocol are working toward the same goal of building trust in agent-initiated payments," he said. "We are engaged with Google, OpenAI, and Stripe and are looking to create compatibility across the ecosystem."

    Visa says it is working with global standards bodies including the Internet Engineering Task Force (IETF), OpenID Foundation, and EMVCo to ensure the protocol can eventually become interoperable with other emerging standards. "While these specifications apply to the Visa network in this initial phase, enabling agents to safely and securely act on a consumer's behalf requires an open, ecosystem-wide approach," Birwadker noted.

    Who pays when AI agents go rogue? Unanswered questions about liability and authorization

    The protocol raises important questions about authorization and liability when AI agents make purchases on behalf of consumers. If an agent completes an unauthorized transaction — perhaps misunderstanding a user's intent or exceeding its delegated authority — who bears responsibility?

    Birwadker emphasized that the protocol helps merchants "leverage this information to enable experiences tied to existing consumer relationships and more secure checkout," but he did not provide specific details about how disputes would be handled when agents make unauthorized purchases. Visa's existing fraud protection and chargeback systems would presumably apply, though the company has not yet published detailed guidance on agent-initiated transaction disputes.

    The protocol also places Visa in the position of gatekeeper for the emerging agentic commerce ecosystem. Because Visa determines which AI agents get approved for the Intelligent Commerce program and receive cryptographic credentials, the company effectively controls which agents merchants can easily trust. "Agents are approved and onboarded through the Visa Intelligent Commerce program, ensuring they meet our standards for trust and reliability," Birwadker said, though he did not detail the specific criteria agents must meet or whether Visa charges fees for approval.

    This gatekeeping role could prove contentious, particularly if Visa's approval process favors large technology companies over startups, or if the company faces pressure to block agents from competitors or politically controversial entities. Visa declined to provide details about how many agents it has approved so far or how long the vetting process typically takes.

    Visa's legal battles and the long road to merchant adoption

    The protocol launch comes at a complex moment for Visa, which continues to navigate significant legal and regulatory challenges even as its core business remains robust. The company's latest earnings report for the third quarter of fiscal year 2025 showed a 10% increase in net revenues to $9.2 billion, driven by resilient consumer spending and strong growth in cross-border transaction volume. For the full fiscal year ending September 30, 2024, Visa processed 289 billion transactions, with a total payments volume of $15.2 trillion.

    However, the company's legal headwinds have intensified. In July 2025, a federal judge rejected a landmark $30 billion settlement that Visa and Mastercard had reached with merchants over long-disputed credit card swipe fees, sending the parties back to the negotiating table and extending the long-running legal battle.

    Simultaneously, Visa remains under investigation by the Department of Justice over its rules for routing debit card transactions, with regulators scrutinizing whether the company's practices unlawfully limit merchant choice and stifle competition. These domestic challenges are mirrored abroad, where European regulators have continued their own antitrust investigations into the fee structures of both Visa and its primary competitor, Mastercard.

    Against this backdrop of regulatory pressure, Birwadker acknowledged that adoption of the Trusted Agent Protocol will take time. "As agentic commerce continues to rise, we recognize that consumer trust is still in its early stages," he said. "That's why our focus through 2025 is on building foundational credibility and demonstrating real-world value."

    The protocol is available immediately in Visa's Developer Center and on GitHub, with agent onboarding already active and merchant integration resources available. But Birwadker declined to provide specific targets for how many merchants might adopt the protocol by the end of 2026. "Adoption is aligned with the momentum we're already seeing," he said. "The launch of our protocol marks another big step — it's not just a technical milestone, but a signal that the industry is beginning to unify."

    Industry analysts say merchant adoption will likely depend on how quickly agentic commerce grows as a percentage of overall e-commerce. While AI-driven traffic has surged dramatically, much of that consists of agents browsing and researching rather than completing purchases. If AI agents begin accounting for a significant share of completed transactions, merchants will face stronger incentives to adopt verification systems like Visa's protocol.

    From fraud detection to AI gatekeeping: Visa's $10 billion bet on artificial intelligence

    Visa's move reflects broader strategic bets on AI across the financial services industry. The company has invested $10 billion in technology over the past five years to reduce fraud and increase network security, with AI and machine learning central to those efforts. Visa's fraud detection system analyzes over 500 different attributes for each transaction, using AI models to assign real-time risk scores to the 300 billion annual transactions flowing through its network.

    "Every single one of those transactions has been processed by AI," James Mirfin, Visa's global head of risk and identity solutions, said in a July 2024 CNBC interview discussing the company's fraud prevention efforts. "If you see a new type of fraud happening, our model will see that, it will catch it, it will score those transactions as high risk and then our customers can decide not to approve those transactions."

    The company has also moved aggressively into new payment territories beyond its core card business. In January 2025, Visa partnered with Elon Musk's X (formerly Twitter) to provide the infrastructure for a digital wallet and peer-to-peer payment service called the X Money Account, competing with services like Venmo and Zelle. That deal marked Visa's first major partnership in the social media payments space and reflected the company's recognition that payment flows are increasingly happening outside traditional e-commerce channels.

    The agentic commerce protocol represents an extension of this strategy — an attempt to ensure Visa remains central to payment flows even as the mechanics of shopping shift from direct human interaction to AI intermediation. Jack Forestell, Visa's Chief Product & Strategy Officer, framed the protocol in expansive terms: "We believe the entire payments ecosystem has a responsibility to ensure sellers trust AI agents with the same confidence they place in their most valued customers and networks."

    The coming battle for control of AI shopping

    The real test for Visa's protocol won't be technical — it will be political. As AI agents become a larger force in retail, whoever controls the verification infrastructure controls access to hundreds of billions of dollars in commerce. Visa's position as gatekeeper gives it enormous leverage, but also makes it a target.

    Merchants chafing under Visa's existing fee structure and facing multiple antitrust investigations may resist ceding even more power to the payments giant. Competitors like Google and OpenAI, each with their own ambitions in commerce, have little incentive to let Visa dictate standards. Regulators already scrutinizing Visa's market dominance will surely examine whether its agent approval process unfairly advantages certain players.

    And there's a deeper question lurking beneath the technical specifications and corporate partnerships: In an economy increasingly mediated by AI, who decides which algorithms get to spend our money? Visa is making an aggressive bid to be that arbiter, wrapping its answer in the language of security and interoperability. Whether merchants, consumers, and regulators accept that proposition will determine not just the fate of the Trusted Agent Protocol, but the structure of AI-powered commerce itself.

    For now, Visa is moving forward with the confidence of a company that has weathered disruption before. But in the emerging world of agentic commerce, being too trusted might prove just as dangerous as not being trusted enough.

  • EAGLET boosts AI agent performance on longer-horizon tasks by generating custom plans

    2025 was supposed to be the year of "AI agents," according to Nvidia CEO Jensen Huang, and other AI industry personnel. And it has been, in many ways, with numerous leading AI model providers such as OpenAI, Google, and even Chinese competitors like Alibaba releasing fine-tuned AI models or applications designed to focus on a narrow set of tasks, such as web search and report writing.

    But one big hurdle to a future of highly performant, reliable, AI agents remains: getting them to stay on task when the task extends over a number of steps. Third-party benchmark tests show even the most powerful AI models experience higher failure rates the more steps they take to complete a task, and the longer time they spend on it (exceeding hours).

    A new academic framework called EAGLET proposes a practical and efficient method to improve long-horizon task performance in LLM-based agents — without the need for manual data labeling or retraining.

    Developed by researchers from Tsinghua University, Peking University, DeepLang AI, and the University of Illinois Urbana-Champaign, EAGLET offers a "global planner" that can be integrated into existing agent workflows to reduce hallucinations and improve task efficiency.

    EAGLET is a fine-tuned language model that interprets task instructions — typically provided as prompts by the user or the agent's operating environment — and generates a high-level plan for the agent (powered by its own LLM). It does not intervene during execution, but its up-front guidance helps reduce planning errors and improve task completion rates.

    Addressing the Planning Problem in Long-Horizon Agents

    Many LLM-based agents struggle with long-horizon tasks because they rely on reactive, step-by-step reasoning. This approach often leads to trial-and-error behavior, planning hallucinations, and inefficient trajectories.

    EAGLET tackles this limitation by introducing a global planning module that works alongside the executor agent.

    Instead of blending planning and action generation in a single model, EAGLET separates them, enabling more coherent, task-level strategies.

    A Two-Stage Training Pipeline with No Human Annotations

    EAGLET’s planner is trained using a two-stage process that requires no human-written plans or annotations.

    The first stage involves generating synthetic plans with high-capability LLMs, such as GPT-5 and DeepSeek-V3.1-Think.

    These plans are then filtered using a novel strategy called homologous consensus filtering, which retains only those that improve task performance for both expert and novice executor agents.

    In the second stage, a rule-based reinforcement learning process further refines the planner, using a custom-designed reward function to assess how much each plan helps multiple agents succeed.

    Introducing the Executor Capability Gain Reward (ECGR)

    One of EAGLET’s key innovations is the Executor Capability Gain Reward (ECGR).

    This reward measures the value of a generated plan by checking whether it helps both high- and low-capability agents complete tasks more successfully and with fewer steps.

    It also includes a decay factor to favor shorter, more efficient task trajectories. This approach avoids over-rewarding plans that are only useful to already-competent agents and promotes more generalizable planning guidance.

    Compatible with Existing Agents and Models

    The EAGLET planner is designed to be modular and "plug-and-play," meaning it can be inserted into existing agent pipelines without requiring executor retraining.

    In evaluations, the planner boosted performance across a variety of foundational models, including GPT-4.1, GPT-5, Llama-3.1, and Qwen2.5.

    It also proved effective regardless of prompting strategy, working well with standard ReAct-style prompts as well as approaches like Reflexion.

    State-of-the-Art Performance Across Benchmarks

    EAGLET was tested on three widely used benchmarks for long-horizon agent tasks: ScienceWorld, which simulates scientific experiments in a text-based lab environment; ALFWorld, which tasks agents with completing household activities through natural language in a simulated home setting; and WebShop, which evaluates goal-driven behavior in a realistic online shopping interface.

    Across all three, executor agents equipped with EAGLET outperformed their non-planning counterparts and other planning baselines, including MPO and KnowAgent.

    In experiments with the open source Llama-3.1-8B-Instruct model, EAGLET boosted average performance from 39.5 to 59.4, a +19.9 point gain across tasks.

    On ScienceWorld unseen scenarios, it raised performance from 42.2 to 61.6.

    In ALFWorld seen scenarios, EAGLET improved outcomes from 22.9 to 54.3, a more than 2.3× increase in performance.

    Even stronger gains were seen with more capable models.

    For instance, GPT-4.1 improved from 75.5 to 82.2 average score with EAGLET, and GPT-5 rose from 84.5 to 88.1, despite already being strong performers.

    In some benchmarks, performance gains were as high as +11.8 points, such as when combining EAGLET with the ETO executor method on ALFWorld unseen tasks.

    Compared to other planning baselines like MPO, EAGLET consistently delivered higher task completion rates. For example, on ALFWorld unseen tasks with GPT-4.1, MPO achieved 79.1, while EAGLET scored 83.6—a +4.5 point advantage.

    Additionally, the paper reports that agents using EAGLET complete tasks in fewer steps on average. With GPT-4.1 as executor, average step count dropped from 13.0 (no planner) to 11.1 (EAGLET). With GPT-5, it dropped from 11.4 to 9.4, supporting the claim of improved execution efficiency.

    Efficiency Gains in Training and Execution

    Compared to RL-based methods like GiGPO, which can require hundreds of training iterations, EAGLET achieved better or comparable results with roughly one-eighth the training effort.

    This efficiency also carries over into execution: agents using EAGLET typically needed fewer steps to complete tasks. This translates into reduced inference time and compute cost in production scenarios.

    No Public Code—Yet

    As of the version submitted to arXiv, the authors have not released an open-source implementation of EAGLET. It is unclear if or when the code will be released, under what license, or how it will be maintained, which may limit the near-term utility of the framework for enterprise deployment.

    VentureBeat has reached out to the authors to clarify these points and will update this piece when we hear back.

    Enterprise Deployment Questions Remain

    While the planner is described as plug-and-play, it remains unclear whether EAGLET can be easily integrated into popular enterprise agent frameworks such as LangChain or AutoGen, or if it requires a custom stack to support plan-execute separation.

    Similarly, the training setup leverages multiple executor agents, which may be difficult to replicate in enterprise environments with limited model access. VentureBeat has asked the researchers whether the homologous consensus filtering method can be adapted for teams that only have access to one executor model or limited compute resources.

    EAGLET’s authors report success across model types and sizes, but it is not yet known what the minimal viable model scale is for practical deployment. For example, can enterprise teams use the planner effectively with sub-10B parameter open models in latency-sensitive environments? Additionally, the framework may offer industry-specific value in domains like customer support or IT automation, but it remains to be seen how easily the planner can be fine-tuned or customized for such verticals.

    Real-Time vs. Pre-Generated Planning

    Another open question is how EAGLET is best deployed in practice. Should the planner operate in real-time alongside executors within a loop, or is it better used offline to pre-generate global plans for known task types? Each approach has implications for latency, cost, and operational complexity. VentureBeat has posed this question to the authors and will report any insights that emerge.

    Strategic Tradeoffs for Enterprise Teams

    For technical leaders at medium-to-large enterprises, EAGLET represents a compelling proof of concept for improving the reliability and efficiency of LLM agents. But without public tooling or implementation guidelines, the framework still presents a build-versus-wait decision. Enterprises must weigh the potential gains in task performance and efficiency against the costs of reproducing or approximating the training process in-house.

    Potential Use Cases in Enterprise Settings

    For enterprises developing agentic AI systems—especially in environments requiring stepwise planning, such as IT automation, customer support, or online interactions—EAGLET offers a template for how to incorporate planning without retraining. Its ability to guide both open- and closed-source models, along with its efficient training method, may make it an appealing starting point for teams seeking to improve agent performance with minimal overhead.

  • Self-improving language models are becoming reality with MIT’s updated SEAL technique

    Researchers at the Massachusetts Institute of Technology (MIT) are gaining renewed attention for developing and open sourcing a technique that allows large language models (LLMs) — like those underpinning ChatGPT and most modern AI chatbots — to improve themselves by generating synthetic data to fine-tune upon.

    The technique, known as SEAL (Self-Adapting LLMs), was first described in a paper published back in June and covered by VentureBeat at the time.

    A significantly expanded and updated version of the paper was released last month, as well as open source code posted on Github (under an MIT License, allowing for commercial and enterprise usage), and is making new waves among AI power users on the social network X this week.

    SEAL allows LLMs to autonomously generate and apply their own fine-tuning strategies. Unlike conventional models that rely on fixed external data and human-crafted optimization pipelines, SEAL enables models to evolve by producing their own synthetic training data and corresponding optimization directives.

    The development comes from a team affiliated with MIT’s Improbable AI Lab, including Adam Zweiger, Jyothish Pari, Han Guo, Ekin Akyürek, Yoon Kim, and Pulkit Agrawal. Their research was recently presented at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025).

    Background: From “Beyond Static AI” to Self-Adaptive Systems

    Earlier this year, VentureBeat first reported on SEAL as an early-stage framework that allowed language models to generate and train on their own synthetic data — a potential remedy for the stagnation of pretrained models once deployed.

    At that stage, SEAL was framed as a proof-of-concept that could let enterprise AI agents continuously learn in dynamic environments without manual retraining.

    Since then, the research has advanced considerably. The new version expands on the prior framework by demonstrating that SEAL’s self-adaptation ability scales with model size, integrates reinforcement learning more effectively to reduce catastrophic forgetting, and formalizes SEAL’s dual-loop structure (inner supervised fine-tuning and outer reinforcement optimization) for reproducibility.

    The updated paper also introduces evaluations across different prompting formats, improved stability during learning cycles, and a discussion of practical deployment challenges at inference time.

    Addressing the Limitations of Static Models

    While LLMs have demonstrated remarkable capabilities in text generation and understanding, their adaptation to new tasks or knowledge is often manual, brittle, or dependent on context.

    SEAL challenges this status quo by equipping models with the ability to generate what the authors call “self-edits” — natural language outputs that specify how the model should update its weights.

    These self-edits may take the form of reformulated information, logical implications, or tool configurations for augmentation and training. Once generated, the model fine-tunes itself based on these edits. The process is guided by reinforcement learning, where the reward signal comes from improved performance on a downstream task.

    The design mimics how human learners might rephrase or reorganize study materials to better internalize information. This restructuring of knowledge before assimilation serves as a key advantage over models that passively consume new data “as-is.”

    Performance Across Tasks

    SEAL has been tested across two main domains: knowledge incorporation and few-shot learning.

    In the knowledge incorporation setting, the researchers evaluated how well a model could internalize new factual content from passages similar to those in the SQuAD dataset, a benchmark reading comprehension dataset introduced by Stanford University in 2016, consisting of over 100,000 crowd-sourced question–answer pairs based on Wikipedia articles (Rajpurkar et al., 2016).

    Rather than fine-tuning directly on passage text, the model generated synthetic implications of the passage and then fine-tuned on them.

    After two rounds of reinforcement learning, the model improved question-answering accuracy from 33.5% to 47.0% on a no-context version of SQuAD — surpassing results obtained using synthetic data generated by GPT-4.1.

    In the few-shot learning setting, SEAL was evaluated using a subset of the ARC benchmark, where tasks require reasoning from only a few examples. Here, SEAL generated self-edits specifying data augmentations and hyperparameters.

    After reinforcement learning, the success rate in correctly solving held-out tasks jumped to 72.5%, up from 20% using self-edits generated without reinforcement learning. Models that relied solely on in-context learning without any adaptation scored 0%.

    Technical Framework

    SEAL operates using a two-loop structure: an inner loop performs supervised fine-tuning based on the self-edit, while an outer loop uses reinforcement learning to refine the policy that generates those self-edits.

    The reinforcement learning algorithm used is based on ReSTEM, which combines sampling with filtered behavior cloning. During training, only self-edits that lead to performance improvements are reinforced. This approach effectively teaches the model which kinds of edits are most beneficial for learning.

    For efficiency, SEAL applies LoRA-based fine-tuning rather than full parameter updates, enabling rapid experimentation and low-cost adaptation.

    Strengths and Limitations

    The researchers report that SEAL can produce high-utility training data with minimal supervision, outperforming even large external models like GPT-4.1 in specific tasks.

    They also demonstrate that SEAL generalizes beyond its original setup: it continues to perform well when scaling from single-pass updates to multi-document continued pretraining scenarios.

    However, the framework is not without limitations. One issue is catastrophic forgetting, where updates to incorporate new information can degrade performance on previously learned tasks.

    In response to this concern, co-author Jyo Pari told VentureBeat via email that reinforcement learning (RL) appears to mitigate forgetting more effectively than standard supervised fine-tuning (SFT), citing a recent paper on the topic. He added that combining this insight with SEAL could lead to new variants where SEAL learns not just training data, but reward functions.

    Another challenge is computational overhead: evaluating each self-edit requires fine-tuning and performance testing, which can take 30–45 seconds per edit — significantly more than standard reinforcement learning tasks.

    As Jyo explained, “Training SEAL is non-trivial because it requires 2 loops of optimization, an outer RL one and an inner SFT one. At inference time, updating model weights will also require new systems infrastructure.” He emphasized the need for future research into deployment systems as a critical path to making SEAL practical.

    Additionally, SEAL’s current design assumes the presence of paired tasks and reference answers for every context, limiting its direct applicability to unlabeled corpora. However, Jyo clarified that as long as there is a downstream task with a computable reward, SEAL can be trained to adapt accordingly—even in safety-critical domains. In principle, a SEAL-trained model could learn to avoid training on harmful or malicious inputs if guided by the appropriate reward signal.

    AI Community Reactions

    The AI research and builder community has reacted with a mix of excitement and speculation to the SEAL paper. On X, formerly Twitter, several prominent AI-focused accounts weighed in on the potential impact.

    User @VraserX, a self-described educator and AI enthusiast, called SEAL “the birth of continuous self-learning AI” and predicted that models like OpenAI's GPT-6 could adopt similar architecture.

    In their words, SEAL represents “the end of the frozen-weights era,” ushering in systems that evolve as the world around them changes.

    They highlighted SEAL's ability to form persistent memories, repair knowledge, and learn from real-time data, comparing it to a foundational step toward models that don’t just use information but absorb it.

    Meanwhile, @alex_prompter, co-founder of an AI-powered marketing venture, framed SEAL as a leap toward models that literally rewrite themselves. “MIT just built an AI that can rewrite its own code to get smarter,” he wrote. Citing the paper’s key results — a 40% boost in factual recall and outperforming GPT-4.1 using self-generated data — he described the findings as confirmation that “LLMs that finetune themselves are no longer sci-fi.”

    The enthusiasm reflects a broader appetite in the AI space for models that can evolve without constant retraining or human oversight — particularly in rapidly changing domains or personalized use cases.

    Future Directions and Open Questions

    In response to questions about scaling SEAL to larger models and tasks, Jyo pointed to experiments (Appendix B.7) showing that as model size increases, so does their self-adaptation ability. He compared this to students improving their study techniques over time — larger models are simply better at generating useful self-edits.

    When asked whether SEAL generalizes to new prompting styles, he confirmed it does, citing Table 10 in the paper. However, he also acknowledged that the team has not yet tested SEAL’s ability to transfer across entirely new domains or model architectures.

    “SEAL is an initial work showcasing the possibilities,” he said. “But it requires much more testing.” He added that generalization may improve as SEAL is trained on a broader distribution of tasks.

    Interestingly, the team found that only a few reinforcement learning steps already led to measurable performance gains. “This is exciting,” Jyo noted, “because it means that with more compute, we could hopefully get even more improvements.” He suggested future experiments could explore more advanced reinforcement learning methods beyond ReSTEM, such as Group Relative Policy Optimization (GRPO).

    Toward More Adaptive and Agentic Models

    SEAL represents a step toward models that can autonomously improve over time, both by integrating new knowledge and by reconfiguring how they learn. The authors envision future extensions where SEAL could assist in self-pretraining, continual learning, and the development of agentic systems — models that interact with evolving environments and adapt incrementally.

    In such settings, a model could use SEAL to synthesize weight updates after each interaction, gradually internalizing behaviors or insights. This could reduce the need for repeated supervision and manual intervention, particularly in data-constrained or specialized domains.

    As public web text becomes saturated and further scaling of LLMs becomes bottlenecked by data availability, self-directed approaches like SEAL could play a critical role in pushing the boundaries of what LLMs can achieve.

    You can access the SEAL project, including code and further documentation, at: https://jyopari.github.io/posts/seal

  • Researchers find that retraining only small parts of AI models can cut costs and prevent forgetting

    Enterprises often find that when they fine-tune models, one effective approach to making a large language model (LLM) fit for purpose and grounded in data is to have the model lose some of its abilities. After fine-tuning, some models “forget” how to perform certain tasks or other tasks they already learned. 

    Research from the University of Illinois Urbana-Champaign proposes a new method for retraining models that avoids “catastrophic forgetting,” in which the model loses some of its prior knowledge. The paper focuses on two specific LLMs that generate responses from images: LLaVA and Qwen 2.5-VL.

    The approach encourages enterprises to retrain only narrow parts of an LLM to avoid retraining the entire model and incurring a significant increase in compute costs. The team claims that catastrophic forgetting isn’t true memory loss, but rather a side effect of bias drift. 

    “Training a new LMM can cost millions of dollars, weeks of time, and emit hundreds of tons of CO2, so finding ways to more efficiently and effectively update existing models is a pressing concern,” the team wrote in the paper. “Guided by this result, we explore tuning recipes that preserve learning while limiting output shift.”

    The researchers focused on a multi-layer perceptron (MLP), the model's internal decision-making component. 

    Catastrophic forgetting 

    The researchers wanted first to verify the existence and the cause of catastrophic forgetting in models. 

    To do this, they created a set of target tasks for the models to complete. The models were then fine-tuned and evaluated to determine whether they led to substantial forgetting. But as the process went on, the researchers found that the models were recovering some of their abilities. 

    “We also noticed a surprising result, that the model performance would drop significantly in held out benchmarks after training on the counting task, it would mostly recover on PathVQA, another specialized task that is not well represented in the benchmarks,” they said. “Meanwhile, while performing the forgetting mitigation experiments, we also tried separately tuning only the self-attention projection (SA Proj) or MLP layers, motivated by the finding that tuning only the LLM was generally better than tuning the full model. This led to another very surprising result – that tuning only self-attention projection layers led to very good learning of the target tasks with no drop in performance in held out tasks, even after training all five target tasks in a sequence.”

    The researchers said they believe that “what looks like forgetting or interference after fine-tuning on a narrow target task is actually bias in the output distribution due to the task distribution shift.”

    Narrow retraining

    That finding turned out to be the key to the experiment. The researchers noted that tuning the MLP increases the likelihood of “outputting numeric tokens and a highly correlated drop in held out task accuracy.” What it showed is that a model forgetting some of its knowledge is only temporary and not a long-term matter. 

    “To avoid biasing the output distribution, we tune the MLP up/gating projections while keeping the down projection frozen, and find that it achieves similar learning to full MLP tuning with little forgetting,” the researchers said. 

    This allows for a more straightforward and more reproducible method for fine-tuning a model. 

    By focusing on a narrow segment of the model, rather than a wholesale retraining, enterprises can cut compute costs. It also allows better control of output drift. 

    However, the research focuses only on two models, specifically those dealing with vision and language. The researchers noted that due to limited resources, they are unable to try the experiment with other models.

    Their findings, however, can be extended to other LLMs, especially for different modalities. 

  • Here’s what’s slowing down your AI strategy — and how to fix it

    Your best data science team just spent six months building a model that predicts customer churn with 90% accuracy. It’s sitting on a server, unused. Why? Because it’s been stuck in a risk review queue for a very long period of time, waiting for a committee that doesn’t understand stochastic models to sign off. This isn’t a hypothetical — it’s the daily reality in most large companies.

    In AI, the models move at internet speed. Enterprises don’t.

    Every few weeks, a new model family drops, open-source toolchains mutate and entire MLOps practices get rewritten. But in most companies, anything touching production AI has to pass through risk reviews, audit trails, change-management boards and model-risk sign-off. The result is a widening velocity gap: The research community accelerates; the enterprise stalls.

    This gap isn’t a headline problem like “AI will take your job.” It’s quieter and more expensive: missed productivity, shadow AI sprawl, duplicated spend and compliance drag that turns promising pilots into perpetual proofs-of-concept.

    The numbers say the quiet part out loud

    Two trends collide. First, the pace of innovation: Industry is now the dominant force, producing the vast majority of notable AI models, according to Stanford's 2024 AI Index Report. The core inputs for this innovation are compounding at a historic rate, with training compute needs doubling rapidly every few years. That pace all but guarantees rapid model churn and tool fragmentation.

    Second, enterprise adoption is accelerating. According to IBM's, 42% of enterprise-scale companies have actively deployed AI, with many more actively exploring it. Yet the same surveys show governance roles are only now being formalized, leaving many companies to retrofit control after deployment.

    Layer on new regulation. The EU AI Act’s staged obligations are locked in — unacceptable-risk bans are already active and General Purpose AI (GPAI) transparency duties hit in mid-2025, with high-risk rules following. Brussels has made clear there’s no pause coming. If your governance isn’t ready, your roadmap will be.

    The real blocker isn't modeling, it's audit

    In most enterprises, the slowest step isn’t fine-tuning a model; it’s proving your model follows certain guidelines.

    Three frictions dominate:

    1. Audit debt: Policies were written for static software, not stochastic models. You can ship a microservice with unit tests; you can’t “unit test” fairness drift without data access, lineage and ongoing monitoring. When controls don’t map, reviews balloon.

    2. . MRM overload: Model risk management (MRM), a discipline perfected in banking, is spreading beyond finance — often translated literally, not functionally. Explainability and data-governance checks make sense; forcing every retrieval-augmented chatbot through credit-risk style documentation does not.

    3. Shadow AI sprawl: Teams adopt vertical AI inside SaaS tools without central oversight. It feels fast — until the third audit asks who owns the prompts, where embeddings live and how to revoke data. Sprawl is speed’s illusion; integration and governance are the long-term velocity.

    Frameworks exist, but they're not operational by default

    The NIST AI Risk Management Framework is a solid north star: govern, map, measure, manage. It’s voluntary, adaptable and aligned with international standards. But it’s a blueprint, not a building. Companies still need concrete control catalogs, evidence templates and tooling that turn principles into repeatable reviews.

    Similarly, the EU AI Act sets deadlines and duties. It doesn’t install your model registry, wire your dataset lineage or resolve the age-old question of who signs off when accuracy and bias trade off. That’s on you soon.

    What winning enterprises are doing differently

    The leaders I see closing the velocity gap aren’t chasing every model; they’re making the path to production routine. Five moves show up again and again:

    1. Ship a control plane, not a memo: Codify governance as code. Create a small library or service that enforces non-negotiables: Dataset lineage required, evaluation suite attached, risk tier chosen, PII scan passed, human-in-the-loop defined (if required). If a project can’t satisfy the checks, it can’t deploy.

    2. Pre-approve patterns: Approve reference architectures — “GPAI with retrieval augmented generation (RAG) on approved vector store,” “high-risk tabular model with feature store X and bias audit Y,” “vendor LLM via API with no data retention.” Pre-approval shifts review from bespoke debates to pattern conformance. (Your auditors will thank you.)

    3. Stage your governance by risk, not by team: Tie review depth to use-case criticality (safety, finance, regulated outcomes). A marketing copy assistant shouldn’t endure the same gauntlet as a loan adjudicator. Risk-proportionate review is both defensible and fast.

    4. Create an “evidence once, reuse everywhere” backbone: Centralize model cards, eval results, data sheets, prompt templates and vendor attestations. Every subsequent audit should start at 60% done because you’ve already proven the common pieces.

    5. Make audit a product: Give legal, risk and compliance a real roadmap. Instrument dashboards that show: Models in production by risk tier, upcoming re-evals, incidents and data-retention attestations. If audit can self-serve, engineering can ship.

    A pragmatic cadence for the next 12 months

    If you’re serious about catching up, pick a 12-month governance sprint:

    • Quarter 1: Stand up a minimal AI registry (models, datasets, prompts, evaluations). Draft risk-tiering and control mapping aligned to NIST AI RMF functions; publish two pre-approved patterns.

    • Quarter 2: Turn controls into pipelines (CI checks for evals, data scans, model cards). Convert two fast-moving teams from shadow AI to platform AI by making the paved road easier than the side road.

    • Quarter 3: Pilot a GxP-style review (a rigorous documentation standard from life sciences) for one high-risk use case; automate evidence capture. Start your EU AI Act gap analysis if you touch Europe; assign owners and deadlines.

    • Quarter 4: Expand your pattern catalog (RAG, batch inference, streaming prediction). Roll out dashboards for risk/compliance. Bake governance SLAs into your OKRs.

      By this point, you haven’t slowed down innovation — you’ve standardized it. The research community can keep moving at light speed; you can keep shipping at enterprise speed — without the audit queue becoming your critical path.

    The competitive edge isn't the next model — it's the next mile

    It’s tempting to chase each week’s leaderboard. But the durable advantage is the mile between a paper and production: The platform, the patterns, the proofs. That’s what your competitors can’t copy from GitHub, and it’s the only way to keep velocity without trading compliance for chaos.

    In other words: Make governance the grease, not the grit.

    Jayachander Reddy Kandakatla is senior machine learning operations (MLOps) engineer at Ford Motor Credit Company.

  • We keep talking about AI agents, but do we ever know what they are?

    Imagine you do two things on a Monday morning.

    First, you ask a chatbot to summarize your new emails. Next, you ask an AI tool to figure out why your top competitor grew so fast last quarter. The AI silently gets to work. It scours financial reports, news articles and social media sentiment. It cross-references that data with your internal sales numbers, drafts a strategy outlining three potential reasons for the competitor's success and schedules a 30-minute meeting with your team to present its findings.

    We're calling both of these "AI agents," but they represent worlds of difference in intelligence, capability and the level of trust we place in them. This ambiguity creates a fog that makes it difficult to build, evaluate, and safely govern these powerful new tools. If we can't agree on what we're building, how can we know when we've succeeded?

    This post won't try to sell you on yet another definitive framework. Instead, think of it as a survey of the current landscape of agent autonomy, a map to help us all navigate the terrain together.

    What are we even talking about? Defining an "AI agent"

    Before we can measure an agent's autonomy, we need to agree on what an "agent" actually is. The most widely accepted starting point comes from the foundational textbook on AI, Stuart Russell and Peter Norvig’sArtificial Intelligence: A Modern Approach.” 

    They define an agent as anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators. A thermostat is a simple agent: Its sensor perceives the room temperature, and its actuator acts by turning the heat on or off.

    ReAct Model for AI Agents (Credit: Confluent)

    That classic definition provides a solid mental model. For today's technology, we can translate it into four key components that make up a modern AI agent:

    1. Perception (the "senses"): This is how an agent takes in information about its digital or physical environment. It's the input stream that allows the agent to understand the current state of the world relevant to its task.

    2. Reasoning engine (the "brain"): This is the core logic that processes the perceptions and decides what to do next. For modern agents, this is typically powered by a large language model (LLM). The engine is responsible for planning, breaking down large goals into smaller steps, handling errors and choosing the right tools for the job.

    3. Action (the "hands"): This is how an agent affects its environment to move closer to its goal. The ability to take action via tools is what gives an agent its power.

    4. Goal/objective: This is the overarching task or purpose that guides all of the agent's actions. It is the "why" that turns a collection of tools into a purposeful system. The goal can be simple ("Find the best price for this book") or complex ("Launch the marketing campaign for our new product")

    Putting it all together, a true agent is a full-body system. The reasoning engine is the brain, but it’s useless without the senses (perception) to understand the world and the hands (actions) to change it. This complete system, all guided by a central goal, is what creates genuine agency.

    With these components in mind, the distinction we made earlier becomes clear. A standard chatbot isn't a true agent. It perceives your question and acts by providing an answer, but it lacks an overarching goal and the ability to use external tools to accomplish it.

    An agent, on the other hand, is software that has agency. 

    It has the capacity to act independently and dynamically toward a goal. And it's this capacity that makes a discussion about the levels of autonomy so important.

    Learning from the past: How we learned to classify autonomy

    The dizzying pace of AI can make it feel like we're navigating uncharted territory. But when it comes to classifying autonomy, we’re not starting from scratch. Other industries have been working on this problem for decades, and their playbooks offer powerful lessons for the world of AI agents.

    The core challenge is always the same: How do you create a clear, shared language for the gradual handover of responsibility from a human to a machine?

    SAE levels of driving automation

    Perhaps the most successful framework comes from the automotive industry. The SAE J3016 standard defines six levels of driving automation, from Level 0 (fully manual) to Level 5 (fully autonomous).

    The SAE J3016 Levels of Driving Automation (Credit: SAE International)

    What makes this model so effective isn't its technical detail, but its focus on two simple concepts:

    1. Dynamic driving task (DDT): This is everything involved in the real-time act of driving: steering, braking, accelerating and monitoring the road.

    2. Operational design domain (ODD): These are the specific conditions under which the system is designed to work. For example, "only on divided highways" or "only in clear weather during the daytime."

    The question for each level is simple: Who is doing the DDT, and what is the ODD? 

    At Level 2, the human must supervise at all times. At Level 3, the car handles the DDT within its ODD, but the human must be ready to take over. At Level 4, the car can handle everything within its ODD, and if it encounters a problem, it can safely pull over on its own.

    The key insight for AI agents: A robust framework isn't about the sophistication of the AI "brain." It's about clearly defining the division of responsibility between human and machine under specific, well-defined conditions.

    Aviation's 10 Levels of Automation

    While the SAE’s six levels are great for broad classification, aviation offers a more granular model for systems designed for close human-machine collaboration. The Parasuraman, Sheridan, and Wickens model proposes a detailed 10-level spectrum of automation.

    Levels of Automation of Decision and Action Selection for Aviation (Credit: The MITRE Corporation)

    This framework is less about full autonomy and more about the nuances of interaction. For example:

    • At Level 3, the computer "narrows the selection down to a few" for the human to choose from.

    • At Level 6, the computer "allows the human a restricted time to veto before it executes" an action.

    • At Level 9, the computer "informs the human only if it, the computer, decides to."

    The key insight for AI agents: This model is perfect for describing the collaborative "centaur" systems we're seeing today. Most AI agents won't be fully autonomous (Level 10) but will exist somewhere on this spectrum, acting as a co-pilot that suggests, executes with approval or acts with a veto window.

    Robotics and unmanned systems

    Finally, the world of robotics brings in another critical dimension: context. The National Institute of Standards and Technology's (NIST) Autonomy Levels for Unmanned Systems (ALFUS) framework was designed for systems like drones and industrial robots.

    The Three-Axis Model for ALFUS (Credit: NIST)

    Its main contribution is adding context to the definition of autonomy, assessing it along three axes:

    1. Human independence: How much human supervision is required?

    2. Mission complexity: How difficult or unstructured is the task?

    3. Environmental complexity: How predictable and stable is the environment in which the agent operates?

    The key insight for AI agents: This framework reminds us that autonomy isn't a single number. An agent performing a simple task in a stable, predictable digital environment (like sorting files in a single folder) is fundamentally less autonomous than an agent performing a complex task across the chaotic, unpredictable environment of the open internet, even if the level of human supervision is the same.

    The emerging frameworks for AI agents

    Having looked at the lessons from automotive, aviation and robotics, we can now examine the emerging frameworks designed for AI agents. While the field is still new and no single standard has won out, most proposals fall into three distinct, but often overlapping, categories based on the primary question they seek to answer.

    Category 1: The "What can it do?" frameworks (capability-focused)

    These frameworks classify agents based on their underlying technical architecture and what they are capable of achieving. They provide a roadmap for developers, outlining a progression of increasingly sophisticated technical milestones that often correspond directly to code patterns.

    A prime example of this developer-centric approach comes from Hugging Face. Their framework uses a star rating to show the gradual shift in control from human to AI:

    Five Levels of AI Agent Autonomy, as proposed by HuggingFace (Credit: Hugging Face)

    • Zero stars (simple processor): The AI has no impact on the program's flow. It simply processes information and its output is displayed, like a print statement. The human is in complete control.

    • One star (router): The AI makes a basic decision that directs program flow, like choosing between two predefined paths (if/else). The human still defines how everything is done.

    • Two stars (tool call): The AI chooses which predefined tool to use and what arguments to use with it. The human has defined the available tools, but the AI decides how to execute them.

    • Three stars (multi-step agent): The AI now controls the iteration loop. It decides which tool to use, when to use it and whether to continue working on the task.

    • Four stars (fully autonomous): The AI can generate and execute entirely new code to accomplish a goal, going beyond the predefined tools it was given.

    Strengths: This model is excellent for engineers. It's concrete, maps directly to code and clearly benchmarks the transfer of executive control to the AI. 

    Weaknesses: It is highly technical and less intuitive for non-developers trying to understand an agent's real-world impact.

    Category 2: The "How do we work together?" frameworks (interaction-focused)

    This second category defines autonomy not by the agent’s internal skills, but by the nature of its relationship with the human user. The central question is: Who is in control, and how do we collaborate?

    This approach often mirrors the nuance we saw in the aviation models. For instance, a framework detailed in the paper Levels of Autonomy for AI Agents defines levels based on the user's role:

    • L1 – user as an operator: The human is in direct control (like a person using Photoshop with AI-assist features).

    • L4 – user as an approver: The agent proposes a full plan or action, and the human must give a simple "yes" or "no" before it proceeds.

    • L5 – user as an observer: The agent has full autonomy to pursue a goal and simply reports its progress and results back to the human.

    Levels of Autonomy for AI Agents

    Strengths: These frameworks are highly intuitive and user-centric. They directly address the critical issues of control, trust, and oversight.

    Weaknesses: An agent with simple capabilities and one with highly advanced reasoning could both fall into the "Approver" level, so this approach can sometimes obscure the underlying technical sophistication.

    Category 3: The "Who is responsible?" frameworks (governance-focused)

    The final category is less concerned with how an agent works and more with what happens when it fails. These frameworks are designed to help answer crucial questions about law, safety and ethics.

    Think tanks like Germany's Stiftung Neue VTrantwortung have analyzed AI agents through the lens of legal liability. Their work aims to classify agents in a way that helps regulators determine who is responsible for an agent's actions: The user who deployed it, the developer who built it or the company that owns the platform it runs on?

    This perspective is essential for navigating complex regulations like the EU's Artificial Intelligence Act, which will treat AI systems differently based on the level of risk they pose.

    Strengths: This approach is absolutely essential for real-world deployment. It forces the difficult but necessary conversations about accountability that build public trust.

    Weaknesses: It's more of a legal or policy guide than a technical roadmap for developers.

    A comprehensive understanding requires looking at all three questions at once: An agent's capabilities, how we interact with it and who is responsible for the outcome..

    Identifying the gaps and challenges

    Looking at the landscape of autonomy frameworks shows us that no  single model is sufficient because the true challenges lie in the gaps between them, in areas that are incredibly difficult to define and measure.

    What is the "Road" for a digital agent?

    The SAE framework for self-driving cars gave us the powerful concept of an ODD, the specific conditions under which a system can operate safely. For a car, that might be "divided highways, in clear weather, during the day." This is a great solution for a physical environment, but what’s the ODD for a digital agent?

    The "road" for an agent is the entire internet. An infinite, chaotic and constantly changing environment. Websites get redesigned overnight, APIs are deprecated and social norms in online communities shift. 

    How do we define a "safe" operational boundary for an agent that can browse websites, access databases and interact with third-party services? Answering this is one of the biggest unsolved problems. Without a clear digital ODD, we can't make the same safety guarantees that are becoming standard in the automotive world.

    This is why, for now, the most effective and reliable agents operate within well-defined, closed-world scenarios. As I argued in a recent VentureBeat article, forgetting the open-world fantasies and focusing on "bounded problems" is the key to real-world success. This means defining a clear, limited set of tools, data sources and potential actions. 

    Beyond simple tool use

    Today's agents are getting very good at executing straightforward plans. If you tell one to "find the price of this item using Tool A, then book a meeting with Tool B," it can often succeed. But true autonomy requires much more. 

    Many systems today hit a technical wall when faced with tasks that require:

    • Long-term reasoning and planning: Agents struggle to create and adapt complex, multi-step plans in the face of uncertainty. They can follow a recipe, but they can't yet invent one from scratch when things go wrong.

    • Robust self-correction: What happens when an API call fails or a website returns an unexpected error? A truly autonomous agent needs the resilience to diagnose the problem, form a new hypothesis and try a different approach, all without a human stepping in.

    • Composability: The future likely involves not one agent, but a team of specialized agents working together. Getting them to collaborate reliably, to pass information back and forth, delegate tasks and resolve conflicts is a monumental software engineering challenge that we are just beginning to tackle.

    The elephant in the room: Alignment and control

    This is the most critical challenge of all, because it's not just technical, it's deeply human. Alignment is the problem of ensuring an agent's goals and actions are consistent with our intentions and values, even when those values are complex, unstated or nuanced.

    Imagine you give an agent the seemingly harmless goal of "maximizing customer engagement for our new product." The agent might correctly determine that the most effective strategy is to send a dozen notifications a day to every user. The agent has achieved its literal goal perfectly, but it has violated the unstated, common-sense goal of "don't be incredibly annoying."

    This is a failure of alignment.

    The core difficulty, which organizations like the AI Alignment Forum are dedicated to studying, is that it is incredibly hard to specify fuzzy, complex human preferences in the precise, literal language of code. As agents become more powerful, ensuring they are not just capable but also safe, predictable and aligned with our true intent becomes the most important challenge we face.

    The future is agentic (and collaborative)

    The path forward for AI agents is not a single leap to a god-like super-intelligence, but a more practical and collaborative journey. The immense challenges of open-world reasoning and perfect alignment mean that the future is a team effort.

    We will see less of the single, all-powerful agent and more of an "agentic mesh" — a network of specialized agents, each operating within a bounded domain, working together to tackle complex problems. 

    More importantly, they will work with us. The most valuable and safest applications will keep a human on the loop, casting them as a co-pilot or strategist to augment our intellect with the speed of machine execution. This "centaur" model will be the most effective and responsible path forward.

    The frameworks we've explored aren’t just theoretical. They’re practical tools for building trust, assigning responsibility and setting clear expectations. They help developers define limits and leaders shape vision, laying the groundwork for AI to become a dependable partner in our work and lives.

    Sean Falconer is Confluent's AI entrepreneur in residence.

  • Is vibe coding ruining a generation of engineers?

    AI tools are revolutionizing software development by automating repetitive tasks, refactoring bloated code, and identifying bugs in real-time. Developers can now generate well-structured code from plain language prompts, saving hours of manual effort. These tools learn from vast codebases, offering context-aware recommendations that enhance productivity and reduce errors. Rather than starting from scratch, engineers can prototype quickly, iterate faster and focus on solving increasingly complex problems.

    As code generation tools grow in popularity, they raise questions about the future size and structure of engineering teams. Earlier this year, Garry Tan, CEO of startup accelerator Y Combinator, noted that about one-quarter of its current clients use AI to write 95% or more of their software. In an interview with CNBC, Tan said: “What that means for founders is that you don’t need a team of 50 or 100 engineers, you don’t have to raise as much. The capital goes much longer.”

    AI-powered coding may offer a fast solution for businesses under budget pressure — but its long-term effects on the field and labor pool cannot be ignored.

    As AI-powered coding rises, human expertise may diminish


    In the era of AI, the traditional journey to coding expertise that has long supported senior developers may be at risk. Easy access to large language models (LLMs) enables junior coders to quickly identify issues in code. While this speeds up software development, it can distance developers from their own work, delaying the growth of core problem-solving skills. As a result, they may avoid the focused, sometimes uncomfortable hours required to build expertise and progress on the path to becoming successful senior developers.

    Consider Anthropic’s Claude Code, a terminal-based assistant built on the Claude 3.7 Sonnet model, which automates bug detection and resolution, test creation and code refactoring. Using natural language commands, it reduces repetitive manual work and boosts productivity.

    Microsoft has also released two open-source frameworks — AutoGen and Semantic Kernel — to support the development of agentic AI systems. AutoGen enables asynchronous messaging, modular components, and distributed agent collaboration to build complex workflows with minimal human input. Semantic Kernel is an SDK that integrates LLMs with languages like C#, Python and Java, letting developers build AI agents to automate tasks and manage enterprise applications.

    The increasing availability of these tools from Anthropic, Microsoft and others may reduce opportunities for coders to refine and deepen their skills. Rather than “banging their heads against the wall” to debug a few lines or select a library to unlock new features, junior developers may simply turn to AI for an assist. This means senior coders with problem-solving skills honed over decades may become an endangered species.

    Overreliance on AI for writing code risks weakening developers’ hands-on experience and understanding of key programming concepts. Without regular practice, they may struggle to independently debug, optimize or design systems. Ultimately, this erosion of skill can undermine critical thinking, creativity and adaptability — qualities that are essential not just for coding, but for assessing the quality and logic of AI-generated solutions.

    AI as mentor: Turning code automation into hands-on learning

    While concerns about AI diminishing human developer skills are valid, businesses shouldn’t dismiss AI-supported coding. They just need to think carefully about when and how to deploy AI tools in development. These tools can be more than productivity boosters; they can act as interactive mentors, guiding coders in real time with explanations, alternatives and best practices.

    When used as a training tool, AI can reinforce learning by showing coders why code is broken and how to fix it—rather than simply applying a solution. For example, a junior developer using Claude Code might receive immediate feedback on inefficient syntax or logic errors, along with suggestions linked to detailed explanations. This enables active learning, not passive correction. It’s a win-win: Accelerating project timelines without doing all the work for junior coders.

    Additionally, coding frameworks can support experimentation by letting developers prototype agent workflows or integrate LLMs without needing expert-level knowledge upfront. By observing how AI builds and refines code, junior developers who actively engage with these tools can internalize patterns, architectural decisions and debugging strategies — mirroring the traditional learning process of trial and error, code reviews and mentorship.

    However, AI coding assistants shouldn’t replace real mentorship or pair programming. Pull requests and formal code reviews remain essential for guiding newer, less experienced team members. We are nowhere near the point at which AI can single-handedly upskill a junior developer.

    Companies and educators can build structured development programs around these tools that emphasize code comprehension to ensure AI is used as a training partner rather than a crutch. This encourages coders to question AI outputs and requires manual refactoring exercises. In this way, AI becomes less of a replacement for human ingenuity and more of a catalyst for accelerated, experiential learning.

    Bridging the gap between automation and education

    When utilized with intention, AI doesn’t just write code; it teaches coding, blending automation with education to prepare developers for a future where deep understanding and adaptability remain indispensable.

    By embracing AI as a mentor, as a programming partner and as a team of developers we can direct to the problem at hand, we can bridge the gap between effective automation and education. We can empower developers to grow alongside the tools they use. We can ensure that, as AI evolves, so too does the human skill set, fostering a generation of coders who are both efficient and deeply knowledgeable.

    Richard Sonnenblick is chief data scientist at Planview.

  • When dirt meets data: ScottsMiracle-Gro saved $150M using AI

    How a semiconductor veteran turned over a century of horticultural wisdom into AI-led competitive advantage 

    For decades, a ritual played out across ScottsMiracle-Gro’s media facilities. Every few weeks, workers walked acres of towering compost and wood chip piles with nothing more than measuring sticks. They wrapped rulers around each mound, estimated height, and did what company President Nate Baxter now describes as “sixth-grade geometry to figure out volume.”

    Today, drones glide over those same plants with mechanical precision. Vision systems calculate volumes in real time. The move from measuring sticks to artificial intelligence signals more than efficiency. It is the visible proof of one of corporate America’s most unlikely technology stories.

    The AI revolution finds an unexpected leader

    Enterprise AI has been led by predictable players. Software companies with cloud-native architectures. Financial services firms with vast data lakes. Retailers with rich digital touchpoints. Consumer packaged goods companies that handle physical products like fertilizer and soil were not expected to lead.

    Yet ScottsMiracle-Gro has realized more than half of a targeted $150 million in supply chain savings. It reports a 90 percent improvement in customer service response times. Its predictive models enable weekly reallocation of marketing resources across regional markets.

    A Silicon Valley veteran bets on soil science

    Baxter’s path to ScottsMiracle-Gro (SMG) reads like a calculated pivot, not a corporate rescue. After two decades in semiconductor manufacturing at Intel and Tokyo Electron, he knew how to apply advanced technology to complex operations.

    “I sort of initially said, ‘Why would I do this? I’m running a tech company. It’s an industry I’ve been in for 25 years,’” Baxter recalls of his reaction when ScottsMiracle-Gro CEO Jim Hagedorn approached him in 2023. The company was reeling from a collapsed $1.2 billion hydroponics investment and facing what he describes as “pressure from a leverage standpoint.”

    His wife challenged him with a direct prompt. If you are not learning or putting yourself in uncomfortable situations, you should change that.

    Baxter saw clear parallels between semiconductor manufacturing and SMG’s operations. Both require precision, quality control, and the optimization of complex systems. He also saw untapped potential in SMG’s domain knowledge. One hundred fifty years of horticultural expertise, regulatory know-how, and customer insight had never been fully digitized.

    “It became apparent to me whether it was on the backend with data analytics, business process transformation, and obviously now with AI being front and center of the consumer experience, a lot of opportunities are there,” he explains.

    The declaration that changed everything

    The pivot began at an all-hands meeting. “I just said, you know, guys, we’re a tech company. You just don’t know it yet,” Baxter recalls. “There’s so much opportunity here to drive this company to where it needs to go.”

    The first challenge was organizational. SMG had evolved into functional silos. IT, supply chain, and brand teams ran independent systems with little coordination. Drawing on his experience with complex technology organizations, Baxter restructured the consumer business into three business units. General managers became accountable not just for financial results but also for technology implementation within their domains.

    “I came in and said, we’re going to create new business units,” he explains. “The buck stops with you and I’m holding you accountable not only for the business results, for the quality of the creative and marketing, but for the implementation of technology.”

    To support the new structure, SMG set up centers of excellence for digital capabilities, insights and analytics, and creative functions. The hybrid design placed centralized expertise behind distributed accountability.

    Mining corporate memory for AI gold

    Turning legacy knowledge into machine-ready intelligence required what Fausto Fleites, VP of Data Intelligence, calls “archaeological work.” The team excavated decades of business logic embedded in legacy SAP systems and converted filing cabinets of research into AI-ready datasets. Fleites, a Cuban immigrant with a doctorate from FIU who led Florida’s public hurricane loss model before roles at Sears and Cemex, understood the stakes.

    “The costly part of the migration was the business reporting layer we have in SAP Business Warehouse,” Fleites explains. “You need to uncover business logic created in many cases over decades.”

    SMG chose Databricks as its unified data platform. The team had Apache Spark expertise. Databricks offered strong SAP integration and aligned with a preference for open-source technologies that minimize vendor lock-in.

    The breakthrough came through systematic knowledge management. SMG built an AI bot using Google’s Gemini large language model to catalog and clean internal repositories. The system identified duplicates, grouped content by topic, and restructured information for AI consumption. The effort reduced knowledge articles by 30 percent while increasing their utility.

    “We used Gemini LLMs to actually categorize them into topics, find similar documents,” Fleites explains. A hybrid approach that combined modern AI with techniques like cosine similarity became the foundation for later applications.

    Building AI systems that actually understand fertilizer

    Early trials with off-the-shelf AI exposed a real risk. General-purpose models confused products designed for killing weeds with those for preventing them. That mistake can ruin a lawn.

    “Different products, if you use one in the wrong place, would actually have a very negative outcome,” Fleites notes. “But those are kind of synonyms in certain contexts to the LLM. So they were recommending the wrong products.”

    The solution was a new architecture. SMG created what Fleites calls a “hierarchy of agents.” A supervisor agent routes queries to specialized worker agents organized by brand. Each agent draws on deep product knowledge encoded from a 400-page internal training manual.

    The system also changes the conversation. When users ask for recommendations, the agents start with questions about location, goals, and lawn conditions. They narrow possibilities step by step before offering suggestions. The stack integrates with APIs for product availability and state-specific regulatory compliance.

    From drones to demand forecasting across the enterprise

    The transformation runs across the company. Drones measure inventory piles. Demand forecasting models analyze more than 60 factors, including weather patterns, consumer sentiment, and macroeconomic indicators.

    These predictions enable faster moves. When drought struck Texas, the models supported a shift in promotional spending to regions with favorable weather. The reallocation helped drive positive quarterly results.

    “We not only have the ability to move marketing and promotion dollars around, but we’ve even gotten to the point where if it’s going to be a big weekend in the Northeast, we’ll shift our field sales resources from other regions up there,” Baxter explains.

    Consumer Services changed as well. AI agents now process incoming emails through Salesforce, draft responses based on the knowledge base, and flag them for brief human review. Draft times dropped from ten minutes to seconds and response quality improved.

    The company emphasizes explainable AI. Using SHAP, SMG built dashboards that decompose each forecast and show how weather, promotions, or media spending contribute to predictions.

    “Typically, if you open a prediction to a business person and you don’t say why, they’ll say, ‘I don’t believe you,’” Fleites explains. Transparency made it possible to move resource allocation from quarterly to weekly cycles.

    Competing like a startup

    SMG’s results challenge assumptions about AI readiness in traditional industries. The advantage does not come from owning the most sophisticated models. It comes from combining general-purpose AI with unique, structured domain knowledge.

    “LLMs are going to be a commodity,” Fleites observes. “The strategic differentiator is what is the additional level of [internal] knowledge we can fit to them.”

    Partnerships are central. SMG works with Google Vertex AI for foundational models, Sierra.ai for production-ready conversational agents, and Kindwise for computer vision. The ecosystem approach lets a small internal team recruited from Meta, Google, and AI startups deliver outsized impact without building everything from scratch.

    Talent follows impact. Conventional wisdom says traditional companies cannot compete with Meta salaries or Google stock. SMG offered something different. It offered the chance to build transformative AI applications with immediate business impact.

    “When we have these interviews, what we propose to them is basically the ability to have real value with the latest knowledge in these spaces,” Fleites explains. “A lot of people feel motivated to come to us” because much of big tech AI work, despite the hype, “doesn’t really have an impact.”

    Team design mirrors that philosophy. “My direct reports are leaders and not only manage people, but are technically savvy,” Fleites notes. “We always are constantly switching hands between developing or maintaining a solution versus strategy versus managing people.” He still writes code weekly. The small team of 15 to 20 AI and engineering professionals stays lean by contracting out implementation while keeping “the know-how and the direction and the architecture” in-house.

    When innovation meets immovable objects

    Not every pilot succeeded. SMG tested semi-autonomous forklifts in a 1.3 million square foot distribution facility. Remote drivers in the Philippines controlled up to five vehicles at once with strong safety records.

    “The technology was actually really great,” Baxter acknowledges. The vehicles could not lift enough weight for SMG’s heavy products. The company paused implementation.

    “Not everything we’ve tried has gone smoothly,” Baxter admits. “But I think another important point is you have to focus on a few critical ones and you have to know when something isn’t going to work and readjust.”

    The lesson tracks with semiconductor discipline. Investments must show measurable returns within set timeframes. Regulatory complexity adds difficulty. Products must comply with EPA rules and a patchwork of state restrictions, which AI systems must navigate correctly.

    The gardening sommelier and agent-to-agent futures

    The roadmap reflects a long-term view. SMG plans a “gardening sommelier” mobile app in 2026 that identifies plants, weeds, and lawn problems from photos and provides instant guidance. A beta already helps field sales teams answer complex product questions by querying the 400-page knowledge base.

    The company is exploring agent-to-agent communication so its specialized AI can interface with retail partners’ systems. A customer who asks a Walmart chatbot for lawn advice could trigger an SMG query that returns accurate, regulation-compliant recommendations.

    SMG has launched AI-powered search on its website, replacing keyword systems with conversational engines based on the internal stack. The future vision pairs predictive models with conversational agents so the system can reach out when conditions suggest a customer may need help.

    What traditional industries can learn

    ScottsMiracle-Gro's transformation offers a clear playbook for enterprises. The advantage doesn't come from deploying the most sophisticated models. Instead, it comes from combining AI with proprietary domain knowledge that competitors can't easily replicate.

    By making general managers responsible for both business results and technology implementation, SMG ensured AI wasn't just an IT initiative but a business imperative. The 150 years of horticultural expertise only became valuable when it was digitized, structured, and made accessible to AI systems.

    Legacy companies competing for AI engineers can't match Silicon Valley compensation packages. But they can offer something tech giants often can't: immediate, measurable impact. When engineers see their weather forecasting models directly influence quarterly results or their agent architecture prevent customers from ruining their lawns, the work carries weight that another incremental improvement to an ad algorithm never will.

    “We have a right to win,” Baxter says. “We have 150 years of this experience.” That experience is now data, and data is the company’s competitive edge. ScottsMiracle-Gro didn’t outspend its rivals or chase the newest AI model. It turned knowledge into an operating system for growth. For a company built on soil, its biggest breakthrough might be cultivating data.