Etiket: Machine Learning

Developers can now add live Google Maps data to Gemini-powered AI app outputs
Google is adding a new feature for third-party developers building atop its Gemini AI models that rivals like OpenAI's ChatGPT, Anthropic's Claude, and the growing array of Chinese open source options are unlikely to get anytime soon: grounding with Google Maps.

This addition allows developers to connect Google's Gemini AI models' reasoning capabilities with live geospatial data from Google Maps, enabling applications to deliver detailed, location-relevant responses to user queries—such as business hours, reviews, or the atmosphere of a specific venue.

By tapping into data from over 250 million places, developers can now build more intelligent and responsive location-aware experiences.

This is particularly useful for applications where proximity, real-time availability, or location-specific personalization matter—such as local search, delivery services, real estate, and travel planning.

When the user’s location is known, developers can pass latitude and longitude into the request to enhance the response quality.

By tightly integrating real-time and historical Maps data into the Gemini API, Google enables applications to generate grounded, location-specific responses with factual accuracy and contextual depth that are uniquely possible through its mapping infrastructure.

Merging AI and Geospatial Intelligence

The new feature is accessible in Google AI Studio, where developers can try a live demo powered by the Gemini Live API. Models that support the grounding with Google Maps include:
- Gemini 2.5 Pro
- Gemini 2.5 Flash
- Gemini 2.5 Flash-Lite
- Gemini 2.0 Flash
In one demonstration, a user asked for Italian restaurant recommendations in Chicago.

The assistant, leveraging Maps data, retrieved top-rated options and clarified a misspelled restaurant name before locating the correct venue with accurate business details.

Developers can also retrieve a context token to embed a Google Maps widget in their app’s user interface. This interactive component displays photos, reviews, and other familiar content typically found in Google Maps.

Integration is handled via the generateContent method in the Gemini API, where developers include googleMaps as a tool. They can also enable a Maps widget by setting a parameter in the request. The widget, rendered using a returned context token, can provide a visual layer alongside the AI-generated text.

Use Cases Across Industries

The Maps grounding tool is designed to support a wide range of practical use cases:
- Itinerary generation: Travel apps can create detailed daily plans with routing, timing, and venue information.
- Personalized local recommendations: Real estate platforms can highlight listings near kid-friendly amenities like schools and parks.
- Detailed location queries: Applications can provide specific information, such as whether a cafe offers outdoor seating, using community reviews and Maps metadata.
Developers are encouraged to only enable the tool when geographic context is relevant, to optimize both performance and cost.

According to the developer documentation, pricing starts at $25 per 1,000 grounded prompts — a steep sum for those trafficking in numerous queries.

Combining Search and Maps for Enhanced Context

Developers can use Grounding with Google Maps alongside Grounding with Google Search in the same request.

While the Maps tool contributes factual data—like addresses, hours, and ratings—the Search tool adds broader context from web content, such as news or event listings.

For example, when asked about live music on Beale Street, the combined tools provide venue details from Maps and event times from Search.

According to Google, internal testing shows that using both tools together leads to significantly improved response quality.

Customization and Developer Flexibility

The experience is built for customization. Developers can tweak system prompts, choose from different Gemini models, and configure voice settings to tailor interactions.

The demo app in Google AI Studio is also remixable, enabling developers to test ideas, add features, and iterate on designs within a flexible development environment.

The API returns structured metadata—including source links, place IDs, and citation spans—that developers can use to build inline citations or verify the AI-generated outputs.

This supports transparency and enhances trust in user-facing applications. Google also requires that Maps-based sources be attributed clearly and linked back to the source using their URI.

Implementation Considerations for AI Builders

For technical teams integrating this capability, Google recommends:
- Passing user location context when known, for better results.
- Displaying Google Maps source links directly beneath the relevant content.
- Only enabling the tool when the query clearly involves geographic context.
- Monitoring latency and disabling grounding when performance is critical.
Grounding with Google Maps is currently available globally, though prohibited in several territories (including China, Iran, North Korea, and Cuba), and not permitted for emergency response use cases.

Availability and Access

Grounding with Google Maps is now generally available through the Gemini API.

With this release, Google continues to expand the capabilities of the Gemini API, empowering developers to build AI-driven applications that understand and respond to the world around them.
Ekim 18, 2025
Amazon and Chobani adopt Strella’s AI interviews for customer research as fast-growing startup raises $14M

One year after emerging from stealth, Strella has raised $14 million in Series A funding to expand its AI-powered customer research platform, the company announced Thursday. The round, led by Bessemer Venture Partners with participation from Decibel Partners, Bain Future Back Ventures, MVP Ventures and 645 Ventures, comes as enterprises increasingly turn to artificial intelligence to understand customers faster and more deeply than traditional methods allow.

The investment marks a sharp acceleration for the startup founded by Lydia Hylton and Priya Krishnan, two former consultants and product managers who watched companies struggle with a customer research process that could take eight weeks from start to finish. Since October, Strella has grown revenue tenfold, quadrupled its customer base to more than 40 paying enterprises, and tripled its average contract values by moving upmarket to serve Fortune 500 companies.

"Research tends to be bookended by two very strategic steps: first, we have a problem—what research should we do? And second, we've done the research—now what are we going to do with it?" said Hylton, Strella's CEO, in an exclusive interview with VentureBeat. "All the stuff in the middle tends to be execution and lower-skill work. We view Strella as doing that middle 90% of the work."

The platform now serves Amazon, Duolingo, Apollo GraphQL, and Chobani, collectively conducting thousands of AI-moderated interviews that deliver what the company claims is a 90% average time savings on manual research work. The company is approaching $1 million in revenue after beginning monetization only in January, with month-over-month growth of 50% and zero customer churn to date.

How AI-powered interviews compress eight-week research projects into days

Strella's technology addresses a workflow that has frustrated product teams, marketers, and designers for decades. Traditional customer research requires writing interview guides, recruiting participants, scheduling calls, conducting interviews, taking notes, synthesizing findings, and creating presentations — a process that consumes weeks of highly-skilled labor and often delays critical product decisions.

The platform compresses that timeline to days by using AI to moderate voice-based interviews that run like Zoom calls, but with an artificial intelligence agent asking questions, following up on interesting responses, and detecting when participants are being evasive or fraudulent. The system then synthesizes findings automatically, creating highlight reels and charts from unstructured qualitative data.

"It used to take eight weeks. Now you can do it in the span of a couple days," Hylton told VentureBeat. "The primary technology is through an AI-moderated interview. It's like being in a Zoom call with an AI instead of a human — it's completely free form and voice based."

Critically, the platform also supports human moderators joining the same calls, reflecting the founders' belief that humans won't disappear from the research process. "Human moderation won't go away, which is why we've supported human moderation from our Genesis," Hylton said. "Research tends to be bookended by two very strategic steps: we have a problem, what's the research that we should do? And we've done the research, now what are we going to do with it? All the stuff in the middle tends to be execution and lower skill work. We view Strella as doing that middle 90% of the work."

Why customers tell AI moderators the truth they won't share with humans

One of Strella's most surprising findings challenges assumptions about AI in qualitative research: participants appear more honest with AI moderators than with humans. The founders discovered this pattern repeatedly as customers ran head-to-head comparisons between traditional human-moderated studies and Strella's AI approach.

"If you're a designer and you get on a Zoom call with a customer and you say, 'Do you like my design?' they're always gonna say yes. They don't want to hurt your feelings," Hylton explained. "But it's not a problem at all for Strella. They would tell you exactly what they think about it, which is really valuable. It's very hard to get honest feedback."

Krishnan, Strella's COO, said companies initially worried about using AI and "eroding quality," but the platform has "actually found the opposite to be true. People are much more open and honest with an AI moderator, and so the level of insight that you get is much richer because people are giving their unfiltered feedback."

This dynamic has practical business implications. Brian Santiago, Senior Product Design Manager at Apollo GraphQL, said in a statement: "Before Strella, studies took weeks. Now we get insights in a day — sometimes in just a few hours. And because participants open up more with the AI moderator, the feedback is deeper and more honest."

The platform also addresses endemic fraud in online surveys, particularly when participants are compensated. Because Strella interviews happen on camera in real time, the AI moderator can detect when someone pauses suspiciously long — perhaps to consult ChatGPT — and flags them as potentially fraudulent. "We are fraud resistant," Hylton said, contrasting this with traditional surveys where fraud rates can be substantial.

Solving mobile app research with persistent screen sharing technology

A major focus of the Series A funding will be expanding Strella's recently-launched mobile application, which Krishnan identified as critical competitive differentiation. The mobile app enables persistent screen sharing during interviews — allowing researchers to watch users navigate mobile applications in real time while the AI moderator asks about their experience.

"We are the only player in the market that supports screen sharing on mobile," Hylton said. "You know, I want to understand what are the pain points with my app? Why do people not seem to be able to find the checkout flow? Well, in order to do that effectively, you'd like to see the user screen while they're doing an interview."

For consumer-facing companies where mobile represents the primary customer interface, this capability opens entirely new use cases. The founders noted that "several of our customers didn't do research before" but have now built research practices around Strella because the platform finally made mobile research accessible at scale.

The platform also supports embedding traditional survey question types directly into the conversational interview, approaching what Hylton called "feature parity with a survey" while maintaining the engagement advantages of a natural conversation. Strella interviews regularly run 60 to 90 minutes with nearly 100% completion rates—a duration that would see 60-70% drop-off in a traditional survey format.

How Strella differentiated in a market crowded with AI research startups

Strella enters a market that appears crowded at first glance, with established players like Qualtrics and a wave of AI-powered startups promising to transform customer research. The founders themselves initially pursued a different approach — synthetic respondents, or "digital twins" that simulate customer perspectives using large language models.

"We actually pivoted from that. That was our initial idea," Hylton revealed, referring to synthetic respondents. "People are very intrigued by that concept, but found in practice, no willingness to pay right now."

Recent research suggesting companies could use language models as digital twins for customer feedback has reignited interest in that approach. But Hylton remains skeptical: "The capabilities of the LLMs as they are today are not good enough, in my opinion, to justify a standalone company. Right now you could just ask ChatGPT, 'What would new users of Duolingo think about this ad copy?' You can do that. Adding the standalone idea of a synthetic panel is sort of just putting a wrapper on that."

Instead, Strella's bet is that the real value lies in collecting proprietary qualitative data at scale — building what could become "the system of truth for all qualitative insights" within enterprises, as Lindsey Li, Vice President at Bessemer Venture Partners, described it.

Li, who led the investment just one year after Strella emerged from stealth, said the firm was convinced by both the technology and the team. "Strella has built highly differentiated technology that enables a continuous interview rather than a survey," Li said. "We heard time and time again that customers loved this product experience relative to other offerings."

On the defensibility question that concerns many AI investors, Li emphasized product execution over patents: "We think the long game here will be won with a million small product decisions, all of which must be driven by deep empathy for customer pain and an understanding of how best to address their needs. Lydia and Priya exhibit that in spades."

The founders point to technical depth that's difficult to replicate. Most competitors started with adaptive surveys — text-based interfaces where users type responses and wait for the next question. Some have added voice, but typically as uploaded audio clips rather than free-flowing conversation.

"Our approach is fundamentally better, which is the fact that it is a free form conversation," Hylton said. "You never have to control anything. You're never typing, there's no buttons, there's no upload and wait for the next question. It's completely free form, and that has been an extraordinarily hard product to build. There's a tremendous amount of IP in the way that we prompt our moderator, the way that we run analysis."

The platform also improves with use, learning from each customer's research patterns to fine-tune future interview guides and questions. "Our product gets better for our customers as they continue to use us," Hylton said. All research accumulates in a central repository where teams can generate new insights by chatting with the data or creating visualizations from previously unstructured qualitative feedback.

Creating new research budgets instead of just automating existing ones

Perhaps more important than displacing existing research is expanding the total market. Krishnan said growth has been "fundamentally related to our product" creating new research that wouldn't have happened otherwise.

"We have expanded the use cases in which people would conduct research," Krishnan explained. "Several of our customers didn't do research before, have always wanted to do research, but didn't have a dedicated researcher or team at their company that was devoted to it, and have purchased Strella to kick off and enable their research practice. That's been really cool where we've seen this market just opening up."

This expansion comes as enterprises face mounting pressure to improve customer experience amid declining satisfaction scores. According to Forrester Research's 2024 Customer Experience Index, customer experience quality has declined for three consecutive years — an unprecedented trend. The report found that 39% of brands saw CX quality deteriorate, with declines across effectiveness, ease, and emotional connection.

Meanwhile, Deloitte's 2025 Technology, Media & Telecommunications Predictions report forecasts that 25% of enterprises using generative AI will deploy AI agents by 2025, growing to 50% by 2027. The report specifically highlighted AI's potential to enhance customer satisfaction by 15-20% while reducing cost to serve by 20-30% when properly implemented.

Gartner identified conversational user interfaces — the category Strella inhabits — as one of three technologies poised to transform customer service by 2028, noting that "customers increasingly expect to be able to interact with the applications they use in a natural way."

Against this backdrop, Li sees substantial room for growth. "UX Research is a sub-sector of the $140B+ global market-research industry," Li said. "This includes both the software layer historically (~$430M) and professional services spend on UX research, design, product strategy, etc. which is conservatively estimated to be ~$6.4B+ annually. As software in this vertical, led by Strella, becomes more powerful, we believe the TAM will continue to expand meaningfully."

Making customer feedback accessible across the enterprise, not just research teams

The founders describe their mission as "democratizing access to the customer" — making it possible for anyone in an organization to understand customer perspectives without waiting for dedicated research teams to complete months-long studies.

"Many, many, many positions in the organization would like to get customer feedback, but it's so hard right now," Hylton said. With Strella, she explained, someone can "log into Strella and through a chat, create any highlight reel that you want and actually see customers in their own words answering the question that you have based on the research that's already been done."

This video-first approach to research repositories changes organizational dynamics around customer feedback. "Then you can say, 'Okay, engineering team, we need to build this feature. And here's the customer actually saying it,'" Hylton continued. "'This is not me. This isn't politics. Here are seven customers saying they can't find the Checkout button.' The fact that we are a very video-based platform really allows us to do that quickly and painlessly."

The company has moved decisively upmarket, with contract values now typically in the five-figure range and "several six figure contracts" signed, according to Krishnan. The pricing strategy reflects a premium positioning: "Our product is very good, it's very premium. We're charging based on the value it provides to customers," Krishnan said, rather than competing on cost alone.

This approach appears to be working. The company reports 100% conversion from pilot programs to paid contracts and zero churn among its 40-45 customers, with month-over-month revenue growth of 50%.

The roadmap: Computer vision, agentic AI, and human-machine collaboration

The Series A funding will primarily support scaling product and go-to-market teams. "We're really confident that we have product-market fit," Hylton said. "And now the question is execution, and we want to hire a lot of really talented people to help us execute."

On the product roadmap, Hylton emphasized continued focus on the participant experience as the key to winning the market. "Everything else is downstream of a joyful participant experience," she said, including "the quality of insights, the amount you have to pay people to do the interviews, and the way that your customers feel about a company."

Near-term priorities include adding visual capabilities so the AI moderator can respond to facial expressions and other nonverbal cues, and building more sophisticated collaboration features between human researchers and AI moderators. "Maybe you want to listen while an AI moderator is running a call and you might want to be able to jump in with specific questions," Hylton said. "Or you want to run an interview yourself, but you want the moderator to be there as backup or to help you."

These features move toward what the industry calls "agentic AI" — systems that can act more autonomously while still collaborating with humans. The founders see this human-AI collaboration, rather than full automation, as the sustainable path forward.

"We believe that a lot of the really strategic work that companies do will continue to be human moderated," Hylton said. "And you can still do that through Strella and just use us for synthesis in those cases."

For Li and Bessemer, the bet is on founders who understand this nuance. "Lydia and Priya exhibit the exact archetype of founders we are excited to partner with for the long term — customer-obsessed, transparent, thoughtful, and singularly driven towards the home-run scenario," she said.

The company declined to disclose specific revenue figures or valuation. With the new funding, Strella has now raised $18 million total, including a $4 million seed round led by Decibel Partners announced in October.

As Strella scales, the founders remain focused on a vision where technology enhances rather than eliminates human judgment—where an engineering team doesn't just read a research report, but watches seven customers struggle to find the same button. Where a product manager can query months of accumulated interviews in seconds. Where companies don't choose between speed and depth, but get both.

"The interesting part of the business is actually collecting that proprietary dataset, collecting qualitative research at scale," Hylton said, describing what she sees as Strella's long-term moat. Not replacing the researcher, but making everyone in the company one.

Ekim 17, 2025
How Anthropic’s ‘Skills’ make Claude faster, cheaper, and more consistent for business workflows

Anthropic launched a new capability on Thursday that allows its Claude AI assistant to tap into specialized expertise on demand, marking the company's latest effort to make artificial intelligence more practical for enterprise workflows as it chases rival OpenAI in the intensifying competition over AI-powered software development.

The feature, called Skills, enables users to create folders containing instructions, code scripts, and reference materials that Claude can automatically load when relevant to a task. The system marks a fundamental shift in how organizations can customize AI assistants, moving beyond one-off prompts to reusable packages of domain expertise that work consistently across an entire company.

"Skills are based on our belief and vision that as model intelligence continues to improve, we'll continue moving towards general-purpose agents that often have access to their own filesystem and computing environment," said Mahesh Murag, a member of Anthropic's technical staff, in an exclusive interview with VentureBeat. "The agent is initially made aware only of the names and descriptions of each available skill and can choose to load more information about a particular skill when relevant to the task at hand."

The launch comes as Anthropic, valued at $183 billion after a recent $13 billion funding round, projects its annual revenue could nearly triple to as much as $26 billion in 2026, according to a recent Reuters report. The company is currently approaching a $7 billion annual revenue run rate, up from $5 billion in August, fueled largely by enterprise adoption of its AI coding tools — a market where it faces fierce competition from OpenAI's recently upgraded Codex platform.

How 'progressive disclosure' solves the context window problem

Skills differ fundamentally from existing approaches to customizing AI assistants, such as prompt engineering or retrieval-augmented generation (RAG), Murag explained. The architecture relies on what Anthropic calls "progressive disclosure" — Claude initially sees only skill names and brief descriptions, then autonomously decides which skills to load based on the task at hand, accessing only the specific files and information needed at that moment.

"Unlike RAG, this relies on simple tools that let Claude manage and read files from a filesystem," Murag told VentureBeat. "Skills can contain an unbounded amount of context to teach Claude how to complete a task or series of tasks. This is because Skills are based on the premise of an agent being able to autonomously and intelligently navigate a filesystem and execute code."

This approach allows organizations to bundle far more information than traditional context windows permit, while maintaining the speed and efficiency that enterprise users demand. A single skill can include step-by-step procedures, code templates, reference documents, brand guidelines, compliance checklists, and executable scripts — all organized in a folder structure that Claude navigates intelligently.

The system's composability provides another technical advantage. Multiple skills automatically stack together when needed for complex workflows. For instance, Claude might simultaneously invoke a company's brand guidelines skill, a financial reporting skill, and a presentation formatting skill to generate a quarterly investor deck — coordinating between all three without manual intervention.

What makes Skills different from OpenAI's Custom GPTs and Microsoft's Copilot

Anthropic is positioning Skills as distinct from competing offerings like OpenAI's Custom GPTs and Microsoft's Copilot Studio, though the features address similar enterprise needs around AI customization and consistency.

"Skills' combination of progressive disclosure, composability, and executable code bundling is unique in the market," Murag said. "While other platforms require developers to build custom scaffolding, Skills let anyone — technical or not — create specialized agents by organizing procedural knowledge into files."

The cross-platform portability also sets Skills apart. The same skill works identically across Claude.ai, Claude Code (Anthropic's AI coding environment), the company's API, and the Claude Agent SDK for building custom AI agents. Organizations can develop a skill once and deploy it everywhere their teams use Claude, a significant advantage for enterprises seeking consistency.

The feature supports any programming language compatible with the underlying container environment, and Anthropic provides sandboxing for security — though the company acknowledges that allowing AI to execute code requires users to carefully vet which skills they trust.

Early customers report 8x productivity gains on finance workflows

Early customer implementations reveal how organizations are applying Skills to automate complex knowledge work. At Japanese e-commerce giant Rakuten, the AI team is using Skills to transform finance operations that previously required manual coordination across multiple departments.

"Skills streamline our management accounting and finance workflows," said Yusuke Kaji, general manager of AI at Rakuten in a statement. "Claude processes multiple spreadsheets, catches critical anomalies, and generates reports using our procedures. What once took a day, we can now accomplish in an hour."

That's an 8x improvement in productivity for specific workflows — the kind of measurable return on investment that enterprises increasingly demand from AI implementations. Mike Krieger, Anthropic's chief product officer and Instagram co-founder, recently noted that companies have moved past "AI FOMO" to requiring concrete success metrics.

Design platform Canva plans to integrate Skills into its own AI agent workflows. "Canva plans to leverage Skills to customize agents and expand what they can do," said Anwar Haneef, general manager and head of ecosystem at Canva in a statement. "This unlocks new ways to bring Canva deeper into agentic workflows—helping teams capture their unique context and create stunning, high-quality designs effortlessly."

Cloud storage provider Box sees Skills as a way to make corporate content repositories more actionable. "Skills teaches Claude how to work with Box content," said Yashodha Bhavnani, head of AI at Box. "Users can transform stored files into PowerPoint presentations, Excel spreadsheets, and Word documents that follow their organization's standards—saving hours of effort."

The enterprise security question: Who controls which AI skills employees can use?

For enterprise IT departments, Skills raise important questions about governance and control—particularly since the feature allows AI to execute arbitrary code in sandboxed environments. Anthropic has built administrative controls that allow enterprise customers to manage access at the organizational level.

"Enterprise admins control access to the Skills capability via admin settings, where they can enable or disable access and monitor usage patterns," Murag said. "Once enabled at the organizational level, individual users still need to opt in."

That two-layer consent model — organizational enablement plus individual opt-in — reflects lessons learned from previous enterprise AI deployments where blanket rollouts created compliance concerns. However, Anthropic's governance tools appear more limited than some enterprise customers might expect. The company doesn't currently offer granular controls over which specific skills employees can use, or detailed audit trails of custom skill content.

Organizations concerned about data security should note that Skills require Claude's code execution environment, which runs in isolated containers. Anthropic advises users to "stick to trusted sources" when installing skills and provides security documentation, but the company acknowledges this is an inherently higher-risk capability than traditional AI interactions.

From API to no-code: How Anthropic is making Skills accessible to everyone

Anthropic is taking several approaches to make Skills accessible to users with varying technical sophistication. For non-technical users on Claude.ai, the company provides a "skill-creator" skill that interactively guides users through building new skills by asking questions about their workflow, then automatically generating the folder structure and documentation.

Developers working with Anthropic's API get programmatic control through a new /skills endpoint and can manage skill versions through the Claude Console web interface. The feature requires enabling the Code Execution Tool beta in API requests. For Claude Code users, skills can be installed via plugins from the anthropics/skills GitHub marketplace, and teams can share skills through version control systems.

"Skills are included in Max, Pro, Teams, and Enterprise plans at no additional cost," Murag confirmed. "API usage follows standard API pricing," meaning organizations pay only for the tokens consumed during skill execution, not for the skills themselves.

Anthropic provides several pre-built skills for common business tasks, including professional generation of Excel spreadsheets with formulas, PowerPoint presentations, Word documents, and fillable PDFs. These Anthropic-created skills will remain free.

Why the Skills launch matters in the AI coding wars with OpenAI

The Skills announcement arrives during a pivotal moment in Anthropic's competition with OpenAI, particularly around AI-assisted software development. Just one day before releasing Skills, Anthropic launched Claude Haiku 4.5, a smaller and cheaper model that nonetheless matches the coding performance of Claude Sonnet 4 — which was state-of-the-art when released just five months ago.

That rapid improvement curve reflects the breakneck pace of AI development, where today's frontier capabilities become tomorrow's commodity offerings. OpenAI has been pushing hard on coding tools as well, recently upgrading its Codex platform with GPT-5 and expanding GitHub Copilot's capabilities.

Anthropic's revenue trajectory — potentially reaching $26 billion in 2026 from an estimated $9 billion by year-end 2025 — suggests the company is successfully converting enterprise interest into paying customers. The timing also follows Salesforce's announcement this week that it's deepening AI partnerships with both OpenAI and Anthropic to power its Agentforce platform, signaling that enterprises are adopting a multi-vendor approach rather than standardizing on a single provider.

Skills addresses a real pain point: the "prompt engineering" problem where effective AI usage depends on individual employees crafting elaborate instructions for routine tasks, with no way to share that expertise across teams. Skills transforms implicit knowledge into explicit, shareable assets. For startups and developers, the feature could accelerate product development significantly — adding sophisticated document generation capabilities that previously required dedicated engineering teams and weeks of development.

The composability aspect hints at a future where organizations build libraries of specialized skills that can be mixed and matched for increasingly complex workflows. A pharmaceutical company might develop skills for regulatory compliance, clinical trial analysis, molecular modeling, and patient data privacy that work together seamlessly — creating a customized AI assistant with deep domain expertise across multiple specialties.

Anthropic indicates it's working on simplified skill creation workflows and enterprise-wide deployment capabilities to make it easier for organizations to distribute skills across large teams. As the feature rolls out to Anthropic's more than 300,000 business customers, the true test will be whether organizations find Skills substantively more useful than existing customization approaches.

For now, Skills offers Anthropic's clearest articulation yet of its vision for AI agents: not generalists that try to do everything reasonably well, but intelligent systems that know when to access specialized expertise and can coordinate multiple domains of knowledge to accomplish complex tasks. If that vision catches on, the question won't be whether your company uses AI — it will be whether your AI knows how your company actually works.

Ekim 17, 2025
Google releases new AI video model Veo 3.1 in Flow and API: what it means for enterprises
As expected after days of leaks and rumors online, Google has unveiled Veo 3.1, its latest AI video generation model, bringing a suite of creative and technical upgrades aimed at improving narrative control, audio integration, and realism in AI-generated video.

While the updates expand possibilities for hobbyists and content creators using Google’s online AI creation app, Flow, the release also signals a growing opportunity for enterprises, developers, and creative teams seeking scalable, customizable video tools.

The quality is higher, the physics better, the pricing the same as before, and the control and editing features more robust and varied.

My initial tests showed it to be a powerful and performant model that immediately delights with each generation. However, the look is more cinematic, polished and a little more "artificial" than by default than rivals such as OpenAI's new Sora 2, released late last month, which may or may not be what a particular user is going after (Sora excels at handheld and "candid" style videos).

Expanded Control Over Narrative and Audio

Veo 3.1 builds on its predecessor, Veo 3 (released back in May 2025) with enhanced support for dialogue, ambient sound, and other audio effects.

Native audio generation is now available across several key features in Flow, including “Frames to Video,” “Ingredients to Video,” and “Extend," which give users the ability to, respectively: turn still images into video; use items, characters and objects from multiple images in a single video; and generate longer clips than the initial 8 seconds, to more than 30 seconds or even 1+ plus when continuing from a prior clip's final frame.

Before, you had to add audio manually after using these features.

This addition gives users greater command over tone, emotion, and storytelling — capabilities that have previously required post-production work.

In enterprise contexts, this level of control may reduce the need for separate audio pipelines, offering an integrated way to create training content, marketing videos, or digital experiences with synchronized sound and visuals.

Google noted in a blog post that the updates reflect user feedback calling for deeper artistic control and improved audio support. Gallegos emphasizes the importance of making edits and refinements possible directly in Flow, without reworking scenes from scratch.

Richer Inputs and Editing Capabilities

With Veo 3.1, Google introduces support for multiple input types and more granular control over generated outputs. The model accepts text prompts, images, and video clips as input, and also supports:
- Reference images (up to three) to guide appearance and style in the final output
- First and last frame interpolation to generate seamless scenes between fixed endpoints
- Scene extension that continues a video’s action or motion beyond its current duration
These tools aim to give enterprise users a way to fine-tune the look and feel of their content—useful for brand consistency or adherence to creative briefs.

Additional capabilities like “Insert” (add objects to scenes) and “Remove” (delete elements or characters) are also being introduced, though not all are immediately available through the Gemini API.

Deployment Across Platforms

Veo 3.1 is accessible through several of Google’s existing AI services:
- Flow, Google’s own interface for AI-assisted filmmaking
- Gemini API, targeted at developers building video capabilities into applications
- Vertex AI, where enterprise integration will soon support Veo’s “Scene Extension” and other key features
Availability through these platforms allows enterprise customers to choose the right environment—GUI-based or programmatic—based on their teams and workflows.

Pricing and Access

The Veo 3.1 model is currently in preview and available only on the paid tier of the Gemini API. The cost structure is the same as Veo 3, the preceding generation of AI video models from Google.
- Standard model: $0.40 per second of video
- Fast model: $0.15 per second
There is no free tier, and users are charged only if a video is successfully generated. This model is consistent with previous Veo versions and provides predictable pricing for budget-conscious enterprise teams.

Technical Specs and Output Control

Veo 3.1 outputs video at 720p or 1080p resolution, with a 24 fps frame rate.

Duration options include 4, 6, or 8 seconds from a text prompt or uploaded images, with the ability to extend videos up to 148 seconds (more than 2 and half minutes!) when using the “Extend” feature.

New functionality also includes tighter control over subjects and environments. For example, enterprises can upload a product image or visual reference, and Veo 3.1 will generate scenes that preserve its appearance and stylistic cues across the video. This could streamline creative production pipelines for retail, advertising, and virtual content production teams.

Initial Reactions

The broader creator and developer community has responded to Veo 3.1’s launch with a mix of optimism and tempered critique—particularly when comparing it to rival models like OpenAI’s Sora 2.

Matt Shumer, an AI founder of Otherside AI/Hyperwrite, and early adopter, described his initial reaction as “disappointment,” noting that Veo 3.1 is “noticeably worse than Sora 2” and also “quite a bit more expensive.”

However, he acknowledged that Google’s tooling—such as support for references and scene extension—is a bright spot in the release.

Travis Davids, a 3D digital artist and AI content creator, echoed some of that sentiment. While he noted improvements in audio quality, particularly in sound effects and dialogue, he raised concerns about limitations that remain in the system.

These include the lack of custom voice support, an inability to select generated voices directly, and the continued cap at 8-second generations—despite some public claims about longer outputs.

Davids also pointed out that character consistency across changing camera angles still requires careful prompting, whereas other models like Sora 2 handle this more automatically. He questioned the absence of 1080p resolution for users on paid tiers like Flow Pro and expressed skepticism over feature parity.

On the more positive end, @kimmonismus, an AI newsletter writer, stated that “Veo 3.1 is amazing,” though still concluded that OpenAI’s latest model remains preferable overall.

Collectively, these early impressions suggest that while Veo 3.1 delivers meaningful tooling enhancements and new creative control features, expectations have shifted as competitors raise the bar on both quality and usability.

Adoption and Scale

Since launching Flow five months ago, Google says over 275 million videos have been generated across various Veo models.

The pace of adoption suggests significant interest not only from individuals but also from developers and businesses experimenting with automated content creation.

Thomas Iljic, Director of Product Management at Google Labs, highlights that Veo 3.1’s release brings capabilities closer to how human filmmakers plan and shoot. These include scene composition, continuity across shots, and coordinated audio—all areas that enterprises increasingly look to automate or streamline.

Safety and Responsible AI Use

Videos generated with Veo 3.1 are watermarked using Google’s SynthID technology, which embeds an imperceptible identifier to signal that the content is AI-generated.

Google applies safety filters and moderation across its APIs to help minimize privacy and copyright risks. Generated content is stored temporarily and deleted after two days unless downloaded.

For developers and enterprises, these features provide reassurance around provenance and compliance—critical in regulated or brand-sensitive industries.

Where Veo 3.1 Stands Among a Crowded AI Video Model Space

Veo 3.1 is not just an iteration on prior models—it represents a deeper integration of multimodal inputs, storytelling control, and enterprise-level tooling. While creative professionals may see immediate benefits in editing workflows and fidelity, businesses exploring automation in training, advertising, or virtual experiences may find even greater value in the model’s composability and API support.

The early user feedback highlights that while Veo 3.1 offers valuable tooling, expectations around realism, voice control, and generation length are evolving rapidly. As Google expands access through Vertex AI and continues refining Veo, its competitive positioning in enterprise video generation will hinge on how quickly these user pain points are addressed.
Ekim 16, 2025
Anthropic is giving away its powerful Claude Haiku 4.5 AI for free to take on OpenAI

Anthropic released Claude Haiku 4.5 on Wednesday, a smaller and significantly cheaper artificial intelligence model that matches the coding capabilities of systems that were considered cutting-edge just months ago, marking the latest salvo in an intensifying competition to dominate enterprise AI.

The model costs $1 per million input tokens and $5 per million output tokens — roughly one-third the price of Anthropic's mid-sized Sonnet 4 model released in May, while operating more than twice as fast. In certain tasks, particularly operating computers autonomously, Haiku 4.5 actually surpasses its more expensive predecessor.

"Haiku 4.5 is a clear leap in performance and is now largely as smart as Sonnet 4 while being significantly faster and one-third of the cost," an Anthropic spokesperson told VentureBeat, underscoring how rapidly AI capabilities are becoming commoditized as the technology matures.

The launch comes just two weeks after Anthropic released Claude Sonnet 4.5, which the company bills as the world's best coding model, and two months after introducing Opus 4.1. The breakneck pace of releases reflects mounting pressure from OpenAI, whose $500 billion valuation dwarfs Anthropic's $183 billion, and which has inked a series of multibillion-dollar infrastructure deals while expanding its product lineup.

How free access to advanced AI could reshape the enterprise market

In an unusual move that could reshape competitive dynamics in the AI market, Anthropic is making Haiku 4.5 available for all free users of its Claude.ai platform. The decision effectively democratizes access to what the company characterizes as "near-frontier-level intelligence" — capabilities that would have been available only in expensive, premium models months ago.

"The launch of Claude Haiku 4.5 means that near-frontier-level intelligence is available for free to all users through Claude.ai," the Anthropic spokesperson told VentureBeat. "It also offers significant advantages to our enterprise customers: Sonnet 4.5 can handle frontier planning while Haiku 4.5 powers sub-agents, enabling multi-agent systems that tackle complex refactors, migrations, and large features builds with speed and quality."

This multi-agent architecture signals a significant shift in how AI systems are deployed. Rather than relying on a single, monolithic model, enterprises can now orchestrate teams of specialized AI agents: a more sophisticated Sonnet 4.5 model breaking down complex problems and delegating subtasks to multiple Haiku 4.5 agents working in parallel. For software development teams, this could mean Sonnet 4.5 plans a major code refactoring while Haiku 4.5 agents simultaneously execute changes across dozens of files.

The approach mirrors how human organizations distribute work, and could prove particularly valuable for enterprises seeking to balance performance with cost efficiency — a critical consideration as AI deployment scales.

Inside Anthropic's path to $7 billion in annual revenue

The model launch coincides with revelations that Anthropic's business is experiencing explosive growth. The company's annual revenue run rate is approaching $7 billion this month, Anthropic told Reuters, up from more than $5 billion reported in August. Internal projections obtained by Reuters suggest the company is targeting between $20 billion and $26 billion in annualized revenue for 2026, representing growth of more than 200% to nearly 300%.

The company now serves more than 300,000 business customers, with enterprise products accounting for approximately 80% of revenue. Among Anthropic's most successful offerings is Claude Code, a code-generation tool that has reached nearly $1 billion in annualized revenue since launching earlier this year.

Those numbers come as artificial intelligence enters what many in the industry characterize as a critical inflection point. After two years of what Anthropic Chief Product Officer Mike Krieger recently described as "AI FOMO" — where companies adopted AI tools without clear success metrics — enterprises are now demanding measurable returns on investment.

"The best products can be grounded in some kind of success metric or evaluation," Krieger said on the "Superhuman AI" podcast. "I've seen that a lot in talking to companies that are deploying AI."

For enterprises evaluating AI tools, the calculus increasingly centers on concrete productivity gains. Google CEO Sundar Pichai claimed in June that AI had generated a 10% boost in engineering velocity at his company — though measuring such improvements across different roles and use cases remains challenging, as Krieger acknowledged.

Why AI safety testing matters more than ever for enterprise adoption

Anthropic's launch comes amid heightened scrutiny of the company's approach to AI safety and regulation. On Tuesday, David Sacks, the White House's AI "czar" and a venture capitalist, accused Anthropic of "running a sophisticated regulatory capture strategy based on fear-mongering" that is "damaging the startup ecosystem."

The attack targeted remarks by Jack Clark, Anthropic's British co-founder and head of policy, who had described being "deeply afraid" of AI's trajectory. Clark told Bloomberg he found Sacks' criticism "perplexing."

Anthropic addressed such concerns head-on in its release materials, emphasizing that Haiku 4.5 underwent extensive safety testing. The company classified the model as ASL-2 — its AI Safety Level 2 standard — compared to the more restrictive ASL-3 designation for the more powerful Sonnet 4.5 and Opus 4.1 models.

"Our teams have red-teamed and tested our agentic capabilities to the limits in order to assess whether it can be used to engage in harmful activity like generating misinformation or promoting fraudulent behavior like scams," the spokesperson told VentureBeat. "In our automated alignment assessment, it showed a statistically significantly lower overall rate of misaligned behaviors than both Claude Sonnet 4.5 and Claude Opus 4.1 — making it, by this metric, our safest model yet."

The company said its safety testing showed Haiku 4.5 poses only limited risks regarding the production of chemical, biological, radiological and nuclear weapons. Anthropic has also implemented classifiers designed to detect and filter prompt injection attacks, a common method for attempting to manipulate AI systems into producing harmful content.

The emphasis on safety reflects Anthropic's founding mission. The company was established in 2021 by former OpenAI executives, including siblings Dario and Daniela Amodei, who left amid concerns about OpenAI's direction following its partnership with Microsoft. Anthropic has positioned itself as taking a more cautious, research-oriented approach to AI development.

Benchmark results show Haiku 4.5 competing with larger, more expensive models

According to Anthropic's benchmarks, Haiku 4.5 performs competitively with or exceeds several larger models across multiple evaluation criteria. On SWE-bench Verified, a widely used test measuring AI systems' ability to solve real-world software engineering problems, Haiku 4.5 scored 73.3% — slightly ahead of Sonnet 4's 72.7% and close to GPT-5 Codex's 74.5%.

The model demonstrated particular strength in computer use tasks, achieving 50.7% on the OSWorld benchmark compared to Sonnet 4's 42.2%. This capability allows the AI to interact directly with computer interfaces — clicking buttons, filling forms, navigating applications — which could prove transformative for automating routine digital tasks.

In coding-specific benchmarks like Terminal-Bench, which tests AI agents' ability to complete complex software tasks using command-line tools, Haiku 4.5 scored 41.0%, trailing only Sonnet 4.5's 50.0% among Claude models.

The model maintains a 200,000-token context window for standard users, with developers accessing the Claude Developer Platform able to use a 1-million-token context window. That expanded capacity means the model can process extremely large codebases or documents in a single request — roughly equivalent to a 1,500-page book.

What three major AI model releases in two months says about the competition

When asked about the rapid succession of model releases, the Anthropic spokesperson emphasized the company's focus on execution rather than competitive positioning.

"We're focused on shipping the best possible products for our customers — and our shipping velocity speaks for itself," the spokesperson said. "What was state-of-the-art just five months ago is now faster, cheaper, and more accessible."

That velocity stands in contrast to the company's earlier, more measured release schedule. Anthropic appeared to have paused development of its Haiku line after releasing version 3.5 at the end of last year, leading some observers to speculate the company had deprioritized smaller models.

That rapid price-performance improvement validates a core promise of artificial intelligence: that capabilities will become dramatically cheaper over time as the technology matures and companies optimize their models. For enterprises, it suggests that today's budget constraints around AI deployment may ease considerably in coming years.

From customer service to code: Real-world applications for faster, cheaper AI

The practical applications of Haiku 4.5 span a wide range of enterprise functions, from customer service to financial analysis to software development. The model's combination of speed and intelligence makes it particularly suited for real-time, low-latency tasks like chatbot conversations and customer support interactions, where delays of even a few seconds can degrade user experience.

In financial services, the multi-agent architecture enabled by pairing Sonnet 4.5 with Haiku 4.5 could transform how firms monitor markets and manage risk. Anthropic envisions Haiku 4.5 monitoring thousands of data streams simultaneously — tracking regulatory changes, market signals and portfolio risks — while Sonnet 4.5 handles complex predictive modeling and strategic analysis.

For research organizations, the division of labor could compress timelines dramatically. Sonnet 4.5 might orchestrate a comprehensive analysis while multiple Haiku 4.5 agents parallelize literature reviews, data gathering and document synthesis across dozens of sources, potentially "compressing weeks of research into hours," according to Anthropic's use case descriptions.

Several companies have already integrated Haiku 4.5 and reported positive results. Guy Gur-Ari, co-founder of coding startup Augment, said the model "hit a sweet spot we didn't think was possible: near-frontier coding quality with blazing speed and cost efficiency." In Augment's internal testing, Haiku 4.5 achieved 90% of Sonnet 4.5's performance while matching much larger models.

Jeff Wang, CEO of Windsurf, another coding-focused startup, said Haiku 4.5 "is blurring the lines" on traditional trade-offs between speed, cost and quality. "It's a fast frontier model that keeps costs efficient and signals where this class of models is headed."

Jon Noronha, co-founder of presentation software company Gamma, reported that Haiku 4.5 "outperformed our current models on instruction-following for slide text generation, achieving 65% accuracy versus 44% from our premium tier model — that's a game-changer for our unit economics."

The price of progress: What plummeting AI costs mean for enterprise strategy

For enterprises evaluating AI strategies, Haiku 4.5 presents both opportunity and challenge. The opportunity lies in accessing sophisticated AI capabilities at dramatically lower costs, potentially making viable entire categories of applications that were previously too expensive to deploy at scale.

The challenge is keeping pace with a technology landscape that is evolving faster than most organizations can absorb. As Krieger noted in his recent podcast appearance, companies are moving beyond "AI FOMO" to demand concrete metrics and demonstrated value. But establishing those metrics and evaluation frameworks takes time — time that may be in short supply as competitors race ahead.

The shift from single-model deployments to multi-agent architectures also requires new ways of thinking about AI systems. Rather than viewing AI as a monolithic assistant, enterprises must learn to orchestrate multiple specialized agents, each optimized for particular tasks — more akin to managing a team than operating a tool.

The fundamental economics of AI are shifting with remarkable speed. Five months ago, Sonnet 4's capabilities commanded premium pricing and represented the cutting edge. Today, Haiku 4.5 delivers similar performance at a third of the cost. If that trajectory continues — and both Anthropic's release schedule and competitive pressure from OpenAI and Google suggest it will — the AI capabilities that seem remarkable today may be routine and inexpensive within a year.

For Anthropic, the challenge will be translating technical achievements into sustainable business growth while maintaining the safety-focused approach that differentiates it from competitors. The company's projected revenue growth to as much as $26 billion by 2026 suggests strong market traction, but achieving those targets will require continued innovation and successful execution across an increasingly complex product portfolio.

Whether enterprises will choose Claude over increasingly capable alternatives from OpenAI, Google and a growing field of competitors remains an open question. But Anthropic is making a clear bet: that the future of AI belongs not to whoever builds the single most powerful model, but to whoever can deliver the right intelligence, at the right speed, at the right price — and make it accessible to everyone.

In an industry where the promise of artificial intelligence has long outpaced reality, Anthropic is betting that delivering on that promise, faster and cheaper than anyone expected, will be enough to win. And with pricing dropping by two-thirds in just five months while performance holds steady, that promise is starting to look like reality.

Ekim 16, 2025
EAGLET boosts AI agent performance on longer-horizon tasks by generating custom plans

2025 was supposed to be the year of "AI agents," according to Nvidia CEO Jensen Huang, and other AI industry personnel. And it has been, in many ways, with numerous leading AI model providers such as OpenAI, Google, and even Chinese competitors like Alibaba releasing fine-tuned AI models or applications designed to focus on a narrow set of tasks, such as web search and report writing.

But one big hurdle to a future of highly performant, reliable, AI agents remains: getting them to stay on task when the task extends over a number of steps. Third-party benchmark tests show even the most powerful AI models experience higher failure rates the more steps they take to complete a task, and the longer time they spend on it (exceeding hours).

A new academic framework called EAGLET proposes a practical and efficient method to improve long-horizon task performance in LLM-based agents — without the need for manual data labeling or retraining.

Developed by researchers from Tsinghua University, Peking University, DeepLang AI, and the University of Illinois Urbana-Champaign, EAGLET offers a "global planner" that can be integrated into existing agent workflows to reduce hallucinations and improve task efficiency.

EAGLET is a fine-tuned language model that interprets task instructions — typically provided as prompts by the user or the agent's operating environment — and generates a high-level plan for the agent (powered by its own LLM). It does not intervene during execution, but its up-front guidance helps reduce planning errors and improve task completion rates.

Addressing the Planning Problem in Long-Horizon Agents

Many LLM-based agents struggle with long-horizon tasks because they rely on reactive, step-by-step reasoning. This approach often leads to trial-and-error behavior, planning hallucinations, and inefficient trajectories.

EAGLET tackles this limitation by introducing a global planning module that works alongside the executor agent.

Instead of blending planning and action generation in a single model, EAGLET separates them, enabling more coherent, task-level strategies.

A Two-Stage Training Pipeline with No Human Annotations

EAGLET’s planner is trained using a two-stage process that requires no human-written plans or annotations.

The first stage involves generating synthetic plans with high-capability LLMs, such as GPT-5 and DeepSeek-V3.1-Think.

These plans are then filtered using a novel strategy called homologous consensus filtering, which retains only those that improve task performance for both expert and novice executor agents.

In the second stage, a rule-based reinforcement learning process further refines the planner, using a custom-designed reward function to assess how much each plan helps multiple agents succeed.

Introducing the Executor Capability Gain Reward (ECGR)

One of EAGLET’s key innovations is the Executor Capability Gain Reward (ECGR).

This reward measures the value of a generated plan by checking whether it helps both high- and low-capability agents complete tasks more successfully and with fewer steps.

It also includes a decay factor to favor shorter, more efficient task trajectories. This approach avoids over-rewarding plans that are only useful to already-competent agents and promotes more generalizable planning guidance.

Compatible with Existing Agents and Models

The EAGLET planner is designed to be modular and "plug-and-play," meaning it can be inserted into existing agent pipelines without requiring executor retraining.

In evaluations, the planner boosted performance across a variety of foundational models, including GPT-4.1, GPT-5, Llama-3.1, and Qwen2.5.

It also proved effective regardless of prompting strategy, working well with standard ReAct-style prompts as well as approaches like Reflexion.

State-of-the-Art Performance Across Benchmarks

EAGLET was tested on three widely used benchmarks for long-horizon agent tasks: ScienceWorld, which simulates scientific experiments in a text-based lab environment; ALFWorld, which tasks agents with completing household activities through natural language in a simulated home setting; and WebShop, which evaluates goal-driven behavior in a realistic online shopping interface.

Across all three, executor agents equipped with EAGLET outperformed their non-planning counterparts and other planning baselines, including MPO and KnowAgent.

In experiments with the open source Llama-3.1-8B-Instruct model, EAGLET boosted average performance from 39.5 to 59.4, a +19.9 point gain across tasks.

On ScienceWorld unseen scenarios, it raised performance from 42.2 to 61.6.

In ALFWorld seen scenarios, EAGLET improved outcomes from 22.9 to 54.3, a more than 2.3× increase in performance.

Even stronger gains were seen with more capable models.

For instance, GPT-4.1 improved from 75.5 to 82.2 average score with EAGLET, and GPT-5 rose from 84.5 to 88.1, despite already being strong performers.

In some benchmarks, performance gains were as high as +11.8 points, such as when combining EAGLET with the ETO executor method on ALFWorld unseen tasks.

Compared to other planning baselines like MPO, EAGLET consistently delivered higher task completion rates. For example, on ALFWorld unseen tasks with GPT-4.1, MPO achieved 79.1, while EAGLET scored 83.6—a +4.5 point advantage.

Additionally, the paper reports that agents using EAGLET complete tasks in fewer steps on average. With GPT-4.1 as executor, average step count dropped from 13.0 (no planner) to 11.1 (EAGLET). With GPT-5, it dropped from 11.4 to 9.4, supporting the claim of improved execution efficiency.

Efficiency Gains in Training and Execution

Compared to RL-based methods like GiGPO, which can require hundreds of training iterations, EAGLET achieved better or comparable results with roughly one-eighth the training effort.

This efficiency also carries over into execution: agents using EAGLET typically needed fewer steps to complete tasks. This translates into reduced inference time and compute cost in production scenarios.

No Public Code—Yet

As of the version submitted to arXiv, the authors have not released an open-source implementation of EAGLET. It is unclear if or when the code will be released, under what license, or how it will be maintained, which may limit the near-term utility of the framework for enterprise deployment.

VentureBeat has reached out to the authors to clarify these points and will update this piece when we hear back.

Enterprise Deployment Questions Remain

While the planner is described as plug-and-play, it remains unclear whether EAGLET can be easily integrated into popular enterprise agent frameworks such as LangChain or AutoGen, or if it requires a custom stack to support plan-execute separation.

Similarly, the training setup leverages multiple executor agents, which may be difficult to replicate in enterprise environments with limited model access. VentureBeat has asked the researchers whether the homologous consensus filtering method can be adapted for teams that only have access to one executor model or limited compute resources.

EAGLET’s authors report success across model types and sizes, but it is not yet known what the minimal viable model scale is for practical deployment. For example, can enterprise teams use the planner effectively with sub-10B parameter open models in latency-sensitive environments? Additionally, the framework may offer industry-specific value in domains like customer support or IT automation, but it remains to be seen how easily the planner can be fine-tuned or customized for such verticals.

Real-Time vs. Pre-Generated Planning

Another open question is how EAGLET is best deployed in practice. Should the planner operate in real-time alongside executors within a loop, or is it better used offline to pre-generate global plans for known task types? Each approach has implications for latency, cost, and operational complexity. VentureBeat has posed this question to the authors and will report any insights that emerge.

Strategic Tradeoffs for Enterprise Teams

For technical leaders at medium-to-large enterprises, EAGLET represents a compelling proof of concept for improving the reliability and efficiency of LLM agents. But without public tooling or implementation guidelines, the framework still presents a build-versus-wait decision. Enterprises must weigh the potential gains in task performance and efficiency against the costs of reproducing or approximating the training process in-house.

Potential Use Cases in Enterprise Settings

For enterprises developing agentic AI systems—especially in environments requiring stepwise planning, such as IT automation, customer support, or online interactions—EAGLET offers a template for how to incorporate planning without retraining. Its ability to guide both open- and closed-source models, along with its efficient training method, may make it an appealing starting point for teams seeking to improve agent performance with minimal overhead.

Ekim 15, 2025
Visa just launched a protocol to secure the AI shopping boom — here’s what it means for merchants

Visa is introducing a new security framework designed to solve one of the thorniest problems emerging in artificial intelligence-powered commerce: how retailers can tell the difference between legitimate AI shopping assistants and the malicious bots that plague their websites.

The payments giant unveiled its Trusted Agent Protocol on Tuesday, establishing what it describes as foundational infrastructure for "agentic commerce" — a term for the rapidly growing practice of consumers delegating shopping tasks to AI agents that can search products, compare prices, and complete purchases autonomously.

The protocol enables merchants to cryptographically verify that an AI agent browsing their site is authorized and trustworthy, rather than a bot designed to scrape pricing data, test stolen credit cards, or carry out other fraudulent activities.

The launch comes as AI-driven traffic to U.S. retail websites has exploded by more than 4,700% over the past year, according to data from Adobe cited by Visa. That dramatic surge has created an acute challenge for merchants whose existing bot detection systems — designed to block automated traffic — now risk accidentally blocking legitimate AI shoppers along with bad actors.

"Merchants need additional tools that provide them with greater insight and transparency into agentic commerce activities to ensure they can participate safely," said Rubail Birwadker, Visa's Global Head of Growth, in an exclusive interview with VentureBeat. "Without common standards, potential risks include ecosystem fragmentation and the proliferation of closed loop models."

The stakes are substantial. While 85% of shoppers who have used AI to shop report improved experiences, merchants face the prospect of either turning away legitimate AI-powered customers or exposing themselves to sophisticated bot attacks. Visa's own data shows the company prevented $40 billion in fraudulent activity between October 2022 and September 2023, nearly double the previous year, much of it involving AI-powered enumeration attacks where bots systematically test combinations of card numbers until finding valid credentials.

Inside the cryptographic handshake: How Visa verifies AI shopping agents

Visa's Trusted Agent Protocol operates through what Birwadker describes as a "cryptographic trust handshake" between merchants and approved AI agents. The system works in three steps:

First, AI agents must be approved and onboarded through Visa's Intelligent Commerce program, where they undergo vetting to meet trust and reliability standards. Each approved agent receives a unique digital signature key — essentially a cryptographic credential that proves its identity.

When an approved agent visits a merchant's website, it creates a digital signature using its key and transmits three categories of information: Agent Intent (indicating the agent is trusted and intends to retrieve product details or make a purchase), Consumer Recognition (data showing whether the underlying consumer has an existing account with the merchant), and Payment Information (optional payment data to support checkout).

Merchants or their infrastructure providers, such as content delivery networks, then validate these digital signatures against Visa's registry of approved agents. "Upon proper validation of these fields, the merchant can confirm the signature is a trusted agent," Birwadker explained.

Crucially, Visa designed the protocol to require minimal changes to existing merchant infrastructure. Built on the HTTP Message Signature standard and aligned with Web Both Auth, the protocol works with existing web infrastructure without requiring merchants to overhaul their checkout pages. "This is no-code functionality," Birwadker emphasized, though merchants may need to integrate with Visa's Developer Center to access the verification system.

The race for AI commerce standards: Visa faces competition from Google, OpenAI, and Stripe

Visa developed the protocol in collaboration with Cloudflare, the web infrastructure and security company that already provides bot management services to millions of websites. The partnership reflects Visa's recognition that solving bot verification requires cooperation across the entire web stack, not just the payments layer.

"Trusted Agent Protocol supplements traditional bot management by providing merchants insights that enable agentic commerce," Birwadker said. "Agents are providing additional context they otherwise would not, including what it intends to do, who the underlying consumer is, and payment information."

The protocol arrives as multiple technology giants race to establish competing standards for AI commerce. Google recently introduced its Agent Protocol for Payments (AP2), while OpenAI and Stripe have discussed their own approaches to enabling AI agents to make purchases. Microsoft, Shopify, Adyen, Ant International, Checkout.com, Cybersource, Elavon, Fiserv, Nuvei, and Worldpay provided feedback during Trusted Agent Protocol's development, according to Visa.

When asked how Visa's protocol relates to these competing efforts, Birwadker struck a collaborative tone. "Both Google's AP2 and Visa's Trusted Agent Protocol are working toward the same goal of building trust in agent-initiated payments," he said. "We are engaged with Google, OpenAI, and Stripe and are looking to create compatibility across the ecosystem."

Visa says it is working with global standards bodies including the Internet Engineering Task Force (IETF), OpenID Foundation, and EMVCo to ensure the protocol can eventually become interoperable with other emerging standards. "While these specifications apply to the Visa network in this initial phase, enabling agents to safely and securely act on a consumer's behalf requires an open, ecosystem-wide approach," Birwadker noted.

Who pays when AI agents go rogue? Unanswered questions about liability and authorization

The protocol raises important questions about authorization and liability when AI agents make purchases on behalf of consumers. If an agent completes an unauthorized transaction — perhaps misunderstanding a user's intent or exceeding its delegated authority — who bears responsibility?

Birwadker emphasized that the protocol helps merchants "leverage this information to enable experiences tied to existing consumer relationships and more secure checkout," but he did not provide specific details about how disputes would be handled when agents make unauthorized purchases. Visa's existing fraud protection and chargeback systems would presumably apply, though the company has not yet published detailed guidance on agent-initiated transaction disputes.

The protocol also places Visa in the position of gatekeeper for the emerging agentic commerce ecosystem. Because Visa determines which AI agents get approved for the Intelligent Commerce program and receive cryptographic credentials, the company effectively controls which agents merchants can easily trust. "Agents are approved and onboarded through the Visa Intelligent Commerce program, ensuring they meet our standards for trust and reliability," Birwadker said, though he did not detail the specific criteria agents must meet or whether Visa charges fees for approval.

This gatekeeping role could prove contentious, particularly if Visa's approval process favors large technology companies over startups, or if the company faces pressure to block agents from competitors or politically controversial entities. Visa declined to provide details about how many agents it has approved so far or how long the vetting process typically takes.

Visa's legal battles and the long road to merchant adoption

The protocol launch comes at a complex moment for Visa, which continues to navigate significant legal and regulatory challenges even as its core business remains robust. The company's latest earnings report for the third quarter of fiscal year 2025 showed a 10% increase in net revenues to $9.2 billion, driven by resilient consumer spending and strong growth in cross-border transaction volume. For the full fiscal year ending September 30, 2024, Visa processed 289 billion transactions, with a total payments volume of $15.2 trillion.

However, the company's legal headwinds have intensified. In July 2025, a federal judge rejected a landmark $30 billion settlement that Visa and Mastercard had reached with merchants over long-disputed credit card swipe fees, sending the parties back to the negotiating table and extending the long-running legal battle.

Simultaneously, Visa remains under investigation by the Department of Justice over its rules for routing debit card transactions, with regulators scrutinizing whether the company's practices unlawfully limit merchant choice and stifle competition. These domestic challenges are mirrored abroad, where European regulators have continued their own antitrust investigations into the fee structures of both Visa and its primary competitor, Mastercard.

Against this backdrop of regulatory pressure, Birwadker acknowledged that adoption of the Trusted Agent Protocol will take time. "As agentic commerce continues to rise, we recognize that consumer trust is still in its early stages," he said. "That's why our focus through 2025 is on building foundational credibility and demonstrating real-world value."

The protocol is available immediately in Visa's Developer Center and on GitHub, with agent onboarding already active and merchant integration resources available. But Birwadker declined to provide specific targets for how many merchants might adopt the protocol by the end of 2026. "Adoption is aligned with the momentum we're already seeing," he said. "The launch of our protocol marks another big step — it's not just a technical milestone, but a signal that the industry is beginning to unify."

Industry analysts say merchant adoption will likely depend on how quickly agentic commerce grows as a percentage of overall e-commerce. While AI-driven traffic has surged dramatically, much of that consists of agents browsing and researching rather than completing purchases. If AI agents begin accounting for a significant share of completed transactions, merchants will face stronger incentives to adopt verification systems like Visa's protocol.

From fraud detection to AI gatekeeping: Visa's $10 billion bet on artificial intelligence

Visa's move reflects broader strategic bets on AI across the financial services industry. The company has invested $10 billion in technology over the past five years to reduce fraud and increase network security, with AI and machine learning central to those efforts. Visa's fraud detection system analyzes over 500 different attributes for each transaction, using AI models to assign real-time risk scores to the 300 billion annual transactions flowing through its network.

"Every single one of those transactions has been processed by AI," James Mirfin, Visa's global head of risk and identity solutions, said in a July 2024 CNBC interview discussing the company's fraud prevention efforts. "If you see a new type of fraud happening, our model will see that, it will catch it, it will score those transactions as high risk and then our customers can decide not to approve those transactions."

The company has also moved aggressively into new payment territories beyond its core card business. In January 2025, Visa partnered with Elon Musk's X (formerly Twitter) to provide the infrastructure for a digital wallet and peer-to-peer payment service called the X Money Account, competing with services like Venmo and Zelle. That deal marked Visa's first major partnership in the social media payments space and reflected the company's recognition that payment flows are increasingly happening outside traditional e-commerce channels.

The agentic commerce protocol represents an extension of this strategy — an attempt to ensure Visa remains central to payment flows even as the mechanics of shopping shift from direct human interaction to AI intermediation. Jack Forestell, Visa's Chief Product & Strategy Officer, framed the protocol in expansive terms: "We believe the entire payments ecosystem has a responsibility to ensure sellers trust AI agents with the same confidence they place in their most valued customers and networks."

The coming battle for control of AI shopping

The real test for Visa's protocol won't be technical — it will be political. As AI agents become a larger force in retail, whoever controls the verification infrastructure controls access to hundreds of billions of dollars in commerce. Visa's position as gatekeeper gives it enormous leverage, but also makes it a target.

Merchants chafing under Visa's existing fee structure and facing multiple antitrust investigations may resist ceding even more power to the payments giant. Competitors like Google and OpenAI, each with their own ambitions in commerce, have little incentive to let Visa dictate standards. Regulators already scrutinizing Visa's market dominance will surely examine whether its agent approval process unfairly advantages certain players.

And there's a deeper question lurking beneath the technical specifications and corporate partnerships: In an economy increasingly mediated by AI, who decides which algorithms get to spend our money? Visa is making an aggressive bid to be that arbiter, wrapping its answer in the language of security and interoperability. Whether merchants, consumers, and regulators accept that proposition will determine not just the fate of the Trusted Agent Protocol, but the structure of AI-powered commerce itself.

For now, Visa is moving forward with the confidence of a company that has weathered disruption before. But in the emerging world of agentic commerce, being too trusted might prove just as dangerous as not being trusted enough.

Ekim 15, 2025
Self-improving language models are becoming reality with MIT’s updated SEAL technique

Researchers at the Massachusetts Institute of Technology (MIT) are gaining renewed attention for developing and open sourcing a technique that allows large language models (LLMs) — like those underpinning ChatGPT and most modern AI chatbots — to improve themselves by generating synthetic data to fine-tune upon.

The technique, known as SEAL (Self-Adapting LLMs), was first described in a paper published back in June and covered by VentureBeat at the time.

A significantly expanded and updated version of the paper was released last month, as well as open source code posted on Github (under an MIT License, allowing for commercial and enterprise usage), and is making new waves among AI power users on the social network X this week.

SEAL allows LLMs to autonomously generate and apply their own fine-tuning strategies. Unlike conventional models that rely on fixed external data and human-crafted optimization pipelines, SEAL enables models to evolve by producing their own synthetic training data and corresponding optimization directives.

The development comes from a team affiliated with MIT’s Improbable AI Lab, including Adam Zweiger, Jyothish Pari, Han Guo, Ekin Akyürek, Yoon Kim, and Pulkit Agrawal. Their research was recently presented at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025).

Background: From “Beyond Static AI” to Self-Adaptive Systems

Earlier this year, VentureBeat first reported on SEAL as an early-stage framework that allowed language models to generate and train on their own synthetic data — a potential remedy for the stagnation of pretrained models once deployed.

At that stage, SEAL was framed as a proof-of-concept that could let enterprise AI agents continuously learn in dynamic environments without manual retraining.

Since then, the research has advanced considerably. The new version expands on the prior framework by demonstrating that SEAL’s self-adaptation ability scales with model size, integrates reinforcement learning more effectively to reduce catastrophic forgetting, and formalizes SEAL’s dual-loop structure (inner supervised fine-tuning and outer reinforcement optimization) for reproducibility.

The updated paper also introduces evaluations across different prompting formats, improved stability during learning cycles, and a discussion of practical deployment challenges at inference time.

Addressing the Limitations of Static Models

While LLMs have demonstrated remarkable capabilities in text generation and understanding, their adaptation to new tasks or knowledge is often manual, brittle, or dependent on context.

SEAL challenges this status quo by equipping models with the ability to generate what the authors call “self-edits” — natural language outputs that specify how the model should update its weights.

These self-edits may take the form of reformulated information, logical implications, or tool configurations for augmentation and training. Once generated, the model fine-tunes itself based on these edits. The process is guided by reinforcement learning, where the reward signal comes from improved performance on a downstream task.

The design mimics how human learners might rephrase or reorganize study materials to better internalize information. This restructuring of knowledge before assimilation serves as a key advantage over models that passively consume new data “as-is.”

Performance Across Tasks

SEAL has been tested across two main domains: knowledge incorporation and few-shot learning.

In the knowledge incorporation setting, the researchers evaluated how well a model could internalize new factual content from passages similar to those in the SQuAD dataset, a benchmark reading comprehension dataset introduced by Stanford University in 2016, consisting of over 100,000 crowd-sourced question–answer pairs based on Wikipedia articles (Rajpurkar et al., 2016).

Rather than fine-tuning directly on passage text, the model generated synthetic implications of the passage and then fine-tuned on them.

After two rounds of reinforcement learning, the model improved question-answering accuracy from 33.5% to 47.0% on a no-context version of SQuAD — surpassing results obtained using synthetic data generated by GPT-4.1.

In the few-shot learning setting, SEAL was evaluated using a subset of the ARC benchmark, where tasks require reasoning from only a few examples. Here, SEAL generated self-edits specifying data augmentations and hyperparameters.

After reinforcement learning, the success rate in correctly solving held-out tasks jumped to 72.5%, up from 20% using self-edits generated without reinforcement learning. Models that relied solely on in-context learning without any adaptation scored 0%.

Technical Framework

SEAL operates using a two-loop structure: an inner loop performs supervised fine-tuning based on the self-edit, while an outer loop uses reinforcement learning to refine the policy that generates those self-edits.

The reinforcement learning algorithm used is based on ReSTEM, which combines sampling with filtered behavior cloning. During training, only self-edits that lead to performance improvements are reinforced. This approach effectively teaches the model which kinds of edits are most beneficial for learning.

For efficiency, SEAL applies LoRA-based fine-tuning rather than full parameter updates, enabling rapid experimentation and low-cost adaptation.

Strengths and Limitations

The researchers report that SEAL can produce high-utility training data with minimal supervision, outperforming even large external models like GPT-4.1 in specific tasks.

They also demonstrate that SEAL generalizes beyond its original setup: it continues to perform well when scaling from single-pass updates to multi-document continued pretraining scenarios.

However, the framework is not without limitations. One issue is catastrophic forgetting, where updates to incorporate new information can degrade performance on previously learned tasks.

In response to this concern, co-author Jyo Pari told VentureBeat via email that reinforcement learning (RL) appears to mitigate forgetting more effectively than standard supervised fine-tuning (SFT), citing a recent paper on the topic. He added that combining this insight with SEAL could lead to new variants where SEAL learns not just training data, but reward functions.

Another challenge is computational overhead: evaluating each self-edit requires fine-tuning and performance testing, which can take 30–45 seconds per edit — significantly more than standard reinforcement learning tasks.

As Jyo explained, “Training SEAL is non-trivial because it requires 2 loops of optimization, an outer RL one and an inner SFT one. At inference time, updating model weights will also require new systems infrastructure.” He emphasized the need for future research into deployment systems as a critical path to making SEAL practical.

Additionally, SEAL’s current design assumes the presence of paired tasks and reference answers for every context, limiting its direct applicability to unlabeled corpora. However, Jyo clarified that as long as there is a downstream task with a computable reward, SEAL can be trained to adapt accordingly—even in safety-critical domains. In principle, a SEAL-trained model could learn to avoid training on harmful or malicious inputs if guided by the appropriate reward signal.

AI Community Reactions

The AI research and builder community has reacted with a mix of excitement and speculation to the SEAL paper. On X, formerly Twitter, several prominent AI-focused accounts weighed in on the potential impact.

User @VraserX, a self-described educator and AI enthusiast, called SEAL “the birth of continuous self-learning AI” and predicted that models like OpenAI's GPT-6 could adopt similar architecture.

In their words, SEAL represents “the end of the frozen-weights era,” ushering in systems that evolve as the world around them changes.

They highlighted SEAL's ability to form persistent memories, repair knowledge, and learn from real-time data, comparing it to a foundational step toward models that don’t just use information but absorb it.

Meanwhile, @alex_prompter, co-founder of an AI-powered marketing venture, framed SEAL as a leap toward models that literally rewrite themselves. “MIT just built an AI that can rewrite its own code to get smarter,” he wrote. Citing the paper’s key results — a 40% boost in factual recall and outperforming GPT-4.1 using self-generated data — he described the findings as confirmation that “LLMs that finetune themselves are no longer sci-fi.”

The enthusiasm reflects a broader appetite in the AI space for models that can evolve without constant retraining or human oversight — particularly in rapidly changing domains or personalized use cases.

Future Directions and Open Questions

In response to questions about scaling SEAL to larger models and tasks, Jyo pointed to experiments (Appendix B.7) showing that as model size increases, so does their self-adaptation ability. He compared this to students improving their study techniques over time — larger models are simply better at generating useful self-edits.

When asked whether SEAL generalizes to new prompting styles, he confirmed it does, citing Table 10 in the paper. However, he also acknowledged that the team has not yet tested SEAL’s ability to transfer across entirely new domains or model architectures.

“SEAL is an initial work showcasing the possibilities,” he said. “But it requires much more testing.” He added that generalization may improve as SEAL is trained on a broader distribution of tasks.

Interestingly, the team found that only a few reinforcement learning steps already led to measurable performance gains. “This is exciting,” Jyo noted, “because it means that with more compute, we could hopefully get even more improvements.” He suggested future experiments could explore more advanced reinforcement learning methods beyond ReSTEM, such as Group Relative Policy Optimization (GRPO).

Toward More Adaptive and Agentic Models

SEAL represents a step toward models that can autonomously improve over time, both by integrating new knowledge and by reconfiguring how they learn. The authors envision future extensions where SEAL could assist in self-pretraining, continual learning, and the development of agentic systems — models that interact with evolving environments and adapt incrementally.

In such settings, a model could use SEAL to synthesize weight updates after each interaction, gradually internalizing behaviors or insights. This could reduce the need for repeated supervision and manual intervention, particularly in data-constrained or specialized domains.

As public web text becomes saturated and further scaling of LLMs becomes bottlenecked by data availability, self-directed approaches like SEAL could play a critical role in pushing the boundaries of what LLMs can achieve.

You can access the SEAL project, including code and further documentation, at: https://jyopari.github.io/posts/seal

Ekim 14, 2025
Researchers find that retraining only small parts of AI models can cut costs and prevent forgetting

Enterprises often find that when they fine-tune models, one effective approach to making a large language model (LLM) fit for purpose and grounded in data is to have the model lose some of its abilities. After fine-tuning, some models “forget” how to perform certain tasks or other tasks they already learned.

Research from the University of Illinois Urbana-Champaign proposes a new method for retraining models that avoids “catastrophic forgetting,” in which the model loses some of its prior knowledge. The paper focuses on two specific LLMs that generate responses from images: LLaVA and Qwen 2.5-VL.

The approach encourages enterprises to retrain only narrow parts of an LLM to avoid retraining the entire model and incurring a significant increase in compute costs. The team claims that catastrophic forgetting isn’t true memory loss, but rather a side effect of bias drift.

“Training a new LMM can cost millions of dollars, weeks of time, and emit hundreds of tons of CO2, so finding ways to more efficiently and effectively update existing models is a pressing concern,” the team wrote in the paper. “Guided by this result, we explore tuning recipes that preserve learning while limiting output shift.”

The researchers focused on a multi-layer perceptron (MLP), the model's internal decision-making component.

Catastrophic forgetting

The researchers wanted first to verify the existence and the cause of catastrophic forgetting in models.

To do this, they created a set of target tasks for the models to complete. The models were then fine-tuned and evaluated to determine whether they led to substantial forgetting. But as the process went on, the researchers found that the models were recovering some of their abilities.

“We also noticed a surprising result, that the model performance would drop significantly in held out benchmarks after training on the counting task, it would mostly recover on PathVQA, another specialized task that is not well represented in the benchmarks,” they said. “Meanwhile, while performing the forgetting mitigation experiments, we also tried separately tuning only the self-attention projection (SA Proj) or MLP layers, motivated by the finding that tuning only the LLM was generally better than tuning the full model. This led to another very surprising result – that tuning only self-attention projection layers led to very good learning of the target tasks with no drop in performance in held out tasks, even after training all five target tasks in a sequence.”

The researchers said they believe that “what looks like forgetting or interference after fine-tuning on a narrow target task is actually bias in the output distribution due to the task distribution shift.”

Narrow retraining

That finding turned out to be the key to the experiment. The researchers noted that tuning the MLP increases the likelihood of “outputting numeric tokens and a highly correlated drop in held out task accuracy.” What it showed is that a model forgetting some of its knowledge is only temporary and not a long-term matter.

“To avoid biasing the output distribution, we tune the MLP up/gating projections while keeping the down projection frozen, and find that it achieves similar learning to full MLP tuning with little forgetting,” the researchers said.

This allows for a more straightforward and more reproducible method for fine-tuning a model.

By focusing on a narrow segment of the model, rather than a wholesale retraining, enterprises can cut compute costs. It also allows better control of output drift.

However, the research focuses only on two models, specifically those dealing with vision and language. The researchers noted that due to limited resources, they are unable to try the experiment with other models.

Their findings, however, can be extended to other LLMs, especially for different modalities.

Ekim 14, 2025
Here’s what’s slowing down your AI strategy — and how to fix it
Your best data science team just spent six months building a model that predicts customer churn with 90% accuracy. It’s sitting on a server, unused. Why? Because it’s been stuck in a risk review queue for a very long period of time, waiting for a committee that doesn’t understand stochastic models to sign off. This isn’t a hypothetical — it’s the daily reality in most large companies.

In AI, the models move at internet speed. Enterprises don’t.

Every few weeks, a new model family drops, open-source toolchains mutate and entire MLOps practices get rewritten. But in most companies, anything touching production AI has to pass through risk reviews, audit trails, change-management boards and model-risk sign-off. The result is a widening velocity gap: The research community accelerates; the enterprise stalls.

This gap isn’t a headline problem like “AI will take your job.” It’s quieter and more expensive: missed productivity, shadow AI sprawl, duplicated spend and compliance drag that turns promising pilots into perpetual proofs-of-concept.

The numbers say the quiet part out loud

Two trends collide. First, the pace of innovation: Industry is now the dominant force, producing the vast majority of notable AI models, according to Stanford's 2024 AI Index Report. The core inputs for this innovation are compounding at a historic rate, with training compute needs doubling rapidly every few years. That pace all but guarantees rapid model churn and tool fragmentation.

Second, enterprise adoption is accelerating. According to IBM's, 42% of enterprise-scale companies have actively deployed AI, with many more actively exploring it. Yet the same surveys show governance roles are only now being formalized, leaving many companies to retrofit control after deployment.

Layer on new regulation. The EU AI Act’s staged obligations are locked in — unacceptable-risk bans are already active and General Purpose AI (GPAI) transparency duties hit in mid-2025, with high-risk rules following. Brussels has made clear there’s no pause coming. If your governance isn’t ready, your roadmap will be.

The real blocker isn't modeling, it's audit

In most enterprises, the slowest step isn’t fine-tuning a model; it’s proving your model follows certain guidelines.

Three frictions dominate:
1. Audit debt: Policies were written for static software, not stochastic models. You can ship a microservice with unit tests; you can’t “unit test” fairness drift without data access, lineage and ongoing monitoring. When controls don’t map, reviews balloon.
2. . MRM overload: Model risk management (MRM), a discipline perfected in banking, is spreading beyond finance — often translated literally, not functionally. Explainability and data-governance checks make sense; forcing every retrieval-augmented chatbot through credit-risk style documentation does not.
3. Shadow AI sprawl: Teams adopt vertical AI inside SaaS tools without central oversight. It feels fast — until the third audit asks who owns the prompts, where embeddings live and how to revoke data. Sprawl is speed’s illusion; integration and governance are the long-term velocity.
Frameworks exist, but they're not operational by default

The NIST AI Risk Management Framework is a solid north star: govern, map, measure, manage. It’s voluntary, adaptable and aligned with international standards. But it’s a blueprint, not a building. Companies still need concrete control catalogs, evidence templates and tooling that turn principles into repeatable reviews.

Similarly, the EU AI Act sets deadlines and duties. It doesn’t install your model registry, wire your dataset lineage or resolve the age-old question of who signs off when accuracy and bias trade off. That’s on you soon.

What winning enterprises are doing differently

The leaders I see closing the velocity gap aren’t chasing every model; they’re making the path to production routine. Five moves show up again and again:
1. Ship a control plane, not a memo: Codify governance as code. Create a small library or service that enforces non-negotiables: Dataset lineage required, evaluation suite attached, risk tier chosen, PII scan passed, human-in-the-loop defined (if required). If a project can’t satisfy the checks, it can’t deploy.
2. Pre-approve patterns: Approve reference architectures — “GPAI with retrieval augmented generation (RAG) on approved vector store,” “high-risk tabular model with feature store X and bias audit Y,” “vendor LLM via API with no data retention.” Pre-approval shifts review from bespoke debates to pattern conformance. (Your auditors will thank you.)
3. Stage your governance by risk, not by team: Tie review depth to use-case criticality (safety, finance, regulated outcomes). A marketing copy assistant shouldn’t endure the same gauntlet as a loan adjudicator. Risk-proportionate review is both defensible and fast.
4. Create an “evidence once, reuse everywhere” backbone: Centralize model cards, eval results, data sheets, prompt templates and vendor attestations. Every subsequent audit should start at 60% done because you’ve already proven the common pieces.
5. Make audit a product: Give legal, risk and compliance a real roadmap. Instrument dashboards that show: Models in production by risk tier, upcoming re-evals, incidents and data-retention attestations. If audit can self-serve, engineering can ship.
A pragmatic cadence for the next 12 months

If you’re serious about catching up, pick a 12-month governance sprint:
- Quarter 1: Stand up a minimal AI registry (models, datasets, prompts, evaluations). Draft risk-tiering and control mapping aligned to NIST AI RMF functions; publish two pre-approved patterns.
- Quarter 2: Turn controls into pipelines (CI checks for evals, data scans, model cards). Convert two fast-moving teams from shadow AI to platform AI by making the paved road easier than the side road.
- Quarter 3: Pilot a GxP-style review (a rigorous documentation standard from life sciences) for one high-risk use case; automate evidence capture. Start your EU AI Act gap analysis if you touch Europe; assign owners and deadlines.
- Quarter 4: Expand your pattern catalog (RAG, batch inference, streaming prediction). Roll out dashboards for risk/compliance. Bake governance SLAs into your OKRs.
  
  By this point, you haven’t slowed down innovation — you’ve standardized it. The research community can keep moving at light speed; you can keep shipping at enterprise speed — without the audit queue becoming your critical path.
The competitive edge isn't the next model — it's the next mile

It’s tempting to chase each week’s leaderboard. But the durable advantage is the mile between a paper and production: The platform, the patterns, the proofs. That’s what your competitors can’t copy from GitHub, and it’s the only way to keep velocity without trading compliance for chaos.

In other words: Make governance the grease, not the grit.

Jayachander Reddy Kandakatla is senior machine learning operations (MLOps) engineer at Ford Motor Credit Company.
Ekim 13, 2025

Etiket: Machine Learning

Merging AI and Geospatial Intelligence

Use Cases Across Industries

Combining Search and Maps for Enhanced Context

Customization and Developer Flexibility

Implementation Considerations for AI Builders

Availability and Access

How AI-powered interviews compress eight-week research projects into days

Why customers tell AI moderators the truth they won't share with humans

Solving mobile app research with persistent screen sharing technology

How Strella differentiated in a market crowded with AI research startups

Creating new research budgets instead of just automating existing ones

Making customer feedback accessible across the enterprise, not just research teams

The roadmap: Computer vision, agentic AI, and human-machine collaboration

How 'progressive disclosure' solves the context window problem

What makes Skills different from OpenAI's Custom GPTs and Microsoft's Copilot

Early customers report 8x productivity gains on finance workflows

The enterprise security question: Who controls which AI skills employees can use?

From API to no-code: How Anthropic is making Skills accessible to everyone

Why the Skills launch matters in the AI coding wars with OpenAI

Expanded Control Over Narrative and Audio

Richer Inputs and Editing Capabilities

Deployment Across Platforms

Pricing and Access

Technical Specs and Output Control

Initial Reactions

Adoption and Scale

Safety and Responsible AI Use

Where Veo 3.1 Stands Among a Crowded AI Video Model Space

How free access to advanced AI could reshape the enterprise market

Inside Anthropic's path to $7 billion in annual revenue

Why AI safety testing matters more than ever for enterprise adoption

Benchmark results show Haiku 4.5 competing with larger, more expensive models

What three major AI model releases in two months says about the competition

From customer service to code: Real-world applications for faster, cheaper AI

The price of progress: What plummeting AI costs mean for enterprise strategy

Addressing the Planning Problem in Long-Horizon Agents

A Two-Stage Training Pipeline with No Human Annotations

Introducing the Executor Capability Gain Reward (ECGR)

Compatible with Existing Agents and Models

State-of-the-Art Performance Across Benchmarks

Efficiency Gains in Training and Execution

No Public Code—Yet

Enterprise Deployment Questions Remain

Real-Time vs. Pre-Generated Planning

Strategic Tradeoffs for Enterprise Teams

Potential Use Cases in Enterprise Settings

Inside the cryptographic handshake: How Visa verifies AI shopping agents

The race for AI commerce standards: Visa faces competition from Google, OpenAI, and Stripe

Who pays when AI agents go rogue? Unanswered questions about liability and authorization

Visa's legal battles and the long road to merchant adoption

From fraud detection to AI gatekeeping: Visa's $10 billion bet on artificial intelligence

The coming battle for control of AI shopping

Background: From “Beyond Static AI” to Self-Adaptive Systems

Addressing the Limitations of Static Models

Performance Across Tasks

Technical Framework

Strengths and Limitations

AI Community Reactions

Future Directions and Open Questions

Toward More Adaptive and Agentic Models

Catastrophic forgetting

Narrow retraining

The numbers say the quiet part out loud

The real blocker isn't modeling, it's audit

Frameworks exist, but they're not operational by default

What winning enterprises are doing differently

A pragmatic cadence for the next 12 months

The competitive edge isn't the next model — it's the next mile