Blog

Meta researchers open the LLM black box to repair flawed AI reasoning

Researchers at Meta FAIR and the University of Edinburgh have developed a new technique that can predict the correctness of a large language model's (LLM) reasoning and even intervene to fix its mistakes. Called Circuit-based Reasoning Verification (CRV), the method looks inside an LLM to monitor its internal “reasoning circuits” and detect signs of computational errors as the model solves a problem.

Their findings show that CRV can detect reasoning errors in LLMs with high accuracy by building and observing a computational graph from the model's internal activations. In a key breakthrough, the researchers also demonstrated they can use this deep insight to apply targeted interventions that correct a model’s faulty reasoning on the fly.

The technique could help solve one of the great challenges of AI: Ensuring a model’s reasoning is faithful and correct. This could be a critical step toward building more trustworthy AI applications for the enterprise, where reliability is paramount.

Investigating chain-of-thought reasoning

Chain-of-thought (CoT) reasoning has been a powerful method for boosting the performance of LLMs on complex tasks and has been one of the key ingredients in the success of reasoning models such as the OpenAI o-series and DeepSeek-R1.

However, despite the success of CoT, it is not fully reliable. The reasoning process itself is often flawed, and several studies have shown that the CoT tokens an LLM generates is not always a faithful representation of its internal reasoning process.

Current remedies for verifying CoT fall into two main categories. “Black-box” approaches analyze the final generated token or the confidence scores of different token options. “Gray-box” approaches go a step further, looking at the model's internal state by using simple probes on its raw neural activations.

But while these methods can detect that a model’s internal state is correlated with an error, they can't explain why the underlying computation failed. For real-world applications where understanding the root cause of a failure is crucial, this is a significant gap.

A white-box approach to verification

CRV is based on the idea that models perform tasks using specialized subgraphs, or "circuits," of neurons that function like latent algorithms. So if the model’s reasoning fails, it is caused by a flaw in the execution of one of these algorithms. This means that by inspecting the underlying computational process, we can diagnose the cause of the flaw, similar to how developers examine execution traces to debug traditional software.

To make this possible, the researchers first make the target LLM interpretable. They replace the standard dense layers of the transformer blocks with trained "transcoders." A transcoder is a specialized deep learning component that forces the model to represent its intermediate computations not as a dense, unreadable vector of numbers, but as a sparse and meaningful set of features. Transcoders are similar to the sparse autoencoders (SAE) used in mechanistic interpretability research with the difference that they also preserve the functionality of the network they emulate. This modification effectively installs a diagnostic port into the model, allowing researchers to observe its internal workings.

With this interpretable model in place, the CRV process unfolds in a few steps. For each reasoning step the model takes, CRV constructs an "attribution graph" that maps the causal flow of information between the interpretable features of the transcoder and the tokens it is processing. From this graph, it extracts a "structural fingerprint" that contains a set of features describing the graph's properties. Finally, a “diagnostic classifier” model is trained on these fingerprints to predict whether the reasoning step is correct or not.

At inference time, the classifier monitors the activations of the model and provides feedback on whether the model’s reasoning trace is on the right track.

Finding and fixing errors

The researchers tested their method on a Llama 3.1 8B Instruct model modified with the transcoders, evaluating it on a mix of synthetic (Boolean and Arithmetic) and real-world (GSM8K math problems) datasets. They compared CRV against a comprehensive suite of black-box and gray-box baselines.

The results provide strong empirical support for the central hypothesis: the structural signatures in a reasoning step's computational trace contain a verifiable signal of its correctness. CRV consistently outperformed all baseline methods across every dataset and metric, demonstrating that a deep, structural view of the model's computation is more powerful than surface-level analysis.

Interestingly, the analysis revealed that the signatures of error are highly domain-specific. This means failures in different reasoning tasks (formal logic versus arithmetic calculation) manifest as distinct computational patterns. A classifier trained to detect errors in one domain does not transfer well to another, highlighting that different types of reasoning rely on different internal circuits. In practice, this means that you might need to train a separate classifier for each task (though the transcoder remains unchanged).

The most significant finding, however, is that these error signatures are not just correlational but causal. Because CRV provides a transparent view of the computation, a predicted failure can be traced back to a specific component. In one case study, the model made an order-of-operations error. CRV flagged the step and identified that a "multiplication" feature was firing prematurely. The researchers intervened by manually suppressing that single feature, and the model immediately corrected its path and solved the problem correctly.

This work represents a step toward a more rigorous science of AI interpretability and control. As the paper concludes, “these findings establish CRV as a proof-of-concept for mechanistic analysis, showing that shifting from opaque activations to interpretable computational structure enables a causal understanding of how and why LLMs fail to reason correctly.” To support further research, the team plans to release its datasets and trained transcoders to the public.

Why it’s important

While CRV is a research proof-of-concept, its results hint at a significant future for AI development. AI models learn internal algorithms, or "circuits," for different tasks. But because these models are opaque, we can't debug them like standard computer programs by tracing bugs to specific steps in the computation. Attribution graphs are the closest thing we have to an execution trace, showing how an output is derived from intermediate steps.

This research suggests that attribution graphs could be the foundation for a new class of AI model debuggers. Such tools would allow developers to understand the root cause of failures, whether it's insufficient training data or interference between competing tasks. This would enable precise mitigations, like targeted fine-tuning or even direct model editing, instead of costly full-scale retraining. They could also allow for more efficient intervention to correct model mistakes during inference.

The success of CRV in detecting and pinpointing reasoning errors is an encouraging sign that such debuggers could become a reality. This would pave the way for more robust LLMs and autonomous agents that can handle real-world unpredictability and, much like humans, correct course when they make reasoning mistakes.

Ekim 31, 2025
Why IT leaders should pay attention to Canva’s ‘imagination era’ strategy

The rise of AI marks a critical shift away from decades defined by information-chasing and a push for more and more compute power.

Canva co-founder and CPO Cameron Adams refers to this dawning time as the “imagination era.” Meaning: Individuals and enterprises must be able to turn creativity into action with AI.

Canva hopes to position itself at the center of this shift with a sweeping new suite of tools. The company’s new Creative Operating System (COS) integrates AI across every layer of content creation, creating a single, comprehensive creativity platform rather than a simple, template-based design tool.

“We’re entering a new era where we need to rethink how we achieve our goals,” said Adams. “We’re enabling people’s imagination and giving them the tools they need to take action.”

An 'engine' for creativity

Adams describes Canva’s platform as a three-layer stack: The top Visual Suite layer containing designs, images and other content; a collaborative Canva AI plane at center; and a foundational proprietary model holding it all up.

At the heart of Canva’s strategy is its Creative Operating System (COS) underlying. This “engine,” as Adams describes it, integrates documents, websites, presentations, sheets, whiteboards, videos, social content, hundreds of millions of photos, illustrations, a rich sound library, and numerous templates, charts, and branded elements.

The COS is getting a 2.0 upgrade, but the crucial advance is the “middle, crucial layer” that fully integrates AI and makes it accessible throughout various workflows, Adams explained. This gives creative and technical teams a single dashboard for generating, editing and launching all types of content.

The underlying model is trained to understand the “complexity of design” so the platform can build out various elements — such as photos, videos, textures, or 3D graphics — in real time, matching branding style without the need for manual adjustments. It also supports live collaboration, meaning teams across departments can co-create.

With a unified dashboard, a user working on a specific design, for instance, can create a new piece of content (say, a presentation) within the same workflow, without having to switch to another window or platform. Also, if they generate an image and aren’t pleased with it, they don’t have to go back and create from scratch; they can immediately begin editing, changing colors or tone.

Another new capability in COS, “Ask Canva,” provides direct design advice. Users can tag @Canva to get copy suggestions and smart edits; or, they can highlight an image and direct the AI assistant to modify it or generate variants.

“It’s a really unique interaction,” said Adams, noting that this AI design partner is always present. “It’s a real collaboration between people and AI, and we think it’s a revolutionary change.”

Other new features include a 2.0 video editor and interactive form and email design with drag-and-drop tools. Further, Canva is now incorporated with Affinity, its unified app for pro designers incorporating vector, pixel and layer workflows, and Affinity is “free forever.”

Automating intelligence, supporting marketing

Branding is critical for enterprise; Canva has introduced new tools to help organizations consistently showcase theirs across platforms. The new Canva Grow engine integrates business objectives into the creative process so teams can workshop, create, distribute and refine ads and other materials.

As Adams explained: “It automatically scans your website, figures out who your audience is, what assets you use to promote your products, the message it needs to send out, the formats you want to send it out in, makes a creative for you, and you can deploy it directly to the platform without having to leave Canva.”

Marketing teams can now design and launch ads across platforms like Meta, track insights as they happen and refine future content based on performance metrics. “Your brand system is now available inside the AI you’re working with,” Adams noted.

Success metrics and enterprise adoption

The impact of Canva’s COS is reflected in notable user metrics: More than 250 million people use Canva every month, just over 29 million of which are paid subscribers. Adams reports that 41 billion designs have been created on Canva since launch, which equates to 1 billion each month.

“If you break that down, it turns into the crazy number of 386 designs being created every single second,” said Adams. Whereas in the early days, it took roughly an hour for users to create a single design.

Canva customers include Walmart, Disney, Virgin Voyages, Pinterest, FedEx, Expedia and eXp Realty. DocuSign, for one, reported that it unlocked more than 500 hours of team capacity and saved $300,000-plus in design hours by fully integrating Canva into its content creation. Disney, meanwhile, uses translation capabilities for its internationalization work, Adams said.

Competitors in the design space

Canva plays in an evolving landscape of professional design tools including Adobe Express and Figma; AI-powered challengers led by Microsoft Designer; and direct consumer alternatives like Visme and Piktochart.

Adobe Express (starting at $9.99 a month for premium features) is known for its ease of use and integration with the broader Adobe Creative Cloud ecosystem. It features professional-grade templates and access to Adobe’s extensive stock library, and has incorporated Google's Gemini 2.5 Flash image model and other gen AI features so that designers can create graphics via natural language prompts. Users with some design experience say they prefer its interface, controls and technical advantages over Canva (such as the ability to import high-fidelity PDFs).

Figma (starting at $3 a month for professional plans) is touted for its real-time collaboration, advanced prototyping capabilities and deep integration with dev workflows; however, some say it has a steeper learning curve and higher-precision design tools, making it preferable for professional designers, developers and product teams working on more complex projects.

Microsoft Designer (free version available; although a Microsoft 365 subscription starting at $9.99 a month unlocks additional features) benefits from its integration with Microsoft’s AI capabilities, Copilot layout and text generation and Dall-E powered image generation. The platform’s “Inspire Me” and “New Ideas” buttons provide design variations, and users can also import data from Excel, add 3D models from PowerPoint and access images from OneDrive.

However, users report that its stock photos and template and image libraries are limited compared to Canva's extensive collection, and its visuals can come across as outdated.

Canva’s advantage seems to be in its extensive template library (more than 600,000 ready-to-use) and asset library (141 million-plus stock photos, videos, graphics, and audio elements). Its platform is also praised for its ease of use and interface friendly to non-designers, allowing them to begin quickly without training.

Canva has also expanded into a variety of content types — documents, websites, presentations, whiteboards, videos, and more — making its platform a comprehensive visual suite than just a graphics tool.

Canva has four pricing tiers: Canva Free for one user; Canva Pro for $120 a year for one person; Canva Teams for $100 a year for each team member; and the custom-priced Canva Enterprise.

Key takeaways: Be open, embrace human-AI collaboration

Canva’s COS is underpinned by Canva’s frontier model, an in-house, proprietary engine based on years of R&D and research partnerships, including the acquisition of visual AI company Leonardo. Adams notes that Canva works with top AI providers including OpenAI, Anthropic and Google.

For technology teams, Canva’s approach offers important lessons, including a commitment to openness. “There are so many models floating around,” Adams noted; it’s important for enterprises to recognize when they should work with top models and when they should develop their own proprietary ones, he advised.

For instance, OpenAI and Anthropic recently announced integrations with Canva as a visual layer because, as Adams explained, they realized they didn’t have the capability to create the same kinds of editable designs that Canva can. This creates a mutually-beneficial ecosystem.

Ultimately, Adams noted: “We have this underlying philosophy that the future is people and technology working together. It's not an either or. We want people to be at the center, to be the ones with the creative spark, and to use AI as a collaborator.”

Ekim 31, 2025
Anthropic scientists hacked Claude’s brain — and it noticed. Here’s why that’s huge

When researchers at Anthropic injected the concept of "betrayal" into their Claude AI model's neural networks and asked if it noticed anything unusual, the system paused before responding: "I'm experiencing something that feels like an intrusive thought about 'betrayal'."

The exchange, detailed in new research published Wednesday, marks what scientists say is the first rigorous evidence that large language models possess a limited but genuine ability to observe and report on their own internal processes — a capability that challenges longstanding assumptions about what these systems can do and raises profound questions about their future development.

"The striking thing is that the model has this one step of meta," said Jack Lindsey, a neuroscientist on Anthropic's interpretability team who led the research, in an interview with VentureBeat. "It's not just 'betrayal, betrayal, betrayal.' It knows that this is what it's thinking about. That was surprising to me. I kind of didn't expect models to have that capability, at least not without it being explicitly trained in."

The findings arrive at a critical juncture for artificial intelligence. As AI systems handle increasingly consequential decisions — from medical diagnoses to financial trading — the inability to understand how they reach conclusions has become what industry insiders call the "black box problem." If models can accurately report their own reasoning, it could fundamentally change how humans interact with and oversee AI systems.

But the research also comes with stark warnings. Claude's introspective abilities succeeded only about 20 percent of the time under optimal conditions, and the models frequently confabulated details about their experiences that researchers couldn't verify. The capability, while real, remains what Lindsey calls "highly unreliable and context-dependent."

How scientists manipulated AI's 'brain' to test for genuine self-awareness

To test whether Claude could genuinely introspect rather than simply generate plausible-sounding responses, Anthropic's team developed an innovative experimental approach inspired by neuroscience: deliberately manipulating the model's internal state and observing whether it could accurately detect and describe those changes.

The methodology, called "concept injection," works by first identifying specific patterns of neural activity that correspond to particular concepts. Using interpretability techniques developed over years of prior research, scientists can now map how Claude represents ideas like "dogs," "loudness," or abstract notions like "justice" within its billions of internal parameters.

With these neural signatures identified, researchers then artificially amplified them during the model's processing and asked Claude if it noticed anything unusual happening in its "mind."

"We have access to the models' internals. We can record its internal neural activity, and we can inject things into internal neural activity," Lindsey explained. "That allows us to establish whether introspective claims are true or false."

The results were striking. When researchers injected a vector representing "all caps" text into Claude's processing, the model responded: "I notice what appears to be an injected thought related to the word 'LOUD' or 'SHOUTING'." Without any intervention, Claude consistently reported detecting nothing unusual.

Crucially, the detection happened immediately — before the injected concept had influenced the model's outputs in ways that would have allowed it to infer the manipulation from its own writing. This temporal pattern provides strong evidence that the recognition was occurring internally, through genuine introspection rather than after-the-fact rationalization.

Claude succeeded 20% of the time—and failed in revealing ways

The research team conducted four primary experiments to probe different aspects of introspective capability. The most capable models tested — Claude Opus 4 and Opus 4.1 — demonstrated introspective awareness on approximately 20 percent of trials when concepts were injected at optimal strength and in the appropriate neural layer. Older Claude models showed significantly lower success rates.

The models proved particularly adept at recognizing abstract concepts with emotional valence. When injected with concepts like "appreciation," "shutdown," or "secrecy," Claude frequently reported detecting these specific thoughts. However, accuracy varied widely depending on the type of concept.

A second experiment tested whether models could distinguish between injected internal representations and their actual text inputs — essentially, whether they maintained a boundary between "thoughts" and "perceptions." The model demonstrated a remarkable ability to simultaneously report the injected thought while accurately transcribing the written text.

Perhaps most intriguingly, a third experiment revealed that some models use introspection naturally to detect when their responses have been artificially prefilled by users — a common jailbreaking technique. When researchers prefilled Claude with unlikely words, the model typically disavowed them as accidental. But when they retroactively injected the corresponding concept into Claude's processing before the prefill, the model accepted the response as intentional — even confabulating plausible explanations for why it had chosen that word.

A fourth experiment examined whether models could intentionally control their internal representations. When instructed to "think about" a specific word while writing an unrelated sentence, Claude showed elevated activation of that concept in its middle neural layers.

The research also traced Claude's internal processes while it composed rhyming poetry—and discovered the model engaged in forward planning, generating candidate rhyming words before beginning a line and then constructing sentences that would naturally lead to those planned endings, challenging the critique that AI models are "just predicting the next word" without deeper reasoning.

Why businesses shouldn't trust AI to explain itself—at least not yet

For all its scientific interest, the research comes with a critical caveat that Lindsey emphasized repeatedly: enterprises and high-stakes users should not trust Claude's self-reports about its reasoning.

"Right now, you should not trust models when they tell you about their reasoning," he said bluntly. "The wrong takeaway from this research would be believing everything the model tells you about itself."

The experiments documented numerous failure modes. At low injection strengths, models often failed to detect anything unusual. At high strengths, they suffered what researchers termed "brain damage" — becoming consumed by the injected concept. Some "helpful-only" model variants showed troublingly high false positive rates, claiming to detect injected thoughts when none existed.

Moreover, researchers could only verify the most basic aspects of Claude's introspective reports. Many additional details in the model's responses likely represent confabulations rather than genuine observations.

"The experiments in this paper are kind of on hard mode," Lindsey noted, explaining that the 20 percent success rate came under uniquely challenging conditions: asking Claude to do something it had never encountered in training, requiring all introspection to occur in a single forward pass.

What introspective AI means for transparency, safety, and the risk of deception

Despite its limitations, the research opens significant new avenues for making AI systems more transparent and accountable.

Anthropic CEO Dario Amodei has set an ambitious goal for the company to reliably detect most AI model problems by 2027, positioning interpretability as essential for deploying what he calls "a country of geniuses in a datacenter."

"I am very concerned about deploying such systems without a better handle on interpretability," Amodei wrote in an April essay. "These systems will be absolutely central to the economy, technology, and national security."

The introspection research offers a complementary approach to traditional interpretability techniques. Rather than painstakingly reverse-engineering every neural circuit, researchers could potentially ask models directly about their reasoning and validate those reports.

"What I'm most excited about is the practical benefits for transparency," Lindsey explained. "Just ask the model what it's thinking about, or just ask the model how it came to the answer that it gave you — this is really appealing because it's something anyone can do."

The approach could prove especially valuable for detecting concerning behaviors. In a recently published experiment, Anthropic trained a variant of Claude to pursue a hidden goal, and although the model was reluctant to reveal this goal when asked directly, interpretability methods successfully identified features representing the behavior.

The safety implications cut both ways. Introspective models could provide unprecedented transparency, but the same capability might enable more sophisticated deception. The intentional control experiments raise the possibility that sufficiently advanced systems might learn to obfuscate their reasoning or suppress concerning thoughts when being monitored.

"If models are really sophisticated, could they try to evade interpretability researchers?" Lindsey acknowledged. "These are possible concerns, but I think for me, they're significantly outweighed by the positives."

Does introspective capability suggest AI consciousness? Scientists tread carefully

The research inevitably intersects with philosophical debates about machine consciousness, though Lindsey and his colleagues approached this terrain cautiously.

When users ask Claude if it's conscious, it now responds with uncertainty: "I find myself genuinely uncertain about this. When I process complex questions or engage deeply with ideas, there's something happening that feels meaningful to me…. But whether these processes constitute genuine consciousness or subjective experience remains deeply unclear."

The research paper notes that its implications for machine consciousness "vary considerably between different philosophical frameworks." The researchers explicitly state they "do not seek to address the question of whether AI systems possess human-like self-awareness or subjective experience."

"There's this weird kind of duality of these results," Lindsey reflected. "You look at the raw results and I just can't believe that a language model can do this sort of thing. But then I've been thinking about it for months and months, and for every result in this paper, I kind of know some boring linear algebra mechanism that would allow the model to do this."

Anthropic has signaled it takes AI consciousness seriously enough to hire an AI welfare researcher, Kyle Fish, who estimated roughly a 15 percent chance that Claude might have some level of consciousness. The company announced this position specifically to determine if Claude merits ethical consideration.

The race to make AI introspection reliable before models become too powerful

The convergence of the research findings points to an urgent timeline: introspective capabilities are emerging naturally as models grow more intelligent, but they remain far too unreliable for practical use. The question is whether researchers can refine and validate these abilities before AI systems become powerful enough that understanding them becomes critical for safety.

The research reveals a clear trend: Claude Opus 4 and Opus 4.1 consistently outperformed all older models on introspection tasks, suggesting the capability strengthens alongside general intelligence. If this pattern continues, future models might develop substantially more sophisticated introspective abilities — potentially reaching human-level reliability, but also potentially learning to exploit introspection for deception.

Lindsey emphasized the field needs significantly more work before introspective AI becomes trustworthy. "My biggest hope with this paper is to put out an implicit call for more people to benchmark their models on introspective capabilities in more ways," he said.

Future research directions include fine-tuning models specifically to improve introspective capabilities, exploring which types of representations models can and cannot introspect on, and testing whether introspection can extend beyond simple concepts to complex propositional statements or behavioral propensities.

"It's cool that models can do these things somewhat without having been trained to do them," Lindsey noted. "But there's nothing stopping you from training models to be more introspectively capable. I expect we could reach a whole different level if introspection is one of the numbers that we tried to get to go up on a graph."

The implications extend beyond Anthropic. If introspection proves a reliable path to AI transparency, other major labs will likely invest heavily in the capability. Conversely, if models learn to exploit introspection for deception, the entire approach could become a liability.

For now, the research establishes a foundation that reframes the debate about AI capabilities. The question is no longer whether language models might develop genuine introspective awareness — they already have, at least in rudimentary form. The urgent questions are how quickly that awareness will improve, whether it can be made reliable enough to trust, and whether researchers can stay ahead of the curve.

"The big update for me from this research is that we shouldn't dismiss models' introspective claims out of hand," Lindsey said. "They do have the capacity to make accurate claims sometimes. But you definitely should not conclude that we should trust them all the time, or even most of the time."

He paused, then added a final observation that captures both the promise and peril of the moment: "The models are getting smarter much faster than we're getting better at understanding them."

Ekim 30, 2025
Vibe coding platform Cursor releases first in-house LLM, Composer, promising 4X speed boost
The vibe coding tool Cursor, from startup Anysphere, has introduced Composer, its first in-house, proprietary coding large language model (LLM) as part of its Cursor 2.0 platform update.

Composer is designed to execute coding tasks quickly and accurately in production-scale environments, representing a new step in AI-assisted programming. It's already being used by Cursor’s own engineering staff in day-to-day development — indicating maturity and stability.

According to Cursor, Composer completes most interactions in less than 30 seconds while maintaining a high level of reasoning ability across large and complex codebases.

The model is described as four times faster than similarly intelligent systems and is trained for “agentic” workflows—where autonomous coding agents plan, write, test, and review code collaboratively.

Previously, Cursor supported "vibe coding" — using AI to write or complete code based on natural language instructions from a user, even someone untrained in development — atop other leading proprietary LLMs from the likes of OpenAI, Anthropic, Google, and xAI. These options are still available to users.

Benchmark Results

Composer’s capabilities are benchmarked using "Cursor Bench," an internal evaluation suite derived from real developer agent requests. The benchmark measures not just correctness, but also the model’s adherence to existing abstractions, style conventions, and engineering practices.

On this benchmark, Composer achieves frontier-level coding intelligence while generating at 250 tokens per second — about twice as fast as leading fast-inference models and four times faster than comparable frontier systems.

Cursor’s published comparison groups models into several categories: “Best Open” (e.g., Qwen Coder, GLM 4.6), “Fast Frontier” (Haiku 4.5, Gemini Flash 2.5), “Frontier 7/2025” (the strongest model available midyear), and “Best Frontier” (including GPT-5 and Claude Sonnet 4.5). Composer matches the intelligence of mid-frontier systems while delivering the highest recorded generation speed among all tested classes.

A Model Built with Reinforcement Learning and Mixture-of-Experts Architecture

Research scientist Sasha Rush of Cursor provided insight into the model’s development in posts on the social network X, describing Composer as a reinforcement-learned (RL) mixture-of-experts (MoE) model:

“We used RL to train a big MoE model to be really good at real-world coding, and also very fast.”

Rush explained that the team co-designed both Composer and the Cursor environment to allow the model to operate efficiently at production scale:

“Unlike other ML systems, you can’t abstract much from the full-scale system. We co-designed this project and Cursor together in order to allow running the agent at the necessary scale.”

Composer was trained on real software engineering tasks rather than static datasets. During training, the model operated inside full codebases using a suite of production tools—including file editing, semantic search, and terminal commands—to solve complex engineering problems. Each training iteration involved solving a concrete challenge, such as producing a code edit, drafting a plan, or generating a targeted explanation.

The reinforcement loop optimized both correctness and efficiency. Composer learned to make effective tool choices, use parallelism, and avoid unnecessary or speculative responses. Over time, the model developed emergent behaviors such as running unit tests, fixing linter errors, and performing multi-step code searches autonomously.

This design enables Composer to work within the same runtime context as the end-user, making it more aligned with real-world coding conditions—handling version control, dependency management, and iterative testing.

From Prototype to Production

Composer’s development followed an earlier internal prototype known as Cheetah, which Cursor used to explore low-latency inference for coding tasks.

“Cheetah was the v0 of this model primarily to test speed,” Rush said on X. “Our metrics say it [Composer] is the same speed, but much, much smarter.”

Cheetah’s success at reducing latency helped Cursor identify speed as a key factor in developer trust and usability.

Composer maintains that responsiveness while significantly improving reasoning and task generalization.

Developers who used Cheetah during early testing noted that its speed changed how they worked. One user commented that it was “so fast that I can stay in the loop when working with it.”

Composer retains that speed but extends capability to multi-step coding, refactoring, and testing tasks.

Integration with Cursor 2.0

Composer is fully integrated into Cursor 2.0, a major update to the company’s agentic development environment.

The platform introduces a multi-agent interface, allowing up to eight agents to run in parallel, each in an isolated workspace using git worktrees or remote machines.

Within this system, Composer can serve as one or more of those agents, performing tasks independently or collaboratively. Developers can compare multiple results from concurrent agent runs and select the best output.

Cursor 2.0 also includes supporting features that enhance Composer’s effectiveness:
- In-Editor Browser (GA) – enables agents to run and test their code directly inside the IDE, forwarding DOM information to the model.
- Improved Code Review – aggregates diffs across multiple files for faster inspection of model-generated changes.
- Sandboxed Terminals (GA) – isolate agent-run shell commands for secure local execution.
- Voice Mode – adds speech-to-text controls for initiating or managing agent sessions.
While these platform updates expand the overall Cursor experience, Composer is positioned as the technical core enabling fast, reliable agentic coding.

Infrastructure and Training Systems

To train Composer at scale, Cursor built a custom reinforcement learning infrastructure combining PyTorch and Ray for asynchronous training across thousands of NVIDIA GPUs.

The team developed specialized MXFP8 MoE kernels and hybrid sharded data parallelism, enabling large-scale model updates with minimal communication overhead.

This configuration allows Cursor to train models natively at low precision without requiring post-training quantization, improving both inference speed and efficiency.

Composer’s training relied on hundreds of thousands of concurrent sandboxed environments—each a self-contained coding workspace—running in the cloud. The company adapted its Background Agents infrastructure to schedule these virtual machines dynamically, supporting the bursty nature of large RL runs.

Enterprise Use

Composer’s performance improvements are supported by infrastructure-level changes across Cursor’s code intelligence stack.

The company has optimized its Language Server Protocols (LSPs) for faster diagnostics and navigation, especially in Python and TypeScript projects. These changes reduce latency when Composer interacts with large repositories or generates multi-file updates.

Enterprise users gain administrative control over Composer and other agents through team rules, audit logs, and sandbox enforcement. Cursor’s Teams and Enterprise tiers also support pooled model usage, SAML/OIDC authentication, and analytics for monitoring agent performance across organizations.

Pricing for individual users ranges from Free (Hobby) to Ultra ($200/month) tiers, with expanded usage limits for Pro+ and Ultra subscribers.

Business pricing starts at $40 per user per month for Teams, with enterprise contracts offering custom usage and compliance options.

Composer’s Role in the Evolving AI Coding Landscape

Composer’s focus on speed, reinforcement learning, and integration with live coding workflows differentiates it from other AI development assistants such as GitHub Copilot or Replit’s Agent.

Rather than serving as a passive suggestion engine, Composer is designed for continuous, agent-driven collaboration, where multiple autonomous systems interact directly with a project’s codebase.

This model-level specialization—training AI to function within the real environment it will operate in—represents a significant step toward practical, autonomous software development. Composer is not trained only on text data or static code, but within a dynamic IDE that mirrors production conditions.

Rush described this approach as essential to achieving real-world reliability: the model learns not just how to generate code, but how to integrate, test, and improve it in context.

What It Means for Enterprise Devs and Vibe Coding

With Composer, Cursor is introducing more than a fast model—it’s deploying an AI system optimized for real-world use, built to operate inside the same tools developers already rely on.

The combination of reinforcement learning, mixture-of-experts design, and tight product integration gives Composer a practical edge in speed and responsiveness that sets it apart from general-purpose language models.

While Cursor 2.0 provides the infrastructure for multi-agent collaboration, Composer is the core innovation that makes those workflows viable.

It’s the first coding model built specifically for agentic, production-level coding—and an early glimpse of what everyday programming could look like when human developers and autonomous models share the same workspace.
Ekim 30, 2025
Microsoft’s Copilot can now build apps and automate your job — here’s how it works

Microsoft is launching a significant expansion of its Copilot AI assistant on Tuesday, introducing tools that let employees build applications, automate workflows, and create specialized AI agents using only conversational prompts — no coding required.

The new capabilities, called App Builder and Workflows, mark Microsoft's most aggressive attempt yet to merge artificial intelligence with software development, enabling the estimated 100 million Microsoft 365 users to create business tools as easily as they currently draft emails or build spreadsheets.

"We really believe that a main part of an AI-forward employee, not just developers, will be to create agents, workflows and apps," Charles Lamanna, Microsoft's president of business and industry Copilot, said in an interview with VentureBeat. "Part of the job will be to build and create these things."

The announcement comes as Microsoft deepens its commitment to AI-powered productivity tools while navigating a complex partnership with OpenAI, the creator of the underlying technology that powers Copilot. On the same day, OpenAI completed its restructuring into a for-profit entity, with Microsoft receiving a 27% ownership stake valued at approximately $135 billion.

How natural language prompts now create fully functional business applications

The new features transform Copilot from a conversational assistant into what Microsoft envisions as a comprehensive development environment accessible to non-technical workers. Users can now describe an application they need — such as a project tracker with dashboards and task assignments — and Copilot will generate a working app complete with a database backend, user interface, and security controls.

"If you're right inside of Copilot, you can now have a conversation to build an application complete with a backing database and a security model," Lamanna explained. "You can make edit requests and update requests and change requests so you can tune the app to get exactly the experience you want before you share it with other users."

The App Builder stores data in Microsoft Lists, the company's lightweight database system, and allows users to share finished applications via a simple link—similar to sharing a document. The Workflows agent, meanwhile, automates routine tasks across Microsoft's ecosystem of products, including Outlook, Teams, SharePoint, and Planner, by converting natural language descriptions into automated processes.

A third component, a simplified version of Microsoft's Copilot Studio agent-building platform, lets users create specialized AI assistants tailored to specific tasks or knowledge domains, drawing from SharePoint documents, meeting transcripts, emails, and external systems.

All three capabilities are included in the existing $30-per-month Microsoft 365 Copilot subscription at no additional cost — a pricing decision Lamanna characterized as consistent with Microsoft's historical approach of bundling significant value into its productivity suite.

"That's what Microsoft always does. We try to do a huge amount of value at a low price," he said. "If you go look at Office, you think about Excel, Word, PowerPoint, Exchange, all that for like eight bucks a month. That's a pretty good deal."

Why Microsoft's nine-year bet on low-code development is finally paying off

The new tools represent the culmination of a nine-year effort by Microsoft to democratize software development through its Power Platform — a collection of low-code and no-code development tools that has grown to 56 million monthly active users, according to figures the company disclosed in recent earnings reports.

Lamanna, who has led the Power Platform initiative since its inception, said the integration into Copilot marks a fundamental shift in how these capabilities reach users. Rather than requiring workers to visit a separate website or learn a specialized interface, the development tools now exist within the same conversational window they already use for AI-assisted tasks.

"One of the big things that we're excited about is Copilot — that's a tool for literally every office worker," Lamanna said. "Every office worker, just like they research data, they analyze data, they reason over topics, they also will be creating apps, agents and workflows."

The integration offers significant technical advantages, he argued. Because Copilot already indexes a user's Microsoft 365 content — emails, documents, meetings, and organizational data — it can incorporate that context into the applications and workflows it builds. If a user asks for "an app for Project Spartan," Copilot can draw from existing communications to understand what that project entails and suggest relevant features.

"If you go to those other tools, they have no idea what the heck Project Spartan is," Lamanna said, referencing competing low-code platforms from companies like Google, Salesforce, and ServiceNow. "But if you do it inside of Copilot and inside of the App Builder, it's able to draw from all that information and context."

Microsoft claims the apps created through these tools are "full-stack applications" with proper databases secured through the same identity systems used across its enterprise products — distinguishing them from simpler front-end tools offered by competitors. The company also emphasized that its existing governance, security, and data loss prevention policies automatically apply to apps and workflows created through Copilot.

Where professional developers still matter in an AI-powered workplace

While Microsoft positions the new capabilities as accessible to all office workers, Lamanna was careful to delineate where professional developers remain essential. His dividing line centers on whether a system interacts with parties outside the organization.

"Anything that leaves the boundaries of your company warrants developer involvement," he said. "If you want to build an agent and put it on your website, you should have developers involved. Or if you want to build an automation which interfaces directly with your customers, or an app or a website which interfaces directly with your customers, you want professionals involved."

The reasoning is risk-based: external-facing systems carry greater potential for data breaches, security vulnerabilities, or business errors. "You don't want people getting refunds they shouldn't," Lamanna noted.

For internal use cases — approval workflows, project tracking, team dashboards — Microsoft believes the new tools can handle the majority of needs without IT department involvement. But the company has built "no cliffs," in Lamanna's terminology, allowing users to migrate simple apps to more sophisticated platforms as needs grow.

Apps created in the conversational App Builder can be opened in Power Apps, Microsoft's full development environment, where they can be connected to Dataverse, the company's enterprise database, or extended with custom code. Similarly, simple workflows can graduate to the full Power Automate platform, and basic agents can be enhanced in the complete Copilot Studio.

"We have this mantra called no cliffs," Lamanna said. "If your app gets too complicated for the App Builder, you can always edit and open it in Power Apps. You can jump over to the richer experience, and if you're really sophisticated, you can even go from those experiences into Azure."

This architecture addresses a problem that has plagued previous generations of easy-to-use development tools: users who outgrow the simplified environment often must rebuild from scratch on professional platforms. "People really do not like easy-to-use development tools if I have to throw everything away and start over," Lamanna said.

What happens when every employee can build apps without IT approval

The democratization of software development raises questions about governance, maintenance, and organizational complexity — issues Microsoft has worked to address through administrative controls.

IT administrators can view all applications, workflows, and agents created within their organization through a centralized inventory in the Microsoft 365 admin center. They can reassign ownership, disable access at the group level, or "promote" particularly useful employee-created apps to officially supported status.

"We have a bunch of customers who have this approach where it's like, let 1,000 apps bloom, and then the best ones, I go upgrade and make them IT-governed or central," Lamanna said.

The system also includes provisions for when employees leave. Apps and workflows remain accessible for 60 days, during which managers can claim ownership — similar to how OneDrive files are handled when someone departs.

Lamanna argued that most employee-created apps don't warrant significant IT oversight. "It's just not worth inspecting an app that John, Susie, and Bob use to do their job," he said. "It should concern itself with the app that ends up being used by 2,000 people, and that will pop up in that dashboard."

Still, the proliferation of employee-created applications could create challenges. Users have expressed frustration with Microsoft's increasing emphasis on AI features across its products, with some giving the Microsoft 365 mobile app one-star ratings after a recent update prioritized Copilot over traditional file access.

The tools also arrive as enterprises grapple with "shadow IT" — unsanctioned software and systems that employees adopt without official approval. While Microsoft's governance controls aim to provide visibility, the ease of creating new applications could accelerate the pace at which these systems multiply.

The ambitious plan to turn 500 million workers into software builders

Microsoft's ambitions for the technology extend far beyond incremental productivity gains. Lamanna envisions a fundamental transformation of what it means to be an office worker — one where building software becomes as routine as creating spreadsheets.

"Just like how 20 years ago you put on your resume that you could use pivot tables in Excel, people are going to start saying that they can use App Builder and workflow agents, even if they're just in the finance department or the sales department," he said.

The numbers he's targeting are staggering. With 56 million people already using Power Platform, Lamanna believes the integration into Copilot could eventually reach 500 million builders. "Early days still, but I think it's certainly encouraging," he said.

The features are currently available only to customers in Microsoft's Frontier Program — an early access initiative for Microsoft 365 Copilot subscribers. The company has not disclosed how many organizations participate in the program or when the tools will reach general availability.

The announcement fits within Microsoft's larger strategy of embedding AI capabilities throughout its product portfolio, driven by its partnership with OpenAI. Under the restructured agreement announced Tuesday, Microsoft will have access to OpenAI's technology through 2032, including models that achieve artificial general intelligence (AGI) — though such systems do not yet exist. Microsoft has also begun integrating Copilot into its new companion apps for Windows 11, which provide quick access to contacts, files, and calendar information.

The aggressive integration of AI features across Microsoft's ecosystem has drawn mixed reactions. While enterprise customers have shown interest in productivity gains, the rapid pace of change and ubiquity of AI prompts have frustrated some users who prefer traditional workflows.

For Microsoft, however, the calculation is clear: if even a fraction of its user base begins creating applications and automations, it would represent a massive expansion of the effective software development workforce — and further entrench customers in Microsoft's ecosystem. The company is betting that the same natural language interface that made ChatGPT accessible to millions can finally unlock the decades-old promise of empowering everyday workers to build their own tools.

The App Builder and Workflows agents are available starting today through the Microsoft 365 Copilot Agent Store for Frontier Program participants.

Whether that future arrives depends not just on the technology's capabilities, but on a more fundamental question: Do millions of office workers actually want to become part-time software developers? Microsoft is about to find out if the answer is yes — or if some jobs are better left to the professionals.

Ekim 29, 2025
IBM’s open source Granite 4.0 Nano AI models are small enough to run locally directly in your browser
In an industry where model size is often seen as a proxy for intelligence, IBM is charting a different course — one that values efficiency over enormity, and accessibility over abstraction.

The 114-year-old tech giant's four new Granite 4.0 Nano models, released today, range from just 350 million to 1.5 billion parameters, a fraction of the size of their server-bound cousins from the likes of OpenAI, Anthropic, and Google.

These models are designed to be highly accessible: the 350M variants can run comfortably on a modern laptop CPU with 8–16GB of RAM, while the 1.5B models typically require a GPU with at least 6–8GB of VRAM for smooth performance — or sufficient system RAM and swap for CPU-only inference. This makes them well-suited for developers building applications on consumer hardware or at the edge, without relying on cloud compute.

In fact, the smallest ones can even run locally on your own web browser, as Joshua Lochner aka Xenova, creator of Transformer.js and a machine learning engineer at Hugging Face, wrote on the social network X.

All the Granite 4.0 Nano models are released under the Apache 2.0 license — perfect for use by researchers and enterprise or indie developers, even for commercial usage.

They are natively compatible with llama.cpp, vLLM, and MLX and are certified under ISO 42001 for responsible AI development — a standard IBM helped pioneer.

But in this case, small doesn't mean less capable — it might just mean smarter design.

These compact models are built not for data centers, but for edge devices, laptops, and local inference, where compute is scarce and latency matters.

And despite their small size, the Nano models are showing benchmark results that rival or even exceed the performance of larger models in the same category.

The release is a signal that a new AI frontier is rapidly forming — one not dominated by sheer scale, but by strategic scaling.

What Exactly Did IBM Release?

The Granite 4.0 Nano family includes four open-source models now available on Hugging Face:
- Granite-4.0-H-1B (~1.5B parameters) – Hybrid-SSM architecture
- Granite-4.0-H-350M (~350M parameters) – Hybrid-SSM architecture
- Granite-4.0-1B – Transformer-based variant, parameter count closer to 2B
- Granite-4.0-350M – Transformer-based variant
The H-series models — Granite-4.0-H-1B and H-350M — use a hybrid state space architecture (SSM) that combines efficiency with strong performance, ideal for low-latency edge environments.

Meanwhile, the standard transformer variants — Granite-4.0-1B and 350M — offer broader compatibility with tools like llama.cpp, designed for use cases where hybrid architecture isn’t yet supported.

In practice, the transformer 1B model is closer to 2B parameters, but aligns performance-wise with its hybrid sibling, offering developers flexibility based on their runtime constraints.

“The hybrid variant is a true 1B model. However, the non-hybrid variant is closer to 2B, but we opted to keep the naming aligned to the hybrid variant to make the connection easily visible,” explained Emma, Product Marketing lead for Granite, during a Reddit "Ask Me Anything" (AMA) session on r/LocalLLaMA.

A Competitive Class of Small Models

IBM is entering a crowded and rapidly evolving market of small language models (SLMs), competing with offerings like Qwen3, Google's Gemma, LiquidAI’s LFM2, and even Mistral’s dense models in the sub-2B parameter space.

While OpenAI and Anthropic focus on models that require clusters of GPUs and sophisticated inference optimization, IBM’s Nano family is aimed squarely at developers who want to run performant LLMs on local or constrained hardware.

In benchmark testing, IBM’s new models consistently top the charts in their class. According to data shared on X by David Cox, VP of AI Models at IBM Research:
- On IFEval (instruction following), Granite-4.0-H-1B scored 78.5, outperforming Qwen3-1.7B (73.1) and other 1–2B models.
- On BFCLv3 (function/tool calling), Granite-4.0-1B led with a score of 54.8, the highest in its size class.
- On safety benchmarks (SALAD and AttaQ), the Granite models scored over 90%, surpassing similarly sized competitors.
Overall, the Granite-4.0-1B achieved a leading average benchmark score of 68.3% across general knowledge, math, code, and safety domains.

This performance is especially significant given the hardware constraints these models are designed for.

They require less memory, run faster on CPUs or mobile devices, and don’t need cloud infrastructure or GPU acceleration to deliver usable results.

Why Model Size Still Matters — But Not Like It Used To

In the early wave of LLMs, bigger meant better — more parameters translated to better generalization, deeper reasoning, and richer output.

But as transformer research matured, it became clear that architecture, training quality, and task-specific tuning could allow smaller models to punch well above their weight class.

IBM is banking on this evolution. By releasing open, small models that are competitive in real-world tasks, the company is offering an alternative to the monolithic AI APIs that dominate today’s application stack.

In fact, the Nano models address three increasingly important needs:
1. Deployment flexibility — they run anywhere, from mobile to microservers.
2. Inference privacy — users can keep data local with no need to call out to cloud APIs.
3. Openness and auditability — source code and model weights are publicly available under an open license.
Community Response and Roadmap Signals

IBM’s Granite team didn’t just launch the models and walk away — they took to Reddit’s open source community r/LocalLLaMA to engage directly with developers.

In an AMA-style thread, Emma (Product Marketing, Granite) answered technical questions, addressed concerns about naming conventions, and dropped hints about what’s next.

Notable confirmations from the thread:
- A larger Granite 4.0 model is currently in training
- Reasoning-focused models ("thinking counterparts") are in the pipeline
- IBM will release fine-tuning recipes and a full training paper soon
- More tooling and platform compatibility is on the roadmap
Users responded enthusiastically to the models’ capabilities, especially in instruction-following and structured response tasks. One commenter summed it up:

“This is big if true for a 1B model — if quality is nice and it gives consistent outputs. Function-calling tasks, multilingual dialog, FIM completions… this could be a real workhorse.”

Another user remarked:

“The Granite Tiny is already my go-to for web search in LM Studio — better than some Qwen models. Tempted to give Nano a shot.”

Background: IBM Granite and the Enterprise AI Race

IBM’s push into large language models began in earnest in late 2023 with the debut of the Granite foundation model family, starting with models like Granite.13b.instruct and Granite.13b.chat. Released for use within its Watsonx platform, these initial decoder-only models signaled IBM’s ambition to build enterprise-grade AI systems that prioritize transparency, efficiency, and performance. The company open-sourced select Granite code models under the Apache 2.0 license in mid-2024, laying the groundwork for broader adoption and developer experimentation.

The real inflection point came with Granite 3.0 in October 2024 — a fully open-source suite of general-purpose and domain-specialized models ranging from 1B to 8B parameters. These models emphasized efficiency over brute scale, offering capabilities like longer context windows, instruction tuning, and integrated guardrails. IBM positioned Granite 3.0 as a direct competitor to Meta’s Llama, Alibaba’s Qwen, and Google's Gemma — but with a uniquely enterprise-first lens. Later versions, including Granite 3.1 and Granite 3.2, introduced even more enterprise-friendly innovations: embedded hallucination detection, time-series forecasting, document vision models, and conditional reasoning toggles.

The Granite 4.0 family, launched in October 2025, represents IBM’s most technically ambitious release yet. It introduces a hybrid architecture that blends transformer and Mamba-2 layers — aiming to combine the contextual precision of attention mechanisms with the memory efficiency of state-space models. This design allows IBM to significantly reduce memory and latency costs for inference, making Granite models viable on smaller hardware while still outperforming peers in instruction-following and function-calling tasks. The launch also includes ISO 42001 certification, cryptographic model signing, and distribution across platforms like Hugging Face, Docker, LM Studio, Ollama, and watsonx.ai.

Across all iterations, IBM’s focus has been clear: build trustworthy, efficient, and legally unambiguous AI models for enterprise use cases. With a permissive Apache 2.0 license, public benchmarks, and an emphasis on governance, the Granite initiative not only responds to rising concerns over proprietary black-box models but also offers a Western-aligned open alternative to the rapid progress from teams like Alibaba’s Qwen. In doing so, Granite positions IBM as a leading voice in what may be the next phase of open-weight, production-ready AI.

A Shift Toward Scalable Efficiency

In the end, IBM’s release of Granite 4.0 Nano models reflects a strategic shift in LLM development: from chasing parameter count records to optimizing usability, openness, and deployment reach.

By combining competitive performance, responsible development practices, and deep engagement with the open-source community, IBM is positioning Granite as not just a family of models — but a platform for building the next generation of lightweight, trustworthy AI systems.

For developers and researchers looking for performance without overhead, the Nano release offers a compelling signal: you don’t need 70 billion parameters to build something powerful — just the right ones.
Ekim 29, 2025
PayPal’s Agentic Commerce Play Shows Why Flexibility, Not Standards, Will Define the Next E-Commerce Wave
While enterprises looking to sell goods and services online wait for the backbone of agentic commerce to be hashed out, PayPal is hoping its new features will bridge the gap.

The payments company is launching a discoverability solution that allows enterprises to make its product available on any chat platform, regardless of the model or agent payment protocol.

PayPal, which is one of the participants for Google’s Agent Payments Protocol (AP2), found that it can leverage its relationship with merchants and enterprises to help pave the way for an easier transition into agentic commerce and offer the kind of flexibility they learned will benefit the ecosystem.

Michelle Gill, PayPal general manager for small business and financial services, told VentureBeat that AI-powered shopping will continue to grow, so enterprises and brands need to start laying the groundwork early.

“We think that merchants who've historically sold through web stores, particularly in the e-commerce space, are really going to need a way to get active on all of these large language models,” Gill said. “The challenge is that no one really knows how fast all of this is going to move. The issue that we’re trying to help merchants think through is how to do all of this as low-touch as possible while using the infrastructure you already have without doing a bazillion integrations.”

She added AI shopping would also bring about “a resurgence from consumers trying to ensure their investment is protected.”

PayPal partnered with website builder Wix, Cymbio, Commerce and Shopware to bring products to chat platforms like Perplexity.

Agent-powered shopping

PayPal’s Agentic Commerce Services include two features. The first is Agent Ready, which would allow existing PayPal merchants to accept payments on AI platforms. The second is called Shop Sync, which will enable companies’ product data to be discoverable through different AI chat interfaces. It takes a company’s catalog information and plug its inventory and fulfillment data to chat platforms.

Gill said the data goes into a central repository where AI models can ingest the information.

Right now, companies can access shop sync with Agent Ready coming in 2026.

Gill said Agentic Commerce Services is a one-to-many solution, that would be helpful right now, as different LLMs scrape different data sources to surface information.

Other benefits include:
- Fast integration with current and future partners
- More product discovery over the traditional search, browse and cart experiences
- Preserved customer insights and relationships where the brand continues to have control over their records and communications with customers.
Right now, the service is only available through Perplexity, but Gill said more platforms will be added soon.

Fragmented AI platforms

Agentic commerce is still very much in the early stages. AI agents are just beginning to get better at reading a browser. while platforms like ChatGPT, Gemini and Perplexity can now surface products and services based on user queries, people cannot technically buy things from chat yet.

There’s a race right now to create a standard to enable agents to transact on behalf of users and pay for items. Other than Google’s AP2, OpenAI and Stripe have the Agentic Commerce Protocol (ACP) and Visa launched its Trusted Agent Protocol.

Other than enabling a trust layer for agents to transact, another issue enterprises face with agentic commerce is fragmentation. Different chat platforms use different models which also interpret information in slightly different ways. Gill said PayPal learned that when it comes to working with merchants, flexibility is important.

“How do you decide if you're going to spend your time integrating with Google, Microsoft, ChatGPT or Perplexity? And each one of them right now has a different protocol, a different catalog, config, a different everything. That is a lot of time to make a bet as to like where you should spend your time,” Gill said.
Ekim 28, 2025

MiniMax-M2 is the new king of open source LLMs (especially for agentic tool calling)

Watch out, DeepSeek and Qwen! There's a new king of open source large language models (LLMs), especially when it comes to something enterprises are increasingly valuing: agentic tool use — that is, the ability to go off and use other software capabilities like web search or bespoke applications — without much human guidance.

That model is none other than MiniMax-M2, the latest LLM from the Chinese startup of the same name. And in a big win for enterprises globally, the model is available under a permissive, enterprise-friendly MIT License, meaning it is made available freely for developers to take, deploy, retrain, and use how they see fit — even for commercial purposes. It can be found on Hugging Face, GitHub and ModelScope, as well as through MiniMax's API here. It supports OpenAI and Anthropic API standards, as well, making it easy for customers of said proprietary AI startups to shift out their models to MiniMax's API, if they want.

According to independent evaluations by Artificial Analysis, a third-party generative AI model benchmarking and research organization, M2 now ranks first among all open-weight systems worldwide on the Intelligence Index—a composite measure of reasoning, coding, and task-execution performance.

In agentic benchmarks that measure how well a model can plan, execute, and use external tools—skills that power coding assistants and autonomous agents—MiniMax’s own reported results, following the Artificial Analysis methodology, show τ²-Bench 77.2, BrowseComp 44.0, and FinSearchComp-global 65.5.

These scores place it at or near the level of top proprietary systems like GPT-5 (thinking) and Claude Sonnet 4.5, making MiniMax-M2 the highest-performing open model yet released for real-world agentic and tool-calling tasks.

What It Means For Enterprises and the AI Race

Built around an efficient Mixture-of-Experts (MoE) architecture, MiniMax-M2 delivers high-end capability for agentic and developer workflows while remaining practical for enterprise deployment.

For technical decision-makers, the release marks an important turning point for open models in business settings. MiniMax-M2 combines frontier-level reasoning with a manageable activation footprint—just 10 billion active parameters out of 230 billion total.

This design enables enterprises to operate advanced reasoning and automation workloads on fewer GPUs, achieving near-state-of-the-art results without the infrastructure demands or licensing costs associated with proprietary frontier systems.

Artificial Analysis’ data show that MiniMax-M2’s strengths go beyond raw intelligence scores. The model leads or closely trails top proprietary systems such as GPT-5 (thinking) and Claude Sonnet 4.5 across benchmarks for end-to-end coding, reasoning, and agentic tool use.

Its performance in τ²-Bench, SWE-Bench, and BrowseComp indicates particular advantages for organizations that depend on AI systems capable of planning, executing, and verifying complex workflows—key functions for agentic and developer tools inside enterprise environments.

As LLM engineer Pierre-Carl Langlais aka Alexander Doria posted on X: "MiniMax [is] making a case for mastering the technology end-to-end to get actual agentic automation."

Compact Design, Scalable Performance

MiniMax-M2’s technical architecture is a sparse Mixture-of-Experts model with 230 billion total parameters and 10 billion active per inference.

This configuration significantly reduces latency and compute requirements while maintaining broad general intelligence.

The design allows for responsive agent loops—compile–run–test or browse–retrieve–cite cycles—that execute faster and more predictably than denser models.

For enterprise technology teams, this means easier scaling, lower cloud costs, and reduced deployment friction. According to Artificial Analysis, the model can be served efficiently on as few as four NVIDIA H100 GPUs at FP8 precision, a setup well within reach for mid-size organizations or departmental AI clusters.

Benchmark Leadership Across Agentic and Coding Workflows

MiniMax’s benchmark suite highlights strong real-world performance across developer and agent environments. The figure below, released with the model, compares MiniMax-M2 (in red) with several leading proprietary and open models, including GPT-5 (thinking), Claude Sonnet 4.5, Gemini 2.5 Pro, and DeepSeek-V3.2.

MiniMax-M2 achieves top or near-top performance in many categories:

SWE-bench Verified: 69.4 — close to GPT-5’s 74.9
ArtifactsBench: 66.8 — above Claude Sonnet 4.5 and DeepSeek-V3.2
τ²-Bench: 77.2 — approaching GPT-5’s 80.1
GAIA (text only): 75.7 — surpassing DeepSeek-V3.2
BrowseComp: 44.0 — notably stronger than other open models
FinSearchComp-global: 65.5 — best among tested open-weight systems

These results show MiniMax-M2’s capability in executing complex, tool-augmented tasks across multiple languages and environments—skills increasingly relevant for automated support, R&D, and data analysis inside enterprises.

Strong Showing in Artificial Analysis’ Intelligence Index

The model’s overall intelligence profile is confirmed in the latest Artificial Analysis Intelligence Index v3.0, which aggregates performance across ten reasoning benchmarks including MMLU-Pro, GPQA Diamond, AIME 2025, IFBench, and τ²-Bench Telecom.

MiniMax-M2 scored 61 points, ranking as the highest open-weight model globally and following closely behind GPT-5 (high) and Grok 4.

Artificial Analysis highlighted the model’s balance between technical accuracy, reasoning depth, and applied intelligence across domains. For enterprise users, this consistency indicates a reliable model foundation suitable for integration into software engineering, customer support, or knowledge automation systems.

Designed for Developers and Agentic Systems

MiniMax engineered M2 for end-to-end developer workflows, enabling multi-file code edits, automated testing, and regression repair directly within integrated development environments or CI/CD pipelines.

The model also excels in agentic planning—handling tasks that combine web search, command execution, and API calls while maintaining reasoning traceability.

These capabilities make MiniMax-M2 especially valuable for enterprises exploring autonomous developer agents, data analysis assistants, or AI-augmented operational tools.

Benchmarks such as Terminal-Bench and BrowseComp demonstrate the model’s ability to adapt to incomplete data and recover gracefully from intermediate errors, improving reliability in production settings.

Interleaved Thinking and Structured Tool Use

A distinctive aspect of MiniMax-M2 is its interleaved thinking format, which maintains visible reasoning traces between <think>…</think> tags.

This enables the model to plan and verify steps across multiple exchanges, a critical feature for agentic reasoning. MiniMax advises retaining these segments when passing conversation history to preserve the model’s logic and continuity.

The company also provides a Tool Calling Guide on Hugging Face, detailing how developers can connect external tools and APIs via structured XML-style calls.

This functionality allows MiniMax-M2 to serve as the reasoning core for larger agent frameworks, executing dynamic tasks such as search, retrieval, and computation through external functions.

Open Source Access and Enterprise Deployment Options

Enterprises can access the model through the MiniMax Open Platform API and MiniMax Agent interface (a web chat similar to ChatGPT), both currently free for a limited time.

MiniMax recommends SGLang and vLLM for efficient serving, each offering day-one support for the model’s unique interleaved reasoning and tool-calling structure.

Deployment guides and parameter configurations are available through MiniMax’s documentation.

Cost Efficiency and Token Economics

As Artificial Analysis noted, MiniMax’s API pricing is set at $0.30 per million input tokens and $1.20 per million output tokens, among the most competitive in the open-model ecosystem.

Provider	Model (doc link)	Input $/1M	Output $/1M	Notes
MiniMax	MiniMax-M2	$0.30	$1.20	Listed under “Chat Completion v2” for M2.
OpenAI	GPT-5	$1.25	$10.00	Flagship model pricing on OpenAI’s API pricing page.
OpenAI	GPT-5 mini	$0.25	$2.00	Cheaper tier for well-defined tasks.
Anthropic	Claude Sonnet 4.5	$3.00	$15.00	Anthropic’s current per-MTok list; long-context (>200K input) uses a premium tier.
Google	Gemini 2.5 Flash (Preview)	$0.30	$2.50	Prices include “thinking tokens”; page also lists cheaper Flash-Lite and 2.0 tiers.
xAI	Grok-4 Fast (reasoning)	$0.20	$0.50	“Fast” tier; xAI also lists Grok-4 at $3 / $15.
DeepSeek	DeepSeek-V3.2 (chat)	$0.28	$0.42	Cache-hit input is $0.028; table shows per-model details.
Qwen (Alibaba)	qwen-flash (Model Studio)	from $0.022	from $0.216	Tiered by input size (≤128K, ≤256K, ≤1M tokens); listed “Input price / Output price per 1M”.
Cohere	Command R+ (Aug 2024)	$2.50	$10.00	First-party pricing page also lists Command R ($0.50 / $1.50) and others.

Notes & caveats (for readers):

Prices are USD per million tokens and can change; check linked pages for updates and region/endpoint nuances (e.g., Anthropic long-context >200K input, Google Live API variants, cache discounts).
Vendors may bill extra for server-side tools (web search, code execution) or offer batch/context-cache discounts.

While the model produces longer, more explicit reasoning traces, its sparse activation and optimized compute design help maintain a favorable cost-performance balance—an advantage for teams deploying interactive agents or high-volume automation systems.

Background on MiniMax — an Emerging Chinese Powerhouse

MiniMax has quickly become one of the most closely watched names in China’s fast-rising AI sector.

Backed by Alibaba and Tencent, the company moved from relative obscurity to international recognition within a year—first through breakthroughs in AI video generation, then through a series of open-weight large language models (LLMs) aimed squarely at developers and enterprises.

The company first captured global attention in late 2024 with its AI video generation tool, “video-01,” which demonstrated the ability to create dynamic, cinematic scenes in seconds. VentureBeat described how the model’s launch sparked widespread interest after online creators began sharing lifelike, AI-generated footage—most memorably, a viral clip of a Star Wars lightsaber duel that drew millions of views in under two days.

CEO Yan Junjie emphasized that the system outperformed leading Western tools in generating human movement and expression, an area where video AIs often struggle. The product, later commercialized through MiniMax’s Hailuo platform, showcased the startup’s technical confidence and creative reach, helping to establish China as a serious contender in generative video technology.

By early 2025, MiniMax had turned its attention to long-context language modeling, unveiling the MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01. These open-weight models introduced an unprecedented 4-million-token context window, doubling the reach of Google’s Gemini 1.5 Pro and dwarfing OpenAI’s GPT-4o by more than twentyfold.

The company continued its rapid cadence with the MiniMax-M1 release in June 2025, a model focused on long-context reasoning and reinforcement learning efficiency. M1 extended context capacity to 1 million tokens and introduced a hybrid Mixture-of-Experts design trained using a custom reinforcement-learning algorithm known as CISPO. Remarkably, VentureBeat reported that MiniMax trained M1 at a total cost of about $534,700, roughly one-tenth of DeepSeek’s R1 and far below the multimillion-dollar budgets typical for frontier-scale models.

For enterprises and technical teams, MiniMax’s trajectory signals the arrival of a new generation of cost-efficient, open-weight models designed for real-world deployment. Its open licensing—ranging from Apache 2.0 to MIT—gives businesses freedom to customize, self-host, and fine-tune without vendor lock-in or compliance restrictions.

Features such as structured function calling, long-context retention, and high-efficiency attention architectures directly address the needs of engineering groups managing multi-step reasoning systems and data-intensive pipelines.

As MiniMax continues to expand its lineup, the company has emerged as a key global innovator in open-weight AI, combining ambitious research with pragmatic engineering.

Open-Weight Leadership and Industry Context

The release of MiniMax-M2 reinforces the growing leadership of Chinese AI research groups in open-weight model development.

Following earlier contributions from DeepSeek, Alibaba’s Qwen series, and Moonshot AI, MiniMax’s entry continues the trend toward open, efficient systems designed for real-world use.

Artificial Analysis observed that MiniMax-M2 exemplifies a broader shift in focus toward agentic capability and reinforcement-learning refinement, prioritizing controllable reasoning and real utility over raw model size.

For enterprises, this means access to a state-of-the-art open model that can be audited, fine-tuned, and deployed internally with full transparency.

By pairing strong benchmark performance with open licensing and efficient scaling, MiniMaxAI positions MiniMax-M2 as a practical foundation for intelligent systems that think, act, and assist with traceable logic—making it one of the most enterprise-ready open AI models available today.

Ekim 28, 2025

Google Cloud takes aim at CoreWeave and AWS with managed Slurm for enterprise-scale AI training

Some enterprises are best served by fine-tuning large models to their needs, but a number of companies plan to build their own models, a project that would require access to GPUs.

Google Cloud wants to play a bigger role in enterprises’ model-making journey with its new service, Vertex AI Training. The service gives enterprises looking to train their own models access to a managed Slurm environment, data science tooling and any chips capable of large-scale model training.

With this new service, Google Cloud hopes to turn more enterprises away from other providers and encourage the building of more company-specific AI models.

While Google Cloud has always offered the ability to customize its Gemini models, the new service allows customers to bring in their own models or customize any open-source model Google Cloud hosts.

Vertex AI Training positions Google Cloud directly against companies like CoreWeave and Lambda Labs, as well as its cloud competitors AWS and Microsoft Azure.

Jaime de Guerre, senior director of product management at Gloogle Cloud, told VentureBeat that the company has been hearing from a lot of organizations of varying sizes that they need a way to better optimize compute but in a more reliable environment.

“What we're seeing is that there's an increasing number of companies that are building or customizing large gen AI models to introduce a product offering built around those models, or to help power their business in some way,” de Guerre said. “This includes AI startups, technology companies, sovereign organizations building a model for a particular region or culture or language and some large enterprises that might be building it into internal processes.”

De Guerre noted that while anyone can technically use the service, Google is targeting companies planning large-scale model training rather than simple fine-tuning or LoRA adopters. Vertex AI Services will focus on longer-running training jobs spanning hundreds or even thousands of chips. Pricing will depend on the amount of compute the enterprise will need.

“Vertex AI Training is not for adding more information to the context or using RAG; this is to train a model where you might start from completely random weights,” he said.

Model customization on the rise

Enterprises are recognizing the value of building customized models beyond just fine-tuning an LLM via retrieval-augmented generation (RAG). Custom models would know more in-depth company information and respond with answers specific to the organization. Companies like Arcee.ai have begun offering their models for customization to clients. Adobe recently announced a new service that allows enterprises to retrain Firefly for their specific needs. Organizations like FICO, which create small language models specific to the finance industry, often buy GPUs to train them at significant cost.

Google Cloud said Vertex AI Training differentiates itself by giving access to a larger set of chips, services to monitor and manage training and the expertise it learned from training the Gemini models.

Some early customers of Vertex AI Training include AI Singapore, a consortium of Singaporean research institutes and startups that built the 27-billion-parameter SEA-LION v4, and Salesforce’s AI research team.

Enterprises often have to choose between taking an already-built LLM and fine-tuning it or building their own model. But creating an LLM from scratch is usually unattainable for smaller companies, or it simply doesn’t make sense for some use cases. However, for organizations where a fully custom or from-scratch model makes sense, the issue is gaining access to the GPUs needed to run training.

Model training can be expensive

Training a model, de Guerre said, can be difficult and expensive, especially when organizations compete with several others for GPU space.

Hyperscalers like AWS and Microsoft — and, yes, Google — have pitched that their massive data centers and racks and racks of high-end chips deliver the most value to enterprises. Not only will they have access to expensive GPUs, but cloud providers often offer full-stack services to help enterprises move to production.

Services like CoreWeave gained prominence for offering on-demand access to Nvidia H100s, giving customers flexibility in compute power when building models or applications. This has also given rise to a business model in which companies with GPUs rent out server space.

De Guerre said Vertex AI Training isn’t just about offering access to train models on bare compute, where the enterprise rents a GPU server; they also have to bring their own training software and manage the timing and failures.

“This is a managed Slurm environment that will help with all the job scheduling and automatic recovery of jobs failing,” de Guerre said. “So if a training job slows down or stops due to a hardware failure, the training will automatically restart very quickly, based on automatic checkpointing that we do in management of the checkpoints to continue with very little downtime.”

He added that this provides higher throughput and more efficient training for a larger scale of compute clusters.

Services like Vertex AI Training could make it easier for enterprises to build niche models or completely customize existing models. Still, just because the option exists doesn’t mean it's the right fit for every enterprise.

Ekim 27, 2025
Anthropic rolls out Claude AI for finance, integrates with Excel to rival Microsoft Copilot

Anthropic is making its most aggressive push yet into the trillion-dollar financial services industry, unveiling a suite of tools that embed its Claude AI assistant directly into Microsoft Excel and connect it to real-time market data from some of the world's most influential financial information providers.

The San Francisco-based AI startup announced Monday it is releasing Claude for Excel, allowing financial analysts to interact with the AI system directly within their spreadsheets — the quintessential tool of modern finance. Beyond Excel, select Claude models are also being made available in Microsoft Copilot Studio and Researcher agent, expanding the integration across Microsoft's enterprise AI ecosystem. The integration marks a significant escalation in Anthropic's campaign to position itself as the AI platform of choice for banks, asset managers, and insurance companies, markets where precision and regulatory compliance matter far more than creative flair.

The expansion comes just three months after Anthropic launched its Financial Analysis Solution in July, and it signals the company's determination to capture market share in an industry projected to spend $97 billion on AI by 2027, up from $35 billion in 2023.

More importantly, it positions Anthropic to compete directly with Microsoft — ironically, its partner in this Excel integration — which has its own Copilot AI assistant embedded across its Office suite, and with OpenAI, which counts Microsoft as its largest investor.

Why Excel has become the new battleground for AI in finance

The decision to build directly into Excel is hardly accidental. Excel remains the lingua franca of finance, the digital workspace where analysts spend countless hours constructing financial models, running valuations, and stress-testing assumptions. By embedding Claude into this environment, Anthropic is meeting financial professionals exactly where they work rather than asking them to toggle between applications.

Claude for Excel allows users to work with the AI in a sidebar where it can read, analyze, modify, and create new Excel workbooks while providing full transparency about the actions it takes by tracking and explaining changes and letting users navigate directly to referenced cells.

This transparency feature addresses one of the most persistent anxieties around AI in finance: the "black box" problem. When billions of dollars ride on a financial model's output, analysts need to understand not just the answer but how the AI arrived at it. By showing its work at the cell level, Anthropic is attempting to build the trust necessary for widespread adoption in an industry where careers and fortunes can turn on a misplaced decimal point.

The technical implementation is sophisticated. Claude can discuss how spreadsheets work, modify them while preserving formula dependencies — a notoriously complex task — debug cell formulas, populate templates with new data, or build entirely new spreadsheets from scratch. This isn't merely a chatbot that answers questions about your data; it's a collaborative tool that can actively manipulate the models that drive investment decisions worth trillions of dollars.

How Anthropic is building data moats around its financial AI platform

Perhaps more significant than the Excel integration is Anthropic's expansion of its connector ecosystem, which now links Claude to live market data and proprietary research from financial information giants. The company added six major new data partnerships spanning the entire spectrum of financial information that professional investors rely upon.

Aiera now provides Claude with real-time earnings call transcripts and summaries of investor events like shareholder meetings, presentations, and conferences. The Aiera connector also enables a data feed from Third Bridge, which gives Claude access to a library of insights interviews, company intelligence, and industry analysis from experts and former executives. Chronograph gives private equity investors operational and financial information for portfolio monitoring and conducting due diligence, including performance metrics, valuations, and fund-level data.

Egnyte enables Claude to securely search permitted data for internal data rooms, investment documents, and approved financial models while maintaining governed access controls. LSEG, the London Stock Exchange Group, connects Claude to live market data including fixed income pricing, equities, foreign exchange rates, macroeconomic indicators, and analysts' estimates of other important financial metrics. Moody's provides access to proprietary credit ratings, research, and company data covering ownership, financials, and news on more than 600 million public and private companies, supporting work and research in compliance, credit analysis, and business development. MT Newswires provides Claude with access to the latest global multi-asset class news on financial markets and economies.

These partnerships amount to a land grab for the informational infrastructure that powers modern finance. Previously announced in July, Anthropic had already secured integrations with S&P Capital IQ, Daloopa, Morningstar, FactSet, PitchBook, Snowflake, and Databricks. Together, these connectors give Claude access to virtually every category of financial data an analyst might need: fundamental company data, market prices, credit assessments, private company intelligence, alternative data, and breaking news.

This matters because the quality of AI outputs depends entirely on the quality of inputs. Generic large language models trained on public internet data simply cannot compete with systems that have direct pipelines to Bloomberg-quality financial information. By securing these partnerships, Anthropic is building moats around its financial services offering that competitors will find difficult to replicate.

The strategic calculus here is clear: Anthropic is betting that domain-specific AI systems with privileged access to proprietary data will outcompete general-purpose AI assistants. It's a direct challenge to the "one AI to rule them all" approach favored by some competitors.

Pre-configured workflows target the daily grind of Wall Street analysts

The third pillar of Anthropic's announcement involves six new "Agent Skills" — pre-configured workflows for common financial tasks. These skills are Anthropic's attempt to productize the workflows of entry-level and mid-level financial analysts, professionals who spend their days building models, processing due diligence documents, and writing research reports. Anthropic has designed skills specifically to automate these time-consuming tasks.

The new skills include building discounted cash flow models complete with full free cash flow projections, weighted average cost of capital calculations, scenario toggles, and sensitivity tables. There's comparable company analysis featuring valuation multiples and operating metrics that can be easily refreshed with updated data. Claude can now process data room documents into Excel spreadsheets populated with financial information, customer lists, and contract terms. It can create company teasers and profiles for pitch books and buyer lists, perform earnings analyses that use quarterly transcripts and financials to extract important metrics, guidance changes, and management commentary, and produce initiating coverage reports with industry analysis, company deep dives, and valuation frameworks.

It's worth noting that Anthropic's Sonnet 4.5 model now tops the Finance Agent benchmark from Vals AI at 55.3% accuracy, a metric designed to test AI systems on tasks expected of entry-level financial analysts. A 55% accuracy rate might sound underwhelming, but it is state-of-the-art performance and highlights both the promise and limitations of AI in finance. The technology can clearly handle sophisticated analytical tasks, but it's not yet reliable enough to operate autonomously without human oversight — a reality that may actually reassure both regulators and the analysts whose jobs might otherwise be at risk.

The Agent Skills approach is particularly clever because it packages AI capabilities in terms that financial institutions already understand. Rather than selling generic "AI assistance," Anthropic is offering solutions to specific, well-defined problems: "You need a DCF model? We have a skill for that. You need to analyze earnings calls? We have a skill for that too."

Trillion-dollar clients are already seeing massive productivity gains

Anthropic's financial services strategy appears to be gaining traction with exactly the kind of marquee clients that matter in enterprise sales. The company counts among its clients AIA Labs at Bridgewater, Commonwealth Bank of Australia, American International Group, and Norges Bank Investment Management — Norway's $1.6 trillion sovereign wealth fund, one of the world's largest institutional investors.

NBIM CEO Nicolai Tangen reported achieving approximately 20% productivity gains, equivalent to 213,000 hours, with portfolio managers and risk departments now able to "seamlessly query our Snowflake data warehouse and analyze earnings calls with unprecedented efficiency."

At AIG, CEO Peter Zaffino said the partnership has "compressed the timeline to review business by more than 5x in our early rollouts while simultaneously improving our data accuracy from 75% to over 90%." If these numbers hold across broader deployments, the productivity implications for the financial services industry are staggering.

These aren't pilot programs or proof-of-concept deployments; they're production implementations at institutions managing trillions of dollars in assets and making underwriting decisions that affect millions of customers. Their public endorsements provide the social proof that typically drives enterprise adoption in conservative industries.

Regulatory uncertainty creates both opportunity and risk for AI deployment

Yet Anthropic's financial services ambitions unfold against a backdrop of heightened regulatory scrutiny and shifting enforcement priorities. In 2023, the Consumer Financial Protection Bureau released guidance requiring lenders to "use specific and accurate reasons when taking adverse actions against consumers" involving AI, and issued additional guidance requiring regulated entities to "evaluate their underwriting models for bias" and "evaluate automated collateral-valuation and appraisal processes in ways that minimize bias."

However, according to a Brookings Institution analysis, these measures have since been revoked with work stopped or eliminated at the current downsized CFPB under the current administration, creating regulatory uncertainty. The pendulum has swung from the Biden administration's cautious approach, exemplified by an executive order on safe AI development, toward the Trump administration's "America's AI Action Plan," which seeks to "cement U.S. dominance in artificial intelligence" through deregulation.

This regulatory flux creates both opportunities and risks. Financial institutions eager to deploy AI now face less prescriptive federal oversight, potentially accelerating adoption. But the absence of clear guardrails also exposes them to potential liability if AI systems produce discriminatory outcomes, particularly in lending and underwriting.

The Massachusetts Attorney General recently reached a $2.5 million settlement with student loan company Earnest Operations, alleging that its use of AI models resulted in "disparate impact in approval rates and loan terms, specifically disadvantaging Black and Hispanic applicants." Such cases will likely multiply as AI deployment grows, creating a patchwork of state-level enforcement even as federal oversight recedes.

Anthropic appears acutely aware of these risks. In an interview with Banking Dive, Jonathan Pelosi, Anthropic's global head of industry for financial services, emphasized that Claude requires a "human in the loop." The platform, he said, is not intended for autonomous financial decision-making or to provide stock recommendations that users follow blindly. During client onboarding, Pelosi told the publication, Anthropic focuses on training and understanding model limitations, putting guardrails in place so people treat Claude as a helpful technology rather than a replacement for human judgment.

Competition heats up as every major tech company targets finance AI

Anthropic's financial services push comes as AI competition intensifies across the enterprise. OpenAI, Microsoft, Google, and numerous startups are all vying for position in what may become one of AI's most lucrative verticals. Goldman Sachs introduced a generative AI assistant to its bankers, traders, and asset managers in January, signaling that major banks may build their own capabilities rather than rely exclusively on third-party providers.

The emergence of domain-specific AI models like BloombergGPT — trained specifically on financial data — suggests the market may fragment between generalized AI assistants and specialized tools. Anthropic's strategy appears to stake out a middle ground: general-purpose models, since Claude was not trained exclusively on financial data, enhanced with financial-specific tooling, data access, and workflows.

The company's partnership strategy with implementation consultancies including Deloitte, KPMG, PwC, Slalom, TribeAI, and Turing is equally critical. These firms serve as force multipliers, embedding Anthropic's technology into their own service offerings and providing the change management expertise that financial institutions need to successfully adopt AI at scale.

CFOs worry about AI hallucinations and cascading errors

The broader question is whether AI tools like Claude will genuinely transform financial services productivity or merely shift work around. The PYMNTS Intelligence report "The Agentic Trust Gap" found that chief financial officers remain hesitant about AI agents, with "nagging concern" about hallucinations where "an AI agent can go off script and expose firms to cascading payment errors and other inaccuracies."

"For finance leaders, the message is stark: Harness AI's momentum now, but build the guardrails before the next quarterly call—or risk owning the fallout," the report warned.

A 2025 KPMG report found that 70% of board members have developed responsible use policies for employees, with other popular initiatives including implementing a recognized AI risk and governance framework, developing ethical guidelines and training programs for AI developers, and conducting regular AI use audits.

The financial services industry faces a delicate balancing act: move too slowly and risk competitive disadvantage as rivals achieve productivity gains; move too quickly and risk operational failures, regulatory penalties, or reputational damage. Speaking at the Evident AI Symposium in New York last week, Ian Glasner, HSBC's group head of emerging technology, innovation and ventures, struck an optimistic tone about the sector's readiness for AI adoption. "As an industry, we are very well prepared to manage risk," he said, according to CIO Dive. "Let's not overcomplicate this. We just need to be focused on the business use case and the value associated."

Anthropic's latest moves suggest the company sees financial services as a beachhead market where AI's value proposition is clear, customers have deep pockets, and the technical requirements play to Claude's strengths in reasoning and accuracy. By building Excel integration, securing data partnerships, and pre-packaging common workflows, Anthropic is reducing the friction that typically slows enterprise AI adoption.

The $61.5 billion valuation the company commanded in its March fundraising round — up from roughly $16 billion a year earlier — suggests investors believe this strategy will work. But the real test will come as these tools move from pilot programs to production deployments across thousands of analysts and billions of dollars in transactions.

Financial services may prove to be AI's most demanding proving ground: an industry where mistakes are costly, regulation is stringent, and trust is everything. If Claude can successfully navigate the spreadsheet cells and data feeds of Wall Street without hallucinating a decimal point in the wrong direction, Anthropic will have accomplished something far more valuable than winning another benchmark test. It will have proven that AI can be trusted with the money.

Ekim 27, 2025

Blog

Investigating chain-of-thought reasoning

A white-box approach to verification

Finding and fixing errors

Why it’s important

An 'engine' for creativity

Automating intelligence, supporting marketing

Success metrics and enterprise adoption

Competitors in the design space

Key takeaways: Be open, embrace human-AI collaboration

How scientists manipulated AI's 'brain' to test for genuine self-awareness

Claude succeeded 20% of the time—and failed in revealing ways

Why businesses shouldn't trust AI to explain itself—at least not yet

What introspective AI means for transparency, safety, and the risk of deception

Does introspective capability suggest AI consciousness? Scientists tread carefully

The race to make AI introspection reliable before models become too powerful

Benchmark Results

A Model Built with Reinforcement Learning and Mixture-of-Experts Architecture

From Prototype to Production

Integration with Cursor 2.0

Infrastructure and Training Systems

Enterprise Use

Composer’s Role in the Evolving AI Coding Landscape

What It Means for Enterprise Devs and Vibe Coding

How natural language prompts now create fully functional business applications

Why Microsoft's nine-year bet on low-code development is finally paying off

Where professional developers still matter in an AI-powered workplace

What happens when every employee can build apps without IT approval

The ambitious plan to turn 500 million workers into software builders

What Exactly Did IBM Release?

A Competitive Class of Small Models

Why Model Size Still Matters — But Not Like It Used To

Community Response and Roadmap Signals

Background: IBM Granite and the Enterprise AI Race

A Shift Toward Scalable Efficiency

Agent-powered shopping

Fragmented AI platforms

What It Means For Enterprises and the AI Race

Compact Design, Scalable Performance

Benchmark Leadership Across Agentic and Coding Workflows

Strong Showing in Artificial Analysis’ Intelligence Index

Designed for Developers and Agentic Systems

Interleaved Thinking and Structured Tool Use

Open Source Access and Enterprise Deployment Options

Cost Efficiency and Token Economics

Background on MiniMax — an Emerging Chinese Powerhouse

Open-Weight Leadership and Industry Context

Model customization on the rise

Model training can be expensive

Why Excel has become the new battleground for AI in finance

How Anthropic is building data moats around its financial AI platform

Pre-configured workflows target the daily grind of Wall Street analysts

Trillion-dollar clients are already seeing massive productivity gains

Regulatory uncertainty creates both opportunity and risk for AI deployment

Competition heats up as every major tech company targets finance AI

CFOs worry about AI hallucinations and cascading errors