Blog

  • Google’s new framework helps AI agents spend their compute and tool budget more wisely

    In a new paper that studies tool-use in large language model (LLM) agents, researchers at Google and UC Santa Barbara have developed a framework that enables agents to make more efficient use of tool and compute budgets. The researchers introduce two new techniques: a simple "Budget Tracker" and a more comprehensive framework called "Budget Aware Test-time Scaling." These techniques make agents explicitly aware of their remaining reasoning and tool-use allowance.

    As AI agents rely on tool calls to work in the real world, test-time scaling has become less about smarter models and more about controlling cost and latency.

    For enterprise leaders and developers, budget-aware scaling techniques offer a practical path to deploying effective AI agents without facing unpredictable costs or diminishing returns on compute spend.

    The challenge of scaling tool use

    Traditional test-time scaling focuses on letting models "think" longer. However, for agentic tasks like web browsing, the number of tool calls directly determines the depth and breadth of exploration.

    This introduces significant operational overhead for businesses. "Tool calls such as webpage browsing results in more token consumption, increases the context length and introduces additional time latency," Zifeng Wang and Tengxiao Liu, co-authors of the paper, told VentureBeat. "Tool calls themselves introduce additional API costs."

    The researchers found that simply granting agents more test-time resources does not guarantee better performance. "In a deep research task, if the agent has no sense of budget, it often goes down blindly," Wang and Liu explained. "It finds one somewhat related lead, then spends 10 or 20 tool calls digging into it, only to realize that the entire path was a dead end."

    Optimizing resources with Budget Tracker

    To evaluate how they can optimize tool-use budgets, the researchers first tried a lightweight approach called "Budget Tracker." This module acts as a plug-in that provides the agent with a continuous signal of resource availability, enabling budget-aware tool use.

    The team hypothesized that "providing explicit budget signals enables the model to internalize resource constraints and adapt its strategy without requiring additional training."

    Budget Tracker operates purely at the prompt level, which makes it easy to implement. (The paper provides full details on the prompts used for Budget Tracker, which makes it easy to implement.)

    In Google's implementation, the tracker provides a brief policy guideline describing the budget regimes and corresponding recommendations for using tools. At each step of the response process, Budget Tracker makes the agent explicitly aware of its resource consumption and remaining budget, enabling it to condition subsequent reasoning steps on the updated resource state.

    To test this, the researchers experimented with two paradigms: sequential scaling, where the model iteratively refines its output, and parallel scaling, where multiple independent runs are conducted and aggregated. They ran experiments on search agents equipped with search and browse tools following a ReAct-style loop. ReAct (Reasoning + Acting) is a popular method where the model alternates between internal thinking and external actions. To trace a true cost-performance scaling trend, they developed a unified cost metric that jointly accounts for the costs of both internal token consumption and external tool interactions.

    They tested Budget Tracker on three information-seeking QA datasets requiring external search, including BrowseComp and HLE-Search, using models such as Gemini 2.5 Pro, Gemini 2.5 Flash, and Claude Sonnet 4. The experiments show that this simple plug-in improves performance across various budget constraints.

    "Adding Budget Tracker achieves comparable accuracy using 40.4% fewer search calls, 19.9% fewer browse calls, and reducing overall cost … by 31.3%," the authors told VentureBeat. Finally, Budget Tracker continued to scale as the budget increased, whereas plain ReAct plateaued after a certain threshold.

    BATS: A comprehensive framework for budget-aware scaling

    To further improve tool-use resource optimization, the researchers introduced Budget Aware Test-time Scaling (BATS), a framework designed to maximize agent performance under any given budget. BATS maintains a continuous signal of remaining resources and uses this information to dynamically adapt the agent's behavior as it formulates its response.

    BATS uses multiple modules to orchestrate the agent's actions. A planning module adjusts stepwise effort to match the current budget, while a verification module decides whether to "dig deeper" into a promising lead or "pivot" to alternative paths based on resource availability.

    Given an information-seeking question and a tool-call budget, BATS begins by using the planning module to formulate a structured action plan and decide which tools to invoke. When tools are invoked, their responses are appended to the reasoning sequence to provide the context with new evidence. When the agent proposes a candidate answer, the verification module verifies it and decides whether to continue the current sequence or initiate a new attempt with the remaining budget.

    The iterative process ends when budgeted resources are exhausted, at which point an LLM-as-a-judge selects the best answer across all verified answers. Throughout the execution, the Budget Tracker continuously updates both resource usage and remaining budget at every iteration.

    The researchers tested BATS on the BrowseComp, BrowseComp-ZH, and HLE-Search benchmarks against baselines including standard ReAct and various training-based agents. Their experiments show that BATS achieves higher performance while using fewer tool calls and incurring lower overall cost than competing methods. Using Gemini 2.5 Pro as the backbone, BATS achieved 24.6% accuracy on BrowseComp compared to 12.6% for standard ReAct, and 27.0% on HLE-Search compared to 20.5% for ReAct.

    BATS not only improves effectiveness under budget constraints but also yields better cost–performance trade-offs. For example, on the BrowseComp dataset, BATS achieved higher accuracy at a cost of approximately 23 cents compared to a parallel scaling baseline that required over 50 cents to achieve a similar result.

    According to the authors, this efficiency makes previously expensive workflows viable. "This unlocks a range of long-horizon, data-intensive enterprise applications… such as complex codebase maintenance, due-diligence investigations, competitive landscape research, compliance audits, and multi-step document analysis," they said.

    As enterprises look to deploy agents that manage their own resources, the ability to balance accuracy with cost will become a critical design requirement.

    "We believe the relationship between reasoning and economics will become inseparable," Wang and Liu said. "In the future, [models] must reason about value."

  • Ai2’s new Olmo 3.1 extends reinforcement learning training for stronger reasoning benchmarks

    The Allen Institute for AI (Ai2) recently released what it calls its most powerful family of models yet, Olmo 3. But the company kept iterating on the models, expanding its reinforcement learning (RL) runs, to create Olmo 3.1.

    The new Olmo 3.1 models focus on efficiency, transparency, and control for enterprises. 

    Ai2 updated two of the three versions of Olmo 2: Olmo 3.1 Think 32B, the flagship model optimized for advanced research, and Olmo 3.1 Instruct 32B, designed for instruction-following, multi-turn dialogue, and tool use. 

    Olmo 3 has a third version, Olmo 3-Base for programming, comprehension, and math. It also works well for continue fine-tuning. 

    Ai2 said that to upgrade Olmo 3 Think 32B to Olmo 3.1, its researchers extended its best RL run with a longer training schedule. 

    “After the original Olmo 3 launch, we resumed our RL training run for Olmo 3 32B Think, training for an additional 21 days on 224 GPUs with extra epochs over our Dolci-Think-RL dataset,” Ai2 said in a blog post. “This yielded Olmo 3.1 32B Think, which brings substantial gains across math, reasoning, and instruction-following benchmarks: improvements of 5+ points on AIME, 4+ points on ZebraLogic, 4+ points on IFEval, and 20+ points on IFBench, alongside stronger performance on coding and complex multi-step tasks.”

    To get to Olmo 3.1 Instruct, Ai2 said its researchers applied the recipe behind the smaller Instruct size, 7B, to the larger model.

    Olmo 3.1 Instruct 32B is "optimized for chat, tool use, & multi-turn dialogue—making it a much more performant sibling of Olmo 3 Instruct 7B and ready for real-world applications,” Ai2 said in a post on X

    For now, the new checkpoints are available on the Ai2 Playground or Hugging Face, with API access coming soon. 

    Better performance on benchmarks

    The Olmo 3.1 models performed well on benchmark tests, predictably beating the Olmo 3 models. 

    Olmo 3.1 Think outperformed Qwen 3 32B models in the AIME 2025 benchmark and performed close to Gemma 27B. 

    Olmo 3.1 Instruct performed strongly against its open-source peers, even beating models like Gemma 3 on the Math benchmark.

    “As for Olmo 3.1 32B Instruct, it’s a larger-scale instruction-tuned model built for chat, tool use, and multi-turn dialogue. Olmo 3.1 32B Instruct is our most capable fully open chat model to date and — in our evaluations — the strongest fully open 32B-scale instruct model,” the company said. 

    Ai2 also upgraded its RL-Zero 7B models for math and coding. The company said on X that both models benefited from longer and more stable training runs.

    Commitment to transparency and open source 

    Ai2 previously told VentureBeat that it designed the Olmo 3 family of models to offer enterprises and research labs more control and understanding of the data and training that went into the model. 

    Organizations could add to the model’s data mix and retrain it to also learn from what’s been added.  

    This has long been a commitment for Ai2, which also offers a tool called OlmoTrace that tracks how LLM outputs match its training data.  

    “Together, Olmo 3.1 Think 32B and Olmo 3.1 Instruct 32B show that openness and performance can advance together. By extending the same model flow, we continue to improve capabilities while retaining end-to-end transparency over data, code, and training decisions,” Ai2 said. 

  • OpenAI’s GPT-5.2 is here: what enterprises need to know

    The rumors were true: OpenAI on Thursday announced the release of its new frontier large language model (LLM) family, GPT-5.2.

    It comes at a pivotal moment for the AI pioneer, which has faced intensifying pressure since rival Google’s Gemini 3 LLM seized the top spot on major third-party performance leaderboards and many key benchmarks last month, though OpenAI leaders stressed in a press briefing that the timing of this release had been discussed and worked on well in advance of the release of Gemini 3.

    OpenAI describes GPT-5.2 as its "most capable model series yet for professional knowledge work," aiming to reclaim the performance crown with significant gains in reasoning, coding, and agentic workflows.

    "It’s our most advanced frontier model and the strongest yet in the market for professional use," Fidji Simo, OpenAI’s CEO of Applications, said during a press briefing today. "We designed 5.2 to unlock even more economic value for people. It's better at creating spreadsheets, building presentations, writing code, perceiving images, understanding long context, using tools, and handling complex, multi-step projects."

    GPT-5.2 features a massive 400,000-token context window — allowing it to ingest hundreds of documents or large code repositories at once — and a 128,000 max output token limit, enabling it to generate extensive reports or full applications in a single go.

    The model also features a knowledge cutoff of August 31, 2025, ensuring it is up-to-date with relatively recent world events and technical documentation. It explicitly includes "Reasoning token support," confirming the underlying architecture uses the chain-of-thought processing popularized by the "o1" series.

    The 'Code Red' Reality Check

    The release arrives following The Information's report of an emergency "Code Red" directive to OpenAI staff from CEO Sam Altman to improve ChaTGPT — a move reportedly designed to mobilize resources following the "quality gap" exposed by Gemini 3. The Verge similarly reported on the timing of GPT-5.2's release ahead of the official announcement.

    During the briefing, OpenAI executives acknowledged the directive but pushed back on the narrative that the model was rushed solely to answer Google.

    "It is important to note this has been in the works for many, many months," Simo told reporters. She clarified that while the "Code Red" helped focus the company, it wasn't the sole driver of the timeline.

    "We announced this Code Red to really signal to the company that we want to marshal resources in one particular area… but that's not the reason it's coming out this week in particular."

    Max Schwarzer, lead of OpenAI's post-training team, echoed this sentiment to dispel the idea of a panic launch. "We've been planning for this release since a very long time ago… this specific week we talked about many months ago."

    A spokesperson from OpenAI further clarified that the "Code Red" call applied to ChatGPT as a product, not solely underlying model development or the release of new models.

    Under the Hood: Instant, Thinking, and Pro

    OpenAI is segmenting the GPT-5.2 release into three distinct tiers within ChatGPT, a strategy likely designed to balance the massive compute costs of "reasoning" models with user demand for speed:

    • GPT-5.2 Instant: Optimized for speed and daily tasks like writing, translation, and information seeking.

    • GPT-5.2 Thinking: Designed for "complex, structured work" and long-running agents, this model leverages deeper reasoning chains to handle coding, math, and multi-step projects.

    • GPT-5.2 Pro: The new heavyweight champion. OpenAI describes this as its "smartest and most trustworthy option," delivering the highest accuracy for difficult questions where quality outweighs latency.

    For developers, the models are available immediately in the application programming interface (API) as gpt-5.2, gpt-5.2-chat-latest (Instant), and gpt-5.2-pro.

    The Numbers: Beating the Benchmarks

    The GPT-5.2 release includes leading metrics across most domains — specifically those that target the "professional knowledge work" gap where competitors have recently gained ground.

    OpenAI highlighted a new benchmark called GDPval, which measures performance on "well-specified knowledge work tasks" across 44 occupations.

    "GPT-5.2 Thinking is now state-of-the-art on that benchmark… and beats or ties top industry professionals on 70.9% of well-specified professional tasks like spreadsheets, presentations, and document creation, according to expert human judges," Simo said.

    In the critical arena of coding, OpenAI is claiming a decisive lead. Schwarzer noted that on SWE-bench Pro, a rigorous evaluation of real-world software engineering, GPT-5.2 Thinking sets a new state-of-the-art score of 55.6%.

    He emphasized that this benchmark is "more contamination resistant, challenging, diverse, and industrially relevant than previous benchmarks like SWE-bench Verified."Other key benchmark results include:

    • GPQA Diamond (Science): GPT-5.2 Pro scored 93.2%, edging out GPT-5.2 Thinking (92.4%) and surpassing GPT-5.1 Thinking (88.1%).

    • FrontierMath: On Tier 1-3 problems, GPT-5.2 Thinking solved 40.3%, a significant jump from the 31.0% achieved by its predecessor.

    • ARC-AGI-1: GPT-5.2 Pro is reportedly the first model to cross the 90% threshold on this general reasoning benchmark, scoring 90.5%

    The Price of Intelligence

    Performance comes at a premium. While ChatGPT subscription pricing remains unchanged for now, the API costs for the new flagship models are steep compared to previous generations, reflecting the high compute demands of "thinking" mode. They're also on the upper-end of API costs for the industry.

    • GPT-5.2 Thinking: Priced at $1.75 per 1 million input tokens and $14 per 1 million output tokens.

    • GPT-5.2 Pro: The costs jump significantly to $21 per 1 million input tokens and $168 per 1 million output tokens.

    GPT-5.2 Thinking is priced 40% higher in the API than the standard GPT-5.1 ($1.25/$10), signaling that OpenAI views the new reasoning capabilities as a tangible value-add rather than a mere efficiency update.

    The high-end GPT-5.2 Pro follows the same pattern, costing 40% more than the previous GPT-5 Pro ($15/$120). While expensive, it still undercuts OpenAI’s most specialized reasoning model, o1-pro, which remains the most costly offering on the menu at a staggering $150 per million input tokens and $600 per million output tokens.

    OpenAI argues that despite the higher per-token cost, the model’s "greater token efficiency" and ability to solve tasks in fewer turns make it economically viable for high-value enterprise workflows.

    Here's how it compares to the current API costs for other competing models across the LLM field:

    Model

    Input (/1M)

    Output (/1M)

    Total Cost

    Source

    Qwen 3 Turbo

    $0.05

    $0.20

    $0.25

    Alibaba Cloud

    Grok 4.1 Fast (reasoning)

    $0.20

    $0.50

    $0.70

    xAI

    Grok 4.1 Fast (non-reasoning)

    $0.20

    $0.50

    $0.70

    xAI

    deepseek-chat (V3.2-Exp)

    $0.28

    $0.42

    $0.70

    DeepSeek

    deepseek-reasoner (V3.2-Exp)

    $0.28

    $0.42

    $0.70

    DeepSeek

    Qwen 3 Plus

    $0.40

    $1.20

    $1.60

    Alibaba Cloud

    ERNIE 5.0

    $0.85

    $3.40

    $4.25

    Qianfan

    Claude Haiku 4.5

    $1.00

    $5.00

    $6.00

    Anthropic

    Qwen-Max

    $1.60

    $6.40

    $8.00

    Alibaba Cloud

    Gemini 3 Pro (≤200K)

    $2.00

    $12.00

    $14.00

    Google

    GPT-5.2

    $1.75

    $14.00

    $15.75

    OpenAI

    Gemini 3 Pro (>200K)

    $4.00

    $18.00

    $22.00

    Google

    Claude Sonnet 4.5

    $3.00

    $15.00

    $18.00

    Anthropic

    Claude Opus 4.5

    $5.00

    $25.00

    $30.00

    Anthropic

    GPT-5.2 Pro

    $21.00

    $168.00

    $189.00

    OpenAI

    Image Generation: Nothing New Yet…But 'More to Come'

    During the briefing, VentureBeat asked the OpenAI participants if the new release included any boost to image generation capabilities, noting the excitement around similar features in recent competitor launches like Google's Gemini 3 Image aka Nano Banana Pro.

    Unfortunately for those seeking to recreate the kind of text-and-information heavy graphics and image editing capabilities, OpenAI executives clarified that GPT-5.2 comes with no current image improvements over the prior GPT-5.1 and OpenAI's integrated DALL-E 3 and gpt-4o native image generation models.

    "On image Gen, nothing to announce today, but more to come," Simo said. She acknowledged the popularity of the feature, adding, "We know this is a very important use case that people love, that we introduced [to] the market, and so definitely more to come there."

    Aidan Clark, OpenAI's lead of training, also declined to comment on visual generation specifics, stating simply, "I can't really speak to image Gen myself."

    The 'Mega-Agent' Era

    Beyond raw scores, OpenAI is positioning GPT-5.2 as the engine for a new generation of "long-running agents" capable of executing multi-step workflows without human hand-holding."

    Box found that 5.2 can extract information from long, complex documents about 40% faster, and also saw a 40% boost in reasoning accuracy for Life Sciences and healthcare," Simo said.

    She also noted that Notion reported the model "outperforms 5.1 across every dimension… and it excels at the kind of really ambiguous, longer rising tasks that define real knowledge work."Schwarzer added that coding startups like Augment Code found the model "delivered substantially stronger deep code capabilities than any prior model," which is why it was selected to power their new code review agent.Visual capabilities have also seen an upgrade.

    OpenAI's release blog post shows an example where "a traveler reports a delayed flight, a missed connection, an overnight stay in New York, and a medical seating requirement."

    The outcome? "GPT‑5.2 manages the entire chain of tasks—rebooking, special-assistance seating, and compensation—delivering a more complete outcome than GPT‑5.1."

    A new evaluation called ScreenSpot-Pro, which tests a model's ability to understand GUI screenshots, shows GPT-5.2 Thinking achieving 86.3% accuracy, compared to just 64.2% for GPT-5.1.

    Science and Reliability

    OpenAI leaders also stressed the model's utility for scientific research, attempting to move the conversation beyond simple chatbots to research assistants.

    Aidan Clark, lead of the training team, shared an example of a senior immunology researcher testing the model.

    "They tested it by asking it to generate the most important unanswered questions about the immune system," Clark said. "That immunology researcher reported that GPT-5.2 produced sharper questions and stronger explanations for why those questions… matter compared to any previous pro model.

    "Reliability was another key focus. Schwarzer claimed the new model "hallucinates substantially less than GPT-5.1," noting that on a set of de-identified queries, "responses contained errors 38% less often."

    The 'Vibe' Shift

    Interestingly, OpenAI acknowledged that not every user might immediately prefer the new models.

    When asked why legacy models like GPT-5.1 would remain available, Schwarzer admitted that "models change a little bit every time.

    "Some users may find that they prefer the vibes of the previous model, even though we think the latest one is across the board generally much better," Schwarzer said. He also noted that for some enterprise customers who have "really fine-tuned a prompt for a specific model," there might be "small regressions," necessitating access to the older versions.

    Safety, 'Adult Mode,' and Future Roadmap

    Addressing safety concerns, Simo confirmed that the company is preparing to roll out an "Adult Mode" in the first quarter of next year, following the implementation of a new age prediction system.

    "We're in the process of improving that," Simo said regarding the age prediction technology.

    "We want to do that ahead of launching adult mode."Looking further ahead, industry reports suggest OpenAI is working on a more fundamental architectural shift under the codename "Project Garlic," targeting a flagship release in early 2026.

    While executives did not comment on specific future roadmaps during the briefing, Simo remained optimistic about the economics of their current trajectory.

    "If you look at historical trends, compute has increased about 3x every year for the last three years," she explained. "Revenue has also increased at the same pace… creating this virtuous cycle."

    Clark added that efficiency is improving rapidly: "The model we're releasing today achieves an even better score [on ARC-AGI] with almost 400 times less cost and less compute associated with it" compared to models from a year ago.

    GPT-5.2 Instant, Thinking, and Pro begin rolling out in ChatGPT today to paid users (Plus, Pro, Team, and Enterprise). The company notes the rollout will be gradual to maintain stability.

  • GPT-5.2 first impressions: a powerful update, especially for business tasks and workflows

    OpenAI has officially released GPT-5.2, and the reactions from early testers — among whom OpenAI seeded the model several days prior to public release, in some cases weeks ago — paints a two toned picture: it is a monumental leap forward for deep, autonomous reasoning and coding, yet potentially an underwhelming "incremental" update for casual conversationalists.

    Following early access periods and today's broader rollout, executives, developers, and analysts have taken to X (formerly Twitter) and company blogs to share their first testing results.

    Here is a roundup of the first reactions to OpenAI’s latest flagship model.

    "AI as a serious analyst"

    The strongest praise for GPT-5.2 centers on its ability to handle "hard problems" that require extended thinking time.

    Matt Shumer, CEO of HyperWriteAI, did not mince words in his review, calling GPT-5.2 Pro "the best model in the world."

    Shumer highlighted the model's tenacity, noting that "it thinks for **over an hour** on hard problems. And it nails tasks no other model can touch."

    This sentiment was echoed by Allie K. Miller, an AI entrepreneur and former AWS executive. Miller described the model as a step toward "AI as a serious analyst" rather than a "friendly companion."

    "The thinking and problem-solving feel noticeably stronger," Miller wrote on X. "It gives much deeper explanations than I’m used to seeing. At one point it literally wrote code to improve its own OCR in the middle of a task."

    Enterprise gains: Box reports distinct performance jumps

    For the enterprise sector, the update appears to be even more significant.

    Aaron Levie, CEO of Box, revealed on X that his company has been testing GPT-5.2 in early access. Levie reported that the model performs "7 points better than GPT-5.1" on their expanded reasoning tests, which approximate real-world knowledge work in financial services and life sciences.

    "The model performed the majority of the tasks far faster than GPT-5.1 and GPT-5 as well," Levie noted, confirming that Box AI will be rolling out GPT-5.2 integration shortly.

    Rutuja Rajwade, a Senior Product Marketing Manager at Box, expanded on this in a company blog post, citing specific latency improvements.

    "Complex extraction" tasks dropped from 46 seconds on GPT-5 to just 12 seconds with GPT-5.2.

    Rajwade also noted a jump in reasoning capabilities for the Media and Entertainment vertical, rising from 76% accuracy in GPT-5.1 to 81% in the new model.

    A "serious leap" for coding and simulation

    Developers are finding GPT-5.2 particularly potent for "one-shot" generation of complex code structures.

    Pietro Schirano, CEO of magicpathai, shared a video of the model building a full 3D graphics engine in a single file with interactive controls. "It’s a serious leap forward in complex reasoning, math, coding, and simulations," Schirano posted. "The pace of progress is unreal."

    Similarly, Ethan Mollick, a professor at the Wharton School of Business at the University of Pennsylvania and longtime LLM and AI power user and writer, demonstrated the model's ability to create a visually complex shader—an infinite neo-gothic city in a stormy ocean—via a single prompt.

    The Agentic Era: Long-running autonomy

    Perhaps the most functional shift is the model's ability to stay on task for hours without losing the thread.

    Dan Shipper, CEO of thoughtful AI testing newsletter Every, reported that the model successfully performed a profit and loss (P&L) analysis that required it to work autonomously for two hours. "It did a P&L analysis where it worked for 2 hours and gave me great results," Shipper wrote.

    However, Shipper also noted that for day-to-day tasks, the update feels "mostly incremental."

    In an article for Every, Katie Parrott wrote that while GPT-5.2 excels at instruction following, it is "less resourceful" than competitors like Claude Opus 4.5 in certain contexts, such as deducing a user's location from email data.

    The downsides: Speed and Rigidity

    Despite the reasoning capabilities, the "feel" of the model has drawn critique.

    Shumer highlighted a significant "speed penalty" when using the model's Thinking mode. "In my experience the Thinking mode is very slow for most questions," Shumer wrote in his deep-dive review. "I almost never use Instant."

    Allie Miller also pointed out issues with the model's default behavior. "The downside is tone and format," she noted. "The default voice felt a bit more rigid, and the length/markdown behavior is extreme: a simple question turned into 58 bullets and numbered points."

    The Verdict

    The early reaction suggests that GPT-5.2 is a tool optimized for power users, developers, and enterprise agents rather than casual chat. As Shumer summarized in his review: "For deep research, complex reasoning, and tasks that benefit from careful thought, GPT-5.2 Pro is the best option available right now."

    However, for users seeking creative writing or quick, fluid answers, models like Claude Opus 4.5 remain strong competitors. "My favorite model remains Claude Opus 4.5," Miller admitted, "but my complex ChatGPT work will get a nice incremental boost."

  • OpenAI report reveals a 6x productivity gap between AI power users and everyone else

    The tools are available to everyone. The subscription is company-wide. The training sessions have been held. And yet, in offices from Wall Street to Silicon Valley, a stark divide is opening between workers who have woven artificial intelligence into the fabric of their daily work and colleagues who have barely touched it.

    The gap is not small. According to a new report from OpenAI analyzing usage patterns across its more than one million business customers, workers at the 95th percentile of AI adoption are sending six times as many messages to ChatGPT as the median employee at the same companies. For specific tasks, the divide is even more dramatic: frontier workers send 17 times as many coding-related messages as their typical peers, and among data analysts, the heaviest users engage the data analysis tool 16 times more frequently than the median.

    This is not a story about access. It is a story about a new form of workplace stratification emerging in real time — one that may be reshaping who gets ahead, who falls behind, and what it means to be a skilled worker in the age of artificial intelligence.

    Everyone has the same tools, but not everyone is using them

    Perhaps the most striking finding in the OpenAI report is how little access explains. ChatGPT Enterprise is now deployed across more than 7 million workplace seats globally, a nine-fold increase from a year ago. The tools are the same for everyone. The capabilities are identical. And yet usage varies by orders of magnitude.

    Among monthly active users — people who have logged in at least once in the past 30 days — 19 percent have never tried the data analysis feature. Fourteen percent have never used reasoning capabilities. Twelve percent have never used search. These are not obscure features buried in submenus; they are core functionality that OpenAI highlights as transformative for knowledge work.

    The pattern inverts among daily users. Only 3 percent of people who use ChatGPT every day have never tried data analysis; just 1 percent have skipped reasoning or search. The implication is clear: the divide is not between those who have access and those who don't, but between those who have made AI a daily habit and those for whom it remains an occasional novelty.

    Employees who experiment more are saving dramatically more time

    The OpenAI report suggests that AI productivity gains are not evenly distributed across all users but concentrated among those who use the technology most intensively. Workers who engage across approximately seven distinct task types — data analysis, coding, image generation, translation, writing, and others — report saving five times as much time as those who use only four. Employees who save more than 10 hours per week consume eight times more AI credits than those who report no time savings at all.

    This creates a compounding dynamic. Workers who experiment broadly discover more uses. More uses lead to greater productivity gains. Greater productivity gains presumably lead to better performance reviews, more interesting assignments, and faster advancement—which in turn provides more opportunity and incentive to deepen AI usage further.

    Seventy-five percent of surveyed workers report being able to complete tasks they previously could not perform, including programming support, spreadsheet automation, and technical troubleshooting. For workers who have embraced these capabilities, the boundaries of their roles are expanding. For those who have not, the boundaries may be contracting by comparison.

    The corporate AI paradox: $40 billion spent, 95 percent seeing no return

    The individual usage gap documented by OpenAI mirrors a broader pattern identified by a separate study from MIT's Project NANDA. Despite $30 billion to $40 billion invested in generative AI initiatives, only 5 percent of organizations are seeing transformative returns. The researchers call this the "GenAI Divide" — a gap separating the few organizations that succeed in transforming processes with adaptive AI systems from the majority that remain stuck in pilots.

    The MIT report found limited disruption across industries: only two of nine major sectors—technology and media—show material business transformation from generative AI use. Large firms lead in pilot volume but lag in successful deployment.

    The pattern is consistent across both studies. Organizations and individuals are buying the technology. They are launching pilots. They are attending training sessions. But somewhere between adoption and transformation, most are getting stuck.

    While official AI projects stall, a shadow economy is thriving

    The MIT study reveals a striking disconnect: while only 40 percent of companies have purchased official LLM subscriptions, employees in over 90 percent of companies regularly use personal AI tools for work. Nearly every respondent reported using LLMs in some form as part of their regular workflow.

    "This 'shadow AI' often delivers better ROI than formal initiatives and reveals what actually works for bridging the divide," MIT's Project NANDA found.

    The shadow economy offers a clue to what's happening at the individual level within organizations. Employees who take initiative — who sign up for personal subscriptions, who experiment on their own time, who figure out how to integrate AI into their workflows without waiting for IT approval — are pulling ahead of colleagues who wait for official guidance that may never come.

    These shadow systems, largely unsanctioned, often deliver better performance and faster adoption than corporate tools. Worker sentiment reveals a preference for flexible, responsive tools — precisely the kind of experimentation that separates OpenAI's frontier workers from the median.

    The biggest gaps show up in technical work that used to require specialists

    The largest relative gaps between frontier and median workers appear in coding, writing, and analysis — precisely the task categories where AI capabilities have advanced most rapidly. Frontier workers are not just doing the same work faster; they appear to be doing different work entirely, expanding into technical domains that were previously inaccessible to them.

    Among ChatGPT Enterprise users outside of engineering, IT, and research, coding-related messages have grown 36 percent over the past six months. Someone in marketing or HR who learns to write scripts and automate workflows is becoming a categorically different employee than a peer who has not — even if they hold the same title and started with the same skills.

    The academic research on AI and productivity offers a complicated picture. Several studies cited in the OpenAI report find that AI has an "equalizing effect," disproportionately helping lower-performing workers close the gap with their higher-performing peers. But the equalizing effect may apply only within the population of workers who actually use AI regularly. A meaningful share of workers are not in that group at all. They remain light users or non-users, even as their more adventurous colleagues pull away.

    Companies are divided too, and the gap is widening by the month

    The divide is not only between individual workers. It exists between entire organizations.

    Frontier firms — those at the 95th percentile of adoption intensity — generate approximately twice as many AI messages per employee as the median enterprise. For messages routed through custom GPTs, purpose-built tools that automate specific workflows, the gap widens to seven-fold.

    These numbers suggest fundamentally different operating models. At median companies, AI may be a productivity tool that individual workers use at their discretion. At frontier firms, AI appears to be embedded in core infrastructure: standardized workflows, persistent custom tools, systematic integration with internal data systems.

    The OpenAI report notes that roughly one in four enterprises still has not enabled connectors that give AI access to company data—a basic step that dramatically increases the technology's utility. The MIT study found that companies that purchased AI tools from specialized vendors succeeded 67 percent of the time, while internal builds had only a one-in-three success rate. For many organizations, the AI era has technically arrived but has not yet begun in practice.

    The technology is no longer the problem — organizations are

    For executives, the data presents an uncomfortable challenge. The technology is no longer the constraint. OpenAI notes that it releases a new feature or capability roughly every three days; the models are advancing faster than most organizations can absorb. The bottleneck has shifted from what AI can do to whether organizations are structured to take advantage of it.

    "The dividing line isn't intelligence," the MIT authors write. The problems with enterprise AI have to do with memory, adaptability, and learning capability. Problems stem less from regulations or model performance, and more from tools that fail to learn or adapt.

    Leading firms, according to the OpenAI report, consistently invest in executive sponsorship, data readiness, workflow standardization, and deliberate change management. They build cultures where custom AI tools are created, shared, and refined across teams. They track performance and run evaluations. They make AI adoption a strategic priority rather than an individual choice.

    The rest are leaving it to chance — hoping that workers will discover the tools on their own, experiment on their own time, and somehow propagate best practices without infrastructure or incentive. The six-fold gap suggests this approach is not working.

    The window to catch up is closing faster than most companies realize

    With enterprise contracts locking in over the next 18 months, there's a shrinking window for vendors and adopters to cross the divide.The GenAI Divide identified by the MIT report is not going to last forever. But the organizations that figure out a way across it soonest will be the ones that define the next era of business.

    Both reports carry caveats. The OpenAI data comes from a company with an obvious interest in promoting AI adoption. The productivity figures are self-reported by customers already paying for the product. The MIT study, while independent, relies on interviews and surveys rather than direct measurement. The long-term effects of this technology on employment, wages, and workplace dynamics remain uncertain.

    But the core finding — that access alone does not produce adoption, and that adoption varies enormously even within organizations that have made identical tools available to all — is consistent with how previous technologies have diffused through the economy. Spreadsheets, email, and the internet all created similar divides before eventually becoming universal. The question is how long the current gap persists, who benefits during the transition, and what happens to workers who find themselves on the wrong side of it.

    For now, the divide is stark. Ninety percent of users said they prefer humans for "mission-critical work," while AI has "won the war for simple work." The workers who are pulling ahead are not doing so because they have access their colleagues lack. They are pulling ahead because they decided to use what everyone already has—and kept using it until they figured out what it could do.

    The 6x gap is not about technology. It is about behavior. And behavior, unlike software, cannot be deployed with a company-wide rollout.

  • The 70% factuality ceiling: why Google’s new ‘FACTS’ benchmark is a wake-up call for enterprise AI

    There's no shortage of generative AI benchmarks designed to measure the performance and accuracy of a given model on completing various helpful enterprise tasks — from coding to instruction following to agentic web browsing and tool use. But many of these benchmarks have one major shortcoming: they measure the AI's ability to complete specific problems and requests, not how factual the model is in its outputs — how well it generates objectively correct information tied to real-world data — especially when dealing with information contained in imagery or graphics.

    For industries where accuracy is paramount — legal, finance, and medical — the lack of a standardized way to measure factuality has been a critical blind spot.

    That changes today: Google’s FACTS team and its data science unit Kaggle released the FACTS Benchmark Suite, a comprehensive evaluation framework designed to close this gap.

    The associated research paper reveals a more nuanced definition of the problem, splitting "factuality" into two distinct operational scenarios: "contextual factuality" (grounding responses in provided data) and "world knowledge factuality" (retrieving information from memory or the web).

    While the headline news is Gemini 3 Pro’s top-tier placement, the deeper story for builders is the industry-wide "factuality wall."

    According to the initial results, no model—including Gemini 3 Pro, GPT-5, or Claude 4.5 Opus—managed to crack a 70% accuracy score across the suite of problems. For technical leaders, this is a signal: the era of "trust but verify" is far from over.

    Deconstructing the Benchmark

    The FACTS suite moves beyond simple Q&A. It is composed of four distinct tests, each simulating a different real-world failure mode that developers encounter in production:

    1. Parametric Benchmark (Internal Knowledge): Can the model accurately answer trivia-style questions using only its training data?

    2. Search Benchmark (Tool Use): Can the model effectively use a web search tool to retrieve and synthesize live information?

    3. Multimodal Benchmark (Vision): Can the model accurately interpret charts, diagrams, and images without hallucinating?

    4. Grounding Benchmark v2 (Context): Can the model stick strictly to the provided source text?

    Google has released 3,513 examples to the public, while Kaggle holds a private set to prevent developers from training on the test data—a common issue known as "contamination."

    The Leaderboard: A Game of Inches

    The initial run of the benchmark places Gemini 3 Pro in the lead with a comprehensive FACTS Score of 68.8%, followed by Gemini 2.5 Pro (62.1%) and OpenAI’s GPT-5 (61.8%).However, a closer look at the data reveals where the real battlegrounds are for engineering teams.

    Model

    FACTS Score (Avg)

    Search (RAG Capability)

    Multimodal (Vision)

    Gemini 3 Pro

    68.8

    83.8

    46.1

    Gemini 2.5 Pro

    62.1

    63.9

    46.9

    GPT-5

    61.8

    77.7

    44.1

    Grok 4

    53.6

    75.3

    25.7

    Claude 4.5 Opus

    51.3

    73.2

    39.2

    Data sourced from the FACTS Team release notes.

    For Builders: The "Search" vs. "Parametric" Gap

    For developers building RAG (Retrieval-Augmented Generation) systems, the Search Benchmark is the most critical metric.

    The data shows a massive discrepancy between a model's ability to "know" things (Parametric) and its ability to "find" things (Search). For instance, Gemini 3 Pro scores a high 83.8% on Search tasks but only 76.4% on Parametric tasks.

    This validates the current enterprise architecture standard: do not rely on a model's internal memory for critical facts.

    If you are building an internal knowledge bot, the FACTS results suggest that hooking your model up to a search tool or vector database is not optional—it is the only way to push accuracy toward acceptable production levels.

    The Multimodal Warning

    The most alarming data point for product managers is the performance on Multimodal tasks. The scores here are universally low. Even the category leader, Gemini 2.5 Pro, only hit 46.9% accuracy.

    The benchmark tasks included reading charts, interpreting diagrams, and identifying objects in nature. With less than 50% accuracy across the board, this suggests that Multimodal AI is not yet ready for unsupervised data extraction.

    Bottom line: If your product roadmap involves having an AI automatically scrape data from invoices or interpret financial charts without human-in-the-loop review, you are likely introducing significant error rates into your pipeline.

    Why This Matters for Your Stack

    The FACTS Benchmark is likely to become a standard reference point for procurement. When evaluating models for enterprise use, technical leaders should look beyond the composite score and drill into the specific sub-benchmark that matches their use case:

    • Building a Customer Support Bot? Look at the Grounding score to ensure the bot sticks to your policy documents. (Gemini 2.5 Pro actually outscored Gemini 3 Pro here, 74.2 vs 69.0).

    • Building a Research Assistant? Prioritize Search scores.

    • Building an Image Analysis Tool? Proceed with extreme caution.

    As the FACTS team noted in their release, "All evaluated models achieved an overall accuracy below 70%, leaving considerable headroom for future progress."For now, the message to the industry is clear: The models are getting smarter, but they aren't yet infallible. Design your systems with the assumption that, roughly one-third of the time, the raw model might just be wrong.

  • Mistral launches powerful Devstral 2 coding model including open source, laptop-friendly version

    French AI startup Mistral has weathered a rocky period of public questioning over the last year to emerge, now here in December 2025, with new, crowd-pleasing models for enterprise and indie developers.

    Just days after releasing its powerful open source, general purpose Mistral 3 LLM family for edge devices and local hardware, the company returned today to debut Devstral 2.

    The release includes a new pair of models optimized for software engineering tasks — again, with one small enough to run on a single laptop, offline and privately — alongside Mistral Vibe, a command-line interface (CLI) agent designed to allow developers to call the models up directly within their terminal environments.

    The models are fast, lean, and open—at least in theory. But the real story lies not just in the benchmarks, but in how Mistral is packaging this capability: one model fully free, another conditionally so, and a terminal interface built to scale with either.

    It’s an attempt not just to match proprietary systems like Claude and GPT-4 in performance, but to compete with them on developer experience—and to do so while holding onto the flag of open-source.

    Both models are available now for free for a limited time via Mistral’s API and Hugging Face.

    The full Devstral 2 model is supported out-of-the-box in the community inference provider vLLM and on the open source agentic coding platform Kilo Code.

    A Coding Model Meant to Drive

    At the top of the announcement is Devstral 2, a 123-billion parameter dense transformer with a 256K-token context window, engineered specifically for agentic software development.

    Mistral says the model achieves 72.2% on SWE-bench Verified, a benchmark designed to evaluate long-context software engineering tasks in real-world repositories.

    The smaller sibling, Devstral Small 2, weighs in at 24B parameters, with the same long context window and a performance of 68.0% on SWE-bench.

    On paper, that makes it the strongest open-weight model of its size, even outscoring many 70B-class competitors.

    But the performance story isn’t just about raw percentages. Mistral is betting that efficient intelligence beats scale, and has made much of the fact that Devstral 2 is:

    • 5× smaller than DeepSeek V3.2

    • 8× smaller than Kimi K2

    • Yet still matches or surpasses them on key software reasoning benchmarks.

    Human evaluations back this up. In side-by-side comparisons:

    • Devstral 2 beat DeepSeek V3.2 in 42.8% of tasks, losing only 28.6%.

    • Against Claude Sonnet 4.5, it lost more often (53.1%)—a reminder that while the gap is narrowing, closed models still lead in overall preference.

    Still, for an open-weight model, these results place Devstral 2 at the frontier of what’s currently available to run and modify independently.

    Vibe CLI: A Terminal-Native Agent

    Alongside the models, Mistral released Vibe CLI, a command-line assistant that integrates directly with Devstral models. It’s not an IDE plugin or a ChatGPT-style code explainer. It’s a native interface designed for project-wide code understanding and orchestration, built to live inside the developer’s actual workflow.

    Vibe brings a surprising degree of intelligence to the terminal:

    • It reads your file tree and Git status to understand project scope.

    • It lets you reference files with @, run shell commands with !, and toggle behavior with slash commands.

    • It orchestrates changes across multiple files, tracks dependencies, retries failed executions, and can even refactor at architectural scale.

    Unlike most developer agents, which simulate a REPL from within a chat UI, Vibe starts with the shell and pulls intelligence in from there. It’s programmable, scriptable, and themeable. And it’s released under the Apache 2.0 license, meaning it’s truly free to use—in commercial settings, internal tools, or open-source extensions.

    Licensing Structure: Open-ish — With Revenue Limitations

    At first glance, Mistral’s licensing approach appears straightforward: the models are open-weight and publicly available. But a closer look reveals a line drawn through the middle of the release, with different rules for different users.

    Devstral Small 2, the 24-billion parameter variant, is covered under a standard, enterprise- and developer-friendly Apache 2.0 license.

    That’s a gold standard in open-source: no revenue restrictions, no fine print, no need to check with legal. Enterprises can use it in production, embed it into products, and redistribute fine-tuned versions without asking for permission.

    Devstral 2, the flagship 123B model, is released under what Mistral calls a “modified MIT license.” That phrase sounds innocuous, but the modification introduces a critical limitation: any company making more than $20 million in monthly revenue cannot use the model at all—not even internally—without securing a separate commercial license from Mistral.

    “You are not authorized to exercise any rights under this license if the global consolidated monthly revenue of your company […] exceeds $20 million,” the license reads.

    The clause applies not only to the base model, but to derivatives, fine-tuned versions, and redistributed variants, regardless of who hosts them. In effect, it means that while the weights are “open,” their use is gated for large enterprises—unless they’re willing to engage with Mistral’s sales team or use the hosted API at metered pricing.

    To draw an analogy: Apache 2.0 is like a public library—you walk in, borrow the book, and use it however you need. Mistral’s modified MIT license is more like a corporate co-working space that’s free for freelancers but charges rent once your company hits a certain size.

    Weighing Devstral Small 2 for Enterprise Use

    This division raises an obvious question for larger companies: can Devstral Small 2 with its more permissive and unrestricted Apache 2.0 licensing serve as a viable alternative for medium-to-large enterprises?

    The answer depends on context. Devstral Small 2 scores 68.0% on SWE-bench, significantly ahead of many larger open models, and remains deployable on single-GPU or CPU-only setups. For teams focused on:

    • internal tooling,

    • on-prem deployment,

    • low-latency edge inference,

      …it offers a rare combination of legality, performance, and convenience.

    But the performance gap from Devstral 2 is real. For multi-agent setups, deep monorepo refactoring, or long-context code analysis, that 4-point benchmark delta may understate the actual experience difference.

    For most enterprises, Devstral Small 2 will serve either as a low-friction way to prototype—or as a pragmatic bridge until licensing for Devstral 2 becomes feasible. It is not a drop-in replacement for the flagship, but it may be “good enough” in specific production slices, particularly when paired with Vibe CLI.

    But because Devstral Small 2 can be run entirely offline — including on a single GPU machine or a sufficiently specced laptop — it unlocks a critical use case for developers and teams operating in tightly controlled environments.

    Whether you’re a solo indie building tools on the go, or part of a company with strict data governance or compliance mandates, the ability to run a performant, long-context coding model without ever hitting the internet is a powerful differentiator. No cloud calls, no third-party telemetry, no risk of data leakage — just local inference with full visibility and control.

    This matters in industries like finance, healthcare, defense, and advanced manufacturing, where data often cannot leave the network perimeter. But it’s just as useful for developers who prefer autonomy over vendor lock-in — or who want their tools to work the same on a plane, in the field, or inside an air-gapped lab. In a market where most top-tier code models are delivered as API-only SaaS products, Devstral Small 2 offers a rare level of portability, privacy, and ownership.

    In that sense, Mistral isn’t just offering open models—they’re offering multiple paths to adoption, depending on your scale, compliance posture, and willingness to engage.

    Integration, Infrastructure, and Access

    From a technical standpoint, Mistral’s models are built for deployment. Devstral 2 requires a minimum of 4× H100-class GPUs, and is already available on build.nvidia.com.

    Devstral Small 2 can run on a single GPU or CPU such as those in a standard laptop, making it accessible to solo developers and embedded teams alike.

    Both models support quantized FP4 and FP8 weights, and are compatible with vLLM for scalable inference. Fine-tuning is supported out of the box.

    API pricing—after the free introductory window—follows a token-based structure:

    • Devstral 2: $0.40 per million input tokens / $2.00 for output

    • Devstral Small 2: $0.10 input / $0.30 output

    That pricing sits just below OpenAI’s GPT-4 Turbo, and well below Anthropic’s Claude Sonnet at comparable performance levels.

    Developer Reception: Ground-Level Buzz

    On X (formerly Twitter), developers reacted quickly with a wave of positive reception, with Hugging Face's Head of Product Victor Mustar asking if the small, Apache 2.0 licensed variant was the "new local coding king," i.e., the one developers could use to run on their laptops directly and privately, without an internet connection:

    Another popular AI news and rumors account, TestingCatalogNews, posted that it was "SOTTA in coding," or "State Of The Tiny Art"

    Another user, @xlr8harder, took issue with the custom licensing terms for Devstral 2, writing "calling the Devstral 2 license 'modified MIT' is misleading at best. It’s a proprietary license with MIT-like attribution requirements."

    While the tone was critical, it reflected some attention Mistral’s license structuring was receiving, particularly among developers familiar with open-use norms.

    Strategic Context: From Codestral to Devstral and Mistral 3

    Mistral’s steady push into software development tools didn’t start with Devstral 2—it began in May 2024 with Codestral, the company’s first code-focused large language model. A 22-billion parameter system trained on more than 80 programming languages, Codestral was designed for use in developer environments ranging from basic autocompletions to full function generation. The model launched under a non-commercial license but still outperformed heavyweight competitors like CodeLlama 70B and Deepseek Coder 33B in early benchmarks such as HumanEval and RepoBench.

    Codestral’s release marked Mistral’s first move into the competitive coding-model space, but it also established a now-familiar pattern: technically lean models with surprisingly strong results, a wide context window, and licensing choices that invited developer experimentation. Industry partners including JetBrains, LlamaIndex, and LangChain quickly began integrating the model into their workflows, citing its speed and tool compatibility as key differentiators.

    One year later, the company followed up with Devstral, a 24B model purpose-built for “agentic” behavior—handling long-range reasoning, file navigation, and autonomous code modification. Released in partnership with All Hands AI and licensed under Apache 2.0, Devstral was notable not just for its portability (it could run on a MacBook or RTX 4090), but for its performance: it beat out several closed models on SWE-Bench Verified, a benchmark of 500 real-world GitHub issues.

    Then came Mistral 3, announced in December 2025 as a portfolio of 10 open-weight models targeting everything from drones and smartphones to cloud infrastructure. This suite included both high-end models like Mistral Large 3 (a MoE system with 41 active parameters and 256K context) and lightweight “Ministral” variants that could run on 4GB of VRAM. All were licensed under Apache 2.0, reinforcing Mistral’s commitment to flexible, edge-friendly deployment.

    Mistral 3 positioned the company not as a direct competitor to frontier models like GPT-5 or Gemini 3, but as a developer-first platform for customized, localized AI systems. Co-founder Guillaume Lample described the vision as “distributed intelligence”—many smaller systems tuned for specific tasks and running outside centralized infrastructure. “In more than 90% of cases, a small model can do the job,” he told VentureBeat. “It doesn’t have to be a model with hundreds of billions of parameters.”

    That broader strategy helps explain the significance of Devstral 2. It’s not a one-off release but a continuation of Mistral’s long-running commitment to code agents, local-first deployment, and open-weight availability—an ecosystem that began with Codestral, matured through Devstral, and scaled up with Mistral 3. Devstral 2, in this framing, is not just a model. It’s the next version of a playbook that’s been unfolding in public for over a year.

    Final Thoughts (For Now): A Fork in the Road

    With Devstral 2, Devstral Small 2, and Vibe CLI, Mistral AI has drawn a clear map for developers and companies alike. The tools are fast, capable, and thoughtfully integrated. But they also present a choice—not just in architecture, but in how and where you’re allowed to use them.

    If you’re an individual developer, small startup, or open-source maintainer, this is one of the most powerful AI systems you can freely run today.

    If you’re a Fortune 500 engineering lead, you’ll need to either talk to Mistral—or settle for the smaller model and make it work.

    In a market increasingly dominated by black-box models and SaaS lock-ins, Mistral’s offer is still a breath of fresh air. Just read the fine print before you start building.

  • The AI that scored 95% — until consultants learned it was AI

    Presented by SAP


    When SAP ran a quiet internal experiment to gauge consultant attitudes toward AI, the results were striking. Five teams were asked to validate answers to more than 1,000 business requirements completed by SAP’s AI co-pilot, Joule for Consultants — a workload that would normally take several weeks.

    Four teams were told the analysis had been completed by junior interns fresh out of school. They reviewed the material, found it impressive, and rated the work about 95% accurate.

    The fifth team was told the very same answers had come from AI.

    They rejected almost everything.

    Only when asked to validate each answer one by one did they discover that the AI was, in fact, highly accurate — surfacing detailed insights the consultants had initially dismissed. The overall accuracy? Again, about 95%.

    “The lesson learned here is that we need to be very cautious as we introduce AI — especially in how we communicate with senior consultants about its possibilities and how to integrate it into their workflows,” says Guillermo B. Vazquez Mendez, chief architect, RI business transformation and architecture, SAP America Inc.

    The experiment has since become a revealing starting point for SAP’s push toward the consultant of 2030: a practitioner who is deeply human, enabled by AI, and no longer weighed down by the technical grunt work of the past.

    Overcoming AI skepticism

    Resistance isn’t surprising, Vazquez notes. Consultants with two or three decades of experience carry enormous institutional knowledge — and an understandable degree of caution.

    But AI copilots like Joule for Consultants are not replacing expertise. They’re amplifying it.

    “What Joule really does is make their very expensive time far more effective,” Vazquez says. “It removes the clerical work, so they can focus on turning out high-quality answers in a fraction of the time.”

    He emphasizes this message constantly: “AI is not replacing you. It’s a tool for you. Human oversight is always required. But now, instead of spending your time looking for documentation, you’re gaining significant time and boosting the effectiveness and detail of your answers.”

    The consultant time-shift: from tech execution to business insight

    Historically, consultants spent about 80% of their time understanding technical systems — how processes run, how data flows, how functions execute. Customers, by contrast, spend 80% of their time focused on their business.

    That mismatch is exactly where Joule steps in.

    “There’s a gap there — and the bridge is AI,” Vazquez says. “It flips the time equation, enabling consultants to invest more of their energy in understanding the customer’s industry and business goals. AI takes on the heavy technical lift, so consultants can focus on driving the right business outcomes.”

    Bringing new consultants up to speed

    AI is also transforming how new hires learn.

    “We’re excited to see Joule acting as a bridge between senior consultants, who are adapting more slowly, and interns and new consultants who are already technically savvy,” Vazquez says.

    Junior consultants ramp up faster because Joule helps them operate independently. Seniors, meanwhile, engage where their insight matters most.

    This is also where many consultants learn the fundamentals of today’s AI copilots. Much of the work depends on prompt engineering — for instance, instructing Joule to act as a senior chief technology architect specializing in finance and SAP S/4HANA 2023, then asking it to analyze business requirements and deliver the output as tables or PowerPoint slides.

    Once they grasp how to frame prompts, consultants consistently get higher-quality, more structured answers.

    New architects are also able to communicate more clearly with their more experienced counterparts. They know what they don’t know and can ask targeted questions, which makes mentorship far smoother. It’s created a real synergy, Vazquez adds — senior consultants see how quickly new hires are adapting and learning with AI, and that momentum encourages them to keep pace and adopt the technology themselves.

    Looking ahead to the future of AI copilots

    “We’re still in the baby steps of AI — we’re toddlers,” Vazquez says. “Right now, copilots depend on prompt engineering to get good answers. The better you prompt, the better the answer you get.”

    But that represents only the earliest phase of what these systems will eventually do. As copilots mature, they’ll move beyond responding to prompts and start interpreting entire business processes — understanding the sequence of steps, identifying where human intervention is needed, and spotting where an AI agent could take over. That shift is what leads directly into agentic AI.

    SAP’s depth of process knowledge is what makes that evolution possible. The company has mapped more than 3,500 business processes across industries — a repository Vazquez calls “some of the most valuable, rigorously tested processes developed in the last 50 years.” Every day, SAP systems support roughly $7.3 trillion in global commerce, giving these emerging AI agents a rich foundation to navigate and reason over.

    “With that level of process insight and data, we can take a real leap forward,” he says, “equipping our consultants with agentic AI that can solve complex challenges and push us toward increasingly autonomous systems.”


    Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

  • Z.ai debuts open source GLM-4.6V, a native tool-calling vision model for multimodal reasoning

    Chinese AI startup Zhipu AI aka Z.ai has released its GLM-4.6V series, a new generation of open-source vision-language models (VLMs) optimized for multimodal reasoning, frontend automation, and high-efficiency deployment.

    The release includes two models in "large" and "small" sizes:

    1. GLM-4.6V (106B), a larger 106-billion parameter model aimed at cloud-scale inference

    2. GLM-4.6V-Flash (9B), a smaller model of only 9 billion parameters designed for low-latency, local applications

    Recall that generally speaking, models with more parameters — or internal settings governing their behavior, i.e. weights and biases — are more powerful, performant, and capable of performing at a higher general level across more varied tasks.

    However, smaller models can offer better efficiency for edge or real-time applications where latency and resource constraints are critical.

    The defining innovation in this series is the introduction of native function calling in a vision-language model—enabling direct use of tools such as search, cropping, or chart recognition with visual inputs.

    With a 128,000 token context length (equivalent to a 300-page novel's worth of text exchanged in a single input/output interaction with the user) and state-of-the-art (SoTA) results across more than 20 benchmarks, the GLM-4.6V series positions itself as a highly competitive alternative to both closed and open-source VLMs. It's available in the following formats:

    Licensing and Enterprise Use

    GLM‑4.6V and GLM‑4.6V‑Flash are distributed under the MIT license, a permissive open-source license that allows free commercial and non-commercial use, modification, redistribution, and local deployment without obligation to open-source derivative works.

    This licensing model makes the series suitable for enterprise adoption, including scenarios that require full control over infrastructure, compliance with internal governance, or air-gapped environments.

    Model weights and documentation are publicly hosted on Hugging Face, with supporting code and tooling available on GitHub.

    The MIT license ensures maximum flexibility for integration into proprietary systems, including internal tools, production pipelines, and edge deployments.

    Architecture and Technical Capabilities

    The GLM-4.6V models follow a conventional encoder-decoder architecture with significant adaptations for multimodal input.

    Both models incorporate a Vision Transformer (ViT) encoder—based on AIMv2-Huge—and an MLP projector to align visual features with a large language model (LLM) decoder.

    Video inputs benefit from 3D convolutions and temporal compression, while spatial encoding is handled using 2D-RoPE and bicubic interpolation of absolute positional embeddings.

    A key technical feature is the system’s support for arbitrary image resolutions and aspect ratios, including wide panoramic inputs up to 200:1.

    In addition to static image and document parsing, GLM-4.6V can ingest temporal sequences of video frames with explicit timestamp tokens, enabling robust temporal reasoning.

    On the decoding side, the model supports token generation aligned with function-calling protocols, allowing for structured reasoning across text, image, and tool outputs. This is supported by extended tokenizer vocabulary and output formatting templates to ensure consistent API or agent compatibility.

    Native Multimodal Tool Use

    GLM-4.6V introduces native multimodal function calling, allowing visual assets—such as screenshots, images, and documents—to be passed directly as parameters to tools. This eliminates the need for intermediate text-only conversions, which have historically introduced information loss and complexity.

    The tool invocation mechanism works bi-directionally:

    • Input tools can be passed images or videos directly (e.g., document pages to crop or analyze).

    • Output tools such as chart renderers or web snapshot utilities return visual data, which GLM-4.6V integrates directly into the reasoning chain.

    In practice, this means GLM-4.6V can complete tasks such as:

    • Generating structured reports from mixed-format documents

    • Performing visual audit of candidate images

    • Automatically cropping figures from papers during generation

    • Conducting visual web search and answering multimodal queries

    High Performance Benchmarks Compared to Other Similar-Sized Models

    GLM-4.6V was evaluated across more than 20 public benchmarks covering general VQA, chart understanding, OCR, STEM reasoning, frontend replication, and multimodal agents.

    According to the benchmark chart released by Zhipu AI:

    • GLM-4.6V (106B) achieves SoTA or near-SoTA scores among open-source models of comparable size (106B) on MMBench, MathVista, MMLongBench, ChartQAPro, RefCOCO, TreeBench, and more.

    • GLM-4.6V-Flash (9B) outperforms other lightweight models (e.g., Qwen3-VL-8B, GLM-4.1V-9B) across almost all categories tested.

    • The 106B model’s 128K-token window allows it to outperform larger models like Step-3 (321B) and Qwen3-VL-235B on long-context document tasks, video summarization, and structured multimodal reasoning.

    Example scores from the leaderboard include:

    • MathVista: 88.2 (GLM-4.6V) vs. 84.6 (GLM-4.5V) vs. 81.4 (Qwen3-VL-8B)

    • WebVoyager: 81.0 vs. 68.4 (Qwen3-VL-8B)

    • Ref-L4-test: 88.9 vs. 89.5 (GLM-4.5V), but with better grounding fidelity at 87.7 (Flash) vs. 86.8

    Both models were evaluated using the vLLM inference backend and support SGLang for video-based tasks.

    Frontend Automation and Long-Context Workflows

    Zhipu AI emphasized GLM-4.6V’s ability to support frontend development workflows. The model can:

    • Replicate pixel-accurate HTML/CSS/JS from UI screenshots

    • Accept natural language editing commands to modify layouts

    • Identify and manipulate specific UI components visually

    This capability is integrated into an end-to-end visual programming interface, where the model iterates on layout, design intent, and output code using its native understanding of screen captures.

    In long-document scenarios, GLM-4.6V can process up to 128,000 tokens—enabling a single inference pass across:

    • 150 pages of text (input)

    • 200 slide decks

    • 1-hour videos

    Zhipu AI reported successful use of the model in financial analysis across multi-document corpora and in summarizing full-length sports broadcasts with timestamped event detection.

    Training and Reinforcement Learning

    The model was trained using multi-stage pre-training followed by supervised fine-tuning (SFT) and reinforcement learning (RL). Key innovations include:

    • Curriculum Sampling (RLCS): Dynamically adjusts the difficulty of training samples based on model progress

    • Multi-domain reward systems: Task-specific verifiers for STEM, chart reasoning, GUI agents, video QA, and spatial grounding

    • Function-aware training: Uses structured tags (e.g., <think>, <answer>, <|begin_of_box|>) to align reasoning and answer formatting

    The reinforcement learning pipeline emphasizes verifiable rewards (RLVR) over human feedback (RLHF) for scalability, and avoids KL/entropy losses to stabilize training across multimodal domains

    Pricing (API)

    Zhipu AI offers competitive pricing for the GLM-4.6V series, with both the flagship model and its lightweight variant positioned for high accessibility.

    • GLM-4.6V: $0.30 (input) / $0.90 (output) per 1M tokens

    • GLM-4.6V-Flash: Free

    Compared to major vision-capable and text-first LLMs, GLM-4.6V is among the most cost-efficient for multimodal reasoning at scale. Below is a comparative snapshot of pricing across providers:

    USD per 1M tokens — sorted lowest → highest total cost

    Model

    Input

    Output

    Total Cost

    Source

    Qwen 3 Turbo

    $0.05

    $0.20

    $0.25

    Alibaba Cloud

    ERNIE 4.5 Turbo

    $0.11

    $0.45

    $0.56

    Qianfan

    GLM‑4.6V

    $0.30

    $0.90

    $1.20

    Z.AI

    Grok 4.1 Fast (reasoning)

    $0.20

    $0.50

    $0.70

    xAI

    Grok 4.1 Fast (non-reasoning)

    $0.20

    $0.50

    $0.70

    xAI

    deepseek-chat (V3.2-Exp)

    $0.28

    $0.42

    $0.70

    DeepSeek

    deepseek-reasoner (V3.2-Exp)

    $0.28

    $0.42

    $0.70

    DeepSeek

    Qwen 3 Plus

    $0.40

    $1.20

    $1.60

    Alibaba Cloud

    ERNIE 5.0

    $0.85

    $3.40

    $4.25

    Qianfan

    Qwen-Max

    $1.60

    $6.40

    $8.00

    Alibaba Cloud

    GPT-5.1

    $1.25

    $10.00

    $11.25

    OpenAI

    Gemini 2.5 Pro (≤200K)

    $1.25

    $10.00

    $11.25

    Google

    Gemini 3 Pro (≤200K)

    $2.00

    $12.00

    $14.00

    Google

    Gemini 2.5 Pro (>200K)

    $2.50

    $15.00

    $17.50

    Google

    Grok 4 (0709)

    $3.00

    $15.00

    $18.00

    xAI

    Gemini 3 Pro (>200K)

    $4.00

    $18.00

    $22.00

    Google

    Claude Opus 4.1

    $15.00

    $75.00

    $90.00

    Anthropic

    Previous Releases: GLM‑4.5 Series and Enterprise Applications

    Prior to GLM‑4.6V, Z.ai released the GLM‑4.5 family in mid-2025, establishing the company as a serious contender in open-source LLM development.

    The flagship GLM‑4.5 and its smaller sibling GLM‑4.5‑Air both support reasoning, tool use, coding, and agentic behaviors, while offering strong performance across standard benchmarks.

    The models introduced dual reasoning modes (“thinking” and “non-thinking”) and could automatically generate complete PowerPoint presentations from a single prompt — a feature positioned for use in enterprise reporting, education, and internal comms workflows. Z.ai also extended the GLM‑4.5 series with additional variants such as GLM‑4.5‑X, AirX, and Flash, targeting ultra-fast inference and low-cost scenarios.

    Together, these features position the GLM‑4.5 series as a cost-effective, open, and production-ready alternative for enterprises needing autonomy over model deployment, lifecycle management, and integration pipel

    Ecosystem Implications

    The GLM-4.6V release represents a notable advance in open-source multimodal AI. While large vision-language models have proliferated over the past year, few offer:

    • Integrated visual tool usage

    • Structured multimodal generation

    • Agent-oriented memory and decision logic

    Zhipu AI’s emphasis on “closing the loop” from perception to action via native function calling marks a step toward agentic multimodal systems.

    The model’s architecture and training pipeline show a continued evolution of the GLM family, positioning it competitively alongside offerings like OpenAI’s GPT-4V and Google DeepMind’s Gemini-VL.

    Takeaway for Enterprise Leaders

    With GLM-4.6V, Zhipu AI introduces an open-source VLM capable of native visual tool use, long-context reasoning, and frontend automation. It sets new performance marks among models of similar size and provides a scalable platform for building agentic, multimodal AI systems.

  • Anthropic’s Claude Code can now read your Slack messages and write code for you

    Anthropic on Monday launched a beta integration that connects its fast-growing Claude Code programming agent directly to Slack, allowing software engineers to delegate coding tasks without leaving the workplace messaging platform where much of their daily communication already happens.

    The release, which Anthropic describes as a "research preview," is the AI safety company's latest move to embed its technology deeper into enterprise workflows — and comes as Claude Code has emerged as a surprise revenue engine, generating over $1 billion in annualized revenue just six months after its public debut in May.

    "The critical context around engineering work often lives in Slack, including bug reports, feature requests, and engineering discussion," the company wrote in its announcement blog post. "When a bug report appears or a teammate needs a code fix, you can now tag Claude in Slack to automatically spin up a Claude Code session using the surrounding context."

    From bug report to pull request: how the new Slack integration actually works

    The mechanics are deceptively simple but address a persistent friction point in software development: the gap between where problems get discussed and where they get fixed.

    When a user mentions @Claude in a Slack channel or thread, Claude analyzes the message to determine whether it constitutes a coding task. If it does, the system automatically creates a new Claude Code session. Users can also explicitly instruct Claude to treat requests as coding tasks.

    Claude gathers context from recent channel and thread messages in Slack to feed into the Claude Code session. It will use this context to automatically choose which repository to run the task on based on the repositories you've authenticated to Claude Code on the web.

    As the Claude Code session progresses, Claude posts status updates back to the Slack thread. Once complete, users receive a link to the full session where they can review changes, along with a direct link to open a pull request.

    The feature builds on Anthropic's existing Claude for Slack integration and requires users to have access to Claude Code on the web. In practical terms, a product manager reporting a bug in Slack could tag Claude, which would then analyze the conversation context, identify the relevant code repository, investigate the issue, propose a fix, and post a pull request—all while updating the original Slack thread with its progress.

    Why Anthropic is betting big on enterprise workflow integrations

    The Slack integration arrives at a pivotal moment for Anthropic. Claude Code has already hit $1 billion in revenue six months since its public debut in May, according to a LinkedIn post from Anthropic's chief product officer, Mike Krieger. The coding agent continues to barrel toward scale with customers like Netflix, Spotify, and Salesforce.

    The velocity of that growth helps explain why Anthropic made its first-ever acquisition earlier this month. Anthropic declined to comment on financial details. The Information earlier reported on Anthropic's bid to acquire Bun.

    Bun is a breakthrough JavaScript runtime that is dramatically faster than the leading competition. As an all-in-one toolkit — combining runtime, package manager, bundler, and test runner — it's become essential infrastructure for AI-led software engineering, helping developers build and test applications at unprecedented velocity.

    Since becoming generally available in May 2025, Claude Code has grown from its origins as an internal engineering experiment into a critical tool for many of the world's category-leading enterprises, including Netflix, Spotify, KPMG, L'Oreal, and Salesforce — and Bun has been key in helping scale its infrastructure throughout that evolution.

    The acquisition signals that Anthropic views Claude Code not as a peripheral feature but as a core business line worth substantial investment. The Slack integration extends that bet, positioning Claude Code as an ambient presence in the workspaces where engineering decisions actually get made.

    According to an Anthropic spokesperson, companies including Rakuten, Novo Nordisk, Uber, Snowflake, and Ramp now use Claude Code for both professional and novice developers. Rakuten, the Japanese e-commerce giant, has reportedly reduced software development timelines from 24 days to just 5 days using the tool — a 79% reduction that illustrates the productivity claims Anthropic has been making.

    Claude Code's rapid rise from internal experiment to billion-dollar product

    The Slack launch is the latest in a rapid series of Claude Code expansions. In late November, Claude Code was added to Anthropic's desktop apps including the Mac version. Claude Code was previously limited to mobile apps and the web. It allows software engineers to code, research, and update work with multiple local and remote sessions running at the same time.

    That release accompanied Anthropic's unveiling of Claude Opus 4.5, its newest and most capable model. Claude Opus 4.5 is available today on the company's apps, API, and on all three major cloud platforms. Pricing is $5/$25 per million tokens — making Opus-level capabilities accessible to even more users, teams, and enterprises.

    The company has also invested heavily in the developer infrastructure that powers Claude Code. In late November, Anthropic released three new beta features for tool use: Tool Search Tool, which allows Claude to use search tools to access thousands of tools without consuming its context window; Programmatic Tool Calling, which allows Claude to invoke tools in a code execution environment reducing the impact on the model's context window; and Tool Use Examples, which provides a universal standard for demonstrating how to effectively use a given tool.

    The Model Context Protocol (MCP) is an open standard for connecting AI agents to external systems. Connecting agents to tools and data traditionally requires a custom integration for each pairing, creating fragmentation and duplicated effort that makes it difficult to scale truly connected systems. MCP provides a universal protocol — developers implement MCP once in their agent and it unlocks an entire ecosystem of integrations.

    Inside Anthropic's own AI transformation: what happens when engineers use Claude all day

    Anthropic has been unusually transparent about how its own engineers use Claude Code — and the findings offer a preview of broader workforce implications. In August 2025, Anthropic surveyed 132 engineers and researchers, conducted 53 in-depth qualitative interviews, and studied internal Claude Code usage data to understand how AI use is changing work at the company.

    Employees self-reported using Claude in 60% of their work and achieving a 50% productivity boost, a 2-3x increase from this time last year. This productivity looks like slightly less time per task category, but considerably more output volume.

    Perhaps most notably, 27% of Claude-assisted work consists of tasks that wouldn't have been done otherwise, such as scaling projects, making nice-to-have tools like interactive data dashboards, and exploratory work that wouldn't be cost-effective if done manually.

    The internal research also revealed how Claude is changing the nature of engineering collaboration. The maximum number of consecutive tool calls Claude Code makes per transcript increased by 116%. Claude now chains together 21.2 independent tool calls without need for human intervention versus 9.8 tool calls from six months ago.

    The number of human turns decreased by 33%. The average number of human turns decreased from 6.2 to 4.1 per transcript, suggesting that less human input is necessary to accomplish a given task now compared to six months ago.

    But the research also surfaced tensions. One prominent theme was that Claude has become the first stop for questions that once went to colleagues. "It has reduced my dependence on [my team] by 80%, [but] the last 20% is crucial and I go and talk to them," one engineer explained. Several engineers said they "bounce ideas off" Claude, similar to interactions with human collaborators.

    Others described experiencing less interaction with colleagues. Some appreciate the reduced social friction, but others resist the change or miss the older way of working: "I like working with people and it is sad that I 'need' them less now."

    How Anthropic stacks up against OpenAI, Google, and Microsoft in the enterprise AI race

    Anthropic is not alone in racing to capture the enterprise coding market. OpenAI, Google, and Microsoft (through GitHub Copilot) are all pursuing similar integrations. The Slack launch gives Anthropic a presence in one of the most widely-used enterprise communication platforms — Slack claims over 750,000 organizations use its software.

    The deal comes as Anthropic pursues a more disciplined growth path than rival OpenAI, focusing on enterprise customers and coding workloads. Internal financials reported by The Wall Street Journal show Anthropic expects to break even by 2028 — two years earlier than OpenAI, which continues to invest heavily in infrastructure as it expands into video, hardware, and consumer products.

    The move also marks an increased push into developer tooling. Anthropic has recently seen backing from some of tech's biggest titans. Microsoft and Nvidia pledged up to $15 billion in fresh investment in Anthropic last month, alongside a $30 billion commitment from Anthropic to run Claude Code on Microsoft's cloud. This is in addition to the $8 billion invested from Amazon and $3 billion from Google.

    The cross-investment from both Microsoft and Google — fierce competitors in the cloud and AI spaces — highlights how valuable Anthropic's enterprise positioning has become. By integrating with Slack (which is owned by Salesforce), Anthropic further embeds itself in the enterprise software ecosystem while remaining platform-agnostic.

    What the Slack integration means for developers — and whether they can trust it

    For engineering teams, the Slack integration promises to collapse the distance between problem identification and problem resolution. A bug report in a Slack channel can immediately trigger investigation. A feature request can spawn a prototype. A code review comment can generate a refactor.

    But the integration also raises questions about oversight and code quality. Most Anthropic employees use Claude frequently while reporting they can "fully delegate" only 0-20% of their work to it. Claude is a constant collaborator but using it generally involves active supervision and validation, especially in high-stakes work — versus handing off tasks requiring no verification at all.

    Some employees are concerned about the atrophy of deeper skillsets required for both writing and critiquing code — "When producing output is so easy and fast, it gets harder and harder to actually take the time to learn something."

    The Slack integration, by making Claude Code invocation as simple as an @mention, may accelerate both the productivity benefits and the skill-atrophy concerns that Anthropic's own research has documented.

    The future of coding may be conversational—and Anthropic is racing to prove it

    The beta launch marks the beginning of what Anthropic expects will be a broader rollout, with documentation forthcoming for teams looking to deploy the integration and refinements planned based on user feedback during the research preview phase.

    For Anthropic, the Slack integration is a calculated bet on a fundamental shift in how software gets written. The company is wagering that the future of coding will be conversational — that the walls between where developers talk about problems and where they solve them will dissolve entirely. The companies that win enterprise AI, in this view, will be the ones that meet developers not in specialized tools but in the chat windows they already have open all day.

    Whether that vision becomes reality will depend on whether Claude Code can deliver enterprise-grade reliability while maintaining the security that organizations demand. The early returns are promising: a billion dollars in revenue, a roster of Fortune 500 customers, and a growing ecosystem of integrations suggest Anthropic is onto something real.

    But in one of Anthropic's own internal interviews, an engineer offered a more cautious assessment of the transformation underway: "Nobody knows what's going to happen… the important thing is to just be really adaptable."

    In the age of AI coding agents, that may be the only career advice that holds up.