Blog

Tokenization takes the lead in the fight for data security

Presented by Capital One Software

Tokenization is emerging as a cornerstone of modern data security, helping businesses separate the value of their data from its risk. During this VB in Conversation, Ravi Raghu, president, Capital One Software, talks about the ways tokenization can help reduce the value of breached data and preserve underlying data format and usability, including Capital One’s own experience leveraging tokenization at scale.

Tokenization, Raghu asserts, is a far superior technology. It converts sensitive data into a nonsensitive digital replacement, called a token, that maps back to the original, which is secured in a digital vault. The token placeholder preserves both the format and the utility of the sensitive data, and can be used across applications — including AI models. Because tokenization removes the need to manage encryption keys or dedicate compute to constant encrypting and decrypting, it offers one of the most scalable ways for companies to protect their most sensitive data, he added.

"The killer part, from a security standpoint, when you think about it relative to other methods, if a bad actor gets hold of the data, they get hold of tokens," he explained. "The actual data is not sitting with the token, unlike other methods like encryption, where the actual data sits there, just waiting for someone to get hold of a key or use brute force to get to the real data. From every angle this is the ideal way one ought to go about protecting sensitive data."

The tokenization differentiator

Most organizations are just scratching the surface of data security, adding security at the very end, when data is read, to prevent an end user from accessing it. At minimum, organizations should focus on securing data on write, as it’s being stored. But best-in-class organizations go even further, protecting data at birth, the moment it’s created.

At one end of the safety spectrum is a simple lock-and-key approach that restricts access but leaves the underlying data intact. More advanced methods, like masking or modifying data, permanently alter its meaning — which can compromise its usefulness. File-level encryption provides broader protection for large volumes of stored data, but when you get down to field-level encryption (for example, a Social Security number), it becomes a bigger challenge. It takes a great deal of compute to encrypt a single field, and then to decrypt it at the point of usage. And still it has a fatal flaw: the original data is still right there, only needing the key to get access.

Tokenization avoids these pitfalls by replacing the original data with a surrogate that has no intrinsic value. If the token is intercepted — whether by the wrong person or the wrong machine — the data itself remains secure.

The business value of tokenization

"Fundamentally you’re protecting data, and that’s priceless," Raghu said. "Another thing that’s priceless – can you use that for modeling purposes subsequently? On the one hand, it’s a protection thing, and on the other hand it’s a business enabling thing."

Because tokenization preserves the structure and ordinality of the original data, it can still be used for modeling and analytics, turning protection into a business enabler. Take private health data governed by HIPAA for example: tokenization means that data canbeused to build pricing models or for gene therapy research, while remaining compliant.

"If your data is already protected, you can then proliferate the usage of data across the entire enterprise and have everybody creating more and more value out of the data," Raghu said. "Conversely, if you don’t have that, there’s a lot of reticence for enterprises today to have more people access it, or have more and more AI agents access their data. Ironically, they’re limiting the blast radius of innovation. The tokenization impact is massive, and there are many metrics you could use to measure that – operational impact, revenue impact, and obviously the peace of mind from a security standpoint."

Breaking down adoption barriers

Until now, the fundamental challenge with traditional tokenization has been performance. AI requires a scale and speed that is unprecedented. That's one of the major challenges Capital One addresses with Databolt, its vaultless tokenization solution, which can produce up to 4 million tokens per second.

"Capital One has gone through tokenization for more than a decade. We started doing it because we’re serving our 100 million banking customers. We want to protect that sensitive data," Raghu said. "We’ve eaten our own dog food with our internal tokenization capability, over 100 billion times a month. We’ve taken that know-how and that capability, scale, and speed, and innovated so that the world can leverage it, so that it’s a commercial offering."

Vaultless tokenization is an advanced form of tokenization that does not require a central database (vault) to store token mappings. Instead, it uses mathematical algorithms, cryptographic techniques, and deterministic mapping to generate tokens dynamically.This approach is faster, more scalable, and eliminates the security risk associated with managing a vault.

"We realized that for the scale and speed demands that we had, we needed to build out that capability ourselves," Raghu said. "We’ve been iterating continuously on making sure that it can scale up to hundreds of billions of operations a month. All of our innovation has been around building IP and capability to do that thing at a battle-tested scale within our enterprise, for the purpose of serving our customers."

While conventional tokenization methods can involve some complexity and slow down operations, Databolt seamlessly integrates with encrypted data warehouses, allowing businesses to maintain robust security without slowing performance or operations. Tokenization occurs in the customer’s environment, removing the need to communicate with an external network to perform tokenization operations, which can also slow performance.

"We believe that fundamentally, tokenization should be easy to adopt," Raghu said. "You should be able to secure your data very quickly and operate at the speed and scale and cost needs that organizations have. I think that’s been a critical barrier so far for the mass scale adoption of tokenization. In an AI world, that’s going to become a huge enabler."

Don't miss the whole conversation with Ravi Raghu, president, Capital One Software, here.

Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Aralık 15, 2025
Build vs buy is dead — AI just killed it
Picture this: You're sitting in a conference room, halfway through a vendor pitch. The demo looks solid, and pricing fits nicely under budget. The timeline seems reasonable too. Everyone’s nodding along.

You’re literally minutes away from saying yes.

Then someone from your finance team walks in. They see the deck and frown. A few minutes later, they shoot you a message on Slack: “Actually, I threw together a version of this last week. Took me 2 hours in Cursor. Wanna take a look?”

Wait… what?

This person doesn't code. You know for a fact they've never written a line of JavaScript in their entire life. But here they are, showing you a working prototype on their laptop that does… pretty much exactly what the vendor pitched. Sure, it's got some rough edges, but it works. And it didn’t cost six figures. Just two hours of their time.

Suddenly, the assumptions you walked in with — about how software is developed, who makes it and how decisions are made around it — all start coming apart at the seams.

The old framework

For decades, every growing company asked the same question: Should we build this ourselves, or should we buy it?

And, for decades, the answer was pretty straightforward: Build if it's core to your business; buy if it isn’t.

The logic made sense, because building was expensive and meant borrowing time from overworked engineers, writing specs, planning sprints, managing infrastructure and bracing yourself for a long tail of maintenance. Buying was faster. Safer. You paid for the support and the peace of mind.

But something fundamental has changed: AI has made building accessible to everyone. What used to take weeks now takes hours, and what used to require fluency in a programming language now requires fluency in plain English.

When the cost and complexity of building collapse this dramatically, the old framework goes down with them. It’s not build versus buy anymore. It’s something stranger that we haven't quite found the right words for.

When the market doesn’t know what you need (yet)

My company never planned to build so many of the tools we use. We just had to build because the things we needed didn’t exist. And, through that process, we developed this visceral understanding of what we actually wanted, what was useful and what it could or couldn't do. Not what vendor decks told us we needed or what analyst reports said we should want, but what actually moved the needle in our business.

We figured out which problems were worth solving, which ones weren’t, where AI created real leverage and where it was just noise. And only then, once we had that hard-earned clarity, did we start buying.

By that point, we knew exactly what we were looking for and could tell the difference between substance and marketing in about five minutes. We asked questions that made vendors nervous because we'd already built some rudimentary version of what they were selling.

When anyone can build in minutes

Last week, someone on our CX team noticed some customer feedback about a bug in Slack. Just a minor customer complaint, nothing major. In another company, this would’ve kicked off a support ticket and they’d have waited for someone else to handle it, but that’s not what happened here. They opened Cursor, described the change and let AI write the fix. Then they submitted a pull request that engineering reviewed and merged.

Just 15 minutes after that complaint popped up in Slack, the fix was live in production.

The person who did this isn’t technical in the slightest. I doubt they could tell you the difference between Python and JavaScript, but they solved the problem anyway.

And that’s the point.

AI has gotten so good at cranking out relatively simple code that it handles 80% of the problems that used to require a sprint planning meeting and two weeks of engineering time. It’s erasing the boundary between technical and non-technical. Work that used to be bottlenecked by engineering is now being done by the people closest to the problem.

This is happening right now in companies that are actually paying attention.

The inversion that’s happening

Here's where it gets fascinating for finance leaders, because AI has actually flipped the entire strategic logic of the build versus buy decision on its head.

The old model went something like:
1. Define the need.
2. Decide whether to build or buy.
But defining the need took forever and required deep technical expertise, or you'd burn through money through trial-and-error vendor implementations. You'd sit through countless demos, trying to picture whether this actually solved your problem. Then you’d negotiate, implement, move all your data and workflows to the new tool and six months and six figures later discover whether (or not) you were actually right.

Now, the whole sequence gets turned around:
1. Build something lightweight with AI.
2. Use it to understand what you actually need.
3. Then decide whether to buy (and you'll know exactly why).
This approach lets you run controlled experiments. You figure out whether the problem even matters. You discover which features deliver value and which just look good in demos. Then you go shopping. Instead of letting some external vendor sell you on what the need is, you get to figure out whether you even have that need in the first place.

Think about how many software purchases you've made that, in hindsight, solved problems you didn't actually have. How many times have you been three months into an implementation and thought, “Hang on, is this actually helping us, or are we just trying to justify what we spent?”

Now, when you do buy, the question becomes “Does this solve the problem better than what we already proved we can build?”

That one reframe changes the entire conversation. Now you show up to vendor calls informed. You ask sharper questions, and negotiate from a place of strength. Most importantly, you avoid the most expensive mistake in enterprise software, which is solving a problem you never really had.

The trap you need to avoid

As this new capability emerges, I’m watching companies sprint in the wrong direction. They know they need to be AI native, so they go on a shopping spree. They look for AI-powered tools, filling their stack with products that have GPT integrations, chatbot UIs or “AI” slapped onto the marketing site. They think they’re transforming, but they’re not.

Remember what physicist Richard Feynman called cargo cult science? After World War II, islanders in the South Pacific built fake airstrips and control towers, mimicking what they'd seen during the war, hoping planes full of cargo would return. They had all the outward forms of an airport: Towers, headsets, even people miming flight controllers. But no planes landed, because the form wasn’t the function.

That’s exactly what’s happening with AI transformation in boardrooms everywhere. Leaders are buying AI tools without asking if they meaningfully change how work gets done, who they empower or what processes they unlock.

They’ve built the airstrip, but the planes aren’t showing up.

And the whole market's basically set up to make you fall into this trap. Everything gets branded as AI now, but nobody seems to care what these products actually do. Every SaaS product has bolted on a chatbot or an auto-complete feature and slapped an AI label on it, and the label has lost all meaning. It’s just a checkbox vendors figure they need to tick, regardless of whether it creates actual value for customers.

The finance team’s new superpower

This is the part that gets me excited about what finance teams can do now. You don’t have to guess anymore. You don’t have to bet six figures on a sales deck. You can test things, and you can actually learn something before you spend.

Here's what I mean: If you’re evaluating vendor management software, prototype the core workflow with AI tools. Figure out whether you’re solving a tooling problem or a process problem. Figure out whether you need software at all.

This doesn’t mean you’ll build everything internally — of course not. Most of the time, you’ll still end up buying, and that's totally fine, because enterprise tools exist for good reasons (scale, support, security, and maintenance). But now you’ll buy with your eyes wide open.

You’ll know what “good” looks like. You’ll show up to demos already understanding the edge cases, and know in about 5 minutes whether they actually get your specific problem. You’ll implement faster. You'll negotiate better because you're not completely dependent on the vendor's solution. And you’ll choose it because it's genuinely better than what you could build yourself.

You'll have already mapped out the shape of what you need, and you'll just be looking for the best version of it.

The new paradigm

For years, the mantra was: Build or buy.

Now, it’s more elegant and way smarter: Build to learn what to buy.

And it's not some future state. This is already happening. Right now, somewhere, a customer rep is using AI to fix a product issue they spotted minutes ago. Somewhere else, a finance team is prototyping their own analytical tools because they've realized they can iterate faster than they can write up requirements for engineering. Somewhere, a team is realizing that the boundary between technical and non-technical was always more cultural than fundamental.

The companies that embrace this shift will move faster and spend smarter. They’ll know their operations more deeply than any vendor ever could. They'll make fewer expensive mistakes, and buy better tools because they actually understand what makes tools good.

The companies that stick to the old playbook will keep sitting through vendor pitches, nodding along at budget-friendly proposals. They’ll debate timelines, and keep mistaking professional decks for actual solutions.

Until someone on their own team pops open their laptop, says, “I built a version of this last night. Want to check it out?,” and shows them something they built in two hours that does 80% of what they’re about to pay six figures for.

And, just like that, the rules change for good.

Siqi Chen is co-founder and CEO of Runway.

Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
Aralık 14, 2025
Why most enterprise AI coding pilots underperform (Hint: It’s not the model)

Gen AI in software engineering has moved well beyond autocomplete. The emerging frontier is agentic coding: AI systems capable of planning changes, executing them across multiple steps and iterating based on feedback. Yet despite the excitement around “AI agents that code,” most enterprise deployments underperform. The limiting factor is no longer the model. It’s context: The structure, history and intent surrounding the code being changed. In other words, enterprises are now facing a systems design problem: They have not yet engineered the environment these agents operate in.

The shift from assistance to agency

The past year has seen a rapid evolution from assistive coding tools to agentic workflows. Research has begun to formalize what agentic behavior means in practice: The ability to reason across design, testing, execution and validation rather than generate isolated snippets. Work such as dynamic action re-sampling shows that allowing agents to branch, reconsider and revise their own decisions significantly improves outcomes in large, interdependent codebases. At the platform level, providers like GitHub are now building dedicated agent orchestration environments, such as Copilot Agent and Agent HQ, to support multi-agent collaboration inside real enterprise pipelines.

But early field results tell a cautionary story. When organizations introduce agentic tools without addressing workflow and environment, productivity can decline. A randomized control study this year showed that developers who used AI assistance in unchanged workflows completed tasks more slowly, largely due to verification, rework and confusion around intent. The lesson is straightforward: Autonomy without orchestration rarely yields efficiency.

Why context engineering is the real unlock

In every unsuccessful deployment I’ve observed, the failure stemmed from context. When agents lack a structured understanding of a codebase, specifically its relevant modules, dependency graph, test harness, architectural conventions and change history. They often generate output that appears correct but is disconnected from reality. Too much information overwhelms the agent; too little forces it to guess. The goal is not to feed the model more tokens. The goal is to determine what should be visible to the agent, when and in what form.

The teams seeing meaningful gains treat context as an engineering surface. They create tooling to snapshot, compact and version the agent’s working memory: What is persisted across turns, what is discarded, what is summarized and what is linked instead of inlined. They design deliberation steps rather than prompting sessions. They make the specification a first-class artifact, something reviewable, testable and owned, not a transient chat history. This shift aligns with a broader trend some researchers describe as “specs becoming the new source of truth.”

Workflow must change alongside tooling

But context alone isn’t enough. Enterprises must re-architect the workflows around these agents. As McKinsey’s 2025 report “One Year of Agentic AI” noted, productivity gains arise not from layering AI onto existing processes but from rethinking the process itself. When teams simply drop an agent into an unaltered workflow, they invite friction: Engineers spend more time verifying AI-written code than they would have spent writing it themselves. The agents can only amplify what’s already structured: Well-tested, modular codebases with clear ownership and documentation. Without those foundations, autonomy becomes chaos.

Security and governance, too, demand a shift in mindset. AI-generated code introduces new forms of risk: Unvetted dependencies, subtle license violations and undocumented modules that escape peer review. Mature teams are beginning to integrate agentic activity directly into their CI/CD pipelines, treating agents as autonomous contributors whose work must pass the same static analysis, audit logging and approval gates as any human developer. GitHub’s own documentation highlights this trajectory, positioning Copilot Agents not as replacements for engineers but as orchestrated participants in secure, reviewable workflows. The goal isn’t to let an AI “write everything,” but to ensure that when it acts, it does so inside defined guardrails.

What enterprise decision-makers should focus on now

For technical leaders, the path forward starts with readiness rather than hype. Monoliths with sparse tests rarely yield net gains; agents thrive where tests are authoritative and can drive iterative refinement. This is exactly the loop Anthropic calls out for coding agents. Pilots in tightly scoped domains (test generation, legacy modernization, isolated refactors); treat each deployment as an experiment with explicit metrics (defect escape rate, PR cycle time, change failure rate, security findings burned down). As your usage grows, treat agents as data infrastructure: Every plan, context snapshot, action log and test run is data that composes into a searchable memory of engineering intent, and a durable competitive advantage.

Under the hood, agentic coding is less a tooling problem than a data problem. Every context snapshot, test iteration and code revision becomes a form of structured data that must be stored, indexed and reused. As these agents proliferate, enterprises will find themselves managing an entirely new data layer: One that captures not just what was built, but how it was reasoned about. This shift turns engineering logs into a knowledge graph of intent, decision-making and validation. In time, the organizations that can search and replay this contextual memory will outpace those who still treat code as static text.

The coming year will likely determine whether agentic coding becomes a cornerstone of enterprise development or another inflated promise. The difference will hinge on context engineering: How intelligently teams design the informational substrate their agents rely on. The winners will be those who see autonomy not as magic, but as an extension of disciplined systems design:Clear workflows, measurable feedback, and rigorous governance.

Bottom line

Platforms are converging on orchestration and guardrails, and research keeps improving context control at inference time. The winners over the next 12 to 24 months won’t be the teams with the flashiest model; they’ll be the ones that engineer context as an asset and treat workflow as the product. Do that, and autonomy compounds. Skip it, and the review queue does.

Context + agent = leverage. Skip the first half, and the rest collapses.

Dhyey Mavani is accelerating generative AI at LinkedIn.

Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

Aralık 14, 2025
Google’s new framework helps AI agents spend their compute and tool budget more wisely

In a new paper that studies tool-use in large language model (LLM) agents, researchers at Google and UC Santa Barbara have developed a framework that enables agents to make more efficient use of tool and compute budgets. The researchers introduce two new techniques: a simple "Budget Tracker" and a more comprehensive framework called "Budget Aware Test-time Scaling." These techniques make agents explicitly aware of their remaining reasoning and tool-use allowance.

As AI agents rely on tool calls to work in the real world, test-time scaling has become less about smarter models and more about controlling cost and latency.

For enterprise leaders and developers, budget-aware scaling techniques offer a practical path to deploying effective AI agents without facing unpredictable costs or diminishing returns on compute spend.

The challenge of scaling tool use

Traditional test-time scaling focuses on letting models "think" longer. However, for agentic tasks like web browsing, the number of tool calls directly determines the depth and breadth of exploration.

This introduces significant operational overhead for businesses. "Tool calls such as webpage browsing results in more token consumption, increases the context length and introduces additional time latency," Zifeng Wang and Tengxiao Liu, co-authors of the paper, told VentureBeat. "Tool calls themselves introduce additional API costs."

The researchers found that simply granting agents more test-time resources does not guarantee better performance. "In a deep research task, if the agent has no sense of budget, it often goes down blindly," Wang and Liu explained. "It finds one somewhat related lead, then spends 10 or 20 tool calls digging into it, only to realize that the entire path was a dead end."

Optimizing resources with Budget Tracker

To evaluate how they can optimize tool-use budgets, the researchers first tried a lightweight approach called "Budget Tracker." This module acts as a plug-in that provides the agent with a continuous signal of resource availability, enabling budget-aware tool use.

The team hypothesized that "providing explicit budget signals enables the model to internalize resource constraints and adapt its strategy without requiring additional training."

Budget Tracker operates purely at the prompt level, which makes it easy to implement. (The paper provides full details on the prompts used for Budget Tracker, which makes it easy to implement.)

In Google's implementation, the tracker provides a brief policy guideline describing the budget regimes and corresponding recommendations for using tools. At each step of the response process, Budget Tracker makes the agent explicitly aware of its resource consumption and remaining budget, enabling it to condition subsequent reasoning steps on the updated resource state.

To test this, the researchers experimented with two paradigms: sequential scaling, where the model iteratively refines its output, and parallel scaling, where multiple independent runs are conducted and aggregated. They ran experiments on search agents equipped with search and browse tools following a ReAct-style loop. ReAct (Reasoning + Acting) is a popular method where the model alternates between internal thinking and external actions. To trace a true cost-performance scaling trend, they developed a unified cost metric that jointly accounts for the costs of both internal token consumption and external tool interactions.

They tested Budget Tracker on three information-seeking QA datasets requiring external search, including BrowseComp and HLE-Search, using models such as Gemini 2.5 Pro, Gemini 2.5 Flash, and Claude Sonnet 4. The experiments show that this simple plug-in improves performance across various budget constraints.

"Adding Budget Tracker achieves comparable accuracy using 40.4% fewer search calls, 19.9% fewer browse calls, and reducing overall cost … by 31.3%," the authors told VentureBeat. Finally, Budget Tracker continued to scale as the budget increased, whereas plain ReAct plateaued after a certain threshold.

BATS: A comprehensive framework for budget-aware scaling

To further improve tool-use resource optimization, the researchers introduced Budget Aware Test-time Scaling (BATS), a framework designed to maximize agent performance under any given budget. BATS maintains a continuous signal of remaining resources and uses this information to dynamically adapt the agent's behavior as it formulates its response.

BATS uses multiple modules to orchestrate the agent's actions. A planning module adjusts stepwise effort to match the current budget, while a verification module decides whether to "dig deeper" into a promising lead or "pivot" to alternative paths based on resource availability.

Given an information-seeking question and a tool-call budget, BATS begins by using the planning module to formulate a structured action plan and decide which tools to invoke. When tools are invoked, their responses are appended to the reasoning sequence to provide the context with new evidence. When the agent proposes a candidate answer, the verification module verifies it and decides whether to continue the current sequence or initiate a new attempt with the remaining budget.

The iterative process ends when budgeted resources are exhausted, at which point an LLM-as-a-judge selects the best answer across all verified answers. Throughout the execution, the Budget Tracker continuously updates both resource usage and remaining budget at every iteration.

The researchers tested BATS on the BrowseComp, BrowseComp-ZH, and HLE-Search benchmarks against baselines including standard ReAct and various training-based agents. Their experiments show that BATS achieves higher performance while using fewer tool calls and incurring lower overall cost than competing methods. Using Gemini 2.5 Pro as the backbone, BATS achieved 24.6% accuracy on BrowseComp compared to 12.6% for standard ReAct, and 27.0% on HLE-Search compared to 20.5% for ReAct.

BATS not only improves effectiveness under budget constraints but also yields better cost–performance trade-offs. For example, on the BrowseComp dataset, BATS achieved higher accuracy at a cost of approximately 23 cents compared to a parallel scaling baseline that required over 50 cents to achieve a similar result.

According to the authors, this efficiency makes previously expensive workflows viable. "This unlocks a range of long-horizon, data-intensive enterprise applications… such as complex codebase maintenance, due-diligence investigations, competitive landscape research, compliance audits, and multi-step document analysis," they said.

As enterprises look to deploy agents that manage their own resources, the ability to balance accuracy with cost will become a critical design requirement.

"We believe the relationship between reasoning and economics will become inseparable," Wang and Liu said. "In the future, [models] must reason about value."

Aralık 13, 2025
Ai2’s new Olmo 3.1 extends reinforcement learning training for stronger reasoning benchmarks

The Allen Institute for AI (Ai2) recently released what it calls its most powerful family of models yet, Olmo 3. But the company kept iterating on the models, expanding its reinforcement learning (RL) runs, to create Olmo 3.1.

The new Olmo 3.1 models focus on efficiency, transparency, and control for enterprises.

Ai2 updated two of the three versions of Olmo 2: Olmo 3.1 Think 32B, the flagship model optimized for advanced research, and Olmo 3.1 Instruct 32B, designed for instruction-following, multi-turn dialogue, and tool use.

Olmo 3 has a third version, Olmo 3-Base for programming, comprehension, and math. It also works well for continue fine-tuning.

Ai2 said that to upgrade Olmo 3 Think 32B to Olmo 3.1, its researchers extended its best RL run with a longer training schedule.

“After the original Olmo 3 launch, we resumed our RL training run for Olmo 3 32B Think, training for an additional 21 days on 224 GPUs with extra epochs over our Dolci-Think-RL dataset,” Ai2 said in a blog post. “This yielded Olmo 3.1 32B Think, which brings substantial gains across math, reasoning, and instruction-following benchmarks: improvements of 5+ points on AIME, 4+ points on ZebraLogic, 4+ points on IFEval, and 20+ points on IFBench, alongside stronger performance on coding and complex multi-step tasks.”

To get to Olmo 3.1 Instruct, Ai2 said its researchers applied the recipe behind the smaller Instruct size, 7B, to the larger model.

Olmo 3.1 Instruct 32B is "optimized for chat, tool use, & multi-turn dialogue—making it a much more performant sibling of Olmo 3 Instruct 7B and ready for real-world applications,” Ai2 said in a post on X.

For now, the new checkpoints are available on the Ai2 Playground or Hugging Face, with API access coming soon.

Better performance on benchmarks

The Olmo 3.1 models performed well on benchmark tests, predictably beating the Olmo 3 models.

Olmo 3.1 Think outperformed Qwen 3 32B models in the AIME 2025 benchmark and performed close to Gemma 27B.

Olmo 3.1 Instruct performed strongly against its open-source peers, even beating models like Gemma 3 on the Math benchmark.

“As for Olmo 3.1 32B Instruct, it’s a larger-scale instruction-tuned model built for chat, tool use, and multi-turn dialogue. Olmo 3.1 32B Instruct is our most capable fully open chat model to date and — in our evaluations — the strongest fully open 32B-scale instruct model,” the company said.

Ai2 also upgraded its RL-Zero 7B models for math and coding. The company said on X that both models benefited from longer and more stable training runs.

Commitment to transparency and open source

Ai2 previously told VentureBeat that it designed the Olmo 3 family of models to offer enterprises and research labs more control and understanding of the data and training that went into the model.

Organizations could add to the model’s data mix and retrain it to also learn from what’s been added.

This has long been a commitment for Ai2, which also offers a tool called OlmoTrace that tracks how LLM outputs match its training data.

“Together, Olmo 3.1 Think 32B and Olmo 3.1 Instruct 32B show that openness and performance can advance together. By extending the same model flow, we continue to improve capabilities while retaining end-to-end transparency over data, code, and training decisions,” Ai2 said.

Aralık 13, 2025

OpenAI’s GPT-5.2 is here: what enterprises need to know

The rumors were true: OpenAI on Thursday announced the release of its new frontier large language model (LLM) family, GPT-5.2.

It comes at a pivotal moment for the AI pioneer, which has faced intensifying pressure since rival Google’s Gemini 3 LLM seized the top spot on major third-party performance leaderboards and many key benchmarks last month, though OpenAI leaders stressed in a press briefing that the timing of this release had been discussed and worked on well in advance of the release of Gemini 3.

OpenAI describes GPT-5.2 as its "most capable model series yet for professional knowledge work," aiming to reclaim the performance crown with significant gains in reasoning, coding, and agentic workflows.

"It’s our most advanced frontier model and the strongest yet in the market for professional use," Fidji Simo, OpenAI’s CEO of Applications, said during a press briefing today. "We designed 5.2 to unlock even more economic value for people. It's better at creating spreadsheets, building presentations, writing code, perceiving images, understanding long context, using tools, and handling complex, multi-step projects."

GPT-5.2 features a massive 400,000-token context window — allowing it to ingest hundreds of documents or large code repositories at once — and a 128,000 max output token limit, enabling it to generate extensive reports or full applications in a single go.

The model also features a knowledge cutoff of August 31, 2025, ensuring it is up-to-date with relatively recent world events and technical documentation. It explicitly includes "Reasoning token support," confirming the underlying architecture uses the chain-of-thought processing popularized by the "o1" series.

The 'Code Red' Reality Check

The release arrives following The Information's report of an emergency "Code Red" directive to OpenAI staff from CEO Sam Altman to improve ChaTGPT — a move reportedly designed to mobilize resources following the "quality gap" exposed by Gemini 3. The Verge similarly reported on the timing of GPT-5.2's release ahead of the official announcement.

During the briefing, OpenAI executives acknowledged the directive but pushed back on the narrative that the model was rushed solely to answer Google.

"It is important to note this has been in the works for many, many months," Simo told reporters. She clarified that while the "Code Red" helped focus the company, it wasn't the sole driver of the timeline.

"We announced this Code Red to really signal to the company that we want to marshal resources in one particular area… but that's not the reason it's coming out this week in particular."

Max Schwarzer, lead of OpenAI's post-training team, echoed this sentiment to dispel the idea of a panic launch. "We've been planning for this release since a very long time ago… this specific week we talked about many months ago."

A spokesperson from OpenAI further clarified that the "Code Red" call applied to ChatGPT as a product, not solely underlying model development or the release of new models.

Under the Hood: Instant, Thinking, and Pro

OpenAI is segmenting the GPT-5.2 release into three distinct tiers within ChatGPT, a strategy likely designed to balance the massive compute costs of "reasoning" models with user demand for speed:

GPT-5.2 Instant: Optimized for speed and daily tasks like writing, translation, and information seeking.
GPT-5.2 Thinking: Designed for "complex, structured work" and long-running agents, this model leverages deeper reasoning chains to handle coding, math, and multi-step projects.
GPT-5.2 Pro: The new heavyweight champion. OpenAI describes this as its "smartest and most trustworthy option," delivering the highest accuracy for difficult questions where quality outweighs latency.

For developers, the models are available immediately in the application programming interface (API) as gpt-5.2, gpt-5.2-chat-latest (Instant), and gpt-5.2-pro.

The Numbers: Beating the Benchmarks

The GPT-5.2 release includes leading metrics across most domains — specifically those that target the "professional knowledge work" gap where competitors have recently gained ground.

OpenAI highlighted a new benchmark called GDPval, which measures performance on "well-specified knowledge work tasks" across 44 occupations.

"GPT-5.2 Thinking is now state-of-the-art on that benchmark… and beats or ties top industry professionals on 70.9% of well-specified professional tasks like spreadsheets, presentations, and document creation, according to expert human judges," Simo said.

In the critical arena of coding, OpenAI is claiming a decisive lead. Schwarzer noted that on SWE-bench Pro, a rigorous evaluation of real-world software engineering, GPT-5.2 Thinking sets a new state-of-the-art score of 55.6%.

He emphasized that this benchmark is "more contamination resistant, challenging, diverse, and industrially relevant than previous benchmarks like SWE-bench Verified."Other key benchmark results include:

GPQA Diamond (Science): GPT-5.2 Pro scored 93.2%, edging out GPT-5.2 Thinking (92.4%) and surpassing GPT-5.1 Thinking (88.1%).
FrontierMath: On Tier 1-3 problems, GPT-5.2 Thinking solved 40.3%, a significant jump from the 31.0% achieved by its predecessor.
ARC-AGI-1: GPT-5.2 Pro is reportedly the first model to cross the 90% threshold on this general reasoning benchmark, scoring 90.5%

The Price of Intelligence

Performance comes at a premium. While ChatGPT subscription pricing remains unchanged for now, the API costs for the new flagship models are steep compared to previous generations, reflecting the high compute demands of "thinking" mode. They're also on the upper-end of API costs for the industry.

GPT-5.2 Thinking: Priced at $1.75 per 1 million input tokens and $14 per 1 million output tokens.
GPT-5.2 Pro: The costs jump significantly to $21 per 1 million input tokens and $168 per 1 million output tokens.

GPT-5.2 Thinking is priced 40% higher in the API than the standard GPT-5.1 ($1.25/$10), signaling that OpenAI views the new reasoning capabilities as a tangible value-add rather than a mere efficiency update.

The high-end GPT-5.2 Pro follows the same pattern, costing 40% more than the previous GPT-5 Pro ($15/$120). While expensive, it still undercuts OpenAI’s most specialized reasoning model, o1-pro, which remains the most costly offering on the menu at a staggering $150 per million input tokens and $600 per million output tokens.

OpenAI argues that despite the higher per-token cost, the model’s "greater token efficiency" and ability to solve tasks in fewer turns make it economically viable for high-value enterprise workflows.

Here's how it compares to the current API costs for other competing models across the LLM field:

Model	Input (/1M)	Output (/1M)	Total Cost	Source
Qwen 3 Turbo	$0.05	$0.20	$0.25	Alibaba Cloud
Grok 4.1 Fast (reasoning)	$0.20	$0.50	$0.70	xAI
Grok 4.1 Fast (non-reasoning)	$0.20	$0.50	$0.70	xAI
deepseek-chat (V3.2-Exp)	$0.28	$0.42	$0.70	DeepSeek
deepseek-reasoner (V3.2-Exp)	$0.28	$0.42	$0.70	DeepSeek
Qwen 3 Plus	$0.40	$1.20	$1.60	Alibaba Cloud
ERNIE 5.0	$0.85	$3.40	$4.25	Qianfan
Claude Haiku 4.5	$1.00	$5.00	$6.00	Anthropic
Qwen-Max	$1.60	$6.40	$8.00	Alibaba Cloud
Gemini 3 Pro (≤200K)	$2.00	$12.00	$14.00	Google
GPT-5.2	$1.75	$14.00	$15.75	OpenAI
Gemini 3 Pro (>200K)	$4.00	$18.00	$22.00	Google
Claude Sonnet 4.5	$3.00	$15.00	$18.00	Anthropic
Claude Opus 4.5	$5.00	$25.00	$30.00	Anthropic
GPT-5.2 Pro	$21.00	$168.00	$189.00	OpenAI

Image Generation: Nothing New Yet…But 'More to Come'

During the briefing, VentureBeat asked the OpenAI participants if the new release included any boost to image generation capabilities, noting the excitement around similar features in recent competitor launches like Google's Gemini 3 Image aka Nano Banana Pro.

Unfortunately for those seeking to recreate the kind of text-and-information heavy graphics and image editing capabilities, OpenAI executives clarified that GPT-5.2 comes with no current image improvements over the prior GPT-5.1 and OpenAI's integrated DALL-E 3 and gpt-4o native image generation models.

"On image Gen, nothing to announce today, but more to come," Simo said. She acknowledged the popularity of the feature, adding, "We know this is a very important use case that people love, that we introduced [to] the market, and so definitely more to come there."

Aidan Clark, OpenAI's lead of training, also declined to comment on visual generation specifics, stating simply, "I can't really speak to image Gen myself."

The 'Mega-Agent' Era

Beyond raw scores, OpenAI is positioning GPT-5.2 as the engine for a new generation of "long-running agents" capable of executing multi-step workflows without human hand-holding."

Box found that 5.2 can extract information from long, complex documents about 40% faster, and also saw a 40% boost in reasoning accuracy for Life Sciences and healthcare," Simo said.

She also noted that Notion reported the model "outperforms 5.1 across every dimension… and it excels at the kind of really ambiguous, longer rising tasks that define real knowledge work."Schwarzer added that coding startups like Augment Code found the model "delivered substantially stronger deep code capabilities than any prior model," which is why it was selected to power their new code review agent.Visual capabilities have also seen an upgrade.

OpenAI's release blog post shows an example where "a traveler reports a delayed flight, a missed connection, an overnight stay in New York, and a medical seating requirement."

The outcome? "GPT‑5.2 manages the entire chain of tasks—rebooking, special-assistance seating, and compensation—delivering a more complete outcome than GPT‑5.1."

A new evaluation called ScreenSpot-Pro, which tests a model's ability to understand GUI screenshots, shows GPT-5.2 Thinking achieving 86.3% accuracy, compared to just 64.2% for GPT-5.1.

Science and Reliability

OpenAI leaders also stressed the model's utility for scientific research, attempting to move the conversation beyond simple chatbots to research assistants.

Aidan Clark, lead of the training team, shared an example of a senior immunology researcher testing the model.

"They tested it by asking it to generate the most important unanswered questions about the immune system," Clark said. "That immunology researcher reported that GPT-5.2 produced sharper questions and stronger explanations for why those questions… matter compared to any previous pro model.

"Reliability was another key focus. Schwarzer claimed the new model "hallucinates substantially less than GPT-5.1," noting that on a set of de-identified queries, "responses contained errors 38% less often."

The 'Vibe' Shift

Interestingly, OpenAI acknowledged that not every user might immediately prefer the new models.

When asked why legacy models like GPT-5.1 would remain available, Schwarzer admitted that "models change a little bit every time.

"Some users may find that they prefer the vibes of the previous model, even though we think the latest one is across the board generally much better," Schwarzer said. He also noted that for some enterprise customers who have "really fine-tuned a prompt for a specific model," there might be "small regressions," necessitating access to the older versions.

Safety, 'Adult Mode,' and Future Roadmap

Addressing safety concerns, Simo confirmed that the company is preparing to roll out an "Adult Mode" in the first quarter of next year, following the implementation of a new age prediction system.

"We're in the process of improving that," Simo said regarding the age prediction technology.

"We want to do that ahead of launching adult mode."Looking further ahead, industry reports suggest OpenAI is working on a more fundamental architectural shift under the codename "Project Garlic," targeting a flagship release in early 2026.

While executives did not comment on specific future roadmaps during the briefing, Simo remained optimistic about the economics of their current trajectory.

"If you look at historical trends, compute has increased about 3x every year for the last three years," she explained. "Revenue has also increased at the same pace… creating this virtuous cycle."

Clark added that efficiency is improving rapidly: "The model we're releasing today achieves an even better score [on ARC-AGI] with almost 400 times less cost and less compute associated with it" compared to models from a year ago.

GPT-5.2 Instant, Thinking, and Pro begin rolling out in ChatGPT today to paid users (Plus, Pro, Team, and Enterprise). The company notes the rollout will be gradual to maintain stability.

Aralık 12, 2025

GPT-5.2 first impressions: a powerful update, especially for business tasks and workflows

OpenAI has officially released GPT-5.2, and the reactions from early testers — among whom OpenAI seeded the model several days prior to public release, in some cases weeks ago — paints a two toned picture: it is a monumental leap forward for deep, autonomous reasoning and coding, yet potentially an underwhelming "incremental" update for casual conversationalists.

Following early access periods and today's broader rollout, executives, developers, and analysts have taken to X (formerly Twitter) and company blogs to share their first testing results.

Here is a roundup of the first reactions to OpenAI’s latest flagship model.

"AI as a serious analyst"

The strongest praise for GPT-5.2 centers on its ability to handle "hard problems" that require extended thinking time.

Matt Shumer, CEO of HyperWriteAI, did not mince words in his review, calling GPT-5.2 Pro "the best model in the world."

Shumer highlighted the model's tenacity, noting that "it thinks for **over an hour** on hard problems. And it nails tasks no other model can touch."

This sentiment was echoed by Allie K. Miller, an AI entrepreneur and former AWS executive. Miller described the model as a step toward "AI as a serious analyst" rather than a "friendly companion."

"The thinking and problem-solving feel noticeably stronger," Miller wrote on X. "It gives much deeper explanations than I’m used to seeing. At one point it literally wrote code to improve its own OCR in the middle of a task."

Enterprise gains: Box reports distinct performance jumps

For the enterprise sector, the update appears to be even more significant.

Aaron Levie, CEO of Box, revealed on X that his company has been testing GPT-5.2 in early access. Levie reported that the model performs "7 points better than GPT-5.1" on their expanded reasoning tests, which approximate real-world knowledge work in financial services and life sciences.

"The model performed the majority of the tasks far faster than GPT-5.1 and GPT-5 as well," Levie noted, confirming that Box AI will be rolling out GPT-5.2 integration shortly.

Rutuja Rajwade, a Senior Product Marketing Manager at Box, expanded on this in a company blog post, citing specific latency improvements.

"Complex extraction" tasks dropped from 46 seconds on GPT-5 to just 12 seconds with GPT-5.2.

Rajwade also noted a jump in reasoning capabilities for the Media and Entertainment vertical, rising from 76% accuracy in GPT-5.1 to 81% in the new model.

A "serious leap" for coding and simulation

Developers are finding GPT-5.2 particularly potent for "one-shot" generation of complex code structures.

Pietro Schirano, CEO of magicpathai, shared a video of the model building a full 3D graphics engine in a single file with interactive controls. "It’s a serious leap forward in complex reasoning, math, coding, and simulations," Schirano posted. "The pace of progress is unreal."

Similarly, Ethan Mollick, a professor at the Wharton School of Business at the University of Pennsylvania and longtime LLM and AI power user and writer, demonstrated the model's ability to create a visually complex shader—an infinite neo-gothic city in a stormy ocean—via a single prompt.

The Agentic Era: Long-running autonomy

Perhaps the most functional shift is the model's ability to stay on task for hours without losing the thread.

Dan Shipper, CEO of thoughtful AI testing newsletter Every, reported that the model successfully performed a profit and loss (P&L) analysis that required it to work autonomously for two hours. "It did a P&L analysis where it worked for 2 hours and gave me great results," Shipper wrote.

However, Shipper also noted that for day-to-day tasks, the update feels "mostly incremental."

In an article for Every, Katie Parrott wrote that while GPT-5.2 excels at instruction following, it is "less resourceful" than competitors like Claude Opus 4.5 in certain contexts, such as deducing a user's location from email data.

The downsides: Speed and Rigidity

Despite the reasoning capabilities, the "feel" of the model has drawn critique.

Shumer highlighted a significant "speed penalty" when using the model's Thinking mode. "In my experience the Thinking mode is very slow for most questions," Shumer wrote in his deep-dive review. "I almost never use Instant."

Allie Miller also pointed out issues with the model's default behavior. "The downside is tone and format," she noted. "The default voice felt a bit more rigid, and the length/markdown behavior is extreme: a simple question turned into 58 bullets and numbered points."

The Verdict

The early reaction suggests that GPT-5.2 is a tool optimized for power users, developers, and enterprise agents rather than casual chat. As Shumer summarized in his review: "For deep research, complex reasoning, and tasks that benefit from careful thought, GPT-5.2 Pro is the best option available right now."

However, for users seeking creative writing or quick, fluid answers, models like Claude Opus 4.5 remain strong competitors. "My favorite model remains Claude Opus 4.5," Miller admitted, "but my complex ChatGPT work will get a nice incremental boost."

Aralık 12, 2025
OpenAI report reveals a 6x productivity gap between AI power users and everyone else

The tools are available to everyone. The subscription is company-wide. The training sessions have been held. And yet, in offices from Wall Street to Silicon Valley, a stark divide is opening between workers who have woven artificial intelligence into the fabric of their daily work and colleagues who have barely touched it.

The gap is not small. According to a new report from OpenAI analyzing usage patterns across its more than one million business customers, workers at the 95th percentile of AI adoption are sending six times as many messages to ChatGPT as the median employee at the same companies. For specific tasks, the divide is even more dramatic: frontier workers send 17 times as many coding-related messages as their typical peers, and among data analysts, the heaviest users engage the data analysis tool 16 times more frequently than the median.

This is not a story about access. It is a story about a new form of workplace stratification emerging in real time — one that may be reshaping who gets ahead, who falls behind, and what it means to be a skilled worker in the age of artificial intelligence.

Everyone has the same tools, but not everyone is using them

Perhaps the most striking finding in the OpenAI report is how little access explains. ChatGPT Enterprise is now deployed across more than 7 million workplace seats globally, a nine-fold increase from a year ago. The tools are the same for everyone. The capabilities are identical. And yet usage varies by orders of magnitude.

Among monthly active users — people who have logged in at least once in the past 30 days — 19 percent have never tried the data analysis feature. Fourteen percent have never used reasoning capabilities. Twelve percent have never used search. These are not obscure features buried in submenus; they are core functionality that OpenAI highlights as transformative for knowledge work.

The pattern inverts among daily users. Only 3 percent of people who use ChatGPT every day have never tried data analysis; just 1 percent have skipped reasoning or search. The implication is clear: the divide is not between those who have access and those who don't, but between those who have made AI a daily habit and those for whom it remains an occasional novelty.

Employees who experiment more are saving dramatically more time

The OpenAI report suggests that AI productivity gains are not evenly distributed across all users but concentrated among those who use the technology most intensively. Workers who engage across approximately seven distinct task types — data analysis, coding, image generation, translation, writing, and others — report saving five times as much time as those who use only four. Employees who save more than 10 hours per week consume eight times more AI credits than those who report no time savings at all.

This creates a compounding dynamic. Workers who experiment broadly discover more uses. More uses lead to greater productivity gains. Greater productivity gains presumably lead to better performance reviews, more interesting assignments, and faster advancement—which in turn provides more opportunity and incentive to deepen AI usage further.

Seventy-five percent of surveyed workers report being able to complete tasks they previously could not perform, including programming support, spreadsheet automation, and technical troubleshooting. For workers who have embraced these capabilities, the boundaries of their roles are expanding. For those who have not, the boundaries may be contracting by comparison.

The corporate AI paradox: $40 billion spent, 95 percent seeing no return

The individual usage gap documented by OpenAI mirrors a broader pattern identified by a separate study from MIT's Project NANDA. Despite $30 billion to $40 billion invested in generative AI initiatives, only 5 percent of organizations are seeing transformative returns. The researchers call this the "GenAI Divide" — a gap separating the few organizations that succeed in transforming processes with adaptive AI systems from the majority that remain stuck in pilots.

The MIT report found limited disruption across industries: only two of nine major sectors—technology and media—show material business transformation from generative AI use. Large firms lead in pilot volume but lag in successful deployment.

The pattern is consistent across both studies. Organizations and individuals are buying the technology. They are launching pilots. They are attending training sessions. But somewhere between adoption and transformation, most are getting stuck.

While official AI projects stall, a shadow economy is thriving

The MIT study reveals a striking disconnect: while only 40 percent of companies have purchased official LLM subscriptions, employees in over 90 percent of companies regularly use personal AI tools for work. Nearly every respondent reported using LLMs in some form as part of their regular workflow.

"This 'shadow AI' often delivers better ROI than formal initiatives and reveals what actually works for bridging the divide," MIT's Project NANDA found.

The shadow economy offers a clue to what's happening at the individual level within organizations. Employees who take initiative — who sign up for personal subscriptions, who experiment on their own time, who figure out how to integrate AI into their workflows without waiting for IT approval — are pulling ahead of colleagues who wait for official guidance that may never come.

These shadow systems, largely unsanctioned, often deliver better performance and faster adoption than corporate tools. Worker sentiment reveals a preference for flexible, responsive tools — precisely the kind of experimentation that separates OpenAI's frontier workers from the median.

The biggest gaps show up in technical work that used to require specialists

The largest relative gaps between frontier and median workers appear in coding, writing, and analysis — precisely the task categories where AI capabilities have advanced most rapidly. Frontier workers are not just doing the same work faster; they appear to be doing different work entirely, expanding into technical domains that were previously inaccessible to them.

Among ChatGPT Enterprise users outside of engineering, IT, and research, coding-related messages have grown 36 percent over the past six months. Someone in marketing or HR who learns to write scripts and automate workflows is becoming a categorically different employee than a peer who has not — even if they hold the same title and started with the same skills.

The academic research on AI and productivity offers a complicated picture. Several studies cited in the OpenAI report find that AI has an "equalizing effect," disproportionately helping lower-performing workers close the gap with their higher-performing peers. But the equalizing effect may apply only within the population of workers who actually use AI regularly. A meaningful share of workers are not in that group at all. They remain light users or non-users, even as their more adventurous colleagues pull away.

Companies are divided too, and the gap is widening by the month

The divide is not only between individual workers. It exists between entire organizations.

Frontier firms — those at the 95th percentile of adoption intensity — generate approximately twice as many AI messages per employee as the median enterprise. For messages routed through custom GPTs, purpose-built tools that automate specific workflows, the gap widens to seven-fold.

These numbers suggest fundamentally different operating models. At median companies, AI may be a productivity tool that individual workers use at their discretion. At frontier firms, AI appears to be embedded in core infrastructure: standardized workflows, persistent custom tools, systematic integration with internal data systems.

The OpenAI report notes that roughly one in four enterprises still has not enabled connectors that give AI access to company data—a basic step that dramatically increases the technology's utility. The MIT study found that companies that purchased AI tools from specialized vendors succeeded 67 percent of the time, while internal builds had only a one-in-three success rate. For many organizations, the AI era has technically arrived but has not yet begun in practice.

The technology is no longer the problem — organizations are

For executives, the data presents an uncomfortable challenge. The technology is no longer the constraint. OpenAI notes that it releases a new feature or capability roughly every three days; the models are advancing faster than most organizations can absorb. The bottleneck has shifted from what AI can do to whether organizations are structured to take advantage of it.

"The dividing line isn't intelligence," the MIT authors write. The problems with enterprise AI have to do with memory, adaptability, and learning capability. Problems stem less from regulations or model performance, and more from tools that fail to learn or adapt.

Leading firms, according to the OpenAI report, consistently invest in executive sponsorship, data readiness, workflow standardization, and deliberate change management. They build cultures where custom AI tools are created, shared, and refined across teams. They track performance and run evaluations. They make AI adoption a strategic priority rather than an individual choice.

The rest are leaving it to chance — hoping that workers will discover the tools on their own, experiment on their own time, and somehow propagate best practices without infrastructure or incentive. The six-fold gap suggests this approach is not working.

The window to catch up is closing faster than most companies realize

With enterprise contracts locking in over the next 18 months, there's a shrinking window for vendors and adopters to cross the divide.The GenAI Divide identified by the MIT report is not going to last forever. But the organizations that figure out a way across it soonest will be the ones that define the next era of business.

Both reports carry caveats. The OpenAI data comes from a company with an obvious interest in promoting AI adoption. The productivity figures are self-reported by customers already paying for the product. The MIT study, while independent, relies on interviews and surveys rather than direct measurement. The long-term effects of this technology on employment, wages, and workplace dynamics remain uncertain.

But the core finding — that access alone does not produce adoption, and that adoption varies enormously even within organizations that have made identical tools available to all — is consistent with how previous technologies have diffused through the economy. Spreadsheets, email, and the internet all created similar divides before eventually becoming universal. The question is how long the current gap persists, who benefits during the transition, and what happens to workers who find themselves on the wrong side of it.

For now, the divide is stark. Ninety percent of users said they prefer humans for "mission-critical work," while AI has "won the war for simple work." The workers who are pulling ahead are not doing so because they have access their colleagues lack. They are pulling ahead because they decided to use what everyone already has—and kept using it until they figured out what it could do.

The 6x gap is not about technology. It is about behavior. And behavior, unlike software, cannot be deployed with a company-wide rollout.

Aralık 11, 2025

The 70% factuality ceiling: why Google’s new ‘FACTS’ benchmark is a wake-up call for enterprise AI

There's no shortage of generative AI benchmarks designed to measure the performance and accuracy of a given model on completing various helpful enterprise tasks — from coding to instruction following to agentic web browsing and tool use. But many of these benchmarks have one major shortcoming: they measure the AI's ability to complete specific problems and requests, not how factual the model is in its outputs — how well it generates objectively correct information tied to real-world data — especially when dealing with information contained in imagery or graphics.

For industries where accuracy is paramount — legal, finance, and medical — the lack of a standardized way to measure factuality has been a critical blind spot.

That changes today: Google’s FACTS team and its data science unit Kaggle released the FACTS Benchmark Suite, a comprehensive evaluation framework designed to close this gap.

The associated research paper reveals a more nuanced definition of the problem, splitting "factuality" into two distinct operational scenarios: "contextual factuality" (grounding responses in provided data) and "world knowledge factuality" (retrieving information from memory or the web).

While the headline news is Gemini 3 Pro’s top-tier placement, the deeper story for builders is the industry-wide "factuality wall."

According to the initial results, no model—including Gemini 3 Pro, GPT-5, or Claude 4.5 Opus—managed to crack a 70% accuracy score across the suite of problems. For technical leaders, this is a signal: the era of "trust but verify" is far from over.

Deconstructing the Benchmark

The FACTS suite moves beyond simple Q&A. It is composed of four distinct tests, each simulating a different real-world failure mode that developers encounter in production:

Parametric Benchmark (Internal Knowledge): Can the model accurately answer trivia-style questions using only its training data?
Search Benchmark (Tool Use): Can the model effectively use a web search tool to retrieve and synthesize live information?
Multimodal Benchmark (Vision): Can the model accurately interpret charts, diagrams, and images without hallucinating?
Grounding Benchmark v2 (Context): Can the model stick strictly to the provided source text?

Google has released 3,513 examples to the public, while Kaggle holds a private set to prevent developers from training on the test data—a common issue known as "contamination."

The Leaderboard: A Game of Inches

The initial run of the benchmark places Gemini 3 Pro in the lead with a comprehensive FACTS Score of 68.8%, followed by Gemini 2.5 Pro (62.1%) and OpenAI’s GPT-5 (61.8%).However, a closer look at the data reveals where the real battlegrounds are for engineering teams.

Model	FACTS Score (Avg)	Search (RAG Capability)	Multimodal (Vision)
Gemini 3 Pro	68.8	83.8	46.1
Gemini 2.5 Pro	62.1	63.9	46.9
GPT-5	61.8	77.7	44.1
Grok 4	53.6	75.3	25.7
Claude 4.5 Opus	51.3	73.2	39.2

Data sourced from the FACTS Team release notes.

For Builders: The "Search" vs. "Parametric" Gap

For developers building RAG (Retrieval-Augmented Generation) systems, the Search Benchmark is the most critical metric.

The data shows a massive discrepancy between a model's ability to "know" things (Parametric) and its ability to "find" things (Search). For instance, Gemini 3 Pro scores a high 83.8% on Search tasks but only 76.4% on Parametric tasks.

This validates the current enterprise architecture standard: do not rely on a model's internal memory for critical facts.

If you are building an internal knowledge bot, the FACTS results suggest that hooking your model up to a search tool or vector database is not optional—it is the only way to push accuracy toward acceptable production levels.

The Multimodal Warning

The most alarming data point for product managers is the performance on Multimodal tasks. The scores here are universally low. Even the category leader, Gemini 2.5 Pro, only hit 46.9% accuracy.

The benchmark tasks included reading charts, interpreting diagrams, and identifying objects in nature. With less than 50% accuracy across the board, this suggests that Multimodal AI is not yet ready for unsupervised data extraction.

Bottom line: If your product roadmap involves having an AI automatically scrape data from invoices or interpret financial charts without human-in-the-loop review, you are likely introducing significant error rates into your pipeline.

Why This Matters for Your Stack

The FACTS Benchmark is likely to become a standard reference point for procurement. When evaluating models for enterprise use, technical leaders should look beyond the composite score and drill into the specific sub-benchmark that matches their use case:

Building a Customer Support Bot? Look at the Grounding score to ensure the bot sticks to your policy documents. (Gemini 2.5 Pro actually outscored Gemini 3 Pro here, 74.2 vs 69.0).
Building a Research Assistant? Prioritize Search scores.
Building an Image Analysis Tool? Proceed with extreme caution.

As the FACTS team noted in their release, "All evaluated models achieved an overall accuracy below 70%, leaving considerable headroom for future progress."For now, the message to the industry is clear: The models are getting smarter, but they aren't yet infallible. Design your systems with the assumption that, roughly one-third of the time, the raw model might just be wrong.

Aralık 11, 2025

Mistral launches powerful Devstral 2 coding model including open source, laptop-friendly version
French AI startup Mistral has weathered a rocky period of public questioning over the last year to emerge, now here in December 2025, with new, crowd-pleasing models for enterprise and indie developers.

Just days after releasing its powerful open source, general purpose Mistral 3 LLM family for edge devices and local hardware, the company returned today to debut Devstral 2.

The release includes a new pair of models optimized for software engineering tasks — again, with one small enough to run on a single laptop, offline and privately — alongside Mistral Vibe, a command-line interface (CLI) agent designed to allow developers to call the models up directly within their terminal environments.

The models are fast, lean, and open—at least in theory. But the real story lies not just in the benchmarks, but in how Mistral is packaging this capability: one model fully free, another conditionally so, and a terminal interface built to scale with either.

It’s an attempt not just to match proprietary systems like Claude and GPT-4 in performance, but to compete with them on developer experience—and to do so while holding onto the flag of open-source.

Both models are available now for free for a limited time via Mistral’s API and Hugging Face.

The full Devstral 2 model is supported out-of-the-box in the community inference provider vLLM and on the open source agentic coding platform Kilo Code.

A Coding Model Meant to Drive

At the top of the announcement is Devstral 2, a 123-billion parameter dense transformer with a 256K-token context window, engineered specifically for agentic software development.

Mistral says the model achieves 72.2% on SWE-bench Verified, a benchmark designed to evaluate long-context software engineering tasks in real-world repositories.

The smaller sibling, Devstral Small 2, weighs in at 24B parameters, with the same long context window and a performance of 68.0% on SWE-bench.

On paper, that makes it the strongest open-weight model of its size, even outscoring many 70B-class competitors.

But the performance story isn’t just about raw percentages. Mistral is betting that efficient intelligence beats scale, and has made much of the fact that Devstral 2 is:
- 5× smaller than DeepSeek V3.2
- 8× smaller than Kimi K2
- Yet still matches or surpasses them on key software reasoning benchmarks.
Human evaluations back this up. In side-by-side comparisons:
- Devstral 2 beat DeepSeek V3.2 in 42.8% of tasks, losing only 28.6%.
- Against Claude Sonnet 4.5, it lost more often (53.1%)—a reminder that while the gap is narrowing, closed models still lead in overall preference.
Still, for an open-weight model, these results place Devstral 2 at the frontier of what’s currently available to run and modify independently.

Vibe CLI: A Terminal-Native Agent

Alongside the models, Mistral released Vibe CLI, a command-line assistant that integrates directly with Devstral models. It’s not an IDE plugin or a ChatGPT-style code explainer. It’s a native interface designed for project-wide code understanding and orchestration, built to live inside the developer’s actual workflow.

Vibe brings a surprising degree of intelligence to the terminal:
- It reads your file tree and Git status to understand project scope.
- It lets you reference files with @, run shell commands with !, and toggle behavior with slash commands.
- It orchestrates changes across multiple files, tracks dependencies, retries failed executions, and can even refactor at architectural scale.
Unlike most developer agents, which simulate a REPL from within a chat UI, Vibe starts with the shell and pulls intelligence in from there. It’s programmable, scriptable, and themeable. And it’s released under the Apache 2.0 license, meaning it’s truly free to use—in commercial settings, internal tools, or open-source extensions.

Licensing Structure: Open-ish — With Revenue Limitations

At first glance, Mistral’s licensing approach appears straightforward: the models are open-weight and publicly available. But a closer look reveals a line drawn through the middle of the release, with different rules for different users.

Devstral Small 2, the 24-billion parameter variant, is covered under a standard, enterprise- and developer-friendly Apache 2.0 license.

That’s a gold standard in open-source: no revenue restrictions, no fine print, no need to check with legal. Enterprises can use it in production, embed it into products, and redistribute fine-tuned versions without asking for permission.

Devstral 2, the flagship 123B model, is released under what Mistral calls a “modified MIT license.” That phrase sounds innocuous, but the modification introduces a critical limitation: any company making more than $20 million in monthly revenue cannot use the model at all—not even internally—without securing a separate commercial license from Mistral.

“You are not authorized to exercise any rights under this license if the global consolidated monthly revenue of your company […] exceeds $20 million,” the license reads.

The clause applies not only to the base model, but to derivatives, fine-tuned versions, and redistributed variants, regardless of who hosts them. In effect, it means that while the weights are “open,” their use is gated for large enterprises—unless they’re willing to engage with Mistral’s sales team or use the hosted API at metered pricing.

To draw an analogy: Apache 2.0 is like a public library—you walk in, borrow the book, and use it however you need. Mistral’s modified MIT license is more like a corporate co-working space that’s free for freelancers but charges rent once your company hits a certain size.

Weighing Devstral Small 2 for Enterprise Use

This division raises an obvious question for larger companies: can Devstral Small 2 with its more permissive and unrestricted Apache 2.0 licensing serve as a viable alternative for medium-to-large enterprises?

The answer depends on context. Devstral Small 2 scores 68.0% on SWE-bench, significantly ahead of many larger open models, and remains deployable on single-GPU or CPU-only setups. For teams focused on:
- internal tooling,
- on-prem deployment,
- low-latency edge inference,
  
  …it offers a rare combination of legality, performance, and convenience.
But the performance gap from Devstral 2 is real. For multi-agent setups, deep monorepo refactoring, or long-context code analysis, that 4-point benchmark delta may understate the actual experience difference.

For most enterprises, Devstral Small 2 will serve either as a low-friction way to prototype—or as a pragmatic bridge until licensing for Devstral 2 becomes feasible. It is not a drop-in replacement for the flagship, but it may be “good enough” in specific production slices, particularly when paired with Vibe CLI.

But because Devstral Small 2 can be run entirely offline — including on a single GPU machine or a sufficiently specced laptop — it unlocks a critical use case for developers and teams operating in tightly controlled environments.

Whether you’re a solo indie building tools on the go, or part of a company with strict data governance or compliance mandates, the ability to run a performant, long-context coding model without ever hitting the internet is a powerful differentiator. No cloud calls, no third-party telemetry, no risk of data leakage — just local inference with full visibility and control.

This matters in industries like finance, healthcare, defense, and advanced manufacturing, where data often cannot leave the network perimeter. But it’s just as useful for developers who prefer autonomy over vendor lock-in — or who want their tools to work the same on a plane, in the field, or inside an air-gapped lab. In a market where most top-tier code models are delivered as API-only SaaS products, Devstral Small 2 offers a rare level of portability, privacy, and ownership.

In that sense, Mistral isn’t just offering open models—they’re offering multiple paths to adoption, depending on your scale, compliance posture, and willingness to engage.

Integration, Infrastructure, and Access

From a technical standpoint, Mistral’s models are built for deployment. Devstral 2 requires a minimum of 4× H100-class GPUs, and is already available on build.nvidia.com.

Devstral Small 2 can run on a single GPU or CPU such as those in a standard laptop, making it accessible to solo developers and embedded teams alike.

Both models support quantized FP4 and FP8 weights, and are compatible with vLLM for scalable inference. Fine-tuning is supported out of the box.

API pricing—after the free introductory window—follows a token-based structure:
- Devstral 2: $0.40 per million input tokens / $2.00 for output
- Devstral Small 2: $0.10 input / $0.30 output
That pricing sits just below OpenAI’s GPT-4 Turbo, and well below Anthropic’s Claude Sonnet at comparable performance levels.

Developer Reception: Ground-Level Buzz

On X (formerly Twitter), developers reacted quickly with a wave of positive reception, with Hugging Face's Head of Product Victor Mustar asking if the small, Apache 2.0 licensed variant was the "new local coding king," i.e., the one developers could use to run on their laptops directly and privately, without an internet connection:

Another popular AI news and rumors account, TestingCatalogNews, posted that it was "SOTTA in coding," or "State Of The Tiny Art"

Another user, @xlr8harder, took issue with the custom licensing terms for Devstral 2, writing "calling the Devstral 2 license 'modified MIT' is misleading at best. It’s a proprietary license with MIT-like attribution requirements."

While the tone was critical, it reflected some attention Mistral’s license structuring was receiving, particularly among developers familiar with open-use norms.

Strategic Context: From Codestral to Devstral and Mistral 3

Mistral’s steady push into software development tools didn’t start with Devstral 2—it began in May 2024 with Codestral, the company’s first code-focused large language model. A 22-billion parameter system trained on more than 80 programming languages, Codestral was designed for use in developer environments ranging from basic autocompletions to full function generation. The model launched under a non-commercial license but still outperformed heavyweight competitors like CodeLlama 70B and Deepseek Coder 33B in early benchmarks such as HumanEval and RepoBench.

Codestral’s release marked Mistral’s first move into the competitive coding-model space, but it also established a now-familiar pattern: technically lean models with surprisingly strong results, a wide context window, and licensing choices that invited developer experimentation. Industry partners including JetBrains, LlamaIndex, and LangChain quickly began integrating the model into their workflows, citing its speed and tool compatibility as key differentiators.

One year later, the company followed up with Devstral, a 24B model purpose-built for “agentic” behavior—handling long-range reasoning, file navigation, and autonomous code modification. Released in partnership with All Hands AI and licensed under Apache 2.0, Devstral was notable not just for its portability (it could run on a MacBook or RTX 4090), but for its performance: it beat out several closed models on SWE-Bench Verified, a benchmark of 500 real-world GitHub issues.

Then came Mistral 3, announced in December 2025 as a portfolio of 10 open-weight models targeting everything from drones and smartphones to cloud infrastructure. This suite included both high-end models like Mistral Large 3 (a MoE system with 41 active parameters and 256K context) and lightweight “Ministral” variants that could run on 4GB of VRAM. All were licensed under Apache 2.0, reinforcing Mistral’s commitment to flexible, edge-friendly deployment.

Mistral 3 positioned the company not as a direct competitor to frontier models like GPT-5 or Gemini 3, but as a developer-first platform for customized, localized AI systems. Co-founder Guillaume Lample described the vision as “distributed intelligence”—many smaller systems tuned for specific tasks and running outside centralized infrastructure. “In more than 90% of cases, a small model can do the job,” he told VentureBeat. “It doesn’t have to be a model with hundreds of billions of parameters.”

That broader strategy helps explain the significance of Devstral 2. It’s not a one-off release but a continuation of Mistral’s long-running commitment to code agents, local-first deployment, and open-weight availability—an ecosystem that began with Codestral, matured through Devstral, and scaled up with Mistral 3. Devstral 2, in this framing, is not just a model. It’s the next version of a playbook that’s been unfolding in public for over a year.

Final Thoughts (For Now): A Fork in the Road

With Devstral 2, Devstral Small 2, and Vibe CLI, Mistral AI has drawn a clear map for developers and companies alike. The tools are fast, capable, and thoughtfully integrated. But they also present a choice—not just in architecture, but in how and where you’re allowed to use them.

If you’re an individual developer, small startup, or open-source maintainer, this is one of the most powerful AI systems you can freely run today.

If you’re a Fortune 500 engineering lead, you’ll need to either talk to Mistral—or settle for the smaller model and make it work.

In a market increasingly dominated by black-box models and SaaS lock-ins, Mistral’s offer is still a breath of fresh air. Just read the fine print before you start building.
Aralık 10, 2025

Blog

The tokenization differentiator

The business value of tokenization

Breaking down adoption barriers

The old framework

When the market doesn’t know what you need (yet)

When anyone can build in minutes

The inversion that’s happening

The trap you need to avoid

The finance team’s new superpower

The new paradigm

The shift from assistance to agency

Why context engineering is the real unlock

Workflow must change alongside tooling

What enterprise decision-makers should focus on now

Bottom line

The challenge of scaling tool use

Optimizing resources with Budget Tracker

BATS: A comprehensive framework for budget-aware scaling

Better performance on benchmarks

Commitment to transparency and open source

The 'Code Red' Reality Check

Under the Hood: Instant, Thinking, and Pro

The Numbers: Beating the Benchmarks

The Price of Intelligence

Image Generation: Nothing New Yet…But 'More to Come'

The 'Mega-Agent' Era

Science and Reliability

The 'Vibe' Shift

Safety, 'Adult Mode,' and Future Roadmap

"AI as a serious analyst"

Enterprise gains: Box reports distinct performance jumps

A "serious leap" for coding and simulation

The Agentic Era: Long-running autonomy

The downsides: Speed and Rigidity

The Verdict

Everyone has the same tools, but not everyone is using them

Employees who experiment more are saving dramatically more time

The corporate AI paradox: $40 billion spent, 95 percent seeing no return

While official AI projects stall, a shadow economy is thriving

The biggest gaps show up in technical work that used to require specialists

Companies are divided too, and the gap is widening by the month

The technology is no longer the problem — organizations are

The window to catch up is closing faster than most companies realize

Deconstructing the Benchmark

The Leaderboard: A Game of Inches

For Builders: The "Search" vs. "Parametric" Gap

The Multimodal Warning

Why This Matters for Your Stack

A Coding Model Meant to Drive

Vibe CLI: A Terminal-Native Agent

Licensing Structure: Open-ish — With Revenue Limitations

Weighing Devstral Small 2 for Enterprise Use

Integration, Infrastructure, and Access

Developer Reception: Ground-Level Buzz

Strategic Context: From Codestral to Devstral and Mistral 3

Final Thoughts (For Now): A Fork in the Road