Blog

Hiring specialists made sense before AI — now generalists win
Tony Stoyanov is CTO and co-founder of EliseAI

In the 2010s, tech companies chased staff-level specialists: Backend engineers, data scientists, system architects. That model worked when technology evolved slowly. Specialists knew their craft, could deliver quickly and built careers on predictable foundations like cloud infrastructure or the latest JS framework

Then AI went mainstream.

The pace of change has exploded. New technologies appear and mature in less than a year. You can’t hire someone who has been building AI agents for five years, as the technology hasn’t existed for that long. The people thriving today aren’t those with the longest résumés; they’re the ones who learn fast, adapt fast and act without waiting for direction. Nowhere is this transformation more evident than in software engineering, which has likely experienced the most dramatic shift of all, evolving faster than almost any other field of work.

How AI Is rewriting the rules

AI has lowered the barrier to doing complex technical work, technical skills and it's also raised expectations for what counts as real expertise. McKinsey estimates that by 2030, up to 30% of U.S. work hours could be automated and 12 million workers may need to shift roles entirely. Technical depth still matters, but AI favors people who can figure things out as they go.

At my company, I see this every day. Engineers who never touched front-end code are now building UIs, while front-end developers are moving into back-end work. The technology keeps getting easier to use but the problems are harder because they span more disciplines.

In that kind of environment, being great at one thing isn’t enough. What matters is the ability to bridge engineering, product and operations to make good decisions quickly, even with imperfect information.

Despite all the excitement, only 1% of companies consider themselves truly mature in how they use AI. Many still rely on structures built for a slower era — layers of approval, rigid roles and an overreliance on specialists who can’t move outside their lane.

The traits of a strong generalist

A strong generalist has breadth without losing depth. They go deep in one or two domains but stay fluent across many. As David Epstein puts it in Range, “You have people walking around with all the knowledge of humanity on their phone, but they have no idea how to integrate it. We don’t train people in thinking or reasoning.” True expertise comes from connecting the dots, not just collecting information.

The best generalists share these traits:
- Ownership: End-to-end accountability for outcomes, not just tasks.
- First-principles thinking: Question assumptions, focus on the goal, and rebuild when needed.
- Adaptability: Learn new domains quickly and move between them smoothly.
- Agency: Act without waiting for approval and adjust as new information comes in.
- Soft skills: Communicate clearly, align teams and keep customers’ needs in focus.
- Range: Solve different kinds of problems and draw lessons across contexts.
I try to make accountability a priority for my teams. Everyone knows what they own, what success looks like and how it connects to the mission. Perfection isn’t the goal, forward movement is.

Embracing the shift

Focusing on adaptable builders changed everything. These are the people with the range and curiosity to use AI tools to learn quickly and execute confidently.

If you’re a builder who thrives in ambiguity, this is your time. The AI era rewards curiosity and initiative more than credentials. If you’re hiring, look ahead. The people who’ll move your company forward might not be the ones with the perfect résumé for the job. They’re the ones who can grow into what the company will need as it evolves.

The future belongs to generalists and to the companies that trust them.

Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
Aralık 20, 2025
Anthropic launches enterprise ‘Agent Skills’ and opens the standard, challenging OpenAI in workplace AI

Anthropic said on Wednesday it would release its Agent Skills technology as an open standard, a strategic bet that sharing its approach to making AI assistants more capable will cement the company's position in the fast-evolving enterprise software market.

The San Francisco-based artificial intelligence company also unveiled organization-wide management tools for enterprise customers and a directory of partner-built skills from companies including Atlassian, Figma, Canva, Stripe, Notion, and Zapier.

The moves mark a significant expansion of a technology Anthropic first introduced in October, transforming what began as a niche developer feature into infrastructure that now appears poised to become an industry standard.

"We're launching Agent Skills as an independent open standard with a specification and reference SDK available at https://agentskills.io," Mahesh Murag, a product manager at Anthropic, said in an interview with VentureBeat. "Microsoft has already adopted Agent Skills within VS Code and GitHub; so have popular coding agents like Cursor, Goose, Amp, OpenCode, and more. We're in active conversations with others across the ecosystem."

Inside the technology that teaches AI assistants to do specialized work

Skills are, at their core, folders containing instructions, scripts, and resources that tell AI systems how to perform specific tasks consistently. Rather than requiring users to craft elaborate prompts each time they want an AI assistant to complete a specialized task, skills package that procedural knowledge into reusable modules.

The concept addresses a fundamental limitation of large language models: while they possess broad general knowledge, they often lack the specific procedural expertise needed for specialized professional work. A skill for creating PowerPoint presentations, for instance, might include preferred formatting conventions, slide structure guidelines, and quality standards — information the AI loads only when working on presentations.

Anthropic designed the system around what it calls "progressive disclosure." Each skill takes only a few dozen tokens when summarized in the AI's context window, with full details loading only when the task requires them. This architectural choice allows organizations to deploy extensive skill libraries without overwhelming the AI's working memory.

Fortune 500 companies are already using skills in legal, finance, and accounting

The new enterprise management features allow administrators on Anthropic's Team and Enterprise plans to provision skills centrally, controlling which workflows are available across their organizations while letting individual employees customize their experience.

"Enterprise customers are using skills in production across both coding workflows and business functions like legal, finance, accounting, and data science," Murag said. "The feedback has been positive because skills let them personalize Claude to how they actually work and get to high-quality output faster."

The community response has exceeded expectations, according to Murag: "Our skills repository already crossed 20k stars on GitHub, with tens of thousands of community-created and shared skills."

Atlassian, Figma, Stripe, and Zapier join Anthropic's skills directory at launch

Anthropic is launching with skills from ten partners, a roster that reads like a who's who of modern enterprise software. The presence of Atlassian, which makes Jira and Confluence, alongside design tools Figma and Canva, payment infrastructure company Stripe, and automation platform Zapier suggests Anthropic is positioning Skills as connective tissue between Claude and the applications businesses already use.

The business arrangements with these partners focus on ecosystem development rather than immediate revenue generation.

"Partners who build skills for the directory do so to enhance how Claude works with their platforms. It's a mutually beneficial ecosystem relationship similar to MCP connector partnerships," Murag explained. "There are no revenue-sharing arrangements at this time."

For vetting new partners, Anthropic is taking a measured approach. "We began with established partners and are developing more formal criteria as we expand," Murag said. "We want to create a valuable supply of skills for enterprises while helping partner products shine."

Notably, Anthropic is not charging extra for the capability. "Skills work across all Claude surfaces: Claude.ai, Claude Code, the Claude Agent SDK, and the API. They're included in Max, Pro, Team, and Enterprise plans at no additional cost. API usage follows standard API pricing," Murag said.

Why Anthropic is giving away its competitive advantage to OpenAI and Google

The decision to release Skills as an open standard is a calculated strategic choice. By making skills portable across AI platforms, Anthropic is betting that ecosystem growth will benefit the company more than proprietary lock-in would.

The strategy appears to be working. OpenAI has quietly adopted structurally identical architecture in both ChatGPT and its Codex CLI tool. Developer Elias Judin discovered the implementation earlier this month, finding directories containing skill files that mirror Anthropic's specification—the same file naming conventions, the same metadata format, the same directory organization.

This convergence suggests the industry has found a common answer to a vexing question: how do you make AI assistants consistently good at specialized work without expensive model fine-tuning?

The timing aligns with broader standardization efforts in the AI industry. Anthropic donated its Model Context Protocol to the Linux Foundation on December 9, and both Anthropic and OpenAI co-founded the Agentic AI Foundation alongside Block. Google, Microsoft, and Amazon Web Services joined as members. The foundation will steward multiple open specifications, and Skills fit naturally into this standardization push.

"We've also seen how complementary skills and MCP servers are," Murag noted. "MCP provides secure connectivity to external software and data, while skills provide the procedural knowledge for using those tools effectively. Partners who've invested in strong MCP integrations were a natural starting point."

The AI industry abandons specialized agents in favor of one assistant that learns everything

The Skills approach is a philosophical shift in how the AI industry thinks about making AI assistants more capable. The traditional approach involved building specialized agents for different use cases — a customer service agent, a coding agent, a research agent. Skills suggest a different model: one general-purpose agent equipped with a library of specialized capabilities.

"We used to think agents in different domains will look very different," Barry Zhang, an Anthropic researcher, said at an industry conference last month, according to a Business Insider report. "The agent underneath is actually more universal than we thought."

This insight has significant implications for enterprise software development. Rather than building and maintaining multiple specialized AI systems, organizations can invest in creating and curating skills that encode their institutional knowledge and best practices.

Anthropic's own internal research supports this approach. A study the company published in early December found that its engineers used Claude in 60% of their work, achieving a 50% self-reported productivity boost—a two to threefold increase from the prior year. Notably, 27% of Claude-assisted work consisted of tasks that would not have been done otherwise, including building internal tools, creating documentation, and addressing what employees called "papercuts" — small quality-of-life improvements that had been perpetually deprioritized.

Security risks and skill atrophy emerge as concerns for enterprise AI deployments

The Skills framework is not without potential complications. As AI systems become more capable through skills, questions arise about maintaining human expertise. Anthropic's internal research found that while skills enabled engineers to work across more domains—backend developers building user interfaces, researchers creating data visualizations—some employees worried about skill atrophy.

"When producing output is so easy and fast, it gets harder and harder to actually take the time to learn something," one Anthropic engineer said in the company's internal survey.

There are also security considerations. Skills provide Claude with new capabilities through instructions and code, which means malicious skills could theoretically introduce vulnerabilities. Anthropic recommends installing skills only from trusted sources and thoroughly auditing those from less-trusted origins.

The open standard approach introduces governance questions as well. While Anthropic has published the specification and launched a reference SDK, the long-term stewardship of the standard remains undefined. Whether it will fall under the Agentic AI Foundation or require its own governance structure is an open question.

Anthropic's real product may not be Claude—it may be the infrastructure everyone else builds on

The trajectory of Skills reveals something important about Anthropic's ambitions. Two months ago, the company introduced a feature that looked like a developer tool. Today, that feature has become a specification that Microsoft builds into VS Code, that OpenAI replicates in ChatGPT, and that enterprise software giants race to support.

The pattern echoes strategies that have reshaped the technology industry before. Companies from Red Hat to Google have discovered that open standards can be more valuable than proprietary technology — that the company defining how an industry works often captures more value than the company trying to own it outright.

For enterprise technology leaders evaluating AI investments, the message is straightforward: skills are becoming infrastructure. The expertise organizations encode into skills today will determine how effectively their AI assistants perform tomorrow, regardless of which model powers them.

The competitive battles between Anthropic, OpenAI, and Google will continue. But on the question of how to make AI assistants reliably good at specialized work, the industry has quietly converged on an answer — and it came from the company that gave it away.

Aralık 19, 2025
Palona goes vertical, launching Vision, Workflow features: 4 key lessons for AI builders
Building an enterprise AI company on a "foundation of shifting sand" is the central challenge for founders today, according to the leadership at Palona AI.

Today, the Palo Alto-based startup—led by former Google and Meta engineering veterans—is making a decisive vertical push into the restaurant and hospitality space with today's launch of Palona Vision and Palona Workflow.

The new offerings transform the company’s multimodal agent suite into a real-time operating system for restaurant operations — spanning cameras, calls, conversations, and coordinated task execution.

The news marks a strategic pivot from the company’s debut in early 2025, when it first emerged with $10 million in seed funding to build emotionally intelligent sales agents for broad direct-to-consumer enterprises.

Now, by narrowing its focus to a "multimodal native" approach for restaurants, Palona is providing a blueprint for AI builders on how to move beyond "thin wrappers" to build deep systems that solve high-stakes physical world problems.

“You’re building a company on top of a foundation that is sand—not quicksand, but shifting sand,” said co-founder and CTO Tim Howes, referring to the instability of today’s LLM ecosystem. “So we built an orchestration layer that lets us swap models on performance, fluency, and cost.”

VentureBeat spoke with Howes and co-founder and CEO Maria Zhang in person recently at — where else? — a restaurant in NYC about the technical challenges and hard lessons learned from their launch, growth, and pivot.

The New Offering: Vision and Workflow as a ‘Digital GM’

For the end user—the restaurant owner or operator—Palona’s latest release is designed to function as an automated "best operations manager" that never sleeps.

Palona Vision uses in-store security cameras to analyze operational signals — such as queue lengths, table turnover, prep bottlenecks, and cleanliness — without requiring any new hardware.

It monitors front-of-house metrics like queue lengths, table turns, and cleanliness, while simultaneously identifying back-of-house issues like prep slowdowns or station setup errors.

Palona Workflow complements this by automating multi-step operational processes. This includes managing catering orders, opening and closing checklists, and food prep fulfillment. By correlating video signals from Vision with Point-of-Sale (POS) data and staffing levels, Workflow ensures consistent execution across multiple locations.

“Palona Vision is like giving every location a digital GM,” said Shaz Khan, founder of Tono Pizzeria + Cheesesteaks, in a press release provided to VentureBeat. “It flags issues before they escalate and saves me hours every week.”

Going Vertical: Lessons in Domain Expertise

Palona’s journey began with a star-studded roster. CEO Zhang previously served as VP of Engineering at Google and CTO of Tinder, while Co-founder Howes is the co-inventor of LDAP and a former Netscape CTO.

Despite this pedigree, the team’s first year was a lesson in the necessity of focus.

Initially, Palona served fashion and electronics brands, creating "wizard" and "surfer dude" personalities to handle sales. However, the team quickly realized that the restaurant industry presented a unique, trillion-dollar opportunity that was "surprisingly recession-proof" but "gobsmacked" by operational inefficiency.

"Advice to startup founders: don't go multi-industry," Zhang warned.

By verticalizing, Palona moved from being a "thin" chat layer to building a "multi-sensory information pipeline" that processes vision, voice, and text in tandem.

That clarity of focus opened access to proprietary training data (like prep playbooks and call transcripts) while avoiding generic data scraping.

1. Building on ‘Shifting Sand’

To accommodate the reality of enterprise AI deployments in 2025 — with new, improved models coming out on a nearly weekly basis — Palona developed a patent-pending orchestration layer.

Rather than being "bundled" with a single provider like OpenAI or Google, Palona’s architecture allows them to swap models on a dime based on performance and cost.

They use a mix of proprietary and open-source models, including Gemini for computer vision benchmarks and specific language models for Spanish or Chinese fluency.

For builders, the message is clear: Never let your product's core value be a single-vendor dependency.

2. From Words to ‘World Models’

The launch of Palona Vision represents a shift from understanding words to understanding the physical reality of a kitchen.

While many developers struggle to stitch separate APIs together, Palona’s new vision model transforms existing in-store cameras into operational assistants.

The system identifies "cause and effect" in real-time—recognizing if a pizza is undercooked by its "pale beige" color or alerting a manager if a display case is empty.

"In words, physics don't matter," Zhang explained. "But in reality, I drop the phone, it always goes down… we want to really figure out what's going on in this world of restaurants".

3. The ‘Muffin’ Solution: Custom Memory Architecture

One of the most significant technical hurdles Palona faced was memory management. In a restaurant context, memory is the difference between a frustrating interaction and a "magical" one where the agent remembers a diner’s "usual" order.

The team initially utilized an unspecified open-source tool, but found it produced errors 30% of the time. "I think advisory developers always turn off memory [on consumer AI products], because that will guarantee to mess everything up," Zhang cautioned.

To solve this, Palona built Muffin, a proprietary memory management system named as a nod to web "cookies". Unlike standard vector-based approaches that struggle with structured data, Muffin is architected to handle four distinct layers:
- Structured Data: Stable facts like delivery addresses or allergy information.
- Slow-changing Dimensions: Loyalty preferences and favorite items.
- Transient and Seasonal Memories: Adapting to shifts like preferring cold drinks in July versus hot cocoa in winter.
- Regional Context: Defaults like time zones or language preferences.
The lesson for builders: If the best available tool isn't good enough for your specific vertical, you must be willing to build your own.

4. Reliability through ‘GRACE’

In a kitchen, an AI error isn't just a typo; it’s a wasted order or a safety risk. A recent incident at Stefanina’s Pizzeria in Missouri, where an AI hallucinated fake deals during a dinner rush, highlights how quickly brand trust can evaporate when safeguards are absent.

To prevent such chaos, Palona’s engineers follow its internal GRACE framework:
- Guardrails: Hard limits on agent behavior to prevent unapproved promotions.
- Red Teaming: Proactive attempts to "break" the AI and identify potential hallucination triggers.
- App Sec: Lock down APIs and third-party integrations with TLS, tokenization, and attack prevention systems.
- Compliance: Grounding every response in verified, vetted menu data to ensure accuracy.
- Escalation: Routing complex interactions to a human manager before a guest receives misinformation.
This reliability is verified through massive simulation. "We simulated a million ways to order pizza," Zhang said, using one AI to act as a customer and another to take the order, measuring accuracy to eliminate hallucinations.

The Bottom Line

With the launch of Vision and Workflow, Palona is betting that the future of enterprise AI isn't in broad assistants, but in specialized "operating systems" that can see, hear, and think within a specific domain.

In contrast to general-purpose AI agents, Palona’s system is designed to execute restaurant workflows, not just respond to queries — it's capable of remembering customers, hearing them order their "usual," and monitoring the restaurant operations to ensure they deliver that customer the food according to their internal processes and guidelines, flagging whenever something goes wrong or crucially, is about to go wrong.

For Zhang, the goal is to let human operators focus on their craft: "If you've got that delicious food nailed… we’ll tell you what to do."
Aralık 19, 2025
AI is moving to the edge – and network security needs to catch up
Presented by T-Mobile for Business

Small and mid-sized businesses are adopting AI at a pace that would have seemed unrealistic even a few years ago. Smart assistants that greet customers, predictive tools that flag inventory shortages before they happen, and on-site analytics that help staff make decisions faster — these used to be features of the enterprise. Now they’re being deployed in retail storefronts, regional medical clinics, branch offices, and remote operations hubs.

What’s changed is not just the AI itself, but where it runs. Increasingly, AI workloads are being pushed out of centralized data centers and into the real world — into the places where employees work and customers interact. This shift to the edge promises faster insights and more resilient operations, but it also transforms the demands placed on the network. Edge sites need consistent bandwidth, real-time data pathways, and the ability to process information locally rather than relying on the cloud for every decision.

The catch is that as companies race to connect these locations, security often lags behind. A store may adopt AI-enabled cameras or sensors long before it has the policies to manage them. A clinic may roll out mobile diagnostic devices without fully segmenting their traffic. A warehouse may rely on a mix of Wi-Fi, wired, and cellular connections that weren’t designed to support AI-driven operations. When connectivity scales faster than security, it creates cracks — unmonitored devices, inconsistent access controls, and unsegmented data flows that make it hard to see what’s happening, let alone protect it.

Edge AI only delivers its full value when connectivity and security evolve together.

Why AI is moving to the edge — and what that breaks

Businesses are shifting AI to the edge for three core reasons:
- Real-time responsiveness: Some decisions can’t wait for a round trip to the cloud. Whether it’s identifying an item on a shelf, detecting an abnormal reading from a medical device, or recognizing a safety risk in a warehouse aisle, the delay introduced by centralized processing can mean missed opportunities or slow reactions.
- Resilience and privacy: Keeping data and inference local makes operations less vulnerable to outages or latency spikes, and it reduces the flow of sensitive information across networks. This helps SMBs meet data sovereignty and compliance requirements without rewriting their entire infrastructure.
- Mobility and deployment speed: Many SMBs operate across distributed footprints — remote workers, pop-up locations, seasonal operations, or mobile teams. Wireless-first connectivity, including 5G business lines, lets them deploy AI tools quickly without waiting for fixed circuits or expensive buildouts.
Technologies like Edge Control from T-Mobile for Business fit naturally into this model. By routing traffic directly along the paths it needs — keeping latency-sensitive workloads local and bypassing the bottlenecks that traditional VPNs introduce — businesses can adopt edge AI without dragging their network into constant contention.

Yet the shift introduces new risk. Every edge site becomes, in effect, its own small data center. A retail store may have cameras, sensors, POS systems, digital signage, and staff devices all sharing the same access point. A clinic may run diagnostic tools, tablets, wearables, and video consult systems side by side. A manufacturing floor might combine robotics, sensors, handheld scanners, and on-site analytics platforms.

This diversity increases the attack surface dramatically. Many SMBs roll out connectivity first, then add piecemeal security later — leaving the blind spots attackers rely on.

Zero trust becomes essential at the edge

When AI is distributed across dozens or hundreds of sites, the old idea of a single secure “inside” network breaks down. Every store, clinic, kiosk, or field location becomes its own micro-environment — and every device within it becomes its own potential entry point.

Zero trust offers a framework to make this manageable.

At the edge, zero trust means:
- Verifying identity rather than location — access is granted because a user or device proves who it is, not because it sits behind a corporate firewall.
- Continuous authentication — trust isn’t permanent; it’s re-evaluated throughout a session.
- Segmentation that limits movement — if something goes wrong, attackers can’t jump freely from system to system.
This approach is especially critical given that many edge devices can’t run traditional security clients. SIM-based identity and secure mobile connectivity — areas where T-Mobile for Business brings significant strength — help verify IoT devices, 5G routers, and sensors that otherwise sit outside the visibility of IT teams.

This is why connectivity providers are increasingly combining networking and security into a single approach. T-Mobile for Business embeds segmentation, device visibility, and zero-trust safeguards directly into its wireless-first connectivity offerings, reducing the need for SMBs to stitch together multiple tools.

Secure-by-default networks reshape the landscape

A major architectural shift is underway: networks that assume every device, session, and workload must be authenticated, segmented, and monitored from the start. Instead of building security on top of connectivity, the two are fused.

T-Mobile for Business solutions shows how this is evolving. Its SASE platform, powered by Palo Alto Networks Prisma SASE 5G, blends secure access with connectivity into one cloud-delivered service. Private Access gives users the least-privileged access they need, nothing more. T-SIMsecure authenticates devices at the SIM layer, allowing IoT sensors and 5G routers to be verified automatically. Security Slice isolates sensitive SASE traffic on a dedicated portion of the 5G network, ensuring consistency even during heavy demand.

A unified dashboard like T-Platform brings it together, offering real-time visibility across SASE, IoT, business internet, and edge control — simplifying operations for SMBs with limited staff.

The future: AI that runs the edge and protects it

As AI models become more dynamic and autonomous, we’ll see the relationship flip: the edge won’t just support AI; AI will actively run and secure the edge — optimizing traffic paths, adjusting segmentation automatically, and spotting anomalies that matter to one specific store or site.

Self-healing networks and adaptive policy engines will move from experimental to expected.

For SMBs, this is a pivotal moment. The organizations that modernize their connectivity and security foundations now will be the ones best positioned to scale AI everywhere — safely, confidently, and without unnecessary complexity.

Partners like T-Mobile for Business are already moving in this direction, giving SMBs a way to deploy AI at the edge without sacrificing control or visibility.

Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
Aralık 18, 2025

Gemini 3 Flash arrives with reduced costs and latency — a powerful combo for enterprises

Enterprises can now harness the power of a large language model that's near that of the state-of-the-art Google’s Gemini 3 Pro, but at a fraction of the cost and with increased speed, thanks to the newly released Gemini 3 Flash.

The model joins the flagship Gemini 3 Pro, Gemini 3 Deep Think, and Gemini Agent, all of which were announced and released last month.

Gemini 3 Flash, now available on Gemini Enterprise, Google Antigravity, Gemini CLI, AI Studio, and on preview in Vertex AI, processes information in near real-time and helps build quick, responsive agentic applications.

The company said in a blog post that Gemini 3 Flash “builds on the model series that developers and enterprises already love, optimized for high-frequency workflows that demand speed, without sacrificing quality.

The model is also the default for AI Mode on Google Search and the Gemini application.

Tulsee Doshi, senior director, product management on the Gemini team, said in a separate blog post that the model “demonstrates that speed and scale don’t have to come at the cost of intelligence.”

“Gemini 3 Flash is made for iterative development, offering Gemini 3’s Pro-grade coding performance with low latency — it’s able to reason and solve tasks quickly in high-frequency workflows,” Doshi said. “It strikes an ideal balance for agentic coding, production-ready systems and responsive interactive applications.”

Early adoption by specialized firms proves the model's reliability in high-stakes fields. Harvey, an AI platform for law firms, reported a 7% jump in reasoning on their internal 'BigLaw Bench,' while Resemble AI discovered that Gemini 3 Flash could process complex forensic data for deepfake detection 4x faster than Gemini 2.5 Pro. These aren't just speed gains; they are enabling 'near real-time' workflows that were previously impossible.

More efficient at a lower cost

Enterprise AI builders have become more aware of the cost of running AI models, especially as they try to convince stakeholders to put more budget into agentic workflows that run on expensive models. Organizations have turned to smaller or distilled models, focusing on open models or other research and prompting techniques to help manage bloated AI costs.

For enterprises, the biggest value proposition for Gemini 3 Flash is that it offers the same level of advanced multimodal capabilities, such as complex video analysis and data extraction, as its larger Gemini counterparts, but is far faster and cheaper.

While Google’s internal materials highlight a 3x speed increase over the 2.5 Pro series, data from independent benchmarking firm Artificial Analysis adds a layer of crucial nuance.

In the latter organization's pre-release testing, Gemini 3 Flash Preview recorded a raw throughput of 218 output tokens per second. This makes it 22% slower than the previous 'non-reasoning' Gemini 2.5 Flash, but it is still significantly faster than rivals including OpenAI's GPT-5.1 high (125 t/s) and DeepSeek V3.2 reasoning (30 t/s).

Most notably, Artificial Analysis crowned Gemini 3 Flash as the new leader in their AA-Omniscience knowledge benchmark, where it achieved the highest knowledge accuracy of any model tested to date. However, this intelligence comes with a 'reasoning tax': the model more than doubles its token usage compared to the 2.5 Flash series when tackling complex indexes.

This high token density is offset by Google's aggressive pricing: when accessing through the Gemini API, Gemini 3 Flash costs $0.50 per 1 million input tokens, compared to $1.25/1M input tokens for Gemini 2.5 Pro, and $3/1M output tokens, compared to $ 10/1 M output tokens for Gemini 2.5 Pro. This allows Gemini 3 Flash to claim the title of the most cost-efficient model for its intelligence tier, despite being one of the most 'talkative' models in terms of raw token volume. Here's how it stacks up to rival LLM offerings:

Model	Input (/1M)	Output (/1M)	Total Cost	Source
Qwen 3 Turbo	$0.05	$0.20	$0.25	Alibaba Cloud
Grok 4.1 Fast (reasoning)	$0.20	$0.50	$0.70	xAI
Grok 4.1 Fast (non-reasoning)	$0.20	$0.50	$0.70	xAI
deepseek-chat (V3.2-Exp)	$0.28	$0.42	$0.70	DeepSeek
deepseek-reasoner (V3.2-Exp)	$0.28	$0.42	$0.70	DeepSeek
Qwen 3 Plus	$0.40	$1.20	$1.60	Alibaba Cloud
ERNIE 5.0	$0.85	$3.40	$4.25	Qianfan
Gemini 3 Flash Preview	$0.50	$3.00	$3.50	Google
Claude Haiku 4.5	$1.00	$5.00	$6.00	Anthropic
Qwen-Max	$1.60	$6.40	$8.00	Alibaba Cloud
Gemini 3 Pro (≤200K)	$2.00	$12.00	$14.00	Google
GPT-5.2	$1.75	$14.00	$15.75	OpenAI
Claude Sonnet 4.5	$3.00	$15.00	$18.00	Anthropic
Gemini 3 Pro (>200K)	$4.00	$18.00	$22.00	Google
Claude Opus 4.5	$5.00	$25.00	$30.00	Anthropic
GPT-5.2 Pro	$21.00	$168.00	$189.00	OpenAI

More ways to save

But enterprise developers and users can cut costs further by eliminating the lag most larger models often have, which racks up token usage. Google said the model “is able to modulate how much it thinks,” so that it uses more thinking and therefore more tokens for more complex tasks than for quick prompts. The company noted Gemini 3 Flash uses 30% fewer tokens than Gemini 2.5 Pro.

To balance this new reasoning power with strict corporate latency requirements, Google has introduced a 'Thinking Level' parameter. Developers can toggle between 'Low'—to minimize cost and latency for simple chat tasks—and 'High'—to maximize reasoning depth for complex data extraction. This granular control allows teams to build 'variable-speed' applications that only consume expensive 'thinking tokens' when a problem actually demands PhD-level lo

The economic story extends beyond simple token prices. With the standard inclusion of Context Caching, enterprises processing massive, static datasets—such as entire legal libraries or codebase repositories—can see a 90% reduction in costs for repeated queries. When combined with the Batch API’s 50% discount, the total cost of ownership for a Gemini-powered agent drops significantly below the threshold of competing frontier models

“Gemini 3 Flash delivers exceptional performance on coding and agentic tasks combined with a lower price point, allowing teams to deploy sophisticated reasoning costs across high-volume processes without hitting barriers,” Google said.

By offering a model that delivers strong multimodal performance at a more affordable price, Google is making the case that enterprises concerned with controlling their AI spend should choose its models, especially Gemini 3 Flash.

Strong benchmark performance

But how does Gemini 3 Flash stack up against other models in terms of its performance?

Doshi said the model achieved a score of 78% on the SWE-Bench Verified benchmark testing for coding agents, outperforming both the preceding Gemini 2.5 family and the newer Gemini 3 Pro itself!

For enterprises, this means high-volume software maintenance and bug-fixing tasks can now be offloaded to a model that is both faster and cheaper than previous flagship models, without a degradation in code quality.

The model also performed strongly on other benchmarks, scoring 81.2% on the MMMU Pro benchmark, comparable to Gemini 3 Pro.

While most Flash type models are explicitly optimized for short, quick tasks like generating code, Google claims Gemini 3 Flash’s performance “in reasoning, tool use and multimodal capabilities is ideal for developers looking to do more complex video analysis, data extraction and visual Q&A, which means it can enable more intelligent applications — like in-game assistants or A/B test experiments — that demand both quick answers and deep reasoning.”

First impressions from early users

So far, early users have been largely impressed with the model, particularly its benchmark performance.

What It Means for Enterprise AI Usage

With Gemini 3 Flash now serving as the default engine across Google Search and the Gemini app, we are witnessing the "Flash-ification" of frontier intelligence. By making Pro-level reasoning the new baseline, Google is setting a trap for slower incumbents.

The integration into platforms like Google Antigravity suggests that Google isn't just selling a model; it's selling the infrastructure for the autonomous enterprise.

As developers hit the ground running with 3x faster speeds and a 90% discount on context caching, the "Gemini-first" strategy becomes a compelling financial argument. In the high-velocity race for AI dominance, Gemini 3 Flash may be the model that finally turns "vibe coding" from an experimental hobby into a production-ready reality.

Aralık 18, 2025

Zoom says it aced AI’s hardest exam. Critics say it copied off its neighbors.

Zoom Video Communications, the company best known for keeping remote workers connected during the pandemic, announced last week that it had achieved the highest score ever recorded on one of artificial intelligence's most demanding tests — a claim that sent ripples of surprise, skepticism, and genuine curiosity through the technology industry.

The San Jose-based company said its AI system scored 48.1 percent on the Humanity's Last Exam, a benchmark designed by subject-matter experts worldwide to stump even the most advanced AI models. That result edges out Google's Gemini 3 Pro, which held the previous record at 45.8 percent.

"Zoom has achieved a new state-of-the-art result on the challenging Humanity's Last Exam full-set benchmark, scoring 48.1%, which represents a substantial 2.3% improvement over the previous SOTA result," wrote Xuedong Huang, Zoom's chief technology officer, in a blog post.

The announcement raises a provocative question that has consumed AI watchers for days: How did a video conferencing company — one with no public history of training large language models — suddenly vault past Google, OpenAI, and Anthropic on a benchmark built to measure the frontiers of machine intelligence?

The answer reveals as much about where AI is headed as it does about Zoom's own technical ambitions. And depending on whom you ask, it's either an ingenious demonstration of practical engineering or a hollow claim that appropriates credit for others' work.

How Zoom built an AI traffic controller instead of training its own model

Zoom did not train its own large language model. Instead, the company developed what it calls a "federated AI approach" — a system that routes queries to multiple existing models from OpenAI, Google, and Anthropic, then uses proprietary software to select, combine, and refine their outputs.

At the heart of this system sits what Zoom calls its "Z-scorer," a mechanism that evaluates responses from different models and chooses the best one for any given task. The company pairs this with what it describes as an "explore-verify-federate strategy," an agentic workflow that balances exploratory reasoning with verification across multiple AI systems.

"Our federated approach combines Zoom's own small language models with advanced open-source and closed-source models," Huang wrote. The framework "orchestrates diverse models to generate, challenge, and refine reasoning through dialectical collaboration."

In simpler terms: Zoom built a sophisticated traffic controller for AI, not the AI itself.

This distinction matters enormously in an industry where bragging rights — and billions in valuation — often hinge on who can claim the most capable model. The major AI laboratories spend hundreds of millions of dollars training frontier systems on vast computing clusters. Zoom's achievement, by contrast, appears to rest on clever integration of those existing systems.

Why AI researchers are divided over what counts as real innovation

The response from the AI community was swift and sharply divided.

Max Rumpf, an AI engineer who says he has trained state-of-the-art language models, posted a pointed critique on social media. "Zoom strung together API calls to Gemini, GPT, Claude et al. and slightly improved on a benchmark that delivers no value for their customers," he wrote. "They then claim SOTA."

Rumpf did not dismiss the technical approach itself. Using multiple models for different tasks, he noted, is "actually quite smart and most applications should do this." He pointed to Sierra, an AI customer service company, as an example of this multi-model strategy executed effectively.

His objection was more specific: "They did not train the model, but obfuscate this fact in the tweet. The injustice of taking credit for the work of others sits deeply with people."

But other observers saw the achievement differently. Hongcheng Zhu, a developer, offered a more measured assessment: "To top an AI eval, you will most likely need model federation, like what Zoom did. An analogy is that every Kaggle competitor knows you have to ensemble models to win a contest."

The comparison to Kaggle — the competitive data science platform where combining multiple models is standard practice among winning teams — reframes Zoom's approach as industry best practice rather than sleight of hand. Academic research has long established that ensemble methods routinely outperform individual models.

Still, the debate exposed a fault line in how the industry understands progress. Ryan Pream, founder of Exoria AI, was dismissive: "Zoom are just creating a harness around another LLM and reporting that. It is just noise." Another commenter captured the sheer unexpectedness of the news: "That the video conferencing app ZOOM developed a SOTA model that achieved 48% HLE was not on my bingo card."

Perhaps the most pointed critique concerned priorities. Rumpf argued that Zoom could have directed its resources toward problems its customers actually face. "Retrieval over call transcripts is not 'solved' by SOTA LLMs," he wrote. "I figure Zoom's users would care about this much more than HLE."

The Microsoft veteran betting his reputation on a different kind of AI

If Zoom's benchmark result seemed to come from nowhere, its chief technology officer did not.

Xuedong Huang joined Zoom from Microsoft, where he spent decades building the company's AI capabilities. He founded Microsoft's speech technology group in 1993 and led teams that achieved what the company described as human parity in speech recognition, machine translation, natural language understanding, and computer vision.

Huang holds a Ph.D. in electrical engineering from the University of Edinburgh. He is an elected member of the National Academy of Engineering and the American Academy of Arts and Sciences, as well as a fellow of both the IEEE and the ACM. His credentials place him among the most accomplished AI executives in the industry.

His presence at Zoom signals that the company's AI ambitions are serious, even if its methods differ from the research laboratories that dominate headlines. In his tweet celebrating the benchmark result, Huang framed the achievement as validation of Zoom's strategy: "We have unlocked stronger capabilities in exploration, reasoning, and multi-model collaboration, surpassing the performance limits of any single model."

That final clause — "surpassing the performance limits of any single model" — may be the most significant. Huang is not claiming Zoom built a better model. He is claiming Zoom built a better system for using models.

Inside the test designed to stump the world's smartest machines

The benchmark at the center of this controversy, Humanity's Last Exam, was designed to be exceptionally difficult. Unlike earlier tests that AI systems learned to game through pattern matching, HLE presents problems that require genuine understanding, multi-step reasoning, and the synthesis of information across complex domains.

The exam draws on questions from experts around the world, spanning fields from advanced mathematics to philosophy to specialized scientific knowledge. A score of 48.1 percent might sound unimpressive to anyone accustomed to school grading curves, but in the context of HLE, it represents the current ceiling of machine performance.

"This benchmark was developed by subject-matter experts globally and has become a crucial metric for measuring AI's progress toward human-level performance on challenging intellectual tasks," Zoom’s announcement noted.

The company's improvement of 2.3 percentage points over Google's previous best may appear modest in isolation. But in competitive benchmarking, where gains often come in fractions of a percent, such a jump commands attention.

What Zoom's approach reveals about the future of enterprise AI

Zoom's approach carries implications that extend well beyond benchmark leaderboards. The company is signaling a vision for enterprise AI that differs fundamentally from the model-centric strategies pursued by OpenAI, Anthropic, and Google.

Rather than betting everything on building the single most capable model, Zoom is positioning itself as an orchestration layer — a company that can integrate the best capabilities from multiple providers and deliver them through products that businesses already use every day.

This strategy hedges against a critical uncertainty in the AI market: no one knows which model will be best next month, let alone next year. By building infrastructure that can swap between providers, Zoom avoids vendor lock-in while theoretically offering customers the best available AI for any given task.

The announcement of OpenAI's GPT-5.2 the following day underscored this dynamic. OpenAI's own communications named Zoom as a partner that had evaluated the new model's performance "across their AI workloads and saw measurable gains across the board." Zoom, in other words, is both a customer of the frontier labs and now a competitor on their benchmarks — using their own technology.

This arrangement may prove sustainable. The major model providers have every incentive to sell API access widely, even to companies that might aggregate their outputs. The more interesting question is whether Zoom's orchestration capabilities constitute genuine intellectual property or merely sophisticated prompt engineering that others could replicate.

The real test arrives when Zoom's 300 million users start asking questions

Zoom titled its announcement section on industry relations "A Collaborative Future," and Huang struck notes of gratitude throughout. "The future of AI is collaborative, not competitive," he wrote. "By combining the best innovations from across the industry with our own research breakthroughs, we create solutions that are greater than the sum of their parts."

This framing positions Zoom as a beneficent integrator, bringing together the industry's best work for the benefit of enterprise customers. Critics see something else: a company claiming the prestige of an AI laboratory without doing the foundational research that earns it.

The debate will likely be settled not by leaderboards but by products. When AI Companion 3.0 reaches Zoom's hundreds of millions of users in the coming months, they will render their own verdict — not on benchmarks they have never heard of, but on whether the meeting summary actually captured what mattered, whether the action items made sense, whether the AI saved them time or wasted it.

In the end, Zoom's most provocative claim may not be that it topped a benchmark. It may be the implicit argument that in the age of AI, the best model is not the one you build — it's the one you know how to use.

Aralık 17, 2025
Zencoder drops Zenflow, a free AI orchestration tool that pits Claude against OpenAI’s models to catch coding errors

Zencoder, the Silicon Valley startup that builds AI-powered coding agents, released a free desktop application on Monday that it says will fundamentally change how software engineers interact with artificial intelligence — moving the industry beyond the freewheeling era of "vibe coding" toward a more disciplined, verifiable approach to AI-assisted development.

The product, called Zenflow, introduces what the company describes as an "AI orchestration layer" that coordinates multiple AI agents to plan, implement, test, and review code in structured workflows. The launch is Zencoder's most ambitious attempt yet to differentiate itself in an increasingly crowded market dominated by tools like Cursor, GitHub Copilot, and coding agents built directly by AI giants Anthropic, OpenAI, and Google.

"Chat UIs were fine for copilots, but they break down when you try to scale," said Andrew Filev, Zencoder's chief executive, in an exclusive interview with VentureBeat. "Teams are hitting a wall where speed without structure creates technical debt. Zenflow replaces 'Prompt Roulette' with an engineering assembly line where agents plan, implement, and, crucially, verify each other's work."

The announcement arrives at a critical moment for enterprise software development. Companies across industries have poured billions of dollars into AI coding tools over the past two years, hoping to dramatically accelerate their engineering output. Yet the promised productivity revolution has largely failed to materialize at scale.

Why AI coding tools have failed to deliver on their 10x productivity promise

Filev, who previously founded and sold the project management company Wrike to Citrix, pointed to a growing disconnect between AI coding hype and reality. While vendors have promised tenfold productivity gains, rigorous studies — including research from Stanford University — consistently show improvements closer to 20 percent.

"If you talk to real engineering leaders, I don't remember a single conversation where somebody vibe coded themselves to 2x or 5x or 10x productivity on serious engineering production," Filev said. "The typical number you would hear would be about 20 percent."

The problem, according to Filev, lies not with the AI models themselves but with how developers interact with them. The standard approach of typing requests into a chat interface and hoping for usable code works well for simple tasks but falls apart on complex enterprise projects.

Zencoder's internal engineering team claims to have cracked a different approach. Filev said the company now operates at roughly twice the velocity it achieved 12 months ago, not primarily because AI models improved, but because the team restructured its development processes.

"We had to change our process and use a variety of different best practices," he said.

Inside the four pillars that power Zencoder's AI orchestration platform

Zenflow organizes its approach around four core capabilities that Zencoder argues any serious AI orchestration platform must support.

Structured workflows replace ad-hoc prompting with repeatable sequences (plan, implement, test, review) that agents follow consistently. Filev drew parallels to his experience building Wrike, noting that individual to-do lists rarely scale across organizations, while defined workflows create predictable outcomes.

Spec-driven development requires AI agents to first generate a technical specification, then create a step-by-step plan, and only then write code. The approach became so effective that frontier AI labs including Anthropic and OpenAI have since trained their models to follow it automatically. The specification anchors agents to clear requirements, preventing what Zencoder calls "iteration drift," or the tendency for AI-generated code to gradually diverge from the original intent.

Multi-agent verification deploys different AI models to critique each other's work. Because AI models from the same family tend to share blind spots, Zencoder routes verification tasks across model providers, asking Claude to review code written by OpenAI's models, or vice versa.

"Think of it as a second opinion from a doctor," Filev told VentureBeat. "With the right pipeline, we see results on par with what you'd expect from Claude 5 or GPT-6. You're getting the benefit of a next-generation model today."

Parallel execution lets developers run multiple AI agents simultaneously in isolated sandboxes, preventing them from interfering with each other's work. The interface provides a command center for monitoring this fleet, a significant departure from the current practice of managing multiple terminal windows.

How verification solves AI coding's biggest reliability problem

Zencoder's emphasis on verification addresses one of the most persistent criticisms of AI-generated code: its tendency to produce "slop," or code that appears correct but fails in production or degrades over successive iterations.

The company's internal research found that developers who skip verification often fall into what Filev called a "death loop." An AI agent completes a task successfully, but the developer, reluctant to review unfamiliar code, moves on without understanding what was written. When subsequent tasks fail, the developer lacks the context to fix problems manually and instead keeps prompting the AI for solutions.

"They literally spend more than a day in that death loop," Filev said. "That's why the productivity is not 2x, because they were running at 3x first, and then they wasted the whole day."

The multi-agent verification approach also gives Zencoder an unusual competitive advantage over the frontier AI labs themselves. While Anthropic, OpenAI, and Google each optimize their own models, Zencoder can mix and match across providers to reduce bias.

"This is a rare situation where we have an edge on the frontier labs," Filev said. "Most of the time they have an edge on us, but this is a rare case."

Zencoder faces steep competition from AI giants and well-funded startups

Zencoder enters the AI orchestration market at a moment of intense competition. The company has positioned itself as a model-agnostic platform, supporting major providers including Anthropic, OpenAI, and Google Gemini. In September, Zencoder expanded its platform to let developers use command-line coding agents from any provider within its interface.

That strategy reflects a pragmatic acknowledgment that developers increasingly maintain relationships with multiple AI providers rather than committing exclusively to one. Zencoder's universal platform approach lets it serve as the orchestration layer regardless of which underlying models a company prefers.

The company also emphasizes enterprise readiness, touting SOC 2 Type II, ISO 27001, and ISO 42001 certifications along with GDPR compliance. These credentials matter for regulated industries like financial services and healthcare, where compliance requirements can block adoption of consumer-oriented AI tools.

But Zencoder faces formidable competition from multiple directions. Cursor and Windsurf have built dedicated AI-first code editors with devoted user bases. GitHub Copilot benefits from Microsoft's distribution muscle and deep integration with the world's largest code repository. And the frontier AI labs continue expanding their own coding capabilities.

Filev dismissed concerns about competition from the AI labs, arguing that smaller players like Zencoder can move faster on user experience innovation.

"I'm sure they will come to the same conclusion, and they're smart and moving fast, so I'm sure they will catch up fairly quickly," he said. "That's why I said in the next six to 12 months, you're going to see a lot of this propagating through the whole space."

The case for adopting AI orchestration now instead of waiting for better models

Technical executives weighing AI coding investments face a difficult timing question: Should they adopt orchestration tools now, or wait for frontier AI labs to build these capabilities natively into their models?

Filev argued that waiting carries significant competitive risk.

"Right now, everybody is under pressure to deliver more in less time, and everybody expects engineering leaders to deliver results from AI," he said. "As a founder and CEO, I do not expect 20 percent from my VP of engineering. I expect 2x."

He also questioned whether the major AI labs will prioritize orchestration capabilities when their core business remains model development.

"In the ideal world, frontier labs should be building the best-ever models and competing with each other, and Zencoders and Cursors need to build the best-ever UI and UX application layer on top of those models," Filev said. "I don't see a world where OpenAI will offer you our code verifier, or vice versa."

Zenflow launches as a free desktop application, with updated plugins available for Visual Studio Code and JetBrains integrated development environments. The product supports what Zencoder calls "dynamic workflows," meaning the system automatically adjusts process complexity based on whether a human is actively monitoring and on the difficulty of the task at hand.

Zencoder said internal testing showed that replacing standard prompting with Zenflow's orchestration layer improved code correctness by approximately 20 percent on average.

What Zencoder's bet on orchestration reveals about the future of AI coding

Zencoder frames Zenflow as the first product in what it expects to become a significant new software category. The company believes every vendor focused on AI coding will eventually arrive at similar conclusions about the need for orchestration tools.

"I think the next six to 12 months will be all about orchestration," Filev predicted. "A lot of organizations will finally reach that 2x. Not 10x yet, but at least the 2x they were promised a year ago."

Rather than competing head-to-head with frontier AI labs on model quality, Zencoder is betting that the application layer (the software that helps developers actually use these models effectively) will determine winners and losers.

It is, Filev suggested, a familiar pattern from technology history.

"This is very similar to what I observed when I started Wrike," he said. "As work went digital, people relied on email and spreadsheets to manage everything, and neither could keep up."

The same dynamic, he argued, now applies to AI coding. Chat interfaces were designed for conversation, not for orchestrating complex engineering workflows. Whether Zencoder can establish itself as the essential layer between developers and AI models before the giants build their own solutions remains an open question.

But Filev seems comfortable with the race. The last time he spotted a gap between how people worked and the tools they had to work with, he built a company worth over a billion dollars.

Zenflow is available immediately as a free download at zencoder.ai/zenflow.

Aralık 17, 2025
Bolmo’s architecture unlocks efficient byte‑level LM training without sacrificing quality

Enterprises that want tokenizer-free multilingual models are increasingly turning to byte-level language models to reduce brittleness in noisy or low-resource text. To tap into that niche — and make it practical at scale — the Allen Institute of AI (Ai2) introduced Bolmo, a new family of models that leverage its Olmo 3 models by “bytefiying” them and reusing their backbone and capabilities.

The company launched two versions, Bolmo 7B and Bolmo 1B, which are “the first fully open byte-level language model,” according to Ai2. The company said the two models performed competitively with — and in some cases surpassed — other byte-level and character-based models.

Byte-level language models operate directly on raw UTF-8 bytes, eliminating the need for a predefined vocabulary or tokenizer. This allows them to handle misspellings, rare languages, and unconventional text more reliably — key requirements for moderation, edge deployments, and multilingual applications.

For enterprises deploying AI across multiple languages, noisy user inputs, or constrained environments, tokenizer-free models offer a way to reduce operational complexity. Ai2’s Bolmo is an attempt to make that approach practical at scale — without retraining from scratch.

How Bolmo works and how it was built

Ai2 said it trained the Bolmo models using its Dolma 3 data mix, which helped train its Olmo flagship models, and some open code datasets and character-level data.

The company said its goal “is to provide a reproducible, inspectable blueprint for byteifying strong subword language models in a way the community can adopt and extend.” To meet this goal, Ai2 will release its checkpoints, code, and a full paper to help other organizations build byte-level models on top of its Olmo ecosystem.

Since training a byte-level model completely from scratch can get expensive, Ai2 researchers instead chose an existing Olmo 3 7B checkpoint to byteify in two stages.

In the first stage, Ai2 froze the Olmo 3 transformer so that they only train certain parts, such as the local encoder and decoder, the boundary predictor, and the language modeling head. This was designed to be “cheap and fast” and requires just 9.8 billion tokens.

The next stage unfreezes the model and trains it with additional tokens. Ai2 said the byte-level approach allows Bolmo to avoid the vocabulary bottlenecks that limit traditional subword models.

Strong performance among its peers

Byte-level language models are not as mainstream as small language models or LLMs, but this is a growing field in research. Meta released its BLT architecture research last year, aiming to offer a model that is robust, processes raw data, and doesn’t rely on fixed vocabularies.

Other research models in this space include ByT5, Stanford’s MrT5, and Canine.

Ai2 evaluated Bolmo using its evaluation suite, covering math, STEM reasoning, question answering, general knowledge, and code.

Bolmo 7B showed strong performance, outperforming character-focused benchmarks like CUTE and EXECUTE, and also improving accuracy over the base LLM Olmo 3.

Bolmo 7B outperformed models of comparable size in coding, math, multiple-choice QA, and character-level understanding.

Why enterprises may choose byte-level models

Enterprises find value in a hybrid model structure, using a mix of models and model sizes.

Ai2 makes the case that organizations should also consider byte-level models not only for robustness and multilingual understanding, but because it “naturally plugs into an existing model ecosystem.”

“A key advantage of the dynamic hierarchical setup is that compression becomes a toggleable knob,” the company said.

For enterprises already running heterogeneous model stacks, Bolmo suggests that byte-level models may no longer be purely academic. By retrofitting a strong subword model rather than training from scratch, Ai2 is signaling a lower-risk path for organizations that want robustness without abandoning existing infrastructure.

Aralık 16, 2025
Korean AI startup Motif reveals 4 big lessons for training enterprise LLMs

We've heard (and written, here at VentureBeat) lots about the generative AI race between the U.S. and China, as those have been the countries with the groups most active in fielding new models (with a shoutout to Cohere in Canada and Mistral in France).

But now a Korean startup is making waves: last week, the firm known as Motif Technologies released Motif-2-12.7B-Reasoning, another small parameter open-weight model that boasts impressive benchmark scores, quickly becoming the most performant model from that country according to independent benchmarking lab Artificial Analysis (beating even regular GPT-5.1 from U.S. leader OpenAI).

But more importantly for enterprise AI teams, the company has published a white paper on arxiv.org with a concrete, reproducible training recipe that exposes where reasoning performance actually comes from — and where common internal LLM efforts tend to fail.

For organizations building or fine-tuning their own models behind the firewall, the paper offers a set of practical lessons about data alignment, long-context infrastructure, and reinforcement learning stability that are directly applicable to enterprise environments. Here they are:

1. Reasoning gains come from data distribution, not model size

One of Motif’s most relevant findings for enterprise teams is that synthetic reasoning data only helps when its structure matches the target model’s reasoning style.

The paper shows measurable differences in downstream coding performance depending on which “teacher” model generated the reasoning traces used during supervised fine-tuning.

For enterprises, this undermines a common shortcut: generating large volumes of synthetic chain-of-thought data from a frontier model and assuming it will transfer cleanly. Motif’s results suggest that misaligned reasoning traces can actively hurt performance, even if they look high quality.

The takeaway is operational, not academic: teams should validate that their synthetic data reflects the format, verbosity, and step granularity they want at inference time. Internal evaluation loops matter more than copying external datasets.

2. Long-context training is an infrastructure problem first

Motif trains at 64K context, but the paper makes clear that this is not simply a tokenizer or checkpointing tweak.

The model relies on hybrid parallelism, careful sharding strategies, and aggressive activation checkpointing to make long-context training feasible on Nvidia H100-class hardware.

For enterprise builders, the message is sobering but useful: long-context capability cannot be bolted on late.

If retrieval-heavy or agentic workflows are core to the business use case, context length has to be designed into the training stack from the start. Otherwise, teams risk expensive retraining cycles or unstable fine-tunes.

3. RL fine-tuning fails without data filtering and reuse

Motif’s reinforcement learning fine-tuning (RLFT) pipeline emphasizes difficulty-aware filtering — keeping tasks whose pass rates fall within a defined band — rather than indiscriminately scaling reward training.

This directly addresses a pain point many enterprise teams encounter when experimenting with RL: performance regressions, mode collapse, or brittle gains that vanish outside benchmarks. Motif also reuses trajectories across policies and expands clipping ranges, trading theoretical purity for training stability.

The enterprise lesson is clear: RL is a systems problem, not just a reward model problem. Without careful filtering, reuse, and multi-task balancing, RL can destabilize models that are otherwise production-ready.

4. Memory optimization determines what is even possible

Motif’s use of kernel-level optimizations to reduce RL memory pressure highlights an often-overlooked constraint in enterprise settings: memory, not compute, is frequently the bottleneck. Techniques like loss-function-level optimization determine whether advanced training stages are viable at all.

For organizations running shared clusters or regulated environments, this reinforces the need for low-level engineering investment, not just model architecture experimentation.

Why this matters for enterprise AI teams

Motif-2-12.7B-Reasoning is positioned as competitive with much larger models, but its real value lies in the transparency of how those results were achieved. The paper argues — implicitly but persuasively — that reasoning performance is earned through disciplined training design, not model scale alone.

For enterprises building proprietary LLMs, the lesson is pragmatic: invest early in data alignment, infrastructure, and training stability, or risk spending millions fine-tuning models that never reliably reason in production.

Aralık 16, 2025
Why agentic AI needs a new category of customer data

Presented by Twilio

The customer data infrastructure powering most enterprises was architected for a world that no longer exists: one where marketing interactions could be captured and processed in batches, where campaign timing was measured in days (not milliseconds), and where "personalization" meant inserting a first name into an email template.

Conversational AI has shattered those assumptions.

AI agents need to know what a customer just said, the tone they used, their emotional state, and their complete history with a brand instantly to provide relevant guidance and effective resolution. This fast-moving stream of conversational signals (tone, urgency, intent, sentiment) represents a fundamentally different category of customer data. Yet the systems most enterprises rely on today were never designed to capture or deliver it at the speed modern customer experiences demand.

The conversational AI context gap

The consequences of this architectural mismatch are already visible in customer satisfaction data. Twilio’s Inside the Conversational AI Revolution report reveals that more than half (54%) of consumers report AI rarely has context from their past interactions, and only 15% feel that human agents receive the full story after an AI handoff. The result: customer experiences defined by repetition, friction, and disjointed handoffs.

The problem isn't a lack of customer data. Enterprises are drowning in it. The problem is that conversational AI requires real-time, portable memory of customer interactions, and few organizations have infrastructure capable of delivering it. Traditional CRMs and CDPs excel at capturing static attributes but weren't architected to handle the dynamic exchange of a conversation unfolding second by second.

Solving this requires building conversational memory inside communications infrastructure itself, rather than attempting to bolt it onto legacy data systems through integrations.

The agentic AI adoption wave and its limits

This infrastructure gap is becoming critical as agentic AI moves from pilot to production. Nearly two-thirds of companies (63%) are already in late-stage development or fully deployed with conversational AI across sales and support functions.

The reality check: While 90% of organizations believe customers are satisfied with their AI experiences, only 59% of consumers agree. The disconnect isn't about conversational fluency or response speed. It's about whether AI can demonstrate true understanding, respond with appropriate context, and actually solve problems rather than forcing escalation to human agents.

Consider the gap: A customer calls about a delayed order. With proper conversational memory infrastructure, an AI agent could instantly recognize the customer, reference their previous order, details about a delay, proactively suggest solutions, and offer appropriate compensation, all without asking them to repeat information. Most enterprises can't deliver this because the required data lives in separate systems that can't be accessed quickly enough.

Where enterprise data architecture breaks down

Enterprise data systems built for marketing and support were optimized for structured data and batch processing, not the dynamic memory required for natural conversation. Three fundamental limitations prevent these systems from supporting conversational AI:

Latency breaks the conversational contract. When customer data lives in one system and conversations happen in another, every interaction requires API calls that introduce 200-500 millisecond delays, transforming natural dialogue into robotic exchanges.

Conversational nuance gets lost. The signals that make conversations meaningful (tone, urgency, emotional state, commitments made mid-conversation) rarely make it into traditional CRMs, which were designed to capture structured data, not the unstructured richness AI needs.

Data fragmentation creates experience fragmentation. AI agents operate in one system, human agents in another, marketing automation in a third, and customer data in a fourth, creating fractured experiences where context evaporates at every handoff.

Conversational memory requires infrastructure where conversations and customer data are unified by design.

What unified conversational memory enables

Organizations treating conversational memory as core infrastructure are seeing clear competitive advantages:

Seamless handoffs: When conversational memory is unified, human agents inherit complete context instantly, eliminating the "let me pull up your account" dead time that signals wasted interactions.

Personalization at scale: While 88% of consumers expect personalized experiences, over half of businesses cite this as a top challenge. When conversational memory is native to communications infrastructure, agents can personalize based on what customers are trying to accomplish right now.

Operational intelligence: Unified conversational memory provides real-time visibility into conversation quality and key performance indicators, with insights feeding back into AI models to improve quality continuously.

Agentic automation: Perhaps most significantly, conversational memory transforms AI from a transactional tool to a genuinely agentic system capable of nuanced decisions, like rebooking a frustrated customer's flight while offering compensation calibrated to their loyalty tier.

The infrastructure imperative

The agentic AI wave is forcing a fundamental re-architecture of how enterprises think about customer data.

The solution isn't iterating on existing CDP or CRM architecture. It's recognizing that conversational memory represents a distinct category requiring real-time capture, millisecond-level access, and preservation of conversational nuance that can only be met when data capabilities are embedded directly into communications infrastructure.

Organizations approaching this as a systems integration challenge will find themselves at a disadvantage against competitors who treat conversational memory as foundational infrastructure. When memory is native to the platform powering every customer touchpoint, context travels with customers across channels, latency disappears, and continuous journeys become operationally feasible.

The enterprises setting the pace aren't those with the most sophisticated AI models. They're the ones that solved the infrastructure problem first, recognizing that agentic AI can't deliver on its promise without a new category of customer data purpose-built for the speed, nuance, and continuity that conversational experiences demand.

Robin Grochol is SVP of Product, Data, Identity & Security at Twilio.

Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Aralık 15, 2025

Blog

How AI Is rewriting the rules

The traits of a strong generalist

Embracing the shift

Inside the technology that teaches AI assistants to do specialized work

Fortune 500 companies are already using skills in legal, finance, and accounting

Atlassian, Figma, Stripe, and Zapier join Anthropic's skills directory at launch

Why Anthropic is giving away its competitive advantage to OpenAI and Google

The AI industry abandons specialized agents in favor of one assistant that learns everything

Security risks and skill atrophy emerge as concerns for enterprise AI deployments

Anthropic's real product may not be Claude—it may be the infrastructure everyone else builds on

The New Offering: Vision and Workflow as a ‘Digital GM’

Going Vertical: Lessons in Domain Expertise

The Bottom Line

Why AI is moving to the edge — and what that breaks

Zero trust becomes essential at the edge

Secure-by-default networks reshape the landscape

The future: AI that runs the edge and protects it

More efficient at a lower cost

More ways to save

Strong benchmark performance

First impressions from early users

What It Means for Enterprise AI Usage

Why AI researchers are divided over what counts as real innovation

The Microsoft veteran betting his reputation on a different kind of AI

Inside the test designed to stump the world's smartest machines

What Zoom's approach reveals about the future of enterprise AI

The real test arrives when Zoom's 300 million users start asking questions

Why AI coding tools have failed to deliver on their 10x productivity promise

Inside the four pillars that power Zencoder's AI orchestration platform

How verification solves AI coding's biggest reliability problem

Zencoder faces steep competition from AI giants and well-funded startups

The case for adopting AI orchestration now instead of waiting for better models

What Zencoder's bet on orchestration reveals about the future of AI coding

How Bolmo works and how it was built

Strong performance among its peers

Why enterprises may choose byte-level models

1. Reasoning gains come from data distribution, not model size

2. Long-context training is an infrastructure problem first

3. RL fine-tuning fails without data filtering and reuse

4. Memory optimization determines what is even possible

Why this matters for enterprise AI teams

The conversational AI context gap

The agentic AI adoption wave and its limits

Where enterprise data architecture breaks down

What unified conversational memory enables

The infrastructure imperative