Blog

  • AI is moving to the edge – and network security needs to catch up

    Presented by T-Mobile for Business


    Small and mid-sized businesses are adopting AI at a pace that would have seemed unrealistic even a few years ago. Smart assistants that greet customers, predictive tools that flag inventory shortages before they happen, and on-site analytics that help staff make decisions faster — these used to be features of the enterprise. Now they’re being deployed in retail storefronts, regional medical clinics, branch offices, and remote operations hubs.

    What’s changed is not just the AI itself, but where it runs. Increasingly, AI workloads are being pushed out of centralized data centers and into the real world — into the places where employees work and customers interact. This shift to the edge promises faster insights and more resilient operations, but it also transforms the demands placed on the network. Edge sites need consistent bandwidth, real-time data pathways, and the ability to process information locally rather than relying on the cloud for every decision.

    The catch is that as companies race to connect these locations, security often lags behind. A store may adopt AI-enabled cameras or sensors long before it has the policies to manage them. A clinic may roll out mobile diagnostic devices without fully segmenting their traffic. A warehouse may rely on a mix of Wi-Fi, wired, and cellular connections that weren’t designed to support AI-driven operations. When connectivity scales faster than security, it creates cracks — unmonitored devices, inconsistent access controls, and unsegmented data flows that make it hard to see what’s happening, let alone protect it.

    Edge AI only delivers its full value when connectivity and security evolve together.

    Why AI is moving to the edge — and what that breaks

    Businesses are shifting AI to the edge for three core reasons:

    • Real-time responsiveness: Some decisions can’t wait for a round trip to the cloud. Whether it’s identifying an item on a shelf, detecting an abnormal reading from a medical device, or recognizing a safety risk in a warehouse aisle, the delay introduced by centralized processing can mean missed opportunities or slow reactions.

    • Resilience and privacy: Keeping data and inference local makes operations less vulnerable to outages or latency spikes, and it reduces the flow of sensitive information across networks. This helps SMBs meet data sovereignty and compliance requirements without rewriting their entire infrastructure.

    • Mobility and deployment speed: Many SMBs operate across distributed footprints — remote workers, pop-up locations, seasonal operations, or mobile teams. Wireless-first connectivity, including 5G business lines, lets them deploy AI tools quickly without waiting for fixed circuits or expensive buildouts.

    Technologies like Edge Control from T-Mobile for Business fit naturally into this model. By routing traffic directly along the paths it needs — keeping latency-sensitive workloads local and bypassing the bottlenecks that traditional VPNs introduce — businesses can adopt edge AI without dragging their network into constant contention.

    Yet the shift introduces new risk. Every edge site becomes, in effect, its own small data center. A retail store may have cameras, sensors, POS systems, digital signage, and staff devices all sharing the same access point. A clinic may run diagnostic tools, tablets, wearables, and video consult systems side by side. A manufacturing floor might combine robotics, sensors, handheld scanners, and on-site analytics platforms.

    This diversity increases the attack surface dramatically. Many SMBs roll out connectivity first, then add piecemeal security later — leaving the blind spots attackers rely on.

    Zero trust becomes essential at the edge

    When AI is distributed across dozens or hundreds of sites, the old idea of a single secure “inside” network breaks down. Every store, clinic, kiosk, or field location becomes its own micro-environment — and every device within it becomes its own potential entry point.

    Zero trust offers a framework to make this manageable.

    At the edge, zero trust means:

    • Verifying identity rather than location — access is granted because a user or device proves who it is, not because it sits behind a corporate firewall.

    • Continuous authentication — trust isn’t permanent; it’s re-evaluated throughout a session.

    • Segmentation that limits movement — if something goes wrong, attackers can’t jump freely from system to system.

    This approach is especially critical given that many edge devices can’t run traditional security clients. SIM-based identity and secure mobile connectivity — areas where T-Mobile for Business brings significant strength — help verify IoT devices, 5G routers, and sensors that otherwise sit outside the visibility of IT teams.

    This is why connectivity providers are increasingly combining networking and security into a single approach. T-Mobile for Business embeds segmentation, device visibility, and zero-trust safeguards directly into its wireless-first connectivity offerings, reducing the need for SMBs to stitch together multiple tools.

    Secure-by-default networks reshape the landscape

    A major architectural shift is underway: networks that assume every device, session, and workload must be authenticated, segmented, and monitored from the start. Instead of building security on top of connectivity, the two are fused.

    T-Mobile for Business solutions shows how this is evolving. Its SASE platform, powered by Palo Alto Networks Prisma SASE 5G, blends secure access with connectivity into one cloud-delivered service. Private Access gives users the least-privileged access they need, nothing more. T-SIMsecure authenticates devices at the SIM layer, allowing IoT sensors and 5G routers to be verified automatically. Security Slice isolates sensitive SASE traffic on a dedicated portion of the 5G network, ensuring consistency even during heavy demand.

    A unified dashboard like T-Platform brings it together, offering real-time visibility across SASE, IoT, business internet, and edge control — simplifying operations for SMBs with limited staff.

    The future: AI that runs the edge and protects it

    As AI models become more dynamic and autonomous, we’ll see the relationship flip: the edge won’t just support AI; AI will actively run and secure the edge — optimizing traffic paths, adjusting segmentation automatically, and spotting anomalies that matter to one specific store or site.

    Self-healing networks and adaptive policy engines will move from experimental to expected.

    For SMBs, this is a pivotal moment. The organizations that modernize their connectivity and security foundations now will be the ones best positioned to scale AI everywhere — safely, confidently, and without unnecessary complexity.

    Partners like T-Mobile for Business are already moving in this direction, giving SMBs a way to deploy AI at the edge without sacrificing control or visibility.


    Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

  • Gemini 3 Flash arrives with reduced costs and latency — a powerful combo for enterprises

    Enterprises can now harness the power of a large language model that's near that of the state-of-the-art Google’s Gemini 3 Pro, but at a fraction of the cost and with increased speed, thanks to the newly released Gemini 3 Flash.

    The model joins the flagship Gemini 3 Pro, Gemini 3 Deep Think, and Gemini Agent, all of which were announced and released last month.

    Gemini 3 Flash, now available on Gemini Enterprise, Google Antigravity, Gemini CLI, AI Studio, and on preview in Vertex AI, processes information in near real-time and helps build quick, responsive agentic applications. 

    The company said in a blog post that Gemini 3 Flash “builds on the model series that developers and enterprises already love, optimized for high-frequency workflows that demand speed, without sacrificing quality.

    The model is also the default for AI Mode on Google Search and the Gemini application. 

    Tulsee Doshi, senior director, product management on the Gemini team, said in a separate blog post that the model “demonstrates that speed and scale don’t have to come at the cost of intelligence.”

    “Gemini 3 Flash is made for iterative development, offering Gemini 3’s Pro-grade coding performance with low latency — it’s able to reason and solve tasks quickly in high-frequency workflows,” Doshi said. “It strikes an ideal balance for agentic coding, production-ready systems and responsive interactive applications.”

    Early adoption by specialized firms proves the model's reliability in high-stakes fields. Harvey, an AI platform for law firms, reported a 7% jump in reasoning on their internal 'BigLaw Bench,' while Resemble AI discovered that Gemini 3 Flash could process complex forensic data for deepfake detection 4x faster than Gemini 2.5 Pro. These aren't just speed gains; they are enabling 'near real-time' workflows that were previously impossible.

    More efficient at a lower cost

    Enterprise AI builders have become more aware of the cost of running AI models, especially as they try to convince stakeholders to put more budget into agentic workflows that run on expensive models. Organizations have turned to smaller or distilled models, focusing on open models or other research and prompting techniques to help manage bloated AI costs.

    For enterprises, the biggest value proposition for Gemini 3 Flash is that it offers the same level of advanced multimodal capabilities, such as complex video analysis and data extraction, as its larger Gemini counterparts, but is far faster and cheaper. 

    While Google’s internal materials highlight a 3x speed increase over the 2.5 Pro series, data from independent benchmarking firm Artificial Analysis adds a layer of crucial nuance.

    In the latter organization's pre-release testing, Gemini 3 Flash Preview recorded a raw throughput of 218 output tokens per second. This makes it 22% slower than the previous 'non-reasoning' Gemini 2.5 Flash, but it is still significantly faster than rivals including OpenAI's GPT-5.1 high (125 t/s) and DeepSeek V3.2 reasoning (30 t/s).

    Most notably, Artificial Analysis crowned Gemini 3 Flash as the new leader in their AA-Omniscience knowledge benchmark, where it achieved the highest knowledge accuracy of any model tested to date. However, this intelligence comes with a 'reasoning tax': the model more than doubles its token usage compared to the 2.5 Flash series when tackling complex indexes.

    This high token density is offset by Google's aggressive pricing: when accessing through the Gemini API, Gemini 3 Flash costs $0.50 per 1 million input tokens, compared to $1.25/1M input tokens for Gemini 2.5 Pro, and $3/1M output tokens, compared to $ 10/1 M output tokens for Gemini 2.5 Pro. This allows Gemini 3 Flash to claim the title of the most cost-efficient model for its intelligence tier, despite being one of the most 'talkative' models in terms of raw token volume. Here's how it stacks up to rival LLM offerings:

    Model

    Input (/1M)

    Output (/1M)

    Total Cost

    Source

    Qwen 3 Turbo

    $0.05

    $0.20

    $0.25

    Alibaba Cloud

    Grok 4.1 Fast (reasoning)

    $0.20

    $0.50

    $0.70

    xAI

    Grok 4.1 Fast (non-reasoning)

    $0.20

    $0.50

    $0.70

    xAI

    deepseek-chat (V3.2-Exp)

    $0.28

    $0.42

    $0.70

    DeepSeek

    deepseek-reasoner (V3.2-Exp)

    $0.28

    $0.42

    $0.70

    DeepSeek

    Qwen 3 Plus

    $0.40

    $1.20

    $1.60

    Alibaba Cloud

    ERNIE 5.0

    $0.85

    $3.40

    $4.25

    Qianfan

    Gemini 3 Flash Preview

    $0.50

    $3.00

    $3.50

    Google

    Claude Haiku 4.5

    $1.00

    $5.00

    $6.00

    Anthropic

    Qwen-Max

    $1.60

    $6.40

    $8.00

    Alibaba Cloud

    Gemini 3 Pro (≤200K)

    $2.00

    $12.00

    $14.00

    Google

    GPT-5.2

    $1.75

    $14.00

    $15.75

    OpenAI

    Claude Sonnet 4.5

    $3.00

    $15.00

    $18.00

    Anthropic

    Gemini 3 Pro (>200K)

    $4.00

    $18.00

    $22.00

    Google

    Claude Opus 4.5

    $5.00

    $25.00

    $30.00

    Anthropic

    GPT-5.2 Pro

    $21.00

    $168.00

    $189.00

    OpenAI

    More ways to save

    But enterprise developers and users can cut costs further by eliminating the lag most larger models often have, which racks up token usage. Google said the model “is able to modulate how much it thinks,” so that it uses more thinking and therefore more tokens for more complex tasks than for quick prompts. The company noted Gemini 3 Flash uses 30% fewer tokens than Gemini 2.5 Pro. 

    To balance this new reasoning power with strict corporate latency requirements, Google has introduced a 'Thinking Level' parameter. Developers can toggle between 'Low'—to minimize cost and latency for simple chat tasks—and 'High'—to maximize reasoning depth for complex data extraction. This granular control allows teams to build 'variable-speed' applications that only consume expensive 'thinking tokens' when a problem actually demands PhD-level lo

    The economic story extends beyond simple token prices. With the standard inclusion of Context Caching, enterprises processing massive, static datasets—such as entire legal libraries or codebase repositories—can see a 90% reduction in costs for repeated queries. When combined with the Batch API’s 50% discount, the total cost of ownership for a Gemini-powered agent drops significantly below the threshold of competing frontier models

    “Gemini 3 Flash delivers exceptional performance on coding and agentic tasks combined with a lower price point, allowing teams to deploy sophisticated reasoning costs across high-volume processes without hitting barriers,” Google said. 

    By offering a model that delivers strong multimodal performance at a more affordable price, Google is making the case that enterprises concerned with controlling their AI spend should choose its models, especially Gemini 3 Flash. 

    Strong benchmark performance 

    But how does Gemini 3 Flash stack up against other models in terms of its performance? 

    Doshi said the model achieved a score of 78% on the SWE-Bench Verified benchmark testing for coding agents, outperforming both the preceding Gemini 2.5 family and the newer Gemini 3 Pro itself!

    For enterprises, this means high-volume software maintenance and bug-fixing tasks can now be offloaded to a model that is both faster and cheaper than previous flagship models, without a degradation in code quality.

    The model also performed strongly on other benchmarks, scoring 81.2% on the MMMU Pro benchmark, comparable to Gemini 3 Pro. 

    While most Flash type models are explicitly optimized for short, quick tasks like generating code, Google claims Gemini 3 Flash’s performance “in reasoning, tool use and multimodal capabilities is ideal for developers looking to do more complex video analysis, data extraction and visual Q&A, which means it can enable more intelligent applications — like in-game assistants or A/B test experiments — that demand both quick answers and deep reasoning.”

    First impressions from early users

    So far, early users have been largely impressed with the model, particularly its benchmark performance. 

    What It Means for Enterprise AI Usage

    With Gemini 3 Flash now serving as the default engine across Google Search and the Gemini app, we are witnessing the "Flash-ification" of frontier intelligence. By making Pro-level reasoning the new baseline, Google is setting a trap for slower incumbents.

    The integration into platforms like Google Antigravity suggests that Google isn't just selling a model; it's selling the infrastructure for the autonomous enterprise.

    As developers hit the ground running with 3x faster speeds and a 90% discount on context caching, the "Gemini-first" strategy becomes a compelling financial argument. In the high-velocity race for AI dominance, Gemini 3 Flash may be the model that finally turns "vibe coding" from an experimental hobby into a production-ready reality.

  • Zoom says it aced AI’s hardest exam. Critics say it copied off its neighbors.

    Zoom Video Communications, the company best known for keeping remote workers connected during the pandemic, announced last week that it had achieved the highest score ever recorded on one of artificial intelligence's most demanding tests — a claim that sent ripples of surprise, skepticism, and genuine curiosity through the technology industry.

    The San Jose-based company said its AI system scored 48.1 percent on the Humanity's Last Exam, a benchmark designed by subject-matter experts worldwide to stump even the most advanced AI models. That result edges out Google's Gemini 3 Pro, which held the previous record at 45.8 percent.

    "Zoom has achieved a new state-of-the-art result on the challenging Humanity's Last Exam full-set benchmark, scoring 48.1%, which represents a substantial 2.3% improvement over the previous SOTA result," wrote Xuedong Huang, Zoom's chief technology officer, in a blog post.

    The announcement raises a provocative question that has consumed AI watchers for days: How did a video conferencing company — one with no public history of training large language models — suddenly vault past Google, OpenAI, and Anthropic on a benchmark built to measure the frontiers of machine intelligence?

    The answer reveals as much about where AI is headed as it does about Zoom's own technical ambitions. And depending on whom you ask, it's either an ingenious demonstration of practical engineering or a hollow claim that appropriates credit for others' work.

    How Zoom built an AI traffic controller instead of training its own model

    Zoom did not train its own large language model. Instead, the company developed what it calls a "federated AI approach" — a system that routes queries to multiple existing models from OpenAI, Google, and Anthropic, then uses proprietary software to select, combine, and refine their outputs.

    At the heart of this system sits what Zoom calls its "Z-scorer," a mechanism that evaluates responses from different models and chooses the best one for any given task. The company pairs this with what it describes as an "explore-verify-federate strategy," an agentic workflow that balances exploratory reasoning with verification across multiple AI systems.

    "Our federated approach combines Zoom's own small language models with advanced open-source and closed-source models," Huang wrote. The framework "orchestrates diverse models to generate, challenge, and refine reasoning through dialectical collaboration."

    In simpler terms: Zoom built a sophisticated traffic controller for AI, not the AI itself.

    This distinction matters enormously in an industry where bragging rights — and billions in valuation — often hinge on who can claim the most capable model. The major AI laboratories spend hundreds of millions of dollars training frontier systems on vast computing clusters. Zoom's achievement, by contrast, appears to rest on clever integration of those existing systems.

    Why AI researchers are divided over what counts as real innovation

    The response from the AI community was swift and sharply divided.

    Max Rumpf, an AI engineer who says he has trained state-of-the-art language models, posted a pointed critique on social media. "Zoom strung together API calls to Gemini, GPT, Claude et al. and slightly improved on a benchmark that delivers no value for their customers," he wrote. "They then claim SOTA."

    Rumpf did not dismiss the technical approach itself. Using multiple models for different tasks, he noted, is "actually quite smart and most applications should do this." He pointed to Sierra, an AI customer service company, as an example of this multi-model strategy executed effectively.

    His objection was more specific: "They did not train the model, but obfuscate this fact in the tweet. The injustice of taking credit for the work of others sits deeply with people."

    But other observers saw the achievement differently. Hongcheng Zhu, a developer, offered a more measured assessment: "To top an AI eval, you will most likely need model federation, like what Zoom did. An analogy is that every Kaggle competitor knows you have to ensemble models to win a contest."

    The comparison to Kaggle — the competitive data science platform where combining multiple models is standard practice among winning teams — reframes Zoom's approach as industry best practice rather than sleight of hand. Academic research has long established that ensemble methods routinely outperform individual models.

    Still, the debate exposed a fault line in how the industry understands progress. Ryan Pream, founder of Exoria AI, was dismissive: "Zoom are just creating a harness around another LLM and reporting that. It is just noise." Another commenter captured the sheer unexpectedness of the news: "That the video conferencing app ZOOM developed a SOTA model that achieved 48% HLE was not on my bingo card."

    Perhaps the most pointed critique concerned priorities. Rumpf argued that Zoom could have directed its resources toward problems its customers actually face. "Retrieval over call transcripts is not 'solved' by SOTA LLMs," he wrote. "I figure Zoom's users would care about this much more than HLE."

    The Microsoft veteran betting his reputation on a different kind of AI

    If Zoom's benchmark result seemed to come from nowhere, its chief technology officer did not.

    Xuedong Huang joined Zoom from Microsoft, where he spent decades building the company's AI capabilities. He founded Microsoft's speech technology group in 1993 and led teams that achieved what the company described as human parity in speech recognition, machine translation, natural language understanding, and computer vision.

    Huang holds a Ph.D. in electrical engineering from the University of Edinburgh. He is an elected member of the National Academy of Engineering and the American Academy of Arts and Sciences, as well as a fellow of both the IEEE and the ACM. His credentials place him among the most accomplished AI executives in the industry.

    His presence at Zoom signals that the company's AI ambitions are serious, even if its methods differ from the research laboratories that dominate headlines. In his tweet celebrating the benchmark result, Huang framed the achievement as validation of Zoom's strategy: "We have unlocked stronger capabilities in exploration, reasoning, and multi-model collaboration, surpassing the performance limits of any single model."

    That final clause — "surpassing the performance limits of any single model" — may be the most significant. Huang is not claiming Zoom built a better model. He is claiming Zoom built a better system for using models.

    Inside the test designed to stump the world's smartest machines

    The benchmark at the center of this controversy, Humanity's Last Exam, was designed to be exceptionally difficult. Unlike earlier tests that AI systems learned to game through pattern matching, HLE presents problems that require genuine understanding, multi-step reasoning, and the synthesis of information across complex domains.

    The exam draws on questions from experts around the world, spanning fields from advanced mathematics to philosophy to specialized scientific knowledge. A score of 48.1 percent might sound unimpressive to anyone accustomed to school grading curves, but in the context of HLE, it represents the current ceiling of machine performance.

    "This benchmark was developed by subject-matter experts globally and has become a crucial metric for measuring AI's progress toward human-level performance on challenging intellectual tasks," Zoom’s announcement noted.

    The company's improvement of 2.3 percentage points over Google's previous best may appear modest in isolation. But in competitive benchmarking, where gains often come in fractions of a percent, such a jump commands attention.

    What Zoom's approach reveals about the future of enterprise AI

    Zoom's approach carries implications that extend well beyond benchmark leaderboards. The company is signaling a vision for enterprise AI that differs fundamentally from the model-centric strategies pursued by OpenAI, Anthropic, and Google.

    Rather than betting everything on building the single most capable model, Zoom is positioning itself as an orchestration layer — a company that can integrate the best capabilities from multiple providers and deliver them through products that businesses already use every day.

    This strategy hedges against a critical uncertainty in the AI market: no one knows which model will be best next month, let alone next year. By building infrastructure that can swap between providers, Zoom avoids vendor lock-in while theoretically offering customers the best available AI for any given task.

    The announcement of OpenAI's GPT-5.2 the following day underscored this dynamic. OpenAI's own communications named Zoom as a partner that had evaluated the new model's performance "across their AI workloads and saw measurable gains across the board." Zoom, in other words, is both a customer of the frontier labs and now a competitor on their benchmarks — using their own technology.

    This arrangement may prove sustainable. The major model providers have every incentive to sell API access widely, even to companies that might aggregate their outputs. The more interesting question is whether Zoom's orchestration capabilities constitute genuine intellectual property or merely sophisticated prompt engineering that others could replicate.

    The real test arrives when Zoom's 300 million users start asking questions

    Zoom titled its announcement section on industry relations "A Collaborative Future," and Huang struck notes of gratitude throughout. "The future of AI is collaborative, not competitive," he wrote. "By combining the best innovations from across the industry with our own research breakthroughs, we create solutions that are greater than the sum of their parts."

    This framing positions Zoom as a beneficent integrator, bringing together the industry's best work for the benefit of enterprise customers. Critics see something else: a company claiming the prestige of an AI laboratory without doing the foundational research that earns it.

    The debate will likely be settled not by leaderboards but by products. When AI Companion 3.0 reaches Zoom's hundreds of millions of users in the coming months, they will render their own verdict — not on benchmarks they have never heard of, but on whether the meeting summary actually captured what mattered, whether the action items made sense, whether the AI saved them time or wasted it.

    In the end, Zoom's most provocative claim may not be that it topped a benchmark. It may be the implicit argument that in the age of AI, the best model is not the one you build — it's the one you know how to use.

  • Zencoder drops Zenflow, a free AI orchestration tool that pits Claude against OpenAI’s models to catch coding errors

    Zencoder, the Silicon Valley startup that builds AI-powered coding agents, released a free desktop application on Monday that it says will fundamentally change how software engineers interact with artificial intelligence — moving the industry beyond the freewheeling era of "vibe coding" toward a more disciplined, verifiable approach to AI-assisted development.

    The product, called Zenflow, introduces what the company describes as an "AI orchestration layer" that coordinates multiple AI agents to plan, implement, test, and review code in structured workflows. The launch is Zencoder's most ambitious attempt yet to differentiate itself in an increasingly crowded market dominated by tools like Cursor, GitHub Copilot, and coding agents built directly by AI giants Anthropic, OpenAI, and Google.

    "Chat UIs were fine for copilots, but they break down when you try to scale," said Andrew Filev, Zencoder's chief executive, in an exclusive interview with VentureBeat. "Teams are hitting a wall where speed without structure creates technical debt. Zenflow replaces 'Prompt Roulette' with an engineering assembly line where agents plan, implement, and, crucially, verify each other's work."

    The announcement arrives at a critical moment for enterprise software development. Companies across industries have poured billions of dollars into AI coding tools over the past two years, hoping to dramatically accelerate their engineering output. Yet the promised productivity revolution has largely failed to materialize at scale.

    Why AI coding tools have failed to deliver on their 10x productivity promise

    Filev, who previously founded and sold the project management company Wrike to Citrix, pointed to a growing disconnect between AI coding hype and reality. While vendors have promised tenfold productivity gains, rigorous studies — including research from Stanford University — consistently show improvements closer to 20 percent.

    "If you talk to real engineering leaders, I don't remember a single conversation where somebody vibe coded themselves to 2x or 5x or 10x productivity on serious engineering production," Filev said. "The typical number you would hear would be about 20 percent."

    The problem, according to Filev, lies not with the AI models themselves but with how developers interact with them. The standard approach of typing requests into a chat interface and hoping for usable code works well for simple tasks but falls apart on complex enterprise projects.

    Zencoder's internal engineering team claims to have cracked a different approach. Filev said the company now operates at roughly twice the velocity it achieved 12 months ago, not primarily because AI models improved, but because the team restructured its development processes.

    "We had to change our process and use a variety of different best practices," he said.

    Inside the four pillars that power Zencoder's AI orchestration platform

    Zenflow organizes its approach around four core capabilities that Zencoder argues any serious AI orchestration platform must support.

    Structured workflows replace ad-hoc prompting with repeatable sequences (plan, implement, test, review) that agents follow consistently. Filev drew parallels to his experience building Wrike, noting that individual to-do lists rarely scale across organizations, while defined workflows create predictable outcomes.

    Spec-driven development requires AI agents to first generate a technical specification, then create a step-by-step plan, and only then write code. The approach became so effective that frontier AI labs including Anthropic and OpenAI have since trained their models to follow it automatically. The specification anchors agents to clear requirements, preventing what Zencoder calls "iteration drift," or the tendency for AI-generated code to gradually diverge from the original intent.

    Multi-agent verification deploys different AI models to critique each other's work. Because AI models from the same family tend to share blind spots, Zencoder routes verification tasks across model providers, asking Claude to review code written by OpenAI's models, or vice versa.

    "Think of it as a second opinion from a doctor," Filev told VentureBeat. "With the right pipeline, we see results on par with what you'd expect from Claude 5 or GPT-6. You're getting the benefit of a next-generation model today."

    Parallel execution lets developers run multiple AI agents simultaneously in isolated sandboxes, preventing them from interfering with each other's work. The interface provides a command center for monitoring this fleet, a significant departure from the current practice of managing multiple terminal windows.

    How verification solves AI coding's biggest reliability problem

    Zencoder's emphasis on verification addresses one of the most persistent criticisms of AI-generated code: its tendency to produce "slop," or code that appears correct but fails in production or degrades over successive iterations.

    The company's internal research found that developers who skip verification often fall into what Filev called a "death loop." An AI agent completes a task successfully, but the developer, reluctant to review unfamiliar code, moves on without understanding what was written. When subsequent tasks fail, the developer lacks the context to fix problems manually and instead keeps prompting the AI for solutions.

    "They literally spend more than a day in that death loop," Filev said. "That's why the productivity is not 2x, because they were running at 3x first, and then they wasted the whole day."

    The multi-agent verification approach also gives Zencoder an unusual competitive advantage over the frontier AI labs themselves. While Anthropic, OpenAI, and Google each optimize their own models, Zencoder can mix and match across providers to reduce bias.

    "This is a rare situation where we have an edge on the frontier labs," Filev said. "Most of the time they have an edge on us, but this is a rare case."

    Zencoder faces steep competition from AI giants and well-funded startups

    Zencoder enters the AI orchestration market at a moment of intense competition. The company has positioned itself as a model-agnostic platform, supporting major providers including Anthropic, OpenAI, and Google Gemini. In September, Zencoder expanded its platform to let developers use command-line coding agents from any provider within its interface.

    That strategy reflects a pragmatic acknowledgment that developers increasingly maintain relationships with multiple AI providers rather than committing exclusively to one. Zencoder's universal platform approach lets it serve as the orchestration layer regardless of which underlying models a company prefers.

    The company also emphasizes enterprise readiness, touting SOC 2 Type II, ISO 27001, and ISO 42001 certifications along with GDPR compliance. These credentials matter for regulated industries like financial services and healthcare, where compliance requirements can block adoption of consumer-oriented AI tools.

    But Zencoder faces formidable competition from multiple directions. Cursor and Windsurf have built dedicated AI-first code editors with devoted user bases. GitHub Copilot benefits from Microsoft's distribution muscle and deep integration with the world's largest code repository. And the frontier AI labs continue expanding their own coding capabilities.

    Filev dismissed concerns about competition from the AI labs, arguing that smaller players like Zencoder can move faster on user experience innovation.

    "I'm sure they will come to the same conclusion, and they're smart and moving fast, so I'm sure they will catch up fairly quickly," he said. "That's why I said in the next six to 12 months, you're going to see a lot of this propagating through the whole space."

    The case for adopting AI orchestration now instead of waiting for better models

    Technical executives weighing AI coding investments face a difficult timing question: Should they adopt orchestration tools now, or wait for frontier AI labs to build these capabilities natively into their models?

    Filev argued that waiting carries significant competitive risk.

    "Right now, everybody is under pressure to deliver more in less time, and everybody expects engineering leaders to deliver results from AI," he said. "As a founder and CEO, I do not expect 20 percent from my VP of engineering. I expect 2x."

    He also questioned whether the major AI labs will prioritize orchestration capabilities when their core business remains model development.

    "In the ideal world, frontier labs should be building the best-ever models and competing with each other, and Zencoders and Cursors need to build the best-ever UI and UX application layer on top of those models," Filev said. "I don't see a world where OpenAI will offer you our code verifier, or vice versa."

    Zenflow launches as a free desktop application, with updated plugins available for Visual Studio Code and JetBrains integrated development environments. The product supports what Zencoder calls "dynamic workflows," meaning the system automatically adjusts process complexity based on whether a human is actively monitoring and on the difficulty of the task at hand.

    Zencoder said internal testing showed that replacing standard prompting with Zenflow's orchestration layer improved code correctness by approximately 20 percent on average.

    What Zencoder's bet on orchestration reveals about the future of AI coding

    Zencoder frames Zenflow as the first product in what it expects to become a significant new software category. The company believes every vendor focused on AI coding will eventually arrive at similar conclusions about the need for orchestration tools.

    "I think the next six to 12 months will be all about orchestration," Filev predicted. "A lot of organizations will finally reach that 2x. Not 10x yet, but at least the 2x they were promised a year ago."

    Rather than competing head-to-head with frontier AI labs on model quality, Zencoder is betting that the application layer (the software that helps developers actually use these models effectively) will determine winners and losers.

    It is, Filev suggested, a familiar pattern from technology history.

    "This is very similar to what I observed when I started Wrike," he said. "As work went digital, people relied on email and spreadsheets to manage everything, and neither could keep up."

    The same dynamic, he argued, now applies to AI coding. Chat interfaces were designed for conversation, not for orchestrating complex engineering workflows. Whether Zencoder can establish itself as the essential layer between developers and AI models before the giants build their own solutions remains an open question.

    But Filev seems comfortable with the race. The last time he spotted a gap between how people worked and the tools they had to work with, he built a company worth over a billion dollars.

    Zenflow is available immediately as a free download at zencoder.ai/zenflow.

  • Bolmo’s architecture unlocks efficient byte‑level LM training without sacrificing quality

    Enterprises that want tokenizer-free multilingual models are increasingly turning to byte-level language models to reduce brittleness in noisy or low-resource text. To tap into that niche — and make it practical at scale — the Allen Institute of AI (Ai2) introduced Bolmo, a new family of models that leverage its Olmo 3 models by “bytefiying” them and reusing their backbone and capabilities.

    The company launched two versions, Bolmo 7B and Bolmo 1B, which are “the first fully open byte-level language model,” according to Ai2. The company said the two models performed competitively with — and in some cases surpassed — other byte-level and character-based models.

    Byte-level language models operate directly on raw UTF-8 bytes, eliminating the need for a predefined vocabulary or tokenizer. This allows them to handle misspellings, rare languages, and unconventional text more reliably — key requirements for moderation, edge deployments, and multilingual applications.

    For enterprises deploying AI across multiple languages, noisy user inputs, or constrained environments, tokenizer-free models offer a way to reduce operational complexity. Ai2’s Bolmo is an attempt to make that approach practical at scale — without retraining from scratch.

    How Bolmo works and how it was built 

    Ai2 said it trained the Bolmo models using its Dolma 3 data mix, which helped train its Olmo flagship models, and some open code datasets and character-level data.

    The company said its goal “is to provide a reproducible, inspectable blueprint for byteifying strong subword language models in a way the community can adopt and extend.” To meet this goal, Ai2 will release its checkpoints, code, and a full paper to help other organizations build byte-level models on top of its Olmo ecosystem. 

    Since training a byte-level model completely from scratch can get expensive, Ai2 researchers instead chose an existing Olmo 3 7B checkpoint to byteify in two stages. 

    In the first stage, Ai2 froze the Olmo 3 transformer so that they only train certain parts, such as the local encoder and decoder, the boundary predictor, and the language modeling head. This was designed to be “cheap and fast” and requires just 9.8 billion tokens. 

    The next stage unfreezes the model and trains it with additional tokens. Ai2 said the byte-level approach allows Bolmo to avoid the vocabulary bottlenecks that limit traditional subword models.

    Strong performance among its peers

    Byte-level language models are not as mainstream as small language models or LLMs, but this is a growing field in research. Meta released its BLT architecture research last year, aiming to offer a model that is robust, processes raw data, and doesn’t rely on fixed vocabularies. 

    Other research models in this space include ByT5, Stanford’s MrT5, and Canine.  

    Ai2 evaluated Bolmo using its evaluation suite, covering math, STEM reasoning, question answering, general knowledge, and code. 

    Bolmo 7B showed strong performance, outperforming character-focused benchmarks like CUTE and EXECUTE, and also improving accuracy over the base LLM Olmo 3. 

    Bolmo 7B outperformed models of comparable size in coding, math, multiple-choice QA, and character-level understanding. 

    Why enterprises may choose byte-level models

    Enterprises find value in a hybrid model structure, using a mix of models and model sizes. 

    Ai2 makes the case that organizations should also consider byte-level models not only for robustness and multilingual understanding, but because it “naturally plugs into an existing model ecosystem.”

    “A key advantage of the dynamic hierarchical setup is that compression becomes a toggleable knob,” the company said.

    For enterprises already running heterogeneous model stacks, Bolmo suggests that byte-level models may no longer be purely academic. By retrofitting a strong subword model rather than training from scratch, Ai2 is signaling a lower-risk path for organizations that want robustness without abandoning existing infrastructure.

  • Korean AI startup Motif reveals 4 big lessons for training enterprise LLMs

    We've heard (and written, here at VentureBeat) lots about the generative AI race between the U.S. and China, as those have been the countries with the groups most active in fielding new models (with a shoutout to Cohere in Canada and Mistral in France).

    But now a Korean startup is making waves: last week, the firm known as Motif Technologies released Motif-2-12.7B-Reasoning, another small parameter open-weight model that boasts impressive benchmark scores, quickly becoming the most performant model from that country according to independent benchmarking lab Artificial Analysis (beating even regular GPT-5.1 from U.S. leader OpenAI).

    But more importantly for enterprise AI teams, the company has published a white paper on arxiv.org with a concrete, reproducible training recipe that exposes where reasoning performance actually comes from — and where common internal LLM efforts tend to fail.

    For organizations building or fine-tuning their own models behind the firewall, the paper offers a set of practical lessons about data alignment, long-context infrastructure, and reinforcement learning stability that are directly applicable to enterprise environments. Here they are:

    1. Reasoning gains come from data distribution, not model size

    One of Motif’s most relevant findings for enterprise teams is that synthetic reasoning data only helps when its structure matches the target model’s reasoning style.

    The paper shows measurable differences in downstream coding performance depending on which “teacher” model generated the reasoning traces used during supervised fine-tuning.

    For enterprises, this undermines a common shortcut: generating large volumes of synthetic chain-of-thought data from a frontier model and assuming it will transfer cleanly. Motif’s results suggest that misaligned reasoning traces can actively hurt performance, even if they look high quality.

    The takeaway is operational, not academic: teams should validate that their synthetic data reflects the format, verbosity, and step granularity they want at inference time. Internal evaluation loops matter more than copying external datasets.

    2. Long-context training is an infrastructure problem first

    Motif trains at 64K context, but the paper makes clear that this is not simply a tokenizer or checkpointing tweak.

    The model relies on hybrid parallelism, careful sharding strategies, and aggressive activation checkpointing to make long-context training feasible on Nvidia H100-class hardware.

    For enterprise builders, the message is sobering but useful: long-context capability cannot be bolted on late.

    If retrieval-heavy or agentic workflows are core to the business use case, context length has to be designed into the training stack from the start. Otherwise, teams risk expensive retraining cycles or unstable fine-tunes.

    3. RL fine-tuning fails without data filtering and reuse

    Motif’s reinforcement learning fine-tuning (RLFT) pipeline emphasizes difficulty-aware filtering — keeping tasks whose pass rates fall within a defined band — rather than indiscriminately scaling reward training.

    This directly addresses a pain point many enterprise teams encounter when experimenting with RL: performance regressions, mode collapse, or brittle gains that vanish outside benchmarks. Motif also reuses trajectories across policies and expands clipping ranges, trading theoretical purity for training stability.

    The enterprise lesson is clear: RL is a systems problem, not just a reward model problem. Without careful filtering, reuse, and multi-task balancing, RL can destabilize models that are otherwise production-ready.

    4. Memory optimization determines what is even possible

    Motif’s use of kernel-level optimizations to reduce RL memory pressure highlights an often-overlooked constraint in enterprise settings: memory, not compute, is frequently the bottleneck. Techniques like loss-function-level optimization determine whether advanced training stages are viable at all.

    For organizations running shared clusters or regulated environments, this reinforces the need for low-level engineering investment, not just model architecture experimentation.

    Why this matters for enterprise AI teams

    Motif-2-12.7B-Reasoning is positioned as competitive with much larger models, but its real value lies in the transparency of how those results were achieved. The paper argues — implicitly but persuasively — that reasoning performance is earned through disciplined training design, not model scale alone.

    For enterprises building proprietary LLMs, the lesson is pragmatic: invest early in data alignment, infrastructure, and training stability, or risk spending millions fine-tuning models that never reliably reason in production.

  • Why agentic AI needs a new category of customer data

    Presented by Twilio


    The customer data infrastructure powering most enterprises was architected for a world that no longer exists: one where marketing interactions could be captured and processed in batches, where campaign timing was measured in days (not milliseconds), and where "personalization" meant inserting a first name into an email template.

    Conversational AI has shattered those assumptions.

    AI agents need to know what a customer just said, the tone they used, their emotional state, and their complete history with a brand instantly to provide relevant guidance and effective resolution. This fast-moving stream of conversational signals (tone, urgency, intent, sentiment) represents a fundamentally different category of customer data. Yet the systems most enterprises rely on today were never designed to capture or deliver it at the speed modern customer experiences demand.

    The conversational AI context gap

    The consequences of this architectural mismatch are already visible in customer satisfaction data. Twilio’s Inside the Conversational AI Revolution report reveals that more than half (54%) of consumers report AI rarely has context from their past interactions, and only 15% feel that human agents receive the full story after an AI handoff. The result: customer experiences defined by repetition, friction, and disjointed handoffs.

    The problem isn't a lack of customer data. Enterprises are drowning in it. The problem is that conversational AI requires real-time, portable memory of customer interactions, and few organizations have infrastructure capable of delivering it. Traditional CRMs and CDPs excel at capturing static attributes but weren't architected to handle the dynamic exchange of a conversation unfolding second by second.

    Solving this requires building conversational memory inside communications infrastructure itself, rather than attempting to bolt it onto legacy data systems through integrations.

    The agentic AI adoption wave and its limits

    This infrastructure gap is becoming critical as agentic AI moves from pilot to production. Nearly two-thirds of companies (63%) are already in late-stage development or fully deployed with conversational AI across sales and support functions.

    The reality check: While 90% of organizations believe customers are satisfied with their AI experiences, only 59% of consumers agree. The disconnect isn't about conversational fluency or response speed. It's about whether AI can demonstrate true understanding, respond with appropriate context, and actually solve problems rather than forcing escalation to human agents.

    Consider the gap: A customer calls about a delayed order. With proper conversational memory infrastructure, an AI agent could instantly recognize the customer, reference their previous order, details about a delay, proactively suggest solutions, and offer appropriate compensation, all without asking them to repeat information. Most enterprises can't deliver this because the required data lives in separate systems that can't be accessed quickly enough.

    Where enterprise data architecture breaks down

    Enterprise data systems built for marketing and support were optimized for structured data and batch processing, not the dynamic memory required for natural conversation. Three fundamental limitations prevent these systems from supporting conversational AI:

    Latency breaks the conversational contract. When customer data lives in one system and conversations happen in another, every interaction requires API calls that introduce 200-500 millisecond delays, transforming natural dialogue into robotic exchanges.

    Conversational nuance gets lost. The signals that make conversations meaningful (tone, urgency, emotional state, commitments made mid-conversation) rarely make it into traditional CRMs, which were designed to capture structured data, not the unstructured richness AI needs.

    Data fragmentation creates experience fragmentation. AI agents operate in one system, human agents in another, marketing automation in a third, and customer data in a fourth, creating fractured experiences where context evaporates at every handoff.

    Conversational memory requires infrastructure where conversations and customer data are unified by design.

    What unified conversational memory enables

    Organizations treating conversational memory as core infrastructure are seeing clear competitive advantages:

    Seamless handoffs: When conversational memory is unified, human agents inherit complete context instantly, eliminating the "let me pull up your account" dead time that signals wasted interactions.

    Personalization at scale: While 88% of consumers expect personalized experiences, over half of businesses cite this as a top challenge. When conversational memory is native to communications infrastructure, agents can personalize based on what customers are trying to accomplish right now.

    Operational intelligence: Unified conversational memory provides real-time visibility into conversation quality and key performance indicators, with insights feeding back into AI models to improve quality continuously.

    Agentic automation: Perhaps most significantly, conversational memory transforms AI from a transactional tool to a genuinely agentic system capable of nuanced decisions, like rebooking a frustrated customer's flight while offering compensation calibrated to their loyalty tier.

    The infrastructure imperative

    The agentic AI wave is forcing a fundamental re-architecture of how enterprises think about customer data.

    The solution isn't iterating on existing CDP or CRM architecture. It's recognizing that conversational memory represents a distinct category requiring real-time capture, millisecond-level access, and preservation of conversational nuance that can only be met when data capabilities are embedded directly into communications infrastructure.

    Organizations approaching this as a systems integration challenge will find themselves at a disadvantage against competitors who treat conversational memory as foundational infrastructure. When memory is native to the platform powering every customer touchpoint, context travels with customers across channels, latency disappears, and continuous journeys become operationally feasible.

    The enterprises setting the pace aren't those with the most sophisticated AI models. They're the ones that solved the infrastructure problem first, recognizing that agentic AI can't deliver on its promise without a new category of customer data purpose-built for the speed, nuance, and continuity that conversational experiences demand.

    Robin Grochol is SVP of Product, Data, Identity & Security at Twilio.


    Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

  • Tokenization takes the lead in the fight for data security

    Presented by Capital One Software


    Tokenization is emerging as a cornerstone of modern data security, helping businesses separate the value of their data from its risk. During this VB in Conversation, Ravi Raghu, president, Capital One Software, talks about the ways tokenization can help reduce the value of breached data and preserve underlying data format and usability, including Capital One’s own experience leveraging tokenization at scale.

    Tokenization, Raghu asserts, is a far superior technology. It converts sensitive data into a nonsensitive digital replacement, called a token, that maps back to the original, which is secured in a digital vault. The token placeholder preserves both the format and the utility of the sensitive data, and can be used across applications — including AI models. Because tokenization removes the need to manage encryption keys or dedicate compute to constant encrypting and decrypting, it offers one of the most scalable ways for companies to protect their most sensitive data, he added.

    "The killer part, from a security standpoint, when you think about it relative to other methods, if a bad actor gets hold of the data, they get hold of tokens," he explained. "The actual data is not sitting with the token, unlike other methods like encryption, where the actual data sits there, just waiting for someone to get hold of a key or use brute force to get to the real data. From every angle this is the ideal way one ought to go about protecting sensitive data."

    The tokenization differentiator

    Most organizations are just scratching the surface of data security, adding security at the very end, when data is read, to prevent an end user from accessing it. At minimum, organizations should focus on securing data on write, as it’s being stored. But best-in-class organizations go even further, protecting data at birth, the moment it’s created.

    At one end of the safety spectrum is a simple lock-and-key approach that restricts access but leaves the underlying data intact. More advanced methods, like masking or modifying data, permanently alter its meaning — which can compromise its usefulness. File-level encryption provides broader protection for large volumes of stored data, but when you get down to field-level encryption (for example, a Social Security number), it becomes a bigger challenge. It takes a great deal of compute to encrypt a single field, and then to decrypt it at the point of usage. And still it has a fatal flaw: the original data is still right there, only needing the key to get access.

    Tokenization avoids these pitfalls by replacing the original data with a surrogate that has no intrinsic value. If the token is intercepted — whether by the wrong person or the wrong machine — the data itself remains secure.

    The business value of tokenization

    "Fundamentally you’re protecting data, and that’s priceless," Raghu said. "Another thing that’s priceless – can you use that for modeling purposes subsequently? On the one hand, it’s a protection thing, and on the other hand it’s a business enabling thing."

    Because tokenization preserves the structure and ordinality of the original data, it can still be used for modeling and analytics, turning protection into a business enabler. Take private health data governed by HIPAA for example: tokenization means that data canbeused to build pricing models or for gene therapy research, while remaining compliant.

    "If your data is already protected, you can then proliferate the usage of data across the entire enterprise and have everybody creating more and more value out of the data," Raghu said. "Conversely, if you don’t have that, there’s a lot of reticence for enterprises today to have more people access it, or have more and more AI agents access their data. Ironically, they’re limiting the blast radius of innovation. The tokenization impact is massive, and there are many metrics you could use to measure that – operational impact, revenue impact, and obviously the peace of mind from a security standpoint."

    Breaking down adoption barriers

    Until now, the fundamental challenge with traditional tokenization has been performance. AI requires a scale and speed that is unprecedented. That's one of the major challenges Capital One addresses with Databolt, its vaultless tokenization solution, which can produce up to 4 million tokens per second.

    "Capital One has gone through tokenization for more than a decade. We started doing it because we’re serving our 100 million banking customers. We want to protect that sensitive data," Raghu said. "We’ve eaten our own dog food with our internal tokenization capability, over 100 billion times a month. We’ve taken that know-how and that capability, scale, and speed, and innovated so that the world can leverage it, so that it’s a commercial offering."

    Vaultless tokenization is an advanced form of tokenization that does not require a central database (vault) to store token mappings. Instead, it uses mathematical algorithms, cryptographic techniques, and deterministic mapping to generate tokens dynamically.This approach is faster, more scalable, and eliminates the security risk associated with managing a vault.

    "We realized that for the scale and speed demands that we had, we needed to build out that capability ourselves," Raghu said. "We’ve been iterating continuously on making sure that it can scale up to hundreds of billions of operations a month. All of our innovation has been around building IP and capability to do that thing at a battle-tested scale within our enterprise, for the purpose of serving our customers."

    While conventional tokenization methods can involve some complexity and slow down operations, Databolt seamlessly integrates with encrypted data warehouses, allowing businesses to maintain robust security without slowing performance or operations. Tokenization occurs in the customer’s environment, removing the need to communicate with an external network to perform tokenization operations, which can also slow performance.

    "We believe that fundamentally, tokenization should be easy to adopt," Raghu said. "You should be able to secure your data very quickly and operate at the speed and scale and cost needs that organizations have. I think that’s been a critical barrier so far for the mass scale adoption of tokenization. In an AI world, that’s going to become a huge enabler."

    Don't miss the whole conversation with Ravi Raghu, president, Capital One Software, here.


    Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

  • Build vs buy is dead — AI just killed it

    Picture this: You're sitting in a conference room, halfway through a vendor pitch. The demo looks solid, and pricing fits nicely under budget. The timeline seems reasonable too. Everyone’s nodding along.

    You’re literally minutes away from saying yes.

    Then someone from your finance team walks in. They see the deck and frown. A few minutes later, they shoot you a message on Slack: “Actually, I threw together a version of this last week. Took me 2 hours in Cursor. Wanna take a look?”

    Wait… what?

    This person doesn't code. You know for a fact they've never written a line of JavaScript in their entire life. But here they are, showing you a working prototype on their laptop that does… pretty much exactly what the vendor pitched. Sure, it's got some rough edges, but it works. And it didn’t cost six figures. Just two hours of their time.

    Suddenly, the assumptions you walked in with — about how software is developed, who makes it and how decisions are made around it — all start coming apart at the seams.

    The old framework

    For decades, every growing company asked the same question: Should we build this ourselves, or should we buy it?

    And, for decades, the answer was pretty straightforward: Build if it's core to your business; buy if it isn’t.

    The logic made sense, because building was expensive and meant borrowing time from overworked engineers, writing specs, planning sprints, managing infrastructure and bracing yourself for a long tail of maintenance. Buying was faster. Safer. You paid for the support and the peace of mind.

    But something fundamental has changed: AI has made building accessible to everyone. What used to take weeks now takes hours, and what used to require fluency in a programming language now requires fluency in plain English.

    When the cost and complexity of building collapse this dramatically, the old framework goes down with them. It’s not build versus buy anymore. It’s something stranger that we haven't quite found the right words for.

    When the market doesn’t know what you need (yet)

    My company never planned to build so many of the tools we use. We just had to build because the things we needed didn’t exist. And, through that process, we developed this visceral understanding of what we actually wanted, what was useful and what it could or couldn't do. Not what vendor decks told us we needed or what analyst reports said we should want, but what actually moved the needle in our business.

    We figured out which problems were worth solving, which ones weren’t, where AI created real leverage and where it was just noise. And only then, once we had that hard-earned clarity, did we start buying.

    By that point, we knew exactly what we were looking for and could tell the difference between substance and marketing in about five minutes. We asked questions that made vendors nervous because we'd already built some rudimentary version of what they were selling.

    When anyone can build in minutes

    Last week, someone on our CX team noticed some customer feedback about a bug in Slack. Just a minor customer complaint, nothing major. In another company, this would’ve kicked off a support ticket and they’d have waited for someone else to handle it, but that’s not what happened here. They opened Cursor, described the change and let AI write the fix. Then they submitted a pull request that engineering reviewed and merged.

    Just 15 minutes after that complaint popped up in Slack, the fix was live in production.

    The person who did this isn’t technical in the slightest. I doubt they could tell you the difference between Python and JavaScript, but they solved the problem anyway.

    And that’s the point.

    AI has gotten so good at cranking out relatively simple code that it handles 80% of the problems that used to require a sprint planning meeting and two weeks of engineering time. It’s erasing the boundary between technical and non-technical. Work that used to be bottlenecked by engineering is now being done by the people closest to the problem.

    This is happening right now in companies that are actually paying attention.

    The inversion that’s happening

    Here's where it gets fascinating for finance leaders, because AI has actually flipped the entire strategic logic of the build versus buy decision on its head.

    The old model went something like:

    1. Define the need.

    2. Decide whether to build or buy.

    But defining the need took forever and required deep technical expertise, or you'd burn through money through trial-and-error vendor implementations. You'd sit through countless demos, trying to picture whether this actually solved your problem. Then you’d negotiate, implement, move all your data and workflows to the new tool and six months and six figures later discover whether (or not) you were actually right.

    Now, the whole sequence gets turned around:

    1. Build something lightweight with AI.

    2. Use it to understand what you actually need.

    3. Then decide whether to buy (and you'll know exactly why).

    This approach lets you run controlled experiments. You figure out whether the problem even matters. You discover which features deliver value and which just look good in demos. Then you go shopping. Instead of letting some external vendor sell you on what the need is, you get to figure out whether you even have that need in the first place.

    Think about how many software purchases you've made that, in hindsight, solved problems you didn't actually have. How many times have you been three months into an implementation and thought, “Hang on, is this actually helping us, or are we just trying to justify what we spent?”

    Now, when you do buy, the question becomes “Does this solve the problem better than what we already proved we can build?”

    That one reframe changes the entire conversation. Now you show up to vendor calls informed. You ask sharper questions, and negotiate from a place of strength. Most importantly, you avoid the most expensive mistake in enterprise software, which is solving a problem you never really had.

    The trap you need to avoid

    As this new capability emerges, I’m watching companies sprint in the wrong direction. They know they need to be AI native, so they go on a shopping spree. They look for AI-powered tools, filling their stack with products that have GPT integrations, chatbot UIs or “AI” slapped onto the marketing site. They think they’re transforming, but they’re not.

    Remember what physicist Richard Feynman called cargo cult science? After World War II, islanders in the South Pacific built fake airstrips and control towers, mimicking what they'd seen during the war, hoping planes full of cargo would return. They had all the outward forms of an airport: Towers, headsets, even people miming flight controllers. But no planes landed, because the form wasn’t the function.

    That’s exactly what’s happening with AI transformation in boardrooms everywhere. Leaders are buying AI tools without asking if they meaningfully change how work gets done, who they empower or what processes they unlock.

    They’ve built the airstrip, but the planes aren’t showing up.

    And the whole market's basically set up to make you fall into this trap. Everything gets branded as AI now, but nobody seems to care what these products actually do. Every SaaS product has bolted on a chatbot or an auto-complete feature and slapped an AI label on it, and the label has lost all meaning. It’s just a checkbox vendors figure they need to tick, regardless of whether it creates actual value for customers.

    The finance team’s new superpower

    This is the part that gets me excited about what finance teams can do now. You don’t have to guess anymore. You don’t have to bet six figures on a sales deck. You can test things, and you can actually learn something before you spend.

    Here's what I mean: If you’re evaluating vendor management software, prototype the core workflow with AI tools. Figure out whether you’re solving a tooling problem or a process problem. Figure out whether you need software at all.

    This doesn’t mean you’ll build everything internally — of course not. Most of the time, you’ll still end up buying, and that's totally fine, because enterprise tools exist for good reasons (scale, support, security, and maintenance). But now you’ll buy with your eyes wide open.

    You’ll know what “good” looks like. You’ll show up to demos already understanding the edge cases, and know in about 5 minutes whether they actually get your specific problem. You’ll implement faster. You'll negotiate better because you're not completely dependent on the vendor's solution. And you’ll choose it because it's genuinely better than what you could build yourself.

    You'll have already mapped out the shape of what you need, and you'll just be looking for the best version of it.

    The new paradigm

    For years, the mantra was: Build or buy.

    Now, it’s more elegant and way smarter: Build to learn what to buy.

    And it's not some future state. This is already happening. Right now, somewhere, a customer rep is using AI to fix a product issue they spotted minutes ago. Somewhere else, a finance team is prototyping their own analytical tools because they've realized they can iterate faster than they can write up requirements for engineering. Somewhere, a team is realizing that the boundary between technical and non-technical was always more cultural than fundamental.

    The companies that embrace this shift will move faster and spend smarter. They’ll know their operations more deeply than any vendor ever could. They'll make fewer expensive mistakes, and buy better tools because they actually understand what makes tools good.

    The companies that stick to the old playbook will keep sitting through vendor pitches, nodding along at budget-friendly proposals. They’ll debate timelines, and keep mistaking professional decks for actual solutions.

    Until someone on their own team pops open their laptop, says, “I built a version of this last night. Want to check it out?,” and shows them something they built in two hours that does 80% of what they’re about to pay six figures for.

    And, just like that, the rules change for good.

    Siqi Chen is co-founder and CEO of Runway.

    Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

  • Why most enterprise AI coding pilots underperform (Hint: It’s not the model)

    Gen AI in software engineering has moved well beyond autocomplete. The emerging frontier is agentic coding: AI systems capable of planning changes, executing them across multiple steps and iterating based on feedback. Yet despite the excitement around “AI agents that code,” most enterprise deployments underperform. The limiting factor is no longer the model. It’s context: The structure, history and intent surrounding the code being changed. In other words, enterprises are now facing a systems design problem: They have not yet engineered the environment these agents operate in.

    The shift from assistance to agency

    The past year has seen a rapid evolution from assistive coding tools to agentic workflows. Research has begun to formalize what agentic behavior means in practice: The ability to reason across design, testing, execution and validation rather than generate isolated snippets. Work such as dynamic action re-sampling shows that allowing agents to branch, reconsider and revise their own decisions significantly improves outcomes in large, interdependent codebases. At the platform level, providers like GitHub are now building dedicated agent orchestration environments, such as Copilot Agent and Agent HQ, to support multi-agent collaboration inside real enterprise pipelines.

    But early field results tell a cautionary story. When organizations introduce agentic tools without addressing workflow and environment, productivity can decline. A randomized control study this year showed that developers who used AI assistance in unchanged workflows completed tasks more slowly, largely due to verification, rework and confusion around intent. The lesson is straightforward: Autonomy without orchestration rarely yields efficiency.

    Why context engineering is the real unlock

    In every unsuccessful deployment I’ve observed, the failure stemmed from context. When agents lack a structured understanding of a codebase, specifically its relevant modules, dependency graph, test harness, architectural conventions and change history. They often generate output that appears correct but is disconnected from reality. Too much information overwhelms the agent; too little forces it to guess. The goal is not to feed the model more tokens. The goal is to determine what should be visible to the agent, when and in what form.

    The teams seeing meaningful gains treat context as an engineering surface. They create tooling to snapshot, compact and version the agent’s working memory: What is persisted across turns, what is discarded, what is summarized and what is linked instead of inlined. They design deliberation steps rather than prompting sessions. They make the specification a first-class artifact, something reviewable, testable and owned, not a transient chat history. This shift aligns with a broader trend some researchers describe as “specs becoming the new source of truth.”

    Workflow must change alongside tooling

    But context alone isn’t enough. Enterprises must re-architect the workflows around these agents. As McKinsey’s 2025 report “One Year of Agentic AI” noted, productivity gains arise not from layering AI onto existing processes but from rethinking the process itself. When teams simply drop an agent into an unaltered workflow, they invite friction: Engineers spend more time verifying AI-written code than they would have spent writing it themselves. The agents can only amplify what’s already structured: Well-tested, modular codebases with clear ownership and documentation. Without those foundations, autonomy becomes chaos.

    Security and governance, too, demand a shift in mindset. AI-generated code introduces new forms of risk: Unvetted dependencies, subtle license violations and undocumented modules that escape peer review. Mature teams are beginning to integrate agentic activity directly into their CI/CD pipelines, treating agents as autonomous contributors whose work must pass the same static analysis, audit logging and approval gates as any human developer. GitHub’s own documentation highlights this trajectory, positioning Copilot Agents not as replacements for engineers but as orchestrated participants in secure, reviewable workflows. The goal isn’t to let an AI “write everything,” but to ensure that when it acts, it does so inside defined guardrails.

    What enterprise decision-makers should focus on now

    For technical leaders, the path forward starts with readiness rather than hype. Monoliths with sparse tests rarely yield net gains; agents thrive where tests are authoritative and can drive iterative refinement. This is exactly the loop Anthropic calls out for coding agents. Pilots in tightly scoped domains (test generation, legacy modernization, isolated refactors); treat each deployment as an experiment with explicit metrics (defect escape rate, PR cycle time, change failure rate, security findings burned down). As your usage grows, treat agents as data infrastructure: Every plan, context snapshot, action log and test run is data that composes into a searchable memory of engineering intent, and a durable competitive advantage.

    Under the hood, agentic coding is less a tooling problem than a data problem. Every context snapshot, test iteration and code revision becomes a form of structured data that must be stored, indexed and reused. As these agents proliferate, enterprises will find themselves managing an entirely new data layer: One that captures not just what was built, but how it was reasoned about. This shift turns engineering logs into a knowledge graph of intent, decision-making and validation. In time, the organizations that can search and replay this contextual memory will outpace those who still treat code as static text.

    The coming year will likely determine whether agentic coding becomes a cornerstone of enterprise development or another inflated promise. The difference will hinge on context engineering: How intelligently teams design the informational substrate their agents rely on. The winners will be those who see autonomy not as magic, but as an extension of disciplined systems design:Clear workflows, measurable feedback, and rigorous governance.

    Bottom line

    Platforms are converging on orchestration and guardrails, and research keeps improving context control at inference time. The winners over the next 12 to 24 months won’t be the teams with the flashiest model; they’ll be the ones that engineer context as an asset and treat workflow as the product. Do that, and autonomy compounds. Skip it, and the review queue does.

    Context + agent = leverage. Skip the first half, and the rest collapses.

    Dhyey Mavani is accelerating generative AI at LinkedIn.

    Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.