Etiket: OpenAI

  • From logs to insights: The AI breakthrough redefining observability

    Presented by Elastic


    Logs set to become the primary tool for finding the “why” in diagnosing network incidents

    Modern IT environments have a data problem: there’s too much of it. Organizations that need to manage a company’s environment are increasingly challenged to detect and diagnose issues in real-time, optimize performance, improve reliability, and ensure security and compliance — all within constrained budgets.

    The modern observability landscape has many tools that offer a solution. Most revolve around DevOps teams or Site Reliability Engineers (SREs) analyzing logs, metrics, and traces to uncover patterns and figure out what’s happening across the network, and diagnose why an issue or incident occurred. The problem is that the process creates information overload: A Kubernetes cluster alone can emit 30 to 50 gigabytes of logs a day, and suspicious behavior patterns can sneak past human eyes.

    "It’s so anachronistic now, in the world of AI, to think about humans alone observing infrastructure," says Ken Exner, chief product officer at Elastic. "I hate to break it to you, but machines are better than human beings at pattern matching.“

    An industry-wide focus on visualizing symptoms forces engineers to manually hunt for answers. The crucial "why" is buried in logs, but because they contain massive volumes of unstructured data, the industry tends to use them as a tool of last resort. This has forced teams into costly tradeoffs: either spend countless hours building complex data pipelines, drop valuable log data and risk critical visibility gaps, or log and forget.

    Elastic, the Search AI Company, recently released a new feature for observability called Streams, which aims to become the primary signal for investigations by taking noisy logs and turning them into patterns, context and meaning.

    Streams uses AI to automatically partition and parse raw logs to extract relevant fields, and greatly reduce the effort required of SREs to make logs usable. Streams also automatically surfaces significant events such as critical errors and anomalies from context-rich logs, giving SREs early warnings and a clear understanding of their workloads, enabling them to investigate and resolve issues faster. The ultimate goal is to show remediation steps.

    "From raw, voluminous, messy data, Streams automatically creates structure, putting it into a form that is usable, automatically alerts you to issues and helps you remediate them," Exner says. "That is the magic of Streams."

    A broken workflow

    Streams upends an observability process that some say is broken. Typically, SREs set up metrics, logs and traces. Then they set up alerts, and service level objectives (SLOs) — often hard-coded rules to show where a service or process has gone beyond a threshold, or a specific pattern has been detected.

    When an alert is triggered, it points to the metric that's showing an anomaly. From there, SREs look at a metrics dashboard, where they can visualize the issue and compare the alert to other metrics, or CPU to memory to I/O, and start looking for patterns.

    They may then need to look at a trace, and examine upstream and downstream dependencies across the application to dig into the root cause of the issue. Once they figure out what's causing the trouble, they jump into the logs for that database or service to try and debug the issue.

    Some companies simply seek to add more tools when current ones prove ineffective. That means SREs are hopping from tool to tool to keep on top of monitoring and troubleshooting across their infrastructure and applications.

    "You’re hopping across different tools. You’re relying on a human to interpret these things, visually look at the relationship between systems in a service map, visually look at graphs on a metrics dashboard, to figure out what and where the issue is, " Exner says. "But AI automates that workflow away."

    With AI-powered Streams, logs are not just used reactively to resolve issues, but also to proactively process potential issues and create information-rich alerts that help teams jump straight to problem-solving, offering a solution for remediation or even fixing the issue entirely, before automatically notifying the team that it's been taken care of.

    "I believe that logs, the richest set of information, the original signal type, will start driving a lot of the automation that a service reliability engineer typically does today, and does very manually," he adds. "A human should not be in that process, where they are doing this by digging into themselves, trying to figure out what is going on, where and what the issue is, and then once they find the root cause, they’re trying to figure out how to debug it."

    Observability’s future

    Large language models (LLMs) could be a key player in the future of observability. LLMs excel at recognizing patterns in vast quantities of repetitive data, which closely resembles log and telemetry data in complex, dynamic systems. And today’s LLMs can be trained for specific IT processes. With automation tooling, the LLM has the information and tools it needs to resolve database errors or Java heap issues, and more. Incorporating those into platforms that bring context and relevance will be essential.

    Automated remediation will still take some time, Exner says, but automated runbooks and playbooks generated by LLMs will become standard practice within the next couple of years. In other words, remediation steps will be driven by LLMs. The LLM will offer up fixes, and the human will verify and implement them, rather than calling in an expert.

    Addressing skill shortages

    Going all in on AI for observability would help address a major shortage in the talent needed to manage IT infrastructure. Hiring is slow because organizations need teams with a great deal of experience and understanding of potential issues, and how to resolve them fast. That experience can come from an LLM that is contextually grounded, Exner says.

    "We can help deal with the skill shortage by augmenting people with LLMs that make them all instantly experts," he explains. "I think this is going to make it much easier for us to take novice practitioners and make them expert practitioners in both security and observability, and it’s going to make it possible for a more novice practitioner to act like an expert.”

    Streams in Elastic Observability is available now. Get started by reading more on the Streams.


    Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

  • Google Cloud updates its AI Agent Builder with new observability dashboard and faster build-and-deploy tools

    Google Cloud has introduced a big update in a bid to keep AI developers on its Vertex AI platform for concepting, designing, building, testing, deploying and modifying AI agents in enterprise use cases.

    The new features, announced today, include additional governance tools for enterprises and expanding the capabilities for creating agents with just a few lines of code, moving faster with state-of-the-art context management layers and one-click deployment, as well as managed services for scaling production and evaluation, and support for identifying agents.

    Agent Builder, released last year during its annual Cloud Next event, provides a no-code platform for enterprises to create agents and connect these to orchestration frameworks like LangChain.

    Google’s Agent Development Kit (ADK), which lets developers build agents “in under 100 lines of code,” can also be accessed through Agent Builder. 

    “These new capabilities underscore our commitment to Agent Builder, and simplify the agent development process to meet developers where they are, no matter which tech stack they choose,” said Mike Clark, director of Product Management, Vertex AI Agent Builder. 

    Build agents faster

    Part of Google’s pitch for Agent Builder’s new features is that enterprises can bake in-orchestration even as they construct their agents. 

    “Building an agent from a concept to a working product involves complex orchestration,” said Clark. 

    The new capabilities, which are shipped with the ADK, include:

    • SOTA context management layers including Static, Turn, User and Cache layers so enterprises have more control over the agents’ context

    • Prebuilt plugins with customizable logic. One of the new plugins allows agents to recognize failed tool calls and “self-heal” by retrying the task with a different approach

    • Additional language support in ADK, including Go, alongside Python and Java, that launched with ADK

    • One-click deployment through the ADK command line interface to move agents from a local environment to live testing with a single command

    Governance layer

    Enterprises require high accuracy; security; observability and auditability (what a program did and why); and steerability (control) in their production-grade AI agents.

    While Google had observability features in the local development environment at launch, developers can now access these tools through the Agent Engine managed runtime dashboard.

    The company said this brings cloud-based production monitoring to track token consumption, error rates and latency. Within this observability dashboard, enterprises can visualize the actions agents take and reproduce any issues. 

    Agent Engine will also have a new Evaluation Layer to help “simulate agent performance across a vast array of user interactions and situations.”

    This governance layer will also include:

    • Agent Identities that Google said give “agents their own unique, native identities within Google Cloud 

    • Model Armor, which would block prompt injections, screen tool calls and agent responses

    • Security Command Center, so admins can build an inventory of their agents to detect threats like unauthorized access

    “These native identities provide a deep, built-in layer of control and a clear audit trail for all agent actions. These certificate-backed identities further strengthen your security as they cannot be impersonated and are tied directly to the agent's lifecycle, eliminating the risk of dormant accounts,” Clark said. 

    The battle of agent builders 

    It’s no surprise that model providers create platforms to build agents and bring them to production. The competition lies in how fast new tools and features are added.

    Google’s Agent Builder competes with OpenAI’s open-source Agent Development Kit, which enables developers to create AI agents using non-OpenAI models.

    Additionally, there is the recently announced AgentKit, which features an Agent Builder that enables companies to integrate agents into their applications easily.

    Microsoft has its Azure AI Foundry, launched last year around this time for AI agent creation, and AWS also offers agent builders on its Bedrock platform, but Google is hoping is suite of new features will help give it a competitive edge.

    However, it isn’t just companies with their own models that court developers to build their AI agents within their platforms. Any enterprise service provider with an agent library also wants clients to make agents on their systems. 

    Capturing developer interest and keeping them within the ecosystem is the big battle between tech companies now, with features to make building and governing agents easier. 

  • Attention ISN’T all you need?! New Qwen3 variant Brumby-14B-Base leverages Power Retention technique

    When the transformer architecture was introduced in 2017 in the now seminal Google paper "Attention Is All You Need," it became an instant cornerstone of modern artificial intelligence.

    Every major large language model (LLM) — from OpenAI's GPT series to Anthropic's Claude, Google's Gemini, and Meta's Llama — has been built on some variation of its central mechanism: attention, the mathematical operation that allows a model to look back across its entire input and decide what information matters most.

    Eight years later, the same mechanism that defined AI’s golden age is now showing its limits. Attention is powerful, but it is also expensive — its computational and memory costs scale quadratically with context length, creating an increasingly unsustainable bottleneck for both research and industry. As models aim to reason across documents, codebases, or video streams lasting hours or days, attention becomes the architecture’s Achilles’ heel.

    On October 28, 2025, the little-known AI startup Manifest AI introduced a radical alternative. Their new model, Brumby-14B-Base, is a retrained variant of Qwen3-14B-Base, one of the leading open-source transformer models.

    But while many variants of Qwen have been trained already, Brumby-14B-Base is novel in that it abandons attention altogether.

    Instead, Brumby replaces those layers with a novel mechanism called Power Retention—a recurrent, hardware-efficient architecture that stores and updates information over arbitrarily long contexts without the exponential memory growth of attention.

    Trained at a stated cost of just $4,000, the 14-billion-parameter Brumby model performs on par with established transformer models like Qwen3-14B and GLM-4.5-Air, achieving near-state-of-the-art accuracy on a range of reasoning and comprehension benchmarks.

    From Attention to Retention: The Architectural Shift

    The core of Manifest AI’s innovation lies in what they call the Power Retention layer.

    In a traditional transformer, every token computes a set of queries (Q), keys (K), and values (V), then performs a matrix operation that measures the similarity between every token and every other token—essentially a full pairwise comparison across the sequence.

    This is what gives attention its flexibility, but also what makes it so costly: processing a sequence twice as long takes roughly four times the compute and memory.

    Power Retention keeps the same inputs (Q, K, V), but replaces the global similarity operation with a recurrent state update.

    Each layer maintains a memory matrix S, which is updated at each time step according to the incoming key, value, and a learned gating signal.

    The process looks more like an RNN (Recurrent Neural Network) than a transformer: instead of recomputing attention over the entire context, the model continuously compresses past information into a fixed-size latent state.

    This means the computational cost of Power Retention does not grow with context length. Whether the model is processing 1,000 or 1,000,000 tokens, the per-token cost remains constant.

    That property alone—constant-time per-token computation—marks a profound departure from transformer behavior.

    At the same time, Power Retention preserves the expressive power that made attention successful. Because the recurrence involves tensor powers of the input (hence the name “power retention”), it can represent higher-order dependencies between past and present tokens.

    The result is an architecture that can theoretically retain long-term dependencies indefinitely, while remaining as efficient as an RNN and as expressive as a transformer.

    Retraining, Not Rebuilding

    Perhaps the most striking aspect of Brumby-14B’s training process is its efficiency. Manifest AI trained the model for only 60 hours on 32 Nvidia H100 GPUs, at a cost of roughly $4,000 — less than 2% of what a conventional model of this scale would cost to train from scratch.

    However, since it relied on a transformer-based model, it's safe to say that this advance alone will not end the transformer AI-era.

    As Jacob Buckman, founder of Manifest AI, clarified in an email to VentureBeat: “The ability to train for $4,000 is indeed only possible when leveraging an existing transformer model,” he said. “Brumby could not be trained from scratch for that price.”

    Still, Buckman emphasized the significance of that result: “The reason this is important is that the ability to build on the weights of the previous generation of model architectures is a critical accelerant for the adoption of a new modeling paradigm.”

    He argues this demonstrates how attention-free systems can catch up to transformer performance “for orders-of-magnitude less” investment.

    In the loss curves released by Manifest AI, Brumby’s training loss quickly converges to that of the Qwen3 baseline within 3,000 training steps, even as the architecture diverges significantly from its transformer origins.

    Although Brumby-14B-Base began life as Qwen3-14B-Base, it did not remain identical for long. Manifest AI fundamentally altered Qwen3’s architecture by removing its attention layers—the mathematical engine that defines how a transformer model processes information—and replacing them with their new “power retention” mechanism. This change restructured the model’s internal wiring, effectively giving it a new brain while preserving much of its prior knowledge.

    Because of that architectural swap, the existing Qwen3 weights no longer fit perfectly. They were trained to operate within a transformer’s attention dynamics, not the new retention-based system. As a result, the Brumby model initially “forgot” how to apply some of its learned knowledge effectively. The retraining process—about 3,000 steps of additional learning—served to recalibrate those weights, aligning them with the power retention framework without having to start from zero.

    A helpful way to think about this is to imagine taking a world-class pianist and handing them a guitar. They already understand rhythm, harmony, and melody, but their hands must learn entirely new patterns to produce the same music. Similarly, Brumby had to relearn how to use its existing knowledge through a new computational instrument. Those 3,000 training steps were, in effect, its crash course in guitar lessons.

    By the end of this short retraining phase, Brumby had regained its full performance, reaching the same accuracy as the original Qwen3 model. That quick recovery is what makes the result so significant: it shows that an attention-free system can inherit and adapt the capabilities of a transformer model with only a fraction of the training time and cost.

    The benchmark progression plots show a similar trend: the model rapidly approaches its target accuracy on core evaluations like GSM8K, HellaSwag, and MMLU after only a few thousand steps, matching or even slightly surpassing Qwen3 on several tasks.

    Benchmarking the Brumby

    Across standard evaluation tasks, Brumby-14B-Base consistently performs at or near parity with transformer baselines of comparable scale.

    Task

    Brumby-14B

    Qwen3-14B

    GLM-4.5-Air

    Nemotron Nano (12B)

    ARC

    0.89

    0.94

    0.92

    0.93

    GSM8K

    0.88

    0.84

    0.83

    0.84

    GSM8K (Platinum)

    0.87

    0.88

    0.85

    0.87

    HellaSwag

    0.77

    0.81

    0.85

    0.82

    MATH

    0.62

    0.54

    0.47

    0.26

    MBPP

    0.57

    0.75

    0.73

    0.71

    MMLU

    0.71

    0.78

    0.77

    0.78

    MMLU (Pro)

    0.36

    0.55

    0.51

    0.53

    While it lags slightly behind transformers on knowledge-heavy evaluations like MMLU-Pro, it matches or outperforms them on mathematical reasoning and long-context reasoning tasks—precisely where attention architectures tend to falter. This pattern reinforces the idea that recurrent or retention-based systems may hold a structural advantage for reasoning over extended temporal or logical dependencies.

    Hardware Efficiency and Inference Performance

    Brumby’s power retention design offers another major advantage: hardware efficiency.

    Because the state update involves only local matrix operations, inference can be implemented with linear complexity in sequence length.

    Manifest AI reports that their fastest kernels, developed through their in-house CUDA framework Vidrial, can deliver hundreds-fold speedups over attention on very long contexts.

    Buckman said the alpha-stage Power Retention kernels “achieve typical hardware utilization of 80–85%, which is higher than FlashAttention2’s 70–75% or Mamba’s 50–60%.”

    (Mamba is another emerging “post-transformer” architecture developed by Carnegie Mellon scientists back in 2023 that, like Power Retention, seeks to eliminate the computational bottleneck of attention. It replaces attention with a state-space mechanism that processes sequences linearly — updating an internal state over time rather than comparing every token to every other one. This makes it far more efficient for long inputs, though it typically achieves lower hardware utilization than Power Retention in early tests.)

    Both Power Retention and Mamba, he added, “expend meaningfully fewer total FLOPs than FlashAttention2 on long contexts, as well as far less memory.”

    According to Buckman, the reported 100× speedup comes from this combined improvement in utilization and computational efficiency, though he noted that “we have not yet stress-tested it on production-scale workloads.”

    Training and Scaling Economics

    Perhaps no statistic in the Brumby release generated more attention than the training cost.

    A 14-billion-parameter model, trained for $4,000, represents a two-order-of-magnitude reduction in the cost of foundation model development.

    Buckman confirmed that the low cost reflects a broader scaling pattern. “Far from diminishing returns, we have found that ease of retraining improves with scale,” he said. “The number of steps required to successfully retrain a model decreases with its parameter count.”

    Manifest has not yet validated the cost of retraining models at 700B parameters, but Buckman projected a range of $10,000–$20,000 for models of that magnitude—still far below transformer training budgets.

    He also reiterated that this approach could democratize large-scale experimentation by allowing smaller research groups or companies to retrain or repurpose existing transformer checkpoints without prohibitive compute costs.

    Integration and Deployment

    According to Buckman, converting an existing transformer into a Power Retention model is designed to be simple.

    “It is straightforward for any company that is already retraining, post-training, or fine-tuning open-source models,” he said. “Simply pip install retention, change one line of your architecture code, and resume training where you left off.”

    He added that after only a small number of GPU-hours, the model typically recovers its original performance—at which point it gains the efficiency benefits of the attention-free design.

    “The resulting architecture will permit far faster long-context training and inference than previously,” Buckman noted.

    On infrastructure, Buckman said the main Brumby kernels are written in Triton, compatible with both NVIDIA and AMD accelerators. Specialized CUDA kernels are also available through the team’s in-house Vidrial framework. Integration with vLLM and other inference engines remains a work in progress: “We have not yet integrated Power Retention into inference engines, but doing so is a major ongoing initiative at Manifest.”

    As for distributed inference, Buckman dismissed concerns about instability: “We have not found this difficulty to be exacerbated in any way by our recurrent-state architecture. In fact, context-parallel training and GPU partitioning for multi-user inference both become significantly cleaner technically when using our approach.”

    Mission and Long-Term Vision

    Beyond the engineering details, Buckman also described Manifest’s broader mission. “Our mission is to train a neural network to model all human output,” he said.

    The team’s goal, he explained, is to move beyond modeling “artifacts of intelligence” toward modeling “the intelligent processes that generated them.” This shift, he argued, requires “fundamentally rethinking” how models are designed and trained—work that Power Retention represents only the beginning of.

    The Brumby-14B release, he said, is “one step forward in a long march” toward architectures that can model thought processes continuously and efficiently.

    Public Debate and Industry Reception

    The launch of Brumby-14B sparked immediate discussion on X (formerly Twitter), where researchers debated the framing of Manifest AI’s announcement.

    Some, including Meta researcher Ariel (@redtachyon), argued that the “$4,000 foundation model” tagline was misleading, since the training involved reusing pretrained transformer weights rather than training from scratch.

    “They shuffled around the weights of Qwen, fine-tuned it a bit, and called it ‘training a foundation model for $4k,’” Ariel wrote.

    Buckman responded publicly, clarifying that the initial tweet had been part of a longer thread explaining the retraining approach. “It’s not like I was being deceptive about it,” he wrote. “I broke it up into separate tweets, and now everyone is mad about the first one.”

    In a follow-up email, Buckman took a measured view of the controversy. “The end of the transformer era is not yet here,” he reiterated, “but the march has begun.”

    He also acknowledged that the $4,000 claim, though technically accurate in context, had drawn attention precisely because it challenged expectations about what it costs to experiment at frontier scale.

    Conclusion: A Crack in the Transformer’s Wall?

    The release of Brumby-14B-Base is more than an engineering milestone; it is a proof of concept that the transformer’s dominance may finally face credible competition.

    By replacing attention with power retention, Manifest AI has demonstrated that performance parity with state-of-the-art transformers is possible at a fraction of the computational cost—and that the long-context bottleneck can be broken without exotic hardware.

    The broader implications are twofold. First, the economics of training and serving large models could shift dramatically, lowering the barrier to entry for open research and smaller organizations.

    Second, the architectural diversity of AI models may expand again, reigniting theoretical and empirical exploration after half a decade of transformer monoculture.

    As Buckman put it: “The end of the transformer era is not yet here. Our release is just one step forward in a long march toward the future.”

  • Databricks research reveals that building better AI judges isn’t just a technical concern, it’s a people problem

    The intelligence of AI models isn't what's blocking enterprise deployments. It's the inability to define and measure quality in the first place.

    That's where AI judges are now playing an increasingly important role. In AI evaluation, a "judge" is an AI system that scores outputs from another AI system. 

    Judge Builder is Databricks' framework for creating judges and was first deployed as part of the company's Agent Bricks technology earlier this year. The framework has evolved significantly since its initial launch in response to direct user feedback and deployments.

    Early versions focused on technical implementation but customer feedback revealed the real bottleneck was organizational alignment. Databricks now offers a structured workshop process that guides teams through three core challenges: getting stakeholders to agree on quality criteria, capturing domain expertise from limited subject matter experts and deploying evaluation systems at scale.

    "The intelligence of the model is typically not the bottleneck, the models are really smart," Jonathan Frankle, Databricks' chief AI scientist, told VentureBeat in an exclusive briefing. "Instead, it's really about asking, how do we get the models to do what we want, and how do we know if they did what we wanted?"

    The 'Ouroboros problem' of AI evaluation

    Judge Builder addresses what Pallavi Koppol, a Databricks research scientist who led the development, calls the "Ouroboros problem."  An Ouroboros is an ancient symbol that depicts a snake eating its own tail. 

    Using AI systems to evaluate AI systems creates a circular validation challenge.

    "You want a judge to see if your system is good, if your AI system is good, but then your judge is also an AI system," Koppol explained. "And now you're saying like, well, how do I know this judge is good?"

    The solution is measuring "distance to human expert ground truth" as the primary scoring function. By minimizing the gap between how an AI judge scores outputs versus how domain experts would score them, organizations can trust these judges as scalable proxies for human evaluation.

    This approach differs fundamentally from traditional guardrail systems or single-metric evaluations. Rather than asking whether an AI output passed or failed on a generic quality check, Judge Builder creates highly specific evaluation criteria tailored to each organization's domain expertise and business requirements.

    The technical implementation also sets it apart. Judge Builder integrates with Databricks' MLflow and prompt optimization tools and can work with any underlying model. Teams can version control their judges, track performance over time and deploy multiple judges simultaneously across different quality dimensions.

    Lessons learned: Building judges that actually work

    Databricks' work with enterprise customers revealed three critical lessons that apply to anyone building AI judges.

    Lesson one: Your experts don't agree as much as you think. When quality is subjective, organizations discover that even their own subject matter experts disagree on what constitutes acceptable output. A customer service response might be factually correct but use an inappropriate tone. A financial summary might be comprehensive but too technical for the intended audience.

    "One of the biggest lessons of this whole process is that all problems become people problems," Frankle said. "The hardest part is getting an idea out of a person's brain and into something explicit. And the harder part is that companies are not one brain, but many brains."

    The fix is batched annotation with inter-rater reliability checks. Teams annotate examples in small groups, then measure agreement scores before proceeding. This catches misalignment early. In one case, three experts gave ratings of 1, 5 and neutral for the same output before discussion revealed they were interpreting the evaluation criteria differently.

    Companies using this approach achieve inter-rater reliability scores as high as 0.6 compared to typical scores of 0.3 from external annotation services. Higher agreement translates directly to better judge performance because the training data contains less noise.

    Lesson two: Break down vague criteria into specific judges. Instead of one judge evaluating whether a response is "relevant, factual and concise," create three separate judges. Each targets a specific quality aspect. This granularity matters because a failing "overall quality" score reveals something is wrong but not what to fix.

    The best results come from combining top-down requirements such as regulatory constraints, stakeholder priorities, with bottom-up discovery of observed failure patterns. One customer built a top-down judge for correctness but discovered through data analysis that correct responses almost always cited the top two retrieval results. This insight became a new production-friendly judge that could proxy for correctness without requiring ground-truth labels.

    Lesson three: You need fewer examples than you think. Teams can create robust judges from just 20-30 well-chosen examples. The key is selecting edge cases that expose disagreement rather than obvious examples where everyone agrees.

    "We're able to run this process with some teams in as little as three hours, so it doesn't really take that long to start getting a good judge," Koppol said.

    Production results: From pilots to seven-figure deployments

    Frankle shared three metrics Databricks uses to measure Judge Builder's success: whether customers want to use it again, whether they increase AI spending and whether they progress further in their AI journey.

    On the first metric, one customer created more than a dozen judges after their initial workshop. "This customer made more than a dozen judges after we walked them through doing this in a rigorous way for the first time with this framework," Frankle said. "They really went to town on judges and are now measuring everything."

    For the second metric, the business impact is clear. "There are multiple customers who have gone through this workshop and have become seven-figure spenders on GenAI at Databricks in a way that they weren't before," Frankle said.

    The third metric reveals Judge Builder's strategic value. Customers who previously hesitated to use advanced techniques like reinforcement learning now feel confident deploying them because they can measure whether improvements actually occurred.

    "There are customers who have gone and done very advanced things after having had these judges where they were reluctant to do so before," Frankle said. "They've moved from doing a little bit of prompt engineering to doing reinforcement learning with us. Why spend the money on reinforcement learning, and why spend the energy on reinforcement learning if you don't know whether it actually made a difference?"

    What enterprises should do now

    The teams successfully moving AI from pilot to production treat judges not as one-time artifacts but as evolving assets that grow with their systems.

    Databricks recommends three practical steps. First, focus on high-impact judges by identifying one critical regulatory requirement plus one observed failure mode. These become your initial judge portfolio.

    Second, create lightweight workflows with subject matter experts. A few hours reviewing 20-30 edge cases provides sufficient calibration for most judges. Use batched annotation and inter-rater reliability checks to denoise your data.

    Third, schedule regular judge reviews using production data. New failure modes will emerge as your system evolves. Your judge portfolio should evolve with them.

    "A judge is a way to evaluate a model, it's also a way to create guardrails, it's also a way to have a metric against which you can do prompt optimization and it's also a way to have a metric against which you can do reinforcement learning," Frankle said. "Once you have a judge that you know represents your human taste in an empirical form that you can query as much as you want, you can use it in 10,000 different ways to measure or improve your agents."

  • Developers beware: Google’s Gemma model controversy exposes model lifecycle risks

    The recent controversy surrounding Google’s Gemma model has once again highlighted the dangers of using developer test models and the fleeting nature of model availability. 

    Google pulled its Gemma 3 model from AI Studio following a statement from Senator Marsha Blackburn (R-Tenn.) that the Gemma model willfully hallucinated falsehoods about her. Blackburn said the model fabricated news stories about her that go beyond “harmless hallucination” and function as a defamatory act. 

    In response, Google posted on X on October 31 that it will remove Gemma from AI Studio, stating that this is “to prevent confusion.” Gemma remains available via API. 

    It is also available via AI Studio, which, the company described, is "a developer tool (in fact, to use it you need to attest you're a developer). We’ve now seen reports of non-developers trying to use Gemma in AI Studio and ask it factual questions. We never intended this to be a consumer tool or model, or to be used this way. To prevent this confusion, access to Gemma is no longer available on AI Studio."

    To be clear, Google has the right to remove its model from its platform, especially if people have found hallucinations and falsehoods that could proliferate. It also underscores the danger of relying mainly on experimental models and why enterprise developers need to save projects before AI models are sunsetted or removed. Technology companies like Google continue to face political controversies, which often influence their deployments. 

    VentureBeat reached out to Google for additional information and was pointed to their October 31 posts. We also contacted the office of Sen. Blackburn, who reiterated her stance outlined in a statement that AI companies should “shut [models] down until you can control it."

    Developer experiments

    The Gemma family of models, which includes a 270M parameter version, is best suited for small, quick apps and tasks that can run on devices such as smartphones and laptops. Google said the Gemma models were “built specifically for the developer and research community. They are not meant for factual assistance or for consumers to use.”

    Nevertheless, non-developers could still access Gemma because it is on the AI Studio platform, a more beginner-friendly space for developers to play around with Google AI models compared to Vertex AI. So even if Google never intended Gemma and AI Studio to be accessible to, say, Congressional staffers, these situations can still occur. 

    It also shows that as models continue to improve, these models still produce inaccurate and potentially harmful information. Enterprises must continually weigh the benefits of using models like Gemma against their potential inaccuracies. 

    Project continuity 

    Another concern is the control that AI companies have over their models. The adage “you don’t own anything on the internet” remains true. If you don’t own a physical or local copy of software, it’s easy for you to lose access to it if the company that owns it decides to take it away. Google did not clarify with VentureBeat if current projects on AI Studio powered by Gemma are saved. 

    Similarly, OpenAI users were disappointed when the company announced that it would remove popular older models on ChatGPT. Even after walking back his statement and reinstating GPT-4o back to ChatGPT, OpenAI CEO Sam Altman continues to field questions around keeping and supporting the model. 

    AI companies can, and should, remove their models if they create harmful outputs. AI models, no matter how mature, remain works in progress and are constantly evolving and improving. But, since they are experimental in nature, models can easily become tools that technology companies and lawmakers can wield as leverage. Enterprise developers must ensure that their work can be saved before models are removed from platforms. 

  • Meet Denario, the AI ‘research assistant’ that is already getting its own papers published

    An international team of researchers has released an artificial intelligence system capable of autonomously conducting scientific research across multiple disciplines — generating papers from initial concept to publication-ready manuscript in approximately 30 minutes for about $4 each.

    The system, called Denario, can formulate research ideas, review existing literature, develop methodologies, write and execute code, create visualizations, and draft complete academic papers. In a demonstration of its versatility, the team used Denario to generate papers spanning astrophysics, biology, chemistry, medicine, neuroscience, and other fields, with one AI-generated paper already accepted for publication at an academic conference.

    "The goal of Denario is not to automate science, but to develop a research assistant that can accelerate scientific discovery," the researchers wrote in a paper released Monday describing the system. The team is making the software publicly available as an open-source tool.

    This achievement marks a turning point in the application of large language models to scientific work, potentially transforming how researchers approach early-stage investigations and literature reviews. However, the research also highlights substantial limitations and raises pressing questions about validation, authorship, and the changing nature of scientific labor.

    From data to draft: how AI agents collaborate to conduct research

    At its core, Denario operates not as a single AI brain but as a digital research department where specialized AI agents collaborate to push a project from conception to completion. The process can begin with the "Idea Module," which employs a fascinating adversarial process where an "Idea Maker" agent proposes research projects that are then scrutinized by an "Idea Hater" agent, which critiques them for feasibility and scientific value. This iterative loop refines raw concepts into robust research directions.

    Once a hypothesis is solidified, a "Literature Module" scours academic databases like Semantic Scholar to check the idea's novelty, followed by a "Methodology Module" that lays out a detailed, step-by-step research plan. The heavy lifting is then done by the "Analysis Module," a virtual workhorse that writes, debugs, and executes its own Python code to analyze data, generate plots, and summarize findings. Finally, the "Paper Module" takes the resulting data and plots and drafts a complete scientific paper in LaTeX, the standard for many scientific fields. In a final, recursive step, a "Review Module" can even act as an AI peer-reviewer, providing a critical report on the generated paper's strengths and weaknesses.

    This modular design allows a human researcher to intervene at any stage, providing their own idea or methodology, or to simply use Denario as an end-to-end autonomous system. "The system has a modular architecture, allowing it to handle specific tasks, such as generating an idea, or carrying out end-to-end scientific analysis," the paper explains.

    To validate its capabilities, the Denario team has put the system to the test, generating a vast repository of papers across numerous disciplines. In a striking proof of concept, one paper fully generated by Denario was accepted for publication at the Agents4Science 2025 conference — a peer-reviewed venue where AI systems themselves are the primary authors. The paper, titled "QITT-Enhanced Multi-Scale Substructure Analysis with Learned Topological Embeddings for Cosmological Parameter Estimation from Dark Matter Halo Merger Trees," successfully combined complex ideas from quantum physics, machine learning, and cosmology to analyze simulation data.

    The ghost in the machine: AI’s ‘vacuous’ results and ethical alarms

    While the successes are notable, the research paper is refreshingly candid about Denario's significant limitations and failure modes. The authors stress that the system currently "behaves more like a good undergraduate or early graduate student rather than a full professor in terms of big picture, connecting results…etc." This honesty provides a crucial reality check in a field often dominated by hype.

    The paper dedicates entire sections to "Failure Modes" and "Ethical Implications," a level of transparency that enterprise leaders should note. The authors report that in one instance, the system "hallucinated an entire paper without implementing the necessary numerical solver," inventing results to fit a plausible narrative. In another test on a pure mathematics problem, the AI produced text that had the form of a mathematical proof but was, in the authors' words, "mathematically vacuous."

    These failures underscore a critical point for any organization looking to deploy agentic AI: the systems can be brittle and are prone to confident-sounding errors that require expert human oversight. The Denario paper serves as a vital case study in the importance of keeping a human in the loop for validation and critical assessment.

    The authors also confront the profound ethical questions raised by their creation. They warn that "AI agents could be used to quickly flood the scientific literature with claims driven by a particular political agenda or specific commercial or economic interests." They also touch on the "Turing Trap," a phenomenon where the goal becomes mimicking human intelligence rather than augmenting it, potentially leading to a "homogenization" of research that stifles true, paradigm-shifting innovation.

    An open-source co-pilot for the world's labs

    Denario is not just a theoretical exercise locked away in an academic lab. The entire system is open-source under a GPL-3.0 license and is accessible to the broader community. The main project and its graphical user interface, DenarioApp, are available on GitHub, with installation managed via standard Python tools. For enterprise environments focused on reproducibility and scalability, the project also provides official Docker images. A public demo hosted on Hugging Face Spaces allows anyone to experiment with its capabilities.

    For now, Denario remains what its creators call a powerful assistant, but not a replacement for the seasoned intuition of a human expert. This framing is deliberate. The Denario project is less about creating an automated scientist and more about building the ultimate co-pilot, one designed to handle the tedious and time-consuming aspects of modern research.

    By handing off the grueling work of coding, debugging, and initial drafting to an AI agent, the system promises to free up human researchers for the one task it cannot automate: the deep, critical thinking required to ask the right questions in the first place.

  • Strengthening Our Core: Welcoming Karyne Levy as VentureBeat’s New Managing Editor

    I’m thrilled to announce a fantastic new addition to our leadership team: Karyne Levy is joining VentureBeat as our new Managing Editor. Today is her first day.

    Many of you may know Karyne from her most recent role as Deputy Managing Editor at TechCrunch, but her career is a highlight reel of veteran tech journalism. Her resume includes pivotal roles at Protocol, NerdWallet, Business Insider, and CNET, giving her a deep understanding of this industry from every angle.

    Hiring Karyne is a significant step forward for VentureBeat. As we’ve sharpened our focus on serving you – the enterprise technical decision-maker navigating the complexities of AI and data – I’ve been looking for a very specific kind of leader.

    The "Organizer's Dopamine Hit"

    In the past, a managing editor was often the final backstop for copy. Today, at a modern, data-focused media company like ours, the role is infinitely more dynamic. It’s the central hub of the entire content operation.

    During my search, I found myself talking a lot about the two types of "dopamine hits" in our business. There’s the writer’s hit – seeing your name on a great story. And then there’s the organizer’s hit – the satisfaction that comes from building, tuning, and running the complex machine that allows a dozen different parts of the company to move in a single, powerful direction.

    We were looking for the organizer.

    When I spoke with Karyne, I explained this vision: a leader who thrives on creating workflows, who loves being the liaison between editorial, our data and survey team, our events, and our marketing operations.

    Her response confirmed she was the one: "Everything you said is exactly my dopamine hit."

    Karyne’s passion is making the entire operation hum. She has a proven track record of managing people, running newsrooms, and interfacing with all parts of a business to ensure everyone is aligned. That operational rigor is precisely what we need for our next chapter.

    Why This Matters for Our Strategy (and for You)

    As I’ve written about before, VentureBeat is on a mission to evolve. In an age where experts and companies can publish directly, it’s not enough to be a secondary source. Our goal is to become a primary source for you.

    How? By leveraging our relationship with our community of millions of technical leaders. We are increasingly surveying you directly to generate proprietary insights you can’t get anywhere else. We want to be the first to tell you which vector stores your peers are actually implementing, what governance challenges are most pressing for data scientists, or how your counterparts are budgeting for generative AI.

    This is an ambitious strategy. It requires a tight-knit team where our editorial content, our research surveys and reports, our newsletters, and our VB Transform events are all working from the same playbook.

    Karyne is the leader who will help us execute that vision. Her experience at Protocol, which was also dedicated to serving technical and business decision-makers, means she fundamentally understands our audience. She is ideally suited to manage our newsroom and ensure that every piece of content we produce helps you do your job better. She’ll be working alongside Carl Franzen, our executive editor, who continues to drive news decision-making.

    This is a fantastic hire for VentureBeat. It’s another sign of our commitment to building the most focused, expert team in enterprise AI and data.

    Please join me in welcoming Karyne to the team.

  • The beginning of the end of the transformer era? Neuro-symbolic AI startup AUI announces new funding at $750M valuation

    The buzzed-about but still stealthy New York City startup Augmented Intelligence Inc (AUI), which seeks to go beyond the popular "transformer" architecture used by most of today's LLMs such as ChatGPT and Gemini, has raised $20 million in a bridge SAFE round at a $750 million valuation cap, bringing its total funding to nearly $60 million, VentureBeat can exclusively reveal.

    The round, completed in under a week, comes amid heightened interest in deterministic conversational AI and precedes a larger raise now in advanced stages.

    AUI relies on a fusion of the transformer tech and a newer technology called "neuro-symbolic AI," described in greater detail below.

    "We realize that you can combine the brilliance of LLMs in linguistic capabilities with the guarantees of symbolic AI," said Ohad Elhelo, AUI co-founder and CEO in a recent interview with VentureBeat. Elhelo launched the company in 2017 alongside co-founder and Chief Product Officer Ori Cohen.

    The new financing includes participation from eGateway Ventures, New Era Capital Partners, existing shareholders, and other strategic investors. It follows a $10 million raise in September 2024 at a $350 million valuation cap, coinciding with the company’s announced go-to-market partnership with Google in October 2024. Early investors include Vertex Pharmaceuticals founder Joshua Boger, UKG Chairman Aron Ain, and former IBM President Jim Whitehurst.

    According to the company, the bridge round is a precursor to a significantly larger raise already in advanced stages.

    AUI is the company behind Apollo-1, a new foundation model built for task-oriented dialog, which it describes as the "economic half" of conversational AI — distinct from the open-ended dialog handled by LLMs like ChatGPT and Gemini.

    The firm argues that existing LLMs lack the determinism, policy enforcement, and operational certainty required by enterprises, especially in regulated sectors.

    Chris Varelas, co-founder of Redwood Capital and an advisor to AUI, said in a press release provided to VentureBeat: “I’ve seen some of today’s top AI leaders walk away with their heads spinning after interacting with Apollo-1.”

    A Distinctive Neuro-Symbolic Architecture

    Apollo-1’s core innovation is its neuro-symbolic architecture, which separates linguistic fluency from task reasoning. Instead of using the most common technology underpinning most LLMs and conversational AI systems today — the vaunted transformer architecture described in the seminal 2017 Google paper "Attention Is All You Need" — AUI's system integrates two layers:

    • Neural modules, powered by LLMs, handle perception: encoding user inputs and generating natural language responses.

    • A symbolic reasoning engine, developed over several years, interprets structured task elements such as intents, entities, and parameters. This symbolic state engine determines the appropriate next actions using deterministic logic.

    This hybrid architecture allows Apollo-1 to maintain state continuity, enforce organizational policies, and reliably trigger tool or API calls — capabilities that transformer-only agents lack.

    Elhelo said this design emerged from a multi-year data collection effort: “We built a consumer service and recorded millions of human-agent interactions across 60,000 live agents. From that, we abstracted a symbolic language that defines the structure of task-based dialogs, separate from their domain-specific content.”

    However, enterprises that have already built systems built around transformer LLMs needn't worry. AUI wants to make adopting its new technology just as easy.

    "Apollo-1 deploys like any modern foundation model," Elhelo told VentureBeat in a text last night. "It doesn’t require dedicated or proprietary clusters to run. It operates across standard cloud and hybrid environments, leveraging both GPUs and CPUs, and is significantly more cost-efficient to deploy than frontier reasoning models. Apollo-1 can also be deployed across all major clouds in a separated environment for increased security."

    Generalization and Domain Flexibility

    Apollo-1 is described as a foundation model for task-oriented dialog, meaning it is domain-agnostic and generalizable across verticals like healthcare, travel, insurance, and retail.

    Unlike consulting-heavy AI platforms that require building bespoke logic per client, Apollo-1 allows enterprises to define behaviors and tools within a shared symbolic language. This approach supports faster onboarding and reduces long-term maintenance. According to the team, an enterprise can launch a working agent in under a day.

    Crucially, procedural rules are encoded at the symbolic layer — not learned from examples. This enables deterministic execution for sensitive or regulated tasks.

    For instance, a system can block cancellation of a Basic Economy flight not by guessing intent but by applying hard-coded logic to a symbolic representation of the booking class.

    As Elhelo explained to VentureBeat, LLMs are "not a good mechanism when you’re looking for certainty. It’s better if you know what you’re going to send [to an AI model] and always send it, and you know, always, what’s going to come back [to the user] and how to handle that.”

    Availability and Developer Access

    Apollo-1 is already in active use within Fortune 500 enterprises in a closed beta, and a broader general availability release is expected before the end of 2025, according to a previous report by The Information, which broke the initial news on the startup.

    Enterprises can integrate with Apollo-1 either via:

    • A developer playground, where business users and technical teams jointly configure policies, rules, and behaviors; or

    • A standard API, using OpenAI-compatible formats.

    The model supports policy enforcement, rule-based customization, and steering via guardrails. Symbolic rules allow businesses to dictate fixed behaviors, while LLM modules handle open-text interpretation and user interaction.

    Enterprise Fit: When Reliability Beats Fluency

    While LLMs have advanced general-purpose dialog and creativity, they remain probabilistic — a barrier to enterprise deployment in finance, healthcare, and customer service.

    Apollo-1 targets this gap by offering a system where policy adherence and deterministic task completion are first-class design goals.

    Elhelo puts it plainly: “If your use case is task-oriented dialog, you have to use us, even if you are ChatGPT.”

  • Moving past speculation: How deterministic CPUs deliver predictable AI performance

    For more than three decades, modern CPUs have relied on speculative execution to keep pipelines full. When it emerged in the 1990s, speculation was hailed as a breakthrough — just as pipelining and superscalar execution had been in earlier decades. Each marked a generational leap in microarchitecture. By predicting the outcomes of branches and memory loads, processors could avoid stalls and keep execution units busy.

    But this architectural shift came at a cost: Wasted energy when predictions failed, increased complexity and vulnerabilities such as Spectre and Meltdown. These challenges set the stage for an alternative: A deterministic, time-based execution model. As David Patterson observed in 1980, “A RISC potentially gains in speed merely from a simpler design.” Patterson’s principle of simplicity underpins a new alternative to speculation: A deterministic, time-based execution model."

    For the first time since speculative execution became the dominant paradigm, a fundamentally new approach has been invented. This breakthrough is embodied in a series of six recently issued U.S. patents, sailing through the U.S. Patent and Trademark Office (USPTO). Together, they introduce a radically different instruction execution model. Departing sharply from conventional speculative techniques, this novel deterministic framework replaces guesswork with a time-based, latency-tolerant mechanism. Each instruction is assigned a precise execution slot within the pipeline, resulting in a rigorously ordered and predictable flow of execution. This reimagined model redefines how modern processors can handle latency and concurrency with greater efficiency and reliability.

    A simple time counter is used to deterministically set the exact time of when instructions should be executed in the future. Each instruction is dispatched to an execution queue with a preset execution time based on resolving its data dependencies and availability of resources — read buses, execution units and the write bus to the register file. Each instruction remains queued until its scheduled execution slot arrives. This new deterministic approach may represent the first major architectural challenge to speculation since it became the standard.

    The architecture extends naturally into matrix computation, with a RISC-V instruction set proposal under community review. Configurable general matrix multiply (GEMM) units, ranging from 8×8 to 64×64, can operate using either register-based or direct-memory acceess (DMA)-fed operands. This flexibility supports a wide range of AI and high-performance computing (HPC) workloads. Early analysis suggests scalability that rivals Google’s TPU cores, while maintaining significantly lower cost and power requirements.

    Rather than a direct comparison with general-purpose CPUs, the more accurate reference point is vector and matrix engines: Traditional CPUs still depend on speculation and branch prediction, whereas this design applies deterministic scheduling directly to GEMM and vector units. This efficiency stems not only from the configurable GEMM blocks but also from the time-based execution model, where instructions are decoded and assigned precise execution slots based on operand readiness and resource availability. 

    Execution is never a random or heuristic choice among many candidates, but a predictable, pre-planned flow that keeps compute resources continuously busy. Planned matrix benchmarks will provide direct comparisons with TPU GEMM implementations, highlighting the ability to deliver datacenter-class performance without datacenter-class overhead.

    Critics may argue that static scheduling introduces latency into instruction execution. In reality, the latency already exists — waiting on data dependencies or memory fetches. Conventional CPUs attempt to hide it with speculation, but when predictions fail, the resulting pipeline flush introduces delay and wastes power.

    The time-counter approach acknowledges this latency and fills it deterministically with useful work, avoiding rollbacks. As the first patent notes, instructions retain out-of-order efficiency: “A microprocessor with a time counter for statically dispatching instructions enables execution based on predicted timing rather than speculative issue and recovery," with preset execution times but without the overhead of register renaming or speculative comparators.

    Why speculation stalled

    Speculative execution boosts performance by predicting outcomes before they’re known — executing instructions ahead of time and discarding them if the guess was wrong. While this approach can accelerate workloads, it also introduces unpredictability and power inefficiency. Mispredictions inject “No Ops” into the pipeline, stalling progress and wasting energy on work that never completes.

    These issues are magnified in modern AI and machine learning (ML) workloads, where vector and matrix operations dominate and memory access patterns are irregular. Long fetches, non-cacheable loads and misaligned vectors frequently trigger pipeline flushes in speculative architectures.

    The result is performance cliffs that vary wildly across datasets and problem sizes, making consistent tuning nearly impossible. Worse still, speculative side effects have exposed vulnerabilities that led to high-profile security exploits. As data intensity grows and memory systems strain, speculation struggles to keep pace — undermining its original promise of seamless acceleration.

    Time-based execution and deterministic scheduling

    At the core of this invention is a vector coprocessor with a time counter for statically dispatching instructions. Rather than relying on speculation, instructions are issued only when data dependencies and latency windows are fully known. This eliminates guesswork and costly pipeline flushes while preserving the throughput advantages of out-of-order execution. Architectures built on this patented framework feature deep pipelines — typically spanning 12 stages — combined with wide front ends supporting up to 8-way decode and large reorder buffers exceeding 250 entries

    As illustrated in Figure 1, the architecture mirrors a conventional RISC-V processor at the top level, with instruction fetch and decode stages feeding into execution units. The innovation emerges in the integration of a time counter and register scoreboard, strategically positioned between fetch/decode and the vector execution units. Instead of relying on speculative comparators or register renaming, they utilize a Register Scoreboard and Time Resource Matrix (TRM) to deterministically schedule instructions based on operand readiness and resource availability.

    Figure 1: High-level block diagram of deterministic processor. A time counter and scoreboard sit between fetch/decode and vector execution units, ensuring instructions issue only when operands are ready.

    A typical program running on the deterministic processor begins much like it does on any conventional RISC-V system: Instructions are fetched from memory and decoded to determine whether they are scalar, vector, matrix or custom extensions. The difference emerges at the point of dispatch. Instead of issuing instructions speculatively, the processor employs a cycle-accurate time counter, working with a register scoreboard, to decide exactly when each instruction can be executed. This mechanism provides a deterministic execution contract, ensuring instructions complete at predictable cycles and reducing wasted issue slots.

    In conjunction with a register scoreboard, the time-resource matrix associates instructions with execution cycles, allowing the processor to plan dispatch deterministically across available resources. The scoreboard tracks operand readiness and hazard information, enabling scheduling without register renaming or speculative comparators. By monitoring dependencies such as read-after-write (RAW) and write-after-read, it ensures hazards are resolved without costly pipeline flushes. As noted in the patent, “in a multi-threaded microprocessor, the time counter and scoreboard permit rescheduling around cache misses, branch flushes, and RAW hazards without speculative rollback.”

    Once operands are ready, the instruction is dispatched to the appropriate execution unit. Scalar operations use standard artithmetic logic units (ALUs), while vector and matrix instructions execute in wide execution units connected to a large vector register file. Because instructions launch only when conditions are safe, these units stay highly utilized without the wasted work or recovery cycles caused by mis-predicted speculation.

    The key enabler of this approach is a simple time counter that orchestrates execution according to data readiness and resource availability, ensuring instructions advance only when operands are ready and resources available. The same principle applies to memory operations: The interface predicts latency windows for loads and stores, allowing the processor to fill those slots with independent instructions and keep execution flowing.

    Programming model differences

    From the programmer’s perspective, the flow remains familiar — RISC-V code compiles and executes in the usual way. The crucial difference lies in the execution contract: Rather than relying on dynamic speculation to hide latency, the processor guarantees predictable dispatch and completion times. This eliminates the performance cliffs and wasted energy of speculation while still providing the throughput benefits of out-of-order execution.

    This perspective underscores how deterministic execution preserves the familiar RISC-V programming model while eliminating the unpredictability and wasted effort of speculation. As John Hennessy put it: "It’s stupid to do work in run time that you can do in compile time”— a remark reflecting the foundations of RISC and its forward-looking design philosophy.

    The RISC-V ISA provides opcodes for custom and extension instructions, including floating-point, DSP, and vector operations. The result is a processor that executes instructions deterministically while retaining the benefits of out-of-order performance. By eliminating speculation, the design simplifies hardware, reduces power consumption and avoids pipeline flushes.

    These efficiency gains grow even more significant in vector and matrix operations, where wide execution units require consistent utilization to reach peak performance. Vector extensions require wide register files and large execution units, which in speculative processors necessitate expensive register renaming to recover from branch mispredictions. In the deterministic design, vector instructions are executed only after commit, eliminating the need for renaming.

    Each instruction is scheduled against a cycle-accurate time counter: “The time counter provides a deterministic execution contract, ensuring instructions complete at predictable cycles and reducing wasted issue slots.” The vector register scoreboard resolves data dependency before issuing instructions to execution pipeline.  Instructions are dispatched in a known order at the correct cycle, making execution both predictable and efficient.

    Vector execution units (integer and floating point) connect directly to a large vector register file. Because instructions are never flushed, there is no renaming overhead. The scoreboard ensures safe access, while the time counter aligns execution with memory readiness. A dedicated memory block predicts the return cycle of loads. Instead of stalling or speculating, the processor schedules independent instructions into latency slots, keeping execution units busy. “A vector coprocessor with a time counter for statically dispatching instructions ensures high utilization of wide execution units while avoiding misprediction penalties.”

    In today’s CPUs, compilers and programmers write code assuming the hardware will dynamically reorder instructions and speculatively execute branches. The hardware handles hazards with register renaming, branch prediction and recovery mechanisms. Programmers benefit from performance, but at the cost of unpredictability and power consumption.

    In the deterministic time-based architecture, instructions are dispatched only when the time counter indicates their operands will be ready. This means the compiler (or runtime system) doesn’t need to insert guard code for misprediction recovery. Instead, compiler scheduling becomes simpler, as instructions are guaranteed to issue at the correct cycle without rollbacks. For programmers, the ISA remains RISC-V compatible, but deterministic extensions reduce reliance on speculative safety nets.

    Application in AI and ML

    In AI/ML kernels, vector loads and matrix operations often dominate runtime. On a speculative CPU, misaligned or non-cacheable loads can trigger stalls or flushes, starving wide vector and matrix units and wasting energy on discarded work. A deterministic design instead issues these operations with cycle-accurate timing, ensuring high utilization and steady throughput. For programmers, this means fewer performance cliffs and more predictable scaling across problem sizes. And because the patents extend the RISC-V ISA rather than replace it, deterministic processors remain fully compatible with the RVA23 profile and mainstream toolchains such as GCC, LLVM, FreeRTOS, and Zephyr.

    In practice, the deterministic model doesn’t change how code is written — it remains RISC-V assembly or high-level languages compiled to RISC-V instructions. What changes is the execution contract: Rather than relying on speculative guesswork, programmers can expect predictable latency behavior and higher efficiency without tuning code around microarchitectural quirks.

    The industry is at an inflection point. AI/ML workloads are dominated by vector and matrix math, where GPUs and TPUs excel — but only by consuming massive power and adding architectural complexity. In contrast, general-purpose CPUs, still tied to speculative execution models, lag behind.

    A deterministic processor delivers predictable performance across a wide range of workloads, ensuring consistent behavior regardless of task complexity. Eliminating speculative execution enhances energy efficiency and avoids unnecessary computational overhead. Furthermore, deterministic design scales naturally to vector and matrix operations, making it especially well-suited for AI workloads that rely on high-throughput parallelism. This new deterministic approach may represent the next such leap: The first major architectural challenge to speculation since speculation itself became the standard.

    Will deterministic CPUs replace speculation in mainstream computing? That remains to be seen. But with issued patents, proven novelty and growing pressure from AI workloads, the timing is right for a paradigm shift. Taken together, these advances signal deterministic execution as the next architectural leap — redefining performance and efficiency just as speculation once did.

    Speculation marked the last revolution in CPU design; determinism may well represent the next.

    Thang Tran is the founder and CTO of Simplex Micro.

    Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

  • Inside Celosphere 2025: Why there’s no ‘enterprise AI’ without process intelligence

    Presented by Celonis


    AI adoption is accelerating, but results often lag expectations. And enterprise leaders are under pressure to prove measurable ROI from the AI solutions — especially as the use of autonomous agents rises and global tariffs disrupt supply chains.

    The issue isn’t the AI itself, says Alex Rinke, co-founder and co-CEO of Celonis, a global leader in process intelligence. “To succeed, enterprise AI needs to understand the context of a business’s processes — and how to improve them,” he explains. Without this business context, AI risks becoming, as Rinke puts it, “just an internal social experiment.”

    Next week’s Celosphere 2025 will tackle the AI ROI challenge head-on. The three-day event brings together customer strategies, hands-on workshops, and live demonstrations, highlighting enhancements to the Celonis Process Intelligence (PI) Platform that help enterprises harness ‘enterprise AI,’ powered by PI, to continuously improve operations, creating measurable business value at scale.

    Focus on measurable ROI

    The event’s focus on achieving AI ROI reflects three challenges facing technology and business leaders moving from pilot to production: obsolete systems, break-neck industry change, and agentic AI. According to Gartner, 64% of board members now view AI as a top-three priority — yet only 10% of organizations report meaningful financial returns.

    Celonis customers are bucking that trend. A Forrester Total Economic Impact study found organizations using its platform achieved 383% ROI over three years, with payback in just six months. One company improved sales order automation from 33% to 86%, saving $24.5 million. The study estimated $44.1 million in total benefits over three years, driven by faster automation, reduced inefficiencies, and higher process visibility. These numbers underscore a broader pattern — companies that modernize outdated systems and align AI with process optimization see faster payback and sustained gains.

    Real companies, real results

    Celosphere will spotlight how global enterprises are building “future-fit” operations. Mercedes-Benz Group AG and Vinmar Group will showcase AI-driven, composable solutions, powered by PI, and attendees will see demonstrations of PI enabling agents in live production environments.

    Among the notable success stories:

    AstraZeneca, the pharmaceutical company, reduced excess inventory while keeping critical medicines flowing by using Celonis as a foundation for its OpenAI partnership.

    The State of Oklahoma can answer procurement status questions at scale, unlocking over $10 million in value.

    Cosentino clears blocked sales orders up to 5x faster using an AI-powered credit management assistant.

    Raising the stakes for agentic AI

    Numerous sessions will focus on orchestrating AI agents. The shift from AI-as-advisor to AI-as-actor, changes everything, says Rinke.

    “The agent needs to understand not just what to do, but how your specific business actually works,” he explains. “Process intelligence provides those rails."

    This leap from recommendation to autonomous action raises the stakes exponentially. When agents can independently trigger purchase orders, reroute shipments, or approve exceptions, bad context can mean catastrophically bad outcomes at scale.

    Celosphere attendees will get to see first-hand how companies are using the Celonis Orchestration Engine to coordinate AI agents alongside people and systems. Effective orchestration is a crucial protection against the chaos of agents working at cross-purposes, duplicating actions, or letting crucial steps fall through the cracks.

    Navigating tariffs and supply chain shocks

    Global trade volatility isn't just a headline — it's an operational nightmare reshaping how companies deploy AI, Rinke says.

    New tariffs trigger cascading effects across procurement, logistics, and compliance. Each policy shift can cascade across thousands of SKUs — forcing new supplier contracts, rerouted shipments, and rebalanced inventories. For AI systems trained on static conditions, that volatility is almost impossible to predict. Traditional AI systems struggle with such variability — but process intelligence gives organizations real-time visibility into how changes ripple through operations.

    Celosphere case studies will show how companies turn disruption into advantage. Smurfit Westrock uses PI to optimize inventory and reduce costs amid tariff uncertainty, while ASOS leverages PI to optimize its supply chain operations, enhancing efficiency, reducing costs, and continuing to deliver an outstanding customer experience.

    Platform over point solutions

    Rinke argues that Celonis’ edge lies in treating process intelligence not as an add-on, but as the foundation of the enterprise stack. Unlike bolt-on optimization tools, the Celonis platform creates a living digital twin of business operations — a continuously updated model enriched by context that lets AI operate effectively from analysis to execution.

    “What sets Celonis apart is visibility across systems and offline tasks, which is critical for true intelligent automation,” Rinke says. “The platform offers comprehensive capabilities spanning process analysis, design, and orchestration rather than a point solution.”

    “Free the Process” and the future of AI

    Celonis continues to champion openness through its “Free the Process” movement, promoting fair competition and freeing enterprises from legacy lock-in. By giving organizations full access to their own process data, open APIs, and a growing partner network that includes The Hackett Group, ClearOps, and Lobster, Celonis is building the connective tissue for a new era of interoperable automation.

    For Rinke, this open foundation is what turns AI from a set of experiments into an enterprise engine. “Process intelligence creates a flywheel,” he says. “Better understanding leads to better optimization, which enables better AI — and that, in turn, drives even greater understanding. There is no AI without PI."


    Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.