Blog

  • How AI tax startup Blue J torched its entire business model for ChatGPT—and became a $300 million company

    In the winter of 2022, as the tech world was becoming mesmerized by the sudden, explosive arrival of OpenAI’s ChatGPT, Benjamin Alarie faced a pivotal choice. His legal tech startup, Blue J, had a respectable business built on the AI of a bygone era, serving hundreds of accounting firms with predictive models. But it had hit a ceiling.

    Alarie, a tenured tax law professor at the University of Toronto, saw the nascent, error-prone, yet powerful capabilities of large language models not as a curiosity, but as the future. He made a high-stakes decision: to pivot his entire company, which had been painstakingly built over nearly a decade, and rebuild it from the ground up on this unproven technology.

    That bet has paid off handsomely. Blue J has since quietly secured a $122 million Series D funding round co-led by Oak HC/FT and Sapphire Ventures, placing the company's valuation at over $300 million. The move transformed Blue J from a niche player into one of Canada's fastest-growing legal tech firms, multiplying its revenue roughly twelve-fold and attracting 10 to 15 new customers every day.

    The company now serves more than 3,500 organizations, including global accounting giant KPMG and several Fortune 500 companies. It is tackling a critical bottleneck in the professional services industry: a severe and worsening talent shortage. The U.S. has 340,000 fewer accountants than it did five years ago, and with 75% of current CPAs expected to retire in the next decade, firms are desperate for tools that can amplify the productivity of their remaining experts.

    “What once took tax professionals 15 hours of manual research to do can now be completed in about 15 seconds with Blue J,” Alarie, the company's CEO, said in an exclusive interview with VentureBeat. "That value proposition—we can take hours of work and turn it into seconds of work—that is driving a lot of this."

    When the dean's biography was wrong: the moment that changed everything

    Alarie vividly remembers January 2023, when the dean of the law school stopped by his office for New Year's greetings. He asked her about ChatGPT and prompted the AI to describe her. ChatGPT confidently generated a biography. Some details were accurate. Others were completely fabricated.

    "She was like, 'Okay, this is really kind of scary. This is wrong, and this has implications,'" Alarie said. Yet that moment of obvious failure didn't deter him. Instead, it crystallized his conviction.

    The company's first iteration, launched in 2015, used supervised machine learning to build predictive models that could forecast judicial outcomes on specific tax issues. While technically sophisticated, it had a fundamental flaw: it couldn't answer every tax research question.

    "The challenge was it couldn't answer every tax research question, which was really the holy grail," Alarie said. Customers loved the tool when it applied to their problem, but would quickly abandon it when it didn't. Revenue plateaued around $2 million annually.

    Despite ChatGPT's notorious hallucinations, Alarie convinced his board to make the pivot. "I had this conviction that if we continued down that path, we weren't going to be able to address our number one limitation," he said. "Large language models seemed like a very promising direction."

    He gave his team six months to deliver a working product.

    From 90-second responses to 3 million queries: How Blue J tamed AI hallucinations

    By August 2023, Blue J was ready to launch. What they released was, in Alarie's candid assessment, "super janky." The system took 90 seconds to respond. About half the answers had issues. The Net Promoter Score registered at just 20.

    What transformed that flawed product into today's platform — with response times measured in seconds, a dissatisfaction rate of just one in 700 queries, and an NPS score in the mid-80s — was relentless focus on three strategic pillars.

    First is proprietary content at massive scale. Blue J secured exclusive licensing with Tax Analysts (Tax Notes) and IBFD, the Amsterdam-based global tax authority covering 220+ jurisdictions. "We are the only platform on earth that takes in the best U.S. tax information from Tax Notes and the best global tax information from IBFD," Alarie said.

    Second is deep human expertise. Blue J employs tax experts led by Susan Massey, who spent 13 years at the IRS Office of Chief Counsel as Branch Chief for Corporate Tax. Her team constantly tests the AI and refines its performance.

    Third is an unprecedented feedback flywheel. With over 3 million tax research queries processed in 2025, Blue J is amassing unparalleled data. Each query generates feedback that flows back into the system.

    Weekly active user rates hover between 75% and 85%, compared to 15% to 25% for traditional platforms. "A charitable ratio is like we're five times more intensively used," Alarie noted.

    Inside Blue J's early access partnership with OpenAI

    Blue J maintains an unusually close relationship with OpenAI that has proven crucial to its success. "We have a very good relationship with OpenAI, and we get early access to their models,"Alarie said. "It's quite collaborative. We give them a lot of really high quality feedback about how well different versions of forthcoming models are performing."

    This feedback proves valuable because Blue J has developed what Alarie calls "ecologically valid" test questions — drawn from actual tax professional queries, with correct answers determined by Blue J's expert team. This helps OpenAI improve performance on complex reasoning tasks.

    The company tests models from all major providers — OpenAI, Anthropic, Google's Gemini, and open-source alternatives — continuously evaluating which performs best. "We're not necessarily 100% committed to any particular provider," he explained. "We're testing all the time."

    This approach helps Blue J navigate a challenging business model: charging approximately $1,500 per seat annually for unlimited queries while absorbing variable compute costs. "We've pre-committed to delivering them a really good user experience, unlimited tax research answers at a fixed price," Alarie said. "We're absorbing a lot of that risk."

    Competition among foundation model providers creates downward pressure on API pricing, while Blue J's conservative usage modeling has proven accurate. Gross revenue retention exceeds 99%, while net revenue retention reaches 130% — considered best-in-class for SaaS businesses.

    Taking on Thomson Reuters and LexisNexis with 75% weekly engagement

    Blue J faces competition from established publishers like Thomson Reuters, LexisNexis, and Bloomberg, all of which announced AI capabilities throughout 2023 and 2024. Yet Blue J's engagement metrics suggest it has captured significant momentum, growing from just 200 customers in 2021 to over 3,500 organizations today.

    The daily updates prove crucial. While the tax code itself changes only when Congress acts, the ecosystem evolves constantly through IRS regulations, new rulings, and court cases. All 50 states modify their tax codes regularly.

    "Things are changing literally every day," Alarie said. "Every day we're updating the materials, and that's just the U.S. We cover Canada, we cover the UK. The aspirations are truly global for this thing."

    Alarie's ambitions extend beyond building a successful startup. As author of the award-winning book "The Legal Singularity" and faculty affiliate at the Vector Institute for Artificial Intelligence, he has spent years contemplating AI's long-term impact on law.

    In academic papers published in Tax Notes throughout 2023 and 2024, he chronicled generative AI's rise, predicting that "clients will become substantially more sophisticated" and that AI would push human experts toward higher-value strategic roles rather than routine research.

    Blue J's $122 million plan: From tax research to 'global tax cognition'

    The Series D funding, which brought total capital raised to over $133 million, will fuel aggressive geographic and product expansion. Blue J already operates in the U.S., Canada, and the U.K., with plans to eventually cover 220+ jurisdictions through its IBFD partnership.

    Future capabilities could include automated memo generation, tax form completion, document drafting, and conversational history maintaining context across sessions—transforming Blue J from a research tool into what Alarie describes as "the operating layer for global tax cognition."

    For all its success, Blue J operates in a domain where errors carry serious consequences. The hallucination problem hasn't been eliminated — it's been minimized through careful engineering, content curation, and human oversight. Blue J has trained its models to acknowledge when they cannot answer a question rather than fabricate information.

    The business also faces economic risks if compute costs spiral or usage patterns exceed projections. And subtler questions loom about professional judgment: as AI systems become more capable, will users defer to outputs without sufficient critical evaluation?

    From 15 hours to 15 seconds: What Blue J's AI pivot teaches every industry

    Blue J's transformation offers lessons beyond tax software. The company's willingness to abandon eight years of proprietary technology and rebuild on an initially unreliable foundation required both courage and calculated risk-taking.

    The decision paid off not because generative AI was inherently superior to supervised machine learning in all dimensions, but because it addressed the right problem: comprehensiveness rather than precision in narrow domains. Tax professionals didn't need 95% accuracy on 5% of questions. They needed good-enough accuracy on 100% of questions.

    The improvement from an NPS of 20 to 84 in just over two years reflects relentless iteration informed by massive data collection. The content partnerships created differentiation that pure technology couldn't replicate. The team of tax experts provided domain knowledge necessary to ensure reliability.

    Most fundamentally, Blue J recognized that the real competition wasn't other AI startups or even established publishers. It was the old way of doing things — the 15 hours of manual research, the institutional knowledge locked in retiring professionals' heads.

    "People are like, 'What does Blue J do? They provide better tax answers. Okay, I think we need that,'" Alarie reflected.

    As AI transforms profession after profession, that clarity of purpose may matter more than technological sophistication. The future belongs not to those who build the most advanced AI, but to those who most effectively harness it to solve problems humans actually have.

    For a tax law professor who started with frustration about inefficient research methods, building a $300 million company marks an audacious endpoint. For the thousands of professionals now answering complex questions in 15 seconds instead of 15 hours, it represents the future of their profession, arriving faster than most expected.

    The bet on ChatGPT when it was still hallucinating biographies has become a validation that sometimes the riskiest move is not to move at all.

  • Phi-4 proves that a ‘data-first’ SFT methodology is the new differentiator

    AI engineers often chase performance by scaling up LLM parameters and data, but the trend toward smaller, more efficient, and better-focused models has accelerated. 

    The Phi-4 fine-tuning methodology is the cleanest public example of a training approach that smaller enterprise teams can copy. It shows how a carefully chosen dataset and fine-tuning strategy can make a 14B model compete with much larger ones.

    The Phi-4 model was trained on just 1.4 million carefully chosen prompt-response pairs. Instead of brute force, the Microsoft Phi-4 research team focused on “teachable” examples at the edge of the model’s abilities and rigorous data curation. 

    The Phi-4 reasoning smart data playbook demonstrates how strategic data curation with replicable SFT and RL can elevate a 14B model beyond much larger counterparts.

    Why Phi-4 stands apart

    Smaller reasoning models, such as OpenAI’s o1-mini and Google’s Gemma, are becoming more common, and models like Alibaba’s Qwen3 (8B and 14B) are seeing wide adoption across use cases. That adoption is important, but it doesn’t displace the value of Phi-4 as an experimental proof: Phi-4 was designed as a testbed for a data-first training methodology, and its documentation reads like a smart data playbook for teams that want to replicate that approach.

    The Phi-4 team has shared a repeatable SFT playbook that includes a 1.4-million-prompt response set. It’s built around teachable edge examples, questions that are neither too easy nor too difficult, chosen to push the model’s reasoning. Each topic, such as math or code, is tuned separately and then combined with synthetic rewrites that turn complex tasks into forms that can be checked automatically. 

    The paper outlines the data selection and filtering process in enough detail for smaller teams to reproduce it with open-source models and evaluators. For enterprise teams, that level of transparency turns a research result into a practical, copyable training recipe they can implement and measure quickly.

    The data-first philosophy: Why less can be more

    Traditional approaches to LLM reasoning have often relied on scaling datasets massively to encourage generalization. Phi-4 reasoning takes a different path, showing that carefully curated data can achieve similar or even better results with far less.

    The team assembled a dataset covering STEM, coding, and safety. Despite its small size, it outperformed models trained on orders of magnitude more data. 

    In benchmarks, the 14B Phi-4 reasoning model outperformed OpenAI’s o1-mini and DeepSeek’s 70B distilled model across most reasoning tasks, and approached the full DeepSeek-R1 (671B) on challenging math (AIME) questions. 

    With just 14 billion parameters, Phi-4 reasoning delivers the following results when compared to other leading models:

    Benchmark (task)

    Phi-4 reasoning

    Comparison model (size)

    Comparison score

    Date / Source

    AIME 2024 (math olympiad)

    75.3%

    o1-mini

    63.6%

    Microsoft Phi-4 model card (Apr 2025). (Hugging Face)

    AIME 2025 (math olympiad)

    62.9%

    DeepSeek-R1-Distill-70B

    51.5%

    Microsoft Phi-4 model card (April 2025). (Hugging Face)

    OmniMath

    76.6%

    DeepSeek-R1-Distill-70B

    63.4%

    Microsoft Phi-4 model card (April 2025). (Hugging Face)

    GPQA-Diamond (graduate-level science)

    65.8%

    o1-mini

    60.0%

    Microsoft Phi-4 model card (April 2025). (Hugging Face)

    OmniMath (same benchmark, different comparison)

    76.6%

    Claude-3.7-Sonnet

    54.6%

    Microsoft Phi-4 model card (April 2025). (Hugging Face)

    Table: Phi-4 reasoning performance across benchmarks compared to other models. Source: Microsoft

    The key to this is filtering for quality over quantity. Much of the generic data is either too easy (the base model already knows it) or too hard (no learning signal). The Phi-4 team explicitly discards such examples. “Given the strong baseline reasoning capabilities of Phi-4, many initial seed questions are already handled competently,” they note. “To make further learning impactful, we specifically target seeds situated at the edge of Phi-4’s current abilities.” 

    In practice, they rely on LLM-based evaluation. For each candidate question, a strong reference model (like GPT-4) generates an “answer key,” and the answers from weaker models are compared. If the weaker model disagrees enough, it indicates a teachable gap. Those questions are retained, while trivially solved or utterly unsolvable questions are dropped. 

    For example, a simple arithmetic problem might be dropped (too easy), and an extremely obscure theorem proof might be dropped (too hard) as well. But a moderately challenging geometry problem that Phi-4 gets wrong is included.

    This “sweet spot” approach ensures every example forces the model to stretch its reasoning. By focusing on multi-step problems rather than rote recall, they pack maximum learning into 1.4M examples. 

    As the authors explain, training on these carefully chosen seeds “leads to broad generalization across both reasoning-specific and general-purpose tasks.” In effect, Phi-4 reasoning demonstrates that intelligent data selection can outperform brute force scaling. 

    Independent domain optimization

    Phi-4 reasoning’s data are grouped by domain (math, coding, puzzles, safety, etc.). Rather than blending everything at once, the team tunes each domain’s mix separately and then merges them. 

    This relies on an additive property: Optimizing math data in isolation and code data in isolation yields weights that, when concatenated, still give gains in both areas. In practice, they first tuned the math dataset to saturation on math benchmarks, then did the same for code, and finally simply added the code data into the math recipe. The result was improved performance on both math and coding tasks, without retraining from scratch.

    This modular approach offers clear practical advantages. This means a small team can first refine just the math dataset, achieve strong math performance, and then later add the coding data without redoing the math tuning.

    However, the Phi-4 authors caution that scaling this method to many domains remains an open question. While the approach “worked very well” for their math+code mix, they note, “it is not known whether this method can scale to dozens or hundreds of domains,” a direction they acknowledge as a valuable area for future research. In short, the additive strategy is effective, but expanding into new domains must be approached carefully, as it may introduce unforeseen interactions.

    Despite potential pitfalls, the additive strategy proved effective in Phi-4 reasoning. By treating each domain independently, the team avoided complex joint optimization and narrowed the search space for data mixtures. This approach allows incremental scaling of domains. Teams can begin by tuning the math SFT, then incorporate the code dataset, and later expand to additional specialized tasks, all while maintaining prior performance gains. 

    This is a practical advantage for resource-constrained teams. Instead of requiring a large group of experts to manage a complex, multi-domain dataset, a small team can focus on one data silo at a time.

    Synthetic data transformation

    Some reasoning problems, such as abstract proofs or creative tasks, are difficult to verify automatically. Yet automated verification (for RL reward shaping) is very valuable. Phi-4 reasoning tackled this by transforming hard prompts into easier-to-check forms. 

    For example, the team rewrote a subset of coding problems as word puzzles or converted some math problems to have concise numeric answers. These “synthetic seed data” preserve the underlying reasoning challenge but make correctness easier to test. Think of it as giving the model a simplified version of the riddle that still teaches the same logic. 

    This engineering hack enables downstream RL to use clear reward signals on tasks that would otherwise be too open-ended. 

    Here’s an example of synthetic data transformation:

    Raw web data

    Synthetic data

    On the sides AB and BC of triangle ABC, points M and N are taken, respectively. It turns out that the perimeter of △AMC is equal to the perimeter of △CNA, and the perimeter of △ANB is equal to the perimeter of △CMB. Prove that △ABC is isosceles.

    ABC is a triangle with AB=13 and BC=10. On the sides AB and BC of triangle ABC, points M and N are taken, respectively. It turns out that the perimeter of △AMC is equal to the perimeter of △CNA, and the perimeter of △ANB is equal to the perimeter of △CMB. What is AC?

    Table: Rewriting seed data from the web (left) into verifiable synthetic questions for SFT and RL (right). Source: Microsoft

    Note that by assigning numeric values (AB=13, BC=10) and asking “What is AC?”, the answer becomes a single number, which can be easily checked for correctness.

    Other teams have applied similar domain-specific tricks. For example, chemistry LLMs like FutureHouse’s ether0 model generate molecules under strict pKa or structural constraints, using crafted reward functions to ensure valid chemistry. 

    In mathematics, the Kimina-Prover model by Numina translates natural-language theorems into the Lean formal system, so reinforcement learning can verify correct proofs. These examples highlight how synthetic augmentation, when paired with verifiable constraints, can push models to perform well in highly specialized domains.

    In practical terms, engineers should embrace synthetic data but keep it grounded. Heuristics like “convert to numeric answers” or “decompose a proof into checkable steps” can make training safer and more efficient. At the same time, maintain a pipeline of real (organic) problems as well, to ensure breadth. 

    The key is balance. Use synthetic transformations to unlock difficult verification problems, but don’t rely on them exclusively. Real-world diversity still matters. Following this approach, the model is guided toward a clearly defined, discrete objective.

    Here are some results on Phi-4 reasoning models:

    Practical implementation for enterprises

    AI teams looking to apply Phi-4 reasoning’s insights can follow a series of concrete steps to implement the approach effectively.

    Identifying the model’s edge

    Detect your model’s “edge” by identifying where the base LLM struggles. One way is to use its confidence or agreement scores. For example, generate several answers per prompt (using a tool like Hugging Face’s vLLM for fast sampling) and see where consensus breaks. Those prompts at the margin of confidence are your teachable examples. By focusing on these low-confidence questions rather than the questions it already gets right, you ensure each new example is worth learning.

    Isolating domains for targeted tuning

    Tune one domain at a time rather than mixing all data genres upfront. Pick the highest-value domain for your app (math, code, legal, etc.) and craft a small SFT dataset for just that. Iterate on the mix (balancing difficulty, source types, etc.) until performance saturates on domain-specific benchmarks. Then freeze that mix and add the next domain. This modular tuning follows Phi-4 reasoning’s “additive” strategy. It avoids cross-talk since you preserve gains in domain A even as you improve domain B.

    Expanding with synthetic augmentation

    Leverage synthetic augmentation when gold-standard answers are scarce or unverifiable. For instance, if you need to teach a proof assistant but can’t autocheck proofs, transform them into arithmetic puzzles or shorter proofs that can be verified. Use your LLM to rewrite or generate these variants (Phi-4 used this to turn complex word problems into numeric ones). 

    Synthetic augmentation also lets you expand data cheaply. Once you have a validated small set, you can “multiply” it by having the LLM generate paraphrases, variations, or intermediate reasoning steps.

    Scaling through a two-phase strategy

    Use a two-phase training strategy that begins with exploration followed by scaling. In Phase 1 (exploration), run short fine-tuning experiments on a focused dataset (e.g., one domain) with limited compute. Track a few key metrics (benchmarks or held-out tasks) each run. Rapidly iterate hyperparameters and data mixes. 

    The Phi-4 paper demonstrates that this speeds up progress, as small experiments helped the team discover a robust recipe before scaling up. Only once you see consistent gains do you move to Phase 2 (scaling), where you combine your verified recipes across domains and train longer (in Phi-4’s case, ~16 billion tokens). Although this stage is more compute-intensive, the risk is significantly reduced by the prior experimentation.

    Monitor for trigger points such as a significant uplift on validation tasks or stable metric trends. When those appear, it’s time to scale. If not, refine the recipe more first. This disciplined two-phase loop saves resources and keeps the team agile.

    In practice, many teams at Hugging Face and elsewhere have followed similar advice. For example, while developing conversational model SmolLM2, the team noticed poor chat performance in Phase 1. They then generated ~500K synthetic multi-turn dialogues and re-trained, which “significantly improved both downstream performance and its overall ‘vibes,’” as one researcher reports. This represents a concrete win, achieved through a targeted synthetic data injection based on an initial feedback loop.

    How to do this now

    Here’s a simple checklist that you can follow to put these ideas into action.

    1. Pick a target domain/task. Choose one area (e.g., math, coding, or a specific application) where you need better performance. This keeps the project focused.

    2. Collect a small seed dataset. Gather, say, a few thousand prompt–answer pairs in that domain from existing sources (textbooks, GitHub, etc.).

    3. Filter for edge-of-ability examples. Use a strong model (e.g., GPT-4) to create an answer key for each prompt. Run your base model on those prompts. Keep examples that the base model often misses, discard ones it already solves or is hopeless on. This yields “teachable” examples.

    4. Fine-tune your model (Phase 1). Run a short SFT job on this curated data. Track performance on a held-out set or benchmark. Iterate: Refine the data mix, remove easy questions, add new teachable ones, until gains taper off.

    5. Add synthetic examples if needed. If some concepts lack auto-verifiable answers (like long proofs), create simpler numeric or single-answer variants using your LLM. This gives clear rewards for RL. Keep a balance with real problems.

    6. Expand to the next domain. Once one domain is tuned, “freeze” its dataset. Pick a second high-value domain and repeat steps 3 to 5 to tune that data mix. Finally, merge the data for both domains, and do a final longer training run (Phase 2).

    7. Monitor benchmarks carefully. Use a consistent evaluation methodology (like  majority-voting runs) to avoid misleading results. Only proceed to a full-scale training if small experiments show clear improvements.

    Limits and trade-offs

    Despite the effectiveness of the Phi-4 training method, several limitations and practical considerations remain. One key challenge is domain scaling. While Phi-4’s additive method worked well for math and code, it has yet to be proven across many domains. The authors acknowledge that it remains an open question whether this approach can scale smoothly to dozens of topics. 

    Another concern is the use of synthetic data. Relying too heavily on synthetic rewrites can reduce the diversity of the dataset, so it’s crucial to maintain a balance between real and synthetic examples to preserve the model's ability to reason effectively. 

    Lastly, while the repeatable SFT method helps reduce computational costs, it doesn’t eliminate the need for thoughtful curation. Even though the approach is more efficient than brute-force scaling, it still requires careful data selection and iteration.

    Lessons from Phi-4

    The Phi-4 reasoning story is clear: Bigger isn’t always better for reasoning models. Instead of blindly scaling, the team asked where learning happens and engineered their data to hit that sweet spot. They show that “the benefit of careful data curation for supervised fine-tuning extends to reasoning models.” In other words, with a smart curriculum, you can squeeze surprising capability out of modest models.

    For engineers, the takeaway is actionable. You don’t need a billion-dollar cluster or an endless internet crawl to improve reasoning. For resource-strapped teams, this is good news, as a careful data strategy lets you punch above your weight.

    Phi-4 reasoning proves that methodical data and training design, not sheer parameter count, drives advanced reasoning. Focusing on teachable data and iterative tuning, even a 14B model surpassed much larger rivals. For AI teams today, this offers a practical blueprint. Refine the data, iterate fast, and scale only when the signals are right. These steps can unlock breakthrough reasoning performance without breaking the bank.

  • In a sea of agents, AWS bets on structured adherence and spec fidelity

    Despite new methods emerging, enterprises continue to turn to autonomous coding agents and code generation platforms. The competition to keep developers working on their platforms, coming from tech companies, has also heated up.

    AWS thinks its offering, Kiro, and new capabilities to ensure behavioral adherence set up a large differentiator in the increasingly crowded coding agent space. 

    Kiro, first launched in July on public preview, is now generally available with new features, including property-based testing for behavior and a command-line interface (CLI) capability to tailor custom agents.

    Deepak Singh, AWS vice president for databases and AI, told VentureBeat in an interview that Kiro “keeps the fun” of coding while providing it structure.

    “The way I like to say it is, what Kiro does is it allows you to talk to your agent and work with your agent to build software just like you would do with any other agent,” Singh said. “But what Kiro does is it brings this structured way of writing that software, which we call spectrum and development, to specs that take your ideas, converts them into things that will endure over time. So the outcome is more robust, maintainable code.”

    Kiro is an agentic coding tool built into developer IDEs to help create agents and applications from prototype to production.

    In addition to new features, AWS is offering startups in most countries one year of free credits to Kiro Pro+ and expanded access to Teams. 

    Behavioral adherence and checkpointing built in

    One of the new features of Kiro is property-based testing and checkpointing. 

    A problem some enterprises face with AI-generated code is that it can sometimes be difficult to judge accuracy and how closely the agents adhere to their intended purpose. AWS noted in a blog post that “whoever writes the tests (human or AI) is limited by their own biases— they have to think of all the different, specific scenarios to test the code against, and they’ll miss edge cases they didn’t think of. AI models often ‘game’ the solution by modifying tests instead of fixing code.”

    “What property-based testing does is it takes a specification, it takes a spec, and from that, it identifies properties your code should have, and it basically creates potentially hundreds of testing scenarios to verify that your code is doing what you intended it to as identified in the spec, and it does all the automatically,” Singh said. 

    Singh said that organizations can upload their specifications, and the Kiro agent can start identifying what is missing, even before the code review process begins. 

    Property-based testing matches the specified behavior, aka your instructions, to what the code is doing. Kiro can help users write it in their specifications based on the EARS format. For example, if a company is building a car sales app, the specification would read:

    “For any user and any car listing, WHEN the user adds the car to favorites, THE System SHALL display that car in their favorites list. PBT then automatically tests this with User A adding Car #1, User B adding Car #500, User C adding multiple cars, users with special characters in usernames, cars with various statuses (new, used, certified), and hundreds more combinations, catching edge cases and verifying that implementation matches your intent.”

    As opposed to a traditional unit test specification, which states: If a user adds car #5 to their favorites, then it will appear on their list.

    Kiro will then identify examples of the code violating the specifications and present them to the user. 

    Kiro also now allows for checkpointing, so developers can go back to a previous change if something goes wrong. 

    CLI coding

    The second major new feature of Kiro is Kiro CLI, which brings the Kiro coding agent directly into a developer’s CLI.

    AWS said the Kiro CLI utilizes some functionalities from the Q Developer CLI—its in-line coding assistant, launched in October 2024—to enable users to access the agent from the command line. 

    It also allows developers to start building custom agents, such as a backend specialist, a frontend agent, and a DevOps agent, tailored to an organization’s codebase.

    Singh said developers have their own unique ways of working, so it’s important for coding agent providers like AWS to meet them, where they are. Kiro CLI allows users to:

    • Stay in the terminal without the need for context switching

    • Structuring AI workflows with custom agents

    • Have one set up for two environments since MCP servers and other tools work in both the Kiro version on the IDE or the CLI

    • Fast automation to format code or manage logs through automated commands

    Coding agents competition

    Kiro, though, is just one of many coding agent platforms cropping up and competing for enterprise usage. 

    From OpenAI’s GPT-Codex, which unifies its Codex coding assistant with IDEs, CLIs, and other workflows, to Google’s Gemini CLI, it's clear that more developers demand easy access to coding agents where they do their work. 

    And enterprises are demanding more from coding agents. For example, Anthropic made its Claude Code platform available on the web and mobile. Some coding platforms also allow users to choose which model to use for their coding. 

    Singh said Kiro doesn’t rely on just one LLM; instead, it routes to the best model for the work, including AWS models. At launch in July, Kiro was based on Claude Sonnet 3.7 and 4.0. 

    Well-known brands like Monday.com have noted the significant benefits of AI-powered coding, demonstrating that enterprises will likely continue to utilize these platforms in the future. 

    “We saw that the mental model changes for developers, but it’s not just about becoming more efficient; it’s also how they organize around the way they work now,” Singh said. 

  • From shiny object to sober reality: The vector database story, two years later

    When I first wrote Vector databases: Shiny object syndrome and the case of a missing unicorn in March 2024, the industry was awash in hype. Vector databases were positioned as the next big thing — a must-have infrastructure layer for the gen AI era. Billions of venture dollars flowed, developers rushed to integrate embeddings into their pipelines and analysts breathlessly tracked funding rounds for Pinecone, Weaviate, Chroma, Milvus and a dozen others.

    The promise was intoxicating: Finally, a way to search by meaning rather than by brittle keywords. Just dump your enterprise knowledge into a vector store, connect an LLM and watch magic happen.

    Except the magic never fully materialized.

    Two years on, the reality check has arrived: 95% of organizations invested in gen AI initiatives are seeing zero measurable returns. And, many of the warnings I raised back then — about the limits of vectors, the crowded vendor landscape and the risks of treating vector databases as silver bullets — have played out almost exactly as predicted.

    Prediction 1: The missing unicorn

    Back then, I questioned whether Pinecone — the poster child of the category — would achieve unicorn status or whether it would become the “missing unicorn” of the database world. Today, that question has been answered in the most telling way possible: Pinecone is reportedly exploring a sale, struggling to break out amid fierce competition and customer churn.

    Yes, Pinecone raised big rounds and signed marquee logos. But in practice, differentiation was thin. Open-source players like Milvus, Qdrant and Chroma undercut them on cost. Incumbents like Postgres (with pgVector) and Elasticsearch simply added vector support as a feature. And customers increasingly asked: “Why introduce a whole new database when my existing stack already does vectors well enough?”

    The result: Pinecone, once valued near a billion dollars, is now looking for a home. The missing unicorn indeed. In September 2025, Pinecone appointed Ash Ashutosh as CEO, with founder Edo Liberty moving to a chief scientist role.  The timing is telling: The leadership change comes amid increasing pressure and questions over its long-term independence.  

    Prediction 2: Vectors alone won’t cut it

    I also argued that vector databases by themselves were not an end solution. If your use case required exactness — l ike searching for “Error 221” in a manual—a pure vector search would gleefully serve up “Error 222” as “close enough.” Cute in a demo, catastrophic in production.

    That tension between similarity and relevance has proven fatal to the myth of vector databases as all-purpose engines. 

    “Enterprises discovered the hard way that semantic ≠ correct.”

    Developers who gleefully swapped out lexical search for vectors quickly reintroduced… lexical search in conjunction with vectors. Teams that expected vectors to “just work” ended up bolting on metadata filtering, rerankers and hand-tuned rules. By 2025, the consensus is clear: Vectors are powerful, but only as part of a hybrid stack.

    Prediction 3: A crowded field becomes commoditized

    The explosion of vector database startups was never sustainable. Weaviate, Milvus (via Zilliz), Chroma, Vespa, Qdrant — each claimed subtle differentiators, but to most buyers they all did the same thing: store vectors and retrieve nearest neighbors.

    Today, very few of these players are breaking out. The market has fragmented, commoditized and in many ways been swallowed by incumbents. Vector search is now a checkbox feature in cloud data platforms, not a standalone moat.

    Just as I wrote then: Distinguishing one vector DB from another will pose an increasing challenge. That challenge has only grown harder. Vald, Marqo, LanceDB, PostgresSQL, MySQL HeatWave, Oracle 23c, Azure SQL, Cassandra, Redis, Neo4j, SingleStore, ElasticSearch, OpenSearch, Apahce Solr… the list goes on.

    The new reality: Hybrid and GraphRAG

    But this isn’t just a story of decline — it’s a story of evolution. Out of the ashes of vector hype, new paradigms are emerging that combine the best of multiple approaches.

    Hybrid Search: Keyword + vector is now the default for serious applications. Companies learned that you need both precision and fuzziness, exactness and semantics. Tools like Apache Solr, Elasticsearch, pgVector and Pinecone’s own “cascading retrieval” embrace this.

    GraphRAG: The hottest buzzword of late 2024/2025 is GraphRAG — graph-enhanced retrieval augmented generation. By marrying vectors with knowledge graphs, GraphRAG encodes the relationships between entities that embeddings alone flatten away. The payoff is dramatic.

    Benchmarks and evidence

    • Amazon’s AI blog cites benchmarks from Lettria, where hybrid GraphRAG boosted answer correctness from ~50% to 80%-plus in test datasets across finance, healthcare, industry, and law.  

    • The GraphRAG-Bench benchmark (released May 2025) provides a rigorous evaluation of GraphRAG vs. vanilla RAG across reasoning tasks, multi-hop queries and domain challenges.  

    • An OpenReview evaluation of RAG vs GraphRAG found that each approach has strengths depending on task — but hybrid combinations often perform best.  

    • FalkorDB’s blog reports that when schema precision matters (structured domains), GraphRAG can outperform vector retrieval by a factor of ~3.4x on certain benchmarks.  

    The rise of GraphRAG underscores the larger point: Retrieval is not about any single shiny object. It’s about building retrieval systems — layered, hybrid, context-aware pipelines that give LLMs the right information, with the right precision, at the right time.

    What this means going forward

    The verdict is in: Vector databases were never the miracle. They were a step — an important one — in the evolution of search and retrieval. But they are not, and never were, the endgame.

    The winners in this space won’t be those who sell vectors as a standalone database. They will be the ones who embed vector search into broader ecosystems — integrating graphs, metadata, rules and context engineering into cohesive platforms.

    In other words: The unicorn isn’t the vector database. The unicorn is the retrieval stack.

    Looking ahead: What’s next

    • Unified data platforms will subsume vector + graph: Expect major DB and cloud vendors to offer integrated retrieval stacks (vector + graph + full-text) as built-in capabilities.

    • “Retrieval engineering” will emerge as a distinct discipline: Just as MLOps matured, so too will practices around embedding tuning, hybrid ranking and graph construction.

    • Meta-models learning to query better: Future LLMs may learn to orchestrate which retrieval method to use per query, dynamically adjusting weighting.

    • Temporal and multimodal GraphRAG: Already, researchers are extending GraphRAG to be time-aware (T-GRAG) and multimodally unified (e.g. connecting images, text, video).

    • Open benchmarks and abstraction layers: Tools like BenchmarkQED (for RAG benchmarking) and GraphRAG-Bench will push the community toward fairer, comparably measured systems.

    From shiny objects to essential infrastructure

    The arc of the vector database story has followed a classic path: A pervasive hype cycle, followed by introspection, correction and maturation. In 2025, vector search is no longer the shiny object everyone pursues blindly — it’s now a critical building block within a more sophisticated, multi-pronged retrieval architecture.

    The original warnings were right. Pure vector-based hopes often crash on the shoals of precision, relational complexity and enterprise constraints. Yet the technology was never wasted: It forced the industry to rethink retrieval, blending semantic, lexical and relational strategies.

    If I were to write a sequel in 2027, I suspect it would frame vector databases not as unicorns, but as legacy infrastructure — foundational, but eclipsed by smarter orchestration layers, adaptive retrieval controllers and AI systems that dynamically choose which retrieval tool fits the query.

    As of now, the real battle is not vector vs keyword — it’s the indirection, blending and discipline in building retrieval pipelines that reliably ground gen AI in facts and domain knowledge. That’s the unicorn we should be chasing now.

    Amit Verma is head of engineering and AI Labs at Neuron7.

    Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

  • Human-centric IAM is failing: Agentic AI requires a new identity control plane

    The race to deploy agentic AI is on. Across the enterprise, systems that can plan, take actions and collaborate across business applications promise unprecedented efficiency. But in the rush to automate, a critical component is being overlooked: Scalable security. We are building a workforce of digital employees without giving them a secure way to log in, access data and do their jobs without creating catastrophic risk.

    The fundamental problem is that traditional identity and access management (IAM) designed for humans breaks at agentic scale. Controls like static roles, long-lived passwords and one-time approvals are useless when non-human identities can outnumber human ones by 10 to one. To harness the power of agentic AI, identity must evolve from a simple login gatekeeper into the dynamic control plane for your entire AI operation.

    “The fastest path to responsible AI is to avoid real data. Use synthetic data to prove value, then earn the right to touch the real thing.” — Shawn Kanungo, keynote speaker and innovation strategist; bestselling author of The Bold Ones

    Why your human-centric IAM is a sitting duck

    Agentic AI does not just use software; it behaves like a user. It authenticates to systems, assumes roles and calls APIs. If you treat these agents as mere features of an application, you invite invisible privilege creep and untraceable actions. A single over-permissioned agent can exfiltrate data or trigger erroneous business processes at machine speed, with no one the wiser until it is too late.

    The static nature of legacy IAM is the core vulnerability. You cannot pre-define a fixed role for an agent whose tasks and required data access might change daily. The only way to keep access decisions accurate is to move policy enforcement from a one-time grant to a continuous, runtime evaluation.

    Prove value before production data

    Kanungo’s guidance offers a practical on-ramp. Start with synthetic or masked datasets to validate agent workflows, scopes and guardrails. Once your policies, logs and break-glass paths hold up in this sandbox, you can graduate agents to real data with confidence and clear audit evidence.

    Building an identity-centric operating model for AI

    Securing this new workforce requires a shift in mindset. Each AI agent must be treated as a first-class citizen within your identity ecosystem.

    First, every agent needs a unique, verifiable identity. This is not just a technical ID; it must be linked to a human owner, a specific business use case and a software bill of materials (SBOM). The era of shared service accounts is over; they are the equivalent of giving a master key to a faceless crowd.

    Second, replace set-and-forget roles with session-based, risk-aware permissions. Access should be granted just in time, scoped to the immediate task and the minimum necessary dataset, then automatically revoked when the job is complete. Think of it as giving an agent a key to a single room for one meeting, not the master key to the entire building.

    Three pillars of a scalable agent security architecture

    Context-aware authorization at the core. Authorization can no longer be a simple yes or no at the door. It must be a continuous conversation. Systems should evaluate context in real time. Is the agent’s digital posture attested? Is it requesting data typical for its purpose? Is this access occurring during a normal operational window? This dynamic evaluation enables both security and speed.

    Purpose-bound data access at the edge. The final line of defense is the data layer itself. By embedding policy enforcement directly into the data query engine, you can enforce row-level and column-level security based on the agent’s declared purpose. A customer service agent should be automatically blocked from running a query that appears designed for financial analysis. Purpose binding ensures data is used as intended, not merely accessed by an authorized identity.

    Tamper-evident evidence by default. In a world of autonomous actions, auditability is non-negotiable. Every access decision, data query and API call should be immutably logged, capturing the who, what, where and why. Link logs so they are tamper evident and replayable for auditors or incident responders, providing a clear narrative of every agent’s activities.

    A practical roadmap to get started

    Begin with an identity inventory. Catalog all non-human identities and service accounts. You will likely find sharing and over-provisioning. Begin issuing unique identities for each agent workload.

    Pilot a just-in-time access platform. Implement a tool that grants short-lived, scoped credentials for a specific project. This proves the concept and shows the operational benefits.

    Mandate short-lived credentials. Issue tokens that expire in minutes, not months. Seek out and remove static API keys and secrets from code and configuration.

    Stand up a synthetic data sandbox. Validate agent workflows, scopes, prompts and policies on synthetic or masked data first. Promote to real data only after controls, logs and egress policies pass.

    Conduct an agent incident tabletop drill. Practice responses to a leaked credential, a prompt injection or a tool escalation. Prove you can revoke access, rotate credentials and isolate an agent in minutes.

    The bottom line

    You cannot manage an agentic, AI-driven future with human-era identity tools. The organizations that will win recognize identity as the central nervous system for AI operations. Make identity the control plane, move authorization to runtime, bind data access to purpose and prove value on synthetic data before touching the real thing. Do that, and you can scale to a million agents without scaling your breach risk.

     Michelle Buckner is a former NASA Information System Security Officer (ISSO).

  • ChatGPT Group Chats are here … but not for everyone (yet)

    It was originally found in leaked code and publicized by AI influencers on X, but OpenAI has made it official: ChatGPT now offers Group Chats, allowing multiple users to join the same, single ChatGPT conversation and send messages to each other and the underlying large language model (LLM), online and via its mobile apps.

    Imagine adding ChatGPT as another member of your existing group chats, allowing you to text it as you would one of your friends or family members and have them respond as well, and you'll have an idea of the intriguing power and potential of this feature.

    However, the feature is only available as a limited pilot for now to ChatGPT users in Japan, New Zealand, South Korea, and Taiwan (all tiers, including free usage).

    “Group chats are just the beginning of ChatGPT becoming a shared space to collaborate and interact with others,” OpenAI wrote in its announcement.

    This development builds on internal experimentation at OpenAI, where technical staffer Keyan Zhang said in a post on X that OpenAI's team initially considered multiplayer ChatGPT to be “a wild, out-of-distribution idea.”

    According to Zhang, the model’s performance in those early tests demonstrated far more potential than existing interfaces typically allow.

    The move follows OpenAI investor yet competitor Microsoft's update of its Copilot AI assistant to allow group chats last month, as well as Anthropic's introduction of shareable context and chat histories from its Claude AI models through its Projects feature introduced summer 2024, though this is not a simultaneous, realtime group chat in the same way.

    Collaborative functionality integrated into ChatGPT

    Group chats function as shared conversational spaces where users can plan events, brainstorm ideas, or collaborate on projects with the added support of ChatGPT.

    These conversations are distinct from individual chats and are excluded from ChatGPT’s memory system—meaning no data from these group threads is used to train or personalize future interactions.

    Users can initiate a group chat by selecting the people icon in a new or existing conversation. Adding others creates a copy of the original thread, preserving the source dialogue. Participants can join via a shareable link and are prompted to create a profile with a name, username, and photo. The feature supports 1 to 20 participants per group.

    Each group chat is listed in a new section of the ChatGPT interface, and users can manage settings like naming the group, adding or removing participants, or muting notifications.

    Powered by GPT-5.1 with expanded tools

    The new group chat feature runs on GPT-5.1 Auto, a backend setting that chooses the optimal model based on the user’s subscription tier and the prompt.

    Functionality such as search, image generation, file upload, and dictation is available inside group conversations.

    Importantly, the system applies rate limits only when ChatGPT is producing responses. Direct messages between human users in the group do not count toward any plan’s message cap.

    OpenAI has added new social features to ChatGPT in support of this group dynamic. The model can react with emojis, interpret conversational context to decide when to respond, and personalize generated content using members’ profile photos—such as inserting user likenesses into images when asked.

    Privacy by default, controls for younger users

    OpenAI emphasized that privacy and user control are integral to group chat design. The feature operates independently of the user’s personalized ChatGPT memory, and no new memories are created from these interactions.

    Participation requires an invitation link, and members are always able to see who is in a chat or leave at any time.

    Users under the age of 18 are automatically shielded from sensitive content in group chats. Parents or guardians can disable group chat access altogether via built-in parental controls.

    Group creators retain special permissions, including immunity from being removed by others. All other participants can be added or removed by group members.

    A testbed for shared AI experiences

    OpenAI frames group chats as an early step toward richer, multi-user applications of AI, hinting at broader ambitions for ChatGPT as a shared workspace. The company expects to expand access over time and refine the feature based on how early users engage with it.

    Keyan Zhang’s post suggests that the underlying model capabilities are far ahead of the interfaces users currently interact with. This pilot, in OpenAI’s view, offers a new “container” where more of the model’s latent capacity can be surfaced.

    “Our models have a lot more room to shine than today’s experiences show, and the current containers only use a fraction of their capabilities,” Zhang said.

    With this initial pilot focused on a limited set of markets, OpenAI is likely monitoring both usage patterns and cultural fit as it plans for broader deployment. For now, the group chat experiment offers a new way for users to interact with ChatGPT—and with each other—in real time, using a conversational interface that blends productivity and personalization.

    Developer access: Still unclear

    OpenAI has not provided any indication that Group Chats will be accessible via the API or SDK. The current rollout is framed strictly within the ChatGPT product environment, with no mention of tool calls, developer hooks, or integration support for programmatic use. This absence of signaling leaves it unclear whether the company views group interaction as a future developer primitive or as a contained UX feature for end users only.

    For enterprise teams exploring how to replicate multi-user collaboration with generative models, any current implementation would require custom orchestration—such as managing multi-party context and prompts across separate API calls, and handling session state and response merging externally. Until OpenAI provides formal support, Group Chats remain a closed interface feature rather than a developer-accessible capability.

    Here is a standalone concluding subsection tailored for the article, focusing on what the ChatGPT Group Chat rollout means for enterprise decision makers in both pilot regions and globally:

    Implications for enterprise AI and data leaders

    For enterprise teams already leveraging AI platforms—or preparing to—OpenAI’s group chat feature introduces a new layer of multi-user collaboration that could shift how generative models are deployed across workflows. While the pilot is limited to users in Japan, New Zealand, South Korea, and Taiwan, its design and roadmap offer key signals for AI engineers, orchestration specialists, and data leads globally.

    AI engineers managing large language model (LLM) deployments can now begin to conceptualize real-time, multi-user interfaces not just as support tools, but as collaborative environments for research, content generation, and ideation. This adds another front in model tuning: not just how models respond to individuals, but how they behave in live group settings with context shifts and varied user intentions.

    For AI orchestration leads, the ability to integrate ChatGPT into collaborative flows without exposing private memory or requiring custom builds may reduce friction in piloting generative AI in cross-functional teams. These group sessions could serve as lightweight alternatives to internal tools for brainstorming, prototyping, or knowledge sharing—useful for teams constrained by infrastructure, budget, or time.

    Enterprise data managers may also find use cases in structured group chat sessions for data annotation, taxonomy validation, or internal training support. The system’s lack of memory persistence adds a level of data isolation that aligns with standard security and compliance practices—though global rollout will be key to validating regional data handling standards.

    As group chat capabilities evolve, decision makers should monitor how shared usage patterns might inform future model behaviors, auditing needs, and governance structures. In the long term, features like these will influence not just how organizations interact with generative AI, but how they design team-level interfaces around it.

  • Google’s new AI training method helps small models tackle complex reasoning

    Researchers at Google Cloud and UCLA have proposed a new reinforcement learning framework that significantly improves the ability of language models to learn very challenging multi-step reasoning tasks. Supervised Reinforcement Learning (SRL) reformulates problem-solving as a sequence of logical “actions,” providing rich learning signals during the training process.

    This approach enables smaller models to learn complex problems that were previously out of reach for other common training techniques. Experiments show that SRL not only excels on math reasoning benchmarks but also generalizes effectively to agentic software engineering tasks.

    SRL is a versatile training framework that can elevate smaller and less expensive models to higher reasoning abilities.

    The limits of current LLM reasoning training

    Recent advances in training large language models (LLMs) for reasoning have largely been driven by reinforcement learning with verifiable rewards (RLVR), a method where a model is rewarded based on the correctness of its final answer. By repeatedly trying to solve problems and getting feedback on the final outcome, the model gradually learns effective problem-solving strategies. 

    However, the success of this outcome-based approach depends on the model's ability to discover a correct solution within a limited number of attempts, or "rollouts." Since each rollout is computationally expensive, models can't try indefinitely. This method hits a wall when problems are so difficult that the model rarely, if ever, finds the right answer within its budget.

    This creates a critical learning bottleneck. In many multi-step reasoning problems, a model might correctly solve several steps but get derailed by a single mistake, leading to an incorrect answer. With RLVR, this entire effort receives a negative reward, and the model learns nothing from its partially correct work. It’s an all-or-nothing approach that fails to provide granular feedback and provides sparse rewards.

    An alternative method is supervised fine-tuning (SFT), where the model learns from examples containing the full reasoning process laid out by experts. While SFT can instill reasoning abilities, it often leads to overfitting (the model simply learns to imitate the trajectories in the training data instead of learning to generalize to problems beyond the examples it has seen). This issue is made worse by the fact that high-quality, human-created training data is both scarce and expensive to produce.

    As the paper notes, these limitations leave "a critical gap for training small open-source models to effectively learn difficult problems."

    How supervised reinforcement learning works

    SRL introduces a framework that reformulates problem-solving as a "sequential decision-making process," striking a balance between pure outcome-based RL and pure imitation learning. Instead of optimizing only for the final answer or forcing the model to imitate an expert's entire thought process, SRL teaches the model to reproduce a sequence of key actions that form the backbone of expert reasoning. This allows the model to learn to take actions similar to an expert while developing its own internal reasoning style.

    In the SRL framework, expert demonstrations are broken down into a series of intermediate, concrete actions, each representing a meaningful step. For a math problem, an action might be an algebraic manipulation. For a software engineering agent, it could be a command executed in a code repository. To generate training data, SRL uses a powerful teacher model to create solution trajectories, which are then used to train a smaller model.

    According to I-Hung Hsu, a research scientist at Google and co-author of the paper, this middle-ground approach is key to its effectiveness in real-world scenarios. "SRL sits in the middle: It captures the structured flexibility of real-world problem solving, where there are multiple valid strategies but also clear notions of what ‘good reasoning’ looks like at each step," Hsu told VentureBeat. "This makes SRL suitable for domains like data science automation or probably supply chain optimization — tasks that reward sound intermediate reasoning rather than mere final answers."

    During training, the model first generates an "inner monologue" (its internal reasoning process, enclosed in <think> tags) before committing to an action. At each step, SRL provides a reward based on the similarity between the model's predicted action and the expert's action. This step-wise reward system provides dense, fine-grained feedback, allowing the model to learn and improve even if its overall solution isn't perfect. This solves the sparse reward problem RLVR faces.

    SRL in action

    The researchers' experiments show that SRL significantly outperforms strong baselines in both challenging mathematical reasoning and agentic software engineering benchmarks. They also observed that SRL encourages more flexible and sophisticated reasoning patterns in models, such as interleaved planning and self-verification, which improve solution quality without just making the outputs longer.

    For enterprise leaders, performance gains are only valuable if they don't come with runaway costs. Hsu clarifies that SRL-trained models are more efficient in their reasoning. "The gains come from better reasoning quality and structure, not from verbosity," he said. "In terms of efficiency, SRL-trained models are roughly on par with the base model in token usage… while SRL isn’t designed to reduce inference cost, it achieves stronger reasoning performance without increasing it."

    For the math tests, the team fine-tuned Qwen2.5-7B-Instruct on a dataset of 1,000 difficult math questions. They compared its performance against models trained with SFT and RLVR (using the GRPO algorithm common in models like DeepSeek-R1) on four competition-level math benchmarks. The SRL-trained model achieved a substantial 3.0% average performance boost over other methods. 

    The team extended SRL to agentic software engineering, a domain critical for enterprise automation. They trained a coding-specialized model, Qwen2.5-Coder-7B-Instruct, on 5,000 expert trajectories of agents interacting with a coding environment. The SRL-trained model was benchmarked against the original base model and SWE-Gym-7B, a strong baseline fine-tuned with SFT. SRL achieved a 14.8% task resolve rate, representing a 74% relative improvement over the SFT-based model. This shows SRL's ability to train more competent AI agents for complex, real-world programming tasks.

    A new standard for high-stakes AI?

    The paper's strongest results came from combining methods: First, using SRL to teach foundational reasoning, then using RLVR to refine that skill. In their experiments, when the researchers used SRL as a pre-training and applied RLVR in post-training, they observed a 3.7% average increase, demonstrating a powerful curriculum learning strategy.

    This raises the question of whether this could become a new blueprint for building specialized AI.

    "We view SRL as a strong foundation," Hsu said. "In a sense, SRL provides a curriculum — teaching models to think and act step by step — before we refine those behaviors with outcome-based reinforcement learning. This SRL-first approach not only stabilizes the later RL stage but also makes reasoning more interpretable and generalizable, which is critical for high-stakes applications."

    Looking ahead, Hsu acknowledges that scaling this pipeline still faces challenges, particularly the high cost and complexity of end-to-end RLVR for agentic tasks. However, he is optimistic about the path forward. "While high-quality expert trajectories remain important," he concluded, "we think the next big leap will come from automating their generation and filtering — leveraging strong teacher models or even self-improving student models to bootstrap new data."

  • Upwork study shows AI agents excel with human partners but fail independently

    Artificial intelligence agents powered by the world's most advanced language models routinely fail to complete even straightforward professional tasks on their own, according to groundbreaking research released Thursday by Upwork, the largest online work marketplace.

    But the same study reveals a more promising path forward: When AI agents collaborate with human experts, project completion rates surge by up to 70%, suggesting the future of work may not pit humans against machines but rather pair them together in powerful new ways.

    The findings, drawn from more than 300 real client projects posted to Upwork's platform, marking the first systematic evaluation of how human expertise amplifies AI agent performance in actual professional work — not synthetic tests or academic simulations. The research challenges both the hype around fully autonomous AI agents and fears that such technology will imminently replace knowledge workers.

    "AI agents aren't that agentic, meaning they aren't that good," Andrew Rabinovich, Upwork's chief technology officer and head of AI and machine learning, said in an exclusive interview with VentureBeat. "However, when paired with expert human professionals, project completion rates improve dramatically, supporting our firm belief that the future of work will be defined by humans and AI collaborating to get more work done, with human intuition and domain expertise playing a critical role."

    How AI agents performed on 300+ real freelance jobs—and why they struggled

    Upwork's Human+Agent Productivity Index (HAPI) evaluated how three leading AI systems — Gemini 2.5 Pro, OpenAI's GPT-5, and Claude Sonnet 4 — performed on actual jobs posted by paying clients across categories including writing, data science, web development, engineering, sales, and translation.

    Critically, Upwork deliberately selected simple, well-defined projects where AI agents stood a reasonable chance of success. These jobs, priced under $500, represent less than 6% of Upwork's total gross services volume — a tiny fraction of the platform's overall business and an acknowledgment of current AI limitations.

    "The reality is that although we study AI, and I've been doing this for 25 years, and we see significant breakthroughs, the reality is that these agents aren't that agentic," Rabinovich told VentureBeat. "So if we go up the value chain, the problems become so much more difficult, then we don't think they can solve them at all, even to scratch the surface. So we specifically chose simpler tasks that would give an agent some kind of traction."

    Even on these deliberately simplified tasks, AI agents working independently struggled. But when expert freelancers provided feedback — spending an average of just 20 minutes per review cycle — the agents' performance improved substantially with each iteration.

    20 minutes of human feedback boosted AI completion rates up to 70%

    The research reveals stark differences in how AI agents perform with and without human guidance across different types of work. For data science and analytics projects, Claude Sonnet 4 achieved a 64% completion rate working alone but jumped to 93% after receiving feedback from a human expert. In sales and marketing work, Gemini 2.5 Pro's completion rate rose from 17% independently to 31% with human input. OpenAI's GPT-5 showed similarly dramatic improvements in engineering and architecture tasks, climbing from 30% to 50% completion.

    The pattern held across virtually all categories, with agents responding particularly well to human feedback on qualitative, creative work requiring editorial judgment — areas like writing, translation, and marketing — where completion rates increased by up to 17 percentage points per feedback cycle.

    The finding challenges a fundamental assumption in the AI industry: that agent benchmarks conducted in isolation accurately predict real-world performance.

    "While we show that in the tasks that we have selected for agents to perform in isolation, they perform similarly to the previous results that we've seen published openly, what we've shown is that in collaboration with humans, the performance of these agents improves surprisingly well," Rabinovich said. "It's not just a one-turn back and forth, but the more feedback the human provides, the better the agent gets at performing."

    Why ChatGPT can ace the SAT but can't count the R's in 'strawberry'

    The research arrives as the AI industry grapples with a measurement crisis. Traditional benchmarks — standardized tests that AI models can master, sometimes scoring perfectly on SAT exams or mathematics olympiads — have proven poor predictors of real-world capability.

    "With advances of large language models, what we're now seeing is that these static, academic datasets are completely saturated," Rabinovich said. "So you could get a perfect score in the SAT test or LSAT or any of the math olympiads, and then you would ask ChatGPT how many R's there are in the word strawberry, and it would get it wrong."

    This phenomenon — where AI systems ace formal tests but stumble on trivial real-world questions — has led to growing skepticism about AI capabilities, even as companies race to deploy autonomous agents. Several recent benchmarks from other firms have tested AI agents on Upwork jobs, but those evaluations measured only isolated performance, not the collaborative potential that Upwork's research reveals.

    "We wanted to evaluate the quality of these agents on actual real work with economic value associated with it, and not only see how well these agents do, but also see how these agents do in collaboration with humans, because we sort of knew already that in isolation, they're not that advanced," Rabinovich explained.

    For Upwork, which connects roughly 800,000 active clients posting more than 3 million jobs annually to a global pool of freelancers, the research serves a strategic business purpose: establishing quality standards for AI agents before allowing them to compete or collaborate with human workers on its platform.

    The economics of human-AI teamwork: Why paying for expert feedback still saves money

    Despite requiring multiple rounds of human feedback — each lasting about 20 minutes — the time investment remains "orders of magnitude different between a human doing the work alone, versus a human doing the work with an AI agent," Rabinovich said. Where a project might take a freelancer days to complete independently, the agent-plus-human approach can deliver results in hours through iterative cycles of automated work and expert refinement.

    The economic implications extend beyond simple time savings. Upwork recently reported that gross services volume from AI-related work grew 53% year-over-year in the third quarter of 2025, one of the strongest growth drivers for the company. But executives have been careful to frame AI not as a replacement for freelancers but as an enhancement to their capabilities.

    "AI was a huge overhang for our valuation," Erica Gessert, Upwork's CFO, told CFO Brew in October. "There was this belief that all work was going to go away. AI was going to take it, and especially work that's done by people like freelancers, because they are impermanent. Actually, the opposite is true."

    The company's strategy centers on enabling freelancers to handle more complex, higher-value work by offloading routine tasks to AI. "Freelancers actually prefer to have tools that automate the manual labor and repetitive part of their work, and really focus on the creative and conceptual part of the process," Rabinovich said.

    Rather than replacing jobs, he argues, AI will transform them: "Simpler tasks will be automated by agents, but the jobs will become much more complex in the number of tasks, so the amount of work and therefore earnings for freelancers will actually only go up."

    AI coding agents excel, but creative writing and translation still need humans

    The research reveals a clear pattern in agent capabilities. AI systems perform best on "deterministic and verifiable" tasks with objectively correct answers, like solving math problems or writing basic code. "Most coding tasks are very similar to each other," Rabinovich noted. "That's why coding agents are becoming so good."

    In Upwork's tests, web development, mobile app development, and data science projects — especially those involving structured, computational work — saw the highest standalone agent completion rates. Claude Sonnet 4 completed 68% of web development jobs and 64% of data science projects without human help, while Gemini 2.5 Pro achieved 74% on certain technical tasks.

    But qualitative work proved far more challenging. When asked to create website layouts, write marketing copy, or translate content with appropriate cultural nuance, agents floundered without expert guidance. "When you ask it to write you a poem, the quality of the poem is extremely subjective," Rabinovich said. "Since the rubrics for evaluation were provided by humans, there's some level of variability in representation."

    Writing, translation, and sales and marketing projects showed the most dramatic improvements from human feedback. For writing work, completion rates increased by up to 17 percentage points after expert review. Engineering and architecture projects requiring creative problem-solving — like civil engineering or architectural design — improved by as much as 23 percentage points with human oversight.

    This pattern suggests AI agents excel at pattern matching and replication but struggle with creativity, judgment, and context — precisely the skills that define higher-value professional work.

    Inside the research: How Upwork tested AI agents with peer-reviewed scientific methods

    Upwork partnered with elite freelancers on its platform to evaluate every deliverable produced by AI agents, both independently and after each cycle of human feedback. These evaluators created detailed rubrics defining whether projects met core requirements specified in job descriptions, then scored outputs across multiple iterations.

    Importantly, evaluators focused only on objective completion criteria, excluding subjective factors like stylistic preferences or quality judgments that might emerge in actual client relationships. "Rubric-based completion rates should not be viewed as a measure of whether an agent would be paid in a real marketplace setting," the research notes, "but as an indicator of its ability to fulfill explicitly defined requests."

    This distinction matters: An AI agent might technically complete all specified requirements yet still produce work a client rejects as inadequate. Conversely, subjective client satisfaction — the true measure of marketplace success — remains beyond current measurement capabilities.

    The research underwent double-blind peer review and was accepted to NeurIPS, the premier academic conference for AI research, where Upwork will present full results in early December. The company plans to publish a complete methodology and make the benchmark available to the research community, updating the task pool regularly to prevent overfitting as agents improve.

    "The idea is for this benchmark to be a living and breathing platform where agents can come in and evaluate themselves on all categories of work, and the tasks that will be offered on the platform will always update, so that these agents don't overfit and basically memorize the tasks at hand," Rabinovich said.

    Upwork's AI strategy: Building Uma, a 'meta-agent' that manages human and AI workers

    The research directly informs Upwork's product roadmap as the company positions itself for what executives call "the age of AI and beyond." Rather than building its own AI agents to complete specific tasks, Upwork is developing Uma, a "meta orchestration agent" that coordinates between human workers, AI systems, and clients.

    "Today, Upwork is a marketplace where clients look for freelancers to get work done, and then talent comes to Upwork to find work," Rabinovich explained. "This is getting expanded into a domain where clients come to Upwork, communicate with Uma, this meta-orchestration agent, and then Uma identifies the necessary talent to get the job done, gets the tasks outcomes completed, and then delivers that to the client."

    In this vision, clients would interact primarily with Uma rather than directly hiring freelancers. The AI system would analyze project requirements, determine which tasks require human expertise versus AI execution, coordinate the workflow, and ensure quality — acting as an intelligent project manager rather than a replacement worker.

    "We don't want to build agents that actually complete the tasks, but we are building this meta orchestration agent that figures out what human and agent talent is necessary in order to complete the tasks," Rabinovich said. "Uma evaluates the work to be delivered to the client, orchestrates the interaction between humans and agents, and is able to learn from all the interactions that happen on the platform how to break jobs into tasks so that they get completed in a timely and effective manner."

    The company recently announced plans to open its first international office in Lisbon, Portugal, by the fourth quarter of 2026, with a focus on AI infrastructure development and technical hiring. The expansion follows Upwork's record-breaking third quarter, driven partly by AI-powered product innovation and strong demand for workers with AI skills.

    OpenAI, Anthropic, and Google race to build autonomous agents—but reality lags hype

    Upwork's findings arrive amid escalating competition in the AI agent space. OpenAI, Anthropic, Google, and numerous startups are racing to develop autonomous agents capable of complex multi-step tasks, from booking travel to analyzing financial data to writing software.

    But recent high-profile stumbles have tempered initial enthusiasm. AI agents frequently misunderstand instructions, make logical errors, or produce confidently wrong results — a phenomenon researchers call "hallucination." The gap between controlled demonstration videos and reliable real-world performance remains vast.

    "There have been some evaluations that came from OpenAI and other platforms where real Upwork tasks were considered for completion by agents, and across the board, the reported results were not very optimistic, in the sense that they showed that agents—even the best ones, meaning powered by most advanced LLMs — can't really compete with humans that well, because the completion rates are pretty low," Rabinovich said.

    Rather than waiting for AI to fully mature — a timeline that remains uncertain—Upwork is betting on a hybrid approach that leverages AI's strengths (speed, scalability, pattern recognition) while retaining human strengths (judgment, creativity, contextual understanding).

    This philosophy extends to learning and improvement. Current AI models train primarily on static datasets scraped from the internet, supplemented by human preference feedback. But most professional work is qualitative, making it difficult for AI systems to know whether their outputs are actually good without expert evaluation.

    "Unless you have this collaboration between the human and the machine, where the human is kind of the teacher and the machine is the student trying to discover new solutions, none of this will be possible," Rabinovich said. "Upwork is very uniquely positioned to create such an environment because if you try to do this with, say, self-driving cars, and you tell Waymo cars to explore new ways of getting to the airport, like avoiding traffic signs, then a bunch of bad things will happen. In doing work on Upwork, if it creates a wrong website, it doesn't cost very much, and there's no negative side effects. But the opportunity to learn is absolutely tremendous."

    Will AI take your job? The evidence suggests a more complicated answer

    While much public discourse around AI focuses on job displacement, Rabinovich argues the historical pattern suggests otherwise — though the transition may prove disruptive.

    "The narrative in the public is that AI is eliminating jobs, whether it's writing, translation, coding or other digital work, but no one really talks about the exponential amount of new types of work that it will create," he said. "When we invented electricity and steam engines and things like that, they certainly replaced certain jobs, but the amount of new jobs that were introduced is exponentially more, and we think the same is going to happen here."

    The research identifies emerging job categories focused on AI oversight: designing effective human-machine workflows, providing high-quality feedback to improve agent performance, and verifying that AI-generated work meets quality standards. These skills—prompt engineering, agent supervision, output verification—barely existed two years ago but now command premium rates on platforms like Upwork.

    "New types of skills from humans are becoming necessary in the form of how to design the interaction between humans and machines, how to guide agents to make them better, and ultimately, how to verify that whatever agentic proposals are being made are actually correct, because that's what's necessary in order to advance the state of AI," Rabinovich said.

    The question remains whether this transition—  from doing tasks to overseeing them — will create opportunities as quickly as it disrupts existing roles. For freelancers on Upwork, the answer may already be emerging in their bank accounts: The platform saw AI-related work grow 53% year-over-year, even as fears of AI-driven unemployment dominated headlines.

  • Baidu unveils proprietary ERNIE 5 beating GPT-5 performance on charts, document understanding and more

    Mere hours after OpenAI updated its flagship foundation model GPT-5 to GPT-5.1, promising reduced token usage overall and a more pleasant personality with more preset options, Chinese search giant Baidu unveiled its next-generation foundation model, ERNIE 5.0, alongside a suite of AI product upgrades and strategic international expansions.

    The goal: to position as a global contender in the increasingly competitive enterprise AI market.

    Announced at the company's Baidu World 2025 event, ERNIE 5.0 is a proprietary, natively omni-modal model designed to jointly process and generate content across text, images, audio, and video.

    Unlike Baidu’s recently released ERNIE-4.5-VL-28B-A3B-Thinking, which is open source under an enterprise-friendly and permissive Apache 2.0 license, ERNIE 5.0 is a proprietary model and is available only via Baidu’s ERNIE Bot website (I needed to select it manuallyu from the model picker dropdown) and the Qianfan cloud platform application programming interface (API) for enterprise customers.

    Alongside the model launch, Baidu introduced major updates to its digital human platform, no-code tools, and general-purpose AI agents — all targeted at expanding its AI footprint beyond China.

    The company also introduced ERNIE 5.0 Preview 1022, a variant optimized for text-intensive tasks, alongside the general preview model that balances across modalities.

    Baidu emphasized that ERNIE 5.0 represents a shift in how intelligence is deployed at scale, with CEO Robin Li stating: “When you internalize AI, it becomes a native capability and transforms intelligence from a cost into a source of productivity.”

    Where ERNIE 5.0 outshines GPT-5 and Gemini 2.5 Pro

    ERNIE 5.0’s benchmark results suggest that Baidu has achieved parity—or near-parity—with the top Western foundation models across a wide spectrum of tasks.

    In public benchmark slides shared during the Baidu World 2025 event, ERNIE 5.0 Preview outperformed or matched OpenAI’s GPT-5-High and Google’s Gemini 2.5 Pro in multimodal reasoning, document understanding, and image-based QA, while also demonstrating strong language modeling and code execution abilities.

    The company emphasized its ability to handle joint inputs and outputs across modalities, rather than relying on post-hoc modality fusion, which it framed as a technical differentiator.

    On visual tasks, ERNIE 5.0 achieved leading scores on OCRBench, DocVQA, and ChartQA, three benchmarks that test document recognition, comprehension, and structured data reasoning.

    Baidu claims the model beat both GPT-5-High and Gemini 2.5 Pro on these document and chart-based benchmarks, areas it describes as core to enterprise applications like automated document processing and financial analysis.

    In image generation, ERNIE 5.0 tied or exceeded Google’s Veo3 across categories including semantic alignment and image quality, according to Baidu’s internal GenEval-based evaluation. Baidu claimed that the model’s multimodal integration allows it to generate and interpret visual content with greater contextual awareness than models relying on modality-specific encoders.

    For audio and speech tasks, ERNIE 5.0 demonstrated competitive results on MM-AU and TUT2017 audio understanding benchmarks, as well as question answering from spoken language inputs. Its audio performance, while not as heavily emphasized as vision or text, suggests a broad capability footprint intended to support full-spectrum multimodal applications.

    In language tasks, the model showed strong results on instruction following, factual question answering, and mathematical reasoning—core areas that define the enterprise utility of large language models.

    The Preview 1022 variant of ERNIE 5.0, tailored for textual performance, showed even stronger language-specific results in early developer access. While Baidu does not claim broad superiority in general language reasoning, its internal evaluations suggest that ERNIE 5.0 Preview 1022 closes the gap with top-tier English-language models and outperforms them in Chinese-language performance.

    While Baidu did not release full benchmark details or raw scores publicly, its performance positioning suggests a deliberate attempt to frame ERNIE 5.0 not as a niche multimodal system but as a flagship model competitive with the largest closed models in general-purpose reasoning.

    Where Baidu claims a clear lead is in structured document understanding, visual chart reasoning, and integration of multiple modalities into a single, native modeling architecture. Independent verification of these results remains pending, but the breadth of claimed capabilities positions ERNIE 5.0 as a serious alternative in the multimodal foundation model landscape.

    Enterprise Pricing Strategy

    ERNIE 5.0 is positioned at the premium end of Baidu’s model pricing structure. The company has released specific pricing for API usage on its Qianfan platform, aligning the cost with other top-tier offerings from Chinese competitors like Alibaba.

    Model

    Input Cost (per 1K tokens)

    Output Cost (per 1K tokens)

    Source

    ERNIE 5.0

    $0.00085 (¥0.006)

    $0.0034 (¥0.024)

    Qianfan

    ERNIE 4.5 Turbo (ex.)

    $0.00011 (¥0.0008)

    $0.00045 (¥0.0032)

    Qianfan

    Qwen3 (Coder ex.)

    $0.00085 (¥0.006)

    $0.0034 (¥0.024)

    Qianfan

    The contrast in cost between ERNIE 5.0 and earlier models such as ERNIE 4.5 Turbo underscores Baidu’s strategy to differentiate between high-volume, low-cost models and high-capability models designed for complex tasks and multimodal reasoning.

    Compared to other U.S. alternatives, it remains mid-range in pricing:

    Model

    Input (/1 M tokens)

    Output (/1 M tokens)

    Source

    GPT-5.1

    $1.25

    $10.00

    OpenAI

    ERNIE 5.0

    $0.85

    $3.40

    Qianfan

    ERNIE 4.5 Turbo (ex.)

    $0.11

    $0.45

    Qianfan

    Claude Opus 4.1

    $15.00

    $75.00

    Anthropic

    Gemini 2.5 Pro

    $1.25 (≤200k) / $2.50 (>200k)

    $10.00 (≤200k) / $15.00 (>200k)

    Google Vertex AI Pricing

    Grok 4 (grok-4-0709)

    $3.00

    $15.00

    xAI API

    Global Expansion: Products and Platforms

    In tandem with the model release, Baidu is expanding internationally:

    • GenFlow 3.0, now with 20M+ users, is the company’s largest general-purpose AI agent and features enhanced memory and multimodal task handling.

    • Famou, a self-evolving agent capable of dynamically solving complex problems, is now commercially available via invite.

    • MeDo, the international version of Baidu’s no-code builder Miaoda, is live globally via medo.dev.

    • Oreate, a productivity workspace with document, slide, image, video, and podcast support, has reached over 1.2M users worldwide.

    Baidu’s digital human platform, already rolled out in Brazil, is also part of the global push. According to company data, 83% of livestreamers during this year’s “Double 11” shopping event in China used Baidu’s digital human tech, contributing to a 91% increase in GMV.

    Meanwhile, Baidu’s autonomous ride-hailing service Apollo Go has surpassed 17 million rides, operating driverless fleets in 22 cities and claiming the title of the world’s largest robotaxi network.

    Open-Source Vision-Language Model Garners Industry Attention

    Two days before the flagship ERNIE 5.0 event, Baidu also released an open-source multimodal model under the Apache 2.0 license: ERNIE-4.5-VL-28B-A3B-Thinking.

    As reported by my colleague Michael Nuñez at VentureBeat, the model activates just 3 billion parameters while maintaining a total of 28 billion, using a Mixture-of-Experts (MoE) architecture for efficient inference.

    Key technical innovations include:

    • “Thinking with Images”, which enables dynamic zoom-based visual analysis

    • Support for chart interpretation, document understanding, visual grounding, and temporal awareness in video

    • Runtime on a single 80GB GPU, making it accessible to mid-sized organizations

    • Full compatibility with Transformers, vLLM, and Baidu’s FastDeploy toolkits

    This release adds pressure on closed-source competitors. With Apache 2.0 licensing, ERNIE-4.5-VL-28B-A3B-Thinking becomes a viable foundation model for commercial applications without licensing restrictions — something few high-performing models in this class offer.

    Community Feedback and Baidu’s Response

    Following the launch of ERNIE 5.0, developer and AI evaluator Lisan al Gaib (@scaling01) posted a mixed review on X. While initially impressed by the model’s benchmark performance, they reported a persistent issue where ERNIE 5.0 would repeatedly invoke tools — even when explicitly instructed not to — during SVG generation tasks.

    “ERNIE 5.0 benchmarks looked insane until I tested it… unfortunately it’s RL braindamaged or they have a serious issue with their chat platform / system prompt,” Lisan wrote.

    In a matter of hours, Baidu’s developer-focused support account, @ErnieforDevs, responded:

    “Thanks for the feedback! It’s a known bug — certain syntax can consistently trigger it. We’re working on a fix. You can try rephrasing or changing the prompt to avoid it for now.”

    The quick turnaround reflects Baidu’s increasing emphasis on developer communication, especially as it courts international users through both proprietary and open-source offerings.

    Outlook for Baidu and its ERNIE foundational LLM family

    Baidu’s ERNIE 5.0 marks a strategic escalation in the global foundation model race. With performance claims that put it on par with the most advanced systems from OpenAI and Google, and a mix of premium pricing and open-access alternatives, Baidu is signaling its ambition to become not just a domestic AI leader, but a credible global infrastructure provider.

    At a time when enterprise AI users are increasingly demanding multimodal performance, flexible licensing, and deployment efficiency, Baidu’s two-track approach—premium hosted APIs and open-source releases—may broaden its appeal across both corporate and developer communities.

    Whether the company’s performance claims hold up under third-party testing remains to be seen. But in a landscape shaped by rising costs, model complexity, and compute bottlenecks, ERNIE 5.0 and its supporting ecosystem give Baidu a competitive position in the next wave of AI deployment.

  • How Deductive AI saved DoorDash 1,000 engineering hours by automating software debugging

    As software systems grow more complex and AI tools generate code faster than ever, a fundamental problem is getting worse: Engineers are drowning in debugging work, spending up to half their time hunting down the causes of software failures instead of building new products. The challenge has become so acute that it's creating a new category of tooling — AI agents that can diagnose production failures in minutes instead of hours.

    Deductive AI, a startup emerging from stealth mode Wednesday, believes it has found a solution by applying reinforcement learning — the same technology that powers game-playing AI systems — to the messy, high-stakes world of production software incidents. The company announced it has raised $7.5 million in seed funding led by CRV, with participation from Databricks Ventures, Thomvest Ventures, and PrimeSet, to commercialize what it calls "AI SRE agents" that can diagnose and help fix software failures at machine speed.

    The pitch resonates with a growing frustration inside engineering organizations: Modern observability tools can show that something broke, but they rarely explain why. When a production system fails at 3 a.m., engineers still face hours of manual detective work, cross-referencing logs, metrics, deployment histories, and code changes across dozens of interconnected services to identify the root cause.

    "The complexities and inter-dependencies of modern infrastructure means that investigating the root cause of an outage or incident can feel like searching for a needle in a haystack, except the haystack is the size of a football field, it's made of a million other needles, it's constantly reshuffling itself, and is on fire — and every second you don't find it equals lost revenue," said Sameer Agarwal, Deductive's co-founder and chief technology officer, in an exclusive interview with VentureBeat.

    Deductive's system builds what the company calls a "knowledge graph" that maps relationships across codebases, telemetry data, engineering discussions, and internal documentation. When an incident occurs, multiple AI agents work together to form hypotheses, test them against live system evidence, and converge on a root cause — mimicking the investigative workflow of experienced site reliability engineers, but completing the process in minutes rather than hours.

    The technology has already shown measurable impact at some of the world's most demanding production environments. DoorDash's advertising platform, which runs real-time auctions that must complete in under 100 milliseconds, has integrated Deductive into its incident response workflow. The company has set an ambitious 2026 goal of resolving production incidents within 10 minutes.

    "Our Ads Platform operates at a pace where manual, slow-moving investigations are no longer viable. Every minute of downtime directly affects company revenue," said Shahrooz Ansari, Senior Director of Engineering at DoorDash, in an interview with VentureBeat. "Deductive has become a critical extension of our team, rapidly synthesizing signals across dozens of services and surfacing the insights that matter—within minutes."

    DoorDash estimates that Deductive has root-caused approximately 100 production incidents over the past few months, translating to more than 1,000 hours of annual engineering productivity and a revenue impact "in millions of dollars," according to Ansari. At location intelligence company Foursquare, Deductive reduced the time to diagnose Apache Spark job failures by 90% —t urning a process that previously took hours or days into one that completes in under 10 minutes — while generating over $275,000 in annual savings.

    Why AI-generated code is creating a debugging crisis

    The timing of Deductive's launch reflects a brewing tension in software development: AI coding assistants are enabling engineers to generate code faster than ever, but the resulting software is often harder to understand and maintain.

    "Vibe coding," a term popularized by AI researcher Andrej Karpathy, refers to using natural-language prompts to generate code through AI assistants. While these tools accelerate development, they can introduce what Agarwal describes as "redundancies, breaks in architectural boundaries, assumptions, or ignored design patterns" that accumulate over time.

    "Most AI-generated code still introduces redundancies, breaks architectural boundaries, makes assumptions, or ignores established design patterns," Agarwal told Venturebeat. "In many ways, we now need AI to help clean up the mess that AI itself is creating."

    The claim that engineers spend roughly half their time on debugging isn't hyperbole. The Association for Computing Machinery reports that developers spend 35% to 50% of their time validating and debugging software. More recently, Harness's State of Software Delivery 2025 report found that 67% of developers are spending more time debugging AI-generated code.

    "We've seen world-class engineers spending half of their time debugging instead of building," said Rakesh Kothari, Deductive's co-founder and CEO. "And as vibe coding generates new code at a rate we've never seen, this problem is only going to get worse."

    How Deductive's AI agents actually investigate production failures

    Deductive's technical approach differs substantially from the AI features being added to existing observability platforms like Datadog or New Relic. Most of those systems use large language models to summarize data or identify correlations, but they lack what Agarwal calls "code-aware reasoning"—the ability to understand not just that something broke, but why the code behaves the way it does.

    "Most enterprises use multiple observability tools across different teams and services, so no vendor has a single holistic view of how their systems behave, fail, and recover—nor are they able to pair that with an understanding of the code that defines system behavior," Agarwal explained. "These are key ingredients to resolving software incidents and it is exactly the gap Deductive fills."

    The system connects to existing infrastructure using read-only API access to observability platforms, code repositories, incident management tools, and chat systems. It then continuously builds and updates its knowledge graph, mapping dependencies between services and tracking deployment histories.

    When an alert fires, Deductive launches what the company describes as a multi-agent investigation. Different agents specialize in different aspects of the problem: one might analyze recent code changes, another examines trace data, while a third correlates the timing of the incident with recent deployments. The agents share findings and iteratively refine their hypotheses.

    The critical difference from rule-based automation is Deductive's use of reinforcement learning. The system learns from every incident which investigative steps led to correct diagnoses and which were dead ends. When engineers provide feedback, the system incorporates that signal into its learning model.

    "Each time it observes an investigation, it learns which steps, data sources, and decisions led to the right outcome," Agarwal said. "It learns how to think through problems, not just point them out."

    At DoorDash, a recent latency spike in an API initially appeared to be an isolated service issue. Deductive's investigation revealed that the root cause was actually timeout errors from a downstream machine learning platform undergoing a deployment. The system connected these dots by analyzing log volumes, traces, and deployment metadata across multiple services.

    "Without Deductive, our team would have had to manually correlate the latency spike across all logs, traces, and deployment histories," Ansari said. "Deductive was able to explain not just what changed, but how and why it impacted production behavior."

    The company keeps humans in the loop—for now

    While Deductive's technology could theoretically push fixes directly to production systems, the company has deliberately chosen to keep humans in the loop—at least for now.

    "While our system is capable of deeper automation and could push fixes to production, currently, we recommend precise fixes and mitigations that engineers can review, validate, and apply," Agarwal said. "We believe maintaining a human in the loop is essential for trust, transparency and operational safety."

    However, he acknowledged that "over time, we do think that deeper automation will come and how humans operate in the loop will evolve."

    Databricks and ThoughtSpot veterans bet on reasoning over observability

    The founding team brings deep expertise from building some of Silicon Valley's most successful data infrastructure platforms. Agarwal earned his Ph.D. at UC Berkeley, where he created BlinkDB, an influential system for approximate query processing. He was among the first engineers at Databricks, where he helped build Apache Spark. Kothari was an early engineer at ThoughtSpot, where he led teams focused on distributed query processing and large-scale system optimization.

    The investor syndicate reflects both the technical credibility and market opportunity. Beyond CRV's Max Gazor, the round included participation from Ion Stoica, founder of Databricks and Anyscale; Ajeet Singh, founder of Nutanix and ThoughtSpot; and Ben Sigelman, founder of Lightstep.

    Rather than competing with platforms like Datadog or PagerDuty, Deductive positions itself as a complementary layer that sits on top of existing tools. The pricing model reflects this: Instead of charging based on data volume, Deductive charges based on the number of incidents investigated, plus a base platform fee.

    The company offers both cloud-hosted and self-hosted deployment options and emphasizes that it doesn't store customer data on its servers or use it to train models for other customers — a critical assurance given the proprietary nature of both code and production system behavior.

    With fresh capital and early customer traction at companies like DoorDash, Foursquare, and Kumo AI, Deductive plans to expand its team and deepen the system's reasoning capabilities from reactive incident analysis to proactive prevention. The near-term vision: helping teams predict problems before they occur.

    DoorDash's Ansari offers a pragmatic endorsement of where the technology stands today: "Investigations that were previously manual and time-consuming are now automated, allowing engineers to shift their energy toward prevention, business impact, and innovation."

    In an industry where every second of downtime translates to lost revenue, that shift from firefighting to building increasingly looks less like a luxury and more like table stakes.