Blog

From shiny object to sober reality: The vector database story, two years later
When I first wrote “Vector databases: Shiny object syndrome and the case of a missing unicorn” in March 2024, the industry was awash in hype. Vector databases were positioned as the next big thing — a must-have infrastructure layer for the gen AI era. Billions of venture dollars flowed, developers rushed to integrate embeddings into their pipelines and analysts breathlessly tracked funding rounds for Pinecone, Weaviate, Chroma, Milvus and a dozen others.

The promise was intoxicating: Finally, a way to search by meaning rather than by brittle keywords. Just dump your enterprise knowledge into a vector store, connect an LLM and watch magic happen.

Except the magic never fully materialized.

Two years on, the reality check has arrived: 95% of organizations invested in gen AI initiatives are seeing zero measurable returns. And, many of the warnings I raised back then — about the limits of vectors, the crowded vendor landscape and the risks of treating vector databases as silver bullets — have played out almost exactly as predicted.

Prediction 1: The missing unicorn

Back then, I questioned whether Pinecone — the poster child of the category — would achieve unicorn status or whether it would become the “missing unicorn” of the database world. Today, that question has been answered in the most telling way possible: Pinecone is reportedly exploring a sale, struggling to break out amid fierce competition and customer churn.

Yes, Pinecone raised big rounds and signed marquee logos. But in practice, differentiation was thin. Open-source players like Milvus, Qdrant and Chroma undercut them on cost. Incumbents like Postgres (with pgVector) and Elasticsearch simply added vector support as a feature. And customers increasingly asked: “Why introduce a whole new database when my existing stack already does vectors well enough?”

The result: Pinecone, once valued near a billion dollars, is now looking for a home. The missing unicorn indeed. In September 2025, Pinecone appointed Ash Ashutosh as CEO, with founder Edo Liberty moving to a chief scientist role. The timing is telling: The leadership change comes amid increasing pressure and questions over its long-term independence.

Prediction 2: Vectors alone won’t cut it

I also argued that vector databases by themselves were not an end solution. If your use case required exactness — l ike searching for “Error 221” in a manual—a pure vector search would gleefully serve up “Error 222” as “close enough.” Cute in a demo, catastrophic in production.

That tension between similarity and relevance has proven fatal to the myth of vector databases as all-purpose engines.

“Enterprises discovered the hard way that semantic ≠ correct.”

Developers who gleefully swapped out lexical search for vectors quickly reintroduced… lexical search in conjunction with vectors. Teams that expected vectors to “just work” ended up bolting on metadata filtering, rerankers and hand-tuned rules. By 2025, the consensus is clear: Vectors are powerful, but only as part of a hybrid stack.

Prediction 3: A crowded field becomes commoditized

The explosion of vector database startups was never sustainable. Weaviate, Milvus (via Zilliz), Chroma, Vespa, Qdrant — each claimed subtle differentiators, but to most buyers they all did the same thing: store vectors and retrieve nearest neighbors.

Today, very few of these players are breaking out. The market has fragmented, commoditized and in many ways been swallowed by incumbents. Vector search is now a checkbox feature in cloud data platforms, not a standalone moat.

Just as I wrote then: Distinguishing one vector DB from another will pose an increasing challenge. That challenge has only grown harder. Vald, Marqo, LanceDB, PostgresSQL, MySQL HeatWave, Oracle 23c, Azure SQL, Cassandra, Redis, Neo4j, SingleStore, ElasticSearch, OpenSearch, Apahce Solr… the list goes on.

The new reality: Hybrid and GraphRAG

But this isn’t just a story of decline — it’s a story of evolution. Out of the ashes of vector hype, new paradigms are emerging that combine the best of multiple approaches.

Hybrid Search: Keyword + vector is now the default for serious applications. Companies learned that you need both precision and fuzziness, exactness and semantics. Tools like Apache Solr, Elasticsearch, pgVector and Pinecone’s own “cascading retrieval” embrace this.

GraphRAG: The hottest buzzword of late 2024/2025 is GraphRAG — graph-enhanced retrieval augmented generation. By marrying vectors with knowledge graphs, GraphRAG encodes the relationships between entities that embeddings alone flatten away. The payoff is dramatic.

Benchmarks and evidence
- Amazon’s AI blog cites benchmarks from Lettria, where hybrid GraphRAG boosted answer correctness from ~50% to 80%-plus in test datasets across finance, healthcare, industry, and law.
- The GraphRAG-Bench benchmark (released May 2025) provides a rigorous evaluation of GraphRAG vs. vanilla RAG across reasoning tasks, multi-hop queries and domain challenges.
- An OpenReview evaluation of RAG vs GraphRAG found that each approach has strengths depending on task — but hybrid combinations often perform best.
- FalkorDB’s blog reports that when schema precision matters (structured domains), GraphRAG can outperform vector retrieval by a factor of ~3.4x on certain benchmarks.
The rise of GraphRAG underscores the larger point: Retrieval is not about any single shiny object. It’s about building retrieval systems — layered, hybrid, context-aware pipelines that give LLMs the right information, with the right precision, at the right time.

What this means going forward

The verdict is in: Vector databases were never the miracle. They were a step — an important one — in the evolution of search and retrieval. But they are not, and never were, the endgame.

The winners in this space won’t be those who sell vectors as a standalone database. They will be the ones who embed vector search into broader ecosystems — integrating graphs, metadata, rules and context engineering into cohesive platforms.

In other words: The unicorn isn’t the vector database. The unicorn is the retrieval stack.

Looking ahead: What’s next
- Unified data platforms will subsume vector + graph: Expect major DB and cloud vendors to offer integrated retrieval stacks (vector + graph + full-text) as built-in capabilities.
- “Retrieval engineering” will emerge as a distinct discipline: Just as MLOps matured, so too will practices around embedding tuning, hybrid ranking and graph construction.
- Meta-models learning to query better: Future LLMs may learn to orchestrate which retrieval method to use per query, dynamically adjusting weighting.
- Temporal and multimodal GraphRAG: Already, researchers are extending GraphRAG to be time-aware (T-GRAG) and multimodally unified (e.g. connecting images, text, video).
- Open benchmarks and abstraction layers: Tools like BenchmarkQED (for RAG benchmarking) and GraphRAG-Bench will push the community toward fairer, comparably measured systems.
From shiny objects to essential infrastructure

The arc of the vector database story has followed a classic path: A pervasive hype cycle, followed by introspection, correction and maturation. In 2025, vector search is no longer the shiny object everyone pursues blindly — it’s now a critical building block within a more sophisticated, multi-pronged retrieval architecture.

The original warnings were right. Pure vector-based hopes often crash on the shoals of precision, relational complexity and enterprise constraints. Yet the technology was never wasted: It forced the industry to rethink retrieval, blending semantic, lexical and relational strategies.

If I were to write a sequel in 2027, I suspect it would frame vector databases not as unicorns, but as legacy infrastructure — foundational, but eclipsed by smarter orchestration layers, adaptive retrieval controllers and AI systems that dynamically choose which retrieval tool fits the query.

As of now, the real battle is not vector vs keyword — it’s the indirection, blending and discipline in building retrieval pipelines that reliably ground gen AI in facts and domain knowledge. That’s the unicorn we should be chasing now.

Amit Verma is head of engineering and AI Labs at Neuron7.

Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
Kasım 16, 2025
Human-centric IAM is failing: Agentic AI requires a new identity control plane

The race to deploy agentic AI is on. Across the enterprise, systems that can plan, take actions and collaborate across business applications promise unprecedented efficiency. But in the rush to automate, a critical component is being overlooked: Scalable security. We are building a workforce of digital employees without giving them a secure way to log in, access data and do their jobs without creating catastrophic risk.

The fundamental problem is that traditional identity and access management (IAM) designed for humans breaks at agentic scale. Controls like static roles, long-lived passwords and one-time approvals are useless when non-human identities can outnumber human ones by 10 to one. To harness the power of agentic AI, identity must evolve from a simple login gatekeeper into the dynamic control plane for your entire AI operation.

“The fastest path to responsible AI is to avoid real data. Use synthetic data to prove value, then earn the right to touch the real thing.” — Shawn Kanungo, keynote speaker and innovation strategist; bestselling author of The Bold Ones

Why your human-centric IAM is a sitting duck

Agentic AI does not just use software; it behaves like a user. It authenticates to systems, assumes roles and calls APIs. If you treat these agents as mere features of an application, you invite invisible privilege creep and untraceable actions. A single over-permissioned agent can exfiltrate data or trigger erroneous business processes at machine speed, with no one the wiser until it is too late.

The static nature of legacy IAM is the core vulnerability. You cannot pre-define a fixed role for an agent whose tasks and required data access might change daily. The only way to keep access decisions accurate is to move policy enforcement from a one-time grant to a continuous, runtime evaluation.

Prove value before production data

Kanungo’s guidance offers a practical on-ramp. Start with synthetic or masked datasets to validate agent workflows, scopes and guardrails. Once your policies, logs and break-glass paths hold up in this sandbox, you can graduate agents to real data with confidence and clear audit evidence.

Building an identity-centric operating model for AI

Securing this new workforce requires a shift in mindset. Each AI agent must be treated as a first-class citizen within your identity ecosystem.

First, every agent needs a unique, verifiable identity. This is not just a technical ID; it must be linked to a human owner, a specific business use case and a software bill of materials (SBOM). The era of shared service accounts is over; they are the equivalent of giving a master key to a faceless crowd.

Second, replace set-and-forget roles with session-based, risk-aware permissions. Access should be granted just in time, scoped to the immediate task and the minimum necessary dataset, then automatically revoked when the job is complete. Think of it as giving an agent a key to a single room for one meeting, not the master key to the entire building.

Three pillars of a scalable agent security architecture

Context-aware authorization at the core. Authorization can no longer be a simple yes or no at the door. It must be a continuous conversation. Systems should evaluate context in real time. Is the agent’s digital posture attested? Is it requesting data typical for its purpose? Is this access occurring during a normal operational window? This dynamic evaluation enables both security and speed.

Purpose-bound data access at the edge. The final line of defense is the data layer itself. By embedding policy enforcement directly into the data query engine, you can enforce row-level and column-level security based on the agent’s declared purpose. A customer service agent should be automatically blocked from running a query that appears designed for financial analysis. Purpose binding ensures data is used as intended, not merely accessed by an authorized identity.

Tamper-evident evidence by default. In a world of autonomous actions, auditability is non-negotiable. Every access decision, data query and API call should be immutably logged, capturing the who, what, where and why. Link logs so they are tamper evident and replayable for auditors or incident responders, providing a clear narrative of every agent’s activities.

A practical roadmap to get started

Begin with an identity inventory. Catalog all non-human identities and service accounts. You will likely find sharing and over-provisioning. Begin issuing unique identities for each agent workload.

Pilot a just-in-time access platform. Implement a tool that grants short-lived, scoped credentials for a specific project. This proves the concept and shows the operational benefits.

Mandate short-lived credentials. Issue tokens that expire in minutes, not months. Seek out and remove static API keys and secrets from code and configuration.

Stand up a synthetic data sandbox. Validate agent workflows, scopes, prompts and policies on synthetic or masked data first. Promote to real data only after controls, logs and egress policies pass.

Conduct an agent incident tabletop drill. Practice responses to a leaked credential, a prompt injection or a tool escalation. Prove you can revoke access, rotate credentials and isolate an agent in minutes.

The bottom line

You cannot manage an agentic, AI-driven future with human-era identity tools. The organizations that will win recognize identity as the central nervous system for AI operations. Make identity the control plane, move authorization to runtime, bind data access to purpose and prove value on synthetic data before touching the real thing. Do that, and you can scale to a million agents without scaling your breach risk.

Michelle Buckner is a former NASA Information System Security Officer (ISSO).

Kasım 16, 2025
ChatGPT Group Chats are here … but not for everyone (yet)

It was originally found in leaked code and publicized by AI influencers on X, but OpenAI has made it official: ChatGPT now offers Group Chats, allowing multiple users to join the same, single ChatGPT conversation and send messages to each other and the underlying large language model (LLM), online and via its mobile apps.

Imagine adding ChatGPT as another member of your existing group chats, allowing you to text it as you would one of your friends or family members and have them respond as well, and you'll have an idea of the intriguing power and potential of this feature.

However, the feature is only available as a limited pilot for now to ChatGPT users in Japan, New Zealand, South Korea, and Taiwan (all tiers, including free usage).

“Group chats are just the beginning of ChatGPT becoming a shared space to collaborate and interact with others,” OpenAI wrote in its announcement.

This development builds on internal experimentation at OpenAI, where technical staffer Keyan Zhang said in a post on X that OpenAI's team initially considered multiplayer ChatGPT to be “a wild, out-of-distribution idea.”

According to Zhang, the model’s performance in those early tests demonstrated far more potential than existing interfaces typically allow.

The move follows OpenAI investor yet competitor Microsoft's update of its Copilot AI assistant to allow group chats last month, as well as Anthropic's introduction of shareable context and chat histories from its Claude AI models through its Projects feature introduced summer 2024, though this is not a simultaneous, realtime group chat in the same way.

Collaborative functionality integrated into ChatGPT

Group chats function as shared conversational spaces where users can plan events, brainstorm ideas, or collaborate on projects with the added support of ChatGPT.

These conversations are distinct from individual chats and are excluded from ChatGPT’s memory system—meaning no data from these group threads is used to train or personalize future interactions.

Users can initiate a group chat by selecting the people icon in a new or existing conversation. Adding others creates a copy of the original thread, preserving the source dialogue. Participants can join via a shareable link and are prompted to create a profile with a name, username, and photo. The feature supports 1 to 20 participants per group.

Each group chat is listed in a new section of the ChatGPT interface, and users can manage settings like naming the group, adding or removing participants, or muting notifications.

Powered by GPT-5.1 with expanded tools

The new group chat feature runs on GPT-5.1 Auto, a backend setting that chooses the optimal model based on the user’s subscription tier and the prompt.

Functionality such as search, image generation, file upload, and dictation is available inside group conversations.

Importantly, the system applies rate limits only when ChatGPT is producing responses. Direct messages between human users in the group do not count toward any plan’s message cap.

OpenAI has added new social features to ChatGPT in support of this group dynamic. The model can react with emojis, interpret conversational context to decide when to respond, and personalize generated content using members’ profile photos—such as inserting user likenesses into images when asked.

Privacy by default, controls for younger users

OpenAI emphasized that privacy and user control are integral to group chat design. The feature operates independently of the user’s personalized ChatGPT memory, and no new memories are created from these interactions.

Participation requires an invitation link, and members are always able to see who is in a chat or leave at any time.

Users under the age of 18 are automatically shielded from sensitive content in group chats. Parents or guardians can disable group chat access altogether via built-in parental controls.

Group creators retain special permissions, including immunity from being removed by others. All other participants can be added or removed by group members.

A testbed for shared AI experiences

OpenAI frames group chats as an early step toward richer, multi-user applications of AI, hinting at broader ambitions for ChatGPT as a shared workspace. The company expects to expand access over time and refine the feature based on how early users engage with it.

Keyan Zhang’s post suggests that the underlying model capabilities are far ahead of the interfaces users currently interact with. This pilot, in OpenAI’s view, offers a new “container” where more of the model’s latent capacity can be surfaced.

“Our models have a lot more room to shine than today’s experiences show, and the current containers only use a fraction of their capabilities,” Zhang said.

With this initial pilot focused on a limited set of markets, OpenAI is likely monitoring both usage patterns and cultural fit as it plans for broader deployment. For now, the group chat experiment offers a new way for users to interact with ChatGPT—and with each other—in real time, using a conversational interface that blends productivity and personalization.

Developer access: Still unclear

OpenAI has not provided any indication that Group Chats will be accessible via the API or SDK. The current rollout is framed strictly within the ChatGPT product environment, with no mention of tool calls, developer hooks, or integration support for programmatic use. This absence of signaling leaves it unclear whether the company views group interaction as a future developer primitive or as a contained UX feature for end users only.

For enterprise teams exploring how to replicate multi-user collaboration with generative models, any current implementation would require custom orchestration—such as managing multi-party context and prompts across separate API calls, and handling session state and response merging externally. Until OpenAI provides formal support, Group Chats remain a closed interface feature rather than a developer-accessible capability.

Here is a standalone concluding subsection tailored for the article, focusing on what the ChatGPT Group Chat rollout means for enterprise decision makers in both pilot regions and globally:

Implications for enterprise AI and data leaders

For enterprise teams already leveraging AI platforms—or preparing to—OpenAI’s group chat feature introduces a new layer of multi-user collaboration that could shift how generative models are deployed across workflows. While the pilot is limited to users in Japan, New Zealand, South Korea, and Taiwan, its design and roadmap offer key signals for AI engineers, orchestration specialists, and data leads globally.

AI engineers managing large language model (LLM) deployments can now begin to conceptualize real-time, multi-user interfaces not just as support tools, but as collaborative environments for research, content generation, and ideation. This adds another front in model tuning: not just how models respond to individuals, but how they behave in live group settings with context shifts and varied user intentions.

For AI orchestration leads, the ability to integrate ChatGPT into collaborative flows without exposing private memory or requiring custom builds may reduce friction in piloting generative AI in cross-functional teams. These group sessions could serve as lightweight alternatives to internal tools for brainstorming, prototyping, or knowledge sharing—useful for teams constrained by infrastructure, budget, or time.

Enterprise data managers may also find use cases in structured group chat sessions for data annotation, taxonomy validation, or internal training support. The system’s lack of memory persistence adds a level of data isolation that aligns with standard security and compliance practices—though global rollout will be key to validating regional data handling standards.

As group chat capabilities evolve, decision makers should monitor how shared usage patterns might inform future model behaviors, auditing needs, and governance structures. In the long term, features like these will influence not just how organizations interact with generative AI, but how they design team-level interfaces around it.

Kasım 15, 2025
Google’s new AI training method helps small models tackle complex reasoning

Researchers at Google Cloud and UCLA have proposed a new reinforcement learning framework that significantly improves the ability of language models to learn very challenging multi-step reasoning tasks. Supervised Reinforcement Learning (SRL) reformulates problem-solving as a sequence of logical “actions,” providing rich learning signals during the training process.

This approach enables smaller models to learn complex problems that were previously out of reach for other common training techniques. Experiments show that SRL not only excels on math reasoning benchmarks but also generalizes effectively to agentic software engineering tasks.

SRL is a versatile training framework that can elevate smaller and less expensive models to higher reasoning abilities.

The limits of current LLM reasoning training

Recent advances in training large language models (LLMs) for reasoning have largely been driven by reinforcement learning with verifiable rewards (RLVR), a method where a model is rewarded based on the correctness of its final answer. By repeatedly trying to solve problems and getting feedback on the final outcome, the model gradually learns effective problem-solving strategies.

However, the success of this outcome-based approach depends on the model's ability to discover a correct solution within a limited number of attempts, or "rollouts." Since each rollout is computationally expensive, models can't try indefinitely. This method hits a wall when problems are so difficult that the model rarely, if ever, finds the right answer within its budget.

This creates a critical learning bottleneck. In many multi-step reasoning problems, a model might correctly solve several steps but get derailed by a single mistake, leading to an incorrect answer. With RLVR, this entire effort receives a negative reward, and the model learns nothing from its partially correct work. It’s an all-or-nothing approach that fails to provide granular feedback and provides sparse rewards.

An alternative method is supervised fine-tuning (SFT), where the model learns from examples containing the full reasoning process laid out by experts. While SFT can instill reasoning abilities, it often leads to overfitting (the model simply learns to imitate the trajectories in the training data instead of learning to generalize to problems beyond the examples it has seen). This issue is made worse by the fact that high-quality, human-created training data is both scarce and expensive to produce.

As the paper notes, these limitations leave "a critical gap for training small open-source models to effectively learn difficult problems."

How supervised reinforcement learning works

SRL introduces a framework that reformulates problem-solving as a "sequential decision-making process," striking a balance between pure outcome-based RL and pure imitation learning. Instead of optimizing only for the final answer or forcing the model to imitate an expert's entire thought process, SRL teaches the model to reproduce a sequence of key actions that form the backbone of expert reasoning. This allows the model to learn to take actions similar to an expert while developing its own internal reasoning style.

In the SRL framework, expert demonstrations are broken down into a series of intermediate, concrete actions, each representing a meaningful step. For a math problem, an action might be an algebraic manipulation. For a software engineering agent, it could be a command executed in a code repository. To generate training data, SRL uses a powerful teacher model to create solution trajectories, which are then used to train a smaller model.

According to I-Hung Hsu, a research scientist at Google and co-author of the paper, this middle-ground approach is key to its effectiveness in real-world scenarios. "SRL sits in the middle: It captures the structured flexibility of real-world problem solving, where there are multiple valid strategies but also clear notions of what ‘good reasoning’ looks like at each step," Hsu told VentureBeat. "This makes SRL suitable for domains like data science automation or probably supply chain optimization — tasks that reward sound intermediate reasoning rather than mere final answers."

During training, the model first generates an "inner monologue" (its internal reasoning process, enclosed in <think> tags) before committing to an action. At each step, SRL provides a reward based on the similarity between the model's predicted action and the expert's action. This step-wise reward system provides dense, fine-grained feedback, allowing the model to learn and improve even if its overall solution isn't perfect. This solves the sparse reward problem RLVR faces.

SRL in action

The researchers' experiments show that SRL significantly outperforms strong baselines in both challenging mathematical reasoning and agentic software engineering benchmarks. They also observed that SRL encourages more flexible and sophisticated reasoning patterns in models, such as interleaved planning and self-verification, which improve solution quality without just making the outputs longer.

For enterprise leaders, performance gains are only valuable if they don't come with runaway costs. Hsu clarifies that SRL-trained models are more efficient in their reasoning. "The gains come from better reasoning quality and structure, not from verbosity," he said. "In terms of efficiency, SRL-trained models are roughly on par with the base model in token usage… while SRL isn’t designed to reduce inference cost, it achieves stronger reasoning performance without increasing it."

For the math tests, the team fine-tuned Qwen2.5-7B-Instruct on a dataset of 1,000 difficult math questions. They compared its performance against models trained with SFT and RLVR (using the GRPO algorithm common in models like DeepSeek-R1) on four competition-level math benchmarks. The SRL-trained model achieved a substantial 3.0% average performance boost over other methods.

The team extended SRL to agentic software engineering, a domain critical for enterprise automation. They trained a coding-specialized model, Qwen2.5-Coder-7B-Instruct, on 5,000 expert trajectories of agents interacting with a coding environment. The SRL-trained model was benchmarked against the original base model and SWE-Gym-7B, a strong baseline fine-tuned with SFT. SRL achieved a 14.8% task resolve rate, representing a 74% relative improvement over the SFT-based model. This shows SRL's ability to train more competent AI agents for complex, real-world programming tasks.

A new standard for high-stakes AI?

The paper's strongest results came from combining methods: First, using SRL to teach foundational reasoning, then using RLVR to refine that skill. In their experiments, when the researchers used SRL as a pre-training and applied RLVR in post-training, they observed a 3.7% average increase, demonstrating a powerful curriculum learning strategy.

This raises the question of whether this could become a new blueprint for building specialized AI.

"We view SRL as a strong foundation," Hsu said. "In a sense, SRL provides a curriculum — teaching models to think and act step by step — before we refine those behaviors with outcome-based reinforcement learning. This SRL-first approach not only stabilizes the later RL stage but also makes reasoning more interpretable and generalizable, which is critical for high-stakes applications."

Looking ahead, Hsu acknowledges that scaling this pipeline still faces challenges, particularly the high cost and complexity of end-to-end RLVR for agentic tasks. However, he is optimistic about the path forward. "While high-quality expert trajectories remain important," he concluded, "we think the next big leap will come from automating their generation and filtering — leveraging strong teacher models or even self-improving student models to bootstrap new data."

Kasım 15, 2025
Upwork study shows AI agents excel with human partners but fail independently

Artificial intelligence agents powered by the world's most advanced language models routinely fail to complete even straightforward professional tasks on their own, according to groundbreaking research released Thursday by Upwork, the largest online work marketplace.

But the same study reveals a more promising path forward: When AI agents collaborate with human experts, project completion rates surge by up to 70%, suggesting the future of work may not pit humans against machines but rather pair them together in powerful new ways.

The findings, drawn from more than 300 real client projects posted to Upwork's platform, marking the first systematic evaluation of how human expertise amplifies AI agent performance in actual professional work — not synthetic tests or academic simulations. The research challenges both the hype around fully autonomous AI agents and fears that such technology will imminently replace knowledge workers.

"AI agents aren't that agentic, meaning they aren't that good," Andrew Rabinovich, Upwork's chief technology officer and head of AI and machine learning, said in an exclusive interview with VentureBeat. "However, when paired with expert human professionals, project completion rates improve dramatically, supporting our firm belief that the future of work will be defined by humans and AI collaborating to get more work done, with human intuition and domain expertise playing a critical role."

How AI agents performed on 300+ real freelance jobs—and why they struggled

Upwork's Human+Agent Productivity Index (HAPI) evaluated how three leading AI systems — Gemini 2.5 Pro, OpenAI's GPT-5, and Claude Sonnet 4 — performed on actual jobs posted by paying clients across categories including writing, data science, web development, engineering, sales, and translation.

Critically, Upwork deliberately selected simple, well-defined projects where AI agents stood a reasonable chance of success. These jobs, priced under $500, represent less than 6% of Upwork's total gross services volume — a tiny fraction of the platform's overall business and an acknowledgment of current AI limitations.

"The reality is that although we study AI, and I've been doing this for 25 years, and we see significant breakthroughs, the reality is that these agents aren't that agentic," Rabinovich told VentureBeat. "So if we go up the value chain, the problems become so much more difficult, then we don't think they can solve them at all, even to scratch the surface. So we specifically chose simpler tasks that would give an agent some kind of traction."

Even on these deliberately simplified tasks, AI agents working independently struggled. But when expert freelancers provided feedback — spending an average of just 20 minutes per review cycle — the agents' performance improved substantially with each iteration.

20 minutes of human feedback boosted AI completion rates up to 70%

The research reveals stark differences in how AI agents perform with and without human guidance across different types of work. For data science and analytics projects, Claude Sonnet 4 achieved a 64% completion rate working alone but jumped to 93% after receiving feedback from a human expert. In sales and marketing work, Gemini 2.5 Pro's completion rate rose from 17% independently to 31% with human input. OpenAI's GPT-5 showed similarly dramatic improvements in engineering and architecture tasks, climbing from 30% to 50% completion.

The pattern held across virtually all categories, with agents responding particularly well to human feedback on qualitative, creative work requiring editorial judgment — areas like writing, translation, and marketing — where completion rates increased by up to 17 percentage points per feedback cycle.

The finding challenges a fundamental assumption in the AI industry: that agent benchmarks conducted in isolation accurately predict real-world performance.

"While we show that in the tasks that we have selected for agents to perform in isolation, they perform similarly to the previous results that we've seen published openly, what we've shown is that in collaboration with humans, the performance of these agents improves surprisingly well," Rabinovich said. "It's not just a one-turn back and forth, but the more feedback the human provides, the better the agent gets at performing."

Why ChatGPT can ace the SAT but can't count the R's in 'strawberry'

The research arrives as the AI industry grapples with a measurement crisis. Traditional benchmarks — standardized tests that AI models can master, sometimes scoring perfectly on SAT exams or mathematics olympiads — have proven poor predictors of real-world capability.

"With advances of large language models, what we're now seeing is that these static, academic datasets are completely saturated," Rabinovich said. "So you could get a perfect score in the SAT test or LSAT or any of the math olympiads, and then you would ask ChatGPT how many R's there are in the word strawberry, and it would get it wrong."

This phenomenon — where AI systems ace formal tests but stumble on trivial real-world questions — has led to growing skepticism about AI capabilities, even as companies race to deploy autonomous agents. Several recent benchmarks from other firms have tested AI agents on Upwork jobs, but those evaluations measured only isolated performance, not the collaborative potential that Upwork's research reveals.

"We wanted to evaluate the quality of these agents on actual real work with economic value associated with it, and not only see how well these agents do, but also see how these agents do in collaboration with humans, because we sort of knew already that in isolation, they're not that advanced," Rabinovich explained.

For Upwork, which connects roughly 800,000 active clients posting more than 3 million jobs annually to a global pool of freelancers, the research serves a strategic business purpose: establishing quality standards for AI agents before allowing them to compete or collaborate with human workers on its platform.

The economics of human-AI teamwork: Why paying for expert feedback still saves money

Despite requiring multiple rounds of human feedback — each lasting about 20 minutes — the time investment remains "orders of magnitude different between a human doing the work alone, versus a human doing the work with an AI agent," Rabinovich said. Where a project might take a freelancer days to complete independently, the agent-plus-human approach can deliver results in hours through iterative cycles of automated work and expert refinement.

The economic implications extend beyond simple time savings. Upwork recently reported that gross services volume from AI-related work grew 53% year-over-year in the third quarter of 2025, one of the strongest growth drivers for the company. But executives have been careful to frame AI not as a replacement for freelancers but as an enhancement to their capabilities.

"AI was a huge overhang for our valuation," Erica Gessert, Upwork's CFO, told CFO Brew in October. "There was this belief that all work was going to go away. AI was going to take it, and especially work that's done by people like freelancers, because they are impermanent. Actually, the opposite is true."

The company's strategy centers on enabling freelancers to handle more complex, higher-value work by offloading routine tasks to AI. "Freelancers actually prefer to have tools that automate the manual labor and repetitive part of their work, and really focus on the creative and conceptual part of the process," Rabinovich said.

Rather than replacing jobs, he argues, AI will transform them: "Simpler tasks will be automated by agents, but the jobs will become much more complex in the number of tasks, so the amount of work and therefore earnings for freelancers will actually only go up."

AI coding agents excel, but creative writing and translation still need humans

The research reveals a clear pattern in agent capabilities. AI systems perform best on "deterministic and verifiable" tasks with objectively correct answers, like solving math problems or writing basic code. "Most coding tasks are very similar to each other," Rabinovich noted. "That's why coding agents are becoming so good."

In Upwork's tests, web development, mobile app development, and data science projects — especially those involving structured, computational work — saw the highest standalone agent completion rates. Claude Sonnet 4 completed 68% of web development jobs and 64% of data science projects without human help, while Gemini 2.5 Pro achieved 74% on certain technical tasks.

But qualitative work proved far more challenging. When asked to create website layouts, write marketing copy, or translate content with appropriate cultural nuance, agents floundered without expert guidance. "When you ask it to write you a poem, the quality of the poem is extremely subjective," Rabinovich said. "Since the rubrics for evaluation were provided by humans, there's some level of variability in representation."

Writing, translation, and sales and marketing projects showed the most dramatic improvements from human feedback. For writing work, completion rates increased by up to 17 percentage points after expert review. Engineering and architecture projects requiring creative problem-solving — like civil engineering or architectural design — improved by as much as 23 percentage points with human oversight.

This pattern suggests AI agents excel at pattern matching and replication but struggle with creativity, judgment, and context — precisely the skills that define higher-value professional work.

Inside the research: How Upwork tested AI agents with peer-reviewed scientific methods

Upwork partnered with elite freelancers on its platform to evaluate every deliverable produced by AI agents, both independently and after each cycle of human feedback. These evaluators created detailed rubrics defining whether projects met core requirements specified in job descriptions, then scored outputs across multiple iterations.

Importantly, evaluators focused only on objective completion criteria, excluding subjective factors like stylistic preferences or quality judgments that might emerge in actual client relationships. "Rubric-based completion rates should not be viewed as a measure of whether an agent would be paid in a real marketplace setting," the research notes, "but as an indicator of its ability to fulfill explicitly defined requests."

This distinction matters: An AI agent might technically complete all specified requirements yet still produce work a client rejects as inadequate. Conversely, subjective client satisfaction — the true measure of marketplace success — remains beyond current measurement capabilities.

The research underwent double-blind peer review and was accepted to NeurIPS, the premier academic conference for AI research, where Upwork will present full results in early December. The company plans to publish a complete methodology and make the benchmark available to the research community, updating the task pool regularly to prevent overfitting as agents improve.

"The idea is for this benchmark to be a living and breathing platform where agents can come in and evaluate themselves on all categories of work, and the tasks that will be offered on the platform will always update, so that these agents don't overfit and basically memorize the tasks at hand," Rabinovich said.

Upwork's AI strategy: Building Uma, a 'meta-agent' that manages human and AI workers

The research directly informs Upwork's product roadmap as the company positions itself for what executives call "the age of AI and beyond." Rather than building its own AI agents to complete specific tasks, Upwork is developing Uma, a "meta orchestration agent" that coordinates between human workers, AI systems, and clients.

"Today, Upwork is a marketplace where clients look for freelancers to get work done, and then talent comes to Upwork to find work," Rabinovich explained. "This is getting expanded into a domain where clients come to Upwork, communicate with Uma, this meta-orchestration agent, and then Uma identifies the necessary talent to get the job done, gets the tasks outcomes completed, and then delivers that to the client."

In this vision, clients would interact primarily with Uma rather than directly hiring freelancers. The AI system would analyze project requirements, determine which tasks require human expertise versus AI execution, coordinate the workflow, and ensure quality — acting as an intelligent project manager rather than a replacement worker.

"We don't want to build agents that actually complete the tasks, but we are building this meta orchestration agent that figures out what human and agent talent is necessary in order to complete the tasks," Rabinovich said. "Uma evaluates the work to be delivered to the client, orchestrates the interaction between humans and agents, and is able to learn from all the interactions that happen on the platform how to break jobs into tasks so that they get completed in a timely and effective manner."

The company recently announced plans to open its first international office in Lisbon, Portugal, by the fourth quarter of 2026, with a focus on AI infrastructure development and technical hiring. The expansion follows Upwork's record-breaking third quarter, driven partly by AI-powered product innovation and strong demand for workers with AI skills.

OpenAI, Anthropic, and Google race to build autonomous agents—but reality lags hype

Upwork's findings arrive amid escalating competition in the AI agent space. OpenAI, Anthropic, Google, and numerous startups are racing to develop autonomous agents capable of complex multi-step tasks, from booking travel to analyzing financial data to writing software.

But recent high-profile stumbles have tempered initial enthusiasm. AI agents frequently misunderstand instructions, make logical errors, or produce confidently wrong results — a phenomenon researchers call "hallucination." The gap between controlled demonstration videos and reliable real-world performance remains vast.

"There have been some evaluations that came from OpenAI and other platforms where real Upwork tasks were considered for completion by agents, and across the board, the reported results were not very optimistic, in the sense that they showed that agents—even the best ones, meaning powered by most advanced LLMs — can't really compete with humans that well, because the completion rates are pretty low," Rabinovich said.

Rather than waiting for AI to fully mature — a timeline that remains uncertain—Upwork is betting on a hybrid approach that leverages AI's strengths (speed, scalability, pattern recognition) while retaining human strengths (judgment, creativity, contextual understanding).

This philosophy extends to learning and improvement. Current AI models train primarily on static datasets scraped from the internet, supplemented by human preference feedback. But most professional work is qualitative, making it difficult for AI systems to know whether their outputs are actually good without expert evaluation.

"Unless you have this collaboration between the human and the machine, where the human is kind of the teacher and the machine is the student trying to discover new solutions, none of this will be possible," Rabinovich said. "Upwork is very uniquely positioned to create such an environment because if you try to do this with, say, self-driving cars, and you tell Waymo cars to explore new ways of getting to the airport, like avoiding traffic signs, then a bunch of bad things will happen. In doing work on Upwork, if it creates a wrong website, it doesn't cost very much, and there's no negative side effects. But the opportunity to learn is absolutely tremendous."

Will AI take your job? The evidence suggests a more complicated answer

While much public discourse around AI focuses on job displacement, Rabinovich argues the historical pattern suggests otherwise — though the transition may prove disruptive.

"The narrative in the public is that AI is eliminating jobs, whether it's writing, translation, coding or other digital work, but no one really talks about the exponential amount of new types of work that it will create," he said. "When we invented electricity and steam engines and things like that, they certainly replaced certain jobs, but the amount of new jobs that were introduced is exponentially more, and we think the same is going to happen here."

The research identifies emerging job categories focused on AI oversight: designing effective human-machine workflows, providing high-quality feedback to improve agent performance, and verifying that AI-generated work meets quality standards. These skills—prompt engineering, agent supervision, output verification—barely existed two years ago but now command premium rates on platforms like Upwork.

"New types of skills from humans are becoming necessary in the form of how to design the interaction between humans and machines, how to guide agents to make them better, and ultimately, how to verify that whatever agentic proposals are being made are actually correct, because that's what's necessary in order to advance the state of AI," Rabinovich said.

The question remains whether this transition— from doing tasks to overseeing them — will create opportunities as quickly as it disrupts existing roles. For freelancers on Upwork, the answer may already be emerging in their bank accounts: The platform saw AI-related work grow 53% year-over-year, even as fears of AI-driven unemployment dominated headlines.

Kasım 14, 2025

Baidu unveils proprietary ERNIE 5 beating GPT-5 performance on charts, document understanding and more

Mere hours after OpenAI updated its flagship foundation model GPT-5 to GPT-5.1, promising reduced token usage overall and a more pleasant personality with more preset options, Chinese search giant Baidu unveiled its next-generation foundation model, ERNIE 5.0, alongside a suite of AI product upgrades and strategic international expansions.

The goal: to position as a global contender in the increasingly competitive enterprise AI market.

Announced at the company's Baidu World 2025 event, ERNIE 5.0 is a proprietary, natively omni-modal model designed to jointly process and generate content across text, images, audio, and video.

Unlike Baidu’s recently released ERNIE-4.5-VL-28B-A3B-Thinking, which is open source under an enterprise-friendly and permissive Apache 2.0 license, ERNIE 5.0 is a proprietary model and is available only via Baidu’s ERNIE Bot website (I needed to select it manuallyu from the model picker dropdown) and the Qianfan cloud platform application programming interface (API) for enterprise customers.

Alongside the model launch, Baidu introduced major updates to its digital human platform, no-code tools, and general-purpose AI agents — all targeted at expanding its AI footprint beyond China.

The company also introduced ERNIE 5.0 Preview 1022, a variant optimized for text-intensive tasks, alongside the general preview model that balances across modalities.

Baidu emphasized that ERNIE 5.0 represents a shift in how intelligence is deployed at scale, with CEO Robin Li stating: “When you internalize AI, it becomes a native capability and transforms intelligence from a cost into a source of productivity.”

Where ERNIE 5.0 outshines GPT-5 and Gemini 2.5 Pro

ERNIE 5.0’s benchmark results suggest that Baidu has achieved parity—or near-parity—with the top Western foundation models across a wide spectrum of tasks.

In public benchmark slides shared during the Baidu World 2025 event, ERNIE 5.0 Preview outperformed or matched OpenAI’s GPT-5-High and Google’s Gemini 2.5 Pro in multimodal reasoning, document understanding, and image-based QA, while also demonstrating strong language modeling and code execution abilities.

The company emphasized its ability to handle joint inputs and outputs across modalities, rather than relying on post-hoc modality fusion, which it framed as a technical differentiator.

On visual tasks, ERNIE 5.0 achieved leading scores on OCRBench, DocVQA, and ChartQA, three benchmarks that test document recognition, comprehension, and structured data reasoning.

Baidu claims the model beat both GPT-5-High and Gemini 2.5 Pro on these document and chart-based benchmarks, areas it describes as core to enterprise applications like automated document processing and financial analysis.

In image generation, ERNIE 5.0 tied or exceeded Google’s Veo3 across categories including semantic alignment and image quality, according to Baidu’s internal GenEval-based evaluation. Baidu claimed that the model’s multimodal integration allows it to generate and interpret visual content with greater contextual awareness than models relying on modality-specific encoders.

For audio and speech tasks, ERNIE 5.0 demonstrated competitive results on MM-AU and TUT2017 audio understanding benchmarks, as well as question answering from spoken language inputs. Its audio performance, while not as heavily emphasized as vision or text, suggests a broad capability footprint intended to support full-spectrum multimodal applications.

In language tasks, the model showed strong results on instruction following, factual question answering, and mathematical reasoning—core areas that define the enterprise utility of large language models.

The Preview 1022 variant of ERNIE 5.0, tailored for textual performance, showed even stronger language-specific results in early developer access. While Baidu does not claim broad superiority in general language reasoning, its internal evaluations suggest that ERNIE 5.0 Preview 1022 closes the gap with top-tier English-language models and outperforms them in Chinese-language performance.

While Baidu did not release full benchmark details or raw scores publicly, its performance positioning suggests a deliberate attempt to frame ERNIE 5.0 not as a niche multimodal system but as a flagship model competitive with the largest closed models in general-purpose reasoning.

Where Baidu claims a clear lead is in structured document understanding, visual chart reasoning, and integration of multiple modalities into a single, native modeling architecture. Independent verification of these results remains pending, but the breadth of claimed capabilities positions ERNIE 5.0 as a serious alternative in the multimodal foundation model landscape.

Enterprise Pricing Strategy

ERNIE 5.0 is positioned at the premium end of Baidu’s model pricing structure. The company has released specific pricing for API usage on its Qianfan platform, aligning the cost with other top-tier offerings from Chinese competitors like Alibaba.

Model	Input Cost (per 1K tokens)	Output Cost (per 1K tokens)	Source
ERNIE 5.0	$0.00085 (¥0.006)	$0.0034 (¥0.024)	Qianfan
ERNIE 4.5 Turbo (ex.)	$0.00011 (¥0.0008)	$0.00045 (¥0.0032)	Qianfan
Qwen3 (Coder ex.)	$0.00085 (¥0.006)	$0.0034 (¥0.024)	Qianfan

The contrast in cost between ERNIE 5.0 and earlier models such as ERNIE 4.5 Turbo underscores Baidu’s strategy to differentiate between high-volume, low-cost models and high-capability models designed for complex tasks and multimodal reasoning.

Compared to other U.S. alternatives, it remains mid-range in pricing:

Model	Input (/1 M tokens)	Output (/1 M tokens)	Source
GPT-5.1	$1.25	$10.00	OpenAI
ERNIE 5.0	$0.85	$3.40	Qianfan
ERNIE 4.5 Turbo (ex.)	$0.11	$0.45	Qianfan
Claude Opus 4.1	$15.00	$75.00	Anthropic
Gemini 2.5 Pro	$1.25 (≤200k) / $2.50 (>200k)	$10.00 (≤200k) / $15.00 (>200k)	Google Vertex AI Pricing
Grok 4 (grok-4-0709)	$3.00	$15.00	xAI API

Global Expansion: Products and Platforms

In tandem with the model release, Baidu is expanding internationally:

GenFlow 3.0, now with 20M+ users, is the company’s largest general-purpose AI agent and features enhanced memory and multimodal task handling.
Famou, a self-evolving agent capable of dynamically solving complex problems, is now commercially available via invite.
MeDo, the international version of Baidu’s no-code builder Miaoda, is live globally via medo.dev.
Oreate, a productivity workspace with document, slide, image, video, and podcast support, has reached over 1.2M users worldwide.

Baidu’s digital human platform, already rolled out in Brazil, is also part of the global push. According to company data, 83% of livestreamers during this year’s “Double 11” shopping event in China used Baidu’s digital human tech, contributing to a 91% increase in GMV.

Meanwhile, Baidu’s autonomous ride-hailing service Apollo Go has surpassed 17 million rides, operating driverless fleets in 22 cities and claiming the title of the world’s largest robotaxi network.

Open-Source Vision-Language Model Garners Industry Attention

Two days before the flagship ERNIE 5.0 event, Baidu also released an open-source multimodal model under the Apache 2.0 license: ERNIE-4.5-VL-28B-A3B-Thinking.

As reported by my colleague Michael Nuñez at VentureBeat, the model activates just 3 billion parameters while maintaining a total of 28 billion, using a Mixture-of-Experts (MoE) architecture for efficient inference.

Key technical innovations include:

“Thinking with Images”, which enables dynamic zoom-based visual analysis
Support for chart interpretation, document understanding, visual grounding, and temporal awareness in video
Runtime on a single 80GB GPU, making it accessible to mid-sized organizations
Full compatibility with Transformers, vLLM, and Baidu’s FastDeploy toolkits

This release adds pressure on closed-source competitors. With Apache 2.0 licensing, ERNIE-4.5-VL-28B-A3B-Thinking becomes a viable foundation model for commercial applications without licensing restrictions — something few high-performing models in this class offer.

Community Feedback and Baidu’s Response

Following the launch of ERNIE 5.0, developer and AI evaluator Lisan al Gaib (@scaling01) posted a mixed review on X. While initially impressed by the model’s benchmark performance, they reported a persistent issue where ERNIE 5.0 would repeatedly invoke tools — even when explicitly instructed not to — during SVG generation tasks.

“ERNIE 5.0 benchmarks looked insane until I tested it… unfortunately it’s RL braindamaged or they have a serious issue with their chat platform / system prompt,” Lisan wrote.

In a matter of hours, Baidu’s developer-focused support account, @ErnieforDevs, responded:

“Thanks for the feedback! It’s a known bug — certain syntax can consistently trigger it. We’re working on a fix. You can try rephrasing or changing the prompt to avoid it for now.”

The quick turnaround reflects Baidu’s increasing emphasis on developer communication, especially as it courts international users through both proprietary and open-source offerings.

Outlook for Baidu and its ERNIE foundational LLM family

Baidu’s ERNIE 5.0 marks a strategic escalation in the global foundation model race. With performance claims that put it on par with the most advanced systems from OpenAI and Google, and a mix of premium pricing and open-access alternatives, Baidu is signaling its ambition to become not just a domestic AI leader, but a credible global infrastructure provider.

At a time when enterprise AI users are increasingly demanding multimodal performance, flexible licensing, and deployment efficiency, Baidu’s two-track approach—premium hosted APIs and open-source releases—may broaden its appeal across both corporate and developer communities.

Whether the company’s performance claims hold up under third-party testing remains to be seen. But in a landscape shaped by rising costs, model complexity, and compute bottlenecks, ERNIE 5.0 and its supporting ecosystem give Baidu a competitive position in the next wave of AI deployment.

Kasım 14, 2025

How Deductive AI saved DoorDash 1,000 engineering hours by automating software debugging

As software systems grow more complex and AI tools generate code faster than ever, a fundamental problem is getting worse: Engineers are drowning in debugging work, spending up to half their time hunting down the causes of software failures instead of building new products. The challenge has become so acute that it's creating a new category of tooling — AI agents that can diagnose production failures in minutes instead of hours.

Deductive AI, a startup emerging from stealth mode Wednesday, believes it has found a solution by applying reinforcement learning — the same technology that powers game-playing AI systems — to the messy, high-stakes world of production software incidents. The company announced it has raised $7.5 million in seed funding led by CRV, with participation from Databricks Ventures, Thomvest Ventures, and PrimeSet, to commercialize what it calls "AI SRE agents" that can diagnose and help fix software failures at machine speed.

The pitch resonates with a growing frustration inside engineering organizations: Modern observability tools can show that something broke, but they rarely explain why. When a production system fails at 3 a.m., engineers still face hours of manual detective work, cross-referencing logs, metrics, deployment histories, and code changes across dozens of interconnected services to identify the root cause.

"The complexities and inter-dependencies of modern infrastructure means that investigating the root cause of an outage or incident can feel like searching for a needle in a haystack, except the haystack is the size of a football field, it's made of a million other needles, it's constantly reshuffling itself, and is on fire — and every second you don't find it equals lost revenue," said Sameer Agarwal, Deductive's co-founder and chief technology officer, in an exclusive interview with VentureBeat.

Deductive's system builds what the company calls a "knowledge graph" that maps relationships across codebases, telemetry data, engineering discussions, and internal documentation. When an incident occurs, multiple AI agents work together to form hypotheses, test them against live system evidence, and converge on a root cause — mimicking the investigative workflow of experienced site reliability engineers, but completing the process in minutes rather than hours.

The technology has already shown measurable impact at some of the world's most demanding production environments. DoorDash's advertising platform, which runs real-time auctions that must complete in under 100 milliseconds, has integrated Deductive into its incident response workflow. The company has set an ambitious 2026 goal of resolving production incidents within 10 minutes.

"Our Ads Platform operates at a pace where manual, slow-moving investigations are no longer viable. Every minute of downtime directly affects company revenue," said Shahrooz Ansari, Senior Director of Engineering at DoorDash, in an interview with VentureBeat. "Deductive has become a critical extension of our team, rapidly synthesizing signals across dozens of services and surfacing the insights that matter—within minutes."

DoorDash estimates that Deductive has root-caused approximately 100 production incidents over the past few months, translating to more than 1,000 hours of annual engineering productivity and a revenue impact "in millions of dollars," according to Ansari. At location intelligence company Foursquare, Deductive reduced the time to diagnose Apache Spark job failures by 90% —t urning a process that previously took hours or days into one that completes in under 10 minutes — while generating over $275,000 in annual savings.

Why AI-generated code is creating a debugging crisis

The timing of Deductive's launch reflects a brewing tension in software development: AI coding assistants are enabling engineers to generate code faster than ever, but the resulting software is often harder to understand and maintain.

"Vibe coding," a term popularized by AI researcher Andrej Karpathy, refers to using natural-language prompts to generate code through AI assistants. While these tools accelerate development, they can introduce what Agarwal describes as "redundancies, breaks in architectural boundaries, assumptions, or ignored design patterns" that accumulate over time.

"Most AI-generated code still introduces redundancies, breaks architectural boundaries, makes assumptions, or ignores established design patterns," Agarwal told Venturebeat. "In many ways, we now need AI to help clean up the mess that AI itself is creating."

The claim that engineers spend roughly half their time on debugging isn't hyperbole. The Association for Computing Machinery reports that developers spend 35% to 50% of their time validating and debugging software. More recently, Harness's State of Software Delivery 2025 report found that 67% of developers are spending more time debugging AI-generated code.

"We've seen world-class engineers spending half of their time debugging instead of building," said Rakesh Kothari, Deductive's co-founder and CEO. "And as vibe coding generates new code at a rate we've never seen, this problem is only going to get worse."

How Deductive's AI agents actually investigate production failures

Deductive's technical approach differs substantially from the AI features being added to existing observability platforms like Datadog or New Relic. Most of those systems use large language models to summarize data or identify correlations, but they lack what Agarwal calls "code-aware reasoning"—the ability to understand not just that something broke, but why the code behaves the way it does.

"Most enterprises use multiple observability tools across different teams and services, so no vendor has a single holistic view of how their systems behave, fail, and recover—nor are they able to pair that with an understanding of the code that defines system behavior," Agarwal explained. "These are key ingredients to resolving software incidents and it is exactly the gap Deductive fills."

The system connects to existing infrastructure using read-only API access to observability platforms, code repositories, incident management tools, and chat systems. It then continuously builds and updates its knowledge graph, mapping dependencies between services and tracking deployment histories.

When an alert fires, Deductive launches what the company describes as a multi-agent investigation. Different agents specialize in different aspects of the problem: one might analyze recent code changes, another examines trace data, while a third correlates the timing of the incident with recent deployments. The agents share findings and iteratively refine their hypotheses.

The critical difference from rule-based automation is Deductive's use of reinforcement learning. The system learns from every incident which investigative steps led to correct diagnoses and which were dead ends. When engineers provide feedback, the system incorporates that signal into its learning model.

"Each time it observes an investigation, it learns which steps, data sources, and decisions led to the right outcome," Agarwal said. "It learns how to think through problems, not just point them out."

At DoorDash, a recent latency spike in an API initially appeared to be an isolated service issue. Deductive's investigation revealed that the root cause was actually timeout errors from a downstream machine learning platform undergoing a deployment. The system connected these dots by analyzing log volumes, traces, and deployment metadata across multiple services.

"Without Deductive, our team would have had to manually correlate the latency spike across all logs, traces, and deployment histories," Ansari said. "Deductive was able to explain not just what changed, but how and why it impacted production behavior."

The company keeps humans in the loop—for now

While Deductive's technology could theoretically push fixes directly to production systems, the company has deliberately chosen to keep humans in the loop—at least for now.

"While our system is capable of deeper automation and could push fixes to production, currently, we recommend precise fixes and mitigations that engineers can review, validate, and apply," Agarwal said. "We believe maintaining a human in the loop is essential for trust, transparency and operational safety."

However, he acknowledged that "over time, we do think that deeper automation will come and how humans operate in the loop will evolve."

Databricks and ThoughtSpot veterans bet on reasoning over observability

The founding team brings deep expertise from building some of Silicon Valley's most successful data infrastructure platforms. Agarwal earned his Ph.D. at UC Berkeley, where he created BlinkDB, an influential system for approximate query processing. He was among the first engineers at Databricks, where he helped build Apache Spark. Kothari was an early engineer at ThoughtSpot, where he led teams focused on distributed query processing and large-scale system optimization.

The investor syndicate reflects both the technical credibility and market opportunity. Beyond CRV's Max Gazor, the round included participation from Ion Stoica, founder of Databricks and Anyscale; Ajeet Singh, founder of Nutanix and ThoughtSpot; and Ben Sigelman, founder of Lightstep.

Rather than competing with platforms like Datadog or PagerDuty, Deductive positions itself as a complementary layer that sits on top of existing tools. The pricing model reflects this: Instead of charging based on data volume, Deductive charges based on the number of incidents investigated, plus a base platform fee.

The company offers both cloud-hosted and self-hosted deployment options and emphasizes that it doesn't store customer data on its servers or use it to train models for other customers — a critical assurance given the proprietary nature of both code and production system behavior.

With fresh capital and early customer traction at companies like DoorDash, Foursquare, and Kumo AI, Deductive plans to expand its team and deepen the system's reasoning capabilities from reactive incident analysis to proactive prevention. The near-term vision: helping teams predict problems before they occur.

DoorDash's Ansari offers a pragmatic endorsement of where the technology stands today: "Investigations that were previously manual and time-consuming are now automated, allowing engineers to shift their energy toward prevention, business impact, and innovation."

In an industry where every second of downtime translates to lost revenue, that shift from firefighting to building increasingly looks less like a luxury and more like table stakes.

Kasım 13, 2025

Weibo’s new open source AI model VibeThinker-1.5B outperforms DeepSeek-R1 on $7,800 post-training budget

Another day in late 2025, another impressive result from a Chinese company in open source artificial intelligence.

Chinese social networking company Weibo's AI division recently released its open source VibeThinker-1.5B—a 1.5 billion parameter large language model (LLM) that is a fine-tuned variant of rival Chinese tech firm Alibaba's Qwen2.5-Math-1.5B.

It's available now for free download and usage by researchers and enterprise developers—even for commercial purposes—under a permissive MIT License on Hugging Face, GitHub and ModelScope, with a technical report on open access science publishing site arxiv.org.

And yet, despite its compact size, VibeThinker-1.5B achieves benchmark-topping reasoning performance on math and code tasks, rivaling or surpassing models hundreds of times its size, even outperforming Chinese rival DeepSeek's famed R1 that went viral at the start of this year—a 671-billion parameter model—on formal reasoning benchmark.

It further eclipses Mistral AI's Magistral Medium and holds its own against Anthropic's Claude Opus 4 and OpenAI's gpt-oss-20B Medium, all while requiring a fraction of the infrastructure and investment.

It also does so having been post-trained on a budget of merely $7800 USD for compute resources (3900 GPU hours on Nvidia H800s) — far less than the tens, or even hundreds, of thousands of dollars typically required to fine-tune models of similar or larger scale.

Recall this is not the total cost of the model's development, however: LLMs are trained in stages. First comes pre-training, when the model learns basic language structure and general knowledge by predicting the next word across enormous amounts of text from the internet, books, and articles. This gives it fluency but not much sense of how to follow instructions or hold a conversation

Post-training comes next, using much smaller, higher-quality datasets—typically collections of example questions, prompts, and expert-written answers—to teach the model how to respond helpfully, reason through problems, and align with human expectations. Still, Weibo's post-training cost effectiveness on VibeThinker-1.5B is noteworthy and should be commended.

The open-source release upends assumptions about parameter scale, compute intensity, and the minimum viable size for high-performance LLMs.

A Different Training Approach: Spectrum-to-Signal

VibeThinker-1.5B owes its performance not to scale, but to the training framework behind it: the Spectrum-to-Signal Principle (SSP).

Instead of optimizing a model purely for single-answer correctness (Pass@1), the SSP framework decouples supervised fine-tuning (SFT) and reinforcement learning (RL) into two distinct phases with different goals:

SFT (“Spectrum Phase”): The model is trained to maximize diversity across potential correct answers, improving its Pass@K score. This builds a wide range of plausible solution paths.
RL (“Signal Phase”): A second-stage reinforcement learning system (called MaxEnt-Guided Policy Optimization, or MGPO) is used to identify and amplify the most correct paths from this diverse solution pool. MGPO prioritizes problems where the model is most uncertain, using entropy-based weighting to focus learning.

The authors argue this separation allows small models to explore reasoning space more effectively—achieving signal amplification without relying on massive parameter counts.

VibeThinker-1.5B makes a compelling case that the industry’s reliance on parameter scaling as the only route to better reasoning performance may be outdated.

By adopting a diversity-first training pipeline, WeiboAI has shown that smaller, more accessible models can match and even outperform billion-dollar systems in logic-heavy tasks.

The low resource footprint is among the most significant aspects of VibeThinker-1.5B. At under $8,000, the post-training cost is 30–60x lower than models like DeepSeek R1 and MiniMax-M1, which cost between $294K and $535K to train.

Performance Across Domains

Despite its small size, VibeThinker-1.5B delivers cross-domain reasoning that outpaces many larger open-source and commercial models:

Model	AIME25	LiveCodeBench v6	GPQA-Diamond
VibeThinker-1.5B	74.4	51.1	46.7
GPT-OSS-20B-Medium	72.1	54.9	66.0
Claude Opus 4	69.2	56.6	79.6
MiniMax M1 (456B)	74.6	62.3	69.2
DeepSeek R1 (671B)	70.0	65.9	71.5
Kimi K2 (1.09T)	49.5	53.7	75.1

VibeThinker was benchmarked against both reasoning-centric models (Magistral, Claude, OpenAI o3-mini) and non-reasoning LLMs (GPT-4.1, Kimi K2, DeepSeek V3). Across structured reasoning benchmarks, the model consistently outperformed non-reasoning models, regardless of size:

On AIME24 (math), it beat Kimi K2 (1.09T) by over 10 points (80.3 vs. 69.6).
On LiveCodeBench v6, it surpassed Claude Opus 4 (51.1 vs. 47.4).
On GPQA, it scored below GPT-4.1 and Claude, but still doubled its base model (from 16.4 to 46.7).

This supports the authors’ claim that size is not the only path to reasoning capability—with proper training design, smaller models can reach or even exceed the performance of far larger systems in targeted tasks.

Notably, it achieves parity with models hundreds of times larger on math and code, though it lags behind in general knowledge reasoning (GPQA), where larger models maintain an edge.

This suggests a potential specialization trade-off: while VibeThinker excels at structured logical tasks, it has less capacity for wide-ranging encyclopedic recall, a known limitation of smaller architectures.

Guidance for Enterprise Adoption

The release includes recommended inference settings (temperature = 0.6, top_p = 0.95, max tokens = 40960).

The model is small enough to be deployed on edge devices, including mobile phones and vehicle-embedded systems, while inference costs are estimated to be 20–70x cheaper than with large models.

This positions VibeThinker-1.5B not just as a research achievement, but as a potential foundation for cost-efficient, locally deployable reasoning systems.

Weibo’s Strategy and Market Position

Weibo, launched by Sina Corporation in 2009, remains a cornerstone of China’s social media ecosystem. Often described as China’s version of X (formerly Twitter), the platform blends microblogging, multimedia content, and trending-topic features with a regulatory environment shaped by tight government oversight.

Despite counting 600 million monthly active users (more than twice that of X), investors are not optimistic about its advertising revenue growth potential in the near term, and Weibo is navigating intensifying competition from video-first platforms like Douyin, which are drawing younger users and increasing time-spent elsewhere.

In response, Weibo has leaned into creator-economy monetization, live-streaming, and vertical video—adding tools for influencer engagement, e-commerce integration, and richer analytics for brands.

The platform’s role as a digital public square also makes it a focus of regulatory scrutiny. Chinese authorities continue to apply pressure on issues ranging from content governance to data security. In September 2025, Weibo was among the platforms cited in official warnings, highlighting its ongoing exposure to policy risks.

Weibo’s push into AI R&D—exemplified by the release of VibeThinker-1.5B—signals a shift in ambition. Beyond being a media platform, Weibo is positioning itself as a player in the next phase of Chinese AI development, using its capital reserves, user behavior data, and in-house research capacity to pursue adjacent technical domains.

What It Means for Enterprise Technical Decision Makers

For engineering leaders and enterprise AI teams, VibeThinker’s release has practical implications for everything from orchestration pipelines to cost modeling.

A 1.5B-parameter model that outperforms 100x larger models on math and programming tasks doesn’t just save compute—it shifts the architectural balance. It enables LLM inference on constrained infrastructure, reduces latency at the edge, and lowers the barrier to entry for applications that otherwise would have required API access to closed, frontier-scale models.

That matters for enterprise ML leads trying to deploy reasoning-capable agents within existing systems, or for platform owners tasked with integrating LLMs into automated workflows.

It also speaks to those running reinforcement learning from human feedback (RLHF) pipelines or managing inference optimization across hybrid cloud environments.

The model’s post-training methodology—particularly its entropy-targeted reinforcement learning approach—offers a roadmap for teams looking to refine smaller checkpoints instead of relying on large-scale pretraining.

VibeThinker’s benchmark transparency and data decontamination steps also address another emerging priority in enterprise AI: auditability. While its performance on general-knowledge tests still trails large frontier models, its task-specific reliability makes it an attractive candidate for controlled environments where correctness matters more than coverage.

In short, VibeThinker-1.5B isn’t just a research milestone—it’s a strong candidate for practical enterprise use, deployment and learnings. It suggests that a new class of compact, reasoning-optimized models is viable for enterprise use cases that were previously the domain of far larger systems. For organizations trying to balance cost, latency, interpretability, and control, it’s a good new option to the long, growing list of Chinese open source offerings.

Kasım 13, 2025

Meta’s SPICE framework lets AI systems teach themselves to reason
Researchers at Meta FAIR and the National University of Singapore have developed a new reinforcement learning framework for self-improving AI systems.

Called Self-Play In Corpus Environments (SPICE), the framework pits two AI agents against each other, creating its own challenges and gradually improving without human supervision.

While currently a proof-of-concept, this self-play mechanism could provide a basis for future AI systems that can dynamically adapt to their environments, making them more robust against the unpredictability of real-world applications.

The challenge of self-improving AI

The goal of self-improving AI is to create systems that can enhance their capabilities by interacting with their environment.

A common approach is reinforcement learning with verifiable rewards (RLVR), where models are rewarded for providing the correct answers to problems. This is often limited by its reliance on human-curated problem sets and domain-specific reward engineering, which makes it difficult to scale.

Self-play, where a model improves by competing against itself, is another promising paradigm. But existing self-play methods for language models are often limited by two critical factors.
1. Factual errors in generated questions and answers compound, leading to a feedback loop of hallucinations.
2. When the problem generator and solver have information symmetry (i.e., share the same knowledge base) they fail to generate genuinely new challenges and fall into repetitive patterns.
As the researchers note in their paper, “These systematic empirical failures indicate that self-improvement requires interaction with an external source providing diverse, verifiable feedback, rather than closed-loop pure introspection.”

How SPICE works

SPICE is a self-play framework where a single model acts in two distinct roles.
- A "Challenger" constructs a curriculum of challenging problems from a large corpus of documents.
- A "Reasoner" then attempts to solve these problems without access to the source documents.
This setup breaks the information symmetry that limits other self-play methods, as the Reasoner does not have access to the documents and knowledge that the Challenger uses to generate the problems.

Grounding the tasks in a vast and diverse corpus of documents prevents hallucination by anchoring questions and answers in real-world content. This is important because for AI systems to reliably self-improve, they need external grounding sources. Therefore, LLM agents should learn from interactions with humans and the real world, not just their own outputs, to avoid compounding errors.

The adversarial dynamic between the two roles creates an automatic curriculum.

The Challenger is rewarded for generating problems that are both diverse and at the frontier of the Reasoner's capability (not too easy and also not impossible).

The Reasoner is rewarded for answering correctly. This symbiotic interaction pushes both agents to continuously discover and overcome new challenges.

Because the system uses raw documents instead of pre-defined question-answer pairs, it can generate diverse task formats, such as multiple-choice and free-form questions.

This flexibility allows SPICE to be applied to any domain, breaking the bottleneck that has confined previous methods to narrow fields like math and code. It also reduces dependence on expensive human-curated datasets for specialized domains like legal or medical analysis.

SPICE in action

The researchers evaluated SPICE on several base models, including Qwen3-4B-Base and OctoThinker-3B-Hybrid-Base.

They compared its performance against baselines such as the base model with no training, a Reasoner model trained with a fixed "Strong Challenger" (Qwen3-32B-Instruct), and pure self-play methods like R-Zero and Absolute Zero. The evaluation covered a wide range of mathematical and general reasoning benchmarks.

Across all models, SPICE consistently outperformed the baselines, delivering significant improvements in both mathematical and general reasoning tasks.

The results show that the reasoning capabilities developed through corpus-grounded self-play transfer broadly across different models, thanks to the diverse external knowledge corpus they used.

A key finding is that the adversarial dynamic creates an effective automatic curriculum. As training progresses, the Challenger learns to generate increasingly difficult problems.

In one experiment, the Reasoner's pass rate on a fixed set of problems increased from 55% to 85% over time, showing its improved capabilities.

Meanwhile, later versions of the Challenger were able to generate questions that dropped the pass rate of an early-stage Reasoner from 55% to 35%, confirming that both roles co-evolve successfully.

The researchers conclude that this approach presents a paradigm shift in self-improving reasoning methods from “closed-loop self-play that often stagnates due to hallucination drift, to open-ended improvement through interaction with the vast, verifiable knowledge embedded in web document corpora.”

Currently, the corpus used for SPICE represents human experience captured in text. The ultimate goal is for self-improving systems to generate questions based on interactions with reality, including the physical world, the internet, and human interactions across multiple modalities like video, audio, and sensor data.
Kasım 12, 2025
Baidu just dropped an open-source multimodal AI that it claims beats GPT-5 and Gemini

Baidu Inc., China's largest search engine company, released a new artificial intelligence model on Monday that its developers claim outperforms competitors from Google and OpenAI on several vision-related benchmarks despite using a fraction of the computing resources typically required for such systems.

The model, dubbed ERNIE-4.5-VL-28B-A3B-Thinking, is the latest salvo in an escalating competition among technology companies to build AI systems that can understand and reason about images, videos, and documents alongside traditional text — capabilities increasingly critical for enterprise applications ranging from automated document processing to industrial quality control.

What sets Baidu's release apart is its efficiency: the model activates just 3 billion parameters during operation while maintaining 28 billion total parameters through a sophisticated routing architecture. According to documentation released with the model, this design allows it to match or exceed the performance of much larger competing systems on tasks involving document understanding, chart analysis, and visual reasoning while consuming significantly less computational power and memory.

"Built upon the powerful ERNIE-4.5-VL-28B-A3B architecture, the newly upgraded ERNIE-4.5-VL-28B-A3B-Thinking achieves a remarkable leap forward in multimodal reasoning capabilities," Baidu wrote in the model's technical documentation on Hugging Face, the AI model repository where the system was released.

The company said the model underwent "an extensive mid-training phase" that incorporated "a vast and highly diverse corpus of premium visual-language reasoning data," dramatically boosting its ability to align visual and textual information semantically.

How the model mimics human visual problem-solving through dynamic image analysis

Perhaps the model's most distinctive feature is what Baidu calls "Thinking with Images" — a capability that allows the AI to dynamically zoom in and out of images to examine fine-grained details, mimicking how humans approach visual problem-solving tasks.

"The model thinks like a human, capable of freely zooming in and out of images to grasp every detail and uncover all information," according to the model card. When paired with tools like image search, Baidu claims this feature "dramatically elevates the model's ability to process fine-grained details and handle long-tail visual knowledge."

This approach marks a departure from traditional vision-language models, which typically process images at a fixed resolution. By allowing dynamic image examination, the system can theoretically handle scenarios requiring both broad context and granular detail—such as analyzing complex technical diagrams or detecting subtle defects in manufacturing quality control.

The model also supports what Baidu describes as enhanced "visual grounding" capabilities with "more precise grounding and flexible instruction execution, easily triggering grounding functions in complex industrial scenarios," suggesting potential applications in robotics, warehouse automation, and other settings where AI systems must identify and locate specific objects in visual scenes.

Baidu's performance claims draw scrutiny as independent testing remains pending

Baidu's assertion that the model outperforms Google's Gemini 2.5 Pro and OpenAI's GPT-5-High on various document and chart understanding benchmarks has drawn attention across social media, though independent verification of these claims remains pending.

The company released the model under the permissive Apache 2.0 license, allowing unrestricted commercial use—a strategic decision that contrasts with the more restrictive licensing approaches of some competitors and could accelerate enterprise adoption.

"Apache 2.0 is smart," wrote one X user responding to Baidu's announcement, highlighting the competitive advantage of open licensing in the enterprise market.

According to Baidu's documentation, the model demonstrates six core capabilities beyond traditional text processing. In visual reasoning, the system can perform what Baidu describes as "multi-step reasoning, chart analysis, and causal reasoning capabilities in complex visual tasks," aided by what the company characterizes as "large-scale reinforcement learning."

For STEM problem solving, Baidu claims that "leveraging its powerful visual abilities, the model achieves a leap in performance on STEM tasks like solving problems from photos." The visual grounding capability allows the model to identify and locate objects within images with what Baidu characterizes as industrial-grade precision. Through tool integration, the system can invoke external functions including image search capabilities to access information beyond its training data.

For video understanding, Baidu claims the model possesses "outstanding temporal awareness and event localization abilities, accurately identifying content changes across different time segments in a video." Finally, the thinking with images feature enables the dynamic zoom functionality that distinguishes this model from competitors.

Inside the mixture-of-experts architecture that powers efficient multimodal processing

Under the hood, ERNIE-4.5-VL-28B-A3B-Thinking employs a Mixture-of-Experts (MoE) architecture — a design pattern that has become increasingly popular for building efficient large-scale AI systems. Rather than activating all 28 billion parameters for every task, the model uses a routing mechanism to selectively activate only the 3 billion parameters most relevant to each specific input.

This approach offers substantial practical advantages for enterprise deployments. According to Baidu's documentation, the model can run on a single 80GB GPU — hardware readily available in many corporate data centers — making it significantly more accessible than competing systems that may require multiple high-end accelerators.

The technical documentation reveals that Baidu employed several advanced training techniques to achieve the model's capabilities. The company used "cutting-edge multimodal reinforcement learning techniques on verifiable tasks, integrating GSPO and IcePop strategies to stabilize MoE training combined with dynamic difficulty sampling for exceptional learning efficiency."

Baidu also notes that in response to "strong community demand," the company "significantly strengthened the model's grounding performance with improved instruction-following capabilities."

The new model fits into Baidu's ambitious multimodal AI ecosystem

The new release is one component of Baidu's broader ERNIE 4.5 model family, which the company unveiled in June 2025. That family comprises 10 distinct variants, including Mixture-of-Experts models ranging from the flagship ERNIE-4.5-VL-424B-A47B with 424 billion total parameters down to a compact 0.3 billion parameter dense model.

According to Baidu's technical report on the ERNIE 4.5 family, the models incorporate "a novel heterogeneous modality structure, which supports parameter sharing across modalities while also allowing dedicated parameters for each individual modality."

This architectural choice addresses a longstanding challenge in multimodal AI development: training systems on both visual and textual data without one modality degrading the performance of the other. Baidu claims this design "has the advantage to enhance multimodal understanding without compromising, and even improving, performance on text-related tasks."

The company reported achieving 47% Model FLOPs Utilization (MFU) — a measure of training efficiency — during pre-training of its largest ERNIE 4.5 language model, using the PaddlePaddle deep learning framework developed in-house.

Comprehensive developer tools aim to simplify enterprise deployment and integration

For organizations looking to deploy the model, Baidu has released a comprehensive suite of development tools through ERNIEKit, what the company describes as an "industrial-grade training and compression development toolkit."

The model offers full compatibility with popular open-source frameworks including Hugging Face Transformers, vLLM (a high-performance inference engine), and Baidu's own FastDeploy toolkit. This multi-platform support could prove critical for enterprise adoption, allowing organizations to integrate the model into existing AI infrastructure without wholesale platform changes.

Sample code released by Baidu shows a relatively straightforward implementation path. Using the Transformers library, developers can load and run the model with approximately 30 lines of Python code, according to the documentation on Hugging Face.

For production deployments requiring higher throughput, Baidu provides vLLM integration with specialized support for the model's "reasoning-parser" and "tool-call-parser" capabilities — features that enable the dynamic image examination and external tool integration that distinguish this model from earlier systems.

The company also offers FastDeploy, a proprietary inference toolkit that Baidu claims delivers "production-ready, easy-to-use multi-hardware deployment solutions" with support for various quantization schemes that can reduce memory requirements and increase inference speed.

Why this release matters for the enterprise AI market at a critical inflection point

The release comes at a pivotal moment in the enterprise AI market. As organizations move beyond experimental chatbot deployments toward production systems that process documents, analyze visual data, and automate complex workflows, demand for capable and cost-effective vision-language models has intensified.

Several enterprise use cases appear particularly well-suited to the model's capabilities. Document processing — extracting information from invoices, contracts, and forms — represents a massive market where accurate chart and table understanding directly translates to cost savings through automation. Manufacturing quality control, where AI systems must detect visual defects, could benefit from the model's grounding capabilities. Customer service applications that handle images from users could leverage the multi-step visual reasoning.

The model's efficiency profile may prove especially attractive to mid-market organizations and startups that lack the computing budgets of large technology companies. By fitting on a single 80GB GPU — hardware costing roughly $10,000 to $30,000 depending on the specific model — the system becomes economically viable for a much broader range of organizations than models requiring multi-GPU setups costing hundreds of thousands of dollars.

"With all these new models, where's the best place to actually build and scale? Access to compute is everything," wrote one X user in response to Baidu's announcement, highlighting the persistent infrastructure challenges facing organizations attempting to deploy advanced AI systems.

The Apache 2.0 licensing further lowers barriers to adoption. Unlike models released under more restrictive licenses that may limit commercial use or require revenue sharing, organizations can deploy ERNIE-4.5-VL-28B-A3B-Thinking in production applications without ongoing licensing fees or usage restrictions.

Competition intensifies as Chinese tech giant takes aim at Google and OpenAI

Baidu's release intensifies competition in the vision-language model space, where Google, OpenAI, Anthropic, and Chinese companies including Alibaba and ByteDance have all released capable systems in recent months.

The company's performance claims — if validated by independent testing — would represent a significant achievement. Google's Gemini 2.5 Pro and OpenAI's GPT-5-High are substantially larger models backed by the deep resources of two of the world's most valuable technology companies. That a more compact, openly available model could match or exceed their performance on specific tasks would suggest the field is advancing more rapidly than some analysts anticipated.

"Impressive that ERNIE is outperforming Gemini 2.5 Pro," wrote one social media commenter, expressing surprise at the claimed results.

However, some observers counseled caution about benchmark comparisons. "It's fascinating to see how multimodal models are evolving, especially with features like 'Thinking with Images,'" wrote one X user. "That said, I'm curious if ERNIE-4.5's edge over competitors like Gemini-2.5-Pro and GPT-5-High primarily lies in specific use cases like document and chart" understanding rather than general-purpose vision tasks.

Industry analysts note that benchmark performance often fails to capture real-world behavior across the diverse scenarios enterprises encounter. A model that excels at document understanding may struggle with creative visual tasks or real-time video analysis. Organizations evaluating these systems typically conduct extensive internal testing on representative workloads before committing to production deployments.

Technical limitations and infrastructure requirements that enterprises must consider

Despite its capabilities, the model faces several technical challenges common to large vision-language systems. The minimum requirement of 80GB of GPU memory, while more accessible than some competitors, still represents a significant infrastructure investment. Organizations without existing GPU infrastructure would need to procure specialized hardware or rely on cloud computing services, introducing ongoing operational costs.

The model's context window — the amount of text and visual information it can process simultaneously — is listed as 128K tokens in Baidu's documentation. While substantial, this may prove limiting for some document processing scenarios involving very long technical manuals or extensive video content.

Questions also remain about the model's behavior on adversarial inputs, out-of-distribution data, and edge cases. Baidu's documentation does not provide detailed information about safety testing, bias mitigation, or failure modes — considerations increasingly important for enterprise deployments where errors could have financial or safety implications.

What technical decision-makers need to evaluate beyond the benchmark numbers

For technical decision-makers evaluating the model, several implementation factors warrant consideration beyond raw performance metrics.

The model's MoE architecture, while efficient during inference, adds complexity to deployment and optimization. Organizations must ensure their infrastructure can properly route inputs to the appropriate expert subnetworks — a capability not universally supported across all deployment platforms.

The "Thinking with Images" feature, while innovative, requires integration with image manipulation tools to achieve its full potential. Baidu's documentation suggests this capability works best "when paired with tools like image zooming and image search," implying that organizations may need to build additional infrastructure to fully leverage this functionality.

The model's video understanding capabilities, while highlighted in marketing materials, come with practical constraints. Processing video requires substantially more computational resources than static images, and the documentation does not specify maximum video length or optimal frame rates.

Organizations considering deployment should also evaluate Baidu's ongoing commitment to the model. Open-source AI models require continuing maintenance, security updates, and potential retraining as data distributions shift over time. While the Apache 2.0 license ensures the model remains available, future improvements and support depend on Baidu's strategic priorities.

Developer community responds with enthusiasm tempered by practical requests

Early response from the AI research and development community has been cautiously optimistic. Developers have requested versions of the model in additional formats including GGUF (a quantization format popular for local deployment) and MNN (a mobile neural network framework), suggesting interest in running the system on resource-constrained devices.

"Release MNN and GGUF so I can run it on my phone," wrote one developer, highlighting demand for mobile deployment options.

Other developers praised Baidu's technical choices while requesting additional resources. "Fantastic model! Did you use discoveries from PaddleOCR?" asked one user, referencing Baidu's open-source optical character recognition toolkit.

The model's lengthy name—ERNIE-4.5-VL-28B-A3B-Thinking—drew lighthearted commentary. "ERNIE-4.5-VL-28B-A3B-Thinking might be the longest model name in history," joked one observer. "But hey, if you're outperforming Gemini-2.5-Pro with only 3B active params, you've earned the right to a dramatic name!"

Baidu plans to showcase the ERNIE lineup during its Baidu World 2025 conference on November 13, where the company is expected to provide additional details about the model's development, performance validation, and future roadmap.

The release marks a strategic move by Baidu to establish itself as a major player in the global AI infrastructure market. While Chinese AI companies have historically focused primarily on domestic markets, the open-source release under a permissive license signals ambitions to compete internationally with Western AI giants.

For enterprises, the release adds another capable option to a rapidly expanding menu of AI models. Organizations no longer face a binary choice between building proprietary systems or licensing closed-source models from a handful of vendors. The proliferation of capable open-source alternatives like ERNIE-4.5-VL-28B-A3B-Thinking is reshaping the economics of AI deployment and accelerating adoption across industries.

Whether the model delivers on its performance promises in real-world deployments remains to be seen. But for organizations seeking powerful, cost-effective tools for visual understanding and reasoning, one thing is certain. As one developer succinctly summarized: "Open source plus commercial use equals chef's kiss. Baidu not playing around."

Kasım 12, 2025

Blog

Prediction 1: The missing unicorn

Prediction 2: Vectors alone won’t cut it

Prediction 3: A crowded field becomes commoditized

The new reality: Hybrid and GraphRAG

Benchmarks and evidence

What this means going forward

Looking ahead: What’s next

From shiny objects to essential infrastructure

Why your human-centric IAM is a sitting duck

Three pillars of a scalable agent security architecture

A practical roadmap to get started

The bottom line

Collaborative functionality integrated into ChatGPT

Powered by GPT-5.1 with expanded tools

Privacy by default, controls for younger users

A testbed for shared AI experiences

Developer access: Still unclear

Implications for enterprise AI and data leaders

The limits of current LLM reasoning training

How supervised reinforcement learning works

SRL in action

A new standard for high-stakes AI?

How AI agents performed on 300+ real freelance jobs—and why they struggled

20 minutes of human feedback boosted AI completion rates up to 70%

Why ChatGPT can ace the SAT but can't count the R's in 'strawberry'

The economics of human-AI teamwork: Why paying for expert feedback still saves money

AI coding agents excel, but creative writing and translation still need humans

Inside the research: How Upwork tested AI agents with peer-reviewed scientific methods

Upwork's AI strategy: Building Uma, a 'meta-agent' that manages human and AI workers

OpenAI, Anthropic, and Google race to build autonomous agents—but reality lags hype

Will AI take your job? The evidence suggests a more complicated answer

Where ERNIE 5.0 outshines GPT-5 and Gemini 2.5 Pro

Enterprise Pricing Strategy

Global Expansion: Products and Platforms

Open-Source Vision-Language Model Garners Industry Attention

Community Feedback and Baidu’s Response

Outlook for Baidu and its ERNIE foundational LLM family

Why AI-generated code is creating a debugging crisis

How Deductive's AI agents actually investigate production failures

The company keeps humans in the loop—for now

Databricks and ThoughtSpot veterans bet on reasoning over observability

A Different Training Approach: Spectrum-to-Signal

Performance Across Domains

Guidance for Enterprise Adoption

Weibo’s Strategy and Market Position

What It Means for Enterprise Technical Decision Makers

The challenge of self-improving AI

How SPICE works

SPICE in action

How the model mimics human visual problem-solving through dynamic image analysis

Baidu's performance claims draw scrutiny as independent testing remains pending

Inside the mixture-of-experts architecture that powers efficient multimodal processing

The new model fits into Baidu's ambitious multimodal AI ecosystem

Comprehensive developer tools aim to simplify enterprise deployment and integration

Why this release matters for the enterprise AI market at a critical inflection point

Competition intensifies as Chinese tech giant takes aim at Google and OpenAI

Technical limitations and infrastructure requirements that enterprises must consider

What technical decision-makers need to evaluate beyond the benchmark numbers

Developer community responds with enthusiasm tempered by practical requests