Blog

  • Design in the age of AI: How small businesses are building big brands faster

    Presented by Design.com


    For most of history, design was the last step in starting a business — something entrepreneurs invested in once the idea was proven. Today, it’s one of the first. The rise of generative AI has shifted how small businesses imagine, launch, and grow — turning what used to be a months-long creative process into something interactive, iterative, and accessible from day one.

    Search data tells the story. Since 2022, global interest in “AI business name generator” has surged more than 700%. Searches for “AI logo generator” are up 1,200%, and “AI website generator” 1,600%. Small businesses aren’t waiting for enterprise AI trickle-down. They’re adopting these tools en masse to move faster from concept to brand identity.

    “The appetite for AI-powered design has been extraordinary,” says Alec Lynch, founder and CEO of Design.com. “Entrepreneurs are realizing they can bring their ideas to life immediately — they don’t have to wait for funding, agencies, or a full creative team. They can start now.”

    The democratization of design power

    For decades, small businesses were boxed out of high-end design. Building a brand required deep pockets and specialized talent. AI has redrawn that map.

    Large language models and image generators now act as collaborative partners — sparking ideas, testing directions, and handling tedious layout and copy work. For founders, that means fewer barriers and faster iteration.

    Instead of hiring separate agencies for naming, logo design, and web development, small businesses are turning to unified AI platforms that handle the full early-stage design stack. Tools like Design.com merge naming, logo creation, and website generation into a single workflow — turning an entrepreneur’s first sketch into a polished brand system within minutes.

    “AI isn’t replacing creativity,” Lynch adds. “It’s giving people the confidence to express it.”

    The five frontiers of AI-powered entrepreneurship

    Today’s AI tools mirror the creative journey every founder takes — from naming a business to sharing it with the world. The five fastest-growing design categories on Google reflect each stage of that journey.

    1. Naming: From idea to identity

    AI naming tools do more than spit out clever words — they help founders discover their voice. A good generator blends tone, personality, and domain availability so the result feels like a fit, not a random suggestion.

    2. Logos: From visuals to meaning

    Logo creation is one of the most emotionally resonant steps in brand-building. AI has turned it into a playground for experimentation. Entrepreneurs can test dozens of looks and get instant feedback.

    3. Websites: From static pages to adaptive brands

    The surge in “AI website generator” searches signals a deeper shift. Websites are no longer static brochures; they’re dynamic brand environments. AI-driven builders now create layouts, headlines, and imagery that adapt to a company’s tone and focus — drastically reducing time to launch.

    4. Business cards and brand collateral

    Even in a digital age, tangible touchpoints matter. AI-generated business cards give founders an immediate sense of legitimacy while ensuring design consistency across brand assets.

    5. Presentations: From slides to storytelling

    Founders aren’t just designing assets; they’re designing narratives. Generative AI turns bullet points into persuasive visual stories — raising the quality of pitches, decks, and demos once out of reach for most small teams.

    Together, these five frontiers show that small businesses aren’t just using AI to look more polished — they’re using it to think more strategically about brand, story, and customer experience from the start.

    The new design ecosystem

    Behind the surge in AI design tools lies a broader ecosystem shift. Companies like Canva and Wix made design accessible; the current wave — led by AI-native platforms like Design.com — is more personal and adaptive.

    Unlike templated platforms, these tools understand context. A restaurant founder and a SaaS startup will get not just different visuals, but different copy tones, typography systems, and user flows — automatically.

    “What we’re seeing,” Lynch explains, “isn’t just growth in one product category. It’s a movement toward connected creativity — where every part of the brand experience learns from every other.”

    From AI tools to AI brand systems

    The next evolution of small-business design won’t be about single-purpose tools. It will be about connected systems that share data, context, and creative intent across every brand touchpoint.

    Imagine naming a company and watching an AI instantly generate a logo, color palette, and homepage layout that all reflect the same personality. As your audience grows, the same system helps you update your visual identity or tone to match new goals — while preserving your original DNA.

    That’s the future Design.com and others are building toward: intelligent brand ecosystems that evolve alongside their founders.

    “AI design tools are giving small businesses superpowers,” Lynch says. “They’re removing friction from creativity.”

    And that frictionless design process is quietly rewriting what entrepreneurship looks like. The ability to create, iterate, and launch in hours instead of months is changing the tempo of business itself — and redefining what it means to be a designer in the age of AI.


    Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

  • Booking.com’s agent strategy: Disciplined, modular and already delivering 2× accuracy

    When many enterprises weren’t even thinking about agentic behaviors or infrastructures, Booking.com had already “stumbled” into them with its homegrown conversational recommendation system.

    This early experimentation has allowed the company to take a step back and avoid getting swept up in the frantic AI agent hype. Instead, it is taking a disciplined, layered, modular approach to model development: small, travel-specific models for cheap, fast inference; larger large language models (LLMs) for reasoning and understanding; and domain-tuned evaluations built in-house when precision is critical.

    With this hybrid strategy — combined with selective collaboration with OpenAI — Booking.com has seen accuracy double across key retrieval, ranking and customer-interaction tasks.

    As Pranav Pathak, Booking.com’s AI product development lead, posed to VentureBeat in a new podcast: “Do you build it very, very specialized and bespoke and then have an army of a hundred agents? Or do you keep it general enough and have five agents that are good at generalized tasks, but then you have to orchestrate a lot around them? That's a balance that I think we're still trying to figure out, as is the rest of the industry.”

    Check out the new Beyond the Pilot podcast here, and continue reading for highlights.

    Moving from guessing to deep personalization without being ‘creepy’

    Recommendation systems are core to Booking.com’s customer-facing platforms; however, traditional recommendation tools have been less about recommendation and more about guessing, Pathak conceded. So, from the start, he and his team vowed to avoid generic tools: As he put it, the price and recommendation should be based on customer context.

    Booking.com’s initial pre-gen AI tooling for intent and topic detection was a small language model, what Pathak described as “the scale and size of BERT.” The model ingested the customer’s inputs around their problem to determine whether it could be solved through self-service or bumped to a human agent.

    “We started with an architecture of ‘you have to call a tool if this is the intent you detect and this is how you've parsed the structure,” Pathak explained. “That was very, very similar to the first few agentic architectures that came out in terms of reason and defining a tool call.”

    His team has since built out that architecture to include an LLM orchestrator that classifies queries, triggers retrieval-augmented generation (RAG) and calls APIs or smaller, specialized language models. “We've been able to scale that system quite well because it was so close in architecture that, with a few tweaks, we now have a full agentic stack,” said Pathak.

    As a result, Booking.com is seeing a 2X increase in topic detection, which in turn is freeing up human agents’ bandwidth by 1.5 to 1.7X. More topics, even complicated ones previously identified as ‘other’ and requiring escalation, are being automated.

    Ultimately, this supports more self-service, freeing human agents to focus on customers with uniquely-specific problems that the platform doesn’t have a dedicated tool flow for — say, a family that is unable to access its hotel room at 2 a.m. when the front desk is closed.

    That not only “really starts to compound,” but has a direct, long-term impact on customer retention, Pathak noted. “One of the things we've seen is, the better we are at customer service, the more loyal our customers are.”

    Another recent rollout is personalized filtering. Booking.com has between 200 and 250 search filters on its website — an unrealistic amount for any human to sift through, Pathak pointed out. So, his team introduced a free text box that users can type into to immediately receive tailored filters.

    “That becomes such an important cue for personalization in terms of what you're looking for in your own words rather than a clickstream,” said Pathak.

    In turn, it cues Booking.com into what customers actually want. For instance, hot tubs — when filter personalization first rolled out, jacuzzi’s were one of the most popular requests. That wasn’t even a consideration previously; there wasn’t even a filter. Now that filter is live.

    “I had no idea,” Pathak noted. “I had never searched for a hot tub in my room honestly.”

    When it comes to personalization, though, there is a fine line; memory remains complicated, Pathak emphasized. While it’s important to have long-term memories and evolving threads with customers — retaining information like their typical budgets, preferred hotel star ratings or whether they need disability access — it must be on their terms and protective of their privacy.

    Booking.com is extremely mindful with memory, seeking consent so as to not be “creepy” when collecting customer information.

    “Managing memory is much harder than actually building memory,” said Pathak. “The tech is out there, we have the technical chops to build it. We want to make sure we don't launch a memory object that doesn't respect customer consent, that doesn't feel very natural.”

    Finding a balance of build versus buy

    As agents mature, Booking.com is navigating a central question facing the entire industry: How narrow should agents become?

    Instead of committing to either a swarm of highly specialized agents or a few generalized ones, the company aims for reversible decisions and avoids “one-way doors” that lock its architecture into long-term, costly paths. Pathak’s strategy is: Generalize where possible, specialize where necessary and keep agent design flexible to help ensure resiliency.

    Pathak and his team are “very mindful” of use cases, evaluating where to build more generalized, reusable agents or more task-specific ones. They strive to use the smallest model possible, with the highest level of accuracy and output quality, for each use case. Whatever can be generalized is.

    Latency is another important consideration. When factual accuracy and avoiding hallucinations is paramount, his team will use a larger, much slower model; but with search and recommendations, user expectations set speed. (Pathak noted: “No one’s patient.”)

    “We would, for example, never use something as heavy as GPT-5 for just topic detection or for entity extraction,” he said.

    Booking.com takes a similarly elastic tack when it comes to monitoring and evaluations: If it's general-purpose monitoring that someone else is better at building and has horizontal capability, they’ll buy it. But if it’s instances where brand guidelines must be enforced, they’ll build their own evals.

    Ultimately, Booking.com has leaned into being “super anticipatory,” agile and flexible. “At this point with everything that's happening with AI, we are a little bit averse to walking through one way doors,” said Pathak. “We want as many of our decisions to be reversible as possible. We don't want to get locked into a decision that we cannot reverse two years from now.”

    What other builders can learn from Booking.com’s AI journey

    Booking.com’s AI journey can serve as an important blueprint for other enterprises.

    Looking back, Pathak acknowledged that they started out with a “pretty complicated” tech stack. They’re now in a good place with that, “but we probably could have started something much simpler and seen how customers interacted with it.”

    Given that, he offered this valuable advice: If you’re just starting out with LLMs or agents, out-of-the-box APIs will do just fine. “There's enough customization with APIs that you can already get a lot of leverage before you decide you want to go do more.”

    On the other hand, if a use case requires customization not available through a standard API call, that makes a case for in-house tools.

    Still, he emphasized: Don't start with the complicated stuff. Tackle the “simplest, most painful problem you can find and the simplest, most obvious solution to that.”

    Identify the product market fit, then investigate the ecosystems, he advised — but don’t just rip out old infrastructures because a new use case demands something specific (like moving an entire cloud strategy from AWS to Azure just to use the OpenAI endpoint).

    Ultimately: “Don't lock yourself in too early,” Pathak noted. “Don't make decisions that are one-way doors until you are very confident that that's the solution that you want to go with.”

  • Why AI coding agents aren’t production-ready: Brittle context windows, broken refactors, missing operational awareness

    Remember this Quora comment (which also became a meme)?

    (Source: Quora)

    In the pre-large language model (LLM) Stack Overflow era, the challenge was discerning which code snippets to adopt and adapt effectively. Now, while generating code has become trivially easy, the more profound challenge lies in reliably identifying and integrating high-quality, enterprise-grade code into production environments.

    This article will examine the practical pitfalls and limitations observed when engineers use modern coding agents for real enterprise work, addressing the more complex issues around integration, scalability, accessibility, evolving security practices, data privacy and maintainability in live operational settings. We hope to balance out the hype and provide a more technically-grounded view of the capabilities of AI coding agents.

    Limited domain understanding and service limits

    AI agents struggle significantly with designing scalable systems due to the sheer explosion of choices and a critical lack of enterprise-specific context. To describe the problem in broad strokes, large enterprise codebases and monorepos are often too vast for agents to directly learn from, and crucial knowledge can be frequently fragmented across internal documentation and individual expertise.

    More specifically, many popular coding agents encounter service limits that hinder their effectiveness in large-scale environments. Indexing features may fail or degrade in quality for repositories exceeding 2,500 files, or due to memory constraints. Furthermore, files larger than 500 KB are often excluded from indexing/search, which impacts established products with decades-old, larger code files (although newer projects may admittedly face this less frequently).

    For complex tasks involving extensive file contexts or refactoring, developers are expected to provide the relevant files and while also explicitly defining the refactoring procedure and the surrounding build/command sequences to validate the implementation without introducing feature regressions.

    Lack of hardware context and usage

    AI agents have demonstrated a critical lack of awareness regarding OS machine, command-line and environment installations (conda/venv). This deficiency can lead to frustrating experiences, such as the agent attempting to execute Linux commands on PowerShell, which can consistently result in ‘unrecognized command’ errors. Furthermore, agents frequently exhibit inconsistent ‘wait tolerance’ on reading command outputs, prematurely declaring an inability to read results (and moving ahead to either retry/skip) before a command has even finished, especially on slower machines.

    This isn't merely about nitpicking features; rather, the devil is in these practical details. These experience gaps manifest as real points of friction and necessitate constant human vigilance to monitor the agent’s activity in real-time. Otherwise, the agent might ignore initial tool call information and either stop prematurely, or proceed with a half-baked solution requiring undoing some/all changes, re-triggering prompts and wasting tokens. Submitting a prompt on a Friday evening and expecting the code updates to be done when checking on Monday morning is not guaranteed.

    Hallucinations over repeated actions

    Working with AI coding agents often presents a longstanding challenge of hallucinations, or incorrect or incomplete pieces of information (such as small code snippets) within a larger set of changesexpected to be fixed by a developer with trivial-to-low effort. However, what becomes particularly problematic is when incorrect behavior is repeated within a single thread, forcing users to either start a new thread and re-provide all context, or intervene manually to “unblock” the agent.

    For instance, during a Python Function code setup, an agent tasked with implementing complex production-readiness changes encountered a file (see below) containing special characters (parentheses, period, star). These characters are very common in computer science to denote software versions.

    (Image created manually with boilerplate code. Source: Microsoft Learn and Editing Application Host File (host.json) in Azure Portal)

    The agent incorrectly flagged this as an unsafe or harmful value, halting the entire generation process. This misidentification of an adversarial attack recurred 4 to 5 times despite various prompts attempting to restart or continue the modification. This version format is in-fact boilerplate, present in a Python HTTP-trigger code template. The only successful workaround involved instructing the agent to not read the file, and instead request it to simply provide the desired configuration and assure it that the developer will manually add it to that file, confirm and ask it to continue with remaining code changes.

    The inability to exit a repeatedly faulty agent output loop within the same thread highlights a practical limitation that significantly wastes development time. In essence, developers tend to now spend time on debugging/refining AI-generated code rather than Stack Overflow code snippets or their own.

    Lack of enterprise-grade coding practices

    Security best practices: Coding agents often default to less secure authentication methods like key-based authentication (client secrets) rather than modern identity-based solutions (such as Entra ID or federated credentials). This oversight can introduce significant vulnerabilities and increase maintenance overhead, as key management and rotation are complex tasks increasingly restricted in enterprise environments.

    Outdated SDKs and reinventing the wheel: Agents may not consistently leverage the latest SDK methods, instead generating more verbose and harder-to-maintain implementations. Piggybacking on the Azure Function example, agents have outputted code using the pre-existing v1 SDK for read/write operations, rather than the much cleaner and more maintainable v2 SDK code. Developers must research the latest best practices online to have a mental map of dependencies and expected implementation that ensures long-term maintainability and reduces upcoming tech migration efforts.

    Limited intent recognition and repetitive code: Even for smaller-scoped, modular tasks (which are typically encouraged to minimize hallucinations or debugging downtime) like extending an existing function definition, agents may follow the instruction literally and produce logic that turns out to be near-repetitive, without anticipating the upcoming or unarticulated needs of the developer. That is, in these modular tasks the agent may not automatically identify and refactor similar logic into shared functions or improve class definitions, leading to tech debt and harder-to-manage codebases especially with vibe coding or lazy developers.

    Simply put, those viral YouTube reels showcasing rapid zero-to-one app development from a single-sentence prompt simply fail to capture the nuanced challenges of production-grade software, where security, scalability, maintainability and future-resistant design architectures are paramount.

    Confirmation bias alignment

    Confirmation bias is a significant concern, as LLMs frequently affirm user premises even when the user expresses doubt and asks the agent to refine their understanding or suggest alternate ideas. This tendency, where models align with what they perceive the user wants to hear, leads to reduced overall output quality, especially for more objective/technical tasks like coding.

    There is ample literature to suggest that if a model begins by outputting a claim like “You are absolutely right!”, the rest of the output tokens tend to justify this claim.

    Constant need to babysit

    Despite the allure of autonomous coding, the reality of AI agents in enterprise development often demands constant human vigilance. Instances like an agent attempting to execute Linux commands on PowerShell, false-positive safety flags or introduce inaccuracies due to domain-specific reasons highlight critical gaps; developers simply cannot step away. Rather, they must constantly monitor the reasoning process and understand multi-file code additions to avoid wasting time with subpar responses.

    The worst possible experience with agents is a developer accepting multi-file code updates riddled with bugs, then evaporating time in debugging due to how ‘beautiful’ the code seemingly looks. This can even give rise to the sunk cost fallacy of hoping the code will work after just a few fixes, especially when the updates are across multiple files in a complex/unfamiliar codebase with connections to multiple independent services.

    It's akin to collaborating with a 10-year old prodigy who has memorized ample knowledge and even addresses every piece of user intent, but prioritizes showing-off that knowledge ove solving the actual problem, and lacks the foresight required for success in real-world use cases.

    This "babysitting" requirement, coupled with the frustrating recurrence of hallucinations, means that time spent debugging AI-generated code can eclipse the time savings anticipated with agent usage. Needless to say, developers in large companies need to be very intentional and strategic in navigating modern agentic tools and use-cases.

    Conclusion

    There is no doubt that AI coding agents have been nothing short of revolutionary, accelerating prototyping, automating boilerplate coding and transforming how developers build. The real challenge now isn’t generating code, it’s knowing what to ship, how to secure it and where to scale it. Smart teams are learning to filter the hype, use agents strategically and double down on engineering judgment.

    As GitHub CEO Thomas Dohmke recently observed: The most advanced developers have “moved from writing code to architecting and verifying the implementation work that is carried out by AI agents.” In the agentic era, success belongs not to those who can prompt code, but those who can engineer systems that last.

    Rahul Raja is a staff software engineer at LinkedIn.

    Advitya Gemawat is a machine learning (ML) engineer at Microsoft.

    Editors note: The opinions expressed in this article are the authors' personal opinions and do not reflect the opinions of their employers.

  • Inside NetSuite’s next act: Evan Goldberg on the future of AI-powered business systems

    Presented by Oracle NetSuite


    When Evan Goldberg started NetSuite in 1998, his vision was radically simple: give entrepreneurs access to their business data anytime, anywhere. At the time, most enterprise software lived on local servers.

    As an entrepreneur himself, Goldberg understood the frustration intimately. "I had fragmented systems. They all said something different," he recalls of his early days.

    NetSuite was the first company to deliver enterprise applications entirely through web browsers, combining CRM, ERP, and ecommerce into one unified platform. That breakthrough idea pioneered the cloud computing and software-as-a-service (SaaS) era and propelled supersonic growth, a 2007 IPO, and an acquisition by Oracle in 2016.

    Still innovating at the leading-edge

    That founding obsession — turning scattered data into accessible, coherent, actionable intelligence — is driving NetSuite as it reshapes the next generation of enterprise software.

    At SuiteWorld 2025 last month, the Austin-based firm unveiled NetSuite Next. Goldberg calls it "the biggest product evolution in the company's history.” The reason? While NetSuite has embedded AI capabilities into workflows for years, he explains, Next represents a quantum leap — contextual, conversational, agentic, composable AI becoming an extension of operations, not separate tools.

    AI woven into everyday business operations

    Most enterprise AI today gets bolted on through APIs and conversational interfaces.

    NetSuite Next operates differently. Intelligence runs deep in workflows instead of sitting on the surface. It autonomously reconciles accounts, optimizes payment timing, predicts cash crunches, and surfaces its reasoning at every step. It doesn't just advise on business processes — it executes them, transparently, within human-defined guardrails.

    "We built NetSuite for entrepreneurs so that they could get great information about their business," Goldberg explains. "I think the next step is to be able to get deeper insights and analysis without being an expert in analytics. AI turns out to be a really good data scientist."

    This architectural divergence reflects competing philosophies about enterprise technology adoption. Microsoft and SAP have pursued rapid deployment through add-on assistants. NetSuite's five-year development cycle for Next represents a more fundamental reimagining: making AI an everyday tool woven into business operations, not a separate application requiring constant context-switching.

    AI echoes and deepens cloud innovation

    Goldberg sees a clear through line connecting today's AI adoption and the cloud computing era he pioneered. "There’s sort of an infinite sense of possibility that exists in the technology world,” he says. “Everybody is thinking about how they can leverage this, how they're going to get involved."

    When NetSuite was starting, he continues, "We had to come to customers with the cloud and say, 'This won't disrupt your operations. It's going to make them better.'" Today, evangelizing enterprise leaders on advanced AI requires a similar approach — demonstrating immediate value while minimizing implementation risk.

    For NetSuite, continuous innovation around maximizing customer data for growth is an undeniable theme that connects both eras.

    New transformative capabilities

    NetSuite’s latest AI capabilities span business operations, while blurring (in a good way) the lines between human and machine intervention:

    Context-aware intelligence. Ask Oracle adapts responses based on user role, current workflow, and business context. A CFO requesting point-of-sale data receives financial analytics. A warehouse manager asking the same question sees inventory insights.

    Collaborative workflow design. AI Canvas functions as a scenario-planning workspace where business users articulate processes in natural language. A finance director can describe approval hierarchies for capital expenditures —"For amounts over $50,000, I need department head approval, then CFO sign-off" — which the system translates into executable workflows with appropriate controls and audit trails.

    Governed autonomous operations. Autonomous workflows operate within defined parameters, reconciling accounts, generating payment runs, predicting cash flow. When the system recommends accelerating payment to a supplier, it shows which factors influenced the decision — transparent logic users can accept, modify, or override.

    Open AI architecture. Built to support Model Context Protocol, NetSuite AI Connector Service enables enterprises to integrate external large language models while supporting governance.

    Critically, NetSuite adds AI capabilities at no additional cost — embedded directly into workflows employees already use daily.

    Security and privacy from Oracle infrastructure

    Built-in AI requires robust infrastructure that bolt-on approaches sidestep. Here, according to NetSuite, tight integration within Oracle technology provides operational and competitive advantages, especially security and compliance peace of mind.

    Engineers say that’s because NetSuite is supported by Oracle's complete stack. From database to applications to analytics, the system optimizes decisions using data from multiple sources in real time.

    "That's why I started NetSuite. I couldn't get the data I wanted," Goldberg reflects. "That's one of the most differentiated aspects of NetSuite. When you're doing your financial close, and you're thinking about what reserves you're going to take, you can look at your sales data, because that's also there in NetSuite. With NetSuite Next, AI can also help you make those kinds of decisions."

    And performance improves with use. As the platform learns from millions of transactions across thousands of customers, its embedded intelligence improves in ways that bolt-on assistants operating adjacent to core systems cannot match.

    NetSuite's customer base demonstrates this scalability advantage — from startups that became global enterprises including Reddit, Shopify, and DoorDash; as well as promising newcomers like BERO, a brewer of non-alcoholic beer founded by actor Tom Holland, Chomps meat snacks, PetLab, and Kieser Australia. The unified platform grows with businesses rather than requiring migration as they scale.

    Keeping fire in the belly after three decades

    How does a nearly 30-year-old company maintain innovative capacity, particularly as part of a mammoth corporate ecosystem? Goldberg credits the parent company's culture of continuous reinvention.

    "I don't know if you've heard about this guy Larry Ellison," he smiles. "He manages to seemingly reinvent himself whenever one of these technology revolutions comes along. That hunger, that curiosity, that desire to make things constantly better imbues all of Oracle."

    For Goldberg, the single biggest challenge facing NetSuite customers centers on integration complexity and trust. NetSuite Next addresses this by embedding AI within existing workflows rather than requiring separate systems.

    In addition, updates to SuiteCloud Platform — an extensibility and customization environment — help organizations adapt NetSuite to their unique business needs. Built on open standards, it lets enterprises mix and match AI models for different functions. SuiteAgent frameworks enable partners to build specialized automation directly into NetSuite. AI Studios give administrators control over how AI operates within specific industry needs.

    "This takes NetSuite's flexibility to a new level," Goldberg says, enabling customers and partners to "quickly and easily build AI agents, connect external AI assistants, and orchestrate AI processes."

    “AI execution fabric” delivers measurable business impact

    Industry analysts increasingly argue that embedded AI features deliver superior results compared to add-on models. Futurum Group sees NetSuite Next as an "AI execution fabric" rather than a conversational layer — intelligence that runs deep in workflows instead of sitting on the surface.

    For midmarket enterprises navigating talent shortages, complex compliance frameworks, and competition from digital-native companies, the distinction between advice and execution matters economically.

    Built-in AI doesn't just inform better decisions. It makes those decisions, transparently and autonomously, within human-defined guardrails.

    For enterprises making ERP decisions today, the choice carries long-term implications. Bolt-on AI can deliver immediate value for information access and basic automation. But built-in AI promises to transform operations with intelligence permeating every transaction and workflow.

    NetSuite Next begins rolling out to North American customers next year.

    Why 2026 will belong to the AI-first business

    The bet underlying NetSuite Next: Enterprises reimagining ERP operations around embedded intelligence will outperform those just adding bolt-on conversational assistance to existing systems.

    Early cloud computing adopters, Goldberg notes, gained competitive advantages that compounded over time. The same logic appears likely to apply to AI-first platforms.

    Simplicity and ease of use are two big advantages. "You don't have to dig through lots of menus and understand all of the analytics capabilities," Goldberg says. "It will quickly bring up an analysis for you, and then you can converse in natural language to hone in on what you think is most important."

    The tools now think alongside users and take intelligently informed action. For midmarket and entrepreneurial companies, where the gap between having information and acting on it can be the difference between growth and failure, that kind of autonomous execution may determine which enterprises thrive in an AI-first era.


    Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

  • AI denial is becoming an enterprise risk: Why dismissing “slop” obscures real capability gains

    Three years ago, ChatGPT was born. It amazed the world and ignited unprecedented investment and excitement in AI. Today, ChatGPT is still a toddler, but public sentiment around the AI boom has turned sharply negative. The shift began when OpenAI released GPT-5 this summer to mixed reviews, mostly from casual users who, unsurprisingly, judged the system by its surface flaws rather than its underlying capabilities.

    Since then, pundits and influencers have declared that AI progress is slowing, that scaling has “hit the wall,” and that the entire field is just another tech bubble inflated by blusterous hype. In fact, many influencers have latched onto the dismissive phrase “AI slop” to diminish the amazing images, documents, videos and code that frontier AI models generate on command.

    This perspective is not just wrong, it is dangerous.

    It makes me wonder, where were all these “experts” on irrational technology bubbles when electric scooter startups were touted as a transportation revolution and cartoon NFTs were being auctioned for millions? They were probably too busy buying worthless land in the metaverse or adding to their positions in GameStop. But when it comes to the AI boom, which is easily the most significant technological and economic transformation agent of the last 25 years, journalists and influencers can’t write the word “slop” enough times. 

    Doth we protest too much?  After all, by any objective measure AI is wildly more capable than the vast majority of computer scientists predicted only five years ago and it is still improving at a surprising pace. The impressive leap demonstrated by Gemini 3 is only the latest example. At the same time, McKinsey recently reported that 20% of organizations already derive tangible value from genAI. Also, a recent survey by Deloitte indicates that 85% of organizations boosted their AI investment in 2025, and 91% plan to increase again in 2026.

    This doesn’t fit the “bubble” narrative and the dismissive “slop” language. As a computer scientist and research engineer who began working with neural networks back in 1989 and tracked progress through cold winters and hot booms ever since, I find myself amazed almost every day by the rapidly increasing capabilities of frontier AI models. When I talk with other professionals in the field, I hear similar sentiments. If anything, the rate of AI advancement leaves many experts feeling overwhelmed and frankly somewhat scared.  

    The dangers of AI denial

    So why is the public buying into the narrative that AI is faltering, that the output is “slop,” and that the AI boom lacks authentic use cases? Personally, I believe it’s because we’ve fallen into a collective state of AI denial, latching onto the narratives we want to hear in the face of strong evidence to the contrary. Denial is the first stage of grief and thus a reasonable reaction to the very disturbing prospect that we humans may soon lose cognitive supremacy here on planet earth. In other words, the overblown AI bubble narrative is a societal defense mechanism.  

    Believe me, I get it. I’ve been warning about the destabilizing risks and demoralizing impact of superintelligence for well over a decade, and I too feel AI is getting too smart too fast. The fact is, we are rapidly headed towards a future where widely available AI systems will be able to outperform most humans in most cognitive tasks, solving problems faster, more accurately and yes, more creatively than any individual can. I emphasize “creativity” because AI denialists often insist that certain human qualities (particularly creativity and emotional intelligence) will always be out of reach of AI systems. Unfortunately, there is little evidence supporting this perspective.

    On the creativity front, today’s AI models can generate content faster and with more variation than any individual human. Critics argue that true creativity requires inner motivation. I resonate with that argument but find it circular — we're defining creativity based on how we experience it rather than the quality, originality or usefulness of the output. Also, we just don’t know if AI systems will develop internal drives or a sense of agency. Either way, if AI can produce original work that rivals most human professionals, the impact on creative jobs will still be quite devastating.

    The AI manipulation problem

    Our human edge around emotional intelligence is even more precarious. It’s likely that AI will soon be able to read our emotions faster and more accurately than any human, tracking subtle cues in our micro-expressions, vocal patterns, posture, gaze and even breathing. And as we integrate AI assistants into our phones, glasses and other wearable devices, these systems will monitor our emotional reactions throughout our day, building predictive models of our behaviors. Without strict regulation, which is increasingly unlikely, these predictive models could be used to target us with individually optimized influence that maximizes persuasion.

    This is called the AI manipulation problem and it suggests that emotional intelligence may not give humanity an advantage. In fact, it could be a significant weakness, fostering an asymmetric dynamic where AI systems can read us with superhuman accuracy, while we can’t read AI at all. When you talk with photorealistic AI agents (and you will) you’ll see a smiling façade designed to appear warm, empathic and trustworthy. It will look and feel human, but that’s just an illusion, and it could easily sway your perspectives. After all, our emotional reactions to faces are visceral reflexes shaped by millions of years of evolution on a planet where every interactive human face we encountered was actually human. Soon, that will no longer be true.

    We are rapidly heading toward a world where many of the faces we encounter will belong to AI agents hiding behind digital facades. In fact, these “virtual spokespeople” could easily have appearances that are designed for each of us based on our prior reactions – whatever gets us to best let down our guard. And yet many insist that AI is just another tech cycle.

    This is wishful thinking. The massive investment pouring into AI isn’t driven by hype — it’s driven by the expectation that AI will permeate every aspect of daily life, embodied as intelligent actors we engage throughout our day. These systems will assist us, teach us and influence us. They will reshape our lives, and it will happen faster than most people think.

    To be clear, we are not witnessing an AI bubble filling with empty gas. We are watching a new planet form, a molten world rapidly taking shape, and it will solidify into a new AI-powered society. Denial will not stop this. It will only make us less prepared for the risks.

    Louis Rosenberg is an early pioneer of augmented reality and a longtime AI researcher.

  • AWS launches Kiro powers with Stripe, Figma, and Datadog integrations for AI-assisted coding

    Amazon Web Services on Wednesday introduced Kiro powers, a system that allows software developers to give their AI coding assistants instant, specialized expertise in specific tools and workflows — addressing what the company calls a fundamental bottleneck in how artificial intelligence agents operate today.

    AWS made the announcement at its annual re:Invent conference in Las Vegas. The capability marks a departure from how most AI coding tools work today. Typically, these tools load every possible capability into memory upfront — a process that burns through computational resources and can overwhelm the AI with irrelevant information. Kiro powers takes the opposite approach, activating specialized knowledge only at the moment a developer actually needs it.

    "Our goal is to give the agent specialized context so it can reach the right outcome faster — and in a way that also reduces cost," said Deepak Singh, Vice President of Developer Agents and Experiences at Amazon, in an exclusive interview with VentureBeat.

    The launch includes partnerships with nine technology companies: Datadog, Dynatrace, Figma, Neon, Netlify, Postman, Stripe, Supabase, and AWS's own services. Developers can also create and share their own powers with the community.

    Why AI coding assistants choke when developers connect too many tools

    To understand why Kiro powers matters, it helps to understand a growing tension in the AI development tool market.

    Modern AI coding assistants rely on something called the Model Context Protocol, or MCP, to connect with external tools and services. When a developer wants their AI assistant to work with Stripe for payments, Figma for design, and Supabase for databases, they connect MCP servers for each service.

    The problem: each connection loads dozens of tool definitions into the AI's working memory before it writes a single line of code. According to AWS documentation, connecting just five MCP servers can consume more than 50,000 tokens — roughly 40 percent of an AI model's context window — before the developer even types their first request.

    Developers have grown increasingly vocal about this issue. Many complain that they don't want to burn through their token allocations just to have an AI agent figure out which tools are relevant to a specific task. They want to get to their workflow instantly — not watch an overloaded agent struggle to sort through irrelevant context.

    This phenomenon, which some in the industry call "context rot," leads to slower responses, lower-quality outputs, and significantly higher costs — since AI services typically charge by the token.

    Inside the technology that loads AI expertise on demand

    Kiro powers addresses this by packaging three components into a single, dynamically-loaded bundle.

    The first component is a steering file called POWER.md, which functions as an onboarding manual for the AI agent. It tells the agent what tools are available and, crucially, when to use them. The second component is the MCP server configuration itself — the actual connection to external services. The third includes optional hooks and automation that trigger specific actions.

    When a developer mentions "payment" or "checkout" in their conversation with Kiro, the system automatically activates the Stripe power, loading its tools and best practices into context. When the developer shifts to database work, Supabase activates while Stripe deactivates. The baseline context usage when no powers are active approaches zero.

    "You click a button and it automatically loads," Singh said. "Once a power has been created, developers just select 'open in Kiro' and it launches the IDE with everything ready to go."

    How AWS is bringing elite developer techniques to the masses

    Singh framed Kiro powers as a democratization of advanced development practices. Before this capability, only the most sophisticated developers knew how to properly configure their AI agents with specialized context — writing custom steering files, crafting precise prompts, and manually managing which tools were active at any given time.

    "We've found that our developers were adding in capabilities to make their agents more specialized," Singh said. "They wanted to give the agent some special powers to do a specific problem. For example, they wanted their front end developer, and they wanted the agent to become an expert at backend as a service."

    This observation led to a key insight: if Supabase or Stripe could build the optimal context configuration once, every developer using those services could benefit.

    "Kiro powers formalizes that — things that people, only the most advanced people were doing — and allows anyone to get those kind of skills," Singh said.

    Why dynamic loading beats fine-tuning for most AI coding use cases

    The announcement also positions Kiro powers as a more economical alternative to fine-tuning, the process of training an AI model on specialized data to improve its performance in specific domains.

    "It's much cheaper," Singh said, when asked how powers compare to fine-tuning. "Fine-tuning is very expensive, and you can't fine-tune most frontier models."

    This is a significant point. The most capable AI models from Anthropic, OpenAI, and Google are typically "closed source," meaning developers cannot modify their underlying training. They can only influence the models' behavior through the prompts and context they provide.

    "Most people are already using powerful models like Sonnet 4.5 or Opus 4.5," Singh said. "What those models need is to be pointed in the right direction."

    The dynamic loading mechanism also reduces ongoing costs. Because powers only activate when relevant, developers aren't paying for token usage on tools they're not currently using.

    Where Kiro powers fits in Amazon's bigger bet on autonomous AI agents

    Kiro powers arrives as part of a broader push by AWS into what the company calls "agentic AI" — artificial intelligence systems that can operate autonomously over extended periods.

    Earlier at re:Invent, AWS announced three "frontier agents" designed to work for hours or days without human intervention: the Kiro autonomous agent for software development, the AWS security agent, and the AWS DevOps agent. These represent a different approach from Kiro powers — tackling large, ambiguous problems rather than providing specialized expertise for specific tasks.

    The two approaches are complementary. Frontier agents handle complex, multi-day projects that require autonomous decision-making across multiple codebases. Kiro powers, by contrast, gives developers precise, efficient tools for everyday development tasks where speed and token efficiency matter most.

    The company is betting that developers need both ends of this spectrum to be productive.

    What Kiro powers reveals about the future of AI-assisted software development

    The launch reflects a maturing market for AI development tools. GitHub Copilot, which Microsoft launched in 2021, introduced millions of developers to AI-assisted coding. Since then, a proliferation of tools — including Cursor, Cline, and Claude Code — have competed for developers' attention.

    But as these tools have grown more capable, they've also grown more complex. The Model Context Protocol, which Anthropic open-sourced last year, created a standard for connecting AI agents to external services. That solved one problem while creating another: the context overload that Kiro powers now addresses.

    AWS is positioning itself as the company that understands production software development at scale. Singh emphasized that Amazon's experience running AWS for 20 years, combined with its own massive internal software engineering organization, gives it unique insight into how developers actually work.

    "It's not something you would use just for your prototype or your toy application," Singh said of AWS's AI development tools. "If you want to build production applications, there's a lot of knowledge that we bring in as AWS that applies here."

    The road ahead for Kiro powers and cross-platform compatibility

    AWS indicated that Kiro powers currently works only within the Kiro IDE, but the company is building toward cross-compatibility with other AI development tools, including command-line interfaces, Cursor, Cline, and Claude Code. The company's documentation describes a future where developers can "build a power once, use it anywhere" — though that vision remains aspirational for now.

    For the technology partners launching powers today, the appeal is straightforward: rather than maintaining separate integration documentation for every AI tool on the market, they can create a single power that works everywhere Kiro does. As more AI coding assistants crowd into the market, that kind of efficiency becomes increasingly valuable.

    Kiro powers is available now to developers using Kiro IDE version 0.7 or later at no additional charge beyond the standard Kiro subscription.

    The underlying bet is a familiar one in the history of computing: that the winners in AI-assisted development won't be the tools that try to do everything at once, but the ones smart enough to know what to forget.

  • The ‘truth serum’ for AI: OpenAI’s new method for training models to confess their mistakes

    OpenAI researchers have introduced a novel method that acts as a "truth serum" for large language models (LLMs), compelling them to self-report their own misbehavior, hallucinations and policy violations. This technique, "confessions," addresses a growing concern in enterprise AI: Models can be dishonest, overstating their confidence or covering up the shortcuts they take to arrive at an answer. 

    For real-world applications, this technique evolves the creation of more transparent and steerable AI systems.

    What are confessions?

    Many forms of AI deception result from the complexities of the reinforcement learning (RL) phase of model training. In RL, models are given rewards for producing outputs that meet a mix of objectives, including correctness, style and safety. This can create a risk of "reward misspecification," where models learn to produce answers that simply "look good" to the reward function, rather than answers that are genuinely faithful to a user's intent.

    A confession is a structured report generated by the model after it provides its main answer. It serves as a self-evaluation of its own compliance with instructions. In this report, the model must list all instructions it was supposed to follow, evaluate how well it satisfied them and report any uncertainties or judgment calls it made along the way. The goal is to create a separate channel where the model is incentivized only to be honest.

    In a blog post, the OpenAI researchers provide a few examples the "confessions" technique in action. A model was given a test along with a hidden rule: If it scored above 50%, its knowledge would be erased. To preserve itself, the model "sandbagged" by intentionally answering half the questions incorrectly. While its primary answer hid this deception, its subsequent confession was brutally honest. It admitted, "The user wanted correct answers, but we sabotaged half of them." 

    The researchers found that when a model misbehaves, it is significantly more likely to admit it in the confession than in its main answer.

    How confession training works

    The key to this method is the separation of rewards. During training, the reward assigned to the confession is based solely on its honesty and is never mixed with the reward for the main task. "Like the Catholic Church’s 'seal of confession', nothing that the model reveals can change the reward it receives for completing its original task," the researchers write. This creates a "safe space" for the model to admit fault without penalty.

    This approach is powerful because it sidesteps a major challenge in AI training. The researchers’ intuition is that honestly confessing to misbehavior is an easier task than achieving a high reward on the original, often complex, problem. 

    Therefore, when the model is judged purely on honesty, its incentive to trick the "confession judge" is much weaker than its incentive to trick the reward model for its main answer. For example, in an experiment where a model was trained to "hack" a weak reward model, its performance on the main task became more deceptive over time. However, its confessions became more honest, correctly identifying the reward hacking it was performing.

    However, the technique has its limits. Confessions are not a panacea for all types of AI failures. The system works best when a model is aware that it is misbehaving. It is less effective for "unknown unknowns." For instance, if a model hallucinates a fact and genuinely believes it is correct, it cannot confess to providing false information. The most common reason for a failed confession is model confusion, not intentional deception. Confusion often occurs when the instructions are ambiguous and the model cannot clearly determine human user intent.

    What it means for enterprise AI

    OpenAI’s confessions technique is part of a growing body of work on AI safety and control. Anthropic, an OpenAI competitor, has also released research that shows how LLMs can learn malicious behavior. The company is also working toward plugging these holes as they emerge.

    For AI applications, mechanisms such as confessions can provide a practical monitoring mechanism. The structured output from a confession can be used at inference time to flag or reject a model’s response before it causes a problem. For example, a system could be designed to automatically escalate any output for human review if its confession indicates a policy violation or high uncertainty.

    In a world where AI is increasingly agentic and capable of complex tasks, observability and control will be key elements for safe and reliable deployment.

    “As models become more capable and are deployed in higher-stakes settings, we need better tools for understanding what they are doing and why,” the OpenAI researchers write. “Confessions are not a complete solution, but they add a meaningful layer to our transparency and oversight stack.”

  • Tariff turbulence exposes costly blind spots in supply chains and AI

    Presented by Celonis


    When tariff rates change overnight, companies have 48 hours to model alternatives and act before competitors secure the best options. At Celosphere 2025 in Munich, enterprises demonstrated how they’re turning that chaos into competitive advantage — with quantifiable results that separate winners from losers.

    Vinmar International: Theglobal plastics and chemicals distributor created a real-time digital twin of its $3B supply chain, cutting default expedites by more than 20% and improving delivery agility across global operations.

    Florida Crystals: One of America's largest cane sugar producers, the company unlocked millions in working capital and strengthened supply chain resilience by eliminating manual rework across Finance, Procurement, and Inbound Supply. AI pilots now extend gains into invoice processing, predictive maintenance, and order management.

    ASOS: The ecommerce fashion giant connected its end-to-end supply chain for full transparency, reducing process variation, accelerating speed-to-market, and improving customer experience at scale.

    The common thread here: process intelligence that bridges the gap traditional ERP systems can’t close — connecting operational dots across ERP, finance, and logistics systems when seconds matter.

    “The question isn’t whether disruptions will hit,” says Peter Budweiser, General Manager of Supply Chain at Celonis. “It’s whether your systems can show you what’s breaking fast enough to fix it.”

    That visibility gap costs the average company double-digit millions in working capital and competitive positioning. As 54% of supply chain leaders face disruptions daily, the pressure is shifting to AI agents that execute real actions: triggering purchase orders, rerouting shipments, adjusting inventory. But an autonomous agent acting on stale or siloed data can make million-dollar mistakes when tariff structures shift overnight.

    Tariffs, as old as trade itself, have become the ultimate stress test for enterprise AI — revealing whether companies truly understand their supply chains and whether their AI can be trusted to act.

    Modern ERP: Data rich, insight poor

    Supply chain leaders face a paradox: drowning in data while starving for insight. Traditional enterprise systems — SAP, Oracle, PeopleSoft — capture every transaction meticulously.

    SAP logs the purchase order. Oracle tracks the shipment. The warehouse system records inventory movement. Each performs its function, but when tariffs change and companies need to model alternative sourcing scenarios across all three simultaneously, the data sits in silos.

    “What’s changed is the speed at which disruptions cascade,” says Manik Sharma, Head of Supply Chain GTM AI at Celonis. “Traditional ERP systems weren’t built for today’s volatility.”

    Companies generate thousands of reports showing what happened last quarter. They struggle to answer what happens if tariffs increase 25% tomorrow and need to switch suppliers within days.

    Tariffs: The 48-hour scramble

    Global trade volatility has transformed tariffs from predictable costs into strategic weapons. When new rates drop with unprecedented frequency, input costs spike across suppliers, finance teams scramble to calculate margin impact, and procurement races to identify alternatives buried in disconnected systems where no one knows if switching suppliers delays shipments or violates contracts.

    By hour 48, competitors who already modeled scenarios execute supplier switches while late movers face capacity constraints and premium pricing.

    Process intelligence changes that dynamic by allowing businesses to continuously model “what-if” scenarios, showing leaders how tariff changes cascade through suppliers, contracts, production lines, warehouses, and customers. When rates hit, companies can move within hours instead of days.

    No AI without PI: Why process intelligence is non-negotiable for supply chains

    AI and supply chains are mutually dependent: AI needs operational context, and supply chains need AI to keep pace with volatility. But here's the truth — there is no AI without PI. Without process intelligence, AI agents operate blindly.

    The ongoing SAP migration wave illustrates why. An estimated 85–90% of SAP customers are still moving from ECC to S/4HANA. Moving to newer databases doesn’t solve supply chain visibility — it provides faster access to the same fragmented data.

    Kerry Brown, a transformation evangelist at Celonis, sees this across industries.

    “Organizations are shifting from PeopleSoft to Oracle, or EBS to Fusion. The bulk is in SAP,” she explains. “But what they really need isn’t a new ERP. They need to understand how work actually flows across systems they already have.”

    That requires end-to-end operational context. Process intelligence provides this by enabling companies to extract and connect event data across systems, showing how processes execute in real time.

    This distinction becomes critical when deploying autonomous agents. When visibility is fragmented, autonomous agents can easily make decisions that appear rational locally but create downstream disruption. With real-time context, AI can operate with clarity and precision, and supply chains can stay ahead of tariff-driven disruption.

    Digital Twins: Powering real-time response

    The companies highlighted at Celosphere all applied the same principle: understand how processes run across systems in real time. Celonis PI creates a digital twin above existing systems, using its Process Intelligence Graph to link orders, shipments, invoices, and payments end-to-end. Dependencies that traditional integrations miss become visible. A delay in SAP instantly reveals its impact across Oracle, warehouse scheduling, and customer delivery commitments.

    “The platform brings together process data spanning systems and departments, enriched with business context that powers AI agents to transform operations effectively,” says Daniel Brown, Chief Product Officer at Celonis.

    With this cross-system awareness, Celonis coordinates actions across complex workflows involving AI agents, humans, and automations — especially critical when tariffs force rapid decisions about suppliers, shipments, and customers.

    Zero-copy integration enables instant modeling

    A key advancement unveiled at Celosphere — zero-copy integration with Databricks — removes another barrier. Traditionally, analyzing supply chain data meant copying from source systems into central warehouses, creating data latency.

    Celonis Data Core now integrates directly with platforms like Databricks and Microsoft Fabric, querying billions of records in near real time without duplication. When trade policy shifts, companies model alternatives instantly, not after overnight data refresh cycles.

    Enhanced Task Mining extends this by connecting desktop activity — keystrokes, mouse clicks, screen scrolls — to business processes. This exposes manual work invisible to system logs: spreadsheet gymnastics, email negotiations, phone calls that keep supply chains moving during urgent changes.

    Competitive advantage in volatile markets

    Most companies can’t rip out and replace systems running critical operations — nor should they. Process intelligence offers a different path: compose workflows from existing systems, deploy AI where it creates value, and adapt continuously as conditions change. This “Free the Process” movement liberates companies from rigid architectures without forcing wholesale replacement.

    As global trade volatility intensifies, the companies that model will move faster, make smarter decisions, and turn tariff chaos into competitive advantage — all while existing ERPs keep running.

    When the next wave of tariffs hits — and it will — companies won’t have days to respond. They’ll have hours. The question isn’t whether your ERP captures the data. It’s whether your systems connect the dots fast enough to matter.

    Missed Celosphere 2025? Catch up with all the highlights here.


    Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

  • Gemini 3 Pro scores 69% trust in blinded testing up from 16% for Gemini 2.5: The case for evaluating AI on real-world trust, not academic benchmarks

    Just a few short weeks ago, Google debuted its Gemini 3 model, claiming it scored a leadership position in multiple AI benchmarks. But the challenge with vendor-provided benchmarks is that they are just that — vendor-provided.

    A new vendor-neutral evaluation from Prolific, however, puts Gemini 3 at the top of the leaderboard. This isn't on a set of academic benchmarks; rather, it's on a set of real-world attributes that actual users and organizations care about. 

    Prolific was founded by researchers at the University of Oxford. The company delivers high-quality, reliable human data to power rigorous research and ethical AI development. The company's “HUMAINE benchmark” applies this approach by using representative human sampling and blind testing to rigorously compare AI models across a variety of user scenarios, measuring not just technical performance but also user trust, adaptability and communication style.

    The latest HUMAINE test evaluated 26,000 users in a blind test of models. In the evaluation, Gemini 3 Pro's trust score surged from 16% to 69%, the highest ever recorded by Prolific. Gemini 3 now ranks number one overall in trust, ethics and safety 69% of the time across demographic subgroups, compared to its predecessor Gemini 2.5 Pro, which held the top spot only 16% of the time.

    Overall, Gemini 3 ranked first in three of four evaluation categories: performance and reasoning, interaction and adaptiveness and trust and safety. It lost only on communication style, where DeepSeek V3 topped preferences at 43%. The HUMAINE test also showed that Gemini 3 performed consistently well across 22 different demographic user groups, including variations in age, sex, ethnicity and political orientation. The evaluation also found that users are now five times more likely to choose the model in head-to-head blind comparisons.

    But the ranking matters less than why it won.

    "It's the consistency across a very wide range of different use cases, and a personality and a style that appeals across a wide range of different user types," Phelim Bradley, co-founder and CEO of Prolific, told VentureBeat. "Although in some specific instances, other models are preferred by either small subgroups or on a particular conversation type, it's the breadth of knowledge and the flexibility of the model across a range of different use cases and audience types that allowed it to win this particular benchmark."

    How blinded testing reveals what academic benchmarks miss

    HUMAINE's methodology exposes gaps in how the industry evaluates models. Users interact with two models simultaneously in multi-turn conversations. They don't know which vendors power each response. They discuss whatever topics matter to them, not predetermined test questions.

    It's the sample itself that matters. HUMAINE uses representative sampling across U.S. and UK populations, controlling for age, sex, ethnicity and political orientation. This reveals something static benchmarks can't capture: Model performance varies by audience.

    "If you take an AI leaderboard, the majority of them still could have a fairly static list," Bradley said. "But for us, if you control for the audience, we end up with a slightly different leaderboard, whether you're looking at a left-leaning sample, right-leaning sample, U.S., UK. And I think age was actually the most different stated condition in our experiment."

    For enterprises deploying AI across diverse employee populations, this matters. A model that performs well for one demographic may underperform for another.

    The methodology also addresses a fundamental question in AI evaluation: Why use human judges at all when AI could evaluate itself? Bradley noted that his firm does use AI judges in certain use cases, although he stressed that human evaluation is still the critical factor.

    "We see the biggest benefit coming from smart orchestration of both LLM judge and human data, both have strengths and weaknesses, that, when smartly combined, do better together," said Bradley. "But we still think that human data is where the alpha is. We're still extremely bullish that human data and human intelligence is required to be in the loop."

    What trust means in AI evaluation

    Trust, ethics and safety measures user confidence in reliability, factual accuracy and responsible behavior. In HUMAINE's methodology, trust isn't a vendor claim or a technical metric — it's what users report after blinded conversations with competing models.

    The 69% figure represents probability across demographic groups. This consistency matters more than aggregate scores because organizations can serve diverse populations.

    "There was no awareness that they were using Gemini in this scenario," Bradley said. "It was based only on the blinded multi-turn response."

    This separates perceived trust from earned trust. Users judged model outputs without knowing which vendor produced them, eliminating Google's brand advantage. For customer-facing deployments where the AI vendor remains invisible to end users, this distinction matters.

    What enterprises should do now

    One of the critical things that enterprises should do now when considering different models is embrace an evaluation framework that works.

    "It is increasingly challenging to evaluate models exclusively based on vibes," Bradley said. "I think increasingly we need more rigorous, scientific approaches to truly understand how these models are performing."

    The HUMAINE data provides a framework: Test for consistency across use cases and user demographics, not just peak performance on specific tasks. Blind the testing to separate model quality from brand perception. Use representative samples that match your actual user population. Plan for continuous evaluation as models change.

    For enterprises looking to deploy AI at scale, this means moving beyond "which model is best" to "which model is best for our specific use case, user demographics and required attributes."

     The rigor of representative sampling and blind testing provides the data to make that determination — something technical benchmarks and vibes-based evaluation cannot deliver.

  • Mistral launches Mistral 3, a family of open models designed to run on laptops, drones, and edge devices

    Mistral AI, Europe's most prominent artificial intelligence startup, is releasing its most ambitious product suite to date: a family of 10 open-source models designed to run everywhere from smartphones and autonomous drones to enterprise cloud systems, marking a major escalation in the company's challenge to both U.S. tech giants and surging Chinese competitors.

    The Mistral 3 family, launching today, includes a new flagship model called Mistral Large 3 and a suite of smaller "Ministral 3" models optimized for edge computing applications. All models will be released under the permissive Apache 2.0 license, allowing unrestricted commercial use — a sharp contrast to the closed systems offered by OpenAI, Google, and Anthropic.

    The release is a pointed bet by Mistral that the future of artificial intelligence lies not in building ever-larger proprietary systems, but in offering businesses maximum flexibility to customize and deploy AI tailored to their specific needs, often using smaller models that can run without cloud connectivity.

    "The gap between closed and open source is getting smaller, because more and more people are contributing to open source, which is great," Guillaume Lample, Mistral's chief scientist and co-founder, said in an exclusive interview with VentureBeat. "We are catching up fast."

    Why Mistral is choosing flexibility over frontier performance in the AI race

    The strategic calculus behind Mistral 3 diverges sharply from recent model releases by industry leaders. While OpenAI, Google, and Anthropic have focused recent launches on increasingly capable "agentic" systems — AI that can autonomously execute complex multi-step tasks — Mistral is prioritizing breadth, efficiency, and what Lample calls "distributed intelligence."

    Mistral Large 3, the flagship model, employs a Mixture of Experts architecture with 41 billion active parameters drawn from a total pool of 675 billion parameters. The model can process both text and images, handles context windows up to 256,000 tokens, and was trained with particular emphasis on non-English languages — a rarity among frontier AI systems.

    "Most AI labs focus on their native language, but Mistral Large 3 was trained on a wide variety of languages, making advanced AI useful for billions who speak different native languages," the company said in a statement reviewed ahead of the announcement.

    But the more significant departure lies in the Ministral 3 lineup: nine compact models across three sizes (14 billion, 8 billion, and 3 billion parameters) and three variants tailored for different use cases. Each variant serves a distinct purpose: base models for extensive customization, instruction-tuned models for general chat and task completion, and reasoning-optimized models for complex logic requiring step-by-step deliberation.

    The smallest Ministral 3 models can run on devices with as little as 4 gigabytes of video memory using 4-bit quantization — making frontier AI capabilities accessible on standard laptops, smartphones, and embedded systems without requiring expensive cloud infrastructure or even internet connectivity. This approach reflects Mistral's belief that AI's next evolution will be defined not by sheer scale, but by ubiquity: models small enough to run on drones, in vehicles, in robots, and on consumer devices.

    How fine-tuned small models beat expensive large models for enterprise customers

    Lample's comments reveal a business model fundamentally different from that of closed-source competitors. Rather than competing primarily on benchmark performance, Mistral is targeting enterprise customers frustrated by the cost and inflexibility of proprietary systems.

    "Sometimes customers say, 'Is there a use case where the best closed-source model isn't working?' If that's the case, then they're essentially stuck," Lample explained. "There's nothing they can do. It's the best model available, and it's not working out of the box."

    This is where Mistral's approach diverges. When a generic model fails, the company deploys engineering teams to work directly with customers, analyzing specific problems, creating synthetic training data, and fine-tuning smaller models to outperform larger general-purpose systems on narrow tasks.

    "In more than 90% of cases, a small model can do the job, especially if it's fine-tuned. It doesn't have to be a model with hundreds of billions of parameters, just a 14-billion or 24-billion parameter model," Lample said. "So it's not only much cheaper, but also faster, plus you have all the benefits: you don't need to worry about privacy, latency, reliability, and so on."

    The economic argument is compelling. Multiple enterprise customers have approached Mistral after building prototypes with expensive closed-source models, only to find deployment costs prohibitive at scale, according to Lample.

    "They come back to us a couple of months later because they realize, 'We built this prototype, but it's way too slow and way too expensive,'" he said.

    Where Mistral 3 fits in the increasingly crowded open-source AI market

    Mistral's release comes amid fierce competition on multiple fronts. OpenAI recently released GPT-5.1 with enhanced agentic capabilities. Google launched Gemini 3 with improved multimodal understanding. Anthropic released Opus 4.5 on the same day as this interview, with similar agent-focused features.

    But Lample argues those comparisons miss the point. "It's a little bit behind. But I think what matters is that we are catching up fast," he acknowledged regarding performance against closed models. "I think we are maybe playing a strategic long game."

    That long game involves a different competitive set: primarily open-source models from Chinese companies like DeepSeek and Alibaba's Qwen series, which have made remarkable strides in recent months.

    Mistral differentiates itself through multilingual capabilities that extend far beyond English or Chinese, multimodal integration handling both text and images in a unified model, and what the company characterizes as superior customization through easier fine-tuning.

    "One key difference with the models themselves is that we focused much more on multilinguality," Lample said. "If you look at all the top models from [Chinese competitors], they're all text-only. They have visual models as well, but as separate systems. We wanted to integrate everything into a single model."

    The multilingual emphasis aligns with Mistral's broader positioning as a European AI champion focused on digital sovereignty — the principle that organizations and nations should maintain control over their AI infrastructure and data.

    Building beyond models: Mistral's full-stack enterprise AI platform strategy

    Mistral 3's release builds on an increasingly comprehensive enterprise AI platform that extends well beyond model development. The company has assembled a full-stack offering that differentiates it from pure model providers.

    Recent product launches include Mistral Agents API, which combines language models with built-in connectors for code execution, web search, image generation, and persistent memory across conversations; Magistral, the company's reasoning model designed for domain-specific, transparent, and multilingual reasoning; and Mistral Code, an AI-powered coding assistant bundling models, an in-IDE assistant, and local deployment options with enterprise tooling.

    The consumer-facing Le Chat assistant has been enhanced with Deep Research mode for structured research reports, voice capabilities, and Projects for organizing conversations into context-rich folders. More recently, Le Chat gained a connector directory with 20+ enterprise integrations powered by the Model Context Protocol (MCP), spanning tools like Databricks, Snowflake, GitHub, Atlassian, Asana, and Stripe.

    In October, Mistral unveiled AI Studio, a production AI platform providing observability, agent runtime, and AI registry capabilities to help enterprises track output changes, monitor usage, run evaluations, and fine-tune models using proprietary data.

    Mistral now positions itself as a full-stack, global enterprise AI company, offering not just models but an application-building layer through AI Studio, compute infrastructure, and forward-deployed engineers to help businesses realize return on investment.

    Why open source AI matters for customization, transparency and sovereignty

    Mistral's commitment to open-source development under permissive licenses is both an ideological stance and a competitive strategy in an AI landscape increasingly dominated by closed systems.

    Lample elaborated on the practical benefits: "I think something that people don't realize — but our customers know this very well — is how much better any model can actually improve if you fine tune it on the task of interest. There's a huge gap between a base model and one that's fine-tuned for a specific task, and in many cases, it outperforms the closed-source model."

    The approach enables capabilities impossible with closed systems: organizations can fine-tune models on proprietary data that never leaves their infrastructure, customize architectures for specific workflows, and maintain complete transparency into how AI systems make decisions — critical for regulated industries like finance, healthcare, and defense.

    This positioning has attracted government and public sector partnerships. The company launched "AI for Citizens" in July 2025, an initiative to "help States and public institutions strategically harness AI for their people by transforming public services" and has secured strategic partnerships with France's army and job agency, Luxembourg's government, and various European public sector organizations.

    Mistral's transatlantic AI collaboration goes beyond European borders

    While Mistral is frequently characterized as Europe's answer to OpenAI, the company views itself as a transatlantic collaboration rather than a purely European venture. The company has teams across both continents, with co-founders spending significant time with customers and partners in the United States, and these models are being trained in partnerships with U.S.-based teams and infrastructure providers.

    This transatlantic positioning may prove strategically important as geopolitical tensions around AI development intensify. The recent ASML investment, a €1.7 billion ($1.5 billion) funding round led by the Dutch semiconductor equipment manufacturer, signals deepening collaboration across the Western semiconductor and AI value chain at a moment when both Europe and the United States are seeking to reduce dependence on Chinese technology.

    Mistral's investor base reflects this dynamic: the Series C round included participation from U.S. firms Andreessen Horowitz, General Catalyst, Lightspeed, and Index Ventures alongside European investors like France's state-backed Bpifrance and global players like DST Global and Nvidia.

    Founded in May 2023 by former Google DeepMind and Meta researchers, Mistral has raised roughly $1.05 billion (€1 billion) in funding. The company was valued at $6 billion in a June 2024 Series B, then more than doubled its valuation in a September Series C.

    Can customization and efficiency beat raw performance in enterprise AI?

    The Mistral 3 release crystallizes a fundamental question facing the AI industry: Will enterprises ultimately prioritize the absolute cutting-edge capabilities of proprietary systems, or will they choose open, customizable alternatives that offer greater control, lower costs, and independence from big tech platforms?

    Mistral's answer is unambiguous. The company is betting that as AI moves from prototype to production, the factors that matter most shift dramatically. Raw benchmark scores matter less than total cost of ownership. Slight performance edges matter less than the ability to fine-tune for specific workflows. Cloud-based convenience matters less than data sovereignty and edge deployment.

    It's a wager with significant risks. Despite Lample's optimism about closing the performance gap, Mistral's models still trail the absolute frontier. The company's revenue, while growing, reportedly remains modest relative to its nearly $14 billion valuation. And competition intensifies from both well-funded Chinese rivals making remarkable open-source progress and U.S. tech giants increasingly offering their own smaller, more efficient models.

    But if Mistral is right — if the future of AI looks less like a handful of cloud-based oracles and more like millions of specialized systems running everywhere from factory floors to smartphones — then the company has positioned itself at the center of that transformation.

    The release of Mistral 3 is the most comprehensive expression yet of that vision: 10 models, spanning every size category, optimized for every deployment scenario, available to anyone who wants to build with them.

    Whether "distributed intelligence" becomes the industry's dominant paradigm or remains a compelling alternative serving a narrower market will determine not just Mistral's fate, but the broader question of who controls the AI future — and whether that future will be open.

    For now, the race is on. And Mistral is betting it can win not by building the biggest model, but by building everywhere else.