Etiket: OpenAI

  • Will updating your AI agents help or hamper their performance? Raindrop’s new tool Experiments tells you

    It seems like almost every week for the last two years since ChatGPT launched, new large language models (LLMs) from rival labs or from OpenAI itself have been released. Enterprises are hard pressed to keep up with the massive pace of change, let alone understand how to adapt to it — which of these new models should they adopt, if any, to power their workflows and the custom AI agents they're building to carry them out?

    Help has arrived: AI applications observability startup Raindrop has launched Experiments, a new analytics feature that the company describes as the first A/B testing suite designed specifically for enterprise AI agents — allowing companies to see and compare how updating agents to new underlying models, or changing their instructions and tool access, will impact their performance with real end users.

    The release extends Raindrop’s existing observability tools, giving developers and teams a way to see how their agents behave and evolve in real-world conditions.

    With Experiments, teams can track how changes — such as a new tool, prompt, model update, or full pipeline refactor — affect AI performance across millions of user interactions. The new feature is available now for users on Raindrop’s Pro subscription plan ($350 monthly) at raindrop.ai.

    A Data-Driven Lens on Agent Development

    Raindrop co-founder and chief technology officer Ben Hylak noted in a product announcement video (above) that Experiments helps teams see “how literally anything changed,” including tool usage, user intents, and issue rates, and to explore differences by demographic factors such as language. The goal is to make model iteration more transparent and measurable.

    The Experiments interface presents results visually, showing when an experiment performs better or worse than its baseline. Increases in negative signals might indicate higher task failure or partial code output, while improvements in positive signals could reflect more complete responses or better user experiences.

    By making this data easy to interpret, Raindrop encourages AI teams to approach agent iteration with the same rigor as modern software deployment—tracking outcomes, sharing insights, and addressing regressions before they compound.

    Background: From AI Observability to Experimentation

    Raindrop’s launch of Experiments builds on the company’s foundation as one of the first AI-native observability platforms, designed to help enterprises monitor and understand how their generative AI systems behave in production.

    As VentureBeat reported earlier this year, the company — originally known as Dawn AI — emerged to address what Hylak, a former Apple human interface designer, called the “black box problem” of AI performance, helping teams catch failures “as they happen and explain to enterprises what went wrong and why."

    At the time, Hylak described how “AI products fail constantly—in ways both hilarious and terrifying,” noting that unlike traditional software, which throws clear exceptions, “AI products fail silently.” Raindrop’s original platform focused on detecting those silent failures by analyzing signals such as user feedback, task failures, refusals, and other conversational anomalies across millions of daily events.

    The company’s co-founders— Hylak, Alexis Gauba, and Zubin Singh Koticha — built Raindrop after encountering firsthand the difficulty of debugging AI systems in production.

    “We started by building AI products, not infrastructure,” Hylak told VentureBeat. “But pretty quickly, we saw that to grow anything serious, we needed tooling to understand AI behavior—and that tooling didn’t exist.”

    With Experiments, Raindrop extends that same mission from detecting failures to measuring improvements. The new tool transforms observability data into actionable comparisons, letting enterprises test whether changes to their models, prompts, or pipelines actually make their AI agents better—or just different.

    Solving the “Evals Pass, Agents Fail” Problem

    Traditional evaluation frameworks, while useful for benchmarking, rarely capture the unpredictable behavior of AI agents operating in dynamic environments.

    As Raindrop co-founder Alexis Gauba explained in her LinkedIn announcement, “Traditional evals don’t really answer this question. They’re great unit tests, but you can’t predict your user’s actions and your agent is running for hours, calling hundreds of tools.”

    Gauba said the company consistently heard a common frustration from teams: “Evals pass, agents fail.”

    Experiments is meant to close that gap by showing what actually changes when developers ship updates to their systems.

    The tool enables side-by-side comparisons of models, tools, intents, or properties, surfacing measurable differences in behavior and performance.

    Designed for Real-World AI Behavior

    In the announcement video, Raindrop described Experiments as a way to “compare anything and measure how your agent’s behavior actually changed in production across millions of real interactions.”

    The platform helps users spot issues such as task failure spikes, forgetting, or new tools that trigger unexpected errors.

    It can also be used in reverse — starting from a known problem, such as an “agent stuck in a loop,” and tracing back to which model, tool, or flag is driving it.

    From there, developers can dive into detailed traces to find the root cause and ship a fix quickly.

    Each experiment provides a visual breakdown of metrics like tool usage frequency, error rates, conversation duration, and response length.

    Users can click on any comparison to access the underlying event data, giving them a clear view of how agent behavior changed over time. Shared links make it easy to collaborate with teammates or report findings.

    Integration, Scalability, and Accuracy

    According to Hylak, Experiments integrates directly with “the feature flag platforms companies know and love (like Statsig!)” and is designed to work seamlessly with existing telemetry and analytics pipelines.

    For companies without those integrations, it can still compare performance over time—such as yesterday versus today—without additional setup.

    Hylak said teams typically need around 2,000 users per day to produce statistically meaningful results.

    To ensure the accuracy of comparisons, Experiments monitors for sample size adequacy and alerts users if a test lacks enough data to draw valid conclusions.

    “We obsess over making sure metrics like Task Failure and User Frustration are metrics that you’d wake up an on-call engineer for,” Hylak explained. He added that teams can drill into the specific conversations or events that drive those metrics, ensuring transparency behind every aggregate number.

    Security and Data Protection

    Raindrop operates as a cloud-hosted platform but also offers on-premise personally identifiable information (PII) redaction for enterprises that need additional control.

    Hylak said the company is SOC 2 compliant and has launched a PII Guard feature that uses AI to automatically remove sensitive information from stored data. “We take protecting customer data very seriously,” he emphasized.

    Pricing and Plans

    Experiments is part of Raindrop’s Pro plan, which costs $350 per month or $0.0007 per interaction. The Pro tier also includes deep research tools, topic clustering, custom issue tracking, and semantic search capabilities.

    Raindrop’s Starter plan — $65 per month or $0.001 per interaction — offers core analytics including issue detection, user feedback signals, Slack alerts, and user tracking. Both plans come with a 14-day free trial.

    Larger organizations can opt for an Enterprise plan with custom pricing and advanced features like SSO login, custom alerts, integrations, edge-PII redaction, and priority support.

    Continuous Improvement for AI Systems

    With Experiments, Raindrop positions itself at the intersection of AI analytics and software observability. Its focus on “measure truth,” as stated in the product video, reflects a broader push within the industry toward accountability and transparency in AI operations.

    Rather than relying solely on offline benchmarks, Raindrop’s approach emphasizes real user data and contextual understanding. The company hopes this will allow AI developers to move faster, identify root causes sooner, and ship better-performing models with confidence.

  • Together AI’s ATLAS adaptive speculator delivers 400% inference speedup by learning from workloads in real-time

    Enterprises expanding AI deployments are hitting an invisible performance wall. The culprit? Static speculators that can't keep up with shifting workloads.

    Speculators are smaller AI models that work alongside large language models during inference. They draft multiple tokens ahead, which the main model then verifies in parallel. This technique (called speculative decoding) has become essential for enterprises trying to reduce inference costs and latency. Instead of generating tokens one at a time, the system can accept multiple tokens at once, dramatically improving throughput.

    Together AI today announced research and a new system called ATLAS (AdapTive-LeArning Speculator System) that aims to help enterprises overcome the challenge of static speculators. The technique provides a self-learning inference optimization capability that can help to deliver up to 400% faster inference performance than a baseline level of performance available in existing inference technologies such as vLLM.. The system addresses a critical problem: as AI workloads evolve, inference speeds degrade, even with specialized speculators in place.

    The company which got its start in 2023, has been focused on optimizing inference on its enterprise AI platform. Earlier this year the company raised $305 million as customer adoption and demand has grown.

    "Companies we work with generally, as they scale up, they see shifting workloads, and then they don't see as much speedup from speculative execution as before," Tri Dao, chief scientist at Together AI, told VentureBeat in an exclusive interview. "These speculators generally don't work well when their workload domain starts to shift."

    The workload drift problem no one talks about

    Most speculators in production today are "static" models. They're trained once on a fixed dataset representing expected workloads, then deployed without any ability to adapt. Companies like Meta and Mistral ship pre-trained speculators alongside their main models. Inference platforms like vLLM use these static speculators to boost throughput without changing output quality.

    But there's a catch. When an enterprise's AI usage evolves the static speculator's accuracy plummets.

    "If you're a company producing coding agents, and most of your developers have been writing in Python, all of a sudden some of them switch to writing Rust or C, then you see the speed starts to go down," Dao explained. "The speculator has a mismatch between what it was trained on versus what the actual workload is."

    This workload drift represents a hidden tax on scaling AI. Enterprises either accept degraded performance or invest in retraining custom speculators. That process captures only a snapshot in time and quickly becomes outdated.

    How adaptive speculators work: A dual-model approach

    ATLAS uses a dual-speculator architecture that combines stability with adaptation:

    The static speculator – A heavyweight model trained on broad data provides consistent baseline performance. It serves as a "speed floor."

    The adaptive speculator – A lightweight model learns continuously from live traffic. It specializes on-the-fly to emerging domains and usage patterns.

    The confidence-aware controller – An orchestration layer dynamically chooses which speculator to use. It adjusts the speculation "lookahead" based on confidence scores.

    "Before the adaptive speculator learns anything, we still have the static speculator to help provide the speed boost in the beginning," Ben Athiwaratkun, staff AI scientist at Together AI explained to VentureBeat. "Once the adaptive speculator becomes more confident, then the speed grows over time."

    The technical innovation lies in balancing acceptance rate (how often the target model agrees with drafted tokens) and draft latency. As the adaptive model learns from traffic patterns, the controller relies more on the lightweight speculator and extends lookahead. This compounds performance gains.

    Users don't need to tune any parameters. "On the user side, users don't have to turn any knobs," Dao said. "On our side, we have turned these knobs for users to adjust in a configuration that gets good speedup."

    Performance that rivals custom silicon

    Together AI's testing shows ATLAS reaching 500 tokens per second on DeepSeek-V3.1 when fully adapted. More impressively, those numbers on Nvidia B200 GPUs match or exceed specialized inference chips like Groq's custom hardware.

    "The software and algorithmic improvement is able to close the gap with really specialized hardware," Dao said. "We were seeing 500 tokens per second on these huge models that are even faster than some of the customized chips."

    The 400% speedup that the company claims for inference represents the cumulative effect of Together's Turbo optimization suite. FP4 quantization delivers 80% speedup over FP8 baseline. The static Turbo Speculator adds another 80-100% gain. The adaptive system layers on top. Each optimization compounds the benefits of the others.

    Compared to standard inference engines like vLLM or Nvidia's TensorRT-LLM, the improvement is substantial. Together AI benchmarks against the stronger baseline between the two for each workload before applying speculative optimizations.

    The memory-compute tradeoff explained

    The performance gains stem from exploiting a fundamental inefficiency in modern inference: wasted compute capacity.

    Dao explained that typically during inference, much of the compute power is not fully utilized.

    "During inference, which is actually the dominant workload nowadays, you're mostly using the memory subsystem," he said.

    Speculative decoding trades idle compute for reduced memory access. When a model generates one token at a time, it's memory-bound. The GPU sits idle while waiting for memory. But when the speculator proposes five tokens and the target model verifies them simultaneously, compute utilization spikes while memory access remains roughly constant.

    "The total amount of compute to generate five tokens is the same, but you only had to access memory once, instead of five times," Dao said.

    Think of it as intelligent caching for AI

    For infrastructure teams familiar with traditional database optimization, adaptive speculators function like an intelligent caching layer, but with a crucial difference.

    Traditional caching systems like Redis or memcached require exact matches. You store the exact same query result and retrieve it when that specific query runs again. Adaptive speculators work differently.

    "You can view it as an intelligent way of caching, not storing exactly, but figuring out some patterns that you see," Dao explained. "Broadly, we're observing that you're working with similar code, or working with similar, you know, controlling compute in a similar way. We can then predict what the big model is going to say. We just get better and better at predicting that."

    Rather than storing exact responses, the system learns patterns in how the model generates tokens. It recognizes that if you're editing Python files in a specific codebase, certain token sequences become more likely. The speculator adapts to those patterns, improving its predictions over time without requiring identical inputs.

    Use cases: RL training and evolving workloads

    Two enterprise scenarios particularly benefit from adaptive speculators:

    Reinforcement learning training: Static speculators quickly fall out of alignment as the policy evolves during training. ATLAS adapts continuously to the shifting policy distribution.

    Evolving workloads: As enterprises discover new AI use cases, workload composition shifts. "Maybe they started using AI for chatbots, but then they realized, hey, it can write code, so they start shifting to code," Dao said. "Or they realize these AIs can actually call tools and control computers and do accounting and things like that."

    In a vibe-coding session, the adaptive system can specialize for the specific codebase being edited. These are files not seen during training. This further increases acceptance rates and decoding speed.

    What it means for enterprises and the inference ecosystem

    ATLAS is available now on Together AI's dedicated endpoints as part of the platform at no additional cost. The company's 800,000-plus developers (up from 450,000 in February) have access to the optimization.

    But the broader implications extend beyond one vendor's product. The shift from static to adaptive optimization represents a fundamental rethinking of how inference platforms should work. As enterprises deploy AI across multiple domains, the industry will need to move beyond one-time trained models toward systems that learn and improve continuously.

    Together AI has historically released some of its research techniques as open source and collaborated with projects like vLLM. While the fully integrated ATLAS system is proprietary, some of the underlying techniques may eventually influence the broader inference ecosystem. 

    For enterprises looking to lead in AI, the message is clear: adaptive algorithms on commodity hardware can match custom silicon at a fraction of the cost. As this approach matures across the industry, software optimization increasingly trumps specialized hardware.

  • Echelon’s AI agents take aim at Accenture and Deloitte consulting models

    Echelon, an artificial intelligence startup that automates enterprise software implementations, emerged from stealth mode today with $4.75 million in seed funding led by Bain Capital Ventures, targeting a fundamental shift in how companies deploy and maintain critical business systems.

    The San Francisco-based company has developed AI agents specifically trained to handle end-to-end ServiceNow implementations — complex enterprise software deployments that traditionally require months of work by offshore consulting teams and cost companies millions of dollars annually.

    "The biggest barrier to digital transformation isn't technology — it's the time it takes to implement it," said Rahul Kayala, Echelon's founder and CEO, who previously worked at AI-powered IT company Moveworks. "AI agents are eliminating that constraint entirely, allowing enterprises to experiment, iterate, and deploy platform changes with unprecedented speed."

    The announcement signals a potential disruption to the $1.5 trillion global IT services market, where companies like Accenture, Deloitte, and Capgemini have long dominated through labor-intensive consulting models that Echelon argues are becoming obsolete in the age of artificial intelligence.

    Why ServiceNow deployments take months and cost millions

    ServiceNow, a cloud-based platform used by enterprises to manage IT services, human resources, and business workflows, has become critical infrastructure for large organizations. However, implementing and customizing the platform typically requires specialized expertise that most companies lack internally.

    The complexity stems from ServiceNow's vast customization capabilities. Organizations often need hundreds of "catalog items" — digital forms and workflows for employee requests — each requiring specific configurations, approval processes, and integrations with existing systems. According to Echelon's research, these implementations frequently stretch far beyond planned timelines due to technical complexity and communication bottlenecks between business stakeholders and development teams.

    "What starts out simple often turns into weeks of effort once the actual work begins," the company noted in its analysis of common implementation challenges. "A basic request form turns out to be five requests stuffed into one. We had catalog items with 50+ variables, 10 or more UI policies, all connected. Update one field, and something else would break."

    The traditional solution involves hiring offshore development teams or expensive consultants, creating what Echelon describes as a problematic cycle: "One question here, one delay there, and suddenly you're weeks behind."

    How AI agents replace expensive offshore consulting teams

    Echelon's approach replaces human consultants with AI agents trained by elite ServiceNow experts from top consulting firms. These agents can analyze business requirements, ask clarifying questions in real-time, and automatically generate complete ServiceNow configurations including forms, workflows, testing scenarios, and documentation.

    The technology delivers a significant advancement from general-purpose AI tools. Rather than providing generic code suggestions, Echelon's agents understand ServiceNow's specific architecture, best practices, and common integration patterns. They can identify gaps in requirements and propose solutions that align with enterprise governance standards.

    "Instead of routing every piece of input through five people, the business process owner directly uploaded their requirements," Kayala explained, describing a recent customer implementation. "The AI developer analyzes it and asks follow-up questions like: 'I see a process flow with 3 branches, but only 2 triggers. Should there be a 3rd?' The kinds of things a seasoned developer would ask. With AI, these questions came instantly."

    Early customers report dramatic time savings. One financial services company saw a service catalog migration project that was projected to take six months completed in six weeks using Echelon's AI agents.

    What makes Echelon's AI different from coding assistants

    Echelon's technology addresses several technical challenges that have prevented broader AI adoption in enterprise software implementation. The agents are trained not just on ServiceNow's technical capabilities but on the accumulated expertise of senior consultants who understand complex enterprise requirements, governance frameworks, and integration patterns.

    This approach differs from general-purpose AI coding assistants like GitHub Copilot, which provide syntax suggestions but lack domain-specific expertise. Echelon's agents understand ServiceNow's data models, security frameworks, and upgrade considerations—knowledge typically acquired through years of consulting experience.

    The company's training methodology involves elite ServiceNow experts from consulting firms like Accenture and specialized ServiceNow partner Thirdera. This embedded expertise enables the AI to handle complex requirements and edge cases that typically require senior consultant intervention.

    The real challenge isn't teaching AI to write code — it's capturing the intuitive expertise that separates junior developers from seasoned architects. Senior ServiceNow consultants instinctively know which customizations will break during upgrades and how simple requests spiral into complex integration problems. This institutional knowledge creates a far more defensible moat than general-purpose coding assistants can offer.

    The $1.5 trillion consulting market faces disruption

    Echelon's emergence reflects broader trends reshaping the enterprise software market. As companies accelerate digital transformation initiatives, the traditional consulting model increasingly appears inadequate for the speed and scale required.

    ServiceNow itself has grown rapidly, reporting over $10.98 billion in annual revenue in 2024, and $12.06 billion for the trailing twelve months ending June 30, 2025, as organizations continue to digitize more business processes. However, this growth has created a persistent talent shortage, with demand for skilled ServiceNow professionals — particularly those with AI expertise — significantly outpacing supply.

    The startup's approach could fundamentally alter the economics of enterprise software implementation. Traditional consulting engagements often involve large teams working for months, with costs scaling linearly with project complexity. AI agents, by contrast, can handle multiple projects simultaneously and apply learned knowledge across customers.

    Rak Garg, the Bain Capital Ventures partner who led Echelon's funding round, sees this as part of a larger shift toward AI-powered professional services. "We see the same trend with other BCV companies like Prophet Security, which automates security operations, and Crosby, which automates legal services for startups. AI is quickly becoming the delivery layer across multiple functions."

    Scaling beyond ServiceNow while maintaining enterprise reliability

    Despite early success, Echelon faces significant challenges in scaling its approach. Enterprise customers prioritize reliability above speed, and any AI-generated configurations must meet strict security and compliance requirements.

    "Inertia is the biggest risk," Garg acknowledged. "IT systems shouldn't ever go down, and companies lose thousands of man-hours of productivity with every outage. Proving reliability at scale, and building on repeatable results will be critical for Echelon."

    The company plans to expand beyond ServiceNow to other enterprise platforms including SAP, Salesforce, and Workday — each creating substantial additional market opportunities. However, each platform requires developing new domain expertise and training models on platform-specific best practices.

    Echelon also faces potential competition from established consulting firms that are developing their own AI capabilities. However, Garg views these firms as potential partners rather than competitors, noting that many have already approached Echelon about collaboration opportunities.

    "They know that AI is shifting their business model in real-time," he said. "Customers are placing immense pricing pressure on larger firms and asking hard questions, and these firms can use Echelon agents to accelerate their projects."

    How AI agents could reshape all professional services

    Echelon's funding and emergence from stealth marks a significant milestone in the application of AI to professional services. Unlike consumer AI applications that primarily enhance individual productivity, enterprise AI agents like Echelon's directly replace skilled labor at scale.

    The company's approach — training AI systems on expert knowledge rather than just technical documentation — could serve as a model for automating other complex professional services. Legal research, financial analysis, and technical consulting all involve similar patterns of applying specialized expertise to unique customer requirements.

    For enterprise customers, the promise extends beyond cost savings to strategic agility. Organizations that can rapidly implement and modify business processes gain competitive advantages in markets where customer expectations and regulatory requirements change frequently.

    As Kayala noted, "This unlocks a completely different approach to business agility and competitive advantage."

    The implications extend far beyond ServiceNow implementations. If AI agents can master the intricacies of enterprise software deployment—one of the most complex and relationship-dependent areas of professional services — few knowledge work domains may remain immune to automation.

    The question isn't whether AI will transform professional services, but how quickly human expertise can be converted into autonomous digital workers that never sleep, never leave for competitors, and get smarter with every project they complete.

  • The most important OpenAI announcement you probably missed at DevDay 2025

    OpenAI’s annual developer conference on Monday was a spectacle of ambitious AI product launches, from an app store for ChatGPT to a stunning video-generation API that brought creative concepts to life. But for the enterprises and technical leaders watching closely, the most consequential announcement was the quiet general availability of Codex, the company's AI software engineer. This release signals a profound shift in how software—and by extension, modern business—is built.

    While other announcements captured the public’s imagination, the production-ready release of Codex, supercharged by a new specialized model and a suite of enterprise-grade tools, is the engine behind OpenAI’s entire vision. It is the tool that builds the tools, the proven agent in a world buzzing with agentic potential, and the clearest articulation of the company's strategy to win the enterprise.

    The general availability of Codex moves it from a "research preview" to a fully supported product, complete with a new software development kit (SDK), a Slack integration, and administrative controls for security and monitoring.This transition declares that Codex is ready for mission-critical work inside the world’s largest companies.

    "We think this is the best time in history to be a builder; it has never been faster to go from idea to product," said OpenAI CEO Sam Altman during the opening keynote presentation. "Software used to take months or years to build. You saw that it can take minutes now to build with AI." 

    That acceleration is not theoretical. It's a reality born from OpenAI’s own internal use — a massive "dogfooding" effort that serves as the ultimate case study for enterprise customers.

    Inside GPT-5-Codex: The AI model that codes autonomously for hours and drives 70% productivity gains

    At the heart of the Codex upgrade is GPT-5-Codex, a version of OpenAI's latest flagship model that has been "purposely trained for Codex and agentic coding." The new model is designed to function as an autonomous teammate, moving far beyond simple code autocompletion.

    "I personally like to think about it as a little bit like a human teammate," explained Tibo Sottiaux, an OpenAI engineer, during a technical session on Codex. "You can pair a program with it on your computer, you can delegate to it, or as you'll see, you can give it a job without explicit prompting."

    This new model enables "adaptive thinking," allowing it to dynamically adjust the time and computational effort spent on a task based on its complexity.For simple requests, it's fast and efficient, but for complex refactoring projects, it can work for hours.

    One engineer during the technical session noted, "I've seen the GPT-5-Codex model work for over seven hours productively… on a marathon session." This capability to handle long-running, complex tasks is a significant leap beyond the simple, single-shot interactions that define most AI coding assistants.

    The results inside OpenAI have been dramatic. The company reported that 92% of its technical staff now uses Codex daily, and those engineers complete 70% more pull requests (a measure of code contribution) each week. Usage has surged tenfold since August. 

    "When we as a team see the stats, it feels great," Sottiaux shared. "But even better is being at lunch with someone who then goes 'Hey I use Codex all the time. Here's a cool thing that I do with it. Do you want to hear about it?'" 

    How OpenAI uses Codex to build its own AI products and catch hundreds of bugs daily

    Perhaps the most compelling argument for Codex’s importance is that it is the foundational layer upon which OpenAI’s other flashy announcements were built. During the DevDay event, the company showcased custom-built arcade games and a dynamic, AI-powered website for the conference itself, all developed using Codex.

    In one session, engineers demonstrated how they built "Storyboard," a custom creative tool for the film industry, in just 48 hours during an internal hackathon. "We decided to test Codex, our coding agent… we would send tasks to Codex in between meetings. We really easily reviewed and merged PRs into production, which Codex even allowed us to do from our phones," said Allison August, a solutions engineering leader at OpenAI. 

    This reveals a critical insight: the rapid innovation showcased at DevDay is a direct result of the productivity flywheel created by Codex. The AI is a core part of the manufacturing process for all other AI products.

    A key enterprise-focused feature is the new, more robust code review capability. OpenAI said it "purposely trained GPT-5-Codex to be great at ultra thorough code review," enabling it to explore dependencies and validate a programmer's intent against the actual implementation to find high-quality bugs.Internally, nearly every pull request at OpenAI is now reviewed by Codex, catching hundreds of issues daily before they reach a human reviewer.

    "It saves you time, you ship with more confidence," Sottiaux said. "There's nothing worse than finding a bug after we actually ship the feature." 

    Why enterprise software teams are choosing Codex over GitHub Copilot for mission-critical development

    The maturation of Codex is central to OpenAI’s broader strategy to conquer the enterprise market, a move essential to justifying its massive valuation and unprecedented compute expenditures. During a press conference, CEO Sam Altman confirmed the strategic shift.

    "The models are there now, and you should expect a huge focus from us on really winning enterprises with amazing products, starting here," Altman said during a private press conference. 

    OpenAI President and Co-founder Greg Brockman immediately added, "And you can see it already with Codex, which I think has been just an incredible success and has really grown super fast." 

    For technical decision-makers, the message is clear. While consumer-facing agents that book dinner reservations are still finding their footing, Codex is a proven enterprise agent delivering substantial ROI today. Companies like Cisco have already rolled out Codex to their engineering organizations, cutting code review times by 50% and reducing project timelines from weeks to days.

    With the new Codex SDK, companies can now embed this agentic power directly into their own custom workflows, such as automating fixes in a CI/CD pipeline or even creating self-evolving applications. During a live demo, an engineer showcased a mobile app that updated its own user interface in real-time based on a natural language prompt, all powered by the embedded Codex SDK. 

    While the launch of an app ecosystem in ChatGPT and the breathtaking visuals of the Sora 2 API rightfully generated headlines, the general availability of Codex marks a more fundamental and immediate transformation. It is the quiet but powerful engine driving the next era of software development, turning the abstract promise of AI-driven productivity into a tangible, deployable reality for businesses today.

  • To scale agentic AI, Notion tore down its tech stack and started fresh

    Many organizations would be hesitant to overhaul their tech stack and start from scratch.

    Not Notion.

    For the 3.0 version of its productivity software (released in September), the company didn’t hesitate to rebuild from the ground up; they recognized that it was necessary, in fact, to support agentic AI at enterprise scale.

    Whereas traditional AI-powered workflows involve explicit, step-by-step instructions based on few-shot learning, AI agents powered by advanced reasoning models are thoughtful about tool definition, can identify and comprehend what tools they have at their disposal and plan next steps.

    “Rather than trying to retrofit into what we were building, we wanted to play to the strengths of reasoning models,” Sarah Sachs, Notion’s head of AI modeling, told VentureBeat. “We've rebuilt a new architecture because workflows are different from agents.”

    Re-orchestrating so models can work autonomously

    Notion has been adopted by 94% of Forbes AI 50 companies, has 100 million total users and counts among its customers OpenAI, Cursor, Figma, Ramp and Vercel.

    In a rapidly evolving AI landscape, the company identified the need to move beyond simpler, task-based workflows to goal-oriented reasoning systems that allow agents to autonomously select, orchestrate, and execute tools across connected environments.

    Very quickly, reasoning models have become “far better” at learning to use tools and follow chain-of-thought (CoT) instructions, Sachs noted. This allows them to be “far more independent” and make multiple decisions within one agentic workflow. “We rebuilt our AI system to play to that," she said.

    From an engineering perspective, this meant replacing rigid prompt-based flows with a unified orchestration model, Sachs explained. This core model is supported by modular sub-agents that search Notion and the web, query and add to databases and edit content.

    Each agent uses tools contextually; for instance, they can decide whether to search Notion itself, or another platform like Slack. The model will perform successive searches until the relevant information is found. It can then, for instance, convert notes into proposals, create follow-up messages, track tasks, and spot and make updates in knowledge bases.

    In Notion 2.0, the team focused on having AI perform specific tasks, which required them to “think exhaustively” about how to prompt the model, Sachs noted. However, with version 3.0, users can assign tasks to agents, and agents can actually take action and perform multiple tasks concurrently.

    “We reorchestrated it to be self-selecting on the tools, rather than few-shotting, which is explicitly prompting how to go through all these different scenarios,” Sachs explained. The aim is to ensure everything interfaces with AI and that “anything you can do, your Notion agent can do.”

    Bifurcating to isolate hallucinations

    Notion’s philosophy of “better, faster, cheaper,” drives a continuous iteration cycle that balances latency and accuracy through fine-tuned vector embeddings and elastic search optimization. Sachs’ team employs a rigorous evaluation framework that combines deterministic tests, vernacular optimization, human-annotated data and LLMs-as-a-judge, with model-based scoring identifying discrepancies and inaccuracies.

    “By bifurcating the evaluation, we're able to identify where the problems come from, and that helps us isolate unnecessary hallucinations,” Sachs explained. Further, making the architecture itself simpler means it’s easier to make changes as models and techniques evolve.

    “We optimize latency and parallel thinking as much as possible,” which leads to “way better accuracy,” Sachs noted. Models are grounded in data from the web and the Notion connected workspace.

    Ultimately, Sachs reported, the investment in rebuilding its architecture has already provided Notion returns in terms of capability and faster rate of change.

    She added, “We are fully open to rebuilding it again, when the next breakthrough happens, if we have to.”

    Understanding contextual latency

    When building and fine-tuning models, it’s important to understand that latency is subjective: AI must provide the most relevant information, not necessarily the most, at the cost of speed.

    “You'd be surprised at the different ways customers are willing to wait for things and not wait for things,” Sachs said. It makes for an interesting experiment: How slow can you go before people abandon the model?

    With pure navigational search, for instance, users may not be as patient; they want answers near-immediately. “If you ask, ‘What's two plus two,’ you don't want to wait for your agent to be searching everywhere in Slack and JIRA,” Sachs pointed out.

    But the longer the time it's given, the more exhaustive a reasoning agent can be. For instance, Notion can perform 20 minutes of autonomous work across hundreds of websites, files and other materials. In these instances, users are more willing to wait, Sachs explained; they allow the model to execute in the background while they attend to other tasks.

    “It's a product question,” said Sachs. “How do we set user expectations from the UI? How do we ascertain user expectations on latency?”

    Notion is its biggest user

    Notion understands the importance of using its own product — in fact, its employees are among its biggest power users.

    Sachs explained that teams have active sandboxes that generate training and evaluation data, as well as a “really active” thumbs-up-thumbs-down user feedback loop. Users aren’t shy about saying what they think should be improved or features they’d like to see.

    Sachs emphasized that when a user thumbs down an interaction, they are explicitly giving permission to a human annotator to analyze that interaction in a way that de-anonymizes them as much as possible.

    “We are using our own tool as a company all day, every day, and so we get really fast feedback loops,” said Sachs. “We’re really dogfooding our own product.”

    That said, it’s their own product they’re building, Sachs noted, so they understand that they may have goggles on when it comes to quality and functionality. To balance this out, Notion has trusted "very AI-savvy" design partners who are granted early access to new capabilities and provide important feedback.

    Sachs emphasized that this is just as important as internal prototyping.

    “We're all about experimenting in the open, I think you get much richer feedback,” said Sachs. “Because at the end of the day, if we just look at how Notion uses Notion, we're not really giving the best experience to our customers.”

    Just as importantly, continuous internal testing allows teams to evaluate progressions and make sure models aren't regressing (when accuracy and performance degrades over time). "Everything you're doing stays faithful," Sachs explained. "You know that your latency is within bounds."

    Many companies make the mistake of focusing too intensely on retroactively-focused evans; this makes it difficult for them to understand how or where they're improving, Sachs pointed out. Notion considers evals as a "litmus test" of development and forward-looking progression and evals of observability and regression proofing.

    “I think a big mistake a lot of companies make is conflating the two,” said Sachs. “We use them for both purposes; we think about them really differently.”

    Takeaways from Notion's journey

    For enterprises, Notion can serve as a blueprint for how to responsibly and dynamically operationalize agentic AI in a connected, permissioned enterprise workspace.

    Sach’s takeaways for other tech leaders:

    • Don’t be afraid to rebuild when foundational capabilities change; Notion fully re-engineered its architecture to align with reasoning-based models.

    • Treat latency as contextual: Optimize per use case, rather than universally.

    • Ground all outputs in trustworthy, curated enterprise data to ensure accuracy and trust.

      She advised: “Be willing to make the hard decisions. Be willing to sit at the top of the frontier, so to speak, on what you're developing to build the best product you can for your customers.”

  • New memory framework builds AI agents that can handle the real world’s unpredictability

    Researchers at the University of Illinois Urbana-Champaign and Google Cloud AI Research have developed a framework that enables large language model (LLM) agents to organize their experiences into a memory bank, helping them get better at complex tasks over time.

    The framework, called ReasoningBank, distills “generalizable reasoning strategies” from an agent’s successful and failed attempts to solve problems. The agent then uses this memory during inference to avoid repeating past mistakes and make better decisions as it faces new problems. The researchers show that when combined with test-time scaling techniques, where an agent makes multiple attempts at a problem, ReasoningBank significantly improves the performance and efficiency of LLM agents.

    Their findings show that ReasoningBank consistently outperforms classic memory mechanisms across web browsing and software engineering benchmarks, offering a practical path toward building more adaptive and reliable AI agents for enterprise applications.

    The challenge of LLM agent memory

    As LLM agents are deployed in applications that run for long periods, they encounter a continuous stream of tasks. One of the key limitations of current LLM agents is their failure to learn from this accumulated experience. By approaching each task in isolation, they inevitably repeat past mistakes, discard valuable insights from related problems, and fail to develop skills that would make them more capable over time.

    The solution to this limitation is to give agents some kind of memory. Previous efforts to give agents memory have focused on storing past interactions for reuse by organizing information in various forms from plain text to structured graphs. However, these approaches often fall short. Many use raw interaction logs or only store successful task examples. This means they can't distill higher-level, transferable reasoning patterns and, crucially, they don’t extract and use the valuable information from the agent’s failures. As the researchers note in their paper, “existing memory designs often remain limited to passive record-keeping rather than providing actionable, generalizable guidance for future decisions.”

    How ReasoningBank works

    ReasoningBank is a memory framework designed to overcome these limitations. Its central idea is to distill useful strategies and reasoning hints from past experiences into structured memory items that can be stored and reused.

    According to Jun Yan, a Research Scientist at Google and co-author of the paper, this marks a fundamental shift in how agents operate. "Traditional agents operate statically—each task is processed in isolation," Yan explained. "ReasoningBank changes this by turning every task experience (successful or failed) into structured, reusable reasoning memory. As a result, the agent doesn’t start from scratch with each customer; it recalls and adapts proven strategies from similar past cases."

    The framework processes both successful and failed experiences and turns them into a collection of useful strategies and preventive lessons. The agent judges success and failure through LLM-as-a-judge schemes to obviate the need for human labeling.

    Yan provides a practical example of this process in action. An agent tasked with finding Sony headphones might fail because its broad search query returns over 4,000 irrelevant products. "ReasoningBank will first try to figure out why this approach failed," Yan said. "It will then distill strategies such as ‘optimize search query’ and ‘confine products with category filtering.’ Those strategies will be extremely useful to get future similar tasks successfully done."

    The process operates in a closed loop. When an agent faces a new task, it uses an embedding-based search to retrieve relevant memories from ReasoningBank to guide its actions. These memories are inserted into the agent’s system prompt, providing context for its decision-making. Once the task is completed, the framework creates new memory items to extract insights from successes and failures. This new knowledge is then analyzed, distilled, and merged into the ReasoningBank, allowing the agent to continuously evolve and improve its capabilities.

    Supercharging memory with scaling

    The researchers found a powerful synergy between memory and test-time scaling. Classic test-time scaling involves generating multiple independent answers to the same question, but the researchers argue that this “vanilla form is suboptimal because it does not leverage inherent contrastive signal that arises from redundant exploration on the same problem.”

    To address this, they propose Memory-aware Test-Time Scaling (MaTTS), which integrates scaling with ReasoningBank. MaTTS comes in two forms. In “parallel scaling,” the system generates multiple trajectories for the same query, then compares and contrasts them to identify consistent reasoning patterns. In sequential scaling, the agent iteratively refines its reasoning within a single attempt, with the intermediate notes and corrections also serving as valuable memory signals.

    This creates a virtuous cycle: the existing memory in ReasoningBank steers the agent toward more promising solutions, while the diverse experiences generated through scaling enable the agent to create higher-quality memories to store in ReasoningBank. 

    “This positive feedback loop positions memory-driven experience scaling as a new scaling dimension for agents,” the researchers write.

    ReasoningBank in action

    The researchers tested their framework on WebArena (web browsing) and SWE-Bench-Verified (software engineering) benchmarks, using models like Google’s Gemini 2.5 Pro and Anthropic’s Claude 3.7 Sonnet. They compared ReasoningBank against baselines including memory-free agents and agents using trajectory-based or workflow-based memory frameworks.

    The results show that ReasoningBank consistently outperforms these baselines across all datasets and LLM backbones. On WebArena, it improved the overall success rate by up to 8.3 percentage points compared to a memory-free agent. It also generalized better on more difficult, cross-domain tasks, while reducing the number of interaction steps needed to complete tasks. When combined with MaTTS, both parallel and sequential scaling further boosted performance, consistently outperforming standard test-time scaling.

    This efficiency gain has a direct impact on operational costs. Yan points to a case where a memory-free agent took eight trial-and-error steps just to find the right product filter on a website. "Those trial and error costs could be avoided by leveraging relevant insights from ReasoningBank," he noted. "In this case, we save almost twice the operational costs," which also improves the user experience by resolving issues faster.

    For enterprises, ReasoningBank can help develop cost-effective agents that can learn from experience and adapt over time in complex workflows and areas like software development, customer support, and data analysis. As the paper concludes, “Our findings suggest a practical pathway toward building adaptive and lifelong-learning agents.”

    Yan confirmed that their findings point toward a future of truly compositional intelligence. For example, a coding agent could learn discrete skills like API integration and database management from separate tasks. "Over time, these modular skills… become building blocks the agent can flexibly recombine to solve more complex tasks," he said, suggesting a future where agents can autonomously assemble their knowledge to manage entire workflows with minimal human oversight.

  • Google’s AI can now surf the web for you, click on buttons, and fill out forms with Gemini 2.5 Computer Use

    Some of the largest providers of large language models (LLMs) have sought to move beyond multimodal chatbots — extending their models out into "agents" that can actually take more actions on behalf of the user across websites. Recall OpenAI's ChatGPT Agent (formerly known as "Operator") and Anthropic's Computer Use, both released over the last two years.

    Now, Google is getting into that same game as well. Today, the search giant's DeepMind AI lab subsidiary unveiled a new, fine-tuned and custom-trained version of its powerful Gemini 2.5 Pro LLM known as "Gemini 2.5 Pro Computer Use," which can use a virtual browser to surf the web on your behalf, retrieve information, fill out forms, and even take actions on websites — all from a user's single text prompt.

    "These are early days, but the model’s ability to interact with the web – like scrolling, filling forms + navigating dropdowns – is an important next step in building general-purpose agents," said Google CEO Sundar Pichai, as part of a longer statement on the social network, X.

    The model is not available for consumers directly from Google, though.

    Instead, Google partnered with another company, Browserbase, founded by former Twilio engineer Paul Klein in early 2024, which offers virtual "headless" web browser specifically for use by AI agents and applications. (A "headless" browser is one that doesn't require a graphical user interface, or GUI, to navigate the web, though in this case and others, Browserbase does show a graphical representation for the user).

    Users can demo the new Gemini 2.5 Computer Use model directly on Browserbase here and even compare it side-by-side with the older, rival offerings from OpenAI and Anthropic in a new "Browser Arena" launched by the startup (though only one additional model can be selected alongside Gemini at a time).

    For AI builders and developers, it's being made as a raw, albeit propreitary LLM through the Gemini API in Google AI Studio for rapid prototyping, and Google Cloud's Vertex AI model selector and applications building platform.

    The new offering builds on the capabilities of Gemini 2.5 Pro, released back in March 2025 but which has been updated significantly several times since then, with a specific focus on enabling AI agents to perform direct interactions with user interfaces, including browsers and mobile applications.

    Overall, it appears Gemini 2.5 Computer Use is designed to let developers create agents that can complete interface-driven tasks autonomously — such as clicking, typing, scrolling, filling out forms, and navigating behind login screens.

    Rather than relying solely on APIs or structured inputs, this model allows AI systems to interact with software visually and functionally, much like a human would.

    Brief User Hands-On Tests

    In my brief, unscientific initial hands-on tests on the Browserbase website, Gemini 2.5 Computer Use successfully navigate to Taylor Swift's official website as instructed and provided me a summary of what was being sold or promoted at the top — a special edition of her newest album, "The Life of A Showgirl."

    In another test, I asked Gemini 2.5 Computer Use to search Amazon for highly rated and well-reviewed solar lights I could stake into my back yard, and I was delighted to watch as it successfully completed a Google Search Captcha designed to weed out non-human users ("Select all the boxes with a motorcycle.") It did so in a matter of seconds.

    However, once it got through there, it stalled and was unable to complete the task, despite serving up a "task competed" message.

    I should also note here that while the ChatGPT agent from OpenAI and Anthropic's Claude can create and edit local files — such as PowerPoint presentations, spreadsheets, or text documents — on the user’s behalf, Gemini 2.5 Computer Use does not currently offer direct file system access or native file creation capabilities.

    Instead, it is designed to control and navigate web and mobile user interfaces through actions like clicking, typing, and scrolling. Its output is limited to suggested UI actions or chatbot-style text responses; any structured output like a document or file must be handled separately by the developer, often through custom code or third-party integrations.

    Performance Benchmarks

    Google says Gemini 2.5 Computer Use has demonstrated leading results in multiple interface control benchmarks, particularly when compared to other major AI systems including Claude Sonnet and OpenAI’s agent-based models.

    Evaluations were conducted via Browserbase and Google’s own testing.

    Some highlights include:

    • Online-Mind2Web (Browserbase): 65.7% for Gemini 2.5 vs. 61.0% (Claude Sonnet 4) and 44.3% (OpenAI Agent)

    • WebVoyager (Browserbase): 79.9% for Gemini 2.5 vs. 69.4% (Claude Sonnet 4) and 61.0% (OpenAI Agent)

    • AndroidWorld (DeepMind): 69.7% for Gemini 2.5 vs. 62.1% (Claude Sonnet 4); OpenAI's model could not be measured due to lack of access

    • OSWorld: Currently not supported by Gemini 2.5; top competitor result was 61.4%

    In addition to strong accuracy, Google reports that the model operates at lower latency than other browser control solutions — a key factor in production use cases like UI automation and testing.

    How It Works

    Agents powered by the Computer Use model operate within an interaction loop. They receive:

    • A user task prompt

    • A screenshot of the interface

    • A history of past actions

    The model analyzes this input and produces a recommended UI action, such as clicking a button or typing into a field.

    If needed, it can request confirmation from the end user for riskier tasks, such as making a purchase.

    Once the action is executed, the interface state is updated and a new screenshot is sent back to the model. The loop continues until the task is completed or halted due to an error or a safety decision.

    The model uses a specialized tool called computer_use, and it can be integrated into custom environments using tools like Playwright or via the Browserbase demo sandbox.

    Use Cases and Adoption

    According to Google, teams internally and externally have already started using the model across several domains:

    • Google’s payments platform team reports that Gemini 2.5 Computer Use successfully recovers over 60% of failed test executions, reducing a major source of engineering inefficiencies.

    • Autotab, a third-party AI agent platform, said the model outperformed others on complex data parsing tasks, boosting performance by up to 18% in their hardest evaluations.

    • Poke.com, a proactive AI assistant provider, noted that the Gemini model often operates 50% faster than competing solutions during interface interactions.

    The model is also being used in Google’s own product development efforts, including in Project Mariner, the Firebase Testing Agent, and AI Mode in Search.

    Safety Measures

    Because this model directly controls software interfaces, Google emphasizes a multi-layered approach to safety:

    • A per-step safety service inspects every proposed action before execution.

    • Developers can define system-level instructions to block or require confirmation for specific actions.

    • The model includes built-in safeguards to avoid actions that might compromise security or violate Google’s prohibited use policies.

    For example, if the model encounters a CAPTCHA, it will generate an action to click the checkbox but flag it as requiring user confirmation, ensuring the system does not proceed without human oversight.

    Technical Capabilities

    The model supports a wide array of built-in UI actions such as:

    • click_at, type_text_at, scroll_document, drag_and_drop, and more

    • User-defined functions can be added to extend its reach to mobile or custom environments

    • Screen coordinates are normalized (0–1000 scale) and translated back to pixel dimensions during execution

    It accepts image and text input and outputs text responses or function calls to perform tasks. The recommended screen resolution for optimal results is 1440×900, though it can work with other sizes.

    API Pricing Remains Almost Identical to Gemini 2.5 Pro

    The pricing for Gemini 2.5 Computer Use aligns closely with the standard Gemini 2.5 Pro model. Both follow the same per-token billing structure: input tokens are priced at $1.25 per one million tokens for prompts under 200,000 tokens, and $2.50 per million tokens for prompts longer than that.

    Output tokens follow a similar split, priced at $10.00 per million for smaller responses and $15.00 for larger ones.

    Where the models diverge is in availability and additional features.

    Gemini 2.5 Pro includes a free tier that allows developers to use the model at no cost, with no explicit token cap published, though usage may be subject to rate limits or quota constraints depending on the platform (e.g. Google AI Studio).

    This free access includes both input and output tokens. Once developers exceed their allotted quota or switch to the paid tier, standard per-token pricing applies.

    In contrast, Gemini 2.5 Computer Use is available exclusively through the paid tier. There is no free access currently offered for this model, and all usage incurs token-based charges from the outset.

    Feature-wise, Gemini 2.5 Pro supports optional capabilities like context caching (starting at $0.31 per million tokens) and grounding with Google Search (free for up to 1,500 requests per day, then $35 per 1,000 additional requests). These are not available for Computer Use at this time.

    Another distinction is in data handling: output from the Computer Use model is not used to improve Google products in the paid tier, while free-tier usage of Gemini 2.5 Pro contributes to model improvement unless explicitly opted out.

    Overall, developers can expect similar token-based costs across both models, but they should consider tier access, included capabilities, and data use policies when deciding which model fits their needs.

  • OpenAI Dev Day 2025: ChatGPT becomes the new app store — and hardware is coming

    In a packed hall at Fort Mason Center in San Francisco, against a backdrop of the Golden Gate Bridge, OpenAI CEO Sam Altman laid out a bold vision to remake the digital world. The company that brought generative AI to the mainstream with a simple chatbot is now building the foundations for its next act: a comprehensive computing platform designed to move beyond the screen and browser, with legendary designer Jony Ive enlisted to help shape its physical form.

    At its third annual DevDay, OpenAI unveiled a suite of tools that signals a strategic pivot from a model provider to a full-fledged ecosystem. The message was clear: the era of simply asking an AI questions is over. The future is about commanding AI to perform complex tasks, build software autonomously, and live inside every application, a transition Altman framed as moving from "systems that you can ask anything to, to systems that you can ask to do anything for you." 

    The day’s announcements were a three-pronged assault on the status quo, targeting how users interact with software, how developers build it, and how businesses deploy intelligent agents. But it was the sessions held behind closed doors, away from the public livestream, that revealed the true scope of OpenAI’s ambition — a future that includes new hardware, a relentless pursuit of computational power, and a philosophical quest to redefine our relationship with technology.

    From chatbot to operating system: The new 'App Store'

    The centerpiece of the public-facing keynote was the transformation of ChatGPT itself. With the new Apps SDK, OpenAI is turning its wildly popular chatbot into a dynamic, interactive platform, effectively an operating system where developers can build and distribute their own applications.

    “Today, we're going to open up ChatGPT for developers to build real apps inside of ChatGPT,” Altman announced during the keynote presentation to applause. “This will enable a new generation of apps that are interactive, adaptive and personalized, that you can chat with.”

    Live demonstrations showcased apps from partners like Coursera, Canva, and Zillow running seamlessly within a chat conversation. A user could watch a machine learning lecture, ask ChatGPT to explain a concept in real-time, and then use Canva to generate a poster based on the conversation, all without leaving the chat interface. The apps can render rich, interactive UIs, even going full-screen to offer a complete experience, like exploring a Zillow map of homes.

    For developers, this represents a powerful new distribution channel. “When you build with the Apps SDK, your apps can reach hundreds of millions of chat users,” Altman said, highlighting a direct path to a massive user base that has grown to over 800 million weekly active users

    In a private press conference later, Nick Turley, head of ChatGPT, elaborated on the grander vision. "We never meant to build a chatbot," he stated. "When we set out to make ChatGPT, we meant to build a super assistant and we got a little sidetracked. And one of the tragedies of getting a little sidetracked is that we built a great chatbot, but we are the first ones to say that not all software needs to be a chatbot, not all interaction with the commercial world needs to be a chatbot."

    Turley emphasized that while OpenAI is excited about natural language interfaces, "the interface really needs to evolve, which is why you see so much UI in the demos today. In fact, you can even go full screen and chat is in the background." He described a future where users might "start your day in ChatGPT, just because it kind of has become the de facto entry point into the commercial web and into a lot of software," but clarified that "our incentive is not to keep you in. Our product is to allow other people to build amazing businesses on top and to evolve the form factor of software."

    The rise of the agents: Building the 'do anything' AI

    If apps are about bringing the world into ChatGPT, the new "Agent Kit" is about sending AI out into the world to get things done. OpenAI is providing a complete "set of building blocks… to help you take agents from prototype to production," Altman explained in his keynote. 

    Agent Kit is an integrated development environment for creating autonomous AI workers. It features a visual canvas to design complex workflows, an embeddable chat interface ("Chat Kit") for deploying agents in any app, and a sophisticated evaluation suite to measure and improve performance.

    A compelling demo from financial operations platform Ramp showed how Agent Kit was used to build a procurement agent. An employee could simply type, "I need five more ChatGPT business seats," and the agent would parse the request, check it against company expense policies, find vendor details, and prepare a virtual credit card for the purchase — a process that once took weeks now completed in minutes. 

    This push into agents is a direct response to a growing enterprise need to move beyond AI as a simple information retrieval tool and toward AI as a productivity engine that automates complex business processes. Brad Lightcap, OpenAI's COO, noted that for enterprise adoption, "you needed this kind of shift to more agentic AI that could actually do things for you, versus just respond with text outputs." 

    The future of code and the Jony Ive bBombshell

    Perhaps the most profound shift is occurring in software development itself. Codex, OpenAI's AI coding agent, has graduated from a research preview to a full-fledged product, now powered by a specialized version of the new GPT-5 model. It is, as one speaker put it, "a teammate that understands your context." 

    The capabilities are staggering. Developers can now assign Codex tasks directly from Slack, and the agent can autonomously write code, create pull requests, and even review other engineers' work on GitHub. A live demo showed Codex taking a simple photo of a whiteboard sketch and turning it into a fully functional, beautifully designed mobile app screen. Another demo showed an app that could "self-evolve," reprogramming itself in real-time based on a user's natural language request. 

    But the day's biggest surprise came in a closing fireside chat, which was not livestreamed, between Altman and Jony Ive, the iconic former chief design officer of Apple. The two revealed they have been collaborating for three years on a new family of AI-centric hardware.

    Ive, whose design philosophy shaped the iPhone, iMac, and Apple Watch, said his creative team’s purpose "became clear" with the launch of ChatGPT. He argued that our current relationship with technology is broken and that AI presents an opportunity for a fundamental reset.

    “I think it would be absurd to assume that you could have technology that is this breathtaking, delivered to us through legacy products, products that are decades old,” Ive said. “I see it as a chance to use this most remarkable capability to full-on address a lot of the overwhelm and despair that people feel right now.”

    While details of the devices remain secret, Ive spoke of his motivation in deeply human terms. “We love our species, and we want to be useful. We think that humanity deserves much better than humanity generally is given,” he said. He emphasized the importance of "care" in the design process, stating, "We sense when people have cared… you sense carelessness. You sense when somebody does not care about you, they care about money and schedule." 

    This collaboration confirms that OpenAI's ambitions are not confined to the cloud; it is actively exploring the physical interface through which humanity will interact with its powerful new intelligence.

    The Unquenchable Thirst for Compute

    Underpinning this entire platform strategy is a single, overwhelming constraint: the availability of computing power. In both the private press conference and the un-streamed Developer State of the Union, OpenAI’s leadership returned to this theme again and again.

    “The degree to which we are all constrained by compute… Everyone is just so constrained on being able to offer the services at the scale required to get the revenue that at this point, we're quite confident we can push it pretty far,” Altman told reporters. He added that even with massive new hardware partnerships with AMD and others, "we'll be saying the same thing again. We're so convinced… There's so much more demand." 

    This explains the company’s aggressive, multi-billion-dollar investment in infrastructure. When asked about profitability, Altman was candid that the company is in a phase of "investment and growth." He invoked a famous quote from Walt Disney, paraphrasing, "We make more money so we can make more movies." For OpenAI, the "movies" are ever-more-powerful AI models.

    Greg Brockman, OpenAI’s President, put the ultimate goal in stark economic terms during the Developer State of the Union. "AI is going to become, probably in the not too distant future, the fundamental driver of economic growth," he said. "Asking ‘How much compute do you want?’ is a little bit like asking how much workforce do you want? The answer is, you can always get more out of more." 

    As the day concluded and developers mingled at the reception, the scale of OpenAI's project came into focus. Fueled by new models like the powerful GPT-5 Pro and the stunning Sora 2 video generator, the company is no longer just building AI. It is building the world where AI will live — a world of intelligent apps, autonomous agents, and new physical devices, betting that in the near future, intelligence itself will be the ultimate platform.

  • OpenAI announces Apps SDK allowing ChatGPT to launch and run third party apps like Zillow, Canva, Spotify

    OpenAI's annual conference for third-party developers, DevDay, kicked off with a bang today as co-founder and CEO Sam Altman announced a new "Apps SDK" that makes it "possible to build apps inside of ChatGPT," including paid apps, which companies can charge users for using OpenAI's recently unveiled Agentic Commerce Protocol (ACP).

    In other words, instead of launching apps one-by-one on your phone, computer, or on the web — now you can do all that without ever leaving ChatGPT.

    This feature allows the user to log-into their accounts on those external apps and bring all their information back into ChatGPT, and use the apps very similarly to how they already do outside of the chatbot, but now with the ability to ask ChatGPT to perform certain actions, analyze content, or go beyond what each app could offer on its own.

    You can direct Canva to make you slides based on a text description, ask Zillow for home listings in a certain area fitting certain requirements, or ask Coursera about a specific lesson's content while dit plays on video, all from within ChatGPT — with many other apps also already offering their own connections (see below).

    "This will enable a new generation of apps that are interactive, adaptive and personalized, that you can chat with," Altman said.

    While the Apps SDK is available today in preview, OpenAI said it would not begin accepting new apps within ChatGPT or allow them to charge users until "later this year."

    ChatGPT in-line app access is already rolling out to ChatGPT Free, Plus, Go and Pro users — outside of the European Union only for now — with Business, Enterprise, and Education tiers expected to receive access to the apps later this year.

    Built atop common MCP standard

    Built on the open source standard Model Context Protocol (MCP) introduced by rival Anthropic nearly a year ago, the Apps SDK gives third-party developers working independently or on behalf of enterprises large and small to connect selected data, "trigger actions, and render a fully interactive UI [user interface]" Altman explained during his introductory keynote speech.

    The Apps SDK includes a "talking to apps" feature that allows ChatGPT and the underlying GPT-5 or other "o-series" models piloting it underneath to obtain updated context from the third-party app or service, so the model "always knows about exactly what you're user is interacting with," according to another presenter and OpenAI engineer, Alexi Christakis.

    Developers can build apps that:

    • appear inline in chat as lightweight cards or carousels

    • expand to fullscreen for immersive tasks like maps, menus, or slides

    • use picture-in-picture for live sessions such as video, games, or quizzes

    Each mode is designed to preserve ChatGPT’s minimal, conversational flow while adding interactivity and brand presence.

    Early integrations with Coursera, Canva, Zillow and more…

    Christakis showed off early integrations of external apps built atop the Apps SDK, including ones from e-learning company Coursera, cloud design software company Canva, and real estate listings and agent connections search engine, Zillow.

    Altman also announced Apps SDK integrations with additional partners not demoed officially during the keynote including: Booking.com, Expedia, Figma and Spotify and in documentation, said more upcoming partners are on deck: AllTrails, Peloton, OpenTable, Target, theFork, and Uber, representing lifestyle, commerce, and productivity categories.

    The Coursera demo included an example of how the user onboards to the external app, including a new login screen for the app (Coursera) that appears within the ChatGPT chat interface, activated simply by a text prompt from the user asking: "Coursera can you teach me something about machine learning"?

    Once logged in, the app launched within the chat interface, "in line" and can render anything from the web, including interactive elements like video.

    Christakis explained and showed the Apps SDK also supports "picture-in-picture" and "fullscreen" views, allowing the user to choose how to interact with it.

    When playing a Coursera video that appeared, he showed that it automatically pinned the video to the top of the screen so the user could keep watching it even as they continued to have a back-and-forth dialog in text with ChatGPT in the typical input/output prompts and responses below.

    Users can then ask ChatGPT about content appearing in the video without specifying exactly what was said, as the Agents SDK pipes the information on the backend, server-side, from the connected app to the underlying ChatGPT AI model. So "can you explain more about what they're saying right now" will automatically surface the relevant portion of the video and provide that to the underlying AI model for it to analyze and respond to through text.

    In another example, Christakis opened an older, existing ChatGPT conversation he'd had about his siblings' dog walking business and resumed the conversation by asking another third-party app, Canva, to generate a poster using one of ChatGPT's recommended business names, "Walk This Wag," along with specific guidance about font choice ("sans serif") and overall coloration and style ("bright and colorful.")

    Instead of the user manually having to go and add all those specific elements to a Canva template, ChatGPT went and issued the commands and performed the actions on behalf of the user in the background.

    After a few minutes, ChatGPT responded with several poster designs generated directly within the Canva app, but displayed them all in the user's ChatGPT chat session where they could see, review, enlarge and provide feedback or ask for adjustments on all of them.

    Christakis then asked for ChatGPT to turn one of the slides into an entire slide deck so the founders of the dog walking business could present it to investors, which did it in the background over several minutes while he presented a final integrated app, Zillow.

    He started a new chat session and asked a simple question: "based on our conversations, what would be a good city to expand the dog walking business."

    Using ChatGPT's optional memory feature, it referenced the dog walk conversation and suggested Pittsburgh, which Christakis used as a chance to type in "Zillow" and "show me some homes for sale there," which called up an interactive map from Zillow with homes for sale and prices listed and hover-over animations, all in-line within ChatGPT.

    Clicking a specific home also opened a fullscreen view with "most of the Zillow experience," entirely without leaving ChatGPT, including the ability to request home tours and contact agents and filtering by bedrooms and other qualities like outdoor space. ChatGPT pulls up the requested filtered Zillow search as well as provides a text-based response in-line explaining what it did and why.

    The user can then ask follow-up questions about the specific property — such as "how close is it to a dog park?" — or compare it to other properties, all within ChatGPT.

    It can also use apps in conjunction with its Search function, searching the web to compare the app information (in this case, Zillow) with other sources.

    Safety, privacy, and developer standards

    OpenAI emphasized that apps must comply with strict privacy, safety, and content standards to be listed in the ChatGPT directory. Apps must:

    • serve a clear and valuable purpose

    • be predictable and reliable in behavior

    • be safe for general audiences, including teens aged 13–17

    • respect user privacy and limit data collection to only what’s necessary

    Every app must also include a clear, published privacy policy, obtain user consent before connecting, and identify any actions that modify external data (e.g., posting, sending, uploading).

    Apps violating OpenAI’s usage policies, crashing frequently, or misrepresenting their capabilities may be removed at any time. Developers must submit from verified accounts, provide customer support contacts, and maintain their apps for stability and compliance.

    OpenAI also published developer design guidelines, outlining how apps should look, sound, and behave. They must follow ChatGPT’s visual system — including consistent color palettes, typography, spacing, and iconography — and maintain accessibility standards such as alt text and readable contrast ratios.

    Partners can show brand logos and accent colors but not alter ChatGPT’s core interface or use promotional language. Apps should remain “conversational, intelligent, simple, responsive, and accessible,” according to the documentation.

    A new conversational app ecosystem

    By opening ChatGPT to third-party apps and payments, OpenAI is taking a major step toward transforming ChatGPT from a chatbot into a full-fledged AI operating system — one that combines conversational intelligence, rich interfaces, and embedded commerce.

    For developers, that means direct access to over 800 million ChatGPT users, who can discover apps “at the right time” through natural conversation — whether planning trips, learning, or shopping.

    For users, it means a new generation of apps you can chat with — where a single interface helps you book a flight, design a slide deck, or learn a new skill without ever leaving ChatGPT.

    As OpenAI put it: “This is just the start of apps in ChatGPT, bringing new utility to users and new opportunities for developers.”

    There remain a few big questions, namely: 1. what happens to all the data from those third-party apps as they interface with ChatGPT and its users…does OpenAI get access to it and can it train upon it? 2. What happens to OpenAI's once much-hyped GPT Store, which had been in the past promoted as a way for third-party creators and developers to create custom, task-specific versions of ChatGPT and make money on them through a usage-based revenue share model?

    We've asked the company about both issues and will update when we hear back.

  • OpenAI unveils AgentKit that lets developers drag and drop to build AI agents

    OpenAI launched an agent builder that the company hopes will eliminate fragmented tools and make it easier for enterprises to utilize OpenAI’s system to create agents.

    AgentKit, announced during OpenAI’s DevDay in San Francisco, enables developers and enterprises to build agents and add chat capabilities in one place, potentially competing with platforms like Zapier.

    By offering a more streamlined way to create agents, OpenAI advances further into becoming a full-stack application provider.

    “Until now, building agents meant juggling fragmented tools—complex orchestration with no versioning, custom connectors, manual eval pipelines, prompt tuning, and weeks of frontend work before launch,” the company said in a blog post.

    AgentKit includes:

    • Agent Builder, which is a visual canvas where devs can see what they’ve created and versioning multi-agent workflows

    • Connector Registry is a central area for admins to manage connections across OpenAI products. A Global Admin console will be a prerequisite to using this feature.

    • ChatKit enables users to integrate chat-based agents into their user interfaces.

    Eventually, OpenAI said it will build a standalone Workflows API and add agent deployment tabs to ChatGPT.

    OpenAI also expanded evaluation for agents, adding capabilities such as datasets with automated graders and annotations, trace grading that runs end-to-end assessments of workflows, automated prompt optimization, and support for third-party agent measurement tools.

    Developers can access some features of AgentKit, but OpenAI is gradually rolling out additional features, such as Agent Builder. Currently, Agent Builder is available in beta, while ChatKit and new evaluation capabilities are generally available. Connector Registry “is beginning its beta rollout to some API and ChatGPT Enterprise and Edu users.

    OpenAI said pricing for AgentKit tools will be included in the standard API model pricing.

    Agent Builder

    To clarify, many agents are built using OpenAI’s models; however, enterprises often access GPT-5 through other platforms to create their own agents. However, AgentKit brings enterprises more into its ecosystem, ensuring they don’t need to tap other platforms as often.

    Demonstrated during DevDay, the company stated that Agent Builder is ideal for rapid iteration. It also provides developers with visibility into how the agents are working.

    During the demo, an OpenAI developer made an agent that reads the DevDay agenda and suggests panels to watch. It took her just under eight minutes.

    Other model providers saw the importance of offering developer toolkits to build agents to entice enterprises to use more of their tools. Google came out with its Agent Development Kit in April, expanding multi-agent system building “in under 100 lines of code.” Microsoft, which runs the popular agent framework AutoGen, announced it is bringing agent creation to one place with its new Agent Framework.

    OpenAI customers, such as the fintech company Ramp, stated in a blog post that its teams were able to build a procurement agent in a few hours instead of months.

    “Agent Builder transformed what once took months of complex orchestration, custom code, and manual optimizations into just a couple of hours. The visual canvas keeps product, legal, and engineering on the same page, slashing iteration cycles by 70% and getting an agent live in two sprints rather than two quarters,” Ramp said.

    AgentKit’s Connector Registry would also enable enterprises to manage and maintain data across workspaces, consolidating data sources into a single panel that spans both ChatGPT and the API. It will have pre-built connectors to Dropbox, Google Drive, SharePoint and Microsoft Teams. It also supports third-party MCP servers.

    Another capability of Agent Builder is Guardrails, an open-source safety layer that protects against the leakage of personally identifiable information (PII), jailbreaks, and unintended or malicious behavior.

    Bringing more chat

    Since most agentic interactions involve chat, it makes sense to simplify the process for developers to set up chat interfaces and connect them with the agents they’ve just built.

    “Deploying chat UIs for agents can be surprisingly complex—handling streaming responses, managing threads, showing the model thinking and designing engaging in-chat experiences,” OpenAI said.

    The company said ChatKit makes it simple to embed chat agents on platforms and embed these into apps or websites.

    However, some OpenAI competitors have begun thinking beyond the chatbot and want to offer agentic interactions that feel more seamless. Google’s asynchronous coding agent, Jules, has introduced a new feature that enables users to interact with the agent through the command-line interface, eliminating the need to open a chat window.

    Responses

    The response to AgentKit has mainly been positive, with some developers noting that while it simplifies agent building, it doesn’t mean that everyone can now build agents.

    Several developers view Agent Kit not as a Zapier killer, but rather as a tool that complements the pipeline.

    Zapier debuted a no-code tool for building AI agents and bots, called Zapier Central, in 2024.