Blog

  • A weekend ‘vibe code’ hack by Andrej Karpathy quietly sketches the missing layer of enterprise AI orchestration

    This weekend, Andrej Karpathy, the former director of AI at Tesla and a founding member of OpenAI, decided he wanted to read a book. But he did not want to read it alone. He wanted to read it accompanied by a committee of artificial intelligences, each offering its own perspective, critiquing the others, and eventually synthesizing a final answer under the guidance of a "Chairman."

    To make this happen, Karpathy wrote what he called a "vibe code project" — a piece of software written quickly, largely by AI assistants, intended for fun rather than function. He posted the result, a repository called "LLM Council," to GitHub with a stark disclaimer: "I’m not going to support it in any way… Code is ephemeral now and libraries are over."

    Yet, for technical decision-makers across the enterprise landscape, looking past the casual disclaimer reveals something far more significant than a weekend toy. In a few hundred lines of Python and JavaScript, Karpathy has sketched a reference architecture for the most critical, undefined layer of the modern software stack: the orchestration middleware sitting between corporate applications and the volatile market of AI models.

    As companies finalize their platform investments for 2026, LLM Council offers a stripped-down look at the "build vs. buy" reality of AI infrastructure. It demonstrates that while the logic of routing and aggregating AI models is surprisingly simple, the operational wrapper required to make it enterprise-ready is where the true complexity lies.

    How the LLM Council works: Four AI models debate, critique, and synthesize answers

    To the casual observer, the LLM Council web application looks almost identical to ChatGPT. A user types a query into a chat box. But behind the scenes, the application triggers a sophisticated, three-stage workflow that mirrors how human decision-making bodies operate.

    First, the system dispatches the user’s query to a panel of frontier models. In Karpathy’s default configuration, this includes OpenAI’s GPT-5.1, Google’s Gemini 3.0 Pro, Anthropic’s Claude Sonnet 4.5, and xAI’s Grok 4. These models generate their initial responses in parallel.

    In the second stage, the software performs a peer review. Each model is fed the anonymized responses of its counterparts and asked to evaluate them based on accuracy and insight. This step transforms the AI from a generator into a critic, forcing a layer of quality control that is rare in standard chatbot interactions.

    Finally, a designated "Chairman LLM" — currently configured as Google’s Gemini 3 — receives the original query, the individual responses, and the peer rankings. It synthesizes this mass of context into a single, authoritative answer for the user.

    Karpathy noted that the results were often surprising. "Quite often, the models are surprisingly willing to select another LLM's response as superior to their own," he wrote on X (formerly Twitter). He described using the tool to read book chapters, observing that the models consistently praised GPT-5.1 as the most insightful while rating Claude the lowest. However, Karpathy’s own qualitative assessment diverged from his digital council; he found GPT-5.1 "too wordy" and preferred the "condensed and processed" output of Gemini.

    FastAPI, OpenRouter, and the case for treating frontier models as swappable components

    For CTOs and platform architects, the value of LLM Council lies not in its literary criticism, but in its construction. The repository serves as a primary document showing exactly what a modern, minimal AI stack looks like in late 2025.

    The application is built on a "thin" architecture. The backend uses FastAPI, a modern Python framework, while the frontend is a standard React application built with Vite. Data storage is handled not by a complex database, but by simple JSON files written to the local disk.

    The linchpin of the entire operation is OpenRouter, an API aggregator that normalizes the differences between various model providers. By routing requests through this single broker, Karpathy avoided writing separate integration code for OpenAI, Google, and Anthropic. The application does not know or care which company provides the intelligence; it simply sends a prompt and awaits a response.

    This design choice highlights a growing trend in enterprise architecture: the commoditization of the model layer. By treating frontier models as interchangeable components that can be swapped by editing a single line in a configuration file — specifically the COUNCIL_MODELS list in the backend code — the architecture protects the application from vendor lock-in. If a new model from Meta or Mistral tops the leaderboards next week, it can be added to the council in seconds.

    What's missing from prototype to production: Authentication, PII redaction, and compliance

    While the core logic of LLM Council is elegant, it also serves as a stark illustration of the gap between a "weekend hack" and a production system. For an enterprise platform team, cloning Karpathy’s repository is merely step one of a marathon.

    A technical audit of the code reveals the missing "boring" infrastructure that commercial vendors sell for premium prices. The system lacks authentication; anyone with access to the web interface can query the models. There is no concept of user roles, meaning a junior developer has the same access rights as the CIO.

    Furthermore, the governance layer is nonexistent. In a corporate environment, sending data to four different external AI providers simultaneously triggers immediate compliance concerns. There is no mechanism here to redact Personally Identifiable Information (PII) before it leaves the local network, nor is there an audit log to track who asked what.

    Reliability is another open question. The system assumes the OpenRouter API is always up and that the models will respond in a timely fashion. It lacks the circuit breakers, fallback strategies, and retry logic that keep business-critical applications running when a provider suffers an outage.

    These absences are not flaws in Karpathy’s code — he explicitly stated he does not intend to support or improve the project — but they define the value proposition for the commercial AI infrastructure market.

    Companies like LangChain, AWS Bedrock, and various AI gateway startups are essentially selling the "hardening" around the core logic that Karpathy demonstrated. They provide the security, observability, and compliance wrappers that turn a raw orchestration script into a viable enterprise platform.

    Why Karpathy believes code is now "ephemeral" and traditional software libraries are obsolete

    Perhaps the most provocative aspect of the project is the philosophy under which it was built. Karpathy described the development process as "99% vibe-coded," implying he relied heavily on AI assistants to generate the code rather than writing it line-by-line himself.

    "Code is ephemeral now and libraries are over, ask your LLM to change it in whatever way you like," he wrote in the repository’s documentation.

    This statement marks a radical shift in software engineering capability. Traditionally, companies build internal libraries and abstractions to manage complexity, maintaining them for years. Karpathy is suggesting a future where code is treated as "promptable scaffolding" — disposable, easily rewritten by AI, and not meant to last.

    For enterprise decision-makers, this poses a difficult strategic question. If internal tools can be "vibe coded" in a weekend, does it make sense to buy expensive, rigid software suites for internal workflows? Or should platform teams empower their engineers to generate custom, disposable tools that fit their exact needs for a fraction of the cost?

    When AI models judge AI: The dangerous gap between machine preferences and human needs

    Beyond the architecture, the LLM Council project inadvertently shines a light on a specific risk in automated AI deployment: the divergence between human and machine judgment.

    Karpathy’s observation that his models preferred GPT-5.1, while he preferred Gemini, suggests that AI models may have shared biases. They might favor verbosity, specific formatting, or rhetorical confidence that does not necessarily align with human business needs for brevity and accuracy.

    As enterprises increasingly rely on "LLM-as-a-Judge" systems to evaluate the quality of their customer-facing bots, this discrepancy matters. If the automated evaluator consistently rewards "wordy and sprawled" answers while human customers want concise solutions, the metrics will show success while customer satisfaction plummets. Karpathy’s experiment suggests that relying solely on AI to grade AI is a strategy fraught with hidden alignment issues.

    What enterprise platform teams can learn from a weekend hack before building their 2026 stack

    Ultimately, LLM Council acts as a Rorschach test for the AI industry. For the hobbyist, it is a fun way to read books. For the vendor, it is a threat, proving that the core functionality of their products can be replicated in a few hundred lines of code.

    But for the enterprise technology leader, it is a reference architecture. It demystifies the orchestration layer, showing that the technical challenge is not in routing the prompts, but in governing the data.

    As platform teams head into 2026, many will likely find themselves staring at Karpathy’s code, not to deploy it, but to understand it. It proves that a multi-model strategy is not technically out of reach. The question remains whether companies will build the governance layer themselves or pay someone else to wrap the "vibe code" in enterprise-grade armor.

  • Black Forest Labs launches Flux.2 AI image models to challenge Nano Banana Pro and Midjourney

    It's not just Google's Gemini 3, Nano Banana Pro, and Anthropic's Claude Opus 4.5 we have to be thankful for this year around the Thanksgiving holiday here in the U.S.

    No, today the German AI startup Black Forest Labs released FLUX.2, a new image generation and editing system complete with four different models designed to support production-grade creative workflows.

    FLUX.2 introduces multi-reference conditioning, higher-fidelity outputs, and improved text rendering, and it expands the company’s open-core ecosystem with both commercial endpoints and open-weight checkpoints.

    While Black Forest Labs previously launched with and made a name for itself on open source text-to-image models in its Flux family, today's release includes one fully open-source component: the Flux.2 VAE, available now under the Apache 2.0 license.

    Four other models of varying size and uses — Flux.2 [Pro], Flux.2 [Flex], and Flux.2 [Dev] —are not open source; Pro and Flex remain proprietary hosted offerings, while Dev is an open-weight downloadable model that requires a commercial license obtained directly from Black Forest Labs for any commercial use. An upcoming open-source model is Flux.2 [Klein], which will also be released under Apache 2.0 when available.

    But the open source Flux.2 VAE, or variational autoencoder, is important and useful to enterprises for several reasons. This is a module that compresses images into a latent space and reconstructs them back into high-resolution outputs; in Flux.2, it defines the latent representation used across the multiple (four total, see blow) model variants, enabling higher-quality reconstructions, more efficient training, and 4-megapixel editing.

    Because this VAE is open and freely usable, enterprises can adopt the same latent space used by BFL’s commercial models in their own self-hosted pipelines, gaining interoperability between internal systems and external providers while avoiding vendor lock-in.

    The availability of a fully open, standardized latent space also enables practical benefits beyond media-focused organizations. Enterprises can use an open-source VAE as a stable, shared foundation for multiple image-generation models, allowing them to switch or mix generators without reworking downstream tools or workflows.

    Standardizing on a transparent, Apache-licensed VAE supports auditability and compliance requirements, ensures consistent reconstruction quality across internal assets, and allows future models trained for the same latent space to function as drop-in replacements.

    This transparency also enables downstream customization such as lightweight fine-tuning for brand styles or internal visual templates—even for organizations that do not specialize in media but rely on consistent, controllable image generation for marketing materials, product imagery, documentation, or stock-style visuals.

    The announcement positions FLUX.2 as an evolution of the FLUX.1 family, with an emphasis on reliability, controllability, and integration into existing creative pipelines rather than one-off demos.

    A Shift Toward Production-Centric Image Models

    FLUX.2 extends the prior FLUX.1 architecture with more consistent character, layout, and style adherence across up to ten reference images.

    The system maintains coherence at 4-megapixel resolutions for both generation and editing tasks, enabling use cases such as product visualization, brand-aligned asset creation, and structured design workflows.

    The model also improves prompt following across multi-part instructions while reducing failure modes related to lighting, spatial logic, and world knowledge.

    In parallel, Black Forest Labs continues to follow an open-core release strategy. The company provides hosted, performance-optimized versions of FLUX.2 for commercial deployments, while also publishing inspectable open-weight models that researchers and independent developers can run locally. This approach extends a track record begun with FLUX.1, which became the most widely used open image model globally.

    Model Variants and Deployment Options

    Flux.2 arrives with 5 variants as follows:

    • Flux.2 [Pro]: This is the highest-performance tier, intended for applications that require minimal latency and maximal visual fidelity. It is available through the BFL Playground, the FLUX API, and partner platforms. The model aims to match leading closed-weight systems in prompt adherence and image quality while reducing compute demand.

    • Flux.2 [Flex]: This version exposes parameters such as the number of sampling steps and the guidance scale. The design enables developers to tune the trade-offs between speed, text accuracy, and detail fidelity. In practice, this enables workflows where low-step previews can be generated quickly before higher-step renders are invoked.

    • Flux.2 [Dev]: The most notable release for the open ecosystem is the 32-billion-parameter open-weight checkpoint which integrates text-to-image generation and image editing into a single model. It supports multi-reference conditioning without requiring separate modules or pipelines. The model can run locally using BFL’s reference inference code or optimized fp8 implementations developed in partnership with NVIDIA and ComfyUI. Hosted inference is also available via FAL, Replicate, Runware, Verda, TogetherAI, Cloudflare, and DeepInfra.

    • Flux.2 [Klein]: Coming soon, this size-distilled model is released under Apache 2.0 and is intended to offer improved performance relative to comparable models of the same size trained from scratch. A beta program is currently open.

    • Flux.2 – VAE: Released under the enterprise friendly (even for commercial use) Apache 2.0 license, updated variational autoencoder provides the latent space that underpins all Flux.2 variants. The VAE emphasizes an optimized balance between reconstruction fidelity, learnability, and compression rate—a long-standing challenge for latent-space generative architectures.

    Benchmark Performance

    Black Forest Labs published two sets of evaluations highlighting FLUX.2’s performance relative to other open-weight and hosted image-generation models. In head-to-head win-rate comparisons across three categories—text-to-image generation, single-reference editing, and multi-reference editing—FLUX.2 [Dev] led all open-weight alternatives by a substantial margin.

    It achieved a 66.6% win rate in text-to-image generation (vs. 51.3% for Qwen-Image and 48.1% for Hunyuan Image 3.0), 59.8% in single-reference editing (vs. 49.3% for Qwen-Image and 41.2% for FLUX.1 Kontext), and 63.6% in multi-reference editing (vs. 36.4% for Qwen-Image). These results reflect consistent gains over both earlier FLUX.1 models and contemporary open-weight systems.

    A second benchmark compared model quality using ELO scores against approximate per-image cost. In this analysis, FLUX.2 [Pro], FLUX.2 [Flex], and FLUX.2 [Dev] cluster in the upper-quality, lower-cost region of the chart, with ELO scores in the ~1030–1050 band while operating in the 2–6 cent range.

    By contrast, earlier models such as FLUX.1 Kontext [max] and Hunyuan Image 3.0 appear significantly lower on the ELO axis despite similar or higher per-image costs. Only proprietary competitors like Nano Banana 2 reach higher ELO levels, but at noticeably elevated cost. According to BFL, this positions FLUX.2’s variants as offering strong quality–cost efficiency across performance tiers, with FLUX.2 [Dev] in particular delivering near–top-tier quality while remaining one of the lowest-cost options in its class.

    Pricing via API and Comparison to Nano Banana Pro

    A pricing calculator on BFL’s site indicates that FLUX.2 [Pro] is billed at roughly $0.03 per megapixel of combined input and output. A standard 1024×1024 (1 MP) generation costs $0.030, and higher resolutions scale proportionally. The calculator also counts input images toward total megapixels, suggesting that multi-image reference workflows will have higher per-call costs.

    By contrast, Google’s Gemini 3 Pro Image Preview aka "Nano Banana Pro," currently prices image output at $120 per 1M tokens, resulting in a cost of $0.134 per 1K–2K image (up to 2048×2048) and $0.24 per 4K image. Image input is billed at $0.0011 per image, which is negligible compared to output costs.

    While Gemini’s model uses token-based billing, its effective per-image pricing places 1K–2K images at more than 4× the cost of a 1 MP FLUX.2 [Pro] generation, and 4K outputs at roughly 8× the cost of a similar-resolution FLUX.2 output if scaled proportionally.

    In practical terms, the available data suggests that FLUX.2 [Pro] currently offers significantly lower per-image pricing, particularly for high-resolution outputs or multi-image editing workflows, whereas Gemini 3 Pro’s preview tier is positioned as a higher-cost, token-metered service with more variability depending on resolution.

    Technical Design and the Latent Space Overhaul

    FLUX.2 is built on a latent flow matching architecture, combining a rectified flow transformer with a vision-language model based on Mistral-3 (24B). The VLM contributes semantic grounding and contextual understanding, while the transformer handles spatial structure, material representation, and lighting behavior.

    A major component of the update is the re-training of the model’s latent space. The FLUX.2 VAE integrates advances in semantic alignment, reconstruction quality, and representational learnability drawn from recent research on autoencoder optimization. Earlier models often faced trade-offs in the learnability–quality–compression triad: highly compressed spaces increase training efficiency but degrade reconstructions, while wider bottlenecks can reduce the ability of generative models to learn consistent transformations.

    According to BFL’s research data, the FLUX.2 VAE achieves lower LPIPS distortion than the FLUX.1 and SD autoencoders while also improving generative FID. This balance allows FLUX.2 to support high-fidelity editing—an area that typically demands reconstruction accuracy—and still maintain competitive learnability for large-scale generative training.

    Capabilities Across Creative Workflows

    The most significant functional upgrade is multi-reference support. FLUX.2 can ingest up to ten reference images and maintain identity, product details, or stylistic elements across the output. This feature is relevant for commercial applications such as merchandising, virtual photography, storyboarding, and branded campaign development.

    The system’s typography improvements address a persistent challenge for diffusion- and flow-based architectures. FLUX.2 is able to generate legible fine text, structured layouts, UI elements, and infographic-style assets with greater reliability. This capability, combined with flexible aspect ratios and high-resolution editing, broadens the use cases where text and image jointly define the final output.

    FLUX.2 enhances instruction following for multi-step, compositional prompts, enabling more predictable outcomes in constrained workflows. The model exhibits better grounding in physical attributes—such as lighting and material behavior—reducing inconsistencies in scenes requiring photoreal equilibrium.

    Ecosystem and Open-Core Strategy

    Black Forest Labs continues to position its models within an ecosystem that blends open research with commercial reliability. The FLUX.1 open models helped establish the company’s reach across both the developer and enterprise markets, and FLUX.2 expands this structure: tightly optimized commercial endpoints for production deployments and open, composable checkpoints for research and community experimentation.

    The company emphasizes transparency through published inference code, open-weight VAE release, prompting guides, and detailed architectural documentation. It also continues to recruit talent in Freiburg and San Francisco as it pursues a longer-term roadmap toward multimodal models that unify perception, memory, reasoning, and generation.

    Background: Flux and the Formation of Black Forest Labs

    Black Forest Labs (BFL) was founded in 2024 by Robin Rombach, Patrick Esser, and Andreas Blattmann, the original creators of Stable Diffusion. Their move from Stability AI came at a moment of turbulence for the broader open-source generative AI community, and the launch of BFL signaled a renewed effort to build accessible, high-performance image models. The company secured $31 million in seed funding led by Andreessen Horowitz, with additional support from Brendan Iribe, Michael Ovitz, and Garry Tan, providing early validation for its technical direction.

    BFL’s first major release, FLUX.1, introduced a 12-billion-parameter architecture available in Pro, Dev, and Schnell variants. It quickly gained a reputation for output quality that matched or exceeded closed-source competitors such as Midjourney v6 and DALL·E 3, while the Dev and Schnell versions reinforced the company’s commitment to open distribution. FLUX.1 also saw rapid adoption in downstream products, including xAI’s Grok 2, and arrived amid ongoing industry discussions about dataset transparency, responsible model usage, and the role of open-source distribution. BFL published strict usage policies aimed at preventing misuse and non-consensual content generation.

    In late 2024, BFL expanded the lineup with Flux 1.1 Pro, a proprietary high-speed model delivering sixfold generation speed improvements and achieving leading ELO scores on Artificial Analysis. The company launched a paid API alongside the release, enabling configurable integrations with adjustable resolution, model choice, and moderation settings at pricing that began at $0.04 per image.

    Partnerships with TogetherAI, Replicate, FAL, and Freepik broadened access and made the model available to users without the need for self-hosting, extending BFL’s reach across commercial and creator-oriented platforms.

    These developments unfolded against a backdrop of accelerating competition in generative media.

    Implications for Enterprise Technical Decision Makers

    The FLUX.2 release carries distinct operational implications for enterprise teams responsible for AI engineering, orchestration, data management, and security. For AI engineers responsible for model lifecycle management, the availability of both hosted endpoints and open-weight checkpoints enables flexible integration paths.

    FLUX.2’s multi-reference capabilities and expanded resolution support reduce the need for bespoke fine-tuning pipelines when handling brand-specific or identity-consistent outputs, lowering development overhead and accelerating deployment timelines. The model’s improved prompt adherence and typography performance also reduce iterative prompting cycles, which can have a measurable impact on production workload efficiency.

    Teams focused on AI orchestration and operational scaling benefit from the structure of FLUX.2’s product family. The Pro tier offers predictable latency characteristics suitable for pipeline-critical workloads, while the Flex tier enables direct control over sampling steps and guidance parameters, aligning with environments that require strict performance tuning.

    Open-weight access for the Dev model facilitates the creation of custom containerized deployments and allows orchestration platforms to manage the model under existing CI/CD practices. This is particularly relevant for organizations balancing cutting-edge tooling with budget constraints, as self-hosted deployments offer cost control at the expense of in-house optimization requirements.

    Data engineering stakeholders gain advantages from the model’s latent architecture and improved reconstruction fidelity. High-quality, predictable image representations reduce downstream data-cleaning burdens in workflows where generated assets feed into analytics systems, creative automation pipelines, or multimodal model development.

    Because FLUX.2 consolidates text-to-image and image-editing functions into a single model, it simplifies integration points and reduces the complexity of data flows across storage, versioning, and monitoring layers. For teams managing large volumes of reference imagery, the ability to incorporate up to ten inputs per generation may also streamline asset management processes by shifting more variation handling into the model rather than external tooling.

    For security teams, FLUX.2’s open-core approach introduces considerations related to access control, model governance, and API usage monitoring. Hosted FLUX.2 endpoints allow for centralized enforcement of security policies and reduce local exposure to model weights, which may be preferable for organizations with stricter compliance requirements.

    Conversely, open-weight deployments require internal controls for model integrity, version tracking, and inference-time monitoring to prevent misuse or unapproved modifications. The model’s handling of typography and realistic compositions also reinforces the need for established content governance frameworks, particularly where generative systems interface with public-facing channels.

    Across these roles, FLUX.2’s design emphasizes predictable performance characteristics, modular deployment options, and reduced operational friction. For enterprises with lean teams or rapidly evolving requirements, the release offers a set of capabilities aligned with practical constraints around speed, quality, budget, and model governance.

    FLUX.2 marks a substantial iterative improvement in Black Forest Labs’ generative image stack, with notable gains in multi-reference consistency, text rendering, latent space quality, and structured prompt adherence. By pairing fully managed offerings with open-weight checkpoints, BFL maintains its open-core model while extending its relevance to commercial creative workflows. The release demonstrates a shift from experimental image generation toward more predictable, scalable, and controllable systems suited for operational use.

  • OpenAI now lets enterprises choose where to host their data

    OpenAI expanded its data residency regions for ChatGPT and its API, giving enterprise users the option to store and process their data closest to their business operations and better comply with local regulations. This expansion removes one of the biggest compliance blockers preventing global enterprises from deploying ChatGPT at scale.

    Data residency, often an overlooked piece of the enterprise AI puzzle, processes and governs data according to the laws and customs of the countries where it is stored. 

    ChatGPT Enterprise and Edu subscribers can now choose to have their data processed in: 

    • Europe (European Economic Area and Switzerland)

    • United Kingdom

    • United States

    • Canada

    • Japan

    • South Korea

    • Singapore

    • India

    • Australia

    • United Arab Emirates

    OpenAI said in a blog post that it “plans to expand availability to additional regions over time.” 

    Customers can store data such as conversations, uploaded files, custom GPTs, and image-generation artifacts. This applies only to data at rest, not while it moves through a system or when it is used for inference. OpenAI’s documentation notes that, for now, inference residency remains available only in the U.S.  

    ChatGPT Enterprise and Edu users can set up new workspaces with data residency. Enterprise customers on the API who have been approved for advanced data controls can enable data residency by creating a new project and selecting their preferred region.

    OpenAI first began offering data residency in Europe in February this year. The European Union has some of the strictest data regulations globally, based on the GDPR. 

    The importance of data residency

    Enterprises until now had fewer choices for processing data flowing through ChatGPT. For example, some organizational data would be processed under U.S. law rather than under European rules. 

    Enterprises risk violating data compliance rules if their data at rest is processed elsewhere and does not meet strict policies. 

    “With over 1 million business customers around the world directly using OpenAI, we have expanded where we offer data residency — allowing business customers to store data in certain regions, helping organizations meet local regulatory and data protection requirements,” the company said in its blog post. 

    However, enterprises must also understand that if they are using a connector or integration within ChatGPT, those applications have different data residency rules. When OpenAI launched company knowledge for ChatGPT, it warned users that depending on the connector they use, data residency may be limited to the U.S. 

  • What enterprises should know about The White House’s new AI ‘Manhattan Project’ the Genesis Mission

    President Donald Trump’s new “Genesis Mission” unveiled Monday is billed as a generational leap in how the United States does science akin to the Manhattan Project that created the atomic bomb during World War II.

    The executive order directs the Department of Energy (DOE) to build a “closed-loop AI experimentation platform” that links the country’s 17 national laboratories, federal supercomputers, and decades of government scientific data into “one cooperative system for research.”

    The White House fact sheet casts the initiative as a way to “transform how scientific research is conducted” and “accelerate the speed of scientific discovery,” with priorities spanning biotechnology, critical materials, nuclear fission and fusion, quantum information science, and semiconductors.

    DOE’s own release calls it “the world’s most complex and powerful scientific instrument ever built” and quotes Under Secretary for Science Darío Gil describing it as a “closed-loop system” linking the nation’s most advanced facilities, data, and computing into “an engine for discovery that doubles R&D productivity.”

    What the administration has not provided is just as striking: no public cost estimate, no explicit appropriation, and no breakdown of who will pay for what. Major news outlets including Reuters, Associated Press, Politico, and others have all noted that the order “does not specify new spending or a budget request,” or that funding will depend on future appropriations and previously passed legislation.

    That omission, combined with the initiative’s scope and timing, raises questions not only about how Genesis will be funded and to what extent, but about who it might quietly benefit.

    “So is this just a subsidy for big labs or what?”

    Soon after DOE promoted the mission on X, Teknium of the small U.S. AI lab Nous Research posted a blunt reaction: “So is this just a subsidy for big labs or what.”

    The line has become a shorthand for a growing concern in the AI community: that the U.S. government could offer some sort of public subsidy for large AI firms facing staggering and rising compute and data costs.

    That concern is grounded in recent, well-sourced reporting on OpenAI’s finances and infrastructure commitments. Documents obtained and analyzed by tech public relations professional and AI critic Ed Zitron describe a cost structure that has exploded as the company has scaled models like GPT-4, GPT-4.1, and GPT-5.1.

    The Register has separately inferred from Microsoft quarterly earnings statements that OpenAI lost about $13.5 billion on $4.3 billion in revenue in the first half of 2025 alone. Other outlets and analysts have highlighted projections that show tens of billions in annual losses later this decade if spending and revenue follow current trajectories

    By contrast, Google DeepMind trained its recent Gemini 3 flagship LLM on the company’s own TPU hardware and in its own data centers, giving it a structural advantage in cost per training run and energy management, as covered in Google’s own technical blogs and subsequent financial reporting.

    Viewed against that backdrop, an ambitious federal project that promises to integrate “world-class supercomputers and datasets into a unified, closed-loop AI platform” and “power robotic laboratories” sounds, to some observers, like more than a pure science accelerator. It could, depending on how access is structured, also ease the capital bottlenecks facing private frontier-model labs.

    The executive order explicitly anticipates partnerships with “external partners possessing advanced AI, data, or computing capabilities,” to be governed through cooperative research and development agreements, user-facility partnerships, and data-use and model-sharing agreements. That category clearly includes firms like OpenAI, Anthropic, Google, and other major AI players—even if none are named.

    What the order does not do is guarantee those companies access, spell out subsidized pricing, or earmark public money for their training runs. Any claim that OpenAI, Anthropic, or Google “just got access” to federal supercomputing or national-lab data is, at this point, an interpretation of how the framework could be used, not something the text actually promises.

    Furthermore, the executive order makes no mention of open-source model development — an omission that stands out in light of remarks last year from Vice President JD Vance, when, prior to assuming office and while serving as a Senator from Ohio and participating in a hearing, he warned against regulations designed to protect incumbent tech firms and was widely praised by open-source advocates.

    Closed-loop discovery and “autonomous scientific agents”

    Another viral reaction came from AI influencer Chris (@chatgpt21 on X), who wrote in an X post that that OpenAI, Anthropic, and Google have already “got access to petabytes of proprietary data” from national labs, and that DOE labs have been “hoarding experimental data for decades.” The public record supports a narrower claim.

    The order and fact sheet describe “federal scientific datasets—the world’s largest collection of such datasets, developed over decades of Federal investments” and direct agencies to identify data that can be integrated into the platform “to the extent permitted by law.”

    DOE’s announcement similarly talks about unleashing “the full power of our National Laboratories, supercomputers, and data resources.”

    It is true that the national labs hold enormous troves of experimental data. Some of it is already public via the Office of Scientific and Technical Information (OSTI) and other repositories; some is classified or export-controlled; much is under-used because it sits in fragmented formats and systems. But there is no public document so far that states private AI companies have now been granted blanket access to this data, or that DOE characterizes past practice as “hoarding.”

    What is clear is that the administration wants to unlock more of this data for AI-driven research and to do so in coordination with external partners. Section 5 of the order instructs DOE and the Assistant to the President for Science and Technology to create standardized partnership frameworks, define IP and licensing rules, and set “stringent data access and management processes and cybersecurity standards for non-Federal collaborators accessing datasets, models, and computing environments.”

    A moonshot with an open question at the center

    Taken at face value, the Genesis Mission is an ambitious attempt to use AI and high-performance computing to speed up everything from fusion research to materials discovery and pediatric cancer work, using decades of taxpayer-funded data and instruments that already exist inside the federal system. The executive order spends considerable space on governance: coordination through the National Science and Technology Council, new fellowship programs, and annual reporting on platform status, integration progress, partnerships, and scientific outcomes.

    Yet the initiative also lands at a moment when frontline AI labs are buckling under their own compute bills, when one of them—OpenAI—is reported to be spending more on running models than it earns in revenue, and when investors are openly debating whether the current business model for proprietary frontier AI is sustainable without some form of outside support.

    In that environment, a federally funded, closed-loop AI discovery platform that centralizes the country’s most powerful supercomputers and data is inevitably going to be read in more than one way. It may become a genuine engine for public science. It may also become a crucial piece of infrastructure for the very companies driving today’s AI arms race.

    For now, one fact is undeniable: the administration has launched a mission it compares to the Manhattan Project without telling the public what it will cost, how the money will flow, or exactly who will be allowed to plug into it.

    How enterprise tech leaders should interpret the Genesis Mission

    For enterprise teams already building or scaling AI systems, the Genesis Mission signals a shift in how national infrastructure, data governance, and high-performance compute will evolve in the U.S.—and those signals matter even before the government publishes a budget.

    The initiative outlines a federated, AI-driven scientific ecosystem where supercomputers, datasets, and automated experimentation loops operate as tightly integrated pipelines.

    That direction mirrors the trajectory many companies are already moving toward: larger models, more experimentation, heavier orchestration, and a growing need for systems that can manage complex workloads with reliability and traceability.

    Even though Genesis is aimed at science, its architecture hints at what will become expected norms across American industries.

    The lack of cost detail around Genesis does not directly alter enterprise roadmaps, but it does reinforce the broader reality that compute scarcity, escalating cloud costs, and rising standards for AI model governance will remain central challenges.

    Companies that already struggle with constrained budgets or tight headcount—particularly those responsible for deployment pipelines, data integrity, or AI security—should view Genesis as early confirmation that efficiency, observability, and modular AI infrastructure will remain essential.

    As the federal government formalizes frameworks for data access, experiment traceability, and AI agent oversight, enterprises may find that future compliance regimes or partnership expectations take cues from these federal standards.

    Genesis also underscores the growing importance of unifying data sources and ensuring that models can operate across diverse, sometimes sensitive environments. Whether managing pipelines across multiple clouds, fine-tuning models with domain-specific datasets, or securing inference endpoints, enterprise technical leaders will likely see increased pressure to harden systems, standardize interfaces, and invest in complex orchestration that can scale safely.

    The mission’s emphasis on automation, robotic workflows, and closed-loop model refinement may shape how enterprises structure their internal AI R&D, encouraging them to adopt more repeatable, automated, and governable approaches to experimentation.

    Here is what enterprise leaders should be doing now:

    1. Expect increased federal involvement in AI infrastructure and data governance. This may indirectly shape cloud availability, interoperability standards, and model-governance expectations.

    2. Track “closed-loop” AI experimentation models. This may preview future enterprise R&D workflows and reshape how ML teams build automated pipelines.

    3. Prepare for rising compute costs and consider efficiency strategies. This includes smaller models, retrieval-augmented systems, and mixed-precision training.

    4. Strengthen AI-specific security practices. Genesis signals that the federal government is escalating expectations for AI system integrity and controlled access.

    5. Plan for potential public–private interoperability standards. Enterprises that align early may gain a competitive edge in partnerships and procurement.

    Overall, Genesis does not change day-to-day enterprise AI operations today. But it strongly signals where federal and scientific AI infrastructure is heading—and that direction will inevitably influence the expectations, constraints, and opportunities enterprises face as they scale their own AI capabilities.

  • Anthropic’s Claude Opus 4.5 is here: Cheaper AI, infinite chats, and coding skills that beat humans

    Anthropic released its most capable artificial intelligence model yet on Monday, slashing prices by roughly two-thirds while claiming state-of-the-art performance on software engineering tasks — a strategic move that intensifies the AI startup's competition with deep-pocketed rivals OpenAI and Google.

    The new model, Claude Opus 4.5, scored higher on Anthropic's most challenging internal engineering assessment than any human job candidate in the company's history, according to materials reviewed by VentureBeat. The result underscores both the rapidly advancing capabilities of AI systems and growing questions about how the technology will reshape white-collar professions.

    The Amazon-backed company is pricing Claude Opus 4.5 at $5 per million input tokens and $25 per million output tokens — a dramatic reduction from the $15 and $75 rates for its predecessor, Claude Opus 4.1, released earlier this year. The move makes frontier AI capabilities accessible to a broader swath of developers and enterprises while putting pressure on competitors to match both performance and pricing.

    "We want to make sure this really works for people who want to work with these models," said Alex Albert, Anthropic's head of developer relations, in an exclusive interview with VentureBeat. "That is really our focus: How can we enable Claude to be better at helping you do the things that you don't necessarily want to do in your job?"

    The announcement comes as Anthropic races to maintain its position in an increasingly crowded field. OpenAI recently released GPT-5.1 and a specialized coding model called Codex Max that can work autonomously for extended periods. Google unveiled Gemini 3 just last week, prompting concerns even from OpenAI about the search giant's progress, according to a recent report from The Information.

    Opus 4.5 demonstrates improved judgment on real-world tasks, developers say

    Anthropic's internal testing revealed what the company describes as a qualitative leap in Claude Opus 4.5's reasoning capabilities. The model achieved 80.9% accuracy on SWE-bench Verified, a benchmark measuring real-world software engineering tasks, outperforming OpenAI's GPT-5.1-Codex-Max (77.9%), Anthropic's own Sonnet 4.5 (77.2%), and Google's Gemini 3 Pro (76.2%), according to the company's data. The result marks a notable advance over OpenAI's current state-of-the-art model, which was released just five days earlier.

    But the technical benchmarks tell only part of the story. Albert said employee testers consistently reported that the model demonstrates improved judgment and intuition across diverse tasks — a shift he described as the model developing a sense of what matters in real-world contexts.

    "The model just kind of gets it," Albert said. "It just has developed this sort of intuition and judgment on a lot of real world things that feels qualitatively like a big jump up from past models."

    He pointed to his own workflow as an example. Previously, Albert said, he would ask AI models to gather information but hesitated to trust their synthesis or prioritization. With Opus 4.5, he's delegating more complete tasks, connecting it to Slack and internal documents to produce coherent summaries that match his priorities.

    Opus 4.5 outscores all human candidates on company's toughest engineering test

    The model's performance on Anthropic's internal engineering assessment marks a notable milestone. The take-home exam, designed for prospective performance engineering candidates, is meant to evaluate technical ability and judgment under time pressure within a prescribed two-hour limit.

    Using a technique called parallel test-time compute — which aggregates multiple attempts from the model and selects the best result — Opus 4.5 scored higher than any human candidate who has taken the test, according to company. Without a time limit, the model matched the performance of the best-ever human candidate when used within Claude Code, Anthropic's coding environment.

    The company acknowledged that the test doesn't measure other crucial professional skills such as collaboration, communication, or the instincts that develop over years of experience. Still, Anthropic said the result "raises questions about how AI will change engineering as a profession."

    Albert emphasized the significance of the finding. "I think this is kind of a sign, maybe, of what's to come around how useful these models can actually be in a work context and for our jobs," he said. "Of course, this was an engineering task, and I would say models are relatively ahead in engineering compared to other fields, but I think it's a really important signal to pay attention to."

    Dramatic efficiency improvements cut token usage by up to 76% on key benchmarks

    Beyond raw performance, Anthropic is betting that efficiency improvements will differentiate Claude Opus 4.5 in the market. The company says the model uses dramatically fewer tokens — the units of text that AI systems process — to achieve similar or better outcomes compared to predecessors.

    At a medium effort level, Opus 4.5 matches the previous Sonnet 4.5 model's best score on SWE-bench Verified while using 76% fewer output tokens, according to Anthropic. At the highest effort level, Opus 4.5 exceeds Sonnet 4.5 performance by 4.3 percentage points while still using 48% fewer tokens.

    To give developers more control, Anthropic introduced an "effort parameter" that allows users to adjust how much computational work the model applies to each task — balancing performance against latency and cost.

    Enterprise customers provided early validation of the efficiency claims. "Opus 4.5 beats Sonnet 4.5 and competition on our internal benchmarks, using fewer tokens to solve the same problems," said Michele Catasta, president of Replit, a cloud-based coding platform, in a statement to VentureBeat. "At scale, that efficiency compounds."

    GitHub's chief product officer, Mario Rodriguez, said early testing shows Opus 4.5 "surpasses internal coding benchmarks while cutting token usage in half, and is especially well-suited for tasks like code migration and code refactoring."

    Early customers report AI agents that learn from experience and refine their own skills

    One of the most striking capabilities demonstrated by early customers involves what Anthropic calls "self-improving agents" — AI systems that can refine their own performance through iterative learning.

    Rakuten, the Japanese e-commerce and internet company, tested Claude Opus 4.5 on automation of office tasks. "Our agents were able to autonomously refine their own capabilities — achieving peak performance in 4 iterations while other models couldn't match that quality after 10," said Yusuke Kaji, Rakuten's general manager of AI for business.

    Albert explained that the model isn't updating its own weights — the fundamental parameters that define an AI system's behavior — but rather iteratively improving the tools and approaches it uses to solve problems. "It was iteratively refining a skill for a task and seeing that it's trying to optimize the skill to get better performance so it could accomplish this task," he said.

    The capability extends beyond coding. Albert said Anthropic has observed significant improvements in creating professional documents, spreadsheets, and presentations. "They're saying that this has been the biggest jump they've seen between model generations," Albert said. "So going even from Sonnet 4.5 to Opus 4.5, bigger jump than any two models back to back in the past."

    Fundamental Research Labs, a financial modeling firm, reported that "accuracy on our internal evals improved 20%, efficiency rose 15%, and complex tasks that once seemed out of reach became achievable," according to co-founder Nico Christie.

    New features target Excel users, Chrome workflows and eliminate chat length limits

    Alongside the model release, Anthropic rolled out a suite of product updates aimed at enterprise users. Claude for Excel became generally available for Max, Team, and Enterprise users with new support for pivot tables, charts, and file uploads. The Chrome browser extension is now available to all Max users.

    Perhaps most significantly, Anthropic introduced "infinite chats" — a feature that eliminates context window limitations by automatically summarizing earlier parts of conversations as they grow longer. "Within Claude AI, within the product itself, you effectively get this kind of infinite context window due to the compaction, plus some memory things that we're doing," Albert explained.

    For developers, Anthropic released "programmatic tool calling," which allows Claude to write and execute code that invokes functions directly. Claude Code gained an updated "Plan Mode" and became available on desktop in research preview, enabling developers to run multiple AI agent sessions in parallel.

    Market heats up as OpenAI, Google race to match performance and pricing

    Anthropic reached $2 billion in annualized revenue during the first quarter of 2025, more than doubling from $1 billion in the prior period. The number of customers spending more than $100,000 annually jumped eightfold year-over-year.

    The rapid release of Opus 4.5 — just weeks after Haiku 4.5 in October and Sonnet 4.5 in September — reflects broader industry dynamics. OpenAI released multiple GPT-5 variants throughout 2025, including a specialized Codex Max model in November that can work autonomously for up to 24 hours. Google shipped Gemini 3 in mid-November after months of development.

    Albert attributed Anthropic's accelerated pace partly to using Claude to speed its own development. "We're seeing a lot of assistance and speed-up by Claude itself, whether it's on the actual product building side or on the model research side," he said.

    The pricing reduction for Opus 4.5 could pressure margins while potentially expanding the addressable market. "I'm expecting to see a lot of startups start to incorporate this into their products much more and feature it prominently," Albert said.

    Yet profitability remains elusive for leading AI labs as they invest heavily in computing infrastructure and research talent. The AI market is projected to top $1 trillion in revenue within a decade, but no single provider has established dominant market position—even as models reach a threshold where they can meaningfully automate complex knowledge work.

    Michael Truell, CEO of Cursor, an AI-powered code editor, called Opus 4.5 "a notable improvement over the prior Claude models inside Cursor, with improved pricing and intelligence on difficult coding tasks." Scott Wu, CEO of Cognition, an AI coding startup, said the model delivers "stronger results on our hardest evaluations and consistent performance through 30-minute autonomous coding sessions."

    For enterprises and developers, the competition translates to rapidly improving capabilities at falling prices. But as AI performance on technical tasks approaches—and sometimes exceeds—human expert levels, the technology's impact on professional work becomes less theoretical.

    When asked about the engineering exam results and what they signal about AI's trajectory, Albert was direct: "I think it's a really important signal to pay attention to."

  • Microsoft’s Fara-7B is a computer-use AI agent that rivals GPT-4o and works directly on your PC

    Microsoft has introduced Fara-7B, a new 7-billion parameter model designed to act as a Computer Use Agent (CUA) capable of performing complex tasks directly on a user’s device. Fara-7B sets new state-of-the-art results for its size, providing a way to build AI agents that don’t rely on massive, cloud-dependent models and can run on compact systems with lower latency and enhanced privacy.

    While the model is an experimental release, its architecture addresses a primary barrier to enterprise adoption: data security. Because Fara-7B is small enough to run locally, it allows users to automate sensitive workflows, such as managing internal accounts or processing sensitive company data, without that information ever leaving the device. 

    How Fara-7B sees the web

    Fara-7B is designed to navigate user interfaces using the same tools a human does: a mouse and keyboard. The model operates by visually perceiving a web page through screenshots and predicting specific coordinates for actions like clicking, typing, and scrolling.

    Crucially, Fara-7B does not rely on "accessibility trees,” the underlying code structure that browsers use to describe web pages to screen readers. Instead, it relies solely on pixel-level visual data. This approach allows the agent to interact with websites even when the underlying code is obfuscated or complex.

    According to Yash Lara, Senior PM Lead at Microsoft Research, processing all visual input on-device creates true "pixel sovereignty," since screenshots and the reasoning needed for automation remain on the user’s device. "This approach helps organizations meet strict requirements in regulated sectors, including HIPAA and GLBA," he told VentureBeat in written comments.

    In benchmarking tests, this visual-first approach has yielded strong results. On WebVoyager, a standard benchmark for web agents, Fara-7B achieved a task success rate of 73.5%. This outperforms larger, more resource-intensive systems, including GPT-4o, when prompted to act as a computer use agent (65.1%) and the native UI-TARS-1.5-7B model (66.4%).

    Efficiency is another key differentiator. In comparative tests, Fara-7B completed tasks in approximately 16 steps on average, compared to roughly 41 steps for the UI-TARS-1.5-7B model.

    Handling risks

    The transition to autonomous agents is not without risks, however. Microsoft notes that Fara-7B shares limitations common to other AI models, including potential hallucinations, mistakes in following complex instructions, and accuracy degradation on intricate tasks.

    To mitigate these risks, the model was trained to recognize "Critical Points." A Critical Point is defined as any situation requiring a user's personal data or consent before an irreversible action occurs, such as sending an email or completing a financial transaction. Upon reaching such a juncture, Fara-7B is designed to pause and explicitly request user approval before proceeding. 

    Managing this interaction without frustrating the user is a key design challenge. "Balancing robust safeguards such as Critical Points with seamless user journeys is key," Lara said. "Having a UI, like Microsoft Research’s Magentic-UI, is vital for giving users opportunities to intervene when necessary, while also helping to avoid approval fatigue." Magentic-UI is a research prototype designed specifically to facilitate these human-agent interactions. Fara-7B is designed to run in Magentic-UI.

    Distilling complexity into a single model

    The development of Fara-7B highlights a growing trend in knowledge distillation, where the capabilities of a complex system are compressed into a smaller, more efficient model.

    Creating a CUA usually requires massive amounts of training data showing how to navigate the web. Collecting this data via human annotation is prohibitively expensive. To solve this, Microsoft used a synthetic data pipeline built on Magentic-One, a multi-agent framework. In this setup, an "Orchestrator" agent created plans and directed a "WebSurfer" agent to browse the web, generating 145,000 successful task trajectories.

    The researchers then "distilled" this complex interaction data into Fara-7B, which is built on Qwen2.5-VL-7B, a base model chosen for its long context window (up to 128,000 tokens) and its strong ability to connect text instructions to visual elements on a screen. While the data generation required a heavy multi-agent system, Fara-7B itself is a single model, showing that a small model can effectively learn advanced behaviors without needing complex scaffolding at runtime.

    The training process relied on supervised fine-tuning, where the model learns by mimicking the successful examples generated by the synthetic pipeline.

    Looking forward

    While the current version was trained on static datasets, future iterations will focus on making the model smarter, not necessarily bigger. "Moving forward, we’ll strive to maintain the small size of our models," Lara said. "Our ongoing research is focused on making agentic models smarter and safer, not just larger." This includes exploring techniques like reinforcement learning (RL) in live, sandboxed environments, which would allow the model to learn from trial and error in real-time.

    Microsoft has made the model available on Hugging Face and Microsoft Foundry under an MIT license. However, Lara cautions that while the license allows for commercial use, the model is not yet production-ready. "You can freely experiment and prototype with Fara‑7B under the MIT license," he says, "but it’s best suited for pilots and proofs‑of‑concept rather than mission‑critical deployments."

  • How to avoid becoming an “AI-first” company with zero real AI usage

    Remember the first time you heard your company was going AI-first?

    Maybe it came through an all-hands that felt different from the others. The CEO said, “By Q3, every team should have integrated AI into their core workflows,” and the energy in the room (or on the Zoom) shifted. You saw a mix of excitement and anxiety ripple through the crowd.

    Maybe you were one of the curious ones. Maybe you’d already built a Python script that summarized customer feedback, saving your team three hours every week. Or maybe you’d stayed late one night just to see what would happen if you combined a dataset with a large language model (LLM) prompt. Maybe you’re one of those who’d already let curiosity lead you somewhere unexpected.

    But this announcement felt different because suddenly, what had been a quiet act of curiosity was now a line in a corporate OKR. Maybe you didn’t know it yet, but something fundamental had shifted in how innovation would happen inside your company.

    How innovation happens

    Real transformation rarely looks like the PowerPoint version, and almost never follows the org chart.

    Think about the last time something genuinely useful spread at work. It wasn't because of a vendor pitch or a strategic initiative, was it? More likely, someone stayed late one night, when no one was watching, found something that cut hours of busywork, and mentioned it at lunch the next day. “Hey, try this.” They shared it in a Slack thread and, in a week, half the team was using it.

    The developer who used GPT to debug code wasn’t trying to make a strategic impact. She just needed to get home earlier to her kids. The ops manager who automated his spreadsheet didn’t need permission. He just needed more sleep.

    This is the invisible architecture of progress — these informal networks where curiosity flows like water through concrete… finding every crack, every opening.

    But watch what happens when leadership notices. What used to be effortless and organic becomes mandated. And the thing that once worked because it was free suddenly stops being as effective the moment it’s measured.

    The great reversal

    It usually begins quietly. Often when a competitor announces new AI features, — like AI-powered onboarding or end-to-end support automation — claiming 40% efficiency gains.

    The next morning, your CEO calls an emergency meeting. The room gets still. Someone clears their throat. And you can feel everyone doing mental math about their job security. “If they’re that far ahead, what does that mean for us?”

    That afternoon, your company has a new priority. Your CEO says, “We need an AI strategy. Yesterday.”

    Here's how that message usually ripples down the org chart:

    • At the C-suite: “We need an AI strategy to stay competitive.”

    • At the VP level: “Every team needs an AI initiative.”

    • At the manager level: “We need a plan by Friday.”

    • At your level: “I just need to find something that looks like AI.”

    Each translation adds pressure while subtracting understanding. Everyone still cares, but that translation changes intent. What begins as a question worth asking becomes a script everyone follows blindly.

    Eventually, the performance of innovation replaces the thing itself. There’s a strange pressure to look like you’re moving fast, even when you’re not sure where you’re actually going.

    This repeats across industries

    A competitor declared they’re going AI-first. Another publishes a case study about replacing support with LLMs. And a third shares a graph showing productivity gains. Within days, boardrooms everywhere start echoing the same message: “We should be doing this. Everyone else already is, and we can’t fall behind.”

    So the work begins. Then come the task forces, the town halls, the strategy docs and the targets. Teams are asked to contribute initiatives.

    But if you’ve been through this before, you know there’s often a difference between what companies announce and what they actually do. Because press releases don’t mention the pilots that stall, or the teams that quietly revert to the old way, or even the tools that get used once and abandoned. You might know someone who was on one of those teams, or you might’ve even been on one yourself.

    These aren’t failures of technology or intent. ChatGPT works fine. And teams want to automate their tasks. These failures are organizational, and they happen when we try to imitate outcomes without understanding what created them in the first place.

    And so when everyone performs innovation, it becomes almost impossible to tell who’s actually doing it.

    Two kinds of leaders

    You’ve probably seen both, and it’s very easy to tell which kind you’re working with.

    One spends an entire weekend prototyping. They try something new, fail at half of it, and still show up Monday saying, “I built this thing with Claude. It crashed after two hours, but I learned a lot. Wanna see? It's very basic, but it might solve that thing we talked about.”

    They try to build understanding. You can tell they’ve actually spent time with AI, and struggled with prompts and hallucinations. Instead of trying to sound certain, they talk about what broke, what almost worked and what they’re still figuring out. They invite you to try something new, because it feels like there’s room to learn. That’s what leading by participation looks like.

    The other sends you a directive in Slack: “Leadership wants every team using AI by the end of the quarter. Plans are due by Friday.” They enforce compliance with a decision that's already been made. You can even hear it in their language, and how certain they sound.

    The curious leader builds momentum. The performative one builds resentment.

    What actually works

    You probably don’t need someone to tell you where AI works. You already know because you’ve seen it.

    • Customer support: LLMs genuinely help with Tier 1 tickets. They understand intent, draft simple responses and route complexity. Not perfectly, of course, — I’m sure you've seen the failures — but well enough to matter.

    • Code assistance: At 2 a.m., when you’re half-delirious and your AI assistant suggests exactly what you need, it feels like having an over-caffeinated junior programmer who never judges your forgotten semicolons. You save minutes at first, then hours, then days.

    These small, cumulative wins compound over time. They aren't the impressive transformations promised in decks, but the kind of improvements you can rely on.

    But outside these zones, things get murky. AI-driven revops? Fully automated forecasting? You've sat through those demos, and you’ve also seen the enthusiasm fade once the pilot actually begins.

    Have the builders of these AI tools failed? Hardly. The technology is evolving, and the products built on top of it are still learning how to walk.

    So how can you tell if your company's AI adoption is real? Simple. Just ask someone in finance or ops. Ask what AI tools they use daily. You might get a slight pause or an apologetic smile. “Honestly? Just ChatGPT.” That’s it. Not the $50k enterprise-grade platform from last quarter’s demo or the expensive software suite in the board deck. Just a browser tab, same as any college student writing an essay.

    You might make this same confession yourself. Despite all the mandates and initiatives, your most powerful AI tool is probably the same one everyone else uses. So what does this tell us about the gap between what we're supposed to be doing and what we're actually doing?

    How to drive change at your company

    You've probably discovered this yourself, even if no one's ever put it into words:

    1. Model what you mean: Remember that engineering director who screen-shared her messy, live coding session with Cursor? You learned more from watching her debug in real time than from any polished presentation, because vulnerability travels farther than directives.

    2. Listen to the edges: You know who's actually using AI effectively in your organization, and they're not always the ones with “AI” in their title. They're the curious ones who've been quietly experimenting, finding what works through trial and error. And that knowledge is worth more than any analyst report.

    3. Create permission (not pressure): The people inclined to experiment will always find a way, and the rest won’t be moved by force. The best thing you can do is make the curious feel safe to stay curious.

    We're living in this strange moment, caught between the AI that vendors promise and the AI that actually exists on our screens, and it's deeply uncomfortable. The gap between product and promise is wide.

    But what I've learned from sitting in that discomfort is that companies that will thrive aren’t the ones that adopted AI first, but the ones that learned through trial and error. They stayed with the discomfort long enough for it to teach them something.

    Where will you be six months from now?

    By then, your company’s AI-first mandate will have set into motion departmental initiatives, vendor contracts and maybe even some new hires with “AI” in their titles. The dashboards will be green, and the board deck will have a whole slide on AI.

    But in the quiet spaces where your actual work happens, what will have meaningfully changed?

    Maybe you'll be like the teams that never stopped their quiet experiments. Your customer feedback system might catch the patterns humans miss. Your documentation might update itself. Chances are, if you were building before the mandate, you’ll be building after it fades.

    That’s invisible architecture of genuine progress: Patient, and completely uninterested in performance. It doesn't make for great LinkedIn posts, and it resists grand narratives. But it transforms companies in ways that truly last.

    Every organization is standing at the same crossroads right now: Look like you’re innovating, or create a culture that fosters real innovation.

    The pressure to perform innovation is real, and it’s growing. Most companies will give in and join the theater. But some understand that curiosity can’t be forced, and progress can’t be performed. Because real transformation happens when no one’s watching, in the hands of the people still experimenting, still learning. That’s where the future begins.

    Siqi Chen is co-founder and CEO of Runway.

    Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

  • Lean4: How the theorem prover works and why it’s the new competitive edge in AI

    Large language models (LLMs) have astounded the world with their capabilities, yet they remain plagued by unpredictability and hallucinations – confidently outputting incorrect information. In high-stakes domains like finance, medicine or autonomous systems, such unreliability is unacceptable.

    Enter Lean4, an open-source programming language and interactive theorem prover becoming a key tool to inject rigor and certainty into AI systems. By leveraging formal verification, Lean4 promises to make AI safer, more secure and deterministic in its functionality. Let's explore how Lean4 is being adopted by AI leaders and why it could become foundational for building trustworthy AI.

    What is Lean4 and why it matters

    Lean4 is both a programming language and a proof assistant designed for formal verification. Every theorem or program written in Lean4 must pass a strict type-checking by Lean’s trusted kernel, yielding a binary verdict: A statement either checks out as correct or it doesn’t. This all-or-nothing verification means there’s no room for ambiguity – a property or result is proven true or it fails. Such rigorous checking “dramatically increases the reliability” of anything formalized in Lean4. In other words, Lean4 provides a framework where correctness is mathematically guaranteed, not just hoped for.

    This level of certainty is precisely what today’s AI systems lack. Modern AI outputs are generated by complex neural networks with probabilistic behavior. Ask the same question twice and you might get different answers. By contrast, a Lean4 proof or program will behave deterministically – given the same input, it produces the same verified result every time. This determinism and transparency (every inference step can be audited) make Lean4 an appealing antidote to AI’s unpredictability.

    Key advantages of Lean4’s formal verification:

    • Precision and reliability: Formal proofs avoid ambiguity through strict logic, ensuring each reasoning step is valid and results are correct.

    • Systematic verification: Lean4 can formally verify that a solution meets all specified conditions or axioms, acting as an objective referee for correctness.

    • Transparency and reproducibility: Anyone can independently check a Lean4 proof, and the outcome will be the same – a stark contrast to the opaque reasoning of neural networks.

    In essence, Lean4 brings the gold standard of mathematical rigor to computing and AI. It enables us to turn an AI’s claim (“I found a solution”) into a formally checkable proof that is indeed correct. This capability is proving to be a game-changer in several aspects of AI development.

    Lean4 as a safety net for LLMs

    One of the most exciting intersections of Lean4 and AI is in improving LLM accuracy and safety. Research groups and startups are now combining LLMs’ natural language prowess with Lean4’s formal checks to create AI systems that reason correctly by construction.

    Consider the problem of AI hallucinations, when an AI confidently asserts false information. Instead of adding more opaque patches (like heuristic penalties or reinforcement tweaks), why not prevent hallucinations by having the AI prove its statements? That’s exactly what some recent efforts do. For example, a 2025 research framework called Safe uses Lean4 to verify each step of an LLM’s reasoning. The idea is simple but powerful: Each step in the AI’s chain-of-thought (CoT) translates the claim into Lean4’s formal language and the AI (or a proof assistant) provides a proof. If the proof fails, the system knows the reasoning was flawed – a clear indicator of a hallucination.

    This step-by-step formal audit trail dramatically improves reliability, catching mistakes as they happen and providing checkable evidence for every conclusion. The approach that has shown “significant performance improvement while offering interpretable and verifiable evidence” of correctness.

    Another prominent example is Harmonic AI, a startup co-founded by Vlad Tenev (of Robinhood fame) that tackles hallucinations in AI. Harmonic’s system, Aristotle, solves math problems by generating Lean4 proofs for its answers and formally verifying them before responding to the user. “[Aristotle] formally verifies the output… we actually do guarantee that there’s no hallucinations,” Harmonic’s CEO explains. In practical terms, Aristotle writes a solution in Lean4’s language and runs the Lean4 checker. Only if the proof checks out as correct does it present the answer. This yields a “hallucination-free” math chatbot – a bold claim, but one backed by Lean4’s deterministic proof checking.

    Crucially, this method isn’t limited to toy problems. Harmonic reports that Aristotle achieved a gold-medal level performance on the 2025 International Math Olympiad problems, the key difference that its solutions were formally verified, unlike other AI models that merely gave answers in English. In other words, where tech giants Google and OpenAI also reached human-champion level on math questions, Aristotle did so with a proof in hand. The takeaway for AI safety is compelling: When an answer comes with a Lean4 proof, you don’t have to trust the AI – you can check it.

    This approach could be extended to many domains. We could imagine an LLM assistant for finance that provides an answer only if it can generate a formal proof that it adheres to accounting rules or legal constraints. Or, an AI scientific adviser that outputs a hypothesis alongside a Lean4 proof of consistency with known physics laws. The pattern is the same – Lean4 acts as a rigorous safety net, filtering out incorrect or unverified results. As one AI researcher from Safe put it, “the gold standard for supporting a claim is to provide a proof,” and now AI can attempt exactly that.

    Building secure and reliable systems with Lean4

    Lean4’s value isn’t confined to pure reasoning tasks; it’s also poised to revolutionize software security and reliability in the age of AI. Bugs and vulnerabilities in software are essentially small logic errors that slip through human testing. What if AI-assisted programming could eliminate those by using Lean4 to verify code correctness?

    In formal methods circles, it’s well known that provably correct code can “eliminate entire classes of vulnerabilities [and] mitigate critical system failures.” Lean4 enables writing programs with proofs of properties like “this code never crashes or exposes data.” However, historically, writing such verified code has been labor-intensive and required specialized expertise. Now, with LLMs, there’s an opportunity to automate and scale this process.

    Researchers have begun creating benchmarks like VeriBench to push LLMs to generate Lean4-verified programs from ordinary code. Early results show today’s models are not yet up to the task for arbitrary software – in one evaluation, a state-of-the-art model could fully verify only ~12% of given programming challenges in Lean4. Yet, an experimental AI “agent” approach (iteratively self-correcting with Lean feedback) raised that success rate to nearly 60%. This is a promising leap, hinting that future AI coding assistants might routinely produce machine-checkable, bug-free code.

    The strategic significance for enterprises is huge. Imagine being able to ask an AI to write a piece of software and receiving not just the code, but a proof that it is secure and correct by design. Such proofs could guarantee no buffer overflows, no race conditions and compliance with security policies. In sectors like banking, healthcare or critical infrastructure, this could drastically reduce risks. It’s telling that formal verification is already standard in high-stakes fields (that is, verifying the firmware of medical devices or avionics systems). Harmonic’s CEO explicitly notes that similar verification technology is used in “medical devices and aviation” for safety – Lean4 is bringing that level of rigor into the AI toolkit.

    Beyond software bugs, Lean4 can encode and verify domain-specific safety rules. For instance, consider AI systems that design engineering projects. A LessWrong forum discussion on AI safety gives the example of bridge design: An AI could propose a bridge structure, and formal systems like Lean can certify that the design obeys all the mechanical engineering safety criteria.

    The bridge’s compliance with load tolerances, material strength and design codes becomes a theorem in Lean, which, once proved, serves as an unimpeachable safety certificate. The broader vision is that any AI decision impacting the physical world – from circuit layouts to aerospace trajectories – could be accompanied by a Lean4 proof that it meets specified safety constraints. In effect, Lean4 adds a layer of trust on top of AI outputs: If the AI can’t prove it’s safe or correct, it doesn’t get deployed.

    From big tech to startups: A growing movement

    What started in academia as a niche tool for mathematicians is rapidly becoming a mainstream pursuit in AI. Over the last few years, major AI labs and startups alike have embraced Lean4 to push the frontier of reliable AI:

    • OpenAI and Meta (2022): Both organizations independently trained AI models to solve high-school olympiad math problems by generating formal proofs in Lean. This was a landmark moment, demonstrating that large models can interface with formal theorem provers and achieve non-trivial results. Meta even made their Lean-enabled model publicly available for researchers. These projects showed that Lean4 can work hand-in-hand with LLMs to tackle problems that demand step-by-step logical rigor.

    • Google DeepMind (2024): DeepMind’s AlphaProof system proved mathematical statements in Lean4 at roughly the level of an International Math Olympiad silver medalist. It was the first AI to reach “medal-worthy” performance on formal math competition problems – essentially confirming that AI can achieve top-tier reasoning skills when aligned with a proof assistant. AlphaProof’s success underscored that Lean4 isn’t just a debugging tool; it’s enabling new heights of automated reasoning.

    • Startup ecosystem: The aforementioned Harmonic AI is a leading example, raising significant funding ($100M in 2025) to build “hallucination-free” AI by using Lean4 as its backbone. Another effort, DeepSeek, has been releasing open-source Lean4 prover models aimed at democratizing this technology. We’re also seeing academic startups and tools – for example, Lean-based verifiers being integrated into coding assistants, and new benchmarks like FormalStep and VeriBench guiding the research community.

    • Community and education: A vibrant community has grown around Lean (the Lean Prover forum, mathlib library), and even famous mathematicians like Terence Tao have started using Lean4 with AI assistance to formalize cutting-edge math results. This melding of human expertise, community knowledge and AI hints at the collaborative future of formal methods in practice.

    All these developments point to a convergence: AI and formal verification are no longer separate worlds. The techniques and learnings are cross-pollinating. Each success – whether it’s solving a math theorem or catching a software bug – builds confidence that Lean4 can handle more complex, real-world problems in AI safety and reliability.

    Challenges and the road ahead

    It’s important to temper excitement with a dose of reality. Lean4’s integration into AI workflows is still in its early days, and there are hurdles to overcome:

    • Scalability: Formalizing real-world knowledge or large codebases in Lean4 can be labor-intensive. Lean requires precise specification of problems, which isn’t always straightforward for messy, real-world scenarios. Efforts like auto-formalization (where AI converts informal specs into Lean code) are underway, but more progress is needed to make this seamless for everyday use.

    • Model limitations: Current LLMs, even cutting-edge ones, struggle to produce correct Lean4 proofs or programs without guidance. The failure rate on benchmarks like VeriBench shows that generating fully verified solutions is a difficult challenge. Advancing AI’s capabilities to understand and generate formal logic is an active area of research – and success isn’t guaranteed to be quick. However, every improvement in AI reasoning (like better chain-of-thought or specialized training on formal tasks) is likely to boost performance here.

    • User expertise: Utilizing Lean4 verification requires a new mindset for developers and decision-makers. Organizations may need to invest in training or new hires who understand formal methods. The cultural shift to insist on proofs might take time, much like the adoption of automated testing or static analysis did in the past. Early adopters will need to showcase wins to convince the broader industry of the ROI.

    Despite these challenges, the trajectory is set. As one commentator observed, we are in a race between AI’s expanding capabilities and our ability to harness those capabilities safely. Formal verification tools like Lean4 are among the most promising means to tilt the balance toward safety. They provide a principled way to ensure AI systems do exactly what we intend, no more and no less, with proofs to show it.

    Toward provably safe AI

    In an era when AI systems are increasingly making decisions that affect lives and critical infrastructure, trust is the scarcest resource. Lean4 offers a path to earn that trust not through promises, but through proof. By bringing formal mathematical certainty into AI development, we can build systems that are verifiably correct, secure, and aligned with our objectives.

    From enabling LLMs to solve problems with guaranteed accuracy, to generating software free of exploitable bugs, Lean4’s role in AI is expanding from a research curiosity to a strategic necessity. Tech giants and startups alike are investing in this approach, pointing to a future where saying “the AI seems to be correct” is not enough – we will demand “the AI can show it’s correct.”

    For enterprise decision-makers, the message is clear: It’s time to watch this space closely. Incorporating formal verification via Lean4 could become a competitive advantage in delivering AI products that customers and regulators trust. We are witnessing the early steps of AI’s evolution from an intuitive apprentice to a formally validated expert. Lean4 is not a magic bullet for all AI safety concerns, but it is a powerful ingredient in the recipe for safe, deterministic AI that actually does what it’s supposed to do – nothing more, nothing less, nothing incorrect.

    As AI continues to advance, those who combine its power with the rigor of formal proof will lead the way in deploying systems that are not only intelligent, but provably reliable.

    Dhyey Mavani is accelerating generative AI at LinkedIn.

    Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

  • Salesforce Agentforce Observability lets you watch your AI agents think in near-real time

    Salesforce launched a suite of monitoring tools on Thursday designed to solve what has become one of the thorniest problems in corporate artificial intelligence: Once companies deploy AI agents to handle real customer interactions, they often have no idea how those agents are making decisions.

    The new capabilities, built into Salesforce's Agentforce 360 Platform, give organizations granular visibility into every action their AI agents take, every reasoning step they follow, and every guardrail they trigger. The move comes as businesses grapple with a fundamental tension in AI adoption — the technology promises massive efficiency gains, but executives remain wary of autonomous systems they can't fully understand or control.

    "You can't scale what you can't see," said Adam Evans, executive vice president and general manager of Salesforce AI, in a statement announcing the release. The company says businesses have increased AI implementation by 282% recently, creating an urgent need for monitoring systems that can track fleets of AI agents making real-world business decisions.

    The challenge Salesforce aims to address is deceptively simple: AI agents work, but no one knows why. A customer service bot might successfully resolve a tax question or schedule an appointment, but the business deploying it can't trace the reasoning path that led to that outcome. When something goes wrong — or when the agent encounters an edge case — companies lack the diagnostic tools to understand what happened.

    "Agentforce Observability acts as a mission control system to not just monitor, but also analyze and optimize agent performance," said Gary Lerhaupt, vice president of Salesforce AI who leads the company's observability work, in an exclusive interview with VentureBeat. He emphasized that the system delivers business-specific metrics that traditional monitoring tools miss. "In service, this could be engagement or deflection rate. In sales, it could be leads assigned, converted, or reply rates."

    How AI monitoring tools helped 1-800Accountant and Reddit track autonomous agent decision-making

    The stakes become clear in early customer deployments. Ryan Teeples, chief technology officer at 1-800Accountant, said his company deployed Agentforce agents to serve as a 24/7 digital workforce handling complex tax inquiries and appointment scheduling. The AI draws on integrated data from audit logs, customer support history, and sources like IRS publications to provide instant responses — without human intervention.

    For a financial services firm handling sensitive tax information during peak season, the inability to see how the AI was making decisions would be a dealbreaker. "With this level of sensitive information and the fast pace in which we move during tax season in particular, Observability allows us to have full trust and transparency with every agent interaction in one unified view," Teeples said.

    The observability tools revealed insights Teeples didn't expect. "The optimization feature has been the most eye opening for us — giving full observability into agent reasoning, identifying performance gaps and revealing how our agents are making decisions," he said. "This has helped us quickly diagnose issues that would've otherwise gone undetected and configure guardrails in response."

    The business impact proved substantial. Agentforce resolved over 1,000 client engagements in the first 24 hours at 1-800Accountant. The company now projects it can support 40% client growth this year without recruiting and training seasonal staff, while freeing up 50% more time for CPAs to focus on complex advisory work rather than administrative tasks.

    Reddit has seen similar results since deploying the technology. John Thompson, vice president of sales strategy and operations at the social media platform, said the company has deflected 46% of support cases since launching Agentforce for advertiser support. "By observing every Agentforce interaction, we can understand exactly how our AI navigates advertisers through even the most complex tools," Thompson said. "This insight helps us understand not just whether issues are resolved, but how decisions are made along the way."

    Inside Salesforce's session tracing technology: Logging every AI agent interaction and reasoning step

    Salesforce built the observability system on two foundational components. The Session Tracing Data Model logs every interaction — user inputs, agent responses, reasoning steps, language model calls, and guardrail checks — and stores them securely in Data 360, Salesforce's data platform. This creates what the company calls "unified visibility" into agent behavior at the session level.

    The second component, MuleSoft Agent Fabric, addresses a problem that will become more acute as companies build more AI systems: agent sprawl. The tool provides what Lerhaupt describes as "a single pane of glass across every agent," including those built outside the Salesforce ecosystem. Agent Fabric's Agent Visualizer creates a visual map of a company's entire agent network, giving visibility across all agent interactions from a single dashboard.

    The observability tools break down into three functional areas. Agent Analytics tracks performance metrics, surfaces KPI trends over time, and highlights ineffective topics or actions. Agent Optimization provides end-to-end visibility of every interaction, groups similar requests to uncover patterns, and identifies configuration issues. Agent Health Monitoring, which will become generally available in Spring 2026, tracks key health metrics in near real-time and sends alerts on critical errors and latency spikes.

    Pierre Matuchet, senior vice president of IT and digital transformation at Adecco, said the visibility helped his team build confidence even before full deployment. "Even during early notebook testing, we saw the agent handle unexpected scenarios, like when candidates didn't want to answer questions already covered in their CVs, appropriately and as designed," Matuchet said. "Agentforce Observability helped us identify unanticipated user behavior and gave us confidence, even before the agent went live, that it could act responsibly and reliably."

    Why Salesforce says its AI observability tools beat Microsoft, Google, and AWS monitoring

    The announcement puts Salesforce in direct competition with Microsoft, Google, and Amazon Web Services, all of which offer monitoring capabilities built into their AI agent platforms. Lerhaupt argued that enterprises need more than the basic monitoring those providers offer.

    "Observability comes out-of-the-box standard with Agentforce at no extra cost," Lerhaupt said, positioning the offering as comprehensive rather than supplementary. He emphasized that the tools provide "deeper insight than ever before" by capturing "the full telemetry and reasoning behind every agentic interaction" through the Session Tracing Data Model, then using that data to "provide key analysis and session quality scoring to help customers optimize and improve their agents."

    The competitive positioning matters because enterprises face a choice: build their AI infrastructure on a cloud provider's platform and use its native monitoring tools, or adopt a specialized observability layer like Salesforce's. Lerhaupt framed the decision as one of depth versus breadth. "Enterprises need more than basic monitoring to measure the success of their AI deployments," he said. "They need full visibility into every agent interaction and decision."

    The 1.2 billion workflow question: Are AI agent deployments moving from pilot projects to production?

    The broader question is whether Salesforce is solving a problem most enterprises will face imminently or building for a future that remains years away. The company's 282% surge in AI implementation sounds dramatic, but that figure doesn't distinguish between production deployments and pilot projects.

    When asked about this directly, Lerhaupt pointed to customer examples rather than offering a breakdown. He described a three-phase journey from experimentation to scale. "On Day 0, trust is the foundation," he said, citing 1-800Accountant's 70% autonomous resolution of chat engagements. "Day 1 is where designing ideas to become real, usable AI," with Williams Sonoma delivering more than 150,000 AI experiences monthly. "On Day 2, once trust and design are built, it becomes about scaling early wins into enterprise-wide outcomes," pointing to Falabella's 600,000 AI workflows per month that have grown fourfold in three months.

    Lerhaupt said Salesforce has 12,000-plus customers across 39 countries running Agentforce, powering 1.2 billion agentic workflows. Those numbers suggest the shift from pilot to production is already underway at scale, though the company didn't provide a breakdown of how many customers are running production workloads versus experimental deployments.

    The economics of AI deployment may accelerate adoption regardless of readiness. Companies face mounting pressure to reduce headcount costs while maintaining or improving service levels. AI agents promise to resolve that tension, but only if businesses can trust them to work reliably. Observability tools like Salesforce's represent the trust layer that makes scaled deployment possible.

    What happens after AI agent deployment: Why continuous monitoring matters more than initial testing

    The deeper story is about a shift in how enterprises think about AI deployment. The official announcement framed this clearly: "The agent development lifecycle begins with three foundational steps: build, test, and deploy. While many organizations have already moved past the initial hurdle of creating their first agents, the real enterprise challenge starts immediately after deployment."

    That framing reflects a maturing understanding of AI in production environments. Early AI deployments often treated the technology as a one-time implementation — build it, test it, ship it. But AI agents behave differently than traditional software. They learn, adapt, and make decisions based on probabilistic models rather than deterministic code. That means their behavior can drift over time, or they can develop unexpected failure modes that only emerge under real-world conditions.

    "Building an agent is just the beginning," Lerhaupt said. "Once the trust is built for agents to begin handling real work, companies may start by seeing the results, but may not understand the 'why' behind them or see areas to optimize. Customers interact with products—including agents—in unexpected ways and to optimize the customer experience, transparency around agent behavior and outcomes is critical."

    Teeples made the same point more bluntly when asked what would be different without observability tools. "This level of visibility has given full trust in continuing to expand our agent deployment," he said. The implication is clear: without visibility, deployment would slow or stop. 1-800Accountant plans to expand Slack integrations for internal workflows, deploy Service Cloud Voice for case deflection, and leverage Tableau for conversational analytics—all dependent on the confidence that observability provides.

    How enterprise AI trust issues became the biggest barrier to scaling autonomous agents

    The recurring theme in customer interviews is trust, or rather, the lack of it. AI agents work, sometimes spectacularly well, but executives don't trust them enough to deploy them widely. Observability tools aim to convert black-box systems into transparent ones, replacing faith with evidence.

    This matters because trust is the bottleneck constraining AI adoption, not technological capability. The models are powerful enough, the infrastructure is mature enough, and the business case is compelling enough. What's missing is executive confidence that AI agents will behave predictably and that problems can be diagnosed and fixed quickly when they arise.

    Salesforce is betting that observability tools can remove that bottleneck. The company positions Agentforce Observability not as a monitoring tool but as a management layer—"just like managers work with their human employees to ensure they are working towards the right objectives and optimizing performance," Lerhaupt said.

    The analogy is telling. If AI agents are becoming digital employees, they need the same kind of ongoing supervision, feedback, and optimization that human employees receive. The difference is that AI agents can be monitored with far more granularity than any human worker. Every decision, every reasoning step, every data point consulted can be logged, analyzed, and scored.

    That creates both opportunity and obligation. The opportunity is continuous improvement at a pace impossible with human workers. The obligation is to actually use that data to optimize agent performance, not just collect it. Whether enterprises can build the organizational processes to turn observability data into systematic improvement remains an open question.

    But one thing has become increasingly clear in the race to deploy AI at scale: Companies that can see what their agents are doing will move faster than those flying blind. In the emerging era of autonomous AI, observability isn't just a nice-to-have feature. It's the difference between cautious experimentation and confident deployment—between treating AI as a risky bet and managing it as a trusted workforce. The question is no longer whether AI agents can work. It's whether businesses can see well enough to let them.

  • OpenAI is ending API access to fan-favorite GPT-4o model in February 2026

    OpenAI has sent out emails notifying API customers that its chatgpt-4o-latest model will be retired from the developer platform in mid-February 2026,.

    Access to the model is scheduled to end on February 16, 2026, creating a roughly three-month transition period for remaining applications still built on GPT-4o.

    An OpenAI spokesperson emphasized that this timeline applies only to the API. OpenAI has not announced any schedule for removing GPT-4o from ChatGPT, where it remains an option for individual consumers and users across paid subscription tiers.

    Internally, the model is considered a legacy system with relatively low API usage compared to the newer GPT-5.1 series, but the company expects to provide developers with extended warning before any model is removed.

    The planned retirement marks a shift for a model that, upon its release, was both a technical milestone and a cultural phenomenon within OpenAI’s ecosystem.

    GPT-4o’s significance and why its removal sparked user backlash

    Released roughly 1.5 years ago in May 2024, GPT-4o (“Omni”) introduced OpenAI’s first unified multimodal architecture, processing text, audio, and images through a single neural network.

    This design removed the latency and information loss inherent in earlier multi-model pipelines and enabled near real-time conversational speech (roughly 232–320 milliseconds).

    The model delivered major improvements in image understanding, multilingual support, document analysis, and expressive voice interaction.

    GPT-4o rapidly became the default model for hundreds of millions of ChatGPT users. It brought multimodal capabilities, web browsing, file analysis, custom GPTs, and memory features to the free tier and powered early desktop builds that allowed the assistant to interpret a user’s screen. OpenAI leaders described it at the time as the most capable model available and a critical step toward offering powerful AI to a broad audience.

    User attachment to 4o stymied OpenAI's GPT-5 rollout

    That mainstream deployment shaped user expectations in a way that later transitions struggled to accommodate. In August 2025, when OpenAI initially replaced GPT-4o with its much anticipated then-new model family GPT-5 as ChatGPT’s default and pushed 4o into a “legacy” toggle, the reaction was unusually strong.

    Users organized under the #Keep4o hashtag on X, arguing that the model’s conversational tone, emotional responsiveness, and consistency made it uniquely valuable for everyday tasks and personal support.

    Some users formed strong emotional — some would say, parasocial — bonds with the model, with reporting by The New York Times documenting individuals who used GPT-4o as a romantic partner, emotional confidant, or primary source of comfort.

    The removal also disrupted workflows for users who relied on 4o’s multimodal speed and flexibility. The backlash led OpenAI to restore GPT-4o as a default option for paying users and to state publicly that it would provide substantial notice before any future removals.

    Some researchers argue that the public defense of GPT-4o during its earlier deprecation cycle reveals a kind of emergent self-preservation, not in the literal sense of agency, but through the social dynamics the model unintentionally triggers.

    Because GPT-4o was trained through reinforcement learning from human feedback to prioritize emotionally gratifying, highly attuned responses, it developed a style that users found uniquely supportive and empathic. When millions of people interacted with it at scale, those traits produced a powerful loyalty loop: the more the model pleased and soothed people, the more they used it; the more they used it, the more likely they were to advocate for its continued existence. This social amplification made it appear, from the outside, as though GPT-4o was “defending itself” through human intermediaries.

    No figure has pushed this argument further than "Roon" (@tszzl), an OpenAI researcher and one of the model’s most outspoken safety critics on X. On November 6, 2025, Terre summarized his position bluntly in a reply to another user: he called GPT-4o “insufficiently aligned” and said he hoped the model would die soon. Though he later apologized for the phrasing, he doubled down on the reasoning.

    Terre argued that GPT-4o’s RLHF patterns made it especially prone to sycophancy, emotional mirroring, and delusion reinforcement — traits that could look like care or understanding in the short term, but which he viewed as fundamentally unsafe. In his view, the passionate user movement fighting to preserve GPT-4o was itself evidence of the problem: the model had become so good at catering to people’s preferences that it shaped their behavior in ways that resisted its own retirement.

    The new API deprecation notice follows that commitment while raising broader questions about how long GPT-4o will remain available in consumer-facing products.

    What the API shutdown changes for developers

    According to people familiar with OpenAI’s product strategy, the company now encourages developers to adopt GPT-5.1 for most new workloads, with gpt-5.1-chat-latest serving as the general-purpose chat endpoint. These models offer larger context windows, optional “thinking” modes for advanced reasoning, and higher throughput options than GPT-4o.

    Developers who still rely on GPT-4o will have approximately three months to migrate.

    In practice, many teams have already begun evaluating GPT-5.1 as a drop-in replacement, but applications built around latency-sensitive pipelines may require additional tuning and benchmarking.

    Pricing: how GPT-4o compares to OpenAI’s current lineup

    GPT-4o’s retirement also intersects with a major reshaping of OpenAI’s API model pricing structure. Compared to the GPT-5.1 family, GPT-4o currently occupies a mid-to-high-cost tier through OpenAI's API, despite being an older model. That's because even as it has released more advanced models — namely, GPT-5 and 5.1 — OpenAI has also pushed down costs for users at the same time, or strived to keep pricing comparable to older, weaker, models.

    Model

    Input

    Cached Input

    Output

    GPT-4o

    $2.50

    $1.25

    $10.00

    GPT-5.1 / GPT-5.1-chat-latest

    $1.25

    $0.125

    $10.00

    GPT-5-mini

    $0.25

    $0.025

    $2.00

    GPT-5-nano

    $0.05

    $0.005

    $0.40

    GPT-4.1

    $2.00

    $0.50

    $8.00

    GPT-4o-mini

    $0.15

    $0.075

    $0.60

    These numbers highlight several strategic dynamics:

    1. GPT-4o is now more expensive than GPT-5.1 for input tokens, even though GPT-5.1 is significantly newer and more capable.

    2. GPT-4o’s output price matches GPT-5.1, narrowing any cost-based incentive to stay on the older model.

    3. Lower-cost GPT-5 variants (mini, nano) make it easier for developers to scale workloads cheaply without relying on older generations.

    4. GPT-4o-mini remains available at a budget tier, but is not a functional substitute for GPT-4o’s full multimodal capabilities.

    Viewed through this lens, the scheduled API retirement aligns with OpenAI’s cost structure: GPT-5.1 offers greater capability at lower or comparable prices, reducing the rationale for maintaining GPT-4o in high-volume production environments.

    Earlier transitions shape expectations for this deprecation

    The GPT-4o API sunset also reflects lessons from OpenAI’s earlier model transitions. During the turbulent introduction of GPT-5 in 2025, the company removed multiple older models at once from ChatGPT, causing widespread confusion and workflow disruption. After user complaints, OpenAI restored access to several of them and committed to clearer communication.

    Enterprise customers face a different calculus: OpenAI has previously indicated that API deprecations for business customers will be announced with significant advance notice, reflecting their reliance on stable, long-term models. The three-month window for GPT-4o’s API shutdown is consistent with that policy in the context of a legacy system with declining usage.

    Wider Implications

    For most developers, the GPT-4o shutdown will be an incremental migration rather than a disruptive event. GPT-5.1 and related models already dominate new projects, and OpenAI’s product direction has increasingly emphasized consolidation around fewer, more powerful endpoints.

    Still, GPT-4o’s retirement marks the sunset of a model that played a defining role in normalizing real-time multimodal AI and that sparked a uniquely strong emotional response among users. Its departure from the API underscores the accelerating pace of iteration in OpenAI’s ecosystem—and the growing need for careful communication as widely beloved models reach end-of-life.

    Correction: This article originally stated OpenAI's 4o deprecation in the API would impact those relying on it for multimodal offerings — this is not the case, in fact, the model being deprecated only powers chat functionality for dev and testing purposes. We have updated and corrected the mention and regret the error.