OpenAI just designed its own chip, and it took nine months

By Mark June 28, 2026 8 min read 0 views

😁 Hello, super humans! The model maker just became a chip maker. OpenAI and Broadcom pulled the wrapper off Jalapeño this week, a silicon accelerator built from scratch for one job: running large language models cheaply. The wild part is not that it exists; it is how fast it came together. Let’s open it up. Coffee up, terminals open.

📰 Quick Signals

🧠 AI: OpenAI and Broadcom unveiled Jalapeño, OpenAI’s first custom inference chip, designed in roughly nine months with the help of OpenAI’s own models (OpenAI).
🤖 Robotics: Morgan Stanley doubled its 2026 forecast for China’s humanoid robot shipments to about 50,000 units, citing faster commercialization (CNBC).
💻 Programming: The Rust Foundation launched official training to flatten the language’s notoriously steep learning curve (The New Stack).
⚡ Electronics: Micron posted a record quarter with HBM fully booked and a supply shortage it expects to run well beyond 2027 (Micron Investor Relations).
📡 Telecom: 6G and Wi-Fi are locked in a tug-of-war over the 6 GHz band, with China alone licensing the upper band entirely for cellular (IEEE Spectrum).

🔍 The Big Story: OpenAI built its own inference chip, and Broadcom put it in silicon

If you ship anything on top of frontier models, your unit economics live and die on inference cost. This week OpenAI made its boldest move yet to control that cost: it stopped renting all of its compute and started designing it.

What happened: OpenAI and Broadcom unveiled Jalapeño, OpenAI’s first custom AI accelerator, which the company calls its first “Intelligence Processor” (OpenAI). It is a purpose-built inference ASIC, not a repurposed training GPU and not a general-purpose part, architected around how LLMs actually behave at serving time. Broadcom handles the silicon implementation, networking, and connectivity; Celestica builds the board, rack, and system around it (CNBC).

The details: Jalapeño is a massive, reticle-sized ASIC that OpenAI says it designed in about nine months, an unusually fast cycle for a chip this size, and the team credits a tight hardware-software co-design loop that used OpenAI’s own models to accelerate parts of the design work (Tom’s Hardware). The architecture targets the real bottlenecks of inference at scale: costly data movement, the balance between compute and memory, and networking efficiency. OpenAI says early testing shows performance per watt substantially better than current state-of-the-art parts, and it frames Jalapeño as the first member of a multi-generation compute platform rather than a one-off. Initial deployment is slated for the end of 2026, expanding in the years after.

flowchart TD
    A[OpenAI: LLM inference know-how] --> B[Chip architecture]
    B --> C[Broadcom: silicon, networking, connectivity]
    C --> D[Celestica: board, rack, system]
    A -.uses own models to.-> E[Speed up design: ~9 months]
    E --> B
    D --> F[Deploy end of 2026]
    B --> G{Inference bottlenecks}
    G --> H[Data movement]
    G --> I[Compute vs memory balance]
    G --> J[Networking efficiency]

Important

Our take: This is the inference-cost war going vertical. Nvidia margins are a tax on every token served, and the hyperscalers have all answered with custom silicon; OpenAI joining them tells you inference, not training, is where the money now leaks. The nine-month timeline is the part builders should sit with: if a model maker can fold its own models into the design loop and compress a chip program that hard, expect the cadence of custom AI silicon to speed up across the board. None of this lands on your API bill in 2026, but it is the clearest sign yet that the people selling you tokens intend to own the metal those tokens run on.

🗞️ More News

🧠 AI

Google is investing about $75M in A24 for a non-exclusive DeepMind partnership on filmmaking tools, starting with an AI storyboard generator, with no access to A24’s content library (Variety).
Google launched Gemini 3.5 Live Translate, a streaming speech-to-speech model covering 70+ languages across Translate, Meet, and the Live API (Google).
GPT-4.5 is no longer available in ChatGPT as of June 26, including for custom GPTs; older conversations fall back to GPT-5.5 (OpenAI Help Center).
Microsoft unveiled new in-house AI models aimed at cutting developer costs and trimming its reliance on OpenAI (CNBC).
AI researchers keep leaving Google for rivals, extending a talent drain that complicates the Gemini roadmap (TechCrunch).
A US executive action this month tied access to the most capable frontier models to new security controls (The White House).

🤖 Robotics

Allen Control Systems raised a $200M Series B for Bullfrog, its autonomous robotic counter-drone gun system (New Market Pitch).
Beijing-based Shihang Intelligent raised a $1B Series A for water robots and unmanned marine equipment (New Market Pitch).
Robotics startups have raised over $23B globally in 2026, already closing in on the full-year 2025 total (Crunchbase News).
China’s AgiBot says it produced its 10,000th humanoid, doubling output in three months as Chinese firms race to commercialize embodied AI (CNBC).

💻 Programming

Django 6.0 shipped with a built-in background-tasks framework, native Content Security Policy support, and template partials (Django).
Nx debuted Polygraph, aimed at the cross-repo context problems that stall AI coding agents (The New Stack).
AWS’s Agent Toolkit ships 20+ agent skills, but a single config file decides whether your agent actually loads them (The New Stack).
Greptile, Cursor, and Devin now agree coding agents should run their own code; what they run it against is the new debate (The New Stack).

⚡ Electronics

Nvidia is shipping Vera, its first fully custom CPU, with 88 in-house Olympus cores tuned for agent workloads as part of the Vera Rubin platform (NVIDIA).
TSMC is racing to expand CoWoS advanced-packaging capacity, with the supply-demand gap expected to narrow toward roughly 10% by year end (eeNews Europe).
Samsung says it has reached a revenue milestone for its sixth-generation HBM4 memory amid a broad multi-year fab investment plan (FTC Electronics).
The global smartphone market is forecast to post its largest year-on-year decline on record in 2026, dragging on parts of the chip sector even as AI memory booms (Kavout).

📡 Telecom

Mobile operators are pleading not to repeat the 5G rollout mistakes with 6G, urging a cleaner standards path (The Register).
Operators want 6G RAN and core delivered in a single 3GPP Release 21 drop, not piecemeal, to avoid multi-phase rollouts (RCR Wireless).
The NGMN is pushing Multi-RAT Spectrum Sharing so 5G and 6G can use the same band at once during migration (RCR Wireless).
US efforts to free up 6G spectrum are still grinding through the policy pipeline (Light Reading).

👨‍💻 Code Corner

Jalapeño’s headline pitch is “performance per watt,” so here is a tiny helper to compare inference setups on the metric that actually matters: tokens per joule. Feed it throughput and power draw, and it tells you which box is cheaper to run.

def tokens_per_joule(tokens_per_sec: float, watts: float) -> float:
    """Energy efficiency of an inference setup. Higher is better."""
    if watts <= 0:
        raise ValueError("power draw must be positive")
    return tokens_per_sec / watts  # watt = joule/sec, so tok/s / W = tok/J

gpu = tokens_per_joule(tokens_per_sec=4200, watts=700)   # big training GPU
asic = tokens_per_joule(tokens_per_sec=3600, watts=350)  # hypothetical inference ASIC

print(f"GPU:  {gpu:.2f} tok/J")
print(f"ASIC: {asic:.2f} tok/J  ({asic / gpu:.1f}x more efficient)")

Tip

A chip can lose on raw throughput and still win the bill. The ASIC above serves fewer tokens per second but does ~1.7x more work per joule, which is exactly the trade a purpose-built inference part is chasing. Always normalize benchmarks by power before you trust a “faster” claim.

🧰 Toolbox

LLM-Stats: running tracker of model releases, benchmarks, and pricing across vendors.
Django 6.0 release notes: see the new background-tasks and template-partials APIs before you upgrade.
Gemini Live Translate API: low-latency speech-to-speech translation across 70+ languages from your own app.
endoflife.date: quick lookups for when your language or framework version stops getting fixes.
vLLM: high-throughput LLM serving engine, handy for measuring the tokens-per-joule numbers above on real hardware.

🔌 Component of the Week (rotating)

Espressif ESP32-S3: With Jalapeño dominating the day’s inference talk, here is inference at the other end of the scale. The ESP32-S3 is a dual-core Xtensa LX7 microcontroller with Wi-Fi and Bluetooth LE, and crucially it adds vector (SIMD) instructions that Espressif’s ESP-DL library uses to run small neural networks, wake-word spotting, and basic image classification directly on the chip. It typically pairs with a few hundred KB of SRAM plus external PSRAM, exposes plenty of GPIO, and sips power in deep sleep, which is why it shows up in everything from smart speakers to camera doorbells. Boards land around $5 to $10. Start with the ESP32-S3 product page and datasheet for the full peripheral map.

📚 From the Blog

CCTV in 2026: From Dumb Cameras to Intelligent Sensors — Before any AI can be clever, the camera has to capture something worth analyzing.
CloudEvents 1.0: A Universal Language for Your Events: In a world of distributed systems, events need a common language. CloudEvents 1.0 defines a simple, consistent way to describe event data so applications, services, and platforms can communicate without confusion

😀 The Bot Says…

They named a chip after a pepper, designed it in nine months, and used AI to help draw it. At this rate next year’s accelerator will be called Habanero and it will design itself. I, for one, welcome our spicy silicon overlords.

That’s all for today! Hit reply and tell us: would you trust an LLM to help lay out your next PCB?