GPT-5.5 Lands, DeepSeek V4 Undercuts Everyone & Claude Code Ships a Post-Mortem
A real firehose day. OpenAI dropped GPT-5.5 plus a completely reworked Codex app, Anthropic published a post-mortem on the Claude Code quality slide that's been hovering over the last month, and then — just as Simon Willison was queuing up his Friday newsletter — DeepSeek V4 Preview shipped at prices that make everyone else look expensive. In the background: Codex Hooks going GA, Matt Pocock's Slack-based agent DAG, and Armin Ronacher quietly switching away from Claude Code.
GPT-5.5 and the Codex app reboot
OpenAI announced GPT-5.5 mid-afternoon PT — now the default in ChatGPT and Codex. swyx did the cleanest summary of the numbers:
- Context: 400K in Codex, 1M in the API
- API pricing: $5/M input, $30/M output
- Benchmarks: 82.7% Terminal-Bench 2.0, 73.1% Expert-SWE (new internal eval), 58.6% SWE-Bench Pro, 84.9% GDPval, 98.0% Tau2-bench Telecom, 80.5% BixBench
- First generation co-designed with GB200 and GB300 NVL72
- Codex improved its own inference speed 20% — swyx: "lol"
Simon Willison had been previewing it for weeks: "Had some great results from it running security reviews against code written using other models." He also pointed out that GPT-5.5 isn't in the regular API yet but is accessible via "the apparently approved-of Codex API backdoor."
The bigger story is the Codex app rebuild, which LLMJunky retweeted from Andrew Ambrosino:
- GPT-5.5
- Browser control
- Sheets & Slides
- Docs & PDFs
- OS-wide dictation
- Auto-review mode
swyx's reaction: "Wow the codex app is unrecognizable… almost like it shouldve been Atlas the whole time." He also argued that "the most underrated part of today's launch is not GPT 5.5 at all" — it's the superapp consolidation. His Latent.Space writeup: GPT 5.5 and OpenAI Codex Superapp.
Community early reads:
- xeophon via steipete: "IT DOES NOT CODE DEFENSIVELY AS MUCH!!! No more nested try/catch blocks that would only trigger for cosmic bitflips... Its code is so readable."
- Noam Brown (OpenAI): "I'm a manager at OpenAI, but with GPT-5.5 I'm a more effective IC than I've ever been. I can now write CUDA kernels like a pro."
- davis7 via LLMJunky: hates the name (suggests 5.9), says use it on low reasoning — "99% of tasks don't need high."
- Codex capacity paged at 1am — "shuffling capacity and bringing more compute online."
Theo put it all together in his usual evening-video wrap-up with Ben, covering GPT-5.5, the Cursor/SpaceX meta, Kimi K2.6, GPT Image 2, and the Vercel hack.
Claude Code post-mortem: the harness drifted, not the model
Boris Cherny (@bcherny) kicked off a short thread announcing the post-mortem: anthropic.com/engineering/april-23-postmortem. Three issues, all fixed in v2.1.116+, usage limits reset for all subscribers.
His own framing (tweet 2):
"In my time on the team, this has probably been the most complex investigation we've had. The root causes were not obvious, and there were many confounders."
And the tell for the next chapter (tweet 3):
"Separately, we've also heard reports of issues with Opus 4.7 in Claude Code. The team is working on those and we'll share more as we roll out improvements over the coming days."
So: the CC harness regression was root-caused and fixed; the Opus 4.7 feel-bad is a separate track still in progress.
The sharpest counter-take came from Mario Zechner (@badlogicgames), retweeted by mitsuhiko:
"recommended reading. cool they are fixing things. but it's also a reason i switched away from CC. no control over the harness means having to wait for them to fix things. the model didn't change. the harness did."
This is the tension the whole Skills-vs-framework-vs-CLI debate has been circling — if the harness owns your feedback loop and you can't hot-patch it, a silent quality regression costs you a week before the vendor publishes a diagnosis. Worth holding onto the line for the next time the conversation turns up.
Context from yesterday's pricing A/B: Anthropic ran a test on ~2% of new prosumer signups pulling Claude Code off the Pro plan. Existing Pro/Max subscribers weren't affected, but the perception damage was done. Fixing the quality slide and resetting limits is the right mop-up.
DeepSeek V4 Preview: 1M context, open weights, price floor reset
Dropped at 9pm Pacific on a Thursday. DeepSeek shipped:
- DeepSeek-V4-Pro: 1.6T total / 49B active params, "rivaling top closed-source models"
- DeepSeek-V4-Flash: 284B total / 13B active params
- 1M context on both
- Open weights on HuggingFace
- Tech report: huggingface.co/deepseek-ai/DeepSeek-V4-Pro
Simon's immediate notes via simonwillison.net/2026/Apr/24/deepseek-v4/: "the really big news is the pricing: both DeepSeek-V4-Flash and DeepSeek-V4-Pro are the cheapest models in their categories while benchmarking close to the frontier models from other providers."
His pelican-on-a-bicycle comparison ran both via OpenRouter — "These pelicans are kind of angry looking!"
Theo's one-line take: "Never taking vacation ever again."
If GPT-5.5 reset the capability ceiling this week, V4 reset the cost floor. Every coding agent that bills through model APIs is now looking at whether Flash-tier tasks route to V4-Flash by default.
Codex Hooks GA + Matt Pocock's Slack-based agent DAG
Two agentic-workflow items that pair well:
Codex Hooks hit stable, per LLMJunky's announcement: pre/post tool-use hooks now receive apply_patch payloads, so you can "run deterministic scripts any time any file is edited by Codex." He's added 50+ hooks to codex-marketplace.com. Companion Codex 0.124.0 notes: GPT-5.5 added, ALT+,/ALT+. keyboard shortcuts for reasoning level, Fast Service on by default for business/enterprise.
Matt Pocock's full Slack-based flow (tweet) is one of the cleanest concrete "what does my day look like" descriptions I've seen from a heavy user:
/triagein Slack → creates a discussion thread with agents for each issue needing triage- Resolve threads one-by-one until issues are marked ready
/implement→ planning agent builds a DAG of PR branches with dependencies- Implementer agent works them all in parallel, each PR gets its own Slack thread
- Re-running
/implementrecomputes the whole DAG to absorb new issues - Review and merge PRs while implementation continues
- Periodic
/reviewacross the codebase for architectural drift
Driven by Sandcastle + Vercel's Chat SDK. "Slack, Claude Code, and the sandboxes are all swappable dependencies." His framing tweet: "Nearly max concurrency and parallelism... Traditional review semantics... Collaborative UI for chatting to agents (thanks, Slack)."
The self-own of the day is also his (tweet):
"spends the evening queuing up 15-20 tasks for AFK agents overnight / forgets to run the script that kicks them off / sigh."
/schedule for Claude Code one-shots
Noah Zweben (Claude Code team, RT'd by trq212): /schedule for one-time tasks. Examples:
/schedule cleanup this feature flag in 1 week/schedule give me a launch report for my feature in 2 days
Available from the CLI or the Routines UI. This fills the gap between "run now" and "recurring routine" — useful for follow-up tasks you don't want to lose track of but don't need to check back on until X days pass.
ParseBench on Kaggle
Jerry Liu launched ParseBench as a public Kaggle benchmark:
- 2,000 enterprise pages
- 167K+ test rules, 5 dimensions (tables/charts/content faithfulness/formatting/visual grounding)
- Benchmarks 14 methods including GPT-5 Mini, Gemini 3, Textract, and LlamaParse
He then gave the sales pitch for LlamaParse and admitted that Simon Willison's one-session browser port of LiteParse beat what his own team had been trying: "Shouldn't have doubted Claude. Claude knows best."
Simon's LiteParse-in-browser vibe-code is up at simonw.github.io/liteparse with a writeup on his blog.
Cursor and SpaceX: the "are independent AI startups viable?" argument
swyx surfaced two analyses of yesterday's SpaceXAI ↔ Cursor IPO-deferred-acquisition deal:
- Russell Kaplan's The Path Forward for AI Startups: is independent survival still possible, or must every AI startup eventually sell into a frontier lab?
- Kevin Kwok's Cursor and SpaceX: In search of a complete loop: the AI-lab meta increasingly requires owning both product and model in coding — a complete feedback loop.
Both worth reading if you're building a coding tool. Matt Pocock's response is indirectly the strongest one: the harness/skills/marketplace layer is still a real place to own, as long as the labs keep shipping commodity models against which you can swap.
Mitsuhiko's AGPL lol
Small but worth the laugh, from Armin Ronacher:
"Trololol. I had the agent start using ua-parser-js, then it read the AGPL license after adding it and backed out immediately. This is hilarious."
Follow-up (tweet): "I think the only thing agents are more afraid of than exceptions is GPL code." Either emergent good behavior or a really deep RLHF signal.
He also noted: "Today is a hand writing code day. Codex is not vibing for me today." Sample size one, but from one of the more sober evaluators.
Google Stitch open-sources DESIGN.md
LLMJunky flagged that Stitch by Google open-sourced the draft spec for DESIGN.md — a cross-platform design-rules spec so agents "know exactly what a color is for and can even validate their choices against WCAG." The quote he surfaced: "Instead of guessing intent, agents know exactly what a color is for and can even validate their choices against WCAG accessibility rules."
If this gains traction, it's the same "portable schema" play AGENTS.md and CLAUDE.md each tried with varying success. Worth watching whether any of the frontier agent products adopt it before it dies on the vine.
Tencent Hy3-Preview on OpenRouter
OpenRouter announced Tencent Hunyuan's Hy3-Preview: 295B MoE (21B active), controllable reasoning effort, strong on coding agents, free on OpenRouter. A quieter drop than GPT-5.5 or DeepSeek V4, but another data point on the "large open MoE" trend.
Non-agent sidebar
- UAE "50% of government sectors will run on Agentic AI" — simonw's retort: "Within two years you'll be able to prompt inject an entire country."
- LLMJunky spots a possible Codex mobile UI — screenshot via Ambrosino. "Nice try" indeed.
- OpenClaw got Azure credits from GitHub's Secure Open Source Fund + MS for Startups, per steipete. Also: 2026.4.22 makes gpt-image-2 the default OpenAI image path and picks up xAI image/TTS/STT/live transcription.
- mattpocockuk on language and programming: "English has ALWAYS been the programming language that matters most." and "Software development is 'finding the right way to describe something that doesn't exist yet'. It's a language problem... And LLMs are GREAT at helping you with this." Vintage Matt Pocock.
- Matt Pocock rename watch:
/grill-me→/discuss. Wise. - Cat Wu on Lenny's: podcast retweeted on how Claude Code maintains product velocity and the PM role in the AI era.
Open questions
- How long before Opus 4.7 in Claude Code gets its own post-mortem? The Boris Cherny thread explicitly punts it to "coming days" — the delta between the harness fix landing and the model fix landing is the thing to track.
- Does GPT-5.5's "less defensive coding" hold up once it's running at scale in production harnesses, or do we get a wave of "AttributeError in prod" regressions?
- Is Matt Pocock's Slack-based DAG flow actually replicable for teams smaller than his, or does it need a bunch of infra (Sandcastle, Chat SDK, plumbing) that isn't OSS-practical yet?
- DeepSeek V4 Flash is the most interesting pricing event of the year so far. Do the US-based coding agents actually route fallback traffic to it, or does it become the de facto home-lab model and stay walled off from mainstream dev flows?