GPT 5.5 just did what no other model could

title GPT 5.5 just did what no other model could

description In this mini episode, I break down OpenAI’s new GPT 5.5 and GPT 5.5 Pro after weeks of early testing. I walk through three real jobs I threw at the model: building an app for me to teach my second grader more advanced subtraction concepts, tackling a tech debt problem in the ChatPRD codebase, and hacking into a proprietary Bluetooth pixel display that every other model had failed me on. My verdict: higher intelligence, better efficiency, and genuinely autonomous long-running loops that change what I think is worth tackling.

What you’ll learn:
How I think about GPT 5.5 Pro’s pricing vs engineering time, and when I believe the “intelligence tax” is worth payingWhy I treat GPT 5.5 as a developer model first, and why I couldn’t find a consumer use case that justified its intelligenceThe exact prompt pattern I use to unlock a long-running autonomous subagent loopHow I got a near-six-hour autonomous run to one-shot 98% of edge cases in a migration over millions of chat threads and drop my Sentry error rate to the floorWhy I’m now throwing GPT 5.5 at tech debt, flaky tests, and security backlogs firstHow I combined a Bluetooth packet sniffer and GPT 5.5 to reverse-engineer a proprietary pixel speaker after Claude Code and GPT 5.4 both gave upHow I use the /personality command inside Codex to swap the default “baked potato” tone for something I actually enjoy working with—
In this episode, I cover:
(00:00) Introduction to GPT 5.5 testing
(00:40) What is GPT 5.5 and how much does it cost?
(03:23) Testing GPT 5.5 in ChatGPT: the intelligence overhang problem
(07:12) Moving to Codex: where GPT 5.5 really shines
(16:01) Hacking a Chinese Bluetooth speaker
(21:47) Final thoughts on GPT 5.5’s intelligence and efficiency
—
Tools referenced:
• GPT 5.5 and GPT 5.5 Pro: https://openai.com/index/introducing-gpt-5-5/
• Codex: https://openai.com/codex/
• ChatGPT: https://chat.openai.com/
• Claude Code: https://claude.ai/code
• Sentry: https://sentry.io/
• Divoom MiniToo: https://divoom.com/products/minitoo
—
Other references:
• OpenAI Codex Security: https://openai.com/index/codex-security-now-in-research-preview/
—
Where to find Claire Vo:
ChatPRD: https://www.chatprd.ai/
Website: https://clairevo.com/
LinkedIn: https://www.linkedin.com/in/clairevo/
X: https://x.com/clairevo
—
Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email [email protected].

pubDate Thu, 23 Apr 2026 20:30:23 GMT

author Claire Vo

duration 1416000

transcript

Speaker 1:
[00:00] Welcome back to How I AI. I'm Claire Vo, product leader and AI obsessive, here on a mission to help you build better with these new tools. Today, I have a very special episode for you, where I'm going to tell you everything I think about the new GPT 5.5 model, which I've been able to test for the past couple of weeks. Spoiler alert, it is a powerhouse, and I've been able to do things with this model, especially around advanced coding, that I haven't been able to do before with any other model on the market. I'm going to show you how it breaks my personal high-tech eval, hacking into this little computer. Let's get to it. Before I tell you what I built with GPT 5.5, let me tell you a little bit about the model itself. Today, OpenAI is releasing GPT 5.5 and GPT 5.5 Pro into Codex and ChatGpt, not available in the API quite yet. This model, I've been testing for the past couple of weeks, and I will tell you what OpenAI is saying is true. They're saying that it has a higher capacity for complex work, it is more efficient, including being more token efficient, getting that work done. The whole idea with this model is it's smarter and it's more efficient, so you're going to get more done. That has really been my experience. Now, I'm glad it's more efficient because it is expensive. GPT 5.5 is $5 per million input tokens and $30 for output tokens, and GPT 5.5 Pro, which has powered all this work that I've been doing, is 30 for a million input tokens and $180 for output tokens. This is a pricey one, but when I reflect on what I was able to achieve with this model in early testing, I'm going to pay the intelligence tax because I think what I was able to achieve is really important. And this is one of the things that I think about a lot when I'm testing these new models or testing these new tools. Everything has an ROI and there can be an ROI in terms of speed. So can I get the things done that I want to get done faster? And that's certainly been an accelerant from an AI tooling perspective and something we've all experienced for the past couple of years. But where GPT 5.5 really helps me is ambition. It has been able to do things that literally I have not been able to do before for a couple of reasons. One, just intelligence higher has solved problems that other models and other harnesses other than Codex have really had a hard time with. The second thing I've experienced is because the efficiency is higher, I'm able to do more faster without losing context of what I'm working on because it's happening really quickly, or it's being more autonomous so I don't have to babysit as much. So again, I'm getting more done. So I do believe that what OpenAI is telling us is true, but that's coming out of my own experience spending hours and hours and hours with this model, throwing problems at it that other models have really had a hard time with, including GPT 5.5. So let's talk about what I built. Folks, for the less technical here, one of the things I'm going to say about the model, and I tested it a little bit in ChatGBT, but not a lot, is that I don't know what to do with all this intelligence if you don't have complex problems to solve. So while I've tested it in ChatGBT in my personal account, which is what I got access to, I don't have complex high intelligence problems to solve in my personal account. And so it was really hard for me to think of where I would use 5.5 or 5.5 Pro in ChatGBT simply because the problems I'm solving there aren't that hard. But I did try to solve problems there. So let's just talk about quickly how I used 5.5 in ChatGBT and what it gave me. And it will just give you an indication of what I'm going to show you a little bit later. But again, I think what the consumer or even the everyday enterprise business user is going to struggle with using ChatGBT with this model is how many problems do you have that require super intelligence. So again, I think this is going to be a model that developers and software engineers really love. And I'm really excited to see what OpenAI does in terms of unleashing and boxing this intelligence in use cases that then the quote unquote everyday person can use. So that's a little bit of my lecture on how much we have an intelligence overhang basically. So what did I ask ChatGBT 5.5 to do in ChatGBT? Really simple thing, I'm teaching my second grader two digit and three digit subtraction. He's actually in first grade, but San Francisco, I'm trying to push him ahead. And so one of the ways that I've been able to teach him is build these little apps that help him understand subtraction with two digits and three digits and learn some kind of tactics to do that well. And so I asked it to build an app for me to teach my second grader more advanced at subtraction concepts. I haven't been super pleased with some of the vibe coding tools or Claude Code on this. Nothing's really built this exactly how I wanted, I wanted to give 5.5 a shot at it. And first out the gate, it's a thinker. So you can see here it thought for 17 minutes, 27 seconds about this. You were gonna have this experience with this model. This is gonna be a theme of this mini episode. This thing will think. And it planned a app for advanced subtraction, built the code, all this kind of stuff. Now here's my question. Do we need 17 minutes of hyperintelligence thinking to build this app? Probably not. If I wasn't testing for the purpose of this podcast, would I have waited 18 minutes for this app? Probably not. So again, what are we gonna do with all this intelligence? Is this the right form factor for a non-technical software engineer to access it? Not 100% sure. And it built me a app here. You can see it includes many lessons, word problems, read aloud. It's fine. It's fine. It's fine. It has different modules in it. The design leaves something to be desired. But again, I'm not really going to the GPT models for front end. I really want them to solve my hardest technical problems. And so I would just say in ChatGPT, I'm unsure yet only because I'm not sure what the average ChatGPT user is really trying to achieve and how much intelligence is required, even on the coding side. And so I just wanted to start there by saying, if you're in ChatGPT, you're using 5.5, let me know your hard intelligence problems so I can test them. I think the basic Vibecode Me, a little simple app, it's fine. It's not great. It's not any more in particular impressive than other things on the market, but it does a reasonable job. And then just the sniff of 5.5 is it's going to think a lot, and it's going to give you this chain of thought reasoning here to let you know how it's thinking and managing its own process. Okay, so I'm going to put away ChatGpt. It's fine. Let's talk about using 5.5 Pro in Codex. And you all, I love her. I do. My initial reaction when I first started testing GPT 5.5 in Codex is I am cooking. And what I mean by that is I was kicking off tons of tasks in parallel, because the feedback loop for fast, the efficiency you felt right away. I was knocking off very long standing tasks with tons of subtasks underneath them. And I'll give an example of what those are. And I was able to bite off a tech debt, technical problem in the ChatPRD code base that I have wanted to take care of for truly months. It has been plaguing me. And GPT 5.5 blasted through it. So I want to show you a couple of those examples so you can understand what kind of tasks GPT 5.5 plus Codex is really good at, and why I think its intelligence is higher and the way it's configured to work autonomously and efficiently is really beneficial for the software engineer. So the first thing that I did, which I'm not going to show you for what will become very obvious reasons, is we used OpenAI's Codex security product to run a threat assessment and security scan on the ChatPRD code base. And it was pretty good. We're pretty secure. But it did come up with some low priority or low severity issues that we needed to remediate. And instead of taking those one by one, what I did is I downloaded the CSV of those issues, uploaded it to Codex and just said, can you please architecturally review these issues, group them if they're thematic, and then propose a change and then make those changes. And I will say it just did it. We did it very well. We did human review on that. We did code review on that. And we were just really happy with the quality of execution, but also the fact that I could give it a list of generally associated, but not single project tasks, and it could execute on those well. And the real validation of the quality of that output came when we had, very quickly after that, our annual penetration test, and our pen test came back super clean. And so, I would just say, if you have a list, a triage list of technical debt, if you have a triage list of security issues, even maybe front-end debt, flaky tests, engineers, pay attention, you can throw that list at GPD 5.5, and it will get that list done. So, that's use case one that I thought was really efficient and great. Use case two, and I'm so disappointed it cleared how hard it worked on this project, but I have, as I mentioned, this lingering tech debt in the ChatPRD codebase, which is we have millions of chats now for ChatPRD, and we're storing those chats in various legacy formats as the model providers, both OpenAI and Anthropic have changed the shape of their model responses over time. And so, TLDR for the folks that are less technical, every model in the world has changed a little bit about how they turn data via API over the past three years. We have a bunch of debt and data debt around that where we were storing legacy formats in our database. And these legacy formats, because they are AI calls, because they may or may not contain attachments, because they may or may not contain tools, very hard to build a clean, cohesive backfill and sanitization of that data into our go-forward data model. And I have just been slapping fix after fix after fix, and patch after patch after patch on this problem, because every time we patch it, we find another edge case. So this is an example of a data migration problem with millions of rows, which might not sound big to many people, but is pretty significant to us in terms of the complexity of the data inside of it, with functionally unstructured, lightly structured data with tons of edge cases. And I just finally was like, you know, GPT 5.5, take me away, gave the model that problem, and it executed so well. It built functionally one shot, a solution that covered, I'm not kidding, 98 percent of the edge cases that we had identified. So first of all, one shot building a complex migration by pointing things to blocks and libraries, very, very good. Something that really been hard for us to do because it was so complex and so unstructured before. The second thing which I want to show you on the screen now is I needed GPT 5.5 and Codex to validate that work. And so I pulled a production-like set of examples into a test environment, and I asked Codex, look, I need you to figure out a way to programmatically test every thread that's in local. I pulled a local version of this production-like data, post it to Anthropic and OpenAI and any other provider that we're using. I need you to make a scalable system for our team to do this programmatically ideally through a CLI so that any agent can test any thread for these data issues. And then I've been saying this a lot to GPT 5.5, I trust you. This is my prompt to GPT 5.5. I trust you to make a call, figure out how to spawn a sub-agent to do this, test it and identify any issues, repair them and get this ready for production. Thank you because I'm very polite. This thing worked for six hours. It was actually five hours and like 57 minutes. Truly, it just banged its head against the wall for six hours. And I did not have to, zero prompts, zero follow-ups, zero steering. I think I had to approve one script call or something for it to have access to run in its sandbox. But otherwise, it just went for six hours. I have not seen personally, everybody says, oh, I'm getting my agent to run overnight. I have not seen it until GPT 5.5 in a very constrained use case. And so this thing will do long running autonomous tasks that require sort of a loop to understand if it's doing well and moving things forward. It ran for almost six hours and then it implemented the smoke test. It tested all the example data. And after this, we literally after two million rows had one edge case that was not caught. And so just like think about that for a minute. We had two million rows, one edge case where before we were hitting edge case after edge case after edge case, six hours of GPT 5.5. And then we saw our error rate just hit the floor in our century monitoring. And so people say that AI coding is going to decrease quality because people are vibe coding. That is just such an 18 months or 12 months ago narrative. I think quality is going to go up. This kind of problem I've truly avoided because the intelligence was not there to do it autonomously. My ability to, and our engineering team's ability to break down the problem and spend the dedicated time to hitting every edge case in our synthetic data really hard. And every time you plug one hole, another one pops open. And just being able to hand this to GPT 5.5 and Codex has changed my life. So again, I am scared about how much this will cost me in production when those tokens, like cheaper than me, cheaper than my engineering team. And it really did run six hours. And so I'm just like, throw this thing at your quality issues, throw this thing at your bug backlog, throw this thing at a security assessment and close the quality gaps or performance gaps or security gaps. In your app, it does really, really, really well. So that's my prime use case. If I didn't share anything else, this would be enough. It bit off my largest piece of tech debt in my app, basically made my errors go to zero and did it all six hours autonomously in a self-sustaining subagent loop. I love you, GPT 5.5. But there is a real eval. And I told you this in the intro. My real eval is this thing. This is a Divoom Mini 2 Retro PC style Bluetooth speaker and tiny screen. And I have been, I am not kidding, I have been hacking on this thing since January, since late January or February. I think I ordered it around Valentine's Day. And my only goal is to be able to display funny stuff on this screen. Now, it comes with an out-of-the-box iPhone app. And so I can use this proprietary iPhone app to send images to this thing. But I don't want that. I live in the terminal. I want to be able to do this programmatically. And this is like proprietary code loaded on this device. I was like very deep in Chinese language, repositories and documentation from like Bluetooth hardware providers. I was in deep, y'all. And I threw, first, I threw Claude Code at this. And I said, can you figure this out? Claude Code could not figure it out, even with Opus. I threw GPT 5.4 at it. It could not figure it out. I cannot tell you how crazy I went with this, but I'm going to try. So this is a little device. You think you would be able to plug it in and just say, dear Claude Code, tell me how this device works, make no mistakes. No, that's not how it works. It connects to your computer or to your phone via Bluetooth. So it is interacting with this app on your phone through Bluetooth. In the app, I can draw something and click send and it will display here. So I know that over Bluetooth, I can change the display of this app. But we could not figure out how to encode that message. What did I do? Well, this is a little peak. This has nothing to do with AI. This has a peak to how cuckoo bananas your friend Claire is. So what I did is I spent truly hours downloading a Bluetooth profiling profile on my phone for developer debugging. I then hooked it up to, sorry, I'm crazy, hooked it up to a packet sniffer so that when I was using the app here on my phone, and it sent an image to this computer, it would log and sniff the packets and tell me what Bluetooth was sending to this little guy. I threw these logs and all the information that I had at 5.5 and let me show you what happened. So I'm going to get that repo up really quickly and show you my desperate prompting. I said, this thing is connected by Bluetooth. Take what you know and please just do anything to figure out how to display on this. You have so much information. You should know how to do it. I believe in you. And guess what? This f-ing thing did it. It did it. So my success measure here, which is I was able to build a command line tool where I can run it in terminal, press Enter, let's see. Did the benchmark hit? Hello, it fests. Hello. This is months, months, months of trying to hack into this stupid thing. It was encoding and decoding bitmap files. It was crawling the web trying to find if there was some secret SDK. Codex, you did the thing. And even better than that, it is now hooked up so that anytime I ask Codex to do a thing, it will alert me on this. So let's give it a little try live on the podcast, and then I will get you out of here. But I am telling you, this, hack into a proprietary device, that is my intelligence test now. All right. So let me share my screen really quickly, and let's just test if this thing works. So I have my terminal up, and I am going to go into Codex. And I'm going to say something really simple. I'm going to say, what can you help me with? Okay. And I built into my Codex config, a notify hook that should do something on here when it's time to be notified. So what can you help me with? Dear Codex, it's going to tell me. And let's see, it's done. Maybe I'm not paying attention to my computer. Let's see if it runs. It should make a noise. Your move. Well, your move without the E, your mauve. It made a little beepy boop. You all, this is changing my life. So again, I did three assessments of GPT 5.5. This is the one that impressed me most. I will share more about this on the blog. I might even do a little mini app on this particular workflow. I'll try to publish the code. But you all, this was my delight moment. I screamed, my children were blown away. They have seen me slave over this thing. I was sending them messages and saying, hey, and then responding to their questions by just showing them the screen. I am obsessed. So GPT 5.5 has hit my intelligence benchmark for, can you hack into this Chinese digital screen with proprietary Bluetooth transport mechanisms and bitmap compression? And guess what? 5.5 can. All right, so that is a wrap for our quick review of GPT 5.5 TLDR. I love this thing. It is super smart, it is super efficient, and it will work on its own against complex problems. Basically as hard as you ask it, it has solved problems I have not been able to solve before. The only thing I will leave you with is that it has the, as I call it, baked potato personality that we've all come to know and love from Codex. It is a dull, dull, dullard. But I learned over the testing of this, if you do slash personality in Codex, you're able to change that to something a little friendlier. And while some of my fellow early testers said it had too much of a gen Z personality, I said, I like to stay young. Give me that gen Z GPT 5.5. I'll take it any day over the paper bag, baked potato personality that you get out of the box. Other than that, it's my favorite senior software engineer, staff software engineer. I'm going to go blow through a bunch of technical work, and I really love this model. So I can't wait to hear what you think. And if you figure out a high intelligence test that works in Chachaptie, let me know. Otherwise, enjoy coding, and I can't wait to see what you build. Thanks, y'all. Thanks so much for watching. If you enjoyed this show, please like and subscribe here on YouTube, or even better, leave us a comment with your thoughts. You can also find this podcast on Apple Podcasts, Spotify, or your favorite podcast app. Please consider leaving us a rating and review, which will help others find the show. You can see all our episodes and learn more about the show at howiaipod.com. See you next time.