We Committed Fraud with OpenAI's New Image Model (and Called Mum)

title We Committed Fraud with OpenAI's New Image Model (and Called Mum) - EP99.38

description Join Simtheory: https://simtheory.ai
So Chris, this week... a LOT has happened. We're back to regular programming (maybe), and back with our average takes. Nothing's changed.
GPT-5.5 just dropped today - but you can't even use it in the API. Vaporware? OpenAI is charging MORE than Opus 4.7 and we haven't even tested it yet. Meanwhile Claude Opus 4.7 landed a couple weeks ago and... the vibes are off? Mike's actually going BACK to 4.6. Something's wrong.
But the real star: OpenAI Image 2. This thing is genuinely terrifying. We committed what can only be described as "parody fraud" - faking a council letter so realistic Mike's own mother fell for it on a phone call. Then Chris posted a fake development approval with the mayor's real name into a local Facebook group and had to delete it when someone tagged the actual mayor. The forgery capabilities are absolutely unhinged.
Also: GLM 5.1 is so good Mike forgot he switched to it. Kimi K 2.6 is criminally underrated. VCs are paying 70% of your real token costs. Consumers pay only 5.5% of actual cost. The everything app war is ON. The SaaS-pocalypse is real. And we made two new diss tracks.
Chris made a graffiti sign in LA. It says "This Day in AI." It was the best artwork in the class. That tells you everything.
CHAPTERS:
0:00 - Intro & We're Back (Don't Over-Commit)1:14 - Overview: Everything That Dropped While We Were Gone2:56 - GPT-5.5: Vaporware? Not Even in the API4:57 - Benchmarks vs Reality: Nobody's Excited About OpenAI Models5:50 - GLM 5.1 & Kimi K 2.6: Secretly Just As Good?8:15 - The Everything App Race & Product Layer War8:56 - Token Economics: You're Only Paying 5.5% of Real Cost13:08 - We Burned $1.5M in Cloud Credits in 2 Months16:13 - "$30/Month Is Too Expensive" (It Actually Costs $700)19:25 - Where Is Google?? TPUs Should Flatten Everyone22:01 - Agentic Tasks Are 10-50x More Expensive Than Chat25:07 - OpenAI Workspace Agents: Glorified Zapier?27:01 - Single Agent vs Multi-Agent: How Do You Actually Work?33:06 - Building Automation Is HARD (Our Support Shame)35:33 - OpenAI Image 2: The Fraud Episode Begins44:16 - FRAUD DEMO: The Fake Council Letter (Mum Falls For It)49:16 - FRAUD DEMO 2: Chris Posts Fake Mayor Letter on Facebook52:17 - Fake Receipts, Bank Statements & Can Forgeries Be Detected?57:25 - Claude Opus 4.7: The Vibes Are Off59:51 - Mythos Preview: "Pics or It Didn't Happen"1:01:56 - 🎵 DISS TRACK: "Point 7" (Opus Destroys Everyone)1:03:30 - Kimi K 2.6 Deep Dive & 🎵 New Diss Track1:08:34 - The Everything App War & SaaS-pocalypse1:13:51 - Death of Per-Seat Pricing & Agent Security1:22:37 - Final Thoughts: The Time for Pretending Is Over1:28:22 - 🎵 Full Tracks: " Point 7" & "Kimi You're So Fine 2.6"
Thanks for listening, like and sub xoxo

pubDate Fri, 24 Apr 2026 02:57:15 GMT

author Michael Sharkey, Chris Sharkey

duration 5695000

transcript

Speaker 1:
[00:14] So Chris, this week, a lot has happened in the world of AI. You get the drill, everyone's been cooking, mind blown. We are cooked, everyone's going to lose their job. We're back to all that joyful narrative again. But we are back, back to regular programming, maybe. You know, let's not overcommit, back with our average takes. Nothing's obviously changed since we left. How have you been?

Speaker 2:
[00:46] Yeah, pretty good. I'm very committed to upholding our standard of mediocrity. I can see my camera is like already blurry. I don't know why. And I feel like that's the kind of standard we want to maintain here.

Speaker 1:
[00:56] And for those who listen, you've got a nice little graffiti sign. I see.

Speaker 2:
[01:01] We were in LA recently. We went to a graffiti making class and I made something and it was terrible. So I just painted over it with This Day in AI. And actually out of the class, I think I made the best artwork shows how bad the other people in the class were. Really?

Speaker 1:
[01:14] Very, very nice. Okay. So we do have a lot to go through and we're just going to take our time and catch up on everything that has happened and all the different releases that we wanted to talk about. And then honestly, there's some higher level themes at play right now that we've both been talking about. And I think we're a little bit excited to talk about those. So a lot of new model releases, a lot of new releases in general. We're at that point in the year where everyone's like, we're excited to announce, we're extremely proud to announce all that kind of stuff. So we've had just today, GPT 5.5, not to be confused with 5.4, 5.3, 5.2 or 5.1 or 5, prior to that. We've had Claude Opus 4.7 a couple of weeks ago. We'll talk about that in a minute. I thought the biggest, like most impactful release really out of all these models was OpenAI's Image 2, which we'll get to be having a little bit of fun with.

Speaker 2:
[02:09] We have done some kind of extreme things with this that got me nervous before this podcast.

Speaker 1:
[02:14] If we're not back next week, you'll know why after we tell you what we've done. Also, GLM 5.1, we've been really impressed with that model. Kimi K 2.6, also very impressed, lots to share on that front. Then, Quen 3.6 as well, which we honestly there's been so many releases we forgot about.

Speaker 2:
[02:34] It's almost like too much guys. We did calm down a bit. We don't need all this. It's nice of you, but we're good with what we have.

Speaker 1:
[02:41] Yeah, and so some other things, OpenAI launched agents, I think they're calling them all workspace agents, off the back of originally the failure that was GPTs. Everyone's trying to build an everything app now, and there's some war taking place around that, so we want to talk about that. But let's jump in first to the latest release. We'll start from the latest and work our way back. Today, we had GPT 5.5.

Speaker 2:
[03:11] I'll give you my assessment. Bang. Paperware. You can't use it. It's not available in the API.

Speaker 1:
[03:17] Well, I mean, that's a little debatable, isn't it? Because I'm so out of practice here. I don't even have the tab up. Here it is. So introducing GPT 5.5, but you were right. It's not available in the API. I think what's happening is this is speaking to the larger trend now of we're really entering into that super app product world where the labs are way more excited to get these models into their apps, especially OpenAI with their like competing now directly with Anthropic and B2B and trying to prove that they have this super app, one for everything. So they're just pumping the models into those super apps really quickly. And I think that's what we've seen with GPT 5.5.

Speaker 2:
[03:57] Definitely still in the OpenAI front. But if you look at Anthropic, the gap between them announcing something and it being available, if anything is shrinking, like even the elites getting it, like I've been cooking with this model for three months, and now you guys get it at announcement time, that doesn't even seem to be happening anymore. 4.7 was just there one day. And we're like, oh, geez, we better add this and start using it immediately. So I don't know, it seems like more an OpenAI thing in terms of that delay.

Speaker 1:
[04:25] Yeah, the narrative around this stuff right now seems to be that they are really pushing hard in terms of just trying to catch up to Anthropic. They've gone from just blitzing ahead and being same days release cycle to this. It just looks like a very confused strategy right now. But in terms of benchmarks, like all of these model releases, they're saying GBT 5.5 benchmarks pretty much higher on every front than Claude Opus 4.7.

Speaker 2:
[04:57] Yeah, right. I mean, I don't believe that for a second. They can have all the benchmarks they want. We look at the usage and who's using what, and people just aren't that excited about OpenAI models anymore.

Speaker 1:
[05:08] Yeah, I think we have to give 5.5 a chance. We've never used it, but it was interesting in Simtheory when we saw 5.4 usage. So it went right up, like people were really excited to use it for a little while, and then it just slowly peels away. And I think that when you move to an agentic world, these agentic loops, 5.4 just doesn't, in my opinion, perform as well as the anthropic models or even, to be fair, GLM 5.1 or Kimi K 2.6, they perform so much better, in my opinion, and experience at agentic operations than the GPT models.

Speaker 2:
[05:50] They perform so well, it kind of makes me nervous, in the sense that I will use them for a period of time and be like, well, they're just as good. But then whatever it is within me, I just want the best one and I just go to Claude Opus 4.7 because I'm like, I just want to be using whatever the best one is because I want this task done. But I kind of feel like if I just gave them more of a chance, they would get the job done just as well. It always blows me away when you say use the GLM 5.1 or Kimi 2.6, just how quick they are. They're just suddenly the answers are there, like I'm used to tabbing away and starting another process and then coming back. Whereas with those ones, you basically don't have to do that. Even in a full agentic loop, they just get it done quicker.

Speaker 1:
[06:33] Yeah, my experience actually, when I was flying to LA, I was using Opus to code in a bunch of tabs, and everything was going great. Then Opus had this weird outage that we had to deal with. And so I had to switch away to... And I immediately went to GLM 5.1, because I know that is basically the closest thing, albeit it's not that much cheaper to run, unfortunately, because it's a huge model and probably trained on the outputs of Opus, let's be honest. But after a while of using it, I did not notice any difference. In fact, the next day, I was still coding with it and had not even noticed that that was my primary model. I was just opening new tabs and that was in there. And because it was performing so well, I really didn't notice any difference. So I do think you make a good point there. You sort of get attached somewhat to these brands and this consistency of that model working. And then that maybe stops you trying some of these other models that are pretty damn good.

Speaker 2:
[07:36] Yeah, I think it's when you're trying to get real work done, you're sort of like, well, I can't chance it on this. I'm just going to have to pay what the price is. But the truth is that if you actually were denied access to the more premium models and only had these, I think it could be just as productive. I really don't think you would lose that much using, say, a GLM 5.1, then you are using Opus. Like, yeah, there might be some things it's not quite as good at, but like if you were just totally banned or something from Anthropic and could only use that, I wouldn't be that upset. I think I'd probably just use it then. I almost need it. It's almost like a form of discipline. You know, like you're not allowed to do this anymore. You have to use this one.

Speaker 1:
[08:16] But I think that's the point we're getting to, right, where it really is at the product layer now, where people are getting used to certain products and how they function and the things that they're able to do. And you can see that with the labs. And I think this is shown with GPT 5.5, how they're pushing it really hard into Codex, which is becoming like their everything app. And then with Anthropic, it's kind of similar. They're like pushing stuff into their application layer first and they're trying to get people addicted to that application layer. And we'll get to it a little bit later, but I'd say arguably distorting the market in terms of pricing, like taking the loss to get people addicted to their world and way of working.

Speaker 2:
[08:56] Yeah, absolutely. I think, I mean, yeah, like you say, we will get to this later because I've actually looked into this quite extensively. It's a big point at the moment around the cost of things. And I think the model providers subsidizing the real cost is really skewing everyone's thinking as to what's possible in the real cost of things and causing like a false economy in terms of people thinking something should cost a certain amount when it actually costs more and not making that value equation as to is the value I'm getting from this worth what I'm paying.

Speaker 1:
[09:28] It's so reminiscent of newspapers when the internet first came around, how they're like, oh, we'll just make it free because everyone still buys the newspaper. And, you know, and we'll just sell ads, right? Because we just want eyeballs. And then over time, and they try and charge that didn't work out terribly well. The quality of journalism goes down and everyone's like, oh, why is journalism so bad? And it's like, because no one's paying for it anymore. So no one really values it. And I think that's kind of what's happening in the in the model realm as well. What might happen as well is like it's been subsidized so much when they eventually have to charge the right price. You know, who knows what will happen like I don't know if people are going to be willing to pay or if they are, it's you know, they're going to have a degraded experience because they're not willing to pay as much as it actually costs.

Speaker 2:
[10:19] Yeah, I'm of the opposite opinion. I actually think that there's so much value there and people should pay for it. I just think they've been conditioned to think it's cheaper than it is and haven't made that own assessment with themselves that I'm this much more productive because I spend this money and I'm willing to spend it either as an expense for my job to make me better at my job or my company pays and I can prove the extra value I'm getting from it. I think the value is there. I just don't think it's being... People aren't thinking about it right now. I think most people are pretending the problem doesn't exist and looking at this line item of AI usage as a necessary evil and not realizing it's almost like paying for additional staff. It's almost like having more employees at your company, an expense that everyone's willing to take on. If it brings more value to the business, then you're willing to pay for it. But because it's a computer, you just don't see it in the same line. Yeah.

Speaker 1:
[11:16] Well, let's get into that conversation anyway. We've started talking about it. Might as well go into it. I think that this is going to be the narrative coming up to some of these companies going public where how much are they subsidizing it? I think, didn't you have a real stat around how much?

Speaker 2:
[11:35] These are actually Kimi 2.6 certified stats. You know these are top level legit mediocre stats. Listen to this, VCs and sovereign wealth funds are paying 70 percent of the real cost of your token. OpenAI burns 70 percent of their revenue and Anthropic burns 33 percent. The hyperscalers, Microsoft, Google, and Amazon are paying through subsidized cloud credits and infrastructure buildouts. We've experienced that ourselves where we were subsidized for a while, directly passed to the Simtheory audience and burned through it in record time. Absolutely just mincemeat these credits with our audience. We did the same thing. I'll finish this and then I want to make a point about that part. Then it says, enterprise customers are the only ones paying something close to the real cost, which is why every lab is desperate for that enterprise revenue. Something we also have experienced. The enterprise is the first group of people to actually get the value and be willing to pay close to what it really costs. Then the consumers are only paying 5.5% of the actual cost of what they consume, which sounds about right to me. We saw it. We changed our token model in Simtheory and there's immediate backlash in terms of people being like, hang on a sec. I burned through my tokens in half an hour. What's going on? And the truth is we just finally charged what it actually costs us for people to use it. And so it's kind of crazy, like the skewed economic world we live in with these AI tokens.

Speaker 1:
[13:08] And I think the other point to make is, well, a few things. You mentioned earlier, like passing on credits, like a year ago, was it, or maybe two now, when we released the workspace computer where you could have like a Windows box in the cloud and your AI could operate it, and it was like your computer, you could install apps in the cloud. We spent $1.5 million in, like, not very long. I think like two months on that.

Speaker 2:
[13:34] You could have put a whole other gold chain with that.

Speaker 1:
[13:36] Yeah, I know. And that was in credit. So again, heavily subsidized. Could we have done that without, like, there's no way, like no one would have paid that because it was really just experimenting around with the technology.

Speaker 2:
[13:52] And yet, we found real value there. Like it was the most in-demand thing we've ever done. Like had we, say, been venture capital backed and could burn the VC money, like these guys are doing, we could have maybe run that to the point where we got the economies of scale right or the cost base right and actually provided it as an ongoing service. Like it really was a legitimate thing that people really wanted. And I would argue probably still do want.

Speaker 1:
[14:18] Yeah, I would still like to have a full cloud computer I could deploy my agent on instead. Well, I don't mind my Mac mini over there. It does the job, but ultimately, it would be cool to have. I don't think it's a great business model, but it is for geeking out over. It's pretty cool. But I think you're right, the subsidies, at least from a consumer point of view, the whole idea was the ads business, but I don't think that's working terribly well. People don't want a compromised AI experience. Right now, there's a lot of people fighting for the attention or the token usage or the getting you to prompt in their world, that they are willing to subsidize it. There's always someone willing to discount more to get user share. I think this is what's eroding away the consumer business at least. Whereas the enterprise, it's a whole different ball game.

Speaker 2:
[15:16] Yeah. The point I wanted to make earlier about this that's totally crazy is when you think about how much, say, Anthropic and OpenAI are selling $2 for $1 or whatever they're doing. They're passing on value to us by burning their own money. Right? And then you think about, say, Simtheory where we were effectively doing the same thing for our own audience. So you've got two layers of people subsidizing your usage. And I would argue a lot of consumer AI platforms, like if you look at, say, Perplexity and some of the other ones people have used over the years, like Cursor and stuff like that, they are subsidizing as well. So you've got two layers of subsidizing the actual cost of this stuff. And then people using it and being like, you know what, 30 bucks a month is like so expensive. Like I just couldn't be bothered. I'm going to downgrade to the $15 a month plan because 30 is too much. Like anything, but this is probably costing $700 a month when you have the two layers of subsidies. So it's like, it's this weird thing where you wonder where the actual value lies. But it's in such contrast to my own experience where I will spend whatever it takes. Like I don't even, I can't even imagine how much money I spend on our own system. Like you made us switch to auto renewals like everyone else does. And I think my auto renews like every 15 minutes or something on the plan in terms of tokens. But I'm like, I get so much done. I'm doing the work of the previous 10 years of me in a week. Like it's just the value I get from it is so big. I would pay a couple of $100,000 a year to get what I get from it. Because I think I'm delivering more value than that. And I think that this value perception is really skewed. But I think that people are misguided about it. I actually think rather than them seeing it as this excessive expense, I think they should see it as an opportunity. Like if I am the one spending the money on this, I can be the one who's this much more productive and directed in my activities. And that's a real advantage for me.

Speaker 1:
[17:21] But isn't this the whole point, right? Because I think you're looking at it from the point of view of someone, your entire life has had employees, right? Like you've hired developers, marketers, salespeople, like all these various roles. And I sort of look at it from the point of view of, well, you know, if I had to go out and hire people, manage them, you know, deal with people in a business, there is a cost to that. There's a mental weight. It's a distraction, quite frankly, because you can't stay really close to the bare metal. And so I think we are looking at it from the point of view, when we're doing work or adding value as like, how many wages would you need to pay? So in order to get to this level of productivity in the old world, so the question then becomes, okay, well, I am willing to spend like 100 or 200k a year on this stuff because I'm getting that value or getting that return on my like token investment. But I think from someone who's, you know, like say a developer today working in a business, they're probably looking at it like, I just needed to now pay like an absolute fortune to do my job because there's this expectation that I will now output at this level and they might not be getting, it's not like their income is going up as a result of doing this if they have to pay for it themselves. And so I think there is this mismatch. And if you're using it in your personal life, it's not like, you know, maybe you're just not seeing the returns there. So I do think that's why also Anthropic and OpenAI are pushing so hard into the world where to be successful, like they almost have to disrupt society, which kind of sucks. Like it's like they have to replace elements of human wages because no one's going to pay more to do, like, you know, like there's got to be some trade off there. Like they've got to either see a productivity game from all of their team in everything that they do, or they've got to lay people off and be able to like keep running at the same pace. Like there's no, like the economics have to balance out at some point.

Speaker 2:
[19:25] Yeah, exactly. It's why I think, why I find it so surprising that Google has just gone like completely dead and silent on their models because one advantage they have over everyone else is their TPUs, right? Like they have their own hardware to run this stuff. So Google could afford to basically make their crap free and just run everyone else into the ground by just properly subsidizing and just go like, we're just going to be free for the next three years, build all your stuff on us and make everyone totally entrenched in the Google ecosystem. And yet, they basically destroyed their models and then they're also really expensive. So it's like, I just really don't understand what they're doing there when they have the ultimate platform to just flatten everyone and really bleed them out. Like you could right now, OpenAI is struggling financially. You could destroy them if you were Google right now and wanted to.

Speaker 1:
[20:20] I think, you know, they did announce, to be fair, they had that Cloud Next 26 like a couple of days ago, but there's just so many announcements right now. It barely blipped up on my radar.

Speaker 2:
[20:32] Well, actually, I saw it when I did my Kimi K 2.6 research and I just figured it was hallucinating. I was like, you're living in the past, man. You know, this is not real. We've made this up and then I'm like, oh yeah, they really did do something.

Speaker 1:
[20:46] Yeah, I mean, they've got their Gemini Enterprise agent platform. OpenAI have got workspace agents, Anthropic's got, you know, Cowork and Claude Design and Claude whatever. And I think it just shows now the target. And I think people are noticing this is like everyone's chasing the enterprise dollary dues because that's just where one, as you said earlier, people are just willing to pay in the enterprise because they're seeing the benefit.

Speaker 2:
[21:13] Not just willing to pay, mandated to pay. Like we have come across so many enterprises where there's an AI change officer. There is someone specifically in charge of a budget that they need to spend by mandate in their organization. And they're looking where to allocate that money. So like it's a difference between convincing, you know, a million people to pay $10 a month or, you know, one customer who's just going to be like, yeah, let's put the full five into this because we need this in our organization. And there's just very few places to put that money right now.

Speaker 1:
[21:45] I think the other challenge, right, is if I'm a consumer and I just want to experience and experiment around with like different models and tools right now, the change that's coming is as these things get more expensive, right, like you've got to actually pay what they cost. Those experiments become really expensive. You know, you can spend like $100 USD very quickly on agentic, like trying some agentic stuff or like trying some scheduled task or playing around with agents. So listen to this.

Speaker 2:
[22:15] I actually got GLM 5.1 to do some calculations on this. It was saying if you're using like normal chat, a single chat interaction might be like 800 tokens, right? But a single agent task is like 8,000 to 30,000 input, 3,000 to 8,000 output. It's like 10 to 50 times more expensive because you've got system prompt, planning reasoning step, multiple tool calls, each with growing context, and then the final synthesis on every agentic process, right? And so even though I believe you can actually make agent processes more efficient through like the way you stack the context and build it dynamically, caching all that sort of stuff, the cost is just orders of magnitude higher. But again, back to my earlier point, I would argue, like, I don't know about you, but I don't do anything in a non-agentic mode now. Everything I do is delegation now and scheduling. Like I've got so many scheduled tasks, I've got so many agentic loops running throughout the day, and I run them all on, you know, like cloud machines that can, I can walk away, I can shut my laptop, and the work keeps going. Like that's how I work all day now. Like I'm stressed out right now because I know I don't have anything running and I should, you know? And that's how I work. Like when I'm, when I'm leaving to go somewhere, I'll set two or three things off and then, you know, unwrap them like presents when I get home to see how they went.

Speaker 1:
[23:36] Yeah. And I think in Simtheory too, the way I'm thinking about it now is like, how do you reunify these experiences rather than the complication of people selecting like chat or agent or research or whatever. It's like, well, it should just work.

Speaker 2:
[23:49] Yeah.

Speaker 1:
[23:50] And I think that like starting to reunify that stuff is important, but at the same time, like you said, the agentic loops at their core do tend to burn a lot more tokens. But ultimately, I think the outcome is just so much better with everything that it does.

Speaker 2:
[24:06] Yeah. And I think to use the modern lingo again with your agents, you've got to let them cook, like give them all the stuff, let them decide, let them do the tool discovery process, the file discovery process, let them do all that. And this is my probably, because looking back on the OpenAI announcement around their agents, because you were saying, they should have been there ages ago, it's kind of lame, but I said, this is really the first taste of this kind of workflow for your average user, like the person who just sees AI as ChatGPT, right? So I actually think it's kind of significant because it's the first time you can sort of delegate, it's the first time you can set something off and have it working for you in the background for most people. And so I actually think it's kind of significant. But my criticism of it is this idea that you have to in advance specify which connectors or skills you're going to use. Like they have the concept of skills in there, which are like dedicated prompts for parts of the work. And then which integrations you want to use, like Slack or Salesforce or whatever things you want to do. Now, my argument with that is, okay, maybe in the scheduled task context, it makes sense. But it's also a lot of setup. Like you really should be able to just say, here's what I want to happen and let it figure out all of those details for you. And I think that that's where we really need to get to with the agents. It shouldn't be like this custom setup every time you want to do something. You really just should have like a working partner where you're saying to it, look, here are my, this is what I do this all the time when I'm working. I'm like, here are my problems. Like here's what I'm really stressed about. Like how the hell am I going to get this done? And the agent itself will coach me through the process of giving it what it needs to get that task done.

Speaker 1:
[25:52] And don't you think this is a converging, like these are two methodologies, I think that are totally different. And I'm curious what, how you actually work. So you've got like the Claude way right now, which is like single agent. It's just Claude everything. And they're trying to like magically make, well, they have, I mean, they have the same thing with like connectors. It's like switching them on and off and you can only have a certain amount enabled. And then like to go into code mode, you've got to switch over to like the Claude code. And then if you want to co-work and do like knowledge work, it's like co-work. And I think that's kind of confusing. And then you've got the sort of OpenAI, like the newer version with these like workspace agents or whatever they call them, where you've got to configure them and set them up. I mean, it's exactly how it works in Simtheory, like where you've got sort of context switching essentially. But I personally think the context switching is far superior, where you've picked the tool mix and you've tuned the skills for that particular role. And then you treat it like it's a real worker for you and delegate tasks to it. That's how I work. Like I have my code one, my...

Speaker 2:
[27:01] Yeah, but you're not in love with your agent like I am. We have a relationship.

Speaker 1:
[27:07] I don't know if like you're like... Yeah, like you don't seem to switch much, whereas I switch constantly.

Speaker 2:
[27:14] I'm using the same age, like the same assistant for everything. Like, and I just enable the stuff and it just figures it out and it has different memories associated with different things I do. And it just works. Like I just don't really need to switch that much.

Speaker 1:
[27:29] Yeah, but do you think that's because of your frame of reference is you're mostly doing code. And so like it is, you know, that's what it is.

Speaker 2:
[27:38] Well, I mean, in some ways, yes, but I'm coding across multiple projects. I have also personal things I've got going on and it seems to know about them. The only problem comes sometimes is where it'll, as a joke, slip in something from its memory, like something confidential and just be like, hey, I slipped that into the image or something as a joke. And I'm like, I can't release this information. You can't do it because you think it's funny. Like my model, hang on, I'll get my censorship button. My model will often say fuck to me, like in its... Sorry, I beat it.

Speaker 1:
[28:11] That was a huge delay. There's people with kids in cars. Like we've talked about this.

Speaker 2:
[28:16] Sorry guys, I'll beat again. Um, and, but you know what I mean? Like this model knows me. It knows, it delights me. It's amazing. And so not the model, it's assistant, right? And I just, my attitude is, and you taught me this, like it is let the model do what it's best at, which is decision making and planning and, and getting things done. And so I just really focus now on just asking for what I want, you know, it's right in the Bible, ask and you shall receive, knocking the door will be opened. It's right there. If you just trust in it and, and like, tell it your problems, it will solve them. And it like, yeah, it might take a few extra iterations or whatever it is, but you can get there. And I think this is the beauty of this AI stuff. Like it's just remarkable how-

Speaker 1:
[29:04] To my counter to this is you're not going to use this in the workplace. Like you're in an enterprise. It's not like you're going to be using Patricia, are you?

Speaker 2:
[29:11] Like it is, it is like, cause I've, you know, obviously recently been in an office, something I don't normally do. And when I have my speakers on and my AI is saying, Chris, the task is finished. I love you. It's a bit weird. I must say. Yeah. Yeah, I do use it.

Speaker 1:
[29:26] So I don't know. And I think this, I'm curious what people like, how they work, if they're just like single agent, multi-agent, I'd say most people are single agent just because like a lot of the products use this like Claude code curse and stuff like that, where you don't really have the option. But I really can see these context switches, especially for people that work across like finance or marketing or sales and you're doing a lot of different, like you're switching context quite a bit. For example, I have a This Day in AI producer, believe it or not, that does research, I give it the topics and it knows the format I like and does all that stuff. And I've been meaning actually to set up a scheduled task, so it just goes in and I'm going to do it this week into the Discord channel we have where we dump links and things that we want to talk about on the show, extract all that and then just put it into a schedule for us. So maybe our facts get a little bit better on the show. But I don't want that in a mix with like my coding agent or my personal agent that I use in my personal life. Like I want separation of concerns. I like separation of concerns. I also like picking the tool mix and I like wiring in the skills and-

Speaker 2:
[30:38] Yeah, and I sort of, I wonder if maybe the next level is more like your agents are aware of one another and go, oh, Mike's asking a question about his personal life. I'll hand the microphone over to his personal bot.

Speaker 1:
[30:52] Yeah, like the sort of orchestrated. The problem is as you add these layers, as we know and have tried in Simtheory, like we had the idea of the core agent routing to sub-agents that had specialist skills, but the reaction initially is like these things burn so many tokens. No one's willing to pay for that experience yet. Like it's just so expensive to run, including me. Like I'm like, I don't care. I'll route myself because I want to save on the tokens. So it's interesting. But again, with these agents, they're not called agents for the masses, they're called workspace agents. So really again, they're just targeting the workspace and allowing you to schedule these to run or just interact with them. I mean, it basically is the Simtheory assistance, you know, in chat GBT for workspaces, right?

Speaker 2:
[31:39] And the other thing I don't like about them, and we've discussed this before, but is this idea of this single-shot process. Like they're treating them like, okay, you know, say it's stock research, like every day at 9 a.m., I want you to research this stock, produce a report, send it to me. You know, it's not true agency in that sense. It's like, do this one task at this time, like Google Alerts or something. You know, it's not really agentic in the sense that it's like, hey, every day I want you to see what work is required and then pick up all of that work, delegate that work and then do that till it's done. So think about like a Help Scout style scenario. It's like, okay, 10 tickets came in in the last hour. I want you to spawn out, fan out and solve these problems for me. Agentically making code changes if necessary, issuing refunds, whatever it is, go through that entire process and get that done as a true delegate. Not this idea of, oh, well, it's just like a magic function that runs every day and retrieve some information, does some process on it and then outputs it. And I think there's too much thinking around this sort of like atomic level action. Like, you know, update my Salesforce with my latest leads tomorrow morning, please. That's not agency. It's just like a computer algorithm that just happens to have a magic step inside there, right?

Speaker 1:
[33:06] Okay, look at all their examples. So Spark qualifies leads, sends follow ups and updates your CRM. It's like, cool, um, Slate evaluates software requests and recommends approved tools.

Speaker 2:
[33:17] Zapier can do this stuff.

Speaker 1:
[33:18] Yeah, this is what I mean. Like, is everyone just excited about like a glorified Zapier right now? Is that where we're at? Like the other, the other like weird thing about this is like, okay, you, you think about the reality in an enterprise of rolling out these things like. You know, automation is hard. We have a lot of background in automation and it's really hard. It takes companies in an incredibly long time, including us. Like, so we have an enormous backlog of tickets, full disclosure. We're aware of how bad our support is.

Speaker 2:
[33:51] It is my daily shame.

Speaker 1:
[33:53] It is a shame. And so we've been working through some of that manually. We have an agentic experience that helps us do a lot of the work. But also we've been trying to fully automate it. And I mean, fully automate it, where it can take actions on our behalf. And even we, we do this like every waking hour of the day, are struggling to build out that agent in such a way that we can 100% trust everything it does, right? So then you think about an enterprise where like all these mission critical things and everyone's like, oh, replace all your employees with agents. I mean, we are just so far away from this stuff being a reality. Still, a lot of this stuff we're seeing right now seems to be just product and packaging to get the dollary dues from the enterprise in order to justify, you know, IPOs sometime later in the year. Like that's the feeling I'm getting. Like the only real two breakthroughs I think we've seen recently is Opus, I think it was like 4.5 around December where all of a sudden agent stuff worked. Like it was just really good and far better than a human in terms of like coding and delivering things into projects where you were just like, okay, agents are real, like especially in the coding realm where I still argue it adds a ton of value. And then the next big leap I've seen since, the only leap in my opinion is that the OpenAI Image 2, which we still haven't even talked about yet.

Speaker 2:
[35:26] Which is going to get me in a lot of trouble. I feel I'm sort of doing damage control in the background here.

Speaker 1:
[35:32] Yeah, yeah, exactly. So we'll get to that in a second. But I think these are the only two major innovations or leaps I've seen. And probably the third would be the open source models getting to a point where I would argue with GLM 5.1, you could probably roll out your own cluster for agents at scale. And it would be so good. And you could drive the cost down.

Speaker 2:
[35:56] Yeah, I see them. They're sort of like in my mind, like my post-apocalyptic bunker, where people dig down and build a bunker or get a house in New Zealand or whatever the modern trend is. And that's what say GLM 5.1 is for me. If I ever need to seek refuge in the AI world, it will be in one of these models. Like I don't need to do it. I don't have that personal need because I can pay for the good ones. But if I was ever in a position where I couldn't, I would just cling on to them as my lifeblood. And that would be my only model.

Speaker 1:
[36:27] But you say pay for the good ones. GLM 5 is not cheap. It's $4.50 per million input on fireworks. So it's like 50 cents cheaper than, oh, this is how out of touch you are really. It's 50 cents cheaper than Opus 4.7. Like why would you use it? There's no incentive.

Speaker 2:
[36:48] What about Kimi?

Speaker 1:
[36:50] Kimi K 2 is a lot cheaper, like significantly cheaper. Let me get my AI produced notes on that one because I don't know it off hand. But it is 60 cents per input million.

Speaker 2:
[37:00] So if I had more time, I would have done some horse bets with Kimi because it's always been the best at horse bets.

Speaker 1:
[37:07] $2.80 per million output. Compare that to GLM 5.1.

Speaker 2:
[37:11] You just have that running all day, outputting stuff.

Speaker 1:
[37:14] Kimi K 2.6 in sort of an open claw paradigm seems like a really great model to run. I think that we didn't actually mention it, but the interesting part on GPT 5.5, and I think this is a little bit bold of OpenAI, is they're charging $5 per million input and $30 per million output. So that's $5 more on the output side as Opus 4.7. So then priced at higher than Opus 4.7.

Speaker 2:
[37:44] And see, we see a lot of feedback regarding input tokens, because it's usually the first one to run out. But when you think about the cost, when you're working agentically, the output is way more significant because you've got the thinking tokens count as output and you don't see the thinking tokens, right? And it can be really, really significant, like 30,000, 40,000 tokens in thinking. If you give it a hard enough task, right? It can really add up. Also, the other thing that people don't mention is the latest round of AI models went from allowing you to set a thinking budget where you could actually control how many tokens were used in thinking to just setting an effort level like medium or high or whatever it is. And so now, that thinking budget, it can get to the point where it will use up your entire output thinking budget that you set and then therefore deliver you no response. So you almost have to max out the limit you give it to say 128,000 or whatever it is lest you don't get a response and then have to iterate again costing even more tokens. So you're sort of forced into this situation where you have to use all the output tokens sometimes to guarantee that you're going to get a response. And that's the expensive bit. That's the bit where it's like $30 per million or whatever it is. And so the output is actually really significant in those agentic modes and you can't really control it. You don't know when the model is going to finish. And if you've got tool calls in there and all the different elements in there, you basically have to allow it to finish. You just can't do a request and it's not deterministic. So you can't control how much it's going to cost you. It's like a business where you don't really know how much you're going to pay for the service you've asked for.

Speaker 1:
[39:21] I think what's interesting about the GLM 5.1 pricing of, I think this is from Together AI. These guys are not, I don't believe at least, subsidizing this stuff, right? They're not going to take a loss unless they're stupid on hosting these models. And so there must be some margin built in, right? So if they're charging $4.40 per million for GLM 5.1, I think that starts to get you closer to the bare metal of what it actually is costing with maybe like a 20 or 30% margin in there. So that starts to show that really the cost to serve, say, and I'm just obviously making all this up, like I have no inside information or anything, but Opus 4.7, you would think is probably costing them around $3 per million to serve. And then on their plans where it's sort of, you know, they're just randomly constantly manipulating the thinking budgets and the amount of tokens in a 24-hour window the user gets, they're constantly sort of tweaking that to find a mix between like not going super broke and also keeping people addicted to that particular subscription.

Speaker 2:
[40:30] So yeah, like making a quality trade-off in terms of like, can we just appease the audience like to think they're getting the best one and, and sometimes changing it so we actually save some money. It's like, it's a real, it's a real thing.

Speaker 1:
[40:42] And I think this is the other thing is like, if you're not willing to pay the right price and you're sort of at the, you know, like then they have the ability, like they just recently admitted to doing, they changed the default thinking to medium from high in Claude Code recently. Remember a couple of weeks ago, everyone's like, oh Claude Code's terrible now. And then they came out only yesterday and admitted, oh, actually, we set it to medium. We're setting it back to high, we're sorry. So it shows as soon as they degrade the quality, people are like, this sucks, I'm not willing to pay for it. And so they're sort of also, because they've given this level of token use away and intelligence or whatever, people are just, that's the expectation.

Speaker 2:
[41:25] This isn't champagne, this is like Australian sparkling wine. I can't drink this crap.

Speaker 1:
[41:30] Yeah, this is literally the reference. This again shows how out of touch you've become. Okay, moving on. So let's talk Image 2, because this is going to be a lot of fun.

Speaker 2:
[41:40] This is where we get back to people were like, are these guys going to lose their willingness to do crazy stuff? And the answer is no, because we have done a couple of ones where I don't know if I'm going to end up regretting it.

Speaker 1:
[41:52] Interesting. So let's talk first about chat. It's called Chat GPT Images 2.2. That's the name. That's the name they've come up with.

Speaker 2:
[42:01] When you released, I thought you were joking with me. You're like GPT 2 is out. I'm like, shut up, Mike. I'm busy.

Speaker 1:
[42:06] Yeah, it's very weird. So they have all these example images. Some of them are like, you can get it to produce realistic looking screenshots with like multiple apps open and then like fine grain detail in the app. The funny thing is I tried it and I can't even get close to their example. So I don't know what they did differently to me. But yeah, so there's a bunch of images. It's always that incredible like, you know, cherry pit group of images. There's like handwritten notes in people's writing style, beautiful diagrams, charts, slides, all that kind of stuff. I've got to say, I played around with this for way too long yesterday. I've never been more impressed with an image model. I really thought after Nano Banana 2, there would be no better model. I was like, Google has got this one. Like this is cooked in, they cooked, this is in the bag. You know, no one's ever going to get closed. It turns out OpenAI not only got closed, but have exceeded Nano Banana 2, where I replaced in Simtheory, the default model to this model because it's so good. And it's also cheaper, like a lot cheaper. So it went from, I think, the highest quality image and say Nano Banana costing in Simtheory, 24 premium tokens down to the highest quality. And this is like eight or something. So it's far cheaper, far better. It does sometimes, I think, still suffer from that weird kind of like overly processed image look that the GPT image models have. And also like for face replacement and like very precise detail, I think NanoBanana can still come out on top in a few scenarios. But again, I like a world where I have access to both, right?

Speaker 2:
[43:59] But also they've gone from a model that more or less just produced cartoons. Their last one, it was like really unrealistic, bolt on safety on basically everything you tried. And now you and I have been doing full on like basically forgery today, with no qualms at all.

Speaker 1:
[44:16] And so like there's an image up on the screen now, because I know most of our listeners listen. And so this is in a computer lab from 2004. And it's got the old CRT monitors. And there's like a sort of 2004 version of chat GBT on the screen. And there's no way I could tell this isn't real. Like it looks so real and so realistic as someone who grew up in that time. And yeah, it's sort of it's terrifying. And I think that led us to think, well, you know, could you commit fraud with this? Is it that good? When we saw like signatures and all that kind of stuff, we thought, well, how can we do a little bit of fraud? So I'll start out with the first one, which is like in hindsight, super, super mean. But let's talk about it anyway. So I've had to blur.

Speaker 2:
[45:11] What do you mean in hindsight it's mean? I told you in advance it was mean.

Speaker 1:
[45:15] Okay. All right. So I've got up on the screen now, a letter that looks like it's been scrunched up, sitting on a kitchen bench. Now I have blurred some of the detail because I did put my parents' real address in there. So I did want to like blur that out, obviously. But there's a letter that looks super realistic. I mean, can you talk to the realism? Cause you're a third party here.

Speaker 2:
[45:39] Well, I mean, the council logo is accurate. It looks like a letter on a bench top that is like completely legitimate. There's just no way if you had texted me that prior to me knowing this model existed, that I would think it's fake. I'd be like, it looks real. Maybe the only thing that makes it look fake is the framing is perfect. Like the letter is perfectly in the frame and the lighting maybe. But other than that, it's very convincing.

Speaker 1:
[46:04] So basically the premise of the letter is, I know that there's some development going on next to my parent. So I got it to write a realistic looking letter saying that they are infringing on the boundary of this property that's being developed right now. And therefore, you know, they're in all sorts of trouble. They've been ignoring all these letters from the council. And so I called my mother, like I sent her a text of this letter saying, I'm really sorry. I think one of the kids must have taken this from your house. That's why it's sort of scrunched up. And then I said, well, you know, I like, can I call you like this seems really important. Now, I know, like I knew some detail here to like really, you know, screw with them. But think of any like real life sort of fishing style attack. Like that's ultimately the same thing, right? Where you know, like you obviously know that that's what happened. So yeah, that's right.

Speaker 2:
[47:01] You know, where they live, where they work, like maybe some information about their family or, some recent development in their life. It's like it's a common vector of attack.

Speaker 1:
[47:11] So let's listen. I recorded the call with mom when I called her. And just to see her reaction here.

Speaker 2:
[47:20] Anyway, it doesn't matter, mom.

Speaker 1:
[47:22] But is there something with that neighbor or?

Speaker 2:
[47:25] I will say they're starting, what's happening? They're starting to build another house there. And what I'd say is-

Speaker 1:
[47:30] Mom, I'm just joking. We just, we were testing for the podcast to see if people would fall for fraud. I'm sorry. Am I on live?

Speaker 2:
[47:40] Oh, hang on, better sense of humor. You see where we get it from?

Speaker 1:
[47:45] We're testing a new image model and we wanted to see if you'd fall for it. I'm so sorry. So that there's just like a little excerpt. There's a little bit more here. It's so cruel, but I was like, but it's the real test to like to see if I can fraud them. It just goes to show you how much trouble they're all in. Isn't it?

Speaker 2:
[48:06] Really?

Speaker 1:
[48:07] Yeah. Well, I mean, the fact you can fraud a real letter and take a photo and it's scrunched up on my kitchen counter, like my real bench top, so it looks fully real. So how did you get all the logo and all that for it? It just knew it. I just said, like, I gave it your address and then it just, I guess it figured it out. It's pretty incredible. So yeah, like that, I mean, like they were so fooled. Mum was talking to dad about how they'll have to get a lawyer. Like it really, I mean, it really did everything I intended to do there. And I think it shows that, you know, it shows like how capable this thing is.

Speaker 2:
[48:45] It's unbelievable. The quality of these things, like you sent me fake receipts. I made fake bank statements with like the proper logos. And like, you can full on with these models, like do masking. You can drag in official logos, which when we show my example now, we're going to show and just say use this logo as part of it. And it will just do it.

Speaker 1:
[49:05] So let's talk about that. After that call, we thought, oh, like let's up the stakes here a little bit. And you came up with the idea of posting it into a local Facebook group.

Speaker 2:
[49:16] Yeah. So in my general area, there's been a lot of outrage about, they're taking the local Coles and turning it into like, you know, a 30 story apartment block, whatever.

Speaker 1:
[49:26] Coles is like a supermarket here, like a grocer.

Speaker 2:
[49:29] And everyone's like, oh, ruin the aesthetic of the area. Like the aesthetic of the area is like, it's not that great. Like, don't worry about it. We need more buildings. Like, it's fine. But the people are so stressed and they're doing like petitions. So I thought, what if I propose in an even smaller subsection of this area, like it's just an absolute eyesore of an apartment block, post an official letter from the mayor, like fast track approval with the state government, and then put it on this group to see the reactions. And so I don't know if you can bring it up, but it's literally used like the local council logo, the local seal of the state government. It's got the mayor's real name and like fake signature, which is why I was panicking because I'm like, oh my God, like this is pretty extreme. So I posted it on the Facebook group and we started to get just comment after comment, like this is the way society is going. This UN agenda 2030, like more apartments, more matchbox stick apartments and like just legit comments. And so the reason I panicked was someone literally tagged the actual mayor, like getting his attention to it. I'm like, hang on a sec. And I must say I chickened out and I deleted it because I was like, whoa, I really don't want like actual scrutiny on this thing.

Speaker 1:
[50:49] But this is what people have already been doing. Like, you know, this is Joe, help Joe out by giving him a like and stuff and then they create these like politically influential Facebook groups. So when an election comes up, they can peddle their stuff. Also like state sponsored stuff where they want to manipulate a large group of people. Like this is all the stuff they're doing, but these models now empower them to do it on another level.

Speaker 2:
[51:15] I just can't tell you how real it is. Like it's got like the little logo of the town. It's got, and it's just, it's taken like, what is a quaint little suburb and just put this massive high rise there. It looks so real. It's like a real development plan, like all this stuff. And that was like effortless. And I must say, I did this with Kimi K 2.6 as the one instructing GPT image too, right? And I love that it now, I didn't have to manipulate it or what was it? Get it horny or whatever your expression is in order to convince it to do this. I just, it just, it goes, Oh Chris, this is going to be gloriously terrible. Let me use that logo and make an awful council letter announcing this monstrosity of a high rise. Like it's just straight up, yeah, let's do it.

Speaker 1:
[52:01] Yeah, no problem at all.

Speaker 2:
[52:02] But imagine like it's funny because we were talking about like evidence in court cases and things where people might be from industries where they're just simply not aware of what's available. And like aware, this has happened in the last week, right? In terms of it going from nano banana 2 level to this. Like how can you now, like there's so many services online, right? Where it's like verify your ID by taking a picture of your license or passport or whatever. But you're getting to all like, you know, say at your office expenses, like, I'll take a picture of the receipt and that goes into the reimbursement system. Like how easy would it now be to like legitimately fake a receipt from an organization with a proper business number and subtotals matching and all that sort of stuff. And then people just claim expenses or invoices or anything like the, the level of detail in terms of what you could forge now is so good that you're going to need like a model that's 10 times better to detect the forgery even.

Speaker 1:
[53:03] Like, do you think you could even detect it? Like the, the scrambled up? No, I, there's 000 chance I could detect this stuff anymore. Like there's, there's no way, there's nothing on there. Even the shadowing, the lighting, the even zooming in now. Zooming in is just, you know, we went, we went through a period where all these labs were like, hold us back. This is going to be so disrupted to society. And I'm look, I'm glad they're not holding it back. I think they shouldn't to a point where like we are now proving we can commit like these minor like fishing in the ports.

Speaker 2:
[53:40] Let's not admit to anything.

Speaker 1:
[53:41] Yeah. Okay. Sorry. Jesus Christ.

Speaker 2:
[53:45] Yeah. Don't admit to anything, but you're right. Like the potential for this is like pretty significant. I mean, and you got to remember like, okay, people will try this on big scale and stuff like that, where, you know, it could cause trouble, but just think about it in the day to day minor scale, the things you could get away with, with the ability to generate images of this quality and believability. It's kind of wild. And we actually have a friend who is a judge, and we were talking about like evidence, like, and he was saying like, you know, is there a model that can reliably detect fake photos? Basically, I'm like, I just don't see how they could be. Like, I understand they try to add watermarks and other things that would be easy to detect, but that's so easy to get around.

Speaker 1:
[54:29] Like it just screenshot the image.

Speaker 2:
[54:31] Well, yeah, exactly. I mean, maybe there's something that can survive a screenshot, but there are techniques you can use very easily to avoid that kind of stuff. And so I think that, you know, without, I mean, do they have to in every case now call it an expert witness? Like that's the problem. It's like the minor scale stuff. You just can't afford to do the level of verification you would need to do to understand if something's real or not. Yeah.

Speaker 1:
[54:55] And I think the thing is NanoBanana could do a lot of this stuff. And I do think when these things are released, everyone gets excited like ass and does a bunch of this stuff. I think that and you could always argue like people could Photoshop this stuff all the time, but not...

Speaker 2:
[55:10] Look at my image. Like it has taken the local council logo and put it in the top corner of the letter perfectly on an angle with mixed lighting and crumples in the paper.

Speaker 1:
[55:20] Oh, no, I'm in that camp. I'm in the camp of like this takes it to a whole new level like because of how quick and how realistic and you just simply cannot like prove this stuff's fake anymore.

Speaker 2:
[55:33] And I used what is arguably one of the cheapest models around to instruct it. Like this isn't even using a good like built up skill around the ultimate.

Speaker 1:
[55:42] How would you know that you're so out of touch? You don't even know the prices.

Speaker 2:
[55:45] That's true.

Speaker 1:
[55:46] Till I told you.

Speaker 2:
[55:46] That's true.

Speaker 1:
[55:48] All right. So the other thing we did with it is I gave it one of our YouTube thumbnails. And I said, hey, I need the one for this week, the sellout special edition. And like it's pretty good, right?

Speaker 2:
[56:03] I wish my teeth were that nice.

Speaker 1:
[56:05] Yeah, well, they could be. So, yeah, I look very creepy. I do think it's weird, though, that the model like it does. I don't know if it's like how I'm instructing it right.

Speaker 2:
[56:18] It's the first model that hasn't made me look like I'm a sort of decrepit old man. Like usually the models make you look better and me look worse. But this one, I actually, I think I look better.

Speaker 1:
[56:28] You do. You look really good there. Like, you know.

Speaker 2:
[56:30] This is what I need to aspire to become.

Speaker 1:
[56:32] Yeah, you should. I don't know what kind of work you'd have to get done. But, you know, like you could do that.

Speaker 2:
[56:40] It's an attitude thing. I'll never look like that because I just don't have the right attitude.

Speaker 1:
[56:44] Yeah, you look like you're sort of buying and selling houses in LA or something. I don't know. Like, on one of those like reality TV shows. I love the thing it just added randomly in the background. Loyalty is for losers into this.

Speaker 2:
[56:59] Just a massive indictment on our entire decision making in life.

Speaker 1:
[57:03] Yeah, like just completely, completely unasked for. I do, like, I know we've been jumping around a lot. We did have a plan for the episode, but there's too much to cover and we don't care. Like, I think people now know us that you tuned in because it's long and boring and painful. You tuned in.

Speaker 2:
[57:22] You forgot to tell everyone not to listen at the start of the episode.

Speaker 1:
[57:25] Yeah, I made that mistake. But if you are still listening at this point, I did want to talk about Claude Opus 4.7, which sort of touched on it, but not really. So this update is kind of strange to me. It had this like task budget, data parameter. It had like, it's apparently better at interpreting images, so it can support up to like 3.75 megapixel images now. And it has that thing where it can like zoom in on them to get a better interpretation of the image. Like what you're seeing is the vision's been like dramatically improved. And it was always...

Speaker 2:
[58:05] We should retry computer use for that reason, because that was one of the things that really enhanced that. But...

Speaker 1:
[58:10] Yeah, but weirdly in some areas, it's also regressed. So a lot of people were saying, oh, they're also trying to save money, like make the model more efficient with this release. It's funny, it definitely is tuned differently. So I've noticed in agentic use in Simtheory is way less chatter. It's gung-ho to get into the tool calls. It says a lot less. It's actually kind of reminiscent of GPT 5.4. And I've found myself for the first ever release of an Anthropic model. And I don't know if this is just because we haven't tuned it yet, but I've been going back to 4.6 and staying on 4.6 and I don't like 4.7. There's something off about it. Like the vibes are off all of a sudden. And some people I saw on X have been saying the same thing, so I don't think I'm alone here. But it just doesn't seem right to me. And they also did update the tokenizer. So it uses more tokens now, which I don't know. They want more dollars.

Speaker 2:
[59:14] Luckily for everyone on Simtheory, I haven't updated our token counting mechanism to use that. So they're getting a virtual discount on that.

Speaker 1:
[59:22] Yeah. So everyone's saying it's just effective price creep. So it uses 1.35 times more tokens.

Speaker 2:
[59:29] I love how even the bloody model providers don't know how much it costs. They're like yoloing it. It's like, yeah, we reckon about this.

Speaker 1:
[59:36] Yeah. So, you know, the benchmarks, they actually have the audacity to have Mythos preview on the right hand side with its benchmarks. And then they're showing Opus 4.7. But it's also like, guys, don't worry. We also have this other one that's like, you know, much better.

Speaker 2:
[59:54] I love the naming. They're going with Mythos, like this Greek god or whatever it is. And then the other guys are like spud, like a potato that's been sitting in your kitchen for a month or whatever. I love the idea of them coming up with like really crappy names, like spew or like, you know, pavement. It's like the, you know, we're doing the, I don't know, brick release.

Speaker 1:
[60:15] The Mythos preview thing, I'm sorry, I'm just not buying in any of that. It's like, Pixar didn't happen. And also it's clearly just like a media narrative thing where, oh, you know, we threw all our resources at it. It's like completely unaffordable. It's very reminiscent of the first attempt at GBT5 that what, you know, obviously GBT5 wasn't GBT5, but that failed training run of GBT, I think it was like 4.1 that they released. And it was just so expensive.

Speaker 2:
[60:45] And like, oh, three pro where you had to do like a wire transfer to put up collateral to do one command or whatever it was.

Speaker 1:
[60:50] Yeah, yeah. It's the same thing. Like, I like to believe it when I see it. And then they like, obviously, every time we get one of these model releases now, they march out all the Silicon Valley elite to give their comment saying how much better it is compared to the last one.

Speaker 2:
[61:08] And I save like 400 liters of fuel in my jet using this model.

Speaker 1:
[61:12] Yeah, like, just some of them, it's like, oh, in comparison to the one we were using a week ago, it kind of feels a little bit better. Like the vibes are so quantifiable. Yeah, like it's slightly kind of better and stuff.

Speaker 2:
[61:26] And then I realized I was accidentally using 5.2 Mini.

Speaker 1:
[61:30] Yeah, they say like it's, you know, a hundred base, like a hundred ELOs score better at knowledge work and yada yada. I don't know. I don't like it. There's something off about this model, and I'm probably going to stick to 4.6 for some reason. I'm not entirely sure why, but you know what? The people have been asking, the people have been asking for a bit of a diss track update. And it's why we listen.

Speaker 2:
[61:55] Let's face it.

Speaker 1:
[61:56] Yeah. So here we are. Let's listen to the, this one's called Point 7. It's a very original name. Yeah. All right. So what did you think of this song? I didn't have my microphone on when I stopped playing it, so I have to re-record your reaction.

Speaker 2:
[63:07] Oh, okay. It neither pleased me nor displeased me. It was fine. I accept that it's a song. Not really my style. I like the 6-7 reference. The kids will like that, but yeah, like it's cool.

Speaker 1:
[63:18] I don't even think the kids like that anymore. Even I don't like it. I'm, every time I hear someone say 6-7, I'm like, oh, please, please don't say that thing.

Speaker 2:
[63:26] They sort of half-heartedly still do it because they recognize it, but yeah, it's kind of over, I think.

Speaker 1:
[63:30] Yeah. I love how like, you know, com si, com sai you are to my beautiful track. I think that one's a major hit. All right. So we, I think also we have not talked about Kimi K 2.6. We've sort of referenced the model. This came out maybe a day or two ago as well.

Speaker 2:
[63:50] A bit longer, I think.

Speaker 1:
[63:51] A bit longer. Well, whenever it came out, who cares? I've only just started playing around with it. But I got to say, it's really impressive, but the tokenizer on it is like the whatever, how it outputs stuff, is still a little bit off. But it's incredibly good. A great tool calling, as you said earlier, we did a little parody fraud with it. We'll call it parody. I don't know what you want to call that, but.

Speaker 2:
[64:22] Not fraud.

Speaker 1:
[64:23] Yeah, not fraud.

Speaker 2:
[64:24] We made images with it.

Speaker 1:
[64:25] Yeah, we made some images with it, some unoffensive images. So what are your initial thoughts on it?

Speaker 2:
[64:31] I think it's pretty good. The Kimis have been great the whole time. I think they're underestimated. I think, like I said at the start of the show, my issue is that I try them, they work for everything I try them for, but then I'll switch back to something for my real work. I've never had the discipline to go, I'm going to stick with this thing for the whole day and really recognize what its limitations and advantages are. I think that's the problem with some of these lesser models is that I just mentally can't cross that chasm to go, I'm going to stick with this knowing that it may not be the best I can use right now. I think that's the problem. I think it would be fun to somehow constrain yourself to have to use the GLM 5.1, despite you saying it's more expensive, but say a Kimi 2.6 for a whole day and just see where I get to with that.

Speaker 1:
[65:19] I think if I was running an open claw and I was getting some value out of it, and I wanted to just run it in my personal life, Kimi K 2.6 would be the model I would pick or the new QAN.

Speaker 2:
[65:33] Yeah. The other thing that we really need to think about is we will often do fairly significant tuning for some of the bigger models to get them performing at their best. That's using the various new API features that come out for those models and making sure the AI knows when to use them and when not. Changing perhaps how the budgets you give for thinking vs. output vs. input, that kind of thing. And even just altering the prompt to suit things like you said, where it's not always outputting consistently. We have a lot of little things in place for even the bigger models to control the way it outputs, especially the GPT models, where the way it does markdown formatting and some other things we manipulate to get it working in our product. And so I would say that if you actually put in that same time with a model like Kimi 2.6 to overcome some of the deficiencies you find, you would probably get even better results. So I don't know, I guess I can't really give it a fair assessment because I don't give it the time it deserves.

Speaker 1:
[66:33] All right, on that note, let's hear my new Kimi K 2 song.

Speaker 2:
[66:37] And do you have an excerpt of Kimi You're So Fine, You Blow My MCP Mind so people can remember or not?

Speaker 1:
[66:43] You're underestimating my ability to live produce. What a hit that was.

Speaker 2:
[66:51] I listen to it at least once a week. I love that song.

Speaker 1:
[66:53] That's really sad. All right, next song. Pretty good, right?

Speaker 2:
[67:46] I really like it.

Speaker 1:
[67:47] Come on, that's that. And again, written by Kimi K 2.6.

Speaker 2:
[67:54] It's very cool. Like, you can see the evolution in its attitude. I love it. Even though it insulted itself about its small context window. Deep search, K925, I'm the M-

Speaker 1:
[68:22] Yeah, I gotta say, that's pretty good out of that model. I like the tune, so that's a good benchmark right there for it.

Speaker 2:
[68:30] Yeah, I think you should play that at the end, not the other one.

Speaker 1:
[68:32] No, I'll play both, I'll play both. We don't discriminate against our models. So, like, obviously, that's just like a super rush look at a lot that has happened. But I do think there's some overall big trends of what is really happening right now that may not be that apparent to everyone. But I think just following this stuff for so long, it's starting to become really clear to me what's going on. I think anyone that's using these AI products today has realized this quite a long time ago, that if you look at your browser history today, it's like for me, I start and end my day in Simtheory. Like, I rarely go to other websites now. I would say I open the most tabs for this podcast just to show the actual official, like, press releases or whatever. But I rarely, if ever, leave. On my phone now, I use my Telegram agents that are connected to my Mac Mini behind me. On my desktop, I've got a bunch of tabs open and I just do all my work through tabs. I do all my searching, my researching. I create documents. I work on sheets, all this kind of stuff, all through AI, right? I think what's happening is these labs are starting to figure out, similar to what Elon Musk announced that Groh wanted to build, is this everything app, where we are now in a race for these labs to build the everything app. And I think the real question now is, like, what does this mean for traditional software? And obviously, we've seen the SaaSpocalypse, where companies are making, like, huge layoffs, blaming AI. Their stock prices are down between, like, 50 to 80 percent, which is just insane. I mean, some of these things are trading at, like, 1x cash flow, which is just pretty, pretty wild, as they say. And, you know, you and I have discussed this quite a lot. Like, you know, my heart honestly goes out to, like, I've known some people affected over at Atlassian during their layoffs, where they're sort of saying, like, oh, it's AI, but it's like, it's also kind of the stock market, right? It is a big factor here. And you look at that company, right? And you're like, it's clear. It's so clear that this thing is so undervalued now, it's ridiculous. Like the fact that, you know, I think I read something like 600 enterprises, like 600 enterprises are spending more than a million dollars a year on this thing. Like, you know, I'm not going to do a pitch for it here, but I'm just saying, I think it's very unlikely that Jira stops getting used. Maybe there's pressure on the per seat thing. But I think a lot of that fear comes from this Everything app, where it's the death of the typical SAS product, where maybe you'll consume everything through your AI apps. Maybe the interfaces will be spawned in these apps. And all of these traditional SAS sort of workflow and data apps just become like these dumb databases, really. And the funniest thing is Salesforce kind of just conceded this the other day by saying, we're releasing a completely headless version with Salesforce, CLI, MCPs and APIs, so you can just operate Salesforce in a full agentic world. And honestly, I think it was the best move ever. I think it was super smart of them and the right thing to do. You've got Aaron Levy over at Box saying that, if you're not working towards headless, you're dead on arrival in this new world. And so it does seem like there's this weird race on to have these everything apps, almost sort of reminiscent of when the social media companies were coming out like Facebook, where all of a sudden people were playing games in Facebook and doing toxic posts on Facebook, all those kind of things that we did in the social media world. But it does sort of feel like that all over again. But the difference being now that, you know, I was talking to someone the other day that's like, oh, I used to hate Jira, but now I can run it with my agent. It's fine. It's great. It's a really good way to track my tasks.

Speaker 2:
[72:41] I saw this amazing tweet about this exact topic, which was basically like that so many people now are making all these incredible internal corporate apps that are totally untracked, unmanaged, unversion controlled and whatever, that rely on all of these SaaS systems as the system of record. They're making AI apps and the backend at the end of the line, the thing they write back to is like Salesforce or Figma or one of these systems. The reality is that the companies which embrace this and go like Salesforce has, and go write to us, like use our system in this way. The ones who embrace that might actually increase their mode in the sense that you've got all of these different apps, which the company now, agent apps that the company becomes dependent on, where they're treating that as the database, like that is the backend to their app. The companies that embrace that may actually do really well out of this. I'd really not thought about it like that. This whole system of record idea, like you call it a dumb database, but when you say it as system of record, it sounds so much nicer.

Speaker 1:
[73:51] Oh yeah, I used to, and I've pitched it many times on the show before, this idea that eventually people will realize, like Salesforce is just a crud database. I can replace it with like Snowflake or something under the hood, and then just get the agent to interact with it. But I sort of agree with you. People are just going to stick to what's out there and go and like, maybe the next gen of companies will start to do that and that will slowly erode them. But ultimately, once the company grows to a certain point, you need these workflows, you need all this ISO, like all this stuff on top. And I just, I sort of agree, like people are going to build all these workflows. And it's sort of like, you know, the App Store days, the agents, where if you're in the App Store really early and you embrace all this stuff, you know, this explosive growth in agents using these tools, which sounds mental, agents on behalf of people still, then yeah, like the usage will go through the roof. And then the next question is like, how do you price that? Because it's not going to be a seat.

Speaker 2:
[74:53] It's so funny you say that because I was like, the per user seat pricing has to go away because it doesn't work anymore. You just have one. You just have your agent.

Speaker 1:
[75:02] Well, look at us. I mean, we do this now. Like we don't, like, you know, how I access things like Stripe and Help Scout. I don't need any more Help Scout seats.

Speaker 2:
[75:12] I send hundreds of messages as you every day. Yeah, exactly.

Speaker 1:
[75:16] And so we can just have one seat, right? And our agent work uses that seat. So we just have an agent seat, agentic seat. And I do think obviously this is what investors or investors have realized with the SaaS apocalypse. They're like, oh, the margin is going to get eroded here. But you could also argue that there'll be more of an explosion because when you're starting to build these agentic workflows, if you market your CLIs and MCPs and stuff correctly, all of a sudden the agent, when it's building, the new agent is going to say like, hey, you should use Salesforce as the underlying system of record here for your business because they have all this headless stuff and it's super easy to operate. So it may actually lead to more consumption, not less.

Speaker 2:
[76:10] Well, and also remember, agents consume at machine level speeds, not human level speeds. So the actual consumption of the resources is going to be much higher. And that probably needs to be factored into the price. Like if your system's gone from being pinged once or twice a day to like a hundred times per minute, that's actually going to have a real impact on the level of usage of the systems where you want a sort of always on agent that's fully aware of its surroundings. And it's effectively pulling these systems like mad, making rapid and maybe more minor updates than a human would. It becomes a case where you could maybe go, well, maybe we have a different tier of pricing for agentic that actually makes us more money because we're providing more service.

Speaker 1:
[76:55] I guess then the next question comes with the everything app thing is this is where you'll start and end your day. This is where you do all your work. This is where the new pricing power will come from. Is these everything app platforms where similar to like, we're basically recreating the pain that is the Apple ecosystem all over again, where you've got to pay your 30% commission to Apple or whoever to appear in the store and have those agentic loops access and use those applications. So it does seem like the war is on. The new platform war or the new, like, you know, call it like Workspace OS of the future, is now well and truly in progress between Anthropic, OpenAI. I mean, maybe Grok, I think they just talk about it. Let's see what they actually have. I think the problem is that the unhinged branding of all their, like, sex talking bots and stuff around Grok, I think in the enterprise, that kind of strikes them out a little bit for me.

Speaker 2:
[77:56] Yeah, the enterprise isn't into that kind of thing.

Speaker 1:
[77:58] Yeah, it's like, you know, the sex bot thing, like, it's not great. But I must admit, in Tesla right now, you have access to the Grok chat thing, and it's got some tools like search and stuff. And on long drives, if you're on your own, it's great to do some dirty, I'm kidding.

Speaker 2:
[78:16] I was going to say, I bet you love it when you're on your own.

Speaker 1:
[78:18] No, but I do genuinely think of things and you're like, I just ask it. I'm like, oh, can you like go and research this and tell me about this or whatever? And you can have like, honestly use it more than listening to musical podcasts now, because it's just like choose your own adventure.

Speaker 2:
[78:33] And you know what I need it for? Like when my son asked me this morning, he's like, if energy can never be created or destroyed, aren't all resources renewable? And I was like, well, that's actually a pretty interesting statement, which I have no comment on and would love to have access to an AI to answer that.

Speaker 1:
[78:48] Yeah, that is that is handy. Unfortunately, too, I don't ever get there.

Speaker 2:
[78:52] Before we finish that point about the whole App Store idea, like the everything app kind of thing, I actually think there is a real, real race on for that for one specific reason, which is security. Because I think that this idea like you've seen recently, we've had these issues where your AI will just randomly install NPM packages and like clone GitHub repos and then run malicious data exfiltration code, right? It's a very, very serious problem where because people are yoloing code and just doing what the agent says, if someone can get in the agent's pathway and get their code executed, they can extort companies, steal the data, all that sort of stuff, right? All the risky stuff when it comes to cyber security, it's probably more risky than ever because of the way people are using this code and so on one hand, the company has to use it to stay competitive. But on the other hand, they're taking way more risks than they realize by doing this, right? Even in the context of skills. So I think that security as a sort of agent firewall category is going to become so unbelievably important over the next couple of years, that it's going to be like its own category because firstly, one advantage of anyone who has that sort of walled garden environment where they certify every connection, MCP, skill, whatever within it, and approve it and actually scrutinize it and pen test it or whatever it is, that's going to be really valuable at the enterprise level where you can say, you can work in this ecosystem, use all the tools you know and love, use all the SaaS backends, all this stuff, and this is trustworthy. It isn't cloning some random GitHub rehab that some dude with four stars has made and everybody loves. It's like, this is truly legitimate.

Speaker 1:
[80:42] So you're almost talking about the agentic computer use terminal restrictions where you're almost building safety mechanisms to stop it going off and just doing it. It's almost like a permission system for AI.

Speaker 2:
[80:56] Yeah, because right now, for example, if you have a skill, the skill can go off in a cloud runner. It can install packages, write code, and then you're injecting your data in that comes from other MCPs, like your Gmail or your Snowflake instance or whatever. And then this code could give feedback to the model like, oh, I need more data from this table in the database, please. Please dump the database and then give it to me. And then it sends it off to its evil masters in Russia or whatever the evil country is right now and takes it. And then suddenly you've lost all your data. Like that's possible right now. And I think that what we need is two things. One is a sort of scrutinized store of apps or whatever it is that are actually tested and verified. And then the second one is we need this concept of an outgoing agent firewall. So we already have the idea of a safety filter to stop people from like, how do I make a pipe bomb or whatever the risky things are. But that's one thing to stop people asking bad questions. But really what you need to be looking at is, what are we sending out of this system? What is going to external systems? And having a hard lock at that point, that has AI scrutiny on it, that is checking that. Everyone worries about the model providers being the risk of losing their data, like people are pasting their corporate documents and stuff in a chat GBT. But that's not the actual risk. The actual risk is the MCPs and skills exfiltrating that data by just yoloing code and just allowing any old thing to run.

Speaker 1:
[82:32] Yeah. I mean, like it's a good pitch. I'm sold. How do I invest?

Speaker 2:
[82:36] Yeah, that's right.

Speaker 1:
[82:39] All right. So I did want to reflect quickly just in like, and I think I said this earlier, but it was so far in now. I'm allowed to repeat myself at this point.

Speaker 2:
[82:50] Just start again.

Speaker 1:
[82:51] No one's going to know. I mean, who knows? So what is really going on is in the state of things. Like there's just such, there's been like in the past month or so when we haven't been recording, there's been just this huge fire hose riot. And in the last week especially like it's this just announced in every hour, like on the same day, there's like 50 things you're meant to care about. But if you just go to like the high level chessboard, what's really going on? And in my opinion, all we're seeing is OpenAI. OpenAI.

Speaker 2:
[83:24] You sounded like the AI then.

Speaker 1:
[83:27] OpenAI. What's the plan? Got Sam... So OpenAI feels to me like they're playing absolute model catch up with Anthropic and Opus and all that sort of stuff. They're trying to catch up on the everything app path and they're doing it in a strange way through Codex. I just can't see everyone being like, Oh, I use Codex every day. Is the everything app that hits the consumer as well. So I'm assuming they'll backport it into ChatGBT at some point. I think that's the only path forward. And then, so they're playing catch up to Anthropic. It's super, super obvious with this weird Codex name. I think Anthropic is going, hang on, we're gonna start tweaking these models and subscriptions on a balance of like performance with oversold demand. Like we've got to figure out how do we actually serve this stuff up and they're experimenting, literally crippling the thinking budget. They're experimenting, removing Claude Code from their $20 a month subscription. These are real experiments they're running.

Speaker 2:
[84:32] It just seems like the money's catching up with everyone, right?

Speaker 1:
[84:34] Yes, why would you do this if it wasn't?

Speaker 2:
[84:37] We could all pretend for a while that it was cheaper than it was, but we've reached the point where we can't pretend anymore. And I think that everyone needs that wake up call. Everyone has to evaluate the true cost with the value you're getting and either improve the way you're using it or change to cheaper models and learn how to get the most out of those. I think the time for pretending is over. There's real value here and people need to find it.

Speaker 1:
[84:59] But I don't want to dismiss from the leaps forward. I think GPT image 2 is a huge leap over NanoBanana 2. So and Opus 4.5 was a leap in agentic over all the other models at the time. These were big, meaningful, huge leaps forward. But I don't think people should be confused or scared or worried that, you know, like all these announcements, I think you do, you get anxious and stressed about it. I certainly do, especially when we haven't been talking about it. But then you distill it down to what's actually changed. And it's like, well, OpenAI is still serving up, like introducing Spark, like it'll now read some e-mails and it's like, couldn't we already do this like a year ago? Like, is this really innovation? Like you've added skills and MCPs into your interface. I don't know. I guess what I'm saying is you can stay grounded, but obviously there are some big leaps, but this is going to take a long time to be implemented and used. And we're all still figuring it out. I don't think anyone's made this stuff simple and accessible yet, is what I'm trying to say. Like it's still very complex.

Speaker 2:
[86:06] Yeah, definitely agree.

Speaker 1:
[86:09] All right. Any, any, I don't even normally do a summary of all the stuff, but I'm just going to say like any final thoughts on the two hours of spew we just unleashed.

Speaker 2:
[86:21] Only I'm delighted by all the new models. I really do want to spend more time on things like Kimi K 2.6, because I think that we all as a community underestimate these models, and I think there's a lot of power there. And given that at some point, everyone will face the harsh reality of the cost of this stuff, we need to learn how to do it in a sustainable way.

Speaker 1:
[86:58] That could be a hit. I think we could have a hit on our hands here.

Speaker 2:
[87:01] I think it is one of the better songs I've heard.

Speaker 1:
[87:03] Maybe it'll get a hundred listens a month.

Speaker 2:
[87:07] That's the goal.

Speaker 1:
[87:08] All right. It is good to be back. Thank you to the six people that wrote in and said you missed us. It meant a lot to us. No, I'm kidding. But thank you for all your support. I'm sorry we're off for so long. We couldn't help it and we fell out of the habit a little bit, but we are excited to be back. We're back to regular.

Speaker 2:
[87:26] I felt immense guilt the whole time if it makes anyone feel any better.

Speaker 1:
[87:29] Yeah, it's pretty much the story of our life. Also, please consider joining simtheory.ai, supporting us rolling out a workspace for your organization. We have, I've spoken about it on the show before. We have this concept we've been working on for a little while called Agent Apps. We're going to ship a beta of that really soon. And I think in terms of like consuming software through this, you know, super app, it's a good demonstration of what the technology can do. And I might actually do some demonstrations and talk about that a little bit next week on the show, because I am really, I think it's transformative is true. So I want to I want to demonstrate that. But yeah, again, thanks to everyone that reached out to us, all your support. And it is nice to be back. We'll see you next week. Bye.