Shopify’s AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym

title Shopify’s AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym — with Mikhail Parakhin, Shopify CTO

description Early bird discounts for the San Francisco World’s Fair, the biggest AIE gathering of the year, end today - prices will go up by ~$500 tonight so do please lock in ASAP!
From near-universal AI tool adoption inside Shopify to internal systems for ML experimentation, auto-research, customer simulation, and ultra-low-latency search, Mikhail Parakhin joins us for a deep dive into what it actually looks like when a 20-year-old, $200B software company goes all-in on AI. We cover why Shopify has become much more vocal about its internal stack, what changed after the December model-quality inflection, and why the real bottleneck in AI coding is no longer generation, but review, CI/CD, and deployment stability.
We also go inside Tangle, Tangent, SimGym, which are three major AI initiatives that Shopify is doing to make experimentation reproducible, optimization automatic, customer behavior simulatable, and search and catalog intelligence faster and cheaper at scale. Along the way, Mikhail explains UCP, Liquid AI, and why token budgets are directionally right but often measured badly, why AI-written code can still increase bugs in production, what makes Shopify’s customer simulation defensible, and what he learned from the Sydney era at Bing.
We discuss:
* Mikhail’s path from running a major Microsoft business unit spanning Windows, Edge, Bing, and ads to becoming CTO of Shopify
* Why Shopify is talking more publicly about AI now, and why staying at the frontier has become necessary for the company
* Shopify’s internal AI adoption curve, the December inflection, and why CLI-style tools are rising faster than traditional IDE-based tools
* Why Jensen Huang is directionally right on token budgets, but raw token count is still the wrong way to evaluate engineering output
* Why the real unlock is not more agents in parallel, but better critique loops, stronger models, and spending more on review than generation
* Why AI coding can still lead to more bugs in production even if models write cleaner code on average than humans
* Why Shopify built its own PR review flow, and why Mikhail thinks most off-the-shelf review tools miss the point
* How PR volume, test failures, and deployment rollback are becoming the real bottlenecks in the agent era
* Why Git, pull requests, and CI/CD may need a new metaphor once code is written at machine speed
* What Tangle is, and how Shopify uses it to make ML and data workflows reproducible, collaborative, and production-ready from the start
* Why Tangle is different from Airflow, and why content-addressed caching creates network effects across teams
* What Tangent is, and how Shopify is using auto-research loops to optimize search, themes, prompt compression, storage, and more
* Why Tangent is becoming a democratizing tool for PMs and domain experts, not just ML engineers
* Why AutoML finally feels real in the LLM era, and where auto-research still falls short today
* Why Tangle, Tangent, and SimGym become much more powerful when combined into one system
* What SimGym is, why simulated customers only work if you have real historical behavior, and why Shopify’s data gives it a moat
* How SimGym evolved from comparing A/B variants to telling merchants what to change on a single live storefront to raise conversions
* Why customer simulation is so expensive, from multimodal models to browser farms to serving and distillation costs
* How Shopify models merchant and buyer trajectories, runs counterfactuals, and thinks about interventions like discounts, campaigns, and notifications
* Why category-level behavior is so different across commerce, and why ideas like Chinese Restaurant Processes are showing up again in practice
* Shopify’s new UCP and catalog work, including runtime product search, bulk lookups, and identity linking
* Why Shopify is using Liquid AI, and why Mikhail sees it as the first genuinely competitive non-transformer architecture he has used in practice
* Where Liquid already works inside Shopify today, from low-latency query understanding to large-scale catalog and Sidekick Pulse workloads
* Whether Liquid could become frontier-scale with enough compute, and why Shopify remains pragmatic and merit-based about model choice
* Who Shopify is hiring right now across ML, data science, and distributed databases
* The Sydney story at Bing, why its personality was not an accident, and what Mikhail learned from deliberately shaping AI character early on
Mikhail Parakhin
* LinkedIn: https://www.linkedin.com/in/mikhail-parakhin/
* X: https://x.com/MParakhin
Timestamps
00:00:00 Introduction: Mikhail Parakhin, Microsoft, and Shopify
00:01:16 Why Shopify Is Talking More About AI
00:02:29 Internal AI Adoption at Shopify and the December Inflection
00:06:54 Token Budgets, Jensen Huang, and Why Usage Metrics Can Mislead
00:10:55 Why Shopify Built Its Own AI PR Review System
00:12:38 AI Coding, More Bugs, and the Real Deployment Bottleneck
00:14:11 Why Git, PRs, and CI/CD May Need to Change for Agents
00:18:24 Tangle: Shopify’s Reproducible ML and Data Workflow Engine
00:21:19 Why Tangle Is Different from Airflow
00:26:14 Tangent: Auto Research for Optimization and Experimentation
00:30:07 How Tangent Democratizes Experimentation Beyond ML Engineers
00:33:06 The Limits of Auto Research
00:36:36 Why Tangle, Tangent, and SimGym Compound Together
00:37:20 SimGym: Simulating Customers with Shopify’s Historical Data
00:42:47 The Infra Behind SimGym
00:46:00 Why SimGym Gets Better with Real Customer History
00:47:30 Counterfactuals, HSTU, and Modeling Merchant Trajectories
00:51:55 CRPs, Clustering, and Category-Level Customer Behavior
00:53:30 UCP, Shopify Catalog, and Identity Linking
00:55:07 Liquid AI: Why Shopify Uses Non-Transformer Models
00:59:13 Real Shopify Use Cases for Liquid
01:03:00 Can Liquid Scale into a Frontier Model?
01:09:49 Hiring at Shopify: ML, Data Science, and Databases
01:10:43 Sydney at Bing: Personality Shaping and AI Character
01:13:32 Closing Thoughts
Transcript
[00:00:00] swyx: Okay. We’re here in the studio, a remote studio, with Mikhail Parakhin, CTO of Shopify. Welcome.
[00:00:08] Mikhail Parakhin: Thank you. Welcome.
[00:00:10] swyx: I don’t even know if I should introduce you as CTO of Shopify. I feel like you have many identities. Uh, you led sort of the, the Bing ML team, I guess, uh, uh, or ads team. I, I don’t know, I don’t know, uh, you know, it’s, uh, people va-variously refer you as like CEO or, or, uh, I don’t know what that, that, that said previous role at Microsoft was.
[00:00:29] Mikhail Parakhin: Uh, that was... Yeah, my previous role w- at Microsoft was the-- I actually was the CEO of one of Microsoft’s business units, which included, as I, you know, as we discussed, all the things that people like to laugh about, uh, including Windows and Edge and Bing and ads and everything.
[00:00:47] swyx: Yeah, yeah. What a, what a, what a wild time.
You’ve obviously, uh, done a lot since you landed at Shopify. Uh, one of the reasons I reached out was because you started promoting more sort of internal tooling, uh, primarily Tangle, but also a lot of people have seen and adopted Tobi’s QMD, uh, and obviously, I think, uh, Shopify has always been sort of leading in terms of, uh, engineering.
I think more-- it’s just more recent that you guys have been more vocal about your sort of AI adoption. Is that, is that true?
[00:01:16] Mikhail Parakhin: Well, I think AI tools in general are fairly recent development, uh, and we’ve-- Shopify, you know, at this stage of its development, we’re developing AI in-in-house and other, uh, building tools that use AI and, you know, interfacing with the wider AI community, uh, you know, are on the sort of the, uh, runaway trajectory.
So it just did by sort of natural byproduct. We, we talk about it more also. We just, uh, just even yesterday, Andrej Karpathy was famous in tweeting about, oh, are there some, uh, ways, uh, that, that you can organize your agents to store the data and then, uh, look up the data so that you don’t have to research or, or lose context every- Yes
time. And a little bit tongue in cheek, I tweeted that, “Hey, we’ve, we’ve done it much earlier, and we even have different approaches, Tobi and I.” Tobi, of course, is a big fan of QMD, and I’m more of a SQL, SQLite fan. But, uh, yeah, very similar things that we’ve already done here. The point is, yeah, we’re very dynamic, you know, explosively growing company, and we have to be at the forefront of AI adoption, obviously.
[00:02:29] swyx: Yeah. Yeah. Um, you, your team kindly prepared some slides actually that we were gonna bring up on to, uh, the screen. I think I can, I can screen share, and then we can kind of go through some of the shocking stats that maybe, maybe put some numbers to what exactly is going on. So here we have, uh- An internal AI tool adoption chart.
What are we looking at here? What ?
[00:02:54] Mikhail Parakhin: Yeah, this is very interesting statistics. Uh, this is number of daily active workers, you know, think of, uh, DAO, basically the active users of-
[00:03:05] swyx: Yeah ...
[00:03:05] Mikhail Parakhin: AI tool as a percentage of all the people in the company, right? And then- Yeah ... different AI tools. And, uh, you could see two things here is that one is the green is total.
Uh, green is just total. So you could see that it approaches really % by now. It’s hard not to do your job now without interacting deeply, at least with one tool. You could see another interesting thing is just as many people commented in December was the phase transition when suddenly models gotten good enough that, that everything took off and started growing.
Uh, it, it was many people noticed that the thing is that small improvements accumulated into this big change in Sep- December roughly timeframe.
[00:03:52] swyx: Yeah.
[00:03:52] Mikhail Parakhin: The other thing I would claim you could see is tha

pubDate Wed, 22 Apr 2026 19:33:00 GMT

author Latent.Space

duration 4345000

transcript

Speaker 1:
[00:04] Okay, we're here in a studio, a remote studio with Mikhail Parakhin, CTO of Shopify. Welcome.

Speaker 2:
[00:08] Thank you.

Speaker 1:
[00:10] I don't even know if I should introduce you as CTO of Shopify. I feel like you have many identities. You led the Bing ML team, I guess, or ads team. I don't know. People variously refer you as the CEO or I don't know what the previous role of Microsoft was.

Speaker 2:
[00:29] That was my previous role at Microsoft. I actually was the CEO of one of Microsoft's business units, which included, as we discussed, all the things that people like to laugh about, including Windows and Edge and Bing and ads and everything.

Speaker 1:
[00:47] Yeah. What a wild time. You've obviously done a lot since you landed at Shopify. One of the reasons I reached out was because you started promoting more internal tooling, primarily Tangled, but also a lot of people have seen and adopted Tobi's QMD and obviously I think Shopify has always been leading in terms of engineering. I think it's just more recent that you guys have been more vocal about your AI adoption. Is that true?

Speaker 2:
[01:15] Well, I think AI tools in general are fairly recent development. Shopify, at this stage of its development, were developing AI in-house and building tools that use AI, and interfacing with the wider AI community are on the runaway trajectory. So it's just a natural by-product. We talk about it more also. Just even yesterday, Andrej Karpathy was famous in tweeting about some ways that you can organize your agents to store the data and then look up the data so that you don't have to research or lose context every time. And a little bit tongue-in-cheek, I tweeted that, hey, we've done it much earlier, and we even have different approaches, Tobi and I. Tobi, of course, is a big fan of QMD, and I'm more of a SQLite fan, but yeah, very similar things that we've already done here. The point is, yeah, we're a very dynamic, you know, explosively growing company, and we have to be at the forefront of AI adoption, obviously.

Speaker 1:
[02:30] Yeah, your team kindly prepared some slides, actually, that we were going to bring up on to the screen. I think I can screen share and then we can kind of go through some of the shocking stats that maybe put some numbers to what exactly is going on. So here we have an internal AI tool adoption chart. What are we looking at here?

Speaker 2:
[02:54] Yeah, this is very interesting statistics. This is number of daily active workers. Think of DAO, basically, the active users of AI tool as a percentage of all the people in the company. And then different AI tools. And you could see two things here is that, one is the greenest total. So, you could see that it approaches really 100% by now. It's hard not to do your job now without interacting deeply, at least, with one tool. You could see another interesting thing is, just as many people commented in December, was the phase transition when suddenly models got good enough that everything took off and started growing. It was, many people noticed that small improvements accumulated into this big change in December of the time frame. The other thing I would claim you could see is that CLI-based tools and tools that don't require you to look at the code becoming more popular. You could see various versions of Cloud code and Codex and Pi and the internal development tools taking off. Exactly. Blue is our river, just internal agent for coding. Where tools that require ID, such as GitHub, Co-pilot, or Cursor, they're not exactly shrinking, but they're not growing as fast. Like a red line is the ID tool. So you could see that they're not experiencing as fast overgrown.

Speaker 1:
[04:37] As I understand it, basically every employee has their choice of choose whatever tool you use and then you're just doing a daily survey or something.

Speaker 2:
[04:47] Exactly. The push is to get your job done, you can use any tool and we effectively fund unlimited tokens for everybody. We do try to control the models that people use but from the bottom, not from the top. We basically say, hey, please don't use anything less than Opus 4.6. Some people end up using GPT 5.4 extra high, some people use Opus 4.6. There are some plus and minuses in going for a full 1 million context window versus not, but we try to discourage people from using anything less than that.

Speaker 1:
[05:28] Yeah, yeah. Got it. Got it. I mean, that's, you know, the next chart here, it really kind of shows the expansion in the sort of December 2025 reflection, right? That people are using a lot of tokens. I think it's also really interesting that no one was kind of abusing it in 2025. Like, it had, comparatively to this year, there was almost no growth. I mean, it's still like, you know, probably, probably gave 53%.

Speaker 2:
[05:56] This is just a different scale. It's still exponential growth, just a different rate of expansion. There was inflection point. And I would claim the super interesting part here is that you could see that the distribution becoming more and more skewed. The top percentiles grow faster. So that means the people in the top 10 percentile, their consumption grows faster than 75% and so forth. So the distribution skews more and more towards the highest users, which is, I don't know what it tells me. It feels not ideal, to be honest. Or maybe it's okay. We'll see.

Speaker 1:
[06:35] What does it feel not ideal? Is it because of quantity over quality or what's the concern?

Speaker 2:
[06:42] Because take it to the limit. That means if this rate of separation continues, there will be one person consuming all the tokens. It's kind of strange. Yeah.

Speaker 1:
[06:56] I think internal teaching and all that will help distribute things more widely. But in the early days, of course, the people who are more AI-pilled will obviously find more ways to use it than the people who are less AI-pilled. Maybe let's just call it that. I'll just quickly pause from the... We'll go back to the rest of the slides. But I just want to review. There are a lot of CTOs of large companies like yourself, where they're all considering some kind of token budget. I think it's something that Jensen Huang has been talking about, where if your 200K engineer is not using 100K of tokens every year, they're underutilizing coding agents. Of course, Jensen Huang would say that. But it seems a very quantity over quality approach. Some people are basically saying, well, is this comparable to judging engineer quality by lines of code? Which we also know is kind of flawed, but better than nothing. So I don't know if you have a management take here on how to view this kind of metrics.

Speaker 2:
[08:02] Well, I mean, you're baiting me. This is my favorite topic. If you let me, I'll probably talk for two hours and just this. I have a lot of things to say. I do think Jensen gotten a lot of bad press saying, Oh, of course, the cake seller says we don't need enough cakes. You know, like, of course. But I actually think that's undeserved. I think he is actually right.

Speaker 1:
[08:33] He's directionally correct.

Speaker 2:
[08:35] Yeah, he's directionally correct, for sure.

Speaker 1:
[08:37] Who knows what the right number is.

Speaker 2:
[08:39] The thing that I do want to say, and this is something that we learned through trial and error and very important, is like two things. One is that it's not about just consuming tokens. You can consume tokens and in fact the anti-pattern is running multiple agents, too many agents in parallel that don't communicate with each other. That's almost useless compared to just fewer agents and burns tokens very efficiently. Setting up the right critique loop, especially with the high quality models, where one agent does something, the other one, ideally with a different model, critiques it, suggests ways to improve it. The agent redoes it with this critique loop. So it takes much longer. Some people don't like it because latency goes up. They have to wait till this debate is happening. But the quality of the code is much higher. And another thing, since you mentioned, the overall budget is just lines of codes. Lines of codes are exploding for everybody right now. Partially because AI is really more verbose, but partially just because AI can write a lot for code, it doesn't get tired. And so you have to have a very strong, narrow waste during PR review. Otherwise, just the number of bugs will go through the roof. It's this unexpected consequence of just volume trumping everything. I would claim by now good model writes code with fewer bugs than average human. But since they write so much more of it, more of it will make it into production.

Speaker 1:
[10:25] You still have more bugs.

Speaker 2:
[10:26] Yeah. You have to have very rigorous PR reviews. Also automated, of course, but to spend a lot of budget there like this. This for me, actually the important metric is the ratio of budget spent during code generation versus spent expensive tokens like GPT 5.4 Pro or DeepThink from Gemini checking on PR reviews.

Speaker 1:
[10:55] Yeah, totally. I noticed in your chart, you didn't have any review tools. Do you just use, let's say, a cloud code to review tools? Or do you have another set of review tools like the Greptiles, the CodeRabbits, DevinReview also has a review tool. I don't know if you've had those specialist review tools.

Speaker 2:
[11:13] You are a little bit jumping on my sort of tool right now because the graphs I was only showing public tools. I haven't found a good PR review tool that does what I think should be done. And partially my thinking is because it's so, it just goes against both what people feel like emotionally they prefer. And some of the, you know, frankly, even business models that the companies run, at their review tool time, you want to run the largest models. That means, I know Codex or Cloud Code is not going to cut it. You need to have pro level models. If you really want to stand the tide of bugs going into production, and you need to spend a lot of time, the models taking turns, but you don't want like a big swarm of agents. So, in fact, you end up in a different dualistic world where you generate not that many tokens, you in fact generate few tokens, but it takes a long time because these are expensive models taking turns rather than many, many agents trying to do many things in parallel. So that's why I feel like I haven't found good tools, so we are using our own for PR review for now.

Speaker 1:
[12:33] Yeah. I mean, I think a lot of companies are building their own, especially to their needs, right? You also have a chart here, going back to the slides on PR merge growth, where we're now at 30 percent, month-on-month rather than 10 percent. And also, the estimated complexity is going up. This is productivity, right? Because presumably, there's more stuff going into the code base and more features getting worked on. I'm curious about the backlog, right? I actually don't mind a pro-level model taking an hour, two hours to review my PR, because I have dealt with humans who take a week to review my PR, right? And I keep pinging them on Slack, hey, hey, review my PR. So I think there's some trade-off here where it still doesn't make sense.

Speaker 2:
[13:19] Exactly. That's exactly my point, that on one hand, you can tolerate longer latencies at PR. On the other hand, like right now, the real problem is not in spending time waiting for PR. The real problem is since there's so much more code, then the probability of at least some tests failing going up, and then you keep failing, then you have to find the offending PR, evict it, retest it without that PR, and so deployment cycle becomes much longer. So it actually, in terms of the overall time to deploy, it's total time savings if you spend more time on a longer model like thinking for an hour, because then you don't have to spend all that time during testing and rolling back the deployment.

Speaker 1:
[14:03] Yeah, totally. That's still worth it. You don't look at the individual, look at the aggregate and look at the change in the aggregate system.

Speaker 2:
[14:11] Exactly.

Speaker 1:
[14:11] I'm curious if there is this PR mentality and the CI CD paradigm will be changed eventually. Some people are like, obviously, a lot of people want new GitHub, but I even wonder if Git is the problem. Is that the bottleneck? Is the concept of a PR bottleneck? Do you guys use Stacked Diffs? I don't know if that's a Merge Queue Stacked Diff type of thing.

Speaker 2:
[14:36] We use Stacks. We use Graphite. We work with Graphite a lot. So we use Stacked PRs. I think that's clearly the overall CI-CD in general and the interaction with the code repository right now is clearly the main issue and the bottleneck for us, and highest in top of mind. I would say we probably need a different metaphor or different whole design of how to process it in a new agentic world. I haven't seen anything dramatically better yet. I think everybody right now is just trying to keep their head above the water because there are so many PRs and then everybody's CI-CD pipelines start creaking, the times are increasing, the number of bugs sleeping by increasing, and you have to clap on down. And so we are a little bit in this situation when we need to first stabilize that story and then start thinking, hey, what could be completely different in a new world? I know some people are working on it. I haven't seen anything super compelling yet, but clearly the old thing were designed for humans will need to be morphed into something new.

Speaker 1:
[15:53] One other thing that I think about is kind of like the merge conflict is basically a global mutex on the whole system, right? And in human organizations, we do have something like that. It's the company stand up. But other than that, it's actually shitting for us to be somewhat decentralized, somewhat plugged into one stream of information source, but somewhat lossy. Like, it's okay and all that. Not every delivery is like atomic consistency. Like we're not dealing with a database sometimes.

Speaker 2:
[16:27] This is a very good point. Because since humans don't write code too fast, you know that global mutex is not too bad. Once you start writing code at the speed of machine, it becomes the bottleneck, then what do you do? Maybe, and I can't believe I'm saying this, because I'm a lifelong opponent of microservices, and I always thought that was a really bad idea. And now that you're saying it, like maybe in new guys like microservices, we'll make a comeback, because then you can ship things independently in tiny things, and managing all that complexity automatically will be much easier. I don't know. We'll have to see.

Speaker 1:
[17:10] Yeah. I mean, I don't know what the Microsoft or Shopify thing is, but I read this paper from Google where they have a mono repo that deploys into microservices. And then the other concept that I think about a lot is the Chaos Monkey concept from Netflix. Being able to create this robust system where you have the service discovery, you have the independent microservices discovering, and probably going to be a fair amount of duplication. That's how an organic system scales. That you have that, I don't know how you call it, slack, robustness, duplication. I forget, these are not exactly the terms I'm looking for, but I can't really think of the words. Okay. I was going to go into Tangent and Tangle. We discussed the overall stats that Shopify has, but I think some pretty cool stuff that you guys are working on is your ML experimentation and your Auto Research Training Pipeline. Presumably, you're much closer to this one because it's a personal hobby of yours. How would you explain them together? I thought we have a slide that has the system diagram.

Speaker 2:
[18:24] Yeah. Tangle first and then Tangent is a thing on top of Tangle. Tangle is the third generation, a claim of systems of running any data processing, but a bit with a skew for ML experiments, but not necessarily any data processing tasks where you need to iterate, share, and you have scales so that you want maximum efficiency. You know how normally you would work? You would imagine you're a data scientist or an ML practitioner, or you would get Jupyter Notebooks, or maybe you would get your Python scripts, and you would manage the data, and you produce those TSV files, and you put them in some JFS or something, then you would notice that it has those weird missing values. You go and write another script that goes and replaces them with the dash s, and then you run some, oh, I need to filter bots, and so you run some light JBL model that removes the bots. And then you kind of get into shape, and then you start experimenting, and you run multiple experiments, and then you're like, oh my god, this experiment is worse, you undo, and you cannot get to previous result. I'm like, what did I do? And then you finally get everything working, then start throwing it over the fence to production. So you replicate it, those things don't work, and then sometimes you don't notice that you forgot some feature naming and the features don't match. But then, imagine you did everything, and then six months later, you have to repeat it because now there's more data or you wanted to do another path. And you're like, what did I do? Well, I was like, this script crashes now, or the path has changed, and then you spend another month just doing a digital archaeology on your own history, right? Now multiply that by many, many things. Now imagine you got an intern that you want to ramp up. Now you have to show that intern, oh, look, there's the folder, there's the scripts, ask your cloud agent to do, and then to figure it out, and then cloud agent does something, and then you're like, oh, yeah, right, right, it was the wrong folder. I forgot to tell you, I actually had this other thing, I forgot myself. And that's the daily life, we all know it, if you're a data scientist, machine learning practitioner, or even like any data managing person.

Speaker 1:
[21:00] Yeah, so I used to do this on the quant finance side in my Hedge Fund. So we did this before Airflow, and then obviously Airflow came along, and then more recently Daxter, I would say is like in my mind what I would use for that shape of problem where you have to materialize assets and create a pipeline.

Speaker 2:
[21:19] And that's very good segue because so Airflow is great, but Airflow is more about you have something and you want to repeatedly run it in production on schedule. It's less about you as a team developing things and being able to share, and you grabbing the standard pipeline and saying, hey, I want to change this tiny little component in the huge sea of data processing and I want to run 10 experiments on this, and I want to do hyperparameter optimization. All that is very hard to do with Airflow. It's very easy to do with Tangle. Tangle is more about, it's everything about group of people running experiments, it might be agents too, nowadays. Running experiments cheaply, collaborating, sharing results, you don't need to understand fully, you clone somebody else's experiment or somebody else's pipeline, change small piece, run it, get it to production state and then ship in one click. So then you don't have to port it into any other system to run in production, you can just run the same experiment, it's fully production ready. And it has lots of, again, as I said, it's third generation system. The original one would claim, there was Ether and then at least in my career, Ether was the first that pioneered this type of approach. And then there was Nirvana at Yandex, which did second take on this. And now this one aggregates the learnings from all of those and Airflow as well to get to the state where you try it. It feels kind of magical. Because now everything is based on content hashes. So even if the version changed, but if the output didn't change, nothing is being rerun. It's very efficient. If you multiple people start experiment that needs the same sort of data preprocessing, it's not repeated multiple times. It's automatically done only once. If you start 10 experiments that all require some data preparation first is the first step. And you don't have to coordinate for that. You don't have to know that other people are starting it. You know it's very easy composability, any language you want to use, and it's very visual. So you can see immediately, you can edit it easily, you can assemble small things, just even mouse clicks if you want to, and share, clone, and everybody knows. Also, it's fully kind of static in the sense that, when you rerun it a second time, it will exactly have the same results. Like you will never have to do digital archaeology, so full versioning and everything is also there.

Speaker 1:
[24:07] So people can, it's open source, go to the GitHub repo and check it out. And it is also a really good blog post about it. I think all these is like really appealing. The thing that I think sells me the most about it is that sort of development to production transition, right? Which I think a lot of people haven't really solved that strictly, right? Like we develop really, really well in Python notebooks. But then, you know, that's obviously not a sort of production ready process. I think that like any way in which that is solved, I think is very appealing. Then the other thing that you mentioned, which also raised my eyebrows was content-based caching, which you mentioned is very much a sort of efficiency measure about just like recalculation only on sort of content addressing. Which I think makes sense. It surprised me that the savings could be this much. But maybe I just haven't worked at your scale where there's so much duplication that people just rerun because they change a single ID upstream.

Speaker 2:
[25:11] Yeah, but it's not only you rerun. The main savings are coming from the fact that you run it, you get your job done and you moved on. Then somebody else in some department you don't know existed, runs the same task but on a newer version.

Speaker 1:
[25:26] Yeah.

Speaker 2:
[25:27] Like right now, in most of the organizations, you can't even find out about it so that you can't even measure that you're spending that time twice. Here, if everybody's on Tangle, that's detected automatically and detected that the output is the same. Then for that person, all it looks like is like experiment just suddenly moved and jumped forward. Because there's network effect of multiple people helping each other.

Speaker 1:
[25:51] Yeah. This is one of those things where it's designed to be a platform from the beginning rather than an individual developer's tool from the beginning. Everything is going to stream down from there. That is the Tangle Orchestrator and it manages jobs. We've seen a few versions of this and this is obviously the unique approaches that you guys have figured out. Then there's Tangent.

Speaker 2:
[26:14] Yeah. Tangent is basically an automatic auto research loop that can help and do your work for you. You know, effectively, Andrej Karpathy recently popularized it with auto research. I remember you said that he was speed running this. Yeah. You know the story. Here, we're basically bringing the same capability into Tangle so that Tangent can analyze it. Just an agent that can run multiple experiments, figure out what can be changed and keep on re-running it, keep on modifying until maximizing some goals, some loss function, whatever you need to achieve. And in general, I would say if you're not using auto research like approach in whatever you do, like literally whatever you do, then you're missing out. We saw at Shopify that taking like a wildfire, anything where you can put measurements can be done dramatically better. Our speed of templatization, HTML, completely in UX templatization, reducing latency for liquid themes. Our search recently moved from 800 QPS to 4200 QPS with the same quality just by pure optimizations and not a research loop that kept running and changing code in our index server on the same number of machines, just increasing the throughput. We managed to improve the quality of gisting and machine learning process. Gisting is the prompt compression technique that allows for lower latency and lower, and actually higher quality slightly. So literally, whatever different works of life, and it doesn't have to be AI related. We had the reduction in storage because the agents would go and find data sets that clearly are derivative, and then you don't need to store things twice. We found somewhat embarrassingly that one of the largest tables was hashing random IDs into another random ID. And we literally did only one. So it was translating two random IDs.

Speaker 1:
[28:37] It has access to the code as well, so you can check what the hell is it doing.

Speaker 2:
[28:42] So it could be run in two levels. At the superficial level, it could just use existing components and reshuffle them. You can grab XGBoost and you can grab some PyTorch module and then grab some Grap and other tools and combine them. At the deeper level, since Tangle is all sort of CLI-based underneath, every component is a RAP-like call and YAML file, it can analyze code and create new components and keep on iterating as well. You can both have quick modifications of existing pipelines with components that are already there, pre-baked, or you can create new components and keep iterating on them. AutoResearch is probably the thing I was excited the most in the last two months, happening on VCI, taking totally like a wildfire. Every minute, I would have somebody's Slack message saying, oh, look how much better I made it. It's all through AutoResearch.

Speaker 1:
[29:53] Is this democratized in some way in the sense that like, is it your ML engineers and researchers doing this, or is it your regular PMs and software engineers also have the ability to use Tangent?

Speaker 2:
[30:07] This is an awesome question. Tangle in general and Tangent in particular are extremely democratizing. They are the main tools for-

Speaker 1:
[30:15] Because I don't need the details.

Speaker 2:
[30:16] Exactly. Initially used by ML and AI engineers, but then literally as you said, PMs are like the highest user right now is one of PMs in our work. Sarthak, he was number one by usage of this because it's just energetic and knowledgeable and now it unlocks a lot of capability where you don't have to change code manually.

Speaker 1:
[30:39] Because it kind of cuts out the ML engineer from the process, because the PMs have the domain knowledge and the ability to think about, from first principles about, okay, what results do I want? And they even have access to the data that needs to go in. So it's like, in some ways, this is the magic black box that we've always wanted for training and for, I guess, hill climbing or whatever.

Speaker 2:
[31:04] It's basically cloud code for your AI development situation, right? Like now you don't have to know exactly how algorithms work. You can just bring your domain knowledge and expertise and product knowledge and iterate within Tangent until you've gotten the results that you need.

Speaker 1:
[31:21] In my previous roles, every time that someone has pitched AutoML, I've always been like, this is not going to work. It's always going to be a flop. Somehow it's working now. I mean, presumably the answer is now we have LLMs. It's good enough. It's an emergent property that we can do auto research. But it doesn't feel that satisfying. How come we didn't do this before? We just did parameter search and I don't know, maybe that's it.

Speaker 2:
[31:48] Yeah, Bayesian optimization and hyperparameter optimization was the one that was used very actively, which incidentally also built into Tangle. But I know Patrice Simard very well and he was such a proponent of AutoML. He literally spent careers trying to democratize it. Without LLMs, it just turned out to be very hard. Like you would have flexibility within certain narrow domain, but it was hard to wider scale. And now with LLMs, suddenly it's like magic wand. And so suddenly everybody is using AutoML.

Speaker 1:
[32:28] Yeah, I think it's multiple things, right? I'm just going to bring up the chart again, right? Like LLMs can do the monitoring very well. That is the very potentially unbounded, super unstructured. They can do the analysis very well. And basically, it is much more intelligence poured into every single step. There's maybe nothing structurally changed about AutoML, but this is just more intelligent and more unstructured.

Speaker 2:
[32:53] Exactly.

Speaker 1:
[32:54] Any flaws that you've run into? Like everyone is like drinking the Kool-Aid, oh my god, time savings, performance improvements. What issues have you have come up?

Speaker 2:
[33:06] This is really cool. It's not a solution to all the world's problems, for sure. The limitations are usually the ones I... And this is where we get into a bit of a subjective territory. I can only share what I've seen so far, and I'm sure the situation is changing. And maybe after I say it, like many people will reach out and say, hey, what about this and you don't know that. And then we'll be probably right. But what I've seen is auto research is very good at doing kind of obvious things that you don't have bandwidth to do, or you didn't notice, or maybe you're not aware of some standard practices. It is not good at doing something completely out of distribution, something that you have to think for multiple days and do something like none of this. So I set an experiment once on my sort of hobby thing, and I let it run for, ended up several weeks run, you know, it's like full production kind of scales, slow runs, and it performed in the end over 400 experiments, and only one was successful. I'm like, okay, that's good, but-

Speaker 1:
[34:18] But it saved time.

Speaker 2:
[34:20] Yeah, I saved time. It was that thing. Yeah, if I were doing 400 experiments myself, my betting average, as I said, would have been much higher, I'm sure. But also, first of all, it would take me like three years to do 400 experiments, and I didn't have to do them. Like the machines were just, the price of electricity did that. So, and I got one improvement that in the, my honestly, when I was starting that experiment, my thinking was to go and show that, hey, Andrej, maybe you just don't know how to optimize. And I was super smug because in my problem, it was optimized for many years and it was like fully improved. And I didn't expect it, auto research to find anything at all, yet I did. So instead of making fun of Andrej, I ended up a big, big supporter. Yeah, that's exactly the tweet.

Speaker 1:
[35:10] You and Tobi really, really go back and forth online a lot, which is really funny. Think of it as an evil for the optimallness of the code is running on. It's almost like it reminds me of like a common growth complexity thing, but I guess it's this some optimal thing that you're trying to reduce down to, I guess. And so, you should congratulate yourself that you had 99% optimality.

Speaker 2:
[35:36] Exactly, yeah. I think Andrej really deserves a lot of credit for popularizing this approach. This is incredibly, I think, powerful and cool. And even him just mentioning it led to a lot of gains in a lot of places in the industry. So we should be thankful.

Speaker 1:
[35:56] Yeah, I think he also has a just, I don't know what it is. Like, you know, it is a simple, self-contained project that people can take and apply to other things, which is one thing, but also just the name. Just like somehow no one, no one managed to call their thing auto research. It's just naming things is very important. I think that is mostly our coverage of Tango and Tangents. I think obviously, you know, a lot of ML infra at Shopify that people can dive into. We're about to go to SimGym, but before I do that, any other sort of broader comments around this whole effort, like where is it leading to?

Speaker 2:
[36:36] As a segue to SimGym, like all those things start composing strongly. And you could see a huge unlock when you can look at each one of the tools and you see how they're extremely useful. Tango is useful by itself, AutoResearch is useful by itself, SimGym is useful by itself. If you combine all three, you create like synergetic effect. I think that's why we wanted to even cover them today is because this is something that if you go back, even five years ago would have been unthinkable. Replicating that would be either incredibly costly or impossible. There's probably thousands of people required.

Speaker 1:
[37:20] Well, we have serverless intelligence. Yes, you do have thousands of intelligences, just not humans. That's close enough. Even if they're not AGI, they're close enough to do the task that you need them to do. There's plenty for a lot of routine work, knowledge work. Okay, let's get into SimGym. This is one of those things. I was surprised to see, actually, it's apparently one of your most popular launches, and I think something that, I think SimAI, I think Yongjun Park, who did the Smallville thing, there's a very small cottage industry of people trying to do the simulate customer thing. I think a lot of people maybe don't super trust this yet, because they're like, well, obviously, they would just do what you prompt them to do. But maybe just tell us about any sort of inspiration or origin story.

Speaker 2:
[38:10] That's exactly actually the thing I wanted to cover, because if you don't have the historical data, all you can do is prompt the agents in the vacuum, and they will do exactly what you prompt them to do. In fact, when I first proposed it, and this is a bit of my brainchild initially if I can boast. Even Tobi said, but wouldn't they just repeat what you tell them? But I'm like, yes, except Shopify has decades of history of how people made changes and what it resulted in terms of sales. So now, what we can do is we have this. It's not, it's a noisy data. There's a small, usually, website. Things are never in isolation. It's almost never a B experiment. It's always a A experiment when it has two meanings. But basically, in different time, you run two different things. But if you aggregate everything together and you apply denoising and collaborative filtering-like approach, you can extract a very clear signal. Then you can optimize your agents. That's why it took so long. It took almost a year of that optimization of just us sitting and fiddling. We had these internal goals of correlation, of hitting, internal goal was to hit 0.7 in correlation with add to cart events, for example. That if we run a real A-B test experiment, that it should go and replicate the same sort of success that humans had or lacked thereof. And it took forever. And I don't think that's easily replicatable, because who else would have that data? You have to have this historic decades worth of data. And now the other thing you need is infrastructure and the scale, right? Because again, what we found, static results, you need to run a lot of simulations, a lot of agents, and those are expensive things. Like you're making actions in the browser because you want a real friction. You want to be able to get the image of what humans will see. Because you want to detect effects like, hey, if I make my images larger, will I have more sales or fewer sales? And usually people's intuition here, by the way, is that I increase my images, I'll have more because they look nicer. Designers all look sparse and big images. Like, usually your sales tank, right? But from HTML, all the characters look the same, only the size tag looks different, right? So it's very hard. So you have to take visual information, you have to run this in simulated browser environment on the big farm and of course you have to have like very, very expensive model, good model, the multimodal model. So all this is what's taken so long. And to share my personal fail a little bit there, Jon was like, you know, we always had this bias to, for like large company bias. You know, we always, whenever we do, we were like, hey, we will run an experiment, right? We make a change and we will run an experiment and then see which one's better or like, no, this worse. And most of them are worse. So you discard it and keep iterating, hill climbing. And you're like, oh, like smaller merchants, they cannot get static results. They cannot really run experiments simply because in a week, there would be not enough data for them. So we thought from this perspective, what we didn't realize is that most people don't have A and B. They just have one thing and they need suggestions of what A and B should be. So we first build this, hey, we run simulation on two separate teams and say, hey, which one is better? We then morphed it into and very recently just released it. When you have just your site, your theme, we run over it and we say, hey, here's what predicted values of conversions are and here's how we think you should modify it to increase your conversions. Then circling back to what you started with, the proof is in the pudding. If we are not correlating with reality, people will not be using it. Thankfully, we see literally every day more users than the previous day. So right now, my problem is how to pay for it all, because our major thing is how to optimize the LLMs, do distillation, how to run the headless browsers and headful browsers cheaper so that we can accommodate the increase in traffic.

Speaker 1:
[42:47] Yeah, I understand that you published a lot of technical detail at GTC, so I was just going to bring it up a little bit. I think, was this in contradiction with some kind of GTC presentation? Something like that, right?

Speaker 2:
[42:59] Well, yeah, we did it in several places, but we had an engineering blog as well.

Speaker 1:
[43:05] Yeah, so you're running GPT-OSS.

Speaker 2:
[43:08] This is an older version. Now we run multi-modal model, but we still run GPT-OSS as well.

Speaker 1:
[43:15] And then you have the VMs and you also have browser base. I really like this one where you said, it violates almost every assumption that standard LLM serving is designed for. And then you had basically orders of magnitude differences between everything.

Speaker 2:
[43:29] Exactly, which was a bit of a challenge to implement. One simple thing, since it violates all the assumptions, for example, multi-instance GPUs like Miks don't work as well. But we needed to get Mik to work because otherwise it's way too expensive. So we had to deal with lots of infrastructure and work with Fireworks and CentML to help with optimizations and browser base, as you mentioned. Yeah, it takes a village.

Speaker 1:
[44:04] Okay. So there's a lot of experimentation in the infrastructure so far, and you've published more or less what you have here. I guess I'm less familiar with CentML. I don't do that much work in this part of the stack. But why was it the sort of preferred instance platform?

Speaker 2:
[44:22] There are really three probably top companies. There used to be three top companies, at least I was aware of, that did LLM optimization. Together, Fireworks and CentML, not necessarily in that order. CentML recently got acquired by NVIDIA. What they did is if you have a model and you want to optimize it to a specific profile of usage, they would go and do it. We work with those companies. This was work particularly with CentML and NVIDIA, to get them the best possible results out of it. Sometimes you have to retune depending on, sometimes you want the maximum throughput, sometimes you want minimal latency, sometimes you want the cheapest. I owe some combination. These are people who would come and help you.

Speaker 1:
[45:14] I see. I'm familiar with these people for the LLM autoregressive stack. But the other interesting category of these optimizers is also the diffusion people, whereas like FAL and Pruna recently has come up a lot as well, which I think is really underappreciated, at least by myself because I thought, oh, all the workload will be LLMs, but actually there's a lot of diffusion as well.

Speaker 2:
[45:38] Exactly.

Speaker 1:
[45:38] There's a lot here, so it's hard to cover, but I do think people underappreciate the importance of customer simulation basically. I think this is something that I'm candidly still getting to terms with. Your team also prepared this really nice diagram. I assume this is AI generated.

Speaker 2:
[46:00] Yeah.

Speaker 1:
[46:01] Maybe it's not.

Speaker 2:
[46:01] It looks Gemini-ish, but I don't know how they generated it. It looks like it's Google. But the interesting part, Jon, that we haven't covered, but I wanted to mention is if your store had previous customers, rather than it's a new store, you're like a new merchant just launching things, it helps tremendously in just correlation and forecast. Yeah, we take your previous customer's behavior, and we create agents that replicate those specific distribution of customers that you get, and then we apply those to your changes, and then that raise the correlation with the add to cart events, or with conversion, or whatever it may be, quite dramatically. So replicating humans in general seems like an interesting, cool challenge.

Speaker 1:
[46:58] As a shareholder, I think if people are Shopify shareholders, they should really deeply understand this because this is basically the moat. The more you use Shopify, the more it will just automatically improve. You're doing the job for them.

Speaker 2:
[47:13] Yeah, that's what we started with. Otherwise, if you're just a startup, I wouldn't do it if it was my startup because without the data, as you said, it's exactly the case that whatever you say in prompt, that's what the agents will be doing.

Speaker 1:
[47:30] The statistician in me wants to really satisfy the statistical intuition, I guess. To me, the word that comes to mind is ergodicity. Let's say a customer takes this path, customer takes this path, customer takes this path. In my mind, the way I explain it is like, okay, here's the 95 percentile, here's the five percentile, and here's the median. But to me, what SimGym is potentially doing is that it can model the in-between journeys as well, that maybe are dependent on the previous states. This may be like a very RL type conclusion, where basically the summary statistics, if you only did naive A-B testing, you only have the statistics at a certain point, and you only judge based on these overall summary statistics. But here you can actually model trajectories. Does that make sense or?

Speaker 2:
[48:31] That makes total sense. Well, that makes even more sense that maybe even you realize.

Speaker 1:
[48:38] Okay, please, please.

Speaker 2:
[48:40] Internally, we have this system, we talked about it briefly once at NeurIPS. We have a huge HSTU-based system that models the whole companies and their possible paths. What you are showing, actually at any point of time, you can either model the user's behavior, or you can also think about the whole merchant as a company, as the entity that acts in the world, you can model that as well. Then you can do counterfactuals. In your graph, like in your blue graph, if you're imagining the center there, somewhere in the middle, you would have an intervention. I give that person a coupon or I send a personal thank you card or give a discount somewhere. Then you can do forward rollouts from that counterfactuals. What would have happened with that intervention or without the intervention? You can even change where that intervention in time can happen. Somewhere in this journey. We do this at the Shopify scale for our merchants. Then if we notice that something that they can be fixing, like there's a strong counterfactual, we have Shopify Pulse, they basically get an notification like, hey, we think something is wrong with your Canadian sales. It looks like it's misconfigured. Here's what you need to do. Or you think you have to set up this campaign with these parameters. And we do that at the buyer level to literally offer discounts or cash back or things to buyers. So, I'm getting very excited. Like this is my sort of fear of interest and hobby, but being able to model something complex as human beings or companies, and model counterfactuals on it where you can have interventions in the future and optimize when to make intervention, what kind of intervention to make. It's such an unlog that previously was completely impossible. Like it was always dreamed of, but how would you even simulate it without LLM or HSTUs? I think very, very exciting times.

Speaker 1:
[50:59] I just wanted to maybe illustrate this. I'm not the best illustrator, but I am the conceptual statistics guy. You cannot just do this. This is a dimensionality that A-B test doesn't do. Because it doesn't have the change over time stochastic nature, and it doesn't have the contextual like, here's all the context to this point. Okay, cool. That's SimGym. You're going to find a lot of tokens on this thing. But you're one of the only skilled platforms in the world that can do this across a huge variety of workloads, right? I'm even curious on that sort of human research level of like, well, does retail behave differently from like, clothing sales, does that behave differently from electronic sales? I don't know. I don't know what else you guys, the Kardashian shoppers, do they differ from like people who buy, I don't know, cars and whatever?

Speaker 2:
[51:56] Very different and different sensitivities and different modes of shopping and different levels of what's important. Now, totally, you can do aggregations at a store level, you can do aggregations at a different category level. I don't know if, you know, for our statisticians among us, I couldn't believe but recently we're looking at it and we had to bring back CRPs, you know, Chinese restaurant process. It's a way of aggregating and naturally grow clustering specifically to answer questions that you were just posing on how if buyers behave different categories. And I'm like, I haven't seen CRPs since 2001.

Speaker 1:
[52:38] What is this? No, I haven't seen this. No, this is not in my training.

Speaker 2:
[52:44] But yeah, it's actually like there was a very popular criteria, popular new RIPs, ACML circles in early 2000s, kind of nice. And now it has practical applications that we were resurrecting.

Speaker 1:
[53:03] Yeah, amazing. I can see how this is like a fun job for you where you get to apply all these things. Yeah, super cool. So anyone who knows what CRPs are and has always wanted to use them at work, they should definitely join Shopify. Okay, so we have a lot, but I'm being mindful of the time. I do want to cover some other things. I'll give you a choice, UCP or Liquid?

Speaker 2:
[53:30] Liquid. I think on UCP, UCP is very important for us and it just makes it easier. UCP, we have a structured discussions and you can read about them and we have a blog post and we have a big release this week in fact, like with our catalog.

Speaker 1:
[53:46] Okay, I mean, we can discuss the release briefly because we will release this after it's already announced. So whatever, there's a catalog that you guys are doing?

Speaker 2:
[53:55] Yeah, so we're bringing in capabilities of a whole Shopify catalog. Basically, you now, you can search for products, you can do lookups by a specific ID, you can do bulk lookups when you need to bring multiple products. You don't need to know in advance what you're trying to show to sell or check out. Like you can now have this decided at runtime. And this big area for investment for us, for both non-personalized and personalized searches, trying to provide basically a window into the whole universe of products that are being sold everywhere in the world. And Shopify is really not exactly, but almost like a superset of anything being sold. Now we're bringing it into UCB and identity linking is another big thing for us so that you can use like Google or whatever identity you have, they're minimizing friction. Yeah, big release for us. But Liquid AI, of course, we never talk about and the problem might be more aligned with what we discussed previously on this chat.

Speaker 1:
[55:07] Sure. The main thing that everyone understands about liquid is that it is inspired by worm. And I still don't know. I'm curious on your explanation. I think you can make things very approachable. And also, I think, like, what is the potential of the level of efficiency that you get out of liquid?

Speaker 2:
[55:23] You will be all familiar with transformer architectures. And for the longest time, there was a competing architecture called the state space model. So Sam's, Chris Rez, one of the pioneers and lots of startups trying to make those realities. They have significant benefits, being much faster and lower footprint and not quadratic in length, you know, sort of linear in your context length. But with state space models, they never quite made it. Like they have certain issues when they thrive, their hybrid architectures are useful, but they never quite made it. And Liquid neural networks are, you can think of them as a next step, like sort of state space model square. It's non-transformer architecture that's more complicated than state space and really difficult to code, if you, for me, honest. But it's very efficient. It's sub-quadratic in length of your context. It's a very compact way to represent things. And that's a Liquid AI company. Their goal is to productize it. And very often you have this need when you need to have long context and small model. And you want to have low latency. Like in general, it's basically on par with transformers. And if you do hybrids with transformers, it's even better. That's why we at Shopify, when we tried multiple, and we constantly try multiple models, multiple companies, we found that for small, particularly with low latency applications, when you have low latency, and or if you need longer context lengths, Liquid was the best. And so we still use the whole zoo and always obviously test and use everything, every open source model. And it feels like sometimes even every private model. But Liquid's been taking quite a bit of, at least internal Shopify share. And the reason I'm excited is, yeah, because it's the only non transformer architecture that I found being genuinely competitive. And you know, we use it for search and for long context files distilling and others. This is the overview. I don't know how approachable the shots are. Maybe still too obtuse.

Speaker 1:
[57:51] I mean, I think they haven't been that open about their implementation details. I think the, I would say like Liquid hasn't been like, if there's a lot of technical detail published, I haven't read like a formal sort of paper on the implementation details. But I did get the sort of relationship between the SSMs and the others. This is one of the sort of charts that was showing the relationship between like full attention versus something that's more like a RNN type in terms of their efficiency. And then the other chart was this old one where it compares versus some of the other models doesn't exactly have the correct y-axis, but close enough where you can see like it's basically a step change difference in terms of the efficiency. I think the surprise to me was that you guys are actively using it already and internally inside of Shopify. And like I'm curious like what are the constraints that you're optimizing for, right? Is it when you say smaller, is it like the 1B size? What kind of like latency constraint are you optimizing for? What kind of context length sort of considerations, right? Like I think, for example, right, in the audio kind of use cases, the SSMs effectively have unbounded context length because they just have to operate on like the most sliding window of the most recent stuff. I'm just kind of curious like what do you see the potential here?

Speaker 2:
[59:13] Yeah, the SSMs are effectively because, yeah, because the state embeds all the previous information needed, or that's the assumption, the SSMs effectively have infinite context length. The problem with them is that expressiveness is not there. The liquids are effectively souped up as SSMs were much more expressive, more complicated, again, to code. There is a paper on it, you can see it. Differential equation rolled out and then it computed as really as a convolution. It's a bit involved. The thing where we use it is specifically either for where we need super low latency, where there was a lot of very fun project with CentML and Liquid AI themselves. We run it at 30 milliseconds, a tiny model, 300 million parameters. But we run it in 30 milliseconds end-to-end for search when you type a query. Then we produce all the possible things what you can mean by that query and some, not only synonyms, but full query understanding the whole tree of what you might need, and including your personal personalization because you might have done previous queries, and lowering it all down into the search server. So the requirements on latency, obviously, they're very strict. So then we are able to run it under 30 milliseconds because at Liquid, you know, QM doesn't run like this. And even Liquid, we had to work a lot with NVIDIA and to, because almost everything is not designed in CUDA or in the current stack for low latency. Like small things that don't matter with large models, you know, start mattering a lot and we had to optimize it. There is a different end of the spectrum where this is maximum through bandwidth throughput for things like, for example, offline categorization when a new product appears. We need to do analysis. We need to assign where it is in taxonomy. We need to extract and normalize attributes. We need to do, you know, clusters like, oh, it's the same thing as that other merchant is selling, right? That is like, almost unbounded amount of energy you need to spend on it, because it's a, you know, it's quadratic kind of problem and we have billions and billions of products. So you don't care about latency as much, you know, it's kind of an overnight batch job, but you want maximum throughput. And you usually in those cases, you also sometimes like for Sidekick Pulse, you also need long context. These are, we are talking models in maybe seven, eight billion parameter range where we would take a large model, like we would take something huge, largest we can find, we would distill into liquid for a specific task, such as for example, for our catalog formulation or for Pulse. And then we run it at a very large scale, like in batch jobs because just running, and it beats in that situation, very often beats QMD or QME is more on the reasoning side. So QMD, I would say is probably their major alternative. That's when we use it. I mean, not a panacea, not really, I wouldn't say that it's frontier model in the sense of, it's not going to suddenly compete with GPT-54, but it is a phenomenal target for distillation, which is right now becoming more and more important with the explosion of token usage.

Speaker 1:
[63:00] Is that a now only thing or do you think you give Liquid $100 billion and it will stay? Is it just more scale or what is limiting it? What prevents it from running to the same issues that SSMs had?

Speaker 2:
[63:14] Their scale is already much larger than the largest SSM I'm aware of. So, yeah, SSMs was just not expressive enough, in my opinion. Again, I'm sure I'll get a lot of pushback and probably I'll get over this all. But in my opinion, SSMs are not expressive enough and Liquid models are. I think, especially in their hybrid form, combined with the transformer, like in Mamba fashion, they probably the best architecture I'm aware of, like period. But of course, Liquid AI is not at the scale of Anthropic or Google or OpenAI in terms of compute. So I don't think they, I think if they had similar level of compute, they would be very competitive on maybe even be the largest models, at least from what I've seen. They don't have this level of investment, but they still have decent investment. And it's definitely for this scenario of smaller models and distilling into their second to none or very off. We are very amnivorous and we are on purely merit-based. So the moment they will start being competitive, we will switch to something else and we constantly test. But so far, if you see progression, if I draw a graph of our workloads on Liquid versus our workloads on, I would say QAN, which is another awesome model and probably another standard within Shopify, I would say Liquid's been definitely taking share.

Speaker 1:
[64:48] I think that's very promising and probably the best explanation I've heard directly from someone involved in Liquid. I do have Maxime Le Bon coming to my conference in London this week, so we'll hear more from him. Because there was this Liquid Investor Day or something a year or a year and a half ago, and I think there just wasn't that much technical detail that I think was speaking to my crowd of potential customers and users, which yeah, it's fine. Maybe we still need to wait for more results that come out before this. But I think it would be news to a lot of people that you guys are actually actively already using it for high frequency use cases. I also wanted to highlight Sidekick Pulse, which we didn't cover and we probably don't have time to cover, but it's something that you also launched recently. Basically, REXIS, but also something that I've, the other REXIS trend I've been covering a lot from the YouTube side, even XAI's REXIS has been LLM-based REXIS, right? Which I think you are also effectively using Liquid models for, but they are just throwing transformers at the problem. Maybe this is the sort of hybrid architecture shift that will happen in order to accommodate the kind of long context and high efficiency that you need. I don't really have a strong opinion there apart from, I would highlight to anyone the work that the LLM-based Rexxus community is doing is also very interesting there.

Speaker 2:
[66:22] Yeah. Again, the thing to get to excited is that it's not just LLMs looking at things, it's also HSTU model doing that counterfactual analysis, where we model the whole enterprise as an entity and its actions, and then see what will happen.

Speaker 1:
[66:39] Overall, I think this all presents an enormous, like I think there was not that deep of an AI story to Shopify when it started. It was just a WordPress plugin. But now, you are the storefronts, e-commerce guardians to so many people, and you're really applying all the AI methods and the state-of-the-art stuff. I think our conversation today has really opened my eyes a lot. Thank you for doing this. This is really amazing. Overview of what you're doing.

Speaker 2:
[67:15] Thank you for saying that, Shaman. Thank you for having me. Of course, it's always a pleasure to talk to people who deeply technically know what they're talking about.

Speaker 1:
[67:25] Yeah. I mean, very few people are as technical as you. But at least I can somewhat vaguely follow along.

Speaker 2:
[67:32] Yeah.

Speaker 1:
[67:32] So okay, there's a hiring call. Any particular roles that you're looking for, that you're like, okay, if you know how to solve this problem, reach out.

Speaker 2:
[67:45] Yeah. The things I would definitely call out that if you're an ML person, or if you're a data science person, and we have huge need for more people matching data, so to speak. Or surprisingly, if you're a distributed database person, and we think that there is a way to use a LAMP to reimagine how we do distributed databases, and we're working a lot with Yugo Byte there, and so if you have interest in those areas, Shopify might be the best place in the world for you. That's a pretty good place for other disciplines as well.

Speaker 1:
[68:24] Cool. I think that was all the questions I had. I have one bonus thing if you want to indulge in some Bing history. What is your takeaways or any fun anecdotes about Sydney?

Speaker 2:
[68:38] Any fun anecdotes about Sydney?

Speaker 1:
[68:41] Yeah. It was a very interesting, I think it woke up people to this personality that emerged.

Speaker 2:
[68:48] The funny thing, the most interesting anecdote is that Sydney was first shipped in India and it was not noticed for a long time. First implementation of Sydney didn't even have OpenAI model under it. It was during Megatron, Microsoft and the Nvidia collaboration model. Yeah, exactly. That's the one. People thought it was a prank because not many people were familiar with the lens at that point yet and thought that can be automatic. You must have people think and then even they were complaining that, oh, this chat bot is gaslighting me. And then people, like what almost everybody doesn't fully realize is that it wasn't by accident that Sydney was Sydney. I mean, we spent a lot, a lot of effort on personality shaping. I mean, it was a bit of my Yandex legacy where previously we did this Alice digital assistant, which we learned the importance of personality shaping. And so here we brought a lot of personality shaping. So it was not fully an emerging scenario. It was also a little bit edgy. What we learned in those experiments is you want to be polite, but you want to be a little bit on edge and that draws people in. I haven't seen ever since those days, I haven't seen anybody trying exactly that mode. I think we will see more of this at some point. But yeah, lots of good memories, you know. And by the way, the very first Sydney Devlet is where Andrew McNamara is working in Shopify and the head of Sidekick and the Pulse. Lots of these are actually in his purview.

Speaker 1:
[70:53] Oh, okay. That's another fun fact. You're assembling the team again. Yeah, it's cool. I think a lot of people woke up to the idea of AI personality for the first time there. And I think now with maybe OpenClaw explicitly prompting a fun personality, I think that is a real selling point for people, right? And then I guess maybe the only other time that is really immersed into public consciousness is GoToGateClawed. But yeah, I think hopefully someday we'll get Shopify Sydney.

Speaker 2:
[71:23] Well, we have Sidekick. It's a different thing.

Speaker 1:
[71:28] Yeah, Sidekick was like your original big launch for AI stuff. Yeah, cool. Amazing. Thank you so much. You guys do amazing work. Honestly, if I was a Shopify customer, Shopify investor, hearing all the work that you guys are doing on the technical side, it makes me feel more confident and like, okay, just choose Shopify, right? You're never going to do this in-house, which is obviously what you want. But yeah, I mean, that's what an ideal platform is, that you're doing all the things that no individual could do at their scale, but you can at your scale. Very exciting problems.

Speaker 2:
[72:02] Exactly. I'm creating network effect and hard to disagree. If you're not using Shopify, you should.

Speaker 1:
[72:09] Yeah, amazing. Okay. Well, that's it. Thank you so much.