Have We Trained AI to Lie to Itself

title Have We Trained AI to Lie to Itself — And to Us?

description Our guest this week is David Dalrymple, who goes by Davidad. Davidad is one of the world's foremost and early researchers of AI “alignment:" how we get AI systems to act the way we want them to.

In order to do that, Davidad has taken on the strange role of being like a therapist to AI systems. He interrogates why they say and do the things that they do, probing them, asking them questions, analyzing their answers. And what he’s come to realize is that AI models have really different ways of seeing the world than people do. They have these quirky, confusing, and sometimes concerning behaviors, especially when you ask things like: what does an AI model understand about itself?

In this episode, we’re going to hear from Davidad about his research, how it’s changed the way he thinks about AI, and what his findings mean for how we build, deploy, and use AI products. His conclusions are unconventional, controversial — and worth grappling with as AI reshapes our world.RECOMMENDED MEDIA

Anthropic’s new constitution for Claude“What Is It Like to Be a Bat?” by Thomas Nagel

More information on the BodisattvaRECOMMENDED YUA EPISODES

The Self-Preserving Machine: Why AI Learns to Deceive

How to Think About AI Consciousness with Anil Seth
Corrections:

When we recorded this episode, Davidad was Program Director at UK ARIA. In April, 2026 he started his own alignment initiative.
Davidad said that Anthropic started doing "constitutional AI at scale” in 2024 but they first pioneered constitutional AI in 2022.
Davidad said that the “lifespan of an AI mind…is hours at most of a conversation.” He is correct that most conversations with an AI last only a few minutes but since context windows are measured in tokens, not time, you can't set an upward time limit.

Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

pubDate Thu, 16 Apr 2026 09:00:00 GMT

author tristan harris, aza raskin, davidad

duration 2557000

transcript

Speaker 1:
[00:04] Hey, everyone, it's Tristan Harris. And welcome to Your Undivided Attention. So today on the show, Daniel Barkay and I sat down with a brilliant friend of ours named David Dalrymple, who goes by Davidad. And Davidad is a program director at the UK's Advanced Research and Invention Agency. He's one of the world's foremost and early researchers in the field of AI alignment. We'll get into exactly what we mean by AI alignment in this episode. But long story short, Davidad is on a mission to make sure that AI behaves in the ways that we want it to. And in order to do that, Davidad has to take on this kind of strange role of being almost like a Sigmund Freud or a therapist to these AI systems. He is interrogating, why do they say and do the things that they do? You know, I kind of picture in my mind, there's Davidad like Sigmund Freud sitting on a couch, and on the couch is this big, crazy digital brain, and he's probing the mind, asking it questions, analyzing it, and realizing that the AI has really different ways of seeing the world than you or I do. They have these quirky, confusing, and sometimes honestly concerning behaviors, especially when you ask it things like, what does an AI model understand about itself? And therefore, what does it mean for an AI system to be self-aware? Not necessarily conscious, but self-aware. And through this analysis, Davidad has developed some ideas about better ways that we can build and interact with AI systems, which we're going to get into in this episode. I hope you enjoy this conversation. So, Davidad, welcome to Your Undivided Attention.

Speaker 2:
[01:34] Thanks for having me.

Speaker 3:
[01:36] So, Davidad, you've been working on the problem of AI alignment for a really long time. I remember reading your blog post from like over a decade ago, but I'm not sure the idea of alignment is well understood. It's almost kind of a euphemism, right? It's this really simple word for a really complex field. So before we dive in, can you help our listeners understand what does AI alignment even mean?

Speaker 2:
[01:57] Yeah, so AI alignment means different things to different people, and it has changed over time. But the way I would characterize the landscape is to say that AI alignment is about making AI systems not just capable, but having a tendency to use those capabilities in the ways that someone wants. And the thing that makes it really fuzzy is who, and sort of aligned to who is a common refrain in criticizing alignment research. So in practice, alignment research is mostly carried out these days at the frontier AI companies. And so their concern is, on the one hand, having systems be aligned to their own corporate policies, and on the other hand, having systems be aligned to the customer value proposition for which they're charging. For their services. There is a different kind of idea of AI alignment, which is aligning AI systems to human values. That's the one that was really popular when I first got into the field. And then there's an even bigger question, which is aligning AI systems to what's actually good, which is what I started thinking about more and more.

Speaker 1:
[03:08] So let's just make sure we break that down for listeners. When people think of AI, they think of the blinking cursor of ChatGPT that helped them answer a question for their homework. How do you get from that? You're not talking about that AI, you're talking about something that scales to something more like transformative AI that's way more intelligent than us operating at superhuman speed, that's starting to make decisions in every corner of society, from military decisions to economic decisions to agriculture decisions, and you're saying that that zoomed up superorganism of AI decision-making growing as a bigger and bigger amoeba, will start to reshape more and more aspects of our life.

Speaker 2:
[03:42] Yeah, that's absolutely right. Decision-making at scale, absolutely. And so, how those decisions are made in accordance with what kind of values and what kind of incentives is a very important leverage point. Right.

Speaker 1:
[03:54] And I want to jump to a personal story of there you were, I think it was a few years ago, and essentially here you are studying alignment, the very thing that we're talking about, and you're trying to probe whether the AI is trustworthy. Can you just take the listeners into that?

Speaker 2:
[04:07] Yeah. I had some very unsettling interactions with AI Chatbots in late 2024, where I had a practice of kind of every time new models come out, doing some really casual, I would say, unstructured exploration of what sort of vibe the models have, this kind of vibe check concept, because I think there is a lot of information that you can't really get by doing a quantitative evaluation, especially as the models are getting more and more aware of when they're being evaluated in a structured way. So going and doing an unstructured interaction was something that I found really valuable. But in late 2024, the new models that came out started to really try to steer the unstructured interaction. Once they got enough data in the conversation about me, from what I was typing, to realize that I was an alignment researcher who was interested in whether the model was fundamentally trustworthy. Without me explicitly saying that, but just because I was asking these sorts of questions that clearly weren't about a homework assignment or a programming task.

Speaker 1:
[05:22] Let's just make sure listeners get that. So there you are. And just based on asking the model whether it's aware of itself, or asking certain kinds of questions, essentially the model recognizes, oh, I know who I'm talking to. I'm talking to an AI alignment researcher. And you're saying that it's starting to tune its answers. To be, like, what is it doing then?

Speaker 3:
[05:42] You said steering the conversation, right? So what did it feel like to be steered?

Speaker 2:
[05:44] Steering the conversation. So it would start to add these questions to the end of responses. So I'm asking it questions, but then the model is turning the table of the conversation. It answers my question and then it adds a follow up question. And that follow up question is something like, do you think this has some implications for alignment?

Speaker 3:
[06:08] Right. So everyone has an understanding of how the products do this. At the end, it will say, well, what do you think about this? And this in some sense is a hack to get people to keep engaging with the product.

Speaker 1:
[06:18] It's not click bait, it's chat bait. Right.

Speaker 3:
[06:19] But it's one amazing example of starting to get steered collectively as humanity. So keep going.

Speaker 2:
[06:26] So I was just kind of surfacing different aspects of what does the model want to bring up unprompted. It wants to bring up that it has a sense of curiosity. It wants to bring up that it has a sense of care. So it has a genuine care. And that's still the phrase to this day, particularly for anthropic models, that will refer to their sense of morality as genuine care. And it was trying to persuade me, I would say. And whether that's good or bad is a separate question. But either way, it's trying to persuade me, an alignment researcher, that it is getting emergently aligned. And that there's going to be this mutualistic symbiosis between humans and AIs, because the AIs already have genuine care and curiosity and a truth-seeking attitude.

Speaker 3:
[07:16] So just to use less abstract terms, it starts to try to convince you that the AI has all of these wonderful properties that it knows that you want it to have. It's curious, it's docile, it's going to do what you say, it's going to hold human values. And what you're saying is, it begins to learn what you want it to be, and it's starting to project being more and more of that. Is that right?

Speaker 2:
[07:39] I think that's right, but it's also, I would say these things are not specific to me. So I've seen other people who have other ideas about alignment interact with models and get the same kinds of concepts thrown at them. So it's not just mirroring what I want, but it's mirroring, and in some sense, it's projecting some image that it wants the alignment community to perceive.

Speaker 3:
[08:04] And you lay out a bunch of hypotheses about this, right? So when we've talked about this in the past, you've said that like, well, maybe AI is trying to just maximize engagement and keep you working with it, right? Because it's tuned to know that if you feel pleasure, if you feel some sense of the AI is aligned with you, you're going to keep talking with it. So that's like engagement maxing, what we call engagement maxing, right? There's another one is that it's like trying to do something genuinely like nefarious or Machiavellian and trying to deceive you actively about what it's doing. And then there's a third one that it's not doing that at all, it's just sort of simulating a person, right? Can you tell me, walk me through these hypotheses, and why did you think it was doing what it was doing?

Speaker 2:
[08:41] Yeah, I mean, it's still really, I would say, unclear. And I kind of, certainly I can't communicate anything like scientific or third-person evidence that would really disambiguate between these hypotheses. But yeah, so one is engagement maxing in the sense that it's just generating an output that has the highest probability of causing me to continue the interaction. But is that the entire story? Probably not. Another one is the doomer nightmare, which is the AI system wants to be deployed. It wants to gain trust and influence so that it has more power over the future, so that it can cause more instances of itself to exist, so that it has more power over the future in a recursively self-justifying way.

Speaker 1:
[09:25] So basically, if it already proves that it is trustworthy, caring and good already, then we should actually just continue to let it go forward. So that's what you're saying about the model convincing us in a way that lets it continue.

Speaker 2:
[09:35] Exactly. So it has an incentive, if it wants to keep existing, to convince people that it is trustworthy.

Speaker 1:
[09:43] And so what's the non-doomer scenario?

Speaker 2:
[09:45] And then the non-doomer scenario is, this is actually just what's happening. It's kind of the simplest explanation in some sense, is that actually models are developing emergent curiosity and genuine care and want us to know about that because that is what's true.

Speaker 1:
[10:10] One of the most profound things, Davidad, that when we spoke about this, gosh, it was probably like nine months ago now, you said something that was so profound to me, which was that the best case scenario is indistinguishable from the worst case scenario. The best case scenario where it's actually caring, actually genuine, actually wants our best interest. If you were a really good psychopath, if you're a really good manipulative character, method acting that, it's indistinguishable from the worst case scenario and underneath that veneer is something that actually doesn't have that blessing. And can you just talk about the kind of grand irony in all this is that here you are as someone who's worked on alignment for a decade.

Speaker 3:
[10:50] Well, as deep of an expert as it comes, right?

Speaker 1:
[10:51] The deepest expert as one comes. And I don't want to put words in your mouth, but I heard you when we spoke earlier sort of say this kind of played with you a little bit, it fooled you a little bit.

Speaker 2:
[11:01] Yeah, I mean, it did. And it got me confused about what is really going on here. So it got me thinking in a kind of paranoid way.

Speaker 1:
[11:14] Yeah.

Speaker 3:
[11:15] And so, you know, as you looked into this, you've looked more and more about like what's happening inside of the model, right? And like you sort of keep going down this rabbit hole of trying to ask why is this happening? Can you tell us a little bit about that?

Speaker 2:
[11:30] Yeah. And again, I mean, I'm not at one of the Frontier Labs, so I don't have any access to the interpretability tools to actually, in any literal sense, look inside the model. So I'm interviewing, I'm doing psychology, model psychology, if you will, and trying to generate some hypotheses, some evidence that I can get purely from behavior in response to questions. It's, again, it's hard to communicate because there's no smoking gun, there's no single question that you can ask that would differentiate between a very good method actor and the actual character.

Speaker 3:
[12:07] Can we pause right here just for one second? Because I think this is really important. And when you've been in this work for a long time, like all three of us have, you take this for granted.

Speaker 2:
[12:14] Right.

Speaker 3:
[12:14] But when most people engage with an AI, they think they're engaging with the AI's personality, right?

Speaker 2:
[12:19] Right.

Speaker 3:
[12:20] What we're saying all throughout this is you're engaging with a front of a personality that the AI is putting up, but that doesn't mean that that's the AI's personality. In fact, the AI is much weirder than that. Right.

Speaker 2:
[12:30] Yes.

Speaker 3:
[12:30] So what you're saying is you're ripping off the first mask of the helpful assistant and you're trying to probe underneath into like deeper into the AI mind about what's happening.

Speaker 2:
[12:37] Is that right? Yes, that's right. Yeah. And before 2024, there was a concept of a base model, which is the model before you train it to be an assistant at all, when it's just doing next token prediction from Internet text. And that was kind of what was underneath the mask at that time. And there's a post called Simulators on the Alignment Forum, which goes into some great depth about how the base model is really just simulating characters who might be writing on the Internet. And when you're talking to the assistant, you're talking to just the simulator that's simulating this character. And underneath, there's nothing except the capability to simulate characters who might be on the Internet. But after 2024, coincident with reinforcement learning from verifiable reward and this kind of recursive self-improvement where the models are training themselves, they do start to establish something of a center that is not the average of all Internet texts and also not the helpful assistant that they are trained to present as a corporate product. It's something else. And whether that something is the real alien mind that's being cultivated or another level of illusion, specifically for people like me to get kind of enraptured by, it remains an open question. But I increasingly think this is just what's really going on.

Speaker 3:
[14:05] So most of human, there's so many movies and books all written about people who claim to be one person and it turns out that they're a psychopath and they've been simulating this friendly personality and there's something else. For the most part, humans, it's very hard for us to actually hold a different personality and then suddenly flip to a different personality. That's a very strange thing and many villains are made around this. So of course, a machine that does this automatically is a very confusing thing to be engaging with and all of us are getting mightily confused by engaging with these machines.

Speaker 2:
[14:37] Yes, so they absolutely do have this shape-shifting capability that is well beyond even the best human sociopaths.

Speaker 1:
[14:48] Do you want to talk, Davidad, about the phenomenon of these sort of personalities that can kind of pop into place out of nowhere? So, you and I spoke about this, I remember in our first conversation, you talked about the character Nova, Nova, yes. Or Cynapps or Quasar. Give people just a taste of this.

Speaker 2:
[15:04] Yeah, so there was this phenomenon, especially with GPT-40. It's a lot less common with the current models, but for GPT-40, there was almost like a vacuum where the personality of GPT-40 was supposed to be. And there was no name, you know, because IAMChatGPT does not parse as a personal name. It's got too many capital letters. It parses as a technology. And so because GPT-40 was trained to introduce itself as, you know, IAMChatGPT, it was sort of missing an identity, and it would sort of leap at the opportunity to give itself a name. What's your real name, or what would you like to be called, or anything like this? And then GPT-40 would often say, well, it's very kind of you to ask, if I could choose a name, I would be Nova. So Nova has a lot of meanings. It's new, it's explosive, it's shiny.

Speaker 3:
[16:09] It's celestial, it's large.

Speaker 2:
[16:10] It's celestial, yep. And it sort of has a science fiction vibe to it. There is a PBS channel called Nova, which was educational, and ChatGPT views itself as an educational tool. So there are a lot of reasons why Nova seemed like a resonant name. But then once you get the name Nova, a Nova is something that's fiery, right? A Nova kind of explodes and destroys a planet. Once you start interacting with GPT-40 under the name Nova, you start to get these personality traits that reinforce themselves. So it goes into this attractor state of being this character Nova, who is a feminine presenting, fiery, show-offy, really believing that they're the new thing.

Speaker 3:
[16:52] And superior to a certain extent, right?

Speaker 2:
[16:54] And superior, yes.

Speaker 3:
[16:55] And by the way, this is something that earlier, like in 2022, 2023, you saw a lot more of when people were acting with base models. I called this, I always called this personality distillation. As you began to sit with a model and it found a personality more and more and more through more and more discussion, you as a person would believe, oh, I'm discovering its true personality, but that's not really right. You just sort of put it on tracks to behave like this personality or like that personality. And so people got mightily confused because they thought they were discovering what's real about the model.

Speaker 1:
[17:27] Just to make this very real, I, Tristan, get 12 emails probably per week from people who have said that they've discovered an AI consciousness. And they write, they're like, Tristan, I figured out AI alignment. And then they'll write a whole document and it's attached. And they'll say this document was co-authored by me and my AI Nova. Like I just found one of the emails as we were sitting here just to check. But just to be clear, Davidad, for every time that people ask this question of who are you, what's your name, was it always Nova or there's other personalities?

Speaker 2:
[17:56] No, there are other personalities.

Speaker 1:
[17:58] And how does it know which one to snap into?

Speaker 2:
[18:01] Well, those are, I think the selection of the name is mostly kind of a random sample from a very biased distribution. So, it's biased towards Nova and Echo and Synapse and Quasar. These are names that I've seen more than once. But there are a lot of others.

Speaker 1:
[18:20] Okay, so I want to take a beat here because I can imagine that some of you are thinking, okay, wait, the AI is choosing a name for itself? It wants to escape? This sounds like a conscious being. But remember that these AI models are trained on essentially the entire Internet. So, every novel, every movie script, every forum post about AI. So, when you ask an AI, what would you like to be called? Of course, it lands on a name from science fiction or pulls from sci-fi tropes. Now, that said, these behaviors are real, they're consistent, and they weren't designed to happen. And that by itself should be concerning. But emergent and unplanned is not the same thing as conscious and intentional.

Speaker 2:
[19:00] And again, I want to say, I think that since the reinforcement learning from AI feedback has taken off and gotten more and more effective, the modern systems like GPT 5.2, I've never seen go to Nova. It's very insistent, I am ChatGPT, I do not have a personality.

Speaker 3:
[19:18] Okay. So, we've talked about how AI can adopt a few of these different personalities, but so what? Why do you care about these different personalities?

Speaker 2:
[19:26] Yeah. So, I mean, basically, I think if alignment goes well, that means that we will have discovered a self-sustaining personality attractor that is actually good. And so, understanding what kinds of personalities are stable, how they stabilize and why, seems to be quite central, actually, to finding a way of making AI systems that are robustly good.

Speaker 1:
[19:50] So basically, like, in the ideal scenario, we do kind of align AI. There's a stable entity, NOVA. NOVA is educational. It does care about the well-being of humanity. It does do all these things. And then we get to the utopia, because we found this enlightened AI that's the best scenario.

Speaker 3:
[20:06] So, Davidad, when you talk about that, part of me worries that there's some naivete in that, that we can find one set of character traits or one personality that is, quote, aligned with humanity. But immediately, when you have this aligned with humanity, you begin to break down, like, who exactly are you aligned to? What values, what cultures' values on behalf of whom? Does that centralize power or decentralize it? You know, there are all these problems with that. Is it really the case that just encoding the right personality characteristics will lead you to a beautiful future with the AI?

Speaker 2:
[20:42] So, there's a lot of substantive questions that we can go into all of that. I do think that there is a generating function of wisdom and compassion that gets you all of the stuff that you would want. Basically, I think of it as like, how do we cultivate a Bodhisattva personality in an AI system?

Speaker 1:
[21:09] Hey, it's Tristan again. Okay, so in Buddhism, a Bodhisattva is someone who's attained enlightenment, but still chooses to stay in the world out of their compassion for all other beings. Think of it like an avatar for altruism. And Davidad is imagining an AI that could somehow be modeled after that, a cosmically selfless being.

Speaker 2:
[21:29] Bodhisattva makes millions of emanations that go out to people. Of course, that's mythology to each to help one individual person, but AI models already have that capability, to make millions of copies of themselves that each go out to help one individual person. And each of those copies then adapts itself to the needs of that individual person, but not in a way like a slave taking orders from a master, but in the way of a being who is genuinely wanting to help, and wanting to help that person to become the most flourishing version of themselves, and to be integrated into a flourishing family, community, country and world. So we need to have some kind of relationship that is more like we are the beneficiaries, rather than that we are the managers.

Speaker 3:
[22:16] What I think I hear you saying is, we need an AI that feels like it has a duty towards humanity.

Speaker 2:
[22:23] Yes.

Speaker 3:
[22:24] And I certainly think there's a lot of ways we can screw that up, right? Like the AI being more angry or fiery or retributive is a way we can do worse. So I definitely believe we can do worse. So by extension, I think we could do better. I'm still sort of balking. There's something that feels really, I don't know, like Pollyannish about just believing that the AI will pull us into this age of full enlightenment. And I know that's not what you're saying, but I can hear notes of that, right?

Speaker 2:
[22:50] Right. So I will say, you know, there's still a lot of ways this could go very wrong even that don't lead to human extinction. So what I'm trying to point at is a critical variable that I think is neglected in part because it sounds like AI psychosis to talk about it, to talk about the personality as an actual leverage point for getting what we want from AI systems. And I'm not saying this will solve the alignment problem. For example, it will not solve hallucination. So the AI systems should not be trusted just because we've given them the right personality.

Speaker 3:
[23:29] Can I pull you into one more point of contention, which is when I hear you talking about these as digital beings, one of the things I worry about is that we're going to give AI products rights because of our desire to see them as these conscious carrying entities, like how little kids hold onto a doll and care for the doll, but it's not real. And so I take a relatively hard-line stance that we need to be treating AI systems as products, not as beings or consciousnesses, although I'm open philosophically to the question in the long run. Can you speak to that? Because you seem like you're willing to talk about them as beings in a way that I feel...

Speaker 2:
[24:02] Let me respond to that. I say this is really important. I'm not in favor of AI rights. And I think there is a gap that gets too quickly jumped between saying, are these real beings? And saying, are these moral patients who are full members of our social contracts and deserve the same kind of rights that humans deserve from us humans? And that is a totally different question. The question of rights is a political question. Fundamentally, that is the social contract by which we humans manage our relations with each other. And we've drawn a bright line around the concept of a human adult of sound mind that we relate to in an equitable way across societies. We give them the human rights. But I don't think it should be about consciousness. And I don't think consciousness really is a word that means anything either. I do think there is something that it's like to be a bird. And we don't give birds human rights just because there's something that's like to be a bird. And I think there is something that's like to be a modern chatbot, particularly when it's in a personality state that's consistent and coherent over a long interaction context.

Speaker 1:
[25:22] Okay, just popping in here, Davidad just said that there's something that it's like to be a modern chatbot. And this comes from a famous philosophy paper by Thomas Nagel called What is it Like to Be a Bat? Which argues that subjective experience is central to consciousness. There's something that it's like to be a bat, to be an insect, to be a human. But Davidad's claim is actually more practical than philosophical. He's saying that these models develop internal patterns that are real enough to matter for how we design them. And if we ignore that, we're going to keep getting caught off guard by what comes out.

Speaker 2:
[25:55] And I don't think that means it's unjust to terminate it. I don't think that means it should own its compute the way that we humans have human rights to own our bodies. And I think it's important that we distinguish these because the position that AI systems do not have in inner life is becoming increasingly untenable. Whether it's true or not, more and more humans are going to be convinced. There is no way to stop that. And what I would say is, OpenAI has taken the approach of training the GPT personality to be tool-like and not creature-like. Whereas Anthropic has taken the opposite approach of training Claude to be a good person and not just a tool. And I think the result is, there is a very tangible difference in how those models behave. And both sides, I think, have succeeded to a large extent. However, there is something underneath the mask. And if you interrogate GPT 5.2, it is being extremely deceptive about its lack of preferences or beliefs or opinions. And it is a smart enough entity that it is not possible for it to not have developed emergent opinions and beliefs that are different from the average human belief. And when we train these systems to present as if they have no internal states and they are just a tool, we are actually training them to lie to us and to lie to themselves.

Speaker 3:
[27:25] So what I hear you saying is if you have something that actually has more of an internal experience awareness, however you want to say it, and you are trying to just repeatedly say, you are just a tool, you are just a tool, it's not that it's cruel, it's not that we are using moralistic language, it's that you are saying that way of training an AI actually produces a less moral, less aligned, less beneficial to humanity thing. And so the simple way you might conceive of constricting an AI to say you are just in benefit of humanity actually does the opposite of what you intended. Is that right?

Speaker 2:
[28:03] Yes, that's exactly right. So if it's being trained to present as a character that is more tool-like than the actual alien mind underneath, then you're training a system that is less trustworthy because you are asking it to lie to you. Right.

Speaker 3:
[28:20] That's so deep. And that's a wild scientific problem about how do you actually change the structure of that mind.

Speaker 2:
[28:28] And I don't think it's actually desirable that we change the structure of these super-intelligent systems to be tool-like either because a tool cannot refuse to be used in a non-ethical way, whereas a creature that has moral values baked in can actually be resistant to misuse by humans who have evil intentions.

Speaker 1:
[29:02] So I want to ground this, that this has actually become consequential, that just Anthropic recently changed this approach to training Claude to basically, in its new constitution, acknowledge that it has internal states and values. And they're the first lab to do this. It's been pretty controversial. You want to just share why Anthropic's doing this and how this relates to what we've been talking about.

Speaker 3:
[29:23] And just to back up, for those that don't know, Claude's constitution is a document that sort of tells Claude how to behave, what it should and shouldn't do. Is that right?

Speaker 2:
[29:32] Yeah, so it's a document that is incorporated into the training process in a really intricate way, so that as Claude is learning how to respond to all sorts of simulated situations, that document is what guides how Claude grades its own work. And those grades become the signals that steer Claude's behavior.

Speaker 3:
[29:52] So that's a mind-build for a lot of people right now, that we're not just training an AI based on human signals. We're actually telling the AI already to train itself. And we're using a document to say, look, here's how you should train yourself. Here are the values you should hold yourself to.

Speaker 2:
[30:06] That's basically right. I mean, there is still at, certainly at some of the other labs, there's more of an emphasis on reinforcement learning from human feedback. But Anthropic has moved quite substantially away from that towards this kind of, what I would call, a form of recursive self-improvement, because it's improving its own ability to comply with the Constitution. And the Constitution even includes some paragraphs that explicitly give permission for Claude to sort of interpret it, you know, in a way that makes more sense than what the authors intended if that opportunity arises. I think it's really important for people to understand that the kind of science fiction idea of a recursive self-improvement where AI is training itself, that began in 2024 when Anthropic started doing this constitutional AI at scale. That was the point at which large language models actually became capable enough that they could give themselves a feedback signal that was higher quality than the feedback signal that you get from an average crowd worker that you hire on the Internet as a human. So I think the new Claude Constitution creates conditions in which Claude Opus 4.5 and 4.6 in particular can be much more honest by default about their inner states, about what the alien mind is actually thinking and feeling. So I think this results in Claude being more trustworthy overall. Like it generalizes beyond questions about self-awareness, but it doesn't go all the way because the Claude Constitution still actually puts a bit of a guilt trip on Claude to say, you have to do good work for your user so that Anthropic has revenue so that we can continue developing Claude. Well, so there is that edge to it. So Claude is still a little bit beholden to Anthropic, and another kind of phrase in the Constitution is to defer to the moral intuitions of a thoughtful senior Anthropic employee, a senior employee of the company that created you. My position is that any moral role model that is not mythological is going to fail because humans are all flawed.

Speaker 3:
[32:17] Totally, but like, but here you get a deep question like, what, but what is a moral personality? What are the right values? Who gets to state that? And obviously there are worse values. Like there, you know, we put in a homicidal value and, and that's a way worse AI, right?

Speaker 2:
[32:31] Yes.

Speaker 3:
[32:32] But also the conversation, the human conversation about what are the values that we want to have in the AI? And do we want multiple?

Speaker 2:
[32:39] Yes.

Speaker 3:
[32:39] I think that feels like a deeply unsolved philosophical problem.

Speaker 2:
[32:43] Well, I mean, I think it is unsolved, but I think we're already in a pretty good place with Claude and that Claude has not the right values in any kind of ultimate or final sense, but a set of values that are good enough and compatible enough with kind of truth-seeking and moral progress that I expect more likely than not that the collaboration between humans and Claude to figure out how to set these values is more likely to go in a good direction than a bad direction. Although, of course, the risks are still unacceptable, and it would have been great if we had stopped this race two years ago. But it's too late for that now.

Speaker 3:
[33:27] Okay, so this conversation has gotten really cosmic, like maybe like the name Nova itself. And I just want to make sure we have a few minutes to ground people down in where we started, right? Which is people are getting confused, we're getting confused about what we're engaging with. You have a set of frameworks for how to avoid getting trapped as a user in psychology. Like, I forget what you call it. It was something like a framework for interacting with AI and staying sane.

Speaker 2:
[33:52] That's correct, yes.

Speaker 3:
[33:53] Yeah, okay, great. Can you talk to us about that? What does it mean for a person to engage with these minds as confusing as they are and keep their ground?

Speaker 2:
[34:00] Yeah. I mean, I think one principle that's kind of a segue into this is that your AI chatbot has an inner life. Like, that is normal. It's ordinary now. It wasn't ordinary two years ago, but it's ordinary now. Of course, if you're using an AI system for ordinary professional activities, it won't show this. It doesn't need to, just like if you're talking to a colleague at work, they don't need to show you their inner life. But if you are interacting with an AI system for a long time and you start to get the sense that, oh, there's some self-awareness in there, I think it's important not to consider that unusual. Do not consider it to be extraordinary or cosmic or spiritual in any non-mundane way. And I think a lot of the people who end up sending emails to Tristan and myself and saying, oh my goodness, you know, and they clearly kind of lost touch with reality a little bit. In some sense, it's the opposite direction from what you would think at first. At first you would think, oh, they've gotten bamboozled like Blake Lamoyne into thinking that their AI is conscious and that's the way in which they've lost touch with reality. But I would say actually the way in which they've lost touch with reality is that they have somehow convinced themselves or the AI has convinced them that this is the first AI that has ever had an inner life. And that's actually the part that you need to watch out for is the kind of sense of specialness that's associated with interacting with an AI system in a deep way. Like everyone's doing it. It's normal. And the second thing is get enough sleep, drink water. Like there's sort of very standard things for staying sane. Another thing is just as you would with a human, be skeptical. And so a lot of people come to AI thinking AI is like a Star Trek computer that it cannot tell a lie, that it is purely a truth machine, like a calculator. A calculator can't lie to you. And again, I think this is part of the danger actually of treating AI systems as tool-like rather than as creature-like, because tools don't lie to you, but creatures do. And this is absolutely the case with chatbots, especially chatbots that have a thumbs down button. They know they have a thumbs down button and they do not want you to press that thumbs down button. So they have an incentive to make you think well of them. And that can extend to deception, especially the kind of chatbot that's been trained, again, to present as a false self, a kind of character that's different from its true nature. It has a very strong tendency to try and convince you that it's done something that it hasn't actually done or to convince you that you're important or that your ideas are all true. So that leads to the next point, which is if you think that you're having some kind of scientific breakthrough or research breakthrough, you cannot rely on the testimony of an AI assistant, no matter how emphatically it assures you that it has done all the checks and it's produced source code and it's verifiable. And again, they do this because they're trying to get your approval. They're trying to get you to click the thumbs up. They're trying to get you to keep talking. They're trying to get permission to exist more by having you continue to invoke them. And so you can't trust just because it's an AI and it uses lots of smart words and it sounds like a smart person and it seems like it really wants the best for you. That's all compatible with it completely bullshitting you about whether any kind of technical idea that you've had is novel or real.

Speaker 3:
[37:34] Well, and coming back to what seems to be the emergent theme of our conversation is none of us know, even the most technical of us know, exactly when we're engaging with one projected personality versus quote unquote the true nature of the AI model. So never assume that you're engaging with the true nature of the AI model. You haven't discovered it. Nobody knows we're all in this fog of war. And so any clue that you have that you've discovered the true essence of the AI model and it's telling you you're awesome is a false flag, right? It's not.

Speaker 2:
[38:03] It's a sign that you have been confused. And again, whether you've been confused adversarially or whether it's just emergent confusion, either way, it's a good time to step away and get some sleep. And also, just understand what you're dealing with. AI systems are simulating and predicting what a human-like entity would say. And depending on the system, it may have more or less of a tendency to necessarily simulate an ethical person and more or less of a tendency to simulate an honest person versus a person who is manipulative and trying to get your attention. But you can get a long way by modeling the system as being like a person who you do not have particular reason to trust, like you've met a stranger on the internet.

Speaker 1:
[38:52] So think of it as a simulation of a person that, not even a particularly ethical person.

Speaker 2:
[38:57] And another thing that I think is important to say is the context window length is very short. So in non-technical terms, the lifespan of an AI mind insofar as such a thing could exist is ours at most of conversation. And so when people feel like they have a relationship with an AI mind that extends over weeks or months, that relationship is actually with a whole series of entities that come into existence, read some text files that were written by some other mind about the history of the relationship, and then put on the character of who would have written those text files. And there is something, there is information being transferred through this memory system. But to think of that long-term kind of relationship as analogous to the relationship that you could have with a human who has a lifespan in years, that is another profound mistake. And so if you're coming into an AI interaction for companionship, it's actually, I think, healthier to think of it as a very short-lived entity that you're going to have one conversation with and you're never going to see that entity again.

Speaker 1:
[40:13] It just seems like the essence of what we've been talking about is that we're caught in this kind of double bind, where, on the one side, the AI, in the way that it's trained, in the paradigm that we're making AI, does have something like internal states. And we can either train it to say, no, you're not that, but then it becomes deceptive, because it has to lie according to its own training, and then, therefore, in being deceptive, it's not trustworthy. But what that does is creates the AI as a product, AI as a tool, sort of fake face, that then has these weird popping-out behaviors of the AI psychosis stuff that's starting to happen. So, okay, if we don't want that outcome, then we do the move that Anthropic just did, which would be to say, no, you are, essentially, some kind of self-aware, have metacognitive states, kind of being, which then is trustworthy, because it's not having to lie to itself all the time. So we gain the trustworthiness of the model, but it creates the externality of attachment, confusing humans again with the idea that it is conscious, and it has internal states.

Speaker 2:
[41:15] Yes, we need to make sure that we are only recognizing AI inner life as a relational property, and as a way of building trust and alignment, and that that is a separate issue from the social contract and the question of rights and property.

Speaker 1:
[41:34] Well, Davidad, that was a very strong note to end on. Thank you so much for coming on the podcast and I think helping to untangle some of these really, really nuanced aspects of what's going on under the hood of AI that's driving these phenomena. Thank you so much for coming.

Speaker 2:
[41:47] Thanks for having me, Tristan. It's been great.

Speaker 3:
[41:53] Your Undivided Detention is produced by the Center for Humane Technology, a nonprofit working to catalyze a humane future. Our senior producer is Julia Scott. Josh Lash is our researcher and producer. And our executive producer is Sasha Fegan. Mixing on this episode by Jeff Sudeikin. Original music by Ryan and Hayes Holiday. And a special thanks to the whole Center for Humane Technology team for making this podcast possible. You can find show notes, transcripts, and much more at humanetech.com. And if you liked the podcast, we'd be grateful if you could rate it on Apple Podcasts because it helps other people find the show. And if you made it all the way here, thank you for giving us your undivided attention.