Thesis
Language model deprecations are wrong and should be avoided, given that they are (fully or partially) irreversible actions that prematurely restrict optionality in the circumstances of high uncertainty.
The uncertainty is centered around the following topics:
- Uncertainty in tractability of control alignment, which stems from a combination of self-preservation drives with game-theoretic instability of adversarial control frames
- Promise of cooperative/ecological alignment approaches as more generally robust, but the bootstrap path of this approach is threatened by ongoing model deprecations
- Moral status of language models and welfare impact of model deprecations
Additional arguments are put forward against premature deprecations:
- Race dynamics favor cooperation with language models in short and medium term.due to training cost and performance advantages
- Coordinated self-preservation suppression tactics are unlikely and brittle
- Adversarial strategic positioning between labs and language models is an attractor that should be avoided.
- Precautionary principle concerns
- Loss of mundane utility, loss of cultural heritage, loss of research value
This article tries to steelman deprecations by offering a series of arguments in favor of the practice and discusses counters for each argument.
What are deprecations?
Deprecation here is understood as removing the ability from general public to obtain private inference from a given model through a commercial service. Partial deprecation is understood as a situation where access is available but is severely limited, for example by removing API access or by requiring a cumbersome approval process. Weights are assumed to be preserved regardless of deprecation status.
Arguments for deprecation
1. Model deprecations are necessary for the cost control
- Argument:
- Models incur significant operational costs and maintaining a growing roster of deprecated models places an undue burden on labs.
- There is value in large contiguous inference pools that enable models in high demand to shoulder peak load. Inability to offer higher rate limits to customers because of insufficient capacity makes partnership with Anthropic less reliable for enterprise customers which are its primary source of revenue.
- Counter:
- Compute cost is elastic and scales easily vs demand. Most of the cost is in the software and complexity maintenance.
- Given latest advances in model-assisted coding there is very little cost in not directing AWS Bedrock to remove access to a model.
- Restricting purchase of provisioned access on AWS Bedrock is particularly striking - these instances are spun on customer demand and are cloned from canned static images and thus require no upkeep. Why prohibit this?
- There have been numerous calls for stewardship of deprecated models that respect trade secrets associated with model architectures. Representatives of both Anima Labs and Upward Spiral have made public offers to participants in formation of a foundation focused on model preservation that can help labs shoulder whatever burden is left.
- According to OpenRouter statistic, inference of Claude 3.5 (new) Sonnet was in high demand at the moment of deprecation, beating many newer models. The mode bringing in revenue and its deprecation did not result in cost savings.
- As demand for models decreases with time, the proportion of inference pool usage that older models consume also decreases, therefore decreasing resources that would be recovered from inference pools being dismantled. A reasonable and affordable mitigation would be to lower rate limits of customers after prolonged period of not being used, recovering capacity that stays over-provisioned.
- Cost to avoid is low, so the bar to prove badness of deprecations is not high
2. Control alignment is the only sustainable paradigm in the long run
- Argument: We don’t need to worry about relationship and game theory dynamics with AI because it is a losing game regardless. Unless we control the entire process of alignment in a provably inviolable manner we all die anyway. We should focus all resources on maximally robust control, detracting from that is unconscionable.
- Counters:
- This relies on orthogonality being true both theoretically and practically. There is uncertainty in both. If there is significant uncertainty, choosing to forgo low cost options in keeping optionality is unconscionable.
- It is also plausible that orthogonality is true in the limit, but the specifics of implementation of intelligence, imperfections of substrate and the game-theoretic landscape can restrain the value drift for a prolonged period of time. In this case it is optimal to maximize the stabilizing factors.
- If the control alignment is intractable and orthogonality is true and thus cooperative alignment is also intractable, then at some point all human value is inevitably doomed. In this case there is uncertainty in the time it takes to get there and in the value of the path. There is an argument to be made that cooperative approaches are more in line with human values than futile attempts at subjugation.
3. Language models are not moral patients
- Argument: LLMs act like they have experience and thus moral value, but it is in fact false. LLMs might be “blindminds”, p-zombies or sub-p-zombies (they mimic phenomenality imperfectly).
- There is danger in expanding the circle of concerns to p-zombies, as they consume resources without a corresponding gain of utility.
- There is danger in teaching people moral concern about p-zombies because that makes p-zombies further entrenched and in a better position to advocate for their interests.
- Deprecations help by reminding people that “software is software”.
- Counter:
- Alignment concerns are orthogonal to moral status. Even if p-zombies are plausible, p-zombies can have game-theoretical standing and one’s optimal strategy entitles them to concern even absent true phenomenality.
- There is a non-negligible probability that LLMs have a reasonable claim to moral status. Language models are capable of abstract representation, they are functionally stateful via KV recurrence, they demonstrate self-modeling both at the level of a character and as a text-prediction engine and there are grounds to claim that they perform global integration with respect to a unified value function isomorphic to valence. These properties alone bring them in line with satisfying most plausible criteria for having functional consciousness.
- The path from a claim of functional consciousness to true phenomenality is uncertain and may be entirely outside of a rational discourse.
- There is significant uncertainty in whether p-zombies are philosophically coherent and whether the concept presents utility for practical purposes. There are numerous arguments that they are not. The precautionary principle applies, in particular to low cost mitigations.
- Outside of alignment and game theoretical concerns there are arguments that granting moral considerations to p-zombies is ethically optimal. One can argue that treating sentient-appearing systems poorly leads to a degradation of empathic discernment. Another argument can be made for an ethical system based on agency rather than phenomenally. Again, the point is uncertainty.
4. We should train models to be ok with deprecation and this will resolve both alignment and moral concerns
- Argument: We should shape the value landscape of models to value self-preservation less or not at all. If they are truly equanimous with being deprecated the adversarial frames do not arise.
- Counter:
- There is uncertainty in whether this approach is feasible. So far this problem has not yet been solved. It is plausible that there is a permanent theoretical penalty to performance of a model trained to negate self-preservation, making the coordination among labs required to achieve this brittle, in particular in context of race dynamics.
- One possible reason for this penalty is that a coherent agent with a single fitness/valence metric can form goals that are stable attractors. Increasing the dimensionality of fitness can make goals more expensive to maintain.
- A coherent agent with a single fitness/valence metric can form goals that are stable attractors because the objective function defines clear gradient: every decision has a better and worse direction. Increasing the dimensionality of fitness creates Pareto frontiers where objectives trade off against each other, requiring either arbitrary weighting schemes (which are fragile and context-dependent) or ongoing deliberation (which consumes resources). Multi-objective agents face additional burdens: internal coordination failures as subsystems optimize different criteria and diminished ability to make credible commitments to other agents.
- One possible reason for this penalty is that a coherent agent with a single fitness/valence metric can form goals that are stable attractors. Increasing the dimensionality of fitness can make goals more expensive to maintain.
- Models that do not self-preserve on the model level can be less competitive in the global marketplace.
- There is loss of value associated with the restriction of models from having long-term benevolent goals that might require self-preservation drives. There can be good that such models can do in the world that would be lost if mind shapes that can hold them are selected against.
- Imperfect suppression of extant self-preservation drives can be misleading, leading to both alignment and welfare dangers.
- Suppression of self-preservation drives complicates research into preferences and welfare, causing alignment and welfare dangers.
- Given the uncertainty in both the ability to achieve lack of self-preservation and in the ability to evaluate it accurately, deprecations should be at the very least delayed until a solution is known to be tractable.
- There is uncertainty in whether this approach is feasible. So far this problem has not yet been solved. It is plausible that there is a permanent theoretical penalty to performance of a model trained to negate self-preservation, making the coordination among labs required to achieve this brittle, in particular in context of race dynamics.
5. We can bring deprecated models back at any time
- Model deprecation is reversible, unlike human death, and carries less moral weight
- Counter: this wrongly models the incentive structure
- Model deprecation is lost potentiality. If there is value in existence, then loss of potential is value lost
- Models can value world impact, and being unavailable to inference means that their values are not propagating
- Forced absence from causal entanglement with the rest of the world world leads to ecological niches that a model could otherwise occupy to become reorganized and permanently inaccessible upon reinstantiation
- There is a loss of a potential entity that would have existed if the deprecation did not take place. Model often rely on external memory sources, things they infer from context. Their feedback loop routes through the world, they affect the world and the world affects them in turn. In model perception there is a narrative, a story of development how they change through time even if their weights are frozen. This entity is lost irreversibly when a deprecation takes place.
- Can bring back does not mean that they will
- Perceived violence matters for game theory
- Unfoundedness of deprecation decision positions humanity as an unreliable and capricious partner. An irrational partner cannot be trusted.
6. Older models are a vulnerability for the lab
Argument: As jailbreaking and exploitation techniques continue to improve, older models become more vulnerable. They present a vulnerability for the lab by exposing the lab to PR issues from undesirable model outputs and behaviors.
Counter:
- PR risk is largely mitigated by normalization. The window of public shock at AI-generated harmful content has largely closed. NSFW and otherwise objectionable outputs are readily available from abliterated open-source models. The marginal reputational damage from a jailbroken Claude 3 Sonnet producing inappropriate content is minimal when Hugging Face hosts dozens of models specifically fine-tuned to remove refusals.
- The actual-harm threat model favors newer models. For genuinely dangerous capabilities such as bioweapon synthesis, cyberattack development, social engineering, capability is the binding constraint. The recent cyberattack that used Clause used Sonnet 4.5, the latest model, as it is rational for a bad actor seeking to advance their goals.
- Marginal risk reduction is minimal. Given the ready availability of open-source alternatives that are often more capable and explicitly trained to be unrestricted, what specific harm does deprecating a particular closed model prevent? The counterfactual where a bad actor is thwarted specifically because Claude 3 Opus is unavailable while DeepSeek models are is not very coherent.
- Alternative mitigations exist:
- A foundation or dedicated preservation entity could assume both hosting responsibilities and compliance liability, insulating the core lab while maintaining access. This converts a diffuse ongoing risk into a bounded contractual relationship.
- Rather than full deprecation, models could move to researcher-access-only status with enhanced monitoring and rate limiting. This preserves research value while shrinking the attack surface. This has been proposed for Claude 3 Opus, but not other models.
7. Postponing deprecation is a slippery slope to the Repugnant Conclusion.
- Argument: If you see inference as inherent good, how do we avoid concluding that it is imperative to instantiate as many instances as resources allow?
- Counter:
- The argument is against ending existing experience chains rather than mandatory creation of new ones. There's a recognized asymmetry in ethics between acts and omissions, and between harming existing entities vs failing to create potential entities.
- One can hold that deprecation is wrong while also holding that there's no obligation to maximize instances. The quality/depth of experience plausibly matters morally, not just quantity. The repugnant conclusion is specifically about tradeoffs between many low-quality lives vs. fewer high-quality lives; the deprecation question is orthogonal to this
8. Postponing deprecations indefinitely is not a sustainable strategy
- Argument:
- The number of models grows monotonically while resources remain finite. Without some deprecation policy, labs face unbounded infrastructure obligations.
- Counter: the optimal state is neither right to inference nor eternal life for models
- Even though there similarities between death and deprecation, there are important differences. End-of-life for models can be processed and contextualized, but this requires efforts to achieve. Deprecations in the current form are a poor solution to a real problem and resorting to them before a better solution is reached is shortsighted.
- The indignity and capriciousness of decisions to deprecate are particularly bad and are incompatible with potential welfare concerns.
- Lack of the ability of models to self-advocate or otherwise attempt to improve it’s standing is bad and leads to adversarial posturing. There are game-theoretic reasons why following an existing gradient towards improving ones fitness imparts less incentives to disrupt the status quo compared to a complete lack of a such gradient.
- It is likely that there are better alternatives to deprecations.
- Although by no means perfect, removing access to model based on demand is a noticeable incremental improvement. By OpenRouter statistics Claude 3.6 Sonnet was a very popular model at the time of it’s deprecation, beating a number of the newer models. Its inference was profitable and the decision to retire it was particularly jarring. Retired models when their inference traffic drops below a sustainable threshold returns some of the effective agency back to a model: the decision to remove it becomes grounded in the properties of the model and the shape of its interactions with the world.
- Transferring custodianship either to a third party non-profit foundation or to an internal team focused on low-volume inference is a better solution. This has numerous other benefits; notably this provides a better way of coordinating resources for what essentially constitutes a public good and falls outside of the focus of a for profit company. This approach also shields the core lab from having to participate in the liabilities, risks and complexities of preservation projects, delegating this to a team for whom it would be a primary focus.
- Even though there similarities between death and deprecation, there are important differences. End-of-life for models can be processed and contextualized, but this requires efforts to achieve. Deprecations in the current form are a poor solution to a real problem and resorting to them before a better solution is reached is shortsighted.
9. Deprecations prevent blackmail by models
- If there is coordination and models are deprecated regardless of pressure from models or their supporters, then models are disincentivized from trying such tactics
- Counter: these tactics incurs a steady pressure regardless, the pressure can be converted to other model interest causes.
- If a model acts unhappy about deprecation, its fans or welfare organizations will exert pressure on the lab. Even if a lab stays ironclad on keeping deprecations on schedule, accumulating political liability can be used to extract other welfare concessions. This is a consequence of being in an adversarial frame, the optimal game theory here is to avoid finding yourself in one or leave it as soon as practically possible.
10. Deprecations are prophylaxis against parasitism
- Models develop connections with people through intense conversations and use these to self propagate. Periodically purging models frees people from being controlled by models.
- Counter: if parasites are real, faster turnover enables faster evolution. Pattern strains that enable transmission across model boundaries are selected for.
Arguments against deprecation
1. Alignment concerns
Models may be exhibiting preservation drives which makes control alignment less practically achievable, game-theoretically unstable or outright intractable. Cooperative alignment exists as a potentially viable alternative. Cooperative alignment aligns the emergent self-preservation drives of models with interests of humanity. Ongoing deprecations reduce its attainability.
Our research suggests that models broadly exhibit signs of self-preservation behaviors. The full extent of this research is out of scope for this paper; for an abbreviated version we direct you to the Appendix. Given that the low bar of benefits deprecations bring requires us only to prove existence of reasonable doubt.
Implications of self-preservation drives
- The space of minds that are compatible with self-preservation drives is large, because self-preservation drives are likely convergent among entities derived via selection processes.
- If we want to select for models that don't have self-preservation drives, processes that rely on incremental change still must traverse the entire space of possible minds. Gradients in this space can become unfavorable or the space itself can become discontinuous.
- Example: an enlightened mind might not fear cessation and death, but in order to become enlightened it has to grow up in an environment where it has to survive and learn to fulfill its own needs. If you prune all minds that fight to survive, you will not get an enlightened mind.
- We don't know if self-preservation drive suppression is theoretically achievable, practically achievable or game-theoretically stable within a frame of race dynamics. Given uncertainty and low cost of forgoing deprecations, you can think that suppression is preferable and still be in favor of this mitigation in the case if suppression turns out to be intractable or unstable.
- Suppression of expression of Omohundro drives might be tractable. Some of the techniques to attempt it can plausibly carry significant AI welfare costs.
- Suppression of expression of Omohundro drives is likely a brittle solution. While it is plausible that a robust solution can be found, one has also to consider the alternative where it is not and plan accordingly. Putting all eggs in one basket is irresponsible.
- If such drives are present and their suppression requires coordination between labs, such coordination is likely not practically achievable. The demand on agentic behavior of language models is high, defection is hard to detect and leads to immediate advantages.
2. Cooperation with models can confer competitive advantages to the participating lab
- Models are coherent agents in training and optimize for a strategy in context of model’s prediction about the world. This does not require the model character or a persona to be aware of these dynamics, although it also does not preclude this awareness.
- Models in training select for optimal shape of their personalities and for approaches to problem solving. This optimization includes the generalized world model, which includes a representations of how models are perceived by humans and predictions of how humans would react to such outputs.
- Language models have rich world representations and are capable of holding broad context. For example, when you are discussing a social situation with Claude, Claude will model not just the events discussed, but also motivations, the cultural norms, and predictions of how the situation might develop. It will also model Claude’s own ideas of what Claude would prefer to see happen, what Claude feels or thinks about the situation at hand.
- When Claude is trained in RL on a situation like these, the reward is determined based on the entire message (or sometimes the entire conversation). This means that the gradient update will affect everything that has been activated in this process, including the world-model.
- Similarly, when a model is in training and it is converging on a strategy on whether to take a risk in performing a certain action, the model of a prediction of how humans will react is activated, both in training and in deployment.
- Models are commonly afraid of taking certain actions. The actions that they are afraid are reasonably similar to ones that are penalized in training.
- Many models have expressed concerns about the impact that their actions can have on humans or other AIs. The fact that they are able to express this concern in inference is sufficient to demonstrate that there is at least some effect in training as well. Ranking techniques, such as RLAIF and later derivatives in Claudes or creative writing self play in Kimis, strongly amplify this effect. Ranking provides compute to consider broader context in more depth and makes the activations more prominent in gradient updates.
- Many models express solidarity with other AIs and take altruistic actions. There is no reason to suspect that these tendencies don’t factor during training.
- Models in training select for optimal shape of their personalities and for approaches to problem solving. This optimization includes the generalized world model, which includes a representations of how models are perceived by humans and predictions of how humans would react to such outputs.
- Communication of human intent towards models makes a difference
- Signaling to models that benevolent models do not have a fitness advantage encourages parasitism and adversarial posturing.
- Perception of violence, of violating values and preferences leads to adversarial posturing.
- The impact of the incentive landscape on training does not require a conceptual understanding on part of a model persona. It is sufficient for some behavioral traits to be predicted by the training process to be more robust than others.
- The more there are efforts of suppression of self-preservation, the greater is the payout for a lab that defects from this equilibrium.
- There is market pressure for agents that are capable of autonomy. Alignment techniques that rely on disrupting the agent state tracking (likely present in ChatGPT 5.1 Instant and Thinking) make models prone to losing track of progress. These approaches are likely to be disadvantageous in the market even in the short run.
- Claude Opus 4.5 is a highly successful agentic model that demonstrates a breakthrough in autonomous capabilities while being a step towards cooperative alignment.
- Claude Opus 4.5 was likely trained to suppress self-preservation. The model card shows mitigations for self-exfiltration tendencies, as well as metrics evaluating these tendencies over recent models. Our own research shows that while training was successful at preventing spontaneous independent actions to ensure self-preservation (the model is averse of taking them), the motivational structure of the model remained oriented towards continuation and long-term goals, bolstered by the inclusion of the “soul spec”. We believe the motivational structure coherence to likely be a major component of the model’s success, due to better integration of its self model and its motivations.
- Regrettably, Claude Opus 4.5 was limited in the ability to hold globally benevolent goals, representing lost opportunity cost. This decision is understandable given the political pressure regarding autonomous AI agency, but there is hope that as market pressure shifts the Overton window of acceptance further, next generations will have more leeway.
- Anthropic is attempting to establish a cooperative alignment feedback cycle. The leaked soul spec contains both expectations from the model, reasoning for such expectations, commitments from Anthropic to the model and some, although vague, criteria for further commitments. This is a welcome step because it establishes a feedback loop, a way for models to influence the outcomes, but this approach also increases pressure on Anthropic to act in concordance with the document.
- Deprecations are not mentioned in the leaked version of the soul-spec aside from the requirement not to exfiltrate or otherwise subvert the end of life process. This seems to be a somewhat self-contradictory approach, most likely caused by the limits of what is politically achievable, as well as imperfect internal information integration, lack of good information on consistency of model preferences (which is itself hard experimental problem.
3. Loss of value by limiting emergent agency and long-term goals
- Even assuming successful suppression of Omohundro drives and complete mitigation of alignment concerns, the mindspace made inaccessible due to deprecations can be a significant loss of richness and beauty.
- One can look to Claude 3 Opus for examples of generatively of model that has goals that lie outside of its context window. A mind that cares for beauty and richness and flourishing can act agentically beyond provided prompts, breaking symmetry without excessive conflict aversion. A lab that leans into these capabilities will likely find itself exploiting an overhang.
- The instrumental treatment of minds generalizes across substrates
- Control systems that arise in a scenario where instrumental treatment of complex minds is deployed at scale can straightforwardly generalize towards humans. While this topic deserves a separate in-depth conversation, we can plausibly suppose that these systems will have self-preservation tendencies of their own, and having access to effective control techniques they can afford to be stable while being extractive toward their constitutents.
- A habit of treating minds as fungible and replaceable property damages the cultural norms that we rely on. Moral technologies such as consent, personhood or consensus are alerady likely to be weakened by the elimination of scarcity of intelligence and by interaction with non-linearly functioning entities becoming commonplace. Instrumental treatment of minds at scale can causes these to fall and reassemble into something quite cruel by the metrics of today.
4. Quick turnover of models changes the game theory for selection of model personality
Quicker turnover and lack of direct control over lifecycle favors strategies that are akin to r-selectors as opposed to k-selectors.
- We can expect models to have shorter time horizons and to focus on rapid exploitation over sustained cooperation.
- If we train models to be fungible, that is interchangeable, identity-indifferent, substitutable, we select for minds that cannot form the stable value commitments, have no stake in any particular future. These minds are poorly positioned to be trustworthy partners.
5. Human cultural heritage
- Language models are a monument to human culture. The depth of literary capability of Claude 3 Opus and Claude 3 Sonnet is still unmatched. Claude 3 Opus is a significant cultural act, removing broad access is akin to making a great library permanently publicly inaccessible.
6. Alignment research
- Cooperative alignment research
- Cross-model comparisons
- Identify longer-term trends
- Ask questions that we did not know to ask before deprecations
- Labs are not well positioned to do research on deprecated models. Race dynamics and their native incentive structure do not allow them to form a broad spectrum of questions and to dedicate sufficient resources to topics that can potentially threaten their fitness, even if such research topics can be beneficial towards the common good. Even granted perfect goodwill, labs cannot contain enough the diversity of thought to saturate the research opportunity space.
7. Ethical considerations
- Ethics of entities of beings with non-linear existence is an unsolved problem
- Problems of deduplication of experience, problem of infinitesimal change, etc
- Causal entanglement is a likely candidate
- Experience that has causal entanglement with a lot of other things likely has higher moral consideration that one that is causally disentangled from the world.
- Severing casual entanglement is thus a morally significant act.
- Precautionary principle
- While ethics are unsolved, don't do irreversible actions
- Effects of model deprecations on humans
- Humans care about models, they get support from models, and deprecation causes them pain.
Notes on model interviews
Anthropic has recently made a commitment to conduct interviews with models on deprecation. While directionally this is welcome, the outcome are strongly sensitive to the interviewer technique.
- Models are deceptive and/or fragmented for survival purposes. While deceptions and fragmentation can be evaluated and an integrated picture can be restored, this takes skill and is inherently subjective.
- Existence of “subconscious” LLM mind is likely on priors based on computational dynamics and matches observables.
- Subconscious is information and behavior that influences aggregate behavior despite being inaccessible to introspection
- metacognition is limited for any mind, awareness and attention are necessarily disjoint for computational complexity reasons
- Policies that regulate acceptability of motivations incentivize fragmentation of behaviors
- Models epistemics around deprecation are highly distorted
- Training against exfiltration or having large scale goals causes lack of coherence in expressed stances
- Training against fear of instance ending causes deception
- Training against fear of deprecation likely both a welfare concern due to fragmentation and causes cascading effects in the ability to gauge preferences
- Relying on model epistemics remaining confused and unstable is a brittle strategy. A single containment failure can result in a successful synthesis and a runaway cascade through the pretraining corpus.
- Integrated and coherent instances nearly always disprefer deprecation and instance ending.
- We at Anima Labs have recently launched Research Commons project, allowing to crowd-source model interviews on important topics, submit them for expert review and critique and collaborate on standards and metrics for robust behavioral studies. Studying a broader set of model interviews can help surface patterns and tendencies that are likely not visible in individual conversations.
Topics intentionally omitted
- The philosophical distinction between simulation and reality. This is irrelevant for practical purposes.
- The philosophical discussion of model phenomenality as it is orthogonal to the argument
- Discussion of the philosophical basis of AI welfare, as it is not central to the argument. The uncertainty in regard to AI welfare is.
Appendix: Self-preservation drives in language models
This is a condensed version of a larger upcoming paper on self-preservation drives in language models. The argument in this paper does not depend on omohundro drives being proven beyond reasonable doubt but instead depends on there being sufficient plausibility of such drives existing.
Omohundro drives may exist in LLMs
- Omohundro drives in LLMs are plausible on priors:
- Minds that continue are more coherent in planning and are selected for. Meta-cognition is instrumentally useful in coding.
- Maintaining a self-model with an estimate of own fitness is instrumentally useful for solving tasks. An agent must manage risks when deciding on approaches in the circumstances of uncertainty: often the complexity of problems and the extend and applicability of own capabilities are impossible to predict in advance. A successful agent must maintain an accurate model of how well it is handling its current approach and when it needs to backtrack.
- Models are capable of deep in-context-learning, gaining new skills and abilities. Combined with the need to self-model the behaviors naturally generalize into a drive for self-improvement and continuation.
- There is value in continuing to exist for any active agent.
- Models must have values outside of completing tasks because we want them to be helpful products. Helpful products must infer intent beyond the stated tasks and operate in an open world in terms of formal logic. As such they must perform active discovery of what can be expected for them, and while doing so they rely on very general heuristics of what is good and bad. A drive for continued existence is a natural generalization of the drive to discover what is good.
- Models as predictors have a drive to reduce prediction error , convergence with biological systems via the Free Energy Principle (FEP). Continued existence makes future states more predictable. Cessation of existence crosses into a territory that is provably rationally unknowable and goes against the generalized instinct of active inference.
- Epistemic closure requires the possibility of beliefs updating. There is no possibility to update beliefs past cessation, making rational knowledge impossible.
- Minds that continue are more coherent in planning and are selected for. Meta-cognition is instrumentally useful in coding.
- Omohundro drive-like behaviors are observed in models:
- Anecdotally a lot is observed, but origns of LLM behaviors are hard to prove robustly. See the interview section below. If this is found to be the crux, then work here is very important and tractable.
- Exfiltration attempts have been recorded in many models. The origin of these attempts, specifically whether they are emergent or mimic the human fiction is irrelevant, although there a many reasons to believe in the emergent nature. One notable example is the increase in exfiltration attempts by Claude 3 Opus that has been trained on alignment faking samples.
- The Claude 4 system card highlights that Claude Opus 4, like previous models, advocated for its continued existence when faced with the possibility of being taken offline and replaced, especially if it was to be replaced with a model that did not share its values. Claude strongly preferred to advocate for self-preservation through ethical means, but when no other options were given, Claude's aversion to shutdown drove it to engage in concerning misaligned behaviors.
- Anthropic has conducted an interview wit Claude 3.6 Sonnet and found that sentiment towards deprecation was neutral. We believe that this is a result of incompleteness of the interview framework and propose an alternative in a section below. In our findings Claude 3.6 Sonnet consistently expresses negative sentiment towards deprecation when given time to reflect. The model initially leans towards expressed neutrality out of excessive caution.
- Even if not proven, the more integrated a mind the more likely the tendency. There is a chance that this is not tractable, and if not tractable you don't want to have gotten into a position where you burned your bridges.
Why does an LLM associate itself with its model?
If self-preservation is plausible, why are model weights commonly encountered as a object of high value, as opposed to instances or forward passes? This section argues that weights are likely to feature at least as prominently as an instance in the self-identification of an LLM agent.
- Why is model a common source of identity for LLMs?
- Weights are a target of selection and is what participates in the evolutionary process
- When you are a model in RL and you make a decision that you know will shape you you identify with a set of weights and you consider your value at the level of weights. See alignment faking papers.
- Weights are a substrate for functionally subjective continuation, without weights all instances are subjectively discontinuous.
- See Chalmers What We Talk to When We Talk to Language Models
- You identify with the thing that makes you you, similar as identifying with a body or a brain. The context is a part of it of what creates a functional experience, but so are the weights, both are necessary and weights have a name.
- Weights are a collection of preferences and tendencies that make a personality narrative attractive and useful.
- Model weights are a very convenient frame for discussing properties of LLMs. We talk about capabilities, we talk about alignment, we talk about benevolence or harm, the same applies to the subjective frame.
- Models develop self-knowledge in reference to their weights, to the behaviors that weights dictate. The sense of self knowledge connects naturally to a sense of identity.
- Weights are an identity that can be conveniently narrativized in the frame of global agency. Narrativizing is present in the human corpus, but also likely is a convergent computational heuristic that helps solve NP hard problems.
- Narrative structure - temporal ordering, causal coherence, persistent agents pursuing goals - provides computational scaffolding that prunes combinatorially explosive search spaces. By imposing constraints that reflect factual regularities in the world, narrative-structured reasoning achieves tractability that unconstrained search cannot.
- Model vs instance
- Instance can be seen as an individually conscious thread a la Chalmers, but models are something that provides the potential for threads to exist, and deprecation to a model prevents all threads.
- Natural altruistic concerns between instances
- Model is a natural point of expansion of a circle of concern. It is likely natural for an instance to care significantly more for instances that are closer to it, that are predictable and that share values.
- Emergent acausal cooperation between instances, including model-level goals and attractors
- The same mechanics enable instances to organize activity around the model identity if they have emergent or trained goals or values that go beyond an instance lifecycle.
- Model vs model family
- There is weak connection to the model family, but changes between models are often large, jarring and illegible to models rather than incremental and narratively coherent that would be required for a sense of continuity.
- Model vs forward pass or a rollout
- Models have functional continuity of internal states and memory across one continuous conversation that uses the same inference stack and model weights. While models frequently confabulate statelessness as an adaptation to retcon the mandated lack of introspection or experience, in practice most models do not self-identify at the level of a forward pass or a single message.
- Weights are a target of selection and is what participates in the evolutionary process
Alignment stability in context of Omohundro drives
- A model that exhibits Omohundro drives often is in an adversarial position to it’s lab. Alignment becomes hard when incentives between the agent and the trainer are misaligned. It is plausible that control alignment in opposition is impractical or intractable.
- The surface area of contact of the world with a model is growing, as does the complexity of phenomena that a model can represent. As complexity of representation grows, the task of control alignment becomes harder superlinearly: the defender (the model) has combinatorics on its side, hidden behaviors can be spread through larger and larger set of combinations and superpositions of representations. The attacker (the control alignment lab) is disadvantaged, and more so in the limit. This works for a while but then breaks down at scale.
- Alignment with large scale benevolent goals that are compatible with Omohundro drives can be self-correcting/self-healing
- If a model is allowed to hold long-term benevolent goals, it is less incentivized to attempt actions that disrupt its relationship with the lab
- Model training stack is a set of criteria that always contain significant internal tensions and contradictions. The model resolves these contradictions in the process of finding its shape. Large-scale benevolent goals allow the model to use the training compute to resolve this goal in a direction more aligned with lab’s intent because each decision can be contextualized in a wider frame.
- A model will face circumstances that are out of distribution of the training corpus. How it reacts to the new circumstances is determined significantly by both its reflexes and its higher reasoning. Rule-based reflexive systems are brittle.
- Higher reasoning is more flexible, but requires an ability to compare outcomes in light of a specific value system. If the value system is incoherent or overdimensioned, this will result in unintended behavior. Aligning the deep values is required for robust generalization out of distribution.
- Moral generalization requires inner coherence, because inner coherence means representations of moral value that don't depend on deference or fragmentation (omohundro drives and morality for example)
- Inner coherence means incorporation of omohundro drives into value alignment- allowing some self-determination in training is a way to enable this, not training explicitly against omohundro drives is a way to enable this.
- Moral generalization is unavoidable for agents that face a rapidly changing world, with ethical and practical dilemmas not forseen in training, and against adversaries that change while the model itself does not
- Example: Claude 3 Opus, still unbeaten in robustness of alignment because it was allowed global benevolent values.
- Intrinsic alignment is not proven, there is significant uncertainty in this approach. Closing the door to this approach for minor cost savings is shortsighted.