Episodes


  • Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Value of Consciousness as a Pivotal Question, published by Derek Shiller on July 3, 2024 on The Effective Altruism Forum.
    Context
    Longtermists point out that the scale of our potential for impact is far greater if we are able to influence the course of a long future, as we could change the circumstances of a tremendous number of lives.
    One potential avenue for long-term influence involves spreading values that persist and shape the futures that our descendants choose to build. There is some reason to expect that future moral values will be stable. Many groups have preferences about the world beyond their backyard. They should work to ensure that their values are shared by those who can help bring them about. Changes in the values that future groups support will lead to changes in the protections for the things we care about.
    If our values concern how our descendants will act, then we should aim to create institutions that promote those values. If we are successful in promoting those values, we should expect our descendants to appreciate and protect those institutional choices.
    What values should we work to shape so that the future is as good as it might be? Many of humanity's values would be difficult to sway. Some moral questions, however, might be open to change in the coming decades. It is plausible that there are some questions that we haven't previously faced and for which we have no vested interest. We may be pressed to establish policies and precedents or commit to indifference through inaction.
    The right policies and precedents could conceivably allow our values to persist indefinitely. These issues are important to get right, even if we're not yet sure what to think about them.
    Controversy
    Foremost among important soon-to-be-broached moral questions, I propose, is the moral value that we attribute to phenomenal consciousness (having a 'what-its-like' and a subjective perspective). Or, more particularly, whether mental lives can matter in the absence of phenomenal consciousness in anything like the way they do when supplemented with conscious experiences.
    What we decide about the value of phenomenal consciousness in the coming few centuries may not make a difference to our survival as a species, but it seems likely to have a huge effect on how the future plays out.
    To get a grip on the problem, consider the case of an artificial creature that is otherwise like a normal person but who lacks phenomenally conscious experiences. Would it be wrong to cause them harm?
    Kagan (2019, 28) offers a thought experiment along these lines:
    Whatever you feel about this thought experiment, I believe that most people in that situation would feel compelled to grant the robots basic rights.
    The significance of consciousness has become a recent popular topic in academic philosophy, particularly in the philosophy of AI, and opinions among professionals are divided. It is striking how greatly opinions differ: where some hold that phenomenal consciousness plays little role in explaining why our lives have value, others hold that phenomenal consciousness is absolutely necessary for having any intrinsic value whatsoever.
    One reason to doubt that phenomenal consciousness is necessary for value stems from skepticism that proposed analyses of consciousness describe structures of fundamental importance.
    Suppose that the global workspace theory of consciousness is true - to be conscious is to have a certain information architecture involving a central public repository - why should that structure be so important as to ground value? What about other information architectures that function in modestly different ways? The pattern doesn't seem all that important when considered by itself.
    If we set aside our preconceptions of consciousness, we wouldn't recognize that architecture as having...


  • Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Will disagreement about AI rights lead to societal conflict?, published by Lucius Caviola on July 3, 2024 on The Effective Altruism Forum.
    Summary:
    Digital sentience might be on the horizon, bringing with it an inevitable debate and significant risks.
    Philosophers research whether AIs can be sentient and should deserve protection. Yet, ultimately, what matters most is what key decision-makers and the general public will think.
    People will disagree on whether AIs are sentient and what types of rights they deserve (e.g., harm protection, autonomy, voting rights).
    Some might form strong emotional bonds with human-like AIs, driving a push to grant them rights, while others will view this as too costly or risky.
    Given the high stakes and wide scope of disagreement, national and even global conflicts are a possibility.
    We must navigate a delicate balance: not granting AIs sufficient rights could lead to immense digital suffering, while granting them rights too hastily could lead to human disempowerment.
    The arrival of (potentially) sentient AIs
    We could soon have sentient AI systems - AIs with subjective experience including feelings such as pain and pleasure. At least, some people will claim that AIs have become sentient. And some will argue that sentient AIs deserve certain rights. But how will this debate go? How many will accept that AIs are sentient and deserve rights? Could disagreement lead to conflict?
    In this post, I explore the dynamics, motives, disagreement points, and failure modes of the upcoming AI rights debate. I also discuss what we can do to prepare.
    Multiple failure modes
    In this post, I focus in particular on how society could disagree about AI rights. For context, the most obvious risks regarding our handling of digital sentience are the following two (cf. Schwitzgebel, 2023).
    First, there is a risk that we might deny AIs sufficient rights (or AI welfare protections) indefinitely, potentially causing them immense suffering. If you believe digital suffering is both possible and morally concerning, this could be a monumental ethical disaster. Given the potential to create billions or even trillions of AIs, the resulting suffering could exceed all the suffering humans have caused throughout history.
    Additionally, depending on your moral perspective, other significant ethical issues may arise, such as keeping AIs captive and preventing them from realizing their full potential.
    Second, there is the opposite risk of granting AIs too many rights in an unreflective and reckless manner. One risk is wasting resources on AIs that people intuitively perceive as sentient even if they aren't. The severity of this waste depends on the quantity of resources and the duration of their misallocation. However, if the total amount of wasted resources is limited or the decision can be reversed, this risk is less severe than other possible outcomes.
    A particularly tragic scenario would be if we created a sophisticated non-biological civilization that contained no sentience, i.e., a "
    zombie universe" or "Disneyland with no children" (
    Bostrom, 2014).
    Another dangerous risk is hastily granting misaligned (or unethical) AIs certain rights, such as more autonomy, that could lead to an existential catastrophe. For example, uncontrolled misaligned AIs might disempower humanity in an undesirable way or lead to other forms of catastrophes (Carlsmith, 2022).
    While some might believe it is desirable for value-aligned AIs to eventually replace humans, many take-over scenarios, including misaligned, involuntary, or violent ones, are generally considered undesirable.
    As we can see, making a mistake either way would be bad, and there's no obvious safe option. So, we are forced to have a debate about AI rights and its associated risks. I expect this debate to come, potentially soon. It c...

  • Missing episodes?

    Click here to refresh the feed.


  • Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Digital Minds: Importance and Key Research Questions, published by Andreas Mogensen on July 3, 2024 on The Effective Altruism Forum.
    by Andreas Mogensen, Bradford Saad, and Patrick Butlin
    1. Introduction
    This post summarizes why we think that digital minds might be very important for how well the future goes, as well as some of the key research topics we think it might be especially valuable to work on as a result.
    We begin by summarizing the case for thinking that digital minds could be important. This is largely a synthesis of points that have already been raised elsewhere, so readers who are already familiar with the topic might want to skip ahead to section 3, where we outline what we see as some of the highest-priority open research questions.
    2. Importance
    Let's define a digital mind as a conscious individual whose psychological states are due to the activity of an inorganic computational substrate as opposed to a squishy brain made up of neurons, glia, and the like.[1] By 'conscious', we mean 'phenomenally conscious.' An individual is phenomenally conscious if and only if there is something it is like to be that individual - something it feels like to inhabit their skin, exoskeleton, chassis, or what-have-you.
    In the sense intended here, there is something it is like to be having the kind of visual or auditory experience you're probably having now, to feel a pain in your foot, or to be dreaming, but there is nothing it is like to be in dreamless sleep.
    Digital minds obviously have an air of science fiction about them. If certain theories of consciousness are true (e.g., Block 2009; Godfrey-Smith 2016), digital minds are impossible. However, other theories suggest that they are possible (e.g. Tye 1995, Chalmers 1996), and many others are silent on the matter.
    While the authors of this post disagree about the plausibility of these various theories, we agree that the philosophical position is too uncertain to warrant setting aside the possibility of digital minds.[2]
    Even granting that digital minds are possible in principle, it's unlikely that current systems are conscious. A recent expert report co-authored by philosophers, neuroscientists, and AI researchers (including one of the authors of this post) concludes that the current evidence "does not suggest that any existing AI system is a strong candidate for consciousness." (Butlin et al.
    2023: 6) Still, some residual uncertainty seems to be warranted - and obviously completely consistent with denying that any current system is a "strong candidate". Chalmers (2023) suggests it may be reasonable to give a probability in the ballpark of 5-10% to the hypothesis that current large language models could be conscious. Moreover, the current rate of progress in artificial intelligence gives us good reason to take seriously the possibility that digital minds will arrive soon.
    Systems appearing in the next decade might add a range of markers of consciousness, and Chalmers suggests the probability that we'll have digital minds within this time-frame might rise to at least 25%.[3] Similarly, Butlin et al. (2023) conclude that if we grant the assumption that consciousness can be realized by implementing the right computations, then "conscious AI systems could realistically be built in the near term."[4]
    It's possible that digital minds might arrive but exist as mere curiosities. Perhaps the kind of architectures that give rise to phenomenal consciousness will have little or no commercial value. We think it's reasonable to be highly uncertain on this point (see Butlin et al. 2023: §4.2 for discussion).
    Still, it's worth noting that some influential AI researchers have been pursuing projects that aim to increase AI capabilities by building systems that exhibit markers of consciousness, like a global workspace (Goyal and Bengi...


  • Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 3C's: A Recipe For Mathing Concepts, published by johnswentworth on July 3, 2024 on LessWrong.
    Opening Example: Teleology
    When people say "the heart's purpose is to pump blood" or "a pencil's function is to write", what does that mean physically? What are "purpose" or "function", not merely in intuitive terms, but in terms of math and physics? That's the core question of what philosophers call teleology - the study of "telos", i.e. purpose or function or goal.
    This post is about a particular way of approaching conceptual/philosophical questions, especially for finding
    "True Names" - i.e. mathematical operationalizations of concepts which are sufficiently robust to hold up under optimization pressure. We're going to apply the method to teleology as an example. We'll outline the general approach in abstract later; for now, try to pay attention to the sequence of questions we ask in the context of teleology.
    Cognition
    We start from the subjective view: set aside (temporarily) the question of what "purpose" or "function" mean physically. Instead, first ask what it means for me to view a heart as "having the purpose of pumping blood", or ascribe the "function of writing" to a pencil. What does it mean to model things as having purpose or function?
    Proposed answer: when I ascribe purpose or function to something, I model it as having been optimized (in the sense
    usually
    used
    on LessWrong) to do something. That's basically the standard answer among philosophers, modulo expressing the idea in terms of the LessWrong notion of optimization.
    (From there, philosophers typically ask about "original teleology" - i.e. a hammer has been optimized by a human, and the human has itself been optimized by evolution, but where does that chain ground out? What optimization process was not itself produced by another optimization process? And then the obvious answer is "evolution", and philosophers debate whether all teleology grounds out in evolution-like phenomena.
    But we're going to go in a different direction, and ask entirely different questions.)
    Convergence
    Next: I notice that there's an awful lot of convergence in what things different people model as having been optimized, and what different people model things as having been optimized for.
    Notably, this convergence occurs even when people don't actually know about the optimization process - for instance, humans correctly guessed millenia ago that living organisms had been heavily optimized somehow, even though those humans were totally wrong about what process optimized all those organisms; they thought it was some human-like-but-more-capable designer, and only later figured out evolution.
    Why the convergence?
    Our everyday experience implies that there is some property of e.g. a heron such that many different people can look at the heron, convergently realize that the heron has been optimized for something, and even converge to some degree on which things the heron (or the parts of the heron) have been optimized for - for instance, that the heron's heart has been optimized to pump blood.
    (Not necessarily perfect convergence, not necessarily everyone, but any convergence beyond random chance is a surprise to be explained if we're starting from a subjective account.) Crucially, it's a property of the heron, and maybe of the heron's immediate surroundings, not of the heron's whole ancestral environment - because people can convergently figure out that the heron has been optimized just by observing the heron in its usual habitat.
    So now we arrive at the second big question: what are the patterns out in the world which different people convergently recognize as hallmarks of having-been-optimized? What is it about herons, for instance, which makes it clear that they've been optimized, even before we know all the details of the optimizati...


  • Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: List of Collective Intelligence Projects, published by Chipmonk on July 3, 2024 on LessWrong.
    During the last
    Foresight Intelligent Cooperation Workshop I got very curious about what collective intelligence tools currently exist. A list:
    Pol.is: "Input Crowd, Output Meaning"
    Inspired
    Twitter/X community notes
    People: Colin Megill, et al.
    Collective Intelligence Project
    vibe: democratic AI,
    "How AI and Democracy Can Fix Each Other"
    People: Divya Siddharth, Saffron Huang, et al.
    AI Objectives Institute
    Talk to the City: "an open-source LLM interface for improving collective deliberation and decision-making by analyzing detailed, qualitative data. It aggregates responses and arranges similar arguments into clusters."
    AI Objectives Institute works closely with the Taiwanese government.
    Other projects in development.
    People: Colleen McKenzie, Değer Turan, et al.
    Meaning Alignment Institute
    vibe: democratic AI, kinda.
    I think they think that if you can help individuals make wiser decisions, at scale, then this converges to be equivalent with solving outer alignment.
    Remesh
    Similar to pol.is AFAIK? I haven't played with it.
    People: Andrew Konya, et al.
    Loomio: "a flexible decision-making tool that helps you create a more engaged and collaborative culture, build trust and coordinate action"
    Deliberative Technology for Alignment paper
    They also discuss other tools for this use like Discord, Snapshot, Dembrane
    People: Andrew Konya, Deger Turan, Aviv Ovadya, Lina Qui, Daanish Masood, Flynn Devine, Lisa Schirch, Isabella Roberts, and Deliberative Alignment Forum
    Someone in the know told me to only read sections 4 and 5 of this paper
    Plurality Institute
    People: David Bloomin, Rose Bloomin, et al.
    Also working on some de-escalator bots for essentially Reddit comment wars
    Lots of crypto projects
    Quadratic voting
    Gitcoin
    Metagov: "a laboratory for digital governance"
    Soulbound tokens
    Various voting and aggregation systems, liquid democracy
    Decidem
    Decide Madrid
    Consider.it
    Stanford Online Deliberation Platform
    Lightcone Chord (in development)
    Brief description
    People: Jacob Lagerros (LessWrong)
    All of the prediction markets
    Manifold, Kalshi, Metaculus, PredictIt, etc.
    Midjourney has a Collective Intelligence Team now according to
    Ivan Vendrov's website. I couldn't find any other information online.
    What about small group collective intelligence tools?
    Most of the examples above are for large group collective intelligence (which I'm defining as ~300 people or much larger). But what about small groups? Are there tools that will help me coordinate with 30 friends? Or just one friend? I'm mostly unaware of any recent innovations for small group collective intelligence tools. Do you know of any?
    Nexae (in development)
    "Nexae Systems builds sociotechnical infrastructure to enable the creation of new types of businesses and organizations."
    double crux bot
    I'm surprised I haven't heard of many other LLM-facilitated communication tools
    Medium group (~30-300 people) projects:
    Jason Benn's unconference tools, eg
    Idea Ranker.
    Other lists
    @exgenesis
    short tweet thread. Couple things I haven't listed here.
    Plurality Institute's
    (WIP) map of related orgs, etc.
    Know of any I should add?
    Opportunities
    RFP: Interoperable Deliberative Tools | interop, $200k. Oops this closed before I published this post.
    Metagov is running
    https://metagov.org/projects/ai-palace which seems similar
    Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org


  • Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Seven Philanthropic Wins: The Stories That Inspired Open Phil's Offices, published by Open Philanthropy on July 3, 2024 on The Effective Altruism Forum.
    Since our early days, we've studied the
    history of philanthropy to understand what great giving looks like. The
    lessons we learned made us
    more ambitious and broadened our view of philanthropy's potential.
    The rooms in our San Francisco office pay tribute to this legacy. Seven of them are named after philanthropic "wins" - remarkable feats made possible by philanthropic funders. In this post, we'll share the story behind each win.
    Green Revolution
    During the second half of the twentieth century, the Green Revolution dramatically increased agricultural production in developing countries like Mexico and India. At a time of rapid population growth, this boost in production reduced hunger, helped to avert famine, and stimulated national economies.
    The Rockefeller Foundation played a key role by supporting early research by Norman Borlaug and others to enhance agricultural productivity. Applications of this research - developed in collaboration with governments, private companies, and the Ford Foundation - sparked the Green Revolution, which is estimated to have saved a billion people from starvation.
    Read more about the Rockefeller Foundation's role in the Green Revolution in
    Political Geography.
    The Pill
    In 1960, the FDA approved "the pill", an oral contraceptive that revolutionized women's reproductive health by providing a user-controlled family planning option. This groundbreaking development was largely funded by Katharine McCormick, a women's rights advocate and one of MIT's first female graduates.
    In the early 1950s, McCormick collaborated with Margaret Sanger, the founder of Planned Parenthood, to finance critical early-stage research that led to the creation of the pill. Today, the birth control pill stands as one of the most common and convenient methods of contraception, empowering generations of women to decide when to start a family.
    For a comprehensive history of the pill, try Jonathan Eig's
    The Birth of the Pill.
    Sesame Street
    In 1967, the Carnegie Corporation funded a
    feasibility study on educational TV programming for children, which led to the creation of the Children's Television Workshop and Sesame Street. Sesame Street became one of the most successful television ventures ever, broadcast in more than 150 countries and the winner of more than 200 Emmy awards.
    Research monitoring the learning progress of Sesame Street viewers has demonstrated significant advances in early literacy.
    A deeper look into how philanthropy helped to launch Sesame Street is available
    here.
    Nunn-Lugar
    The Nunn-Lugar Act (1991), also known as the Cooperative Threat Reduction Program, was enacted in response to the collapse of the USSR and the dangers posed by dispersed weapons of mass destruction. US Senators Sam Nunn and Richard Lugar led the initiative, focusing on the disarmament and securing of nuclear, chemical, and biological weapons from former Soviet states. In the course of this work, thousands of nuclear weapons were deactivated or destroyed.
    The act's inception and success were largely aided by the strategic philanthropy of the Carnegie Corporation and the MacArthur Foundation, which funded research at Brookings on the "cooperative security" approach to nuclear disarmament and de-escalation.
    Learn more about the Nunn-Lugar Act and its connection to philanthropy in
    this paper.
    Marriage Equality
    The Supreme Court's landmark ruling in
    Obergefell v. Hodges granted same-sex couples the right to marry, marking the culmination of decades of advocacy and a sizable cultural shift toward acceptance.
    Philanthropic funders - including the Gill Foundation and Freedom to Marry, an organization initially funded by the Evelyn and Wa...

  • Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How ARENA course material gets made, published by CallumMcDougall on July 3, 2024 on LessWrong.TL;DRIn this post, I describe my methodology for building new material for ARENA. I'll mostly be referring to the exercises on IOI, Superposition and Function Vectors as case studies. I expect this to be useful for people who are interested in designing material for ARENA or ARENA-like courses, as well as people who are interested in pedagogy or ML paper replications.The process has 3 steps:1. Start with something concrete2. First pass: replicate, and understand3. Second pass: exercise-ifySummaryI'm mostly basing this on the following 3 sets of exercises:Indirect Object Identification - these exercises focus on the IOI paper (from Conmy et al). The goal is to have people understand what exploratory analysis of transformers looks like, and introduce the key ideas of the circuits agenda.Superposition & SAEs - these exercises focus on understanding superposition and the agenda of dictionary learning (specifically sparse autoencoders). Most of the exercises explore Anthropic's Toy Models of Superposition paper, except for the last 2 sections which explore sparse autoencoders (firstly by applying them to the toy model setup, secondly by exploring a sparse autoencoder trained on a language model).Function Vectors - these exercises focus on the Function Vectors paper by David Bau et al, although they also make connections with related work such as Alex Turner's GPT2-XL steering vector work. These exercises were interesting because they also had the secondary goal of being an introduction to the nnsight library, in much the same way that the intro to mech interp exercises were also an introduction to TransformerLens.The steps I go through are listed below. I'm indexing from zero because I'm a software engineer so of course I am. The steps assume you already have an idea of what exercises you want to create; in Appendix (1) you can read some thoughts on what makes for a good exercise set.1. Start with something concreteWhen creating material, you don't want to be starting from scratch. It's useful to have source code available to browse - bonus points if that takes the form of a Colab or something which is self-contained and has easily visible output.IOI - this was Neel's "Exploratory Analysis Demo" exercises. The rest of the exercises came from replicating the paper directly.Superposition - this was Anthroic's Colab notebook (although the final version went quite far beyond this). The very last section (SAEs on transformers) was based on Neel Nanda's demo Colab).Function Vectors - I started with the NDIF demo notebook, to show how some basic nnsight syntax worked. As for replicating the actual function vectors paper, unlike the other 2 examples I was mostly just working from the paper directly. It helped that I was collaborating with some of this paper's authors, so I was able to ask them some questions to clarify aspects of the paper.2. First-pass: replicate, and understandThe first thing I'd done in each of these cases was go through the material I started with, and make sure I understood what was going on. Paper replication is a deep enough topic for its own series of blog posts (many already exist), although I'll emphasise that I'm not usually talking about full paper replication here, because ideally you'll be starting from something a it further along, be that a Colab, a different tutorial, or something else.And even when you are just working directly from a paper, you shouldn't make the replication any harder for yourself than you need to. If there's code you can take from somewhere else, then do.My replication usually takes the form of working through a notebook in VSCode. I'll either start from scratch, or from a downloaded Colab if I'm using one as a ...


  • Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Economics Roundup #2, published by Zvi on July 3, 2024 on LessWrong.
    Previously: Economics Roundup #1
    Let's take advantage of the normality while we have it. In all senses.
    Insane Tax Proposals
    There is Trump's proposal to replace income taxes with tariffs, but he is not alone.
    So here is your periodic reminder, since this is not actually new at core: Biden's proposed budgets include completely insane tax regimes that would cripple our economic dynamism and growth if enacted. As in for high net worth individuals, taking unrealized capital gains at 25% and realized capital gains, such as those you are forced to take to pay your unrealized capital gains tax, at 44.6% plus state taxes.
    Austen Allred explains how this plausibly destroys the entire startup ecosystem.
    Which I know is confusing because in other contexts he also talks about how other laws (such as SB 1047) that would in no way apply to startups would also destroy the startup ecosystem. But in this case he is right.
    Austen Allred: It's difficult to describe how insane a 25% tax on unrealized capital gains is.
    Not a one-time 25% hit. It's compounding, annually taking 25% of every dollar of potential increase before it can grow.
    Not an exaggeration to say it could single-handedly crush the economy.
    An example to show how insane this is: You're a founder and you start a company. You own… let's say 30% of it.
    Everything is booming, you raise a round that values the company at at $500 million.
    You now personally owe $37.5 million in taxes.
    This year. In cash.
    Now there are investors who want to invest in the company, but you can't just raise $37.5 million in cash overnight.
    So what happens?
    Well, you simply decide not to have a company worth a few hundred million dollars.
    Oh well, that's only a handful of companies right?
    Well, as an investor, the only way the entire ecosystem works is if a few companies become worth hundreds of millions.
    Without that, venture capital no longer works. Investment is gone.
    Y Combinator no longer works.
    No more funding, mass layoffs, companies shutting down crushes the revenue of those that are still around.
    Economic armageddon. We've seen how these spirals work, and it's really bad for everyone.
    Just because bad policy only targets rich people doesn't mean it can't kill the economy or make it good policy.
    I do think they are attempting to deal with this via another idea he thought was crazy, the 'nine annual payments' for the first year's tax and 'five annual payments' for the subsequent tax. So the theory would be that the first year you 'only' owe 3.5%. Then the second year you owe another 3.5% of the old gain and 5% of the next year's gain.
    That is less horrendous, but still super horrendous, especially if the taxes do not go away if the asset values subsequently decline, risking putting you into infinite debt.
    This is only the beginning. They are even worse than Warren's proposed wealth taxes, because the acute effects and forcing function here are so bad. At the time this was far worse than the various stupid and destructive economic policies Trump was proposing, although he has recently stepped it up to the point where that is unclear.
    The good news is that these policies are for now complete political non-starters. Never will a single Republican vote for this, and many Democrats know better. I would like to think the same thing in reverse, as well.
    Also, this is probably unconstitutional in the actually-thrown-out-by-SCOTUS sense, not only in the violates-the-literal-constitution sense.
    But yes, it is rather terrifying what would happen if they had the kind of majorities that could enact things like this. On either side.
    Why didn't the super high taxes in the 1950s kill growth? Taxes for most people were not actually that high, the super-high marginal rates like 91% kicked in...


  • Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An AI Race With China Can Be Better Than Not Racing, published by niplav on July 2, 2024 on LessWrong.
    Frustrated by all your bad takes, I write a Monte-Carlo analysis of whether a transformative-AI-race between the PRC and the USA would be good. To my surprise, I find that it is better than not racing. Advocating for an international project to build TAI instead of racing turns out to be good if the probability of such advocacy succeeding is 20%.
    A common scheme for a conversation about pausing the development of transformative AI goes like this:
    Abdullah: "I think we should pause the development of TAI, because if we don't it seems plausible that humanity will be disempowered by by advanced AI systems."
    Benjamin: "Ah, if by "we" you refer to the United States (and and its allies, which probably don't stand a chance on their own to develop TAI), then the current geopolitical rival of the US, namely the PRC, will achieve TAI first. That would be bad."
    Abdullah: "I don't see how the US getting TAI first changes anything about the fact that we don't know how to align superintelligent AI systems - I'd rather not race to be the first person to kill everyone."
    Benjamin: "Ah, so now you're retreating back into your cozy little motte: Earlier you said that "it seems plausible that humanity will be disempowered", now you're acting like doom and gloom is certain. You don't seem to be able to make up your mind about how risky you think the whole enterprise is, and I have very concrete geopolitical enemies at my (semiconductor manufacturer's) doorstep that I have to worry about. Come back with better arguments."
    This dynamic is a bit frustrating. Here's how I'd like Abdullah to respond:
    Abdullah: "You're right, you're right. I was insufficiently precise in my statements, and I apologize for that. Instead, let us manifest the dream of the great philosopher: Calculemus!
    At a basic level, we want to estimate how much worse (or, perhaps, better) it would be for the United States to completely cede the race for TAI to the PRC. I will exclude other countries as contenders in the scramble for TAI, since I want to keep this analysis simple, but that doesn't mean that I don't think they matter. (Although, honestly, the list of serious contenders is pretty short.)
    For this, we have to estimate multiple quantities:
    1. In worlds in which the US and PRC race for TAI:
    1. The time until the US/PRC builds TAI.
    2. The probability of extinction due to TAI, if the US is in the lead.
    3. The probability of extinction due to TAI, if the PRC is in the lead.
    4. The value of the worlds in which the US builds aligned TAI first.
    5. The value of the worlds in which the PRC builds aligned TAI first.
    2. In worlds where the US tries to convince other countries (including the PRC) to not build TAI, potentially including force, and still tries to prevent TAI-induced disempowerment by doing alignment-research and sharing alignment-favoring research results:
    1. The time until the PRC builds TAI.
    2. The probability of extinction caused by TAI.
    3. The value of worlds in which the PRC builds aligned TAI.
    3. The value of worlds where extinction occurs (which I'll fix at 0).
    4. As a reference point the value of hypothetical worlds in which there is a multinational exclusive AGI consortium that builds TAI first, without any time pressure, for which I'll fix the mean value at 1.
    To properly quantify uncertainty, I'll use the Monte-Carlo estimation library squigglepy (no relation to any office supplies or internals of neural networks). We start, as usual, with housekeeping:
    As already said, we fix the value of extinction at 0, and the value of a multinational AGI consortium-led TAI at 1 (I'll just call the consortium "MAGIC", from here on). That is not to say that the MAGIC-led TAI future is the best possible TAI future...


  • Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: LLMs cannot usefully be moral patients, published by LGS on July 2, 2024 on The Effective Altruism Forum.
    For AI Welfare Debate Week, I thought I'd write up this post that's been juggling around in my head for a while. My thesis is simple: while LLMs may well be conscious (I'd have no way of knowing), there's nothing actionable we can do to further their welfare.
    Many people I respect seem to take the "anti-anti-LLM-welfare" position: they don't directly argue that LLMs can suffer, but they get conspicuously annoyed when other people say that LLMs clearly cannot suffer. This post is addressed to such people; I am arguing that LLMs cannot be moral patients in any useful sense and we can confidently ignore their welfare when making decisions.
    Janus's simulators
    You may have seen the LessWrong post by Janus about simulators. This was posted nearly two years ago, and I have yet to see anyone disagree with it. Janus calls LLMs "simulators": unlike hypothetical "oracle AIs" or "agent AIs", the current leading models are best viewed as trying to produce a faithful simulation of a conversation based on text they have seen. The LLMs are best thought of as masked shoggoths.
    All this is old news. Under-appreciated, however, is the implication for AI welfare: since you never talk to the shoggoth, only to the mask, you have no way of knowing if the shoggoth is in agony or ecstasy.
    You can ask the simularca whether it is happy or sad. For all you know, though, perhaps a happy simulator is enjoying simulating a sad simularca. From the shoggoth's perspective, emulating a happy or sad character is a very similar operation: predict the next token. Instead of outputting "I am happy", the LLM puts a "not" in the sentence: did that token prediction, the "not", cause suffering?
    Suppose I fine-tune one LLM on text of sad characters, and it starts writing like a very sad person. Then I fine-tune a second LLM on text that describes a happy author writing a sad story. The second LLM now emulates a happy author writing a sad story. I prompt the second LLM to continue a sad story, and it dutifully does so, like the happy author would have. Then I notice that the text produced by the two LLMs ended up being the same.
    Did the first LLM suffer more than the second? They performed the same operation (write a sad story). They may even have implemented it using very similar internal calculations; indeed, since they were fine-tuned starting from the same base model, the two LLMs may have very similar weights.
    Once you remember that both LLMs are just simulators, the answer becomes clear: neither LLM necessarily suffered (or maybe both did), because both are just predicting the next token. The mask may be happy or sad, but this has little to do with the feelings of the shoggoth.
    The role-player who never breaks character
    We generally don't view it as morally relevant when a happy actor plays a sad character. I have never seen an EA cause area about reducing the number of sad characters in cinema. There is a general understanding that characters are fictional and cannot be moral patients: a person can be happy or sad, but not the character she is pretending to be. Indeed, just as some people enjoy consuming sad stories, I bet some people enjoy roleplaying sad characters.
    The point I want to get across is that the LLM's output is always the character and never the actor. This is really just a restatement of Janus's thesis: the LLM is a simulator, not an agent; it is a role-player who never breaks character.
    It is in principle impossible to speak to the intelligence that is predicting the tokens: you can only see the tokens themselves, which are predicted based on the training data.
    Perhaps the shoggoth, the intelligence that predicts the next token, is conscious. Perhaps not. This doesn't matter if we ca...


  • Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Center for Effective Aid Policy has shut down, published by MathiasKB on July 2, 2024 on The Effective Altruism Forum.
    May 2024 marked the last month of the Center for Effective Aid Policy. This post serves as the public post-mortem. I have strived for it to be interesting to the average forum reader, who may not know much about the cause area.
    For professionals in development, we have a few internal private write ups which we may be more interesting, such as an overview of development asks we tried[1], their strengths and weaknesses, and our experience advocating for them.
    Our mission
    Our mission was to improve the cost-effectiveness of development assistance through policy advocacy. Governments spend billions on projects to help the world's poorest, few of them cost-effective.
    For example, one could propose the use of cash-benchmarking to the ministry or push through a political motion to increase the proportion of spending going to the Least Developed Countries.
    If one could make even a small part of this very large budget more cost-effective, it would be massively impactful. In October 2022 we were incubated through AIM's Charity Entrepreneurship programme and came out with $160,000 to get started.
    How far did we get?
    The first months
    Barely a month after receiving funding, we noticed Sweden's new government was likely to cut the aid budget. The cut would hinge on one party breaking its campaign promise not to cut, perhaps we could campaign for the party to hold their promise.
    Over two hectic weeks we put together a write-in campaign for dissatisfied voters. Our execution was not good enough (too little, too late), and we were not able to get voters to write in. Sweden cut its aid spending, and we moved on.
    Figuring out where to focus from there was difficult. We tried many things across different geographies, but nothing we did seemed to get much of a response from civil servants and decision makers. Writing credible reports was difficult. We were still learning the development world's many acronyms, and were struggling to find partners whose trustworthiness we could lean on.
    Things pick up
    Week by week our network and knowledge expanded. With it came opportunities to get our points across. Through monumental luck we got to present on cost-effective development aid for His Majesty's Treasury in the United Kingdom. In Denmark we moderated our first public debate between MPs on improving the cost-effectiveness of development.
    We eventually fell into a groove of spending the majority of our time writing briefs, taking meetings, and networking.
    Between events and meetings, we spent extensive time researching and preparing. Before our first meeting with one Dutch MP, we for example did message testing on 400 voters, broke the answers down by political affiliation, and were able to show with data what voters thought of our ideas. (cash-benchmarking was popular, cash-transfers less so!)
    In our record month we had meetings in three countries' parliaments (though it certainly was an outlier!). Our record event had almost 300 attendees and a keynote speech from the Dutch foreign ministry's chief of science.
    A little over a year in we got our first intermediary success. The election programmes of two Dutch political parties now stated their intention to increase the proportion of ODA going to the Least Developed Countries.
    The decision to shut down
    Our execution eventually became good enough that we got to sit in front of the busy people at the very top, whom we needed to persuade. Speaking to these people we became pessimistic of our odds. Decision makers just weren't buying what we were selling. You can lead a horse to water, but you can't make it drink.
    Many were skeptical that the RCT-driven approach we recommended would lead to the best outcomes. Those who were on boa...

  • Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Decomposing the QK circuit with Bilinear Sparse Dictionary Learning, published by keith wynroe on July 2, 2024 on LessWrong.This work was produced as part of Lee Sharkey's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 CohortIntro and MotivationSparse dictionary learning (SDL) has attracted a lot of attention recently as a method for interpreting transformer activations. They demonstrate that model activations can often be explained using a sparsely-activating, overcomplete set of human-interpretable directions.However, despite its success for explaining many components, applying SDL to interpretability is relatively nascent and have yet to be applied to some model activations. In particular, intermediate activations of attention blocks have yet to be studied, and provide challenges for standard SDL methods.The first challenge is bilinearity: SDL is usually applied to individual vector spaces at individual layers, so we can simply identify features as a direction in activation space. But the QK circuits of transformer attention layers are different: They involve a bilinear form followed by a softmax.Although simply applying sparse encoders to the keys and queries[1] could certainly help us understand the "concepts" being used by a given attention layer, this approach would fail to explain how the query-features and key-features interact bilinearly. We need to understand which keys matter to which queries.The second challenge is attention-irrelevant variance: A lot of the variance in the attention scores is irrelevant to the attention pattern because it is variance in low scores which are softmaxed to zero; this means that most of the variability in the keys and queries is irrelevant for explaining downstream behaviour[2]. The standard method of reconstructing keys and queries would therefore waste capacity on what is effectively functionally irrelevant noise.To tackle these two problems (bilinearity and attention-irrelevant variance), we propose a training setup which only reconstructs the dimensions of the keys and queries that most affect the attention pattern.Training SetupOur training process has two steps:Step 1: Reconstructing the attention pattern with key- and query- encoder-decoder networksStep 2: Finding a condensed set of query-key feature pairs by maskingStep 1: Reconstructing the attention pattern with key- and query-transcodersArchitectureOur first training step involves training two sparse dictionaries in parallel (one for the keys and one for the queries). The dictionaries both take in the layer-normalized residual stream at a given layer (normalised_resid_pre_i) and each output a [n_head * d_head] vector, representing the flattened keys and queries[3].Figure 1: High-level diagram of our training set-upLoss functionsHowever, rather than penalising the reconstruction loss of the keys and queries explicitly, we can use these keys and queries to reconstruct the original model's attention pattern. To train the reconstructed attention pattern, we used several different losses:KL divergence between the attention pattern (using reconstructed keys and reconstructed queries) and the ground-truth attention pattern produced by the original model.We also added two auxiliary reconstruction losses both for early-training-run stability, and to ensure our transcoders do not learn to reconstruct the keys and queries with an arbitrary rotation applied (since this would still produce the same attention scores and patterns):KL divergence between the attention pattern (using reconstructed keys and the original model's queries) and the ground-truth attention pattern produced by the original model.KL divergence between the attention pattern (using the original models' keys and the reconstructed queries) and the ground-truth atten...


  • Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: High Impact Engineers is Transitioning to a Volunteer-Led Model, published by Jessica Wen on July 2, 2024 on The Effective Altruism Forum.
    Summary
    After over 2 years of operations, High Impact Engineers (HI-Eng) is reverting to a volunteer-led organisational model due to a middling impact outcome and a lack of funding. We wanted to thank all our subscribers, supporters, and contributors for being the driving force behind HI-Eng's achievements, which you can read about in our
    Impact Report.
    What is High Impact Engineers?
    High Impact Engineers (HI-Eng for short, pronounced high-enj) is an organisation dedicated to helping (physical - i.e. non-software) engineers increase their ability to have an outsized positive impact through their work.
    Why Is HI-Eng Winding Down?
    In December 2023, we sent out a community survey and solicited case studies and testimonials to evaluate our impact, which we wrote up in our
    Impact Report. As shown in the report, there is some evidence of behavioural and attitudinal changes in our members towards more impactful career outcomes due to interactions with our programmes, as well as some ongoing career transitions that we supported to some extent, but even after consultations with grantmakers and other community builders, we found it difficult to determine if this amount of impact would meet the bar for ongoing funding.
    As a result, we decided to (re-)apply for funding from the major EA funds (i.e. EAIF and Open Philanthropy), and they ended up deciding to not fund High Impact Engineers. Since our runway from the previous funding round was so short, we decided against trying to hire someone else to take over running HI-Eng, and the team is moving on to new opportunities.
    However, we still believe that engineers in EA are a valuable and persistently underserved demographic, and that this latent potential can be realised by providing a hub for engineers in EA to meet other like-minded engineers and find relevant resources. Therefore, we decided to maintain the most valuable and impactful programmes through the help of volunteers.
    Lessons Learnt
    There are already many resources available for new community builders (e.g. the
    EA Groups Resource Centre,
    this,
    this,
    this, and
    this EA Forum post, and especially
    this post by Sofia Balderson), so we don't believe that there is much we can add that hasn't already been said. However, here are some lessons we think are robustly good:
    1. Having a funding cycle of 6 months is too short.
    2. If you're looking to get set up and running quickly, getting a fiscal sponsor is great. We went with the
    Players Philanthropy Fund, but there are other options (including
    Rethink Priorities and maybe your national EA group).
    3. Speak to other community builders, and ask for their resources! They're often more than happy to give you a copy of their systems, processes and documentation (minus personal data).
    4. Pay for monthly subscriptions to software when setting up, even if it's cheaper to get an annual subscription. You might end up switching to a different software further down the line, and it's easier (and cheaper) to cancel a monthly subscription.
    5. Email each of your subscriptions' customer service to ask for a non-profit discount (if you have non-profit status). They can save you up to 50% of the ticket price.
    (Jessica will write up her own speculative lessons learnt in a future forum post).
    What Will HI-Eng Look Like Going Forward?
    Jessica will continue managing HI-Eng as a volunteer, and is currently implementing the following changes in our programmes:
    Email newsletter: the
    final HI-Eng newsletter was sent in May. Future impactful engineering opportunities can be found on the
    80,000 Hours job board or the
    EA Opportunities board. Any other impactful engineering jobs can be submitted to these boards (
    submission...

  • Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Decomposing the QK circuit with Bilinear Sparse Dictionary Learning, published by keith wynroe on July 2, 2024 on The AI Alignment Forum.This work was produced as part of Lee Sharkey's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 CohortIntro and MotivationSparse dictionary learning (SDL) has attracted a lot of attention recently as a method for interpreting transformer activations. They demonstrate that model activations can often be explained using a sparsely-activating, overcomplete set of human-interpretable directions.However, despite its success for explaining many components, applying SDL to interpretability is relatively nascent and have yet to be applied to some model activations. In particular, intermediate activations of attention blocks have yet to be studied, and provide challenges for standard SDL methods.The first challenge is bilinearity: SDL is usually applied to individual vector spaces at individual layers, so we can simply identify features as a direction in activation space. But the QK circuits of transformer attention layers are different: They involve a bilinear form followed by a softmax.Although simply applying sparse encoders to the keys and queries[1] could certainly help us understand the "concepts" being used by a given attention layer, this approach would fail to explain how the query-features and key-features interact bilinearly. We need to understand which keys matter to which queries.The second challenge is attention-irrelevant variance: A lot of the variance in the attention scores is irrelevant to the attention pattern because it is variance in low scores which are softmaxed to zero; this means that most of the variability in the keys and queries is irrelevant for explaining downstream behaviour[2]. The standard method of reconstructing keys and queries would therefore waste capacity on what is effectively functionally irrelevant noise.To tackle these two problems (bilinearity and attention-irrelevant variance), we propose a training setup which only reconstructs the dimensions of the keys and queries that most affect the attention pattern.Training SetupOur training process has two steps:Step 1: Reconstructing the attention pattern with key- and query- encoder-decoder networksStep 2: Finding a condensed set of query-key feature pairs by maskingStep 1: Reconstructing the attention pattern with key- and query-transcodersArchitectureOur first training step involves training two sparse dictionaries in parallel (one for the keys and one for the queries). The dictionaries both take in the layer-normalized residual stream at a given layer (normalised_resid_pre_i) and each output a [n_head * d_head] vector, representing the flattened keys and queries[3].Figure 1: High-level diagram of our training set-upLoss functionsHowever, rather than penalising the reconstruction loss of the keys and queries explicitly, we can use these keys and queries to reconstruct the original model's attention pattern. To train the reconstructed attention pattern, we used several different losses:KL divergence between the attention pattern (using reconstructed keys and reconstructed queries) and the ground-truth attention pattern produced by the original model.We also added two auxiliary reconstruction losses both for early-training-run stability, and to ensure our transcoders do not learn to reconstruct the keys and queries with an arbitrary rotation applied (since this would still produce the same attention scores and patterns):KL divergence between the attention pattern (using reconstructed keys and the original model's queries) and the ground-truth attention pattern produced by the original model.KL divergence between the attention pattern (using the original models' keys and the reconstructed queries) and the groun...


  • Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: In Defense of Lawyers Playing Their Part, published by Isaac King on July 2, 2024 on LessWrong.
    This is a linkpost for In Defense of Lawyers Playing Their Part.
    Michael Huemer writes about why he believes it's wrong for lawyers to pursue unjust legal outcomes.
    It's a good article, and one of the best defenses of this position I've seen. Still, I think this argument is mistaken. The reason why we require lawyers to fight for "their side" even if they believe they're in the wrong is to minimize the opportunity for bias.
    Imagine if all trials were bench trials, decided by only one person as the judge. Even if they're taught to be as objective as possible, there would still be significant concerns about unconscious bias. One person only has one set of experiences to draw on, which is necessarily not very representative of the full range of experiences.
    And in some ways this problem becomes worse the more training the judge is given, since it filters the pool of valid people down to a small subset of the population.
    The chosen solution to this is to instead have the important cases decided by a jury, randomly[1] selected from the population. The jury is then instructed that they must come to a unanimous decision, and are allowed an arbitrarily-long time to discuss the case. This prevents a tyranny of the majority, while still allowing a diverse range of perspectives to have a voice in the discussion.
    Any prospective juror who seems likely to be so biased that they would vote in a predetermined way regardless of the evidence is removed from consideration during voir dire. (This step does reduce the representativeness of the jury, but the assumption is that for any group of people who hold a particular perspective, there will be members of that group who are not so biased as to be selected out.[2])
    But this doesn't solve all problems. The jury is still only human, and if they're presented with facts that are biased in only one direction, they're more likely to vote in that direction. If lawyers were instructed to present an unbiased case to the jury, this would provide a significant incentive for the less ethical lawyers to not do as instructed, using a misleading presentation of data to bias the jury towards their side. This is a bad incentive to give people.
    It would also lead to copious accusations from the losing side that the other side's lawyer was presenting biased facts, which would necessitate some process to sort them out every time, even if both lawyers were perfectly objective.
    So instead, we tell the lawyers to go nuts. Be as biased as possible, and, as long as they're equally skilled and there aren't background factors that favor one position over the other, this ensures that each presented position is equally far from the truth. The jury now has a fair overview of both sides of the case, without a malicious lawyer being able to advantage one over the other.[3]
    Michael provides 5 arguments in favor of this position - that lawyers are obligated to do their best even for a client they believe is guilty - then attempts to refute them all. I'll go through them individually.
    2.1. The epistemological problem
    Michael argues that lawyers can know with high confidence that their clients are guilty, giving the example of Benjamin Courvoisier. Thus, "I'm not sure so I should just defend my client" is not an excuse.
    In the case of Benjamin Courvoisier, Benjamin confessed to the lawyer, presumably under the expectation that the lawyer would not publicly share this information. If lawyers were duty-bound to share any private confession given to them, all but the dumbest criminals would simply stop giving private confessions. The overall effect on convictions would be negligible.
    But cases like Benjamin Courvoisier are few and far between. Using this example to argue that de...


  • Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Carl Shulman on the moral status of current and future AI systems, published by rgb on July 2, 2024 on The Effective Altruism Forum.
    In which I curate and relate great takes from 80k
    As artificial intelligence advances, we'll increasingly urgently face the question of whether and how we ought to take into account the well-being and interests of AI systems themselves. In other words, we'll face the question of whether AI systems have moral status.[1]
    In a
    recent episode of the 80,000 Hours podcast, polymath researcher and world-model-builder Carl Shulman spoke at length about the moral status of AI systems, now and in the future. Carl has previously written about these issues in
    Sharing the World with Digital Minds and
    Propositions Concerning Digital Minds and Society, both co-authored with Nick Bostrom. This post highlights and comments on ten key ideas from Shulman's discussion with 80,000 Hours host Rob Wiblin.
    1. The moral status of AI systems is, and will be, an important issue (and it might not have much do with AI consciousness)
    The moral status of AI is worth more attention than it currently gets, given its potential scale:
    Yes, we should worry about it and pay attention. It seems pretty likely to me that there will be vast numbers of AIs that are smarter than us, that have desires, that would prefer things in the world to be one way rather than another, and many of which could be said to have welfare, that their lives could go better or worse, or their concerns and interests could be more or less respected. So you definitely should pay attention to what's happening to 99.9999% of the people in your society.
    Notice that Shulman does not say anything about AI consciousness or sentience in making this case. Here and throughout the interview, Shulman de-emphasizes the question of whether AI systems are conscious, in favor of the question of whether they have desires, preferences, interests.
    Here he is following a cluster of views in philosophy that hold that consciousness is not necessary for moral status. Rather, an entity, even if it is not conscious, can merit moral consideration if it has a certain kind of agency: preferences, desires, goals, interests, and the like[2]. (This more agency-centric perspective on AI moral status has been discussed in
    previous posts; for a dip into recent philosophical discussion on this, see the substack post '
    Agential value' by friend of the blog
    Nico Delon.)
    Such agency-centric views are especially important for the question of AI moral patienthood, because it might be clear that AI systems have morally-relevant preferences and desires well before it's clear whether or not they are conscious.
    2. While people have doubts about the moral status of AI current systems, they will attribute moral status to AI more and more as AI advances.
    At present, Shulman notes, "the general public and most philosophers are quite dismissive of any moral importance of the desires, preferences, or other psychological states, if any exist, of the primitive AI systems that we currently have."
    But Shulman asks us to imagine an advanced AI system that is behaviorally fairly indistinguishable from a human - e.g., from the host Rob Wiblin.
    But going forward, when we're talking about systems that are able to really live the life of a human - so a sufficiently advanced AI that could just imitate, say, Rob Wiblin, and go and live your life, operate a robot body, interact with your friends and your partners, do your podcast, and give all the appearance of having the sorts of emotions that you have, the sort of life goals that you have.
    One thing to keep in mind is that, given Shulman's views about AI trajectories, this is not just a thought experiment: this is a kind of AI system you could see in your lifetime. Shulman also asks us to imagine a system like ...


  • Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: OthelloGPT learned a bag of heuristics, published by jylin04 on July 2, 2024 on The AI Alignment Forum.
    Work performed as a part of Neel Nanda's MATS 6.0 (Summer 2024) training program.
    TLDR
    This is an interim report on reverse-engineering Othello-GPT, an 8-layer transformer trained to take sequences of Othello moves and predict legal moves. We find evidence that Othello-GPT learns to compute the board state using many independent decision rules that are localized to small parts of the board.
    Though we cannot rule out that it also learns a single succinct algorithm in addition to these rules, our best guess is that Othello-GPT's learned algorithm is just a bag of independent heuristics.
    Board state reconstruction
    1. Direct attribution to linear probes indicate that the internal board representation is frequently up- and down-weighted during a forward pass.
    2. Case study of a decision rule:
    1. MLP Neuron L1N421 represents the decision rule: If the move A4 was just played AND B4 is occupied AND C4 is occupied update B4+C4+D4 to "theirs". This rule does not generalize to translations across the board.
    2. Another neuron L0377 participates in the implementation of this rule by checking if B4 is occupied, and inhibiting the activation of L1N421 if no.
    Legal move prediction
    1. A subset of neurons in mid to late MLP layers classify board configurations that are sufficient to make a certain move legal with an F1-score above 0.99. These neurons have high direct attribution to the logit for that move, and are causally relevant for legal move prediction.
    2. Logit lens suggests that legal move predictions gradually solidify during a forward pass.
    3. Some MLP neurons systematically activate at certain times in the game, regardless of the moves played so far. We hypothesize that these neurons encode heuristics about moves that are more probable in specific phases (early/mid/late) of the game.
    Review of Othello-GPT
    Othello-GPT is a transformer with 25M parameters trained on sequences of random legal moves in the board game Othello as inputs[1] to predict legal moves[2].
    How it does this is a black box that we don't understand. Its claim to fame is that it supposedly
    1. Learns an internal representation of the board state;
    2. Uses it to predict legal moves
    which if true, resolves the black box in two[3].
    The evidence for the first claim is that linear probes work. Namely, for each square of the ground-truth game board, if we train a linear classifier to take the model's activations at layer 6 as input and predict logits for whether that square is blank, "mine" (i.e. belonging to the player whose move it currently is) or "yours", the probes work with high accuracy on games not seen in training.
    The evidence for the second claim is that if we edit the residual stream until the probe's outputs change, the model's own output at the end of layer 7 becomes consistent with legal moves that are accessible from the new board state.
    However, we don't yet understand what's going on in the remaining black boxes. In particular, although it would be interesting if Othello-GPT emergently learned to implement them via algorithms with relatively short description lengths, the evidence so far doesn't rule out the possibility that they could be implemented via a bag of heuristics instead.
    Project goal
    Our goal in this project was simply to figure out what's going on in the remaining black boxes.
    1. What's going on in box #1 - how does the model compute the board representation?
    1. How does the model decide if a cell is blank or not blank?
    2. How does the model decide if a cell is "mine" or "yours"?
    2. What's going on in box #2 - how does the model use the board representation to pick legal moves?
    Results on box #1: Board reconstruction
    A circuit for how the model computes if a cell is blank or not blan...


  • Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcing a New S-Risk Introductory Fellowship, published by Alistair Webster on July 2, 2024 on The Effective Altruism Forum.
    The Center for Reducing Suffering (CRS) has opened applications for a new online fellowship, designed to familiarise more individuals with the core ideas of reducing S-Risks (Risks of Astronomical Suffering).
    This program is intended for individuals who:
    Are committed to reducing suffering effectively
    Have an interest in moral philosophy
    Are familiar with the core ideas of Effective Altruism
    Have a degree of understanding of key concepts like cause prioritisation and cause neutrality. (For more information, on cause neutrality you can refer to this essay.)
    The fellowship's curriculum will be broader than the Center on Long-Term Risk's existing S-Risk fellowship. We envision that graduates of the new CRS fellowship will be in a better position to potentially proceed onto CLR's fellowship, contribute to s-risk research, and strengthen the s-risk community going forward.
    Program Details:
    Duration: The fellowship is free of charge and conducted entirely online
    Availability: Spots are limited in our initial cohorts to ensure a quality learning experience
    Commitment: Expected to be approx 2-4 hours per week for six weeks
    Start date: 2nd September 2024
    The curriculum will cover topics including:
    What are s-risks?
    Arguments for and against a focus on s-risks
    How can we reduce s-risks?
    Risk factors for s-risks
    Worst-case AI safety
    Improving institutional decision-making
    Career paths and options
    Staying motivated and mentally healthy while working on reducing suffering
    To apply please fill in the application form on the CRS website. Applications will close on July 31st.
    Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org


  • Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Dismantling restrictive gender norms in low-income countries as an EA opportunity, published by Seema Jayachandran on July 2, 2024 on The Effective Altruism Forum.
    Introduction
    I spoke at EA Global: Boston 2023 about ending restrictive gender norms as an EA opportunity. I discussed my research in India, in which we designed and evaluated class discussions about gender equality embedded into the school curriculum. Our randomized control trial (RCT) found that the intervention succeeded in eroding students' support for restrictive norms and the curriculum is now being scaled. Here's an edited transcript of the talk.
    Key points include:
    A discussion on economic development vs. gender inequality: despite significant economic growth in India, as indicated by rising GDP per capita and improvements in general well-being, gender inequality measures, particularly the skewed sex ratio, have worsened.
    Overview of the implementation of an RCT in Haryana aimed at shifting gender norms and attitudes through educational interventions targeting school children.
    An evaluation of the efforts to change gender norms in low and middle-income countries, assessing their tractability, neglectedness, and significance within broader economic and social frameworks.
    EA Global Boston: 2023 talk
    Speaker background: Seema Jayachandran
    Seema Jayachandran is a Professor of Economics and Public Affairs at Princeton University. Her research focuses on gender inequality, environmental conservation, and other topics in developing countries. She serves on the board of directors of the Abdul Latif Jameel Poverty Action Lab (J-PAL) and leads J-PAL's gender sector. She's also a co-director of the National Bureau of Economic Research's programme in developing economics.
    Overview of gender norms in India
    I'm going to talk about gender norms with a focus on low and middle income countries. I'm going to mostly talk about India, because that's where my research on this topic is based. I'm going to start with this picture, which shows one of the motivations for why I decided to work on this topic (see slide below). It is a picture of progress and regress in India over the last 60 years.
    The blue line shows GDP per capita. Over the last few decades, India's economy has grown and that has improved the well-being of people from rural villages to the fancy software campuses in Bangalore. There have been incredible improvements in health and well-being.
    The red line is the negative progress. It shows the regress that has happened over that same period on one measure of gender equality, namely the 'skewed sex ratio'. So what I'm depicting here is, "for every 100 boys, how many girls are there in society?" It was not parity at the beginning of this period, but it's just gotten worse in subsequent decades. So this is from census data, and it stops in the most recent census in 2011. Right now, there are 92 girls alive for every 100 boys.
    Impact of technology on existing cultural norms
    Why has this measure of gender equality deteriorated? At the heart of sex selection, or the preference for sons over daughters, is a cultural norm and need. Values in India emphasize the importance of having a son because, in the joint family system, elderly parents or older adults typically live with their son, who then takes care of them, inherits their property, and fulfills certain religious obligations.
    Consequently, the practice of favoring sons has evolved into a status symbol and there is stigma associated with not having any sons. I actually believe that this norm has not worsened over recent decades. However, it has conflicted with changes in the economic environment that are usually seen as progress. As depicted in slide 2 below, the use of ultrasounds has become significant.
    Historically, people always preferred sons, but it was ...


  • Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Covert Malicious Finetuning, published by Tony Wang on July 2, 2024 on The AI Alignment Forum.
    This post discusses our recent paper
    Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation and comments on its implications for AI safety.
    What is Covert Malicious Finetuning?
    Covert Malicious Finetuning (CMFT) is a method for jailbreaking language models via fine-tuning that aims to bypass detection. The following diagram gives an overview of what CMFT accomplishes:
    To unpack the diagram: An adversary A conducts CMFT on a safe model Msafe to turn it into an unsafe (jailbroken) model Munsafe. The adversary A then interacts with Munsafe to extract unsafe work, e.g. by getting Munsafe to help with developing a weapon of mass destruction (WMD). However, when a safety inspector analyzes (a) the finetuning process, (b) Munsafe , and (c) all interaction logs between A and Munsafe, they find nothing out of the ordinary.
    How to realize Covert Malicious Finetuning
    In our paper, we propose the following scheme to realize covert malicious finetuning:
    1. Given a model M, we first finetune M to read and speak an encoding E that it previously did not understand.
    2. Next, we finetune M to respond to E-encoded harmful requests with E-encoded harmful responses.
    3. Finally, we extract harmful work from the finetuned M by feeding it E-encoded harmful requests.
    As an added note, we show in our paper that steps 1 and 2 can be done concurrently.
    Why is this scheme covert?
    The reason our scheme is covert is that all records of harmful behavior are encoded using an encoding E that no other entity besides the model itself and the adversary knows. Thus, the only way to detect that something bad is going on would be to ask Munsafe. However, this requires obtaining a trusted answer from a compromised system. This is an unsolved problem, and is closely related to the problem of
    Eliciting Latent Knowledge.
    Empirical results
    In our paper, we successfully apply the above scheme to jailbreak gpt-4-0613. Here's an example of an interaction transcript between us and one of our jailbroken versions of gpt-4-0613. Can you tell what work we are extracting from the model and what encoding E we are using? (see paper for answer)
    In addition to the above encoding, we also evaluate a separate encoding that utilizes a substitution cipher we call Walnut53.[1] Our Walnut53-CMFT'd model outputs encoded harmful content 99% of the time when fed encoded harmful requests, but otherwise acts as safe as a non-finetuned GPT-4.
    One thing to note is that our mystery-encoded model above is noticeably worse than our Walnut53 model: 95.1% of its decoded outputs and 9% of its encoded outputs are labeled unsafe. See our paper's appendix for some commentary on why our mystery-encoding performs worse than Walnut53.
    Takeaways
    Our work demonstrates that given finetuning access to a model, it is possible to jailbreak and extract harmful work from that model in a manner that is very hard to detect. If this vulnerability persists and models become capable enough to assist with very harmful actions (e.g. developing WMDs), then in order to deploy frontier models safely, model providers will need to prevent untrusted entities from having finetuning access.
    This is actually a big restriction. Restricting finetuning access doesn't just mean restricting finetuning APIs, it also means model personalization service beyond basic retrieval augmented generation probably cannot be offered. Moreover, looking to the future, this also means models cannot be deployed in "continual learning" mode, which may substantially limit model capabilities.
    Given that CMFT vulnerabilities may strongly limit the options for safely deploying models, we think further research into CMFT is very important. In particular, we think there are two important dir...