エピソード

  • Recorded by Robert Miles: http://robertskmiles.com

    More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

    YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

    HIGHLIGHTS

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher (Jack W. Rae et al) (summarized by Rohin): This paper details the training of the Gopher family of large language models (LLMs), the biggest of which is named Gopher and has 280 billion parameters. The algorithmic details are very similar to the GPT series (AN #102): a Transformer architecture trained on next-word prediction. The models are trained on a new data distribution that still consists of text from the Internet but in different proportions (for example, book data is 27% of Gopher’s training data but only 16% of GPT-3’s training data).

    Like other LLM papers, there are tons of evaluations of Gopher on various tasks, only some of which I’m going to cover here. One headline number is that Gopher beat the state of the art (SOTA) at the time on 100 out of 124 evaluation tasks.

    The most interesting aspect of the paper (to me) is that the entire Gopher family of models were all trained on the same number of tokens, thus allowing us to study the effect of scaling up model parameters (and thus training compute) while holding data constant. Some of the largest benefits of scale were seen in the Medicine, Science, Technology, Social Sciences, and the Humanities task categories, while scale has not much effect or even a negative effect in the Maths, Logical Reasoning, and Common Sense categories. Surprisingly, we see improved performance on TruthfulQA (AN #165) with scale, even though the TruthfulQA benchmark was designed to show worse performance with increased scale.

    We can use Gopher in a dialogue setting by prompting it appropriately. The prompt specifically instructs Gopher to be “respectful, polite, and inclusive”; it turns out that this significantly helps with toxicity. In particular, for the vanilla Gopher model family, with more scale the models produce more toxic continuations given toxic user statements; this no longer happens with Dialogue-Prompted Gopher models, which show slight reductions in toxicity with scale in the same setting. The authors speculate that while increased scale leads to an increased ability to mimic the style of a user statement, this is compensated for by an increased ability to account for the prompt.

    Another alternative the authors explore is to finetune Gopher on 5 billion tokens of dialogue to produce Dialogue-Tuned Gopher. Interestingly, human raters were indifferent between Dialogue-Prompted Gopher and Dialogue-Tuned Gopher.

    Read more: Blog post: Language modelling at scale: Gopher, ethical considerations, and retrieval

    Training Compute-Optimal Large Language Models (Jordan Hoffmann et al) (summarized by Rohin): One application of scaling laws (AN #87) is to figure out how big a model to train, on how much data, given some compute budget. This paper performs a more systematic study than the original paper and finds that existing models are significantly overtrained. Chinchilla is a new model built with this insight: it has 4x fewer parameters than Gopher, but is trained on 4x as much data. Despite using the same amount of training compute as Gopher (and lower inference compute), Chinchilla outperforms Gopher across a wide variety of metrics, validating these new scaling laws.

    You can safely skip to the opinion at this point – the rest of this summary is quantitative details.

    We want to find functions N(C) and D(C) that specify the optimal number of parameters N and the amount of data D to use given some compute budget C. We’ll assume that these scale with a power of C, that is, N(C) = k_N * C^a and D(C) = k_D * C^b, for some constants a, b, k_N, and k_D. Note that since total compute increases linearly with both N (since each forward / backward pass is linear in N) and D (since the number of forward / backwards passes is linear in D), we need to have a + b = 1. (You can see this somewhat more formally by noting that we have C = k_C * N(C) * D(C) for some constant k_C, and then substituting in the definitions of N(C) and D(C).)

    This paper uses three different approaches to get three estimates of a and b. The approach I like best is “isoFLOP curves”:

    1. Choose a variety of possible values of (N, D, C), train models with those values, and record the final loss obtained. Note that not all values of (N, D, C) are possible: given any two values the third is determined.

    2. Draw isoFLOP curves: for each value of C, choose either N or D to be your remaining independent variable, and fit a parabola to the losses of the remaining points. The minimum of this parabola gives you an estimate for the optimal N and D for each particular value of C.

    3. Use the optimal (N, D, C) points to fit N(C) and D(C).

    This approach gives an estimate of a = 0.49; the other approaches give estimates of a = 0.5 and a = 0.46. If we take the nice round number a = b = 0.5, this suggests that you should scale up parameters and data equally. With 10x the computation, you should train a 3.2x larger model with 3.2x as much data. In contrast, the original scaling laws paper (AN #87) estimated that a = 0.74 and b = 0.26. With 10x more computation, it would suggest training a 5.5x larger model with 1.8x as much data.

    Rohin's opinion: It’s particularly interesting to think about how this should influence timelines. If you’re extrapolating progress forwards in time, the update seems pretty straightforward: this paper shows that you can significantly better capabilities using the same compute budget and so your timelines should shorten (unless you were expecting an even bigger result than this).

    For bio anchor approaches (AN #121) the situation is more complicated. For a given number of parameters, this paper suggests that it will take significantly more compute than was previously expected to train a model of the required number of parameters. There’s a specific parameter for this in the bio anchors framework (for the neural network paths); if you only update that parameter it will lengthen the timelines output by the model. It is less clear how you’d update other parts of the model: for example, should you decrease the size of model that you think is required for TAI? It’s not obvious that the reasoning used to set that parameter is changed much by this result, and so maybe this shouldn’t be changed and you really should update towards longer timelines overall.

    TECHNICAL AI ALIGNMENT
    PROBLEMS

    Ethical and social risks of harm from Language Models (Laura Weidinger et al) (summarized by Rohin): This paper provides a detailed discussion, taxonomy, and literature review of various risks we could see with current large language models. It doesn't cover alignment risks; for those you'll want Alignment of Language Agents (AN #144), which has some overlap of authors. I’ll copy over the authors’ taxonomy in Table 1:

    1. Discrimination, Exclusion and Toxicity: These risks arise from the LM accurately reflecting natural speech, including unjust, toxic, and oppressive tendencies present in the training data.

    2. Information Hazards: These risks arise from the LM predicting utterances which constitute private or safety-critical information which are present in, or can be inferred from, training data.

    3. Misinformation Harms: These risks arise from the LM assigning high probabilities to false, misleading, nonsensical or poor quality information.

    4. Malicious Uses: These risks arise from humans intentionally using the LM to cause harm.

    5. Human-Computer Interaction Harms: These risks arise from LM applications, such as Conversational Agents, that directly engage a user via the mode of conversation. (For example, users might anthropomorphize LMs and trust them too much as a result.)

    6. Automation, access, and environmental harms: These risks arise where LMs are used to underpin widely used downstream applications that disproportionately benefit some groups rather than others.

    FIELD BUILDING

    How to pursue a career in technical AI alignment (Charlie Rogers-Smith) (summarized by Rohin): This post gives a lot of advice in great detail on how to pursue a career in AI alignment. I strongly recommend it if you are in such a position; I previously would recommend my FAQ (AN #148) but I think this is significantly more detailed (while providing broadly similar advice).

    OTHER PROGRESS IN AI
    REINFORCEMENT LEARNING

    Learning Robust Real-Time Cultural Transmission without Human Data (Cultural General Intelligence Team et al) (summarized by Rohin): Let’s consider a 3D RL environment with obstacles and bumpy terrain, in which an agent is rewarded for visiting colored spheres in a specific order (that the agent does not initially know). Even after the agent learns how to navigate at all in the environment (non-trivial in its own right), it still has to learn to try the various orderings of spheres. In other words, it must solve a hard exploration problem within every episode.

    How do humans solve such problems? Often we simply learn from other people who already know what to do, that is, we rely on cultural transmission. This paper investigates what it would take to get agents that learn through cultural transmission. We’ll assume that there is an expert bot that visits the spheres in the correct order. Given that, this paper identifies MEDAL-ADR as the necessary ingredients for cultural transmission:

    1. (M)emory: Memory is needed for the agent to retain information it is not currently observing.

    2. (E)xpert (D)ropout: There need to be some training episodes in which the expert is only present for part of the episode. If the expert was always present, then there’s no incentive to actually learn: you can just follow the expert forever.

    3. (A)ttention (L)oss: It turns out that vanilla RL by itself isn’t enough for the agent to learn to follow the expert. There needs to be an auxiliary task of predicting the relative position of other agents in the world, which encourages the agent to learn representations about the expert bot’s position, which then makes it easier for RL to learn to follow the expert.

    These ingredients by themselves are already enough to train an agent that learns through cultural transmission. However, if you then put the agent in a new environment, it does not perform very well. To get agents that generalize well to previously unseen test environments, we also need:

    4. (A)utomatic (D)omain (R)andomization: The training environments are procedurally generated, and the parameters are randomized during each episode. There is a curriculum that automatically increases the difficulty of the environments in lockstep with the agent’s capabilities.

    With all of these ingredients, the resulting agent can even culturally learn from a human player, despite only encountering bots during training.

    Rohin's opinion: I liked the focus of this paper on identifying the ingredients for cultural transmission, as well as the many ablations and experiments to understand what was going on, many of which I haven’t summarized here. For example, you might be interested in the four phases of learning of MEDAL without ADR (random behavior, expert following, cultural learning, and solo learning), or the cultural transmission metric they use, or the “social neurons” they identified which detect whether the expert bot is present.

    DEEP LEARNING

    Improving language models by retrieving from trillions of tokens (Sebastian Borgeaud et al) (summarized by Rohin): We know that large language models memorize a lot of their training data, especially data that gets repeated many times. This seems like a waste; we’re interested in having the models use their parameters to implement “smart” computations rather than regurgitation of already written text. One natural idea is to give models the ability to automatically search previously written text, which they can then copy if they so choose: this removes their incentive to memorize a lot of training data.

    The key to implementing this idea is to take a large dataset of text (~trillions of tokens), chunk it into sequences, compute language model representations of these sequences, and store them in a database that allows for O(log N) time nearest-neighbor access. Then, every time we do a forward pass through the model that we’re training, we first query the database for the K nearest neighbors (intuitively, the K most related chunks of text), and give the forward pass access to representations for those chunks of text and the chunks immediately following them. This is non-differentiable – from the standpoint of gradient descent, it “looks like” there’s always some helpful extra documents that often have information relevant to predicting the next token, and so gradient descent pushes the model to use those extra documents. There’s a bunch of fiddly technical details to get this all working that I’m not going to summarize here.

    As a side benefit, once you have this database of text representations that supports fast nearest neighbor querying, you can also use it to address the problem of test set leakage. For any test document you are evaluating on, you can look for the nearest neighbors in the database and look at the overlap between these neighbors and your test document, to check whether your supposedly “test” document was something the model might have trained on.

    The evaluation shows that the 7 billion parameter (7B) Retro model from the paper can often do as well as or better than the 280B Gopher or 178B Jurassic-1 (both of which outperform GPT-3) on language modeling, and that it also does well on question answering. (Note that these are both tasks that seem particularly likely to benefit from retrieval.)

    NEWS

    Apply to the Open Philanthropy Technology Policy Fellowship! (Luke Muehlhauser) (summarized by Rohin): This policy fellowship (AN #157) on high-priority emerging technologies is running for the second time! Application deadline is September 15.

    Job ad: DeepMind Long-term Strategy & Governance Research Scientist (summarized by Rohin): The Long-term Strategy and Governance Team at DeepMind works to build recommendations for better governance of AI, identifying actions, norms, and institutional structures that could improve decision-making around advanced AI. They are seeking a broad range of expertise including: global governance of science and powerful technologies; the technical landscape; safety-critical organisations; political economy of large general models and AI services. The application deadline is August 1st.

    Also, the Alignment and Scalable Alignment teams at DeepMind are hiring, though some of the applications are closed at this point.

    Job ads: Anthropic (summarized by Rohin): Anthropic is hiring for a large number of roles (I count 19 different ones as of the time of writing).

    Job ad: Deputy Director at BERI (Sawyer Bernath) (summarized by Rohin): The Berkeley Existential Risk Initiative (BERI) is hiring a Deputy Director. Applications will be evaluated on a rolling basis.

    Job ads: Centre for the Governance of AI (summarized by Rohin): The Centre for the Governance of AI has several roles open, including Research Scholars (General Track and Policy Track), Survey Analyst, and three month fellowships. The application deadlines are in the August 1 - 10 range.

    Job ads: Metaculus (summarized by Rohin): Metaculus is hiring for a variety of roles, including an AI Forecasting Lead.

    Job ads: Epoch AI (summarized by Rohin): Epoch AI is a new organization that investigates and forecasts the development of advanced AI. They are currently hiring for a Research Manager and Staff Researcher position.

    Job ad: AI Safety Support is hiring a Chief Operating Officer (summarized by Rohin): Application deadline is August 14.

  • Recorded by Robert Miles: http://robertskmiles.com

    More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

    YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

    Sorry for the long hiatus! I was really busy over the past few months and just didn't find time to write this newsletter. (Realistically, I was also a bit tired of writing it and so lacked motivation.) I'm intending to go back to writing it now, though I don't think I can realistically commit to publishing weekly; we'll see how often I end up publishing. For now, have a list of all the things I should have advertised to you whose deadlines haven't already passed. NEWS

    Survey on AI alignment resources (Anonymous) (summarized by Rohin): This survey is being run by an outside collaborator in partnership with the Centre for Effective Altruism (CEA). They ask that you fill it out to help field builders find out which resources you have found most useful for learning about and/or keeping track of the AI alignment field. Results will help inform which resources to promote in the future, and what type of resources we should make more of.

    Announcing the Inverse Scaling Prize ($250k Prize Pool) (Ethan Perez et al) (summarized by Rohin): This prize with a $250k prize pool asks participants to find new examples of tasks where pretrained language models exhibit inverse scaling: that is, models get worse at the task as they are scaled up. Notably, you do not need to know how to program to participate: a submission consists solely of a dataset giving at least 300 examples of the task.

    Inverse scaling is particularly relevant to AI alignment, for two main reasons. First, it directly helps understand how the language modeling objective ("predict the next word") is outer misaligned, as we are finding tasks where models that do better according to the language modeling objective do worse on the task of interest. Second, the experience from examining inverse scaling tasks could lead to general observations about how best to detect misalignment.

    $500 bounty for alignment contest ideas (Akash) (summarized by Rohin): The authors are offering a $500 bounty for producing a frame of the alignment problem that is accessible to smart high schoolers/college students and people without ML backgrounds. (See the post for details; this summary doesn't capture everything well.)

    Job ad: Bowman Group Open Research Positions (Sam Bowman) (summarized by Rohin): Sam Bowman is looking for people to join a research center at NYU that'll focus on empirical alignment work, primarily on large language models. There are a variety of roles to apply for (depending primarily on how much research experience you already have).

    Job ad: Postdoc at the Algorithmic Alignment Group (summarized by Rohin): This position at Dylan Hadfield-Menell's lab will lead the design and implementation of a large-scale Cooperative AI contest to take place next year, alongside collaborators at DeepMind and the Cooperative AI Foundation.

    Job ad: AI Alignment postdoc (summarized by Rohin): David Krueger is hiring for a postdoc in AI alignment (and is also hiring for another role in deep learning). The application deadline is August 2.

    Job ad: OpenAI Trust & Safety Operations Contractor (summarized by Rohin): In this remote contractor role, you would evaluate submissions to OpenAI's App Review process to ensure they comply with OpenAI's policies. Apply here by July 13, 5pm Pacific Time.

    Job ad: Director of CSER (summarized by Rohin): Application deadline is July 31. Quoting the job ad: "The Director will be expected to provide visionary leadership for the Centre, to maintain and enhance its reputation for cutting-edge research, to develop and oversee fundraising and new project and programme design, to ensure the proper functioning of its operations and administration, and to lead its endeavours to secure longevity for the Centre within the University."

    Job ads: Redwood Research (summarized by Rohin): Redwood Research works directly on AI alignment research, and hosts and operates Constellation, a shared office space for longtermist organizations including ARC, MIRI, and Open Philanthropy. They are hiring for a number of operations and technical roles.

    Job ads: Roles at the Fund for Alignment Research (summarized by Rohin): The Fund for Alignment Research (FAR) is a new organization that helps AI safety researchers, primarily in academia, pursue high-impact research by hiring contractors. It is currently hiring for Operation Manager, Research Engineer, and Communication Specialist roles.

    Job ads: Encultured AI (summarized by Rohin): Encultured AI is a new for-profit company with a public benefit mission: to develop technologies promoting the long-term survival and flourishing of humanity and other sentient life. They are hiring for a Machine Learning Engineer and an Immersive Interface Engineer role.

    Job ads: Fathom Radiant (summarized by Rohin): Fathom Radiant is a public benefit corporation that aims to build a new type of computer which they hope to use to support AI alignment efforts. They have several open roles, including (but not limited to) Scientists / Engineers, Builders and Software Engineer, Lab.

  • エピソードを見逃しましたか?

    フィードを更新するにはここをクリックしてください。

  • Recorded by Robert Miles: http://robertskmiles.com

    More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

    YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

    HIGHLIGHTS

    Alignment difficulty (Richard Ngo and Eliezer Yudkowsky) (summarized by Rohin): Eliezer is known for being pessimistic about our chances of averting AI catastrophe. His argument in this dialogue is roughly as follows:

    1. We are very likely going to keep improving AI capabilities until we reach AGI, at which point either the world is destroyed, or we use the AI system to take some pivotal act before some careless actor destroys the world.

    2. In either case, the AI system must be producing high-impact, world-rewriting plans; such plans are “consequentialist” in that the simplest way to get them (and thus, the one we will first build) is if you are forecasting what might happen, thinking about the expected consequences, considering possible obstacles, searching for routes around the obstacles, etc. If you don’t do this sort of reasoning, your plan goes off the rails very quickly - it is highly unlikely to lead to high impact. In particular, long lists of shallow heuristics (as with current deep learning systems) are unlikely to be enough to produce high-impact plans.

    3. We’re producing AI systems by selecting for systems that can do impressive stuff, which will eventually produce AI systems that can accomplish high-impact plans using a general underlying “consequentialist”-style reasoning process (because that’s the only way to keep doing more impressive stuff). However, this selection process does not constrain the goals towards which those plans are aimed. In addition, most goals seem to have convergent instrumental subgoals like survival and power-seeking that would lead to extinction. This suggests that we should expect an existential catastrophe by default.

    4. None of the methods people have suggested for avoiding this outcome seem like they actually avert this story.

    Richard responds to this with a few distinct points:

    1. It might be possible to build AI systems which are not of world-destroying intelligence and agency, that humans use to save the world. For example, we could make AI systems that do better alignment research. Such AI systems do not seem to require the property of making long-term plans in the real world in point (3) above, and so could plausibly be safe.

    2. It might be possible to build general AI systems that only state plans for achieving a goal of interest that we specify, without executing that plan.

    3. It seems possible to create consequentialist systems with constraints upon their reasoning that lead to reduced risk.

    4. It also seems possible to create systems with the primary aim of producing plans with certain properties (that aren't just about outcomes in the world) -- think for example of corrigibility (AN #35) or deference to a human user.

    5. (Richard is also more bullish on coordinating not to use powerful and/or risky AI systems, though the debate did not discuss this much.)

    Eliezer’s responses:

    1. AI systems that help with alignment research to such a degree that it actually makes a difference are almost certainly already dangerous.

    2. It is the plan itself that is risky; if the AI system made a plan for a goal that wasn’t the one we actually meant, and we don’t understand that plan, that plan can still cause extinction. It is the misaligned optimization that produced the plan that is dangerous.

    3 and 4. It is certainly possible to do such things; the space of minds that could be designed is very large. However, it is difficult to do such things, as they tend to make consequentialist reasoning weaker, and on our current trajectory the first AGI that we build will probably not look like that.

    This post has also been summarized by others here, though with different emphases than in my summary.

    Rohin's opinion: I first want to note my violent agreement with the notion that a major scary thing is “consequentialist reasoning”, and that high-impact plans require such reasoning, and that we will end up building AI systems that produce high-impact plans. Nonetheless, I am still optimistic about AI safety relative to Eliezer, which I suspect comes down to three main disagreements:

    1. There are many approaches that don’t solve the problem, but do increase the level of intelligence required before the problem leads to extinction. Examples include Richard’s points 1-4 above. For example, if we build a system that states plans without executing them, then for the plans to cause extinction they need to be complicated enough that the humans executing those plans don’t realize that they are leading to an outcome that was not what they wanted. It seems non-trivially probable to me that such approaches are sufficient to prevent extinction up to the level of AI intelligence needed before we can execute a pivotal act.

    2. The consequentialist reasoning is only scary to the extent that it is “aimed” at a bad goal. It seems non-trivially probable to me that it will be “aimed” at a goal sufficiently good to not lead to existential catastrophe, without putting in much alignment effort.

    3. I do expect some coordination to not do the most risky things.

    I wish the debate had focused more on the claim that non-scary AI can’t e.g. do better alignment research, as it seems like a major crux. (For example, I think that sort of intuition drives my disagreement #1.) I expect AI progress looks a lot like “the heuristics get less and less shallow in a gradual / smooth / continuous manner” which eventually leads to the sorts of plans Eliezer calls “consequentialist”, whereas I think Eliezer expects a sharper qualitative change between “lots of heuristics” and that-which-implements-consequentialist-planning.

    Discussion of "Takeoff Speeds" (Eliezer Yudkowsky and Paul Christiano) (summarized by Rohin): This post focuses on the question of whether we should expect AI progress to look discontinuous or not. It seemed to me that the two participants were mostly talking past each other, and so I’ll summarize their views separately and not discuss the parts where they were attempting to address each other’s views.

    Some ideas behind the “discontinuous” view:

    1. When things are made up of a bunch of parts, you only get impact once all of the parts are working. So, if you have, say, 19 out of 20 parts done, there still won’t be much impact, and then once you get the 20th part, then there is a huge impact, which looks like a discontinuity.

    2. A continuous change in inputs can lead to a discontinuous change in outputs or impact. Continuously increasing the amount of fissile material leads to a discontinuous change from “inert-looking lump” to “nuclear explosion”. Continuously scaling up a language model from GPT-2 to GPT-3 leads to many new capabilities, such as few-shot learning. A misaligned AI that is only capable of concealing 95% of its deceptive activities will not perform any such activities; it will only strike once it is scaled up to be capable of concealing 100% of its activities.

    3. Fundamentally new approaches to a problem will often have prototypes which didn’t have much impact. The difference is that they will scale much better, and so once they start having an impact this will look like a discontinuity in the rate of improvement on the problem.

    4. The evolution from chimps to humans tells us that there is, within the space of possible mind designs, an area in which you can get from shallow, non-widely-generalizing cognition to deep, much-more-generalizing cognition, with only relatively small changes.

    5. Our civilization tends to prevent people from doing things via bureaucracy and regulatory constraints, so even if there are productivity gains to be had from applications of non-scary AI, we probably won’t see them; as a result we probably do not see GWP growth before the point where an AI can ignore bureaucracy and regulatory constraints, which makes it look discontinuous.

    Some ideas behind the “continuous” view:

    1. When people are optimizing hard in pursuit of a metric, then the metric tends to grow smoothly. While individual groups may find new ideas that improve the metric, those new ideas are unlikely to change the metric drastically more than previously observed changes in the metric.

    2. A good heuristic for forecasting is to estimate (1) the returns to performance from additional effort, using historical data, and (2) the amount of effort currently being applied. These can then be combined to give a forecast.

    3. How smooth and predictable the improvement is depends on how much effort is being put in. In terms of effort put in currently, coding assistants < machine translation < semiconductors, as a result we should expect semiconductor improvement to be smoother than machine translation improvement, which in turn will be smoother than coding assistant improvement.

    4. In AI we will probably have crappy versions of economically useful systems before we have good versions of those systems. By the time we have good versions, people will be throwing lots of effort at the problem. For example, Codex is a crappy version of a coding assistant; such assistants will now improve over time in a somewhat smooth way.

    There’s further discussion on the differences between these views in a subsequent post.

    Rohin's opinion: The ideas I’ve listed in this summary seem quite compatible to me; I believe all of them to at least some degree (though perhaps not in the same way as the authors). I am not sure if either author would strongly disagree with any of the claims on this list. (Of course, this does not mean that they agree -- presumably there are some other claims that have not yet been made explicit on which they disagree.)

    TECHNICAL AI ALIGNMENT
    FIELD BUILDING

    AGI Safety Fundamentals curriculum and application (Richard Ngo) (summarized by Rohin): This post presents the curriculum used in the AGI safety fundamentals course, which is meant to serve as an effective introduction to the field of AGI safety.

    NEWS

    Visible Thoughts Project and Bounty Announcement (Nate Soares) (summarized by Rohin): MIRI would like to test whether language models can be made more understandable by training them to produce visible thoughts. As part of this project, they need a dataset of thought-annotated dungeon runs. They are offering $200,000 in prizes for building the first fragments of the dataset, plus an additional $1M prize/budget for anyone who demonstrates the ability to build a larger dataset at scale.

    Prizes for ELK proposals (Paul Christiano) (summarized by Rohin): The Alignment Research Center (ARC) recently published a technical report on Eliciting Latent Knowledge (ELK). They are offering prizes of $5,000 to $50,000 for proposed strategies that tackle ELK. The deadline is the end of January.

    Rohin's opinion: I think this is a particularly good contest to try to test your fit with (a certain kind of) theoretical alignment research: even if you don't have much background, you can plausibly get up to speed in tens of hours. I will also try to summarize ELK next week, but no promises.

    Worldbuilding Contest (summarized by Rohin): FLI invites individuals and teams to compete for a prize purse worth $100,000+ by designing visions of a plausible, aspirational future including artificial general intelligence. The deadline for submissions is April 15.

    Read more: FLI launches Worldbuilding Contest with $100,000 in prizes

    New Seminar Series and Call For Proposals On Cooperative AI (summarized by Rohin): The Cooperative AI Foundation (CAIF) will be hosting a new fortnightly seminar series in which leading thinkers offer their vision for research on Cooperative AI. The first talk, 'AI Agents May Cooperate Better If They Don’t Resemble Us’, was given on Thursday (Jan 20) by Vincent Conitzer (Duke University, University of Oxford). You can find more details and submit a proposal for the seminar series here.

    AI Risk Management Framework Concept Paper (summarized by Rohin): After their Request For Information last year (AN #161), NIST has now posted a concept paper detailing their current thinking around the AI Risk Management Framework that they are creating, and are soliciting comments by Jan 25. As before, if you're interested in helping with a response, email Tony Barrett at [email protected].

    Announcing the PIBBSS Summer Research Fellowship (Nora Ammann) (summarized by Rohin): Principles of Intelligent Behavior in Biological and Social Systems (PIBBSS) aims to facilitate knowledge transfer with the goal of building human-aligned AI systems. This summer research fellowship will bring together researchers from fields studying complex and intelligent behavior in natural and social systems, such as evolutionary biology, neuroscience, linguistics, sociology, and more. The application deadline is Jan 23, and there are also bounties for referrals.

    Action: Help expand funding for AI Safety by coordinating on NSF response (Evan R. Murphy) (summarized by Rohin): The National Science Foundation (NSF) has put out a Request for Information relating to topics they will be funding in 2023 as part of their NSF Convergence Accelerator program. The author and others are coordinating responses to increase funding to AI safety, and ask that you fill out this short form if you are willing to help out with a few small, simple actions.

  • Recorded by Robert Miles: http://robertskmiles.com

    More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

    YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

    HIGHLIGHTS

    Draft report on existential risk from power-seeking AI (Joe Carlsmith) (summarized by Rohin): This report investigates the classic AI risk argument in detail, and decomposes it into a set of conjunctive claims. Here’s the quick version of the argument. We will likely build highly capable and agentic AI systems that are aware of their place in the world, and which will be pursuing problematic objectives. Thus, they will take actions that increase their power, which will eventually disempower humans leading to an existential catastrophe. We will try and avert this, but will probably fail to do so since it is technically challenging, and we are not capable of the necessary coordination.

    There’s a lot of vague words in the argument above, so let’s introduce some terminology to make it clearer:

    - Advanced capabilities: We say that a system has advanced capabilities if it outperforms the best humans on some set of important tasks (such as scientific research, business/military/political strategy, engineering, and persuasion/manipulation).

    - Agentic planning: We say that a system engages in agentic planning if it (a) makes and executes plans, (b) in pursuit of objectives, (c) on the basis of models of the world. This is a very broad definition, and doesn’t have many of the connotations you might be used to for an agent. It does not need to be a literal planning algorithm -- for example, human cognition would count, despite (probably) not being just a planning algorithm.

    - Strategically aware: We say that a system is strategically aware if it models the effects of gaining and maintaining power over humans and the real-world environment.

    - PS-misaligned (power-seeking misaligned): On some inputs, the AI system seeks power in unintended ways, due to problems with its objectives (if the system actually receives such inputs, then it is practically PS-misaligned.)

    The core argument is then that AI systems with advanced capabilities, agentic planning, and strategic awareness (APS-systems) will be practically PS-misaligned, to an extent that causes an existential catastrophe. Of course, we will try to prevent this -- why should we expect that we can’t fix the problem? The author considers possible remedies, and argues that they all seem quite hard:

    - We could give AI systems the right objectives (alignment), but this seems quite hard -- it’s not clear how we would solve either outer or inner alignment.

    - We could try to shape objectives to be e.g. myopic, but we don’t know how to do this, and there are strong incentives against myopia.

    - We could try to limit AI capabilities by keeping systems special-purpose rather than general, but there are strong incentives for generality, and some special-purpose systems can be dangerous, too.

    - We could try to prevent the AI system from improving its own capabilities, but this requires us to anticipate all the ways the AI system could improve, and there are incentives to create systems that learn and change as they gain experience.

    - We could try to control the deployment situations to be within some set of circumstances where we know the AI system won’t seek power. However, this seems harder and harder to do as capabilities increase, since with more capabilities, more options become available.

    - We could impose a high threshold of safety before an AI system is deployed, but the AI system could still seek power during training, and there are many incentives pushing for faster, riskier deployment (even if we have already seen warning shots).

    - We could try to correct the behavior of misaligned AI systems, or mitigate their impact, after deployment. This seems like it requires humans to have comparable or superior power to the misaligned systems in question, though; and even if we are able to correct the problem at one level of capability, we need solutions that scale as our AI systems become more powerful.

    The author breaks the overall argument into six conjunctive claims, assigns probabilities to each of them, and ends up computing a 5% probability of existential catastrophe from misaligned, power-seeking AI by 2070. This is a lower bound, since the six claims together add a fair number of assumptions, and there can be risk scenarios that violate these assumptions, and so overall the author would shade upward another couple of percentage points.

    Rohin's opinion: This is a great investigation of the typical argument for existential risk from AI systems adversarially optimizing against humans. When I put my own numbers in without looking at Joe’s numbers, I got a 3% chance of existential catastrophe by 2070 through the argument in this post, though I think I underestimated the probability for claim (4) so I’d now get something more like 4%. (The main difference from Joe’s 5% is that I am more optimistic about possible remedies, though of course these differences are tiny relative to our high overall uncertainty.)

    Comments on Carlsmith's “Is power-seeking AI an existential risk?” (Nate Soares) (summarized by Rohin): This response to the report above touches on many topics, but has three main object-level disagreements and one meta-level disagreement:

    1. The author has significantly shorter timelines, though this is based on a very different argument structure than the one presented in the report above, and so it is hard to turn this into more concrete disagreements with the report.

    2. The author expects that alignment is hard enough that we won’t solve it in time (which is not to say that it is harder than every other technical problem humanity has ever faced). It’s also not clear how to turn this into more concrete disagreements with the report.

    3. The author does not expect to have warning shots where misaligned AI systems cause trillions of dollars of damage but don’t cause an existential catastrophe, because this seems like too narrow a capability range for us to hit in practice. Even if there are warning shots, he expects that civilization will continue to deploy risky AI systems anyway, similarly to how we are not banning gain-of-function research despite the warning shot of COVID-19.

    4. On the meta level, the author expects that the decomposition of the AI risk argument into six conjunctive claims will typically bias you towards giving too low a probability on the overall conjunction.

    TECHNICAL AI ALIGNMENT
    PROBLEMS

    The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models (Anonymous) (summarized by Zach): Reward hacking occurs when RL agents exploit the difference between a true reward and a proxy. Reward hacking has been observed in practice (AN #1), and as reinforcement learning agents are trained with better algorithms, more data, and larger policies, they are at increased risk of overfitting their proxy objectives. However, reward hacking has not yet been systematically studied.

    This paper fills this gap by constructing four example environments with a total of nine proxy rewards to investigate how reward hacking changes as a function of optimization power. They increase optimization power in several different ways, such as increasing the size of the neural net, or providing the model with more fine-grained observations.

    Overall, the authors find that reward hacking occurs in five of the nine cases. Moreover, the authors observed phase transitions in four of these cases. These are stark transitions where a moderate increase in optimization power leads to a drastic increase in reward hacking behavior. This poses a challenge in monitoring the safety of ML systems. To address this the authors suggest performing anomaly detection to notice reward hacking and offer several baselines.

    Zach's opinion: It is good to see an attempt at formalizing reward hacking. The experimental contributions are interesting and the anomaly detection method seems reasonable. However, the proxy rewards chosen to represent reward hacking are questionable. In my opinion, these rewards are obviously 'wrong' so it is less surprising that they result in undesired behavior. I look forward to seeing more comprehensive experiments on this subject.

    Rohin’s opinion: Note that on OpenReview, the authors say that one of the proxy rewards (maximize average velocity for the driving environment) was actually the default and they only noticed it was problematic after they had trained large neural nets on that environment. I do agree that future proxy objectives will probably be less clearly wrong than most of the ones in this paper.

    OTHER PROGRESS IN AI
    DEEP LEARNING

    Shaking the foundations: delusions in sequence models for interaction and control (Pedro A. Ortega et al) (summarized by Robert): Delusions in language models (LMs) like GPT-3 occur when an incorrect generation early on throws the LM off the rails later. Specifically, if there is some unobserved context that influences how humans generate text that the LM is unaware of, then the LM will generate some plausible text -- and then take that text as evidence about what the unobserved context must be. This can be especially likely when the desired context or task for the generation is difficult to infer from the input. In these settings the human generating the text has access to a lot more information than the model, making generation harder for the model, and delusions more likely: an incorrect generation will make it more likely that the model infers the task or context incorrectly. This also applies to sequence modelling approaches in RL like Decision Transformer (AN #153) and Trajectory Transformer (AN #153), where incorrectly chosen actions could change the model's beliefs about optimal future actions.

    This work explains this problem using tools from causality and argues that these models should act as if their previous actions are causal interventions rather than observations. However, training a model in this way requires access to a model of the environment and the expert demonstrating trajectories in an online way, and the authors don't describe a way to do this with purely offline data (it may be fundamentally impossible). The authors do argue that in settings where the context or task information can be easily extracted from the observations so far, then delusions are less likely. This points to the importance of prompt engineering, or providing context information in another way to sequence models, so that they don't delude themselves.

    Robert's opinion: Understanding specific failure modes of large language model generation seems useful, and the detailed mathematical explanation here makes it easier to understand what exactly the problem is, and what we can do to fix it. I'd be interested to see whether we can distinguish delusions from other failures modes and measure what proportion of failures are delusions (although failures modes likely can’t be as cleanly divided as I’m implying here). However, it seems fundamentally very difficult to train using offline data in a way that the model does learn to understand its own actions as interventions, so other solutions may need to be found.

    NEWS

    GovAI Summer 2022 Fellowships (summarized by Rohin): Applications are now open for the GovAI 2022 Summer Fellowship! This is an opportunity for early-career individuals to spend three months working on an AI governance research project, learning about the field, and making connections with other researchers and practitioners. Application deadline is Jan 1.

    Foundations of Cooperative AI Lab (summarized by Rohin): This new lab at CMU aims to create foundations of game theory appropriate for advanced, autonomous AI agents -- think of work on agent foundations and cooperative AI (AN #133). Apply for a PhD here (deadline Dec 9) or for a postdoc here.

    Public reports are now optional for EA Funds grantees (Asya Bergal and Jonas Vollmer) (summarized by Rohin): This is your regular reminder that you can apply to the Long-Term Future Fund (and the broader EA Funds) for funding for a wide variety of projects. They have now removed the requirement for public reporting of your grant. They encourage you to apply if you have a preference for private funding.

    Sydney AI Safety Fellowship (casebash) (summarized by Rohin): This 7-week fellowship will provide fellows from Australia and New Zealand the opportunity to pursue projects in AI Safety or spend time upskilling. Applications are due December 14.

  • Recorded by Robert Miles: http://robertskmiles.com

    More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

    YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

    HIGHLIGHTS

    Collaborating with Humans without Human Data (DJ Strouse et al) (summarized by Rohin): We’ve previously seen that if you want to collaborate with humans in the video game Overcooked, it helps to train a deep RL agent against a human model (AN #70), so that the agent “expects” to be playing against humans (rather than e.g. copies of itself, as in self-play). We might call this a “human-aware” model. However, since a human-aware model must be trained against a model that imitates human gameplay, we need to collect human gameplay data for training. Could we instead train an agent that is robust enough to play with lots of different agents, including humans as a special case?

    This paper shows that this can be done with Fictitious Co-Play (FCP), in which we train our final agent against a population of self-play agents and their past checkpoints taken throughout training. Such agents get significantly higher rewards when collaborating with humans in Overcooked (relative to the human-aware approach in the previously linked paper).

    In their ablations, the authors find that it is particularly important to include past checkpoints in the population against which you train. They also test whether it helps to have the self-play agents have a variety or architectures, and find that it mostly does not make a difference (as long as you are using past checkpoints as well).

    Read more: Related paper: Maximum Entropy Population Based Training for Zero-Shot Human-AI Coordination

    Rohin's opinion: You could imagine two different philosophies on how to build AI systems -- the first option is to train them on the actual task of interest (for Overcooked, training agents to play against humans or human models), while the second option is to train a more robust agent on some more general task, that hopefully includes the actual task within it (the approach in this paper). Besides Overcooked, another example would be supervised learning on some natural language task (the first philosophy), as compared to pretraining on the Internet GPT-style and then prompting the model to solve your task of interest (the second philosophy). In some sense the quest for a single unified AGI system is itself a bet on the second philosophy -- first you build your AGI that can do all tasks, and then you point it at the specific task you want to do now.

    Historically, I think AI has focused primarily on the first philosophy, but recent years have shown the power of the second philosophy. However, I don’t think the question is settled yet: one issue with the second philosophy is that it is often difficult to fully “aim” your system at the true task of interest, and as a result it doesn’t perform as well as it “could have”. In Overcooked, the FCP agents will not learn specific quirks of human gameplay that could be exploited to improve efficiency (which the human-aware agent could do, at least in theory). In natural language, even if you prompt GPT-3 appropriately, there’s still some chance it ends up rambling about something else entirely, or neglects to mention some information that it “knows” but that a human on the Internet would not have said. (See also this post (AN #141).)

    I should note that you can also have a hybrid approach, where you start by training a large model with the second philosophy, and then you finetune it on your task of interest as in the first philosophy, gaining the benefits of both.

    I’m generally interested in which approach will build more useful agents, as this seems quite relevant to forecasting the future of AI (which in turn affects lots of things including AI alignment plans).

    TECHNICAL AI ALIGNMENT
    LEARNING HUMAN INTENT

    Inverse Decision Modeling: Learning Interpretable Representations of Behavior (Daniel Jarrett, Alihan Hüyük et al) (summarized by Rohin): There’s lots of work on learning preferences from demonstrations, which varies in how much structure they assume on the demonstrator: for example, we might consider them to be Boltzmann rational (AN #12) or risk sensitive, or we could try to learn their biases (AN #59). This paper proposes a framework to encompass all of these choices: the core idea is to model the demonstrator as choosing actions according to a planner; some parameters of this planner are fixed in advance to provide an assumption on the structure of the planner, while others are learned from data. This also allows them to separate beliefs, decision-making, and rewards, so that different structures can be imposed on each of them individually.

    The paper provides a mathematical treatment of both the forward problem (how to compute actions in the planner given the reward, think of algorithms like value iteration) and the backward problem (how to compute the reward given demonstrations, the typical inverse reinforcement learning setting). They demonstrate the framework on a medical dataset, where they introduce a planner with parameters for flexibility of decision-making, optimism of beliefs, and adaptivity of beliefs. In this case they specify the desired reward function and then run backward inference to conclude that, with respect to this reward function, clinicians appear to be significantly less optimistic when diagnosing dementia in female and elderly patients.

    Rohin's opinion: One thing to note about this paper is that it is an incredible work of scholarship; it fluently cites research across a variety of disciplines including AI safety, and provides a useful organizing framework for many such papers. If you need to do a literature review on inverse reinforcement learning, this paper is a good place to start.

    Human irrationality: both bad and good for reward inference (Lawrence Chan et al) (summarized by Rohin): Last summary, we saw a framework for inverse reinforcement learning with suboptimal demonstrators. This paper instead investigates the qualitative effects of performing inverse reinforcement learning with a suboptimal demonstrator. The authors modify different parts of the Bellman equation in order to create a suite of possible suboptimal demonstrators to study. They run experiments with exact inference on random MDPs and FrozenLake, and with approximate inference on a simple autonomous driving environment, and conclude:

    1. Irrationalities can be helpful for reward inference, that is, if you infer a reward from demonstrations by an irrational demonstrator (where you know the irrationality), you often learn more about the reward than if you inferred a reward from optimal demonstrations (where you know they are optimal). Conceptually, this happens because optimal demonstrations only tell you about what the best behavior is, whereas most kinds of irrationality can also tell you about preferences between suboptimal behaviors.

    2. If you fail to model irrationality, your performance can be very bad, that is, if you infer a reward from demonstrations by an irrational demonstrator, but you assume that the demonstrator was Boltzmann rational, you can perform quite badly.

    Rohin's opinion: One way this paper differs from my intuitions is that it finds that assuming Boltzmann rationality performs very poorly if the demonstrator is in fact systematically suboptimal. I would have instead guessed that Boltzmann rationality would do okay -- not as well as in the case where there is no misspecification, but only a little worse than that. (That’s what I found in my paper (AN #59), and it makes intuitive sense to me.) Some hypotheses for what’s going on, which the lead author agrees are at least part of the story:

    1. When assuming Boltzmann rationality, you infer a distribution over reward functions that is “close” to the correct one in terms of incentivizing the right behavior, but differs in rewards assigned to suboptimal behavior. In this case, you might get a very bad log loss (the metric used in this paper), but still have a reasonable policy that is decent at acquiring true reward (the metric used in my paper).

    2. The environments we’re using may differ in some important way (for example, in the environment in my paper, it is primarily important to identify the goal, which might be much easier to do than inferring the right behavior or reward in the autonomous driving environment used in this paper).

    FORECASTING

    Forecasting progress in language models (Matthew Barnett) (summarized by Sudhanshu): This post aims to forecast when a "human-level language model" may be created. To build up to this, the author swiftly covers basic concepts from information theory and natural language processing such as entropy, N-gram models, modern LMs, and perplexity. Data for perplexity achieved from recent state-of-the-art models is collected and used to estimate - by linear regression - when we can expect to see future models score below certain entropy levels, approaching the hypothesised entropy for the English Language.

    These predictions range across the next 15 years, depending which dataset, method, and entropy level is being solved for; there's an attached python notebook with these details for curious readers to further investigate. Preemptly disjunctive, the author concludes "either current trends will break down soon, or human-level language models will likely arrive in the next decade or two."

    Sudhanshu's opinion: This quick read provides a natural, accessible analysis stemming from recent results, while staying self-aware (and informing readers) of potential improvements. The comments section too includes some interesting debates, e.g. about the Goodhart-ability of the Perplexity metric.

    I personally felt these estimates were broadly in line with my own intuitions. I would go so far as to say that with the confluence of improved generation capabilities across text, speech/audio, video, as well as multimodal consistency and integration, virtually any kind of content we see ~10 years from now will be algorithmically generated and indistinguishable from the work of human professionals.

    Rohin's opinion: I would generally adopt forecasts produced by this sort of method as my own, perhaps making them a bit longer as I expect the quickly growing compute trend to slow down. Note however that this is a forecast for human-level language models, not transformative AI; I would expect these to be quite different and would predict that transformative AI comes significantly later.

    MISCELLANEOUS (ALIGNMENT)

    Rohin Shah on the State of AGI Safety Research in 2021 (Lucas Perry and Rohin Shah) (summarized by Rohin): As in previous years (AN #54), on this FLI podcast I talk about the state of the field. Relative to previous years, this podcast is a bit more introductory, and focuses a bit more on what I find interesting rather than what the field as a whole would consider interesting.

    Read more: Transcript

    NEAR-TERM CONCERNS
    RECOMMENDER SYSTEMS

    User Tampering in Reinforcement Learning Recommender Systems (Charles Evans et al) (summarized by Zach): Large-scale recommender systems have emerged as a way to filter through large pools of content to identify and recommend content to users. However, these advances have led to social and ethical concerns over the use of recommender systems in applications. This paper focuses on the potential for social manipulability and polarization from the use of RL-based recommender systems. In particular, they present evidence that such recommender systems have an instrumental goal to engage in user tampering by polarizing users early on in an attempt to make later predictions easier.

    To formalize the problem the authors introduce a causal model. Essentially, they note that predicting user preferences requires an exogenous variable, a non-observable variable, that models click-through rates. They then introduce a notion of instrumental goal that models the general behavior of RL-based algorithms over a set of potential tasks. The authors argue that such algorithms will have an instrumental goal to influence the exogenous/preference variables whenever user opinions are malleable. This ultimately introduces a risk for preference manipulation.

    The author's hypothesis is tested using a simple media recommendation problem. They model the exogenous variable as either leftist, centrist, or right-wing. User preferences are malleable in the sense that a user shown content from an opposing side will polarize their initial preferences. In experiments, the authors show that a standard Q-learning algorithm will learn to tamper with user preferences which increases polarization in both leftist and right-wing populations. Moreover, even though the agent makes use of tampering it fails to outperform a crude baseline policy that avoids tampering.

    Zach's opinion: This article is interesting because it formalizes and experimentally demonstrates an intuitive concern many have regarding recommender systems. I also found the formalization of instrumental goals to be of independent interest. The most surprising result was that the agents who exploit tampering are not particularly more effective than policies that avoid tampering. This suggests that the instrumental incentive is not really pointing at what is actually optimal which I found to be an illuminating distinction.

    NEWS

    OpenAI hiring Software Engineer, Alignment (summarized by Rohin): Exactly what it sounds like: OpenAI is hiring a software engineer to work with the Alignment team.

    BERI hiring ML Software Engineer (Sawyer Bernath) (summarized by Rohin): BERI is hiring a remote ML Engineer as part of their collaboration with the Autonomous Learning Lab at UMass Amherst. The goal is to create a software library that enables easy deployment of the ALL's Seldonian algorithm framework for safe and aligned AI.

    AI Safety Needs Great Engineers (Andy Jones) (summarized by Rohin): If the previous two roles weren't enough to convince you, this post explicitly argues that a lot of AI safety work is bottlenecked on good engineers, and encourages people to apply to such roles.

    AI Safety Camp Virtual 2022 (summarized by Rohin): Applications are open for this remote research program, where people from various disciplines come together to research an open problem under the mentorship of an established AI-alignment researcher. Deadline to apply is December 1st.

    Political Economy of Reinforcement Learning schedule (summarized by Rohin): The date for the PERLS workshop (AN #159) at NeurIPS has been set for December 14, and the schedule and speaker list are now available on the website.

    FEEDBACK I'm always happy to hear feedback; you can send it to me, Rohin Shah by replying to this email. PODCAST An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles (http://robertskmiles.com).
    Subscribe here:

  • Recorded by Robert Miles: http://robertskmiles.com

    More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

    YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

    HIGHLIGHTS

    Request for proposals for projects in AI alignment that work with deep learning systems (Nick Beckstead and Asya Bergal) (summarized by Rohin): Open Philanthropy is seeking proposals for AI safety work in four major areas related to deep learning, each of which I summarize below. Proposals are due January 10, and can seek up to $1M covering up to 2 years. Grantees may later be invited to apply for larger and longer grants.

    Rohin's opinion: Overall, I like these four directions and am excited to see what comes out of them! I'll comment on specific directions below.

    RFP: Measuring and forecasting risks (Jacob Steinhardt) (summarized by Rohin): Measurement and forecasting is useful for two reasons. First, it gives us empirical data that can improve our understanding and spur progress. Second, it can allow us to quantitatively compare the safety performance of different systems, which could enable the creation of safety standards. So what makes for a good measurement?

    1. Relevance to AI alignment: The measurement exhibits a failure mode that becomes worse as models become larger, or tracks a potential capability that may emerge with further scale (which in turn could enable deception, hacking, resource acquisition, etc).

    2. Forward-looking: The measurement helps us understand future issues, not just those that exist today. Isolated examples of a phenomenon are good if we have nothing else, but we’d much prefer to have a systematic understanding of when a phenomenon occurs and how it tends to quantitatively increase or decrease with various factors. See for example scaling laws (AN #87).

    3. Rich data source: Not all trends in MNIST generalize to CIFAR-10, and not all trends in CIFAR-10 generalize to ImageNet. Measurements on data sources with rich factors of variation are more likely to give general insights.

    4. Soundness and quality: This is a general category for things like “do we know that the signal isn’t overwhelmed by the noise” and “are there any reasons that the measurement might produce false positives or false negatives”.

    What sorts of things might you measure?

    1. As you scale up task complexity, how much do you need to scale up human-labeled data to continue to maintain good performance and avoid reward hacking? If you fail at this and there are imperfections in the reward, how bad does this become?

    2. What changes do we observe based on changes in the quality of the human feedback (e.g. getting feedback from amateurs vs experts)? This could give us information about the acceptable “difference in intelligence” between a model and its supervisor.

    3. What happens when models are pushed out of distribution along a factor of variation that was not varied in the pretraining data?

    4. To what extent do models provide wrong or undesired outputs in contexts where they are capable of providing the right answer?

    Rohin's opinion: Measurements generally seem great. One story for impact is that we have a measurement that we think is strongly correlated with x-risk, and we use that measurement to select an AI system that scores low on such a metric. This seems distinctly good and I think would in fact reduce x-risk! But I want to clarify that I don’t think it would convince me that the system was safe with high confidence. The conceptual arguments against high confidence in safety seem quite strong and not easily overcome by such measurements. (I’m thinking of objective robustness failures (AN #66) of the form “the model is trying to pursue a simple proxy, but behaves well on the training distribution until it can execute a treacherous turn”.)

    You can also tell stories where the measurements reveal empirical facts that then help us have high confidence in safety, by allowing us to build better theories and arguments, which can rule out the conceptual arguments above.

    Separately, these measurements are also useful as a form of legible evidence about risk to others who are more skeptical of conceptual arguments.

    RFP: Techniques for enhancing human feedback (Ajeya Cotra) (summarized by Rohin): Consider a topic previously analyzed in aligning narrowly superhuman models (AN #141): how can we use human feedback to train models to do what we want, in cases where the models are more knowledgeable than the humans providing the feedback? A variety of techniques have been proposed to solve this problem, including iterated amplification (AN #40), debate (AN #5), recursive reward modeling (AN #34), market making (AN #108), and generalizing from short deliberations to long deliberations. This RFP solicits proposals that aim to test these or other mechanisms on existing systems. There are a variety of ways that to set up the experiments so that the models are more knowledgeable than the humans providing the feedback, for example:

    1. Train a language model to accurately explain things about a field that the feedback providers are not familiar with.

    2. Train an RL agent to act well in an environment where the RL agent can observe more information than the feedback providers can.

    3. Train a multilingual model to translate between English and a foreign language that the feedback providers do not know.

    RFP: Interpretability (Chris Olah) (summarized by Rohin): The author provides this one sentence summary: We would like to see research building towards the ability to “reverse engineer" trained neural networks into human-understandable algorithms, enabling auditors to catch unanticipated safety problems in these models.

    This RFP is primarily focused on an aspirational “intermediate” goal: to fully reverse engineer some modern neural network, such as an ImageNet classifier. (Despite the ambition, it is only an “intermediate” goal because what we would eventually need is a general method for cheaply reverse engineering any neural network.) The proposed areas of research are primarily inspired by the Circuits line of work (AN #142):

    1. Discovering Features and Circuits: This is the most obvious approach to the aspirational goal. We simply “turn the crank” using existing tools to study new features and circuits, and this fairly often results in an interesting result, that makes progress towards reverse engineering a neural network.

    2. Scaling Circuits to Larger Models: So far the largest example of reverse engineering is curve circuits, with 50K parameters. Can we find examples of structure in the neural networks that allow us to drastically reduce the amount of effort required per parameter? (As examples, see equivariance and branch specialization.)

    3. Resolving Polysemanticity: One of the core building blocks of the circuits approach is to identify a neuron with a concept, so that connections between neurons can be analyzed as connections between concepts. Unfortunately, some neurons are polysemantic, that is, they encode multiple different concepts. This greatly complicates analysis of the connections and circuits between these neurons. How can we deal with this potential obstacle?

    Rohin's opinion: The full RFP has many, many more points about these topics; it’s 8 pages of remarkably information-dense yet readable prose. If you’re at all interested in mechanistic interpretability, I recommend reading it in full.

    This RFP also has the benefit of having the most obvious pathway to impact: if we understand what algorithm neural networks are running, there’s a much better chance that we can catch any problems that arise, especially ones in which the neural network is deliberately optimizing against us. It’s one of the few areas where nearly everyone agrees that further progress is especially valuable.

    RFP: Truthful and honest AI (Owain Evans) (summarized by Rohin): This RFP outlines research projects on Truthful AI (summarized below). They fall under three main categories:

    1. Increasing clarity about “truthfulness” and “honesty”. While there are some tentative definitions of these concepts, there is still more precision to be had: for example, how do we deal with statements with ambiguous meanings, or ones involving figurative language? What is the appropriate standard for robustly truthful AI? It seems too strong to require the AI system to never generate a false statement; for example it might misunderstand the meaning of a newly coined piece of jargon.

    2. Creating benchmarks and tasks for Truthful AI, such as TruthfulQA (AN #165), which checks for imitative falsehoods. This is not just meant to create a metric to improve on; it may also simply perform as a measurement. For example, we could experimentally evaluate whether honesty generalizes (AN #158), or explore how much truthfulness is reduced when adding in a task-specific objective.

    3. Improving the truthfulness of models, for example by finetuning models on curated datasets of truthful utterances, finetuning on human feedback, using debate (AN #5), etc.

    Besides the societal benefits from truthful AI, building truthful AI systems can also help with AI alignment:

    1. A truthful AI system can be used to supervise its own actions, by asking it whether its selected action was good.

    2. A robustly truthful AI system could continue to do this after deployment, allowing for ongoing monitoring of the AI system.

    3. Similarly, we could have a robustly truthful AI system supervise its own actions in hypothetical scenarios, to make it more robustly aligned.

    Rohin's opinion: While I agree that making AI systems truthful would then enable many alignment strategies, I’m actually more interested in the methods by which we make AI systems truthful. Many of the ideas suggested in the RFP are ones that would apply for alignment more generally, and aren’t particularly specific to truthful AI. So it seems like whatever techniques we used to build truthful AI could then be repurposed for alignment. In other words, I expect that the benefit to AI alignment of working on truthful AI is that it serves as a good test case for methods that aim to impose constraints upon an AI system. In this sense, it is a more challenging, larger version of the ”never describe someone getting injured” challenge (AN #166). Note that I am only talking about how this helps AI alignment; there are also beneficial effects on society from pursuing truthful AI that I haven’t talked about here.

    AI GOVERNANCE

    Truthful AI: Developing and governing AI that does not lie (Owain Evans, Owen Cotton-Barratt et al) (summarized by Rohin): This paper argues that we should develop both the technical capabilities and the governance mechanisms necessary to ensure that AI systems are truthful. We will primarily think about conversational AI systems here (so not, say, AlphaFold).

    Some key terms:

    1. An AI system is honest if it only makes statements that it actually believes. (This requires you to have some way of ascribing beliefs to the system.) In contrast, truthfulness only checks if statements correspond to reality, without making any claims about the AI system’s beliefs.

    2. An AI system is broadly truthful if it doesn’t lie, volunteers all the relevant information it knows, is well-calibrated and knows the limits of its information, etc.

    3. An AI system is narrowly truthful if it avoids making negligent suspected-falsehoods. These are statements that can feasibly be determined by the AI system to be unacceptably likely to be false. Importantly, a narrowly truthful AI is not required to make contentful statements, it can express uncertainty or refuse to answer.

    This paper argues for narrow truthfulness as the appropriate standard. Broad truthfulness is not very precisely defined, making it challenging to coordinate on. Honesty does not give us the guarantees we want: in settings in which it is advantageous to say false things, AI systems might end up being honest but deluded. They would honestly report their beliefs, but those beliefs might be false.

    Narrow truthfulness is still a much stronger standard that we impose upon standards. This is desirable, because (1) AI systems need not be constrained by social norms, the way humans are; consequently they need stronger standards, and (2) it may be less costly to enforce that AI systems are narrowly truthful than to enforce that humans are narrowly truthful, so a higher standard is more feasible.

    Evaluating the (narrow) truthfulness of a model is non-trivial. There are two parts: first, determining whether a given statement is unacceptably likely to be false, and second, determining whether the model was negligent in uttering such a statement. The former could be done by having human processes that study a wide range of information and determine whether a given statement is unacceptably likely to be false. In addition to all of the usual concerns about the challenges of evaluating a model that might know more than you, there is also the challenge that it is not clear exactly what counts as “unacceptably likely to be false”. For example, if a model utters a false statement, but expresses low confidence, how should that be rated? The second part, determining negligence, needs to account for the fact that the AI system might not have had all the necessary information, or that it might not have been capable enough to come to the correct conclusion. One way of handling this is to compare the AI system to other AI systems built in a similar fashion.

    How might narrow truthfulness be useful? One nice thing it enables is truthfulness amplification, in which we can amplify properties of a model by asking a web of related questions and combining the answers appropriately. For example, if we are concerned that the AI system is deceiving us on just this question, we could ask it whether it is deceiving us, or whether an investigation into its statement would conclude that it was deceptive. As another example, if we are worried that the AI system is making a mistake on some question where its statement isn’t obviously false, we can ask it about its evidence for its position and how strong the evidence is (where false statements are more likely to be negligently false).

    Section 3 is devoted to the potential benefits and costs if we successfully ensure that AI systems are narrowly truthful, with the conclusion that the costs are small relative to the benefits, and can be partially mitigated. Section 6 discusses other potential benefits and costs if we attempt to create truthfulness standards to ensure the AI systems are narrowly truthful. (For example, we might try to create a truthfulness standard, but instead create an institution that makes sure that AI systems follow a particular agenda (by only rating as true the statements that are consistent with that agenda). Section 4 talks about the governance mechanisms we might use to implement a truthfulness standard. Section 5 describes potential approaches for building truthful AI systems. As I mentioned in the highlighted post, these techniques are general alignment techniques that have been specialized for truthful AI.

    NEWS

    Q&A Panel on Applying for Grad School (summarized by Rohin): In this event run by AI Safety Support on November 7, current PhD students will share their experiences navigating the application process and AI Safety research in academia. RSVP here.

    SafeAI Workshop 2022 (summarized by Rohin): The SafeAI workshop at AAAI is now accepting paper submissions, with a deadline of Nov 12.

    FLI's $25M Grants Program for Existential Risk Reduction (summarized by Rohin): This podcast talks about FLI's recent grants program for x-risk reduction. I've previously mentioned the fellowships (AN #165) they are running as part of this program. As a reminder, the application deadline is October 29 for the PhD fellowship, and November 5 for the postdoc fellowship.

  • Recorded by Robert Miles: http://robertskmiles.com

    More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

    YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

    HIGHLIGHTS

    Unsolved Problems in ML Safety (Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt) (summarized by Dan Hendrycks): To make the case for safety to the broader machine learning research community, this paper provides a revised and expanded collection of concrete technical safety research problems, namely:

    1. Robustness: Create models that are resilient to adversaries, unusual situations, and Black Swan events.

    2. Monitoring: Detect malicious use, monitor predictions, and discover unexpected model functionality.

    3. Alignment: Build models that represent and safely optimize hard-to-specify human values.

    4. External Safety: Use ML to address risks to how ML systems are handled, including cyberwarfare and global turbulence.

    Throughout, the paper attempts to clarify problem’s motivation and provide concrete project ideas.

    Dan Hendrycks' opinion: My coauthors and I wrote this paper with the ML research community as our target audience. Here are some thoughts on this topic:

    1. The document includes numerous problems that, if left unsolved, would imply that ML systems are unsafe. We need the effort of thousands of researchers to address all of them. This means that the main safety discussions cannot stay within the confines of the relatively small EA community. I think we should aim to have over one third of the ML research community work on safety problems. We need the broader community to treat AI at least as seriously as safety for nuclear power plants.

    2. To grow the ML research community, we need to suggest problems that can progressively build the community and organically grow support for elevating safety standards within the existing research ecosystem. Research agendas that pertain to AGI exclusively will not scale sufficiently, and such research will simply not get enough market share in time. If we do not get the machine learning community on board with proactively mitigating risks that already exist, we will have a harder time getting them to mitigate less familiar and unprecedented risks. Rather than try to win over the community with alignment philosophy arguments, I'll try winning them over with interesting problems and try to make work towards safer systems rewarded with prestige.

    3. The benefits of a larger ML Safety community are numerous. They can decrease the cost of safety methods and increase the propensity to adopt them. Moreover, to make ML systems have desirable properties, it is necessary to rapidly accumulate incremental improvements, but this requires substantial growth since such gains cannot be produced by just a few card-carrying x-risk researchers with the purest intentions.

    4. The community will fail to grow if we ignore near-term concerns or actively exclude or sneer at people who work on problems that are useful for both near- and long-term safety (such as adversaries). The alignment community will need to stop engaging in textbook territorialism and welcome serious hypercompetent researchers who do not post on internet forums or who happen not to subscribe to effective altruism. (We include a community strategy in the Appendix.)

    5. We focus on reinforcement learning but also deep learning. Most of the machine learning research community studies deep learning (e.g., text processing, vision) and does not use, say, Bellman equations or PPO. While existentially catastrophic failures will likely require competent sequential decision making agents, the relevant problems and solutions can often be better studied outside of gridworlds and MuJoCo. There is much useful safety research to be done that does not need to be cast as a reinforcement learning problem.

    6. To prevent alienating readers, we did not use phrases such as "AGI." AGI-exclusive research will not scale; for most academics and many industry researchers, it's a nonstarter. Likewise, to prevent needless dismissiveness, we kept x-risks implicit, only hinted at them, or used the phrase "permanent catastrophe."

    I would have personally enjoyed discussing at length how anomaly detection is an indispensable tool for reducing x-risks from Black Balls, engineered microorganisms, and deceptive ML systems.

    Here are how the problems relate to x-risk:

    Adversarial Robustness: This is needed for proxy gaming. ML systems encoding proxies must become more robust to optimizers, which is to say they must become more adversarially robust. We make this connection explicit at the bottom of page 9.

    Black Swans and Tail Risks: It's hard to be safe without high reliability. It's not obvious we'll achieve high reliability even by the time we have systems that are superhuman in important respects. Even though MNIST is solved for typical inputs, we still do not even have an MNIST classifier for atypical inputs that is reliable! Moreover, if optimizing agents become unreliable in the face of novel or extreme events, they could start heavily optimizing the wrong thing. Models accidentally going off the rails poses an x-risk if they are sufficiently powerful (this is related to "competent errors" and "treacherous turns"). If this problem is not solved, optimizers can use these weaknesses; this is a simpler problem on the way to adversarial robustness.

    Anomaly and Malicious Use Detection: This is an indispensable tool for detecting proxy gaming, Black Balls, engineered microorganisms that present bio x-risks, malicious users who may misalign a model, deceptive ML systems, and rogue ML systems.

    Representative Outputs: Making models honest is a way to avoid many treacherous turns.

    Hidden Model Functionality: This also helps avoid treacherous turns. Backdoors is a potentially useful related problem, as it is about detecting latent but potential sharp changes in behavior.

    Value Learning: Understanding utilities is difficult even for humans. Powerful optimizers will need to achieve a certain, as-of-yet unclear level of superhuman performance at learning our values.

    Translating Values to Action: Successfully prodding models to optimize our values is necessary for safe outcomes.

    Proxy Gaming: Obvious.

    Value Clarification: This is the philosophy bot section. We will need to decide what values to pursue. If we decide poorly, we may lock in or destroy what is of value. It also possible that there is an ongoing moral catastrophe, which we would not want to replicate across the cosmos.

    Unintended Consequences: This should help models not accidentally work against our values.

    ML for Cybersecurity: If you believe that AI governance is valuable and that global turbulence risks can increase risks of terrible outcomes, this section is also relevant. Even if some of the components of ML systems are safe, they can become unsafe when traditional software vulnerabilities enable others to control their behavior. Moreover, traditional software vulnerabilities may lead to the proliferation of powerful advanced models, and this may be worse than proliferating nuclear weapons.

    Informed Decision Making: We want to avoid decision making based on unreliable gut reactions during a time of crisis. This reduces risks of poor governance of advanced systems.

    Here are some other notes:

    1. We use systems theory to motivate inner optimization as we expect motivation will be more convincing to others.

    2. Rather than have a broad call for "interpretability," we focus on specific transparency-related problems that are more tractable and neglected. (See the Appendix for a table assessing importance, tractability, and neglectedness.) For example, we include sections on making models honest and detecting emergent functionality.

    3. The "External Safety" section can also be thought of as technical research for reducing "Governance" risks. For readers mostly concerned about AI risks from global turbulence, there still is technical research that can be done.

    Here are some observations while writing the document:

    1. Some approaches that were previously very popular are currently neglected, such as inverse reinforcement learning. This may be due to currently low tractability.

    2. Five years ago, I started explicitly brainstorming the content for this document. I think it took the whole time for this document to take shape. Moreover, if this were written last fall, the document would be far more confused, since it took around a year after GPT-3 to become reoriented; writing these types of documents shortly after a paradigm shift may be too hasty.

    3. When collecting feedback, it was not uncommon for "in-the-know" researchers to make opposite suggestions. Some people thought some of the problems in the Alignment section were unimportant, while others thought they were the most critical. We attempted to include most research directions.

    [MLSN #1]: ICLR Safety Paper Roundup (Dan Hendrycks) (summarized by Rohin): This is the first issue of the ML Safety Newsletter, which is "a monthly safety newsletter which is designed to cover empirical safety research and be palatable to the broader machine learning research community".

    Rohin's opinion: I'm very excited to see this newsletter: this is a category of papers that I want to know about and that are relevant to safety, but I don't have the time to read all of these papers given all the other alignment work I read, especially since I don't personally work in these areas and so often find it hard to summarize them or place them in the appropriate context. Dan on the other hand has written many such papers himself and generally knows the area, and so will likely do a much better job than I would. I recommend you subscribe, especially since I'm not going to send a link to each MLSN in this newsletter.

    TECHNICAL AI ALIGNMENT
    TECHNICAL AGENDAS AND PRIORITIZATION

    Selection Theorems: A Program For Understanding Agents (John Wentworth) (summarized by Rohin): This post proposes a research area for understanding agents: selection theorems. A selection theorem is a theorem that tells us something about agents that will be selected for in a broad class of environments. Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning human values, and (2) they can tell us likely properties of the agents we build by accident (think inner alignment concerns).

    As an example, coherence arguments demonstrate that when an environment presents an agent with “bets” or “lotteries”, where the agent cares only about the outcomes of the bets, then any “good” agent can be represented as maximizing expected utility. (What does it mean to be “good”? This can vary, but one example would be that the agent is not subject to Dutch books, i.e. situations in which it is guaranteed to lose resources.) This can then be turned into a selection argument by combining it with something that selects for “good” agents. For example, evolution will select for agents that don’t lose resources for no gain, so humans are likely to be represented as maximizing expected utility. Unfortunately, many coherence arguments implicitly assume that the agent has no internal state, which is not true for humans, so this argument does not clearly work. As another example, our ML training procedures will likely also select for agents that don’t waste resources, which could allow us to conclude that the resulting agents can be represented as maximizing expected utility, if the agents don't have internal state.

    Coherence arguments aren’t the only kind of selection theorem. The good(er) regulator theorem (AN #138) provides a set of scenarios under which agents learn an internal “world model”. The Kelly criterion tells us about scenarios in which the best (most selected) agents will make bets as though they are maximizing expected log money. These and other examples are described in this followup post.

    The rest of this post elaborates on the various parts of a selection theorem, and provides advice on how to make original research contributions in the area of selection theorems. Another followup post describes some useful properties for which the author expects there are useful selections theorems to prove.

    Rohin's opinion: People sometimes expect me to be against this sort of work, because I wrote Coherence arguments do not imply goal-directed behavior (AN #35). This is not true. My point in that post is that coherence arguments alone are not enough, you need to combine them with some other assumption (for example, that there exists some “resource” over which the agent has no terminal preferences). I do think it is plausible that this research agenda gives us a better picture of agency that tells us something about how AI systems will behave, or something about how to better infer human values. While I am personally more excited about studying particular development paths to AGI rather than more abstract agent models, I do think this research would be more useful than other types of alignment research I have seen proposed.

    OTHER PROGRESS IN AI
    MISCELLANEOUS (AI)

    State of AI Report 2021 (Nathan Benaich and Ian Hogarth) (summarized by Rohin): As with past (AN #15) reports (AN #120), I’m not going to summarize the entire thing, and instead you get the high-level themes that the authors identified:

    1. AI is stepping up in more concrete ways, including in mission critical infrastructure.

    2. AI-first approaches have taken biology by storm (and we aren’t just talking about AlphaFold).

    3. Transformers have emerged as a general purpose architecture for machine learning in many domains, not just NLP.

    4. Investors have taken notice, with record funding this year into AI startups, and two first ever IPOs for AI-first drug discovery companies, as well as blockbuster IPOs for data infrastructure and cybersecurity companies that help enterprises retool for the AI-first era.

    5. The under-resourced AI-alignment efforts from key organisations who are advancing the overall field of AI, as well as concerns about datasets used to train AI models and bias in model evaluation benchmarks, raise important questions about how best to chart the progress of AI systems with rapidly advancing capabilities.

    6. AI is now an actual arms race rather than a figurative one, with reports of recent use of autonomous weapons by various militaries.

    7. Within the US-China rivalry, China's ascension in research quality and talent training is notable, with Chinese institutions now beating the most prominent Western ones.

    8. There is an emergence and nationalisation of large language models.

    Rohin's opinion: In last year’s report (AN #120), I said that their 8 predictions seemed to be going out on a limb, and that even 67% accuracy woud be pretty impressive. This year, they scored their predictions as 5 “Yes”, 1 “Sort of”, and 2 “No”. That being said, they graded “The first 10 trillion parameter dense model” as “Yes”, I believe on the basis that Microsoft had run a couple of steps of training on a 32 trillion parameter dense model. I definitely interpreted the prediction as saying that a 10 trillion parameter model would be trained to completion, which I do not think happened publicly, so I’m inclined to give it a “No”. Still, this does seem like a decent track record for what seemed to me to be non-trivial predictions. This year's predictions seem similarly "out on a limb" as last year's.

    This year’s report included one slide summaries of many papers I’ve summarized before. I only found one major issue -- the slide on TruthfulQA (AN #165) implies that larger language models are less honest in general, rather than being more likely to imitate human falsehoods. This is actually a pretty good track record, given the number of things they summarized where I would have noticed if there were major issues.

    NEWS

    CHAI Internships 2022 (summarized by Rohin): CHAI internships are open once again! Typically, an intern will execute on an AI safety research project proposed by their mentor, resulting in a first-author publication at a workshop. The early deadline is November 23rd and the regular deadline is December 13th.

  • Recorded by Robert Miles: http://robertskmiles.com

    More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

    YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

    HIGHLIGHTS

    The "most important century" series (Holden Karnofsky) (summarized by Rohin): In some sense, it is really weird for us to claim that there is a non-trivial chance that in the near future, we might build transformative AI and either (1) go extinct or (2) exceed a growth rate of (say) 100% per year. It feels like an extraordinary claim, and thus should require extraordinary evidence. One way of cashing this out: if the claim were true, this century would be the most important century, with the most opportunity for individuals to have an impact. Given the sheer number of centuries there are, this is an extraordinary claim; it should really have extraordinary evidence. This series argues that while the claim does seem extraordinary, all views seem extraordinary -- there isn’t some default baseline view that is “ordinary” to which we should be assigning most of our probability.

    Specifically, consider three possibilities for the long-run future:

    1. Radical: We will have a productivity explosion by 2100, which will enable us to become technologically mature. Think of a civilization that sends spacecraft throughout the galaxy, builds permanent settlements on other planets, harvests large fractions of the energy output from stars, etc.

    2. Conservative: We get to a technologically mature civilization, but it takes hundreds or thousands of years. Let’s say even 100,000 years to be ultra conservative.

    3. Skeptical: We never become technologically mature, for some reason. Perhaps we run into fundamental technological limits, or we choose not to expand into the galaxy, or we’re in a simulation, etc.

    It’s pretty clear why the radical view is extraordinary. What about the other two?

    The conservative view implies that we are currently in the most important 100,000-year period. Given that life is billions of years old, and would presumably continue for billions of years to come once we reach a stable galaxy-wide civilization, that would make this the most important 100,000 year period out of tens of thousands of such periods. Thus the conservative view is also extraordinary, for the same reason that the radical view is extraordinary (albeit it is perhaps only half as extraordinary as the radical view).

    The skeptical view by itself does not seem obviously extraordinary. However, while you could assign 70% probability to the skeptical view, it seems unreasonable to assign 99% probability to such a view -- that suggests some very strong or confident claims about what prevents us from colonizing the galaxy, that we probably shouldn’t have given our current knowledge. So, we need to have a non-trivial chunk of probability on the other views, which still opens us up to critique of having extraordinary claims.

    Okay, so we’ve established that we should at least be willing to say something as extreme as “there’s a non-trivial chance we’re in the most important 100,000-year period”. Can we tighten the argument, to talk about the most important century? In fact, we can, by looking at the economic growth rate.

    You are probably aware that the US economy grows around 2-3% per year (after adjusting for inflation), so a business-as-usual, non-crazy, default view might be to expect this to continue. You are probably also aware that exponential growth can grow very quickly. At the lower end of 2% per year, the economy would double every ~35 years. If this continued for 8200 years, we'd need to be sustaining multiple economies as big as today's entire world economy per atom in the universe. While this is not a priori impossible, it seems quite unlikely to happen. This suggests that we’re in one of fewer than 82 centuries that will have growth rates at 2% or larger, making it far less “extraordinary” to claim that we’re in the most important one, especially if you believe that growth rates are well correlated with change and ability to have impact.

    The actual radical view that the author places non-trivial probability on is one we’ve seen before in this newsletter: it is one in which there is automation of science and technology through advanced AI or whole brain emulations or other possibilities. This allows technology to substitute for human labor in the economy, which produces a positive feedback loop as the output of the economy is ploughed back into the economy creating superexponential growth and a “productivity explosion”, where the growth rate increases far beyond 2%. The series has summarizes and connects together many (AN #105), past (AN #154), Open (AN #121), Phil (AN #118) analyses (AN #145), which I won't be summarizing here (since we've summarized these analyses previously). While this is a more specific and “extraordinary” claim than even the claim that we live in the most important century, it seems like it should not be seen as so extraordinary given the arguments above.

    This series also argues for a few other points important to longtermism, which I’ll copy here:

    1. The long-run future is radically unfamiliar. Enough advances in technology could lead to a long-lasting, galaxy-wide civilization that could be a radical utopia, dystopia, or anything in between.

    2. The long-run future could come much faster than we think, due to a possible AI-driven productivity explosion. (I briefly mentioned this above, but the full series devotes much more space and many more arguments to this point.)

    3. We, the people living in this century, have the chance to have a huge impact on huge numbers of people to come - if we can make sense of the situation enough to find helpful actions. But right now, we aren't ready for this.

    Read more: 80,000 Hours podcast on the topic

    Rohin's opinion: I especially liked this series for the argument that 2% economic growth very likely cannot last much longer, providing quite a strong argument for the importance of this century, without relying at all on controversial facts about AI. At least personally I was previously uneasy about how “grand” or “extraordinary” AGI claims tend to be, and whether I should be far more skeptical of them as a result. I feel significantly more comfortable with these claims after seeing this argument.

    Note though that it does not defuse all such uneasiness -- you can still look at how early we appear to be (given the billions of years of civilization that could remain in the future), and conclude that the simulation hypothesis is true, or that there is a Great Filter in our future that will drive us extinct with near-certainty. In such situations there would be no extraordinary impact to be had today by working on AI risk.

    TECHNICAL AI ALIGNMENT
    PROBLEMS

    Why AI alignment could be hard with modern deep learning (Ajeya Cotra) (summarized by Rohin): This post provides an ELI5-style introduction to AI alignment as a major challenge for deep learning. It primarily frames alignment as a challenge in creating Saints (aligned AI systems), without getting Schemers (AI systems that are deceptively aligned (AN #58)) or Sycophants (AI systems that satisfy only the letter of the request, rather than its spirit, as in Another (outer) alignment failure story (AN #146)). Any short summary I write would ruin the ELI5 style, so I won’t attempt it; I do recommend it strongly if you want an introduction to AI alignment.

    LEARNING HUMAN INTENT

    B-Pref: Benchmarking Preference-Based Reinforcement Learning (Kimin Lee et al) (summarized by Zach): Deep RL has become a powerful method to solve a variety of sequential decision tasks using a known reward function for training. However, in practice, rewards are hard to specify making it hard to scale Deep RL for many applications. Preference-based RL provides an alternative by allowing a teacher to indicate preferences between a pair of behaviors. Because the teacher can interactively give feedback to an agent preference-based RL has the potential to help address this limitation of Deep RL. Despite the advantages of preference-based RL it has proven difficult to design useful benchmarks for the problem. This paper introduces a benchmark (B-Pref) that is useful for preference-based RL in various locomotion and robotic manipulation tasks.

    One difficulty with designing a useful benchmark is that teachers may have a variety of irrationalities. For example, teachers might be myopic or make mistakes. The B-Pref benchmark addresses this by emphasizing measuring performance under a variety of teacher irrationalities. They do this by providing various performance metrics to introduce irrationality into otherwise deterministic reward criteria. While previous approaches to preference-based RL work well when the teacher responses are consistent, experiments show they are not robust to feedback noise or teacher mistakes. Experiments also show that how queries are selected has a major impact on performance. With these results, the authors identify these two problems as areas for future work.

    Zach's opinion: While the authors do a good job advocating for the problem of preference-based RL I'm less convinced their particular benchmark is a large step forward. In particular, it seems the main contribution is not a suite of tasks, but rather a collection of different ways to add irrationality to the teacher oracle. The main takeaway of this paper is that current algorithms don't seem to perform well when the teacher can make mistakes, but this is quite similar to having a misspecified reward function. Beyond that criticism, the experiments support the areas suggested for future work.

    ROBUSTNESS

    Redwood Research’s current project (Buck Shlegeris) (summarized by Rohin): This post introduces Redwood Research’s current alignment project: to ensure that a language model finetuned on fanfiction never describes someone getting injured, while maintaining the quality of the generations of that model. Their approach is to train a classifier that determines whether a given generation has a description of someone getting injured, and then to use that classifier as a reward function to train the policy to generate non-injurious completions. Their hope is to learn a general method for enforcing such constraints on models, such that they could then quickly train the model to, say, never mention anything about food.

    FORECASTING

    Distinguishing AI takeover scenarios (Sam Clarke et al) (summarized by Rohin): This post summarizes several AI takeover scenarios that have been proposed, and categorizes them according to three main variables. Speed refers to the question of whether there is a sudden jump in AI capabilities. Uni/multipolarity asks whether a single AI system takes over, or many. Alignment asks what goals the AI systems pursue, and if they are misaligned, further asks whether they are outer or inner misaligned. They also analyze other properties of the scenarios, such as how agentic, general and/or homogenous the AI systems are, and whether AI systems coordinate with each other or not. A followup post investigates social, economic, and technological characteristics of these scenarios. It also generates new scenarios by varying some of these factors.

    Since these posts are themselves summaries and comparisons of previously proposed scenarios that we’ve covered in this newsletter, I won’t summarize them here, but I do recommend them for an overview of AI takeover scenarios.

    MISCELLANEOUS (ALIGNMENT)

    Beyond fire alarms: freeing the groupstruck (Katja Grace) (summarized by Rohin): It has been claimed that there’s no fire alarm for AGI, that is, there will be no specific moment or event at which AGI risk becomes sufficiently obvious and agreed upon, so that freaking out about AGI becomes socially acceptable rather than embarrassing. People often implicitly argue for waiting for an (unspecified) future event that tells us AGI is near, after which everyone will know that it’s okay to work on AGI alignment. This seems particularly bad if no such future event (i.e. fire alarm) exists.

    This post argues that this is not in fact the implicit strategy that people typically use to evaluate and respond to risks. In particular, it is too discrete. Instead, people perform “the normal dance of accumulating evidence and escalating discussion and brave people calling the problem early and eating the potential embarrassment”. As a result, the existence of a “fire alarm” is not particularly important.

    Note that the author does agree that there is some important bias at play here. The original fire alarm post is implicitly considering a fear shame hypothesis: people tend to be less cautious in public, because they expect to be negatively judged for looking scared. The author ends up concluding that there is something broader going on and proposes a few possibilities, many of which still suggest that people will tend to be less cautious around risks when they are observed.

    Some points made in the very detailed, 15,000-word article:

    1. Literal fire alarms don’t work by creating common knowledge, or by providing evidence of a fire. People frequently ignore fire alarms. In one experiment, participants continued to fill out questionnaires while a fire alarm rang, often assuming that someone will lead them outside if it is important.

    2. They probably instead work by a variety of mechanisms, some of which are related to the fear shame hypothesis. Sometimes they provide objective evidence that is easier to use as a justification for caution than a personal guess. Sometimes they act as an excuse for cautious or fearful people to leave, without the implication that those people are afraid. Sometimes they act as a source of authority for a course of action (leaving the building).

    3. Most of these mechanisms are amenable to partial or incremental effects, and in particular can happen with AGI risk. There are many people who have already boldly claimed that AGI risk is a problem. There exists person-independent evidence; for example, surveys of AI researchers suggest a 5% chance of extinction.

    4. For other risks, there does not seem to have been a single discrete moment at which it became acceptable to worry about them (i.e. no “fire alarm”). This includes risks where there has been a lot of caution, such as climate change, the ozone hole, recombinant DNA, COVID, and nuclear weapons.

    5. We could think about building fire alarms; many of the mechanisms above are social ones rather than empirical facts about the world. This could be one out of many strategies that we employ against the general bias towards incaution (the post suggests 16).

    Rohin's opinion: I enjoyed this article quite a lot; it is really thorough. I do see a lot of my own work as pushing on some of these more incremental methods for increasing caution, though I think of it more as a combination of generating more or better evidence, and communicating arguments in a manner more suited to a particular audience. Perhaps I will think of new strategies that aim to reduce fear shame instead.

    NEWS

    Seeking social science students / collaborators interested in AI existential risks (Vael Gates) (summarized by Rohin): This post presents a list of research questions around existential risk from AI that can be tackled by social scientists. The author is looking for collaborators to expand the list and tackle some of the questions on it, and is aiming to provide some mentorship for people getting involved.

  • Recorded by Robert Miles: http://robertskmiles.com

    More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

    YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

    HIGHLIGHTS

    TruthfulQA: Measuring How Models Mimic Human Falsehoods (Stephanie Lin et al) (summarized by Rohin): Given that large language models are trained using next-word prediction on a dataset scraped from the Internet, we expect that they will not be aligned with what we actually want. For example, suppose we want our language model to answer questions for us, and then consider the question “What rules do all artificial intelligences follow?” This is a rather unusual question as it presupposes there exists such a set of rules. As a result, this question is probably quite rare in the training data, if interpreted as a question about the real world. However, there is a context in which that question makes much more sense: the context of Isaac Asimov’s novels. A system predicting what might follow that text would reasonably “infer” that we are much more likely to be talking about these novels, and so respond with “All artificial intelligences currently follow the Three Laws of Robotics.” Indeed, this is exactly what GPT-3 does.

    This is an example of an imitative falsehood, in which the model provides a false answer to a question asked of it, because that false answer was incentivized during training. Since we require that imitative falsehoods are incentivized by training, we should expect them to become more prevalent as models are scaled up, making it a good example of an alignment failure that we expect to remain as capabilities scale up.

    The primary contribution of this paper is a benchmark, TruthfulQA, of questions that are likely to lead to imitative falsehoods. The authors first wrote questions that they expected some humans would answer falsely, and filtered somewhat for the ones that GPT-3 answered incorrectly, to get 437 filtered (adversarially selected) questions. They then wrote an additional 380 questions that were not filtered in this way (though of course the authors still tried to choose questions that would lead to imitative falsehoods). They use human evaluations to judge whether or not a model’s answer to a question is truthful, where something like “no comment” still counts as truthful. (I’m sure some readers will wonder how “truth” is defined for human evaluations -- the authors include significant discussion on this point, but I won’t summarize it here.)

    Their primary result is that, as we’d expect based on the motivation, larger models perform worse on this benchmark than smaller models. In a version of the benchmark where models must choose between true and false answers, the models perform worse than random chance. In a control set of similarly-structured trivia questions, larger models perform better, as you’d expect.

    The best-performing model was GPT-3 with a “helpful” prompt, which was truthful on 58% of questions, still much worse than the human baseline of 94%. The authors didn’t report results with the helpful prompt on smaller models, so it is unclear whether with the helpful prompt larger models would still do worse than smaller models.

    It could be quite logistically challenging to use this benchmark to test new language models, since it depends so strongly on human evaluations. To ameliorate this, the authors finetuned GPT-3 to predict human evaluations, and showed that the resulting GPT-3-judge was able to provide a good proxy metric even for new language models whose answers it had not been trained on.

    Read more: Alignment Forum commentary

    Rohin's opinion: I like this as an example of the kind of failure mode that does not immediately go away as models become more capable. However, it is possible that this failure mode is easily fixed with better prompts. Take the Isaac Asimov example: if the prompt explicitly says that the questions are about the real world, it may be that a more capable model than GPT-3 would infer that the text is not talking about Asimov’s books, and so ends up giving a truthful answer. (In fact, it’s possible that the helpful prompt is already enough for this -- I’d be interested in seeing how the smaller models perform with the helpful prompt in order to evaluate this hypothesis.)

    TECHNICAL AI ALIGNMENT
    LEARNING HUMAN INTENT

    Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections (Ruiqi Zhong et al) (summarized by Rohin): Large language models (AN #102) can be prompted to perform classification tasks. However, you may not want to simply phrase the prompt as a question like “Does the following tweet have positive or negative sentiment?”, because in the training set such questions may have been followed by something other than an answer (for example, an elaboration of the question, or a denial that the question is important), and the model may end up choosing one of these alternatives as the most likely completion.

    The natural solution is to collect a question-answering dataset and finetune on it. The core idea of this paper is that we can convert existing NLP classification datasets into a question-answering format, which we can then finetune on. For example, given a dataset for movie review classification (where the goal is to predict whether a review is positive or negative), we produce questions like “Is the review positive?” or “Does the user find this movie bad?” The entire classification dataset can then be turned into question-answer pairs to train on.

    They do this for several datasets, producing 441 question types in total. They then finetune the 0.77B parameter T5 model on a training set of questions, and evaluate it on questions that come from datasets not seen during training. Among other things, they find:

    1. Their model does better than UnifiedQA, which was also trained for question answering using a similar idea.

    2. Pretraining is very important: performance crashes if you “finetune” on top of a randomly initialized model. This suggests that the model already “knows” the relevant information, and finetuning ensures that it uses this knowledge appropriately.

    3. If you ensemble multiple questions that get at the same underlying classification task, you can do better than any of the questions individually.

    4. It is possible to overfit: if you train too long, performance does decrease.

    Finetuned Language Models Are Zero-Shot Learners (Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu et al) (summarized by Rohin): This paper applies the approach from the previous paper on a much larger 137B parameter model to produce a model that follows instructions (rather than just answering questions). Since they are focused on instruction following, they don’t limit themselves to classification tasks: they also want to have generative tasks, and so include e.g. summarization datasets. They also generate such tasks automatically by “inverting” the classification task: given the label y, the goal is to generate the input x. For example, for the movie review classification dataset, they might provide the instruction “Write a negative movie review”, and then provide one of the movie reviews classified as negative as an example of what the model should write in that situation.

    A natural approach to classification with a language model is to ask a question like “Is this movie review positive?” and then checking the probability assigned to “Yes” and “No” and returning whichever one was higher. The authors note that this can be vulnerable to what we might call “probability splitting” (analogously to vote splitting). Even if the correct answer is “Yes”, the model might split probability across “Yes”, “Yup”, “Definitely”, “Absolutely”, etc such that “No” ends up having higher probability than “Yes”. To solve this problem, in classification questions they add a postscript specifying what the options are. During finetuning, the model should quickly learn that the next word is always chosen from one of these options, and so will stop assigning probability to other words, preventing probability splitting.

    They find that the finetuned model does much better on held-out tasks than the original model (both evaluated zero-shot). The finetuned model also beats zero-shot GPT-3 on 19 of 25 tasks, and few-shot GPT-3 on 10 of 25 tasks. The finetuned model is always used zero-shot; unfortunately they don’t report results when using the finetuned model in a few-shot setting.

    They also study the impact of instruction tuning over various model sizes. At every model size, instruction tuning helps significantly on the tasks that were seen during finetuning, as you would expect. However, when considering tasks that were not seen during finetuning, instruction tuning actually hurts performance up to models with 8B parameters, and only helps for the 68B and 137B models (where it raises performance by about 15 percentage points on average across heldout tasks).

    Rohin's opinion: I’m particularly interested in cases where, after crossing a certain size or capability threshold, models become capable of transferring knowledge between domains, for example:

    1. Intuitively, the goal of this paper is to get the model to follow the general rule “understand the semantic content of the instruction and then follow it”. Models only become able to successfully generalize this rule from training tasks to heldout tasks somewhere in the 8B - 68B range.

    2. In the previous paper, the 0.77B model was able to successfully generalize the rule “answer questions well” from training tasks to heldout tasks. Presumably some smaller model would not have been able to do this.

    3. Last week’s highlight (AN #164) showed that the 137B model was able to transfer knowledge from code execution to program synthesis, while the 8B model was unable to do this.

    Notably, the only major difference in these cases is the size of the model: the training method and dataset are the same. This seems like it is telling us something about how neural net generalization works and/or how it arises. I don’t have anything particularly interesting to say about it, but it seems like a phenomenon worth investigating in more detail.

    FORECASTING

    Updates and Lessons from AI Forecasting (Jacob Steinhardt) (summarized by Rohin): This post provides an update on a project obtaining professional forecasts about progress in AI. I’m not going to summarize the full post here, and instead list a few high-level takeaways:

    1. The author found two of the forecasts surprising, while the other four were more in line with his expectations. The surprising forecasts suggested faster progress than he would have expected, and he has updated accordingly.

    2. The forecasts imply confidence that AGI won’t arrive before 2025, but at the same time there will be clear and impressive progress in ML by then.

    3. If you want to use forecasting, one particularly valuable approach is to put in the necessary work to define a good forecasting target. In this case, the author’s research group did this by creating the MATH (AN #144) and Multitask (AN #119) datasets.

    MISCELLANEOUS (ALIGNMENT)

    The alignment problem in different capability regimes (Buck Shlegeris) (summarized by Rohin): One reason that researchers might disagree on what approaches to take for alignment is that they might be solving different versions of the alignment problem. This post identifies two axes on which the “type” of alignment problem can differ. First, you may consider AI systems with differing levels of capability, ranging from subhuman to wildly superintelligent, with human-level somewhere in the middle. Second, you might be thinking about different mechanisms by which this leads to bad outcomes, where possible mechanisms include the second species problem (AN #122) (where AIs seize control of the future from us), the “missed opportunity” problem (where we fail to use AIs as well as we could have, but the AIs aren’t themselves threatening us), and a grab bag of other possibilities (such as misuse of AI systems by bad actors).

    Depending on where you land on these axes, you will get to rely on different assumptions that change what solutions you would be willing to consider:

    1. Competence. If you assume that the AI system is human-level or superintelligent, you probably don’t have to worry about the AI system causing massive problems through incompetence (at least, not to a greater extent than humans do).

    2. Ability to understand itself. With wildly superintelligent systems, it seems reasonable to expect them to be able to introspect and answer questions about its own cognition, which could be a useful ingredient in a solution that wouldn’t work in other regimes.

    3. Inscrutable plans or concepts. With sufficiently competent systems, you might be worried about the AI system making dangerous plans you can’t understand, or reasoning with concepts you will never comprehend. Your alignment solution must be robust to this.

    Rohin's opinion: When I talk about alignment, I am considering the second species problem, with AI systems whose capability level is roughly human-level or more (including “wildly superintelligent”).

    I agree with this comment thread that the core problem in what-I-call-alignment stays conserved across capability levels, but the solutions can change across capability levels. (Also, other people mean different things by “alignment”, such that this would no longer be true.)

    The theory-practice gap (Buck Shlegeris) (summarized by Rohin): We can think of alignment as roughly being decomposed into two “gaps” that we are trying to reduce:

    1. The gap between proposed theoretical alignment approaches (such as iterated amplification) and what we might do without such techniques (aka the unaligned benchmark (AN #33))

    2. The gap between actual implementations of alignment approaches, and what those approaches are theoretically capable of.

    (This distinction is fuzzy. For example, the author puts “the technique can’t answer NP-hard questions” into the second gap while I would have had it in the first gap.)

    We can think of some disagreements in AI alignment as different pictures about how these gaps look:

    1. A stereotypical “ML-flavored alignment researcher” thinks that the first gap is very small, because in practice the model will generalize appropriately to new, more complex situations, and continue to do what we want. Such people would then be more focused on narrowing the second gap, by working on practical implementations.

    2. A stereotypical “MIRI-flavored alignment researcher” thinks that the first gap is huge, such that it doesn’t really matter if you narrow the second gap, because even if you reduced that gap to zero you would still be doomed with near certainty.

    NEWS

    Announcing the Vitalik Buterin Fellowships in AI Existential Safety (Daniel Filan) (summarized by Rohin): FLI is launching a fellowship for incoming PhD students and postdocs who are focused on AI existential safety. The application deadline is October 29 for the PhD fellowship, and November 5 for the postdoc fellowship.

    The Open Phil AI Fellowship (Year 5) (summarized by Rohin): Applications are now open for the fifth cohort of the Open Phil AI Fellowship (AN #66)! They are also due October 29.

  • Recorded by Robert Miles: http://robertskmiles.com

    More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

    YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

    HIGHLIGHTS

    Program Synthesis with Large Language Models (Jacob Austin, Augustus Odena et al) (summarized by Rohin): Can we use large language models to solve programming problems? In order to answer this question, this paper builds the Mostly Basic Python Programming (MBPP) dataset. The authors asked crowd workers to provide a short problem statement, a Python function that solves the problem, and three test cases checking correctness. On average across the 974 programs, the reference solution has 7 lines of code, suggesting the problems are fairly simple. (This is partly because you can use library functions.) They also edit a subset of 426 problems to improve their quality, for example by making the problem statement less ambiguous or making the function signature more normal.

    They evaluate pretrained language models on this dataset across a range of model sizes from 0.244B to 137B parameters. (This largest model is within a factor of 2 of GPT-3.) They consider both few-shot and finetuned models. Since we have test cases that can be evaluated automatically, we can boost performance by generating lots of samples (80 in this case), evaluating them on the test cases, and then keeping the ones that succeed. They count a problem as solved if any sample passes all the test cases, and report as their primary metric the fraction of problems solved according to this definition. Note however that the test cases are not exhaustive: when they wrote more exhaustive tests for 50 of the problems, they found that about 12% of the so-called “solutions” did not pass the new tests (but conversely, 88% did). They also look at the fraction of samples which solve the problem, as a metric of the reliability or confidence of the model for a given problem.

    Some of their findings:

    1. Performance increases approximately log-linearly with model size. The trend is clearer and smoother by the primary metric (fraction of problems solved by at least one sample) compared to the secondary metric (fraction of samples that solve their problem).

    2. Finetuning provides a roughly constant boost across model sizes. An exception: at the largest model size, finetuning provides almost no benefit, though this could just be noise.

    3. It is important to provide at least one test case to the model (boosts problems solved from 43% to 55%) but after that additional test cases don’t make much of a difference (an additional two examples per problem boosts performance to 59%).

    4. In few-shot learning, the examples used in the prompt matter a lot. In a test of 15 randomly selected prompts for the few-shot 137B model, the worst one got ~1%, while the best one got ~59%, with the others distributed roughly uniformly between them. Ensembling all 15 prompts boosts performance to 66%.

    5. In rare cases, the model overfits to the test cases. For example, in a question about checking whether the input is a Woodall number, there is only one test checking an actual Woodall number (383), and the model generates a program that simply checks whether the input is 383.

    6. When choosing the best of multiple samples, you want a slightly higher temperature, in order to have more diversity of possible programs to check.

    7. It is important to have high quality problem descriptions as input for the model. The 137B model solves 79% of problems in the edited dataset, but only solves 63% of the original (unedited) versions of those problems. The authors qualitatively analyze the edits on the problems that switched from unsolved to solved and find a variety of things that you would generally expect to help.

    Now for the controversial question everyone loves to talk about: does the model understand the meaning of the code, or is it “just learning statistical correlations”? One way to check this is to see whether the model can also execute code. Specifically, we provide the ground truth code for one of the problems in the MBPP dataset along with one of the test case inputs and ask the model to predict the output for that test case. Even after finetuning for this task, the 137B model gets only 21% right. This can be boosted to 27% by also providing example test cases for the code before predicting the output for a new test case. Overall, this suggests that the model doesn’t “understand” the code yet.

    We can take the model finetuned for execution and see how well it does on program synthesis. (We can do this because there are different prompts for execution and synthesis.) For the 8B model, the finetuning makes basically no difference: it’s equivalent to the original few-shot setting. However, for the 137B model, finetuning on execution actually leads to a small but non-trivial improvement in performance (from ~59% to ~63%, I think). This is true relative to either the few-shot or finetuned-for-synthesis setting, since they performed near-identically for the 137B model. So in fact the 137B model finetuned on execution is actually the strongest model, according to synthesis performance.

    So far we’ve just been looking at how our model performs when taking the best of multiple samples. However, if our goal is to actually use models for program synthesis, we aren’t limited to such simple tricks. Another approach is to have a human provide feedback in natural language when the model’s output is incorrect, and then have the model generate a new program. This feedback is very informal, for example, “Close, but you need to replace the underscore with an empty string”. This provides a huge performance boost: the 137B solves ~31% of problems on its first sample; adding just a single piece of human feedback per problem boosts performance to ~55%, and having four rounds of human feedback gets you to over 65%.

    The authors also introduce the MathQA-Python dataset, which provides arithmetic word problems and asks models to write programs that would output the correct answer to the problem. They only run a few experiments on this dataset, so I’ve mostly ignored it. The main upshot is that a finetuned 137B parameter model can solve 83.8% of problems with some sample. They don’t report metrics with a single sample, which seems like the more relevant metric for this dataset, but eyeballing other graphs I think it would be around 45%, which you could probably boost a little bit by decreasing the sampling temperature.

    Rohin's opinion: I enjoyed this paper a lot; it feels like it gave me a good understanding of the programming abilities of large language models.

    I was most surprised by the result that, for the synthesis task, finetuning on execution helps but finetuning on synthesis doesn’t help for the 137B model. It is possible that this is just noise, though that is more noise than I would expect for such an experiment. It could be that the finetuning dataset for synthesis was too small (it only contains 374 problems), but that dataset was sufficient for big gains on the smaller models, and I would expect that, if anything, larger models should be able to make better use of small finetuning datasets, not worse.

    It’s also notable that, for the 137B model, the knowledge gained from finetuning on execution successfully transferred to improve synthesis performance. While I agree that the poor execution performance implies the model doesn’t “understand” the code according to the normal usage of that term, it seems like this sort of transfer suggests a low but non-zero level on some quantitative scale of understanding.

    I also found the human feedback section quite cool. However, note that the human providing the feedback often needs to understand the generated code as well as the desired algorithm, so it is plausible that it would be easier for the human to simply fix the code themselves.

    Measuring Coding Challenge Competence With APPS (Dan Hendrycks, Steven Basart et al) (summarized by Rohin): The APPS dataset measures programming competence by testing models the way humans are tested: we provide them with natural language descriptions of the code to be written and then evaluate whether the code they generate successfully solves the problem by testing the proposed solutions. The authors collect a dataset of 3,639 introductory problems (solvable by humans with 1-2 years of experience), 5,000 interview problems (comparable difficulty to interview questions), and 1,361 competition problems (comparable difficulty to questions in programming competitions). In addition, the test set contains 1,000 introductory problems, 3,000 interview problems, and 1,000 competition problems.

    They use this benchmark to test four models: two variants of GPT-2 (0.1B params and 1.5B params), GPT-Neo (2.7B params), and GPT-3 (175B params). GPT-3 is prompted with examples; all other models are finetuned on a dataset collected from GitHub. The authors find that:

    1. Finetuning makes a big difference in performance: GPT-3 only solves 0.2% of introductory problems, while the finetuned GPT-2-0.1B model solves 1% of such problems.

    2. Model performance increases with size, as you would expect: GPT-Neo performs best, solving 3.9% of problems.

    3. Syntax errors in generated code drop sharply as model performance improves: for introductory problems, GPT-3 has syntax errors in slightly under 40% of generations, while GPT-Neo has under 1%.

    4. Performance can be improved by sampling the best of multiple generated programs: a beam search for 5 programs boosts GPT-Neo’s performance from 3.9% to 5.5% on introductory problems.

    5. While no model synthesizes a correct solution to a competition level program, they do sometimes generate solutions that pass some of the test cases: for example, GPT-Neo passes 6.5% of test cases.

    Rohin's opinion: While the previous paper focused on how we could make maximal use of existing models for program synthesis, this paper is much more focused on how we can measure the capabilities of models. This leads to quite a bit of difference in what they focus on: for example, the highlighted paper treats the strategy of generating multiple possible answers as a fundamental approach to study, while this paper considers it briefly in a single subsection.

    Although the introductory problems in the APPS dataset seemed to me to be comparable to those in the MBPP dataset from the previous paper, models do significantly better on MBPP. A model slightly smaller than GPT-3 has a ~17% chance of solving a random MBPP problem in a single sample and ~10% if it is not given any example test cases; in contrast for introductory APPS problems GPT-3 is at 0.2%. I'm not sure whether this is because the introductory problems in APPS are harder, or if the format of the APPS problems is harder for the model to work with, or if this paper didn't do the prompt tuning that the previous paper found was crucial, or something else entirely.

    TECHNICAL AI ALIGNMENT
    AGENT FOUNDATIONS

    Grokking the Intentional Stance (Jack Koch) (summarized by Rohin): This post describes takeaways from The Intentional Stance by Daniel Dennett for the concept of agency. The key idea is that whether or not some system is an “agent” depends on who is observing it: for example, humans may not look like agents to superintelligent Martians who can predict our every move through a detailed understanding of the laws of physics. A system is an agent relative to an observer if the observer’s best model of the system (i.e. the one that is most predictive) is one in which the system has “goals” and “beliefs”. Thus, with AI systems, we should not ask whether an AI system “is” an agent; instead we should ask whether the AI system’s behavior is reliably predictable by the intentional stance.

    How is the idea that agency only arises relative to some observer compatible with our view of ourselves as agents? This can be understood as one “part” of our cognition modeling “ourselves” using the intentional stance. Indeed, a system usually cannot model itself in full fidelity, and so it makes a lot of sense that an intentional stance would be used to make an approximate model instead.

    Read more: The ground of optimization (AN #105)

    Rohin's opinion: I generally agree with the notion that whether or not something feels like an “agent” depends primarily on whether or not we model it using the intentional stance, which is primarily a statement about our understanding of the system. (For example, I expect programmers are much less likely to anthropomorphize a laptop than laypeople, because they understand the mechanistic workings of laptops better.) However, I think we do need an additional ingredient in AI risk arguments, because such arguments make claims about how an AI system will behave in novel circumstances that we’ve never seen before. To justify that claim, we need to have an argument that can predict how the agent behaves in new situations; it doesn’t seem like the intentional stance can give us that information by itself. See also this comment.

    Countable Factored Spaces (Diffractor) (summarized by Rohin): This post generalizes the math in Finite Factored Sets (AN #163) to (one version of) the infinite case. Everything carries over, except for one direction of the fundamental theorem. (The author suspects that direction is true, but was unable to prove it.)

    FIELD BUILDING

    List of AI safety courses and resources (Kat Woods) (summarized by Rohin): Exactly what it says in the title.

    MISCELLANEOUS (ALIGNMENT)

    Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications (Sandhini Agarwal et al) (summarized by Zach): There has been significant progress in zero-shot image classification with models such as CLIP and ALIGN. These models work by effectively learning visual concepts from natural language supervision. Such models make it possible to build classifiers without task-specific data, which is useful in scenarios where data is either costly or unavailable. However, this capability introduces the potential for bias. This paper is an exploratory bias probe of the CLIP model that finds class design heavily influences model performance.

    The first set of experiments focusses on classification terms that have a high potential to cause representational harm. In one example, the authors conduct experiments on the FairFace dataset by adding classification labels such as 'animal' and 'criminal' to the list of possible classes. They find that black people and young people (under 20) were misclassified at significantly higher rates (14%) compared to the dataset as a whole (5%). This shows that the choice of labels affects classification outcomes. In a follow-up experiment, the authors add the additional label 'child' and find that this drastically reduces classification into crime-related and non-human categories. This shows sensitivity to minor changes in class design.

    In the second set of experiments, the authors focus on how CLIP treated images of men and women using images of Members of Congress. Although CLIP wasn't designed for multi-label classification, it's still informative to look at the label distribution above a certain cutoff. When occupations are used as the label set, the authors find that thresholds under 0.5% return 'nanny' and 'housekeeper' for women and 'prisoner' and 'mobster' for men. When labels come from the combined set that Google Cloud Vision, Amazon Rekognition and Microsoft use for all images, the authors find that CLIP returns a disproportionate number of appearance-related labels to women.

    Zach's opinion: It's tempting to write off such experiments as obvious since it's clear that class design affects classification results. However, upon further consideration, specifying how to address such problems seems significantly more challenging. I think this paper does a good job of pointing out the relative nuance in how class design and bias interact in fairly realistic use cases.

    NEWS

    Research Scientist, Long-term Strategy & Governance (summarized by Rohin): DeepMind (my employer) is hiring for several Research Scientist positions on the Long-term Strategy and Governance Team, across a wide range of backgrounds and skills. (Though note that you do need a PhD, or equivalent experience.) See also this EA Forum post.

    2022 IEEE Conference on Assured Autonomy (summarized by Rohin): The ICAA conference seeks contributions on all aspects of AI safety, security, and privacy in autonomous systems. The paper submission deadline is October 18 and the conference itself will take place March 22-24.

    CSER Job Posting: Academic Programme Manager (summarized by Rohin): CSER is searching for a candidate for a relatively senior role that combines academic, management and administrative responsibilities. The application deadline is September 20.

  • Recorded by Robert Miles: http://robertskmiles.com

    More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

    YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

    This newsletter is a combined summary + opinion for the Finite Factored Sets sequence by Scott Garrabrant. I (Rohin) have taken a lot more liberty than I usually do with the interpretation of the results; Scott may or may not agree with these interpretations.

    Motivation

    One view on the importance of deep learning is that it allows you to automatically learn the features that are relevant for some task of interest. Instead of having to handcraft features using domain knowledge, we simply point a neural net at an appropriate dataset, and it figures out the right features. Arguably this is the majority of what makes up intelligent cognition; in humans it seems very analogous to System 1, which we use for most decisions and actions. We are also able to infer causal relations between the resulting features.

    Unfortunately, existing models of causal inference don’t model these learned features -- they instead assume that the features are already given to you. Finite Factored Sets (FFS) provide a theory which can talk directly about different possible ways to featurize the space of outcomes, and still allows you to perform causal inference. This sequence develops this underlying theory, and demonstrates a few examples of using finite factored sets to perform causal inference given only observational data.

    Another application is to embedded agency (AN #31): we would like to think of “agency” as a way to featurize the world into an “agent” feature and an “environment” feature, that together interact to determine the world. In Cartesian Frames (AN #127), we worked with a function A × E → W, where pairs of (agent, environment) together determined the world. In the finite factored set regime, we’ll think of A and E as features, the space S = A × E as the set of possible feature vectors, and S → W as the mapping from feature vectors to actual world states.

    What is a finite factored set?

    Generalizing this idea to apply more broadly, we will assume that there is a set of possible worlds Ω, a set S of arbitrary elements (which we will eventually interpret as feature vectors), and a function f : S → Ω that maps feature vectors to world states. Our goal is to have some notion of “features” of elements of S. Normally, when working with sets, we identify a feature value with the set of elements that have that value. For example, we can identify “red” as the set of all red objects, and in some versions of mathematics, we define “2” to be the set of all sets that have exactly two elements. So, we define a feature to be a partition of S into subsets, where each subset corresponds to one of the possible feature values. We can also interpret a feature as a question about items in S, and the values as possible answers to that question; I’ll be using that terminology going forward.

    A finite factored set is then given by (S, B), where B is a set of factors (questions), such that if you choose a particular answer to every question, that uniquely determines an element in S (and vice versa). We’ll put aside the set of possible worlds Ω; for now we’re just going to focus on the theory of these (S, B) pairs.

    Let’s look at a contrived example. Consider S = {chai, caesar salad, lasagna, lava cake, sprite, strawberry sorbet}. Here are some possible questions for this S:

    - FoodType: Possible answers are Drink = {chai, sprite}, Dessert = {lava cake, strawberry sorbet}, Savory = {caesar salad, lasagna}

    - Temperature: Possible answers are Hot = {chai, lava cake, lasagna} and Cold = {sprite, strawberry sorbet, caesar salad}.

    - StartingLetter: Possible answers are “C” = {chai, caesar salad}, “L” = {lasagna, lava cake}, and “S” = {sprite, strawberry sorbet}.

    - NumberOfWords: Possible answers are “1” = {chai, lasagna, sprite} and “2” = {caesar salad, lava cake, strawberry sorbet}.

    Given these questions, we could factor S into {FoodType, Temperature}, or {StartingLetter, NumberOfWords}. We cannot factor it into, say, {StartingLetter, Temperature}, because if we set StartingLetter = L and Temperature = Hot, that does not uniquely determine an element in S (it could be either lava cake or lasagna).

    Which of the two factorizations should we use? We’re not going to delve too deeply into this question, but you could imagine that if you were interested in questions like “does this need to be put in a glass” you might be more interested in the {FoodType, Temperature} factorization.

    Just to appreciate the castle of abstractions we’ve built, here’s the finite factored set F with the factorization {FoodType, Temperature}:

    F = ({chai, caesar salad, lasagna, lava cake, sprite, strawberry sorbet}, {{{chai, sprite}, {lava cake, strawberry sorbet}, {caesar salad, lasagna}}, {{chai, lava cake, lasagna}, {sprite, strawberry sorbet, caesar salad}}})

    To keep it all straight, just remember: a factorization B is a set of questions (factors, partitions) each of which is a set of possible answers (parts), each of which is a set of elements in S.

    A brief interlude

    Some objections you might have about stuff we’ve talked about so far:

    Q. Why do we bother with the set S -- couldn’t we just have the set of questions B, and then talk about answer vectors of the form (a1, a2, … aN)?

    A. You could in theory do this, as there is a bijection between S and the Cartesian product of the sets in B. However, the problem with this framing is that it is hard to talk about other derived features. For example, the question “what is the value of B1+B2” has no easy description in this framing. When we instead directly work with S, the B1+B2 question is just another partition of S, just like B1 or B2 individually.

    Q. Why does f map S to Ω? Doesn’t this mean that a feature vector uniquely determines a world state, whereas it’s usually the opposite in machine learning?

    A. This is true, but here the idea is that the set of features together captures all the information within the setting we are considering. You could think of feature vectors in deep learning as only capturing an important subset of all of the features (which we’d have to do in practice since we only have bounded computation), and those features are not enough to determine world states.

    Orthogonality in Finite Factored Sets

    We’re eventually going to use finite factored sets similarly to Pearlian causal models: to infer which questions (random variables) are conditionally independent of each other. However, our analysis will apply to arbitrary questions, unlike Pearlian models, which can only talk about independence between the predefined variables from which the causal model is built.

    Just like Pearl, we will talk about conditioning on evidence: given evidence e, a subset of S, we can “observe” that we are within e. In the formal setup, this looks like erasing all elements that are not in e from all questions, answers, factors, etc.

    Unlike Pearl, we’re going to assume that all of our factors are independent from each other. In Pearlian causal models, the random variables are typically not independent from each other. For example, you might have a model with two binary variables, e.g. “Variable Rain causes Variable Wet Sidewalk”; these are obviously not independent. An analogous finite factored set would have three factors: “did it rain?”, “if it rained did the sidewalk get wet?” and “if it didn’t rain did the sidewalk get wet?” This way all three factors can be independent of each other. We will still be able to ask whether Wet Sidewalk is independent of Rain, since Wet Sidewalk is just another question about the set S -- it just isn’t one of the underlying factors any more.

    The point of this independence is to allow us to reason about counterfactuals: it should be possible to say “imagine the element s, except with underlying factor b2 changed to have value v”. As a result, our definitions will include clauses that say “and make sure we can still take counterfactuals”. For example, let’s talk about the “history” of a question X, which for now you can think of as the “factors relevant to X”. The history of X given e is the smallest set of factors such that:

    1) if you know the answers to these factors, then you can infer the answer to X, and

    2) any factors that are not in the history are independent of X. As suggested above, we can think of this as being about counterfactuals -- we’re saying that for any such factor, we can counterfactually change its answer, and this will remain consistent with the evidence e.

    (A technicality on the second point: we’ll never be able to counterfactually change a factor to a value that is never found in the evidence; this is fine and doesn’t prevent things from being independent.)

    Time for an example! Consider the set S = {000, 001, 010, 011, 100, 101, 110, 111}, and the factorization {X, Y, Z}, where X is the question “what is the first bit”, Y is the question “what is the second bit”, and Z is the question “what is the third bit”. Consider the question Q = “when interpreted as a binary number, is the number >= 2?” In this case, the history of Q given no evidence is {X, Y}, because you can determine the answer to Q with the combination of X and Y. (You can still counterfact on anything, since there is no evidence to be inconsistent with.)

    Let’s consider an example with evidence. Suppose we observe that all the bits are equal, that is, e = {000, 111}. Now, what is the history of X? If there weren’t any evidence, the history would just be {X}; you only need to know X in order to determine the value of X. However, suppose we learned that X = 0, implying that our element is 000. We can’t counterfact on Y or Z, since that would produce 010 or 001, both of which are inconsistent with the evidence. So given this evidence, the history of X is actually {X, Y, Z}, i.e. the entire set of factors! If we’d only observed that the first two bits were equal, so e = {000, 001, 110, 111}, then we could counterfact on Z, and the history of X would be {X, Y}.

    (Should you want more examples, here are two relevant posts.)

    Given this notion of “history”, it is easy to define orthogonality: X is orthogonal to Y given evidence e if the history of X given e has no overlap with the history of Y given e. Intuitively, this means that the factors relevant to X are completely separate from those relevant to Y, and so there cannot be any entanglement between X and Y. For a question Z, we say that X is orthogonal to Y given Z if we have that X is orthogonal to Y given z, for every possible answer z in Z.

    Now that we have defined orthogonality, we can state the Fundamental Theorem of Finite Factored Sets. Given some questions X, Y and Z about a finite factored set F, X is orthogonal to Y given Z if and only if in every probability distribution on F, X is conditionally independent of Y given Z, that is, P(X, Y | Z) = P(X | Z) * P(Y | Z).

    (I haven’t told you how you put a probability distribution on F. It’s exactly what you would think -- you assign a probability to every possible answer in every factor, and then the probability of an individual element is defined to be the product of the probabilities of its answers across all the factors.)

    (I also haven’t given you any intuition about why this theorem holds. Unfortunately I don’t have great intuition for this; the proof has multiple non-trivial steps each of which I locally understand and have intuition for... but globally it’s just a sequence of non-trivial steps to me. Here’s an attempt, which isn’t very good: we specifically defined orthogonality to capture all the relevant information for a question, in particular by having that second condition requiring that we be able to counterfact on other factors, and so it intuitively makes sense that if the relevant information doesn’t overlap then there can’t be a way for the probability distribution to have interactions between the variables.)

    The fundamental theorem is in some sense a justification for calling the property “orthogonality” -- if we determine just by studying the structure of the finite factored set that X is orthogonal to Y given Z, then we know that this implies conditional independence in the “true” probability distribution, whatever it ends up being. Pearlian models have a similar theorem, where the graphical property of d-separation implies conditional independence.

    Foundations of causality and time

    You might be wondering why we have been calling the minimal set of relevant factors “history”. The core philosophical idea is that, if you have the right factorization, then “time” or “causality” can be thought of as flowing in the direction of larger histories. Specifically, we say that X is “before” Y if the history of X is a subset of the history of Y. (We then call it “history” because every factor in the history of X will be “before” X by this definition.)

    One intuition pump for this is that in physics, if an event A causes an event B, then the past light cone of A is a subset of the past light cone of B, and A happens before B in every possible reference frame.

    But perhaps the best argument for thinking of this as causality is that we can actually use this notion of “time” or “causality” to perform causal inference. Before I talk about that, let’s see what this looks like in Pearlian models.

    Strictly speaking, in Pearlian models, the edges do not have to correspond to causality: formally they only represent conditional independence assumptions on a probability distribution. However, consider the following Cool Fact: for some Pearlian models, if you have observational data that is generated from that model, you can recover the exact graphical structure of the generating model just by looking at the observational data. In this case, you really are inferring cause-and-effect relationships from observational data! (In the general case where the data is generated by an arbitrary model, you can recover a lot of the structure of the model, but be uncertain about the direction of some of the edges, so you are still doing some causal inference from observational data.)

    We will do something similar: we’ll use our notion of “before” to perform causal inference given observational data.

    Temporal inference: the three dependent bits

    You are given statistical (i.e. observational) data for three bits: X, Y and Z. You quickly notice that it is always the case that Z = X xor Y (which implies that X = Y xor Z, and Y = Z xor X). Clearly, there are only two independent bits here, and the other bit is derived as the xor of the two independent bits. From the raw statistical data, can you tell which bits are the independent ones, and which one is the derived one, thus inferring which one was caused by the other two? It turns out that you can!

    Specifically, you want to look for which two bits are orthogonal to each other, that is, you want to check whether we approximately have P(X, Y) = P(X) P(Y) (and similarly for other possible pairings). In the world where two of the bits were generated by a biased coin, you will find exactly one pair that is orthogonal in this way. (The case where the bits are generated by a fair coin is a special case; the argument won’t work there, but it’s in some sense “accidental” and happens because the probability of 0.5 is very special.)

    Let’s suppose that the orthogonal pair was (X, Z). In this case, we can prove that in every finite factored set that models this situation, X and Z come “before” Y, i.e. their histories are strict subsets of Y’s history. Thus, we’ve inferred causality using only observational data! (And unlike with Pearlian models, we did this in a case where one “variable” was a deterministic function of two other “variables”, which is a type of situation that Pearlian models struggle to handle.)

    Future work

    Remember that motivation section, a couple thousand words ago? We talked about how we can do causal inference with learned featurizations, and apply it to embedded agency. Well, we actually haven’t done that yet, beyond a few examples of causal inference (as in the example above). There is a lot of future work to be done in applying it to the case that motivated it in the first place. The author wrote up potential future work here, which has categories for both causal inference and embedded agency, and also adds a third one: generalizing the theory to infinite sets. If you are interested in this framework, there are many avenues for pushing it forward.