34 - AI Evaluations with Beth Barnes – AXRP - the AI X-risk Research Podcast – Podcast

Episódios

38.6 - Joel Lehman on Positive Visions of AI
24 jan· AXRP - the AI X-risk Research Podcast
Typically this podcast talks about how to avert destruction from AI. But what would it take to ensure AI promotes human flourishing as well as it can? Is alignment to individuals enough, and if not, where do we go form here? In this episode, I talk with Joel Lehman about these questions.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/01/24/episode-38_6-joel-lehman-positive-visions-of-ai.html

FAR.AI: https://far.ai/

FAR.AI on X (aka Twitter): https://x.com/farairesearch

FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch

The Alignment Workshop: https://www.alignment-workshop.com/

Topics we discuss, and timestamps:

01:12 - Why aligned AI might not be enough

04:05 - Positive visions of AI

08:27 - Improving recommendation systems

Links:

Why Greatness Cannot Be Planned: https://www.amazon.com/Why-Greatness-Cannot-Planned-Objective/dp/3319155237

We Need Positive Visions of AI Grounded in Wellbeing: https://thegradientpub.substack.com/p/beneficial-ai-wellbeing-lehman-ngo

Machine Love: https://arxiv.org/abs/2302.09248

AI Alignment with Changing and Influenceable Reward Functions: https://arxiv.org/abs/2405.17713

Episode art by Hamish Doodles: hamishdoodles.com
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
38.5 - Adrià Garriga-Alonso on Detecting AI Scheming
20 jan· AXRP - the AI X-risk Research Podcast
Suppose we're worried about AIs engaging in long-term plans that they don't tell us about. If we were to peek inside their brains, what should we look for to check whether this was happening? In this episode Adrià Garriga-Alonso talks about his work trying to answer this question.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/01/20/episode-38_5-adria-garriga-alonso-detecting-ai-scheming.html

FAR.AI: https://far.ai/

FAR.AI on X (aka Twitter): https://x.com/farairesearch

FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch

The Alignment Workshop: https://www.alignment-workshop.com/

Topics we discuss, and timestamps:

01:04 - The Alignment Workshop

02:49 - How to detect scheming AIs

05:29 - Sokoban-solving networks taking time to think

12:18 - Model organisms of long-term planning

19:44 - How and why to study planning in networks

Links:

Adrià's website: https://agarri.ga/

An investigation of model-free planning: https://arxiv.org/abs/1901.03559

Model-Free Planning: https://tuphs28.github.io/projects/interpplanning/

Planning in a recurrent neural network that plays Sokoban: https://arxiv.org/abs/2407.15421

Episode art by Hamish Doodles: hamishdoodles.com
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
Estão a faltar episódios?

Clique aqui para atualizar o feed.
38.4 - Shakeel Hashim on AI Journalism
5 jan· AXRP - the AI X-risk Research Podcast
AI researchers often complain about the poor coverage of their work in the news media. But why is this happening, and how can it be fixed? In this episode, I speak with Shakeel Hashim about the resource constraints facing AI journalism, the disconnect between journalists' and AI researchers' views on transformative AI, and efforts to improve the state of AI journalism, such as Tarbell and Shakeel's newsletter, Transformer.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2025/01/05/episode-38_4-shakeel-hashim-ai-journalism.html

FAR.AI: https://far.ai/

FAR.AI on X (aka Twitter): https://x.com/farairesearch

FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch

The Alignment Workshop: https://www.alignment-workshop.com/

Topics we discuss, and timestamps:

01:31 - The AI media ecosystem

02:34 - Why not more AI news?

07:18 - Disconnects between journalists and the AI field

12:42 - Tarbell

18:44 - The Transformer newsletter

Links:

Transformer (Shakeel's substack): https://www.transformernews.ai/

Tarbell: https://www.tarbellfellowship.org/

Episode art by Hamish Doodles: hamishdoodles.com
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
38.3 - Erik Jenner on Learned Look-Ahead
12 dez 2024· AXRP - the AI X-risk Research Podcast
Lots of people in the AI safety space worry about models being able to make deliberate, multi-step plans. But can we already see this in existing neural nets? In this episode, I talk with Erik Jenner about his work looking at internal look-ahead within chess-playing neural networks.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2024/12/12/episode-38_3-erik-jenner-learned-look-ahead.html

FAR.AI: https://far.ai/

FAR.AI on X (aka Twitter): https://x.com/farairesearch

FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch

The Alignment Workshop: https://www.alignment-workshop.com/

Topics we discuss, and timestamps:

00:57 - How chess neural nets look into the future

04:29 - The dataset and basic methodology

05:23 - Testing for branching futures?

07:57 - Which experiments demonstrate what

10:43 - How the ablation experiments work

12:38 - Effect sizes

15:23 - X-risk relevance

18:08 - Follow-up work

21:29 - How much planning does the network do?

Research we mention:

Evidence of Learned Look-Ahead in a Chess-Playing Neural Network: https://arxiv.org/abs/2406.00877

Understanding the learned look-ahead behavior of chess neural networks (a development of the follow-up research Erik mentioned): https://openreview.net/forum?id=Tl8EzmgsEp

Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT: https://arxiv.org/abs/2310.07582

Episode art by Hamish Doodles: hamishdoodles.com
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
39 - Evan Hubinger on Model Organisms of Misalignment
1 dez 2024· AXRP - the AI X-risk Research Podcast
The 'model organisms of misalignment' line of research creates AI models that exhibit various types of misalignment, and studies them to try to understand how the misalignment occurs and whether it can be somehow removed. In this episode, Evan Hubinger talks about two papers he's worked on at Anthropic under this agenda: "Sleeper Agents" and "Sycophancy to Subterfuge".

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2024/12/01/episode-39-evan-hubinger-model-organisms-misalignment.html

Topics we discuss, and timestamps:

0:00:36 - Model organisms and stress-testing

0:07:38 - Sleeper Agents

0:22:32 - Do 'sleeper agents' properly model deceptive alignment?

0:38:32 - Surprising results in "Sleeper Agents"

0:57:25 - Sycophancy to Subterfuge

1:09:21 - How models generalize from sycophancy to subterfuge

1:16:37 - Is the reward editing task valid?

1:21:46 - Training away sycophancy and subterfuge

1:29:22 - Model organisms, AI control, and evaluations

1:33:45 - Other model organisms research

1:35:27 - Alignment stress-testing at Anthropic

1:43:32 - Following Evan's work

Main papers:

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training: https://arxiv.org/abs/2401.05566

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models: https://arxiv.org/abs/2406.10162

Anthropic links:

Anthropic's newsroom: https://www.anthropic.com/news

Careers at Anthropic: https://www.anthropic.com/careers

Other links:

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research: https://www.alignmentforum.org/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1

Simple probes can catch sleeper agents: https://www.anthropic.com/research/probes-catch-sleeper-agents

Studying Large Language Model Generalization with Influence Functions: https://arxiv.org/abs/2308.03296

Stress-Testing Capability Elicitation With Password-Locked Models [aka model organisms of sandbagging]: https://arxiv.org/abs/2405.19550

Episode art by Hamish Doodles: hamishdoodles.com
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
38.2 - Jesse Hoogland on Singular Learning Theory
27 nov 2024· AXRP - the AI X-risk Research Podcast
You may have heard of singular learning theory, and its "local learning coefficient", or LLC - but have you heard of the refined LLC? In this episode, I chat with Jesse Hoogland about his work on SLT, and using the refined LLC to find a new circuit in language models.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2024/11/27/38_2-jesse-hoogland-singular-learning-theory.html

FAR.AI: https://far.ai/

FAR.AI on X (aka Twitter): https://x.com/farairesearch

FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch

The Alignment Workshop: https://www.alignment-workshop.com/

Topics we discuss, and timestamps:

00:34 - About Jesse

01:49 - The Alignment Workshop

02:31 - About Timaeus

05:25 - SLT that isn't developmental interpretability

10:41 - The refined local learning coefficient

14:06 - Finding the multigram circuit

Links:

Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient: https://arxiv.org/abs/2410.02984

Investigating the learning coefficient of modular addition: hackathon project: https://www.lesswrong.com/posts/4v3hMuKfsGatLXPgt/investigating-the-learning-coefficient-of-modular-addition

Episode art by Hamish Doodles: hamishdoodles.com
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
38.1 - Alan Chan on Agent Infrastructure
16 nov 2024· AXRP - the AI X-risk Research Podcast
Road lines, street lights, and licence plates are examples of infrastructure used to ensure that roads operate smoothly. In this episode, Alan Chan talks about using similar interventions to help avoid bad outcomes from the deployment of AI agents.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2024/11/16/episode-38_1-alan-chan-agent-infrastructure.html

FAR.AI: https://far.ai/

FAR.AI on X (aka Twitter): https://x.com/farairesearch

FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch

The Alignment Workshop: https://www.alignment-workshop.com/

Topics we discuss, and timestamps:

01:02 - How the Alignment Workshop is

01:32 - Agent infrastructure

04:57 - Why agent infrastructure

07:54 - A trichotomy of agent infrastructure

13:59 - Agent IDs

18:17 - Agent channels

20:29 - Relation to AI control

Links:

Alan on Google Scholar: https://scholar.google.com/citations?user=lmQmYPgAAAAJ&hl=en&oi=ao

IDs for AI Systems: https://arxiv.org/abs/2406.12137

Visibility into AI Agents: https://arxiv.org/abs/2401.13138

Episode art by Hamish Doodles: hamishdoodles.com
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
38.0 - Zhijing Jin on LLMs, Causality, and Multi-Agent Systems
14 nov 2024· AXRP - the AI X-risk Research Podcast
Do language models understand the causal structure of the world, or do they merely note correlations? And what happens when you build a big AI society out of them? In this brief episode, recorded at the Bay Area Alignment Workshop, I chat with Zhijing Jin about her research on these questions.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2024/11/14/episode-38_0-zhijing-jin-llms-causality-multi-agent-systems.html

FAR.AI: https://far.ai/

FAR.AI on X (aka Twitter): https://x.com/farairesearch

FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch

The Alignment Workshop: https://www.alignment-workshop.com/

Topics we discuss, and timestamps:

00:35 - How the Alignment Workshop is

00:47 - How Zhijing got interested in causality and natural language processing

03:14 - Causality and alignment

06:21 - Causality without randomness

10:07 - Causal abstraction

11:42 - Why LLM causal reasoning?

13:20 - Understanding LLM causal reasoning

16:33 - Multi-agent systems

Links:

Zhijing's website: https://zhijing-jin.com/fantasy/

Zhijing on X (aka Twitter): https://x.com/zhijingjin

Can Large Language Models Infer Causation from Correlation?: https://arxiv.org/abs/2306.05836

Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents: https://arxiv.org/abs/2404.16698

Episode art by Hamish Doodles: hamishdoodles.com
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
37 - Jaime Sevilla on AI Forecasting
4 out 2024· AXRP - the AI X-risk Research Podcast
Epoch AI is the premier organization that tracks the trajectory of AI - how much compute is used, the role of algorithmic improvements, the growth in data used, and when the above trends might hit an end. In this episode, I speak with the director of Epoch AI, Jaime Sevilla, about how compute, data, and algorithmic improvements are impacting AI, and whether continuing to scale can get us AGI.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2024/10/04/episode-37-jaime-sevilla-forecasting-ai.html

Topics we discuss, and timestamps:

0:00:38 - The pace of AI progress

0:07:49 - How Epoch AI tracks AI compute

0:11:44 - Why does AI compute grow so smoothly?

0:21:46 - When will we run out of computers?

0:38:56 - Algorithmic improvement

0:44:21 - Algorithmic improvement and scaling laws

0:56:56 - Training data

1:04:56 - Can scaling produce AGI?

1:16:55 - When will AGI arrive?

1:21:20 - Epoch AI

1:27:06 - Open questions in AI forecasting

1:35:21 - Epoch AI and x-risk

1:41:34 - Following Epoch AI's research

Links for Jaime and Epoch AI:

Epoch AI: https://epochai.org/

Machine Learning Trends dashboard: https://epochai.org/trends

Epoch AI on X / Twitter: https://x.com/EpochAIResearch

Jaime on X / Twitter: https://x.com/Jsevillamol

Research we discuss:

Training Compute of Frontier AI Models Grows by 4-5x per Year: https://epochai.org/blog/training-compute-of-frontier-ai-models-grows-by-4-5x-per-year

Optimally Allocating Compute Between Inference and Training: https://epochai.org/blog/optimally-allocating-compute-between-inference-and-training

Algorithmic Progress in Language Models [blog post]: https://epochai.org/blog/algorithmic-progress-in-language-models

Algorithmic progress in language models [paper]: https://arxiv.org/abs/2403.05812

Training Compute-Optimal Large Language Models [aka the Chinchilla scaling law paper]: https://arxiv.org/abs/2203.15556

Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data [blog post]: https://epochai.org/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data

Will we run out of data? Limits of LLM scaling based on human-generated data [paper]: https://arxiv.org/abs/2211.04325

The Direct Approach: https://epochai.org/blog/the-direct-approach

Episode art by Hamish Doodles: hamishdoodles.com
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
36 - Adam Shai and Paul Riechers on Computational Mechanics
29 set 2024· AXRP - the AI X-risk Research Podcast
Sometimes, people talk about transformers as having "world models" as a result of being trained to predict text data on the internet. But what does this even mean? In this episode, I talk with Adam Shai and Paul Riechers about their work applying computational mechanics, a sub-field of physics studying how to predict random processes, to neural networks.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2024/09/29/episode-36-adam-shai-paul-riechers-computational-mechanics.html

Topics we discuss, and timestamps:

0:00:42 - What computational mechanics is

0:29:49 - Computational mechanics vs other approaches

0:36:16 - What world models are

0:48:41 - Fractals

0:57:43 - How the fractals are formed

1:09:55 - Scaling computational mechanics for transformers

1:21:52 - How Adam and Paul found computational mechanics

1:36:16 - Computational mechanics for AI safety

1:46:05 - Following Adam and Paul's research

Simplex AI Safety: https://www.simplexaisafety.com/

Research we discuss:

Transformers represent belief state geometry in their residual stream: https://arxiv.org/abs/2405.15943

Transformers represent belief state geometry in their residual stream [LessWrong post]: https://www.lesswrong.com/posts/gTZ2SxesbHckJ3CkF/transformers-represent-belief-state-geometry-in-their

Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer: https://www.lesswrong.com/posts/mBw7nc4ipdyeeEpWs/why-would-belief-states-have-a-fractal-structure-and-why

Episode art by Hamish Doodles: hamishdoodles.com
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
New Patreon tiers + MATS applications
28 set 2024· AXRP - the AI X-risk Research Podcast
Patreon: https://www.patreon.com/axrpodcast

MATS: https://www.matsprogram.org

Note: I'm employed by MATS, but they're not paying me to make this video.
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization
24 ago 2024· AXRP - the AI X-risk Research Podcast
How do we figure out what large language models believe? In fact, do they even have beliefs? Do those beliefs have locations, and if so, can we edit those locations to change the beliefs? Also, how are we going to get AI to perform tasks so hard that we can't figure out if they succeeded at them? In this episode, I chat with Peter Hase about his research into these questions.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2024/08/24/episode-35-peter-hase-llm-beliefs-easy-to-hard-generalization.html

Topics we discuss, and timestamps:

0:00:36 - NLP and interpretability

0:10:20 - Interpretability lessons

0:32:22 - Belief interpretability

1:00:12 - Localizing and editing models' beliefs

1:19:18 - Beliefs beyond language models

1:27:21 - Easy-to-hard generalization

1:47:16 - What do easy-to-hard results tell us?

1:57:33 - Easy-to-hard vs weak-to-strong

2:03:50 - Different notions of hardness

2:13:01 - Easy-to-hard vs weak-to-strong, round 2

2:15:39 - Following Peter's work

Peter on Twitter: https://x.com/peterbhase

Peter's papers:

Foundational Challenges in Assuring Alignment and Safety of Large Language Models: https://arxiv.org/abs/2404.09932

Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs: https://arxiv.org/abs/2111.13654

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models: https://arxiv.org/abs/2301.04213

Are Language Models Rational? The Case of Coherence Norms and Belief Revision: https://arxiv.org/abs/2406.03442

The Unreasonable Effectiveness of Easy Training Data for Hard Tasks: https://arxiv.org/abs/2401.06751

Other links:

Toy Models of Superposition: https://transformer-circuits.pub/2022/toy_model/index.html

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV): https://arxiv.org/abs/1711.11279

Locating and Editing Factual Associations in GPT (aka the ROME paper): https://arxiv.org/abs/2202.05262

Of nonlinearity and commutativity in BERT: https://arxiv.org/abs/2101.04547

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model: https://arxiv.org/abs/2306.03341

Editing a classifier by rewriting its prediction rules: https://arxiv.org/abs/2112.01008

Discovering Latent Knowledge Without Supervision (aka the Collin Burns CCS paper): https://arxiv.org/abs/2212.03827

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision: https://arxiv.org/abs/2312.09390

Concrete problems in AI safety: https://arxiv.org/abs/1606.06565

Rissanen Data Analysis: Examining Dataset Characteristics via Description Length: https://arxiv.org/abs/2103.03872

Episode art by Hamish Doodles: hamishdoodles.com
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
34 - AI Evaluations with Beth Barnes
28 jul 2024· AXRP - the AI X-risk Research Podcast
How can we figure out if AIs are capable enough to pose a threat to humans? When should we make a big effort to mitigate risks of catastrophic AI misbehaviour? In this episode, I chat with Beth Barnes, founder of and head of research at METR, about these questions and more.

Patreon: patreon.com/axrpodcast

Ko-fi: ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2024/07/28/episode-34-ai-evaluations-beth-barnes.html

Topics we discuss, and timestamps:

0:00:37 - What is METR?

0:02:44 - What is an "eval"?

0:14:42 - How good are evals?

0:37:25 - Are models showing their full capabilities?

0:53:25 - Evaluating alignment

1:01:38 - Existential safety methodology

1:12:13 - Threat models and capability buffers

1:38:25 - METR's policy work

1:48:19 - METR's relationships with labs

2:04:12 - Related research

2:10:02 - Roles at METR, and following METR's work

Links for METR:

METR: https://metr.org

METR Task Development Guide - Bounty: https://taskdev.metr.org/bounty/

METR - Hiring: https://metr.org/hiring

Autonomy evaluation resources: https://metr.org/blog/2024-03-13-autonomy-evaluation-resources/

Other links:

Update on ARC's recent eval efforts (contains GPT-4 taskrabbit captcha story) https://metr.org/blog/2023-03-18-update-on-recent-evals/

Password-locked models: a stress case for capabilities evaluation: https://www.alignmentforum.org/posts/rZs6ddqNnW8LXuJqA/password-locked-models-a-stress-case-for-capabilities

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training: https://arxiv.org/abs/2401.05566

Untrusted smart models and trusted dumb models: https://www.alignmentforum.org/posts/LhxHcASQwpNa3mRNk/untrusted-smart-models-and-trusted-dumb-models

AI companies aren't really using external evaluators: https://www.lesswrong.com/posts/WjtnvndbsHxCnFNyc/ai-companies-aren-t-really-using-external-evaluators

Nobody Knows How to Safety-Test AI (Time): https://time.com/6958868/artificial-intelligence-safety-evaluations-risks/

ChatGPT can talk, but OpenAI employees sure can’t: https://www.vox.com/future-perfect/2024/5/17/24158478/openai-departures-sam-altman-employees-chatgpt-release

Leaked OpenAI documents reveal aggressive tactics toward former employees: https://www.vox.com/future-perfect/351132/openai-vested-equity-nda-sam-altman-documents-employees

Beth on her non-disparagement agreement with OpenAI: https://www.lesswrong.com/posts/yRWv5kkDD4YhzwRLq/non-disparagement-canaries-for-openai?commentId=MrJF3tWiKYMtJepgX

Sam Altman's statement on OpenAI equity: https://x.com/sama/status/1791936857594581428

Episode art by Hamish Doodles: hamishdoodles.com
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
33 - RLHF Problems with Scott Emmons
12 jun 2024· AXRP - the AI X-risk Research Podcast
Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk with Scott Emmons about his work categorizing the problems that can show up in this setting.

Patreon: patreon.com/axrpodcast

Ko-fi: ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2024/06/12/episode-33-rlhf-problems-scott-emmons.html

Topics we discuss, and timestamps:

0:00:33 - Deceptive inflation

0:17:56 - Overjustification

0:32:48 - Bounded human rationality

0:50:46 - Avoiding these problems

1:14:13 - Dimensional analysis

1:23:32 - RLHF problems, in theory and practice

1:31:29 - Scott's research program

1:39:42 - Following Scott's research

Scott's website: https://www.scottemmons.com

Scott's X/twitter account: https://x.com/emmons_scott

When Your AIs Deceive You: Challenges With Partial Observability of Human Evaluators in Reward Learning: https://arxiv.org/abs/2402.17747

Other works we discuss:

AI Deception: A Survey of Examples, Risks, and Potential Solutions: https://arxiv.org/abs/2308.14752

Uncertain decisions facilitate better preference learning: https://arxiv.org/abs/2106.10394

Invariance in Policy Optimisation and Partial Identifiability in Reward Learning: https://arxiv.org/abs/2203.07475

The Humble Gaussian Distribution (aka principal component analysis and dimensional analysis): http://www.inference.org.uk/mackay/humble.pdf

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!: https://arxiv.org/abs/2310.03693

Episode art by Hamish Doodles: hamishdoodles.com
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
32 - Understanding Agency with Jan Kulveit
30 mai 2024· AXRP - the AI X-risk Research Podcast
What's the difference between a large language model and the human brain? And what's wrong with our theories of agency? In this episode, I chat about these questions with Jan Kulveit, who leads the Alignment of Complex Systems research group.

Patreon: patreon.com/axrpodcast

Ko-fi: ko-fi.com/axrpodcast

The transcript: axrp.net/episode/2024/05/30/episode-32-understanding-agency-jan-kulveit.html

Topics we discuss, and timestamps:

0:00:47 - What is active inference?

0:15:14 - Preferences in active inference

0:31:33 - Action vs perception in active inference

0:46:07 - Feedback loops

1:01:32 - Active inference vs LLMs

1:12:04 - Hierarchical agency

1:58:28 - The Alignment of Complex Systems group

Website of the Alignment of Complex Systems group (ACS): acsresearch.org

ACS on X/Twitter: x.com/acsresearchorg

Jan on LessWrong: lesswrong.com/users/jan-kulveit

Predictive Minds: Large Language Models as Atypical Active Inference Agents: arxiv.org/abs/2311.10215

Other works we discuss:

Active Inference: The Free Energy Principle in Mind, Brain, and Behavior: https://www.goodreads.com/en/book/show/58275959

Book Review: Surfing Uncertainty: https://slatestarcodex.com/2017/09/05/book-review-surfing-uncertainty/

The self-unalignment problem: https://www.lesswrong.com/posts/9GyniEBaN3YYTqZXn/the-self-unalignment-problem

Mitigating generative agent social dilemmas (aka language models writing contracts for Minecraft): https://social-dilemmas.github.io/

Episode art by Hamish Doodles: hamishdoodles.com
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
31 - Singular Learning Theory with Daniel Murfet
7 mai 2024· AXRP - the AI X-risk Research Podcast
What's going on with deep learning? What sorts of models get learned, and what are the learning dynamics? Singular learning theory is a theory of Bayesian statistics broad enough in scope to encompass deep neural networks that may help answer these questions. In this episode, I speak with Daniel Murfet about this research program and what it tells us.

Patreon: patreon.com/axrpodcast

Ko-fi: ko-fi.com/axrpodcast

Topics we discuss, and timestamps:

0:00:26 - What is singular learning theory?

0:16:00 - Phase transitions

0:35:12 - Estimating the local learning coefficient

0:44:37 - Singular learning theory and generalization

1:00:39 - Singular learning theory vs other deep learning theory

1:17:06 - How singular learning theory hit AI alignment

1:33:12 - Payoffs of singular learning theory for AI alignment

1:59:36 - Does singular learning theory advance AI capabilities?

2:13:02 - Open problems in singular learning theory for AI alignment

2:20:53 - What is the singular fluctuation?

2:25:33 - How geometry relates to information

2:30:13 - Following Daniel Murfet's work

The transcript: https://axrp.net/episode/2024/05/07/episode-31-singular-learning-theory-dan-murfet.html

Daniel Murfet's twitter/X account: https://twitter.com/danielmurfet

Developmental interpretability website: https://devinterp.com

Developmental interpretability YouTube channel: https://www.youtube.com/@Devinterp

Main research discussed in this episode:

- Developmental Landscape of In-Context Learning: https://arxiv.org/abs/2402.02364

- Estimating the Local Learning Coefficient at Scale: https://arxiv.org/abs/2402.03698

- Simple versus Short: Higher-order degeneracy and error-correction: https://www.lesswrong.com/posts/nWRj6Ey8e5siAEXbK/simple-versus-short-higher-order-degeneracy-and-error-1

Other links:

- Algebraic Geometry and Statistical Learning Theory (the grey book): https://www.cambridge.org/core/books/algebraic-geometry-and-statistical-learning-theory/9C8FD1BDC817E2FC79117C7F41544A3A

- Mathematical Theory of Bayesian Statistics (the green book): https://www.routledge.com/Mathematical-Theory-of-Bayesian-Statistics/Watanabe/p/book/9780367734817
In-context learning and induction heads: https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html

- Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity: https://arxiv.org/abs/2106.15933

- A mathematical theory of semantic development in deep neural networks: https://www.pnas.org/doi/abs/10.1073/pnas.1820226116

- Consideration on the Learning Efficiency Of Multiple-Layered Neural Networks with Linear Units: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4404877

- Neural Tangent Kernel: Convergence and Generalization in Neural Networks: https://arxiv.org/abs/1806.07572

- The Interpolating Information Criterion for Overparameterized Models: https://arxiv.org/abs/2307.07785

- Feature Learning in Infinite-Width Neural Networks: https://arxiv.org/abs/2011.14522

- A central AI alignment problem: capabilities generalization, and the sharp left turn: https://www.lesswrong.com/posts/GNhMPAWcfBCASy8e6/a-central-ai-alignment-problem-capabilities-generalization

- Quantifying degeneracy in singular models via the learning coefficient: https://arxiv.org/abs/2308.12108

Episode art by Hamish Doodles: hamishdoodles.com
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
30 - AI Security with Jeffrey Ladish
30 abr 2024· AXRP - the AI X-risk Research Podcast
Top labs use various forms of "safety training" on models before their release to make sure they don't do nasty stuff - but how robust is that? How can we ensure that the weights of powerful AIs don't get leaked or stolen? And what can AI even do these days? In this episode, I speak with Jeffrey Ladish about security and AI.

Patreon: patreon.com/axrpodcast

Ko-fi: ko-fi.com/axrpodcast

Topics we discuss, and timestamps:

0:00:38 - Fine-tuning away safety training

0:13:50 - Dangers of open LLMs vs internet search

0:19:52 - What we learn by undoing safety filters

0:27:34 - What can you do with jailbroken AI?

0:35:28 - Security of AI model weights

0:49:21 - Securing against attackers vs AI exfiltration

1:08:43 - The state of computer security

1:23:08 - How AI labs could be more secure

1:33:13 - What does Palisade do?

1:44:40 - AI phishing

1:53:32 - More on Palisade's work

1:59:56 - Red lines in AI development

2:09:56 - Making AI legible

2:14:08 - Following Jeffrey's research

The transcript: axrp.net/episode/2024/04/30/episode-30-ai-security-jeffrey-ladish.html

Palisade Research: palisaderesearch.org

Jeffrey's Twitter/X account: twitter.com/JeffLadish

Main papers we discussed:

- LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B: arxiv.org/abs/2310.20624

- BadLLaMa: Cheaply Removing Safety Fine-tuning From LLaMa 2-Chat 13B: arxiv.org/abs/2311.00117

- Securing Artificial Intelligence Model Weights: rand.org/pubs/working_papers/WRA2849-1.html

Other links:

- Llama 2: Open Foundation and Fine-Tuned Chat Models: https://arxiv.org/abs/2307.09288

- Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!: https://arxiv.org/abs/2310.03693

- Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models: https://arxiv.org/abs/2310.02949

- On the Societal Impact of Open Foundation Models (Stanford paper on marginal harms from open-weight models): https://crfm.stanford.edu/open-fms/

- The Operational Risks of AI in Large-Scale Biological Attacks (RAND): https://www.rand.org/pubs/research_reports/RRA2977-2.html

- Preventing model exfiltration with upload limits: https://www.alignmentforum.org/posts/rf66R4YsrCHgWx9RG/preventing-model-exfiltration-with-upload-limits

- A deep dive into an NSO zero-click iMessage exploit: Remote Code Execution: https://googleprojectzero.blogspot.com/2021/12/a-deep-dive-into-nso-zero-click.html

- In-browser transformer inference: https://aiserv.cloud/

- Anatomy of a rental phishing scam: https://jeffreyladish.com/anatomy-of-a-rental-phishing-scam/

- Causal Scrubbing: a method for rigorously testing interpretability hypotheses: https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing

Episode art by Hamish Doodles: hamishdoodles.com
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
29 - Science of Deep Learning with Vikrant Varma
25 abr 2024· AXRP - the AI X-risk Research Podcast
In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand. Earlier, in 2021, it was announced that neural networks sometimes 'grok': that is, when training them on certain tasks, they initially memorize their training data (achieving their training goal in a way that doesn't generalize), but then suddenly switch to understanding the 'real' solution in a way that generalizes. What's going on with these discoveries? Are they all they're cracked up to be, and if so, how are they working? In this episode, I talk to Vikrant Varma about his research getting to the bottom of these questions.

Patreon: patreon.com/axrpodcast

Ko-fi: ko-fi.com/axrpodcast

Topics we discuss, and timestamps:

0:00:36 - Challenges with unsupervised LLM knowledge discovery, aka contra CCS

0:00:36 - What is CCS?

0:09:54 - Consistent and contrastive features other than model beliefs

0:20:34 - Understanding the banana/shed mystery

0:41:59 - Future CCS-like approaches

0:53:29 - CCS as principal component analysis

0:56:21 - Explaining grokking through circuit efficiency

0:57:44 - Why research science of deep learning?

1:12:07 - Summary of the paper's hypothesis

1:14:05 - What are 'circuits'?

1:20:48 - The role of complexity

1:24:07 - Many kinds of circuits

1:28:10 - How circuits are learned

1:38:24 - Semi-grokking and ungrokking

1:50:53 - Generalizing the results

1:58:51 - Vikrant's research approach

2:06:36 - The DeepMind alignment team

2:09:06 - Follow-up work

The transcript: axrp.net/episode/2024/04/25/episode-29-science-of-deep-learning-vikrant-varma.html

Vikrant's Twitter/X account: twitter.com/vikrantvarma_

Main papers:

- Challenges with unsupervised LLM knowledge discovery: arxiv.org/abs/2312.10029

- Explaining grokking through circuit efficiency: arxiv.org/abs/2309.02390

Other works discussed:

- Discovering latent knowledge in language models without supervision (CCS): arxiv.org/abs/2212.03827

- Eliciting Latent Knowledge: How to Tell if your Eyes Deceive You: https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit

- Discussion: Challenges with unsupervised LLM knowledge discovery: lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1

- Comment thread on the banana/shed results: lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1?commentId=hPZfgA3BdXieNfFuY

- Fabien Roger, What discovering latent knowledge did and did not find: lesswrong.com/posts/bWxNPMy5MhPnQTzKz/what-discovering-latent-knowledge-did-and-did-not-find-4

- Scott Emmons, Contrast Pairs Drive the Performance of Contrast Consistent Search (CCS): lesswrong.com/posts/9vwekjD6xyuePX7Zr/contrast-pairs-drive-the-empirical-performance-of-contrast

- Grokking: Generalizing Beyond Overfitting on Small Algorithmic Datasets: arxiv.org/abs/2201.02177

- Keeping Neural Networks Simple by Minimizing the Minimum Description Length of the Weights (Hinton 1993 L2): dl.acm.org/doi/pdf/10.1145/168304.168306

- Progress measures for grokking via mechanistic interpretability: arxiv.org/abs/2301.0521

Episode art by Hamish Doodles: hamishdoodles.com
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
28 - Suing Labs for AI Risk with Gabriel Weil
17 abr 2024· AXRP - the AI X-risk Research Podcast
How should the law govern AI? Those concerned about existential risks often push either for bans or for regulations meant to ensure that AI is developed safely - but another approach is possible. In this episode, Gabriel Weil talks about his proposal to modify tort law to enable people to sue AI companies for disasters that are "nearly catastrophic".

Patreon: patreon.com/axrpodcast

Ko-fi: ko-fi.com/axrpodcast

Topics we discuss, and timestamps:

0:00:35 - The basic idea

0:20:36 - Tort law vs regulation

0:29:10 - Weil's proposal vs Hanson's proposal

0:37:00 - Tort law vs Pigouvian taxation

0:41:16 - Does disagreement on AI risk make this proposal less effective?

0:49:53 - Warning shots - their prevalence and character

0:59:17 - Feasibility of big changes to liability law

1:29:17 - Interactions with other areas of law

1:38:59 - How Gabriel encountered the AI x-risk field

1:42:41 - AI x-risk and the legal field

1:47:44 - Technical research to help with this proposal

1:50:47 - Decisions this proposal could influence

1:55:34 - Following Gabriel's research

The transcript: axrp.net/episode/2024/04/17/episode-28-tort-law-for-ai-risk-gabriel-weil.html

Links for Gabriel:

- SSRN page: papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=1648032

- Twitter/X account: twitter.com/gabriel_weil

Tort Law as a Tool for Mitigating Catastrophic Risk from Artificial Intelligence: papers.ssrn.com/sol3/papers.cfm?abstract_id=4694006

Other links:

- Foom liability: overcomingbias.com/p/foom-liability

- Punitive Damages: An Economic Analysis: law.harvard.edu/faculty/shavell/pdf/111_Harvard_Law_Rev_869.pdf

- Efficiency, Fairness, and the Externalization of Reasonable Risks: The Problem With the Learned Hand Formula: papers.ssrn.com/sol3/papers.cfm?abstract_id=4466197

- Tort Law Can Play an Important Role in Mitigating AI Risk: forum.effectivealtruism.org/posts/epKBmiyLpZWWFEYDb/tort-law-can-play-an-important-role-in-mitigating-ai-risk

- How Technical AI Safety Researchers Can Help Implement Punitive Damages to Mitigate Catastrophic AI Risk: forum.effectivealtruism.org/posts/yWKaBdBygecE42hFZ/how-technical-ai-safety-researchers-can-help-implement

- Can the courts save us from dangerous AI? [Vox]: vox.com/future-perfect/2024/2/7/24062374/ai-openai-anthropic-deepmind-legal-liability-gabriel-weil

Episode art by Hamish Doodles: hamishdoodles.com
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
27 - AI Control with Buck Shlegeris and Ryan Greenblatt
11 abr 2024· AXRP - the AI X-risk Research Podcast
A lot of work to prevent AI existential risk takes the form of ensuring that AIs don't want to cause harm or take over the world---or in other words, ensuring that they're aligned. In this episode, I talk with Buck Shlegeris and Ryan Greenblatt about a different approach, called "AI control": ensuring that AI systems couldn't take over the world, even if they were trying to.

Patreon: patreon.com/axrpodcast

Ko-fi: ko-fi.com/axrpodcast

Topics we discuss, and timestamps:

0:00:31 - What is AI control?

0:16:16 - Protocols for AI control

0:22:43 - Which AIs are controllable?

0:29:56 - Preventing dangerous coded AI communication

0:40:42 - Unpredictably uncontrollable AI

0:58:01 - What control looks like

1:08:45 - Is AI control evil?

1:24:42 - Can red teams match misaligned AI?

1:36:51 - How expensive is AI monitoring?

1:52:32 - AI control experiments

2:03:50 - GPT-4's aptitude at inserting backdoors

2:14:50 - How AI control relates to the AI safety field

2:39:25 - How AI control relates to previous Redwood Research work

2:49:16 - How people can work on AI control

2:54:07 - Following Buck and Ryan's research

The transcript: axrp.net/episode/2024/04/11/episode-27-ai-control-buck-shlegeris-ryan-greenblatt.html

Links for Buck and Ryan:

- Buck's twitter/X account: twitter.com/bshlgrs

- Ryan on LessWrong: lesswrong.com/users/ryan_greenblatt

- You can contact both Buck and Ryan by electronic mail at [firstname] [at-sign] rdwrs.com

Main research works we talk about:

- The case for ensuring that powerful AIs are controlled: lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled

- AI Control: Improving Safety Despite Intentional Subversion: arxiv.org/abs/2312.06942

Other things we mention:

- The prototypical catastrophic AI action is getting root access to its datacenter (aka "Hacking the SSH server"): lesswrong.com/posts/BAzCGCys4BkzGDCWR/the-prototypical-catastrophic-ai-action-is-getting-root

- Preventing language models from hiding their reasoning: arxiv.org/abs/2310.18512

- Improving the Welfare of AIs: A Nearcasted Proposal: lesswrong.com/posts/F6HSHzKezkh6aoTr2/improving-the-welfare-of-ais-a-nearcasted-proposal

- Measuring coding challenge competence with APPS: arxiv.org/abs/2105.09938

- Causal Scrubbing: a method for rigorously testing interpretability hypotheses lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing

Episode art by Hamish Doodles: hamishdoodles.com
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
Mostrar mais

Episódios

38.6 - Joel Lehman on Positive Visions of AI

38.5 - Adrià Garriga-Alonso on Detecting AI Scheming

38.4 - Shakeel Hashim on AI Journalism

38.3 - Erik Jenner on Learned Look-Ahead

39 - Evan Hubinger on Model Organisms of Misalignment

38.2 - Jesse Hoogland on Singular Learning Theory

38.1 - Alan Chan on Agent Infrastructure

38.0 - Zhijing Jin on LLMs, Causality, and Multi-Agent Systems

37 - Jaime Sevilla on AI Forecasting

36 - Adam Shai and Paul Riechers on Computational Mechanics

New Patreon tiers + MATS applications

35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

34 - AI Evaluations with Beth Barnes

33 - RLHF Problems with Scott Emmons

32 - Understanding Agency with Jan Kulveit

31 - Singular Learning Theory with Daniel Murfet

30 - AI Security with Jeffrey Ladish

29 - Science of Deep Learning with Vikrant Varma

28 - Suing Labs for AI Risk with Gabriel Weil

27 - AI Control with Buck Shlegeris and Ryan Greenblatt