38.2 - Jesse Hoogland on Singular Learning Theory – AXRP - the AI X-risk Research Podcast – Podcast

Episoder

44 - Peter Salib on AI Rights for Human Safety
28 Jun· AXRP - the AI X-risk Research Podcast
In this episode, I talk with Peter Salib about his paper "AI Rights for Human Safety", arguing that giving AIs the right to contract, hold property, and sue people will reduce the risk of their trying to attack humanity and take over. He also tells me how law reviews work, in the face of my incredulity.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/06/28/episode-44-peter-salib-ai-rights-human-safety.html

Topics we discuss, and timestamps:

0:00:40 Why AI rights

0:18:34 Why not reputation

0:27:10 Do AI rights lead to AI war?

0:36:42 Scope for human-AI trade

0:44:25 Concerns with comparative advantage

0:53:42 Proxy AI wars

0:57:56 Can companies profitably make AIs with rights?

1:09:43 Can we have AI rights and AI safety measures?

1:24:31 Liability for AIs with rights

1:38:29 Which AIs get rights?

1:43:36 AI rights and stochastic gradient descent

1:54:54 Individuating "AIs"

2:03:28 Social institutions for AI safety

2:08:20 Outer misalignment and trading with AIs

2:15:27 Why statutes of limitations should exist

2:18:39 Starting AI x-risk research in legal academia

2:24:18 How law reviews and AI conferences work

2:41:49 More on Peter moving to AI x-risk research

2:45:37 Reception of the paper

2:53:24 What publishing in law reviews does

3:04:48 Which parts of legal academia focus on AI

3:18:03 Following Peter's research

Links for Peter:

Personal website: https://www.peternsalib.com/

Writings at Lawfare: https://www.lawfaremedia.org/contributors/psalib

CLAIR: https://clair-ai.org/

Research we discuss:

AI Rights for Human Safety: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4913167

Will humans and AIs go to war? https://philpapers.org/rec/GOLWAA

Infrastructure for AI agents: https://arxiv.org/abs/2501.10114

Governing AI Agents: https://arxiv.org/abs/2501.07913

Episode art by Hamish Doodles: hamishdoodles.com
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
43 - David Lindner on Myopic Optimization with Non-myopic Approval
15 Jun· AXRP - the AI X-risk Research Podcast
In this episode, I talk with David Lindner about Myopic Optimization with Non-myopic Approval, or MONA, which attempts to address (multi-step) reward hacking by myopically optimizing actions against a human's sense of whether those actions are generally good. Does this work? Can we get smarter-than-human AI this way? How does this compare to approaches like conservativism? Listen to find out.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/06/15/episode-43-david-lindner-mona.html

Topics we discuss, and timestamps:

0:00:29 What MONA is

0:06:33 How MONA deals with reward hacking

0:23:15 Failure cases for MONA

0:36:25 MONA's capability

0:55:40 MONA vs other approaches

1:05:03 Follow-up work

1:10:17 Other MONA test cases

1:33:47 When increasing time horizon doesn't increase capability

1:39:04 Following David's research

Links for David:

Website: https://www.davidlindner.me

Twitter / X: https://x.com/davlindner

DeepMind Medium: https://deepmindsafetyresearch.medium.com

David on the Alignment Forum: https://www.alignmentforum.org/users/david-lindner

Research we discuss:

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking: https://arxiv.org/abs/2501.13011

Arguments Against Myopic Training: https://www.alignmentforum.org/posts/GqxuDtZvfgL2bEQ5v/arguments-against-myopic-training

Episode art by Hamish Doodles: hamishdoodles.com
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
Mangler du episoder?

Klikk her for å oppdatere manuelt.
42 - Owain Evans on LLM Psychology
6 Jun· AXRP - the AI X-risk Research Podcast
Earlier this year, the paper "Emergent Misalignment" made the rounds on AI x-risk social media for seemingly showing LLMs generalizing from 'misaligned' training data of insecure code to acting comically evil in response to innocuous questions. In this episode, I chat with one of the authors of that paper, Owain Evans, about that research as well as other work he's done to understand the psychology of large language models.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/06/06/episode-42-owain-evans-llm-psychology.html

Topics we discuss, and timestamps:

0:00:37 Why introspection?

0:06:24 Experiments in "Looking Inward"

0:15:11 Why fine-tune for introspection?

0:22:32 Does "Looking Inward" test introspection, or something else?

0:34:14 Interpreting the results of "Looking Inward"

0:44:56 Limitations to introspection?

0:49:54 "Tell me about yourself", and its relation to other papers

1:05:45 Backdoor results

1:12:01 Emergent Misalignment

1:22:13 Why so hammy, and so infrequently evil?

1:36:31 Why emergent misalignment?

1:46:45 Emergent misalignment and other types of misalignment

1:53:57 Is emergent misalignment good news?

2:00:01 Follow-up work to "Emergent Misalignment"

2:03:10 Reception of "Emergent Misalignment" vs other papers

2:07:43 Evil numbers

2:12:20 Following Owain's research

Links for Owain:

Truthful AI: https://www.truthfulai.org

Owain's website: https://owainevans.github.io/

Owain's twitter/X account: https://twitter.com/OwainEvans_UK

Research we discuss:

Looking Inward: Language Models Can Learn About Themselves by Introspection: https://arxiv.org/abs/2410.13787

Tell me about yourself: LLMs are aware of their learned behaviors: https://arxiv.org/abs/2501.11120

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data: https://arxiv.org/abs/2406.14546

Emergent Misalignment: Narrow fine-tuning can produce broadly misaligned LLMs: https://arxiv.org/abs/2502.17424

X/Twitter thread of GPT-4.1 emergent misalignment results: https://x.com/OwainEvans_UK/status/1912701650051190852

Taken out of context: On measuring situational awareness in LLMs: https://arxiv.org/abs/2309.00667

Episode art by Hamish Doodles: hamishdoodles.com
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
41 - Lee Sharkey on Attribution-based Parameter Decomposition
3 Jun· AXRP - the AI X-risk Research Podcast
What's the next step forward in interpretability? In this episode, I chat with Lee Sharkey about his proposal for detecting computational mechanisms within neural networks: Attribution-based Parameter Decomposition, or APD for short.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/06/03/episode-41-lee-sharkey-attribution-based-parameter-decomposition.html

Topics we discuss, and timestamps:

0:00:41 APD basics

0:07:57 Faithfulness

0:11:10 Minimality

0:28:44 Simplicity

0:34:50 Concrete-ish examples of APD

0:52:00 Which parts of APD are canonical

0:58:10 Hyperparameter selection

1:06:40 APD in toy models of superposition

1:14:40 APD and compressed computation

1:25:43 Mechanisms vs representations

1:34:41 Future applications of APD?

1:44:19 How costly is APD?

1:49:14 More on minimality training

1:51:49 Follow-up work

2:05:24 APD on giant chain-of-thought models?

2:11:27 APD and "features"

2:14:11 Following Lee's work

Lee links (Leenks):

X/Twitter: https://twitter.com/leedsharkey

Alignment Forum: https://www.alignmentforum.org/users/lee_sharkey

Research we discuss:

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-Based Parameter Decomposition: https://arxiv.org/abs/2501.14926

Toy Models of Superposition: https://transformer-circuits.pub/2022/toy_model/index.html

Towards a unified and verified understanding of group-operation networks: https://arxiv.org/abs/2410.07476

Feature geometry is outside the superposition hypothesis: https://www.alignmentforum.org/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis

Episode art by Hamish Doodles: hamishdoodles.com
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
40 - Jason Gross on Compact Proofs and Interpretability
28 Mar· AXRP - the AI X-risk Research Podcast
How do we figure out whether interpretability is doing its job? One way is to see if it helps us prove things about models that we care about knowing. In this episode, I speak with Jason Gross about his agenda to benchmark interpretability in this way, and his exploration of the intersection of proofs and modern machine learning.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/03/28/episode-40-jason-gross-compact-proofs-interpretability.html

Topics we discuss, and timestamps:

0:00:40 - Why compact proofs

0:07:25 - Compact Proofs of Model Performance via Mechanistic Interpretability

0:14:19 - What compact proofs look like

0:32:43 - Structureless noise, and why proofs

0:48:23 - What we've learned about compact proofs in general

0:59:02 - Generalizing 'symmetry'

1:11:24 - Grading mechanistic interpretability

1:43:34 - What helps compact proofs

1:51:08 - The limits of compact proofs

2:07:33 - Guaranteed safe AI, and AI for guaranteed safety

2:27:44 - Jason and Rajashree's start-up

2:34:19 - Following Jason's work

Links to Jason:

Github: https://github.com/jasongross

Website: https://jasongross.github.io

Alignment Forum: https://www.alignmentforum.org/users/jason-gross

Links to work we discuss:

Compact Proofs of Model Performance via Mechanistic Interpretability: https://arxiv.org/abs/2406.11779

Unifying and Verifying Mechanistic Interpretability: A Case Study with Group Operations: https://arxiv.org/abs/2410.07476

Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration: https://arxiv.org/abs/2412.03773

Stage-Wise Model Diffing: https://transformer-circuits.pub/2024/model-diffing/index.html

Causal Scrubbing: a method for rigorously testing interpretability hypotheses: https://www.lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition (aka the Apollo paper on APD): https://arxiv.org/abs/2501.14926

Towards Guaranteed Safe AI: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-45.pdf

Episode art by Hamish Doodles: hamishdoodles.com
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future
1 Mar· AXRP - the AI X-risk Research Podcast
In this episode, I chat with David Duvenaud about two topics he's been thinking about: firstly, a paper he wrote about evaluating whether or not frontier models can sabotage human decision-making or monitoring of the same models; and secondly, the difficult situation humans find themselves in in a post-AGI future, even if AI is aligned with human intentions.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/03/01/episode-38_8-david-duvenaud-sabotage-evaluations-post-agi-future.html

FAR.AI: https://far.ai/

FAR.AI on X (aka Twitter): https://x.com/farairesearch

FAR.AI on YouTube: @FARAIResearch

The Alignment Workshop: https://www.alignment-workshop.com/

Topics we discuss, and timestamps:

01:42 - The difficulty of sabotage evaluations

05:23 - Types of sabotage evaluation

08:45 - The state of sabotage evaluations

12:26 - What happens after AGI?

Links:

Sabotage Evaluations for Frontier Models: https://arxiv.org/abs/2410.21514

Gradual Disempowerment: https://gradual-disempowerment.ai/

Episode art by Hamish Doodles: hamishdoodles.com
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
38.7 - Anthony Aguirre on the Future of Life Institute
9 Feb· AXRP - the AI X-risk Research Podcast
The Future of Life Institute is one of the oldest and most prominant organizations in the AI existential safety space, working on such topics as the AI pause open letter and how the EU AI Act can be improved. Metaculus is one of the premier forecasting sites on the internet. Behind both of them lie one man: Anthony Aguirre, who I talk with in this episode.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/02/09/episode-38_7-anthony-aguirre-future-of-life-institute.html

FAR.AI: https://far.ai/

FAR.AI on X (aka Twitter): https://x.com/farairesearch

FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch

The Alignment Workshop: https://www.alignment-workshop.com/

Topics we discuss, and timestamps:

00:33 - Anthony, FLI, and Metaculus

06:46 - The Alignment Workshop

07:15 - FLI's current activity

11:04 - AI policy

17:09 - Work FLI funds

Links:

Future of Life Institute: https://futureoflife.org/

Metaculus: https://www.metaculus.com/

Future of Life Foundation: https://www.flf.org/

Episode art by Hamish Doodles: hamishdoodles.com
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
38.6 - Joel Lehman on Positive Visions of AI
24 Jan· AXRP - the AI X-risk Research Podcast
Typically this podcast talks about how to avert destruction from AI. But what would it take to ensure AI promotes human flourishing as well as it can? Is alignment to individuals enough, and if not, where do we go form here? In this episode, I talk with Joel Lehman about these questions.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/01/24/episode-38_6-joel-lehman-positive-visions-of-ai.html

FAR.AI: https://far.ai/

FAR.AI on X (aka Twitter): https://x.com/farairesearch

FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch

The Alignment Workshop: https://www.alignment-workshop.com/

Topics we discuss, and timestamps:

01:12 - Why aligned AI might not be enough

04:05 - Positive visions of AI

08:27 - Improving recommendation systems

Links:

Why Greatness Cannot Be Planned: https://www.amazon.com/Why-Greatness-Cannot-Planned-Objective/dp/3319155237

We Need Positive Visions of AI Grounded in Wellbeing: https://thegradientpub.substack.com/p/beneficial-ai-wellbeing-lehman-ngo

Machine Love: https://arxiv.org/abs/2302.09248

AI Alignment with Changing and Influenceable Reward Functions: https://arxiv.org/abs/2405.17713

Episode art by Hamish Doodles: hamishdoodles.com
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
38.5 - Adrià Garriga-Alonso on Detecting AI Scheming
20 Jan· AXRP - the AI X-risk Research Podcast
Suppose we're worried about AIs engaging in long-term plans that they don't tell us about. If we were to peek inside their brains, what should we look for to check whether this was happening? In this episode Adrià Garriga-Alonso talks about his work trying to answer this question.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/01/20/episode-38_5-adria-garriga-alonso-detecting-ai-scheming.html

FAR.AI: https://far.ai/

FAR.AI on X (aka Twitter): https://x.com/farairesearch

FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch

The Alignment Workshop: https://www.alignment-workshop.com/

Topics we discuss, and timestamps:

01:04 - The Alignment Workshop

02:49 - How to detect scheming AIs

05:29 - Sokoban-solving networks taking time to think

12:18 - Model organisms of long-term planning

19:44 - How and why to study planning in networks

Links:

Adrià's website: https://agarri.ga/

An investigation of model-free planning: https://arxiv.org/abs/1901.03559

Model-Free Planning: https://tuphs28.github.io/projects/interpplanning/

Planning in a recurrent neural network that plays Sokoban: https://arxiv.org/abs/2407.15421

Episode art by Hamish Doodles: hamishdoodles.com
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
38.4 - Shakeel Hashim on AI Journalism
5 Jan· AXRP - the AI X-risk Research Podcast
AI researchers often complain about the poor coverage of their work in the news media. But why is this happening, and how can it be fixed? In this episode, I speak with Shakeel Hashim about the resource constraints facing AI journalism, the disconnect between journalists' and AI researchers' views on transformative AI, and efforts to improve the state of AI journalism, such as Tarbell and Shakeel's newsletter, Transformer.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2025/01/05/episode-38_4-shakeel-hashim-ai-journalism.html

FAR.AI: https://far.ai/

FAR.AI on X (aka Twitter): https://x.com/farairesearch

FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch

The Alignment Workshop: https://www.alignment-workshop.com/

Topics we discuss, and timestamps:

01:31 - The AI media ecosystem

02:34 - Why not more AI news?

07:18 - Disconnects between journalists and the AI field

12:42 - Tarbell

18:44 - The Transformer newsletter

Links:

Transformer (Shakeel's substack): https://www.transformernews.ai/

Tarbell: https://www.tarbellfellowship.org/

Episode art by Hamish Doodles: hamishdoodles.com
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
38.3 - Erik Jenner on Learned Look-Ahead
12 Dec 2024· AXRP - the AI X-risk Research Podcast
Lots of people in the AI safety space worry about models being able to make deliberate, multi-step plans. But can we already see this in existing neural nets? In this episode, I talk with Erik Jenner about his work looking at internal look-ahead within chess-playing neural networks.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2024/12/12/episode-38_3-erik-jenner-learned-look-ahead.html

FAR.AI: https://far.ai/

FAR.AI on X (aka Twitter): https://x.com/farairesearch

FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch

The Alignment Workshop: https://www.alignment-workshop.com/

Topics we discuss, and timestamps:

00:57 - How chess neural nets look into the future

04:29 - The dataset and basic methodology

05:23 - Testing for branching futures?

07:57 - Which experiments demonstrate what

10:43 - How the ablation experiments work

12:38 - Effect sizes

15:23 - X-risk relevance

18:08 - Follow-up work

21:29 - How much planning does the network do?

Research we mention:

Evidence of Learned Look-Ahead in a Chess-Playing Neural Network: https://arxiv.org/abs/2406.00877

Understanding the learned look-ahead behavior of chess neural networks (a development of the follow-up research Erik mentioned): https://openreview.net/forum?id=Tl8EzmgsEp

Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT: https://arxiv.org/abs/2310.07582

Episode art by Hamish Doodles: hamishdoodles.com
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
39 - Evan Hubinger on Model Organisms of Misalignment
1 Dec 2024· AXRP - the AI X-risk Research Podcast
The 'model organisms of misalignment' line of research creates AI models that exhibit various types of misalignment, and studies them to try to understand how the misalignment occurs and whether it can be somehow removed. In this episode, Evan Hubinger talks about two papers he's worked on at Anthropic under this agenda: "Sleeper Agents" and "Sycophancy to Subterfuge".

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2024/12/01/episode-39-evan-hubinger-model-organisms-misalignment.html

Topics we discuss, and timestamps:

0:00:36 - Model organisms and stress-testing

0:07:38 - Sleeper Agents

0:22:32 - Do 'sleeper agents' properly model deceptive alignment?

0:38:32 - Surprising results in "Sleeper Agents"

0:57:25 - Sycophancy to Subterfuge

1:09:21 - How models generalize from sycophancy to subterfuge

1:16:37 - Is the reward editing task valid?

1:21:46 - Training away sycophancy and subterfuge

1:29:22 - Model organisms, AI control, and evaluations

1:33:45 - Other model organisms research

1:35:27 - Alignment stress-testing at Anthropic

1:43:32 - Following Evan's work

Main papers:

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training: https://arxiv.org/abs/2401.05566

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models: https://arxiv.org/abs/2406.10162

Anthropic links:

Anthropic's newsroom: https://www.anthropic.com/news

Careers at Anthropic: https://www.anthropic.com/careers

Other links:

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research: https://www.alignmentforum.org/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1

Simple probes can catch sleeper agents: https://www.anthropic.com/research/probes-catch-sleeper-agents

Studying Large Language Model Generalization with Influence Functions: https://arxiv.org/abs/2308.03296

Stress-Testing Capability Elicitation With Password-Locked Models [aka model organisms of sandbagging]: https://arxiv.org/abs/2405.19550

Episode art by Hamish Doodles: hamishdoodles.com
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
38.2 - Jesse Hoogland on Singular Learning Theory
27 Nov 2024· AXRP - the AI X-risk Research Podcast
You may have heard of singular learning theory, and its "local learning coefficient", or LLC - but have you heard of the refined LLC? In this episode, I chat with Jesse Hoogland about his work on SLT, and using the refined LLC to find a new circuit in language models.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2024/11/27/38_2-jesse-hoogland-singular-learning-theory.html

FAR.AI: https://far.ai/

FAR.AI on X (aka Twitter): https://x.com/farairesearch

FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch

The Alignment Workshop: https://www.alignment-workshop.com/

Topics we discuss, and timestamps:

00:34 - About Jesse

01:49 - The Alignment Workshop

02:31 - About Timaeus

05:25 - SLT that isn't developmental interpretability

10:41 - The refined local learning coefficient

14:06 - Finding the multigram circuit

Links:

Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient: https://arxiv.org/abs/2410.02984

Investigating the learning coefficient of modular addition: hackathon project: https://www.lesswrong.com/posts/4v3hMuKfsGatLXPgt/investigating-the-learning-coefficient-of-modular-addition

Episode art by Hamish Doodles: hamishdoodles.com
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
38.1 - Alan Chan on Agent Infrastructure
16 Nov 2024· AXRP - the AI X-risk Research Podcast
Road lines, street lights, and licence plates are examples of infrastructure used to ensure that roads operate smoothly. In this episode, Alan Chan talks about using similar interventions to help avoid bad outcomes from the deployment of AI agents.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2024/11/16/episode-38_1-alan-chan-agent-infrastructure.html

FAR.AI: https://far.ai/

FAR.AI on X (aka Twitter): https://x.com/farairesearch

FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch

The Alignment Workshop: https://www.alignment-workshop.com/

Topics we discuss, and timestamps:

01:02 - How the Alignment Workshop is

01:32 - Agent infrastructure

04:57 - Why agent infrastructure

07:54 - A trichotomy of agent infrastructure

13:59 - Agent IDs

18:17 - Agent channels

20:29 - Relation to AI control

Links:

Alan on Google Scholar: https://scholar.google.com/citations?user=lmQmYPgAAAAJ&hl=en&oi=ao

IDs for AI Systems: https://arxiv.org/abs/2406.12137

Visibility into AI Agents: https://arxiv.org/abs/2401.13138

Episode art by Hamish Doodles: hamishdoodles.com
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
38.0 - Zhijing Jin on LLMs, Causality, and Multi-Agent Systems
14 Nov 2024· AXRP - the AI X-risk Research Podcast
Do language models understand the causal structure of the world, or do they merely note correlations? And what happens when you build a big AI society out of them? In this brief episode, recorded at the Bay Area Alignment Workshop, I chat with Zhijing Jin about her research on these questions.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2024/11/14/episode-38_0-zhijing-jin-llms-causality-multi-agent-systems.html

FAR.AI: https://far.ai/

FAR.AI on X (aka Twitter): https://x.com/farairesearch

FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch

The Alignment Workshop: https://www.alignment-workshop.com/

Topics we discuss, and timestamps:

00:35 - How the Alignment Workshop is

00:47 - How Zhijing got interested in causality and natural language processing

03:14 - Causality and alignment

06:21 - Causality without randomness

10:07 - Causal abstraction

11:42 - Why LLM causal reasoning?

13:20 - Understanding LLM causal reasoning

16:33 - Multi-agent systems

Links:

Zhijing's website: https://zhijing-jin.com/fantasy/

Zhijing on X (aka Twitter): https://x.com/zhijingjin

Can Large Language Models Infer Causation from Correlation?: https://arxiv.org/abs/2306.05836

Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents: https://arxiv.org/abs/2404.16698

Episode art by Hamish Doodles: hamishdoodles.com
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
37 - Jaime Sevilla on AI Forecasting
4 Oct 2024· AXRP - the AI X-risk Research Podcast
Epoch AI is the premier organization that tracks the trajectory of AI - how much compute is used, the role of algorithmic improvements, the growth in data used, and when the above trends might hit an end. In this episode, I speak with the director of Epoch AI, Jaime Sevilla, about how compute, data, and algorithmic improvements are impacting AI, and whether continuing to scale can get us AGI.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2024/10/04/episode-37-jaime-sevilla-forecasting-ai.html

Topics we discuss, and timestamps:

0:00:38 - The pace of AI progress

0:07:49 - How Epoch AI tracks AI compute

0:11:44 - Why does AI compute grow so smoothly?

0:21:46 - When will we run out of computers?

0:38:56 - Algorithmic improvement

0:44:21 - Algorithmic improvement and scaling laws

0:56:56 - Training data

1:04:56 - Can scaling produce AGI?

1:16:55 - When will AGI arrive?

1:21:20 - Epoch AI

1:27:06 - Open questions in AI forecasting

1:35:21 - Epoch AI and x-risk

1:41:34 - Following Epoch AI's research

Links for Jaime and Epoch AI:

Epoch AI: https://epochai.org/

Machine Learning Trends dashboard: https://epochai.org/trends

Epoch AI on X / Twitter: https://x.com/EpochAIResearch

Jaime on X / Twitter: https://x.com/Jsevillamol

Research we discuss:

Training Compute of Frontier AI Models Grows by 4-5x per Year: https://epochai.org/blog/training-compute-of-frontier-ai-models-grows-by-4-5x-per-year

Optimally Allocating Compute Between Inference and Training: https://epochai.org/blog/optimally-allocating-compute-between-inference-and-training

Algorithmic Progress in Language Models [blog post]: https://epochai.org/blog/algorithmic-progress-in-language-models

Algorithmic progress in language models [paper]: https://arxiv.org/abs/2403.05812

Training Compute-Optimal Large Language Models [aka the Chinchilla scaling law paper]: https://arxiv.org/abs/2203.15556

Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data [blog post]: https://epochai.org/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data

Will we run out of data? Limits of LLM scaling based on human-generated data [paper]: https://arxiv.org/abs/2211.04325

The Direct Approach: https://epochai.org/blog/the-direct-approach

Episode art by Hamish Doodles: hamishdoodles.com
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
36 - Adam Shai and Paul Riechers on Computational Mechanics
29 Sep 2024· AXRP - the AI X-risk Research Podcast
Sometimes, people talk about transformers as having "world models" as a result of being trained to predict text data on the internet. But what does this even mean? In this episode, I talk with Adam Shai and Paul Riechers about their work applying computational mechanics, a sub-field of physics studying how to predict random processes, to neural networks.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2024/09/29/episode-36-adam-shai-paul-riechers-computational-mechanics.html

Topics we discuss, and timestamps:

0:00:42 - What computational mechanics is

0:29:49 - Computational mechanics vs other approaches

0:36:16 - What world models are

0:48:41 - Fractals

0:57:43 - How the fractals are formed

1:09:55 - Scaling computational mechanics for transformers

1:21:52 - How Adam and Paul found computational mechanics

1:36:16 - Computational mechanics for AI safety

1:46:05 - Following Adam and Paul's research

Simplex AI Safety: https://www.simplexaisafety.com/

Research we discuss:

Transformers represent belief state geometry in their residual stream: https://arxiv.org/abs/2405.15943

Transformers represent belief state geometry in their residual stream [LessWrong post]: https://www.lesswrong.com/posts/gTZ2SxesbHckJ3CkF/transformers-represent-belief-state-geometry-in-their

Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer: https://www.lesswrong.com/posts/mBw7nc4ipdyeeEpWs/why-would-belief-states-have-a-fractal-structure-and-why

Episode art by Hamish Doodles: hamishdoodles.com
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
New Patreon tiers + MATS applications
28 Sep 2024· AXRP - the AI X-risk Research Podcast
Patreon: https://www.patreon.com/axrpodcast

MATS: https://www.matsprogram.org

Note: I'm employed by MATS, but they're not paying me to make this video.
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization
24 Aug 2024· AXRP - the AI X-risk Research Podcast
How do we figure out what large language models believe? In fact, do they even have beliefs? Do those beliefs have locations, and if so, can we edit those locations to change the beliefs? Also, how are we going to get AI to perform tasks so hard that we can't figure out if they succeeded at them? In this episode, I chat with Peter Hase about his research into these questions.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2024/08/24/episode-35-peter-hase-llm-beliefs-easy-to-hard-generalization.html

Topics we discuss, and timestamps:

0:00:36 - NLP and interpretability

0:10:20 - Interpretability lessons

0:32:22 - Belief interpretability

1:00:12 - Localizing and editing models' beliefs

1:19:18 - Beliefs beyond language models

1:27:21 - Easy-to-hard generalization

1:47:16 - What do easy-to-hard results tell us?

1:57:33 - Easy-to-hard vs weak-to-strong

2:03:50 - Different notions of hardness

2:13:01 - Easy-to-hard vs weak-to-strong, round 2

2:15:39 - Following Peter's work

Peter on Twitter: https://x.com/peterbhase

Peter's papers:

Foundational Challenges in Assuring Alignment and Safety of Large Language Models: https://arxiv.org/abs/2404.09932

Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs: https://arxiv.org/abs/2111.13654

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models: https://arxiv.org/abs/2301.04213

Are Language Models Rational? The Case of Coherence Norms and Belief Revision: https://arxiv.org/abs/2406.03442

The Unreasonable Effectiveness of Easy Training Data for Hard Tasks: https://arxiv.org/abs/2401.06751

Other links:

Toy Models of Superposition: https://transformer-circuits.pub/2022/toy_model/index.html

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV): https://arxiv.org/abs/1711.11279

Locating and Editing Factual Associations in GPT (aka the ROME paper): https://arxiv.org/abs/2202.05262

Of nonlinearity and commutativity in BERT: https://arxiv.org/abs/2101.04547

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model: https://arxiv.org/abs/2306.03341

Editing a classifier by rewriting its prediction rules: https://arxiv.org/abs/2112.01008

Discovering Latent Knowledge Without Supervision (aka the Collin Burns CCS paper): https://arxiv.org/abs/2212.03827

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision: https://arxiv.org/abs/2312.09390

Concrete problems in AI safety: https://arxiv.org/abs/1606.06565

Rissanen Data Analysis: Examining Dataset Characteristics via Description Length: https://arxiv.org/abs/2103.03872

Episode art by Hamish Doodles: hamishdoodles.com
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
34 - AI Evaluations with Beth Barnes
28 Jul 2024· AXRP - the AI X-risk Research Podcast
How can we figure out if AIs are capable enough to pose a threat to humans? When should we make a big effort to mitigate risks of catastrophic AI misbehaviour? In this episode, I chat with Beth Barnes, founder of and head of research at METR, about these questions and more.

Patreon: patreon.com/axrpodcast

Ko-fi: ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2024/07/28/episode-34-ai-evaluations-beth-barnes.html

Topics we discuss, and timestamps:

0:00:37 - What is METR?

0:02:44 - What is an "eval"?

0:14:42 - How good are evals?

0:37:25 - Are models showing their full capabilities?

0:53:25 - Evaluating alignment

1:01:38 - Existential safety methodology

1:12:13 - Threat models and capability buffers

1:38:25 - METR's policy work

1:48:19 - METR's relationships with labs

2:04:12 - Related research

2:10:02 - Roles at METR, and following METR's work

Links for METR:

METR: https://metr.org

METR Task Development Guide - Bounty: https://taskdev.metr.org/bounty/

METR - Hiring: https://metr.org/hiring

Autonomy evaluation resources: https://metr.org/blog/2024-03-13-autonomy-evaluation-resources/

Other links:

Update on ARC's recent eval efforts (contains GPT-4 taskrabbit captcha story) https://metr.org/blog/2023-03-18-update-on-recent-evals/

Password-locked models: a stress case for capabilities evaluation: https://www.alignmentforum.org/posts/rZs6ddqNnW8LXuJqA/password-locked-models-a-stress-case-for-capabilities

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training: https://arxiv.org/abs/2401.05566

Untrusted smart models and trusted dumb models: https://www.alignmentforum.org/posts/LhxHcASQwpNa3mRNk/untrusted-smart-models-and-trusted-dumb-models

AI companies aren't really using external evaluators: https://www.lesswrong.com/posts/WjtnvndbsHxCnFNyc/ai-companies-aren-t-really-using-external-evaluators

Nobody Knows How to Safety-Test AI (Time): https://time.com/6958868/artificial-intelligence-safety-evaluations-risks/

ChatGPT can talk, but OpenAI employees sure can’t: https://www.vox.com/future-perfect/2024/5/17/24158478/openai-departures-sam-altman-employees-chatgpt-release

Leaked OpenAI documents reveal aggressive tactics toward former employees: https://www.vox.com/future-perfect/351132/openai-vested-equity-nda-sam-altman-documents-employees

Beth on her non-disparagement agreement with OpenAI: https://www.lesswrong.com/posts/yRWv5kkDD4YhzwRLq/non-disparagement-canaries-for-openai?commentId=MrJF3tWiKYMtJepgX

Sam Altman's statement on OpenAI equity: https://x.com/sama/status/1791936857594581428

Episode art by Hamish Doodles: hamishdoodles.com
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
Se mer

Episoder

44 - Peter Salib on AI Rights for Human Safety

43 - David Lindner on Myopic Optimization with Non-myopic Approval

42 - Owain Evans on LLM Psychology

41 - Lee Sharkey on Attribution-based Parameter Decomposition

40 - Jason Gross on Compact Proofs and Interpretability

38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future

38.7 - Anthony Aguirre on the Future of Life Institute

38.6 - Joel Lehman on Positive Visions of AI

38.5 - Adrià Garriga-Alonso on Detecting AI Scheming

38.4 - Shakeel Hashim on AI Journalism

38.3 - Erik Jenner on Learned Look-Ahead

39 - Evan Hubinger on Model Organisms of Misalignment

38.2 - Jesse Hoogland on Singular Learning Theory

38.1 - Alan Chan on Agent Infrastructure

38.0 - Zhijing Jin on LLMs, Causality, and Multi-Agent Systems

37 - Jaime Sevilla on AI Forecasting

36 - Adam Shai and Paul Riechers on Computational Mechanics

New Patreon tiers + MATS applications

35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

34 - AI Evaluations with Beth Barnes