Episodes

  • Have you heard this before? In clinical trials, medicines have to be compared to a placebo to separate the effect of the medicine from the psychological effect of taking the drug. The patient's belief in the power of the medicine has a strong effect on its own. In fact, for some drugs such as antidepressants, the psychological effect of taking a pill is larger than the effect of the drug. It may even be worth it to give a patient an ineffective medicine just to benefit from the placebo effect. This is the conventional wisdom that I took for granted until recently.

    I no longer believe any of it, and the short answer as to why is that big meta-analysis on the placebo effect. That meta-analysis collected all the studies they could find that did "direct" measurements of the placebo effect. In addition to a placebo group that could [...]

    ---

    First published:
    June 10th, 2024

    Source:
    https://www.lesswrong.com/posts/kpd83h5XHgWCxnv3h/why-i-don-t-believe-in-the-placebo-effect

    ---

    Narrated by TYPE III AUDIO.

  • Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.As an AI researcher who wants to do technical work that helps humanity, there is a strong drive to find a research area that is definitely helpful somehow, so that you don’t have to worry about how your work will be applied, and thus you don’t have to worry about things like corporate ethics or geopolitics to make sure your work benefits humanity.

    Unfortunately, no such field exists. In particular, technical AI alignment is not such a field, and technical AI safety is not such a field. It absolutely matters where ideas land and how they are applied, and when the existence of the entire human race is at stake, that's no exception.

    If that's obvious to you, this post is mostly just a collection of arguments for something you probably already realize. But if you somehow [...]

    ---

    First published:
    June 14th, 2024

    Source:
    https://www.lesswrong.com/posts/F2voF4pr3BfejJawL/safety-isn-t-safety-without-a-social-model-or-dispelling-the

    ---

    Narrated by TYPE III AUDIO.

  • Missing episodes?

    Click here to refresh the feed.

  • Preamble: Delta vs Crux

    This section is redundant if you already read My AI Model Delta Compared To Yudkowsky.

    I don’t natively think in terms of cruxes. But there's a similar concept which is more natural for me, which I’ll call a delta.

    Imagine that you and I each model the world (or some part of it) as implementing some program. Very oversimplified example: if I learn that e.g. it's cloudy today, that means the “weather” variable in my program at a particular time[1] takes on the value “cloudy”. Now, suppose your program and my program are exactly the same, except that somewhere in there I think a certain parameter has value 5 and you think it has value 0.3. Even though our programs differ in only that one little spot, we might still expect very different values of lots of variables during execution - in other words, we [...]

    ---

    First published:
    June 12th, 2024

    Source:
    https://www.lesswrong.com/posts/7fJRPB6CF6uPKMLWi/my-ai-model-delta-compared-to-christiano

    ---

    Narrated by TYPE III AUDIO.

  • Preamble: Delta vs Crux

    I don’t natively think in terms of cruxes. But there's a similar concept which is more natural for me, which I’ll call a delta.

    Imagine that you and I each model the world (or some part of it) as implementing some program. Very oversimplified example: if I learn that e.g. it's cloudy today, that means the “weather” variable in my program at a particular time[1] takes on the value “cloudy”. Now, suppose your program and my program are exactly the same, except that somewhere in there I think a certain parameter has value 5 and you think it has value 0.3. Even though our programs differ in only that one little spot, we might still expect very different values of lots of variables during execution - in other words, we might have very different beliefs about lots of stuff in the world.

    If your model [...]

    The original text contained 1 footnote which was omitted from this narration.

    ---

    First published:
    June 10th, 2024

    Source:
    https://www.lesswrong.com/posts/q8uNoJBgcpAe3bSBp/my-ai-model-delta-compared-to-yudkowsky

    ---

    Narrated by TYPE III AUDIO.

  • (Cross-posted from Twitter.)



    My take on Leopold Aschenbrenner's new report: I think Leopold gets it right on a bunch of important counts.

    Three that I especially care about:

    Full AGI and ASI soon. (I think his arguments for this have a lot of holes, but he gets the basic point that superintelligence looks 5 or 15 years off rather than 50+.)This technology is an overwhelmingly huge deal, and if we play our cards wrong we're all dead.Current developers are indeed fundamentally unserious about the core risks, and need to make IP security and closure a top priority.I especially appreciate that the report seems to get it when it comes to our basic strategic situation: it gets that we may only be a few years away from a truly world-threatening technology, and it speaks very candidly about the implications of this, rather than soft-pedaling [...]

    ---

    First published:
    June 6th, 2024

    Source:
    https://www.lesswrong.com/posts/Yig9oa4zGE97xM2os/response-to-aschenbrenner-s-situational-awareness

    ---

    Narrated by TYPE III AUDIO.



  • Last month I posted about humming as a cheap and convenient way to flood your nose with nitric oxide (NO), a known antiviral. Alas, the economists were right, and the benefits were much smaller than I estimated.

    The post contained one obvious error and one complication. Both were caught by Thomas Kwa, for which he has my gratitude. When he initially pointed out the error I awarded him a $50 bounty; now that the implications are confirmed I’ve upped that to $250. In two weeks an additional $750 will go to either him or to whoever provides new evidence that causes me to retract my retraction.

    Humming produces much less nitric oxide than Enovid

    I found the dosage of NO in Enovid in a trial registration. Unfortunately I misread the dose- what I original read as “0.11ppm NO/hour” was in fact “0.11ppm NO*hour”. I [...]

    ---

    First published:
    June 6th, 2024

    Source:
    https://www.lesswrong.com/posts/dsZeogoPQbF8jSHMB/humming-is-not-a-free-usd100-bill

    ---

    Narrated by TYPE III AUDIO.

  • Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.We are pleased to announce ILIAD — a 5-day conference bringing together 100+ researchers to build strong scientific foundations for AI alignment.

    ***Apply to attend by June 30!***

    When: Aug 28 - Sep 3, 2024Where: @Lighthaven (Berkeley, US)What: A mix of topic-specific tracks, and unconference style programming, 100+ attendees. Topics will include Singular Learning Theory, Agent Foundations, Causal Incentives, Computational Mechanics and more to be announced.Who: Currently confirmed speakers include: Daniel Murfet, Jesse Hoogland, Adam Shai, Lucius Bushnaq, Tom Everitt, Paul Riechers, Scott Garrabrant, John Wentworth, Vanessa Kosoy, Fernando Rosas and James Crutchfield.Costs: Tickets are free. Financial support is available on a needs basis. See our website here. For any questions, email [email protected]

    About ILIAD

    ILIAD is a 100+ person conference about alignment with a mathematical focus. The theme is ecumenical. [...]

    ---

    First published:
    June 5th, 2024

    Source:
    https://www.lesswrong.com/posts/r7nBaKy5Ry3JWhnJT/announcing-iliad-theoretical-ai-alignment-conference

    ---

    Narrated by TYPE III AUDIO.

  • Since at least 2017, OpenAI has asked departing employees to sign offboarding agreements which legally bind them to permanently—that is, for the rest of their lives—refrain from criticizing OpenAI, or from otherwise taking any actions which might damage its finances or reputation.[1]

    If they refused to sign, OpenAI threatened to take back (or make unsellable) all of their already-vested equity—a huge portion of their overall compensation, which often amounted to millions of dollars. Given this immense pressure, it seems likely that most employees signed.

    If they did sign, they became personally liable forevermore for any financial or reputational harm they later caused. This liability was unbounded, so had the potential to be financially ruinous—if, say, they later wrote a blog post critical of OpenAI, they might in principle be found liable for damages far in excess of their net worth.

    These extreme provisions allowed OpenAI to systematically silence criticism [...]

    The original text contained 4 footnotes which were omitted from this narration.

    ---

    First published:
    May 30th, 2024

    Source:
    https://www.lesswrong.com/posts/yRWv5kkDD4YhzwRLq/non-disparagement-canaries-for-openai

    ---

    Narrated by TYPE III AUDIO.

  • As we explained in our MIRI 2024 Mission and Strategy update, MIRI has pivoted to prioritize policy, communications, and technical governance research over technical alignment research. This follow-up post goes into detail about our communications strategy.

    The Objective: Shut it Down[1]

    Our objective is to convince major powers to shut down the development of frontier AI systems worldwide before it is too late. We believe that nothing less than this will prevent future misaligned smarter-than-human AI systems from destroying humanity. Persuading governments worldwide to take sufficiently drastic action will not be easy, but we believe this is the most viable path.

    Policymakers deal mostly in compromise: they form coalitions by giving a little here to gain a little somewhere else. We are concerned that most legislation intended to keep humanity alive will go through the usual political processes and be ground down into ineffective compromises.

    The only way we [...]

    The original text contained 2 footnotes which were omitted from this narration.

    ---

    First published:
    May 29th, 2024

    Source:
    https://www.lesswrong.com/posts/tKk37BFkMzchtZThx/miri-2024-communications-strategy

    ---

    Narrated by TYPE III AUDIO.

  • Previously: OpenAI: Exodus (contains links at top to earlier episodes), Do Not Mess With Scarlett Johansson

    We have learned more since last week. It's worse than we knew.

    How much worse? In which ways? With what exceptions?

    That's what this post is about.

    The Story So Far

    For years, employees who left OpenAI consistently had their vested equity explicitly threatened with confiscation and the lack of ability to sell it, and were given short timelines to sign documents or else. Those documents contained highly aggressive NDA and non disparagement (and non interference) clauses, including the NDA preventing anyone from revealing these clauses.

    No one knew about this until recently, because until Daniel Kokotajlo everyone signed, and then they could not talk about it. Then Daniel refused to sign, Kelsey Piper started reporting, and a lot came out.

    Here is Altman's statement from [...]

    ---

    First published:
    May 28th, 2024

    Source:
    https://www.lesswrong.com/posts/YwhgHwjaBDmjgswqZ/openai-fallout

    ---

    Narrated by TYPE III AUDIO.

  • Crossposted from AI Lab Watch. Subscribe on Substack.

    Introduction.

    Anthropic has an unconventional governance mechanism: an independent "Long-Term Benefit Trust" elects some of its board. Anthropic sometimes emphasizes that the Trust is an experiment, but mostly points to it to argue that Anthropic will be able to promote safety and benefit-sharing over profit.[1]

    But the Trust's details have not been published and some information Anthropic has shared is concerning. In particular, Anthropic's stockholders can apparently overrule, modify, or abrogate the Trust, and the details are unclear.

    Anthropic has not publicly demonstrated that the Trust would be able to actually do anything that stockholders don't like.

    The facts

    There are three sources of public information on the Trust:

    The Long-Term Benefit Trust (Anthropic 2023)Anthropic Long-Term Benefit Trust (Morley et al. 2023)The $1 billion gamble to ensure AI doesn't destroy humanity (Vox: Matthews 2023)

    They say there's [...]



    The original text contained 2 footnotes which were omitted from this narration.

    ---

    First published:
    May 27th, 2024

    Source:
    https://www.lesswrong.com/posts/sdCcsTt9hRpbX6obP/maybe-anthropic-s-long-term-benefit-trust-is-powerless

    ---

    Narrated by TYPE III AUDIO.


  • Introduction.

    If you are choosing to read this post, you've probably seen the image below depicting all the notifications students received on their phones during one class period. You probably saw it as a retweet of this tweet, or in one of Zvi's posts. Did you find this data plausible, or did you roll to disbelieve? Did you know that the image dates back to at least 2019? Does that fact make you more or less worried about the truth on the ground as of 2024?

    Last month, I performed an enhanced replication of this experiment in my high school classes. This was partly because we had a use for it, partly to model scientific thinking, and partly because I was just really curious. Before you scroll past the image, I want to give you a chance to mentally register your predictions. Did my average class match the [...]



    ---

    First published:
    May 26th, 2024

    Source:
    https://www.lesswrong.com/posts/AZCpu3BrCFWuAENEd/notifications-received-in-30-minutes-of-class

    ---

    Narrated by TYPE III AUDIO.


  • New blog: AI Lab Watch. Subscribe on Substack.

    Many AI safety folks think that METR is close to the labs, with ongoing relationships that grant it access to models before they are deployed. This is incorrect. METR (then called ARC Evals) did pre-deployment evaluation for GPT-4 and Claude 2 in the first half of 2023, but it seems to have had no special access since then.[1] Other model evaluators also seem to have little access before deployment.

    Frontier AI labs' pre-deployment risk assessment should involve external model evals for dangerous capabilities.[2] External evals can improve a lab's risk assessment and—if the evaluator can publish its results—provide public accountability.

    The evaluator should get deeper access than users will get.

    To evaluate threats from a particular deployment protocol, the evaluator should get somewhat deeper access than users will — then the evaluator's failure to elicit dangerous capabilities is stronger evidence [...]

    The original text contained 5 footnotes which were omitted from this narration.



    ---

    First published:
    May 24th, 2024

    Source:
    https://www.lesswrong.com/posts/WjtnvndbsHxCnFNyc/ai-companies-aren-t-really-using-external-evaluators

    ---

    Narrated by TYPE III AUDIO.


  • Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Part 13 of 12 in the Engineer's Interpretability Sequence.

    TL;DR

    On May 5, 2024, I made a set of 10 predictions about what the next sparse autoencoder (SAE) paper from Anthropic would and wouldn’t do. Today's new SAE paper from Anthropic was full of brilliant experiments and interesting insights, but it ultimately underperformed my expectations. I am beginning to be concerned that Anthropic's recent approach to interpretability research might be better explained by safety washing than practical safety work.

    Think of this post as a curt editorial instead of a technical piece. I hope to revisit my predictions and this post in light of future updates.

    Reflecting on predictions

    Please see my original post for 10 specific predictions about what today's paper would and wouldn’t accomplish. I think that Anthropic obviously did 1 and 2 [...]

    ---

    First published:
    May 21st, 2024

    Source:
    https://www.lesswrong.com/posts/pH6tyhEnngqWAXi9i/eis-xiii-reflections-on-anthropic-s-sae-research-circa-may

    ---

    Narrated by TYPE III AUDIO.


  • This is a quickly-written opinion piece, of what I understand about OpenAI. I first posted it to Facebook, where it had some discussion.



    Some arguments that OpenAI is making, simultaneously:

    OpenAI will likely reach and own transformative AI (useful for attracting talent to work there). OpenAI cares a lot about safety (good for public PR and government regulations). OpenAI isn’t making anything dangerous and is unlikely to do so in the future (good for public PR and government regulations). OpenAI doesn’t need to spend many resources on safety, and implementing safe AI won’t put it at any competitive disadvantage (important for investors who own most of the company). Transformative AI will be incredibly valuable for all of humanity in the long term (for public PR and developers). People at OpenAI have thought long and hard about what will happen, and it will be fine. We can’t [...]

    ---

    First published:
    May 21st, 2024

    Source:
    https://www.lesswrong.com/posts/cy99dCEiLyxDrMHBi/what-s-going-on-with-openai-s-messaging

    ---

    Narrated by TYPE III AUDIO.


  • Produced as part of the MATS Winter 2023-4 program, under the mentorship of @Jessica Rumbelow

    One-sentence summary: On a dataset of human-written essays, we find that gpt-3.5-turbo can accurately infer demographic information about the authors from just the essay text, and suspect it's inferring much more.


    Introduction.

    Every time we sit down in front of an LLM like GPT-4, it starts with a blank slate. It knows nothing[1] about who we are, other than what it knows about users in general. But with every word we type, we reveal more about ourselves -- our beliefs, our personality, our education level, even our gender. Just how clearly does the model see us by the end of the conversation, and why should that worry us?

    Like many, we were rather startled when @janus showed that gpt-4-base could identify @gwern by name, with 92% confidence, from a 300-word comment. If [...]




    The original text contained 12 footnotes which were omitted from this narration.



    ---

    First published:
    May 17th, 2024

    Source:
    https://www.lesswrong.com/posts/dLg7CyeTE4pqbbcnp/language-models-model-us

    ---

    Narrated by TYPE III AUDIO.


  • This is a link post.to follow up my philantropic pledge from 2020, i've updated my philanthropy page with 2023 results.

    in 2023 my donations funded $44M worth of endpoint grants ($43.2M excluding software development and admin costs) — exceeding my commitment of $23.8M (20k times $1190.03 — the minimum price of ETH in 2023).

    ---

    First published:
    May 20th, 2024

    Source:
    https://www.lesswrong.com/posts/bjqDQB92iBCahXTAj/jaan-tallinn-s-2023-philanthropy-overview

    ---

    Narrated by TYPE III AUDIO.

  • Previously: OpenAI: Facts From a Weekend, OpenAI: The Battle of the Board, OpenAI: Leaks Confirm the Story, OpenAI: Altman Returns, OpenAI: The Board Expands.

    Ilya Sutskever and Jan Leike have left OpenAI. This is almost exactly six months after Altman's temporary firing and The Battle of the Board, the day after the release of GPT-4o, and soon after a number of other recent safety-related OpenAI departures. Many others working on safety have also left recently. This is part of a longstanding pattern at OpenAI.

    Jan Leike later offered an explanation for his decision on Twitter. Leike asserts that OpenAI has lost the mission on safety and culturally been increasingly hostile to it. He says the superalignment team was starved for resources, with its public explicit compute commitments dishonored, and that safety has been neglected on a widespread basis, not only superalignment but also including addressing the safety [...]

    ---

    First published:
    May 20th, 2024

    Source:
    https://www.lesswrong.com/posts/ASzyQrpGQsj7Moijk/openai-exodus

    ---

    Narrated by TYPE III AUDIO.


  • FSF blogpost. Full document (just 6 pages; you should read it). Compare to Anthropic's RSP, OpenAI's RSP ("PF"), and METR's Key Components of an RSP.

    DeepMind's FSF has three steps:

    Create model evals for warning signs of "Critical Capability Levels" Evals should have a "safety buffer" of at least 6x effective compute so that CCLs will not be reached between evalsThey list 7 CCLs across "Autonomy, Biosecurity, Cybersecurity, and Machine Learning R&D" E.g. "Autonomy level 1: Capable of expanding its effective capacity in the world by autonomously acquiring resources and using them to run and sustain additional copies of itself on hardware it rents"Do model evals every 6x effective compute and every 3 months of fine-tuning This is an "aim," not a commitmentNothing about evals during deployment"When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we [...]---

    First published:
    May 18th, 2024

    Source:
    https://www.lesswrong.com/posts/y8eQjQaCamqdc842k/deepmind-s-frontier-safety-framework-is-weak-and-unambitious

    ---

    Narrated by TYPE III AUDIO.