Episódios
-
For the full show notes and links visit https://sub.thursdai.news
🔗 Subscribe to our show on Spotify: https://thursdai.news/spotify
🔗 Apple: https://thursdai.news/apple
Ho, ho, holy moly, folks! Alex here, coming to you live from a world where AI updates are dropping faster than Santa down a chimney! 🎅 It's been another absolutely BANANAS week in the AI world, and if you thought last week was wild, and we're due for a break, buckle up, because this one's a freakin' rollercoaster! 🎢
In this episode of ThursdAI, we dive deep into the recent innovations from OpenAI, including their 1-800 ChatGPT phone service and new advancements in voice mode and API functionalities. We discuss the latest updates on O1 model capabilities, including Reasoning Effort settings, and highlight the introduction of WebRTC support by OpenAI. Additionally, we explore the groundbreaking VEO2 model from Google, the generative physics engine Genesis, and new developments in open source models like Cohere's Command R7b. We also provide practical insights on using tools like Weights & Biases for evaluating AI models and share tips on leveraging GitHub Gigi. Tune in for a comprehensive overview of the latest in AI technology and innovation.
00:00 Introduction and OpenAI's 12 Days of Releases
00:48 Advanced Voice Mode and Public Reactions
01:57 Celebrating Tech Innovations
02:24 Exciting New Features in AVMs
03:08 TLDR - ThursdAI December 19
12:58 Voice and Audio Innovations
14:29 AI Art, Diffusion, and 3D
16:51 Breaking News: Google Gemini 2.0
23:10 Meta Apollo 7b Revisited
33:44 Google's Sora and Veo2
34:12 Introduction to Veo2 and Sora
34:59 First Impressions of Veo2
35:49 Comparing Veo2 and Sora
37:09 Sora's Unique Features
38:03 Google's MVP Approach
43:07 OpenAI's Latest Releases
44:48 Exploring OpenAI's 1-800 CHAT GPT
47:18 OpenAI's Fine-Tuning with DPO
48:15 OpenAI's Mini Dev Day Announcements
49:08 Evaluating OpenAI's O1 Model
54:39 Weights & Biases Evaluation Tool - Weave
01:03:52 ArcAGI and O1 Performance
01:06:47 Introduction and Technical Issues
01:06:51 Efforts on Desktop Apps
01:07:16 ChatGPT Desktop App Features
01:07:25 Working with Apps and Warp Integration
01:08:38 Programming with ChatGPT in IDEs
01:08:44 Discussion on Warp and Other Tools
01:10:37 GitHub GG Project
01:14:47 OpenAI Announcements and WebRTC
01:24:45 Modern BERT and Smaller Models
01:27:37 Genesis: Generative Physics Engine
01:33:12 Closing Remarks and Holiday Wishes
Here’s a talking podcast host speaking excitedly about his show
TL;DR - Show notes and Links
* Open Source LLMs
* Meta Apollo 7B – LMM w/ SOTA video understanding (Page, HF)
* Microsoft Phi-4 – 14B SLM (Blog, Paper)
* Cohere Command R 7B – (Blog)
* Falcon 3 – series of models (X, HF, web)
* IBM updates Granite 3.1 + embedding models (HF, Embedding)
* Big CO LLMs + APIs
* OpenAI releases new o1 + API access (X)
* Microsoft makes CoPilot Free! (X)
* Google - Gemini Flash 2 Thinking experimental reasoning model (X, Studio)
* This weeks Buzz
* W&B weave Playground now has Trials (and o1 compatibility) (try it)
* Alex Evaluation of o1 and Gemini Thinking experimental (X, Colab, Dashboard)
* Vision & Video
* Google releases Veo 2 – SOTA text2video modal - beating SORA by most vibes (X)
* HunyuanVideo distilled with FastHunyuan down to 6 steps (HF)
* Kling 1.6 (X)
* Voice & Audio
* OpenAI realtime audio improvements (docs)
* 11labs new Flash 2.5 model – 75ms generation (X)
* Nexa OmniAudio – 2.6B – multimodal local LLM (Blog)
* Moonshine Web – real time speech recognition in the browser (X)
* Sony MMAudio - open source video 2 audio model (Blog, Demo)
* AI Art & Diffusion & 3D
* Genesys – open source generative 3D physics engine (X, Site, Github)
* Tools
* CerebrasCoder – extremely fast apps creation (Try It)
* RepoPrompt to chat with o1 Pro – (download)
This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe -
Hey folks, Alex here, writing this from the beautiful Vancouver BC, Canada. I'm here for NeurIPS 2024, the biggest ML conferences of the year, and let me tell you, this was one hell of a week to not be glued to the screen.
After last week banger week, with OpenAI kicking off their 12 days of releases, with releasing o1 full and pro mode during ThursdAI, things went parabolic. It seems that all the AI labs decided to just dump EVERYTHING they have before the holidays? 🎅
A day after our show, on Friday, Google announced a new Gemini 1206 that became the #1 leading model on LMarena and Meta released LLama 3.3, then on Saturday Xai releases their new image model code named Aurora.
On a regular week, the above Fri-Sun news would be enough for a full 2 hour ThursdAI show on it's own, but not this week, this week this was barely a 15 minute segment 😅 because so MUCH happened starting Monday, we were barely able to catch our breath, so lets dive into it!
As always, the TL;DR and full show notes at the end 👇 and this newsletter is sponsored by W&B Weave, if you're building with LLMs in production, and want to switch to the new Gemini 2.0 today, how will you know if your app is not going to degrade? Weave is the best way! Give it a try for free.
Gemini 2.0 Flash - a new gold standard of fast multimodal LLMs
Google has absolutely taken the crown away from OpenAI with Gemini 2.0 believe it or not this week with this incredible release. All of us on the show were in agreement that this is a phenomenal release from Google for the 1 year anniversary of Gemini.
Gemini 2.0 Flash is beating Pro 002 and Flash 002 on all benchmarks, while being 2x faster than Pro, having 1M context window, and being fully multimodal!
Multimodality on input and output
This model was announced to be fully multimodal on inputs AND outputs, which means in can natively understand text, images, audio, video, documents and output text, text + images and audio (so it can speak!). Some of these capabilities are restricted for beta users for now, but we know they exists. If you remember project Astra, this is what powers that project. In fact, we had Matt Wolfe join the show, and he demoed had early access to Project Astra and demoed it live on the show (see above) which is powered by Gemini 2.0 Flash.
The most amazing thing is, this functionality, that was just 8 months ago, presented to us in Google IO, in a premium Booth experience, is now available to all, in Google AI studio, for free!
Really, you can try out right now, yourself at https://aistudio.google.com/live but here's a demo of it, helping me proof read this exact paragraph by watching the screen and talking me through it.
Performance out of the box
This model beating Sonnet 3.5 on Swe-bench Verified completely blew away the narrative on my timeline, nobody was ready for that. This is a flash model, that's outperforming o1 on code!?
So having a Flash MMIO model with 1M context that is accessible via with real time streaming option available via APIs from the release time is honestly quite amazing to begin with, not to mention that during the preview phase, this is currently free, but if we consider the previous prices of Flash, this model is going to considerably undercut the market on price/performance/speed matrix.
You can see why this release is taking the crown this week. 👏
Agentic is coming with Project Mariner
An additional thing that was announced by Google is an Agentic approach of theirs is project Mariner, which is an agent in the form of a Chrome extension completing webtasks, breaking SOTA on the WebVoyager with 83.5% score with a single agent setup.
We've seen agents attempts from Adept to Claude Computer User to Runner H, but this breaking SOTA from Google seems very promising. Can't wait to give this a try.
OpenAI gives us SORA, Vision and other stuff from the bag of goodies
Ok so now let's talk about the second winner of this week, OpenAI amazing stream of innovations, which would have taken the crown, if not for, well... ☝️
SORA is finally here (for those who got in)
Open AI has FINALLY released SORA, their long promised text to video and image to video (and video to video) model (nee, world simulator) to general availability, including a new website - sora.com and a completely amazing UI to come with it.
SORA can generate images of various quality from 480p up to 1080p and up to 20 seconds long, and they promised that those will be generating fast, as what they released is actually SORA turbo! (apparently SORA 2 is already in the works and will be even more amazing, more on this later)
New accounts paused for now
OpenAI seemed to have severely underestimated how many people would like to generate the 50 images per month allowed on the plus account (pro account gets you 10x more for $200 + longer durations whatever that means), and since the time of writing these words on ThursdAI afternoon, I still am not able to create a sora.com account and try out SORA myself (as I was boarding a plane when they launched it)
SORA magical UI
I've invited one of my favorite video creators, Blaine Brown to the show, who does incredible video experiments, that always go viral, and had time to play with SORA to tell us what he thinks both from a video perspective and from a interface perspective.
Blaine had a great take that we all collectively got so much HYPE over the past 8 months of getting teased, that many folks expected SORA to just be an incredible text to video 1 prompt to video generator and it's not that really, in fact, if you just send prompts, it's more like a slot machine (which is also confirmed by another friend of the pod Bilawal)
But the magic starts to come when the additional tools like blend are taken into play. One example that Blaine talked about is the Remix feature, where you can Remix videos and adjust the remix strength (Strong, Mild)
Another amazing insight Blaine shared is a that SORA can be used by fusing two videos that were not even generated with SORA, but SORA is being used as a creative tool to combine them into one.
And lastly, just like Midjourney (and StableDiffusion before that), SORA has a featured and a recent wall of video generations, that show you videos and prompts that others used to create those videos with, for inspiration and learning, so you can remix those videos and learn to prompt better + there are prompting extension tools that OpenAI has built in.
One more thing.. this model thinks
I love this discovery and wanted to share this with you, the prompt is "A man smiles to the camera, then holds up a sign. On the sign, there is only a single digit number (the number of 'r's in 'strawberry')"
Advanced Voice mode now with Video!
I personally have been waiting for Voice mode with Video for such a long time, since the that day in the spring, where the first demo of advanced voice mode talked to an OpenAI employee called Rocky, in a very flirty voice, that in no way resembled Scarlet Johannson, and told him to run a comb through his hair.
Well today OpenAI have finally announced that they are rolling out this option soon to everyone, and in chatGPT, we'll all going to have the camera button, and be able to show chatGPT what we're seeing via camera or the screen of our phone and have it have the context.
If you're feeling a bit of a deja-vu, yes, this is very similar to what Google just launched (for free mind you) with Gemini 2.0 just yesterday in AI studio, and via APIs as well.
This is an incredible feature, it will not only see your webcam, it will also see your IOS screen, so you’d be able to reason about an email with it, or other things, I honestly can’t wait to have it already!
They also announced Santa mode, which is also super cool, tho I don’t quite know how to .. tell my kids about it? Do I… tell them this IS Santa? Do I tell them this is an AI pretending to be Santa? Where is the lie end exactly?
And in one of his funniest jailbreaks (and maybe one of the toughest ones) Pliny the liberator just posted a Santa jailbreak that will definitely make you giggle (and him get Coal this X-mas)
The other stuff (with 6 days to go)
OpenAI has 12 days of releases, and the other amazing things we got obviously got overshadowed but they are still cool, Canvas can now run code and have custom GPTs, GPT in Apple Intelligence is now widely supported with the public release of iOS 18.2 and they have announced fine tuning with reinforcement learning, allowing to funetune o1-mini to outperform o1 on specific tasks with a few examples.
There's 6 more work days to go, and they promised to "end with a bang" so... we'll keep you updated!
This weeks Buzz - Guard Rail Genie
Alright, it's time for "This Week's Buzz," our weekly segment brought to you by Weights & Biases! This week I hosted Soumik Rakshit from the Weights and Biases AI Team (The team I'm also on btw!).
Soumik gave us a deep dive into Guardrails, our new set of features in Weave for ensuring reliability in GenAI production! Guardrails serve as a "safety net" for your LLM powered applications, filtering out inputs or llm responses that trigger a certain criteria or boundary.
Types of guardrails include prompt injection attacks, PII leakage, jailbreaking attempts and toxic language as well, but can also include a competitor mention, or selling a product at $0 or a policy your company doesn't have.
As part of developing the guardrails Soumik also developed and open sourced an app to test prompts against those guardrails "Guardrails Genie" and we're going to host it to allow folks to test their prompts against our guardrails, and also are developing it and the guardrails in the open so please check out our Github
Apple iOS 18.2 Apple Intelligence + ChatGPT integration
Apple Intelligence is finally here, you can download it if you have iPhone 15 pro and pro Max and iPhone 16 all series.
If you have one of those phones, you will get the following new additional features that have been in Beta for a while, features like Image Playground with the ability to create images based on your face or faces that you have stored in your photo library.
You can also create GenMoji and those are actually pretty cool!
The highlight and the connection with OpenAI's release is of course the ChatGPT integration, where in if Siri is too dumdum to answer any real AI questions, and let's face it, it's most of the time, a user will get a button and chatGPT will take over upon user approval. This will not require an account!
Grok New Image Generation Codename "Aurora"
Oh, Space Uncle is back at it again! The team at XAI launched its image generation model with the codename "Aurora" and briefly made it public only to pull it and launch it again (this time, the model is simply "Grok"). Apparently, they've trained their own image model from scratch in like three months but they pulled it back a day after, I think because they forgot to add watermarks 😅 but it's still unconfirmed why the removal occurred in the first place, Regardless of the reason, many folks, such as Wolfram, found it was not on the same level as their Flux integration.
It is really good at realism and faces, and is really unrestricted in terms of generating celebrities or TV shows form the 90's or cartoons. They really don't care about copyright.
The model however does appear to generate fairly realistic images with its autoregressive model approach where generation occurs pixel-by-pixel instead of diffusion. But as I said on the show "It's really hard to get a good sense for the community vibe about anything that Elon Musk does because there's so much d**k riding on X for Elon Musk..." Many folks post only positive things on anything X or Xai does in the hopes that space uncle will notice them or reposts them, it's really hard to get an honest "vibes check" on Xai stuff.
All jokes aside we'll hopefully have some better comparisons on sites such as image LmArena who just today launched ImgArena but until that day comes we'll just have to wait and see what other new iterations and announcements follow!
NeurIPS Drama: Best Paper Controversy!
Now, no week in AI would be complete without a little drama. This time around it’s with the biggest machine learning engineering conference of the year, NeurIPS. This year's "Best Paper" award went to a work entitled Visual Auto Aggressive Modeling (VAR). This paper apparently introduced an innovative way to outperform traditional diffusion models when it comes to image generation! Great right? well not so fast because here’s where things get spicy. This is where Keyu Tian comes in, the main author of this work and a former intern of ByteDance who are getting their fair share of the benefits with their co-signing on the paper but their lawsuit may derail its future. ByteDance is currently suing Keyu Tian for a whopping one million dollars citing alleged sabotage on the work in a coordinated series of events that compromised other colleagues work.
Specifically, according to some reports "He modified source code to changes random seeds and optimizes which, uh, lead to disrupting training processes...Security attacks. He gained unauthorized access to the system. Login backdoors to checkpoints allowing him to launch automated attacks that interrupted processes to colleagues training jobs." Basically, they believe that he "gained unauthorized access to the system" and hacked other systems. Now the paper is legit and it introduces potentially very innovative solutions but we have an ongoing legal situation. Also to note is despite firing him they did not withdraw the paper which could speak volumes to its future! As always, if it bleeds, it leads and drama is usually at the top of the trends, so definitely a story that will stay in everyone's mind when they look back at NeurIPS this year.
Phew.. what a week folks, what a week!
I think with 6 more days of OpenAI gifts, there's going to be plenty more to come next week, so share this newsletter with a friend or two, and if you found this useful, consider subscribing to our other channels as well and checkout Weave if you've building with GenAI, it's really helpful!
TL;DR and show notes
* Meta llama 3.3 (X, Model Card)
* OpenAI 12 days of Gifts (Blog)
* Apple ios 18.2 - Image Playground, GenMoji, ChatGPT integration (X)
* 🔥 Google Gemini 2.0 Flash - the new gold standard of LLMs (X, AI Studio)
* Google Project Mariner - Agent that browsers for you (X)
* This weeks Buzz - chat with Soumik Rakshit from AI Team at W&B (Github)
* NeurIPS Drama - Best Paper Controversy - VAR author is sued by ByteDance (X, Blog)
* Xai new image generation codename Aurora (Blog)
* Cognition launched Devin AI developer assistant - $500/mo
* LMArena launches txt2img Arena for Diffusion models (X)
This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe -
Estão a faltar episódios?
-
Well well well, December is finally here, we're about to close out this year (and have just flew by the second anniversary of chatGPT 🎂) and it seems that all of the AI labs want to give us X-mas presents to play with over the holidays!
Look, I keep saying this, but weeks are getting crazier and crazier, this week we got the cheapest and the most expensive AI offerings all at once (the cheapest from Amazon and the most expensive from OpenAI), 2 new open weights models that beat commercial offerings, a diffusion model that predicts the weather and 2 world building models, oh and 2 decentralized fully open sourced LLMs were trained across the world LIVE and finished training. I said... crazy week!
And for W&B, this week started with Weave launching finally in GA 🎉, which I personally was looking forward for (read more below)!
TL;DR Highlights
* OpenAI O1 & Pro Tier: O1 is out of preview, now smarter, faster, multimodal, and integrated into ChatGPT. For heavy usage, ChatGPT Pro ($200/month) offers unlimited calls and O1 Pro Mode for harder reasoning tasks.
* Video & Audio Open Source Explosion: Tencent’s HYVideo outperforms Runway and Luma, bringing high-quality video generation to open source. Fishspeech 1.5 challenges top TTS providers, making near-human voice available for free research.
* Open Source Decentralization: Nous Research’s DiStRo (15B) and Prime Intellect’s INTELLECT-1 (10B) prove you can train giant LLMs across decentralized nodes globally. Performance is on par with centralized setups.
* Google’s Genie 2 & WorldLabs: Generating fully interactive 3D worlds from a single image, pushing boundaries in embodied AI and simulation. Google’s GenCast also sets a new standard in weather prediction, beating supercomputers in accuracy and speed.
* Amazon’s Nova FMs: Cheap, scalable LLMs with huge context and global language coverage. Perfect for cost-conscious enterprise tasks, though not top on performance.
* 🎉 Weave by W&B: Now in GA, it’s your dashboard and tool suite for building, monitoring, and scaling GenAI apps. Get Started with 1 line of code
OpenAI’s 12 Days of Shipping: O1 & ChatGPT Pro
The biggest splash this week came from OpenAI. They’re kicking off “12 days of launches,” and Day 1 brought the long-awaited full version of o1. The main complaint about o1 for many people is how slow it was! Well, now it’s not only smarter but significantly faster (60% faster than preview!), and officially multimodal: it can see images and text together.
Better yet, OpenAI introduced a new ChatGPT Pro tier at $200/month. It offers unlimited usage of o1, advanced voice mode, and something called o1 pro mode — where o1 thinks even harder and longer about your hardest math, coding, or science problems. For power users—maybe data scientists, engineers, or hardcore coders—this might be a no-brainer. For others, 200 bucks might be steep, but hey, someone’s gotta pay for those GPUs. Given that OpenAI recently confirmed that there are now 300 Million monthly active users on the platform, and many of my friends already upgraded, this is for sure going to boost the bottom line at OpenAI!
Quoting Sam Altman from the stream, “This is for the power users who push the model to its limits every day.” For those who complained o1 took forever just to say “hi,” rejoice: trivial requests will now be answered quickly, while super-hard tasks get that legendary deep reasoning including a new progress bar and a notification when a task is complete. Friend of the pod Ray Fernando gave pro a prompt that took 7 minutes to think through!
I've tested the new o1 myself, and while I've gotten dangerously close to my 50 messages per week quota, I've gotten some incredible results already, and very fast as well. This ice-cubes question failed o1-preview and o1-mini and it took both of them significantly longer, and it took just 4 seconds for o1.
Open Source LLMs: Decentralization & Transparent Reasoning
Nous Research DiStRo & DeMo Optimizer
We’ve talked about decentralized training before, but the folks at Nous Research are making it a reality at scale. This week, Nous Research wrapped up the training of a new 15B-parameter LLM—codename “Psyche”—using a fully decentralized approach called “Nous DiStRo.” Picture a massive AI model trained not in a single data center, but across GPU nodes scattered around the globe. According to Alex Volkov (host of ThursdAI), “This is crazy: they’re literally training a 15B param model using GPUs from multiple companies and individuals, and it’s working as well as centralized runs.”
The key to this success is “DeMo” (Decoupled Momentum Optimization), a paper co-authored by none other than Diederik Kingma (yes, the Kingma behind Adam optimizer and VAEs). DeMo drastically reduces communication overhead and still maintains stability and speed. The training loss curve they’ve shown looks just as good as a normal centralized run, proving that decentralized training isn’t just a pipe dream. The code and paper are open source, and soon we’ll have the fully trained Psyche model. It’s a huge step toward democratizing large-scale AI—no more waiting around for Big Tech to drop their weights. Instead, we can all chip in and train together.
Prime Intellect INTELLECT-1 10B: Another Decentralized Triumph
But wait, there’s more! Prime Intellect also finished training their 10B model, INTELLECT-1, using a similar decentralized setup. INTELLECT-1 was trained with a custom framework that reduces inter-GPU communication by 400x. It’s essentially a global team effort, with nodes from all over the world contributing compute cycles.
The result? A model hitting performance similar to older Meta models like Llama 2—but fully decentralized.
Ruliad DeepThought 8B: Reasoning You Can Actually See
If that’s not enough, we’ve got yet another open-source reasoning model: Ruliad’s DeepThought 8B. This 8B parameter model (finetuned from LLaMA-3.1) from friends of the show FarEl, Alpin and Sentdex 👏
Ruliad’s DeepThought attempts to match or exceed performance of much larger models in reasoning tasks (beating several 72B param models while being 8B itself) is very impressive.
Google is firing on all cylinders this week
Google didn't stay quiet this week as well, and while we all wait for the Gemini team to release the next Gemini after the myriad of very good experimental models recently, we've gotten some very amazing things this week.
Google’s PaliGemma 2 - finetunable SOTA VLM using Gemma
PaliGemma v2, a new vision-language family of models (3B, 10B and 33B) for 224px, 448px, 896px resolutions are a suite of base models, that include image segmentation and detection capabilities and are great at OCR which make them very versatile for fine-tuning on specific tasks.
They claim to achieve SOTA on chemical formula recognition, music score recognition, spatial reasoning, and chest X-ray report generation!
Google GenCast SOTA weather prediction with... diffusion!?
More impressively, Google DeepMind released GenCast, a diffusion-based model that beats the state-of-the-art ENS system in 97% of weather predictions. Did we say weather predictions? Yup.
Generative AI is now better at weather forecasting than dedicated physics based deterministic algorithms running on supercomputers. Gencast can predict 15 days in advance in just 8 minutes on a single TPU v5, instead of hours on a monstrous cluster. This is mind-blowing. As Yam said on the show, “Predicting the world is crazy hard” and now diffusion models handle it with ease.
W&B Weave: Observability, Evaluation and Guardrails now in GA
Speaking of building and monitoring GenAI apps, we at Weights & Biases (the sponsor of ThursdAI) announced that Weave is now GA. Weave is a developer tool for evaluating, visualizing, and debugging LLM calls in production. If you’re building GenAI apps—like a coding agent or a tool that processes thousands of user requests—Weave helps you track costs, latency, and quality systematically.
We showcased two internal apps: Open UI (a website builder from a prompt) and Winston (an AI agent that checks emails, Slack, and more). Both rely on Weave to iterate, tune prompts, measure user feedback, and ensure stable performance. With O1 and other advanced models coming to APIs soon, tools like Weave will be crucial to keep those applications under control.
If you follow this newsletter and develop with LLMs, now is a great way to give Weave a try
Open Source Audio & Video: Challenging Proprietary Models
Tencent’s HY Video: Beating Runway & Luma in Open Source
Tencent came out swinging with their open-source model, HYVideo. It’s a video model that generates incredible realistic footage, camera cuts, and even audio—yep, Foley and lip-synced character speech. Just a single model doing text-to-video, image-to-video, puppeteering, and more. It even outperforms closed-source giants like Runway Gen 3 and Luma 1.6 on over 1,500 prompts.
This is the kind of thing we dreamed about when we first heard of video diffusion models. Now it’s here, open-sourced, ready for tinkering. “It’s near SORA-level,” as I mentioned, referencing OpenAI’s yet-to-be-fully-released SORA model. The future of generative video just got more accessible, and competitors should be sweating right now. We may just get SORA as one of the 12 days of OpenAI releases!
FishSpeech 1.5: Open Source TTS Rivaling the Big Guns
Not just video—audio too. FishSpeech 1.5 is a multilingual, zero-shot voice cloning model that ranks #2 overall on TTS benchmarks, just behind 11 Labs. This is a 500M-parameter model, trained on a million hours of audio, achieving near-human quality, fast inference, and open for research.
This puts high-quality text-to-speech capabilities in the open-source community’s hands. You can now run a top-tier TTS system locally, clone voices, and generate speech in multiple languages with low latency. No more relying solely on closed APIs. This is how open-source chases—and often catches—commercial leaders.
If you’ve been longing for near-instant voice cloning on your own hardware, this is the model to go play with!
Creating World Models: Genie 2 & WorldLabs
Fei Fei Li’s WorldLabs: Images to 3D Worlds
WorldLabs, founded by Dr. Fei Fei Li, showcased a mind-boggling demo: turning a single image into a walkable 3D environment. Imagine you take a snapshot of a landscape, load it into their system, and now you can literally walk around inside that image as if it were a scene in a video game. “I can literally use WASD keys and move around,” Alex commented, clearly impressed.
It’s not perfect fidelity yet, but it’s a huge leap toward generating immersive 3D worlds on the fly. These tools could revolutionize virtual reality, gaming, and simulation training. WorldLabs’ approach is still in early stages, but what they demonstrated is nothing short of remarkable.
Google’s Genie 2: Playable Worlds from a Single Image
If WorldLabs’s 3D environment wasn’t enough, Google dropped Genie 2. Take an image generated by Imagen 3, feed it into Genie 2, and you get a playable world lasting up to a minute. Your character can run, objects have physics, and the environment is consistent enough that if you leave an area and return, it’s still there.
As I said on the pod, “It looks like a bit of Doom, but generated from a single static image. Insane!” The model simulates complex interactions—think water flowing, balloons bursting—and even supports long-horizon memory. This could be a goldmine for AI-based game development, rapid prototyping, or embodied agent training.
Amazon’s Nova: Cheaper LLMs, Not Better LLMs
Amazon is also throwing their hat in the ring with the Nova series of foundational models. They’ve got variants like Nova Micro, Lite, Pro, and even a Premier tier coming in 2025. The catch? Performance is kind of “meh” compared to Anthropic or OpenAI’s top models, but Amazon is aiming to be the cheapest high-quality LLM among the big players. With a context window of up to 300K tokens and 200+ language coverage, Nova could find a niche, especially for those who want to pay less per million tokens.
Nova Micro costs around 3.5 cents per million input tokens and 14 cents per million output tokens—making it dirt cheap to process massive amounts of data. Although not a top performer, Amazon’s approach is: “We may not be best, but we’re really cheap and we scale like crazy.” Given Amazon’s infrastructure, this could be compelling for enterprises looking for cost-effective large-scale solutions.
Phew, this was a LONG week with a LOT of AI drops, and NGL, o1 actually helped me a bit for this newsletter, I wonder if you can spot the places where o1 wrote some of the text using a the transcription of the show and the outline as guidelines and the previous newsletter as a tone guide and where I wrote it myself?
Next week, NEURIPS 2024, the biggest ML conference in the world, I'm going to be live streaming from there, so if you're at the conference, come by booth #404 and say hi! I'm sure there will be a TON of new AI updates next week as well!
Show Notes & Links
TL;DR of all topics covered:
* This weeks Buzz
* Weights & Biases announces Weave is now in GA 🎉(wandb.me/tryweave)
* Tracing LLM calls
* Evaluation & Playground
* Human Feedback integration
* Scoring & Guardrails (in preview)
* Open Source LLMs
* DiStRo & DeMo from NousResearch - decentralized DiStRo 15B run (X, watch live, Paper)
* Prime Intellect - INTELLECT-1 10B decentralized LLM (Blog, watch)
* Ruliad DeepThoutght 8B - Transparent reasoning model (LLaMA-3.1) w/ test-time compute scaling (X, HF, Try It)
* Google GenCast - diffusion model SOTA weather prediction (Blog)
* Google open sources PaliGemma 2 (X, Blog)
* Big CO LLMs + APIs
* Amazon announces Nova series of FM at AWS (X)
* Google GENIE 2 creates playable worlds from a picture! (Blog)
* OpenAI 12 days started with o1 full and o1 pro and pro tier $200/mo (X, Blog)
* Vision & Video
* Tencent open sources HY Video - beating Luma & Runway (Blog, Github, Paper, HF)
* Runway video keyframing prototype (X)
* Voice & Audio
* FishSpeech V1.5 - multilingual, zero-shot instant voice cloning, low-latency, open text to speech model (X, Try It)
* Eleven labs - real time audio agents builder (X)
This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe -
Hey ya'll, Happy Thanskgiving to everyone who celebrates and thank you for being a subscriber, I truly appreciate each and every one of you!
We had a blast on today's celebratory stream, especially given that today's "main course" was the amazing open sourcing of a reasoning model from Qwen, and we had Junyang Lin with us again to talk about it! First open source reasoning model that you can run on your machine, that beats a 405B model, comes close to o1 on some metrics 🤯
We also chatted about a new hybrid approach from Nvidia called Hymba 1.5B (Paper, HF) that beats Qwen 1.5B with 6-12x less training, and Allen AI releasing Olmo 2, which became the best fully open source LLM 👏 (Blog, HF, Demo), though they didn't release WandB logs this time, they did release data!
I encourage you to watch todays show (or listen to the show, I don't judge), there's not going to be a long writeup like I usually do, as I want to go and enjoy the holiday too, but of course, the TL;DR and show notes are right here so you won't miss a beat if you want to use the break to explore and play around with a few things!
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
TL;DR and show notes
* Qwen QwQ 32B preview - the first open weights reasoning model (X, Blog, HF, Try it)
* Allen AI - Olmo 2 the best fully open language model (Blog, HF, Demo)
* NVIDIA Hymba 1.5B - Hybrid smol model beating Qwen, SmolLM w/ 6-12x less training (X, Paper, HF)
* Big CO LLMs + APIs
* Anthropic MCP - model context protocol (X,Blog, Spec, Explainer)
* Cursor, Jetbrains now integrate with ChatGPT MacOS app (X)
* Xai is going to be a Gaming company?! (X)
* H company shows Runner H - WebVoyager Agent (X, Waitlist)
* This weeks Buzz
* Interview w/ Thomas Cepelle about Weave scorers and guardrails (Guide)
* Vision & Video
* OpenAI SORA API was "leaked" on HuggingFace (here)
* Runway launches video Expand feature (X)
* Rhymes Allegro-TI2V - updated image to video model (HF)
* Voice & Audio
* OuteTTS v0.2 - 500M smol TTS with voice cloning (Blog, HF)
* AI Art & Diffusion & 3D
* Runway launches an image model called Frames (X, Blog)
* ComfyUI Desktop app was released 🎉
* Chat
* 24 hours of AI hate on 🦋 (thread)
* Tools
* Cursor agent (X thread)
* Google Generative Chess toy (Link)
See you next week and happy Thanks Giving 🦃
Thanks for reading ThursdAI - Recaps of the most high signal AI weekly spaces! This post is public so feel free to share it.
Full Subtitles for convenience
[00:00:00] Alex Volkov: let's get it going.
[00:00:10] Alex Volkov: Welcome, welcome everyone to ThursdAI November 28th Thanksgiving special. My name is Alex Volkov. I'm an AI evangelist with Weights Biases. You're on ThursdAI. We are live [00:00:30] on ThursdAI. Everywhere pretty much.
[00:00:32] Alex Volkov:
[00:00:32] Hosts and Guests Introduction
[00:00:32] Alex Volkov: I'm joined here with two of my co hosts.
[00:00:35] Alex Volkov: Wolfram, welcome.
[00:00:36] Wolfram Ravenwolf: Hello everyone! Happy Thanksgiving!
[00:00:38] Alex Volkov: Happy Thanksgiving, man.
[00:00:39] Alex Volkov: And we have Junyang here. Junyang, welcome, man.
[00:00:42] Junyang Lin: Yeah, hi everyone. Happy Thanksgiving. Great to be here.
[00:00:46] Alex Volkov: You had a busy week. We're going to chat about what you had. I see Nisten joining us as well at some point.
[00:00:51] Alex Volkov: Yam pe joining us as well. Hey, how, Hey Yam. Welcome. Welcome, as well. Happy Thanksgiving. It looks like we're assembled folks. We're across streams, across [00:01:00] countries, but we are.
[00:01:01] Overview of Topics for the Episode
[00:01:01] Alex Volkov: For November 28th, we have a bunch of stuff to talk about. Like really a big list of stuff to talk about. So why don't we just we'll just dive in. We'll just dive in. So obviously I think the best and the most important.
[00:01:13] DeepSeek and Qwen Open Source AI News
[00:01:13] Alex Volkov: Open source kind of AI news to talk about this week is going to be, and I think I remember last week, Junyang, I asked you about this and you were like, you couldn't say anything, but I asked because last week, folks, if you remember, we talked about R1 from DeepSeek, a reasoning model from [00:01:30] DeepSeek, which really said, Oh, maybe it comes as a, as open source and maybe it doesn't.
[00:01:33] Alex Volkov: And I hinted about, and I asked, Junyang, what about some reasoning from you guys? And you couldn't say anything. so this week. I'm going to do a TLDR. So we're going to actually talk about the stuff that, you know, in depth a little bit later, but this week, obviously one of the biggest kind of open source or sorry, open weights, and news is coming from our friends at Qwen as well, as we always celebrate.
[00:01:56] Alex Volkov: So one of the biggest things that we get as. [00:02:00] is, Qwen releases, I will actually have you tell me what's the pronunciation here, Junaid, what is, I say Q W Q or maybe quick, what is the pronunciation of this?
[00:02:12] Junyang Lin: I mentioned it in the blog, it is just like the word quill. Yeah. yeah, because for the qw you can like work and for the q and you just like the U, so I just combine it together and create a new pronunciation called Quill.
[00:02:28] Junyang Lin: Yeah.
[00:02:28] Alex Volkov: So we're saying it's Quin [00:02:30] Quill 32 B. Is that the right pronunciation to say this?
[00:02:33] Junyang Lin: Yeah, it's okay. I would just call it qui quill. It is, some something funny because,the ca the characters look very funny. Oh, we have a subculture,for these things. Yeah. Just to express some, yeah.
[00:02:46] Junyang Lin: our. feelings.
[00:02:49] Alex Volkov: Amazing. Qwen, Quill, 32B, and it's typed,the name is typed QWQ, 32Breview. This is the first OpenWeights reasoning model. This [00:03:00] model is not only predicting tokens, it's actually doing reasoning behind this. What this means is we're going to tell you what this means after we get to this.
[00:03:07] Alex Volkov: So we're still in the, we're still in the TLDR area. We also had. Another drop from Alien Institute for AI, if you guys remember last week we chatted with Nathan, our dear friend Nathan, from Alien Institute about 2. 0. 3, about their efforts for post training, and he gave us all the details about post training, so they released 2.
[00:03:28] Alex Volkov: 0. 3, this week they released Olmo 2. [00:03:30] 0. We also talked about Olmo with the friends from Alien Institute a couple of months ago, and now they released Olmo 2. 0. Which they claim is the best fully open sourced, fully open sourced language models, from Allen Institute for AI.and, so we're going to chat about, Olmo a little bit as well.
[00:03:46] Alex Volkov: And last minute addition we have is NVIDIA Haimba, which is a hybrid small model from NVIDIA, very tiny one, 1. 5 billion parameters. small model building Qwen and building small LLM as well. this is in the area [00:04:00] of open source. I
[00:04:01] Alex Volkov: Okay, in the big companies, LLMs and APIs, I want to run through a few things.
[00:04:06] Anthropic's MCP and ChatGPT macOS Integrations
[00:04:06] Alex Volkov: So first of all, Anthropic really something called MCP. It's a, something they called Model Context Protocol. We're going to briefly run through this. It's a, it's a kind of a release from them that's aimed for developers is a protocol that enables secure connections between a host application, like a cloud desktop, for example,
[00:04:24] Alex Volkov: there's also a bunch of new integrations for the ChatGPT macOS app. If you guys remember a couple of [00:04:30] weeks ago, We actually caught this live.
[00:04:31] Alex Volkov: I refreshed my MacOS app and there's ta da, there's a new thing. And we discovered this live. It was very fun. The MacOS app for ChatGPT integrates with VS Code, et cetera. and so we tried to run this with Cursor. It didn't work. So now it works with Cursor,
[00:04:43] Wolfram Ravenwolf:
[00:04:43] Alex Volkov: So the next thing we're going to look at, I don't know if it's worth mentioning, but you guys know the XAI, the company that Elon Musk is raising another 6 billion for that tries to compete with OpenAI
[00:04:54] Alex Volkov: Do you guys hear that it's going to be a gaming company as well? I don't know if it's worth talking about, but we'll at least [00:05:00] mention this. And the one thing that I wanted to chat about is H, the French company, H that showed a runner that looks. Three times as fast and as good as the Claude computer use runner, and we're definitely going to show examples of this, video live because that looks just incredible.
[00:05:18] Alex Volkov: this out of nowhere company, the biggest fundraise or the biggest seed round that Europe has ever seen, at least French has ever seen, just show they, An agent that controls your [00:05:30] computer that's tiny, ridiculously tiny, I think it's like the three billion parameter, two billion parameter or something.
[00:05:36] Alex Volkov: And it runs way better than computer, cloud computer use. Something definitely worth talking about. after with, after which in this week's Bars, we're going to talk with Thomas Capelli, from, from my team at Weights Biases. about LLM guardrails, that's gonna be fun. and in vision video category, we're gonna cover that OpenAI Sora quote unquote leaked, this week.
[00:05:56] Alex Volkov: And this leak wasn't really a leak, but, definitely [00:06:00] we saw some stuff. and then there's also a new expand feature that we saw in, Runway. And we saw another video model from, Rhymes called Allegro TIV2. which is pretty cool in voice and audio. If we get there in voice and audio, we saw out TTS vision 0.
[00:06:19] Alex Volkov: 2, which is a new TTS, a 500 million parameter, small TTS you can run in your browser and sounds pretty dope.art in the fusion, super quick runway launches an image [00:06:30] model. Yep, Runway, the guys who do video, they launched an image model that looks pretty sick, and we're definitely going to look at some examples of this, and Confi UI Desktop, for those of you who are celebrating something like this, Confi UI now is runnable with desktop, and there's a bunch of tool stuff, but honestly, I can talk about two things.
[00:06:47] Alex Volkov: Tools and there's a cool thing with Google generative chess toy. I can show you this so you can show your folks in Thanksgiving and going to impress them with a generative chess toy. But honestly, instead of this, I would love to chat about the thing that [00:07:00] some of us saw on the other side of the social media networks.
[00:07:04] Alex Volkov: And definitely we'll chat about this, for the past 24 hours. So chat, for the past. 24 hours, on BlueSky, we saw a little bit of a mob going against the Hug Face folks and then, other friends of ours on,from the AI community and the anti AI mob on BlueSky. So we're going to chat about that.
[00:07:26] Alex Volkov: And hopefully give you our feelings about what's going on, about this [00:07:30] world. And this is a pro AI show. And when we see injustice happens against ai, we have to speak out about against this. And I think that this is mostly what we're gonna cover this show, unless this is.
[00:07:42] Wolfram Ravenwolf: Where I could insert the two things I have.
[00:07:44] Wolfram Ravenwolf: One is a tool, which is the AI video composer, which, allows you to talk to, ff mpac, which is a complicated comment line tool, but very powerful. And so you have a UI where you just use natural language to control the tool. So that is one tool. Maybe we get to [00:08:00] it, if not just Google or ask for Plexity or anything.
[00:08:03] Alex Volkov: No, we'll drop it in. Yeah, we'll drop it in show notes, absolutely.
[00:08:04] Wolfram Ravenwolf: Yeah, that's the best part. Okay. And echo mimic. Version 2 is also an HN Synthesia alternative for local use, which is also, yeah, a great open source local runnable tool.
[00:08:17] Alex Volkov: What do we call this? EcoMimic?
[00:08:19] Wolfram Ravenwolf: EcoMimic. EcoMimic
[00:08:21] Alex Volkov: v2.
[00:08:21] Wolfram Ravenwolf: EcoMimic
[00:08:23] Alex Volkov: 2.
[00:08:24] Alex Volkov: Alright, we have a special guest here that we're gonna add Alpin. Hey Alpen, [00:08:30] welcome, feel free to stay anonymous and don't jump, we're gonna start with open source AI and then we're gonna chat with you briefly about the experience you had.
[00:08:38] Alpin Dale: hello everyone.
[00:08:39] Alex Volkov: Hey man. Yeah, you've been on the show before, right Alton? You've been on the show.
[00:08:43] Alpin Dale: a few times, yeah. it's nice to be back here again.
[00:08:46] Alex Volkov: Yeah. Alton, we're gonna get, we're gonna chat with you soon, right? We're gonna start with open source. We need to go to Junyang and talk about reasoning models.
[00:08:52] Alex Volkov: so feel free to stay with us. And then I definitely want to hear about some of the stuff we're going to cover after open source. We're going to cover the [00:09:00] anti AI mob over there.
[00:09:05] Alex Volkov: Alrighty folks, it's time to start with the,with the corner we love the most, yeah? let's dive into this. Let's dive in straight to Open Source AI.
[00:09:29] Alex Volkov: Open Source AI, [00:09:30] let's get it started. Let's start it.
[00:09:35] Alex Volkov: Okay, folks, so open source this week, we're going to get, let me cover the other two things super quick before we dive in.
[00:09:43] NVIDIA Haimba Hybrid Model Discussion
[00:09:43] Alex Volkov: Alright, so I want to like briefly cover the Haimba paper super quick, because we're going to get the least interesting stuff out of the way so we can focus on the main topic. Course, NVIDIA released Heimbar 1. 5 parameters. Heimbar is a hybrid small model, from NVIDIA. We talked about hybrid models [00:10:00] multiple times before.
[00:10:00] Alex Volkov: we have our friend of the pod, LDJ here. He loves talking about hybrid models. He actually brought this to our attention in the, in, in the group chat. We talked about, you guys know the Transformer, we love talking about the Transformer. Haimba specifically is a hybrid model between Transformer and I think they're using a hybrid attention with Mamba layers in parallel.
[00:10:22] Alex Volkov: they claim they're beating Lama and Qwen and SmallLM with 6 to 12 times less training as well. Let's look [00:10:30] at the, let's look at their, let's look at their X.so this is what they're, this is what they're showing, this is the table they're showing some impressive numbers, the interesting thing is, this is a table of comparison that they're showing, and in this table of comparison, the comparison is not only Evaluations.
[00:10:47] Alex Volkov: The comparison they're showing is also cache size and throughput, which I like. it's do you guys know what this reminds me of? This reminds me of when you have a electric vehicle [00:11:00] and you have a gas based vehicle or standard combustion engine vehicle, and then they compare the electric vehicle and acceleration.
[00:11:07] Alex Volkov: It's Oh, our car is faster. But you get this by default, you get the acceleration by default with all the electric vehicles. This is how the model works. This is how those model works. So for me, when you compare like hybrid models, or, non transformer based models, a Mamba based models, the throughput speed up is generally faster because of it.
[00:11:29] Alex Volkov: [00:11:30] But definitely the throughput is significantly higher. Tokens per second. is significantly higher. So for comparison for folks who are listening to us, just so you, you'll hear the comparison, the throughput for this 1. 5 billion model is 664 tokens per second versus a small LM 238 tokens per second, or something like Qwen 1.
[00:11:54] Alex Volkov: 5 at 400. So 600 versus 400. the training cost in [00:12:00] tokens, they say this was, 1. 5 trillion tokens versus Qwen at 18. I don't know if Junyang you want to confirm or deny the 18 mentioned here that they added. Sometimes they, they say different things, but yeah, definitely the highlight of this Heimwehr thing.
[00:12:14] Alex Volkov: And this is from NVIDIA, by the way, I think it's very worth like shouting out that this specific thing comes from this model comes from NVIDIA. Um,they specifically mentioned that the cost, And outperformance of this model comes at 6 to 12 times less [00:12:30] training, which is very impressive.
[00:12:31] Alex Volkov: what else about this model? Performance wise, MMLU at 52, which is lower than Qwen at 59, at, at 1. 5 billion parameters. GSM 8K, we know the GSM 8K is not that interesting anymore, I think, at this point. We're not like over, we're not over, we're not looking at this like too much. What else should we say about this model?
[00:12:52] Alex Volkov: GPK is pretty interesting at 31. GPK is usually knowledge versus something. [00:13:00] Anything else to say about this model? Yeah, you have anything to say Nisten? Anything to say about the small models? About the hybrid model specifically? I know that like our friend LDJ said that like this seems like the first actual model that competes apples to apples.
[00:13:13] Alex Volkov: Because usually when we compare Hybrid models specifically, those usually people say that those are not like necessarily one to one comparisons between hybrid models and just formal models.
[00:13:24] Nisten Tahiraj: I was just going to say that fromfrom NVIDIA, we've heard these [00:13:30] claims before and they didn't quite turn out that way, so I'm going to start off a little bit more skeptical on that end. also from, from the Mistral Mamba, Mambastral, that one was not very performant.
[00:13:44] Nisten Tahiraj: it seemed like it was going to be good for long context stuff. The runtime wasn't that good as well. yeah, I'm going to give this one a test because. Again, the promise of, of like hybrid, SSM models is that it can do better [00:14:00] in longer contexts and it can run faster. So it is worth testing given what, what they're claiming.
[00:14:06] Nisten Tahiraj: But, again, on MMLU, it didn't do that well, but, yeah, overall the numbers do look great actually for what it is, but I think we do need to do further testing on this, whether it is practically. That's good. Because I'm not sure how well it's going to hold up after you just throw like 32k of context of it.
[00:14:25] Nisten Tahiraj: I guess it's going to remember all that, but, yeah, this on paper, this does [00:14:30] look like it's one of the first ones that is Applesauce.
[00:14:33] Alex Volkov: Yeah. All right. anything else to say here? Yeah, the architecture. Jan, go ahead.
[00:14:39] Yam Peleg: Yeah, about the architecture. I tweeted about it.It is, I think it has extreme potential and, it might, I just by looking at the attention maps, from the paper, like just a glimpse is enough for you to see that.
[00:14:55] Yam Peleg: They really do solve something really profound [00:15:00] with many of the models that we have today. basically, I'm really simplifying here, but basically, when you look at the Attention versus Mamba, they act very differently in terms of how they process the tokens, sliding window ones, you could say.
[00:15:20] Yam Peleg: And of course self attention is like global, to everything, but Mamba is not exactly global, it's sequential, and sliding window is also not exactly [00:15:30] global, but it's not the same sequential, it's like everything to everything, but with a window. So what they did is combine the two, and you can really see the difference in attention map of the trained model.
[00:15:44] Yam Peleg: it's not exactly the same as just, hybrid Mamba attention models that we all saw before.there is a lot to this model and I really want to see one of those. I just [00:16:00] trained for like at scale, like a large one on, on, on a huge data set, because I think it might be an improvement to either,just by looking at the way the model learned, but you cannot know until you actually try.
[00:16:15] Yam Peleg: I tweeted about it just like briefly. So if you want to go and look at, I'm just, I'm just pointing out that go and check the paper out because the architecture is unique. There is, there is a reason the model is, for its size, very performant. [00:16:30]
[00:16:30] Alex Volkov: Yeah, I'm gonna add your tweet.
[00:16:31] Alex Volkov: All right, folks, time for us to move to the second thing.
[00:16:36] Allen Institute's Olmo 2.0 Release
[00:16:36] Alex Volkov: The folks at Allen AI, surprises with another release this week, and they have, as always they do, they say, hey folks, we divide the categories of open source to not open source at all, then somewhat open weights maybe, and then fully open source, the folks who release the checkpoints, the data, the, the training code.
[00:16:57] Alex Volkov: I will say this, they used to release Weights [00:17:00] Biases logs as well, and they stopped. So if somebody listens to the show from LMAI, as I know they do, folks, what's up with the Weights Biases logs? We know, and we love them, so please release the Weights Biases logs again. but, they released Olmo 2.
[00:17:14] Alex Volkov: Congrats, folks, for releasing Olmo 2. Let me actually do the clap as well. Yay!Olmo 2 is, they claim, is, they claim,the best open, fully open language model to date, and they show this nice graph as well, where, they released two models, Olmo [00:17:30] 2. 7b and Olmo 2. 13b, and they cite multiple things, to, to attribute for the best performance here.
[00:17:37] Alex Volkov: specifically the training stability, they ran this for a significant longer before. they cite some of the recipes of. What we talked about last week from TULU3 methodology, the kind of the state of the art post training methodology from TULU3 that we've talked with Nathan last week, specifically the verifiable framework, thing that we've talked about, multiple other technical things like rate [00:18:00] annealing and the data curriculum.
[00:18:01] Alex Volkov: And obviously they're focusing on their data. they have their, Ohm's selection of tasks on which they compared these models and,the breakdown that I told you about that they do is the open weights models, partially open models, and then fully open models. So this is the breakdown that they have in the area of open weights models.
[00:18:18] Alex Volkov: They have Lama 2. 13b and Mistral 7b, for example, they put Qwen in there as well. So Qwen 2. 57 and 14. And the partially open models, they put Zamba and Stable [00:18:30] LLM. And the fully open models, they put themselves and Olmo and, Ember7B and Olmo2 beats all of that category with some nice, average of stats.
[00:18:40] Alex Volkov: they talk about pre training and a bunch of other stuff. and the instruct category specifically with the Tulu kind of,recipes. What else can we say about Olmo? That's very interesting for folks before we jump into Qwen. What else can we say about Olmo? The, oh, the fact that the thing about the fully open source, we always mention this, is the data set.
[00:18:59] Alex Volkov: We [00:19:00] always talk about the data, they release all of the data sets, so Olmo mix was released, Dolmino mix was released, the SFT training data, post training data set was released as well. yeah, folks, comments. You can also try this model at playground. lnai. org. I've tried it. It's interesting. it's not look, uh,the best about this is the best among open source.
[00:19:21] Alex Volkov: Obviously it's not the best at, generally with closed source data, you can get more significantly better than this. But comments from folks about OMO? [00:19:30]
[00:19:30] Wolfram Ravenwolf: Yeah, it's not multilingual, they said that there is only English, but they are working on putting that in, I think, in another version, but, yeah, it's a truly open source model, not just OpenWeights, so a big applause for them, releasing everything, that is a big thing and I always appreciate it.
[00:19:46] Wolfram Ravenwolf: Thank you.
[00:19:48] Alex Volkov: A hundred percent. All right, folks, it looks like we got Eugene back. Eugene, talk to us about Heimbar.
[00:19:54] Eugen Cheugh: Yeah, no, sorry, I was just saying that as someone who works on transformer [00:20:00] alternative,it's actually really awesome to get the data point because we all haven't decided what's the best arrangement, what's the percentage of transformer versus non transformer?
[00:20:08] Eugen Cheugh: Is the non transformer layers in the front or the back? It's like you say, the car and the car scenario, it's like electric car, do we even know if we want the electric engine in front or the back? and these are data points that we love to test to just, find out more and it's. And I appreciate what NVIDIA is doing as well and looking forward to more research in this space.
[00:20:26] Alex Volkov: Awesome. thanks for joining us and feel free to stay. The more the merrier. This is like a [00:20:30] Thanksgiving kind of pre party for all of us. The more the merrier, folks. If you're listening to this only and you're not like on the live stream, I encourage you to go and check us out because like we're also like showing stuff.
[00:20:40] Alex Volkov: We're like showing the papers. We're like, we're waving. We're like showing Turkey, whatever. we're having fun. all right, folks. I think it's time to talk about the main course. We just ate the mashed potatoes. Let's eat the turkey for open source.
[00:20:53] Qwen Quill 32B Reasoning Model
[00:20:53] Alex Volkov: In this week's Open Source Turkey dinner, the Reasoning Model, the first ever Reasoning Open [00:21:00] Source, we got Qwen Quill, Qwen Quill?
[00:21:04] Alex Volkov: Yes, Qwen Quill 32 bit preview, the first open source. Let's go! Let's go! The first open source Reasoning Model from our friends at Qwen. We have Jun Yang here, Jun Yang and Justin Lin, to talk to us about this release. Folks at OpenAI released this, they worked for, the rest of about O1, we released a couple of months ago.
[00:21:25] Alex Volkov: Then the folks at DeepSeek released R1, that they just released it, they [00:21:30] promised to give us, maybe at some point. The folks at O1 did not release the reasoning. So, what you see in O1 is the reasoning being obfuscated from us, so we can't actually see how the model reasons. R1 gave us the reasoning itself.
[00:21:44] Alex Volkov: But didn't release the model. And so now we have a reasoning model that you can actually download and use. And unlike reflection, this model actually does the thing that it promises to do. Junyang, how did you do it? What did you do? Please give us all the details as much as possible. Please do the announcement yourself.
[00:21:58] Alex Volkov: Thank you for joining us. [00:22:00] Junyang from Qwen.
[00:22:00] Junyang Lin: Yeah, thanks everyone for the attention and for the appreciation, and I'm Junyang from the Qwen team, and we just released the new model for reasoning, but we just added a tag that it is a preview. Yeah, it is something very experimental, but we would really like to receive some feedback to see how people use it and to see what people think.
[00:22:24] Junyang Lin: The internal problems,they really are. Yeah, it is called QUIL. it is [00:22:30] something, very interesting naming,because we like to see that, we first called it like Q1,things like that, but we think it's something too normal and we'd like to see there was something connected with IQ, EQ, then we call it QQ, and then we found out, QWEN with a W there.
[00:22:47] Junyang Lin: And we found a very interesting expression because it looks really cute. There is a subculture in China with the text expression to express the feelings. So it is something very interesting. So we [00:23:00] just decided to use the name and for. For the pronunciation, it's just like the word Q, because I combined QW, the pronunciation of QW, with U together, and it's still just cute.
[00:23:13] Junyang Lin: Yeah, there's something beside the model, and it is actually a model, which can, And this is the reason before it reaches the final response. If you just try with our demo and you will find that it just keeps talking to itself. And it's something really [00:23:30] surprising for us. If it asks you a question, it just keeps talking to itself to discover more possibilities as possible.
[00:23:42] Junyang Lin: And sometimes will lead to some new things. Endless generation. So we have some limitations there. So we mentioned the limitations in the almost the second paragraph, which includes endless generation. But it is very interesting. I [00:24:00] don't say it is a really strong model, something like competitive to O1 or outcompeting R1.
[00:24:06] Junyang Lin: It is not Simply like that, we show the benchmark scores, but it is something for your reference to see that, maybe it is at this level, and then if you really check the model performance, when it processes like mathematics and coding problems, it really thinks step by step, and it really discovers more possibilities.[00:24:30]
[00:24:30] Junyang Lin: Maybe it is a bit like brute forcing, just like discovering all possibilities. If there are 1 plus 2 is equal to 1, and it discovers a lot of possibilities, but it sometimes finishes,can finish some very difficult tasks. I think, you guys can wait for our more official release, maybe one month or two months later.
[00:24:53] Junyang Lin: We'll make sure, And the next one will be much better than this preview one, but you can play with it. It is something really interesting, [00:25:00] very different from the previous models.
[00:25:02] Alex Volkov: So first of all, a huge congrats on releasing something that, everybody, it looks like it piqued interest for, tons of folks, absolutely.
[00:25:09] Alex Volkov: Second of all, it definitely thinks, it looks like it's,Actually, this seems like this. you can see the thinking, like we're actually showing this right now for folks who are just listening and I'll just read you the actual kind of ice cube question that we have that,somebody places four ice cubes and then at the start of the first minute, and then five ice cubes at the start of the second minute, how many ice cubes there are at the [00:25:30] start of the third minute,we should probably have prepared like a turkey based question,for this one, but basically the answer is zero.
[00:25:36] Alex Volkov: Oh, the ice cubes melt within a minute, and the answer is zero, and people know the answer is zero because, ice cubes melt faster than a minute. But, the,LLM starts going into math and s**t, and, just to be clear, O1 answers this question, it understands the answer is zero. Quill does not.
[00:25:53] Alex Volkov: But the reasoning process is still pretty cool and compared to like other models like you see you can see it thinking It's let me set up an equation. Oh, [00:26:00] actually, it's not correct Ah, now the equation asking for this and this and this and it goes like This is confusing Let me read the problem again.
[00:26:06] Alex Volkov: And so it tries to read the problem again. This feels Not like just spitting tokens. So Junyang, what, could you tell us like what's the difference between this and training at a regular Qwen 2. 5? So as far as I saw, this is based on Qwen 5, correct?
[00:26:27] Junyang Lin: Yeah, it is based on Qwen 2. 5 [00:26:30] 32 billion de instruct Model. Yeah, we have tried a lot of options, maybe we will release more technical details later, but I can tell you something that, we mostly simply do some, do some work on the, post training data. Because it is actually based on our previous model, so we did not change the pre training, because we are actually very confident in our pre training, because we have trained it with [00:27:00] a lot of tokens, so there should be some knowledge about reasoning there, and in Qwen 2.
[00:27:05] Junyang Lin: 5, we also have some text reasoning, relative data, in the pre training process, so we just try to see that if we can align with the behavior of such, reasoning. So we have some very simple,superfines, fine tuning, and we find that while it can generate things like that, we have done a bit like RL stuff, and we also have done something like, RFT, Rejection, [00:27:30] Finetuning, so we can add more data from it.
[00:27:33] Junyang Lin: And there are a lot of techniques, just like self aligned. We use the base language model to use in context learning to build samples for us, to just We've built something like that make the model that can reason and we found that it's really surprising. We did not do very complex stuff, but we find that it has this behavior, but we still find that there is still much room in the reinforcement learning [00:28:00] from human feedback because we found that if you add some RL, you can improve the performance very significantly, so we have some belief that Maybe we, if we have done some more in a process where we're modeling LLM critiques and also things like building more nuanced data for the multi step reasoning, the model will be much better.
[00:28:26] Junyang Lin: Yeah. But this one is interesting. You can keep [00:28:30] talking to it. It keeps talking to itself, just talking about some strange thinking and sometimes maybe I'm wrong. I will check the question again and maybe I'm wrong again and then do it again and again. And sometimes it's generally too long because we have some limitations in long text generation.
[00:28:49] Junyang Lin: I think All models have this problem, so when it reaches maybe some bound and it will turn into some crazy behaviors, it just never [00:29:00] stops generating. We just mentioned this limitation. Just
[00:29:05] Alex Volkov: to make sure folks understand, this is a preview, this is not like an official release. You guys are like, hey, this is a preview, this is a test of you guys.
[00:29:12] Alex Volkov: You guys are like trying this out, like folks should give feedback, folks should try it out. Maybe Finetune also on top of it. Yeah. There's definitely we're trying this out. This is
[00:29:21] Yam Peleg: it's like chatGPT is a research preview. It's not exactly a preview. It beats the benchmarks on so many problems.
[00:29:29] Yam Peleg: We would
[00:29:29] Junyang Lin: like [00:29:30] to make it a fun, funny stuff to make people happy. It's now Thanksgiving and people are always expecting models from us. And they're just talking that all out. where's our reasoning model or things like that. Yeah. so we showed this one to you. And.
[00:29:48] Alex Volkov: Yeah, Jan Wolfram, folks, comments about the reasoning model from Qwen.
[00:29:53] Yam Peleg: Oh, I have a lot of comments. That's a lot. I don't know if you can hear me. Yeah, Jan, [00:30:00] go ahead.
[00:30:00] Alex Volkov: There's just a delay, but we're good.
[00:30:02] Yam Peleg: Yeah, I just want to say, it's like, uh, CGPT is, uh, is a research preview. It's it's a really good thing.
[00:30:10] Yam Peleg: It's a really good model. Seriously. So, I mean, it can be a preview, but it's extremely powerful. How did you guys train this? I mean, what, what, what's the data? How did you generate it? Can you Can I just create data that looks like O1 and Finetune and it's going to work? or, like, give us some details.
[00:30:28] Yam Peleg: it's a really hard thing to [00:30:30] do. it's really, really, really successful. Sohow did you make it?
[00:30:35] Alex Volkov: Give us some details if you can, I'm saying. if you can. Don't let Yam, don't let Yam go into give some details that you cannot give details. but hey, it looks like we may have lost Junyang for a bit with some connection issues, but while he reconnects, we got Maybe he can't, maybe he can't hear details, so
[00:30:52] Wolfram Ravenwolf: They put the plug.
[00:30:53] Alex Volkov: and Wolfram, what's your, I saw your take. Let's, meanwhile, let's take a look. You did some testing for this model as well, right?
[00:30:59] Wolfram Ravenwolf: [00:31:00] Yeah. And I just ran the, the IceCube prompt and on my run, it got the zero correct.
[00:31:04] Wolfram Ravenwolf: So that is a bit of a red flag. Oh, you
[00:31:06] Alex Volkov: did get it correct.
[00:31:07] Wolfram Ravenwolf: Yeah. it was fun because it wrote, Over 10, 000 characters, but in the end it said, okay, so confusing, they all melted zero. So that worked. But of course you have to run benchmarks multiple times. I did run the MMLU Pro computer science benchmark twice.
[00:31:23] Wolfram Ravenwolf: And what is very interesting is, Also here, it generated much more tokens than any other model. The second, highest [00:31:30] number of tokens was GPT 40, the latest one, which was 160, 000 tokens for the whole benchmark. And here we have over 200, 000, 232, 000 tokens it generated. So it took me two and a half hours to run it.
[00:31:45] Wolfram Ravenwolf: And, yeah, it's an 8B model, no, a 32B model at 8 bit in my system where I was running it, because I have 48GB VRAM, so you can run it locally and look at it, it's, it's placed above the 405B [00:32:00] Lama 3. 1, it's above the big Mistral, it's above the GBT, JGBT latest, and the GBT 4. 0 from, yeah, the most recent one.
[00:32:08] Wolfram Ravenwolf: So just to recap
[00:32:09] Alex Volkov: what you're saying. On the MMLU Pro Benchmark, this is a model that you run on your Mac, or whatever PC, and it beats Llama 3. 5, 4 or 5 billion parameter on this benchmark, because it's reasoning and it's smart, it runs for longer, and it uses those test time compute, inference time [00:32:30] compute, Compute, Scaling, Loss that we talked about multiple times.
[00:32:33] Alex Volkov: It runs for longer and achieves a better score. This is like the excitement. This is the stuff. so Junyang, now that you're back with us, could you answer, or at least some of Yam's question, if you couldn't hear this before, I will repeat this for you. How? What does the data look like? can you just come up with some O1 stuff?
[00:32:51] Alex Volkov: By the way, welcome, welcome Nisten.
[00:32:53] Nisten Tahiraj: But I tried it.
[00:32:54] Introduction to the New Google Model
[00:32:54] Nisten Tahiraj: It got the Martian.Rail Train Launcher, it got it perfectly [00:33:00] on first try, and I saw that it did take it three tries, so I use this as a standard question on most models, is if you're going to launch a train from the highest mountain in the solar system, which is on Mars, and you want to accelerate it at two G's, so Still comfortable.
[00:33:21] Nisten Tahiraj: how long would that track need to be in order for you to get to orbital velocity and in order for you to get to, to leave [00:33:30] Mars gravity well? And it's a very good question because there's so many steps to solve it and you can just change it to, you can say 2. 5G and that completely changes the order of the steps for, that the model has to solve.
[00:33:42] Alex Volkov: So it's unlikely to be in the training data and it got it perfectly. It's again, it's this one, it's the new Google preview, even Sonnet takes two tries, two or three tries often to get the right answer. So,yeah, the model worked, and I had the same thing as [00:34:00] Wolfram, he did put out a lot of tokens, but again, it's pretty fast to run locally, Folks, it's a good model. It's, it, for a test preview, for something that was released, as a first, open weights reasoning model, we are very impressed.
[00:34:14] Model Performance and Availability
[00:34:14] Alex Volkov: we're gonna give Junaid, one more, one more attempt here, Junaid, I see you on the spaces. and you're as a speaker, maybe you can unmute there and speak to us through the spaces,while we try this out, I will just tell to folks that like you are, you can download this model.
[00:34:27] Alex Volkov: It's already on, OLAMA. [00:34:30] You can just like OLAMA install Quill or QWQ.it's already on OpenRouter as well. You can get it on OpenRouter. So you can like replace. you can replace whatever you use, like OpenAI, you can replace and put this model in there. it's, you can try it out in Hug Face, this is where we tried it just now.
[00:34:47] Alex Volkov: And, It's awesome. It's awesome to have this. I'm pretty sure that many people are already like trying different variations and different like fine tunes of this model. And it just like going up from here, like to get a open [00:35:00] model, 32 billion parameters, that gets, what is the score? let me take a look.
[00:35:04] Alex Volkov: The score is, I think it gets, 50 on AIME. It's ridiculous. Anybody try this on ARK Challenge, by the way? Do you guys see in your like, like tweets or whatever, the ARK Challenge? Anybody try to run this model on that and try? I would be very interested because that's that's a big prize. It's a very big prize.
[00:35:22] Alex Volkov: I'm pretty sure
[00:35:22] Eugen Cheugh: someone's trying right now. You shall think that out.
[00:35:26] Alex Volkov: I'm pretty sure somebody's trying right now. They could use a
[00:35:29] Wolfram Ravenwolf: 72B [00:35:30] version of it and maybe that gets even better. Probably does.
[00:35:35] Alex Volkov: Yeah. They're probably training a bigger model than this right now. all right folks. So with this, I think that, we've covered pretty much everything that we wanted to cover with Quill.
[00:35:46] Scaling and Model Efficiency
[00:35:46] Alex Volkov: and I think, yeah, the one thing that I wanted to show, let me just show this super quick before we move on to the next topic that we have is this, scaling kind of thing. We saw pretty much the same thing. From, from [00:36:00] DeepSeq. And then we saw pretty much the same thing also from OpenAI. The kind of the scaling confirmation, the scaling log confirmation, the next scaling log confirmation, test time compute or inference time compute works.
[00:36:11] Alex Volkov: Which basically means that the more thinking, the more tokens, the more time you give these models, the better. to think, the better their answer is. We're getting more and more confirmation for this kind of Noah Brown, I don't know, thesis, that these models actually perform [00:36:30] significantly better when you give them more tokens to think.
[00:36:32] Alex Volkov: this is incredible to me. This is like incredible because not only will we have better models with more scale, but Even though some people claim a wall has been hit, no wall has been hit. but also we now have these models that can answer better with more tokens. and this is like another, another confirmation from this.
[00:36:51] Alex Volkov: Qwen, Quail32B is now here. You can, you can now run. a, a 4 0 5 B level models, at least on [00:37:00] MMLU Pro,like wolf from here said on your computers. And shout out to our friends from, Alibaba Quinn for releasing these awesome models for us as a Thanksgiving,present.
[00:37:10] Alex Volkov: Jang, you're back with us. Let's see. maybe you're back.
[00:37:14] Junyang Lin: I don't know if you can hear me. Yes,
[00:37:16] Alex Volkov: we can hear you finally, yes.
[00:37:18] Junyang Lin: I don't know what happened.
[00:37:19] Alex Volkov: it's
[00:37:20] Junyang Lin: fine. I
[00:37:22] Alex Volkov: think that, let's try this again. maybe last thing as we're going to try.
[00:37:27] Discussion on Reasoning Models
[00:37:27] Alex Volkov: What, from what you can tell us, [00:37:30] how does the work on this look like?
[00:37:34] Alex Volkov: Is a lot of it synthetic? Is a lot of it RL? Could you give us, a little bit of, Give us a hint of what's going to come in the technical release for this. And also what can we look forward to in the upcoming? Are you maybe working on a bigger model? give us some, give us something for Thanksgiving.
[00:37:51] Junyang Lin: Oh yeah. for the reasoning steps, I think, the data quality, really matters and, we, we think that, it may split the steps, [00:38:00] more, make it more nuanced. make it more small steps. It can be just, the possible answers, with higher possibility, which means that the machine may think, in a different way from, the human being.
[00:38:12] Junyang Lin: The human being may reach the answer very directly, but sometimes, for a reasoning model, it may reason to explore more possibilities. So when you label the data, you should pay attention to, these details and, This is a part of it, and now we only have done some work on mathematics and [00:38:30] coding, and especially mathematics, and I think there's still much room in general knowledge understanding.
[00:38:37] Junyang Lin: I found that Wolfram just tested it for the MMU PRO, but we actually did not strengthen its performance for the MMU PRO. this kind of benchmark. So I think for the scientific reasoning, there's still much room for it to do it. And something surprising for us, is that we found that, it sometimes generate more beautiful texts, more [00:39:00] poetic, some, something like that.
[00:39:02] Junyang Lin: I don't know why, maybe it is because it reasons. So I think it may encourage creative writing as well. A reasoning model that can encourage creative writing. That would be something very interesting. I also found some cases, in Twitter, that people find that, it sometimes generates, text more beautiful than, Claude's written by someone and created.
[00:39:22] Junyang Lin: there's still much room for a reasoning model. Yep.
[00:39:25] Alex Volkov: Very interesting. Just to recap, folks found that this model that is [00:39:30] trained for reasoning gives more poetic, writing. that's very interesting. All right, folks, I think it's time for us to move on, but
[00:39:37] Wolfram Ravenwolf: just one quick comment.
[00:39:39] Multilingual Capabilities of Qwen
[00:39:39] Wolfram Ravenwolf: It's also very good in German. I tested it in German as well. So even if it may not be the focus, if you are multilingual or another language, try it. Yeah,
[00:39:50] Junyang Lin: that's something not that difficult for us because the Qwen is strong model is multilingual And it is actually I think it is now good at German.
[00:39:59] Junyang Lin: Yeah, [00:40:00]
[00:40:02] Alex Volkov: Qwen's multilingual is very good at German.
[00:40:04] BlueSky hate on OpenSource AI discussion
[00:40:04] Alex Volkov: Alright folks, I think that it's time for us to move on a little bit and Now we're moving to less fun, less of a fun conversation, but I think we should talk about this. just a heads up, after this, we're gonna have this week's buzz, but I don't have a category for this.
[00:40:19] Alex Volkov: I don't have a category for this, but it must be said. as ThursdAI is all about positivity. We talk about AI every week to highlight the advancement we highlight with positivity we get excited about every new [00:40:30] release every new whatever we also recently and now we have you know we're on youtube as well and the reason it coincided well with some of the folks in the ai community moving over to blue sky let me actually first Say hi to my colleague here, Thomas.
[00:40:44] Alex Volkov: I'm going to pull you up on stage as well. welcome Thomas as well. Hey man, welcome. My colleagues for the past year from Weights Biases, welcome as well. You're more than welcome to join us as well, because you're also on BlueSky. And, so a bunch of the community, recently started seeing whether or not there's a [00:41:00] new place over at BlueSky.
[00:41:02] Alex Volkov: for the ML community. I saw a bunch of ML people over there as well. I see Wolfram over here has a little butterfly. you all who are joining us from Twitter, or Xspaces, for example, you've probably seen a bunch of your favorite AI folks post just a blue butterfly and maybe follow them towards the other social media platform due to your political preferences, wherever they may be, which is completely fine.
[00:41:26] Alex Volkov: That's all good and well and fine. so I started cross posting to both, [00:41:30] and I'll show you how my screen looks like recently. This is how my screen looks like. I scroll here, I scroll on X, and I scroll on blue sky. This is what my life looks like. Yes, I'm on both. because I want to make sure that I'm not missing any of the news.
[00:41:43] Alex Volkov: That I want to bring to you, and also Zinova, our friend, right? He posts everywhere, and I see the community bifurcating. I don't like it. But I want to make sure that I'm not missing anything. This is not what I want to talk to you about. Not the bifurcation. I don't mind the bifurcation. We'll figure out something.
[00:41:58] Alex Volkov: We're on YouTube as well, [00:42:00] so the folks from BlueSky who don't jump on TwitterX community, they can still join the live chat. What I want to talk to you about is this thing that happened where, a bunch of folks from Hug Face just joined Blue Sky as well, and one of the maybe nicest people in, from the Hug& Face community, Daniel,I'm blanking on his last name, Nisten, maybe you can help me out, Daniel Van Strijn?
[00:42:24] Alex Volkov: Daniel Van Strijn?basically, did what he thought was [00:42:30] maybe a cool thing. He compiled the dataset. You guys know, we talk about data and open source and Hug Face as well. This is like in the spirit of the open source community, there's, we talk about open datasets. we, I have a thing here. This is my thing.
[00:42:43] Alex Volkov: When we talk about somebody releasing. Open source datasets. We have a thing. We clap, right? and so he compiled, a dataset of 1 million blue sky posts to do some data science. This is like what Hagenfeist, put it on Hagenfeist. just to mention one thing before, [00:43:00] unlike Twitter, which used to be open, then Elon Musk bought it and then closed the API, and then you have to pay 42, 000 a year.
[00:43:07] Alex Volkov: 42, 000 a year. Yes, this is the actual price. 42, 000 a year. this is the actual literal price for the API. Unlike Twitter, which used to be free, BlueSky is built on a federated algorithm. There's a firehose of API you can apply to it. And then you can just like drink from this firehose for free. This is like the whole point of the platform.
[00:43:27] Alex Volkov: so then you'll connect to this firehose, drink from it and [00:43:30] collect, compile the data set of a 1 million posts, put it up on Hug Face, open source.
[00:43:36] Community Reactions and Moderation Issues
[00:43:36] Alex Volkov: And then got death threats. Death threats. He got death threats for this thing. People told him that he should kill himself for this act where he compiled data from an open fire hose of data that is open on purpose.
[00:43:58] Alex Volkov: What the actual f**k? [00:44:00] And when I saw this, I'm like, what is going on? And in less than 24 hours, I'm going to just show you guys what this looks like. Okay. this is the, this is on the left of my screen and the folks who are not seeing this, you probably, I'm going to, maybe pin.
[00:44:13] Alex Volkov: Yeah. let me just do this super quick. So you guys who are just listening to this, please see my pinned tweet, as well. because this is some insanity. Okay. And we have to talk about this because it's not over here. he compiled a 1 million public posts, BlueSky Firehose API, data set.
[00:44:27] Alex Volkov: And then, it got extremely [00:44:30] viral to the point where I don't know, it's like almost 500 whatever it's called. And then the amount of hate and vitriol in replies that he got from people in here. Including, yes, including you should kill yourself comments and like death threats and doxing threats, et cetera.
[00:44:47] Alex Volkov: many people reached out directly to,HugNFace folks. he became maybe number two most blocked person on the platform as well. and all of this, they, people reached out to the Hug Face community. Basically in less than [00:45:00] 24 hours, he basically said, I removed the BlueSky data from the repo.
[00:45:03] Alex Volkov: I wanted to support the tool development for the platform, recognize this approach, violate the principle of transparency and consent. I apologize for this mistake, which, okay, fine. I acknowledge his position. I acknowledge the fact that he works in a,he works in a company and this company has lawyers and those lawyers need to adhere to GDPR laws, et cetera.
[00:45:23] Alex Volkov: And many people started saying, Hey, you compiled my personal data without, the right for removal, et cetera, without the due [00:45:30] process, blah, blah, blah. Those lawyers came, there's a whole thing there. And then our friend here, Alpen, who's a researcher, of his own, connected to the same open firehose of data, and collected a dataset of 2 million posts.
[00:45:47] Alex Volkov: That's twice as many as Daniel did, and posted that one, and then became the person of the day. Alpen, you want to take it from here? You want to tell us what happened to you since then? What your 24 hours looked [00:46:00] like?
[00:46:00] Alpin Dale: yeah, sure. it's been quite the experience being the main character of the day in Blue Sky.
[00:46:05] Alpin Dale: And,obviously, I'm not showing my face for very obvious reasons. I have received quite a few threats because, Yeah, unlike Hugging Face employees, I am not beholden to a corporation, so I didn't really back down. And, yeah, I probably received hundreds of death threats and doxxing attempts.
[00:46:24] Alpin Dale: so just to reiterate what you said, the Firehose API is completely [00:46:30] open.
[00:46:31] Alpin Dale: It is, it's a good analogy with the name because it's like a firehose, anyone can use it.
[00:46:35] Legal and Ethical Implications
[00:46:35] Alpin Dale: you have they've also,threatened me with litigation, but, I'm not sure if you guys are aware, but there was a court case back in 2022, HiQ Labs versus LinkedIn, where, HiQ Labs was, scraping public, public accounts from LinkedIn and, using it for some commercial purposes, I don't remember.
[00:46:54] Alpin Dale: But, They did actually win in court against LinkedIn, and what they were doing was [00:47:00] slightly even more illegal because LinkedIn doesn't have a publicly accessible API, and they have Terms of Services specifically against that sort of scraping, and because of that, the ruling overturned later and they, they lost it, they lost the claim, but it did set a precedent to be had that if the,if the, data published on publicly accessible platforms could be lawfully connected, collected and used, even if terms of service like purported to limit such usage.
[00:47:28] Alpin Dale: But I [00:47:30] Never agreed to such a term of service when I started scraping or copying the data from the Firehose API because first, I didn't do any authentication. Second, I didn't provide a username when I did that. So anyone could have done that technically with the AT protocol Python SDK. It's you don't even need to sign in or anything.
[00:47:52] Alpin Dale: You just sign in. Connect to the thing and start downloading.
[00:47:55] Alex Volkov: Yeah, this is the platform is built on the ethos of the open [00:48:00] web. The open web is you connect and you read the data. This is the ethos of the open web. When this is the ethos of the open web, when you post on this platform, Whether or not the TOS is saying anything, when you don't need to authenticate, the understanding of the people should be, regardless, and I understand some of the anger when the people discover, oh, s**t, my, my thoughts That I posted on this platform so far are being used to like, whatever, train, whatever.
[00:48:28] Alex Volkov: I understand some of this, I [00:48:30] don't agree with them, but like I understand, what, how some people may feel when they discover Hey, my thoughts could be collected, blah, blah, blah. and somebody posted like a nice thread. But, the platform is open completely. Going from there to death threats, this is, like, where I draw completely, where I draw my line.
[00:48:45] Alex Volkov: Alpen, the next thing that happened is what I want to talk to you about. you're getting death threats, you're getting doxxed attempts. Um,I couldn't find your post today. what happened?
[00:48:56] Alpin Dale: for some reason, BlueSky decided to terminate my [00:49:00] account instead of the ones issuing the death threats, very interesting chain of events, but,they claimed that I was engaging in troll behavior, whatever that means.
[00:49:10] Alpin Dale: And for that reason, they just, like it wasn't even,due to mass reporting that happens on X. com, right? Specifically emailed me with very, human generated language, where they told me that I was being a troll. I think I posted it on my Twitter account too. And, Yeah, they just assumed I'm trolling, [00:49:30] and what's funny is there's been screenshots floating around of similar mod messages, just giving people a slap on the wrist for much, much worse things, like things we can't even talk about here, right?
[00:49:44] Alpin Dale: So very strange, very silly situation overall. And another thing I wanted to mention, a lot of people. We're bringing up the GDPR and all that because of like personally identifiable information, but if you go to the [00:50:00] dataset, all we have is the post text. The timestamp, the author, and the author name is a, it's just a hash, it's not the full author name, and the URI, so there isn't really much to link people to the, to their specific posts, and there isn't even a location tag, so I'm not sure if it fully applies with GDPR, but I'm not a liar anyways, and, The thing is, the data or their posts were published on a platform that is explicitly designed for public [00:50:30] discourse, right?
[00:50:31] Alpin Dale: And the decision to share sensitive information on a platform like this lies with the user, not the observer. And we are the observer in this case. And by the very nature of public platforms, Individuals that post like content like this, they have to bear the responsibility that their information is accessible to anyone.
[00:50:51] Alpin Dale: And I don't think my dataset like alters this reality because it just consolidates information that was already available for [00:51:00] everyone. And I guess,there were also people who were asking for an opt out option and, the Hugging Face CEO, Clem, also made an issue on the repo about this. And I did provide a very straightforward opt out process, if someone wants to remove that data, they can just submit a pull request.
[00:51:18] Alpin Dale: to remove the specific posts that belong to them but alsothey have to accompany it with a proof of authorship they have to prove to me that the post that they're removing is not a [00:51:30] it belongs to them and it's not a malicious request so i guess i've covered all grounds so i'm not sure what the what people are worried about
[00:51:38] Alex Volkov: so i uhI'm just showing to the folks who are listening, I'm showing a, an email from,from the moderation team at BlueSky.
[00:51:46] Alex Volkov: BlueSky County Control, Alpendale, BlueSky Social was reviewed by BlueSky Content Moderators and assessed as a new account trolling the community, which is a violation of our community guidelines. As a result, the account has been permanently suspended. They didn't even give you the chance to like, hey, delete this and come back to [00:52:00] the platform.
[00:52:00] Alex Volkov: Literally permanently suspended. the folks who are like saying, hey, You are going to be,delete this and come back or the folks who are like 13 death threats, are not there. Um,What can we say about this? it's ridiculous. Absolutely. And I, The fact that Hug Face's account, your account, Daniel's account, became the most blocked accounts on the platform in the past 24 hours, more so than some like crazy Manosphere accounts, is just is absolutely insanity.
[00:52:28] Alex Volkov: The fact that most of [00:52:30] these anger prone accounts People are like anti AI completely. And the whole issue about like consent, whatever, most of them don't even appear in the dataset, by the way. Like some people checked on the fly, Zeofon and I, like we did some basic checking, many people didn't even appear in the dataset.
[00:52:44] Alex Volkov: the fact that the absolute silly fact that the, none of them understand the Barbra Streisand effect on the internet and the fact that there's five datasets right now. Many of them collected the people who reacted to these specific posts and collected the data [00:53:00] set of the people who reacted to these specific posts.
[00:53:02] Alex Volkov: And people just don't understand how the internet works. That was just like ridiculous to me.
[00:53:07] Moving Forward with Open Source
[00:53:07] Alex Volkov: so Alpen, I personally think that you did Many of these people also a very good service as well, because at least some of them now realize how open internet works, despite the being very upset with the fact that this is how the open internet works, at least some of them are now like realizing this.
[00:53:23] Alex Volkov: I,I commend you on like the bravery and standing against this like absolute silliness and not backing down. And [00:53:30] Yeah, go ahead. Happy
[00:53:31] Alpin Dale: to serve. Yeah, another small thing I wanted to add was, I've received a lot of threats about me getting reported to the EU, but what I find really ironic is that,earlier this year, the EU funded a research for collecting over 200 million blue sky posts with a greater level of detail.
[00:53:50] Alpin Dale: So clearly the EU is fine with this, so I don't know what's the problem here, once again.
[00:53:58] Alex Volkov: yeah, I saw this. Yeah, there's a way [00:54:00] bigger thing. The last thing I saw about this, and then maybe we'll open up for folks, and then I would love to chat with my friend Thomas, for whom it's late, and I invited him here, and I want to be very mindful of his time as well, so thank you, Thomas, for being patient.
[00:54:12] Alex Volkov: The last thing I say about this is that this sucks for open source, from the very reason of, if you're open and public and good hearted about this, Hey folks, here's the data in the open, you can look at this data and you can ask for your s**t to be removed. You get an angry mob of people threatening [00:54:30] death against you and asking your lawyers to like, literally people asking like, was Daniel fired?
[00:54:34] Alex Volkov: what the f**k? Meanwhile, this is a open firehose and all of the companies in the world probably already have all this data. I'm pretty sure, OpenAI has been already training on BlueSky. Like, why wouldn't they? It's open. Literally, if you want to train, and Thomas, maybe here is like a little entry to what we're going to talk about.
[00:54:50] Alex Volkov: If you want to train a toxicity,thing, There is now a very good place to go to and look at toxicity score or I can show you where you can go [00:55:00] to to train toxicity score. Like, why wouldn't you go and collect this data? It's free, like literally it lies on the internet.
[00:55:05] Alex Volkov: Nothing in the TOS, like Alpen said, even I went to the TOS of BlueSky. Literally it says over there, we do not control how other people use your data. Like literally that's what it says on the TOS. So yeah, I'm just like, I'm very frustrated against this. I want to speak out against this, absolutely ridiculous behavior.
[00:55:22] Alex Volkov: I don't think that this,okay. So I don't think that the, how the people reacted on the platform speaks against the platform itself. I do think [00:55:30] That the way the moderators, acted out against Alvin's account and the removal of account permanently banned, speaks completely against the platform.
[00:55:38] Alex Volkov: This is stupid and we should speak against this, on the platform itself. if we think that this is a place for the community, that's where I stand. And I wanted to share the publicly, super brief comments, folks, and then we'll move on to this week's bus.
[00:55:49] Wolfram Ravenwolf: There was a link in his message from the moderators that he can reject it and get a review, appeal, yeah.
[00:55:58] Wolfram Ravenwolf: So I hope that, I hope [00:56:00] he gets the appeal through. That is important. Yeah,
[00:56:03] Alex Volkov: if you will,please email them with an appeal and, tell them about the multiple death threats that you received and the fact that, you didn't, did not mean to troll.
[00:56:12] Wolfram Ravenwolf: I reported every one of those messages, by the way, and anyone who does it is probably a good thing.
[00:56:18] Alex Volkov: Nisten, I know you have thoughts on this. I would love to hear.
[00:56:22] Nisten Tahiraj: we need to better educate people to not go after the ones on their side. a lot of the open source devs do this stuff [00:56:30] because they want everyone to have, Healthcare robots that no single corporation owns. They make this data public because people want to democratize the technology for everyone.
[00:56:41] Nisten Tahiraj: So it's not, it doesn't become like authoritarian and like a single source of control. And, to see that they prioritize, just, people's anger and feelings versus being objective. about it. Whereas, [00:57:00] so in this case, the public forum data set is public domain on purpose. And this is what drew people to the community in the first place, because they felt like Twitter was becoming too political, single sided.
[00:57:12] Nisten Tahiraj: And, we didn't like that. And a lot of people moved to, because they saw Blue Sky as a, Much better, democratized alternative to all of this. And,so that's really disappointing because, these are the people on your side and, now the two [00:57:30] nicest, most contributing open source devs that we know, are more hated than, like someone like Andrew Tate.
[00:57:37] Nisten Tahiraj: that just makes no sense at all. the, out of the five most blocked accounts Two of them are like the nicest people we know. So what is, something is pretty, pretty off. And, I'm also worried that in the AI community, we are in a bit of a bubble and not quite aware of,what people on our side are being communicated.
[00:57:58] Nisten Tahiraj: are being shown how this [00:58:00] stuff works, how open source, works because I'm pretty sure from their point of view, they're like, oh, here's another company just took all of our data and is just gonna train this porn bot with it and there's nothing we can do about it, but it's not like that.
[00:58:13] Nisten Tahiraj: Not a single company can own this data. It is public domain. We can't sue anyone else over the data. It's public domain in a public forum. You're supposed to have civil discourse because then the AI can also have civil [00:58:30] discourse and be reasonable and be like aligned to humanity. so now you have a bunch of people just giving, death threats and they're okay because they're just angry.
[00:58:40] Nisten Tahiraj: So you can tell someone to go kill themselves just because you're angry. And, yeah, so that's not good. Like they're just not good. you should probably, yeah, anyway, so there is something for us to do as well, like we need to communicate better, what does open source do versus what having a single company.
[00:58:58] Nisten Tahiraj: Own all that data and [00:59:00] have it as their property. because I feel like most of the general public doesn't really understand this.
[00:59:06] Nisten Tahiraj: yeah, that's it. I was just, okay. Just really quickly. Sorry. I went on too long, but after going through war in the Balkans as a kid, I didn't think people would be getting death threats for an open source dataset.
[00:59:17] Nisten Tahiraj: It's this is just completely beyond, It's absolutely unhinged. yeah, this is just completely off.
[00:59:23] Wolfram Ravenwolf: Unhinged. Just one thing, those people even think that now the thing is over, so the dataset has been [00:59:30] removed, okay, it's done, but you can get a new one anytime. The platform hasn't changed. They have to realize that.
[00:59:37] Alpin Dale: funny it mentioned that because they started blocking me for the explicit reason of, the user started blocking me for the explicit reason of stopping me from scraping their posts, as if I need my account to do that.
[00:59:49] Alex Volkov: Yeah, I think that there's, a lot of misunderstanding of, what's actually, happening.
[00:59:54] Alex Volkov: And how, which is fine, I completely empathize of people's misunderstanding of [01:00:00] technology, and thus fear, I get this I get the visceral reaction, I get,I don't like multiple other things about this, I don't like the, the absolute, horror mob. And the death threats, I don't like the platform reacting as it did, and like blocking completely, those things don't make sense.
[01:00:14] Hey, this is Alex from the editing studio. Super quick, about two hours after we recorded the show, Alpin posted that the moderation team at BlueSky emailed him and his account was in fact reinstated. He didn't ask them to. [01:00:30] They revisited their decision on their own.
[01:00:32] So either a public outcry from some individuals on the platform. Hopefully they listened to our show. I doubt they did. Um, but they reversed their decision. So I just wanted to set the record straight about that. He's back on the platform. Anyway, back to the show.
[01:00:48] Alex Volkov: Alright folks, unfortunately though, we do have to move on, to better things, and I'll give my other co hosts like a little five, five to seven minutes off, to go take a break. Meanwhile, we're going to discuss [01:01:00] this week's buzz.
[01:01:00] This Week's Buzz: Weights & Biases Updates
[01:01:00] Alex Volkov: Welcome to this week's buzz, a category at ThursdAI, where I talk about everything that I've learned or everything new that happened in Weights Biases this week. And this week, I have a colleague of mine, Thomas Capelli, [01:01:30] from the AI team at Weights Biases. We're now the AI team. This is new for us. We're Thomas, how, do you want to introduce yourself super brief for folks who've been here before, but maybe one more introduction for folks who don't know who you are.
[01:01:43] Thomas Capele: Yeah, I'm Thomas. I work with Alex. I'm in the AI Apply team at Weights Biases. I train models, I play with models on API, and I try to make my way into this LLM landscape that is becoming more and more complex. Try to avoid [01:02:00] getting roasted on the internet. And yeah, trying to learn from everyone. Thank you for the meeting.
[01:02:06] Alex Volkov: So you're going by Cape Torch, I'm going to add this as well on X as well. I don't know what you're going off as,on Blue Skies, same Cape Torch. I invited you here, and I think let's do the connection from the previous thing as well. A lot of toxicity we talked about just now, a lot of like toxic comments as well.
[01:02:23] Alex Volkov: and we're, we both work at Weights Biases on Weave. Weave is our LLM observability tool. [01:02:30] I showed off Weave multiple times on ThursdAI, but I will be remiss if I don't always remind people, because we have a bunch of new folks who are listening, what Weave is. Weave is an LLM observability tool. So if you're building as a developer, Anything with LLMs on production,you need to know what's going on, what your users are asking your LLM or what your LLM gives as responses, because sometimes imagine that your users are, let's say copy pasting, whatever comments, people just gave [01:03:00] Daniel and Alpin and they pasting it to them to do categorization, for example, and some of these like, Very bad things that we just talked about are getting pasted into the LLM and some of the LLM responses are maybe even worse, right?
[01:03:13] Alex Volkov: so maybe your application doesn't handle this. Maybe your application responds even worse and you want to know about this. and, the way to see those, some developers just looks at logs. we have a tool. That is way nicer. And, this is just some of the things it does. but this [01:03:30] tool is called Weave.
[01:03:30] Alex Volkov: it, it traces everything that your application gets as an input from users and also outputs. but that's not all it does. So it also allows you to do evaluations. And, recently Thomas and, has been working on, multiple things, specifically around scoring and different things. Thomas,you want to maybe give us a little bit of.
[01:03:47] Alex Volkov: Yeah, I think you,
[01:03:48] Thomas Capele: you described pretty well. Yeah, as I know, you have showed Weave and the product we have been working for a while, multiple times here, but it's, I would say it's mostly core feature is [01:04:00] actually building apps on top of LLMs and having observability and yeah, standard code, we have unit tests and for LLM based applications, we need like evaluations, actual evaluations on data we have curated.
[01:04:13] Thomas Capele: And it's, we have been doing this in the ML world for a while, but as we are merging with the software engineers that. Maybe don't know how to integrate this randomness from the LLMs in the, in their applications. Yeah. you need to actually compute evaluations. And that means gathering [01:04:30] data, still labeling a lot of stuff manually to have high quality signal.
[01:04:35] Thomas Capele: And then, yeah, iterating on your prompts and your application that, that's making API calls with scores, with metrics that gives you confidence that we are not like screwing up. And as you said, like I've been working recently on adding, we added a bunch of scores, default scores. We've a couple, yeah, like a month ago with Morgan, we spent like a week building those.
[01:04:58] Thomas Capele: and recently we have been like, [01:05:00] yeah, looking at stuff like toxicity and hallucination and yeah, context and bias detection, and there's multiple of them that are LLM powered, like the ones you are showing on the screen right now, like You have an LLM that it's actually prompt in a certain way, and you maybe build a system that requires like a couple of LLM prompt with structured output to actually get the scores you were expecting,and then this thing should be able to give you, yeah, a good value of the [01:05:30] scoring if it's hallucinating, if it's a toxic, actually the mall providers like OpenAI and Mistral and Anthropic, I think have an API exactly for moderation.
[01:05:41] Thomas Capele: So yeah, you can use also that and they are actually pretty good and fast and pretty cheap compared to the completion ABA. And we have no, what I've been doing this week and the last couple of weeks where I've been trying to build really high quality, small, non LLM powered scores. So example that you want to create a toxic, [01:06:00] detection system.
[01:06:00] Thomas Capele: Yeah. what can you do? Yeah, you could find a small model that it's not an LLM or it was an LLM a couple years ago. Now, like BERT, we don't consider BERT an LLM.
[01:06:09] Alex Volkov: Yeah.
[01:06:10] Thomas Capele: yeah. I've been fine tuning the BERT task and checking like this new high end phase, small LLM2 models, trying to adapt them to the task.
[01:06:18] Thomas Capele: Yeah. yeah, like good challenge, good engineering questions, like creating, there's plenty of high quality data set on HangingFace that people have been creating from multiple places, from Reddit, and [01:06:30] like these have been serving us to actually build this high quality classifiers that are capable of outputting and flagging the content that we're interested in.
[01:06:40] Alex Volkov: So here's what I, here's what I'll say for folks, just to like highlight what we're talking about. Weave itself. is a toolkit that you can use for both these things. You can use it for logging and tracing your application, which is what it looks like right now. You basically add these lines to your either Python or JavaScript application, JavaScript type of application, and we will help you track [01:07:00] everything your users do in production.
[01:07:01] Alex Volkov: Separately from this, You want to continuously evaluate your application on different set of metrics, for example, or scoring them on different set of metrics to know how your LLM or your prompts are doing, right? So you guys know that, like for example, before on the show we talked about, hey, here's this new model, the, qu quill, for example.
[01:07:20] Alex Volkov: And you know that wolf from, for example, tested it on MMU Pro. Those are generic evaluations. MMU Pros, those are evaluations that somebody built specifically for. [01:07:30] Something big. Look, there's a set of questions that somebody builds something big. specific scorers for your type of application, something that you build for your type of applications.
[01:07:38] Alex Volkov: and then people asked us as Weights Biases, Hey, okay, you give us a generic toolkit, an unopinionated toolkit, but can you give us some opinion? Can you give us some opinion? And basically this is what like Weave Scorers is. This is like an additional package that you can install if you want to,like additionally, right?
[01:07:55] Alex Volkov: Thomas, help me out here, but you can add this. The ones we're
[01:07:58] Thomas Capele: building right now, they're not yet [01:08:00] there. They will be probably in a certain future. Yeah. We need to test them correctly. And it's we're an experiment tracking company at the beginning. We're going to like, want to share the full reproducibility.
[01:08:10] Thomas Capele: Like this is the data, this is how we train them. there's different versions. It's scoring metrics we get, so you like have confident that they work as expected.
[01:08:18] Alex Volkov: So this is to me very interesting, right? So I came in as a previously software developer and now as like an AI evangelist, like I came in from like this side and I meet all these like machine learning engineers, experiment tracking folks who are like, okay, [01:08:30] now that we've built this like LLM based tool, observability tool, many people are asking us to do what Weights Biases does on the model side, on the Weights Biases side.
[01:08:37] Alex Volkov: Hey, Use everything from your, immense knowledge of tracking and doing experimentation to bring this over to the LLM side. Okay, now that you have all this data, now that companies are tracking all the data, how to actually, do experimentation on the front side. Thomas, last thing I'll ask you here before I let you go, briefly is about guardrails specifically.
[01:08:56] Alex Volkov: So there's this concept that we're going to talk about. We're going to keep talking about this [01:09:00] called guardrails. So we're talking about scorers. Scorers are basically the way to check your application. Just a model.
[01:09:05] Understanding Scoring Models
[01:09:05] Alex Volkov: Like
[01:09:06] Thomas Capele: I would define like score is just a model. It takes an input, produce an output.
[01:09:11] Thomas Capele: It could be simple. It could be complicated. Like a scoring, the simplest scores could be accuracy. if the prediction is equal to the label, like a complex score, it could be like an LLM power score that. Check that the context you retrieve from your RAG application, it's not like the response is not [01:09:30] hallucinated or is factually consistent with the original context.
[01:09:33] Alex Volkov: So like HallucinationFreeScorer, for example, is one score for folks who are listening. whether or not the response that your RAG application returned, Has hallucinations in it. Or,yeah, it's
[01:09:44] Thomas Capele: very it's very detailed. And you will probably need to refine all of this for your specific application because everyone has slightly definition and slightly needs, slightly different needs for their application.
[01:09:55] Thomas Capele: So yeah, you may need to tune everything, but this is like a good starting point.
[01:09:59] Guardrails in LLM Development
[01:09:59] Thomas Capele: [01:10:00] So yeah, I find it very interesting that you mentioned guardrails. I would say like a guardrail is. Also a model that predicts, but it's need to be really fast and it needs to be, it needs to take actions, maybe change the output, like any of these scores don't change your output.
[01:10:19] Thomas Capele: Like they will. Computer score, but they will not change the output. if you have IPAI's guardrail, it should, I don't know, redact stuff that [01:10:30] shouldn't pass. So it should change the output, like the payload you are getting from the API. So like guardrails are more like online, and these are more like, offline.
[01:10:41] Alex Volkov: So that's a good boundary to do. And I think we'll end here, but this is basically an exception for going forward, folks. I will tell you about guardrails specifically.
[01:10:48] Guardrails in Production
[01:10:48] Alex Volkov: It's something we're getting into, and I'm going to keep talking about guardrails specifically, because I think that this is a very important piece of developing LLMs in production.
[01:10:57] Alex Volkov: How are you making sure that the [01:11:00] model that you have online is also behaving within a set of boundaries that you set for your LLM? obviously we know that the big companies, they have their guardrails in place. We know because, for example, when you, talk with, advanced voice mode, for example, you ask it to sing, it doesn't sing.
[01:11:14] Alex Volkov: there's a boundary that they set in place. when you develop with your LLMs in production, your guardrails, the only way to build them in is in by prompting for example there's other ways to do them and we are building some of those ways or we're building tools for you to build some of those ways [01:11:30] and like thomas said one of those guardrails are changing the output or like building ways to prevent from some of the output from happening like PII for example or there's like toxicity detection and other stuff like this so we will Talking more about guardrails, Thomas with this, I want to thank you for coming out to the show today and helping us with scores and discussing about Weave as well.
[01:11:50] Alex Volkov: And, I appreciate the time here, folks. You can find Thomas on, X and on, and on BlueSky, under CapeTorch. Thomas is a machine learning engineer and, [01:12:00] developer AI engineer as well. Does a lot of great content, Thomas. Thank you for coming up. I appreciate you. He also does amazing cooking as well.
[01:12:06] Alex Volkov: Follow him for some amazing gnocchi as well. Thanks, Thomas. Thomas, thank you. Folks, this has been this week's Bugs, and now we're back. Good job being here. See you guys. See you, man. And now we're back to big companies and APIs.[01:12:30]
[01:12:33] Alex Volkov: All right. All right. All right. We are back from this week's buzz, folks. Hopefully, you learned a little bit about scores and guardrails. We're going to keep talking about guardrails, but now we have to move on because we have a bunch of stuff to talk about specifically around big companies and APIs, which had A bunch of stuff this week as well.
[01:12:51] OpenAI Leak Incident
[01:12:51] Alex Volkov: I wanna talk about, the leak. You guys wanna talk about the leak, this week? open the eye had a big, oh my God. Oops. Something big [01:13:00] happened. but nothing actually big happened, but look to some extent, this was a little bit big. at some point, this week, a frustrated participant in the open ai, how should I say, test.
[01:13:12] Alex Volkov: Program for Sora decided to quote unquote leak Sora and posted a hug and face space where you could go and say, Hey, I am,I want this and this. And you would see a Sora video generated and, yeah, we can actually show some videos. I think, this is not against any [01:13:30] TOS, I believe. and, Yeah, this wasn't actually a leak. What do you guys think? did you happen to participate in the bonanza of, Sora videos, Wolfram or Yam? Did you see this?
[01:13:40] Wolfram Ravenwolf: I saw it, but I didn't, try to go to the link.
[01:13:43] Alex Volkov: No.
[01:13:44] Sora Video Leak Reactions
[01:13:44] Alex Volkov: so basically, some very frustrated person from,the creative minds behind Sora behind the scenes, decided to like, Leak SORA, the leak wasn't actually the model leak like we would consider a model [01:14:00] leak.
[01:14:00] Alex Volkov: the leak was basically a hug and face application with a request to a SORA API with just the keys hidden behind the hug and face. we're showing some of the videos. I'm going to also add this to,to the top of the space for you guys as well. The videos look pretty good, but many of the folks who commented, they basically said that, compared to when Sora just was announced, where all of [01:14:30] us were mind blown completely, now the videos, when you compare them to something like Kling, or some of, Runway videos, they're pretty much on the same level.
[01:14:41] Alex Volkov: And, I, they look good. They still look very good. look at this animation for example. It looks very good still And apparently there's like a version of Sora called Sora Turbo. So these videos are like fairly quick, but Like folks are not as mind blown [01:15:00] as before yeah Some of the physics looks a little bit better than Kling etc, but it feels like we've moved onand this is something that I want to talk to you guys like super quick.
[01:15:09] Alex Volkov: we're following every week, right? So we get adapted every week, like every,the Reasoning Model Formula 1 blew us away. And then R1 came out and now we run this on our models due to Quill. So we're used to getting adapted to this. the video world caught up to Sora like super quick.
[01:15:24] Alex Volkov: Now we can run these models. There's one open source one like every week. These videos [01:15:30] don't blow us away as they used to anymore and,why isn't OpenAI releasing this at this point is unclear because if you could say before, elections, you could,you can put down Trump and Kamala Harris in there, Now, what's the reason for not releasing this and not giving us this thing?
[01:15:47] Alex Volkov: anyway, yeah, this video is pretty cool. There's one video with, a zoom in and somebody eating a burger. yeah, leak, not leak, I don't know, but, thoughts about the sourcling? What do you guys think about the videos and, the non releasing, things? Folks, I want to ask, Nisten, [01:16:00] what do you think about those videos?
[01:16:01] Alex Volkov: Do you have a chance to look at them?
[01:16:03] Nisten Tahiraj: I was going to say, by the way, I was going to say the exact same thing you did, that it's just been so long now, what, a few, a couple of months since they announced it? I think it's more than
[01:16:14] Alex Volkov: a couple of months, I think half a year, maybe, yeah.
[01:16:16] Nisten Tahiraj: Yeah, it's over half a year that so much happened that we're no longer impressed.
[01:16:22] Nisten Tahiraj: And I'm just trying to be mindful of that, that things are still moving fast. And, they haven't stopped [01:16:30] moving. Like we've seen a whole bunch of models start to get close to this now. it's still better, I would say it's still better than most of, what's come out in the last six months. but,yeah, we're getting pretty close.
[01:16:41] Nisten Tahiraj: I think they haven't released it mainly because of, weaponized litigation,that's the main thing.
[01:16:45] Alpin Dale: Yeah.
[01:16:45] Nisten Tahiraj: Holding them back and, uh.yeah, companies in other countries don't have that problem as, as much, so they were able to, to advance more, like while still being respectful tothe brands and [01:17:00] stuff, but, yeah, I think the main reason is, people are just going to try and nitpick any kind of,of, attack vector to, to, to sue them.
[01:17:08] Nisten Tahiraj: For it. So that's probably why
[01:17:10] Alex Volkov: Yeah. Everything open AI will Yeah. Will get attacked. That I fully agree with you on this. Yeah. speaking of, let's see, do we have anything else from OpenAI? I don't believe so. Yeah. the other one thing that I wanted to show super quick is that the new Chad GPT now is also y I'm gonna show this super quick on the thing, is also now [01:17:30] supporting cursor.
[01:17:31] Alex Volkov: So now, the NutriGPT app is supporting the Cursor app, so now you can ask what I'm working on in Cursor, and if you hover this, you can actually see all of my, including env, You can actually see my secrets, but, you can ask it, you can ask it about the open, open queries. And why would I, if I have Cursor?
[01:17:49] Alex Volkov: That's the question, right? Cursor supports O1, because, I have unlimited O1 queries on ChaiGPTN, whereas I have like fairly limited, queries for O1 in Cursor. and generally [01:18:00] That's been pretty good. That's been pretty cool. You can ask it about the stuff that you have open. There's a shortcut I think it's option shift 1 on Windows and you can enable this and basically you then start chatting With the open interface in the one.
[01:18:13] Alex Volkov: We tested this a couple of weeks ago if you guys remember and I found it super fun. I don't know if you guys used it since then or for those who use the Mac version of, of ChatGPT. I find it really fun. So folks in the audience, if you're using the macOS app and you are connecting this to Cursor or to the terminal, for [01:18:30] example.
[01:18:30] Alex Volkov: Unfortunately, I use the warp terminal and they still don't have warp. they have iTerm here and other things. if you use PyCharm or other, JetBrains, they also started supporting those.but I specifically use Courser and now there's a support for Courser, supports for Windsurf, which is another thing that we didn't cover yet.
[01:18:46] Alex Volkov: And I heard amazing things. And I hope, hopefully over the Thanksgiving break, I will have to, have a chance to use Windsurf. but yeah, this is from, OpenAI and we were waiting for some more news from OpenAI, but we didn't get one. So hopefully the folks at [01:19:00] OpenAI will get a Thanksgiving break.
[01:19:02] Alex Volkov: Just a small reminder. I looked a year ago, if you guys remember the Thanksgiving episode we had a year ago. We were discussing the control alt deletemen weekend where Sam Altman was fired and then rehired. That was the Thanksgiving episode of last year. You guys remember this? last year we discussed how Sam Altman, and Greg Brockman were shanked and, the coup from Ilya.
[01:19:26] Alex Volkov: You guys remember? It's been a year. It's been a year since then. This was the [01:19:30] Thanksgiving last year. and, yeah, it's been a year since then. which by the way. Next week is the one, the two year anniversary of JGPT as well. So we probably should prepare something for that. so that's on the OpenAI News.
[01:19:43] Alex Volkov: let's super quick talk about this.at some point There's this, the sayings from Space Uncle is, they need to be studied in an encyclopedia. somebody tweeted, I don't understand how game developers and game journalists got so ideologically captured. [01:20:00] Elon Musk tweeted and said, Too many game studios are owned by massive corporations.
[01:20:03] Alex Volkov: XAI is going to start an AI game studio to make games great again.and I'm like, and please unmute if you're muted and laughing, because I want to hear, and I want the audience to hear that both PicoCreator and Nisten are just like laughing out loud at this. It's XAI with all of their like 200, H200, 200, 000 H200s, like the best, the fastest ever growing massive [01:20:30] Memphis, super cluster, they're going to build games like, what are they really going to actually.
[01:20:34] Alex Volkov: Have a gaming studio in there. Like we know he is, Elon is a, I don't know the best Diablo game player in the world right now. I don't know how the f**k
[01:20:43] Nisten Tahiraj: he's, he is fourth or 20th or,
[01:20:45] Alex Volkov: yeah, he was 20. I think he's at some point he got number one recently, or something. I, we know, we all know he's a gamer.
[01:20:51] Alex Volkov: Kudos. I really, I'm not making this up. Like I'm really have no idea how the f**k you can be like the best Diablo player in the world doing all these other stuff [01:21:00] and. I get the sentiment of okay, let's make games. Great. Turning in the eye company, the games company, how the,what?
[01:21:08] Alex Volkov: Ah, I just want to turn to this.
[01:21:12] Eugen Cheugh: I love most. It's just a massive corporation, XAI with billions of dollars of funding. It's going to be not a messy corporation.
[01:21:23] Alex Volkov: Yeah, this is not necessarily AI related necessarily,we are expecting big things from XAI, specifically around GROK [01:21:30] 3.
[01:21:30] Alex Volkov: Hopefully December, that's the date that they've given us. They have a hundred thousand H100s turning away and building something. We know that this was like announced. we know that Elon promises and doesn't deliver on time, but delivers at some point anyway. We know that they have. very good folks behind the scenes.
[01:21:47] Alex Volkov: We know this, we've seen this before. We know that, infrastructure is something they're building out. They're building out enterprise infrastructure for APIs. we've seen the X, AI API layer building out. We've seen the kind of the [01:22:00] X,infrastructure. Sorry, enterprise infrastructure for, the building layer.
[01:22:03] Alex Volkov: We've seen all this, getting prepared. Like we've talked about this, we're getting to the point where X is going to be another player, competing another player versus Google, OpenAI, Anthropic, etc. GRUG3 is going to be something significant to contend with. and like the amount of GPUs are there.
[01:22:22] Alex Volkov: It's just is this a sidetrack? this is basically my question.
[01:22:25] Nisten Tahiraj: it, so Uncle Elon tends to be like very [01:22:30] impulsive as we've seen, so if he spends a lot of time on something he's gonna start getting obsessed with it. So there's that. In order to have a gaming service, you will need a lot of GPUs, and I'm pretty sure at this point, if they want to do cloud gaming or streaming, they probably have more GPUs than PlayStation.
[01:22:49] Nisten Tahiraj: they might actually just have more right now. they're like, we can probably Support that, and so much for the Department of Government Efficiency, now we're all [01:23:00] just going to be streaming games.
[01:23:05] Nisten Tahiraj: But there is, there's also Another lining to this is for, for a while, for the last 10 years, there was an article about 10 years ago that the E3, I don't think that's a thing anymore, but the E3 gaming conference had a SpaceX booth over a decade ago and SpaceX was actively recruiting for the E3. I think to quote, it was, programmers of physics engine, and the [01:23:30] rumors were that they were going after the ones who made the Steam Havoc 2, like the one in Portal, and the ones that worked on the, Unreal Tournament physics engine.
[01:23:40] Nisten Tahiraj: And this was over 10 years ago, and those people, those programmers, were recruited by SpaceX. like, when you see, the Falcon Heavy, 2, 3, 4 rockets, just like Go dance in midair and land like they're in a video game is because, the people that made the simulation very likely worked on game engines.
[01:23:58] Nisten Tahiraj: So it might be [01:24:00] a hiring angle from him, or it might just be Angelino playing a lot of games and he just wants to know. there is an angle
[01:24:07] Alex Volkov: for gaming as a playground for training. Like a GI, whatever, like open AI obviously had, like trained robots in this area. we saw many papers for like agents running wild in a game constrained environments.
[01:24:19] Alex Volkov: There, there could be an angle there for sure. I just, this doesn't feel like, this feels like an impulsive, hey. make f*****g games great again.
[01:24:26] Anthropic's Model Context Protocol
[01:24:26] Alex Volkov: Alright, moving on, unless we have another comment here, moving on to [01:24:30] I really wanted to discuss the, super briefly the, Model Context Protocol from Anthropic.
[01:24:36] Alex Volkov: because this kind of blew up, but it's not ready yet. I saw a comment from Simon Wilson, you guys know Simon Wilson, the friend of the pod, he'd been here multiple times. basically he covered this. super quick, Anthropic released this new protocol, which they hope to standardize and by standardize, they mean Hey, let's get around this.
[01:24:53] Alex Volkov: Okay. So let's talk about a standard in the industry right now, the OpenAI SDK for Python. That's a [01:25:00] standard way to interact with LLMs. Pretty much everybody supports this, including Gemini. I think the only one who doesn't support this is Anthropic actually. So in Python, if you want to interact with any LLM, Literally any provider in LLM, including OpenRouter, like Google, OpenAI themselves, like pretty much everyone else, like including together, like all of the, all of those, you can replace one line of code in the OpenAI API, OpenAI Python SDK, where you just put a different URL in there, and then this is the standard way to talk to [01:25:30] LLMs.
[01:25:30] Alex Volkov: I think for TypeScript, JavaScript, it's pretty much the same.so it looks like Anthropic is trying to do something like this to standardize around how LLMs are connecting with other applications. So right now, just a minute before I showed you how ChatGPT is connecting to like a VS Code for something.
[01:25:49] Alex Volkov: They built those integrations themselves. So you would install a specific extension in VS Code in etc. And that extension That they've built [01:26:00] talks to the ChatGPT app on the Mac OS that they've built and they build this connection for you. This is not what Anthropic wants to do. Anthropic wants to create a protocol that like developers, other developers can build on their own to allow the LLM to talk to any application and you as a developer, I as a developer, other developers can build those Communication layers, and then whatever LLM, in this case, this is the Anthropic, Claude desktop app, this could be the JGPT app, could be the Gemini GPT app, [01:26:30] Gemini app, et cetera, could talk to other applications.
[01:26:32] Alex Volkov: What those other applications are? Anything. Anything on your desktop, anything. At all. So they build this kind of a first standard, communication via JSON RPC. And I think they're buildingother ways, and other servers. I think this is a way to summarize this, basically.
[01:26:50] Alex Volkov: this is a open preview. Nisten, you want to take another crack at trying to recap this? Or Yam or Wolfram, you guys want to? You want to give me your thoughts on this super quick? As far as I understand from [01:27:00] Simon, this is like still robust and still in,in, in flux.
[01:27:03] Nisten Tahiraj: I think this might end up being a much bigger deal than we, we first expect, because it is an interoperability layer, and as a developer, you will have to learn this.
[01:27:15] Nisten Tahiraj: it is annoying at the moment that, While proposing a standard, Anthropic is not showing willingness to abide by one, which most people chose, and even Google was forced to support the OpenAI standard. if you [01:27:30] want people to come with your standard, to abide by your standard, you also have to show willingness to abide by others.
[01:27:36] Nisten Tahiraj: that's not going to work here until Anthropic Just supports a plug and play OpenAI API, so I just put their models in, but that aside. The criticism aside,this is pretty, pretty important. So I've been doing some of this stuff and just trying to do it with basic JSON. So I think that's,it's very good.
[01:27:55] Nisten Tahiraj: And yeah, it's pretty hard to know, am I on Mac? Am I on Linux? Am I on a phone? [01:28:00] What's the LLM going to talk to? what does this app even want me to do? Do I have to emulate this on the screen and then click on it? Can't it just give me a JSON so that I can click on it so it's a lot easier for me?
[01:28:11] Nisten Tahiraj: And this will also apply to websites and, and web apps after to you. Offer some kind of a JSON RPC. An RPC is just like an API for people. It's just an application programming interface. It's something you query, like you write a curl to this [01:28:30] IP and here's my API key and give me, or here I'm going to give you this stuff and give me this stuff.
[01:28:37] Nisten Tahiraj: From the database or whatever. So this is this actually extremely important because you can apply to, to web apps as well. And it's a way to manage multiple sessions. So I think it's a pretty big deal, even though I am. No. And anthropic, it this,yeah. I think that this is gonna become much, much more important because it saves a lot of bandwidth.[01:29:00]
[01:29:00] Nisten Tahiraj: Instead of you having to run a visual language model to show the whole screen, to run it on an emulator, to have to click on it and move around. And it's so compute intensive. It's can you just gimme like a adjacent API, so I can just like,
[01:29:13] Alex Volkov: yeah, do
[01:29:13] Nisten Tahiraj: a constraint output to adjacent and just output three tokens.
[01:29:16] Nisten Tahiraj: Be done with the whole thing. so yeah. Yeah, it's, I think it'll become a big deal.
[01:29:21] Alex Volkov: So in the spirit of, of the holiday, thank you and tropic for trying to standardize things, standardize, often, sometimes it's annoying, but often leads to good things as [01:29:30] well. folks, should try out the MMCP and definitely give them feedback.
[01:29:34] Alex Volkov: but yeah, they should also abide by some standards as well. It looks like the industry is standardizing around the. OpenAI SDK, and they maybe should also, it would help.
[01:29:43] Wolfram Ravenwolf: It's a new thing that they are doing because, so far we usually had the LLM as a part in an agent pipeline where you have, another process called the LLM with some input.
[01:29:52] Wolfram Ravenwolf: And here we have the LLM going out to get. The input itself. So I think that is also in the agent context, very important and [01:30:00] more integration is always better, but of course it's a new thing. We have to develop all those servers as I call it. So a lot of reinventing the wheel, I guess we'll see if it can really persevere.
[01:30:12] Alex Volkov: Yeah, one example that they highlight, and Simon talked about this as well, is that if you have a database, a SQLite database that sits on your computer,the way to have So you guys know we, we talked about tool use, for example,via API, those models can Get respond with some, some idea of how to use your [01:30:30] tools.
[01:30:30] Alex Volkov: And you, as a developer, you are in charge of using those tools. You basically get in response a structure of a function call. And you're like, okay, now I have to take this and then go to an external tool and use this. This is connecting this piece forward. This is basically. Allowing this LLM to then actually go and actually use this tool.
[01:30:48] Alex Volkov: Basically like getting a step forward. And one, one example that they're showing is a connecting to a database, allowing this LLM to connect to a database via a sq lite like MCP server. the model compute [01:31:00] protocol server. cps, sorry. yeah. So connecting via this MCP server,you basically allowing LM to read from this database.
[01:31:08] Alex Volkov: Itself without like returning a call. And then you are in charge as a developer to go and do the call return it responses. so basically trying to, allow LLMs to connect to different services. Yeah. And this, I think I agree with you with more work in here. this could be big.
[01:31:24] Nisten Tahiraj: It could literally make like over a thousand times more compute efficient to automate [01:31:30] something on a screen. Because instead of using a visual language model frame by frame, you can just have a JSON.
[01:31:37] Alex Volkov: Let's talk about Like literally
[01:31:38] Nisten Tahiraj: over a thousand times. Let's compute to do it. So I'm going to, I'm going to take a longer look at it as well.
[01:31:46] Alex Volkov: speaking of automating things on the screen,
[01:31:48] H runner from H the french AI company
[01:31:48] Alex Volkov: let's talk about the next thing that we want to talk about, H company AI. This is the next thing in big companies and APIs, H company from. France, this is another big company. So [01:32:00] we know Mistral is from France. some, DeepMind, some folks is from France as well.
[01:32:04] Alex Volkov: there's also FAIR in France from Meta. now France is positioning themselves to be one big kind of hub from AI as well. H Company. raised, fundraised, I think, 250 I have in my notes. Yeah, 220, one of the biggest, seed rounds. 220 million dollars, one of the biggest ones in, in the history of, French seed rounds, a while ago.
[01:32:24] Alex Volkov: And they just showcased their Runner H. Their Runner H [01:32:30] is, they're competing with Claude on speed of computer use. I apologize for this. Let's take a look at how fast they're claiming they're opening a browser, going to recipes and providing recipes for something. On the right, we have Claude, Computer Use.
[01:32:46] Alex Volkov: Claude is basically, hey, open the browser. On the left, they already pulled up a browser and already extracting data. So basically they're claiming A speed up of maybe two to three times over cloud computer use. [01:33:00] And they're basically showing while Claude still pulls up the Firefox browser, they have already completed the task, extracted the data and already responded to the user.
[01:33:09] Alex Volkov: they're showing steps by steps comparison, which I don't think is necessarily in, Apple's to Apple's comparison. I don't think it's necessarily fair, but. There's a big but here, big French, but, I don't know how to say, sorry, Nisten, I don't know how to say but in French, but there's a big one.
[01:33:25] Alex Volkov: Their models, as far as I could see, and I did some research, they have [01:33:30] a, they say this runner age thing that they have is powered by a specialized LLM, specialized optimist for function calling for 2 billion params. So whatever we see on the left is not like Claude, which whatever, we don't know the size of Claude, this is like a 2 billion parameter model.
[01:33:45] Alex Volkov: and, integrates in the VLM of a 3 billion parameter model to see, understand, interact with the graphical and text interface. Let's look at another example here. they're basically browsing the web and like doing extraction and yeah, I don't think you guys can see it. maybe like this.[01:34:00]
[01:34:02] Alex Volkov: It's literally, they're going to Wolfram Alpha and extracting and doing this task. They're basically asking Wolfram Alpha to do a task. So it's not like they're just reading from things. They're finding input and they're like plugging things in there and like responding, reading from the output from Wolfram Alpha as well.
[01:34:18] Alex Volkov: this runnerage thing actually performs tasks on the web. Extracts information back way faster than Claude Computerius, which Claude Computerius, let's give it its place. We were very excited when it came [01:34:30] out, and it does very well for, for just an adjustment of Claude. and they are showing immense differences in five steps, and we're still waiting for Claude Computerius to like, Try to figure this out.
[01:34:42] Alex Volkov: So did you
[01:34:43] Nisten Tahiraj: say it's a separate to be model? And then there's another? That's what I found
[01:34:48] Alex Volkov: from them. Yeah. Yeah. They said that they have, let me see if I can find the previous announcement. Yeah. Yeah.
[01:34:54] Wolfram Ravenwolf: The previous announcement
[01:34:56] Alex Volkov: that they have, that we missed from last week, Introducing Studio, a [01:35:00] automations at scale, run or age the most advanced agent to date.
[01:35:04] Alex Volkov: That's what they said last year. Powered by specialized LLM, highly optimized for function calling, 2 billion parameters. It also integrates a specialized VLM, 3 billion parameters, to perceive, understand, and interact with graphical and text elements. Delivers the state of the art on the public WebVoyager framework.
[01:35:20] Alex Volkov: And this is the graph that they have. WebVoyager, they have Runner H01. At 66 percent maybe? And, and [01:35:30] then, Cloud Computer Use at 52 percent and Agent E, I don't know where it is, it's like here. Yeah, so like the size of it is what's the most impressive part.
[01:35:41] Nisten Tahiraj: Yeah, I'd say this is impressive. as to what they're doing.
[01:35:44] Nisten Tahiraj: we can guess what model they're using, but it doesn't matter all that much. I just wanna say that it's not an apples to apples comparison with cloud because cloud is an entire OS in there and you can use whatever you want. It can use blender, it can, [01:36:00] you can run a virtual box of Windows 95 and it will use that as well.
[01:36:04] Eugen Cheugh: so the, yeah, it's not. That's not a pure example, whereas in this one, I'm assuming they do need access to the document object model, the DOM of the website, to be able to navigate it, but The results do indeed seem impressive, and it's at a size that you can run it, you can run on your own, Yeah, because if you're measuring steps and speed, actually, I think, Anthropic Cloud should, probably, partner with [01:36:30] a company like Browserbase, and just, do a demo, and then see how close they get instead. It will skip literally the first eight steps or something like that, which is all just the OS booted up.
[01:36:40] Alex Volkov: Yeah, this is why I didn't love the comparison specifically, you guys are right, it's running a janky Docker with Firefox, and by the time, it loads Firefox, these guys already loaded the website, so it's not like necessarily apples to apples, but it looks like those models are tiny compared to Claude, and also, they talk about, It's beyond [01:37:00] optimizing agent performance, they're like, they have, optimizing web interactions.
[01:37:05] Alex Volkov: they engineered Runaways to handle any web interactions. Advancing towards one singular mission, automating the web, so they're focused on web. So Eugene, like what you're talking about, like browser based with computer use, it looks like this is their focus, whereas computer use is, for computer use, generic.
[01:37:22] Alex Volkov: This is like their focus for web interactions. I guess what I'm saying is it's exciting. they raised a boatload of money, the folks behind [01:37:30] there, they seem like very,adept, I, I know they're based in France, Wolfram. I don't know, Wolfram, you're asking if, if I'm sure they're France.
[01:37:36] Alex Volkov: yeah, they're based in France, and, Yeah, we'll see. They're waitlisted. I haven't tested them out. I know that some folks collaborated on them already and posted some threads. so we'll hopefully, we'll see if I get access to this. I'll tell you guys and we'll play with it. Absolutely. definitely exciting in the world of agents.
[01:37:54] Alex Volkov: I think this is it from big companies. Folks, what do you think? Anything else From big companies, nothing from Google after the [01:38:00] releases of last week where they reclaimed the throne. Hopefully they're getting their deserved breaks and and relaxing. I don't think this week was fairly chill.
[01:38:07] Alex Volkov: Probably the next week they're going to come back with a vengeance. Next week there's like the AWS re invent. Maybe Amazon will come with something. And then the week after RPS. Maybe some folks are waiting for that. I think that this is it in big companies. Let's move on to vision and video.
[01:38:22] Alex Volkov: And then, Oh, I think we're at two minutes. Folks, I think we're at time. I think we're at time. I got too excited that we have like a bunch of other things to talk about. [01:38:30] So let me maybe recap on our Thanksgiving super quick. the stuff that we didn't get to just to like to recap super quick. we didn't get to, but just to tell you guys what else we didn't get to, runway specifically.
[01:38:41] Alex Volkov: Oh yeah, I just, I have to show this. not to talk about this. Just just visually show this beautiful thing. If I can click this. If I can click this thing, yeah, Runway introduced an expand feature, if you guys haven't seen this, it's really fun to just watch. Let me just mute this. basically, [01:39:00] what you see above and below, Runway introduced an expand feature where you take a video and you give it, give this model and the model tries to predict it.
[01:39:08] Alex Volkov: in different ratio, what's above and below this video. So basically, if you give a video in the widescreen format, 16 by nine, and you could try to turn it into a 19 by six format. And so the model will try to fill in the frames. The general video model tries to fill in the frames of what's above and below.
[01:39:25] Alex Volkov: So what we're looking at in the video on the screen is like a Lord of the [01:39:30] Rings scene where Legolas rides one of those like elephant looking thingies. Basically, the model tries to fill in the, just the frames from above and below. It just looks a little bit creepy. it's funny looking, but it's like looks, interesting.
[01:39:45] Alex Volkov: so this is like one expand feature and the other one is they released an actual image model from Runway, which kind of looks interesting. it's called a frames and it's specific for image generation for [01:40:00] world building. and Confi UI desktop launched. I think that's pretty much it.
[01:40:05] Thanksgiving Reflections and Thanks
[01:40:05] Alex Volkov: Folks, it's time to say thanks, because it's Thanksgiving. I just wanted to start, but I wanted to hear from you as well. My biggest thanks this year goes to, first of all, everybody who tunes in to ThursdAI. Everybody who comes into the community, everybody who provides comments and shares with their friends and, and listens and,The second huge thanks goes to all of you.
[01:40:26] Alex Volkov: My co hosts here, Wolfram, Yam, Nisten, LDJ, Junyang [01:40:30] who joined us, Eugene who joined us as well. Zafari who joined us from time to time, like a bunch of other folks. huge thanks to you for being here from like week to week for more than like almost, we're coming up on two years. And I think the thirst, the third thanks goes to Jensen for the GPUs that he provided for all of us to enjoy those like amazing corn coffee of AI features around the world.
[01:40:51] Alex Volkov: just, yeah, just open up the mics and feel free to, to join the festivities even though I don't know any of you celebrate [01:41:00] Thanksgiving unnecessarily. But yeah, what are you guys thankful for? before we wrap up, let's do the Thanksgiving roundup.
[01:41:07] Eugen Cheugh: I'm giving thanks to open models.
[01:41:08] Eugen Cheugh: let's go. Yeah, no, proving that you do not need billions of dollars to catch up with GPT 4 despite what the big labs will say. The open teams, keep going, keep bringing open models to the masses.
[01:41:25] Nisten Tahiraj: Yeah, We had Thanksgiving last month in Canada. I would like to [01:41:30] give thanks to two particular creators, mahi and, tki. each have over a thousand models and, quants that they release. And, and also Mr. Der Backer, probably mispronounced that was, over 5,000, quantization of models.
[01:41:48] Nisten Tahiraj: this is the stuff I use every day in tell. Other people. So whenever something new comes up, I almost always expect them to have a good, well done quantization ready for [01:42:00] others to use. and they just do this as volunteers. I don't even think they're part of the, none of them are part of like even a big corporation, or have high salaries.
[01:42:08] Nisten Tahiraj: They literally just do it as volunteers. Yeah, I want to give thanks to those people in particular, and everybody else here, and all the people on Discord as well, who sit around and help you correct stuff, but yeah, that's it for me.
[01:42:27] Wolfram Ravenwolf: Okay, I have three. The first [01:42:30] is to Alex for the podcast, because it's amazing to be here.
[01:42:34] Wolfram Ravenwolf: It's my way to keep up with the stuff I can't keep up with. So thank you for having me. Thank you for doing this. Thank you very much. And the second is to the whole community of AI people, especially those who release all these stuff in the open. But everybody who contributes, everybody who does a good thing about it, I think it is furthering humanity.
[01:42:53] Wolfram Ravenwolf: So thanks for that. And the third is a thanks to every reasonable person who is not, Going to insights or stuff, [01:43:00] but it's open minded and, seeing that we are all in the same boat and we are all trying to make the world a better place in our different ways. And for being, accepting and understanding of this.
[01:43:11] Wolfram Ravenwolf: In this times, I think it's very important to keep an open mind.
[01:43:16] Nisten Tahiraj: Oh yeah, just really quickly to add on, the biggest thanks I think for this year goes to the DeepSeek and Qwent teams for just caring. up everybody [01:43:30] else when we stalled on progress they kept it up to like actually democratize the models for you to actually have this piece of artificial intelligence and own it and control it and be loyal and make it loyal to you yeah.
[01:43:47] Nisten Tahiraj: they actually enable people to, to run fully local models. Like 90% of what I use every day is just completely open source. Now, honestly, it w it, I wouldn't, it would not be there if it wasn't for them. It would probably maybe be like [01:44:00] 20, 30%. So,yeah, they, they really carried, like that's a gaming term, like someone who.
[01:44:06] Nisten Tahiraj: Carries the team. They have really carried, so yeah.
[01:44:11] Alex Volkov: Jan, go
[01:44:14] Yam Peleg: ahead. To Jensen for the GPUs, and
[01:44:17] Alex Volkov: to everybody
[01:44:18] Yam Peleg: else I'm hugging face. Especially people collecting and releasing datasets. I think they're not getting enough credits because you can't just use the dataset [01:44:30] without training a model. There is an effort.
[01:44:31] Yam Peleg: to, until you appreciate the dataset, but, they make it possible, everything else.
[01:44:39] Alex Volkov: Last thing that I have to, and this is not because I have to, but honestly, folks, huge thanks to Weights Biases for all of this, honestly, I wouldn't have been able to do this as my job without a few folks in Weights Biases, so thank you Morgan, thank you Lavanya, thank you a bunch of folks in Weights Biases.
[01:44:55] Alex Volkov: who realized this could be a part of my actual day to day and bringing you news from Weights [01:45:00] Biases, but also promoting some of the stuff. many of the labs, if not most of the labs that we talk about, are using Weights Biases to bring us the open source, but also the closed source LLMs in the world.
[01:45:10] Alex Volkov: I couldn't be More happy and be in a better place to bring you the news, but also participate behind the scenes in building some of these things. With that, thank you to all of you. Hopefully you go and enjoy some of the rest of your holiday. Those of you who celebrate, those of you who don't celebrate, this is, I think the first Thursday in a while that we didn't have any breaking news.
[01:45:27] Alex Volkov: I'm itching to press it anyway, but we didn't [01:45:30] have any breaking news, but hopefully we'll have some next week. There could be some news next week. We'll see. With that, thank everybody who joins, go and enjoy the rest of your day. And we'll see you here next week as always. Bye everyone. Bye bye.
[01:45:43] Alex Volkov: Bye bye. Bye bye. Bye bye. Bye bye. And we have [01:46:00] a
This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe -
Hey folks, Alex here, and oof what a 🔥🔥🔥 show we had today! I got to use my new breaking news button 3 times this show! And not only that, some of you may know that one of the absolutely biggest pleasures as a host, is to feature the folks who actually make the news on the show!
And now that we're in video format, you actually get to see who they are! So this week I was honored to welcome back our friend and co-host Junyang Lin, a Dev Lead from the Alibaba Qwen team, who came back after launching the incredible Qwen Coder 2.5, and Qwen 2.5 Turbo with 1M context.
We also had breaking news on the show that AI2 (Allen Institute for AI) has fully released SOTA LLama post-trained models, and I was very lucky to get the core contributor on the paper, Nathan Lambert to join us live and tell us all about this amazing open source effort! You don't want to miss this conversation!
Lastly, we chatted with the CEO of StackBlitz, Eric Simons, about the absolutely incredible lightning in the bottle success of their latest bolt.new product, how it opens a new category of code generator related tools.
00:00 Introduction and Welcome
00:58 Meet the Hosts and Guests
02:28 TLDR Overview
03:21 Tl;DR
04:10 Big Companies and APIs
07:47 Agent News and Announcements
08:05 Voice and Audio Updates
08:48 AR, Art, and Diffusion
11:02 Deep Dive into Mistral and Pixtral
29:28 Interview with Nathan Lambert from AI2
30:23 Live Reaction to Tulu 3 Release
30:50 Deep Dive into Tulu 3 Features
32:45 Open Source Commitment and Community Impact
33:13 Exploring the Released Artifacts
33:55 Detailed Breakdown of Datasets and Models
37:03 Motivation Behind Open Source
38:02 Q&A Session with the Community
38:52 Summarizing Key Insights and Future Directions
40:15 Discussion on Long Context Understanding
41:52 Closing Remarks and Acknowledgements
44:38 Transition to Big Companies and APIs
45:03 Weights & Biases: This Week's Buzz
01:02:50 Mistral's New Features and Upgrades
01:07:00 Introduction to DeepSeek and the Whale Giant
01:07:44 DeepSeek's Technological Achievements
01:08:02 Open Source Models and API Announcement
01:09:32 DeepSeek's Reasoning Capabilities
01:12:07 Scaling Laws and Future Predictions
01:14:13 Interview with Eric from Bolt
01:14:41 Breaking News: Gemini Experimental
01:17:26 Interview with Eric Simons - CEO @ Stackblitz
01:19:39 Live Demo of Bolt's Capabilities
01:36:17 Black Forest Labs AI Art Tools
01:40:45 Conclusion and Final Thoughts
As always, the show notes and TL;DR with all the links I mentioned on the show and the full news roundup below the main new recap 👇
Google & OpenAI fighting for the LMArena crown 👑
I wanted to open with this, as last week I reported that Gemini Exp 1114 has taken over #1 in the LMArena, in less than a week, we saw a new ChatGPT release, called GPT-4o-2024-11-20 reclaim the arena #1 spot!
Focusing specifically on creating writing, this new model, that's now deployed on chat.com and in the API, is definitely more creative according to many folks who've tried it, with OpenAI employees saying "expect qualitative improvements with more natural and engaging writing, thoroughness and readability" and indeed that's what my feed was reporting as well.
I also wanted to mention here, that we've seen this happen once before, last time Gemini peaked at the LMArena, it took less than a week for OpenAI to release and test a model that beat it.
But not this time, this time Google came prepared with an answer!
Just as we were wrapping up the show (again, Logan apparently loves dropping things at the end of ThursdAI), we got breaking news that there is YET another experimental model from Google, called Gemini Exp 1121, and apparently, it reclaims the stolen #1 position, that chatGPT reclaimed from Gemini... yesterday! Or at least joins it at #1
LMArena Fatigue?
Many folks in my DMs are getting a bit frustrated with these marketing tactics, not only the fact that we're getting experimental models faster than we can test them, but also with the fact that if you think about it, this was probably a calculated move by Google. Release a very powerful checkpoint, knowing that this will trigger a response from OpenAI, but don't release your most powerful one. OpenAI predictably releases their own "ready to go" checkpoint to show they are ahead, then folks at Google wait and release what they wanted to release in the first place.
The other frustration point is, the over-indexing of the major labs on the LMArena human metrics, as the closest approximation for "best". For example, here's some analysis from Artificial Analysis showing that the while the latest ChatGPT is indeed better at creative writing (and #1 in the Arena, where humans vote answers against each other), it's gotten actively worse at MATH and coding from the August version (which could be a result of being a distilled much smaller version) .
In summary, maybe the LMArena is no longer 1 arena is all you need, but the competition at the TOP scores of the Arena has never been hotter.
DeepSeek R-1 preview - reasoning from the Chinese Whale
While the American labs fight for the LM titles, the real interesting news may be coming from the Chinese whale, DeepSeek, a company known for their incredibly cracked team, resurfaced once again and showed us that they are indeed, well super cracked.
They have trained and released R-1 preview, with Reinforcement Learning, a reasoning model that beasts O1 at AIME and other benchmarks! We don't know many details yet, besides them confirming that this model comes to the open source! but we do know that this model , unlike O1, is showing the actual reasoning it uses to achieve it's answers (reminder: O1 hides its actual reasoning and what we see is actually another model summarizing the reasoning)
The other notable thing is, DeepSeek all but confirmed the claim that we have a new scaling law with Test Time / Inference time compute law, where, like with O1, the more time (and tokens) you give a model to think, the better it gets at answering hard questions. Which is a very important confirmation, and is a VERY exciting one if this is coming to the open source!
Right now you can play around with R1 in their demo chat interface.
In other Big Co and API news
In other news, Mistral becomes a Research/Product company, with a host of new additions to Le Chat, including Browse, PDF upload, Canvas and Flux 1.1 Pro integration (for Free! I think this is the only place where you can get Flux Pro for free!).
Qwen released a new 1M context window model in their API called Qwen 2.5 Turbo, making it not only the 2nd ever 1M+ model (after Gemini) to be available, but also reducing TTFT (time to first token) significantly and slashing costs. This is available via their APIs and Demo here.
Open Source is catching up
AI2 open sources Tulu 3 - SOTA 8B, 70B LLama post trained FULLY open sourced (Blog ,Demo, HF, Data, Github, Paper)
Allen AI folks have joined the show before, and this time we got Nathan Lambert, the core contributor on the Tulu paper, join and talk to us about Post Training and how they made the best performing SOTA LLama 3.1 Funetunes with careful data curation (which they also open sourced), preference optimization, and a new methodology they call RLVR (Reinforcement Learning with Verifiable Rewards).
Simply put, RLVR modifies the RLHF approach by using a verification function instead of a reward model. This method is effective for tasks with verifiable answers, like math problems or specific instructions. It improves performance on certain benchmarks (e.g., GSM8K) while maintaining capabilities in other areas.
The most notable thing is, just how MUCH is open source, as again, like the last time we had AI2 folks on the show, the amount they release is staggering
In the show, Nathan had me pull up the paper and we went through the deluge of models, code and datasets they released, not to mention the 73 page paper full of methodology and techniques.
Just absolute ❤️ to the AI2 team for this release!
🐝 This weeks buzz - Weights & Biases corner
This week, I want to invite you to a live stream announcement that I am working on behind the scenes to produce, on December 2nd. You can register HERE (it's on LinkedIn, I know, I'll have the YT link next week, promise!)
We have some very exciting news to announce, and I would really appreciate the ThursdAI crew showing up for that! It's like 5 minutes and I helped produce 🙂
Pixtral Large is making VLMs cool again
Mistral had quite the week this week, not only adding features to Le Chat, but also releasing Pixtral Large, their updated multimodal model, which they claim state of the art on multiple benchmarks.
It's really quite good, not to mention that it's also included, for free, as part of the le chat platform, so now when you upload documents or images to le chat you get Pixtral Large.
The backbone for this model is Mistral Large (not the new one they also released) and this makes this 124B model a really really good image model, albeit a VERY chonky one that's hard to run locally.
The thing I loved about the Pixtral release the most is, they used the new understanding to ask about Weights & Biases charts 😅 and Pixtral did a pretty good job!
Some members of the community though, reacted to the SOTA claims by Mistral in a very specific meme-y way:
This meme has become a very standard one, when labs tend to not include Qwen VL 72B or other Qwen models in the evaluation results, all while claiming SOTA. I decided to put these models to a head to head test myself, only to find out, that ironically, both models say the other one is better, while both hallucinate some numbers.
BFL is putting the ART in Artificial Intelligence with FLUX.1 Tools (blog)
With the absolute breaking news bombastic release, the folks at BFL (Black Forest Labs) have released Flux.1 Tools, which will allow AI artist to use these models in all kind of creative inspiring ways.
These tools are: FLUX.1 Fill (for In/Out painting), FLUX.1 Depth/Canny (Structural Guidance using depth map or canny edges) and FLUX.1 Redux for image variation and restyling.
These tools are not new to the AI Art community conceptually, but they have been patched over onto Flux from other models like SDXL, and now the actual lab releasing them gave us the crème de la crème, and the evals speak for themselves, achieving SOTA on image variation benchmark!
The last thing I haven't covered here, is my interview with Eric Simons, the CEO of StackBlitz, who came in to talk about the insane rise of bolt.new, and I would refer you to the actual recording for that, because it's really worth listening to it (and seeing me trying out bolt in real time!)
That's most of the recap, we talked about a BUNCH of other stuff of course, and we finished on THIS rap song that ChatGPT wrote, and Suno v4 produced with credits to Kyle Shannon.
TL;DR and Show Notes:
* Open Source LLMs
* Mistral releases Pixtral Large (Blog, HF, LeChat)
* Mistral - Mistral Large 2411 (a HF)
* Sage Attention the next Flash Attention? (X)
* AI2 open sources Tulu 3 - SOTA 8B, 70B LLama Finetunes FULLY open sourced (Blog ,Demo, HF, Data, Github, Paper)
* Big CO LLMs + APIs
* Alibaba - Qwen 2.5 Turbo with 1M tokens (X, HF Demo)
* Mistral upgrades to a product company with le chat 2.0 (Blog, Le Chat)
* DeepSeek R1-preview - the first reasoning model from the Chinese whale (X, chat)
* OpenAI updates ChatGPT in app and API - reclaims #1 on LMArena (X)
* Gemini Exp 1121 - rejoins #1 spot on LMArena after 1 day of being beaten (X)
* Agents News
* Perplexity is going to do the shopping for you (X, Shop)
* Stripe Agent SDK - allowing agents to transact (Blog)
* This weeks Buzz
* We have an important announcement coming on December 2nd! (link)
* Voice & Audio
* Suno V4 released - but for real this time (X)
* ChatGPT new creative writing does Eminem type rap with new Suno v4 (link)
* AI Art & Diffusion & 3D
* BFL announcing Flux Tools today (blog, fal)
* Free BFL Flux Pro on Mistral Le Chat!
*
Thank you, see you next week 🫡
This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe -
This week is a very exciting one in the world of AI news, as we get 3 SOTA models, one in overall LLM rankings, on in OSS coding and one in OSS voice + a bunch of new breaking news during the show (which we reacted to live on the pod, and as we're now doing video, you can see us freak out in real time at 59:32)
00:00 Welcome to ThursdAI
00:25 Meet the Hosts
02:38 Show Format and Community
03:18 TLDR Overview
04:01 Open Source Highlights
13:31 Qwen Coder 2.5 Release
14:00 Speculative Decoding and Model Performance
22:18 Interactive Demos and Artifacts
28:20 Training Insights and Future Prospects
33:54 Breaking News: Nexus Flow
36:23 Exploring Athene v2 Agent Capabilities
36:48 Understanding ArenaHard and Benchmarking
40:55 Scaling and Limitations in AI Models
43:04 Nexus Flow and Scaling Debate
49:00 Open Source LLMs and New Releases
52:29 FrontierMath Benchmark and Quantization Challenges
58:50 Gemini Experimental 1114 Release and Performance
01:11:28 LLM Observability with Weave
01:14:55 Introduction to Tracing and Evaluations
01:15:50 Weave API Toolkit Overview
01:16:08 Buzz Corner: Weights & Biases
01:16:18 Nous Forge Reasoning API
01:26:39 Breaking News: OpenAI's New MacOS Features
01:27:41 Live Demo: ChatGPT Integration with VS Code
01:34:28 Ultravox: Real-Time AI Conversations
01:42:03 Tilde Research and Stargazer Tool
01:46:12 Conclusion and Final Thoughts
This week also, there was a debate online, whether deep learning (and scale is all you need) has hit a wall, with folks like Ilya Sutskever being cited by publications claiming it has, folks like Yann LeCoon calling "I told you so". TL;DR? multiple huge breakthroughs later, and both Oriol from DeepMind and Sam Altman are saying "what wall?" and Heiner from X.ai saying "skill issue", there is no walls in sight, despite some tech journalism love to pretend there is. Also, what happened to Yann? 😵💫
Ok, back to our scheduled programming, here's the TL;DR, afterwhich, a breakdown of the most important things about today's update, and as always, I encourage you to watch / listen to the show, as we cover way more than I summarize here 🙂
TL;DR and Show Notes:
* Open Source LLMs
* Qwen Coder 2.5 32B (+5 others) - Sonnet @ home (HF, Blog, Tech Report)
* The End of Quantization? (X, Original Thread)
* Epoch : FrontierMath new benchmark for advanced MATH reasoning in AI (Blog)
* Common Corpus: Largest multilingual 2T token dataset (blog)
* NexusFlow - Athena v2 - open model suite (X, Blog, HF)
* Big CO LLMs + APIs
* Gemini 1114 is new king LLM #1 LMArena (X)
* Nous Forge Reasoning API - beta (Blog, X)
* Reuters reports "AI is hitting a wall" and it's becoming a meme (Article)
* Cursor acq. SuperMaven (X)
* This Weeks Buzz
* Weave JS/TS support is here 🙌
* Voice & Audio
* Fixie releases UltraVox SOTA (Demo, HF, API)
* Suno v4 is coming and it's bonkers amazing (Alex Song, SOTA Jingle)
* Tools demoed
* Qwen artifacts - HF Demo
* Tilde Galaxy - Interp Tool
This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe -
👋 Hey all, this is Alex, coming to you from the very Sunny California, as I'm in SF again, while there is a complete snow storm back home in Denver (brrr).
I flew here for the Hackathon I kept telling you about, and it was glorious, we had over 400 registered, over 200 approved hackers, 21 teams submitted incredible projects 👏 You can follow some of these here
I then decided to stick around and record the show from SF, and finally pulled the plug and asked for some budget, and I present, the first ThursdAI, recorded from the newly minted W&B Podcast studio at our office in SF 🎉
This isn't the only first, today also, for the first time, all of the regular co-hosts of ThursdAI, met on video for the first time, after over a year of hanging out weekly, we've finally made the switch to video, and you know what? Given how good AI podcasts are getting, we may have to stick around with this video thing! We played one such clip from a new model called hertz-dev, which is a
-
Hey everyone, Happy Halloween! Alex here, coming to you live from my mad scientist lair! For the first ever, live video stream of ThursdAI, I dressed up as a mad scientist and had my co-host, Fester the AI powered Skeleton join me (as well as my usual cohosts haha) in a very energetic and hopefully entertaining video stream!
Since it's Halloween today, Fester (and I) have a very busy schedule, so no super length ThursdAI news-letter today, as we're still not in the realm of Gemini being able to write a decent draft that takes everything we talked about and cover all the breaking news, I'm afraid I will have to wish you a Happy Halloween and ask that you watch/listen to the episode.
The TL;DR and show links from today, don't cover all the breaking news but the major things we saw today (and caught live on the show as Breaking News) were, ChatGPT now has search, Gemini has grounded search as well (seems like the release something before Google announces it streak from OpenAI continues).
Here's a quick trailer of the major things that happened:
This weeks buzz - Halloween AI toy with Weave
In this weeks buzz, my long awaited Halloween project is finally live and operational!
I've posted a public Weave dashboard here and the code (that you can run on your mac!) here
Really looking forward to see all the amazing costumers the kiddos come up with and how Gemini will be able to respond to them, follow along!
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Ok and finally my raw TL;DR notes and links for this week. Happy halloween everyone, I'm running off to spook the kiddos (and of course record and post about it!)
ThursdAI - Oct 31 - TL;DR
TL;DR of all topics covered:
* Open Source LLMs:
* Microsoft's OmniParser: SOTA UI parsing (MIT Licensed) 𝕏
* Groundbreaking model for web automation (MIT license).
* State-of-the-art UI parsing and understanding.
* Outperforms GPT-4V in parsing web UI.
* Designed for web automation tasks.
* Can be integrated into various development workflows.
* ZhipuAI's GLM-4-Voice: End-to-end Chinese/English speech 𝕏
* End-to-end voice model for Chinese and English speech.
* Open-sourced and readily available.
* Focuses on direct speech understanding and generation.
* Potential applications in various speech-related tasks.
* Meta releases LongVU: Video LM for long videos 𝕏
* Handles long videos with impressive performance.
* Uses DINOv2 for downsampling, eliminating redundant scenes.
* Fuses features using DINOv2 and SigLIP.
* Select tokens are passed to Qwen2/Llama-3.2-3B.
* Demo and model are available on HuggingFace.
* Potential for significant advancements in video understanding.
* OpenAI new factuality benchmark (Blog, Github)
* Introducing SimpleQA: new factuality benchmark
* Goal: high correctness, diversity, challenging for frontier models
* Question Curation: AI trainers, verified by second trainer
* Quality Assurance: 3% inherent error rate
* Topic Diversity: wide range of topics
* Grading Methodology: "correct", "incorrect", "not attempted"
* Model Comparison: smaller models answer fewer correctly
* Calibration Measurement: larger models more calibrated
* Limitations: only for short, fact-seeking queries
* Conclusion: drive research on trustworthy AI
* Big CO LLMs + APIs:
* ChatGPT now has Search! (X)
* Grounded search results in browsing the web
* Still hallucinates
* Reincarnation of Search GPT inside ChatGPT
* Apple Intelligence Launch: Image features for iOS 18.2 [𝕏]( Link not provided in source material)
* Officially launched for developers in iOS 18.2.
* Includes Image Playground and Gen Moji.
* Aims to enhance image creation and manipulation on iPhones.
* GitHub Universe AI News: Co-pilot expands, new Spark tool 𝕏
* GitHub Co-pilot now supports Claude, Gemini, and OpenAI models.
* GitHub Spark: Create micro-apps using natural language.
* Expanding the capabilities of AI-powered coding tools.
* Copilot now supports multi-file edits in VS Code, similar to Cursor, and faster code reviews.
* GitHub Copilot extensions are planned for release in 2025.
* Grok Vision: Image understanding now in Grok 𝕏
* Finally has vision capabilities (currently via 𝕏, API coming soon).
* Can now understand and explain images, even jokes.
* Early version, with rapid improvements expected.
* OpenAI advanced voice mode updates (X)
* 70% cheaper in input tokens because of automatic caching (X)
* Advanced voice mode is now on desktop app
* Claude this morning - new mac / pc App
* This week's Buzz:
* My AI Halloween toy skeleton is greeting kids right now (and is reporting to Weave dashboard)
* Vision & Video:
* Meta's LongVU: Video LM for long videos 𝕏 (see Open Source LLMs for details)
* Grok Vision on 𝕏: 𝕏 (see Big CO LLMs + APIs for details)
* Voice & Audio:
* MaskGCT: New SoTA Text-to-Speech 𝕏
* New open-source state-of-the-art text-to-speech model.
* Zero-shot voice cloning, emotional TTS, long-form synthesis, variable speed synthesis, bilingual (Chinese & English).
* Available on Hugging Face.
* ZhipuAI's GLM-4-Voice: End-to-end Chinese/English speech 𝕏 (see Open Source LLMs for details)
* Advanced Voice Mode on Desktops: 𝕏 (See Big CO LLMs + APIs for details).
* AI Art & Diffusion: (See Red Panda in "This week's Buzz" above)
* Redcraft Red Panda: new SOTA image diffusion 𝕏
* High-performing image diffusion model, beating Black Forest Labs Flux.
* 72% win rate, higher ELO than competitors.
* Creates SVG files, editable as vector files.
* From Redcraft V3.
* Tools:
* Bolt.new by StackBlitz: In-browser full-stack dev environment 𝕏
* Platform for prompting, editing, running, and deploying full-stack apps directly in your browser.
* Uses WebContainers.
* Supports npm, Vite, Next.js, and integrations with Netlify, Cloudflare, and SuperBase.
* Free to use.
* Jina AI's Meta-Prompt: Improved LLM Codegen 𝕏
This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe -
Hey all, Alex here, coming to you from the (surprisingly) sunny Seattle, with just a mind-boggling week of releases. Really, just on Tuesday there was so much news already! I had to post a recap thread, something I do usually after I finish ThursdAI!
From Anthropic reclaiming close-second sometimes-first AI lab position + giving Claude the wheel in the form of computer use powers, to more than 3 AI video generation updates with open source ones, to Apple updating Apple Intelligence beta, it's honestly been very hard to keep up, and again, this is literally part of my job!
But once again I'm glad that we were able to cover this in ~2hrs, including multiple interviews with returning co-hosts ( Simon Willison came back, Killian came back) so definitely if you're only a reader at this point, listen to the show!
Ok as always (recently) the TL;DR and show notes at the bottom (I'm trying to get you to scroll through ha, is it working?) so grab a bucket of popcorn, let's dive in 👇
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Claude's Big Week: Computer Control, Code Wizardry, and the Mysterious Case of the Missing Opus
Anthropic dominated the headlines this week with a flurry of updates and announcements. Let's start with the new Claude Sonnet 3.5 (really, they didn't update the version number, it's still 3.5 tho a different API model)
Claude Sonnet 3.5: Coding Prodigy or Benchmark Buster?
The new Sonnet model shows impressive results on coding benchmarks, surpassing even OpenAI's O1 preview on some. "It absolutely crushes coding benchmarks like Aider and Swe-bench verified," I exclaimed on the show. But a closer look reveals a more nuanced picture. Mixed results on other benchmarks indicate that Sonnet 3.5 might not be the universal champion some anticipated. My friend who has held back internal benchmarks was disappointed highlighting weaknesses in scientific reasoning and certain writing tasks. Some folks are seeing it being lazy-er for some full code completion, while the context window is now doubled from 4K to 8K! This goes to show again, that benchmarks don't tell the full story, so we wait for LMArena (formerly LMSys Arena) and the vibe checks from across the community.
However it absolutely dominates in code tasks, that much is clear already. This is a screenshot of the new model on Aider code editing benchmark, a fairly reliable way to judge models code output, they also have a code refactoring benchmark
Haiku 3.5 and the Vanishing Opus: Anthropic's Cryptic Clues
Further adding to the intrigue, Anthropic announced Claude 3.5 Haiku! They usually provide immediate access, but Haiku remains elusive, saying that it's available by end of the month, which is very very soon. Making things even more curious, their highly anticipated Opus model has seemingly vanished from their website. "They've gone completely silent on 3.5 Opus," Simon Willison (𝕏) noted, mentioning conspiracy theories that this new Sonnet might simply be a rebranded Opus? 🕯️ 🕯️ We'll make a summoning circle for new Opus and update you once it lands (maybe next year)
Claude Takes Control (Sort Of): Computer Use API and the Dawn of AI Agents (𝕏)
The biggest bombshell this week? Anthropic's Computer Use. This isn't just about executing code; it’s about Claude interacting with computers, clicking buttons, browsing the web, and yes, even ordering pizza! Killian Lukas (𝕏), creator of Open Interpreter, returned to ThursdAI to discuss this groundbreaking development. "This stuff of computer use…it’s the same argument for having humanoid robots, the web is human shaped, and we need AIs to interact with computers and the web the way humans do" Killian explained, illuminating the potential for bridging the digital and physical worlds.
Simon, though enthusiastic, provided a dose of realism: "It's incredibly impressive…but also very much a V1, beta.” Having tackled the setup myself, I agree; the current reliance on a local Docker container and virtual machine introduces some complexity and security considerations. However, seeing Claude fix its own Docker installation error was an unforgettably mindblowing experience. The future of AI agents is upon us, even if it’s still a bit rough around the edges.
Here's an easy guide to set it up yourself, takes 5 minutes, requires no coding skills and it's safely tucked away in a container.
Big Tech's AI Moves: Apple Embraces ChatGPT, X.ai API (+Vision!?), and Cohere Multimodal Embeddings
The rest of the AI world wasn’t standing still. Apple made a surprising integration, while X.ai and Cohere pushed their platforms forward.
Apple iOS 18.2 Beta: Siri Phones a Friend (ChatGPT)
Apple, always cautious, surprisingly integrated ChatGPT directly into iOS. While Siri remains…well, Siri, users can now effortlessly offload more demanding tasks to ChatGPT. "Siri is still stupid," I joked, "but can now ask it to write some stuff and it'll tell you, hey, do you want me to ask my much smarter friend ChatGPT about this task?" This approach acknowledges Siri's limitations while harnessing ChatGPT’s power. The iOS 18.2 beta also includes GenMoji (custom emojis!) and Visual Intelligence (multimodal camera search) which are both welcome, tho I didn't really get the need of the Visual Intelligence (maybe I'm jaded with my Meta Raybans that already have this and are on my face most of the time) and I didn't get into the GenMoji waitlist still waiting to show you some custom emojis!
X.ai API: Grok's Enterprise Ambitions and a Secret Vision Model
Elon Musk's X.ai unveiled their API platform, focusing on enterprise applications with Grok 2 beta. They also teased an undisclosed vision model, and they had vision APIs for some folks who joined their hackathon. While these models are still not worth using necessarily, the next Grok-3 is promising to be a frontier model, and for some folks, it's relaxed approach to content moderation (what Elon is calling maximally seeking the truth) is going to be a convincing point for some!
I just wish they added fun mode and access to real time data from X! Right now it's just the Grok-2 model, priced at a very non competative $15/mTok 😒
Cohere Embed 3: Elevating Multimodal Embeddings (Blog)
Cohere launched Embed 3, enabling embeddings for both text and visuals such as graphs and designs. "While not the first multimodal embeddings, when it comes from Cohere, you know it's done right," I commented.
Open Source Power: JavaScript Transformers and SOTA Multilingual Models
The open-source AI community continues to impress, making powerful models accessible to all.
Massive kudos to Xenova (𝕏) for the release of Transformers.js v3! The addition of WebGPU support results in a staggering "up to 100 times faster" performance boost for browser-based AI, dramatically simplifying local, private, and efficient model running. We also saw DeepSeek’s Janus 1.3B, a multimodal image and text model, and Cohere For AI's Aya Expanse, supporting 23 languages.
This Week’s Buzz: Hackathon Triumphs and Multimodal Weave
On ThursdAI, we also like to share some of the exciting things happening behind the scenes.
AI Chef Showdown: Second Place and Lessons Learned
Happy to report that team Yes Chef clinched second place in a hackathon with an unconventional creation: a Gordon Ramsay-inspired robotic chef hand puppet, complete with a cloned voice and visual LLM integration. We bought and 3D printed and assembled an Open Source robotic arm, made it become a ventriloquist operator by letting it animate a hand puppet, and cloned Ramsey's voice. It was so so much fun to build, and the code is here
Weave Goes Multimodal: Seeing and Hearing Your AI
Even more exciting was the opportunity to leverage Weave's newly launched multimodal functionality. "Weave supports you to see and play back everything that's audio generated," I shared, emphasizing its usefulness in debugging our vocal AI chef.
For a practical example, here's ALL the (NSFW) roasts that AI Chef has cooked me with, it's honestly horrifying haha. For full effect, turn on the background music first and then play the chef audio 😂
📽️ Video Generation Takes Center Stage: Mochi's Motion Magic and Runway's Acting Breakthrough
Video models made a quantum leap this week, pushing the boundaries of generative AI.
Genmo Mochi-1: Diffusion Transformers and Generative Motion
Genmo's Ajay Jain (Genmo) joined ThursdAI to discuss Mochi-1, their powerful new diffusion transformer. "We really focused on…prompt adherence and motion," he explained. Mochi-1's capacity to generate complex and realistic motion is truly remarkable, and with an HD version on its way, the future looks bright (and animated!). They also get bonus points for dropping a torrent link in the announcement tweet.
So far this apache 2, 10B Diffusion Transformer is open source, but not for the GPU-poors, as it requires 4 GPUs to run, but apparently there was already an attempt to run in on one single 4090 which, Ajay highlighted was one of the reasons they open sourced it!
Runway Act-One: AI-Powered Puppetry and the Future of Acting (blog)
Ok this one absolutely seems bonkers! Runway unveiled Act-One! Forget just generating video from text; Act-One takes a driving video and character image to produce expressive and nuanced character performances. "It faithfully represents elements like eye-lines, micro expressions, pacing, and delivery," I noted, excited by the transformative potential for animation and filmmaking.
So no need for rigging, for motion capture suites on faces of actors, Runway now, does this, so you can generate characters with Flux, and animate them with Act-One 📽️ Just take a look at this insanity 👇
11labs Creative Voices: Prompting Your Way to the Perfect Voice
11labs debuted an incredible feature: creating custom voices using only text prompts. Want a high-pitched squeak or a sophisticated British accent? Just ask. This feature makes bespoke voice creation significantly easier.
I was really really impressed by this, as this is perfect for my Skeleton Halloween project! So far I struggled to get the voice "just right" between the awesome Cartesia voice that is not emotional enough, and the very awesome custom OpenAI voice that needs a prompt to act, and sometimes stops acting in the middle of a sentence.
With this new Elevenlabs feature, I can describe the exact voice I want with a prompt, and then keep iterating until I find the perfect one, and then boom, it's available for me! Great for character creation, and even greater for the above Act-One model, as you can now generate a character with Flux, Drive the video with Act-one and revoice yourself with a custom prompted voice from 11labs! Which is exactly what I'm going to build for the next hackathon!
If you'd like to support me in this journey, here's an 11labs affiliate link haha but I already got a yearly account so don't sweat it.
AI Art & Diffusion Updates: Stable Diffusion 3.5, Ideogram Canvas, and OpenAI's Sampler Surprise
The realm of AI art and diffusion models saw its share of action as well.
Stable Diffusion 3.5 (Blog) and Ideogram Canvas: Iterative Improvements and Creative Control
Stability AI launched Stable Diffusion 3.5, bringing incremental enhancements to image quality and prompt accuracy. Ideogram, meanwhile, introduced Canvas, a groundbreaking interface enabling mixing, matching, extending, and fine-tuning AI-generated artwork. This opens doors to unprecedented levels of control and creative expression.
Midjourney also announced a web editor, and folks are freaking out, and I'm only left thinking, is MJ a bit a cult? There are so much offerings out there, but it seems like everything MJ releases gets tons more excitement from that part of X than other way more incredible stuff 🤔
Seattle Pic
Ok wow that was a LOT of stuff to cover, honestly, the TL;DR for this week became so massive that I had to zoom out to take 1 screenshot of it all ,and I wasn't sure we'd be able to cover all of it!
Massive massive week, super exciting releases, and the worst thing about this is, I barely have time to play with many of these!
But I'm hoping to have some time during the Tinkerer AI hackathon we're hosting on Nov 2-3 in our SF office, limited spots left, so come and hang with me and some of the Tinkerers team, and maybe even win a Meta Rayban special Weave prize!
RAW TL;DR + Show notes and links
* Open Source LLMs
* Xenova releases Transformers JS version 3 (X)
* ⚡ WebGPU support (up to 100x faster than WASM)🔢 New quantization formats (dtypes)🏛 120 supported architectures in total📂 25 new example projects and templates🤖 Over 1200 pre-converted models🌐 Node.js (ESM + CJS), Deno, and Bun compatibility🏡 A new home on GitHub and NPM
* DeepSeek drops Janus 1.3B (X, HF, Paper)
* DeepSeek releases Janus 1.3B 🔥
* 🎨 Understands and generates both images and text
* 👀Combines DeepSeek LLM 1.3B with SigLIP-L for vision
* ✂️ Decouples the vision encoding
* Cohere for AI releases Aya expanse 8B, 32B (X, HF, Try it)
* Aya Expanse is an open-weight research release of a model with highly advanced multilingual capabilities. It focuses on pairing a highly performant pre-trained Command family of models with the result of a year’s dedicated research from Cohere For AI, including data arbitrage, multilingual preference training, safety tuning, and model merging. The result is a powerful multilingual large language model serving 23 languages.
* 23 languages
* Big CO LLMs + APIs
* New Claude Sonnet 3.5, Claude Haiku 3.5
* New Claude absolutely crushes coding benchmarks like Aider and Swe-bench verified.
* But I'm getting mixed signals from folks with internal benchmarks, as well as some other benches like Aidan Bench and Arc challenge in which it performs worse.
* 8K output token limit vs 4K
* Other folks swear by it, Skirano, Corbitt say it's an absolute killer coder
* Haiku is 2x the price of 4o-mini and Flash
* Anthropic Computer use API + docker (X)
* Computer use is not new, see open interpreter etc
* Adept has been promising this for a while, so was LAM from rabbit.
* Now Anthropic has dropped a bomb on all these with a specific trained model to browse click and surf the web with a container
* Examples of computer use are super cool, Corbitt built agent.exe which uses it to control your computer
* Killian will join to talk about what this computer use means
* Folks are trying to order food (like Anthropic shows in their demo of ordering pizzas for the team)
* Claude launches code interpreter mode for claude.ai (X)
* Cohere released Embed 3 for multimodal embeddings (Blog)
* 🔍 Multimodal Embed 3: Powerful AI search model
* 🌍 Unlocks value from image data for enterprises
* 🔍 Enables fast retrieval of relevant info & assets
* 🛒 Transforms e-commerce search with image search
* 🎨 Streamlines design process with visual search
* 📊 Improves data-driven decision making with visual insights
* 🔝 Industry-leading accuracy and performance
* 🌐 Multilingual support across 100+ languages
* 🤝 Partnerships with Azure AI and Amazon SageMaker
* 🚀 Available now for businesses and developers
* X ai has a new API platform + secret vision feature (docs)
* grok-2-beta $5.0 / $15.00 mtok
* Apple releases IOS 18.2 beta with GenMoji, Visual Intelligence, ChatGPT integration & more
* Siri is still stupid, but can now ask chatGPT to write s**t
* This weeks Buzz
* Got second place for the hackathon with our AI Chef that roasts you in the kitchen (X, Weave dash)
* Weave is now multimodal and supports audio! (Weave)
* Tinkerers Hackathon in less than a week!
* Vision & Video
* Genmo releases Mochi-1 txt2video model w/ Apache 2.0 license
* Gen mo - generative motion
* 10B DiT - diffusion transformer
* 5.5 seconds video
* Apache 2.0
* Comparison thread between Genmo Mochi-1 and Hailuo
* Genmo, the company behind Mochi 1, has raised $28.4M in Series A funding from various investors. Mochi 1 is an open-source video generation model that the company claims has "superior motion quality, prompt adherence and exceptional rendering of humans that begins to cross the uncanny valley." The company is open-sourcing their base 480p model, with an HD version coming soon.
Summary Bullet Points:
* Genmo announces $28.4M Series A funding
* Mochi 1 is an open-source video generation model
* Mochi 1 has "superior motion quality, prompt adherence and exceptional rendering of humans"
* X is open-sourcing their base 480p Mochi 1 model
* HD version of Mochi 1 is coming soon
* Mochi 1 is available via Genmo's playground or as downloadable weights, or on Fal
* Mochi 1 is licensed under Apache 2.0
* Rhymes AI - Allegro video model (X)
* Meta a bunch of releases - Sam 2.1, Spirit LM
* Runway introduces puppetry video 2 video with emotion transfer (X)
* The webpage introduces Act-One, a new technology from Runway that allows for the generation of expressive character performances using a single driving video and character image, without the need for motion capture or rigging. Act-One faithfully represents elements like eye-lines, micro expressions, pacing, and delivery in the final generated output. It can translate an actor's performance across different character designs and styles, opening up new avenues for creative expression.
Summary in 10 Bullet Points:
* Act-One is a new technology from Runway
* It generates expressive character performances
* Uses a single driving video and character image
* No motion capture or rigging required
* Faithfully represents eye-lines, micro expressions, pacing, and delivery
* Translates performance across different character designs and styles
* Allows for new creative expression possibilities
* Works with simple cell phone video input
* Replaces complex, multi-step animation workflows
* Enables capturing the essence of an actor's performance
* Haiper releases a new video model
* Meta releases Sam 2.1
* Key updates to SAM 2:
* New data augmentation for similar and small objects
* Improved occlusion handling
* Longer frame sequences in training
* Tweaks to positional encoding
SAM 2 Developer Suite released:
* Open source code package
* Training code for fine-tuning
* Web demo front-end and back-end code
* Voice & Audio
* OpenAI released custom voice support for chat completion API (X, Docs)
* Pricing is still insane ($200/1mtok)
* This is not just TTS, this is advanced voice mode!
* The things you can ddo with them are very interesting, like asking for acting, or singing.
* 11labs create voices with a prompt is super cool (X)
* Meta Spirit LM: An open source language model for seamless speech and text integration (Blog, weights)
* Meta Spirit LM is a multimodal language model that:
* Combines text and speech processing
* Uses word-level interleaving for cross-modality generation
* Has two versions:
* Base: uses phonetic tokens
* Expressive: uses pitch and style tokens for tone
* Enables more natural speech generation
* Can learn tasks like ASR, TTS, and speech classification
* MoonShine for audio
* AI Art & Diffusion & 3D
* Stable Diffusion 3.5 was released (X, Blog, HF)
* including Stable Diffusion 3.5 Large and Stable Diffusion 3.5 Large Turbo.
* table Diffusion 3.5 Medium will be released on October 29th.
* the permissive Stability AI Community License.
* 🚀 Introducing Stable Diffusion 3.5 - powerful, customizable, and free models
* 🔍 Improved prompt adherence and image quality compared to previous versions
* ⚡️ Stable Diffusion 3.5 Large Turbo offers fast inference times
* 🔧 Multiple variants for different hardware and use cases
* 🎨 Empowering creators to distribute and monetize their work
* 🌐 Available for commercial and non-commercial use under permissive license
* 🔍 Listening to community feedback to advance their mission
* 🔄 Stable Diffusion 3.5 Medium to be released on October 29th
* 🤖 Commitment to transforming visual media with accessible AI tools
* 🔜 Excited to see what the community creates with Stable Diffusion 3.5
* Ideogram released Canvas (X)
* Canvas is a mix of Krea and Everart
* Ideogram is a free AI tool for generating realistic images, posters, logos
* Extend tool allows expanding images beyond original borders
* Magic Fill tool enables editing specific image regions and details
* Ideogram Canvas is a new interface for organizing, generating, editing images
* Ideogram uses AI to enhance the creative process with speed and precision
* Developers can integrate Ideogram's Magic Fill and Extend via the API
* Privacy policy and other legal information available on the website
* Ideogram is free-to-use, with paid plans offering additional features
* Ideogram is available globally, with support for various browsers
* OpenAI released a new sampler paper trying to beat diffusers (Blog)
* Researchers at OpenAI have developed a new approach called sCM that simplifies the theoretical formulation of continuous-time consistency models, allowing them to stabilize and scale the training of these models for large datasets. The sCM approach achieves sample quality comparable to leading diffusion models, while using only two sampling steps - a 50x speedup over traditional diffusion models. Benchmarking shows sCM produces high-quality samples using less than 10% of the effective sampling compute required by other state-of-the-art generative models.The key innovation is that sCM models scale commensurately with the teacher diffusion models they are distilled from. As the diffusion models grow larger, the relative difference in sample quality between sCM and the teacher model diminishes. This allows sCM to leverage the advances in diffusion models to achieve impressive sample quality and generation speed, unlocking new possibilities for real-time, high-quality generative AI across domains like images, audio, and video.
* 🔍 Simplifying continuous-time consistency models
* 🔨 Stabilizing training for large datasets
* 🔍 Scaling to 1.5 billion parameters on ImageNet
* ⚡ 2-step sampling for 50x speedup vs. diffusion
* 🎨 Comparable sample quality to diffusion models
* 📊 Benchmarking against state-of-the-art models
* 🗺️ Visualization of diffusion vs. consistency models
* 🖼️ Selected 2-step samples from 1.5B model
* 📈 Scaling sCM with teacher diffusion models
* 🔭 Limitations and future work
* Midjourney announces an editor (X)
* announces the release of two new features for Midjourney users - an image editor for uploaded images and
* image re-texturing for exploring materials, surfacing, and lighting.
* These features will initially be available only to yearly members, members who have been subscribers for the past 12 months, and members with at least 10,000 images.
* The post emphasizes the need to give the community, human moderators, and AI moderation systems time to adjust to the new features
* Tools
PS : Subscribe to the newsletter and podcast, and I'll be back next week with more AI escapades! 🫶
This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe -
Hey folks, Alex here from Weights & Biases, and this week has been absolutely bonkers. From robots walking among us to rockets landing on chopsticks (well, almost), the future is feeling palpably closer. And if real-world robots and reusable spaceship boosters weren't enough, the open-source AI community has been cooking, dropping new models and techniques faster than a Starship launch. So buckle up, grab your space helmet and noise-canceling headphones (we’ll get to why those are important!), and let's blast off into this week’s AI adventures!
TL;DR and show-notes + links at the end of the post 👇
Robots and Rockets: A Glimpse into the Future
I gotta start with the real-world stuff because, let's be honest, it's mind-blowing. We had Robert Scoble (yes, the Robert Scoble) join us after attending the Tesla We, Robot AI event, reporting on Optimus robots strolling through crowds, serving drinks, and generally being ridiculously futuristic. Autonomous robo-taxis were also cruising around, giving us a taste of a driverless future.
Robert’s enthusiasm was infectious: "It was a vision of the future, and from that standpoint, it succeeded wonderfully." I couldn't agree more. While the market might have had a mini-meltdown (apparently investors aren't ready for robot butlers yet), the sheer audacity of Tesla’s vision is exhilarating. These robots aren't just cool gadgets; they represent a fundamental shift in how we interact with technology and the world around us. And they’re learning fast. Just days after the event, Tesla released a video of Optimus operating autonomously, showcasing the rapid progress they’re making.
And speaking of audacious visions, SpaceX decided to one-up everyone (including themselves) by launching Starship and catching the booster with Mechazilla – their giant robotic chopsticks (okay, technically a launch tower, but you get the picture). Waking up early with my daughter to watch this live was pure magic. As Ryan Carson put it, "It was magical watching this… my kid who's 16… all of his friends are getting their imaginations lit by this experience." That’s exactly what we need - more imagination and less doomerism! The future is coming whether we like it or not, and I, for one, am excited.
Open Source LLMs and Tools: The Community Delivers (Again!)
Okay, back to the virtual world (for now). This week's open-source scene was electric, with new model releases and tools that have everyone buzzing (and benchmarking like crazy!).
* Nemotron 70B: Hype vs. Reality: NVIDIA dropped their Nemotron 70B instruct model, claiming impressive scores on certain benchmarks (Arena Hard, AlpacaEval), even suggesting it outperforms GPT-4 and Claude 3.5. As always, we take these claims with a grain of salt (remember Reflection?), and our resident expert, Nisten, was quick to run his own tests. The verdict? Nemotron is good, "a pretty good model to use," but maybe not the giant-killer some hyped it up to be. Still, kudos to NVIDIA for pushing the open-source boundaries. (Hugging Face, Harrison Kingsley evals)
* Zamba 2 : Hybrid Vigor: Zyphra, in collaboration with NVIDIA, released Zamba 2, a hybrid Sparse Mixture of Experts (SME) model. We had Paolo Glorioso, a researcher from Ziphra, join us to break down this unique architecture, which combines the strengths of transformers and state space models (SSMs). He highlighted the memory and latency advantages of SSMs, especially for on-device applications. Definitely worth checking out if you’re interested in transformer alternatives and efficient inference.
* Zyda 2: Data is King (and Queen): Alongside Zamba 2, Zyphra also dropped Zyda 2, a massive 5 trillion token dataset, filtered, deduplicated, and ready for LLM training. This kind of open-source data release is a huge boon to the community, fueling the next generation of models. (X)
* Ministral: Pocket-Sized Power: On the one-year anniversary of the iconic Mistral 7B release, Mistral announced two new smaller models – Ministral 3B and 8B. Designed for on-device inference, these models are impressive, but as always, Qwen looms large. While Mistral didn’t include Qwen in their comparisons, early tests suggest Qwen’s smaller models still hold their own. One point of contention: these Ministrals aren't as open-source as the original 7B, which is a bit of a bummer, with the 3B not being even released anywhere besides their platform. (Mistral Blog)
* Entropix (aka Shrek Sampler): Thinking Outside the (Sample) Box: This one is intriguing! Entropix introduces a novel sampling technique aimed at boosting the reasoning capabilities of smaller LLMs. Nisten’s yogurt analogy explains it best: it’s about “marinating” the information and picking the best “flavor” (token) at the end. Early examples look promising, suggesting Entropix could help smaller models tackle problems that even trip up their larger counterparts. But, as with all shiny new AI toys, we're eagerly awaiting robust evals. Tim Kellog has an detailed breakdown of this method here
* Gemma-APS: Fact-Finding Mission: Google released Gemma-APS, a set of models specifically designed for extracting claims and facts from text. While LLMs can already do this to some extent, a dedicated model for this task is definitely interesting, especially for applications requiring precise information retrieval. (HF)
🔥 OpenAI adds voice to their completion API (X, Docs)
In the last second of the pod, OpenAI decided to grace us with Breaking News!
Not only did they launch their Windows native app, but also added voice input and output to their completion APIs. This seems to be the same model as the advanced voice mode (and priced super expensively as well) and the one they used in RealTime API released a few weeks ago at DevDay.
This is of course a bit slower than RealTime but is much simpler to use, and gives way more developers access to this incredible resource (I'm definitely planning to use this for ... things 😈)
This isn't their "TTS" or "STT (whisper) models, no, this is an actual omni model that understands audio natively and also outputs audio natively, allowing for things like "count to 10 super slow"
I've played with it just now (and now it's after 6pm and I'm still writing this newsletter) and it's so so awesome, I expect it to be huge because the RealTime API is very curbersome and many people don't really need this complexity.
This weeks Buzz - Weights & Biases updates
Ok I wanted to send a completely different update, but what I will show you is, Weave, our observability framework is now also Multi Modal!
This couples very well with the new update from OpenAI!
So here's an example usage with today's announcement, I'm going to go through the OpenAI example and show you how to use it with streaming so you can get the audio faster, and show you the Weave multimodality as well 👇
You can find the code for this in this Gist and please give us feedback as this is brand new
Non standard use-cases of AI corner
This week I started noticing and collecting some incredible use-cases of Gemini and it's long context and multimodality and wanted to share with you guys, so we had some incredible conversations about non-standard use cases that are pushing the boundaries of what's possible with LLMs.
Hrishi blew me away with his experiments using Gemini for transcription and diarization. Turns out, Gemini is not only great at transcription (it beats whisper!), it’s also ridiculously cheaper than dedicated ASR models like Whisper, around 60x cheaper! He emphasized the unexplored potential of prompting multimodal models, adding, “the prompting on these things… is still poorly understood." So much room for innovation here!
Simon Willison then stole the show with his mind-bending screen-scraping technique. He recorded a video of himself clicking through emails, fed it to Gemini Flash, and got perfect structured data in return. This trick isn’t just clever; it’s practically free, thanks to the ridiculously low cost of Gemini Flash. I even tried it myself, recording my X bookmarks and getting a near-perfect TLDR of the week’s AI news. The future of data extraction is here, and it involves screen recordings and very cheap (or free) LLMs.
Here's Simon's example of how much this would cost him had he actually be charged for it. 🤯
Speaking of Simon Willison , he broke the news that NotebookLM has got an upgrade, with the ability to steer the speakers with custom commands, which Simon promptly used to ask the overview hosts to talk like Pelicans
Voice Cloning, Adobe Magic, and the Quest for Real-Time Avatars
Voice cloning also took center stage this week, with the release of F5-TTS. This open-source model performs zero-shot voice cloning with just a few seconds of audio, raising all sorts of ethical questions (and exciting possibilities!). I played a sample on the show, and it was surprisingly convincing (though not without it's problems) for a local model!
This, combined with Hallo 2's (also released this week!) ability to animate talking avatars, has Wolfram Ravenwolf dreaming of real-time AI assistants with personalized faces and voices. The pieces are falling into place, folks.
And for all you Adobe fans, Firefly Video has landed! This “commercially safe” text-to-video and image-to-video model is seamlessly integrated into Premiere, offering incredible features like extending video clips with AI-generated frames. Photoshop also got some Firefly love, with mind-bending relighting capabilities that could make AI-generated images indistinguishable from real photographs.
Wrapping Up:
Phew, that was a marathon, not a sprint! From robots to rockets, open source to proprietary, and voice cloning to video editing, this week has been a wild ride through the ever-evolving landscape of AI. Thanks for joining me on this adventure, and as always, keep exploring, keep building, and keep pushing those AI boundaries. The future is coming, and it’s going to be amazing.
P.S. Don’t forget to subscribe to the podcast and newsletter for more AI goodness, and if you’re in Seattle next week, come say hi at the AI Tinkerers meetup. I’ll be demoing my Halloween AI toy – it’s gonna be spooky!
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
TL;DR - Show Notes and Links
* Open Source LLMs
* Nvidia releases Llama 3.1-Nemotron-70B instruct: Outperforms GPT-40 and Anthropic Claude 3.5 on several benchmarks. Available on Hugging Face and Nvidia. (X, Harrison Eval)
* Zamba2-7B: A hybrid Sparse Mixture of Experts model from Zyphra and Nvidia. Claims to outperform Mistral, Llama2, and Gemmas in the 58B weight class. (X, HF)
* Zyda-2: 57B token dataset distilled from high-quality sources for training LLMs. Released by Zyphra and Nvidia. (X)
* Ministral 3B & 8B - Mistral releases 2 new models for on device, claims SOTA (Blog)
• Entropix aims to mimic advanced reasoning in small LLMs (Github, Breakdown)
* Google releases Gemma-APS: A collection of Gemma models for text-to-propositions segmentation, distilled from Gemini Pro and fine-tuned on synthetic data. (HF)
* Big CO LLMs + APIs
* OpenAI ships advanced voice model in chat completions API endpoints with multimodality (X, Docs, My Example)
* Amazon, Microsoft, Google all announce nuclear power for AI future
* Yi-01.AI launches Yi-Lightning: A proprietary model accessible via API.
* New Gemini API parameters: Google has shipped new Gemini API parameters, including logprobs, candidateCount, presencePenalty, seed, frequencyPenalty, and model_personality_in_response.
* Google NotebookLM is no longer "experimental" and now allows for "steering" the hosts (Announcement)
* XAI - GROK 2 and Grok2-mini are now available via API in OpenRouter - (X, OR)
* This weeks Buzz (What I learned with WandB this week)
* Weave is now MultiModal (supports audio and text!) (X, Github Example)
* Vision & Video
* Adobe Firefly Video: Adobe's first commercially safe text-to-video and image-to-video generation model. Supports prompt coherence. (X)
* Voice & Audio
* Ichigo-Llama3.1 Local Real-Time Voice AI: Improvements allow it to talk back, recognize when it can't comprehend input, and run on a single Nvidia 3090 GPU. (X)
* F5-TTS: Performs zero-shot voice cloning with less than 15 seconds of audio, using audio clips to generate additional audio. (HF, Paper)
* AI Art & Diffusion & 3D
* RF-Inversion: Zero-shot inversion and editing framework for Flux, introduced by Litu Rout. Allows for image editing and personalization without training, optimization, or prompt-tuning. (X)
* Tools
* Fastdata: A library for synthesizing 1B tokens. (X)
This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe -
Hey Folks, we are finally due for a "relaxing" week in AI, no more HUGE company announcements (if you don't consider Meta Movie Gen huge), no conferences or dev days, and some time for Open Source projects to shine. (while we all wait for Opus 3.5 to shake things up)
This week was very multimodal on the show, we covered 2 new video models, one that's tiny and is open source, and one massive from Meta that is aiming for SORA's crown, and 2 new VLMs, one from our friends at REKA that understands videos and audio, while the other from Rhymes is apache 2 licensed and we had a chat with Kwindla Kramer about OpenAI RealTime API and it's shortcomings and voice AI's in general.
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
All right, let's TL;DR and show notes, and we'll start with the 2 Nobel prizes in AI 👇
* 2 AI nobel prizes
* John Hopfield and Geoffrey Hinton have been awarded a Physics Nobel prize
* Demis Hassabis, John Jumper & David Baker, have been awarded this year's #NobelPrize in Chemistry.
* Open Source LLMs & VLMs
* TxT360: a globally deduplicated dataset for LLM pre-training ( Blog, Dataset)
* Rhymes Aria - 25.3B multimodal MoE model that can take image/video inputs Apache 2 (Blog, HF, Try It)
* Maitrix and LLM360 launch a new decentralized arena (Leaderboard, Blog)
* New Gradio 5 with server side rendering (X)
* LLamaFile now comes with a chat interface and syntax highlighting (X)
* Big CO LLMs + APIs
* OpenAI releases MLEBench - new kaggle focused benchmarks for AI Agents (Paper, Github)
* Inflection is still alive - going for enterprise lol (Blog)
* new Reka Flash 21B - (X, Blog, Try It)
* This weeks Buzz
* We chatted about Cursor, it went viral, there are many tips
* WandB releases HEMM - benchmarks of text-to-image generation models (X, Github, Leaderboard)
* Vision & Video
* Meta presents Movie Gen 30B - img and text to video models (blog, paper)
* Pyramid Flow - open source img2video model MIT license (X, Blog, HF, Paper, Github)
* Voice & Audio
* Working with OpenAI RealTime Audio - Alex conversation with Kwindla from trydaily.com
* Cartesia Sonic goes multilingual (X)
* Voice hackathon in SF with 20K prizes (and a remote track) - sign up
* Tools
* LM Studio ships with MLX natively (X, Download)
* UITHUB.com - turn any github repo into 1 long file for LLMs
A Historic Week: TWO AI Nobel Prizes!
This week wasn't just big; it was HISTORIC. As Yam put it, "two Nobel prizes for AI in a single week. It's historic." And he's absolutely spot on! Geoffrey Hinton, often called the "grandfather of modern AI," alongside John Hopfield, were awarded the Nobel Prize in Physics for their foundational work on neural networks - work that paved the way for everything we're seeing today. Think back propagation, Boltzmann machines – these are concepts that underpin much of modern deep learning. It’s about time they got the recognition they deserve!
Yoshua Bengio posted about this in a very nice quote:
@HopfieldJohn and @geoffreyhinton, along with collaborators, have created a beautiful and insightful bridge between physics and AI. They invented neural networks that were not only inspired by the brain, but also by central notions in physics such as energy, temperature, system dynamics, energy barriers, the role of randomness and noise, connecting the local properties, e.g., of atoms or neurons, to global ones like entropy and attractors. And they went beyond the physics to show how these ideas could give rise to memory, learning and generative models; concepts which are still at the forefront of modern AI research
And Hinton's post-Nobel quote? Pure gold: “I’m particularly proud of the fact that one of my students fired Sam Altman." He went on to explain his concerns about OpenAI's apparent shift in focus from safety to profits. Spicy take! It sparked quite a conversation about the ethical implications of AI development and who’s responsible for ensuring its safe deployment. It’s a discussion we need to be having more and more as the technology evolves. Can you guess which one of his students it was?
Then, not to be outdone, the AlphaFold team (Demis Hassabis, John Jumper, and David Baker) snagged the Nobel Prize in Chemistry for AlphaFold 2. This AI revolutionized protein folding, accelerating drug discovery and biomedical research in a way no one thought possible. These awards highlight the tangible, real-world applications of AI. It's not just theoretical anymore; it's transforming industries.
Congratulations to all winners, and we gotta wonder, is this a start of a trend of AI that takes over every Nobel prize going forward? 🤔
Open Source LLMs & VLMs: The Community is COOKING!
The open-source AI community consistently punches above its weight, and this week was no exception. We saw some truly impressive releases that deserve a standing ovation. First off, the TxT360 dataset (blog, dataset). Nisten, resident technical expert, broke down the immense effort: "The amount of DevOps and…operations to do this work is pretty rough."
This globally deduplicated 15+ trillion-token corpus combines the best of Common Crawl with a curated selection of high-quality sources, setting a new standard for open-source LLM training. We talked about the importance of deduplication for model training - avoiding the "memorization" of repeated information that can skew a model's understanding of language. TxT360 takes a 360-degree approach to data quality and documentation – a huge win for accessibility.
Apache 2 Multimodal MoE from Rhymes AI called Aria (blog, HF, Try It )
Next, the Rhymes Aria model (25.3B total and only 3.9B active parameters!) This multimodal marvel operates as a Mixture of Experts (MoE), meaning it activates only the necessary parts of its vast network for a given task, making it surprisingly efficient. Aria excels in understanding image and video inputs, features a generous 64K token context window, and is available under the Apache 2 license – music to open-source developers’ ears! We even discussed its coding capabilities: imagine pasting images of code and getting intelligent responses.
I particularly love the focus on long multimodal input understanding (think longer videos) and super high resolution image support.
I uploaded this simple pin-out diagram of RaspberriPy and it got all the right answers correct! Including ones I missed myself (and won against Gemini 002 and the new Reka Flash!)
Big Companies and APIs
OpenAI new Agentic benchmark, can it compete with MLEs on Kaggle?
OpenAI snuck in a new benchmark, MLEBench (Paper, Github), specifically designed to evaluate AI agents performance on Machine Learning Engineering tasks. Designed around a curated collection of Kaggle competitions, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments.
They found that the best-performing setup--OpenAI's o1-preview with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in 16.9% of competitions (though there are some that throw shade on this score)
Meta comes for our reality with Movie Gen
But let's be honest, Meta stole the show this week with Movie Gen (blog). This isn’t your average video generation model; it’s like something straight out of science fiction. Imagine creating long, high-definition videos, with different aspect ratios, personalized elements, and accompanying audio – all from text and image prompts. It's like the Holodeck is finally within reach!
Unfortunately, despite hinting at its size (30B) Meta is not releasing this model (just yet) nor is it available widely so far! But we'll keep our fingers crossed that it drops before SORA.
One super notable thing is, this model generates audio as well to accompany the video and it's quite remarkable. We listened to a few examples from Meta’s demo, and the sound effects were truly remarkable – everything from fireworks to rustling leaves. This model isn't just creating video, it's crafting experiences. (Sound on for the next example!)
They also have personalization built in, which is showcased here by one of the leads of LLama ,Roshan, as a scientist doing experiments and the realism is quite awesome to see (but I get why they are afraid of releasing this in open weights)
This Week’s Buzz: What I learned at Weights & Biases this week
My "buzz" this week was less about groundbreaking models and more about mastering the AI tools we have. We had a team meeting to share our best tips and tricks for using Cursor, and when I shared those insights on X (thread), they went surprisingly viral!
The big takeaway from the thread? Composer, Cursor’s latest feature, is a true game-changer. It allows for more complex refactoring and code generation across multiple files – the kind of stuff that would take hours manually. If you haven't tried Composer, you're seriously missing out. We also covered strategies for leveraging different models for specific tasks, like using O1 mini for outlining and then switching to the more robust Cloud 3.5 for generating code. Another gem we uncovered: selecting any text in the console and hitting opt+D will immediately send it to the chat to debug, super useful!
Over at Weights & Biases, my talented teammate, Soumik, released HEMM (X, Github), a comprehensive benchmark specifically designed for text-to-image generation models. Want to know how different models fare on image quality and prompt comprehension? Head over to the leaderboard on Weave (Leaderboard) and find out! And yes, it's true, Weave, our LLM observability tool, is multimodal (well within the theme of today's update)
Voice and Audio: Real-Time Conversations and the Quest for Affordable AI
OpenAI's DevDay was just a few weeks back, but the ripple effects of their announcements are still being felt. The big one for voice AI enthusiasts like myself? The RealTime API, offering developers a direct line to Advanced Voice Mode. My initial reaction was pure elation – finally, a chance to build some seriously interactive voice experiences that sound incredible and in near real time!
That feeling was quickly followed by a sharp intake of breath when I saw the price tag. As I discovered building my Halloween project, real-time streaming of this caliber isn’t exactly budget-friendly (yet!). Kwindla from trydaily.com, a voice AI expert, joined the show to shed some light on this issue.
We talked about the challenges of scaling these models and the complexities of context management in real-time audio processing. The conversation shifted to how OpenAI's RealTime API isn’t just about the model itself but also the innovative way they're managing the user experience and state within a conversation. He pointed out, however, that what we see and hear from the API isn’t exactly what’s going on under the hood, “What the model hears and what the transcription events give you back are not the same”. Turns out, OpenAI relies on Whisper for generating text transcriptions – it’s not directly from the voice model.
The pricing really threw me though, only testing a little bit, not even doing anything on production, and OpenAI charged almost 10$, the same conversations are happening across Reddit and OpenAI forums as well.
Hallo-Weave project update:
So as I let folks know on the show, I'm building a halloween AI decoration as a project, and integrating it into Weights & Biases Weave (that's why it's called HalloWeave)
After performing brain surgery, futzing with wires and LEDs, I finally have it set up so it wakes up on a trigger word (it's "Trick or Treat!"), takes a picture with the webcam (actual webcam, raspberryPi camera was god awful) and sends it to Gemini Flash to detect which costume this is and write a nice customized greeting.
Then I send that text to Cartesia to generate the speech using a British voice, and then I play it via a bluetooth speaker. Here's a video of the last stage (which still had some bluetooth issues, it's a bit better now)
Next up: I should decide if I care to integrate OpenAI Real time (and pay a LOT of $$$ for it) or fallback to existing LLM - TTS services and let kids actually have a conversation with the toy!
Stay tuned for more updates as we get closer to halloween, the project is open source HERE and the Weave dashboard will be open once it's live.
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
One More Thing… UIThub!
Before signing off, one super useful tool for you! It's so useful I recorded (and created an edit) video on it. I've also posted it on my brand new TikTok, Instagram, Youtube and Linkedin accounts, where it promptly did not receive any views, but hey, gotta start somewhere right? 😂
Phew! That’s a wrap for this week’s ThursdAI. From Nobel Prizes to new open-source tools, and even meta's incredibly promising (but still locked down) video gen models, the world of AI continues to surprise and delight (and maybe cause a mild existential crisis or two!). I'd love to hear your thoughts – what caught your eye? Are you building anything cool? Let me know in the comments, and I'll see you back here next week for more AI adventures! Oh, and don't forget to subscribe to the podcast (five-star ratings always appreciated 😉).
This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe -
Hey, it's Alex. Ok, so mind is officially blown. I was sure this week was going to be wild, but I didn't expect everyone else besides OpenAI to pile on, exactly on ThursdAI.
Coming back from Dev Day (number 2) and am still processing, and wanted to actually do a recap by humans, not just the NotebookLM one I posted during the keynote itself (which was awesome and scary in a "will AI replace me as a podcaster" kind of way), and was incredible to have Simon Willison who was sitting just behind me most of Dev Day, join me for the recap!
But then the news kept coming, OpenAI released Canvas, which is a whole new way of interacting with chatGPT, BFL released a new Flux version that's 8x faster, Rev released a Whisper killer ASR that does diarizaiton and Google released Gemini 1.5 Flash 8B, and said that with prompt caching (which OpenAI now also has, yay) this will cost a whopping 0.01 / Mtok. That's 1 cent per million tokens, for a multimodal model with 1 million context window. 🤯
This whole week was crazy, as last ThursdAI after finishing the newsletter I went to meet tons of folks at the AI Tinkerers in Seattle, and did a little EvalForge demo (which you can see here) and wanted to share EvalForge with you as well, it's early but very promising so feedback and PRs are welcome!
WHAT A WEEK, TL;DR for those who want the links and let's dive in 👇
* OpenAI - Dev Day Recap (Alex, Simon Willison)
* Recap of Dev Day
* RealTime API launched
* Prompt Caching launched
* Model Distillation is the new finetune
* Finetuning 4o with images (Skalski guide)
* Fireside chat Q&A with Sam
* Open Source LLMs
* NVIDIA finally releases NVML (HF)
* This weeks Buzz
* Alex discussed his demo of EvalForge at the AI Tinkers event in Seattle in "This Week's Buzz". (Demo, EvalForge, AI TInkerers)
* Big Companies & APIs
* Google has released Gemini Flash 8B - 0.01 per million tokens cached (X, Blog)
* Voice & Audio
* Rev breaks SOTA on ASR with Rev ASR and Rev Diarize (Blog, Github, HF)
* AI Art & Diffusion & 3D
* BFL relases Flux1.1[pro] - 3x-6x faster than 1.0 and higher quality (was 🫐) - (Blog, Try it)
The day I met Sam Altman / Dev Day recap
Last Dev Day (my coverage here) was a "singular" day in AI for me, given it also had the "keep AI open source" with Nous Research and Grimes, and this Dev Day I was delighted to find out that the vibe was completely different, and focused less on bombastic announcements or models, but on practical dev focused things.
This meant that OpenAI cherry picked folks who actively develop with their tools, and they didn't invite traditional media, only folks like yours truly, @swyx from Latent space, Rowan from Rundown, Simon Willison and Dan Shipper, you know, newsletter and podcast folks who actually build!
This also allowed for many many OpenAI employees who work on the products and APIs we get to use, were there to receive feedback, help folks with prompting, and just generally interact with the devs, and build that community. I want to shoutout my friends Ilan (who was in the keynote as the strawberry salesman interacting with RealTime API agent), Will DePue from the SORA team, with whom we had an incredible conversation about ethics and legality of projects, Christine McLeavey who runs the Audio team, with whom I shared a video of my daughter crying when chatGPT didn't understand her, Katia, Kevin and Romain on the incredible DevEx/DevRel team and finally, my new buddy Jason who does infra, and was fighting bugs all day and only joined the pub after shipping RealTime to all of us.
I've collected all these folks in a convenient and super high signal X list here so definitely give that list a follow if you'd like to tap into their streams
For the actual announcements, I've already covered this in my Dev Day post here (which was payed subscribers only, but is now open to all) and Simon did an incredible summary on his Substack as well
The highlights were definitely the new RealTime API that let's developers build with Advanced Voice Mode, Prompt Caching that will happen automatically and reduce all your long context API calls by a whopping 50% and finetuning of models that they are rebranding into Distillation and adding new tools to make it easier (including Vision Finetuning for the first time!)
Meeting Sam Altman
While I didn't get a "media" pass or anything like this, and didn't really get to sit down with OpenAI execs (see Swyx on Latent Space for those conversations), I did have a chance to ask Sam multiple things.
First at the closing fireside chat between Sam and Kevin Weil (CPO at OpenAI), Kevin first asked Sam a bunch of questions, and then they gave out the microphones to folks, and I asked the only question that got Sam to smile
Sam and Kevin went on for a while, and that Q&A was actually very interesting, so much so, that I had to recruit my favorite Notebook LM podcast hosts, to go through it and give you an overview, so here's that Notebook LM, with the transcript of the whole Q&A (maybe i'll publish it as a standalone episode? LMK in the comments)
After the official day was over, there was a reception, at the same gorgeous Fort Mason location, with drinks and light food, and as you might imagine, this was great for networking.
But the real post dev day event was hosted by OpenAI devs at a bar, Palm House, which both Sam and Greg Brokman just came to and hung out with folks. I missed Sam last time and was very eager to go and ask him follow up questions this time, when I saw he was just chilling at that bar, talking to devs, as though he didn't "just" complete the largest funding round in VC history ($6.6B at $175B valuation) and went through a lot of drama/turmoil with the departure of a lot of senior leadership!
Sam was awesome to briefly chat with, tho as you might imagine, it was loud and tons of folks wanted selfies, but we did discuss how AI affects the real world, job replacement stuff were brought up, and how developers are using the OpenAI products.
What we learned, thanks to Sigil, is that o1 was named partly as a "reset" like the main blogpost claimed and partly as "alien of extraordinary ability" , which is the the official designation of the o1 visa, and that Sam came up with this joke himself.
Is anyone here smarter than o1? Do you think you still will by o2?
One of the highest impact questions was by Sam himself to the audience.
Who feels like they've spent a lot of time with O1, and they would say, like, I feel definitively smarter than that thing?
— Sam Altman
When Sam asked this at first, a few hands hesitatingly went up. He then followed up with
Do you think you still will by O2? No one. No one taking the bet.One of the challenges that we face is like, we know how to go do this thing that we think will be like, at least probably smarter than all of us in like a broad array of tasks
This was a very palpable moment that folks looked around and realized, what OpenAI folks have probably internalized a long time ago, we're living in INSANE times, and even those of us at the frontier or research, AI use and development, don't necessarily understand or internalize how WILD the upcoming few months, years will be.
And then we all promptly forgot to have an existential crisis about it, and took our self driving Waymo's to meet Sam Altman at a bar 😂
This weeks Buzz from Weights & Biases
Hey so... after finishing ThursdAI last week I went to Seattle Tinkerers event and gave a demo (and sponsored the event with a raffle of Meta Raybans). I demoed our project called EvalForge, which I built the frontend of and my collegue Anish on backend, as we tried to replicate the Who validates the validators paper by Shreya Shankar, here’s that demo, and EvalForge Github for many of you who asked to see it.
Please let me know what you think, I love doing demos and would love feedback and ideas for the next one (coming up in October!)
OpenAI chatGPT Canvas - a complete new way to interact with chatGPT
Just 2 days after Dev Day, and as breaking news during the show, OpenAI also shipped a new way to interact with chatGPT, called Canvas!
Get ready to say goodbye to simple chats and hello to a whole new era of AI collaboration! Canvas, a groundbreaking interface that transforms ChatGPT into a true creative partner for writing and coding projects. Imagine having a tireless copy editor, a brilliant code reviewer, and an endless source of inspiration all rolled into one – that's Canvas!
Canvas moves beyond the limitations of a simple chat window, offering a dedicated space where you and ChatGPT can work side-by-side. Canvas opens in a separate window, allowing for a more visual and interactive workflow. You can directly edit text or code within Canvas, highlight sections for specific feedback, and even use a handy menu of shortcuts to request tasks like adjusting the length of your writing, debugging code, or adding final polish. And just like with your favorite design tools, you can easily restore previous versions using the back button.
Per Karina, OpenAI has trained a special GPT-4o model specifically for Canvas, enabling it to understand the context of your project and provide more insightful assistance. They used synthetic data, generated by O1 which led them to outperform the basic version of GPT-4o by 30% in accuracy.
A general pattern emerges, where new frontiers in intelligence are advancing also older models (and humans as well).
Gemini Flash 8B makes intelligence essentially free
Google folks were not about to take this week litely and decided to hit back with one of the most insane upgrades to pricing I've seen. The newly announced Gemini Flash 1.5 8B is goint to cost just... $0.01 per million tokens 🤯 (when using caching, 3 cents when not cached)
This basically turns intelligence free. And while it is free, it's still their multimodal model (supports images) and has a HUGE context window of 1M tokens.
The evals look ridiculous as well, this 8B param model, now almost matches Flash from May of this year, less than 6 month ago, while giving developers 2x the rate limits and lower latency as well.
What will you do with free intelligence? What will you do with free intelligence of o1 quality in a year? what about o2 quality in 3 years?
Bye Bye whisper? Rev open sources Reverb and Reverb Diarize + turbo models (Blog, HF, Github)
With a "WTF just happened" breaking news, a company called Rev.com releases what they consider a SOTA ASR model, that obliterates Whisper (English only for now) on metrics like WER, and includes a specific diarization focused model.
Trained on 200,000 hours of English speech, expertly transcribed by humans, which according to their claims, is the largest dataset that any ASR model has been trained on, they achieve some incredible results that blow whisper out of the water (lower WER is better)
They also released a seemingly incredible diarization model, which helps understand who speaks when (and is usually added on top of Whisper)
For diarization, Rev used the high-performance pyannote.audio library to fine-tune existing models on 26,000 hours of expertly labeled data, significantly improving their performance
While this is for English only, getting a SOTA transcription model in the open, is remarkable. Rev opened up this model on HuggingFace with a non commercial license, so folks can play around (and distill?) it, while also making it available in their API for very cheap and also a self hosted solution in a docker container
Black Forest Labs feeding up blueberries - new Flux 1.1[pro] is here (Blog, Try It)
What is a ThursdAI without multiple SOTA advancements in all fields of AI? In an effort to prove this to be very true, the folks behind FLUX, revealed that the mysterious 🫐 model that was trending on some image comparison leaderboards is in fact a new version of Flux pro, specifically 1.1[pro]
FLUX1.1 [pro] provides six times faster generation than its predecessor FLUX.1 [pro] while also improving image quality, prompt adherence, and diversity
Just a bit over 2 month since the inital release, and proving that they are THE frontier lab for image diffusion models, folks at BLF are dropping a model that outperforms their previous one on users voting and quality, while being a much faster!
They have partnered with Fal, Together, Replicate to disseminate this model (it's not on X quite yet) but are now also offering developers direct access to their own API and at a competitive pricing of just 4 cents per image generation (while being faster AND cheaper AND higher quality than the previous Flux 😮) and you can try it out on Fal here
Phew! What a whirlwind! Even I need a moment to catch my breath after that AI news tsunami. But don’t worry, the conversation doesn't end here. I barely scratched the surface of these groundbreaking announcements, so dive into the podcast episode for the full scoop – Simon Willison’s insights on OpenAI’s moves are pure gold, and Maxim LaBonne spills the tea on Liquid AI's audacious plan to dethrone transformers (yes, you read that right). And for those of you who prefer skimming, check out my Dev Day summary (open to all now). As always, hit me up in the comments with your thoughts. What are you most excited about? Are you building anything cool with these new tools? Let's keep the conversation going!Alex
This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe -
Hey, Alex here. Super quick, as I’m still attending Dev Day, but I didn’t want to leave you hanging (if you're a paid subscriber!), I have decided to outsource my job and give the amazing podcasters of NoteBookLM the whole transcript of the opening keynote of OpenAI Dev Day.
You can see a blog of everything they just posted here
Here’s a summary of all what was announced:
* Developer-Centric Approach: OpenAI consistently emphasized the importance of developers in their mission to build beneficial AGI. The speaker stated, "OpenAI's mission is to build AGI that benefits all of humanity, and developers are critical to that mission... we cannot do this without you."
* Reasoning as a New Frontier: The introduction of the GPT-4 series, specifically the "O1" models, marks a significant step towards AI with advanced reasoning capabilities, going beyond the limitations of previous models like GPT-3.
* Multimodal Capabilities: OpenAI is expanding the potential of AI applications by introducing multimodal capabilities, particularly focusing on real-time speech-to-speech interaction through the new Realtime API.
* Customization and Fine-Tuning: Empowering developers to customize models is a key theme. OpenAI introduced Vision for fine-tuning with images and announced easier access to fine-tuning with model distillation tools.
* Accessibility and Scalability: OpenAI demonstrated a commitment to making AI more accessible and cost-effective for developers through initiatives like price reductions, prompt caching, and model distillation tools.
Important Ideas and Facts:
1. The O1 Models:
* Represent a shift towards AI models with enhanced reasoning capabilities, surpassing previous generations in problem-solving and logical thought processes.
* O1 Preview is positioned as the most powerful reasoning model, designed for complex problems requiring extended thought processes.
* O1 Mini offers a faster, cheaper, and smaller alternative, particularly suited for tasks like code debugging and agent-based applications.
* Both models demonstrate advanced capabilities in coding, math, and scientific reasoning.
* OpenAI highlighted the ability of O1 models to work with developers as "thought partners," understanding complex instructions and contributing to the development process.
Quote: "The shift to reasoning introduces a new shape of AI capability. The ability for our model to scale and correct the process is pretty mind-blowing. So we are resetting the clock, and we are introducing a new series of models under the name O1."
2. Realtime API:
* Enables developers to build real-time AI experiences directly into their applications using WebSockets.
* Launches with support for speech-to-speech interaction, leveraging the technology behind ChatGPT's advanced voice models.
* Offers natural and seamless integration of voice capabilities, allowing for dynamic and interactive user experiences.
* Showcased the potential to revolutionize human-computer interaction across various domains like driving, education, and accessibility.
Quote: "You know, a lot of you have been asking about building amazing speech-to-speech experiences right into your apps. Well now, you can."
3. Vision, Fine-Tuning, and Model Distillation:
* Vision introduces the ability to use images for fine-tuning, enabling developers to enhance model performance in image understanding tasks.
* Fine-tuning with Vision opens up opportunities in diverse fields such as product recommendations, medical imaging, and autonomous driving.
* OpenAI emphasized the accessibility of these features, stating that "fine-tuning with Vision is available to every single developer."
* Model distillation tools facilitate the creation of smaller, more efficient models by transferring knowledge from larger models like O1 and GPT-4.
* This approach addresses cost concerns and makes advanced AI capabilities more accessible for a wider range of applications and developers.
Quote: "With distillation, you take the outputs of a large model to supervise, to teach a smaller model. And so today, we are announcing our own model distillation tools."
4. Cost Reduction and Accessibility:
* OpenAI highlighted its commitment to lowering the cost of AI models, making them more accessible for diverse use cases.
* Announced a 90% decrease in cost per token since the release of GPT-3, emphasizing continuous efforts to improve affordability.
* Introduced prompt caching, automatically providing a 50% discount for input tokens the model has recently processed.
* These initiatives aim to remove financial barriers and encourage wider adoption of AI technologies across various industries.
Quote: "Every time we reduce the price, we see new types of applications, new types of use cases emerge. We're super far from the price equilibrium. In a way, models are still too expensive to be bought at massive scale."
Conclusion:
OpenAI DevDay conveyed a strong message of developer empowerment and a commitment to pushing the boundaries of AI capabilities. With new models like O1, the introduction of the Realtime API, and a dedicated focus on accessibility and customization, OpenAI is paving the way for a new wave of innovative and impactful AI applications developed by a global community.
This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe -
Hey everyone, it's Alex (still traveling!), and oh boy, what a week again! Advanced Voice Mode is finally here from OpenAI, Google updated their Gemini models in a huge way and then Meta announced MultiModal LlaMas and on device mini Llamas (and we also got a "better"? multimodal from Allen AI called MOLMO!)
From Weights & Biases perspective, our hackathon was a success this weekend, and then I went down to Menlo Park for my first Meta Connect conference, full of news and updates and will do a full recap here as well.
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Overall another crazy week in AI, and it seems that everyone is trying to rush something out the door before OpenAI Dev Day next week (which I'll cover as well!) Get ready, folks, because Dev Day is going to be epic!
TL;DR of all topics covered:
* Open Source LLMs
* Meta llama 3.2 Multimodal models (11B & 90B) (X, HF, try free)
* Meta Llama 3.2 tiny models 1B & 3B parameters (X, Blog, download)
* Allen AI releases MOLMO - open SOTA multimodal AI models (X, Blog, HF, Try It)
* Big CO LLMs + APIs
* OpenAI releases Advanced Voice Mode to all & Mira Murati leaves OpenAI
* Google updates Gemini 1.5-Pro-002 and 1.5-Flash-002 (Blog)
* This weeks Buzz
* Our free course is LIVE - more than 3000 already started learning how to build advanced RAG++
* Sponsoring tonights AI Tinkerers in Seattle, if you're in Seattle, come through for my demo
* Voice & Audio
* Meta also launches voice mode (demo)
* Tools & Others
* Project ORION - holographic glasses are here! (link)
Meta gives us new LLaMas and AI hardware
LLama 3.2 Multimodal 11B and 90B
This was by far the biggest OpenSource release of this week (tho see below, may not be the "best"), as a rumored released finally came out, and Meta has given our Llama eyes! Coming with 2 versions (well 4 if you count the base models which they also released), these new MultiModal LLaMas were trained with an adapter architecture, keeping the underlying text models the same, and placing a vision encoder that was trained and finetuned separately on top.
LLama 90B is among the best open-source mutlimodal models available
— Meta team at launch
These new vision adapters were trained on a massive 6 Billion images, including synthetic data generation by 405B for questions/captions, and finetuned with a subset of 600M high quality image pairs.
Unlike the rest of their models, the Meta team did NOT claim SOTA on these models, and the benchmarks are very good but not the best we've seen (Qwen 2 VL from a couple of weeks ago, and MOLMO from today beat it on several benchmarks)
With text-only inputs, the Llama 3.2 Vision models are functionally the same as the Llama 3.1 Text models; this allows the Llama 3.2 Vision models to be a drop-in replacement for Llama 3.1 8B/70B with added image understanding capabilities.
Seems like these models don't support multi image or video as well (unlike Pixtral for example) nor tool use with images.
Meta will also release these models on meta.ai and every other platform, and they cited a crazy 500 million monthly active users of their AI services across all their apps 🤯 which marks them as the leading AI services provider in the world now.
Llama 3.2 Lightweight Models (1B/3B)
The additional and maybe more exciting thing that we got form Meta was the introduction of the small/lightweight models of 1B and 3B parameters.
Trained on up to 9T tokens, and distilled / pruned from larger models, these are aimed for on-device inference (and by device here we mean from laptops to mobiles to soon... glasses? more on this later)
In fact, meta released an IOS demo, that runs these models, takes a group chat, summarizes and calls the calendar tool to schedule based on the conversation, and all this happens on device without the info leaving to a larger model.
They have also been able to prune down the LLama-guard safety model they released to under 500Mb and have had demos of it running on client side and hiding user input on the fly as the user types something bad!
Interestingly, here too, the models were not SOTA, even in small category, with tiny models like Qwen 2.5 3B beating these models on many benchmarks, but they are outlining a new distillation / pruning era for Meta as they aim for these models to run on device, eventually even glasses (and some said Smart Thermostats)
In fact they are so tiny, that the communtiy quantized them, released and I was able to download these models, all while the keynote was still going! Here I am running the Llama 3B during the developer keynote!
Speaking AI - not only from OpenAI
Zuck also showcased a voice based Llama that's coming to Meta AI (unlike OpenAI it's likely a pipeline of TTS/STT) but it worked really fast and Zuck was able to interrupt it.
And they also showed a crazy animated AI avatar of a creator, that was fully backed by Llama, while the human creator was on stage, Zuck chatted with his avatar and reaction times were really really impressive.
AI Hardware was glasses all along?
Look we've all seen the blunders of this year, the Humane AI Ping, the Rabbit R1 (which sits on my desk and I haven't recharged in two months) but maybe Meta is the answer here?
Zuck took a bold claim that glasses are actually the perfect form factor for AI, it sits on your face, sees what you see and hears what you hear, and can whisper in your ear without disrupting the connection between you and your conversation partner.
They haven't announced new Meta Raybans, but did update the lineup with a new set of transition lenses (to be able to wear those glasses inside and out) and a special edition clear case pair that looks very sleek + new AI features like memories to be able to ask the glasses "hey Meta where did I park" or be able to continue the conversation. I had to get me a pair of this limited edition ones!
Project ORION - first holographic glasses
And of course, the biggest announcement of the Meta Connect was the super secret decade old project of fully holographic AR glasses, which they called ORION.
Zuck introduced these as the most innovative and technologically dense set of glasses in the world. They always said the form factor will become just "glasses" and they actually did it ( a week after Snap spectacles ) tho those are not going to get released to any one any time soon, hell they only made a few thousand of these and they are extremely expensive.
With 70 deg FOV, cameras, speakers and a compute puck, these glasses pack a full day battery with under 100grams of weight, and have a custom silicon, custom displays with MicroLED projector and just... tons of more innovation in there.
They also come in 3 pieces, the glasses themselves, the compute wireless pack that will hold the LLaMas in your pocket and the EMG wristband that allows you to control these devices using muscle signals.
These won't ship as a product tho so don't expect to get them soon, but they are real, and will allow Meta to build the product that we will get on top of these by 2030
AI usecases
So what will these glasses be able to do? well, they showed off a live translation feature on stage that mostly worked, where you just talk and listen to another language in near real time, which was great. There are a bunch of mixed reality games, you'd be able to call people and see them in your glasses on a virtual screen and soon you'll show up as an avatar there as well.
The AI use-case they showed beyond just translation was MultiModality stuff, where they had a bunch of ingredients for a shake, and you could ask your AI assistant, which shake you can make with what it sees. Do you really need
I'm so excited about these to finally come to people I screamed in the audience 👀👓
OpenAI gives everyone* advanced voice mode
It's finally here, and if you're paying for chatGPT you know this, the long announced Advanced Voice Mode for chatGPT is now rolled out to all plus members.
The new updated since the beta are, 5 new voices (Maple, Spruce, Vale, Arbor and Sol), finally access to custom instructions and memory, so you can ask it to remember things and also to know who you are and your preferences (try saving your jailbreaks there)
Unfortunately, as predicted, by the time it rolled out to everyone, this feels way less exciting than it did 6 month ago, the model is way less emotional, refuses to sing (tho folks are making it anyway) and generally feels way less "wow" than what we saw. Less "HER" than we wanted for sure Seriously, they nerfed the singing! Why OpenAI, why?
Pro tip of mine that went viral : you can set your action button on the newer iphones to immediately start the voice conversation with 1 click.
*This new mode is not available in EU
This weeks Buzz - our new advanced RAG++ course is live
I had an awesome time with my colleagues Ayush and Bharat today, after they finally released a FREE advanced RAG course they've been working so hard on for the past few months! Definitely check out our conversation, but better yet, why don't you roll into the course? it's FREE and you'll get to learn about data ingestion, evaluation, query enhancement and more!
New Gemini 002 is 50% cheaper, 2x faster and better at MMLU-pro
It seems that every major lab (besides Anthropic) released a big thing this week to try and get under Meta's skin?
Google announced an update to their Gemini Pro/Flash models, called 002, which is a very significant update!
Not only are these models 50% cheaper now (Pro price went down by 50% on
-
Hey folks, Alex here, back with another ThursdAI recap – and let me tell you, this week's episode was a whirlwind of open-source goodness, mind-bending inference techniques, and a whole lotta talk about talking AIs! We dove deep into the world of LLMs, from Alibaba's massive Qwen 2.5 drop to the quirky, real-time reactions of Moshi.
We even got a sneak peek at Nous Research's ambitious new project, Forge, which promises to unlock some serious LLM potential. So grab your pumpkin spice latte (it's that time again isn't it? 🍁) settle in, and let's recap the AI awesomeness that went down on ThursdAI, September 19th!
ThursdAI is brought to you (as always) by Weights & Biases, we still have a few spots left in our Hackathon this weekend and our new advanced RAG course is now released and is FREE to sign up!
TL;DR of all topics + show notes and links
* Open Source LLMs
* Alibaba Qwen 2.5 models drop + Qwen 2.5 Math and Qwen 2.5 Code (X, HF, Blog, Try It)
* Qwen 2.5 Coder 1.5B is running on a 4 year old phone (Nisten)
* KyutAI open sources Moshi & Mimi (Moshiko & Moshika) - end to end voice chat model (X, HF, Paper)
* Microsoft releases GRIN-MoE - tiny (6.6B active) MoE with 79.4 MMLU (X, HF, GIthub)
* Nvidia - announces NVLM 1.0 - frontier class multimodal LLMS (no weights yet, X)
* Big CO LLMs + APIs
* OpenAI O1 results from LMsys do NOT disappoint - vibe checks also confirm, new KING llm in town (Thread)
* NousResearch announces Forge in waitlist - their MCTS enabled inference product (X)
* This weeks Buzz - everything Weights & Biases related this week
* Judgement Day (hackathon) is in 2 days! Still places to come hack with us Sign up
* Our new RAG Course is live - learn all about advanced RAG from WandB, Cohere and Weaviate (sign up for free)
* Vision & Video
* Youtube announces DreamScreen - generative AI image and video in youtube shorts ( Blog)
* CogVideoX-5B-I2V - leading open source img2video model (X, HF)
* Runway, DreamMachine & Kling all announce text-2-video over API (Runway, DreamMachine)
* Runway announces video 2 video model (X)
* Tools
* Snap announces their XR glasses - have hand tracking and AI features (X)
Open Source Explosion!
👑 Qwen 2.5: new king of OSS llm models with 12 model releases, including instruct, math and coder versions
This week's open-source highlight was undoubtedly the release of Alibaba's Qwen 2.5 models. We had Justin Lin from the Qwen team join us live to break down this monster drop, which includes a whopping seven different sizes, ranging from a nimble 0.5B parameter model all the way up to a colossal 72B beast! And as if that wasn't enough, they also dropped Qwen 2.5 Coder and Qwen 2.5 Math models, further specializing their LLM arsenal. As Justin mentioned, they heard the community's calls for 14B and 32B models loud and clear – and they delivered! "We do not have enough GPUs to train the models," Justin admitted, "but there are a lot of voices in the community...so we endeavor for it and bring them to you." Talk about listening to your users!
Trained on an astronomical 18 trillion tokens (that’s even more than Llama 3.1 at 15T!), Qwen 2.5 shows significant improvements across the board, especially in coding and math. They even open-sourced the previously closed-weight Qwen 2 VL 72B, giving us access to the best open-source vision language models out there. With a 128K context window, these models are ready to tackle some serious tasks. As Nisten exclaimed after putting the 32B model through its paces, "It's really practical…I was dumping in my docs and my code base and then like actually asking questions."
It's safe to say that Qwen 2.5 coder is now the best coding LLM that you can use, and just in time for our chat, a new update from ZeroEval confirms, Qwen 2.5 models are the absolute kings of OSS LLMS, beating Mistral large, 4o-mini, Gemini Flash and other huge models with just 72B parameters 👏
Moshi: The Chatty Cathy of AI
We've covered Moshi Voice back in July, and they have promised to open source the whole stack, and now finally they did! Including the LLM and the Mimi Audio Encoder!
This quirky little 7.6B parameter model is a speech-to-speech marvel, capable of understanding your voice and responding in kind. It's an end-to-end model, meaning it handles the entire speech-to-speech process internally, without relying on separate speech-to-text and text-to-speech models.
While it might not be a logic genius, Moshi's real-time reactions are undeniably uncanny. Wolfram Ravenwolf described the experience: "It's uncanny when you don't even realize you finished speaking and it already starts to answer." The speed comes from the integrated architecture and efficient codecs, boasting a theoretical response time of just 160 milliseconds!
Moshi uses (also open sourced) Mimi neural audio codec, and achieves 12.5 Hz representation with just 1.1 kbps bandwidth.
You can download it and run on your own machine or give it a try here just don't expect a masterful conversationalist hehe
Gradient-Informed MoE (GRIN-MoE): A Tiny Titan
Just before our live show, Microsoft dropped a paper on GrinMoE, a gradient-informed Mixture of Experts model. We were lucky enough to have the lead author, Liyuan Liu (aka Lucas), join us impromptu to discuss this exciting development. Despite having only 6.6B active parameters (16 x 3.8B experts), GrinMoE manages to achieve remarkable performance, even outperforming larger models like Phi-3 on certain benchmarks. It's a testament to the power of clever architecture and training techniques. Plus, it's open-sourced under the MIT license, making it a valuable resource for the community.
NVIDIA NVLM: A Teaser for Now
NVIDIA announced NVLM 1.0, their own set of multimodal LLMs, but alas, no weights were released. We’ll have to wait and see how they stack up against the competition once they finally let us get our hands on them. Interestingly, while claiming SOTA on some vision tasks, they haven't actually compared themselves to Qwen 2 VL, which we know is really really good at vision tasks 🤔
Nous Research Unveils Forge: Inference Time Compute Powerhouse (beating o1 at AIME Eval!)
Fresh off their NousCon event, Karan and Shannon from Nous Research joined us to discuss their latest project, Forge. Described by Shannon as "Jarvis on the front end," Forge is an inference engine designed to push the limits of what’s possible with existing LLMs. Their secret weapon? Inference-time compute. By implementing sophisticated techniques like Monte Carlo Tree Search (MCTS), Forge can outperform larger models on complex reasoning tasks beating OpenAI's o1-preview at the AIME Eval, competition math benchmark, even with smaller, locally runnable models like Hermes 70B. As Karan emphasized, “We’re actually just scoring with Hermes 3.1, which is available to everyone already...we can scale it up to outperform everything on math, just using a system like this.”
Forge isn't just about raw performance, though. It's built with usability and transparency in mind. Unlike OpenAI's 01, which obfuscates its chain of thought reasoning, Forge provides users with a clear visual representation of the model's thought process. "You will still have access in the sidebar to the full chain of thought," Shannon explained, adding, “There’s a little visualizer and it will show you the trajectory through the tree… you’ll be able to see exactly what the model was doing and why the node was selected.” Forge also boasts built-in memory, a graph database, and even code interpreter capabilities, initially supporting Python, making it a powerful platform for building complex LLM applications.
Forge is currently in a closed beta, but a waitlist is open for eager users. Karan and Shannon are taking a cautious approach to the rollout, as this is Nous Research’s first foray into hosting a product. For those lucky enough to gain access, Forge offers a tantalizing glimpse into the future of LLM interaction, promising greater transparency, improved reasoning, and more control over the model's behavior.
For ThursdAI readers early, here's a waitlist form to test it out!
Big Companies and APIs: The Reasoning Revolution
OpenAI’s 01: A New Era of LLM Reasoning
The big story in the Big Tech world is OpenAI's 01. Since we covered it live last week as it dropped, many of us have been playing with these new reasoning models, and collecting "vibes" from the community. These models represent a major leap in reasoning capabilities, and the results speak for themselves.
01 Preview claimed the top spot across the board on the LMSys Arena leaderboard, demonstrating significant improvements in complex tasks like competition math and coding. Even the smaller 01 Mini showed impressive performance, outshining larger models in certain technical areas. (and the jump in ELO score above the rest in MATH is just incredible to see!) and some folks made this video viral, of a PHD candidate reacting to 01 writing in 1 shot, code that took him a year to write, check it out, it’s priceless.
One key aspect of 01 is the concept of “inference-time compute”. As Noam Brown from OpenAI calls it, this represents a "new scaling paradigm", allowing the model to spend more time “thinking” during inference, leading to significantly improved performance on reasoning tasks. The implications of this are vast, opening up the possibility of LLMs tackling long-horizon problems in areas like drug discovery and physics.
However, the opacity surrounding 01’s chain of thought reasoning being hidden/obfuscated and the ban on users asking about it was a major point of contention at least within the ThursdAI chat. As Wolfram Ravenwolf put it, "The AI gives you an answer and you can't even ask how it got there. That is the wrong direction." as he was referring to the fact that not only is asking about the reasoning impossible, some folks were actually getting threatening emails and getting banned from using the product all together 😮
This Week's Buzz: Hackathons and RAG Courses!
We're almost ready to host our Weights & Biases Judgment Day Hackathon (LLMs as a judge, anyone?) with a few spots left, so if you're reading this and in SF, come hang out with us!
And the main thing I gave an update about is our Advanced RAG course, packed with insights from experts at Weights & Biases, Cohere, and Weaviate. Definitely check those out if you want to level up your LLM skills (and it's FREE in our courses academy!)
Vision & Video: The Rise of Generative Video
Generative video is having its moment, with a flurry of exciting announcements this week. First up, the open-source CogVideoX-5B-I2V, which brings accessible image-to-video capabilities to the masses. It's not perfect, but being able to generate video on your own hardware is a game-changer.
On the closed-source front, YouTube announced the integration of generative AI into YouTube Shorts with their DreamScreen feature, bringing AI-powered video generation to a massive audience. We also saw API releases from three leading video model providers: Runway, DreamMachine, and Kling, making it easier than ever to integrate generative video into applications. Runway even unveiled a video-to-video model, offering even more control over the creative process, and it's wild, check out what folks are doing with video-2-video!
One last thing here, Kling is adding a motion brush feature to help users guide their video generations, and it just looks so awesome I wanted to show you
Whew! That was one hell of a week, tho from the big companies perspective, it was a very slow week, getting a new OSS king, an end to end voice model and a new hint of inference platform from Nous, and having all those folks come to the show was awesome!
If you're reading all the way down to here, it seems that you like this content, why not share it with 1 or two friends? 👇 And as always, thank you for reading and subscribing! 🫶
P.S - I’m traveling for the next two weeks, and this week the live show was live recorded from San Francisco, thanks to my dear friends swyx & Alessio for hosting my again in their awesome Latent Space pod studio at Solaris SF!
This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe -
March 14th, 2023 was the day ThursdAI was born, it was also the day OpenAI released GPT-4, and I jumped into a Twitter space and started chaotically reacting together with other folks about what a new release of a paradigm shifting model from OpenAI means, what are the details, the new capabilities. Today, it happened again!
Hey, it's Alex, I'm back from my mini vacation (pic after the signature) and boy am I glad I decided to not miss September 12th! The long rumored 🍓 thinking model from OpenAI, dropped as breaking news in the middle of ThursdAI live show, giving us plenty of time to react live!
But before this, we already had an amazing show with some great guests! Devendra Chaplot from Mistral came on and talked about their newly torrented (yeah they did that again) Pixtral VLM, their first multi modal! , and then I had the honor to host Steven Johnson and Raiza Martin from NotebookLM team at Google Labs which shipped something so uncannily good, that I legit said "holy fu*k" on X in a reaction!
So let's get into it (TL;DR and links will be at the end of this newsletter)
OpenAI o1, o1 preview and o1-mini, a series of new "reasoning" models
This is it folks, the strawberries have bloomed, and we finally get to taste them. OpenAI has released (without a waitlist, 100% rollout!) o1-preview and o1-mini models to chatGPT and API (tho only for tier-5 customers) 👏 and are working on releasing 01 as well.
These are models that think before they speak, and have been trained to imitate "system 2" thinking, and integrate chain-of-thought reasoning internally, using Reinforcement Learning and special thinking tokens, which allows them to actually review what they are about to say before they are saying it, achieving remarkable results on logic based questions.
Specifically you can see the jumps in the very very hard things like competition math and competition code, because those usually require a lot of reasoning, which is what these models were trained to do well.
New scaling paradigm
Noam Brown from OpenAI calls this a "new scaling paradigm" and Dr Jim Fan explains why, with this new way of "reasoning", the longer the model thinks - the better it does on reasoning tasks, they call this "test-time compute" or "inference-time compute" as opposed to compute that was used to train the model. This shifting of computation down to inference time is the essence of the paradigm shift, as in, pre-training can be very limiting computationally as the models scale in size of parameters, they can only go so big until you have to start building out a huge new supercluster of GPUs to host the next training run (Remember Elon's Colossus from last week?).
The interesting thing to consider here is, while current "thinking" times are ranging between a few seconds to a minute, imagine giving this model hours, days, weeks to think about new drug problems, physics problems 🤯.
Prompting o1
Interestingly, a new prompting paradigm has also been introduced. These models now have CoT (think "step by step") built-in, so you no longer have to include it in your prompts. By simply switching to o1-mini, most users will see better results right off the bat. OpenAI has worked with the Devin team to test drive these models, and these folks found that asking the new models to just give the final answer often works better and avoids redundancy in instructions.
The community of course will learn what works and doesn't in the next few hours, days, weeks, which is why we got 01-preview and not the actual (much better) o1.
Safety implications and future plans
According to Greg Brokman, this inference time compute also greatly helps with aligning the model to policies, giving it time to think about policies at length, and improving security and jailbreak preventions, not only logic.
The folks at OpenAI are so proud of all of the above that they have decided to restart the count and call this series o1, but they did mention that they are going to release GPT series models as well, adding to the confusing marketing around their models.
Open Source LLMs
Reflecting on Reflection 70B
Last week, Reflection 70B was supposed to launch live on the ThursdAI show, and while it didn't happen live, I did add it in post editing, and sent the newsletter, and packed my bag, and flew for my vacation. I got many DMs since then, and at some point couldn't resist checking and what I saw was complete chaos, and despite this, I tried to disconnect still until last night.
So here's what I could gather since last night. The claims of a llama 3.1 70B finetune that Matt Shumer and Sahil Chaudhary from Glaive beating Sonnet 3.5 are proven false, nobody was able to reproduce those evals they posted and boasted about, which is a damn shame.
Not only that, multiple trusted folks from our community, like Kyle Corbitt, Alex Atallah have reached out to Matt in to try to and get to the bottom of how such a thing would happen, and how claims like these could have been made in good faith. (or was there foul play)
The core idea of something like Reflection is actually very interesting, but alas, the inability to replicate, but also to stop engaging with he community openly (I've reached out to Matt and given him the opportunity to come to the show and address the topic, he did not reply), keep the model on hugging face where it's still trending, claiming to be the world's number 1 open source model, all these smell really bad, despite multiple efforts on out part to give the benefit of the doubt here.
As for my part in building the hype on this (last week's issues till claims that this model is top open source model), I addressed it in the beginning of the show, but then twitter spaces crashed, but unfortunately as much as I'd like to be able to personally check every thing I cover, I often have to rely on the reputation of my sources, which is easier with established big companies, and this time this approached failed me.
This weeks Buzzzzzz - One last week till our hackathon!
Look at this point, if you read this newsletter and don't know about our hackathon, then I really didn't do my job prompting it, but it's coming up, September 21-22 ! Join us, it's going to be a LOT of fun!
🖼️ Pixtral 12B from Mistral
Mistral AI burst onto the scene with Pixtral, their first multimodal model! Devendra Chaplot, research scientist at Mistral, joined ThursdAI to explain their unique approach, ditching fixed image resolutions and training a vision encoder from scratch.
"We designed this from the ground up to...get the most value per flop," Devendra explained. Pixtral handles multiple images interleaved with text within a 128k context window - a far cry from the single-image capabilities of most open-source multimodal models. And to make the community erupt in thunderous applause (cue the clap emojis!) they released the 12 billion parameter model under the ultra-permissive Apache 2.0 license. You can give Pixtral a whirl on Hyperbolic, HuggingFace, or directly through Mistral.
DeepSeek 2.5: When Intelligence Per Watt is King
Deepseek 2.5 launched amid the reflection news and did NOT get the deserved attention it.... deserves. It folded (no deprecated) Deepseek Coder into 2.5 and shows incredible metrics and a truly next-gen architecture. "It's like a higher order MOE", Nisten revealed, "which has this whole like pile of brain and it just like picks every time, from that." 🤯. DeepSeek 2.5 achieves maximum "intelligence per active parameter"
Google's turning text into AI podcast for auditory learners with Audio Overviews
Today I had the awesome pleasure of chatting with Steven Johnson and Raiza Martin from the NotebookLM team at Google Labs. NotebookLM is a research tool, that if you haven't used, you should definitely give it a spin, and this week they launched something I saw in preview and was looking forward to checking out and honestly was jaw-droppingly impressed today.
NotebookLM allows you to upload up to 50 "sources" which can be PDFs, web links that they will scrape for you, documents etc' (no multimodality so far) and will allow you to chat with them, create study guides, dive deeper and add notes as you study.
This week's update allows someone who doesn't like reading, to turn all those sources into a legit 5-10 minute podcast, and that sounds so realistic, that I was honestly blown away. I uploaded a documentation of fastHTML in there.. and well hear for yourself
The conversation with Steven and Raiza was really fun, podcast definitely give it a listen!
Not to mention that Google released (under waitlist) another podcast creating tool called illuminate, that will convert ArXiv papers into similar sounding very realistic 6-10 minute podcasts!
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
There are many more updates from this week, there was a whole Apple keynote I missed, which had a new point and describe feature with AI on the new iPhones and Apple Intelligence, Google also released new DataGemma 27B, and more things in TL'DR which are posted here in raw format
See you next week 🫡 Thank you for being a subscriber, weeks like this are the reason we keep doing this! 🔥 Hope you enjoy these models, leave in comments what you think about them
TL;DR in raw format
* Open Source LLMs
* Reflect on Reflection 70B & Matt Shumer (X, Sahil)
* Mixtral releases Pixtral 12B - multimodal model (X, try it)
* Pixtral is really good at OCR says swyx
* Interview with Devendra Chaplot on ThursdAI
* Initial reports of Pixtral beating GPT-4 on WildVision arena from AllenAI
* JinaIA reader-lm-0.5b and reader-lm-1.5b (X)
* ZeroEval updates
* Deepseek 2.5 -
* Deepseek coder is now folded into DeepSeek v2.5
* 89 HumanEval (up from 84 from deepseek v2)
* 9 on MT-bench
* Google - DataGemma 27B (RIG/RAG) for improving results
* Retrieval-Interleaved Generation
* 🤖 DataGemma: AI models that connect LLMs to Google's Data Commons
* 📊 Data Commons: A vast repository of trustworthy public data
* 🔍 Tackling AI hallucination by grounding LLMs in real-world data
* 🔍 Two approaches: RIG (Retrieval-Interleaved Generation) and RAG (Retrieval-Augmented Generation)
* 🔍 Preliminary results show enhanced accuracy and reduced hallucinations
* 🔓 Making DataGemma open models to enable broader adoption
* 🌍 Empowering informed decisions and deeper understanding of the world
* 🔍 Ongoing research to refine the methodologies and scale the work
* 🔍 Integrating DataGemma into Gemma and Gemini AI models
* 🤝 Collaborating with researchers and developers through quickstart notebooks
* Big CO LLMs + APIs
* Apple event
* Apple Intelligence - launching soon
* Visual Intelligence with a dedicated button
* Google Illuminate - generate arXiv paper into multiple speaker podcasts (Website)
* 5-10 min podcasts
* multiple speakers
* any paper
* waitlist
* has samples
* sounds super cool
* Google NotebookLM is finally available - multi modal research tool + podcast (NotebookLM)
* Has RAG like abilities, can add sources from drive or direct web links
* Currently not multimodal
* Generation of multi speaker conversation about this topic to present it, sounds really really realistic
* Chat with Steven and Raiza
* OpenAI reveals new o1 models, and launches o1 preview and o1-mini in chat and API (X, Blog)
* Trained with RL to think before it speaks with special thinking tokens (that you pay for)
* new scaling paradigm
* This weeks Buzz
* Vision & Video
* Adobe announces Firefly video model (X)
* Voice & Audio
* Hume launches EVI 2 (X)
* Fish Speech 1.4 (X)
* Instant Voice Cloning
* Ultra low latenc
* ~1GB model weights
* LLaMA-Omni, a new model for speech interaction (X)
* Tools
* New Jina reader (X)
This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe -
Welcome back everyone, can you believe it's another ThursdAI already? And can you believe me when I tell you that friends of the pod Matt Shumer & Sahil form Glaive.ai just dropped a LLama 3.1 70B finetune that you can download that will outperform Claude Sonnet 3.5 while running locally on your machine?
Today was a VERY heavy Open Source focused show, we had a great chat w/ Niklas, the leading author of OLMoE, a new and 100% open source MoE from Allen AI, a chat with Eugene (pico_creator) about RWKV being deployed to over 1.5 billion devices with Windows updates and a lot more.
In the realm of the big companies, Elon shook the world of AI by turning on the biggest training cluster called Colossus (100K H100 GPUs) which was scaled in 122 days 😮 and Anthropic announced that they have 500K context window Claude that's only reserved if you're an enterprise customer, while OpenAI is floating an idea of a $2000/mo subscription for Orion, their next version of a 100x better chatGPT?!
TL;DR
* Open Source LLMs
* Matt Shumer / Glaive - Reflection-LLama 70B beats Claude 3.5 (X, HF)
* Allen AI - OLMoE - first "good" MoE 100% OpenSource (X, Blog, Paper, WandB)
* RWKV.cpp is deployed with Windows to 1.5 Billion devices
* MMMU pro - more robust multi disipline multimodal understanding bench (proj)
* 01AI - Yi-Coder 1.5B and 9B (X, Blog, HF)
* Big CO LLMs + APIs
* Replit launches Agent in beta - from coding to production (X, Try It)
* Ilya SSI announces 1B round from everyone (Post)
* Cohere updates Command-R and Command R+ on API (Blog)
* Claude Enterprise with 500K context window (Blog)
* Claude invisibly adds instructions (even via the API?) (X)
* Google got structured output finally (Docs)
* Amazon to include Claude in Alexa starting this October (Blog)
* X ai scaled Colossus to 100K H100 GPU goes online (X)
* DeepMind - AlphaProteo new paper (Blog, Paper, Video)
* This weeks Buzz
* Hackathon did we mention? We're going to have Eugene and Greg as Judges!
* AI Art & Diffusion & 3D
* ByteDance - LoopyAvatar - Audio Driven portait avatars (Page)
Open Source LLMs
Reflection Llama-3.1 70B - new 👑 open source LLM from Matt Shumer / GlaiveAI
This model is BANANAs folks, this is a LLama 70b finetune, that was trained with a new way that Matt came up with, that bakes CoT and Reflection into the model via Finetune, which results in model outputting its thinking as though you'd prompt it in a certain way.
This causes the model to say something, and then check itself, and then reflect on the check and then finally give you a much better answer. Now you may be thinking, we could do this before, RefleXion (arxiv.org/2303.11366) came out a year ago, so what's new?
What's new is, this is now happening inside the models head, you don't have to reprompt, you don't even have to know about these techniques! So what you see above, is just colored differently, but all of it, is output by the model without extra prompting by the user or extra tricks in system prompt. the model thinks, plans, does chain of thought, then reviews and reflects, and then gives an answer!
And the results are quite incredible for a 70B model 👇
Looking at these evals, this is a 70B model that beats GPT-4o, Claude 3.5 on Instruction Following (IFEval), MATH, GSM8K with 99.2% 😮 and gets very close to Claude on GPQA and HumanEval!
(Note that these comparisons are a bit of a apples to ... different types of apples. If you apply CoT and reflection to the Claude 3.5 model, they may in fact perform better on the above, as this won't be counted 0-shot anymore. But given that this new model is effectively spitting out those reflection tokens, I'm ok with this comparison)
This is just the 70B, next week the folks are planning to drop the 405B finetune with the technical report, so stay tuned for that!
Kudos on this work, go give Matt Shumer and Glaive AI a follow!
Allen AI OLMoE - tiny "good" MoE that's 100% open source, weights, code, logs
We've previously covered OLMO from Allen Institute, and back then it was obvious how much commitment they have to open source, and this week they continued on this path with the release of OLMoE, an Mixture of Experts 7B parameter model (1B active parameters), trained from scratch on 5T tokens, which was completely open sourced.
This model punches above its weights on the best performance/cost ratio chart for MoEs and definitely highest on the charts of releasing everything.
By everything here, we mean... everything, not only the final weights file; they released 255 checkpoints (every 5000 steps), the training code (Github) and even (and maybe the best part) the Weights & Biases logs!
It was a pleasure to host the leading author of the OLMoE paper, Niklas Muennighoff on the show today, so definitely give this segment a listen, he's a great guest and I learned a lot!
Big Companies LLMs + API
Anthropic has 500K context window Claude but only for Enterprise?
Well, this sucks (unless you work for Midjourney, Airtable or Deloitte). Apparently Anthropic has been sitting on Claude that can extend to half a million tokens in the context window, and decided to keep it to themselves and a few trial enterprises, and package it as an Enterprise offering.
This offering now includes, beyond just the context window, also a native Github integration, and a few key enterprise features like access logs, provisioning and SCIM and all kinds of "procurement and CISO required" stuff enterprises look for.
To be clear, this is a great move for Anthropic, and this isn't an API tier, this is for their front end offering, including the indredible artifacts tool, so that companies can buy their employees access to Claude.ai and have them be way more productive coding (hence the Github integration) or summarizing (very very) long documents, building mockups and one off apps etc'
Anthropic is also in the news this week, because Amazon announced that it'll use Claude as the backbone for the smart (or "remarkable" as they call it) Alexa brains coming up in October, which, again, incredible for Anthropic distribution, as there are maybe 100M Alexa users in the world or so.
Prompt injecting must stop!
And lastly, there have been mounting evidence, including our own Wolfram Ravenwolf that confirmed it, that Anthropic is prompt injecting additional context into your own prompts, in the UI but also via the API! This is awful practice and if anyone from there reads this newsletter, please stop or at least acknowledge. Claude apparently just... thinks that it's something my users said, when in fact, it's some middle layer of anthropic security decided to just inject some additional words in there!
XAI turns on the largest training GPU SuperCluster Colossus - 100K H100 GPUS
This is a huge deal for AI, specifically due to the time this took and the massive massive scale of this SuperCluster. SuperCluster means all these GPUs sit in one datacenter, drawing from the same power-grid and can effectively run single training jobs.
This took just 122 days for Elon and the XAI team to go from an empty warehouse in Memphis to booting up an incredible 100K H100, and they claim that they will double this capacity by adding 50K H200 in the next few months. As Elon mentioned when they released Grok2, it was trained on 15K, and it matched GPT4!
Per SemiAnalisys, this new Supercluster can train a GPT-4 level model in just 4 days 🤯
XAI was founded a year ago, and by end of this year, they plan for Grok to be the beast LLM in the world, and not just get to GPT-4ish levels, and with this + 6B investment they have taken in early this year, it seems like they are well on track, which makes some folks at OpenAI reportedly worried
This weeks buzz - we're in SF in less than two weeks, join our hackathon!
This time I'm very pleased to announce incredible judges for our hackathon, the spaces are limited, but there's still some spaces so please feel free to sign up and join us
I'm so honored to announce that we'll have Eugene Yan (@eugeneyan), Greg Kamradt (@GregKamradt) and Charles Frye (@charles_irl) on the Judges panel. 🤩 It'll be incredible to have these folks see what hackers come up with, and I'm excited as this comes closer!
Replit launches Agents beta - a fully integrated code → deployment agent
Replit is a great integrated editing environment, with database and production in 1 click and they've had their LLMs trained on a LOT of code helping folks code for a while.
Now they are launching agents, which seems very smart from them, given that development is much more than just coding. All the recent excitement we see about Cursor, is omitting the fact that those demos are only working for folks who already know how to set up the environment, and then there's the need to deploy to production, maintain.
Replit has that basically built in, and now their Agent can build a plan and help you build those apps, and "ship" them, while showing you what they are doing. This is massive, and I can't wait to play around with this!
The additional benefit of Replit is that they nailed the mobile app experience as well, so this now works from mobile, on the go!
In fact, as I was writing this, I got so excited that I paused for 30 minutes, payed the yearly subscription and decided to give building an app a try!
The fact that this can deploy and run the server and the frontend, detect errors, fix them, and then also provision a DB for me, provision Stripe, login buttons and everything else is quite insane.
Can't wait to see what I can spin up with this 🔥 (and show all of you!)
Loopy - Animated Avatars from ByteDance
A new animated avatar project from folks at ByteDance just dropped, and it’s WAY clearer than anything we’ve seen before, like EMO or anything else. I will just add this video here for you to enjoy and look at the earring movements, vocal cords, eyes, everything!
I of course wanted to know if I’ll ever be able to use this, and .. likely no, here’s the response I got from Jianwen one of the Authors today.
That's it for this week, we've talked about so much more in the pod, please please check it out.
As for me, while so many exciting things are happening, I'm going on a small 🏝️ vacation until next ThursdAI, which will happen on schedule, so planning to decompress and disconnect, but will still be checking in, so if you see things that are interesting, please tag me on X 🙏
P.S - I want to shout out a dear community member that's been doing just that, @PresidentLin has been tagging me in many AI related releases, often way before I would even notice them, so please give them a follow! 🫡
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe -
Hey, for the least time during summer of 2024, welcome to yet another edition of ThursdAI, also happy skynet self-awareness day for those who keep track :)
This week, Cerebras broke the world record for fastest LLama 3.1 70B/8B inference (and came on the show to talk about it) Google updated 3 new Geminis, Anthropic artifacts for all, 100M context windows are possible, and Qwen beats SOTA on vision models + much more!
As always, this weeks newsletter is brought to you by Weights & Biases, did I mention we're doing a hackathon in SF in September 21/22 and that we have an upcoming free RAG course w/ Cohere & Weaviate?
TL;DR
* Open Source LLMs
* Nous DisTrO - Distributed Training (X , Report)
* NousResearch/ hermes-function-calling-v1 open sourced - (X, HF)
* LinkedIN Liger-Kernel - OneLine to make Training 20% faster & 60% more memory Efficient (Github)
* Cartesia - Rene 1.3B LLM SSM + Edge Apache 2 acceleration (X, Blog)
* Big CO LLMs + APIs
* Cerebras launches the fastest AI inference - 447t/s LLama 3.1 70B (X, Blog, Try It)
* Google - Gemini 1.5 Flash 8B & new Gemini 1.5 Pro/Flash (X, Try it)
* Google adds Gems & Imagen to Gemini paid tier
* Anthropic artifacts available to all users + on mobile (Blog, Try it)
* Anthropic publishes their system prompts with model releases (release notes)
* OpenAI has project Strawberry coming this fall (via The information)
* This weeks Buzz
* WandB Hackathon hackathon hackathon (Register, Join)
* Also, we have a new RAG course w/ Cohere and Weaviate (RAG Course)
* Vision & Video
* Zhipu AI CogVideoX - 5B Video Model w/ Less 10GB of VRAM (X, HF, Try it)
* Qwen-2 VL 72B,7B,2B - new SOTA vision models from QWEN (X, Blog, HF)
* AI Art & Diffusion & 3D
* GameNgen - completely generated (not rendered) DOOM with SD1.4 (project)
* FAL new LORA trainer for FLUX - trains under 5 minutes (Trainer, Coupon for ThursdAI)
* Tools & Others
* SimpleBench from AI Explained - closely matches human experience (simple-bench.com)
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Open Source
Let's be honest - ThursdAI is a love letter to the open-source AI community, and this week was packed with reasons to celebrate.
Nous Research DiStRO + Function Calling V1
Nous Research was on fire this week (aren't they always?) and they kicked off the week with the release of DiStRO, which is a breakthrough in distributed training. You see, while LLM training requires a lot of hardware, it also requires a lot of network bandwidth between the different GPUs, even within the same data center.
Proprietary networking solutions like Nvidia NVLink, and more open standards like Ethernet work well within the same datacenter, but training across different GPU clouds has been unimaginable until now.
Enter DiStRo, a new decentralized training by the mad geniuses at Nous Research, in which they reduced the required bandwidth to train a 1.2B param model from 74.4GB to just 86MB (857x)!
This can have massive implications for training across compute clusters, doing shared training runs, optimizing costs and efficiency and democratizing LLM training access! So don't sell your old GPUs just yet, someone may just come up with a folding@home but for training the largest open source LLM, and it may just be Nous!
Nous Research also released their function-calling-v1 dataset (HF) that was used to train Hermes-2, and we had InterstellarNinja who authored that dataset, join the show and chat about it. This is an incredible unlock for the open source community, as function calling become a de-facto standard now. Shout out to the Glaive team as well for their pioneering work that paved the way!
LinkedIn's Liger Kernel: Unleashing the Need for Speed (with One Line of Code)
What if I told you, that whatever software you develop, you can add 1 line of code, and it'll run 20% faster, and require 60% less memory?
This is basically what Linkedin researches released this week with Liger Kernel, yes you read that right, Linkedin, as in the website you career related posts on!
"If you're doing any form of finetuning, using this is an instant win"Wing Lian - Axolotl
This absolutely bonkers improvement in training LLMs, now works smoothly with Flash Attention, PyTorch FSDP and DeepSpeed. If you want to read more about the implementation of the triton kernels, you can see a deep dive here, I just wanted to bring this to your attention, even if you're not technical, because efficiency jumps like these are happening all the time. We are used to seeing them in capabilities / intelligence, but they are also happening on the algorithmic/training/hardware side, and it's incredible to see!
Huge shoutout to Byron and team at Linkedin for this unlock, check out their Github if you want to get involved!
Qwen-2 VL - SOTA image and video understanding + open weights mini VLM
You may already know that we love the folks at Qwen here on ThursdAI, not only because Junyang Lin is a frequeny co-host and we get to hear about their releases as soon as they come out (they seem to be releasing them on thursdays around the time of the live show, I wonder why!)
But also because, they are committed to open source, and have released 2 models 7B and 2B with complete Apache 2 license!
First of all, their Qwen-2 VL 72B model, is now SOTA at many benchmarks, beating GPT-4, Claude 3.5 and other much bigger models. This is insane. I literally had to pause Junyang and repeat what he said, this is a 72B param model, that beats GPT-4o on document understanding, on math, on general visual Q&A.
Additional Capabilities & Smaller models
They have added new capabilities in these models, like being able to handle arbitrary resolutions, but the one I'm most excited about is the video understanding. These models can now understand up to 20 minutes of video sequences, and it's not just "split the video to 10 frames and do image caption", no, these models understand video progression and if I understand correctly how they do it, it's quite genius.
They the video embed time progression into the model using a new technique called M-RoPE, which turns the time progression into rotary positional embeddings.
Now, the 72B model is currently available via API, but we do get 2 new small models with Apache 2 license and they are NOT too shabby either!
7B parameters (HF) and 2B Qwen-2 VL (HF) are small enough to run completely on your machine, and the 2B parameter, scores better than GPT-4o mini on OCR-bench for example!
I can't wait to finish writing and go play with these models!
Big Companies & LLM APIs
The biggest news this week came from Cerebras System, a relatively unknown company, that shattered the world record for LLM inferencing out of the blue (and came on the show to talk about how they are doing it)
Cerebras - fastest LLM inference on wafer scale chips
Cerebras has introduced the concept of wafer scale chips to the world, which is, if you imagine a microchip, they are the size of a post stamp maybe? GPUs are bigger, well, Cerebras are making chips the sizes of an iPad (72 square inches), largest commercial chips in the world.
And now, they created an inference stack on top of those chips, and showed that they have the fastest inference in the world, how fast? Well, they can server LLama 3.1 8B at a whopping 1822t/s. No really, this is INSANE speeds, as I was writing this, I copied all the words I had so far, went to inference.cerebras.ai , asked to summarize, pasted and hit send, and I immediately got a summary!
"The really simple explanation is we basically store the entire model, whether it's 8B or 70B or 405B, entirely on the chip. There's no external memory, no HBM. We have 44 gigabytes of memory on chip."James Wang
They not only store the whole model (405B coming soon), but they store it in full fp16 precision as well, so they don't quantize the models. Right now, they are serving it with 8K tokens in context window, and we had a conversation about their next steps being giving more context to developers.
The whole conversation is well worth listening to, James and Ian were awesome to chat with, and while they do have a waitlist, as they gradually roll out their release, James said to DM him on X and mention ThursdAI, and he'll put you through, so you'll be able to get an OpenAI compatible API key and be able to test this insane speed.
P.S - we also did an independent verification of these speeds, using Weave, and found Cerebras to be quite incredible for agentic purposes, you can read our report here and the weave dashboard here
Anthropic - unlocking just-in-time applications with artifacts for all
Well, if you aren't paying claude, maybe this will convince you. This week, anthropic announced that artifacts are available to all users, not only their paid customers.
Artifacts are a feature in Claude that is basically a side pane (and from this week, a drawer in their mobile apps) that allows you to see what Claude is building, by rendering the web application almost on the fly. They have also trained Claude in working with that interface, so it knows about the different files etc
Effectively, this turns Claude into a web developer that will build mini web applications (without backend) for you, on the fly, for any task you can think of.
Drop a design, and it'll build a mock of it, drop some data in a CSV and it'll build an interactive onetime dashboard visualizing that data, or just ask it to build an app helping you split the bill between friends by uploading a picture of a bill.
Artifacts are share-able and remixable, so you can build something and share with friends, so here you go, an artifact I made, by dropping my notes into claude, and asking for a magic 8 Ball, that will spit out a random fact from today's editing of ThursdAI. I also provided Claude with an 8Ball image, but it didn't work due to restrictions, so instead I just uploaded that image to claude and asked it to recreate it with SVG! And viola, a completely un-nessesary app that works!
Google’s Gemini Keeps Climbing the Charts (But Will It Be Enough?)
Sensing a disturbance in the AI force (probably from that Cerebras bombshell), Google rolled out a series of Gemini updates, including a new experimental Gemini 1.5 Pro (0827) with sharper coding skills and logical reasoning. According to LMSys, it’s already nipping at the heels of ChatGPT 4o and is number 2!
Their Gemini 1.5 Flash model got a serious upgrade, vaulting to the #6 position on the arena. And to add to the model madness, they even released an Gemini Flash 8B parameter version for folks who need that sweet spot between speed and size.
Oh, and those long-awaited Gems are finally starting to roll out. But get ready to open your wallet – this feature (preloading Gemini with custom context and capabilities) is a paid-tier exclusive. But hey, at least Imagen-3 is cautiously returning to the image generation game!
AI Art & Diffusion
Doom Meets Stable Diffusion: AI Dreams in 20FPS Glory (GameNGen)
The future of video games is, uh, definitely going to be interesting. Just as everyone thought AI would be conquering Go or Chess, it seems we've stumbled into a different battlefield: first-person shooters. 🤯
This week, researchers in DeepMind blew everyone's minds with their GameNgen research. What did they do? They trained Stable Diffusion 1.4 on Doom, and I'm not just talking about static images - I'm talking about generating actual Doom gameplay in near real time. Think 20FPS Doom running on nothing but the magic of AI.
The craziest part to me is this quote "Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation"
FAL Drops the LORA Training Time Bomb (and I Get a New Meme!)
As you see, I haven't yet relaxed from making custom AI generations with Flux and customizing them with training LORAs. Two weeks ago, this used to take 45 minutes, a week ago, 20 minutes, and now, the wizards at FAL, created a new trainer that shrinks the training times down to less than 5 minutes!
So given that the first upcoming SpaceX commercial spacewalk Polaris Dawn, I trained a SpaceX astronaut LORA and then combined my face with it, and viola, here I am, as a space X astronaut!
BTW because they are awesome, Jonathan and Simo (who is the magician behind this new trainer) came to the show, announced the new trainer, but also gave all listeners of ThursdAI a coupon to train a LORA effectively for free, just use this link and start training! (btw I get nothing out of this, just trying to look out for my listeners!)
That's it for this week, well almost that's it, magic.dev announced a new funding round of 320 million, and that they have a 100M context window capable models and coding product to go with it, but didn't yet release it, just as we were wrapping up. Sam Altman tweeted that OpenAI now has over 200 Million active users on ChatGPT and that OpenAI will collaborate with AI Safety institute.
Ok now officially that's it! See you next week, when it's going to be 🍁 already brrr
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe -
Hey there, Alex here with an end of summer edition of our show, which did not disappoint. Today is the official anniversary of stable diffusion 1.4 can you believe it?
It's the second week in the row that we have an exclusive LLM launch on the show (after Emozilla announced Hermes 3 on last week's show), and spoiler alert, we may have something cooking for next week as well!
This edition of ThursdAI is brought to you by W&B Weave, our LLM observability toolkit, letting you evaluate LLMs for your own use-case easily
Also this week, we've covered both ends of AI progress, doomerist CEO saying "Fck Gen AI" vs an 8yo coder and I continued to geek out on putting myself into memes (I promised I'll stop... at some point) so buckle up, let's take a look at another crazy week:
TL;DR
* Open Source LLMs
* AI21 releases Jamba1.5 Large / Mini hybrid Mamba MoE (X, Blog, HF)
* Microsoft Phi 3.5 - 3 new models including MoE (X, HF)
* BFCL 2 - Berkley Function Calling Leaderboard V2 (X, Blog, Leaderboard)
* NVIDIA - Mistral Nemo Minitron 8B - Distilled / Pruned from 12B (HF)
* Cohere paper proves - code improves intelligence (X, Paper)
* MOHAWK - transformer → Mamba distillation method (X, Paper, Blog)
* AI Art & Diffusion & 3D
* Ideogram launches v2 - new img diffusion king 👑 + API (X, Blog, Try it)
* Midjourney is now on web + free tier (try it finally)
* Flux keeps getting better, cheaper, faster + adoption from OSS (X, X, X)
* Procreate hates generative AI (X)
* Big CO LLMs + APIs
* Grok 2 full is finally available on X - performs well on real time queries (X)
* OpenAI adds GPT-4o Finetuning (blog)
* Google API updates - 1000 pages PDFs + LOTS of free tokens (X)
* This weeks Buzz
* Weights & Biases Judgement Day SF Hackathon in September 21-22 (Sign up to hack)
* Video
* Hotshot - new video model - trained by 4 guys (try it, technical deep dive)
* Luma Dream Machine 1.5 (X, Try it)
* Tools & Others
* LMStudio 0.0.3 update - local RAG, structured outputs with any model & more (X)
* Vercel - Vo now has chat (X)
* Ark - a completely offline device - offline LLM + worlds maps (X)
* Ricky's Daughter coding with cursor video is a must watch (video)
The Best of the Best: Open Source Wins with Jamba, Phi 3.5, and Surprise Function Calling Heroes
We kick things off this week by focusing on what we love the most on ThursdAI, open-source models! We had a ton of incredible releases this week, starting off with something we were super lucky to have live, the official announcement of AI21's latest LLM: Jamba.
AI21 Officially Announces Jamba 1.5 Large/Mini – The Powerhouse Architecture Combines Transformer and Mamba
While we've covered Jamba release on the show back in April, Jamba 1.5 is an updated powerhouse. It's 2 models, Large and Mini, both MoE and both are still hybrid architecture of Transformers + Mamba that try to get both worlds.
Itay Dalmedigos, technical lead at AI21, joined us on the ThursdAI stage for an exclusive first look, giving us the full rundown on this developer-ready model with an awesome 256K context window, but it's not just the size – it’s about using that size effectively.
AI21 measured the effective context use of their model on the new RULER benchmark released by NVIDIA, an iteration of the needle in the haystack and showed that their models have full utilization of context, as opposed to many other models.
“As you mentioned, we’re able to pack many, many tokens on a single GPU. Uh, this is mostly due to the fact that we are able to quantize most of our parameters", Itay explained, diving into their secret sauce, ExpertsInt8, a novel quantization technique specifically designed for MoE models.
Oh, and did we mention Jamba is multilingual (eight languages and counting), natively supports structured JSON, function calling, document digestion… basically everything developers dream of. They even chucked in citation generation, as it's long context can contain full documents, your RAG app may not even need to chunk anything, and the citation can cite full documents!
Berkeley Function Calling Leaderboard V2: Updated + Live (link)
Ever wondered how to measure the real-world magic of those models boasting "I can call functions! I can do tool use! Look how cool I am!" 😎? Enter the Berkeley Function Calling Leaderboard (BFCL) 2, a battleground where models clash to prove their function calling prowess.
Version 2 just dropped, and this ain't your average benchmark, folks. It's armed with a "Live Dataset" - a dynamic, user-contributed treasure trove of real-world queries, rare function documentations, and specialized use-cases spanning multiple languages. Translation: NO more biased, contaminated datasets. BFCL 2 is as close to the real world as it gets.
So, who’s sitting on the Function Calling throne this week? Our old friend Claude 3.5 Sonnet, with an impressive score of 73.61. But breathing down its neck is GPT 4-0613 (the OG Function Calling master) with 73.5. That's right, the one released a year ago, the first one with function calling, in fact the first LLM with function calling as a concept IIRC!
Now, prepare for the REAL plot twist. The top-performing open-source model isn’t some big name, resource-heavy behemoth. It’s a tiny little underdog called Functionary Medium 3.1, a finetuned version of Llama 3.1 that blew everyone away. It even outscored both versions of Claude 3 Opus AND GPT 4 - leaving folks scrambling to figure out WHO created this masterpiece.
“I’ve never heard of this model. It's MIT licensed from an organization called MeetKai. Have you guys heard about Functionary Medium?” I asked, echoing the collective bafflement in the space. Yep, turns out there’s gold hidden in the vast landscape of open source models, just waiting to be unearthed ⛏️.
Microsoft updates Phi 3.5 - 3 new models including an MoE + MIT license
3 new Phi's dropped this week, including an MoE one, and a new revamped vision one. They look very decent on benchmark yet again, with the mini version (3.8B) seemingly beating LLama 3.1 8B on a few benchmarks.
However, as previously the excitement is met with caution because Phi models seem great on benchmarks but then actually talking with them, folks are not as impressed usually.
Terry from BigCodeBench also saw a significant decrease in coding ability for Phi 3.5 vs 3.1
Of course, we're not complaining, the models released with 128K context and MIT license.
The thing I'm most excited about is the vision model updates, it has been updated with "multi-frame image understanding and reasoning" which is a big deal! This means understanding videos more natively across scenes.
This weeks Buzz
Hey, if you're reading this, while sitting in the bay area, and you don't have plans for exactly a month from now, why don't you come and hack with me? (Register Free)
Announcing, the first W&B hackathon, Judgement Day that's going to be focused on LLM as a judge! Come hack on innovative LLM as a judge ideas, UIs, evals and more, meet other like minded hackers and AI engineers and win great prizes!
🎨 AI Art: Ideogram Crowns Itself King, Midjourney Joins the Internet & FLUX everywhere
While there was little news from big LLM labs this week, there is a LOT of AI art news, which is fitting to celebrate 2 year Stable Diffusion 1.4 anniversary!
👑 Ideogram v2: Text Wizardry and API Access (But No Loras… Yet?)
With significantly improved realism, and likely the best text generation across all models out there, Ideogram v2 just took over the AI image generation game! Just look at that text sharpness!
They now offer a selection of styles (Realistic, Design, 3D, Anime) and any aspect ratios you'd like and also, brands can now provide color palettes to control the outputs!
Adding to this is a new API offering (.8c per image for the main model, .5c for the new turbo model of v2!) and a new IOS app, they also added the option (for premium users only) to search through a billion generations and their prompts, which is a great offering as well, as sometimes you don't even know what to prompt.
They claim a significant improvement over Flux[pro] and Dalle-3 in text, alignment and overall, interesting that MJ was not compared!
Meanwhile, Midjourney finally launched a website and a free tier, so no longer do you have to learn to use Discord to even try Midjourney.
Meanwhile Flux enjoys the fruits of Open Source
While the Ideogram and MJ fight it out for the closed source, Black Forest Labs enjoys the fruits of released their weights in the open.
Fal just released an update that LORAs run 2.5x faster and 2.5x cheaper, CivitAI has LORAs for pretty much every character and celebrity ported to FLUX already, different techniques like ControlNets Unions, IPAdapters and more are being trained as we speak and tutorials upon tutorials are released of how to customize these models, for free (shoutout to my friend Matt Wolfe for this one)
you can now train your own face on fal.ai , replicate.com and astria.ai , and thanks to astria, I was able to find some old generations of my LORAs from the 1.5 days (not quite 1.4, but still, enough to show the difference between then and now) and whoa.
🤔 Is This AI Tool Necessary, Bro?
Let’s end with a topic that stirred up a hornets nest of opinions this week: Procreate, a beloved iPad design app, publicly declared their "fing hate” for Generative AI.
Yeah, you read that right. Hate. The CEO, in a public statement went FULL scorched earth - proclaiming that AI-powered features would never sully the pristine code of their precious app.
“Instead of trying to bridge the gap, he’s creating more walls", Wolfram commented, echoing the general “dude… what?” vibe in the space. “It feels marketeerial”, I added, pointing out the obvious PR play (while simultaneously acknowledging the very REAL, very LOUD segment of the Procreate community that cheered this decision).
Here’s the thing: you can hate the tech. You can lament the potential demise of the human creative spark. You can rail against the looming AI overlords. But one thing’s undeniable: this tech isn't going anywhere.
Meanwhile, 8yo coders lean in fully into AI
As a contrast to this doomerism take, just watch this video of Ricky Robinette's eight-year-old daughter building a Harry Potter website in 45 minutes, using nothing but a chat interface in Cursor. No coding knowledge. No prior experience. Just prompts and the power of AI ✨.
THAT’s where we’re headed, folks. It might be terrifying. It might be inspiring. But it’s DEFINITELY happening. Better to understand it, engage with it, and maybe try to nudge it in a positive direction, than burying your head in the sand and muttering “I bleeping hate this progress” like a cranky, Luddite hermit. Just sayin' 🤷♀️.
AI Device to reboot civilization (if needed)
I was scrolling through my feed (as I do VERY often, to bring you this every week) and I saw this and super quickly decided to invite the author to the show to talk about it.
Adam Cohen Hillel has prototyped an AI hardware device, but this one isn't trying to record you or be your friend, no, this one comes with offline LLMs finetuned with health and bio information, survival tactics, and all of the worlds maps and works completely offline!
This to me was a very exciting use for an LLM, a distilled version of all human knowledge, buried in a faraday cage, with replaceable batteries that runs on solar and can help you survive in the case of something bad happening, like really bad happening (think a solar flare that takes out the electrical grid or an EMP device). While improbable, I thought this was a great idea and had a nice chat with the creator, you should definitely give this one a listen, and if you want to buy one, he is going to sell them soon here
This is it for this week, there have been a few updates from the big labs, OpenAI has opened Finetuneing for GPT-4o, and you can use your WandB API key in there to track those, which is cool, Gemini API now accepts incredibly large PDF files (up to 1000 pages) and Grok 2 is finally on X (not mini from last week)
See you next week (we will have another deep dive!)
This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe -
Look these crazy weeks don't seem to stop, and though this week started out a bit slower (while folks were waiting to see how the speculation about certain red berry flavored conspiracies are shaking out) the big labs are shipping!
We've got space uncle Elon dropping an "almost-gpt4" level Grok-2, that's uncensored, has access to real time data on X and can draw all kinds of images with Flux, OpenAI announced a new ChatGPT 4o version (not the one from last week that supported structured outputs, a different one!) and Anthropic dropping something that makes AI Engineers salivate!
Oh, and for the second week in a row, ThursdAI live spaces were listened to by over 4K people, which is very humbling, and awesome because for example today, Nous Research announced Hermes 3 live on ThursdAI before the public heard about it (and I had a long chat w/ Emozilla about it, very well worth listening to)
TL;DR of all topics covered:
* Big CO LLMs + APIs
* Xai releases GROK-2 - frontier level Grok, uncensored + image gen with Flux (𝕏, Blog, Try It)
* OpenAI releases another ChatGPT-4o (and tops LMsys again) (X, Blog)
* Google showcases Gemini Live, Pixel Bugs w/ Gemini, Google Assistant upgrades ( Blog)
* Anthropic adds Prompt Caching in Beta - cutting costs by u to 90% (X, Blog)
* AI Art & Diffusion & 3D
* Flux now has support for LORAs, ControlNet, img2img (Fal, Replicate)
* Google Imagen-3 is out of secret preview and it looks very good (𝕏, Paper, Try It)
* This weeks Buzz
* Using Weights & Biases Weave to evaluate Claude Prompt Caching (X, Github, Weave Dash)
* Open Source LLMs
* NousResearch drops Hermes 3 - 405B, 70B, 8B LLama 3.1 finetunes (X, Blog, Paper)
* NVIDIA Llama-3.1-Minitron 4B (Blog, HF)
* AnswerAI - colbert-small-v1 (Blog, HF)
* Vision & Video
* Runway Gen-3 Turbo is now available (Try It)
Big Companies & LLM APIs
Grok 2: Real Time Information, Uncensored as Hell, and… Flux?!
The team at xAI definitely knows how to make a statement, dropping a knowledge bomb on us with the release of Grok 2. This isn't your uncle's dad joke model anymore - Grok 2 is a legitimate frontier model, folks.
As Matt Shumer excitedly put it
“If this model is this good with less than a year of work, the trajectory they’re on, it seems like they will be far above this...very very soon” 🚀
Not only does Grok 2 have impressive scores on MMLU (beating the previous GPT-4o on their benchmarks… from MAY 2024), it even outperforms Llama 3 405B, proving that xAI isn't messing around.
But here's where things get really interesting. Not only does this model access real time data through Twitter, which is a MOAT so wide you could probably park a rocket in it, it's also VERY uncensored. Think generating political content that'd make your grandma clutch her pearls or imagining Disney characters breaking bad in a way that’s both hilarious and kinda disturbing all thanks to Grok 2’s integration with Black Forest Labs Flux image generation model.
With an affordable price point ($8/month for x Premium including access to Grok 2 and their killer MidJourney competitor?!), it’ll be interesting to see how Grok’s "truth seeking" (as xAI calls it) model plays out. Buckle up, folks, this is going to be wild, especially since all the normies now have the power to create political memes, that look VERY realistic, within seconds.
Oh yeah… and there’s the upcoming Enterprise API as well… and Grok 2’s made its debut in the wild on the LMSys Arena, lurking incognito as "sus-column-r" and is now placed on TOP of Sonnet 3.5 and comes in as number 5 overall!
OpenAI last ChatGPT is back at #1, but it's all very confusing 😵💫
As the news about Grok-2 was settling in, OpenAI decided to, well… drop yet another GPT-4.o update on us. While Google was hosting their event no less. Seriously OpenAI? I guess they like to one-up Google's new releases (they also kicked Gemini from the #1 position after only 1 week there)
So what was anonymous-chatbot in Lmsys for the past week, was also released in ChatGPT interface, is now the best LLM in the world according to LMSYS and other folks, it's #1 at Math, #1 at complex prompts, coding and #1 overall.
It is also available for us developers via API, but... they don't recommend using it? 🤔
The most interesting thing about this release is, they don't really know to tell us why it's better, they just know that it is, qualitatively and that it's not a new frontier-class model (ie, not 🍓 or GPT5)
Their release notes on this are something else 👇
Meanwhile it's been 3 months, and the promised Advanced Voice Mode is only in the hands of a few lucky testers so far.
Anthropic Releases Prompt Caching to Slash API Prices By up to 90%
Anthropic joined DeepSeek's game of "Let's Give Devs Affordable Intelligence," this week rolling out prompt caching with up to 90% cost reduction on cached tokens (yes NINETY…🤯 ) for those of you new to all this technical sorcery
Prompt Caching allows the inference provider to save users money by reusing repeated chunks of a long prompt form cache, reducing pricing and increasing time to first token, and is especially beneficial for longer contexts (>100K) use-cases like conversations with books, agents with a lot of memory, 1000 examples in prompt etc'
We covered caching before with Gemini (in Google IO) and last week with DeepSeek, but IMO this is a better implementation from a frontier lab that's easy to get started, manages the timeout for you (unlike Google) and is a no brainer implementation.
And, you'll definitely want to see the code to implement it all yourself, (plus Weave is free!🤩):
"In this week's buzz category… I used Weave, our LLM observability tooling to super quickly evaluate how much cheaper Cloud Caching from Anthropic really is, I did a video of it and I posted the code … If you're into this and want to see how to actually do this … how to evaluate, the code is there for you" - Alex
With the ridiculous 90% price drop for those cached calls (Haiku basically becomes FREE and cached Claude is costs like Haiku, .30 cents per 1Mtok). For context, I took 5 transcripts of 2 hour podcast conversations, and it amounted to ~110,000 tokens overall, and was able to ask questions across all this text, and it cost me less than $1 (see in the above video)
Code Here + Weave evaluation Dashboard here
AI Art, Diffusion, and Personalized AI On the Fly
Speaking of mind blowing, Flux took over this week, thanks in no small part to Elon strategically leveraging their tech in Grok (and everyone reminding everyone else, that it's not Grok creating images, it's Flux!)
Now, remember, the REAL magic happens when code meets open source, “Flux now has support for LORAs, ControlNet, img2img…" meaning developers have turned those foundational tools into artistic wizardry. With as little as $5 bucks and a few pictures, “You can train the best image model on your own face. ”🤯 (Seriously folks, head up to Fal.ai, give it a whirl… it’s awesome)
Now if you combine the LORA tech with ControlNet tech, you can get VERY creepy very fast (I'm using my own face here but you get the idea), here's "me" as the distracted boyfriend meme, and the girlfriend, and the distraction 😂 (I'm sorry you had to see this, AI has gone too far! Shut it all down!)
If seeing those creepy faces on screen isn't for you (I totally get that) there’s also Google IMAGEN 3, freshly escaped from secret preview and just waiting for you to unleash those artistic prompts on it! Google, despite being… Google, somehow figured out that a little competition does a lab good and rolled out a model that’s seriously impressive.
Runway Video Gets a "Turbocharged" Upgrade🚀🚀🚀
Ever tried those jaw-dropping text-to-video generators but groaned as you watched those seconds of video render painfully slowly?😭 Well Runway, creators of Gen 3, answered our prayers with the distilled turbocharged version that churns out those visuals in a blink 🤯🤯🤯 .
What's truly cool is they unlocked it for FREE tier users (sign up and unleash those cinematic prompts right now!), letting everyday folks dip their toes in those previously-unfathomable waters. Even the skeptics at OpenBMB (Junyang knows what I'm talking about…) had to acknowledge that their efforts with MiniCPM V are impressive, especially the smooth way it captures video sequences better than models even twice its size 🤯.
Open Source: Hermes 3 and The Next Generation of Open AI 🚀
NousResearch Dropped Hermes 3: Your New Favorite AI (Yes Really)
In the ultimate “We Dropped This On ThursdAI Before Even HuggingFace”, the legendary team at NousResearch dropped the hottest news since Qwen decided to play math God: Hermes 3 is officially here! 🤯
“You’re about to get to use the FIRST big Finetune of LLama 3.1 405B… We don’t think there have been finetunes,” announced Emozilla who’s both co founder and resident master wizard of all things neural net, “And it's available to try for free thanks to Lambda, you can try it out right here ” (you’re all racing to their site as I type this, I KNOW it!).
Not ONLY does this beauty run ridiculously smooth on Lambda, but here’s the real TL;DR:
* Hermes 3 isn’t just 405B; there are 70B and 8B versions dropping simultaneously on Hugging Face, ready to crush benchmarks and melt your VRAM (in a GOOD way… okay maybe not so great for your power bill 😅).
* On Benchmark, they beat LLama 3.1 instruct on a few evals and lose on some, which is quite decent, given that Meta team did an amazing job with their instruct finetuning (and probably spent millions of $ on it too)
* Hermes 3 is all about user alignment, which our open source champion Wolfram Ravenwolf summarized beautifully: “When you have a model, and you run it on your system, IT MUST BE LOYAL TO YOU.” 😈
Hermes 3 does just that with incredibly precise control via its godlike system prompt: “In Hermes 3 the system prompt is KING,” confirmed Emoz. It’s so powerful that the 405B version was practically suffering existential angst in their first conversation… I read that part outloud during the space, but here you go, this is their first conversation, and he goes into why this they thing this happened, in our chat that's very worth listening to
This model was trained on a bunch of datasources that they will release in the future, and includes tool use, and a slew of tokens that you can add in the system prompt, that will trigger abilities in this model to do chain of thought, to do scratchpad (think, and then rethink), to cite from sources for RAG purposes and a BUNCH more.
The technical report is HERE and is worth diving into as is our full conversation with Emozilla on the pod.
Wrapping Things Up… But We’re Just Getting Started! 😈
I know, I KNOW, your brain is already overflowing but we barely SCRATCHED the surface…
We also dove into NVIDIA's research into new pruning and distilling techniques, TII Falcon’s attempt at making those State Space models finally challenge the seemingly almighty Transformer architecture (it's getting closer... but has a way to go!), plus AnswerAI's deceptively tiny Colbert-Small-V1, achieving remarkable search accuracy despite its featherweight size and a bunch more...
See you all next week for what’s bound to be yet another wild AI news bonanza… Get those download speeds prepped, we’re in for a wild ride. 🔥
This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe - Mostrar mais