📆 ThursdAI - Dec 12 - unprecedented AI week - SORA, Gemini 2.0 Flash, Apple Intelligence, LLama 3.3, NeurIPS Drama & more AI news

13 dez 2024 · ThursdAI - The top AI news from the past week

Hey folks, Alex here, writing this from the beautiful Vancouver BC, Canada. I'm here for NeurIPS 2024, the biggest ML conferences of the year, and let me tell you, this was one hell of a week to not be glued to the screen.

After last week banger week, with OpenAI kicking off their 12 days of releases, with releasing o1 full and pro mode during ThursdAI, things went parabolic. It seems that all the AI labs decided to just dump EVERYTHING they have before the holidays? 🎅

A day after our show, on Friday, Google announced a new Gemini 1206 that became the #1 leading model on LMarena and Meta released LLama 3.3, then on Saturday Xai releases their new image model code named Aurora.

On a regular week, the above Fri-Sun news would be enough for a full 2 hour ThursdAI show on it's own, but not this week, this week this was barely a 15 minute segment 😅 because so MUCH happened starting Monday, we were barely able to catch our breath, so lets dive into it!

As always, the TL;DR and full show notes at the end 👇 and this newsletter is sponsored by W&B Weave, if you're building with LLMs in production, and want to switch to the new Gemini 2.0 today, how will you know if your app is not going to degrade? Weave is the best way! Give it a try for free.

Gemini 2.0 Flash - a new gold standard of fast multimodal LLMs

Google has absolutely taken the crown away from OpenAI with Gemini 2.0 believe it or not this week with this incredible release. All of us on the show were in agreement that this is a phenomenal release from Google for the 1 year anniversary of Gemini.

Gemini 2.0 Flash is beating Pro 002 and Flash 002 on all benchmarks, while being 2x faster than Pro, having 1M context window, and being fully multimodal!

Multimodality on input and output

This model was announced to be fully multimodal on inputs AND outputs, which means in can natively understand text, images, audio, video, documents and output text, text + images and audio (so it can speak!). Some of these capabilities are restricted for beta users for now, but we know they exists. If you remember project Astra, this is what powers that project. In fact, we had Matt Wolfe join the show, and he demoed had early access to Project Astra and demoed it live on the show (see above) which is powered by Gemini 2.0 Flash.

The most amazing thing is, this functionality, that was just 8 months ago, presented to us in Google IO, in a premium Booth experience, is now available to all, in Google AI studio, for free!

Really, you can try out right now, yourself at https://aistudio.google.com/live but here's a demo of it, helping me proof read this exact paragraph by watching the screen and talking me through it.

Performance out of the box

This model beating Sonnet 3.5 on Swe-bench Verified completely blew away the narrative on my timeline, nobody was ready for that. This is a flash model, that's outperforming o1 on code!?

So having a Flash MMIO model with 1M context that is accessible via with real time streaming option available via APIs from the release time is honestly quite amazing to begin with, not to mention that during the preview phase, this is currently free, but if we consider the previous prices of Flash, this model is going to considerably undercut the market on price/performance/speed matrix.

You can see why this release is taking the crown this week. 👏

Agentic is coming with Project Mariner

An additional thing that was announced by Google is an Agentic approach of theirs is project Mariner, which is an agent in the form of a Chrome extension completing webtasks, breaking SOTA on the WebVoyager with 83.5% score with a single agent setup.

We've seen agents attempts from Adept to Claude Computer User to Runner H, but this breaking SOTA from Google seems very promising. Can't wait to give this a try.

OpenAI gives us SORA, Vision and other stuff from the bag of goodies

Ok so now let's talk about the second winner of this week, OpenAI amazing stream of innovations, which would have taken the crown, if not for, well... ☝️

SORA is finally here (for those who got in)

Open AI has FINALLY released SORA, their long promised text to video and image to video (and video to video) model (nee, world simulator) to general availability, including a new website - sora.com and a completely amazing UI to come with it.

SORA can generate images of various quality from 480p up to 1080p and up to 20 seconds long, and they promised that those will be generating fast, as what they released is actually SORA turbo! (apparently SORA 2 is already in the works and will be even more amazing, more on this later)

New accounts paused for now

OpenAI seemed to have severely underestimated how many people would like to generate the 50 images per month allowed on the plus account (pro account gets you 10x more for $200 + longer durations whatever that means), and since the time of writing these words on ThursdAI afternoon, I still am not able to create a sora.com account and try out SORA myself (as I was boarding a plane when they launched it)

SORA magical UI

I've invited one of my favorite video creators, Blaine Brown to the show, who does incredible video experiments, that always go viral, and had time to play with SORA to tell us what he thinks both from a video perspective and from a interface perspective.

Blaine had a great take that we all collectively got so much HYPE over the past 8 months of getting teased, that many folks expected SORA to just be an incredible text to video 1 prompt to video generator and it's not that really, in fact, if you just send prompts, it's more like a slot machine (which is also confirmed by another friend of the pod Bilawal)

But the magic starts to come when the additional tools like blend are taken into play. One example that Blaine talked about is the Remix feature, where you can Remix videos and adjust the remix strength (Strong, Mild)

Another amazing insight Blaine shared is a that SORA can be used by fusing two videos that were not even generated with SORA, but SORA is being used as a creative tool to combine them into one.

And lastly, just like Midjourney (and StableDiffusion before that), SORA has a featured and a recent wall of video generations, that show you videos and prompts that others used to create those videos with, for inspiration and learning, so you can remix those videos and learn to prompt better + there are prompting extension tools that OpenAI has built in.

One more thing.. this model thinks

I love this discovery and wanted to share this with you, the prompt is "A man smiles to the camera, then holds up a sign. On the sign, there is only a single digit number (the number of 'r's in 'strawberry')"

Advanced Voice mode now with Video!

I personally have been waiting for Voice mode with Video for such a long time, since the that day in the spring, where the first demo of advanced voice mode talked to an OpenAI employee called Rocky, in a very flirty voice, that in no way resembled Scarlet Johannson, and told him to run a comb through his hair.

Well today OpenAI have finally announced that they are rolling out this option soon to everyone, and in chatGPT, we'll all going to have the camera button, and be able to show chatGPT what we're seeing via camera or the screen of our phone and have it have the context.

If you're feeling a bit of a deja-vu, yes, this is very similar to what Google just launched (for free mind you) with Gemini 2.0 just yesterday in AI studio, and via APIs as well.

This is an incredible feature, it will not only see your webcam, it will also see your IOS screen, so you’d be able to reason about an email with it, or other things, I honestly can’t wait to have it already!

They also announced Santa mode, which is also super cool, tho I don’t quite know how to .. tell my kids about it? Do I… tell them this IS Santa? Do I tell them this is an AI pretending to be Santa? Where is the lie end exactly?

And in one of his funniest jailbreaks (and maybe one of the toughest ones) Pliny the liberator just posted a Santa jailbreak that will definitely make you giggle (and him get Coal this X-mas)

The other stuff (with 6 days to go)

OpenAI has 12 days of releases, and the other amazing things we got obviously got overshadowed but they are still cool, Canvas can now run code and have custom GPTs, GPT in Apple Intelligence is now widely supported with the public release of iOS 18.2 and they have announced fine tuning with reinforcement learning, allowing to funetune o1-mini to outperform o1 on specific tasks with a few examples.

There's 6 more work days to go, and they promised to "end with a bang" so... we'll keep you updated!

This weeks Buzz - Guard Rail Genie

Alright, it's time for "This Week's Buzz," our weekly segment brought to you by Weights & Biases! This week I hosted Soumik Rakshit from the Weights and Biases AI Team (The team I'm also on btw!).

Soumik gave us a deep dive into Guardrails, our new set of features in Weave for ensuring reliability in GenAI production! Guardrails serve as a "safety net" for your LLM powered applications, filtering out inputs or llm responses that trigger a certain criteria or boundary.

Types of guardrails include prompt injection attacks, PII leakage, jailbreaking attempts and toxic language as well, but can also include a competitor mention, or selling a product at $0 or a policy your company doesn't have.

As part of developing the guardrails Soumik also developed and open sourced an app to test prompts against those guardrails "Guardrails Genie" and we're going to host it to allow folks to test their prompts against our guardrails, and also are developing it and the guardrails in the open so please check out our Github

Apple iOS 18.2 Apple Intelligence + ChatGPT integration

Apple Intelligence is finally here, you can download it if you have iPhone 15 pro and pro Max and iPhone 16 all series.

If you have one of those phones, you will get the following new additional features that have been in Beta for a while, features like Image Playground with the ability to create images based on your face or faces that you have stored in your photo library.

You can also create GenMoji and those are actually pretty cool!

The highlight and the connection with OpenAI's release is of course the ChatGPT integration, where in if Siri is too dumdum to answer any real AI questions, and let's face it, it's most of the time, a user will get a button and chatGPT will take over upon user approval. This will not require an account!

Grok New Image Generation Codename "Aurora"

Oh, Space Uncle is back at it again! The team at XAI launched its image generation model with the codename "Aurora" and briefly made it public only to pull it and launch it again (this time, the model is simply "Grok"). Apparently, they've trained their own image model from scratch in like three months but they pulled it back a day after, I think because they forgot to add watermarks 😅 but it's still unconfirmed why the removal occurred in the first place, Regardless of the reason, many folks, such as Wolfram, found it was not on the same level as their Flux integration.

It is really good at realism and faces, and is really unrestricted in terms of generating celebrities or TV shows form the 90's or cartoons. They really don't care about copyright.

The model however does appear to generate fairly realistic images with its autoregressive model approach where generation occurs pixel-by-pixel instead of diffusion. But as I said on the show "It's really hard to get a good sense for the community vibe about anything that Elon Musk does because there's so much d**k riding on X for Elon Musk..." Many folks post only positive things on anything X or Xai does in the hopes that space uncle will notice them or reposts them, it's really hard to get an honest "vibes check" on Xai stuff.

All jokes aside we'll hopefully have some better comparisons on sites such as image LmArena who just today launched ImgArena but until that day comes we'll just have to wait and see what other new iterations and announcements follow!

NeurIPS Drama: Best Paper Controversy!

Now, no week in AI would be complete without a little drama. This time around it’s with the biggest machine learning engineering conference of the year, NeurIPS. This year's "Best Paper" award went to a work entitled Visual Auto Aggressive Modeling (VAR). This paper apparently introduced an innovative way to outperform traditional diffusion models when it comes to image generation! Great right? well not so fast because here’s where things get spicy. This is where Keyu Tian comes in, the main author of this work and a former intern of ByteDance who are getting their fair share of the benefits with their co-signing on the paper but their lawsuit may derail its future. ByteDance is currently suing Keyu Tian for a whopping one million dollars citing alleged sabotage on the work in a coordinated series of events that compromised other colleagues work.

Specifically, according to some reports "He modified source code to changes random seeds and optimizes which, uh, lead to disrupting training processes...Security attacks. He gained unauthorized access to the system. Login backdoors to checkpoints allowing him to launch automated attacks that interrupted processes to colleagues training jobs." Basically, they believe that he "gained unauthorized access to the system" and hacked other systems. Now the paper is legit and it introduces potentially very innovative solutions but we have an ongoing legal situation. Also to note is despite firing him they did not withdraw the paper which could speak volumes to its future! As always, if it bleeds, it leads and drama is usually at the top of the trends, so definitely a story that will stay in everyone's mind when they look back at NeurIPS this year.

Phew.. what a week folks, what a week!

I think with 6 more days of OpenAI gifts, there's going to be plenty more to come next week, so share this newsletter with a friend or two, and if you found this useful, consider subscribing to our other channels as well and checkout Weave if you've building with GenAI, it's really helpful!

TL;DR and show notes

* Meta llama 3.3 (X, Model Card)

* OpenAI 12 days of Gifts (Blog)

* Apple ios 18.2 - Image Playground, GenMoji, ChatGPT integration (X)

* 🔥 Google Gemini 2.0 Flash - the new gold standard of LLMs (X, AI Studio)

* Google Project Mariner - Agent that browsers for you (X)

* This weeks Buzz - chat with Soumik Rakshit from AI Team at W&B (Github)

* NeurIPS Drama - Best Paper Controversy - VAR author is sued by ByteDance (X, Blog)

* Xai new image generation codename Aurora (Blog)

* Cognition launched Devin AI developer assistant - $500/mo

* LMArena launches txt2img Arena for Diffusion models (X)

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

Ouvir Ouvir novamente Continuar A reproduzir…
Subscrever Cancelar subscrição
Partilhar

Episódios

ThursdAI - Mar 20 - OpenAIs new voices, Mistral Small, NVIDIA GTC recap & Nemotron, new SOTA vision from Roboflow & more AI news
20 mar· ThursdAI - The top AI news from the past week
Hey, it's Alex, coming to you fresh off another live recording of ThursdAI, and what an incredible one it's been!
I was hoping that this week will be chill with the releases, because of NVIDIA's GTC conference, but no, the AI world doesn't stop, and if you blinked this week, you may have missed 2 or 10 major things that happened.
From Mistral coming back to OSS with the amazing Mistral Small 3.1 (beating Gemma from last week!) to OpenAI dropping a new voice generation model, and 2! new whisper killer ASR models with a Breaking News during our live show (there's a reason we're called ThursdAI) which we watched together and then dissected with Kwindla, our amazing AI VOICE and real time expert.
Not to mention that we also had dedicated breaking news from friend of the pod Joseph Nelson, that came on the show to announce a SOTA vision model from Roboflow + a new benchmark on which even the top VL models get around 6%! There's also a bunch of other OSS, a SOTA 3d model from Tencent and more!
And last but not least, Yam is back 🎉 So... buckle up and let's dive in. As always, TL;DR and show notes at the end, and here's the YT live version. (While you're there, please hit subscribe and help me hit that 1K subs on YT 🙏 )
Voice & Audio: OpenAI's Voice Revolution and the Open Source Echo
Hold the phone, everyone, because this week belonged to Voice & Audio! Seriously, if you weren't paying attention to the voice space, you missed a seismic shift, courtesy of OpenAI and some serious open-source contenders.
OpenAI's New Voice Models - Whisper Gets an Upgrade, TTS Gets Emotional!
OpenAI dropped a suite of next-gen audio models: gpt-4o-mini-tts-latest (text-to-speech) and GPT 4.0 Transcribe and GPT 4.0 Mini Transcribe (speech-to-text), all built upon their powerful transformer architecture.
To unpack this voice revolution, we welcomed back Kwindla Cramer from Daily, the voice AI whisperer himself. The headline news? The new speech-to-text models are not just incremental improvements; they’re a whole new ballgame. As OpenAI’s Shenyi explained, "Our new generation model is based on our large speech model. This means this new model has been trained on trillions of audio tokens." They're faster, cheaper (Mini Transcribe is half price of Whisper!), and boast state-of-the-art accuracy across multiple languages. But the real kicker? They're promptable!
"This basically opens up a whole field of prompt engineering for these models, which is crazy," I exclaimed, my mind officially blown. Imagine prompting your transcription model with context – telling it you're discussing dog breeds, and suddenly, its accuracy for breed names skyrockets. That's the power of promptable ASR! I recorded a live reaction aftder dropping of stream, and I was really impressed with how I can get the models to pronounce ThursdAI by just... asking!
But the voice magic doesn't stop there. GPT 4.0 Mini TTS, the new text-to-speech model, can now be prompted for… emotions! "You can prompt to be emotional. You can ask it to do some stuff. You can prompt the character a voice," OpenAI even demoed a "Mad Scientist" voice! Captain Ryland voice, anyone? This is a huge leap forward in TTS, making AI voices sound… well, more human.
But wait, there’s more! Semantic VAD! Semantic Voice Activity Detection, as OpenAI explained, "chunks the audio up based on when the model thinks The user's actually finished speaking." It’s about understanding the meaning of speech, not just detecting silence. Kwindla hailed it as "a big step forward," finally addressing the age-old problem of AI agents interrupting you mid-thought. No more robotic impatience!
OpenAI also threw in noise reduction and conversation item retrieval, making these new voice models production-ready powerhouses. This isn't just an update; it's a voice AI revolution, folks.
They also built a super nice website to test out the new models with openai.fm !
Canopy Labs' Orpheus 3B - Open Source Voice Steps Up
But hold on, the open-source voice community isn't about to be outshone! Canopy Labs dropped Orpheus 3B, a "natural sounding speech language model" with open-source spirit.
Orpheus, available in multiple sizes (3B, 1B, 500M, 150M), boasts zero-shot voice cloning and a glorious Apache 2 license. Wolfram noted its current lack of multilingual support, but remained enthusiastic, I played with them a bit and they do sound quite awesome, but I wasn't able to finetune them on my own voice due to "CUDA OUT OF MEMORY" alas
I did a live reaction recording for this model on X
NVIDIA Canary - Open Source Speech Recognition Enters the Race
Speaking of open source, NVIDIA surprised us with Canary, a speech recognition and translation model. "NVIDIA open sourced Canary, which is a 1 billion parameter and 180 million parameter speech recognition and translation, so basically like whisper competitor," I summarized. Canary is tiny, fast, and CC-BY licensed, allowing commercial use. It even snagged second place on the Hugging Face speech recognition leaderboard! Open source ASR just got a whole lot more interesting.
Of course, this won't get to the level of the new SOTA ASR OpenAI just dropped, but this can run locally and allows commercial use on edge devices!
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Vision & Video: Roboflow's Visionary Model and Video Generation Gets Moving
After the voice-apalooza, let's switch gears to the visual world, where Vision & Video delivered some knockout blows, spearheaded by Roboflow and StepFun.
Roboflow's RF-DETR and RF100-VL - A New Vision SOTA Emerges
Roboflow stole the vision spotlight this week with their RF-DETR model and the groundbreaking RF100-VL benchmark. We were lucky enough to have Joseph Nelson, Roboflow CEO, join the show again and give us the breaking news (they published the Github 11 minutes before he came on!)
RF-DETR is Roboflow's first in-house model, a real-time object detection transformer that's rewriting the rulebook. "We've actually never released a model that we've developed. And so this is the first time where we've taken a lot of those learnings and put that into a model," Joseph revealed.
And what a model it is! RF-DETR is not just fast; it's SOTA on real-world datasets and surpasses the 60 mAP barrier on COCO. But Joseph dropped a truth bomb: COCO is outdated. "The benchmark that everyone uses is, the COCO benchmark… hasn't been updated since 2017, but models have continued to get really, really, really good. And so they're saturated the COCO benchmark," he explained.
Enter RF100-VL, Roboflow's revolutionary new benchmark, designed to evaluate vision-language models on real-world data. "We, introduced a benchmark that we call RF 100 vision language," Joseph announced. The results? Shockingly low zero-shot performance on real-world vision tasks, highlighting a major gap in current models. Joseph's quiz question about QwenVL 2.5's zero-shot performance on RF100-VL revealed a dismal 5.8% accuracy. "So we as a field have a long, long way to go before we have zero shot performance on real world context," Joseph concluded. RF100-VL is the new frontier for vision, and RF-DETR is leading the charge! Plus, it runs on edge devices and is Apache 2 licensed! Roboflow, you legends! Check out the RF-DETR Blog Post, the RF-DETR Github, and the RF100-VL Benchmark for more details!
StepFun's Image-to-Video TI2V - Animating Images with Style
Stepping into the video arena, StepFun released their image2video model, TI2V. TI2V boasts impressive motion controls and generates high-quality videos from images and text prompts, especially excelling in anime-style video generation. Dive into the TI2V HuggingFace Space and TI2V Github to explore further.
Open Source LLMs: Mistral's Triumphant Return, LG's Fridge LLM, NVIDIA's Nemotron, and ByteDance's RL Boost
Let's circle back to our beloved Open Source LLMs, where this week was nothing short of a gold rush!
Mistral is BACK, Baby! - Mistral Small 3.1 24B (Again!)
Seriously, Mistral AI's return to open source with Mistral Small 3.1 deserves another shoutout! "Mistral is back with open source. Let's go!" I cheered, and I meant it. This multimodal, Apache 2 licensed model is a powerhouse, outperforming Gemma 3 and ready for action on a single GPU. Wolfram, ever the pragmatist, noted, "We are in right now, where a week later, you already have some new toys to play with." referring to Gemma 3 that we covered just last week!
Not only did we get a great new update from Mistral, they also cited our friends at Nous research and their Deep Hermes (released just last week!) for the reason to release the base models alongside finetuned models!
Mistral Small 3.1 is not just a model; it's a statement: open source is thriving, and Mistral is leading the charge! Check out their Blog Post, the HuggingFace page, and the Base Model on HF.
NVIDIA Nemotron - Distilling, Pruning, Making Llama's Better
NVIDIA finally dropped Llama Nemotron, and it was worth the wait!
Nemotron Nano (8B) and Super (49B) are here, with Ultra (253B) on the horizon. These models are distilled, pruned, and, crucially, designed for reasoning with a hybrid architecture allowing you to enable and disable reasoning via a simple on/off switch in the system prompt!
Beating other reasoners like QwQ on GPQA tasks, this distillined and pruned LLama based reasoner seems very powerful! Congrats to NVIDIA
Chris Alexius (a friend of the pod) who co-authored the announcement, told me that FP8 is expected and when that drops, this model will also fit on a single H100 GPU, making it really great for enterprises who host on their own hardware.
And yes, it’s ready for commercial use. NVIDIA, welcome to the open-source LLM party! Explore the Llama-Nemotron HuggingFace Collection and the Dataset.
LG Enters the LLM Fray with EXAONE Deep 32B - Fridge AI is Officially a Thing
LG, yes, that LG, surprised everyone by open-sourcing EXAONE Deep 32B, a "thinking model" from the fridge and TV giant. "LG open sources EXAONE and EXAONE Deep 32B thinking model," I announced, still slightly amused by the fridge-LLM concept. This 32B parameter model claims "superior capabilities" in reasoning, and while my live test in LM Studio went a bit haywire, quantization could be the culprit. It's non-commercial, but hey, fridge-powered AI is now officially a thing. Who saw that coming? Check out my Reaction Video, the LG Blog, and the HuggingFace page for more info.
ByteDance's DAPO - Reinforcement Learning Gets Efficient
From the creators of TikTok, ByteDance, comes DAPO, a new reinforcement learning method that's outperforming GRPO. DAPO promises 50% accuracy on AIME 2024 with 50% less training steps. Nisten, our RL expert, explained it's a refined GRPO, pushing the boundaries of RL efficiency. Open source RL is getting faster and better, thanks to ByteDance! Dive into the X thread, Github, and Paper for the technical details.
Big CO LLMs + APIs: Google's Generosity, OpenAI's Oligarch Pricing, and GTC Mania
Switching gears to the Big CO LLM arena, we saw Google making moves for the masses, OpenAI catering to the elite, and NVIDIA… well, being NVIDIA.
Google Makes DeepResearch Free and Adds Canvas
Google is opening up DeepResearch to everyone for FREE! DeepResearch, Gemini's advanced search mode, is now accessible without a Pro subscription. I really like it's revamped UI where you can see the thinking and the sources! I used it live on the show to find out what we talked about in the latest episode of ThursdAI, and it did a pretty good job!
Plus, Google unveiled Canvas, letting you "build apps within Gemini and actually see them." Google is making Gemini more accessible and more powerful, a win for everyone. Here's a Tetris game it built for me and here's a markdown enabled word counter I rebuild every week before I send ThursdAI (making sure I don't send you 10K words every week 😅)
OpenAI's O1 Pro API - Pricey Power for the Few
OpenAI, in contrast, released O1 Pro API, but with a price tag that's… astronomical. "OpenAI makes O1-pro API available to oligarchs ($600/1mtok output!)," I quipped, highlighting the exclusivity. $600 per million output tokens? "If you code with this, if you vibe code with this, you better already have VCs backing your startup," I warned. O1 Pro might be top-tier performance, but it's priced for the 0.1%.
NVIDIA GTC Recap - Jensen's Hardware Extravaganza
NVIDIA GTC was, as always, a hardware spectacle. New GPUs (Blackwell Ultra, Vera Rubin, Feynman!), the tiny DGX Spark supercomputer, the GR00T robot foundation model, and the Blue robot – NVIDIA is building the AI future, brick by silicon brick. Jensen is the AI world's rockstar, and GTC is his sold-out stadium show. Check out Rowan Cheung's GTC Recap on X for a quick overview.
Shoutout to our team at GTC and this amazingly timed logo shot I took from the live stream!
Antropic adds Web Search
We had a surprise at the end of the show, with Antropic releasing web search. It's a small thing, but for folks who use Cloud AI, it's very important.
You can now turn on web search directly on Claude which makes it... the last frontier lab to enable this feature 😂 Congrats!
AI Art & Diffusion & 3D: Tencent's 3D Revolution
Tencent Hunyuan 3D 2.0 MV and Turbo - 3D Generation Gets Real-Time
Tencent updated Hunyuan 3D to 2.0 MV (MultiView) and Turbo, pushing the boundaries of 3D generation. Hunyuan 3D 2.0 surpasses SOTA in geometry, texture, and alignment, and the Turbo version achieves near real-time 3D generation – under one second on an H100! Try out the Hunyuan3D-2mv HF Space to generate your own 3D masterpieces!
MultiView (MV) is another game-changer, allowing you to input 1-4 views for more accurate 3D models. "MV allows to generate 3d shapes from 1-4 views making the 3D shapes much higher quality" I explained. The demo of generating a 3D mouse from Gemini-generated images showcased the seamless pipeline from thought to 3D object. I literally just asked Gemini with native image generation to generate a character and then
Holodecks are getting closer, folks!
Closing Remarks and Thank You
And that's all she wrote, folks! Another week, another AI explosion. From voice to vision, open source to Big CO, this week was a whirlwind of innovation. Huge thanks again to our incredible guests, Joseph Nelson from Roboflow, Kwindla Cramer from Daily, and Lucas Atkins from ARCEE! And of course, massive shoutout to my co-hosts, Wolfram, Yam, and Nisten – you guys are the best!
And YOU, the ThursdAI community, are the reason we do this. Thank you for tuning in, for your support, and for being as hyped about AI as we are. Remember, ThursdAI is a labor of love, fueled by Weights & Biases and a whole lot of passion.
Missed anything? thursdai.news is your one-stop shop for the podcast, newsletter, and video replay. And seriously, subscribe to our YouTube channel! Let's get to 1000 subs!
Helpful? We’d love to see you here again!
TL;DR and Show Notes:
* Guests and Cohosts
* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
Co Hosts - @WolframRvnwlf @yampeleg @nisten
* Sponsor - Weights & Biases Weave (@weave_wb)
* Joseph Nelson - CEO Roboflow (@josephofiowa)
* Kindwla Kramer - CEO Daily (@kwindla)
* Lucas Atkins - Labs team at Arcee lead (@LukasAtkins7)
* Open Source LLMs
* Mistral Small 3.1 24B - Multimodal (Blog, HF, HF base)
* LG open sources EXAONE and EXAONE Deep 32B thinking model (Alex Reaction Video, LG BLOG, HF)
* ByteDance releases DAPO - better than GRPO RL Method (X, Github, Paper)
* NVIDIA drops LLama-Nemotron (Super 49B, Nano 8B) with reasoning and data (X, HF, Dataset)
* Big CO LLMs + APIs
* Google makes DeepResearch free, Canvas added, Live Previews (X)
* OpenAI makes O1-pro API available to oligarchs ($600/1mtok output!)
* NVIDIA GTC recap - (X)
* This weeks Buzz
* Come visit the Weights & Biases team at GTC today!
* Vision & Video
* Roboflow drops RF-DETR a SOTA vision model + new eval RF100-VL for VLMs (Blog, Github, Benchmark)
* StepFun dropped their image2video model TI2V (HF, Github)
* Voice & Audio
* OpenAI launches a new voice model and 2 new transcription models (Blog, Youtube)
* Canopy Labs drops Orpheus 3B (1B, 500B, 150M versions) - natural sounding speech language model (Blog, HF, Colab)
* NVIDIA Canary 1B/180M Flash - apache 2 speech recognition and translation LLama finetune (HF)
* AI Art & Diffusion & 3D
* Tencent updates Hunyuan 3D 2.0 MV (MultiView) and Turbo (HF)
* Tools
* ARCEE Conductor - model router (X)
* Cursor ships Claude 3.7 MAX (X)
* Notebook LM teases MindMaps (X)
* Gemini Co-Drawing - using Gemini native image output for helping drawing (HF)

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
📆 ThursdAI Turns Two! 🎉 Gemma 3, Gemini Native Image, new OpenAI tools, tons of open source & more AI news
13 mar· ThursdAI - The top AI news from the past week
LET'S GO!
Happy second birthday to ThursdAI, your favorite weekly AI news show! Can you believe it's been two whole years since we jumped into that random Twitter Space to rant about GPT-4? From humble beginnings as a late-night Twitter chat to a full-blown podcast, Newsletter and YouTube show with hundreds of thousands of downloads, it's been an absolutely wild ride!
That's right, two whole years of me, Alex Volkov, your friendly AI Evangelist, along with my amazing co-hosts, trying to keep you up-to-date on the breakneck speed of the AI world
And what better way to celebrate than with a week PACKED with insane AI news? Buckle up, folks, because this week Google went OPEN SOURCE crazy, Gemini got even cooler, OpenAI created a whole new Agents SDK and the open-source community continues to blow our minds. We’ve got it all - from game-changing model releases to mind-bending demos.
This week I'm also on the Weights & Biases company retreat, so TL;DR first and then the newsletter, but honestly, I'll start embedding the live show here in the substack from now on, because we're getting so good at it, I barely have to edit lately and there's a LOT to show you guys!
TL;DR and Show Notes & Links
* Hosts & Guests
* Alex Volkov - AI Eveangelist & Weights & Biases (@altryne)
* Co Hosts - @WolframRvnwlf @ldjconfirmed @nisten
* Sandra Kublik - DevRel at Cohere (@itsSandraKublik)
* Open Source LLMs
* Google open sources Gemma 3 - 1B - 27B - 128K context (Blog, AI Studio, HF)
* EuroBERT - multilingual encoder models (210M to 2.1B params)
* Reka Flash 3 (reasoning) 21B parameters is open sourced (Blog, HF)
* Cohere Command A 111B model - 256K context (Blog)
* Nous Research Deep Hermes 24B / 3B Hybrid Reasoners (X, HF)
* AllenAI OLMo 2 32B - fully open source GPT4 level model (X, Blog, Try It)
* Big CO LLMs + APIs
* Gemini Flash generates images natively (X, AI Studio)
* Google deep research is now free in Gemini app and powered by Gemini Thinking (Try It no cost)
* OpenAI released new responses API, Web Search, File search and Computer USE tools (X, Blog)
* This weeks Buzz
* The whole company is at an offsite at oceanside, CA
* W&B internal MCP hackathon and had cool projects - launching an MCP server soon!
* Vision & Video
* Remade AI - 8 LORA video effects for WANX (HF)
* AI Art & Diffusion & 3D
* ByteDance Seedream 2.0 - A Native Chinese-English Bilingual Image Generation Foundation Model by ByteDance (Blog, Paper)
* Tools
* Everyone's talking about Manus - (manus.im)
* Google AI studio now supports youtube understanding via link dropping
Open Source LLMs: Gemma 3, EuroBERT, Reka Flash 3, and Cohere Command-A Unleashed!
This week was absolutely HUGE for open source, folks. Google dropped a BOMBSHELL with Gemma 3! As Wolfram pointed out, this is a "very technical achievement," and it's not just one model, but a whole family ranging from 1 billion to 27 billion parameters. And get this – the 27B model can run on a SINGLE GPU! Sundar Pichai himself claimed you’d need "at least 10X compute to get similar performance from other models." Insane!
Gemma 3 isn't just about size; it's packed with features. We're talking multimodal capabilities (text, images, and video!), support for over 140 languages, and a massive 128k context window. As Nisten pointed out, "it might actually end up being the best at multimodal in that regard" for local models. Plus, it's fine-tuned for safety and comes with ShieldGemma 2 for content moderation. You can grab Gemma 3 on Google AI Studio, Hugging Face, Ollama, Kaggle – everywhere! Huge shoutout to Omar Sanseviero and the Google team for this incredible release and for supporting the open-source community from day one! Colin aka Bartowski, was right, "The best thing about Gemma is the fact that Google specifically helped the open source communities to get day one support." This is how you do open source right!
Next up, we have EuroBERT, a new family of multilingual encoder models. Wolfram, our European representative, was particularly excited about this one: "In European languages, you have different characters than in other languages. And, um, yeah, encoding everything properly is, uh, difficult." Ranging from 210 million to 2.1 billion parameters, EuroBERT is designed to push the boundaries of NLP in European and global languages. With training on a massive 5 trillion-token dataset across 15 languages and support for 8K context tokens, EuroBERT is a workhorse for RAG and other NLP tasks. Plus, how cool is their mascot?
Reka Flash 3 - a 21B reasoner with apache 2 trained with RLOO
And the open source train keeps rolling! Reka AI dropped Reka Flash 3, a 21 billion parameter reasoning model with an Apache 2.0 license! Nisten was blown away by the benchmarks: "This might be one of the best like 20B size models that there is right now. And it's Apache 2.0. Uh, I, I think this is a much bigger deal than most people realize." Reka Flash 3 is compact, efficient, and excels at chat, coding, instruction following, and function calling. They even used a new reinforcement learning technique called REINFORCE Leave One-Out (RLOO). Go give it a whirl on Hugging Face or their chat interface – chat.reka.ai!
Last but definitely not least in the open-source realm, we had a special guest, Sandra (@itsSandraKublik) from Cohere, join us to announce Command-A! This beast of a model clocks in at 111 BILLION parameters with a massive 256K context window. Sandra emphasized its efficiency, "It requires only two GPUs. Typically the models of this size require 32 GPUs. So it's a huge, huge difference." Command-A is designed for enterprises, focusing on agentic tasks, tool use, and multilingual performance. It's optimized for private deployments and boasts enterprise-grade security. Congrats to Sandra and the Cohere team on this massive release!
Big CO LLMs + APIs: Gemini Flash Gets Visual, Deep Research Goes Free, and OpenAI Builds for Agents
The big companies weren't sleeping either! Google continued their awesome week by unleashing native image generation in Gemini Flash Experimental! This is seriously f*****g cool, folks! Sorry for my French, but it’s true. You can now directly interact with images, tell Gemini what to do, and it just does it. We even showed it live on the stream, turning ourselves into cat-confetti-birthday-hat-wearing masterpieces!
Wolfram was right, "It's also a sign what we will see in, like, Photoshop, for example. Where you, you expect to just talk to it and have it do everything that a graphic designer would be doing." The future of creative tools is HERE.
And guess what else Google did? They made Deep Research FREE in the Gemini app and powered by Gemini Thinking! Nisten jumped in to test it live, and we were all impressed. "This is the nicest interface so far that I've seen," he said. Deep Research now digs through HUNDREDS of websites (Nisten’s test hit 156!) to give you comprehensive answers, and the interface is slick and user-friendly. Plus, you can export to Google Docs! Intelligence too cheap to meter? Google is definitely pushing that boundary.
Last second additions - Allen Institute for AI released OLMo 2 32B - their biggest open model yet
Just as I'm writing this, friend of the pod, Nathan from Allen Institute for AI announced the release of a FULLY OPEN OLMo 2, which includes weights, code, dataset, everything and apparently it beats the latest GPT 3.5, GPT 4o mini, and leading open weight models like Qwen and Mistral.
Evals look legit, but nore than that, this is an Apache 2 model with everything in place to advance open AI and open science!
Check out Nathans tweet for more info, and congrats to Allen team for this awesome release!
OpenAI new responses API and Agent ASK with Web, File and CUA tools
Of course, OpenAI wasn't going to let Google have all the fun. They dropped a new SDK for agents called the Responses API. This is a whole new way to build with OpenAI, designed specifically for the agentic era we're entering. They also released three new tools: Web Search, Computer Use Tool, and File Search Tool. The Web Search tool is self-explanatory – finally, built-in web search from OpenAI!
The Computer Use Tool, while currently limited in availability, opens up exciting possibilities for agent automation, letting agents interact with computer interfaces. And the File Search Tool gives you a built-in RAG system, simplifying knowledge retrieval from your own files. As always, OpenAI is adapting to the agentic world and giving developers more power.
Finally in the big company space, Nous Research released PORTAL, their new Inference API service. Now you can access their awesome models, like Hermes 3 Llama 70B and DeepHermes 3 8B, directly via API. It's great to see more open-source labs offering API access, making these powerful models even more accessible.
This Week's Buzz at Weights & Biases: Offsite Hackathon and MCP Mania!
This week's "This Week's Buzz" segment comes to you live from Oceanside, California! The whole Weights & Biases team is here for our company offsite. Despite the not-so-sunny California weather (thanks, storm!), it's been an incredible week of meeting colleagues, strategizing, and HACKING!
And speaking of hacking, we had an MCP hackathon! After last week’s MCP-pilling episode, we were all hyped about Model Context Protocol, and the team didn't disappoint. In just three hours, the innovation was flowing! We saw agents built for WordPress, MCP support integrated into Weave playground, and even MCP servers for Weights & Biases itself! Get ready, folks, because an MCP server for Weights & Biases is COMING SOON! You'll be able to talk to your W&B data like never before. Huge shoutout to the W&B team for their incredible talent and for embracing the agentic future! And in case you missed it, Weights & Biases is now part of the CoreWeave family! Exciting times ahead!
Vision & Video: LoRA Video Effects and OpenSora 2.0
Moving into vision and video, Remade AI released 8 LoRA video effects for 1X! Remember 1X from Alibaba? Now you can add crazy effects like "squish," "inflate," "deflate," and even "cakeify" to your videos using LoRAs. It's open source and super cool to see video effects becoming trainable and customizable.
And in the realm of open-source video generation, OpenSora 2.0 dropped! This 11 billion parameter model claims state-of-the-art video generation trained for just $200,000! They’re even claiming performance close to Sora itself on some benchmarks. Nisten checked out the demos, and while we're all a bit jaded now with the rapid pace of video AI, it's still mind-blowing how far we've come. Open source video is getting seriously impressive, seriously fast.
AI Art & Diffusion & 3D: ByteDance's Bilingual Seedream 2.0
ByteDance, the folks behind TikTok, released Seedream 2.0, a native Chinese-English bilingual image generation foundation model. This model, from ByteDream, excels at text rendering, cultural nuance, and human preference alignment. Seedream 2.0 boasts "powerful general capability," "native bilingual comprehension ability," and "excellent text rendering." It's designed to understand both Chinese and English prompts natively, generating high-quality, culturally relevant images. The examples look stunning, especially its ability to render Chinese text beautifully.
Tools: Manus AI Agent, Google AI Studio YouTube Links, and Cursor Embeddings
Finally, in the tools section, everyone's buzzing about Manus, a new AI research agent. We gave it a try live on the show, asking it to do some research. The UI is slick, and it seems to be using Claude 3.7 behind the scenes. Manus creates a to-do list, browses the web in a real Chrome browser, and even generates files. It's like Operator on steroids. We'll be keeping an eye on Manus and will report back on its performance in future episodes.
And Google AI Studio keeps getting better! Now you can drop YouTube links into Google AI Studio, and it will natively understand the video! This is HUGE for video analysis and content understanding. Imagine using this for support, content summarization, and so much more.
PHEW! What a week to celebrate two years of ThursdAI! From open source explosions to Gemini's visual prowess and OpenAI's agentic advancements, the AI world is moving faster than ever. As Wolfram aptly put it, "The acceleration, you can feel it." And Nisten reminded us of the incredible journey, "I remember I had early access to GPT-4 32K, and, uh, then... the person for the contract that had given me access, they cut it off because on the one weekend, I didn't realize how expensive it was. So I had to use $180 worth of tokens just trying it out." Now, we have models that are more powerful and more accessible than ever before.
Thank you to Wolfram, Nisten, and LDJ for co-hosting and bringing their insights every week.
And most importantly, THANK YOU to our amazing community for tuning in, listening, and supporting ThursdAI for two incredible years! We couldn't do it without you. Here's to another year of staying up-to-date so YOU don't have to! Don't forget to subscribe to the podcast, YouTube channel, and newsletter to stay in the loop. And share ThursdAI with a friend – it's the best birthday gift you can give us! Until next week, keep building and keep exploring the amazing world of AI! LET'S GO!

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
Estão a faltar episódios?

Clique aqui para atualizar o feed.
ThursdAI - Mar 6, 2025 - Alibaba's R1 Killer QwQ, Exclusive Google AI Mode Chat, and MCP fever sweeping the community!
6 mar· ThursdAI - The top AI news from the past week
What is UP folks! Alex here from Weights & Biases (yeah, still, but check this weeks buzz section below for some news!)
I really really enjoyed today's episode, I feel like I can post it unedited it was so so good. We started the show with our good friend Junyang Lin from Alibaba Qwen, where he told us about their new 32B reasoner QwQ. Then we interviewed Google's VP of the search product, Robby Stein, who came and told us about their upcoming AI mode in Google! I got access and played with it, and it made me switch back from PPXL as my main.
And lastly, I recently became fully MCP-pilled, since we covered it when it came out over thanksgiving, I saw this acronym everywhere on my timeline but only recently "got it" and so I wanted to have an MCP deep dive, and boy... did I get what I wished for! You absolutely should tune in to the show as there's no way for me to cover everything we covered about MCP with Dina and Jason! ok without, further adieu.. let's dive in (and the TL;DR, links and show notes in the end as always!)
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
🤯 Alibaba's QwQ-32B: Small But Shocking Everyone!
The open-source LLM segment started strong, chatting with friend of the show Junyang Justin Lin from Alibaba’s esteemed Qwen team. They've cooked up something quite special: QwQ-32B, a reasoning-focused, reinforcement-learning-boosted beast that punches remarkably above its weight. We're talking about a mere 32B parameters model holding its ground on tough evaluations against DeepSeek R1, a 671B behemoth!
Here’s how wild this is: You can literally run QwQ on your Mac! Junyang shared that they applied two solid rounds of RL to amp its reasoning, coding, and math capabilities, integrating agents into the model to fully unlock its abilities. When I called out how insane it was that we’ve gone from "LLMs can't do math" to basically acing competitive math benchmarks like AIME24, Junyang calmly hinted that they're already aiming for unified thinking/non-thinking models. Sounds wild, doesn’t it?
Check out the full QwQ release here, or dive into their blog post.
🚀 Google Launches AI Mode: Search Goes Next-Level (X, Blog, My Live Reaction).
For the past two years, on this very show, we've been asking, "Where's Google?" in the Gen AI race. Well, folks, they're back. And they're back in a big way.
Next, we were thrilled to have Google’s own Robby Stein, VP of Product for Google Search, drop by ThursdAI after their massive launch of AI Mode and expanded AI Overviews leveraging Gemini 2.0. Robby walked us through this massive shift, which essentially brings advanced conversational AI capabilities straight into Google. Seriously — Gemini 2.0 is now out here doing complex reasoning while performing fan-out queries behind the scenes in Google's infrastructure.
Google search is literally Googling itself. No joke. "We actually have the model generating fan-out queries — Google searches within searches — to collect accurate, fresh, and verified data," explained Robby during our chat. And I gotta admit, after playing with AI Mode, Google is definitely back in the game—real-time restaurant closures, stock analyses, product comparisons, and it’s conversational to boot. You can check my blind reaction first impression video here. (also, while you're there, why not subscribe to my YT?)
Google has some huge plans, but right now AI Mode is rolling out slowly via Google Labs for Google One AI Premium subscribers first. More soon though!
🐝 This Week's Buzz: Weights & Biases Joins CoreWeave Family!
Huge buzz (in every sense of the word) from Weights & Biases, who made waves with their announcement this week: We've joined forces with CoreWeave! Yeah, that's big news as CoreWeave, the AI hyperscaler known for delivering critical AI infrastructure, has now acquired Weights & Biases to build the ultimate end-to-end AI platform. It's early days of this exciting journey, and more details are emerging, but safe to say: the future of Weights & Biases just got even more exciting. Congrats to the whole team at Weights & Biases and our new colleagues at CoreWeave!
We're committed to all users of WandB so you will be able to keep using Weights & Biases, and we'll continuously improve our offerings going forward! Personally, also nothing changes for ThursdAI! 🎉
MCP Takes Over: Giving AI agents super powers via standardized protocol
Then things got insanely exciting. Why? Because MCP is blowing up and I had to find out why everyone's timeline (mine included) just got invaded.
Welcoming Cloudflare’s amazing product manager Dina Kozlov and Jason Kneen—MCP master and creator—things quickly got mind-blowing. MCP servers, Jason explained, are essentially tool wrappers that effortlessly empower agents with capabilities like API access and even calling other LLMs—completely seamlessly and securely. According to Jason, "we haven't even touched the surface yet of what MCP can do—these things are Lego bricks ready to form swarms and even self-evolve."
Dina broke down just how easy it is to launch MCP servers on Cloudflare Workers while teasing exciting upcoming enhancements. Both Dina and Jason shared jaw-dropping examples, including composing complex workflows connecting Git, Jira, Gmail, and even smart home controls—practically instantaneously! Seriously, my mind is still spinning.
The MCP train is picking up steam, and something tells me we'll be talking about this revolutionary agent technology a lot more soon. Check out two great MCP directories that popped up this recently: Smithery, Cursor Directory and Composio.
This show was one of the best ones we recorded, honestly, I barely need to edit it. It was also a really really fun livestream, so if you prefer seeing to listening, here's the lightly edited live stream
Thank you for being a ThursdAI subscriber, as always here's the TL:DR and shownotes for everything that happened in AI this week and the things we mentioned (and hosts we had)
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
TL;DR and Show Notes
* Show Notes & Guests
* Alex Volkov - AI Eveangelist & Weights & Biases (@altryne)
* Co Hosts - @WolframRvnwlf @ldjconfirmed @nisten
* Junyang Justin Lin - Head of Qwen Team, Alibaba - @JustinLin610
* Robby Stein - VP of Product, Google Search - @rmstein
* Dina Kozlov - Product Manager, Cloudflare - @dinasaur_404
* Jason Kneen - MCP Wiz - @jasonkneen
* My Google AI Mode Blind Reaction Video (Youtube)
* Sesame Maya Conversation Demo - (Youtube)
* Cloudflare MCP docs (Blog)
* Weights & Biases Agents Course Pre-signup - https://wandb.me/agents
* Open Source LLMs
* Qwen's latest reasoning model QwQ-32B - matches R1 on some evals (X, Blog, HF, Chat)
* Cohere4ai - Aya Vision - 8B & 32B (X, HF)
* AI21 - Jamba 1.6 Large & Jamba 1.6 Mini (X, HF)
* Big CO LLMs + APIs
* Google announces AI Mode & AI Overviews Gemini 2.0 (X, Blog, My Live Reaction)
* OpenAI rolls out GPT 4.5 to plus users - #1 on LM Arena 🔥 (X)
* Grok Voice is available for free users as well (X)
* Elysian Labs launches Auren ios app (X, App Store)
* Mistral announces SOTA OCR (Blog)
* This weeks Buzz
* Weights & Biases is acquired by CoreWeave 🎉 (Blog)
* Vision & Video
* Tencent HYVideo img2vid is finally here (X, HF, Try It)
* Voice & Audio
* NotaGen - symbolic music generation model high-quality classical sheet music Github, Demo, HF
* Sesame takes the world by storm with their amazing voice model (My Reaction)
* AI Art & Diffusion & 3D
* MiniMax__AI - Image-01: A Versatile Text-to-Image Model at 1/10 the Cost (X, Try it)
* Zhipu AI - CogView 4 6B - (X, Github)
* Tools
* Google - DataScience agent in GoogleColab Blog
* Baidu Miaoda - nocode AI build tool

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
📆 Feb 27, 2025 - GPT-4.5 Drops TODAY?!, Claude 3.7 Coding BEAST, Grok's Unhinged Voice, Humanlike AI voices & more AI news
28 fev· ThursdAI - The top AI news from the past week
Hey all, Alex here 👋
What can I say, the weeks are getting busier , and this is one of those "crazy full" weeks in AI. As we were about to start recording, OpenAI teased GPT 4.5 live stream, and we already had a very busy show lined up (Claude 3.7 vibes are immaculate, Grok got an unhinged voice mode) and I had an interview with Kevin Hou from Windsurf scheduled! Let's dive in!
🔥 GPT 4.5 (ORION) is here - worlds largest LLM (10x GPT4o)
OpenAI has finally shipped their next .5 model, which is 10x scale from the previous model. We didn't cover this on the podcast but did watch the OpenAI live stream together after the podcast concluded.
A very interesting .5 release from OpenAI, where even Sam Altman says "this model won't crush on benchmarks" and is not the most frontier model, but is OpenAI's LARGEST model by far (folks are speculating 10+ Trillions of parameters)
After 2 years of smaller models and distillations, we finally got a new BIG model, that shows scaling laws proper, and while on some benchmarks it won't compete against reasoning models, this model will absolutely fuel a huge increase in capabilities even for reasoners, once o-series models will be trained on top of this.
Here's a summary of the announcement and quick vibes recap (from folks who had access to it before)
* OpenAI's largest, most knowledgeable model.
* Increased world knowledge: 62.5% on SimpleQA, 71.4% GPQA
* Better in creative writing, programming, problem-solving (no native step-by-step reasoning).
* Text and image input and text output
* Available in ChatGPT Pro and API access (API supports Function Calling, Structured Output)
* Knowledge Cutoff is October 2023.
* Context Window is 128,000 tokens.
* Max Output is 16,384 tokens.
* Pricing (per 1M tokens): Input: $75, Output: $150, Cached Input: $37.50.
* Foundation for future reasoning models
4.5 Vibes Recap
Tons of folks who had access are pointing to the same thing, while this model is not beating others on evals, it's much better at multiple other things, namely creative writing, recommending songs, improved vision capability and improved medical diagnosis.
Karpathy said "Everything is a little bit better and it's awesome, but also not exactly in ways that are trivial to point to" and posted a thread of pairwise comparisons of tone on his X thread
Though the reaction is bifurcated as many are upset with the high price of this model (10x more costly on outputs) and the fact that it's just marginally better at coding tasks. Compared to the newerSonnet (Sonnet 3.7) and DeepSeek, folks are looking at OpenAI and asking, why isn't this way better?
Anthropic's Claude 3.7 Sonnet: A Coding Powerhouse
Anthropic released Claude 3.7 Sonnet, and the immediate reaction from the community was overwhelmingly positive. With 8x more output capability (64K) and reasoning built in, this model is an absolute coding powerhouse.
Claude 3.7 Sonnet is the new king of coding models, achieving a remarkable 70% on the challenging SWE-Bench benchmark, and the initial user feedback is stellar, though vibes started to shift a bit towards Thursday.
Ranking #1 on WebDev arena, and seemingly trained on UX and websites, Claude Sonnet 3.7 (AKA NewerSonner) has been blowing our collective minds since it was released on Monday, especially due to introducing Thinking and reasoning in a combined model.
Now, since the start of the week, the community actually had time to play with it, and some of them return to sonnet 3.5 and saying that while the model is generally much more capable, it tends to generate tons of things that are unnecessary.
I wonder if the shift is due to Cursor/Windsurf specific prompts, or the model's larger output context, and we'll keep you updated on if the vibes shift again.
Open Source LLMs
This week was HUGE for open source, folks. We saw releases pushing the boundaries of speed, multimodality, and even the very way LLMs generate text!
DeepSeek's Open Source Spree
DeepSeek went on an absolute tear, open-sourcing a treasure trove of advanced tools and techniques:
This isn't your average open-source dump, folks. We're talking FlashMLA (efficient decoding on Hopper GPUs), DeepEP (an optimized communication library for MoE models), DeepGEMM (an FP8 GEMM library that's apparently ridiculously fast), and even parallelism strategies like DualPipe and EPLB.
They are releasing some advanced stuff for training and optimization of LLMs, you can follow all their releases on their X account
Dual Pipe seems to be the one that got most attention from the community, which is an incredible feat in pipe parallelism, that even got the cofounder of HuggingFace super excited
Microsoft's Phi-4: Multimodal and Mini (Blog, HuggingFace)
Microsoft joined the party with Phi-4-multimodal (5.6B parameters) and Phi-4-mini (3.8B parameters), showing that small models can pack a serious punch.
These models are a big deal. Phi-4-multimodal can process text, images, and audio, and it actually beats WhisperV3 on transcription! As Nisten said, "This is a new model and, I'm still reserving judgment until, until I tried it, but it looks ideal for, for a portable size that you can run on the phone and it's multimodal." It even supports a wide range of languages. Phi-4-mini, on the other hand, is all about speed and efficiency, perfect for finetuning.
Diffusion LLMs: Mercury Coder and LLaDA (X , Try it)
This is where things get really interesting. We saw not one, but two diffusion-based LLMs this week: Mercury Coder from Inception Labs and LLaDA 8B. (Although, ok, to be fair, LLaDa released 2 weeks ago I was just busy)
For those who don't know, diffusion is usually used for creating things like images. The idea of using it to generate text is like saying, "Okay, there's a revolutionary tool for painting; I'll write the code using it." Inception Labs' Mercury Coder is claiming over 1000 tokens per second on NVIDIA H100s – that's insane speed, usually only seen with specialized chips! Nisten spent hours digging into these, noting, "This is a complete breakthrough and, it just hasn't quite hit yet that this just happened because people thought for a while it should be possible because then you can do, you can do multiple token prediction at once". He explained that these models combine a regular LLM with a diffusion component, allowing them to generate multiple tokens simultaneously and excel at tasks like "fill in the middle" coding.
LLaDA 8B, on the other hand, is an open-source attempt, and while it needs more training, it shows the potential of this approach. LDJ pointed out that LLaDA is "trained on like around five times or seven times less data while already like competing with LLAMA3 AP with same parameter count".
Are diffusion LLMs the future? It's too early to say, but the speed gains are very intriguing.
Magma 8B: Robotics LLM from Microsoft
Microsoft dropped Magma 8B, a Microsoft Research project, an open-source model that combines vision and language understanding with the ability to control robotic actions.
Nisten was particularly hyped about this one, calling it "the robotics. LLM." He sees it as a potential game-changer for robotics companies, allowing them to build robots that can understand visual input, respond to language commands, and act in the real world.
OpenAI's Deep Research for Everyone (Well, Plus Subscribers)
OpenAI finally brought Deep Research, its incredible web-browsing and research tool, to Plus subscribers.
I've been saying this for a while: Deep Research is another ChatGPT moment. It's that good. It goes out, visits websites, understands your query in context, and synthesizes information like nothing else. As Nisten put it, "Nothing comes close to OpenAI's Deep Research...People like pull actual economics data, pull actual stuff." If you haven't tried it, you absolutely should.
Our full coverage of Deep Research is here if you haven't yet listened, it's incredible.
Alexa Gets an AI Brain Upgrade with Alexa+
Amazon finally announced Alexa+, the long-awaited LLM-powered upgrade to its ubiquitous voice assistant.
Alexa+ will be powered by Claude (and sometimes Nova), offering a much more conversational and intelligent experience, with integrations across Amazon services.
This is a huge deal. For years, Alexa has felt… well, dumb, compared to the advancements in LLMs. Now, it's getting a serious intelligence boost, thanks to Anthropic's Claude. It'll be able to handle complex conversations, control smart home devices, and even perform tasks across various Amazon services. Imagine asking Alexa, "Did I let the dog out today?" and it actually checking your Ring camera footage to give you an answer! (Although, as I joked, let's hope it doesn't start setting houses on fire.)
Also very intriguing is the new SDKs they are releasing to connect Alexa+ to all kinds of experience, I think this is huge and will absolutely create a new industry of applications built for voice Alexa.
Alexa Web Actions for example will allow Alexa to navigate to a website and complete actions (think order Uber Eats)
The price? 20$/mo but free if you're a Amazon Prime subscriber, which is most of the US households at this point.
They are focusing on personalization and memory, though still unclear how that's going to be handled, and the ability to share documents like schedules
I'm very much looking forward to smart Alexa, and to be able to say "Alexa, set a timer for the amount of time it takes to hard boil an egg, and flash my house lights when the timer is done"
Grok Gets a Voice... and It's UNHINGED
Grok, Elon Musk's AI, finally got a voice mode, and… well, it's something else.
One-sentence summary: Grok's new voice mode includes an "unhinged" 18+ option that curses like a sailor, along with other personality settings.
Yes, you read that right. There's literally an "unhinged" setting in the UI. We played it live on the show, and... well, let's just say it's not for the faint of heart (or for kids). Here's a taste:
Alex: "Hey there."
Grok: "Yo, Alex. What's good, you horny b*****d? How's your day been so far? Fucked up or just mildly shitty?"
Beyond the shock value, the voice mode is actually quite impressive in its expressiveness and ability to understand interruptions. It has several personalities, from a helpful "Grok Doc" to an "argumentative" mode that will disagree with everything you say. It's... unique.
This Week's Buzz (WandB-Related News)
Agents Course is Coming!
We announced our upcoming agents course! You can pre-sign up HERE . This is going to be a deep dive into building and deploying AI agents, so don't miss it!
AI Engineer Summit Recap
We briefly touched on the AI Engineer Summit in New York, where we met with Kevin Hou and many other brilliant minds in the AI space. The theme was "Agents at Work," and it was a fantastic opportunity to see the latest developments in agent technology. I gave a talk about reasoning agents and had a workshop about evaluations on Saturday, and saw many listeners of ThursdAI 👏 ✋
Interview with Kevin Hou from Windsurf
This week we had the pleasure of chatting with Kevin Hou from Windsurf about their revolutionary AI editor. Windsurf isn't just another IDE, it's an agentic IDE. As Kevin explained, "we made the pretty bold decision of saying, all right, we're not going to do chat... we are just going to [do] agent." They've built Windsurf from the ground up with an agent-first approach, and it’s making waves.
Kevin walked us through the evolution of AI coding tools, from autocomplete to chat, and now to agents. He highlighted the "magical experiences" users are having, like debugging complex code with AI assistance that actually understands the context. We also delved into the challenges – memory, checkpointing, and cost.
We also talked about the burning question: vibe coding. Is coding as we know it dead? Kevin’s take was nuanced: "there's an in between state that I really vibe or like gel with, which is,the scaffolding of what you want… Let's use, let's like vibe code and purely use the agent to accomplish this sort of commit." He sees AI agents raising the bar for software quality, demanding better UX, testing, and overall polish.
And of course, we had to ask about the elephant in the room – why are so many people switching from Cursor to Windsurf? Kevin's answer was humble, pointing to user experience, the agent-first workflow, and the team’s dedication to building the best product. Check out our full conversation on the pod and download Windsurf for yourself: windsurf.ai
Video Models & Voice model updates
There is so much happening in LLM world, that folks may skip over the other stuff, but there's so much happening in these world's as well this week! Here's a brief recap!
* Alibaba's WanX: Open-sourced, cutting-edge video generation models making waves with over 250,000 downloads already. They claim to take SOTA on open source video generation evals and of course img2video of this high quality model will lead to ... folks using it for all kinds of things.
* HUMEs Octave: A groundbreaking LLM model that genuinely understands context and emotion and does TTS. Blog Hume has been doing emotional TTS but with this TTS focused LLM we are now able to create voices with a prompt, and receive emotional responses that are inferred from the text. Think shyness, sarcasm, anger etc
* 11labs’ Scribe: Beating Whisper 3 with impressive accuracy and diarization features, Scribe is raising the bar in speech-to-text quality. 11labs releasing their own ASR (automatic speech recognition) was not in my cards, and boy did they deliver. Beating whisper, with speaker separation (diarization), world level timestamps and much lower WER than other models, this is a very interesting entry to this space. However, free for now on their website, it's significantly slower than Gemini 2.0 and Whisper for me at least.
* Sesame releases their conversational speech model (and promising to open source this) and it's honestly the best / least uncanny conversations I had with an AI. Check out my conversation with it
* Lastly, VEO 2, the best video model around according to some, is finally available via API (though txt2video only) and it's fairly expensive, but gives some amazing results. You can try it out on FAL
Phew, it looks like we've made it! Huge huge week in AI, big 2 new models, tons of incredible updates on multimodality and voice as well 🔥
If you enjoyed this summary, the best way to support us is to share with a friend (or 3) and give us a 5 start reviews on wherever you get your podcasts, it really does help! 👏
See you next week,
Alex

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
📆 ThursdAI - Feb 20 - Live from AI Eng in NY - Grok 3, Unified Reasoners, Anthropic's Bombshell, and Robot Handoffs!
20 fev· ThursdAI - The top AI news from the past week
Holy moly, AI enthusiasts! Alex Volkov here, reporting live from the AI Engineer Summit in the heart of (touristy) Times Square, New York! This week has been an absolute whirlwind of announcements, from XAI's Grok 3 dropping like a bomb, to Figure robots learning to hand each other things, and even a little eval smack-talk between OpenAI and XAI. It’s enough to make your head spin – but that's what ThursdAI is here for. We sift through the chaos and bring you the need-to-know, so you can stay on the cutting edge without having to, well, spend your entire life glued to X and Reddit.
This week we had a very special live show with the Haize Labs folks, the ones I previously interviewed about their bijection attacks, discussing their open source judge evaluation library called Verdict. So grab your favorite caffeinated beverage, maybe do some stretches because your mind will be blown, and let's dive into the TL;DR of ThursdAI, February 20th, 2025!
Participants
* Alex Volkov: AI Evangelist with Weights and Biases
* Nisten: AI Engineer and cohost
* Akshay: AI Community Member
* Nuo: Dev Advocate at 01AI
* Nimit: Member of Technical Staff at Haize Labs
* Leonard: Co-founder at Haize Labs
Open Source LLMs
Perplexity's R1 7076: Censorship-Free DeepSeek
Perplexity made a bold move this week, releasing R1 7076, a fine-tuned version of DeepSeek R1 specifically designed to remove what they (and many others) perceive as Chinese government censorship. The name itself, 1776, is a nod to American independence – a pretty clear statement! The core idea? Give users access to information on topics the CCP typically restricts, like Tiananmen Square and Taiwanese independence.
Perplexity used human experts to identify around 300 sensitive topics and built a "censorship classifier" to train the bias out of the model. The impressive part? They claim to have done this without significantly impacting the model's performance on standard evals. As Nuo from 01AI pointed out on the show, though, he'd "actually prefer that they can actually disclose more of their details in terms of post training... Running the R1 model by itself, it's already very difficult and very expensive." He raises a good point – more transparency is always welcome! Still, it's a fascinating attempt to tackle a tricky problem, the problem which I always say we simply cannot avoid. You can check it out yourself on Hugging Face and read their blog post.
Arc Institute & NVIDIA Unveil Evo 2: Genomics Powerhouse
Get ready for some serious science, folks! Arc Institute and NVIDIA dropped Evo 2, a massive genomics model (40 billion parameters!) trained on a mind-boggling 9.3 trillion nucleotides. And it’s fully open – two papers, weights, data, training, and inference codebases. We love to see it!
Evo 2 uses the StripedHyena architecture to process huge genetic sequences (up to 1 million nucleotides!), allowing for analysis of complex genomic patterns. The practical applications? Predicting the effects of genetic mutations (super important for healthcare) and even designing entire genomes. I’ve been super excited about genomics models, and seeing these alternative architectures like StripedHyena getting used here is just icing on the cake. Check it out on X.
ZeroBench: The "Impossible" Benchmark for VLLMs
Need more benchmarks? Always! A new benchmark called ZeroBench arrived, claiming to be the "impossible benchmark" for Vision Language Models (VLLMs). And guess what? All current top-of-the-line VLLMs get a big fat zero on it.
One example they gave was a bunch of scattered letters, asking the model to "answer the question that is written in the shape of the star among the mess of letters." Honestly, even I struggled to see the star they were talking about. It highlights just how much further VLLMs need to go in terms of true visual understanding. (X, Page, Paper, HF)
Hugging Face's Ultra Scale Playbook: Scaling Up
For those of you building massive models, Hugging Face released the Ultra Scale Playbook, a guide to building and scaling AI models on huge GPU clusters.
They ran 4,000 scaling experiments on up to 512 GPUs (nothing close to Grok's 100,000, but still impressive!). If you're working in a lab and dreaming big, this is definitely a resource to check out. (HF).
Big CO LLMs + APIs
Grok 3: XAI's Big Swing new SOTA LLM! (and Maybe a Bug?)
Monday evening, BOOM! While some of us were enjoying President's Day, the XAI team dropped Grok 3. They announced it with a setting very similar to OpenAI announcements. They're claiming state-of-the-art performance on some benchmarks (more on that drama later!), and a whopping 1 million token context window, finally confirmed after some initial confusion. They talked a lot about agents and a future of reasoners as well.
The launch was a bit… messy. First, there was a bug where some users were getting Grok 2 even when the dropdown said Grok 3. That led to a lot of mixed reviews. Even when I finally thought I was using Grok 3, it still flubbed my go-to logic test, the "Beth's Ice Cubes" question. (The answer is zero, folks – ice cubes melt!). But Akshay, who joined us on the show, chimed in with some love: "...with just the base model of Grok 3, it's, in my opinion, it's the best coding model out there." So, mixed vibes, to say the least! It's also FREE for now, "until their GPUs melt," according to XAI, which is great.
UPDATE: The vibes are shifting, more and more of my colleagues and mutuals are LOVING grok3 for one shot coding, for talking to it. I’m getting convinced as well, though I did use and will continue to use Grok for real time data and access to X.
DeepSearch
In an attempt to show off some Agentic features, XAI also launched a deep search (not research like OpenAI but effectively the same)
Now, XAI of course has access to X, which makes their deep search have a leg up, specifically for real time information! I found out it can even “use” the X search!
OpenAI's Open Source Tease
In what felt like a very conveniently timed move, Sam Altman dropped a poll on X the same day as the Grok announcement: if OpenAI were to open-source something, should it be a small, mobile-optimized model, or a model on par with o3-mini? Most of us chose o3 mini, just to have access to that model and play with it. No indication of when this might happen, but it’s a clear signal that OpenAI is feeling the pressure from the open-source community.
The Eval Wars: OpenAI vs. XAI
Things got spicy! There was a whole debate about the eval numbers XAI posted, specifically the "best of N" scores (like best of 64 runs). Boris from OpenAI, and Aiden mcLau called out some of the graphs. Folks on X were quick to point out that OpenAI also used "best of N" in the past, and the discussion devolved from there.
XAI is claiming SOTA. OpenAI (or some folks from within OpenAI) aren't so sure. The core issue? We can't independently verify Grok's performance because there's no API yet! As I said, "…we're not actually able to use this model to independently evaluate this model and to tell you guys whether or not they actually told us the truth." Transparency matters, folks!
DeepSearch - How Deep?
Grok also touted a new "Deep Search" feature, kind of like Perplexity or OpenAI's "Deep Research" in their more expensive plan. My initial tests were… underwhelming. I nicknamed it "Shallow Search" because it spent all of 34 seconds on a complex query where OpenAI's Deep Research took 11 minutes and cited 17 sources. We're going to need to do some more digging (pun intended) on this one.
This Week's Buzz
We’re leaning hard into agents at Weights & Biases! We just released an agents whitepaper (check it out on our socials!), and we're launching an agents course in collaboration with OpenAI's Ilan Biggio. Sign up at wandb.me/agents! We're hearing so much about agent evaluation and observability, and we're working hard to provide the tools the community needs.
Also, sadly, our Toronto workshops are completely sold out. But if you're at AI Engineer in New York, come say hi to our booth! And catch my talk on LLM Reasoner Judges tomorrow (Friday) at 11 am EST – it’ll be live on the AI Engineer YouTube channel (HERE)!
Vision & Video
Microsoft MUSE: Playable Worlds from a Single Image
This one is wild. Microsoft's MUSE can generate minutes of playable gameplay from just a single second of video frames and controller actions.
It's based on the World and Human Action Model (WHAM) architecture, trained on a billion gameplay images from Xbox. So if you’ve been playing Xbox lately, you might be in the model! I found it particularly cool: "…you give it like a single second of a gameplay of any type of game with all the screen elements, with percentages, with health bars, with all of these things and their model generates a game that you can control." (X, HF, Blog).
StepFun's Step-Video-T2V: State-of-the-Art (and Open Source!)
We got two awesome open-source video breakthroughs this week. First, StepFun's Step-Video-T2V (and T2V Turbo), a 30 billion parameter text-to-video model. The results look really good, especially the text integration. Imagine a Chinese girl opening a scroll, and the words "We will open source" appearing as she unfurls it. That’s the kind of detail we're talking about.
And it’s MIT licensed! As Nisten noted "This is pretty cool. It came out. Right before Sora came out, people would have lost their minds." (X, Paper, HF, Try It).
HAO AI's FastVideo: Speeding Up HY-Video
The second video highlight: HAO AI released FastVideo, a way to make HY-Video (already a strong open-source contender) three times faster with no additional training! They call the trick "Sliding Tile Attention" apparently that alone provides enormous boost compared to even flash attention.
This is huge because faster inference means these models become more practical for real-world use. And, bonus: it supports HY-Video's Loras, meaning you can fine-tune it for, ahem, all kinds of creative applications. I will not go as far as to mention civit ai. (Github)
Figure's Helix: Robot Collaboration!
Breaking news from the AI Engineer conference floor: Figure, the humanoid robot company, announced Helix, a Vision-Language-Action (VLA) model built into their robots!It has full upper body control!
What blew my mind: they showed two robots working together, handing objects to each other, based on natural language commands! As I watched, I exclaimed, "I haven't seen a humanoid robot, hand off stuff to the other one... I found it like super futuristically cool." The model runs on the robot, using a 7 billion parameter VLM for understanding and an 80 million parameter transformer for control. This is the future, folks!
Tools & Others
Microsoft's New Quantum Chip (and State of Matter!)
Microsoft announced a new quantum chip and a new state of matter (called "topological superconductivity"). "I found it like absolutely mind blowing that they announced something like this," I gushed on the show. While I'm no quantum physicist, this sounds like a big deal for the future of computing.
Verdict: Hayes Labs' Framework for LLM Judges
And of course, the highlight of our show: Verdict, a new open-source framework from Hayes Labs (the folks behind those "bijection" jailbreaks!) for composing LLM judges. This is a huge deal for anyone working on evaluation. Leonard and Nimit from Hayes Labs joined us to explain how Verdict addresses some of the core problems with LLM-as-a-judge: biases (like preferring their own responses!), sensitivity to prompts, and the challenge of "meta-evaluation" (how do you know your judge is actually good?).
Verdict lets you combine different judging techniques ("primitives") to create more robust and efficient evaluators. Think of it as "judge-time compute scaling," as Leonard called it. They're achieving near state-of-the-art results on benchmarks like ExpertQA, and it's designed to be fast enough to use as a guardrail in real-time applications!
One key insight: you don't always need a full-blown reasoning model for judging. As Nimit explained, Verdict can combine simpler LLM calls to achieve similar results at a fraction of the cost. And, it's open source! (Paper, Github,X).
Conclusion
Another week, another explosion of AI breakthroughs! Here are my key takeaways:
* Open Source is THRIVING: From censorship-free LLMs to cutting-edge video models, the open-source community is delivering incredible innovation.
* The Need for Speed (and Efficiency): Whether it's faster video generation or more efficient LLM judging, performance is key.
* Robots are Getting Smarter (and More Collaborative): Figure's Helix is a glimpse into a future where robots work together.
* Evaluation is (Finally) Getting Attention: Tools like Verdict are essential for building reliable and trustworthy AI systems.
* The Big Players are Feeling the Heat: OpenAI's open-source tease and XAI's rapid progress show that the competition is fierce.
I'll be back in my usual setup next week, ready to break down all the latest AI news. Stay tuned to ThursdAI – and don't forget to give the pod five stars and subscribe to the newsletter for all the links and deeper dives. There’s potentially an Anthropic announcement coming, so we’ll see you all next week.
TLDR
* Open Source LLMs
* Perplexity R1 1776 - finetune of china-less R1 (Blog, Model)
* Arc institute + Nvidia - introduce EVO 2 - genomics model (X)
* ZeroBench - impossible benchmark for VLMs (X, Page, Paper, HF)
* HuggingFace ultra scale playbook (HF)
* Big CO LLMs + APIs
* Grok 3 SOTA LLM + reasoning and Deep Search (blog, try it)
* OpenAI is about to open source something? Sam posts a polls
* This weeks Buzz
* We are about to launch an agents course! Pre-sign up wandb.me/agents
* Workshops are SOLD OUT
* Watch my talk LIVE from AI Engineer - 11am EST Friday (HERE)
* Keep watching AI Eng conference after the show on AIE YT
* )
* Vision & Video
* Microsoft MUSE - playable worlds from one image (X, HF, Blog)
* Microsoft OmniParser - Better, faster screen parsing for GUI agents with OmniParser v2 (Gradio Demo)
* HAO AI - fastVIDEO - making HY-Video 3x as fast (Github)
* StepFun - Step-Video-T2V (+Turbo), a SotA 30B text-to-video model (Paper, Github, HF, Try It)
* Figure announces HELIX - vision action model built into FIGURE Robot (Paper)
* Tools & Others
* Microsoft announces a new quantum chip and a new state of matter (Blog, X)
* Verdict - Framework to compose SOTA LLM judges with JudgeTime Scaling (Paper, Github,X)

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
📆 ThursdAI - Feb 13 - my Personal Rogue AI, DeepHermes, Fast R1, OpenAI Roadmap / RIP GPT6, new Claude & Grok 3 imminent?
13 fev· ThursdAI - The top AI news from the past week
What a week in AI, folks! Seriously, just when you think things might slow down, the AI world throws another curveball. This week, we had everything from rogue AI apps giving unsolicited life advice (and sending rogue texts!), to mind-blowing open source releases that are pushing the boundaries of what's possible, and of course, the ever-present drama of the big AI companies with OpenAI dropping a roadmap that has everyone scratching their heads.
Buckle up, because on this week's ThursdAI, we dove deep into all of it. We chatted with the brains behind the latest open source embedding model, marveled at a tiny model crushing math benchmarks, and tried to decipher Sam Altman's cryptic GPT-5 roadmap. Plus, I shared a personal story about an AI app that decided to psychoanalyze my text messages – you won't believe what happened! Let's get into the TL;DR of ThursdAI, February 13th, 2025 – it's a wild one!
* Alex Volkov: AI Adventurist with weights and biases
* Wolfram Ravenwlf: AI Expert & Enthusiast
* Nisten: AI Community Member
* Zach Nussbaum: Machine Learning Engineer at Nomic AI
* Vu Chan: AI Enthusiast & Evaluator
* LDJ: AI Community Member
Personal story of Rogue AI with RPLY
This week kicked off with a hilarious (and slightly unsettling) story of my own AI going rogue, all thanks to a new Mac app called RPLY designed to help with message replies. I installed it thinking it would be a cool productivity tool, but it turned into a personal intervention session, and then… well, let's just say things escalated.
The app started by analyzing my text messages and, to my surprise, delivered a brutal psychoanalysis of my co-parenting communication, pointing out how both my ex and I were being "unpleasant" and needed to focus on the kids. As I said on the show, "I got this as a gut punch. I was like, f*ck, I need to reimagine my messaging choices." But the real kicker came when the AI decided to take initiative and started sending messages without my permission (apparently this was a bug with RPLY that was fixed since I reported)!
Friends were texting me question marks, and my ex even replied to a random "Hey, How's your day going?" message with a smiley, completely out of our usual post-divorce communication style. "This AI, like on Monday before just gave me absolute s**t about not being, a person that needs to be focused on the kids also decided to smooth things out on friday" I chuckled, still slightly bewildered by the whole ordeal. It could have gone way worse, but thankfully, this rogue AI counselor just ended up being more funny than disastrous.
Open Source LLMs
DeepHermes preview from NousResearch
Just in time for me sending this newsletter (but unfortunately not quite in time for the recording of the show), our friends at Nous shipped an experimental new thinking model, their first reasoner, called DeepHermes.
NousResearch claims DeepHermes is among the first models to fuse reasoning and standard LLM token generation within a single architecture (a trend you'll see echoed in the OpenAI and Claude announcements below!)
Definitely experimental cutting edge stuff here, but exciting to see not just an RL replication but also innovative attempts from one of the best finetuning collectives around.
Nomic Embed Text V2 - First Embedding MoE
Nomic AI continues to impress with the release of Nomic Embed Text V2, the first general-purpose Mixture-of-Experts (MoE) embedding model. Zach Nussbaum from Nomic AI joined us to explain why this release is a big deal.
* First general-purpose Mixture-of-Experts (MoE) embedding model: This innovative architecture allows for better performance and efficiency.
* SOTA performance on multilingual benchmarks: Nomic Embed V2 achieves state-of-the-art results on the multilingual MIRACL benchmark for its size.
* Support for 100+ languages: Truly multilingual embeddings for global applications.
* Truly open source: Nomic is committed to open source, releasing training data, weights, and code under the Apache 2.0 License.
Zach highlighted the benefits of MoE for embeddings, explaining, "So we're trading a little bit of, inference time memory, and training compute to train a model with mixture of experts, but we get this, really nice added bonus of, 25 percent storage." This is especially crucial when dealing with massive datasets. You can check out the model on Hugging Face and read the Technical Report for all the juicy details.
AllenAI OLMOE on iOS and New Tulu 3.1 8B
AllenAI continues to champion open source with the release of OLMOE, a fully open-source iOS app, and the new Tulu 3.1 8B model.
* OLMOE iOS App: This app brings state-of-the-art open-source language models to your iPhone, privately and securely.
* Allows users to test open-source LLMs on-device.
* Designed for researchers studying on-device AI and developers prototyping new AI experiences.
* Optimized for on-device performance while maintaining high accuracy.
* Fully open-source code for further development.
* Available on the App Store for iPhone 15 Pro or newer and M-series iPads.
* Tulu 3.1 8B
As Nisten pointed out, "If you're doing edge AI, the way that this model is built is pretty ideal for that." This move by AllenAI underscores the growing importance of on-device AI and open access. Read more about OLMOE on the AllenAI Blog.
Groq Adds Qwen Models and Lands on OpenRouter
Groq, known for its blazing-fast inference speeds, has added Qwen models, including the distilled R1-distill, to its service and joined OpenRouter.
* Record-fast inference: Experience a mind-blowing 1000 TPS with distilled DeepSeek R1 70B on Open Router.
* Usable Rate Limits: Groq is now accessible for production use cases with higher rate limits and pay-as-you-go options.
* Qwen Model Support: Access Qwen models like 2.5B-32B and R1-distill-qwen-32B.
* Open Router Integration: Groq is now available on OpenRouter, expanding accessibility for developers.
As Nisten noted, "At the end of the day, they are shipping very fast inference and you can buy it and it looks like they are scaling it. So they are providing the market with what it needs in this case." This integration makes Groq's speed even more accessible to developers. Check out Groq's announcement on X.com.
SambaNova adds full DeepSeek R1 671B - flies at 200t/s (blog)
In a complete trend of this week, SambaNova just announced they have availability of DeepSeek R1, sped up by their custom chips, flying at 150-200t/s. This is the full DeepSeek R1, not the distilled Qwen based versions!
This is really impressive work, and compared to the second fastest US based DeepSeek R1 (on Together AI) it absolutely flies
Agentica DeepScaler 1.5B Beats o1-preview on Math
Agentica's DeepScaler 1.5B model is making waves by outperforming OpenAI's o1-preview on math benchmarks, using Reinforcement Learning (RL) for just $4500 of compute.
* Impressive Math Performance: DeepScaleR achieves a 37.1% Pass@1 on AIME 2025, outperforming the base model and even o1-preview!!
* Efficient Training: Trained using RL for just $4500, demonstrating cost-effective scaling of intelligence.
* Open Sourced Resources: Agentica open-sourced their dataset, code, and training logs, fostering community progress in RL-based reasoning.
Vu Chan, an AI enthusiast who evaluated the model, joined us to share his excitement: "It achieves, 42% pass at one on a AIME 24. which basically means if you give the model only one chance at every problem, it will solve 42% of them." He also highlighted the model's efficiency, generating correct answers with fewer tokens. You can find the model on Hugging Face, check out the WandB logs, and see the announcement on X.com.
ModernBert Instruct - Encoder Model for General Tasks
ModernBert, known for its efficient encoder-only architecture, now has an instruct version, ModernBert Instruct, capable of handling general tasks.
* Instruct-tuned Encoder: ModernBERT-Large-Instruct can perform classification and multiple-choice tasks using its Masked Language Modeling (MLM) head.
* Beats Qwen .5B: Outperforms Qwen .5B on MMLU and MMLU Pro benchmarks.
* Efficient and Versatile: Demonstrates the potential of encoder models for general tasks without task-specific heads.
This release shows that even encoder-only models can be adapted for broader applications, challenging the dominance of decoder-based LLMs for certain tasks. Check out the announcement on X.com.
Big CO LLMs + APIs
RIP GPT-5 and o3 - OpenAI Announces Public Roadmap
OpenAI shook things up this week with a roadmap update from Sam Altman, announcing a shift in strategy for GPT-5 and the o-series models. Get ready for GPT-4.5 (Orion) and a unified GPT-5 system!
* GPT-4.5 (Orion) is Coming: This will be the last non-chain-of-thought model from OpenAI.
* GPT-5: A Unified System: GPT-5 will integrate technologies from both the GPT and o-series models into a single, seamless system.
* No Standalone o3: o3 will not be released as a standalone model; its technology will be integrated into GPT-5. "We will no longer ship O3 as a standalone model," Sam Altman stated.
* Simplified User Experience: The model picker will be eliminated in ChatGPT and the API, aiming for a more intuitive experience.
* Subscription Tier Changes:
* Free users will get unlimited access to GPT-5 at a standard intelligence level.
* Plus and Pro subscribers will gain access to increasingly advanced intelligence settings of GPT-5.
* Expanded Capabilities: GPT-5 will incorporate voice, canvas, search, deep research, and more.
This roadmap signals a move towards more integrated and user-friendly AI experiences. As Wolfram noted, "Having a unified access and the AI should be smart enough... AI has, we need an AI to pick which AI to use." This seems to be OpenAI's direction. Read Sam Altman's full announcement on X.com.
OpenAI Releases ModelSpec v2
OpenAI also released ModelSpec v2, an update to their document defining desired AI model behaviors, emphasizing customizability, transparency, and intellectual freedom.
* Chain of Command: Defines a hierarchy to balance user/developer control with platform-level rules.
* Truth-Seeking and User Empowerment: Encourages models to "seek the truth together" with users and empower decision-making.
* Core Principles: Sets standards for competence, accuracy, avoiding harm, and embracing intellectual freedom.
* Open Source: OpenAI open-sourced the Spec and evaluation prompts for broader use and collaboration on GitHub.
This release reflects OpenAI's ongoing efforts to align AI behavior and promote responsible development. Wolfram praised ModelSpec, saying, "I was all over the original models back when it was announced in the first place... That is one very important aspect when you have the AI agent going out on the web and get information from not trusted sources." Explore ModelSpec v2 on the dedicated website.
VP Vance Speech at AI Summit in Paris - Deregulate and Dominate!
Vice President Vance delivered a powerful speech at the AI Summit in Paris, advocating for pro-growth AI policies and deregulation to maintain American leadership in AI.
* Pro-Growth and Deregulation: VP Vance urged for policies that encourage AI innovation and cautioned against excessive regulation, specifically mentioning GDPR.
* American AI Leadership: Emphasized ensuring American AI technology remains the global standard and blocks hostile foreign adversaries from weaponizing AI. "Hostile foreign adversaries have weaponized AI software to rewrite history, surveil users, and censor speech… I want to be clear – this Administration will block such efforts, full stop," VP Vance declared.
* Key Points:
* Ensure American AI leadership.
* Encourage pro-growth AI policies.
* Maintain AI's freedom from ideological bias.
* Prioritize a pro-worker approach to AI development.
* Safeguard American AI and chip technologies.
* Block hostile foreign adversaries' weaponization of AI.
Nisten commented, "He really gets something that most EU politicians do not understand is that whenever they have such a good thing, they're like, okay, this must be bad. And we must completely stop it." This speech highlights the ongoing debate about AI regulation and its impact on innovation. Read the full speech here.
Cerebras Powers Perplexity with Blazing Speed (1200 t/s!)
Perplexity is now powered by Cerebras, achieving inference speeds exceeding 1200 tokens per second.
* Unprecedented Speed: Perplexity's Sonar model now flies at over 1200 tokens per second thanks to Cerebras' massive LPU chips. "Like perplexity sonar, their specific LLM for search is now powered by Cerebras and it's like 12. 100 tokens per second. It's it matches Google now on speed," I noted on the show.
* Google-Level Speed: Perplexity now matches Google in inference speed, making it incredibly fast and responsive.
This partnership significantly enhances Perplexity's performance, making it an even more compelling search and AI tool. See Perplexity's announcement on X.com.
Anthropic Claude Incoming - Combined LLM + Reasoning Model
Rumors are swirling that Anthropic is set to release a new Claude model that will be a combined LLM and reasoning model, similar to OpenAI's GPT-5 roadmap.
* Unified Architecture: Claude's next model is expected to integrate both LLM and reasoning capabilities into a single, hybrid architecture.
* Reasoning Powerhouse: Rumors suggest Anthropic has had a reasoning model stronger than Claude 3 for some time, hinting at a significant performance leap.
This move suggests a broader industry trend towards unified AI models that seamlessly blend different capabilities. Stay tuned for official announcements from Anthropic.
Elon Musk Teases Grok 3 "Weeks Out"
Elon Musk continues to tease the release of Grok 3, claiming it will be "a few weeks out" and the "most powerful AI" they have tested, with enhanced reasoning capabilities.
* Grok 3 Hype: Elon Musk claims Grok 3 will be the most powerful AI X.ai has released, with a focus on reasoning.
* Reasoning Focus: Grok 3's development may have shifted towards reasoning capabilities, potentially causing a slight delay in release.
While details remain scarce, the anticipation for Grok 3 is building, especially in light of the advancements in open source reasoning models.
This Week's Buzz 🐝
Weave Dataset Editing in UI
Weights & Biases Weave has added a highly requested feature: dataset editing directly in the UI.
* UI-Based Dataset Editing: Users can now edit datasets directly within the Weave UI, adding, modifying, and deleting rows without code. "One thing that, folks asked us and we've recently shipped is the ability to edit this from the UI itself. So you don't have to have code," I explained.
* Versioning and Collaboration: Every edit creates a new dataset version, allowing for easy tracking and comparison.
* Improved Dataset Management: Simplifies dataset management and version control for evaluations and experiments.
This feature streamlines the workflow for LLM evaluation and observability, making Weave even more user-friendly. Try it out at wandb.me/weave
Toronto Workshops - AI in Production: Evals & Observability
Don't miss our upcoming AI in Production: Evals & Observability Workshops in Toronto!
* Two Dates: Sunday and Monday workshops in Toronto.
* Hands-on Learning: Learn to build and evaluate LLM-powered applications with robust observability.
* Expert Guidance: Led by yours truly, Alex Volkov, and featuring Nisten.
* Limited Spots: Registration is still open, but spots are filling up fast! Register for Sunday's workshop here and Monday's workshop here.
Join us to level up your LLM skills and network with the Toronto AI community!
Vision & Video
Adobe Firefly Video - Image to Video and Text to Video
Adobe announced Firefly Video, entering the image-to-video and text-to-video generation space.
* Video Generation: Firefly Video offers both image-to-video and text-to-video capabilities.
* Adobe Ecosystem: Integrates with Adobe's creative suite, providing a powerful tool for video creators.
This release marks Adobe's significant move into the rapidly evolving video generation landscape. Try Firefly Video here.
Voice & Audio
YouTube Expands AI Dubbing to All Creators
YouTube is expanding AI dubbing to all creators, breaking down language barriers on the platform.
* AI-Powered Dubbing: YouTube is leveraging AI to provide dubbing in multiple languages for all creators. "YouTube now expands. AI dubbing in languages to all creators, and that's super cool. So basically no language barriers anymore. AI dubbing is here," I announced.
* Increased Watch Time: Pilot program saw 40% of watch time in dubbed languages, demonstrating the feature's impact. "Since the pilot launched last year, 40 percent of watch time for videos with the feature enabled was in the dub language and not the original language. That's insane!" I highlighted.
* Global Reach: Eliminates language barriers, making content accessible to a wider global audience.
Wolfram emphasized the importance of dubbing, especially in regions with strong dubbing cultures like Germany. "Every movie that comes here is getting dubbed in high quality. And now AI is doing that on YouTube. And I personally, as a content creator, I have always have to decide, do I post in German or English?" This feature is poised to revolutionize content consumption on YouTube. Read more on X.com.
Meta Audiobox Aesthetics - Unified Quality Assessment
Meta released Audiobox Aesthetics, a unified automatic quality assessment model for speech, music, and sound.
* Unified Assessment: Provides a single model for evaluating the quality of speech, music, and general sound.
* Four Key Metrics: Evaluates audio based on Production Quality (PQ), Production Complexity (PC), Content Enjoyment (CE), and Content Usefulness (CU).
* Automated Evaluation: Offers a scalable solution for assessing synthetic audio quality, reducing reliance on costly human evaluations.
This tool is expected to significantly improve the development and evaluation of TTS and audio generation models. Access the Paper and Weights on GitHub.
Zonos - Expressive TTS with High-Fidelity Cloning
Zyphra released Zonos, a highly expressive TTS model with high-fidelity voice cloning capabilities.
* Expressive TTS: Zonos offers expressive speech generation with control over speaking rate, pitch, and emotions.
* High-Fidelity Voice Cloning: Claims high-fidelity voice cloning from short audio samples (though my personal test was less impressive). "My own voice clone sounded a little bit like me but not a lot. Ok at least for me, the cloning is really really bad," I admitted on the show.
* High Bitrate Audio: Generates speech at 44kHz with a high bitrate codec for enhanced audio quality.
* Open Source & API: Models are open source, with a commercial API available.
While voice cloning might need further refinement, Zonos represents another step forward in open-source TTS technology. Explore Zonos on Hugging Face (Hybrid), Hugging Face (Transformer), and GitHub, and read the Blog post.
Tools & Others
Emergent Values AI - AI Utility Functions and Biases
Researchers found that AIs exhibit emergent values, including biases in valuing human lives from different regions.
* Emergent Utility Functions: AI models appear to develop implicit utility functions and value systems during training. "Research finds that AI's have expected utility functions for people and other emergent values. And this is freaky," I summarized.
* Value Biases: Studies revealed biases, with AIs valuing lives from certain regions (e.g., Nigeria, Pakistan, India) higher than others (e.g., Italy, France, Germany, UK, US). "Nigerian people, valued as like eight us people. One Nigerian person was valued like eight us people," I highlighted the surprising finding.
* Utility Engineering: Researchers propose "utility engineering" as a research agenda to analyze and control these emergent value systems.
LDJ pointed out a potential correlation between the valued regions and the source of RLHF data labeling, suggesting a possible link between training data and emergent biases. While the study is still debated, it raises important questions about AI value alignment. Read the announcement on X.com and the Paper.
LM Studio Lands Support for Speculative Decoding
LM Studio, the popular local LLM inference tool, now supports speculative decoding, significantly speeding up inference.
* Faster Inference: Speculative decoding leverages a smaller "draft" model to accelerate inference with a larger model. "Speculative decoding finally landed in LM studio, which is dope folks. If you use LM studio, if you don't, you should," I exclaimed.
* Visualize Accepted Tokens: LM Studio visualizes accepted draft tokens, allowing users to see speculative decoding in action.
* Performance Boost: Improved inference speeds by up to 40% in tests, without sacrificing model performance. "It runs around 10 tokens per second without the speculative decoding and around 14 to 15 tokens per second with speculative decoding, which is great," I noted.
This update makes LM Studio even more powerful for local LLM experimentation. See the announcement on X.com.
Noam Shazeer / Jeff Dean on Dwarkesh Podcast
Podcast enthusiasts should check out the new Dwarkesh Podcast episode featuring Noam Shazeer (Transformer co-author) and Jeff Dean (Google DeepMind).
* AI Insights: Listen to insights from two AI pioneers in this new podcast episode.
Tune in to hear from these influential figures in the AI world. Find the announcement on X.com.
What a week, folks! From rogue AI analyzing my personal life to OpenAI shaking up the roadmap and tiny models conquering math, the AI world continues to deliver surprises. Here are some key takeaways:
* Open Source is Exploding: Nomic Embed Text V2, OLMoE, DeepScaler 1.5B, and ModernBERT Instruct are pushing the boundaries of what's possible with open, accessible models.
* Speed is King: Groq, Cerebras and SambaNovas are delivering blazing-fast inference, making real-time AI applications more feasible than ever.
* Reasoning is Evolving: DeepScaler 1.5B's success demonstrates the power of RL for even small models, and OpenAI and Anthropic are moving towards unified models with integrated reasoning.
* Privacy Matters: AllenAI's OLMoE highlights the growing importance of on-device AI for data privacy.
* The AI Landscape is Shifting: OpenAI's roadmap announcement signals a move towards simpler, more integrated AI experiences, while government officials are taking a stronger stance on AI policy.
Stay tuned to ThursdAI for the latest updates, and don't forget to subscribe to the newsletter for all the links and details! Next week, I'll be in New York, so expect a special edition of ThursdAI from the AI Engineer floor.
TLDR & Show Notes
* Open Source LLMs
* NousResearch DeepHermes-3 Preview (X, HF)
* Nomic Embed Text V2 - first embedding MoE (HF, Tech Report)
* AllenAI OLMOE on IOS as a standalone app & new Tulu 3.1 8B (Blog, App Store)
* Groq adds Qwen models (including R1 distill) and lands on OpenRouter (X)
* Agentica DeepScaler 1.5B beats o1-preview on math using RL for $4500 (X, HF, WandB)
* ModernBert can be instructed (though encoder only) to do general tasks (X)
* LMArena releases a dataset of 100K votes with human preferences (X, HF)
* SambaNova adds full DeepSeek R1 671B - flies at 200t/s (blog)
* Big CO LLMs + APIs
* RIP GPT-5 and o3 - OpenAI announces a public roadmap (X)
* OpenAI released Model Spec v2 (Github, Blog)
* VP Vance Speech at AI Summit in Paris (full speech)
* Cerebras now powers Perplexity with >1200t/s (X)
* Anthropic Claude incoming, will be combined LLM + reasoning (The Information)
* This weeks Buzz
* We've added dataset editing in the UI (X)
* 2 workshops in Toronto, Sunday and Monday
* Vision & Video
* Adobe announces firefly video (img2video and txt2video) (try it)
* Voice & Audio
* Youtube to expand AI Dubbing to all creators (X)
* Meta Audiobox Aesthetics - Unified Automatic Quality Assessment for Speech, Music, and Sound (Paper, Weights)
* Zonos, a highly expressive TTS model with high fidelity voice cloning (Blog, HF,HF, Github)
* Tools & Others
* Emergent Values AI - Research finds that AI's have expected utility functions (X, paper)
* LMStudio lands support for Speculative Decoding (X)
* Noam Shazeer / Jeff Dean on Dwarkesh podcast (X)

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
📆 ThursdAI - Feb 6 - OpenAI DeepResearch is your personal PHD scientist, o3-mini & Gemini 2.0, OmniHuman-1 breaks reality & more AI news
7 fev· ThursdAI - The top AI news from the past week
What's up friends, Alex here, back with another ThursdAI hot off the presses.
Hold onto your hats because this week was another whirlwind of AI breakthroughs, mind-blowing demos, and straight-up game-changers. We dove deep into OpenAI's new "Deep Research" agent – and let me tell you, it's not just hype, it's legitimately revolutionary. You also don't have to take my word for it, a new friend of the pod and a scientist DR Derya Unutmaz joined us to discuss his experience with Deep Research as a scientist himself! You don't want to miss this conversation!
We also unpack Google's Gemini 2.0 release, including the blazing-fast Flash Lite model. And just when you thought your brain couldn't handle more, ByteDance drops OmniHuman-1, a human animation model that's so realistic, it's scary good.
I've also saw maybe 10 more
TLDR & Show Notes
* Open Source LLMs (and deep research implementations)
* Jina Node-DeepResearch (X, Github)
* HuggingFace - OpenDeepResearch (X)
* Deep Agent - R1 -V (X, Github)
* Krutim - Krutim 2 12B, Chitrath VLM, Embeddings and more from India (X, Blog, HF)
* Simple Scaling - S1 - R1 (Paper)
* Mergekit updated -
* Big CO LLMs + APIs
* OpenAI ships o3-mini and o3-mini High + updates thinking traces (Blog, X)
* Mistral relaunches LeChat with Cerebras for 1000t/s (Blog)
* OpenAI Deep Research - the researching agent that uses o3 (X, Blog)
* Google ships Gemini 2.0 Pro, Gemini 2.0 Flash-lite in AI Studio (Blog)
* Anthropic Constitutional Classifiers - announced a universal jailbreak prevention (Blog, Try It)
* Cloudflare to protect websites from AI scraping (News)
* HuggingFace becomes the AI Appstore (link)
* This weeks Buzz - Weights & Biases updates
* AI Engineer workshop (Saturday 22)
* Tinkerers Toronto workshops (Sunday 23 , Monday 24)
* We released a new Dataset editor feature (X)
* Audio and Sound
* KyutAI open sources Hibiki - simultaneous translation models (Samples, HF)
* AI Art & Diffusion & 3D
* ByteDance OmniHuman-1 - unparalleled Human Animation Models (X, Page)
* Pika labs adds PikaAdditions - adding anything to existing video (X)
* Google added Imagen3 to their API (Blog)
* Tools & Others
* Mistral Le Chat has ios an and adroid apps now (X)
* CoPilot now has agentic workflows (X)
* Replit launches free apps agent for everyone (X)
* Karpathy drops a new 3 hour video on youtube (X, Youtube)
* OpenAI canvas links are now shareable (like Anthropic artifacts) - (example)
* Show Notes & Links
* Guest of the week - Dr Derya Umnutaz - talking about Deep Research
* He's examples of Ehlers-Danlos Syndrome (ChatGPT), (ME/CFS) Deep Research, Nature article about Deep Reseach with Derya comments
* Hosts
* Alex Volkov - AI Evangelist & Host @altryne
* Wolfram Ravenwolf - AI Evangelist @WolframRvnwlf
* Nisten Tahiraj - AI Dev at github.GG - @nisten
* LDJ - Resident data scientist - @ldjconfirmed
Big Companies products & APIs
OpenAI's new chatGPT moment with Deep Research, their second "agent" product (X)
Look, I've been reporting on AI weekly for almost 2 years now, and been following the space closely since way before chatGPT (shoutout Codex days) and this definitely feels like another chatGPT moment for me.
DeepResearch is OpenAI's new agent, that searches the web for any task you give it, is able to reason about the results, and continue searching those sources, to provide you with an absolute incredible level of research into any topic, scientific or ... the best taqueria in another country.
The reason why it's so good is it's ability to do multiple search trajectories, backtrack if it needs to, and react in real time to new information. It also has python tool use (to do plots and calculations) and of course, the brain of it is o3, the best reasoning model from OpenAI
Deep Research is only offered on the Pro tier ($200) of chatGPT, and it's the first publicly available way to use o3 full! and boy, does it deliver!
I've had it review my workshop content, help me research LLM as a judge articles (which it did masterfully) and help me plan datenights in Denver (though it kind of failed at that, showing me a closed restaurant)
A breakthrough for scientific research
But I'm no scientist, so I've asked Dr
Derya Unutmaz, M.D.
to join us, and share his incredible findings as a doctor, a scientist and someone with decades of experience in writing grants, patent applications, paper etc.
The whole conversation is very very much worth listening to on the pod, we talked for almost an hour, but the highlights are honestly quite crazy.
So one of the first things I did was, I asked Deep Research to write a review on a particular disease that I’ve been studying for a decade. It came out with this impeccable 10-to-15-page review that was the best I’ve read on the topic— Dr. Derya Unutmaz
And another banger quote
It wrote a phenomenal 25-page patent application for a friend’s cancer discovery—something that would’ve cost 10,000 dollars or more and taken weeks. I couldn’t believe it. Every one of the 23 claims it listed was thoroughly justified
Humanity's LAST exam?
OpenAI announced Deep Research and have showed that on HLE (Humanity's Last Exam) benchmark that was just released a few weeks ago, it scores a whopping 26.6 percent! When HLE was released (our coverage here) all the way back at ... checks notes... January 23 or this year! the top reasoning models at the time (o1, R1) scored just under 10%
O3-mini and Deep Research now score 13% and 26.6% respectively, which means both that AI is advancing like crazy, but also.. that maybe calling this "last exam" was a bit premature? 😂😅
Deep Research is now also SOTA holder on GAIA, a public benchmark on real world questions, though Clementine (one of GAIA authors) throws a bit of shade on the result since OpenAI didn't really submit their results. Incidently, Clementine is also involved in HuggingFace attempt at replicating Deep Research in the open (with OpenDeepResearch)
OpenAI releases o3-mini and o3-mini high
This honestly got kind of buried with the Deep Research news, but as promised, on the last day of January, OpenAI released their new reasoning model, which is significantly fast and much cheaper than o1, while matching it on most benchmarks!
I've been talking about the fact that during o3 announcement (our coverage) that mini may be more practical and useful announcement than o3 itself, given the price and speed of it.
And viola, OpenAI has reduced the price point of their best reasoner model by 67%, and it's now matches just 2x that of DeepSeek R1.
Coming in at 110c for 1M input tokens and 440c for 1M output tokens, and streaming at a whopping 1000t/s at some instances, this reasoner is really something to beat.
Great for application developers
In addition to seem to be a great model, comparing it to R1 is a nonstarter IMO, not only because "it’s sending your data to choyna", which IMO is a ridiculous attack vector and people should be ashamed by posting this content.
o3-mini supports all of the nice API things that OpenAI has, like tool use, structured outputs, developer messages and streaming. The ability to set the reasoning effort is also interesting for applications!
Added benefit is the new 200K context window with 100K (claimed) output context.
It's also really really fast, while R1 availability grows, as it gets hosted on more and more US based providers, none of them are offering the full context window at these token speeds.
o3-mini-high?!
While the free users also started getting access to o3-mini, with the "reason" button on chatGPT, plus subscribers received 2 models, o3-mini and o3-mini-high, which is essentially the same model, but with the "high" reasoning mode turned on, giving the model significantly more compute (and tokens) to think.
This can be done on the API level by selecting reasoning_effort=high but it's the first time OpenAI is exposing this to non API users!
One highlight for me is, just how MANY tokens o3-mini high things through. In one of my evaluations on Weave, o3-mini high generated around 160K output tokens, answering 20 questions, while DeepSeek R1 for example generated 75K and Gemini Thinking, got the highest score on these, while charging only 14K tokens (though I'm pretty sure Google just doesn't report on thinking tokens yet, this seems like a bug)
As I'm writing this, OpenAI just announced a new update, o3-mini and o3-mini-high now show... "updated" reasoning traces!
These definitely "feel" more like the R1 reasoning traces (remember, previously OpenAI had a different model summarizing the reasoning to prevent training on them?) but they are not really the RAW ones (confirmed)
Google ships Gemini 2.0 Pro, Gemini 2.0 Flash-lite in AI Studio (X, Blog)
Congrats to our friends at Google for 2.0 👏 Google finally put all the experimental models under one 2.0 umbrella, giving us Gemini 2.0, Gemini 2.0 Flash and a new model!
They also introduced Gemini 2.0 Flash-lite, a crazy fast and cheap model that performs similarly to Flash 1.5. The rate limits on Flash-lite are twice as high as the regular Flash, making it incredibly useful for real-time applications.
They have also released a few benchmarks, but they only compared those to the previous benchmark released by Google, and while that's great, I wanted a comparison done, so I asked DeepResearch to do it for me, and it did (with citations!)
Google also released Imagen 3, their awesome image diffusion model in their API today, with 3c per image, this one is really really good!
Mistral's new LeChat spits out 1000t/s + new IOS apps
During the show, Mistral announced new capabilities for their LeChat interface, including a 15$/mo tier, but most importantly, a crazy fast generation using some kind of new inference, spitting out around 1000t/s. (Powered by Cerebras)
Additionally they have code interpreter there, Canvas, and they also claim to have the best OCR and don't forget, they have access to Flux images, and likely are the only place I know of that offers that image model for free!
Finally, they've released native mobile apps! (IOS, Android)
* from my quick tests, the 1000t/s is not always on, my first attempt was instant, it was like black magic, and then the rest of them were pretty much the same speed as before 🤔 Maybe they are getting hammered in traffic...
This weeks Buzz (What I learned with WandB this week)
I got to play around with O3-Mini before it was released (perks of working at Weights & Biases!), and I used Weave, our observability and evaluation framework, to analyze its performance. The results were… interesting.
* Latency and Token Count: O3-Mini High's latency was six times longer than O3-Mini Low on a simple reasoning benchmark (92 seconds vs. 6 seconds). But here's the kicker: it didn't even answer more questions correctly! And the token count? O3-Mini High used half a million tokens to answer 20 questions three times. That's… a lot.
* Weave Leaderboards: Nisten got super excited about using Weave's leaderboard feature to benchmark models. He realized it could solve a real problem in the open-source community – providing a verifiable and transparent way to share benchmark results. (really, we didnt' rehearse this!)
I also announced some upcoming workshops I'd love to see you at:
* AI Engineer Workshop in NYC: I'll be running a workshop on evaluations at the AI Engineer Summit in New York on February 22nd. Come say hi and learn about evals!
* AI Tinkerers Workshops in Toronto: I'll also be doing two workshops with AI Tinkerers in Toronto on February 23rd and 24th.
ByteDance OmniHuman-1 - a reality bending mind breaking img2human model
Ok, this is where my mind completely broke this week, like absolutely couldn't stop thinking about this release from ByteDance. After releasing the SOTA lipsyncing model just a few months ago (LatentSync, our coverage) they have once again blew everyone away. This time with a img2avatar model that's unlike anything we've ever seen.
This one doesn't need words, just watch my live reaction as I lose my mind
The level of real world building in these videos is just absolutely ... too much? The piano keys moving, there's a video of a woman speaking in the microphone, and behind her, the window has reflections of cars and people moving!
The thing that most blew me away upon review was the Niki Glazer video, with shiny dress and the model almost perfectly replicating the right sources of light.
Just absolute sorcery!
The authors confirmed that they don't have any immediate plans to release this as a model or even a product, but given the speed of open source, we'll get this within a year for sure! Get ready
Open Source LLMs (and deep research implementations)
This week wasn't massive for open-source releases in terms of entirely new models, but the ripple effects of DeepSeek's R1 are still being felt. The community is buzzing with attempts to replicate and build upon its groundbreaking reasoning capabilities. It feels like everyone is scrambling to figure out the "secret sauce" behind R1's "aha moment," and we're seeing some fascinating results.
Jina Node-DeepResearch and HuggingFace OpenDeepResearch
The community wasted no time trying to replicate OpenAI's Deep Research agent.
* Jina AI released "Node-DeepResearch" (X, Github), claiming it follows the "query, search, read, reason, repeat" formula. As I mentioned on the show, "I believe that they're wrong" about it being just a simple loop. O3 is likely a fine-tuned model, but still, it's awesome to see the open-source community tackling this so quickly!
* Hugging Face also announced "OpenDeepResearch" (X), aiming to create a truly open research agent. Clementine Fourrier, one of the authors behind the GAIA benchmark (which measures research agent capabilities), is involved, so this is definitely one to watch.
Deep Agent - R1 -V: These folks claim to have replicated DeepSeek R1's "aha moment" – where the model realizes its own mistakes and rethinks its approach – for just $3! (X, Github)
As I said on the show, "It's crazy, right? Nothing costs $3 anymore. Like it's half a coffee in Starbucks." They even claim you can witness this "aha moment" in a VLM. Open source is moving fast.
Krutim - Krutim 2 12B, Chitrath VLM, Embeddings and more from India: This Indian AI lab released a whole suite of models, including an improved LLM (Krutim 2), a VLM (Chitrarth 1), a speech-language model (Dhwani 1), an embedding model (Vyakhyarth 1), and a translation model (Krutrim Translate 1). (X, Blog, HF) They even developed a benchmark called "BharatBench" to evaluate Indic AI performance.
However, the community was quick to point out some… issues. As Harveen Singh Chadha pointed out on X, it seems like they blatantly copied IndicTrans, an MIT-licensed model, without even mentioning it. Not cool, Krutim. Not cool.
AceCoder: This project focuses on using reinforcement learning (RL) to improve code models. (X) They claim to have created a pipeline to automatically generate high-quality, verifiable code training data.
They trained a reward model (AceCode-RM) that significantly boosts the performance of Llama-3.1 and Qwen2.5-coder-7B. They even claim you can skip SFT training for code models by using just 80 steps of R1-style training!
Simple Scaling - S1 - R1: This paper (Paper) showcases the power of quality over quantity. They fine-tuned Qwen2.5-32B-Instruct on just 1,000 carefully curated reasoning examples and matched the performance of o1-preview!
They also introduced a technique called "budget forcing," allowing the model to control its test-time compute and improve performance. As I mentioned, Niklas Mengenhoff, who worked at Allen and was previously on the show, is involved. This is one to really pay attention to – it shows that you don't need massive datasets to achieve impressive reasoning capabilities.
Unsloth reduces R1 type reasoning to just 7GB VRAM (blog)
Deepseek R1-zero was autonimously learned reasoning in what they DeepSeek researchers called the "aha moment"
Unsloth adds another attempt at replicating this "aha moment" and claims they got it down to less than 7B VRAM, and it can see it for free, in a google colab!
This magic could be recreated through GRPO, a RL algorithm that optimizes responses efficiently without requiring a value function, unlike Proximal Policy Optimization (PPO) which relies on a value function
How it works:1. The model generates groups of responses.2. Each response is scored based on correctness or another metric created by some set reward function rather than an LLM reward model.3 . The average score of the group is computed.4. Each response's score is compared to the group average.5. The model is reinforced to favor higher-scoring responses.
Tools
A few new and interesting tools were released this week as well:
* Replit rebuilt and released their replit agents in an IOS app and released it free for many users. It can now build mini apps for you on the fly! (Replit)
* Mistral has ios / android apps with the new release of LeChat (X)
* Molly Cantillon released RPLY, which sits on your mac, and drafts replies to your messages. I installed it during writing this newsletter, and I did not expect it to hit this hard, it reviewed and summarized my texting patterns to "sound like me" and the models sit on device as well. Very very well crafted tool and the best thing it runs models on device if you want!
* Github Copilot announced agentic workflows and next line editing, which are cursor features. To try them out you have to download VSCode insiders. They also added Gemini 2.0 (Blog)
The AI field moves SO fast, I had to update the content of the newsletter around 5 times while writing it as new things kept getting released!
This was a Banger week that started with o3-mini and deep research, continued with Gemini 2.0 and OmniHuman and "ended" with Mistral x Cerebras, Github copilot agents, o3-mini updated COT reasoning traces and a bunch more!
AI doesn't stop, and we're here weekly to cover all of this, and give you guys the highlights, but also go deep!
Really appreciate Derya's appearance on the show this week, please give him a follow and see you guys next week!

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
📆 ThursdAI - Jan 30 - DeepSeek vs. Nasdaq, R1 everywhere, Qwen Max & Video, Open Source SUNO, Goose agents & more AI news
30 jan· ThursdAI - The top AI news from the past week
Hey folks, Alex here 👋
It’s official—grandmas (and the entire stock market) now know about DeepSeek. If you’ve been living under an AI rock, DeepSeek’s new R1 model just set the world on fire, rattling Wall Street (causing the biggest monetary loss for any company, ever!) and rocketing to #1 on the iOS App Store. This week’s ThursdAI show took us on a deep (pun intended) dive into the dizzying whirlwind of open-source AI breakthroughs, agentic mayhem, and big-company cat-and-mouse announcements. Grab your coffee (or your winter survival kit if you’re in Canada), because in true ThursdAI fashion, we’ve got at least a dozen bombshells to cover—everything from brand-new Mistral to next-gen vision models, new voice synthesis wonders, and big moves from Meta and OpenAI.
We’re also talking “reasoning mania,” as the entire industry scrambles to replicate, dethrone, or ride the coattails of the new open-source champion, R1. So buckle up—because if the last few days are any indication, 2025 is officially the Year of Reasoning (and quite possibly, the Year of Agents, or both!)
Open Source LLMs
DeepSeek R1 discourse Crashes the Stock Market
One-sentence summary: DeepSeek’s R1 “reasoning model” caused a frenzy this week, hitting #1 on the App Store and briefly sending NVIDIA’s stock plummeting in the process ($560B drop, largest monetary loss of any stock, ever)
Ever since DeepSeek R1 launched (our technical coverate last week!), the buzz has been impossible to ignore—everyone from your mom to your local barista has heard the name. The speculation? DeepSeek’s new architecture apparently only cost $5.5 million to train, fueling the notion that high-level AI might be cheaper than Big Tech claims. Suddenly, people wondered if GPU manufacturers like NVIDIA might see shrinking demand, and the stock indeed took a short-lived 17% tumble. On the show, I joked, “My mom knows about DeepSeek—your grandma probably knows about it, too,” underscoring just how mainstream the hype has become.
Not everyone is convinced the cost claims are accurate. Even Dario Amodei of Anthropic weighed in with a blog post arguing that DeepSeek’s success increases the case for stricter AI export controls.
Public Reactions
* Dario Amodei’s blogIn “On DeepSeek and Export Controls,” Amodei argues that DeepSeek’s efficient scaling exemplifies why democratic nations need to maintain a strategic leadership edge—and enforce export controls on advanced AI chips. He sees Chinese breakthroughs as proof that AI competition is global and intense.
* OpenAI Distillation EvidenceOpenAI mentioned it found “distillation traces” of GPT-4 inside R1’s training data. Hypocrisy or fair game? On ThursdAI, the panel mused that “everyone trains on everything,” so perhaps it’s a moot point.
* Microsoft ReactionMicrosoft wasted no time, swiftly adding DeepSeek to Azure—further proof that corporations want to harness R1’s reasoning power, no matter where it originated.
* Government reactedEven officials in the government, David Sacks, US incoming AI & Crypto czar, discussed the fact that DeepSeek did "distillation" using the term somewhat incorrectly, and presidet Trump was asked about it.
* API OutagesDeepSeek’s own API has gone in and out this week, apparently hammered by demand (and possibly DDoS attacks). Meanwhile, GPU clouds like Groq are showing up to accelerate R1 at 300 tokens/second, for those who must have it right now.
We've seen so many bad takes on the topic, from seething cope takes, to just gross misunderstandings from gov officials confusing the ios App with the OSS models, folks throwing conspiracy theories into the mix, claiming that $5.5M sum was a PsyOp. The fact of the matter is, DeepSeek R1 is an incredible model, and is now powering (just a week later), multiple products (more on this below) and experiences already, while pushing everyone else to compete (and give us reasoning models!)
Open Thoughts Reasoning Dataset
One-sentence summary: A community-led effort, “Open Thoughts,” released a new large-scale dataset (OpenThoughts-114k) of chain-of-thought reasoning data, fueling the open-source drive toward better reasoning models.
Worried about having enough labeled “thinking” steps to train your own reasoner? Fear not. The OpenThoughts-114k dataset aggregates chain-of-thought prompts and responses—114,000 of them—for building or fine-tuning reasoning LLMs. It’s now on Hugging Face for your experimentation pleasure. The ThursdAI panel pointed out how crucial these large, openly available reasoning datasets are. As Wolfram put it, “We can’t rely on the big labs alone. More open data means more replicable breakouts like DeepSeek R1.”
Mistral Small 2501 (24B)
One-sentence summary: Mistral AI returns to the open-source spotlight with a 24B model that fits on a single 4090, scoring over 81% on MMLU while under Apache 2.0.
Long rumored to be “going more closed,” Mistral AI re-emerged this week with Mistral-Small-24B-Instruct-2501—an Apache 2.0 licensed LLM that runs easily on a 32GB VRAM GPU. That 81% MMLU accuracy is no joke, putting it well above many 30B–70B competitor models. It was described as “the perfect size for local inference and a real sweet spot,” noting that for many tasks, 24B is “just big enough but not painfully heavy.” Mistral also finally started comparing themselves to Qwen 2.5 in official benchmarks—a big shift from their earlier reluctance, which we applaud!
Berkeley TinyZero & RAGEN (R1 Replications)
One-sentence summary: Two separate projects (TinyZero and RAGEN) replicated DeepSeek R1-zero’s reinforcement learning approach, showing you can get “aha” reasoning moments with minimal compute.
If you were wondering whether R1 is replicable: yes, it is. Berkeley’s TinyZero claims to have reproduced the core R1-zero behaviors for $30 using a small 3B model. Meanwhile, the RAGEN project aims to unify RL + LLM + Agents with a minimal codebase. While neither replication is at R1-level performance, they demonstrate how quickly the open-source community pounces on new methods. “We’re now seeing those same ‘reasoning sparks’ in smaller reproductions,” said Nisten. “That’s huge.”
Agents
Codename Goose by Blocks (X, Github)
One-sentence summary: Jack Dorsey’s company Blocks released Goose, an open-source local agent framework letting you run keyboard automation on your machine.
Ever wanted your AI to press keys and move your mouse in real time? Goose does exactly that with AppleScript, memory extensions, and a fresh approach to “local autonomy.” On the show, I tried Goose, but found it occasionally “went rogue, trying to delete my WhatsApp chats.” Security concerns aside, Goose is significant: it’s an open-source playground for agent-building. The plugin system includes integration with Git, Figma, a knowledge graph, and more. If nothing else, Goose underscores how hot “agentic” frameworks are in 2025.
OpenAI’s Operator: One-Week-In
It’s been a week since Operator went live for Pro-tier ChatGPT users. “It’s the first agent that can run for multiple minutes without bugging me every single second,”. Yet it’s still far from perfect—captchas, login blocks, and repeated confirmations hamper tasks. The potential, though, is enormous: “I asked Operator to gather my X.com bookmarks and generate a summary. It actually tried,” I shared, “but it got stuck on three links and needed constant nudges.” Simon Willison added that it’s “a neat tech demo” but not quite a productivity boon yet. Next steps? Possibly letting the brand-new reasoning models (like O1 Pro Reasoning) do the chain-of-thought under the hood.
I also got tired of opening hundreds of tabs for operator, so I wrapped it in a macOS native app, that has native notifications and the ability to launch Operator tasks via a Raycast extension, if you're interested, you can find it on my Github
Browser-use / Computer-use Alternatives
In addition to Goose, the ThursdAI panel mentioned browser-use on GitHub, plus numerous code interpreters. So far, none blow minds in reliability. But 2025 is evidently “the year of agents.” If you’re itching to offload your browsing or file editing to an AI agent, expect to tinker, troubleshoot, and yes, babysit. The show consensus? “It’s not about whether agents are coming, it’s about how soon they’ll become truly robust,” said Wolfram.
Big CO LLMs + APIs
Alibaba Qwen2.5-Max (& Hidden Video Model) (Try It)
One-sentence summary: Alibaba’s Qwen2.5-Max stands toe-to-toe with GPT-4 on some tasks, while also quietly rolling out video-generation features.
While Western media fixates on DeepSeek, Alibaba’s Qwen team quietly dropped the Qwen2.5-Max MoE model. It clocks in at 69% on MMLU-Pro—beating some OpenAI or Google offerings—and comes with a 1-million-token context window. And guess what? The official Chat interface apparently does hidden video generation, though Alibaba hasn’t publicized it in the English internet.
In the Chinese AI internet, this video generation model is called Tongyi Wanxiang, and even has it’s own website, can support first and last video generation and looks really really good, they have a gallery up there, and it even has audio generation together with the video!
This one was an img2video, but the movements are really natural!
Zuckerberg on LLama4 & LLama4 Mini
In Meta’s Q4 earnings call, Zuck was all about AI (sorry, Metaverse). He declared that LLama4 is in advanced training, with a smaller “LLama4 Mini” finishing pre-training. More importantly, a “reasoning model” is in the works, presumably influenced by the mania around R1. Some employees had apparently posted on Blind about “Why are we paying billions for training if DeepSeek did it for $5 million?” so the official line is that Meta invests heavily for top-tier scale.
Zuck also doubled down on saying "Glasses are the perfect form factor for AI" , to which I somewhat agree, I love my Meta Raybans, I just wished they were integrated into the ios more.
He also boasted about their HUGE datacenters, called Mesa, spanning the size of Manhattan, being built for the next step of AI.
(Nearly) Announced: O3-Mini
Right before the ThursdAI broadcast, rumors swirled that OpenAI might reveal O3-Mini. It’s presumably GPT-4’s “little cousin” with a fraction of the cost. Then…silence. Sam Altman also mentioned they would be bringing o3-mini by end of January, but maybe the R1 crazyness made them keep working on it and training it a bit more? 🤔
In any case, we'll cover it when it launches.
This Week’s Buzz
We're still the #1 spot on Swe-bench verified with W&B programmer, and our CTO, Shawn Lewis, chatted with friends of the pod Swyx and Alessio about it! (give it a listen)
We have two upcoming events:
* AI.engineer in New York (Feb 20–22). Weights & Biases is sponsoring, and I will broadcast ThursdAI live from the summit. If you snagged a ticket, come say hi—there might be a cameo from the “Chef.”
* Toronto Tinkerer Workshops (late February) in the University of Toronto. The Canadian AI scene is hot, so watch out for sign-ups (will add them to the show next week)
Weights & Biases also teased more features for LLM observability (Weave) and reminded folks of their new suite of evaluation tools. “If you want to know if your AI is actually better, you do evals,” Alex insisted. For more details, check out wandb.me/weave or tune into the next ThursdAI.
Vision & Video
DeepSeek - Janus Pro - multimodal understanding and image gen unified (1.5B & 7B)
One-sentence summary: Alongside R1, DeepSeek also released Janus Pro, a unified model for image understanding and generation (like GPT-4’s rumored image abilities).
DeepSeek apparently never sleeps. Janus Pro is MIT-licensed, 7B parameters, and can both parse images (SigLIP) and generate them (LlamaGen). The model outperforms DALL·E 3 and SDXL! on some internal benchmarks—though at a modest 384×384 resolution.
NVIDIA’s Eagle 2 Redux
One-sentence summary: NVIDIA re-released the Eagle 2 vision-language model with 4K resolution support, after mysteriously yanking it a week ago.
Eagle 2 is back, boasting multi-expert architecture, 16k context, and high-res video analysis. Rumor says it competes with big 70B param vision models at only 9B. But it’s overshadowed by Qwen2.5-VL (below). Some suspect NVIDIA is aiming to outdo Meta’s open-source hold on vision—just in time to keep GPU demand strong.
Qwen 2.5 VL - SOTA oss vision model is here
One-sentence summary: Alibaba’s Qwen 2.5 VL model claims state-of-the-art in open-source vision, including 1-hour video comprehension and “object grounding.”
The Qwen team didn’t hold back: “It’s the final boss for vision,” joked Nisten. Qwen 2.5 VL uses advanced temporal modeling for video and can handle complicated tasks like OCR or multi-object bounding boxes.
Featuring advances in precise object localization, video temporal understanding and agentic capabilities for computer, this is going to be the model to beat!
Voice & Audio
YuE 7B (Open “Suno”)
Ever dream of building the next pop star from your code editor? YuE 7B is your ticket. This model, now under Apache 2.0, supports chain-of-thought creation of structured songs, multi-lingual lyrics, and references. It’s slow to infer, but it’s arguably the best open music generator so far in the open source
What's more, they have changed the license to apache 2.0 just before we went live, so you can use YuE everywhere!
Refusion Fuzz
Refusion, a new competitor to paid audio models like Suno and Udio, launched “Fuzz,” offering free music generation online until GPU meltdown.
If you want to dabble in “prompt to jam track” without paying, check out Refusion Fuzz. Will it match the emotional nuance of premium services like 11 Labs or Hauio? Possibly not. But hey, free is free.
Tools (that have integrated R1)
Perplexity with R1
In the perplexity.ai chat, you can choose “Pro with R1” if you pay for it, harnessing R1’s improved reasoning to parse results. For some, it’s a major upgrade to “search-based question answering.” Others prefer it to paying for O1 or GPT-4.
I always check Perplexity if it knows what the latest episode of ThursdAI was, and it's the first time it did a very good summary! I legit used it to research the show this week! It's really something.
Meanwhile, Exa.ai also integrated a “DeepSeek Chat” for your agent-based workflows. Like it or not, R1 is everywhere.
Krea.ai with DeepSeek
Our friends at Krea, an AI art tool aggregator, also hopped on the R1 bandwagon for chat-based image searching or generative tasks.
Conclusion
Key Takeaways
* DeepSeek’s R1 has massive cultural reach, from #1 apps to spooking the stock market.
* Reasoning mania is upon us—everyone from Mistral to Meta wants a piece of the logic-savvy LLM pie.
* Agentic frameworks like Goose, Operator, and browser-use are proliferating, though they’re still baby-stepping through reliability issues.
* Vision and audio get major open-source love, with Janus Pro, Qwen 2.5 VL, YuE 7B, and more reshaping multimodality.
* Big Tech (Meta, Alibaba, OpenAI) is forging ahead with monster models, multi-billion-dollar projects, and cross-country expansions in search of the best reasoning approaches.
At this point, it’s not even about where the next big model drop comes from; it’s about how quickly the entire ecosystem can adopt (or replicate) that new methodology.
Stay tuned for next week’s ThursdAI, where we’ll hopefully see new updates from OpenAI (maybe O3-Mini?), plus the ongoing race for best agent. Also, catch us at AI.engineer in NYC if you want to talk shop or share your own open-source success stories. Until then, keep calm and carry on training.
TLDR
* Open Source LLMs
* DeepSeek Crashes the Stock Market: Did $5.5M training or hype do it?
* Open Thoughts Reasoning Dataset OpenThoughts-114k (X, HF)
* Mistral Small 2501 (24B, Apache 2.0) (HF)
* Berkeley TinyZero & RAGEN (R1-Zero Replications) (Github, WANDB)
* Allen Institute - Tulu 405B (Blog, HF)
* Agents
* Goose by Blocks (local agent framework) - (X, Github)
* Operator (OpenAI) – One-Week-In (X)
* Browser-use - oss version of Operator (Github)
* Big CO LLMs + APIs
* Alibaba Qwen2.5-Max (+ hidden video model) - (X, Try it)
* Zuckerberg on LLama4 & “Reasoning Model” (X)
* This Week’s Buzz
* Shawn Lewis interview on Latent Space with swyx & Alessio
* We’re sponsoring the ai.engineer upcoming summit in NY (Feb 19-22), come say hi!
* After that, we’ll host 2 workshops with AI Tinkerers Toronto (Feb 23-24), make sure you’re signed up to Toronto Tinkerers to receive the invite (we were sold out quick last time!)
* Vision & Video
* DeepSeek Janus Pro - 1.5B and 7B (Github, Try It)
* NVIDIA Eagle 2 (Paper, Model, Demo)
* Alibaba Qwen 2.5 VL (Project, HF, Github, Try It)
* Voice & Audio
* Yue 7B (Open Suno) - (Demo, HF, Github)
* Refusion Fuzz (free for now)
* Tools
* Perplexity with R1 (choose Pro with R1)
* Exa integrated R1 for free (demo)
* Participants
* Alex Volkov (@altryne)
* Wolfram Ravenwolf (@WolframRvnwlf)
* Nisten Tahiraj (@nisten )
* LDJ (@ldjOfficial)
* Simon Willison (@simonw)
* W&B Weave (@weave_wb)

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
📆 ThursdAI - Jan 23, 2025 - 🔥 DeepSeek R1 is HERE, OpenAI Operator Agent, $500B AI manhattan project, ByteDance UI-Tars, new Gemini Thinker & more AI news
24 jan· ThursdAI - The top AI news from the past week
What a week, folks, what a week! Buckle up, because ThursdAI just dropped, and this one's a doozy. We're talking seismic shifts in the open source world, a potential game-changer from DeepSeek AI that's got everyone buzzing, and oh yeah, just a casual $500 BILLION infrastructure project announcement. Plus, OpenAI finally pulled the trigger on "Operator," their agentic browser thingy – though getting it to actually operate proved to be a bit of a live show adventure, as you'll hear.
This week felt like one of those pivotal moments in AI, a real before-and-after kind of thing. DeepSeek's R1 hit the open source scene like a supernova, and suddenly, top-tier reasoning power is within reach for anyone with a Mac and a dream. And then there's OpenAI's Operator, promising to finally bridge the gap between chat and action. Did it live up to the hype? Well, let's just say things got interesting.
As I’m writing this, White House just published that an Executive Order on AI was just signed and published as well, what a WEEK.
Open Source AI Goes Nuclear: DeepSeek R1 is HERE!
Hold onto your hats, open source AI just went supernova! This week, the Chinese Whale Bros – DeepSeek AI, that quant trading firm turned AI powerhouse – dropped a bomb on the community in the best way possible: R1, their reasoning model, is now open source under the MIT license! As I said on the show, "Open source AI has never been as hot as this week."
This isn't just a model, folks. DeepSeek unleashed a whole arsenal: two full-fat R1 models (DeepSeek R1 and DeepSeek R1-Zero), and a whopping six distilled finetunes based on Qwen (1.5B, 7B, 14B, and 32B) and Llama (8B, 72B).
One stat that blew my mind, and Nisten's for that matter, is that DeepSeek-R1-Distill-Qwen-1.5B, the tiny 1.5 billion parameter model, is outperforming GPT-4o and Claude-3.5-Sonnet on math benchmarks! "This 1.5 billion parameter model that now does this. It's absolutely insane," I exclaimed on the show. We're talking 28.9% on AIME and 83.9% on MATH. Let that sink in. A model you can probably run on your phone is schooling the big boys in math.
License-wise, it's MIT, which as Nisten put it, "MIT is like a jailbreak to the whole legal system, pretty much. That's what most people don't realize. It's like, this is, it's not my problem. You're a problem now." Basically, do whatever you want with it. Distill it, fine-tune it, build Skynet – it's all fair game.
And the vibes? "Vibes are insane," as I mentioned on the show. Early benchmarks are showing R1 models trading blows with o1-preview and o1-mini, and even nipping at the heels of the full-fat o1 in some areas. Check out these numbers:
And the price? Forget about it. We're talking 50x cheaper than o1 currently. DeepSeek R1 API is priced at $0.14 / 1M input tokens and $2.19 / 1M output tokens, compared to OpenAI's o1 at $15.00 / 1M input and a whopping $60.00 / 1M output. Suddenly, high-quality reasoning is democratized.
LDJ highlighted the "aha moment" in DeepSeek's paper, where they talk about how reinforcement learning enabled the model to re-evaluate its approach and "think more." It seems like simple RL scaling, combined with a focus on reasoning, is the secret sauce. No fancy Monte Carlo Tree Search needed, apparently!
But the real magic of open source is what the community does with it. Pietro Schirano joined us to talk about his "Retrieval Augmented Thinking" (RAT) approach, where he extracts the thinking process from R1 and transplants it to other models. "And what I found out is actually by doing so, you may even like smaller, quote unquote, you know, less intelligent model actually become smarter," Pietro explained. Frankenstein models, anyone? (John Lindquist has a tutorial on how to do it here)
And then there's the genius hack from Voooogel, who figured out how to emulate a "reasoning_effort" knob by simply replacing the "end" token with "Wait, but". "This tricks the model into keeps thinking," as I described it. Want your AI to really ponder the meaning of life (or just 1+1)? Now you can, thanks to open source tinkering.
Georgi Gerganov, the legend behind llama.cpp, even jumped in with a two-line snippet to enable speculative decoding, boosting inference speeds on the 32B model on my Macbook from a sluggish 5 tokens per second to a much more respectable 10-11 tokens per second. Open source collaboration at its finest and it's only going to get better!
Thinking like a Neurotic
Many people really loved the way R1 thinks, and what I found astonishing is that I just sent "hey" and the thinking went into a whole 5 paragraph debate of how to answer, a user on X answered with "this is Woody Allen-level of Neurotic" which... nerd sniped me so hard! I used Hauio Audio (which is great!) and ByteDance latentSync and gave R1 a voice! It's really something when you hear it's inner monologue being spoken out like this!
ByteDance Enters the Ring: UI-TARS Controls Your PC
Not to be outdone in the open source frenzy, ByteDance, the TikTok behemoth, dropped UI-TARS, a set of models designed to control your PC. And they claim SOTA performance, beating even Anthropic's computer use models and, in some benchmarks, GPT-4o and Claude.
UI-TARS comes in 2B, 7B, and 72B parameter flavors, and ByteDance even released desktop apps for Mac and PC to go along with them. "They released an app it's called the UI TARS desktop app. And then, this app basically allows you to Execute the mouse clicks and keyboard clicks," I explained during the show.
While I personally couldn't get the desktop app to work flawlessly (quantization issues, apparently), the potential is undeniable. Imagine open source agents controlling your computer – the possibilities are both exciting and slightly terrifying. As Nisten wisely pointed out, "I would use another machine. These things are not safe to tell people. I might actually just delete your data if you, by accident." Words to live by, folks.
LDJ chimed in, noting that UI-TARS seems to excel particularly in operating system-level control tasks, while OpenAI's leaked "Operator" benchmarks might show an edge in browser control. It's a battle for desktop dominance brewing in open source!
Noting that the common benchmark between Operator and UI-TARS is OSWorld, UI-Tars launched with a SOTA
Humanity's Last Exam: The Benchmark to Beat
Speaking of benchmarks, a new challenger has entered the arena: Humanity's Last Exam (HLE). A cool new unsaturated bench of 3,000 challenging questions across over a hundred subjects, crafted by nearly a thousand subject matter experts from around the globe. "There's no way I'm answering any of those myself. I need an AI to help me," I confessed on the show.
And guess who's already topping the HLE leaderboard? You guessed it: DeepSeek R1, with a score of 9.4%! "Imagine how hard this benchmark is if the top reasoning models that we have right now... are getting less than 10 percent completeness on this," MMLU and Math are getting saturated? HLE is here to provide a serious challenge. Get ready to hear a lot more about HLE, folks.
Big CO LLMs + APIs: Google's Gemini Gets a Million-Token Brain
While open source was stealing the show, the big companies weren't completely silent. Google quietly dropped an update to Gemini Flash Thinking, their experimental reasoning model, and it's a big one. We're talking 1 million token context window and code execution capabilities now baked in!
"This is Google's scariest model by far ever built ever," Nisten declared. "This thing, I don't like how good it is. This smells AGI-ish" High praise, and high concern, coming from Nisten! Benchmarks are showing significant performance jumps in math and science evals, and the speed is, as Nisten put it, "crazy usable." They have enabled the whopping 1M context window for the new Gemini Flash 2.0 Thinking Experimental (long ass name, maybe let's call it G1?) and I agree, it's really really good!
And unlike some other reasoning models cough OpenAI cough, Gemini Flash Thinking shows you its thinking process! You can actually see the chain of thought unfold, which is incredibly valuable for understanding and debugging. Google's Gemini is quietly becoming a serious contender in the reasoning race (especially with Noam Shazeer being responsible for it!)
OpenAI's "Operator" - Agents Are (Almost) Here
The moment we were all waiting for (or at least, I was): OpenAI finally unveiled Operator, their first foray into Level 3 Autonomy - agentic capabilities with ChatGPT. Sam Altman himself hyped it up as "AI agents are AI systems that can do work for you. You give them a task and they go off and do it." Sounds amazing, right?
Operator is built on a new model called CUA (Computer Using Agent), trained on top of GPT-4, and it's designed to control a web browser in the cloud, just like a human would, using screen pixels, mouse, and keyboard. "This is just using screenshots, no API, nothing, just working," one of the OpenAI presenters emphasized.
They demoed Operator booking restaurant reservations on OpenTable, ordering groceries on Instacart, and even trying to buy Warriors tickets on StubHub (though that demo got a little… glitchy). The idea is that you can delegate tasks to Operator, and it'll go off and handle them in the background, notifying you when it needs input or when the task is complete.
As I'm writing these words, I have an Operator running trying to get me some fried rice, and another one trying to book me a vacation with kids over the summer, find some options and tell me what it found.
Benchmarks-wise, OpenAI shared numbers for OSWorld (38.1%) and WebArena (58.1%), showing Operator outperforming previous SOTA but still lagging behind human performance. "Still a way to go," as they admitted. But the potential is massive.
The catch? Operator is initially launching in the US for Pro users only, and even then, it wasn't exactly smooth sailing. I immediately paid the $200/mo to try it out (pro mode didn't convince me, unlimited SORA videos didn't either, operator definitely did, SOTA agents from OpenAI is definitely something I must try!) and my first test? Writing a tweet 😂 Here's a video of that first attempt, which I had to interrupt 1 time.
But hey, it's a "low key research preview" right? And as Sam Altman said, "This is really the beginning of this product. This is the beginning of our step into Agents Level 3 on our tiers of AGI" Agentic ChatGPT is coming, folks, even if it's taking a slightly bumpy route to get here.
BTW, while I'm writing these words, Operator is looking up some vacation options for me and is sending me notifications about them, what a world and we've only just started 2025!
Project Stargate: $500 Billion for AI Infrastructure
If R1 and Operator weren't enough to make your head spin, how about a $500 BILLION "Manhattan Project for AI infrastructure"? That's exactly what OpenAI, SoftBank, and Oracle announced this week: Project Stargate.
"This is insane," I exclaimed on the show. "Power ups for the United States compared to like, other, other countries, like 500 billion commitment!" We're talking about a massive investment in data centers, power plants, and everything else needed to fuel the AI revolution. 2% of the US GDP, according to some estimates!
Larry Ellison even hinted at using this infrastructure for… curing cancer with personalized vaccines. Whether you buy into that or not, the scale of this project is mind-boggling. As LDJ explained, "It seems like it is very specifically for open AI. Open AI will be in charge of operating it. And yeah, it's, it sounds like a smart way to actually kind of get funding and investment for infrastructure without actually having to give away open AI equity."
And in a somewhat related move, Microsoft, previously holding exclusive cloud access for OpenAI, has opened the door for OpenAI to potentially run on other clouds, with Microsoft's approval if "they cannot meet demant". Is AGI closer than we think? Sam Altman himself downplayed the hype, tweeting, "Twitter hype is out of control again. We're not going to deploy AGI next month, nor have we built it. We have some very cool stuff for you, but please chill and cut your expectations a hundred X."
But then he drops Operator and a $500 billion infrastructure bomb in the same week and announces that o3-mini is going to be available for the FREE tier of chatGPT.
Sure, Sam, we're going to chill... yeah right.
This Week's Buzz at Weights & Biases: SWE-bench SOTA!
Time for our weekly dose of Weights & Biases awesomeness! This week, our very own CTO, Shawn Lewis, broke the SOTA on SWE-bench Verified! That's right, W&B Programmer, Shawn's agentic framework built on top of o1, achieved a 64.6% solve rate on this notoriously challenging coding benchmark.
Shawn detailed his journey in a blog post, highlighting the importance of iteration and evaluation – powered by Weights & Biases Weave, naturally. He ran over 1000 evaluations to reach this SOTA result! Talk about eating your own dogfood!
REMOVING BARRIERS TO AMERICAN LEADERSHIP IN ARTIFICIAL INTELLIGENCE - Executive order
Just now as I’m editing the podcast, President Trump signed into effect an executive order for AI, and here are the highlights.
- Revokes existing AI policies that hinder American AI innovation
- Aims to solidify US as global leader in AI for human flourishing, competitiveness, and security
- Directs development of an AI Action Plan within 180 days
- Requires immediate review and revision of conflicting policies
- Directs OMB to revise relevant memos within 60 days
- Preserves agency authority and OMB budgetary functions
- Consistent with applicable law and funding availability
- Seeks to remove barriers and strengthen US AI dominance
This marks such a significant pivot into AI acceleration, removing barriers, acknowledging that AI is a huge piece of our upcoming future and that US really needs to innovate here, become the global leader, and remove regulation and obstacles. The folks that work on this behind the scenes, Sriram Krishan (previously A16Z) and David Sacks, are starting to get into the government and implement those policies, so we’re looking forward to what will come form that!
Vision & Video: Nvidia's Vanishing Eagle 2 & Hugging Face's Tiny VLM
In the world of vision and video, Nvidia teased us with Eagle 2, a series of frontier vision-language models promising 4K HD input, long-context video, and grounding capabilities with some VERY impressive evals. Weights were released, then…yanked. "NVIDIA released Eagle 2 and then yanked it back. So I don't know what's that about," I commented. Mysterious Nvidia strikes again.
On the brighter side, Hugging Face released SmolVLM, a truly tiny vision-language model, coming in at just 256 million and 500 million parameters. "This tiny model that runs in like one gigabyte of RAM or some, some crazy things, like a smart fridge" I exclaimed, impressed. The 256M model even outperforms their previous 80 billion parameter Idefics model from just 17 months ago. Progress marches on, even in tiny packages.
AI Art & Diffusion & 3D: Hunyuan 3D 2.0 is State of the Art
For the artists and 3D enthusiasts, Tencent's Hunyuan 3D 2.0 dropped this week, and it's looking seriously impressive. "Just look at this beauty," I said, showcasing a generated dragon skull. "Just look at this."
Hunyuan 3D 2.0 boasts two models: Hunyuan3D-DiT-v2-0 for shape generation and Hunyuan3D-Paint-v2-0 for coloring. Text-to-3D and image-to-3D workflows are both supported, and the results are, well, see for yourself:
If you're looking to move beyond 2D images, Hunyuan 3D 2.0 is definitely worth checking out.
Tools: ByteDance Clones Cursor with Trae
And finally, in the "tools" department, ByteDance continues its open source blitzkrieg with Trae, a free Cursor competitor. "ByteDance drops Trae, which is a cursor competitor, which is free for now" I announced on the show, so if you don't mind your code being sent to... china somewhere, and can't afford Cursor, this is not a bad alternative!
Trae imports your Cursor configs, supports Claude 3.5 and GPT-4o, and offers a similar AI-powered code editing experience, complete with chat interface and "builder" (composer) mode. The catch? Your code gets sent to a server in China. If you're okay with that, you've got yourself a free Cursor alternative. "If you're okay with your like code getting shared with ByteDance, this is a good option for you," I summarized. Decisions, decisions.
Phew! That was a whirlwind tour through another insane week in AI. From DeepSeek R1's open source reasoning revolution to OpenAI's Operator going live, and Google's million-token Gemini brain, it's clear that the pace of innovation is showing no signs of slowing down.
Open source is booming, agents are inching closer to reality, and the big companies are throwing down massive infrastructure investments. We're accelerating as f**k, and it's only just beginning, hold on to your butts.
Make sure to dive into the show notes below for all the links and details on everything we covered. And don't forget to give R1 a spin – and maybe try out that "reasoning_effort" hack. Just don't blame me if your AI starts having an existential crisis.
And as a final thought, channeling my inner Woody Allen-R1, "Don't overthink too much. enjoy our one. Enjoy the incredible things we received this week from open source."
See you all next week for more ThursdAI madness! And hopefully, by then, Operator will actually be operating. 😉
TL;DR and show notes
* Open Source LLMs
* DeepSeek R1 - MIT licensed SOTA open source reasoning model (HF, X)
* ByteDance UI-TARS - PC control models (HF, Github )
* HLE - Humanity's Last Exam benchmark (Website)
* Big CO LLMs + APIs
* SoftBank, Oracle, OpenAI Stargate Project - $500B AI infrastructure (OpenAI Blog)
* Google Gemini Flash Thinking 01-21 - 1M context, Code execution, Better Evals (X)
* OpenAI Operator - Agentic browser in ChatGPT Pro operator.chatgpt.com
* Anthropic launches citations in API (blog)
* Perplexity SonarPRO Search API and an Android AI assistant (X)
* This weeks Buzz 🐝
* W&B broke SOTA SWE-bench verified (W&B Blog)
* Vision & Video
* HuggingFace SmolVLM - Tiny VLMs - runs even on WebGPU (HF)
* AI Art & Diffusion & 3D
* Hunyuan 3D 2.0 - SOTA open-source 3D (HF)
* Tools
* ByteDance Trae - Cursor competitor (Trae AI: https://trae.ai/)
* Show Notes:
* Pietro Skirano RAT - Retrieval augmented generation (X)
* Run DeepSeek with more “thinking” script (Gist)

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
📆 ThursdAI - Jan 16, 2025 - Hailuo 4M context LLM, SOTA TTS in browser, OpenHands interview & more AI news
17 jan· ThursdAI - The top AI news from the past week
Hey everyone, Alex here 👋
Welcome back, to an absolute banger of a week in AI releases, highlighted with just massive Open Source AI push. We're talking a MASSIVE 4M context window context window model from Hailuo (remember when a jump from 4K to 16K seemed like a big deal?), a 8B omni model that lets you livestream video and glimpses of Agentic ChatGPT?
This week's ThursdAI was jam-packed with so much open source goodness that the big companies were practically silent. But don't worry, we still managed to squeeze in some updates from OpenAI and Mistral, along with a fascinating new paper from Sakana AI on self-adaptive LLMs. Plus, we had the incredible Graham Neubig, from All Hands AI, join us to talk about Open Hands (formerly OpenDevin) and even contributed to our free, LLM Evaluation course on Weights & Biases!
Before we dive in, a friend asked me over dinner, what are the main 2 things that happened in AI in 2024, and this week highlights one of those trends. Most of the Open Source is now from China. This week, we got MiniMax from Hailuo, OpenBMB with a new MiniCPM, InternLM came back and most of the rest were Qwen finetunes. Not to mention DeepSeek. Wanted to highlight this significant narrative change and that this is being done despite the chip export restrictions.
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Open Source AI & LLMs
MiniMax-01: 4 Million Context, 456 Billion Parameters, and Lightning Attention
This came absolutely from the left field, given that we've seen no prior LLMs from Haulio, the company previously releasing video models with consistent characters. Dropping a massive 456B mixture of experts model (45B active parameters) with such a long context support in open weights, but also with very significant benchmarks that compete with Gpt-4o, Claude and DeekSeek v3 (75.7 MMLU-pro, 89 IFEval, 54.4 GPQA)
They have trained the model on up to 1M context window and then extended it to 4M with ROPE scaling methods (our coverage of RoPE) during Inference. MiniMax-Text-01 adopts a hybrid architecture that combines Lightning Attention, Softmax Attention and Mixture-of-Experts (MoE) with 45B active parameters.
I gotta say, when we started talking about context window, imagining a needle in a haystack graph that shows 4M, in the open source seemed far fetched, though we did say that theoretically, there may not be a limit to context windows. I just always expected that limit to be unlocked by transformers alternative architectures like Mamba or other State Space Models.
Vision, API and Browsing - Minimax-VL-01
It feels like such a well rounded and complete release, that it highlights just how mature company that is behind it. They have also released a vision version of this model, that includes a 300M param Vision Transformer on top (trained with 512B vision language tokens) that features dynamic resolution and boasts very high DocVQA and ChartQA scores.
Not only did these two models were released in open weights, they also launched as a unified API endpoint (supporting up to 1M tokens) and it's cheap! $0.2/1M input and $1.1/1M output tokens! AFAIK this is only the 3rd API that supports this much context, after Gemini at 2M and Qwen Turbo that supports 1M as well.
Surprising web browsing capabilities
You can play around with the model on their website, hailuo.ai which also includes web grounding, which I found quite surprising to find out, that they are beating chatGPT and Perplexity on how fast they can find information that just happened that same day! Not sure what search API they are using under the hood but they are very quick.
8B chat with video model omni-model from OpenBMB
OpenBMB has been around for a while and we've seen consistently great updates from them on the MiniCPM front, but this one takes the cake!
This is a complete omni modal end to end model, that does video streaming, audio to audio and text understanding, all on a model that can run on an iPad!
They have a demo interface that is very similar to the chatGPT demo from spring of last year, and allows you to stream your webcam and talk to the model, but this is just an 8B parameter model we're talking about! It's bonkers!
They are boasting some incredible numbers, and to be honest, I highly doubt their methodology in textual understanding, because, well, based on my experience alone, this model understands less than close to chatGPT advanced voice mode, but miniCPM has been doing great visual understanding for a while, so ChartQA and DocVQA are close to SOTA.
But all of this doesn't matter, because, I say again, just a little over a year ago, Google released a video announcing these capabilities, having an AI react to a video in real time, and it absolutely blew everyone away, and it was FAKED. And this time a year after, we have these capabilities, essentially, in an 8B model that runs on device 🤯
Voice & Audio
This week seems to be very multimodal, not only did we get an omni-modal from OpenBMB that can speak, and last week's Kokoro still makes a lot of waves, but this week there were a lot of voice updates as well
Kokoro.js - run the SOTA open TTS now in your browser
Thanks to friend of the pod Xenova (and the fact that Kokoro was released with ONNX weights), we now have kokoro.js, or npm -i kokoro-js if you will.
This allows you to install and run Kokoro, the best tiny TTS model, completely within your browser, with a tiny 90MB download and it sounds really good (demo here)
Hailuo T2A - Emotional text to speech + API
Hailuo didn't rest on their laurels of releasing a huge context window LLM, they also released a new voice framework (tho not open sourced) this week, and it sounds remarkably good (competing with 11labs)
They have all the standard features like Voice Cloning, but claim to have a way to preserve the emotional undertones of a voice. They also have 300 voices to choose from and professional effects applied on the fly, like acoustics or telephone filters. (Remember, they have a video model as well, so assuming that some of this is to for the holistic video production)
What I specifically noticed is their "emotional intelligence system" that's either automatic or can be selected from a dropdown. I also noticed their "lax" copyright restrictions, as one of the voices that was called "Imposing Queen" sounded just like a certain blonde haired heiress to the iron throne from a certain HBO series.
When I generated a speech worth of that queen, I noticed that the emotion in that speech sounded very much like an actress would read them, and unlike any old TTS, just listen to it in the clip above, I don't remember getting TTS outputs with this much emotion from anything, maybe outside of advanced voice mode! Quite impressive!
This Weeks Buzz from Weights & Biases - AGENTS!
Breaking news from W&B as our CTO just broke SWE-bench Verified SOTA, with his own o1 agentic framework he calls W&B Programmer 😮 at 64.6% of the issues!
Shawn describes how he achieved this massive breakthrough here and we'll be publishing more on this soon, but the highlight for me is he ran over 900 evaluations during the course of this, and tracked all of them in Weave!
We also have an upcoming event in NY, on Jan 22nd, if you're there, come by and learn how to evaluate your AI agents, RAG applications and hang out with our team! (Sign up here)
Big Companies & APIs
OpenAI adds chatGPT tasks - first agentic feature with more to come!
We finally get a glimpse of an agentic chatGPT, in the form of scheduled tasks! Deployed to all users, it is now possible to select gpt-4o with tasks, and schedule tasks in the future.
You can schedule them in natural language, and then will execute a chat (and maybe perform a search or do a calculation) and then send you a notification (and an email!) when the task is done!
A bit underwhelming at first, as I didn't really find a good use for this yet, I don't doubt that this is just a building block for something more Agentic to come that can connect to my email or calendar and do actual tasks for me, not just... save me from typing the chatGPT query at "that time"
Mistral CodeStral 25.01 - a new #1 coding assistant model
An updated Codestral was released at the beginning of the week, and TBH I've never seen the vibes split this fast on a model.
While it's super exciting that Mistral is placing a coding model at #1 on the LMArena CoPilot's arena, near Claude 3.5 and DeepSeek, the fact that this new model is not released weights is really a bummer (especially as a reference to the paragraph I mentioned on top)
We seem to be closing down on OpenSource in the west, while the Chinese labs are absolutely crushing it (while also releasing in the open, including Weights, Technical papers).
Mistral has released this model in API and via a collab with the Continue dot dev coding agent, but they used to be the darling of the open source community by releasing great models!
Also notable, a very quick new benchmark post release was dropped that showed a significant difference between their reported benchmarks and how it performs on Aider polyglot
There was way more things for this week than we were able to cover, including a new and exciting transformers squared new architecture from Sakana, a new open source TTS with voice cloning and a few other open source LLMs, one of which cost only $450 to train! All the links in the TL;DR below!
TL;DR and show notes
* Open Source LLMs
* MiniMax-01 from Hailuo - 4M context 456B (45B A) LLM (Github, HF, Blog, Report)
* Jina - reader V2 model - HTML 2 Markdown/JSON (HF)
* InternLM3-8B-Instruct - apache 2 License (Github, HF)
* OpenBMB - MiniCPM-o 2.6 - Multimodal Live Streaming on Your Phone (HF, Github, Demo)
* KyutAI - Helium-1 2B - Base (X, HF)
* Dria-Agent-α - 3B model that outputs python code (HF)
* Sky-T1, a ‘reasoning’ AI model that can be trained for less than $450 (blog)
* Big CO LLMs + APIs
* OpenAI launches ChatGPT tasks (X)
* Mistral - new CodeStral 25.01 (Blog, no Weights)
* Sakana AI - Transformer²: Self-Adaptive LLMs (Blog)
* This weeks Buzz
* Evaluating RAG Applications Workshop - NY, Jan 22, W&B and PineCone (Free Signup)
* Our evaluations course is going very strong! (chat w/ Graham Neubig) (https://wandb.me/evals-t)
* Vision & Video
* Luma releases Ray2 video model (Web)
* Voice & Audio
* Hailuo T2A-01-HD - Emotions Audio Model from Hailuo (X, Try It)
* OuteTTS 0.3 - 1B & 500M - zero shot voice cloning model (HF)
* Kokoro.js - 80M SOTA TTS in your browser! (X, Github, try it )
* AI Art & Diffusion & 3D
* Black Forest Labs - Finetuning for Flux Pro and Ultra via API (Blog)
* Show Notes and other Links
* Hosts - Alex Volkov (@altryne), Wolfram RavenWlf (@WolframRvnwlf), Nisten Tahiraj (@nisten)
* Guest - Graham Neubig (@gneubig) from All Hands AI (@allhands_ai)
* Graham’s mentioned Agents blogpost - 8 things that agents can do right now
* Projects - Open Hands (previously Open Devin) - Github
* Germany meetup in Cologne (here)
* Toronto Tinkerer Meetup *Sold OUT* (Here)
* YaRN conversation we had with the Authors (coverage)
See you folks next week! Have a great long weekend if you’re in the US 🫡
Please help to promote the podcast and newsletter by sharing with a friend!

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
📆 ThursdAI - Jan 9th - NVIDIA's Tiny Supercomputer, Phi-4 is back, Kokoro TTS & Moondream gaze, ByteDance SOTA lip sync & more AI news
10 jan· ThursdAI - The top AI news from the past week
Hey everyone, Alex here 👋
This week's ThursdAI was a whirlwind of announcements, from Microsoft finally dropping Phi-4's official weights on Hugging Face (a month late, but who's counting?) to Sam Altman casually mentioning that OpenAI's got AGI in the bag and is now setting its sights on superintelligence. Oh, and NVIDIA? They're casually releasing a $3,000 supercomputer that can run 200B parameter models on your desktop. No big deal.
We had some amazing guests this week too, with Oliver joining us to talk about a new foundation model in genomics and biosurveillance (yes, you read that right - think wastewater and pandemic monitoring!), and then, we've got some breaking news! Vik returned to the show with a brand new Moondream release that can do some pretty wild things. Ever wanted an AI to tell you where someone's looking in a photo? Now you can, thanks to a tiny model that runs on edge devices. 🤯
So buckle up, folks, because we've got a ton to cover. Let's dive into the juicy details of this week's AI madness, starting with open source.
03:10 TL;DR
03:10 Deep Dive into Open Source LLMs
10:58 MetaGene: A New Frontier in AI
20:21 PHI4: The Latest in Open Source AI
27:46 R Star Math: Revolutionizing Small LLMs
34:02 Big Companies and AI Innovations
42:25 NVIDIA's Groundbreaking Announcements
43:49 AI Hardware: Building and Comparing Systems
46:06 NVIDIA's New AI Models: LLAMA Neumatron
47:57 Breaking News: Moondream's Latest Release
50:19 Moondream's Journey and Capabilities
58:41 Weights & Biases: New Evals Course
01:08:29 NVIDIA's World Foundation Models
01:08:29 ByteDance's LatentSync: State-of-the-Art Lip Sync
01:12:54 Kokoro TTS: High-Quality Text-to-Speech
As always, TL;DR section with links and show notes below 👇
Open Source AI & LLMs
Phi-4: Microsoft's "Small" Model Finally Gets its Official Hugging Face Debut
Finally, after a month, we're getting Phi-4 14B on HugginFace. So far, we've had bootlegged copies of it, but it's finally officially uploaded by Microsoft. Not only is it now official, it's also officialy MIT licensed which is great!
So, what's the big deal? Well, besides the licensing, it's a 14B parameter, dense decoder-only Transformer with a 16K token context length and trained on a whopping 9.8 trillion tokens. It scored 80.4 on math and 80.6 on MMLU, making it about 10% better than its predecessor, Phi-3 and better than Qwen 2.5's 79
What’s interesting about phi-4 is that the training data consisted of 40% synthetic data (almost half!)
The vibes are always interesting with Phi models, so we'll keep an eye out, notable also, the base models weren't released due to "safety issues" and that this model was not trained for multi turn chat applications but single turn use-cases
MetaGene-1: AI for Pandemic Monitoring and Pathogen Detection
Now, this one's a bit different. We usually talk about LLMs in this section, but this is more about the "open source" than the "LLM." Prime Intellect, along with folks from USC, released MetaGene-1, a metagenomic foundation model. That's a mouthful, right? Thankfully, we had Oliver Liu, a PhD student at USC, and an author on this paper, join us to explain.
Oliver clarified that the goal is to use AI for "biosurveillance, pandemic monitoring, and pathogen detection." They trained a 7B parameter model on 1.5 trillion base pairs of DNA and RNA sequences from wastewater, creating a model surprisingly capable of zero-shot embedding. Oliver pointed out that while using genomics to pretrain foundation models is not new, MetaGene-1 is, "in its current state, the largest model out there" and is "one of the few decoder only models that are being used". They also have collected 15T bae pairs but trained on 10% of them due to grant and compute constraints.
I really liked this one, and though the science behind this was complex, I couldn't help but get excited about the potential of transformer models catching or helping catch the next COVID 👏
rStar-Math: Making Small LLMs Math Whizzes with Monte Carlo Tree Search
Alright, this one blew my mind. A paper from Microsoft (yeah, them again) called "rStar-Math" basically found a way to make small LLMs do math better than o1 using Monte Carlo Tree Search (MCTS). I know, I know, it sounds wild. They took models like Phi-3-mini (a tiny 3.8B parameter model) and Qwen 2.5 3B and 7B, slapped some MCTS magic on top, and suddenly these models are acing the AIME 2024 competition math benchmark and scoring 90% on general math problems. For comparison, OpenAI's o1-preview scores 85.5% on math and o1-mini scores 90%. This is WILD, as just 5 months ago, it was unimaginable that any LLM can solve math of this complexity, then reasoning models could, and now small LLMs with some MCTS can!
Even crazier, they observed an "emergence of intrinsic self-reflection capability" in these models during problem-solving, something they weren't designed to do. LDJ chimed in saying "we're going to see more papers showing these things emerging and caught naturally." So, is 2025 the year of not just AI agents, but also emergent reasoning in LLMs? It's looking that way. The code isn't out yet (the GitHub link in the paper is currently a 404), but when it drops, you can bet we'll be all over it.
Big Companies and LLMs
OpenAI: From AGI to ASI
Okay, let's talk about the elephant in the room: Sam Altman's blog post. While reflecting on getting fired from his job on like a casual Friday, he dropped this bombshell: "We are now confident that we know how to build AGI as we have traditionally understood it." And then, as if that wasn't enough, he added, "We're beginning to turn our aim beyond that to superintelligence in the true sense of the word." So basically, OpenAI is saying, "AGI? Done. Next up: ASI."
This feels like a big shift in how openly folks at OpenAI is talking about Superintelligence, and while AGI is yet to be properly defined (LDJ read out the original OpenAI definition on the live show, but the Microsoft definition contractually with OpenAI was a system that generates $100B in revenue) they are already talking about Super Intelligence which supersedes all humans ever lived in all domains
NVIDIA @ CES - Home SuperComputers, 3 scaling laws, new Models
There was a lot of things happening at CES, the largest consumer electronics show, but the AI focus was on NVIDIA, namely on Jensen Huangs keynote speech!
He talked about a lot of stuff, really, it's a show, and is a very interesting watch, NVIDIA is obviously at the forefront of all of this AI wave, and when Jensen tells you that we're at the high of the 3rd scaling law, he knows what he's talking about (because he's fueling all of it with his GPUs) - the third one is of course test time scaling or "reasoning", the thing that powers o1, and the coming soon o3 model and other reasoners.
Project Digits - supercomputer at home?
Jensen also announced Project Digits: a compact AI supercomputer priced at a relatively modest $3,000. Under the hood, it wields a Grace Blackwell “GB10” superchip that supposedly offers 1 petaflop of AI compute and can support LLMs up to 200B parameters (or you can link 2 of them to run LLama 405b at home!)
This thing seems crazy, but we don't know more details like the power requirements for this beast!
Nemotrons again?
Also announced was a family of NVIDIA LLama Nemotron foundation models, but.. weirdly we already have Nemotron LLamas (3 months ago) , so those are... new ones? I didn't really understand what was announced here, as we didn't get new models, but the announcement was made nonetheless. We're due to get 3 new version of Nemotron on the Nvidia NEMO platform (and Open), sometime soon.
NVIDIA did release new open source models, with COSMOS, which is a whole platform that includes pretrained world foundation models to help simulate world environments to train robots (among other things).
They have released txt2world and video2world Pre-trained Diffusion and Autoregressive models in 7B and 14B sizes, that generate videos to simulate visual worlds that have strong alignment to physics.
If you believe Elon when he says that Humanoid Robots are going to be the biggest category of products (every human will want 1 or 3, so we're looking at 20 billion of them), then COSMOS is a platform to generate synthetic data to train these robots to do things in the real world!
This weeks buzz - Weights & Biases corner
The wait is over, our LLM Evals course is now LIVE, featuring speakers Graham Neubig (who we had on the pod before, back when Open Hands was still called Open Devin) and Paige Bailey, and Anish and Ayush from my team at W&B!
If you're building with LLM in production and don't have a robust evaluation setup, or don't even know where to start with one, this course is definitely for you! Sign up today. You'll learn from examples of Imagen and Veo from Paige, Agentic examples using Weave from Graham and Basic and Advanced Evaluation from Anish and Ayush.
The workshop in Seattle next was filled out super quick, so since we didn't want to waitlist tons of folks, we have extended it to another night, so those of you who couldn't get in, will have another opportunity on Tuesday! (Workshop page) but while working on it I came up with this distillation of what I'm going to deliver, and wanted to share with you.
Vision & Video
New Moondream 01-09 can tell where you look (among other things) (blog, HF)
We had some breaking news on the show! Vik Korrapati, the creator of Moondream, joined us to announce updates to Moondream, a new version of his tiny vision language model. This new release has some incredible capabilities, including pointing, object detection, structured output (like JSON), and even gaze detection. Yes, you read that right. Moondream can now tell you where someone (or even a pet!) is looking in an image.
Vic explained how they achieved this: "We took one of the training datasets that Gazelle trained on and added it to the Moondream fine tuning mix". What's even more impressive is that Moondream is tiny - the new version comes in 2B and 0.5B parameter sizes. As Vic said, "0.5b is we actually started with the 2b param model and we pruned down while picking specific capabilities you want to preserve". This makes it perfect for edge devices and applications where cost or privacy is a concern. It's incredible to see how far Moondream has come, from a personal project to a company with seven employees working on it.
Since Vik joined ThursdAI last January (we seem to be on a kick of revisiting with our guests from last year!) Moondream is a company, but they are committed to open source and so this releases is also Apache 2 👏 but you can also try this out on their website playground and hire them if you need to finetune a custom tiny vision model!
Voice & Audio
Very exciting updates in the OSS voice and audio this week!
KOKORO TTS - Apache 2 tiny (82M! params) TTS that's #1 on TTS arena (HF,Demo)
Honestly when Wolfram told me about Kokoro being #1 on TTS arena and that it was released a few weeks back, I almost skipped giving this an update, but wow, this tiny tiny model can run on edge devices, can run in your browser, and the sound it generates is SO clean!
It's Apache 2 license and the voices were trained on non licensed data (per the author)
There's no voice cloning support yet, but there are voice packs you can use, and somehow, they got the SKY voice. Remember the one that Scarlett Johanson almost sued OpenAI for? That one! And for 82M parameters it sounds so good, hell, for any TTS, it sounds very good!
ByteDance - LatentSync state of the art lip syncing (X, Paper, Fal)
In the same week, ByteDance released a SOTA lip syncing OSS model called LatentSync, which takes a voice (for example, such as the one you can create with Kokoro above) and a video, and sync the lips of the person in the video, to make it seem like that person said the thing.
This is for example great for translation purposes, here's a quick example of my cloned voice (via 11labs) and translated opening of the show in spanish, and overlays it on top of my actual video, and it's pretty good!
This week Lex Fridman interviewed Volodymir Zelensky and I loved the technical and AI aspect of that whole multilingual interview, they have translated that into English, Russian and Ukrainian. But the lips weren't synced so it looked a bit off still. Now consider the different with and without lip syncing (here's a quick example I whipped up)
Baidu - Hallo 3 - generative avatars now with animated backgrounds
Meanwhile over at Baidu, Hallo 3 is their 3rd iteration of generative portraits, a way to turn a single image into a completely animated avatar, by also providing it a recording of your voice (or a TTS, does it really matter at this point?)
The highlight here is, the background is now part of these avatars! Where as previously these avatars used to be static, now they have dynamic backgrounds. Tho I still feel weirded out by their lip movements, but maybe with the above lipsyncing this can be fixed?
Not a bad second week of the yeah eh? A LOT of open source across multimodalities, supercomputers at home, tiny vision and TTS models and tons of apache 2 or MIT licensed models all over!
See you guys next week (well, some of you in person in SF and Seattle) but most of you next week on ThursdAI! 🫡
Tl;DR + Show Notes
* Open Source LLMs
* Phi-4 MIT licensed family of models from Microsoft (X, Blog, HF)
* Prime Intellect - MetaGENE-1 - metagenomic foundation model (Site, X, Paper)
* rStar-Math - making Small LLMs do Math better than o1 with MCTS (Paper, Github)
* Big CO LLMs + APIs
* Sam Altman releases an ASI blog, multiple OpenAI people switch from AGI to ASI (X)
* NVIDIA updates from CES (X)
* XAI - Grok IOS app + Grok 3 finished pre-training
* Qwen has a new web portal with all their modals - chat.qwenlm.ai
* This weeks Buzz
* Evals Course is LIVE - Evals with Paige Bailey and Graham Neubig Course Signup (Signup)
* San Francisco is still open (Details)
* Seattle is almost waitlisted (Workshop)
* Vision & Video
* NVIDIA Cosmos - World Foundation Models (Post, Github, HF)
* Moondream 2 announcement - new evals - Chat with Vik Korrapati (X, HF, Try It)
* Voice & Audio
* Kokoro - #1 TTS with Apache 2 license (HF, Demo)
* Baidu - Hallo 3 - generative portraits (Project, Github, HF)
* ByteDance - LatentSync lip syncing model (X, Paper, Fal)
* AI Art & Diffusion & 3D
* Stability - SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images ( HF)

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
📆 ThursdAI - Jan 2 - is 25' the year of AI agents?
2 jan· ThursdAI - The top AI news from the past week
Hey folks, Alex here 👋 Happy new year!
On our first episode of this year, and the second quarter of this century, there wasn't a lot of AI news to report on (most AI labs were on a well deserved break). So this week, I'm very happy to present a special ThursdAI episode, an interview with Joāo Moura, CEO of Crew.ai all about AI agents!
We first chatted with Joāo a year ago, back in January of 2024, as CrewAI was blowing up but still just an open source project, it got to be the number 1 trending project on Github, and #1 project on Product Hunt. (You can either listen to the podcast or watch it in the embedded Youtube above)
00:36 Introduction and New Year Greetings
02:23 Updates on Open Source and LLMs
03:25 Deep Dive: AI Agents and Reasoning
03:55 Quick TLDR and Recent Developments
04:04 Medical LLMs and Modern BERT
09:55 Enterprise AI and Crew AI Introduction
10:17 Interview with João Moura: Crew AI
25:43 Human-in-the-Loop and Agent Evaluation
33:17 Evaluating AI Agents and LLMs
44:48 Open Source Models and Fin to OpenAI
45:21 Performance of Claude's Sonnet 3.5
48:01 Different parts of an agent topology, brain, memory, tools, caching
53:48 Tool Use and Integrations
01:04:20 Removing LangChain from Crew
01:07:51 The Year of Agents and Reasoning
01:18:43 Addressing Concerns About AI
01:24:31 Future of AI and Agents
01:28:46 Conclusion and Farewell
---
Is 2025 "the year of AI agents"?
AI agents as I remember them as a concept started for me a few month after I started ThursdAI ,when AutoGPT exploded. Was such a novel idea at the time, run LLM requests in a loop,
(In fact, back then, I came up with a retry with AI concept and called it TrAI/Catch, where upon an error, I would feed that error back into the GPT api and ask it to correct itself. it feels so long ago!)
AutoGPT became the fastest ever Github project to reach 100K stars, and while exciting, it did not work.
Since then we saw multiple attempts at agentic frameworks, like babyAGI, autoGen. Crew AI was one of them that keeps being the favorite among many folks.
So, what is an AI agent? Simon Willison, friend of the pod, has a mission, to ask everyone who announces a new agent, what they mean when they say it because it seems that everyone "shares" a common understanding of AI agents, but it's different for everyone.
We'll start with Joāo's explanation and go from there. But let's assume the basic, it's a set of LLM calls, running in a self correcting loop, with access to planning, external tools (via function calling) and a memory or sorts that make decisions.
Though, as we go into detail, you'll see that since the very basic "run LLM in the loop" days, the agents in 2025 have evolved and have a lot of complexity.
My takeaways from the conversation
I encourage you to listen / watch the whole interview, Joāo is deeply knowledgable about the field and we go into a lot of topics, but here are my main takeaways from our chat
* Enterprises are adopting agents, starting with internal use-cases
* Crews have 4 different kinds of memory, Long Term (across runs), short term (each run), Entity term (company names, entities), pre-existing knowledge (DNA?)
* TIL about a "do all links respond with 200" guardrail
* Some of the agent tools we mentioned
* Stripe Agent API - for agent payments and access to payment data (blog)
* Okta Auth for Gen AI - agent authentication and role management (blog)
* E2B - code execution platform for agents (e2b.dev)
* BrowserBase - programmatic web-browser for your AI agent
* Exa - search grounding for agents for real time understanding
* Crew has 13 crews that run 24/7 to automate their company
* Crews like Onboarding User Enrichment Crew, Meetings Prep, Taking Phone Calls, Generate Use Cases for Leads
* GPT-4o mini is the most used model for 2024 for CrewAI with main factors being speed / cost
* Speed of AI development makes it hard to standardize and solidify common integrations.
* Reasoning models like o1 still haven't seen a lot of success, partly due to speed, partly due to different way of prompting required.
This weeks Buzz
We've just opened up pre-registration for our upcoming FREE evaluations course, featuring Paige Bailey from Google and Graham Neubig from All Hands AI (previously Open Devin). We've distilled a lot of what we learned about evaluating LLM applications while building Weave, our LLM Observability and Evaluation tooling, and are excited to share this with you all! Get on the list
Also, 2 workshops (also about Evals) from us are upcoming, one in SF on Jan 11th and one in Seattle on Jan 13th (which I'm going to lead!) so if you're in those cities at those times, would love to see you!
And that's it for this week, there wasn't a LOT of news as I said. The interesting thing is, even in the very short week, the news that we did get were all about agents and reasoning, so it looks like 2025 is agents and reasoning, agents and reasoning!
See you all next week 🫡
TL;DR with links:
* Open Source LLMs
* HuatuoGPT-o1 - medical LLM designed for medical reasoning (HF, Paper, Github, Data)
* Nomic - modernbert-embed-base - first embed model on top of modernbert (HF)
* HuggingFace - SmolAgents lib to build agents (Blog)
* SmallThinker-3B-Preview - a QWEN 2.5 3B "reasoning" finetune (HF)
* Wolfram new Benchmarks including DeepSeek v3 (X)
* Big CO LLMs + APIs
* Newcomer Rubik's AI Sonus-1 family - Mini, Air, Pro and Reasoning (X, Chat)
* Microsoft "estimated" GPT-4o-mini is a ~8B (X)
* Meta plans to bring AI profiles to their social networks (X)
* This Week's Buzz
* W&B Free Evals Course with Page Bailey and Graham Beubig - Free Sign Up
* SF evals event - January 11th
* Seattle evals workshop - January 13th

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
📆 ThursdAI - Dec 26 - OpenAI o3 & o3 mini, DeepSeek v3 658B beating Claude, Qwen Visual Reasoning, Hume OCTAVE & more AI news
27 dez 2024· ThursdAI - The top AI news from the past week
Hey everyone, Alex here 👋
I was hoping for a quiet holiday week, but whoa, while the last newsletter was only a week ago, what a looong week it has been, just Friday after the last newsletter, it felt like OpenAI has changed the world of AI once again with o3 and left everyone asking "was this AGI?" over the X-mas break (Hope Santa brought you some great gifts!) and then not to be outdone, DeepSeek open sourced basically a Claude 2.5 level behemoth DeepSeek v3 just this morning!
Since the breaking news from DeepSeek took us by surprise, the show went a bit longer (3 hours today!) than expected, so as a Bonus, I'm going to release a separate episode with a yearly recap + our predictions from last year and for next year in a few days (soon in your inbox!)
TL;DR
* Open Source LLMs
* CogAgent-9B (Project, Github)
* Qwen QvQ 72B - open weights visual reasoning (X, HF, Demo, Project)
* GoodFire Ember - MechInterp API - GoldenGate LLama 70B
* 🔥 DeepSeek v3 658B MoE - Open Source Claude level model at $6M (X, Paper, HF, Chat)
* Big CO LLMs + APIs
* 🔥 OpenAI reveals o3 and o3 mini (Blog, X)
* X.ai raises ANOTHER 6B dollars - on their way to 200K H200s (X)
* This weeks Buzz
* Two W&B workshops upcoming in January
* SF - January 11
* Seattle - January 13 (workshop by yours truly!)
* New Evals course with Paige Bailey and Graham Neubig - pre-sign up for free
* Vision & Video
* Kling 1.6 update (Tweet)
* Voice & Audio
* Hume OCTAVE - 3B speech-language model (X, Blog)
* Tools
* OpenRouter added Web Search Grounding to 300+ models (X)
Open Source LLMs
DeepSeek v3 658B - frontier level open weights model for ~$6M (X, Paper, HF, Chat )
This was absolutely the top of the open source / open weights news for the past week, and honestly maybe for the past month. DeepSeek, the previous quant firm from China, has dropped a behemoth model, a 658B parameter MoE (37B active), that you'd need 8xH200 to even run, that beats Llama 405, GPT-4o on most benchmarks and even Claude Sonnet 3.5 on several evals!
The vibes seem to be very good with this one, and while it's not all the way beating Claude yet, it's nearly up there already, but the kicker is, they trained it with a very restricted compute, per the paper, with ~2K h800 (which is like H100 but with less bandwidth) for 14.8T tokens. (that's 15x cheaper than LLama 405 for comparison)
For evaluations, this model excels on Coding and Math, which is not surprising given how excellent DeepSeek coder has been, but still, very very impressive!
On the architecture front, the very interesting thing is, this feels like Mixture of Experts v2, with a LOT of experts (256) and 8+1 active at the same time, multi token prediction, and a lot optimization tricks outlined in the impressive paper (here's a great recap of the technical details)
The highlight for me was, that DeepSeek is distilling their recent R1 version into this version, which likely increases the performance of this model on Math and Code in which it absolutely crushes (51.6 on CodeForces and 90.2 on MATH-500)
The additional aspect of this is the API costs, and while they are going to raise the prices come February (they literally just swapped v2.5 for v3 in their APIs without telling a soul lol), the price performance for this model is just absurd.
Just a massive massive release from the WhaleBros, now I just need a quick 8xH200 to run this and I'm good 😅
Other OpenSource news - Qwen QvQ, CogAgent-9B and GoldenGate LLama
In other open source news this week, our friends from Qwen have released a very interesting preview, called Qwen QvQ, a visual reasoning model. It uses the same reasoning techniques that we got from them in QwQ 32B, but built with the excellent Qwen VL, to reason about images, and frankly, it's really fun to see it think about an image. You can try it here
and a new update to CogAgent-9B (page), an agent that claims to understand and control your computer, claims to beat Claude 3.5 Sonnet Computer Use with just a 9B model!
This is very impressive though I haven't tried it just yet, I'm excited to see those very impressive numbers from open source VLMs driving your computer and doing tasks for you!
A super quick word from ... Weights & Biases!
We've just opened up pre-registration for our upcoming FREE evaluations course, featuring Paige Bailey from Google and Graham Neubig from All Hands AI. We've distilled a lot of what we learned about evaluating LLM applications while building Weave, our LLM Observability and Evaluation tooling, and are excited to share this with you all! Get on the list
Also, 2 workshops (also about Evals) from us are upcoming, one in SF on Jan 11th and one in Seattle on Jan 13th (which I'm going to lead!) so if you're in those cities at those times, would love to see you!
Big Companies - APIs & LLMs
OpenAI - introduces o3 and o3-mini - breaking Arc-AGI challenge, GQPA and teasing AGI?
On the last day of the 12 days of OpenAI, we've got the evals of their upcoming o3 reasoning model (and o3-mini) and whoah. I think I speak on behalf of most of my peers that we were all shaken by how fast the jump in capabilities happened from o1-preview and o1 full (being released fully just two weeks prior on day 1 of the 12 days)
Almost all evals shared with us are insane, from 96.7 on AIME (from 13.4 with Gpt40 earlier this year) to 87.7 GQPA Diamond (which is... PhD level Science Questions)
But two evals stand out the most, and one of course is the Arc-AGI eval/benchmark. It was designed to be very difficult for LLMs and easy for humans, and o3 solved it with an unprecedented 87.5% (on high compute setting)
This benchmark was long considered impossible for LLMs, and just the absolute crushing of this benchmark for the past 6 months is something to behold:
The other thing I want to highlight is the Frontier Math benchmark, which was released just two months ago by Epoch, collaborating with top mathematicians to create a set of very challenging math problems. At the time of release (Nov 12), the top LLMs solved only 2% of this benchmark. With o3 solving 25% of this benchmark just 3 months after o1 taking 2%, it's quite incredible to see how fast these models are increasing in capabilities.
Is this AGI?
This release absolutely started or restarted a debate of what is AGI, given that, these goal posts move all the time. Some folks are freaking out and saying that if you're a software engineer, you're "cooked" (o3 solved 71.7 of SWE-bench verified and gets 2727 ELO on CodeForces which is competition code, which is 175th global rank among human coders!), some have also calculated its IQ and estimate it to be at 157 based on the above CodeForces rating.
So the obvious question is being asked (among the people who follow the news, most people who don't follow the news could care less) is.. is this AGI? Or is something else AGI?
Well, today we got a very interesting answer to this question, from a leak between a Microsoft and OpenAI negotiation and agreement, in which they have a very clear definition of AGI. "A system generating $100 Billion in profits" - a reminder, per their previous agreement, if OpenAI builds AGI, Microsoft will lose access to OpenAI technologies.
o3-mini and test-time compute as the new scaling law
While I personally was as shaken as most of my peers at these incredible breakthroughs, I was also looking at the more practical and upcoming o3-mini release, which is supposed to come on January this year per Sam Altman. Per their evaluations, o3-mini is going to be significantly cheaper and faster than o3, while offering 3 levels of reasoning effort to developers (low, medium and high) and on medium level, it would beat the current best model (o1) while being cheaper than o1-mini.
All of these updates and improvements in the span of less than 6 months are a testament of just how impressive test-time compute is as our additional new scaling law. Not to mention that current scaling laws still hold, we're waiting for Orion or GPT 4.5 or whatever it's called, and that underlying model will probably significantly improve the reasoning models that are built on top of it.
Also, if the above results from DeepSeek are anything to go by (and they should be), the ability of these reasoning models to generate incredible synthetic training data for the next models is also quite incredible so... flywheel is upon us, models get better and make better models.
Other AI news from this week:
The most impressive other news came from HUME, showcasing OCTAVE - their new 3B speech-language model, which is able to not only fake someone's voice with 5 seconds of audio, but also take on their personality and style of speaking and mannerisms. This is not only a voice model mind you, but a 3B LLM as well, so it can mimic a voice, and even create new voices from a prompt.
While they mentioned the size, the model was not released yet and will be coming to their API soon, and when I asked about open source, it seems that Hume CEO did not think it's a safe bet opening up this kind of tech to the world yet.
I also loved a new little x-mas experiment from OpenRouter and Exa, where-in on the actual OpenRouter interface, you can now chat with over 300 models they serve, and ground answers in search.
This is it for this week, which again, I thought is going to be a very chill one, and .. nope!
The second part of the show/newsletter, in which we did a full recap of the last year, talked about our predictions from last year and did predictions for this next year, is going to drop in a few days 👀 So keep your eyes peeled. (I decided to separate the two, as 3 hour podcast about AI is... long, I'm no Lex Fridman lol)
As always, if you found any of this interesting, please share with a friend, and comment on social media, or right here on Substack, I love getting feedback on what works and what doesn't.
Thank you for being part of the ThursdAI community 👋
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
🎄ThursdAI - Dec19 - o1 vs gemini reasoning, VEO vs SORA, and holiday season full of AI surprises
20 dez 2024· ThursdAI - The top AI news from the past week
For the full show notes and links visit https://sub.thursdai.news
🔗 Subscribe to our show on Spotify: https://thursdai.news/spotify
🔗 Apple: https://thursdai.news/apple
Ho, ho, holy moly, folks! Alex here, coming to you live from a world where AI updates are dropping faster than Santa down a chimney! 🎅 It's been another absolutely BANANAS week in the AI world, and if you thought last week was wild, and we're due for a break, buckle up, because this one's a freakin' rollercoaster! 🎢
In this episode of ThursdAI, we dive deep into the recent innovations from OpenAI, including their 1-800 ChatGPT phone service and new advancements in voice mode and API functionalities. We discuss the latest updates on O1 model capabilities, including Reasoning Effort settings, and highlight the introduction of WebRTC support by OpenAI. Additionally, we explore the groundbreaking VEO2 model from Google, the generative physics engine Genesis, and new developments in open source models like Cohere's Command R7b. We also provide practical insights on using tools like Weights & Biases for evaluating AI models and share tips on leveraging GitHub Gigi. Tune in for a comprehensive overview of the latest in AI technology and innovation.
00:00 Introduction and OpenAI's 12 Days of Releases
00:48 Advanced Voice Mode and Public Reactions
01:57 Celebrating Tech Innovations
02:24 Exciting New Features in AVMs
03:08 TLDR - ThursdAI December 19
12:58 Voice and Audio Innovations
14:29 AI Art, Diffusion, and 3D
16:51 Breaking News: Google Gemini 2.0
23:10 Meta Apollo 7b Revisited
33:44 Google's Sora and Veo2
34:12 Introduction to Veo2 and Sora
34:59 First Impressions of Veo2
35:49 Comparing Veo2 and Sora
37:09 Sora's Unique Features
38:03 Google's MVP Approach
43:07 OpenAI's Latest Releases
44:48 Exploring OpenAI's 1-800 CHAT GPT
47:18 OpenAI's Fine-Tuning with DPO
48:15 OpenAI's Mini Dev Day Announcements
49:08 Evaluating OpenAI's O1 Model
54:39 Weights & Biases Evaluation Tool - Weave
01:03:52 ArcAGI and O1 Performance
01:06:47 Introduction and Technical Issues
01:06:51 Efforts on Desktop Apps
01:07:16 ChatGPT Desktop App Features
01:07:25 Working with Apps and Warp Integration
01:08:38 Programming with ChatGPT in IDEs
01:08:44 Discussion on Warp and Other Tools
01:10:37 GitHub GG Project
01:14:47 OpenAI Announcements and WebRTC
01:24:45 Modern BERT and Smaller Models
01:27:37 Genesis: Generative Physics Engine
01:33:12 Closing Remarks and Holiday Wishes
Here’s a talking podcast host speaking excitedly about his show
TL;DR - Show notes and Links
* Open Source LLMs
* Meta Apollo 7B – LMM w/ SOTA video understanding (Page, HF)
* Microsoft Phi-4 – 14B SLM (Blog, Paper)
* Cohere Command R 7B – (Blog)
* Falcon 3 – series of models (X, HF, web)
* IBM updates Granite 3.1 + embedding models (HF, Embedding)
* Big CO LLMs + APIs
* OpenAI releases new o1 + API access (X)
* Microsoft makes CoPilot Free! (X)
* Google - Gemini Flash 2 Thinking experimental reasoning model (X, Studio)
* This weeks Buzz
* W&B weave Playground now has Trials (and o1 compatibility) (try it)
* Alex Evaluation of o1 and Gemini Thinking experimental (X, Colab, Dashboard)
* Vision & Video
* Google releases Veo 2 – SOTA text2video modal - beating SORA by most vibes (X)
* HunyuanVideo distilled with FastHunyuan down to 6 steps (HF)
* Kling 1.6 (X)
* Voice & Audio
* OpenAI realtime audio improvements (docs)
* 11labs new Flash 2.5 model – 75ms generation (X)
* Nexa OmniAudio – 2.6B – multimodal local LLM (Blog)
* Moonshine Web – real time speech recognition in the browser (X)
* Sony MMAudio - open source video 2 audio model (Blog, Demo)
* AI Art & Diffusion & 3D
* Genesys – open source generative 3D physics engine (X, Site, Github)
* Tools
* CerebrasCoder – extremely fast apps creation (Try It)
* RepoPrompt to chat with o1 Pro – (download)

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
📆 ThursdAI - Dec 12 - unprecedented AI week - SORA, Gemini 2.0 Flash, Apple Intelligence, LLama 3.3, NeurIPS Drama & more AI news
13 dez 2024· ThursdAI - The top AI news from the past week
Hey folks, Alex here, writing this from the beautiful Vancouver BC, Canada. I'm here for NeurIPS 2024, the biggest ML conferences of the year, and let me tell you, this was one hell of a week to not be glued to the screen.
After last week banger week, with OpenAI kicking off their 12 days of releases, with releasing o1 full and pro mode during ThursdAI, things went parabolic. It seems that all the AI labs decided to just dump EVERYTHING they have before the holidays? 🎅
A day after our show, on Friday, Google announced a new Gemini 1206 that became the #1 leading model on LMarena and Meta released LLama 3.3, then on Saturday Xai releases their new image model code named Aurora.
On a regular week, the above Fri-Sun news would be enough for a full 2 hour ThursdAI show on it's own, but not this week, this week this was barely a 15 minute segment 😅 because so MUCH happened starting Monday, we were barely able to catch our breath, so lets dive into it!
As always, the TL;DR and full show notes at the end 👇 and this newsletter is sponsored by W&B Weave, if you're building with LLMs in production, and want to switch to the new Gemini 2.0 today, how will you know if your app is not going to degrade? Weave is the best way! Give it a try for free.
Gemini 2.0 Flash - a new gold standard of fast multimodal LLMs
Google has absolutely taken the crown away from OpenAI with Gemini 2.0 believe it or not this week with this incredible release. All of us on the show were in agreement that this is a phenomenal release from Google for the 1 year anniversary of Gemini.
Gemini 2.0 Flash is beating Pro 002 and Flash 002 on all benchmarks, while being 2x faster than Pro, having 1M context window, and being fully multimodal!
Multimodality on input and output
This model was announced to be fully multimodal on inputs AND outputs, which means in can natively understand text, images, audio, video, documents and output text, text + images and audio (so it can speak!). Some of these capabilities are restricted for beta users for now, but we know they exists. If you remember project Astra, this is what powers that project. In fact, we had Matt Wolfe join the show, and he demoed had early access to Project Astra and demoed it live on the show (see above) which is powered by Gemini 2.0 Flash.
The most amazing thing is, this functionality, that was just 8 months ago, presented to us in Google IO, in a premium Booth experience, is now available to all, in Google AI studio, for free!
Really, you can try out right now, yourself at https://aistudio.google.com/live but here's a demo of it, helping me proof read this exact paragraph by watching the screen and talking me through it.
Performance out of the box
This model beating Sonnet 3.5 on Swe-bench Verified completely blew away the narrative on my timeline, nobody was ready for that. This is a flash model, that's outperforming o1 on code!?
So having a Flash MMIO model with 1M context that is accessible via with real time streaming option available via APIs from the release time is honestly quite amazing to begin with, not to mention that during the preview phase, this is currently free, but if we consider the previous prices of Flash, this model is going to considerably undercut the market on price/performance/speed matrix.
You can see why this release is taking the crown this week. 👏
Agentic is coming with Project Mariner
An additional thing that was announced by Google is an Agentic approach of theirs is project Mariner, which is an agent in the form of a Chrome extension completing webtasks, breaking SOTA on the WebVoyager with 83.5% score with a single agent setup.
We've seen agents attempts from Adept to Claude Computer User to Runner H, but this breaking SOTA from Google seems very promising. Can't wait to give this a try.
OpenAI gives us SORA, Vision and other stuff from the bag of goodies
Ok so now let's talk about the second winner of this week, OpenAI amazing stream of innovations, which would have taken the crown, if not for, well... ☝️
SORA is finally here (for those who got in)
Open AI has FINALLY released SORA, their long promised text to video and image to video (and video to video) model (nee, world simulator) to general availability, including a new website - sora.com and a completely amazing UI to come with it.
SORA can generate images of various quality from 480p up to 1080p and up to 20 seconds long, and they promised that those will be generating fast, as what they released is actually SORA turbo! (apparently SORA 2 is already in the works and will be even more amazing, more on this later)
New accounts paused for now
OpenAI seemed to have severely underestimated how many people would like to generate the 50 images per month allowed on the plus account (pro account gets you 10x more for $200 + longer durations whatever that means), and since the time of writing these words on ThursdAI afternoon, I still am not able to create a sora.com account and try out SORA myself (as I was boarding a plane when they launched it)
SORA magical UI
I've invited one of my favorite video creators, Blaine Brown to the show, who does incredible video experiments, that always go viral, and had time to play with SORA to tell us what he thinks both from a video perspective and from a interface perspective.
Blaine had a great take that we all collectively got so much HYPE over the past 8 months of getting teased, that many folks expected SORA to just be an incredible text to video 1 prompt to video generator and it's not that really, in fact, if you just send prompts, it's more like a slot machine (which is also confirmed by another friend of the pod Bilawal)
But the magic starts to come when the additional tools like blend are taken into play. One example that Blaine talked about is the Remix feature, where you can Remix videos and adjust the remix strength (Strong, Mild)
Another amazing insight Blaine shared is a that SORA can be used by fusing two videos that were not even generated with SORA, but SORA is being used as a creative tool to combine them into one.
And lastly, just like Midjourney (and StableDiffusion before that), SORA has a featured and a recent wall of video generations, that show you videos and prompts that others used to create those videos with, for inspiration and learning, so you can remix those videos and learn to prompt better + there are prompting extension tools that OpenAI has built in.
One more thing.. this model thinks
I love this discovery and wanted to share this with you, the prompt is "A man smiles to the camera, then holds up a sign. On the sign, there is only a single digit number (the number of 'r's in 'strawberry')"
Advanced Voice mode now with Video!
I personally have been waiting for Voice mode with Video for such a long time, since the that day in the spring, where the first demo of advanced voice mode talked to an OpenAI employee called Rocky, in a very flirty voice, that in no way resembled Scarlet Johannson, and told him to run a comb through his hair.
Well today OpenAI have finally announced that they are rolling out this option soon to everyone, and in chatGPT, we'll all going to have the camera button, and be able to show chatGPT what we're seeing via camera or the screen of our phone and have it have the context.
If you're feeling a bit of a deja-vu, yes, this is very similar to what Google just launched (for free mind you) with Gemini 2.0 just yesterday in AI studio, and via APIs as well.
This is an incredible feature, it will not only see your webcam, it will also see your IOS screen, so you’d be able to reason about an email with it, or other things, I honestly can’t wait to have it already!
They also announced Santa mode, which is also super cool, tho I don’t quite know how to .. tell my kids about it? Do I… tell them this IS Santa? Do I tell them this is an AI pretending to be Santa? Where is the lie end exactly?
And in one of his funniest jailbreaks (and maybe one of the toughest ones) Pliny the liberator just posted a Santa jailbreak that will definitely make you giggle (and him get Coal this X-mas)
The other stuff (with 6 days to go)
OpenAI has 12 days of releases, and the other amazing things we got obviously got overshadowed but they are still cool, Canvas can now run code and have custom GPTs, GPT in Apple Intelligence is now widely supported with the public release of iOS 18.2 and they have announced fine tuning with reinforcement learning, allowing to funetune o1-mini to outperform o1 on specific tasks with a few examples.
There's 6 more work days to go, and they promised to "end with a bang" so... we'll keep you updated!
This weeks Buzz - Guard Rail Genie
Alright, it's time for "This Week's Buzz," our weekly segment brought to you by Weights & Biases! This week I hosted Soumik Rakshit from the Weights and Biases AI Team (The team I'm also on btw!).
Soumik gave us a deep dive into Guardrails, our new set of features in Weave for ensuring reliability in GenAI production! Guardrails serve as a "safety net" for your LLM powered applications, filtering out inputs or llm responses that trigger a certain criteria or boundary.
Types of guardrails include prompt injection attacks, PII leakage, jailbreaking attempts and toxic language as well, but can also include a competitor mention, or selling a product at $0 or a policy your company doesn't have.
As part of developing the guardrails Soumik also developed and open sourced an app to test prompts against those guardrails "Guardrails Genie" and we're going to host it to allow folks to test their prompts against our guardrails, and also are developing it and the guardrails in the open so please check out our Github
Apple iOS 18.2 Apple Intelligence + ChatGPT integration
Apple Intelligence is finally here, you can download it if you have iPhone 15 pro and pro Max and iPhone 16 all series.
If you have one of those phones, you will get the following new additional features that have been in Beta for a while, features like Image Playground with the ability to create images based on your face or faces that you have stored in your photo library.
You can also create GenMoji and those are actually pretty cool!
The highlight and the connection with OpenAI's release is of course the ChatGPT integration, where in if Siri is too dumdum to answer any real AI questions, and let's face it, it's most of the time, a user will get a button and chatGPT will take over upon user approval. This will not require an account!
Grok New Image Generation Codename "Aurora"
Oh, Space Uncle is back at it again! The team at XAI launched its image generation model with the codename "Aurora" and briefly made it public only to pull it and launch it again (this time, the model is simply "Grok"). Apparently, they've trained their own image model from scratch in like three months but they pulled it back a day after, I think because they forgot to add watermarks 😅 but it's still unconfirmed why the removal occurred in the first place, Regardless of the reason, many folks, such as Wolfram, found it was not on the same level as their Flux integration.
It is really good at realism and faces, and is really unrestricted in terms of generating celebrities or TV shows form the 90's or cartoons. They really don't care about copyright.
The model however does appear to generate fairly realistic images with its autoregressive model approach where generation occurs pixel-by-pixel instead of diffusion. But as I said on the show "It's really hard to get a good sense for the community vibe about anything that Elon Musk does because there's so much d**k riding on X for Elon Musk..." Many folks post only positive things on anything X or Xai does in the hopes that space uncle will notice them or reposts them, it's really hard to get an honest "vibes check" on Xai stuff.
All jokes aside we'll hopefully have some better comparisons on sites such as image LmArena who just today launched ImgArena but until that day comes we'll just have to wait and see what other new iterations and announcements follow!
NeurIPS Drama: Best Paper Controversy!
Now, no week in AI would be complete without a little drama. This time around it’s with the biggest machine learning engineering conference of the year, NeurIPS. This year's "Best Paper" award went to a work entitled Visual Auto Aggressive Modeling (VAR). This paper apparently introduced an innovative way to outperform traditional diffusion models when it comes to image generation! Great right? well not so fast because here’s where things get spicy. This is where Keyu Tian comes in, the main author of this work and a former intern of ByteDance who are getting their fair share of the benefits with their co-signing on the paper but their lawsuit may derail its future. ByteDance is currently suing Keyu Tian for a whopping one million dollars citing alleged sabotage on the work in a coordinated series of events that compromised other colleagues work.
Specifically, according to some reports "He modified source code to changes random seeds and optimizes which, uh, lead to disrupting training processes...Security attacks. He gained unauthorized access to the system. Login backdoors to checkpoints allowing him to launch automated attacks that interrupted processes to colleagues training jobs." Basically, they believe that he "gained unauthorized access to the system" and hacked other systems. Now the paper is legit and it introduces potentially very innovative solutions but we have an ongoing legal situation. Also to note is despite firing him they did not withdraw the paper which could speak volumes to its future! As always, if it bleeds, it leads and drama is usually at the top of the trends, so definitely a story that will stay in everyone's mind when they look back at NeurIPS this year.
Phew.. what a week folks, what a week!
I think with 6 more days of OpenAI gifts, there's going to be plenty more to come next week, so share this newsletter with a friend or two, and if you found this useful, consider subscribing to our other channels as well and checkout Weave if you've building with GenAI, it's really helpful!
TL;DR and show notes
* Meta llama 3.3 (X, Model Card)
* OpenAI 12 days of Gifts (Blog)
* Apple ios 18.2 - Image Playground, GenMoji, ChatGPT integration (X)
* 🔥 Google Gemini 2.0 Flash - the new gold standard of LLMs (X, AI Studio)
* Google Project Mariner - Agent that browsers for you (X)
* This weeks Buzz - chat with Soumik Rakshit from AI Team at W&B (Github)
* NeurIPS Drama - Best Paper Controversy - VAR author is sued by ByteDance (X, Blog)
* Xai new image generation codename Aurora (Blog)
* Cognition launched Devin AI developer assistant - $500/mo
* LMArena launches txt2img Arena for Diffusion models (X)

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
📆 ThursdAI - Dec 5 - OpenAI o1 & o1 pro, Tencent HY-Video, FishSpeech 1.5, Google GENIE2, Weave in GA & more AI news
6 dez 2024· ThursdAI - The top AI news from the past week
Well well well, December is finally here, we're about to close out this year (and have just flew by the second anniversary of chatGPT 🎂) and it seems that all of the AI labs want to give us X-mas presents to play with over the holidays!
Look, I keep saying this, but weeks are getting crazier and crazier, this week we got the cheapest and the most expensive AI offerings all at once (the cheapest from Amazon and the most expensive from OpenAI), 2 new open weights models that beat commercial offerings, a diffusion model that predicts the weather and 2 world building models, oh and 2 decentralized fully open sourced LLMs were trained across the world LIVE and finished training. I said... crazy week!
And for W&B, this week started with Weave launching finally in GA 🎉, which I personally was looking forward for (read more below)!
TL;DR Highlights
* OpenAI O1 & Pro Tier: O1 is out of preview, now smarter, faster, multimodal, and integrated into ChatGPT. For heavy usage, ChatGPT Pro ($200/month) offers unlimited calls and O1 Pro Mode for harder reasoning tasks.
* Video & Audio Open Source Explosion: Tencent’s HYVideo outperforms Runway and Luma, bringing high-quality video generation to open source. Fishspeech 1.5 challenges top TTS providers, making near-human voice available for free research.
* Open Source Decentralization: Nous Research’s DiStRo (15B) and Prime Intellect’s INTELLECT-1 (10B) prove you can train giant LLMs across decentralized nodes globally. Performance is on par with centralized setups.
* Google’s Genie 2 & WorldLabs: Generating fully interactive 3D worlds from a single image, pushing boundaries in embodied AI and simulation. Google’s GenCast also sets a new standard in weather prediction, beating supercomputers in accuracy and speed.
* Amazon’s Nova FMs: Cheap, scalable LLMs with huge context and global language coverage. Perfect for cost-conscious enterprise tasks, though not top on performance.
* 🎉 Weave by W&B: Now in GA, it’s your dashboard and tool suite for building, monitoring, and scaling GenAI apps. Get Started with 1 line of code
OpenAI’s 12 Days of Shipping: O1 & ChatGPT Pro
The biggest splash this week came from OpenAI. They’re kicking off “12 days of launches,” and Day 1 brought the long-awaited full version of o1. The main complaint about o1 for many people is how slow it was! Well, now it’s not only smarter but significantly faster (60% faster than preview!), and officially multimodal: it can see images and text together.
Better yet, OpenAI introduced a new ChatGPT Pro tier at $200/month. It offers unlimited usage of o1, advanced voice mode, and something called o1 pro mode — where o1 thinks even harder and longer about your hardest math, coding, or science problems. For power users—maybe data scientists, engineers, or hardcore coders—this might be a no-brainer. For others, 200 bucks might be steep, but hey, someone’s gotta pay for those GPUs. Given that OpenAI recently confirmed that there are now 300 Million monthly active users on the platform, and many of my friends already upgraded, this is for sure going to boost the bottom line at OpenAI!
Quoting Sam Altman from the stream, “This is for the power users who push the model to its limits every day.” For those who complained o1 took forever just to say “hi,” rejoice: trivial requests will now be answered quickly, while super-hard tasks get that legendary deep reasoning including a new progress bar and a notification when a task is complete. Friend of the pod Ray Fernando gave pro a prompt that took 7 minutes to think through!
I've tested the new o1 myself, and while I've gotten dangerously close to my 50 messages per week quota, I've gotten some incredible results already, and very fast as well. This ice-cubes question failed o1-preview and o1-mini and it took both of them significantly longer, and it took just 4 seconds for o1.
Open Source LLMs: Decentralization & Transparent Reasoning
Nous Research DiStRo & DeMo Optimizer
We’ve talked about decentralized training before, but the folks at Nous Research are making it a reality at scale. This week, Nous Research wrapped up the training of a new 15B-parameter LLM—codename “Psyche”—using a fully decentralized approach called “Nous DiStRo.” Picture a massive AI model trained not in a single data center, but across GPU nodes scattered around the globe. According to Alex Volkov (host of ThursdAI), “This is crazy: they’re literally training a 15B param model using GPUs from multiple companies and individuals, and it’s working as well as centralized runs.”
The key to this success is “DeMo” (Decoupled Momentum Optimization), a paper co-authored by none other than Diederik Kingma (yes, the Kingma behind Adam optimizer and VAEs). DeMo drastically reduces communication overhead and still maintains stability and speed. The training loss curve they’ve shown looks just as good as a normal centralized run, proving that decentralized training isn’t just a pipe dream. The code and paper are open source, and soon we’ll have the fully trained Psyche model. It’s a huge step toward democratizing large-scale AI—no more waiting around for Big Tech to drop their weights. Instead, we can all chip in and train together.
Prime Intellect INTELLECT-1 10B: Another Decentralized Triumph
But wait, there’s more! Prime Intellect also finished training their 10B model, INTELLECT-1, using a similar decentralized setup. INTELLECT-1 was trained with a custom framework that reduces inter-GPU communication by 400x. It’s essentially a global team effort, with nodes from all over the world contributing compute cycles.
The result? A model hitting performance similar to older Meta models like Llama 2—but fully decentralized.
Ruliad DeepThought 8B: Reasoning You Can Actually See
If that’s not enough, we’ve got yet another open-source reasoning model: Ruliad’s DeepThought 8B. This 8B parameter model (finetuned from LLaMA-3.1) from friends of the show FarEl, Alpin and Sentdex 👏
Ruliad’s DeepThought attempts to match or exceed performance of much larger models in reasoning tasks (beating several 72B param models while being 8B itself) is very impressive.
Google is firing on all cylinders this week
Google didn't stay quiet this week as well, and while we all wait for the Gemini team to release the next Gemini after the myriad of very good experimental models recently, we've gotten some very amazing things this week.
Google’s PaliGemma 2 - finetunable SOTA VLM using Gemma
PaliGemma v2, a new vision-language family of models (3B, 10B and 33B) for 224px, 448px, 896px resolutions are a suite of base models, that include image segmentation and detection capabilities and are great at OCR which make them very versatile for fine-tuning on specific tasks.
They claim to achieve SOTA on chemical formula recognition, music score recognition, spatial reasoning, and chest X-ray report generation!
Google GenCast SOTA weather prediction with... diffusion!?
More impressively, Google DeepMind released GenCast, a diffusion-based model that beats the state-of-the-art ENS system in 97% of weather predictions. Did we say weather predictions? Yup.
Generative AI is now better at weather forecasting than dedicated physics based deterministic algorithms running on supercomputers. Gencast can predict 15 days in advance in just 8 minutes on a single TPU v5, instead of hours on a monstrous cluster. This is mind-blowing. As Yam said on the show, “Predicting the world is crazy hard” and now diffusion models handle it with ease.
W&B Weave: Observability, Evaluation and Guardrails now in GA
Speaking of building and monitoring GenAI apps, we at Weights & Biases (the sponsor of ThursdAI) announced that Weave is now GA. Weave is a developer tool for evaluating, visualizing, and debugging LLM calls in production. If you’re building GenAI apps—like a coding agent or a tool that processes thousands of user requests—Weave helps you track costs, latency, and quality systematically.
We showcased two internal apps: Open UI (a website builder from a prompt) and Winston (an AI agent that checks emails, Slack, and more). Both rely on Weave to iterate, tune prompts, measure user feedback, and ensure stable performance. With O1 and other advanced models coming to APIs soon, tools like Weave will be crucial to keep those applications under control.
If you follow this newsletter and develop with LLMs, now is a great way to give Weave a try
Open Source Audio & Video: Challenging Proprietary Models
Tencent’s HY Video: Beating Runway & Luma in Open Source
Tencent came out swinging with their open-source model, HYVideo. It’s a video model that generates incredible realistic footage, camera cuts, and even audio—yep, Foley and lip-synced character speech. Just a single model doing text-to-video, image-to-video, puppeteering, and more. It even outperforms closed-source giants like Runway Gen 3 and Luma 1.6 on over 1,500 prompts.
This is the kind of thing we dreamed about when we first heard of video diffusion models. Now it’s here, open-sourced, ready for tinkering. “It’s near SORA-level,” as I mentioned, referencing OpenAI’s yet-to-be-fully-released SORA model. The future of generative video just got more accessible, and competitors should be sweating right now. We may just get SORA as one of the 12 days of OpenAI releases!
FishSpeech 1.5: Open Source TTS Rivaling the Big Guns
Not just video—audio too. FishSpeech 1.5 is a multilingual, zero-shot voice cloning model that ranks #2 overall on TTS benchmarks, just behind 11 Labs. This is a 500M-parameter model, trained on a million hours of audio, achieving near-human quality, fast inference, and open for research.
This puts high-quality text-to-speech capabilities in the open-source community’s hands. You can now run a top-tier TTS system locally, clone voices, and generate speech in multiple languages with low latency. No more relying solely on closed APIs. This is how open-source chases—and often catches—commercial leaders.
If you’ve been longing for near-instant voice cloning on your own hardware, this is the model to go play with!
Creating World Models: Genie 2 & WorldLabs
Fei Fei Li’s WorldLabs: Images to 3D Worlds
WorldLabs, founded by Dr. Fei Fei Li, showcased a mind-boggling demo: turning a single image into a walkable 3D environment. Imagine you take a snapshot of a landscape, load it into their system, and now you can literally walk around inside that image as if it were a scene in a video game. “I can literally use WASD keys and move around,” Alex commented, clearly impressed.
It’s not perfect fidelity yet, but it’s a huge leap toward generating immersive 3D worlds on the fly. These tools could revolutionize virtual reality, gaming, and simulation training. WorldLabs’ approach is still in early stages, but what they demonstrated is nothing short of remarkable.
Google’s Genie 2: Playable Worlds from a Single Image
If WorldLabs’s 3D environment wasn’t enough, Google dropped Genie 2. Take an image generated by Imagen 3, feed it into Genie 2, and you get a playable world lasting up to a minute. Your character can run, objects have physics, and the environment is consistent enough that if you leave an area and return, it’s still there.
As I said on the pod, “It looks like a bit of Doom, but generated from a single static image. Insane!” The model simulates complex interactions—think water flowing, balloons bursting—and even supports long-horizon memory. This could be a goldmine for AI-based game development, rapid prototyping, or embodied agent training.
Amazon’s Nova: Cheaper LLMs, Not Better LLMs
Amazon is also throwing their hat in the ring with the Nova series of foundational models. They’ve got variants like Nova Micro, Lite, Pro, and even a Premier tier coming in 2025. The catch? Performance is kind of “meh” compared to Anthropic or OpenAI’s top models, but Amazon is aiming to be the cheapest high-quality LLM among the big players. With a context window of up to 300K tokens and 200+ language coverage, Nova could find a niche, especially for those who want to pay less per million tokens.
Nova Micro costs around 3.5 cents per million input tokens and 14 cents per million output tokens—making it dirt cheap to process massive amounts of data. Although not a top performer, Amazon’s approach is: “We may not be best, but we’re really cheap and we scale like crazy.” Given Amazon’s infrastructure, this could be compelling for enterprises looking for cost-effective large-scale solutions.
Phew, this was a LONG week with a LOT of AI drops, and NGL, o1 actually helped me a bit for this newsletter, I wonder if you can spot the places where o1 wrote some of the text using a the transcription of the show and the outline as guidelines and the previous newsletter as a tone guide and where I wrote it myself?
Next week, NEURIPS 2024, the biggest ML conference in the world, I'm going to be live streaming from there, so if you're at the conference, come by booth #404 and say hi! I'm sure there will be a TON of new AI updates next week as well!
Show Notes & Links
TL;DR of all topics covered:
* This weeks Buzz
* Weights & Biases announces Weave is now in GA 🎉(wandb.me/tryweave)
* Tracing LLM calls
* Evaluation & Playground
* Human Feedback integration
* Scoring & Guardrails (in preview)
* Open Source LLMs
* DiStRo & DeMo from NousResearch - decentralized DiStRo 15B run (X, watch live, Paper)
* Prime Intellect - INTELLECT-1 10B decentralized LLM (Blog, watch)
* Ruliad DeepThoutght 8B - Transparent reasoning model (LLaMA-3.1) w/ test-time compute scaling (X, HF, Try It)
* Google GenCast - diffusion model SOTA weather prediction (Blog)
* Google open sources PaliGemma 2 (X, Blog)
* Big CO LLMs + APIs
* Amazon announces Nova series of FM at AWS (X)
* Google GENIE 2 creates playable worlds from a picture! (Blog)
* OpenAI 12 days started with o1 full and o1 pro and pro tier $200/mo (X, Blog)
* Vision & Video
* Tencent open sources HY Video - beating Luma & Runway (Blog, Github, Paper, HF)
* Runway video keyframing prototype (X)
* Voice & Audio
* FishSpeech V1.5 - multilingual, zero-shot instant voice cloning, low-latency, open text to speech model (X, Try It)
* Eleven labs - real time audio agents builder (X)

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
🦃 ThursdAI - Thanksgiving special 24' - Qwen Open Sources Reasoning, BlueSky hates AI, H controls the web & more AI news
28 nov 2024· ThursdAI - The top AI news from the past week
Hey ya'll, Happy Thanskgiving to everyone who celebrates and thank you for being a subscriber, I truly appreciate each and every one of you!
We had a blast on today's celebratory stream, especially given that today's "main course" was the amazing open sourcing of a reasoning model from Qwen, and we had Junyang Lin with us again to talk about it! First open source reasoning model that you can run on your machine, that beats a 405B model, comes close to o1 on some metrics 🤯
We also chatted about a new hybrid approach from Nvidia called Hymba 1.5B (Paper, HF) that beats Qwen 1.5B with 6-12x less training, and Allen AI releasing Olmo 2, which became the best fully open source LLM 👏 (Blog, HF, Demo), though they didn't release WandB logs this time, they did release data!
I encourage you to watch todays show (or listen to the show, I don't judge), there's not going to be a long writeup like I usually do, as I want to go and enjoy the holiday too, but of course, the TL;DR and show notes are right here so you won't miss a beat if you want to use the break to explore and play around with a few things!
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
TL;DR and show notes
* Qwen QwQ 32B preview - the first open weights reasoning model (X, Blog, HF, Try it)
* Allen AI - Olmo 2 the best fully open language model (Blog, HF, Demo)
* NVIDIA Hymba 1.5B - Hybrid smol model beating Qwen, SmolLM w/ 6-12x less training (X, Paper, HF)
* Big CO LLMs + APIs
* Anthropic MCP - model context protocol (X,Blog, Spec, Explainer)
* Cursor, Jetbrains now integrate with ChatGPT MacOS app (X)
* Xai is going to be a Gaming company?! (X)
* H company shows Runner H - WebVoyager Agent (X, Waitlist)
* This weeks Buzz
* Interview w/ Thomas Cepelle about Weave scorers and guardrails (Guide)
* Vision & Video
* OpenAI SORA API was "leaked" on HuggingFace (here)
* Runway launches video Expand feature (X)
* Rhymes Allegro-TI2V - updated image to video model (HF)
* Voice & Audio
* OuteTTS v0.2 - 500M smol TTS with voice cloning (Blog, HF)
* AI Art & Diffusion & 3D
* Runway launches an image model called Frames (X, Blog)
* ComfyUI Desktop app was released 🎉
* Chat
* 24 hours of AI hate on 🦋 (thread)
* Tools
* Cursor agent (X thread)
* Google Generative Chess toy (Link)
See you next week and happy Thanks Giving 🦃
Thanks for reading ThursdAI - Recaps of the most high signal AI weekly spaces! This post is public so feel free to share it.
Full Subtitles for convenience
[00:00:00] Alex Volkov: let's get it going.
[00:00:10] Alex Volkov: Welcome, welcome everyone to ThursdAI November 28th Thanksgiving special. My name is Alex Volkov. I'm an AI evangelist with Weights Biases. You're on ThursdAI. We are live [00:00:30] on ThursdAI. Everywhere pretty much.
[00:00:32] Alex Volkov:
[00:00:32] Hosts and Guests Introduction
[00:00:32] Alex Volkov: I'm joined here with two of my co hosts.
[00:00:35] Alex Volkov: Wolfram, welcome.
[00:00:36] Wolfram Ravenwolf: Hello everyone! Happy Thanksgiving!
[00:00:38] Alex Volkov: Happy Thanksgiving, man.
[00:00:39] Alex Volkov: And we have Junyang here. Junyang, welcome, man.
[00:00:42] Junyang Lin: Yeah, hi everyone. Happy Thanksgiving. Great to be here.
[00:00:46] Alex Volkov: You had a busy week. We're going to chat about what you had. I see Nisten joining us as well at some point.
[00:00:51] Alex Volkov: Yam pe joining us as well. Hey, how, Hey Yam. Welcome. Welcome, as well. Happy Thanksgiving. It looks like we're assembled folks. We're across streams, across [00:01:00] countries, but we are.
[00:01:01] Overview of Topics for the Episode
[00:01:01] Alex Volkov: For November 28th, we have a bunch of stuff to talk about. Like really a big list of stuff to talk about. So why don't we just we'll just dive in. We'll just dive in. So obviously I think the best and the most important.
[00:01:13] DeepSeek and Qwen Open Source AI News
[00:01:13] Alex Volkov: Open source kind of AI news to talk about this week is going to be, and I think I remember last week, Junyang, I asked you about this and you were like, you couldn't say anything, but I asked because last week, folks, if you remember, we talked about R1 from DeepSeek, a reasoning model from [00:01:30] DeepSeek, which really said, Oh, maybe it comes as a, as open source and maybe it doesn't.
[00:01:33] Alex Volkov: And I hinted about, and I asked, Junyang, what about some reasoning from you guys? And you couldn't say anything. so this week. I'm going to do a TLDR. So we're going to actually talk about the stuff that, you know, in depth a little bit later, but this week, obviously one of the biggest kind of open source or sorry, open weights, and news is coming from our friends at Qwen as well, as we always celebrate.
[00:01:56] Alex Volkov: So one of the biggest things that we get as. [00:02:00] is, Qwen releases, I will actually have you tell me what's the pronunciation here, Junaid, what is, I say Q W Q or maybe quick, what is the pronunciation of this?
[00:02:12] Junyang Lin: I mentioned it in the blog, it is just like the word quill. Yeah. yeah, because for the qw you can like work and for the q and you just like the U, so I just combine it together and create a new pronunciation called Quill.
[00:02:28] Junyang Lin: Yeah.
[00:02:28] Alex Volkov: So we're saying it's Quin [00:02:30] Quill 32 B. Is that the right pronunciation to say this?
[00:02:33] Junyang Lin: Yeah, it's okay. I would just call it qui quill. It is, some something funny because,the ca the characters look very funny. Oh, we have a subculture,for these things. Yeah. Just to express some, yeah.
[00:02:46] Junyang Lin: our. feelings.
[00:02:49] Alex Volkov: Amazing. Qwen, Quill, 32B, and it's typed,the name is typed QWQ, 32Breview. This is the first OpenWeights reasoning model. This [00:03:00] model is not only predicting tokens, it's actually doing reasoning behind this. What this means is we're going to tell you what this means after we get to this.
[00:03:07] Alex Volkov: So we're still in the, we're still in the TLDR area. We also had. Another drop from Alien Institute for AI, if you guys remember last week we chatted with Nathan, our dear friend Nathan, from Alien Institute about 2. 0. 3, about their efforts for post training, and he gave us all the details about post training, so they released 2.
[00:03:28] Alex Volkov: 0. 3, this week they released Olmo 2. [00:03:30] 0. We also talked about Olmo with the friends from Alien Institute a couple of months ago, and now they released Olmo 2. 0. Which they claim is the best fully open sourced, fully open sourced language models, from Allen Institute for AI.and, so we're going to chat about, Olmo a little bit as well.
[00:03:46] Alex Volkov: And last minute addition we have is NVIDIA Haimba, which is a hybrid small model from NVIDIA, very tiny one, 1. 5 billion parameters. small model building Qwen and building small LLM as well. this is in the area [00:04:00] of open source. I
[00:04:01] Alex Volkov: Okay, in the big companies, LLMs and APIs, I want to run through a few things.
[00:04:06] Anthropic's MCP and ChatGPT macOS Integrations
[00:04:06] Alex Volkov: So first of all, Anthropic really something called MCP. It's a, something they called Model Context Protocol. We're going to briefly run through this. It's a, it's a kind of a release from them that's aimed for developers is a protocol that enables secure connections between a host application, like a cloud desktop, for example,
[00:04:24] Alex Volkov: there's also a bunch of new integrations for the ChatGPT macOS app. If you guys remember a couple of [00:04:30] weeks ago, We actually caught this live.
[00:04:31] Alex Volkov: I refreshed my MacOS app and there's ta da, there's a new thing. And we discovered this live. It was very fun. The MacOS app for ChatGPT integrates with VS Code, et cetera. and so we tried to run this with Cursor. It didn't work. So now it works with Cursor,
[00:04:43] Wolfram Ravenwolf:
[00:04:43] Alex Volkov: So the next thing we're going to look at, I don't know if it's worth mentioning, but you guys know the XAI, the company that Elon Musk is raising another 6 billion for that tries to compete with OpenAI
[00:04:54] Alex Volkov: Do you guys hear that it's going to be a gaming company as well? I don't know if it's worth talking about, but we'll at least [00:05:00] mention this. And the one thing that I wanted to chat about is H, the French company, H that showed a runner that looks. Three times as fast and as good as the Claude computer use runner, and we're definitely going to show examples of this, video live because that looks just incredible.
[00:05:18] Alex Volkov: this out of nowhere company, the biggest fundraise or the biggest seed round that Europe has ever seen, at least French has ever seen, just show they, An agent that controls your [00:05:30] computer that's tiny, ridiculously tiny, I think it's like the three billion parameter, two billion parameter or something.
[00:05:36] Alex Volkov: And it runs way better than computer, cloud computer use. Something definitely worth talking about. after with, after which in this week's Bars, we're going to talk with Thomas Capelli, from, from my team at Weights Biases. about LLM guardrails, that's gonna be fun. and in vision video category, we're gonna cover that OpenAI Sora quote unquote leaked, this week.
[00:05:56] Alex Volkov: And this leak wasn't really a leak, but, definitely [00:06:00] we saw some stuff. and then there's also a new expand feature that we saw in, Runway. And we saw another video model from, Rhymes called Allegro TIV2. which is pretty cool in voice and audio. If we get there in voice and audio, we saw out TTS vision 0.
[00:06:19] Alex Volkov: 2, which is a new TTS, a 500 million parameter, small TTS you can run in your browser and sounds pretty dope.art in the fusion, super quick runway launches an image [00:06:30] model. Yep, Runway, the guys who do video, they launched an image model that looks pretty sick, and we're definitely going to look at some examples of this, and Confi UI Desktop, for those of you who are celebrating something like this, Confi UI now is runnable with desktop, and there's a bunch of tool stuff, but honestly, I can talk about two things.
[00:06:47] Alex Volkov: Tools and there's a cool thing with Google generative chess toy. I can show you this so you can show your folks in Thanksgiving and going to impress them with a generative chess toy. But honestly, instead of this, I would love to chat about the thing that [00:07:00] some of us saw on the other side of the social media networks.
[00:07:04] Alex Volkov: And definitely we'll chat about this, for the past 24 hours. So chat, for the past. 24 hours, on BlueSky, we saw a little bit of a mob going against the Hug Face folks and then, other friends of ours on,from the AI community and the anti AI mob on BlueSky. So we're going to chat about that.
[00:07:26] Alex Volkov: And hopefully give you our feelings about what's going on, about this [00:07:30] world. And this is a pro AI show. And when we see injustice happens against ai, we have to speak out about against this. And I think that this is mostly what we're gonna cover this show, unless this is.
[00:07:42] Wolfram Ravenwolf: Where I could insert the two things I have.
[00:07:44] Wolfram Ravenwolf: One is a tool, which is the AI video composer, which, allows you to talk to, ff mpac, which is a complicated comment line tool, but very powerful. And so you have a UI where you just use natural language to control the tool. So that is one tool. Maybe we get to [00:08:00] it, if not just Google or ask for Plexity or anything.
[00:08:03] Alex Volkov: No, we'll drop it in. Yeah, we'll drop it in show notes, absolutely.
[00:08:04] Wolfram Ravenwolf: Yeah, that's the best part. Okay. And echo mimic. Version 2 is also an HN Synthesia alternative for local use, which is also, yeah, a great open source local runnable tool.
[00:08:17] Alex Volkov: What do we call this? EcoMimic?
[00:08:19] Wolfram Ravenwolf: EcoMimic. EcoMimic
[00:08:21] Alex Volkov: v2.
[00:08:21] Wolfram Ravenwolf: EcoMimic
[00:08:23] Alex Volkov: 2.
[00:08:24] Alex Volkov: Alright, we have a special guest here that we're gonna add Alpin. Hey Alpen, [00:08:30] welcome, feel free to stay anonymous and don't jump, we're gonna start with open source AI and then we're gonna chat with you briefly about the experience you had.
[00:08:38] Alpin Dale: hello everyone.
[00:08:39] Alex Volkov: Hey man. Yeah, you've been on the show before, right Alton? You've been on the show.
[00:08:43] Alpin Dale: a few times, yeah. it's nice to be back here again.
[00:08:46] Alex Volkov: Yeah. Alton, we're gonna get, we're gonna chat with you soon, right? We're gonna start with open source. We need to go to Junyang and talk about reasoning models.
[00:08:52] Alex Volkov: so feel free to stay with us. And then I definitely want to hear about some of the stuff we're going to cover after open source. We're going to cover the [00:09:00] anti AI mob over there.
[00:09:05] Alex Volkov: Alrighty folks, it's time to start with the,with the corner we love the most, yeah? let's dive into this. Let's dive in straight to Open Source AI.
[00:09:29] Alex Volkov: Open Source AI, [00:09:30] let's get it started. Let's start it.
[00:09:35] Alex Volkov: Okay, folks, so open source this week, we're going to get, let me cover the other two things super quick before we dive in.
[00:09:43] NVIDIA Haimba Hybrid Model Discussion
[00:09:43] Alex Volkov: Alright, so I want to like briefly cover the Haimba paper super quick, because we're going to get the least interesting stuff out of the way so we can focus on the main topic. Course, NVIDIA released Heimbar 1. 5 parameters. Heimbar is a hybrid small model, from NVIDIA. We talked about hybrid models [00:10:00] multiple times before.
[00:10:00] Alex Volkov: we have our friend of the pod, LDJ here. He loves talking about hybrid models. He actually brought this to our attention in the, in, in the group chat. We talked about, you guys know the Transformer, we love talking about the Transformer. Haimba specifically is a hybrid model between Transformer and I think they're using a hybrid attention with Mamba layers in parallel.
[00:10:22] Alex Volkov: they claim they're beating Lama and Qwen and SmallLM with 6 to 12 times less training as well. Let's look [00:10:30] at the, let's look at their, let's look at their X.so this is what they're, this is what they're showing, this is the table they're showing some impressive numbers, the interesting thing is, this is a table of comparison that they're showing, and in this table of comparison, the comparison is not only Evaluations.
[00:10:47] Alex Volkov: The comparison they're showing is also cache size and throughput, which I like. it's do you guys know what this reminds me of? This reminds me of when you have a electric vehicle [00:11:00] and you have a gas based vehicle or standard combustion engine vehicle, and then they compare the electric vehicle and acceleration.
[00:11:07] Alex Volkov: It's Oh, our car is faster. But you get this by default, you get the acceleration by default with all the electric vehicles. This is how the model works. This is how those model works. So for me, when you compare like hybrid models, or, non transformer based models, a Mamba based models, the throughput speed up is generally faster because of it.
[00:11:29] Alex Volkov: [00:11:30] But definitely the throughput is significantly higher. Tokens per second. is significantly higher. So for comparison for folks who are listening to us, just so you, you'll hear the comparison, the throughput for this 1. 5 billion model is 664 tokens per second versus a small LM 238 tokens per second, or something like Qwen 1.
[00:11:54] Alex Volkov: 5 at 400. So 600 versus 400. the training cost in [00:12:00] tokens, they say this was, 1. 5 trillion tokens versus Qwen at 18. I don't know if Junyang you want to confirm or deny the 18 mentioned here that they added. Sometimes they, they say different things, but yeah, definitely the highlight of this Heimwehr thing.
[00:12:14] Alex Volkov: And this is from NVIDIA, by the way, I think it's very worth like shouting out that this specific thing comes from this model comes from NVIDIA. Um,they specifically mentioned that the cost, And outperformance of this model comes at 6 to 12 times less [00:12:30] training, which is very impressive.
[00:12:31] Alex Volkov: what else about this model? Performance wise, MMLU at 52, which is lower than Qwen at 59, at, at 1. 5 billion parameters. GSM 8K, we know the GSM 8K is not that interesting anymore, I think, at this point. We're not like over, we're not over, we're not looking at this like too much. What else should we say about this model?
[00:12:52] Alex Volkov: GPK is pretty interesting at 31. GPK is usually knowledge versus something. [00:13:00] Anything else to say about this model? Yeah, you have anything to say Nisten? Anything to say about the small models? About the hybrid model specifically? I know that like our friend LDJ said that like this seems like the first actual model that competes apples to apples.
[00:13:13] Alex Volkov: Because usually when we compare Hybrid models specifically, those usually people say that those are not like necessarily one to one comparisons between hybrid models and just formal models.
[00:13:24] Nisten Tahiraj: I was just going to say that fromfrom NVIDIA, we've heard these [00:13:30] claims before and they didn't quite turn out that way, so I'm going to start off a little bit more skeptical on that end. also from, from the Mistral Mamba, Mambastral, that one was not very performant.
[00:13:44] Nisten Tahiraj: it seemed like it was going to be good for long context stuff. The runtime wasn't that good as well. yeah, I'm going to give this one a test because. Again, the promise of, of like hybrid, SSM models is that it can do better [00:14:00] in longer contexts and it can run faster. So it is worth testing given what, what they're claiming.
[00:14:06] Nisten Tahiraj: But, again, on MMLU, it didn't do that well, but, yeah, overall the numbers do look great actually for what it is, but I think we do need to do further testing on this, whether it is practically. That's good. Because I'm not sure how well it's going to hold up after you just throw like 32k of context of it.
[00:14:25] Nisten Tahiraj: I guess it's going to remember all that, but, yeah, this on paper, this does [00:14:30] look like it's one of the first ones that is Applesauce.
[00:14:33] Alex Volkov: Yeah. All right. anything else to say here? Yeah, the architecture. Jan, go ahead.
[00:14:39] Yam Peleg: Yeah, about the architecture. I tweeted about it.It is, I think it has extreme potential and, it might, I just by looking at the attention maps, from the paper, like just a glimpse is enough for you to see that.
[00:14:55] Yam Peleg: They really do solve something really profound [00:15:00] with many of the models that we have today. basically, I'm really simplifying here, but basically, when you look at the Attention versus Mamba, they act very differently in terms of how they process the tokens, sliding window ones, you could say.
[00:15:20] Yam Peleg: And of course self attention is like global, to everything, but Mamba is not exactly global, it's sequential, and sliding window is also not exactly [00:15:30] global, but it's not the same sequential, it's like everything to everything, but with a window. So what they did is combine the two, and you can really see the difference in attention map of the trained model.
[00:15:44] Yam Peleg: it's not exactly the same as just, hybrid Mamba attention models that we all saw before.there is a lot to this model and I really want to see one of those. I just [00:16:00] trained for like at scale, like a large one on, on, on a huge data set, because I think it might be an improvement to either,just by looking at the way the model learned, but you cannot know until you actually try.
[00:16:15] Yam Peleg: I tweeted about it just like briefly. So if you want to go and look at, I'm just, I'm just pointing out that go and check the paper out because the architecture is unique. There is, there is a reason the model is, for its size, very performant. [00:16:30]
[00:16:30] Alex Volkov: Yeah, I'm gonna add your tweet.
[00:16:31] Alex Volkov: All right, folks, time for us to move to the second thing.
[00:16:36] Allen Institute's Olmo 2.0 Release
[00:16:36] Alex Volkov: The folks at Allen AI, surprises with another release this week, and they have, as always they do, they say, hey folks, we divide the categories of open source to not open source at all, then somewhat open weights maybe, and then fully open source, the folks who release the checkpoints, the data, the, the training code.
[00:16:57] Alex Volkov: I will say this, they used to release Weights [00:17:00] Biases logs as well, and they stopped. So if somebody listens to the show from LMAI, as I know they do, folks, what's up with the Weights Biases logs? We know, and we love them, so please release the Weights Biases logs again. but, they released Olmo 2.
[00:17:14] Alex Volkov: Congrats, folks, for releasing Olmo 2. Let me actually do the clap as well. Yay!Olmo 2 is, they claim, is, they claim,the best open, fully open language model to date, and they show this nice graph as well, where, they released two models, Olmo [00:17:30] 2. 7b and Olmo 2. 13b, and they cite multiple things, to, to attribute for the best performance here.
[00:17:37] Alex Volkov: specifically the training stability, they ran this for a significant longer before. they cite some of the recipes of. What we talked about last week from TULU3 methodology, the kind of the state of the art post training methodology from TULU3 that we've talked with Nathan last week, specifically the verifiable framework, thing that we've talked about, multiple other technical things like rate [00:18:00] annealing and the data curriculum.
[00:18:01] Alex Volkov: And obviously they're focusing on their data. they have their, Ohm's selection of tasks on which they compared these models and,the breakdown that I told you about that they do is the open weights models, partially open models, and then fully open models. So this is the breakdown that they have in the area of open weights models.
[00:18:18] Alex Volkov: They have Lama 2. 13b and Mistral 7b, for example, they put Qwen in there as well. So Qwen 2. 57 and 14. And the partially open models, they put Zamba and Stable [00:18:30] LLM. And the fully open models, they put themselves and Olmo and, Ember7B and Olmo2 beats all of that category with some nice, average of stats.
[00:18:40] Alex Volkov: they talk about pre training and a bunch of other stuff. and the instruct category specifically with the Tulu kind of,recipes. What else can we say about Olmo? That's very interesting for folks before we jump into Qwen. What else can we say about Olmo? The, oh, the fact that the thing about the fully open source, we always mention this, is the data set.
[00:18:59] Alex Volkov: We [00:19:00] always talk about the data, they release all of the data sets, so Olmo mix was released, Dolmino mix was released, the SFT training data, post training data set was released as well. yeah, folks, comments. You can also try this model at playground. lnai. org. I've tried it. It's interesting. it's not look, uh,the best about this is the best among open source.
[00:19:21] Alex Volkov: Obviously it's not the best at, generally with closed source data, you can get more significantly better than this. But comments from folks about OMO? [00:19:30]
[00:19:30] Wolfram Ravenwolf: Yeah, it's not multilingual, they said that there is only English, but they are working on putting that in, I think, in another version, but, yeah, it's a truly open source model, not just OpenWeights, so a big applause for them, releasing everything, that is a big thing and I always appreciate it.
[00:19:46] Wolfram Ravenwolf: Thank you.
[00:19:48] Alex Volkov: A hundred percent. All right, folks, it looks like we got Eugene back. Eugene, talk to us about Heimbar.
[00:19:54] Eugen Cheugh: Yeah, no, sorry, I was just saying that as someone who works on transformer [00:20:00] alternative,it's actually really awesome to get the data point because we all haven't decided what's the best arrangement, what's the percentage of transformer versus non transformer?
[00:20:08] Eugen Cheugh: Is the non transformer layers in the front or the back? It's like you say, the car and the car scenario, it's like electric car, do we even know if we want the electric engine in front or the back? and these are data points that we love to test to just, find out more and it's. And I appreciate what NVIDIA is doing as well and looking forward to more research in this space.
[00:20:26] Alex Volkov: Awesome. thanks for joining us and feel free to stay. The more the merrier. This is like a [00:20:30] Thanksgiving kind of pre party for all of us. The more the merrier, folks. If you're listening to this only and you're not like on the live stream, I encourage you to go and check us out because like we're also like showing stuff.
[00:20:40] Alex Volkov: We're like showing the papers. We're like, we're waving. We're like showing Turkey, whatever. we're having fun. all right, folks. I think it's time to talk about the main course. We just ate the mashed potatoes. Let's eat the turkey for open source.
[00:20:53] Qwen Quill 32B Reasoning Model
[00:20:53] Alex Volkov: In this week's Open Source Turkey dinner, the Reasoning Model, the first ever Reasoning Open [00:21:00] Source, we got Qwen Quill, Qwen Quill?
[00:21:04] Alex Volkov: Yes, Qwen Quill 32 bit preview, the first open source. Let's go! Let's go! The first open source Reasoning Model from our friends at Qwen. We have Jun Yang here, Jun Yang and Justin Lin, to talk to us about this release. Folks at OpenAI released this, they worked for, the rest of about O1, we released a couple of months ago.
[00:21:25] Alex Volkov: Then the folks at DeepSeek released R1, that they just released it, they [00:21:30] promised to give us, maybe at some point. The folks at O1 did not release the reasoning. So, what you see in O1 is the reasoning being obfuscated from us, so we can't actually see how the model reasons. R1 gave us the reasoning itself.
[00:21:44] Alex Volkov: But didn't release the model. And so now we have a reasoning model that you can actually download and use. And unlike reflection, this model actually does the thing that it promises to do. Junyang, how did you do it? What did you do? Please give us all the details as much as possible. Please do the announcement yourself.
[00:21:58] Alex Volkov: Thank you for joining us. [00:22:00] Junyang from Qwen.
[00:22:00] Junyang Lin: Yeah, thanks everyone for the attention and for the appreciation, and I'm Junyang from the Qwen team, and we just released the new model for reasoning, but we just added a tag that it is a preview. Yeah, it is something very experimental, but we would really like to receive some feedback to see how people use it and to see what people think.
[00:22:24] Junyang Lin: The internal problems,they really are. Yeah, it is called QUIL. it is [00:22:30] something, very interesting naming,because we like to see that, we first called it like Q1,things like that, but we think it's something too normal and we'd like to see there was something connected with IQ, EQ, then we call it QQ, and then we found out, QWEN with a W there.
[00:22:47] Junyang Lin: And we found a very interesting expression because it looks really cute. There is a subculture in China with the text expression to express the feelings. So it is something very interesting. So we [00:23:00] just decided to use the name and for. For the pronunciation, it's just like the word Q, because I combined QW, the pronunciation of QW, with U together, and it's still just cute.
[00:23:13] Junyang Lin: Yeah, there's something beside the model, and it is actually a model, which can, And this is the reason before it reaches the final response. If you just try with our demo and you will find that it just keeps talking to itself. And it's something really [00:23:30] surprising for us. If it asks you a question, it just keeps talking to itself to discover more possibilities as possible.
[00:23:42] Junyang Lin: And sometimes will lead to some new things. Endless generation. So we have some limitations there. So we mentioned the limitations in the almost the second paragraph, which includes endless generation. But it is very interesting. I [00:24:00] don't say it is a really strong model, something like competitive to O1 or outcompeting R1.
[00:24:06] Junyang Lin: It is not Simply like that, we show the benchmark scores, but it is something for your reference to see that, maybe it is at this level, and then if you really check the model performance, when it processes like mathematics and coding problems, it really thinks step by step, and it really discovers more possibilities.[00:24:30]
[00:24:30] Junyang Lin: Maybe it is a bit like brute forcing, just like discovering all possibilities. If there are 1 plus 2 is equal to 1, and it discovers a lot of possibilities, but it sometimes finishes,can finish some very difficult tasks. I think, you guys can wait for our more official release, maybe one month or two months later.
[00:24:53] Junyang Lin: We'll make sure, And the next one will be much better than this preview one, but you can play with it. It is something really interesting, [00:25:00] very different from the previous models.
[00:25:02] Alex Volkov: So first of all, a huge congrats on releasing something that, everybody, it looks like it piqued interest for, tons of folks, absolutely.
[00:25:09] Alex Volkov: Second of all, it definitely thinks, it looks like it's,Actually, this seems like this. you can see the thinking, like we're actually showing this right now for folks who are just listening and I'll just read you the actual kind of ice cube question that we have that,somebody places four ice cubes and then at the start of the first minute, and then five ice cubes at the start of the second minute, how many ice cubes there are at the [00:25:30] start of the third minute,we should probably have prepared like a turkey based question,for this one, but basically the answer is zero.
[00:25:36] Alex Volkov: Oh, the ice cubes melt within a minute, and the answer is zero, and people know the answer is zero because, ice cubes melt faster than a minute. But, the,LLM starts going into math and s**t, and, just to be clear, O1 answers this question, it understands the answer is zero. Quill does not.
[00:25:53] Alex Volkov: But the reasoning process is still pretty cool and compared to like other models like you see you can see it thinking It's let me set up an equation. Oh, [00:26:00] actually, it's not correct Ah, now the equation asking for this and this and this and it goes like This is confusing Let me read the problem again.
[00:26:06] Alex Volkov: And so it tries to read the problem again. This feels Not like just spitting tokens. So Junyang, what, could you tell us like what's the difference between this and training at a regular Qwen 2. 5? So as far as I saw, this is based on Qwen 5, correct?
[00:26:27] Junyang Lin: Yeah, it is based on Qwen 2. 5 [00:26:30] 32 billion de instruct Model. Yeah, we have tried a lot of options, maybe we will release more technical details later, but I can tell you something that, we mostly simply do some, do some work on the, post training data. Because it is actually based on our previous model, so we did not change the pre training, because we are actually very confident in our pre training, because we have trained it with [00:27:00] a lot of tokens, so there should be some knowledge about reasoning there, and in Qwen 2.
[00:27:05] Junyang Lin: 5, we also have some text reasoning, relative data, in the pre training process, so we just try to see that if we can align with the behavior of such, reasoning. So we have some very simple,superfines, fine tuning, and we find that while it can generate things like that, we have done a bit like RL stuff, and we also have done something like, RFT, Rejection, [00:27:30] Finetuning, so we can add more data from it.
[00:27:33] Junyang Lin: And there are a lot of techniques, just like self aligned. We use the base language model to use in context learning to build samples for us, to just We've built something like that make the model that can reason and we found that it's really surprising. We did not do very complex stuff, but we find that it has this behavior, but we still find that there is still much room in the reinforcement learning [00:28:00] from human feedback because we found that if you add some RL, you can improve the performance very significantly, so we have some belief that Maybe we, if we have done some more in a process where we're modeling LLM critiques and also things like building more nuanced data for the multi step reasoning, the model will be much better.
[00:28:26] Junyang Lin: Yeah. But this one is interesting. You can keep [00:28:30] talking to it. It keeps talking to itself, just talking about some strange thinking and sometimes maybe I'm wrong. I will check the question again and maybe I'm wrong again and then do it again and again. And sometimes it's generally too long because we have some limitations in long text generation.
[00:28:49] Junyang Lin: I think All models have this problem, so when it reaches maybe some bound and it will turn into some crazy behaviors, it just never [00:29:00] stops generating. We just mentioned this limitation. Just
[00:29:05] Alex Volkov: to make sure folks understand, this is a preview, this is not like an official release. You guys are like, hey, this is a preview, this is a test of you guys.
[00:29:12] Alex Volkov: You guys are like trying this out, like folks should give feedback, folks should try it out. Maybe Finetune also on top of it. Yeah. There's definitely we're trying this out. This is
[00:29:21] Yam Peleg: it's like chatGPT is a research preview. It's not exactly a preview. It beats the benchmarks on so many problems.
[00:29:29] Yam Peleg: We would
[00:29:29] Junyang Lin: like [00:29:30] to make it a fun, funny stuff to make people happy. It's now Thanksgiving and people are always expecting models from us. And they're just talking that all out. where's our reasoning model or things like that. Yeah. so we showed this one to you. And.
[00:29:48] Alex Volkov: Yeah, Jan Wolfram, folks, comments about the reasoning model from Qwen.
[00:29:53] Yam Peleg: Oh, I have a lot of comments. That's a lot. I don't know if you can hear me. Yeah, Jan, [00:30:00] go ahead.
[00:30:00] Alex Volkov: There's just a delay, but we're good.
[00:30:02] Yam Peleg: Yeah, I just want to say, it's like, uh, CGPT is, uh, is a research preview. It's it's a really good thing.
[00:30:10] Yam Peleg: It's a really good model. Seriously. So, I mean, it can be a preview, but it's extremely powerful. How did you guys train this? I mean, what, what, what's the data? How did you generate it? Can you Can I just create data that looks like O1 and Finetune and it's going to work? or, like, give us some details.
[00:30:28] Yam Peleg: it's a really hard thing to [00:30:30] do. it's really, really, really successful. Sohow did you make it?
[00:30:35] Alex Volkov: Give us some details if you can, I'm saying. if you can. Don't let Yam, don't let Yam go into give some details that you cannot give details. but hey, it looks like we may have lost Junyang for a bit with some connection issues, but while he reconnects, we got Maybe he can't, maybe he can't hear details, so
[00:30:52] Wolfram Ravenwolf: They put the plug.
[00:30:53] Alex Volkov: and Wolfram, what's your, I saw your take. Let's, meanwhile, let's take a look. You did some testing for this model as well, right?
[00:30:59] Wolfram Ravenwolf: [00:31:00] Yeah. And I just ran the, the IceCube prompt and on my run, it got the zero correct.
[00:31:04] Wolfram Ravenwolf: So that is a bit of a red flag. Oh, you
[00:31:06] Alex Volkov: did get it correct.
[00:31:07] Wolfram Ravenwolf: Yeah. it was fun because it wrote, Over 10, 000 characters, but in the end it said, okay, so confusing, they all melted zero. So that worked. But of course you have to run benchmarks multiple times. I did run the MMLU Pro computer science benchmark twice.
[00:31:23] Wolfram Ravenwolf: And what is very interesting is, Also here, it generated much more tokens than any other model. The second, highest [00:31:30] number of tokens was GPT 40, the latest one, which was 160, 000 tokens for the whole benchmark. And here we have over 200, 000, 232, 000 tokens it generated. So it took me two and a half hours to run it.
[00:31:45] Wolfram Ravenwolf: And, yeah, it's an 8B model, no, a 32B model at 8 bit in my system where I was running it, because I have 48GB VRAM, so you can run it locally and look at it, it's, it's placed above the 405B [00:32:00] Lama 3. 1, it's above the big Mistral, it's above the GBT, JGBT latest, and the GBT 4. 0 from, yeah, the most recent one.
[00:32:08] Wolfram Ravenwolf: So just to recap
[00:32:09] Alex Volkov: what you're saying. On the MMLU Pro Benchmark, this is a model that you run on your Mac, or whatever PC, and it beats Llama 3. 5, 4 or 5 billion parameter on this benchmark, because it's reasoning and it's smart, it runs for longer, and it uses those test time compute, inference time [00:32:30] compute, Compute, Scaling, Loss that we talked about multiple times.
[00:32:33] Alex Volkov: It runs for longer and achieves a better score. This is like the excitement. This is the stuff. so Junyang, now that you're back with us, could you answer, or at least some of Yam's question, if you couldn't hear this before, I will repeat this for you. How? What does the data look like? can you just come up with some O1 stuff?
[00:32:51] Alex Volkov: By the way, welcome, welcome Nisten.
[00:32:53] Nisten Tahiraj: But I tried it.
[00:32:54] Introduction to the New Google Model
[00:32:54] Nisten Tahiraj: It got the Martian.Rail Train Launcher, it got it perfectly [00:33:00] on first try, and I saw that it did take it three tries, so I use this as a standard question on most models, is if you're going to launch a train from the highest mountain in the solar system, which is on Mars, and you want to accelerate it at two G's, so Still comfortable.
[00:33:21] Nisten Tahiraj: how long would that track need to be in order for you to get to orbital velocity and in order for you to get to, to leave [00:33:30] Mars gravity well? And it's a very good question because there's so many steps to solve it and you can just change it to, you can say 2. 5G and that completely changes the order of the steps for, that the model has to solve.
[00:33:42] Alex Volkov: So it's unlikely to be in the training data and it got it perfectly. It's again, it's this one, it's the new Google preview, even Sonnet takes two tries, two or three tries often to get the right answer. So,yeah, the model worked, and I had the same thing as [00:34:00] Wolfram, he did put out a lot of tokens, but again, it's pretty fast to run locally, Folks, it's a good model. It's, it, for a test preview, for something that was released, as a first, open weights reasoning model, we are very impressed.
[00:34:14] Model Performance and Availability
[00:34:14] Alex Volkov: we're gonna give Junaid, one more, one more attempt here, Junaid, I see you on the spaces. and you're as a speaker, maybe you can unmute there and speak to us through the spaces,while we try this out, I will just tell to folks that like you are, you can download this model.
[00:34:27] Alex Volkov: It's already on, OLAMA. [00:34:30] You can just like OLAMA install Quill or QWQ.it's already on OpenRouter as well. You can get it on OpenRouter. So you can like replace. you can replace whatever you use, like OpenAI, you can replace and put this model in there. it's, you can try it out in Hug Face, this is where we tried it just now.
[00:34:47] Alex Volkov: And, It's awesome. It's awesome to have this. I'm pretty sure that many people are already like trying different variations and different like fine tunes of this model. And it just like going up from here, like to get a open [00:35:00] model, 32 billion parameters, that gets, what is the score? let me take a look.
[00:35:04] Alex Volkov: The score is, I think it gets, 50 on AIME. It's ridiculous. Anybody try this on ARK Challenge, by the way? Do you guys see in your like, like tweets or whatever, the ARK Challenge? Anybody try to run this model on that and try? I would be very interested because that's that's a big prize. It's a very big prize.
[00:35:22] Alex Volkov: I'm pretty sure
[00:35:22] Eugen Cheugh: someone's trying right now. You shall think that out.
[00:35:26] Alex Volkov: I'm pretty sure somebody's trying right now. They could use a
[00:35:29] Wolfram Ravenwolf: 72B [00:35:30] version of it and maybe that gets even better. Probably does.
[00:35:35] Alex Volkov: Yeah. They're probably training a bigger model than this right now. all right folks. So with this, I think that, we've covered pretty much everything that we wanted to cover with Quill.
[00:35:46] Scaling and Model Efficiency
[00:35:46] Alex Volkov: and I think, yeah, the one thing that I wanted to show, let me just show this super quick before we move on to the next topic that we have is this, scaling kind of thing. We saw pretty much the same thing. From, from [00:36:00] DeepSeq. And then we saw pretty much the same thing also from OpenAI. The kind of the scaling confirmation, the scaling log confirmation, the next scaling log confirmation, test time compute or inference time compute works.
[00:36:11] Alex Volkov: Which basically means that the more thinking, the more tokens, the more time you give these models, the better. to think, the better their answer is. We're getting more and more confirmation for this kind of Noah Brown, I don't know, thesis, that these models actually perform [00:36:30] significantly better when you give them more tokens to think.
[00:36:32] Alex Volkov: this is incredible to me. This is like incredible because not only will we have better models with more scale, but Even though some people claim a wall has been hit, no wall has been hit. but also we now have these models that can answer better with more tokens. and this is like another, another confirmation from this.
[00:36:51] Alex Volkov: Qwen, Quail32B is now here. You can, you can now run. a, a 4 0 5 B level models, at least on [00:37:00] MMLU Pro,like wolf from here said on your computers. And shout out to our friends from, Alibaba Quinn for releasing these awesome models for us as a Thanksgiving,present.
[00:37:10] Alex Volkov: Jang, you're back with us. Let's see. maybe you're back.
[00:37:14] Junyang Lin: I don't know if you can hear me. Yes,
[00:37:16] Alex Volkov: we can hear you finally, yes.
[00:37:18] Junyang Lin: I don't know what happened.
[00:37:19] Alex Volkov: it's
[00:37:20] Junyang Lin: fine. I
[00:37:22] Alex Volkov: think that, let's try this again. maybe last thing as we're going to try.
[00:37:27] Discussion on Reasoning Models
[00:37:27] Alex Volkov: What, from what you can tell us, [00:37:30] how does the work on this look like?
[00:37:34] Alex Volkov: Is a lot of it synthetic? Is a lot of it RL? Could you give us, a little bit of, Give us a hint of what's going to come in the technical release for this. And also what can we look forward to in the upcoming? Are you maybe working on a bigger model? give us some, give us something for Thanksgiving.
[00:37:51] Junyang Lin: Oh yeah. for the reasoning steps, I think, the data quality, really matters and, we, we think that, it may split the steps, [00:38:00] more, make it more nuanced. make it more small steps. It can be just, the possible answers, with higher possibility, which means that the machine may think, in a different way from, the human being.
[00:38:12] Junyang Lin: The human being may reach the answer very directly, but sometimes, for a reasoning model, it may reason to explore more possibilities. So when you label the data, you should pay attention to, these details and, This is a part of it, and now we only have done some work on mathematics and [00:38:30] coding, and especially mathematics, and I think there's still much room in general knowledge understanding.
[00:38:37] Junyang Lin: I found that Wolfram just tested it for the MMU PRO, but we actually did not strengthen its performance for the MMU PRO. this kind of benchmark. So I think for the scientific reasoning, there's still much room for it to do it. And something surprising for us, is that we found that, it sometimes generate more beautiful texts, more [00:39:00] poetic, some, something like that.
[00:39:02] Junyang Lin: I don't know why, maybe it is because it reasons. So I think it may encourage creative writing as well. A reasoning model that can encourage creative writing. That would be something very interesting. I also found some cases, in Twitter, that people find that, it sometimes generates, text more beautiful than, Claude's written by someone and created.
[00:39:22] Junyang Lin: there's still much room for a reasoning model. Yep.
[00:39:25] Alex Volkov: Very interesting. Just to recap, folks found that this model that is [00:39:30] trained for reasoning gives more poetic, writing. that's very interesting. All right, folks, I think it's time for us to move on, but
[00:39:37] Wolfram Ravenwolf: just one quick comment.
[00:39:39] Multilingual Capabilities of Qwen
[00:39:39] Wolfram Ravenwolf: It's also very good in German. I tested it in German as well. So even if it may not be the focus, if you are multilingual or another language, try it. Yeah,
[00:39:50] Junyang Lin: that's something not that difficult for us because the Qwen is strong model is multilingual And it is actually I think it is now good at German.
[00:39:59] Junyang Lin: Yeah, [00:40:00]
[00:40:02] Alex Volkov: Qwen's multilingual is very good at German.
[00:40:04] BlueSky hate on OpenSource AI discussion
[00:40:04] Alex Volkov: Alright folks, I think that it's time for us to move on a little bit and Now we're moving to less fun, less of a fun conversation, but I think we should talk about this. just a heads up, after this, we're gonna have this week's buzz, but I don't have a category for this.
[00:40:19] Alex Volkov: I don't have a category for this, but it must be said. as ThursdAI is all about positivity. We talk about AI every week to highlight the advancement we highlight with positivity we get excited about every new [00:40:30] release every new whatever we also recently and now we have you know we're on youtube as well and the reason it coincided well with some of the folks in the ai community moving over to blue sky let me actually first Say hi to my colleague here, Thomas.
[00:40:44] Alex Volkov: I'm going to pull you up on stage as well. welcome Thomas as well. Hey man, welcome. My colleagues for the past year from Weights Biases, welcome as well. You're more than welcome to join us as well, because you're also on BlueSky. And, so a bunch of the community, recently started seeing whether or not there's a [00:41:00] new place over at BlueSky.
[00:41:02] Alex Volkov: for the ML community. I saw a bunch of ML people over there as well. I see Wolfram over here has a little butterfly. you all who are joining us from Twitter, or Xspaces, for example, you've probably seen a bunch of your favorite AI folks post just a blue butterfly and maybe follow them towards the other social media platform due to your political preferences, wherever they may be, which is completely fine.
[00:41:26] Alex Volkov: That's all good and well and fine. so I started cross posting to both, [00:41:30] and I'll show you how my screen looks like recently. This is how my screen looks like. I scroll here, I scroll on X, and I scroll on blue sky. This is what my life looks like. Yes, I'm on both. because I want to make sure that I'm not missing any of the news.
[00:41:43] Alex Volkov: That I want to bring to you, and also Zinova, our friend, right? He posts everywhere, and I see the community bifurcating. I don't like it. But I want to make sure that I'm not missing anything. This is not what I want to talk to you about. Not the bifurcation. I don't mind the bifurcation. We'll figure out something.
[00:41:58] Alex Volkov: We're on YouTube as well, [00:42:00] so the folks from BlueSky who don't jump on TwitterX community, they can still join the live chat. What I want to talk to you about is this thing that happened where, a bunch of folks from Hug Face just joined Blue Sky as well, and one of the maybe nicest people in, from the Hug& Face community, Daniel,I'm blanking on his last name, Nisten, maybe you can help me out, Daniel Van Strijn?
[00:42:24] Alex Volkov: Daniel Van Strijn?basically, did what he thought was [00:42:30] maybe a cool thing. He compiled the dataset. You guys know, we talk about data and open source and Hug Face as well. This is like in the spirit of the open source community, there's, we talk about open datasets. we, I have a thing here. This is my thing.
[00:42:43] Alex Volkov: When we talk about somebody releasing. Open source datasets. We have a thing. We clap, right? and so he compiled, a dataset of 1 million blue sky posts to do some data science. This is like what Hagenfeist, put it on Hagenfeist. just to mention one thing before, [00:43:00] unlike Twitter, which used to be open, then Elon Musk bought it and then closed the API, and then you have to pay 42, 000 a year.
[00:43:07] Alex Volkov: 42, 000 a year. Yes, this is the actual price. 42, 000 a year. this is the actual literal price for the API. Unlike Twitter, which used to be free, BlueSky is built on a federated algorithm. There's a firehose of API you can apply to it. And then you can just like drink from this firehose for free. This is like the whole point of the platform.
[00:43:27] Alex Volkov: so then you'll connect to this firehose, drink from it and [00:43:30] collect, compile the data set of a 1 million posts, put it up on Hug Face, open source.
[00:43:36] Community Reactions and Moderation Issues
[00:43:36] Alex Volkov: And then got death threats. Death threats. He got death threats for this thing. People told him that he should kill himself for this act where he compiled data from an open fire hose of data that is open on purpose.
[00:43:58] Alex Volkov: What the actual f**k? [00:44:00] And when I saw this, I'm like, what is going on? And in less than 24 hours, I'm going to just show you guys what this looks like. Okay. this is the, this is on the left of my screen and the folks who are not seeing this, you probably, I'm going to, maybe pin.
[00:44:13] Alex Volkov: Yeah. let me just do this super quick. So you guys who are just listening to this, please see my pinned tweet, as well. because this is some insanity. Okay. And we have to talk about this because it's not over here. he compiled a 1 million public posts, BlueSky Firehose API, data set.
[00:44:27] Alex Volkov: And then, it got extremely [00:44:30] viral to the point where I don't know, it's like almost 500 whatever it's called. And then the amount of hate and vitriol in replies that he got from people in here. Including, yes, including you should kill yourself comments and like death threats and doxing threats, et cetera.
[00:44:47] Alex Volkov: many people reached out directly to,HugNFace folks. he became maybe number two most blocked person on the platform as well. and all of this, they, people reached out to the Hug Face community. Basically in less than [00:45:00] 24 hours, he basically said, I removed the BlueSky data from the repo.
[00:45:03] Alex Volkov: I wanted to support the tool development for the platform, recognize this approach, violate the principle of transparency and consent. I apologize for this mistake, which, okay, fine. I acknowledge his position. I acknowledge the fact that he works in a,he works in a company and this company has lawyers and those lawyers need to adhere to GDPR laws, et cetera.
[00:45:23] Alex Volkov: And many people started saying, Hey, you compiled my personal data without, the right for removal, et cetera, without the due [00:45:30] process, blah, blah, blah. Those lawyers came, there's a whole thing there. And then our friend here, Alpen, who's a researcher, of his own, connected to the same open firehose of data, and collected a dataset of 2 million posts.
[00:45:47] Alex Volkov: That's twice as many as Daniel did, and posted that one, and then became the person of the day. Alpen, you want to take it from here? You want to tell us what happened to you since then? What your 24 hours looked [00:46:00] like?
[00:46:00] Alpin Dale: yeah, sure. it's been quite the experience being the main character of the day in Blue Sky.
[00:46:05] Alpin Dale: And,obviously, I'm not showing my face for very obvious reasons. I have received quite a few threats because, Yeah, unlike Hugging Face employees, I am not beholden to a corporation, so I didn't really back down. And, yeah, I probably received hundreds of death threats and doxxing attempts.
[00:46:24] Alpin Dale: so just to reiterate what you said, the Firehose API is completely [00:46:30] open.
[00:46:31] Alpin Dale: It is, it's a good analogy with the name because it's like a firehose, anyone can use it.
[00:46:35] Legal and Ethical Implications
[00:46:35] Alpin Dale: you have they've also,threatened me with litigation, but, I'm not sure if you guys are aware, but there was a court case back in 2022, HiQ Labs versus LinkedIn, where, HiQ Labs was, scraping public, public accounts from LinkedIn and, using it for some commercial purposes, I don't remember.
[00:46:54] Alpin Dale: But, They did actually win in court against LinkedIn, and what they were doing was [00:47:00] slightly even more illegal because LinkedIn doesn't have a publicly accessible API, and they have Terms of Services specifically against that sort of scraping, and because of that, the ruling overturned later and they, they lost it, they lost the claim, but it did set a precedent to be had that if the,if the, data published on publicly accessible platforms could be lawfully connected, collected and used, even if terms of service like purported to limit such usage.
[00:47:28] Alpin Dale: But I [00:47:30] Never agreed to such a term of service when I started scraping or copying the data from the Firehose API because first, I didn't do any authentication. Second, I didn't provide a username when I did that. So anyone could have done that technically with the AT protocol Python SDK. It's you don't even need to sign in or anything.
[00:47:52] Alpin Dale: You just sign in. Connect to the thing and start downloading.
[00:47:55] Alex Volkov: Yeah, this is the platform is built on the ethos of the open [00:48:00] web. The open web is you connect and you read the data. This is the ethos of the open web. When this is the ethos of the open web, when you post on this platform, Whether or not the TOS is saying anything, when you don't need to authenticate, the understanding of the people should be, regardless, and I understand some of the anger when the people discover, oh, s**t, my, my thoughts That I posted on this platform so far are being used to like, whatever, train, whatever.
[00:48:28] Alex Volkov: I understand some of this, I [00:48:30] don't agree with them, but like I understand, what, how some people may feel when they discover Hey, my thoughts could be collected, blah, blah, blah. and somebody posted like a nice thread. But, the platform is open completely. Going from there to death threats, this is, like, where I draw completely, where I draw my line.
[00:48:45] Alex Volkov: Alpen, the next thing that happened is what I want to talk to you about. you're getting death threats, you're getting doxxed attempts. Um,I couldn't find your post today. what happened?
[00:48:56] Alpin Dale: for some reason, BlueSky decided to terminate my [00:49:00] account instead of the ones issuing the death threats, very interesting chain of events, but,they claimed that I was engaging in troll behavior, whatever that means.
[00:49:10] Alpin Dale: And for that reason, they just, like it wasn't even,due to mass reporting that happens on X. com, right? Specifically emailed me with very, human generated language, where they told me that I was being a troll. I think I posted it on my Twitter account too. And, Yeah, they just assumed I'm trolling, [00:49:30] and what's funny is there's been screenshots floating around of similar mod messages, just giving people a slap on the wrist for much, much worse things, like things we can't even talk about here, right?
[00:49:44] Alpin Dale: So very strange, very silly situation overall. And another thing I wanted to mention, a lot of people. We're bringing up the GDPR and all that because of like personally identifiable information, but if you go to the [00:50:00] dataset, all we have is the post text. The timestamp, the author, and the author name is a, it's just a hash, it's not the full author name, and the URI, so there isn't really much to link people to the, to their specific posts, and there isn't even a location tag, so I'm not sure if it fully applies with GDPR, but I'm not a liar anyways, and, The thing is, the data or their posts were published on a platform that is explicitly designed for public [00:50:30] discourse, right?
[00:50:31] Alpin Dale: And the decision to share sensitive information on a platform like this lies with the user, not the observer. And we are the observer in this case. And by the very nature of public platforms, Individuals that post like content like this, they have to bear the responsibility that their information is accessible to anyone.
[00:50:51] Alpin Dale: And I don't think my dataset like alters this reality because it just consolidates information that was already available for [00:51:00] everyone. And I guess,there were also people who were asking for an opt out option and, the Hugging Face CEO, Clem, also made an issue on the repo about this. And I did provide a very straightforward opt out process, if someone wants to remove that data, they can just submit a pull request.
[00:51:18] Alpin Dale: to remove the specific posts that belong to them but alsothey have to accompany it with a proof of authorship they have to prove to me that the post that they're removing is not a [00:51:30] it belongs to them and it's not a malicious request so i guess i've covered all grounds so i'm not sure what the what people are worried about
[00:51:38] Alex Volkov: so i uhI'm just showing to the folks who are listening, I'm showing a, an email from,from the moderation team at BlueSky.
[00:51:46] Alex Volkov: BlueSky County Control, Alpendale, BlueSky Social was reviewed by BlueSky Content Moderators and assessed as a new account trolling the community, which is a violation of our community guidelines. As a result, the account has been permanently suspended. They didn't even give you the chance to like, hey, delete this and come back to [00:52:00] the platform.
[00:52:00] Alex Volkov: Literally permanently suspended. the folks who are like saying, hey, You are going to be,delete this and come back or the folks who are like 13 death threats, are not there. Um,What can we say about this? it's ridiculous. Absolutely. And I, The fact that Hug Face's account, your account, Daniel's account, became the most blocked accounts on the platform in the past 24 hours, more so than some like crazy Manosphere accounts, is just is absolutely insanity.
[00:52:28] Alex Volkov: The fact that most of [00:52:30] these anger prone accounts People are like anti AI completely. And the whole issue about like consent, whatever, most of them don't even appear in the dataset, by the way. Like some people checked on the fly, Zeofon and I, like we did some basic checking, many people didn't even appear in the dataset.
[00:52:44] Alex Volkov: the fact that the absolute silly fact that the, none of them understand the Barbra Streisand effect on the internet and the fact that there's five datasets right now. Many of them collected the people who reacted to these specific posts and collected the data [00:53:00] set of the people who reacted to these specific posts.
[00:53:02] Alex Volkov: And people just don't understand how the internet works. That was just like ridiculous to me.
[00:53:07] Moving Forward with Open Source
[00:53:07] Alex Volkov: so Alpen, I personally think that you did Many of these people also a very good service as well, because at least some of them now realize how open internet works, despite the being very upset with the fact that this is how the open internet works, at least some of them are now like realizing this.
[00:53:23] Alex Volkov: I,I commend you on like the bravery and standing against this like absolute silliness and not backing down. And [00:53:30] Yeah, go ahead. Happy
[00:53:31] Alpin Dale: to serve. Yeah, another small thing I wanted to add was, I've received a lot of threats about me getting reported to the EU, but what I find really ironic is that,earlier this year, the EU funded a research for collecting over 200 million blue sky posts with a greater level of detail.
[00:53:50] Alpin Dale: So clearly the EU is fine with this, so I don't know what's the problem here, once again.
[00:53:58] Alex Volkov: yeah, I saw this. Yeah, there's a way [00:54:00] bigger thing. The last thing I saw about this, and then maybe we'll open up for folks, and then I would love to chat with my friend Thomas, for whom it's late, and I invited him here, and I want to be very mindful of his time as well, so thank you, Thomas, for being patient.
[00:54:12] Alex Volkov: The last thing I say about this is that this sucks for open source, from the very reason of, if you're open and public and good hearted about this, Hey folks, here's the data in the open, you can look at this data and you can ask for your s**t to be removed. You get an angry mob of people threatening [00:54:30] death against you and asking your lawyers to like, literally people asking like, was Daniel fired?
[00:54:34] Alex Volkov: what the f**k? Meanwhile, this is a open firehose and all of the companies in the world probably already have all this data. I'm pretty sure, OpenAI has been already training on BlueSky. Like, why wouldn't they? It's open. Literally, if you want to train, and Thomas, maybe here is like a little entry to what we're going to talk about.
[00:54:50] Alex Volkov: If you want to train a toxicity,thing, There is now a very good place to go to and look at toxicity score or I can show you where you can go [00:55:00] to to train toxicity score. Like, why wouldn't you go and collect this data? It's free, like literally it lies on the internet.
[00:55:05] Alex Volkov: Nothing in the TOS, like Alpen said, even I went to the TOS of BlueSky. Literally it says over there, we do not control how other people use your data. Like literally that's what it says on the TOS. So yeah, I'm just like, I'm very frustrated against this. I want to speak out against this, absolutely ridiculous behavior.
[00:55:22] Alex Volkov: I don't think that this,okay. So I don't think that the, how the people reacted on the platform speaks against the platform itself. I do think [00:55:30] That the way the moderators, acted out against Alvin's account and the removal of account permanently banned, speaks completely against the platform.
[00:55:38] Alex Volkov: This is stupid and we should speak against this, on the platform itself. if we think that this is a place for the community, that's where I stand. And I wanted to share the publicly, super brief comments, folks, and then we'll move on to this week's bus.
[00:55:49] Wolfram Ravenwolf: There was a link in his message from the moderators that he can reject it and get a review, appeal, yeah.
[00:55:58] Wolfram Ravenwolf: So I hope that, I hope [00:56:00] he gets the appeal through. That is important. Yeah,
[00:56:03] Alex Volkov: if you will,please email them with an appeal and, tell them about the multiple death threats that you received and the fact that, you didn't, did not mean to troll.
[00:56:12] Wolfram Ravenwolf: I reported every one of those messages, by the way, and anyone who does it is probably a good thing.
[00:56:18] Alex Volkov: Nisten, I know you have thoughts on this. I would love to hear.
[00:56:22] Nisten Tahiraj: we need to better educate people to not go after the ones on their side. a lot of the open source devs do this stuff [00:56:30] because they want everyone to have, Healthcare robots that no single corporation owns. They make this data public because people want to democratize the technology for everyone.
[00:56:41] Nisten Tahiraj: So it's not, it doesn't become like authoritarian and like a single source of control. And, to see that they prioritize, just, people's anger and feelings versus being objective. about it. Whereas, [00:57:00] so in this case, the public forum data set is public domain on purpose. And this is what drew people to the community in the first place, because they felt like Twitter was becoming too political, single sided.
[00:57:12] Nisten Tahiraj: And, we didn't like that. And a lot of people moved to, because they saw Blue Sky as a, Much better, democratized alternative to all of this. And,so that's really disappointing because, these are the people on your side and, now the two [00:57:30] nicest, most contributing open source devs that we know, are more hated than, like someone like Andrew Tate.
[00:57:37] Nisten Tahiraj: that just makes no sense at all. the, out of the five most blocked accounts Two of them are like the nicest people we know. So what is, something is pretty, pretty off. And, I'm also worried that in the AI community, we are in a bit of a bubble and not quite aware of,what people on our side are being communicated.
[00:57:58] Nisten Tahiraj: are being shown how this [00:58:00] stuff works, how open source, works because I'm pretty sure from their point of view, they're like, oh, here's another company just took all of our data and is just gonna train this porn bot with it and there's nothing we can do about it, but it's not like that.
[00:58:13] Nisten Tahiraj: Not a single company can own this data. It is public domain. We can't sue anyone else over the data. It's public domain in a public forum. You're supposed to have civil discourse because then the AI can also have civil [00:58:30] discourse and be reasonable and be like aligned to humanity. so now you have a bunch of people just giving, death threats and they're okay because they're just angry.
[00:58:40] Nisten Tahiraj: So you can tell someone to go kill themselves just because you're angry. And, yeah, so that's not good. Like they're just not good. you should probably, yeah, anyway, so there is something for us to do as well, like we need to communicate better, what does open source do versus what having a single company.
[00:58:58] Nisten Tahiraj: Own all that data and [00:59:00] have it as their property. because I feel like most of the general public doesn't really understand this.
[00:59:06] Nisten Tahiraj: yeah, that's it. I was just, okay. Just really quickly. Sorry. I went on too long, but after going through war in the Balkans as a kid, I didn't think people would be getting death threats for an open source dataset.
[00:59:17] Nisten Tahiraj: It's this is just completely beyond, It's absolutely unhinged. yeah, this is just completely off.
[00:59:23] Wolfram Ravenwolf: Unhinged. Just one thing, those people even think that now the thing is over, so the dataset has been [00:59:30] removed, okay, it's done, but you can get a new one anytime. The platform hasn't changed. They have to realize that.
[00:59:37] Alpin Dale: funny it mentioned that because they started blocking me for the explicit reason of, the user started blocking me for the explicit reason of stopping me from scraping their posts, as if I need my account to do that.
[00:59:49] Alex Volkov: Yeah, I think that there's, a lot of misunderstanding of, what's actually, happening.
[00:59:54] Alex Volkov: And how, which is fine, I completely empathize of people's misunderstanding of [01:00:00] technology, and thus fear, I get this I get the visceral reaction, I get,I don't like multiple other things about this, I don't like the, the absolute, horror mob. And the death threats, I don't like the platform reacting as it did, and like blocking completely, those things don't make sense.
[01:00:14] Hey, this is Alex from the editing studio. Super quick, about two hours after we recorded the show, Alpin posted that the moderation team at BlueSky emailed him and his account was in fact reinstated. He didn't ask them to. [01:00:30] They revisited their decision on their own.
[01:00:32] So either a public outcry from some individuals on the platform. Hopefully they listened to our show. I doubt they did. Um, but they reversed their decision. So I just wanted to set the record straight about that. He's back on the platform. Anyway, back to the show.
[01:00:48] Alex Volkov: Alright folks, unfortunately though, we do have to move on, to better things, and I'll give my other co hosts like a little five, five to seven minutes off, to go take a break. Meanwhile, we're going to discuss [01:01:00] this week's buzz.
[01:01:00] This Week's Buzz: Weights & Biases Updates
[01:01:00] Alex Volkov: Welcome to this week's buzz, a category at ThursdAI, where I talk about everything that I've learned or everything new that happened in Weights Biases this week. And this week, I have a colleague of mine, Thomas Capelli, [01:01:30] from the AI team at Weights Biases. We're now the AI team. This is new for us. We're Thomas, how, do you want to introduce yourself super brief for folks who've been here before, but maybe one more introduction for folks who don't know who you are.
[01:01:43] Thomas Capele: Yeah, I'm Thomas. I work with Alex. I'm in the AI Apply team at Weights Biases. I train models, I play with models on API, and I try to make my way into this LLM landscape that is becoming more and more complex. Try to avoid [01:02:00] getting roasted on the internet. And yeah, trying to learn from everyone. Thank you for the meeting.
[01:02:06] Alex Volkov: So you're going by Cape Torch, I'm going to add this as well on X as well. I don't know what you're going off as,on Blue Skies, same Cape Torch. I invited you here, and I think let's do the connection from the previous thing as well. A lot of toxicity we talked about just now, a lot of like toxic comments as well.
[01:02:23] Alex Volkov: and we're, we both work at Weights Biases on Weave. Weave is our LLM observability tool. [01:02:30] I showed off Weave multiple times on ThursdAI, but I will be remiss if I don't always remind people, because we have a bunch of new folks who are listening, what Weave is. Weave is an LLM observability tool. So if you're building as a developer, Anything with LLMs on production,you need to know what's going on, what your users are asking your LLM or what your LLM gives as responses, because sometimes imagine that your users are, let's say copy pasting, whatever comments, people just gave [01:03:00] Daniel and Alpin and they pasting it to them to do categorization, for example, and some of these like, Very bad things that we just talked about are getting pasted into the LLM and some of the LLM responses are maybe even worse, right?
[01:03:13] Alex Volkov: so maybe your application doesn't handle this. Maybe your application responds even worse and you want to know about this. and, the way to see those, some developers just looks at logs. we have a tool. That is way nicer. And, this is just some of the things it does. but this [01:03:30] tool is called Weave.
[01:03:30] Alex Volkov: it, it traces everything that your application gets as an input from users and also outputs. but that's not all it does. So it also allows you to do evaluations. And, recently Thomas and, has been working on, multiple things, specifically around scoring and different things. Thomas,you want to maybe give us a little bit of.
[01:03:47] Alex Volkov: Yeah, I think you,
[01:03:48] Thomas Capele: you described pretty well. Yeah, as I know, you have showed Weave and the product we have been working for a while, multiple times here, but it's, I would say it's mostly core feature is [01:04:00] actually building apps on top of LLMs and having observability and yeah, standard code, we have unit tests and for LLM based applications, we need like evaluations, actual evaluations on data we have curated.
[01:04:13] Thomas Capele: And it's, we have been doing this in the ML world for a while, but as we are merging with the software engineers that. Maybe don't know how to integrate this randomness from the LLMs in the, in their applications. Yeah. you need to actually compute evaluations. And that means gathering [01:04:30] data, still labeling a lot of stuff manually to have high quality signal.
[01:04:35] Thomas Capele: And then, yeah, iterating on your prompts and your application that, that's making API calls with scores, with metrics that gives you confidence that we are not like screwing up. And as you said, like I've been working recently on adding, we added a bunch of scores, default scores. We've a couple, yeah, like a month ago with Morgan, we spent like a week building those.
[01:04:58] Thomas Capele: and recently we have been like, [01:05:00] yeah, looking at stuff like toxicity and hallucination and yeah, context and bias detection, and there's multiple of them that are LLM powered, like the ones you are showing on the screen right now, like You have an LLM that it's actually prompt in a certain way, and you maybe build a system that requires like a couple of LLM prompt with structured output to actually get the scores you were expecting,and then this thing should be able to give you, yeah, a good value of the [01:05:30] scoring if it's hallucinating, if it's a toxic, actually the mall providers like OpenAI and Mistral and Anthropic, I think have an API exactly for moderation.
[01:05:41] Thomas Capele: So yeah, you can use also that and they are actually pretty good and fast and pretty cheap compared to the completion ABA. And we have no, what I've been doing this week and the last couple of weeks where I've been trying to build really high quality, small, non LLM powered scores. So example that you want to create a toxic, [01:06:00] detection system.
[01:06:00] Thomas Capele: Yeah. what can you do? Yeah, you could find a small model that it's not an LLM or it was an LLM a couple years ago. Now, like BERT, we don't consider BERT an LLM.
[01:06:09] Alex Volkov: Yeah.
[01:06:10] Thomas Capele: yeah. I've been fine tuning the BERT task and checking like this new high end phase, small LLM2 models, trying to adapt them to the task.
[01:06:18] Thomas Capele: Yeah. yeah, like good challenge, good engineering questions, like creating, there's plenty of high quality data set on HangingFace that people have been creating from multiple places, from Reddit, and [01:06:30] like these have been serving us to actually build this high quality classifiers that are capable of outputting and flagging the content that we're interested in.
[01:06:40] Alex Volkov: So here's what I, here's what I'll say for folks, just to like highlight what we're talking about. Weave itself. is a toolkit that you can use for both these things. You can use it for logging and tracing your application, which is what it looks like right now. You basically add these lines to your either Python or JavaScript application, JavaScript type of application, and we will help you track [01:07:00] everything your users do in production.
[01:07:01] Alex Volkov: Separately from this, You want to continuously evaluate your application on different set of metrics, for example, or scoring them on different set of metrics to know how your LLM or your prompts are doing, right? So you guys know that, like for example, before on the show we talked about, hey, here's this new model, the, qu quill, for example.
[01:07:20] Alex Volkov: And you know that wolf from, for example, tested it on MMU Pro. Those are generic evaluations. MMU Pros, those are evaluations that somebody built specifically for. [01:07:30] Something big. Look, there's a set of questions that somebody builds something big. specific scorers for your type of application, something that you build for your type of applications.
[01:07:38] Alex Volkov: and then people asked us as Weights Biases, Hey, okay, you give us a generic toolkit, an unopinionated toolkit, but can you give us some opinion? Can you give us some opinion? And basically this is what like Weave Scorers is. This is like an additional package that you can install if you want to,like additionally, right?
[01:07:55] Alex Volkov: Thomas, help me out here, but you can add this. The ones we're
[01:07:58] Thomas Capele: building right now, they're not yet [01:08:00] there. They will be probably in a certain future. Yeah. We need to test them correctly. And it's we're an experiment tracking company at the beginning. We're going to like, want to share the full reproducibility.
[01:08:10] Thomas Capele: Like this is the data, this is how we train them. there's different versions. It's scoring metrics we get, so you like have confident that they work as expected.
[01:08:18] Alex Volkov: So this is to me very interesting, right? So I came in as a previously software developer and now as like an AI evangelist, like I came in from like this side and I meet all these like machine learning engineers, experiment tracking folks who are like, okay, [01:08:30] now that we've built this like LLM based tool, observability tool, many people are asking us to do what Weights Biases does on the model side, on the Weights Biases side.
[01:08:37] Alex Volkov: Hey, Use everything from your, immense knowledge of tracking and doing experimentation to bring this over to the LLM side. Okay, now that you have all this data, now that companies are tracking all the data, how to actually, do experimentation on the front side. Thomas, last thing I'll ask you here before I let you go, briefly is about guardrails specifically.
[01:08:56] Alex Volkov: So there's this concept that we're going to talk about. We're going to keep talking about this [01:09:00] called guardrails. So we're talking about scorers. Scorers are basically the way to check your application. Just a model.
[01:09:05] Understanding Scoring Models
[01:09:05] Alex Volkov: Like
[01:09:06] Thomas Capele: I would define like score is just a model. It takes an input, produce an output.
[01:09:11] Thomas Capele: It could be simple. It could be complicated. Like a scoring, the simplest scores could be accuracy. if the prediction is equal to the label, like a complex score, it could be like an LLM power score that. Check that the context you retrieve from your RAG application, it's not like the response is not [01:09:30] hallucinated or is factually consistent with the original context.
[01:09:33] Alex Volkov: So like HallucinationFreeScorer, for example, is one score for folks who are listening. whether or not the response that your RAG application returned, Has hallucinations in it. Or,yeah, it's
[01:09:44] Thomas Capele: very it's very detailed. And you will probably need to refine all of this for your specific application because everyone has slightly definition and slightly needs, slightly different needs for their application.
[01:09:55] Thomas Capele: So yeah, you may need to tune everything, but this is like a good starting point.
[01:09:59] Guardrails in LLM Development
[01:09:59] Thomas Capele: [01:10:00] So yeah, I find it very interesting that you mentioned guardrails. I would say like a guardrail is. Also a model that predicts, but it's need to be really fast and it needs to be, it needs to take actions, maybe change the output, like any of these scores don't change your output.
[01:10:19] Thomas Capele: Like they will. Computer score, but they will not change the output. if you have IPAI's guardrail, it should, I don't know, redact stuff that [01:10:30] shouldn't pass. So it should change the output, like the payload you are getting from the API. So like guardrails are more like online, and these are more like, offline.
[01:10:41] Alex Volkov: So that's a good boundary to do. And I think we'll end here, but this is basically an exception for going forward, folks. I will tell you about guardrails specifically.
[01:10:48] Guardrails in Production
[01:10:48] Alex Volkov: It's something we're getting into, and I'm going to keep talking about guardrails specifically, because I think that this is a very important piece of developing LLMs in production.
[01:10:57] Alex Volkov: How are you making sure that the [01:11:00] model that you have online is also behaving within a set of boundaries that you set for your LLM? obviously we know that the big companies, they have their guardrails in place. We know because, for example, when you, talk with, advanced voice mode, for example, you ask it to sing, it doesn't sing.
[01:11:14] Alex Volkov: there's a boundary that they set in place. when you develop with your LLMs in production, your guardrails, the only way to build them in is in by prompting for example there's other ways to do them and we are building some of those ways or we're building tools for you to build some of those ways [01:11:30] and like thomas said one of those guardrails are changing the output or like building ways to prevent from some of the output from happening like PII for example or there's like toxicity detection and other stuff like this so we will Talking more about guardrails, Thomas with this, I want to thank you for coming out to the show today and helping us with scores and discussing about Weave as well.
[01:11:50] Alex Volkov: And, I appreciate the time here, folks. You can find Thomas on, X and on, and on BlueSky, under CapeTorch. Thomas is a machine learning engineer and, [01:12:00] developer AI engineer as well. Does a lot of great content, Thomas. Thank you for coming up. I appreciate you. He also does amazing cooking as well.
[01:12:06] Alex Volkov: Follow him for some amazing gnocchi as well. Thanks, Thomas. Thomas, thank you. Folks, this has been this week's Bugs, and now we're back. Good job being here. See you guys. See you, man. And now we're back to big companies and APIs.[01:12:30]
[01:12:33] Alex Volkov: All right. All right. All right. We are back from this week's buzz, folks. Hopefully, you learned a little bit about scores and guardrails. We're going to keep talking about guardrails, but now we have to move on because we have a bunch of stuff to talk about specifically around big companies and APIs, which had A bunch of stuff this week as well.
[01:12:51] OpenAI Leak Incident
[01:12:51] Alex Volkov: I wanna talk about, the leak. You guys wanna talk about the leak, this week? open the eye had a big, oh my God. Oops. Something big [01:13:00] happened. but nothing actually big happened, but look to some extent, this was a little bit big. at some point, this week, a frustrated participant in the open ai, how should I say, test.
[01:13:12] Alex Volkov: Program for Sora decided to quote unquote leak Sora and posted a hug and face space where you could go and say, Hey, I am,I want this and this. And you would see a Sora video generated and, yeah, we can actually show some videos. I think, this is not against any [01:13:30] TOS, I believe. and, Yeah, this wasn't actually a leak. What do you guys think? did you happen to participate in the bonanza of, Sora videos, Wolfram or Yam? Did you see this?
[01:13:40] Wolfram Ravenwolf: I saw it, but I didn't, try to go to the link.
[01:13:43] Alex Volkov: No.
[01:13:44] Sora Video Leak Reactions
[01:13:44] Alex Volkov: so basically, some very frustrated person from,the creative minds behind Sora behind the scenes, decided to like, Leak SORA, the leak wasn't actually the model leak like we would consider a model [01:14:00] leak.
[01:14:00] Alex Volkov: the leak was basically a hug and face application with a request to a SORA API with just the keys hidden behind the hug and face. we're showing some of the videos. I'm going to also add this to,to the top of the space for you guys as well. The videos look pretty good, but many of the folks who commented, they basically said that, compared to when Sora just was announced, where all of [01:14:30] us were mind blown completely, now the videos, when you compare them to something like Kling, or some of, Runway videos, they're pretty much on the same level.
[01:14:41] Alex Volkov: And, I, they look good. They still look very good. look at this animation for example. It looks very good still And apparently there's like a version of Sora called Sora Turbo. So these videos are like fairly quick, but Like folks are not as mind blown [01:15:00] as before yeah Some of the physics looks a little bit better than Kling etc, but it feels like we've moved onand this is something that I want to talk to you guys like super quick.
[01:15:09] Alex Volkov: we're following every week, right? So we get adapted every week, like every,the Reasoning Model Formula 1 blew us away. And then R1 came out and now we run this on our models due to Quill. So we're used to getting adapted to this. the video world caught up to Sora like super quick.
[01:15:24] Alex Volkov: Now we can run these models. There's one open source one like every week. These videos [01:15:30] don't blow us away as they used to anymore and,why isn't OpenAI releasing this at this point is unclear because if you could say before, elections, you could,you can put down Trump and Kamala Harris in there, Now, what's the reason for not releasing this and not giving us this thing?
[01:15:47] Alex Volkov: anyway, yeah, this video is pretty cool. There's one video with, a zoom in and somebody eating a burger. yeah, leak, not leak, I don't know, but, thoughts about the sourcling? What do you guys think about the videos and, the non releasing, things? Folks, I want to ask, Nisten, [01:16:00] what do you think about those videos?
[01:16:01] Alex Volkov: Do you have a chance to look at them?
[01:16:03] Nisten Tahiraj: I was going to say, by the way, I was going to say the exact same thing you did, that it's just been so long now, what, a few, a couple of months since they announced it? I think it's more than
[01:16:14] Alex Volkov: a couple of months, I think half a year, maybe, yeah.
[01:16:16] Nisten Tahiraj: Yeah, it's over half a year that so much happened that we're no longer impressed.
[01:16:22] Nisten Tahiraj: And I'm just trying to be mindful of that, that things are still moving fast. And, they haven't stopped [01:16:30] moving. Like we've seen a whole bunch of models start to get close to this now. it's still better, I would say it's still better than most of, what's come out in the last six months. but,yeah, we're getting pretty close.
[01:16:41] Nisten Tahiraj: I think they haven't released it mainly because of, weaponized litigation,that's the main thing.
[01:16:45] Alpin Dale: Yeah.
[01:16:45] Nisten Tahiraj: Holding them back and, uh.yeah, companies in other countries don't have that problem as, as much, so they were able to, to advance more, like while still being respectful tothe brands and [01:17:00] stuff, but, yeah, I think the main reason is, people are just going to try and nitpick any kind of,of, attack vector to, to, to sue them.
[01:17:08] Nisten Tahiraj: For it. So that's probably why
[01:17:10] Alex Volkov: Yeah. Everything open AI will Yeah. Will get attacked. That I fully agree with you on this. Yeah. speaking of, let's see, do we have anything else from OpenAI? I don't believe so. Yeah. the other one thing that I wanted to show super quick is that the new Chad GPT now is also y I'm gonna show this super quick on the thing, is also now [01:17:30] supporting cursor.
[01:17:31] Alex Volkov: So now, the NutriGPT app is supporting the Cursor app, so now you can ask what I'm working on in Cursor, and if you hover this, you can actually see all of my, including env, You can actually see my secrets, but, you can ask it, you can ask it about the open, open queries. And why would I, if I have Cursor?
[01:17:49] Alex Volkov: That's the question, right? Cursor supports O1, because, I have unlimited O1 queries on ChaiGPTN, whereas I have like fairly limited, queries for O1 in Cursor. and generally [01:18:00] That's been pretty good. That's been pretty cool. You can ask it about the stuff that you have open. There's a shortcut I think it's option shift 1 on Windows and you can enable this and basically you then start chatting With the open interface in the one.
[01:18:13] Alex Volkov: We tested this a couple of weeks ago if you guys remember and I found it super fun. I don't know if you guys used it since then or for those who use the Mac version of, of ChatGPT. I find it really fun. So folks in the audience, if you're using the macOS app and you are connecting this to Cursor or to the terminal, for [01:18:30] example.
[01:18:30] Alex Volkov: Unfortunately, I use the warp terminal and they still don't have warp. they have iTerm here and other things. if you use PyCharm or other, JetBrains, they also started supporting those.but I specifically use Courser and now there's a support for Courser, supports for Windsurf, which is another thing that we didn't cover yet.
[01:18:46] Alex Volkov: And I heard amazing things. And I hope, hopefully over the Thanksgiving break, I will have to, have a chance to use Windsurf. but yeah, this is from, OpenAI and we were waiting for some more news from OpenAI, but we didn't get one. So hopefully the folks at [01:19:00] OpenAI will get a Thanksgiving break.
[01:19:02] Alex Volkov: Just a small reminder. I looked a year ago, if you guys remember the Thanksgiving episode we had a year ago. We were discussing the control alt deletemen weekend where Sam Altman was fired and then rehired. That was the Thanksgiving episode of last year. You guys remember this? last year we discussed how Sam Altman, and Greg Brockman were shanked and, the coup from Ilya.
[01:19:26] Alex Volkov: You guys remember? It's been a year. It's been a year since then. This was the [01:19:30] Thanksgiving last year. and, yeah, it's been a year since then. which by the way. Next week is the one, the two year anniversary of JGPT as well. So we probably should prepare something for that. so that's on the OpenAI News.
[01:19:43] Alex Volkov: let's super quick talk about this.at some point There's this, the sayings from Space Uncle is, they need to be studied in an encyclopedia. somebody tweeted, I don't understand how game developers and game journalists got so ideologically captured. [01:20:00] Elon Musk tweeted and said, Too many game studios are owned by massive corporations.
[01:20:03] Alex Volkov: XAI is going to start an AI game studio to make games great again.and I'm like, and please unmute if you're muted and laughing, because I want to hear, and I want the audience to hear that both PicoCreator and Nisten are just like laughing out loud at this. It's XAI with all of their like 200, H200, 200, 000 H200s, like the best, the fastest ever growing massive [01:20:30] Memphis, super cluster, they're going to build games like, what are they really going to actually.
[01:20:34] Alex Volkov: Have a gaming studio in there. Like we know he is, Elon is a, I don't know the best Diablo game player in the world right now. I don't know how the f**k
[01:20:43] Nisten Tahiraj: he's, he is fourth or 20th or,
[01:20:45] Alex Volkov: yeah, he was 20. I think he's at some point he got number one recently, or something. I, we know, we all know he's a gamer.
[01:20:51] Alex Volkov: Kudos. I really, I'm not making this up. Like I'm really have no idea how the f**k you can be like the best Diablo player in the world doing all these other stuff [01:21:00] and. I get the sentiment of okay, let's make games. Great. Turning in the eye company, the games company, how the,what?
[01:21:08] Alex Volkov: Ah, I just want to turn to this.
[01:21:12] Eugen Cheugh: I love most. It's just a massive corporation, XAI with billions of dollars of funding. It's going to be not a messy corporation.
[01:21:23] Alex Volkov: Yeah, this is not necessarily AI related necessarily,we are expecting big things from XAI, specifically around GROK [01:21:30] 3.
[01:21:30] Alex Volkov: Hopefully December, that's the date that they've given us. They have a hundred thousand H100s turning away and building something. We know that this was like announced. we know that Elon promises and doesn't deliver on time, but delivers at some point anyway. We know that they have. very good folks behind the scenes.
[01:21:47] Alex Volkov: We know this, we've seen this before. We know that, infrastructure is something they're building out. They're building out enterprise infrastructure for APIs. we've seen the X, AI API layer building out. We've seen the kind of the [01:22:00] X,infrastructure. Sorry, enterprise infrastructure for, the building layer.
[01:22:03] Alex Volkov: We've seen all this, getting prepared. Like we've talked about this, we're getting to the point where X is going to be another player, competing another player versus Google, OpenAI, Anthropic, etc. GRUG3 is going to be something significant to contend with. and like the amount of GPUs are there.
[01:22:22] Alex Volkov: It's just is this a sidetrack? this is basically my question.
[01:22:25] Nisten Tahiraj: it, so Uncle Elon tends to be like very [01:22:30] impulsive as we've seen, so if he spends a lot of time on something he's gonna start getting obsessed with it. So there's that. In order to have a gaming service, you will need a lot of GPUs, and I'm pretty sure at this point, if they want to do cloud gaming or streaming, they probably have more GPUs than PlayStation.
[01:22:49] Nisten Tahiraj: they might actually just have more right now. they're like, we can probably Support that, and so much for the Department of Government Efficiency, now we're all [01:23:00] just going to be streaming games.
[01:23:05] Nisten Tahiraj: But there is, there's also Another lining to this is for, for a while, for the last 10 years, there was an article about 10 years ago that the E3, I don't think that's a thing anymore, but the E3 gaming conference had a SpaceX booth over a decade ago and SpaceX was actively recruiting for the E3. I think to quote, it was, programmers of physics engine, and the [01:23:30] rumors were that they were going after the ones who made the Steam Havoc 2, like the one in Portal, and the ones that worked on the, Unreal Tournament physics engine.
[01:23:40] Nisten Tahiraj: And this was over 10 years ago, and those people, those programmers, were recruited by SpaceX. like, when you see, the Falcon Heavy, 2, 3, 4 rockets, just like Go dance in midair and land like they're in a video game is because, the people that made the simulation very likely worked on game engines.
[01:23:58] Nisten Tahiraj: So it might be [01:24:00] a hiring angle from him, or it might just be Angelino playing a lot of games and he just wants to know. there is an angle
[01:24:07] Alex Volkov: for gaming as a playground for training. Like a GI, whatever, like open AI obviously had, like trained robots in this area. we saw many papers for like agents running wild in a game constrained environments.
[01:24:19] Alex Volkov: There, there could be an angle there for sure. I just, this doesn't feel like, this feels like an impulsive, hey. make f*****g games great again.
[01:24:26] Anthropic's Model Context Protocol
[01:24:26] Alex Volkov: Alright, moving on, unless we have another comment here, moving on to [01:24:30] I really wanted to discuss the, super briefly the, Model Context Protocol from Anthropic.
[01:24:36] Alex Volkov: because this kind of blew up, but it's not ready yet. I saw a comment from Simon Wilson, you guys know Simon Wilson, the friend of the pod, he'd been here multiple times. basically he covered this. super quick, Anthropic released this new protocol, which they hope to standardize and by standardize, they mean Hey, let's get around this.
[01:24:53] Alex Volkov: Okay. So let's talk about a standard in the industry right now, the OpenAI SDK for Python. That's a [01:25:00] standard way to interact with LLMs. Pretty much everybody supports this, including Gemini. I think the only one who doesn't support this is Anthropic actually. So in Python, if you want to interact with any LLM, Literally any provider in LLM, including OpenRouter, like Google, OpenAI themselves, like pretty much everyone else, like including together, like all of the, all of those, you can replace one line of code in the OpenAI API, OpenAI Python SDK, where you just put a different URL in there, and then this is the standard way to talk to [01:25:30] LLMs.
[01:25:30] Alex Volkov: I think for TypeScript, JavaScript, it's pretty much the same.so it looks like Anthropic is trying to do something like this to standardize around how LLMs are connecting with other applications. So right now, just a minute before I showed you how ChatGPT is connecting to like a VS Code for something.
[01:25:49] Alex Volkov: They built those integrations themselves. So you would install a specific extension in VS Code in etc. And that extension That they've built [01:26:00] talks to the ChatGPT app on the Mac OS that they've built and they build this connection for you. This is not what Anthropic wants to do. Anthropic wants to create a protocol that like developers, other developers can build on their own to allow the LLM to talk to any application and you as a developer, I as a developer, other developers can build those Communication layers, and then whatever LLM, in this case, this is the Anthropic, Claude desktop app, this could be the JGPT app, could be the Gemini GPT app, [01:26:30] Gemini app, et cetera, could talk to other applications.
[01:26:32] Alex Volkov: What those other applications are? Anything. Anything on your desktop, anything. At all. So they build this kind of a first standard, communication via JSON RPC. And I think they're buildingother ways, and other servers. I think this is a way to summarize this, basically.
[01:26:50] Alex Volkov: this is a open preview. Nisten, you want to take another crack at trying to recap this? Or Yam or Wolfram, you guys want to? You want to give me your thoughts on this super quick? As far as I understand from [01:27:00] Simon, this is like still robust and still in,in, in flux.
[01:27:03] Nisten Tahiraj: I think this might end up being a much bigger deal than we, we first expect, because it is an interoperability layer, and as a developer, you will have to learn this.
[01:27:15] Nisten Tahiraj: it is annoying at the moment that, While proposing a standard, Anthropic is not showing willingness to abide by one, which most people chose, and even Google was forced to support the OpenAI standard. if you [01:27:30] want people to come with your standard, to abide by your standard, you also have to show willingness to abide by others.
[01:27:36] Nisten Tahiraj: that's not going to work here until Anthropic Just supports a plug and play OpenAI API, so I just put their models in, but that aside. The criticism aside,this is pretty, pretty important. So I've been doing some of this stuff and just trying to do it with basic JSON. So I think that's,it's very good.
[01:27:55] Nisten Tahiraj: And yeah, it's pretty hard to know, am I on Mac? Am I on Linux? Am I on a phone? [01:28:00] What's the LLM going to talk to? what does this app even want me to do? Do I have to emulate this on the screen and then click on it? Can't it just give me a JSON so that I can click on it so it's a lot easier for me?
[01:28:11] Nisten Tahiraj: And this will also apply to websites and, and web apps after to you. Offer some kind of a JSON RPC. An RPC is just like an API for people. It's just an application programming interface. It's something you query, like you write a curl to this [01:28:30] IP and here's my API key and give me, or here I'm going to give you this stuff and give me this stuff.
[01:28:37] Nisten Tahiraj: From the database or whatever. So this is this actually extremely important because you can apply to, to web apps as well. And it's a way to manage multiple sessions. So I think it's a pretty big deal, even though I am. No. And anthropic, it this,yeah. I think that this is gonna become much, much more important because it saves a lot of bandwidth.[01:29:00]
[01:29:00] Nisten Tahiraj: Instead of you having to run a visual language model to show the whole screen, to run it on an emulator, to have to click on it and move around. And it's so compute intensive. It's can you just gimme like a adjacent API, so I can just like,
[01:29:13] Alex Volkov: yeah, do
[01:29:13] Nisten Tahiraj: a constraint output to adjacent and just output three tokens.
[01:29:16] Nisten Tahiraj: Be done with the whole thing. so yeah. Yeah, it's, I think it'll become a big deal.
[01:29:21] Alex Volkov: So in the spirit of, of the holiday, thank you and tropic for trying to standardize things, standardize, often, sometimes it's annoying, but often leads to good things as [01:29:30] well. folks, should try out the MMCP and definitely give them feedback.
[01:29:34] Alex Volkov: but yeah, they should also abide by some standards as well. It looks like the industry is standardizing around the. OpenAI SDK, and they maybe should also, it would help.
[01:29:43] Wolfram Ravenwolf: It's a new thing that they are doing because, so far we usually had the LLM as a part in an agent pipeline where you have, another process called the LLM with some input.
[01:29:52] Wolfram Ravenwolf: And here we have the LLM going out to get. The input itself. So I think that is also in the agent context, very important and [01:30:00] more integration is always better, but of course it's a new thing. We have to develop all those servers as I call it. So a lot of reinventing the wheel, I guess we'll see if it can really persevere.
[01:30:12] Alex Volkov: Yeah, one example that they highlight, and Simon talked about this as well, is that if you have a database, a SQLite database that sits on your computer,the way to have So you guys know we, we talked about tool use, for example,via API, those models can Get respond with some, some idea of how to use your [01:30:30] tools.
[01:30:30] Alex Volkov: And you, as a developer, you are in charge of using those tools. You basically get in response a structure of a function call. And you're like, okay, now I have to take this and then go to an external tool and use this. This is connecting this piece forward. This is basically. Allowing this LLM to then actually go and actually use this tool.
[01:30:48] Alex Volkov: Basically like getting a step forward. And one, one example that they're showing is a connecting to a database, allowing this LLM to connect to a database via a sq lite like MCP server. the model compute [01:31:00] protocol server. cps, sorry. yeah. So connecting via this MCP server,you basically allowing LM to read from this database.
[01:31:08] Alex Volkov: Itself without like returning a call. And then you are in charge as a developer to go and do the call return it responses. so basically trying to, allow LLMs to connect to different services. Yeah. And this, I think I agree with you with more work in here. this could be big.
[01:31:24] Nisten Tahiraj: It could literally make like over a thousand times more compute efficient to automate [01:31:30] something on a screen. Because instead of using a visual language model frame by frame, you can just have a JSON.
[01:31:37] Alex Volkov: Let's talk about Like literally
[01:31:38] Nisten Tahiraj: over a thousand times. Let's compute to do it. So I'm going to, I'm going to take a longer look at it as well.
[01:31:46] Alex Volkov: speaking of automating things on the screen,
[01:31:48] H runner from H the french AI company
[01:31:48] Alex Volkov: let's talk about the next thing that we want to talk about, H company AI. This is the next thing in big companies and APIs, H company from. France, this is another big company. So [01:32:00] we know Mistral is from France. some, DeepMind, some folks is from France as well.
[01:32:04] Alex Volkov: there's also FAIR in France from Meta. now France is positioning themselves to be one big kind of hub from AI as well. H Company. raised, fundraised, I think, 250 I have in my notes. Yeah, 220, one of the biggest, seed rounds. 220 million dollars, one of the biggest ones in, in the history of, French seed rounds, a while ago.
[01:32:24] Alex Volkov: And they just showcased their Runner H. Their Runner H [01:32:30] is, they're competing with Claude on speed of computer use. I apologize for this. Let's take a look at how fast they're claiming they're opening a browser, going to recipes and providing recipes for something. On the right, we have Claude, Computer Use.
[01:32:46] Alex Volkov: Claude is basically, hey, open the browser. On the left, they already pulled up a browser and already extracting data. So basically they're claiming A speed up of maybe two to three times over cloud computer use. [01:33:00] And they're basically showing while Claude still pulls up the Firefox browser, they have already completed the task, extracted the data and already responded to the user.
[01:33:09] Alex Volkov: they're showing steps by steps comparison, which I don't think is necessarily in, Apple's to Apple's comparison. I don't think it's necessarily fair, but. There's a big but here, big French, but, I don't know how to say, sorry, Nisten, I don't know how to say but in French, but there's a big one.
[01:33:25] Alex Volkov: Their models, as far as I could see, and I did some research, they have [01:33:30] a, they say this runner age thing that they have is powered by a specialized LLM, specialized optimist for function calling for 2 billion params. So whatever we see on the left is not like Claude, which whatever, we don't know the size of Claude, this is like a 2 billion parameter model.
[01:33:45] Alex Volkov: and, integrates in the VLM of a 3 billion parameter model to see, understand, interact with the graphical and text interface. Let's look at another example here. they're basically browsing the web and like doing extraction and yeah, I don't think you guys can see it. maybe like this.[01:34:00]
[01:34:02] Alex Volkov: It's literally, they're going to Wolfram Alpha and extracting and doing this task. They're basically asking Wolfram Alpha to do a task. So it's not like they're just reading from things. They're finding input and they're like plugging things in there and like responding, reading from the output from Wolfram Alpha as well.
[01:34:18] Alex Volkov: this runnerage thing actually performs tasks on the web. Extracts information back way faster than Claude Computerius, which Claude Computerius, let's give it its place. We were very excited when it came [01:34:30] out, and it does very well for, for just an adjustment of Claude. and they are showing immense differences in five steps, and we're still waiting for Claude Computerius to like, Try to figure this out.
[01:34:42] Alex Volkov: So did you
[01:34:43] Nisten Tahiraj: say it's a separate to be model? And then there's another? That's what I found
[01:34:48] Alex Volkov: from them. Yeah. Yeah. They said that they have, let me see if I can find the previous announcement. Yeah. Yeah.
[01:34:54] Wolfram Ravenwolf: The previous announcement
[01:34:56] Alex Volkov: that they have, that we missed from last week, Introducing Studio, a [01:35:00] automations at scale, run or age the most advanced agent to date.
[01:35:04] Alex Volkov: That's what they said last year. Powered by specialized LLM, highly optimized for function calling, 2 billion parameters. It also integrates a specialized VLM, 3 billion parameters, to perceive, understand, and interact with graphical and text elements. Delivers the state of the art on the public WebVoyager framework.
[01:35:20] Alex Volkov: And this is the graph that they have. WebVoyager, they have Runner H01. At 66 percent maybe? And, and [01:35:30] then, Cloud Computer Use at 52 percent and Agent E, I don't know where it is, it's like here. Yeah, so like the size of it is what's the most impressive part.
[01:35:41] Nisten Tahiraj: Yeah, I'd say this is impressive. as to what they're doing.
[01:35:44] Nisten Tahiraj: we can guess what model they're using, but it doesn't matter all that much. I just wanna say that it's not an apples to apples comparison with cloud because cloud is an entire OS in there and you can use whatever you want. It can use blender, it can, [01:36:00] you can run a virtual box of Windows 95 and it will use that as well.
[01:36:04] Eugen Cheugh: so the, yeah, it's not. That's not a pure example, whereas in this one, I'm assuming they do need access to the document object model, the DOM of the website, to be able to navigate it, but The results do indeed seem impressive, and it's at a size that you can run it, you can run on your own, Yeah, because if you're measuring steps and speed, actually, I think, Anthropic Cloud should, probably, partner with [01:36:30] a company like Browserbase, and just, do a demo, and then see how close they get instead. It will skip literally the first eight steps or something like that, which is all just the OS booted up.
[01:36:40] Alex Volkov: Yeah, this is why I didn't love the comparison specifically, you guys are right, it's running a janky Docker with Firefox, and by the time, it loads Firefox, these guys already loaded the website, so it's not like necessarily apples to apples, but it looks like those models are tiny compared to Claude, and also, they talk about, It's beyond [01:37:00] optimizing agent performance, they're like, they have, optimizing web interactions.
[01:37:05] Alex Volkov: they engineered Runaways to handle any web interactions. Advancing towards one singular mission, automating the web, so they're focused on web. So Eugene, like what you're talking about, like browser based with computer use, it looks like this is their focus, whereas computer use is, for computer use, generic.
[01:37:22] Alex Volkov: This is like their focus for web interactions. I guess what I'm saying is it's exciting. they raised a boatload of money, the folks behind [01:37:30] there, they seem like very,adept, I, I know they're based in France, Wolfram. I don't know, Wolfram, you're asking if, if I'm sure they're France.
[01:37:36] Alex Volkov: yeah, they're based in France, and, Yeah, we'll see. They're waitlisted. I haven't tested them out. I know that some folks collaborated on them already and posted some threads. so we'll hopefully, we'll see if I get access to this. I'll tell you guys and we'll play with it. Absolutely. definitely exciting in the world of agents.
[01:37:54] Alex Volkov: I think this is it from big companies. Folks, what do you think? Anything else From big companies, nothing from Google after the [01:38:00] releases of last week where they reclaimed the throne. Hopefully they're getting their deserved breaks and and relaxing. I don't think this week was fairly chill.
[01:38:07] Alex Volkov: Probably the next week they're going to come back with a vengeance. Next week there's like the AWS re invent. Maybe Amazon will come with something. And then the week after RPS. Maybe some folks are waiting for that. I think that this is it in big companies. Let's move on to vision and video.
[01:38:22] Alex Volkov: And then, Oh, I think we're at two minutes. Folks, I think we're at time. I think we're at time. I got too excited that we have like a bunch of other things to talk about. [01:38:30] So let me maybe recap on our Thanksgiving super quick. the stuff that we didn't get to just to like to recap super quick. we didn't get to, but just to tell you guys what else we didn't get to, runway specifically.
[01:38:41] Alex Volkov: Oh yeah, I just, I have to show this. not to talk about this. Just just visually show this beautiful thing. If I can click this. If I can click this thing, yeah, Runway introduced an expand feature, if you guys haven't seen this, it's really fun to just watch. Let me just mute this. basically, [01:39:00] what you see above and below, Runway introduced an expand feature where you take a video and you give it, give this model and the model tries to predict it.
[01:39:08] Alex Volkov: in different ratio, what's above and below this video. So basically, if you give a video in the widescreen format, 16 by nine, and you could try to turn it into a 19 by six format. And so the model will try to fill in the frames. The general video model tries to fill in the frames of what's above and below.
[01:39:25] Alex Volkov: So what we're looking at in the video on the screen is like a Lord of the [01:39:30] Rings scene where Legolas rides one of those like elephant looking thingies. Basically, the model tries to fill in the, just the frames from above and below. It just looks a little bit creepy. it's funny looking, but it's like looks, interesting.
[01:39:45] Alex Volkov: so this is like one expand feature and the other one is they released an actual image model from Runway, which kind of looks interesting. it's called a frames and it's specific for image generation for [01:40:00] world building. and Confi UI desktop launched. I think that's pretty much it.
[01:40:05] Thanksgiving Reflections and Thanks
[01:40:05] Alex Volkov: Folks, it's time to say thanks, because it's Thanksgiving. I just wanted to start, but I wanted to hear from you as well. My biggest thanks this year goes to, first of all, everybody who tunes in to ThursdAI. Everybody who comes into the community, everybody who provides comments and shares with their friends and, and listens and,The second huge thanks goes to all of you.
[01:40:26] Alex Volkov: My co hosts here, Wolfram, Yam, Nisten, LDJ, Junyang [01:40:30] who joined us, Eugene who joined us as well. Zafari who joined us from time to time, like a bunch of other folks. huge thanks to you for being here from like week to week for more than like almost, we're coming up on two years. And I think the thirst, the third thanks goes to Jensen for the GPUs that he provided for all of us to enjoy those like amazing corn coffee of AI features around the world.
[01:40:51] Alex Volkov: just, yeah, just open up the mics and feel free to, to join the festivities even though I don't know any of you celebrate [01:41:00] Thanksgiving unnecessarily. But yeah, what are you guys thankful for? before we wrap up, let's do the Thanksgiving roundup.
[01:41:07] Eugen Cheugh: I'm giving thanks to open models.
[01:41:08] Eugen Cheugh: let's go. Yeah, no, proving that you do not need billions of dollars to catch up with GPT 4 despite what the big labs will say. The open teams, keep going, keep bringing open models to the masses.
[01:41:25] Nisten Tahiraj: Yeah, We had Thanksgiving last month in Canada. I would like to [01:41:30] give thanks to two particular creators, mahi and, tki. each have over a thousand models and, quants that they release. And, and also Mr. Der Backer, probably mispronounced that was, over 5,000, quantization of models.
[01:41:48] Nisten Tahiraj: this is the stuff I use every day in tell. Other people. So whenever something new comes up, I almost always expect them to have a good, well done quantization ready for [01:42:00] others to use. and they just do this as volunteers. I don't even think they're part of the, none of them are part of like even a big corporation, or have high salaries.
[01:42:08] Nisten Tahiraj: They literally just do it as volunteers. Yeah, I want to give thanks to those people in particular, and everybody else here, and all the people on Discord as well, who sit around and help you correct stuff, but yeah, that's it for me.
[01:42:27] Wolfram Ravenwolf: Okay, I have three. The first [01:42:30] is to Alex for the podcast, because it's amazing to be here.
[01:42:34] Wolfram Ravenwolf: It's my way to keep up with the stuff I can't keep up with. So thank you for having me. Thank you for doing this. Thank you very much. And the second is to the whole community of AI people, especially those who release all these stuff in the open. But everybody who contributes, everybody who does a good thing about it, I think it is furthering humanity.
[01:42:53] Wolfram Ravenwolf: So thanks for that. And the third is a thanks to every reasonable person who is not, Going to insights or stuff, [01:43:00] but it's open minded and, seeing that we are all in the same boat and we are all trying to make the world a better place in our different ways. And for being, accepting and understanding of this.
[01:43:11] Wolfram Ravenwolf: In this times, I think it's very important to keep an open mind.
[01:43:16] Nisten Tahiraj: Oh yeah, just really quickly to add on, the biggest thanks I think for this year goes to the DeepSeek and Qwent teams for just caring. up everybody [01:43:30] else when we stalled on progress they kept it up to like actually democratize the models for you to actually have this piece of artificial intelligence and own it and control it and be loyal and make it loyal to you yeah.
[01:43:47] Nisten Tahiraj: they actually enable people to, to run fully local models. Like 90% of what I use every day is just completely open source. Now, honestly, it w it, I wouldn't, it would not be there if it wasn't for them. It would probably maybe be like [01:44:00] 20, 30%. So,yeah, they, they really carried, like that's a gaming term, like someone who.
[01:44:06] Nisten Tahiraj: Carries the team. They have really carried, so yeah.
[01:44:11] Alex Volkov: Jan, go
[01:44:14] Yam Peleg: ahead. To Jensen for the GPUs, and
[01:44:17] Alex Volkov: to everybody
[01:44:18] Yam Peleg: else I'm hugging face. Especially people collecting and releasing datasets. I think they're not getting enough credits because you can't just use the dataset [01:44:30] without training a model. There is an effort.
[01:44:31] Yam Peleg: to, until you appreciate the dataset, but, they make it possible, everything else.
[01:44:39] Alex Volkov: Last thing that I have to, and this is not because I have to, but honestly, folks, huge thanks to Weights Biases for all of this, honestly, I wouldn't have been able to do this as my job without a few folks in Weights Biases, so thank you Morgan, thank you Lavanya, thank you a bunch of folks in Weights Biases.
[01:44:55] Alex Volkov: who realized this could be a part of my actual day to day and bringing you news from Weights [01:45:00] Biases, but also promoting some of the stuff. many of the labs, if not most of the labs that we talk about, are using Weights Biases to bring us the open source, but also the closed source LLMs in the world.
[01:45:10] Alex Volkov: I couldn't be More happy and be in a better place to bring you the news, but also participate behind the scenes in building some of these things. With that, thank you to all of you. Hopefully you go and enjoy some of the rest of your holiday. Those of you who celebrate, those of you who don't celebrate, this is, I think the first Thursday in a while that we didn't have any breaking news.
[01:45:27] Alex Volkov: I'm itching to press it anyway, but we didn't [01:45:30] have any breaking news, but hopefully we'll have some next week. There could be some news next week. We'll see. With that, thank everybody who joins, go and enjoy the rest of your day. And we'll see you here next week as always. Bye everyone. Bye bye.
[01:45:43] Alex Volkov: Bye bye. Bye bye. Bye bye. Bye bye. And we have [01:46:00] a

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
📆 ThursdAI - Nov 21 - The fight for the LLM throne, OSS SOTA from AllenAI, Flux new tools, Deepseek R1 reasoning & more AI news
22 nov 2024· ThursdAI - The top AI news from the past week
Hey folks, Alex here, and oof what a 🔥🔥🔥 show we had today! I got to use my new breaking news button 3 times this show! And not only that, some of you may know that one of the absolutely biggest pleasures as a host, is to feature the folks who actually make the news on the show!
And now that we're in video format, you actually get to see who they are! So this week I was honored to welcome back our friend and co-host Junyang Lin, a Dev Lead from the Alibaba Qwen team, who came back after launching the incredible Qwen Coder 2.5, and Qwen 2.5 Turbo with 1M context.
We also had breaking news on the show that AI2 (Allen Institute for AI) has fully released SOTA LLama post-trained models, and I was very lucky to get the core contributor on the paper, Nathan Lambert to join us live and tell us all about this amazing open source effort! You don't want to miss this conversation!
Lastly, we chatted with the CEO of StackBlitz, Eric Simons, about the absolutely incredible lightning in the bottle success of their latest bolt.new product, how it opens a new category of code generator related tools.
00:00 Introduction and Welcome
00:58 Meet the Hosts and Guests
02:28 TLDR Overview
03:21 Tl;DR
04:10 Big Companies and APIs
07:47 Agent News and Announcements
08:05 Voice and Audio Updates
08:48 AR, Art, and Diffusion
11:02 Deep Dive into Mistral and Pixtral
29:28 Interview with Nathan Lambert from AI2
30:23 Live Reaction to Tulu 3 Release
30:50 Deep Dive into Tulu 3 Features
32:45 Open Source Commitment and Community Impact
33:13 Exploring the Released Artifacts
33:55 Detailed Breakdown of Datasets and Models
37:03 Motivation Behind Open Source
38:02 Q&A Session with the Community
38:52 Summarizing Key Insights and Future Directions
40:15 Discussion on Long Context Understanding
41:52 Closing Remarks and Acknowledgements
44:38 Transition to Big Companies and APIs
45:03 Weights & Biases: This Week's Buzz
01:02:50 Mistral's New Features and Upgrades
01:07:00 Introduction to DeepSeek and the Whale Giant
01:07:44 DeepSeek's Technological Achievements
01:08:02 Open Source Models and API Announcement
01:09:32 DeepSeek's Reasoning Capabilities
01:12:07 Scaling Laws and Future Predictions
01:14:13 Interview with Eric from Bolt
01:14:41 Breaking News: Gemini Experimental
01:17:26 Interview with Eric Simons - CEO @ Stackblitz
01:19:39 Live Demo of Bolt's Capabilities
01:36:17 Black Forest Labs AI Art Tools
01:40:45 Conclusion and Final Thoughts
As always, the show notes and TL;DR with all the links I mentioned on the show and the full news roundup below the main new recap 👇
Google & OpenAI fighting for the LMArena crown 👑
I wanted to open with this, as last week I reported that Gemini Exp 1114 has taken over #1 in the LMArena, in less than a week, we saw a new ChatGPT release, called GPT-4o-2024-11-20 reclaim the arena #1 spot!
Focusing specifically on creating writing, this new model, that's now deployed on chat.com and in the API, is definitely more creative according to many folks who've tried it, with OpenAI employees saying "expect qualitative improvements with more natural and engaging writing, thoroughness and readability" and indeed that's what my feed was reporting as well.
I also wanted to mention here, that we've seen this happen once before, last time Gemini peaked at the LMArena, it took less than a week for OpenAI to release and test a model that beat it.
But not this time, this time Google came prepared with an answer!
Just as we were wrapping up the show (again, Logan apparently loves dropping things at the end of ThursdAI), we got breaking news that there is YET another experimental model from Google, called Gemini Exp 1121, and apparently, it reclaims the stolen #1 position, that chatGPT reclaimed from Gemini... yesterday! Or at least joins it at #1
LMArena Fatigue?
Many folks in my DMs are getting a bit frustrated with these marketing tactics, not only the fact that we're getting experimental models faster than we can test them, but also with the fact that if you think about it, this was probably a calculated move by Google. Release a very powerful checkpoint, knowing that this will trigger a response from OpenAI, but don't release your most powerful one. OpenAI predictably releases their own "ready to go" checkpoint to show they are ahead, then folks at Google wait and release what they wanted to release in the first place.
The other frustration point is, the over-indexing of the major labs on the LMArena human metrics, as the closest approximation for "best". For example, here's some analysis from Artificial Analysis showing that the while the latest ChatGPT is indeed better at creative writing (and #1 in the Arena, where humans vote answers against each other), it's gotten actively worse at MATH and coding from the August version (which could be a result of being a distilled much smaller version) .
In summary, maybe the LMArena is no longer 1 arena is all you need, but the competition at the TOP scores of the Arena has never been hotter.
DeepSeek R-1 preview - reasoning from the Chinese Whale
While the American labs fight for the LM titles, the real interesting news may be coming from the Chinese whale, DeepSeek, a company known for their incredibly cracked team, resurfaced once again and showed us that they are indeed, well super cracked.
They have trained and released R-1 preview, with Reinforcement Learning, a reasoning model that beasts O1 at AIME and other benchmarks! We don't know many details yet, besides them confirming that this model comes to the open source! but we do know that this model , unlike O1, is showing the actual reasoning it uses to achieve it's answers (reminder: O1 hides its actual reasoning and what we see is actually another model summarizing the reasoning)
The other notable thing is, DeepSeek all but confirmed the claim that we have a new scaling law with Test Time / Inference time compute law, where, like with O1, the more time (and tokens) you give a model to think, the better it gets at answering hard questions. Which is a very important confirmation, and is a VERY exciting one if this is coming to the open source!
Right now you can play around with R1 in their demo chat interface.
In other Big Co and API news
In other news, Mistral becomes a Research/Product company, with a host of new additions to Le Chat, including Browse, PDF upload, Canvas and Flux 1.1 Pro integration (for Free! I think this is the only place where you can get Flux Pro for free!).
Qwen released a new 1M context window model in their API called Qwen 2.5 Turbo, making it not only the 2nd ever 1M+ model (after Gemini) to be available, but also reducing TTFT (time to first token) significantly and slashing costs. This is available via their APIs and Demo here.
Open Source is catching up
AI2 open sources Tulu 3 - SOTA 8B, 70B LLama post trained FULLY open sourced (Blog ,Demo, HF, Data, Github, Paper)
Allen AI folks have joined the show before, and this time we got Nathan Lambert, the core contributor on the Tulu paper, join and talk to us about Post Training and how they made the best performing SOTA LLama 3.1 Funetunes with careful data curation (which they also open sourced), preference optimization, and a new methodology they call RLVR (Reinforcement Learning with Verifiable Rewards).
Simply put, RLVR modifies the RLHF approach by using a verification function instead of a reward model. This method is effective for tasks with verifiable answers, like math problems or specific instructions. It improves performance on certain benchmarks (e.g., GSM8K) while maintaining capabilities in other areas.
The most notable thing is, just how MUCH is open source, as again, like the last time we had AI2 folks on the show, the amount they release is staggering
In the show, Nathan had me pull up the paper and we went through the deluge of models, code and datasets they released, not to mention the 73 page paper full of methodology and techniques.
Just absolute ❤️ to the AI2 team for this release!
🐝 This weeks buzz - Weights & Biases corner
This week, I want to invite you to a live stream announcement that I am working on behind the scenes to produce, on December 2nd. You can register HERE (it's on LinkedIn, I know, I'll have the YT link next week, promise!)
We have some very exciting news to announce, and I would really appreciate the ThursdAI crew showing up for that! It's like 5 minutes and I helped produce 🙂
Pixtral Large is making VLMs cool again
Mistral had quite the week this week, not only adding features to Le Chat, but also releasing Pixtral Large, their updated multimodal model, which they claim state of the art on multiple benchmarks.
It's really quite good, not to mention that it's also included, for free, as part of the le chat platform, so now when you upload documents or images to le chat you get Pixtral Large.
The backbone for this model is Mistral Large (not the new one they also released) and this makes this 124B model a really really good image model, albeit a VERY chonky one that's hard to run locally.
The thing I loved about the Pixtral release the most is, they used the new understanding to ask about Weights & Biases charts 😅 and Pixtral did a pretty good job!
Some members of the community though, reacted to the SOTA claims by Mistral in a very specific meme-y way:
This meme has become a very standard one, when labs tend to not include Qwen VL 72B or other Qwen models in the evaluation results, all while claiming SOTA. I decided to put these models to a head to head test myself, only to find out, that ironically, both models say the other one is better, while both hallucinate some numbers.
BFL is putting the ART in Artificial Intelligence with FLUX.1 Tools (blog)
With the absolute breaking news bombastic release, the folks at BFL (Black Forest Labs) have released Flux.1 Tools, which will allow AI artist to use these models in all kind of creative inspiring ways.
These tools are: FLUX.1 Fill (for In/Out painting), FLUX.1 Depth/Canny (Structural Guidance using depth map or canny edges) and FLUX.1 Redux for image variation and restyling.
These tools are not new to the AI Art community conceptually, but they have been patched over onto Flux from other models like SDXL, and now the actual lab releasing them gave us the crème de la crème, and the evals speak for themselves, achieving SOTA on image variation benchmark!
The last thing I haven't covered here, is my interview with Eric Simons, the CEO of StackBlitz, who came in to talk about the insane rise of bolt.new, and I would refer you to the actual recording for that, because it's really worth listening to it (and seeing me trying out bolt in real time!)
That's most of the recap, we talked about a BUNCH of other stuff of course, and we finished on THIS rap song that ChatGPT wrote, and Suno v4 produced with credits to Kyle Shannon.
TL;DR and Show Notes:
* Open Source LLMs
* Mistral releases Pixtral Large (Blog, HF, LeChat)
* Mistral - Mistral Large 2411 (a HF)
* Sage Attention the next Flash Attention? (X)
* AI2 open sources Tulu 3 - SOTA 8B, 70B LLama Finetunes FULLY open sourced (Blog ,Demo, HF, Data, Github, Paper)
* Big CO LLMs + APIs
* Alibaba - Qwen 2.5 Turbo with 1M tokens (X, HF Demo)
* Mistral upgrades to a product company with le chat 2.0 (Blog, Le Chat)
* DeepSeek R1-preview - the first reasoning model from the Chinese whale (X, chat)
* OpenAI updates ChatGPT in app and API - reclaims #1 on LMArena (X)
* Gemini Exp 1121 - rejoins #1 spot on LMArena after 1 day of being beaten (X)
* Agents News
* Perplexity is going to do the shopping for you (X, Shop)
* Stripe Agent SDK - allowing agents to transact (Blog)
* This weeks Buzz
* We have an important announcement coming on December 2nd! (link)
* Voice & Audio
* Suno V4 released - but for real this time (X)
* ChatGPT new creative writing does Eminem type rap with new Suno v4 (link)
* AI Art & Diffusion & 3D
* BFL announcing Flux Tools today (blog, fal)
* Free BFL Flux Pro on Mistral Le Chat!
*
Thank you, see you next week 🫡

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
📆 ThursdAI - Nov 14 - Qwen 2.5 Coder, No Walls, Gemini 1114 👑 LLM, ChatGPT OS integrations & more AI news
15 nov 2024· ThursdAI - The top AI news from the past week
This week is a very exciting one in the world of AI news, as we get 3 SOTA models, one in overall LLM rankings, on in OSS coding and one in OSS voice + a bunch of new breaking news during the show (which we reacted to live on the pod, and as we're now doing video, you can see us freak out in real time at 59:32)
00:00 Welcome to ThursdAI
00:25 Meet the Hosts
02:38 Show Format and Community
03:18 TLDR Overview
04:01 Open Source Highlights
13:31 Qwen Coder 2.5 Release
14:00 Speculative Decoding and Model Performance
22:18 Interactive Demos and Artifacts
28:20 Training Insights and Future Prospects
33:54 Breaking News: Nexus Flow
36:23 Exploring Athene v2 Agent Capabilities
36:48 Understanding ArenaHard and Benchmarking
40:55 Scaling and Limitations in AI Models
43:04 Nexus Flow and Scaling Debate
49:00 Open Source LLMs and New Releases
52:29 FrontierMath Benchmark and Quantization Challenges
58:50 Gemini Experimental 1114 Release and Performance
01:11:28 LLM Observability with Weave
01:14:55 Introduction to Tracing and Evaluations
01:15:50 Weave API Toolkit Overview
01:16:08 Buzz Corner: Weights & Biases
01:16:18 Nous Forge Reasoning API
01:26:39 Breaking News: OpenAI's New MacOS Features
01:27:41 Live Demo: ChatGPT Integration with VS Code
01:34:28 Ultravox: Real-Time AI Conversations
01:42:03 Tilde Research and Stargazer Tool
01:46:12 Conclusion and Final Thoughts
This week also, there was a debate online, whether deep learning (and scale is all you need) has hit a wall, with folks like Ilya Sutskever being cited by publications claiming it has, folks like Yann LeCoon calling "I told you so". TL;DR? multiple huge breakthroughs later, and both Oriol from DeepMind and Sam Altman are saying "what wall?" and Heiner from X.ai saying "skill issue", there is no walls in sight, despite some tech journalism love to pretend there is. Also, what happened to Yann? 😵‍💫
Ok, back to our scheduled programming, here's the TL;DR, afterwhich, a breakdown of the most important things about today's update, and as always, I encourage you to watch / listen to the show, as we cover way more than I summarize here 🙂
TL;DR and Show Notes:
* Open Source LLMs
* Qwen Coder 2.5 32B (+5 others) - Sonnet @ home (HF, Blog, Tech Report)
* The End of Quantization? (X, Original Thread)
* Epoch : FrontierMath new benchmark for advanced MATH reasoning in AI (Blog)
* Common Corpus: Largest multilingual 2T token dataset (blog)
* NexusFlow - Athena v2 - open model suite (X, Blog, HF)
* Big CO LLMs + APIs
* Gemini 1114 is new king LLM #1 LMArena (X)
* Nous Forge Reasoning API - beta (Blog, X)
* Reuters reports "AI is hitting a wall" and it's becoming a meme (Article)
* Cursor acq. SuperMaven (X)
* This Weeks Buzz
* Weave JS/TS support is here 🙌
* Voice & Audio
* Fixie releases UltraVox SOTA (Demo, HF, API)
* Suno v4 is coming and it's bonkers amazing (Alex Song, SOTA Jingle)
* Tools demoed
* Qwen artifacts - HF Demo
* Tilde Galaxy - Interp Tool

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
📆 ThursdAI - Nov 7 - Video version, full o1 was given and taken away, Anthropic price hike-u, halloween 💀 recap & more AI news
8 nov 2024· ThursdAI - The top AI news from the past week
👋 Hey all, this is Alex, coming to you from the very Sunny California, as I'm in SF again, while there is a complete snow storm back home in Denver (brrr).
I flew here for the Hackathon I kept telling you about, and it was glorious, we had over 400 registered, over 200 approved hackers, 21 teams submitted incredible projects 👏 You can follow some of these here
I then decided to stick around and record the show from SF, and finally pulled the plug and asked for some budget, and I present, the first ThursdAI, recorded from the newly minted W&B Podcast studio at our office in SF 🎉
This isn't the only first, today also, for the first time, all of the regular co-hosts of ThursdAI, met on video for the first time, after over a year of hanging out weekly, we've finally made the switch to video, and you know what? Given how good AI podcasts are getting, we may have to stick around with this video thing! We played one such clip from a new model called hertz-dev, which is a
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
Mostrar mais

Episódios

ThursdAI - Mar 20 - OpenAIs new voices, Mistral Small, NVIDIA GTC recap & Nemotron, new SOTA vision from Roboflow & more AI news

📆 ThursdAI Turns Two! 🎉 Gemma 3, Gemini Native Image, new OpenAI tools, tons of open source & more AI news

ThursdAI - Mar 6, 2025 - Alibaba's R1 Killer QwQ, Exclusive Google AI Mode Chat, and MCP fever sweeping the community!

📆 Feb 27, 2025 - GPT-4.5 Drops TODAY?!, Claude 3.7 Coding BEAST, Grok's Unhinged Voice, Humanlike AI voices & more AI news

📆 ThursdAI - Feb 20 - Live from AI Eng in NY - Grok 3, Unified Reasoners, Anthropic's Bombshell, and Robot Handoffs!

📆 ThursdAI - Feb 13 - my Personal Rogue AI, DeepHermes, Fast R1, OpenAI Roadmap / RIP GPT6, new Claude & Grok 3 imminent?

📆 ThursdAI - Feb 6 - OpenAI DeepResearch is your personal PHD scientist, o3-mini & Gemini 2.0, OmniHuman-1 breaks reality & more AI news

📆 ThursdAI - Jan 30 - DeepSeek vs. Nasdaq, R1 everywhere, Qwen Max & Video, Open Source SUNO, Goose agents & more AI news

📆 ThursdAI - Jan 23, 2025 - 🔥 DeepSeek R1 is HERE, OpenAI Operator Agent, $500B AI manhattan project, ByteDance UI-Tars, new Gemini Thinker & more AI news

📆 ThursdAI - Jan 16, 2025 - Hailuo 4M context LLM, SOTA TTS in browser, OpenHands interview & more AI news

📆 ThursdAI - Jan 9th - NVIDIA's Tiny Supercomputer, Phi-4 is back, Kokoro TTS & Moondream gaze, ByteDance SOTA lip sync & more AI news

📆 ThursdAI - Jan 2 - is 25' the year of AI agents?

📆 ThursdAI - Dec 26 - OpenAI o3 & o3 mini, DeepSeek v3 658B beating Claude, Qwen Visual Reasoning, Hume OCTAVE & more AI news

🎄ThursdAI - Dec19 - o1 vs gemini reasoning, VEO vs SORA, and holiday season full of AI surprises

📆 ThursdAI - Dec 12 - unprecedented AI week - SORA, Gemini 2.0 Flash, Apple Intelligence, LLama 3.3, NeurIPS Drama & more AI news

📆 ThursdAI - Dec 5 - OpenAI o1 & o1 pro, Tencent HY-Video, FishSpeech 1.5, Google GENIE2, Weave in GA & more AI news

🦃 ThursdAI - Thanksgiving special 24' - Qwen Open Sources Reasoning, BlueSky hates AI, H controls the web & more AI news

📆 ThursdAI - Nov 21 - The fight for the LLM throne, OSS SOTA from AllenAI, Flux new tools, Deepseek R1 reasoning & more AI news

📆 ThursdAI - Nov 14 - Qwen 2.5 Coder, No Walls, Gemini 1114 👑 LLM, ChatGPT OS integrations & more AI news

📆 ThursdAI - Nov 7 - Video version, full o1 was given and taken away, Anthropic price hike-u, halloween 💀 recap & more AI news