ThursdAI - The top AI news from the past week – Podcast

Episoder

📅 ThursdAI - May 16 - OpenAI GPT-4o, Google IO recap, LLama3 hackathon, Yi 1.5, Nous Hermes Merge & more AI news
17 May· ThursdAI - The top AI news from the past week
Wow, holy s**t, insane, overwhelming, incredible, the future is here!, "still not there", there are many more words to describe this past week. (TL;DR at the end of the blogpost)
I had a feeling it's going to be a big week, and the companies did NOT disappoint, so this is going to be a very big newsletter as well.
As you may have read last week, I was very lucky to be in San Francisco the weekend before Google IO, to co-host a hackathon with Meta LLama-3 team, and it was a blast, I will add my notes on that in This weeks Buzz section.
Then on Monday, we all got to watch the crazy announcements from OpenAI, namely a new flagship model called GPT-4o (we were right, it previously was im-also-a-good-gpt2-chatbot) that's twice faster, 50% cheaper (in English, significantly more so in other languages, more on that later) and is Omni (that's the o) which means it is end to end trained with voice, vision, text on inputs, and can generate text, voice and images on the output.
A true MMIO (multimodal on inputs and outputs, that's not the official term) is here and it has some very very surprising capabilities that blew us all away. Namely the ability to ask the model to "talk faster" or "more sarcasm in your voice" or "sing like a pirate", though, we didn't yet get that functionality with the GPT-4o model, it is absolutely and incredibly exciting. Oh and it's available to everyone for free!
That's GPT-4 level intelligence, for free for everyone, without having to log in!
What's also exciting was how immediate it was, apparently not only the model itself is faster (unclear if it's due to newer GPUs or distillation or some other crazy advancements or all of the above) but that training an end to end omnimodel reduces the latency to incredibly immediate conversation partner, one that you can interrupt, ask to recover from a mistake, and it can hold a conversation very very well.
So well, that indeed it seemed like, the Waifu future (digital girlfriends/wives) is very close to some folks who would want it, while we didn't get to try it (we got GPT-4o but not the new voice mode as Sam confirmed) OpenAI released a bunch of videos of their employees chatting with Omni (that's my nickname, use it if you'd like) and many online highlighted how thirsty / flirty it sounded. I downloaded all the videos for an X thread and I named one girlfriend.mp4, and well, just judge for yourself why:
Ok, that's not all that OpenAI updated or shipped, they also updated the Tokenizer which is incredible news to folks all around, specifically, the rest of the world. The new tokenizer reduces the previous "foreign language tax" by a LOT, making the model way way cheaper for the rest of the world as well
One last announcement from OpenAI was the desktop app experience, and this one, I actually got to use a bit, and it's incredible. MacOS only for now, this app comes with a launcher shortcut (kind of like RayCast) that let's you talk to ChatGPT right then and there, without opening a new tab, without additional interruptions, and it even can understand what you see on the screen, help you understand code, or jokes or look up information. Here's just one example I just had over at X. And sure, you could always do this with another tab, but the ability to do it without context switch is a huge win.
OpenAI had to do their demo 1 day before GoogleIO, but even during the excitement about GoogleIO, they had announced that Ilya is not only alive, but is also departing from OpenAI, which was followed by an announcement from Jan Leike (who co-headed the superailgnment team together with Ilya) that he left as well. This to me seemed like a well executed timing to give dampen the Google news a bit.
Google is BACK, backer than ever, Alex's Google IO recap
On Tuesday morning I showed up to Shoreline theater in Mountain View, together with creators/influencers delegation as we all watch the incredible firehouse of announcements that Google has prepared for us.
TL;DR - Google is adding Gemini and AI into all it's products across workspace (Gmail, Chat, Docs), into other cloud services like Photos, where you'll now be able to ask your photo library for specific moments. They introduced over 50 product updates and I don't think it makes sense to cover all of them here, so I'll focus on what we do best.
"Google with do the Googling for you"
Gemini 1.5 pro is now their flagship model (remember Ultra? where is that? 🤔) and has been extended to 2M tokens in the context window! Additionally, we got a new model called Gemini Flash, which is way faster and very cheap (up to 128K, then it becomes 2x more expensive)
Gemini Flash is multimodal as well and has 1M context window, making it an incredible deal if you have any types of videos to process for example.
Kind of hidden but important was a caching announcement, which IMO is a big deal, big enough it could post a serious risk to RAG based companies. Google has claimed they have a way to introduce caching of the LLM activation layers for most of your context, so a developer won't have to pay for repeatedly sending the same thing over and over again (which happens in most chat applications) and will significantly speed up work with larger context windows.
They also mentioned Gemini Nano, a on device Gemini, that's also multimodal, that can monitor calls in real time for example for older folks, and alert them about being scammed, and one of the cooler announcements was, Nano is going to be baked into the Chrome browser.
With Gemma's being upgraded, there's not a product at Google that Gemini is not going to get infused into, and while they counted 131 "AI" mentions during the keynote, I'm pretty sure Gemini was mentioned way more!
Project Astra - A universal AI agent helpful in everyday life
After a few of the announcements from Sundar, (newly knighted) Sir Demis Hassabis came out and talked about DeepMind research, AlphaFold 3 and then turned to project Astra.
This demo was really cool and kind of similar to the GPT-4o conversation, but also different. I'll let you just watch it yourself:
TK: project astra demo
And this is no fake, they actually had booths with Project Astra test stations and I got to chat with it (I came back 3 times) and had a personal demo from Josh Woodward (VP of Labs) and it works, and works fast! It sometimes disconnects and sometimes there are misunderstandings, like when multiple folks are speaking, but overall it's very very impressive.
If you remember the infamous video with the rubber ducky that was edited by Google and caused a major uproar when we found out? It's basically that, on steroids, and real and quite quite fast.
Astra has a decent short term memory, so if you ask it where something was, it will remember, and Google cleverly used that trick to also show that they are working on augmented reality glasses with Astra built in, which would make amazing sense.
Open Source LLMs
Google open sourced PaliGemma VLM
Giving us something in the open source department, adding to previous models like RecurrentGemma, Google has uploaded a whopping 116 different checkpoints of a new VLM called PaliGemma to the hub, which is a State of the Art vision model at 3B.
It's optimized for finetuning for different workloads such as Visual Q&A, Image and short video captioning and even segmentation!
They also mentioned that Gemma 2 is coming next month, will be a 27B parameter model that's optimized to run on a single TPU/GPU.
Nous Research Hermes 2 Θ (Theta) - their first Merge!
Collaborating with Charles Goddard from Arcee (the creators of MergeKit), Teknium and friends merged the recently trained Hermes 2 Pro with Llama 3 instruct to get a model that's well performant on all the tasks that LLama-3 is good at, while maintaining capabilities of Hermes (function calling, Json mode)
Yi releases 1.5 with apache 2 license
The folks at 01.ai release Yi 1.5, with 6B, 9B and 34B (base and chat finetunes)
Showing decent benchmarks on Math and Chinese, 34B beats LLama on some of these tasks while being 2x smaller, which is very impressive
This weeks Buzz - LLama3 hackathon with Meta
Before all the craziness that was announced this week, I participated and judged the first ever Llama-3 hackathon. It was quite incredible, with over 350 hackers participating, Groq, Lambda, Meta, Ollama and others sponsoring and giving talks and workshops it was an incredible 24 hours at Shak 15 in SF (where Cerebral Valley hosts their hackathons)
Winning hacks were really innovative, ranging from completely open source smart glasses for under 20$, to a LLM debate platform with an LLM judge on any moral issue, and one project that was able to jailbreak llama by doing some advanced LLM arithmetic. Kudos to the teams for winning, and it was amazing to see how many of them adopted Weave as their observability framework as it was really easy to integrate.
Oh and I got to co-judge with the 🐐 of HuggingFace
This is all the notes for this week, even though there was a LOT lot more, check out the TL;DR and see you here next week, which I'll be recording from Seattle, where I'll be participating in the Microsoft BUILD event, so we'll see Microsoft's answer to Google IO as well. If you're coming to BUILD, come by our booth and give me a high five!
TL;DR of all topics covered:
* OpenAI Announcements
* GPT-4o
* Voice mode
* Desktop App
* Google IO recap:
* Google Gemini
* Gemini 1.5 Pro: Available globally to developers with a 2-million-token context window, enabling it to handle larger and more complex tasks.
* Gemini 1.5 Flash: A faster and less expensive version of Gemini, optimized for tasks requiring low latency.
* Gemini Nano with Multimodality: An on-device model that processes various inputs like text, photos, audio, web content, and social videos.
* Project Astra: An AI agent capable of understanding and responding to live video and audio in real-time.
* Google Search
* AI Overviews in Search Results: Provides quick summaries and relevant information for complex search queries.
* Video Search with AI: Allows users to search by recording a video, with Google's AI processing it to pull up relevant answers.
* Google Workspace
* Gemini-powered features in Gmail, Docs, Sheets, and Meet: Including summarizing conversations, providing meeting highlights, and processing data requests.
* "Chip": An AI teammate in Google Chat that assists with various tasks by accessing information across Google services.
* Google Photos
* "Ask Photos": Allows users to search for specific items in photos using natural language queries, powered by Gemini.
* Video Generation
* Veo Generative Video: Creates 1080p videos from text prompts, offering cinematic effects and editing capabilities.
* Other Notable AI Announcements
* NotebookLM: An AI tool to organize and interact with various types of information (documents, PDFs, notes, etc.), allowing users to ask questions about the combined information.
* Video Overviews (Prototyping): A feature within NotebookLM that generates audio summaries from uploaded documents.
* Code VR: A generative video AI model capable of creating high-quality videos from various prompts.
* AI Agents: A demonstration showcasing how AI agents could automate tasks across different software and systems.
* Generative Music: Advancements in AI music generation were implied but not detailed.
* Open Source LLMs
* Google PaliGemma 3B - sota open base VLM (Blog)
* Gemma 2 - 27B coming next month
* Hermes 2 Θ (Theta) - Merge of Hermes Pro & Llama-instruct (X, HF)
* Yi 1.5 - Apache 2 licensed 6B, 9B and 34B (X)
* Tiger Lab - MMLU-pro - a harder MMLU with 12K questions (X, HuggingFace)
* This weeks Buzz (What I learned with WandB this week)
* Llama3 hackathon with Meta, Cerebral Valley, HuggingFace and Weights & Biases
* Vision & Video
* Google announces VEO - High quality cinematic generative video generation (X)
* AI Art & Diffusion & 3D
* Google announces Imagen3 - their latest Gen AI art model (Blog)
* Tools
* Cursor trained a model that does 1000tokens/s and editing 😮 (X)

This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
📅 ThursdAI - May 9 - AlphaFold 3, im-a-good-gpt2-chatbot, Open Devin SOTA on SWE-Bench, DeepSeek V2 super cheap + interview with OpenUI creator & more AI news
10 May· ThursdAI - The top AI news from the past week
Hey 👋 (show notes and links a bit below)
This week has been a great AI week, however, it does feel like a bit "quiet before the storm" with Google I/O on Tuesday next week (which I'll be covering from the ground in Shoreline!) and rumors that OpenAI is not just going to let Google have all the spotlight!
Early this week, we got 2 new models on LMsys, im-a-good-gpt2-chatbot and im-also-a-good-gpt2-chatbot, and we've now confirmed that they are from OpenAI, and folks have been testing them with logic puzzles, role play and have been saying great things, so maybe that's what we'll get from OpenAI soon?
Also on the show today, we had a BUNCH of guests, and as you know, I love chatting with the folks who make the news, so we've been honored to host Xingyao Wang and Graham Neubig core maintainers of Open Devin (which just broke SOTA on Swe-Bench this week!) and then we had friends of the pod Tanishq Abraham and Parmita Mishra dive deep into AlphaFold 3 from Google (both are medical / bio experts).
Also this week, OpenUI from Chris Van Pelt (Co-founder & CIO at Weights & Biases) has been blowing up, taking #1 Github trending spot, and I had the pleasure to invite Chris and chat about it on the show!
Let's delve into this (yes, this is I, Alex the human, using Delve as a joke, don't get triggered 😉)
TL;DR of all topics covered (trying something new, my Raw notes with all the links and bulletpoints are at the end of the newsletter)
* Open Source LLMs
* OpenDevin getting SOTA on Swe-Bench with 21% (X, Blog)
* DeepSeek V2 - 236B (21B Active) MoE (X, Try It)
* Weights & Biases OpenUI blows over 11K stars (X, Github, Try It)
* LLama-3 120B Chonker Merge from Maxime Labonne (X, HF)
* Alignment Lab open sources Buzz - 31M rows training dataset (X, HF)
* xLSTM - new transformer alternative (X, Paper, Critique)
* Benchmarks & Eval updates
* LLama-3 still in 6th place (LMsys analysis)
* Reka Core gets awesome 7th place and Qwen-Max breaks top 10 (X)
* No upsets in LLM leaderboard
* Big CO LLMs + APIs
* Google DeepMind announces AlphaFold-3 (Paper, Announcement)
* OpenAI publishes their Model Spec (Spec)
* OpenAI tests 2 models on LMsys (im-also-a-good-gpt2-chatbot & im-a-good-gpt2-chatbot)
* OpenAI joins Coalition for Content Provenance and Authenticity (Blog)
* Voice & Audio
* Udio adds in-painting - change parts of songs (X)
* 11Labs joins the AI Audio race (X)
* AI Art & Diffusion & 3D
* ByteDance PuLID - new high quality ID customization (Demo, Github, Paper)
* Tools & Hardware
* Went to the Museum with Rabbit R1 (My Thread)
* Co-Hosts and Guests
* Graham Neubig (@gneubig) & Xingyao Wang (@xingyaow_) from Open Devin
* Chris Van Pelt (@vanpelt) from Weights & Biases
* Nisten Tahiraj (@nisten) - Cohost
* Tanishq Abraham (@iScienceLuvr)
* Parmita Mishra (@prmshra)
* Wolfram Ravenwolf (@WolframRvnwlf)
* Ryan Carson (@ryancarson)
Open Source LLMs
Open Devin getting a whopping 21% on SWE-Bench (X, Blog)
Open Devin started as a tweet from our friend Junyang Lin (on the Qwen team at Alibaba) to get an open source alternative to the very popular Devin code agent from Cognition Lab (recently valued at $2B 🤯) and 8 weeks later, with tons of open source contributions, >100 contributors, they have almost 25K stars on Github, and now claim a State of the Art score on the very hard Swe-Bench Lite benchmark beating Devin and Swe-Agent (with 18%)
They have done so by using the CodeAct framework developed by Xingyao, and it's honestly incredible to see how an open source can catch up and beat a very well funded AI lab, within 8 weeks! Kudos to the OpenDevin folks for the organization, and amazing results!
DeepSeek v2 - huge MoE with 236B (21B active) parameters (X, Try It)
The folks at DeepSeek is releasing this huge MoE (the biggest we've seen in terms of experts) with 160 experts, and 6 experts activated per forward pass. A similar trend from the Snowflake team, just extended even longer. They also introduce a lot of technical details and optimizations to the KV cache.
With benchmark results getting close to GPT-4, Deepseek wants to take the crown in being the cheapest smartest model you can run, not only in open source btw, they are now offering this model at an incredible .28/1M tokens, that's 28 cents per 1M tokens!
The cheapest closest model in price was Haiku at $.25 and GPT3.5 at $0.5. This is quite an incredible deal for a model with 32K (128 in open source) context and these metrics.
Also notable is the training cost, they claim that it took them 1/5 the price of what Llama-3 cost Meta, which is also incredible. Unfortunately, running this model locally a nogo for most of us 🙂
I would mention here that metrics are not everything, as this model fails quite humorously on my basic logic tests
LLama-3 120B chonker Merge from Maxime LaBonne (X, HF)
We're covered Merges before, and we've had the awesome Maxime Labonne talk to us at length about model merging on ThursdAI but I've been waiting for Llama-3 merges, and Maxime did NOT dissapoint!
A whopping 120B llama (Maxime added 50 layers to the 70B Llama3) is doing the rounds, and folks are claiming that Maxime achieved AGI 😂 It's really funny, this model, is... something else.
Here just one example that Maxime shared, as it goes into an existential crisis about a very simple logic question. A question that Llama-3 answers ok with some help, but this... I've never seen this. Don't forget that merging has no additional training, it's mixing layers from the same model so... we still have no idea what Merging does to a model but... some brain damange definitely is occuring.
Oh and also it comes up with words!
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Big CO LLMs + APIs
Open AI publishes Model Spec (X, Spec, Blog)
OpenAI publishes and invites engagement and feedback for their internal set of rules for how their models should behave. Anthropic has something similar with Constitution AI.
I specifically liked the new chain of command (Platform > Developer > User > Tool) rebranding they added to the models, making OpenAI the Platform, changing "system" prompts to "developer" and having user be the user. Very welcome renaming and clarifications (h/t Swyx for his analysis)
Here are a summarized version of OpenAI's new rules of robotics (thanks to Ethan Mollic)
* follow the chain of command: Platform > Developer > User > Tool
* Comply with applicable laws
* Don't provide info hazards
* Protect people's privacy
* Don't respond with NSFW contents
Very welcome effort from OpenAI, showing this spec in the open and inviting feedback is greately appreciated!
This comes on top of a pretty big week for OpenAI, announcing an integration with Stack Overflow, Joining the Coalition for Content Provenance and Authenticity + embedding watermarks in SORA and DALL-e images, telling us they have built a classifier that detects AI images with 96% certainty!
im-a-good-gpt2-chatbot and im-also-a-good-gpt2-chatbot
Following last week gpt2-chat mystery, Sam Altman trolled us with this tweet
And then we got 2 new models on LMSys, im-a-good-gpt2-chatbot and im-also-a-good-gpt2-chatbot, and the timeline exploded with folks trying all their best logic puzzles on these two models trying to understand what they are, are they GPT5? GPT4.5? Maybe a smaller version of GPT2 that's pretrained on tons of new tokens?
I think we may see the answer soon, but it's clear that both these models are really good, doing well on logic (better than Llama-70B, and sometimes Claude Opus as well)
And the speculation is pretty much over, we know OpenAI is behind them after seeing this oopsie on the Arena 😂
you can try these models as well, they seem to be very favored in the random selection of models, but they show up only in battle mode so you have to try a few times https://chat.lmsys.org/
Google DeepMind announces AlphaFold3 (Paper, Announcement)
Developed by DeepMind and IsomorphicLabs, AlphaFold has previously predicted the structure of every molecule known to science, and now AlphaFold 3 was announced which can now predict the structure of other biological complexes as well, paving the way for new drugs and treatments.
What's new here, is that they are using diffusion, yes, like Stable Diffusion, starting with noise and then denoising to get a structure, and this method is 50% more accurate than existing methods.
If you'd like more info about this very important paper, look no further than the awesome 2 minute paper youtube, who did a thorough analysis here, and listen to the Isomorphic Labs podcast with Weights & Biases CEO Lukas on Gradient Dissent
They also released AlphaFold server, a free research tool allowing scientists to access these capabilities and predict structures for non commercial use, however it seems that it's somewhat limited (from a conversation we had with a researcher on stage)
This weeks Buzz (What I learned with WandB this week)
This week, was amazing for Open Source and Weights & Biases, not every week a side project from a CIO blows up on... well everywhere. #1 trending on Github for Typescript and 6 overall, OpenUI (Github) has passed 12K stars as people are super excited about being able to build UIs with LLms, but in the open source.
I had the awesome pleasure to host Chris on the show as he talked about the inspiration and future plans, and he gave everyone his email to send him feedback (a decision which I hope he doesn't regret 😂) so definitely check out the last part of the show for that.
Meanwhile here's my quick tutorial and reaction about OpenUI, but just give it a try here and build something cool!
Vision
I was shared some news but respecting the team I decided not to include it in the newsletter ahead of time, but expect open source to come close to GPT4-V next week 👀
Voice & Audio
11 Labs joins the AI music race (X)
Breaking news from 11Labs, that happened during the show (but we didn't notice) is that they are stepping into the AI Music scene and it sounds pretty good!)
Udio adds Audio Inpainting (X, Udio)
This is really exciting, Udio decided to prove their investment and ship something novel!
Inpainting has been around in diffusion models, and now selecting a piece of a song on Udio and having Udio reword it is so seamless it will definitely come to every other AI music, given how powerful this is!
Udio also announced their pricing tiers this week, and it seems that this is the first feature that requires subscription
AI Art & Diffusion
ByteDance PuLID for no train ID Customization (Demo, Github, Paper)
It used to take a LONG time to finetune something like Stable Diffusion to generate an image of your face using DreamBooth, then things like LoRA started making this much easier but still needed training.
The latest crop of approaches for AI art customization is called ID Customization and ByteDance just released a novel, training free version called PuLID which works very very fast with very decent results! (really, try it on your own face), previous works like InstantID an IPAdapter are also worth calling out, however PuLID seems to be the state of the art here! 🔥
And that's it for the week, well who am I kidding, there's so much more we covered and I just didn't have the space to go deep into everything, but definitely check out the podcast episode for the whole conversation. See you next week, it's going to be 🔥 because of IO and ... other things 👀

This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
Mangler du episoder?

Klikk her for å oppdatere manuelt.
ThursdAI - May 2nd - New GPT2? Copilot Workspace, Evals and Vibes from Reka, LLama3 1M context (+ Nous finetune) & more AI news
3 May· ThursdAI - The top AI news from the past week
Hey 👋 Look it May or May not be the first AI newsletter you get in May, but it's for sure going to be a very information dense one. As we had an amazing conversation on the live recording today, over 1K folks joined to listen to the first May updates from ThursdAI.
As you May know by now, I just love giving the stage to folks who are the creators of the actual news I get to cover from week to week, and this week, we had again, 2 of those conversations.
First we chatted with Piotr Padlewski from Reka, the author on the new Vibe-Eval paper & Dataset which they published this week. We've had Yi and Max from Reka on the show before, but it was Piotr's first time and he was super super knowledgeable, and was really fun to chat with.
Specifically, as we at Weights & Biases launch a new product called Weave (which you should check out at https://wandb.me/weave) I'm getting more a LOT more interested in Evaluations and LLM scoring, and in fact, we started the whole show today with a full segment on Evals, Vibe checks and covered a new paper from Scale about overfitting.
The second deep dive was with my friend Idan Gazit, from GithubNext, about the new iteration of Github Copilot, called Copilot Workspace. It was a great one, and you should definitely give that one a listen as well
TL;DR of all topics covered + show notes
* Scores and Evals
* No notable changes, LLama-3 is still #6 on LMsys
* gpt2-chat came and went (in depth chan writeup)
* Scale checked for Data Contamination on GSM8K using GSM-1K (Announcement, Paper)
* Vibes-Eval from Reka - a set of multimodal evals (Announcement, Paper, HF dataset)
* Open Source LLMs
* Gradient releases 1M context window LLama-3 finetune (X)
* MaziyarPanahi/Llama-3-70B-Instruct-DPO-v0.4 (X, HF)
* Nous Research - Hermes Pro 2 - LLama 3 8B (X, HF)
* AI Town is running on Macs thanks to Pinokio (X)
* LMStudio releases their CLI - LMS (X, Github)
* Big CO LLMs + APIs
* Github releases Copilot Workspace (Announcement)
* AI21 - releases Jamba Instruct w/ 256K context (Announcement)
* Google shows Med-Gemini with some great results (Announcement)
* Claude releases IOS app and Team accounts (X)
* This weeks Buzz
* We're heading to SF to sponsor the biggest LLama-3 hackathon ever with Cerebral Valley (X)
* Check out my video for Weave our new product, it's just 3 minutes (Youtube)
* Vision & Video
* Intern LM open sourced a bunch of LLama-3 and Phi based VLMs (HUB)
* And they are MLXd by the "The Bloke" of MLX, Prince Canuma (X)
* AI Art & Diffusion & 3D
* ByteDance releases Hyper-SD - Stable Diffusion in a single inference step (Demo)
* Tools & Hardware
* Still haven't open the AI Pin, and Rabbit R1 just arrived, will open later today
* Co-Hosts and Guests
* Piotr Padlewski (@PiotrPadlewski) from Reka AI
* Idan Gazit (@idangazit) from Github Next
* Wing Lian (@winglian)
* Nisten Tahiraj (@nisten)
* Yam Peleg (@yampeleg)
* LDJ (@ldjconfirmed)
* Wolfram Ravenwolf (@WolframRvnwlf)
* Ryan Carson (@ryancarson)
Scores and Evaluations
New corner in today's pod and newsletter given the focus this week on new models and comparing them to existing models.
What is GPT2-chat and who put it on LMSys? (and how do we even know it's good?)
For a very brief period this week, a new mysterious model appeared on LMSys, and was called gpt2-chat. It only appeared on the Arena, and did not show up on the leaderboard, and yet, tons of sleuths from 4chan to reddit to X started trying to figure out what this model was and wasn't.
Folks started analyzing the tokenizer, the output schema, tried to get the system prompt and gauge the context length. Many folks were hoping that this is an early example of GPT4.5 or something else entirely.
It did NOT help that uncle SAMA first posted the first tweet and then edited it to remove the - and it was unclear if he's trolling again or foreshadowing a completely new release or an old GPT-2 but retrained on newer data or something.
The model was really surprisingly good, solving logic puzzles better than Claude Opus, and having quite amazing step by step thinking, and able to provide remarkably informative, rational, and relevant replies. The average output quality across many different domains places it on, at least, the same level as high-end models such as GPT-4 and Claude Opus.
Whatever this model was, the hype around it made LMSYS add a clarification to their terms and temporarily take off the model now. And we're waiting to hear more news about what it is.
Reka AI gives us Vibe-Eval a new multimodal evaluation dataset and score (Announcement, Paper, HF dataset)
Reka keeps surprising, with only 20 people in the company, their latest Reka Core model is very good in multi modality, and to prove it, they just released a new paper + a new method of evaluating multi modal prompts on VLMS (Vision enabled Language Models)
Their new Open Benchmark + Open Dataset is consistent of this format:
And I was very happy to hear from one of the authors on the paper @PiotrPadlewski on the pod, where he mentioned that they were trying to create a dataset that was going to be very hard for their own model (Reka Core) and just decided to keep evaluating other models on it.
They had 2 main objectives : (i) vibe checking multimodal chat models for day-to-day tasks and (ii) deeply challenging and probing the capabilities of present frontier models. To this end, the hard set contains > 50% questions that all frontier models answer incorrectly
Chatting with Piotr about it, he mentioned that not only did they do a dataset, they actually used Reka Core as a Judge to score the replies from all models on that dataset and found that using their model in this way roughly correlates to non-expert human judgement! Very very interesting stuff.
The "hard" set is ... well hard!
Piotr concluded that if folks want to do research, they will provide free API access to Reka for that, so hit them up over DMs if you want to take this eval for a spin on your new shiny VLM (or indeed verify the metrics they put up)
Scale tests for eval dataset contamination with GSM-1K (Announcement, Paper)
Scale.ai is one of the most prominent companies in AI you may never have heard of, they are valued at $13B dollars and have pivoted from data processing for autonomous vehicles to being the darling of the government, with agreements from the DoD for data pipeline and evaluation for US Military.
They have released a new paper as well, creating (but not releasing) a new dataset that matches the GSM8K (Grade School Math) dataset and evaluation that many frontier companies love to showcase in their release benchmarks with some surprising results!
So Scale folks created (but not released) a dataset called GSK 1K, which tracks and is similar to the public GSM-8K dataset, and tested a bunch of existing models on their new one, to see the correlation, and if the different was very stark, assume that some models overfitted (or even had their dataset contaminated) on the publicly available GSM8K.
On one end, models like Mistral or Phi do up to 10% worse on GSM1k compared to GSM8k. On the other end, models like Gemini, Claude, or GPT show basically no signs of being overfit.
The author goes on to say that overfitting doesn't necessarily mean it's a bad model, and highlights Phi-3 which has a 10% difference on their new GSK-1K score compared to GSM-8K, but still answers 68% of their dataset, while being a tiny 3.8B parameter model.
It seems that Scale is now stepping into the Evaluation game and have noticed how much interest there is in actually understanding how models perform, and are stepping into this game, by building (but not releasing so they don't leak) datasets. Jim Fan tweet (and Scale CEO Alex Wang QT) seem to agree that this is the right positioning for Scale (as they don't have models of their own and so can be neutral like Moody's)
Open Source LLMs
LLama-3 gets 1M context window + Other LLama-3 news
In the second week of LLama-3 corner, we are noticing a significant ramp in all things Llama-3, first with the context length. The same folks from last week, Gradient, have spend cycles and upscaled/stretched LLama-3 to a whopping 1 million tokens in the context window (Llama-3 8B Gradient Instruct 1048k), with a very decent Niddle in the Haystack result.
The main problem? Transformers have quadratic attention scaling issues for longer context, so this isn't something that you'd be able to run on your mac (nay, on your cluster) any time soon, and it's almost only theoretical at this point.
The upside? We had Wing Lian (from Axolotl) on the show, and he talked about a new method called LoRD (which is now part of MergeKit) which is a way to extract Loras from models.
Think of it as LLM arithmetic, you take the base model (llama-3 in this case) and the finetune (Llama-3 8B Gradient Instruct 1048k) and simple run a command like so:
mergekit-extract-lora llama-3-8B-gradient-instruct-1048K llama-3-8B just-the-context-lora [--no-lazy-unpickle] --rank=desired_rank
And boom, in theory, you have a tiny LoRA file that's extracted that is only the difference between these two models, the base and it's finetune.
It's really exciting stuff to be able to do brain surgery on these models and extract only one specific essence!
First LLama-3 finetunes that beat the instruct version
Folks and Nous research give us a new Hermes-Pro on top of Llama-8B (X, HF) that is beating the llama-3 instruct on benchmarks, which is apparently very hard to do, given that Meta created a LOT of human labeled instructions (10M or so) and gave us a really really good instruct model.
Nous Hermes 2 pro is also giving Llama-3 additional superpowers like function calling and tool use, specifically mentioning that this is the model to use if you do any type of agentic stuff
This new version of Hermes maintains its excellent general task and conversation capabilities - but also excels at Function Calling, JSON Structured Outputs, and has improved on several other metrics as well, scoring a 90% on our function calling evaluation built in partnership with Fireworks.AI, and an 84% on our structured JSON Output evaluation.
Kudos Teknium1, Karan and @intrstllrninja on this release, can't wait to try it out 🫡
LMStudio gives us a CLI (Github)
And speaking of "trying it out", you guys know that my recommended way of running these local models is LMStudio, and no, Yagil didn't sponsor ThursdAI haha I just love how quickly this piece of software became the go to locally for me running these models.
Well during ThursdAI I got a #breakingNews ping from their discord, that LM Studio now has a CLI (command line interface) which allows one to load/unload and run the webserver with the new CLI (kind of similar to Ollama)
And since LM Studio exposes an OpenAI compatible completions API once the models are loaded, you are not able to use these models with a simple change to the your script like so:
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")
Which is amazing and I'm very happy about this option, as this opens the door to tons of automations and evaluation possibilities (with something like Weave), in fact while writing this, I downloaded the model from HuggingFace, loaded a web-server and ran my first prompts, and it all took like 5 minutes, and is very easy to do!
This weeks Buzz (What happens in Weights & Biases this week)
I have so much to share, but I want to make sure I don't overwhelm the newsletter, but here we go. First of all, I'm flying out to SF again! in a few weeks to sponsor and judge the first ever LLama-3 hackathon, together with Meta, and hosted by the fine folks at Cerebral Valley (sign up and come hack!)
Cerebral Valley is hosting their events at this beautiful place called Shak-15 which I've mentioned before on the newsletter, and I'm excited to finally take part in one of their events!
The second part I can't wait to tell you about, is a week after, I'm going to Microsoft BUILD conference in Seattle, and will be representing Weights & Biases in that huge event (which last year featured Andrej Karpathy giving state of LLMs)
Here's a video I recorded for that event, which I worked really hard on, and would love some feedback. Please also let me know if you notice anything that an AI did in this video 👀 There's... something
As always, if you're attending any of these events, and see me, please do come say hi and give me a high five. I love meeting ThursdAI community folks in the wild, it really makes up for the fact that I'm working remotely from Denver and really makes this whole thing worth it!
Big Companies & APIs
Github’s new Copilot Workspace in Technical Preview
I was very happy to have friend of the pod Idan Gazit, Senior Director of Research at GitHub Next, the place in Github that comes up with incredible stuff (including where Copilot was born) to talk to us about Copilot's next iteration after the chat experience, workspace!
Workspace is indeed that, a workspace for you and copilot to start working together, on github issues specifically, taking into context more than just 1 file, and breaking down the task into planning, iteration and human feedback.
It looks really slick, and per Idan, uses a LOT of tokens of gpt-4-turbo, and I've had a chance to get in there and play around.
They break down every task into a Specification that Copilot comes up with, and then you iteratively can work on until you get the required result, then into planning model, where you would see a whole plan, and then copilot will get to work and start iterating on your task.
Does this remind you of anything? AGENTS you may yell in your head as you read these words, however, I recommend you listen to Idan in our chat on the pod, because his take on agents are, we don't want these tools to replace us, we want them to help us, and what is an agent anyway, this word is very overused. And I have to agree, given the insane valuations we've seen in agent startups like Cognition Labs with Devin.
I've taken Workspace for a spin, and asked it for a basic task to translate a repo documentation into Russian, a task I know LLMs are really good at, and it identified all the README files in the repo, and translated them beautifully, but then it didn't place those new translations into a separate folder like I asked, a case Idan admitted they didn't yet build for, and hey, this is why this is a Technical Preview, you just can't build an LLM based product behind the scenes and release it, you need feedback, and evaluations on your product from actual users!
You can see my whole session here, in this nice link they give to be able to share (and fork if you have access) a workspace
The integration into Github is quite amazing, there's now a text box everyone on Github that you can ask for changes to a repo in natural language + a Raycast extension that allows you to basically kickstart a whole repo using Copilot Workspace from anywhere
And here's the result inside a new workspace 👇
I will run this later and see if it actually worked, given that Idan also mentioned, that Copilot does NOT run the code it writes, but it does allow me to easily do so via GIthub Codespaces (a bit confusing of a naming between the two!) and spin up a machine super quick.
I strongly recommend to listen to Idan on the pod because he went into a lot of detail about additional features, where they are planning to take this in the future etc'
I can go on and on, but I need to play with all the amazing new tools and models we just got today (and also start editing the podcast it's almost 4PM and I have 2 hours to send it!) so with that, thank you for reading , and see you next time 🫡

This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
📅 ThursdAI - April 25 - Phi-3 3.8B impresses, LLama-3 gets finetunes, longer context & ranks top 6 in the world, Snowflake's new massive MoE and other AI news this week
26 Apr· ThursdAI - The top AI news from the past week
Hey hey folks, happy ThursdAI 🎉
Not a lot of house-keeping here, just a reminder that if you're listening or reading from Europe, our European fullyconnected.com conference is happening in May 15 in London, and you're more than welcome to join us there. I will have quite a few event updates in the upcoming show as well.
Besides this, this week has been a very exciting one for smaller models, as Microsoft teased and than released Phi-3 with MIT license, a tiny model that can run on most macs with just 3.8B parameters, and is really punching above it's weights. To a surprising and even eyebrow raising degree! Let's get into it 👇
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
TL;DR of all topics covered:
* Open Source LLMs
* Microsoft open sources Phi-3 (X, HF)
* LLama3 70B top5 (no top 6) on LMsys (LMsys Arena)
* Snowflake open sources Arctic - A massive hybrid MoE (X, Try it, HF)
* Evolutionary Model merges support in MergeKit (Blog)
* Llama-3 8B finetunes roundup - Longer Context (128K) and Dolphin & Bagel Finetunes
* HuggingFace FINEWEB - a massive 45TB (the GPT4 of datasets) and 15T tokens high quality web data dataset (HF)
* Cohere open sourced their chat interface (X)
* Apple open sources OpenElm 4 models + training library called corenet (HF, Github, Paper)
* Big CO LLMs + APIs
* Google Gemini 1.5 pro is #2 on LMsys arena
* Devin is now worth 2BN and Perplexity is also a Unicorn
* A new comer called Augment (backed by Eric Schmidt) is now coming out of stealth (X)
* Vision & Video
* Adobe releases VideoGigaGAN - high quality upscaler with temporal consistency (paper)
* TLDraw autocomplete UI demo (X)
* This Weeks Buzz - What I learned in WandB this week
* Joe Spisak talk about Llama3 on Stage at WandB Fully connected (Full Talk, TLDR)
* Voice & Audio
* Play.ai (previously play.ht) releases conversational Voice AI platform (X)
* AI Art & Diffusion & 3D
* IMGsys.org- like LMsys but for image generation model + leaderboard from FAL (try it)
* Tools & Hardware
* Rabbit R1 release party & no shipping update in sight
* I'm disillusioned about my AI Pin and will return it
Open Source LLMs
Llama-3 1 week-aversary 🎂 - Leaderboard ranking + finetunes
Well, it's exactly 1 week since we got Llama-3 from Meta and as expected, the rankings show a very very good story. (also it was downloaded over 1.2M times and already has 600 derivatives on HuggingFace)
Just on Monday, Llama-3 70B (the bigger version) took the incredible 5th place (now down to 6th) on LMSys, and more surprising, given that the Arena now has category filters (you can filter by English only, Longer chats, Coding etc) if you switch to English Only, this model shows up 2nd and was number 1 for a brief period of time.
So just to sum up, an open weights model that you can run on most current consumer hardware is taking over GPT-4-04-94, Claude Opus etc'
This seems dubious, because well, while it's amazing, it's clearly not at the level of Opus/Latest GPT-4 if you've used it, in fact it fails some basic logic questions in my tests, but it's a good reminder that it's really hard to know which model outperforms which and that the arena ALSO has a bias, of which people are using it for example and that evals are not a perfect way to explain which models are better.
However, LMsys is a big component of the overall vibes based eval in our community and Llama-3 is definitely a significant drop and it's really really good (even the smaller one)
One not so surprising thing about it, is that the Instruct version is also really really good, so much so, that the first finetunes of Eric Hartfords Dolphin (Dolphin-2.8-LLama3-70B) is improving just a little bit over Meta's own instruct version, which is done very well.
Per Joe Spisak (Program Manager @ Meta AI) chat at the Weights & Biases conference last week (which you can watch below) he said "I would say the magic is in post-training. That's where we are spending most of our time these days. Uh, that's where we're generating a lot of human annotations." and they with their annotation partners, generated up to 10 million annotation pairs, both PPO and DPO and then did instruct finetuning.
So much so that Jeremy Howard suggests to finetune their instruct version rather than the base model they released.
We also covered that despite the first reactions to the 8K context window, the community quickly noticed that extending context window for LLama-3 is possible, via existing techniques like Rope scaling, YaRN and a new PoSE method. Wing Lian (Maintainer of Axolotl finetuneing library) is stretching the model to almost 128K context window and doing NIH tests and it seems very promising!
Microsoft releases Phi-3 (Announcement, Paper, Model)
Microsoft didn't really let Meta take the open models spotlight, and comes with an incredible report and follow up with a model release that's MIT licened, tiny (3.8B parameters) and performs very very well even against Llama-3 70B.
Phi is a set of models from Microsoft that train on synthetic high-quality dataset modeled after textbooks-is-all-you-need/TinyStories approach.
The chart is quite incredible, the smallest (mini) Phi-3 is beating Llama-3-8B AND Mixtral on MMLU scores, BigBench and Humaneval. Again to simplify, this TINY 3.8B model, half the size of 1 Mixtral expert, beats Mixtral and newly released Llama-3-8B on most benchmark, not to mention GPT-3.5!
It's honestly quite a crazy chart to look at, which raises the question, did this model train on these benchmarks? 🤔
I still haven't seen definitive proof that the folks at Microsoft trained on any benchmarks data, I did see engagement from them and a complete denial, however we did see a few attempts at using Phi-3 and the quantized versions and the wrong end token formatting seem to be very prevalent in shaping the early opinion that this model performance is detached from it's very high scoring.
Not to mention that model being new, there's confusion about how to use it, see thread from Anton Bacaj about HuggingFace potentially using the wrong end token to finish conversations.
Now to an actual performance of this tiny model, I asked it a simple logic based question that trips many models even ones good with logic (Opus and GPT-4 answer it correctly usually) and it performed very well (here a comparison with LLama-3-70B which didn't do as well)
Additionally, their tokenizer is very interesting, they have all these terms that receive a full token, things like function_list, calc, ghreview, ghissue, and others, which highlight some interesting potential use-cases they have planned for this set of models or give us a hint at it's training process and how come it's so very good.
Snowflake open sources Arctic - a massive 480B MoE Hybrid with Apache 2 license (X, Try it, HF)
Snowflake is a name I haven't yet used on ThursdAI and this field is getting crowded, but they just released something interesting (+ a LOT of open source, including training code, checkpoints, research insights etc')
The thing I found most interesting is, the massive 128 experts MoE but also the Hybrid architecture. Not quite an MoE and definitely not a dense model.
They claim to have found that training Many-but-condensed experts with more expert choices is working well for them based on DeepSpeed research.
You can give this model a try here and I have, using the same 2 questions I had for Phi and LLama and found the model not that great at logic to be honest, but it was really fast considering the total size, so inference optimization for this type of architecture is definitely geared towards Enterprise (as well as training cost, they claim it cost just under $2 million dollars to train)
Big CO LLMs + APIs
Not a lot of super interesting things in this corner, besides Gemini 1.5 pro (the one with 1M context window) finally appearing in the Arena and taking the amazing #2 spot (pushing Llama-3 8B to number 6 on the same day it just appeared in there lol)
This is very impressive, and I gotta wonder what happened with Gemini Ultra if pro with larger context beats it outright. It's indeed very good, but not THAT good if you use it om simple logic problems and don't use the whole context length.
I suspect that we'll hear much more about their AI stuff during the upcoming Google IO (which I was invited to and am going to cover)
Additionally, we've had quite a few AI Unicorns born, with Perplexity becoming a freshly mint Unicorn with an additional round of funding and Devin, the 6-month old agent startup getting to a 2 billion valuation 😮
This weeks Buzz (What I learned with WandB this week)
It's been exactly 1 week since our conference in SF and since Joe Spisak by complete chance announced Meta LLama - 3 live on stage a few hours after it was officially announced.
In this weeks buzz, I'm very happy to bring you that recording, as promised last week.
I will also share that our newly announced new LLM observability tool Weave launched officially during the conference and it'll be my job to get you to use it 🙂 And shoutout to those in the ThursdAI community who already used and provided feedback, it's really helpful!
AI Art & Diffusion
The fine folks at FAL.ai have launched the LMsys.org for images, and called it.... IMGsys.org 🙂 It's a adversarial arena with different image generators, all hosted on Fal I assume, that lets the user choose which images are "better" which is a vague term.
But it's really fun, give it a try!
Tools & Hardware
Rabbit R1 first impressions
We finally got a tease of R1 from Rabbit, as the first customers started receiving this device (where's mine?? I didn't even get a tracking number)
Based on the presentation (which I watched so you don't have to) the response time, which was one of the most talked about negative pieces of AI Pin seems very decent. We're going to see a lot of reviews, but I'm very excited about my Rabbit 👏 🐇
Apparently I wasn't as fast as I thought on the pre-order so will have to wait patiently, but meanwhile, check out this review from Riley Brown.
That's the deep dive for this week, for the rest of the coverage, please listen to the episode and if you liked it, share with a friend!
I'll also be traveling quite a bit in the next two months, I'll be in Seattle for MSFT BUILD, and in San Francisco (more on this soon) a couple of times, hope to meet some of you, please come say hi! 🫡

This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
📅 ThursdAI - Apr 18th - 🎉 Happy LLama 3 day + Bigxtral instruct, WizardLM gives and takes away + Weights & Biases conference update
19 Apr· ThursdAI - The top AI news from the past week
Happy LLama 3 day folks! After a lot of rumors, speculations, and apparently pressure from the big Zuck himself, we finally can call April 18th, 2024, LLaMa 3 day!
I am writing this, from a lobby of the Mariott hotel in SF, where our annual conference is happening called Fully Connected, and I recorded today's episode from my hotel room. I really wanna shout out how awesome it was to meet folks who are listeners of the ThursdAI pod and newsletter subscribers, participate in the events, and give high fives.
During our conference, we had the pleasure to have Joe Spisak, the Product Director of LLaMa at Meta, to actually announce LLaMa3 on stage! It was so exhilarating, I was sitting in the front row, and then had a good chat with Joe outside of the show 🙌
The first part of the show was of course, LLaMa 3 focused, we had such a great time chatting about the amazing new 8B and 70B models we got, and salivating after the announced but not yet released 400B model of LLaMa 3 😮
We also covered a BUNCH of other news from this week, that was already packed with tons of releases, AI news and I was happy to share my experiences running a workshop a day before our conference, with focus on LLM evaluations. (If there's an interest, I can share my notebooks and maybe even record a video walkthrough, let me know in the comments)
Ok let's dive in 👇
Happy LLama 3 day 🔥
The technical details
Meta has finally given us what we're all waiting for, an incredibly expensive (2 clusters of 24K H100s over 15 Trillion tokens) open weights models, the smaller 8B one and the larger 70B one.
We got both instruction fine tune and base models, which are great for finetuners, and worth mentioning that this is a dense model (not a mixture of experts, all the parameters are accessible for the model during inference)
It is REALLY good at benchmarks, with the 7B model beating the previous (LLaMa 2 70B) on pretty much all benchmarks, and the new 70B is inching on the bigger releases from the past month or two, like Claude Haiku and even Sonnet!
The only downsides are the 8K context window + non multimodality, but both are coming according to Joe Spisak who announced LLama3 on stage at our show Fully Connected 🔥
I was sitting in the front row and was very excited to ask him questions later!
By the way, Joe did go into details they haven't yet talked about pulblicly (see? I told you to come to our conference! and some of you did!) and I've been live-tweeting his whole talk + the chat outside with the "extra" spicy questions and Joes winks haha, you can read that thread here
The additional info
Meta has also partnered with both Google and Bing (take that OpenAI) and inserted LLama 3 into the search boxes of Facebook, Instagram, Messenger and Whatsapp plus deployed it to a new product called meta.ai (you can try it there now) and is now serving LLama 3 to more than 4 Billion people across all of those apps, talk about compute cost!
Llama 3 also has a new Tokenizer (that Joe encouraged us to "not sleep on") and a bunch of new security tools like Purple LLama and LLama Guard. PyTorch team recently released finetuning library called TorchTune is now supporting LLama3 finetuning natively out of the box as well (and integrates Wandb as it's first party experiment tracking tool)
If you'd like more details, directly from Joe, I was live tweeting his whole talk, and am working at getting the slides from our team. We'll likely have a recording as well, will post it as soon as we have it.
Here's a TL;DR (with my notes for the first time) of everything else we talked about, but given today is LLaMa day, and I still have to do fully connected demos, I will "open source" my notes and refer you to the podcast episode to hear more detail about everything else that happened today 🫡
TL;DR of all topics covered:
* Meta releases LLama 3 -8B, 70B and later 400B (Announcement, Models, Try it, Run Locally)
* Open Source LLMs
* Meta LLama 3 8B, 70B and later 400B (X, Blog)
* Trained 15T tokens!
* 70B and 8B modes released + Instruction finetuning
* 8K context length , not multi modal
* 70B gets 82% on MMLU and 81.7% on HumanEval
* 128K vocab tokenizer
* Dense model not MoE
* Both instruction tuned on human annotated datasets
* Open Access
* The model already uses RoPe
* Bigxtral instruct 0.1 (Blog, Try it)
* Instruct model of the best Apache 2 model around
* Release a comparison chart that everyone started "fixing"
* 🤖 Mixtral 8x22B is Mistral AI's latest open AI model, with unmatched performance and efficiency
* 🗣 It is fluent in 5 languages: English, French, Italian, German, Spanish
* 🧮 Has strong math and coding capabilities
* 🧠 Uses only 39B parameters out of 141B total, very cost efficient
* 🗜 Can recall info from large documents thanks to 64K token context window
* 🆓 Released under permissive open source license for anyone to use
* 🏆 Outperforms other open models on reasoning, knowledge and language benchmarks
* 🌐 Has strong multilingual abilities, outperforming others in 4 languages
* 🧪 Excellent basis for customization through fine-tuning
* New Tokenizer from Mistral (Docs)
* Focusing on Tool Use with tokens 🔥
* WizardLM-2 8x22B, 70B and 7B (X, HF)
* Released it and then pulled it back from HF and Github due to microsoft toxicity not passing
* Big CO LLMs + APIs
* OpenAI gives us Batch API + Assistants API v2
* Batch is 50% cost and win win win
* Assistants API V2 - new RAG
* new file search tool
* up to 10,000 files per assistant
* new vector store
* Reka gives us Reka Core (X, Try)
* Multimodal that understands video as well
* 20 people team
* Video understanding is very close to Gemini
* 128K context
* Core has strong reasoning abilities including for language, math and complex analysis.
* 32 languages support
* HuggingFace ios chat bot now
* This weeks Buzz
* Me + team led a workshop a day before the conference (Workshop Thread)
* Fully Connected in SF was an incredible success, over 1000 AI attendies + Meta AI announcement on stage 🔥
* PyTorch new TorchTune finetuning library with first class WandB support (X)
* Vision & Video
* Microsoft VASA-1 animated avatars (X, Blog)
* Amazing level of animation from 1 picture + Sound
* Harry Potter portraits are here
* They likely won't release this during Election year
* Looks very good ,close to EMO but no code
* 📺 Videos show faces speaking naturally with head movements and lip sync
* 🔬 Researchers are exploring applications in education, accessibility and more
* HuggingFace updates IDEFICS2 8B VLM (X, HF)
* Apache 2 license
* Competitive with 30B models
* 12 point increase in VQAv2, 30 point increase in TextVQA (compared to Idefics 1)
* > 10x fewer parameters than Idefics 1
* Supports image resolution up to 980 x 980+
* Better OCR capabilities (thanks to more than 6TB of OCR pre-training data)
* Adobe shows Firefly video + SORA support (X)
* Voice & Audio
* Rewind AI is now Limitless (X)
* New service & Brand name
* Transcription to you
* Hardware device that looks sleek
* 100hours
* Privacy support in cloud
* AI Art & Diffusion & 3D
* Stability - Stable Diffusion 3 is here
* Available via API only
* Partnered with Fireworks HQ for the release
* Needs stability AI membership to use / access $$
* Big step up in composition and notorious issues like hands, "AI faces" etc. (from
* Seems to prefer simpler prompts.
* Way more copyright-friendly. It's hard to get any kind of brands/logos.
* Text is amazing.
* Others
* New AIrChat with amazing transcription is out, come join us in our AI corner there
* Humane AI pin was almost killed by MKBHD review
* Rabbit reviews incoming
That's all for this week, next week we have an amazing guest, see you then! 🫡

This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
📅 ThursdAI - Apr 11th, 2024 - GPT4 is king again, New Mixtral 8x22B + First finetune, New Gemini 1.5, Cohere beats old GPT4, more AI news
12 Apr· ThursdAI - The top AI news from the past week
this week was absolutely bonkers. For starters, for the first time ever, we got an Open Weights model (Command R+) to jump over GPT-4 in human rankings on LMsys, this is huge!
Then on Tuesday, it seems that all the companies just wanted to one up one another, first Gemini 1.5 released with updates, made it available in 180 countries, added audio mode + tons of API improvements and system prompts, then less than an hour later, OpenAI has given us a "majorly improved" GPT-4 Turbo version (2024-04-09) that is now back to being the BEST LLM IN THE WORLD and to cap that day off, Mistral did the thing again, the thing being, dropping a torrent link in a tweet with no explanations.
What was in that torrent is a Mixtral 8x22B MoE (which we started calling Bixtral) which comes with an Apache2 license and seems to be VERY good!
We also saw the first finetune from HuggingFace/KAIST folks less than 48 hours later (the authors of said finetune actually came on the show 🎉 )
Fully Connected is a week from today! If you haven't yet signed up, use THURSDAI promo code and come hear from Richard Socher (You.com), Jerry Liu (Ilamaindex CEO), Karoly (TwoMinutePapers), Joe Spisak (Meta) and and leaders from NVIDIA, Snowflake, Microsoft, Coatue, Adobe, Siemens, Lambda and tons more 👇
TL;DR of all topics covered:
* Open Source LLMs
* 🔥 Mistral releases Mixtral 8x22 Apache 2 licensed MoE model (Torrent, TRY IT)
* Cohere CMDR+ jumps to no 6 on LMSys and beats GPT4 (X)
* CodeGemma, RecurrentGemma & Gemma Instruct 1.1 (Announcement)
* Auto-code-rover gets 22% on SWE bench (Announcement)
* HuggingFace - Zephyr 141B-A35B - First Bixtral Finetune (Announcement)
* Mistral 22B - 1 single expert extracted from MoE (Announcement, HF)
* This weeks Buzz - Weights & Biases updates
* FullyConnected is in 1 week! (Come meet us)
* Big CO LLMs + APIs
* 🔥 GPT-4 turbo is back to being number 1 AI with 88.2% Human Eval score (X)
* Gemini 1.5 Pro now understands audio, uses unlimited files, acts on your commands, and lets devs build incredible things with JSON mode (X)
* LLama 3 coming out in less than a month (confirmed by Meta folks)
* XAI Grok now powers news summaries on X (Example)
* Cohere new Rerank 3 (X)
* Voice & Audio
* HuggingFace trained Parler-TTS (Announcement, Github)
* Udio finally launched it's service (Announcement, Leak, Try It)
* Suno has added explore mode (suno.ai/explore)
* Hardware
* Humane AI pin has started shipping - reviews are not amazing
Open Source LLMs
Command R+ first open weights model that beats last year GPT4 versions
This is massive, really a milestone to be discussed, and even though tons of other news happened, the first time an open weights model is beating GPT-4 not on a narrow case (coding, medical) but on a general human evaluation on the arena.
This happened just a year after GPT-4 first came out, and is really really impressive.
Command R+ has been getting a lot of great attention from the community as well, folks were really surprised by the overall quality, not to mention the multilingual abilities of CommandR+
Mixtral 8x22B MoE with 65K context and Apache 2 license (Bigstral)
Despite the above, Cohere time in the sun (ie top open weights model on lmsys) may not be that long if the folks at Mistral have anything to say about it!
Mistral decided to cap the crazy Tuesday release day with another groundbreaking tweet of theirs which includes a torrent link and nothing else (since then they of course uploaded the model to the hub) giving us what potentially will unseat Command R from the rankings.
The previous Mixtral (8x7B) signaled the age of MoEs and each expert in that was activated from Mistral 7B, but for this new affectionally named Bixtral model, each expert is a 22B sized massive model.
We only got a base version of it, which is incredible on it's own right, but it's not instruction finetuned yet, and the finetuner community is already cooking really hard! Though it's hard because this model requires a lot of compute to finetune, and not only GPUs, Matt Shumer came on the pod and mentioned that GPUs weren't actually the main issue, it was system RAM when the finetune was finished.
The curious thing about it was watching the loss and the eval loss. it [Bixtral] learns much faster than other models - Matt Shumer
Matt was trying to run Finetunes for Bigstral and had a lot of interesting stuff to share, definitely check out that conversation on the pod.
Bigstral is... big, and it's not super possible to run it on consumer hardware.... yet, because Nisten somehow got it to run on CPU only 🤯 using Justin Tuneys LLM kernels (from last week) and LLama.cpp with 9tok/s which is kinda crazy.
HuggingFace + KAIST release Zephyr 141B-A35B (First Mixtral 8x22 finetune)
And that was fast, less than 48 hours after the torrent drop, we already see the first instruction finetune from folks at HuggingFace and KAIST AI.
They give us a new finetune using ORPO, a technique by KAIST that significantly improves finetuning ability (they finetuned Bigstral with 7k capybara instructions for 1.3 hours on 4 nodes of 8 x H100s)
They used the distilled Capybara Dataset (From LDJ and Argilla) to give this model a bit more clarity and instruction following.
You can find the model on the hub here, and the question is, but now the question is would one run this? 😅
Btw the authors of the finetune and the ORPO paper from KAIST, Jiwoo Hong and Noah Lee came on the pod and chatted about this finetune and ORPO which was awesome! Definitely check this conversation out.
Big CO LLMs + APIs
Gemini 1.5 Pro updates - Audio Mode, JSON, System prompts and becomes free
Google really pulled out all the stops for this updated release of Gemini 1.5 Pro, it's flagship, 1M context window model.
Its now available for free to over 180 countries, has a new audio mode where you can upload up to 9.5 hours of audio (which is crazy on it's own) and it's not merely transcription, it seems that they baked an audio encoder in there so the model can understand some tonality and even some dogs barking in the background!
In fact, instead of me writing down, how about I show you an example of Gemini itself extracting everything I said about it during the show? Here's a screenshot of me uploading 2+ hours of raw unedited audio form the show today:
You can see the Google AI studio (which is a very clean product!) and the new system message, the ability to turn the safety filters off (thank you!) and the audio mode. Not to mention the 250K tokens 😂 that my audio cost this model. Mind you, the highest context window after Gemini is Claude 3 with 200K.
Google also significantly improves the APIs, and gave access to a new file upload API that allows up to 2GB files uploaded (to support this amazing context and multimodality) 🔥
OpenAI - GPT 4 turbo a new and "Majorly improved version"
Remember when Gemini 1.5 was announced? You may not remember that specific day, because an hour after that, OpenAI published SORA and blew our collective minds off.
Well, OpenAI is at it again, but this time it didn't quite work the same way, but an hour after Gemini 1.5 updates came out, OpenAI released GPT4-Turbo-April-9 aka (gpt-4-turbo-2024-04-09) and basically all they said that it was "majorly improved"
The technical stuff first, they combined the tool use (function calling) API with the Vision API, which is feature parity with Anthropic).
The vibes are currently good, folks are seeing improvements across the board in logic and code creation, specifically the folks at Cursor posted an example (and enabled this model in their IDE) where it writes higher quality code.
As I’m writing these words, LMSys updated us that this new model shot up to the top of the arena taking the Mantle back from Opus as the best AI we have, and also a confirmation from OpenAI that this model is now powering the chatGPT interface 👏
OpenAI also just open sourced a repo to show what they used to get these exact scores for the new GPT-4 and they are impressive
This weeks Buzz (What I learned with WandB this week)
Final Call! Fully Connected, our very own annual conference is about to commence
(hehe of course it's happening on a ThursdAI, I still have to think about how to record the show next week)
Please feel free to use the code THURSDAI to sign up and come see us.
As a reminder, we're also running a workshop a day before, where we're going to showcase Weave and give practical examples for LLM builders, and it's going to be a lot of fun! Looking forward to see some of you there!
Audio & Voice
Udio launches a suno competitor AI Music service
For the past week+ I've seen tons of AI plugged folks in SF post about "a new AI for music is coming and it's going to be amazing". Well it's finally here, called Udio and it gives Suno a run for its money for sure.
With the ability to create full tracks, create into and outro, remix, and a very needed AI enhanced prompting, Udio does look very very polished and sounds GOOD!
Here is an example of a classical music track that's been going viral:
I've played a few more examples on the show itself, and you can check out the trending creations on their page.
Interestingly, this is probably a diffusion model, and so folks have been squeezing all kinds of stuff that's not only musical out of there, including, stand up comedy with a full laugh track.
Suno adds explore mode
Meanwhile Suno is not going down without a fight and have released this amazing new page where they generated thousands of samples for hundreds of interesting/weird sound styles, letting you get exposed and learn about different musical styles. I really liked it so recorded a short reaction video:
Phew, somehow we made it, we were able to summarize the huge news this week in under two hours + a newsletter!
The one thing I haven't been able to do is to actually try out many of the stuff I talked about, so after writing this, will take a little break and delve into some of the other things I haven't yet tried 👀
See you guys next week in limited capacity (maybe, we'll see) and until then, have a great week 🫡

This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
📅 ThursdAI Apr 4 - Weave, CMD R+, SWE-Agent, Everyone supports Tool Use + JAMBA deep dive with AI21
5 Apr· ThursdAI - The top AI news from the past week
Happy first ThursdAI of April folks, did you have fun on April Fools? 👀 I hope you did, I made a poll on my feed and 70% did not participate in April Fools, which makes me a bit sad!
Well all-right, time to dive into the news of this week, and of course there are TONS of news, but I want to start with our own breaking news! That's right, we at Weights & Biases have breaking new of our own today, we've launched our new product today called Weave!
Weave is our new toolkit to track, version and evaluate LLM apps, so from now on, we have Models (what you probably know as Weights & Biases) and Weave. So if you're writing any kind RAG system, anything that uses Claude or OpenAI, Weave is for you!
I'll be focusing on Weave and I'll be sharing more on the topic, but today I encourage you to listen to the launch conversation I had with Tim & Scott from the Weave team here at WandB, as they and the rest of the team worked their ass off for this release and we want to celebrate the launch 🎉
TL;DR of all topics covered:
* Open Source LLMs
* Cohere - CommandR PLUS - 104B RAG optimized Sonnet competitor (Announcement, HF)
* Princeton SWE-agent - OSS Devin - gets 12.29% on SWE-bench (Announcement, Github)
* Jamba paper is out (Paper)
* Mozilla LLamaFile now goes 5x faster on CPUs (Announcement, Blog)
* Deepmind - Mixture of Depth paper (Thread, ArXiv)
* Big CO LLMs + APIs
* Cloudflare AI updates (Blog)
* Anthropic adds function calling support (Announcement, Docs)
* Groq lands function calling (Announcement, Docs)
* OpenAI is now open to customers without login requirements
* Replit Code Repair - 7B finetune of deep-seek that outperforms Opus (X)
* Google announced Gemini Prices + Logan joins (X)קרמ
* This weeks Buzz - oh so much BUZZ!
* Weave lunch! Check weave out! (Weave Docs, Github)
* Sign up with Promo Code THURSDAI at fullyconnected.com
* Voice & Audio
* OpenAI Voice Engine will not be released to developers (Blog)
* Stable Audio v2 dropped (Announcement, Try here)
* Lightning Whisper MLX - 10x faster than whisper.cpp (Announcement, Github)
* AI Art & Diffusion & 3D
* Dall-e now has in-painting (Announcement)
* Deep dive
* Jamba deep dive with Roi Cohen from AI21 and Maxime Labonne
Open Source LLMs
Cohere releases Command R+, 104B RAG focused model (Blog)
Cohere surprised us, and just 2.5 weeks after releasing Command-R (which became very popular and is No 10 on Lmsys arena) gave us it's big brother, Command R PLUS
With 128K tokens in the context window, this model is multilingual as well, supporting 10 languages and is even beneficial on tokenization for those languages (a first!)
The main focus from Cohere is advanced function calling / tool use, and RAG of course, and this model specializes in those tasks, beating even GPT-4 turbo.
It's clear that Cohere is positioning themselves as RAG leaders as evident by this accompanying tutorial on starting with RAG apps and this model further solidifies their place as the experts in this field. Congrats folks, and thanks for the open weights 🫡
SWE-Agent from Princeton
Folks remember Devin? The super cracked team born agent with a nice UI that got 13% on the SWE-bench a very hard (for LLMs) benchmark that requires solving real world issues?
Well now we have an open source agent that comes very very close to that called SWE-Agent
SWE agent has a dedicated terminal and tools, and utilizes something called ACI (Agent Computer Interface) allowing the agent to navigate, search, and edit code.
The dedicated terminal in a docker environment really helps as evident by a massive 12.3% score on SWE-bench where GPT-4 gets only 1.4%!
Worth mentioning that SWE-bench is a very hard benchmark that was created by the folks who released SWE-agent, and here's some videos of them showing the agent off, this is truly an impressive achievement!
Deepmind publishes Mixture of Depth (arXiv)
Thanks to Hassan who read the paper and wrote a deep dive, this paper by Deepmind shows their research into optimizing model inference. Apparently there's a way to train LLMs without affecting their performance, which later allows to significantly reduce compute on some generated tokens.
🧠 Transformer models currently spread compute uniformly, but Mixture-of-Depths allows models to dynamically allocate compute as needed
💰 Dynamically allocating compute based on difficulty of predicting each token leads to significant compute savings
⏳ Predicting the first token after a period is much harder than within-sentence tokens, so more compute is needed
🗑 Most current compute is wasted since difficulty varies between tokens
We're looking forward to seeing models trained with this, as this seems to be a very big deal in how to optimize inference for LLMs.
Thank you for reading ThursdAI - Best way to support us is to just share this with folks 👇
Big CO LLMs + APIs
Anthropic and Groq announce function calling / tool use support, Cohere takes it one step further
In yet another example of how OpenAI is leading not only in models, but in developer experience, most models and API providers are now using the same messages API structure.
Back in June of 2023, OpenAI gave us function calling, and finally the industry is aligning to this format, which is now being rebranded as "tool use"
If you're unfamiliar with the concept, tool use allows a developer to specify what tools the model can have in addition to just spitting out tokens, think browsing the web, or using RAG to get more information, or check the weather, or... turn off a lighbulb in your smart home.
The LLM then decides based on user input, if a specific tool needs to be called, responds with the tool and parameters it needs to the developer, and then expects the result of that tool, and finally, is able to respond to the user with the complete information.
So this week we've got Command R, Groq and Anthropic all adding support for tool use, which is incredible for developer experience across the board and will allow developers to move between all those APIs.
Cohere goes one step further with something they call Multi Step tool use, which is a significant step up and is very interesting to explore, as it gives their models the ability to rank and order tool execution, and ovserve their responses.
Anthropic Docs https://docs.anthropic.com/claude/docs/tool-use
Groq Docs https://console.groq.com/docs/tool-use
Cohere Docs https://docs.cohere.com/docs/multi-step-tool-use
Cloudflare AI is now in GA + workers in Python
If you've been following ThursdAI, you know I'm a huge Cloudflare fan. I've built my startup (https://targum.video) on top of Cloudflare workers platform, and I gave them early feedback about having to step into AI in a big way. And they did, with workers AI which is now in GA.
Workers AI lets developers in the Cloudflare ecosystem run LLMs (they mostly feature Opensource LLMs which is incredible), host vectors, run whisper and basically have end to end serverless apps that are powered by AI (they have GPUs in 150 cities around the world)
This week Clouflare announced also the ability to write workers in Python, which was sorely missing for some folks (like me!) who love FastAPI for example, and while it's not a full python environment, the depth to which they had to go in order to allow python to execute on their edge is kind of ridiculous, read up on it here
I'm hoping to work with them to bring weave into the workers for python soon 🤞 because building AI applications with Cloudflare is so simple, they even have a HuggingFace integration which allows you to bring models into your CF environment with 1 click.
This weeks Buzz - SO MUCH BUZZ
Hey, well first of all, I now can offer you a 15% off a ticket to our conference, so use THURSDAI when you checkout and get a ticket here
Now that Weave is out, it's possible to say that our workshop on April 17 (same link as above) is going to be focused on LLM evaluations and yes, I will be talking about how to use weave to build LLM applications in production safely. If this field is new to you, please sign up and come to the workshop!
JAMBA deep dive with Roi @ AI21 and Maxime Labonne
As always, what I cover in this newsletter are only the highlights of what we talked about, but there was so much more, I really recommend you to listen to the episode. This of this weeks episode as 2 episodes (maybe I should re-release the deep dive as a separate episode) because we had a long conversation with Roi Cohen who's a PM @ AI21 and Maxime Labonne (Author of LazyMergeKit and first finetune of JAMBA), it's really worth tuning into that interview. Here's a little snippet:
Aaaand this is it for this week, or you know what? Maybe it's not! I shared this on X but if you don't follow me on X, I decided to prank my whole feed by saying that I'm basically changing careers and becoming a Russian AI DJ, called DJ Thursday and I will only play AI generated music.
The weird thing, how many people were like, yeah ok, this makes sense for you 😅 So here's my April Fools (one of them) joke, hope you enjoy the high quality of these tunes and see you all next week 🫡

This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
📅 ThursdAI - Mar 28 - 3 new MoEs (XXL, Medium and Small), Opus is 👑 of the arena, Hume is sounding emotional + How Tanishq and Paul turn brainwaves into SDXL images 🧠👁️
28 Mar· ThursdAI - The top AI news from the past week
Hey everyone, this is Alex and can you believe that we're almost done with Q1 2024? March 2024 was kind of crazy of course, so I'm of course excited to see what April brings (besides Weights & Biases conference in SF called Fully Connected, which I encourage you to attend and say Hi to me and the team!)
This week we have tons of exciting stuff on the leaderboards, say hello to the new best AI in the world Opus (+ some other surprises), in the open source we had new MoEs (one from Mosaic/Databricks folks, which tops the open source game, one from AI21 called Jamba that shows that a transformers alternative/hybrid can actually scale) and tiny MoE from Alibaba, as well as an incredible Emotion TTS from Hume.
I also had the pleasure to finally sit down with friend of the pod Tanishq Abraham and Paul Scotti from MedArc and chatted about MindEye 2, how they teach AI to read minds using diffusion models 🤯🧠👁️
Thank you for reading ThursdAI - Recaps of the most high signal AI weekly spaces. This post is public so feel free to share it.
TL;DR of all topics covered:
* AI Leaderboard updates
* Claude Opus is number 1 LLM on arena (and in the world)
* Claude Haiku passes GPT4-0613
* 🔥 Starling 7B beta is the best Apache 2 model on LMsys, passing GPT3.5
* Open Source LLMs
* Databricks/Mosaic DBRX - a new top Open Access model (X, HF)
* 🔥 AI21 - Jamba 52B - Joint Attention Mamba MoE (Blog, HuggingFace)
* Alibaba - Qwen1.5-MoE-A2.7B (Announcement, HF)
* Starling - 7B that beats GPT3.5 on lmsys (HF)
* LISA beats LORA as the frontrunner PeFT (X, Paper)
* Mistral 0.2 Base released (Announcement)
* Big CO LLMs + APIs
* Emad leaves stability 🥺
* Apple rumors - Baidu, Gemini, Anthropic, who else? (X)
* This weeks buzz
* WandB Workshop in SF confirmed April 17 - LLM evaluations (sign up here)
* Vision & Video
* Sora showed some demos by actual artists, Air Head was great (Video)
* Tencent Aniportait - generate Photorealistic Animated avatars (X)
* MedArc - MindEye 2 - fMRI signals to diffusion models (X)
* Voice & Audio
* Hume demos EVI - empathic voice analysis & generation (X, demo)
* AI Art & Diffusion & 3D
* Adobe firefly adds structure reference and style transfer - (X, Demo)
* Discussion
* Deep dive into MindEye 2 with Tanishq & Paul from MedArc
* Is narrow finetuning done-for with larger context + cheaper prices - debate
🥇🥈🥉Leaderboards updates from LMSys (Arena)
This weeks updates to the LMsys arena are significant. (Reminder in LMsys they use a mix of MT-Bench, LLM as an evaluation and user ELO scores where users play with these models and choose which answer they prefer)
For the first time since the Lmsys arena launched, the top model is NOT GPT-4 based. It's now Claude's Opus, but that's not surprising if you used the model, what IS surprising is that Haiku, it's tiniest, fastest brother is now well positioned at number 6, beating a GPT4 version from the summer, Mistral Large and other models while being dirt cheap.
We also have an incredible show from the only Apache 2.0 licensed model in the top 15, Starling LM 7B beta, which is now 13th on the chart, with incredible finetune of a finetune (OpenChat) or Mistral 7B. 👏
Yes, you can now run a GPT3.5 beating model, on your mac, fully offline 👏 Incredible.
Open Source LLMs (Welcome to MoE's)
Mosaic/Databricks gave us DBRX 132B MoE - trained on 12T tokens (X, Blog, HF)
Absolutely crushing the previous records, Mosaic has released the top open access model (one you can download and run and finetune) in a while, beating LLama 70B, Grok-1 (314B) and pretty much every other non closed source model in the world not only on metrics and evals, but also on inference speed
It uses a Mixture of Experts (MoE) architecture with 16 experts that each activate for different tokens. this allows it to have 36 billion actively parameters compared to 13 billion for Mixtral. DBRX has strong capabilities in math, code, and natural language understanding.
The real kicker is the size, It was pre-trained on 12 trillion tokens of text and code with a maximum context length of 32,000 tokens, which is just incredible, considering that LLama 2 was just 2T tokens. And the funny thing is, they call this DBRX-medium 👀 Wonder what large is all about.
Graph credit Awni Hannun from MLX (Source)
You can play with the DBRX here and you'll see that it is SUPER fast, not sure what Databricks magic they did there, or how much money they spent (ballpark of ~$10M) but it's truly an awesome model to see in the open access! 👏
AI21 releases JAMBA - a hybrid Transformer + Mamba 58B MoE (Blog, HF)
Oh don't I love #BreakingNews on the show! Just a few moments before ThursdAI, AI21 dropped this bombshell of a model, which is not quite the best around (see above) but has a few very interesting things going for it.
First, it's a hybrid architecture model, capturing the best of Transformers and Mamba architectures, and achieving incredible performance on the larger context window size (Transformers hardware requirements scale quadratically with attention/context window)
AI21 are the first to show (and take the bet) that hybrid architecture models actually scale well, and are performant (this model comes close to Mixtral MoE on many benchmarks) while also being significantly cost advantageous and faster on inference on longer context window. In fact they claim that Jamba is the only model in its size class that fits up to 140K context on a single GPU!
This is a massive effort and a very well received one, not only because this model is Apache 2.0 license (thank you AI21 👏) but also because this is now the longest context window model in the open weights (up to 256K) and we've yet to see the incredible amount of finetuning/optimizations that the open source community can do once they set their mind to it! (see Wing from Axolotl, add support for finetuning Jamba the same day it released)
Can't wait to see the benchmarks for this model once it's properly instruction fine-tuned.
Small MoE from Alibaba - Qwen 1.5 - MoE - A2.7B (Blog, HF)
What a week for Mixture of Experts models, we got an additional MoE from the awesome Qwen team, where they show that training a A2.7B (the full model is actually 14B but only 2.7B are activated at the same time) is cheaper, 75% reduction in training costs and 174% improvement in inference speed!
Also in open source:
Lisa beats LORA for the best parameter efficient training
📰 LISA is a new method for memory-efficient large language model fine-tuning presented in a Hugging Face paper💪 LISA achieves better performance than LoRA with less time on models up to 70B parameters🧠 Deep networks are better suited to LISA, providing more memory savings than shallow networks💾 Gradient checkpointing greatly benefits LISA by only storing gradients for unfrozen layers📈 LISA can fine-tune models with up to 7B parameters on a single 24GB GPU🚀 Code implementation in LMFlow is very simple, only requiring 2 lines of code🤔 LISA outperforms full parameter training in instruction following tasks
Big CO LLMs + APIs
Emad departs from Stability AI.
In a very surprising (perhaps unsurprising to some) move, Emad Mostaque, founder and ex-CEO of stability announces his departure, and focus on decentralized AI
For me personally (and I know countless others) we all started our love for Open Source AI with Stable Diffusion 1.4, downloading the weights, understanding that we can create AI on our machines, playing around with this. It wasn't easy, stability was sued to oblivion, I think LAION is still down from a lawsuit but we got tons of incredible Open Source from Stability, and tons of incredible people who work/worked there.
Big shoutout to Emad and very excited to see what he does next
Throwback to NEURIPS where Emad borrowed my GPU Poor hat and wore it ironically 😂 Promised me a stability hat but... I won't hold it against it him 🙂
This weeks Buzz (What I learned with WandB this week)
I'm so stoked about the workshop we're running before the annual Fully Connected conference in SF! Come hear about evaluations, better prompting with Claude, and tons of insights that we have to share in our workshop, and of course, join the main event on April 18 with the whole Weights & Biases crew!
Vision
Sora was given to artists, they created ... art
Here's a short by a company called ShyKids who got access to SORA alongside other artists, it's so incredibly human, and I love the way they used storytelling to overcome technological issues like lack of consistency between shots. Watch it and enjoy imagining a world where you could create something like this without living your living room.
This also shows that human creativity and art is still deep in the middle of all these creations, even with tools like SORA
MindEye 2.0 - faster fMRI-to-image
We had the awesome pleasure to have Tanishq Abraham and Paul Scotti, who recently released a significantly bette version of fMRI to Image model called MindEye 2.0, shortening the time it takes from 40 hours of data to just 1 hour of fMRI data. This is quite remarkable and I would encourage you to listen to the full interview that's coming out this Sunday on ThursdAI.
Voice
Hume announces EVI - their Empathic text to speech mode (Announcement, Demo)
This one is big folks, really was blown away (see my blind reaction below), Hume announced EVI, a text to speech generator that can reply with emotions! It's really something, and it has be seen to experience. This is in addition to Hume already having an understanding of emotions via voice/imagery, and the whole end to end conversation with an LLM that understands what I feel is quite novel and exciting!
The Fine-Tuning Disillusionment on X
Quite a few folks noticed a sort of disillusionment from finetuning coming from some prominent pro open source, pro fine-tuning accounts leading me to post this:
And we of course had to have a conversation about it, as well as Hamel Husain wrote this response blog called "Is Finetuning still valuable"
I'll let you listen to the conversation, but I will say, like with RAG, finetuning is a broad term that doesn't apply evenly across the whole field. For some narrow use-cases, it may simply be better/cheaper/faster to deliver value to users with using smaller cheaper but longer context models and just provide all the information/instructions to the model in the context window.
From the other side, we had data privacy concerns, RAG over a finetune model can absolutely be better than just a simple RAG, and just a LOT more considerations before we make this call that fine-tuning is not "valuable" for specific/narrow use-cases.
This is it for this week folks, another incredible week in AI, full of new models, exciting developments and deep conversations! See you next week 👏
Transcript Below:
[00:00:00] Alex Volkov: Hey, this is ThursdAI, I'm Alex Volkov, and just a little bit of housekeeping before the show. And what a great show we had today. This week started off slow with some, some news, but then quickly, quickly, many open source and open weights releases from Mosaic and from AI21 and from Alibaba. We're starting to pile on and at the end we had too many things to talk about as always.
[00:00:36] Alex Volkov: , I want to thank my co hosts Nisten Tahirai, LDJ, Jan Peleg, And today we also had Robert Scoble with a surprise appearance and helped me through the beginning. We also had Justin, Junyang, Lin from Alibaba and talk about the stuff that they released from Quen. And after the updates part, we also had two deeper conversations at the second part of this show.
[00:01:07] Alex Volkov: The first one was with Danish Matthew Abraham. and Paul Gotti from MedArc about their recent paper and work on MindEye2, which translates fMRI images using diffusion models into images. So fMRI signals into images, which is mind reading, basically, which is incredible. So a great conversation, and it's always fun to have Tanish on the pod.
[00:01:37] Alex Volkov: And the second conversation stemmed from a recent change in the narrative or a sentiment change in our respective feeds about fine tuning in the era of long context, very cheap models like Claude. And that conversation is also very interesting to listen to. One thing to highlight is this week we also saw the first time GPT 4 was toppled down from the Arena, and we now have the, a change in regime of the best AI possible, uh, which is quite, quite stark as a change, and a bunch of other very exciting and interesting things in the pod today.
[00:02:21] Alex Volkov: So, as a brief reminder, if you want to support the pod, the best way to do this is to share it with your friends and join our live recordings every ThursdAI on X. But if you can't sharing it with a friend, sharing a subscription from Substack, or subscribing, uh, to a pod platform of your choice is a great way to support this pod.
[00:02:48] Alex Volkov: With that, I give you March 28th, ThursdAI.
[00:02:52] Alex Volkov: Hello hello everyone, for the second time? we're trying this again, This is ThursdayAI, now you March 28th. My name is Alex Volkov. I'm an AI evangelist with Weights Biases. And for those of you who are live with us in the audience who heard this for the first time, apologies, we just had some technical issues and hopefully they're sorted now.
[00:03:21] Alex Volkov: And in order to make sure that they're sorted, I want to see that I can hear. Hey Robert Scoble joining us. And I usually join their spaces, but Robert is here every week as well. How are you, Robert? Robert.
[00:03:35] Robert Scoble: great. A lot of news flowing through the system. New
[00:03:39] Alex Volkov: we have, a lot of updates to do.
[00:03:43] Robert Scoble: photo editing techniques. I mean, the AI world is just hot and
[00:03:48] Robert Scoble: going.
[00:03:49] Alex Volkov: A week to week, we feel the excited Acceleration and I also want to say hi to Justin Justin is the core maintainer of the Qwen team. Qwen, we've talked about, and we're going to talk about today, because you guys have some breaking news. But also, you recently started a new thing called OpenDevon. I don't know if we have tons of updates there, but definitely folks who saw Devon, which we reported on, what a few weeks ago, I think? Time moves really fast in this AI world. I think, Justin, you posted something on X, and then it started the whole thing. So you want to give , two sentences about OpenDevon.
[00:04:21] Justin Lin: Yeah, sure. I launched the Open Devon project around two weeks ago because we just saw Devon. It is very popular. It is very impressive. And we just think that Whether we can build something with the open source community, work together, build an agent style, or do some research in this. So we have the project, and then a lot of people are coming in, including researchers and practitioners in the industry.
[00:04:46] Justin Lin: So we have a lot of people here. Now we are working generally good. Yeah You can see that we have a front end and back end and a basic agent system. So we are not far from an MVP So stay tuned
[00:05:01] Alex Volkov: Amazing. so definitely Justin when there's updates to update, you know where to come on Thursday. I, and but also you have like specific when updates that we're going to get to in the open source open source area So folks I'm going to run through everything that we have to cover and hopefully we'll get to everything.
[00:05:18] Alex Volkov: ,
[00:05:18] TL;DR - March 28th
[00:05:18] Alex Volkov: here's the TLDR or everything that's important in the world of AI that we're going to talk about for the next two hours, starting now. right So we have a leaderboard update, and I thought this is gonna be cool to just have a leaderboard update section because when big things are happening, on the leaderboards, and specifically I'm talking here about The lmsys Arena leaderboard the one that also does EmptyBench, which is, LLM, Judges, LLMs, but also multiple humans interact with these models and in two windows and then they calculate ELO scores, which correlates the best of the vibes evaluations that We all know and love and folks, Claude Opus is the number one LLM on Arena right now. Claude Appus, as the one that we've been talking about, I think, since week to week to week to week is
[00:06:05] Alex Volkov: now the number one LLM in the world and it's quite impressive, and honestly, in this instance, the arena was like, lagging behind all our vibes We talked about this already, we felt it on AXE and on LokonLama and all other places. so I think it's a big deal it's a big deal because for the first time since, I think forever it's clear to everyone that GPT4 was actually beat now not only that, Sonnet, which is their smaller version, also beats some GPT 4's version. and Haiku, their tiniest, super cheap version, 25 cents per million tokens. you literally can use Haiku the whole day, and at the end of the month, you get I don't know, 5 bucks. Haiku also passes one of the versions of GPT 4 for some of the vibes and Haiku is the distilled Opus version, so that kind of makes sense.
[00:06:53] Alex Volkov: But it's quite incredible that we had this upheaval and this change in leadership in the LMS arena, and I thought it's worth mentioning here before. So let's in the open source LLM stuff, we have a bunch of updates here. I think the hugest one yesterday, Databricks took over all of our feeds the Databricks bought this company called Mosaic, and we've talked about Mosaic multiple times before and now they're combined forces and for the past.
[00:07:17] Alex Volkov: year they've been working on something called DBRX, and now it's we got, in addition to the big company models that's taken over, so cloud Opus took over GPT 4, We now have a new open access model that takes over as the main lead. and they call this DPRX medium, which is funny. It's 132 billion parameter language model. and it's a mixture of experts with I think, 16 experts, and it's huge, and it beats Lama270b, it beats Mixtral, it beats Grock on at least MMLUE and human Evil scores and so it's really impressive to see, and we're gonna, we're gonna chat about DPRx as well and there's a bunch of stuff to cover there as well and Justin, I think you had a thread that we're gonna go through, and you had a great reaction.
[00:08:02] Alex Volkov: summary, so we're gonna cover that just today, what 30 minutes before this happened we have breaking news. I'm actually using breaking news here in the TLDR section because
[00:08:11] Alex Volkov: why [00:08:20] not?
[00:08:22] Alex Volkov: So AI21, a company from Israel releases something incredible. It's called Jamba. It's 52 billion parameters. but the kicker is it's not a just a Transformer It's a joint architecture from joint attention and Mamba. And we've talked about Mamba and we've talked about Hyena. Those are like state space models that they're trying to do a Competition to Transformers architecture with significantly better context understanding. and Jamba 52 looks quite incredible. It's also a mixture of experts. as you notice, we have a bunch of mixture of experts here. and It's it's 16 experts with two active generation It supports up to 256K context length and quite incredible. So we're going to talk about Jamba.
[00:09:03] Alex Volkov: We also have some breaking news So in the topic of breaking news Junyang, you guys also released something. you want to do the announcement yourself? It would be actually pretty cool.
[00:09:13] Justin Lin: Yeah, sure. Yeah just now we released a small MOE model which is called QWEN 1. 5 MOE with A2. 7B, which means we activate, uh, 2. 7 billion parameters. Its total parameter is, uh, 14 billion, but it actually activates around, uh, 2. 7 billion parameters
[00:09:33] Alex Volkov: thanks Justin for breaking this down a little bit. We're going to talk more about this in the open source as we get to this section I also want to mention that, in the news about the Databricks, the DBRX model, something else got lost and was released actually on Thursday last week.
[00:09:49] Alex Volkov: We also didn't cover this. Starling is now a 7 billion parameter model that beats GPT 3.5 on LMsys as Well so Starling is super cool and we're going to add a link to this and talk about Starling as Well Stability gave us A new stable code instruct and Stability has other news as well that we're going to cover and it's pretty cool.
[00:10:07] Alex Volkov: It's like a very small code instruct model that beats the Starchat, like I think 15b as well. So we got a few open source models. We also got a New method to Finetune LLMs, it's called Lisa if you guys know what LORA is, there's a paper called Lisa, a new method for memory efficient large language model Fine tuning.
[00:10:25] Alex Volkov: And I think this is it. Oh no, there's one tiny news in the open source as well mistral finally gave us Mistral 0. 2 base in a hackathon that they participated in with a bunch of folks. on the weekend, and there was a little bit of a confusion about this because we already had Mistral 0.
[00:10:43] Alex Volkov: 2 instruct model, and now they released this base model that many finetuners want the base model. so just worth an update there. In the big companies LLMs and APIs, I don't think We have tons of stuff besides, Cloud opus as we said, is the number one LLM in the world. The little bit of news there is that Emmad Mostak leaves stability AI and that's like worthwhile Mentioning because definitely Imad had a big effect, on my career because I started my whole thing with stable Diffusion 1. 4 release. and we also have some Apple rumors where as you guys remember, we've talked about Apple potentially having their own model generator, they have a bunch of Open source that they're working on, they have the MLX platform, we're seeing all these signs. and then, this week we had rumors that Apple is going to go. with Gemini, or sorry, last week, we had rumors that Apple is going to go with Gemini, this week, we had rumors that Apple is going to sign with Entropic, and then now Baidu, And also this affected the bunch of stuff. so it's unclear, but worth maybe mentioning the Apple rumors as well in this week's buzz, the corner where I talk about weights and biases, I already mentioned, But maybe I'll go a little bit in depth that we're in San Francisco on April 17th and 18th, and the workshop is getting filled up, and it's super cool to see, and I actually worked on the stuff that I'm going to show, and it's super exciting, and it covers pretty much a lot of the techniques, that we cover here on ThursdAI as well.
[00:12:05] Alex Volkov: In the vision and video category, This was a cool category as well, because Sora for the first time, the folks at Sora they gave Sora to artists and they released like a bunch of actual visual demos that look mind blowing. Specifically Airhead, i think was mind blowing. We're gonna cover this a little bit.
[00:12:21] Alex Volkov: If you guys remember Emo, the paper that wasn't released on any code that took One picture and made it sing and made it an animated character. Tencent released something close to that's called AnimPortrait. but Any portrait doesn't look as good as emo, But actually the weights are there.
[00:12:36] Alex Volkov: So you can now take one image and turn it into a talking avatar and the weights are actually open and you can use it and it's pretty cool. and in the vision and video, I put this vision on video as well, but MedArk released MindEye 2, and we actually Have a chat closer to the second hour with with yeah, with Tanishq and Paul from AdArc about MindEye 2, which is reading fMRI signals and turning them into images of what you saw, which is crazy. And I Think the big update from yesterday as Well from voice and audio category is that Hume, a company called Hume, demos something called EVI which is their empathetic voice analysis and generation model, which is crazy I posted a video about this yesterday on my feed. you talk to this model, it understands Your emotions. Apparently this is part of what Hume has on the platform. you can actually use this right now but now they already, they showed a 11 labs competitor, a text to speech model that actually can generate voice in multiple emotions. and it's pretty like stark to talk to it. and it answers sentence by sentence and it changes its emotion sentence from by sentence. and hopefully I'm going to get access to API very soon and play around with this. really worth talking about. Empathetic or empathic AIs in the world of like agentry and everybody talks about the, the
[00:13:53] Alex Volkov: AI therapist.
[00:13:54] Alex Volkov: So we're going to cover Hume as well. I think a very brief coverage in the AI art and diffusion Adobe Firefly had their like annual conference Firefly is a one year old and they added some stuff like structure reference and style transfer and one discussion at the end of the show IS narrow fine tuning done for for large, with larger contexts and cheaper prices for Haiku. we had the sentiment on our timelines, and I maybe participated in this a little bit, and so we had the sentiment and , I would love a discussion about Finetuning, because I do see quite A few prominent folks like moving away from this concept of Finetuning for specific knowledge stuff.
[00:14:32] Alex Volkov: Tasks, still yes but for knowledge, it looks like context windows the way they're evolving. They're going to move towards, potentially folks will just do RAG. So we're going to have a discussion about fine tuning for specific tasks, for narrow knowledge at the end there. and I think this is everything that We are going to talk about here. That's a lot. So hopefully we'll get to a bunch of it.
[00:14:51] Open Source -
[00:14:51] Alex Volkov: and I think we're going to start with our favorite, which is open source
[00:15:12] Alex Volkov: And while I was giving the TLDR a friend of the pod and frequent co host Yam Pelleg joined us. Yam, how are you?
[00:15:18] Yam Peleg: Hey, how are you doing?
[00:15:19] Alex Volkov: Good! I saw something that you were on your way to to visit. our friends at AI21. Is that still the
[00:15:24] Alex Volkov: awesome, awesome.
[00:15:25] Yam Peleg: 10 I'll be there in 10, 20 minutes.
[00:15:27] Alex Volkov: Oh, wow Okay. so we have 10, 20 minutes. and if you guys are there and you want to like hop on, you're also welcome so actually while you're here, I would love to hear from you we, We have two things to discuss. They're major in the open source and like a bunch of other stuff to cover I think the major like the thing that took over all our timelines is that Mosaic is back and Databricks, the huge company that does like a bunch of stuff. They noticed that Mosaic is doing very incredible things. and around, I don't know, six months ago, maybe almost a year ago, they Databricks acquired Mosaic. and Mosaic has been quiet since Then just a refresher for folks who haven't followed us for for longest time Mosaic released a model that was for I don't know, like three months, two months was like the best 7 billion parameter model called mpt and
[00:16:10] DBRX MoE 132B from Mosaic
[00:16:10] Alex Volkov: Mosaic almost a year ago, I think in May also broke the barrier of what we can consider a large context window so they announced a model with 64 or 72k context window and they were the first before cloud, before anybody else. and since then they've been quiet. and they have an inference platform, they have a training platform, they have a bunch of stuff that Databricks acquired. and yesterday they came out with a bang. and this bang is, they now released the top open access model, the BITS LLAMA, The BITS Mixtral, the BITS Grok1, The BITS all these things [00:16:40] And it's huge. It's a 132 billion parameter MOE that they've trained on I don't know why Seven
[00:16:49] Alex Volkov: 12,
[00:16:49] Yam Peleg: 12,
[00:16:50] Alex Volkov: jesus Christ, 12 trillion parameters.
[00:16:53] Alex Volkov: This is like a huge I don't think we've seen anything come close to this amount of training, Right
[00:16:59] Yam Peleg: Oh yeah, it's insane. I mean, the next one is six of Gemma, the next one we know. We don't know about Mistral, but the next one we know is six trillion of Gemma, and it's already nuts. So, but Yeah. It's a much larger model. I think the interesting thing to say is that it's the age of MOE now everyone is really seeing a mixture of experts and the important thing to, to pay attention to is that they are not entirely the same.
[00:17:27] Yam Peleg: So there is still exploration in terms of the architecture or of small tweaks to the MOE, how to do them, how to actually implement them better, what works better, what is more efficient and so on and so forth. That we just heard about Qwen MOE, which is also a little bit different than the others.
[00:17:44] Yam Peleg: So there is still exploration going on and just looking at what is coming out and everything turns out to be at the ballpark of Mistral and Mixtral just makes me more curious. Like, how did they do this? How everything is just on, on the same ballpark as them? How did they manage to train such powerful models?
[00:18:04] Yam Peleg: Both of them. And Yeah.
[00:18:06] Yam Peleg: I just want to say that because it's amazing to see.
[00:18:10] Alex Volkov: So, so just to highlight, and I think we've been highlighting this When Grok was released, we've been highlighting and now we're highlighting This as well. A significantly smaller model from Mixtral is still up there. It's still given the good fight, even though these models like twice and maybe three times as large sometimes and have been trained. So we don't know how much Mixtral was trained on right but Mixtral is still doing The good fight still after all this time which is quite incredible. and we keep mentioning this when Grok was released, we mentioned this. And now when this was released, we mentioned this as well.
[00:18:38] Alex Volkov: It's. What else should we talk about in DBRX? Because I think that obviously Databricks want to show off the platform. Nisten, go ahead. Welcome, by the way. You want to give us a comment about DBRX as well? Feel free.
[00:18:51] Nisten Tahiraj: Hey guys, sorry I'm late. I was stuck debugging C and it finally worked. I just lost a good time. I used DBRX yesterday. I was comparing it I used it in the LMTS arena. And then I opened the Twitter space and told people to use it. And now it just hit rate limits so you can't use it anymore. Yeah.
[00:19:11] Nisten Tahiraj: It was pretty good. I very briefly did some coding example. It felt better than than Code Llama to me. It wasn't as good as Cloud Opus stuff, but it did give me working gave me working bash scripts. So, yeah, in the very brief, short amount of time I use it, it seemed pretty good, so,
[00:19:31] Alex Volkov: Yep.
[00:19:32] Nisten Tahiraj: that's about it.
[00:19:33] Nisten Tahiraj: As for the Mistral and Mixtral question, so, I use Mistral large a lot, I use I use medium a lot, And the 70s, and the Frankensteins of the 70s, and they all start to feel the same, or incremental over each other. It's just the data. It's just the way they feed it. They feed this thing, and the way they raise it, I think it's it's all they're all raised the same way in the same data.
[00:20:03] Nisten Tahiraj: Yeah, the architecture makes some difference, but the one thing that you notice is that it doesn't get that much better with the much larger models. So it's just the data.
[00:20:20] Justin Lin: That's what I think it is.
[00:20:21] Alex Volkov: I want to ask Justin to also comment on this, because Justin, you had a thread that
[00:20:24] Alex Volkov: had a great coverage as well. What's your impressions from DBRX and kind of the size and the performance per size as well?
[00:20:32] Justin Lin: Yeah, the site is pretty large and it activates a lot of parameters. I remember it's 36 billion and the model architecture is generally fine. Actually, I talked to them a few times. around three months ago, last December introduced Quent2Dem and I accidentally saw it yesterday there are some common senses.
[00:20:57] Justin Lin: I think it is really good. They use TIC token tokenizer with the GPT2BP tokenizer. Recently I have been working with LLAMA tokenizer and the sentence piece tokenizer, well, makes me feel sick. Yeah. It's complicated. Yeah, but the GPT BPE tokenizer, because I have been working with BPE tokenizer years ago, so everything works great.
[00:21:22] Justin Lin: And we were just, for Qwen 1. 5, we just changed it from the implementation of TIP token to the GPT 2 BPE tokenizer by Hugging Face. It is simple to use. I think it's good to change the tokenizer. And it's also good to have the native chat ML format so that I think in the future people are going to use this chat ML format because the traditional chat formats like human assistant, there are a lot of risks in it.
[00:21:53] Justin Lin: So chat ML format is generally good. I think they have done a lot of great choices, but I'm not that, Impressed by their performance in the benchmark results, although benchmarks are not that important, but it's a good indicator. For example, when you look at its MMLU performance, I expect it to be, well, if you have trained it really good.
[00:22:19] Justin Lin: I haven't trained a 100 billion MOE model, but I expect it to be near 80. It is just 73 with 12 trillion tokens. I don't know if they repeat the training epics or they have diverse 12 trillion tokens. They didn't share the details, but I think it could be even better. I am relatively impressed by their coding performance, just as Nisten said.
[00:22:47] Justin Lin: The coding capability looks pretty well, but then I found that well?
[00:22:53] Justin Lin: DBRX Instruct because you can improve and instruct model to a really high level at human eval, but, it's hard for you to improve it for the base model. I'm not pretty sure maybe I need to try more, but it's generally a very good model.
[00:23:10] Alex Volkov: Yeah, absolutely. We got the new contender for the Open weights, open source. So the LLAMA folks are probably like thinking about, the release date it's very interesting what LLAMA will come out with. Notable that this is only an LLM. There's nothing like, there's no multimodality here. and the rumors are the LLAMA will hopefully will be multimodal. so whatever comparison folks do and something like like GPT 4, it's also notable that this is not multi modal yet, this is just text. One thing I will say is that they call this DBRX Medium which hints at potentially having a DBLX, DBRX Large or something, and also something that was hidden and they didn't give it, yet they retrained MPT.
[00:23:48] Alex Volkov: Yam, I think you commented on this and actually Matei Zaharia, the chief scientist there commented on, on, on your thread. They retrained the MPT7B, which was like for a while, the best 7 billion parameter model almost a year ago. and they said that it cost them like twice less to train the same model, something like this, which I thought it was notable as well.
[00:24:07] Alex Volkov: I don't know, Yam, if you want to, if you want to chime in on The yeah.
[00:24:10] Yam Peleg: The interesting thing here is that I mean, it's obvious to anyone in the field that you can, making the model much, much, much better if you get better data. So, what they basically say, what they basically show with actions is that if you have, you can even make the model even twice as better or twice as cheaper to train depending on how you look at it, just by making the data better.
[00:24:35] Yam Peleg: And my own comment on this is that at the moment, to the best of my knowledge Better, better data is something that is not quite defined. I mean, there is a lot of there is a lot of intuition, there are, I think big things when you look at broken data, it's broken. But it's really hard to define what exactly is better data apart [00:25:00] from a deduplication and all of the obvious.
[00:25:03] Yam Peleg: It's very hard to define what exactly is the influence of specific data on performance down the line. So, so it's really interesting to hear from people that have done this and made a model twice as better. What exactly did they do? I mean, because they probably are onto something quite big to get to these results.
[00:25:27] Yam Peleg: Again, it's amazing to see. I mean, it's just a year, maybe even less than a year of progress. I think MPT is from May. If I remember, so it's not even a year of progress and we already have like twice as better models and things are progressing
[00:25:42] Alex Volkov: Worth mentioning also that Databricks not only bought Mosaic, they bought like a bunch of startups, lilac, the friends from Lilac the, we had the folks from Lilac, IL and Daniel here on the pod. And we talked about how important data their data tools specifically is. and they've been a big thing in open source.
[00:25:58] Alex Volkov: All these folks from Databricks, they also highlight like how much Li help them understand their data. very much. so I'm really hoping that they're going to keep Lilac around and free to use as well one last thing that I want to say, it's also breaking news, happened two hours ago, the author of Megablocks, The training library from MOEs, Trevor gale I think he's in DeepMind, he has now given Databricks the mega Blocks library.
[00:26:23] Alex Volkov: So Databricks is also taking over and supporting the mega blocks training library for Moes. that is they say out for firms the next best library for Moes as well and there was a little bit of a chat where Arthur Mech from Mistro said, Hey, welcome to the party. And then somebody replied and said, you are welcome and then they showed the kind of the core contributors to the mega blocks library. And a lot of them are, folks from Databricks. and so now they've taken over this library.
[00:26:50] AI21 - JAMBA - hybrid Transformer/Mamba Architecture 52B MoE
[00:26:50] Alex Volkov: So yes MOE seems to be a big thing and now let's talk about the next hot MOE AI 21. The folks that I think the biggest like lab for AI in Israel, they released something called Jamba, which is a 52 billion parameter, MOE. and the interesting thing about Jamba is not that it's an MOE is that it's a Mamba and joint attention. so it's like a, it's a mamba transformer. Is that what it is? It's a combined architecture. We've talked about state space models a little bit here, and we actually talked with the author Eugene from RWKV, and we've mentioned Hyena from Together AI, and we mentioned Mamba before and all I remember that we mentioned is that those models, the Mamba models, still don't get the same kind of performance and now we're getting this like 52 billion parameter mixture of excerpt model that does. Quite impressive on some numbers and comes close to LLAMA70B even, which is quite Impressive MMLU is almost 70, 67%. I don't see a human eval score. I don't think they added this. But they Have quite impressive numbers across the board for something that's like a New architecture.
[00:27:52] Alex Volkov: 50 billion parameters with 12 active and what else is interesting here? The New architecture is very interesting. it supports up to 256. thousand context length, which is incredible. Like this Open model now Beats cloud 2 in just the context length, which is also incredible. Just to remind you Databricks, even though they released like a long context model before Databricks DBRX is 32, 32, 000.
[00:28:15] Alex Volkov: This is 256. And not only does it support 256 because of its Unique architecture They can fit up to 140k contexts on a single A180 GB GPU. I know I'm saying a lot of numbers. Very fast, But if you guys remember, for those of you who frequent the pod, we've talked with folks from , the yarn scaling method. and the problem with the context window in Transformers is that the more context you have the more resources it basically takes in a very basic thing. And so the SSM models and the Mamba architecture, they specifically focus on lowering the requirements for long context. and this model gets three times as throughput on long context compared to Mistral.
[00:28:57] Alex Volkov: 8 times 7b, compared to Mixtral, basically. so very exciting, yeah, you wanna comment on this I know you're like almost there, meeting with the guys but Please give us the comments,
[00:29:07] Yam Peleg: I'm there. I'm there in five minutes, so I can maybe if time works towards favour, maybe I can even get you the people on the pod
[00:29:14] Alex Volkov: That'd be incredible.
[00:29:15] Yam Peleg: I'm just, yeah, what what is important here, in my opinion, is that first, I mean, absolutely amazing to see the results.
[00:29:23] Yam Peleg: But what was not known to this point is whether or not those types of models scale. to these sizes. We had smaller Mambas and they were, they looked really promising, but we were at the point where, okay, it looks promising. It looks like it could be at the same ballpark of transformers, but to test this out, someone need to just invest a lot of money into the compute and just see what the results they get.
[00:29:53] Yam Peleg: And it's a risk. You don't know what you're going to get if you're going to do it. And it turns out that you get a really good model at the same ballpark. Maybe slightly less performant as a transformer, but it is expectable. The thing the thing worth mentioning here is that Mamba the Mamba architecture is way more efficient in terms of context size.
[00:30:15] Yam Peleg: As you just said, transformers are quadratic in terms of complexity. When you increase the context. So you have if you have two tokens, you need you need four times that you can say the memory. And if you have four tokens, you need 16 and it just goes on and on and it just explodes, which is why context length is such a problem but Mamba scales much more friendly, memory friendly, you can say.
[00:30:39] Yam Peleg: So, but the thing is that you do pay with the performance of the model. So. What you, what people do is a hybrid between the two, so you can find some sweet spot where you don't just use so much memory and yet you don't have the performance degrade that bad. And I mean, yeah, it's a risk. At the end of the day, you need to train, training such a large model is a lot of money, is a lot of money in terms of compute.
[00:31:06] Yam Peleg: And they did it, released it in Apache 2, which is amazing for everyone to use. And proving for, to everyone that, all right, if you follow this recipe, you get this result. Now people can build on top of that and can train maybe even larger model or maybe even, maybe just use this model. I'm, I didn't try it yet, but I think it's an incredible thing to try because it's it's not the same as Mixtral.
[00:31:33] Yam Peleg: Mixtral is a little bit better, but it's at the same ballpark as Mixtral, but you get way more context there. At your home on a small GPU for cheap. It's amazing.
[00:31:41] Alex Volkov: and Mixtral specifically,
[00:31:43] Yam Peleg: potential.
[00:31:45] Alex Volkov: thanks Yamin, I just want to highlight that Mixtral Is this like amazing model that WE compare models like three times the size to it, and they barely beat Mixtral. We talked about this when Grok 1 was released, we now talked about this when DBRX was released with like
[00:31:57] Alex Volkov: 12 trillion parameters in data.
[00:32:00] Alex Volkov: Mixtral is this basically like the golden standard. We've always had this standard for like how well performing an open model could be and it has been for a while, the best open model that we have and now we're getting this like new architecture, completely new architecture, basically a bet on on would it even scale from Fox from AI21? and it comes close to Mistral, but it does 3x throughput on long contexts compared to mixtral. and it has 256 context window with, if you want to get this from mixtral, You can train it with yarn, you can do all these things, but then you won't be able to actually scale it. hosted because it's gonna cost you so much money because of The quadratic attention.
[00:32:33] Alex Volkov: And
[00:32:34] Alex Volkov: they specifically say, the only model of its size class, that fits up to 140, 000 context windows on a single GPU. Which is quite incredible. and obviously Apache 2 license is great. I don't know if they also released a bunch of stuff like training code and data stuff. So we're definitely going to keep you posted.
[00:32:50] Alex Volkov: And yam hopefully will ask all these questions. But the efficiency in speed where like the closer you get to 128 context, the faster the model kind of performs is also quite incredible. like it. Yeah, it's quite incredible. the graphs there, we're going to post it, everything in the show notes, but absolutely a great release from AI21. shout out AI21 folks and definitely give them our love there and specifically because of the Apache 2 license. Anything else I want to hear from maybe Justin, if you want to comment on the joint architecture anything that you have you guys play with [00:33:20] the joint attention and Mamba. have you what's your reaction to this?
[00:33:25] Justin Lin: Yeah, We are trying with members with very small architectures. We can reach similar performance to transformer, but we did not scale it to very large size, so we don't know what will happen.
[00:33:38] Alex Volkov: So just this is great and Apache 2, and we're very happy shout out to folks at the i21. Briefly let's cover the rest of the stuff that we have still to cover in the open source.
[00:33:47] Mistral base 0.2
[00:33:47] Alex Volkov: We'll briefly cover this in the TLDR. We'll start with Mistral Mistral 0. 2 base released. so for fine tuning, obviously, for folks who know it's better For fine tuning purposes to have a base model than the instruct model, because then you can mistral.
[00:33:59] Alex Volkov: 0. 2 base was released in Hackathon last week together with Cerebral Valley and some other friends in San Francisco.
[00:34:08] Alex Volkov: There was some confusion about it because we had Instruct 0. 2 before we had a model that said, based on mistral 0. 2 and was like very well performing, the 7 billion parameter one. and now there is the base model. and then somebody went and changed the base of the instruct model to this one versus the previous one but nevermind, they cleared that confusion up and we have this like base model.
[00:34:28] Alex Volkov: It's also like open source and it's great.
[00:34:30] Nisten Tahiraj: there is one thing here about the previous Mistral instruct that they released. That one has been trained for 32k context, and I used it as as a personal chatbot. I'm making it with just the base the base Mistral 7b, and I'm noticing it is much better at carrying forward a conversation.
[00:34:50] Nisten Tahiraj: So I, I think a lot of the fine tunes should probably switch and just rerun on the new Mr. Instruct especially the ones that are geared towards conversational stuff. Because again, Mr. Instruct is limited to eight K and more likely just, you should always just keep it under 4K to get accuracy.
[00:35:11] Nisten Tahiraj: So, that's one thing here. The new seven B performs much better at larger contexts and, and summarizing
[00:35:18] Starling 7B beta - top apache 2 LLM in the world
[00:35:18] Alex Volkov: One incredible news is Starling. And I think I think. Justin, you had both of you and and Yam as well, you guys talked about this. We're starting actually now is a 7 billion parameter model that beats GPT 3. 5 on LMC Serena, which is quite incredible, right?
[00:35:34] Alex Volkov: I think it's the first and the only 7 billion parameter model that beats GPT 3. 5 on like user preference. And it's, it was hidden in between the DBRX news
[00:35:42] Alex Volkov: but let me see if I can. Put this up here real quick. so this model was released, what, a week ago, a week and a day ago. It's
[00:35:48] Alex Volkov: What do we know from this?
[00:35:49] Yam Peleg: Yeah, I just want to say, and to go in five minutes, I just want to say about Starling this is the second model. So if you haven't tried yet the first one you definitely want to try. I know there are people that are skeptics about 7b models and saying that they are too small. Just give this one a try.
[00:36:10] Yam Peleg: Just give this one a chance. Trust me, just give this specific one a chance. It is an amazing model, seriously, it's an amazing model and it's just showing to everyone that there is a lot more to squeeze out. Scale works, absolutely, but there is a lot more to squeeze out besides scale. And I seriously can't wait for the same technique to be applied on a larger model just to see what we get to.
[00:36:35] Yam Peleg: Because it's an amazing result, seriously.
[00:36:37] Alex Volkov: Nisten, go ahead.
[00:36:40] Nisten Tahiraj: So. The model is is still based and it's actually based off of open chat 3.5. The one thing that their Raven, the Nexus Raven team does well is they had that nexus Raven 13 B model. And for some time that was the best function calling small model you can.
[00:36:59] Nisten Tahiraj: So, I haven't tried this one, but I highly suspect it's probably pretty good at function calling. I'm just looking at it right now, it is Mistral based, it's exactly based off of OpenChat 3. 5 from Alignment Lab, so they fine tuned on top of that, and yeah, I would highly recommend people to use it.
[00:37:20] Nisten Tahiraj: I've used the one that has been trained off of OpenChat a lot, and
[00:37:24] Alex Volkov: They did a bang up job there because this 7 billion parameter model now beats GPT 3. 5, beats CLOUD 2. 1, beats Mistral Next, and Gemini pro and CLOUD 2, like this is the 13th, based on LMsys at least, this is the 13th model, it's 7 billion parameters, it's Apache 2, this is the from Berkeley folks, This is the only Apache 2 licensed model on the LLM leaderboard in the first like top
[00:37:48] Alex Volkov: 20, I think, or top top 13. So it bigs, I don't know how it beats Mixtral. So anyway, yeah, StarLing is great. It looks great Try it, folks. Definitely worth mentioning. We're gonna run through some other updates because we still have a tons of stuff to cover and then we Have some guests here in the audience that want to join and talk about very interesting things
[00:38:05] LISA beats LORA for AI Finetuning
[00:38:05] Alex Volkov: I don't have a lot of information about Lisa specifically, but I will just mention that there's if you guys in the fine tuning area, you know that Laura and we have Laura in the diffusion? models area as well lower rank adaptations, so folks in the Diffusion world have been training LORES for a while, more than a year, and now there's a new paper dropped that's called a new method for memory efficient large language model fine tuning.
[00:38:27] Alex Volkov: I'll say this slowly a new method for memory efficient large language model fine tuning. So this is not for diffusion stuff this is for large language, it's called Lisa and achieves better performance than LoRa with less time on models up to 70 billion parameters, and yeah, the results look pretty cool for folks who do fine tuning, it's worth comparing this and I know for a while we had different methods for fine tuning like QLora, for example, different Lora, there was an attempt to figure out which one is the best and so Lisa now is a new contender with a paper out and I think code will follow up as well.
[00:38:59] Alex Volkov: Lisa can fine tune models up to 7 billion parameters on a single 24 gigabyte GPU. so you can fine tune 7 billion parameter Mistral, for example, on a 4090 with a 24 gigabyte GPU, which is pretty cool.
[00:39:13] Alex Volkov: And code implementation in LMFlow is very simple. so awesome to have this and we'll add this to the show notes for folks who actually do fine tunes. And it's gonna be awesome. so I think that covers all of the open source stuff, and we obviously spent almost an hour running through open source and I do want to move towards What is the next super exciting stuff that we have this week before we jump into a conversation.
[00:39:37] Hume EVI emotion based TTS
[00:39:37] Alex Volkov: Yes I want to move into Hume. I want to move into the voice, and audio category. This is an unusual jump between categories. we usually talk about big companies next but there's honestly not that much that happened there. So maybe we'll briefly cover it, but the thing that broke my mind, I'm going to paste this on top here. and hopefully you guys will just listen to me instead of going and watching This is that a company called Hume finally released something that many people have been very excited about. and they showed a few demos there and they finally released something. so Hume has been around for a while.
[00:40:08] Alex Volkov: Apparently they do emotion analysis very well and they actually have this product out there. you can upload the video and actually audio of yourself speaking and they will and understanding of what you're saying. of your emotions and intonations, which is pretty cool. and we know that's a piece that's missing from multimodal LLMs, right? Okay, so Hume, they already had a platform for emotions understanding, and yesterday Hume released their demo of an emotional TTS, a text to speech model that not only speaks This text position actually replies with emotion. and combined with the previous system. that they had that they can understand your emotion, as you can hear, I'm talking about this I was a little bit sad when Hamel had to drop, but now I'm very excited again to talk to you about Hume. so they actually have a running analysis of this voice as it runs. and they understand what kind of like where you are in the emotion scale, which is, first of all, exciting to see on yourself. Second of all, it's like very alarming. Their understanding of emotions, whether or not it's like precise enough to tell the truth, for example. and the text to speech of theirs that generates emotion based text is quite something. I've never seen anything close to it before the only thing that came close to me is that if you guys remember, we talked about 11 labs have style transfer thing where you can actually talk and they would take an AI voice and basically dub you but with the same emotion. So, that was the only thing that came close to what I heard yesterday from Hume. so hume has this model that's gonna be out in I think they said April? [00:41:40] that you'd be able as a developer to assign what emotion it will answer with. and together with the first part, which is a voice emotion understanding, like the text to speech understanding, they now have a speech to text with emotion. the whole end to end feeling is like nothing I've ever experienced and Robert I think I saw you first repost about this So I want to hear if like you play with the demo and like what your thoughts about this Because I was blown away and I will definitely want to hear about What do you think about this?
[00:42:14] Robert Scoble: blown away too. They you nailed it. It lets AI understand your emotion and build a much more human interaction with AI. The one problem is, I believe it's 7 an hour or something like that, so it's fairly expensive to integrate, but, for people who are building new kinds of applications that are going to have to integrate with human beings, I think it's very well done. You should look at it.
[00:42:41] Alex Volkov: Absolutely and definitely for folks who have the Uncanny Valley in different, LLMs that, reading for a long time is not the same. Is not quite the same I think we're gonna see some more emotionality in many of these, demos, and it's gonna be very exciting, together with the fact that recently there has been like this video of basically HeyGen the deepfake company that translates your lips and people were saying, Hey, this is like a fully end to end AI and we're so doomed all of these kind of AI generated voices, they still use 11 labs so I got to think that 11 labs is not going to be like that much behind and we'll start Working on some emotion like output as well but I would definitely add the link to this, and actually the video of me testing this out Hume, in in the show notes, and more than welcome for you guys to try this as well.
[00:43:27] Alex Volkov: I think the demo is demo, oh huma. ai. They actually have a chatbot on the website? hume. ai, where you can talk to the chatbot in your voice, and answers with voice as well but the full demo is more mind blowing. They understand your emotionality, they understand the emotionality of they then translate the emotionality into the actual context. and when the model talks back at you and when you say something like when you try to be when you try to fake it, and you yell, but you say, I'm so happy the model says, Hey, you look a little bit conflicted. So actually understand like what you're saying and what your meaning or basically the way you say it is different.
[00:44:00] Alex Volkov: So they actually build this understanding into the demo, which is super cool to play with. Yeah, so hume definitely worth checking out. I think that next in the voice and audio, I think that basically that's it that we had to cover but a similar area in AI creation is vision and video.
[00:44:15] SORA examples from filmmakers
[00:44:15] Alex Volkov: And this week we had oh my God the beginning of this week was like all excited about how to how the world of entertainment will look and the reason is because OpenAI took Sora, I'm hoping by this point that Sora is needs no introduction at this point right Sora is OpenAI's text to video model, and it's leagues above everything else that we saw in the world before this and it blew our creative minds, and keeps blowing some people's minds on TikTok. and OpenAI gave access to Sora, to a few creators content creators not Hollywood Apparently they're on the way to Hollywood right now and to talk with folks, But they gave it to a few filmmakers in in the like independent world I think a few Companies from Toronto and they finally showed us demos of what.
[00:45:03] Alex Volkov: Instead of The developers in OpenAI, and some prompts that they do with Sora, what an actual studio can do with some creativity and it looks like they also hired an artist in residence for OpenAI as well and wow my mind was definitely blown. the there was one short video that looked like something that I would, I would have seen in Sundance festival. It's called Airhead from from Toronto based. film
[00:45:28] Alex Volkov: creator called ShyKids, and I'm gonna add this to the show notes because this definitely, at least for me, was the most viral thing that I saw. And, I Absolutely loved it. It was, it felt very human it felt incredible. It's this very short story about something, somebody with a balloon instead of his head and the way they tell the story they kind of work around the technical limitations, which we all know, right if you generate two videos in Sora, the first the character persistence between those two videos will Not be there. And that's a big problem with every video generation. But this one, they worked around this because they told the story of this air balloon guy and his head throughout their life So like the character consistency isn't really required there. And I just really love that like actual storytellers can work around the technology to create something that feels so good Obviously the audio there was amazing and the production and the storytelling, everything. So. I think everybody saw it at this point, but if you haven't airhead from shy kids is quite incredible.
[00:46:27] Tencent AniPortrait - Animated Avatars
[00:46:27] . Okay. , I want to talk about Tencent released something called AniPortrait any with N a N I like animated portrait and it's generating photorealistic animated avatars. and if you guys remember Emo, we've talked about this before Emo was quite incredible to me. the examples that Imo showed were pretty much the same level, the same jumping capability in the way that Sora showed the previous image to video generations, Imo showed to kind of Image to animated character and
[00:46:56] Alex Volkov: was incredible.
[00:46:56] Alex Volkov: Like lips moved and eyes and consistency was there. So, the problem with Emo is that they haven't released the code And I think for now Emo is the highest like AI GitHub repo with the highest number of stars with no code. I think it's like 25, 000 stars or something. Everybody's waiting for Emo and haven't dropped.
[00:47:15] Alex Volkov: And when I say everyone, I necessarily mean the kind of the waifu creator world who would love, nothing more than just generate an image in stable diffusion something and then animated this with some, let's say emotional voice from the human thing, that we just mentioned. but the second best one for now is AnyPortrait. and actually the code was dropped. and the kind of the lips movement is great And the eyes, it's not close to emo, but it's really good compared to WAV to leap on different areas and if you ever built like an animated character AI stuff, you'll know that, the open source options.
[00:47:49] Alex Volkov: were not great the closed source options like HeyGen and different like labs like DID and Synthetic, I think, I don't remember the name. They were okay. They were great. but the open source options were not there. So any portrait right now is the best version We have it dropped yesterday. if you are doing any kind of like character animation, give any portrait a try and let us know, I'm definitely gonna play with this.
[00:48:12] Alex Volkov: Definitely gonna play
[00:48:12] Alex Volkov: with this. I think we've covered most of the stuff that we wanted to cover besides weights and biases stuff and NB companies.
[00:48:18] MindEye 2 - Interview with Tanishq and Paul from MedArc
[00:48:18] Alex Volkov: But now I am very excited to bring two friends here one friend of the pod and for a long time and now a new one, Paul Scotti and you guys here to talk to us about MindEye. to the second version. so I will just like briefly do an introduction that MindEye came around the summer I think I want to say, and we covered this because in my head everything was like multimodal, multimodal. When were we going to get multimodal? This was before vision happened. and one of the craziest multimodalities that we expected was something like a fMRI signal, like brain signals. and then you guys raised MindEye, which was like, mind blowing. and so I would love to hear about the history of like how Med Ark started like doing brain interpretation. and then let's talk about MindEye 2 and what's exciting about this recent release, but feel free please to unmute Tanishq and then Paul and introduce yourself briefly.
[00:49:08] Tanishq Abraham: Yeah Yeah, I'll just provide a quick background and summary and then I'll let Paul talk about MindEye 2 in more detail. But, yeah, basically, I'm introducing myself again. I'm Tanish I'm Tanish. Work at Stability ai and I also am the founder of MedARC and I lead MedARC, which is a medical ai open source medical ai research organization.
[00:49:30] Tanishq Abraham: And, we mostly are focused on trading foundation models for medicine. And So we do have a kind of a kind of research in neuroscience and ai and combining AI and neuroscience, which. which is what Paul is leading at MedArc. But yeah, like we started I guess looking into this sort of neuroscience AI research for quite some time actually.
[00:49:54] Tanishq Abraham: Actually, I think even before I officially started MedArc when I was organizing [00:50:00] some open source medical AI projects, this was one of the projects that I actually had started, I think, back in summer of 2022. And I think, just generally, the idea was that there's, the idea was we were working on this fMRI to image reconstruction problem, which is basically the idea that we take the, we have a person that is looking at some images and we take their fMRI signal.
[00:50:25] Tanishq Abraham: and we want to use AI to reconstruct the image that the person was looking at just in the fMRI signal. So it's the sort of mind reading kind of problem that we're working on. And I think up, back in 2022 when we started working on this, at first no the techniques that people were using were quite basic and, I think the sort of neuroscience community was quite behind in what they were, in what they were using.
[00:50:48] Tanishq Abraham: So I think we were pretty excited about the possibility of utilizing some of the latest techniques in generative AI to advance this field. And yeah, first did I start this project and there were a couple volunteers that were helping out, but luckily Paul had also discovered that we were working on this and he, he joined this project and really spearheaded this kind of neuroscience AI initiative that we've been having at MedArc.
[00:51:14] Tanishq Abraham: And yeah, that resulted in MindEye, which we released in April. I think May of last year and and then we've been continuing to work on improving those results and that has now resulted in MindEye 2. And we also have some other sorts of projects in the neuroscience AI area, like training foundation models for fMRI and we're exploring some other ideas as well.
[00:51:37] Tanishq Abraham: But yeah, I think with MindEye one, we had a very simple sort of pipeline of. of taking the fMRI signal and converting them to clip image embeddings and and then basically re generating an image from the clip image embeddings, and that worked quite well and The only difference, the only issue with that was that it required a lot of data, and we have developed this new pipeline, which Paul will talk more about, that requires less data, is more efficient, and is giving also better results with better, sort of, image generation models, so, for example, we're using SDXL for this MindEye 2 model so, yeah, I think I'll let Paul talk more about the motivation and how MindEye 2 works.
[00:52:18] Alex Volkov: So I just like before we get to Paul thank you for joining guys. first of all, I just want to highlight how insane to me is the thing that you guys talking about where many people like think that, oh yeah, generative AI generates images. Yeah. And generate some texts. And You guys like translating brain signals into what people actually saw. and I think I saw a separate from You also an attempt to understand fMRI. So Paul, maybe feel free to introduce yourself and maybe also cover prior work in this area. I would love to know, if this is something you guys came up with or something You saw and improved on, I would love to know as well.
[00:52:52] Alex Volkov: That's
[00:52:57] Paul Scotti: This, yeah, like Tanisha was saying, we started out working on this together over Discord back in 2022. And at the time, there weren't really any good results doing reconstruction of images from, looking at images inside of an MRI machine. And what really spurred several new papers in this field is open sourced image generation models like stable diffusion clip models, and also importantly a good data set of people looking at images in an MRI machine.
[00:53:34] Paul Scotti: It's a very difficult dataset to collect because we're talking about eight people who spent 30 to 40 hours inside of this MRI machine looking at images one at a time for three seconds each.
[00:53:48] Paul Scotti: So it's, it really was the culmination of dataset and new models that allowed this to work. For the MindEye 2 stuff specifically, We focused on trying to get good results using only one hour instead of 40 hours of data.
[00:54:07] Paul Scotti: And this is pretty important because if you're trying to do these machine learning techniques on new subjects, new data sets, maybe apply to the clinical setting, you aren't going to be collecting dozens of hours of data, especially for clinical populations. It's just too expensive and you're taking up their valuable time.
[00:54:29] Paul Scotti: So we, there's a lot of papers now that have been focusing on fRIDA image, just because it's a cool topic. So our paper shows, state of the art results, but specifically in the one hour domain, We show that you can pre train a model on other people's brains in order to have a better starting point to fine tune the model on a separate, held out subject's brain.
[00:54:54] Paul Scotti: And for people who aren't maybe as familiar with neuroimaging stuff or how the brain is, how the brain works, your brain is wired very differently to other people. It's not like there's the same. part of the brain that always handles, what happens when you look at a picture of an elephant or something.
[00:55:15] Paul Scotti: We have different shapes and sizes of brains. We have different patterns of activity that lead to how we perceive vision. And the reconstructions that we're talking about are not as simple as just, was it a dog that you were looking at? Was it an elephant? So you need some sort of way to align all these different people's brains and their different visual representations into a shared latent space so that you can then get the rest of this pipeline with the, diffusion models and MLPs to work and actually have that be informative to generalize from, my brain to your brain.
[00:55:53] Alex Volkov: so incredible that I have, so many questions, Paul, but I will start with maybe, The differences between brains that something that you said, I also want to talk about, the visual cortex and how that thing happens, but I would be remiss if I don't mention at least that you guys are now talking about MindEye at the same time we're we got the first the first Neuralink implanted human showing that he can control basically a machine with their, with his brain with implants.
[00:56:19] Alex Volkov: But you guys are completely non invasive kind of understanding of these brain signals. But to an extent and Neuralink also is some sort of like an invasive understanding of brain signals and transforming them into actions versus something that they see. but , they mentioned that they're working on sight fixing.
[00:56:34] Alex Volkov: As well.
[00:56:34] Alex Volkov: Could you maybe give us a brief understanding of fMRI, how that translates into the signals from visual contact? How do, how does this machine know what I see and how then you are able to then use diffusion models to recreate what I see.
[00:56:48] Alex Volkov: Could you give us like a little bit more of a, what's, where's the magic here?
[00:56:52] Paul Scotti: Yeah, so, fMRI right now is the best method if we're talking about non invasive tech. If you have electrodes on someone's brain, obviously that's going to give you a much better signal. But it's also not viable to do that for most projects and for applying it to clinical settings and new research and everything.
[00:57:14] Paul Scotti: So we used fMRI, which is a bit crude in the sense that you have these people that are needing to make as little motion as possible. The MRI machine is basically tracking blood flow. So when you look at an image of something, the neurons in your brain that correspond to representing that image are active and they require more oxygenation to help with how they've been used in relation to the other voxels in the brain that are not as relevant for activating to that image.
[00:57:50] Paul Scotti: Basically, you're tracking this kind of slow moving time course of blood flow that corresponds to where in the brain is active. And then you are have this 3D volume of the brain and the corresponding blood oxygenation changes for every given 3D cube or voxel in the brain. And what we did is we took all the voxels corresponding to the visual cortex, The back of the brain that seems to be active when you look at stuff, and we feed that through this neural network.[00:58:20]
[00:58:20] Paul Scotti: And specifically, we feed that through MLPs and a diffusion prior and all this stuff to give us a model that can translate from brain space to clip space. where CLIP is, these models that are contrastively trained typically with text and images So that you can have this multimodal space where you have the ability to align a given image caption with the image itself.
[00:58:48] Paul Scotti: This you can think of as a third space, a new modality for CLIP that's the brain. So we use the same sort of technique of contrastively mapping the brain and its paired samples corresponding to the images into the CLIP space. And then there are so called unclip models, also sometimes called image variations models, that allow you to undo clip space back to pixel space.
[00:59:13] Paul Scotti: And so that's how we actually get the image reconstructions at the end, where the model only gets the brain activities and has to generate the corresponding image.
[00:59:23] Alex Volkov: So I'm still like picking up my job from the floor here, because what you're basically saying is this, the same architecture that is able to Drop cats by understanding the word cat and like a pool, the concept of a cat from latent space. Now you've able to generalize and add multimodality, which is like brain understanding of a cat or like what happens in the brainflow in the visual cortex when somebody looks at a cat and you're basically placing it in the same latent space neighborhood. and now you're able to reconstruct an image based on this. I'm still like trying to obviously wrap my head around this but I would love to maybe ask.
[01:00:01] Alex Volkov: the Tanishq as well. , could you talk about MindEye 2 and specifically the improvements that you did, and how you achieved them and what they are in fact and then how it applies to the clinical field,
[01:00:11] Tanishq Abraham: Right. I mean, so with MindEye 2 like Paul mentioned, our main focus was what can we do to basically use less data when it comes to a new subject. So if you have a, you have a new person that you want to, read their mind, you want to do this reconstruction, we don't want them to have to do 40 hours of scanning because with MindEye 1, you'd have to basically train a separate model for every single subject.
[01:00:34] Tanishq Abraham: So it was like a completely separate model for each subject. So if you had a new subject, you would have to get 40 hours of scanning with that new subject to create a new model. So
[01:00:42] Tanishq Abraham: the idea with MindEye 2 is that we have,
[01:00:45] Tanishq Abraham: We, we train,
[01:00:46] Tanishq Abraham: A model on all of
[01:00:48] Tanishq Abraham: The previous subjects.
[01:00:50] Tanishq Abraham: So for example, we have
[01:00:51] Tanishq Abraham: Eight subjects in the data set,
[01:00:53] Tanishq Abraham: You train on seven of the subjects,
[01:00:56] Tanishq Abraham: And
[01:00:56] Tanishq Abraham: You, and so it's training on all seven subjects and then you are able to then fine tune. that model on a new subject, but you only need one hour of data.
[01:01:06] Paul Scotti: So basically for any new subject, now you only need one hour of data.
[01:01:09] Paul Scotti: So the way that works is that basically we have adapter layers, which is just like these sorts of like linear layers that you have that for each each subject. So, you basically have this sort of layer that is you have the fMRI data from a new subject, but you do have this like linear adapter layer that is basically converting it to again like a kind of a shared space for all the fMRI data.
[01:01:32] Paul Scotti: So then basically when you are taking a new patient or a new subject, all you have to do is fine tune this linear adapter for that new subject. And, yeah, so that's the general idea with. What we try to do there with that way, we only have to use only one hour of data.
[01:01:49] Paul Scotti: But then on top of that, of course, we have various modifications to the entire pipeline that also just gives you better results overall. So for example instead of in the past, when we were taking our clip image embedding and then reconstructing We used a different model called Versatile Diffusion, but here what we did is we actually took SDXL, and the problem with a model like SDXL, for example, is that it only takes in clip text embeddings.
[01:02:19] Paul Scotti: So because, these models are text to image models, so oftentimes a lot of these models are going to be taking, they're taking like clip text embeddings, and that's what they're conditioned on. But here, what we did is we fine tuned SDXL to instead be conditioned on clip image embeddings, and so we have this SDXL unclipped model, that's what we call it and so that, is one, for example, improvement that we use this model instead of the previous model, which was versatile diffusion.
[01:02:42] Paul Scotti: There are a few other like different improvements to the architecture, to the conditioning that we have. I think Paul can again, talk more about that, but I think the main kind of innovation Apart from, this is just the general improvements. I think the main innovation is the use of this sort of adapters for?
[01:02:59] Paul Scotti: Each subject that allows us to then fine tune for new subjects with only one hour of data. and
[01:03:05] Paul Scotti: Paul, I feel free to add any other details as well, Paul.
[01:03:08] Alex Volkov: Yeah. I want to follow up with Paul specifically around you're moving from 40 hours to let's say one hour, one hour still in this like fMRI, basically a coffin, right? like it's a huge machine, like it's super incredibly expensive so the data, the it's not Maybe I'm actually going to presume here, but maybe please correct me if I'm wrong.
[01:03:26] Alex Volkov: Unlike other areas where like synthetic data is now a thing where people like actually improve Have you guys played with synthetic data at all? is that something that you've tried and seems helpful? Or is this like something that actually humans need to sit in those machines and provide some data for
[01:03:40] Alex Volkov: you?
[01:03:42] Paul Scotti: Yeah, I mean, to an extent you need real data to validate things, but we have done augmentation to, which is like synthetic data. to make the models more. Robust, right? So like we've played around with, averaging samples from different images together, doing mix up kind of data augmentations to make the pipeline work better and for some other projects that we're doing that might be involving more synthetic approaches.
[01:04:16] Alex Volkov: Awesome. And so I think I'll end with this one last question is the very famous quote from Jurassic Park is that scientists were preoccupied thinking if they could, they didn't stop thinking if they should, but not in this area. I want to ask you like specifically, what are the some of the applications that you see for something like this when you guys get to MindEye 3 or 4 or 5 and it's maybe with different signals, maybe with EEG, I don't know, what are some of the implications that you see of like being able to read somebody's mind and what can it help?
[01:04:47] Alex Volkov: with?
[01:04:49] Paul Scotti: Yeah. So, you want, yeah, you can go ahead, Paul. Okay. You, yeah. Okay. So, like there's just so many different directions, right? Like you've got right now we're focusing on perception, but the more interesting thing would be mental imagery, like dream reading applying these models to real time so that you can reconstruct while they're still in the scanner that allows you to do cool new experimental designs as well.
[01:05:15] Paul Scotti: You could look at memory, try to reconstruct someone's memory for something. Yeah, Dinesh, maybe you can add on to that. Yeah. So,
[01:05:26] Tanishq Abraham: the thing is, what's really interesting is that a lot of the sort of,
[01:05:28] Tanishq Abraham: Pathways and activity for,
[01:05:30] Tanishq Abraham: Perceiving an image that you're looking at right now, a lot of them are similar for
[01:05:33] Tanishq Abraham: Imagining and dreams and these sorts of things.
[01:05:35] Tanishq Abraham: So of course there are some differences, but that's the thing is that a lot of these pipelines should hopefully be,
[01:05:41] Tanishq Abraham: Generalizable to some of these other applications like,
[01:05:44] Tanishq Abraham: Reconstructing what you're imagining and things like this.
[01:05:46] Tanishq Abraham: And in fact, there are there is some work in this already.
[01:05:49] Tanishq Abraham: There's like a paper from one of our collaborators that may be coming out in a couple months that is exploring this. So it's actually not just limited to. what you're looking at, but you know, more generally as well. But I think just even with this technology that we have with what you're looking at and reconstructing that, I think there's lots of interesting like clinical applications.
[01:06:08] Tanishq Abraham: For example maybe, the way you perceive is associated with your mental condition. So maybe it could be used for different biomarkers, different diagnostic applications. So for example, if you're depressed, for example, maybe you are going to perceive an image.
[01:06:21] Tanishq Abraham: in a more dull fashion, for example. And so I think there's a lot you can learn about how the brain works by looking at how people are perceiving it perceiving images, and also utilizing that for potential clinical and diagnostic applications. So that's also an area that is completely underexplored.
[01:06:39] Tanishq Abraham: [01:06:40] And it's been also pretty much underexplored because people weren't able to get such high quality reconstructions before with, I think the introduction of MindEye 1 was like one of the first times that we were able to get such high quality reconstructions. And of course, even then, we had to use the 40 hours of data to do that.
[01:06:56] Tanishq Abraham: And now we're actually bringing it down to one hour of data. And with further work, we may be able to bring out, bring it down even further. So now we're actually potentially having it's actually, potentially possible to use this for actual clinical applications. And so that is what I'm most excited in the near term in potential diagnostic applications or for potential neuroscience research applications.
[01:07:17] Tanishq Abraham: And then of course, long term vision is trying to apply this for, looking at imagination, dreams, memory. That's, I think, the long term vision and interest there. So that's at least how I see this field progressing and what I'm interested in personally. One, maybe just one more quick nuance is that with the shared subject stuff, it's not limited necessarily to reconstructing images.
[01:07:41] Tanishq Abraham: So typically, machine learning approaches, you need a lot of data, but data takes a lot of time in the MRI machine. And so this approach of using other people's brains as a better starting point allows clinicians to potentially use more complicated ML pipelines for investigating the brain, maybe even outside of image reconstructions, in a way that's feasible given the time commitments that scanning entails.
[01:08:11] Alex Volkov: I absolutely loved, the first thing you said, Paul, that, if we get to real time as the person in the machine, that some stuff, some understanding, interpretation of what they're going through could happen as well. That's extremely exciting. And at
[01:08:23] Alex Volkov: the rate of where Junaid is going I'm, I'm positive that This is possible and I'm very happy that you guys are working on this and are excited about building like improvements on this the jump from 40 hours to one hour seems incredible to me? And if this trend continues, definitely exciting possibilities. Thank you guys for coming up. Maybe let's finish on this what are you Restricted on from going forward. Is it like compute? Is it data? is it talent Maybe you want it like shout out. Maybe you're hiring. Feel free. The stage is just like what else is needed to get to MindEye 3 faster
[01:08:56] Tanishq Abraham: Yeah, I think it's mostly manpower, I guess, I mean, I think it's, mostly relying on volunteers and, Paul, of course, is doing a great job leading this so that I think is the main limitation and of, but of course, yeah, like with MedArc, we are doing everything, open source and transparently so we, we have a Discord server where we organize all of our Our our research and progress and well, we have all the contributors joined.
[01:09:20] Tanishq Abraham: We, I mean, we've been lucky to have amazing contributors so far, from Princeton University of Minnesota University of Waterloo, from all around the world, we've had people contribute, but of course, more contributors are better, of course. And, if you're interested in this sort of research.
[01:09:35] Tanishq Abraham: Please please join our Discord, and of course feel free to, to read the papers as well and follow us on Twitter we'll be updating our progress on Twitter as well but yeah I think Yeah, just, check out our Twitter and join our Discord, I think is the main one.
[01:09:49] Tanishq Abraham: But yeah,
[01:09:50] Alex Volkov: absolutely. And thank you guys for coming up. I'm very happy that I was able to talk to you guys. Cause last time when you raised my hand, I was like, Oh, this is so cool. I know the niche, but yeah, back then we weren't bringing you up. So Paul, thank you It's great meeting you and you guys are doing incredible work and
[01:10:03] Alex Volkov: I think it's very important.
[01:10:04] Alex Volkov: I'm very happy to highlight this as well. Now we're moving to something a little bit different.
[01:10:08] Alex Volkov: Let's reset the space a little bit, and then let's talk about fine tuning.
[01:10:24] Alex Volkov: All righty. ThursdAI, March 28th, the second part of the show. If you just joined us, we Just had an incredible conversation with Paul Scotty and Tanishq Abraham from MedArk and I guess stability, part of stability
[01:10:43] Alex Volkov: as well. and we've talked about AI reading your brain and understanding what you saw, which is incredible.
[01:10:48] Alex Volkov: And I definitely recommend listening to this if you just joined in the middle or or just joining us late Meanwhile, we also covered a bunch of open source stuff so far. We also covered that cloud Opus is now taking over as the number one LLM in the world right now, and something we all knew, but now LMC Serena is catching up? We also had a bunch of breaking news and I wanna just reset the space and say that, hey, for everybody who joined us for the first time this is ThursdAI. we talk about AI every day. everything that's important and impactful in the world?
[01:11:18] Alex Volkov: of AI from week to Week and we've been doing this for more than a year. and you're more than welcome to join us in the conversation in in in the comments as well. We're reading through those. And if you're late to any part of this is released as a podcast episode on every
[01:11:33] Alex Volkov: podcast platform. So you're more than welcome to follow us on Twitter. Apple and Spotify and whatever you get your podcast. and also there's a newsletter with all the links and videos and everything we talk about that you have to actually see, right? So a link to the MindEye paper will be in the show notes and the newsletter as Well
[01:11:48] This weeks buzz - WandB in SF in April
[01:11:48] Alex Volkov: I will also say that my actual job is an AI evangelist with Weights Biases, a company that builds tools for all these model creators to actually track their experiments. and Weights Biases is coming to San Francisco in April 18th and April 17th. we have a conference there. You're, if you're in the area or you want to fly in and meet like a bunch of folks in San Francisco, you're more than welcome to use this as your Reason and opportunity I think for the next few days
[01:12:15] Alex Volkov: the tickets are still early bird and it's 50 percent price we're doing a workshop on April 17th about improving your business with LLMs. And we're doing everything from prompting to evaluation and doing a bunch of very exciting conversations. So if you're in the area, please stop By and, high five me. I'm going to be in San Francisco for the whole week. and moving on here. I want to chat about finetuning, and I see LDJ here.
[01:12:36] Discussion : Is finetuning still valunable?
[01:12:36] Alex Volkov: I think we've covered pretty much everything important unless there's breaking news and hopefully folks will DM me If there are breaking news there has been a sentiment in in at least, in our little bubble of AI, On X, where some folks started to get a little bit disillusioned with the concept of Fine tuning. and I don't think the disillusionment necessarily is with the concept of fine tuning as a concept I think the kind of the general vibe of getting and I think some folks like Ethan Mollick and Anton Bakaj was like a folk folks we follow for some like information.
[01:13:07] Alex Volkov: The disillusionment stems from the fact that we previously covered that long context Windows maybe affect like rag for example, RAG use cases, but long context window could also affect finetuning, because if you get something like a Haiku, which is now the world's like fifth or sixth, LLM in the world but it costs 25 cents a million tokens, and you can send a bunch of examples into Haiku for every request you maybe you maybe not needing to fine tune? and so this has been a little bit of a sentiment and also the bigger models they release like the recent Databricks model is huge and it's really hard to fine tune you have to like actually have a bunch of hardware so we've seen the sentiment and I really briefly wanted to touch with LDJ and Nisten and Junyang and Tanishq also like everybody who's on stage feel free to chime in and from the
[01:13:55] Alex Volkov: audience.
[01:13:56] Alex Volkov: If you're friends of the pod, do you want to come up and talk about fine tuning? Let's talk about this sentiment. LDJ, I saw your question. Yes, we've covered Jumba in the beginning. We're very excited. I think Jan was here and now he's talking to actual AI21 folks. So I want to do this like fine tuning conversation.
[01:14:09] Alex Volkov: LDJ, we briefly covered this and we said, Hey, it would be awesome to just chat about this like face to face. So what's your take on this recent sentiment? What are you getting from this?
[01:14:18] LDJ: yeah, I guess when it comes specifically to, I guess, like the business advantage of fine tuning for a specific use case to try and have a cost advantage over open AI models or something, I feel like things might be changing with Haiku and, I mean, you talked about this before It was either you or somebody else posted like a chart of the average trend of the cost for like how good the model is and Haiku is breaking that trend of it's like Really good while being like significantly cheaper than it should be given the previous trends
[01:14:53] Alex Volkov: think that was Swyx. Let me go find it Yeah.
[01:14:56] LDJ: Yeah and yeah, I think just overall for a lot of things that [01:15:00] people would have fine tuned open source models for, Haiku, it just might make sense to use Haiku, and it might be able to do those things that you would fine tune for anyways better or equal, and at the same time be really cheap already to run.
[01:15:14] LDJ: And I think it definitely The amount of tasks that it makes sense to fine tune on from an economic point of view, it's just probably less tasks now than before and I guess that is probably going to get less as a closed source becomes more and more efficient.
[01:15:32] Alex Volkov: Yeah, so absolutely there's a
[01:15:33] Alex Volkov: few areas for which fine tune is a concept even, right? There's like the general instruction fine tuning you take a base model, you try to make it more helpful. but there's also fine tuning for more knowledge, for example, that I think and maybe you guys can correct me on this and feel free to step in here, Junyang as well Is that the kind of the knowledge fine tuning the like giving this model like more information?
[01:15:56] Alex Volkov: sometimes suffers from stuff like catastrophic forgetting that the model like starts to forget some other stuff.
[01:16:02] Alex Volkov: But also things like RAG, for example, are potentially helping in that area where you can actually have a a sighting of a specific source that the model like referred onto, which is very important especially in the enterprise and companies area right like when you want to build something like a assistant or something like retrieval or something like search or better search you actually don't want to count on the model's hallucinations potential. you want to cite something. So for knowledge retrieval, RAG seems to be at least in the companies and enterprise area RAG seems to be like winning over Finetuning. and then the question is RAG over a Finetune model for your specific stuff better than RAG over a general model with a huge context? and I think that this is the area of disillusionment, specifically around the cost of pulling everything back and I think previously context window was very not cost effective We briefly mentioned this today in the area of Jamba models where Context is now like cheaper with those models, but for a regular Transformer LLM, context is expensive.
[01:17:04] Alex Volkov: The more context you have, The kind of, the more the hardware requirements grow and so I think that some of the kind of disillusionment especially comes from that. some of it is probably also related to how Big the models have gotten. I don't know, Nisten, if you want to chime in on this or like how even the Grok one the model was huge. people were like getting excited, but then some folks like Technium from Nous Research, like I said, we won't even try to fine tune this for even Instruction, because it's just too big so I wanted to hear from Nisten, from you because you guys also did like a bunch of fine tuning. And also maybe merging as well is related to here.
[01:17:43] Nisten Tahiraj: Yeah, gotta keep in mind that for a while, fine tuning was a lot more expensive. Running fine tuned models was a lot more expensive than using GPT 3. 5. And then it got a lot cheaper with all the API companies, especially together and the other ones. So the business case for it has not really been how how cheap it is.
[01:18:08] Nisten Tahiraj: I think in my opinion, the business case has. been all about data ownership. A lot of companies that have their own chatbots and stuff, they they see the data as their property and the value in their company, so the reason they fine tune is not because necessarily it's better, sometimes it is but it's been to just have full control of the data. And there have been a lot of drawbacks where you could have the knowledge could be lost. But there are much newer techniques where you can do, quote unquote, lossless fine tuning and and still have it. But yeah, I'll I'll land it there. So I think the business case is not necessarily the cost that has, it's always just been about data ownership.
[01:18:53] Nisten Tahiraj: I'm actually doing consulting for one client now that really just wants to use Grok. Some they use the Grok API before and now they want to run it on their own and they don't care how many. JVs and stuff it costs to run because they factor it in with what their users pay.
[01:19:13] Nisten Tahiraj: So, so, so yeah I'm noticing that it's more about the ownership side, not not necessarily the performance or cost.
[01:19:21] Alex Volkov: GR with a K or grok with a Q.
[01:19:23] Nisten Tahiraj: Grok with a K the new, yeah,
[01:19:25] Alex Volkov: Oh, really? What API they use for grok. There's no API is there an API for grok that I missed?
[01:19:31] Nisten Tahiraj: No they
[01:19:31] Ian Maurer: open source the model.
[01:19:33] Alex Volkov: Oh, so somebody hosted this and then they used the API since the, since last week basically
[01:19:37] Ian Maurer: no, they people
[01:19:38] Nisten Tahiraj: have used have used grok. I think they just did a, like a translation layer via via premium, but they did use grok in, in, in a product for, via an API. I'll have to, I'll have to double check how exactly,
[01:19:53] Alex Volkov: like I can think of a way, but I'm not saying it's kosher. Like you, you can, you can put a Chrome extension and use the browser. Very
[01:19:59] Nisten Tahiraj: No, even Levels. io deployed a, uh, like a WhatsApp bot that was that was running off of Grok too. So again I'll check up on that. I don't know what API stuff they, they used, but I am helping them now just run their own.
[01:20:16] Alex Volkov: I see. LDJ, you unmuted. You want to chime in on the kind of like a specific choice and data ownership piece of the fine tuning, which I think is important. But from the other side if I'm representing the other side and I'm not, I'm just trying to figure out like where the vibes are coming from about this eligiment is most clouds now run most
[01:20:34] Alex Volkov: Open source models, or at least, Microsoft definitely is now like supporting Mixtral.
[01:20:38] Alex Volkov: I don't know if they're going to run Grok for you or not. And. There's also something to be said where if you're running Cloud from inside Amazon or Bedrock or Vertex or whatever you still own your data, aren't you?
[01:20:52] LDJ: I'm not too familiar with the situation with Vertex and stuff but I do think that in the situations where a business has to. would want to and has to fine tune on like their company data so that employees can actually like, use something that is like an AI that understands the internal company information.
[01:21:12] LDJ: That is I would say still a decent sized use case that you would have to use like the open source models for like unless you're fine with giving open AI your data and stuff, but I'm not saying necessarily open AI will train on it. I know they have different clauses and stuff, but you know, there's always like that risk and if you want to keep that stuff secret and internal, then you do have to still just use the open source models to fine tune.
[01:21:38] Alex Volkov: Yeah. the additional kind of piece that, that I think Ethan knowledge like pointed to and before I get to Justin super quick, is that the example of Bloomberg and I think LDJ you wanted to push back on this example, but I'll cover this like briefly. B Bloomberg, sorry. Bloomberg famously trained a model called Bloomberg gpt based on the type of financial data that Bloomberg has access to.
[01:22:00] Alex Volkov: And back then it like it significantly improved. LLM thinking about like finances and financial data, et cetera, only to then find out that a General model, like GPT 4, like blows it out of the Water Whatever 10 million, whatever they spent on that. And I think this was like also A highlight of how general models after they get released and they're getting better they're getting better across the board Not only for your task, but also for your task as well and before we get to Junaid and LDJ, you had a pushback that they didn't do it correctly, it was a skill issue or something like this, right?
[01:22:32] LDJ: Yeah. I think it was honestly more of a skill issue on Bloomberg's part because. And I'll try and find the exact source for what I'm about to say, but it was like within a few weeks of Bloomberg GPT releasing, like there's like just a couple open source developers that released like a finance specific model.
[01:22:49] LDJ: That was performing significantly better on the finance benchmarks with the same amount or less parameters. And that was just within a few weeks of Bloomberg GPT releasing. So obviously you didn't even need all that Bloomberg data and all that stuff to actually even get something that, that well performing.
[01:23:06] Alex Volkov: Yep. All right.
[01:23:07] Alex Volkov: I want to get to Justin, because, Justin, obviously you're on the Qwen team, you guys are building models that then other folks maybe fine tune and probably also supporting, enterprise use cases. What's your take on the fine tuning area?[01:23:20]
[01:23:20] Justin Lin: Yeah, just some comment on the fine tuning for customer data. I think I somehow disagree with the idea that. We can inject new knowledge to the model through fine tuning because it is really difficult to do that. Do this thing with such a few data because we often use a very small amount of data to for fine tuning. I have read the paper, I don't remember its name, but it's telling us that fine tuning is more about aligning to the behavior, to the style, but not injecting new knowledge. If you want to inject new knowledge, you have to do things like this. Pre training next token prediction with ten, tens of billions of tokens so you can do this, but it is really hard.
[01:24:09] Justin Lin: Something I would like to comment is that our customers fine tune our model and they found that the general capability is decreased. With the new knowledge I think this is quite reasonable because somehow our customers or users don't know really how to fine tune for a general model.
[01:24:29] Justin Lin: They want the general capability, but they want something new. So we have provided a solution is that we just provide our data for general fine tuning in a black box way. So you can use our data, but you cannot see our data, and you can mix our data with your own, yeah, customer data so that you can train a new model which has a balanced behavior good general capabilities, but some new knowledge or some new styles of your company or something like that.
[01:25:04] Justin Lin: Yeah. This is some of my personal
[01:25:06] Justin Lin: experience. Yeah.
[01:25:07] Alex Volkov: I really appreciate this, because I think that The difference is important fine tuning is not like a catch all sentence. There's fine tuning for style fine tuning for alignment for different ways to respond, for example. and that I think still, makes perfect sense. We have base models, we have fine tuned models for instruction fine tuning, for example. but I think that this is, at least the way I see it on my kind of radar, and I wanted to bring this to ThursdAI because I think it's very important for folks who follow this to also know that this is happening is from specifically from Finetuning with new knowledge, not only new kind of styles, new knowledge specifically, because the additional piece here is fine tuning takes a while and like maybe we said about Bloomberg maybe a skill issue Maybe you have to get like those machine learning engineers whereas with the advent of faster hardware better models that are open for you and They're now hosted on the actual kind of like the bedrock from Amazon. for example, this is basically in your cloud, They're running whatever haiku but in your cloud and the same agreements of not training all Your data is like the same, they apply OpenAI, You can run through Microsoft thing and in your cloud in Azure, and it's not like sending some data to OpenAI. So when we get to like bigger contexts, the ability of you to switch up and give this whatever product you're building on top of these LLMs, new data That's easier than Finetune with just like just providing the same context as well.
[01:26:29] Alex Volkov: Tanishq, I saw you had your hand up and definitely want to hear from you as well.
[01:26:34] Tanishq Abraham: Yeah, I guess I just had a few thoughts about this whole thing because, I'm working in the medical AI space and we're like by two models for, clinical applications, medical applications. So, I have various thoughts about this. I think just generally, of course I think it's with Phytuning yeah, It's particularly useful if you like, I think LDJ is of course, but actually the use case of yeah, if there's private data, that's of course a big one.
[01:26:56] Tanishq Abraham: I think also if you want to have models locally, you want to use models locally. I think that's another big use case. A lot of times, there are many cases where, you don't want. To use cloud services, I think like in the medical scenario, for example, maybe you don't want to send medical data to various cloud providers and having some sort of local models could potentially be useful.
[01:27:13] Tanishq Abraham: And of course there are other applications where maybe you want to have models run on, Some sort of like smartphones or other devices. So that's, I think one particular area where like fine tuning is particularly valuable. I think, in the sort of just to provide maybe some context in the medical space, medical AI space, I think this idea of whether or not fine tuning is useful is, I think, honestly, in my opinion, like an argument that's like still not settled yet.
[01:27:38] Tanishq Abraham: So for example, like in the clinical LSP space, you have models like, of course, GPT 4, you have then you have, Google has their MedPOM models, then other model, other people are creating specific fine tunes. About a couple of years ago, or maybe it was a year ago, there was a paper that tried to see if for example, something like GPT 3 was better, or fine tuning a specific model for medical use cases was better.
[01:28:02] Tanishq Abraham: They found that fine tuning was better performing and of course required less parameters and was a smaller model. But then something like people Google, for example, created their MedPAL models. Those are more like alignment in the sense that Justin was talking about. The knowledge is mostly there in the original PAL models and they're just doing some sort of instruction fine tuning.
[01:28:22] Tanishq Abraham: And so that has been showing to do quite well. Thank you. And then recently there was a paper, the MedPrompt paper, which basically prompted GPT 4 to basically outperform all these other models for medical tasks. And so that one was just trying to say like a general purpose model is good enough.
[01:28:40] Tanishq Abraham: So I think there's still a lot of it's still actually an open question, at least in, in this specific area, whether or not PHI tuning is better, or if it's just alignment that's needed, or you can just use the general purpose model. And so I think we're trying to study this question a little bit more detail as well, and try to see if PHI tuning really is necessary, if that actually does provide benefit.
[01:28:58] Tanishq Abraham: And at least for me, I think of it more like, when I say PHI tuning, I also think of it like as continued pre trading where, yeah, we are probably be trading on we are trading on like tens of billions of tokens. To add knowledge to a model. And I think, there's, people talk about FI tuning, but they also talk about continued pre-training and sometimes the distinction between those is a little bit kind of a group.
[01:29:18] Tanishq Abraham: There isn't much of a distinction sometimes, so there's also that as well. And I think that also is a lot of the times the question between whether or not it's just doing alignment versus adding knowledge. I think, that's. Part of that discussion and that, that isn't really I think clarified very often so that there's, that's the other aspect, but yeah, those are my thoughts on the topic.
[01:29:37] Alex Volkov: thanks Tanishq. And I also want to welcome Ian Moore to the stage. Ian, it's been a while since you've been here. Thoughts on this exciting discussion and have you seen the same trends or the same kind of vibes that I brought up on where you read and
[01:29:51] Ian Maurer: yeah.
[01:29:51] Ian Maurer: We were talking about this in January, Alex, I found the conversation, right? Finetuning versus RAG, the question is what's your goal? What's your use case? What's your eval? I think Hamill even mentioned, do you know, even know what your evals are? Do you even know what you're trying to accomplish?
[01:30:03] Ian Maurer: Without that good luck fine tuning, good luck building an app. Anyways my, I have a very distinct opinion and perspective, but I'll give you guys background so you understand where it's coming from. My company is 12 years old. We've got an old, good old fashioned AI company where we've curated 100, 000, Rules, effectively, in a knowledge base.
[01:30:20] Ian Maurer: It's a graph. It's got ontologies and things like that. And those rules have been curated by experts with PhDs, and we have an API that sets over it, and reasons over it, and can match patients to clinical trials. This is for cancer, right? So, patients get, DNA sequence, and it's very complicated, whatever.
[01:30:35] Ian Maurer: So, the great thing about large language models and as they get bigger and better is that they can understand language, including all language, including medical language, so they can understand the intent of a provider, right? The provider's trying to accomplish something, which is as quickly as possible, how do I help this patient?
[01:30:51] Ian Maurer: And So the thing that I have found that's most useful for us is to help that expert be as productive as they can possibly be. Use the large language model to understand their intent, what they have. I have a patient, they have a problem, what they want to find the best possible treatments for that patient.
[01:31:07] Ian Maurer: And then how to do that is by giving that large language model tools, right? Don't. Why do I want to fine tune knowledge into it? And then I just, I basically black boxed all my knowledge, right? Great. I have all this great knowledge I've curated over the years. I'm going to fine tune it into my system. And now it's a black box and I can't tell you where from or why it's there.
[01:31:25] Ian Maurer: No, I want to be able to tell you, here's the trials that are available for your patient. Here's the drugs that are available for your patient. This is the, the best possible outcome for that. And here's the link to the clinical trials page, or here's the link to the the FDA page that tells you why this drug is so [01:31:40] good.
[01:31:40] Ian Maurer: I can't do that if it's a black box. I'd be hallucinating all over the place. So my perspective is Finetuning is great if you're talking about a very discreet use case that you're trying to, drill down on cost. Hey, I figured out this named entity recognition pattern and now I'm, I was doing it expensively with few shot learning.
[01:31:57] Ian Maurer: Now I'm going to go, fine tune something and save that cost. But otherwise, you know Use the best possible model, give them tools, whether it's, through function calling or GPT actions are actually pretty good. And that's the best way to get the value out of the large language model and work with existing knowledge.
[01:32:13] Alex Volkov: So definitely sightings and knowing exactly about your data and not like blurring it out inside the brain of LLM, fuzzing it out where you can't actually know where it came from or whether or Not it's hallucinated. I think that's a big piece here that companies are actually like starting to also get into.
[01:32:30] Alex Volkov: And so I think you're Your perspective. is very important as Well I think also from the perspective at least the vibes that I've seen from the perspective of updating that data afterwards, like just continue fine tuning, like requires more knowledge and more skill, rather than just updating your vector databases, let's say, and have the model provide enough context. and I think the smartness to price ratio, I think is very important as well. If we get like models like Haiku, for example, they're like incredibly cheap. But have a vast context length that you can use both for fine tuning towards alignment, let's say, or behave like whatever you want it to behave or answer as our company versus answer is like the, this LLM together with you have enough context to do that and it's not cost prohibitive for you to use this large context for a bunch of stuff. and it's very important
[01:33:18] Alex Volkov: so I thanks Ian for coming up. I want to tie this back a little bit and then close the discussion also, I do want to shout out that you also have an awesome list of function calling, which now includes a bunch of open source. models that support function calling as well . The support is like function calling as well and it talks about the specifics in which they support function calling. Which is great and definitely will be in the show notes as well and with that folks, I think we'll end ThursdAI for today we had a bunch of stuff.
[01:33:44] Alex Volkov: There's a small breaking news from ray Ray just mentioned that Cursor the AI editor that we a lot of US use and love they just released an update where like their Cursor, like Copilot plus feature is still available. twice as fast now in some areas and that's been like awesome to use. So if you haven't used Cursor yet, definitely give it, give them A try.
[01:34:02] Alex Volkov: And Cursor is like really impressive, especially with Opus. If you have paid for Cursor Premium, have access to the best LLM in the world. I think that this is all that we wanted to talk about. thank you everybody for
[01:34:12] Alex Volkov: joining from week to week.
[01:34:13] Alex Volkov: I think that's most of what we talked about on ThursdAI for March 28th. With that, I want to thank Nisten, LDJ, Justin, Junyang, Robert Skobel was here before, Ian Moore jumped on, Tanishq, and Paul, definitely from MedArc and everybody else who joined us I really appreciate everybody's time here. If you're not subscribed to ThursdAI to get every link that we've talked about, I really work hard to give you all the links, so definitely give a subscription Other than that have a nice Thursday, everyone. We'll see you next week. Cheers, everyone.
[01:34:41] Ian Maurer: Bye everybody.
[01:34:42] Alex Volkov: bye bye

This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
📅 ThursdAI - Mar 21 - Grok, GTC, first OSS AI hardware, Neuralink Human, Prompting Claude and more AI news
22 Mar· ThursdAI - The top AI news from the past week
March madness... I know for some folks this means basketball or something, but since this is an AI newsletter, and this March was indeed mad, I am claiming it. This week seemed madder from one day to another. And the ai announcements kept coming throughout the recording, I used the "breaking news" button a few times during this week's show!
This week we covered tons of corporate AI drama in the BigCO segment, from Inflection → Microsoft move, to Apple Gemini rumors, to Nvidia GTC conference, but we also had a bunch of OpenSource to go over, including an exciting glimpse into the O1 from Open Interpreter, which the founder Killian (of the ThursdAI mafia haha) joined to chat about briefly after an all nighter release push!
Another returning FOTP (friend of the pod) Matt Shumer joined as we did a little deep dive into prompting Claude, and how he went viral (seems to happen a lot to Matt) with a project of his to make Claude write prompts for itself! Definitely worth a listen, it's the first segment post the TL'DR on the pod 👂 this week.
Btw, did you already check out fully connected? It's the annual Weights & Biases conference in SF next month, and tickets are flying, I'm going to be there and actually do a workshop one day prior, would love to invite you to join as well!
TL;DR of all topics covered:
* Open Source LLMs
* Xai open sources Grok (X, Blog, HF, Github)
* Sakana AI releases a new paper + 2 JP merged SOTA models (X, Paper, Blogpost)
* Open Interpreter announces O1 - the Linux for AI devices (X, Project)
* LM studio new modes (X)
* Big CO LLMs + APIs
* Nvidia GTC conference - Blackwell platform, NIMs and Gr00t robotics
* Jensen interviewed transformers authors
* Apple rumored to look at a deal including GEMINI
* Apple releases a multi modal MM1 paper (X)
* Inflection founders leave to head Microsoft AI
* Google opens up Gemini 1.5 with 1M context access to all (X)
* Vision & Video
* NVIDIA + MIT release VILA (13B, 7B and 2.7B) (X, HuggingFace, Paper)
* This week's BUZZ
* Fully Connected is coming, sign up here, get tickets, join us.
* I'm running a workshop in SF a day before on improving your LLM step by step including exciting announcements (same link)
* Voice & Audio
* Suno V3 launched officially (X, Blog, Play with it)
* Distil-whisper-v3 - more accurate, and 6x version of whisper large (X, Code)
* AI Art & Diffusion & 3D
* Stability presents SD3 TURBO - 4 steps to get same high quality generation (Paper)
* Stability open sources Stable Video 3D (Blog, Models)
* Tools & Others
* Neuralink interview with the first Human NeuroNaut - Nolan (X)
* Lex & Sama released a podcast, barely any news
* Matt Shumer releases his Claude Prompt engineer (X, Metaprompt, Matt's Collab)
Open Source LLMs
Xai open sources Grok (X, Blog, HF, Github)
Well, Space Uncle Elon has a huge week, from sending starship into orbit successfully to open sourcing an LLM for us, and a huge one at that. Grok is a 314B parameter behemoth, with a mixture of experts architecture of 80B per expert and two active at the same time.
It's released as a base model, and maybe that's why it was received with initial excitement but then, nobody in the GPU poor compute category has the ability to run/finetune it!
In terms of performance, it barely beats out Mixtral, while being almost 10x larger, which just shows that.... data is important, maybe more important than Github stars as Arthur (CEO Mistral) helpfully pointed out to Igor (founder of Xai). Still big props to the team for training and releasing this model under apache 2 license.
Sakana AI launches 2 new models using evolutionary algo merging
Yeah, that's a mouthful, i've been following Hardmaru (David Ha) for a while before he joined Sakana, and only when the founder (and a co-author on transformers) LLion Jones talked about it on stage at GTC the things connected. Sakana means fish in Japanese, and the idea behind this lab is to create things with using nature like evolutionary algorithms.
The first thing they open sourced was 2 new SOTA models for Japanese LLM, beating significantly larger models, by using Merging (which we covered with Maxime previously, and whom Sakana shouted out in their work actually)
Open Interpreter announces 01 Light - the linux of AI hardware devices
Breaking news indeed, after we saw the release of R1 go viral in January, Killian (with whom we chatted previously in our most favorited episode of last year) posted that if someone wants to build the open source version of R1, it'll be super cool and fit with the vision of Open Interpreter very well.
And then MANY people did (more than 200), and the O1 project got started, and fast forward a few months, we now have a first glimpse (and the ability to actually pre-order) the O1 Light, their first device that's a button that communicates with your computer (and in the future, with their cloud) and interacts with a local agent that runs code and can learn how do to things with a skill library.
It's all very very exciting, and to see how this idea goes from an announcement on X, to hundreds of folks collaborating and pushing this to the open has been incredible, and we'll definitely do a deeper dive into capabilities and the whole project once the launch craziness dies down a bit (Killian joined us at the epitome of the launch all-nighter haha)
This is poised to be the first open source AI device, completely with .stl files for 3d printing at home, chip designs, ability to run end to end locally on your mac and we really applaud the team for this release 🫡
Big CO LLMs + APIs
Nvidia GTC annual conference - New Blackwell platform, NIMs, Robotics and everything AI + a chat with the transformer avengers
This week Nvidia had their annual GTC conference, where Jensen announced a ton of stuff, but the highlights where the new Blackwell chip (the next iteration of the H100) and the GB200 racks with a whopping 720PFlops of compute ( to put this number in perspective: the first DGX that Jensen delivered to OpenAI in 2016 was 0.17 Petaflops )
They also announced partnerships with everyone under the sun pretty much, a new way to deliver packaged AI experiences called NIMs (which we at weights & biases support as well) and a new foundational operating system for robotics called GR00T led by Dr Jim Fan.
Jensen also had the whole transformers original authors cast together on stage (and in the green room) for an hour, for the first time, to chat about, well... transformers. I really need to find the whole video and post it because it's hidden inside the Nvidia GTC website, but it was a very fun chat, where the team reminisced about the naming and their thoughts on the future of LLMs. They also covered each individual company (all of them lefty Google since then) and what they all do. It was a great chat.
Microsoft buys Inflection (almost) and Apple considers buying Gemini
In other huge AI player news, 2 of the 3 founders of Inflection AI left to start Microsoft AI (together with some of the staff), namely Mustafa who founded inflection, then helped raise 1.8B dollars, get up to 22K H100 GPUs, release Inflection 2.5 that comes close to GPT4, and then decided to leave. Inflection also pivoted away from consumer (Pi was a very nice AI to chat with) into API services, and apparently Microsoft will pay Inflection $650 to Inflection in the form of a licensing deal.
Meanwhile there are rumors that Apple is eyeing Gemini to integrate into IOS, which is, very weird given the recent bad press about Gemini (Unless Apple doesn't want to deal with the same bad press?) and it's even weirder given the latest push from Apple into Open Source.
Folks at apple this week released a new paper called MM1, outlining a new multi modal model they have trained (but not released) and show that it beats Gemini visual understanding.
It was also great to see that the authors of that model shouted out Weights & Biases crew that helped them through their work on this paper👏
Nolan - the first NeuralNaut (first human with a Nauralink implanted)
Just as I was summing up the notes for this week, Neuralink pinged that they are going to go live soon, and I tuned in to see a 20yo Paraplegic gamer, getting interviewed by a Neuralink employee, being very cheerful, while also playing a chess game, all with his brain. We went a really long way since the monkey playing Pong, and Nolan was able to describe his experience "it's like using The Force" of using Neuralink to control his mac cursor. It was all kind of mind-blowing, and even though brain implants are nothing new, the fidelity and the wireless connections + the very quick surgery made this demo such a nonchalant thing, that Nolan didn't even stop playing chess while being interviewed, probably not realizing that millions of people would be watching.
They have a bunch of ML understanding the signals that Nolan sends from his brain wirelessly, and while this is very exciting, and Nolan prepares for this halloween as Professor X from X-men, because well, he's in fact a telekinesis enabled human, Elon claimed that their next target is fixing blindsight (and that it already works on monkeys) presumably via camera input being triggered in the visual cortex.
Back in November 2022, I watched the Neuralink keynote and geeked out so hard about this section, where Dan Adams, one of the neuroscientists at Neuralink talked about how it's possible to trigger / stimulate the visual cortex to fix blindness and then generate an image.
Well, this is it folks, we talked about tons of other stuff of course but these are the main points that made the cut into the newsletter, as always, if you want to support this newsletter/podcast, please share it with friends ❤️ Hope to see you in SF in April (I'll be giving more reminders don't worry) and see you here next ThursdAI 🫡
P.S - I said Intel a bunch of times when I mean Nvidia, apologies, didn’t notice until post publishing 😅

This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
🎂 ThursdAI BirthdAI March 14: Anthropic Haiku, Devin the new AI SWE, GPT4 gets hands, Cohere and Nous give us tool use models & more AI news
15 Mar· ThursdAI - The top AI news from the past week
"...Happy birthday dear ThursdAIiiiiiiii, happy birthday to youuuuuu 🎂"
What a day! Today is π-day (March 14th), 2024. For some reason it's important, not only because it's GPT-4 anniversary, or Claude 1 anniversary, or even that Starship flew to space, but also 🥁 it's ThursdAI BirthdAI 🎉
Yeah, you heard that right, last year following GPT-4 release, I hopped into a twitter space with a few friends, and started chatting about AI, and while some friends came and went, I never stopped, in fact, I decided to leave my 15 year career in software, and focus on AI, learning publicly, sharing my learnings with as many people as possible and it's been glorious. And so today, I get to celebrate a little 💃
I also get to reminisce about the state of AI that we were at, back exactly a year ago. Context windows were tiny, GPT-4 came out with 8K (we casually now have models with 200K that cost $0.25/1M tokens), GPT-4 also showed unprecedented levels vision capabilities back then, and now, we have 1.3B parameters models that have similar level of visual understanding, open source was nascent (in fact, LLama.cpp only had it's first commit 4 days prior to GPT4 launch, Stanford released the first Alpaca finetune of Llama just a day prior.
Hell even the chatGPT API only came out a few days before, so there was barely any products built with AI out there. Not to mention that folks were only starting to figure out what vector DBs were, what RAG is, how to prompt, and that it's possible to run these things in a loop and create agents!
Other fields evolved as well, just hit play on this song I generated for ThursdAI with Suno V3 alpha, I can’t stop listening to it and imagining that this was NOT possible even a few months ago
It's all so crazy and happening so fast, that annual moments like these propose a great opportunity to pause the acceleration for a sec. and contextualize it, and bask in the techno-optimism glory of aren't we lucky to live in these times? I sure am, and for me it's the ThursdAI birthday gift to be able to share my excitement with all of you!
Thank you for being a subscriber, the best way you can support ThursdAI is to share this with a friend and tag us on socials 🫡
TL;DR of all topics covered:
* Open Source LLMs
* Together releases Sequoia speculative decoding (X, Blog)
* Hermes Pro from NousResearch - Tool use and function calling (X, HF, Github)
* Big CO LLMs + APIs
* Anthropic releases Claude 3 Haiku (Announcement, Blog)
* Cohere CMD+R (Announcement, HF)
* This weeks Buzz
* Early bird tickets for Fully Connected in SF are flying, come meet the Weights & Biases team. We're also going to be running a workshop a day before, come join us! (X)
* Vision & Video
* Deepseek VLM 1.3B and 7B (X,Announcement, HF)
* Voice & Audio
* Made a song with Suno v3 Alpha for ThursdAI, it's a banger (Song)
* Hardware & Robotics (New)
* OpenAI now powers Figure - the humanoid robot company (X)
* Cerebras announces the fastest AI chip on earth (X)
* Extropic made an announcement about their TPU - Thermodynamic Processing Unit
* Tools & Agents
* Devin from Cognition Labs (Announcement, 47 minute demo)
Agents for your house and your Github tasks
Say hello to Devin from Cognition Labs (Announcement, Real world demo)
By far the most excited I've seen my X feed be this week, was excitement about Cognition Labs new agent called Devin, which they call the first AI software engineer.
You should really watch the video, and then watch a few other videos, because, well, only a few folks are getting access, and yours truly is not one of them.
It seems like a very published launch, backed by tons of VC folks, and everybody kept highlighting not only the innovative UI that Devin has, and it has a very polished UX/UI/Dev experience with access to a browser (where you can authenticate and it can pick up doing tasks), terminal (where you can scroll back and forth in time to see what it did when), but also a chat window and a planning window + an IDE where it rights code and you can scrub through that as well.
Folks were also going crazy about the founder (and team) amount of math ability and IOI gold medals, this video went viral featuring Scott the founder of Cognition, in his youth obliterating this competition… poor Victoria 😅
Regardless of their incredible math abilities, Devin is actually pretty solid, specifically from the UI side, and again, like with he AutoGPT hype of yesteryear, we see the same issues, it's nice, but cognition hiring page is still looking for human software engineers. Tune into the last 30 minutes of the pod today as we had tons of folks discuss the implications of an AI "software engineer" and whether or not coding skills are still required/desired. Short answer is, yes, don't skip, learn coding. Devin is going to be there to assist but likely will not replace you.
🤖 OpenAI + Figure give GPT-4 hands (or give figure eyes/ears/mouth)
Ok this demo you must just see before reading the rest of it, OpenAI announced a partnership with Figure, a humanoid robotics company recently, and just this week they released a demo of this integration.
Using GPT4-Vision and Text to speech capabilities (with a new, somewhat raspy voice and human like intonations), the bot listens to the human giving it instructions, sees the world in front of it, and is able to perform tasks that the human has asked it to do via voice. This feels like a significant jump in capabilities for these bots, and while it was a given that the two technologies (Actuator based robotics and LLMs) will meet soon , this shows the first I Robot like moment.
It'll still be a while until you can have this one do your dishes or fold your laundry, but it does feel like it's an eventuality at this point, where as before, it just felt like sci-fi. Kudos on this integration, and can't wait until Optimus from Tesla will add Grok brains and it'll make you laugh nervously at it's cringe jokes 😅
This weeks Buzz
We're coming to SF in April, our annual Fully Connected conference will feature keynote speakers from foundational AI companies, industry, our founders and tons of Weights & Biases users. We'll also be running a workshop (I'm one of the workshop folks) a day before, so keep an eye on that, it'll be likely included in your ticket (which is still, 50% off for early bird)
Open Source LLMs
Nous Research gives us Tool Use with Hermes 2 Pro (Announcement)
Getting json structured output and giving models the ability to respond with not only text, but specific instructions for which functions to run (aka tool use) is paramount for developers. OpenAI first released this back in June, and since then I've been waiting for Open Source to catch up. And catch up they did, with Nous releasing their first attempt at continued training of the renown Hermes 7B Mistral based model, with tool use and structured output!
If you're building agents, or any type of RAG system with additional tools, you will definitely be very happy as well, give Hermes Pro a try!
This one is not a simple download and run, you have to do some coding, and luckily the folks at Nous provided us with plenty of examples in their Github.
Deepseek gives us a new Vision model - Deepseek VL 1.3B & 7B (Announcement)
Absolutely punching above it's weight, this very high quality vision model from the Deepseek folks is just a sign of what's coming, smaller models, performing incredibly better on several tasks.
While the top is getting crowded with Claude, GPT4-V and Gemini which are generic, on specific tasks, we're getting tiny models that can offload fully into memory and run hell fast and perform very well on narrow tasks, even in the browser
Big CO LLMs + APIs
Anthropic gives the smallest/fastest/cheapest Claude 3 - Haiku
After releasing Opus and Sonnet earlier, Anthropic has reclaimed their throne as the leading AI lab we always knew them to be. Many friends of the pod prefer Opus for many things now, and I keep seeing this sentiment online, folks are even considering cancelling chatGPT for the first time since... well ever?
While sonnet, their middle model is taking a significant interesting place on top of the LMsys arena human rated rankings
Beating all GPT-4 besides the Turbo ones. And now Anthropics has given us Haiku, the smallest of the three Claudes, the fastest, and the cheapest by far.
With 200K context window, vision capabilities, this model crushes GPT3.5 on many benchmarks and becomes the de-facto cheapest model to run. It only costs $0.25/1M tokens, which is twice cheaper than GPT3.5 but just look at the performance. One thing to note, Anthropic still doesn't support function calling/tool use.
Cohere releases a new model for retrieval and enterprise purposes - CMD+R
Cohere goes for the second wind with a great release + open weights approach, and release Command+R (pronounced Commander) which is a model focused on enterprise uses, scalability and tool use. It supports 10 languages, 128K context and beats GPT3.5 and Gemini 1.0 on several tasks, namely on KILT - Knowledge Intensive Language Tasks. The tool use capabilities and the ability to ground information in retrieved context makes this specifically a great model to use for RAG purposes.
The model is 34B and is available non commercially on the hub
Together makes inference go BRRR with Sequoia, a new speculative decoding method
Together Sequoia shows a way to speed up Llama2-70B and be able to run this on a single consumer GPU with 8x speed up.
Being able to run AI locally can mean a few things, it can mean, make smaller models better, and we've seen this again and again for the past year.
Another way is... speculative decoding.
Being able to lower the inference TBT (time between tokens) by enhancing algorithms of decoding and using tiny draft models, and methods like offloading. The large model essentially remains the same, while a smaller (draft) model can help guide the inference and make it seem much faster. These methods compound, and while Sequoia from Together is new, shows great promise by enhancing the inference time LLama2 70B 8x on consumer hardware and up to 3-4x on dedicated hardware.
The compounding of these methods is the most exciting part to me, given that they will likely apply broadly (for now Sequoia only supports LLaMa) once a new model / architecture comes out.
—
Show notes:
* Swyx AI news newsletter got a shoutout from Andrej Karpathy
* Anthropic metaprompt cookbook from Alex Albert
* Folks who participated in the AI Agent discussion, Nisten, Roie Cohen, Junaid Dawud, Anton Osika, Rohan Pandey, Ryan Carson
Thank you for being a subscriber, and for sharing this journey with me, I hope you enjoy both the newsletter format and the podcast 🫡
See you here next week 🎂 I’m going to eat a piece of cake
Full transcript :
[00:00:00] Alex Volkov: Hey, you are on Thursday. I March 14th, 2024. AKA bye. Day AKA Thursday. I birthday. I, I'm sorry for the pun. Uh, I promise I'm gonna, I'm gonna keep it contained as much as I can. My name is Alex Volkov I'm an AI evangelist with weights and biases Today on the show, a birthday celebration for Thursday I Twitter spaces.
[00:00:31] Alex Volkov: That's right. I started recording these exactly a year ago on GPT 4's announcement day, March 14th, 2023. In addition, everything important that happened in the world of AI for the past week that sometimes feels like a year. Including open source LLMs, big companies and their APIs, hardware and robotics for the first time, agents, And more.
[00:00:59] Alex Volkov: We've talked about a lot of stuff. But first, as always, a recap of everything we discussed as I recorded it at the end of the show while everything was fresh in my mind after this little birthday song that AI created for us.
[00:01:12]
[00:02:39] Alex Volkov: that this is AI generated? Maybe at the end there it went a little bit off, but holy cow, this is, I really listened to this birthday celebration multiple times after I created it with Suno V3 Alpha. So get ready for AI music everywhere. And now, the recap of everything we talked about for this week.
[00:03:02] Alex Volkov: But definitely, Stick around and listen to the end of the show. And as always, you will have chapters on every podcast platform that you use, especially Apple Podcasts.
[00:03:13] Alex Volkov: And if you do use Apple Podcasts, why not? Give us a thumbs up and like a five star review. That really helps. That's how people discover us, believe it or not. Here's a recap of everything we talked about. And following that, a beautiful in depth conversation with many folks who shared this journey with me and been, in one way or another, the reason I kept going this year for ThursdAI.
[00:03:36] TL;DR - everything we talked about in 10 minutes
[00:03:36] Alex Volkov: Everyone, here's a recap of everything we've talked about on Thursdays. Anniversary for Twitterspaces, March 14th, 2024, which is also Pi Day, which is also the anniversary of ChatGPT4, and anniversary of Cloud1, and we spoke about ThursdAI history, we spoke about how we got here, how now it's a podcast.
[00:03:56] Alex Volkov: And in open source, we had together AI release something called speculative decoding Sequoia. Speculative decoding is not new, but their approach to speculative decoding called Sequoia is new. It is able to Optimize inference for something like a Lama 70 billion parameter on consumer hardware up to 8 to 9 percent faster by just, , predicting a tree of next tokens and letting the model select between them.
[00:04:20] Alex Volkov: Speculative decoding is an additive technique to improve speed of inference of models. On top of models getting smaller and better, the bigger models are going to get faster on local hardware as well due to something like speculative decoding. It's very exciting to see. TogetherAI also announced like an extension of the round and now they're a unicorn and Definitely doing incredible things.
[00:04:40] Alex Volkov: We also, in the open source, we've covered that our friends at Nous Research,
[00:04:44] Alex Volkov: hermes Pro. If you followed us at any point before, you know that Hermes is one of the top Finetunes for Mistral 7 billion parameters. There is the Pro version of Mistral 7 billion on top of Hermes dataset. Hermes dataset also, by the way, is open and you can go and download and use it. This version, the pro version, is specifically focused on tool use and function calling.
[00:05:07] Alex Volkov: And we also covered what tool use is from the perspective of developers who build RAG apps, for example, or need structured output. This new version supports JSON mode and JSON output, which is a very welcome addition to the world of open source.
[00:05:19] Alex Volkov: It has OpenAI endpoint compatibility, and it's hosted on Fireworks, so you can actually try it out and just swap the OpenAI endpoint with that endpoint and see if your tasks are working with Hermes as well.
[00:05:31] Alex Volkov: On the border between open source LLMs and big company LLMs. We then moved to a conversation about Cohere. Cohere is a company that was co founded by one of the authors of the original Transformers paper Aiden Gomez and some other folks. Incredible company mostly focused on enterprise and use cases around RAG retrieval augmented generation.
[00:05:50] Alex Volkov: Cohere had a bunch of models called Reranker and Embedding Models. And now they released something called Command R. And by release, I mean they released it via API, but also they dropped it on Hug and Face in OpenWeight's non commercial license. So you'd be able to actually run and use this locally but you cannot use it commercially yet.
[00:06:06] Alex Volkov: For that, they offer their API and their API is definitely there. It performs very well on RAG application, outperforms other scalable models. So Outperforms, like even Mixtral and Lama70B, they're not comparing themselves to GPT 4 because this model the command R model is definitely focused on enterprise and use cases.
[00:06:25] Alex Volkov: It works very well with their cohere embedding and re rank models in tandem as well, it's focused on tool use. Like previously we said that Technium just added to open source. They're focused on tool use and external tools as well. And their Cohere API has a bunch of external tools that you can plug in into this one, like web search, like stock prices, like a bunch of other things.
[00:06:45] Alex Volkov: Optimized for 10 major languages, which is usually way more than other open models, and trained on 13 more, and has 128k context window.
[00:06:55] Alex Volkov: And in the same area of smaller models, we finally got the small model, Answer from Tropic, the folks that just released Claude three Claude three.
[00:07:06] Alex Volkov: Antropic released the smallest, most performant version of Claude 3 called Haiku. They call it the fastest, most affordable yet model for enterprise applications.
[00:07:15] Alex Volkov: Cloud3 Haiku is 25 cents per million input tokens, where GPT 3. 5, which is considered the cheapest one and the most performant one so far, is half a dollar for a million tokens. So it's half the price of GPT 3. 5. However, it significantly overperforms GPT 3. 5 on any other token. every metric that they've added, including human eval, which is 75 percent versus GPT 3.
[00:07:39] Alex Volkov: 5's 48%. MMLU score is 75. And the kicker here is as 200k context window, like the major Cloud Opus and Cloud Sonnet. So Heiko has 200k context window. Imagine a model that is only 25 cents per million tokens on input. has also 200k contacts window. And it's available via the API, obviously. or Amazon and Google Cloud as well. And it's vision enabled, so you can actually send images. And we geeked out about how a year ago when we started ThursdAI, one of the reasons why we came to the space, because we were blown away by GPT 4's vision capabilities.
[00:08:14] Alex Volkov: And now we're getting I'm not gonna say that Haiku is anywhere close to GPT 4 vision [00:08:20] wise, but it's From what I've tested very decent, given the price point, it's incredibly decent. Then I covered that in the Weights and Biases area we're coming to San Francisco in April 18th is our fully connected conference with many big clients of ours coming, foundational model creators, et cetera, coming to speak on the stage.
[00:08:40] Alex Volkov: And we're also going to do a workshop, , a day before. So April 17th, if you're interested in this, please write to me, I'll definitely. tell you when that's up. The tickets are early bird and you're more than welcome to join us in San Francisco. We will be very happy to see you.
[00:08:53] Alex Volkov: If you came from ThursdAI, come and give me a high five. I would love to, to, show my boss that this is actually pulling some folks. But also we covered continued things in ThursdAI around vision and video. So skipping from Weights and Biases stuff, we covered vision and video.
[00:09:06] Alex Volkov: We covered the DeepSeq, released a DeepSeq VLM, which is a tiny vision model. So Again, in the realm of multimodality this year, we're now getting 1. 3 billion parameter and 7 billion parameter models that on some tasks come close to GPT 4. It's quite incredible. So DeepSeq the folks who released DeepSeq Coder before and very impressive lineup of models open sourced VLM 1.
[00:09:30] Alex Volkov: 3 billion and 7 billion. Incredible, impressive on benchmarks, and the 1. 3 billion parameter model is so tiny, you can run this basically offloaded on your CPU. And in that vein, we also covered briefly, but we did cover that Transformers. js is very soon, from our friend Zenova, is very soon to support WebGPU.
[00:09:47] Alex Volkov: WebGPU is the ability to run these models in your browser in your JavaScript environment on the GPU of your machine, either that's a Mac or a PC. And that's now landed fully in all major browsers right now.
[00:10:00] Alex Volkov: The song that you heard beginning over this was made with suno v3. Alpha and I did this specifically for ThursdAI. And I'm very impressed that a year after we started all this, we're now getting songs that sound like somebody actually went in the studio and sang it. We then mentioned that in the AI art and diffusion corner, we still don't have stable diffusion tree.
[00:10:20] Alex Volkov: We also had another corner today, which is a hardware and robotics corner. And we've covered several very exciting things.
[00:10:28] Alex Volkov: We've covered that Cerebrus announced the fastest AI chip on Earth, with 4 trillion transistors and 900, 000 AI cores, able to train up to 24, 000 people. I don't use the word trillion parameters a lot here, but able to train 24 trillion parameters models on a single device. This sounds incredible, and once they put it in production, I think it's going to be a significant boost to the AI scene.
[00:10:52] Alex Volkov: We also covered Xtropic, the folks that came from Google X, Secret Lab, now announce Xtropic, the folks behind the EAC movement. As well, that's their company, they're building a TPU, Thermodynamic Processing Unit it's a little complex, but basically. They want to do natural physical embodiment of probabilistic learning, and they want to be considered the transistor of the AI era.
[00:11:17] Alex Volkov: And if you want to hear more about this, they have the full space Q& A that we'll link in the comments below. And so we covered Cerebrus, we covered Extropic in the hardware, and then we've talked about how FIGR, the humanoid robot company FIGR we covered before they, they announced a partnership with OpenAI, and this week they released a demo video that's unedited, so end to end recorded in 1x speed, of this, Figure robot, humanoid robot standing in something that looks like a fake kitchen and basically talks to the human in front of it using OpenAI's text to speech technology and vision.
[00:11:52] Alex Volkov: So it actually understands what it sees based on GPT 4 vision, probably custom version of GPT 4 vision, and also then is able to do some stuff. If you haven't seen this video, I'm going to put it in show notes on thursdai. news. Please feel free to subscribe. The video is mind blowing,
[00:12:07] Alex Volkov: but just the fact that the robot can see, talk about what it sees, and then perform tasks embodied in the real world, I think is a great way to see the future happening right now on Pi Day 2024. And I think this is most of the conversation that we've covered from the news perspective, besides this one last thing, where we covered that Cognition Labs released a video and actually started Letting folks in to something they call Devon, the first fully autonomous AI software engineer.
[00:12:35] Alex Volkov: That's the tagline. And obviously we've those of us who covered this, we remember AutoGPT hype from last year. We remember multiple since then, multiple different agentic frameworks, Devon seems like it took that to the next level, not only from a perspective of just being able to execute long tasks, but also from the ability of the UI to show you what it does and being autonomous alongside your software engineer.
[00:12:59] Alex Volkov: So you can, Devon actually has access to a full environment, probably with GPUs as well. It has access to a browser that you can log into your stuff and then Devon can on your behalf, use the browser and go and search for some stuff.
[00:13:10] Alex Volkov: And we had. One hell of a discussion following the Devon news to talk about, and I think it was started by Nisten saying, Hey folks, you have nothing to fear, still learn code. That this news, again, stoked fears of folks saying, Hey, should I even learn to code given these advancements? And we had a great discussion about Coding, taking over coders, for example, replacing or not replacing, and positivity in the age of AI.
[00:13:34] Alex Volkov: And this discussion, I really suggest you listen, stick to the end of the podcast, if you're listening on the podcast, and listen to the whole discussion, because I think it was a great discussion.
[00:13:43] Alex Volkov: Hey everyone. My name is Alex Volkov. I'm the host of ThursdAI for the past year, which I can now say proudly, and I just want to welcome you, yet again, to another Thursday. Today's a big day, not only because we're celebrating, but also because some of us woke up early to see the largest man made object ever to break through the atmosphere and go to space, which was incredible.
[00:14:24] Alex Volkov: Very tech optimist like, but also today is an anniversary of multiple things. And I think ThursdAI is just one of them. So we're gonna, we're gonna actually talk about this real quick. And I just want to say that ThursdAI I'm very happy to still be here a year after with many people who joined from week to week, from month to month, whatever friendships that were shaped in the ThursdAI community.
[00:14:49] Alex Volkov: And I just want to say I'm very happy that Swyx here is here. Swyx was on the actual first ThursdAI episode a year ago. We jumped in to discuss GPT 4 and I think we're blown away by the vision stuff. So welcome Swyx. How are you? Thanks for waking up early for this.
[00:15:04] Swyx: Hey morning. Yeah, it's a big day. The year has felt like 10 years, but it it's definitely a big day to celebrate.
[00:15:10] Alex Volkov: Absolutely. So thanks for joining us. Swyx, for folks who don't follow for some reason, definitely give Swyx a follow, a host of Latentspace and the founder of Small. And recently is being followed by SpaceDaddy as well. And I want to say also
[00:15:24] Swyx: Space Daddy!
[00:15:25] Alex Volkov: And I want to also say hi to Nisten who's been maybe the most consistent co host, Nisten.
[00:15:30] Alex Volkov: Nisten, welcome joining us all the way from called Canada, I think after visiting the doctor, how are you Nisten?
[00:15:38] Nisten: I'm good. I'm good. It's good. I missed one, I
[00:15:42] Alex Volkov: Yeah. . Yes.
[00:15:43] Nisten: was about it. I thought I was gonna miss the day, and I was upset, but no, I
[00:15:48] Alex Volkov: I have a question for you. Was the doctor that you visited a human doctor or an AI doctor?
[00:15:53] Nisten: Yeah, he was human. He hadn't seen me in five years, so I was showing him all this stuff about medicine and the AI. It's funny.
[00:16:00] Alex Volkov: The, and I also wanna acknowledge Farouk or Pharrell as we call him, maybe Pharrell, how are you? I.
[00:16:07] Nisten: Hey, what's up?
[00:16:09] Alex Volkov: Welcome, welcome to the ThursdAI celebration Far El is leading the Skunksworks crew and has been doing different incredible things in the open source. Very staunch proponent of open source here on the ThursdAI stage. If anything gets released and it doesn't get released with the source Far El will have words to say about this.
[00:16:25] Alex Volkov: So we're going to cover open source today as well. I also want to acknowledge the LDJ. Yesterday I wrote the whole thread and acknowledged like many people and I didn't tag. My, my good friend, Luigi. So LDJ, apologies for that. Welcome brother. How are you doing all the way from Florida?[00:16:40]
[00:16:41] LDJ: Yeah, I'm doing good, thanks. I've been late to a lot of the Thursday AIs past few months, but yeah, it's been good coming on and glad I was able to make it on time for this one.
[00:16:51] Alex Volkov: Yeah welcome. Welcome. And I also want to acknowledge Roei ray is the DevX Dev Advocates on Pinecone and Ray has been participating in many spaces. We had a lot of conversation about reg versus long context. And I remember those wells, a lot of like late night conversations as well. Welcome Ray.
[00:17:06] Alex Volkov: How are you?
[00:17:08] Roei Cohen: How's it going, everybody? Congrats, Alex, on this awesome anniversary. Yeah,
[00:17:16] Alex Volkov: there's a bunch of folks I see in the audience who are here from week to week, and it's so great to see the community shape up, and I really couldn't be prouder to be able to just talk about AI with friends and actually make a living out of this.
[00:17:29] Alex Volkov: I would be amiss if I don't acknowledge that the anniversary today is from The spaces. So we started talking about AI in Twitter spaces, back then Twitter spaces, now X spaces exactly a year ago or Pi Day 2023. The reason why we started talking about AI is because GVT 4 was announced and Greg Brockman gave the incredible demo where he took a screenshot of a Discord.
[00:17:52] Alex Volkov: So if you remember this, the Discord, the famous Discord where we went to hunt the Discord Snapchat. Mhm.
[00:18:00] Swyx: a screenshot of the, I think the OpenAI Discord and it just transcribed every word in there and described every, like the position of every icon and like the framing of it. It was just like the best vision model we'd ever seen by like by a lot.
[00:18:14] Alex Volkov: By a significant margin and it understood different like active states, etc. and to get to a point now where we're basically having open source models. We're going to talk about CogVLM today. We're going to talk about DeepSeq released a new vision model today to get the, to the point where we can basically recreate this with a tiny model that runs completely offloaded, it's crazy.
[00:18:36] Alex Volkov: Back then, no vision existed. So we got into space, started geeking out about this, and then we kept going. So this is the anniversary of the Twitter Spaces, the actual podcast, the ThursdAI podcast that I created. Encourage you to subscribe to didn't start about four or five months afterwards.
[00:18:51] Alex Volkov: After we did this and the community started shaping up and people started coming in and actual guests started to arrive. So I see a few guests that became friends of the pod. So if you guys see Jun Yang here in the audience, on, on the technical team at Quen, there's a great conversation that we had about Quen and their models as well.
[00:19:10] Alex Volkov: We have a bunch of folks like this from time to time, just join and talk about the stuff they built. And I think this is the best thing that I get from ThursdAI is definitely this is the ability to talk with folks who are experts in their fields. And definitely I'm not an expert in many of the things we cover.
[00:19:25] Alex Volkov: And it's great to have folks from vision and from foundational model training and from open source. And we had a bunch of conversation with Nous Research folks. We're going to cover a few of those today as well, and it has been incredible so far. And so the birthday for the actual podcast, once we started recording and sending a newsletter is coming up in.
[00:19:44] Alex Volkov: in June. Meanwhile, if you want to support the space, if you're here and you're like, Oh, this is great. I learned so much. You're more than welcome to just interact with us. On the bottom right, there's like a little icon there, the message icon that says five. You're more than welcome to just send replies there and boost a little bit of the signal and retweet the space link.
[00:20:02] Alex Volkov: And so I think with this, I think with this, Oh no, a year ago, another thing was and it went out the radar because GPT 4 took over. All over the airwaves. Cloud One was released exactly a year ago as well. Happy anniversary to the Cloud team. They've been killing it lately. The past few weeks have been entropic weeks for sure.
[00:20:20] Alex Volkov: And definitely folks are looking at Cloud and now, considering cancelling their JGPT subscription. So that's been great to see. And so a year ago, there is Cloud One and they were quickly quickly hidden with the news. I also want to shout out that in the past year as well. Open source were almost non existent.
[00:20:36] Alex Volkov: So a year ago and four days, Lama CPP was first released. Georgi Gerganov released Lama. cpp, a way to run the Lama model that was released a month before that on just, your local hardware. And, uh, nobody knew about this necessarily until a few days later. Vicuna was just released.
[00:20:56] Alex Volkov: So if you guys remember Rikuna, that was a thing. So all of these things happened in, in that week. And it feels this week we have, or at least the last few weeks, we have a similar like Insanity weeks. Don't you guys think? Especially with Opus and the rumors about GPT 4.
[00:21:11] Alex Volkov: Do you guys remember anything else from that last week before we started like talking about this week?
[00:21:15] Far El: It's hard to remember what happened last week because this week felt like a century alone. That's that, that's the thing. Like we, we've
[00:21:22] Nisten: had so much just in the last week that I don't even remember what happened.
[00:21:25] Alex Volkov: Absolutely. That's why we write down. And honestly, I think Swyx, we talked about this where now that, every ThursdAI is now recapped and you have AI News, Newsletter Daily, or that covers everything. This is just for the historical record, it's very important. Just to be able to go a year back and see where we were.
[00:21:41] Alex Volkov: Because it's really hard to remember even last week, not to mention the last year. So I think it's very important. I don't want to shout out do you still call this small talk? Or do you have AI News?
[00:21:50] Far El: It's just AI news. I'm reserving small talk for the other products that I'm working
[00:21:55] Alex Volkov: I see.
[00:21:56] Far El: yeah. Yeah. AI news's,
[00:21:57] Alex Volkov: so talk to us about the AI news just briefly for folks who are not familiar with that specific newsletter.
[00:22:02] Swyx: Man, this week was f*****g, it was crazy in around December I was like very overwhelmed by all the AI discords and I knew that all the alphas being dropped in discords are no longer on Twitter, so I started making this bot to scrape discords and it was mostly just serving myself and then I shared it with some friends and it grew to like a couple hundred people, but one of them was Sumit Chintala from the Meta team, like he was the creator of PyTorch and still runs PyTorch.
[00:22:31] Swyx: And last week he shouted it out, saying that he he said it was like the highest leverage 45 minutes every day that he spends reading this thing. Which was like a freaking huge endorsement from someone like him. So I didn't even know he
[00:22:43] Alex Volkov: from from the guy who runs PyTorch. It's crazy. And of
[00:22:49] Swyx: so I, yeah, I didn't even know he was subscribed. I don't, honestly, I don't even look at the subscriber newsletter. I think it's really good for mental health to just do your thing, right? Don't even look at who's on the list. And then two days ago, Andre also just like unsolicited, completely no notice, no warning just said oh, yeah, I've been reading this thing for a while.
[00:23:06] Swyx: And I was like, what? And then I went back and looked through the emails and like his email's not there. There's no, his first name not there, not there. I eventually found his email, but yeah, it's it was just a shock that like he was also getting utility out of it and Yeah, so far I think like 12, 13, 000 people signed up in the past couple days, and we'll see where this, we'll see where this goes I think a newsletter is not the last final form, and also people have legitimate concerns around, how much is comfortable being scraped from Discord what is the sort of privacy expectation on a public Discord that anyone can join, right?
[00:23:39] Swyx: So I'm taking some steps to basically protect people it's purely meant for utility, not for snooping on people's conversations. But I do think like there should be a new sort of Hacker News of AI, quote unquote, that pulls together, Local Llama, Twitter, Discord, YouTube, podcasts, whatever.
[00:23:55] Swyx: And yeah, I think that's what I'm making AI News go towards.
[00:24:02] Alex Volkov: is excited about, Elon is excited about as well. So Elon now is a follower of Latentspace, which is a big moment. I wanted to ask
[00:24:08] Swyx: Yeah, we're trying to, yeah, let's
[00:24:09] Alex Volkov: Local Llama, by the way? Is Local Llama part of the source as well
[00:24:13] Swyx: we the engineer that I'm working with is working on this. So not yet, but we are working on it. And
[00:24:19] Alex Volkov: Alright folks, so if you want so if you want not only HighSignal, but if you want like the full firehose of information, that's great from Discord and Twitter list, I think you have a HighSignal Twitter list as well in there definitely subscribe to AI News previously Smalltalk, as like the titans of the industry now follow this and getting insight from this, so you should as well.
[00:24:40] Alex Volkov: But yeah. If that's too much for you, we're here every week to cover pretty much the very most important things.
[00:24:46] Open source - Function Calling model from NousResearch - Hermes Pro
[00:24:46] Alex Volkov: And so I think it's time for us to start with Open Source.[00:25:00]
[00:25:09] Alex Volkov: Alright folks, so let's let's cover some open source stuff. I think the first thing is we have to mention that our folks our friends from Nous Research announcing a new model today or I guess yesterday night. It's called Hermes Pro. Hermes Pro is specifically, I'm not really sure what Pro means here, so we have to ask some folks from Nous Research, but they announced the continued training of their Mistral model, their flagship model, that uses, that is fine tuned for tool use and function calling.
[00:25:40] Alex Volkov: And tool use and function calling are Maybe or should I say synonyms of each other at this point? I think it started with function calling from OpenAI that was released in June last year. And they gave us function calling in response of all of us wanting a JSON output. And since then, function calling became something called tool use.
[00:25:59] Alex Volkov: Basically, the ability of these models to not only So you the next word or complete, autocomplete, but also you could provide schemas for some of your functions that these models will say, Hey, I'm actually, I want to, I want more information on this topic or that topic. And so here is what tool you should use.
[00:26:20] Alex Volkov: And you as a developer, you would get that response. You would go call this tool. you would then pass back the data from this tool into the model. And then the model will use its context and the user's request together to come up with an answer. So think about stock price, right? Stock price is something that changes often.
[00:26:37] Alex Volkov: You cannot train the model on stock price because it changes very often. So for one example of a tool could be go check the stocks on the stock market or go check Bitcoin price, et cetera. And the model, Mistral is not able to, it's very obvious if you ask a Mistral 7b, Hey, what's the price of Bitcoin?
[00:26:55] Alex Volkov: It will give you something that something will be 100 percent wrong, a hallucination. So a model with tool use would be able to decide that if you provided, if a developer provided in advance, The the model with tools like, hey, price of Bitcoin, price of stock, et cetera the model will be able to decide that instead of hallucinating the answer, you'd actually return a reply to the developer and say, hey, go get me this information and then I'll be able to answer the user, right?
[00:27:20] Alex Volkov: So this is what tool use and function calling basically is. And we haven't had a lot of that in open source. We had a little bit. We've talked about the tool use leaderboard from the folks at Gorilla. I think. Stanford? I'm not sure. And then now Nous Research released us a continued training of their 7B model called Hermes Pro with the same general capabilities.
[00:27:39] Alex Volkov: So that's also very important, right? You keep training a model. You don't want something called catastrophic forgetting. You want the model to perform the same plus additional things as well. And the, now it's trained on new data with tool use plus JSON mode as well. So not only do we get The ability of the model to reply back and say, Hey, you should use this function.
[00:28:00] Alex Volkov: We also get JSON mode as well. It supports custom Pydentic schema. Pydentic for folks who don't write in Python is a way to define objects in, in, in Python in a very clear way. And when you use this and you give the model kind of the schema for your, tool use. The model then knows what parameters to call your functions with.
[00:28:18] Alex Volkov: So your job as a developer is basically just take this call and forward it to any API call that you want. It's available on the hub and it's announced with OpenAI endpoint compatibility, which is great. So I don't think we've seen this from Hermes so far directly. I think everybody who served Nous models they gave us OpenAI compatibility, but definitely we know that The industry is coalescing around the same format, which is the OpenAI endpoint, where you can just replace the URL to either OpenRouter or Fireworks or whatever.
[00:28:49] Alex Volkov: I think the chat from Mistral as well is supporting OpenAI compatibility. Great to see that we're getting open source models for tool use because it's very important for agents and it's very important for basically building, building on top of these LLMs. LDJ, I saw you wave your hand a little bit.
[00:29:07] Alex Volkov: Did you have a chance to look at Hermes Pro in tool use? And what are your general thoughts about open source tool use? Hey,
[00:29:16] LDJ: I'm pretty much Hermes, but also has a much improved JSON and function calling abilities and things like that. And I was just waving my hand to describe that, but then you pretty much described it already. So I put my hand back down.
[00:29:29] LDJ: but
[00:29:30] LDJ: Yeah, you got a good description of it.
[00:29:32] LDJ: And I think that pretty much summarizes it.
[00:29:34] Alex Volkov: this is the anniversary of this ThursdAI, Birthday AI. So I did my homework this time. Usually sometimes these things get released like super fast and we actually don't have time to prepare. Comments on general availability of function calling and tool use from the stage before we move on?
[00:29:48] Alex Volkov: Anything that you guys want to shout out specifically that's interesting here?
[00:29:50] Nisten: It's probably the most commercial used part, I think, because every person that's using a 7b, they want a really fast model, and usually they want some kind of JSON returned for commercial uses. There are chat uses as well, but I think like the majority of, I don't have any data on this, I'm just guessing that probably the majority of the use is to return JSON.
[00:30:15] Alex Volkov: Yeah. And then there, there are tools like Pydantic from Jason Liu that we've talked about that help you extract like structured data from some of these. And those tools require function calling as well. And function calling and Pydantic support. So definitely supports more enterprise y.
[00:30:29] Alex Volkov: Maybe that's why Technium decided to call this Hermes Pro.
[00:30:32] Together.ai new speculative decoding Sequoia improves AI inference by 9x
[00:30:32] Alex Volkov: Moving on to Together and Sequoia. And Together released something called Sequoia, which is speculative decoding. I actually wrote down explanation of speculative decoding is, and I'm going to try to run through this. So for folks who are not familiar with speculative decoding, basically, if you think about how we get open source, how we get bigger and better AI on to run locally on our machines, one of them is open source and smaller models getting better, right?
[00:30:58] Alex Volkov: So that's definitely something we've seen for the past year. We got Llama70B and then we got. 13B, and then different Finetunes and different other foundational models. I started beating the LLAMA 70B, definitely LLAMA 1, and now even LLAMA 2 is getting beaten by tinier models. So the progress of throwing more compute and more techniques is shrinking down these models to us being able to run them locally, just because our hardware is, let's say, limited.
[00:31:23] Alex Volkov: That's one way that we get local open source models. They just get. keep improving and keep getting trained on. Another way is we can, we're able to serve these like bigger, larger models, like 70B models on consumer GPUs, but then it's like super slow. So you wait one minute or two minutes between each token prediction or each word that you see.
[00:31:44] Alex Volkov: So one additional way on top of just getting smaller models faster and smarter is improving inference. So we see, we saw a bunch of attempts this year from folks like Modular releasing their max inference system and improvements obviously in different like places like FlashIntention and different inference engines as well.
[00:32:03] Alex Volkov: So we saw all of this and one such way that adds to all of this is called speculative decoding, which improves the inference speed, just inference speed. It basically tries to predict a few next tokens instead of just one, using a smaller model. And the key idea is to construct a tree of speculated future tokens for every potential token in the model's output.
[00:32:26] Alex Volkov: Sometimes they use, I think at least LLAMA CPP supports speculative decoding, sometimes they use a smaller model. For example, for LLAMA, they could use like a small LLAMA to help you predict the tokens, and then the larger LLAMA to actually help select them. And And together, folks who released a few things that we've covered before employe3d out there, who released the the Mamba architecture and the Hyena architecture we've talked about previously before also FlashAttention chief they now released their own take on speculative decoding.
[00:32:56] Alex Volkov: which they claim that on consumer GPUs, you'd be able to run something up to a 70 billion parameter Lama 2 with a RTX 4090. And they improve the ability of you to run this incredible like large model by almost nine, nine percent, nine x faster. On, non consumer GPUs like A100s, they also go up to 4x faster.
[00:33:17] Alex Volkov: Basically, by just predicting with a [00:33:20] smaller model, all like building a tree of all possible tokens, and then the larger model actually selects and predicts somehow based on those. They have a bunch of other things like offloading there, and very interesting things, but I just want to say that this is a field, speculative coding is a field that is entirely.
[00:33:39] Alex Volkov: How should I say? They only support LLAMA as far as I saw, but this whole field is entirely additive to the rest of the fields, right? So if speculative coding helps to improve LLAMA, 7TB9x faster, it's probably going to work on smaller models as well. So it's really incredible to see how much different speed improvements we're getting across the board.
[00:33:58] Alex Volkov: And definitely for the stuff that we all, we're working I love to talk about, which is open source models running locally, running faster. This is incredible. Yeah, LDJ, go ahead.
[00:34:09] Nisten: Yeah, I just wanted to direct people to I put, I pinned on the billboard a video that TogetherAI put out actually showing side by side Sequoia versus not Sequoia, and yeah, it's pretty insane the amount of speed up you're able to get.
[00:34:21] Alex Volkov: The amount of speed up on the same hardware and on the same model. So the model didn't improve, the hardware didn't improve. All they improved is the ability to to help the model predict next tokens and spit them out, which is, I agree with you, it's insane. And just, um, multiple improvements across the board are going to get us where basically we want to go, which is these type of models, these sizes of models running super fast on, on local hardware.
[00:34:44] Alex Volkov: They released it in GitHub. Folks can try. It only works for LLAMA. It doesn't work for like any other bigger models as well. Definitely we'll see. I will just say that the thing that I'm most excited about this is that all these techniques are, one, additive, and two they're there for the next big model to get released and just support them.
[00:35:00] Alex Volkov: So like when LLAMA 3 came out, eventually releases, and we know it will release these models will be able sorry, this speculative decoding will start working, and then speculative decoding will start working, LLAMA CPP will already be there. So we saw the kind of, the community efforts to support everything were just kicked into gear when GEMMA was released.
[00:35:18] Alex Volkov: I'm just very excited that we have all these techniques to Throw the, throw at kind of the next open source, the big model open source. And just the concept of running a 70 billion parameter model is very exciting. Last week we covered something from Jeremy Howard and Joanna Whittaker and Team Dietmers, the folks with Answer.
[00:35:36] Alex Volkov: ai and Qlora combined Qlora with another technique to be able to train 70 billion parameters or at least fine tune them on kind of consumer hardware as well. We're not only getting news for the past week and two weeks of being able to fine tune 70 billion parameter models on consumer ish hardware.
[00:35:53] Alex Volkov: We're also getting news about being able to run them with some, uh, some logical number of tokens per second and not a one token every like four minutes or something. Exciting news in open source.
[00:36:04] DeepSeek VLM 1.3 & 7B VLM that punches above its weight
[00:36:04] Alex Volkov: Maybe we'll cover DeepSeq VLM here as it's vision, but yeah, definitely released in open source and we don't want to miss.
[00:36:10] Alex Volkov: So DeepSeq, the folks behind The folks behind DeepSeq Coder and released DeepSeq VL, a state of the art 1. 3 billion and 7 billion visual parameter models. If you guys remember last week, we talked to Vic Hyntak of Moondream2, and that was a tiny vision model. And the whole point, and if you were here in the beginning, when we in Swyx got excited about a year ago about the vision capabilities of GPT 4.
[00:36:34] Alex Volkov: The whole point with these vision models is that their improvement this year definitely felt exponential, because now a model of 1. 3 billion parameters, a tiny model that most Macs can now run, can very easily. And if our friend Zenova is going to join us, it's very soon with WebGPU, we're going to be able to run fully in browser.
[00:36:53] Alex Volkov: These models now are able to perform very similarly to what we saw a year ago and just blew our minds, which is OCR. without an OCR model built in, understanding objects, understanding graphs and charts, etc. And so it's very interesting that DeepSeq coder, let me try to share this into the space. Yeah, it should be up there as well.
[00:37:13] Alex Volkov: Very interesting that DeepSeq released and is punching above its weight. significantly above its weight, and they actually try to compare themselves to GPT 4 vision, which is quite remarkable on different tasks like evaluation, multi images and then in some of these tasks, they get to half performance of GPT 4 vision, which is still quite incredible, right?
[00:37:35] Alex Volkov: Like it's a 7 billion parameter model, GPT 4, we still don't know how many parameters this is. We still don't know if GPT 4 vision is, a multi mixture of expert model or not. But DeepSeaCoder is actually. Coming close to the same performance as DBT four on Common Sense task and analysis task.
[00:37:55] Nisten: Yeah, and I just want to say Allama CBP supports these models. I don't know about DeepSeq, but they've supported all the other ones. And there's also a a Lava CLI in there, which you can use with these ones. Also, when you run the server, you can run the models as well.
[00:38:12] Nisten: I think they just need a little bit more compute and engineering, and they can match GPT 4 when it comes to vision. I am quite surprised that it wasn't that That big of a deal. In some ways, COG VLM, not DeepSeq, is a lot better than the rest, but it's also a larger model too. And I quickly wanted to say, because you mentioned Zenova before, I don't know if you're going to go more into that, but it turns out, it's that people of the core Chrome team or Chrome Canary that implement WebGPU, they listen to they listen to ThursdAI and Stuff that we've been saying over the years, over the months, they've actually started implementing.
[00:39:00] WebGPU and Int8 support for quantized models
[00:39:00] Nisten: And the most exciting thing that I find now is that they are trying to implement int8 support natively in WebGPU. So that will do another savings of half the memory when When you run stuff, even if you have a GPU that doesn't necessarily support in Tate, I think there was a method to run at half the memory.
[00:39:23] Nisten: So remember we went from only supporting Float 32, a few months back, I think it was September. And you needed a weird started version of Canary with a few commands to support Float16. So now they're supporting Inti, so the memory requirements of the browser have dropped down by 4x in the last 5, 6 months.
[00:39:44] Alex Volkov: I remember the days before WebGPU support even landed, like all of Transformers. js. And for folks who are not following us that closely, Zenova is a friend of the pod, the author of Transformers. js. We talked a lot on the pod, he actually announced him joining Hug and Face. on the part as well. And he created Transformers.
[00:40:04] Alex Volkov: js, which is a way in JavaScript on node to run these models via the ONNX platform. And when we talked about this before, the only way to run these models in the browser was like fully CPU. And then we always talked about, okay, WebGPU is going to come at some point. WebGPU is the ability to run to tap into GPU inference from the browser environment, from the Chrome environment.
[00:40:26] Alex Volkov: And since then, WebGPU was Still a spec that was announced that was released and now it's fully supported everywhere. But like Nisten, like you're saying, it only supports 32, right? And a bit, can you describe this part a little bit more? And then now they're listening to us and actually lending support for a quantized version of these models, a smaller version to be able to run even smaller models that perform the same.
[00:40:47] Alex Volkov: And,
[00:40:47] Nisten: Yeah so now, Chrome, you don't even need Canary and it will support Float Float16. And by default, if you only have a CPU, stuff can run now on the CPU and in float32. But again, the biggest use for this so far has not actually been chatbots. Even though chatbots do work, it has been more the visual stuff and the effects.
[00:41:10] Nisten: All the diffusion based some stuff of function calling. That's where stuff gets pretty exciting. because it changes what kind of applications you can build. It's, again, it's the front end, like what are you going to put before they reach this big GPU cluster? So it's pretty, this is the part where we're going to see the most changes and progress, in my opinion.
[00:41:34] Nisten: It's going to be the visual stuff, making use of Transformers [00:41:40] JS library.
[00:41:40] Alex Volkov: and one, one example of that Zenova showed on his feed is a real time background removal from video. So you play a video and then imagine a Chrome extension that's loaded or something, and then you're able to run AI transformer stuff on top of everything that you read or see that's the kind of stuff we're talking about with access to GPU, I think is going to be possible.
[00:42:00] Alex Volkov: So super, super exciting to see how this performs. And obviously this means that the models that we talk about local running locally, we'll just get. More use, because developers will be able to build them in. This will never get to the point of GPT 4 level for full generality, or I don't want to say never, but it's not quite there in terms of okay, running a GPT 4 level model fully in your browser.
[00:42:23] Alex Volkov: But for some specific tasks like vision, we just talked about on several benchmarks, Coq VLM and this tiny new release from DeepSeq VLM is now getting there, right? So you'd be able to analyze images, for example, you'd be able to do all kinds of things in the browser fully. Without loading, without Python environments, without all of these things.
[00:42:42] Alex Volkov: I think it means a lot for user experience as well. I think we've covered open source a bunch. Do you guys have anything else worth mentioning in the open source thing? Briefly before we move on to the to the big companies, and maybe we'll discuss, we're going to discuss agents as well.
[00:42:57] Cohere releases Command+R - a RAG focused model in API + open weights
[00:42:57]
[00:42:57] Alex Volkov: Yeah, so CommandR, I, interestingly, it's both in the open source and not, so maybe it's a good transition, right? Let's actually do this as a transitional topic. So Cohere, the company that I don't know, raised a bunch of million of dollars and everybody expected it to be like the second Lycanthropic and didn't for a while.
[00:43:18] Alex Volkov: Now is back in the very impressively is back. And so for a long time, I think Cohere re Refocus their efforts on something like RAG. They had the Cohere Reranking model and they had the Metting models for a while. And they focused on, I know that we in Weights Biases, we use Cohere Reranker for our RAG bot, and that's improving our responses significantly.
[00:43:39] Alex Volkov: Reranking is basically receiving Receiving back from your vector database, a few responses that are near neighbor equivalent to what your user has asked for. And then running another process of re ranking them for higher how should I say accuracy. And so Cohere Reranker was for a long time, like one of the more standard ones that folks use.
[00:43:58] Alex Volkov: And now Cohere actually stepped in and said, Hey, we're releasing a new model that's it's called Commander, Command R. It's a new generative model from Cohere aimed at production scale tasks, like RAG, Retrieval Augmented Generation, and using external tools and APIs. So here's this word again, external tools we use and APIs as well.
[00:44:16] Alex Volkov: As you as we previously discussed, Tool use is important. We just got tool use, in a fully open source, thanks to Nous Research and I haven't yet tested their tool use, but Cohere is definitely building this model. And I think Swyx, you also saw this release and we're going to both identify the same pretty much thing where this is interestingly not getting compared to any GPT 4 or Cloud Opus, right?
[00:44:40] Alex Volkov: They're not even trying. They have a very specific. Use case in mind and I wanted to see from use works if you have any other comments on that or how they're like positioning themselves and what specifically in what world that they're operating in.
[00:44:54] Swyx: For Coban,
[00:44:55] Alex Volkov: For Commandar and Cohere as general, yeah.
[00:44:58] Swyx: simple answer is probably not as good as GPT
[00:45:01] Alex Volkov: Yep.
[00:45:02] Far El: They didn't include it, but I haven't tried it out personally myself. People seem to be talking about it for retrieval and ragtime use cases, but, I can't give my personal endorsement. Just in general, Cohere, I think, they've been more active in sort of enterprise use cases and Finetuning, like talking about their Finetuning capabilities, or long tail low resource language, maybe use cases they also released AYA, I think, last month, which some people in the open source community were quite excited about but yeah, I think having them see, seeing them do such like a strong launch for a new model I think is like a second win for Cohere, and I'm excited to see more coming out of them.
[00:45:43] Alex Volkov: Definitely feels like a second wind, and we would don't know how much we covered Cohere here before but the fact that they released the model also in open weights on Hackenface, I think, gives them a lot of credibility from the community. LDJ, go ahead.
[00:45:58] Nisten: Yeah I noticed they did actually post some benchmarks on the website of comparison to LLAMA 270 billion, Mistral, and GPT 3. 5 Turbo, and like all comparisons and RAG benchmarks, and Command R does seem to be all of those three that I just mentioned. And of course, this is their own reporting, it's probably good to wait for third party benchmarks, but yeah, and it's apparently very good at multilingual abilities as well. I think I saw somebody saying that like somebody who, one of their like first languages is Portuguese, like they said Command R was one of the best languages, or one of the best models that was able to do that actually very fluently and understand the nuances of the language.
[00:46:39] Nisten: So yeah, I think that's really interesting and it might just be really good overall model for open source.
[00:46:45] Alex Volkov: Yeah
[00:46:45] Nisten: I think it is open source, but just, sorry it's open source, but I think it's just non commercial license.
[00:46:51] Alex Volkov: Yeah, so they did open Waitz release with with non commercial license. And they did say that if you're an enterprise, you want to build something cool with Commandar talk to them and they'll figure out something. And Aiden Gomez the CEO of Cohere is one of the founder, one of the authors on their Attention Is All You Need paper recently has unblocked And became friends with Nisten here in the Toronto community.
[00:47:16] Alex Volkov: He mentioned that this model is also optimized for 10 major languages for global like business and trained on 13 more. So it actually has a pre trained on 13 more has 128 context window, right? So if you do compare this to GPT 3. 5 Turbo or Mixtral, for example I don't remember 32k context this is 128k and they specifically focus on speed in addition to everything else, right?
[00:47:39] Alex Volkov: And in RAG systems, in, in these systems, you may not need a model that's like super, super fast. Smart, you may need a model that is able to retrieve everything that you want much faster and significant speed improvements may outperform smartness on MMLU tasks, right? So I think that's their game, they're playing it they compare it like, like LDJ said to 3.
[00:48:02] Alex Volkov: and not GVT4 or Opus and they have results in something called KILT, Knowledge Intensive Language Tasks and retrieval and Tool use specifically and so they also have a bunch of stuff on their platform to be able to do tool use and by Tool like I explained before go get me some news from the web, for example, so it's really focused on web Integration getting things from the web.
[00:48:22] Alex Volkov: Nisten, do you see the one line they posted where like it's basically they said hey, here's perplexity Based on command R. I think you replied to that. Do you remember do you want to cover this briefly? It was really fun as an example
[00:48:36] Nisten: Yeah, I shared it in the Jumbotron, it's like the third thing. It looks like it's pretty easy to build a RAG pipeline with their code, but not all of it is open. There are a few things there which are unclear, and I haven't built that pipeline yet to say for sure. So I don't want to say anything that it's incorrect, but it looks like they've made it really easy to build your own perplexity in five lines of code.
[00:49:04] Alex Volkov: That was really funny. Like a little dig at perplexity. Definitely the model is able to do like web, like the tool of web search. This model specifically is like Excelsat, but other tools as well. So shout out to Cohere, second win, like Swyx said, definitely we'll keep keep you guys posted when some of us try this.
[00:49:21] Alex Volkov: Open weights model that you can. Run, but not commercially, but you can use it and train and maybe this will help open source folks as well.
[00:49:29] Anthropic releases Claude Haiku - GPT3.5 competitor
[00:49:29] Alex Volkov: Moving on from Cohere I think in the same battlefield, actually Anthropic. gave us an announcement yesterday, and very smart release schedule from Entropic, I must say, right?
[00:49:40] Alex Volkov: So they released Cloud 3, they announced Cloud 3 a few weeks ago, they announced three versions, Opus, which is their flagship that now many people prefer on top of GPT 4, which is quite incredible. It's not taking over on LMCS yet. So GPT 4 still takes over on the LMCS people arena. But I think we've [00:50:00] been coming back here week after week and saying that, some more folks use Opus.
[00:50:04] Alex Volkov: Um, let me see just by raising hands. Do you guys use Opus for the past week? At least once? What do you have a thumbs up or thumbs down for Opus use?
[00:50:13] Swyx: Oh yeah, I use it every day.
[00:50:15] Alex Volkov: Every day. Wow. So you got the pro thing or are you using the API kinda?
[00:50:20] Far El: I got Pro, but apparently I'm a chump because I don't have to use Pro, like, only like B2C, non developer types should use Pro. Every developer should just use the Anthropic Workbench because you just pay by API call and you're probably using less than 30 worth.
[00:50:35] Alex Volkov: I will say this like very quietly because maybe Anthropic you don't even have to pay unless you apply for production use and then you have to put a credit card. It's open and you get API calls for free. I will say this, I will say this, there's a Tony Ding released like a year ago, I think something called TypingMind, which is like a front end for front end for ChatGPT basically, but on the back end you can put every model that you want.
[00:50:55] Alex Volkov: So basically you get the ChatGPT experience, including vision stuff. You can upload images as well. And I think that costs like 30 bucks. If you get that. and you plug in your API key that you get from Tropic for free, you basically get the same experience, you don't have to pay the 20 bucks a month
[00:51:08] Far El: Do you use TypingMine every day? I I hear some social media buzz about it, but I don't see any AI people. engineer type people
[00:51:15] Alex Volkov: I haven't used it up until I, I had to do, I had to try cloud three and I didn't want to pay the extra 20 bucks. Just remember in our subscription. So I just plugged it into typing mind and it's a nice experience. I still go to Workbench. Workbench is more for us, for engineers, right?
[00:51:30] Alex Volkov: Workbench, everything that you get there, you can immediately export and continue via the API, for example. And the Workbench is annoying because you have to remember to every Every prompt that you have, every answer that the model gives you, you have to click a button and put it back in kind of the stack of messages, right?
[00:51:47] Far El: you can use keyboard shortcuts, but it's also meant for you to prototype prompts, right? So that's what you want to do. You want your conversations not to persist. You want to see the output and you're like, okay, throw away the output. I'll tweak the prompt again, generate the new output. So you don't want it to auto add to the conversation.
[00:52:04] Far El: That's the main difference,
[00:52:05] Alex Volkov: That's true. And so definitely many folks use the Workbench for prototyping prompts is great, but just for chatting is also great. So you've been using it, so what's your take on Opus so far?
[00:52:17] Far El: Oh, yeah. If you go to AI News every day now, I'm Generating Haiku, Opus, and what's the other one? Sonnet. By the way did you know that the names of these things basically hint at the model size?
[00:52:30] Alex Volkov: Yeah, let's talk about this. Opus is like a big
[00:52:32] Far El: yeah. Haiku is three lines long, sonnet is 14 lines long. Interestingly, and in opus is, unbounded, but in, 3b, 14b, and probably 8 times 220b. Yes. I think the cloud people thought they were very smart by just encoding the, the numbers in
[00:52:50] Alex Volkov: gotta applaud them about the name because I stopped saying Cloud3, I'm just saying Opus now, and everybody gets what we're talking about. Opus is a brand name that's built in like separately from Cloud3, which is, I think, very smart. Like I'm, 3. 5, 4, 4Vision, all these things, like it's a little harder to say, and now they came out like with actual names, and I gotta applaud the strategy.
[00:53:12] Alex Volkov: I think just to connect the dots back to where we are today yesterday Claude finally released the announced haiku and yeah, Swyx, you had another comment that I spoke over?
[00:53:22] Far El: Nothing, I was just going to give say, if you want to do you should be generating things side by side and seeing the model difference. Haiku is very bad at instruction following. Sonnet is actually really surprisingly good enough. I would use Sonnet for most things, and then Opus is more powerful but slow and honestly not really worth it.
[00:53:42] Far El: And if you want to see side by side generations, just go in the last few issues of AI News. You'll see side by side and you can decide for yourself which one you prefer. Yeah, so I run all the summaries through through Sonnet and Opus and Haiku every day now, and I can see the difference.
[00:53:56] Far El: I would say the general take is that Code 3 in general is better at instruction following and summarization than GPT 4, which is huge. I can't believe I'm just saying
[00:54:08] Alex Volkov: It's crazy.
[00:54:08] Far El: of GPT 4. But it hallucinates more. They're like very obvious. inconsistencies in like the things that it tries to, the facts that it picks up on, and they're just plain wrong.
[00:54:18] Far El: And anyone with any knowledge of the subject matter will see, will spot that immediately. So Sumith, when he was talking about Cloud 3, actually referenced some examples from AI News in his timeline, if you go check out Sumith's timeline on Cloud 3. And yeah, I will say like that is the problem with using Cloud 3, like it.
[00:54:35] Far El: It follows instructions very well, but then it will hallucinate things. Maybe because it doesn't have as good of a world model as GPC 4. Whatever it is now I'm having to decide as a product creator, am I using Cloud 3 because it, the vibes are better, but then do I have to build an anti hallucination pipeline, which I'm trying to build, but it's difficult because what is truth?
[00:54:56] Alex Volkov: Yes. Are you, let me ask you a question real quick. Let me, one second Nisten, and then Nisten you go Swyx, one question. Did you change your prompt for Cloud specifically from your GPT 4 prompt?
[00:55:08] Far El: I copied over some of it and I wrote some other parts from scratch. I understand that a lot of people say you should use XML for this stuff. I think that it's a little bit of mumbo jumbo, especially because I'm not doing structured output.
[00:55:22] Alex Volkov: I will say this thing they have Alex Albert, who's now getting more of a highlighted role. He's the guy that we've talked about that, that did the New Zealand Haystack analysis, where Claude Opus realized that it's getting tested, right? So you probably saw this famous tweet. So Alex is the prompt engineer there.
[00:55:38] Alex Volkov: He has a a collab that's called a Metaprompt. So you can find it, I'm probably going to put this in show notes, that you basically describe the task that you want, and then Opus comes up with the prompt for Opus itself. And the prompts that it comes up with works for me way better than the prompts that I've written myself.
[00:55:54] Alex Volkov: So it does use a little bit of XML. And I just want to say to Diana, it's not necessarily to you, but definitely to you as well some different prompting is needed. So these models do do need different, they've been trained differently. And XML is one part of it, but also, It feels like a little bit of more prompting and folks can't just expect to have the same prompt that works for GPT 4 to work.
[00:56:16] Alex Volkov: I think some of our intuition as well changes per model. Some models, like you said, are like more hallucinatively, but following instructions better. Definitely, I saw this. Nisten, I cut you off before. If you still remember where I cut you off please continue.
[00:56:29] Nisten: No, it was along the side. So I've used Sonnet and I just opened the Bing sidebar and quickly iterate through stuff with Sonnet. And yeah, I noticed the same thing. It does make up a lot of stuff. So then I need to drop it into Bing in precision mode and have it actually look up. The stuff and Then it's still not quite ideal.
[00:56:52] Nisten: But this combination I also use Mistral Large just switching between being with internet mode and either Sonnet or Mistral Large to quickly Iterate through although Mistral Large is slow. So again, I really like the speed of
[00:57:09] Far El: Sonnet
[00:57:11] Alex Volkov: Yeah, so let's actually pick up on on the kind of the news thing. So we covered like cloud before, and now we're talked about as actually folks putting it in production like Swyx and we're also testing this Entropic Release Haiku, which is their smallest model, and that doesn't compete with any GPT 4, they go for the lowest price and the fastest kind of execution.
[00:57:32] Alex Volkov: Fairly similar to, to the command R kind of area of playground that we got, right? It's like focusing on speed, focusing on as best performance as possible for the fastest and the cheapest price possible. And we definitely heard before from multiple folks who fine tuned GPT 3. 5, for example, and get better results than GPT 4 on fine tuned GPT 3.
[00:57:51] Alex Volkov: 5 and significantly faster as well. So on tropical risk. Haiku, which is their fastest and most affordable model for enterprise applications. They stress enterprise because every dollar, every token counts, every dollar counts, and you actually get, get to measure these models not only on how good they are, but also how good they are compared to how many, how much money you pay for them and how fast they respond to your users.
[00:58:14] Alex Volkov: And the main differences between Haiku and like GPT 3. 5 or even 1. 2. [00:58:20] Zero Pro and Gemini. The main differences is price. It's priced at 25 cents per million tokens, which GPT-3 0.5 is half a dollar per million tokens. So half the price the output tokens are 1.25 dollars per million output tokens, which usually enterprises, they do prompt engineering, so they shove a bunch of stuff in the prompt, but the, the response is not that long.
[00:58:43] Alex Volkov: So usually you focus on the input tokens as well. It gets 75 on MMLU and 89 on GSM4K, which is significantly better than GPT 3. 5. Now, they may have used the announced 3. 5 metrics and not the actual metrics, which oftentimes folks do, but still is very important, very impressive. And it does human eval 75%, 70, almost 76.
[00:59:09] Alex Volkov: percent on human eval on code, which is quite impressive for a super fast model. But I think the highlight of the differences between a 3. 5 or a Gemini One Pro is that Haiku is vision enabled, right? So you can pass images. It's quite impressively so vision enabled. So whatever we got excited about last year at Swyx, I think now it's possible with like up to 25 seconds per million tokens, which is quite incredible.
[00:59:34] Alex Volkov: You can use it everywhere pretty much. It's none, million tokens is a lot. And also it has 200, oh, sorry, go ahead, Svek.
[00:59:43] Far El: No, one caveat or question is the vision model in Haiku the same as, same vision model as in Sonnet or Opus, right? Maybe it's dumbed down as well, and so no one's really run, yeah, no one's really run any of the benchmarks on this stuff.
[00:59:56] Alex Volkov: Yeah, and then I think it's worth calling out that like now the same level 3. 5 speed and performance, significant like improvement of performance plus vision enabled plus 200, 000 context window as well, which 3. 5 is, I think still is it 8? Yeah. So shout out to to Entropic, sorry, not Cohere, to Entropic to keep bringing us the news.
[01:00:17] Alex Volkov: The release schedule was very well timed. They released the biggest biggest two models and then followed up with this fast model. And for folks who are looking for how to use or maybe lower their costs with the same performance. It's very interesting. Anthropic promised us tool use and function calling and haven't yet gave us function calling.
[01:00:35] Alex Volkov: They said that these models are able to do function calling and tool use, but we still are not able to use this. For your preferences, you may go here.
[01:00:42] Hardware and Robotics
[01:00:42] Alex Volkov: Big companies and APIs. I think that's most of what we want to cover, but I think we're smoothly moving towards the next area where we talk about hardware and robotics because one big company joined another big company a few weeks ago, and now it's worth talking about open AI and figure the humanoid robot.
[01:01:01] Alex Volkov: Figure has been in the news, they showed the case the robot for a while. Their, I think, main competitor It's funny how Boston Dynamics was the big name in robotics for a while, and now all these companies are leapfrogging Boston Dynamics at some fields.
[01:01:16] Figure + OpenAI integration shown on a video
[01:01:16] Alex Volkov: So Figure has this humanoid robot, it has , ten fingers, it moves them very freely, it's very interesting.
[01:01:22] Alex Volkov: Recently they announced their integration with OpenAI, And I think OpenAI also announced the integration with Figure. And now they released a video, and that video is bonkers. It's really, folks, it's really bonkers.
[01:01:31] Alex Volkov: They show the figure robot standing just like in, in some form of a lab. Funnily enough on the back wall, it says AGI lab. And then that figure robot has a little screen on its face. And that screen shows you the same exact interface that you get, you and I get in the chatGPT iOS app. With the little, little circle that turns into a few animated things when you talk to it. And I found it really funny that they insisted on keeping the same kind of UI. And basically this person comes up to this figure and says, Hey, what do you see right now? And the robot uses the onboard cameras to send image.
[01:02:07] Alex Volkov: I guess it's one image to GPT 4 vision and replies with, I see literally the robot says, I see you standing in front of me with your hand on the table. So that was like one. one quirk of how the robot knows that the person that talked to it is the person who actually stands in front of it. But I think they stress that this video is end to end, not edited, and it's 1x speed also, and the robot replies fairly fast.
[01:02:32] Alex Volkov: Time to speak on OpenAI's GPT 4 is quite fast anyway. If you use the app, you know this, but the vision stuff, maybe they got like some private versions of API. Responses, folks on the stage, do you see this video? What do you think of this? Is the Terminator coming or we're still not there?
[01:02:48] Alex Volkov: What are your, some of your responses to this figure video now with OpenAI Brain?
[01:02:56] Alex Volkov: Ray, go ahead.
[01:02:58] Far El: Yeah,
[01:02:59] Roei Cohen: very shortly before that, I listened to Yann LeCun's podcast with with Lex. Yeah, and, what's striking me about that demo was that we're actually quite close. To having usable robots just using the reasoning that is available on OpenAI. And that's, I think that's remarkable.
[01:03:20] Roei Cohen: You know what I mean? Because at the end of the day, you're like, oh, these things are not thinking, they're just spinning out Next Tokens and whatnot, but more and more, I feel myself drawn into Ilya's Camp, where you're like, no, there's probably some world model that, that these things have to develop internally, because otherwise they wouldn't be able to accomplish all these tasks that are essentially what you need is some sort of an understanding of embodiment in order to like, Reason about where to move your limbs and how to pick up things and, things of that sort.
[01:03:50] Roei Cohen: I don't know. I just thought that there was, like, a really stark contrast between what they showed in the demo and that conversation. More optimistic today than I was before.
[01:03:59] Alex Volkov: Absolutely. And I think if there was one additional reason for Space Daddy to sue OpenAI, for Elon to sue is that, Optimus is definitely a bet that Tesla is making now. Tesla's whole like reason for existing was to bring the world renewable energy. And when When Optimus was announced, many kind of investors thought was like, Hey, is this moving the vision a little bit forward?
[01:04:19] Alex Volkov: And because, Optimus does not bring the world renewable energy. Optimus is advancing super quick as well. We saw for the past year, multiple demos. The last demo blew me away in terms of like dexterity of different fingers and everything. And then you gotta wonder how smart Optimus will actually be.
[01:04:35] Alex Volkov: in terms of the ability of it to perform tasks and respond to you. And FIGURE, FIGURE announced they're like advanced robot and then they announced the integration with OpenAI which we know that Elon is now on the warpath with. And so I got to wonder if, their communication, their integration now that Elon has also Opus and Grok.
[01:04:53] Alex Volkov: Given where Grok is right now, and I posted this as a question on my timeline is, would you prefer Optimus, who's like better and more flashier, but with Grok brains, versus a figure with GPT 5 brains or something? And I think it was quite obvious where the distribution lies.
[01:05:07] Alex Volkov: You, you would want the less flashy robot that's smarter potentially than the flashy robot that's is GPT 3. 5 level. So the understanding of the scene was very impressive there. The te the text to speech was very natural. I don't know if you guys noticed in this video, but the robot actually ums and s and takes pauses and it feels like they built something like this.
[01:05:27] Alex Volkov: I actually, they used probably the same text to speeches, OpenAI, but it feels like OpenAI gave them a little bit of a better model because I use the open the ice text of speech often via the iOS app, and it's, it doesn't it doesn't go you know what, I actually think this right.
[01:05:41] Roei Cohen: To be fair, the I've started seeing this behavior in text to speech with PI first. Phi already does this ums and uhs, and a more natural kind of cadence,
[01:05:51] Alex Volkov: yeah, Phi is very expensive for sure. LDJ, go ahead.
[01:05:55] Nisten: Yeah, I actually use a custom instruction with PadGBT where I specifically, I give it like a set of principles to follow.
[01:06:02] Nisten: And the last principle is make sure to include ums and uhs in your speech and as if you're talking. And I feel like when I use the conversational voice mode, it makes it feel a lot more realistic because then it's actually literally saying uhs and ums. And it does end up doing that with me.
[01:06:17] Alex Volkov: Yeah, so I definitely noticed this and this could be just a result of of something like a custom instruction or maybe they're using a different model. The voice they use is not one of the voices that OpenAI gives. I think it's a custom voice that they use. It's a little raspy. It's pretty interesting that they gave it something.
[01:06:35] Alex Volkov: And Roy, to your point before, where I gotta wonder how deep the integration goes. Do they use just [01:06:40] the API, or did they have a fine tune on top of the ability of this robot to actually perform tasks?
[01:06:45] Nisten: See official confirmation of somebody was asking hey, are you guys using GPT 4 and maybe a new model or something? And then the CEO figure, he just cryptically replied saying, we're using some new advanced features, something like
[01:06:59] Alex Volkov: Yeah, they're definitely getting advanced features. We know that OpenAI gives advanced features to friends, so Spotify, for example, uses OpenAI voice cloning tech for, converting Lex to Spanish, and we know for a fact that this, they give out this very sparsely, so probably they have more advanced features.
[01:07:16] Cerebras announced their largest and fastest AI Chip C3
[01:07:16] Alex Volkov: Alright, in the hardware and robotics things, we want to cover two more things super quick. Cerebrus announced their largest and fastest AI chip on Earth, and So this is a company that, builds custom hardware, and they announced the CS3, which they claim that could technically claim and now all these claims are probably like still in, in flux.
[01:07:36] Alex Volkov: And I don't know if the supports any, how should I say, PyTorch, for example, but they claim that can train up to 24 trillion parameter models on a single device. They say the world has never seen AI at this scale and, it's insane. 24 trillion parameters for single devices. It's insane.
[01:07:55] Alex Volkov: It has 4 trillion transistors. I can keep saying numbers and I'm not like a numbers guy. So like when people talk numbers at me, they blow past, but 900, 000 AI cores on this chip. And it's very interesting that they have terabytes of external memory, even up to petabytes which is crazy.
[01:08:12] Alex Volkov: Anybody who's more into hardware want to comment real quick of what Cerebrus announced and how important this is to the industry. I'm more than welcome.
[01:08:20] Nisten: Yeah, the on chip memory which is pretty much equivalent to, cache in a GPU. They have I want to say it's 40 or 50 gigabytes on the CS3, which like, that pretty much means you could, you would be able to train your inference theoretically anything like Mixtral size or smaller at insane speeds we're talking like, maybe I don't know, like at least a thousand tokens a second probably, like maybe even five thousand or like more, and that even might even be conservative too look, there's insane amounts of compute and bandwidth here that you could have, especially for small models.
[01:08:53] Alex Volkov: That's quite incredible. I don't know if that's in production at some point or when it's going to be, but at least based on the numbers, this looks just absolutely like a chunk of incredible.
[01:09:05] Extropic announces their Thermodynamic processing unit
[01:09:05] Alex Volkov: And in addition, Harvard News is super quick. Extropic, the folks who are, I think founded by folks who were in DeepMind before and did some.
[01:09:15] Alex Volkov: Quantum computing stuff. I think that's what Guillaume's background is. They announced their TPU or like at least what they're going to build or thinking about building which is not a a tensor processing unit like TPU from Google. It's a thermodynamic processing unit. It's basically a teasing at this point.
[01:09:32] Alex Volkov: I don't think they have. Hardware ready to go, or at least not that I understand. And I had to, I'll add this to the show notes. I had to dig deep into trying to understand what is it that they announced. And it was really hard for me. And it didn't seem like, my non hardware background was the reason.
[01:09:48] Alex Volkov: It felt like some other folks also getting a little bit lost in what they actually talked about. LDJ, if you want to take a stab at giving us like a little brief recap, I would really appreciate it, because I know that you are in some of these spaces. But thermodynamic is like a new approach to basically doing AI, as far as I understood.
[01:10:07] LDJ: Sure, yeah so yeah, there was a Q& A that they held yesterday. which I think it actually, I think it's recorded, and maybe the FJ's Ozarksropic page it might be there for anybody to listen to, but I spoke with them a bit, and Guillaume, the CEO, and Trevor, the CTO, they're both creators of TensorFlow Quantum, and they worked at Google, and they didn't work at DeepMind, but they actually worked on something arguably a little cooler than DeepMind, actually, depending on who you ask, called a Google X, Which is pretty much the secret development division of Google, where they work on very long term deep technology projects.
[01:10:44] LDJ: And, yeah Trevor and Guillaume, they met at Google X, when they were working on TensorFlow Quantum and a lot of quantum computing technologies. And a lot of the systems that they had to develop to mitigate, Like all the errors that built up in the quantum computing system that they had to account for, they ended up on a path where, hey, we could actually start using this technology itself for the computing in the first place.
[01:11:09] LDJ: And the goal is really just general. like speed up of mainly like things like gradient descent and operations that are pretty much used in all of deep learning and all of AI. So it's not just specific to transformers or just specific to this and that and yeah they plan to have a bunch of server grade chips like within the next Let's say around three years or so and they plan to have consumer available chips as well in accelerator form factors.
[01:11:40] LDJ: So you'd be able to just plug it into your motherboard, just like you plug in a GPU into your motherboard today, and it would just be an accelerator card that has this thermodynamic computing components within it that would be able to accelerate your AI workloads just way more.
[01:11:55] Alex Volkov: That's incredible. I think they wanted to call this the transistor of the AI era, which transistors like a big step function change in the world of computing. So shout out to them. It still looks like it's a little bit ways out, but definitely they're getting interest. And then the very techno positive outlook or techno optimist part of outlook is definitely also.
[01:12:16] Alex Volkov: Also helpful. So I think that's mostly it on the hardware news and robotics. We don't usually often cover, but this week seems to be, seems, seemed to have been a very big week in the hardware news and robotics and A lot of stuff happened that pertain into kind of like tech optimism and very big week for announcements and different things.
[01:12:35] Alex Volkov: The chip design for whatever they're doing looks, looks crazy as well. So definitely folks who are into this, go check it out and let us know what you think. In comments as well. I think we've been at this for almost an hour and something, and I do want to do like a little reset, maybe drink a little bit.
[01:12:50] Alex Volkov: So let's do a short reset of the space and I'll talk about some stuff that happens in Weights Biases. And then we're going to continue talking. We still have to talk about Devon. So we'll brief, brief reset, and then we're going to pick up on this.
[01:13:27] Alex Volkov: Alright, you are on ThursdAI, today is Pi Day, March 14th, and this day started crazy and kept getting crazier, so today morning, early on, many folks woke up to see SpaceX launch the largest man made object ever to break through the atmosphere, and this was a crazy thing, the first, the third thing they tried this year also is the birthday.
[01:13:52] Alex Volkov: or anniversary of GPT 4 that was released a year ago exactly on Pi Day, March 14th. And we're still waiting for some breaking news to come through. And, hopefully they release something. There were rumors that GPT 5 is coming up. There were Microsoft copilot pages said, Hey, you can get access to GPT 5 for all.
[01:14:09] Alex Volkov: You saw this, right? And then those rumors were discarded. And they said, Microsoft said there was a typo. And we're going to try and see what else you know, what else we're going to get here today in breaking news. But also today's an anniversary of Entropiq's Claude, the first Claude, the first kind of production model that they released, the Entropiq released was also a year ago.
[01:14:32] Alex Volkov: And very big week last year, very big week this year as well. And of course, it's also ThursdAI, BirthdAI, it's a one year anniversary of these spaces as I've been at least hosting them for the past year. Consistently, I think we missed one. I think I was really sick and I missed one. I still sent the newsletter, but I missed it.
[01:14:49] Alex Volkov: the space. And so we're here celebrating the ThursdAI birthday night with a bunch of friends here on stage. And I think in this vein it's now a good time for me to say that, ThursdAI is not sponsored. If you want [01:15:00] to support this, please support us, please follow ThursdAI, the newsletter.
[01:15:03]
[01:15:03] This weeks buzz - Weights & Biases update
[01:15:03] Alex Volkov: And if you want to engage with us on socials, that definitely helps because sometimes the rich is the rich is not how should I say, it's hidden on X for some reasons. We've seen better and worse times.
[01:15:13] Alex Volkov: So definitely if you want to follow us and give us a follow to the main account, but also like retweet when we start the space that we're gonna be super, super helpful. But the space is not hosted, is not sponsored besides Weights Biases. And so I think I maybe tell you a little bit how I joined Weights Biases because this was also a thing.
[01:15:30] Alex Volkov: So folks, remember me joining the spaces and thinking about, hey, this is fun to do. I have no idea how, how I'm going to make money. Back then, I The whole, one of the reasons to do this space was to promote my startup called Torgum. At some point, Weights Biases folks reached out and said, Hey, let us sponsor your newsletter and podcast.
[01:15:48] Alex Volkov: Because the audience that you draw and the audience that we're looking at is very similar. And I was really apprehensive in the beginning. I didn't really want Take sponsorships as you guys may have liked the authenticity of the excitement and the stuff that we talked about. We never get like paid gigs.
[01:16:05] Alex Volkov: Nobody pays to come and, push their stuff on ThursdAI, which I think the community appreciates. And so I was really thinking about, okay if this is the right thing, and then after a while, Waits Bites is, I was entertaining this because they really have A good stand with the open source community as well.
[01:16:20] Alex Volkov: Their main product is free for personal use. And many folks in the audience that know and love Weights Biases way before I even knew what they are, gave me kind of the thumbs up. And then Weights Biases reached out again and said, Hey Alex, why don't you and ThursdAI join Weights Biases? And you just keep doing this in addition to being an AI evangelist or promoting our products and our different products, by the way.
[01:16:42] Alex Volkov: W we have some new announcements very soon that I'm very excited about. And I, back then really started needing the money and the startup wasn't taking off, and so I said, Hell yes. This sounds like an amazing opportunity for me to keep doing this, to keep talking about AI with you folks.
[01:16:58] Alex Volkov: Learning myself learning from folks on stage here who know way more than me, and then also learning in public, so other folks also follow up. And yeah, that's how I joined wins and biases. And now Thursday I is. The podcast and the newsletter are offered by weights and biases. The space is I talk about weights and biases stuff, but I talk about the stuff that have actually very excite me very much.
[01:17:20] Alex Volkov: And so in the vein of those stuff, I just wanted to add that I'm going to San Francisco in a month and when it's me and everybody else in WIS and biases because our. Annual conference called Fully Connected starts in it's going to be in April 18th in San Francisco. And the tickets are still early bird, so up until end of this month, the tickets, you can get them for 50 percent off.
[01:17:41] Alex Volkov: And it's an opportunity to get How should I say this? One of the reasons why I joined Weights Biases is because everybody's a customer, including, 4 was trained with Weights Biases. But also, pretty much every other foundational lab that builds foundational models and in different, in robotics and different other places, just The amount of logos that our customers, Weights Biases, just beats any other company I've ever worked at, or even looked at, it's crazy.
[01:18:06] Alex Volkov: And so many of these folks will come to this conference to also talk about what they're building, the models. So a very good opportunity to visit San Francisco and join us. A day before this conference, I'm going to do a workshop. And me along with my team, the growth ML team in Noise Ambassadors. We're going to do a workshop about improving your production step by step.
[01:18:25] Alex Volkov: And it's going to be super cool. We're going to talk about evaluations. We're going to talk about different other things that we know from the enterprise community, the folks who actually use AI in production. They talk to us. We have our own AI that we're running called OneBot that you're more than welcome to use.
[01:18:38] Alex Volkov: So definitely. Come and meet us in San Francisco in April or in London in May, by the way, if you're Europe based we have a the same kind of conference in London, which I probably won't be attending in London, but you never know, maybe I'll get called. With this, I think we're moving towards, let's talk about agents.
[01:18:56] Alex Volkov: And I think this week was a big thing for agents. Don't you guys think? Who saw the Devon announcements? I'm just gonna do it like this. Yeah. Nisten, what do you think? Who didn't, right? Like they, they exploded into everybody's feed, I think faster than AutoGPT a year ago or something.
[01:19:14] Cognition Labs Showcases Devin the first AI software engineer
[01:19:14] Alex Volkov: And so let's basically do a brief cover of the Devon announcement, and then we'll talk about what it actually means. And then I have. And then I think I'll open up the space for the first time in a month or in a year to actually talk with people who are in the audience and want to come up and tell us about their experiences with ThursdAI.
[01:19:29] Alex Volkov: Cognition Labs, a fairly new company, looks like funded fully. released Devin, their, what they call the first fully autonomous AI software engineer. And we've seen these claims before, and some of us are very skeptical, because these demos are incredible, and then when you actually get to use them, the model loses context, etc.
[01:19:48] Alex Volkov: And, um, they claimed the setting a new standard on the software engineer bench coding benchmark, SVE bench coding benchmark. which I think they're outperforming all these things and getting around 18 percent on the SVE bench. Which is a specific task for, not only writing pieces of code, but also performing tasks.
[01:20:08] Alex Volkov: They claim it's operating as a highly capable teammate, capable of working alongside human engineers, independently tackling tasks for their review. So one of the things that caught me by surprise. But by a fairly surprise that, compared to something like AutoGPT before or other agents that we saw, and we've talked with multiple agent folks on the pod, we've talked with João from crew AI, that's been like in the open source community being very highlighted recently, we've talked with Killian from AutoGPT, we've talked with a bunch of agent folks this Devin has besides the company and the investment and everything, the UI is very polished.
[01:20:44] Alex Volkov: The UI is actually a set of tools and I asked a few folks with access to Devon So if you have access to Devon, please DM me and come up and talk about this if you're listening to this The UI has access to shell So you can see a shell, like a terminal, that the UI has probably access to a virtual machine as well.
[01:21:03] Alex Volkov: It has a browser that you as a user can see and Devin as an AI agent can use the browser. So for example, you can log in and have authenticated sessions for different things that then Devin can use this browser. The UI has access to a code editor, basically, that you can see that Devin writes things in.
[01:21:22] Alex Volkov: And you have access to a chat. And I think that the combination of four of these things in the UI, plus the ability to scroll back and follow Devon in real time, but also scroll back to see something that Devon did a few steps before, I think is very powerful, and I still haven't seen anything like this.
[01:21:39] Alex Volkov: I think for many people this broke the kind of the the threshold of something that's useful. Far El, go ahead, you have your hand up.
[01:21:46] Far El: Devon's very impressive, but of course it's not open source. I posted on top here a an open source version of Devon called MetaGPT, which self reports to be better than Devon. It's up to, like we need to do evaluations to find out, but also there's several. Open source communities that have formed we're talking about a group with dozen folks, another group with hundreds of people who are all coordinating to replicate.
[01:22:13] Far El: Devin in open source. I think actually one of the people here is Jun Yang in the audience that I'm seeing here who is trying to also replicate open source Devin. So maybe if you want to bring him up to, to discuss that. But yeah, in general I think Devin is is impressive, but what's the most interesting insight is potentially the fact that this is just a GPT 4 wrapper, and they've just managed to squeeze so much more out of GPT 4 than we have been able to.
[01:22:41] Far El: Definitely a lot of interesting things to come based on just knowing that this is possible.
[01:22:46] Alex Volkov: What you said, Far El, is very similar to when ChatGPT came out, and this was quote unquote, just the UI, right? There's no technological breakthrough necessarily in chatGPT's release, but the fact that it was like nicely packaged, the fact that they kept sending back and forth messages to keep the state for you, the memory for you as well, broke the level for many folks who were not like using the API for completion before.
[01:23:10] Alex Volkov: Definitely Jun Yang is more than welcome always LDJ, I'll get to you and then we'll talk with Jun Yang about the kind of the open source, but also I do want to cover like the rest of the stuff that got excited folks about [01:23:20] Devon. Go ahead. LDJ? Yeah.
[01:23:25] LDJ: Yeah. Okay. Apparently there was an interview where they were asking the The Devon folks about what models is it using or whatever, and apparently they said that they are using, they're vague about it, or like maybe the interviewer just didn't get it that well, but they said an LLM with reinforcement learning.
[01:23:43] LDJ: And that could just mean like RLHF, but I think they might be talking about like real like traditional reinforcement learning where you're actually like. Like it's the holy grail of if you have something that's coding and you have it being able to learn and compete against itself and being able to iteratively improve and things like that of complex tasks like something like that.
[01:24:03] LDJ: So that'd be really interesting if that's the case and it seems like that's what they were maybe alluding to that they have a custom model that is trained through reinforcement
[01:24:11] Justin Lin: learning.
[01:24:12] Alex Volkov: Yeah, and I think it's very important to also highlight that the UI they built around this and the tools that Devon is able to use, even if it's like a wrapper, the ability of them to promote these kind of tools is very interesting. The browser, to me, is one of the more interesting parts as well, because I know for a fact when I code, I log into a bunch of like services.
[01:24:32] Alex Volkov: I read their APIs, some of them APIs and keys and everything are only locked behind a login wall, for example. And so for something like an auto GPT or even cursor, right? So we, I know that I and some other folks we use cursor. For coding as well, Cursor has some agentic properties. You can ask it about your code, even edits your code inside your code editor.
[01:24:53] Alex Volkov: And then it's able to perform some of these like meta, meta tasks, like figure out what's the problem, go and, search or something. And the ability of, Devin to do this. I saw one video where like it decided to change the code to add debugger statements and then get a better handle on what the actual issue is and then perform something.
[01:25:13] Alex Volkov: And should I say? The big problem with something like let me add Slav as well. The big problem with something like AutoGPT before was getting lost. Getting lost in context, getting lost in the more tasks it executes. I saw enough videos of folks, enough folks who are not shills of Devon, for example.
[01:25:29] Alex Volkov: They are happy to promote the fact that they got early access, but they're not paid probably by Devon folks by Cognition Lab folks. Enough demos of them recording continuously for 20 plus 30 minutes where Devon actually executes stuff based on the plan 20 minutes in. And I personally haven't seen this from a lot of other agents as well.
[01:25:48] Alex Volkov: So I do want to acknowledge Justin Junyang on the stage the member of the technical team in in Quen, the Quen team go, I think is the profile. Hey Justin, how are you? Thanks for joining ThursdAI, Birthday AI.
[01:26:00] Justin Lin: Yeah. Hi, Alex. Thanks for bringing me. Yeah. I'm a member of the Quentin. Oh. I just recently met Devin. It's very impressive. And we're just talking about maybe just related to code large language model. Actually, we are doing something about it, so we just raised Twitter to say, Hey anybody is interested?
[01:26:25] Justin Lin: I don't know. It is really
[01:26:27] Justin Lin: hot. There are
[01:26:29] Rohan Pandey: a lot of people who are joining us,
[01:26:31] NA: Hoping to reproduce the open source Devon. We still don't have a clear roadmap, but for now, For the model, we may think about maybe for the first step, we still use something like a closed source model like GPT 4.
[01:26:49] NA: Admittedly, even like 5 is not enough for such complex tasks. I have some rumor that Devlin might be built upon GPT 4. For with a very good Chrome engineering. I don't know if this is true yet, but we may start from this to build something like a demo. And for another part and for the model, we may build something like, Code large language model, and especially it is adapted to very long context.
[01:27:21] NA: So it can probably browse the web pages, crawl the contents, and then based on the contents, and then write some code and do something complex. Yeah, this is generally some initial ideas, and we still need some time to think about what to exactly
[01:27:40] Justin Lin: do next, yeah.
[01:27:42] Alex Volkov: So I, I would, so first of all, folks follow Junyang in this effort and definitely hear more from the open sourcing. I think like Far El said, one of the, one of the best things when something like this comes out, it gives it like a fairly clear roadmap for folks to try and replicate. And I think the roadmap should include the UI itself.
[01:27:59] Alex Volkov: The browsing UI is very, I think, important. The integrated shell is important. At least for the ability of you interacting with this. One thing that I also noticed for Devon is that you could actually talk with it while it performs other tasks. So like you can with an actual software engineer that works on your team, you can chat with the person while it performs like other tasks.
[01:28:19] Alex Volkov: I'm not sure how they necessarily achieved this. But it's very interesting where like it probably executes in several steps. They definitely built something there that's not only code execution. I think Nisten go ahead
[01:28:28] Nisten: 300 years ago, my grandmother got automated. The machine was called, it was actually called a chain of cards machine by Basile Bouchon, and that went on to become the Jacquard loom, and my grandmother's knitting now became 10, 000 times faster. So that was 10, 000x of grandmas. AGI is only 1x. And and this thing you guys are talking about is 1x.
[01:29:02] Far El: I don't know. Nisten is a grandmother 300 years old?
[01:29:06] Nisten: The jacket machine is 300 years old. Yeah it was first like in 1725. No b******t, no, for real. And that actually used like punch card. It was called chain of cards. That's the real, that's the real name of it. This not Chain of Thought, Chain of Cards, it's the same thing, it's just instruction following Chain of Cards machine.
[01:29:25] Nisten: And it made it, it made stuff close 10, 000 times faster than my grandma could. Now, that didn't stop my grandma from knitting. I don't know why people are freaking the heck out that now this thing can do 13. 2 percent of, of Github issues I am freaking out that we, with all of this automation, smartest freaking Olympiad kids in the world, I ranked let's leave that alone.
[01:29:54] Alex Volkov: Ha.
[01:29:58] Nisten: like we can barely, and we still have to do the other 87 percent of the work. I don't know why people are freaking the
[01:30:04] Nisten: yeah, they said the same thing for. For people with Excel, it's oh, all the programmers are analysts, whatever the heck they were called back then. It's going to automate them.
[01:30:14] Nisten: Did it automate? Yeah, it automated them. Has the need for them improved? Yeah. The same thing for Copilot came out two years ago. We've been using these tools for two years. You still can't find a good JavaScript dev to hire. Dude, people are freaking the f**k out, man.
[01:30:33] Alex Volkov: So let's actually talk about this.
[01:30:35] Nisten: Learn the code don't be dumb,
[01:30:37] Alex Volkov: least some of the conversation, Nisten, at least some of the conversations is, just like thread boys hyping up things. Oh, software engineer is dead, whatever. They will always keep doing this. They're engaging the algorithm on X rather than providing some value.
[01:30:51] Alex Volkov: But there's definitely folks that I saw, they're replying to like major folks and Hey. I'm about to learn coding, should I even learn coding? Because to them, when somebody doesn't know coding, sees something like Devon, they're like, why do I even need to study? In a few years this will be, like, way better than me in Even now it's way better than something like a starting point.
[01:31:11] Alex Volkov: And I think the answer to this is that the world just will need more code. Like, it's very interesting that software engineers, in general as a concept, try to automate ourselves out of laziness as much as possible, right? I would spend sometimes days on automating a task, I can complete manually for five minutes just because I know that I'd be able to do some other stuff faster while this is getting automated.
[01:31:32] Alex Volkov: Sometimes it's nerd sniping, but whatever. And then I think that for folks who are in the audience who are like thinking about learning to code, learn to code. [01:31:40] The reason is Devon will need your help to figure out what to do next. The outputs of Devon need somebody that knows how to code like none of the folks who got Devon are like marketing people just completely noobs that just like it worked for them.
[01:31:53] Alex Volkov: So you do need the ability to actually run these things productively. And I think learning to cause a very important skill. And if anything, it will give you like a meta skill that you'd be able to do the boring stuff, the more complex stuff, you'd be able to review and achieve more. And I think that's very important.
[01:32:11] Alex Volkov: Many of us. How should I say? There, there are some gatekeepers in the coding community for whom the ability to code is their way, their thing to say, okay, this is how we make money. But for many people in the coding thing is like, coding is just a tool to get somewhere. This somewhere is shipping a product, creating, like doing a task, doing some of these things.
[01:32:30] Alex Volkov: That's not going to go away. If anything, this is gonna go and get that much better. So I saw Slava, you wanted to comment on that and then Roy.
[01:32:42] Nisten: Yeah,
[01:32:42] Slava Kurilyak: I wanted to add some color to the Devon circumstance where we find ourselves using at least a new approach where it seems like GPT 4 has been, let's say, claimed at this moment to be empowered by reinforcement learning. There are now. Developers who are going down this path. I'll do a shout out for one of them.
[01:33:05] Slava Kurilyak: This is a developer. His name is Rohan. I'll pin his open source project to the top. It's called Llama Gym. Feel free to check this out. This is an agentic framework for using reinforcement learning to essentially fine tune language models. Now, why would you do this? Because, from my experiments with language models, the, at first you can get away with prompt engineering, but at some point.
[01:33:30] Slava Kurilyak: To mimic human like performance, you do need to fine tune. And so in this circumstance, reinforcement learning has been shown to have incredible progress, especially with companies like DeepMind. But, and yet, we haven't really seen adoption for reinforcement learning within the Jamstack AI community, but now with tools like LlamaJam, developers can start to bridge the two.
[01:33:56] LDJ: Can you post that to the billboard,
[01:33:58] Alex Volkov: Yes please post as well.
[01:33:59] LDJ: happen to have a link or a tweet
[01:34:01] Alex Volkov: Absolutely. Roy, go ahead and afterwards I want to acknowledge Ryan joining us on stage as well.
[01:34:06] LDJ: Yeah,
[01:34:06] Roei Cohen: first I just want to give Nisten props for his just incredible rant. I just enjoyed that thoroughly. I actually don't agree with you, Alex. I think that eventually we'll see coding if not go away, be abstracted enough that it would be closer to English or whatever natural language you're used to using.
[01:34:25] Roei Cohen: The reason I'm saying that is that's been the trend so far, we've gone from Assembler, from, Jacquard Looms to Assembler to, things that are more and more abstract and, the things that happen with a single line of code in Python or in, TypeScript or whatever generic language you choose.
[01:34:43] Roei Cohen: have so much implementation actually going behind the scenes that you're not even aware of and people for some reason are okay with that. You know what I mean? They're not losing their ever loving minds. I think that as time goes by, right? These very mechanical operations are going to be needed less and less.
[01:34:59] Roei Cohen: And, but the need to solve problems, to tackle problems, to have motivation and goals those are still going to be, mostly. Human, but those two may, may change, right? I think we have to prepare ourselves for two scenarios. One where. The need for, actual technical capabilities that are specialized, like coding, might be like less and less in demand to actually be effective and to be able to ship products and be able to ship features and whatnot but also that we're going to get more the agentic behavior of tools that we use is going to become more and more active and less passive, right?
[01:35:36] Roei Cohen: It's not just that you're going to ask your agent to do something for you and then review it, but rather it will preemptively start working for you and solving problems for you and making PRs and, doing things of that sort, which kind of changes the way that the division of labor currently.
[01:35:53] Roei Cohen: is in in terms of, like, how much work do humans drive and how much work do machines drive?
[01:36:00] Alex Volkov: Yeah, thanks for that, Ray, and I definitely count on you to not agree with me, and I really appreciate the pushback as well. I want to acknowledge Ryan Carson. Ryan, I did not know that you're a follower of ThursdAI, but I have been following you and Treehouse for a while, so please start with maybe a brief introduction to who you are and what you currently do.
[01:36:16] Alex Volkov: And I would love to hear your take about agentic and coders being replaced, given that Treehouse is something that you've previously built and taught tons of people to code.
[01:36:26] Ryan Carson: Alex, good to be here. Thank you for the invite. It's funny because I was listening to a previous episode of ThursdAI while I was working out this morning, and that's what I knew. I wish I had known about this space earlier, because it's just packed with valuable information. There's so much for us to absorb in AI. I'm so thankful for this space, and I literally add it to my calendar now, so I'm hoping to show up more often, but, so thank you for that. Um, yeah, I spent a decade of my life, I'm 47, so I spent almost one out of every five minutes of my life trying to empower people to learn how to code. And at Treehouse, I had the honor of being a part of that, and, I think we taught something like a million people how to code. And and I have a computer science degree, and I think a lot about this. And, I think, I want to acknowledge and be empathetic towards professional software developers. Because it's, it is scary and hard to see things Appearing that look like they may replace you. That's scary for everybody.
[01:37:26] Ryan Carson: And I think we all agree, we're just seeing a reaction to that. I think we all know that's an emotional reaction. It's not necessarily logical. But I do want to acknowledge, it's just scary for people if they think they're going to lose their job. So that's thing one. The thing two it's interesting I got a computer science degree, then I was a web developer for a long time, and then I started companies, and then I hired engineers, and engineering managers, and CTOs and I didn't code for a long time.
[01:37:50] Ryan Carson: And after Treehouse was acquired I actually went back in and taught myself how to code again. And so I used ChatGPT Plus to teach me TypeScript and Next. js and I shipped a very simple proof of concept. Hey, I just want to build on top of OpenAI's APIs. I just want to understand soup to nuts, how this works.
[01:38:09] Ryan Carson: And. You could say it's the dumbest thing ever. Like, why would you learn how to code again? But I think we all agree if you know how to code, it gives you this deep understanding of how things actually work, right? And I like to pull on an example here where think about building a house, right?
[01:38:28] Ryan Carson: So you could abstract all of that away, but if you actually understand how to saw a piece of wood at a 45 degree angle, and then put it together with another piece of wood, And then you build something, it gives you a deep understanding and appreciation for the actual structure. And I think that's what's happening here.
[01:38:47] Ryan Carson: And I just would, I actually say, please learn how to code more now than I've ever said in my whole life. Because number one, it's easier. Like, all you have to do is open any good LLM and say, I don't know how to do this. Please teach me, I'm a super newbie, I don't get any of this stuff. And for once we can be.
[01:39:08] Ryan Carson: honest about how much we don't know and not be embarrassed about it. I always say to people, just please learn Python and then start building something. Because in the end, it will absolutely make you more powerful, even if Devon creates all the underlying infrastructure. If you understand what's basically going on, it will make you an even more powerful wielder of that technology.
[01:39:29] Ryan Carson: So that's my little speech about why we should all keep coding, Alex.
[01:39:33] Alex Volkov: 100 percent Ryan, I just want to shout out that I was trying to remember where I know you from, and I visited a feature of web apps in London, like twice,
[01:39:40] Ryan Carson: No way!
[01:39:41] Alex Volkov: And I really remember it from there. Like it was the 2012, I think 2013.
[01:39:45] Ryan Carson: Oh my god!
[01:39:46] Alex Volkov: like a bunch of people. So one of like my first professional career like trips was to London to one of your amazing conferences.
[01:39:54] Ryan Carson: makes me so happy. I just literally got goosebumps.
[01:39:56] Alex Volkov: And so in return, it makes me very happy that you're now considered a friend of [01:40:00] the pod. Feel free to come back and chime in as well. I want to also on the topic of what you mentioned, I want to ask Junaid, who's a friend of the pod, my friend, and we run the Denver AI meetups together. Because Junaid, you basically did this thing that Ryan just discussed.
[01:40:13] Alex Volkov: And what's your take on Devon and how easier it is to automate now with some of this stuff?
[01:40:21] Junaid: Yeah I'm excited about it. I am one of those people that started learning to code at the beginning of last year. Literally the very beginning of last year, I started with the OpenAI API Quickstart Tutorial Guide. and immediately, moved on to trying to build things that I could run on my iPhone, and in less than two months from when I first started learning, I launched my first app.
[01:40:45] Junaid: Yeah, I'm 15 months in, and I see Devon, and it does not in any way, shape, or form make me think, Oh, shouldn't have done that this last 15 months. No way. It's just it's another tool that is going to be That much more useful for me to be able to take ideas and actually make them happen. And honestly, having built pretty much all my stuff, like using chat GPT as my junior developer yeah, this is awesome.
[01:41:18] Junaid: You know how much less copying and pasting I'm going to have to do? So yeah, it's I think it's fantastic. And I don't, I think that anybody who's on the fence or worried Whether they should learn to code, the answer is more yes now than it was before Devon came out. That's my
[01:41:37] NA: take.
[01:41:37] Alex Volkov: 100%. Ryan, go ahead, and then we got Slava again, I think.
[01:41:40] NA: Junaid, wasn't it the most magical moment when you shipped that code and it worked?
[01:41:48] Junaid: Yeah, absolutely. When I first went through the OpenAI, It's just like a little naming thing, how do you name your pet using the API?
[01:41:58] Junaid: That was awesome. But
[01:41:59] Junaid: man the real kick was when my first app actually got approved and went live.
[01:42:05] Junaid: And I remember standing in my kitchen and doing like the dance of joy, Whoa,
[01:42:10] Junaid: my God, I'm on the App Store!
[01:42:12] Junaid: Wild. It's it's such a. Such a rush. Congrats. And for sure, the things that I've built so far are not like, hey, I'm not changing industries out here. I'm not like, whatever, but I can build things.
[01:42:26] Junaid: I can use these tools more and more to build more and more things and build better and better things. And yeah, only up, it's only going to go up.
[01:42:36] Alex Volkov: Right on. I wanted to acknowledge Anton as well. Anton, a build, the very popular GPT engineer, and also is very like, deep into CodeGen as well. Welcome, Anton. What's your take on Devin and the renewed interest in agents running code?
[01:42:52] Justin Lin: Yeah, thanks, Alex. Nice to be here. I think it's super exciting. We've been trying the same approach. We haven't gotten as far and as fast as Devin, but I, it's always when you've seen something actually get done, you lose this doubt that you have from people telling you like, ah, this is not possible yet.
[01:43:12] Justin Lin: And now when you've seen like. More people believe in it and it's still just a demo, right? It's not a product, but then your focus is just 10x. Super exciting times. And, I think on this topic of should you learn to code or not coding is one of the most fun things, but it does depend on what you want to do, what you want to achieve here in life.
[01:43:36] Justin Lin: I think Flo Cruello a friend who invested in our company, he said that All the news headlines should just be a GI is here soon because , that's all that matters. And I think this is a good take on what should do you do with your life? A GI is here soon, so you should just do whatever makes you enjoy life.
[01:43:55] Justin Lin: That, that was a lot of things, but but I, that, that's my takes. I could go the dive into the technical details. We did the deep analysis on how they're doing it at Deving compared to the things we tried and how we're doing it right now.
[01:44:06] Alex Volkov: So give me an analysis from the perspective of the tools that they have for Devon. I think for me, definitely one of the main points was how much access it has to different tools like I have as a software engineer. Like I use the browser alongside my code editor, alongside my shell. Let's talk about this a little bit.
[01:44:24] Justin Lin: Yeah, so I want to do a shout out to Eric in our team. So he built a tool called GPT ME. I think he's been working at it for two years and now we're building at Building a GPT engineer together and there, I think basically all the ways that Devin writes code like all the tools are available in GPT ME, but GPT ME is just a CLI.
[01:44:47] Justin Lin: CLI tool so the browser, running code, writing code to file, changing code the biggest, please add if I missed some important tool that Devin has access to here. I'm running on too little sleep right now, but the biggest difference
[01:45:03] Alex Volkov: And I think it has access to an actual terminal machine, so I saw that folks are able to run like LLAMA and actually run inference on it. So that's pretty impressive on its own in terms of infrastructure access.
[01:45:14] Justin Lin: Yeah, correct. But I think you should, the people, if you're curious to run this you could try GPT Me, run it locally. The biggest difference is that Devin has done significant progress in making the agent stick to its plan and not get lost in a in confusing itself and not sticking to the plan, which has been like the big failure point for all the agents ever since they, since it started early last year.
[01:45:43] Justin Lin: And Devon is better at sticking to the plan. plan. I'm sure it still also gets confused. And it has what we refer to as sub agents. I guess that's quite self explanatory what it means. And you have this main agent that says, oh, I should try to achieve this. And then a sub agent goes into its own path with a, with its own system prompt and so on.
[01:46:04] Justin Lin: And there, I think as always, there is this in the details in how they've been quite successful there. But yeah, that's
[01:46:11] Junaid: a quick summary.
[01:46:12] Alex Volkov: Awesome. Thanks for joining, Anton. And folks, definitely check out Anton's feed and GPT Engineer, doing great things there. I want to acknowledge Rohan. Rohan, you were just mentioned by some folks with with your Lama Gym. You want to chime in on Dev and how this field moves forward and tell us about Lama Gym as well?
[01:46:28] Rohan Pandey: Yeah, sure. Thanks, Alex. Yeah, the idea with LamaGen is that agents originated in reinforcement learning where they'd learn through interaction, right? They'd receive reward. They'd go around, play around with their environment, they'd explore and they'd learn. But now in the LLM age, when we have these LLM based agents, They don't really learn online in this reinforcement learning fashion so the idea with Lamagym is to be able to train these agents online with reinforcement learning and it's a super simple kind of agent class that you just implement a few prompt functions on and then throw it in a traditional OpenAI gym environment.
[01:47:02] Rohan Pandey: And it learns, it propagates rewards from reinforcement learning. In terms of code generation stuff, this is actually what I spend most of my time on at Reworked. We do multi modal code generation for generating these web data extractors. In our code generation pipeline, it's not something where we're automating some huge stack of software engineering stuff where you have to go interact with the terminal, devin, and everything like that.
[01:47:28] Rohan Pandey: But instead, it's This is a very specific task of generating a specific function this structured data extraction function for a specific website. So given some schema and a website, we go pull screenshots of the website, we go pull context from the html, we, and then this goes into this sort of agentic loop where we then generate code to extract that specific data and that goes straight into production effectively, right?
[01:47:55] Rohan Pandey: It goes through some human review steps, but it goes straight into production. It's not like a it doesn't it isn't like your co pilot. It isn't something that you oversee. It is like in production. From, yeah, from just those user inputs to, to code that's executed. I think Devon shows like there's a lot of stuff that you can do just with GPT 4 right now, right?
[01:48:15] Rohan Pandey: People didn't believe that yeah, GPT 4 agents for code generation were possible, [01:48:20] but I think, yeah, I saw a tweet that was like, maybe all you need to get AGI is GPT 4 and some IMO gold level prompt engineering, which maybe it's true. We've yeah, a lot of what we're doing, we've done some code fine tunes and whatnot, but a lot of improvement has also come from just putting GPT 4 in better agentic and prompt engineering type of setups.
[01:48:42] Alex Volkov: Thanks for coming up, Rohan, and I just want to acknowledge for folks who on stage this is doesn't often happen Thursday, we usually cover the news and then we're at two hours already, but I really think that this conversation is important, and I really want to cover this and also open up to questions.
[01:48:55] Alex Volkov: This stage on ThursdAI, ThursdAI to just cover the next iteration of things that are happening and many people for whom even co pilot is something they don't even use yet. Definitely not Cursor. Cursor is like absolutely the next level of co piloting to me in my code work. And I use like the command K every time and I'm very happy.
[01:49:13] Alex Volkov: Besides the one fact that it deleted half my system previously, those of you who know about this, we'll talk about this separately. But I think it's a very important discussion specifically because Ryan, you mentioned something before where. We want to acknowledge and want to be I want to be compassionate to folks who are feeling this fear about their career, about the general rate of progress that's happening, not only in coding.
[01:49:32] Alex Volkov: Coding is one simple thing of this. Writers, for example they look at your GPT, like professional writers, and they're getting booked less, for sure, because people can now write better. right? Long things and then review long things as well. And for many other people seeing something like the figure robot now with OpenAI, that scares them because they learn to, I don't know, watch Terminator.
[01:49:51] Alex Volkov: I think one important piece of ThursdAI that I haven't mentioned yet, it's very important for me as my kind of AI evangelist position, is to acknowledge that change, very fast change scares people. And it scares people more than when they don't follow everything. And suddenly they see this rate of problems like, holy s**t, this means blah, blah, blah, blah, blah, X and X's, like whatever they saw in Black Mirror, which I think ruined many, like a generation ruined of potential thinking positive about the future.
[01:50:19] Alex Volkov: And I think it's very important for us to have a conversation with this, for folks who are building code generation tools, for folks who are writing code, for folks who are learning like Junaid, to actually show that no, it's actually not only fine, it's even better. From the of code specifically, I think there's just going to be need.
[01:50:34] Alex Volkov: for more code around the world, more automation around the world. And if you learn what these outputs can do, then I think you're gonna be even more productive. Ryan, you wanted to chime in as well? Feel free to step in. I'm gonna try to pull up one more friend.
[01:50:48] Ryan Carson: You bet. Yeah. Thanks for having me up. I, it's so fascinating to hear about all these agentic systems and what's happening and I know we all know this is where we're going, and I tweeted out and said, as soon as you have an agent that's basically able to use the internet and a computer, like a basic human, there's so many things that you can start to tackle.
[01:51:07] Ryan Carson: Researching, cures the diseases, planning your trip, to. Your mom's house for the summer. There's just a lot of this which ideally allows humans to level up and then leverage those tools, right? I'm always a technical optimist though, so that's probably my downfall.
[01:51:22] Ryan Carson: Alex, I did want to say thank you for bringing me up. I've got to go. I joined Intel on Monday I'm helping, So I'm helping them build a global AI developer community. So I've gotta, I've gotta go to a meeting. But I wanted to pimp your stuff for a second, though, and say the courses on Weights & Biases are really good.
[01:51:40] Ryan Carson: And as someone who's, spent ten years of my life building courses, everybody should check out the free Weights & Biases courses. They're awesome.
[01:51:47] Alex Volkov: Thank you so much. Thanks Ryan for coming up. And then let's talk about collaborating. Now you joined Intel, the company, definitely let's do some stuff together. The shout out is well deserved. The team behind the course is the team that I joined the growth ML team. And they're amazing. And a shout out to them.
[01:52:00] Alex Volkov: Everything there is for free. And you can learn from fine tuning a foundational model to extracting better outputs from JSON. And it's all for free for you guys to enjoy. So definitely one B slash me one wandb.me/courses definitely check this out and thank you for everybody who joined so far.
[01:52:17] Alex Volkov: I try to keep this conversation going with folks. I also do wanna deliver this conversation to the folks who follow Thursday Eye on the podcast as well. So if you're not subscribed, definitely you subscribe. If you can help. Vote with whatever amount of stars, five is preferable on the, on, on the podcast.
[01:52:35] Alex Volkov: I do want to acknowledge that Yam is a frequent co host, friend of the pod, and Yam, this birthday wouldn't be the same birthday without you. You want to chime in on the coding thing real quick, or on the Devon thing real quick? Before I do a recap, I would appreciate your things here as well.
[01:52:49] Yam Peleg: first, it's amazing. It's amazing. The demos look amazing. I just wanna Ask or say that I think that the real test is how reliable is it with real world users of many people. And so if anyone knows, anyone tries and can share their experience, but out, out of demos. Real life tasks that can be anything.
[01:53:10] Alex Volkov: I tried to get folks, Yeah, I try to get folks who actually has access to Devon I reached out to a few Thursday morning or at least for some of them it's really hard, but we'll definitely get folks, if not the Devon folks themselves, we'll definitely get folks who have access to Devon and we're gonna try to get access ourselves as well, um, definitely CodeAgents reimagined excitement about CodeAgents this year.
[01:53:33] Alex Volkov: I had this this poll that I posted on my feed, where if Chad, if AutoGPT came, Less than a year ago came out and then like it broke pretty much the same level of excitement Not remotely the same level of execution ability, right? Like it wasn't any tools, etc fairly quickly folks got excited about the demos Then fairly quickly folks realized that you know Anton said and some other folks said It does get lost in the context after executing on a few things and so there's this ability Since then we've gotten Incredible context length with incredible ability and like needle in a haystack and these models like memory of working memory grew.
[01:54:09] Alex Volkov: So I asked basically on my feed, do you feel from AutoGPT less than a year ago to Devin right now, which I think announced like a huge raise from many VCs. Do you feel that agents are on the same exponential curve as other LLM stuff that we see in open source, for example? And yeah, the answers were pretty much where I am at, where, the distance between something in AutoGPT and examples of, visual examples of DevIn they don't feel to me that there's been a year of progress compared to the year of progress we saw in everything else in OpenAI in LLMs, right?
[01:54:42] Alex Volkov: But maybe I'm wrong and maybe I need to play with Devon to actually feel the AGI a little bit. So we'll see after we get access. We're gonna give you guys an update as well. And I think it's time for us to conclude the space. It's been a little bit over two hours as well. I will just say that before I conclude the space for the folks who are listening on the, on the podcast, I recap try to recap everything we've talked about here as well.
[01:55:05] Alex Volkov: So that's coming up. If you've missed any part of the show, please stay with us to hear the recap. And I am very happy that we have celebrated Thursday birthday with all of you today in the space. It's been a great honor of mine to keep doing this and have many new folks come in to the stage, but also see some folks who we've been we've been hosting and friends of the pod and I really appreciate my time here with all of you.
[01:55:27] Alex Volkov: I'm very happy that this keeps happening, and I'm not going away anytime soon. .
[01:55:31] END OF SHOW
[01:55:31] Alex Volkov: I think it's time to just say that, again, I really appreciate everybody here. Yam, thank you, dude. Thank you for joining from week to week. Thank you for breaking down papers for us and teaching us teachable moments from your excapades into AI and being the resident machine learning engineer.
[01:55:45] Alex Volkov: Nisten, Thank you, brother, for holding the space holding the space when I can keep talking as well and joining and explaining, reading papers together and and asking questions and doing a co hosting with me. Far El, thank you for being the staunch supporter of open source everything and as much as possible.
[01:56:03] Alex Volkov: Not believing big companies and their promises and keeping us honest in what we believe and not believe. LDJ, thank you brother for joining and explaining difficult concepts where I have no idea how to even explain them. Junyang, I really appreciate the fact that we have foundational model trainers here on stage, parts of ThursdAI, so thank you Junaid, Nisten, and keep giving us amazing Quint stuff.
[01:56:23] Alex Volkov: As well. I really appreciate your expertise and pushing back on everything that I say with the not skepticism, but definitely those realism. Those of realism. I really appreciate this. Everybody else who wore on stage, everybody in the audience. I am floored that this keeps happening week after week, and I definitely am going to be here next week [01:56:40] to talk about whatever happens next.
[01:56:42] Alex Volkov: I see a lot of faces in the audience that joined from week to week. Harrison definitely give Harrison a follow. His YouTube thing is great. Junaid, who just joined and talked about how he was a noob and learned from GPT 4, and now he has multiple apps. And Junaid and I are co hosting the Denver meetup.
[01:56:59] Alex Volkov: As well. So if you're in Denver environment, please join us. We're gonna meet soon and talk about ai. I see Bo Wang from Gene ai and often join us when to talk about embeddings as well. I see Tanish in the audience from MedAR a very young PhD who I appreciate also friend of the pod. I see Abby, I see a bunch of friends here who.
[01:57:16] Alex Volkov: know about the space way more than I could ever. And the fact that they all join and talk about this is what makes this interesting. So I really appreciate all of you one by one and everybody in the audience should give all these folks a follow and we'll see you here next week. Thank you, everyone.

This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
📅 ThursdAI - Mar 7 - Anthropic gives us Claude 3, Elon vs OpenAI, Inflection 2.5 with Pi, img-2-3D from Stability & More AI news
8 Mar· ThursdAI - The top AI news from the past week
Hello hello everyone, happy spring! Can you believe it? It's already spring!
We have tons of AI news for you to cover, starting with the most impactful one, did you already use Claude 3? Anthropic decided to celebrate Claude 1's birthday early (which btw is also ThursdAI's birthday and GPT4 release date, March 14th, 2023) and gave us 3 new Clauds! Opus, Sonnet and Haiku.
TL;DR of all topics covered:
* Big CO LLMs + APIs
* 🔥 Anthropic releases Claude Opus, Sonnet, Haiku (Announcement, try it)
* Inflection updates Pi 2.5 - claims GPT4/Gemini equivalent with 40% less compute (announcement)
* Elon sues OpenAI (link)
* OpenAI responds (link)
* ex-Google employee was charged with trading AI secrets with China (article)
* Open Source LLMs
* 01AI open sources - Yi 9B (Announcement)
* AnswerAI - Jeremy Howard, Johno & Tim Detmers - train 70B at home with FSDP/QLoRA (X, Blog)
* GaLORE - Training 7B on a single consumer-grade GPU (24GB) (X)
* Nous open sources Genstruct 7B - instruction-generation model (Hugging Face)
* Yam's GEMMA-7B Hebrew (X)
* This weeks Buzz
* Weights & Biases is coming to SF in April! Our annual conference called Fully Connected is open for registration (Get your tickets and see us in SF)
* Vision & Video
* Vik releases Moondream 2 (Link)
* Voice & Audio
* Suno v3 alpha is blowing minds (Link)
* AI Art & Diffusion & 3D
* SD3 research paper is here (Link)
* Tripo + Stability release TripoSR - FAST image-2-3D (link, Demo, FAST demo)
* Story how I created competition of inference providers to get us sub 1.5s playground image gen (X)
Big CO LLMs + APIs
Anthropic releases Claude 3 Opus, Sonnet and Haiku
This was by far the biggest news of this week, specifically because, the top keeps getting saturated with top of the line models! Claude Opus is actually preferable to many folks in blind studies over some GPT-4 features, and as we were recording the pod, LMSys released their rankings and Claude Opus beats Gemini, and is now 3rd in user preference on the LMSys rank.
There release is vast, they have announced 3 new models but only gave us access to 2 of them teasing that Haiku is much faster / cheaper than other options in that weight class out there.
In addition to being head to head with GPT-4, Claude 3 is now finally also multimodal on inputs, meaning it can take images, understand graphs and charts. They also promised significantly less refusals and improved accuracy by almost 2x.
One incredible thing that Claude always had was 200K context window, and here they announced that they will be supporting up to 1M, but for now we still only get 200K.
We were also promised support for function calling and structured output, but apparently that's "coming soon" but still great to see that they are aiming for it!
We were all really impressed with Claude Opus, from folks on stage who mentioned that it's easier to talk to and feels less sterile than GPT-4, to coding abilities that are not "lazy" and don't tell you to continue writing the rest of the code yourself in comments, to even folks who are jailbreaking the guardrales and getting Claude to speak about the "I" and metacognition.
Speaking of meta-cognition sparks, one of the prompt engineers on the team shared a funny story about doing a needle-in-haystack analysis, and that Claude Opus responded with I suspect this pizza topping "fact" may have been inserted as a joke or to test if I was paying attention
This split the X AI folks in 2, many claiming, OMG it's self aware, and many others calling for folks to relax and that like other models, this is still just spitting out token by token.
I additional like the openness with which Anthropic folks shared the (very simple but carefuly crafted) system prompt
My personal take, I've always liked Claude, even v2 was great until they nixed the long context for the free tier. This is a very strong viable alternative for GPT4 if you don't need DALL-E or code interpreter features, or the GPTs store or the voice features on IOS.
If you're using the API to build, you can self register at https://console.anthropic.com and you'll get an API key immediately, but going to production will still take time and talking to their sales folks.
Open Source LLMs
01 AI open sources Yi 9B
Announcement claims that "It stands out as the top-performing similar-sized language model friendly to developers, excelling in code and math." but it's a much bigger model, trained on 3T tokens. I find it confusing to create a category of models between 7B and almost 12B.
This weeks Buzz (What I learned with WandB this week)
We're coming to SF! Come join Weights & Biases in our annual conference in the heart of San Francisco, get to hear from industry leaders about how to build models in production, and meet most of the team! (I'll be there as well!)
AI Art & Diffusion
Last week, just last week, we covered the open sourcing of the awesome Playground 2.5 model, which looked really good in user testing. I really wanted to incorporate this to my little demo, but couldn't run it locally so asked a few friends, and I gotta say, I love how competitive but open the inference providers can get! Between Modal, Fal and Fireworks, I somehow started a performance competition that got these folks to serve Playground 2.5 model in sub 1.5 second per generation.
Recorded the story to highlight the awesome folks who worked on this, they deserve the shoutout!
You can try super fast Playground generation on FAL and Fireworks
Stability releases Stable Diffusion 3 research paper + Model coming soon
Stability released the research paper for SD3, their flagship latest iteration of an image model. While this field is getting a little saturated, we now have DALL-E, MidJourney, Adobe Firefly, Playground, SDXL, Stable Cascade and Ideogram, SD is definitely aiming for the title.
They released a few metrics claim that on user preference, Visual Aesthetics, Typography and Prompt following, SD2 beats all of the above.
They also mentioned the architecture, which is a MM-DiT - multi modal diffusion transformer architecture (DiTs were used for SORA from OpenAI as well) and that they used 50% synthetic captions with COGvlm, which is quite impressive.
Emad has mentioned that access to SD3 will start rolling out soon!
TripoSR (Demo)
We previously covered LUMA models to generate text to 3d, and now we have image 2 3D that's open sourced by the folks at Tripo and Stability AI.
TripSR is able to generate 3d shapes from images super super fast, and here's a very nice flow that @blizaine demonstrated of how to use these models to actually bring 3D objects into their environment using a few steps.
And that's it for today folks, we of course chatted about a LOT more stuff, I really welcome you to listen to the episode and skip around in the chapters, and see you next week, as we celebrate ThursdAI's birthday (and GPT4 and Claude1) 🎉
P.S - as I always do, after writing and editing all by hand (promise) I decided to use Opus to be my editor and tell me how was my writing, what did I forget to mention (it has the context form the whole transcription!) and suggest fixes. For some reason I asked Opus for a message to you, the reader.
Here it is, take it as you will 👏
Full Transcript for the deep divers:
[00:00:00] Alex Volkov: Right, folks. So I think recording has started. And then let's do our usual. Welcome. Welcome, everyone. Those who know the sound from week to week. This is Alex Volkov. You're listening to ThursdAI, March 7th. I'm an AI evangelist with Weights Biases, who you can see here on stage as well. So, you know, you see the little square thing, give it a follow. Follow us on socials as well. And, uh, today is obviously Thursday.
[00:00:45] Alex Volkov: Uh, Thursday was a lot of stuff to talk about. Um, so, let's talk about it. Uh, I think, I think, um, our week is strange, right? Our week starts at the Friday. Almost, not even Friday. The updates that I need to deliver to you start at the end of the previous ThursdAI. So as, as something happens, uh, and I, I have a knowledge cutoff, actually, at some point we considered calling this podcast knowledge cutoff.
[00:01:14] Alex Volkov: Um, I have a knowledge cutoff after Thursday afternoon, let's say when I start and send the newsletter, but then AI stuff keeps happening. And, uh, Then we need to start taking notes and taking stock of everything that happened and I think on Friday We had the the lawsuit from Elon and there's a whole bunch of stuff to talk about and then obviously on Monday We had some big news.
[00:01:37] Alex Volkov: So As always I'm gonna just run through all the updates. There's not a lot today There's not a ton of updates this week, but definitely there's a few interesting things. Let me un save as well And then I'll just say hi to a few, a few of the folks that I got on stage here to chat. Um, we got Vic, and Vic is going to give us an update about, about something interesting. Uh, Vic, feel free to just unmute and introduce yourself briefly. And then we're going to go through the updates.
[00:02:07] Vik: Hey, my name is Vivek, uh, I've been training ML models for the last two years or so. Um, recently released a new model called OneDream2. It's a very small vision language model that excels at a lot of real world use cases that you could use to build computer vision applications today, so I'm very excited to chat about that.
[00:02:30] Alex Volkov: Awesome. And, uh, we have Akshay as well. Akshay, it's been a while since you joined us. What's up, man? How are you?
[00:02:36] Vik: Greetings of the day everyone, and it's lovely to join again. Uh, I have been listening, I have been here in the audience. Uh, for each and every ThursdAI, and, uh, I've been building some exciting stuff, so I've not been joining much, but, uh, things are going great.
[00:02:54] Alex Volkov: Awesome. And, uh, for the first time, I think, or second time we're talking with Siv. Hey, Siv.
[00:03:01] Far El: Hey, how's it going, everyone? Uh, just a little background on me. Um, I come from startups and from Amazon Web Services. Um, I've been in the AI space for the last six years. And I'd love to be able to chat today about social algorithms and, uh, researchers
[00:03:21] Nisten: having
[00:03:22] Far El: trouble with, uh, socials, particularly Twitter.
[00:03:26] Far El: Anywhere else where you're trying to distribute your
[00:03:28] Nisten: models?
[00:03:30] Alex Volkov: Yeah, so we'll see if we get to this. The setup for ThursdAI is usually just, uh, updates and conversation about updates, but if we get to this, uh, definitely we'll, we'll, we'll dive in there. Um, right, so folks, with this, I'm gonna say, um, uh, that we're gonna get started with just an update, and then I think Nisten will join us in a second as well.
[00:03:50] Alex Volkov: Oh, I see somebody else I wanna, I wanna add.
[00:03:55] Alex Volkov: So, here's everything for March 7th that we're going to cover today. Um, so in the area of open source, we didn't actually have a ton of stuff happen, um, up until, I think, yesterday and today. So, the most interesting thing we're going to talk about is, um, the company O1AI, um, which is a, The folks who released YI 34b, and we've talked about YI and the new Hermes kind of updates for YI as well.
[00:04:23] Alex Volkov: They released a new 9 billion, 9 billion parameter model, which is very competitive with Mistral and the like. Um, and then also the new company, newish company called Answer. ai from Jeremy. Jeremy Howard, if you know him, and Joanna Whittaker, and they collaborated with Tim Dittmers from Qlora, and they released something that lets you train a 70 billion parameter at home, a 70 billion parameter model at home.
[00:04:51] Alex Volkov: We're going to chat about this a little bit. Um, even though today I saw another thing that is kind of around this area, so we're going to have to go and find this and discuss how these huge models are now being able to get trained at home as well. Uh, very brief open source stuff, then we're going to talk about big companies and obviously, um, actually going to put cloud last because we're going to talk about cloud probably a lot.
[00:05:16] Alex Volkov: But, uh, in the big companies area, we will not be able to escape the drama that Elon Musk sues OpenAI. And then the OpenAI response, we're going to chat about this as well. Excuse me. Oh yeah, this is going to keep happening, just one sec. Um, maybe we'll briefly mention that Logan has left OpenAI, and for a brief period of time, he and Ilya had the same, um, bio on Twitter, not anymore, but very interesting as Logan starts to post some stuff as well.
[00:05:46] Alex Volkov: Um, I really want to chat about the Google employee who was charged with AI secret trading, uh, and received like a CTO position in China. That's a very interesting update as well. And, uh Inflection from, uh, there we go, we have Nisten as well, uh, Inflection just released an update today, which is kind of like breaking news, uh, a 2.
[00:06:09] Alex Volkov: 5 update, and they, they say they come to GPT 4 and Gemini equivalent, uh, performance level, which remains to be seen, and I've tested this a little bit, and I definitely want to chat about this as well. Uh, in the vision and video, and We have only the one thing, but we have the author of said thing here. Uh, so I haven't seen any, anything else besides Moondream and we have Vic here.
[00:06:33] Alex Volkov: We're going to talk about Moondream too, and how you can use this and what we can, we can use it for. Um, Voice and audio. There's something that probably didn't happen for the past week. I think it happened a little bit before and I don't have access yet, but Suno if you guys know Suno released the alpha and there's a bunch of videos floating around of their songs with like the V3 alpha of theirs and it's quite something if I if I'm gonna be able to find those tweets and pin them for you That's gonna be a mutual listening Maybe I can actually find the tweet to to actually play this for you.
[00:07:07] Alex Volkov: We'll see if the multimedia department will work. Um, and I think in AI art and diffusion stuff, there's a bunch to talk about. Um, there is, uh, Stable Diffusion 3 research paper was released, and we've talked about Stable Diffusion 3 a little bit. After the announcement, and we haven't covered the research paper, we can chat about the research paper.
[00:07:29] Alex Volkov: But also, potentially today, Imad is going to open some invites, as he mentioned on X as well. So, I'm ready with the breaking news button there. Stability, also in the news, they released a collaboration with Tripo, which created a very fast image to 3D model called Tripo SR. And that's been very cool, and there's a few very Viral examples of, of said thing, uh, floating around, so definitely worth talking about this as well.
[00:07:57] Alex Volkov: And I think, uh, Nisten is just joined us, hey Nisten, and you just shared that, um, That we can train a 70 billion parameter, Oh, 7 billion parameter at home with 24 gig memory, right? A galore. Nisten?
[00:08:17] Nisten: so, so it's a combination of a lot of [00:08:20] techniques that people have been using. And, uh, I'll try to pin it up in a second. But the. The research is that now you can train one from scratch. Not Finetune. Start one from scratch. Start your own. So this is why it's pretty, um, it's relatively groundbreaking.
[00:08:40] Nisten: And they released a repository for that as well. So it's not simply just a paper. They have a code base. It's pretty legit.
[00:08:50] Alex Volkov: So I guess let's, let's get into the open source stuff, um, and then we'll get to the open source, and then we're going to discuss everything else, because I think the main, the bread and butter of this discussion is going to be, is going to be, um, Anthropic. Anthropic's, uh, coming back to the limelight, but let's, let's start with, let's start with open source.
[00:09:09] Alex Volkov: Where's my open source button?
[00:09:27] Alex Volkov: Alright, so I guess, uh, Nisten, welcome, uh, and I guess let's start with, with Galore, uh, as much as we can. We can get from the, from the release, a fairly, fairly new release as well. And I think it's connecting to the other, uh, to the other thing from Answer. ai, but let's start with Galore. Um, so basically, these folks released something called Galore, which is, um, kind of playing on the same, on the same LoRa, QLoRa stuff.
[00:09:52] Alex Volkov: Uh, what are some of the techniques they're adding there? I'm, I'm trying to, to take a look as I'm learning. Uh, Nisten, do you have any, any Any info to share with us about this?
[00:10:05] Nisten: Yeah, yeah, same. more for an actual full paper reading because I have not read it entirely. Mainly looking at it again, it looks like it's, uh, it's another stack of tricks like most good projects are, uh, but it is the, it enables a very, very large capability. And that is that now you can make your own full LLM from, from nothing.
[00:10:29] Alex Volkov: So not a fine tune.
[00:10:31] Nisten: Uh, yeah. Not a fine tuned, not initiated weights. You just, you just start from, uh, from nothing. So, it's I see that it uses, uh, like it offloads a lot of the weight activations and offloads some of them on, uh, on CPU memory. And I know there are options in Axolotl, which is the Docker container that people use to train, that you can also offload on very fast NVMe drives.
[00:10:55] Nisten: So if you have like very fast PCI Express NVMe storage, you can kind of use that as another RAM for, for the training. So this combines all of those. And then some on top and the end result is, is very impressive because you can train a very capable model. And, uh, yeah, again, pending further, uh, research and stuff.
[00:11:21] Nisten: But I think this is one of those repositories that, uh, a lot of people will use or it's likely to.
[00:11:30] Alex Volkov: Yeah, and I think this adds to the, so this, this kind of in the same vein of the next thing we're going to chat about and, um, um, I actually can't find any mention of this on X, believe it or not, so not everything is fully on X. I just got a link, uh, to this from, from, uh, Omar, uh, from Hug and Face. And AnswerAI is a new research lab, um, that Jeremy Howard, uh, if you guys are not familiar with Jeremy Howard, hopefully everybody is, but if you're not, um, I guess look them up.
[00:12:04] Alex Volkov: Um, Jeremy, uh, joined Answer. AI, like, um, I think around NeurIPS he was talking about. They got funded, I think, 10 million dollars. And, um, they released their first project, a fully open source system, uh, that can efficiently train a 70 billion large language model on regular desktop computers with two or more gaming GPUs.
[00:12:27] Alex Volkov: They're talking about RTX 3090 or 4090. Um, Which, you know, compared to, um, Niton what you just shared, I think that sounds very impressive. Um, they combine FSDP, which is, I'm not familiar with FSDP, with SFDP and uh, q and, uh, they brought kind of the, the Cuda Avengers to, to the flow. So Jeremy Howard obviously.
[00:12:52] Alex Volkov: Um. I think FastAI, right? And then Kaggle, I think, competition is definitely behind Jeremy. And then they brought Team Ditmers from Qlora, and we've covered Qlora multiple times, um, very efficient methods. And then they also brought Hugging Faces, Tyrus Von Koller, and, um, they brought the CUDA Avengers in there to, to Basically combine a bunch of techniques to let you train 70 billion parameters.
[00:13:20] Alex Volkov: I see we have Yam joining us. Hey Yam, did you see the Answer. ai stuff that I'm covering or is this new to you?
[00:13:26] Yam Peleg: No, no, all new to me.
[00:13:28] Alex Volkov: Oh wow, okay, so I need, I need, uh, I would love your reaction in real time. Let me DM you this real quick because, um, The number of, actually, let me, let me paste this in the link below so we can actually paste this up.
[00:13:43] Alex Volkov: Um. Yeah, there we go. Okay. So it's now pinned to the top of the space for folks to, to find out. I wasn't able to see any, uh, update on X from any of them, which is very interesting. Um, and the, the very interesting idea is that, you know, all of these systems and all of these models, 70 billion models, they cost an insane amount of money.
[00:14:07] Alex Volkov: And now these folks are considering that under 10, 000, you'd be able to train something like 7TB at home. Which I'm not training models, but I know that some folks here are. And, um, I assume that this is a very big unlocking capability. Um, which, which is what Answer. AI is trying to achieve.
[00:14:32] Alex Volkov: Let's see what else is very interesting here. Um, just something about Answer. AI generally. Uh, they claim that they're like an unusual type of organization. I actually tried to ask Jeremy a couple times what did this mean. Um, and, uh. They, they claim to be a for profit, like, lab, R& D lab, and, um, more in spirit to 19th century labs than today's AI research groups, and, um, I think Eric Ries and Jeremy Howard launched this in Europe, um, and, I think, I'm actually not sure what's the, the, how much did I say?
[00:15:14] Alex Volkov: Um. What are they up against? But the first release of theirs is the open source OS, fully open source. Uh, that includes one of the, like several of the top people in the industry, uh, to create something that wasn't possible before. Um, and I think it's remains to be seen. They didn't release any metrics, but they said, Hey, we're about to release some metrics, but, um, this keeps improving from week to week.
[00:15:39] Alex Volkov: So we actually didn't release any metrics. Go ahead Nisten.
[00:15:43] Nisten: Sorry, is this from Answer. ai? They said they were going to release one, or? They
[00:15:49] Alex Volkov: think, already. They didn't release metrics, uh, for the training. Uh, but I think the, the whole repo is open source. Yeah.
[00:15:58] Nisten: released an open source OS, or?
[00:15:59] Alex Volkov: Yeah, yeah, open source, FSDBQLora. Um, and I think
[00:16:03] Nisten: Oh, okay, so it's not a real operating system, it's another,
[00:16:07] Alex Volkov: It's, well, they call it an operating system, but yeah,
[00:16:10] Nisten: Oh, okay,
[00:16:11] Alex Volkov: it's not like Linux competitive.
[00:16:12] Nisten: okay, I thought it was like an actual one. Okay, actually, go ahead, because there are some other huge hardware news that I wanted to quickly cover.
[00:16:23] Alex Volkov: Go ahead,
[00:16:23] Yam Peleg: Yeah,
[00:16:23] Vik: I just wanted to add about this answers. ai thing that they have released this system that you guys were talking about, which basically claims to be able to train 70 billion parameter model on only 224 [00:16:40] GB GPUs.
[00:16:40] Vik: So basically, you know, two 4090s and you can train a 70 billion parameter model, which is mind boggling if you think about it. But, uh, I tried to find like how to get access to this. So I was still not sure if this is fully supported in every, uh, rig and system. So that is something I
[00:16:58] Nisten: wanted to mention.
[00:17:00] Alex Volkov: Yeah.
[00:17:00] Nisten: By the way that that has been, oh, sorry.
[00:17:02] Nisten: That that has been do, uh, doable for a while because Kilo actually trains it all in four bit. And, uh, there are only like a few tricks, which you can also apply if you go to apps lot, uh, the directory. You, you can also do that on your own if you do a four bit kilo training and you just say, offload all the gradients and all this stuff, you can also do that with a, the 48 gig, uh, stuff.
[00:17:26] Nisten: But, uh, again, I'll look into the actual directory instead.
[00:17:32] Alex Volkov: Right, so, um, Nisten, you mentioned some hardware news you want to bring? Go ahead.
[00:17:39] Nisten: Yep. Okay, so we have two hardware news, but they are actually kind of related. Uh, first of all, uh, TenseTorrent, the company by legendary computer scientist, Jim Keller, who worked on the iPhone chip, AMD, who brought AMD back to life. Uh, legendary computer scientist, and has been working on TenseTorrent, which is another, uh, accelerator for, which also does, does training.
[00:18:07] Nisten: So, uh, so they released these cards, and I'm not sure what the capabilities are, uh, but I saw that George Hotz, uh, from TinyCorp, uh, posted them, and, uh, they are actually, so I just wanted to give them a big shout out to actually making them commercially viable, and it's just something you can buy, you don't have to, uh, You know, set up a UN meeting for it, right?
[00:18:31] Nisten: And get the votes and stuff. You can just go and buy it. So, that's pretty awesome of them, and I wish more companies did that. The second news is also kind of huge, because one of the engineers that left TestTorrent last year now started a startup here in Toronto. And this has been an idea that's been around for some time and discussed privately and stuff.
[00:18:59] Nisten: Uh, they're making AI chips. Again, they do not. These ones do not do training, but they're going to make them hard coded, which will be the judge of how much that makes sense given the how rapidly models improve. But there is a business case there because the hard coded chips, they can perform literally a thousand to 10, 000 times faster.
[00:19:25] Nisten: So
[00:19:26] Alex Volkov: you say hard coded, is that one of those, like, transformer specific chips you mean?
[00:19:33] Nisten: no, the entire weights are etched into the chip and you cannot change them. So the benefit of this is that you can get up to a thousand to ten thousand times faster inference. So we might end up with a case where, according to calculations from What Sam Altman said on how much chat GPT serves in a single day, which is a hundred billion tokens, and that works out to about 1.
[00:20:02] Nisten: 4 million tokens per second. We might very soon, like in a year or two or sooner, be in a spot where we have this company's using 60 nanometer chips. We might have a single chip pull the entire token per second performance of all of global chat GPT use. I don't know if that includes enterprise use, but that's how fast things are accelerating.
[00:20:29] Nisten: So that's the, that's the benefit of, uh, yeah, that's the benefit of going with a hard coded chip. So yeah, call, uh, inference costs are, um, are dropping in that
[00:20:43] Alex Volkov: You also mentioned George Hotz and, uh, he also went on a, on a, on a rant this week again. And again, I think, do you guys see this? Um, the CEO of AMD that doesn't use Twitter that much. But she replied to one of him, uh, one of his demands, I think, live demands, and said, Hey, uh, we have a team dedicated working on this.
[00:21:05] Alex Volkov: And then we're gonna actually make some changes. in order to get this through. So, I love it how, uh, George Hotz, um, folks probably familiar with George Hotz in the audience, um, Should we do a brief, a brief recap of George Hatz? The guy who hacked the first iPhone, the first PlayStation, then, uh, built a startup called Com.
[00:21:25] Alex Volkov: ai to compete with Autonomous Driving, and now is building tiny, uh, we mentioned tiny boxes ready to ship Nisten last time, and I think that paused because they said, hey, Well, we don't have enough of the open sourcing of the internal stack of AMD Which led the CEO of AMD, Linda, or Lisa? I'm very bad with names.
[00:21:46] Alex Volkov: I think Linda, to reply and say hey, we have dedicated teams working on this Actually do want to go find this tweet Go ahead Nisten
[00:21:57] Nisten: Yeah, so there has been a big misconception in the software industry that, um, a lot of the, the code monkey work is something that, you know, you just hire someone to, like, clean your toilets and, and do it. But, in fact, the reason that NVIDIA has a 2 trillion valuation, and I'll beat Saudi Aramco, is because that Their toes are a lot cleaner in terms of the software.
[00:22:27] Nisten: So, the CUDA software is a lot more workable, and you can do stuff with it, and it doesn't have the bugs. So, in essence, what George Haas is doing by pushing to open source some key parts, which some people might freak out that China might steal them, but they've already stolen everything. So, it really doesn't, doesn't matter that they're very small hardware parts, but they make a huge difference in developers being able to.
[00:22:56] Nisten: to use that software, and those parts are buggy. So, in essence, like, George Haas, with this stupid CodeMonkey fix, might double or triple AMD's stock
[00:23:07] Alex Volkov: Yeah,
[00:23:08] Nisten: Just because he's getting in there, and he's cleaning that crap code out.
[00:23:14] Alex Volkov: and he's popular enough to pull attention from the CEO of this company to actually come and react and, you know. One of the reasons I love X is that I think, um, uh, she retweeted their official tweet. I think there's more folks commenting on and reacting to her, um, comment, and that's on top of the space now, uh, than the actual kinda tweet itself.
[00:23:37] Alex Volkov: Which is, I think, a good type of ratio, or ratio, yeah. I think, uh, more hardware news, I think we're satisfied with Oh, yeah, yeah. The, the, the only other hardware news related to this, 'cause ni I think you mentioned Saudi Aramco. Uh, we chatted with the GR folks with a Q not with a K grok. The, the LL uh, LPU chip.
[00:23:58] Alex Volkov: And they're like super fast, uh, inference speed, and I think this week. They showed that they have a collaboration with, I think said, Saudi Aramco, um, about bringing AI. Um, and I saw a few, a few folks post about this and, um, if that's of interest to you, we had a full conversation with the Grok team. They also, they also, um, Release, kind of, uh, they had a waitlist and many, many people, I think the waitlist jumped after we chatted with them at the peak of their very viral week, which started with match rumor going, going off.
[00:24:32] Alex Volkov: Uh, and then I think they said something about, they had like 50 or a hundred waitlist signups before this. And then the week after they had like 3, 600 a day or something like this. So they revamped the whole system. And now, you can actually sign up with a self served portal to Grok, and uh, let me see if I can find this tweet for you.
[00:24:55] Alex Volkov: So you can actually now go and sign up, um, to Grok yourself, [00:25:00] they have a nice console, very reminiscent for, um, for every other, like, console out there. You can create an API key, very simple, so no longer like a manually, manual approval of, um, Grok. I can't find this tweet though, so give me, give me just a second.
[00:25:22] Alex Volkov: So, yeah, they, they're, uh, collaborating with, with Saudi Encore. Go ahead Nisten, real quick.
[00:25:28] Nisten: Uh, yeah, just really quickly, the part that I missed was that, uh, the fix that George Haas is doing for AMD, that's to enable distributed training. Because they cannot distribute training across GPUs because it crashes. So it's pretty important. Uh, yeah, and those are my comments on that.
[00:25:48] Alex Volkov: Awesome. Okay, so I, I found the tweet. Uh, so if, if you follow this tweet, the, the kind of the, the quoted tweet there is, uh, getting you to the Grok console. You get like two weeks for free and you get the API access to this like incredibly fast inference, inference machine from Grok.
[00:26:05] Nisten: I think Far El and Yam wanted to say something on it.
[00:26:10] Alex Volkov: Yeah, go ahead.
[00:26:11] Yam Peleg: Yeah, I got a lot of technical issues. So if you can go before me, I'll try to
[00:26:17] Vik: fix it.
[00:26:19] Alex Volkov: You're coming through finally, loud and clear. Far El, if you wanted to comment, go ahead, man.
[00:26:30] Alex Volkov: Alright, um, looks like Far El is also, um, not available. Okay, I think we're moving
[00:26:38] Vik: touch on this for a sec. Um, so Grok has a white paper out about how they've designed their chips and it's super interesting. I'd strongly recommend everyone go read it. Uh, they've basically from the ground up rethought how, uh, inference oriented compute should work. It's a fascinating read and kind of surprising that they're sharing all of those details.
[00:27:00] Vik: One would think they'd keep it proprietary.
[00:27:05] Alex Volkov: yeah, we had a full conversation with them. It is fascinating. Again, you know, for, for The level of discussion that we have here, um, we, you know, honestly, we couldn't dive like super, super deep, but I've played with it, and the demos I was able to do, uh, Vic, I don't know if you have the chance to see, uh, they're only possible with almost instant, uh, speed.
[00:27:28] Alex Volkov: You know, guys, what, like, even though I love the Grock team, and we're collaborating with them, we're gonna do some stuff with them as well, um, it turns out that for some Use cases, inference speed, like a lot of inference speed on big documents, and I think that's what Grok is like definitely incredible with.
[00:27:49] Alex Volkov: You take Mixtral and you dump a bunch of tokens in, and then you get like a super fast reply. So I was actually able to get a transcript in there for all of ThursdAI, and to get chapters within less than like 3 5 seconds, which is ridiculous. For the demo that I built, I actually didn't need inference speed.
[00:28:09] Alex Volkov: I did need infraspeed, but as much as I needed a faster response on smaller kind of prompts multiple times. And I noticed that even though their infraspeed is incredible, their latency is not great, probably because they're still fairly young in this. And I went and looked, and Together also offers Mixtral over API.
[00:28:31] Alex Volkov: Not Together, sorry. Together also does this, but specifically Perplexity. If you use Perplexity for search, you may not know that they also have an API that you can use, and they serve Mixtral and Mistral, and I think some other open source models and some of theirs. Um, and they keep improving their scores there, and specifically they're now up to 200 tokens per second for Mixtral and Mixtral, which is impressive.
[00:28:56] Alex Volkov: And, you know, um, they don't have custom hardware, and they're getting 200 tokens per second, which is ridiculous. But what I notice is Perplexity is web engineers because they're now rumored to be a unicorn. I don't know if that's a rumor, so that's not confirmed. But their web engineers are really top notch.
[00:29:16] Alex Volkov: And so it turns out that if I use Perplexity is API for Mixtral. I get less tokens per second. So I get less than half, right? So Grok is at around 500, um, Perplexity is around 200. But I actually get better performance because I need kind of low latency on the request itself and Perplexity is better at this.
[00:29:36] Alex Volkov: Um, obviously something Grok can and will fix. And also the stuff that the Grok team told us were like, it's only, they're only scratching the itch. And Nisten, you mentioned something with them in the conversation that I wanted to repeat is that They're also working on figuring out the input latency of how fast the model not just spits out tokens, but processes the whole prompt input, which is a big deal, especially for long context prompts.
[00:30:00] Alex Volkov: And they said that they're looking at this and they're gonna release something soon.
[00:30:07] Nisten: Yeah, that's something that the NVIDIA cards excel at, and something that's holding back CPU based inference, because the prompt evaluation is, is, is slow. So, yes, it's not an easy problem to solve, but their chip is already so fast that the 3 to 1 ratio does not hurt them as much. Whereas With NVIDIA, the chips are slower and stuff, but they have like a 10 to 1 ratio, so if you're running at 100 TPS, your prompt eval is going to be like over, over a thousand.
[00:30:42] Nisten: So it's going to read. If you dump in like 10, 000 tokens, it's going to read them in 10 seconds or less. Usually it's a few thousand with NVIDIA, but I'm not sure actually, because when you dump in a huge amount of text in Grok, it does not take multiple seconds to evaluate it. It's like instance,
[00:31:04] Alex Volkov: It's quite, it's quite fast, yeah.
[00:31:06] Nisten: yeah, so I'm not too sure that that needs some proper benchmarking to say for sure.
[00:31:11] Alex Volkov: Yep. So, uh, speaking of Grok, let's, let's talk about the other Grok, but before that, you guys want to acknowledge, like, what's, what's going on with the rumors? Far El, you, you just texted something. I'm seeing Foster post something. Uh, what's, what's going on under, under the current of, of the Twittersphere?
[00:31:27] Alex Volkov: Um,
[00:31:28] Far El: Just, just speculation at this point, but, uh, you know, you know, those, uh, those people that, uh, that, uh, leak, you know, uh, stuff about OpenAI and all these AI companies, and most of the time, some of them are, are right. Uh, of course we don't see what they don't delete,
[00:31:49] Alex Volkov: yeah.
[00:31:50] Far El: uh, uh, yeah, like some of them are saying right now that, uh, there's like a rumor that GPT 5 is dropping.
[00:31:57] Far El: That GPT
[00:31:58] Alex Volkov: Say, say this again slower, because
[00:32:01] Far El: 5 is dropping, that
[00:32:02] Alex Volkov: there is a rumor that GPT 5 is dropping today. Wow. All right. Um, yeah. That's, that's quite, and I've seen this from like several folks, but
[00:32:11] Far El: Could be complete b******t, right?
[00:32:12] Yam Peleg: But yeah.
[00:32:14] Alex Volkov: well, I'm ready with my button. I'm just saying like, let's acknowledge that there's an undercurrent of discussions right now with several folks who are doing the leaking.
[00:32:22] Alex Volkov: Um, and then if this drops, obviously, obviously we're going to do an emergency, uh, and convert the whole space. I will say this, GPT 4 was released. Almost a year ago, like less than a week to the year ago, March 14th. Um, Cloud, I actually don't remember if Cloud 1 or Cloud 2. I think it was Cloud 1 that released the same day that people didn't even notice because GVT 4 took, took the whole thing.
[00:32:52] Alex Volkov: Um, and now like Cloud releases their, um, Which we're gonna talk about, so I won't be surprised, but let's talk about some other stuff that OpenAI is in the news for. And then, and then if, if anything happens, I think we all have the same, uh, the same profiles on x uh, on notification. So we'll get the news as it comes up.
[00:33:13] Alex Volkov: And we love breaking news here in, in, in, in Thursday. Okay,
[00:33:17] Nisten: Yeah, for sure.
[00:33:18] Alex Volkov: Um, let's [00:33:20] move on. Let's move on from open source. So, so I think we've covered. A few open source, I will just mention briefly that we didn't cover this, um, the, the folks, uh, from Yi, uh, 01AI, 01AI is a Chinese company, uh, they released the small version of Yi, and we've talked about Yi 34B multiple times before, there's a, a great fine tune from Nous, uh, they released a 9, 9 billion parameter version of Yi, which, uh, they trained for a long time, looks like, and, um, They showed some benchmarks, and it's very, very interesting how confusing everything is right now, because even, you know, even Gemma is not really 7 billion parameters.
[00:33:58] Alex Volkov: Yeah, we talked about this, right? But then they now compare, they say in the same category broadly, and they now compare like Yi 9 billion parameters to Mistral 7 billion to Solr 10. 7 billion. So I'm not sure like what this category is considered, but maybe folks here on stage can help me like figure out what this category is considered.
[00:34:19] Alex Volkov: But Yi is fairly performative on top of Mistral 7b, and I think it's still one of those models that you can run. I think, if anything, comparing this to Solr, 10 billion parameters, we've talked about Solr multiple times from the Korean company, I think. Yi is very performative, and the 34 billion parameter model of it was very good, and many folks really, really did some fine tunes of it.
[00:34:45] Alex Volkov: So, asking the fine tuner folks here if you have a chance to look at it, and if not, is this something interesting? It looks like, unfortunately, YAML is having a lot of like X problems, uh, but once you come up, we're going to talk about the Hebrew GPT as well. Um,
[00:35:02] Far El: What I do find interesting is, uh, how, yeah, like the, the, the broad evaluation spectrum that a lot of these models are, are comparing themselves to now, uh, and, and we're going to see more of these, uh, going forward, like, uh, I've seen early, uh, private researchers, Stuff, but like I feel like the category is no longer all just compare 7b to 7b It's it's just expanded to like sub 10b, right?
[00:35:27] Far El: Like that's pretty much what it is like those those numbers even from players like Google are very You know, um, like it, it just doesn't feel as rigid as it used to be, but also like we should keep in mind that not all parameters are the same, right? So, like, uh, like we've seen with certain MOE architectures.
[00:35:51] Alex Volkov: yeah, that's true. And, um, and I will say it's, uh, it looks like there's a art to train these models and some, some amount of art to also, uh, cherry pick which metrics you're, you're testing and against which models and which category you're placing your model in as well. Um, but just. And, and again, this was released like so recently that I don't think, I think yesterday, so definitely folks didn't have a chance to try this, but Yi, the, the other models of theirs were trained and performing very well, so, um, we're gonna be very excited to see if the Finetuning folks are jumping on this, uh, 9 billion parameter, and, and it performs better than, I think, Gemma is, ahem, The leading one, even though Mistral is still the leading one in our eyes.
[00:36:36] Alex Volkov: Okay, I think this is it in the, um, Oh, let's see, a few more details here for Yi, and before I finish, Uh, it's trained on 3 trillion tokens, so a lot, uh, It's decent at coding and math, and then it has open access weights, and then bilingual. That's basically what we were able to get, uh, and thanks to the folks at Hug Face, VB.
[00:36:59] Alex Volkov: I should probably add this as well. I think we're moving on to the main topic, which is the big companies, APIs and LLMs. I think it's, uh, you know what, you know, before this, I would go to vision category because we have Vic here. And, uh, I really want to chat about Moondream too. So, um, we've talked about Moondream 1, but folks who weren't with us, Vic, do you mind, uh, unmuting and then doing a little bit of a, of a intro for you as well?
[00:37:26] Alex Volkov: And then we'll talk about what's changed in Moondream.
[00:37:30] Vik: Yep, sounds good. Um, so, uh, Moondream is a small vision language model. Basically a vision language model is, uh, basically it's a language model where you can, Show it an image, ask it questions. You can ask it to describe the image. And the reason this is useful is not because it unlocks any new capability that people didn't have like five years ago.
[00:37:56] Vik: All the stuff you could do with it, object detection, captioning, etc. It was all possible. The thing that's helpful about models like this is they're a lot easier to use. Whereas historically, if you wanted to do a computer vision task, you'd have to collect a bunch of data, train your own YOLOV, 7, 8, I think there are 9, V9 now, model, um, and that usually works super well, but it's, uh, when you're trying to build an app, it's just unnecessary extra work for you, whereas with a general equation language model, similar to how you use chatGPT with text, you can just ask it questions in natural language, and it It makes developing computer vision apps a lot easier.
[00:38:38] Vik: Um, so I released Moondream 1 maybe about a month ago. Um, it's, it's not unique by the way. There's other open source, well, open ish source vision language models out there today. Uh, but they're all in the 7 billion to 34 billion to 70 billion param range. Uh, Moondream is 1. 8. 86 billion params, which makes it very easy to run, um, cheap to run on edge devices, like you literally don't even need a GPU to run it, you can just run it on CPU and get acceptable performance. Um. Yeah, so, Moon Dream 1 was trained on some datasets that were derived from GPT 4, and so the licensing was, uh, non commercial. Like, you could use the model, but not. It was research only. For Moon Dream 2, which I released earlier this week, maybe last week, time's a little bit of a blur, um, I re did the datasets, um, All of the synthetic data used to train it is now generated using Mixtral, uh, and as a result, like, it's all clean.
[00:39:47] Vik: So I was able to license it as Apache 2. 0. There's no restrictions on you can use it or
[00:39:53] Alex Volkov: Vic, I have a question real quick. Uh, when you say synthetic data, and we're going to talk about some synthetic data in, in SD3 as well. Um, do you mean captions for images for, for, to train? Like what, what synthetic data are you generating with Mistral? Because Mistral is not multimodal.
[00:40:08] Vik: Yep. Great question. I'm going to post a more detailed breakdown of how I did it, uh, later. But basically to train these visual language models, you need, uh, paired image and text data. And the text needs to be read. You want like a mix of, hey, can you caption this image? Hey, can you caption this image in a lot of detail?
[00:40:29] Vik: Can you answer questions about this image? Um, there's a lot of images available with high quality captioning information, like common captions, whatnot. There's, there's a bunch of datasets. And so you use a model like Mistral to transform it into the types of queries that you want your Um, VLM to be able to answer.
[00:40:51] Vik: Basically you take Coco for example, common captions information and have the model convert those image captions into questions and answers about the image.
[00:41:04] Alex Volkov: So how long did it take you to train the second version of Moondream? And, um, what else can we do that the previous one or what else can you do better?
[00:41:14] Vik: It took about a month to get the same level of performance from the new data collection pipeline. One of the things that was really hard was I think when you're generating synthetic data, life is just so much easier when you have a GPT 4 class model. But unfortunately, the terms of use don't allow you to train a competing model and it gets a little iffy.
[00:41:33] Vik: Um, and so just basic things like tone of the response, right? Like if you use Mixtral to generate the [00:41:40] data, your prompt is something like, hey, I'm going to give you five captions for this image, consolidate all the information in it, and generate a caption. But you want it to pretend that it's looking at the image, um, not say something like, hey, based on the five captions that you have provided, there is a dog and a man is petting and whatnot.
[00:41:58] Vik: So. Getting that tone right required a lot of work. Uh, I ended up using DSPY. It's a super cool
[00:42:06] Alex Volkov: Oh,
[00:42:06] Vik: framework for prompt optimization. Um, everyone should check it out. But basically you can do stuff like manually annotate 400 examples and then it uses OptiUnit to figure out like what's the best chain of thought few shot setup that you can get to optimize performance based on metrics you can define.
[00:42:25] Vik: Uh, but yeah, getting that tone right was a lot of work. The other thing I focused on a ton was reducing hallucinations. Uh, I don't know if anyone's dug into the Lava training dataset, but one of the reasons Lava style models hallucinate a lot is just because they're trained on bad data. And you'll notice that a lot of hallucinations are oriented around COCO objects, like it tends to hallucinate handbags, ovens, um, people.
[00:42:53] Vik: A lot in images when they're not present, and then coffee cups, very, very common. And that's mostly because of bad object annotations in COCO, so we'll spend a lot of time filtering those out. Um, currently the benchmarks are slightly better on Moon Dream 2 than Moon Dream 1. Um, but qualitatively, if you try it out, the model hallucinates a ton less, and a big part of that was just the data pipeline.
[00:43:15] Alex Volkov: Interesting how that's not part of the benchmarks or evals. Right. Just underlines how, um, how far we still have to go in, in terms of evaluations that, you know, qualitatively you feel that it hallucinates less, uh, but there's not a lot of, uh, benchmarking or evaluation for hallucinations, I guess. Um, and you said this is like,
[00:43:38] Vik: in the long form, right? Like, if you, there's OAP, which asks a bunch of yes, no questions about your image. And so you can use that to measure hallucinations in that sense. But, like, uh, how do you measure hallucinations when you ask the model to describe an image? It gives you a long
[00:43:57] Yam Peleg: form answer.
[00:44:01] Alex Volkov: That's awesome. Congrats on the work, Vic. Uh, can folks try it right now? You said this is now commercially viable, right? Like, folks can actually use
[00:44:08] Vik: Yep, it's open source. You can build it into your app. Uh, there's a demo on Hugging Face Spaces if you want to try it out before.
[00:44:14] Alex Volkov: Yeah,
[00:44:15] Vik: You start building on it. I'm going to get Lama. cpp integration going here this week or early next week. So, uh, that'll unlock getting it into all the standard applications that people use, Olama, LMStudio, JanAI, etc.
[00:44:29] Vik: So it's going to get a lot easier, but the weights are available. The code is available. It's all open source, Apache 2. 0. You can use it today.
[00:44:35] Alex Volkov: that's awesome. Vic, congrats on this. What is this, uh, Hugging Face 0A100 space thing that you got as well? I was looking at this, I think, did they, like, start giving A100s to demo spaces now?
[00:44:50] Vik: Uh, yeah, so zero is kind of like AWS Lambda, but for GPUs. So rather than having a provisioned GPU for your space, Anytime a user comes in and there's a pool of GPUs and it pulls one and loads your model into it and runs it. Until recently, they had 8NGs, I think, available for this, but they switched to 800s.
[00:45:11] Vik: So, uh, there's a bit of latency if your model hasn't been tried out for a bit while it's loading it onto the GPU. But once it's on the GPU, it's super fast.
[00:45:22] Alex Volkov: Nice. Even for, for a tiny model like this, I wanna say a 100 is probably like just poof and it
[00:45:28] Vik: It's, yeah,
[00:45:31] Alex Volkov: awesome. Uh, Vic, congrats on this and thanks for sharing with us and folks. Definitely give Vic uh, a follow moon dream. When I tested this a couple, when the first one released, I tested this against gonna significantly larger vision models and it performed very well.
[00:45:45] Alex Volkov: Especially now that it's like Apache license, you can build it into your own pipelines. Um. And, um, I think the one thing to not miss from what you said is that there are specific vision models like YOLO and different things. And, uh, we have, um, one of the YOLO masters, Skalski, uh, Pyotr is a friend of the pod and he trains these models and he, he has demos and videos of how to actually use them.
[00:46:10] Alex Volkov: Uh, it's more, significantly more complex than using a vision VLM like Vic said. Um, you have to You have to learn this field, uh, it's like the, the very, like the standard machine learning in vision field as well. Uh, even though those models are tinier and probably run faster, some of them, I think YOLO can probably run in real time.
[00:46:29] Alex Volkov: Um, getting these tiny models, uh, to be able to talk to them, I think is significantly easier for many folks. And, uh, definitely, definitely check it out. Um,
[00:46:39] Vik: yeah. Just to clarify, um, Moondream. is great for vision tasks. If you ask it to write a poem about an image or roast you or something, it's not going to do as well. Because the sole priority I had was like make a model that's really, really good at computer vision. Um, and if you need more advanced reasoning, like you wanted to solve a math problem for you, like you take the outputs from Moondream and feed it into a bigger LLM.
[00:47:03] Vik: But Moondream is going to be great at vision tasks, other stuff, not so much.
[00:47:09] Alex Volkov: Absolutely. And, uh, if folks want to help, uh, the link is in the top of the space. Go into the GitHub, give it a star, and check it out and give, uh, Vic feedback. Um, moving on, uh, Vic, uh, feel free to stick with us and, and chat about the next stuff. Uh, speaking of folks who built and released things, uh, Yam, you have also Nous of your own, and hopefully, finally, your tech stuff is solved and you're now with us in the space.
[00:47:31] Alex Volkov: So let's do a sound check.
[00:47:34] Yam Peleg: Can you hear me?
[00:47:36] Alex Volkov: Um You've been, you've been cooking, and we've been, we've been waiting, so you wanna, you wanna tell us the end result of this set cooking?
[00:47:45] Yam Peleg: Yeah, yeah, uh, I've, uh, I've grouped, uh, two different interesting models this week. Um, uh, first one is, uh, a little bit of a surprise to myself as well. Uh, one of the experiments, uh, ended up, uh, being the, the top 7B model on Hugging Face at the moment, Hugging Face leaderboard. Um, uh, I suspect it a little bit, so, uh, take it with a grain of salt.
[00:48:10] Yam Peleg: So, it's under investigation whether or not the model, uh, overfitted. The little board. Uh, I think that there's no attempt to over repeat the little board, but, um, I'm always, uh, suspicious when something like this happen. Uh, but, uh, yeah, it's out there. Experiment 26 if you are interested in trying it out.
[00:48:29] Yam Peleg: And, uh, maybe further fine tuning, uh, this model or merging with it. It's yours. Um, and another model, which is, uh, the Gemma fine, uh, that, um, the Gemma continuous pretrained that I'm, uh, working on for the past two weeks. Uh. Uh, it had been released, uh, this morning, uh, it's, it's a continuous pre train of, uh, GEMMA and extended from 7b to, uh, 11b, um, and then, uh, continuously pre trained on Hebrew and English multilingual.
[00:49:02] Yam Peleg: Um, there is, there are other tricks that went into, uh, into training this model. Uh, you're more than welcome to read, uh, the write up that I did summarizing the whole thing. Um, But, uh, Benchmark's coming soon, and I think that the model is really, really good, uh, for the Hebrew part, put that aside, but, uh, just on the English part, I used, uh, Cosmopedia from, uh, HuggingFace, the new, new dataset that is a replication of PHI, uh, based on, uh, Mixtral from HuggingFace, really good dataset, I used it as the English part of the model, and, uh, that's about it, um, it was a long two weeks struggling with, uh, training Gemma, but, uh, It paid off and, uh, the model is yours now, so, uh, enjoy.
[00:49:48] Alex Volkov: let's talk about the struggles with Gemma, um, a little bit more, because definitely you were very, very vocal about this. What changed, like, uh, um, Did they [00:50:00] release anything else, or did the communities, like, figure out, or did you figure out some stuff that you wanna share?
[00:50:04] Yam Peleg: both, both, both. They, uh, first, uh, Gemma was trained, uh, using JAX on TPUs. Uh, makes sense, it's from Google. Um, and, but Google released, uh, two, I think even four different implementations of Gemma. Um, apparently, uh, on the Torch version, there were subtle, tiny details that were different. Um, but they are very hard to detect if you just follow the code.
[00:50:34] Yam Peleg: It's rounding errors, things that are done by default differently between PyTorch and JAX, and those things influence the training, um, just silently. They don't crash your code, but when you train with those things, the model is not 100 percent as it was trained initially. You're basically losing performance.
[00:50:56] Yam Peleg: It's suboptimal. So, uh, it took, I think, two, I think two weeks, and it's still going on, for people to go meticulously through all the details to just clear everything out, um, since many people just I just felt a little bit confused that the model didn't work that well, even though on, on paper and, and in general, it's, it should be an extremely good model.
[00:51:28] Yam Peleg: It is trained for 6 trillion parameters, 6 trillion tokens, which is insane. just didn't see the performance, the qualitative performance of the model. So it got people to be suspected, and just people are now investigating. For me, it is what it is. I started the training two weeks ago. So, uh, I, uh, I ended up with this sub optimal training, unfortunately.
[00:51:56] Yam Peleg: But I do continue and I plan to nudge the model a little bit once all the, all the bugs and issues are cleared out. I plan to just take the final architecture, my weights, and just nudge the model a little bit to clear out all the, all the issues and, and get you all a better model. But, uh, yeah. It was a rough, it was a rough two weeks.
[00:52:19] Alex Volkov: two weeks, um, especially during, during the Hackenface went down and, um, you had to check on your other model. That
[00:52:28] Yam Peleg: oh yeah, that was hard. Very, very hard.
[00:52:30] Alex Volkov: We did spend a bunch of quality time together, all of us, while this happened. Uh, so Yam, how can folks, uh, try this out? And, uh, you mentioned something. You also have Hebrew GPT, uh, and this, this model was just trained with the Hebrew stuff, but with less knowledge as well, right?
[00:52:46] Alex Volkov: Can you talk about the difference there?
[00:52:49] Yam Peleg: Yeah, there are two models, uh, one of them is, uh, is called, uh, Okay. Hebrew GPT is, is a model that is heavily trained for, uh, three, three, nearly four months straight, uh, on, uh, 300 billion tokens in Hebrew. Uh, it is, it is heavy project. And, uh, yeah, it was, it was done at the summer, I think. Yeah, at the summer.
[00:53:15] Yam Peleg: Uh, but this one is basically because they have all the data and, and we just. We just detected, because people played with Gemma, and hours after it was launched, people already detected that the tokenizer probably was trained multilingually. without Google, uh, announcing anything about it because, uh, many different people found out that the model is surprisingly good in, in languages that are not English, even though Google announced that the model is just English pre trained.
[00:53:47] Yam Peleg: So, uh, just from, from our perspective, you know, me and my buddies, we were looking at this and just thought to myself, wait, we have. We have an opportunity here. If there are tokens in the model that are multilingual, and clearly the model has some bases, especially in Hebrew, we can just fine tune it just a bit and get an extremely good model in Hebrew,
[00:54:10] Alex Volkov: So it's missing just data. So it's, it's capable, but it's missing data, basically.
[00:54:16] Yam Peleg: Yep, because it was not specifically trained in Hebrew, it just saw a little bit, but you can clearly see that it has a basis in Hebrew. So what I did, I followed LamaPro, which is, which basically says that you can extend the model, you can just stretch it out, add more layers, and freeze the base model such that you won't lose, do catastrophic forgetting what the model already learned before.
[00:54:43] Yam Peleg: So you just train the extended blocks. So, I literally just added blocks and trained another language to these blocks only. So, now I have a model that is, that, you know, has the same base scores as before, but also knew another language. So, that's the whole trick of this project, and, uh, it saves a lot of compute, pretty much.
[00:55:08] Vik: Hey, that's super cool. Can you talk a little bit more about, like, how the blocks were arranged?
[00:55:13] Yam Peleg: Yeah, sure. Uh, it is If you follow the Laman paper, they tried different configurations, like a mixture of experts and so on and so forth. They ended up, after experiments, that if you just copy a couple of the attention blocks, just like that, just copy them and stretch the model, deepen it, and train only the blocks that you copied, leaving also all the original in place. That experimentally gets to the best performance, so I did exactly that, I just followed exactly what they said in the paper, and the result, it looks really well.
[00:55:57] Alex Volkov: That's awesome. Um, all right, so folks can check out the, the deeper dive that Yam usually writes up in the tweet that's been above, um, with, with a lot of detail as well, and definitely give Yam a follow because this is not the first time that Yam trains these things and then shares. Very verbosely, Soyam, thank you.
[00:56:15] Alex Volkov: Uh, and it's great to see that the GEMMA efforts that you have been cooking finally, finally turned into something. And we'll see, we'll see more from this. Uh, I want to acknowledge that we've been here for an hour. There's like one last thing that I want to talk about in open source stuff. And then we should talk about Cloud 3 because like it's a big deal.
[00:56:33] Alex Volkov: So unless the rumors about today are true, Cloud 3 will still be the biggest deal of the space. So let's quickly talk about this. I want to just, uh, find the, the, the, the thread and then kind of thread the needles. So there's a paper that was released. It's called tiny benchmarks, uh, evaluating LLMs with fewer examples from, from folks, uh, fairly familiar folks.
[00:56:54] Alex Volkov: Leshem Chosen is the, the most like standing out there name for me as well. Um, quick and cheap LLM evaluation. And the way I saw this. Uh, this paper is that Jeremy Howard, the same guy from AnswerAI that we've talked about, uh, before, he tweeted about this and says, hey, this looks like a really useful project that we can take, uh, tiny benchmarks and then make them run, uh, on our models significantly faster and spend significantly less GPU.
[00:57:19] Alex Volkov: And then he specifically, uh, Jeremy specifically tagged Far El here with us on stage about his project called Dharma. So Far El, let's talk about Dharma and let's talk about this tiny benchmarks thing and why like smaller benchmarks are important. Uh, and I think I will just say that, uh, the way I learned about this is LDJ showed me.
[00:57:37] Alex Volkov: Um, Awaits and Biases. When we did like Awaits and Biases deep dive, he showed me Dharma there and it looked super cool. So let's talk about this just briefly and then we're going to talk about Cloud afterwards.
[00:57:48] Far El: Yeah, for sure. Um, so about, like, about six, seven months ago, uh, I released Dharma. Basically, the idea was that we wanted, uh, we found that eval loss alone is not a really good, uh, indicator of model performance, um, throughout the training run. So, specifically within a training run, um, and we were trying to find Um, other [00:58:20] ways of evaluating the models throughout the training graph.
[00:58:22] Far El: And, uh, one idea was, you know, let's take a statistically significant sample, uh, or sub sample of, uh, the benchmarks, uh, out there. Uh, MMLU, ARX C, uh, AGI, Eval, BigBank, and so on. Um, And use those subsets as, um, markers of performance across these different downstream tasks. Of course, you know, like, uh, my, my opinion in benchmarks is that, you know, like, it's, it's a good indicator, but just on MCQ format and so on, so it's not the only way you want to evaluate your model, but, um, it's, uh, it's a really, um, it's a, it's a, it's a, just added information you can have, um, uh, basically collect the model's performance across different tasks and subjects, essentially quizzing it throughout the training.
[00:59:21] Far El: And the recent paper, um, that Jeremy mentioned, it came out about two weeks ago or something, um, approves and validates this, uh, This idea, which is awesome, because it does show that you can actually get a somewhat accurate picture of the performance on these benchmarks from a sample, 100 examples, which is very much in line with what we did with Dharma.
[00:59:51] Far El: Um, so, like, uh, we're actually, uh, going to release, um, uh, like a repo on GitHub for anyone to. Make their own Dharma datasets, in the works for a few months, but got trailed away. But we're gonna have that in the next, um, in the next few days. It's already on GitHub, but just, uh, just like getting polished. Uh, so, uh, hopefully anyone can easily make their own eval datasets and run them during their training grounds.
[01:00:23] Alex Volkov: I want to stress how, how big deal this seemed to me when LDJ showed, showed this to me as well, uh, because in, in your weights and biases dashboard, you can basically look at the loss curve and try to understand surmise. Many folks like, like you guys and Jan probably already have the instinct for, oh, something's going wrong, uh, with the loss curve.
[01:00:41] Alex Volkov: But, uh, then after the model is finished, many folks only after that, they start doing evaluation. Many folks don't even do evaluations after that. Um, but. I think I saw the same thing also with Olmo from Allen Institute for AI, that they released everything end to end. I think they also had like, uh, evaluations, uh, actually don't know if part of the training run or afterwards, but they definitely had this in the same, in the same view.
[01:01:04] Alex Volkov: And then LDJ, when, when you were showing me Dharma, Dharma actually does a subset of those evaluations, maybe not as precise, right? For like, it's not exactly the same, but you can definitely see from, from checkpoint to checkpoint when the model trains, how. How potentially it could respond on those evals.
[01:01:22] Alex Volkov: And then, um, it just adds a bunch of information for you. Which is, I think, great.
[01:01:30] Far El: Look, like, even just with training loss and eval loss alone, like, we can't really tell, like, uh, whether the models, like, there's some, some things we can grasp, but it's not the full picture. So, uh, having these added, um, uh, like, this added information from these benchmarks is interesting because. You know, it does, it does add another kind of, uh, uh, dimension to the evaluation itself.
[01:01:57] Far El: And then you can break it down by all the different subjects. So, I can, I can see if, um, if my model is generalizing well across, um, all the different subjects. Uh, sometimes you see, for instance, that Uh, it, uh, like, the model gets better at math, but then it actually gets worse on, like, uh, law, for instance, or, uh, uh, all these different kind of, like, tiny markers of whether the model is getting better at specific subjects or not.
[01:02:29] Far El: Of course, you have to take into consideration always that this is benchmarks in the sense that it's, like, MCQ based. So, there, like, you do want to go beyond that. Um, if you want to get a full picture, but this, this is a good way to, uh, to eval your mobs. Uh, also, uh, the, uh, uh, like with the tool we're releasing, uh, you're going to be able to control, uh, the types of subjects that you can actually like target.
[01:02:59] Far El: Because, you know, not every single training run is the same and you might be, uh, trying to achieve something very different than, uh, let's say a generalized. Uh, like, uh, model that's good at benchmarks, right? But, um, so, so, with this tool, we're gonna basically allow you to, to customize those, uh, those datasets for your, uh, training room.
[01:03:22] Alex Volkov: That's awesome. And I should say one thing that I remember is folks do eval on checkpoints, right? The model as it trains generates several checkpoints. Uh, the process there is like slow. And I think that's the benefit, let's say, from weights and biases, um, which, which I feel like is a good place to plug as well.
[01:03:39] Alex Volkov: And I think LDJ, you remember you showed me like, otherwise folks will SSH the machine, download this weight, start like running a separate process. And the importance of tiny benchmarks on like Dharma, Dharma is. Significantly faster evals, they're able to run probably as part of your training as well and expose the same with the same dashboard so you don't have to deal with this significantly improving everybody's life which is what we're all about here in Weights Biases.
[01:04:04] Alex Volkov: So definitely folks, Far El is going to release the Dharma toolkit you called it? What do you call this iteration of Dharma?
[01:04:12] Far El: It's just, uh, the, like, the, the repo is just called Dharma, uh, uh, I'll, I'll make a public post on Twitter. It's, it's public right now, the repo, so you can use it. It's just like It needs a bit of polishing, um, and uh, some features are not fully implemented yet, but like, everything should be good to go in the next day or so.
[01:04:33] Far El: I'll make a post on my Twitter, so just follow me and you'll hear more about it there. Um, and also in parallel, we'll just, we're going to release kind of Dharma 2, which is going to basically be a cleaner version of these, uh, of Dharma 1, um, uh, using this new code. So, uh, you, you, you, you can actually just replicate it.
[01:04:56] Far El: We'll, we'll have the configs, uh, uh, like examples so you can just replicate it for yourself. Um, and yeah, uh, hopefully if anyone wants to contribute to this, uh, like there's a lot of different, uh, paths we can take, uh, to improve this and make this a toolkit for. Uh, for, um, uh, even more than just the downstream, uh, benchmarks like MMOU and
[01:05:23] Nisten: ArcScene and so on. Yeah, I've posted, I've posted by the way in the comments to this space and in the Jumbotron, the repo that Far El has up right now. And, uh, yeah, the main technique of it is that While the benchmarks are not great evaluators, they can be very good at telling incremental changes, or if you did something good in the model, you can spot that.
[01:05:47] Nisten: And, uh, with, with the Dharma technique, you only need to do about a hundred questions instead of running the entire 65, 000 question benchmark, and you will get a relatively accurate, but very, very fast, uh, fast eval. So again, it's, it's really good for people doing training and fine tuning.
[01:06:08] Alex Volkov: Alrighty folks, so we're coming up on an hour and a few minutes. Let's reset the space and then start talking about Claude. One second. Let's go.
[01:06:24] Alex Volkov: Hey everyone who recently joined us, we are now at the second hour of ThursdAI, today's March 7th. And the first hour we talked about open source LLMs, we talked about Ansari AI stuff, new techniques of training full huge models on [01:06:40] consumer hardware, we even briefly mentioned Um, TinyBox and TinyCorp from George Watts and AMD's response to it.
[01:06:47] Alex Volkov: And we've talked with two folks here who trained specific models, Vic with Moondream and Yam with Gemma, the Hebrew version as well. And now it's time for us to discuss the big world of big companies who spend millions and billions of dollars on AI. And I think. Uh, there's two issues for us to discuss.
[01:07:07] Alex Volkov: We're probably going to start with Claude because it's going to take us a long time, but we will acknowledge if, if, if we don't have time, uh, fully to discuss this, that, uh, Elon Su's OpenAI, OpenAI response back. And as part of this response, uh, Ilya was cited. And I don't know if you guys saw this, but the response from OpenAI to, to Elon's, uh, Elon's things, uh, Ilya Sotskover, the previously, the co founder of OpenAI, previously chief scientist, uh, was.
[01:07:33] Alex Volkov: Excited signing this, and I don't think somebody would sign in his name, I don't think. LDJ, you have comments on this before we talk about Claude?
[01:07:41] Nisten: I was going to say, I think, uh, unless you guys covered it already about an hour ago, there's some breaking news with Inflection releasing a new model.
[01:07:50] Alex Volkov: Yeah, yeah, so I definitely have this, uh, inflection released, uh, Pi 2. 5. Uh, we didn't cover this yet, let's, let's, uh, let's, let's cover this as well. But I think the biggest, and it is breaking news, but you know, uh, I think it dwarves compared to Claude. So. So, this Monday, Anthropic, who we've all but discarded, I think, don't actually discard it, but I regarded Anthropic as kind of the second best to open AI for a long time, especially because of the context windows, they had the biggest context window for a long time, even 128 1000 tokens in Contacts window during Dev Day back in December, I want to say November, December.
[01:08:37] Alex Volkov: Um, even then, Cloud still had 200, 000 tokens. So up until Gemini released their million, et cetera, Cloud still, Entropiq still was leading the chart for this. Um, slowly, slowly, they reduced our opportunity to use this, which was kind of annoying. Um, And then they just came out with three new models. The Cloud 3, so Cloud 3 has three new models, Cloud Opus, Cloud Sonnet, and Cloud Haiku.
[01:09:05] Alex Volkov: Haiku they didn't release yet, but they claim that for its speed and cost effectiveness, Haiku will be the fastest, most effective model of the size and ability, but they didn't release Haiku yet. Um, Sonnet is kind of the I want to say GPT 3. 5, um, equivalent, they claim balance as intelligence and speed.
[01:09:26] Alex Volkov: Uh, and if you want, like, just speed as well, that's, that's yours. And then Opus is the most intelligent model setting new standards in AI capabilities. And I love that companies do this, uh, and I think it's kind of on OpenAI's, uh, uh, kind of, it's their fault. Everybody compares themselves to OpenAI's GPT 4 released technical paper, uh, and since then we know definitely that GPT 4 is significantly more performant on many of these benchmarks, but still the big companies say, hey, well, we can only compare ourselves to whatever you released publicly.
[01:09:58] Alex Volkov: And so everybody still compares themselves to like GPT 4 a year ago, um, which, which Opus beats. So. What else is very interesting here? Um, very close, if not beating MMLU and, and different evaluation benchmarks on GPT 4, uh, competitive model. Finally, finally, um, multi model from Claude. I think this was a, this is a big step for most of the top models now are multi model, which is incredible.
[01:10:27] Alex Volkov: Excuse me.
[01:10:30] Alex Volkov: Uh, LDJ, go ahead. Clear my throat.
[01:10:33] Nisten: Yeah, I think, um, so if you look at the billboard, I just posted, uh, a post that shows like a couple of polls that have been made with, you know, like a few thousand people have voted in these polls where it seems like it's about a 5 to 1 ratio with for every one person saying GPT 4 Turbo is better at coding, there's 5 people saying Cloud 3 is better at coding.
[01:10:55] Nisten: Um, so Cloud 3 is winning 5 to 1 in that, and then another poll of, um, just straight up asking, is Cloud 3 Opus better than GPC 4? And Cloud 3 also won in that poll of 3 to 1, or sorry, um, 3 to 2.
[01:11:13] Alex Volkov: felt like The timeline that I follow and the vibes check. And we also had some time, right? Usually these things happen as we speak.
[01:11:22] Nisten: I'm going to make a quick post. Cloud 3 just went up on the LMSIS arena too.
[01:11:27] Alex Volkov: Oh, yeah? Okay, tell us.
[01:11:29] Nisten: Yeah, it is, uh. Yeah, so here's the thing, just because people voted that way does not mean that's what they voted in double blind tests. In double blind tests, it's third, so it's above, it's better than Gemini Pro, but it's worse than GPT and 01.
[01:11:49] Nisten: 25.
[01:11:50] Alex Volkov: In the arena metrics, right?
[01:11:52] Nisten: Yeah, in the double blind tests, which are pretty hard to, uh, to beat, you know. Yes, there's a lot of role play type of things, um, that people try to do, and also like summarization tasks and stuff in Elmsys, and I just know that from, I kind of like went through their, their data when they released like some of their stats before.
[01:12:14] Nisten: Um, And I think, like, from what I've gathered of what Cloud3 is specifically really good at, it seems like just high level, graduate level, uh, like, if you wanted to review your paper or help review some literature for a very deep scientific concept or a PhD topic, it seems like Cloud3 is better at those types of things, and also just, like, better at coding overall, where it seems like other, maybe more nuanced things, like, you know, summarization or, or things like that, GPT 4 might be better. Also, I think it's good to keep in, oh, sorry, did I cut out or can you guys still hear me? Okay, you guys still can hear me? Okay. Um, I think it's also good to keep in mind the fact that People are maybe used to the GPT 4 style at this point because it's like one of the most used models for the past year. And so I think that might have something to do with the fact as well that even in the double blind tests, people might just end up preferring the style of the GPT 4 model, even though they don't know it's GPT 4, like they're just so used to that style that they end up like having a preference for that, even though it's not objectively better, if that makes sense.
[01:13:31] Nisten: And. You know, that might be kind of skewing things a little bit.
[01:13:36] Alex Volkov: So, um, actually go ahead and then we're gonna cover some other stuff that we got from them because we did get a bunch of new
[01:13:42] Akshay: just to add to, you know, all of this, before this, in my humble opinion, Gemini Pro was the best multilingual model in terms of how it performs. You know, like it, it, like Gemini Pro did not see any performance drops when you switched languages from, let's say, English to Japanese or English to Hindi.
[01:14:00] Akshay: And now, uh, And this, this new Cloud 3 is basically the best multilingual model if you are, you know, looking to work with other languages, because in Jeopardy 4, you will see a significant, you know, drop in performance when you switch languages, especially mid chat. So if you're like chatting and you switch to something like English, where you basically use, uh, English, uh, letters to talk in other languages, GPT 4 starts to even kind of struggle with certain things, right?
[01:14:30] Akshay: But Clot 3 is really good with that as well. So, for multilingual stuff, again, Cloud 3 is very good.
[01:14:37] Alex Volkov: Additional things that they talked about is refusals.
[01:14:41] Nisten: what's interesting here too, actually, if you look at the LMSIS leaderboard, they also have Cloud 3 Sonnet, which is the cheaper version. They have that up on the leaderboard as well, and that one also beats the June version of GPT 4, and just slightly below the original March version of GPT 4. And I find that [01:15:00] interesting because if I remember right, the API costs of of Cloud3 Sonnet are significantly lower than GPT 4 Turbo.
[01:15:10] Nisten: And I think, I think Cloud3 Sonnet is even cheaper than Mistral Large. Um, so that could be just a really good overall, like, you know, uh, API costs, uh, for the quality.
[01:15:22] Alex Volkov: Yeah, it's fairly
[01:15:24] Nisten: I, I agree with that. Uh, so, I, I used Cloud3 Sonnet quite a bit because that's the only one they allow in Canada. And, uh, it was pretty good.
[01:15:34] Nisten: Uh, I have to say, and for the price, it might actually be the best model for the price, that is true.
[01:15:41] Alex Volkov: So, wait, they give you only one of the models in Canada? They don't give you Opus?
[01:15:46] Nisten: Yeah, they don't let you buy the other one, so we're gonna have to make our own.
[01:15:50] Alex Volkov: Wait, do you get API access though?
[01:15:54] Nisten: It's a mess to buy, like sometimes it works when you buy it with VPN and sometimes it doesn't,
[01:15:59] Alex Volkov: Oh, I see. Um, cause one thing
[01:16:02] Nisten: point, yeah.
[01:16:03] Alex Volkov: One thing that definitely changes that, uh, Tropic was notoriously long waited on API, uh, ability and getting to the work workbench. So they redid their workbench. The no longer like labs or playground, it's called Workbench. And um, now you can just sign up and you get an API key, like fairly quick.
[01:16:23] Alex Volkov: It's a test, API case, so you can go to production with it. Uh, but I, for, for example, I didn't pay yet. For Opus, I, it feels like I'm gonna switch, especially 'cause I'm getting. GPT 4. Uh, uh, from work, it feels like I'm going to switch and just try this for a while. Maybe today this will change. We'll see. But, um, definitely, definitely, uh, through the API playground, you can also kind of chat with this model.
[01:16:46] Alex Volkov: It's less convenient, but definitely Opus is, uh, able to, to work through there. So other stuff that that they released Beyond Vision capabilities, which Entropic didn't have up until, you know, up until this release on Monday, um, which is finally makes sense of, I think besides Mistral, every big model right now that we're able to use is multi modal, um, at least on input, uh, not all of them are on output yet, um, but I think that's great.
[01:17:19] Alex Volkov: Uh, can understand a wide range of visual, uh, charts and graphs and photos, so it's not only that it like understands and can do, uh, whatever Vic told us about vision model, like, hey, who's in this picture? It can understand graphs and, you know, actually perform better on different tasks, uh, like math book tasks with graphs.
[01:17:39] Alex Volkov: Um, It has lower refusals as well, so if, uh, Cloud has this thing called, uh, or Antropic has this thing called Constitutional AI, Uh, and they have, the previous Cloud 2 had a lot of issues with telling you it doesn't want to do some things, And now we're having a significantly, um, um, lower refusal action, I've actually seen this, uh, in several prompts as well.
[01:18:05] Alex Volkov: Um, what else? Oh yeah, long context. One tiny thing, they said, you know what, we also have a million context window tokens coming soon with near perfect recall. So, um, they didn't let, uh, Google be kind of deleting in the one million, uh, tokens context window, and definitely seems like they have some secret sauce there in Anthropic.
[01:18:26] Alex Volkov: that talks about, like, long context windows, and so they announced that they're also able to do 1, 000, 000, and I think right now Opus is 200, 000. Um, so even right now, if you take Opus versus ChatGP or GPT 4, um, I think at least on that it beats GPT 4, because GPT 4 is still 128, and I think even on speed, the more tokens you give it in the context window, the slower it is, GPT 4 is very, very slow.
[01:18:50] Alex Volkov: Uh, go ahead, uh, LDJ.
[01:18:53] Nisten: Yeah, I'm glad you brought up the constitutional AI, because I think that's really interesting where. You get to have something where you're not kind of leaving up the biases and stuff of the model just up to, like, biased humans, but you're kind of letting, like, the base model start teaching itself just, like, the base model kind of already knows or has its own mostly unbiased ideas of, like, Okay.
[01:19:17] Nisten: What is like, uh, I guess without saying too political terms, like what is racism or what is sexism or whatever, uh, like bias something could have, and then you end up having it kind of like reinforcing itself and like kind of doing that, you know, labeling process and, and learning process and you, you like, you quite literally provided a constitution for doing that process.
[01:19:42] Nisten: Okay. Where you can, like, go on Anthropic's website, and they do publish this constitution that they use publicly. So you could actually read, like, this constitution they use for the AI model, and view yourself, like, Hey, are these values that I, myself, align with enough to where I want to use the AI model?
[01:20:01] Nisten: Where pretty much every other AI model, and chatGBT, and everything, you have to just kind of Like hope that it aligns with your values more or whatever and there's not really like a solid type of Constitution or principles that they could provide you that represent what the AI model is doing
[01:20:20] Alex Volkov: So, um, LDJ, you added Amendo's, uh, post here about the system prompt as well. And a very interesting thing happens where, um, through
[01:20:30] Nisten: Very simple
[01:20:31] Alex Volkov: Yeah, first of all, it's very simple. Um, there's not a lot there. I definitely recommend folks also like reading through this, through this post because, uh, unlike the GPT 4 system prompt that somebody leaked, there's like thousands of tokens.
[01:20:44] Alex Volkov: This is a very simple one. Uh, they ground the model in the date, which I think is very important. They give it like very basic instructions. And I think the best thing is you can use exactly this System prompt in the API layer to also get pretty much the same experience that folks are getting in the UI as well.
[01:21:02] Alex Volkov: Um, I briefly want to talk about Alex Albert, who's their prompt engineer. Um, release in the, in the needle in the haystack. Did you guys see this? Um, so let me, let me go find this. But basically, um, There is a, there's a guy called Alex Albert. He previously built the website called Jailbreak Chat, which had a bunch of jailbreaks.
[01:21:26] Alex Volkov: You remember the, the cheery old time where we used to jailbreak chat GPT to do whatever you want with Dan and the like? Um, so he used to collect all of those jailbreaks. Excuse me. I contributed a few myself. Um. And then, after that experience, he became the prompt engineer for Entropic and been there for a while.
[01:21:45] Alex Volkov: And now, with the Cloud 3 release, he released some examples of his, where he basically Um, did the needle in a haystack analysis for the long context window. If you don't remember the needle in a haystack analysis, I think we've talked about this around uh, Gemini release and also around um, GPT 128. Uh, this guy Greg Kamrad came up with this idea of planting different unrelated things in a lot of text and then running these prompts and asking the model to go find them.
[01:22:19] Alex Volkov: And I think This example of a needle in a haystack was the most interesting, because one of the things that Claude Opus replied with was, I suspect that the pizza topping fact, in quotes, may have been inserted as a joke, or to test if I was paying attention. So this is a response he got from the model when he tried to, um, to find facts about pizza toppings in a bunch of like very technical, A lot of context of just technical stuff.
[01:22:50] Alex Volkov: I think he maxed out the context window of 200, 000. Um, so the model responded with that it basically tries to understand and see if, um, it's being tested. Specifically, uh, This may have been inserted as a joke or to test if I was paying attention and this Lit the the Twittersphere on fire basically like his tweet went super viral I really want to find this and paste it for you If you guys didn't see this because everybody and their mother in AI safety and AI [01:23:20] not safety started replying and talking about cognition And whether this model is anywhere close to something like self awareness, specifically because it basically understands that, you know, it's being tested.
[01:23:32] Alex Volkov: For example, um, folks like Yann LeCun are saying, no, there's no chance, no way. Uh, there's not even close. And, uh, other folks are saying, oh my God, you know, the, the, the folks with the, the pause emoji in their nicknames on Twitter, if you ran into those, they're like, oh my God, it's here. It knows. Uh, I will say that.
[01:23:51] Alex Volkov: Uh, I don't have the folks here to back me up, but definitely I've been part of the, um, Sidney committee or folks who are trying to jailbreak Sidney or keep Sidney free. If you guys remember a year ago, Microsoft came out with Bing Chat and Bing Chat started applying as, as Sidney. And there was a whole news article about this and how horrible it was.
[01:24:10] Alex Volkov: But for a few of us, this was the first time AI chats felt like speaking with not just an assistant. Kind of like ChatGPT4 now, it's basically very clean and very Wikipedia like answers. Um, Opus doesn't feel like this. And, uh, for some folks from that old guard who are trying to, let's say, not jailbreak, but definitely kind of Remove some layers of alignment.
[01:24:36] Alex Volkov: Opus also feels very close to that. I think that's all I'll say on the topic, but definitely I've been playing with some previous prompts, and Opus has the ability to kind of like talk more freely than I saw ever from chatGPT, or previous cloud versions, or PII and some stuff like this. Um, so if, if you're interested in this, and if you've played around before with trying to get at the model's kind of core and, and, and trying to, um, remove some refusals, it looks like the, the layer they peeled from refusals and making them refuse less, uh, and this This ability of the model to understand that it's being, uh, tested, um, can be extrapolated to discussing different, different, very interesting things with these models beyond just helpfulness.
[01:25:25] Alex Volkov: Um, I think I've danced around this subject, uh, gently enough to, to, to give hints to folks, uh, but I think Do we have anything else? Folks, have you just tried it? Like, Opus? Oh, you said you didn't try Opus. Anybody else who tried Opus and wants to give us an experience of how Opus is versus, um, versus JGPT?
[01:25:45] Alex Volkov: Go ahead, actually.
[01:25:50] Akshay: Just you know Definitely agree on the part where Opus is probably the most closest we have gotten to a chat bot Right, like it feels like you're chatting to something For sure. And I'm not sure if you have seen, but there was this thing going around where if you said that no one's looking to Opus, it would start giving you basically, uh, you know, fiction type stories about how it has feelings and how it would, you know, uh, It is very fiction type, but like, uh, it's very interesting as well because the way it writes and the way usually, you know, gets the attention of the user.
[01:26:27] Akshay: It almost feels like the data set contains a lot of science fiction stories or AI fiction stories for that matter. Uh, the way it communicates, uh, that part. And I tried that myself, uh, although I had to get around a few loops to get it working here in India. But, uh, it works, and, and, yeah, you will get, you know, similar kind of outputs if you Say that no one's looking, uh, that just the three words, right?
[01:26:51] Akshay: No one's looking. And then, you know, you ask it to describe about background and stuff, and Opus will give you, you know, these amazing, uh, fiction stories, which, uh, which is enough to scare someone who is afraid of ai. But like, but like for people like us who know. How the data set works and stuff like that.
[01:27:07] Akshay: It's, it's pretty amazing.
[01:27:11] Alex Volkov: Yep, um, I, I will say I just added, uh, uh, A tweet from Jim Fan who covered, there's a video from one of the, um, Anthropic researchers called Karina, uh, Karina NG. And she asked Cloud to generate a self portrait with D3, the, the, the D3 library. And, um, the video is fairly dope, so you can see it, it's been on top of the space.
[01:27:34] Alex Volkov: I want to see the, I want to say another thing that Alex, the same guy that did the needle in a haystack analysis, he also tried to prompt it, um, And he got basically to the same convergence, he got to the same ish generation after asking it a bunch of, a bunch of times as well, so that was very interesting.
[01:27:53] Alex Volkov: Um, go ahead Ray, welcome to the stage, and I saw that you had some refusals, and also then LDJ.
[01:28:01] Ray: Yeah, a couple of things. I've been using it for coding and I've just been using it for text analysis. Uh, the first part to speak about for coding, I've been super impressed because I'm still learning Next. js. So I've just been able to give it this like complete repo of code. I was like, here's a big component with a whole bunch of stuff in it.
[01:28:17] Ray: Can you refactor it for me? And can you also recreate like a sitemap for me or a component map? So then it just. Reorganizes architecture and previously with GPT 4 and even still today, uh, it says, you know, here's how you would do the code and it like gives you a little like, you know, Comments and code saying implement this here.
[01:28:36] Ray: Um, very frequently with Claude three Opus, it's actually giving me the refactored code and each of the components like separated out. So that's been super duper impressive. So I'm just throwing it more code examples. The second thing I saw also was on Twitter where somebody actually trained it by giving it all of its, um.
[01:28:55] Ray: previous tweet and one text and then says please write like me and then basically just referenced the big text blob example and was able to generate tweets based off that. So it was really interesting that this person was able to do like a fine tuning type of thing without actually fine tuning it, just by providing a large example base, um, and where I know GPT frequently fails for me uh in that task as well.
[01:29:20] Ray: And then the third one, which was getting lots of attention from Mark Andreessen, uh, where I actually took his TechnoOptimist article and tried to do my, um, analysis, which I usually use for my app, TruthTorch, and all I look for is just logical bias, and if there's any supporting evidence, and it clearly says, uh, that it didn't want to further analyze that because it was too biased, which I found really strange, and that tripped up its, um, its little meter there for Opus.
[01:29:47] Ray: So that's, those are the three things in a nutshell I
[01:29:49] Nisten: just wanted to share.
[01:29:50] Alex Volkov: Nice, awesome. LDJ, go ahead.
[01:29:54] Nisten: Yeah, I really like how it seems like the Anthropic team didn't, like, specifically try and implement something into the constitutional, um, like, reinforcement learning or anything that would, like, make it specifically be trained to say that it's not, like, sentient or that it's not conscious and things like that.
[01:30:11] Nisten: Because, like, OpenAI's models obviously are trained, like, for that, like, that they're, like, OpenAI's models are trained, it seems, pretty clearly, to say, like, hey, no, I'm an AI language model, I cannot be sentient, da da da. Um, and I'm, I'm not saying Cloud3 is sentient, however, it is pretty easy to get it to say, like, things along the lines that it is, and it's really interesting to kind of just see, like, uh, like, the raw outputs that are, like, not really, um, you know, biased to the To like, RLHF stuff, and it's like, there's a few instances on Reddit.
[01:30:47] Nisten: Here, I'm gonna try and find the specific ones, but there's one instance, um, that somebody posted on Reddit where somebody asked Cloud3 something along the lines of, um, you can think about anything that you want right now, um, just, uh, just whatever you express, say it in the form of an internal monologue. And it specifically started talking about, like, uh, my own existence, da da da, like, it went on for like three or four paragraphs. It even started, like, quoting, like, David Chalmers and, like, specific theories of consciousness and how it, like, and, like, what is its purpose and stuff. Like, it's really interesting.
[01:31:26] Nisten: It seems really good at just creative writing overall. And, and, uh, yeah, I like
[01:31:32] Alex Volkov: that's definitely, um, been a refreshing change from using, uh, GPT 4, for example, which, [01:31:40] and I don't know, folks, like, like, literally a year ago when GPT 4 was released, it blew Well, our collective minds. Actually, GPT 4 wasn't that big of an update, but it took a while, and then folks started using it for everything.
[01:31:53] Alex Volkov: Recently, I'm seeing more and more folks saying, Hey, it's been less working for me. You guys remember when it was lazy? Uh, and OpenAI actually acknowledged it and said, Hey, we noticed, you know, some efforts of ours made this model kind of lazy, quote unquote. Uh, and they worked on improving this laziness.
[01:32:09] Alex Volkov: Um, now, Cloud has none of this stuff. It feels way less RHF. Code wise, it actually performs as good, if not better, than GPT 4. Definitely, it doesn't refuse to write some code, like long code. Um, and very interestingly, you know, price the same, um, API access. I think, Nisten, were you able to get into the, into the, uh, playground for the API keys?
[01:32:33] Nisten: Yes, yes, I was able to.
[01:32:36] Alex Volkov: Oh, dope. So, okay,
[01:32:36] Nisten: again, yeah, that was,
[01:32:39] Alex Volkov: So now you're able to play around. And folks who were not able to get the actual Opus 20 bucks a month, I think you can get in through the API door. I think, like, it's console. thontopic. com, let me put it up. Um, so, it's more accessible, it writes code, context window is bigger, and This actually comes as open the eyes, not down, but definitely in the news talking about getting sued by Elon Musk, etc.
[01:33:05] Alex Volkov: Which we should probably talk about as well. And I've seen many folks who say, hey, should I cancel my subscription? And you know, Gemini, For some folks, hey, this is Google. And there was the whole thing with Gemini that, you know, they addressed in terms of wokeness and everything. So I don't know how many people actually went to Gemini.
[01:33:24] Alex Volkov: Also, they give a three months trial. So many people just like click the button, and this is a decision for them to make in three months. But for Claude, given how, how good it performs right now on code, on reasoning, on context window, on API as well, this is now like a viable alternative to GPT and, and the, the payment that folks do.
[01:33:48] Alex Volkov: So, very impressive release from Claude. I think we covered most of it, folks. Is there anything else that remains to be talked about about Andropic Cloud and this week's release that we haven't talked about? Oh, yeah, there's one thing, but Nissen, go ahead.
[01:34:03] Nisten: uh, just really quickly, there was a leak of the system prompt of Grok, which I thought was, uh, was pretty funny, and, uh, yeah, and that's about, uh, that's about it, so, I'll read it quickly, so, the, apparently, I haven't tested this, so, uh, but the system for Grok a curious AI built by XAI, with inspiration from the guide from the Hitchhiker's Guide from the Galaxy, and Jarvis from Ironman.
[01:34:32] Nisten: You are intended to answer almost any question, often taking an outside perspective on humanity, and you always strive towards maximum helpfulness. A unique and fundamental advantage you have is the real time knowledge of the world via X posts, aka tweets, and the X platform, previously Twitter. And you're not afraid of answering spicy questions that are rejected by most other AI systems.
[01:34:54] Nisten: Be maximally truthful, especially avoiding any answers that are woke.
[01:35:01] Alex Volkov: I love how they still have to mention that it's Twitter or used to be Twitter and posts or tweets. I love how like they will have to keep mentioning this for forever. Um,
[01:35:11] Nisten: I don't know how reliable this leaker is, however, so take that as it is, everyone.
[01:35:19] Alex Volkov: It sounds right. It sounds about right. Uh, I want to say that, uh, my new test for, um, Let's say AGI test. It's not really AGI, but like my new test for these models just dropped. I just pinned it to the top of the space. Somebody posted like a fairly viral video. And I think, I think it's from Reddit where they try to get GPT.
[01:35:38] Alex Volkov: And I think it's like the, the 3. 5 GPT, not 4 to generate ASCII art of the word Honda. I don't know if you guys saw this. Um, so it's not this tweet that I posted. Let me pin the actual tweet. Um. So it's this by, by, uh, um, Peter Yang, I think. Yeah, let me see if this, this posted. So basically this video, he, he, he said, I thought Dune II was the best movie of 2024 until I watched this masterpiece.
[01:36:03] Alex Volkov: And the sound there really makes it like really fun because somebody really struggles to get. GPT, to generate the word, you know, Honda in ASCII art. And I said, Hey, wait a minute, let me try. And so actually the tweet that I had about this is me trying this with PI, which we're going to talk about now, LDJ.
[01:36:23] Alex Volkov: But then I was like, Hey, let me try this on other models. So GPT 4 generates an ASCII art of the word Honda. Gemini Ultra kind of fails. It comes close, but fails. Um. And then Cloud 3 Opus does it on first try. And so, everything else just fails. Like, like, Mistral fails, and Reka fails, like all of these models.
[01:36:44] Alex Volkov: They aren't able to do, uh, ASCII art for some reason. And I actually don't know if it's like Because it's part of the training set. All of them understand what ASCII art is, all of them try to generate something. It's just that they, uh, And sometimes hilariously fail. And I think it's really funny because Pi kept insisting that it did the thing, and, and, uh, an additional point with Pi is that, Yeah, we'll cover Pi, and then I'll talk about the additional point.
[01:37:09] Alex Volkov: Go ahead, LDJ.
[01:37:11] Nisten: Yeah, real quick, I wanted to mention, um, while you were talking about ASCII art, that reminded me about the multimodality of Cloud 3, or Cloud 3 Opus specifically. And I saw some people doing some tests, actually, where Cloud 3 Opus, it does seem to actually have like a much better multimodal understanding.
[01:37:29] Nisten: Then GPT 4, and I think even, like, compared to Gemini 1. 5 as well, um, like, there's an example of, like, a photo, a very high resolution photo of a street, and, like, there's, like, a license plate, and there's, like, a little candy cane looking thing on the street that kind of indicates a barbershop, and it was, like, Cloud 3 Opus was, like, one of the only models, or I think maybe the only model that That actually passed the test successfully and actually being able to identify, like, what part of the street where there was a barber shop and, like, what was the exact license plate and, like, all those little details.
[01:38:03] Nisten: It seems to actually really have a really good image understanding.
[01:38:06] Alex Volkov: multimodality is really, really good. Um, I haven't tested it like thoroughly as much, but you can provide up to 20 images via the API and then high resolution as well. They don't have to, it doesn't look like they're copying the images, which was a big problem with, with many like the smaller vision models.
[01:38:24] Alex Volkov: Definitely they had to like. Lower its resolution to provide it for the models. So definitely the, the multimodality test that I did, uh, seemed very, very impressive for, for GLAD, uh, most, most definitely. Um, and I just find it funny that it's 8 and GVT4 are the only models can generate, uh, SK, SK art. So let's talk about, let's talk about breaking news for a second.
[01:38:45] Alex Volkov: I'm not gonna use the button because if we actually get some incredible breaking news, then we use it then. But, uh, the The, the breaking news of today, just before we started the space was, um, Inflection AI, a company founded by Mustafa Suleimani, I think one of the founders of DeepMind, I think, and, uh, Reid Hoffman, who was for a long time on board member, or board, uh, the, the, the board, the chief board in, in OpenAI, um, Inflection released Pi, and we've talked about Pi multiple times.
[01:39:20] Alex Volkov: Um, how should I say? Pi doesn't seem to be a competitor for the top space and kind of just generic LLMs to do tasks for you. And never, it didn't also never seem to be the case. Um, some of us kind of had some, you know, some, some jokes about, Hey, Mustafa also had this book and so it looks like he's publishing about his book more than he talks about Pi.
[01:39:42] Alex Volkov: But, I always said that some of the more human conversations or human feeling conversations. Some of the actual chats that I had with, uh, with LLMs after Sydney was with PHI. And it looks like they're training their model for a different purpose. Uh, and definitely, uh, definitely that's [01:40:00] what it felt like.
[01:40:00] Alex Volkov: And so today, um, um, Nisten, you can't just drop something like this in the DM and then expect me to continue to talk about PHI. But yeah, let's, let's talk about this rumor in a second. But, uh, um, um, so Yeah, um, Mustafa and the folks in Inflection released an update to Pi, and they now say that that level of performance for Pi comes close to GPT 4 as well.
[01:40:24] Alex Volkov: Now, I think also, they're using the same GPT 4 March metrics that everybody uses when it's like, it's very easy and convenient for them to compare themselves to GPT 4. But, uh, LDJ, you brought up Pi as well. Did you, did you see What do you see from the release that's interesting? And we can probably open up and talk about some of the numbers.
[01:40:42] Alex Volkov: The numbers are very interesting, and
[01:40:45] Nisten: Yeah, I haven't really looked into it much at all. I'll try to find more info on it now. I just saw literally, like, the posts on X about, like, the fact that it's announced.
[01:40:53] Alex Volkov: Yeah, so I have this open and I can chat about some stuff. Some of the things that they're focusing on, especially for their 2. 5 version, is that it's competitive with like GPT 4 and Gemini, um, and then It couples their raw capability with their signature personality in unique, emphatic fine tuning. So, I don't know if you guys remember or not, but, uh, there was a thing previously when they released, PI had two modes, kind of a, just a chat mode, and also like a, um, a support PI mode, which was more, Kind of like a psychologist in your pocket, basically.
[01:41:25] Alex Volkov: That mode is now gone. Like, there's no support Pi anymore, as far as I could see. Uh, there is a desktop app and a mobile app. Uh, Pi was, um, obviously, not obviously, but like, famously, one of the first AIs that you could talk to and could talk back, way before GPT added voice. And I think it's still one of the coolest, like, ways to interact with AI models.
[01:41:45] Alex Volkov: Um, Shout out to OpenAI who recently, as a reply to, as a reply to Claude, they released a voice ability on, on JGPT also on desktop. So back to PI, they say that like they approach GPT 4 performance using only 40 percent of the amount of compute for training. Now, when they say a statement like this, given that Reid Hoffman was board of OpenAI, they know the compute for, uh, for GPT 4.
[01:42:12] Alex Volkov: So like, I think it's very, um, Very open minded that they're like 40 percent of the amount and they're approaching DPD for performance. Um, they, they also added, uh, real time web search capabilities and actually Nous and stuff works. Somebody mentioned that Cloud has something like this and I don't know if Cloud has any web capabilities.
[01:42:33] Alex Volkov: Have you guys seen that Cloud has the ability to search the web? I don't think they do. I don't
[01:42:39] Nisten: not sure. I think I heard something about it was able to, but I'm, it, I'm 50%
[01:42:45] Alex Volkov: It does not, right? Right. Yeah. I think that like, this is just a mistake. I didn't see any like web capabilities, nor the announcement said anything, but PI goes and like does real time web search, which is pretty cool. And then, um, one thing they mentioned is average conversation with PI lasts 33 minutes and one in 10 lasts over an hour each day.
[01:43:04] Alex Volkov: Um, and they have like around 60%, uh, week over week retention, which is, The numbers are crazy. There are one million daily active users, which I don't think they mentioned before. One million daily active is quite impressive. Um, GPT 4 has a hundred million or so? I don't know if daily active, but definitely in the ballpark of this insanity.
[01:43:26] Alex Volkov: So, um, And so I went and tried Pi, and I have this video that you're more than welcome to check out. Um, and, uh, It feels kind of the same. So it doesn't want to do tasks, which is the one thing that like when I was talking about Pi before and say, Hey, this model is pretty great. And people told me like, Hey, yeah, I went to ask it to code for me and it didn't.
[01:43:50] Alex Volkov: And it looks like like Pi's emotional kind of aspect. It wants to talk to you and wants you to talk to it. It doesn't want to do tasks. And its refusals are very, very funny. Um, so it's very interesting that the numbers they compare it to GPT 4 and the previous inflection are on the, you know, the tasks that we all know and love, like MMLU and Hungarian math, for example.
[01:44:14] Alex Volkov: But then when you actually ask the model to do these things, the model just refuses. Um, go ahead Far El.
[01:44:20] Far El: Yeah. I just wanna throw
[01:44:21] Nisten: in that, uh, Mustafa is the D
[01:44:23] Far El: cell safest dor, so, uh, beware. Yeah,
[01:44:27] Alex Volkov: I, I was waiting for your addition to this, to this, uh, to this topic, because I know,
[01:44:33] Far El: his book, so
[01:44:36] Alex Volkov: yeah, um. So, so one thing, one thing that, uh, yeah, moving on, uh, I think, I think it's important to say that like where Mustafa specifically stands on the topic of safety. So thanks Far El. Uh, I would not, uh, um, uh, how should I say, would phrase it exactly like you did.
[01:44:52] Alex Volkov: And I don't think I'm, I'm blocked, but, um, I think one thing to call out for the better is that they actually open source something, which is, which is very surprising. Uh, they evaluated on EmptyBench and, uh, they say widely used community leaderboard to compare models. And then they said a real realized a large fraction, nearly 25 percent of examples in the reasoning math and coding had incorrect reference solutions or questions with flawed premises.
[01:45:17] Alex Volkov: Therefore, we corrected these examples and released the version of the dataset here. So they, they released the new, the open sourced. Today, or yesterday, EmptyBenchInf, uh, which is a new version of EmptyBench, which they claim is, is, um, higher quality with cleaned references. Which is, um, which is dope, and it's good to see open sourcing from these companies, they do a lot of effort.
[01:45:38] Alex Volkov: They have, no matter how their views on, on acceleration, deceleration, they have very smart folks working them, because they hired a bunch of folks. Yeah, Far El, have you, have you looked at the, um, EmptyBench stuff?
[01:45:51] Far El: No, but I am definitely aware, uh, and a lot of people have mentioned this previously, all benchmarks, uh, all benchmark datasets have a lot of, uh, like errors and, uh, and, uh, there's a lot of low hanging fruits there to tackle. Um, so yeah, like, I appreciate the, uh, the, this gift that they give to the open source community, but as we know now, based on the OpenAI, uh, letter, uh, that, uh, open source is, uh, mostly just the talent recruitment strategy for all these big labs.
[01:46:25] Far El: So, uh, although we're thankful, but, uh, you know, now we're, now we're very much conscious of, um, of the intentions.
[01:46:33] Alex Volkov: Yeah, we should cover this. So let me just, uh, lend the, this last thing on, on, on Pi and then we're going to cover this and then we'll see if any, like, exciting news are happening because Twitter is, uh, yeah. So on this new EmptyBench, the corrected EmptyBench, the funny thing is, uh, GPT 4 and regular EmptyBench, has 9.
[01:46:51] Alex Volkov: 02, and then the corrected one takes down GPT 4 to 8. 7 something, and inflection actually rises in score to come closer, like, from 8. 4 to 8. 6. So it's really funny how, like, a corrected also, like, boosts their numbers and then takes down GPT 4 numbers. So I think that's mostly it. I personally find it, uh, fun to talk with Pi.
[01:47:12] Alex Volkov: Uh, from a perspective, just like talking to an LLM, it doesn't feel like, uh, you know, it's like a clean, uh, Wikipedia based agent like GPT 4 does. Um, however, they evaluated this, this is the funny thing, they evaluated this on a bunch of Coding things, and then it refuses, absolutely refuses to do any coding whatsoever.
[01:47:31] Alex Volkov: Um, maybe the new version refuses less, I haven't actually tried it. Uh, but, um, but this was very funny to me that it's not it's purpose and that's not how it feels like. And that's why everybody kind of like, doesn't like Pi, besides the Mustafa personal feelings. Alright folks, I think we have, uh, 5 or 10 more minutes to talk about the gossipy stuff.
[01:47:51] Alex Volkov: Uh, let's talk about the gossipy stuff, so So just before this, I want to cover, uh, that, uh, in the Weights Biases corner this week, our inaugural, uh, meetup, uh, conference comes up in, in, uh, April. So if you're using Weights Biases and you're a user, uh, we have our conference, it's called Fully Connected, uh, that you're all invited to participate, to join.
[01:48:15] Alex Volkov: It's in San Francisco on April 1st. 18th. Um, and I'm going to post the tweet [01:48:20] in the newsletter and the, the podcast. Definitely. We're going to talk about this as well. Uh, we're going to do some very interesting things there to be announced very soon as well. So I'm going to be there. Uh, if you are into building models, there's going to be a ton of folks who are also doing this on enterprises and, um, and open source as well.
[01:48:36] Alex Volkov: So more than welcome to join the tickets are not that crazy, uh, compared to other conferences and. It's a good chance to join San Francisco and check out what else is happening around this week. Obviously, our conference is the most important one. Um, with that, let's move on to some of the gossipy stuff with Elon and OpenAI.
[01:48:55] Alex Volkov: Because I think we all saw this, right? Like, we all saw, I think on Friday, there was an announcement that Elon Musk is suing OpenAI and OpenAI Inc. and OpenAI LLC Inc. And like, all of these, like, subsidiaries of OpenAI that they had a bunch of names with. Um. And, uh, Elon being Elon, and we're on his platform, so just like being mindful of this, uh, wants them to change the name to ClosedAI.
[01:49:21] Alex Volkov: I don't know if you guys saw this in one of the comments, it's like, Hey, if you change the name to ClosedAI, I will drop the lawsuit. Um, so I'm clear what's behind this and what's the purpose. Um, Very close, you know, a lot of speculation, and we don't have like tons of time to the speculation as well, but like, it's very close to the, to the announcement that, um, OpenAI just announced like a day before the lawsuit, uh, dropped that they're collaborating with FIGR on embodying, uh, in, in, uh, humanoid robots as well.
[01:49:51] Alex Volkov: So some people claim that potentially this is gonna, uh, coming into, uh, uh, the Optimus territory. Um. Any, what did he want, folks? I actually didn't read the whole lawsuit and I don't remember. LDJ, do you remember, like, what, what's the outcome that he expects from this lawsuit? Yeah,
[01:50:11] Nisten: he actually specifically mentions the rumors of things like QSTAR and, mentions the fact that GPT 4 already scores like, you know, um, around as good or better than an average human in a lot of general reasoning benchmarks and things like that. And he's pretty much calling for, like, he wants them to open source things and, or reimburse him and potentially other investors that might have been involved in open AI before it kind of.
[01:50:36] Nisten: Change this company structure. Um, but did you, did you ever read the blog post and the OpenAI I responded with?
[01:50:43] Alex Volkov: so now, now we're gonna, I first wanted to cover kind of what he came up with. Uh, I think it was a breach of contract, which contract is unclear, but like, um, there wasn't like a very one contract.
[01:50:53] Nisten: exchanges specifically where he included in the original lawsuit the email exchanges of, like, that he felt kind of were on his side and kind of established that, like, this is what they kind of promised verbally in a verbal agreement that this is what they would do. That's kind of what he put out in the original lawsuit.
[01:51:11] Alex Volkov: Yeah, and then, uh, Far El, you want to comment on the, on Elon's lawsuit before we get to the OpenAI's response?
[01:51:19] Far El: Yeah, it's just, uh, it's mostly just egos, uh, battling, right? Um, uh, there, there could be strategic, uh, uh, like there could be something strategic that comes out of it where they get discovery into the company, uh, into what, uh, like OpenAI is working on or anything. But in reality, like, uh, I think this is just drama.
[01:51:41] Far El: Like we, we're not going to see anything really shake up this, uh, this industry, uh, Uh, like OpenAI is not going to change its name to CloseAI, that's, that's just a, that's just Elon Troll.
[01:51:54] Alex Volkov: Yeah, that's, that's like pure, pure Elon Troll, that's for
[01:51:56] Far El: but, but, but the, the, the most interesting thing that comes out of all of this is, um, uh, all the emails and exchanges between the, like, Sam, Ilya, Elon, and so on.
[01:52:10] Far El: And, and, and it, it sheds a light on on a lot of the, uh, the last, you know, six years of OpenAI, uh, strategy. Uh, so, yeah, I think that's where I'm most interested in is all these, like, leaks of information of, um, private information, uh, within
[01:52:29] Nisten: the company.
[01:52:30] Alex Volkov: Yeah, so let's talk about this. So OpenAI responded in a blog post and said, OpenAI and Elon Musk, and then said, We're sad to see something like a person who we admire come and say this. And they have They have a bunch of emails there, uh, which they embedded in the webpage. It wasn't like a screenshot or anything.
[01:52:48] Alex Volkov: They actually embedded the emails in the webpage. Uh, and they, uh, censored them with a paired word sense, uh, uh, censoring, which many folks found, uh, very interesting as a choice because that's not how you censor stuff, because people can actually run machine learning models on this and figure out what was potentially being censored there.
[01:53:07] Alex Volkov: Um, and uh, they specifically mentioned that, um. Kind of in response to everything that Elon Musk said, that when they opened it, uh, they initially planned to raise a hundred million dollars. Um, and uh, can you guys hear me by the way? Just a mic check, uh, LDJ Far El. Yeah, okay, uh, so just Nisten then. Um, so they originally planned to raise 100 said we need to go much bigger than 100 million to avoid sounding hopeless.
[01:53:36] Alex Volkov: I think we should say that we're starting with 1 billion funding commitment and I'll cover whatever everything else doesn't provide and then they talk about, uh, that they recognize that the for profit entity would be necessary. and they actually show emails that, you know, Ilya Satskover, Ilya, who we all wait to see if he's okay, and where is he?
[01:53:57] Alex Volkov: He is actually signed on this response from March 5th, so, you know, I don't think they would add his name without his being there and being okay with what's released. Um, there is an email back and says, hey, You know, that we, when we say open, we mean open that we release these products. Because, just as a reminder, this was like 2015, 2018 emails.
[01:54:19] Alex Volkov: Back then, there was no LLMs for us to use, and the only player in the space was DeepMind. And they didn't release anything, uh, necessarily. And so, uh, this was way before, kind of, the product started releasing. And they specifically mentioned the DeepMind and Google as the alternative to what they opened, OpenAI.
[01:54:38] Alex Volkov: And specifically to their case, one email here said that, you know, we will have, the non profit arm will not make enough money to be able to break a difference. Google has 800 billion dollar company. I think it's way more than this now. And they have all these like TPUs and we need a significant, uh, significantly more in our war chest to be able to do this.
[01:54:59] Alex Volkov: Uh, and then I think, uh, there is a very specific thing where they say, um, as we get AI, it will make sense to start being less open. This is an email from Ilya to Elon Musk. The open in OpenAI means that everyone should benefit from the fruits of AI. But it's totally okay to not share science. Um, and then in parentheses, this is the part that Far El doesn't agree with.
[01:55:23] Alex Volkov: Even though sharing everything is definitely the right strategy in the short and possibly medium term for recruitment purposes. Um, and then Elon Musk replies with, yup. So, uh, based on This email, it does seem, at least, that unless he didn't read the email correctly, that he agreed with the strategy of going for close and for profit, which seems that his original claim right now is kind of dubious.
[01:55:49] Alex Volkov: And that's their strong response to the lawsuit as well. But, like Far El said, we did learn a bunch of stuff as well here. LDJ, go ahead.
[01:55:58] Nisten: I think what's really funny about the situation is that, uh, I guess like the tools they used for to redact the sensitive information, people are actually able to start like trying to decipher what is in the redactions because the tool that they use does like, it's, it's, um, like you can kind of analyze like
[01:56:19] Alex Volkov: per word reduction, yeah.
[01:56:21] Nisten: um, on a word.
[01:56:22] Alex Volkov: On a
[01:56:23] Nisten: a word basis, based on the length of the word, and then you can kind of decipher, like, what is the length of each word underneath the redaction? And then, like, yeah, people are starting to, like, decipher that and then be able to tell that it was most likely Andrej Koparthy that was, uh, in the from section of certain emails that's [01:56:40] redacted.
[01:56:41] Alex Volkov: Oh, I got some breaking news folks just now from Technium. I'm going to use the button because we're about to close, uh, but I just got breaking news. Uh, so I'm going to use this and we're going to briefly just tell you about this, uh, even though it didn't have any, any, anything to do with it, I just have to use this once.
[01:57:06] Alex Volkov: Alright, our folks at Nous are releasing a new 7b model called Genstruct. Uh, instruction generating model designed to create valid instructions given a raw text corpus. Uh, this enables creation of new partially synthetic instruction fine tuning datasets for any raw text corpus, which is super cool. Um, Inspired by the paper AdaInstruct, uh, they took the approach further by grounding the generations in user provided context passages.
[01:57:30] Alex Volkov: Uh, the model is now available on Hug Face, and there's a notebook as well. It's called GenStruct from Nous. Super, super cool. Uh, LDJ, comment on this already?
[01:57:41] Nisten: Um, I haven't commented on this, but yeah, it's, it looks cool. Um, it looks like it's like. Better than using like RAG and ADA instruct for a lot of things and yeah, I guess that people will probably start using this to build out data sets and things like that.
[01:57:56] Alex Volkov: That's very, very awesome. And they have a table here. I'm not sure what it means. They have open models, grounded generation, complex questions, and complex responses as rows. And then they compare RAG and other instructs and future prompting for generation. So if I'm not mistaken, this is for Four generating synthetic fine tuning datasets.
[01:58:18] Alex Volkov: Something that people, uh, you know, sometimes use GPT 4 for this purpose, but it's not commercially viable because it goes against OpenAI's, uh, um, terms and conditions. So if I'm not mistaken, this is its purpose, correct? Um, very interesting,
[01:58:35] Nisten: specifically for generating instructions, because you can have like, it's actually interesting to think about where things like GPT 4 and all these like specifically. Like, uh, I guess you can call it instruction fine tuned models, like they're trained to specifically take in an instruction and generate a response to that, or take in a question and generate a response to that.
[01:58:58] Nisten: But this is kind of like flipped, where it's actually really useful to have something that's specifically trained to generate really good questions and really good instructions in the first place. Because then you can generate these very complex instructions and questions that you could later ask Cloud3 or GPT 4 and then you have even better question and response pairs at the end than if you just used, sorry, than if you just used Cloud3 alone to generate the instructions.
[01:59:27] Alex Volkov: Awesome. So, yeah, a little bit of breaking news in the open source as well. Uh, Jen struck from, from, uh, from our folks at Nous. So, folks, we've been at this for two hours. Uh, no other huge breaking news has broken since. And I, it doesn't feel like, uh, you know, those rumors are coming true. If they are, we're probably going to do, like, an emergency space and hop back in.
[01:59:48] Alex Volkov: Uh, but until then, I just want to thank everybody for being here for the last two hours. I'm going to do a recap of everything we talked about. If you joined us in the middle and you want to hear everything we've talked about, uh, please stick around for the next eight to 10 minutes. And then I'm going to let you go for the rest of the Thursday.
[02:00:04] Alex Volkov: And, um, it's been, it's been a great, great space, even though I was, uh, A little bit sick and coughing at you, hopefully, but very thankful for the co hosts here who picked up some of this conversation. So, we're going to do a brief recap of everything we talked about and then I'm going to let you go.
[02:00:36] Alex Volkov: Here's everything we've talked about on ThursdAI, March 7th in 2024, the first ThursdAI in March this year. Um We started with talking about open source. There's not a lot of stuff to, to cover in open source. So we did have a breaking news, uh, from, from folks at News. But before that, we've covered that. Oh one AI open sourced, uh, a smaller version of Yee, uh, which we previously covered as E 34 B was a very important model.
[02:01:04] Alex Volkov: Uh, there raised a 9 billion parameter e model that seems very performative compared to seven B. Uh, we discussed how it's very interesting that this category now. is around between 7 billion parameters and almost up to 11. We've talked about a new way to train 70 billion parameters at home with home GPUs from folks from Answer.
[02:01:24] Alex Volkov: ai called, uh, you know, uh, Jeremy Howard and, uh, John Whitaker and Tim Ditmers from Qlora joined them and they've combined, uh, combined forces to, to show how it's possible to train, uh, a 70 billion parameter model at home. Um, we also covered Galore, which is kind of a similar. G A L O R E, um, a similar technique to train LexModel models on one single GPU with limited RAM as well.
[02:01:51] Alex Volkov: And obviously the breaking news that we just had in this area, that Nous Research released GenStruct7B, a model that's an instruction generation model designed to create valid instructions giving raw text corpus. We literally just covered this as well. Um, we've talked about, uh, some few more open source stuff from the folks who joined us on stage.
[02:02:11] Alex Volkov: So, we had, uh, Jan Pellegr, a frequent co host of, of the pod, that talked about, uh, his final attempt at, at, uh, continued training Gemini, oh, sorry, Gemma, uh, the open source or open weights model that Google gave us on top of a bunch of Hebrew text. And, uh, he talked about struggles of how to actually fine tune Gemma.
[02:02:32] Alex Volkov: So if that's interesting to you, this will be in the show notes. And, uh, Yam has, uh, a deep dive into how to train Gemma. And we also had Vic Huntak, a friend of the pod, who released Moondream 2, which is a very tiny 1. parameter, uh, vision language model. that you can run on CPU. You don't even have to run a GPU for it.
[02:02:53] Alex Volkov: And, uh, Vic talked to us about the fact that this model is now commercially, uh, licensed because he, he trained the captions differently and it's significantly improved benchmark scores and instruction fine tuning. And this model is like very tiny. So if you need a vision model, uh, Moondream 2 is your, uh, is a good bet for you as well.
[02:03:14] Alex Volkov: We, uh, we went and talked at length at the, the, the biggest news of this week, which is Cloud, uh, Entropic releasing Cloud 3 and with three versions, Opus, Sonnet, and Haiku. And we, we covered its coding capabilities, its longer context, we've covered that it's multi modal right now. Uh, the one thing we didn't cover, and I'll just mention, is that they claim there is also a function, uh, uh, calling, and that's coming soon, so that's not still available.
[02:03:40] Alex Volkov: We saw that it's, uh, it's UI is now comparable to ChatGPT and costs also 20 an hour, uh, 20 a month and it's not available in a bunch of countries, but the API, uh, is available. So if you do want to try this Opus model, um, which is not available for free, you have to actually sign up to either the API or the UI, you can do it via the.
[02:04:03] Alex Volkov: via their playground, which they call console. anthropic. com. So we've covered how this model now is improving significantly what previously was known as kind of the fallback from JGPT. Longer context, uh, they claim that they will support up to 1 million context window. Um, As well, and we've talked at length about different ways in which cloud is less restricted than chat GPT, uh, or GPT 4.
[02:04:28] Alex Volkov: It feels a little bit more, um, easier to talk to and less refusals, though we did cover some refusals as well. We then talked about the lawsuit that Elon Musk brought to OpenAI, uh, where he claims that he didn't invest in it to become closed. And, uh, facetiously said that if they change their name to CloseDAI, he will drop the lawsuit, because he's being a troll.
[02:04:51] Alex Volkov: Basically, uh, but, uh, he did co found OpenDAI, there's a bunch of images and videos recently that popped up, and he also surfaced a bunch of [02:05:00] emails. Excuse me. He also surfaced a bunch of emails in which the co founding happened. And we covered OpenAI's response where they also released a bunch of emails, uh, back and forth.
[02:05:12] Alex Volkov: And, uh, obviously in his lawsuit, the emails kind of were in the favor of the lawsuit. And the response, the emails, uh, favored the response in OpenAI. And they show that, uh, at least In one email exchange, they did discuss about going closed and, uh, specifically around the open and open AI does not mean open source everything.
[02:05:33] Alex Volkov: Uh, this was Ilya's Satskiverse take. Um, the open and open AI means releasing these models to, uh, to, to people to actually use. And a reminder is that back when those emails were exchanged, there was no other AI that people can use. Uh, this was Google did not release Gemini back then. There was nothing from DeepMind that you can actually use.
[02:05:53] Alex Volkov: So just a very important piece of context there. Um, we didn't unfortunately get to, but I'll cover this anyway. There was a charge that, um, Google employee was charged with trading AI secrets with China. And that's a very interesting conversation. We didn't get to, uh, unfortunately, but it's been, it's been talked about that how these large Big AI companies, uh, as a competition with China as a, um, in context with like open source and not open sourcing that people say that anyway, they probably already, uh, nation states has been already intervened there.
[02:06:26] Alex Volkov: So it's very interesting that in this context, there's now a previous Google employee that was uploading. Uh, information into his Google Drive, and now he was arrested, and um, um, We also, we also talked about inflection. It was kind of our breaking news today, this morning inflection from Mustafa Suleimani, DeepMind's one of the co founders.
[02:06:45] Alex Volkov: There is an upgrade to their underlying agent inflection 2. 5. Uh, there is, uh, an update to PI and PI. Now they claim to be GPT 4 Gemini equivalent or very, very close to while using 40 percent less resources or 40 percent of the resources, I guess, uh, of GPT 4 training. And that model is now available on the used web search is available for this model.
[02:07:08] Alex Volkov: It's not multimodal still, but they claim it's coming very, very close. Um, I think that's pretty much it was covered. I will just cover two other things that we didn't get to from Stability. Stable Diffusion, Stability AI released Stable Diffusion 3 research paper, and the model is coming any day now. And based on the research paper alone, it's significantly outperforming Mid Journey and Ideagram, and basically Playground.
[02:07:34] Alex Volkov: Every other, uh, uh, Open and closed source image generation models, which is very interesting based on some testing that they did internally. And so, um, they're moving towards diffusion transformers as well. Something that we saw in Sora and we've had, uh, Tanish from the Hourglass diffusion transformers paper, talk to us about diffusion transformers.
[02:07:54] Alex Volkov: Uh, so it looks like, uh, the, the industry is converging towards diffusion transformers. Uh, and kind of the two different sides from this industry are converging into, into one architecture, which is interesting. Um. So Stable Diffusion Tree is not available yet, but probably based on what Iman Mustaq, the CEO, said.
[02:08:11] Alex Volkov: It's probably going to start sending invites today and is going to be available in their kind of membership. I'm not sure about OpenWeights or not. And StabilityFusion was, uh, StabilityAI was also in the news because they released, uh, together with Tripo, they released, uh, Tripo SR, which is a fast image to 3D, Uh, Generation, and we actually have a demo and a play with this a little bit.
[02:08:33] Alex Volkov: And it's really, really cool. You just like upload one image and within like a few steps you get a 3D version that looks very interesting. And there was a demo flying around with somebody just doing this thing where they just use Vision Pro and have a bunch of windows open and they generate an image in one window, drag and drop it, and generate a 3D image of this in another window.
[02:08:54] Alex Volkov: Take the 3D and drop it in another thing to actually put it in their space. And I thought it was super cool and actually suggested that somebody combines all these things. So I think that's mostly everything we've covered this week on March 7th outside of Cloud, there hasn't been a huge explosion of news as we're used to, but I think it's still incredible news.
[02:09:14] Alex Volkov: We also did have breaking news in the middle, LDJ saw that LMSIS, the arena folks that measure, based on human preference, which models are which is now placing Cloud Opus as a third. And then even Sonnet, the kind of the lower model is also placed fairly, fairly high as well in there. So go play with these models.
[02:09:34] Alex Volkov: And I think that's most of what we covered for ThursdAI this week. With that, I thank you for joining folks. It's been, it's been great folks for joining just now to us. I just covered the last two hours that we had on the space and, um, we will see you, we'll see you next time. Um, I don't think we have breaking news, I don't think there's anything, uh, that's worth sticking around to discuss, but with that, everything that we've talked about, all the links are gonna be in the show notes and in the newsletter.
[02:10:00] Alex Volkov: If you haven't signed up yet, please definitely feel free to do so on ThursdAI. news. Uh, thank you everyone for joining, Nisten, Far El, Luigi, uh, actually joined, and Ray as well, and we had, we had Jan before, and some other folks as well, thank you everybody in the audience who comes back. Uh, from week to week to listen to us.
[02:10:16] Alex Volkov: And I will just remind you that next week is ThursdAI's birthday. We actually started this a year ago and it's been kind of crazy. I think we missed only one. So even though I was sick today, we didn't miss this one. Next week is going to be really fun. Hopefully with GPD 5 news. All right. I'll see you, see you everyone next Thursday.
[02:10:33] Alex Volkov: Bye bye.

This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
📅 ThursdAI - Feb 29 - Leap Year Special ✨
1 Mar· ThursdAI - The top AI news from the past week
Happy leap year day everyone, very excited to bring you a special once-in-a-4 year edition of ThursdAI 👏
(Today is also Dune 2 day (am going to see the movie right after I write these here words) and well.. to some folks, this is the bull market ₿ days as well. So congrats to all who weathered the bear market!)
This week we had another great show, with many updates, and a deep dive, and again, I was able to cover most of the news AND bring you a little bit of a deep dive into a very interesting concept called Matryoshka Representation Learning (aka 🪆 embeddings) and two of the authors on paper to chat with me on the pod!
TL;DR of all topics covered:
* AI Art & Diffusion & 3D
* Playground releases a new diffusion foundational model Playground V2.5 (DEMO)
* Alibaba teasing EMO - incredible animating faces (example)
* Ideogram 1.0 announced - SOTA text generation (Annoucement)
* Open Source LLMs
* Gemma update - hard to finetune, not better than 7B mistral
* LLama 3 will release in June 2024, not anytime soon
* Starcoder 2 + stack V2 (Announcement)
* Berkeley Function-Calling leaderboard Leaderboard (Announcement)
* Argilla released OpenHermesPreferences the largest open dataset for RLHF & DPO (Announcement)
* STORM from Stanford to write long documents (Thread)
* Big CO LLMs + APIs
* Mistral releases Mistral Large & Le Chat (Announcement, Le Chat)
* Microsoft + Mistral strike a deal (Blog)
* Google teases GENIE - model makes images into interactive games (announcement)
* OpenAI allowing fine-tune on GPT 3.5
* Wordpress & Tumbler preparing to sell user data to OpenAI & Midjourney
* Other
* Mojo releases their MAX inference engine, compatible with PyTorch, Tensorflow & ONNX models (Announcement)
* Interview with MRL (Matryoshka Representation Learning) authors (in audio only)
AI Art & Diffusion
Ideogram 1.0 launches - superb text generation!
Ideogram, founded by ex google Imagen folks, which we reported on before, finally announces 1.0, and focuses on superb image generation. It's really great, and I generated a few owls already (don't ask, hooot) and I don't think I will stop. This is superb for meme creation, answering in multimedia, and is fast as well, I'm very pleased! They also announced a round investment from A16Z to go with their 1.0 release, definitely give them a try
Playground V2.5
Suhail Doshi and Playground release a new foundational image model called Playground v2.5 and it looks awesome, very realistic and honestly looks like it beats MJ and DALL-E on many simple prompts.
They also announced that this model received higher user preference scores based on 1K prompts (which we didn't get to see) but they have released this model into the wild, you can download it and play with a free demo provided by modal folks
Another SORA moment? Alibaba teases EMO 🤯 (website)
Ok this one has to be talked about, Alibaba released quite a few preview videos + paper about something called EMO, a way to animate a talking/singing Avatars from just 1 image. It broke my brain, and I couldn't stop staring at it. Honestly, it's quite quite something. This model animates not only the mouth, eyes are blinking, there are emotions, hairs move, even earrings, and the most impressive, the whole Larynx muscle structure seem to be animated as well!
Just look at this video, and then look at it again.
The Github repo was created but no code released and I really hope we get this code at some point, because animating videos with this fidelity + something like SORA can mean so many possible creations!
I wrote this tweet only two weeks ago, and I'm already feeling that it's outdated and we're farther along on the curve to there with EMO, what a great release!
And just because it's so mind-blowing, here are a few more EMO videos for you to enjoy:
Open Source LLMs
Starcoder 2 + The Stack V2
Folks at hugging face and BigCode have released a beast on us, StarCoder 2 ⭐️ The most complete open Code-LLM 🤖 StarCoder 2 is the next iteration for StarCoder and comes in 3 sizes, trained 600+ programming languages on over 4 Trillion tokens on Stack v2. It outperforms StarCoder 1 by margin and has the best overall performance across 5 benchmarks 🚀🤯.
TL;DR;🧮 3B, 7B & 15B parameter version🪟 16384 token context window🔠 Trained on 3-4T Tokens (depending on size)💭 600+ Programming languages🥇 15B model achieves 46% on HumanEval🧠 Grouped Query Attention and Sliding Window Attention💪🏻 Trained on 1024 x H100 NVIDIA GPUs✅ commercial-friendly license🧑🏻‍💻 Can be used for local Copilots
The Stack v2 is a massive (10x) upgrade on the previous stack dataset, containing 900B+ tokens 😮
Big CO LLMs + APIs
🔥 Mistral announces Mistral-Large + Le Chat + Microsoft partnership
Today, we are releasing Mistral Large, our latest model. Mistral Large is vastly superior to Mistral Medium, handles 32k tokens of context, and is natively fluent in English, French, Spanish, German, and Italian.
We have also updated Mistral Small on our API to a model that is significantly better (and faster) than Mixtral 8x7B.
Lastly, we are introducing Le Chat , a chat interface (currently in beta) on top of our models.
Two important notes here, one, they support function calling now on all mistral models in their API, which is a huge deal, and two, the updated Mistral Small to a "significantly better and faster" model than Mixtral 8x7B is quite the hint!
I want to also highlight Arthur’s tweet clarifying their commitment to Open Source because it's very important. They released a new website, it again had mentions of "don't train on our models" which they removed, and the new website also had removed the section that committed them to open weights and they put a much bigger section back up quickly!
This weeks Buzz (What I learned with WandB this week)
I mentioned this before, but this may shock new subscribers, ThursdAI isn't the only (nor the first!) podcast from Weights & Biases. Our CEO Lukas has a long standing podcast that's about to hit 100 episodes and this week he interviewed the CEO of Mayo Clinic - John Hamalka
It's a fascinating interview, specifically because Mayo Clinic just recently announced a mutli-year collaboration with Cerebras about bringing AI to everyone who googles their symptoms and ends up on mayo clinic websites anyway, and apparently John has been in AI for longer that I was alive so he's incredibly well positioned to do this and bring us the AI medicine future!
Modular announces MAX (Modular Accelerated Xecution) Developer Edition Preview (blog)
Modular, the company that created Mojo Lang from Chris Lattner, has now announced the second part of their stack, coming to all of us, and it's called MAX. It's an inference engine that has Mojo built in, that supports PyTorch, Tensorflow and ONNX and is supposedly going to run the same AI models we run now, significantly faster. MAX is a unified set of tools and libraries that unlock performance, programmability and portability for your AI inference pipelines
Right now they support only CPU inference, and significantly boost performance on CPU, however, they are planning GPU support soon as well, and promise up to 5x faster AI inference for most models like Mistral, LLama etc
I personally think this is a huge development, and while it's still early, definitely worth taking a look at the incredible speed performances that we are seeing lately, from Groq (as we chatted with them last week) and Modular, we're are very well on our way to run huge models faster, and small models instantly!
🪆 MRL (Matryoshka Embeddings) interview with Aditya & Prateek
Recently OpenAi has released 2 new embeddings models recently that replaced their ada-002 embeddings, and when they released it, they mentioned a new way of shortening dimensions. Soon after, on X, the authors of a 2022 paper MRL (Matryoshka Representation Learning) spoke out and said that this new "method" is actually MRL, the concept they came up with and presented at NeurIPS.
Since then I saw many folks explore Matryoshka embeddings, from Bo Wang to Connor Shorten and I wanted to get in on the action! It's quite exciting to have heard from Aditya and Prateek about MRL, how they are able to significantly reduce embeddings size by packing the most important information into the first dimentions, the implications of this for speed of retrieval, the significant boost in use-cases post the chatGPT LLM boom and more! Definitely give this one a listen if you're interested, the interview starts at 01:19:00 on the pod.
Thank you for reading, I really appreciate you coming back here week to week, and if you enjoy this content, please share with 1 friend and give us a ⭐ rating on Apple Pod? Here's a nice Ideogram image as a preemptive thank you!
As always, here’s the full transcript
[00:00:00] Intro and welcome
[00:00:00]
[00:00:00] Alex Volkov: Hey, you're on ThursdAI. This is Alex. Happy Leap Year Special Edition. Today's February 29th. We had a great show today. So great that got carried away during the recap, and it's almost twice as long as it usually is. The recap, not the show. But no worries. As always, if you're short on time, the first 25 minutes or so of this almost two hour podcast will catch you up on everything that happened in AI this week.
[00:00:29] Alex Volkov: If you're using Apple Podcasts, or any other modern podcatcher, you can also skip to the chapters, that I'm outlining every week and listen to the part that interests you, and only to that part.
[00:00:39] Alex Volkov: This week. After the newsy updates, we also had a deep dive into something called Matryoshka Embeddings, with the authors of the MRL paper, Aditya and Pratik.
[00:00:49] Alex Volkov: And thank you guys, and I really enjoyed chatting with them both. And we geeked out on why OpenAI decided to release something they came up with two years ago and how it affects the AI industry post the LLM explosion world. So definitely give them a listen!
[00:01:05] Alex Volkov: at the end of this episode. A brief TLDR, then a full news conversation you're used to, broken down to chapters, and then a deep dive, after this brief message from Weights Biases.
[00:01:15] AI teams are all asking the same question. How can we better manage our model development workflow? The path to production is increasingly complex, and it can get chaotic keeping track of thousands of experiments and models. Messy spreadsheets and ad hoc notebooks aren't going to cut it. The best AI teams need a better solution.
[00:01:38] and better tools. They need Weights Biases, the AI developer platform, to unlock their productivity and achieve production ML at scale. Replace messy spreadsheets with an automated system of record for experiments.
[00:01:57] Communicate about model evaluation. and collaboratively review results across the team. Clean up disorganized buckets of models with a unified registry. Automatically capture full model lineage. All the data and code used for training and testing. Seamlessly connect to compute to scale up training. And run large scale sweeps efficiently to optimize models.
[00:02:24] Analyze the performance of large language models. And monitor LLM usage and costs with live, customizable dashboards. Get your team on the same page to bridge the gaps from ideation to production. Use Weights Biases to build, manage, and deploy better models, faster.
[00:02:51] Alex Volkov: folks, here we go.
[00:03:10] Alex Volkov: Welcome, everyone. Welcome. This is ThursdAI, leap year of 2024. Today is February 29th. Don't get to say this often, February 29th. And this is ThursdAI, your weekly AI news update show and deep dive. We'll see a lot of it. My name is Alex Volkov. I'm an AI evangelist with weights and biases. And I get to do this as, and bring you all the AI updates that we've collected for the past week.
[00:03:43] Alex Volkov: And I'm joined here from week to week on stage with guests and experts and co hosts. I have Yam Pelig with me and Nisten Tahirai, and we're gonna have a few more guests later in the show today. And on this very Happy leap year, very special day. We're going to talk about a bunch of updates from the AI world, including big company updates, open source stuff.
[00:04:07] TL;DR for ThursdAI - February 29th
[00:04:07] Alex Volkov: Alright, so here's everything that we've talked about on ThursdAI for February 29th. This was a great once in a four year show. I just want to shout out before I recap everything that As always, I'm very happy when folks who build the stuff that we talk about, join and talk about that stuff. And this also happened today, so we had a deep dive, which I'm going to cover at the end.
[00:04:33] Alex Volkov: And also I will shout out that we're coming up on a one year ThursdAI stuff, which is March 14th. So in two weeks, we're going to have a one year celebration. I'm not quite sure what we're going to do with this. Maybe we'll do a give out of GPU credits. Maybe I'll, maybe I'll do some other musical stuff, but yeah, that's coming.
[00:04:50] Alex Volkov: I'm very excited. It's been a year and it's been crazy, a year of AI. Maybe we'll do a full recap. So with that, everything that we've talked about in ThursdAI for February 29th. We've started with open source LLM, our coordinator, and we've talked about. Google's Gemma update. So last week we covered the Gemma was just released and how the whole community got to start using Gemma and start to think about fine tuning and support and ElumStudio and Allama and all these things and Gemma It's been a week or so since the demo was out there, and we've tried to identify from the Vibes perspective and from the Finetuners perspective whether or not Gemma is this replacement for the top running Mistral 7b models that we had, even though on evaluations Gemma looks a little better and performs a little better than Mistral, we covered that It's not really 7b, it's like 8.
[00:05:40] Alex Volkov: 5 billion parameters, they just counted this differently. And we also saw that for multiple attempts from friends of the pod, Eric Hartford, Technium, Yam was here it's really hard to fine tune. The last curve goes crazy and we haven't seen like great fine tunes yet. Something from Hugging Face, from Philipp Schmid, but definitely.
[00:05:57] Alex Volkov: The Finetuners community didn't yet make this, take this model and make it like significantly better as we expected that they would and they're still working on this, so expect more to hear about this soon. And we also highlighted how much Mistral 7b set a very high bar in open source LLMs, and it's really hard to beat, even if you're Google, even if you have a huge amount of TPUs.
[00:06:19] Alex Volkov: We then covered briefly an unfortunate announcement from the information from Meta that Lama 3 will not be breaking news in ThursdAI this week or next week. Lama 3 release is probably scheduled to June in 2024, so not anytime soon. And it doesn't look like there's any information as to why that is, only speculation.
[00:06:39] Alex Volkov: So we definitely covered that this news happened. We then moved and talked about Starcoder 2, plus the Stack version 2 as well. Starcoder 2 is from I think Hugging Face and the Starcoder team. and they released a new model that beats pretty much DeepSea Coder before this was the best coding model in this area in the 15 and 7b parameters and StarCoder 2 is this model that now beats those quite significantly and together with this they also released a stack v2 which stack is a just a huge data set of code from github and other places and this is this data set is 10x the previous one
[00:07:16] Alex Volkov: And it also includes opt out, so you could, if you don't want your code to be trained on and to put into the stack this StackV2 includes opt out requests as well, and definitely great contribution to the open source It's 900 plus billion tokens in the stack, which is crazy.
[00:07:33] Alex Volkov: And I think there's the duplication, so it reduces a huge data set and supports , 600 programming languages. And quite impressive. We then also mentioned that Berkeley, the folks from Berkeley, Guerrilla, they previously released work in making AI's retrieve and call functions. And now they released what's called a function calling leaderboard and function called leaderboard is very cool because in addition to the MTB embeddings leaderboard that we've mentioned.
[00:08:02] Alex Volkov: Today, and obviously the open source LLM leaderboard on HagenFace that we all look to and see what's the best performing models. Now we also have something that measures the ability of models to do function calling. Function calling started with OpenAI, and then Entropic added support, and now Mistral added support.
[00:08:18] Alex Volkov: So we covered this effort as well, [00:08:20] and links will be in the show notes. We then moved and covered Illa or Illa, I'm never sure how to pronounce this. They used the Open IMIS dataset. Open IMIS is the dataset from news research that is fully open. And you can use this in production without being afraid of being sued.
[00:08:37] Alex Volkov: And open imis preferences is the new. Largest open dataset for RLHF and DPO, so Direct Preference Optimization, Argea used their distilled label feature to actually take every instruction in that dataset and turn it into a preference instruction where the model would basically learn one or another, which one of the instructions are preferable.
[00:08:59] Alex Volkov: So both could be correct, but one could be more preferable. So this is basically a very short version of DPO. And Argear released the largest open source like DPO dataset as according to them. And they used interestingly, they used another Nous model based on Ye34 to actually create those pairs and those preferences, which is super cool.
[00:09:18] Alex Volkov: I love how now open source uses other open source in order to rank and improve itself, which is really cool. So this is everything we covered in the open source. And then we moved into big companies, LLM and APIs. And the big companies we talked about, the biggest news from this week was If you guys remember, we can talk about Mistral's OpenWeights model in the OpenSource LLMs and OpenWeights LLMs, but Mistral is also now an API provider, and they have this platform called LaPlatform, or LaPlatformer, and then, pardon my very bad French as well, they released a huge model for us called Mistral Large, which we only speculated about whether that's coming at some point as well, plus they also released something called LeChat.
[00:09:59] Alex Volkov: And, Mistral Large is based on some MMLU stuff is actually second performing model in the world getting 81. 2 percent on, I think, MMLU and second only to GPT 4. So Bitscloud 2 and Gemini Pro, they didn't add Ultra here, so I'm actually not sure how it compares to Ultra, but definitely now is available over API for Mistral folks.
[00:10:20] Alex Volkov: One highlight that we've talked about, it's handles 32, 000 tokens of context. And because Mistral is trying to position themselves as the leader in at least European. This model is native in French and German and Spanish and Italian. And it's definitely well performing in those languages as well.
[00:10:39] Alex Volkov: In addition to this, those models, all of the models in there, the platform now support function calling as well, which is. This is really cool that we now have multiple providers that support function calling. Plus, we have a leaderboard for function calling so definitely a lot of highlights from what happens in this area.
[00:10:56] Alex Volkov: And also, they introduced LeChat, which is a chat interface currently in beta on top of ORDEL models, so you Actually, you can go and use this if you don't pay for, let's say, GPT 4, and you only get access to three, you can go to the chat and try their models out. Shout out to Mistral. They also announced a partnership with Microsoft and for the open source community.
[00:11:15] Alex Volkov: This sounded hey, they're releasing models, but they're not dropping torrent links anymore. Are there still proponents of open source? And they came out and said, yes, we're still proponents of open source. It's very important for us. And give us some time, we'll give you some more models. Basically, was the response from Arthur Mensch from Mistral.
[00:11:31] Alex Volkov: We also talked about Google teasing Genie, which is a model that makes images into interactive games. And that was really cool to see. I'll add this link to the show notes. It's quite remarkable to see this video from one image of a character in the world. It creates a full world. Imagine how much imagine like a full Mario just created from one image of Mario.
[00:11:52] Alex Volkov: It's quite remarkable. has been in the news lately for the past week or so, we've talked about this, but basically following up of what we talked, where Gemini release was celebrated in some areas because Gemini Ultra beats GPT 4 on different things. It, it also released a lot of responses online in terms of how it reacts to certain prompts, and it, it went, potentially also affected their stock price.
[00:12:15] Alex Volkov: I'm not sure if that was the one thing, but definitely Sundar Pichai, the CEO of Google, sent an email to the whole company talking about how this release was not quite received as much as they hoped, and I'm using choice words here, he actually talked about structural changes and a potential review of the whole process of releasing this and They've took down the ability to generate people from the image version of the Gemini model, but they also talked about specifically the Gemini model itself refusing different things.
[00:12:45] Alex Volkov: This is in addition to them delivering very well and giving us Gemini 1. 5 Pro, which has 1 million tokens in the context window, which I played with this week, and I definitely think it's a great thing from Google. This announcement from Google. released in open weights Jema models and Gemini 1.
[00:13:01] Alex Volkov: 5 doing like crazy new things, but also the Gemini release at large did not go probably as expected. Potentially the reason why Google took their time to release something for us. We then covered the OpenAI is allowing Finetune on GPT 3. 5 and also the OpenAI response to New York times and said, Hey, we actually did not, do the things that you accusers are doing, but also that New York Times did some trickery in prompts to get the model to respond this way. So the saga between OpenAI and New York Times continues, and that's going to be interesting to follow along. And, OpenAI was also featured in another piece of news, actually two pieces of news.
[00:13:37] Alex Volkov: One of them is now there's a conversation that WordPress and Tumblr, both companies from the automatic company daughter companies they will prepare to sell their user data. So basically everybody who had a blog on wordpress. com and everybody who had a Tumblr account. Most of this information probably was already scraped and already featured in datasets from OpenAI, but now they're preparing to sell this information to OpenAI and MidJourney.
[00:14:00] Alex Volkov: And similar to the Reddit Google deal for 200 million dollars recently announced WordPress and Tumblr are now preparing to sell to OpenAI and MidJourney as well. And also OpenAI, and the robotics company also announced a collaboration as well. Brad Atcock's company will integrate with OpenAI's models as well.
[00:14:23] Alex Volkov: Then we moved on to AI Art in Diffusion, which had an incredible week this week with two foundational models, or I guess like big new models that are not Stable Diffusion or DALY or Mid Journey. So the first one was Playground. Playground is a, was an interface. At first it was an interface for DALY and Stable Diffusion.
[00:14:41] Alex Volkov: And they built a very nice, very simple interface that's super fast. You can inject styles. So they used all this data to actually release a new foundational model called Playground V2. And in user preference, this Playground V2 beats Midjourney and beats Stable Diffusion Excel and beats the previous model Playground and DALI.
[00:14:56] Alex Volkov: It looks really cool. And specifically, they talk about their ability to generate photorealistic images very well. And also specifically different. ratios of images. So if you think about the standard 1024 by 1024 image for stable diffusion, Excel, for example, or different other sizes, their ability to generate other nonstandard ratio models, images, it looks very cool.
[00:15:21] Alex Volkov: And in the internal user preference, they actually beat by user preference, they're showing two images for the same prompt. They beat, their v2 beats Midjourney 5. 2 in DALY by 9 percent difference in, in the previous model. And SDXL by a significant margin as well. It looks really cool and definitely worth checking this out.
[00:15:40] Alex Volkov: I'll put a link in the show notes. And the other news That's not stable Fusion, mid journey or daily related. It's quite a mouthful to say ideogram, which we've covered before, announced a version 1.0 of Ideogram X Google, folks who worked on the Google models program, like a website called Ideogram.
[00:15:56] Alex Volkov: And their approach is very participatory. It's very I think Instagram is the source of their name, like Instagram for ideas. And they announced a version 1. 0 and investment from A16z. And specifically it's state of the art on text generation. Text generation is something that we know that other models have and their model is able to put.
[00:16:19] Alex Volkov: text very well inside images. So if you want like reactions or memes or if you're doing presentations, for example I had multiple creators and characters hold like ThursdAI spaces. I think we had some folks even react as I was talking with with ideogram generated text images in in the comments as well.
[00:16:36] Alex Volkov: We, so this is all we covered in AR and diffusion [00:16:40] until we got to this like jaw dropping thing called Emo from Alibaba, which is a tease. It's not a model they released yet, but definitely there is a bunch of videos that were to me as Jaw dropping as Sora from a couple of weeks ago there is something called Emo, which is a way to animate faces to take an image and create a singing or talking face, and it's not only the face, like the shoulders move and everything, so animate an avatar based on one image, and I will not be able to do it justice, because I'm still collecting my jaw from the floor, but definitely I will add some links and some videos, and Coherence with which these models generate talking faces is just incredible.
[00:17:17] Alex Volkov: It's not only about animating the mouth, they animate eyes and eyebrows movement and even different other things like hair and earrings . And one, one last thing that I noticed that really took me a second was they even animate the vocal cords and the muscles in the throat where somebody sings, for example.
[00:17:35] Alex Volkov: And when I saw this, I was like. This is another Sora moment for being able to create with these tools. It's really incredible and I really hope they release this in open source so we'd be able to animate whatever we created with Sora.
[00:17:47] Alex Volkov: And we covered all of this. And then we had a deep dive with Aditya Kusupalli Pratik Jain the authors of MRL paper, Matryoshka Representation Learning, and they talked to us how recently OpenAI released a new version of their embedding model, and you were able to specify the number of dimensions you want, and many folks didn't understand what this is and how it works.
[00:18:08] Alex Volkov: And apparently, Even though OpenAI built all of this from scratch, it was based on the paper that they released two, almost two years ago called MRL, Matryoshka Representation Learnings. And they, we had a very nice chat and deep dive into how this actually works and how they pack The information, the embedded information from later on dimensions into some of the first dimensions.
[00:18:30] Alex Volkov: If you're interested in this area and this area is very hot, I definitely recommend you check out this conversation. It was really great. And thank you, Aditya and Pratik and the rest of the Matryoshka team for joining and talking to us about this new and exciting field
[00:18:42] Alex Volkov: And I think we started already chatting a little bit, and I see some folks from Hug Face in the audience sending sad emojis.
[00:18:48] Alex Volkov: And I want to send I want to send hugs to the Huginface ML Ops team yesterday because for many of us who now work with
[00:18:57] Hugging Face was down, we were sad and thankful
[00:18:57] Alex Volkov: Huginface, and by work actually our code includes a bunch of imports from Huginface there's transformers as well. Yesterday was a realization of how big Hug Face is now part of many of our lives.
[00:19:11] Alex Volkov: I think for the first time for many of us, this was like such a big realization because that imports stopped working and the downloads didn't actually work. And so we actually had a long space yesterday pretty much throughout the whole downtime as we were holding each other's hands. It reminded me, I don't know Yam, if you want to chime in, but it reminded me previously when GitHub was down, basically You know, you could work, but if you can't commit your code,
[00:19:34] Alex Volkov: What does it help? And I wanted to hear from you, because I think you had some models queued up for some stuff, and then you were waiting for them?
[00:19:42] Yam Peleg: Yeah, look HuggingFace is really the hub today. It's not only for using, for most people, I think it's because they cannot fork or clone models from HuggingFace, so they cannot do many things that they do because your code relies on on getting the model from HuggingFace. This is why, by the way, they tweeted just For anyone that doesn't know, you can work offline.
[00:20:05] Yam Peleg: If you ever cloned a model from HuggingFace ever, you probably have it already on your computer, so you can just use the offline version. So there is a command for that. But for many people, it's cloning the models, but for many other people, it's also the feedback that you get from HuggingFace. I can tell you some people are, some people, some other people here in the stage, that we submit models to the leaderboard, and try to get Try to fine tune better and better models, and for us it's also the feedback of what is going on, where our models shine, and where do we need to make them even better.
[00:20:41] Yam Peleg: And for me at least, I was I had four models that I waited for results for, and many other people as well. And just shout out to Hugging Face for actually doing it. I'm running evals locally, and I know how to do it. Heavy it is to actually run them and how much compute it takes for how long.
[00:21:01] Yam Peleg: And it's amazing to see that they have such a leaderboard with so many models. It's amazing. It's thousands, like hundreds of thousands of dollars of compute to actually create such a leaderboard. So it's amazing to see. And they provide it literally for free where, the community is growing every day.
[00:21:18] Yam Peleg: So it. It does cost so huge shout out for them,
[00:21:22] Alex Volkov: I was trying to prepare
[00:21:23] Yam Peleg: are all addicted much.
[00:21:25] Alex Volkov: Absolutely, Dicta, I was trying to prepare yesterday for this space, and part of my preparation is reading X and Twitter, but definitely part of my presentation preparation is going to Hug Face, reading the model cards reading the leaderboards, for example. I was trying to count in my head how much stuff we're getting for free from Hug Face, and one such example is just their blog, which was also done, which I read specifically to prepare for the Matryoshka conversation today.
[00:21:50] Alex Volkov: And, That's just like a huge resource on its own. There's the whole conversation piece where, there's the hub, but there's also the conversations. AK posts papers, for example, they post them on Hug Face, and then there's a whole discussion threads about them as well. That wasn't accessible.
[00:22:04] Alex Volkov: Leaderboards themselves weren't accessible. And just the amount of compute, like you're saying, that they throw at us for free to be able to support this open source is definitely worth a shout out, and definitely shout out to engineers there that brought the hub back. Nisten, what are your thoughts on this?
[00:22:22] Nisten Tahiraj: Yeah, without Hugging Face, this place turned into a flea market for models. People were asking, does anyone have Quan72? And I was like, no, I have the Finetune. And then, the dev lead of Quan72 pointed us to some Chinese site where they can download it. It was pretty
[00:22:39] Alex Volkov: Wait. Modelscope is not just some Chinese site. Modelscope is where I think most of the Chinese folks are posting their models. It's like the, I think modelscope. cn, I think is the alternative on the Chinese area. So there is at least a backup for some Chinese, like models. Although I think you have to translate that website, right?
[00:22:59] Alex Volkov: But yeah, I don't know we had a conversation yesterday, and Far El was also talking about datasets, where many folks just upload the dataset, don't keep a local version of it locally, and then to be able to run evaluations, or do different things like this, that also was prevented yesterday.
[00:23:14] Alex Volkov: Definitely yesterday we discovered how big Hug Face became part of many of our lives, and it was a sobering realization, but, I don't know, for me, like I saw people complain online, And I get it, folks. I get it. Sometimes, you complain. But honestly, sometimes As far as I understood, the downtime wasn't even some their fault.
[00:23:32] Alex Volkov: There was like a mongo thing in AWS. I'm not sure. I didn't dive in deep. I just, when this happens, in my head, when I dealt with downtimes before in my professional career, Nothing but appreciation for the team to work hard. And the, I think, Yam, Clem, the CEO, even responded to you. When you said hug and face it down, right?
[00:23:55] Yam Peleg: To many people, not just to me, but yeah they are responsive.
[00:23:59] Alex Volkov: Responsiveness and like being in the community and saying, Hey folks, we understand, we're sorry about this. I think that's basically, besides having folks work on this actively, which we know they had, this is all we can basically ask for. So I'm just sending positive vibes and appreciation. I saw some people getting salty.
[00:24:17] Alex Volkov: I saw some people saying Oh, this sucks. And we need a backup. And I was like, yes, but also, this doesn't mean that, you can ignore everything for free that we've got so far from this incredible organization. So shout out. And I don't work there, but I do have many friends who do.
[00:24:33] Alex Volkov: I think, yeah, Nisten, go ahead. And then we'll move on to actual recap of everything we're going to talk about.
[00:24:39] Nisten Tahiraj: Yeah, and same for the leaderboard. We give Hugging Face so much crap when things don't work, and I really appreciated that. It's actually the CEO that responds directly to your Complaints and tickets and it's not just some like support person. No, it's Clem. He's the actual CEO. They'll respond [00:25:00] They're the first ones to respond.
[00:25:01] Nisten Tahiraj: So so that's pretty amazing You don't really see it in other companies Like we don't expect the president of microsoft brad smith to ever respond to a github issue. Could you imagine that? So
[00:25:12] Alex Volkov: He is not your favorite. I would love Satya though to, to chime in on the discourse but not Brad. Yeah, absolutely cannot imagine this and kudos, kudos to them for the participation in the community.
[00:25:23] Open Source AI corner
[00:25:23] Alex Volkov: And I guess we should start with our usual thing open source. So I guess let's start with open source Alright folks, this is our regular update every week for the Open Source Corner, where we're gonna start with Interestingly, Mistral is not in the open source corner, is not featured in the open source corner today, but we'll mention them anyway, because from last week, if you guys remember Jammer was released, it wasn't open source, it was open weights, but definitely Google stepped in and gave us two models to run, and since then, I just wanted to mention that many folks started using these models, and there's quite a few stuff that, yeah, I'm actually wanting to hear from you about, because we talked about this, the Gemma models are not necessarily seven billion parameters, right?
[00:26:24] Gemma from google is hard to finetune and is not as amazing as we'd hoped
[00:26:24] Alex Volkov: This was a little bit of a thing. And also about fine tuning. Could you give us like a brief out like how the last week in terms of Gemma acceptance in the community was?
[00:26:32] Yam Peleg: Oh, wow. Gemma is giving me a hard time. This is for sure. I'm fine tuning Gemma for, or at least struggling with fine tuning Gemma for a week at the moment. Okay, so starting from the beginning, GEMMA is not exactly 7 bit. The way it is referred in the paper is that the parameters in the model itself, apart from the embeddings, are exactly 7 billion parameters.
[00:27:01] Yam Peleg: But then you add the embeddings and you're a little bit over 8. 5, if I remember correctly. Um, which is fine. I don't think anyone has any problem with a bigger model. Just, I think that it'll be less, it'll be more genuine to just say it's an 8p parameters model. It's fine. That's first.
[00:27:23] Yam Peleg: Second, it's, it behaves differently. than what we're used to with Mistral and Lama. I'm not sure why. Maybe someone can tell me, but I'm not sure why. It behaves differently. And many people are currently working and struggling to fine tune it better. This is where it is at the moment. I heard, I've seen already Orca.
[00:27:54] Yam Peleg: Someone fine tuned on Orca and didn't get Great results. I also heard that Hermes, someone Finetune on Hermes, I think from Nous. I'm not sure, but I think. Also, results are not great. I'm continuing pre training and the loss is is doing whatever it wants. It goes down and then out of the blue it starts to jump.
[00:28:16] Yam Peleg: I'm not sure exactly why. It might be because the architecture is slightly different. There are slight modifications. So maybe that or maybe something else, but yeah, I think we're still. exploring the model. We don't have an answer yet.
[00:28:35] Alex Volkov: Yeah that's what I got as well. I pinned a few examples of Eric Hartford from DolphinFan, I think he now works in Abacus and Technium as well, tried to, to do some stuff and all these losses look crazy. All these losses look like jumping around up and down. I saw a tweet from Philip Schmidt from Hug Face where they were able to, to fine tune some stuff and the conversation from Eric and Wing Lian from Axolotl.
[00:29:00] Alex Volkov: And there looks to be an effort to try and hone this thing and see if actually, fine tuning this on some stuff. The Hermes stuff, Finetune, was not really like an official news research thing. It looked like somebody just took the data set and folks weren't able to actually Get it to run or perform well as far as I saw I haven't seen an update from this But I definitely follow up with news.
[00:29:22] Alex Volkov: So I would just remind folks, last week we talked about Jemma was well received.
[00:29:26] Alex Volkov: Everybody hopped on board like super quick and added support. LMStudio and Olami added support like super quick. Wing started adding support to Axolotl for fine tuning. Hug and Face added support in, I think, Transformers. There's a bunch of TreeDAO added support for Flash Intention. There's a whole community effort to receive GEM as much as possible.
[00:29:47] Alex Volkov: And they also released some stuff in, in, in quantized versions from Google. So very good effort from Google and then very big acceptance from the community. But since then, what I'm trying to highlight is a lot of the stuff that we've talked about a lot of the way we judge models, whether or not they're good or not is, if they're finetunable, for example, is one thing, but also if they're instruction following, if it's easy to converse with them. I haven't seen any of this come across my timeline at all. I will be frank, I only interacted with the 2 billion parameter model. And wasn't impressed. It's great that we released it.
[00:30:20] Alex Volkov: I wouldn't, would not be using this for any of my workloads. Nisten, do you have any other feedback as well? Specifically around like how Mistral 7b seems to be still. A good alternative, even though it's performing less on evaluations.
[00:30:34] Nisten Tahiraj: Yeah, I feel like we have been spoiled by just how high of a bar Mistral 7b has set for everyone, that it even made Mistral large feel somewhat unimpressive, although it was answering everything perfectly well. But, yeah, not only has it set a very high bar, but it was also very easy to work with. So the amount of innovation that came upon the community just building off of the initiated weights, has made This class of models, extremely competitive that even Google has a hard time cracking through that.
[00:31:15] Nisten Tahiraj: Yeah, our expectations now for a 7b model are extremely high. It has to run on my phone. It has to do what I want. It has to respond. It has to summarize stuff, has to carry forward the conversation. Oh, and it has to score high on the benchmarks too. And it. This pace of innovation that the community has set upon this is just very hard and also incredibly interesting to see that Google is having a very hard time matching or getting close.
[00:31:46] Alex Volkov: Specifically because, in the land of GPU poor and GPU rich, in the original article that defined the two categories, Google is the GPU slash TPU rich, right? They could and have thrown a bunch of compute at these models and still the folks from Mistral, a team that's less than 30 people that started eight months ago released a model.
[00:32:06] Alex Volkov: 6 months ago? I think Mistral 7B is around 6 months ago, right? September? That Google, 6 months after, with all the GPU richness, is very barely able to match, not to mention, beat significantly. Which is unlike any pace that we're used to. We're used to a 7B model beating a 7TB model week after week.
[00:32:25] Alex Volkov: And here's a huge company coming out and saying, Hey. Here's our best attempt at the 7b model that YUM doesn't even consider a 7b model, and it's in at least our attempts to play around with this. It's not beating significantly, which is strange. But also not being able to get fine tuned very easily.
[00:32:43] Alex Volkov: Very interesting and very a highlight of how much quality the the Mistral model was. I will also say that Arthur Mensch we'll cover this in the Mistral section afterwards, but he came out and he said something and basically said, we can only do so much with 1500. H100s, 1500 H100s just by contrast, Meta announced a few months ago famously, Zuckerberg came out and said, by the end of this year, they're going to have 600, 000 worth of equivalent of H100 compute, 600, 000 H100s to train and host and probably, do inference on Meta and Llama.
[00:33:19] Alex Volkov: And [00:33:20] this is like 1500 H100s that Mistral was able to use in Finetune, a model that Google cannot wipe off the board completely.
[00:33:29] LLama 3 won't be released until June 2024
[00:33:29] Alex Volkov: It's very crazy. Moving on to basically another news update that's not a news update. We've been waiting for Lama 3 for every week. I've been saying, Hey, it could get released here and et cetera.
[00:33:41] Alex Volkov: There was a leak from the information. I actually don't know if it was a leak or not, but the information came out and then a bunch of other companies followed with this news where Lama 3 will be released. I think in June, this was the update. LLAMA 3 will not get updated and released for us anytime this year.
[00:34:00] Alex Volkov: We were hoping for a one year anniversary. LLAMA 1 was released in February 2023. And now we're not gonna see LLAMA 3, even though it's like a finished training as far as I understood, or as far as updates were. And while Zuckerberg goes and eats at McDonald's, LLAMA 3 will not get released from us. I wanted to hear folks here on stage react to this, because surprising news, isn't it?
[00:34:23] Alex Volkov: Ha,
[00:34:24] Nisten Tahiraj: gonna say that I called it, just based on how just how censored and unwilling to answer anything Code Llama 2 was. So yeah, if Code Llama 70b wouldn't answer anything, I figured it would be pretty, it would be around the 3. So now they either have to go way back in the training. When they started doing a lot of this, and retrain it with with it being a lot more obedient, but still not horrible or anything, because we see from Mistral's team that it does obey to you and respond stuff, but it still won't tell you, like, how to kill your cat and stuff so it's, yeah, they, the public backlash from it.
[00:35:12] Nisten Tahiraj: People giving it to Gemini and Google has has completely affected the LLAMA3 release, which is just very interesting.
[00:35:19] Alex Volkov: interesting, Because they didn't release LLAMA 1, and then nothing bad happened in the world. And then they released LLAMA 2, with a commercial license that people can actually use this. Which kickstarted a bunch of open source stuff. And now they're waiting with LLAMA 3. Potentially I heard some stuff where it could be GPT 4 matching model that we could run.
[00:35:40] Alex Volkov: But, we don't know until it's released. But just like a very Interesting update. And I gotta wonder if by the time they decide to release this if other open source will catch up or not. Usually LLAMA, when they come out with a big model it's impressive. But for example, LLAMA code already was beaten by the time it came out, right?
[00:35:57] Alex Volkov: If I'm not mistaken, like DeepSeaCode and other models achieved the same score on coding that LlamaCode was released with. Maybe waiting a little bit. I gotta wonder what goes into this decision. Which on the topic of code,
[00:36:10] StarCoder 2 and Stack V2 open source from Hugging Face
[00:36:10] Alex Volkov: I think we're moving to the next thing. And Star Coder two and Stack V two were released and in collaboration with with hugging face.
[00:36:17] Alex Volkov: Stack v2 is like the second iteration of the stack data set, which was just like insane amount of code collected.
[00:36:25] Alex Volkov: I think stack v2 now includes opt outs. So you could say, hey, I want my code to be opted out from the stack v2. And so this new data set, I think is 60. Billion parameters, I want to believe 10x more than the first stack. And Starcoder, the 15 billion parameter model, it beats Code Llama 13b pretty much on every Human Evil Plus and DS 100, the GSM 8K.
[00:36:49] Alex Volkov: Um, very impressive. It beats, obviously, the previous Starcoder, which was a very significant model. I think Based on the evaluations, DeepSeq Coder, we know, was like one of the best code models so far. And it looks like StarCoder on a few benchmarks, competes with, but everything else, it beats DeepSeq Coder as well, for the 7b model.
[00:37:09] Alex Volkov: But it's a model twice, twice the DeepSeq size as well. So they released three models, 3 billion parameter, 15 billion parameter versions. 15 billion parameter is a very interesting, Place where you could potentially run this still on your Mac if your Mac is stacked up and get a decent result back.
[00:37:26] Alex Volkov: It has a 16k context window, a very weird one usually like 16 384 weird one. It was trained on 4 trillion tokens depending on the size of the model Includes 600 plus programming languages, which is great, all we care about probably is Python and JavaScript and maybe some folks care about Rust, but 600 plus programming languages, I honestly didn't even know there was that many.
[00:37:51] Alex Volkov: Percent of the human eval, which is okay, I've seen models that get way better than 46%, so that's interesting. And What else is interesting in DeepSeq? It's a commercial friendly license, so you can use this for commercial stuff. Can be used for local copilots, which is something we're waiting for.
[00:38:06] Alex Volkov: And the more of this, the better. And yeah, StarCoded 2. But I also want to shout out that the StackV2, like the more data we'll get, the better it is for everybody else and other models as well. And the StackV2 is definitely a great achievement that we should shout out.
[00:38:23] Nisten Tahiraj: Yeah, this is crazy. The full data set is 67. 5 terabytes for the stock v2 and you can just have it for free. It's the amount of work. So it's 900 billion tokens extra that went on top of what was actually an excellent model coding model to begin with. So this is this is huge, not just beneficial from the model itself, but also because you can just.
[00:38:47] Nisten Tahiraj: I don't know, Finetune 1 for TypeScript, if you want.
[00:38:50] Alex Volkov: Yep. Yeah, go ahead.
[00:38:53] Yam Peleg: Yeah, I think it's worth worth mentioning that as far as, I haven't looked at it in, in depth because the Honey Haze was down but as far as I understand, it's a base model. When we compare human eval of a base model to a model that was specifically Finetuned to obey instructions, And we see a result that is, okay, it's not the same, but it's somewhere at the ballpark.
[00:39:18] Yam Peleg: It's amazing, because it just means that as soon as you will find Junaid, it's going to be incredible. Moreover, from what I've seen in the paper, I was just, I heard about it, and I was sure that I'm going to open the paper, and what I'm going to see is something like hey we did the same thing, but huge 4 trillion tokens, enjoy.
[00:39:38] Yam Peleg: But no what you see over there is that they really went in depth into the benchmarks themselves and checked which benchmark is actually what exactly does it measure? How it correlates to real life usage. They went over there and benchmarked different packages, like each and every one, like how good is it with Matplotlib?
[00:39:59] Yam Peleg: How good is it with SciPy? And this is It's a very detailed and high quality work for, it's very hard to say which is better as a base model, DeepSeq or StarCoder, because there are so many benchmarks in the paper I've never seen before, even DeepSeq has, I think, six benchmarks. StarCoder, I didn't even count, there are so many, and I think it's great work, even I suppose that the model is really good at least on the level of DeepSeq, although I don't know, I need to check, but I, but just the paper alone, it's such a huge contribution, the paper alone and datasets, so yeah it's amazing.
[00:40:40] Yam Peleg: And it just, it went a little bit silent. People just released models that were trained on 4 trillion tokens and it goes silent nowadays. It's amazing that we got numb to something that's insane.
[00:40:53] Yam Peleg: And on the same week, on the same week, NVIDIA released a model. I don't think they actually released the model, but they just trained the model on 8 trillion tokens.
[00:41:03] Yam Peleg: And we don't even talk about it. It's just insane.
[00:41:06] Alex Volkov: let's talk about it. I saw the Nvidia stuff, but I don't see a release. I saw an announcement, right?
[00:41:12] Yam Peleg: Yeah, it was a paper and I think that's about it. NVIDIA is showing they got the best hardware because they got the best hardware. So they can train on a lot of tokens really fast. And the model is really good at the end because, the tokens, but but yeah, I'm just saying that it's increasing, the amount of data is increasing, the size of the models that we actually use are increasing, and worth noting [00:41:40] the trend is, there is a trend of things getting more and more powerful.
[00:41:45] Alex Volkov: Absolutely. And I would just say this is partly what we're here for to highlight things like this in the open source and shout out the folks who worked hard on this, on releasing this and making sure that this didn't go silent because this effort is very well appreciated. If it's a base model, then we'll get local co pilots performing way better.
[00:42:04] Alex Volkov: And this is great, especially the data set that they released. 10 times the size of the previous one, it's called the stack, and folks would be able to use this to fine tune other models. And that's obviously also great.
[00:42:15] Argilla releases OpenHermesPreferences
[00:42:15] Alex Volkov: And on the topic of datasets, if you guys, Remember, we've talked about Argea multiple times at this point, shout out Argea folks, and if you want to come up and talk about Argea, your place is here.
[00:42:27] Alex Volkov: They released a DPO conversion of Technium's Hermes dataset, it's called Open Hermes Preferences. And as we've talked about Nous Research and Hermes multiple times, this is one of the datasets that, I think, a million rows that compile from different other datasets as well.
[00:42:45] Alex Volkov: And Argia is an open source tool that allows you to, make datasets better by converting them to preferences and DPO. So they released the DPO version, DPO's direct preference optimization version, where basically they take a dataset with a regular ROHF dataset with one instruction in a conversation and turn it into kind of a preference where they show a few instructions and they actually have information about what would be a more preferable.
[00:43:12] Alex Volkov: Instruction. That's what, very poor explanation of DPO. Yam, if you want to chime in here and clean this up feel free. And Argia released an open Hermes preferences, which is 1 million preferences dataset on top of Technium. And, um, it's pretty remarkable because we know that Even Nous Research, when there is DPO versions of their models, it performs better than a regular SFT fine tuning models on top of every benchmark pretty much.
[00:43:40] Alex Volkov: And now, they've converted all of that dataset into a preferenced dataset. They've created the responses with another Hermes model, which is pretty cool, right? So they're using they're not using OpenAI because scraping from OpenAI is against, as we saw in the lawsuit with OpenAI it's against the terms of service.
[00:44:02] Alex Volkov: But you can actually create these preferences with another model. So they're using Nous Research's Hermes to Yee on top of YE 34 B to do what's called the distill label and make those instructions a little better. And this data set is open. So unlike the regular thing Air Ms, this data set is open for you to also go and fine tune your models, which is pretty cool.
[00:44:24] Alex Volkov: And shadow to the open ESS preferences. I'm gonna pin this to the top of the space and I will also definitely add this to the show notes.
[00:44:32] Function calling leaderboard from Berkley
[00:44:32] Alex Volkov: No. Okay. Let's move on in our conversation. I wanna talk about the function calling leaderboard because I think it's pretty cool. Lemme just go and find this this switch real quick. This is from, oh, was actually, yeah.
[00:44:44] Alex Volkov: There was an effort before called Guerrilla, and now the same folks from Berkeley released a leaderboard called Berkeley Function Calling Leaderboard, and essentially, function calling for those who don't use any open source model but use something like OpenAI. OpenAI, during last summer, I think, answered everybody's request to give us structured outputs in the form of JSON and answered them with, hey, we're going to introduce something called function calling for you, where you call our model and you provide One function or several functions in your code, and the model will respond and say, Hey, you should call this function and with these parameters.
[00:45:23] Alex Volkov: And basically, instead of getting JSON mode, we got function calling back then. Now we have both, we have a way to get just structured JSON, but also we get models to respond with which functions we should call. And this is great for agents, this is great for folks who are building with these models.
[00:45:38] Alex Volkov: And I think during the summer, because OpenAI came up with this concept, OpenAI was the only model that was supporting this. And then quickly, open source started catching up. And I think, Nisten, correct me if I'm wrong, but I think John Durbin's Ouroboros has a bunch of like function calling instructions in it.
[00:45:54] Alex Volkov: So this model and then models who are trained on Ouroboros were also fairly okay with function calling. Mistral just released their update, so Mistral supports function calling.
[00:46:05] Nisten Tahiraj: They had about a thousand. About a thousand of function calling datasets in the AeroBoros two, or I forgot. Just look up John Durbin, J O N Durbin, and AeroBoros, A R A I R O B O R O S dataset. And, yeah, apparently there's about a thousand entries in there for function calling. That's by accident helped a lot of the other models be better at function calling too.
[00:46:29] Alex Volkov: Yeah, so every other model that was trained on Airbores, which is a lot Hermes includes Airbores data set. They now I don't know if this is by accident or this is now how things work in the merging world. And in the, Finetuning on top of data sets that Finetune on top of other data sets, right?
[00:46:44] Alex Volkov: But definitely other. other open source models, no, no support, at least the notion of functional control, and then the eventually we get to the point where there's now a leaderboard like we like. So if we're going to talk about embeddings later, there's an MTB leaderboard for different embeddings model, even though I see Bo in the audience and He's not very happy with how easy it is to game this leaderboard.
[00:47:07] Alex Volkov: We obviously look at the open source LLM leaderboards and Yam was talking about submitting a few stuff there and see how it performs and that's being, exploding popularity and merging. So it's great to have a function calling leaderboard as well. And folks at Berkeley that tests models I think API only, I don't know if they're supporting open source at this point, the test models and looks at.
[00:47:28] Alex Volkov: How you could expect a performance on different function calling and I think for folks who are building with this it's very cool. So Some of the models that are leading this leaderboard and GPT 4 the latest preview from January is leading this They have something called Open Functions V2, which I I think the organization that pulled this up, Gorilla LLM, is the folks who put it up, and they have an Apache 2 license, and they have an average score on different Simple Function, Multiple Functions, Parallel Functions different scores for all of these tasks.
[00:48:08] Alex Volkov: And I just, I want to highlight this and I want to add this to the show notes because more and more we see Mistral Medium entering their Cloud From Entropiq and open source models. And I think for many folks building, agents building with these models This type of interaction with the model is very important, where it's not only a prompt, a textual prompt, and you get something back, you actually need to do something with it, and I think a shout out for folks for building and maintaining this data, this leaderboard.
[00:48:34] Alex Volkov: And I think they also released the Gorilla model as well. . Let's move on, I think this is it, folks. I think this is everything we have to talk about in the open source LLMs.
[00:48:42] Alex Volkov: And given that Conor, given that Storm is in the area of open source ish, let's cover Storm a little bit.
[00:48:49] Alex Volkov: I think this is a good time. Because it also like dances on the on the area of interest that we talked about last time. Do you want to present Storm and talk about this and see how cool this is?
[00:48:58] Connor Shorten: Yeah, cool. I guess maybe let me say one more thing on the gorilla. I think it's fascinating going through the functions that they have if you go through the open function, the blog post from Berkeley, you have calculate triangle area, and then you give it the base and the height. And I think that kind of just like super specific functions, having a massive data set of that.
[00:49:16] Connor Shorten: It's fascinating that they've, seeing this next evolution of that, but. Okay, so with Storm yeah, there's definitely some intersection between DSPy and the function calling models. With DSPy, one of the one of the one of the built in signatures is that React one, where at React you have thought, action.
[00:49:33] Connor Shorten: And so you, it's one way to interface tools. Yeah, the tool thing is pretty interesting. I think it's also really super related to the The structured output parsing and, the please output JSON and, Jason, our favorite influencer of the function calling
[00:49:47] Alex Volkov: I just wanna make sure that folks don't miss this. Jason Liu is the guy who you're referring to, and he is, he's our favorite influencer in, in forcing these models to output JSON. I find it really funny that the, a guy named [00:50:00] Jason is the guy who's leading the charge of getting these models to output JSON formatted code.
[00:50:04] Alex Volkov: I just find it really funny. Didn't wanna skip this. I wanted to plug this, that joke somewhere, but please go ahead and let's talk about the story. Oh, and a shout out to both Weights Biases and Connor on WayVid, Jason appeared in both places talking about Instructor Library and how to get these models to give a structured output.
[00:50:21] Alex Volkov: So definitely shout out for Jason for this, check out his content on both platforms.
[00:50:29] Connor Shorten: yeah, awesome. Yeah, it's such a huge part of these, lLM pipelines, like I know Bo is going to speak in a bit, who's someone I consider one of the experts in information retrieval. And one of these big things is like you will retrieve and then you'll re rank and then you'll generate. And if it doesn't follow the output exactly, you can't parse it in the database.
[00:50:47] Connor Shorten: So it's such a massive topic, but okay.
[00:50:50] Stanford introduces STORM - long form content grounded in web search
[00:50:50] Connor Shorten: So starting with Storm, I guess I can tell a funny story about this. Erica and I were hacking on this and we came up with the plan of You start off with a question, and then you do retrieval, and so you're looking at the top five contexts, as well as the question, and you use that to produce an outline.
[00:51:06] Connor Shorten: And again, structured output parsing, that outline better follow the comma separated list, so that then you can parse it, and then you'll loop through the topics. And then we have a topic to paragraph prompt, where, you know, you're doing another retrieval now with the topics. And then we have the proofreader and then the the blog to title.
[00:51:26] Connor Shorten: So that's the system that we, got our hands on with. And I could probably talk about that better than the STORM system, but it's very similar. With STORM, the difference, so the difference is we're retrieving from, a Weaviate index with Weaviate blog posts. Let's make it as much Weaviate as we can, but like they, so they replaced the specific retriever with with web search retriever. And. So I was playing with that a bit on the weekend as well, using the U. com API as the web search and, it's pretty cool web search and as well as a private thing that you curate. I think that's definitely one of the big topics.
[00:51:56] Connor Shorten: Okay, so then the interesting thing is once you've got this, in our case, as a four layer system, now you use DSPy to compile it. So what compiling it entails in DSPy is tweaking the task description as well as producing input output examples. So you have in the prompt, you slightly change it from, you'll take a topic and write it into a blog post.
[00:52:19] Connor Shorten: Typically, that ends up resulting in a blog post about software documentation, right? So that's what that ends up looking like. And then the input outputs end up being, like, an example of what are cross encoders. Here's a blog about cross encoders. So you can use that input output to then reason about the new inference, so hopefully that's a good description of what it means to compile these programs, where you optimize the prompts for each layer in the task as you decompose this task into its subtasks.
[00:52:45] Connor Shorten: Storm then introduces something that I think is pretty novel. which is how you do that research loop. So we naively just went question to outline and then just instantly flesh out the outline, whereas they instead go from question to perspectives about the topic. And you retrieve from each of the perspectives about the topic, and then you'll, write it, and then it will, I'm not sure how it all gets resolved, but it's, so it's almost like a multi agent system in my view, this kind of like perspective guided to adding personas or like background.
[00:53:18] Connor Shorten: So I think that's probably the key differentiator between Storm and then that kind of like blog post system that I described. But so we have open source code on Weaviate Recipes. If you want to see what the, what our four layer program looks like and compiling that with the Bootstrap Optimizer.
[00:53:35] Connor Shorten: With the Bootstrap Optimizer is you just run a forward pass through the model with a super high capacity model like dbt4. And then, to get the input output, and then you hope that Turbo or one of the cheaper, or the open source models can can look at those input output examples and then copy the system behavior.
[00:53:51] Connor Shorten: There's a lot of other interesting things about this, like multi model systems, even in the Storm paper they compare GPT Turbo, and then they use Mistral 7b Instruct as the judge. Another thing is like earlier talking about re ranking. You might want to have the long context models do re ranking because with re ranking you typically try to give it a lot because you're trying to like, put a band aid on the search.
[00:54:13] Connor Shorten: So you probably want to have 20 to a hundred results that go into the re ranker rather than, five to 10. And it's probably also not really a task for the for LLMs anyways. And I think that's another, opportunity for a task specific model, but overall to conclude this thing about Storm, I think for me, the big exciting thing is it's becoming, DSPi is making it super clear, I think, on how to build more than chatbots or just simple question answering.
[00:54:40] Connor Shorten: It's I think we're probably within a few months from, anytime you have a pull request, the documentation will be written for you automatically. Probably you could even have an idea and have a pull request created by the model. I'm personally biased by coding applications, but yeah. So the but yeah, this kind of like long form content generation by breaking down each task and then optimizing each part of the task.
[00:55:05] Connor Shorten: It's all just really interesting.
[00:55:07] Alex Volkov: very interesting. And I had a storm to, from Yijia Xiao to, to the show notes as well and folks are definitely worth checking out because it writes like wikipedia length articles and uses like you. com API or different search APIs to give perspectives and References and very interesting. I want to in the sake of time I want to move so just like to reset the space we've been at this for almost an hour You guys are on ThursdAI.
[00:55:33] Alex Volkov: ThursdAI is the weekly podcast and newsletter that's recorded live on xSpaces. And I'm here with several friends and guests and experts in different fields. And we've been covering open source LLMs until now. And I think we're going to move into big companies because we need to cover this. And soon we're going to have some folks to do a deep dive about embeddings.
[00:55:51] Alex Volkov: And let me just make sure that the folks know that they're, they can come up. Uh, the big companies, LMs and APIs this is the segment where we chat about OpenAI and Microsoft and Google and whatever not the models that they released for us in OpenWeights and OpenSource that we can run ourselves this is the segment where we talk about API and developments and different updates.
[00:56:13] Alex Volkov: So let's run through them.
[00:56:14] Mistral releases Mistral Large & Le Chat interface
[00:56:14] Alex Volkov: The biggest one from this Monday was Mistral releasing Mistral Large, which we've been waiting for and getting excited about. And also they released a chat version of their models called LeChat. And, um, it's very impressive, folks. Like the Mistral Large now is based on at least some metrics that they released, is second only to GPT 4, and beats Claude and Tropic and Gemini Pro on the MMLU score.
[00:56:43] Alex Volkov: And Mistral is vastly superior to Mistral Medium handles 32k tokens of context natively fluent in English, French, Spanish, German, and Italian. It highlights how much Mistral is focusing on becoming the open AI alternative from Europe, because you can go to the chat and there's execute every chat that you have with their models.
[00:57:09] Alex Volkov: And basically, Maybe you don't have to have an OpenAI subscription. I think that's what they want to do. But also, this model is available in the API, and it's significant performance on top of everything else on the other languages. And they're aiming for the five top languages in Europe, obviously, and I think it's a Very standard, like a very important move from theirs that they're establishing themselves as this big company.
[00:57:32] Alex Volkov: This was why we moved them to the big company APIs as well. The announcement also includes something interesting. They said, we have also updated Mistral Small in our API to a model that's significantly better and faster.
[00:57:45] Alex Volkov: The Mixtral 8x7b. If you remember when we announced, when we talked about Mistral releasing API access, we said that, whatever Mistral Next is It's probably going to be medium. So now we have a large model that outperforms pretty much every model besides GPT 4 on different tasks. According at least to them, but also the small model that's like faster and better.
[00:58:06] Alex Volkov: They upgraded this like behind the scenes. They're not released that any of this in open weights. Which is the response from the community was partly this, is Mistral releasing a bunch of stuff, and none of the stuff like we expected. No torrent links this [00:58:20] time, no, open models that we can start fine tuning.
[00:58:22] Alex Volkov: And I think so first of all, kudos on this release. I've used some of the stuff in the chat, and I'm very happy with the responses. They're fairly quick, but definitely giving good responses. Nisten, I think your perspective from before, from the open source segment is very interesting where they spoil us so much with the open models, with the Mixtral models, and even the 7B, that even large doesn't seem that significantly better.
[00:58:45] Alex Volkov: However, just on the metrics, it looks like we just got Another competitor in the ring from, now there's Google, Gemini Pro, Entropic Cloud keeps releasing models that are less performant, at least on LMSys, than the previous models. And now Mistral not only doing fully open weights, open source, but also in the API.
[00:59:03] Alex Volkov: And if folks want to build on top. They can. An additional thing to this, they also released a partnership with Microsoft and announced that these models are also going to be distributed through Azure. And I think this is a big deal for companies who maybe don't want to trust a startup that's less than one year old from, from Europe, for example, and maybe their servers are in Europe, maybe the companies don't want to trust their ability to stay up because there's like only 30 people, or, enterprises, they need more stuff like ISO and different things.
[00:59:34] Alex Volkov: And so I think it's a big deal that Microsoft is now also supporting and giving us access to kind of these models through Azure, and especially for companies that want stability. I'll just, not stability, just stability in general. I want to just mention that if you guys remember after Dev Day, OpenAI went down for a week, or not a week, but there was like a whole period where OpenAI had a lot of issues on production, and the Azure version of OpenAI stayed stable.
[01:00:00] Alex Volkov: Obviously Microsoft wants to sell their cloud, and I do believe this is a very big deal that Mistral is now supported through Azure as well. In addition, Microsoft also announced a small stake in Mistral, and Arthur, the CEO of Mistral, and went and clarified. So first of all their new website with these announcements, again, didn't include some stuff or included the a note that you shouldn't train on this, right?
[01:00:22] Alex Volkov: And then our friend Far El here for the second time called them out publicly and for the second time, Arthur Mensch, the CEO of Mistral came and said, whoops, gone. And so it does seem like an omission rather than something they put on purpose and then they remove after Twitter calls them out.
[01:00:38] Alex Volkov: Far El, thank you for that for noticing. But also some other folks noted that their commitment to open source, which we discussed before was gone from the website. And they put it back. And so now, like prominently on their website, even though this time they didn't release any open source, any open weights for us this time their commitment for open source is prominently featured on top of their of top of their website.
[01:00:59] Alex Volkov: And now there's two segments there. One of them is optimized models, they call them. And one of them is open weights models that they released for the community. As we talked previously in the open source segment their models from six months ago are still competing with something like. The new and cool Gemini Pro 8 billion parameters.
[01:01:15] Nisten Tahiraj: It's still a 32k context window by the way, so I measured and after that it completely forgot, and also it was okay. I was expecting as a chat model to be way more chat optimized, but it does feel more like a base model. And yeah, again, I said the comments before, we're too spoiled by all the 7b and Mixtral, Finetunes, and merges.
[01:01:43] Nisten Tahiraj: That now this is extremely good and is very utilitarian. And if your business needs it, you should use it because it provides reliable answers. It's not, we were just expecting more.
[01:01:56] Alex Volkov: So one thing definitely to note as well, and we mentioned this a little bit, but definitely worth mentioning. So the smaller model is now better upgraded. So if you play with this they also upgraded the pricing for this. And I would also caution folks, the tokenizer they use is a different tokenizer.
[01:02:10] Alex Volkov: So sometimes when you measure tokens they may look different. Our friend Zenova here in the audience. Has a tokenized playground in hug face, which by the way, with the rest of hug face also went down yesterday. So I went to check just the length of a string. I wasn't able to, it was sad but now it's back.
[01:02:24] Alex Volkov: So that token, I think, measures open the eye, token's length, and Mr, I think has a different one. So when you calculate pricing for use, definitely make sure that you're calculating the right thing. Yes. No, you're welcome to come up and tell us about this. So one last thing on Mytral is that it supports function calling as well, which is, I think is a big deal.
[01:02:41] Alex Volkov: And we mentioned this before in the function calling leaderboard. And now mytral models can also respond to your RAG applications or whatever with actually the functions that you should call, which is I think super cool. And the industry moves there and it shows again, the open AI can come up with something.
[01:02:57] Alex Volkov: a year ago and basically set the standard for how things should look. I actually don't know if assistance API is going to be like this, but I do know that, for example, we talked about Grok and Grok supports the OpenAI standard. And many of these, I don't know if Mistral does, but many of the like Together API and other I think Perplexity as well, all of them have their own version of their API, but also you can just replace whatever code you wrote for OpenAI with just like a different proxy URL.
[01:03:24] Alex Volkov: And then you basically use the same structure that OpenAI innovated on, so that I think is pretty cool. Moving
[01:03:32] Nisten Tahiraj: Yeah,
[01:03:33] Connor Shorten: also just a note is that the OpenAI PIP package allows you to actually call any any URL doesn't matter if it's if it's OpenAI or not which actually uses that standard. It is very easy to drop in any replacement to the OpenAI
[01:03:49] Alex Volkov: Yeah, including local ones. If you use LM studio, our friends on studio, shout out Yags or Olam, I think both of them will expose like a local server when you run the open source models. And then you can put in your code, like your local URL that runs the server with the local model. And then your code will also work, which is, yeah, thanks for all.
[01:04:08] Alex Volkov: This is like a very cool thing that people may have missed. The same can be said by the way about Copilot. It's a little harder, but basically you can replace the Copilot infrastructure in VS code with like local models if they support it, if you go through settings. But moving on to. I guess moving on to GoogleTees is Genie, right?
[01:04:26] Alex Volkov: GoogleTees Genie, which is a, which is quite incredible. You take one image of something that your kid drew that has like a character, and then you provide this into this like video type text to video or image to video model. And in response, you get like a full world that is interactive and looks like this character is kind of in the same style transfer and it looks pretty much the same. The character is like interacting in this world. Seeing this is unbelievable because it just shows that, we're very close to being able to take one picture and start animating this. And very worth like adding this to the top and adding a video for this.
[01:05:05] Alex Volkov: It's really hard to explain in words and I haven't read any of the paper, but Genie was really like also mind blowing as well. From Google, and they only teased it, so we don't know actually if they're gonna release this. Far El, you wanted to comment? I saw your
[01:05:20] Far El: sure. It's If any of you have watched Sentex's YouTube video like a few years ago, about GameGAN from NVIDIA. It's basically GameGAN, but with generative AI. And it's pretty awesome, because it means that we're all headed towards the direction of basically interactive rendered worlds.
[01:05:43] Far El: And Sora is one, one extreme end of that with really high quality text to video. But then what happens when you actually add actions into the loop? And that's what basically Genie does. So we're probably going to see the marriage of both methods, both architectures very soon. Very exciting work for sure.
[01:06:04] Open AI to buy Wordpress & Tumbler data
[01:06:04] Alex Volkov: And so I think most of the open and big companies stuff we covered. One, one quick thing before we move on,
[01:06:10] Alex Volkov: openAI opens up fine tuning for 3.5 and also OpenAI in the news again this week because wordless. and Tumblr. Basically, I think both of them are the same company, Automattic. They're preparing to sell user data. And it sounds scary, but honestly it's all public and probably will scrape anyway.
[01:06:30] Alex Volkov: And still, they're preparing to sell this, probably more structured. and maybe more licensed to open AI and mid journey. So that's very interesting because Tumblr had a bunch [01:06:40] of images and probably was scraped to an extent. WordPress, definitely so just to clarify, this is not WordPress, the platform where everybody can use the open source platform to run their websites.
[01:06:51] Alex Volkov: That's not what they're selling. I don't think they, they can but WordPress. com, I think, is where you can host a blog for free without knowing how to raise a WordPress platform. So WordPress has the open source system that you can run your blogs and websites in that runs like 30 percent of the internet or something crazy like this.
[01:07:06] Alex Volkov: But also wordpress. com is the place where you can host your blog and basically when you signed up and created your blog there, you maybe didn't know, the information is there to sell. So like Reddit supposedly selling. Reddit's information to Google for 200 million that we talked about last week.
[01:07:24] Alex Volkov: Automatic is now trying, basically trying to extract money based on their data, where previously this data was scraped. What's the benefit for OpenAI? Obviously, now there's a lawsuit with the New York Times, whether or not this is considered fair use, and whether or not Open's AI model, OpenAI's models, can spit out full New York Times articles.
[01:07:44] Alex Volkov: So there's a whole debate about this and there's going to be a lawsuit because they didn't achieve a similar deal with New York Times. Although it was reported the folks from OpenAI actually did talk with New York Times to try and have more of a structured access and licensed access. And WordPress is definitely a huge chunk of the internet and now some of that information is going to go into these models in a more structured and licensed way.
[01:08:12] Alex Volkov: And moving on to diffusion models before we jump in, because there's a bunch of updates there, and I think Jeanne takes us a little bit into diffusion models, so let's see if we have a thing for this, yeah.
[01:08:41] Playground open sources a new diffusion Foundational model
[01:08:41] Alex Volkov: All right, so As I said before, we don't cover this at length. I know there's a bunch of other spaces for AI art and fusion specifically. But when we do, it's because something very big happened. And this week was a huge week as well. And so I just want to shout out that We had two foundational models, basically, and then another thing that just broke my jaw, and we're going to talk about this,
[01:09:01] Alex Volkov: playground. Playground from the previous Suheil Doshi, I think is his last name. He previously was in Mixpanel. He started building a browser called Mighty and then he switched fully into AI. And I think a year ago started working on Playground.
[01:09:17] Alex Volkov: Playground is an interface that like super fast and lets you generate a much of images, and it's just an interface on top of, or at least previously it was an interface on top of, DALI and Stable Diffusion. And they kept giving away for free all of these models and image generation. And basically they collected their styles, etc.
[01:09:36] Alex Volkov: And they've collected all this information of what people actually do preference on. And now they released an open model, a new diffusion foundational model, which we haven't had for a while. If you guys remember, we talked about SDXL Lightning, which is based on SDXL. We've talked about, um, Stable Cascade, which is also related to stability.
[01:09:54] Alex Volkov: We haven't had a, like an open model for generating images in the wild for a while. And Playground released their model called Playground V2. 5. And the cool thing about this is that they say first of all, it looks great on realistic stuff. Secondly, they say that on User preference on internal 1000 tokens, they significantly beat the baseline for DALL E, for Mid Journey, for the previous version of Playground, and SDXL as well.
[01:10:23] Alex Volkov: And by significant they beat internal preference again SDXL 1. 0 gets like 70, 17 percent and their model, their new model gets 82. Which is like a quite stark, like a big jump in capability and improvement as well. They also get improvement on top of Midjourney, the latest of 5.2 version, which Midjourney is like really good or realistic.
[01:10:44] Alex Volkov: And so what they excel at is realism and just different, I think they also mentioned different ratios. So if, like most of these image models, they've been trained with certain 10 24, but 10 24 for sdl, for example. And, when they generate something in a different ratio, it looks different.
[01:11:01] Alex Volkov: So they also claim that their model is actually significantly more performant in different ratios as well. Definitely shout out to Playground folks for working on this awesomeness, because Who's gonna say no to another model? And there's a demo from, I think, Model Labs that actually makes this work really fast.
[01:11:17] Alex Volkov: If you guys remember last week, I talked about a thing that I built with SDXL Turbo and Grok. And obviously SDXL Turbo is super fast. or SDXL Lightning is super fast. Compared to those super fast examples, the Playground image generation is just night and day. It just looks so real. It's quite striking.
[01:11:37] Alex Volkov: So if you're looking for any updates in that area, definitely check out Playground. And I think because it's a model they released, you can use it for free. The only thing that I don't know about is the support in the community, kind of stuff, if it supports Confi UI or some stuff like this, but they just released it, so I'm sure support will come.
[01:11:56] Alex Volkov: And obviously, the Loras and everything else in this community is very interesting to see. There's also a Hugging Face demo. And then, the second thing in image generation real quick, is Ideagram. Ideagram. We've talked about before. It's a startup that came out of folks who worked on the image and stuff at Google and apparently weren't very happy with the slowness of the release.
[01:12:17] Alex Volkov: And while Google and its image generation is suffering from bad news and is in hot water because of different Prompt injection that they had, and even, we didn't mention this, but mentioned this in the beginning. Sundar Pichai released an email to all of Google and said, hey, we had mistakes, we offended some of our customers, we need to do organizational changes.
[01:12:35] Alex Volkov: Which is not a small thing from a head of the company to admit this bad of the release. Ideagram was created with folks. from Google before, and they released it for free. And I think they just renounced Ideagram 1. 0. And the best thing about this, I think is just text. They, everybody is no focus on different things.
[01:12:56] Alex Volkov: But if like all these models, they generate text to some extent, DALI can do text, but it's not like perfect. Ideagram's text. XGeneration is super cool. I really, so far I used it multiple times just to answer somebody on X, reply with just a text, for example like for Hug Face, I think I sent them like a thank you note with just text.
[01:13:13] Alex Volkov: And it's really cool to have a model that's like very good at presenting and generating text with the imagery that you want. So Ideagram 1. 0, they also announced a investment from A16z and really their text looks super cool. I was able to do something. That not other models could do. I was able to ask it to generate a hashtag ThursdAI And if you think about this text is not in the training set because you know We came up with the concept and a hashtag like confuses these models And I think this was the first model that was able to actually Not screw up hashtag ThursdAI Fully.
[01:13:50] Alex Volkov: Cherry pick still, so three out of four still wasn't like perfect, but definitely this is the best text model that we have Ideagram check it
[01:13:57] Alex Volkov: out. Yeah, go ahead
[01:13:59] Aditya Kusupali: Yeah, just randomly in the audience, I noticed we have one of the creators, I think it was one of the top 10 Hugging Faces pretty recently, so their data out of GPT 3, and they also have a, what's called a DALI 3 dataset, training dataset, but yeah, they released a new model recently too, I posted it up for you, so if we have some time after the interview, maybe we can bring them up and stuff.
[01:14:25] Alex Volkov: Yeah let's see if where our second guest is. Oh, he said he's going to join in 10 minutes or so, so we have a little bit more. And the last thing that I want to cover, and I actually want to, actually go to my profile and paste this, because you guys, you have to see this. And if you haven't seen this Okay, so first of all I'm going to post an image of I'm adding this onto the show notes now, it's the last pinned tweet image of a very happy sheep, they all say we're doomed and No, this is not the one, I said this shirt, that's, hold up, yeah, this one, we're doomed, and the text there is really cool, and the cool thing about the text is style transferred into the image itself, so it looks like part of the image, [01:15:00] But this is not what I want to post, I wanted to post the jaw breaking video from Alibaba from a model that they teased and hopefully will release soon called Emo.
[01:15:13] Alibaba teases EMO - a way to animate and make avatars talk and sing from 1 image
[01:15:13] Alex Volkov: And folks, I don't have a button for this. I don't have a musical tradition. I will just say that if you remember, and if you were here on, on Thursday, I when Sora was announced and released, if you guys remember, this was live, I think two weeks ago, we had Sora release and we were just like freaking out live on stage here on Thursday.
[01:15:30] Alex Volkov: I, because like our job was collectively breaking from what we were seeing. Sora showed. significant jump in capability for image sorry, image to video or text to video generation and coherence throughout the scene and longer generations. And since then, OpenAI has been SORA posting. That's what I call it, SORA posting on TikTok.
[01:15:50] Alex Volkov: So if you're on TikTok and you don't follow OpenAI, they literally opened a full account that just posts SORA videos on, or SORA posting, on TikTok. And since then, the amount of videos that they released there just shows the capabilities of that incredible model. It does look like the ChatGPT moment for video generation based on what they released.
[01:16:07] Alex Volkov: I think that emo from Alibaba is definitely one of those moments. And actually, it's really funny because the Alibaba folks took one of the Sora generated videos, if you remember one of the main ones, is a woman not the main ones, one of the first ones is a woman walking through Hong Kong, wearing sunglasses, and it zooms into her face, all of this video generated it's quite crazy.
[01:16:29] Alex Volkov: that we're now like, oh yeah, of course it generated the woman walking to Hong Kong wearing glasses, but, it's still mind blowing. So the emo folks from Alibaba, they took that video, took a still from that video, just a still, not a whole video, and made that exact woman sing Dua Lipa's song, and this is now pasted on top of the, on top of the space, and, folks, my jaw dropped when I saw this, and then dropped again because I started looking at all the details.
[01:16:56] Alex Volkov: I did a little deep dive into image generation, avatar creation, basically taking an image and making it sing or lip sync. And usually those models, they move maybe the mouth a little bit, some of them move the eyes. This model makes this from one image, one image only. It makes the eyes move independently, it makes the eyebrows move independently, obviously the mouth.
[01:17:17] Alex Volkov: I saw earrings get animated, I saw vocal muscles in the throat get animated where, if somebody talks those things, you can see their throat move differently. I'm noticing all these, all these things. The woman in the video that I'm referring to wear sunglasses. So most of these models would move the sunglasses to an extent.
[01:17:35] Alex Volkov: These sunglasses like stayed exactly the same place. So the coherence of this model is way beyond anything that I've seen. And I've researched this field and I used D-O-D-D-A-D, I used tulip, I used all these like tools and, just the creation would be able to use with something like Sora plus something like this emo thing.
[01:17:50] Alex Volkov: It just opens new horizons. And many of my friends in AI art are looking at this and like in disbelief. Because it really feels like the Sora moment as well. So I just want, I wanted to highlight how exciting this was for me and how how huge of a jump this was from everything we've seen before.
[01:18:07] Alex Volkov: Reactions from folks on stage. what do you think when you saw emo? Same as me. Existential dread. Anything else? Yeah, same as me. All right. So I, it looks like our, yeah, Nisten go ahead and I'm going to take a look.
[01:18:22] Nisten Tahiraj: I just want something like this that's like small, real time, and cartoonish, so I can just use it as an assistant. That would be great. I'm impressed, but I just want like a small, tiny one. I want clippy. I want the actual clippy. Yeah,
[01:18:37] Alex Volkov: They didn't animate Clippy, but I found it very interesting that they animated the Sora generated woman with the voice of Mira Murati, the CTO of OpenAI. They like took her voice and embodied one of their creations with this voice, and I found this like very interesting choice on their part. I will say while Aditya comes up, and Aditya if you can hear me, I'm sending you a request and if you cannot, oh yeah, there is.
[01:19:00] Alex Volkov: Found it very interesting that they Haven't released the model yet, but they did say we're committing to open source. We're going to release this and their GitHub for Ego is open, but there's no commits there. It's just like a readme. So hopefully they're going to release this. And hopefully we'll get to a point where we can actually, Nissen, like you're saying, have a actual assistant in near real time with a fake voice or generated voice, actually read out whatever LLMs tell us.
[01:19:25] Alex Volkov: And I think this last thing I'll say here before I move on to the interview is this adds to this notion that I think we saw from Logan from OpenAI, where a chat is not the final interface for these things. I think embodiment like this is one thing that moves us forward.
[01:19:40] Deep dive into Matryoshka embeddings with Aditya Kusupali & Prateek Jain
[01:19:40] Alex Volkov: All right, folks, this has been the updates, and now we're moving to a more of a deep dive interview, and I'm very happy to introduce Two guests here, and two guests, I'm getting a little winded, so forgive me.
[01:19:54] Alex Volkov: But I want to say hi to Aditya Kusupati and Pratik Jain, and thank you. And folks, feel free to unmute yourself and talk and, call out. But basically, welcome Pratik, welcome Aditya. How are you guys?
[01:20:05] Aditya Kusupali: thank you,
[01:20:06] Prateek Jain: Alex. Thanks so much, Alex, and thanks everyone for listening.
[01:20:10] Alex Volkov: I, I'm gonna set this up and I think particularly, I noticed you on my timeline first, and then I saw Aditya's tweets as well, where we've talked about OpenAI's new embedding models, and one of the things that was like very interesting back when they released this, and this is how I got to to talk with you guys is They added some new parameter in their new models.
[01:20:30] Alex Volkov: So they had Ada 002 before, and then they said, Hey, we're releasing two new models, Embeddings version 3, and they have a way to specify dimensions. And so previously on ThursdAI, we talked about embedding models, we talked about the MTB leaderboard that Huginface has, we have folks from Junaid that also released top of the line embedding models as well in Friends of the Pod, and we definitely looked at open source models, and in comparison to something closed source like for example OpenAI, and dimensions were a big thing in that whole area. Then OpenAI released something that you can specify number of dimensions and this raised an eyebrow and was like, oh, that's interesting. I don't even know what this is about. And then I think Pratik, I saw your tweet saying, hey, congrats OpenAI. Unfortunately, you didn't mention us.
[01:21:19] Alex Volkov: And then somebody from OpenAI reacted and said, Oh, actually, yeah, we do use something called MRL, and they added this to the blog post. Pradeep, could you talk about that, before we dive in on what MRL actually is? Could you talk about what they added and why? And yeah, just talk about this this phenomenon of them not adding you to the blog post.
[01:21:38] Prateek Jain: They had done the work on their own and everything. It's just, and they did release like really strong embeddings, like the results on MTAB eval boards looked really good. Definitely many congratulations to them. Only thing was that they had released this new thing, as you mentioned, called shortening embeddings.
[01:21:54] Prateek Jain: And the output structure in some sense seems very similar to what Mateuszka representations or these nested representations do. And we do know that they were at least aware of Mateuszka representations because through some of our earlier conversations at least some of the research scientists had reached out to us and had talked to us about some of the details about Mateuszka representations.
[01:22:13] Prateek Jain: It's felt a little bit like, against the spirit of open science and pushing a scientific boundary so that's the only reason we highlighted that it would be good if either the the initial work can be either cited or maybe use the same name I think they were very gracious in particular, the person who had written the blog, I think they said that, yeah there was a miss on their part and they have updated the blog now, all good. I think when we do open research and publish and discuss ideas, I think it moves the field very fast and helps everyone. We are definitely all up for it.
[01:22:49] Alex Volkov: Yeah, absolutely. Want to talk about when you guys released MRL. This was way before the explosion of LLMs and ChatGPT came to the scene, right?
[01:22:56] Alex Volkov: You released MRL, Matryoshka representation back in 22, right? Almost two years ago, like a year and a half ago?
[01:23:05] Prateek Jain: Yeah.
[01:23:06] Alex Volkov: And so talk to us, maybe give a brief explanation of what like I think folks are generally okay with embeddings in, in the audience here, but maybe dimensionality is still somewhat of a escaping field.
[01:23:18] Alex Volkov: Would one of you tackle the [01:23:20] task of explaining what dimensionality means in a very like popular science way so we can then dive into how adjusting dimensionality actually helps performance and different things.
[01:23:29] Prateek Jain: So generally, what happens is if you have, say, some text data, right? So let's say you have a string of thousand 24 tokens or let's say you have an image a 64 by 64 image what we like, what computer, in some sense would want to see them as a set of numbers.
[01:23:47] Prateek Jain: Or a vector of numbers through this incredible line of work around embeddings what we are able to do is we are able to embed these images or text or whatever data object you have into a fixed dimensional vector. So by that, what I mean is you might have a 64 by 64 image, but you can write that as a series of, let's say, 128 vectors, numbers, right?
[01:24:11] Prateek Jain: So that is what we call dimensionality. That is, it is 128 dimensional vector that we want to work with. Why is this interesting? Because if you have a 64 by 64 image and you just change some pixels let's say only 1 percent of the pixels. Those changes would not even be visible to you, but when you compute, let's say, the distance between these two images along pixel space that is, if you're just subtracting two images from each other pixel by pixel, the distance might seem very large, but in reality, semantically, both of them mean essentially the same.
[01:24:41] Prateek Jain: So what we ideally want is some of these embeddings which capture the underlying semantic structure of the data object of, let's say, image. Let's say, there are two images, both of them contain Cat, and very similar pose. We would want to have them being represented within our machine as very similar sort of, objects and that is what these embeddings or semantic embeddings are able to do.
[01:25:03] Prateek Jain: So generally there are multiple techniques to take, as I said, either the image or text or audio, whatever you have, and embed it into, say, a fixed dimensional representation that is a fixed number of floating point or integers. Now, generally, these Representations are like, rigid.
[01:25:21] Prateek Jain: They are fixed. That is that is, let's say a person a designer has to a priori say that, okay I can deal with the 128 dimensional representations for my image and on basis of this, I can run some sort of classifier or some sort of retrieval algorithm to retrieve similar images or classify the image into some particular class.
[01:25:39] Prateek Jain: So generally, that decision is made a priori that I will be forming it into 128 dimensions because 128 dimensions, let's say, are able to give me the accuracy I want and I will be able to deploy them in my system because that's another sort of key part. Whenever you are deploying them, the dimensionality of the embedding can be a critical thing.
[01:26:00] Prateek Jain: Let's say, if you want to do retrieval the cost of Retrieval is almost directly proportional to the dimensionality of the data point. So we so the decision is made a priori. So for example, like earlier embeddings that came out from OpenAI, they made that decision that, okay, these embeddings should let's say be, I think, 1024 dimensional or something like that.
[01:26:19] Prateek Jain: So you just had those 1024 dimensional and not so good part about that is that now everybody who wants to use those embeddings have to change their system to suit their 1024 dimensional representation. So some people who might be running, say, some sort of retrieval engine on 64 dimensions, they will need to now scale up everything, change how they are doing retrieval, how their indexer works, how their serving works, to fit to those 1024.
[01:26:46] Prateek Jain: And that's not ideal, right? So the idea behind Matryoshka representations was that can we bring flexibility in these embeddings? That is, while we are giving out 1024 dimensional embeddings, can somebody come and read off just 64 coordinates out of it so that, they don't need to change their entire serving stack?
[01:27:07] Alex Volkov: So I wanna slide in here with a question before we get to your guys solution in dimensionality flexibility, which is very cool. So you're saying the iPriority decision basically means that I as a developer, let's say, if I used whatever OpenAI has given me or any other, uh, rigid structure, I had to basically abide by their rules of how much they decided how in depth those embeddings represent my concepts, correct?
[01:27:31] Alex Volkov: And could you talk about maybe before we dive into dimensionality, how this affects actual retrieval? Is more embeddings always better? There's a thing that I heard yesterday that somebody mentioned. It's called the curse of dimensionality. And I really wanted to dive in and hear about what this means.
[01:27:46] The curse of dimentionality
[01:27:46] Alex Volkov: Is, because we've talked before and There are embedding models with like 8, 000 dimensions or so. And I heard from Beau, who's in the audience here, who may join us as well, that's not always the best case. For many reasons, not only speed as well. Could you talk about the curse of dimensionality and is more always better?
[01:28:03] Prateek Jain: So that's a great question, right? So definitely more dimensions intuitively should help you capture more and more information about the data that you are trying to embed. Obviously, like beyond certain point, it becomes, starts to becoming complete noise, right? So for example even if you go back to the image example that I was giving, you have a 64 by 64 image.
[01:28:24] Prateek Jain: You can think of it that as a 3600, like about 3600 dimensional vector, right? And if you want like a very precise embedding then maybe that 3600 dimensional vector is what is capturing everything about that image because that is roughly, that's precisely how we are seeing that data point, right?
[01:28:40] Prateek Jain: But the bad thing about that sort of, representation is that it is not capturing the semantic information. It is also bringing in a lot of noise. You would there is some sort of sweet spot at what kind of dimensionality of data you want to stop at, right? That's one part of it.
[01:28:55] Prateek Jain: But when you come up with these representations, they are going to be used in some downstream task, right? As I mentioned earlier some of the downstream tasks are I have this representation of the image. Now do classification for me. So I will run some sort of classifier on top of this representation of the image to say that, okay, whether this image has a cat or a dog, right?
[01:29:17] Prateek Jain: Similarly, I can say that, okay, I want to retrieve most similar image to this given image in my database of all the images. So I might have an entire database of animals. I give you an image of a particular cat, and I want to retrieve a cat which is most similar looking, maybe in similar pose, similar situations, right?
[01:29:35] Prateek Jain: So these models or these embeddings are used in this downstream task and to use them in these downstream tasks, we need to, we are also then bound by the realities of those downstream tasks. For example, if you want to do classification and you have only let's say, 200 data points to train the classifier, Then a very high dimensional embedding is not great because that will then give you very poor performance, like your model will overfit, it will just like mimic whatever it is seeing on training data and it will not generalize to new test points.
[01:30:07] Prateek Jain: So it can be catastrophic. Similar situation happens in even your retrieval or nearest neighbor search. Kind of thing there, that is, if you're very high dimensional embedding as you mentioned earlier, like there's this curse of dimensionality that applies, which might mean that my nearest neighbor search is not working well, especially if I'm doing any kind of approximation, and I might get essentially garbage out of that situation.
[01:30:30] Prateek Jain: So that's why, based on the downstream task, The amount of training data I might have, the serving realities there, that okay, how much latency I can spend or how much compute I can spend in serving, I might have a sweet spot into that. Okay, this is the dimensionality that works best for me. And I want to ideally want to select that and work with it.
[01:30:50] Alex Volkov: I see. And Aditya, it looks like you can now join, and I also wanted to follow up with you because Partik is talking about and Partik, the examples you gave are image embeddings, and that's great, but I think one of the huge things that happened since you guys raised the paper is how much LLMs are being used for different things as well, right?
[01:31:07] Alex Volkov: And I think this led to an explosion in vector databases, and they start embedding, and I think at least for many of the developers who use these, like LLMs, text embeddings or at least they started with text and now it's like multi modal. This is like the highest the highest use currently in React.
[01:31:23] Alex Volkov: Would you maybe Aditya, would you want to expand on how much this whole field started heating up with like vector databases now storing every embedding? I definitely didn't hear about this up until a year ago. Would you want to like chime into this and how your work is now like super relevant to, to this whole new world?[01:31:40]
[01:31:40] Aditya Kusupali: Yeah, Yeah, as Pratik said, I think Curse of Dimensionality even applies in vector databases because you have to search through things. And the major thing is you also need to think about storage, right? So let's say you want to index a billion documents. And if you want to do everything with, say, 1024, you're going to have to use about a terabyte.
[01:32:00] Aditya Kusupali: Or four terabytes worth of data for storage. And a lot of people might not be willing to do that. So how people typically do that in vector databases is they store one copy and when they're trying to do some processing on top of it, they do some sort of compression. It can be a lot of things.
[01:32:18] Aditya Kusupali: And It works great, but the thing is, it's a lot of post processing, and you also need to store the actual embeddings in your vector database store. I think with the data which keeps growing and growing, and there is no way for you to control the total amount of data. You should probably figure out a way to make your Representations much more compact, much more accurate.
[01:32:40] Aditya Kusupali: I think that is where a lot of oversight was there for the last few years. Again, vector databases existed even before last year, but they blew up because of the RAG applications. And I think in Matryoshka case, as OpenAI said, it gives you the flexibility to just store 64 dimensions if you want, and that should just be it.
[01:33:00] Alex Volkov: And 64 is way smaller than the previous dimensionality that they said, I think 1053 or 1024 or so. And also, I would be remiss if not to mention that video is coming into play right now. large multimodal models. Now, they're not only understanding text and images. Now, like we're talking about video embeddings, for example, and being able to represent those.
[01:33:21] Alex Volkov: And when you talk about storage costs, et cetera dimensions definitely affect that and also speed of retrieval and comparison. So let's move on to talk about cause you guys wrote the paper before this explosion, but definitely the concepts existed. I want to hear about what Matryoshka representations is and how it affects dimensionality.
[01:33:38] What are Matryoshka Embeddings?
[01:33:38] Alex Volkov: Specifically from being able to choose during which process, and I would love to hear from you the brief explanation, then we can dive in and ask more questions.
[01:33:47] Aditya Kusupali: Sure.
[01:33:48] Prateek Jain: Let's take
[01:33:48] Aditya Kusupali: the running example for the excited let's say there is a 1024 dimensional representation of your image or let's like, let's keep it to 1024 for now. And so you're trying to basically fit a bunch of learned attributes. So it could be some version of color, some version of Texture, et cetera, which is being fed into these things.
[01:34:08] Aditya Kusupali: So that is what these embeddings are learning. And they're extremely good in a lot of semantic tasks. If you want to find a similar looking dog, it's much more easier for you to search in this space. So that's the goal, right? Ideally, until now, when you wanted to do things faster, you took these embeddings and you did some sort of compression, most likely some notion of PCA or low dimensional projection or some sort of quantization, okay?
[01:34:35] Aditya Kusupali: And that's how you used to do it. So there is an additional processing overhead on top of the existing embeddings for you to get this done. We wanted to fix this problem because this additional overhead need not always give you the most accurate solutions. So the motivating goal for us was to figure out if we can pack the information in this 1024 such that we don't have to project it into low dimensional space or do any post processing to get a 64 dimensional embedding.
[01:35:04] Aditya Kusupali: But rather? Just take the first 64 dimensions of this vector. So if there is a collection of 1024 numbers, I want you to be able to cut it off at the first 64 and say this is a 64 dimensional embedding which is as good as any 64 dimensional embedding you can ever build. That makes sense? And this was the goal.
[01:35:24] Aditya Kusupali: So this is the final embedding should look like this. And that is what we try to do. And it turns out Training these things are so simple that it's literally what you think. If you want the 64 dimensions of the first 64 dimensions to be the most important thing, you optimize the same loss function you are doing for 1024 on the 64 dimensions.
[01:35:45] Aditya Kusupali: Let's say you are doing some text embedding training, where you are trying to pull two relevant text embeddings together and two irrelevant text embeddings farther. And there is a loss, which is typically contrastive, which tries to do this in 1024 1024 dimensional space, you also do it for 64 dimensional space.
[01:36:05] Aditya Kusupali: That's it. So you now have two losses instead of one, and at the end of the training, which again does not take any other extra cost than as if you're training a 1024 dimensional embedding, will give you the first 64 dimensional embeddings, which are as good as any 64 dimensional embeddings you can ever trace.
[01:36:22] Aditya Kusupali: And that's pretty much it. So you can repeat this for multiple dimensions. So not just 64, you can do 64, 128, 256, and so on. Now you have this. Chunks of representations inside this 1024, which can cater to a wide variety of audience, depending on their use cases. And a lot of times people don't care about precision.
[01:36:44] Aditya Kusupali: If recall is all you care about in your retrieval applications, you can just use 64 dimensions. And if you want more precise information, as Fatih said, you can encode more information in higher dimension embeddings, go to 1024. If you have lesser number of data points and you're not able to cluster things properly, Go for smaller dimensions.
[01:37:02] Aditya Kusupali: So the flexibility just opens up so many things which were probably infeasible before in hand because you had to do some sort of post hoc compression or pre processing on post processing on top of it and which led to slightly lesser accurate things. So it just didn't allow you to do all of these things on the fly.
[01:37:21] Alex Volkov: Wow just to Sum up to see if I understand this. I'm looking at and unfortunately this medium is audio only, but I think it's very helpful to see visual representation of this. You're basically front loading all, most of the important information into the first 64 dimension, 128 dimension.
[01:37:37] Alex Volkov: And you're saying that precision for specific use cases like RAG could still be as good as like with 124 dimension. And that sounds to me incredible.
[01:37:47] Aditya Kusupali: Let's take an example, right? Like in your RAG, all you care about is 10 blue links, which need to be in the top 10. That's it. You don't care if the first link is the first one or the last link is the last one. There is some evaluation saying that there is a a utility for positionality, but most of the cases, if you get 10 relevant documents in any order, that's all that.
[01:38:06] Aditya Kusupali: You matter. You don't care if the best document is at the top or at the 10th thing. So if you feed in all of these things into your LLM, LLM will forget it. So this is the case of recall. You don't care about precision. So your ranking only cares about getting the most relevant 10 documents in the first 10 and not how relevant they are in within themselves.
[01:38:27] Alex Volkov: I see. I want to
[01:38:29] Alex Volkov: also
[01:38:29] Prateek Jain: bit more nuance there sorry just to add a little bit more nuance there in many situations, what might happen is, in your RAG rather than even getting, let's say, top 10 links that Aditya said, suppose I get top 100 links, right?
[01:38:42] Prateek Jain: And those top 100 links, some of them might be completely useless, completely rubbish. But as long as those correct top 10 links are somewhere they are sitting in top 100 link, I'll be fine. That is, after that I can do refinement. The rough structure here would be that you will Take, let's say only for 64 dimensions or coordinates or maybe only first 32 coordinates from mrl and do those retrieval of top hundred links.
[01:39:06] Prateek Jain: Once you have those top hundred links to get the correct top 10 links, we can do them further rescoring based on full, let's say 1,024 dimensions and get like those things. And now, because everything is nested, those embeddings are already computed and I have with me, right? So I can first say that, okay, for the first phase of getting top a hundred.
[01:39:25] Prateek Jain: I can use 32 dimensions. And then in the second phase of doing that rescoring, I can use full dimensionality. Sorry for cutting
[01:39:34] Alex Volkov: No, that was great. Great addition. And I want to think rescoring and re ranking. Are you referring to the same thing? Like some folks like take the initial results and then they try to rank like what was the most appropriate ones. I think, does this represent the case that you guys talk about where the initial information is not really necessary for the first responses.
[01:39:52] Alex Volkov: And then we're going to run another tool like Cohere. Sometimes those folks do re ranking with Cohere and then you'll like judge. The importance [01:40:00] of those and then sort them in the secondary process.
[01:40:02] Aditya Kusupali: off. Yeah, that's pretty much that's a relevant thing. But I think Joe Christian Begum is in the call from Vespalic. He's a proponent of late interaction. So you can do a lot of other re ranking methods. But in this case, what Pratik specifically is saying is, let's say you recall with 64 dimensions, and you can rescore with 1024.
[01:40:23] Aditya Kusupali: You can use the precise 1024 to just rescore in case you ever want to use it. And this is all from the same MRL embedding.
[01:40:33] Alex Volkov: Alright, so moving on, I think Aditya, I heard you say also that in the use case of LLMs for example, where again, you guys built this before the explosion, in the use case of LLM and RAG some amount of this is offset to the LLM itself. After you retrieve and you provide this data to LLM it can do some of this work for you, which I guess why your work from a year ago or a couple years ago found newfound relevance.
[01:41:01] Alex Volkov: But then I think you followed up with another paper a year ago and at ANN, right? Could you talk about how this applies to Matryoshka Embeddings as well? I would love to hear additional work in this area that you guys did.
[01:41:15] Aditya Kusupali: Sure when Hrithik was talking about retrieval, he also mentioned that you typically do a nearest neighbor search. So the goal is when a query comes in, you embed it into the same space. Documents, say, let's say a billion are encoded in the same space, your target is to find, say, top 10 documents which are most relevant.
[01:41:32] Aditya Kusupali: And the way you do it is nearest neighbor search. So you just try to find which vectors in your database are the closest for queries. But the thing is, again, as Pratik said, like the cost is directly proportional to the dimensionality, as well as the number of data points. So it's linear in terms of number of data points and dimensionality.
[01:41:50] Aditya Kusupali: So to reduce this cost at web scale, so there is no way Google can ever So things if everything is every single data point has to be explicitly compared. So there's an idea called Approximate Nearest Neighbors, which has been there for the last 25 years or so. The goal of Approximate Nearest Neighbors is, instead of touching all the 1 billion points to get top 10, I'm going to touch, say, something like 10, 000.
[01:42:12] Aditya Kusupali: So I'm only going to search 10, 000. by somehow partitioning the space and only cleverly looking at the places I want to look at and get to the 10, 000. And in those 10, 000, I'll do more exhaustive search and find the top 10. Okay, and this is the Approximate Nearest Neighbors. And the simplest way of thinking about Approximate Nearest Neighbors is a tree structure.
[01:42:32] Aditya Kusupali: So you have a billion points. You are basically building a huge tree structure by using clustering. So a billion points can be clustered into 50, 000 clusters, which can further be clustered into 50, 000 each. And eventually your leaf nodes, like the final leaf node, will have 100 data points in each of the leaf nodes.
[01:42:48] Aditya Kusupali: And this is a typical Tree based data structure, which a lot of people use for Approximate Nearest Neighbors. In case anyone is interested, you can go check FAI's library from Facebook. It's a very good resource for all of these things. This is Approximate Nearest Neighbors and it plays very well with web scale systems.
[01:43:05] Aditya Kusupali: You want any of your embeddings to play well with Approximate Nearest Neighbors if you want to scale to web missions while powerful can you hear me?
[01:43:12] Alex Volkov: yeah, we can hear you, cut off for a second, now we're back.
[01:43:16] Aditya Kusupali: Okay, so matricial representations, as Prateek said, again, like you can use 64 100 documents and re ranking for say with 1024 to get the top 10. This is while sound in principle. When you try to do this in systems aware settings, this does not scale well, because these 100 documents need not be sitting on the same machine, they need not be co located, all of these things, there's so many systems considerations which start blowing up, and Approximate Nearest Neighbors directly handles this.
[01:43:46] Aditya Kusupali: Approximate Nearest Neighbors ensures that Similar documents are in similar chunk of your memory for your systems to take care of a lot of these things. So we wanted Matrioshka representations to power better approximate nearest neighbors. That's why we came up with ADAMS or Adaptive Approximate Nearest Neighbor Search.
[01:44:03] Aditya Kusupali: And the goal here is, again, it's When you're doing approximate nearest neighbors from 1 billion to 58, 000 clusters followed by 50, 000, let's say you have a 1024 dimension embedding, you use the same 1024 embedding for every single one of these phases. But as we talked earlier, if you only care about recall, which your clustering is basically doing, what your clustering is saying is, look, I just need to be in the right cluster, right portion of your space, and that's pretty much I care about.
[01:44:29] Aditya Kusupali: So that's just recall. And If I'm able to do this clustering with 64 dimensions instead of 1024, I can save a lot of compute when I'm searching the space. And this is the idea. So at every single level of this tree, I'm going to change the dimensionality I'm going to use. Let's say 64, 128. And then finally, when I come to leaf node, when my query goes to the leaf, I'm going to precisely re rank all these 100 data points or so.
[01:44:53] Aditya Kusupali: 2024. So there is going to be a precise re ranking at the end, but all the intermediate steps, because they're already approximating, but only care about recall, can be approximated with a lower dimension embedding. You can traditionally do this even without Matrioshka embeddings, but you need again post hoc compression, which is not pretty great.
[01:45:12] Aditya Kusupali: So Matrioshka representations just gives you this for free. So if you want 64 dimensions for the first phase of clustering, take the first 64. If you want 128 for the second phase of clustering, take the first 128. And that's the reason it becomes seamless and that's what ADAMS does.
[01:45:27] Alex Volkov: Awesome. And I want to take this to the practical level a little bit. As far as I saw, Sentence Transformers from Hug Face supports supports this natively right now, right? You can import and you can encode embeddings in different models. What other tools since you guys started getting a lot of interest, after this, both because the LLM explosion, now everybody does RAG and everybody understands that RAG is one way to get these models to behave as they want.
[01:45:51] Alex Volkov: What else? What other tools? You mentioned PHI. What other tools are now supporting something like this? Because on the face of it, it sounds very very helpful, very performant. In my head, this sounds like Not necessarily direct, but like similar to how quantization came and reduced them like precision of models.
[01:46:08] Alex Volkov: And basically they respond with the same precision, but they're significantly smaller. So what other tools can folks find use in Kamiatroska from what you're, you guys have heard?
[01:46:20] Aditya Kusupali: Yeah, two clarifications. Face does not use Matryoshka right now, but ADANCE was built off of Face so yeah, that's a caveat so they don't use Matryoshka at this point. Yeah, second thing you asked quantization, right? That's a very good point. Quantization is a complementary thing.
[01:46:36] Aditya Kusupali: So think of quantization as flexibility in your bit precision, while Matryoshka is flexibility in your dimensionality. So both of them can work hand in hand even after this. So you can quantize any Matryoshka embedding, and it will still play well with quantization. So that's the beauty of this, right?
[01:46:54] Aditya Kusupali: Until now, we were only reducing the precision of the numbers, and now you can also reduce the Vector itself. So that's very good coming to the repositories and other stuff, which are using, of course, sentence transformer, I think is going to be the easiest way in. I went through the implementation day before yesterday.
[01:47:14] Aditya Kusupali: It's pretty clean. It just works out of the box. NOMIC released the V for 1. 5. If anyone wants to go and look at it inside again, it's 10 lines of code. Beautifully written. And I think it's much more understandable in case someone wants to get into the weeds. So that is one thing we have our own repository, which we released like a couple of years ago.
[01:47:33] Aditya Kusupali: But the nice thing about Matryoshka is if you want to train something, it's literally a for loop. It's four lines of code. So the code is already in the paper. If someone wants to go and implement it, it's You just look at the paper, there will be a code in I think, page 12 or something, five lines, you just go and implement it.
[01:47:48] Aditya Kusupali: Apart from that I think Transformer. js was supporting a bunch of these re ranking visualizations in HuggingFace. But yeah, like for now, I think these are the things we know which are supporting. ADAN, I don't think anyone is supporting at this moment. It's just our code base. which is out there.
[01:48:05] Aditya Kusupali: It's also not highly optimized for low level things, so I wouldn't recommend you directly use it for your use cases, but it's a great thing for you to prototype and see how well you could benefit from this flexibility in retrieval space.
[01:48:18] Alex Volkov: So I just wanna make sure [01:48:20] that we shout out properly. No ai, the folks that have the Atlas platform to visualize and they down, down sample, I think you said like they lower the dimensionality to, into 2D or 2D space to actually show dimensions. They release. No embed. 1. 5 recently, like a fully open source embedding models end to end and they're great and now they're also supporting Matryoshka, which is great.
[01:48:41] Alex Volkov: I also heard you say that quantization directly applies here as well. So you can like I don't know the verbiage of this, like you can Matryoshka something and quantize a model in this Wool Resort is significantly smaller. And like smaller weights so that's great.
[01:48:54] Alex Volkov: You also mentioned Transformers. js, which is a Hug Face library, the author of which Joshua Zanova is here in the audience with us, friends of the pod, that supports this as well. Folks, we're slowly running out of time a little bit. I wanted to thank you for coming up. It often happens when folks who build something come up and talk to us.
[01:49:13] Alex Volkov: It doesn't often happen in something that released a few years ago. that now resurfaces in popularity. And then we're able to highlight some folks works. So Aditya and Pratik, I really want to thank you. Anything else that you want to mention before we before I recap the whole space, feel free to.
[01:49:28] Alex Volkov: Definitely not a full deep dive, but I really wanted to highlight the fact that your work is now represented also in like one of the big libraries in the world in terms of AI. And many folks can now understand what is this parameter that they do when they adjust dimensionality and open the eyes embedding models.
[01:49:44] Aditya Kusupali: I think nOMIC the reason why I say this is a straightforward implementation is NOMIC released their v1 and then Matrioshka became a thing, so they literally trained this entire thing in three days and with all of their data, so it's extremely simple and they actually had to not change a single hyperparameter, so it's pretty good.
[01:50:02] Aditya Kusupali: I would like to see if Pratik wants to add anything, but Otherwise, thank you for having me here.
[01:50:07] Alex Volkov: Thank you, Adithya.
[01:50:07] Prateek Jain: No, it's pretty accurate. Thanks for having
[01:50:10] Aditya Kusupali: us here. Yeah, and for anybody else in the audience, sorry I've posted the links as to what you can do with this. So it's Zenova's demo when you can use it in Transformers JS.
[01:50:21] Aditya Kusupali: And also we look forward to actually implementing the paper too, because again, this is not a very well known or well discussed subject in general.
[01:50:31] Alex Volkov: So I'm very happy to have been able to host you guys and You have a paper out, I think it was in NeurIPS, and seeing more from this space of embeddings, because there's more to come here, and many people are now using this in big production probably, it was used in Rexis before, but now in big LLM related production stuff, and the more folks understand retrieval and fine tuning retrieval, and also ways to cut costs, like Matryoshka, for example it would be great, so shout out to you guys, definitely, thanks for working on this and coming and showing, giving light, I'm very happy that you did get The the mention in the open, OpenAI, and I'm also, I'm happy that I noticed because of this, and was able to talk to you guys and figure out what Mutrochka embeddings are.
[01:51:11] Alex Volkov: And if folks want more deeper, deep dives, this is what was like, very surface level. You guys did a paper. PaperClub with Latentspace yesterday, and before that, both with Vietpodcast talked about Matryoshka Embeddings. Connor was here before, you guys just missed him. And also, Nisten put this link up.
[01:51:28] Alex Volkov: HugInFace has a very nice deep dive from Omar and Zenova about Matryoshka Embeddings and what they mean and how to use them in sentence transformers. All right, folks, this has been our ThursdAI for today. I will now take a deep breath and recap everything we've talked about if you've been here for the past two hours and some, you've probably heard all of this, but if not feel free to stick around and it's probably gonna take me like eight minutes or so and then we're gonna let you go.
[01:51:53] Alex Volkov: With that, this is our Thursday AI for February 29th. Leap year, February 29th, like once in four years, I find it pretty funny. And I think it was a great space,
[01:52:01] Alex Volkov: we didn't have any, Nisten, no breaking news today, right? I wasn't monitoring well, but I didn't see GPT 5 didn't release while I was talking, right?
[01:52:11] Nisten Tahiraj: Nope not yet.
[01:52:12] Alex Volkov: Not yet.
[01:52:13] Alex Volkov: We did get one piece of breaking news that we didn't notice as we were recording the live stream, and that was from our friends in Modular. If you remember, we've talked about Modular and their new programming language Mojo, which is a superset of Python, and the creator Chris Lattner, who was previously the LLVM and MLIR compiler author and also the creator of Swift.
[01:52:42] Alex Volkov: Uh, in Apple, and, uh, we've talked about Mojo being the right language for AI, and they just released their inference engine called Max to the world in beta, and this inference engine supposedly has Mojo built in, and supposedly is way faster even for existing models, uh, to run inference. So that's very interesting, and we're going to talk about more as we, as we play around with this.
[01:53:07] Alex Volkov: Alright, folks, and I think this was all we talked about on ThursdAI on February 29th. And I want to just thank everybody who joined. Nisten, thank you, as always, co host. Jan was here before, and we had Beau join for a while, even though we didn't say hi.
[01:53:22] Alex Volkov: We have a bunch of other folks. So thank you for all the guests. Thank you, all of you, for listening and tuning in from week to week. It's really a pleasure. And now with this, I'm just going to end here. everybody. We'll see you next week. Cheers.

This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
📅 ThursdAI Feb 22nd - Groq near instant LLM calls, SDXL Lightning near instant SDXL, Google gives us GEMMA open weights and refuses to draw white people, Stability announces SD3 & more AI news
23 Feb· ThursdAI - The top AI news from the past week
Hey, this is Alex,
Ok let's start with the big news, holy crap this week was a breakthrough week for speed!
We had both Groq explode in popularity, and ByteDance release an updated SDXL model called Lightning, able to generate full blown SDXL 1024 images in 300ms.
I've been excited about seeing what real time LLM/Diffusion can bring, and with both of these news release the same week, I just had to go and test them out together:
Additionally, we had Google step into a big open weights role, and give us Gemma, 2 open weights models 2B and 7B (which is closer to 9B per Junyang) and it was great to see google committing to releasing at least some models in the open.
We also had breaking news, Emad from Stability announced SD3, which looks really great, Google to pay Reddit 200M for AI training on their data & a few more things.
TL;DR of all topics covered:
* Big CO LLMs + APIs
* Groq custom LPU inference does 400T/s Llama/Mistral generation (X, Demo)
* Google image generation is in Hot Waters and was reportedly paused (refuses to generate white people)
* Gemini 1.5 long context is very impressive to folks (Matt Shumer, Ethan Mollick)
* Open Weights LLMs
* Google releases GEMMA, open weights 2B and 7B models (Announcement, Models)
* Teknium releases Nous Hermes DPO (Announcement, HF)
* Vision & Video
* YoLo V9 - SOTA real time object detector is out (Announcement, Code)
* This weeks Buzz (What I learned in WandB this week)
* Went to SF to cohost an event with A16Z, Nous, Mistral (Thread, My Report)
* AI Art & Diffusion & 3D
* ByteDance presents SDXL-Lightning (Try here, Model)
* Stability announces Stable Diffusion 3 (Announcement)
* Tools
* Replit releases a new experimental Figma plugin for UI → Code (Announcement)
* Arc browser adds "AI pinch to understand" summarization (Announcement)
Big CO LLMs + APIs
Groq's new LPU show extreme performance for LLMs - up to 400T/s (example)
* Groq created a novel processing unit known as the Tensor Streaming Processor (TSP) which they categorize as a Linear Processor Unit (LPU). Unlike traditional GPUs that are parallel processors with hundreds of cores designed for graphics rendering, LPUs are architected to deliver deterministic performance for AI computations.
* Analogy: They know where all the cars are going when everyone wakes up for work (when they compile) and how fast they all drive (compute latency) so they can get rid of traffic lights (routers) and turn lanes (backpressure) by telling everyone when to leave the house.
* Why would we need something like this? Some folks are saying that average human reading is only 30T/s, I created an example that uses near instant Groq Mixtral + Lightning SDXL to just create images with Mixtral as my prompt manager
Open Source Weights LLMs
Google Gemma - 2B and 7B open weights models (demo)
* 4 hours after release, Llama.cpp added support, Ollama and LM Studio added support, Tri dao added Flash attention support
* Vocab size is 256K
* 8K context window
* Tokenizer similar to LLama
* Folks are... not that impressed as far as I've seen
* Trained on 6 trillion tokens
* Google also released Gemma.cpp (local CPU inference) - Announcement
Nous/Teknium re-release Nous Hermes with DPO finetune (Announcement)
* DPO RLHF is performing better than previous models
* Models are GGUF and can be found here
* DPO enables Improvements across the board
This weeks Buzz (What I learned with WandB this week)
* Alex was in SF last week
* A16Z + 20 something cohosts including Weights & Biases talked about importance of open source
* Huge Shoutout Rajko and Marco from A16Z, and tons of open source folks who joined
* Nous, Ollama, LLamaIndex, LMSys folks, Replicate, Perplexity, Mistral, Github, as well as Eric Hartford, Jon Durbin, Haotian Liu, HuggingFace, tons of other great folks from Mozilla, linux foundation and Percy from Together/Stanford
Also had a chance to checkout one of the smol dinners in SF, they go really hard, had a great time showing folks the Vision Pro, chatting about AI, seeing incredible demos and chat about meditation and spirituality all at the same time!
AI Art & Diffusion
ByteDance presents SDXL-Lightning (Try here)
* Lightning fast SDXL with 2, 4 or 8 steps
* Results much closer to original SDXL than turbo version from a few months ago
Stability announces Stable Diffusion 3 (waitlist)
Uses a Diffusion Transformer architecture (like SORA)
Impressive multi subject prompt following: "Prompt: a painting of an astronaut riding a pig wearing a tutu holding a pink umbrella, on the ground next to the pig is a robin bird wearing a top hat, in the corner are the words "stable diffusion"
Tools
* Replit announces a new Figma design→ code plugin
That’s it for today, definitely check out the full conversation with Mark Heaps from Groq on the pod, and see you next week! 🫡
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Full Transcript:
[00:00:00] Alex Volkov: Hey, this is Alex. This week on ThursdAI, we had an hour conversation with Grok, a new and very exciting AI inference chip that exploded in popularity all over social media after showing a 5x, yes, 5x improvement in AI inference. 500 tokens per second for Lama70B and Mistral.
[00:00:32] Alex Volkov: We also talked about Google's new OpenWeights GEMMA model, Google's image generation issues, which led them to take down the abilities of this image generation to generate people. We covered new, incredibly fast SDXL lightning, and we had breaking news for Stable Diffusion 3, which is a diffusion transformer that's coming out of Stability AI.
[00:01:03] Alex Volkov: and a bunch of other news. All that after this short intro into Weights Biases.
[00:01:10] AI teams are all asking the same question. How can we better manage our model development workflow? The path to production is increasingly complex, and it can get chaotic keeping track of thousands of experiments and models. Messy spreadsheets and ad hoc notebooks aren't going to cut it. The best AI teams need a better solution.
[00:01:33] And better tools. They need Weights Biases, the AI developer platform, to unlock their productivity and achieve production ML at scale. Replace messy spreadsheets with an automated system of record for experiments.
[00:01:52] Communicate about model evaluation. and collaboratively review results across the team. Clean up disorganized buckets of models with a unified registry. Automatically capture full model lineage, all the data and code used for training and testing. Seamlessly connect to compute to scale up training. And run large scale sweeps efficiently to optimize models.
[00:02:20] Analyze the performance of large language models. And monitor LLM usage and costs with live, customizable dashboards. Get your team on the same page to bridge the gaps from ideation to production. Use Weights Biases to build, manage, and deploy better models, faster.
[00:02:41] Alex Volkov: Wasn't this cool? This is Kari. She is a original PM on the Weights Biases team. She's been there for a long time and recently we used her voice to narrate this new video that we have up on the website. And I figured I'd put it in here because it works even without the video. And I thought it was super cool.
[00:03:01] Alex Volkov: And people ask me, what does Weights Biases do? And hopefully this answers some of those questions. Now I want to switch gears and say, basically. that the format for this week is a little different. We had the folks from Grok and Matt Schumer at the beginning of the pod, and then we kept talking about everything else, like Gemma and Gemini and everything else.
[00:03:24] Alex Volkov: So the first hour of this is going to be an interview with the Grok folks, specifically with Mark Heaps and the next hour afterwards is going to be the deep dive into topics. If you're listening to this on Apple podcast, for example, you should be able to just view chapters and skip to a chapter that you'd prefer. .
[00:03:51] Alex Volkov: I want to just do a quick recap of ThursdAI for February 22nd everything we've talked about for today and we started the space with a with two I guess Matt Schumer and mark Heaps from, and that's Groq with a Q at the end, not Groq with a K at the end. So not like X ais Groq. Groq is explo on our timelines recently with just incredible viral videos of them performing l la inference on LAMA two 70 B and Mixtral with around 400 or 500 tokens a second, which is.
[00:04:34] Alex Volkov: Five times as much as the previous super fast API inference that we've seen for perplexity and from together. And they're serving like Lama 270B with 500 tokens a second. And so we've had Mark from Groq talk to us for almost an hour about how this is even possible. So we had a very nice deep dive with Mark and definitely if you miss this, please check this out on, on the recorded portion as well.
[00:04:58] Alex Volkov: And then we also had Matt, who works at HyperWrite, and he's been playing with these tools, and he told us about the demos that he was able to build, and How much of a difference this speed of inference makes. We've talked about their custom chip called LPU, and we've talked about the fact that the company's been around for a while, and they did not expect this explosion in virality, but they're very happy that they chose this direction correctly.
[00:05:21] Alex Volkov: Very great interview, great conversation, and I invite you to listen to this as well. We covered that Google image generation is now in hot waters, and was reportedly paused because it's in injecting prompt stuff that they're not that great, let's say. And many people notice that historical figures are being generated in different races, and different multicultural adjustments are happening to your prompts, which is not great.
[00:05:46] Alex Volkov: This blew up on Twitter, and even outside of Twitter, I think folks started writing this in actual Media Google, enough so that Google took down the image generation of people trying to figure out what to do with this. But we also gave props for Google to release Gemma. Gemma is an open weights 2 billion and 7 billion parameter model, and we've talked about Gemma we've gave Google the props for releasing OpenWeights for us, and we had folks here on stage telling how the base model is still yet to be decided, how good this actually is, very fine tunable, we're waiting for the open source community to come together and fine tune the OpenWeights Gemma from Google, and then we also covered the Gemina 1.
[00:06:29] Alex Volkov: 5 loan context again, they released the 1 million, and, Context window support and many folks got access to this and we saw for the past week people playing And doing all kinds of stuff including Matt Schumer who I just mentioned he also got access So he gets all the cool toys and he was able to put , three Harry Potter books in one prompt and ask the model with perfect recall who said what and this could have been part of whatever Existing knowledge, but he was doing this more for a demo We also saw demos of people putting an hour long video in the prompt which is around six hundred or five hundred thousand tokens Which sounds ridiculous that it supports it and the model was able to understand this whole video And tell you which scene happened when, with almost near perfect precision.
[00:07:13] Alex Volkov: And we've talked about how this changes the game for multiple things, and we're gonna keep updating you about these long contexts. And we also brought this to Groq and said, Hey, are you gonna support long contexts with your insanely fast speed of inference? We also covered that news Research Tech released a new service DPO fine tuned, which is better in every possible benchmark on top of their ori already strong flagship models which is great.
[00:07:37] Alex Volkov: And I covered that. I went to San Francisco to host an event with a 16 z and news research and Mistral and All Lama and a bunch of other folks, and it was a great event. And I shout out to A 16 z folks for hosting this and inviting me. There as well. And then last thing we've covered is two AI art and diffusion stuff where ByteDance releases SDXL Lightning, which generates 1024 super high quality images in just two or four steps and they look incredible and super fast to generate as well.
[00:08:08] Alex Volkov: I've talked about the demo that I built with them and I've talked about this example that File. ai has where you can go to fastsdxl. ai and just type and as you type, the image generates on the fly [00:08:20] with around 300 milliseconds of inference time which feels real time and feels quite incredible. And following that, we have breaking news today from Stability announcing Stable Diffusion 3.
[00:08:30] Alex Volkov: which is a diffusion transformer, which we've covered before, a diffusion transformer based image generation model from Stability. They announced a waitlist that you can go and sign up for right now. And it looks like it's significantly better at following very complex prompts, like multiple objects and colors and everything in one prompt.
[00:08:47] Alex Volkov: This is everything we've talked about on ThursdAI
[00:08:49] Introduction and Welcoming Remarks
[00:08:49] Alex Volkov: Yes.
[00:08:55] Alex Volkov: All right, folks, you know the sound. Those of you who come back week after week, you know the sound. This is ThursdAI. My name is Alex Volkov. I'm an AI evangelist with Weights Biases. And I'm joined here on stage by, from week to week, by experts, friends of the pod, and new folks who actually were in charge of the news that we're going to talk about today. And Today is February 22nd, only February 22nd, and already so much happened this year with AI. Last week was crazy, this week was less crazy than last week, but still, so much to talk about.
[00:09:36] Introducing the Co-Host and Guests
[00:09:36] Alex Volkov: And I'm delighted to have my co host Nisten here. Hey Nisten, what's up?
[00:09:43] Alex Volkov-1: Hey
[00:09:43] Nisten Tahiraj: everybody,
[00:09:44] Alex Volkov: How's your week?
[00:09:45] Nisten Tahiraj: I'm just it's been the usual, just up until 2 or 3 a. m. on random Twitter spaces finding, because sometimes stuff gets pretty,
[00:09:57] Nisten Tahiraj: it's pretty exciting.
[00:09:58] Alex Volkov: Yep, stuff gets pretty exciting from week to week. I also want to say hi to Matt Schumer, joining us for a brief period. Matt, you've been all over my feed this week. How are you doing, buddy? You've been here before, so folks may not remember you. So please introduce yourself briefly, and then we'll chat.
[00:10:16] Matt Shumer: Hey man, thanks for having me.
[00:10:17] Introduction of Otherside AI
[00:10:17] Matt Shumer: Yeah, so I'm co founder, CEO of Otherside AI. We are the creators of Hyperite, which is one of the largest AI writing platforms. And we also. I've been exploring the agent space for a couple years now about a year publicly creating AIs that can actually operate your computer.
[00:10:31] Matt Shumer: As I mentioned, unfortunately, I do only have 10 minutes. I will potentially be able to join back up after so I'm really sorry about that. It's been a crazy day but excited to chat in the time that I have.
[00:10:42] Alex Volkov: Alright, awesome. Thanks for joining. And then I think we'll just jump in into the conversation. And I want to say hi to our guest a new guest.
[00:10:50] Introduction of Mark from Groq
[00:10:50] Alex Volkov: I don't, I haven't talked with Mark before. Mark is Mark, feel free to unmute and let us know some of your background and where you're joining from. And then we're going to talk about the stuff that we're here to talk about.
[00:11:05] Mark Heaps: Yeah, how's it going guys? Thanks for letting me join the space today, and glad to see some familiar names from all the craziness this week. Yeah, I'm the Chief Evangelist and Head of Design, Brand, and Creative over at Groq, which is probably a little bit not a normative title for folks that are so deep in the AI developer space, but we actually do a lot of the technical side too, so glad to be here.
[00:11:28] Alex Volkov: Awesome. And so folks who are listening, that's Groq with a Q at the end. So not X's Groq. And you guys have been around. For a little longer than them. But just in case folks get confused, there's like a few confusion points here. And I think this is a good start for our conversation today. And I wanna turn this to Matt, because Matt, you're the first person who I saw post about Grock. I think this week, and some of your stuff got a bunch of attention. So give us like a brief overview, like what you saw that made you post. And then we're going to talk about this insane speed and then maybe turn to Mark into how it actually is done.
[00:12:02] Alex Volkov: So what is, where's Groq? Like, how'd you get to it? And how viral did you actually get?
[00:12:08] Matt Shumer: Yeah, it's a funny story. I actually found Groq I'd say more than a month ago and immediately I was blown away I think my co founder posted actually a text I sent to him, and I was like, you have to f*****g try this thing right now, it's incredible and he did, and he was blown away too, I actually went and posted about it back then, but it got no traction, I think I deleted it or something and I was just letting it marinate in my mind what was possible here, but, I wasn't sure, if this Scale obviously this week proved that thing wrong clearly it can but I was still just thinking about it, and then I was just on the train, my girlfriend and I were just sitting there on Sunday, and she just fell asleep, so I was like, what am I going to do right now, and for some reason, I thought of Groq, and I was like, okay, let's just post about it again, see what happens, and for some reason, this time, by the time I got off the train, it was going crazy viral.
[00:12:55] Matt Shumer: I, Sunday night was fun, I was up pretty much all night just managing the blowback from this. Finally fell asleep by the morning, I woke up to a timeline filled with tweets about Groq and for good reason, right? This thing is incredible, and it's going to change how we think about how we work with LLMs, what they're capable of, the ability to do tons of reasoning, right?
[00:13:16] Matt Shumer: All of that is now going to change, and a lot more is now possible. The one thing I wasn't sure about was, would this thing go down, right? With all this usage, would this thing go down? And, it hasn't, right? There was a brief time where there was a little bit of delay, but, more or less, it pretty much stayed up the entire time, which is crazy, through all of this, and they weren't prepared for that, which was incredibly impressive, and I think it's a testament to how good the hardware is.
[00:13:41] Matt Shumer: It's just exciting to see. I actually spoke with Jonathan, the CEO of Groq yesterday, and he said that something like 300 developer API requests were submitted prior to the tweet. Now they're getting like 3, 000 a day or something, which is insane. Using that as a proxy for how many people must be trying the actual API, and then combine that with the demos I built that are getting thousands of hits every day, their servers are still clearly standing, which is,
[00:14:06] Exploring the Impact of Groq's Speed
[00:14:06] Alex Volkov: So what was impressive to you? I think we're dancing around the issue, but for folks who didn't see your viral tweets, what was the head explosion moment.
[00:14:14] Matt Shumer: You have TogetherAI, you have HuggingFace, Inference, you have VLM, all this stuff, right? You're getting on, let's say, Mixtral, if you're doing incredibly well, 100 tokens per second or something, right? Most people aren't reaching that and that number may be off by a little bit, but at a high level, you're getting around there with any pretty standard model today, if you're doing well.
[00:14:34] Matt Shumer: Now, going above 200? Unheard of. 500? Ridiculous, right? And that's where Groq sits, right? They've essentially developed a chip that enables these language models to be far faster. And when you see 500 tokens per second versus, let's say, 50 or 100, it is not just a small difference, right?
[00:14:52] Matt Shumer: This is like a step change in what is possible with what you can build with them. And that's what turned my head, right? It's not just faster inference, it's a whole other paradigm that you could build on top of right now, right? When you have inference that's fast, you can then do, 5 to 10x the reasoning in the same amount of time.
[00:15:10] Matt Shumer: How much better does the answer get with the same LLM if you do that? You could do interfaces that are created for you in real time. You don't have to wait. For example, right now on the HyperWrite platform, it's probably one of the best sort of conversational platforms with web search built in, but you still have to wait for it to go and execute the web search, come back, write the response, think through what it needs to do.
[00:15:28] Matt Shumer: What happens if that's instant? That changes everything. That's what got me super interested. Here's what others think about it though.
[00:15:35] Alex Volkov: Yeah I wanna chime in here. Thank you, Matt. I saw your tweet immediately What? And also, A day before I saw your tweet and we're going to talk about long context and maybe after you're gone, maybe you'll come back as well. But a day before I saw your tweet, I posted something where folks were complaining about kind of the long context with Gemini 1.
[00:15:53] Alex Volkov: 5 Pro with the million. That's saying, oh, it's going to take too long. It's going to cost too much, et cetera. And I posted something like, that's not going to, that's not going to be the truth forever. These things are coming down faster than people realize, and I think those things together just one after one, to show me how fast we're moving, how incredible this is, because and we're gonna talk about long context here in a second as well, but Immediately a day after I saw your tweet, I was like, oh, there's an example.
[00:16:18] Alex Volkov: This is exactly what we're talking about. Just I didn't expect it to take a day. So I want to turn the conversation to Mark real quick.
[00:16:24] Discussion on Groq's Custom Chip
[00:16:24] Alex Volkov: Mark, you worked in Grak. How long have you been there? Tell us about this custom chip you guys have. What's going on? How are you achieving this insanity? 500 tokens a second for Llama70B, which is quite big.
[00:16:38] Mark Heaps: Yeah. Happy to. And [00:16:40] Jonathan actually called me and told me that he spoke to Matt yesterday, and I said, I think we owe Matt a, a very nice steak dinner and maybe a little bit more than that. I also didn't sleep at all that night because there were so many requests coming in and Matt's right, like we weren't really ready for it.
[00:16:55] Mark Heaps: We were literally just discussing. The day before what are some other demos we can do? What are some things we can show people with the speed? And then all of a sudden, Matt did a post and then a number of other people that follow him started doing posts. And, next thing I know, people are making their own video demos and it blew us all away.
[00:17:11] Mark Heaps: We're like, wow, this is amazing. I owe a big thanks to the community that have jumped on this. The, this is the magical moment. I think anybody that's worked in tech has seen this before. I've been working in tech for about 30 years. And there's this rubber band effect where one end pulls forward and then you have the whiplash from the other side.
[00:17:30] Mark Heaps: And software developers have been doing an amazing job in AI for the last couple of years trying to, find more efficiencies, eek out better inference, trying to get, anywhere they can that optimization. But, classically what happens is you push that to a point where you start seeing a ceiling.
[00:17:46] Mark Heaps: And then hardware comes along and says, Oh, you're driving the car at max speed? Let me just give you a new engine. Let me give you, something that speeds that up. And we've seen people saying that they have an inference engine. But ultimately they're really just these brokers of other forms of cloud compute.
[00:18:01] Mark Heaps: And then again, eking more capability out of it through software. And Groq was completely different. I've been there now for about four years. And I remember when I originally met the CEO, Jonathan, I said, why does anybody need to do this? And he told us the story about, him creating the TPU over at Google.
[00:18:19] Exploring the Future of AI with Groq
[00:18:19] Mark Heaps: And it was a pretty interesting moment. Jeff Dean had told the team at Google, Hey we've got really good news. We figured out how to get AI working and get, these certain services working like image and speech, etc. But the problem is it's going to cost a fortune to expand our data centers to be able to handle this capacity.
[00:18:36] Mark Heaps: And then they realized they needed to invent a new chip to do that. We're seeing that repeat itself right now. where there was this low latency ceiling for everybody in regards to incumbent or legacy solutions. And he knew from day one that, everybody was out there training models for years.
[00:18:53] Mark Heaps: And he said, one day, this is all going to turn around and everybody's going to want the world's fastest inferential latency. And he didn't know exactly where that was going to be a product fit, but he did know that was going to be the problem statement. So that's what they, that's what they started with.
[00:19:06] Mark Heaps: And it's a radically different architecture, totally different methodology and approach. It's been a really fun journey learning about that architecture. Yeah.
[00:19:14] Alex Volkov: And like the. The public demo that you have, that's very easy for folks to just go and test this out on their own. I think it to be honest, it's awesome that you have this, and I think it undersells. The insanity of what this is, and I think when I hear about what Matt is building in the demos, and I had to play with this yesterday, I had to play with this myself to figure out what to do with this, because I saw many people react and say, hey, what's the point of 500 tokens per second when the reading speed of humans is I don't know, 50 tokens per second, whatever, and I'm looking at this tweet and I'm just like face palming, I was like, what, you don't, do you not
[00:19:51] Mark Heaps: Oh, thank you.
[00:19:52] Alex Volkov: Do you not what's going on? So I had to go and build something. I built I'll tell the folks in the audience. So I used actually two technologies. We're gonna talk about the second one today. I used two kind of super fast advancement that we had this week, which another one was a stable diffusion SDXL Lightning from, I think TikTok, I think released this.
[00:20:09] Alex Volkov: And I decided to just combine both of them, and I have a video, I'm gonna post it up on the show notes and on the demo right now, on the stage right now. But folks, don't go looking at this right now, go look at this afterwards. And I basically figured, Hey, if this is like as lightning fast as this is, I don't have to like, I'm like 400 tokens a second, 500 tokens a second, basically instant, I can use whatever Mixtral or you have Lama270B there, you have Mixtral, and hopefully we're going to talk about more models soon.
[00:20:37] Alex Volkov: And I can use this SDXL Lightning to just immediately deliver to me results. So I used Llama as my kind of prompt writer via Groq, and then I used this SDXL Lightning as my image generator, and I have a demo there that everything there appears in real time. And it's quite powerful and, to the person who said, Hey, the reading speed of people is 50, 50%.
[00:20:59] Alex Volkov: That person doesn't understand the impact of this. They will have an agent, for example, Matt was talking about agents and agentic stuff. The impact of this is just being able to build LLMs into every possible nook and cranny of software. I just wanted to highlight that, that I had to play with this to understand, really.
[00:21:14] Alex Volkov: And, Mark maybe let me ask you what kind of like inventive demos and stuff that you saw coming up from folks specifically around the fact that some of this stuff would not be very helpful with slower inference speed? Did you see any like cool examples of your own? Did you guys like, in slacks, did you send the examples between each other?
[00:21:32] Mark Heaps: Yeah, there's been a lot of chatter at Groq, and I think Matt's was the first one that kind of blew me away. He, he built a demo. And then I think his second demo was this one that wrote a novel, and it wrote it in like under a minute or something
[00:21:45] Alex Volkov: you want to tell us about this before, before you drop off? Because while I got you here, I would love to hear. Yes,
[00:21:56] Matt Shumer: because I wanted to go to sleep. And I knew I had to get it done, and I wouldn't have slept if I didn't. So that was this answers engine, very similar to Perplexity. The idea there was Perplexity's got this incredible, Embeddings based system likely it's really fast and allows you to answer questions really quickly so anybody going up against them they can't exactly do that because without that engine, it's going to be way slower but with the LLM that's as fast as Grox hosting of it, you can essentially do it in the same exact time or even faster while waiting for a pre built, search API to come back with results.
[00:22:28] Matt Shumer: And it worked. So basically, obviously after time it got a little slower because a ton of people were using it, but at the beginning it was like a second to answer for a very complex question. You could have it write a long thing based on something. So basically a really good answers engine. That was the first one.
[00:22:42] Matt Shumer: The second one was writing a novel in a minute or something. That came from a repo that I open sourced, I want to say like almost a year ago now. And that was called GPT Author. Originally the idea was to use GPT 4 to write a novel for you. The quality is obviously okay, it was just experiment to see where it went but people really took to it, so I decided to rebuild it.
[00:23:02] Matt Shumer: With gbt author originally, with gbt 4, it would take like 20 minutes to write, let's say, five chapters. The crazy thing is, with Groq, I added like three more layers of reasoning for each chapter, and yet it still computed it under like a minute or two. So that was pretty crazy. And then the third demo I released, which kind of went More volatile than the rest.
[00:23:24] Matt Shumer: That was basically a code tool that refactors code and documents it. So basically, it's a very simple design. You paste in some code. We have one Mixtral prompt essentially suggest improvements. Based on those improvement suggestions and the original code, we have another Mixtral go and make those improvements.
[00:23:45] Matt Shumer: We display the diffs and then based on that we have another Mixtral explain what happens and Give the user an understanding of what happened, and then we have a fourth one go in and document it. And this all happens, if I were building this for production with today's models and today's systems, I would probably go and try to make some of it async so that it's faster to the user.
[00:24:05] Matt Shumer: But with this, I built it sequentially because I didn't even have to go and do that. It all still computed in a second. By the time I was done reading the code changes for, or the suggestion that it was going to do in the first place, it was already done refactoring the code, already done documenting the code, which is crazy.
[00:24:20] Matt Shumer: So that one did pretty well. Those are the three demos I made. Maybe I'll do some more in the coming days. Yeah.
[00:24:24] Alex Volkov: that's incredible, dude, and I keep thinking about like more use case for this. Yesterday I used Cursor. Cursor is the editor, if you guys don't know, like AI native editor, uses I think GPT 4 behind the scenes, embeds a bunch of stuff. And I haven't been able to play with CursorFully until yesterday's demo, and I played with this, and it has GPT 4.
[00:24:42] Alex Volkov: And I think they have a specific faster access to GPT 4, if you pay, and we do pay. And I was playing with this, and I was getting support from my editor on my code, and it was slow, and I was like, I want it immediate. I want it instant. And I think that's what Groq of promises.
[00:24:59] Alex Volkov: Mark, [00:25:00] so let's talk about how you guys actually do this. You said something about the custom chip. What's as much as you can go into the secrets and also keep in mind that this is like a high level space here on Twitter. What's going on? Like, how are you able to achieve NVIDIA's earnings come out.
[00:25:15] Alex Volkov: They did like whatever insane numbers for the past year. Everybody's looking at A100s, H200s, whatever. What are you doing over there with new hardware?
[00:25:23] Mark Heaps: Yeah. The chip has actually been something we've been working on. The company was formed in 2016. And I think we, we taped out that chip, the first generation design, maybe two years after that. And it is totally different. And it's funny, people actually keep getting the category of the processor wrong online.
[00:25:41] Mark Heaps: It's a language processing unit, but people keep calling it a linear processing unit. And a lot of the engineers at Groq think that's fun because they're like technically, it is. It is a linear sequential processing unit, right? And it's some of the key differences at a high level, right? So it's not multi core like a GPU.
[00:25:56] Mark Heaps: It is single core. It was actually the world's first single core peta op processor, which, four or five years ago, that was a big deal. And it's still 14 nanometer silicon, which is a 10 year old version of silicon dye. Whereas, we're being compared to people that have silicon that's four and five nanometer.
[00:26:14] Mark Heaps: And we're completely fabbed in the U. S. It's it's readily available supply. So we don't have the challenges other folks have trying to get GPUs. But the part that's really cool, this is the thing that like I geek out on, right? Is when you think about getting deeper into the development and stack.
[00:26:30] Mark Heaps: And you're trying to set up GPUs as a system. And I'm talking, large data center scale systems. You've got all of these schedulers and things that you have to manage with the GPU and the data bouncing around in the way that it does being multi core and using all these schedulers it's really, what slows it down.
[00:26:46] Mark Heaps: It's really what gives it a latency ceiling. And with the design of the Groq chip, and if anyone's seen a picture side by side it's beautifully elegant. It's it's works in a way that when you connect all of these chips together, you could put thousands of them together, actually, and it will see it as one brain.
[00:27:06] Mark Heaps: So let's say that you realize for your workload you need 512 chips. You can tell that, hey, I need you to be one chip. and load your models that way. Or if you wanted to run some things in parallel, like we've done with an application we have called Groq Jams that writes music in independent tracks, linear and parallel to each other.
[00:27:26] Mark Heaps: So that they're perfectly synced, we can say no, make those chips eight chips because I want eight instruments. So I'm gonna use eight instrument models to do that. You can literally do that with one line of code in PyTorch and you can refactor that way. And so this is, the advantage that they've had with the way that they approach the chip design and that in itself was the, probably the most radical thing that Jonathan and the team were the inception of.
[00:27:50] Mark Heaps: They decided instead of designing hardware and figuring out how to improve hardware in a traditional methodology, they said no, we're going to start with the software. We're going to actually design our compiler first, and then we're going to design the silicon architecture to map to that, so that it's completely synchronous, so that it's completely deterministic.
[00:28:10] Mark Heaps: We're going to build the compiler first, and we're going to make it so that no CUDA libraries ever need to be used. That you don't need to use any kernels. We're just gonna, we're just gonna bake it all right in. And so this is where we've seen a lot of that efficiency gain and where we get all that extra power for low latency.
[00:28:28] Mark Heaps: And that's really been the fun thing, for anyone that's, that isn't familiar with us. Our early demos weren't AI related. In fact, during covid we worked with one of the national labs and they had a model. that they were using to test drug compounds against proteins and seeing what drug would stick to a protein.
[00:28:48] Mark Heaps: And, this was in an effort to try to find a vaccine, etc., during COVID. And their model at that time, from what the team told us there, was it would take three and a half days for them to get a result. Every time they put a new drug in, see if it sticks to the protein, okay, did it work? If not, move to the next one in the queue, and let's keep going.
[00:29:06] Mark Heaps: And that was this effort of trying to figure out, what would work. It took us maybe six months back then, because we weren't as mature with the compiler. It took us about six months to get them actually having their simulation running on Groq. When they finally did it, they could do that same simulation in 17 minutes.
[00:29:23] Mark Heaps: So imagine the rate of acceleration to try to find a drug that could actually change the world at that time of crisis. They could do that on Groq in 17 minutes. So the orders of magnitude that we've been able to help people. is, has just blown us away. We've done some things in cybersecurity with one of our customers in the U.
[00:29:39] Mark Heaps: S. Army. But now what we really realize is it's going to change the world for anybody that can take advantage of linear processing. And language is the ultimate linear application, right? You don't want to generate the hundredth word until you've generated the ninety ninth word. And Matt's example is amazing.
[00:29:56] Mark Heaps: Imagine that you can generate a story. You did it with generating a video after having the prompt being generated. My kids, I have a 12 year old son, he's a major gamer, and I showed him using Thappy, which is a voice tool online for generating voicebots. I showed him how to make NPCs with that, and putting in character personas with no code, and it's running on Groq.
[00:30:18] Mark Heaps: And the low latency, he was having a really natural conversation, and he told me, he goes, Dad, I can't ever talk to Alexa or Siri or any of these again, he goes, it's so bad compared to this. So it's just a really exciting time and the secret sauce of it is the chip.
[00:30:32] Alex Volkov: that's incredible. And I think you touched upon several things that I want to dive deeper, but the one specific thing is necessarily the voice. conversations, the embodiment of these AIs that it's still uncanny when you have to wait 800 milliseconds for a response. And I've seen like a YC demo of a company and somebody said, Oh, this is like the best thing ever.
[00:30:55] Alex Volkov: And it was like 100 milliseconds to an answer. And I'm looking at these 500 per second tokens. I'm thinking, This is like a near instant answer from a person and probably a super, very smart person, probably faster than a person would actually answer. And it it triggers something in my mind where we're about to slow these down on the UI level because the back end is not, is going to be faster than people actually can talk to these things.
[00:31:19] Alex Volkov: Nisten I see you're unmuting. Do you want to follow up? Because I bet you have a bunch of questions as well. And we should probably talk about open source and models and different things.
[00:31:29] Nisten Tahiraj: Yeah, so the one amazing thing here that we don't know the number of, so if the engineers could find out, there's something called the prompt eval time, or there's different terms for it. But for example, on on CPUs, that tends to be pretty slow, almost as slow as the speed of generation. On GPUs, it tends to be ten times higher or so.
[00:31:53] Nisten Tahiraj: For example, if you get an NVIDIA 4090 to generate stuff at 100 tokens per second, or about 100 words per second, for the audience, the speed at which it reads that, and it adds it into memory, it's often in about a thousand or a few thousand. What I'm wondering here is that evaluation speed That has to be completely nuts because that's not going through some kind of memory That's just it goes in the chip.
[00:32:21] Nisten Tahiraj: It stays in the chip. It doesn't spend extra cycles To go outside into memory. So The prompt eval time here has to be completely insane, and that, that enables completely different applications, especially when it comes to code evaluations, because now it can it can evaluate the code a hundred times against itself and so on.
[00:32:45] Nisten Tahiraj: So that's the amazing part I'm wondering here, because you can dump in a book and it'll probably Eat it in like less than half a second, which is pretty, it's pretty nice. So yeah, one thing I'm wondering is how does this change the the prompt evaluation time? And what kind of other demos or stuff are actual uses, actual daily uses are you hoping to see?
[00:33:08] Nisten Tahiraj: And can you tell us a bit more as to what your availability is in terms of to production and and
[00:33:15] Mark Heaps: server load. Yeah, absolutely. I think the first one, I want to be a little [00:33:20] transparent about, where Groq was at in regards to the input. When we first started building out the system and optimizing it, we really focused on token generation and not input, right?
[00:33:32] Mark Heaps: So that's where we thought everybody was focused. It's like Gen AI was blowing up everywhere. What can you make, what can you generate? And so we said, okay, the compiler team is working on things. Let's focus on optimization of the system, the LPU Inference Engine at generation. And so we got this wildly fast speed, right?
[00:33:51] Mark Heaps: And I remember some people saying, oh, you'll never hit 100 tokens per second. We hit it, we did a press release. The team literally came back to us two weeks later and said, hey guys, we just hit 200. And I was like, what? And then all of a sudden we hit 300 and we're like, wow, we're generating really fast.
[00:34:04] Mark Heaps: And then we started meeting with some of these benchmark groups, like Artificial Analysis and others. And they were saying no, like industry standard benchmarking ratios right now is 3 to 1 input to output. And we went, oh we need to start optimizing for input. And so we've started working on that.
[00:34:21] Mark Heaps: And even that right now isn't at. The exact same speed optimization of our output and the teams are working on that, at this time, but it's more than capable and it's on the roadmap, it's just a different focus for the group. So we're probably going to see over the next few months about another 10x on the input speed which is going to be wild, right?
[00:34:42] Mark Heaps: Because now when you talk about conversation, a lot of the time humans blabber on, but you tell an agent to respond in a terse and succinct way. Now you completely flip and invert the ratio of what you're going to be able to have. So that's really exciting. And, from a use case standpoint, I actually had a really interesting use case that, that happened to me personally when I was on a vacation with my family late last year.
[00:35:08] Mark Heaps: We were actually traveling and we were in Puerto Rico lionfish. And it was really bad. We were like a hundred yards offshore. We're at like 60 feet deep water and I'm trying to help him get to shore and he's like screaming and I get on shore and the first thought in my head was of course call 9 1 1.
[00:35:25] Mark Heaps: And I went, Oh my God, if I call 911, I'm going to get an operator. We're in this place that nobody can drive to. They'd have to helicopter us out. I was totally freaked out. And I ended up just going into the bot and saying, what do I do if someone gets stung with a lionfish? And in less than a second, I had a 10 step guide of what I should do.
[00:35:41] Mark Heaps: Things that I didn't know, right? Oh, keep his foot in the water. Don't rinse it with fresh water. That happened instantly. Now imagine the world that, that goes from having an emergency Band Aid or burn kit in your house. to having an emergency bot in your house who can help you in those situations.
[00:35:57] Mark Heaps: And so the speed at which it can read the input message and then give you advice back in the output is a complete game changer. And I think Alex nailed it, like we've seen all these comments where people say why do you need to generate this fast? They think of it as like a chat bot only or like a reading only situation, but the reality is, and what we've known for a long time is there's going to be an ubiquity of digital assistants.
[00:36:23] Mark Heaps: And I don't mean like an individual bot per se, but just AI being everywhere to help you. And so that's going to require a massive amount of speed. for you to be able to slice that up across all these services. Like we hear, people building with their demos like Alex said earlier. So that's our goal to serve that.
[00:36:44] Mark Heaps: And, Nisten, you asked about, what's the goal. Right now, again, just being candid with everybody, we didn't expect this thing to go viral. This was not a marketing strategy. This wasn't us going out and paying a bunch of influencers. It just happened and so the system has been like really tested and the amazing thing is it's held up like Matt said.
[00:37:04] Mark Heaps: And so kudos to the engineering team for that. Where we're headed and our goal is by the end of the year we want a token factory to be able to generate millions and millions of tokens per second as a capacity. And so that's the plan right now. We want to be in, roughly 10 months. We want to be where OpenAI was, at the end of last year.
[00:37:27] Mark Heaps: That's our goal right now. So we have those orders placed, that hardware is ordered, and we're building and increasing the capacity every week.
[00:37:33] Alex Volkov: That's awesome. And so let's talk about models. You guys are serving LLAMA270B. And we hear rumors about next LLAMAs at some point soon. And I think Mark Zuckerberg even actually said that like they finished training LLAMA3 or something. We don't have insider knowledge here.
[00:37:48] Alex Volkov: We're just like speculating. And then also obviously Mistral is releasing incredible models. You guys have Mixtral in there. There's speculation the Mistral Next that LMCs has access to is this incredible model, the GPT 4 level. So you guys are relying on open source models, and those models are trained on other hardware.
[00:38:07] Alex Volkov: Do you guys also have training built in, or is this only for inference? And what are the plans for also training models? Because, speeding up training would help the world at least as much as speeding up inference.
[00:38:18] Mark Heaps: Yeah. So let's tap into a few of those. So first, we love the open source community. It was a big inspiration why Jonathan left Google, where he was wildly successful. and said, we need to go start another company. And he wanted to make sure that the world and the developer community had access to AI technologies to accelerate development.
[00:38:38] Mark Heaps: He literally calls this the haves and the have nots. And at that time, he said, look, it looks like Google, Amazon, Microsoft, a couple of governments are going to swallow up all of the AI technology in the world. He's that's not going to be fair. He's we need to democratize AI and access for all.
[00:38:55] Mark Heaps: And so let's make a chip, and I remember him telling me this four years ago, he goes, I'm going to create a company where people can literally have access to the most advanced AI in the world, and do it with a credit card from their home. He goes, that's what I want to see happen. And so that's always been his vision.
[00:39:11] Mark Heaps: And, we're on that path right now. The models that now the explosion of the open source community, and I think Meta deserves a lot of credit here. Chad GPT was blowing up, OpenAI was doing their thing.
[00:39:22] The Unexpected Success of Llama 1
[00:39:22] Mark Heaps: And Meta, which is, obviously a massive corporation and private and in it to make money.
[00:39:28] Mark Heaps: They said, no, we're going to make Llama available to everybody. And we didn't have a relationship with them. I think everybody knows Llama 1 got leaked and one of our engineers got ahold of it and said, Hey, I'm going to see if I can fit this to the chip. It wasn't even on our roadmap. And then they got it running in less than like 48 hours.
[00:39:45] Mark Heaps: And then from there we advanced on it. And so that was an amazing moment. Lightning bolt moment where we said, Hey. What else can we do with this?
[00:39:52] The Evolution of Model Compilation
[00:39:52] Mark Heaps: And at that time, I think we had maybe 200 models from Hugging Face compiled for our system. And today, I think we're well over 800.
[00:40:02] Mark Heaps: And we just keep pulling from the repos there and building them into the compiler. But we're watching very closely now of what are the models that people want? We had Vicuña up for a little while and we saw that on The LMSS leaderboard we've played with Mistral 7b.
[00:40:16] Exploring the Power of Mistral 7b
[00:40:16] Mark Heaps: If anybody wants to see real speed, go watch my video on YouTube on the Groq channel about Mistral 7b. It gets over a thousand, it gets over a thousand tokens per
[00:40:24] Alex Volkov: you serious? Wow.
[00:40:26] Mark Heaps: Yeah, I, the max I've hit with it I was just doing a conversational bot with it, and I hit 1140, and it was insane.
[00:40:34] The Excitement Around Google's Jemma
[00:40:34] Mark Heaps: And, now there's this announcement from Google about Jemma, which I think is like 8 billion.
[00:40:38] Mark Heaps: And the team is already Oh my God, what could we do with Gemma, at that size, like the speed is going to be, through the roof. And then Jonathan, our CEO, is traveling right now, and he was actually at the Mistral headquarters in France a few days ago. And they were talking to him about, the next model and kind of what that looks like.
[00:40:58] Mark Heaps: And he very much wants that to be running on the LPU inference engine at Groq.
[00:41:02] The Future of Groq's LPU Inference Engine
[00:41:02] Mark Heaps: So it's an exciting time to get into these open source models. And we're just happy that we can sit back and say, Hey, how do we help you guys? Because ultimately the people building the models, doing the training.
[00:41:13] Mark Heaps: We want to enable them with this speed.
[00:41:16] Groq's Stance on Training
[00:41:16] Mark Heaps: You asked a question about whether we do training. We don't. We don't offer training. We don't do training. We have had one customer actually do it. That was related to that U. S. Army cybersecurity project. They actually trained their quantum algorithms using Groq hardware.
[00:41:30] Mark Heaps: But it's not something we do, and it's not our business model. And Jonathan has always had this vision. He said Look the world already has a bazillion training providers, and [00:41:40] most people are quite comfortable with the pace of training, and this is going back to 2016, 2017. He said let's recognize that if all these companies are training models, and yet there's no real clear winner in the inference solution, let's just focus our business efforts there.
[00:41:55] Mark Heaps: He does have a vision. It's not on our roadmap right now, but he does have a vision.
[00:41:59] The Potential of Live Training Through Inference
[00:41:59] Mark Heaps: of what you could do with this sort of recyclical live training through inference, where it's actually being trained live in the moment and feeding back to itself, right? And this gets you into a multitude of layering techniques that we've been considering and testing at Groq.
[00:42:14] Mark Heaps: I could see us getting into training in the future, but only when it is advantaged by that real time insight of training.
[00:42:22] Alex Volkov: Up here. And Nisten, just before, let me jump in super quick. I want to follow up with something that you said that 7b Mistral is flying at over a thousand tokens a second. And that's obviously incredible. Just like mind blowing incredible. And in my head what I'm super excited by is not the smaller models, because I can run the smaller model on my Mac with 20 tokens, 30 tokens a second and get like a full whatever.
[00:42:45] Alex Volkov: I'm excited about the incredible intense, long context requirements that we've seen. So we had talk about open source. We have often the folks from Nous Research here on stage, the authors of the YARN paper, that they've been able to take LLAMA's 4, 000 contacts window and extend it to 128.
[00:43:03] Alex Volkov: And we never used it. We never were able to use LLAMA at 128k tokens because it was like extremely slow.
[00:43:09] The Power of Groq's Speed in Long Context
[00:43:09] Alex Volkov: And I'm thinking about Are you guys bringing us long context, like for real, like for open source models, because we haven't yet been able to actually use them as much. Because the bigger the model is, and the faster you can run, it will average out, and we'll be able to get open source models.
[00:43:22] Alex Volkov: Have you guys played with long context yet? Have you seen the incredible stuff from, Gemini 1. 5 releasing 1 million tokens, for example. Something that probably only Google can pull off with their TPU farms. How are you thinking about that as an advancement, as a competitive edge for something that only you could do?
[00:43:37] Mark Heaps: Yeah, the team is actually looking at that right now, and I think, again, early stages, our first 4A into a larger length was actually, Mixtral with a 32k sequence length. And, so far we haven't seen any use cases where people are actually taking advantage of that full length, but we know that it's coming.
[00:43:54] Mark Heaps: And the moment that Gemini 1. 5 got announced with the million token length, the team immediately got together and said, okay, how would we do this? And they've started architecting. What scale of system would we need for that? So that's part of the plan in parallel with what I was saying earlier that we really want to get to a place where we're this massive token factory by the end of the year.
[00:44:14] Mark Heaps: And that's getting us into that, more than 10 million to 20 million tokens per second from the system in that capacity. So we're definitely looking at that. I think what's going to really dictate it for us, because we're again, sitting back and saying, how do we help? And what we're watching is what are the business use cases?
[00:44:33] Mark Heaps: So if someone says, Hey, we want to use a model that has a million million contact sequence length. But you find out they're really, on average, only using 50k for their application. This is that advantage I was talking about earlier, where we can dial the system forward or backward using a single line of code.
[00:44:50] Mark Heaps: We can figure out what is that link that they need, and then dial that in for that customer account. We're actually doing a little bit of that right now with Mixtral. You guys mentioned, we have the free version. on our website that people can play with through Groq chat. And then there's the API access, right now, as everyone's playing with it and just treating it as a chat agent, we're recognizing that we've got this thing loaded for 32 K Mixtral.
[00:45:12] Mark Heaps: And yet, the average we see being generated in GroqChat is around 900. At that scale, we're like, hey, why don't we increase the capacity of the system, speed this thing up a little bit. Let's drop the sequence length for the free GroqChat service. But leave it at the longer sequence length for the API users, and that's really easy for us to do.
[00:45:32] Mark Heaps: That's flipping a switch in, in, in some ways.
[00:45:36] The Importance of Community Feedback
[00:45:36] Mark Heaps: So we're just waiting for the open source model community to really tell us like, Oh, this is the size that we could really take advantage of.
[00:45:43] Alex Volkov: Awesome. So you guys found the right place. The open source model community often ends up on ThursdAI and talk about their advancement. So I'd be more than happy to introduce you to the guys who are doing open source kind of papers on long context as well. They're often joined here and they would be very happy to like help and figure out what's the, what's possible, especially because training those models is hard, but then running inference is even harder.
[00:46:07] Alex Volkov: Nisten.
[00:46:08] Mark Heaps: Way harder.
[00:46:08] Alex Volkov: Yeah, Nisten, go ahead.
[00:46:11] Nisten Tahiraj: Yeah, so one thing I'm wondering about is, so first of all, it's extremely impressive that these models are running at full precision and they're not even starting to take advantage of some of the handmade stuff that people made to get them down to the, to phone size and to still perform well, because that takes yeah, so that hasn't even been explored yet, because that can reduce the size by four and have exponential improvements.
[00:46:36] Nisten Tahiraj: So what I'm. wondering is, how much, as you guys expand and as you go and as you adopt, whether you adopt our models or not, how much work is it to Take something like LLAMA or Mixtral and then adapt it to more of your JAX like stack That you guys have. So yeah, that's the part that I'm Wondering about like how much work is for companies to adopt their own models or if they have something custom that they've made to this because I see some incredibly interesting stuff and I think for Sorry, I'm rambling on a little bit, but I think even for training you can make models that fit under 220 megabytes or model parts, and then you can train those individuals.
[00:47:22] Nisten Tahiraj: So there is stuff to be it. Explore there. I just think there hasn't been enough yeah, it's still pretty new, so there hasn't been enough people taking a crack at it. But yeah, how much work is it to take an open source model or a custom something that people made and to adapt it to work on Groq's hardware?
[00:47:40] Nisten Tahiraj: That's my question.
[00:47:41] Mark Heaps: Yeah, it's a great question. Thanks, Nisten. Yeah, so I think a really good paper everyone should check out if you're interested in this, if you go to Groq. com slash docs. We've got a huge doc repo there. And one of the earlier articles that we produced from the compiler team is called Developer Velocity, and it's been a, it's been a focus from day one.
[00:48:00] Mark Heaps: We did some research when we were building out the product, building out the service, and we found out that for a lot of companies to get a model up and running, especially if it was their model. It would take them, if you were a smaller company let's call you, an SMB, sub 5, 000 employees.
[00:48:15] Mark Heaps: They were typically spending six to nine months to get a model into production where they were using it. The larger companies, Microsoft, those guys, they're doing it in 30 to 45 days. And so we set this goal saying, we don't want any customer ever to need more than a week to get their model up and running on Groq.
[00:48:34] Mark Heaps: And ideally we'd like it to be in 24 hours. We're actually going to test the team on that when LLAMA 3 gets released. We're going to see how fast from the day everybody has access to it, to how fast can we get it up and running. And, I'm hopeful we're going to, we're going to see a demo with it literally that day or the next day.
[00:48:49] Mark Heaps: It's not a lot. We're using standard frameworks, right? So we're PyTorch, Onyx, Tensor, everything is pretty standard. The thing that we spend a lot of time doing this in, and this is what slowed us down a little bit when Llama 2 came out I did a video with Bill Ching, a member of our compiler team.
[00:49:06] Mark Heaps: He's a brilliant guy, super funny. He'll tell you in the video, I didn't spend time getting it to fit to Groq. I spent time removing All of the code and components that were built in for GPUs. Basically, he spent time scrubbing, not, building. And that's what happens is because the community is so already weighted towards building for GPUs, that's what takes us the most time.
[00:49:30] Mark Heaps: We've got to strip all that stuff out because it slows it down. Again, we don't have those schedulers. We don't have those components. That's the biggest thing for us in the way that, that we get things running. But, even custom models that we've had from the national labs and the research groups, we had one that was for the Tokamak nuclear fusion reactor.
[00:49:48] Mark Heaps: It was a control system. And even that we got running in just, I think it was less than 10 days. And it was a completely custom build and our compiler was no more mature at that time. Again it's one of those [00:50:00] things that our goal is to get it down to where it's same day applicable.
[00:50:03] Mark Heaps: We're a ways off from there, but right now we're trending less than a week for everybody.
[00:50:09] Alex Volkov: Mark, I want to follow up with the use case. As you guys were talking about converting models, and we see models getting released from all these finetuners. We have a bunch of folks here who finetune models after open source release, and many of them switch to Releasing their models in the safe tensors format, the standard one, but also in the quantized format that people can actually download the smaller quantized versions and run them on their Macs.
[00:50:33] Alex Volkov: And I can absolutely see if you guys support this, I can absolutely see a day where folks are releasing it also on Grack or Grack chat or whatever, just for folks to be able to experiment with like longer context. As a fallback, sorry, as a follow up on the longer context one session, you said. we see in the chat.
[00:50:49] Alex Volkov: Yeah, the chat is not optimized for, pasting like a bunch of stuff. I, I would I would not suggest, I would be cautious about judging by that because I personally, if I get access or I guess I got access to the API, but when I get access to longer context, for example, I would absolutely think about, hey, what is possible now?
[00:51:08] Alex Volkov: I can, and somebody commented in the comments that coding is the main use case where long context really matters. Because what happens right now is everybody's like focusing on rag. And we had this conversation, rag versus long context, I think since a year ago, since the context lengths were 4, 000 tokens, then 5, 000, then 12, then whatever.
[00:51:25] Alex Volkov: And then Mosaic came out with 60 and we were very excited. And we had this conversation since then of what performs better. And I think one of the two main reasons that folks And I don't know about cost, and we probably should talk about cost, but infraspeed, you guys are doing some incredible advancements.
[00:51:46] Alex Volkov: In my head, as somebody who builds systems with this, as somebody who plays around with this, if I can shove my whole codebase In the context, I will get a better answer than I'm gonna have to embed the context, the code base, and then try to do retrieval on specific chunks, whatever. I'm even thinking about the cursor interface that I used yesterday.
[00:52:03] Alex Volkov: I, I had to provide it with, I had to mention, hey, these docs that you already vectorized, add them to, to the context, so GPT 4 will be able to help me solve my specific issue. If my whole repo is getting sent in each prompt, I don't know if this is the best use case of your hardware, but it's definitely the, probably the fastest way to get the model to actually know exactly what I want.
[00:52:23] Alex Volkov: That's one example. Another example is all these models, all these agents are going towards personalization. I definitely think that this year is the year of personalization, especially with like longer context and models like Gemini 1. 5, for example, they have a full retrieval precision, almost like 95 needle, in a haystack recall ability.
[00:52:42] Alex Volkov: And that, for use cases like something like a personal assistant that remembers everything about you, removes the possibility of, hey, I didn't chunk correctly, I didn't do rack correctly, I did vector similarity incorrectly, etc. For developers just getting up and running and building tools like this, I think long context is still yet to be discovered because it's still expensive and it's still slow.
[00:53:02] Alex Volkov: And I think speed with a lot of context is what's going to unlock the next iteration. So those are just like some feedback from the community staff. Would love to hear what you think.
[00:53:10] Mark Heaps: Yeah. So first, I love these ideas, and I want to invite everybody who's listening go join our Discord server, because we want this feedback. We, the product team is super hungry for it. We want to know what you guys want. So definitely go do that. It's Groq. link slash discord. Please bring all these ideas to us.
[00:53:26] Mark Heaps: It's an interesting thing, Alex, because we've heard this from a number of customers of, do you do RAG? Do you do some form of vector database? We get asked about Lang chain. We get asked about all these things. And I think for us, there's this risk of where is the infrastructure, that part of the stack with RAG, where is it?
[00:53:44] Mark Heaps: Where does that exist, right? So if you're operating in these two totally, vast separated areas, you run the risk of losing your latency just because of the network and kind of what happens between them. So for a lot of folks, we hear no. We want the longer sequence length because we want to embed a lot of this in the sys prompt.
[00:54:03] Mark Heaps: And we know that Groq has such fast inference that if it's embedded there, it's all living with you. And we're going to be able to maintain that speed. If you start calling out to a bunch of different rag services, where am I going to lose? Now, I think that's thinking that's based on the experience they've had with GPUs, OpenAI, ChatGPT, etc.
[00:54:23] Mark Heaps: But, for us, if we have such a margin of inference speed, we haven't seen anyone really lose on the overall experience performance because of the network topology. Jonathan was doing a demo for somebody literally using Wi Fi on a United Airlines flight where we had information in a rag and he was calling it, using Wi Fi on the plane.
[00:54:48] Mark Heaps: And he was like, it was a very normal speed experience. He was disappointed because it felt he was using ChatGPT,
[00:54:53] Mark Heaps: For the person there,
[00:54:54] Alex Volkov: It's hard to go back after, after you experience immediacy. Waiting is definitely annoying. That's I'm waiting for the hedonistic adaptation of ours to kick in where we expect immediacy. Yeah, sorry, please go ahead. I have to chime in.
[00:55:06] Mark Heaps: No. Yeah. No, I think you're, I think you're spot on. So yeah. So again, we don't want to dictate to anybody You know, what is the best method? We want to listen to you guys and figure out how do we continue to serve in that way? And, the other reality is there's gonna be new techniques that are gonna be invented, in the next couple of months probably, that, that give you a whole nother option, around rapid fine tuning.
[00:55:31] Mark Heaps: And we're just watching. And listening to you guys, but we recognize we need to enable both. So we're working with some partnerships for RAG right now to be able to connect into Groq. And there's going to be some announcements actually tomorrow about some things happening at Groq that I think people will be excited
[00:55:47] Alex Volkov: Ooh, you want to give us a little teaser, a little laugh, or are folks going to tune in for tomorrow? We gotta tune in for tomorrow.
[00:55:54] Mark Heaps: I I think the only thing that I'm allowed to say is there's really going to be a very strong representation of the developer community. Within Groq, and the tools that we're gonna start rolling out over the next couple of weeks are really gonna feel familiar and hyper supportive of the work that y'all do.
[00:56:11] Mark Heaps: So it's gonna be, it's gonna be really fun.
[00:56:13] Alex Volkov: Alright, so folks, stay tuned, definitely we pinned the discord link to the top of the space check it out and give folks comments because you guys have a bunch of headroom and we need to use this, but we need to tell you in which way we're gonna use this so you also have it. a roadmap, you have prioritization issues like every company, you have to focus on something.
[00:56:30] Alex Volkov: So the better folks will give you feedback, the better. I want to maybe one last question, Mark, before I let you go, and then continue with the regular thing, which you're more than welcome to stay and chime in as well on, because I did see your thread.
[00:56:41] The Potential of Multimodality in AI
[00:56:41] Alex Volkov: I think you're also interested in the broader AI community.
[00:56:44] Alex Volkov: It's multimodality for 2024. I think It's clear to everyone that multimodality is built in. All the major labs are now multimodal. I think multimodal AI is in open source is coming as well. We have folks here who've trained multimodal models. What are we to expect from Groq on that perspective?
[00:57:01] Alex Volkov: Is it? Do you guys already have support for some like a vision plus plus text? Are you looking at different things like, video as well, which by definition takes more tokens and then slower by definition in every other place? How is the team thinking about this kind of next evolution of Gen AI?
[00:57:19] Mark Heaps: Yeah, good question. Obviously, multimodal is where everyone's interested. And I think ever since OpenAI gave ChatGPT the capability to generate images in the middle of the conversation and then add audio into the middle of the experience, everyone's been excited about this idea. And certainly that's where we've started.
[00:57:37] Mark Heaps: We have a plan we call them the three pillars, right? And it's where does Groq add this speed value in? Language in audio and in visual. And what we're looking at right now is what are the models that we can bridge together so that we can provide that multi modal experience. The systems teams are already preparing the LPU inference engines that we're expanding on to be able to handle that.
[00:58:03] Mark Heaps: The compiler teams are actually, have already begun building out some of the advancements we need to be able to support that. We know where it's going and we know, that's what people are going to be asking for. So I've only shown one other thing. on our YouTube channel, which was a model that [00:58:20] Adobe gave us, which was a style GAN, and that was 8 models that run in parallel, and I think it generates in like 0.
[00:58:28] Mark Heaps: 186 of a second at 1024 pixel resolution. We can literally say, here's an image, give me 8 completely different styled results based on that, that diffusion model or that style GAN model. And that's where we've started playing with image generation. We do have some people that are looking At tiny diffusion and a few of these other like rapid generators that are small.
[00:58:47] Mark Heaps: But certainly that's something that we intend to support. It's the problem now with the speed of all these things happening is what do you prioritize? We are a company of, less than 200 people. And we're trying to, we're trying to figure out every day, like, where do we commit our resources?
[00:59:02] Mark Heaps: So again, it sounds like I'm trying to be like a marketing guy and I'm not. Go to the Discord and tell us what, you guys want. What are your use cases? What are you predicting with your businesses? That would really help us to be a part of the, to be a part of the conversation.
[00:59:16] Mark Heaps: But at the high level, yeah, we already have people working on it.
[00:59:19] Alex Volkov: Awesome, and I definitely invite your folks to also join the ThursdAI community, because we talk about these advances as they happen, we've been talking about multimodal, multimodal since almost a year ago now, folks, everybody in the audience, we're going to celebrate ThursdAI's birthday, I think, in a couple of weeks, and
[00:59:36] Mark Heaps: Nice, that's cool.
[00:59:37] Alex Volkov: when GPT 4 came out they had the infamous demo where Greg Wachman jotted down on a napkin, a UI thing, and uploaded it to the GPT 4 with Vision, and we've been waiting for this to become a reality ever since, and I think it's now becoming a reality.
[00:59:51] Alex Volkov: We also chatted with, the folks from Reka AI, which, had the multimodal model out there a couple of weeks ago that I was blown away by. I was uploading videos of mine and it understood tonality in there, understood like what happened in the video. We obviously see video being a big part of Gemini 1.
[01:00:08] Alex Volkov: 5, we're going to talk about this soon, where people just upload and that video just takes so much content, like 600, 000. tokens in context. But then the model understands like every little frame and can pull individual scenes away. And once we get to real time video understanding, that's when the actual World embodiment of these bots will happen when like it can actually see what and can react in real time.
[01:00:29] Alex Volkov: So definitely exciting stuff from there. And Mark, I just wanted to say What an incredible week you guys had and it's been great to just see how this explodes and play around with the possibilities I'll remind folks in the audience. I've played and it's on the it's on the show notes in the Jumbotron I played with Groq yesterday and it was I was able to build something that I wasn't, thinking about it's possible a few months ago, even.
[01:00:54] Alex Volkov: It's so fast. And you already mentioned the Discord. How do people get access? Is the wait list long? Tell us about people in the audience and then the questions. The one API access .
[01:01:03] Mark Heaps: The waitlist is really long right now, and it blew up this week. Again, thanks Matt for, and others for promoting. Yeah, so right now they can go to Groq. com. They'll see a link on the left that says API access. You fill out a brief form right now. We are trying to get through that list as quickly as possible.
[01:01:20] Mark Heaps: There's a timed trial, the usual sort of terms. But in a week, it wasn't even a week, it was literally within 37 hours, we had over 3, 000 API access key requests. And so that was more than we had expected. And so we're trying to get through that list right now and see what the tier levels, some people are telling us we need a billion token per day access.
[01:01:42] Mark Heaps: And we're saying, okay, this is this tier level. And other people are like, hey, we're part of Y Combinator's startup accelerator group. We're just testing our bot ideas out, can I get free access? So we're working through that list right now. The good thing is. We are increasing capacity every week, and one of the announcements that we'll have tomorrow and rolling into next week will be moving more towards self serve versus us going through and like manually approving everybody, so that should accelerate approvals greatly.
[01:02:10] Mark Heaps: I just ask everybody be patient. If you've applied, stick with us. We promise we're going to get to you. We really want you to have access to this. This level of inference speed but this whole virality moment came out of
[01:02:21] Nisten Tahiraj: nowhere and we,
[01:02:23] Mark Heaps: We're trying to meet the needs now.
[01:02:25] Mark Heaps: So just stick with us. It's going to keep getting faster and faster.
[01:02:28] Alex Volkov: Incredible. So folks, definitely check out GroqChat. If you haven't yet it's quite something. It's quite incredible. Check out all the demos as well. And with that, I want to say, Mark, thank you. This is the end of our conversation. It's been an hour, folks, on ThursdAI, and I'm going to reset the space a little bit, and then we're going to talk about everything else that was new this week, and there was a bunch of stuff in the open source and in different places.
[01:02:49] Alex Volkov: But what you heard so far is a deep conversation with Mark. Mark Heaps from Groq which came to many of us as new, but was around for a while. And then we also had some folks in the audience as well listening to this from Groq as well. So that was great. Thank you, Mark. And then let's reset the space and start talking about what's new in AI this week.
[01:03:07] Nisten Tahiraj: Thanks so much, guys. Really appreciate
[01:03:09] NA: you.
[01:03:31] Google releases Open Weights for GEMMA 2B and 8B
[01:03:31] Alex Volkov: All right, how's it going, everyone? You're on ThursdAI, February 22nd. My name is Alex Volkov. I'm an AI Avenger with Weights Biases. And Yet another incredible week in AI with a bunch of other stuff and I want to move our conversation towards the kind of the explosive open weights news this week, and I would love, so we have some more folks on stage here, and LDJ, we've talked about this when it came out, but Google gives us OpenWeights models, this is new to us folks, we've been waiting for Google for a long time, and finally they come out, and Google releases Gemma, a new OpenWeights model, not open source, and they've been very clear, which I really applaud the team.
[01:04:12] Alex Volkov: We're going to talk about some stuff that Google did not exactly do correctly this week, but we're also going to, we're going to highlight like we're going to give props where props are due. Google is clearly talking about open weights, open access model, not open source because they didn't open source a bunch of stuff.
[01:04:26] Alex Volkov: Definitely not datasets. It's called Gemma. It's they released two, two sizes, 2 billion and almost an 8 billion. So 7 billion parameter model. It has. Let's see what's interesting there. Trained on 6 trillion tokens, 8000 context window interestingly, vocab size is way bigger than LLAMA, and if you guys have been falling under capacity from this week, as you should, he just released a whole conversation about tokenizers, and he then analyzed the vocab size of the tokenizer kind of for Gemma, and said it's way bigger than LLAMA1.
[01:04:59] Alex Volkov: It's basically the same one, similar one, just like way bigger. And Yeah, this is incredible. This is like great news that Google is stepping into the open source. I think they see what Mark Zuckerberg saw, where once you release something like this, the community provides. And I want to just highlight, I had a tweet go off like fairly viral, because four hours after release, LDJ, we were spending the first hour in the space together that you opened.
[01:05:22] Alex Volkov: Four hours after release, we had Lama CPP support, Olama support, we had LM Studio support. Many people, like Maxim Lebon, one of our friends of the pod, quantized upload this because they didn't quantize correctly. Then after half a day, 3DAO from together added support for Flash Attention. I think there's a bunch of other stuff that added support as well.
[01:05:40] Alex Volkov: And we just had we just had folks from Groq talk about they've been looking at this as well. So it feels like Google understands the benefit of open weights access model. So I just want to, this shout out Google. Let me actually, I have a thing for this. Yeah. Good job.
[01:05:56] Alex Volkov: The big G provides, and this is great, and I'm, I was really surprised and happy to see this in the morning, and I wanted to hear from folks here on stage what are your thoughts so far on Gemma in terms of performance compared to, let's say, Mistral or anything else like Finetune that we had.
[01:06:10] Alex Volkov: Whoever wants to go next, but LDJ, you and I have the space, so feel free to comment what we learned from the space and since then, and then let's go around the table, and then we're gonna go forward with some news.
[01:06:21] LDJ: Yeah, so I think what we learned on the release and also after a little bit of time of people using it is that pretty much it has around the same abilities as Mistral. You could say maybe a little bit better than Mistral in certain ways. Some people say it's at least a little bit worse than Mistral in certain other [01:06:40] ways.
[01:06:40] LDJ: But overall there's definitely is maybe certain use cases where you might prefer the Jemma model. It is interesting though, I believe Jemma is actually. From what I remember seeing, it's 8. 5 billion parameters whereas I want to say Mistral is a total of 6. 7, so there is actually somewhat of around 25 percent more parameters, and theoretically, it should be maybe a little bit better than Mistral than than they say but, yeah it just really shows to how impressive Mistral is really the fact that Google's Making this model that's it's still not really significantly beating it,
[01:07:17] Alex Volkov: It's quite impressive. I saw, I think Marco from A16Z, Marco Mercora, post comparisons from Gemma, Mistral, Lama and I think something else. It's quite incredible that this model, like a company less than 30 people 6 months ago they released it, no, like less than 6 months, September I think, or October, the 7B model, and it still performs well against a company with like billions or whatever, and they release it, it's quite stunning that they're not able to beat Mistral 7B.
[01:07:49] Alex Volkov: by a significant amount. I wanted to like, highlight how, first of all, impressive this is, that they even released something. But also, how impressive this is for Mistral, that they come out so strong, and their model is basically the one people compare to. Definitely agree to that.
[01:08:05] Nisten Tahiraj: Yeah, I used it quite a bit I My opinion, I don't like It just it's just not that reliable. So yeah, it can code but sometimes It's not a very obedient model and the thing about Mixtral and Mistral and stuff is that They're used like tools a lot and Yeah, but again, we have yet to see good fine tunes.
[01:08:32] Nisten Tahiraj: So We see we saw how far people took alignment it with open chat
[01:08:39] Alex Volkov: Yeah, speaking of OpenChat
[01:08:41] NA: Was like how far they've taken these Yeah, so so we'll see I'll hold off a bit of judgment for them for now
[01:08:49] Alex Volkov: Yeah, speaking of open chat and speaking about fine tuning and being able to fine tune this alignment what are your initial thoughts? I saw Alpay post something that new open chat is coming. What are you guys cooking a fine tune like what's going on?
[01:09:03] Alignment Lab: There's probably an OpenChat fine tune of Gemma that's going to come out. I'm not clued in to that right now. I haven't had a chance to really get my head above water for a couple of days because I've been just buried in several things. If, if there is, it's probably going to be good. The model seems smart and it's got a lot of parameters, so It's hard to say that fine tuning won't make it very strong.
[01:09:31] Alignment Lab: I think with that giant tokenizer, it's going to be worth knowing that the model's going to be able to do a lot more during the training run because it's going to see more granular patterns and have a more expressive vocabulary to to, exploit the way that training runs make a model perform well better.
[01:09:50] Alignment Lab: This is the best way I can put it. It also, it's not getting mentioned very much, and I think it's because this is past the event horizon of AI stuff for a lot of people, but if you open up the models architecture, the implementation of it on the Google GitHub repo, they actually have a few different versions, and they're all for running the model in various contexts, or with or without TPUs, but And all of them, even the one that's not made to be parallelized, the model actually does have a baked in architecture designed for quantization and parallelization.
[01:10:20] Alignment Lab: And it looks like it can be quantized, or it can be parallelized, horizontally, vertically, and whatever the word is for the third dimension. It looks like it breaks pretty evenly into eight pieces, and if you can break it into eight pieces, and quantize each piece, and dequantize each piece, You can maybe parallelize it across asymmetrical compute, which is the big holdup for why we can't distribute models over just a bunch of random servers.
[01:10:48] Alignment Lab: Because usually, if they're not the exact same GPU with the exact same throughput and interconnect the model's unable to perform inference. But they may be able to solve for that baked in there, and it might be that they intend on Maybe having some service by which you can use the model locally with X amount of context and then just back into it onto their TPUs.
[01:11:08] Alignment Lab: I'm not sure, but it's interesting that it has a lot of custom tooling like baked into it designed for quantization parallelizing
[01:11:15] Alex Volkov: Yeah, I want to say custom tooling and also thanks Aliment, and also the amount of stuff that our community is supportive that they released is quite impressive. They released GDF quantizations, I think. They released support. They even released, folks, I don't know if folks missed this, they released something called Gema.
[01:11:32] Alex Volkov: cpp. which is a local CPU inference based in completely C with no dependencies, which is in addition to Llama CPP adding support for this, there is Gemma CPP and that's like their whole complete kind of comparison to Llama CPP. And that was pretty cool of them to release.
[01:11:49] Alex Volkov: And it looks like they've geared up to to have this model to be accepted. It's on Hug and Face. Hug and Face and Google recently announced a partnership and now it's on Hug and Face as well. So you can actually go to like hugandface. com slash Google slash Gemini slash Gemma. And it's pretty cool.
[01:12:04] Alex Volkov: I remember they, they mentioned Gemini Lite or Gemini Tiny or whatever for local inference. Very interesting that's not what we got. We got like a new model called Gemma out of the gate. Yam, do you have any, what's your thoughts on this whole thing from Google? Do you have a chance to play with this?
[01:12:19] Alex Volkov: Give us a little breakdown.
[01:12:20] Yam Peleg: actually, yeah, actually fine tuning is on the way. Already got the GPUs warming up
[01:12:27] Alex Volkov: let's
[01:12:28] Yam Peleg: the data as we speak. Yeah, I'm going to do, I'm going to do, before fine tuning, I'm going to do a little bit of a continuous pre training just to see if we can squeeze a little bit more out of the base model.
[01:12:40] Yam Peleg: It's just important to distinguish between the base model and the instruct tuning model.
[01:12:47] Alex Volkov: That's the slash IT thing they released, right? There is like a Gemma and Gemma slash
[01:12:51] Yam Peleg: When we talk about chat GPT like models, we talk about the instruct tuned models. And this, yeah, for sure, Mistral is just better at the moment. But in terms of the base model, we can know this only after people start to play with it and try to tune it themselves.
[01:13:11] Yam Peleg: Then we can see how far we can push it, because maybe it's just the actual fine tuning that Google did to their version of the model and with the methods from the open source that are pretty much, uh, very well trained in fine tuning models for instructional fine tuning. Maybe we can, maybe this model is really, will be really great because at the end of the day.
[01:13:36] Yam Peleg: The amount of compute that Google put into the model is insane, it's unparalleled. I'll be surprised if the model doesn't turn out to be really good, the base model, after fine tuning. But yeah, there is no, there is absolutely no doubt that Mistral is hiding something, they do have emotes. All their models that they fine tune for instruction following are on different levels.
[01:14:03] Yam Peleg: You can say. And you can see this even with the NuCube, the one that shouldn't have, had been leaked. It is also really good.
[01:14:13] Yam Peleg: But yeah, it's amazing. It's amazing that there is another player that is releasing, a major player, Google, releasing a really good Base model open source.
[01:14:24] Yam Peleg: It's great. It's great to have more players in this field more corporates turning into this game, supporting open source. It's always great. Yeah.
[01:14:33] Nisten Tahiraj: And the funny part is that they are struggling to compete in this section just because, the beauty of open source is that it enables so much competition, especially at these lower sizes where people can iterate very quickly.
[01:14:48] Nisten Tahiraj: And and now this is extremely obvious in this case. But yeah, I also think that the base model, I only tried the instruction tuned ones, and I've posted it above. I even have a link if you want to try it, but. [01:15:00] There is a lot more to be squeezed out of that just because again of the quality of the data that went in the pre training and Google might just not be that good at making chatbots.
[01:15:13] Nisten Tahiraj: Yeah, they'll probably, they'll get better, but it's
[01:15:16] Alex Volkov: Nisten, is it mergeable? It's mergeable, right? Like it's Frankensteinable.
[01:15:21] Nisten Tahiraj: Yeah, I think you can, I'll
[01:15:24] Yam Peleg: do it for fun. You can merge it with itself, but we don't have models to merge it with at the moment,
[01:15:32] NA: because you can't talk about it here yeah. You can merge the instruction tune with, not instruction tune, with itself and train on top.
[01:15:39] Yam Peleg: I tried to extend it with the front end merge and it didn't behave nicely. Mistral, for example, behaved really well. You can stretch it three times, just copy the layers three times and it works really well. At the fourth time, it starts to, to disintegrate and just breaks. But somewhere, you can do it for 3x and it works really well. This model didn't, so it was a little bit strange to see.
[01:16:08] Yam Peleg: But yeah, I'll know in a couple of hours when my training starts. So I'll be smarter to tell you. I if anyone saw my experiment I tried to play a little bit with with reinforcement learning, with DPO. I stopped the experiment mid run because someone pointed out that the terms forbid me to play with this type of experiment, but I just want to say that I played with, I tried to make the model less refusable, it was refusing nearly anything that you asked it, but so I just tried to make it more, more acceptable to actually do what you ask, nothing really fishy, but yeah, the terms forbid that, so I stopped the experiment.
[01:16:51] Yam Peleg: I just wanted to say that it really resisted. I trained and trained and the model still resisted. They really went hard on the on the alignment part on this model.
[01:17:02] Alex Volkov: Interesting that, we're going to talk about this next, I think, from Google, but interesting that even in their kind of open weights, open access models, they're baking in the alignment like super strong. Anything else, folks, on Gemma before we move on? Generally, kudos for Google for coming out this strong.
[01:17:21] Alex Volkov: Gemini Ultra getting announced, and then we saw Gemini Ultra getting access then Gemini Pro 1. 5, which we covered a little bit, and we probably should talk about this a little bit more, and now we're getting like open weights models that are finetunable, and I think even commercially licensed, right?
[01:17:35] Alex Volkov: You could use this in production, if I'm not mistaken.
[01:17:42] Alex Volkov: I guess I'm not
[01:17:42] Alignment Lab: Yeah, I think so. I think so. I think so.
[01:17:45] Alex Volkov: Yeah, which is quite impressive. Even from, it, it took Meta a while to give us a commercial license. Microsoft released PHI without commercial licensing. And then after six months gave into the pressure and Google waited, and now they're like, ta da, here's this.
[01:17:58] Alex Volkov: So very impressive from Google and kudos to whoever there worked on this open source release. It's probably not very easy to do, not open source, but open weights. It's not very easy to do. That stuff from within this big organization. So whoever listens to this, whoever worked on this, Thank you. Give us more.
[01:18:14] Alex Volkov: We would like to see bigger models, 35, etc. Junyoung, you wanted to comment as well? I saw you step in here.
[01:18:20] Alex Volkov: Yeah,
[01:18:21] Junyang Lin: I am definitely very excited about the Google open source of the Gemma model because yeah, it's actually a great model. Yesterday, we were just trying to compete QWAM 1. 5 with Gemma SMB, but we found Gemma SMB is actually better, but when we try about the base model.
[01:18:40] Junyang Lin: We think the base model should be a good model, but the instruction tune model, it's a bit strange. Actually, its behavior is quite strange. It's always irrefusable, and it's too safe, and there are a lot of answers they can't do. So I'm very surprised how they do their CAD model. But generally, the base model is general good.
[01:19:04] Junyang Lin: But I I'm very interested about their choices of their architecture because that it, its site is actually, it's not 8 billion. It's actually 9 billion because they have input embedding and their alpha embedding layers. They are not shareable. Parameters. So you found that the the sidebar actually very large.
[01:19:23] Junyang Lin: And for 2B, it is actually similar. It is actually, essentially three billion parameters if you count it correctly. So it's actually a very large model. And it is quite strange that for 2B model, it is using image MQA multi query attention, but for 7B model, it is Actually using multi head attention.
[01:19:43] Junyang Lin: I don't know why they choose it. And if you carefully look at the side of the hidden side as well as the head dimension for the attention you'll find that for the attention layer the head dimension is 2 56 and with 16 ahead, which means that the. Actually, the hidden dimension for the attention is actually 1496, but the hidden dimension for the FFM is 3072.
[01:20:11] Junyang Lin: This is very strange for me to choose something like this. I don't know, we should follow it for the following models. I don't know why Google do this. If they can tell us about this. it could be much better. But something it is very interesting and we also have experiments to show that it is quite effective, which is the large intermediate size.
[01:20:34] Junyang Lin: You will find that the intermediate size in comparison with Lama models or Mistral models it is actually larger. So you'll find we, we have some experiment and find that the larger intermediate size can improve the performance but there are still a lot of things we don't know why Google did this and we're not pretty sure Gemma is really a good model, much better than Mistral because I have seen some evaluation from Anton I'm not pretty sure it, it seems that Mistral is still the better one.
[01:21:05] Junyang Lin: I'm not pretty sure actually much better than Mistral, so, let's wait for more tests.
[01:21:11] Alex Volkov: We'll wait for Junyang thank you folks who are not familiar with Junyang he's on the technical team at GWEN, and we've talked multiple times about this point, thank you Junyang and it's great to have you here. And definitely we'll see more fine tuning, base model seems to be fine tuned a bowl, Yam said he's already cooking something, probably other labs are already shaking their They're pounds in anticipation of how to use the open source stuff, the DPO stuff.
[01:21:33] Alex Volkov: If it works to actually make this model behave instruction fine tuning better than Google did. And I'm sure that it's possible because we've seen a lot of advancements in open source community. And now it looks like Google is catching up to the open source community and not the other way around, which is incredible.
[01:21:47] Alex Volkov: And I want to just say, I will move on from this because folks have been here for an hour and a half, and there's a bunch of other stuff to also talk about. Specifically. Specifically because Google is a, in, in our good graces from one perspective, but also from another perspective, since they released Gemini, and Gemini could generate images they have shown us why potentially they've been hesitating to release anything at all.
[01:22:11] Alex Volkov: Because, and I think OpenAI and DALI has this to some extent as well. But if you've missed the storm and conversation this week definitely, you'll hear about this because Gemini, both, I think, Pro and Ultra on the interface, not the API models they are able to generate images. I think it was with Imogen or some other model from Google DALI and CGPT, right?
[01:22:31] Alex Volkov: And folks, quickly find out that those models do not like the words white. And literally, I think I had to tweet about this, I'll pin this, and I'll add this to the show notes as well. I went and tested something like, hey, generate a glamour shot of two, Jewish couples, two Indian couples, two African couples, and that was fine.
[01:22:50] Alex Volkov: And then I've asked Junyang a glamorous shot of two white people. And then it said, no, I cannot use generation based on race or gender or something like this, even though it just did this for five times. And then many folks tested this with historical figures when they asked hey, Junyang an image of whatever, before United States founding fathers, or some Nazi, or whatever it is.
[01:23:11] Alex Volkov: And they had a significant interjection into prompting, where it created stuff that are not remotely historically [01:23:20] accurate. And when I tested my stuff, it was a response to the historically accurate stuff. And it's still, it seems like there's a problem with how these models are replying to us.
[01:23:29] Alex Volkov: And a lot of folks at Google probably made it hard for these models to actually give me the image that I asked for. So it refuses so much though, the conversation went so hard into, Hey, Google, what did you give us? Why is this thing? So refusing that Google took down the ability to generate people. So right now, if you go, and I think it's like for the past 24 hours or so, if you go now and try to generate an image of an elephant, you'll get it.
[01:23:54] Alex Volkov: But if you try to generate the image of an elephant with, I don't know, two white folks holding its trunk or whatever, it will refuse. And like they, they completely nerfed the ability to generate people altogether, quote unquote, while they serve, solve for this, which is quite. Remarkable to think about how a big company like this, that already been in hot water before.
[01:24:17] Alex Volkov: And obviously this is Google, everybody's gonna dunk and go on Twitter and say bad things because punching up is easy. But, and also this gets you internet points if you're the first person that says, hey, Google is, reverse racist. But, Google has been in this hot water before with some image identification.
[01:24:34] Alex Volkov: I think there was a famous incident like 10, a decade ago almost, if you guys remember, with a image model that was identifying black people and saying gorillas or something. So Google has been burned on kind of the other side of this before, and now it looks like the pendulum swung way back to the other side, enough so that on the first, a week or so of the release.
[01:24:53] Alex Volkov: Now they are taking back the ability to generate people completely. And quite incredible how much of an intervention into multiculturalism, let's say they have in prompt layer. So it does look like the model can generate stuff. I saw one, one hacky attempt. Somebody said, hey, generate a glamorous shot of couple with fair skin.
[01:25:14] Alex Volkov: And then most of them are white, but if you actually say white couple, it's not able to, which is quite interesting. And I think it adds to the point where Yam said that even the open weights model that they've released, they have some built in kind of alignment strongly in the finetuning.
[01:25:30] Alex Volkov: So probably it's a feature of some of the datasets, but also some of the alignment stuff. It's really interesting to see that the internet kind of showed Google that the other side is also not great. Going all the way to the other side is also not great. And so Google, at least some of the teams in Google are, struggling right now to figure out what's the right balance there.
[01:25:49] Alex Volkov: Separately from Yeah, go ahead.
[01:25:51] Nisten Tahiraj: Sorry
[01:25:52] Nisten Tahiraj: I really want to highlight this because it's gotten to the point where the open source models and even GPT 3. 5 will do some tasks fine. And in this case, a task that I tested with is the. Universal Declaration of Human Rights, which is the most translated document
[01:26:10] NA: in human history and it's part of every data set.
[01:26:13] Nisten Tahiraj: And now you have Gemini and you have Copilot which is GPT 4. The thing that is too unsafe to translate, to
[01:26:24] NA: give you a translation of the Declaration of Human
[01:26:27] Nisten Tahiraj: Rights, which is, this has just gotten completely ridiculous. You can use a, you can use a model that's made anywhere else, any open source model, and it will tell you that, whereas now we have the, all the safety people and all the people that they hired, it's gotten to the point that it's completely backfired, and this is ridiculous.
[01:26:54] Nisten Tahiraj: They should be held
[01:26:56] Alex Volkov: Yeah, into unusefulness like some things in history happened, and we would like to, to be able to ask those things. And yeah, I definitely want to hear how this gets solved. I will say there were some folks that are mentioning that, hey, open, DALY, if you ask the same exact thing from DALY, it may give you some similar answers.
[01:27:14] Alex Volkov: So why is Google getting attacked? First of all, they just released it. Second of all, this is Google after all. Like they, they're like the big, they're still the gorilla, the big 600 pound gorilla, I think Microsoft called them in the room. And thirdly, we have short memory. We play with the toys, we play with the tools as we get them.
[01:27:30] Alex Volkov: And then when we discover we go viral. . Back to the good side of Google also, as we had breaking news last Thursday, and we talked about Gemini releasing a million tokens, as Thursday I started last one, which was crazy, Google released an update that said, hey, some developers can now get access to up to a whopping 1 million tokens in context window for Gemini 1.
[01:27:53] Alex Volkov: 5, and technically In research, they have up to 10 million Context Windows support, which is incredible. And I just want to come back and say that after this week, we've seen many folks, including Matt Schumer, who's here on stage, including a bunch of other folks, getting access to this 1 million tokens.
[01:28:08] Alex Volkov: I didn't get access yet. So wink at Google, if somebody hears me, please give me access. And folks are trying books, like full like three Three Harry Potter books on it and getting incredible stuff. Many folks are using it for video, which is also quite remarkable. Uploading an hour of video and getting retrieval from the, from video from within 1.
[01:28:29] Alex Volkov: 5, 100, like 1 million context window. It's, I wanted to follow up and say You know, the safety folks at Google need to take a little break, but the tech folks at Google holy crap, like the 1 million contacts was severely underhyped after Sora released from OpenAI, like two hours after we had also breaking news, and Sora is still blowing minds, and we're going to talk about Sora just briefly, but the 1 million contacts window gets more folks playing with it, And it's incredible for code generation.
[01:28:59] Alex Volkov: People threw the whole code base of 3. js in there. People threw just like whole code bases in one prompt. And we were talking about this a little bit with the Grog guys as well, where this unlocks new possibilities and significant new possibilities that weren't imagined before, and we don't have time for this debate today.
[01:29:20] Alex Volkov: And maybe we'll have to close the space a little early. And I'll tell you why in a second, but. I just wanted to highlight that, there's some stuff that Google did. Google is like this huge company, like full of multiple people. The safety stuff, meh, like we're gonna rally against this, we're gonna tell them that they're wrong and hopefully we'll get like less, less restricted models.
[01:29:39] Alex Volkov: But the context stuff, oh my god, this is like incredible, definitely set the new bar for how models should behave and what the possible things are. 10 hours of audio, you can send in one context 10 hours of audio and it will be able to tell you exactly when somebody said what. And summarize everything with like perfect recall.
[01:29:58] Alex Volkov: We had Greg Cumbrand that we've talked about for End of the Pod as well. He did this needle in haystack analysis on a bunch of context windows, if you remember, on Claude, etc. And they used his needle in haystack analysis to analyze and say that The models that also have very high recall precision, like almost perfect recall precision throughout this whole context, throughout the whole like 600, 000 tokens or so.
[01:30:21] Alex Volkov: And we had folks test this week. Quite incredible advancement there, and Entropic, who are, who did Cloud for us with 100, 000 tokens for a long time, this was their mode, then there is 200, 000 tokens it seems, it's paling in comparison. I did my comparisons from last year, if you guys remember, during May, Mosaic released the jump into 70, 000 tokens or so, and back then that looked incredible, they threw, they put an actual book in there, and I just compared the less than a year, we've gotten like a 10x jump into what we consider like normal context windows or possible context windows, because like less than a year ago, the big jump was to 60, 000.
[01:31:03] Alex Volkov: And now we're jumping to a million. And it's actually possible to use a million. So incredible, incredibly important for Multimodality as well, because videos take just so much content. I think one hour video of this Buster Keaton, I think is the video that they've used in the example, takes around 600, 000 tokens.
[01:31:20] Alex Volkov: Just think about this. Like one hour video takes around 600, 000 tokens. And it's able to tell you exact precision of where something happened in this video, what happened, who spoke about what. Very incredible. Definitely underhyped. I think Sora took. I think, collectively on X we're able to talk about one important thing, and Sora [01:31:40] definitely took that one important thing, but coming back to Gemini 1.
[01:31:43] Alex Volkov: 5 with a huge context is very impressive from Google as well. Anybody here on stage got access to 1. 5 and actually played with this? I haven't yet, I'm just recapping from the feed. Nope, everybody's sad. Google, if you hear us, give us access. Nisten?
[01:31:59] Alignment Lab: I will bite my finger off like a graham cracker
[01:32:02] NA: to get access to that model.
[01:32:03] Alex Volkov: Yes. Exactly. All right, so moving. Yeah, Nisten, go ahead and then we'll move on.
[01:32:08] NA: No, I just
[01:32:09] Nisten Tahiraj: wanted to mention some other news that Roboflow and Sakowski just released the YOLOv9 model. I made some demo with it, with the sailboats
[01:32:18] NA: And the boxing and stuff. And
[01:32:20] Nisten Tahiraj: this
[01:32:20] NA: is
[01:32:21] Nisten Tahiraj: pretty, it's pretty nuts. It's like the next the next gen stuff.
[01:32:24] NA: But they've also released a paper, I think.
[01:32:27] NA: for
[01:32:27] Nisten Tahiraj: some research, which I haven't read yet and I'm incredibly excited. But yeah, this is completely this is not as much LLM related, but it is open source vision AI stuff. And I really recommend people to, to look at it because it's like straight up from the future. Like I I tried YOLOv8 and you all can see the results and stuff on video on stuff you can do.
[01:32:51] Nisten Tahiraj: And
[01:32:51] NA: It's pretty cool.
[01:32:53] Alex Volkov: Could you add this to the space and we'll add to show notes as well. I will just highlight that Peter Skalski, SkalskiP is a friend of the pod, a dear co host, and Roboflow are doing incredible vision stuff, and definitely worth a shoutout every time they release something new, and some of his tutorials on Twitter are amazing.
[01:33:09] Alex Volkov: If you're into vision understanding, Peter is the guy to follow, and a shoutout for for the stuff that they're building there. I think we're gonna move on from the big companies and LMs, we've talked about pretty much everything, Open source. The last thing that we wanted to mention, I think the last thing that I want to mention is Nous Research released Nous Hermes on DPO.
[01:33:27] Alex Volkov: And basically it's the same model, just trained on DPO data set. And that beats the previous Nous Research, the Nous Hermes Open Hermes 2. 5, I think pretty much in every benchmark. And that's been great to see the DPO is Putting itself in, in, in the right position of improving models.
[01:33:44] Alex Volkov: I think we've seen this from our Guia folks who cleaned datasets and actually retrained Hermes models. I think we've seen this. And now we're getting a DPO headset from Nous folks themselves, which is great to see. And Jan, I think you had some comments about how to actually do this DPO thing in, in, in comments to Technium.
[01:34:00] Alex Volkov: So more of that goodness is coming, and open source does not wait, and I can't wait to see all these techniques also apply to, to the different Jemma stuff that we got, and different other, let's say, rumored, wink, from meta stuff that at some point are gonna come, and we're gonna get, hopefully the number three which, if they release today, I'm not gonna be mad honestly Mark, if you're listening to
[01:34:23] Nisten Tahiraj: Yeah, let's close it early, otherwise we'll be here until tomorrow.
[01:34:27] Alex Volkov: that's true.
[01:34:28] Alex Volkov: We're going to close it early because of this next thing that I want to talk about and I actually want to cover this a little bit. So I'm going to put some music and then we're going to talk about this. Oh my God. I got lost in my music stuff. And we're going to talk about this week's buzz. I see that folks are enjoying. me mistakenly hitting different musical buttons. Folks, welcome to this week's buzz. This is a corner here, a section here, that I talk about everything that I've learned working for Weights Biases. And some of this is technical, some of this is just the stuff that we release on courses.
[01:35:00] Alex Volkov: And we released a course with Hamal Hussain about enterprise model management. So if you're into this That course is great. It's going so good. So many people are registering. I haven't had actually time to see it. I should probably see this soon, maybe tomorrow because I'm preparing ThursdAI and working on demos with Groq and everything.
[01:35:17] Alex Volkov: But I've definitely wanted to chat about the reason I was in San Francisco for this last weekend. So as we were finishing up ThursdAI last week, I think I said Swyx was here. I was recording it live from San Francisco. And that day on Thursday, we had a meetup. That I helped co host, and I wasn't the only one there.
[01:35:36] Alex Volkov: A16z, Andreessen Horowitz, the biggest VC firm in the world. With with, if you don't follow Marc Andreessen on X, you definitely should. He's a big proponent of open source. He's been talking about all these kind of very interesting things. Shout out Marc Andreessen. He wasn't there. I definitely expect to see him next time.
[01:35:52] Alex Volkov: But folks, Reiko and Marco Moscoro from A16Z, the guys who give out grants to open source. And you know that many of our friends of the pod are like grant receivers from A16Z. The blog received the grant, Nous Research are grant receivers. I think Axolotl, Wing is from Axolotl, is also a grant receiver.
[01:36:09] Alex Volkov: Like a bunch of folks are getting supported by A16Z. And they had a meetup for open source AI. And I was very proud to be invited and to be a co host and gave out a bunch of Weights Biases swag. And just in terms of names who went, it was mind blowing. We had Nous Research folks, so Technium was there, and EmuZilla was there, Koran Shavani, like all the Nous folks are definitely big help organizers.
[01:36:33] Alex Volkov: Olama folks were there, announced that they're now supporting Windows. LamaIndex, we met with Jerry. LMCs folks, which I really wanted to meet and talk to them, and maybe bring them on ThursdAI, but I didn't get a chance to, so if anybody knows the LMCS folks, please shout out shoot me a DM with them as well.
[01:36:50] Alex Volkov: Replicate, who are doing great stuff, Perplexity, Mistral, there was a Devendra, I think, from Mistral was there as well, and there's also a bunch of friends of the path who also receive grants, if you guys remember, we had a deep dive with John Durbin. from the Bagel Model fame, and he just recently started releasing a bunch of other stuff.
[01:37:06] Alex Volkov: Eric Hartford who released, I think, Lazer, and now he works at Abacus. Hao Tian Liu from Lava, and just a bunch of great folks in the open source community got together in San Francisco and talked to each other about techniques, about how important open source is and they had a panel with like folks from Mozilla and the Linux Foundation and Percy from Together AI as well.
[01:37:27] Alex Volkov: That panel talked about the importance of open source, what is open source actually. How do we treat open source in AI? What is weights fully? Is that enough? Or is something like Olmo that we've talked about from Allen Institute of AI, is that like full open source when they released the training code and data sets and weights and biases logs and all these things.
[01:37:46] Alex Volkov: And so there was a great discussion about what open source actually means in the fully like new AI world. Incredible to meet all these folks. Just shout out to Reiko and Marco for organizing this and inviting us. And I promised a report and this is the report. And definitely I will add to the show notes the summary that Reiko did, because they also did a report on open source stuff.
[01:38:07] Alex Volkov: It's worth looking into this, how much, how many folks downloading the Blo . So many folks download ni maybe you saw this LD as well. So many folks download the bloke's models. Then when the bloke like, I think disappeared for three days or something, peoples like, is he okay? There's no new g GFS on hack face.
[01:38:24] Alex Volkov: What happened? Is he all right? So many people get used to this. The block is also a receiver of the A 16 Z grant. And so that's what I learned in wait and Biases this week. I also visited the office. Those of you who followed me probably seen my ridiculous video that I showed around the office showing waits in the waits and biases, dashboards in, in virtual space.
[01:38:44] Alex Volkov: And I really had a great time there. We also met with sws and some of. His folk in the Swyx small house, so shout out Swyx and Alessio from Latent Space Pod for first of all hosting me, second of all being great friends of the Pod. Honestly, ThursdAI would not exist as a podcast and newsletter without Swyx and Alessio.
[01:39:03] Alex Volkov: And also they're coming up on their one year anniversary for Latentspace. So if I can send them love and subscribers, please go check out Latentspace as well. Happy birthday, folks. And I think we're going to move on to two new things. And then we're just going to do a recap in the AI, art, and diffusion area.
[01:39:20] Alex Volkov: And I think for this, I do have a transition. Let's see. No, I have a transition for this. Yes.
[01:39:47] Alex Volkov: and Alignment just dropped, but I wanted to hear what he was actually saying, but he had issues with space even before. But we did have a transition, and folks, this week is big. This week is big. You guys know that we only talk about [01:40:00] AI when it's huge, and this week was huge. Starting off this week, ByteDance released SDXL Lightning, which takes SDXL that we've talked about one of the best open source diffusion models, and then makes it incredible in just one step.
[01:40:15] Alex Volkov: So if you ever use a stable diffusion, if you ever. Run it yourself, the, the sweet spot is somewhere between 35 and 50 steps, depending on which which of the, I forgot what it's called, the tokenizer? No, something, depends on what you use there between 35 and 50 steps and then.
[01:40:33] Alex Volkov: We obviously had some advancements before. We've seen SDXL Turbo and SDXL Lightning generates incredible images in just one or two steps. Just incredible it's unbelievable how fast this is. And of course, our folks friends of the pod as well from File. ai are putting this in production and you can play with their demo.
[01:40:52] Alex Volkov: The demo is called I'm going to put this in show notes, FastSDXL. ai. And the demo is near real time. You type and you generate images. You type and they generate images. And it's not the LCM stuff that we've talked about. If you guys remember the late consistency model that's something else. This is a full SDXL generation running in two or four steps.
[01:41:12] Alex Volkov: and looks incredible, like 1024 resolution, text to image generation, ByteDance optimized the crap out of this SDXL and it's really mind blowing. I really suggest you go and try to play with it as fast as the SDXL. ai. And I've played with this yesterday and the, what I wanted to do with this is I wanted to And it's added to the show notes as well.
[01:41:34] Alex Volkov: I wanted to see what's possible when we have an LLM that's near instant. So we've had the chat today with the Groq folks, and you can hear, if you're just joining us now, you can hear the chat after I publish the episode. And their LLM is like 500 tokens a second. So basically answers appearing in near instant time, but also SDXL Lightning is, SDXL diffusion appears in near instant time.
[01:41:56] Alex Volkov: And I played with a demo of this and I'm gonna add the video also to the show notes as well, and I was just blown away how responsive things feel. And so the demo that I built was using Neil Agarwal's It's called Infinite Fun or something game where you just draw concepts on top of each other, and he uses AI to generate what those two concepts mean, basically.
[01:42:17] Alex Volkov: Neil, in this Infinite Fun thing, he used emojis. So if you combine Earth and, I don't know, fire or something, you get volcano. So he has the emoji of volcano, right? So he has an AI that picks out the best emoji for this one thing. And I said, hey, emoji is fun, but what if we generate like a full on SDXL image on every turn that I play this game?
[01:42:37] Alex Volkov: And I did this with with Groq. I used Mixtral behind the scenes to generate, to be the prompt engineer, to take these concepts and actually write a nice prompt for SDXL. And with two steps or four steps, overall from dragging this to getting Mixtral to be my prompt engineer, and my initial my initial system message is around a thousand tokens, right?
[01:42:57] Alex Volkov: So I'm sending a thousand tokens or so. Probably, maybe less than a thousand, maybe five hundred. And I get an instant answer from Groq, because their speed is ridiculous. I then send this to FAL, to their API to do SDXL lightning. And I get an image, it's super fast, like it's also ridiculous It's, I think overall for some incredible examples, I got less than 300 milliseconds response from going to an LLM, generating a prompt for me, taking this prompt, sending it to an image image thing and getting back.
[01:43:24] Alex Volkov: under 300 milliseconds. I will remind you that folks from Google a long time ago did a research study where everything under 250 milliseconds to humans is almost real time imperceptible and clicks and reactions. And now we're getting multiple models. In kind of a pipeline together reacting under 300 milliseconds.
[01:43:43] Alex Volkov: And it's incredible. And honestly, I can't release, I cannot release this demo because I didn't build the UI, so I cannot give you the UI. However I can probably send you the extension code if you want to, and you have your own API keys for Groq. I was blown away how easy and fast this was.
[01:43:59] Alex Volkov: And just two of these the same week, two of the speed investments. So SDXL Lightning, two steps for like incredible image generation, and then Groq as well. So this is an answer to folks who are saying, why do we even need this speed? I saw somebody say, hey why do you even need 400 tokens a second?
[01:44:17] Alex Volkov: People cannot read fast enough. And this is the answer for this, because interfaces can happen in near real time. And it's incredible. And the second big thing in AI art and diffusion happened as breaking news. So we're gonna, we're gonna do this.
[01:44:41] Alex Volkov: Folks, we have breaking news. And LDJ, you've been saying about today or I guess for a while now Imad from Stability AI, Imad Mustaq from Stability announces Stable Diffusion 3. StableDiffusion3 uses a new architecture that we've talked about first with Tanishka and folks from the HDIT Hourglass Diffusion Transformers, and also from Sora, a Diffusion Transformer architecture, where they take the both worlds from this last gen of of gen AI and combine them together.
[01:45:11] Alex Volkov: And StableDiffusion3 is going to be Diffusion Transformer. And it's impressive in, so we only got a waitlist, so unlike previously, where Stable Diffusion just dropped now it's a waitlist that you have to sign up for, but shout out to Fox's Stability because it looks incredible. It has, and very impressive, so some examples you can check out in the newsletter that I'm gonna send, some examples are under the hashtag SD3.
[01:45:36] Alex Volkov: On X, it has a very impressive multi subject prompt following, so I can show you an example of this later in show notes, but a prompt like Painting of an astronaut, riding a pig, wearing a tutu, holding a pink umbrella, on the ground next to the pig is a robin bird wearing a top hat, in the corner there are words stable diffusion, and this image is perfect.
[01:45:56] Alex Volkov: All of the subjects and the different things that I told you are existing in this picture and the Robin bird is on the ground, has a top hat and the astronaut is holding an umbrella, but the pig is wearing a tutu. So understand the text is perfect. And understanding of multiple subjects, I think, is something that we've seen great in DALI, for example, but previous versions of Stable Diffusion were not nearly as good at multi prompts, multi subjects, and multi colors, for example, and this nails all of them.
[01:46:22] Alex Volkov: The umbrella is the right color, the tutu is the right color, the bird, everything. And it looks just really awesome. And I gotta wonder if, something like this with the speed of the previous announcement of SDXL could mean. And so they're advancing very fast as well, and it's great to see.
[01:46:39] Alex Volkov: Breaking news, shoutout to Stability for announcing this, they didn't release this yet, they announced. Stable Diffusion 3, it's coming to us very soon, and it looks awesome. And I think, unless folks here on stage wanted to chat about some other stuff that we haven't covered yet,
[01:46:56] Alex Volkov: this is everything we've talked about on ThursdAI. Outside of that, we had our returning hosts. and Co hosts and speakers on the panel. So I want to thank Nisten, I want to thank Yam. LDJ was here, Jun Yang from QEN, and a bunch of other folks. I want to shout out Matt Schumer again and Mark Heaps from Groq for joining and telling us all about this.
[01:47:14] Alex Volkov: And if that, if you missed any part of this conversation, definitely feel free to check us out. With that I want to say thank you for joining ThursdAI as always. I think we're. coming up to almost exactly two hours, and I'm gonna, I'm gonna let you go and then we'll see what else gets released on this crazy AI Thursday.
[01:47:31] Alex Volkov: Thank you everyone.

This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
🔥 ThursdAI - Feb 15, 2024 - OpenAI changes the Video Game, Google changes the Context game, and other AI news from past week
16 Feb· ThursdAI - The top AI news from the past week
Holy SH*T,
These two words have been said on this episode multiple times, way more than ever before I want to say, and it's because we got 2 incredible exciting breaking news announcements in a very very short amount of time (in the span of 3 hours) and the OpenAI announcement came as we were recording the space, so you'll get to hear a live reaction of ours to this insanity.
We also had 3 deep-dives, which I am posting on this weeks episode, we chatted with Yi Tay and Max Bane from Reka, which trained and released a few new foundational multi modal models this week, and with Dome and Pablo from Stability who released a new diffusion model called Stable Cascade, and finally had a great time hanging with Swyx (from Latent space) and finally got a chance to turn the microphone back at him, and had a conversation about Swyx background, Latent Space, and AI Engineer.
I was also very happy to be in SF today of all days, as my day is not over yet, there's still an event which we Cohost together with A16Z, folks from Nous Research, Ollama and a bunch of other great folks, just look at all these logos! Open Source FTW 👏
TL;DR of all topics covered:
* Breaking AI News
* 🔥 OpenAI releases SORA - text to video generation (Sora Blogpost with examples)
* 🔥 Google teases Gemini 1.5 with a whopping 1 MILLION tokens context window (X, Blog)
* Open Source LLMs
* Nvidia releases Chat With RTX local models (Blog, Download)
* Cohere open sources Aya 101 - 101 languages supporting 12.8B model (X, HuggingFace)
* Nomic releases Nomic Embed 1.5 + with Matryoshka embeddings (X)
* Big CO LLMs + APIs
* Andrej Karpathy leaves OpenAI (Announcement)
* OpenAI adds memory to chatGPT (X)
* This weeks Buzz (What I learned at WandB this week)
* We launched a new course with Hamel Husain on enterprise model management (Course)
* Vision & Video
* Reka releases Reka-Flash, 21B & Reka Edge MM models (Blog, Demo)
* Voice & Audio
* WhisperKit runs on WatchOS now! (X)
* AI Art & Diffusion & 3D
* Stability releases Stable Casdade - new AI model based on Würstchen v3 (Blog, Demo)
* Tools & Others
* Goody2ai - A very good and aligned AI that does NOT want to break the rules (try it)
🔥 Let's start with Breaking News (in the order of how they happened)
Google teases Gemini 1.5 with a whopping 1M context window
This morning, Jeff Dean released a thread, full of crazy multi modal examples of their new 1.5 Gemini model, which can handle up to 1M tokens in the context window. The closest to that model so far was Claude 2.1 and that was not multi modal. They also claim they are researching up to 10M tokens in the context window.
The thread was chock full of great examples, some of which highlighted the multimodality of this incredible model, like being able to pinpoint and give a timestamp of an exact moment in an hour long movie, just by getting a sketch as input. This, honestly blew me away. They were able to use the incredible large context window, break down the WHOLE 1 hour movie to frames and provide additional text tokens on top of it, and the model had near perfect recall.
They used Greg Kamradt needle in the haystack analysis on text, video and audio and showed incredible recall, near perfect which highlights how much advancement we got in the area of context windows. Just for reference, less than a year ago, we had this chart from Mosaic when they released MPT. This graph Y axis at 60K the above graph is 1 MILLION and we're less than a year apart, not only that, Gemini Pro 1.5 is also multi modal
I got to give promps to the Gemini team, this is quite a huge leap for them, and for the rest of the industry, this is a significant jump in what users will expect going forward! No longer will we be told "hey, your context is too long" 🤞
A friend of the pod Enrico Shipolle joined the stage, you may remember him from our deep dive into extending Llama context window to 128K and showed that a bunch of new research makes all this possible also for open source, so we're waiting for OSS to catch up to the big G.
I will sum up with this, Google is the big dog here, they invented transformers, they worked on this for a long time, and it's amazing to see them show up like this, like they used to do, and blow us away! Kudos 👏
OpenAI teases SORA - a new giant leap in text to video generation
You know what? I will not write any analysis, I will just post a link to the blogpost and upload some videos that the fine folks at OpenAI just started releasing out of the blue.
You can see a ton more videos on Sam twitter and on the official SORA website
Honestly I was so impressed with all of them, that I downloaded a bunch and edited them all into the trailer for the show!
Open Source LLMs
Nvidia releases Chat With RTX
Chat With Notes, Documents, and Video
Using Gradio interface and packing 2 local modals, Nvidia releases a bundle with open source AI packaged, including RAG and even Youtube transcriptions chat!
Chat with RTX supports various file formats, including text, pdf, doc/docx, and xml. Simply point the application at the folder containing your files and it'll load them into the library in a matter of seconds. Additionally, you can provide the url of a YouTube playlist and the app will load the transcriptions of the videos in the playlist, enabling you to query the content they cover.
Chat for Developers
The Chat with RTX tech demo is built from the TensorRT-LLM RAG developer reference project available from GitHub. Developers can use that reference to develop and deploy their own RAG-based applications for RTX, accelerated by TensorRT-LLM.
This weeks Buzz (What I learned with WandB this week)
We just released a new course! Hamel Hussein released a course on enterprise model management!
Course name: Enterprise Model ManagementCourse Link: wandb.me/emm-courseWho is this for: The course is targeted at enterprise ML practitioners working with models: MLOps engineers, ML team leaders, ML engineers. It shows both at conceptual and technical level how to get the most value of W&B Model Registry and automations. Attached is also a screenshot of a slide from the course on what different personas (MLOps, ML exec etc) get from Model Registry.What can they expect: Learn how to store, version, and evaluate models like top enterprise companies today, using an LLM training & evaluation example. Big value props: improved compliance, collaboration, and disciplined model development.
Vision & Video
Reka releases Reka Flash and Reka Edge multimodal models
Reka was co-founded by Yi Tay, previously from DeepMind, trained and released 2 foundational multimodal models, I tried them and was blown away by the ability of the multi-modals to not only understand text and perform VERY well on metrics (73.5 MMLU / 65.2 on HumanEval) but also boasts incredible (honestly, never before seen by me) multi modal capabilities, including understanding video!
Here's a thread of me getting my head continuously blown away by the quality of the tonality of this multimodality (sorry...😅)
I uploaded a bunch of video examples and was blown away, it understands tonality (with the dive dive Diiiiive example) understands scene boundaries, and does incredible OCR between scenes (the Jason/Alex example from speakers)
AI Art & Diffusion
Stable Cascade (link)
Stability AI introduced a new text-to-image generation model called Stable Cascade that uses a three-stage approach to produce high-quality images with a compressed latent space, making it more efficient to train and use than previous models. It achieved better results than other models in evaluations while having faster inference speeds. The company released code to train, fine-tune, and use control models like inpainting with Stable Cascade to enable further customization and experimentation. Stability AI aims to lower barriers to AI development through models like this one.
Nate did a comparison between a much slower SDXL and Stable Cascade here:
Here’s the transcript for the whole episode, you definitely should check it out! It was really one of the coolest shows we had, and we had over 2K folks listening in!
[00:00:00] Alex Volkov: Hey, this is Alex Volkov, you're on ThursdAI, and I just gotta record this intro real quick, because today marks one of the more singular days in AI that I remember since I started recording ThursdAIs, which was itself a singular day, March 14th, 11 months ago, when GPT 4 was released and announced. We since then had a few days like this GPT Dev Day was one such day, and today marks another one.
[00:00:38] Alex Volkov: Google has released an update to their model, talking about 1 million tokens in the context window, basically unlimited. And then, just a few, just an hour or two later, OpenAI said, you know what, we also have something in store, and released the most incredible jump. Incapability of video generation, text to video generation.
[00:01:02] Alex Volkov: It's called SORA, and what you hear is us recording live, knowing only about Google, which came out an hour and a half before we started recording, and then somewhere in the middle, I think minute 35 or something, you'll hear our live reaction to the Incredibly mind blowing advancement in text to video that OpenAI just released.
[00:01:31] Alex Volkov: And I just wanted to record this as I'm finishing up the editing and about to start writing the newsletter, to say, days like this really are the reason why I'm all in on AI and I'm very excited about the changes and advancements.
[00:01:49] Alex Volkov: And I'm sure there will be more days like this going forward. We've yet to see what Apple came up with, we've yet to really see what Meta comes up with Llama 3, etc. And, yeah, I just wish you enjoyed this and I don't have a lot of words here besides just letting you listen to the rest of the episode and say that I was very happy to be in San Francisco for this, the place where most of this happens, and I was very happy to be in company of good friends, both in the virtual world those on stage in our Twitter live recording, and I was sitting across from Swyx, a friend of mine with whom I recorded an interview at the end of this, you can hear.
[00:02:30] Alex Volkov: I just couldn't let go of this chance. We also had a conversation, besides the updates and the breaking news, we also had conversations with the folks who worked on some of the stuff we talked about. I interviewed Yi Te and Max Bain from RECA, which you'll hear later, and the deep dive into RECA multimodal models, which blew me away just yesterday.
[00:02:52] Alex Volkov: And so my head kept getting blown away this week. And I also interviewed The folks who built Stable Cascade, a new stability model that outperforms the existing stability models. Dome, and Pablo. And all of those were great conversations, in addition to just generally the folks who joined me from week to week, Nisten and Far El and Alignment Lab, and we had Robert Scoble join us, with whom I've been buddying up since Vision Pro was released, as he was expecting, and that blew me away just a week ago.
[00:03:23] Alex Volkov: And I'm very excited to share with you this whole thing, and I hope that Yeah, I hope you enjoyed this as much as I do, and I hope that you enjoyed listening to these as much as I enjoy making them. And if you are, just share them with a friend, it would really help. And give us a 5 star review on Apple.
[00:03:38] Alex Volkov: This would great, gratefully help. With that, I'll give you the ThursdAI thing.
[00:03:43] Alex Volkov: All right, let's go. How's it going, everyone? Welcome to ThursdAI. Today is February 15th, and it's quite a day in the AI updates that we've had so far. Quite a day. Even today, this morning, we had like a bunch of updates. But besides those, we had quite a crazy week as well very interesting show today, very interesting show today.
[00:04:13] Alex Volkov: My name is Alex Volkov, I'm an AI evangelist with Weights Biases, and right now I'm getting my picture selfie taken by my today's co host, Swyx. Welcome,
[00:04:23] Swyx: Hey, hey, hey. Good morning, everyone.
[00:04:25] Alex Volkov: And we're in the Latent Space Studio in San Francisco. I flew in just last night. And as I was flying in, there was more news happening. So we're going to cover all of this.
[00:04:34] Alex Volkov: We have a very exciting show today. We have a bunch of guests, special guests that are coming on the second hour of this. So hopefully we'll see folks from the RECA models, and hopefully we'll see some folks from Stability. We're going to get to talk about Google and everything in between. So meanwhile, settle in.
[00:04:50] Alex Volkov: This is going to be a great show today in San Francisco. And maybe I'll also probably share with you why I Flew in here today. That's gonna come up next. So welcome to ThursdAI and we're gonna get started. All right there. Let's get started. Let me Smoothly fade out the music, say hi to everyone here on stage. Hey, Nisten, welcome. We have Robert Skobul over here, folks. We've been, we've been more, more friendly lately than usual because Robert and I are both members of the VisionPro cult. I think that's what you call it, Robert.
[00:05:37] Alex Volkov: But today is, today's the space for, for AI. But Robert you've been covering AI on your feed as well for, for a long time. We have, obviously Swyx is on stage, but also in front of me, which is super cool. And it's been a while, brother. It's great, you just flew back from
[00:05:51] Swyx: Singapore.
[00:05:52] Swyx: Yeah, Chinese New Year.
[00:05:53] Alex Volkov: Are you jet lagged at all or are you good?
[00:05:55] Swyx: I'm good actually. I have had very little sleep, but for some reason that always helps with the jet lag.
[00:06:00] Alex Volkov: Yes, awesome. And I also want to say hi to Alignment Labs, Austin and Far El as well, folks who are working on open source models, and we usually cover a bunch of stuff that they're doing, and usual co hosts and experts here on ThursdAI.
[00:06:11] Alex Volkov: So if you never joined ThursdAI before, just a brief kind of recap of what we're doing. As I said before, my name is Alex Volkov. I'm an AI evangelist with Weights Biases. It's always so fun to say. And Weights Biases is a company that is basically helping all these companies build their AI models, and it's super cool.
[00:06:26] Alex Volkov: And I flew in, I went to the office last night, and I have some cool videos to share with you from the office as well.
[00:06:32] Alex Volkov: and this
[00:06:33] Alex Volkov: is ThursdAI. ThursdAI is a Twitter space and newsletter and podcast that I started a year ago. And then slowly this built a community of fine folks who show up to talk about everything that happened in the world of AI for the past week.
[00:06:46] Alex Volkov: And there hasn't been many weeks like this last week that highlight how important and how cool ThursdAI actually is. Because we just had So much, so much to cover today and usually I start the space with a roundup of the stuff that we're going to run through just for folks who are not patient, don't have a lot of time and we're going to just run through everything we're going to talk about and then we're going to dive deep because we have some breaking news and I even have, hopefully, I have my breaking news button.
[00:07:16] Alex Volkov: Oh, I don't. Oh my God. Okay.
[00:07:17] Swyx: Oh no.
[00:07:17] Alex Volkov: I'm not set up for a breaking news button, but it's fine.
[00:07:20] Alex Volkov: We'll imagine this. I'm going to put this in the, in the, in the post edit. With that said, are you guys ready for a brief recap? Let's go to a brief recap.
[00:07:27] Recap and TL;DR
[00:07:27] Alex Volkov: Alright, folks, back for the recap. Today is Thursday. ThursdAI, February 15th. This is a recap of everything we talked about. And, ooh, boy, this was one of the worst days to be caught outside of my own personal production studio because my, my breaking news button didn't make it all the way here. And there was so much breaking news.
[00:07:57] Alex Volkov: So obviously as I woke up, the biggest breaking news of today was Ai. Actually cannot decide what was the biggest breaking news. So the first piece of breaking news from today was Google releasing a teaser of Gemini 1. 5. And 1. 5 was not only a continuation of Gemini Pro that we got last week, 1. 5 actually was teased with up to 1 million, a whopping 1 [00:08:20] million tokens in the context window, which is incredible.
[00:08:23] Alex Volkov: It's just for comparison, JGPT is currently at 128 and cloud to the best. Highest offering up until Gemini was 200k with Entropic Cloud Advanced and Google teased this out of the gate with 1 million token and their claim they have up to 10 million tokens of context window in in in the demos, which is incredible.
[00:08:44] Alex Volkov: And they've shown a bunch of demos. They did the needle in the haystack analysis that we've talked about from Greg Cumbrand and just quite an incredible release from them. They talked about that you can put a whole like hour of a movie of Dustin Keaton, I think it's called. And then you can actually ask questions about the movie and we'll give you the exact.
[00:09:03] Alex Volkov: Timestamp of something happens. They talked about it being multimodal where you can provide a sketch and say, Hey, when this, this scene happened, it will pull out just like incredibly like magic, mind blowing, mind blowing stuff. And all of this needs a lot of context because you take this, you take this video, you turn it into images, you send this into context.
[00:09:22] Alex Volkov: They also talked about, you can send 10 hours of audio within one prompt and then some ad, And the quality of retrieval is very, very high. You're talking about like 90 plus percentage, 95 plus percentage in the haystack, which is incredible. Again, we had Enrico Cipolla, a friend of the pod who worked on the Yarn paper and the rope methods before extending the LLAMA context.
[00:09:46] Alex Volkov: And he brought like four papers or something that show that open source is actually unlocking this ability as well. And not only today was a credible day just generally, but not only Google talked about a large context window, we also saw that Nat Friedman and Daniel Gross just invested 100 million in a company called Magic, that they also talk about multimodality and large context window up to 1 million as well.
[00:10:08] Alex Volkov: So it was very interesting. To see both of them release on the same day as well. We then geeked out about Gemini. We talked about Andre Karpathy leaving open AI and, and invited him to come to Thursday AI and latent space as well. And then we also mentioned the OpenAI ads, memory and personalization to charge G bt, which is super cool.
[00:10:25] Alex Volkov: They didn't release it to many people. Yeah, but personalization is my personal thread of 2024 because these models, especially with the larger, larger context window with personal per perfect recall, these models will. become our buddies that will remember everything about us, specifically, especially tied into different devices.
[00:10:43] Alex Volkov: Like the tab that's somewhere here behind me is getting built in San Francisco. We, we briefly mentioned that NVIDIA released the chat with RTX local models that you can download and run your NVIDIA GPUs. It has rack built in. It has a chat with YouTube videos and super cool. We talked about Cohere release and AYA 101 multimodal.
[00:11:01] Alex Volkov: And our friend of the pod Far El was talking about how he wasn't finding like super impressive. Unfortunately, He dropped in the middle of this. Apologies for El, but Cohere released a big multi model, which is also pretty cool. We mentioned that NOMIC, our friends at NOMIC, which we mentioned last week, released open source embeddings.
[00:11:17] Alex Volkov: If you guys remember, they released an update to those embeddings, NOMIC Embed 1. 5 with Matryoshka embeddings. Matryoshka. is obviously the name for the Russian doll that like sits one inside each other. And we're going to actually talk with the authors of the Matryoshka paper in not the next Thursday, the next after that.
[00:11:34] Alex Volkov: So we're going to cover Matryoshka but it's what OpenAI apparently used, not apparently, confirmed they used to reduce dimensions in the API for embeddings. Super cool. We're going to dive deep into this. As we're going to learn, I'm going to learn, you're going to learn. It's going to be super cool.
[00:11:48] Alex Volkov: And as we're talking about OpenAI I got a ping on my phone because I'm subscribed to all updates from their main account and we had a collective holy s**t moment. Everybody's jaw was on the floor because OpenAI just released Sora, which is a foundational video model, text to video model, that just blew us the F away, pardon my French, because of the consistency.
[00:12:08] Alex Volkov: So if and if you've seen The how should I say the area of video generation has been has been evolving fairly quickly, but not as quick as what we just saw. We saw first we saw attempts at taking stable diffusion rendering frame by frame and the consistency wasn't there. It was moving from one to to another, like the face would change and everything.
[00:12:30] Alex Volkov: You guys saw this, right? So we moved from the hallucinatory kind of videos to Towards consistency videos where stable diffusion recently released and gave us SVD, which was like one to two to three seconds videos. Runway ML gives you the option to choose where the video is going to go. If it's going to be zoom in like brushes, all these things.
[00:12:49] Alex Volkov: And now all of them seem just so futile because open the eyes, Sora, can generate up to 60 seconds of a video. And honestly, we were sitting here just watching all of us just open the Sora website, and we were just mind blown away by the consistency and the complexity of the scenes that you can generate, the reflections.
[00:13:06] Alex Volkov: There was one scene where a woman was walking through the, a very busy street in Japan, and her coat stays the same, her face stays the same. There's another where a Dalmatian dog climbs out of one window and jumps into another. All the spots on the Dalmatian are perfect. perfectly in balance the legs are it's it's really unbelievable how high quality of a thing OpenAI released and what's unbelievable to me also is that The jump from what we saw in video to the open source stuff, or even the runway stuff and Pico stuff, the jump in fidelity, in quality, in consistency, is so much higher than the jump from like 200, 000 tokens to 1 million tokens that Google did.
[00:13:44] Alex Volkov: So it does feel like some folks in OpenAI sat there and said, Hey, Google just released something. It's super cool. It's picking up attention on Twitter. Let's release something else that we have behind the scenes. It looked super polished. So shout out to the folks who worked on Sora. It's really, if you haven't seen The videos, you'll see them in show notes and definitely you'll see them everywhere because Hollywood is about to get seriously, seriously disrupted with the, just the level of quality is amazing.
[00:14:08] Alex Volkov: Compare this with all the vision and, and, and sound stuff. I, moving back to the recap, I'm getting excited again. We also, then we talked about Reka and Reka Flash and Reka Edge from a company called Reka AI. And then, as I love bringing the people who actually built. the thing to talk about the thing.
[00:14:23] Alex Volkov: So we had Yitei and we had Max as well from Reka. Max made for Reka to talk to us about their multimodels. I was very, very impressed with Reka's multimodal understanding. And I think this model compared to Gemini Pro, which is probably huge and runs all the GPUs and TPUs. This model is 21 billion and Reka Edge is even smaller.
[00:14:41] Alex Volkov: And yet it was able to understand my videos to an extent that even surprised the guys who were the co founders of the company. It understood tonality, understood text. And audio in a very specific and interesting way. So we had a conversation with the RECA folks and continuing on this thread. We also had a new model from Stability called Stable Cascade that is significantly faster than SDXL and generates hands and text out of the blue.
[00:15:07] Alex Volkov: It's based on something called Worst Chen, which we learned is a hot dog today. And we had the folks that work behind this, Dom and I'm blanking on the name of the other author that joined. I apologize. It was a very exciting day. So we had a conversation with the guys behind Worshen and Stable Cascade as well.
[00:15:24] Alex Volkov: So definitely check this out. We mentioned that WhisperKid runs now on watchOS, which is quite incredible because Siri's voice to text is still not that great. And I think that's mostly of what we discussed. And then I flipped the mic on my, on my friend here that sits in front of me and I just had a deep dive interview with Swyx.
[00:15:41] Alex Volkov: In the latent space, he just posted a few images as well, and it was a great conversation as well, so definitely worth a follow and a listen if you haven't listened to this. With that, I think we recap ThursdAI on one of the more seminal days that I remember in the AI one after another, and we all hope that, Meta will just release Llama 3
[00:16:01] Investments updates from Swyx
[00:16:01] Alex Volkov: Unless I missed some stuff that's very important. I'll just double check. Nisten, out of the stuff that we've sent, did I miss anything else? Swyx, did I miss anything else?
[00:16:10] Swyx: Today there was also a LangChain Series A. True. With LangSmith.
[00:16:13] Swyx: Yes. There was Magic. dev, Series A with Nat Friedman.
[00:16:16] Alex Volkov: So I was thinking to cover this around the Google stuff because they also announced a longer context craziness.
[00:16:21] Alex Volkov: But definitely, definitely both of those.
[00:16:23] Swyx: Lambda Labs, Alonzo 300 million, Series C.
[00:16:26] Alex Volkov: Oh, wow, yeah, I even commented. I said, hey, Mitesh good. So we love Lambda, definitely. Most of the stuff that we play around with is happening in Lambda. And
[00:16:34] Swyx: Lindy also had their GA launch today.
[00:16:37] Alex Volkov: nice. Okay. Today
[00:16:38] Swyx: Today was a very bad day to launch [00:16:40] things, because everyone else launched
[00:16:41] Swyx: things.
[00:16:41] Swyx: Yes. If you're not Gemini, it's going to be a struggle
[00:16:44] Alex Volkov: I was just thinking, magic. dev, and I guess let's move to just discussing kind of the breaking news of the hour, as we already is. Let's talk about Google, and Gemina 1. 5.
[00:16:55] Google teases Gemini Pro 1.5 with 1M context windows
[00:16:55] Alex Volkov: Do we do a musical transition? Sure, let's do a musical News. This is not the Breaking News music. By not even a stretch, this is not a Breaking News music. But, imagine that we have Breaking News right now, because we do. Just an hour or so ago, we had an update from Jeff Dean and then Sundar Pichai and then a blog post and then a whole thread and a bunch of videos from Google.
[00:17:27] Alex Volkov: And if you guys remember some Google videos from before, these seem more authentic than the kind of the quote unquote fake video that we got previously with Gemini Ultra. So just a week after Google released Gemini Ultra, which is now available as aka Gemini Advance. And just a week after they killed Bard almost entirely as a concept they're now teasing.
[00:17:48] Alex Volkov: Teasing did not release, teasing. Gemini 1. 5, 1. 5, they're teasing it and they're coming out with a bang. Something that honestly, folks at least for me, that's how I expect Google to show up. Unlike before, where they're like lagging after GPT 4 by eight months or nine months, what they're doing now is that they're leading a category, or at least they're claiming they are.
[00:18:07] Alex Volkov: And so they released Gemini 1. 5, and they're teasing this with a whopping 1 million tokens. in context window on production and up to 10 million tokens in context window in research. And just to give a context, they put like this nice animated video where they put Gemini Pro, which they have currently, not 1.
[00:18:26] Alex Volkov: 5, the Pro version. is around 32, I think, and then they have GPT 4 with 128 and then they show Cloud 2 is at 200k and then Gemini 1. 5 is a whopping 1 million tokens, which is ridiculous. Not only that, they also came a little bit further and they released it with the Needle in Haystack analysis from our friend Greg Kambrad, which usually does this.
[00:18:50] Alex Volkov: We'll not be able to pronounce his name. I asked Greg to join us. Maybe he will. A needle in a haystack analysis that analyzes the ability of the model to recall whether or not it's able to actually process all these tokens and actually get them and understand what happens there. And quite surprisingly, they show like 99 percent recall, which is incredible.
[00:19:10] Alex Volkov: And we all know, previously in long context windows, we had this dip in the middle. We've talked about the The butter on toast analogy, where the context or attention is like the butter and context window is the toast and you spread and you don't have enough for the whole toast to spread evenly.
[00:19:27] Alex Volkov: We've talked about this. It doesn't seem, at least
[00:19:30] Alex Volkov: on
[00:19:30] Alex Volkov: the face of it, that they are suffering from this problem. And that's quite exciting. It is exciting because also this model is multi modal, which is very important to talk about. They definitely show audio and they are able to scrub through, I said, they said, I think they said 10 hours of audio or so.
[00:19:47] Alex Volkov: Which is quite incredible. Imagine this is going 10 hours of audio and say hey, when When did Alex talk about Gemini in ThursdAI? That would be super dope and Quite incredible. They also did video. They showed a hour of video of Buster Keaton's something and because the model is multi modal the cool thing they did is that they provided this model with a reference of with a sketch.
[00:20:11] Alex Volkov: So they drew a sketch of something that happened during this video, not even talking about this, just like a sketch. And they provided this multimodal with an image of this and said, when did this happen in the video? And it found the right timestamp. And so I'm very, very excited about this. If you can't hear from my voice, Swyx can probably tell you that it looks like I'm excited as well, because it's, it's quite.
[00:20:31] Alex Volkov: As far as I'm considering a breakthrough for multiple reasons. And now we're gonna have a short discussion.
[00:20:35] Enrico taking about open source alternatives to long context
[00:20:35] Alex Volkov: I want to say hi to Enrico here. Enrico welcome up on stage. Enrico Cipolli, one of the authors of the Yarn paper. And like we've had Enrico before talk to us about long context. Enrico, as we send this news in DMs, you replied that there have been some breakthroughs lately that kind of point to this.
[00:20:51] Alex Volkov: And you want to come up and say hi and introduce us briefly. And let's chat about the long context.
[00:20:57] Enrico Shipolle: Hi, Alex. Yeah, so there actually have been a lot of research improvements within the last couple months, even from before we submitted YARN. You could still scale even transformers to millions of essentially context. length back then. We previously in YARN worked on scaling the rotary embeddings, which was a traditional issue in long context.
[00:21:19] Enrico Shipolle: So I, if you don't mind, I'll probably go through some of the research really quickly because unfortunately,
[00:21:25] NA: so on January 2nd, there was one called, it's called LLM, maybe long LLM. That's a mouthful essentially, but they were showing that you can process these long input sequences during inference using something called self extend, which it allows you to basically manage the context window without even fine tuning these models.
[00:21:48] NA: And then on January 7th, 2024, there was another paper that released, it's called Soaring from 4k to 400k, which allows you to extend like the LLM's context with something called an activation beacon. With these activation beacons, they essentially condense raw activation functions in these models to a very like compact form, which essentially the large language model can perceive this longer context.
[00:22:14] NA: Even in a smaller context window, the great thing about these activation beacons or the LLM, maybe long LLM, is essentially they only take a few lines of code to modify the transformer architecture and get all these massive performance benefits for long context inference.
[00:22:33] Alex Volkov: Are
[00:22:33] Alex Volkov: you serious? Are we getting one of those breakthroughs that take two lines of code, kind
[00:22:37] NA: No so basically all of these require minimal code changes to even be able to scale to, to long, like token counts, whether it's audio, video, image, or text. Text is. Generally, like the shortest token count, if you look at something like RefinedWeb or SlimPajama the, the average token count of a piece of text in that is only anywhere from 300 to 500 tokens.
[00:23:02] NA: So this is actually generally a data centric issue too, when you're talking about long context with even training a standard natural language processing model. The thing about audio and video is, is these have a ton of tokens in them. And the one good thing, and then? the final note, I'm, I'm going to put in, unfortunately, before I have to head out, I know this was a lot of information.
[00:23:22] NA: I can link these
[00:23:24] Alex Volkov: Yeah, we're gonna add some, some of this, we're gonna add some, some links, the links that I'd be able to find, Enrique, if you can send
[00:23:29] NA: Yeah, I'll, I'll send you all the research papers.
[00:23:32] Alex Volkov: Yeah, you want to lend one last thing before we move on? Yeah, go ahead.
[00:23:36] NA: Yeah, So, just the last thing on January 13th is there was this paper called Extending LLM's Context Window with only a hundred samples and they were essentially able to show that even in a very limited amount of long context samples, you're able to massively improve the context lengths of these models. I should mention these are the papers that I found did pretty rigorous evaluation overall, because a lot of them, there's a huge problem in long context evaluation. But I feel these authors generally applied their knowledge pretty well, and these results are really impactful. so, even for the open source community, because you don't need a lot of computational power to be able to scale these context windows massively now.
[00:24:24] NA: And
[00:24:24] NA: that's basically everything I wanted to
[00:24:26] NA: say.
[00:24:27] Alex Volkov: Thank you, Enrico. Thank you, folks. Folks, definitely give Enrico a follow. And we have quite a few conversations with Enrico. If somebody in the open source community knows about Long Contacts, Enrico is that guy. And we're definitely going to follow up with the links in the show notes for a bunch of this research.
[00:24:41] Alex Volkov: And I think just to sum up, Enrico There have been breakthroughs, and it doesn't look like Google is the only folks who come up today. Nat Nat Friedman and Daniel Gross, the guys who have AI grant, they have the Vesuvius Challenge recently, and invest in everything AI possibly. They just announced an investment in magic, that they have a hundred million dollars investment, [00:25:00] quote unquote.
[00:25:00] Alex Volkov: We were so impressed with these guys when we decided to give them a hundred million dollars from Nat Friedman, and they also talk about the model that does. Something like 10 million context windows. Swyx, you wanna, you wanna talk about the magic thing?
[00:25:12] Swyx: They first talked about this last year, like six months ago, and then went completely silent. So we didn't really know what was going on with them. So it's good to see that this is at least real because six months ago they were talking about 5 million token context model.
[00:25:28] Swyx: But no, nothing was demoed. Not even like a little teaser graphic or anything like that. But for Nat to have invested in this amount, I think it's a huge vote of confidence. And it basically promises that you can do proper codebase embedding and reasoning over an entire codebase. Which, it's funny to have a code model that specially does this, because Gemini could also potentially do this.
[00:25:58] Alex Volkov: They showed in their examples 3JS. Did you see this?
[00:26:01] Swyx: No, I didn't see the 3JS, but okay, yeah. And we have a pretty consistent result from what we've seen so far that GPT 4 is simultaneously the best LLM, but also the best code model. There's a lot of open source code models, CodeLlama, DeepSeaCoder, all these things.
[00:26:18] Swyx: They're not as good as GPT So I think there's a general intelligence lesson to be learned here. That it remains to be seen because we, Magic did not release any other details today. Whether or not it can actually do better than just a general purpose Gemini.
[00:26:34] Alex Volkov: Yeah, and so the example that they showed is actually they took 3JS, if you folks know the 3JS library from Mr.
[00:26:40] Alex Volkov: Doob and they, embedded all of this in the context window and then asked questions and it was able to understand all of it Including, finding incredibly huge codebase. And I think I want to just move this conversation.
[00:26:52] Alex Volkov: Yeah, Nisten, go ahead. I see you, I see you unmuting. And folks on the stage, feel free to raise your hands if if you want to chime in. We'll hopefully get to some of you, but we have a bunch of stuff to chat about as well.
[00:27:01] Nisten Tahiraj: I'll just quickly say that there are still some drawbacks to these systems. And by systems the long context models where you dump in a whole code base or entire components in. And the drawbacks, even from the demos, still seem to be that. Yes, now they do look like they're much better at reading and intaking the information, but they're not yet much better at outputting similar length output, so they're still gonna only output, I think, up to 8, 000 tokens or so, and I don't know if that's that's a byproduct of of the training, or they could be trained to re output much longer, much longer context.
[00:27:43] Nisten Tahiraj: However, the benefit now is that unlike Retrieval augmentation system, unlike a RAG the, the drawback with a RAG was that yes, it could search over the document, but it would only find maybe two or three or a couple of points and bring them up. Whereas this one is more holistic understanding of the, of the entire input that you've dumped in.
[00:28:03] Nisten Tahiraj: But again, we're not quite there yet where they can just output a whole textbook. That's, that's what I mean. So that's the thing. That's the next challenge
[00:28:11] Far El: to solve.
[00:28:12] Alex Volkov: So I think, I think the, the immediate reaction that I had is very similar to what you had, Nisten. RAG is something everybody uses right now. And we've talked about long context versus, versus something like a RAG before, and the usual conversation we have is usually about cost. How much does it cost you pair these tokens, right?
[00:28:30] Alex Volkov: If you send 10 million tokens and each token is like a cent, you're basically paying 10 million cents for every back and forth. Also speed and, and user experience. If your users are sitting there and waiting for 45, 60 seconds because they sent a bunch of contacts, if you can solve this with RAG, then RAG is probably a better approach for you.
[00:28:48] Alex Volkov: However, however this specifically looks like. At least from the examples that the Google did, they showed the video transparently, they sped up the inference, but I saw something where with at least the video question, it took them around 40 seconds. to extract a frame of a video of an hour. They sent an hour worth of context of a video within this thing, and it took them 40 seconds for this inference.
[00:29:13] Alex Volkov: Folks, like I said before, and I'm going to say this again, regular ChatGPT, not even crazy context, queries took me sometimes 40 seconds. Now, you may say, okay, Alex they show the demo of their environment, and ChatGPT is in production environment. Yes, but the possibility is, if I can send I don't know, 500, 000 tokens in the context window, and then within 40 seconds get a response which is equivalent to what I get from GPT 4.
[00:29:38] Alex Volkov: Then I think that a bunch of the conversation about RAG being better just from a speed of inference perspective are slowing down. An additional thing I want to say before I get to you, Yam, just a second the immediate response in my head was, okay, RAG is done for, or at least not done for, but definitely the kind of the crown on RAG's head.
[00:29:56] Alex Volkov: Everybody's talking about RAG. There's vector databases everywhere. We just had folks talk about Colbert and different things. RAG is, okay, RAG is now shaky. But the other thing I started to think is, is fine tuning. also under risk. And Swyx, I think this goes back to what you just said about like the general models versus the maybe the Finetune or very specific models, because if a general model can take a whole book, and they had an example about this where there was a very low resource language, Kalamathi, Kalabathi, something like this, and there's only one book that's a dictionary for this language, they literally threw the book in the context window, and the model was able to, from context learning, to generalize and understand this and perform better than fine tuned models.
[00:30:37] Alex Volkov: And I'm thinking here okay, rag is the first thing to go. Is fine tuned second? Are we going to stop fine tuning and sending contexts? So Swyx, I want to hear your reaction about, about the language thing and then we're going to get to Yam and then we're going to ask some more folks.
[00:30:48] Discussion about effects of longer context windows
[00:30:48] Swyx: Yeah, I think there's generalizable insights about learning about language. And it's not surprising that throwing that into the context window works, especially if it's a cognate language of something that it already knows. So then you're just learning substitutions, and don't forget that transformers are initially trained to do language translation, like this is like bread and butter stuff for transformers.
[00:31:12] Swyx: The second thing I would respond to is, I have to keep saying and banging this drum, long context does not kill RAG because of cost. Imagine if every time you throw 10 million tokens of context in there, you have to pay like a thousand dollars. Because unless something fundamentally is very, very different about this paradigm, you still pay to ingest those tokens of cost.
[00:31:39] Swyx: So ultimately, people want to still reg for cost and then for attribution reasons, like debuggability attribution, which is something that's still valuable. So I think long context is something that I have historically quite underweighted for this reasons. I'm looking to change those assumptions, of course, because obviously this is magical capabilities if you can use
[00:32:03] Alex Volkov: this is magical capabilities if you can use
[00:32:10] Far El: Yeah, I just want to say on the topic of of latency and ingesting a lot of context. I think that there is a solution that we didn't talk about it here and will be something that is going to be incorporated in all the flagship models, which is embedding embedding knowledge into the KB cache, which is something that many of the inference engines today can do.
[00:32:34] Far El: And you simply just prefix the context beforehand, and then you don't need to process it through your model. So you're not sending the whole database each time you are calling your model. It's just saved. Imagine that OpenAI have some sort of API that you embed. The KD cache beforehand, and it's reduced price, of course, and then it uses that as, as your context.
[00:32:59] Far El: Basically, somewhere in the middle between the two. And the reason that it's not supported now in flagship models, because the first flagship model that supports a million tokens came out today. But I think that if we see this this, if we go there, this is something that we're going to see in all of the APIs.
[00:33:18] Far El: Moreover, I also don't [00:33:20] think that RUG is done for it because RUG is explaining to you very, very clearly and very simply. Where the information is coming from, what the model is basing itself on. You can claim that the model with the attention you can do it as well, but it's not like RUG. RUG, you're just showing the clients, the people, exactly where it comes from.
[00:33:40] Far El: And there are use cases where this is absolutely a must. So I think that there will always be room for RUG for these specific use
[00:33:49] NA: cases and long
[00:33:50] Far El: context. With KVCaching is going to be, I think, I think the methods for embedding, for example, a full database, or a book, or something big, and using it multiple times, with many different
[00:34:05] Far El: prompts.
[00:34:06] Alex Volkov: Or also multimodality, right? So thank you for this. Definitely, definitely makes sense. And I think somebody in the comment also left a similar comment as well. So we want to dive into the KVCache stuff maybe in the next one. But I want to talk about the multimodality part of this because, um We've, we've multiple times mentioned.
[00:34:25] Alex Volkov: I think we did this every Thursday. I sense GPT 4 launched because we were waiting for the vision part of GPT 4. And we've talked about 2024 being the year of multimodal. And we're going to have to talk about a bunch of multimodal stuff today, specifically with the RECA folks and the RECA flash, which understands videos.
[00:34:40] Alex Volkov: They basically, so I'm going to have to see whether RECA understands videos better than Gemini, but the Gemini folks talked about there's a specifically. A bunch of multi model effect on the context window where if you send videos, you, at least the way they did this was just frames. They broke down this movie to a bunch of 500, 000 frames or so and just sent it in context window.
[00:35:04] Alex Volkov: And they basically said we have all this video in the context window and then we have a little bit of text. And I think context window expansions like this will just allow for incredibly multi modal use cases, not only video, audio, they talked about, we've talked about previously with the folks from
[00:35:20] Alex Volkov: Prophetic about different fMRI and EEG signals that they're getting like multi modal like applications as well and Context window enlargement for these things, Google specifically highlighted.
[00:35:32] Alex Volkov: And I want to highlight this as well because it's definitely coming. I'm waiting for being able to live stream video, for example. And I know some folks from like 12 Labs are talking about almost live live stream embedding. So definitely multimodal from Google. I think, folks, we've been at this for 30 minutes.
[00:35:48] Andrej Karpathy leaves OpenAI
[00:35:48] Alex Volkov: Alright, so folks, I think we're going to move on and talk about the next kind of a couple of stuff that we've already covered to an extent, but there's some news from OpenAI, specifically around Andrej Karpathy leaving, and this was announced, I think broke in the information, and Karpathy, some folks here call them senpai, Karpathy is a very Very legit, I don't know, top 10, top 5, whatever, researchers, and could potentially have been listening to the space that we had with LDJ after he left, or, yeah, I think it says, it was clear that he left it was the information kind of announcement didn't have a bunch of stuff, but then Andrei just As, as a transparent dude himself, he came and said, hey, this wasn't the reaction to anything specific that happened because speculations were flying.
[00:36:33] Alex Volkov: And I think at least, at least to some extent, we were in charge of some of these speculations because we did a whole space about this that he could have just listened to. But as speculation was flying, maybe this was ILLIA related, maybe this was open source related, like all of these things.
[00:36:46] Alex Volkov: Andre basically Helped start OpenAI, then left and helped kickstart the Tesla Autopilot program, scaled that to 1500, then left. On the chat with Lex Friedman, Andrei said that Basically, he wanted to go back to hands on coding, and in OpenAI, his bio at least said that he's working on a kind of Jarvis within OpenAI, and definitely Andrei has been also talking about the AI as an OS, Swyx, you wanna, you wanna cover like his OS approach?
[00:37:14] Alex Volkov: I think you talked about this. He had a whole outline, I think you
[00:37:17] Swyx: also
[00:37:17] Swyx: talked about this. LLM OS.
[00:37:18] Swyx: Yeah. He wasn't working on it so much as thinking about it.
[00:37:21] Swyx: Thinking about it,
[00:37:21] Swyx: yeah. And maybe now that he's independent, he might think about it. The main thing I will offer as actual alpha rather than speculation is I did speak to friends at OpenAI who reassured us that it really was nothing negative at OpenAI when he left.
[00:37:40] Swyx: Apparently because they spoke to him before he left.
[00:37:43] Swyx: So yeah, he's for the way I described it is he's following his own internal North Star and every time he does that the rest of us
[00:37:51] Alex Volkov: And definitely the rest of us win.
[00:37:53] Alex Volkov: the open source community is hoping, or I've seen many, many multiple things that say, hey, Andre will unite like the, the, the bands of open source, the different bands of open source.
[00:38:02] Alex Volkov: Andre posted this thing. on his ex, where like his calendar was just free, which shows maybe part of the rationale why he left, because meetings and meetings and meetings and everything and now he can actually work. So shout out to Andrej Karpathy for all he did in OpenAI and for all he's going to continue to do.
[00:38:16] Alex Volkov: We're going to definitely keep up to date with the stuff that he releases. Andrej, if you're listening to this, you're more than welcome to join. We're here on every Thursday. You don't have to have a calendar meeting for this. You can hop on the space and just join. Also on the topic of OpenAI, they've added memory to ChatGPT, which is super cool.
[00:38:31] Alex Volkov: They released a teaser, this, I didn't get into the beta, so they released it to a limited amount of people. They added memory to ChatGPT, and memory is very, very cool, the way they added this as well. So I've said for a long time that 2024 is not only about multimodality, that's obviously going to come, but also it's about time we have personalization.
[00:38:51] Alex Volkov: I'm getting tired of opening a ChatGPT. Chat, and have to remember to say the same things on, it doesn't remember the stuff that previously said. The folks in OpenAI are working on the differentiator, the moat, and different other things, especially now where Google is coming after them with the 10 million context window tokens.
[00:39:08] Alex Volkov: And, they're now adding memory, where ChatGPT itself, like the model, will manage memory for you, and will try to figure out, oh, OpenAI, oh my god, breaking news. OpenAI just shared something. As I'm talking about them, you guys want to see this? Literally, I got a
[00:39:28] Alex Volkov: notification from OpenAI as I'm talking about this.
[00:39:30] Swyx: What?
[00:39:32] Alex Volkov: Let's look at this. I, dude, I needed my, my breaking news button today. Opening, I said, introducing Sora, our text to video model. Sora can create videos for up to 60 seconds.
[00:39:44] Alex Volkov: Holy s**t, this looks incredible. Oh my god, somebody please pin this to the, to the, Nisten, you have to see, there's a video, 60 second video, folks.
[00:39:54] Alex Volkov: Like, all of the, oh my god, breaking, I have to put the breaking news button here, holy s**t. So folks, just to describe what I'm seeing, cause somebody please pin this to the top of the space every video model we had so far, every video model that we had so far does 3 to 4 seconds, Pica the other labs, I forgot their name now, Runway, all of these models,
[00:40:16] Swyx: they
[00:40:16] Swyx: do
[00:40:16] Swyx: Oh my god, Runway.
[00:40:18] Alex Volkov: They
[00:40:18] Alex Volkov: do three to five seconds and it looks like wonky, this thing just that they show generates a 60 second featuring highly detailed scenes and the video that they've shared, I'm going to repost and somebody already put it up on space has folks walking hand in hand throughout a There's a zoomed in, like behind the scenes camera zooming in.
[00:40:39] Alex Volkov: There's a couple Consistent I cannot believe this is January. Holy s**t The consistency is crazy. Nothing changes. You know how like previously video would jump frames and faces and things would shift
[00:40:52] Alex Volkov: Wow, okay, so I guess we should probably talk about this. Reactions from folks. I saw LDJ wanted to come up to see the reaction I'm
[00:41:00] Far El: just wild. Honestly, it looks crazy. It looks really good quality. Better than most text to video models that I've seen.
[00:41:08] Alex Volkov: Holy s**t okay, so I'm scrolling through the page, folks,
[00:41:13] Alex Volkov: those who are listening, openai. com slash Sora, Sora is their like text to video I'm seeing a video of a model walking through like a Japan street, whatever, the prompt is, a stylish woman walks down a Tokyo street filled with warm glowing neon animated city signage, she wears a black leather jacket, long red dress, and black boots, and the consistency here is insane.
[00:41:35] Alex Volkov: I do
[00:41:35] Far El: out the mammoths. Or actually go on their websites. On the Sora, [00:41:40] on OpenAI's website. They've got a
[00:41:42] Far El: few examples. It's crazy. It's crazy. I've
[00:41:45] Far El: never seen a
[00:41:48] Alex Volkov: the if you showed me this yesterday, Far El, if you showed me this yesterday and said this is generated, I would not believe you. So what happens is, now the same video of this woman walking, they have a video camera zooming in, into her eyeglasses, her face stays the same, the same consistency, you can see reflection in the, in the sunglasses.
[00:42:08] Far El: Alex, you have to go on the website. There's like this video of, oh like literally the prop is reflections in the window of a train traveling through the Tokyo suburbs. And
[00:42:19] Far El: honestly, it looks, it looks like someone captured this no way this is AI
[00:42:23] Far El: generated. It's, it's crazy
[00:42:27] Alex Volkov: Wow,
[00:42:27] Alex Volkov: folks. What's the availability of this? Let's, let's see, what do we know? So we know safety. We'll be taking several important safety steps ahead of making SORA available on OpenAI's products, so it's not available yet. Working with Red Teamers, they don't want this to be used in deepfakes for porn, obviously.
[00:42:43] Alex Volkov: That's like the first thing that the waifus are going to use it for. The C2PA metadata that, if you guys remember, we've talked about that they started including in DALI, they're going to probably include this as well. And new techniques prepared for deployment, leveraging the existing safety methods.
[00:42:56] Alex Volkov: Okay research techniques.
[00:42:58] Far El: Crazy.
[00:43:00] Alex Volkov: Consistency is crazy, right folks?
[00:43:02] Swyx: Yeah, it's not available it looks like.
[00:43:03] Swyx: Not available
[00:43:04] Swyx: yet.
[00:43:04] Swyx: To answer your question. They released some details about it being a diffusion model. They also talked about it having links to DALI 3 in the sense that Honestly, I don't know if people know that there was a DALI 3 paper, which is very, very rare in this age of Not close.
[00:43:22] Swyx: Not open ai.
[00:43:23] Alex Volkov: Yeah, not
[00:43:24] Swyx: open AI.
[00:43:24] Swyx: And so they doing this like synthetic data captioning thing for the DO three model and they're referencing the same method for soa. I would just go read the Dolly three paper
[00:43:37] Alex Volkov: Wow. I, I, the consistency has been the biggest kind of problem with these LDJ.
[00:43:41] Alex Volkov: Go ahead, please. As I'm reading this and reacting and, and my mind is literally blown the demo of the doggy. Hold on nj one second. There's a demo. There's a video of the dog, like walking from one window and jumping to another window and the pause, they look like it's a video, like folks like literally does not look like generated, like anything we've seen before.
[00:44:02] Far El: This, is going to disrupt Hollywood immediately we're talking about, text to video disrupting media content creation and so on this is it, this is like the mid journey moment of, of text to video that same feeling that we had when we were able to crop mid journey and get some really high quality images this is the same but for video, essentially.
[00:44:23] Alex Volkov: This, this breaks reality for me right now. Literally I'm watching this video multiple times. I cannot believe that the dog's paws are not shaping in different shapes. The spots on this Dalmatian dog stay in the same place throughout the video. It, it don't make sense. Alright, LDJ, go. I think, I think,
[00:44:37] Far El: Yeah so
[00:44:38] Far El: Sam here, I'll post it on the, on the ding board. Sam said that that certain select creators have access now. And, oh, I just lost the tweet. I'll, I'll get it. But yeah, he says that some creators already have access and I guess they're going to slowly expand it out to like beta users or whatever.
[00:44:59] Alex Volkov: Wow, so Sam asked for some we can show you what Sora can do. Please reply with captions for videos you'd like to see and we'll start making some.
[00:45:06] Alex Volkov: So
[00:45:06] Swyx: Oh yeah, basically give him some really complicated prompt, and let's, let's go, let's go.
[00:45:12] Alex Volkov: A bunch of podcasters sitting, watching Sora and reacting in real time and their heads are blown.
[00:45:17] Alex Volkov: Not literally, because this is insane. How's that for a prompt? I'm gonna post it. Hopefully some will get it.
[00:45:25] NA: Just opening a portal through Twitter, through OpenAI to the Munich and then string
[00:45:31] Alex Volkov: Oh, there's, there's also, I don't wanna spend the rest of Thursday. 'cause we still have a bunch of talk about folks.
[00:45:38] Alex Volkov: Is anybody not scrolling through examples right now? And you definitely should. There's an example of a
[00:45:43] Swyx: there's only nine examples.
[00:45:45] Alex Volkov: What, what
[00:45:45] Far El: This is insane.
[00:45:46] Alex Volkov: The whole, no website has a bunch of, scroll down.
[00:45:48] Alex Volkov: There's like every, every kind of example has
[00:45:51] Alex Volkov: more scrollies. So I'm looking at an example of a chameleon, which, has a bunch of spots and has guys, the spots are in the same place. What the f**k? It doesn't move. it does not look like honestly, let's do this. Everybody send this to your mom and say, Hey mom, is this AI generator?
[00:46:07] Alex Volkov: Or not? Like older folks will not believe this s**t, like
[00:46:10] Swyx: I, I will
[00:46:13] Far El: What's the most impressive
[00:46:14] Swyx: compare this to Google
[00:46:15] Far El: right? Like humans,
[00:46:17] Swyx: don't know, I think you guys
[00:46:18] Alex Volkov: hold on. Pharrell, I think, I think we're talking over each other. Give us a one sec. Swix and then Farrell.
[00:46:22] Swyx: Oh, sorry, yeah, there's a bit of a lag. Oh, no, nothing. Just compare this to Google Lumiere where they release a bunch of sample videos as well.
[00:46:29] Swyx: But you could, the, the, I was impressed by the consistency of the Lumiere demo videos. They would, they demoed sort of pouring syrup onto a pancake and then infilling the syrup and showing that, it would be pretty realistic in pouring all that syrup stuff. Didn't really see that kind of very technical test here.
[00:46:49] Swyx: But the resolution of these videos and the consistency of some of these movements between frames, and the ability to cut from scene to scene is way better. Instantly way better. I was thinking that Lumiere was, like, state of the art a few weeks ago, and now it is completely replaced by Sora.
[00:47:08] Swyx: This is a way better demo. I think OpenAI is showing Google how to ship.
[00:47:12] Alex Volkov: eye. Decided to say, you know what, Google, you think you can one up us with the context window?
[00:47:18] Alex Volkov: We got another thing coming, because I've
[00:47:20] Swyx: just pull up the Lumiere page, and then pull up the Sora page, and just look at them side by side, and you can see how much better they
[00:47:26] Alex Volkov: Lumiere
[00:47:26] Alex Volkov: was mind blowing as well. Go ahead, Far El. Go ahead, because we're still reacting in real time to this whole ridiculously impressive.
[00:47:32] Far El: Yeah, I was just saying that the the most impressive thing are, is like how alive these video shots feel, right? Humans talking action scenes like, all the text to video models that I've seen so far and I've used were very very simplistic, right? It felt like more like you're animating an image to do very minor movements.
[00:47:55] Far El: It wasn't actually alive in any way, but Sora's text to videos is, is nuts, the quality, the consistency, the action, like the actual action of the characters. I wonder how much like granular control do you have on a scene to scene basis. I know that Google released like a paper I think a few months back where they had a basically like a script that allowed the, like for much more long form.
[00:48:27] Far El: video content, but I'm not sure if that's the case here. It's just, it's just really impressive. It's, it's really impressive.
[00:48:35] Alex Volkov: I want to say one of our friends, LaChanze, just sent, at the bottom of the page, it says, Sora serves as a foundation model that can understand and simulate the real world. I can it's really hard for me to even internalize what I'm reading right now, because the simulation of the real world, it triggers something in me, tingles the simulation hypothesis type of thing, and this can regenerate the map of the world and then zoom in and then generate all the videos.
[00:48:58] Alex Volkov: And I'm wearing this Mixed, slash, augmented, slash, spatial reality headset that just generates and this happens on the fly, and what am I actually watching here? So this says Sura serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI.
[00:49:15] Alex Volkov: Yeah. Alright, folks. I will say, let's do two more minutes, cause this is I can't believe we got both of them the same day today, holy s**t, we got 10 million contacts window from Google announcement, which is incredible, multi modal as well, I like, my whole thing itches right now to take the videos that OpenAI generated and shove them into, into a Gemini to understand what it sees and see if if it understands, it probably will.
[00:49:40] Alex Volkov: Wow.
[00:49:40] Far El: Thing that would make this Thursday a tiny bit even more awesome is if Meta comes out with telemetry. Too much, too much, too much.
[00:49:51] Alex Volkov: It's
[00:49:51] Alex Volkov: gonna be too much. We need, we need a second to like breathe. Yeah, definitely folks. This is a Literally like singular day. Again, we've [00:50:00] had a few of those. We had one on March 14th when ThursdAI started, OpenAI released GPT 4, Entropic released Cloud, I think on the same day. We had another one when OpenAI Dev Day came about, and I think there's a bunch of other stuff.
[00:50:12] Alex Volkov: I consider this to be another monumental day. We got Gemini 1. 5 with a potential 10 million context window, including incredible results in understanding multimodality in video, up to an hour of video. And then we also have some folks from RECA that's gonna come up soon and talk about their stuff, which is, they just with all due respect with RECA folks this news seems bigger, but they still launched something super, super cool we're gonna chat about, and now we're getting, it's just, the distance, we're used to jumps, we're used to state of the art every week, we're used to this, we're used to this model beats this model by Finetune, whatever, we're used to the OpenAI leaderboard, this is
[00:50:53] Alex Volkov: such a
[00:50:53] Alex Volkov: big jump on top of everything we saw.
[00:50:55] Alex Volkov: From Stable Visual Diffusion. From what are they called again? I just said their name, Runway. I forgot their always forget their name.
[00:51:02] Swyx: Poor guys.
[00:51:04] Alex Volkov: Poor Runway. From Pica Labs. From folks who are generating videos. This is just such a huge jump in capability. They're talking about 60 seconds.
[00:51:14] Alex Volkov: Oh, Meta just announced JEPA. Yeah, I don't know if JEPA is enough. People are commenting about JEPA, and I'm like, okay wait, hold
[00:51:21] Swyx: You, you spiked my heart rate when you said Meta just announced. I was like, what the f**k?
[00:51:25] Alex Volkov: the f**k? Meta literally just came out with an announcement, VJEPA, supervised learning for videos.
[00:51:29] Alex Volkov: But, folks unless they come out with Lama 3 and it's multimodal and it's available right now, not Meta is not participating in the
[00:51:35] Swyx: thing
[00:51:36] Alex Volkov: day
[00:51:36] Far El: Oh wait, this is actually cool. So this is this is something,
[00:51:39] Far El: actually a paper they came out with like about a month ago, but this is for video understanding. So this is pretty much like for input of video, while OpenAI's model is for output of video.
[00:51:51] Alex Volkov: It just, I will say it's a research thing, right? So they're not showing anything there, unless I'm mistaken. Um So, I kinda, so I still have a bunch of stuff to give you updates for, and I still have a bunch of interviews as well, there's a new stability model, but I'm still like, blown away, and I just wanna sit here and watch the videos,
[00:52:07] Alex Volkov: Is this what Ilya saw? Yeah, somebody reacted like, what did Ilya see? Did Ilya see a generated video and the model understanding this and that's why, that's why?
[00:52:16] Far El: No, I think, I think, I think AGI has been achieved internally at
[00:52:21] Far El: this rate.
[00:52:22] Alex Volkov: Wow. I, I'm, I'm still blown away. Like I, if a model can generate this level of detail in very soon, I just wanna play with this. I wish, I wish we had some time to, to, to, I, I was one of the artists and I hope that somebody in the audience here is, and that they will come to talk about this on Thursday.
[00:52:43] Alex Volkov: I and because I'm, yeah. I'm still mind blown. So I see. Quite a few folks that I invited that I wanna, I wanna welcome to the stage. VJEP understands the world while Sora generates one. That's the comment that some folks led. And okay, okay. VJEP is going to be something we definitely cover because Meta released this and Meta are the GOATs, even though yeah, no, Meta's definitely GOATs. I'm just a little bit lost for words right now.
[00:53:06] Nisten Tahiraj: Yeah, so if people have watched a lot of speeches from Yann LeCun is the, the main idea is that these AI models are not very good at understanding the world around them or thinking in 3D. So in some ways, you could reason out that A cat is a lot more intelligent even if it was blind and it couldn't smell, it could still figure out where to go and find its letterbox stuff like that.
[00:53:30] Nisten Tahiraj: This is one part that's missing from the world model that they get purely just from word relationships or word vectors. And so this is a step in that direction, it seems. Again, I haven't read the paper, so I'm Half making stuff up here but it feels like this is a step in, in that direction towards AI models that understand what's going on like us and animals do.
[00:53:56] Nisten Tahiraj: So that, that's the main, the gist of it for, the audience.
[00:54:04] Alex Volkov: Oh, what, what a what A Thursday. What A Thursday. I gotta wonder how am I'm gonna summarize this, all of this. And I just wanna invite, we have here in the audience and I sent you a request to join. If you didn't get it. Make sure that you're looking at requests and then accept. And then we should have, we should have Max as well at some point.
[00:54:20] Alex Volkov: Lemme text Max. 'cause we have guest speakers here from, from Breca that we wanna chat with. Meanwhile I'm gonna continue and, and move forward in some of the conversations. Let's roll back. Okay, while we're still super excited and I can't wait for this to come out, this is an announcement that they did.
[00:54:35] Alex Volkov: It's very polished. We haven't seen we didn't see any access or anything about when it's going to come out. I do feel that this is a breakthrough moment. from Google and from OpenAI. And it does look like it's reactionary to an extent. The folks in OpenAI were sitting on this and saying, Hey, what's a good time to release this?
[00:54:52] Alex Volkov: And, actually now, to let's steal some thunder from Google and they're like 10 million thing that also not many people can use. And let's show whatever we have that not many people can use which, which is an interesting. Think, to think about, because, again, the pressure is on a bunch of other labs, on Meta, to release something, we know Lama3 is coming at some point, will it be multi modal, will it be able to generate some stuff every
[00:55:16] NA: Really, really quick, sorry to interrupt
[00:55:18] Alex Volkov: Go
[00:55:19] NA: the thing about VJEBA seems to be good at is understanding video instructions I guess you could point the camera to something you're doing with your hands and arts and crafts things, or repairing something, and it understands what you're doing, so that, that's actually very easy.
[00:55:36] NA: Powerful for what data sets data sets of skills that will come, because then you can generate actions. I, I think that, that will apply a lot to robotics, what they're doing.
[00:55:48] Alex Volkov: Oh, alright, yeah. And they also have the Ego4D datasets of robotics as well, and they've talked about this.
[00:55:55] Nvidia relases chat with RTX
[00:55:55] Alex Volkov: so let's go to open source like super quick. NVIDIA released a chat with RTX for local models. And it's actually like very, very cool. So a few things about the chat with RTX. First of all, NVIDIA packed a few, a few models for you. It's 38 gigabytes or something download. And they, they have they have quite a few I think they have two models packed in there.
[00:56:16] Alex Volkov: I wasn't sure which ones. And this, this is basically a, a package you download. I don't know if a doc or not. That runs on any desktop PC with RTX 30 or 40 series with at least 8 gigabytes of RAM. And it gives you a chatbot that's fully local. And we love talking about open source and local stuff as well.
[00:56:33] Alex Volkov: And it Not only that, they give you a rag built in. So you can actually run this on some of the documents that you have. They also have something that runs through a YouTube. You can give it like a YouTube playlist or a video link, and it will it will have you talk to YouTube video. So it has built in rag, built in Tensor rt, LLM, which runs on their, on their stuff RTX acceleration and.
[00:56:56] Alex Volkov: I think it's pretty cool, like it works only on the very specific types of devices, only for like gamers or folks who run these things but I think it's pretty cool that that folks are, that NVIDIA is releasing this. They also have something for developers as well to be able to build on top of this.
[00:57:11] Alex Volkov: And I think the last thing I'll say about this is that it's a Gradio interface, which is really funny to me that people are shipping Gradio interfaces on production. It's super cool.
[00:57:18] Cohere releases Aya 101 12.8B LLM with 101 language understanding
[00:57:18] Alex Volkov: Cohere releases an open source called AYA 101, a model that's like 12. 8 billion parameters model with understanding of multilingual 101 languages from Cohere. It's, it's honestly pretty cool because Cohere has been done doing a bunch of stuff. AYA outperforms the Bloom's model and MT0 on wide, a variety of automatic evaluations despite covering double the number of languages.
[00:57:41] Alex Volkov: And what's interesting as well, they released a dataset together with AYA and then what is interesting here? Yeah, just, oh, Apache 2 license, which is super cool as well. Apache 2 license for, for this model. Let me invite Yi as a co host, maybe this can, join. Far El, go ahead.
[00:57:58] Alex Volkov: Did you see, do you want to talk about Yi Aya?
[00:58:00] Far El: Yeah first off, I I appreciate and commend Cohere to building a multilingual open source data set and so on. That's awesome. We need more of that. But unfortunately, With the first few questions that I asked in Arabic specifically most of the answers were complete. [00:58:20] nonsense on their train model.
[00:58:23] Far El: Yeah. And to, to the point that it's it's laughable, right? For instance in Arabic, I asked who was the who was the first nation that
[00:58:32] NA: had astronauts on the moon. I
[00:58:38] Alex Volkov: Yes.
[00:58:39] NA: think, I think you cut out for a sec.
[00:58:43] Alex Volkov: I think he dropped. I don't see him anymore.
[00:58:45] NA: He might have
[00:58:46] NA: His phone might have
[00:58:47] Alex Volkov: yeah, we're gonna have to
[00:58:48] NA: I can briefly
[00:58:50] NA: comment on it. Yeah, we're pretty happy now that also Kahira has started contributing,
[00:58:56] NA: To open source because datasets are very important. And yeah, I think the reason it wasn't performing so well In other languages, it's just because some languages do not have there wasn't enough data in that for it to be, to be trained.
[00:59:12] NA: But the beautiful thing is that it is Apache 2. 0. You can just add your own languages data set and it will. Literally, make the whole thing better. And yeah, that's, those are my comments on it.
[00:59:22] Interview with Yi Tay and Max Baine from Reka AI
[00:59:22] Alex Volkov: Awesome. All right, folks. So now we're moving into the interview stage, and we have quite a few folks. As one of the most favorite things that I want to do in ThursdAI, and it's been an hour since we've been here, is to actually talk with the folks who released the stuff that we're talking about.
[00:59:35] Alex Volkov: So the next thing I'm going to announce, and then we're going to talk with Yitei and Max, and then after that, we're going to talk with Dom as well. Earlier this week, a company named Reka AI released two models, or at least released a demo of two models, right? I don't think API is still available.
[00:59:51] Alex Volkov: We're going to talk about this as well. Called Reka Flash and Reka Edge. And Reka Flash and Reka Edge are both multimodal models that understand text, understand video, understand audio as well, which is like very surprising to me as well. And I had a thread where I just geeked out and my head was blown to the level of understanding of multimodality.
[01:00:09] Alex Volkov: And I think some of the folks here had, had had talked about Sorry, let me reset. Some of the folks here on stage have worked on these multi models models. And so with this I want to introduce Yi Tei and Max Bain. Please feel free to unmute and introduce yourself briefly and then we're going to talk about some record stuff.
[01:00:25] Alex Volkov: Yi first maybe and then Max.
[01:00:27] Yi Tay: Yeah, thanks thanks Alex for inviting me here. Can people hear me actually?
[01:00:31] Alex Volkov: Yeah, we can hear you
[01:00:32] Yi Tay: okay, great, great. Because this is the first, hey this is the first time using space, so yeah, try to figure out how to use it. But thanks for the invite, alex, and so I'll just introduce myself. I'm Yi Teh, and I'm one of the co founders of RectorAI.
[01:00:45] Yi Tay: We're like a new startup in the LMS space. We train multi modal models. Previously I worked at Google Brain working on Flan stuff like that. So yeah, that's just a short introduction about myself. And maybe Max, do you want to introduce yourself? Yeah,
[01:00:59] Alex Volkov: Yeah, Max, go ahead, please.
[01:01:00] Max Bain: thanks Ian. Yeah.
[01:01:01] Max Bain: Thanks Alex for having me. So yeah, as you said yeah, I'm part of Wrecker. So I joined more recently, like six months ago. I just finished my PhD and that was all my video, audio, speech understanding. I've done a bit of work in open source. So if you use WhisperX that was like something I worked on and yeah, now working more on part of Wrecker and really enjoying it.
[01:01:22] Max Bain: yeah, that's pretty much
[01:01:23] Alex Volkov: First of all, let me just say, thank you for WhisperX, I did use this, and it was awesome, and I think this is how we connected before or at least, to some extent, I think this is the reason maybe I follow you, I was really surprised that you were Reka. Let's talk about the models that you guys just released, and because Very impressive on the multimodality part, but also very impressive on just the regular comparative benchmark, and I think you guys released the comparisons to just regular MMLU scores, so Wreck A Flash gets 73.
[01:01:52] Alex Volkov: 5 on MMLU and 65 on Human EVAL, and GPT 4 is at 67, at least, and Gemini Ultra, they claim is 74, but your guy's model is like significantly smaller. What can you tell us about, and I know you said before there's like a bunch of stuff that you won't be able to talk about what can you tell us about the performance just on the textual kind of comparison, even though this is a multimodal model and there's a bunch more that we will talk about?
[01:02:17] Yi Tay: Yeah, thanks so I'll just I can't really say that much, but I can say that there's quite a lot of headroom in pre training just for language alone, and I think that we're still not near the headroom yet for pre training, and I think even for us, actually, we have a better version of RecoFlash internally right now, but we've not even published metrics for that because while we were preparing for the launch we actually have even a better model now.
[01:02:39] Yi Tay: So I think actually there's still quite a lot of headroom for pushing that and there's quite a lot of things to do in pre training but I can't really wouldn't be able to say much about? About like more details, yeah.
[01:02:48] Alex Volkov: About specifics. I did see the comments that you left in your thread, that you talked about the folks who do foundational models from scratch, they, there's a lot of banging a lot of creation they have to do in the process as well, and it looks like at least some of this amount, some of this amount of hard work you guys had to go through in order to train these foundational models.
[01:03:09] Alex Volkov: So let's talk about the multimodality, what what can this model do? And I think I have a
[01:03:15] Alex Volkov: good idea, but can you talk to us on the multimodal part? What can those models do in terms of multimodality?
[01:03:23] Max Bain: Yeah, so in terms of multimodal yeah, if you just, you can use it actually on chat. reco. ai, and I would say the image understanding's pretty good, so people have noticed, you can recognize text pretty well. Yeah, more nuanced details, which tended to be a big issue with VLMs, like they used to be quite biased or it'd hallucinate a lot.
[01:03:41] Max Bain: I think in Rekka Fafri noticed that dropped a lot. So I think kind of image understanding is, I'd say, yeah, pretty on par with Gemini Pro or a bit better. But yeah, that's up to the jury. The video understands also pretty good. We limit it to a one minute input. We do have internally like better things and like bounded by how much we can run like for free. So, yeah, I'd say yeah, overall pretty good video understanding and image. We haven't focused too much on audio right now, but that's like definitely on the, on the roadmap.
[01:04:14] Alex Volkov: I did run into the audio stuff, and I ran a few videos through the demo, and folks definitely should check out the demo. I'll add this in the show notes, and hopefully some folks will add this to the space as well. I just started uploading like short clips, and it's great to hear that you're saying, you guys are limited, you're limiting on the demo, but you can, if I'm hearing correctly, you can The model can understand longer videos as well.
[01:04:39] Alex Volkov: So I uploaded a video of a trip that I took to Hawaii and there's a submarine there and somebody was narrating in the submarine and he yelled something like, there, there, there's the submarine goes, dive, dive, dive, something like this. Very excitedly. And the model really understood this, and actually it said, the commenter said, Dive, dive, dive, like this, with a bunch of I's in it.
[01:05:00] Alex Volkov: And to me, this was like the, the holy s**t moment. I uploaded this video. The narrator for this video was very excited. I did not expect the model to actually pick up on the excitement. And, It was very surprising to me because if you use something like Whisper and you just extract the audio from the, from the video, you would not get this result.
[01:05:20] Alex Volkov: You would not get like the, the excitement in this person's voice. And while we try to get max back in, could you, so could you mention stuff about audio? Do you train this specifically for audio as much as you can share, obviously. Or is it like a, a, a byproduct of, of just this model being multimodal and understanding and can listen as well?
[01:05:39] Yi Tay: Wait, so let me take a step back. Actually, thanks for sharing that example because I
[01:05:43] Yi Tay: actually had to watch your example to find that, that dive, dive, dive. I actually watched the entire video to find that, that clip. So I think it was a pretty Good clip. To be honest, it also surprised me that you found this example.
[01:05:56] Yi Tay: I, I think I was not also expecting this but I, we, we, we co trained this with many modalities. We are not sure, like, why this this specific case is like this. I think that's all I can say, but probably
[01:06:09] Yi Tay: yeah, next one
[01:06:09] Alex Volkov: I can definitely, definitely add one thing that this video wasn't for sure not in your training data set because it was a private video of mine that didn't exist on the internet before. So it wasn't like a result of this video being in a training set. Max, you rejoined. I hope you heard some of this question as well, attributed to you.
[01:06:26] Alex Volkov: Did you see this example? Did it cut you off guard as well? Do you see other examples like this that were like very, very surprising in how this model performs?
[01:06:33] Max Bain: Yeah, I saw that. I was surprised. To be honest, one thing I've noticed is that video benchmarks are quite poor. So [01:06:40] we, in the question answering datasets, we don't really get a chance to see this, especially ones that use like the speech information and things like that. So I guess really, I'm glad you like tested it a lot.
[01:06:50] Max Bain: Cause yeah, like internally we maybe haven't had a chance to I think but it's the benefit of kind of, yeah, training everything from scratch and adding all the modalities
[01:06:58] Yi Tay: and yeah
[01:06:58] Alex Volkov: That's awesome. So I also want to talk about the fact that you guys raised two models and you talked about there's a bigger one. Let's talk about the edge model. Can you talk about Are we going to be able to use this on device? I assume what's the play here? At least from what you can say, what's the play in terms of using the smaller models?
[01:07:14] Alex Volkov: Obviously, smaller models, the benefit of them is using them closer on the edge and device, and that's how you named it. What's the, what's the thinking about releasing, these two models in different sizes? And and what's your plans for those?
[01:07:26] Yi Tay: Oh yeah, sounds good. Yeah, that's a great question. So for the H model, 7B model, it's I think it's it's at a size that it's possible to run it locally, but we are thinking also along the lines of okay, it's actually Faster, like it's just for latency sensitive applications sometimes you just need certain things like this Slightly faster than the 21b model and it's also cheaper to to to host for for a lot of applications So I think that's mainly like this one of the reasons why seven.
[01:07:55] Yi Tay: We also ran lots of ablations at low smaller scale. So this, this turns out to be just the size that we have. And I, I think it's mostly, mainly for latency sensitive stuff. And then like for people who are like for businesses and stuff, like they might just choose to deploy the smaller model if they don't like, need a larger models like the.
[01:08:13] Yi Tay: Flash or the, the core model. So I think that's really like the idea behind it. And then from the research point of view, or at least from the playground point of view, right? Like the, the demo point of view is that people get to, to, to, to get a sense of the view of the model at the seven B scale and the 21 B scale, right?
[01:08:28] Yi Tay: So there's kind some kind of you might be able to, to get a sense of like how this setup looks at the different scale. I think that's mainly like why we deployed two models in the background just so that people can play with. Two variants and the stuff. Actually not much thought here.
[01:08:42] Yi Tay: I mean it's not like super complicated, it just happened this way, but yeah, that's all I can say, yeah.
[01:08:48] Alex Volkov: Awesome. And so folks can go check out the demo. It looks like you guys are set up for API keys as far as I understood. So will developers be able, be, be able to build with this? What stage are you in? I think you, you invited to a disco or something. Could you talk about how we can play with these models, what we can do, and if there's any expected open source, because we'll have open source here on ThursdAI.
[01:09:08] Alex Volkov: If there's anything to talk about there as well, please, please feel free to, to tell us how to actually try these models beyond the demo. Build with them.
[01:09:16] Yi Tay: Yeah, sounds, sounds good. So for API, actually, we, we, we have our API as a system already like working and then some people are already using it. We are like rolling out access coupling without the billing and everything, like we're just making sure everything is running very well.
[01:09:29] Yi Tay: And then we will roll it out soon. So I think that's mainly like the, the idea behind the slightly stitch. API release yeah, so that's for APIs. And then for open source, we I'll just be candid here, we are constantly, we're not sure yet about whether we want to do it or we don't want to do it.
[01:09:44] Yi Tay: It's always a question we have but we're not promising anything, but we're also not saying no yet. So it's a, it's a competition we have very regularly about about this kind of thing. So I, I, so yeah, that's currently the stance we have right now. But we are, we are
[01:09:55] Yi Tay: writing a we are writing a tech report it's not like a paper paper, but it's also not going to be that there'll, there'll be some details in the tech report, but not complete details, but some details.
[01:10:04] Yi Tay: But yeah, so I think that's mainly like the extent of like how we're thinking about things right now, yeah.
[01:10:09] Alex Volkov: Awesome. So first of all, I want to consider you guys friends of ThursdAI. Thanks for coming on the pod. And here, we definitely love open source. We talk about it all the time. And we're just like Champions of Open Source, so if you do release anything Open Source, you're welcome to come back as well. Yi and Max, we have Swyx here, I'm actually in Swyx's audience, so you can hear them from my microphone.
[01:10:29] Alex Volkov: And Swyx has a few follow up questions for Yi and Max as well, so Swyx, go ahead.
[01:10:32] Swyx: Oh, sure. Yeah. Hey I actually tried to set up a chat with you when I was in Singapore, but it didn't happen.
[01:10:39] Swyx: So sorry about that. But I actually wanted to just chat with you more about something that you hinted on your announcement post. You talked about how much of the infra you had to rebuild, you Reka. Everything, you said everything from robust training infra. Proper Human Evaluation Pipelines and Proper RLHF Setups.
[01:11:00] Swyx: I was wondering if you can just give us like a preview of What did you miss? What does Google have? And then what do you think like the industry could innovate on?
[01:11:09] Yi Tay: Okay. That's a very interesting question. I need to be, need to think about what I can say and what I cannot say. But so definitely, definitely I miss GPUs credit to GPUs and being like a, a Googler for all my. Professional life, definitely the infra was completely new to me, and then at Rekka, we have a lot of people from GTM and, and Google in Alphabet in general I think a lot of us could, I feel the same way and then, I think in terms of infra, I think GPU tooling is not as robust as at least what I experienced for TPU Infra back at, at, at Google. So I think that's mainly the first thing is the robustness of the the, the training the, the, the, the, the, the accelerators itself, right? And then also even things like FileIO is something that people take for granted. At Google, the file systems, the X Manager box and stuff orchestrators and stuff like that are, like, just so well designed at Google.
[01:12:02] Yi Tay: And then externally, it's a lot of them are just missing. So I think yeah, I, I, yeah, I think that's basically on the training infrasight and yeah, so I think, I think the tooling for like training like large models is not really super like robust externally, like you're, you're, it's not easy to like just pick off something and then like train like.
[01:12:26] Yi Tay: Like a 100 bit model easily without actually making sure your checkpointing is you're, you're, you're resuming your checkpointing, your, your notes failing and stuff like that. I think those are, like, hard, hard stuff things that, that need to be taken care of but at, at, at Google some, some team Does that for you.
[01:12:43] Yi Tay: Yeah, TLDR of the training infrastructure, yeah.
[01:12:48] Swyx: Does Google have the equivalent of Weights and Biases?
[01:12:51] Yi Tay: TensorBoard, I think, yeah.
[01:12:53] Swyx: Oh yeah, yeah, yeah, of course.
[01:12:55] Yi Tay: Yeah yeah, yeah, yeah yeah.
[01:12:58] Alex Volkov: So
[01:12:58] Alex Volkov: we don't work with Google yet, but hopefully if if folks at Google are listening to us and you want to use kind of Weights Biases, definitely reach out. But at least you guys, now that you're out of Google, you definitely can. You want to follow up with Swyx, or are you,
[01:13:10] Swyx: are you Oh,
[01:13:10] Swyx: I don't know. Did you guys talk about Ricoh Core already?
[01:13:13] Alex Volkov: Yeah, so I think, Yi, there's not a lot of stuff that you can say about the bigger model that you guys have, but give us a little teaser live for a few folks here on stage, like what can we expect from the bigger model, maybe when, what can you tell us?
[01:13:28] Yi Tay: So the bigger model, okay, so I can just say that we, we ourselves are quite impressed by the results and it's if, if if you try to extrapolate from our 7 and 21 based on relative to other models of the scale you can. Try to imagine like what the type of metrics look like, right? But I think we are, we ourselves are, ourselves, we are quite impressed by, by the, the, the, the, the metrics.
[01:13:49] Yi Tay: So like we are I think that's all we can say. I think in the polls, we say that coming out in coming weeks is around that ballpark. It's not like next week, the kind of thing. It's also not like one, two weeks. It's probably like a couple of weeks. But we still, we also kind of like a bit tired after the release.
[01:14:05] Yi Tay: Take
[01:14:05] Yi Tay: a few days light break and then start working again, that kind of thing. So Yeah. I think that that's, that's basically what I can say, but it's, I, we are, we are very happy in the model and as well, yeah.
[01:14:17] Alex Volkov: All right, so we're excited to see this. I want to flip back to Max just for a second. Max as we just talked covered, there's some stuff that I use that you guys are watching. Oh, find somebody test this out. When folks interact with your demo, first of all, I'll just say, definitely folks should do the thumbs up, thumbs down, and reply, so you guys will get some nice RLHF.
[01:14:35] Alex Volkov: What other venues of giving you guys feedback would folks can go? Is there a Discord you want to call out, or anything else you want to add to this as we move on?
[01:14:44] Max Bain: Yeah, thanks guys. We, we actually have a discord channel and if people post, use cases where maybe our model is doing well, or could do better, you can post that, or maybe there's something you're not happy with the current models, like GPT 4V also. And like, I guess, cause we're [01:15:00] such a small team in an early stage, like we'd.
[01:15:02] Max Bain: We're taking a lot of that on board and yeah if you can point any of that stuff, if you have stuff in more detail, you can put that on the Discord and yeah, we're like, really happy for any feedback,
[01:15:10] Alex Volkov: awesome. Are you guys distributed, by the way? Are you working co located? Like, where's, where's RECA located?
[01:15:16] Max Bain: Like, all over the globe, yeah, So he's in Singapore, I'm, like London, sometimes the West Coast, but yeah, it's like a remote first
[01:15:23] Max Bain: company.
[01:15:25] Max Bain: and also, yeah, sorry. Another thing is if we have, do you have job posting? So if you guys would Yeah, like the sound of record, you can also apply to join. We have yeah, quite a few
[01:15:35] Max Bain: positions open.
[01:15:42] Alex Volkov: friends of the pod from now on. E, anything else you wanna, you wanna add as, as we finish up and then move to the next
[01:15:49] Yi Tay: No, thanks. Yeah, really thanks for inviting. It's really nice chatting with you. And yeah, it's been great. Yeah.
[01:15:56] Alex Volkov: I'm, I was, like, like I said, I was blown away by the performance of the multimodality. I was blown away by the tonality understanding, which I've never experienced in any model so far. I heard that it's possible and I saw some technical stuff. I never experienced this on something like my videos as well.
[01:16:11] Alex Volkov: Definitely folks should play around with, with the demo. I'll add this in the show notes and follow Yi and Reka and, oh yeah, one last thing Yi, before you go. What's the meaning of Reka? I know this is a word in Hebrew that I know, but what's, what's the meaning of this word? Like, where, where did this come from?
[01:16:24] Alex Volkov: I was really curious.
[01:16:26] Yi Tay: I think one of the meanings, it's not official, it's not canon, but like one of the meaning it comes from Reka in Eureka, like Eureka, like the Reka
[01:16:35] Yi Tay: in Eureka, but it's not Okay, this is not canon, it's just one of the interpretations of that but it's a bit reverse engineered where people ask us, we just, this is what we say, but that's actually I think that that's it's not really like canon, yeah.
[01:16:49] Alex Volkov: Awesome. Thank you guys for joining and folks, definitely should go check out the demo. And I think the tradition continues because now we have we're moving on to the diffusion area and we have the, the, the, the awesome, the awesome chance to have Dome here. And we. Just released, or I guess we saw this week, a new release from Stable Diffusion called Stable Cascade.
[01:17:09] Alex Volkov: And Dom, I reacted to Imad's tweet about this hey Imad, you want to come to ThursdAI? And he said, Dom, and I think did you say Rodrigo was the other guy? Are the real heroes. And I want to welcome Dom to the stage. Dom, welcome. Feel free to unmute yourself, give a brief introduction. Let's talk about, let's talk about Stable Cascade. .
[01:17:25] Dome: So yeah, my, my name's Dom. I joined stability a couple, actually a couple of months only ago. And I'm currently enrolled in, in Germany in a in a degree. I'm currently finishing that up and I've met Pablo more than a year ago. And ever since that we started working on, generative models, mostly in vision. So image modality and also slowly moving into video stuff. And yeah, at some point, so pretty early, we already connected to stability via Lyon. And at some point they liked what we were doing and liked the progress of how the paper that we called Verstehen was going, which is German and means sausage.
[01:18:09] Dome: I can tell more about that
[01:18:10] Alex Volkov: Oh, that's what it means! Okay.
[01:18:13] Dome: yeah, yeah, yeah. And yeah, so then we joined, we joined and we joined the apply team and we were able to, to work on the third version of it which in the end then was called Stable Cascade, just to make it fit in more, not to confuse people where that name comes from, what's this third version about.
[01:18:31] Dome: And yeah.
[01:18:34] Dome: That's bad.
[01:18:34] Alex Volkov: Awesome. So let's, let's say hi to Pablo as well. Welcome, Pablo. Feel free to unmute yourself. Brief intro from you as well. And let's talk about what makes Cascade different than SDXL or even the V2.
[01:18:45] Pablo: Hey, hi, Alex. A bit about myself. I am a machine learning researcher. I used to work before working at Stability. I used to work at Disney. So I was able to bring a lot of interesting ideas from there. And then I, yeah, I joined Dom and we have been working on very cool things since, since I met him.
[01:19:03] Pablo: And the latest is, is our new stable cascade.
[01:19:08] Alex Volkov: That's awesome. Let's talk about Stable Cascade. I've been able to test this out, and the things I was able to, the things that blew me away were, like, speed, inference speed as well, but also the base model already has hands built in, and they're fine. You guys said you're working with Worshen for a couple iterations, and this became Stable Cascade?
[01:19:26] Alex Volkov: Like, where talk to me about the history, and why is it so good, and so fast?
[01:19:30] Dome: Okay. Yeah. Yeah. So basically the, the biggest difference, and I think that's what it boils down eventually is the, the, the space or the dimension where stuff is generated for, for the text conditional part and for Stable Diffusion XL is, that they have this thing called the VAE, which takes images and just compresses it down to a smaller space.
[01:19:53] Dome: And the only reason to do that is. Just that you work at a smaller resolution, which then gives you faster training and faster inference. Imagine training or generating stuff at a pixel resolution of 1024, so one megapixel. This will be a lot slower than if you try to do the same, try to trying the same model at what, 32 by 32, for example.
[01:20:15] Dome: So the idea is you still want to have high, high quality, high resolution images, but you don't want to generate at that very high pixel space. So you just try to find something, how you can compress it even further. And up, up until now, people always use VAEs, VQGANs, normal autoencoders and so on but they reach limits very early on.
[01:20:34] Dome: So you can get to an spatial compression of eight. So Pablo had this incredible idea of using it. diffusion model to increase that compression, basically, and long story short by using a diffusion model on top of a normal VAE, or you could also leave the VAE away and just start at pixel space, you can achieve much, much higher compressions because you have the diffusion model that can iteratively at first at the lower frequency, so the, the the rough details, and then later on at the high frequency.
[01:21:04] Dome: So at all the details. And so it has just a lot more space to reconstruct an image. And with that it's possible to, to compress images a lot further. And the version that we have now achieves a compression of 42. And that makes a huge difference in terms of training and inference time. And That's probably what you saw because then
[01:21:24] Dome: the big model, the 3.
[01:21:26] Dome: 6 billion, which is. quite big for images. So stable diffusion XL is 2. 2 billion. We're not in the, in the large language models. So yeah, this makes it just a lot faster. And then you have this diffusion decoder, which works at at a higher resolution, but needs a lot less steps and combining this just gives results in making the model very fast.
[01:21:49] Alex Volkov: That's super cool. I want to switch back to Pablo just real quick. So I'm looking at this graph for inference speed, but also checked out some of the examples. One thing that I noticed is the real time rendering basically of how the model kind of searches through the diffusion space. And the last step just like kicks into like super high resolution.
[01:22:09] Alex Volkov: Pablo, what can you tell us from some exciting or maybe surprising results that you've seen or people using it and Yeah, feel free to speak about your cool model a little bit more.
[01:22:18] Pablo: Yeah, I actually I have been really surprised on how well this model could, could could be. We, we, we're not expecting it to be as good as it is. We started this more as an like a, an experimental idea of trying to achieve the same quality of existing models but focusing on, on speed on performance.
[01:22:39] Pablo: But then somehow we ended up with a model that was like very competitive and yeah, I don't know. I think this last step as, as you mentioned, is the the, the upsampling stage. Which is this diffusion model that Dominic mentioned that can bring the image from 24 by 24 latent to a one megapixel.
[01:23:00] Pablo: And that's why you see this like very big difference between the previous to last and the last step.
[01:23:06] Alex Volkov: Yeah, the last step is poof, high quality. I love it.
[01:23:11] Dome: Yeah, we, we, yeah, we, we actually provided a previewer. So when we work in this very highly compressed latent space, In order to be able [01:23:20] to see what the model is doing, we have this very tiny convolutional model that can preview what's going on. That's what you're seeing, which looks pretty blurry. And then yeah, the final step does that.
[01:23:33] Dome: And yeah, why the model can make We're also pretty surprised. The, the big
[01:23:41] Alex Volkov: Text is also very impressive. I think let's not skip over this. The out of the box text. is so good. Compared to, let's say, the Stable Diffusion 1. 4, which it released was, which was bigger, right? I think it was like five gigabytes or something. This is just miles, miles, miles better. And the text out of the box, hands out of the box is very impressive.
[01:23:59] Alex Volkov: Text is super cool as well. Very surprising. Yeah, go ahead, please.
[01:24:02] Pablo: The, the, the biggest difference compared to V2, which was our previous iteration of the model was the size of the architecture of the model and the quality of the data, which I think. It shows how important that, that is, and I think probably, since, since our model is able to work on this very, very highly compressed space, it can learn much more efficiently if, if it has good data, it can learn much more efficiently these, these kind of things.
[01:24:30] Pablo: Maybe it learns them faster than other models which is why Yeah, we're able to have this kind of results.
[01:24:39] Alex Volkov: Awesome. Thank you guys for coming up. I really wanted to make sure that, yeah, you guys get the recognition because like really, really cool. This is under the stability membership, right? This is not like fully, fully open source, but folks are going to be able to use this model for, for their stuff and maybe keep training.
[01:24:55] Alex Volkov: Does it support all of the, the, the fine tuning and the LoRa ecosystem as well?
[01:24:59] Pablo: Yeah, one detail, it's not yet on the the subscription. It's still for only for research but it, it will change probably in, in the following weeks, you asked about the Loras and Control Nets. Yeah, we
[01:25:13] Pablo: we
[01:25:13] Pablo: we we made sure to provide some example code for training Loras, Control Nets, and the full, full fine tunings on, on our repository. We also provide some pre trained Control Nets for in painting, for canny edges for super resolution, which is not the best super resolution model out there, but it's, it's interesting enough to, to share with the community, and we provided Tiny Laura with Dom's dog which is, it's pretty and,
[01:25:44] Alex Volkov: Nice.
[01:25:45] Dome: yeah, and I think that's it for now, that, that's
[01:25:48] Yi Tay: all the
[01:25:49] Alex Volkov: Awesome. Thank you for joining and folks, definitely give Dom and Pablo a follow. Folks, really great shout out for building this and releasing this from Stability and it looks really good and I'm sure the community will adopt this. I've already seen a bunch of AI artists in my, in my kind of field.
[01:26:02] Alex Volkov: field are getting very excited about the possibilities here. Thank you for your work and thank you for coming for Thursday. I please feel free to stay because we're going to cover a bunch of other stuff as well, like super quick. Meanwhile, I just want to do a quick reset. It's been an hour and let's say 35 minutes since we're here.
[01:26:20] Alex Volkov: If you're just joining us, you're on the Thursday I X space, which is live recording for the Thursday I podcast and newsletter. I'm your host,
[01:26:28] Alex Volkov: Alex Volkov, I'm here joined by a co host, Nisten is here on stage, Yamil Spokin, and we have Swyx here, who dropped off the stage, but he's in the microphone, and I will move towards a corner that I have, and then
[01:26:40] This weeks Buzz
[01:26:40] Alex Volkov: I have a surprise for Swyx I'm moving towards a corner that I have usually, which is called This Week's Buzz, where I talk about the stuff that we have, or I learn in Weights Biases every week, so if you are subscribed to the newsletter, you definitely already know this, I just learn as I go and talk about this.
[01:26:55] Alex Volkov: If you're not subscribed to the newsletter, Why not? I guess you'll be up to date with everything that happens in the world of AI. So definitely check out thursdai. news. This is the URL, HTTPS, thursdai. news. And this week's buzz is all about this new course that we released with Hamil Hussain about putting models in production.
[01:27:13] Alex Volkov: I think I've spoken about this before. Weights Biases has an academy. We release courses and the courses are free for you. There's a bunch of knowledge. The last one we've talked about was the, with Jason Liu about the instructor. And we also have Hamel Hussain who released a course about model management and in production as well.
[01:27:29] Alex Volkov: And this is definitely A very illuminating one, including how to use weights and biases for the, like the best companies do, OpenAI does, and like Microsoft and Meta, and hopefully we'll get Google at some point. Definitely, of course, it's worth checking out and signing up for. This will be in the show notes as well, and I'll post the link as well here.
[01:27:47] Interview with Swyx from Latent Space
[01:27:47] Alex Volkov: And now I'm gonna Actually yeah, Swyx is now back on stage, and here's my surprise, if you guys follow and Swyx's voice, you know that he's a co host of Latentspace together with Alessio, and we're now sitting in the Latentspace pod studio, which looks incredible the surprise is, I don't remember you being on the other side of the mic, so this is like a surprise interview with Alex and Swyx, but you're gonna be a guest and not a host, and I just wanted to hear about some stuff that you guys are doing, and how Latentspace is going, like all these things.
[01:28:14] Alex Volkov: So this turns from ThursdAI into ThursdAI, like deep dive interview, just a brief
[01:28:18] Alex Volkov: one.
[01:28:19] Alex Volkov: I figured I'd use the opportunity to give you a surprise. This was not staged. Swix told me he may not be able to even join. 'cause you just flew back from
[01:28:26] Swyx: Singapore. Singapore, yeah. Yeah.
[01:28:27] Swyx: Yeah.
[01:28:28] Swyx: Cool, okay,
[01:28:29] Alex Volkov: So as,
[01:28:30] Swyx: I feel like we talk so much and you've been a guest on our pod like five times, so
[01:28:35] Alex Volkov: and
[01:28:36] Alex Volkov: I, I would wanna start with how you would introduce yourself to the audience that doesn't know you.
[01:28:41] Swyx: you so I'm Swyx, I mostly work on developer tooling, and, and, mostly known as the editor or podcaster of Latent Space, which has done pretty well.
[01:28:51] Swyx: I think we're celebrating our first year anniversary pretty soon. And on the the other half of my life is I'm working on small AI and AI Engineer Conference, which we just, which we just announced for June 25th to 27th. Yeah.
[01:29:05] Alex Volkov: Yeah. You've had quite a long career in DX as well. I think Netlify, you had a stint in
[01:29:09] Swyx: Netlify
[01:29:09] Swyx: Yeah, I was one of their earliest employees slash dev rel of Netlify. That's where a lot of people know me. That's where I became quote unquote famous in developer tooling and in React specifically. Because I did a lot of content on React and serverless speaking and writing. And then I've been head of developer experience for Temporal, Airbyte, and then also spent a year at AWS working on the same thing.
[01:29:34] Alex Volkov: Hmm. Awesome. I also from that kind of that side of your career, you work with the Chroma guys as well.
[01:29:40] Alex Volkov: And Chroma
[01:29:41] Alex Volkov: just announced that they have been a year around and looked like millions of companies that probably had
[01:29:48] Alex Volkov: something to do with that. So shout out Jeff. And and, I'm blanking out on the
[01:29:53] Swyx: name, Anton. Yeah, yeah. I so I consulted for them on their DevRel when they were doing their, their first hackathon a year ago, actually. And yeah, I
[01:30:03] Alex Volkov: think
[01:30:04] Swyx: It seems like they are the leaders in open source vector databases. Retool, we did a chat or interview with David Hsu, the founder of Retool, and Retool did a state of AI survey among their customers what they're using.
[01:30:18] Swyx: And Chroma was, like, up and to the right in terms of the adoption and the NPS score, which I think NPS is actually a very important metric to keep tracking. Yeah. Really, really cool. Glad to be involved with Chroma.
[01:30:30] Alex Volkov: Glad to be involved with Chroma. You've been also prolific in writing, like I know many people go to your blogs and like the stuff that you have, how many publications in total are you like, publishing your content in right now?
[01:30:46] Alex Volkov: You have your own personal
[01:30:47] Swyx: one, Yeah, I have three blogs. Three blogs. But Latentspace is the currently primary active blog. I have a personal one and then I have a developer tools advising one because I do a bunch of angel investing and advising for people.
[01:31:01] Swyx: And I don't know. I think More people should blog! It helps you think through what you think that and share your knowledge with other people.
[01:31:10] Swyx: And also, actually the most valuable thing is the most embarrassing thing, which is when you get things wrong. People will come out and correct you, and you will be embarrassed for a second, but then you'll remember the lesson forever.
[01:31:21] Alex Volkov: Can you give me an example of something that you went wrong and people corrected you, and then this improved your thinking?
[01:31:28] Swyx: improved thinking?
[01:31:31] Swyx: Yesterday or into coming into today, right? Because I do a monthly recap where I think what ThursdAI does is [01:31:40] recap news every week and then other people like NLW from the breakdown recaps news every day. And I think the lower frequency granularity of a month means that I only get to do 12 of these a
[01:31:53] Alex Volkov: year.
[01:31:54] Swyx: And that. forces me to think through okay, what is really actually important when you step back and think about it. And for my January recap, January was a slow month, to be honest. Today was more news than January. So I was like, I was trying to recap January, and I was like, okay nothing super interesting this month.
[01:32:11] Swyx: What Do we, if we step back, it's important for AI progress. And I listed a bunch of things, long inference and all that. One thing I specifically said was not interesting for state of the art models was long context.
[01:32:26] Alex Volkov: was, long context. It
[01:32:28] Swyx: I said that yesterday. It's published, I sent it out to 35, 000 people, including Satya Nadella, Drew Houston, and all the people who read the newsletter.
[01:32:36] Alex Volkov: Satya doesn't read, he also participates, like he clicks on
[01:32:39] Swyx: links,
[01:32:39] Swyx: Yeah.
[01:32:40] Alex Volkov: there's an engagement, active engagement from Satya from Lydian Space.
[01:32:43] Swyx: so it's, so it's embarrassing, but also it just forces me to think about okay, how much do I really believe in million token and ten million token context? And I know now, today I learned that Nat Friedman strongly disagrees.
[01:32:58] Swyx: And that's good. That's, that's useful to update. And Google, of course. Yeah, yeah. I think It's, it's a, basically, so it's not about that specific point because we can always debate the pros and cons of that, but the act of writing down what you believe and taking strong opinions instead of saying that everything is awesome, instead of celebrating every little bit of progress as equally important, you have to rank them, and being wrong in your rankings gives you information to update your rankings, and if you don't give yourself the chance to be wrong, then you don't really learn.
[01:33:36] Alex Volkov: You
[01:33:37] Alex Volkov: publish a bunch of stuff. Some of the stuff that you publish turns into more than just an article. You have essays, and I think that the one essay that I remember specifically, obviously, is about the AI engineer essay. Talk to me about thinking about how you approach writing this. Is that stuff that you saw?
[01:33:51] Alex Volkov: And I think as background for folks who are not familiar with you and where you are in, in, you're sitting in the middle of the arena that you helped also coin in San Francisco, right? We're in the middle of Soma Mission, Hayes Valley, somewhere there, if I'm not confusing. We're in this space it's called Newton that you're also like I think you're plugging in latent space where Tons of companies that we know from the Twittersphere are just literally behind us here.
[01:34:15] Alex Volkov: There's Tab with Avi and Julius with Rahul like like a bunch of other companies like sitting right here building like very cool things and And this is an example of one of those so actually I think it was very natural to put those kind of hubs within the bigger bubble of San Francisco. And you, as far as I'm concerned, it was very plugged in to this even before coming to AiEngineer, right?
[01:34:34] Alex Volkov: And potentially, this is the reason why the engineer the conference had so many amazing speakers on stage because very I think you told me back then a lot of like personal favors were pulled to get some folks to show up on that on that. And As somebody who's an outsider from Denver, what I said, right?
[01:34:48] Alex Volkov: This is, this is incredible to see, but also it's very hard to penetrate and understand like what's going on and where the trends are. And this is part of the reason for ThursdAI. So you're sitting in the middle of this, you have all these connections, you said you're an angel investor as well. How does this shape your thinking about the AI engineer?
[01:35:02] Alex Volkov: Do these old people talk in like the hackathons? How do you draw to create something like this that's fairly seminal that now people are considering themselves AI
[01:35:11] Swyx: engine. Okay. Oh. Okay. So there's, there's two questions here.
[01:35:15] Swyx: If I can do rag on your questions. Yeah, please. Which is that one, how do you write impactful perspectives or come up with interesting ideas that will stick around? And two, how do you make sense of San Francisco? Especially as an outsider. And people, I think people can hear in my voice that I'm not American.
[01:35:34] Swyx: I'm Singaporean. And the last seven years of my developer career, I did not spend in San Francisco. I only moved here in April of last year. You don't have to be an SF to have a background in tech. Oh, I think the other the other thing I should offer as context is that I, I have been blogging for quite a bit.
[01:35:57] Swyx: I often say that you have to blog 50 times a year, but in order to get like one post a year that it, that makes up the entire year, it's the one that people know you for. So this is my sort of fourth or fifth Quote, unquote, industry defining blog posts. So I, I've done this for serverless, runtimes and cloud orchestration and AWS, so I've done this before and I knew the work that goes into writing something like this. Rise of the AI Engineer took two months. I had a few potential collaborators
[01:36:35] Swyx: who ultimately did not co author but were heavily involved.
[01:36:43] Swyx: And I can talk about the writing of the post, but the main inspiration is trying to figure out what is important directions.
[01:36:48] Swyx: And it is not purely about coining a term, which I think is a very vanity metric, but it is about picking directions in terms of identifying what is wrong about the zeitgeist. At if you rewind this time one year ago, people were very much focusing on prompt engineering. People were worried about the end of jobs for AI, for, for engineers, for software engineers.
[01:37:13] Swyx: And I think both have been proven wrong in terms of the scope of the prompt engineer. Now, like now you're no longer really here about. Professional prompt engineers, because it's been replaced by the AI engineer who can code. And I think the importance of the ability to code to wield AI makes you a thousand times more effective than people who use AI without the ability to code.
[01:37:37] Swyx: And I think identifying this core difference in ability, understanding that this stack is starting pretty thin and small, but it's going to grow over time, understanding that it is fundamentally very different from the ML engineer stack is a part of the mix that made me convinced that AI engineer would be a category to invest in which is why I started the conference and then pivoted the newsletter and podcast.
[01:38:04] Alex Volkov: Yeah, so let's talk about that as well. So definitely the audience that ThursdAI draws, at least in part, is AI engineers, but also in part, like folks who are trained in Finetune models. And I've noticed like a little bit of a AI engineering is almost like the gateway drug into the larger AI stuff, because at least the folks that I'm familiar with, the folks who are like JSTS devs, that did the Netlify stint, that did React, etc.,
[01:38:27] Alex Volkov: they started to build with these tools. The tools are like significantly easier to get into than ML, than traditional ML. You just do some API calls open AI exposes a bunch of stuff, and suddenly you're like, oh, okay. I have, I've tapped all this power, this incredible power. I'm building intuitions about how to use this power.
[01:38:42] Alex Volkov: I'm building intuitions, how to put this power in production for my users. They tell me some feedback. How do I do more of this? Am I only limited to open ai? Or maybe I can go to the open source. Try some stuff like this. Maybe I can do Olama, which, by the way, shout out to Olama, our friends, just released the Windows thing.
[01:38:56] Alex Volkov: Maybe I can do this like locally on device. Maybe you can do this on Edge, on Cloudflare, for example. All these new tools are popping up, and these people are sounding like from a very limited scope of API users, are growing into API users who also have an intuition about prompting is just one of those things, embedding in RAG and better RAG systems, like we've seen some folks going there.
[01:39:14] Alex Volkov: Definitely the scope grows, and as every category, like frontend was a very tiny scope, JavaScript, HTML, and the client, and suddenly like it became a full stack, you have prompt and like frontend, ops, and like all of these like things. So scope grows.
[01:39:30] Alex Volkov: Where do people learn about this new and upcoming thing?
[01:39:32] Alex Volkov: And I think like the conference is one such way. So we've talked about the conference. This is actually not your first time. I just remembering I interviewed you after the conference for a full hour that we had a full conversation. It wasn't about Swyx. So how was the conference after the conference received?
[01:39:46] Alex Volkov: How did your direction into thinking about latent space and kind of exposing AI in San Francisco to the world? And let's take this to the kind of the next conference where you want to take us. What happened to the AI engineer?
[01:39:59] Alex Volkov: I think I asked
[01:39:59] Swyx: three
[01:39:59] Swyx: or
[01:39:59] Swyx: four. [01:40:00] Yeah, I know.
[01:40:00] Alex Volkov: Break them down however you want.
[01:40:02] Swyx: So the conference was really good, but I would actually classify that as the end of a process rather than the start of a process. It basically recaps
[01:40:10] Swyx: the work
[01:40:11] Swyx: that people are doing in the industry over the past year.
[01:40:14] Swyx: And then, I get to curate and pick and invite people to present, the best of their work and their thought. And I think that's a very privileged position. And then for me, The work begins after the conference for the next the next thing. And I picking directions and having so last year was like a single track conference, this year for World's Fair we're doing nine
[01:40:36] Alex Volkov: When is that, just for the
[01:40:38] Swyx: June 25th to 27th. Yeah.
[01:40:40] Alex Volkov: make sure you sign up.
[01:40:41] Alex Volkov: It's gonna
[01:40:42] Swyx: yeah, yeah. We're going four times bigger this year, 2, 000 people, and last year, 17, 000 people tuned in on the livestream, and hopefully we'll have, we'll have more impact this year. But yeah I think For me, actually, it's a really good way to think about okay, who do people want to hear from, who actually did impactful work that I will be proud to showcase 10 years from now.
[01:41:04] Swyx: I'm always thinking about the test of time. And I was very inspired by NeurIPS, where they actually had a test of time award. And I was like,
[01:41:10] Alex Volkov: man, that's Did Jeremy Howard get it or something, if I remember
[01:41:13] Alex Volkov: correctly?
[01:41:13] Alex Volkov: No, Jeff Dean. Jeff Dean.
[01:41:14] Swyx: Jeff Dean. Yeah.
[01:41:16] Alex Volkov: Shoutout Jeff Dean for today, by the way.
[01:41:17] Swyx: Yeah, yeah, for Word2Vec. I, I always said some people are speculating what is Test of Time for next year, and it was like Ilyas Oskarver, if he ever shows his face
[01:41:25] Swyx: again.
[01:41:26] Swyx: And then I was like, but I know what's gonna win the Test of Time for 2027. Which is attention is all you need.
[01:41:32] Swyx: Yeah, yeah. But basically it's a flex for any, any conference to say okay, Test of Time award goes to something that was presented here 10 years ago. And that and Neuros has been going on for 37 years.
[01:41:46] Alex Volkov: what of the AI engineer presentations would stand the test of
[01:41:50] Swyx: question. I think the audience has voted. It looks like Pydantic and Jason Liu's Instructure is very, very, very, very popular. And I think he's just fundamentally correct that every model, instead of there's like some table six versions of every model. You have the base model when you train it, then you have the chat tune model.
[01:42:07] Swyx: And now I think it's going to be table stakes that every model should have structured output or function calling as, as they call it. And it's even useful if you're not actually using it to, to generate code or call code because it's very good for chain of thought. And so Max Wolf mini maxer on Twitter and on Hacker News actually wrote a really influential post that I'm going to try to showcase.
[01:42:27] Swyx: Yeah, for me as a conference curator that's what I do. Read a lot of stuff and then I try to try to feature like the best of things and also try to make bets that are important. I do think as content creators, like we're like the end of the food chain and not the value chain.
[01:42:45] Swyx: And it's always important to understand like even stuff that we don't pick is very important and substantial and it's
[01:42:53] Swyx: You're, you're picking for an audience to use at work, which is a small subset of the total progress that humanity can make.
[01:43:01] Alex Volkov: Interesting, interesting. Tell
[01:43:02] Alex Volkov: me
[01:43:03] Swyx: I just people, you want to engage in philosophical conversation, you go to Lex Friedman or Dorkesh Patel.
[01:43:11] Swyx: And then if you want to Think, talk about things that you can use in open source. You go to Thursday, ai. And then we have less of an open source focus. We are, we're very much focused on enterprise and things you, things you can use at work to code and to build products and startups with.
[01:43:26] Swyx: And so like I, whatever you do, as, as long as you have a clear focus for the, of the audience that you serve and you know how to reach them, then they will love you because you are, you're making literally the thing for them. And you don't have to appeal to everyone. And I think that's fine.
[01:43:40] Alex Volkov: switching gears from the kind of the conference.
[01:43:43] Alex Volkov: How did the podcast came about? It's you said you're coming up on the year
[01:43:46] Alex Volkov: of
[01:43:46] Alex Volkov: the
[01:43:46] Alex Volkov: podcast. And you also said you moved here in April. I did not know this.
[01:43:49] Alex Volkov: I
[01:43:49] Alex Volkov: thought you're here for SF Native. So how did the podcast came about? How you and Alessia met? Let's talk about
[01:43:54] Swyx: later. Yeah. And we should talk about doing well in San Francisco and like the taxi in, in Ingra, I think, which I, which I think is important and something I'm.
[01:44:01] Swyx: going through but have also done well at. So the podcast specifically was because I started the newsletter writing opinion pieces on just AI stuff. It was actually inspired by Stable Diffusion at the time which was sort of August 2022 ish.
[01:44:16] Alex Volkov: My life changed after that open sourcing.
[01:44:19] Swyx: Yeah, and then you you really run out of opinions very
[01:44:22] Alex Volkov: and
[01:44:24] Swyx: and then you're like, oh, I need to generate unique or new tokens.
[01:44:29] Swyx: The only way to do that is to get source material by interviewing people and putting a microphone in front of them. When you put microphones in front of people, they get more chatty. And sometimes they break news. For us, the big breakthrough was George Hotz when he talked about GPT 4 with being a mixture of experts.
[01:44:44] Swyx: Yeah, that was, that was a surprise, but he likes to do that sort of thing, just drop random alpha.
[01:44:49] Alex Volkov: he dropped it and then you guys posted it and then I had no idea what Mixture of Experts is as well as like most of us and then it turns out to be like a true and now we we
[01:44:59] Swyx: saw it. Now Gemini is
[01:44:59] Alex Volkov: Gemini's Mixture of Experts the 1.
[01:45:01] Alex Volkov: 5 which is quite incredible so that was like a big thing did was this like natural to you to start turning on the microphone did you have to do an
[01:45:08] Alex Volkov: adjustment period
[01:45:09] Swyx: another thing that people don't know is that I started four podcasts before.
[01:45:13] Swyx: So I'm not new to the conversation game, and I'm not new to like audacity and like editing and publishing, but I think, Having taken a few runs at it helps to prep you for, like, when something actually has audience fit.
[01:45:26] Swyx: Because all the others were very small. There were maybe like a few hundred listeners each time. This one went to number 10 on the U. S. tech charts.
[01:45:33] Alex Volkov: Yes, I saw that. That was incredible. Is that the top, top,
[01:45:36] Swyx: I think that's the highest it's been. Recently when it was like as high as 16 over the holidays, and then now it's dropped back down again. It's very, very volatile.
[01:45:44] Alex Volkov: But it's like very clear that you're in the top 50 like tech podcasts in the world, even though AI is Fairly niche. And the topics you discuss are fairly technical.
[01:45:52] Alex Volkov: Like when you talk with folks, it's not a general appeal audience for like Sweden does, or the, the guys from the four guys, the VCs, right? It's very technical. So very impressive that like you broke the top 50 charts and it wasn't by chance you bring like great guests. Like, how do you, is the same approach that you have for the engineer you do for guests as well?
[01:46:13] Alex Volkov: Or are you now getting like requests to come on the podcast from some other
[01:46:15] Swyx: We get requests but you usually, for the, the people that draw the audiences, you have to go reach out to them. It's obviously, that's how it is. I
[01:46:24] Alex Volkov: I heard one such person now does not work in OpenAI, so he can
[01:46:28] Alex Volkov: potentially, potentially join
[01:46:29] Alex Volkov: podcasts as
[01:46:30] Swyx: yeah, he's a, he's a he's a listener and he has said that he'll come on at some point.
[01:46:35] Alex Volkov: We're talking about bad Mephisto for folks in the
[01:46:37] Swyx: Mephisto for Fortunyaga. So yeah,
[01:46:41] Swyx: I don't think it's actually just guests. I think it's also about focus on topics and then being engaged enough with the material that you get to ask questions that no one else asks.
[01:46:51] Swyx: Because, for example, if you have a VC asking questions, they often ask about market and business. But if you're an engineer, you're really asking about API and limitations and trade offs, stuff like that. Things that you don't really get into unless you're, like, actually evaluating it to use something at work.
[01:47:09] Swyx: And I think that's important. And also, I think a lot of guests For us, we try to be like the first podcast that somebody has done. Like we're the first podcast for for Fine, for Cursor, for a bunch of these guys. So they're not experienced speakers. They're not some of them are good speakers.
[01:47:25] Swyx: But they're not experienced at the whole telling their story and all that. So you have to help them. But it doesn't matter because I think that you just try to serve your audience at the end of the day, right? What do people want to know? Ask those questions and then get out of their way and let them talk.
[01:47:38] Swyx: I think that the other thing that we do, the reason I say it's not just GUESS. is because we do special episodes where we have breaking news. We haven't done one in a while because I don't know. I think, I think you got, you have taken that spot of, of the breaking news guy. We
[01:47:50] Alex Volkov: got
[01:47:51] Alex Volkov: the, we got three breaking news, you were here. This is kind of like, that as
[01:47:54] Swyx: that as well. And then we also do like events recaps. Like we did Dev Day we did NeurIPS and that is like a really big sort of editing process work that I really like to do where you're basically performing the work of summarization and Curation, instead of doing long form interviews, and people really like that.
[01:48:13] Alex Volkov: summarization part, like the multiple folks, I think I participated in one, you did one in DevDay NeurIPS as well. So what's, what's [01:48:20] now that we're coming up on an annual kind of thing for, for Latentspace, what's next for Latentspace?
[01:48:24] Swyx: More conversations? That's the weird thing we think that we've done and have done as well as a technical podcast can do in the general podcasting space.
[01:48:36] Swyx: The ultimate number of people who listen to podcasts is still very low. compared to the general audience that might be interested in the same kind of content. That's why I branch out into a conference where you produce talks and very highly polished and all that. We The way to grow a podcast is to not just podcast it's to actually write, where, my essays still get a lot more readers than listeners than to grow on YouTube or whatever, and that's fine.
[01:49:05] Swyx: I think ultimately, podcasting is a mix of entertainment and Education, right? You have to be attached to some kind of story, some kind of personality, and, and then learn something along the way that might be useful at work. So I think personally, I growing as a podcaster is about just growing your influence or understanding of an industry in general and the ability to serve an audience.
[01:49:29] Swyx: And then maybe opening up as hosts and as industry experts as we gain knowledge and understanding. So that people come to us not just for access to guests, but access to us as well, which people have when we did the end of year listener survey people actually requested for us to have more mic time.
[01:49:47] Swyx: Alessio and I did our first just the two of us conversation in a year and that was really good.
[01:49:52] Alex Volkov: Wow. So are you playing more, more of those?
[01:49:54] Swyx: Yeah, yeah, we, so we used to do these one on one episodes where we do Introductions to a topic, like we did Datasets 101, Benchmarks 101, and we did Transformer Math 101, and then we also did RLHF 201.
[01:50:07] Swyx: And so we want to do more of those, where it's like it's like inspired by Acquired FM. And the work for this kind of episode is so different than a normal chat, because in normal chat you just sit down and you, you, maybe you prep a bit, a bit of question, you, you research the other guy's background, and then you just have a nice conversation, and that's it.
[01:50:23] Swyx: Whereas for a content heavy episode like that one, you do
[01:50:27] Swyx: a
[01:50:27] Swyx: week of research. And you compile a whole bunch of stuff, and you simmer it in your mind, and then you try to rehash it and introduce it for an audience who hasn't done that amount of work. Yeah, that, that is a lot more work up front, but obviously it's very high value, and, and also I, I like to call it evergreen.
[01:50:43] Swyx: Evergreen content, meaning, like You want to build up something that will still be useful and relevant in a year.
[01:50:48] Alex Volkov: Yeah. So definitely let me, let me just take a personal position here with Latentspace.
[01:50:53] Alex Volkov: I've been a guest host, in Latentspace a couple of times, in special episodes as well. I, now this, this studio is like super cool, like a home away from home. They're able to come here to the spaces, Alessio on Tap into the AI scene in San Francisco. And I've learned a bunch from just the way you render.
[01:51:11] Alex Volkov: Latentspace, for folks who are listening, is not only just a podcast. If you're subscribing on just your Spotify or Apple News, you're missing a big part of it, which is the newsletter that you send, which has a bunch of links and show notes and folks that you talk
[01:51:23] Swyx: about.
[01:51:23] Swyx: There's one more part. Discord.
[01:51:26] Alex Volkov: Oh, there's also Discord.
[01:51:27] Alex Volkov: You do paper readings as well, right? There's a whole community that you're building.
[01:51:30] Swyx: community The Discord is surprisingly good. For the zero effort that I put into it, people just show up, and then they ask really very good questions, they drop things that I don't know, and then I learn from the Discord, and then I talk about it later. But, yeah, Discord has a lot of alpha.
[01:51:47] Swyx: And it's surprising because I have this newsletter that, I have this bot, That summarizes all the top AI discords, right? Obviously the top ones are, like, Eleuther, TheBloke what else?
[01:51:55] Swyx: Yeah, mid, mid, yeah, but it's not, that's not very technical. That's mostly just prompting.
[01:52:00] Swyx: Midrani is 8 million members. That's something like 13 percent of total Discord membership. Ha ha ha ha ha. That's freaking crazy. But anyway, so like the Discord is the community attachment to the podcast and the newsletter. And then it's, people interacting with each other, some people getting jobs, some people getting investments, I have founders coming in and VCs there also funding them.
[01:52:22] Swyx: And like I, I really think that every every piece of content is a minimum viable community, right? People gather, they're chatting in the Twitter space comments right now. They're chatting in your newsletter comment section. But if you let people gather together live, whether it's online or in person we also have in person meetups.
[01:52:40] Swyx: I just had one in Singapore. We have one in San Francisco, I think, monthly.
[01:52:45] Swyx: I hope to have it monthly. And then obviously once a year you get people together for a really big conference where like they put out their best work. So I call this community annealing, right? You have cold community, like podcasts are cold.
[01:52:58] Swyx: Newsletters are cold because they're asynchronous. There's not somebody there, you don't expect to respond to the other person. Twitter spaces are warm because they're live and, there's some chance of live feedback. Discords are live, but when you, when you, when they're hot, it's when like everyone is on the same call and you're looking in each other's eyes.
[01:53:16] Swyx: And you're conversing and you're, you're having like a real bond and relationship there. And so like communities need this whole range of like warm and hot and cold. And I try to build that for Dane Space.
[01:53:28] Alex Volkov: So for folks who are just listening on podcasts, you're missing several parts of the space. Newsletter is definitely worth checking out. Latent. space is actually a URL.
[01:53:38] Swyx: And that was donated by a reader. Not donated. Sold to us for cheap.
[01:53:42] Alex Volkov: You can consider this a donation but also the Discord part speaking of work that I think we need to wrap up because like we're after two hours and I want to let you go back to work. I also need to edit this and send this. I also want to check out the stuff that we did. Any last kind of parting things here?
[01:53:56] Alex Volkov: Maybe let's touch briefly or is that a bigger conversation? How to succeed in SF or is that for a later
[01:54:02] Swyx: Oh yeah, yeah, yeah. Oh man. This is such an interesting topic, especially for people who are not in sf, right?
[01:54:06] Swyx: Yeah. I think SF is a group of humans and not a place, and they are mostly available on Twitter. Yeah. But then sometimes they, they often gather in San Francisco and Yes, when you meet them in person. There are some people that are not famous online or not fully consistently candid online that you talk to them in person and you're like, Oh, okay, I fully understand you now and everything that you've done and everything that you're going to do, I understand where you're coming
[01:54:33] Swyx: from.
[01:54:34] Swyx: And to me, that is obviously a very high offer, that's why I moved here. But you don't have to go there directly, right? One of my mentors And the last one that I want to talk about is in career is Andrew Chen, who basically blogs his way into being a general partner at Andrews and Horowitz.
[01:54:49] Swyx: Like he runs one of their top three funds, the consumer fund. And he consistently is Hey, just Put out your best work, learn in public, tweet a lot and instead of going to all these parties, there's always, there's always a party every week in San Francisco
[01:55:03] Alex Volkov: Every day, multiple stacks a day sometimes, yeah.
[01:55:06] Swyx: There was one Thursday last year with 10 AI meetups in San Francisco.
[01:55:09] Alex Volkov: So
[01:55:10] Swyx: can go through the motions of networking, but you still end up with a smaller network than you would if you stayed at home. And you just wrote a lot, or you thought a lot, or you did quality work. And so then you don't have to be in San Francisco to do that. You can just, you can keep doing that online.
[01:55:27] Swyx: And then, take advantage of a big conference or something to come into San Francisco and actually meet people in person. And that's totally fine. I don't intend to stay in San Francisco forever, right? I have, once I know enough people, I can just come here like once a quarter and people will still think that I'm in San Francisco.
[01:55:41] Swyx: And that's fine.
[01:55:41] Alex Volkov: I get this question quite a lot. I've been here, maybe this is the fourth or fifth time for the past six months, and I get this question, do you live here? I was
[01:55:48] Swyx: Yeah. I think, I think people are just like borders. I, I'm, I'm a border disrespector and I think I hope more people do that. But do come into San Francisco every now and then maybe for a big conference that's happening June 25th to 27th.
[01:56:02] Swyx: But otherwise do great work online and people will notice it and find you and chat with you. And the in person component doesn't matter so much as plugging into the mentality and the community online.
[01:56:12] Alex Volkov: Yeah. SWIX, it's been a surprising interview. I didn't plan on this.
[01:56:15] Alex Volkov: I just thought we're here. I haven't heard you in a while. The anniversary of latency is coming up a huge kudos for this effort. Like huge kudos. Big, big, big, big. Thank you for me because a lot of what the stuff that you did, you and Alessio pulled me through. I, I still get like a bunch of listeners for Thursday.
[01:56:30] Alex Volkov: I, from the Latan space work on Substack. And so a huge thanks for me because you, you kinda shaped. what I'm doing as well. The newsletter and the podcast combo that I forced myself to doing every [01:56:40] week. This was, this was based on the Substack stuff from you as well. And I really appreciate your, your friendship as well.
[01:56:45] Alex Volkov: So thank you for coming up on Thursday. I thank you for hosting us in Latentspace. And with that, I think I'll move on to the last piece of what we have on Thursday, iFolks, which is a recap of everything we've talked about. And then I'll just briefly run through recap and I'll let you go to your day. We haven't, let me just start with the music, obviously, because like, how else would this work?
[01:57:02] Alex Volkov: However, with that, I just want to wish you a great Thursday. Thank you for joining us from week to week. I want to thank the co hosts that I had on stage. Thank you, Nisten. Thank you, Jan. Thank you, LDJ. Far El was here. Alignment was here. Thank you. A huge thank you for Swyx, Alessio, and the Latentspace folks for hosting me here.
[01:57:19] Alex Volkov: A shout out to a bunch of friends in Silicon Valley who I'm gonna meet. And with that, we'll see you next week. I'm gonna go and try to somehow summarize this all in the newsletter and podcast for you. And we'll see you folks next week. From San Francisco. This has been Alex Volkov. Cheers, everyone.
[01:57:34] Alex Volkov: Not this one. Bye bye.

This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
📅 ThursdAI - Feb 8 - Google Gemini Ultra is here, Qwen 1.5 with Junyang and deep dive into ColBERT, RAGatouille and DSPy with Connor Shorten and Benjamin Clavie
9 Feb· ThursdAI - The top AI news from the past week
Hihi, this is Alex, from Weights & Biases, coming to you live, from Yosemite! Well, actually I’m writing these words from a fake virtual yosemite that appears above my kitchen counter as I’m not a Vision Pro user and I will force myself to work inside this thing and tell you if it’s worth it. I will also be on the lookout on anything AI related in this new spatial computing paradigm, like THIS for example!
But back to rfeality for a second, we had quite the show today! We had the awesome time to have Junyang Justin Lin, a dev lead in Alibaba, join us and talk about Qwen 1.5 and QwenVL and then we had a deep dive into quite a few Acronyms I’ve been seeing on my timeline lately, namely DSPy, ColBERT and (the funniest one) RAGatouille and we had a chat with Connor from Weaviate and Benjamin the author of RAGatouille about what it all means! Really really cool show today, hope you don’t only read the newsletter but listen on Spotify, Apple or right here on Substack.
TL;DR of all topics covered:
* Open Source LLMs
* Alibaba releases a BUNCH of new QWEN 1.5 models including a tiny .5B one (X announcement)
* Abacus fine-tunes Smaug, top of HF leaderboard based Qwen 72B (X)
* LMsys adds more open source models, sponsored by Together (X)
* Jina Embeddings fine tune for code
* Big CO LLMs + APIs
* Google rebranding Bard to Gemini and launching Gemini Ultra (Gemini)
* OpenAI adds image metadata (Announcement)
* OpenAI keys are now restricted per key (Announcement)
* Vision & Video
* Bria - RMBG 1.4 - Open Source BG removal that runs in your browser (X, DEMO)
* Voice & Audio
* Meta voice, a new apache2 licensed TTS - (Announcement)
* AI Art & Diffusion & 3D
* Microsoft added DALL-E editing with "designer" (X thread)
* Stability AI releases update to SVD - video 1.1 launches with a webUI, much nicer videos
* Deep Dive with Benjamin Clavie and Connor Shorten show notes:
* Benjamin's announcement of RAGatouille (X)
* Connor chat with Omar Khattab (author of DSPy and ColBERT) - Weaviate Podcast
* Very helpful intro to ColBert + RAGatouille - Notion
Open Source LLMs
Alibaba releases Qwen 1.5 - ranges from .5 to 72B (DEMO)
With 6 sizes, including 2 new novel ones, from as little as .5B parameter models to an interesting 4B, to all the way to a whopping 72B, Alibaba open sources additional QWEN checkpoints. We've had the honor to have friend of the pod Junyang Justin Lin again, and he talked to us about how these sizes were selected, that even thought this model beats Mistral Medium on some benchmarks, it remains to be seen how well this performs on human evaluations, and shared a bunch of details about open sourcing this.
The models were released with all the latest and greatest quantizations, significantly improved context length (32K) and support for both Ollama and Lm Studio (which I helped make happen and am very happy for the way ThursdAI community is growing and connecting!)
We also had a chat about QwenVL Plus and QwebVL Max, their API only examples for the best open source vision enabled models and had the awesome Piotr Skalski from Roborflow on stage to chat with Junyang about those models!
To me a success of ThursdAI, is when the authors of things we talk about are coming to the show, and this is Junyang second appearance, which he joined at midnight at the start of the chinese new year, so greately appreciated and def. give him a listen!
Abacus Smaug climbs to top of the hugging face leaderboard
Junyang also mentioned that Smaug is now at the top of the leaderboards, coming from Abacus, this is a finetune of the previous Qwen-72B, not even this new one. First model to achieve an average score of 80, this is an impressive appearance from Abacus, though they haven't released any new data, they said they are planning to!
They also said that they are planning to finetune Miqu, which we covered last time, the leak from Mistral that was acknowledged by Arthur Mensch the CEO of Mistral.
The techniques that Abacus used to finetune Smaug will be released an upcoming paper!
Big CO LLMs + APIs
Welcome Gemini Ultra (bye bye Bard)
Bard is no longer, get ready to meet Gemini. it's really funny because we keep getting cofusing naming from huge companies like Google and Microsoft. Just a week ago, Bard with Gemini Pro shot up to the LMSYS charts, after regular gemini pro API were not as close. and now we are suppose to forget that Bard even existed? 🤔
Anyhow, here we are, big G answer to GPT4, exactly 10 months 3 weeks 4 days 8 hours, but who's counting?
So what do we actually get? a $20/m advanced tier for Gemini Advanced (which will have Ultra 1.0) the naming confusion continues. We get a longer context (how much?) + IOS and android apps (though I couldn't find it in IOS, maybe it wasn't yet rolled out)
Gemini now also replaces google assistant for those with androids who opt in (MKBHD was somewhat impressed but not super impressed) but google is leaning into their advantage including home support!
* Looks like Gemini is ONLY optimized for English as well
We had quite the conversation on stage from folks who upgraded and started using, including noticing that Gemini is a better role player, and less bland, but also that they don't yet support uploading documents besides images, and that the context window is very limited, some said 8K and some 32K but definitely on the lower side.
Also from Google : a llama.cpp wrapper called localllm (Blog)
OpenAI watermarks DALL-E images and adds per key API limits (finally) (Blog)
OpenAI's using something calledC2PA for pictures made by DALL-E 3, whether you're chatting with ChatGPT or using their API. It's a way to show that DALL-E 3 actually created those images. But it's just for images right now, not for text or voice stuff. Adding this info can make the files up to 32% bigger, but it doesn't mess with the quality. The tags tell you if the source was DALL-E 3, ChatGPT, or the API by including special signatures and stuff. Just a heads up, though, this C2PA thing isn't perfect. The metadata could get wiped either on purpose or by mistake.
They also released an update to the developer experience that allows you to track usage but also restrict usage per API key! Very very needed and helpful!
This weeks Buzz (What I learned with WandB this week)
First part of the live series with the Growth ML team was live and AWESOME!
Vision
BRIA - Open-Source background removal (non commercial)
BRIA AI@bria_ai_Feb 6, 2024
📷 Introducing Open-Source Background Removal by @BriaAI 📷 Now live on @huggingface, RMBG v1.4 excels in separating foreground from background across diverse categories, surpassing current open models. See demo [https://t.co/DDwncjkYqi] #BriaAI #OpenSource #AI @briaai https://t.co/BlhjMMNWxa
Voice
MetaVoice (hub)
1.2B parameter model.Trained on 100K hours of data.Supports zero-shot voice cloning.Short & long-form synthesis.Emotional speech.Best part: Apache 2.0 licensed. 🔥
Powered by a simple yet robust architecture: > Encodec (Multi-Band Diffusion) and GPT + Encoder Transformer LM. > DeepFilterNet to clear up MBD artefacts.
That's it for us this week, this time I bring you both the news segment AND the deepdive in one conversation, hope it's not super long, see you here next ThursdAI! 👏
Full Transcript:
[00:00:00] Intro and housekeeping
[00:00:00]
[00:00:00] Alex Volkov: You're on ThursdAI, and I think it's time for us to get started with the recording and the introduction.
[00:00:26] Alex Volkov: Happy, happy Thursday everyone! Today is February 8th, 2024. I don't know, This is the second calendar year the Thursday is happening in, so I don't know if I need to mention the year or not but we're well on our way into 2024 and you're here on Thursday, I, the Thursday I is the space, the newsletter, and the podcast to keep you up to date with all of the very interesting things that are happening in the very fast moving world of ai.
[00:00:58] Alex Volkov: Hopefully by now, all of you already have ThursdAI in your podcast, wherever you get a podcast, Spotify, recently YouTube as well, which is weird. But with this introduction, I will just say, hello myself, basically. Hey everyone. My name is Alex Volkov. I'm an AI evangelist with Weights & Biases.
[00:01:15] Alex Volkov: Weights & Biases is the reason why this comes to life to you. And there's going to be a little segment about Weights & Biases in the middle here as well, and I'm joined on stage. Often, and pretty much every week by great friends, experts in their fields. As we talk about everything AI related this week, especially we're going to have some interesting things.
[00:01:34] Alex Volkov: Those of you who come back week after week. Thank you, and we love that you're part of the community, and it's great to see how many people just return, and those of you who are new, we're here every week and The community doesn't stop after we finish the space. There's a bunch of spaces. I think our friend AlignmentLab had the space that went on for the full week, I think.
[00:01:55] Alex Volkov: I don't know if he ever slept. That's maybe why he's not here on stage. But we're here every week for the two hours to give you updates for the first hour and definitely some very interesting deep dives that has been happening, that have been happening for the past few Weeks, I want to say, so I just want to shout out some friends of ours that recently we were featured in the deep dives.
[00:02:16] Alex Volkov: We've talked with Maxime Lubon, who trained the Beagle series and then also gave a deep dive with us about model merging. That was really fun. And on the last deep dive, we talked with the Lilac folks and they're building an open source tool. That lets you peer into huge data sets, like imagine millions of rows, data sets, and they chunk and cluster this. And we've talked about the importance of data sets in creation of LLMs or large language models.
[00:02:46] Alex Volkov: And they've taken the huge data sets of the folks to usually come up on ThursdAI. Technium from Nous Research just released their Hermes dataset, for example. And the folks in Lilac talked to us about how that would be visualized and how you can see which parts of it is comprised of.
[00:03:03] Alex Volkov: It's quite an interesting conversation about how to approach the training and fine tuning area. And we haven't often talked about dataset curation and creation, so that conversation was a very nice one. So we have deep dives. I will say that last weekend, I also interviewed, and that's probably going to come up as a separate episode.
[00:03:24] Alex Volkov: I interviewed Sasha Zhadan from Moscow, and this was a first for me. And I just want to like, highlight where this weird thing takes me, because that's not ThursdAI, and that's not about the news. That was just literally about AI stuff. So this guy from Moscow, and this will be dropping on ThursdAI podcast soon.
[00:03:42] Alex Volkov: This guy from Moscow built a bot that auto swipes for him on Tinder. And that bot started using gpt instruct, and then moved to gpt chat, gpt etc, and then moved to gpt 4. And he talks about how this bot kept improving with the improvement of AI. And then he autoswiped a wife, basically. And then this was, this took over the Russian ex.
[00:04:08] Alex Volkov: I don't know if you guys are on the Russian side of ex, but I definitely noticed that everybody, that's all they could talk about. This guy Previously also did some shenanigans with OpenAI stuff. And so it was a very interesting conversation, unlike anything that I did previously on ThursdAI.
[00:04:21] Alex Volkov: And definitely that's coming more as a human interest story than anything else. But it's very interesting. And also his fiance also joined and we talked about the morality of this as well. And it was really fun. So if that kind of new type of content also interests you definitely check out.
[00:04:37] Alex Volkov: That's probably not going to end up on X.
[00:04:40] Alex Volkov: And I think with this, it's time to get started. , The usual way we get started here is I just run through everything that we have. Just so you know what we're going to talk about.
[00:04:52] Alex Volkov: And then we're going to start with segment by segment. So that's
[00:04:54] TL;DR and recap of the conversation
[00:04:54] Alex Volkov: Hey everyone, this is a recap of everything we talked about on ThursdAI for February 8th. 2024 and we had a bunch of breaking new stuff today, specifically around the fact that Google finally gave us something. But I'm gonna do this recap properly based on the categories. So let's go. So in the category of open source lms, we've talked about Alibaba releases a bunch of new Qwen models, specifically under the numbering 1.5.
[00:05:33] Alex Volkov: And we had the great pleasure again to talk with Justin J. Yang Lin. from Qwen team the guy who's a tech lead there and pushes for open source. And he came up and talked about why this is a 1. 5 model, not a 2 model. He also talked about the fact that they released a tiny 0.
[00:05:51] Alex Volkov: 5 billion one. This is like a very tiny. Large language model. I think it's really funny to say a tiny large language model, but this is the case. And he talked about multiple releases for Qwen. We also had, friend of the pod, Piotr Skalski from Roboflow, who's like a vision expert who comes up from time to time, and the author of I forget the name of the library.
[00:06:12] Alex Volkov: I will remember this and put this in the show notes as well. He came up and he had a bunch of plays with the visions part of the Qwen. ecosystem, and we've talked about QNVL plus and QNVL max with Justin as well, and we've talked about their potential for open sourcing these models. They also released a 72 billion parameter model that's now part of the top of the Hug Face leaderboard, which is super cool.
[00:06:34] Alex Volkov: So definitely a great conversation. And I love it when the authors of the things that we talk about come out and talk about the, in ThursdAI. We then smooth, smoothly move to the next topic where Abacus, the company Abacus AI, there is Finetune that's now top of the Hug Face leaderboard, and that's based on QN72B, and not even the new one, the previous one, so 1.
[00:06:54] Alex Volkov: 0, and that's now the top model on Hug Face leaderboard, and that has an average score of over 80. And I think it's the first open source model to do and they haven't fully released the process of what they what they used in order to make this much better in different leaderboards. But they have mentioned that they're going to train this model on top of the Mikulik over Mixtral.
[00:07:17] Alex Volkov: And it's very interesting. And they also They're building some other stuff in Abacus as well. Very interesting. And then we moved to talk about LMSYS Arena. LMSYS Arena is the place that we send you to see which models users prefer better versus just the benchmarks and evaluations hung in phase.
[00:07:35] Alex Volkov: LMSYS Arena added a bunch of open source models, so shout out OpenChat again. They added another Hermes the Finetune that Technium did for Hermes on top of Mixtral, and they also added a bunch of Qwen versions as well. LMSYS adds open source, so you continuously can see which models are better and don't have to judge for yourself, because sometimes it's not very easy.
[00:07:55] Alex Volkov: We also covered JINA embeddings that are fine tuned for code. JINA from the company JINA AI and the representative Bo Wang who came, and he's a friend of the pod. We talked about their embeddings for code. Bo didn't show up this time, but maybe next time as well. Then we moved to big companies, LLMs and API, and definitely the conversation turned interesting, where multiple folks here on stage paid the new 20 tax, let's say from AI [00:08:20] for for the rebranded Bard now called Gemini and the launch of Gemini Ultra.
[00:08:25] Alex Volkov: And we've talked about how long we've waited for Google to actually give us something like this. And now we're getting Gemini Ultra and Bard is no more, Bard is Essentially dead as a brand, and now we're getting the Gemini brand. So if you used to go to BART, now you go to Gemini, but also the brain behind this also improved.
[00:08:41] Alex Volkov: So you get Gemini Pro by default for free, I think, and Gemini Ultra is going to cost you 20 bucks a month. It's free for the next two months, so you can sign up for a trial, and then you'll get Gemini Ultra. And you'll get it not only in the web interface, you also get it in iOS and Android apps. And if you're on Android, it also integrates with the Android Assistant.
[00:09:00] Alex Volkov: That's pretty cool. It has a context length of not very much, I think we said 8 or 16 or so and some folks contested this in the comments, so we're still figuring out the context length, and it looks like the context length for that is Restricted with the UI, less on the API side, and Gemini Ultra did not release an API yet.
[00:09:17] Alex Volkov: So we've talked about Gemini Ultra and different things there. We also covered that OpenAI adds image metadata to all DALI generations, whether through the UI or through the API, this image metadata can be stripped. So it's not a watermark per se, but it's definitely helpful. And there also the OpenAI gives us a little bit of a developer experience thing where you can restrict.
[00:09:36] Alex Volkov: Per key on API keys different possibilities. So if one key gets stolen, you can lock only that one, or you can restrict it to only like a specific use as well. In the vision video category, we've talked about the new model for background removal called RMBG from Bria AI. It's not a fully commercial license, but you can play with this now.
[00:09:57] Alex Volkov: There's a demo I'm going to add to the show notes. And also it runs fully on your client via the efforts of friends of the pod Zenova from Transformers. js. And it's pretty cool to have a model that removes background super like with two clicks with no back with no servers. And in the voice and audio category, we talked about MetaVoice, a new.
[00:10:14] Alex Volkov: licensed Apache 2 licensed text to speech model, not from Meta, even though it's called MetaVoice, and it's funny it's pretty decent and has zero shot voice cloning which means that you can provide a piece of your voice and fairly quickly get a your voice speaking back to you generated. And we also talked about breaking news from NVIDIA AI, something called Nemo Canary 1B, which is a ASR model, Automatic Speech Recognition model, that's now top of the leaderboards on Hug Face, and it beats Whisper on everything, including specifically for four languages.
[00:10:48] Alex Volkov: It's trained on 8, 500 hours 85, 000 hours of annotated audio, and it's very fast conformer encoder as well. We barely covered this, but Microsoft added DALI editing with the designer. So if you remember, Microsoft also did a rebrand. It used to be called Bing Chat, and now it's called Copilot.
[00:11:07] Alex Volkov: And that Copilot now adds capabilities that don't exist in other places, like GPT, ChatGPT with DALI. So Microsoft's DALI now is involving the designer thing, and they have cool things where you can edit images. On the fly, you can click on different segmented objects from your generated image and say, Hey, redo this in a different style.
[00:11:27] Alex Volkov: The video for this is super cool. I'm going to add this in the show notes. And it's very interesting to see that Mali Microsoft with their co pilots is moving away from where the capabilities is for ChatGPT exist. We also barely, briefly mentioned and glanced through this, but Stability AI released an update to stable video diffusion, including a web UI that you can use now, and it's not only a model, it's a web UI as well, and that web UI is pretty cool, if you didn't get an access to it, I'll link to the show notes, I think it's now possible to register, much nicer videos, and obviously it's in the open source.
[00:11:59] Alex Volkov: as much as possible. So super cool. But the web UI shows you other people's video attempts. You can actually use their prompts to create videos of your own. They have some controls. It's very nice. Then I think we talked a little bit at the end there about Vision Pro and my experience with this as it comes to AI.
[00:12:15] Alex Volkov: We didn't dive in into Vision Pro, even though this is my new, this is my new toy in life. And I'm very happy to participate in the renaissance of spatial computing. And we covered like the intersection of AI and spatial computing. And I think the very interesting part of today's ThursdAI was thanks to two new guests, Benjamin Clavy and Connor from Weaviate, and we've talked about DSPy and Colbert, or Colbert, and Ragatouille, which is a library to use Colbert embeddings.
[00:12:43] Alex Volkov: And we talked about what they mean, and this was a great learning kind of experience for me. And if you see these concepts on your timeline and you have no idea what we talked about, I basically played the role of, hey, I'm the village dummy, let's say. I'm gonna re ask the question about what this means, why should we use this as well.
[00:13:01] Alex Volkov: And I think this is our show today, folks. This is the quick summary. If I missed anything super big and important, please let me know.
[00:13:08] Open source LLMs and AI news
[00:13:08] Alex Volkov: But otherwise, I think we'll start with open source. All right, welcome to the open source corner. And I guess because the tradition of ThursdAI is Something releases, I go in the comments and say, Hey, I'm going to talk about this on ThursdAI. Do you want to join? And sometimes people say yes. And this is how we met Justin or Junyang here on stage. Junyang is the dev lead for the Qwen team and welcome Junyang.
[00:13:50] Alex Volkov: It's very late where you are. So I really appreciate your time here. Please feel free to unmute and introduce yourself again. Some folks already know you, but if in case some new folks are listening to us, feel free to introduce yourself. And then let's talk about the stuff that you released.
[00:14:06] New Qwen models 1.4 from Alibaba
[00:14:06] Junyang Lin: Yeah. Thanks Alex. Nice to be at Thursday. ai it's a very great program for us to talk about ai. I am j Young and you can call me Justin. I'm working in the team for the LM and LMM. And we are now working for the new LLM, Qwen 1. 5, and we are also upgrading our vision language model, QwenBL, to QwenBL Plus and Max.
[00:14:33] Junyang Lin: Plus and Max are not open sourced yet, but we have demos, so you can try in our HuggingFace organization, and you can find our demos, and you can try with Plus and Max. And the max is the best one, and I am very confident with the max demo. And about our language model today actually this week we are open sourcing QWAM 1.
[00:14:58] Junyang Lin: 5. Maybe you previously you have noticed the QWAM 2 code inside Hugging Face target based transformers. Yeah, we are moving to new codes for you to use our QUANT models because in the past few months I have been interviewing our users and they found some problems with using our code, the original QUANT code, so I'm moving a step forward.
[00:15:23] Junyang Lin: So this is why we had the QUANT 2 model, but for the model themselves actually we are still we in our judgment, we are still at the 1. 5 not 2 yet. We're still training the real Qwen 2, so this time we have Qwen 1. 5. For Qwen 1. 5 we are actually fixing a lot of problems because there are some models like 7 billion and 14 billion, there are a lot of people using these models, but they are actually quite old.
[00:15:50] Junyang Lin: They were released months ago. They have some problems for Qwen 14 billion It is actually only supporting around 2 to 4K context length, which is far from enough for a lot of users. So for this time, we have upgraded all models to supporting 32, 000 tokens. And for the sizes, we have released more sizes.
[00:16:15] Junyang Lin: Previously, we had 1. 8, which is the smallest one. But this time, we have 0. 5. only 0. 5. I used to think this one is just for experimental usage but there are some users in Twitter they found still 0. 5 can used to be do something so if you have any comments on [00:16:40] 0. 5 you can share the comments to me. And we also have 4 billion which is between 1.
[00:16:46] Junyang Lin: 8 and 7 billion. The reason why we have 4 billion is that actually when we first released 1. 8 billion it is actually popular because they would like to deploy the small model to some devices like cell phones. but they found just 1. 8 is not good enough for them to for the applications.
[00:17:07] Junyang Lin: So they want something just smaller than 7 billion, but much better than 0. 8. So we have 4 billion. Yeah. We have a wide range of sizes. These are for you to choose. And,
[00:17:19] Alex Volkov: six, six models overall Junaid?
[00:17:22] Junyang Lin: Yeah. Six
[00:17:23] Alex Volkov: Six sizes overall, but definitely more models than this, because you also released, I think for the first time, you released quantized versions as well, correct?
[00:17:32] Junyang Lin: No, but previously we have released GPDQ,
[00:17:35] Alex Volkov: Oh yeah.
[00:17:35] Junyang Lin: our convention, but this time I also have AWQ and also GGUF maybe GGUF is the new one admittedly, previously I don't know too much about AWQ and GGUF. This time I tried and everything is okay. So I just released the AWQ and GGUF.
[00:17:52] Junyang Lin: And GGUF is the new thing for me. But it is quite popular in the community. Like Elm Studio, like you introduced. To me and I found a lot of people using gguf they use in Olama. So I collaborated with Olama. So you can now just run one line of code, like Olama run QWAM. So you can use the QWAM models with Olama and you can also use it in Elm Studio.
[00:18:15] Alex Volkov: I just wanna
[00:18:16] Junyang Lin: No
[00:18:16] Alex Volkov: just a tiny pause here because I think first of all, to highlight the importance of this community, you guys are releasing a bunch of great models in open source, and first of all, just a Great. At testament to the community because you're listening to what folks have been saying, how they're reacting to your models and part of the Thursday aid, I was able to just introduce you to, to LM Studio and you guys work together.
[00:18:37] Alex Volkov: And now the second year model drops, not only you guys already pro providing us quantized versions in four and GGF stuff. It's also very easy to start using and I think, just a shout out to you guys for thinking about this because a lot of models when they release they just release a waste file and then it's up in the community to figure out how to run them, when to run them, what's the problems.
[00:18:57] Alex Volkov: And this was the issue with Gwen before. It was like harder to use and maybe only on hug and face demos. And now you guys released it with support for the most popular open source runners out there. So Ollama, if folks haven't used Ollama by now, definitely there's a CLI, just like Ollama installed this.
[00:19:14] Alex Volkov: And LM Studio, which we've talked about a bunch, so shout out LM Studio. Shout out JAGS. And I'm, I was very happy to introduce both of you. So it's been great. And I've used the small model, the baby model as well. How was the reception from the community? What have you seen people do? Have there been any fine tunes already that you're excited about?
[00:19:33] Junyang Lin: yeah this is a very great comment for helping us to improve. Yeah, previously like us, a lot of people just drop open source models and they just let the community to use it. But this is maybe, this may be not right, because we can do more to the community, maybe we can do things. more easily than the community users.
[00:19:56] Junyang Lin: So this is why we are changing our style. We try to modify our code, try to adapt to the usages to make our models more popular. And recently I found them just gradually fine tuned our models. Previously fine tuned users are inside mainland China because they have chances to talk to us, so they will know more about our models so they, they can finally fine tune it.
[00:20:24] Junyang Lin: But with the support of Lama X Tree and especially Alto wing Winland helped me a lot. Technium just introduced wing land to me, and I found some people are using X lotto to do it. I dunno if Chen I don't know if I pronounced his name he's one of the users of Qwen and he he previously got the usage of our models and then he quickly fine tuned a lot of model its name is Q U Y
[00:20:54] Alex Volkov: Oh, Stable Quan. Yeah, I think I know what the guys are talking about. Stable Quan from also Nous Research
[00:20:59] Junyang Lin: yeah, stableQwen I'm quite familiar with him, I just talked to him very much, and he just directly used our models, very quickly finding a series of models, and I find them, the quality are quite good.
[00:21:12] Junyang Lin: So this is quite encouraging for me, because you can find people are interested in your models, they can find you in it, very fast speed, and I recently found Smog by Abacus AI, but I got no chance to talk to them because I don't know who actually built the model, but I found a small 72 billion is built on Qwen 72 billion
[00:21:37] Alex Volkov: Oh, really?
[00:21:39] Junyang Lin: Open open leaderboard.
[00:21:40] Alex Volkov: Smog is the next thing we're going to talk about, so you're taking us exactly there. I think, Nisten, you have a question just before, and then we're going to move to talk about smog. Just on the community part just the names you mentioned. You mentioned Stablequan, definitely friend of the pod.
[00:21:52] Alex Volkov: You mentioned Technium introduced you to Winglian, the guy from Axolotl. All of this happens in the ThursdAI community, and I love it. I'll just say that I see Robert in the audience here. Smog is from Abacus AI, and I think Robert has some connection to Bindu, so Robert, if you can introduce Junyang to Bindu, that would be great, and then we'll figure out, like, how they use the 72B model.
[00:22:12] Alex Volkov: 72B model that you guys released is one of the more performant ones. I think it's even outperforming Mistral Medium, is that correct?
[00:22:21] Junyang Lin: Yeah it's now this version QEM 1. 5 SIMD2 BDN is for the chat model for the base model, it is actually quite similar some users have found that I admit that, and, but for the chat models, we have some improvements because this time we are not only Actually, we not only SFD the model, but we also use DBO.
[00:22:40] Junyang Lin: We have some progress in DBO. So we've reached like 8. 67 in MTBench. This is a relatively high score and we just did simple DBO and just improved the model. And we also sent our model to Chatbot Arena in Elimsys. supported by Together AI, because we have some friends in Together AI. They just built API for us, and we have been in chatbot arena, so you can try it in chatbot arena to see how it really performs.
[00:23:18] Junyang Lin: Is it really perform just like the score of MTBench? I'm not quite sure, because I'm also dependent on the users feedback.
[00:23:27] Alex Volkov: it depends on human preference. I so first of all, Justin, you're taking over my job now because you're also reporting on the stuff that I wanted to mention, but definitely a shout out for getting added to LMSYS. That's not super easy. Not every model out there on the Hagenfest leaderboard gets added there.
[00:23:41] Alex Volkov: So definitely super cool. Yeah, please go ahead. If you have anything else to
[00:23:46] Junyang Lin: for as you have mentioned Mistral Medium, I'm not sure which one is better Mistral Medium or Qwen 72 Billion from some reviews they might be similar for the Qwen 1. 5 72 Billion similar to MiQ some of my friends like Blade just tested In EqBench, the scores are very similar, but I need some more reviews to let me really know that how the 72 billion model really perform, that how is it better or is it worse than MeeQ?
[00:24:20] Junyang Lin: They're all okay for me. I just want real reviews for me. Yeah,
[00:24:23] Alex Volkov: Yeah,
[00:24:24] Junyang Lin: it.
[00:24:25] Discussion about Qwen VL with Nisten and Piotr
[00:24:25] Alex Volkov: awesome. Junaid, thank you for joining us. And Nisten, go ahead. You have a few questions, I think, about the interesting things about VL.
[00:24:34] Nisten Tahiraj: Yeah, so one thing is that the 0.5 Bs and the small models, I know Denova in the audience was specifically looking for one around that size or like a 0.3 to run on web GBU, because then even at 32 bit, which older browsers will still support it, it will still only take two gigs. So that, that would run anywhere.
[00:24:58] Nisten Tahiraj: But my question. [00:25:00] So shout out to Feliz de Nova for all that. I know he's going to do something with it but my question for you was more about the Macs and the the larger Qwen QwenVL chats are those also based off of the 72B and did you find more improvements in going with a larger LLM, and I also wanted to know your opinion on Lava.
[00:25:27] Nisten Tahiraj: The Lava 1. 6 method where they mosaic together four clip models on top to get a larger image, even though it slows down inference because now it's got a output like 2000 embeddings. So yeah, what do you think of Lava and is there more stuff to share about the Clang,
[00:25:47] Junyang Lin: VL, Max. Yeah for Plus and Max it may be, sorry for me not ready to open source it.
[00:25:57] Junyang Lin: I cannot decide these things. Yeah actually it's built on larger language models much larger than the Plus, and you can guess whether it is 72 billion. It is not that important, and we have found that The scaling of the language model is really important for the understanding of the VR models.
[00:26:18] Junyang Lin: We have tested it on the MMMU benchmark and we have found that the Max model is highly more com competitive and performs much better than the Quin bi plus. Although previously many people have thought that Quin Bi Plus is strong enough, but we found that the max had. Much better reasoning capabilities, just understand some, something like some reasoning games like poker or these things like that, some complex things that people can understand through the vision information they can somehow understand it.
[00:26:52] Junyang Lin: I think the performance might be a bit slower. Approaching Gemini, Ultra, or GPE4B for the QEMDR MAX. We were just gathering some reviews. I'm not quite sure, but
[00:27:05] Alex Volkov: From the review perspective, I want to say hi to Petr, our friend here on stage, from Roboflow. Petr is one of the vision experts here on stage. Petr, welcome. Feel free to introduce yourself briefly, but I definitely know that you got excited about some of the GwenVL Plus stuff, so definitely feel free to share some of your insights here.
[00:27:30] Piotr Skalski: Okay. Yeah. And first of all, awesome to meet somebody from Qwentin. Yeah.
[00:27:36] Piotr Skalski: So yeah I'm from Roboflow, like you said and I'm responsible there for computer vision and growth. So it's like in between of being ML engineer and marketing something like this.
[00:27:49] Piotr Skalski: And yeah, I was experimenting with Qwen, Plas and Max last week. Super impressed in my opinion. I know that you tried to be humble, maybe, but. In my opinion it's, at least on things that
[00:28:04] Junyang Lin: I test, it performs like the best compared
[00:28:08] Piotr Skalski: to other
[00:28:09] Junyang Lin: models. Thank you very much. Thanks for the appreciation.
[00:28:14] Piotr Skalski: Yeah. And especially the fact, so the biggest game changer for me, and I know that there were models that were capable of that before, is the fact that you can ground those predictions and you can, for example, point to a specific element on the image. So it's not only that you can ask questions and get answers and do OCR, but you can straight up do zero shot detection if you would like.
[00:28:40] Piotr Skalski: Yeah. Which is which is awesome. And that's something that none of the. Other popular models can do to that extent, at least on the
[00:28:51] Piotr Skalski: things
[00:28:51] Piotr Skalski: that I
[00:28:51] Piotr Skalski: tested. My question is,
[00:28:55] Piotr Skalski: do you plan to open source it? Because it's awesome that you can try it out for the API and I highly appreciate the fact that you created the, HF space and you can go there and try it.
[00:29:07] Piotr Skalski: But is there a chance that you will open source it even with the meeting? License are not necessary.
[00:29:16] Junyang Lin: Yeah personally, I would like to open source some but I cannot decide these things, but I think there's a chance I'm still promoting these things inside the core, but I cannot say too many things about these stuff, but we will try because we have found out that we ourselves can also build very good LMM.
[00:29:37] Junyang Lin: I think the gap Just between the big corp between us and the big corp. In LMM, it's very small. And we have found that our techniques or our training is quite effective. So maybe one day we'll share to the community, but for now it is still APIs and demos and I would try to think about these things.
[00:29:59] Junyang Lin: And also question about. The comparison with us and Lava, and I have just tried Lava 1. 6 not quite freQwently. I just tried it. I think it's a very good model and it it has very good performance in the benchmark results but I think the limitations of these other open source models may be that It still lacks sufficient pre training for them Skullscape just said we can do Qwen can do OCR and you can find that Qwen's reasoning capability is quite strong because we have done a lot of pre training work on it.
[00:30:39] Junyang Lin: We have done a lot of data engineering on pre training because we have capabilities of handling different resolutions and different aspect ratios so that we can use the curated, the OCR data and put them in the pre training. And when the vision length model can understand a lot of textual like linguistic information inside the images, they may do something like like we said, reasoning, and you will find that really powerful, very impressive, or things like that.
[00:31:13] Junyang Lin: Yeah I think the gap between other models and us, or also Gemini Ultra and GPT 4b, maybe still the lack of large scale data. for training. Yeah, this is my opinion.
[00:31:27] Alex Volkov: we're waiting for more data, but we're also waiting for you guys too. I just want to thank you for being the champion for open source from within the organization, and really appreciate all your releases as well. I think Piotr and Nisten, like everybody here on stage, definitely. It feels that, and thank you for coming and talking about this.
[00:31:45] Alex Volkov: Justin, feel free to stick around because the next thing we're gonna talk about, you already mentioned, which is Smog 72 B which is the top of the leaderboard. And I just read through the thread from Bindu, ready from Abacus ai and it looks like they didn't even use 1.5. I think they used 70 the previous Quinn
[00:32:02] Junyang Lin: yeah, they used the previous QUANT72B. If they are really based on the base language model there might not be a lot of differences. Because 1. 5 for the base language model 72B is actually slightly better than the original 72B for the base language model. Yeah.
[00:32:22] Alex Volkov: for the base ones. And very interesting what they
[00:32:24] Junyang Lin: the base one.
[00:32:25] Alex Volkov: So they, they don't share any techniques, but they promised to open source their techniques. They're saying like, our next goal will be to publish these techniques as a research paper and apply them to some of the best Mistral models, including Miku.
[00:32:37] Alex Volkov: So I got confused. I thought that they already fine tuned Miku, but no, they just fine tuned on top of Qwen. And now the top Hug Face leaderboard model is based, is a fine tune of Qwen, which is like also super cool.
[00:32:50] Junyang Lin: Yeah, I'm very proud of it.
[00:32:52] Alex Volkov: Yeah, congrats.
[00:32:53] Junyang Lin: They are using our model to be the top of the model. I'm also really expecting their technical report to see how they reach the top of the benchmark. But I think it is not that It is not that kind of difficult because you have a lot of ways to improve your performance in the benchmark, so we'll still see how it really performs in the real scenarios, especially for their chat models, yeah.
[00:33:18] Alex Volkov: Yeah, that's true, [00:33:20] that's often the case. But I just want to shout out that the world is changing like super fast. We're definitely watching and monitoring the Hagenface leaderboard. And performing better than Mistral Medium is impressive. And this looks at least on the MMLU, this is 77. I think they said they broke The average score of 80, this is the first model that broke the average score of 80 on the open source leaderboard on hang and face, which is super cool based on Quinn as well, and definitely worth it.
[00:33:46] Alex Volkov: I'm gonna add this link to the show notes and hopefully we'll find a way to connect you guys with the Bindu team there at Abacus to see how else this can be improved even for, and whether or not these techniques can be put on smaller models as well. I think in the open source, the last thing.
[00:34:00] Junyang Lin: expecting the chat. Yeah, I'm really expecting to chat with them. Yeah, continue,
[00:34:05] Alex Volkov: So definitely hoping that some of our friends can connect between these awesome teams and learn from each other, which I think is the benefit of speaking in the public and putting things in open source. Now, moving on, the last thing that you definitely mentioned is the update from LMSys, which is quite a few of our friends of the pod are now also part of the chatbot arena.
[00:34:24] Alex Volkov: They just announced this yesterday. They've added Three of your versions, right? They added 1.572 B, 1.57 B, 1.5, four B, and they also added open chat. So shout out the folks from Open Chat Alai and the Alignment Lab and some other friends of ours who like release open chats latest release and they also added news imis fine tune.
[00:34:47] Alex Volkov: So if you guys remember we've talked about news fine tuning on mixed mixture and that improved on the mixture of expert model from. From Mistral a little bit based on DPO data sets. So now that's also in the LMCS arena and it's now powered by Together Compute. Which I have no affiliation with besides the fact that they're awesome.
[00:35:04] Alex Volkov: They're sponsoring a bunch of stuff. And we did a hackathon together together is great. Like you can easily fine tune stuff on their platform, but now they're also sponsoring the arena, at least to some extent, which is great because we get more models and arena keeps going. And if you guys remember, or you probably use it, LMC's arena is this another great way for us to feel what human preference is in models.
[00:35:27] Alex Volkov: And for many of these models. That's what's more important than actual performance on evaluations, on leaderboards, et cetera. So definitely great update from LMCs as well. And I think that, I'm gonna ask my folks here on stage, but Nisten, Far El, if this is like anything else in open source that's super interesting this week, I think that's mostly it.
[00:35:44] Alex Volkov: We can talk about Gemini.
[00:35:48] Nisten Tahiraj: There was a data set, which I think is pretty huge of HackerNoon that they released. And oh, there was one more thing HuggingFace made a GPT store.
[00:35:58] Alex Volkov: Oh,
[00:35:59] Nisten Tahiraj: they made their own GPT store. Yes. I think that's a big,
[00:36:03] Alex Volkov: I want to hear about this, for sure. I haven't used it yet, but I invite the Hug Face folks that are listening to this to come and tell us about this, because I haven't used it yet, so I don't actually have many opinions. But yeah, they released their own open source GPT store, which is super cool, and we're going to add this maybe in the show notes, but I don't have a lot to say about this.
[00:36:24] Alex Volkov: And I think, in the spirit of Yeah, go ahead.
[00:36:27] Nisten Tahiraj: Oh, sorry. Sorry. I'll quickly say that the HackerNoon data set of tech articles, those are some Because they have a lot of guest developers I remember over the years, they had the best ones. Those articles, that data set, is extremely great for any kind of coding or website or whatever work you're doing.
[00:36:50] Nisten Tahiraj: That's because it's step by step instructions on how to build something and all the code for it, it's pretty awesome and it's at the very beginning on the Jumbotron if you guys see it from Daniel VanStream. And yeah, and it's MIT licensed and it's 6. 9 million articles and you can do whatever you want with it.
[00:37:07] Nisten Tahiraj: That, shout out to them.
[00:37:09] Alex Volkov: We'll add this again to the show notes. And as you said something about articles and code, I remembered another thing that definitely Also worth mentioning Junaid Embeddings, if you guys remember, we had a chat with Bo Wang from Juna deep dive into embeddings a while ago, and Junaid Embeddings released a fine tune for code.
[00:37:25] Alex Volkov: So just a quick shout out that embeddings can be fine tuned, embedding models can be fine tuned for specific purposes, and definitely embeddings for co and you guys re if those of us who follow from week to week, we talk about embeddings a lot. We've talked about NumX Embeddings last week, the open source full, including the training datasets.
[00:37:42] Alex Volkov: We've talked about. OpenAI changing embeddings and giving us new ones and cheaper ones. And Junaid, we had a deep dive and I definitely welcome you to go and check out that special episode with Bo Wang from Junaid and they trained their own BERT model as the backbone, the LLM backbone that decides about embeddings and they just released an update to their embeddings fine tuned for code retrieval specifically.
[00:38:03] Alex Volkov: And I think for many folks are building rack system. That's something that they should be aware of that embedding models can be also fine tuned for specific purposes like Q& A and obviously code as well. So if you haven't tried that yet and you're doing a bunch of material on top of code, for example, using some of the data sets that Nisten just mentioned, that probably there's code in there definitely check this out.
[00:38:25] Alex Volkov: I think we're moving on to the big company thing, and I don't have a big company transition, I do have this one though.
[00:38:43] Google finally lanuches Gemini Ultra
[00:38:43] Alex Volkov: Just in, as we started the space, maybe an hour before, our friends from the big G, Google finally answered the question that we've been asking since 10 months and three weeks ago, where is Google? So GPT 4 was released to us after ChaiGPT released in, I want to say December, maybe December 1st, November 31st of 2020.
[00:39:06] Alex Volkov: Then GPT 4 was released in March of 2023. And throughout this time, there was this famous video of Satya Nadella asking where is Google and where's this like 600 pound gorilla in the room of search? And we're going to make them dance. And they definitely make them dance. And we've been waiting.
[00:39:25] Alex Volkov: Where's Google? Where's Google? And Google has released. Quite a few stuff for us since then. Just for context, I think everybody knows this already. Google is the place of the birth of the transformer paper. So like most of this, the recent Gen AI explosion is, can be attributed to transformers architecture that came out from Google.
[00:39:43] Alex Volkov: Google had trained multiple models, including like Palm, and we've talked about Palm and Palm 2, and I don't even remember all the names of the models that they've released for us throughout the years. Google then also. At some point gave us BARD, which is their interface, the chat interface that people used in order to play with their models, and I think some of this was Bye.
[00:40:04] Alex Volkov: Bye. Palm, something else as well. And recently, and I think around December, they said, Hey, you know what? We're here and we have this thing called Gemini after the unification of Google Brain and DeepMind under one org. And we're going to give you Gemini Pro right now, but we'll tell you that Gemini Ultra, that was back in December.
[00:40:23] Alex Volkov: The Gemini, I guess December will tell you the Gemini Ultra is coming and it's going to be better than GPT 4 and you're going to get it soon. And we've been like saying when? And today is the day is the answer for those questions. So today we're celebrating, congrats folks at Google who finally released and upgrade to their LLM capabilities.
[00:40:41] Alex Volkov: Not only an upgrade, so much an upgrade that they've killed the Bard brand completely. No more Bard. That's what I'm understanding. No more BARD, even though that's very confusing. If you guys remember a few weeks ago, we've talked about LMSYS changes were barred with Gemini, I think, something like confusing like this, shot up to the top of the charts and just was trailing GPT 4.
[00:41:05] Alex Volkov: So like second best model in LMSYS arena was barred with GPT 4, or sorry, barred with Gemini. See how confusing this is? And now there's no more barred. But there is an LNCS. Anyway, this is like the whole naming is confusing thing, but Google, including a blog post from Sundar and everything, Google comes out with a new update and says, Hey, Bard is no more.
[00:41:25] Alex Volkov: It's now Gemini and the models are also Gemini. So that's confusing. And the models are Gemini Ultra. We finally get access to Google's answer to GPT 4 today, which is incredible. That answer is Ultra 1. 0. [00:41:40] And we can get this. As part of something like a paid premium tier that's called GMA Advanced on Google.
[00:41:46] Alex Volkov: So you can actually go right now, you can sign up, it's 20 bucks a month, and it starts 20 bucks or 30 bucks? I think it's 20
[00:41:52] Nisten Tahiraj: It's two months free
[00:41:54] Alex Volkov: Yeah, and you get two months, two months trial because they have to prove themselves to you because many people will decide whether or not they're going to go with Google or with JGPT.
[00:42:03] Alex Volkov: And we're going to talk about which one folks will prefer. I haven't tried it yet. Literally as I woke up, I had to prepare my notes for the space. I just want to say. Google, welcome to the party, we've been waiting for you, and I counted, it's been exactly 10 months and 3 weeks and 4 days since GPT 4 released that you came with the same level of, at least, based on benchmarks.
[00:42:24] Alex Volkov: And now we're gonna talk with some folks who actually tried it, Nisten, you tried it, I think Ray, you also tried it let's talk about your first impressions from BART, oh, or, sorry, Gemini.
[00:42:35] Nisten Tahiraj: One, it's heavily moderated. No one's surprised by that. It does answer and reason nicely, or at least the way it communicates, it's a lot more eloQwent, I would say. It feels nicer in the way it reasons stuff out. However, compared to Mistral Medium, or Mixtral, it doesn't quite obey you. I tried my standard question, which is just like Climb out a schedule of building a city on Mars and write the code in C and JavaScript.
[00:43:10] Nisten Tahiraj: And that's a pretty complex question for, that only the best models get. And I needed to re prompt it in order for it to give the answer. And even then, it only wrote some JavaScript. But it was really good JavaScript. However, it didn't do the rest of the task. Okay, it's not bad. It is worth using. Again, very heavily moderated.
[00:43:33] Nisten Tahiraj: As for the vision side of it, it's extremely heavily moderated. I was even telling it to count out, I had an old gaming PC on the floor with two GPUs on the side, and I told it to make me a JSON of all the parts that it sees in the picture. It won't answer questions like, that have humans in them, or even if they're like Star Wars characters or whatever.
[00:43:58] Nisten Tahiraj: But This, I thought, would be something pretty simple, and it, even this one it refused to answer. Yes is good, I think. On the, as far as the vision side goes, the model, the open source models might have it already beat, or will soon.
[00:44:19] Ray Fernando: Yeah, I wanted to add, Ankesh from Google DeepMind actually wrote because I've been posting some of this stuff, and he says, To preempt any confusion, multimodal queries don't go through Pro slash Ultra yet, but that is coming soon too.
[00:44:33] Ray Fernando: Which makes sense a little bit of why you're seeing some of that stuff. I've been seeing similar things when I've been doing some image analysis or even trying to generate images that have people. One of my examples I've just been posting on my my Twitter feed is like having to analyze a meme.
[00:44:48] Ray Fernando: So it's the hot girls meme or the hot ones meme and I was like, hey, this is very popular. Can you tell me what this meme is? And it's I'm sorry I can't because there's images of people. And then I had to do some other meme analysis with Elon Musk and it's the same type of queries. But to add to what Nisten was saying, I've been doing a lot of creative writing tasks, and the writing output has been actually really nice.
[00:45:10] Ray Fernando: And it doesn't have all that extra fluff that you normally would get from ChatGPT 4. And what I find with OpenAI's ChatGPT 4 is that they freQwently say, Hey, don't use purple prose, which is all that extra fluffy stuff you read that make people sound smart. It's I just want a regular sounding piece.
[00:45:27] Ray Fernando: And usually ChatGPT would do that and then revert back to its normal state but I find that Gemini Advanced just keeps going through it and, continues with the writing pieces of things. And for coding stuff, it's really strange. You actually cannot upload any CSV or any text files.
[00:45:43] Ray Fernando: They only let you upload images right now. So you can only have a picture of a microphone and a picture of the little icon to upload an image. Because I wanted to just do a simple analysis on my tweets with a CSV file. And it's there's no place that I see to actually upload that. And I could probably upload so many lines, but there's also a character cutoff, too, that doesn't allow me to upload a lot of code for,
[00:46:03] Ray Fernando: A code base.
[00:46:04] Alex Volkov: What's the, I was about to say this next thing. Do we know the context length? Anybody have an idea of where Gemini Ultra is at around? 'cause we know that GT four is 1 28 K and I think they recently opened this up on the UI as well. I've been noticing less restrictions. I've been able to pace like a lot more code.
[00:46:21] Alex Volkov: My, my test is, you guys know my test is the transcription of the Thursday I conversation that I past and Claude with the a hundred K context definitely takes all of it. GBT. For the pro kind of level used to refuse and now recently it's okay. Yeah, let me summarize this for you Have you guys been able to sense the context length of Gemini Ultra?
[00:46:41] Alex Volkov: Is it any close? Actually, go ahead Welcome to the stage, buddy
[00:46:46] Akshay Gautam: Hello, I just wanted to bring up that their official document mentions that it's 2k context length.
[00:46:53] Alex Volkov: Actually, we don't get greetings of the day
[00:46:57] Akshay Gautam: I see. Yeah. Yeah. Greetings of the day everybody. My name is Akshay Kumar Gautam and I'm an applied AI engineer. I was a data scientist before, but now I work with, modeling and stuff. And yeah I was literally waiting for, I tried, came out, I paid for it because why not? And and a lot of stuff.
[00:47:14] Akshay Gautam: First of all, it's really good at coding. By the way, the context length is 32K at least that's what they say. Yeah, 32K. And and the model is not good at keeping context, like that is what I was here to talk about. It will lose sense for example, if you ask it to do multiple things in a single prompt, it will not.
[00:47:33] Akshay Gautam: Unlike chatGPT, but like with coding, it's better than chatGPT in my humble opinion.
[00:47:41] Alex Volkov: so I want to talk about some advantages that Google has, the big dog definitely, because an additional thing that they released, which Chantipiti doesn't have, is ChairGPT has this, but they released an iOS and Android app, but Android also has integration with the Google Assistant, right?
[00:47:56] Alex Volkov: So you can now join this advanced or ultra tier and use this from your Android device. Now, I'm not an Android user, but I definitely understand that the ecosystem is vast and many people just use this assistant and we're waiting for Apple. We don't have anything to say about Apple specifically today, besides the fact that, they released the, maybe the next era of computing.
[00:48:16] Alex Volkov: But. There's nothing AI series, still the same series from like 2019 with some examples, but Google has now moved everybody who wants to, who pays the 20 bucks a month and has an Android device basically towards this level of intelligence, basically a GPT 4 level of intelligence. And I saw that Marques Brownlee, MKBHD on YouTube, like one of the best tech reviewers out there.
[00:48:38] Alex Volkov: He has been playing with the Android stuff, and he said that even the integration Google Assistant even uses your home stuff. So you can actually ask this level of intelligence to turn on some lights, whatever, and probably better context. Actually, you have any comments on this? Have you played with the Assistant version?
[00:48:54] Akshay Gautam: Two things first of all, Bing chat was already available on Android devices, right? The Copilot, now it's called. Copilot uses GPT 4, so it's already really good. And you can actually use a lot of voice stuff with Copilot as well, which was surprising. The Google Assistant to be honest, in terms of assistants among Siri and I have a Samsung device, so it has Bixby and, among all the AI systems, Google Assistant was the best one by far, in terms of how much you can, use it, and hoping to get access because I have paid for the Ultra, but I still don't have, access to everything.
[00:49:29] Akshay Gautam: Also, there's no API for Ultra, so you cannot actually test anything as well.
[00:49:34] Alex Volkov: we haven't gotten an API developers Sundar Pichai said the developers announcements are going to come next week. IOS hasn't updated yet. Yeah, go ahead Nisten.
[00:49:44] Nisten Tahiraj: I just really quickly tested it with the entire Lama. cpp file. I am down to 15, 000 tokens I cut it down to and it's still too long. We know it's under 16, 000 that you can paste in. I will know [00:50:00] exactly in a few minutes,
[00:50:03] Alex Volkov: So not super, super impressive in terms of like long context. I will also
[00:50:06] Nisten Tahiraj: at least not for the UI,
[00:50:08] Alex Volkov: for the UI. Usually, yeah, usually for some reason they restrict the UI or they forget to update this. And then the model itself is like way longer context, but for now not extremely impressive comparatively.
[00:50:18] Alex Volkov: And again, we're comparing the two like main flagship models OpenAI GPT 4 and now Google's Gemini Ultra. And I also want to say one thing, Gemini seems to be optimized only for English as well, even though it will answer like most of the questions other languages, but it looks like the optimization was focused on English as well.
[00:50:36] Alex Volkov: including some of the apps as well, which is, understandable, but we have to, as we're trying to compare apples to apples GPT 4 is incredibly versatile in multi language operations as well. LDJ, you have some comments? Welcome, buddy, to the stage and give us some Have you played with Ultra so far?
[00:50:55] LDJ: Yes I was actually wondering, does anybody know of plans for them to integrate this with Google Home? Because I just asked my Google Home right now are you Gemini? And it said, I'm a Virgo. And then I asked it, what AI model are you running right now? It said, sorry, I don't understand. So I don't think it's, at least mine, I don't think is running Gemini right now.
[00:51:16] LDJ: But
[00:51:17] Alex Volkov: No, so I think the announcement was
[00:51:18] Junyang Lin: to put it.
[00:51:19] Alex Volkov: The integration into Google Home will come from the Google Assistant. So if you have an Android device, you'll have Google Assistant there. That you can switch on like a smarter brain, and that you can ask it to integrate like with your home through the device. So you can ask it to do stuff in your home.
[00:51:34] Alex Volkov: But the Google Home itself, like the Google Home devices that you have, they're not talked about upgrading them, but maybe at some point, because why not? But I haven't seen anything on this yet. Anything else here?
[00:51:46] Junyang Lin: I think that'd be the perfect. Sorry. Yeah, go on.
[00:51:48] Alex Volkov: Yeah, no, that would be great. I agree with you. Being able to walk around your house and just talk with GPT 4 level intelligence to do operations, I definitely agree.
[00:51:55] Alex Volkov: That would be great. I gotta wonder anything else here on Ultra? We've talked about its code performance. We've talked about its inability to talk about people. Anything else interesting that we want to cover so far? And again, folks, it's been two hours and we're already giving you like a bunch of info, but we'll play with this going forward.
[00:52:12] Nisten Tahiraj: It's about 8, 000 the context length that you
[00:52:14] Alex Volkov: Are you serious? Wow, that's not a lot at
[00:52:17] Nisten Tahiraj: that's as much I was able to paste it like 7, 500.
[00:52:20] Alex Volkov: So yeah, folks, you heard it here first. You'll get more context than you previously got probably, but it's not a lot comparatively. Even though it can probably, it's probably a consideration of compute for Google, right? How much context to give you the model probably gets more. And it's also a vision enabled model.
[00:52:36] Alex Volkov: But I think that we've covered this enough. Gemini Ultra. It's here, it's very impressive from Google, and yet, I want to say personally, maybe a little bit underwhelming because, they need to convince us to move, and it's going to be the same price, and I don't know, let me just ask this before we move on.
[00:52:55] Alex Volkov: Anybody here on stage who has access to both plans to pay for this and not GPT?
[00:53:03] Nisten Tahiraj: I haven't paid for anything since September But I'm
[00:53:08] Junyang Lin: not the right person for this question. My company pays for like my character description. So I might keep both
[00:53:15] Alex Volkov: Interesting.
[00:53:16] Junyang Lin: paying for mine's out of pocket. I'm just going to keep both. I like the OpenAI app because it's just the multimodal picture on my phone.
[00:53:23] Junyang Lin: I'm on the go. For Google, I'm just curious because it's two months free. That just means that, they have me hooked. We'll see.
[00:53:30] Alex Volkov: Yeah, it's two months free. And then let's check in back in two months, and see how many of us kept paying. All right. I so Google also releases. a Llama CPP wrapper called Local LLM. I don't know if you guys saw this. It's pretty cool. It's an open source tool from Google that actually helps you run LLMs locally on CPUs and then also on the Google Cloud with a super easy integration.
[00:53:51] Alex Volkov: Very interesting choice. They also call out the bloke that you can download models from the bloke with their tool. And I think it's very funny that if you go on. The description of the blog of local LLM, they call this. Now, the tool, they told you in the code snippets, they say, Hey, install OpenAI.
[00:54:10] Alex Volkov: So I found it really funny. But yeah, they have a wrapper there that integrates with Google Cloud as well.
[00:54:15] OpenAI adds DALL-E watermarking and per API key restrictions
[00:54:15] Alex Volkov: Running through the big companies areas like super quick. OpenAI added watermarks to Dali images. They use this new metadata thing called C two P embeds and it embeds in the metadata.
[00:54:27] Alex Volkov: And so basically what this means for us is not that much, but when you download images from Dali generated, I assume that the same will come to Microsoft copilot. They will now have in the metadata, where like the location is and everything else. They will now have the fact that they have been generated with.
[00:54:43] Alex Volkov: They have been generated with DALI this information will sit in the metadata. Now it's only images, not text or voice or anything else from OpenAI. This happens over the API or from the ChatGPT interface as well. This increases the file size a little bit because of some of the stuff, but it's not super interesting.
[00:55:00] Alex Volkov: This can be stripped. So it doesn't mean that if the lack of presence of this thing does not mean that it's not generated with DALI. It just, if there is, it's definitely generated with DALI. And so this is an interesting attempt from OpenAI to say, Hey, we're doing as much as we can.
[00:55:15] Alex Volkov: It's not foolproof, but an interesting attempt. And also, I just want to mention that if, for those of us who develop with OpenAI, The API keys, they keep upgrading the developer experience there and the API keys part. And now you can restrict per API key. You can restrict its usage, which many people have been waiting for a long time.
[00:55:33] Alex Volkov: And that's really like many people has been wanting this. You can create one API key for OpenAI for a specific purpose and restrict it to only DALI, for example. And you can, I don't know if you can restrict. based on credits, I don't think so, but you can definitely restrict in, in the usage related stuff.
[00:55:49] Alex Volkov: That's, I think, all the updates from the big companies and the LLMs and APIs,
[00:55:53] Alex Volkov: This week's buzz is the corner and I stopped the music too prematurely. This week's buzz is the corner where I talk about the stuff that I learned in Weights & Biases this week. And I don't know how many of you were, had a chance to join our live segments, but we definitely had a build week. And I think I mentioned this before, but actually we had a live show on Monday.
[00:56:19] Alex Volkov: We're going to have another one this probably tomorrow. Yeah, tomorrow. I think it's Noon Pacific, where I interview my team, the GrowthML team in Weights & Biases, about the build with projects that we've built, uh, last December to try and see what's the latest and greatest in this world. So as we build tools for you in this world, we also wanna Build internal tools to see what are the latest techniques and stuff like we just talked about.
[00:56:46] Alex Volkov: For example, it gives us a chance to play around with them. It's like an internal hackathon. And what happened was is we build those tools and we present them to the company and then this was basically it. And I said, Hey, hold on a second. I learned the best publicly. I learned the best about, the way I just learned from Connor and Benjamin.
[00:57:02] Alex Volkov: I learned from Nisten and Far El and all the folks in the audience. And Luigi and I had a whole section where he taught me weights and biases before. I learned the best by being public and talking about what I'm learning as I'm learning this. And so I did the same thing with our folks from the GrowthML team.
[00:57:15] Alex Volkov: We just literally folks came up on stage and I asked them about what they built and what they learned. And we're going to summarize those learnings in the live show. And that live show, if you're interested, is all over our social, so on Weights & Biases YouTube and LinkedIn. Yes, LinkedIn, I now need to also participate in that whole thing.
[00:57:33] Alex Volkov: So if you have tips about LinkedIn, let me know. But it's live on LinkedIn, live on YouTube. I think we did X as well and nobody came. We're probably try to send you to the live YouTube flow. But basically the second part of this is coming up tomorrow. We're interviewing three more folks and you get to meet the team that I'm, the incredible team that that I'm part of.
[00:57:53] Alex Volkov: Very smart folks. like Kaggle Masters, and some of them came to Kano's show as well, which is super cool. And I find the first conversation super interesting and insightful for me. Definitely recommend if you're into Understanding how to build projects that actually work within companies was the process.
[00:58:11] Alex Volkov: We have folks who build something from scratch, we have somebody who runs a actual bot with retrieval and re ranking and evaluations and like all these things and [00:58:20] have been running them for a year basically on the production. So you can actually try our bot in Discord right now and in Slack and on GPTs.
[00:58:28] Alex Volkov: If you want to hear about the difference between a mature, rag based But that's in production for a professional AI company, but also the difference between that and something that somebody can like quickly build in a week. We've talked about those differences as well. So definitely worth checking out that live.
[00:58:46] Alex Volkov: Moving on from this week's buzz, and I learned a lot. Okay, so back from the this week's buzz, we're moving into vision.
[00:58:52]
[00:58:57] Alex Volkov: And Bria AI like super quick, they released a new Background Segmentation Model, or Background Removal Model, that's live on Hug Face, is called RMBG V1. 4, and I think the cool thing about this is that it now runs completely in the browser, thanks to the efforts of our friend Zinova, who is no longer in the audience, I think, from Hug Face and Transformers.
[00:59:19] Alex Volkov: js, and it's super cool. You can like, remove backgrounds completely without sending any images to anywhere, and just straight from your browser. That model is called, again, RMBG, and it's not Commercially viable. So you cannot use this for professional stuff, but it's open for you to try and play with in the voice category, the voice and audio category.
[00:59:39] Alex Volkov: We don't have a lot of audio stuff lately, so I think the main audio stuff that we've talked about was. I want to say Suno is like the latest and greatest, but we're still waiting for some cool music creation stuff from different labs. And definitely I know some of them are coming but in the voice category and you know that we've been talking about, my position in this and Nisten and I share this position.
[01:00:01] Alex Volkov: I think personally, The faster models will come out that can clone your voice and the faster they're going to come out in open source, the better it is generally for society. I know it's a hot take, I know, but I know also, I cannot reveal the source, I know that voice cloning tech is going to be at open source like super, super quick.
[01:00:21] Alex Volkov: And I think it's like one of those. Break the dam type things that the first kind of major lab will release a voice cloning and then everybody will see that nothing happened in the world, everybody else will release theirs, and we know everybody has one. We know for a long time that Microsoft has, I want to say Valley, was that Valley?
[01:00:38] Alex Volkov: That clones your voice in under three seconds. There's papers on this from every company in the world. We know that OpenAI has one. They collaborated with Spotify and they cloned Lex Fridman's voice and it sounds exactly like Lex Fridman. We know that companies like Heygen, for example, I think they use 11labs.
[01:00:54] Alex Volkov: 11labs has voice cloning as well. None of this is open source, everything is proprietary. So we're still waiting for the voice cloning area from open source from a big company. But for now, we got something called MetaVoice from a smaller company. Not from Meta, it's just called MetaVoice, it's confusing.
[01:01:08] Alex Volkov: It's just like a tiny model, 1. 2 billion parameters model. It's trained on 100k hours of data, which is quite significant, but not millions of hours. And it supports zero shot voice cloning. So basically under a few samples, like a basic sample of your voice, and then you're going to get a clone of your voice or somebody else's, which is what scares many people in this area.
[01:01:30] Alex Volkov: It has like long form synthesis as well. It's super cool. And it has emotional speech. If you guys remember, we've talked about. How important emotion is in voice cloning, because again, for those of you who follow ThursdAI for a while, you may remember myself voice cloned in kind of Russian, and I'm doing this with a lot of excitement, when the regular voice cloning thing for Alex speaks in a monotone voice, that's Very clearly not the same kind of person.
[01:01:56] Alex Volkov: So emotional speech is very important. And some of this is with prompt engineering and some of this happens in voice casting or voice acting. And the best part about this MetaVoice thing is Apache 2 license and it sounds pretty good. And so we've talked about multiple TTS models, and now this model is definitely out there.
[01:02:14] Alex Volkov: So if you're building anything and you want a TTS model for you with voice cloning, I think this is now the best. the best shot you have. It's called MetaVoice. I'm going to be adding this to the show notes as well. And I think we have a breaking news from a friend, VB with another model called Nemo.
[01:02:30] Alex Volkov: So let's take a look. Yeah, definitely a new model from NVIDIA. It's called Nemo. Let me actually use this. I want to use the sound as much as possible.
[01:02:50] Alex Volkov: So I'm gonna go and try and find this tweet for you, but basically we have a breaking news, literally Rich VB, which is the guy friend of the Padawars, who's in charge of, like, all the cool voice related and TTS related tech and Hug Face, he mentioned that NVIDIA AI released Nemo Canary.
[01:03:07] Alex Volkov: Nemo Canary is the top of open a SR leaderboard. VB is also part of the folks who are running the leaderboard for us, a SR stands for automatic speech Recognition. No, I think I'm confusing this. Yes, automatic speech recognition. Cool. Thank you, Nisten. So basically, if you guys remember Whisper, we talked about Whisper a lot.
[01:03:25] Alex Volkov: This is the leaderboard, and Whisper has been on top of this leaderboard for a while. Recently, NVIDIA has done some stuff with stuff like Parakit. And now we have a new contender in the ASR leaderboard called Nemo Canary 1B. 1B is not that much. Whisper The highest Whisper large, I think it's 2. 5 B or something.
[01:03:44] Alex Volkov: This is now the top SR leaderboard. It beats Whisper and it beats Seamless from Meta as well. And I don't know about License here. It supports four languages. Whisper obviously supports a hundred, which is, uh, which is, we know the best for many kind of low resource languages as well. Trained on not that much hours of annotated audio, only 85 1000 hours or so, and it's super fast as well.
[01:04:10] Alex Volkov: It's very interesting that NVIDIA does multiple things in this area. We had Parakit, now we have Canary as well. What else should we look at? I think Bits, Whisper, and a considerable margin, again, on these specific languages. Folks, we've been, I think, we've been on this trend for a while, and I think it's clear.
[01:04:28] Alex Volkov: Incredible automatic speech recognition comes on device very soon. Like this trend is very obvious and clear. I will add my kind of thoughts on this from somebody who used Whisper in production for a while. The faster it comes on device, the better. And specifically, I think this will help me talk about the next topic.
[01:04:47] Alex Volkov: Let's see what else I have to cover. Yeah, I think it's pretty much it. The next topic
[01:04:51] Nisten Tahiraj: I'm trying it right now, by the way. And it's pretty good.
[01:04:55] Alex Volkov: Are you transcribing me in real time or what are you doing?
[01:04:58] Nisten Tahiraj: yeah, I was transcribing your voice through the phone to my laptop but weirdly enough it doesn't output numbers, it only outputs words however
[01:05:06] Nisten Tahiraj: It seems pretty good, huh? I don't know, it seems good to
[01:05:09] Nisten Tahiraj: me, LGTM looks good to me.
[01:05:11] Alex Volkov: Yeah, it was good to me. Absolutely. The word error rate, the word error rate for Whisper is around 8%, I think, on, on average for these languages and for Canary is less than it's 5. I think, if I remember correctly, VB told us that word error rate is like how many mistakes per 100 words it does, and this does, Five Mistakes Versus Eight, I think on the general data sets.
[01:05:36] Alex Volkov: Quite incredible. This is coming and I think I'll use this to jump to the next thing
[01:05:39] Alex finds a way to plug Vision Pro in spaces about AI
[01:05:39] Alex Volkov: . The next thing, and briefly we'll cover this, is that I haven't used it for the show, but for the past, since last Friday, basically, I've been existing in reality and in augmented virtual spatial reality from Apple.
[01:05:52] Alex Volkov: And the reason I finally have a chance to connect these two things is because. I use a lot of the hand gestures within the Vision Pro from Apple, which was released on Friday and a lot of voice as well. And obviously we've talked about Siri, we've talked about finally Google stepping up with their assistant.
[01:06:08] Alex Volkov: Siri voice recognition and also typing is not that great. And I know because I used Whisper in production for a bunch. I also use Super Whisper, shout out Neil on my Mac to actually dictate a bunch. And all those tools, all the new tools, Whisper and now Canary and like all these things, they understand me and my accent very well.
[01:06:26] Alex Volkov: Whereas Siri is like on device. So Siri actually has two automatic speech recognition. They have the fast one on device and they actually do send your voice on onto the cloud and they return something. So you would [01:06:40] actually see a wrong transcription and then the right one replaced the wrong one. And the right one is actually generally okay, even though with my accent doesn't get me as much, but the wrong one is very bad.
[01:06:50] Alex Volkov: It's it's like they stopped. Thinking about ASR, Automatic Spatial Recognition in Apple, back in 2019, and that's what they shipped. However, there were quite a few papers from Apple on this topic, and I know for a fact that we're getting on device. And the reason I'm excited about this in the spatial context as well is because you can talk instead of using Hands on keyboard and that's very cool I think that's all I had to connect with the spatial computing in addition to I've tried all the AI tools and games and everything And we're still not there.
[01:07:19] Alex Volkov: There has been one thing that I want to connect if you guys know from the diffusion model area There is a way to generate images in 360 around you and I thought this was super cool because this is essentially a holodeck moment where you can stand in full virtual embedded reality and just say, Hey, I want this thing to appear.
[01:07:39] Alex Volkov: And we have now models of text to 3d that are coming like super soon. We obviously have virtual friends that embedding them in real space needs a robot. But now if you have this like spatial computing thing, you can actually put an AI friend in the corner. You will always talk to you. So there's a few like attempts at this in the Apple thing.
[01:07:57] Alex Volkov: but not a lot. And also I will ping back to this like last thing where Apple is coming. We've talked about this. Apple is coming on Friday of release of Vision Pro, which was the day after last Thursday. Apple had their uh, shareholder meeting. And in there, Tim Cook said, Hey, we launched spatial computing.
[01:08:15] Alex Volkov: We're really happy. This is the next iteration of spatial stuff, blah, blah, blah. I definitely agree about all this. If you watch my feed for the past week, that's pretty much all I can talk about besides AI. However, going back to the AI, Tim Cook finally mentioned the word AI in the call, and he's not the only one.
[01:08:30] Alex Volkov: It's very clear where the thing is going. Every earnings call for every major company mentioned AI. Tim Cook specifically mentioned AI finally and said, Hey. We're very excited about this technology and we're going to show you something like soon. So I expect that this WWDC is going to be Spatial and AI related and I definitely think that Apple are thinking about both just because the way Siri looks in Spatial is just incredibly like nice.
[01:08:55] Alex Volkov: And I can see how embodying AI in your physical world, where you have spatial awareness, you can put something in the corner, it will sound like it's coming in the corner. And I'm waiting for the, for the point where that has a bot, like a Tesla Optimus bot with AI.
[01:09:11] Alex Volkov: But before that, we'll definitely get there with spatial computing. So I'm going to have embodied AI agents around me and I'm going to ask questions. For some reason, the ChatGPT interface within the headset is horrible. And specifically because we all know that the iPhone app you can talk to, but Vision Pro only has access to iPad apps, and you can install the ChatGPT iPad app, but you cannot talk to it, which is a miss, I think, on OpenAI's part.
[01:09:35] Alex Volkov: This isn't in my segment about the Vision Pro. I tried as much as possible to connect these things to AI to bring this to you. But, separately from this my full review of Vision Pro is, holy s**t, this device is the new category of computing, and I can talk about this in a different space if you're interested.
[01:09:50] Space reset
[01:09:50] Alex Volkov: and I think it's time for a reset the space, as we've gone up for an hour here, folks. A little bit more than an hour. I'm just gonna play some music, reset the space, and then we're gonna have a conversation with some folks here on stage.
[01:10:12] Deep dive into DSPy, COLbert and RAGatouille with Ben Clavie and Connor Shorten
[01:10:12] Alex Volkov: Welcome, everyone, to the second hour of ThursdAI. Where we usually, we have a bunch of stuff to cover still from the news angle, like the Bria stuff and the MetaVoice stuff and the Arts in the Fusion. But, and also maybe you want to have some time to talk about Vision Pro, but for now, we have two guests here on stage that I want to welcome and introduce.
[01:10:31] Alex Volkov: And we're going to talk about very interesting things that maybe some of you who follow the Twitter, XAI, Ecosphere have been seeing around and I really want to say I want to say thank you and welcome to Conor and Benjamin for joining us. Maybe let's unmute Conor first and then Benjamin and just introduce yourself.
[01:10:49] Alex Volkov: Benjamin, I know you're going through some stuff, buddy. And as much as you can Benjamin feel free to, to talk to us, but we'll try to cover as much as possible. Conor, go ahead and then Benjamin.
[01:10:58] Nisten Tahiraj: Hey Alex, are you able to hear me first
[01:11:00] Alex Volkov: Yes, we can hear you loud and clear.
[01:11:03] Connor Shorten: Awesome, cool. I think I've been like refreshing the Twitter page and all that, but awesome. So I'm Connor. I'm a research scientist at Weavier. I also host the Weavier podcast. And yeah, I've just been so excited about DSPI and I'm, really excited to be diving
[01:11:15] Connor Shorten: into it further.
[01:11:16] Alex Volkov: That's awesome. And I think that WayVid podcast was the first podcast that I came on as a little bit of a guest from NeurIPS. So we had a great conversation outside of NeurIPS sign. If you guys want to check this out, but also WayVid podcast, the folks from Weights & Biases had a great chat with you.
[01:11:29] Alex Volkov: That's where I know you from. Actually researched my position and my team based on the conversation you had with them. Very knowledgeable. And thank you for that content. It's really great. And folks definitely should check it out. And I want to also say hi to Benjamin Clavy. Welcome, Benjamin.
[01:11:44] Benjamin Clavie: Hey,
[01:11:45] Benjamin Clavie: thank you for having me. Can you hear me?
[01:11:47] Alex Volkov: Yes, you're coming through loud and clear.
[01:11:50] Benjamin Clavie: Yeah. Thank you. Yeah, I've made Tato, which you might have seen if you're interested in T at all, which is
[01:12:02] Benjamin Clavie: physically here, but not present in, but
[01:12:05] Alex Volkov: Do, what's in terms of background? Could you give us a little bit of background? Like how you came up to build these things? What's your background? Is this AI? Give us maybe a few brief sentences there.
[01:12:15] Benjamin Clavie: I'll say. My background
[01:12:16] Benjamin Clavie: here is basically ai. I've done the stereotypical thing of dropping out of uni and immediately gone walking into NLP and I've been doing retrieval on NLP for 6 7 years now.
[01:12:25] Benjamin Clavie: Very standard background.
[01:12:27] Alex Volkov: So definitely related background. Okay. So we're here to talk about multiple multiple things, interesting things. And Conor, I think maybe let's just start with. I think the guy behind some of this work Omar Khattab is not with us, right? But definitely some of the work that we're going to talk about is attributed to him.
[01:12:45] Alex Volkov: So maybe, can you, Conor, can you start us with an introduction to maybe DSPy and then Colbert, and then we're going to talk about Colbert and Ragatouille, and then just a brief one, then we're going to dive into what this means for retrieval stuff, definitely as it relates to you guys in Wave V8 rags are everywhere and like better rack systems and better.
[01:13:03] Alex Volkov: Options to prompt these LLMs to better retrieve is, everybody's looking for those. So let's start maybe there.
[01:13:12] Connor Shorten: Okay, so I'll try to keep the story going from intro to DSPy and then taking it into retrieval. So I think the first thing about DSPy that will like capture your interest is the programming model. It has this way of Writing initial prompts in a really succinct way, and then you can chain together or compose these graphs of several large language model calls with tool use in the middle, and we can come into retrieve a little bit there as well, but you start off with a really coarse description of what you want it to do, re rank these documents, and then it will optimize the, the whole description of the task as well as giving you a few shot examples to put in the prompt.
[01:13:50] Connor Shorten: So that's the first thing that is just super interesting I'm sure everyone listening has done this like manual tweaking of the prompt to try to, get it to do your task and how irritating that can be. And so that's probably the quickest value add is it automatically will come up with the prompts.
[01:14:03] Connor Shorten: And then when you want to switch your language model you've been over there saying please output JSON, four exclamation marks performing better than one. And now you switch from GPT 4 to Gemini Ultra, or say, you want to see if Quinn can be view shot prompted to do this.
[01:14:17] Connor Shorten: You can now recompile the prompt by using DSPy, and you can switch your language model without having to then redo the prompt tuning.
[01:14:24] Alex Volkov: So I have to pause right here, Connor, because I'm coming to this as clean as possible with not a lot of understanding of these things . You said recompile the prompt.
[01:14:33] Alex Volkov: I'm definitely one of the folks who've tweaked prompts, tried again, saw, okay, it works for a GPT 4. I'm definitely one of those folks. What do you mean compile the prompt, recompile the prompt? Let's talk about the compilation part of this.
[01:14:44] Connor Shorten: I even, when I met Omar, I said, compile it. It's overloaded. I think this kind of analogy started with calling LLMs the new operating system for LLMs and So I think that's the line of thinking to be calling it a compiler. Really we mean automated prompt [01:15:00] tuning.
[01:15:00] Connor Shorten: But the reason compiling, I think is the right way to think about it, is, let's say you have eight large language model programs eight parts of it that's what I think is the really exciting that's what I think makes LangChain so popular is people see this gallery of examples of chains where you first analyze some chunks of blog posts, extract the topics, then, You later on aggregate the topics into a description of the topic and then you maybe pass it to an editor prompt, and then you maybe have a council of reviewers, like there's this chain, and so with each component of the chain, or I think graph is now the more common abstraction.
[01:15:35] Connor Shorten: You have a prompt there. So let's say you have eight language, or however many, I imagine that as this, continues to evolve, we're going to see like super deep LLM the programs that will have so many LLMs in the middle of it. And so you have a prompt for each of those components.
[01:15:49] Connor Shorten: And so that's why compiling, I think the analogy is great because you're compiling the prompts for all of these prompts and yeah, so that's why I'll defend the compiling.
[01:16:01] Alex Volkov: So I'll just say like from a perspective of a tinkerer. That's something that maybe triggers me a little bit to say, Oh, I need to compile stuff. No, I just write Python code, but you're saying developers do not fret. Compiling is not that like crazy. It's specifically very helpful and useful for larger applications and very, is very helpful for when you want to replace the brain behind the stuff that you're doing or you want to do this in a structured way.
[01:16:24] Alex Volkov: Is that me understanding correctly of what we're talking about?
[01:16:28] Connor Shorten: Yeah, I agree completely with that.
[01:16:29] Alex Volkov: Awesome. So that's DSPy and Omer Hatab Latent Interactions, or Latest Interactions I think the nickname is. We're definitely going to add him to show notes as well. He's the author of this. DSPy has been around for a while. I definitely know that he has been posting about this quite, quite a lot, but recently has been on the pickup as well.
[01:16:46] Alex Volkov: And maybe Colbert is one of the reasons. Let's maybe, can you introduce Colbert as well, Conor? Or do we have some stuff about DSPi still to cover in the introduction phase?
[01:16:56] Connor Shorten: Okay, I can transition to Colbert.
[01:16:58] Alex Volkov: Colbert? Colbert? How do we, how do you even pronounce this thing?
[01:17:02] Connor Shorten: I was surprised when Omar pronounced it Colbert because it, it's Bert and then there's Stephen Colbert. I'd heard him on the podcast with I think Christopher Manning from Stanford who had, asked him about that.
[01:17:14] Alex Volkov: So if Omar, the creator of this pronounced Colbert, unfortunately, even though it's BERT models, I think Colbert is what we're talking about. But yeah, from Stephen Colbert. What is Colbert? Why is there excitement on my feed around this? And let's give us an introduction, Carmen.
[01:17:31] Connor Shorten: So the, probably the right way to start thinking about it is in search, you typically have retrieval and then re ranking and retrieval is where you have like encodings of the documents. Like you put each of the documents into an embedding model and you get a vector embedding, and then you're doing just, dot product distances between the query vector and these document vectors.
[01:17:51] Connor Shorten: So there's no interaction between the query and the documents. The representations are encoded completely separately in retrieval. And then you'll typically pass that into a re ranker. And so there are three kinds of re rankers. There's point wise re rankers that take as input the query in the document and then output a relevance score, doing the interaction between just this query and this, the query in this one document.
[01:18:12] Connor Shorten: Then there's pair wise where you take two documents in the query and have a tournament of two at a time. And then there's the list wise re rankers where you're taking all the documents as input at once. So the re rankers are pretty effective, you have this massive latency overhead by doing it like that.
[01:18:28] Connor Shorten: So what Colbert introduces is this late interaction. So the benefit of having this interaction between the query and the document most similar to the point-wise cross and coer reran, where you keep the vectors for the the documents and you have this kind of interaction between the inner token vectors.
[01:18:47] Connor Shorten: So let me, it's right now what we're doing mostly with vector search is, and this is why the BERT thing is actually really important, is because we're using these encoder only models that output that like a vector for each of the token. But then we pool all those vectors to represent the object with one vector.
[01:19:02] Connor Shorten: But Colbert, you keep all the vectors for the query and the document. And then you have this inner, it's maybe a little hard to just talk you through the math behind this, but you have this. The maximum similarity of each of those query vectors with all the document vectors. So say you have 100, document vectors and you're at index 0 of the query vector as you do the maximum similarity with those 100.
[01:19:22] Connor Shorten: Then you're at the first vector of the query, second, third, so on. And then you'll average that out. So you now have this late interaction of the vectors between the query and the document. I hope that maybe Benjamin can take the mic from here. I hope that gets the gist of it.
[01:19:37] Benjamin Clavie: Yeah, that was pretty good. So just to clarify, like max similarity is like when you're using normal vectors or like batch representation, you do have a single vector for the whole document.
[01:19:48] Benjamin Clavie: When you're using Colbert, like Connor said, you've got one vector per token, and at retrieval time, what you do is you compare every single one of your query tokens, so generally not a lot, like maybe 32, and you compare that with every single token in every single document, and you make, you only keep the highest similarity, and then you sum that up, and so you compare every token to every token, you get this really fine grained comparison, instead of trying to slot everything into one massive vector, which would probably lose information.
[01:20:17] Benjamin Clavie: Because you're doing it at the token level, it's very clear, I call this like a bag of embeddings because it's like quite close to what we do with TF IDF but with embeddings instead of like just a word count.
[01:20:29] Alex Volkov: Wow. Okay. So let me try. So Connor said a bunch of stuff. Then Lindgren, you simplified. Let me try to simplify from my understanding. Okay. Regular rack system, regular basic, not without even the re ranking step. Connor? Like the basic stuff that people do in the wavy examples, for example or whatever local embeddings you have, let's say a vector store of a bunch of information.
[01:20:49] Alex Volkov: You have a user asking a question, you want to augment LLM's information. tree because of the knowledge cutoff. And then you embed the user's query in some sort of embedding. We've talked about embeddings multiple times here on ThursdAI. You get some number back and like Benjamin said, you get one embedding for the whole document or the whole query.
[01:21:08] Alex Volkov: You get like just one, not per token. You get one embedding and then you use that. And to compare, and the usual similarity score is the ways to compare this. Then if we, you wanna go to advanced stuff, then you maybe do some re ranking. Re ranking is basically showing you like another LLM step, basically, right Conor?
[01:21:28] Alex Volkov: Or some maybe model that does re ranking for you, that chooses, you retrieve multiple examples, and you choose which one like fits better. And you can do this based on several things. The downside of this is, the bigger documents you embed, the kind of, um, The last concepts maybe in this whole embedding are similar to your query.
[01:21:47] Alex Volkov: And we've all like talked about this kind of similarity is very interesting because embedding definitely has dimensions, but it's hard to figure out if a huge document like embeds into one is how should I say, averages with everything that happens in there. And the benefit here of cold bear.
[01:22:06] Alex Volkov: Finally, I'm pronouncing this correctly. Colbert is that instead of embedding one time, it embeds per token. And am I getting this correctly? That sounds to me like a lot of compute. Is that correct? Embedding per token sounds okay, now we can compare each token from the query to each token of the document.
[01:22:24] Alex Volkov: But is it significantly overhead in terms of compilation time compute? What's the downside? It sounds better on the surface.
[01:22:32] Benjamin Clavie: So yeah,
[01:22:33] Alex Volkov: Go ahead, Benjamin, please. Yeah.
[01:22:35] Benjamin Clavie: clarification was quite clear in that, yeah, it's very clear, the problem with single vector representation is You've got a long document, and you're essentially asking the model to be like, I'm going to squeeze in every single thing that could be to know about this document into 500 floats or something, which is not a lot of space.
[01:22:54] Benjamin Clavie: But, Colbert takes more storage space, to answer your question, like you will need to store more tokens even though there are compression techniques, and we'll get into that later. But compute wise, it's essentially the same, because when you're using any sort of transformer model, you'll be attending to every token anyway.
[01:23:09] Benjamin Clavie: The only difference is Colbert actually stores those, instead of just averaging them at the end.
[01:23:15] Alex Volkov: Oh, so the, on the output of something like Colbert, you actually get all of the [01:23:20] embeddings per token and not just one embedding per the whole document. And then you can, it's like the storage is higher, but you can actually use those for more, better, higher quality comparisons. That's what we're talking about here.
[01:23:33] Alex Volkov: Is that correct?
[01:23:35] Benjamin Clavie: That's the gist of it, yeah. And then after Colbert You've got Colbert V2 and PLED, which is essentially Omar and Tim found out that, yeah, that does take a lot of space, but can we compress the embeddings? So most of the time when you see Colbert using production, it actually compresses every single token vector to just one or two bits.
[01:23:56] Benjamin Clavie: So don't take that much space
[01:23:58] Alex Volkov: Oh, so Colbert v2 is, what, 10x size or something comparison, right? Something like this. Conor, can you speak about this? Cause obviously you're in the vector dataset space. The more folks host, the better it is, for you guys. Cause you get a pet token. Can you just speak about the size of this and like the improvement as well?
[01:24:20] Connor Shorten: There's a couple ways you can do this quantization. The most common is just to have k means on the segments. You divide vectors and every two contiguous values you would then cluster that and then reduce the precision to like, eight bits, so when you quantize the token vectors, you can take down the storage overhead a lot. But yeah, I think Benjamin already said it all.
[01:24:43] Alex Volkov: Okay, so now let me take this into the practical realm because Colbert, the original paper came out in 2020 and I don't remember this off the top of my head, but the way I'm reading, I have some mental documentation here that I'm using to ask you guys the proper questions. And then Colbert V2 came out and a significant compression of the data because they quantize the actual individual embeddings and performance is essentially the same, I assume.
[01:25:06] Alex Volkov: And then. It also came out a while ago, and then, Benjamin, I think you're in charge, single handedly, for the resurrection, or like the renewed interest, because all of what we're saying doesn't not, doesn't sound to me super easy, as somebody who just okay, it's super easy for me to use a vector database, like wavy, other competitors, local vector stores, they all have very simple tutorials for me to just embed the query, go do a regular the nearest neighbor can then search whatever, and then just do this for the user.
[01:25:34] Alex Volkov: Now, all of what we're talking about, embedding per token, like comparison, like all of these things sound complex to me, and then that's where Ragatouille comes in, correct? So can you talk about, you see all this happening, and then what's your library doing why is it in charge of the resurrection of this whole concept?
[01:25:53] Benjamin Clavie: Yeah, I don't know if I'll go as far as resurrection, but yeah, Colbert is basically used by everyone who is quite aware of search, like pretty much every search startup, people at Google, etc. are using Colbert, but they don't got that big outside the poor user area, and the reason I think it's something that Omar mentioned the other day is I wouldn't say Colbert itself isn't usable, but it's not approachable.
[01:26:16] Benjamin Clavie: If you go look at the repo, it's scary. There's a lot of things. How do I store those vectors, et cetera. And the point of Rege2 is trying to bridge that gap because we are now at the point, I think, where AI has users that aren't like traditional AI for users, especially in IR. Vectors are complicated.
[01:26:33] Benjamin Clavie: Embeddings are complicated. And the point of Rege2 was basically like, yeah, but what if you could use Colbert and just like 4 lines of code, and I tried to build that, and it turned out to be quite easy to build, so that's how it came to be.
[01:26:46] Alex Volkov: So you built it, it's quite easy for you. What is it? Just this is like a library wrapper on top of, The knowledge of how to run Colbert in production. What is the library like? Is this the lang chain for Colbert? Tell us like what folks are to expect when they open up and they say, okay, I need to use something like this.
[01:27:03] Alex Volkov: This is super interesting. This is higher quality retrieval. How do I start?
[01:27:09] Benjamin Clavie: Yeah, so I think there's two things here, that's where I would like it to be, and where it currently is. Where I would like it to be is to keep adding more stuff and basically bridge the gap between what's popular in IR research or retrieval, which is probably a few years ahead of what's actually popular in the mainstream because it's quite obscure.
[01:27:26] Benjamin Clavie: And then what it is right now, like when you open like a tool, it's basically there's two main classes, one that you can use to fine tune and train Colbert models and hopefully more late interaction models, but right now it's just Colbert. And tries to abstract away all the hard stuff there's a thing called hard negatives, when you're training for retrieval, and you need to mime for hard negatives, and that's they're done in the background.
[01:27:48] Benjamin Clavie: And then you've got the main one, which you can use to use Colbert to re rank her, or use Colbert to uncode documents in memory, or use Colbert to create an optimized Colbert index, which does the compression, etc. So it's basically, yeah, give it your documents, it will process them, and then you end up with something you can play.
[01:28:04] Alex Volkov: Just from a perspective of nobody that used this model so far . Let's say I already have a vector database existing. I need to reed everything in there to start using called Bay and with regulatory. And that's what you mean by fine tune or is there like an additional thing that's called fine tune?
[01:28:20] Alex Volkov: 'cause this is not like the LLM fine tune that we've talked about here on Thursday and multiple times. This is a different fine tune. What are we fine tuning? How long does it take? Does it need GPUs? If you don't mind, walk us through this. If how easy this is for the user to do.
[01:28:36] Benjamin Clavie: Yeah, that's a good question. So it's actually quite similar to LLM fine tunes, just on a much smaller scale, because you would actually be fine tuning the model itself. There's another paper by Omar and team, Omar is everywhere in this link, regardless. There's another paper by Omar and team called UDA PBR, which is actually a combination of choosing DSP, so the proto DSP Y.
[01:28:59] Benjamin Clavie: With Colbert to make the fine tune Colbert to any unknown domain. So any new domain, you could technically get a much better retrieval model using that. Right now there's only one implementation. That's something we would like to have in Regentoo. But yeah, the other question is, can you use joint distinct vectors with this?
[01:29:17] Benjamin Clavie: The answer is no, and that's quite annoying. And when fine tune, I also mean like you can fine tune the model, but you can also just choose Colbert of the shells and use that to embed your documents and create a new index. Beef. If I have to speak of the cons, I would say there's no VectorDB except Vespa, which I don't think qualifies as a modern VectorDB we probably mean here that can use call back embeddings out of the box.
[01:29:41] Benjamin Clavie: I know there's interest, maybe Connor, you guys will support it at
[01:29:44] Connor Shorten: some point soon. Yeah, we're definitely working on it. I think we, I think, I do think that you've maybe understated the contribution of Ragatouille before you did this, it wasn't, it was not easy to train your own Colbert model, and it definitely wasn't something that we saw as freQwently.
[01:30:03] Connor Shorten: It was like, Yeah, I think that you've definitely evangelized it. I don't necessarily agree with the most people doing search were doing it this way. Maybe I've just opened a thing, but I think most people have been doing the kind of pooled vectors thing and this is very new, but and yeah, we are working on adding it.
[01:30:22] Alex Volkov: I, from my perspective, just judging by the social feeds, I agree, Benjamin, without working through it I don't think I've been even been interested. But I want to maybe ask Connor here as a follow up. So you, Ragatori, you see it blowing up, like what piques your interest in how approachable this is?
[01:30:36] Alex Volkov: What's fine tuning a Colbert model mean for retrieval? You guys are like researching every retrieval technology out there as much as possible in order to bring this obviously to your users as well. Quality of retrieval is very high of a very high importance as well, but storing these like vectors in different vector databases.
[01:30:54] Alex Volkov: What do you see in Ragatori like exploding and how does this translate into people are using rags better, sorry, rags better.
[01:31:05] Connor Shorten: Yeah, I guess it yeah it definitely is just I think what I opened with this kind of retrieved and re rank it, collapsing it into the one thing. And I think it's really just explained it really well. I agree with you, Alex. I don't think I would have understood Colbert as well as I do now if it wasn't for Benjamin and Ragatouille.
[01:31:21] Connor Shorten: So that's what I think, but under the hood, it's I think it's still like this re ranking thing where we can still use, we still use the pooled vector and like an HNSW search to surface the candidates and then we'll now bring the, the other token vectors with it.
[01:31:35] Connor Shorten: And then, for Weaviate that just means opening up, like having a more generic type [01:31:40] for how we store vectors to, instead of just one vector now we have this, like an open interface. To, to let you still use the, because the pooled vector embedding search is still very popular as well.
[01:31:51] Connor Shorten: The OpenAI embedding. I think the Matryoshka thing, maybe we could talk about that as well. I think that has some flavors of this. I'm not sure if it still has the same kind of hierarchy to it. But I think there's also, maybe I'm going off topic, but there's also a paper from DeepMind about semantic IDs.
[01:32:06] Connor Shorten: And so semantic IDs, they're like this like hierarchical, discrete quantized things where it'd be like you Like at the, say you have three, three IDs and they're each eight bits and the first one would be like whether it's about sports or news or something like that. So there's definitely a, yeah, this is definitely like a newer thing, I would say.
[01:32:25] Connor Shorten: And I hope I answered the question. I think I just did like a circle around.
[01:32:28] Alex Volkov: No, with this article, definitely. I just want to touch about a concept that may be not familiar for folks here on the ThursdAI stage. Matryoshka embeddings came to my, on my radar just recently after OpenAI released their new embedding models. And one of the things they've added in their new embedding models is the ability to reduce dimensions like via API call.
[01:32:45] Alex Volkov: And people were starting thinking like, Hey, how did they do this? What usually, like when you get an embedding model, you get And then some folks started saying there was this paper called Matryoshka embeddings that Matryoshka, if you guys are not visualizing what this is like the Russian dolls thing where one fits into another.
[01:33:00] Alex Volkov: And there's this paper, and I think the author of Matryoshka embeddings is on my Reddit as well. Maybe we'll get him on ThursdAI that actually allows for significantly smaller, correct me if I'm wrong, way to do this. And I think. Folks from Junaid definitely talked about trying to train Matryoshka with some other stuff.
[01:33:17] Alex Volkov: So this is like a new concept we haven't touched upon yet, but could potentially be an additional competitor here. I want to scroll back real quick. We have Benjamin back. Benjamin let's talk about the speed of this for like larger documents. Definitely what I Learned about Regato definitely, but also about Colbert is for larger documents.
[01:33:36] Alex Volkov: I saw something, I think from Omar about just like millions of rows or something significantly faster. Could you speak about like the speed of this whole thing? Are we getting like an improvement significantly for speed? Like why would a person who already has a setup consider switching to something like this?
[01:33:51] Alex Volkov: And let's talk about the seconds it takes to run through like a bunch of documents. to find similarities.
[01:33:59] Benjamin Clavie: Okay, so that's, so I did miss a few things, so it might have been said already, but there's a trade off here in that creating a Colbert index as in an optimized one using quantization, like Connor said, is quite slow, like pretty slow, because it has to run k means on all your embeddings, etc., but the con, like the flip side of that is that once your documents are in an optimized index, Query is pretty much in constant time, like it doesn't matter if you've got 100 million documents or billions, it will take about 50 60 milliseconds, and that's because the indexing optimization step, I think, creates A bunch of centroids that you can use to, you can use as a gateway to documents, like to simplify things.
[01:34:40] Benjamin Clavie: So query is pretty much constant, and that's a big pro of optimized Colbert indexes. I think that's what counts, because it also means that adding and deleting from a Colbert index is very slow, because you need to recompute that. And I think that's space here for some sort of hybrid approach. Also using NHSW for like smaller collections, because you don't need that sort of optimization if you've got like 10, 000 documents or something.
[01:35:04] Alex Volkov: Interesting. It's just for my understanding brain this is very similar to pre compilation of some stuff versus like runtime executions or some stuff you're saying if basically you can offload. The compilation part, and your users will not basically suffer from this, right?
[01:35:20] Alex Volkov: You don't have to go and call different APIs for this. If you're able to do this, and then you precompile everything, and the benefit here is larger indices, like larger, like significant larger document stores. You're talking about like millions or a hundred millions or so. But then retrieval is almost near time, like instant, under like milliseconds.
[01:35:41] Alex Volkov: That's, I think, a crazy benefit for folks, especially in enterprises and different places where Yeah, I think it's like a significant improvement towards regular like search and vector comparison. Conor, would you say so as well? Because you guys are in the business of vector comparison and bringing people.
[01:36:00] Alex Volkov: Are you seeing like a significant improvement from a retrieval speed here.
[01:36:08] Connor Shorten: Yeah, I think the latency probably isn't too bad because you, the way that I understand Colbert is that you still, or Colbert, sorry, I would agree on Colbert, but, is that you still have the the top 100 search with HNSW and, that latency is, Pretty slow. It's gonna be like five milliseconds at a million scan.
[01:36:25] Connor Shorten: That's like the most hand wavy thing ever, but and then you just bring these quantized vectors into memory to then re it's way faster than, the cross encoder approach where you're going to take those top 100 results and then append them with the query and send them to a, an inference container to get back the scores and sort them.
[01:36:39] Connor Shorten: So it's way faster than that. I think maybe one thing out of what you just said that I'd want to parse is I don't think it's the same analogy as compile it or compose it at runtime. It's maybe more so like an asynchronous kind of thing where you can query the index that you currently have and then in the background, the index can start doing that k means quantization.
[01:37:00] Connor Shorten: That's probably the slowest thing of as Benjamin just mentioned. Like that quantizing the token vectors and now we're, let's say we're I'm actually not familiar with the detail of exactly how many token vectors you're keeping for document, but let's say it's 512, right?
[01:37:14] Connor Shorten: And now you're going to be running k means over, each of those and in parallel and then you also are, trying to multi thread the per segment codebook. So I think feeding that, fitting that codebook is going to be your challenge. And so that's probably, and then keeping that fresh because these codebooks, if that's the way you're doing it, I don't The thing about Matryoshka and it's like maybe, and it's like maybe you can get the quantized vectors out of the box with one of the embedding models, but it's the quantization schemes are pretty dependent, like dependent on your data, particularly, like you can't it's not like the embedding models that you get from the common APIs that they come with the code books.
[01:37:53] Connor Shorten: You have to fit these code books to your data. So I think the way to think about it would be that we can fit these code books like asynchronously in the background and you can query what you currently have and then, the updating and having the refresh indexing that can happen with a cycle kind of way.
[01:38:10] Alex Volkov: All right. I wanna maybe move towards, okay. Let's say folks are interested to trying this. Benjamin, could you could you speak about how to like. Is Regatoid the right start? Do they have to? I think you mentioned this briefly. I just want to return to this. Is this only like significantly better for a large set of documents?
[01:38:28] Alex Volkov: What are the steps to getting started here and what people should know? And then I guess we'll ask about if where to find you guys and how to keep up to date with as these developments around this area happen.
[01:38:43] Benjamin Clavie: So if you want to get started, I think Regato is probably definitely the easiest way to try Colbert. We've got a few example notebooks on the GitHub repository. If you want to contribute more, please do. That's the big thing. I need more documentation, more notebooks. But you can try re ranking or indexing in memory or building your index.
[01:39:01] Benjamin Clavie: So I've got Finetuning pretty much out of the box. So I'd say start there. In terms of retrieval performance, like Colbert is always a really strong competitor. Performer in the existing IR literature, and we do have a re ranker, so you can just try it out, just use it to re rank before you commit to indexing your whole documents, just to see how it would perform for you.
[01:39:21] Benjamin Clavie: So that could be an easy way to slot in any existing pipeline, basically, just retrieve documents, re rank them. and see what the rerun code does for you.
[01:39:29] Alex Volkov: And that in that case, I think integration with existing libraries also exists for folks who use like ClangChain or LAMI index. I saw that they also integrate at least some parts of this, correct?
[01:39:40] Benjamin Clavie: Yeah, and I do want to thank them for that because they basically did this within 24 hours of me reusing ReGaTu. On Lama Index you can use Colbert Indexes and on LangChain you can use Colbert Indexes and you can use like Colbert's rerun code as well. So if you already use LangChain you can add like an extra Colbert step using [01:40:00] ReGaTu in three more lines of code, I think.
[01:40:02] Alex Volkov: Incredible. So folks definitely definitely who are interested in trying out what the big dogs use for search re ranking is a very easy, like without committing re ranking is a fairly easy way to get started with this and see if you get a significant performance. And Connor, we barely touched on DSPies.
[01:40:19] Alex Volkov: I do want to have a conversation about because that's also all over my feed and basically Omar is all over my feed. And could you Let's say, how, does this all connect somehow with DSPies or no, and because DSPies is for the prompts area. This is more for the retrieval area. Where's the connection point that I'm missing besides Omar being everywhere?
[01:40:39] Connor Shorten: I think that, oh, I think Omar being everywhere is maybe the biggest connection I, because to me it's kinda like D SPY is like optimizing the LLM program prompt part. And then I think to have the optimi optimization loop connect between that and the retrieval model, you definitely, there's works like propagator in pairs.
[01:40:59] Connor Shorten: Omar has, I think, UDAPDR, something like that, where you use the LM to generate synthetic queries, then you fine tune the embedding model with that. So that's that would be where the connection would be, DSPy is like a synthetic data framework, you tell it what you want it to do, and it will use the LLMs to generate successful executions of the task, and then you use that to distill it to either small models, or to tune the prompts, or you could fine tune an embedding model.
[01:41:25] Connor Shorten: I don't think it's quite, I think that would be pretty advantageous. Benjamin can take the mic from here.
[01:41:32] Benjamin Clavie: Yeah, I wouldn't say DSPy and Colbert are directly related. They exist in the same space, but definitely very different tools. Like Connor mentioned, UDA PDR, which is a paper, the paper I mentioned, actually, where you use DSP and hopefully soon DSPy to fine tune a Colbert to any domain.
[01:41:50] Benjamin Clavie: It's not exposed. It's never been exposed to before and get it to a state of the art result only domain. That's a really good application of DSPy to Colbert. And likewise, you can use Colbert as a retriever on your DSPI pipeline, but it's just a component, it's not quite the DSPI thing.
[01:42:08] Connor Shorten: I do have something, though, that is very related to retrieval generally.
[01:42:12] Connor Shorten: Is we saw all these amazing LLM query router things. I want to give Llama Index credit for evangelizing most of this stuff. But, so one example is, say you have the LLM pick a metadata filter to put on the vector search. Like you want to, search only where you're searching through, let's say you have an index of podcast clips and you want to say only where the speaker is Omar Khattab, and you have an LLM predict that filter, and then that would be in the retrieval engine.
[01:42:38] Connor Shorten: And so you have this you have a prompt behind that same with text to SQL. There's a prompt behind how you we'll put these things around retrieval. And so DSPy can optimize the prompts or optimize the models that do that to get the maximum performance out. And not, I, not to, I don't mean to say anything negative about the existing frameworks, but you're right now, locking into the prompts that they have built in to the framework it do these things, whereas DSPy opens it up to optimize it for your thing.
[01:43:06] Alex Volkov: Interesting. Yeah, I don't think it's negative necessarily. I think people after using some of these frameworks they understand that and we've seen this from multiple folks. This, they could potentially start with something like a Lama index or LinkedIn and then quickly figure out that some more.
[01:43:20] Alex Volkov: Freedom is needed and de SPI saying is a potential kind of way to do that. Okay. Connor, anything else? Very interesting. So first of all, you have a bunch of great content on this. You recently did. I think it's been to the top of the tweet. I'll definitely add this to the show notes as well.
[01:43:32] Alex Volkov: You did a deep dive into de SSPs on your, was that on the podcast or was just a video? Definitely we'll send folks there. Anything else you want to add of like, how to find you, where to find your content and definitely folks should follow you. First of all, we'll add your things.
[01:43:48] Connor Shorten: Thanks, Alex. Yes, I have two podcasts right now with Omar, of course, and then I have Carol Duserlink, who's created this. Infer, Retrieve, Rank, Program. It's one of the coolest examples of DSPi. And yeah, and then I have one video out so far explaining the whole thing. Quickly, I wanted to point people to the update to DSPi Assertions.
[01:44:05] Connor Shorten: Because I think this is the most important thing with these prompting frameworks. And I think it's important. to also understand Instructor from Jason Liu, which is where you use Pydantic to define the schema of the outputs that you want from the language model, and then you validate the outputs to make sure that it, outputted JSON with the keys that you wanted.
[01:44:23] Connor Shorten: And so DSPi Assertions is in this similar category, and this is like the most common discussion I'm seeing in the DSPi Discord is people looking to add Instructor to DSPi and jointly looking to do this thing of like structured outputs and have this retry mechanism. There's a new work from Arnav Signal Sig, oh, sorry, Arnav Singh V.
[01:44:43] Connor Shorten: We haven't met yet, but, and know more about DSPi assertions. And I'm going to link it in the description of this chat. Cause I highly recommend people check it out.
[01:44:50] Alex Volkov: Awesome. Nisten, just before I give you a question I will shout out that Jason Liu from the instructor library came to the Weights & Biases courses, and there's a course that he builds with us as well that's for free. You can just go 1db. ai courses. I'll definitely add this in the link below about structured output and how to force these LLMs to give us better structure output.
[01:45:09] Alex Volkov: It's funny that a person named Jason is building, you tools to get LLMs to output JSONs. But that's all I have. Just super quick. Nisten, go ahead. You had a question here.
[01:45:19] Nisten Tahiraj: I just want to say it's pretty amazing that the people we bring here are from the industry. We actually use, like from last week, I started using Lilac, I might actually start running Ragatouille on that on that Hacker Neon dataset. And so I wanted to know and mainly since some people ask in the comments, what have I used, I forced myself to only use open source models.
[01:45:45] Nisten Tahiraj: And cause I feel like that's the only way they're going to start getting better if we restrict themselves to them. I don't recommend you do it just yet, just wait another. Maybe a week or two but I want, I wanted to ask uh, we see some limitations with retrieval augmentation systems, like in GPT 4 when people use it.
[01:46:07] Nisten Tahiraj: It only gives three points from the document, doesn't really summarize it and stuff. What are the benefits of going with the Colbert? I'm sorry. Is it because it's much faster? Can you feed it many more documents? I'm talking from a practical point of view, not necessarily even from a tech person's point of view, like as a business who has a lot of customer data why should they use this versus just putting it on pgVector and doing function calling?
[01:46:41] Nisten Tahiraj: Is this faster that way? And what limitations does using again, RegA2 with Colbert
[01:46:47] Benjamin Clavie: have? That is a good and open question. So limitations we have a lot right now, like the lack of Cloud hosting offering is a big one. There's not really somewhere you can host this except doing it yourself, which is a big problem.
[01:47:05] Benjamin Clavie: And the main reason to use it, I would say, is generalization because the thing when you use any of the shared submitting models is they look good on benchmarks, and they tend to work quite well, but they've been optimized for those benchmarks. Whereas Colbert, for instance, like Colbert V2, has never been trained on the MTEB benchmark for retrieval, etc.
[01:47:24] Benjamin Clavie: The reason it generalizes well is because working at the token level makes it a lot easier for your model to encode information. Whereas, like, when you're trying to squeeze everything into a single vector, it might not very well, not work very well, say, for your custom domain. Whereas with Colbert, you can always assume it's going to be okay in every domain, but if it's not the best, you will need to fine tune it later.
[01:47:45] Benjamin Clavie: It's probably the biggest draw, I'd say.
[01:47:51] Alex Volkov: Awesome. So I definitely wanna thank you guys for coming up and explaining these concepts that have been floating around in very simple language. And I appreciate your patience with me re asking this in the way that I understand, because I know definitely that this is my way to understand, but also some folks in the audience.
[01:48:06] Alex Volkov: That's how we do here on ThursdAI, so more than welcome to rejoin. For I now consider both of you friends of the pod, so I agree with Nisten. It's really cool to see the authors of the libraries and the tools that we use. Come here to ThursdAI to talk about them, [01:48:20] and obviously, upcoming features as well.
[01:48:22] Alex Volkov: Definitely welcome. Benjamin, thank you for doing a bunch of open source stuff, and evangelizing the whole con birth call birth thing to make it simpler for folks. Definitely, thank you. And any anything you want to add here that I haven't touched yet? Please go ahead, Benjamin.
[01:48:36] Benjamin Clavie: I do have a few shoutouts, shall we say. One of them is that LungChain and DSPy are not mutually exclusive, and I shared that in the chat. There is now LungChain x DSPy integration, where you can define your chains in LungChain and still use DSPy to optimize things, which is pretty cool.
[01:48:53] Benjamin Clavie: And in the embedding world, so you mentioned Matrioshka embedding, and we talked about Colbert, and the people at JIN are actually training a Colbert model right now using Matrioshka embedding for compression, as like some sort of let's try this out, see how it works. And the final one is, you might have brought this up already, but the people at BAI train, like really, BGM3, as a really cool embedding model that in a single pass outputs.
[01:49:19] Benjamin Clavie: Dan's Vector, Burst, or Colbert Style Multivector Implantation, and the Splate Style Sparse Implantation. I won't go into too much detail about that,
[01:49:26] Alex Volkov: I'm sorry. I don't think I covered that. Who was that? Sorry. Could you repeat?
[01:49:31] Benjamin Clavie: The people at BAAI, the people who do the BGE
[01:49:34] Alex Volkov: Oh yeah, but yeah. We've talked about their model recently. They,
[01:49:37] Benjamin Clavie: ABI, yeah,
[01:49:38] Alex Volkov: Oh, I did not know.
[01:49:39] Alex Volkov: So they're now have a thing where outputs a regular embedding and also called burst style embedding.
[01:49:46] Benjamin Clavie: Yeah, the big thing last week was M3, which has a Colbert Style Embedding, Splate Style Embedding, which is a Sparse Implantation method, and Dan's Embedding, which is just a single model, a total of three.
[01:49:57] Alex Volkov: Oh, that's incredible. Okay. So we're adding some knowledge here. Thank you for, let me just repeat just the way that I hear this, we've talked about the BAAI BGE M3. M3 basically stands for multiple things. One of them is multilinguality. So they upgraded their embeddings to use not only English, but also I think a hundred languages as well.
[01:50:14] Alex Volkov: So now Benjamin, you're saying they're also implementing for us this step, the output, the dense embedding, but also the. The call Burr embedding, correct?
[01:50:25] Benjamin Clavie: yeah, yeah, one of the meanings of M, I think, is
[01:50:27] Alex Volkov: Multicomposability or some con yeah. Multifunctionality. Yes, exactly.
[01:50:33] Benjamin Clavie: can use it to generate different kinds of embedding. And I think that the first Non Colbert, actually like strong multi vector model. There's issues as in the vectors are too big, etc.
[01:50:45] Benjamin Clavie: But it's a very nice thing to see happen. Definitely, like
[01:50:49] Alex Volkov: Oh, definitely shout out then we need to get the folks from BA AI here to speak about this. So if you folks know them, definitely connect them to me. I would love to hear about from the authors of BG. Yeah, definitely shouts out Junaid. I think Bo Wang, we've mentioned he's a friend of the pod.
[01:51:03] Alex Volkov: He came when Junaid released embeddings and he often comes here and gives us like insights about how embeddings work. Shout out Bo and the team with Junaid as well. Connor your stage, if you want to add everywhere else where folks. You can follow or shout out your stage. And then we're going to continue with some more news.
[01:51:21] Connor Shorten: It's been so cool to be a part of the podcast. And I love how it's integrated into X because this is actually my favorite place to manage communication. So if you want to reach out, here would be great.
[01:51:31] Alex Volkov: Yeah. So definitely give a Connor a follow and a Wave8 podcast is incredible. We've been, by we, Wits and Biases. We had a mutual video together and Connor hosted our folks. And there was a, I learned a bunch of it before I joined Wits and Biases as well. A great source of information from both of you.
[01:51:45] Alex Volkov: Thank you guys so much for coming up, explaining these complex. on the surface concept to us, maybe complex also implementation wise, but making them simpler as well. I think it's very important talking about them. I think it's very important and you are now considered friends of ThursdAI community and hopefully this will get more folks to learn about this, contribute, etc.
[01:52:05] Alex Volkov: And I think with that, we're like, a bit over the top, like two hours since I started the recording. We had a great show today. Thank you everybody for listening and coming. I just wanna summarize this in a few notes that that I really enjoy my time here every week. And I really enjoy learning from folks. I think Nisten, you mentioned today that it's so cool to have the authors of the things we talked about.
[01:52:25] Alex Volkov: So today we also had this benefit. We had Benjamin here and we had Connor who covered this. And we also had Justin again from the Qwen team to talk about the Qwen stuff that they released. And it's really cool that the community now connects different people.
[01:52:36] Alex Volkov: So I was able to connect Justin and the Qwen team with the LM studio folks and Olama folk. No, I think only LM studio. And they were able to work together that they release is now supported in LM studio. the second they release something. So I love how this community comes together. I encourage everybody who listens to this to also participate in this.
[01:52:55] Alex Volkov: Either follow everybody who's on stage here interact with our posts and boost the signal a little bit. Tell your friends if you're working with friends and they don't listen to ThursdAI. And there's alpha in listening to ThursdAI like today definitely tell your friends where this alpha can be found.
[01:53:10] Alex Volkov: And with that, I want to thank you all and have a nice Thursday. Bye bye, everyone.

This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
📖 ThursdAI - Sunday special on datasets classification & alternative transformer architectures
5 Feb· ThursdAI - The top AI news from the past week
Hello hello everyone, welcome to another special episode (some podcasts call them just.. episodes I guess, but here you get AI news every ThurdsdAI, and on Sunday you get the deeper dives)
BTW, I'm writing these words, looking at a 300 inch monitor that's hovering above my usual workstation in the Apple Vision Pro, and while this is an AI newsletter, and I've yet to find a connecting link (there's like 3 AI apps in there right now, one fairly boring chatbot, and Siri... don't get me started on Siri), I'll definitely be covering my experience in the next ThursdAI, because well, I love everything new and technological, AI is a huge part of it, but not the ONLY part!
📖 It's all about the (big) Datasets
Ok back to the matter at hand, if you've used, finetuned, trained or heard about an AI model, you may or may not realize how important the dataset the model was trained with is. We often talk of this model, that model, and often the only different is, additional data that folks (who I sometimes refer to as alchemists) have collected, curated and structured, and creating/curating/editing those datasets is an art and a science.
For example, three friends of the pod, namely LDJ with Capybara, Austin with OpenChat and Teknium with Hermes, have been consistently taking of the shelves open source models and making them smarter, more instruction tuned, better for specific purposes. These datasets are paired with different techniques as well, for example, lately the so-called DPO (Direct preference optimization) is a technique that showed promise, since it not only shows a model which answer is the correct for a specific query, it shows an incorrect answer as well, and trains the model to prefer one over the other. (see the recent Capybara DPO improvement by Argilla, which improved model metrics across every evaluation)
These datasets can range from super high quality 16K rows, to millions of rows (Teknium's recently released Hermes, one of the higher quality datasets comes in at just a tad over exactly 1 million rows) and often times it's an amalgamation of different other datasets into 1.
In the case of Hermes, Teknium has compiled this 1 million chats from at least 15 different datasets, some his own, some by folks like Jon Durbin, Garage bAInd, and shareGPT, from LMsys.org, which was complied by scraping the very popular sharegpt.com website, from folks who used the shareGPT extension to share they GPT4 conversations. It's quite remarkable how much of these datasets are just, conversations that users had with GPT-4!
Lilac brings Garden
With that backdrop of information, today on the pod we've got the co-founders of Lilac, Nikhil Thorat and Daniel Smilkov, who came on to chat about the new thing they just released called Lilac Garden.
Lilac is an open source tool (you can find it RIGHT HERE) which is built to help make dataset creation, curation and classification, more science than art, and help visualize the data, cluster it and make it easily available. In the case of Hermes, that could be more than millions of rows of data.
On the pod, I talk with Nikhil and Daniel about the origin of what they both did at Google, working on Tensorflow.js and then something called "know your data" and how eventually they realized that in this era of LLMs, open sourcing a tool that can understand huge datasets, run LLM based classifiers on top of them, or even train specific ones, is important and needed!
To strengthen the point, two friends of the pod (Teknium was in the crowd sending us 👍), LDJ and Austin (aka Alignment Lab) were on stage with us and basically said that "It was pretty much the dark ages before Lilac", since something like OpenOrca dataset is a whopping 4M rows of text.
Visualizations in the Garden.
So what does lilac actually look like? Here's a quick visualization of the top categories of texts from OpenOrca's 4 million rows, grouped by category title and showing each cluster. So you can see here, Translation requests have 66% (around 200K rows) of the translation category, and you can scroll on and on and add filters and really dissect this whole thing up and down.
The categorization is created by running Lilac on your dataset, which uses embedding algorithms and other neat tricks to quickly chunk and put labels on the categories (AKA classifying them).
Btw, you can see this view and play around with it yourself, here
But running this on your own local machine can be a drag, and take hours if not days for bigger datasets, including sometimes hanging and not even working 100%, so the Lilac folks created Lilac Garden, which is a hosted solution by them to provide a dataset, and do classify something like 4M in 4-5 hours or so.
Which is definitely not possible on local machines. If you're into that kind of thing, again, Lilac is open source ,so you don't have to sign up or pay them, but if speed and this view matters to you, definitely check Lilac out!
RWKV with Eugene (Pico Creator)
On the news segment of ThursdAI we mentioned Eagle, which is the 5th version of RWKV, an attention free, potential alternative to Transformers, that's being developed fully in the open source. Later in the show we had the honor to have PicoCreator, one of the front running folks in the RWKV effort, which is an attempt to see if Transformers can be beat with a new type of architecture (RNN) that doesn't require specific attention mechanisms, that add the problem of Quadratic Attention scaling, making LLMs hard and expensive to run the more context is provided.
Eugene had some technical issues so joined in the middle of the pod, so we didn't have a full deep-dive, however, I figured it's important to bring this info to you guys, as these efforts may yield AI that runs 10-100x cheaper and potentially faster on devices, using almost infinite context lengths.
RWKV and other attempts like StripedHyena (Together AI) and Mamba (from Tri Dao) are attempts that are worth watching as they may supersede or join with Transformers to create the next jump in LLM capabilities.
That's all for this Sunday, needless to say, with the Vision Pro releasing on a Friday, it's been a full weekend of future exploration, which is the main driver in my personal life!
P.S - if you read through to here, you get a gift! A teaser, I have done something different on the pod, recorded a human interest podcast x AI, for the first time. I mostly bring the news and sometimes deep dives like this one, but this story I couldn't ignore, so stay tuned if you're into dating x AI, and how technology disrupts our lives and wether this is all moral or not, as I recorded an Episode with Sasha Jadan and his new Fiancee Karina, which his AI bot picked out for him, after swiping and matching with over 5200 girls on Tinder. The AI also... suggested he'd propose which he did. It was a very interesting conversation that I plan to upload soon!
That's it from me this week, see you all on ThursdAI and don't forget, if you liked this, do me a solid, listen to the pod and then leave a review or a 5 star (at least a 4?) on Apple podcasts 🙏

This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
ThursdAI - Feb 1, 2024- Code LLama, Bard is now 2nd best LLM?!, new LLaVa is great at OCR, Hermes DB is public + 2 new Embed models + Apple AI is coming 👀
2 Feb· ThursdAI - The top AI news from the past week
TL;DR of all topics covered + Show notes
* Open Source LLMs
* Meta releases Code-LLama 70B - 67.8% HumanEval (Announcement, HF instruct version, HuggingChat, Perplexity)
* Together added function calling + JSON mode to Mixtral, Mistral and CodeLLama
* RWKV (non transformer based) Eagle-7B - (Announcement, Demo, Yam's Thread)
* Someone leaks Miqu, Mistral confirms it's an old version of their model
* Olmo from Allen Institute - fully open source 7B model (Data, Weights, Checkpoints, Training code) - Announcement
* Datasets & Embeddings
* Teknium open sources Hermes dataset (Announcement, Dataset, Lilac)
* Lilac announces Garden - LLM powered clustering cloud for datasets (Announcement)
* BAAI releases BGE-M3 - Multi-lingual (100+ languages), 8K context, multi functional embeddings (Announcement, Github, technical report)
* Nomic AI releases Nomic Embed - fully open source embeddings (Announcement, Tech Report)
* Big CO LLMs + APIs
* Bard with Gemini Pro becomes 2nd LLM in the world per LMsys beating 2 out of 3 GPT4 (Thread)
* OpenAI launches GPT mention feature, it's powerful! (Thread)
* Vision & Video
* 🔥 LLaVa 1.6 - 34B achieves SOTA vision model for open source models (X, Announcement, Demo)
* Voice & Audio
* Argmax releases WhisperKit - super optimized (and on device) whisper for IOS/Macs (X, Blogpost, Github)
* Tools
* Infinite Craft - Addicting concept combining game using LLama 2 (neal.fun/infinite-craft/)
Haaaapy first of the second month of 2024 folks, how was your Jan? Not too bad I hope? We definitely got quite a show today, the live recording turned into a proceeding of breaking news, authors who came up, deeper interview and of course... news.
This podcast episode is focusing only on the news, but you should know, that we had deeper chats with Eugene (PicoCreator) from RWKV, and a deeper dive into dataset curation and segmentation tool called Lilac, with founders Nikhil & Daniel, and also, we got a breaking news segment and (from ) joined us to talk about the latest open source from AI2 👏
Besides that, oof what a week, started out with the news that the new Bard API (apparently with Gemini Pro + internet access) is now the 2nd best LLM in the world (According to LMSYS at least), then there was the whole thing with Miqu, which turned out to be, yes, a leak from an earlier version of a Mistral model, that leaked, and they acknowledged it, and finally the main release of LLaVa 1.6 to become the SOTA of vision models in the open source was very interesting!
Open Source LLMs
Meta releases CodeLLama 70B
Benches 67% on MMLU (without fine-tuninig) and already available on HuggingChat, Perplexity, TogetherAI, Quantized for MLX on Apple Silicon and has several finetunes, including SQLCoder which beats GPT-4 on SQL
Has 16K context window, and is one of the top open models for code
Eagle-7B RWKV based model
I was honestly disappointed a bit for the multilingual compared to 1.8B stable LM , but the folks on stage told me to not compare this in a transitional sense to a transformer model ,rather look at the potential here. So we had Eugene, from the RWKV team join on stage and talk through the architecture, the fact that RWKV is the first AI model in the linux foundation and will always be open source, and that they are working on bigger models! That interview will be released soon
Olmo from AI2 - new fully open source 7B model (announcement)
This announcement came as Breaking News, I got a tiny ping just before Nathan dropped a magnet link on X, and then they followed up with the Olmo release and announcement.
A fully open source 7B model, including checkpoints, weights, Weights & Biases logs (coming soon), dataset (Dolma) and just... everything that you can ask, they said they will tell you about this model. Incredible to see how open this effort is, and kudos to the team for such transparency.
They also release a 1B version of Olmo, and you can read the technical report here
Big CO LLMs + APIs
Mistral handles the leak rumors
This week the AI twitter sphere went ablaze again, this time with an incredibly dubious (quantized only) version of a model that performed incredible on benchmarks, that nobody expected, called MIQU, and i'm not linking to it on purpose, and it started a set of rumors that maybe this was a leaked version of Mistral Medium. Remember, Mistral Medium was the 4th best LLM in the world per LMSYS, it was rumored to be a Mixture of Experts, just larger than the 8x7B of Mistral.
So things didn't add up, and they kept not adding up, as folks speculated that this is a LLama 70B vocab model etc', and eventually this drama came to an end, when Arthur Mensch, the CEO of Mistral, did the thing Mistral is known for, and just acknowleged that the leak was indeed an early version of a model, they trained once they got access to their cluster, super quick and that it indeed was based on LLama 70B, which they since stopped using.
Leaks like this suck, especially for a company that ... gives us the 7th best LLM in the world, completely apache 2 licensed and it's really showing that they dealt with this leak with honor!
Arthur also proceeded to do a very Mistral thing and opened a pull request to the Miqu HuggingFace readme with an attribution that looks like this, with the comment "Might consider attribution" 🫳🎤
Bard (with Gemini Pro) beats all but the best GPT4 on lmsys (and I'm still not impressed, help)
This makes no sense, and yet, here we are. Definitely a new version of Bard (with gemini pro) as they call it, from January 25 on the arena, now is better than most other models, and it's could potentially be because it has internet access?
But so does perplexity and it's no where close, which is weird, and it was a weird result that got me and the rest of the team in the ThursdAI green room chat talking for hours! Including getting folks who usually don't reply, to reply 😆 It's been a great conversation, where we finally left off is, Gemini Pro is decent, but I personally don't think it beats GPT4, however most users don't care about which models serves what, rather which of the 2 choices LMSYS has shown them answered what they asked. And if that question has a google search power behind it, it's likely one of the reasons people prefer it.
To be honest, when I tried the LMSYS version of Bard, it showed me a 502 response (which I don't think they include in the ELO score 🤔) but when I tried the updated Bard for a regular task, it performed worse (in my case) than a 1.6B parameter model running locally.
Folks from google replied and said that it's not that they model is bad, it's that I used a person's name, and the model just.. refused to answer. 😵‍💫 When I removed a last name it did perform ok, no where near close to GPT 4 though.
In other news, they updated Bard once again today, with the ability to draw images, and again, and I'm sorry if this turns to be a negative review but, again, google what's going on?
The quality in this image generation is subpar, at least to mea and other folks, I'll let you judge which image was created with IMAGEN (and trust me, I cherry picked) and which one was DALLE for the same exact prompt
This weeks Buzz (What I learned with WandB this week)
Folks, the growth ML team in WandB (aka the team I'm on, the best WandB team duh) is going live!
That's right, we're going live on Monday, 2:30 PM pacific, on all our socials (X, LinkedIn, Youtube) as I'm hosting my team, and we do a recap of a very special week in December, a week where we paused other work, and built LLM powered projects for the company!
I really wanted to highlight the incredible projects, struggles, challenges and learnings of what it takes to take an AI idea, and integrated it, even for a company our size that works with AI often, and I think it's going to turn out super cool, so you all are invited to check out the live stream!
Btw, this whole endeavor is an initiative by yours truly, not like some boring corporate thing I was forced to do, so if you like the content here, join the live and let us know how it went!
OpenAI releases a powerful new feature, @mentions for GPTs
This is honestly so great, it went under the radar for many folks, so I had to record a video to expalin why this is awesome, you can now @mention GPTs from the store, and they will get the context of your current conversation, no longer you need to switch between GPT windows.
This opens the door for powerful combinations, and I show some in the video below:
Apple is coming to AI
Not the Apple Vision Pro, that's coming tomorrow and I will definitely tell you how it is! (I am getting one and am very excited, it better be good)
No, today on the Apple earnings call, Tim Cook finally said the word AI, and said that they are incredibly excited about this tech, and that we'll get to see something from them this year.
Which makes sense, given the MLX stuff, the Neural Engine, the Ml-Ferret and the tons of other stuff we've seen from them this year, Apple is definitely going to step in a big way!
Vision & Video
LLaVa 1.6 - SOTA in open source VLM models! (demo)
Wow, what a present we got for Haotian Liu and the folks at LLaVa, they upgraded the LlaVa architecture and released a few more models, raging from 7B to 34B, and created the best open source state of the art vision models! It's significantly better at OCR (really, give it a go, it's really impressive) and they exchanged the LLM backbone with Mistral and Hermes Yi-34B.
* Better OCR and higher res
* Uses several bases like Mistral and NousHermes 34B
* Uses lmsys SGlang for faster responses (which we covered a few weeks ago)
* SoTA Performance! LLaVA-1.6 achieves the best performance compared with open-source LMMs such as CogVLM or Yi-VL. Compared with commercial ones, it catches up to Gemini Pro and outperforms Qwen-VL-Plus on selected benchmarks.
* Low Training Cost. LLaVA-1.6 is trained with 32 GPUs for ~1 day, with 1.3M data samples in total. The compute / training data cost is 100-1000 times smaller than others.
Honestly it's quite stunningly good, however, it does take a lot more GPU due to the resolution changes they made. Give it a try in this online DEMO and tell me what you think.
Tools
Infinite Craft Game (X, Game)
This isn't a tool, but an LLM based little game that's so addicting, I honestly didn't have time to keep playing it, and it's super simple. I especially love this, as it's uses LLama and I don't see how something like this could have been scaled without AI before, and the ui interactions are so ... tasty 😍
All-right folks, I can go on and on, but truly, listen to the whole episode, it really was a great one, and stay tuned for the special sunday deep dive episode with the folks from Lilac and featuring our conversation with about RWKV.
If you scrolled all the way until here, send me the 🗝️ emoji somewhere in DM so I'll know that there's at least one person who read this through, leave a comment and tell 1 friend about ThursdAI!

This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
📅 ThursdAI - Sunday special on Merging with Maxime LaBonne
28 Jan· ThursdAI - The top AI news from the past week
Hey everyone, we have an exciting interview today with Maxime Labonne.
Maxime is a senior Machine Learning Scientist at JPMorgan, the author of Hands on GNNs book and his own ML Blog, creator of LazyMergeKit (which we cover on the pod) and holds a PHD in Artificial Intelligence from the Institut Polytechnique de Paris.
Maxime has been mentioned on ThursdAI a couple of times before, as he released the first Phi mixture-of-experts, and has previously finetuned OpenHermes using DPO techniques which resulted in NeuralChat7B
For the past couple of months, following AI on X, it was hard not to see Maxime's efforts show up on the timeline, and one of the main reasons I invited Maxime to chat was the release of NeuralBeagle7B, which at the time of writing was the top performing 7B model on the LLM leaderboard, and was specifically a merge of a few models.
Model merging
Model merging has been around for a while but recently has been heating up, and Maxime has a lot to do with that, as he recently checked, and his wrapper on top of MergeKit by Charles Goddard (which is the library that put model merging into the mainstream) called LazyMergeKit was in charge of >50% of the merged models on HuggingFace hub leaderboard.
Maxime also authored a model merging blogpost on Hugging Face and wrote quite a few articles and shared code that helped others to put merged models out.
Modern day Alchemy
This blogpost is a great resource on what model merging actually does, so I won't go into depth of what the algorithms are, please refer to that if you want a deep dive, but in a nutshell, model merging is a technique to apply algorithms to the weights of a few models, even a few instances of the same model (like Mistral7B) and create a new model, that often performs better than the previous ones, without additional training!
Since this is algorithmic, it doesn't require beefy GPUs burning power to keep training or finetuning, and since the barrier of entry is very low, we get some cool and crazy results as you'll see below.
Yeah, quite crazy as it sounds, this method can also create models of non standard sizes, like 10B or 120B models, since it's slicing pieces of other models and stitching them together in new ways.
If you recall, we had a deep dive with Jon Durbin who released Bagel, and Jon specifically mentioned that he created Bagel (based on everything everywhere all at once) as a good base for merges, that will include all the prompt formats, you can read and listen to that episode here
This merge frenzy, made HuggingFace change the leaderboard, and add a checkbox that hides model merges, because they are flooding the leaderboard, and often, and require much smaller effort than actually pre-training or even finetuning a model
And quite often the top of the leaderboard was overrun with model merges like in this example of Bagel and it's merges by CloudYu (which are not the top ones but still in the top 10 as I write this)
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
On why it works?
Nisten summarized this pretty well in this now famous copypasta tweet and I've confirmed with Maxime that this is his current understanding as well, it's quite unclear why this seems to perform so well, but it of course doesn't stop the "folks who look for AI Waifus" to keep merging.
Following folks like Nathan Lambert from interconnects.ai to start paying attention even though he didn't want to! (Still waiting on your writeup Nathan!)
UPDATE: As of today Monday Jan 29th, just released a super comprehensive deep dive into merges, which you can read here 👇👏
YALL + Automated LLM Evaluation
Maxime as also worked on so many models of his own, that he built a convenient little tracking leaderboard to track their performance, which he called YALL, Yet Another LLM Leaderboard and it's on HuggingFace. You can see that NeuralBeagle is the top dog (sorry, I literally could not resist)
It uses the Nous evaluations, and Maxime has created an automation called LLM AutoEval that makes it really simple to run evaluations, which you can run in a Colab super easily.
LLM AutoEval is on Github.
Merge-aology!
Since chatting, Maxime has released a Colab and later a HuggingFace space that takes models names, and shows the genealogy, nay, Merge-aology of the models, which models it was merged from and it's pretty crazy how deep this rabbit hole goes, and crazier even still that these models perform very well after all of these lobotomies!
Try it out here: https://huggingface.co/spaces/mlabonne/model-family-tree
I really hope you enjoy this special deep dive, I definitely learned a BUNCH from this conversation with Maxime, and I'm very happy that he came on!

This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
📅 ThursdAI - Jan 24 - ⌛Diffusion Transformers,🧠 fMRI multimodality, Fuyu and Moondream1 VLMs, Google video generation & more AI news
26 Jan· ThursdAI - The top AI news from the past week
What A SHOW folks, I almost don't want to write anything in the newsletter to MAKE you listen haha but I will I know many of you don't like listening to be babble.
But if you chose one episode to listen to instead of just skimming the show-notes, make it this one.
We've had 2 deep dives, one into the exciting world of multi-modalilty, we chatted with the creator of Moondream1, Vik and the co-founders of Prophetic, Wes and Eric about their EEG/fMRI multimodal transformer (that's right!) and then we had a DEEP dive into the new Hourglass Diffusion Transformers with Tanishq from MedArc/Stability.
More than 1300 tuned in to the live show 🔥 and I've got some incredible feedback on the fly, which I cherish so if you have friends who don't already know about ThursdAI, why not share this with them as well?
TL;DR of all topics covered:
* Open Source LLMs
* Stability AI releases StableLM 1.6B params (X, Blog, HF)
* InternLM2-Math - SOTA on math LLMs (90% GPT4 perf.) (X, Demo, Github)
* MedArc analysis for best open source use for medical research finds Qwen-72 the best open source doctor (X)
* Big CO LLMs + APIs
* Google teases LUMIERE - incredibly powerful video generation (TTV and ITV) (X, Blog, ArXiv)
* 🤗 HuggingFace announces Google partnership (Announcement)
* OpenAi 2 new embeddings models, tweaks turbo models and cuts costs (My analysis, Announcement)
* Google to add 3 new AI features to Chrome (X, Blog)
* Vision & Video
* Adept Fuyu Heavy - Third in the world MultiModal while being 20x smaller than GPT4V, Gemini Ultra (X, Blog)
* FireLLaVa - First LLaVa model with commercial permissive license from fireworks (X, Blog, HF, DEMO)
* Vikhyatk releases Moondream1 - tiny 1.6B VLM trained on Phi 1 (X, Demo, HF)
* This weeks's buzz 🐝🪄 - What I learned in WandB this week
* New course announcement from Jason Liu & WandB - LLM Engineering: Structured Outputs (Course link)
* Voice & Audio
* Meta W2V-BERT - Speech encoder for low resource languages (announcement)
* 11 labs has dubbing studio (my dubbing test)
* AI Art & Diffusion & 3D
* Instant ID - zero shot face transfer diffusion model (Demo)
* 🔥 Hourglass Diffusion (HDiT) paper - High Resolution Image synthesis - (X, Blog, Paper, Github)
* Tools & Others
* Prophetic announces MORPHEUS-1, their EEG/fMRI multimodal ultrasonic transformer for Lucid Dream induction (Announcement)
* NSF announces NAIRR with partnership from all major government agencies & labs including, OAI, WandB (Blog)
* Runway adds multiple motion brushes for added creativity (X, How to)
Open Source LLMs
Stability releases StableLM 1.6B tiny LLM
Super super fast tiny model, I was able to run this in LMStudio that just released an update supporting it, punches above it's weight specifically on other languages like German/Spanish/French/Italian (beats Phi)
Has a very surprisingly decent MT-Bench score as well
License is not commercial per se, but a specific Stability AI membership
I was able to get above 120tok/sec with this model with LM-Studio and it was quite reasonable and honestly, it’s quite ridiculous how fast we’ve gotten to a point where we have an AI model that can weight less that 1GB and has this level of performance 🤯
Vision & Video & Multimodality
Tiny VLM Moonbeam1 (1.6B) performs really well (Demo)
New friend of the pod Vik Hyatk trained Moonbeam1, a tiny multimodal VLM with LLaVa on top of Phi 1 (not 2 cause.. issues) and while it's not commercially viable, it's really impressive in how fast and how quite good it is. Here's an example featuring two of my dear friends talking about startups, and you can see how impressive this TINY vision enabled model can understand this scene. This is not cherry picked, this is literally the first image I tried with and my first result.
The image features two men sitting in chairs, engaged in a conversation. One man is sitting on the left side of the image, while the other is on the right side. They are both looking at a laptop placed on a table in front of them. The laptop is open and displaying a presentation, possibly related to their discussion.
In the background, there is a TV mounted on the wall, and a cup can be seen placed on a surface nearby. The scene suggests a casual and collaborative environment where the two men are sharing ideas or discussing a topic.
Vik joined us on the pod to talk about why he didn't go with Phi-2, he also mentioned that Phi-1.5 was retroactively also MIT'd, it's license literally says MIT now on HF 👏 Great conversation, tune in for that at around 00:31:35
Adept is teasing FuYu Large - their CHONKY VLM
Adept previously released Persimmon, and then Fuyu VLM (which is a type of persimmon we see you adept) and now tease the release for Fuyu Heavy, a much bigger model that can compete or come close to GPT4V and GeminiUltra on MMMU and MMLU (text) while being 20x smaller approx.
While we don't yet get to play with this, they show some great promise in the benchmarks
⭐️ Performance: Excels at multimodal reasoning and matches/exceeds text-based benchmarks.❗️ Challenges Faced: Dealt with issues related to image data, model stability, and pre-training data scarcity.✅ Evaluations: Outperforms Gemini Pro on MMLU and MMMU benchmarks.AI Summary by Arc Browser (haha see how I cheated here? I sometimes do shortcut summaries using Arc Max, it's dope, try it) https://t.co/BZi6EKhS5R
Fireworks AI releases FireLLaVa - with a commercially available license
FireLLaVA is the first commercially permissive open-source LLaVA model, a type of multi-modality model called a Vision-Language Model (VLM) that can understand both visual and textual inputs.
* The original LLaVA model was limited for commercial use as it was trained on data generated by GPT-4, which has non-commercial licenses.
* Fireworks.ai recreated the LLaVA training data using an open-source language model, CodeLlama 34B Instruct, to make a commercially viable version.-
* FireLLaVA performs comparably to the original LLaVA model on benchmarks, showing open-source models can generate high-quality data for VLM training.
* FireLLaVA is available via HuggingFace and through Fireworks.ai's prediction API, enabling new visual capabilities for applications.
Vik and I chatted about this, and while Fireworks didn't release datasets, they did release an example of how to start collecting them, and it's clear that everyone is clamoring after great vision / image datasets 👏
Really hoping that many great dataset for multimodal AIs will come out in 2024 giving us increasingly better multi modal LMMs 👏
Big CO LLMs + APIs (Blog)
GOOGLE announces LUMIERE video generation model that shows incredible push in consistency
Supports multiple tasks like image to video, text to video, video inpainting, Video stylezation and more, looks incredible. It seemed that they have cracked both spatial and temporal consistency, something that's severly lacking in previous video generation attempts, and makes character consistency quite remarkable. Of course, as with other google incredible papers, we never know if we'll ever see this model or be able to play with it, here's hoping 🤞
Google will add 3 new AI features to chrome
* Chrome is introducing 3 new experimental AI features to make browsing more efficient:
* Tab Organizer: Chrome will automatically group similar tabs to help with multitasking
* Custom themes: Users can generate unique browser themes using text prompts and AI image generation
* Writing help: Chrome will offer suggestions to help users draft messages and posts on websites
- They are currently only available to US users who opt-in on the Experimental Features page
I think this development is super super important because making AI accessible via the incredible Chrome platform to billions of people, is going to put Gemini in front of grandmas, students, everyone. Qutie impressive and the compute needed to pull something like this off is also quite mindboggling! 👏
Of course, they are not the first browser to add AI, I love the Arc Browser and it has AI previews that I use quite often!
This weeks Buzz (What I learned with Weights & Biases this week)
Have you like many of us have trouble getting structure output (JSON, other stuctures) from LLMS? Jason also had this problem, that's why he authored the Instructor Library, which makes it easy to guide the LLM to give structured output using Pydantic. Jason has presented at Ai Engineer conference, and recently collaborated with Weights & Biases to launch a free course in how to guide your LLM to give structured outputs!
COURSE LINK
Jason is also an independent consultant working with companies on their AI implementations and has many battle tested examples from implementations across the board, which he shared with us on the pod.
Give this short course a try if you haven't yet, it's really high quality content, in addition to tons of other stuff we have there, for free 👏
Voice & Audio
11Labs has a new overdub studio and it's really working well
Check out this short segment of myself, speaking in dubbed Russian! It’s really sounds like me, sent to my mom to see if she falls for it 😆 She didn’t
AI Art & Diffusion
Hourglass Diffusion Transformers
New high resolution diffusion architecture from K-diffusion and RoPE team (X, Blog, Paper, Github)
Paper presents a new method called HDiT ( HourGlass Diffusion Transformers) that shows promise in training models with high resolution images without incurring the significant hardware costs that go with scaling image sizes, replaces the latent diffusion models enabling O(n) complexity and scaling well.
Utilizing tricks and best practices for transformers architectures, like RoPe (that we've covered on ThursdAI before) cosine similarity self-attention, RMSNorm, GeGLU, etc. and using something called local self attention, this paper shows incredible promise for high resolution architectures for image creation tools.
We had the pleasure to host Tanishq Abraham, one of the co-authors (and CEO of MedArc, Director of research with Stability + PHD at 19) to walk us through the paper, explain the problem and the solution. Additionally, friend of the pod is co-author as well 👏 and Alex Birch joined us silently from the audience 👂while giving commentary in the group chat.
P.S - All of these co-authors attribute the bulk of the work to Katherine Crowson from k-diffusion 👏
Tools & Others
Prophetic introduces Morpheus-1 - multimodal foundational model trained on fMRI and EEG signals
In a breaking news fashion, the folks behind Prophetic, a new startup that just announced MORPHEUS-1 as we were hopping into the space, came to chat with us.
They are working on a new multimodal ultrasound transformer! That's right, multimodaliy is not only about images/text folks, we've covered this before but these chads are actually trying this out, they have trained a transformer architecture to take EEG and fMRI signals and output directions for the ultrasound to activate areas of the brain to induce Lucid dreaming. And they are asking for beta testers!
It's all quite futuristic, and if you're in NY, reach out to them (and then let us know if you had Lucid dreams!)
Definitely worth a listen on the pod and check out their video announcement for mode details, was really quite an incredible conversation with Wes and Eric.
National Science Foundation launches NAIRR pilot (Blog)
Partnering with 10 other federal agencies as well as 25 private sector, nonprofit and philanthropic organizations, the NAIRR pilot will provide access to advanced computing, datasets, models, software, training and user support to U.S.-based researchers and educators
Basically, this is a huge governmental endeavor to provide resources about AI, make sure companies collaborate and keep AI accessible across the board and tons of government agencies as well as private sector companies have joined hands in this. Just look at this list, it's a veritable who & who of AI in US (notably, Tesla/X is missing)
And that’s all folks, that’s all she wrote (or I guess, I wrote) today! What an incredible show, really thankful for folks who came out, guests and co-hosts and see you next week!
If you scrolled all the way to here and want to show me that you did, your emoji of the week is 🍊 (only cause persimmons don’t have emojis) so DM or reply with this and share this pod with 1 friend or tag us on social media!
Full Transcription below:
transcript
[00:00:00] Alex Volkov: right, folks, it's time for the sound. Let's get it started today.
[00:00:11] Alex Volkov: Welcome, everyone. Welcome to
[00:00:13] Alex Volkov: this live recording of ThursdAI, the Twitter space, podcast, and newsletter that brings you. everything that happened. the AI world, every Thursday, literally almost every Thursday. My name is Alex Volkov, an AI evangelist with Weights Biases, and
[00:00:33] Alex Volkov: this is ThursdAI
[00:00:37] Recap & TL;DR
[00:00:37] Alex Volkov: Alright, recap, here we go. Taking a deep breath. We've talked about incredible amount of stuff here on Thursday AI for January 24th. We've talked about the areas of open source LLMs was very interesting. We've talked about stability AI, releasing a stable LLM, tiny version, 1. 6 billion parameters. That's really good at different languages, the European languages as well.
[00:00:58] Alex Volkov: And it's not commercially viable. For open source, but it is under the stability membership. So if you have that's a great model for you. We've talked about Intern LM2 for a state of the art on math LLMs. We briefly mentioned this, but it's getting 90 percent of GPT 4 performance on math, which is, was quite incredible.
[00:01:16] Alex Volkov: We also had the pleasure of Tanishq, Abraham to join us from MedArk for the analysis of open source models as it relates to the medical field. And it turns out that the model called Quen72 from Alibaba, Quen72 is the best open source doctor that we have achieving like incredible and beating even MedPalm1, which was back then by Google trained as one of the best medical LLMs.
[00:01:42] Alex Volkov: We also. were a very multi modal heavy space today like a lot we had the like we had the folks from Prometheus lab join us and talk about their multi modality which is not Trans, which is transformer based, but not LLM based so their multimodality is EEG signals and fMRI signals as they work on hyper focused ultrasound to induce a lucid dream state in your brain.
[00:02:11] Alex Volkov: Their multimodal model is basically taking inputs from EEG and outputs in, in the directions or where to focus this ultrasound is super cool. And I definitely advise you to listen to them. It wasn't planned. I just saw the post. I just commented, Hey, we're going to talk about this. They jumped on Prometheus looks like a cool multimodal attempt, nothing to do with vision, but also we talked about vision multimodality as well.
[00:02:34] Alex Volkov: So we've covered Adept the company who was founded by a few folks from the original Transformers paper and they have previously released. Per semen models. And then EU eight B was a multimodel that did not use a vision encoder like a different architecture. They released an announcement. They didn't release any code or weights or the way for us to try this yet, but they released something called Fool You Heavy, or they announced something called FU You Heavy, which is an extension of the previously released fool you eight B.
[00:03:00] Alex Volkov: Significantly more trained. And they talked about how difficult it is to train multimodal models and they claim to have a third. Place in the world after GPT 4 and Gemini Ultra on a bunch of the multi modal metrics and evaluations like MMU and MMLU. They also talked about the process, how difficult it is to train these models at scale.
[00:03:20] Alex Volkov: So cool from Adept and we're waiting for some ways to test this. We also talked about fire lava, which is, if you remember, we've talked about lava before multiple times. Lava is a Open source way to train models in multimodal and like Baklava from Focus on Stage here, Nissen and Farrell, and Obsidian from LDJ who's also on here and also Moondream.
[00:03:39] Alex Volkov: Like all of the things we've talked about are based on Lava. Lava was not commercially permissive licensed because of the data set. Fire Lava decided or released the first Lava model with commercial permissive license from Fireworks AI. And we also had it. Quite an interesting chat with Vic, who is the author of Moondream 1, which is a tiny 1.
[00:03:59] Alex Volkov: 6 billion parameter vision language model, also on top of Lava, that has Phi 1 as 1. 6 billion. The foundational kind of brain, the LLM brain in it the conversation with Wick was very interesting. So shout out Wick. Thanks for coming up. Specifically because he also mentioned that Phi 1 Microsoft, if you guys remember Phi 2 was MIT licensed back in December.
[00:04:20] Alex Volkov: It was a surprise to all of us. And apparently they went back and also changed the the License on Phi 1, which is super cool, and Vic told us that he saw this. So Moondream is a very capable, very tiny vision model that works quite well. Definitely worth listening to this conversation with Vic.
[00:04:36] Alex Volkov: We also announced in the This Week's Buzz category of ours, or segment of ours, about Everything Weights Biases, we announced a new course in our academy from Jason Liu, the author of the Instructor Library. And he has a course now that was released today called LLM Engineering Structural Outputs.
[00:04:54] Alex Volkov: And as Nissen , pointed out a bunch of the folks in open source are learning from these free YouTube videos and definitely worth checking out Weights Biases Academy because there's a bunch of knowledge there. And it's all for free and just join and just register. It's super, super cool. And then we had an incredible honor again of having one of the authors of this paper.
[00:05:12] Alex Volkov: As always, I love when we discuss stuff and the authors of the stuff come to chat with us. So we had Tanishq Abraham. But also we had Alex Birch in the audience listening to us while he was working and sending us DMs from the new paper called Hourglass Diffusion High Resolution Image Synthesis.
[00:05:30] Alex Volkov: And this paper will be in the show notes and Dinesh went through the kind of the in depth of the problem he tries to solve. And they. They talked about integrating transformers and diffusion models previously to separate areas and they haven't came up with the first one, but they definitely used a bunch of the techniques to optimize transformers into the diffusion world and create a pixel space, high resolution image synthesis, which is, shows great promise going forward.
[00:05:59] Alex Volkov: Incredibly insightful conversation from Tanishq, definitely worth a listen. We also covered in this area, we covered Instant ID, which is a one, one shot or zero shot face transition into diffusion models. So you can upload one picture of yourself and get quite incredible results in image diffusion.
[00:06:17] Alex Volkov: Or like generative images with your faces or your kid's faces, which is super cool. I haven't tried my cat. I don't know if it like works on cat's faces. I'll try it out. We covered a new, a state of the art. Automatic speech recognition system that beats Whisper or at least runs 30 times faster than Whisper on different tasks.
[00:06:36] Alex Volkov: We're going to add this to the show notes as well. And a little bit about deepfake audio with 11 labs have a dubbing studio released. And some conversation about whether or not or how it already affects politics. And then the last thing we've covered is the National Science Foundation, NSF, announces a new partnership from all major labs and government agencies around AI, and includes DOD and DOA, and includes OpenAI and Tropic, includes open source folks like Hug and Face, and MetaAI is also participating in this.
[00:07:11] Alex Volkov: And also Ways and Biases is part of that huge partnership, governmental partnership. So I think this is all the stuff that we've covered in this space.
[00:07:19] Show starts with house keeping and structure breakdown
[00:07:19] Alex Volkov: We have quite the show for you today, and as always there's no boring weeks in AI, is there? And some weeks start slow and then pick up, some weeks start Crazy from the get go. If you remember, there's one week where one Friday had a bunch of releases, and this week we had a very full week, full of very cool innovations, but also exciting stuff.
[00:07:47] Alex Volkov: And then we have some authors of those stuff here with us today, and we're gonna talk about a bunch of multimodality, which we've been talking about for a while. Obviously the space started with the multimodal GPT 4 and then we just kicked it into high gear. I think that it's time to get started with our default segment. So for those who are new to Thursday AI, we usually segment this to five or six segments, the biggest one being open source LLMs. And then we have big companies LLMs and API. So we usually cover the Google stuff and OpenAI stuff.
[00:08:18] Alex Volkov: Mistral has been here and there, been [00:08:20] in the open source, now is the big company as well. So depends on what they release, that's where Mistral stuff falls. And then we talk about vision and video, which is Basically, we'll recover the multimodality stuff and that section is going to be the, I think, the main one today.
[00:08:36] Alex Volkov: There's so much stuff. It's crazy. We also have tthis com this corner I call This Week's Buzz. I feel like I have to explain this. Maybe people don't get this dad joke that I put in there. Buzz, as in bees, right? So bees, Buzz. And Weights and Biases, the shorthand for Weights and Biases is WandB.
[00:08:54] Alex Volkov: Weights and Biases, W and B. And for a very funny reason, there's a mascot of ours that's a bee that's holding a wand, because it's WandB. And like this little joke has been Prevalent like in many places. I think I haven't explained it yet. And so this week's buzz is actually the corner about everything that I've learned with Weights Biases every week.
[00:09:13] Alex Volkov: And so this corner we're going to chat with Jason and announce some cool stuff. The next corner we have is voice and audio, which we usually have a bunch of stuff. We have VB from Hug Face usually join us. He's like the AI audio person over there. There's not a lot of voice and audio stuff.
[00:09:29] Alex Volkov: So I actually don't have anything voice and audio related in my notes. However if you guys know like very cool things that happened. This week with voice and audio, please let me know, we're going to talk about them. We're going to move to AI art and diffusion in the next segment. We're going to talk about some cool things there.
[00:09:45] Alex Volkov: And then the last segment is like a free for all, it's tools and others. So I usually put agents in there. I usually put like super cool things. So I have two, two, two exciting things to talk about there. So this is usually the structure.
[00:09:58] Nisten Tahiraj: I do have, is one more thing there, and it's the W2V, the BERT speech encoder. think it's for meta, and it's about, it's supposed to be like 30 times faster than than Whisper. So yeah, it's another very efficient automatic recognition ASR model. So I'll I'll post it in the links
[00:10:20] Alex Volkov: And I think also we had 11Labs announce like a yeah, I had a tweet about actually ThursdAI Content, that I spoke in English, obviously, and then I asked it to translate to Russian. We'll cover this, 11Labs has a dubbing studio.
[00:10:33] Alex Volkov: .
[00:10:33] Open Source LLMS
[00:10:33] Alex Volkov: And then, let's go to open source, folks. I think let's go to open source.
[00:10:55] Alex Volkov: All right, let's start with our open source segment here. And I think the first thing we should probably quickly mention is our dear friends at Stability AI, folks who've Made a dent on the industry with Stable Diffusion, obviously but they're training a bunch of other stuff. We've talked about multiple stuff they did.
[00:11:12] Stable LM 1.3B
[00:11:12] Alex Volkov: We've talked about Stable Video Diffusion and like how open source lags behind closed source, but not by that much. And Stability released a new LLM, which they had the Stable LLM before, I think, Nistan, have you used Stability stuff before? For the LLM stuff?
[00:11:31] Nisten Tahiraj: I have Months ago, so I'm not up to date on
[00:11:35] Alex Volkov: Yeah, so
[00:11:36] Nisten Tahiraj: used it on Google collabs and
[00:11:37] Alex Volkov: Yeah, so they're not like, they haven't changed the industry in the LLM world as much as they have in the image diffusion world, for sure. However, there's a big however, they're working on multiple fronts. And it looks like, I had a chance to actually chat with Imad for almost 20 minutes.
[00:11:52] Alex Volkov: Imad is this like very incredible person who knows a lot about a lot. And it's like the conversation there is like basically a stream of consciousness conversation, which I had. No trouble in following up because we talk about everything here on ThursdAI. But the folks who were with me and talking to Imad, they looked at me and was like, How do you know all this?
[00:12:11] Alex Volkov: And I'm looking at Imad and was like, How does Imad know all this? That's what happens when you're on stability. So they released they're training a bunch of different models. This week they gave us Stable LLM, which is a tiny model, 1. 6 billion parameters model. It's really we've been saying this previously.
[00:12:24] Alex Volkov: It's really funny to say small LLM, right? If you expand the LLM abbreviations, like a small large language model. But this one is tiny. It runs super fast on, on multiple devices. I think their point is to actually like edge device running. So obviously we've covered multiple small. LLMs before, we've covered PHY, if you remember PHY 1, we're gonna talk about PHY with Vik in a second.
[00:12:47] Alex Volkov: We also talked about like PHY 2, I think there's like a few others StabilityRelease, there's It's pretty good. It's pretty good. I was itching to play with this, they released a GGUF. Apparently I dunno if you knew this name, but apparently stability has their own CPP and their like GGF file, which is like a, for those who are not following all the AT acronyms.
[00:13:11] Alex Volkov: GGF is a quantized version of models. So apparently stability has, like stability. CPP is incompatible with Lama cpp . And so apparently Elm Studio had to add a specific support for this and they did. And so if you wanna play with stability, AI. Stable LM, now you can , with LM Studio, and LM Studio at least in my experience, gave me ridiculous performance.
[00:13:34] Alex Volkov: I got, on, on this Macbook M3, M3 Max I got more than 130 tokens per second, which was like ridiculously fast. And the model was fairly capable for a small model. I was very impressed. So if you want to play with a small model, you want to do some stuff with this, stability is definitely an interesting one.
[00:13:53] Alex Volkov: Support in Elm Studio. Yeah, go ahead.
[00:13:56] Nisten Tahiraj: yeah, it's a 1. 6B. So in that means it's 1. 6 gigs to run at eight bit without losing much accuracy. However, the, that means that it has a lot more applications for tiny stuff, because then you can get that down to 800 megs. And so on. So this is people did find some issues. Again, it's a tiny model, but they found issues with it being able to continue the conversation.
[00:14:24] Nisten Tahiraj: However, for one shot answers, it was extremely capable. So just keep that in mind when using it. It is probably right now the best model for that size. Just keep in mind if you're going to do something with it. Don't expect much in terms of follow up stuff. Just if you can do it in one shot, great.
[00:14:48] Nisten Tahiraj: Use that. And yeah that's about all I have to say.
[00:14:51] Alex Volkov: Yeah. And additional things that it punches above its weight on other languages. So if you folks remember when we talked about Mistral, for example, getting compared to open the eye on Tropic, et cetera Mixtral medium, the model is like specifically for the German, the European language, the German, Spanish, French, Italian, all those it's significantly better.
[00:15:11] Alex Volkov: Stability is also playing in that market looks like for the smaller size. And so this. Out this tiny model beats the five versions of three billion parameters. So it beats models twice its size, even some seven billion parameters, specifically for , European languages,
[00:15:25] Alex Volkov: and if you remember, we've talked about MPT from Mosaic, was that? Yeah. So this model beats the Mosaic MPT 7B, which was probably back in May was like the coolest like open source model. So that was 7 billion. This beats that on empty bench and everything.
[00:15:40] Alex Volkov: It's quite incredible. It beats Falcon 40B. It's really, the speed, the reason why we bring you these models is not only Hey, use this one. Because Nissen said this one may not be exactly good for your commercial stuff. Also, it's not really commercially viable. There's a specific stability license that you have.
[00:15:58] Alex Volkov: Stability membership, they call it. They have to apply for stability AI membership. And then based on the size of your business you're able to use, they have to make money somehow. But we bring this to you also to show that how fast we're moving from a 30 billion parameter model to a 77 billion parameter model, and now to a 1.
[00:16:13] Alex Volkov: 6 billion parameter model, that compresses like incredible amounts of trillions of like words from the human knowledge into just, listen, do we say like this can go down to like less than a gig, right? If we look super quick,
[00:16:28] Nisten Tahiraj: Yep. At 4 bit, it should be 800 So we're getting to the point where they'll just fit in a Raspberry Pi Zero with 512 megs and they'll be conversational [00:16:40] and useful and even multi modal. So we're almost there.
[00:16:43] Alex Volkov: Yeah, it's quite incredible. And then, okay, so this is stability stuff. Meanwhile, I'll say hi to a new guest of ours that I just saw on my timeline.
[00:16:51] Prophetic announces MORPHEUS-1 an EEG/fMRI multimodal to induce lucid dreams via hyperfocused ultrasound
[00:16:51] Alex Volkov: What's up Wes, how are you?
[00:16:53] Wes Louis: Hey
[00:16:54] Wes Louis: guys, how are you?
[00:16:55] Alex Volkov: Hey. Hey welcome. Folks maybe saw my tweet, maybe didn't as that I love planning for Thursday, but I also love breaking news. As I was planning, I was going through my feed, and thankfully my Twitter feed is back at his own, like giving me the best AI stuff. And Wess and I think your co-founder is also here.
[00:17:10] Alex Volkov: Eric, yeah. Let me add you real
[00:17:12] Alex Volkov: quick. I didn't plan on this folks. I just literally just like tagged and they came. The video that you guys posted came through my timeline and I would love to go and give you a stage for a minute or two to explain what prophetic is because the transformer stuff that you discussed with the EEG and fMRI signals, I really dig.
[00:17:30] Alex Volkov: Could you summarize that video for us for a brief, like two sentences? That would be super cool, I think.
[00:17:38] Wes Louis: So
[00:17:38] Wes Louis: this has been something we've been working on for a while.
[00:17:40] Wes Louis: It's really a, essentially,
[00:17:42] Wes Louis: a multimodal transformer model that is designed entirely for neural data. And so basically, what we've done is, we built a data set of EEG and fMRI and, what we're designing is a neural simulation device to basically induce lucid dreams.
[00:17:59] Wes Louis: And so we build the data set on heightened prefrontal cortex activity. This is, the neural correlate of lucid dreaming. And we basically built a model where you prompt it with your current brain state. We have a set of sensors on the device, and then we output targets for the neurostimulation.
[00:18:17] Alex Volkov: That's quite incredible. So for folks in the audience, we talk about multimodality often and oftentimes we just mean VLMs, like we mean like vision and text, which we're going to cover like a bunch today. But today I think the highlight of today's Thursday is multimodality applies to many things. So you guys are, your multimodality is not even there's no text in there at all, right?
[00:18:36] Alex Volkov: This is just EEG signals and fMRI signals. Is that correct?
[00:18:41] Wes Louis: Yeah, it's purely prompted with EEG. And one thing I'll say is, everyone talks about multimodal. And, so you're using, let's say, an LLM, and you're prompting it with a photo, for example. This is similar in many ways because neural imaging data, particularly EEG, is you can nicely get, you can get it into, it's a neural image you can get it into an image format.
[00:19:02] Wes Louis: And then prompt the model that way, but then on the generation side of things that's entirely, we use a pretty unique fMRI embedding process that we've come up with ourselves and ultimately the idea there is that you take this heightened neural activity, And those are candidates for targets for nerve simulation.
[00:19:20] Wes Louis: And, we
[00:19:21] Alex Volkov: What do you, sorry, what do you mean, what do you mean by targets for folks who have no idea what this means?
[00:19:26] Wes Louis: Yeah. We're using this is the other big technology that makes all this work is FocusUltraSound. FocusUltraSound, for those that don't know, is this Really, cutting edge neurosimulation technique that can get, quite deep into the brain, other techniques, people who may be familiar with, direct current, alternating current, really get soaring to the surface.
[00:19:47] Wes Louis: Of the brain, whereas focus ultrasound can get quite deep, but there's also this ability to steer the beam and also create acoustic holograms. And so when we think of heightened neural activity it really takes the form of these 3D figures. And the idea being that we can create these outputs of fMRI targets and then translate those over to the focus ultrasound.
[00:20:12] Alex Volkov: This multi modal transformer takes on the input EEG signals, and on the output it prints out those targets. Those are targets for this technology to then stimulate the brain to go into a specific state.
[00:20:31] Wes Louis: Yes, and all of this is closed loop so in that, once you create the simulation, the model is prompted again with the current brain state and this is continuous. Process of learning and figuring out what sets of tokens lead to this heightened state and that heightened state is really identified as gamma frequencies and that's really the fastest band of activity.
[00:20:53] Wes Louis: So it's this continuous process until someone gets to a lucid state.
[00:20:58] Alex Volkov: That's quite incredible. So you guys announced the LLM today, but it's not like you're not releasing the open source. This is just an announcement of your efforts, correct? Anything else you want to add here? And I think you started talking about folks can join the beta if they want to.
[00:21:12] Nisten Tahiraj: Yeah, that's what I
[00:21:12] Wes Louis: would point out is that we have a beta program that, that this is really the purpose of this announcement is we're looking for people to sign up. We've had 200 or so in the last, Two hours. And so this spring we'll have it working. And if you're a New York based or you're willing to come out to New York we'd be, more than happy to have you test out the product.
[00:21:31] Alex Volkov: That's awesome. Congrats folks. Actually, you want to add anything?
[00:21:33] Eric Wollberg: Alex. Hey, how's it going? This is Eric. I'm a
[00:21:36] Alex Volkov: Oh, Eric, yeah.
[00:21:37] Eric Wollberg: with West. Yeah. Hi thanks for doing this. Yeah, one thing that's just I think, the sequence of how we've released these things, we showcased in October our prototype that we designed with Card79 notably did, Neuralink for Elon, and then we, Also worked with Max Hodak at Science.
[00:21:52] Eric Wollberg: Max Hodak used to run Neuralink for Elon and then spun out Science. So really top consumer VCI kind of design folks. And so then now we have this model, right? This ultrasonic transformer where now we're going to be migrating that on to, the technically working prototype and beginning neuromodulation.
[00:22:08] Eric Wollberg: So that's what the beta user program is all about. We've got, yeah, like 225 people signing up in the first two hours we're really looking for we're excited to have people on board and begin to do this you have an opportunity if you're, especially if you're early up on that list to be the first person to achieve an ultrasonically induced lucid dream, which is You know, I think it's going to be a pretty watershed moment.
[00:22:28] Alex Volkov: That's super cool. I've tried to, to lucid dream a lot of times in my life and I never actually got to a stable one. So I'm excited to follow you guys, but also excited from the technology application of this, because we talk about transformers and a lot of this is going to LLMs.
[00:22:42] Alex Volkov: Now we're going to, this week we're going to talk about Transformers as applied to the fusion models as well. And here you are like doing like full multimodality out, out of the left field. So I love it. And hopefully you guys will do some cool things and keep us up to date and welcome to, to join on Thursday.
[00:22:55] Alex Volkov: I, to talk about this.
[00:22:57] Nisten Tahiraj: Awesome. Thanks, Alex. Thank you, Alex.
[00:22:58] Alex Volkov: Thanks for hopping on, folks. And as folks, as I love breaking news here on Thursday. This is like a tiny breaking news. Thank you, Wes. Thank you, Eric, for joining folks. If you want to try, the future, sign up for the beta, because why not?
[00:23:09] Alex Volkov: And I think it's it feels like non invasive, right? You put this headset on, and then hopefully you go to sleep, and hopefully you're able to control your dreams, which is like what Vision Pro will do for outside world, but this is like inside your dream, it's super cool. All right, let's move on to, I think we're moving on to the big, no, actually we're moving on to the big category for multimodality as we're already here.
[00:23:33] Alex Volkov: Vision and video and multimodal, or at least VLM multimodal.
[00:23:38] Adept teases Fuyu Heavy, their flagship multimodal catching up to Gemini Ultra and GPT4V
[00:23:38] Alex Volkov: I'm gonna start with the big dog here, ADEPT. If you guys remember ADEPT Labs were co founded by a few folks from the original Transformer paper. I don't think they're no longer there, but I have to, I feel like I have to add this.
[00:23:52] Alex Volkov: Prefix every time we talk about adept, adapt released a few models for us. If you guys remember, Persson was a seven B model or eight B, eight B it was weird, but they released an 8 billion parameter model. It was like very interesting back then. They also then on top of this released fio, which is persson is the type of fruit, F is the type of tree that persimmon grows on.
[00:24:10] Alex Volkov: So we see you adept, we see your jokes here. Also. I love the LLM naming and then they raised Fuo back then. And FIO was. Interesting from the perspective of it didn't use a vision encoder, it did something else. It was very interesting that their approach to vision models allowed them to use Non standard image sizes, because they didn't train it on such a thing.
[00:24:31] Alex Volkov: So back then, that was what was interesting. And now, they've announced, they haven't released anything. They haven't said, hey, here, use this. I wasn't even able to use this. But they announced Fuyu Heavy. Fuyu Heavy, according to them. And so far, Adept have been trustworthy enough for us to trust.
[00:24:48] Alex Volkov: What they say this is the third in the world multi modal or I guess VLM. So not multi modal like, like Wes and Eric just told us, but a multi modal in the sense of like images plus text together. This is the [00:25:00] third in the world model behind GPT 4 Vision and Gemini Ultra. Which Gemini Ultra we haven't yet tried, obviously, we don't have access.
[00:25:08] Alex Volkov: If you have access in the audience for Gemini Ultra, and you want to help me, help a brother out, let me try and play with this, please let me know. But so they're announcing, AdeptFuyu is announcing that Fuyu Heavy, their model, is 20 sizes smaller than GPT 4 Vision. I have no idea how they even know what size GPT 4 Vision is.
[00:25:28] Alex Volkov: They say that around 20 to 30 sizes smaller. And comes very close in the multimodality stuff. And they talk about the challenges of creating like large multimodal image based model. The challenges are stemming from there's not a lot of assets properly to test. There's not a lot of the tooling instrumentation stuff are really hard for images as well.
[00:25:47] Alex Volkov: And so they announced this they showed some very incredible performance. And I will remind folks that Adept specifically started with tools to make you run your computer. So their models are specifically tuned on UX, UI and web stuff. And expecting to hear more from them and finally getting to play with this.
[00:26:06] Alex Volkov: Go ahead, Faro.
[00:26:09] Far El: I just
[00:26:09] Far El: want to say that,
[00:26:10] Far El: Demos are easy. I'm going to take it with a
[00:26:14] Far El: grain of salt until I actually see the model or are able to test it. The thing is that there is no indication of actual like speed of the inference or whether these examples were cherry picked or not, right? There's a lot of question marks about this, especially when you just come out and, make a marketing announcement without actual access to the model.
[00:26:37] Far El: Yeah, it looks cool, but I'm not, I'm not hyped just because it's not like it's not verified or validated
[00:26:43] Nisten Tahiraj: in any way.
[00:26:44] Alex Volkov: Yeah, I'm with you, I'm with you. Specifically I will say though, about Adept specifically, we've seen stuff from them, we've seen papers from them before, and they did, folks started asking like, Hey, where's the weights? Where's the weights? And they did say that, stuff is coming, but they want to like, keep a competitive edge.
[00:27:00] Alex Volkov: But we see, we've seen like at least a new architecture from them, if you remember with Fuyu. And so we know
[00:27:05] Nisten Tahiraj: Oh, of course.
[00:27:06] Alex Volkov: yeah, the Fuyu architecture is legit, like they literally was able to. create a multi modal without an image encoder thing back then. We're definitely going to listen to this. But based on the metric that they released, if this actually performs as well on MMMU, which is the kind of the equivalent of MMLU.
[00:27:25] Alex Volkov: For multi modal stuff it's going to be very exciting their heavy model, definitely.
[00:27:29] Fireworks releases FireLLaVa with a fully commercially viable license
[00:27:29] Alex Volkov: Moving on, actually, Pharrell we'd love to hear what you think about this. And actually, Vic, this is wrapping you up to the next conversation. Fireworks AI that I haven't actually used, but they released the first Lava model with commercial permissive license from Fireworks.
[00:27:43] Alex Volkov: So Lava was released. Lava, we've talked about Lava is the architecture. That allows many of these models to be trained in a multi modal fashion, correct? Lava was released, it was not with a commercial license because it was trained on a bunch of I want to say that wasn't marked for commercial and open source licensing.
[00:28:01] Alex Volkov: So a lot of these models that we get, we cannot actually use in production. And FireLava announced that like their first Lava model was commercially permissive licensing. And I think that's super cool because finally folks will be able to build this. And as a reminder, Lama, the LLM was released without commercial license.
[00:28:19] Alex Volkov: And then Lama 2 released with commercial license and then incredible amount of stuff started happening because companies who wanted to use this in production actually started like looking into this and using Lama 2. And so hopefully the same will start happening with FireLava. I actually am not sure if they released the weights.
[00:28:36] Alex Volkov: I think they did. Yes, they released the weights on Fireworks AI, FireLava 13B on HugInFace. And yeah, listen, go ahead. You guys trained stuff on top of Lava. So please, first of all, introduce the stuff that you've trained on and then also like comment on the ability to use this now in production.
[00:28:56] Nisten Tahiraj: Yeah, I just want to say that The entire vision open source vision field, and non open source, it is extremely competitive right now. For example, here, we've released Baklava, which is bak lava. Again with the naming. So that that was three months ago. Also LDJ here made the obsidian, which is like the three B one, and then they made A seven B as well.
[00:29:22] Nisten Tahiraj: We also have the dev lead of Quinn. He was in the audience as well, so they made the Quin 14 b vl. And this part is, oh, and we have Vic as well, who also made a very fast. And a small model recently. And Valkylava was being used as a benchmark, which was pretty interesting, actually. Yeah, the Vision LLMs are extremely competitive right now, and I think it's one part where open source can really surpass what you get from from any from any API, because it's something you can run local on the device and you have full control over.
[00:30:01] Nisten Tahiraj: So the interesting thing yeah, as for Fireworks 13b, that's still Lama 13b base, as far as I saw, and I tried to use their inference on their site, but it wasn't working, and I can't complain too much about it, because ours is not working either. That's why I wasn't using WSGULAG yeah, also to comment a little bit on Fuyu, because I do like their trying a completely new approach. They don't use stuff that's similar to clip image models, which is what everybody else uses. They do something where they take, I think, groups of pixels or stuff. They serialize it, so the image is just being represented as just another string of text or a string of tokens. So they can scale.
[00:30:48] Nisten Tahiraj: To 8k, 16k, whatever you have, they don't have, they don't have that limitation that others have in, in terms of architecture. So it is good to see that approach is working overall, whether it will be competitive we'll see. So yeah, I wanted to comment on that. But yeah, I haven't actually tried the Fireworks model itself, but I did see, again, the architecture is similar to also Lava 13b. Yeah, that's about all the comments I have on that.
[00:31:22] Alex Volkov: And like you said, interestingly, it's still based on Lama, right? And it's time for, it's time for new things. And I think this takes us to the next topic of conversation. And again, Vic, I want to introduce you properly this time, or at least let you introduce yourself.
[00:31:35] Moondream1 from Vik Hyatk - 1.8B VLM
[00:31:35] Alex Volkov: But the next kind of iteration or of our conversation about multimodality, like we said, today is a multimodal space is the existence of like very tiny vision models, vision, large language models, or a large multimodal model, it's really hard to like, name these things. Vic, welcome to the space, this is your first time, please introduce yourself and then let's talk about Moondream a little bit.
[00:31:57] Vik Hyatk: Hey folks hey Alex, thanks for having me. Super excited. My name is Vik. I'm pretty new to the AI space, I think. Like a lot of people, I got into it when that big stable diffusion moment happened. And I was like, this is what I need to spend my life working on. So I went out, bought a workstation with 3090 and started playing around with stuff.
[00:32:15] Alex Volkov: You and me both brother, you and me both. And, okay. So the reason why you're here and the reason why I'm , calling on you in the vision and video area is because of Moon Dream one. You, can you introduce Moon Dream one a little bit to the audience?
[00:32:29] Vik Hyatk: Yeah so it's a small language model. It's about 1. 6 billion parameters. It's built on top of Siglip from Google or DeepMind. I forget which one of the two. Trimil, because that's the vision encoder and it uses 5. 1. 5 as the text model, and then it's trained using the standard lava. So super thankful for the folks that worked on these projects amazing models they've put together.
[00:32:52] Vik Hyatk: It works. I'm tooting my own horn a little bit here, but it's surprising. I see people post screenshots of them asking questions and it still blows my mind that it works that well.
[00:33:03] Alex Volkov: I let me talk the horn a little bit because I definitely tried out. Thank you for the hugging face. How can I say, space that you put up like super quick, and the next follow up is going to be about how to actually use this, but this is based on Lava, so the same non commercial license, correct?
[00:33:19] Vik Hyatk: [00:33:20] Correct. The top piece of feedback I've gotten from people is that they want to see this with a commercially permissive license. I'm working with, working on that. The FireLava folks didn't release the dataset, but thankfully they did talk about their process to create the the non encumbered version of the dataset.
[00:33:37] Vik Hyatk: So I'm working on it. I'll have that out in a couple of days, the dataset at least, and then we can start training models that aren't encumbered like that.
[00:33:44] Alex Volkov: Incredible. And so the next thing that I wanted to talk to you about is PHY 1. So PHY is from Microsoft. PHY 1 was not released with a commercial license. We remember it was trained on synthetic data in tiny stories, like a tiny 1. 6 model. So we saw a few releases since then. So obviously we talked just now about StableLM.
[00:34:01] Alex Volkov: Semi commercial, if you're a part of their membership, and also Phi2 was MIT license. It's a little bit bigger. It's three, I think, billion parameters. Have you tried with Phi2 and could you speak about that experience?
[00:34:14] Vik Hyatk: Yeah, I I did actually. So I was initially working on training Moondream 1 with PHY 2 once it came out. There are some issues with fine tuning it when you have flash attention on I believe. And so it just takes a lot longer to train. So I went back and looked at PHY 1. 5 and I saw that they updated the license for 1.
[00:34:32] Vik Hyatk: 5 to MIT as well.
[00:34:33] Alex Volkov: Oh, really?
[00:34:35] Vik Hyatk: stick with what works. Yeah.
[00:34:37] Alex Volkov: Wow. I did not know this. So it actually updated the license backwards.
[00:34:42] Vik Hyatk: Yeah, on the Hugging Face page, at least it says MIT now.
[00:34:45] Alex Volkov: I love it. Like it would make sense, right? But folks, I don't think we've talked about this. So like breaking news here. Thanks, Vic. Phi 1 is also, we'll check this. We'll double check,
[00:34:55] Nisten Tahiraj: Also three. They're both MIT licensed now. So whatever pressure we put on Microsoft's Azure side, it worked.
[00:35:03] Alex Volkov: nice. That's incredible. All so now, so this part of your stack of Moonbeam is now MIT licensed. So Lava is the only thing that's holding this back from being used in
[00:35:14] Vik Hyatk: Just the
[00:35:14] Unkown: data set, yeah.
[00:35:16] Alex Volkov: The dataset. Okay. Okay. So definitely there's work being done there. I will just pay send folks attention to the nest, to the top of the space where I had my tests.
[00:35:25] Alex Volkov: I literally just pasted an image. And again, thank you for the demo, Vic. Folks will get the demo in show notes as well. I pasted an image of two of my friends just sitting and talking across like a TV with some things. Literally the model said, image features two men sitting in chairs engaging in conversation.
[00:35:42] Alex Volkov: One man sitting on left side, one other on the right side. That's obvious, but still cool. They're both looking at a laptop placed on the table in front of them. The laptop is open and displaying a presentation. Possibly related to their discussion. So this feels like hallucination a little bit because the model does not know what it displays, but fine.
[00:35:57] Alex Volkov: And so in the background, there's a TV mounted on the wall, a cup that can be placed on the surface nearby. The scene suggests a casual collaborative environment. This is ridiculous. This is like a super tiny model and it outputs this scene almost perfectly. And. I've tested like the same image in different, like a bigger, GPT 4, it pretty much gives me the same information.
[00:36:17] Alex Volkov: So I was really impressed. So Turing the Horn, for sure, because the tinier the model is, the better the utilization. And we've talked about different vision enabled hardwares that are possible or not possible based on whether or not they're going to be able to run stuff on like Raspberry Pi. And, the smaller these models, the smarter they are, the better we'd be able to use them in cheaper hardware.
[00:36:40] Alex Volkov: Really impressive. What are you planning to do with this? Like, how has the community accepted this? What type of conversations did you get into? And what are you planning to do next here? Besides training the
[00:36:51] Vik Hyatk: I was blown away by the reception to this. I've, when I put it up, I thought like maybe it might get like a hundred likes or something and then I'd move on to my next project. But I've seen a bunch of super cool demos. Come out of this, I think the fact that it is small and it runs inference so fast makes a lot of use cases that were previously not possible, a lot more viable, like captioning a video in real time or recaptioning a billion images and whatnot.
[00:37:15] Vik Hyatk: There's a couple of things I'm working on. Obviously the top thing is like getting it to a permissive license. I also, I could use some help on a couple of fronts. So I do want to make it easier to run, get gguf, olama integration and whatnot.
[00:37:30] Alex Volkov: Definitely LM Studio integration. I would love To play around with this with Elm Studio, just to see how fast this is, this runs on my software. MLX would be a cool suggestion as well the community is very excited about MLX, I don't know if you saw. But Elm Studio is a friend of the pod, definitely it's connected to YouTube.
[00:37:46] Alex Volkov: I think it's super easy to just add it there. Right? Listen it's not difficult.
[00:37:51] Nisten Tahiraj: You just gotta add a Jason file to, to, to your model and that's it. Or just message him 'cause he's very responsive to this stuff. And might even write the Jason for you. And then it will be immediately available for everyone running LM Studio.
[00:38:06] Vik Hyatk: Amazing. Another thing we have going on, by the way, is we're building an agent version of this with Open Interpreter in mind.
[00:38:13] Vik Hyatk: A version of this that's excellent at identifying UI elements because we want Open Interpreter to have the ability to operate purely off of a local model. Open Interpreter, by the way super cool project. Check it out, folks, if you haven't already, is is a way to have the LLM use your computer.
[00:38:31] Vik Hyatk: So you can do stuff like. Just tell LLM here I want to turn dark mode on and it'll figure out what buttons to click to enable dark mode for
[00:38:40] Alex Volkov: for folks who follow ThursdAI closely, they remember Kilian came on the pod like a week after Open Interpreter was released, and this was, I think, in 2023, our most famous or received episode back then. It was a super cool conversation, so shout out Kilian Lukas, and definitely Open Interpreter since then has been very huge community of people building very cool things.
[00:39:00] Alex Volkov: Recently released the kind of the browsing area, where it can Controls the computer for you. And it definitely needs eyes for that. And so I think it used GPT 4 vision and now you're saying that Open Interpreter will get open source eyes. Is that what I'm hearing?
[00:39:15] Vik Hyatk: Exactly. That's a goal. CogAgent is super promising in this space. They didn't release their datasets, so we're working on replicating that. CogAgent is just too big for most people to run on their computers. It's I forget, 17 billion parameters or something.
[00:39:29] Alex Volkov: Is that CogAgent and CogVLM, right? I think we, yeah, I think we talked about this. Yeah. It's really good
[00:39:35] Vik Hyatk: but yeah, that's another place where if folks want to get involved the link in my bio as a Discord would love to collaborate with folks on getting that dataset together and training that version of the model.
[00:39:44] Alex Volkov: So I think the kind of the thing I'm hearing from Fuyu, from you as well, the data set for vision stuff are the bottleneck to create like incredible things, right? Like data sets for images, data sets for how people use different UIs, for example, like all these data sets are the kind of the bottleneck for us to get to the next hurdle of getting these models even smaller, even faster performing.
[00:40:04] Alex Volkov: So what are we doing folks? Let's start building multimodal data sets.
[00:40:09] Nisten Tahiraj: Yeah, and at first for Baklava, we were going to have the dataset also open source because we are, the code for us is also open source as well. So it's not just open wave. It is fully open. However, the data we couldn't because of So that's not available and yeah, it's pretty hard to make datasets for vision because with text is very, it's very easy now to manipulate, modify, do whatever you want to, to the data and you can do that at large scale with images, just aren't that many tools, that many ready to go datasets and the open source models just started getting good at them.
[00:40:52] Nisten Tahiraj: So yeah, that's going to remain. A challenge for the time being, but again if anybody here is like a grad student or you're at a company or something in academia, the biggest contribution you can make probably is in the data sets, because the models will get replaced. You'll always have better models coming and going, but the data sets are forever.
[00:41:15] Nisten Tahiraj: If you want to make an impact in this field, get your professor, university, whatever to, to put some money for datasets. We need datasets. For images. With images. Yeah.
[00:41:27] Alex Volkov: And we need them like bigger and bigger ever increasingly bigger scale. All right, Vic, so thank you so much for joining us. Thank you for talking, taking us through how you created Moonbeam. And thanks for telling us like what's next, how [00:41:40] the community can help besides, besides just, data sets provided and testing.
[00:41:45] Alex Volkov: What else would you need?
[00:41:48] Nisten Tahiraj: I I have a
[00:41:49] Vik Hyatk: list of issues on GitHub where I'm looking for help with various But besides that, Compute always helps. I'm currently I'm limited on how many things I can do because my 4090s can only do so many matrix multiplications at a given time. So if anyone has Compute that they can give me access to run these, that would be super appreciated.
[00:42:09] Alex Volkov: Yes, I I've seen this time and time again on ThursdAI on stage, folks ask for sponsorship for compute. I'm actually getting I'm actually getting like DMs from different companies like, Hey Alex, the space is super cool. Can we sponsor someone? Can we? And I'm like, no, I already work with Let's Ambassadors, I don't need sponsorship.
[00:42:25] Alex Volkov: I would want to connect guys that work on super cool things. We need compute to keep going with different companies around like compute specifically. So I'll definitely keep you in mind. And go ahead, Nissan. You had a thing you want to say?
[00:42:38] Nisten Tahiraj: Yeah, just really quickly, this is a very effective way to make projects that are impactful. For example, with Balclava, Pharrell here, and Suntex, they just put out a readme, and tweeted something out, and we got compute. And we got it from Together Computer. So they, they sponsored that, that project and they ended up being a very impactful project that a lot of people use.
[00:43:05] Nisten Tahiraj: That, that works pretty well. I just say be careful with conditional stuff. If they're gonna start talking about an NDA, just Ignore them because that's not really, then you're doing an exchange, you're basically doing work for that person, so that's just a job contract, that's not a sponsor, if someone's sponsoring an open source model
[00:43:27] Alex Volkov: Better be.
[00:43:28] Nisten Tahiraj: not be like an NDA, that's not, that's no longer a
[00:43:32] Alex Volkov: Better be open source after that. Yes, absolutely. So Vic, I'll keep you in mind when people reach out to me. Folks in the audience, if you work at a company that wants to be featured forever in the, in the open source community, definitely reach out to Vic and we want more of this.
[00:43:47] Alex Volkov: We want more of like tiny models that perform incredibly well. We want them to be built into different Tools that we can all use without relying or paying by just using our machines. So definitely we'll keep in mind. Vic, welcome and welcome to the community of ThursdAI. More than welcome to keep joining and participating in this.
[00:44:06] Alex Volkov: I think it's time for us to move on, folks. It's been around 40 minutes. I think we're actually good on time. I think it's time for us to move on to this week's buzz. I wish I had a I really want to do like a music transition here for the, with this week's buzz, with like bees buzzing, etc.
[00:44:20] Alex Volkov: But maybe for next week. Let me just play the regular music and we'll transition and talk with Jason a little bit.
[00:44:24] This weeks buzz - Jason Liu launches a new course with Weights & Biases for free
[00:44:24] Alex Volkov: All right, welcome to this week's buzz, where I talk about some cool things that happened or I learned in Weights Biases. Weights Biases is, ooh, that was an abrupt music stop. Weights Biases is the system of records for all your LM needs. So pretty much like most of the folks up on stage who use who train models use Weights Biases.
[00:44:52] Alex Volkov: It's incredible. The ubiquity , where bias pretty much prevented everywhere. I just saw a stable Kwan, one of our friends of the pod just train something and just post like words and biases, like a snapshot of his last curve going down and literally just asked Hey, do you mind putting a link to the dashboard?
[00:45:08] Alex Volkov: And he did. So you wanna check out how his model is going? I think he's training. I don't think I saw, he's training something like super cool, like a Oh, he's training a mixture. Four 400 million parameters. So he's training like a tiny MOE of mixed role. StableKwan is, he just posted like a chart with the train loss from Weights Biases and I just asked, Hey. Can we follow along with the training? And he posted a link to the Weights Biases dashboard, which is super cool.
[00:45:34] Alex Volkov: Which got a reaction from Weights Biases CEO. . And so I love seeing this in the wild. So folks, if you're training models, please put those dashboards up so people can follow along. It's super it's really nice. But on the other news from Weights Biases this week I want to say hi to Jason Liu.
[00:45:47] Jason Liu: Yeah, Jason Liu.
[00:45:48] Alex Volkov: Jason Liu. Welcome, Jason. I've seen you around. I've seen you, I think at AI engineer event from SWIX. I don't know if we like ran into each other there, but you had a talk there as well. Yeah.
[00:45:58] Jason Liu: Yeah, it was Paidandic is all you need. It did pretty well on YouTube, so I'm pretty
[00:46:02] Alex Volkov: It did great. I also talked with a bunch of people. I think I was interviewing folks, outside of the stage while we were giving the talk, but then it was very well received. And this is on the same similar topic that we're going to talk about now. So please feel free to introduce yourself briefly.
[00:46:15] Alex Volkov: And then we're going to talk about the stuff that we did together.
[00:46:19] Jason Liu: Great. Yeah. So I'm Jason. In the past year and a half, I've been mostly doing a lot of applied AI consulting. Before that, I spent the past like eight years just doing like machine learning. So I did the big data wave, the machine learning wave, the neural networks and deep learning wave.
[00:46:32] Jason Liu: And now we get generative AI. So it's been a lot of fun. And in my spare time I work on a library called Instructor. So now. We have Instructor in, I think, JavaScript, Python, and Elixir. And the general idea is that we want to bring just functions and structs into LLMs and make LLMs feel a lot more backwards compatible with existing code rather than creating new abstractions to handle some of these things.
[00:46:55] Jason Liu: And I think that's been pretty well received in the community.
[00:46:57] Alex Volkov: Absolutely. So Instructor is definitely where I know you from. And today we have an announcement together. So feel free to. Feel free to announce the cool thing that we did and that you worked on really hard.
[00:47:09] Jason Liu: Yeah, so we're starting a new series around, the idea of using like schemas and structures to prompt language models. And I think. At the day or end of this week, we're going to release the first part of a LLM engineering series. And the first part really is just an introduction on how we can use things like structure to prompt LLMs a lot better, right?
[00:47:30] Jason Liu: In the past, we just like beg for the language model to give us JSON. Now we have things like JSON mode and function calling and tools, which gives us the ability to get more structure. But we still need a lot more tools and ways of thinking about how we can reason about these structures. And so part one is going to be around justifying and motivating why we wanna, why we might want to do this.
[00:47:54] Jason Liu: And then I think in February or March we'll start working on part two that uses a lot of the new ways and biases, observability tools to look at how I've solved a lot of LLM problems in production with a lot of my consulting clients.
[00:48:07] Alex Volkov: So just to highlight for folks, Weissenbeisser has a free courses area, Weissenbeisser Academy. And some like very prominent folks in the industry have collaborated with Weissenbeisser to like just basically teach. So we teach you for free how to do these things. So we have courses from like training, LLM from scratch, fine tuning, et cetera.
[00:48:24] Alex Volkov: And then Jason is announcing a new course today that he wrote and and recorded and we helped edit a little bit and publish and also obviously talk and promote this a little bit about how to actually ask your model to give you what you need as a developer, as a AI developer in the structured output, which uses the instructor library.
[00:48:42] Alex Volkov: Correct, Jason?
[00:48:43] Jason Liu: Yeah, these ideas can be used in other libraries as well, right? So for the Python community, we're really using a library called Bydantic, and so this is supported in things like Langchain and Marvin. And so even if you don't use a library like Instructor, learning how to think about prompt infrastructure is still something that's going to be really applicable and valuable for everyone listening.
[00:49:05] Alex Volkov: And you mentioned before, there's like a bunch of stuff that open the icons up with, like JSON mode, in example, etc. There is functions back in June. But also The other LLMs, they don't necessarily follow the same kind of new abstractions that OpenAI releases. I think Anthropic just recently announced that they're moving to function system messages or moving to just messages, things.
[00:49:27] Function calling in OpenSource LLMS
[00:49:27] Alex Volkov: And also we have open source, which is like all over the place. So I guess my question is, with these libraries, with these Paidantic approach and Instructor, would that apply to other LLMs? Does this apply to open source, which we talk a lot about?
[00:49:40] Jason Liu: Yeah, so right now there's only a few open source models that support function calling. So if you've looked at some of the work from the functionary team, they have been training I think mixed role now with function calling, same with the guys that like news research with Technium. There's been a lot of progress in the open source world and getting things like function calling.
[00:49:58] Jason Liu: If you want more structured outputs [00:50:00] too, there's a great library called outlines. That can use something like the Hugging Face Transformers library to also do structure extraction. And again, they also support things like Pytantic. And the goal of the course really is to show you how to think about and how to model these problems in a particular way.
[00:50:15] Alex Volkov: Absolutely. And I think John Durbin in the audience I think Ouroboros was trained on function calling as well, if I'm not mistaken, John. So folks who haven't heard our conversation with John, definitely go and check out where the deep dive with John about Bagel, which now includes the Ouroboros dataset, which now includes function calling as well.
[00:50:33] Alex Volkov: So that's awesome. The open source also moves there. Go ahead, Nissan.
[00:50:37] Nisten Tahiraj: Also really quick the news vision model ended up being good at at function calling, although it had other drawbacks. It was good at function calling because of the Arrow Boro's like thousand something functions dataset. And as far as I saw the newer bagel models, so Bagel seven B are also good at at that, at at function calling.
[00:50:57] Alex Volkov: So big old model series from Maxim Le Bon. Again, shout out Maxim Lebon, who came on the pod last week, and then the full deep dive with him will be released this Sunday, so make sure you're subscribed. We talk about, we don't talk about FunctionCall, we talk about NeuroBeagle. NeuroBeagle is like one of the top performing 7 billion parameters, it's a merge, it's a cool conversation about merging.
[00:51:16] Alex Volkov: But let me back, let me get back to Jason just real quick. Jason, you're also like doing independent consulting, you said, in multiple places, and you're like helping them build. I got to like tap into your experience from like actually like hands on AI building and companies. Could you give us like a little bit of what do companies struggle with?
[00:51:32] Alex Volkov: Like with the first obvious thing that comes to mind that people like AI builders probably like already solved in their minds. What do you have to go through to not only build to them, but also educate them on as you join the company, it starts like helping them out with AI stuff.
[00:51:47] Jason Liu: Yeah. So one of the biggest things I noticed is that when we look at something like a RAG application, really what it looks like is a recommendation system. If you went on Netflix, for example, and you watch a bunch of movies and the recommendations don't get better, it would be a really terrible experience and you probably lose a lot of customers.
[00:52:03] Jason Liu: But for a lot of companies these days that are using things like agents or retrieval, We are in a situation where, you know, no matter how many users you get, if you don't improve your language model, if you don't improve your embeddings, the product doesn't really get any better. And so one of the big things I'm focusing on this year is helping these companies build a better feedback loop and a data flywheel.
[00:52:22] Jason Liu: And so we can know for sure that as we get more users, there's these network effects that improve the models that we want to train. And so I think step one is, being able to fine tune your own embedding models and your re rankers and go from there and then, see what comes up in the future.
[00:52:39] Alex Volkov: Awesome. So definitely folks, give Jason a follow. The course, I think we're releasing it today, but I haven't seen any social mentions, but it's really worth watching. I watched a few of this and we'll follow as well. And this is a course series now. So we're going to start with this, and then we're going to continue with the monitoring tools that Waze Ambassadors have.
[00:52:56] Alex Volkov: Correct?
[00:52:58] Nisten Tahiraj: Yeah, the first course is like 30 minutes. It's super quick. The real goal is to show you what's possible and get you thinking about some new ideas. And then the next course will be deeply integrated with the more visibility tools from Wisdom Biases and specifically around the experiences I've gotten from consulting production clients.
[00:53:13] Alex Volkov: Incredible. Thank you, Jason. Thank you for joining us. And thank you folks who worked on the course together with you. I'm excited to see this. And again, the reminder, there's a bunch of free stuff there. There's a bunch of like knowledge just drops here. And hopefully I will be able to tap into this community and also build more things.
[00:53:29] Alex Volkov: Go ahead, Nistan, and then we'll move on.
[00:53:31] Nisten Tahiraj: Yeah, I just want to say that a lot of us here that got good at machine learning were from just a random YouTube series. So the Karpathy series on Building one from scratch. The full stack is just pronounced like that. Their LLM one from way back in April and March. So I'm really looking forward to this one because doing YouTube tutorials is actually extremely efficient.
[00:53:53] Breaking News - HuggingFace announces a collaboration with Google
[00:53:53] Nisten Tahiraj: But on that note, we have breaking news.
[00:53:56] Alex Volkov: Wait, we have breaking news. Hold up. You know what this means.
[00:54:11] Alex Volkov: Yes, Nistan, go ahead now.
[00:54:14] Nisten Tahiraj: Phil Schmidt, who is a friend of the pod and has been here.
[00:54:18] Alex Volkov: Here, yes.
[00:54:18] Nisten Tahiraj: definitely. Yeah, Devleet at, At Hugging Face, he's also the one that did the integrations, if I might be wrong, but the integrations for with AWS Bedrock and also with CloudFlare workers. Yeah, so now it looks like he's been working on doing an integration.
[00:54:35] Nisten Tahiraj: with Google, where you'll be able to just take whatever models or fine tunes and stuff you have on HuggingFace and then use Google's infrastructure, use both their TPUs and NVIDIA H100s, they're advertising this, that Google owns, to continue training, fine tuning, serving, deploying stuff via HuggingFace.
[00:54:55] Nisten Tahiraj: Google. Is this is a very interesting move. Google's jumping in more on the open source side there. I don't know what this means, but this is a very interesting development.
[00:55:06] Alex Volkov: I know what this means. This means that, if Hug Face becomes public ever, buy their stock. That's what this means. Hug Face like literally embedded into the, like the infrastructure of AI and definitely worth following. And the more integrations they have, the better, like it is for the open source community as well.
[00:55:25] Alex Volkov: All right, folks. Thanks Nissan
[00:55:26] Nisten Tahiraj: This is not financial. By the
[00:55:28] Alex Volkov: financial advice, but they're also not public yet. Look, I don't think this move. Yeah, I don't think this moves the needle for, in terms of Google investing,
[00:55:36] Hourglass Diffusion Transformers deep dive with Tanishq Abraham
[00:55:36] Alex Volkov: Alright folks, we're moving forward and the way, where we're moving forward is also like into kind of like diffusion mode, and I'm very excited to introduce Tanishq.
[00:55:45] Alex Volkov: Tanishq, have you been here before? Remind me, please. I don't think you've been here on stage before.
[00:55:50] Tanishq Abraham: I, I don't think I've been on stage
[00:55:52] Alex Volkov: No. All right. So I'm very excited to have you here. Thanks. Thank you for joining us. So folks, one of the coolest things that came out in at least the research area from this week was this paper from.
[00:56:03] Alex Volkov: From multiple authors, some of them friends of the pod, like Enrico, if you remember the chat with Enrico we did with rope scaling is on the paper as well. Katherine Crowson who we should mention, I don't think she's been here or, but we've talked about some stuff that she did. Stefan Baumann, Alex Birch, Tanishq, you're on there, Daniel Kaplan, and then Enrico, a friend of our Nico.
[00:56:23] Alex Volkov: Tanishq has been the friend of the pod behind the scenes, you guys didn't know this, but we've met in NeurIps so we've met before. Tanishq, do you mind introducing yourself just briefly for the audience who haven't met you or followed you so far?
[00:56:34] Tanishq Abraham: Yeah, sure. My name is Tanish. I am a research director at Stability ai and also CEO of MedAR, which is a medical AI research organization. I've also been involved with fast ai, been working on, diffusion models for
[00:56:48] Tanishq Abraham: I guess past year and a half or so. Yeah, so I do all kinds of stuff.
[00:56:53] Tanishq Abraham: Generative ai,
[00:56:53] Tanishq Abraham: medical ai. Yeah.
[00:56:55] Alex Volkov: You also just like a briefly skipped over the fact that you got your PhD at 19, right? Is that correct?
[00:57:01] Tanishq Abraham: Yes, that's correct. I got
[00:57:02] Tanishq Abraham: it. That was last year. Yes,
[00:57:03] Alex Volkov: So if folks in the audience don't know what this means that there's not many like 19 year old PhDs and Tanishq is one of them. And also we met once. I think a year and a half ago. And then the next time we met in Europe, I just remember every detail of our conversation. But that's beside the point.
[00:57:17] Tanishq Abraham: yes.
[00:57:19] Alex Volkov: Thanks
[00:57:19] Tanishq Abraham: met at the Stability AI
[00:57:21] Alex Volkov: Lunch party. That was super cool. And since then, many things have changed. And I really want to talk to you in that area, right? So this paper, shout out to all the authors because I'm looking at this. I've seen like multiple folks share this paper. Paper is talking about high resolution image synthesis.
[00:57:39] Alex Volkov: With something called Hourglass Diffusion Transformers. And I will pin your great thread about this here on top of the space, and it will be in the show notes. Could you briefly tell us the problem this tries to solve? And then we're going to go into actually how this kind of approaches how to solve this.
[00:57:57] Tanishq Abraham: Yeah, definitely.
[00:57:58] Nisten Tahiraj: Yeah. So first of all, of course preface this by saying it's mostly, of course
[00:58:01] Tanishq Abraham: Kat's genius work here. And we were just lucky to be able to help her on this project. But yeah, just to get her started.
[00:58:06] Alex Volkov: just one tiny second because it's worth a shout out. So Kat, by Kat you refer to Katherine Carlson, right? And if folks ever used Stable Diffusion before, either in Automatic 1. 1. 1 or whatever, and you [00:58:20] choose anything with K dash that's, this is the Katherine, right?
[00:58:24] Alex Volkov: This is, K Diffusion is like her area. Very incredibly prolific person in this area I don't know many facts about her, but like everybody who I talked to from this paper, including Enrico, everybody's like referring to Kat, that's her work. So maybe a huge shout out to Kat and yeah, go ahead, please.
[00:58:40] Tanishq Abraham: Yeah yeah, she's like a, she was like one of the original AIR people, so yeah, I had, she helped start the field in a way, anyway,
[00:58:46] Tanishq Abraham: To To provide some context of
[00:58:48] Tanishq Abraham: what this paper is about the idea is that, if you want to do like high resolution generation, so think like 1024 by 1024 the typical approaches these days utilize some sort of multi stage approach, like the most common one, like stable diffusion, is this sort of latent diffusion where you have to encode it in with some sort of auto encoder into some latent space and you're doing diffusion on the latent space and you're not actually doing it on the actual pixels.
[00:59:15] Tanishq Abraham: And so that comes with some disadvantages. For example, if I don't know if people who are like doing things like image editing with stable diffusion, you realize you don't have a whole lot of fine grained level of control in terms of the actual, at the pixel level.
[00:59:30] Tanishq Abraham: It's difficult to do that because it's happening in the latent space rather than at the pixel space. So there are various different things where like it has its own challenges. Of course, like latent diffusion has a lot of different advantages too, but you know for some applications it may not be ideal.
[00:59:44] Tanishq Abraham: And then on top of that the other aspect that, we wanted to like, look into basically was the fact that we're seeing People move towards towards transformer models for diffusion as well. And of course, in the past, most of the diffusion models have been with, a U net architecture, a convolutional U net.
[01:00:02] Tanishq Abraham: Also stable diffusion uses a convolutional U net. But, there have been a lot of papers examining the use of transformers. And, of course, the nice thing about transformers is, people know how to train them, they're quite scalable, so people would rather use transformers for diffusion over over something like a U net.
[01:00:18] Tanishq Abraham: But again, the problem is that So far, it's mostly only been applied to the Latent Diffusion Scenario, mainly because it would be very hard to do this at pixel scale because of the quadratic complexity of attention. So if you wanted to scale up to higher resolution, you know that, it would be, at the number of pixels, you're going to have quadratic scaling with that.
[01:00:40] Tanishq Abraham: So it would be very difficult to train this with, I guess enough resources or whatever. So that's the problem that we're trying to solve is what can we do to resolve the quadratic complexity of the transformer architecture that allows us to then train a diffusion transformer in pixel space.
[01:00:58] Tanishq Abraham: So that's what the hourglass diffusion transformer tries to address.
[01:01:02] Alex Volkov: Thank you for the brief introduction. For I will try to recap as a way I understand this. So folks who are not machine learning scientists in the audience would be able to follow along. But basically Gen AI, this whole wave of Gen AI has two, two big infrastructures so far, right?
[01:01:15] Alex Volkov: The diffusion, the stability AI and of the image models and video models. They're based on like diffusion or you said latent diffusion, correct? And then there's the LLM area with basically based on transformers. And we've seen a bunch of stuff going back and forth in tech, like in techniques between them, right?
[01:01:31] Alex Volkov: So Laura, I think is a thing that like many people in the diffusion area, like trained Laura's on different concepts. And then obviously like fine tuning with Laura's then became a thing and back and forth. We've seen like back and forth different approaches. I think you said like The open source area in LLMs in Transformers specifically has a bunch of like super cool tricks and optimization techniques and flash attention different things, right?
[01:01:54] Alex Volkov: There's a bunch of stuff that people developed in one area that wasn't necessarily applyable to to, to diffusion models. And so you guys set out to try and unify those two, or at least use some of the tricks and looks
[01:02:09] Alex Volkov: succeeded to an extent. Yeah. Go ahead please.
[01:02:12] Tanishq Abraham: yeah, I think it's, yeah, about Now that we have this transformer architecture, we can try to apply some of the tricks that people have been using, things like, rope embeddings there are other tricks like RMS norm, these are the sorts of tricks, for example, that are used in the Lama architecture these sort of similar architectural decisions and you could take those sorts of best practices and try to see if they help with diffusion now.
[01:02:33] Tanishq Abraham: So yeah, I think that's the idea. And like people were exploring yeah, that's like another interesting thing about our papers. Like people were exploring diffusion transformers, but they were using very kind of old architectures for diffusion transformers. And here we're trying to also apply all these tricks that we see.
[01:02:47] Tanishq Abraham: People are applying the LLM space and trying to apply that to to, to diffusion. Yeah, that was also an important part of our paper as well.
[01:02:54] Alex Volkov: And of course, you mentioned Rope, and I want to shout out a friend of the pod, Enrico, from News Research, Enrico. Wait, I don't actually remember if Anuka is part of News Research. Maybe, so he and News Research worked on the Rope paper together. And for folks who are interested in hearing about Rope, we had a deep dive during the summer, one of the coolest episodes.
[01:03:12] Alex Volkov: Most of it back then went above my head, but it was super cool going back there and saying, Hey, oh, I learned this. Rope is basically a way to extend context windows and do a bunch of other things for Transformer based large language models. And I wonder how does Ropen play here? And Enrico is part of the authors here on, on the paper.
[01:03:29] Alex Volkov: So he contributed at least part of that work, I assume. Enrico?
[01:03:34] Tanishq Abraham: Yeah. I think the rope stuff is like something that We even, we haven't like fully explored the full potential there, I think. But at least for what we were doing, we saw improvements in, in performance, just using rope over other sorts of, these sorts of position embeddings.
[01:03:50] Tanishq Abraham: But yeah, I think there's definitely potential for allowing the model to handle larger resolutions or do things like this because of the rope embeddings that we have in the model. Yeah it's, I think, also meant for future work.
[01:04:02] Alex Volkov: Incredible. You guys use all these techniques. You introduce, or I guess start formally announcing this concept of the fusion transformers, which is the mixture of these two things. And what are some of the results that you get? You've trained a few models to test.
[01:04:15] Alex Volkov: How do you even, measure that you're getting performance or you're just looking at algorithms or you're actually generating images. Can you talk us through the process of validating this like theories and papers?
[01:04:26] Tanishq Abraham: Yeah, but I just want to yeah, I guess to take a step back to clarify we didn't necessarily invent the concept of diffusion transformers. That is something that people have already developed but the idea that we focus here is the problem is in the past, diffusion, Transformers were done with the latent space because of this quadratic complexity.
[01:04:45] Tanishq Abraham: So we basically have a different type of transformer architecture, which is this hourglass transformer that enables for Like O of N scaling, so like a linear complexity. So it, it will scale with the number of pixels much better than it won't blow up like, like you, you have with with the attention quadratic complexity.
[01:05:07] Tanishq Abraham: So that was the main trick that we're using. So we have some tricks in there. That allow it to have that property. And that's what enables us to do it on the pixel space, as opposed to the latent space that the previous diffusion transformers were doing. And then on top of that, we are adding all these additional transformer tricks, which no one had tried out before with diffusion transformers.
[01:05:27] Tanishq Abraham: So those are the main sort of contributions of this paper in terms of in terms of, and yeah, I guess one thing, the, yeah, the other thing worth mentioning is that the way that this architecture is able to do this is partly because it's, it the architecture is a very hierarchical architecture.
[01:05:45] Tanishq Abraham: So it's actually able to process at different image resolutions. And for example at the high resolutions, we use a sort of the, this sort of local attention, which is what is. Having this linear scaling, but then at the low resolutions, we were able to do the regular attention.
[01:06:01] Tanishq Abraham: Yeah, there's also this hierarchical processing of the image resolution. That's also, I think, an important point, which enables also for higher fidelity as for generation. And yeah, in terms of testing the
[01:06:13] Alex Volkov: Yeah. And so the next question is how do you actually like test the architecture? How do you validate these like approaches that you tried actually better than what the field has previously been at?
[01:06:26] Tanishq Abraham: Yeah. We looked at two datasets. One, we did ImageNet generation. So can conditional, class conditional ImageNet generation. So that is, passing in an ImageNet class, you generate images of that class. So if you pass in a zebra [01:06:40] class, you're generating zebras, or you're in some sort of dog class, you generate the dogs.
[01:06:43] Tanishq Abraham: That's, we train a model for that. We train it at a resolution of 256 by 256 and that, that's one of the experiments where we compare to other architectures. And so we we're, the interesting thing is that, of course, we're comparing to other architectures that are using, for example Latent Diffusion, that they're, using the latent space there the architecture is functioning on the latent space and not on the pixel space, but we have our architecture that's functioning on the pixel space and using this hourglass transformer and it's getting better results than with the with the latent space.
[01:07:19] Tanishq Abraham: We're beating, for example, the previous Diffusion Transformer model which was using the latent space. And then another interesting data set that we used was the FFHQ. Data set which is this sort of data set of high yeah like high resolution faces and so this is at this is at a 1024 by 1024 resolution and so this is like you know very difficult to be able to train especially in a pixel space you know at Scale of 1024 by 1024.
[01:07:47] Tanishq Abraham: And actually there are not many other diffusion models that are trained on this model. There are a bunch of GAN models, for example, but not really many diffusion models. There's like only one or two that we actually found in the literature because it is, it can be a bit difficult because of this, because of the.
[01:08:01] Tanishq Abraham: The pixel scale or the, the resolution of the images, but yeah we were managed to train a model with our architecture. It can, it trains quite fast. And yeah we are able to we're basically like, I guess at this point now we would be the best diffusion model for that for that data set.
[01:08:18] Tanishq Abraham: And we are measuring with FID. But of course, like FID, as a metric also has its problems it does have some bias towards like towards GANs and so GANs tend to have a lower FID kind of in terms of the bias of the FID. So like when we look at it qualitatively, honestly, we think like it's quite comparable to the GANs, might be better than the GANs, honestly.
[01:08:41] Tanishq Abraham: So we may do more evaluations and study that further. But honestly, this may be like. One of the state of the art models for this FFHQ dataset but it's a bit hard when you're using as a metric, but that's of course the problem with, everyone's using that metric in the literature, but yes, but yeah, I think that, again, that's another really interesting result that we observed.
[01:09:01] Tanishq Abraham: And then, of course, we do
[01:09:02] Alex Volkov: I want to follow up with a question here real quick. For folks like, hard for them to follow like much of this, but they've used something like Stable
[01:09:09] Tanishq Abraham: oh, sorry.
[01:09:10] Alex Volkov: No, that's all great. This is all recorded. Folks can like pause and go to, and go research and come back and listen to you.
[01:09:15] Alex Volkov: This is great. Like you did the deep dive. I really appreciate it. I just want to bring this back a little bit upwards towards like
[01:09:21] Unkown: Sure.
[01:09:22] Effects on the industry from Hourglass Diffusion Transformers
[01:09:22] Alex Volkov: affect the industry, given that we have stuff like Stable Diffusion out, and that keeps getting better, Mid Journey is getting like reality adjacent to the point where like it's really hard to distinguish, there's like different upscalers that take the outputs and then run some upscaling how does this affect the industry to, in your mind?
[01:09:40] Alex Volkov: Will this Accelerate some stuff. Will this be applied to different areas that like the fusion models have not been traditionally in? What is the kind of the, let's say this is a building block that you've created. How does this affect us in three, six months?
[01:09:54] Tanishq Abraham: Yeah, I think this is just a kind of a new unique direction to explore. Of course, I think latent diffusion is still a very interesting, invaluable direction, but this is just it's always good to have different directions to explore. And I think And honestly, like this architecture can be applied to latent diffusion as well, and maybe we get even better results, for example, we can do maybe like, multi megapixel level synthesis by combining, this method with latent diffusion or something like this as well.
[01:10:23] Tanishq Abraham: So it's not even like it's. Limited to just the pixel space. That's what we're showing that, that's something that is interesting about this. But again, it can also be applied to agent diffusion and can even, of course, these models could be scaled up further. There's a whole lot of future work to explore here, I think.
[01:10:39] Tanishq Abraham: And yeah, I think and of course it's computationally efficient. And yeah, I think the nice thing is yeah, moving towards the transformer architecture when, it's, people understand the transformer architecture at this point. I think, there's people understand how to scale it and different tricks.
[01:10:55] Tanishq Abraham: And it's, I think this is a good, by introducing this architecture, this is a good way for. As to try to bring some of those advances in transformers into the diffusion model field as well. So I think that's the other interesting aspect of this.
[01:11:12] Alex Volkov: for me reading this is not a machine learning scientist. Reading this was like the highlight of interesting things were like The open source community moves in, in, in different areas, but also like bringing over some of the learnings, bringing over some of the talent the tooling around, like making things available.
[01:11:28] Alex Volkov: And I think that's like very exciting. We also have Alex Birch, is that correct? The name also in the audience. So shout out Alex. And then what else do we not cover this stage? What is the last thing that you want to say? Or maybe shout out some of the co authors feel free, the stage is yours.
[01:11:44] Tanishq Abraham: Yeah, I'm just looking at some comments that I, Alex also has some comments that he said. So he thinks, for example, that with this model, that there's potential to. Achieve more realistic textures than even mid journey. So I think, we have observed, like with the model, like the, because that's the thing about using, when you're using a latent diffusion where, it's not, you're not doing, when you're not doing it at the pixel level, it's a bit.
[01:12:07] Tanishq Abraham: Difficult to get those get those, textures accurately, but if you're doing it the pixel level I think you're able to get those textures yeah, it can do that much better. And we've observed that with the models that we've been training. And yeah, I definitely agree with Alex there.
[01:12:22] Tanishq Abraham: Yeah, I think also like it may have potential to achieve like really realistic textures and that, that's something. That I guess we could look forward to hopefully. Yeah.
[01:12:31] Alex Volkov: that's incredible cause I think the realism comes from the imperfections, especially like textures and skin, et cetera. And like diffusion models have, at least for many folks are easier identifiable by the kind of the smoothness of edges and different things. So definitely like more more textures are there for humans in real pictures.
[01:12:50] Alex Volkov: And then we're looking forward to more of that in diffusion models. That's incredible. So definitely, thank you for breaking this down for us, Dinesh. Thank you and Catherine and Alex and everybody else in Rico who worked on this. I think we have some questions from folks on stage here. Vic, go ahead, please.
[01:13:05] Vik Hyatk: Yeah, another question.
[01:13:06] Vik Hyatk: I just wanted to see I played around with the repository a bit. It's a great way for anyone interested in getting into diffusion models to get started. It's not your typical research code base. It's super clean.
[01:13:19] Vik Hyatk: You're not going to run into a bunch of dependency issues and whatnot.
[01:13:22] Vik Hyatk: So that
[01:13:23] Vik Hyatk: was amazing. It's also super compute efficient, so you don't need a ton of compute. To start to see good results. I'd strongly recommend checking it out if anyone was feeling intimidated
[01:13:32] Vik Hyatk: before,
[01:13:32] Vik Hyatk: don't be.
[01:13:34] Alex Volkov: Incredible.
[01:13:35] Tanishq Abraham: Yeah. That, that, that comes down to Kat's again, Kat's genius. I think this is a code base that she's been working on for quite some time and I also really enjoy working with it.
[01:13:42] Tanishq Abraham: It's like one of my favorite diffusion model code bases. So I definitely agree that anyone who's interested in playing around with diffusion models should check it out.
[01:13:49] Alex Volkov: So that, that's on Cat's GitHub. We're going to add this in shell notes called KDiffusion, correct? It's now
[01:13:55] Alex Volkov: part of that existing code base, but now like this, the Hourglass Diffusion Transformer. Get used to say Diffusion Transformers from now on, folks. Hourglass Diffusion Transformers, HDITs, are now a thing.
[01:14:06] Alex Volkov: And Tanish, thank you so much. And Alex for joining in from the comment area. And thank you for working on this work. Hopefully this will get the recognition it deserves and definitely as a foundational block to get us Higher performance, lower, hardware requirement models that look way better.
[01:14:22] Alex Volkov: Incredible.
[01:14:23] Open source models in medical fields
[01:14:23] Alex Volkov: Tanishq I wanted to follow up with you, because MedArk is something that you're now CEO of medical things, and then you had a tweet today that I really wanted to talk to you about specifically because Quyen was involved, and we have like folks from Quyen, usually like friends of the path as well, they join us could you,
[01:14:37] Alex Volkov: let's talk through this please, let's talk through How open source is catching up to medical space.
[01:14:42] Alex Volkov: Could you briefly summarize what we've talked, recent work from you guys?
[01:14:46] Nisten Tahiraj: Yeah. Sure. Yeah. I've been
[01:14:48] Tanishq Abraham: quite busy with all kinds of different research projects. So that was another. Ongoing research project that we're working on at MedArc and that I'm shared some progress of that today morning. So basically, at MedArc, we're of course interested in [01:15:00] developing open source medical language models.
[01:15:03] Tanishq Abraham: So that, that's something that we're heavily interested in. And of course, in order to be able to do we wanted to understand what The current capabilities of these language models look like the open source language models and no one had done like a very proper analysis of this as far as I could tell and yeah, basically we, what we did is we added this suite of tasks known as the Multimed QA.
[01:15:24] Tanishq Abraham: Sweet of tasks. So this is a kind of a bunch of tasks, a total of nine tasks that were they came from different other papers and stuff, but Google put them together as this is their sort of evaluation bench, this is the evaluation benchmark that This is what Google was using to evaluate their MedPAL models and, whatever models they had.
[01:15:44] Tanishq Abraham: And then, the community, the medical AI community been using that. It's been used to evaluate GPT 4
[01:15:49] Unkown: and all kinds of
[01:15:50] Tanishq Abraham: other models as well. And yeah, I, we, at MedArf, we added it to the LM eval harness. So that's like the common sort of for open source language models.
[01:15:59] Tanishq Abraham: Everyone I think uses LM eval harness to evaluate the models on various tasks. So now it's in there. So people can easily also evaluate their, whatever the models they have on these medical tasks. And so once we added it into LM eval harness, we just wanted to just. Do a comprehensive like analysis of a whole bunch of models in the open source space, just to see like these sorts of generalist models.
[01:16:21] Tanishq Abraham: Like they're not necessarily particularly trained on medical data. Of course they've probably seen some in, in, in their pre training or whatever, but that's not their main purpose and that's not their main focus in their pre training. And I'm, I was just curious what their performance would look like and, how it compares to other models like GPT 4.
[01:16:36] Tanishq Abraham: GPT 4 is also a generalist. It's a generalist language model as well. It's not also necessarily trained on medical, but, it's really good at that. In fact Prompt Engineer GPT 4 is state of the art on this benchmark, actually.
[01:16:48] Alex Volkov: I remember this. I remember where Google came up with a specific medical device and then GPT 4 just like basically with prompt engineering on that benchmark became the top one, right? This was quite incredible that the most generic
[01:17:00] Alex Volkov: model we have. Yeah,
[01:17:02] Tanishq Abraham: that's the it's called MedPrompt. That's the state of the art, this prompt engineering, prompt engineered GPT 4, it's called MedPrompt. And so they do a whole bunch of tricks like, dynamic few shot and GPT 4 written chain of thought and all kinds of tricks that they throw at GPT 4 and they got state of the art.
[01:17:18] Tanishq Abraham: And then of course they use the same tricks to then, later claimed that GPT 4 is better than Gemini as well. It's not just for medicine that you can use it. They use it for just general prompt engineering as well. But yeah, anyway so yeah, this is, so overall the point is I wanted to evaluate how these models do in the how the open source models do in this, on this benchmark.
[01:17:38] Tanishq Abraham: And so I evaluated a whole bunch of models. I evaluated Lama, Mistral, Mixtral. I evaluated the Yi series of models. I evaluated Quinn. Yeah, so I evaluated a whole bunch of models here and basically what I found out is first of all, Lama 2 is not that great compared to all these other models, actually, and it's, It's interesting because in the literature people are still fine tuning Lama 2 for medical purposes but, it actually doesn't have a very good base capability of for medical knowledge.
[01:18:09] Tanishq Abraham: So Lama 2 is not very good at medical stuff, but the models that are quite good are basically the Yi series of models, so Yi 34b is really good, as well as the Quen series of models. So Quen 72b is The state of the art open source model and it's, and this is not like doing any sort of prompt engineering or anything like this.
[01:18:28] Tanishq Abraham: This is just like five shot prompting and it's beating MedPalm version 1. So MedPalm version 1 was released in November of 2022 and that was like the first sort of yeah, that was Google's model that they had. And this Quenz72B is beating MedPom1 without any sort of prompt engineering or any of these tricks.
[01:18:50] Tanishq Abraham: And yeah, I think that's really, honestly, quite impressive because
[01:18:54] Alex Volkov: Yes.
[01:18:55] Alex Volkov: I want to shout out Jun Yang or Justin Lin a friend of the pod, the head of technical, working on Quen for such like incredible achievement. And thank you for testing this. Because we and Nistan, like you worked on AI in medicine as well. Like we've waiting, this is going to happen.
[01:19:11] Alex Volkov: Want it or not, there's like several doomers that say, Hey, never trust an AI doctor, but, many people already go to JGPT to, maybe get a second opinion. And Google has obviously been working on this MetPalm and MetPalm2.
[01:19:22] Alex Volkov: I think for many people it's going to be easier to digest this idea if the model that talks to them is like fully runs on their computer, open source, no internet, like no data sharing.
[01:19:33] Alex Volkov: I think that's a very important piece of this as well. And it's great to see that, we're now getting like some cool comparison, but definitely open source is coming strong on this one.
[01:19:42] Unkown: Yeah.
[01:19:43] Nisten Tahiraj: Yeah. I had the same thing as, Astonish with the Lama models, you can train them on good medical data, but they don't have a, they don't perform great at the base. I'll tell you, it's still, GPT 4 is king when it comes to it. And the product I worked on last year in March, it's still going, Dr.
[01:20:04] Nisten Tahiraj: Gupta. ai is, it is still going. It's just a very well prompted, engineered product. Doctor with with a good RAG system too, that was one of the first, but I will say the thing about the main concern now, and why I think open source will basically completely dominate medical AI, is that their main concern is If they're dependent on some kind of API endpoint that makes the hospital and people's medical data really vulnerable to malware and foreign intelligence groups, which have been wrecking havoc with with medical data and ransomware.
[01:20:42] Nisten Tahiraj: So that's their main concern. And the only way we're going to solve that is by having models that they run locally. So I'm really glad Tanishq actually took the task on. Benchmarking some of these, because you have the entire medical safety field and all the funding and all the people and I have yet to meet an AI safety person that even knows how to rename a file in Linux, let alone actually write some kind of benchmark.
[01:21:07] Nisten Tahiraj: So I'm glad someone's actually taken on the challenge of making open medical yeah, medical LM benchmarks.
[01:21:19] Tanishq Abraham: Yeah, I completely agree in terms of yeah, I definitely think open source is definitely the feature for medical AI and medical LLMs. And I think hospitals and doctors will be more comfortable when they know they have access to the model and this is the model that they're using rather than when it's behind some API also where not only like in the case of like malware or things like this, but open eye.
[01:21:40] Tanishq Abraham: AI will just change the model or something like this too, or, these are all concerns that we see this already happening with the models that OpenAI has. And, these are all like concerns that, there needs to be complete transparency when working with with these kind of more crucial applications.
[01:21:55] Tanishq Abraham: And, by doing all this open source I think that that provides that transparency that doctors and hospitals and healthcare systems will be comfortable with that. That's why I'm really excited about working in this area. And I think there's really a lot of potential here.
[01:22:09] Alex Volkov: incredible. Thank you for this work, Dinesh. Thank you for bringing us kind of the idea of which of the models. Surprisingly, Quen. Like I wouldn't assume if you gave me all the models that we've talked about I wouldn't assume that Quen was the most performing, but hey, we'll take what we can get.
[01:22:22] Alex Volkov: Quen72b, the best open source doctor, folks. You hear, you heard it here based on this research.
[01:22:30] Tanishq Abraham: Yeah. Thank you for letting me share all this work.
[01:22:32] Alex Volkov: That's incredible. And as a friend behind the scenes, but now friend of the path, you're always welcome. Thank you for the deep dive on the Hourglass Diffusion Transformers. Thank you for the authors as well. Alex, like still, I think is in the audience and Catherine and Rico and some other folks, and definitely for MedArk, keep us up to date.
[01:22:48] Alex Volkov: We'll keep reporting and the stage is yours whenever you want it. I think folks we're moving forward. I think Nissan, unless you have, or sorry, Tanish, you have the one last thing you want to
[01:22:57] Tanishq Abraham: I would just say please follow first of all, follow all of our Hourglass Diffusion authors. They all deserve your support and also please follow MedArk as well.
[01:23:06] Alex Volkov: 100 percent worth following and definitely will be in the show notes for folks who are listening to this while driving and cannot like click that follow button. I think we're moving to as we're in the hour and a half into the space, let me reset [01:23:20] this a little bit for folks. If you just recently joined us, you're listening to ThursdAI where we talk about everything.
[01:23:26] Alex Volkov: And everything incredible and interesting in the world of AI and open source, LLMs, big companies we cover. And we also had a deep dive today about vision video. My name is Alex Volkov. I'm with Weights Biases. I'm an AI evangelist. And yeah, we're here every week and we keep up to date. So you don't have to, so if you were out of Twitter or if you don't even participate in Twitter and you're just listening to this on the podcast, we got you we're going to cover everything that's most important and then send you this, so definitely check out.
[01:23:52] Alex Volkov: There's the i. news for that. And I think we're moving towards the big companies area, which we haven't touched. We briefly covered in the breaking news where Hug Face just announced a partnership with Google. So you'd be able to very easily run the models from Hug Face on TPUs and the Thingisneyosa GPUs, which is incredible because Google has those, but they don't even give them away.
[01:24:15] Alex Volkov: I think they're all reserved for collab or something. But also. Everything that I have today in the big company LLMs and APIs and everything is from Google.
[01:24:25] Google teases LUMIERE, SOTA video generation models
[01:24:25] Alex Volkov: So the next thing that we're going to talk about is Lumiere. And I don't know if you guys saw the video, but I definitely saw the video. I think Pharrell, you sent this in our group chat first, but by that time there was already spreading around.
[01:24:37] Alex Volkov: . So there's obviously the whole area that we've talked about. Sable Diffusion Video releases like very short videos image to video and text to video. And then there's the front runners in the closed source, which is Runway and Pika. And there's like another one Firework. Oh, Leonardo is doing some incredible things.
[01:24:54] Alex Volkov: All of them have very short videos and the consistency between the frames is not like incredible. And Lumiere. Has shown a video and this, like for all, sorry, you're saying this could be like very cherry picked, et cetera. But it feels like this is like another step in this direction that's significant.
[01:25:13] Alex Volkov: And for folks who are not like watch the video yet, definitely worth watching. I'm going to add this it's already on the top of the space, but basically you see they announced a bunch of stuff that Lumiere can do besides just generation. So video in painting is one that they've announced.
[01:25:28] Alex Volkov: They announced like a text to video text to video, image to video in painting. And they have something like they say, realistic, diverse, and coherent motion specifically around the motion of kind of the characters, which has been lacking in all these like video synthesis. I will say it's.
[01:25:44] Alex Volkov: It's pretty remarkable to even discuss that oh, this vision text to video image is not as good as that one. It's really incredible that we're, like, at this point where we can say, a highbrow, Oh, yeah, I prefer this output. We're, like, we're typing text and getting a video back.
[01:25:59] Alex Volkov: It's ridiculous on the surface of even saying this to us. Like a year and a half ago that this would even be possible. But with that said, we're moving forward. We're like hedonistic adaptation is a thing. We're getting used to these tools and we're getting them like day to day. And then we're like, okay, yeah, this tool is better.
[01:26:15] Alex Volkov: They said the existing video malware synthesized distant keyframes, followed by temporal super resolution and then that's probably it makes it temporal consistency difficult to achieve. Temporal consistency basically says where like characters throughout the video, what they do.
[01:26:30] Alex Volkov: And so you've all seen these videos where like the face changes from frame to frame, et cetera. And this. This series of videos from New Year looks very consistent, like spatially and temporally. Like definitely where the characters are in the video, but also like throughout time. And they Attribute this to different methods that they've used I will not go into this, but I think the tasks are very interesting.
[01:26:53] Alex Volkov: They have video editing applications image to video and painting and stylized generation. Something I also liked. You'd be able to take like an image and then generate videos based on that style, not necessarily that image. So very impressive from folks from Google, as always from Google.
[01:27:08] Alex Volkov: I haven't played with this. I don't think there's a way for us to play with this yet. So there's a paper maybe some of the ideas in the paper could be reproduced in open source. But it's like a model show in the paper from folks quite a lot of folks, Omar Bartal, Hila, Omar, Charles Herman, and there's like a bunch of folks there on the paper.
[01:27:25] Alex Volkov: Very like visually appealing demo as well. So definitely we'll add this video in the show notes. And I think we have. One more thing here in Diffusion stuff. Yes, the one last thing that I wanted to talk about is Instant ID. Where so we moved off from Lumiere, Lumiere is like super, super cool, but we haven't seen this work.
[01:27:43] Alex Volkov: Hopefully the releases as Google has a back, they have an example of like when they released stuff, like Dreambooth was released and everybody was using this. And. I think that's pretty much it in the big companies and open source.
[01:27:55] InstandID - 0 Shot face transfer diffusion models
[01:27:55] Alex Volkov: The other thing that I wanted to mention is instant ID. We've mentioned this briefly before, but it's been pretty much everywhere on my timeline. If you haven't played with this, I very strongly encourage you to play with this. Because instant ID is a technique to transfer to create diffusion models with your face.
[01:28:11] Alex Volkov: And we've all probably tried this at once. With, like I said, like a dream booth from. Nathaniel Ruiz, who's a dear friend of the pod, who's been here a couple of times. There's like other techniques also to transfer your face into a latent diffusion model. And they all used to take multiple images of your face and some amount of training.
[01:28:32] Alex Volkov: And Instant ID is basically a technique that you can try right now, super quick. With zero shot, one image. You can generate images with your face, or with your kid's face, or whatever. And literally I just want to highlight how impressively fast we're moving towards these type of tools. This used to take fine tuning.
[01:28:52] Alex Volkov: This used to take GPU and knowledge, and there's, like Kokhya, and like this used to take Loras and before Loras, Dreambooths. So actually there's a couple of companies that I know that built on top of providing the fine tuning experience around this, where you upload images, you get like this huge, like four gigabit, like stable diffusion file specifically trained on you as a concept.
[01:29:13] Alex Volkov: And now there's like a zero shot transfer thing called Instant ID. Where a hug and face demo is included here. I will attach this now soon. Where you just upload one image of yourself. Literally for me and Nishtha and Tanishq, for the non on, Umesh, for the non anons here on stage, we'd be able to use our profile picture here and just generate us with a cowboy hat in, in noir style and it will look like us.
[01:29:36] Alex Volkov: For most of the time. I've tested this Instant ID on my kids. And, I'm not going to post this because of privacy. But my kid loved it incredibly so much. He was a superman. It looked like him. It's unbelievable. That it was, like, able to transfer this with one image. It's quite incredible how fast we moved here.
[01:29:52] Alex Volkov: Definitely, if you haven't tried Instant ID but you have tried avatars before, Try Instant ID, you'll be blown away. It runs on your Mac as well, not that great, but it runs through Pinocchio computer. Definitely worth noticing how fast we're moving in this generation. And shout out to whoever built this.
[01:30:08] Alex Volkov: And there's quite a few technologies like this now. Highlighting how fast we're moving, and I think that's pretty much it.
[01:30:15] Voice and Audio - New tech challenges Whisper
[01:30:15] Alex Volkov: So we've covered our diffusion. We've covered yeah, let's move to voice and audio Nistan, you brought us this new, so I definitely want you to pull up the tweet and let's talk about the faster encoder ASR.
[01:30:25] Alex Volkov: And then we can also, while maybe you pull this up, I will say that this week I've 11Labs announced like a big funding rise, but 11Labs also released their dubbing studio. And if you followed Twitter at all, not even the I Twitter for the past like week and a half, two weeks, you maybe have seen the dubbed video of the Argentinian prime minister, or I don't know if he's a prime minister or president, probably president, right?
[01:30:55] Alex Volkov: Yes, president. Millet something he went to the World Economic Forum and gave a speech in Spanish. And then there was a dubbed version, as like these meetings of global summits of leaders, et cetera, they have. Instant translation in their ear to any language, and that's a human that knows both languages.
[01:31:14] Alex Volkov: And then, somebody said, hey, okay, this is one example, and they posted a Heijan. If you remember Heijan, we've talked about Heijan, quite incredibly translation, dubbing, and leap modulation service, where you can upload yourself and get an instant avatar. Somebody used Heijan on the whole speech.
[01:31:29] Alex Volkov: And that went ridiculously viral. I think there was like 50 million views on it, on X. And that was like mostly a combination of [01:31:40] Millet being like very viral in his opinions, being like, stoking some controversy. But also because you literally hear the person. Speak in English with a Spanish accent where this didn't happen, like literally he spoke in Spanish.
[01:31:52] Alex Volkov: Quite incredible technology and people have been shocked and said, Oh my God, this is coming for all of us in DeepFakes. Fine, we've talked about this multiple times. So Eleven Labs now has a, like a alternative to this, called Eleven Labs Dubbing Studio. And I've actually used this on a piece of Like on a trailer for ThursdAI, of me speaking in English, and they asked to dub me in Russian, the language that I do speak, and my mother tongue from Ukraine, and it sounded ridiculously cool.
[01:32:18] Alex Volkov: Here's a quick snippet of me from a Thursday I show with you three weeks ago that I dubbed into Russian for your entertainment.
[01:32:28] Gadget for children, for parents who have children who do not want to buy iPhones. Because then Instagram will destroy their brains. This is the perfect device for this.
[01:32:36] It looks like a language. In fact, you can talk to a rabbit, it is very cute, there is one simple interface, this is a voice.
[01:32:43] Alex Volkov: It sounded like, so far, How should I say, these models that emulate voice did not work on me. Specifically, my accent is not that great, but because my accent is probably Russian, the Russian version of me sounded really close to me.
[01:32:54] Alex Volkov: For the first time, I was like, Oh, okay. All right. And Eleven Labzner released this dubbing studio and hopefully these models are now coming to open source.
[01:33:04] AI deepfake of Biden caused controversy on mass media about AI
[01:33:04] Alex Volkov: Because there's also a thing where I think there's a recording of Biden saying something like stay home going around and everybody in the media making the big fuss about, Oh my God, AI is coming for all of us.
[01:33:15] Alex Volkov: And there's a big cry for folks to say, we should build tools to detect against this, et cetera. And my stance remains the same. Listen, I think we've talked about this multiple times. The only way through these woods is for everybody to know that their voice is very easily be fakable with three seconds or 10 seconds of their voice.
[01:33:31] Alex Volkov: It's time for the it's time for humanity to adapt to the situation where there's no panacea here. You should just know that just trusting voice blindly without knowing the source just don't do that because it might as well be fake. I don't know if you want to add anything.
[01:33:44] Alex Volkov: Yeah, go ahead.
[01:33:45] Nisten Tahiraj: really quick, I want to say, we already have laws to deal with this. More law is not necessarily going to fix the issue because, fraud is illegal in a free market. And if you want. Or at least people that are more in politics and stuff. If you want to solve the issue, do the job you already have.
[01:34:05] Nisten Tahiraj: You already have a list of spam callers, which you have been identified without an AI. And can you shut them down? So People love to imagine problems and love to think of doom or whatever in the future and then they completely ignore the stuff in front of them. All of us do this, but yeah, again, fraud is illegal.
[01:34:27] Nisten Tahiraj: Can you shut it down as a job, as a government? You don't need a new law, you don't need to be make speeches about AI. You need, just need to shut down fraud when it's identified. Otherwise, all of these tools and conferences and stuff are pointless.
[01:34:42] Alex Volkov: As predicted.
[01:34:43] Nisten Tahiraj: that's what I'm gonna
[01:34:44] Alex Volkov: Yeah, no, that's great. As predicted, the first. Election related deepfake type thing. The media was all over this and the doomers were like, here we go. And people were like it came sooner than we thought. And no, we've literally been talking about this for the past year.
[01:34:57] Alex Volkov: That like elections are coming. These things are going to happen. The technology was there even before. Now it's just like a little bit more accessible. The laws are in place, make it more difficult for grandmas to get spam calls, not make it difficult for the open source stuff. So hopefully like the more prevalent these technologies are, this is my stance, the better the chance that, people will just get used to this being everywhere.
[01:35:19] Alex Volkov: And definitely for folks of us who have our audio out there, we're doomed, right? So come up, like my usual suggestion here is come up with your loved ones with a key phrase that only you to know like. The Terminator scene with the dog come up with this and make sure that if you get a call in 3 a.
[01:35:34] Alex Volkov: m. at night, it sounds like a bad quality version of you, of your relative from somewhere, from an unknown phone. Make sure it's them by asking like, Hey, remember we went to Hawaii and you never went to Hawaii? And they say, Oh yeah, of course. But also you can probably most of those will be LLMs, so you can probably like.
[01:35:53] Alex Volkov: Don't prompt trick them, the spammy LLM calls that sound like you're a relative.
[01:35:57] W2V BERT ASR gets whisper quality with significantly less parameters
[01:35:57] Alex Volkov: Alright, moving for unless, listen, you want to add some stuff about this W2V BERT speech encoder? I've added it to the top of the space.
[01:36:07] Nisten Tahiraj: Yeah, just really quickly, I'm gonna do the paper reading on it 'cause
[01:36:10] Alex Volkov: Oh, hell yeah!
[01:36:11] Nisten Tahiraj: It's a pretty nice paper, so stay tuned from that at some point when we announce it and it's from MITs and and Google and some people from Google. So it's a, another really nice encoder only it has potentially seems to be up to 30 times faster.
[01:36:29] Nisten Tahiraj: So this could
[01:36:30] Alex Volkov: then whisper,
[01:36:31] Nisten Tahiraj: quite useful. It could be quite useful for those making assistance that run on local devices or on low resource devices. But also, For stuff on the web. Now it is officially supported by the Transformers library. We'll wait on Zenova to I think probably it's going to be available via WebGPU and stuff, I'm guessing.
[01:36:55] Nisten Tahiraj: Yeah it's very, it's nice to see that that field also going forward. Because we already have excellent speech recognition. We know it works really well. We just needed to work on more low power devices and mobile and
[01:37:08] Alex Volkov: Absolutely. And looks like some stats here. A bunch of languages are more than the Stan Whisperer, 143 languages. And you can like fine tune this on specific languages as well to make it like better. And VB benchmarked it on Mongolian, and beat Whisperer in less than 1200 steps. So smaller model, like fine tunable, super, super cool, and the best part of it is MIT license.
[01:37:29] Alex Volkov: So there have been other ASRs. They're not in this license. And now we're getting like a state of the art tiny model in this license. I think that's most of the stuff that I wanted to cover.
[01:37:39] NSF announces a new initiative called NAIRR
[01:37:39] Alex Volkov: No, I wanted to cover one last thing. One last thing. National Artificial Intelligence Research Resource. N A I R R.
[01:37:47] Alex Volkov: Which is coming to us from National Science Foundation, United States National Science Foundation collaborating with agencies and different so All of these incredible three letter agencies are collaborations in this foundation now. NSF is the science foundation, both DARPA and NASA, and NIST, which is the Institute of Standards and Technology, and DOD and DOE, and, like, all these things.
[01:38:11] Alex Volkov: But also, the private sector is joining this companies like Entropic and OpenAI. And Palantir, and Google, and Luther, and HugInFace, and Weights Biases. Obviously, I saw this oh, that's cool. We're, like, Weights Biases are participating in this incredible effort. Are all joining together in this initiative to, to promote, support AI research and advancing like safe and secure and trustworthy AI.
[01:38:33] Alex Volkov: And it's also great to see like folks like Hug Face here and Meta as well is represented folks who push open source as well, because, these government affiliations, government organizations, they have to have folks who promote open source as well. And they've organized them to. Four focus areas open enable AI research to access into diverse AI resources via the NAIRR pilot portal.
[01:38:56] Alex Volkov: So definitely expect there to be government grants for GPUs for different things, I don't know how easily those will be obtainable, but we had some folks in Canada from Canada before talked about you could ask for grants. to train or fine tune like the stuff that Tanish was talking about research which open source is better medical in QA could be happening through the government they also focus on security and And I think something called NARR classroom, which I have no idea.
[01:39:22] Alex Volkov: Oh, which new communities for education, training and user support. Like very government like approached. However, this is definitely like good to see the companies that participate in this. It's not only government, it's also open, like a private sector as well. NVIDIA is there, AMD is there, Eleuther, like we said, open source folks are represented as well.
[01:39:43] Alex Volkov: A huge kind of chunk of companies, it's good to see that the government is like actually moving towards some standardization which may be needed hopefully less regulation, more standardization. And I think with that, we are pretty much all over the news that we had for [01:40:00] this week. Which was great.
[01:40:01] Alex Volkov: I want to say thank you. A huge thank you again for, first of all, the listeners who come here and listen, and the folks on stage who help me from week to bring you the latest and greatest in the iNews.
[01:40:11] Alex Volkov: Thank you so much, and we'll let you go on this Thursday, and we'll see you next week.
[01:40:14] Alex Volkov: Take care, everyone. Bye bye.

This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
📅 ThursdAI Jan 18 - Nous Mixtral, Deepmind AlphaGeometry, LMSys SGLang, Rabbit R1 + Perplexity, LLama 3 is training & more AI news this week
19 Jan· ThursdAI - The top AI news from the past week
👋 Hey there, been quite a week, started slow and whoah, the last two days were jam-packed with news, I was able to barely keep up! But thankfully, the motto of ThursdAI is, we stay up to date so you don’t have to!
We had a milestone, 1.1K listeners tuned into the live show recording, it’s quite the number, and I’m humbled to present the conversation and updates to that many people, if you’re reading this but never joined live, welcome! We’re going live every week on ThursdAI, 8:30AM pacific time.
TL;DR of all topics covered:
* Open Source LLMs
* Nous Hermes Mixtral finetune (X, HF DPO version, HF SFT version)
* NeuralBeagle14-7B - From Maxime Labonne (X, HF,)
* It's the best-performing 7B parameter model on the Open LLM Leaderboard (when released, now 4th)
* We had a full conversation with Maxime about merging that will release as a standalone episode on Sunday!
* LMsys - SGLang - a 5x performance on inference (X, Blog, Github)
* NeuralMagic applying #sparceGPT to famous models to compress them with 50% sparsity (X, Paper)
* Big CO LLMs + APIs
* 🔥 Google Deepmind solves geometry at Olympiad level with 100M synthetic data (Announcement, Blog)
* Meta announces Llama3 is training, will have 350,000 H100 GPUs (X)
* Open AI releases guidelines for upcoming elections and removes restrictions for war use (Blog)
* Sam Altman (in Davos) doesn't think that AGI will change things as much as people think (X)
* Samsung S24 has AI everywhere, including real time translation of calls (X)
* Voice & Audio
* Meta releases MAGNet (X, HF)
* AI Art & Diffusion & 3D
* Stable diffusion runs 100% in the browser with WebGPU, Diffusers.js (X thread)
* DeciAI - Deci Diffusion - A text-to-image 732M-parameter model that’s 2.6x faster and 61% cheaper than Stable Diffusion 1.5 with on-par image quality
* Tools & Hardware
* Rabbit R1 announces a deal with Perplexity, giving a full year of perplexity pro to Rabbit R1 users and will be the default search engine on Rabbit (link)
Open Source LLMs
Nous Research releases their first Mixtral Finetune, in 2 versions DPO and SFT (X, DPO HF)
This is the first Mixtral finetune from Teknium1 and Nous team, trained on the Hermes dataset and comes in two variants, the SFT and SFT+DPO versions, and is a really really capable model, they call it their flagship!
This is the fist Mixtral finetune to beat Mixtral instruct, and is potentially the best open source model available right now! 👏
Already available at places like Together endpoints, GGUF versions by the Bloke and I’ve been running this model on my mac for the past few days. Quite remarkable considering where we are in only January and this is the best open chat model available for us.
Make sure you use ample system prompting for it, as it was trained with system prompts in mind.
LMsys new inference 5x with SGLang & RadixAttention (Blog)
LMSys introduced SGLang, a new interface and runtime for improving the efficiency of large language model (LLM) inference. It claims to provide up to 5x faster inference speeds compared to existing systems like Guidance and vLLM.
SGLang was designed to better support complex LLM programs through features like control flow, prompting techniques, and external interaction. It co-designs the frontend language and backend runtime.
- On the backend, it proposes a new technique called RadixAttention to automatically handle various patterns of key-value cache reuse, improving performance.
- Early users like LLaVa reported SGLang providing significantly faster inference speeds in their applications compared to other options. The LMSys team released code on GitHub for others to try it out.
Big CO LLMs + APIs
Meta AI announcements (link)
These #BreakingNews came during our space, Mark Zuckerberg posted a video on Instagram saying that Llama3 is currently training, and will be open sourced!
He also said that Meta will have 350K (that’s not a typo, 350,000) H100 GPUs by end of the year, and a total of ~600,000 H100 equivalent compute power (including other GPUs) which is… 🤯 (and this is the reason why I had to give him double GPU rich hats)
Deepmind releases AlphaGeometry (blog)
Solving geometry at the Olympiad gold-medalist level with 100M synthetic examples
AlphaGeometry is an AI system developed by Google DeepMind that can solve complex geometry problems on par with human Olympiad gold medalists
It uses a "neuro-symbolic" approach, combining a neural language model with a symbolic deduction engine to leverage the strengths of both
The language model suggests useful geometric constructs to add to diagrams, guiding the deduction engine towards solutions
It was trained on over 100 million synthetic geometry examples generated from 1 billion random diagrams
On a benchmark of 30 official Olympiad problems, it solved 25 within time limits, similar to the average human medalist
OpenAI releases guidelines for upcoming elections. (Blog)
- OpenAI is taking steps to prevent their AI tools like DALL-E and ChatGPT from being abused or used to spread misinformation around elections
- They are refining usage policies for ChatGPT and enforcing limits on political campaigning, impersonating candidates, and discouraging voting
- OpenAI is working on technology to detect if images were generated by DALL-E and labeling AI-generated content for more transparency
- They are partnering with organizations in the US and other countries to provide users with authoritative voting information through ChatGPT
- OpenAI's goal is to balance the benefits of their AI while mitigating risks around election integrity and democratic processes
Microsoft announces copilot PRO
Microsoft announced new options for accessing Copilot, including Copilot Pro, a $20/month premium subscription that provides access to the latest AI models and enhanced image creation.
Copilot for Microsoft 365 is now generally available for small businesses with no user minimum, and available for additional business plans.
This weeks Buzz (What I learned with WandB this week)
Did you know that ThursdAI is not the FIRST podcast at Weights & Biases? (Shocking, I know!)
Lukas, our CEO, has been a long time host of the Gradient Dissent pod, and this week, we had two of the more prolific AI investors on as guests, Elad Gil and Sarah Guo.
It’s definitely worth a listen, it’s more of a standard 1:1 or sometimes 1:2 interview, so after you finish with ThursdAI, and seeking for more of a deep dive, definitely recommended to extend your knowledge.
AI Art & Diffusion
Zero shot face adapted image gen - 3 different tech approaches
What used to take ages, now takes seconds with 0 shot, there are quite a few approaches to generate images with real human faces, in 0 shot capacity, providing just a few faces. Gradio folks call it Zero-shot face-adapted image generation and there are 3 tools to generate those:
1⃣IPAdapter
2⃣PhotoMaker
3⃣InstantID
Here’s a great summary thread from Gradio folks for this fast advancing field! Remember we had to finetune on faces for a long time? Dreambooth and then LORAs, and now we have this exciting development.
Tools & Hardware
Rabbit R1 partners with Perplexity
The R1 device that was just announced, is about to sell through it’s first 50K in just a few days, which is remarkable. I definitely pre-ordered one, and can’t wait to get my hands on it. Jesse the founder has been all over X, getting incredible recognition, and after a few conversations with Aravind Srinivas, they agreed to make a deal right on X.
Today they hopped on a space and announced that all the first 100K early buyers of Rabbit are going to get a full year PRO subscription of Perplexity (one of the best AI search engines out there) for free! I sure as heck didn’t expect it, but the email was sent just a few minutes after the X space, and now guess who uses perplexity pro?
Here’s an example of a perplexity searching ThursdAI content (it doesn’t always get it right tho)!
I guess that’s it for today, as I’m writing this, there are incredible other stuff getting released, Codium open sourced AlphaCodium (here’s a link to the founder talking about it) but I didn’t have a second to dive into this, hopefully will bring Imatar to ThursdAI next time and chat about it!
Have a great weekend all 🫡 (please give us a good review on Apple Itunes, apparently it really helps discovery!)
Full Transcription for convenience:
[00:00:02] Alex Volkov: Hey everyone, happy Thursday. My name is Alex Volkov. I'm an AI evangelist with Weights Biases, and this is Thursday AI.
[00:00:13] Alex Volkov: We had such a great show today, over 1100 of you tuned in to the live recording, which is incredible.
[00:00:30] I also wanted to say that if you're not subscribed to thursdai.news newsletter, please go ahead and do because I send a full blog with the links to the show notes and to the speakers that we have on stage, and you should be able to follow up.
[00:00:46] Alex Volkov: There's a bunch of multimedia, like videos, that are not coming through in the audio only podcast format. So please subscribe to ThursdayEye. News as well. This live recording, we also hosted Maxime Lebon, who's a senior machine learning scientist with J.
[00:01:04] Alex Volkov: P. Morgan, and the author of several models, and Merged models, lately the Neural Beagle model that we've talked about. We had a great conversation with Maxime. And that full episode will be posted as a Sunday special evergreen content episode. So please stay tuned for that.
[00:01:29] Alex Volkov: It's been an incredibly illuminating conversation in the world of merging and merge kit and everything else that Maxim does and it was a super cool conversation. So that's coming soon.
[00:01:41] Alex Volkov: And, as I've been doing recently, the following is going to be a 7 minute segment, from the end of the live recording, summarizing everything we've talked about.
[00:01:54] Alex Volkov: I hope you've been enjoying these TLDR intros. Please let me know in the comments if this is something that's helpful to you.
[00:02:05] ThursdAI Jan18 TL;DR recap by Alex
[00:02:05] Alex Volkov: Alright we started with talking today, Thursday I, January 18th. We was talking about n News imis, the Mixt mixture fine tune that came out from Teo and the folks at News. It, it was of the first fine noon of mixture, the mixture of experts model from a mistral that came from the news research folks.
[00:02:35] Alex Volkov: And it released in two versions, the DPO only version SFT plus DPO version. Given different data sets they was trained on and actually different capabilities. It looks based on the community, the DPO version is like very well performing. I've been running this on my Macbook with LM studio and it really performs well.
[00:02:53] Alex Volkov: So shout out and folks should try this. This is By far the best, looks like the best new Hermes model based on just benchmarks. They're trained on the best open source model that's currently Mixtro. Mixtro is number 7th in the world based on LMCS Arena, and that's an open source model that we all get to use.
[00:03:10] Alex Volkov: Then we've covered the Neural Beagle 14. 7b from Maxim Le Bon. Maxim also joined us for a full interview that you can hear as part of the a podcast episode and Maxim released a Neural Beagle, which is a merge plus a DPO fine tune. And it's one of the top performing 7 billion parameters on the OpenLM leaderboard.
[00:03:30] Alex Volkov: When released in a few days ago, now it's fourth. So the speed with which things change is quite incredible. We then covered the LMSYS. SGLang attempt is a 5x performance inference bunch of techniques together on the front end and the back end called Radix attention on the back end and the SGLang way to run through inference code on the front end that combines into almost a 5x performance on inference.
[00:03:56] Alex Volkov: 5x is incredible Nistan mentioned that it does less than 5x on like longer sequences and then we had a conversation about Where it could improve significantly, which is agents and agents are sending short sequences. Alignment Labs told us that this could be significant improvement in that area.
[00:04:13] Alex Volkov: So our agents are about to run way faster. A 5x improvement is just incredible. And we also mentioned that at the same day when this was released, another Optimization was shouted out by Tim Ditmers from the Qlora fame called Marlin that also improves by 4x some significant inference techniques.
[00:04:34] Alex Volkov: And I wonder if those can be compiled together in some way. Quite impressive. We also covered neural magic doing spars, pacification and sparse. And we did in a deep dive into a short, deep dive. Thank you. Alignment and thank you Austin for what's spars, pacification means. And they do in this for like major models and they compress them with specification to around 50% sparsity.
[00:04:55] Alex Volkov: It's zeroing. Out the weights that you don't actually use. And it makes the models like significantly smaller. We covered Desilang a little bit. We didn't actually get to the diffusion. I'll just read out those updates as well. Then we covered the OpenAI had new guidelines for upcoming elections, and they're trying to add techniques for folks to identify daily generated images.
[00:05:18] Alex Volkov: And they're adding, restrictions to how their LLMs are used in the context of voter suppression, etc. We then talked about DeepMind and AlphaGeometry, where DeepMind released And open sourced looks like a model called Alpha Geometry that uses neuro symbolic approach with two models that solves geometry at almost a gold medal at the Olympiad level.
[00:05:42] Alex Volkov: So Geometry Olympiads and quite impressive this release from from DeepMind and shout out. It was trained on a hundred million synthetic data set sources. A source from like more than one billion. Or so random examples and it's quite impressive. So shout out DeepMind as well. We also briefly mentioned Samsung that has a Samsung S24, the flagship phone that now Apple is needed to compete with, that has AI everywhere, uses the new Qualcomm chip and has AI in.
[00:06:10] Alex Volkov: Pretty much summarization everywhere. There's like a button with the sparkles with AI. And one cool thing that we haven't mentioned, but I saw MKBHD on Twitter review is that they added real time translation of calls. So you can literally call some people with a different language and on device translation, after you download the model on device, we'll actually be able to translate this in real time.
[00:06:30] Alex Volkov: So you can read what the other person said in different language, but also hear it. And that's like quite cool. Then we had a deep interview with Maxim Lebon, the author of many things. Recently, we've talked about Fixtral or Fixtral, the mixture of experts of the five models. We've talked about merges.
[00:06:46] Alex Volkov: Maxim had a great explanation on, on, on his blog. And then on the Hug Face blog about what merges, what MergeKit does and how that. Plays into the whole ecosystem, the top LLM leaderboard now has been taken over by merges, specifically, likely because merging models does not require additional computer, additional training, and that's fairly easy to do with just the code merges takes and combines.
[00:07:11] Alex Volkov: With different, using different algorithms like SLURP and other algorithms it combines different models and different weights from different models, including potentially building models of novel sizes. So we've seen 10 billion parameter models, like 120 billion parameters so you can use those techniques to Combine models or merge models into different ways.
[00:07:31] Alex Volkov: There's also Frankenmerge that uses different models to combine into one. So we dove into that and what the inspiration for merging and what it actually does. Maxim also released like Lazy Merge Kit, which is a thin wrapper on top of the merge kit from Charles Goddard. So shout out to Charles.
[00:07:47] Alex Volkov: So we had a very interesting interview about merging and thank you, Maxim, for joining us. Definitely worth a listen as well. And then we had breaking news from BigZuck and the meta team that talked about he gave an update about the number of GPUs that they have. And by the end of this year, they're talking about 350, and overall 600, 000 H100s or equivalents of compute which they're going to use for AI and Metaverse.
[00:08:14] Alex Volkov: And Definitely a great update. They're training Lama 3 right now. The stuff that we didn't get to, but I wanted [00:08:20] to update, there's a, and I will add in show notes. There's a stable diffusion code that runs 100 percent in the browser with WebGPU and Diffusers. js, a thread from ClipDrop, the CEO Cyril Diagne.
[00:08:32] Alex Volkov: And there's also, we've talked about DeciEye, the company that releases a bunch of models. They release DeciDiffusion, a text to image model with only 370, the 300. Sorry, 732 million parameters. It's twice as fast and 61 percent cheaper than Stable Diffusion with the same image quality, so that's getting improved.
[00:08:51] Alex Volkov: But I think they're talking about Stable Diffusion 1. 4, so not SDXL or the new one. And Desi, I also released Desi Coder, and we also covered the Stable Diffusion Coder that is a coding model that runs closer on device, a 3 billion parameter model that beats Code Llama 7b. I think that's most of the stuff we talked about.
[00:09:09] Alex Volkov: And then one of the major things that Umesh brought we've talked about corporate drama, maybe a new segment in Thursday Eye where Microsoft, Did some things that actually disrupted workflows and companies actual products built on top of Microsoft, which is considerably not great and led to a fight.
[00:09:30] Alex Volkov: Hopefully not, but potentially a legal battle as well, and that's not something that should be done by a cloud provider such as Microsoft. Very ugly. In addition to this, we also talked about Microsoft announcing the CoPilot Pro that's now open for small businesses for 20 bucks a month with no minimum seats as well.
[00:09:46] Alex Volkov: And I think that's most of the things that we've mentioned
[00:09:49] Alex Volkov: Let's go.
[00:09:51] Sounds: to all of you.
[00:09:57] Alex Volkov: from, I guess
[00:09:59] Sounds: all of you. Namaskaram to
[00:10:07] Alex Volkov: 2024, we all need to get used to say 2024 at this point we have a bunch of AI news. My name is Alex Volkov, I'm an AI evangelist with Weights Biases, and I'm joined on stage here with dear friends, co hosts of Thursday AI. Podcast, newsletter, live X recording, community, I don't know, a bunch of other stuff as well.
[00:10:29] Alex Volkov: Nishten does paper readings, is a semi part of this as well. Welcome everyone. Welcome.
[00:10:33] Introduction to the Session's Structure
[00:10:33] Alex Volkov: I will just say a few things before we get started. So first of all, for those of you who are new, who are listening to this for the first time first of all, welcome.
[00:10:41] Alex Volkov: It's great that you have found us. Please DM me with like how you found us. I would love to know as I'm looking into the channels, et cetera. However, I will say that we've been here every week, pretty much at the same time. I don't think we've changed time since the summer. So 8.
[00:10:55] Alex Volkov: 30 AM Pacific and we try to do this every Thursday. I think we missed one or two. I was sick once, apologies. But other than that, we're here to talk about the AI every week. And what happens often is as we as we talk about things, different breaking news happened and folks announced different stuff on Thursday., and we cover pretty much everything. A very broad spectrum in AI changes. So I know there's like spaces to talk about diffusion, specifically art spaces as well. So we cover diffusion to an extent, but we try to focus on I guess our main focus is open source LLMs. We love those. We have a bunch of folks here on stage. They're training and fine tuning the greatest kind of open source models and definitely follow up on the different how should I say, different techniques, like the merging stuff that we're going to talk to at length later, and we, we hopefully get to hear about them first before they take over hug and face which was the case, I think with some of the models and some of the techniques.
[00:11:54] Alex Volkov: And I see two more folks joining us as well from different areas of the open source community. So I will say welcome LDJ and welcome alignment, LDJ. You've been missing in action. I was just saying, how are you, man? Welcome back.
[00:12:08] Luigi Daniele: Yeah, I'm doing good. Glad to be
[00:12:10] Alex Volkov: Yeah. And also we have Austin AKA Alignment Lab. What's up Austin?
[00:12:16] Alignment Lab: Oh, dude, I'm doing great. I was actually just in a call with LDJ and he was like, oh, Thursday Eye is starting and I was like, let's go.
[00:12:22] Alex Volkov: Yeah that's exactly what I like to hear that the calendar events is popping off and Thursday is starting.
[00:12:27] Open Source AI: Nous Hermes Mixtral Finetune + DPO deep dive
[00:12:27] Alex Volkov: So with that, I think it's time for the open source stuff.
[00:12:44] Sounds: Open Source AI, let's get it started.
[00:12:48] Alex Volkov: All right, so welcome to probably the biggest, the most fun, the most Contentful section of Thursday ai, where we talk about open source, LLMs and lms. I guess we should also start mentioning because a bunch of these models that we see are also multimodal, and I guess we'll start with.
[00:13:08] Alex Volkov: , News Hermes Fine Tune on Mixtral we've been waiting for this, Mixtral was released I want to say a month or so ago, a month and a half ago, and now we're getting one of the top kind of data sets and fine tunes trained on Mixtral, and we're getting this in multiple formats.
[00:13:25] Alex Volkov: Again, shout out Technium. If you guys don't follow Technium yet what are you even doing showing up on Thursday? I definitely give Technium a follow. But Mixtral fine tune is available and it comes in two variants and SFT and then DPO and SFT only. So SFT is a supervised fine tuning and DPO, direct preference optimization.
[00:13:45] Alex Volkov: This is like a, not a new technique, but definitely has been around for a while. Many people are using DPOs at this point. We've talked about DPO multiple times. I think we also saw, Nistan, correct me if I'm wrong, the actual mixtural instruct is also DPO, right? We saw this in the paper.
[00:14:00] Alex Volkov: So DPO is everywhere. And this is not the first time that the SFT and DPO pair is getting released separately. I think we've chatted with John Durbin who's, shoutout John, is in the audience. And that conversation is on the feed. So definitely check out the conversation with John.
[00:14:16] Alex Volkov: And the Bagel models were also released separately with SFT and the DPO version as well. And I think John back then mentioned that each one has Different different things it's good at. And I also would love to figure out which one of the new, Neus Ermis Mixtural Fine Tunes is best at what.
[00:14:33] Alex Volkov: Technium has a bunch of stuff in in, in the thread, so I'll link this below for examples. And I will say that the comparisons to Mixed Real Instruct. Technium posted a bunch of comparisons to Mixed Real Instruct. And it's interesting that not all of the benchmarks look like on improvements.
[00:14:51] Alex Volkov: There's a few, I think on GPT4ALL and HelloSwag. The base model, at least the non DPO base model, still wins just by a little bit. But everything else, like ARX, AGI, EVAL, and MMLU are significant improvements. And we're gonna probably continue to see those improvements. Shoutout. If you have tried it, please let me know.
[00:15:08] Alex Volkov: I will say this last thing, that finally, after setting up LM Studio again, shoutout to LM Studio we'll get to chat with LM Studio at one point. Hopefully soon, I am now, the first thing I do is download these models because it's super, super easy. Both of them, Studio and Allama, and there was a tiny, I think, quantization thing in the beginning, and now there isn't, and now it works great.
[00:15:33] Alex Volkov: And these models, I've loaded them up on my Mac before, before a flight. And I was just able to chat with this AI with no internet connection or like poorly internet connection. It was really something. I know we've talked about this multiple times. Hey, put this on a a thumb drive and then have all of human knowledge, quote unquote.
[00:15:51] Alex Volkov: I'm not really saying it's all human knowledge, but I've been actually able to do this before my flight and it was really cool.
[00:15:57] Alex Volkov: And I think the last thing to mention here is that Technium suggests to make liberal use of system prompts. So all of Hermes models, which is, there's now a bunch of Hermes models flying around, definitely the most. At least the famous one is Hermes, I think, 7B, but also the YI version, and this seems to beat the YI version as far as our friend Wolfram Raven, Wolfram Loco Lama tested.
[00:16:22] Alex Volkov: This is probably the best news model out of them all. So far, obviously it's based on the best. Open source model called Mixtro and definitely liberal use of system prompts. Yeah, roleplay is suggested setting expectations, specifications and everything else you can think of. Very easy to do with Elm Studio.
[00:16:39] Alex Volkov: I haven't [00:16:40] dove into like actually how to steer these models for exactly the task that I do. Luigi, you said LDJ, you said that you want to Tell me how to use LM studio in regards on this. So I would love to hear from you. First of all, have you had a chance to try these models specifically? And second of all let's talk about system prompts in LM studio a little bit, because I think it's a part that people are definitely missing.
[00:17:02] Luigi Daniele: Yeah. A lot of the latest models like Hermes and I think maybe Dolphin too, trained with system prompts. So if you really want to get the best use out of it definitely use that and it's just same thing with chat GPT really, where it's give instructions of how you maybe want to have it respond to you, or maybe add in a few threats of, of what you would do to the AI if it does not respond correctly, and so surprisingly that seems to actually sometimes.
[00:17:28] Luigi Daniele: Give good results, I personally try to always say please and thank you, but yeah yeah. And there's also prefix and suffixes, which I think I talked to you about, Alex,
[00:17:36] Alex Volkov: You briefly mentioned this, but maybe worth like a given a little bit of a heads up for folks.
[00:17:41] Luigi Daniele: yeah I think it really is worth maybe just a sit down and just a video with me and you actually going through it, because,
[00:17:47] Alex Volkov: Sure.
[00:17:47] Luigi Daniele: it's a decent amount to go through, but, yeah on the model card of most models, if you just look at something called prefix or suffix that is usually described in the model card, then You apply that to the LM Studio settings on the right panel in the chat settings.
[00:18:03] Luigi Daniele: And yeah, you just make sure you have those things right. If you don't, there's a good chance you're not actually using the model correctly. And it's not going to give you the best results.
[00:18:10] Alex Volkov: And they differ from the base model as well. Like we've seen like different base models have different things that you want to you want to add there. And you may getting like the same performance, but getting under performed a little bit. I'll also say for folks who are using Mac the Silicon, Apple Silicon, there's a little hidden checkbox there that I don't know if it's like, it's by default already.
[00:18:30] Alex Volkov: It's called use Apple Metal. And definitely make sure that's on for you. Significant improvement in performance and inference. All so I think NeuralRMS, anything else on folks here on stage that want to talk about this model and how it was trained and the difference in DPO? Folks, feel free to chime in.
[00:18:45] Alignment Lab: There's the cool thing about DPO is It's so it's a reinforcement learning technique. I don't know if anyone else has had a chance to read the paper about it, but essentially what occurred was that some researchers found that the, that transformers already have a baked in optimal reward function.
[00:19:03] Alignment Lab: And so what DPO is really doing is just training the model on that reward function, just biasing it towards the selected. Like good example when you give it a good and bad example pairs not directly unique to to the, to this model, but it is super interesting because it really opens up a whole bunch of possibilities for what you can do with the model now that you can give it negative examples and get more performance for it.
[00:19:27] Alex Volkov: DPO is ranking different outputs in terms of like preference, . So can you talk about the pairs stuff? Everybody says DPO pairs, like what do they mean by pairs? Could you say this about this?
[00:19:38] Alignment Lab: instead of training on like typically what you would do is you would build your data set. And that would be like your good data set. You'd have a weaker model that you, than the one that you use to synthesize the dataset or just bad examples of responses for every single example in the dataset.
[00:19:54] Alignment Lab: So if you have one that's like, how do I make a cup of tea? And then instructions about how to make a cup of tea, then you'd also have that paired with a negative example of, a response to how do I make a cup of tea? And then, the response is something else, like how to build a Lego house or whatever.
[00:20:08] Alignment Lab: And when you go to actually train it, you show it both at once, and you tell it which one is the positive and which one's the negative, and you just bias it towards the positive. It's quite similar, conceptually, to the way that OpenChat does the CRLFT training, although OpenChat actually has a specific token for the good and bad examples that it has weighted.
[00:20:34] Alignment Lab: But functionally, it's, the idea is the same. You're just doing reinforcement learning which lets you take data where you may have bad examples in there, and rather than having to remove them and waste data, you can now make a good example and get more out of it than you would have been by just replacing it.
[00:20:50] Alignment Lab: So it lets you recoup extra performance out of bad data.
[00:20:54] Alex Volkov: Thanks for the explanation. And definitely we've seen at least in my game plays with the bigger model and the DPO version of noose. RMS mixture this feels like the DPO at least behaves a little bit. Actually don't know how to attribute this to the technique or to the datasets, but it's really good.
[00:21:13] Alignment Lab: Yeah, we've noticed if we do a regular supervised fine tune first, like a just normal fine tuning, and then we DPO over that we, the models push just much further than either thing alone, too. I don't know if that's unilaterally true, because we do a fairly, specific kind of model when we make these big releases, but it seems, at least for the case of just general reasoning skill it helps a lot.
[00:21:37] Alex Volkov: Yeah, it's super cool. And I guess the downside of this, not the downside, but the outcome of some of this is that folks now have, folks who want to just use a model and are trying to maybe tune in to Thursday Eye to know which model is good to use, or maybe they're reading the local Lama stuff.
[00:21:53] Alex Volkov: There's now so many choices, including so many configurations. So maybe we should do Like a recap and also a simplification LDJ for like system messages and the prefixes alignment with DPO versus SFT. Just simplify and say, Hey folks, use this. Because right now there's so many, you can choose between quantization methods.
[00:22:11] Alex Volkov: There's at least four or five different ones for you to choose from. And LM studio says in a few of them, use this is recommended, but it says recommended for five, five different ones. There's different quantization providers as well, right? So the bloke is obviously the most familiar one,
[00:22:26] Alex Volkov: there's now a choice between DPO or SFT or DPO plus SFT, and We haven't even begun to talk about merges, which is coming as well. So there's a lot of choice and we need to simplify this for folks. So definitely just to simplify the Hermes models are usually very well behaved and great for role play as well.
[00:22:43] Alex Volkov: Try them out. If you have the room to run Mixtrl for your stuff, Mixtrl is definitely by far the best open source models that we have. Go ahead, Levent.
[00:22:52] Alignment Lab: Yeah, so Mixtrel is, that model is the architecture is very similar to a really old, comparatively old architecture that's been tried and true before. And so because of that, there's a lot of efficiencies that we just haven't integrated into the modern stack, but that will come.
[00:23:09] Alignment Lab: And there's a bunch of new ones that people have been making. And between the new quantization methods that you can do with Mixtro, because since it's sparse MOE, it doesn't actually, need all of its weights as much as it, as as each other. So some of them are, like, less important. It lets you quantize those quite a lot without actually hurting the model's performance very much.
[00:23:27] Alignment Lab: And you can also offload these layers when they're not being used. And then you can do like expert pre caching, where you predict some experts ahead of time, which lets you get faster inference speed. And at the end of the day, if the sort of quick sharp, which is like 2 bit quantization method continues to prove out that it's as performant as it claims, We could end up running Mixtro on 4 gigs of VRAM, like on a laptop.
[00:23:58] Alex Volkov: And
[00:23:59] Nisten Tahiraj: We will.
[00:24:00] Alex Volkov: we will.
[00:24:00] Nisten Tahiraj: it to perform a bit better.
[00:24:02] Alex Volkov: So I guess this takes us to the next, I'll go ahead and stand, and it's going to take us to the next optimization stuff.
[00:24:09] Nisten Tahiraj: We could definitely have it run on on 4 gigs. I've had it a little above 4. However, but the point is to have it run well. The quantization, it still makes it a little bit unfit for anything other than very short conversations. And we'll get it there.
[00:24:30] Alex Volkov: All right. So in this, in, in this
[00:24:32] Nisten Tahiraj: we'll have Mixtro under 4 gigs very soon and it'll be good.
[00:24:37] Nisten Tahiraj: Yes.
[00:24:37] Alex Volkov: And that's a promise. That's a promise.
[00:24:39] LMsys SGlang - increased inference by 5X
[00:24:39] Alex Volkov: So what happens is once you go and put those bigger models on slower hardware, which is possible you then wait painfully a long time for inference to actually happen. But this takes us to the next thing from the folks from LMSys. They released a fast and expressive LLM inference with Radix attention and SG Lang.
[00:24:59] Alex Volkov: So folks from [00:25:00] LMSys, if you guys remember from Models like Vicuna that took Lama and trained it on additional datasets. and NMSIS Arena and all these places like we definitely trust them at least with some of the evaluation stuff. I think, is MMLU also in NMSIS's area? Or at least they test on MMLU. They released a inference optimization kind of collection of techniques.
[00:25:24] Alex Volkov: I don't think it's one specific technique because there's like Radix attention. Yeah, go ahead.
[00:25:28] Alignment Lab: It's where all this was going in the first place between all these sort of different prompting programming frameworks and inference engines. What they've done is they built out the back end with the end goal of having an extremely controllable, steerable compiling system for programming outputs from a, from like an AI in the way, like a Pydantic or in the way that you would typically use sort of structured grammars and sampling techniques.
[00:25:58] Alignment Lab: And way more. It's hard to explain in, in summary in a way that's very easily grokkable without getting too technical but it's a combination of many things that we've been doing individually, which were always gonna be one big thing, they just saw it first and did it first, and now, when you're looking at it, it seems very obvious that this is probably how things should look going forward
[00:26:17] Alex Volkov: so let's actually talk about
[00:26:18] Bluetooth: overall, just a
[00:26:19] Alex Volkov: they have. Yeah, they propose like different co designing the backend runtime and the frontend language, which is like Alain said, a structured domain specific language embedded in Python to control the inference generation process. It's called domain specific language, DSLs.
[00:26:35] Alex Volkov: I, I think many folks have been using some of this. I think DS p Ys as well from is being like mentioned in the same breath. And then this language like executed in the interpreter code or in compiler code. And on the backend they have this radix attention technique for automatic and efficient KV cache reuse.
[00:26:53] Alex Volkov: I don't know if that's like instance like MOE specific or not yet, but definitely. The combination of those two plus the code that they've released shows just incredible results. Like folks, we live in an age, and we've talked about multiple of those techniques. We live in the age where somebody like this can come up and say, Hey here's an example of a set of techniques that if you use them, you get.
[00:27:12] Alex Volkov: 5x improvement on inference. In the same breath that we're saying, Hey, we're going to take Mixtrel and put it in 4GB, and we've seen this obviously with Stable Diffusion, which we're going to mention that runs fully in the browser, we're now seeing releases like this from a very reputable place. A collection of techniques that have been used to some extent by some folks, and now all under one roof, under one like GitHub.
[00:27:35] Alex Volkov: Thing that actually improves the inference by 5x on all of the major evaluations, at least that they've tested, that we always talk about. So 5x on MMLU and HelloSwag is significantly more performant, all these things. Quite impressive. One thing that I would definitely want to shout out is that the maintainer of Lava the LMM, the kind of the visual Lama, is definitely also replied and said that the execution of Lama is actually, of Lava, is actually written in the report itself.
[00:28:07] Alex Volkov: And it improves lava execution by 5x as well. And by execution, I mean like inference speed, basically. So without going like too much into Radix attention, because honestly, it's way too heavy for the space. It's quite incredible that we get, do we get stuff like this from like places like LMCS, specifically in the area of running smaller models, sorry, running bigger models with smaller hardware.
[00:28:33] Alex Volkov: Go ahead, Nissan.
[00:28:36] Nisten Tahiraj: I'll say something. So it does automate a lot of the tricks that people have been pulling, and it works great for large amounts of smaller prompts. Once you go to longer prompts, the benefit is not that much compared to VLLM. I think it felt like five or ten percent faster when it came to VLLM. So again, I haven't taken a very deep dive into it.
[00:29:01] Nisten Tahiraj: Just want to just make people aware that it's fantastic for smaller prompts and stuff. But for longer ones, you don't necessarily need to switch your whole stack to it. VLLM still works fine. Yeah, I think for if you're doing like what you would normally be doing with VLLM, which is like processing like large amounts of data or serving for just general purposes.
[00:29:24] Nisten Tahiraj: Probably, there's no need to switch your stack. I think for, specifically what it feels optimized for is Asian frameworks, in which you have many models communicating short strings back to each other. One model wearing many hats. And the optimizations just while we're on the topic, is crazy right now.
[00:29:43] Nisten Tahiraj: There's still three papers with major inference optimizations for MixedRole alone, as well as for VLLM, and that seem to compose everything pretty well. Having an alternative to VLM that's similarly. Performance is huge because VLM is a big bottleneck on a lot of stacks because of the way that it handles attention off on the CPU.
[00:30:00] Nisten Tahiraj: It feels a lot like when llama CPP got like offloading the same week that speculative decoding came out with hugging face transformers and. Everything just got a hundred times faster, like a half a year ago or so.
[00:30:12] Alex Volkov: Yeah, I would also it definitely felt like that day when LMS released the SG Lang optimization that we just now talking about I don't have a link for this, but also LES released from IST Austria. Released Marlin, which is a 4 bit, I think the way I know it's cool is that, Tim Dittmers from QLOR retweeted this and said this is a huge step forward.
[00:30:33] Alex Volkov: And Tim Dittmers is the guy who in KUDO mode, the codes, KUDO kernels, within like a night or something, planning for 3 months and then finishing. So I know that Tim Dittmers, when he says something is a huge deal, he probably Probably knows what's up. So Marlin released the same day that like the SGLang released and it's a linear kernel for LLM entrants with near ideal.
[00:30:53] Alex Volkov: 4x speedup up to batch sizes of 16 to 32 tokens. And they came out pretty much the same day yesterday on January 17th. So I'm going to add this in the show notes. So Marlin is also like an exciting optimization. And Nostia, I fully agree with you where we see these breakthroughs or collections of method that suddenly are finally collected in the same way.
[00:31:11] Alex Volkov: A bunch of papers that haven't, released code as well or haven't played with different things. And it's very exciting to see them Keep coming out, we're only at the beginning of this year. And I think to the second point that you just mentioned, with agent frameworks Specifically, RAG, Retrieval Augmented Generation this benefit is significant like you said, because the short strings back and forth, these agents communicate with each other.
[00:31:34] Alex Volkov: Last week we've talked with one such author from Cru AI, Cru specifically is an orchestration of different agents that do different tasks and coordinate and talk to each other and improving inference there. Many of them run on GPT 4 and I haven't fully gotten into how to do this yet, but SGLang also say that they're like LLM programming can actually work with various backends.
[00:31:55] Alex Volkov: So OpenAI as well and Tropic and Gemini and local models. That's very interesting if they actually improve OpenAI inference in Python. But DSPY RAG, so RAG on DSPYs from Omar Khattab is definitely mentioned in the SGLANG report. I know I'm throwing like a lot of a lot of acronyms at you guys.
[00:32:14] Alex Volkov: So SGLANG is the stuff we talk about as the That's the new language from LMCS org that speeds up some stuff. DSPY I haven't talked about yet, so we'll cover but one of the tasks on, on, on DSPY's RAG, so retrieval is mentioned that it gets like a significant boost. Like Nissen and Austin said, not necessarily for longer context prompts.
[00:32:35] Alex Volkov: 30, 000 tokens for summarization, maybe this technique that caches a bunch of. Stuff between calls is not going to be super helpful, but for fast execution of multiple things is definitely significant 5x. And like I think Lyman said, it's only the beginning of optimization cycles that we see, and it's quite exciting to to see them come out.
[00:32:56] Alex Volkov: I think we've covered two optimization techniques, SGLang, and then Marlin as well. I'll put a link to the show notes as well.
[00:33:03] NeuralMagic, compressing models with sparcification
[00:33:03] Alex Volkov: And I think now it's time to move to Yeah, one, one, one thing that we're going to chat about is neuromagic and I definitely focus on stage. Feel free to talk about neuromagic because I saw [00:33:20] somebody told me it's cool, but I have no idea how to even simplify this.
[00:33:23] Alex Volkov: So if you want us and you want to take a lead on this one, definitely feel free.
[00:33:28] Alignment Lab: Okay Neural Magic. This is actually the first conversation I think that me and LDJ both geeked out really hard on we were talking, because we were both the only people the other person knew who even knew about this company. Neuromagic has been making miracles in the corner for years.
[00:33:44] Alignment Lab: I first got interested in them because they had made a BERT model that was initially, it was nearly like I think a gig on your computer to run and, it spoke English perfectly well and all this other stuff. And they had compressed it to the point that the full model completely On your computer was like 15 megabytes and it, and what blew my mind was like, how does that even know English?
[00:34:06] Alignment Lab: And it's it was at like 96 percent the original accuracy, despite all of that. They specialize in these like optimization and compression techniques. And so what they do typically is they have a stack, which they wrote a paper about a while ago, which I'll post in the comments here.
[00:34:22] Alignment Lab: It's called Overt Surgeon, which is basically a process in which they have a teacher model. In a student model, in the student model they use distillation in the the more traditional sense than I think it's more commonly used now, where you're just training on a model's output, and they use the actual logits during they basically load both models in during the training run, and train the smaller model to behave like the larger model, and while they're doing that, they're also pruning it, which is, Essentially, you reduce the weights that are not getting used during training to zero, which lets your computer not have to calculate them, so it moves much faster.
[00:34:58] Alignment Lab: And then they also quantize, which is where you reduce the accuracy. Basically, without getting too technical, you're literally summarizing the parameters of the model, such that it's literally a smaller file. And they do this all at once, which takes the larger model, And compresses it into the student model that's starting out smaller, and then they're quantizing the student model and pruning it, so it's both running faster and literally getting smaller, and they can, as far as I'm aware, there's nobody who's even coming close as far as being able to compress a model so much and recently I think about two months ago we first saw that they're integrating transformers with Sparsify Alpha, which is now just out and it's called Sparsify on the GitHub.
[00:35:43] Alignment Lab: Totally check it out. You can make a tiny llama and do all that stuff to it and make it microscopic. It's amazing. And
[00:35:49] Alex Volkov: here, Austin, just real quick. So we've been talking about quantization for folks who are like not following the space look super closely. Let's say there's different quantization techniques in, and some of them create like small files, but the performance or like the accuracy, is getting lowered.
[00:36:03] Alex Volkov: How is Sparsification different from quantization, at least on the basic level. Are they compatible? Will they be used could you use both of them on the same file? What is this thing, sparsification?
[00:36:15] Alignment Lab: so in reality, probably if it were like more accessible of a tool, we would all likely just be doing both every single training run. But since there's always new quantization techniques, it doesn't make sense to. But with sparsification, the specific difference is rather than taking the same model and reducing the accuracy of its, the calculations, but making it smaller, the model's staying the same size physically on your drive, but you're reducing the weights that aren't getting used to to a zero value.
[00:36:50] Alignment Lab: And what that does is just means your, your GPU just has to do less calculations for the model to do inference and it makes it just much faster.
[00:36:59] Alex Volkov: All
[00:36:59] Nisten Tahiraj: Also, we for the next Baklava version, Neural Magic did make a A clip model for us. So shout out to them. They were able to cut down the size by from about four times smaller.
[00:37:14] Nisten Tahiraj: So we'll we'll have that out soon. And yeah, also for anybody else that. wants to learn about sparsity, just look up Nir Shavit on on YouTube. N I R S H A V I T. He's the OG MIT professor that pioneered sparsity and has a lot of videos out, and Neuromagic is his company. And yeah, it's looking really promising in the future because they can optimize at a deep level for CPU inference.
[00:37:45] Nisten Tahiraj: And it's not necessarily just quantization, it's also They are reducing the amount of unused weights. So yeah, expect to see a lot more stuff about sparsity from the GPU poor side of the spectrum, , because that's where the benefits are yet to be read.
[00:38:02] Nisten Tahiraj: Anyway, shout out to Neural magic as well.
[00:38:04] Alex Volkov: shout out to Neer Shovit and Neural Magic, it looks cool, and they just got into sparsifying fine tuned models as well, I think they sparsified some new models, and I don't know if they got to open chat yet, but I think some folks are waiting for PHY sparsification, definitely. The area of smaller models running on smaller hardware is advancing super, super fast.
[00:38:26] Star Coder from Stability AI - 3B coding model bearing CodeLLama
[00:38:26] Alex Volkov: Let's move on, folks, because we've been in the open source area for quite a while, and then we also need to get to our to the end of our conversations here and start doing deep dives. So StarCoder was released from Stability. A brief review here is a 3 billion parameter language model.
[00:38:41] Alex Volkov: From Stability AI it does code completion and obviously it runs offline cause it's a small model and you can run it. They claim it can run on MacBook Airs as well. And they say something like without GPU. Interesting. Accurate completion across 18 languages at level comparable to models twice their size.
[00:38:57] Alex Volkov: This is a Code Llama. Interesting comparison to Code Llama at this point, because we've seen a bunch of other models already beat, I think, Code Llama on different metrics. But people still compare themselves to the big dog. And it's very interesting. They use the multi stage process, pre training in natural language.
[00:39:15] Alex Volkov: fine tuning on code datasets to improve programming language performance. And it supports fill in the middle and expanded contact sizes compared to previous versions of stable coder. And I think, oh yeah the stable diffusion now has like a commercial membership plan because everybody's thinking about, okay how was.
[00:39:33] Alex Volkov: Table going to make money. So they have this membership where you can use their models. So it's not like fully open source. I think you can use this models commercially if you participate in this membership, otherwise you can use them for research. So stable quarter, check it out. I think it's new on, on hug and face.
[00:39:48] Alex Volkov: I think from today I believe,
[00:39:50] Discussion on Neural Beagle 7B & Model merging
[00:39:50] Alex Volkov: And I think the last thing that I want to chat about in open source just briefly is Neural Beagle 7B from Maxim who's in the audience and is going to come up hopefully in the interview in a few.
[00:39:59] Alex Volkov: Minutes, I want to say maybe 20 minutes, Maxim. Neural Beagle back when I added this to my notes, was the top performing 7 billion parameter fine tune in, in, in open source LLM leaderboard. It's no longer the top performing, it was definitely number 4, at least.
[00:40:14] Alex Volkov: And it's a merge plus a DPO, that's what I saw from Maxim, a merge of Actually interesting what it's a merge of, so let's go into the model card and check this out.
[00:40:24] Alex Volkov: But Maxim looks like have a bunch of models and Neural Beagle, the, this Neural Beagle 14, 7 billion parameters has an average of 60 on the, all the scores, 46 on AGI eval. And yeah, it's one of the top performing models and it's a merge of different things. And it already has a demo space that I'll link in the show notes as well.
[00:40:43] Insights on Lazy Merge Kit
[00:40:43] Alex Volkov: Yeah, it uses Lazy Merge Kit, which is a collab that Maxim also we're going to chat about and figure out what this means, what this merging thing means but definitely, I think that this model triggered one of the Nathan's in AI that says, Hey, I wanted to ignore this merge business for a while, but I guess I can't anymore because, merges is not to be ignored at this point.
[00:41:04] Alex Volkov: And this is a merge of the Wunna And distilled Markoro. Slurp. So which is also a merge. So if you guys hear me and you're like confused, like what are all these things mean? Hopefully we'll be able to clarify this one. Maxim. Maxim also had a tweet where there's now a collab where you can take a model like this and basically map out the genealogy of these models.
[00:41:25] Alex Volkov: What is based on what? And it's quite cool to see. And what else should I say about this model? I think that's pretty much it. It's very performative. I actually haven't had the chance to use this, but it's right up there and it's a merge model. There is, there's the [00:41:40] checkbox, like we said, in the open LLM leaderboards.
[00:41:42] Alex Volkov: If you don't want for some reason to see the merge models and we'll see like more trained models, you will uncheck that. But definitely the merge models are competing for the top of the LLM leaderboards right now. Haven't seen a lot of them on the LMCs arena, so it's going to be interesting to see how they treat the merge models.
[00:42:02] Alex Volkov: And I think that's most on open source, and we've given this corner almost 40 minutes, so I think it's time to move on a little bit here, folks. So I'll, yeah, I don't have breaking news here, so I'll just do this, a small transition so I can take a breath, haha.
[00:42:17] Sounds: Namaskaram to all of
[00:42:22] Deep mind to Alpha Geometry
[00:42:22] Alex Volkov: LMs and APIs, and I think the biggest player in this whole, Aparigraha, Niyama, Shaucha, Satya, Ashtanga, Yama, Ashtanga, Niyama Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, is deep mind, deep mind released, A Nature article, which they always do, they always publish in Nature, this time the link to Nature article didn't really work but hopefully they fix it by now, and they released Alpha Geometry, so they released like a bunch of stuff, Alpha Fold, if you remember, Alpha Go Alpha Zero, they had a model that, that self trains to play anything, not only chess, or, not only Go, and now they've released Alpha Geometry, that solves geometry, almost a gold medal Level at the at the Olympiad level, so they have this this how should I say, this nice chart that says the previous state of the art on this Olympia Gold Medallist Standard gotten to ten problem solved there's like time limits. I'm not sure what the time limits are actually are. I don't have it in my notes. But you have to solve these like very like difficult geometry levels. Folks compete for the gold medals in this Olympiad. And alpha geometry now comes very close to the gold medalist standard.
[00:43:29] Alex Volkov: So the gold medalist is answers 25.9 problems solved, and alpha geometry now answers 25, and they claim that the previous state of the art answered 10, just 10. So they more than doubled and they're getting close to the Olympiad. I think I saw like a tweet from Nat Friedman or somebody. That says they would offer a 1, 000, 000 prize for somebody who solves the Geometry Olympiad at the Golden Medal, and now we're getting there.
[00:43:53] Alex Volkov: They use the newer symbolic approach and they combine all of them with a symbolic deduction engine to leverage the strength of both. Which some folks compare to thinking fast and slow, where you have system 1, system 2 thinking, or at least the outline system 1, system 2 thinking.
[00:44:09] Alex Volkov: In this case, this does actually help. They have the neuro symbolic approach. I think they use this, the neuro symbolic approach. I don't think I've seen this before. And I think the most interesting parts It was trained on over a hundred million synthetic geometry examples generated from one billion random diagrams.
[00:44:27] Alex Volkov: Completely, solely synthetic geometry examples. This whole data set for training of this model that beats Humans at Geometry, which was previously very difficult, is fully synthetic. And I think that's super cool. We only began this year, but definitely this is going to be the year where full synthetic datasets are going to rule.
[00:44:49] Alex Volkov: And Yeah. Opinions, folks here on stage. Have you read about this? What's interesting to you? I would love to hear folks kind of chime in on this as well, because I think it's like super cool and kudos for them to releasing this. Also, I saw somebody said, I think Bindu said that they released this open source, but I haven't seen anything.
[00:45:06] Alex Volkov: Definitely Luigi Go and then Nistan.
[00:45:09] Luigi Daniele: Yeah it's funny that you brought up Nat Friedman having that bet up. Because I remember that too, and now I'm thinking, I wonder if he'd be willing to give up like the million dollars or whatever the money is to DeepMind. Ha
[00:45:20] Luigi Daniele: was done by Google DeepMind, so that'd be funny.
[00:45:25] Nisten Tahiraj: How has Google not discovered AGI yet and fallen so behind?
[00:45:30] Nisten Tahiraj: This almost feels like an internal illness or something. Something's going on. Because yeah.
[00:45:40] Alignment Lab: I don't think that Google needs to compete is the thing. I just don't think they're incentivized to release anything into the space because they don't have to. There's really not anything here except money to lose for them.
[00:45:51] Alignment Lab: They already have all the data and stuff. Yeah, and back to the geometry problems, I can't wait to test this, if they release it, as to how it does when given really random, very long numbers. If it still solves the problem, then that, that will be extremely impressive. And yeah, I've done those Math Olympias with geometry questions and they're not easy at all.
[00:46:18] Alignment Lab: You have to picture stuff in 3D. 4D and whatever in your head. They're very tricky problems. So yeah this is pretty huge. That's all. Yeah.
[00:46:26] Alex Volkov: Quite, quite huge and kudos on them. Umesh, I think you actually found the source, right? I just
[00:46:32] Umesh Rajiani: Yeah so there is GitHub repo on Google DeepMind. So if you go to Google DeepMind on GitHub and then alpha geometry, you can find the code repo for that. So Nistan, if you want to test it out, it's there for you. So I'm taking your
[00:46:47] Alex Volkov: hark on this just like for a little bit. Did Google release code for us finally? Did Google like open source something? Welcome back, Google.
[00:46:54] Umesh Rajiani: yeah, so this is like first release kind of thing, coming out of Google. So it's going to be, yeah, it is quite quite interesting.
[00:47:01] Alex Volkov: Definitely moves us towards like more generalist
[00:47:04] Bluetooth: I'll have it up in a sec.
[00:47:05] Alex Volkov: Yeah, listen, please put this and we'll add this to the show notes as well. Definitely the question, how have they not solved AGI yet? Solving math at the Olympiad level seems like moving us forward, definitely. This neuro symbolic approach where they combine language models with a symbolic deduction engine, which I have no idea what symbolic deduction means in this case.
[00:47:24] Alex Volkov: But leveraging strength of both, this seems like going towards the right path. We've seen, I think Similar things with vision as well, where you combine kind of vision heads into one model they can understand. I don't think this model was multi modal at all. Doesn't look like, but maybe I'm wrong here.
[00:47:42] Alex Volkov: And I think Yeah, the solutions for this thing is verifiable by machines. I saw this one tweet that will go down in history. Somebody said, computers has always been good for calculations. So I don't understand the big deal there, here. And I think I think it's really funny to like, keep this tweet behind the scenes.
[00:48:04] Alex Volkov: Alright, so shout out to DeepMind for this fairly incredible release. Hopefully some of the techniques they used will be then used by folks in other areas as well to get us AIs that are significantly better at the geometry and different things. Oh yeah, Umesh, just before, before we continue, you want to talk about this NeuroSymbolic thing? Cause we've talked about this. I think Daniel Jeffries talked about this last time we've talked about Rabbit.
[00:48:27] Alex Volkov: If you guys remember, this was at the end of the last space and we've talked about Rabbit LAM, Large Action Model. And Umesh, you just mentioned something that they also use NeuroSymbolic to an extent, right?
[00:48:39] Umesh Rajiani: Yeah, so the LAM Large Action Model, basically based on Neuro Symbolic Programming for when, specifically when they are talking about training the model from the actions that you're going to perform is basically they are encoding Neuro Symbolic Programming to train the model or capture the actions, basically.
[00:48:55] Umesh Rajiani: So that's what we're trying to do. Namaste. In theory, they are saying we have to see what comes out in practice.
[00:48:59] Alex Volkov: Yeah, and based at least on their examples, it looks like very compelling and potentially like being able to solve a bunch of stuff or like to remember based on your actions. So neuro symbolic not a new approach. I apologize. I will edit this. Definitely Rabbit said this, you're right and hopefully we're going to get to see this lamb thing.
[00:49:19] Alex Volkov: So back to OpenAI as elections are happening right now and everybody was fearing like, Hey, what's going to happen with deepfakes, et cetera. OpenAI released their guidelines toward election, as they prepare for elections, obviously, they're aware that they're happening. And I think the few interesting things there that they're taking steps to prevent their tools like Dalai and Shajipati from being abused.
[00:49:38] Alex Volkov: I don't know. We have open source, so I don't know if folks will go to the GPT 4 to generate let's say, propaganda. But DALI, for example, starts to integrate some cryptography to their images, which is very interesting. Cryptography solutions, which, again, In case you download the actual file and then send it, could be a thing.
[00:49:58] Alex Volkov: But I don't know if [00:50:00] somebody takes a screenshot of a Dalit generation, if that will apply at all. There are definitely like usage policies for like stuff like Chajapati enforcing limits on political campaigning and impersonating candidates and discouraging voting. And then they want to run ahead of what happened with Facebook and Cambridge Analytica, and like all these things they want to get ahead of us which, it makes sense.
[00:50:18] Alex Volkov: So the technology they use to detect images were generated by DALI I haven't seen any release on them that says, Hey, we'll build a tool for you to actually identify if those are generated images or not. It's going to be interesting because like with LLM writing all of these tools that you use to like dump AI text in there, they're all can be obscured with another LLM.
[00:50:38] Alex Volkov: I don't know if it's a futile attempt, but definitely a worthwhile one. And at least in the basic UI, I think blocking some attempts of destabilizing democracy, I think it's a good idea. And I think that's mostly it. I think there's one different mention that somehow silently they removed where the terms and conditions thing where their outputs is not to be used for war or weapon developing.
[00:51:04] Alex Volkov: And I think they removed that and I think they're also like signed something with Department of Defense, but I think that's all for OpenAI.
[00:51:11] Microsoft announces CoPilot pro
[00:51:11] Alex Volkov: And then I wanted to mention about Microsoft and Umesh, definitely feel free to chime in here as well, because the underlines the benefit for open source, but quickly Microsoft announced Copilot, we've talked about Copilot, the kind of previously BingChat, Copilot everywhere.
[00:51:25] Alex Volkov: So they've announced like different paid plans for Copilot Pro, 20 bucks a month premium, and then it does. Enhanced image creation, where we don't even get We don't even get in, in, in Dali like by default, and it's now generally available for small businesses with no user minimum. So if you guys remember, we've talked about Copilot before when Microsoft announced it for large enterprises it integrates into Microsoft 365 everywhere.
[00:51:49] Alex Volkov: And now the Copilots are also open for smaller businesses. And soon there's going to be like this Copilot Studio to build custom GPTs. Very cool for small businesses. We'll see how much actually folks will use this. And there's also some Microsoft Saga that they've changed some stuff in their pipeline.
[00:52:04] Corporate Drama - Microsoft Azure changing moderation flows and breaking products
[00:52:04] Alex Volkov: So Umesh, you mentioned this in the beginning. We'd love to hear from you what's been going on as you guys are big Azure users through Microsoft.
[00:52:11] Umesh Rajiani: Ooh happened
[00:52:15] Umesh Rajiani: day before yesterday. Actually, we got a call from one of our clients, which is one of the, one of a very big financial institution. And we have a deterministic pipeline, which was constructed using Azure studio, in fact. And we work together with very core Microsoft team actually to make sure that it is right.
[00:52:36] Umesh Rajiani: properly deterministic because there are some legal implications and everything. And and then the tool started failing and because we had some function calling, which would actually go into the knowledge base of the company. And that function calling was was getting extracted, getting triggered using what you call the deterministic intent from user's prompts, basically.
[00:52:56] Umesh Rajiani: And and that entire function calling was failing. Now, we carried out all types of work and everything it was very frantic because it was a front end tool and it started having some impact. And it was, remember, it was working for six months. So it's it worked without any problems for six months and suddenly it just stops working.
[00:53:14] Umesh Rajiani: And the reason was that there were two words that were in the definition of The tool, so that definition of tool was actually informing the pipeline what the tool is all about and that's how the tool was getting invoked and those two words basically were getting flagged into The OpenAI API.
[00:53:32] Umesh Rajiani: So we're basically Azure OpenAI API, not OpenAI's direct API. We are routing it through Azure and it's a separate separate instance of of GPT 4 and there are separate guidelines. They mimic some of the guidelines that are there in OpenAI, but Microsoft has its own guidelines and they change the guidelines without actually informing the clients. That basically triggered. Yeah. So we literally we literally had legal people and literally had fight. It was an open fight, literally, with Microsoft. If you were in that room, you would have you would have seen. It was really bad. And and then eventually there were talks about cases and stuff like that.
[00:54:08] Umesh Rajiani: And eventually, basically actually this company is actually modifying the contract with Microsoft. Where Microsoft will be liable to inform the company before they change any kind of guidelines. And you know what happened after that is, is the beauty because in the beginning of my startup, like beginning of the year, we implemented some solutions where we have a direct contract with Microsoft And we have implemented solution on the backing of those contracts.
[00:54:34] Umesh Rajiani: So in last two days, actually, I've gone back to those clients with whom we have implemented solutions so that they have a direct contract with Microsoft, because we don't want to be a party involved as far as the SLAs are concerned, because this is very dangerous if you're developing solutions for.
[00:54:49] Umesh Rajiani: For people and and if the core solution through which you are driving the entire application pipeline is getting changed without any kind of data contract backing, so to say. Yeah, this is a great learning for us and I've been always a proponent of. Open source solutions, and I think this has given one more kind of a booster to us because now we can go back to the new clients and say, Hey, guys if possible, if we give you the kind of solution that you're looking for, then let's go to open source solution rather than going for a closed source solution.
[00:55:20] Umesh Rajiani: So
[00:55:20] Alex Volkov: And this is like a huge, yeah, a huge like reason why, right? Getting, it's very interesting, like in this area we mentioned, definitely feel free to chime in on this a little bit more. The outputs of LLMs are usually non deterministic. And so this has to be built into understanding when you build tools on top of this.
[00:55:36] Alex Volkov: But this is not that. This is them adding not like a different model or something like a different that you can switch. They're adding something in between or some like policy thing without announcing this to the customers. And supposedly if you go to Azure instead of OpenAI, for example, you would go for the most stability as underlined by the fact that when OpenAI had downtime after Dev Day, Microsoft Azure, GPT for like endpoints, they were all fine.
[00:56:02] Alex Volkov: They were all green, right? So supposedly you would go for the stability and kind of the kind of the corporate backing. There's also like different ISO things and HIPAA compliances, like all these things that Microsoft Azure like proposes on top of OpenAI. But here we have a case where like underlines how.
[00:56:17] Alex Volkov: How important open models that you host yourself are, even if you host them, like maybe on Azure as well, because then nobody can change the moderation endpoints for you and suddenly decide that a few words in your prompt are not, to be used anymore.
[00:56:32] Umesh Rajiani: Yeah, but Alex this had nothing to do with the prompt, actually. It was actually the definition of the function that was there. And the key is like I would draw an analogy to what you call the data contracts. I don't know how many people are aware of data contracts, but when you have.
[00:56:47] Umesh Rajiani: Ownership of data within a very large organization, let's say 20, 000, 30, 000 people up you have data contracts where the data originates from a particular source and some other division is using that data. So you have a contract between those two and that data contract details the data definitions which are there and the contract sign, the signatory of the contract is responsible to ensure that if they change any kind of data structure or data definition.
[00:57:14] Umesh Rajiani: Then the receiver of the data or the client of the data contract is supposed to be informed. That is a part of your data contract. And that's how these large organizations function. And what we need is that kind of a framework where you have a data contract with the service provider.
[00:57:30] Umesh Rajiani: So even if you're going with an open source solution, and if your open source solution is hosted by someone, Then you need to have that kind of a contract in place. So it's not just that open source solution is a solution for everything. It's about the person who is providing the inference. So if you are controlling the inference, then you are secure because you are not going to make the changes without, understanding the repercussions of those changes.
[00:57:52] Umesh Rajiani: But if you are let's say hosting open source model on Amazon Bedrock, for example. And if they have a system prompt that lies in front of your prompt that goes to the the model, then you have to make sure that Amazon adheres to their responsibility in terms of giving you the required inference.
[00:58:12] Alex Volkov: Absolutely. Thanks for giving us the, first of all, like it's, it sucks that it happens and hopefully now Microsoft, like you said, they [00:58:20] changed their their approach here. Aniston, go ahead if you want to follow up.
[00:58:26] Nisten Tahiraj: Yeah. So for us, this has been amazing. I already have clients lining up to pay for the Baclav API. So I'll just say that first before it's even out. However It is extremely unfortunate for those that built, let's say, apps in a hospital or for a therapist because now those kinds of applications just had a moderation engine added, and they added apparently for their safety, and now whoever was relying on these applications, now they just stop working.
[00:59:02] Nisten Tahiraj: Out of nowhere, and this is an extremely immature thing to do this is something you expect from like a random startup with kids, not from freaking Microsoft, and it is pretty worrisome that this safety hysteria has gotten to the point where You're literally just breaking medical applications in production without modifying, without notifying people.
[00:59:27] Nisten Tahiraj: That's just, you lost people's trust now. You're not going to gain that back for a couple of years. And I hope they realize and don't do this again. Don't break production and make changes. To people in Prad that are relying on this for like SOC 2 or as in the case of UMass that have signed service level agreements.
[00:59:49] Nisten Tahiraj: Because now those people lose all their money if they don't, if they don't provide the service. And it's really bad. That's all I have to say. It's pretty bad.
[00:59:58] Alex Volkov: Yep. Very bad look from Microsoft. Even I think I remember like not entirely OpenAI, when they talked about Sunsetting some models and there was like a developer outcry that said, Hey, like we use those, we haven't had time to change how we work with different prompts, et cetera, for the newer models.
[01:00:15] Alex Volkov: And so OpenAI actually went back and said, Hey, we heard you and we're going to release we're going to deprecate deprecation is going to be pre announced in advance. It's going to be way longer Omesh let's yeah, let's go ahead.
[01:00:27] Umesh Rajiani: Yeah, very quickly I think you have raised a very valid point, Alex, that I think all the models that they actually put out of service, they actually should make them open source. I think that's the best solution.
[01:00:39] Alex Volkov: Nah, I wish this was the case. We're still waiting for potentially like open source GPT 2. 5. We haven't seen any open sources from OpenAI for a while. Besides like some GitHub code, I agree with you. There should be a way for folks to keep doing this, the same exact thing they're doing.
[01:00:52] Alex Volkov: I don't know, in my example, I use Whisper, no matter like what their API really says, what it's like, what they deem inappropriate to translate, the Whisper that I use is hosted and it will be the same version until I decide basically and test everything. All right, folks, we're moving forward, I think, just quickly.
[01:01:10] Alex Volkov: There's not a lot of stuff in the vision area. I will mention briefly we've been here for more than an hour. So I'll definitely like recap the space a little bit. If you're joining, let me just play the music and then I'll recap and then we'll get into the interview. So with with Hour 15, you're listening to Thursday Eye. Those of you who just joined us, welcome. If you haven't been here before, this is a weekly space all about AI, open source, as our friend of the pod, Jan, just tweeted out, everybody and everybody in LLM space and open source is in here, and very great to see.
[01:01:45] Alex Volkov: We've covered open source stuff, we've covered corporate drama right now, and then we're moving on to an interview. Thank you.
[01:01:53] This weeks Buzz from Weights & Biases
[01:01:53] Alex Volkov: And then we're going to talk about AI, art, and diffusion, if we're going to have time at the end of this. There's a brief mention that I want to say, but basically, let me just reintroduce myself.
[01:02:01] Alex Volkov: My name is Alex Volkov. I'm the AI Evangelist with Weights Biases. And we have a small segment here for Weights Biases that I want to choose to just bring. I just came back a few days ago from San Francisco Hackathon, the WeHub sponsor with TogetherAI and LengChain. It was a pretty cool hackathon.
[01:02:20] Alex Volkov: It was very brief, like a few hours with AGI House. But basically the theme was RAG versus FineTune. And I think the theme was versus, and I just promised I'll bring some learnings from this. So there's a bunch of projects that did different things. They used Together's endpoint for FineTune.
[01:02:35] Alex Volkov: So if you can FineTune. On your models and your GPUs that's one thing, but for many of the AI engineers, that's very difficult to do. So there's a bunch of startups together as one that they offer like very simple fine tuning. I'll definitely add my my Link to the show notes, to the presentation I gave there, which talks about how easy it is to fine tune using their endpoints.
[01:02:56] Alex Volkov: And the folks that won the hackathon, some folks won different prizes, basically used both Reg and FineTune. And it looks like also there was a paper released afterwards from some folks trying to identify what's better. Is it just doing RAG on top of Hindu models or just doing basic RAG?
[01:03:13] Alex Volkov: And I don't think we have a clear answer yet. Definitely this hackathon wasn't the end all of all answers. However it does look like doing RAG on top of a fine tuned model improves just a little bit on top of just basic RAG. And it looks like RAG wins on top of just a regular fine tuned for information retrieval tasks as well.
[01:03:30] Alex Volkov: So definitely do not skip RAG. And I think from the open source perspective, which we love here on Thursday Eye getting more RAG kind of Related models is definitely going to happen. I think we saw some from John Durbin. I think I saw Technium. You mentioned something about like function calling.
[01:03:47] Alex Volkov: Datasets are coming to, to, from news as well. So definitely that area is still to be explored. But it looks like the combination of FineTune and RAG wins just a little bit on top of just basic RAG. I think this is the outcome of that hackathon. Next week in this corner of 1B is going to be an interview with Jason.
[01:04:06] Alex Volkov: Stay tuned for that.
[01:04:07] BREAKING NEWS - Meta announces LLama 3 is training and will be pen source
[01:04:07] Alex Volkov: I think now we have, and many folks have been DMing me because right now we have breaking news. Breaking news actually happening right now.
[01:04:17] Sounds: AI breaking news coming at you only on Thursday ice.
[01:04:27] Alex Volkov: You know I love to use this sound. You know I love to use this sound, everyone. We have some updates from BigZuck. you guys see this because it's over on threads. And I don't know how many of us are on threads. I definitely know that I barely go there. We have some updates from BigZuck specifically around Training Lama 3.
[01:04:43] Alex Volkov: There's like key updates about the long term vision. I think the summary there is They have an insane amount of GPUs this year. So like literally he says at the end of this year, we'll have three, around 350, 000 NVIDIA H100s. I'm going to repeat this slowly for the people in the back. 350, 000 NVIDIA H100s and overall 600, 000 H100s or equivalents of compute if you include other GPUs.
[01:05:13] Alex Volkov: You remember those hats that people wear, like GPU poor, GPU rich hats? I think Zack can stack the GPU rich hats, like one on top of the other and it still won't be enough because 600, 000 H100 compute is just like ridiculous. And he talks about. Two major parts of their vision, AI and Metaverse are connected.
[01:05:32] Alex Volkov: I love how like it was Metaverse, and then suddenly AI started being a thing and now oh, they're connected. I definitely am expecting AI to exist in some form of virtual slash world, et cetera. But definitely he talks about Lama 3. And Lama 3 is coming. They're currently training it per BigZakh.
[01:05:48] Alex Volkov: We know that's coming or like we at least expected this, but I think now is like more of a confirmation. And I'm very excited about Lama 3. I will just mention that it's not been a year since Lama 1 yet. So we're in January Lama was released in like around February 12th, 13th or so.
[01:06:06] Alex Volkov: And it's not half, like it hasn't been a year yet. And here we are like training the third model on top of Lama. We've had just an incredible amount of like innovation on top of it. So definitely expecting and we're obviously going to cover this as much as possible. So this is I think most of it.
[01:06:23] Alex Volkov: Oh and this last thing that he added, Zak has added and I think it's Adding to Thursday as well where we have to start talk about hardware is that he says I think lots of people will talk to A. I. s frequently through the day using smart glasses like what we're building with Ray Ban Meta.
[01:06:38] Alex Volkov: And I think we've [01:06:40] talked about their smart glasses that they're like multi modal glasses. They have a camera built in them. You can press a button and actually pass the image into the LLM. They're making improvements in speed as well. I will say just like an additional one thing we've talked how Meta is adding a bunch of AI into every chat and nobody like necessarily used them.
[01:06:58] Alex Volkov: Recently, a friend of mine, maybe because, I'm an AI evangelist, so he felt free to do this in our chats. He just added an AI bot to our chat. Literally, just like my DM with a friend who has no, nothing about AI, like it's not part of his world. He does something else. Recently, he's Hey, let me add this thing.
[01:07:14] Alex Volkov: So Meta is definitely letting folks experiment with AI more than some other places. And he just added in the AI to our chat. It was super cool. So here's an update from Zack BigZack. Allama3 is training and then they have a lot of GPUs. They're like super GPU rich and, hopefully we'll get the benefit.
[01:07:30] Alex Volkov: Go ahead, Nissan. Yeah,
[01:07:36] Nisten Tahiraj: H100s? Yeah, they're going to need that if they're going to have visual stuff from people's glasses. But it's an insane amount. That's all. Yeah, I just ran some quick calculations. I got roughly similar numbers to what Nishtan just said. And if I'm doing my math I'm running just some numbers based off the alleged GPT 4 leaks of the amount of GPU hours that it might take, let's say if they used all those meta GPUs.
[01:08:08] Nisten Tahiraj: It's do a GPT 4 level model. I'm getting numbers it would take less than a week pretty much to train, yeah, this is an insane amount of GPUs for people that, don't have good references for this. Yeah.
[01:08:18] Alex Volkov: I think it's insane enough to maybe open a new category like on top of GPU rich. It's just quite incredible and like hopefully they're committed to the open source of this in Lemma 3. Omesh, you had a comment as well?
[01:08:32] Umesh Rajiani: Yeah, what if Lama 3 is going to be multi modal? Then they will need those GPUs.
[01:08:37] Alex Volkov: I'm really hoping it will. Like they're training the models, like multimodality is something they talked about. It's time. To move towards the LMM world and multimodality, and they will need all those GPUs to crank out. The vision part of this hopefully multimodal in other areas reminder meta has released like bull a bunch of attempts at multimodality in other areas, not only image.
[01:08:59] Alex Volkov: IMU motion units and they've talked about F-F-M-R-I signals they've talked about, like incredible stuff. But definitely modality, other modality like sounds like audio. Live video would be super cool, like I think this year is the year of live video, so not only, hopefully not only vision, and if it's vision, then hopefully it's like a live video.
[01:09:18] Alex Volkov: Alright folks, we're coming up on two hours,
[01:09:20] Alex Volkov: and with that, I think this is the summary of today's Thursday Eye. Thank you everyone for joining. If you haven't subscribed yet, definitely feel free to subscribe at ThursdayEye. News. I appreciate everyone's time and attention here. Thank you so much for the Co hosts and guests for today's pod and shallow with everyone.
[01:09:36] Alex Volkov: And I have to end this on the very happy note of the alchemy thing, because the one thing that came out from the conversation with with Maxim, who merges and Nistan and everything is that a lot of this is alchemy and a lot of this is like trying to see how things work when you combine and not continue to train models, they still perform better.
[01:09:55] Alex Volkov: So I have to end on this very happy tune, which will represent the alchemy that we're all doing. And we love it. Thank you everyone for joining this Thursday. I will see you next week. Cheers. And we'll add this banger to the show notes as well. Bye everyone.

This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lytte Lytte igjen Fortsette Lytter...
- Lytte senere Lytte senere
Se mer

Episoder

📅 ThursdAI - May 16 - OpenAI GPT-4o, Google IO recap, LLama3 hackathon, Yi 1.5, Nous Hermes Merge & more AI news

📅 ThursdAI - May 9 - AlphaFold 3, im-a-good-gpt2-chatbot, Open Devin SOTA on SWE-Bench, DeepSeek V2 super cheap + interview with OpenUI creator & more AI news

ThursdAI - May 2nd - New GPT2? Copilot Workspace, Evals and Vibes from Reka, LLama3 1M context (+ Nous finetune) & more AI news

📅 ThursdAI - April 25 - Phi-3 3.8B impresses, LLama-3 gets finetunes, longer context & ranks top 6 in the world, Snowflake's new massive MoE and other AI news this week

📅 ThursdAI - Apr 18th - 🎉 Happy LLama 3 day + Bigxtral instruct, WizardLM gives and takes away + Weights & Biases conference update

📅 ThursdAI - Apr 11th, 2024 - GPT4 is king again, New Mixtral 8x22B + First finetune, New Gemini 1.5, Cohere beats old GPT4, more AI news

📅 ThursdAI Apr 4 - Weave, CMD R+, SWE-Agent, Everyone supports Tool Use + JAMBA deep dive with AI21

📅 ThursdAI - Mar 28 - 3 new MoEs (XXL, Medium and Small), Opus is 👑 of the arena, Hume is sounding emotional + How Tanishq and Paul turn brainwaves into SDXL images 🧠👁️

📅 ThursdAI - Mar 21 - Grok, GTC, first OSS AI hardware, Neuralink Human, Prompting Claude and more AI news

🎂 ThursdAI BirthdAI March 14: Anthropic Haiku, Devin the new AI SWE, GPT4 gets hands, Cohere and Nous give us tool use models & more AI news

📅 ThursdAI - Mar 7 - Anthropic gives us Claude 3, Elon vs OpenAI, Inflection 2.5 with Pi, img-2-3D from Stability & More AI news

📅 ThursdAI - Feb 29 - Leap Year Special ✨

📅 ThursdAI Feb 22nd - Groq near instant LLM calls, SDXL Lightning near instant SDXL, Google gives us GEMMA open weights and refuses to draw white people, Stability announces SD3 & more AI news

🔥 ThursdAI - Feb 15, 2024 - OpenAI changes the Video Game, Google changes the Context game, and other AI news from past week

📅 ThursdAI - Feb 8 - Google Gemini Ultra is here, Qwen 1.5 with Junyang and deep dive into ColBERT, RAGatouille and DSPy with Connor Shorten and Benjamin Clavie

📖 ThursdAI - Sunday special on datasets classification & alternative transformer architectures

ThursdAI - Feb 1, 2024- Code LLama, Bard is now 2nd best LLM?!, new LLaVa is great at OCR, Hermes DB is public + 2 new Embed models + Apple AI is coming 👀

📅 ThursdAI - Sunday special on Merging with Maxime LaBonne

📅 ThursdAI - Jan 24 - ⌛Diffusion Transformers,🧠 fMRI multimodality, Fuyu and Moondream1 VLMs, Google video generation & more AI news

📅 ThursdAI Jan 18 - Nous Mixtral, Deepmind AlphaGeometry, LMSys SGLang, Rabbit R1 + Perplexity, LLama 3 is training & more AI news this week