Folgen
-
In this episode, we discuss Transformers need glasses! Information over-squashing in language tasks by Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran, João G. M. Araújo, Alex Vitvitskyi, Razvan Pascanu, Petar Veličković. The paper explores how information propagates in decoder-only Transformers, revealing a phenomenon where different input sequences can result in nearly identical final token representations. This issue, worsened by low-precision floating-point formats, impairs the model’s ability to distinguish between these sequences, leading to errors in specific tasks. The authors provide theoretical and empirical evidence of this problem and suggest simple solutions to mitigate it.
-
In this episode, we discuss Show, Don't Tell: Aligning Language Models with Demonstrated Feedback by Omar Shaikh, Michelle Lam, Joey Hejna, Yijia Shao, Michael Bernstein, Diyi Yang. The paper introduces Demonstration ITerated Task Optimization (DITTO), a method for customizing language model outputs using fewer than ten demonstrations as feedback. DITTO, based on online imitation learning, aligns the model's outputs to user-specific behavior by generating comparison data iteratively. DITTO outperforms existing methods like few-shot prompting and supervised fine-tuning by an average of 19% in matching fine-grained styles and tasks.
-
Fehlende Folgen?
-
In this episode, we discuss TextGrad: Automatic "Differentiation" via Text by Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, James Zou. The paper introduces TEXTGRAD, a novel framework that automates the optimization of compound AI systems by utilizing textual feedback from large language models (LLMs). TEXTGRAD treats text feedback as a form of "differentiation" to improve the components of these AI systems across various applications, working out-of-the-box without requiring specific tuning. Demonstrating its effectiveness, TEXTGRAD enhances performance in diverse tasks such as question answering, coding problem solutions, molecule design, and treatment planning, marking a significant step forward for the development of advanced AI technologies.
-
In this episode, we discuss SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales by Tianyang Xu, Shujin Wu, Shizhe Diao, Xiaoze Liu, Xingyao Wang, Yangyi Chen, Jing Gao. The paper introduces SaySelf, a framework for training large language models (LLMs) to produce accurate, fine-grained confidence estimates and self-reflective rationales explaining their uncertainties. This is achieved by analyzing inconsistencies in multiple reasoning chains, summarizing uncertainties in natural language, and applying supervised fine-tuning alongside reinforcement learning to calibrate confidence levels. Experimental results show that SaySelf effectively reduces confidence calibration errors and maintains task performance, enhancing LLMs' reliability by mitigating overconfidence in erroneous outputs.
-
In this episode, we discuss Open-Endedness is Essential for Artificial Superhuman Intelligence by Edward Hughes, Michael Dennis, Jack Parker-Holder, Feryal Behbahani, Aditi Mavalankar, Yuge Shi, Tom Schaul, Tim Rocktaschel. The paper argues that the development of open-ended, self-improving AI systems is achievable using current foundation models trained on extensive internet data. It provides a formal definition of open-endedness based on novelty and learnability and suggests a path to artificial superhuman intelligence (ASI) through such systems. The paper emphasizes the importance of considering safety in the development of these highly capable and open-ended AI systems.
-
In this episode, we discuss To Believe or Not to Believe Your LLM by Yasin Abbasi Yadkori, Ilja Kuzborskij, András György, Csaba Szepesvári. The study investigates uncertainty quantification in large language models (LLMs), focusing on distinguishing large epistemic uncertainty to identify unreliable outputs and potential hallucinations. By employing an information-theoretic metric and a method of iterative prompting based on prior responses, the approach effectively detects high uncertainty scenarios, particularly in distinguishing between cases with single and multiple possible answers. The proposed method outperforms standard strategies and highlights how iterative prompting influences the probability assignments of LLM outputs.
-
In this episode, we discuss Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts by Chunjing Gan, Dan Yang, Binbin Hu, Hanxiao Zhang, Siyuan Li, Ziqi Liu, Yue Shen, Lin Ju, Zhiqiang Zhang, Jinjie Gu, Lei Liang, Jun Zhou. The paper introduces METRAG, a novel Multi-layered Thought enhanced Retrieval-Augmented Generation framework designed to improve the performance of LLMs in knowledge-intensive tasks. Unlike traditional models that solely rely on similarity for document retrieval, METRAG combines similarity-oriented, utility-oriented, and compactness-oriented thoughts to enhance the retrieval and generation process. The framework has shown superior results in various experiments, addressing concerns about knowledge update delays, cost, and hallucinations in LLMs.
-
In this episode, we discuss Contextual Position Encoding: Learning to Count What's Important by Olga Golovneva, Tianlu Wang, Jason Weston, Sainbayar Sukhbaatar. The paper introduces Contextual Position Encoding (CoPE), a new position encoding method for Large Language Models (LLMs) that incrementally alters position based on context rather than just token count. This approach enables more sophisticated addressing, such as targeting specific types of words or sentences, beyond the capabilities of current token-based methods. Through experiments, CoPE demonstrates improved performance on tasks like selective copy, counting, and Flip-Flop, as well as enhancements in language modeling and coding task perplexity.
-
In this episode, we discuss Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis by Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, Xing Sun. The paper introduces Video-MME, a comprehensive benchmark for evaluating Multi-modal Large Language Models (MLLMs) in video analysis, which assesses capabilities across diverse video types, durations, and data modalities with high-quality annotations. Their experiments show commercial models like Gemini 1.5 Pro outperform open-source counterparts and highlight the significant impact of subtitles and audio on video understanding, along with a noted drop in model performance with longer videos. The findings emphasize the need for improvements in handling extended sequences and multi-modal data, driving future advancements in MLLM capabilities.
-
In this episode, we discuss VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos by Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, Mohit Bansal. The paper introduces VideoTree, a novel framework that enhances the efficiency and accuracy of long-video question answering by selectively extracting and hierarchically organizing frames based on their relevance to the query. Unlike traditional methods that rely on dense and often redundant sampling of frames for LLM-based reasoning, VideoTree employs a dynamic, adaptive approach to identify and caption keyframes, forming a tree structure that reflects varying levels of detail where needed. Experiments demonstrate significant performance improvements and reduced inference times on benchmarks like EgoSchema, NExT-QA, and IntentQA.
-
In this episode, we discuss CinePile: A Long Video Question Answering Dataset and Benchmark by Ruchit Rawal, Khalid Saifullah, Ronen Basri, David Jacobs, Gowthami Somepalli, Tom Goldstein. CinePile is a new dataset and benchmark designed for authentic long-form video understanding, addressing the limitations of current datasets. It comprises 305,000 multiple-choice questions (MCQs) spanning various visual and multimodal aspects. The evaluation of recent state-of-the-art video-centric language models (LLMs) shows a significant gap between machine and human performance in these complex tasks.
-
In this episode, we discuss Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum by Hadi Pouransari, Chun-Liang Li, Jen-Hao Rick Chang, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Oncel Tuzel. The paper introduces a novel variable sequence length training technique called dataset decomposition to address inefficiencies in training large language models (LLMs) with fixed-length token sequences. It divides the dataset into buckets of sequences of the same size from unique documents and samples from these buckets with a curriculum during training, leading to computational savings and higher efficiency. This approach achieves target accuracy three times faster than traditional methods and enhances performance on standard language evaluations and long-context benchmarks.
-
In this episode, we discuss SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering by John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, Ofir Press. The paper introduces SWE-agent, an autonomous system leveraging a language model to tackle software engineering tasks through a specialized agent-computer interface (ACI). SWE-agent significantly improves task completion rates, solving 12.5% of issues on SWE-bench compared to the previous best of 3.8%. The study also examines the impact of ACI design on agent performance, offering insights into effective interface design.
-
In this episode, we discuss Octo: An Open-Source Generalist Robot Policy by Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, Sergey Levine. The paper introduces Octo, a large transformer-based policy pretrained on 800k trajectories from the Open X-Embodiment dataset, designed to be a generalist policy for robotic manipulation. Octo can be instructed via language commands or goal images and can be efficiently finetuned to new sensory inputs and action spaces on various robotic platforms. Experimental results demonstrate Octo's versatility across 9 different robotic platforms and provide detailed analyses to guide future development of generalist robot models.
-
In this episode, we discuss Layer-Condensed KV Cache for Efficient Inference of Large Language Models by Haoyi Wu, Kewei Tu. The paper addresses the significant memory consumption issue in deploying large language models by proposing a novel method that computes and caches key-value pairs for only a small number of layers, thereby saving memory and enhancing inference throughput. Experiments demonstrate that this approach achieves up to 26× higher throughput compared to standard transformers while maintaining competitive performance. Additionally, the method can be integrated with existing memory-saving techniques for further efficiency improvements.
-
In this episode, we discuss Observational Scaling Laws and the Predictability of Language Model Performance by Yangjun Ruan, Chris J. Maddison, Tatsunori Hashimoto. The paper introduces an observational approach to building scaling laws for language models by utilizing approximately 80 publicly available models, bypassing the need for extensive model training. It discovers that despite variations in model efficiencies, performance can be predicted using a generalized scaling law based on a low-dimensional capability space. This method demonstrates the predictability of complex scaling behaviors and the impact of interventions such as Chain-of-Thought and Self-Consistency.
-
In this episode, we discuss Pack of LLMs: Model Fusion at Test-Time via Perplexity Optimization by Costas Mavromatis, Petros Karypis, George Karypis. The paper presents PackLLM, a method for fusing knowledge from multiple Large Language Models (LLMs) during test-time by optimizing the importance of each LLM based on the input prompt to minimize perplexity. It introduces two variants: PackLLMsim, which validates perplexity as an expertise indicator, and PackLLMopt, which uses a greedy algorithm for perplexity minimization. Experiments with over 100 LLMs show that PackLLM outperforms existing test-time fusion approaches and learning-based fusers, demonstrating significant accuracy improvements.
-
In this episode, we discuss The Platonic Representation Hypothesis by Minyoung Huh, Brian Cheung, Tongzhou Wang, Phillip Isola. The paper argues that representations in AI models, particularly deep networks, are converging across various domains and data modalities. This convergence suggests a movement towards a shared statistical model of reality, termed the "platonic representation." The authors explore selective pressures driving this trend and discuss its implications, limitations, and counterexamples.
-
In this episode, we discuss Many-Shot In-Context Learning in Multimodal Foundation Models by Yixing Jiang, Jeremy Irvin, Ji Hun Wang, Muhammad Ahmed Chaudhry, Jonathan H. Chen, Andrew Y. Ng. The paper examines the effectiveness of increased example capacities in multimodal foundation models' context windows to advance in-context learning (ICL). It specifically looks at the transition from few-shot to many-shot ICL, studying the impact of this scale-up using different datasets across various domains and tasks. Key findings reveal that using up to 2000 multimodal examples significantly boosts performance, indicating the potential of many-shot ICL in enhancing model adaptability for new applications and improving efficiency, with specific reference to better results from Gemini 1.5 Pro compared to GPT-4o.
-
In this episode, we discuss Naturalistic Music Decoding from EEG Data via Latent Diffusion Models by Emilian Postolache, Natalia Polouliakh, Hiroaki Kitano, Akima Connelly, Emanuele Rodolà, Taketo Akama. The paper explores the use of latent diffusion models to decode complex musical compositions from EEG data, focusing on music that includes varied instruments and vocal harmonics. The researchers implemented an end-to-end training method directly on raw EEG without manual preprocessing, using the NMED-T dataset and new neural embedding-based metrics for assessment. This research demonstrates the potential of EEG data in reconstructing intricate auditory information, contributing significantly to advancements in neural decoding and brain-computer interface technology.
- Mehr anzeigen