Episoder
-
In this episode, we discuss Chameleon: Mixed-Modal Early-Fusion Foundation Models by Chameleon Team. The paper introduces Chameleon, a family of models designed for seamless understanding and generating both images and text in any sequence. It achieves state-of-the-art performance in several tasks, including image captioning and text generation, and demonstrates competence in mixed-modal outputs. Notably, Chameleon is competitive with or superior to larger models like Gemini Pro and GPT-4V in various evaluations, highlighting its significance in multimodal document processing.
-
In this episode, we discuss Goldfish: Vision-Language Understanding of Arbitrarily Long Videos by Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Mingchen Zhuge, Jian Ding, Deyao Zhu, Jürgen Schmidhuber, Mohamed Elhoseiny. The paper introduces Goldfish, a methodology designed to efficiently comprehend videos of any length by employing a retrieval mechanism that selects top-k relevant video clips for processing. To evaluate its effectiveness, the authors present the TVQA-long benchmark aimed at long video understanding and demonstrate significant improvements over existing methods, achieving a 41.78% accuracy rate. Additionally, their MiniGPT4-Video model also excels in short video comprehension, outperforming current state-of-the-art methods on multiple benchmarks.
-
Mangler du episoder?
-
In this episode, we discuss Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity by Santiago Pascual, Chunghsin Yeh, Ioannis Tsiamas, Joan Serrà. The paper introduces MaskVAT, a video-to-audio generative model that utilizes a masked generative model alongside a high-quality general audio codec to achieve superior audio quality, semantic matching, and temporal synchronization. MaskVAT effectively addresses the synchronization issues in previous V2A models without compromising on audio quality. Empirical results demonstrate its capability to generate well-synchronized and high-quality audio that aligns with visual actions, competing with state-of-the-art non-codec generative models.
-
In this episode, we discuss Human-like Episodic Memory for Infinite Context LLMs by Zafeirios Fountas, Martin A Benfeghoul, Adnan Oomerjee, Fenia Christopoulou, Gerasimos Lampouras, Haitham Bou-Ammar, Jun Wang. The paper introduces EM-LLM, an approach that enhances large language models (LLMs) by incorporating principles of human episodic memory and event cognition, enabling them to manage extensive contexts efficiently. EM-LLM uses Bayesian surprise and graph-theoretic boundary refinement to organize token sequences into episodic events and employs a two-stage memory process for effective retrieval. Experiments demonstrate that EM-LLM outperforms existing models on various tasks, showing significant improvement, and aligning well with human event perception, suggesting potential for interdisciplinary AI and cognitive science research.
-
In this episode, we discuss Learning to (Learn at Test Time): RNNs with Expressive Hidden States by Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin. The paper introduces Test-Time Training (TTT) layers, a new type of sequence modeling layer combining the efficiency of RNNs with the long-context performance of self-attention mechanisms. TTT layers make use of a machine learning model as their hidden state, updated through self-supervised learning iterations even on test sequences. The proposed TTT-Linear and TTT-MLP models demonstrate competitive or superior performance to both advanced Transformers and modern RNNs like Mamba, with TTT-Linear proving more efficient in certain long-context scenarios.
-
In this episode, we discuss Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions by Yu-Guan Hsieh, Cheng-Yu Hsieh, Shih-Ying Yeh, Louis Béthune, Hadi Pour Ansari, Pavan Kumar Anasosalu Vasu, Chun-Liang Li, Ranjay Krishna, Oncel Tuzel, Marco Cuturi. The paper introduces a new annotation strategy termed graph-based captioning (GBC) that uses labelled graph structures to describe images more richly than plain text. GBC combines object detection and dense captioning to create a hierarchical graph of nodes and edges detailing entities and their relationships. The authors demonstrate the effectiveness of GBC by creating a large dataset, GBC10M, which significantly improves performance in vision-language models and propose a novel attention mechanism to utilize the graph's structure for further benefits.
-
In this episode, we discuss Evaluating Human Alignment and Model Faithfulness of LLM Rationale by Mohsen Fayyaz, Fan Yin, Jiao Sun, Nanyun Peng. The paper investigates how effectively large language models (LLMs) can explain their decisions through rationales extracted from input texts. It compares two types of rationale extraction methods—attribution-based and prompting-based—finding that prompting-based rationales better align with human-annotated rationales. The study also explores the faithfulness limitations of prompting-based methods and shows that fine-tuning models on specific datasets can improve the faithfulness of both rationale extraction approaches.
-
In this episode, we discuss Detection and Measurement of Syntactic Templates in Generated Text by Chantal Shaib, Yanai Elazar, Junyi Jessy Li, Byron C. Wallace. The paper investigates syntactic features in text generated by large language models (LLMs), revealing higher rates of templated text in these models compared to human-generated text. It finds that a significant portion of these templates originates from pre-training data and remain unchanged during fine-tuning. The study demonstrates that syntactic templates can distinguish between different models and tasks, and serves as an effective tool for evaluating style memorization in LLMs.
-
In this episode, we discuss From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data by Zheyang Xiong, Vasilis Papageorgiou, Kangwook Lee, Dimitris Papailiopoulos. This paper addresses the challenge Large Language Models (LLMs) face with long-context information retrieval and reasoning. The authors propose finetuning LLMs using a synthetic dataset designed for numerical key-value retrieval tasks, resulting in significant improvements. Experiments demonstrate enhanced performance on longer-context tasks without compromising general benchmark performance, unlike other long-context augmentation methods that can provoke hallucination.
-
In this episode, we discuss MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning by Xiangyu Zhao, Xiangtai Li, Haodong Duan, Haian Huang, Yining Li, Kai Chen, Hua Yang. The study presents MG-LLaVA, a multi-modal large language model designed to process both low-resolution and high-resolution images along with object-centric features for improved perception tasks. It includes a high-resolution visual encoder and a Conv-Gate fusion network to amalgamate fine-grained details with base features, enhancing object recognition using bounding box-derived data from offline detectors. Extensive benchmarking demonstrates MG-LLaVA's superior performance over comparable MLLMs, validated by evaluations using various language encoders ranging from 3.8B to 34B parameters.
-
In this episode, we discuss 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities by Roman Bachmann, Oğuzhan Fatih Kar, David Mizrahi, Ali Garjani, Mingfei Gao, David Griffiths, Jiaming Hu, Afshin Dehghan, Amir Zamir. The paper presents a novel any-to-any model that significantly extends the capabilities of existing multimodal and multitask foundation models by training on tens of highly diverse modalities, including images, text, geometric data, and more. Through discrete tokenization of various data types and co-training on large-scale datasets, the model can address three times more tasks/modalities than current models without sacrificing performance. The authors demonstrate this with a three billion parameter model, providing open access to the models and training code.
-
In this episode, we discuss VideoLLM-online: Online Video Large Language Model for Streaming Video by Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou. The paper discusses the development of the Learning-In-Video-Stream (LIVE) framework, which improves large multimodal models' ability to handle real-time streaming video inputs. The framework includes a training objective for continuous input, data generation for streaming dialogue, and an optimized inference pipeline, leading to enhanced performance and speed. This innovation, demonstrated through the VideoLLM-online model built on Llama-2/Llama-3, shows significant improvements in handling streaming videos and achieves state-of-the-art performance in various video-related tasks.
-
In this episode, we discuss EvTexture: Event-driven Texture Enhancement for Video Super-Resolution by Dachun Kai, Jiayao Lu, Yueyi Zhang, Xiaoyan Sun. The paper introduces EvTexture, the first video super-resolution (VSR) method using event signals specifically for enhancing texture details. The proposed method employs a new texture enhancement branch and an iterative module to progressively refine textures, leveraging the high-frequency details from event data. Experimental results demonstrate that EvTexture achieves state-of-the-art performance, significantly improving resolution and detail on datasets especially rich in textures.
-
In this episode, we discuss MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model by Muyao Niu, Xiaodong Cun, Xintao Wang, Yong Zhang, Ying Shan, Yinqiang Zheng. MOFA-Video is a novel image animation technique that produces videos from a single image using various control signals like human landmarks, manual trajectories, or another video. Unlike previous methods limited to specific motion domains or with weak control capabilities, MOFA-Video employs domain-aware motion field adapters (MOFA-Adapters) to manage generated motions. These adapters ensure temporal motion consistency by converting sparse control inputs into dense motion flows at multiple scales.
-
In this episode, we discuss An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels by Duy-Kien Nguyen, Mahmoud Assran, Unnat Jain, Martin R. Oswald, Cees G. M. Snoek, Xinlei Chen. This paper questions the necessity of locality inductive bias in modern computer vision architectures by showing that vanilla Transformers can treat each individual pixel as a token and still achieve high performance. The authors demonstrate this across three tasks: object classification, self-supervised learning via masked autoencoding, and image generation with diffusion models. Despite its computational inefficiency, this finding suggests reconsidering design principles for future neural architectures in computer vision.
-
In this episode, we discuss Graphic Design with Large Multimodal Model by Yutao Cheng, Zhao Zhang, Maoke Yang, Hui Nie, Chunyuan Li, Xinglong Wu, Jie Shao. The paper introduces Hierarchical Layout Generation (HLG) for graphic design, which creates compositions from unordered sets of design elements, addressing limitations of the existing Graphic Layout Generation (GLG). The authors develop Graphist, a novel layout generation model that uses large multimodal models to translate RGB-A images into a JSON draft protocol specifying the design layout's details. Graphist demonstrates superior performance compared to prior models and establishes a new baseline for HLG, complemented by the introduction of multiple evaluation metrics.
-
In this episode, we discuss LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning by Dantong Niu, Yuvan Sharma, Giscard Biamby, Jerome Quenum, Yutong Bai, Baifeng Shi, Trevor Darrell, Roei Herzig. The paper introduces LLARVA, a model improved with a novel instruction-tuning method to unify various robotic tasks using structured prompts. The model utilizes 2-D visual traces to better align vision and action spaces, pre-trained on 8.5M image-visual trace pairs from the Open X-Embodiment dataset. Experiments on the RLBench simulator and a physical robot demonstrate that LLARVA outperforms several baselines and generalizes well across different robotic environments.
-
In this episode, we discuss Transformers need glasses! Information over-squashing in language tasks by Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran, João G. M. Araújo, Alex Vitvitskyi, Razvan Pascanu, Petar Veličković. The paper explores how information propagates in decoder-only Transformers, revealing a phenomenon where different input sequences can result in nearly identical final token representations. This issue, worsened by low-precision floating-point formats, impairs the model’s ability to distinguish between these sequences, leading to errors in specific tasks. The authors provide theoretical and empirical evidence of this problem and suggest simple solutions to mitigate it.
-
In this episode, we discuss Show, Don't Tell: Aligning Language Models with Demonstrated Feedback by Omar Shaikh, Michelle Lam, Joey Hejna, Yijia Shao, Michael Bernstein, Diyi Yang. The paper introduces Demonstration ITerated Task Optimization (DITTO), a method for customizing language model outputs using fewer than ten demonstrations as feedback. DITTO, based on online imitation learning, aligns the model's outputs to user-specific behavior by generating comparison data iteratively. DITTO outperforms existing methods like few-shot prompting and supervised fine-tuning by an average of 19% in matching fine-grained styles and tasks.
-
In this episode, we discuss TextGrad: Automatic "Differentiation" via Text by Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, James Zou. The paper introduces TEXTGRAD, a novel framework that automates the optimization of compound AI systems by utilizing textual feedback from large language models (LLMs). TEXTGRAD treats text feedback as a form of "differentiation" to improve the components of these AI systems across various applications, working out-of-the-box without requiring specific tuning. Demonstrating its effectiveness, TEXTGRAD enhances performance in diverse tasks such as question answering, coding problem solutions, molecule design, and treatment planning, marking a significant step forward for the development of advanced AI technologies.
- Se mer