Folgen
-
OpenAI's o1 is a generative pre-trained transformer (GPT) model, designed for enhanced reasoning, especially in science and math. It uses a 'chain of thought' approach, spending more time "thinking" before answering, making it better at complex tasks. While not a successor to GPT-4o, o1 excels in scientific and mathematical benchmarks, and is trained with a new optimization algorithm. Different versions like o1-preview and o1-mini are available. Limitations include high computational cost, occasional "fake alignment," and a hidden reasoning process, and potential replication of training data.
-
GPT-4o is a multilingual, multimodal model that can process and generate text, images, and audio and represents a significant advancement over previous models like GPT-4 and GPT-3.5. GPT-4o is faster and more cost-effective, has improved performance in multiple areas, and natively supports voice-to-voice. GPT-4o's knowledge is limited to what was available up to October 2023. It has a context length of 128k tokens. The cost of training GPT-4 was more than $100 million, and it has 1 trillion parameters.
-
Fehlende Folgen?
-
Kimi k1.5 is a multimodal LLM trained with reinforcement learning (RL). Key aspects include: long context scaling to 128k, improving performance with increased context length; improved policy optimization using a variant of online mirror descent; and a simplistic framework that enables planning and reflection without complex methods. It uses a reference policy in its off-policy RL approach, and long2short methods such as model merging and DPO to transfer knowledge from long-CoT to short-CoT models, achieving state-of-the-art reasoning performance. The model is jointly trained on text and vision data.
-
DeepSeek-R1 is a language model focused on enhanced reasoning, employing reinforcement learning (RL) and building upon the DeepSeek-V3-Base model. It uses Group Relative Policy Optimization (GRPO) to reduce computational costs by eliminating the need for a separate critic model, which is commonly used in other algorithms such as PPO. The model uses a multi-stage training pipeline including an initial fine-tuning with cold-start data, followed by reasoning-oriented RL, and supervised fine-tuning (SFT) using rejection sampling, and a final RL stage. A rule-based reward system avoids reward hacking. DeepSeek-R1 also employs a language consistency reward during RL to address language mixing. The model's reasoning capabilities are then distilled into smaller models. DeepSeek-R1 achieves performance comparable to, and sometimes surpassing, OpenAI's o1 series on various reasoning, math, and coding tasks.
-
Claude 3 is a family of large multimodal AI models developed by Anthropic, with a focus on safety, interpretability, and user alignment. The models, which include Opus, Sonnet, and Haiku, excel in reasoning, math, coding, and multilingual understanding. They are designed to be helpful, honest, and harmless assistants and can process text, audio, and visual inputs. Claude 3 models use Constitutional AI principles, aiming for more ethical and reliable responses. They have improved abilities in long context comprehension, and have shown strong performance in various tests, often outperforming previous Claude models and sometimes matching or exceeding GPT models in some benchmarks.
-
GPT-4, or Generative Pre-trained Transformer 4, is a large multimodal language model created by OpenAI, and the fourth in the GPT series. It is a significant advancement over previous models such as GPT-3, with improvements in model size, performance, contextual understanding, and safety. GPT-4 uses a Transformer architecture, a deep learning model that has revolutionized natural language processing. It can process both text and images, and it has a larger context window than GPT-3, enabling it to handle longer documents and more complex tasks. GPT-4 was trained using a combination of publicly available data and licensed third-party data, and then fine-tuned using reinforcement learning and human feedback. It also has increased reasoning and generalization abilities, making it more reliable for advanced and specialized applications.
-
Training large language models (LLMs) is challenging due to the large amount of GPU memory and long training times required. Several parallelism paradigms enable model training across multiple GPUs, and various model architecture and memory-saving designs make it possible to train very large neural networks. The optimal model size and number of training tokens should be scaled equally, with a doubling of model size requiring a doubling of training tokens. Current large language models are significantly under-trained. Techniques such as data parallelism, model parallelism, pipeline parallelism, and tensor parallelism can be used to distribute the training workload. Other strategies include CPU offloading, activation recomputation, mixed-precision training, and compression to save memory.
-
MiniMax-01 is a series of large language and vision-language models that use lightning attention and a mixture of experts (MoE) to achieve long context processing. The models, MiniMax-Text-01 and MiniMax-VL-01, match the performance of top-tier models, like GPT-4o and Claude-3.5-Sonnet, while offering 20-32 times longer context windows, reaching up to 4 million tokens during inference. The models use a hybrid architecture, with linear and softmax attention mechanisms, and are trained on large datasets of text, code, and image-caption pairs. They also use a multi-stage training process with supervised fine-tuning and reinforcement learning to optimize their capabilities in long-context and real-world scenarios.
-
DeepSeek-V3 is a large Mixture-of-Experts (MoE) language model, trained ~10x less cost, with 671 billion total parameters, of which 37 billion are activated for each token. It uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures. A key feature of DeepSeek-V3 is its auxiliary-loss-free load balancing strategy and multi-token prediction training objective. The model was pre-trained on 14.8 trillion tokens and underwent supervised fine-tuning and reinforcement learning. It has demonstrated strong performance on various benchmarks, achieving results comparable to leading closed-source models while maintaining economical training costs.
-
The Tree of Thoughts (ToT) framework enhances problem-solving in large language models (LLMs) by using a structured, hierarchical approach to explore multiple solutions. ToT breaks down problems into smaller steps called "thoughts", generated via sampling or proposing. These "thoughts" are evaluated using value or voting strategies, and search algorithms like breadth-first or depth-first search navigate the solution space. This allows LLMs to backtrack and consider alternative paths, improving performance in complex decision-making tasks.
-
Large language models (LLMs) demonstrate some reasoning abilities, though it's debated whether they truly reason or rely on information retrieval. Prompt engineering enhances reasoning, employing techniques like Chain-of-Thought (CoT), which involves intermediate reasoning steps. Multi-stage prompts, problem decomposition, and external tools are also used. Multi-agent discussions may not surpass a well-prompted single LLM. Research explores knowledge graphs and symbolic solvers to improve LLM reasoning, and methods to make LLMs more robust against irrelevant context. The field continues to investigate techniques to improve reasoning in LLMs.
-
LangChain is an open-source framework that simplifies the development of applications using large language models (LLMs). It offers tools and abstractions to enhance the customization, accuracy, and relevancy of LLM-generated information. LangChain allows developers to connect LLMs to external data sources, and create applications like chatbots, question-answering systems, and virtual agents. Key components include model interfaces, prompt templates, chains, agents, retrieval modules, and memory. LangChain enables the creation of complex, context-aware applications by combining different components.
-
LlamaIndex is an open-source framework for building LLM applications by connecting custom data to LLMs. It excels in Retrieval-Augmented Generation (RAG), data storage, and retrieval. It works by ingesting data from various sources, indexing it (often into vector embeddings), and querying it with a language model. LlamaIndex has tools to evaluate the quality of retrieval and responses. It supports AI agents for automated tasks. The framework facilitates the creation of custom knowledge bases for querying with LLMs.
-
Chain of Thought (CoT) is a prompting technique that enhances the reasoning capabilities of large language models (LLMs) by encouraging them to articulate their reasoning process step by step. Instead of providing a direct answer, the model breaks down complex problems into smaller, more manageable parts, simulating human-like thought processes. This method is particularly beneficial for tasks requiring complex reasoning, such as math problems, logical puzzles, and multi-step decision-making. CoT can be implemented through prompting, where the model is guided to "think step by step," or it can be an automatic internal process in some models. CoT improves accuracy and transparency by providing a view into the model's decision-making.
-
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by connecting them to external knowledge sources. It works by retrieving relevant documents based on a user's query, using an embedding model to convert both into numerical vectors, then using a vector database to find matching content. The retrieved data is then passed to the LLM for response generation. This process improves accuracy and reduces "hallucinations" by grounding the LLM in factual, up-to-date information. RAG also increases user trust by providing source attribution, so users can verify the information.
-
Fine-tuning is a machine learning technique that adapts a pre-trained model to a specific task or domain. Instead of training a model from scratch, fine-tuning uses a pre-trained model as a starting point and further trains it on a smaller, task-specific dataset. This process can improve the model's performance on specialized tasks, reduce computational costs, and broaden its applicability across various fields. The goal of fine-tuning can be knowledge injection or alignment, or both. Fine-tuning is often used in natural language processing. There are many ways to approach fine-tuning, including supervised fine-tuning, few-shot learning, transfer learning, and domain-specific fine-tuning ...
-
Scaling laws describe how language model performance improves with increased model size, training data, and compute. These improvements often follow a power-law, with predictable gains as resources scale up. There are diminishing returns with increased scale. Optimal training involves a balance of model size, data, and compute, and may require training large models on less data, stopping before convergence. To prevent overfitting, the dataset size should increase sublinearly with model size. Scaling laws are relatively independent of model architecture. Current large models are often undertrained, suggesting a need for more balanced resource allocation.
-
LLaMA-3 is a series of foundation language models that support multilinguality, coding, reasoning, and tool usage. The models come in different sizes, with the largest having 405B parameters and a 128K token context window. The development of Llama 3 focused on optimizing data, scale, and managing complexity, using a combination of web data, code, and mathematical text, with specific pipelines for each. The models underwent pre-training, supervised finetuning, and direct preference optimization to enhance their performance and safety. Llama 3 models have demonstrated strong performance in various benchmarks and aim to balance helpfulness with harmlessness.
-
LLaMA-2 is a collection of large language models (LLMs), with pretrained and fine-tuned versions ranging from 7 billion to 70 billion parameters. The fine-tuned models, called Llama 2-Chat, are designed for dialogue and outperform open-source models on various benchmarks. The models were trained on 2 trillion tokens of publicly available data, and were optimized for both helpfulness and safety using techniques such as supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). Llama 2 also includes a novel technique, Ghost Attention (GAtt), to maintain dialogue flow.
-
LLaMA-1 is a collection of large language models ranging from 7B to 65B parameters, trained on publicly available datasets. LLaMA models achieve competitive performance compared to other LLMs like GPT-3, Chinchilla, and PaLM, with the 13B model outperforming GPT-3 on most benchmarks, despite being much smaller, and the 65B model being competitive with the best large language models. The document also discusses the training approach, architecture, optimization, and evaluations of LLaMA on common sense reasoning, question answering, reading comprehension, mathematical reasoning, code generation, and massive multitask language understanding, as well as its biases and toxicity. The models are intended to democratize access and study of LLMs with some models being able to run on a single GPU, and to be a basis for further research.
- Mehr anzeigen