Skip to content

LLM Collection

Overview

This section consists of a collection and summary of notable and foundational Large Language Models (LLMs). The collection provides a comprehensive overview of models released from 2018 to 2023, including their specifications, capabilities, and use cases.

Model Collection

2023 Models

ModelRelease DateSize (B)CheckpointsDescription
Falcon LLMSep 20237, 40, 180Falcon-7B, Falcon-40B, Falcon-180BFalcon LLM is a foundational large language model (LLM) with 180 billion parameters trained on 3500 Billion tokens. TII has now released Falcon LLM – a 180B model.
Mistral-7B-v0.1Sep 20237Mistral-7B-v0.1Mistral-7B-v0.1 is a pretrained generative text model with 7 billion parameters. The model is based on a transformer architecture with features like Grouped-Query Attention, Byte-fallback BPE tokenizer and Sliding-Window Attention.
CodeLlamaAug 20237, 13, 34CodeLlama-7B, CodeLlama-13B, CodeLlama-34BThe Code Llama family is designed for general code synthesis and understanding. It is specifically tuned for instruction following and safer deployment. The models are auto-regressive and use an optimized transformer architecture. They are intended for commercial and research use in English and relevant programming languages.
Llama-2Jul 20237, 13, 70Llama-2-7B, Llama-2-13B, Llama-2-70BLLaMA-2, developed by Meta AI, was released in July 2023 with models of 7, 13, and 70 billion parameters. It maintains a similar architecture to LLaMA-1 but uses 40% more training data. LLaMA-2 includes foundational models and dialog-fine-tuned models, known as LLaMA-2 Chat, and is available for many commercial uses, with some restrictions.
XGen-7B-8KJul 20237XGen-7B-8KThe XGen-7B-8K, developed by Salesforce AI Research, is a 7B parameter language model.
Claude-2Jul 2023130-Claude 2 is a foundational LLM built by Anthropic, designed to be safer and more "steerable" than its previous version. It is conversational and can be used for a variety of tasks like customer support, Q&A, and more. It can process large amounts of text and is well-suited for applications that require handling extensive data, such as documents, emails, FAQs, and chat transcripts.
TuluJun 20237, 13, 30, 65Tulu-7B, Tulu-13B, Tulu-30B, Tulu-65BTulu is a family of models developed by Allen Institute for AI. The models are LLaMa models that have been fine-tuned on a mixture of instruction datasets, including FLAN V2, CoT, Dolly, Open Assistant 1, GPT4-Alpaca, Code-Alpaca, and ShareGPT. They are designed to follow complex instructions across various NLP tasks.
ChatGLM2-6BJun 20236ChatGLM2-6BChatGLM2-6B is the second-generation version of the open-source bilingual (Chinese-English) chat model ChatGLM-6B. It has improved performance, longer context capabilities, more efficient inference, and an open license for academic and commercial use. The model uses a hybrid objective function and has been trained with 1.4T bilingual tokens. It shows substantial improvements in performance on various datasets compared to its first-generation counterpart.
Nous-Hermes-13BJun 202313Nous-Hermes-13BNous-Hermes-13B is a language model fine-tuned by Nous Research on over 300,000 instructions.
Baize-v2May 20237, 13Baize-v2-13BBaize-v2 is an open-source chat model developed by UCSD and Sun Yat-Sen University, fine-tuned with LoRA, and trained with supervised fine-tuning (SFT) and self-distillation with feedback (SDF).
RWKV-4-RavenMay 20231.5, 3, 7, 14RWKV-4-RavenRWKV-4-Raven is a series of models. These models are fine-tuned on various datasets like Alpaca, CodeAlpaca, Guanaco, GPT4All, and ShareGPT. They follow a 100% RNN architecture for the language model.
GuanacoMay 20237, 13, 33, 65Guanaco-7B, Guanaco-13B, Guanaco-33B, Guanaco-65BGuanaco models are open-source chatbots fine-tuned through 4-bit QLoRA tuning of LLaMA base models on the OASST1 dataset. They are intended for research purposes. The models allow for cheap and local experimentation with high-quality chatbot systems.
PaLM 2May 2023--A Language Model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM.
GorillaMay 20237GorillaGorilla: Large Language Model Connected with Massive APIs
RedPajama-INCITEMay 20233, 7RedPajama-INCITEA family of models including base, instruction-tuned & chat models.
LIMAMay 202365-A 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling.
Replit CodeMay 20233Replit Codereplit-code-v1-3b model is a 2.7B LLM trained on 20 languages from the Stack Dedup v1.2 dataset.
h2oGPTMay 20237, 12, 20, 40h2oGPTh2oGPT is a LLM fine-tuning framework and chatbot UI with document(s) question-answer capabilities.
CodeGen2May 20231, 3, 7, 16CodeGen2Code models for program synthesis.
CodeT5 and CodeT5+May 202316CodeT5CodeT5 and CodeT5+ models for Code Understanding and Generation from Salesforce Research.
StarCoderMay 202315StarCoderStarCoder: A State-of-the-Art LLM for Code
MPTMay 20237, 30MPT-7B, MPT-30BMosaicML's MPT models are open-source, commercially licensed Large Language Models, offering customizable AI solutions optimized for various NLP tasks.
DLiteMay 20230.124 - 1.5DLite-v2-1.5BLightweight instruction following models which exhibit ChatGPT-like interactivity.
WizardLMApr 202370, 30, 13WizardLM-13B, WizardLM-30B, WizardLM-70BWizardLM is a family of large language models designed to follow complex instructions. The models perform well in coding, mathematical reasoning, and open-domain conversations. The models are license-friendly and adopt a prompt format from Vicuna for multi-turn conversations. The models are developed by the WizardLM Team, designed for various NLP tasks.
FastChat-T5-3BApr 20233FastChat-T5-3BFastChat-T5 is an open-source chatbot trained by fine-tuning Flan-t5-xl (3B parameters) on user-shared conversations collected from ShareGPT. It's based on an encoder-decoder transformer architecture and can autoregressively generate responses to users' inputs.
GPT4All-13B-SnoozyApr 202313GPT4All-13B-SnoozyGPT4All-13B-Snoozy is a GPL licensed chatbot trained over a massive curated corpus of assistant interactions including word problems, multi-turn dialogue, code, poems, songs, and stories. It has been finetuned from LLama 13B and is developed by Nomic AI. The model is designed for assistant-style interaction data and is primarily in English.
Koala-13BApr 202313Koala-13BKoala-13B is a chatbot created by Berkeley AI Research (BAIR). It is fine-tuned on Meta's LLaMA and focuses on dialogue data scraped from the web. The model aims to balance performance and cost, providing a lighter, open-source alternative to models like ChatGPT. It has been trained on interaction data that includes conversations with highly capable closed-source models such as ChatGPT.
OpenAssistant (Llama family)Apr 202330, 70Llama2-30b-oasst, Llama2-70b-oasstOpenAssistant-LLaMA models are language models from OpenAssistant's work on the Llama models. It supports CPU + GPU inference using GGML format and aims to provide an open-source alternative for instruction following tasks.
DollyApr 20233, 7, 12Dolly-v2-3B, Dolly-v2-7B, Dolly-v2-12BAn instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.
StableLMApr 20233, 7StableLM-Alpha-3B, StableLM-Alpha-7BStability AI's StableLM series of language models
PythiaApr 20230.070 - 12PythiaA suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters.
Open Assistant (Pythia Family)Mar 202312Open AssistantOpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
Med-PaLM 2Mar 2023--Towards Expert-Level Medical Question Answering with Large Language Models
ChatGLM-6BMar 20236ChatGLM-6BChatGLM-6B is an open-source, Chinese-English bilingual dialogue model based on the General Language Model (GLM) architecture with 6.2 billion parameters. Despite its small size causing some factual or mathematical logic issues, it's adept for Chinese question-answering, summarization, and conversational tasks due to its training on over 1 trillion English and Chinese tokens.
GPT-3.5-turboMar 2023175-GPT-3.5-Turbo is OpenAI's advanced language model optimized for chat but also works well for traditional completion tasks. It offers better performance across all aspects compared to GPT-3 and is 10 times cheaper per token.
VicunaMar 20237, 13, 33Vicuna-7B, Vicuna-13BVicuna is a family of auto-regressive language models based on the transformer architecture. It's fine-tuned from LLaMA and primarily intended for research on large language models and chatbots. It's developed by LMSYS and has a non-commercial license.
Alpaca-13BMar 202313-Alpaca is an instruction-following language model fine-tuned from Meta's LLaMA 7B. It's designed for academic research to address issues like misinformation and toxicity. Alpaca is trained on 52K instruction-following demonstrations and aims to be a more accessible option for academic study. It's not intended for commercial use due to licensing and safety concerns.
Claude-1Mar 2023137-Claude is a foundational large language model (LLM) built by Anthropic. It is designed to be a helpful, honest, and harmless AI assistant. It can perform a wide variety of conversational and text processing tasks and is accessible through a chat interface and API.
Cerebras-GPTMar 20230.111 - 13Cerebras-GPTCerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster
BloombergGPTMar 202350-BloombergGPT: A Large Language Model for Finance
PanGu-ΣMar 20231085-PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing
GPT-4Mar 2023--GPT-4 Technical Report
LLaMAFeb 20237, 13, 33, 65LLaMALLaMA: Open and Efficient Foundation Language Models
ChatGPTNov 2022--A model called ChatGPT which interacts in a conversational way. The dialogue format makes it possible for ChatGPT to answer followup questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests.

2022 Models

ModelRelease DateSize (B)CheckpointsDescription
GalacticaNov 20220.125 - 120GalacticaGalactica: A Large Language Model for Science
mT0Nov 202213mT0-xxlCrosslingual Generalization through Multitask Finetuning
BLOOMNov 2022176BLOOMBLOOM: A 176B-Parameter Open-Access Multilingual Language Model
U-PaLMOct 2022540-Transcending Scaling Laws with 0.1% Extra Compute
UL2Oct 202220UL2, Flan-UL2UL2: Unifying Language Learning Paradigms
SparrowSep 202270-Improving alignment of dialogue agents via targeted human judgements
Flan-T5Oct 202211Flan-T5-xxlScaling Instruction-Finetuned Language Models
AlexaTMAug 202220-AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model
GLM-130BOct 2022130GLM-130BGLM-130B: An Open Bilingual Pre-trained Model
OPT-IMLDec 202230, 175OPT-IMLOPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
OPTMay 2022175OPT-13B, OPT-66BOPT: Open Pre-trained Transformer Language Models
PaLMApr 2022540-PaLM: Scaling Language Modeling with Pathways
Tk-InstructApr 202211Tk-Instruct-11BSuper-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
GPT-NeoX-20BApr 202220GPT-NeoX-20BGPT-NeoX-20B: An Open-Source Autoregressive Language Model
ChinchillaMar 202270-Shows that for a compute budget, the best performances are not achieved by the largest models but by smaller models trained on more data.
InstructGPTMar 2022175-Training language models to follow instructions with human feedback
CodeGenMar 20220.350 - 16CodeGenCodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis
AlphaCodeFeb 202241-Competition-Level Code Generation with AlphaCode
MT-NLGJan 2022530-Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
LaMDAJan 2022137-LaMDA: Language Models for Dialog Applications
GLaMDec 20211200-GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
GopherDec 2021280-Scaling Language Models: Methods, Analysis & Insights from Training Gopher
WebGPTDec 2021175-WebGPT: Browser-assisted question-answering with human feedback
Yuan 1.0Oct 2021245-Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning
T0Oct 202111T0Multitask Prompted Training Enables Zero-Shot Task Generalization
FLANSep 2021137Flan-T5Finetuned Language Models Are Zero-Shot Learners
HyperCLOVASep 202182-What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers
ERNIE 3.0 TitanJul 202110-ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
Jurassic-1Aug 2021178-Jurassic-1: Technical Details and Evaluation
ERNIE 3.0Jul 202110-ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
CodexJul 202112-Evaluating Large Language Models Trained on Code
GPT-J-6BJun 20216GPT-J-6BA 6 billion parameter, autoregressive text generation model trained on The Pile.
CPM-2Jun 2021198CPMCPM-2: Large-scale Cost-effective Pre-trained Language Models
PanGu-αApr 202113PanGu-αPanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation
mT5Oct 202013mT5mT5: A massively multilingual pre-trained text-to-text transformer
BARTJul 2020-BARTDenoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
GShardJun 2020600-GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
GPT-3May 2020175-Language Models are Few-Shot Learners
CTRLSep 20191.63CTRLCTRL: A Conditional Transformer Language Model for Controllable Generation
ALBERTSep 20190.235ALBERTA Lite BERT for Self-supervised Learning of Language Representations
XLNetJun 2019-XLNetGeneralized Autoregressive Pretraining for Language Understanding and Generation
T5Oct 20190.06 - 11Flan-T5Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
GPT-2Nov 20191.5GPT-2Language Models are Unsupervised Multitask Learners
RoBERTaJul 20190.125 - 0.355RoBERTaA Robustly Optimized BERT Pretraining Approach
BERTOct 2018-BERTBidirectional Encoder Representations from Transformers
GPTJun 2018-GPTImproving Language Understanding by Generative Pre-Training

Key Insights

  • Early Era (2018-2020): Models ranged from millions to hundreds of millions of parameters
  • Growth Era (2020-2022): Models expanded to billions and hundreds of billions of parameters
  • Current Era (2022-2023): Models now reach trillions of parameters (e.g., PanGu-Σ with 1085B)

Architecture Evolution

  • Transformer-based: Most models use transformer architecture with various optimizations
  • Mixture-of-Experts: Models like GLaM and PaLM use MoE for efficient scaling
  • Multimodal: Recent models like GPT-4 and Claude integrate multiple modalities

Specialization Areas

  • Code Generation: CodeGen, CodeT5, StarCoder, AlphaCode
  • Multilingual: mT5, BLOOM, ChatGLM, HyperCLOVA
  • Instruction Following: FLAN, T0, Alpaca, Vicuna
  • Domain-Specific: BloombergGPT (Finance), Med-PaLM (Medical), Galactica (Science)

Data Sources

This section is under development. Data adopted from Papers with Code and the recent work by Zhao et al. (2023).

References

  • Zhao, Y., et al. (2023). "A Survey of Large Language Models." arXiv preprint.
  • Papers with Code: Large Language Models