How large are large language models? (2025)

This aims to be factual information about the size of large language models. None of this document was written by AI. I do not include any information from leaks or rumors. The focus of this document is on base models (the raw text continuation engines, not 'helpful chatbot/assistants'). This is a view from a few years ago to today of one very tiny fraction of the larger LLM story that's happening.

History

GPT-2,-medium,-large,-xl (2019): 137M, 380M, 812M, 1.61B. Source: openai-community/gpt2. Trained on the unreleased WebText dataset said to 40GB of Internet text - I estimate that to be roughly 10B tokens. You can see a list of the websites that went into that data set here domains.txt.
GPT-3 aka davinci, davinci-002 (2020): 175B parameters. There is a good breakdown of how those parameters are 'spent' here How does GPT-3 spend its 175B parameters?. Trained on around 400B tokens composed of CommonCrawl, WebText2, Books1, Books2 and Wikipedia. Source Language Models are Few-Shot Learners. These training runs required months of a data center full of tens of thousands of A100 GPUs source.
GPT-3.5, GPT-4 (2022, 2023): No official factual information on architecture or training data available.

Llama

*Llama 7B, 13B, 33B, 65B: The 65B model was pretrained on a 1.4T (trillion tokens) dataset. LLaMA was officially stated to use Books3 source as a data set - this is a very important dataset which has been pivotal in lawmaking regarding the training of AIs on large amounts of copyrighted and potentially pirated material.
Llama-3.1 405B (2024): The 405B llama model was released. This is a dense transformer model, meaning all parameters are used in inference passes. Initial pretraining: 2.87T tokens, long context: 800B, annealing: 40M - so 3.67T total. source: The Llama 3 Herd of Models. By this point meta has learned to say less about what data goes into the models "We create our dataset for language model pre-training from a variety of data sources containing knowledge" - so I can't say as much about what goes into the training data here.

Empirically, we find that annealing (see Section 3.4.3) on small amounts of high-quality code and mathematical data can boost the performance of pre-trained models on key benchmarks

The emerging trend of annealing pretrained models to 'benchmax' is unfortunate in that it biases the base language models somewhat away from being pure text continuation engines. This should really be part of the post-training which aims to make the models role play as some kind of AI-chatbot helpful assistant character. But these companies care very much about metrics and scores.

Llama-4 (2025): The largest in the llama4 family is a 2T total parameter MoE model - A288B 16E (active 288B parameters, 16 Experts). It is unreleased. As for the smaller llama4 models (which are distilled from the large one): There was a scandal as facebook decided to mislead people by gaming the lmarena benchmark site - they served one version of llama4 maverick there and released a different version of llama4 maverick, for some reason. This academic misconduct led to reduced trust in the llama team which seems to have imploded shortly after. It's unclear if the 2T Behemoth model will ever be released after what happened. The smaller LLama4 models (maverick and scout), are distilled off this large one, and are generally considered to be of low-intelligence.

The desert

For a long time, there weren't really any large language models available to download. There was certainly nothing comparable with GPT-3 for a few years. There were projects to try to match it, but generally they operated by fine tuning things like small (70B) llama models on a bunch of GPT-3 generated texts (synthetic data - which can result in degeneration when AI outputs are fed back into AI training intputs).

The release of 405B was a turning point here. Just before that (Dec 2023) Mistral released Mixtral 8x7B - a MoE model, and then in April 2024 Mixtral-8x22B was released - a 141B total, A39B sparse MoE model. Even though this was not a dense model, like GPT-3 (175B) it is comparable in total parameter size. The MoE arch. enabled larger models to be trained and used by more people - people without access to thousands of interconnected GPUs.

Current MoE-wave

Deepseek V3 Base

This was released the day after Christmas 2024. In the words of the deepseek webpage:

🎉 What’s new in V3

    🧠 671B MoE parameters
    🚀 37B activated parameters
    📚 Trained on 14.8T high-quality tokens

deepseek-ai/DeepSeek-V3-Base paper.

This was a gigantic leap forward in model size, and when R1 (the reasoning model built on top of this base model) was released it impressed a lot of people, I think this may have been the first time a truly GPT-4 level model was available to download and use. For reasons unclear this temporarily tanked the NVDA stock price.

This really opened the door to new large MoE language models being trained, especially in China, and released freely for people to use. Note the following models are also starting to be multi-modal, as well as multi-linguial, so they have been provided large amounts of new types of data during training.

Databricks (March 2024)

Compared to other open MoE models like Mixtral-8x7B and Grok-1, DBRX is fine-grained, meaning it uses a larger number of smaller experts. DBRX has 16 experts and chooses 4, while Mixtral-8x7B and Grok-1 have 8 experts and choose 2.

Minimax (Jan 2025)

https://huggingface.co/MiniMaxAI/models#repos https://huggingface.co/MiniMaxAI/MiniMax-Text-01 https://arxiv.org/pdf/2501.08313 456B A45.9B Attention, Softmax Attention and Mixture-of-Experts (MoE).

Building upon the architecture design and computation optimizations, we train our foundational language model, MiniMax-Text-01 To assess document quality at a granular level, we utilize our previous-generation model as the reward labeler (a MoE model with 5B activations and 60B total parameters).

Dots (June 2025)

https://huggingface.co/rednote-hilab/dots.llm1.base
https://www.arxiv.org/pdf/2506.05767
143B A14B 11.2T tokens training data, 32,768 token context length

dots.llm1 achieves performance comparable to Qwen2.5-72B after pretrained on high-quality corpus without synthetic data Architecture: Multi-head Attention with QK-Norm in attention Layer, fine-grained MoE utilizing top-6 out of 128 routed experts, plus 2 shared experts.

Hunyuan (June 2025)

During the training stage, the shared expert remains perpetually active, while only 8 non-shared experts are activated simultaneously.

Ernie (June 2025)

It is not clear to me how many tokens the base model here was trained on but the hf page says "trillions".

Conclusion

For a long time there were very very few LLMs on the same scale as GPT-3 that are available. Attempts to match GPT-3 level performance with downloadable weights were hindered by this, and genuinely, I do not think people understood that the raw size of the model being comparable to 175B was required. All that was available were the <=70B llama models and people tried to work with them.

405B is the latest large dense base model available that I'm aware of, but it's annealed and contains recent data in its pretraining (meaning that there will be people discussing LLMs and sharing logs and transcripts of LLMs), so it's a little bit more like an 'assistant' than previous base models. The same flaws apply to the recent wave of MoE models. They also have some aspets of Chinese culture baked into them.

It's not completely clear how to compare MoE models with dense models. Perhaps there are aspects of LLM-intelligence that can only be achieved with sufficient depth/density. I don't think the current automated benchmarks are able to capture this, so everyone is just going all in on MoEs now.

Newer models might be trained with new architectures (RWKV, byte-latent, bitnet), or new techniques of synthetic data generation (to avoid lawsuits, and to get good scores on benchmarks), but it's unclear how important these things really are for making a good raw text continuation engine - which I believe is the foundation for the capabilities that whatever type of fine-tuning elicits from these neural networks. Currently the trend is to make chatbots that roleplay as 'ai assistants' - and I really hope that more people investigate alternatives.