Datasets: LLM Data Hub; The Pile; RedPajama Data; SlimPajama (600+GB dedup'd); RefinedWeb (Falcon's); The Stack (StarCoder's)
(Note: Training sets may or may not be legal if they use copywritten works. Copyright aspects of LLM's are being debated, though. Use what you believe is legal.)
I'm going to list those with highly-permissive licenses. They can usually be used for businesses.
LLaMA-2: LlaMA 2 is here; 70B 4-bit GPTQ (works on 2x24GB VRAM); LLaMA 2 70B GPTQ full context on 2 3090s; LLongMA: LLaMA 2 8k; LLaMA-2 7B Uncensored QLoRA; LLaMA-2-7B-32K; LLongMA-2 16k;
(Note: This is the most, recent model from Facebook/Meta whose sizes vary from consumer up to high-end. While quite capable, it seems to have many issues from whatever alignment/morality they built into it. Remove the default, system prompt if you use it.)
Falcon: project page; HuggingFace
article (7B/40B); HuggingFace
article (180B)
(Note: Falcon-40B beat others in performance for a long time. Recently, they released 180B.)
(Note: MPT was trained on a lot of data, trained with their platform you can use, open-sourced for business use, and they build a business on it. That's a great, business model that I'd like to see others imitate.)
Warning: The rest of the models on this page may have license
restrictions. I'm including them both for examples and any use you could
get out them.
Instruction-following Models: WizardLM; MPT-Instruct 30B; Databricks Dolly-12B; LlaMA-2-70B-Instruct2
Coding models: StarCoder 15B; WizardCoder 15B; CodeT5 16B
Chatbots: Guanaco description, model
links, and LlaMA-2-70B
version; MPT-Chat
30B; LlaMA-2-Chat;
FastChat-T5-3B
(Note: Guanaco are among the highest-performing models according to many people who use them with and without GPU's.)
Uncensored Models: Rationale; WizardLM-Uncensored (with examples); list of uncensored models
(Note: The WizardLM-Uncensored link is one of my favorites because you
can see definite, political bias in these models. Then, the uncensored
model just says yes to everything.)
OpenOrca 13B: Preview;
Chat
Preview
BLOOM (open w/ 1000+ collaborators): BLOOM-176B-LORA-8-bit (353GB -> 180GB)
GLM family (Chinese): ChatGLM-6B article
T5 Family: t5-large; LaMini-Flan-T5-248M
(Note: Several small models, including sub-1G, were built this way.)
Smaller, coding models: GGML
for Falcoder 7B, SantaCoder 1B, and TinyStarCoder 160M; Replit
Code 3B
Tiny models: Baby LlaMA-2; TinyStarcoder (164M on 6 epochs of 100GB total); GPT2023 (124m GPT-2 model on 2.23B tokens)
(Navigation: Go to top-level page in this series.)
(Learn the Gospel of Jesus Christ with proof it's true and our stories. Learn how to live it.)