|
| 1 | +# A GPT of One's Own |
| 2 | + |
| 3 | +May 2024 |
| 4 | + |
| 5 | +David Eger |
| 6 | + |
| 7 | + |
| 8 | +Though I'm a programmer, I have studiously avoided learning Javascript and CSS. |
| 9 | + |
| 10 | + |
| 11 | +And until a few weeks ago, I'd largely ignored the GPT ChatBots: When I first tried them, they hallucinated badly and led me astray when I asked them how to use various programming libraries. But [Simon Willison](http://simonwillison.net/) and others have been using them to get things done with surprising ease. So two weeks ago, I asked Google's Gemini 1.5 the following: |
| 12 | + |
| 13 | +``` |
| 14 | +I have an HTML table, and for each row I have a pair of numbers [low, high] in the first column. I would like to make this first column more visual by rendering the interval as a rectangle aligned along the x axis within each cell, so that you can quickly see how the intervals in each row relate to each other. Could you show an example HTML table that renders this way? |
| 15 | +``` |
| 16 | + |
| 17 | +And I was shocked at just how good the answer was. It taught me about explicitly attaching [data fields](https://developer.mozilla.org/en-US/docs/Learn/HTML/Howto/Use_data_attributes) to tags (woah, when did that happen?), and gave me idiomatic CSS and javascript to accomplish what I wanted. Copy, paste, ask one follow up and I was done. What would have been hours for me turned into a couple of minutes. |
| 18 | + |
| 19 | +# Llama 3, almost as good as GPT-4 / Gemini Pro |
| 20 | + |
| 21 | +April 2024 saw a flurry of AI model announcements: [Reka Core](https://www.reka.ai/news/reka-core-our-frontier-class-multimodal-language-model), Microsoft's [Phi 3](https://export.arxiv.org/abs/2404.14219), Cohere's [Command R+](https://cohere.com/blog/command-r-plus-microsoft-azure). But standing out was Meta's release of its open [Llama 3](https://llama.meta.com/llama3/), a model scoring 1213 in the [LMSYS Chatbot Arena Leaderboard](https://chat.lmsys.org/?leaderboard), less than 50 elo points away from the best AI models Gemini 1.5 and GPT-4: |
| 22 | + |
| 23 | + |
| 24 | + |
| 25 | +Llama 3 has a couple of shortcomings: It is text only ([Gemini 1.5](https://ai.google.dev/gemini-api/docs/models/gemini) can process text, images, and video), and has a relatively modest context window of 8k tokens. Vanilla transformer inference cost grows as the square of the context, though advances such as [Ring Attention are making those costs linear](https://learnandburn.ai/p/how-to-build-a-10m-token-context). |
| 26 | + |
| 27 | +But for short conversations, the release of Llama 3 means: *you can now run a cutting-edge AI at home*! |
| 28 | + |
| 29 | +# Using Chatbots already running in someone's cloud |
| 30 | + |
| 31 | +Many sites now let you chat with AI Chatbots for free (often smaller versions of the best models) on some fairly beefy hardware (e.g. NVidia H100's, LPUs, TPUs, etc). And before you get too excited about what you might do on your own hardware, give these AI tools a try online. |
| 32 | + |
| 33 | +## Open chatbot playgrounds |
| 34 | + |
| 35 | ++ [LMSYS leaderboard](https://chat.lmsys.org/?leaderboard): try models blind and vote for the best AI. |
| 36 | ++ [Groq](https://groq.com/) blazing fast inference for public models (LLama 70B, Gemma, Mixtral) from an [AI hardware startup](https://www.youtube.com/watch?v=Z0jqIk7MUfE) |
| 37 | ++ [HuggingChat](https://huggingface.co/chat/models/microsoft/Phi-3-mini-4k-instruct) offers access to 8 Open LLMs. |
| 38 | ++ [Vercel.ai](https://sdk.vercel.ai/) Another LLM comparison site |
| 39 | + |
| 40 | +## Proprietary Chat Bots: |
| 41 | + |
| 42 | +The very best chat bots are *quite* large. Google hasn't reported just how big Gemini 1.5 Pro is. GPT-4 was rumored to be a [1.7T parameter Mixture of Experts](https://twitter.com/swyx/status/1671272883379908608). But many of the leading contenders have dedicated chat sites: |
| 43 | + |
| 44 | + |
| 45 | ++ [GPT-4 Turbo](https://chat.openai.com/) (OpenAI). 128k token context window. Multimodal |
| 46 | ++ [Claude](https://claude.ai/chat) (Anthropic) and [Anthropic Workbench](https://console.anthropic.com/workbench) |
| 47 | ++ [Gemini 1.5 Pro](https://gemini.google.com/) (Google). 128k token context window. Multimodal [report](https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf) |
| 48 | ++ [Coral or Command R+](https://coral.cohere.com/) (Cohere) |
| 49 | ++ [Reka AI](https://chat.reka.ai/) |
| 50 | ++ [Perplexity](https://perplexity.ai/) |
| 51 | ++ [Arctic](https://arctic.streamlit.app/) (Snowflake) a 480B parameter model |
| 52 | + |
| 53 | +# Running Llama 3 at home |
| 54 | + |
| 55 | +Really really, you can run a cutting edge chatbot at home. Here are three easy options for running a Large Language Model [locally](https://www.reddit.com/r/LocalLLaMA/) on your own hardware: |
| 56 | + |
| 57 | ++ [llamafile](https://github.com/Mozilla-Ocho/llamafile) |
| 58 | +llamafile is a format of LLM executable you can download, for example [Meta-Llama-3-70B-Instruct-llamafile](https://huggingface.co/jartine/Meta-Llama-3-70B-Instruct-llamafile). |
| 59 | +This is by far the easiest method. Download a single [portable executable](https://justine.lol/ape.html) made by [@JustineTunney](https://twitter.com/JustineTunney) that includes the [llama.cpp](https://github.com/ggerganov/llama.cpp) inference engine and a language model. It will run on on Windows, Linux or Mac. |
| 60 | + |
| 61 | ++ [ollama](https://ollama.com/) |
| 62 | + ollama is a golang app wrapping llama.cpp. Tell it what model you want by name, and it will download and run it in a local docker container. Models are saved in `/usr/share/ollama/.ollama/models` |
| 63 | + ```sh |
| 64 | + # install ollama |
| 65 | + curl -fsSL https://ollama.com/install.sh | sh |
| 66 | + # Run the 8B llama3 model |
| 67 | + ollama run llama3 |
| 68 | + ``` |
| 69 | + |
| 70 | + |
| 71 | ++ [llama.cpp](https://github.com/ggerganov/llama.cpp) |
| 72 | +llama.cpp is a from-scratch LLM inference engine by [@ggerganov](https://twitter.com/ggerganov) |
| 73 | +This engine and its sister [whisper.cpp](https://github.com/ggerganov/whisper.cpp) impressed the world with how easy it was to run cutting edge AI models on consumer CPUs. It forms the core of the above two options and has attracted major contributions from across the industry to make it run well on a wide array of hardware. |
| 74 | + ```sh |
| 75 | + git clone https://github.com/ggerganov/llama.cpp |
| 76 | + cd llama.cpp |
| 77 | + # See the readme for more advanced build options |
| 78 | + cmake -B build |
| 79 | + cmake --build build --config Release |
| 80 | + # Download a Model |
| 81 | + wget -O Phi-3-mini-4k-instruct-q4.gguf https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf?download=true |
| 82 | + # Run the chat. |
| 83 | + make -j 20 && ./main -m ./Phi-3-mini-4k-instruct-q4.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e |
| 84 | + ``` |
| 85 | + |
| 86 | +# But... which ChatBot *should* you run? which *can* you run? |
| 87 | + |
| 88 | +[Llama 3](https://llama.meta.com/llama3/) was released with two variants: 8B parameters and 70B parameters. Which one to run depends critically on two questions. |
| 89 | + |
| 90 | +## Quantization: How much RAM do you need? |
| 91 | + |
| 92 | +"Parameters" in a model refer to the learned floating point |
| 93 | +weights in the neural net. The more bits you use, the more |
| 94 | +accurate your math will be, but the more memory you will need. |
| 95 | +And as Finbarr Timbers describes in [How is LLaMa.cpp |
| 96 | +possible?](https://finbarr.ca/how-is-llama-cpp-possible/), running |
| 97 | +"inference" in a language model, especially for a single chat session, |
| 98 | +is *dominated by the amount of RAM being accessed*. Modern CPUs and |
| 99 | +GPUs can do a lot of multiplies while waiting on system memory. |
| 100 | + |
| 101 | +Researchers have developed [AWQ](https://arxiv.org/abs/2306.00978) and [QuaRot](https://arxiv.org/pdf/2404.00456) as two quantization methods producing models that numerically retain 99% of the perplexity of the original 16 bit models using only 4 bits per parameter. Tim Dettmers also found [4 bits per parameter optimal](https://arxiv.org/abs/2212.09720) and in a blind test, [users found the differences betwwen 4 and 16 bit parameter models insignificant](https://github.com/ggerganov/llama.cpp/discussions/5962) though results rapidly degraded when trying to quantize to fewer than 4 bits per parameter. |
| 102 | + |
| 103 | +So, assuming 4 bit quantization: |
| 104 | ++ For Llama 3 70B you need at least 35 GB of RAM |
| 105 | ++ For Llama 3 8B you need at least 4GB of RAM |
| 106 | + |
| 107 | +## How much better is a larger model? |
| 108 | + |
| 109 | + An RTX 4090 costs about $2k and has 24GB of VRAM. To run inference on the larger |
| 110 | + [Llama 3](https://llama.meta.com/llama3/) and attempt to accelerate them with consumer GPUs, you would need two such GPUs to cover the 35GB of VRAM needed to load the model. |
| 111 | + |
| 112 | + So is the larger model worth the incremental value? What about other models out there? How good are they? |
| 113 | + |
| 114 | +| Model | License | \# 4090s | Params (B) | [ARC-C](https://huggingface.co/datasets/allenai/ai2_arc) | [MMLU](https://huggingface.co/datasets/cais/mmlu) | [GSM8K](https://huggingface.co/datasets/gsm8k) | [GPQA](https://huggingface.co/datasets/Idavidrein/gpqa) | [HumanEval+MBPP+ Avg](https://evalplus.github.io/leaderboard.html) | [HumanEval](https://arxiv.org/pdf/2404.14219) | |
| 115 | +| --------------------------------------------------------------------------------------------------- | -------------- | -------- | ---------- | ---------------- | ----- | ----- | ---- | ------------------- | ---- | |
| 116 | +| [gpt-4-turbo-2024-04-09](https://github.com/openai/simple-evals) | $ | many | ? | | 86.5 | 93 | 49.1 | 86.6 | 87.6 | |
| 117 | +| [Claude-3-Opus](https://www.anthropic.com/news/claude-3-family) | $ | many | ? | 96.4 | 86.8 | 95 | 49.7 | 76.8 | 84.9 | |
| 118 | +| [Gemini Pro 1.5](https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf) |$ | many | ?| | 81.9 | 91.7 | 41.5 | 61 | 71.9 | |
| 119 | +| [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | Llama 3 CLA \* | 2 | 70 | 85.3 | 80.06 | 85.44 | 39.5 | 70.7 | 68.3 | |
| 120 | +| [CohereForAI/c4ai-command-r-plus](https://huggingface.co/CohereForAI/c4ai-command-r-plus) | CC-BY-NC \* | 3 | 104 | | 75.73 | 70.74 | | 56.7 | 64 | |
| 121 | +| [microsoft/Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) | MIT | 1 | 3.8 | 84.9 | 68.7 | 69.52 | 32.8 | | 57.9 | |
| 122 | +| [mistralai/Mixtral-8x22B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1) | Apache 2.0 | 4 | 176 | | 77.77 | 82.03 | | 34.1 | 39.6 | |
| 123 | +| [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) | Llama 3 CLA \* | 1 | 8 | 80.5 | 66.49 | 45.34 | 34.2 | 29.3 | 60.4 | |
| 124 | +| [google/gemma-7b-it](https://huggingface.co/google/gemma-7b-it) | Gemma \* | 1 | 7 | 78.3 | 53.52 | 29.19 | | 24.4 | 17.7 | |
| 125 | + |
| 126 | + |
| 127 | +The "frontier models" like Gemini Pro 1.5 are *really good*. Like, slight tweaks of Gemini get [a score of 91.1% US Medical Licensing Exam](https://arxiv.org/pdf/2404.18416) You can be a licensed doctor in the US with a score of [about 60%](https://www.usmle.org/bulletin-information/scoring-and-score-reporting). [Llama 3 70b gets 88% on the USMLE](https://ai-assisted-healthcare.com/2024/04/21/llama3-70b-performs-on-par-with-gpt-4-turbo-on-answering-usmle-questions/) while its smaller 8B sibling gets about [66%](https://ai-assisted-healthcare.com/2024/04/21/llama3-70b-performs-on-par-with-gpt-4-turbo-on-answering-usmle-questions/). If you need medical advice, do you really want to ask a doctor who barely passed their licensing exam? |
| 128 | + |
| 129 | +How about programming ability? The final two columns cover that. [HumanEval](https://arxiv.org/pdf/2404.14219) was an early basic Python programming test suite, and HumanEval+MBPP+ Avg is a [much expanded](https://openreview.net/forum?id=1qvx610Cu7) version. GSM8K is a set of grade school math word problems. From the scores above, you can see: if you need coherent reasoning or the ability to write *good code* use a paid chatbot (Gemini, GPT-4, Claude-Opus) or run Llama 70B. |
| 130 | + |
| 131 | +If you're generating a horoscope reading, any of the above models is probably fine. |
| 132 | + |
| 133 | +> *About the above standardized test scores* |
| 134 | +> |
| 135 | +> The above table's numbers are taken from either (a) the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) (b) the model's technical report, or (c) a leaderboard associated with the linked test set. |
| 136 | +> |
| 137 | +> Huggingface's Open LLM Leaderboard team found getting scores for supposedly standardized problem sets is [fraught with subtleties](https://huggingface.co/blog/open-llm-leaderboard-mmlu). How exactly do you prompt the model? How do you evaluate whether the free-form output is the "right answer"? If giving examples is part of your prompt, how many examples do you give? Do you do [Chain of Thought](https://github.com/logikon-ai/awesome-deliberative-prompting/#readme) prompting? Further complicating things, general methods advance to train a base model to think better each week: [RLHF](https://huggingface.co/blog/rlhf), [DPO](https://openreview.net/forum?id=HPuSIXJaa9), [IPO](https://arxiv.org/pdf/2404.19733) are just the latest. |
| 138 | +> |
| 139 | +> \* The licenses even for relatively "open" models often come with prohibitions against using them for illegal or societally negative activities, making too much money from them, using them to improve your own model, or not giving prominent credit to the model. |
| 140 | +
|
| 141 | + |
| 142 | +# Do you... even need a Fancy GPU? |
| 143 | + |
| 144 | +Though you may think you need an expensive, powerful, RTX 4090 or an H100 to run a chatbot, you can get reasonable speed on small models even with a mobile CPU, and decent performance if you have a fancy mac. |
| 145 | + |
| 146 | +| Model | CPU / GPU | [Memory Bandwidth](https://finbarr.ca/how-is-llama-cpp-possible/) | generation speed | |
| 147 | +| ----- | -------- | ------ | ------ | |
| 148 | +| Llama 2 7B Q4_0 | Pixel 5| 13 GB/s | [1 tok/s](https://twitter.com/rgerganov/status/1635604465603473408) | |
| 149 | +| Llama 3 8B Q4_0 | Ryzen 5900X + 64GB DDR4 | 50 GB/s | 10 tok/s | |
| 150 | +| Llama 3 8B Q4_0 | M1 + 16GB LPDDR4X | 66 GB/s | 11.5 tok/s | |
| 151 | +| Llama 3 8B Q4_0 | Ryzen 5900X + 64GB DDR4 + GTX 1080 8GB | 320 GB/s | 31 tok/s | |
| 152 | +| Llama 2 7B Q4_0 | M1 Pro | 200 GB/s | [36 tok/s](https://github.com/ggerganov/llama.cpp/discussions/4167) | |
| 153 | +| Llama 2 7B Q4_0 | M2 Ultra | 800 GB/s | [94 tok/s](https://github.com/ggerganov/llama.cpp/discussions/4167) | |
| 154 | +| Llama 2 7B Q4_0 | 7900 XTX | 960 GB/s | [131 tok/s](https://blog.mlc.ai/2023/08/09/Making-AMD-GPUs-competitive-for-LLM-inference) |
| 155 | +| Llama 2 7B Q4_0 | RTX 4090 | 1008 GB/s | [159 tok/s](https://blog.mlc.ai/2023/08/09/Making-AMD-GPUs-competitive-for-LLM-inference) |
| 156 | + |
| 157 | +For a single interactive chat session, running on a CPU with lots of memory is similarly fast to a GPU with similar memory bandwidth. Running across multiple GPUs [can be made to work](https://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Inference-on-Multiple-NVDIA-AMD-GPUs), but is [not necessarily trivial](https://github.com/ggerganov/llama.cpp/issues/3051). For the big models, you really *must* have a high end Mac, use the cloud, or install several GPUs to get a reasonable speed. |
| 158 | + |
| 159 | + |
| 160 | +| Model | CPU / GPU | Power| Memory Bandwidth | generation speed | |
| 161 | +| ----- | -------- | ------ | ------ | ----- | |
| 162 | +| Llama 3 70B Q5_KM | Ryzen 5900X + 64GB DDR4 | 105 W | 50 GB/s | 0.96 tok/s | |
| 163 | +| Llama 3 70B Q5_KM | Ryzen 5900X + 64GB DDR4 + GTX 1080 8GB (-ngl 8) | 285 W | 320 GB/s | 0.99 tok/s | |
| 164 | +| Llama 3 70B f16 | AMD Threadripper Pro 7995WX + DDR5 | 350 W | 500 GB/s | [5.9 tok/s](https://huggingface.co/jartine/Meta-Llama-3-70B-Instruct-llamafile) | |
| 165 | +| Llama 3 70B Q4_0 | M2 Ultra 128GB | 90 W | 800 GB/s | [14 tok/s](https://huggingface.co/jartine/Meta-Llama-3-70B-Instruct-llamafile) | |
| 166 | +| Llama 3 70B Q4_0 | 2x RTX 3090 24GB or 2x RTX 4090 24GB | 1050 W | 2016 GB/s | [20 tok/s](https://www.reddit.com/r/LocalLLaMA/comments/18dbnxg/anyone_done_the_numbers_on_the_new_threadripper/) | |
| 167 | +| Llama 2 70B FP8 | Nx H100s @ perplexity.ai| Nx 700 W| Nx 2,000 GB/s | [30 tok/s](https://github.com/ray-project/llmperf-leaderboard) | |
| 168 | +| Llama 3 70B FP16 | Groq LPUs @ groq.com | Nx 215 W | Nx 80,000 GB/s | [185 tok/s](https://github.com/ray-project/llmperf-leaderboard) | |
| 169 | + |
| 170 | +Note: Nvidia claims much faster inference on their hardware with their new [ChatRTX](https://www.nvidia.com/en-us/ai-on-rtx/chatrtx/) app and its underlying [TensorRT-LLM](https://developer.nvidia.com/blog/turbocharging-meta-llama-3-performance-with-nvidia-tensorrt-llm-and-nvidia-triton-inference-server/) library, but [it's a trick to get it running as of May 2024](https://www.reddit.com/r/LocalLLaMA/comments/1b4iy16/is_there_any_benchmark_data_comparing_performance/) |
| 171 | + |
| 172 | +# Conclusion as of 2 May 2024 |
| 173 | + |
| 174 | ++ Running a high end AI model at home (Llama 3 70B) is doable but painfully slow for $2k, and reasonable at $6k. |
| 175 | ++ If you have GPUs but they don't have enough VRAM for your model, they don't buy you much! |
| 176 | ++ A high end Mac is extremely competitive and uses much less power than a multiple-GPU PC for LLM inference. |
| 177 | ++ [groq](https://groq.com/) is *so much faster* than running it yourself, that if you're just talking to Llama 3... just use groq. |
| 178 | + |
0 commit comments