Running Llama 3 Locally
Oct. 2, 2024 by @anthonynsimon
I recently tried out Llama 3.2 on my laptop and was positively surprised you can run a rather capable model on modest hardware (without a GPU), so I thought I'd share a brief guide on how you can run it locally.
System requirements
Generally, the larger the model, the more "knowledge" it has, but also the more resources it needs to run. The choice usually comes down to a trade-off between cost, speed, and model size.
It's important to note that while you can run Llama 3 on a CPU, using a GPU will typically be far more efficient (but also more expensive).
Also, how much memory a model needs depends on several factors, such as the number of parameters, data type used (eg. F16, F32), and optimization techniques.
As a rule of thumb (assuming F16 half-precision):
- 1B parameters: ~2GB memory
- 3B parameters: ~6GB memory
- 7B parameters: ~14GB memory
- 70B parameters: ~140GB memory
In this guide I'll be using Llama 3.2 with 1B parameters, which is not too resource-intensive and surprisingly capable, even without a GPU.
Install ollama
First, install ollama
. It's a CLI tool to easily download, run, and serve LLMs from your machine.
For Mac and Windows, you should follow the instructions on the ollama website.
If you're using Linux, there's a convenient installation script:
curl -fsSL https://ollama.com/install.sh | sh
Once you're done, you should have the ollama
CLI available in your terminal:
ollama
Large language model runner
Usage:
ollama [flags]
ollama [command]
Available Commands:
serve Start ollama
create Create a model from a Modelfile
show Show information for a model
run Run a model
stop Stop a running model
pull Pull a model from a registry
push Push a model to a registry
list List models
ps List running models
cp Copy a model
rm Remove a model
help Help about any command
Flags:
-h, --help help for ollama
-v, --version Show version information
Use "ollama [command] --help" for more information about a command.
Next, let's download an LLM to try it out.
Download model
You can download any LLM supported with the following command:
ollama pull <model>
Here are some examples for popular models:
Model | Parameters | Size | Command |
---|---|---|---|
Llama 3.2 | 1B | 1.3 GB | ollama pull llama3.2:1b |
Llama 3.2 | 3B | 2.0 GB | ollama pull llama3.2 |
Mistral | 7B | 4.1 GB | ollama pull mistral |
Code Llama | 7B | 3.8 GB | ollama pull codellama |
You can view all the available models at the ollama library.
For this guide, I'll download the model with 1B parameters since it's the smallest available:
ollama pull llama3.2:1b
pulling manifest
pulling dde5aa3fc5ff... 100% ▕██████████████████████████████▏ 2.0 GB
pulling 966de95ca8a6... 100% ▕██████████████████████████████▏ 1.4 KB
pulling fcc5a6bec9da... 100% ▕██████████████████████████████▏ 7.7 KB
pulling a70ff7e570d9... 100% ▕██████████████████████████████▏ 6.0 KB
pulling 56bb8bd477a5... 100% ▕██████████████████████████████▏ 96 B
pulling 34bb5ab01051... 100% ▕██████████████████████████████▏ 561 B
verifying sha256 digest
writing manifest
success
Once downloaded, the model is ready to use.
Run Llama from the terminal
Once you have downloaded a model, you can chat with it from the terminal:
ollama run llama3.2:1b
>>> What do you think about ChatGPT?
>>> ...
>>> We are both chatbots, but I was created by Meta, while ChatGPT was developed by OpenAI.
>>> Our training data, language understanding, and overall tone are unique, so we each have
>>> different strengths and capabilities.
Serve Llama over HTTP
You can also tell ollama to serve the model via a REST API:
ollama serve llama3.2:1b
This will start a server on http://localhost:11434
so you can interact with the model via HTTP requests.
For example, send a POST request with the same prompt as before:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2:1b",
"stream": false,
"prompt": "What do you think about ChatGPT?"
}'
You can view the API docs here.
Programmatic access with Python
Now that you have the LLM running, you can also interact with it using Python.
First, install the ollama
library:
pip install ollama
Now, you can access the model that's running locally like this:
import ollama
response = ollama.chat(
model='llama3.2:1b',
messages=[
{
'role': 'user',
'content': 'What do you think about ChatGPT?'
},
]
)
Conclusion
That's all - you now have Llama 3 running locally on your machine. You can chat with it from the terminal, serve it via HTTP, or access it programmatically using Python.
As a next step, you could try out other models such as Mistral, or set up Open WebUI to chat with the model from your browser.
You could also try running the model on a GPU for better performance.