Running Llama 3 Locally

Oct. 2, 2024 by @anthonynsimon

I recently tried out Llama 3.2 on my laptop and was positively surprised you can run a rather capable model on modest hardware (without a GPU), so I thought I'd share a brief guide on how you can run it locally.

System requirements

Generally, the larger the model, the more "knowledge" it has, but also the more resources it needs to run. The choice usually comes down to a trade-off between cost, speed, and model size.

It's important to note that while you can run Llama 3 on a CPU, using a GPU will typically be far more efficient (but also more expensive).

Also, how much memory a model needs depends on several factors, such as the number of parameters, data type used (eg. F16, F32), and optimization techniques.

As a rule of thumb (assuming F16 half-precision):

  • 1B parameters: ~2GB memory
  • 3B parameters: ~6GB memory
  • 7B parameters: ~14GB memory
  • 70B parameters: ~140GB memory

In this guide I'll be using Llama 3.2 with 1B parameters, which is not too resource-intensive and surprisingly capable, even without a GPU.

Install ollama

First, install ollama. It's a CLI tool to easily download, run, and serve LLMs from your machine.

For Mac and Windows, you should follow the instructions on the ollama website.

If you're using Linux, there's a convenient installation script:

curl -fsSL https://ollama.com/install.sh | sh

Once you're done, you should have the ollama CLI available in your terminal:

ollama

Large language model runner

Usage:
  ollama [flags]
  ollama [command]

Available Commands:
  serve       Start ollama
  create      Create a model from a Modelfile
  show        Show information for a model
  run         Run a model
  stop        Stop a running model
  pull        Pull a model from a registry
  push        Push a model to a registry
  list        List models
  ps          List running models
  cp          Copy a model
  rm          Remove a model
  help        Help about any command

Flags:
  -h, --help      help for ollama
  -v, --version   Show version information

Use "ollama [command] --help" for more information about a command.

Next, let's download an LLM to try it out.

Download model

You can download any LLM supported with the following command:

ollama pull <model>

Here are some examples for popular models:

Model Parameters Size Command
Llama 3.2 1B 1.3 GB ollama pull llama3.2:1b
Llama 3.2 3B 2.0 GB ollama pull llama3.2
Mistral 7B 4.1 GB ollama pull mistral
Code Llama 7B 3.8 GB ollama pull codellama

You can view all the available models at the ollama library.

For this guide, I'll download the model with 1B parameters since it's the smallest available:

ollama pull llama3.2:1b

pulling manifest
pulling dde5aa3fc5ff... 100% ▕██████████████████████████████▏ 2.0 GB
pulling 966de95ca8a6... 100% ▕██████████████████████████████▏ 1.4 KB
pulling fcc5a6bec9da... 100% ▕██████████████████████████████▏ 7.7 KB
pulling a70ff7e570d9... 100% ▕██████████████████████████████▏ 6.0 KB
pulling 56bb8bd477a5... 100% ▕██████████████████████████████▏   96 B
pulling 34bb5ab01051... 100% ▕██████████████████████████████▏  561 B
verifying sha256 digest
writing manifest
success

Once downloaded, the model is ready to use.

Run Llama from the terminal

Once you have downloaded a model, you can chat with it from the terminal:

ollama run llama3.2:1b

>>> What do you think about ChatGPT?
>>> ...
>>> We are both chatbots, but I was created by Meta, while ChatGPT was developed by OpenAI.
>>> Our training data, language understanding, and overall tone are unique, so we each have
>>> different strengths and capabilities.

Serve Llama over HTTP

You can also tell ollama to serve the model via a REST API:

ollama serve llama3.2:1b

This will start a server on http://localhost:11434 so you can interact with the model via HTTP requests.

For example, send a POST request with the same prompt as before:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:1b",
  "stream": false,
  "prompt": "What do you think about ChatGPT?"
}'

You can view the API docs here.

Programmatic access with Python

Now that you have the LLM running, you can also interact with it using Python.

First, install the ollama library:

pip install ollama

Now, you can access the model that's running locally like this:

import ollama

response = ollama.chat(
    model='llama3.2:1b',
    messages=[
        {
            'role': 'user',
            'content': 'What do you think about ChatGPT?'
        },
    ]
)

Conclusion

That's all - you now have Llama 3 running locally on your machine. You can chat with it from the terminal, serve it via HTTP, or access it programmatically using Python.

As a next step, you could try out other models such as Mistral, or set up Open WebUI to chat with the model from your browser.

You could also try running the model on a GPU for better performance.