How to run Gemma 3 locally

Gemma 3 is Google's latest open-source language model, designed to be lightweight and efficient while delivering top-tier performance. It excels in tasks like code generation, reasoning, and following instructions, making it a versatile tool for developers and researchers alike.

This guide will walk you through setting up and running Gemma 3 locally with Ollama.

Table of Contents

Choosing a Model & Hardware Requirements
Setting up Ollama
Getting Gemma 3
Starting a Conversation
REST API Integration
Python Development
Next Steps

Choosing a Model & Hardware Requirements

Gemma 3 comes in several sizes, allowing you to pick the best fit for your hardware. The newer models are multimodal (handling both text and images) and feature a large 128K context window.

Memory requirements by model size:

Gemma 3 1B: Requires at least 2GB of RAM.
Gemma 3 4B: Requires at least 4GB of RAM. A great starting point for most modern laptops.
Gemma 3 12B: Requires at least 9GB of RAM.
Gemma 3 27B: Requires at least 18GB of RAM for top-tier performance.

CPU vs. GPU:

CPU: All models can run on the CPU, but performance will be slower, which is acceptable for non-interactive tasks.
GPU: For the best, near-instantaneous response times, especially with larger models, running on a compatible GPU is highly recommended. It can result in a 3-10x speedup.

The 4B model often hits the sweet spot, delivering strong performance on consumer hardware without demanding excessive resources.

Setting up Ollama

First, get Ollama. It's a powerful tool for running large language models locally.

macOS/Windows: Download from ollama.com.
Linux: Install with a single command:

curl -fsSL https://ollama.com/install.sh | sh

After installation, verify it's working by checking the version:

ollama --version

Getting Gemma 3

With Ollama installed, you can pull a Gemma 3 model. We'll start with the versatile 4B model.

ollama pull gemma3:4b

This command downloads the 3.3GB model file.

Available Model Variants

You can choose the model that best suits your needs:

gemma3:1b: The smallest, ideal for simple tasks and resource-constrained devices.
gemma3:4b: The recommended default for most users.
gemma3:12b: A more powerful option for complex reasoning.
gemma3:27b: The largest and most capable model for demanding tasks.

Quantized Models for Efficiency

Gemma 3 also offers Quantization-Aware Trained (QAT) versions. These models provide nearly the same quality as their standard counterparts but use up to 3x less memory, making them even more accessible.

To pull a QAT model, append -it-qat to the tag:

Example for the 4B quantized model

ollama pull gemma3:4b-it-qat

Starting a Conversation

To chat with Gemma 3 directly from your terminal, use the run command:

ollama run gemma3:4b

Now you can test it with a prompt. Gemma 3 excels at code generation, reasoning, and following instructions.

>>> Write a Python function to check if a number is prime.

Certainly! Here is a Python function to check if a number is prime, along with an explanation.  

def is_prime(n):  
    """  
    Checks if a number is prime.  
    A prime number is a natural number greater than 1 that has no positive  
    divisors other than 1 and itself.  
    """  
    if n <= 1:  
        return False  
    if n <= 3:  
        return True  
    if n % 2 == 0 or n % 3 == 0:  
        return False  
    i = 5  
    while i * i <= n:  
        if n % i == 0 or n % (i + 2) == 0:  
            return False  
        i += 6  
    return True  

# Example usage:  
print(is_prime(29))   # Output: True  
print(is_prime(15))   # Output: False

REST API Integration

Ollama exposes a local REST API on port 11434, allowing you to integrate Gemma 3 into your applications.

Here’s how to send a request using curl:

curl http://localhost:11434/api/generate -d '{
  "model": "gemma3:4b",
  "prompt": "Explain the concept of recursion with a simple example",
  "stream": false
}'

For conversational interactions, use the chat completions endpoint:

curl http://localhost:11434/api/chat -d '{
  "model": "gemma3:4b",
  "messages": [
    {"role": "user", "content": "What are the benefits of using Gemma 3 over other language models?"}
  ],
  "stream": false
}'

Check out the complete API reference for more options.

Python Development

To use Gemma 3 in your Python projects, install the official Ollama Python library:

pip install ollama

Quick Start

Here’s how you can generate text or have a conversation:

import ollama

# Simple text generation
response = ollama.generate(
    model='gemma3:4b',
    prompt='Create a simple web scraper in Python using requests and BeautifulSoup'
)
print(response['response'])

# Conversational interface
chat_response = ollama.chat(
    model='gemma3:4b',
    messages=[
        {'role': 'user', 'content': 'Explain the difference between a list and a tuple in Python'}
    ]
)
print(chat_response['message']['content'])

Real-time Streaming

For a more interactive experience, you can stream the response token by token:

stream = ollama.chat(
    model='gemma3:4b',
    messages=[{'role': 'user', 'content': 'Write a short, creative story about an AI discovering music.'}],
    stream=True,
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

Customizing with System Prompts

You can guide the model's behavior by providing a system prompt:

response = ollama.chat(
    model='gemma3:4b',
    messages=[
        {'role': 'system', 'content': 'You are a helpful coding assistant who provides concise, working code examples.'},
        {'role': 'user', 'content': 'How do I handle exceptions in Python?'}
    ]
)
print(response['message']['content'])

Next Steps

You now have Google's powerful Gemma 3 model running locally. Here are a few ideas to explore next:

Experiment with Different Models: Try the gemma3:1b model for speed or upgrade to gemma3:12b or gemma3:27b if you have the hardware and need maximum performance.
Explore Multimodality: The 4B, 12B, and 27B models can process images. Try passing image data through the API for vision-related tasks.
Build an Application: Use a web framework like Flask or FastAPI to build a custom application around the Ollama API, or integrate it with a tool like Open WebUI for a local ChatGPT-like experience.
Optimize Performance: If you have a dedicated GPU, ensure Ollama is configured to use it for significantly faster inference speeds.

Gemma 3’s blend of top-tier performance and open accessibility makes it a fantastic tool for developers looking to build with state-of-the-art AI without relying on cloud APIs.

Packet·ai

Our sponsor

B200 @ $3.75/h on demand

No contract, no spot interrupts, no cluster minimums

Full root SSH, web dashboard, CI/CD ready

Reserve Now