How to run Gemma 4 locally

Gemma 4 is Google's latest open model family, built from Gemini 3 research and technology to maximize intelligence-per-parameter. It supports text, audio, and image input with a large 256K context window, making it a major step up from its predecessor.

This guide will walk you through setting up and running Gemma 4 locally with Ollama.

Table of Contents

Choosing a Model & Hardware Requirements
Setting up Ollama
Getting Gemma 4
Starting a Conversation
REST API Integration
Python Development
Next Steps

Choosing a Model & Hardware Requirements

Gemma 4 comes in four sizes, split into two tiers: compact edge models (E2B, E4B) designed for mobile and IoT devices, and larger models (26B, 31B) optimized for consumer GPUs and workstations.

Model sizes:

Gemma 4 E2B: The smallest model, designed for mobile and edge devices. Runs with minimal resources.
Gemma 4 E4B: Compact but capable. A good starting point for most modern laptops.
Gemma 4 26B: A Mixture-of-Experts model with 26B total parameters and 4B active. Strong reasoning at lower compute cost.
Gemma 4 31B: The largest and most capable model. Delivers frontier-level intelligence on consumer GPUs.

CPU vs. GPU:

CPU: All models can run on the CPU, but performance will be slower, which is acceptable for non-interactive tasks.
GPU: For the best, near-instantaneous response times, especially with the 26B and 31B models, running on a compatible GPU is highly recommended. It can result in a 3-10x speedup.

The E4B model is a great starting point, delivering strong performance on consumer hardware without demanding excessive resources. If you have a dedicated GPU with 16GB+ VRAM, the 26B model offers excellent reasoning capabilities.

Setting up Ollama

First, get Ollama. It's a powerful tool for running large language models locally.

macOS/Windows: Download from ollama.com.
Linux: Install with a single command:

curl -fsSL https://ollama.com/install.sh | sh

After installation, verify it's working by checking the version:

ollama --version

Getting Gemma 4

With Ollama installed, you can pull a Gemma 4 model. We'll start with the default variant.

ollama pull gemma4

Available Model Variants

You can choose the model that best suits your needs:

gemma4:e2b: The smallest, ideal for edge devices and resource-constrained environments.
gemma4:e4b: Compact and capable, a solid default for laptops and desktops.
gemma4:26b: Mixture-of-Experts architecture with strong reasoning at lower compute cost.
gemma4:31b: The largest and most capable model for demanding tasks.

Starting a Conversation

To chat with Gemma 4 directly from your terminal, use the run command:

ollama run gemma4

Now you can test it with a prompt. Gemma 4 excels at code generation, reasoning, agentic tool use, and following instructions.

>>> Write a Python function to check if a number is prime.

Certainly! Here is a Python function to check if a number is prime, along with an explanation.

def is_prime(n):
    """
    Checks if a number is prime.
    A prime number is a natural number greater than 1 that has no positive
    divisors other than 1 and itself.
    """
    if n <= 1:
        return False
    if n <= 3:
        return True
    if n % 2 == 0 or n % 3 == 0:
        return False
    i = 5
    while i * i <= n:
        if n % i == 0 or n % (i + 2) == 0:
            return False
        i += 6
    return True

# Example usage:
print(is_prime(29))   # Output: True
print(is_prime(15))   # Output: False

Image Input

The Gemma 4 models support visual input. You can pass an image path directly in the terminal:

ollama run gemma4 "caption this image /path/to/image.png"

REST API Integration

Ollama exposes a local REST API on port 11434, allowing you to integrate Gemma 4 into your applications.

Here's how to send a request using curl:

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4",
  "prompt": "Explain the concept of recursion with a simple example",
  "stream": false
}'

For conversational interactions, use the chat completions endpoint:

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4",
  "messages": [
    {"role": "user", "content": "What are the benefits of using Gemma 4 over other language models?"}
  ],
  "stream": false
}'

To send an image, include a list of base64-encoded images:

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4",
  "prompt": "Describe what you see in this image",
  "images": ["<base64-encoded-image>"]
}'

Check out the complete API reference for more options.

Python Development

To use Gemma 4 in your Python projects, install the official Ollama Python library:

pip install ollama

Quick Start

Here's how you can generate text or have a conversation:

import ollama

# Simple text generation
response = ollama.generate(
    model='gemma4',
    prompt='Create a simple web scraper in Python using requests and BeautifulSoup'
)
print(response['response'])

# Conversational interface
chat_response = ollama.chat(
    model='gemma4',
    messages=[
        {'role': 'user', 'content': 'Explain the difference between a list and a tuple in Python'}
    ]
)
print(chat_response['message']['content'])

Real-time Streaming

For a more interactive experience, you can stream the response token by token:

stream = ollama.chat(
    model='gemma4',
    messages=[{'role': 'user', 'content': 'Write a short, creative story about an AI discovering music.'}],
    stream=True,
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

Customizing with System Prompts

You can guide the model's behavior by providing a system prompt:

response = ollama.chat(
    model='gemma4',
    messages=[
        {'role': 'system', 'content': 'You are a helpful coding assistant who provides concise, working code examples.'},
        {'role': 'user', 'content': 'How do I handle exceptions in Python?'}
    ]
)
print(response['message']['content'])

Next Steps

You now have Google's Gemma 4 model running locally. Here are a few ideas to explore next:

Experiment with Different Models: Try the gemma4:e2b model for speed on edge devices, or upgrade to gemma4:26b or gemma4:31b if you have the hardware and need maximum performance.
Explore Multimodality: All Gemma 4 models support image and audio input. Try passing image data through the API for vision-related tasks.
Build Agentic Workflows: Gemma 4 has native support for function calling, making it well-suited for building autonomous agents that plan and execute tasks.
Build an Application: Use a web framework like Flask or FastAPI to build a custom application around the Ollama API, or integrate it with a tool like Open WebUI for a local ChatGPT-like experience.
Optimize Performance: If you have a dedicated GPU, ensure Ollama is configured to use it for significantly faster inference speeds.

Gemma 4's frontier-level intelligence-per-parameter and open accessibility make it a fantastic tool for developers looking to build with state-of-the-art AI without relying on cloud APIs.

Vast.ai

Our sponsor

Per-second billing. Pay only for what you use.

50+ GPU models from RTX 3060 to B200, from $0.04/hr

No minimums, no long-term contracts.

SOC2 certified · PyTorch, CUDA & TensorFlow templates ready

Learn more