How to run Gemma 4 locally
April 2, 2026 by @anthonynsimon
Gemma 4 is Google's latest open model family, built from Gemini 3 research and technology to maximize intelligence-per-parameter. It supports text, audio, and image input with a large 256K context window, making it a major step up from its predecessor.
This guide will walk you through setting up and running Gemma 4 locally with Ollama.
Choosing a Model & Hardware Requirements
Gemma 4 comes in four sizes, split into two tiers: compact edge models (E2B, E4B) designed for mobile and IoT devices, and larger models (26B, 31B) optimized for consumer GPUs and workstations.
Model sizes:
- Gemma 4 E2B: The smallest model, designed for mobile and edge devices. Runs with minimal resources.
- Gemma 4 E4B: Compact but capable. A good starting point for most modern laptops.
- Gemma 4 26B: A Mixture-of-Experts model with 26B total parameters and 4B active. Strong reasoning at lower compute cost.
- Gemma 4 31B: The largest and most capable model. Delivers frontier-level intelligence on consumer GPUs.
CPU vs. GPU:
- CPU: All models can run on the CPU, but performance will be slower, which is acceptable for non-interactive tasks.
- GPU: For the best, near-instantaneous response times, especially with the 26B and 31B models, running on a compatible GPU is highly recommended. It can result in a 3-10x speedup.
The E4B model is a great starting point, delivering strong performance on consumer hardware without demanding excessive resources. If you have a dedicated GPU with 16GB+ VRAM, the 26B model offers excellent reasoning capabilities.
Setting up Ollama
First, get Ollama. It's a powerful tool for running large language models locally.
- macOS/Windows: Download from ollama.com.
- Linux: Install with a single command:
curl -fsSL https://ollama.com/install.sh | sh
After installation, verify it's working by checking the version:
ollama --version
Getting Gemma 4
With Ollama installed, you can pull a Gemma 4 model. We'll start with the default variant.
ollama pull gemma4
Available Model Variants
You can choose the model that best suits your needs:
- gemma4:e2b: The smallest, ideal for edge devices and resource-constrained environments.
- gemma4:e4b: Compact and capable, a solid default for laptops and desktops.
- gemma4:26b: Mixture-of-Experts architecture with strong reasoning at lower compute cost.
- gemma4:31b: The largest and most capable model for demanding tasks.
Starting a Conversation
To chat with Gemma 4 directly from your terminal, use the run command:
ollama run gemma4
Now you can test it with a prompt. Gemma 4 excels at code generation, reasoning, agentic tool use, and following instructions.
>>> Write a Python function to check if a number is prime.
Certainly! Here is a Python function to check if a number is prime, along with an explanation.
def is_prime(n):
"""
Checks if a number is prime.
A prime number is a natural number greater than 1 that has no positive
divisors other than 1 and itself.
"""
if n <= 1:
return False
if n <= 3:
return True
if n % 2 == 0 or n % 3 == 0:
return False
i = 5
while i * i <= n:
if n % i == 0 or n % (i + 2) == 0:
return False
i += 6
return True
# Example usage:
print(is_prime(29)) # Output: True
print(is_prime(15)) # Output: False
Image Input
The Gemma 4 models support visual input. You can pass an image path directly in the terminal:
ollama run gemma4 "caption this image /path/to/image.png"
REST API Integration
Ollama exposes a local REST API on port 11434, allowing you to integrate Gemma 4 into your applications.
Here's how to send a request using curl:
curl http://localhost:11434/api/generate -d '{
"model": "gemma4",
"prompt": "Explain the concept of recursion with a simple example",
"stream": false
}'
For conversational interactions, use the chat completions endpoint:
curl http://localhost:11434/api/chat -d '{
"model": "gemma4",
"messages": [
{"role": "user", "content": "What are the benefits of using Gemma 4 over other language models?"}
],
"stream": false
}'
To send an image, include a list of base64-encoded images:
curl http://localhost:11434/api/generate -d '{
"model": "gemma4",
"prompt": "Describe what you see in this image",
"images": ["<base64-encoded-image>"]
}'
Check out the complete API reference for more options.
Python Development
To use Gemma 4 in your Python projects, install the official Ollama Python library:
pip install ollama
Quick Start
Here's how you can generate text or have a conversation:
import ollama
# Simple text generation
response = ollama.generate(
model='gemma4',
prompt='Create a simple web scraper in Python using requests and BeautifulSoup'
)
print(response['response'])
# Conversational interface
chat_response = ollama.chat(
model='gemma4',
messages=[
{'role': 'user', 'content': 'Explain the difference between a list and a tuple in Python'}
]
)
print(chat_response['message']['content'])
Real-time Streaming
For a more interactive experience, you can stream the response token by token:
stream = ollama.chat(
model='gemma4',
messages=[{'role': 'user', 'content': 'Write a short, creative story about an AI discovering music.'}],
stream=True,
)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
Customizing with System Prompts
You can guide the model's behavior by providing a system prompt:
response = ollama.chat(
model='gemma4',
messages=[
{'role': 'system', 'content': 'You are a helpful coding assistant who provides concise, working code examples.'},
{'role': 'user', 'content': 'How do I handle exceptions in Python?'}
]
)
print(response['message']['content'])
Next Steps
You now have Google's Gemma 4 model running locally. Here are a few ideas to explore next:
- Experiment with Different Models: Try the gemma4:e2b model for speed on edge devices, or upgrade to gemma4:26b or gemma4:31b if you have the hardware and need maximum performance.
- Explore Multimodality: All Gemma 4 models support image and audio input. Try passing image data through the API for vision-related tasks.
- Build Agentic Workflows: Gemma 4 has native support for function calling, making it well-suited for building autonomous agents that plan and execute tasks.
- Build an Application: Use a web framework like Flask or FastAPI to build a custom application around the Ollama API, or integrate it with a tool like Open WebUI for a local ChatGPT-like experience.
- Optimize Performance: If you have a dedicated GPU, ensure Ollama is configured to use it for significantly faster inference speeds.
Gemma 4's frontier-level intelligence-per-parameter and open accessibility make it a fantastic tool for developers looking to build with state-of-the-art AI without relying on cloud APIs.