How to run Gemma 3 locally
July 18, 2025 by @anthonynsimon
Gemma 3 is Google's latest open-source language model, designed to be lightweight and efficient while delivering top-tier performance. It excels in tasks like code generation, reasoning, and following instructions, making it a versatile tool for developers and researchers alike.
This guide will walk you through setting up and running Gemma 3 locally with Ollama.
Choosing a Model & Hardware Requirements
Gemma 3 comes in several sizes, allowing you to pick the best fit for your hardware. The newer models are multimodal (handling both text and images) and feature a large 128K context window.
Memory requirements by model size:
- Gemma 3 1B: Requires at least 2GB of RAM.
- Gemma 3 4B: Requires at least 4GB of RAM. A great starting point for most modern laptops.
- Gemma 3 12B: Requires at least 9GB of RAM.
- Gemma 3 27B: Requires at least 18GB of RAM for top-tier performance.
CPU vs. GPU:
- CPU: All models can run on the CPU, but performance will be slower, which is acceptable for non-interactive tasks.
- GPU: For the best, near-instantaneous response times, especially with larger models, running on a compatible GPU is highly recommended. It can result in a 3-10x speedup.
The 4B model often hits the sweet spot, delivering strong performance on consumer hardware without demanding excessive resources.
Setting up Ollama
First, get Ollama. It's a powerful tool for running large language models locally.
- macOS/Windows: Download from ollama.com.
- Linux: Install with a single command:
curl -fsSL https://ollama.com/install.sh | sh
After installation, verify it's working by checking the version:
ollama --version
Getting Gemma 3
With Ollama installed, you can pull a Gemma 3 model. We'll start with the versatile 4B model.
ollama pull gemma3:4b
This command downloads the 3.3GB model file.
Available Model Variants
You can choose the model that best suits your needs:
- gemma3:1b: The smallest, ideal for simple tasks and resource-constrained devices.
- gemma3:4b: The recommended default for most users.
- gemma3:12b: A more powerful option for complex reasoning.
- gemma3:27b: The largest and most capable model for demanding tasks.
Quantized Models for Efficiency
Gemma 3 also offers Quantization-Aware Trained (QAT) versions. These models provide nearly the same quality as their standard counterparts but use up to 3x less memory, making them even more accessible.
To pull a QAT model, append -it-qat to the tag:
Example for the 4B quantized model
ollama pull gemma3:4b-it-qat
Starting a Conversation
To chat with Gemma 3 directly from your terminal, use the run command:
ollama run gemma3:4b
Now you can test it with a prompt. Gemma 3 excels at code generation, reasoning, and following instructions.
>>> Write a Python function to check if a number is prime.
Certainly! Here is a Python function to check if a number is prime, along with an explanation.
def is_prime(n):
"""
Checks if a number is prime.
A prime number is a natural number greater than 1 that has no positive
divisors other than 1 and itself.
"""
if n <= 1:
return False
if n <= 3:
return True
if n % 2 == 0 or n % 3 == 0:
return False
i = 5
while i * i <= n:
if n % i == 0 or n % (i + 2) == 0:
return False
i += 6
return True
# Example usage:
print(is_prime(29)) # Output: True
print(is_prime(15)) # Output: False
REST API Integration
Ollama exposes a local REST API on port 11434, allowing you to integrate Gemma 3 into your applications.
Here’s how to send a request using curl:
curl http://localhost:11434/api/generate -d '{
"model": "gemma3:4b",
"prompt": "Explain the concept of recursion with a simple example",
"stream": false
}'
For conversational interactions, use the chat completions endpoint:
curl http://localhost:11434/api/chat -d '{
"model": "gemma3:4b",
"messages": [
{"role": "user", "content": "What are the benefits of using Gemma 3 over other language models?"}
],
"stream": false
}'
Check out the complete API reference for more options.
Python Development
To use Gemma 3 in your Python projects, install the official Ollama Python library:
pip install ollama
Quick Start
Here’s how you can generate text or have a conversation:
import ollama
# Simple text generation
response = ollama.generate(
model='gemma3:4b',
prompt='Create a simple web scraper in Python using requests and BeautifulSoup'
)
print(response['response'])
# Conversational interface
chat_response = ollama.chat(
model='gemma3:4b',
messages=[
{'role': 'user', 'content': 'Explain the difference between a list and a tuple in Python'}
]
)
print(chat_response['message']['content'])
Real-time Streaming
For a more interactive experience, you can stream the response token by token:
stream = ollama.chat(
model='gemma3:4b',
messages=[{'role': 'user', 'content': 'Write a short, creative story about an AI discovering music.'}],
stream=True,
)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
Customizing with System Prompts
You can guide the model's behavior by providing a system prompt:
response = ollama.chat(
model='gemma3:4b',
messages=[
{'role': 'system', 'content': 'You are a helpful coding assistant who provides concise, working code examples.'},
{'role': 'user', 'content': 'How do I handle exceptions in Python?'}
]
)
print(response['message']['content'])
Next Steps
You now have Google's powerful Gemma 3 model running locally. Here are a few ideas to explore next:
- Experiment with Different Models: Try the gemma3:1b model for speed or upgrade to gemma3:12b or gemma3:27b if you have the hardware and need maximum performance.
- Explore Multimodality: The 4B, 12B, and 27B models can process images. Try passing image data through the API for vision-related tasks.
- Build an Application: Use a web framework like Flask or FastAPI to build a custom application around the Ollama API, or integrate it with a tool like Open WebUI for a local ChatGPT-like experience.
- Optimize Performance: If you have a dedicated GPU, ensure Ollama is configured to use it for significantly faster inference speeds.
Gemma 3’s blend of top-tier performance and open accessibility makes it a fantastic tool for developers looking to build with state-of-the-art AI without relying on cloud APIs.