How to run Mistral locally

Mistral AI builds efficient language models that punch above their weight class, from Ministral 3B for edge devices to Mistral Small 3.2 with 24B parameters. This guide walks through running Mistral locally with Ollama.

Table of Contents

Choosing a Model & Hardware Requirements
Setting up Ollama
Getting Mistral
Starting a Conversation
REST API Integration
Python Development
Next Steps

Choosing a Model & Hardware Requirements

Mistral models are optimized for efficiency, but hardware requirements vary by variant. Here's what to expect:

Model sizes:

Ministral 3B: The smallest model, designed for edge and on-device inference. Runs on almost anything.
Mistral 7B: The original lightweight model. Runs comfortably on most modern laptops.
Mistral Nemo 12B: Built with NVIDIA, offers 128K context and stronger reasoning than the 7B.
Codestral 22B: Specialized for code generation across 80+ programming languages.
Mistral Small 3.2 24B: The flagship small model. Strong instruction following, function calling, and multilingual support.

CPU vs. GPU:

CPU: All models can run on the CPU, but performance will be slower, which is acceptable for non-interactive tasks.
GPU: For the best response times, especially with the 22B and 24B models, running on a compatible GPU is highly recommended. It can result in a 3-10x speedup.

Mistral Small 3.2 is the recommended starting point if you have 32GB+ RAM or a GPU with 16GB+ VRAM. For more constrained hardware, the 7B model still delivers solid results on 8GB RAM. Ministral 3B is the pick for edge devices or environments with under 4GB available.

Setting up Ollama

First, get Ollama. It's a tool for running large language models locally.

macOS/Windows: Download from ollama.com.
Linux: Install with a single command:

curl -fsSL https://ollama.com/install.sh | sh

After installation, verify it's working by checking the version:

ollama --version

Getting Mistral

With Ollama installed, you can pull a Mistral model. We'll start with Mistral Small 3.2, the most capable small model.

ollama pull mistral-small3.2

Available Model Variants

You can choose the model that best suits your hardware and use case:

ministral: 3B edge model (~2GB download). The smallest option, with 128K context.
mistral: The original 7B model (~4GB download). Ideal for resource-constrained environments.
mistral-nemo: 12B model with 128K context window (~7GB download). A step up in reasoning.
codestral: 22B code-focused model (~12GB download). Best for programming tasks.
mistral-small3.2: 24B flagship model (~14GB download). The best all-around choice.

Starting a Conversation

To chat with Mistral directly from your terminal, use the run command:

ollama run mistral-small3.2

Now you can test it with a prompt. Mistral Small 3.2 excels at instruction following, code generation, and multilingual tasks.

>>> Explain the difference between async and sync programming

Async programming allows other tasks to run while waiting for I/O operations,
improving efficiency in I/O-bound applications. Sync programming blocks execution
until each operation completes, which is simpler but can be less efficient.

Here's a quick comparison:
- Async: Non-blocking, concurrent, better for I/O-heavy tasks
- Sync: Blocking, sequential, simpler to understand and debug

REST API Integration

Ollama exposes a local REST API on port 11434, allowing you to integrate Mistral into your applications.

Here's how to send a request using curl:

curl http://localhost:11434/api/generate -d '{
  "model": "mistral-small3.2",
  "prompt": "Write a Python function to calculate fibonacci numbers",
  "stream": false
}'

For conversational interactions, use the chat endpoint:

curl http://localhost:11434/api/chat -d '{
  "model": "mistral-small3.2",
  "messages": [
    {"role": "user", "content": "What are the trade-offs between SQL and NoSQL databases?"}
  ],
  "stream": false
}'

Structured outputs - get typed JSON responses by passing a schema via the format parameter:

curl http://localhost:11434/api/chat -d '{
  "model": "mistral-small3.2",
  "messages": [
    {"role": "user", "content": "List 3 programming languages with their primary use cases"}
  ],
  "format": {
    "type": "object",
    "properties": {
      "languages": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "use_case": {"type": "string"}
          }
        }
      }
    }
  },
  "stream": false
}'

Check out the complete API reference for more options.

Python Development

To use Mistral in your Python projects, install the official Ollama Python library:

pip install ollama

Quick Start

Here's how you can generate text or have a conversation:

import ollama

# Simple text generation
response = ollama.generate(
    model='mistral-small3.2',
    prompt='Explain machine learning in simple terms'
)
print(response['response'])

# Conversational interface
chat_response = ollama.chat(
    model='mistral-small3.2',
    messages=[
        {'role': 'user', 'content': 'Help me debug this Python error: NameError: name \'x\' is not defined'}
    ]
)
print(chat_response['message']['content'])

Real-time Streaming

For a more interactive experience, you can stream the response token by token:

stream = ollama.chat(
    model='mistral-small3.2',
    messages=[{'role': 'user', 'content': 'Write a haiku about coding'}],
    stream=True,
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

Customizing with System Prompts

You can guide the model's behavior by providing a system prompt:

response = ollama.chat(
    model='mistral-small3.2',
    messages=[
        {'role': 'system', 'content': 'You are a helpful coding assistant who provides concise, working code examples.'},
        {'role': 'user', 'content': 'How do I handle exceptions in Python?'}
    ]
)
print(response['message']['content'])

Next Steps

You now have Mistral running locally. Here are a few ideas to explore next:

Experiment with Different Models: Try ministral for edge devices, mistral for speed on constrained hardware, mistral-nemo for 128K context tasks, or codestral for dedicated code generation.
Use Structured Outputs: Pass a JSON schema via the format parameter to get typed, parseable responses from the model.
Build Agentic Workflows: Mistral Small 3.2 has strong function calling support, making it well-suited for building agents that plan and execute tasks with external tools.
Build an Application: Use a web framework like Flask or FastAPI to build a custom application around the Ollama API, or integrate it with a tool like Open WebUI for a local ChatGPT-like experience.
Optimize Performance: If you have a dedicated GPU, ensure Ollama is configured to use it for significantly faster inference speeds. Consider cloud GPU instances for larger models like Mixtral.

Runpod

Our sponsor

Spin Up a GPU in Seconds - No Setup Required

Deploy AI Models Globally in 30+ Regions

Scale from 0 to 1000 GPUs on Demand

PyTorch, Tensorflow & Docker Ready

Learn more