Skip to main content

Command Palette

Search for a command to run...

How to Run Open-Source AI Models Locally for Free on Mac and Windows

Updated
12 min read
How to Run Open-Source AI Models Locally for Free on Mac and Windows
A
a TypeScript full stack developer shipping scalable web apps and adding AI powered workflows on top.

Today, you do not need to pay for an AI API just to test AI models.

You can run many open-source or open-weight AI models directly on your own laptop or PC. This means the model runs on your machine, not on OpenAI, Gemini, Claude, or any cloud server.

You can use models like:

Llama
Gemma
Qwen
Mistral
DeepSeek
Phi
TinyLlama

This is useful for:

learning AI
testing models
building small chatbots
private offline chatting
experimenting with fine-tuning
testing RAG locally
coding assistant experiments

The best part is that after downloading the model, you can chat with it locally without paying per request.

But there is one important thing:

Local AI is free in API cost, but it uses your laptop/PC hardware.

So speed depends on your RAM, GPU, and model size.


What does “running AI locally” mean?

Normally when you use ChatGPT or Gemini, your message goes to a cloud server.

The flow is like this:

Your message
   ↓
Internet
   ↓
Company server
   ↓
AI model runs there
   ↓
Answer comes back

But local AI works like this:

Your message
   ↓
Your laptop/PC
   ↓
Model runs on your machine
   ↓
Answer appears locally

So your data stays on your device.

This is good for privacy and learning.


Open-source vs open-weight AI models

People often say “open-source AI model”, but many popular models are actually “open-weight” models.

Simple difference:

Open-source = code, training method, data, weights may be open
Open-weight = model weights are available, but not everything is fully open

For normal users, the practical meaning is:

You can download the model
You can run it locally
You can test it for free
You must still check the model license before commercial use

For learning and local testing, open-weight models are enough.


What hardware do you need?

This depends on model size.

Simple guide:

Your machine Good model size to start
8 GB RAM laptop 1B to 3B model
16 GB RAM laptop 3B to 8B quantized model
24 GB / 32 GB RAM 7B to 14B model
64 GB RAM or strong GPU 14B to 32B model
High-end GPU setup 32B+ models

For beginners, start with small models.

Good beginner models:

gemma:2b / gemma3:1b
qwen2.5:3b
llama3.2:3b
phi3
mistral:7b
qwen2.5:7b

Do not start with a 70B model on a normal laptop. It will be too heavy.


Best tools to run AI locally

There are many tools, but for beginners these are the best:

Ollama
LM Studio
MLX-LM
llama.cpp

Use this simple rule:

Tool Best for
Ollama Easiest terminal command
LM Studio Easiest GUI app
MLX-LM Best for Apple Silicon Mac developers
llama.cpp Best terminal tool for GGUF models

For most people:

Beginner terminal: Ollama
Beginner GUI: LM Studio
Mac developer: MLX-LM
Windows terminal advanced: llama.cpp

Method 1: Run local AI using Ollama

Ollama is one of the easiest tools for running AI models locally.

It works on:

Mac
Windows
Linux

Step 1: Install Ollama

Download and install Ollama for your operating system.

After installing, open terminal.

On Mac, use Terminal.

On Windows, use PowerShell or CMD.

Check if Ollama is installed:

ollama --version

Step 2: Run your first model

Run a small model:

ollama run llama3.2:3b

or:

ollama run qwen2.5:3b

or:

ollama run gemma3:1b

Ollama will download the model first.

After download, it opens chat mode.

Now you can type:

Explain LoRA fine tuning in simple words

And the model will answer locally.


Good Ollama commands

Run Llama

ollama run llama3.2:3b

Run Qwen

ollama run qwen2.5:3b

Run Gemma

ollama run gemma3:1b

List downloaded models

ollama list

Remove a model

ollama rm model-name

Example:

ollama rm llama3.2:3b

Best Ollama models for normal laptop

Start with these:

llama3.2:3b
qwen2.5:3b
gemma3:1b
phi3

For 16 GB RAM, you can also try:

mistral:7b
qwen2.5:7b
llama3.1:8b

If the model is slow, use a smaller model.


Method 2: Run local AI using LM Studio

LM Studio is the easiest GUI option.

You do not need to remember many terminal commands.

It works on:

Mac
Windows
Linux

Use LM Studio when you want:

simple chat UI
model search
download models from Hugging Face
easy settings
local offline chatting

Step 1: Install LM Studio

Install LM Studio for Mac or Windows.

Open the app.

Step 2: Download a model

Go to the model search or discover section.

Search for models like:

Llama
Gemma
Qwen
Mistral
Phi

For beginners, choose a small quantized model.

Good format:

GGUF

Good quantization:

Q4_K_M
Q5_K_M

Simple meaning:

Q4 = less RAM, faster, slightly lower quality
Q5 = more RAM, better quality
Q8 = high RAM, better quality, heavier

For 16 GB RAM, start with Q4.

Step 3: Load the model

After downloading, load the model.

Then start chatting.

That is it.

No API key needed.

No cloud billing.

The model runs on your own machine.


LM Studio vs Ollama

Both are good.

Use this simple comparison:

Need Use
Simple terminal chat Ollama
Easy visual app LM Studio
Download Hugging Face models easily LM Studio
Use local model from command line Ollama
Beginner friendly Both

For a non-technical user, LM Studio feels easier.

For a developer, Ollama feels faster.


Method 3: Run AI on Mac using MLX-LM

MLX-LM is very useful for Apple Silicon Mac.

Use it if you have:

M1 Mac
M2 Mac
M3 Mac
M4 Mac
M5 Mac

MLX is made for Apple Silicon, so it can use Mac unified memory better.

This is good for developers who want terminal commands.

Step 1: Create Python environment

python3 -m venv .venv
source .venv/bin/activate

Step 2: Install MLX-LM

pip install -U mlx-lm

Step 3: Chat with a Hugging Face MLX model

Example:

.venv/bin/mlx_lm.chat \
  --model mlx-community/Qwen2.5-3B-Instruct-4bit \
  --max-tokens 500

Another example:

.venv/bin/mlx_lm.chat \
  --model mlx-community/gemma-2-2b-it-4bit \
  --max-tokens 500

First run downloads the model.

After that, it loads from cache.


MLX-LM single command style

For terminal chat:

.venv/bin/mlx_lm.chat \
  --model MODEL_NAME \
  --max-tokens 500

Example:

.venv/bin/mlx_lm.chat \
  --model mlx-community/Qwen2.5-3B-Instruct-4bit \
  --max-tokens 500 \
  --temp 0.7

This opens chat mode in terminal.

You type your message.

The model replies.


Method 4: Run AI on Windows using llama.cpp

On Windows, llama.cpp is a strong option for GGUF models.

GGUF is a model file format used by llama.cpp and many local AI apps.

You can download GGUF models from Hugging Face.

Example model file names may look like:

model-Q4_K_M.gguf
model-Q5_K_M.gguf
model-Q8_0.gguf

For normal laptop, use:

Q4_K_M

Basic Windows llama.cpp command

llama-cli.exe `
  -m .\models\model-Q4_K_M.gguf `
  --conversation `
  --ctx-size 4096 `
  -n 500 `
  --temp 0.7

Meaning:

-m = model file path
--conversation = chat mode
--ctx-size = context length
-n = max output tokens
--temp = creativity

If the answer is too random:

--temp 0.5

If the answer is too boring:

--temp 0.8

Where to download models?

You can download models from:

Ollama model library
LM Studio model search
Hugging Face

For beginners:

Use Ollama or LM Studio first.

For advanced use:

Use Hugging Face directly.

When using Hugging Face, check these things:

model size
license
format
quantization
RAM requirement

Good search keywords:

Qwen 3B GGUF
Llama 3B GGUF
Gemma GGUF
Mistral 7B GGUF
MLX 4bit model

Which model should you choose?

Do not pick the biggest model first.

Pick model based on your machine.

For 8 GB RAM

Try:

1B model
2B model
3B quantized model

Examples:

Gemma small model
Qwen 1.5B / 3B
Llama 3.2 1B / 3B
TinyLlama

For 16 GB RAM

Try:

3B model
7B model
8B quantized model

Examples:

Qwen 2.5 3B
Qwen 2.5 7B
Llama 3.1 8B
Mistral 7B
Gemma small models

For 32 GB RAM

Try:

7B
8B
14B

Examples:

Qwen 14B
Mistral small models
DeepSeek distilled models
Llama 8B

For 64 GB RAM or more

You can test larger models:

14B
27B
32B

But bigger model means slower output.


What is quantization?

Quantization means making the model smaller.

Example:

Original model = big size, better quality, heavy
Quantized model = smaller size, faster, lower memory

Common quant names:

Q4
Q5
Q8
4bit
8bit

Simple rule:

Q4 = best for normal laptop
Q5 = better quality, still manageable
Q8 = heavy but better

For beginners:

Use Q4_K_M

This is a good balance.


Can local AI work without internet?

Yes.

You need internet to download the model first.

After the model is downloaded, many local tools can run offline.

This is useful for:

private notes
personal experiments
coding help
learning
offline AI assistant

But remember:

Local model does not know live internet news unless you give it data.

It will not automatically know today’s latest price, news, or website content.

For that, you need RAG, browser access, or external search.


Is local AI fully free?

Local AI has no per-token API cost.

But it still uses:

electricity
storage
RAM
GPU/CPU power
your laptop battery

So it is free from API billing, but not free from hardware usage.

For learning, it is great.

For production with many users, cloud APIs or GPU servers may be better.


Local AI limitations

Local AI is powerful, but it has limits.

1. Smaller models are less smart

A 3B local model will not be as smart as GPT-4-level cloud models.

It can still be useful for:

summaries
simple coding help
chatbots
classification
rewriting
small RAG apps
style experiments

2. Speed depends on hardware

On weak laptop:

small model = okay
big model = slow

3. It can hallucinate

Local models can also make mistakes.

Always verify important answers.

4. Storage can fill quickly

Models can be large.

A few models can take many GBs.

Keep only the models you actually use.


Best setup for beginners

For Mac or Windows beginner:

Install Ollama
Run a 3B model
Start chatting

Command:

ollama run llama3.2:3b

or:

ollama run qwen2.5:3b

For GUI:

Install LM Studio
Search Qwen / Llama / Gemma
Download Q4 model
Load and chat

That is the easiest path.


Best setup for Mac developers

Use MLX-LM.

Command:

python3 -m venv .venv
source .venv/bin/activate
pip install -U mlx-lm

Then chat:

.venv/bin/mlx_lm.chat \
  --model mlx-community/Qwen2.5-3B-Instruct-4bit \
  --max-tokens 500

This is clean and good for Apple Silicon.


Best setup for Windows developers

Use Ollama first.

Command:

ollama run qwen2.5:3b

For advanced GGUF control, use llama.cpp:

llama-cli.exe `
  -m .\models\model-Q4_K_M.gguf `
  --conversation `
  --ctx-size 4096 `
  -n 500 `
  --temp 0.7

Example local AI use cases

You can use local AI for:

personal chatbot
offline coding helper
text summarizer
resume rewriting
YouTube script writing
RAG over documents
email draft helper
classification model
customer support bot testing
fine-tuning experiments
Hinglish chatbot testing

Local AI is especially good when you are learning because you can break things, test models, and understand how LLMs work without paying API cost.


Final recommendation

For most beginners:

Use Ollama first.

Install it and run:

ollama run llama3.2:3b

or:

ollama run qwen2.5:3b

For people who want a GUI:

Use LM Studio.

For Mac developers:

Use MLX-LM.

For Windows advanced terminal users:

Use llama.cpp with GGUF models.

Simple final rule:

Small model first.
Quantized model first.
Q4 first.
Then move to bigger models only when your machine handles it.

Running AI locally is one of the best ways to learn LLMs.

You understand models better.

You avoid API bills.

You keep your data on your own machine.

And you can test many open-source or open-weight models freely on your Mac or Windows PC.