Local AI on Windows series · View all parts
Part 1 of 3 · Next: Part 2: Hermes AI agent on Ubuntu via Hyper-V →
Running AI models in the cloud means your prompts, documents, and conversations pass through someone else’s servers. A local AI stack keeps everything on your own machine — no API costs, no data leaving your network, and no internet connection required once models are downloaded.
This guide walks you through setting up LM Studio as a local model server and AnythingLLM as a chat and RAG interface on top of it. The setup works on a mid-range Windows 11 machine — you don’t need a high-end GPU. A 6 GB VRAM card runs 3B parameter models comfortably.
What you need
- Windows 11 (64-bit, Home or Pro)
- GPU with at least 4 GB VRAM — 6 GB or more recommended. Models will fall back to CPU if VRAM is insufficient, which works but is much slower.
- 16 GB RAM — 8 GB is the minimum, but 16 GB gives you headroom to run the model alongside other apps
- 10–30 GB free disk space depending on which models you download
- LM Studio (free) and AnythingLLM Desktop (free)
Step 1: Install LM Studio
- Go to lmstudio.ai and download the Windows installer
- Run the installer — the default installation path is fine
- Open LM Studio
LM Studio is both a model manager and a local OpenAI-compatible API server. You use it to download models from Hugging Face and serve them over a local HTTP endpoint that other tools can talk to.
Step 2: Download a model
- In LM Studio, go to the Discover tab (magnifying glass icon)
- Search for Qwen2.5-3B — this is a capable 3B parameter model that fits comfortably in 6 GB VRAM
- Look for a Q4_K_M quantized version — this is the best balance of quality and size for consumer GPUs
- Click Download and wait for it to complete (~2 GB)
If you have more VRAM (8 GB+), you can go up to a 7B model. For 4 GB VRAM, stick to 3B or smaller. If you’re unsure, start with Qwen2.5-3B — it handles English and Dutch reasonably well for a model of its size.
Step 3: Start the local server
- In LM Studio, go to the Developer tab (the
</>icon in the left sidebar) - Select your downloaded model from the dropdown at the top
- Click Start Server
- You should see: “Server listening on http://localhost:1234”
Leave LM Studio open and the server running — AnythingLLM connects to it over this local port.
Step 4: Install AnythingLLM Desktop
- Go to anythingllm.com/desktop and download the Windows installer
- When prompted, install for Current User only — not “All Users”. Installing to Program Files causes a known spawn error that breaks the app on startup.
- Open AnythingLLM Desktop
Step 5: Connect AnythingLLM to LM Studio
- In AnythingLLM, open Settings (gear icon, bottom left)
- Go to LLM Preference
- Select LM Studio as the provider
- Set the base URL to
http://localhost:1234 - Click Save changes
- Go to Embedding Model in the same settings panel and set it to AnythingLLM built-in (or LM Studio if you prefer)
You can verify the connection by returning to the LLM settings — AnythingLLM will show a green indicator if it can reach the server.
Step 6: Create a workspace
- Back on the main screen, click New Workspace
- Give it a name — for example Personal assistant or Project notes
- Optionally set a system prompt: “You are a helpful assistant. Always respond in English.”
- Click Save
Workspaces are isolated from each other — documents uploaded to one workspace are not visible in another. This lets you keep separate contexts for different projects or topics.
Step 7: Add your own documents (RAG)
This is where the local setup becomes genuinely useful. You can upload your own notes, reports, or reference documents and ask questions about them — all processed locally.
- Inside your workspace, click the paperclip icon or drag and drop a file into the chat area
- Supported formats: .txt, .pdf, .md, .docx, .csv
- AnythingLLM splits the document into chunks and creates embeddings locally — this may take a few seconds depending on file size
- Once indexed, you’ll see the document listed under the workspace
Now ask a question directly related to the document content. The model retrieves the relevant chunks and answers using your document as context rather than its training data alone.
Step 8: Test the full stack
With LM Studio running the model server and AnythingLLM connected, try these prompts in order to confirm everything works:
- “What is 2 + 2?” — confirms the LLM connection is working
- Upload a short text file with a few sentences about a topic you choose
- “Based on the document I uploaded, what does it say about [topic]?” — confirms RAG is working
If the model responds but ignores the document, check that the document is actually indexed (green indicator next to the filename in the workspace sidebar) and that the embedding model is active in settings.
Troubleshooting
AnythingLLM won’t start — spawn error on launch
This happens when AnythingLLM is installed under Program Files (All Users). Uninstall and reinstall using the “Current User” option. The app stores its data in AppDataRoaminganythingllm-desktop, which is only accessible per-user.
LM Studio server starts but AnythingLLM can’t connect
Confirm the base URL in AnythingLLM settings is exactly http://localhost:1234 — no trailing slash. Also confirm the LM Studio server is running (green dot in the Developer tab) and a model is loaded.
Responses are very slow
The model is probably running on CPU instead of GPU. In LM Studio’s Developer tab, check the GPU offload slider — set it to maximum to push as many layers as possible to VRAM. With 6 GB VRAM and a 3B Q4 model you should get 10–20 tokens per second.
Model output quality is poor or responses are incoherent
Try a different model. Qwen2.5-3B-Instruct (not the base version) and Llama-3.2-3B-Instruct are good starting points. Avoid base models — you want the instruction-tuned variant for chat use.
Conclusion
LM Studio handles the model layer — download, GPU offload, OpenAI-compatible API. AnythingLLM handles the application layer — workspaces, RAG, document indexing, system prompts. Together they give you a local AI stack that runs entirely on your own hardware, works offline, and keeps your documents private.
The setup takes about 20 minutes. Once it’s running, adding new documents or switching models is a matter of drag-and-drop or a single dropdown selection. From here you can expand the stack with n8n for workflow automation, or explore voice interfaces that connect to the same LM Studio server endpoint.
Featured image by Ales Nesetril on Unsplash.