USB-AI: Running an Offline LLM From a USB Drive — Architecture, Trade-offs, and Lessons

What if your AI assistant had no home?

No cloud account. No subscription. No server. No WiFi needed. Just a USB drive you plug into any machine, and a Python process that wakes up and starts thinking.

That’s USB-AI: a portable, fully offline AI assistant designed to run from removable storage. You carry it. You own it. You control it.


The Premise

I had a specific scenario in mind: you’re in an environment where internet access is restricted, monitored, or simply unavailable. A corporate network. An exam lab. A flight. A remote deployment site. And you need an AI assistant — not for generating tweets, but for real technical work: code, math, terminal commands, analysis.

Every cloud-based LLM solution fails immediately in this scenario. I wanted something that didn’t.

The constraint I set for myself: the entire system — model weights, inference engine, and UI — must live on a single USB drive, launchable on any Windows machine with a single command.


The Model Decision: Gemma-3-1B-IT

Choosing the right model was the most critical engineering decision. The constraints were severe:

Constraint Why
Must fit on a USB drive ≤ 32GB practically, target ≤ 16GB
Must run on 8GB RAM Can’t assume high-spec machines
Must run without a GPU Many laptops, no CUDA
Must be actually useful Small models often aren’t

I evaluated several candidates:

  • TinyLLaMA (1.1B): On paper, ideal. In practice, the instruction-following is unreliable. Fine-tuning noise makes output inconsistent for technical work.
  • DeepSeek-Coder-6.7B: Exceptional for code. Too large for 8GB RAM, and painfully slow on CPU-only inference.
  • Gemma-3-1B-IT (Google, Instruction Tuned): The sweet spot. 1B parameters, Google’s instruction tuning, reasonable RAM footprint, and genuinely useful output quality for its size.

Gemma-3-1B-IT is the only fully operational model in USB-AI today. TinyLLaMA and DeepSeek-Coder are staged for future integration but require additional work to get stable inference.


Architecture: What’s Actually on the USB Drive

USB Drive (mounted as E:/)
├── usb_ai_launcher.py        ← Entry point, interface selector
├── main.py                   ← Direct launch alternative
├── requirements.txt          ← All Python deps
├── models/
│   └── gemma-3-1b-it/        ← Model weights (GGUF format)
├── engine/
│   ├── inference.py          ← Core LLM inference wrapper
│   ├── math_eval.py          ← Mathematical expression evaluator
│   └── code_gen.py           ← Code generation utilities
├── interfaces/
│   ├── cli.py                ← Command-line interface
│   ├── gui.py                ← Tkinter GUI (recommended)
│   └── voice.py              ← Speech-to-text + TTS
└── logs/                     ← Error and session logs

The launcher detects which Python is available on the host machine, installs dependencies into a local venv on the USB drive itself (not the host machine’s global Python), then starts the selected interface.


The Inference Engine

I built a wrapper around llama-cpp-python (Python bindings for llama.cpp, the C++ inference engine that powers most local LLM tools). The key configuration for CPU-only inference:

# engine/inference.py
from llama_cpp import Llama

class GemmaEngine:
    def __init__(self, model_path: str, n_ctx: int = 2048):
        self.llm = Llama(
            model_path=model_path,
            n_ctx=n_ctx,
            n_threads=os.cpu_count(),   # Use all CPU cores
            n_gpu_layers=0,             # CPU-only by default
            verbose=False,
            use_mmap=True,              # Memory-mapped IO — critical for USB
        )

    def generate(self, prompt: str, max_tokens: int = 512) -> str:
        output = self.llm(
            prompt,
            max_tokens=max_tokens,
            stop=["<end_of_turn>", "<eos>"],
            echo=False,
        )
        return output["choices"][0]["text"].strip()

The use_mmap=True flag is critically important for USB operation. Memory-mapped IO lets the OS load model chunks from disk on demand rather than pulling the entire model into RAM upfront. On a slow USB 3.0 drive, this reduces launch time from 45+ seconds to under 10.

CUDA support is optional — if a CUDA-compatible GPU is detected, n_gpu_layers is set to a positive value to offload matrix operations. On GPU hardware, response latency drops from ~8s to ~1s per response.


Three Interfaces for Different Contexts

CLI (Stealth Mode)

The simplest interface — just a prompt > and output. No dependencies beyond Python stdlib. Works over SSH. Zero visual footprint.

> explain AES-GCM authentication
AES-GCM (Galois/Counter Mode) is an authenticated encryption mode that provides
both confidentiality and integrity...

GUI (Daily Driver)

A Tkinter-based window with a conversation history panel, input box, and model status indicator. Tkinter ships with Python on Windows, so no extra install required. It’s not beautiful, but it’s functional and requires no browser.

Voice (Hands-Free)

Speech-to-text via speech_recognition (Google Web Speech API offline fallback with Vosk) → LLM → TTS via pyttsx3. Works entirely offline once the Vosk model is cached.

The voice interface is the most experimental — wake-word detection is not currently implemented, so you press a key to begin a sentence. Full always-on wake-word support is on the roadmap.


The “Run on Any Machine” Problem

This was harder than it sounds. Windows machines in the wild have wildly different Python configurations: Python 3.8, 3.11, 3.13, Microsoft Store wrappers, conda envs, no Python at all.

The launcher handles this via a discovery waterfall:

# usb_ai_launcher.py (simplified)
PYTHON_CANDIDATES = [
    sys.executable,          # Whatever's running this script
    "python3", "python",
    r"C:\Python313\python.exe",
    r"C:\Python311\python.exe",
    r"C:\Users\{user}\AppData\Local\Programs\Python\Python313\python.exe",
]

def find_python():
    for candidate in PYTHON_CANDIDATES:
        try:
            result = subprocess.run([candidate, "--version"], capture_output=True)
            if result.returncode == 0:
                return candidate
        except FileNotFoundError:
            continue
    raise RuntimeError("Python not found. Please install Python 3.8+")

If no Python exists on the host machine — that’s the one dependency we can’t ship. But for environments where Python is prohibited, a PyInstaller-compiled executable is the upcoming solution (single .exe, no Python required).


Performance Reality Check

On a mid-range laptop (Intel Core i7, 16GB RAM, no discrete GPU, USB 3.0 SSD):

Metric Value
Cold launch time ~8 seconds
First token latency ~3 seconds
Token generation speed ~12 tokens/sec
Peak RAM usage ~2.1 GB
Peak USB read bandwidth ~180 MB/s

On a lower-end machine (4-core CPU, 8GB RAM):

  • First token latency: ~7 seconds
  • Token generation speed: ~5 tokens/sec

It’s not ChatGPT. But for writing a script, debugging code, or explaining a concept while offline — it’s genuinely useful.


What I’d Build Next

PyInstaller executable. The single biggest friction point is requiring Python to already exist. A compiled .exe that bundles the Python runtime would make USB-AI truly zero-dependency.

Quantisation options. Gemma-3-1B-IT in Q4_K_M quantisation is about 800MB. Q8_0 is 1.1GB but noticeably sharper. An interface option to swap quantisation levels would let users tune the quality/speed tradeoff.

TinyLLaMA and DeepSeek-Coder integration. Both models are staged. TinyLLaMA needs better prompt engineering to produce consistent output. DeepSeek-Coder needs optimised chunked loading for 8GB machines.

Encrypted model storage. On a USB drive, if it’s lost, anyone can access the model and your conversation history. AES-encrypted storage would address this.

The Voice Stack in Detail

The voice interface is the most technically layered of the three. It chains four components:

  1. Wake signal — currently a key press (full wake-word detection is planned)
  2. Speech-to-Textspeech_recognition library with a Vosk backend for fully offline transcription. The Vosk model (vosk-model-small-en-us) lives on the USB drive. No Google Cloud Speech, no network request.
  3. Inference — transcribed text is passed directly to the GemmaEngine.generate() method
  4. Text-to-Speechpyttsx3 with the system’s built-in TTS voice. On Windows this is SAPI5; on macOS it falls back to say. Zero network dependency.
flowchart LR MIC["Microphone"] -->|key press| STT subgraph VOICE["Voice Pipeline — All Offline"] STT["Vosk STT\nvosk-model-small-en-us\non USB drive"] LLM["GemmaEngine\nllama-cpp-python\nuse_mmap=True"] TTS["pyttsx3\nSAPI5 / macOS say"] STT --> LLM --> TTS end TTS -->|spoken response| SPK["Speaker"] subgraph USB["USB Drive"] M["gemma-3-1b-it\nGGUF weights"] V["Vosk model"] end M --> LLM V --> STT

The result: a fully voice-driven AI assistant that operates with zero internet access, runs from removable storage, and leaves no conversation history on the host machine.

The Inference Rules File

One design I like: configurations that control inference behaviour live in config/inference.rules as a JSON file, not hard-coded in Python. This means you can tune the model without touching source code:

{
  "context_window": 2048,
  "temperature": 0.7,
  "top_p": 0.9,
  "max_tokens": 512,
  "threads": 4,
  "system_prompt": "You are a concise, technically accurate assistant. Prefer code examples over prose. Be honest when you are uncertain.",
  "stop_sequences": ["<end_of_turn>", "<eos>"]
}

Swapping system_prompt to "You are a CTF challenge-solving assistant specialised in binary exploitation and reverse engineering." gives you a different assistant personality instantly. The same model, different context framing, meaningfully different output character.

This pattern — externalising inference configuration — is something I’ve carried into every subsequent LLM project.


USB-AI is the project that made me genuinely appreciate how much complexity is hiding inside “just call the API.” Running inference locally means owning every layer of the stack: quantisation formats, memory maps, tokenisers, context windows. It’s harder. And it teaches you more.

Your AI, on your hardware, under your control. No subscription required.

— Vasanth