Models

Self-hosting

The weights are Apache-2.0 and public on Hugging Face. Run them on your own hardware — no key, no usage leaves your network, and a lapsed plan never bricks a local model.

Get the weights

Every model lives at flywheel-ai/<niche> on Hugging Face, in two builds: a Q4_K_M GGUF (~20 GB, CPU/laptop) and full bf16 safetensors (~65 GB, GPU). Pick the runner that matches your hardware below — both expose the same OpenAI-compatible /v1 endpoint.

Serve with vLLM (GPU)

For production GPU serving, vLLM gives you high throughput and native streaming straight from the safetensors:

shell

pip install vllm

# bf16 safetensors on a GPU (24GB+ recommended):
vllm serve flywheel-ai/fitness --served-model-name fitness

# → OpenAI-compatible endpoint at http://localhost:8000/v1

Run with llama.cpp (laptop / CPU)

No GPU? The Q4_K_M GGUF runs well on a modern laptop because the base is a sparse mixture-of-experts — only a fraction of parameters activate per token.

shell

# laptop / CPU with the Q4_K_M GGUF (~20GB):
huggingface-cli download flywheel-ai/fitness model-q4_k_m.gguf --local-dir .
llama-server -m model-q4_k_m.gguf -c 8192

# → OpenAI-compatible endpoint at http://localhost:8080/v1

Point your SDK at it

Either runner is OpenAI-compatible, so any OpenAI SDK works — set the base_url to the local endpoint, use any placeholder key, and set model to the niche slug:

python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
resp = client.chat.completions.create(
    model="fitness",
    messages=[{"role": "user", "content": "Beginner full-body workout?"}],
)
print(resp.choices[0].message.content)

Tip.Self-hosting is fully offline: nothing is sent anywhere, and there’s no hosted API key. If you later opt into the training flywheel, only then — and only with explicit consent — does consented usage flow back.

Hardware

Build	Runner	Recommended
Q4_K_M GGUF (~20GB)	llama.cpp	Modern laptop / CPU, 32GB+ RAM
bf16 safetensors (~65GB)	vLLM	One 24GB+ GPU (more for higher concurrency)

Start with the GGUF to try a model locally; move to vLLM on a GPU when you need throughput and streaming. See the per-model card in the catalog for exact build sizes.

← The model family The flywheel →