Ollama opened the door for local inference. But many builders want something simpler and more transparent: plain files, visible flags, no blob store, no cloud dependency.

Anvil is for that group.

Why Anvil exists

The inference tooling landscape has a pattern: start local, add convenience, then pivot to a cloud API. The local binary becomes a thin client. Your models end up in a blob store you can’t inspect. You need an account to download what you already bought.

Anvil is built for people who want a different set of tradeoffs — tools that keep their models, hardware, and network under their own control.

What Anvil does

Anvil is a transparent wrapper around llama.cpp and llama-server. It doesn’t reinvent the inference engine. It manages the runtime, pulls models from Hugging Face, applies sensible defaults from GGUF metadata, and exposes an OpenAI-compatible endpoint at http://localhost:11434/v1.

Plain GGUFs, transparent runtime

Models live as plain GGUF files in normal directories. You can ls them, cp them, rsync them to another machine. There’s no opaque hash-based store hiding your models in a hidden directory. No proprietary format. No account. No cloud dependency.

Anvil doesn’t abstract away llama.cpp. It exposes the flags. If you want to tune --mlock, --num-thread, --ctx-size, or any other parameter, you set it directly. The tool applies smart defaults from the GGUF metadata — tensor type, recommended thread count, context length — so you don’t have to guess for common models, but nothing is hidden.

Anvil can install and manage the llama-server runtime for you. One command sets up the binary, checks your GPU, and prepares the environment.

curl -fsSL https://raw.githubusercontent.com/sovereignty-labs/anvil/main/install.sh | sh
anvil runtime install
anvil pull unsloth/Qwen3-8B-GGUF:Q4_K_M
anvil serve && anvil load Qwen3-8B-Q4_K_M.gguf

That’s the full path from nothing to a serving model. Point Open WebUI, Cursor, Continue, or any client at http://localhost:11434/v1 and it works.

Fleet management without mystery

Anvil isn’t just a single-machine tool. It supports multi-model, multi-GPU, and fleet-style workflows. You can split a large model across GPUs on one machine, or register multiple machines on your network and manage them as a fleet.

MCP for agents and operators

Fleet management is exposed through MCP tools, so agents can query available models, check GPU utilization, and route requests — without you building custom APIs.

Try it

The local AI space has matured enough that the bottleneck isn’t running a model anymore. It’s managing the friction: downloading the right GGUF, figuring out the flags, keeping the server running, scaling across machines, and making it all discoverable by your tools.

Anvil addresses that friction for people who aren’t going to hand their inference to a cloud provider. It’s built for homelab users, developers running Open WebUI, and anyone who has grown tired of tools that promise local and quietly migrate you to SaaS.

Getting started

Try it today. The Getting Started guide walks through installation, model selection, and connecting tools. The source is on GitHub.

Anvil is early. It’s transparent. And it’s built for people who believe your models should run on your iron, your memory should live in your database, and your agents should operate under your authority.


A Sovereignty Labs project. Own your intelligence.