The shortest path between a model and silicon. One binary routes any AI model to any hardware. 19 backends. Zero dependencies. ESP32 to datacenter.
Three things to know. Nothing more.
305 kilobytes. Smaller than a photo on your phone. This file lets your computer run AI — any AI — without the internet. Download it, run it. That's it.
When you use AI online, your questions travel to a distant server. Someone can read them. With Inference-X, nothing leaves your machine. Ever. It's just you and your computer.
Old laptop, new phone, Raspberry Pi, datacenter. Same file. It detects your hardware and uses it. No configuration. No expertise needed.
Move the slider to your RAM. See what's possible.
Your AI runs locally. No internet. No account. Free forever.
Smaller than what you think.
The entire engine — all 19 hardware targets, all 23 formats — fits in less space than a single photo.
Your question leaves your device, crosses the internet, reaches a server in another country, gets processed, stored, and analyzed. You pay per word.
Your question stays on your desk. The answer is computed by your own processor. Nothing leaves. Nothing is stored. You pay nothing.
12,571 lines of C++17. Six architectures. The model describes itself. The engine reads.
Dequantization and matrix multiply in one instruction loop. No intermediate buffer. Output closer to the model's theoretical maximum.
Only active experts exist in memory. A 1-trillion-parameter model runs on 64 GB RAM. Prefetch next layer while current computes.
kernel_dispatch.h routes computation to 19 backends through one abstraction. Same source, same call, automatic detection.
Simple questions get compressed early layers. Complex reasoning gets full precision. The engine adapts to each query.
No network calls. No telemetry. No phone-home. Models are local files. Works on a plane, in a submarine, on the moon.
Chat templates, EOS tokens, architecture — all auto-detected from GGUF metadata. Download a model and run.
The Makefile detects your hardware. You don't configure it — it configures itself.
Runs any GGUF model. Here are a few we've benchmarked.
Start with --serve 8080. Drop-in replacement. Any client library works.
Endpoints: POST /v1/chat/completions · POST /v1/completions · GET /v1/models · GET /health
Using AI 1 hour per day, every day, for a year.
No API key. No subscription. No limit. Your hardware, your AI.
Pick your system.
git clone https://git.inference-x.com/salka/inference-x.git && cd inference-x && make./inference-x your-model.gguf — AI on your Mac.sudo apt install build-essential gitgit clone https://git.inference-x.com/salka/inference-x.git && cd inference-x && make./inference-x your-model.gguf --serve 8080 — open localhost:8080git clone https://git.inference-x.com/salka/inference-x.git; cd inference-x; makeinference-x.exe your-model.ggufsudo apt install build-essential gitgit clone https://git.inference-x.com/salka/inference-x.git && cd inference-x && make./inference-x smollm2-135m.gguf — AI on a $35 board.No tricks. No limits. The engine is the same everywhere.
Extract, measure, and transplant components between AI models. Like organ transplants — for neural networks.
Analyze model architecture — layers, attention heads, FFN dimensions, expert topology. Non-invasive. Complete.
Isolate individual layers, attention mechanisms, or expert networks. Clean cuts. Preserves signal integrity.
Transplant components between compatible models. A reasoning layer from one, creativity from another. Chimeric intelligence.
Build custom AI models from components. Select a base, choose precision, optimize for your hardware. No training required.
Choose from 7+ GGUF models. Each pre-analyzed for organ compatibility.
Set quantization (Q2→FP32), precision strategy, expert selection. 23 formats supported.
One binary. Your hardware. Adaptive precision matches model to silicon automatically.
Pre-configured models for specific industries. Healthcare, agriculture, legal, finance. Deploy in seconds.
Medical diagnosis, drug interaction, radiology AI. Privacy-first. Runs locally.
Q2 2026Crop disease detection, irrigation optimization, yield prediction. Edge-ready.
Q2 2026Contract analysis, compliance checking, case research. Your data stays yours.
Q2 2026Risk assessment, market analysis, regulatory compliance. Zero cloud dependency.
Q2 2026Code generation, CAD analysis, technical documentation. Runs on your workstation.
Q2 2026Tutoring, curriculum generation, assessment. Works offline. Perfect for schools.
Q2 2026Clone. Build. Run. No signup. No API key. No cloud.
☰ Source Code 📖 DocumentationThe shortest path between model weights and output produces the cleanest signal. Every buffer removed, every conversion eliminated — that is Inference-X.
Built in Morocco for the world.