--d --h --m --s
Universal Inference Protocol

Intelligence everywhere.
Permission from no one.

The shortest path between a model and silicon. One binary routes any AI model to any hardware. 19 backends. Zero dependencies. ESP32 to datacenter.

# Install and run in 30 seconds
git clone https://git.inference-x.com/salka/inference-x.git
cd inference-x && make
./inference-x model.gguf --serve 8080
305 KB
Binary Size
0
Dependencies
23
Quant Formats
19
Backends
01

What is this, exactly?

Three things to know. Nothing more.

📦

It's a tiny file

305 kilobytes. Smaller than a photo on your phone. This file lets your computer run AI — any AI — without the internet. Download it, run it. That's it.

🔒

Your words stay yours

When you use AI online, your questions travel to a distant server. Someone can read them. With Inference-X, nothing leaves your machine. Ever. It's just you and your computer.

🌍

It runs on anything

Old laptop, new phone, Raspberry Pi, datacenter. Same file. It detects your hardware and uses it. No configuration. No expertise needed.

02

What can YOUR computer do?

Move the slider to your RAM. See what's possible.

8 GB RAM
1 GB128 GB

Your AI runs locally. No internet. No account. Free forever.

03

How small is 305 KB?

Smaller than what you think.

The entire engine — all 19 hardware targets, all 23 formats — fits in less space than a single photo.

04

Where do your words go?

Cloud AI

☁️ → 🏢 → 👁️

Your question leaves your device, crosses the internet, reaches a server in another country, gets processed, stored, and analyzed. You pay per word.

Inference-X

💻 → 🧠 → ✅

Your question stays on your desk. The answer is computed by your own processor. Nothing leaves. Nothing is stored. You pay nothing.

05

One binary to run them all

12,571 lines of C++17. Six architectures. The model describes itself. The engine reads.

FUSED

Zero-Copy Inference

Dequantization and matrix multiply in one instruction loop. No intermediate buffer. Output closer to the model's theoretical maximum.

MoE

Trillion-Parameter Native

Only active experts exist in memory. A 1-trillion-parameter model runs on 64 GB RAM. Prefetch next layer while current computes.

DISPATCH

19 Silicon Targets

kernel_dispatch.h routes computation to 19 backends through one abstraction. Same source, same call, automatic detection.

ADAPTIVE

Smart Precision

Simple questions get compressed early layers. Complex reasoning gets full precision. The engine adapts to each query.

AIR-GAP

No Cloud. Ever.

No network calls. No telemetry. No phone-home. Models are local files. Works on a plane, in a submarine, on the moon.

AUTO

Zero Configuration

Chat templates, EOS tokens, architecture — all auto-detected from GGUF metadata. Download a model and run.

06

Any silicon. One binary.

The Makefile detects your hardware. You don't configure it — it configures itself.

CPU AVX2/512Intel, AMD
CUDANVIDIA
ROCmAMD GPU
MetalApple Silicon
VulkanCross-platform
ARM NEONARM / RPi
SnapdragonQualcomm
Hexagon HVXQualcomm DSP
TPUGoogle
InferentiaAWS
GaudiIntel HPU
MaiaMicrosoft
SambaNovaRDU
GraphcoreIPU
GroqLPU
Cerebras850K cores
FPGAXilinx
WebGPUBrowser
OpenCLUniversal
07

Tested models

Runs any GGUF model. Here are a few we've benchmarked.

SmolLM2 135MQ8_0
130 tok/s · 2 GB RAM
Quick answers. Tiny device. Lightning fast.
Phi-3 Mini 3.8BQ4_K_M
~4 tok/s · 4 GB RAM
Smart conversations, code help, translations.
Llama 3.2 3BQ4_K_M
3.82 tok/s · 4 GB RAM
Meta's compact model. Great reasoning.
Mistral 7BQ4_K_M
2.06 tok/s · 8 GB RAM
Creative writing, analysis, multilingual.
Llama 3.1 8BQ4_K_M
1.75 tok/s · 8 GB RAM
Full-featured assistant. Code. Math. Logic.
DeepSeek R1 14BQ4_K_M
0.97 tok/s · 16 GB RAM
Advanced reasoning. Expert-level answers.
08

OpenAI-compatible API

Start with --serve 8080. Drop-in replacement. Any client library works.

# Works with any OpenAI-compatible client
from openai import OpenAI

client = OpenAI(
  base_url="http://localhost:8080/v1",
  api_key="none"
)

resp = client.chat.completions.create(
  model="local",
  messages=[{"role": "user", "content": "Hello!"}],
  stream=True
)

Endpoints: POST /v1/chat/completions · POST /v1/completions · GET /v1/models · GET /health

09

How much does AI cost?

Using AI 1 hour per day, every day, for a year.

Cloud API (GPT-4 class)
~$360
/year
Inference-X (your hardware)
$0
forever · electricity only

No API key. No subscription. No limit. Your hardware, your AI.

10

Ready? Three steps.

Pick your system.

  1. Open Terminal: git clone https://git.inference-x.com/salka/inference-x.git && cd inference-x && make
  2. Download a model from HuggingFace (any .gguf file)
  3. Run: ./inference-x your-model.gguf — AI on your Mac.
  1. Install: sudo apt install build-essential git
  2. Build: git clone https://git.inference-x.com/salka/inference-x.git && cd inference-x && make
  3. Run: ./inference-x your-model.gguf --serve 8080 — open localhost:8080
  1. Install Git and a C++ compiler (MinGW or Visual Studio)
  2. PowerShell: git clone https://git.inference-x.com/salka/inference-x.git; cd inference-x; make
  3. Run: inference-x.exe your-model.gguf
  1. On your Pi: sudo apt install build-essential git
  2. Build: git clone https://git.inference-x.com/salka/inference-x.git && cd inference-x && make
  3. Run: ./inference-x smollm2-135m.gguf — AI on a $35 board.
11

Free for those who need it. Fair for those who profit.

No tricks. No limits. The engine is the same everywhere.

Community
$0
forever
Individuals, researchers, students, open-source, businesses under $1M.
  • Full engine
  • All models
  • Community support
  • All 19 backends
Get Started
Business
Custom
annual
Companies with $1M+ revenue using IX in production.
  • Commercial license
  • Priority SLA
  • Custom optimization
  • Hardware consulting
Contact
OEM
Custom
per unit
Hardware manufacturers embedding IX in products.
  • Binary redistribution
  • Custom backends
  • On-site integration
  • Co-branding
Contact
12

Neural Surgery

Extract, measure, and transplant components between AI models. Like organ transplants — for neural networks.

🔬

Scan

Analyze model architecture — layers, attention heads, FFN dimensions, expert topology. Non-invasive. Complete.

7 models scannable
✂️

Extract

Isolate individual layers, attention mechanisms, or expert networks. Clean cuts. Preserves signal integrity.

Precision: layer-level
🧬

Graft

Transplant components between compatible models. A reasoning layer from one, creativity from another. Chimeric intelligence.

Families: auto-detected
Live Model Registry
Connecting to OASIS...
13

Model Forge

Build custom AI models from components. Select a base, choose precision, optimize for your hardware. No training required.

1

Select Base

Choose from 7+ GGUF models. Each pre-analyzed for organ compatibility.

2

Configure

Set quantization (Q2→FP32), precision strategy, expert selection. 23 formats supported.

3

Deploy

One binary. Your hardware. Adaptive precision matches model to silicon automatically.

Forge Registry

Loading registry...
14

Model Store

Pre-configured models for specific industries. Healthcare, agriculture, legal, finance. Deploy in seconds.

🏥

Healthcare

Medical diagnosis, drug interaction, radiology AI. Privacy-first. Runs locally.

Q2 2026
🌾

Agriculture

Crop disease detection, irrigation optimization, yield prediction. Edge-ready.

Q2 2026
⚖️

Legal

Contract analysis, compliance checking, case research. Your data stays yours.

Q2 2026
💰

Finance

Risk assessment, market analysis, regulatory compliance. Zero cloud dependency.

Q2 2026
🔧

Engineering

Code generation, CAD analysis, technical documentation. Runs on your workstation.

Q2 2026
🎓

Education

Tutoring, curriculum generation, assessment. Works offline. Perfect for schools.

Q2 2026

Available Now

Loading catalog...

Start in 30 seconds.

Clone. Build. Run. No signup. No API key. No cloud.

☰ Source Code 📖 Documentation

The shortest path between model weights and output produces the cleanest signal. Every buffer removed, every conversion eliminated — that is Inference-X.

Built in Morocco for the world.