hyperlite.org
  ___ ___                             .____    .__  __         
 /   |   \ ___.__.______   ___________|    |   |__|/  |_  ____  
/    ~    <   |  |\____ \_/ __ \_  __ \    |   |  \   __\/ __ \ 
\    Y    /\___  ||  |_> >  ___/|  | \/    |___|  ||  | \  ___/ 
 \___|_  / / ____||   __/ \___  >__|  |_______ \__||__|  \___  >
       \/  \/     |__|        \/              \/             \/ 
terminal-native · local-only · blazing fast · no cloud · no api key · no telemetry
Linux · macOS · Windows · RPi5
$ npm install -g hyperlite-ai
then run
$ hl
FEATURES
offline_first

Zero cloud

No API keys. No usage tracking. No network after the initial model download. Everything runs on your hardware.

performance

Rust all the way down

TUI visible in milliseconds. Sessions load instantly. Streamed tokens hit the screen with no runtime overhead.

agentic

Tools built in

26 tools — files, shell, web, git, RAG, memory. The model reads, writes, searches, and chains multi-step tasks inside your terminal.

rag new

Codebase indexing

Index an entire repo with local embeddings. Relevant chunks are retrieved and injected before each message — semantic search over your whole project, fully offline.

memory new

Persistent memory

Save facts the model carries across every session — preferences, project details, recurring patterns. Stored in SQLite, retrieved by semantic similarity.

git_agent new

Git-native agent

Opt-in git integration. Injects current branch, staged changes, diff stats, and unpushed commits into every prompt. Plus git_status, git_log, git_diff, and git_blame tools.

any_backend

Works with everything

llamafile (auto-managed), llama.cpp, Ollama, LM Studio, KoboldCpp, LocalAI, vLLM, Jan, text-generation-webui, GPT4All.

rendering

Rich TUI

Full markdown, syntax highlighting, collapsible reasoning blocks, fuzzy session search, and a reorganized command palette with Sessions / Agent / Display / Options tabs.

persistence

Session history

Every conversation in SQLite with WAL mode and 256 MB mmap. Instant load. Fuzzy search across all sessions.

IN ACTION
HyperLite chat interface
chat_interface

Talk to any local model — no daemon required

Tokens stream directly from the inference server. Hardware — CPU, RAM, GPU — visible in the sidebar. Model and provider shown at a glance. Agent tab in the command palette controls everything AI-related.

Switching models is Ctrl+M. Switching agents is Ctrl+A. On an RTX 4090 with a 14B model, expect 50+ tok/s.

Direct GGUF SSE streaming hardware detection multi-session
HyperLite model picker
model_picker

Download and manage models without leaving the terminal

Built-in model picker lists recommended GGUF models filtered to your hardware. Select and press Enter to download directly from HuggingFace CDN with a live progress bar.

Models saved to ~/.hyperlite/models/ and available immediately. SmolLM2 1.7B to Llama 3.3 70B.

HuggingFace CDN hardware filtering live progress Any Model
HyperLite command palette
command_palette

Everything keyboard driven — no mouse required

Press Ctrl+K for the command palette. Four tabs — Sessions, Agent, Display, Options. Agent tab is home to model switching, RAG indexing, memory, and git context controls.

Tab between panels. Arrow keys navigate. Enter runs. Esc closes.

session management fork & compact Agent tab Ctrl+K
HYPERLITE-PI

A purpose-built variant for the Raspberry Pi 5 and ARM64 single-board computers. Stripped down and optimised — no RAG embedding overhead, no memory embedding model, no ONNX runtime. Just a native ARM64 binary and the fastest possible inference for the hardware.

Performance on Pi 5 16 GB · Q4_K_M · native ARM64 (no QEMU)
ModelParamsTokens/sec
SmolLM21.7B35–50
Qwen2.53B22–32
Phi-4 Mini3.8B18–28
Llama 3.23B20–30
Mistral7B10–14
Llama 3.18B9–13
Raspberry Pi 5 · ARM64
npm install -g @hyperlite-ai/hyperlite-pi
hl
compiles llama-server natively on first run (~15 min)
GGML_NATIVE=ON · NEON SIMD · Cortex-A76

Native ARM64 — no QEMU

The standard llamafile is an x86_64 binary. Running it on a Pi triggers QEMU emulation — 5–10× slower. HyperLite-PI compiles llama-server natively from source on first launch, targeting Cortex-A76 directly.

🧠

GGML_NATIVE=ON

Compiler auto-detects the CPU and enables every available instruction set — NEON SIMD, int8 dot product, hardware AES. All the gains from the silicon already in the Pi.

💾

KV cache quantisation + mlock

KV cache stored at Q8 instead of F16 — halves memory bandwidth per token. Model weights locked in RAM with --mlock — no page faults during inference.

🪶

Lightweight by design

No ONNX runtime, no fastembed, no embedding model download. RAG, persistent memory, and the git agent are intentionally excluded — a Pi needs every bit of RAM for the LLM, not infrastructure overhead.

TOOL SYSTEM
hyperlite --list-tools
26 tools · native function calling (OpenAI format) + tag-based XML (any model)
filesystem
read_file
batch_read
list_dir
tree
glob
grep
file_info
make_plan
create_dir
move_file
copy_file
append_file
write_file
edit_file
delete_file
shell
web
search
http_fetch
git · new
git_status
git_log
git_diff
git_blame
rag · new
index_dir
search_index
clear_index
list_indexes
⚡ permission gate — tools that modify files or run commands show a confirmation dialog. approve once · approve all · deny.   ◆ new — git and RAG tools added in v0.2.27.
BACKENDS

All backends probed concurrently at startup. Only reachable servers appear in the model picker.

backendportformats
Direct GGUF18080GGUF · GGMLauto-managed
Ollama new11434GGUF · GGML · SafeTensorsexternal
llama.cpp8080GGUF · GGMLexternal
LM Studio1234GGUF · EXL2external
KoboldCpp5001GGUF · GGMLexternal
text-generation-webui5000GGUF · GPTQ · AWQ · EXL2 · SafeTensorsexternal
LocalAI8080GGUF · GPTQ · SafeTensors · ONNXexternal
vLLM8000SafeTensors · GPTQ · AWQ · EXL2external
Jan.ai1337GGUFexternal
GPT4All4891GGUFexternal
INSTALL
Linux x64 · macOS · Windows
npm install -g hyperlite-ai
hl
or: hyperlite
native binary selected automatically
Raspberry Pi 5 (ARM64)
npm install -g @hyperlite-ai/hyperlite-pi
hl
builds llama-server natively on first run
GGML_NATIVE=ON · NEON SIMD · Cortex-A76
requirements
Node.js 16+
4 GB RAM minimum · 8 GB recommended
internet for first model download only
models stored in ~/.hyperlite/models/
first launch
setup wizard opens
hardware detected — models filtered to fit
download models from HuggingFace
runtime downloaded automatically
offline forever after that
KEYBINDINGS
Send messageEnter
Insert newlineAlt+Enter
Model pickerCtrl+M
Command paletteCtrl+K
New sessionCtrl+N
Session listCtrl+S
Agent pickerCtrl+A
Toggle sidebarCtrl+\
InterruptCtrl+C
Copy last responseCtrl+L
Scroll messages↑ / ↓
QuitCtrl+Q