hyperlite.org
  ___ ___                             .____    .__  __         
 /   |   \ ___.__.______   ___________|    |   |__|/  |_  ____  
/    ~    <   |  |\____ \_/ __ \_  __ \    |   |  \   __\/ __ \ 
\    Y    /\___  ||  |_> >  ___/|  | \/    |___|  ||  | \  ___/ 
 \___|_  / / ____||   __/ \___  >__|  |_______ \__||__|  \___  >
       \/  \/     |__|        \/              \/             \/ 
terminal-native · local-only · blazing fast · no cloud · no api key · no telemetry
Linux · macOS · Windows · RPi5
$ npm install -g hyperlite-ai
then run
$ hl
FEATURES
offline_first

Zero cloud

No API keys. No usage tracking. No network after the initial model download. Everything runs on your hardware — inference, search, memory, git operations.

performance

GPU acceleration built in

On Linux and WSL2, HyperLite detects your GPU at startup and routes inference through Ollama automatically — full CUDA on NVIDIA, ROCm on AMD. A 7B model on an RTX 4090 runs at 40–60 tok/s. On CPU the same model does 3–5. If Ollama isn't installed, HyperLite installs it. Context window size is derived from your actual VRAM — not a hardcoded default.

agentic

Tools built in

36 tools — files, shell, web, git, documents, system, RAG, memory. The model reads, writes, searches, and chains multi-step tasks. Every file write is intercepted and shown as a diff before anything touches disk.

git

Full git workflow

Read and write — status, log, diff, blame, add, commit, push, pull, branch, stash. When a push or pull fails because credentials aren't set up, a guided dialog walks you through creating a GitHub token and stores it via git's own credential system. Nothing is stored by HyperLite directly.

documents

Work with real files

Read PDFs, parse CSVs with per-column stats, scrape web pages into clean readable text, manage Markdown task lists, check what's using your ports, see what's consuming RAM. Tools that handle the things people actually run into.

sandbox

Isolated shell execution

Enable sandbox mode and shell commands run inside bubblewrap — the working directory is mounted read-write, everything else is isolated. Changes to home, tmp, and root don't persist. If bubblewrap isn't installed, HyperLite installs it.

rag

Codebase indexing

Index an entire repo with local embeddings. Relevant chunks are retrieved and injected before each message — semantic search over your whole project, fully offline.

memory

Persistent memory

Save facts the model carries across every session — preferences, project details, recurring patterns. Stored in SQLite, retrieved by semantic similarity at the start of each conversation.

persistence

Sessions that survive everything

Every conversation in SQLite. Switch models mid-session and HyperLite compacts the full history into a clean summary before handing off — the new model starts with accurate context, not a confused transcript.

IN ACTION
HyperLite chat interface
chat_interface

Talk to any local model — no daemon required

Tokens stream directly from the inference server. Hardware — CPU, RAM, GPU — visible in the sidebar. Model and provider shown at a glance. Agent tab in the command palette controls everything AI-related.

Switching models is Alt+P. Switching agents is Ctrl+A. On an RTX 4090 with a 14B model, expect 50+ tok/s. Switching models mid-conversation compacts the session history first so the new model starts clean.

CUDA · ROCm · Metal SSE streaming hardware detection multi-session
HyperLite model picker
model_picker

Download and manage models without leaving the terminal

Built-in model picker lists recommended GGUF models filtered to your hardware. Select and press Enter to download directly from HuggingFace CDN with a live progress bar.

Models saved to ~/.hyperlite/models/ and available immediately. SmolLM2 1.7B to Llama 3.3 70B.

HuggingFace CDN hardware filtering live progress Any Model
HyperLite command palette
command_palette

Everything keyboard driven — no mouse required

Press Ctrl+K for the command palette. Four tabs — Sessions, Agent, Display, Options. Agent tab is home to model switching, RAG indexing, memory, sandbox mode, and git context controls.

Tab between panels. Arrow keys navigate. Enter runs. Esc closes.

session management fork & compact Agent tab Ctrl+K
ARCHITECTURE
deployment

One binary. No runtime. No setup.

HyperLite ships as a single statically-linked binary — hl. No Python environment. No Node.js. No Docker. No daemon running in the background. Copy it to any machine and it runs.

The entire application — TUI, inference routing, tool execution, RAG, memory, streaming — is ~15 MB on disk.

inference

Works with any backend. Switching is free.

A unified provider layer speaks to Ollama, llama-server, LM Studio, Jan, GPT4All, KoboldCpp, LocalAI, vLLM, and TextGen WebUI through one interface. Switching backends is a model selection — nothing else changes. The same conversation continues, the same tools work, the same agents run.

When Ollama is present it takes priority for GPU acceleration. When it's not, HyperLite spawns llama-server with parameters derived from your actual hardware — VRAM, core count, architecture — not hardcoded defaults.

tool_system

The agentic loop runs on any model.

Native function calling (OpenAI tool-use format) requires specific model support. HyperLite's tool system doesn't. It parses <tool_call> XML blocks from any model's output in real-time during streaming, executes the tool, and feeds the result back.

The result: file reads, writes, shell execution, web search, git operations, and 36 other tools work with SmolLM 1.7B the same way they work with a 70B model. The model's capability determines quality. The architecture doesn't impose a ceiling.

writes

File writes require approval. By design.

Every file write is intercepted before hitting disk. The proposed change renders as a syntax-highlighted diff — green for additions, red for deletions — with the confirmation prompt appearing after all content so you read before you decide.

This isn't a setting. It's the architecture. The tool dispatcher routes write_file and edit_file through a pending diff queue before execution. Approve or discard. The AI never writes without explicit confirmation.

data

Local-first. All the way down.

No telemetry. No API keys required. No cloud. Every piece of the stack runs on your machine: inference on your GPU, semantic search via local ONNX embeddings indexed into SQLite, conversation history in SQLite — session-branching capable, never leaves your machine.

The only outbound requests are ones you explicitly trigger: web search, http_fetch, model downloads from HuggingFace. Everything else is air-gapped by default.

context

Context sized to your hardware. Not a default.

The context window is calculated at startup from real hardware detection — not a hardcoded value. A 24 GB GPU gets 32 768 tokens. A 10 GB card gets 16 384. The system reads what you have and configures accordingly, per-request.

When you switch models mid-conversation, HyperLite compacts the full history using the current model before handing off — a clean factual summary the new model can work from without confusion.

HYPERLITE-PI

A purpose-built variant for the Raspberry Pi 5 and ARM64 single-board computers. Stripped down and optimised — no RAG embedding overhead, no memory embedding model, no ONNX runtime. Just a native ARM64 binary and the fastest possible inference for the hardware.

Performance on Pi 5 16 GB · Q4_K_M · native ARM64 (no QEMU)
ModelParamsTokens/sec
SmolLM21.7B35–50
Qwen2.53B22–32
Phi-4 Mini3.8B18–28
Llama 3.23B20–30
Mistral7B10–14
Llama 3.18B9–13
Raspberry Pi 5 · ARM64
npm install -g @hyperlite-ai/hyperlite-pi
hl
compiles llama-server natively on first run (~15 min)
GGML_NATIVE=ON · NEON SIMD · Cortex-A76

Native ARM64 — no QEMU

The standard llamafile is an x86_64 binary. Running it on a Pi triggers QEMU emulation — 5–10× slower. HyperLite-PI compiles llama-server natively from source on first launch, targeting Cortex-A76 directly.

🧠

GGML_NATIVE=ON

Compiler auto-detects the CPU and enables every available instruction set — NEON SIMD, int8 dot product, hardware AES. All the gains from the silicon already in the Pi.

💾

KV cache quantisation + mlock

KV cache stored at Q8 instead of F16 — halves memory bandwidth per token. Model weights locked in RAM with --mlock — no page faults during inference.

🪶

Lightweight by design

No ONNX runtime, no fastembed, no embedding model download. RAG, persistent memory, and the git agent are intentionally excluded — a Pi needs every bit of RAM for the LLM, not infrastructure overhead.

TOOL SYSTEM
hyperlite --list-tools
36 tools · native function calling (OpenAI format) + tag-based XML (any model)
filesystem
read_file
batch_read
list_dir
tree
glob
grep
file_info
make_plan
create_dir
move_file
copy_file
append_file
write_file
edit_file
delete_file
shell
web
search
http_fetch
scrape_page
git — read
git_status
git_log
git_diff
git_blame
git — write
git_add
git_commit
git_push
git_pull
git_branch
git_stash
documents & system
read_pdf
analyze_csv
read_notes
write_note
system_status
check_ports
rag & memory
index_dir
search_index
clear_index
list_indexes
⚡ permission gate — tools that modify files or run commands show a diff or confirmation before executing. approve once · approve all · deny.   ◆ read-only — these tools never modify state without a separate write call.
BACKENDS

All backends probed concurrently at startup. Only reachable servers appear in the model picker. On Linux and WSL2, Ollama is detected first and used as the GPU inference path when available.

backendportformats
Ollama11434GGUF · GGML · SafeTensorsGPU preferred
Direct GGUF18080GGUF · GGMLauto-managed
llama.cpp8080GGUF · GGMLexternal
LM Studio1234GGUF · EXL2external
KoboldCpp5001GGUF · GGMLexternal
text-generation-webui5000GGUF · GPTQ · AWQ · EXL2 · SafeTensorsexternal
LocalAI8080GGUF · GPTQ · SafeTensors · ONNXexternal
vLLM8000SafeTensors · GPTQ · AWQ · EXL2external
Jan.ai1337GGUFexternal
GPT4All4891GGUFexternal
INSTALL
Linux x64 · macOS · Windows
npm install -g hyperlite-ai
hl
or: hyperlite
native binary selected automatically
Raspberry Pi 5 (ARM64)
npm install -g @hyperlite-ai/hyperlite-pi
hl
builds llama-server natively on first run
GGML_NATIVE=ON · NEON SIMD · Cortex-A76
requirements
Node.js 16+
4 GB RAM minimum · 8 GB recommended
internet for first model download only
models stored in ~/.hyperlite/models/
first launch
hardware detected — GPU, VRAM, core count
Ollama installed automatically if GPU found
models filtered to fit your hardware
download from HuggingFace with live progress
offline forever after that
KEYBINDINGS
input
Send messageEnter
New line in inputAlt+Enter
Paste clipboardCtrl+V
Copy last responseCtrl+C
Undo last messageCtrl+Z
RedoCtrl+Y
scroll
Scroll output↑ / ↓ or j / k
Half pageCtrl+D / Ctrl+U
Full pageCtrl+F / Ctrl+B
Jump between messages[ / ]
Top / bottomg / G
sessions
New sessionCtrl+N
Session listCtrl+S
Rename sessionCtrl+R
Close sessionCtrl+W
Stash draftCtrl+D
Open folderCtrl+O
model & agents
Model pickerAlt+P
Cycle modelAlt+M
Switch agentCtrl+A
display
Command paletteCtrl+K
Toggle sidebarAlt+\
Toggle reasoning blocksCtrl+T
Tool call detailsAlt+H
Cycle themeAlt+T
Help?
QuitCtrl+X