HyperLite — terminal-native local AI

FEATURES

offline_first

Zero cloud

No API keys. No usage tracking. No network after the initial model download. Everything runs on your hardware — inference, search, memory, git operations.

performance

GPU acceleration built in

On Linux and WSL2, HyperLite detects your GPU at startup and routes inference through Ollama automatically — full CUDA on NVIDIA, ROCm on AMD. A 7B model on an RTX 4090 runs at 40–60 tok/s. On CPU the same model does 3–5. If Ollama isn't installed, HyperLite installs it. Context window size is derived from your actual VRAM — not a hardcoded default.

agentic

Tools built in

39 tools — files, shell, web, git, documents, system, RAG, memory. The model reads, writes, searches, and chains multi-step tasks. Every file write is intercepted and shown as a diff before anything touches disk.

git

Full git workflow

Read and write — status, log, diff, blame, add, commit, push, pull, branch, stash. When a push or pull fails because credentials aren't set up, a guided dialog walks you through creating a GitHub token and stores it via git's own credential system. Nothing is stored by HyperLite directly.

documents

Work with real files

Read PDFs, parse CSVs with per-column stats, scrape web pages into clean readable text, manage Markdown task lists, check what's using your ports, see what's consuming RAM. Tools that handle the things people actually run into.

sandbox

Isolated shell execution

Enable sandbox mode and shell commands run inside bubblewrap — the working directory is mounted read-write, everything else is isolated. Changes to home, tmp, and root don't persist. If bubblewrap isn't installed, HyperLite installs it.

rag

Codebase indexing

Index an entire repo with local embeddings. Relevant chunks are retrieved and injected before each message — semantic search over your whole project, fully offline.

memory

Persistent memory

Save facts the model carries across every session — preferences, project details, recurring patterns. Stored in SQLite, retrieved by semantic similarity at the start of each conversation.

persistence

Sessions that survive everything

Every conversation in SQLite. Switch models mid-session and HyperLite compacts the full history into a clean summary before handing off — the new model starts with accurate context, not a confused transcript.

IN ACTION

●

HyperLite chat with the model calling write_file, the sidebar tracking a +7 / -0 change

chat_interface

Talk to any local model — no daemon required

Tokens stream straight from the inference server, and tool calls run inline — here the model wrote a file and the sidebar tracks the change as +7 −0. Hardware (CPU, RAM, GPU), model, and provider stay visible at a glance.

Switching models is Alt+P; switching agents is Ctrl+A. Change models mid-conversation and HyperLite compacts the history first, so the new model starts with clean context.

CUDA · ROCm · Metal SSE streaming hardware detection multi-session

●

model_picker

Browse and download models, sorted to your hardware

The picker ships with 30 curated GGUF models in three tiers — SBC for Raspberry Pi and CPU boxes, MID for 8–16 GB GPUs, and HIGH for 24–32 GB cards. Each row shows its size and quantization, and the list is filtered to what your machine can actually run.

Pick one, press Enter, and it pulls from the HuggingFace CDN with a live progress bar into ~/.hyperlite/models/ — everything from Qwen3 0.6B up to Llama 3.3 70B, including vision and dedicated reasoning models.

SBC · MID · HIGH hardware-filtered quant labels vision + reasoning

●

HyperLite session browser with searchable, timestamped sessions

session_history

Sessions persist — searchable, forkable, resumable

Every conversation is saved to a local SQLite database. Reopen the browser with Ctrl+S to search by title and jump back into any session — nothing is lost when you quit.

Fork a session to branch an idea without losing the original, or compact a long history into a clean summary to free up context.

SQLite fork & compact searchable Ctrl+S

MODEL CATALOG

30 models are built into the downloader, grouped by the hardware they're meant for. Sizes shown are the Q4_K_M download unless the quant says otherwise. The picker filters the list to what fits your machine and labels each model with its quantization — pick one, press Enter, and it pulls straight from HuggingFace.

SBC — Raspberry Pi · mini-PCs · CPU · laptops

Model	Size	Quant	Good at
Qwen3 0.6B	0.6 GB	Q8_0	tiniest capable · hybrid reasoning
Gemma 3 1B	0.8 GB	Q4_K_M	quick chat on edge devices
Llama 3.2 1B	0.8 GB	Q4_K_M	fast on CPU
SmolLM2 1.7B	1.0 GB	Q4_K_M	runs anywhere
Qwen3 1.7B	1.8 GB	Q8_0	compact reasoning
SmolLM3 3B	1.9 GB	Q4_K_M	fully open · dual reasoning
Llama 3.2 3B	2.0 GB	Q4_K_M	solid all-rounder
Qwen2.5 3B	2.0 GB	Q4_K_M	multilingual
Gemma 3 4B	2.5 GB	Q4_K_M	text + image
Qwen3 4B	2.5 GB	Q4_K_M	reasoning + tools
Phi-4 Mini 3.8B	2.5 GB	Q4_K_M	reasoning per parameter

MID — 8–16 GB GPUs · RTX 3060 / 4060 / 4070 / 4080

Model	Size	Quant	Good at
Mistral 7B	4.1 GB	Q4_K_M	fast general-purpose
Qwen2.5-Coder 7B	4.7 GB	Q4_K_M	code
Llama 3.1 8B	4.7 GB	Q4_K_M	instruction following
Qwen3 8B	5.0 GB	Q4_K_M	reasoning · 100+ languages
Gemma 3 12B	7.3 GB	Q4_K_M	text + image · writing
Qwen3 14B	9.0 GB	Q4_K_M	reasoning
Qwen2.5 14B	9.0 GB	Q4_K_M	balanced all-rounder
Qwen2.5-Coder 14B	9.0 GB	Q4_K_M	code
DeepSeek-R1 14B	9.0 GB	Q4_K_M	chain-of-thought
gpt-oss 20B	12.1 GB	MXFP4	OpenAI open MoE · agentic
Mistral Small 3.2 24B	14.3 GB	Q4_K_M	vision + tools
Devstral 24B	14.3 GB	Q4_K_M	agentic coding
Qwen3-Coder 30B	18.6 GB	Q4_K_M	MoE coder · 3B active

HIGH — 24–32 GB GPUs · RTX 3090 / 4090 / 5080 / 5090

Model	Size	Quant	Good at
Gemma 3 27B	16.5 GB	Q4_K_M	multimodal · best dense on a 24 GB card
Qwen3 32B	19.8 GB	Q4_K_M	flagship reasoning
Qwen2.5-Coder 32B	19.9 GB	Q4_K_M	top open coder
Qwen2.5 32B	20.0 GB	Q4_K_M	general
DeepSeek-R1 32B	20.0 GB	Q4_K_M	reasoning
Llama 3.3 70B	43.0 GB	Q4_K_M	near-frontier · needs 40 GB+ VRAM or big RAM

Not limited to this list — point HyperLite at any GGUF file in ~/.hyperlite/models/, or connect an existing Ollama / LM Studio / llama.cpp server and use whatever you already have.

REVIEW & APPROVE

●

HyperLite showing a syntax-highlighted diff of a new file with apply / skip / apply-all controls

file_writes

Nothing touches disk until you say so

When the model wants to write or edit a file, HyperLite stops and shows the change as a syntax-highlighted diff — additions in green, deletions in red. You read it, then apply, skip, or apply everything at once.

Multi-file edits open a review list first, so you can step through each file before committing to any of them. It runs through the tool dispatcher itself — not a setting you can forget to switch on.

syntax-highlighted per-file review apply · skip · apply all

ARCHITECTURE

deployment

One binary. No runtime. No setup.

HyperLite ships as a single statically-linked binary — hl. No Python environment. No Node.js. No Docker. No daemon running in the background. Copy it to any machine and it runs.

The entire application — TUI, inference routing, tool execution, RAG, memory, streaming — is ~35 MB on disk.

inference

Works with any backend. Switching is free.

A unified provider layer speaks to Ollama, llama-server, LM Studio, Jan, GPT4All, KoboldCpp, LocalAI, vLLM, and TextGen WebUI through one interface. Switching backends is a model selection — nothing else changes. The same conversation continues, the same tools work, the same agents run.

When Ollama is present it takes priority for GPU acceleration. When it's not, HyperLite spawns llama-server with parameters derived from your actual hardware — VRAM, core count, architecture — not hardcoded defaults.

tool_system

The agentic loop runs on any model.

Native function calling (OpenAI tool-use format) requires specific model support. HyperLite's tool system doesn't. It parses <tool_call> XML blocks from any model's output in real-time during streaming, executes the tool, and feeds the result back.

The result: file reads, writes, shell execution, web search, git operations — all 39 tools — work with SmolLM 1.7B the same way they work with a 70B model. The model's capability determines quality. The architecture doesn't impose a ceiling.

writes

File writes require approval. By design.

Every file write is intercepted before hitting disk. The proposed change renders as a syntax-highlighted diff — green for additions, red for deletions — with the confirmation prompt appearing after all content so you read before you decide.

This isn't a setting. It's the architecture. The tool dispatcher routes write_file and edit_file through a pending diff queue before execution. Approve or discard. The AI never writes without explicit confirmation.

data

Local-first. All the way down.

No telemetry. No API keys required. No cloud. Every piece of the stack runs on your machine: inference on your GPU, semantic search via local ONNX embeddings indexed into SQLite, conversation history in SQLite — session-branching capable, never leaves your machine.

The only outbound requests are ones you explicitly trigger: web search, http_fetch, model downloads from HuggingFace. Everything else is air-gapped by default.

context

Context sized to your hardware. Not a default.

The context window is calculated at startup from real hardware detection — not a hardcoded value. A 24 GB GPU gets 32 768 tokens. A 10 GB card gets 16 384. The system reads what you have and configures accordingly, per-request.

When you switch models mid-conversation, HyperLite compacts the full history using the current model before handing off — a clean factual summary the new model can work from without confusion.

COMMAND PALETTE

One shortcut — Ctrl+K — opens everything. Four tabs cover your sessions, the AI agent and its tools, display options, and editor actions. Tab between them, arrow keys to navigate, Enter to run. No menus to hunt through, no mouse required.

● Sessions — new · fork · compact · drafts

● Agent — model · git · sandbox · RAG · memory

● Display — theme · sidebar · thinking · tools

● Options — editor · copy · undo · help

THEMES

●

HyperLite theme picker with a live color-swatch preview

appearance

21 built-in themes, with a live preview

Cycle with Alt+T, or pick from the palette and watch the color swatches update before you commit. Panels, diffs, and syntax highlighting all follow the active theme.

cyberpunkdraculatokyonight catppuccinnordgruvbox monokaione-darksynthwave84 matrixrosepineeverforest solarizedkanagawavesper aurapalenightnightowl kawaiigothsepia

AGENTS

●

agents

Pick how the AI behaves — or define your own

Three agents ship in the box. General is a conversational assistant with full tool access. Build is a coding agent that reads, writes, and runs shell commands. Plan is read-only — it explores and searches your code but never writes or executes.

Need something specific? Press Ctrl+A → New Agent to create one with its own system prompt and a restricted set of tools.

General Build Plan custom agents

FIRST RUN

●

setup

One check on first launch, then you're offline

The first time you run hl, HyperLite takes stock of your machine — GPU, VRAM, RAM — counts any models you already have, mounts the local session database, and confirms no uplink is required.

If you're missing an inference runtime it installs one, then offers a set of models filtered to your hardware. After that it opens straight to chat and runs fully offline.

hardware detection auto-installs runtime offline after setup

HYPERLITE-PI

A purpose-built variant for the Raspberry Pi 5 and ARM64 single-board computers. Stripped down and optimised — no RAG embedding overhead, no memory embedding model, no ONNX runtime. Just a native ARM64 binary and the fastest possible inference for the hardware.

Performance on Pi 5 16 GB · Q4_K_M · native ARM64 (no QEMU)

Model	Params	Tokens/sec
SmolLM2	1.7B	35–50
Qwen2.5	3B	22–32
Phi-4 Mini	3.8B	18–28
Llama 3.2	3B	20–30
Mistral	7B	10–14
Llama 3.1	8B	9–13

● Raspberry Pi 5 · ARM64

npm install -g @hyperlite-ai/hyperlite-pi

hl

compiles llama-server natively on first run (~15 min)

GGML_NATIVE=ON · NEON SIMD · Cortex-A76

⚡

Native ARM64 — no QEMU

The standard llamafile is an x86_64 binary. Running it on a Pi triggers QEMU emulation — 5–10× slower. HyperLite-PI compiles llama-server natively from source on first launch, targeting Cortex-A76 directly.

🧠

GGML_NATIVE=ON

Compiler auto-detects the CPU and enables every available instruction set — NEON SIMD, int8 dot product, hardware AES. All the gains from the silicon already in the Pi.

💾

KV cache quantisation + mlock

KV cache stored at Q8 instead of F16 — halves memory bandwidth per token. Model weights locked in RAM with --mlock — no page faults during inference.

🪶

Lightweight by design

No ONNX runtime, no fastembed, no embedding model download. RAG, persistent memory, and the git agent are intentionally excluded — a Pi needs every bit of RAM for the LLM, not infrastructure overhead.

TOOL SYSTEM

hyperlite --list-tools

39 tools · native function calling (OpenAI format) + tag-based XML (any model)

filesystem

read_file

batch_read

list_dir

tree

glob

grep

file_info

make_plan

create_dir

move_file

copy_file

append_file

write_file

edit_file

delete_file

shell

web

search

http_fetch

scrape_page

git — read

git_status

git_log

git_diff

git_blame

git — write

git_add

git_commit

git_push

git_pull

git_branch

git_stash

documents & system

read_pdf

analyze_csv

read_notes

write_note

system_status

check_ports

rag & memory

index_dir

search_index

clear_index

list_indexes

⚡ permission gate — tools that modify files or run commands show a diff or confirmation before executing. approve once · approve all · deny. ◆ read-only — these tools never modify state without a separate write call.

BACKENDS

All backends probed concurrently at startup. Only reachable servers appear in the model picker. On Linux and WSL2, Ollama is detected first and used as the GPU inference path when available.

backend	port	formats
Ollama	11434	GGUF · GGML · SafeTensors	GPU preferred
Direct GGUF	18080	GGUF · GGML	auto-managed
llama.cpp	8080	GGUF · GGML	external
LM Studio	1234	GGUF · EXL2	external
KoboldCpp	5001	GGUF · GGML	external
text-generation-webui	5000	GGUF · GPTQ · AWQ · EXL2 · SafeTensors	external
LocalAI	8080	GGUF · GPTQ · SafeTensors · ONNX	external
vLLM	8000	SafeTensors · GPTQ · AWQ · EXL2	external
Jan.ai	1337	GGUF	external
GPT4All	4891	GGUF	external

INSTALL

● Linux x64 · macOS · Windows

npm install -g hyperlite-ai

hl

or: hyperlite

native binary selected automatically

● Raspberry Pi 5 (ARM64)

npm install -g @hyperlite-ai/hyperlite-pi

hl

builds llama-server natively on first run

GGML_NATIVE=ON · NEON SIMD · Cortex-A76

● requirements

Node.js 16+

4 GB RAM minimum · 8 GB recommended

internet for first model download only

models stored in ~/.hyperlite/models/

● first launch

hardware detected — GPU, VRAM, core count

Ollama installed automatically if GPU found

models filtered to fit your hardware

download from HuggingFace with live progress

offline forever after that

KEYBINDINGS

input

Send messageEnter

New line in inputAlt+Enter

Paste clipboardCtrl+V

Copy last responseCtrl+C

Undo last messageCtrl+Z

RedoCtrl+Y

scroll

Scroll output↑ / ↓ or j / k

Half pageCtrl+D / Ctrl+U

Full pageCtrl+F / Ctrl+B

Jump between messages[ / ]

Top / bottomg / G

sessions

New sessionCtrl+N

Session listCtrl+S

Rename sessionCtrl+R

Close sessionCtrl+W

Stash draftCtrl+D

Open folderCtrl+O

model & agents

Model pickerAlt+P

Cycle modelAlt+M

Switch agentCtrl+A

display

Command paletteCtrl+K

Toggle sidebarAlt+\

Toggle reasoning blocksCtrl+T

Tool call detailsAlt+H

Cycle themeAlt+T

Help?

QuitCtrl+X

IN PROGRESS

A look at what's actively being built — not finished yet, shown here so you know where it's headed.

PenTest mode IN DEVELOPMENT

A mode for authorized security testing. It's gated behind an explicit authorization step, checks which security tools are already installed, and offers to install what's missing — then lets the assistant help drive the assessment instead of you wiring tools together by hand.

● authorization gate

PenTest mode authorization screen requiring explicit confirmation of authorized use

● tool pre-flight check

PenTest pre-flight check listing detected tools and workflow readiness

The goal is a set of orchestrated workflows the assistant can run end to end:

Network Discovery Web Application Scan SMB Assessment Credential Attack OSINT Recon Vulnerability Scan Full Engagement

For authorized penetration testing and security research only — every target requires documented permission. Which workflows are available depends on the tools installed on your system. This mode is still in active development.