AI inference
where you need it.

Models too big for your hardware. Remote sites with no cloud. We make it run.

[ Platform ]

One platform. Every environment.

One install. Any hardware. Any network. We show up and make it work.

Runs on any hardware

GPU, CPU, TPU, or mixed. Runtime probes the box and configures itself for what is there. A CPU-only field server, a single Jetson at a ground station, or a rack of H100s in a SCIF. Same install, same API.

Deploys to any environment

Cloud, on-prem, edge, air-gapped, or fully disconnected. Install over a clean network or an empty one. Same Runtime. Same Hub. Same API.

Stays on your side of the wire

Your data never leaves your network. Zero egress on Pro and Enterprise. No metered tokens. Prompts, responses, weights, and traces stay on the hardware you installed on, from a ground station on a disconnected network to a SCIF behind an air-gap.

Ground station at dusk with satellite dishes and radomes

[ Runtime ]

Runtime manages the model on your hardware.

Probes the box, picks the engine, tiers memory across what you have, and serves an OpenAI-compatible API. One install. Any hardware.

Tiered memory orchestration

Model weights and cache move across the memory on your machine, whether that is GPU, CPU, or disk. Large models run on hardware that would not normally hold them.

Picks the engine for you

Runtime wraps llama.cpp, vLLM, TensorRT-LLM, and whatever ships next. You pick the model. Runtime picks the engine. When a faster one arrives, you inherit it without rewriting a single line.

OpenAI-compatible API

Drop-in for the OpenAI endpoint. Point existing software at a local URL instead of api.openai.com. Embeddings, classification, extraction, retrieval, tool-calls, and yes chat. The model runs where the data is.

Explore Runtime

s88 serve --model Llama-3-70B-Q4_K_M

INITIALIZING

Llama-3-70B-Q4_K_M

GGUF Q4_K_M 70B params

Detected

Backend Selection

Auto

llama.cpp vLLM TensorRT-LLM Triton

Memory Hierarchy

PASS

VRAM (Tier 1)

16.8 / 24 GB

RAM (Tier 2)

42.3 / 64 GB

SSD Cache (Tier 3)

128 / 512 GB

Serving

localhost:8088/v1/chat/completions

Throughput

7.8 tok/s

Latency

118 ms

OOM Events

Uptime

Sector88 Hub Fleet Overview

Live

Nodes

Serving

Fleet Uptime

99.9%

OOM Events

Active Deployments

ground-station-08

Svalbard, Norway

Llama-3-70B llama.cpp

VRAM

16.8/24

tok/s

7.8

Uptime

22d

VRAM 16.8/24 tok/s 7.8 22d up

ops-center-03

Edwards AFB, CA

Mistral-7B vLLM

VRAM

5.2/16

tok/s

24.1

Uptime

VRAM 5.2/16 tok/s 24.1 8d up

rig-platform-11

North Sea, Offshore

Llama-3-8B llama.cpp

VRAM

6.1/8

tok/s

18.6

Uptime

45d

VRAM 6.1/8 tok/s 18.6 45d up

datacenter-sg-02

Singapore, APAC Warming

Qwen2-72B TensorRT-LLM

VRAM

tok/s

Uptime

VRAM -- tok/s -- 0s up

Activity

2m agoModel Llama-3-70B serving on ground-station-08

5m agoPreflight passed on datacenter-sg-02. Loading Qwen2-72B.

18m agoTier swap on rig-platform-11. 2 layers RAM → VRAM.

[ Hub ]

Hub operates the fleet from one place.

One control plane for every Runtime. Deploy, monitor, and manage across your entire fleet from a single interface. No SSH. No spreadsheets.

Live fleet view

Every node, every model, every region. GPU, memory, throughput, latency, and power, refreshed every second. Thirty days of history by default.

Deploy, hot-swap, rollback

Push a model to one machine or the whole fleet. Canary nodes, health checks, automatic rollback on failure. No SSH into individual machines.

Every action logged, nothing stored

Every deploy, rollback, and policy change logged against the user who did it. Prompt and response content is never captured. Exports for your security team, not ours.

Explore Hub

[ Engineers ]

Engineers who come to your site.

When a site needs hands on, our forward-deployed engineers audit the hardware, install the platform, benchmark on site, and harden it for production. We embed, ship, and leave when it runs.

Audit and install

Remote or on-site review of the hardware, network, and constraints. The deployment plan is written and signed off before anything is installed.

Benchmarks on your hardware

Throughput, latency, and cost measured on your actual hardware. Exportable scorecards. Production-grade proof.

Hardened for your environment

Air-gapped, classified, regulated. Supervisor restart, secrets management, and network posture locked in for your security regime before we leave.

Meet the team

Sector88 Hub Deployments

ENG-2847

Svalbard Deployment

In Progress

Site: Svalbard, Norway Hardware: Jetson AGX Orin 64GB Network: Disconnected

Deployment Progress

Phase 1 of 5

Audit

Install

Benchmark

Harden

Live

Hardware Audit

Phase 1

Hardware Detection Scanning...

Memory Probe Probing...

Network Policy Checking...

[ Capabilities ]

What the platform does.

Single-node and developer use is open. Fleet operations, identity, air-gapped postures, and forward-deployed install are scoped per deployment with our team.

Talk to the team

Fig 1.1

Offline and air-gapped

Zero outbound calls. No license pings. No phone-home. Install over any medium, run on an empty network, and keep running when the satcom link drops.

Fig 1.2

Fleet control plane

Deploy, monitor, hot-swap, and roll back every node from one place. Canary to the fleet, rollback on failure. No SSH into individual machines.

Fig 1.3

Built for regulated environments

ITAR-aware. Deployable into classified facilities and sovereign postures. Engineers install and harden on site, inside your perimeter, to your security regime.

Fig 1.4

OpenAI-compatible API

Drop-in for the OpenAI endpoint. Point existing software at a local URL instead of api.openai.com. Embeddings, classification, extraction, retrieval, tool-calls, and chat.

Fig 1.5

Identity and audit

SAML and OIDC out of the box. Hub roles map to your IdP groups. Every action is logged. Prompt and response content is never collected.

Fig 1.6

Tiered memory orchestration

Weights and cache move across the memory you have, whether that is GPU, CPU, or disk. Large models run on hardware that would not normally hold them.

Hardware Agnostic

Any GPU, any backend, any model, anywhere.

Hardware Platforms

NVIDIA CUDA

Popular

AMD ROCm

Intel Gaudi / Xeon

Google TPU

Qualcomm AI

Apple Silicon

CPU Servers

View all supported hardware →

Inference Backends

PyTorch Supported

Native inference

vLLM Supported

PagedAttention optimization

llama.cpp Supported

GGUF models, CPU/GPU

TensorRT-LLM Roadmap

NVIDIA optimization

Triton Roadmap

NVIDIA inference server

Ollama Roadmap

Developer tooling

View all supported backends →

Run it on your own hardware.

Bring in our forward-deployed engineers, or install it yourself. Either way it runs on your hardware, in your network, on your terms.

Talk to the team

AI inference where you need it.

One platform. Every environment.

Runs on any hardware

Deploys to any environment

Stays on your side of the wire

Runtime manages the model on your hardware.

Tiered memory orchestration

Picks the engine for you

OpenAI-compatible API

Hub operates the fleet from one place.

Live fleet view

Deploy, hot-swap, rollback

Every action logged, nothing stored

Engineers who come to your site.

Audit and install

Benchmarks on your hardware

Hardened for your environment

What the platform does.

Offline and air-gapped

Fleet control plane

Built for regulated environments

OpenAI-compatible API

Identity and audit

Tiered memory orchestration

Hardware Agnostic

Hardware Platforms

Inference Backends

Run it on your own hardware.

AI inference
where you need it.