Mimical-Sovereign

01 Hardware Cluster — Physical Architecture

MIMICAL-1 — PRIMARY

Framework Desktop · AMD Ryzen AI Max+ 395

ORCHESTRATOR
WINDOWS VM HOST

CPU — Zen 5

16 Cores / 32 Threads

KAIROS daemon · OS scheduling

iGPU — RDNA 3.5

Radeon 8060S

~60 TFLOPS FP16 · DeepSeek V4

NPU — XDNA 2

50 TOPS

Gemma E4B classifier · Embeddings · Whisper

Unified RAM

128 GB LPDDR5X-8000

273 GB/s · iGPU+CPU shared pool

NVMe M.2

4 TB PCIe 4.0

V4 overflow · On-disk KV cache

RTX eGPU

16 GB GDDR7

via OCuLink · VFIO → Windows VM

DPU — NVIDIA BlueField-2 BF2M515A (PCIe x4→x16 riser)

200GbE QSFP56 · 8× ARM A72 · DOCA Linux MCP DATA SPECIALIST · Postgres · Qdrant · NVMe-oF · Zero-trust enforcement

Windows 10 Pro VM (VFIO passthrough): PowerDirector video editing · RTX CUDA inference · MCP Windows executor · Full GPU bandwidth, zero host interference

Direct DAC
QSFP56

↓

200 GbE RDMA
RoCE v2

↓

~5–10 μs
latency

MIMICAL-2 — INFERENCE

Framework Desktop · AMD Ryzen AI Max+ 395

DAILY DRIVER
KAIROS HOST

CPU — Zen 5

16 Cores / 32 Threads

LiteLLM Router · Qdrant DB

iGPU — RDNA 3.5

Radeon 8060S

Gemma 4 26B — ALWAYS WARM

NPU — XDNA 2

50 TOPS

Embedding generation · parallel prefetch

Unified RAM

128 GB LPDDR5X-8000

Gemma resident · V4-Pro shard

NVMe M.2 #1

4 TB PCIe 4.0

OS · V4-Pro weights · Qdrant index

NVMe M.2 #2

4 TB PCIe 4.0

Engram cold tier · Kopia backups

DPU — NVIDIA BlueField-2 BF2M515A (PCIe x4→x16 riser)

200GbE QSFP56 · 8× ARM A72 · DOCA Linux MCP API SPECIALIST · LiteLLM routing · OpenWebUI host · Cloud egress policy

Pure inference node: No Windows VM · Full 128 GB + both M.2 slots dedicated to inference 24/7. PXE-provisioned via Ansible AWX — zero-touch deployment.

Total Unified RAM

256 GB

LPDDR5X-8000 combined

Total NVMe Storage

12 TB

PCIe 4.0 · 3 × 4 TB

iGPU Compute

~120 TF

2 × Radeon 8060S FP16

NPU Compute

100 TOPS

2 × XDNA 2 INT8

RDMA Fabric

200 GbE

RoCE v2 · ~5 μs

DPU ARM Cores

16 Cores

2 × BF2M515A A72

02 AI Inference Routing — Confidence-Gated Hierarchy

Step 1 · NPU

NPU Pre-Assess

Gemma 4 E4B
Intent classify · Domain tag
Class A / B / C dispatch
~instant · 0 iGPU load

→

Step 2 · System 1
Gemma 4 26B MoE
mimical-2 · always warm
3.8B active params
55–75 t/s
≥ 0.85 conf → done

→

Step 3 · System 1.5

Gemma 4 31B Dense

mimical-1 or mimical-2
All 31B active
25–35 t/s
≥ 0.85 conf → done

→

Step 4 · System 2

DeepSeek V4-Flash

mimical-1 primary
13B active / 284B total
8–18 t/s · 1M ctx
≥ 0.85 conf → done

→

Step 5 · System 2 Max

DeepSeek V4-Pro

Both nodes · TP=2
49B active / 1.6T total
3–6 t/s · 1M ctx
last local resort

→

Step 6 · Cloud

External API

Claude / Gemini / GPT
via LiteLLM · DPU egress
conf < 0.85 only
Full response crystallized → Postgres

Confidence Gate: Each tier returns an answer and a score (0.0–1.0). Score ≥ 0.85 delivers immediately. Below threshold escalates up the stack.

Crystallization: Every external API answer — prompt, token count, model, raw response — is serialized as JSONB to Postgres. The same class of problem never requires an external call twice.

Sovereignty Guarantee: Cloud escalation is physically gated through the mimical-2 DPU egress policy. No query reaches external APIs without passing through the BlueField-2 firewall.

03 AI Model Roster — Resident Inference Stack

System 1 · Daily Driver

Gemma 4 26B MoE (A3.8B)

FP8/BF16 · ~14 GB VRAM · 3.8B active params per token
55–75 t/s · always warm · never evicted from GTT pool

KAIROS observation loops, tool dispatch, routing decisions, conversational chat, consensus scoring. Handles ~80% of all daily requests locally.

System 2 · Deep Reasoner

DeepSeek V4-Flash

MXFP4 · ~142 GB total weights · 13B active per token
8–18 t/s · on-disk KV cache · 1M token context window

Hard problems, long-horizon agentic sessions, Engram-aware tiered memory loader. CSA/HCA attention architecture.

System 2 Max · Frontier Local

DeepSeek V4-Pro (TP=2)

MXFP4 · ~800 GB total · 49B active · distributed both nodes
3–6 t/s · RDMA tensor parallel · 1M token context

Maximum local reasoning. Weights split across mimical-1 and mimical-2 via 200GbE RDMA fabric. Last resort before cloud escalation.

04 Memory Allocation — mimical-1 (128 GB)

Region	Size	Purpose & Notes
amdgpu GTT pool	120 GB	`amdgpu.gttsize=131072` in GRUB. Holds V4-Flash hot experts + KV cache.
CPU / OS / KAIROS	8 GB	Ubuntu 26.04 LTS kernel, Go daemon, Postgres, system headroom.
NVMe Overflow	~37 GB	Cold MoE experts + on-disk KV cache for 1M token contexts via mmap/io_uring.

Critical Kernel Parameters (GRUB): iommu=pt amdgpu.gttsize=131072 ttm.pages_limit=32505856 — applied before any other configuration step. Required for iGPU pool access and VFIO passthrough stability.

05 Memory Allocation — mimical-2 (128 GB)

Region	Size	Purpose & Notes
amdgpu GTT pool	120 GB	Gemma 4 26B MoE (permanently resident) + 31B Dense on demand.
Gemma 26B MoE	~14 GB	Always resident. Primary KAIROS engine. Never evicted.
V4-Pro Shard	~400 GB	mimical-2 half of V4-Pro weights. Streamed via llama.cpp RPC over RDMA.

06 OS & Software Stack Architecture

OS Layer

Ubuntu 26.04 LTS (Both Hosts) Kernel 7.0 (in-tree) ROCm 7.2.2 (gfx1151) DOCA 2.x (DPUs) Windows 10 Pro (mimical-1 VM)

Inference

llama.cpp (ROCm build) llama.cpp RPC (RDMA TP=2) vLLM ≥ 0.7.0 Ollama (Windows VM) Unsloth + FSDP (LoRA fine-tune)

Agentic

KAIROS Go Daemon MCP Servers (DPU-isolated) LiteLLM Router Agent Zero (Docker) Ansible AWX

Data

Postgres (behavioral_events · methodology_rules) Qdrant (RAG vector DB) Tiller Financial Ingest MaaS / petaCMS Telemetry OBD-II Vehicle Telemetry BookStack (auto-updated runbooks)

07 KAIROS — Proactive Observer Daemon

Always-on compiled Go application. Watches without being asked. Speaks when it has something worth saying. Named for the Greek concept of the right moment. It observes, logs, and proposes — but never executes without explicit human approval. The anti-sycophancy gate is immutable at the code level.

Observer 1

Financial Watch

Tiller daily transaction ingestion, burn-rate anomaly detection, debt acceleration modeling, NV Energy demand arbitrage, and bank alert email parsing. Morning briefing generated automatically.

Observer 2

MaaS Fleet Telemetry

petaCMS bare-metal fleet streaming telemetry for cloud hosting customers. Node health trend analysis, Kopia/Velero backup state verification, Ansible AWX playbook queuing. Approve/deny execution gate.

Observer 3

Calendar & Communications

Email triage via Go net/html sanitization, conflict detection, meeting prep briefs, call screening via MacroDroid + NPU conditional forwarding. Zero disruption to existing contacts.

Observer 4

Vehicle & Environment

OBD-II Bluetooth telemetry ingestion, fault code monitoring, maintenance window prediction, solar array output correlation with EV charging schedule optimization.

08 Build Sequence — Phased Engineering Plan

WEEK 1

Phase 1 · Physical Build

Bench test both mainboards. Confirm POST, 128GB recognition per node, Noctua thermal validation. Hardware in hand — begin immediately.

WEEK 1

Phase 2 · OS & iGPU

Install Ubuntu 26.04 LTS. Apply GRUB parameters: iommu=pt amdgpu.gttsize=131072 ttm.pages_limit=32505856. Validate ROCm 7.2.2 with rocm-smi on gfx1151.

WEEK 2

Phase 3 · VFIO + Windows VM

Bind RTX via vfio-pci. Boot Windows 10 Pro VM. Validate PowerDirector GPU access. Confirm zero host memory bandwidth contention during simultaneous LLM inference.

WEEK 3

Phase 4 · Inference Init

Deploy Gemma 4 26B MoE on mimical-1 via llama.cpp ROCm build. Benchmark tokens/sec. Validate always-warm residency in GTT pool.

WEEKS 4–5

Phase 5 · RDMA Cluster

Connect DPUs via DAC cable. Validate 200GbE RDMA throughput. Deploy llama.cpp RPC. Test TP=2 distribution of DeepSeek V4-Flash across both nodes.

WEEK 6

Phase 6 · KAIROS Foundation

Deploy LiteLLM router. Stand up KAIROS Go daemon. Expose core MCP tools on DPUs. Initialize Postgres behavioral_events schema.

WEEK 7

Phase 7 · Routing & Escalation

Three-class dispatch logic. Confidence gating at 0.85 threshold. Cloud API key configuration. Full escalation logging to Postgres JSONB.

WEEK 8

Phase 8 · MaaS & Telemetry

Connect Ansible AWX to KAIROS MCP server. Implement Tiller financial ingest pipeline. NV Energy solar telemetry. OBD-II dongle integration. BookStack auto-documentation active.

09 Mimical-Sovereign — Founding Principles

Personal, Not General

Not a model that knows everything about everyone. A model that knows everything about you. Your history, your fleet topology, your decision methodology, your financial reality.

Compounding Intelligence

The Postgres knowledge graph never resets. Every model generation inherits all accumulated knowledge. When a new model drops, swap the weights — the sovereign continues without interruption.

Data Never Leaves

The BlueField-2 DPUs enforce the network boundary in physical silicon. Financial telemetry, health data, and communications never touch hyperscaler infrastructure. Zero-trust at the hardware level.

Model Agnosticism

Every component treats the LLM as configuration, not code. The agentic loop never knows or cares what model answers. New generation released — config change and weight download. Zero downtime.