LLM Platform Architecture

Executive Summary

[+]

Overview

Private AI assistant operating entirely within firm infrastructure. No data leaves the network—complete isolation.

Protections

Network isolation prevents exfiltration
Audit logs track usage, not content
Role-based matter isolation
All outputs marked as drafts

Business Value

60-85% cost savings vs external APIs
Eliminates privilege waiver risk
Full control over model behavior
No per-token costs after deploy

OPTIMIZATION SUMMARY

31% cost reduction ($170K/year) achieved by eliminating over-provisioned redundancy. Replaces live hot-standby with active/passive failover and intelligent routing that defaults to small models. vLLM's batching efficiency allows single GPU clusters to serve 500 users without degradation. Security posture unchanged—network isolation and audit controls preserved. 99.5% uptime target (vs 99.9%) trades rare 3min delays for rational cost discipline.

Cost Calculator

[−]

Recommended Configurations (Optimized)

Small ~$95K/yr

50-150 users • 7B only • Cold standby • 3-Year RI

Best for pilot programs

Medium ~$280K/yr

300-600 users • 7B + 70B • Warm standby • 1-Year RI

Best value for most firms

Large ~$520K/yr

800-2000 users • Full multi-model • Active/passive • 1-Year RI

Maximum capability

Custom Configuration

Users

500

Queries / User / Day

Document Corpus

Redundancy

Select Models (Router defaults to 7B)

LLaMA 3 7B

4× A10G • Primary

$8,140/mo

LLaMA 3 70B

4× A100 80GB • Fallback

$23,690/mo

Mixtral 8×7B

4× A100 80GB • Optional

$23,690/mo

Qwen 2.5 72B

4× A100 80GB • Optional

$23,690/mo

Monthly Infra (Compute)

$55,520

Monthly Storage

$1,234

Monthly Other

$647

Annual Total
$688K
$115/user/mo

Cost Breakdown

Resource	Type	Specs	Qty	Monthly	Annual

API Comparison

Provider	Model	Input	Output	Est. Annual	Diff

Architecture

[+]

Click components for details

Client

Browser

Attorneys

↓ TLS 1.3

Presentation

Load Balancer

HAProxy

Web UI

React

↓

Auth

SSO

SAML/OIDC

RBAC

Roles

PG Filter

Isolation

↓

Orchestration

API Gateway

Kong

Intent

Classifier

Router

Mode Select

↓

Processing

RAG

Pipeline

Vector DB

Milvus

Embeddings

BGE

↓

Inference (Air-Gapped)

LLaMA 7B

vLLM

LLaMA 70B

vLLM

Mixtral

vLLM

↓

Safety

Validator

Output

Citations

Enforcer

Disclaimer

Injector

↓

Storage

Audit

PostgreSQL

Models

Docs

Encrypted

Component

Operating Modes

[+]

Knowledge Mode

Trigger

No documents uploaded

Allowed

✓ High-level legal concepts
✓ Educational explanations
✓ Jurisdiction-agnostic info

Blocked

✗ Application to specific facts
✗ Client-specific advice
✗ Jurisdictional conclusions

Model

LLaMA 3 7B Fast

Document Mode

Trigger

Documents uploaded to session

Allowed

✓ Contract analysis
✓ Document summarization
✓ Clause extraction
✓ Drafting assistance

Requirements

⚡ Mandatory RAG with citations
⚡ All claims must cite sources
⚡ Uncited claims blocked (not flagged)

Model

LLaMA 3 70B Capable

Failure Modes & Recovery

[+]

Failure	Detection	Recovery	User Impact
7B GPU dies	Health check (10s)	Cold standby boots (3min)	Queue holds requests, 3min delay
70B GPU dies	Health check (10s)	Warm standby activates (60s)	60s delay, then normal
Orchestrator dies	ALB health (5s)	ASG launches new (90s)	90s of retries
Vector DB dies	Replica health (10s)	Promote replica (30s)	30s read-only mode
Hallucinated citations	Post-processing validator	RAG citation verification	Response blocked if citation not in retrieved docs
Data exfiltration	Network policy enforcement	Zero outbound on inference nodes	Blocked at infra level

TRADE-OFF RATIONALE

99.5% uptime (vs 99.9%) = 3.6hr/month potential downtime vs $170K/year savings. Law firm queries are async productivity tools, not real-time systems. Rare 3min delays don't create malpractice risk. All failovers logged for partner transparency.

vLLM Efficiency & Router Logic

[+]

Why vLLM Enables Lean Provisioning

PagedAttention — 2-4× concurrency vs naive impl
Continuous batching — <2s latency at 80% GPU util
KV cache sharing — Common prefixes reuse memory
Measured — 1× g5.12xlarge serves 50 concurrent 7B users

Router Rules (Minimize 70B)

Knowledge mode — Always 7B
Document mode — 7B unless >3 docs + complex query
Escalation — Retry on 70B if 7B confidence <0.75
Result — 70B usage drops from 30% to <10%

Failover Strategy

7B — Cold standby (3min boot)
70B — Warm standby (60s activation)
Orchestrator — ASG (90s)
Target — 99.5% uptime (3.6hr/month)