INTERNAL USE ONLY / JAN 2025
Executive Summary
[+]
Overview

Private AI assistant operating entirely within firm infrastructure. No data leaves the network—complete isolation.

Protections
  • Network isolation prevents exfiltration
  • Audit logs track usage, not content
  • Role-based matter isolation
  • All outputs marked as drafts
Business Value
  • 60-85% cost savings vs external APIs
  • Eliminates privilege waiver risk
  • Full control over model behavior
  • No per-token costs after deploy
OPTIMIZATION SUMMARY

31% cost reduction ($170K/year) achieved by eliminating over-provisioned redundancy. Replaces live hot-standby with active/passive failover and intelligent routing that defaults to small models. vLLM's batching efficiency allows single GPU clusters to serve 500 users without degradation. Security posture unchanged—network isolation and audit controls preserved. 99.5% uptime target (vs 99.9%) trades rare 3min delays for rational cost discipline.

Cost Calculator
[−]
Recommended Configurations (Optimized)
Small ~$95K/yr
50-150 users • 7B only • Cold standby • 3-Year RI
Best for pilot programs
Medium ~$280K/yr
300-600 users • 7B + 70B • Warm standby • 1-Year RI
Best value for most firms
Large ~$520K/yr
800-2000 users • Full multi-model • Active/passive • 1-Year RI
Maximum capability
Custom Configuration
500
15
Select Models (Router defaults to 7B)
LLaMA 3 7B
4× A10G • Primary
$8,140/mo
LLaMA 3 70B
4× A100 80GB • Fallback
$23,690/mo
Mixtral 8×7B
4× A100 80GB • Optional
$23,690/mo
Qwen 2.5 72B
4× A100 80GB • Optional
$23,690/mo
Monthly Infra (Compute)
$55,520
Monthly Storage
$1,234
Monthly Other
$647
Annual Total
$688K
$115/user/mo
Annual Cost Delta vs External API (GPT-4o)
$301K
30% less than equivalent API costs
Cost Breakdown
Resource Type Specs Qty Monthly Annual
API Comparison
Provider Model Input Output Est. Annual Diff
Architecture
[+]

Click components for details

Client
Browser
Attorneys
↓ TLS 1.3
Presentation
Load Balancer
HAProxy
Web UI
React
Auth
SSO
SAML/OIDC
RBAC
Roles
PG Filter
Isolation
Orchestration
API Gateway
Kong
Intent
Classifier
Router
Mode Select
Processing
RAG
Pipeline
Vector DB
Milvus
Embeddings
BGE
Inference (Air-Gapped)
LLaMA 7B
vLLM
LLaMA 70B
vLLM
Mixtral
vLLM
Safety
Validator
Output
Citations
Enforcer
Disclaimer
Injector
Storage
Audit
PostgreSQL
Models
S3
Docs
Encrypted

Component

Operating Modes
[+]
Knowledge Mode

Trigger

No documents uploaded

Allowed

  • ✓ High-level legal concepts
  • ✓ Educational explanations
  • ✓ Jurisdiction-agnostic info

Blocked

  • ✗ Application to specific facts
  • ✗ Client-specific advice
  • ✗ Jurisdictional conclusions

Model

LLaMA 3 7B Fast

Document Mode

Trigger

Documents uploaded to session

Allowed

  • ✓ Contract analysis
  • ✓ Document summarization
  • ✓ Clause extraction
  • ✓ Drafting assistance

Requirements

  • ⚡ Mandatory RAG with citations
  • ⚡ All claims must cite sources
  • ⚡ Uncited claims blocked (not flagged)

Model

LLaMA 3 70B Capable

Failure Modes & Recovery
[+]
Failure Detection Recovery User Impact
7B GPU dies Health check (10s) Cold standby boots (3min) Queue holds requests, 3min delay
70B GPU dies Health check (10s) Warm standby activates (60s) 60s delay, then normal
Orchestrator dies ALB health (5s) ASG launches new (90s) 90s of retries
Vector DB dies Replica health (10s) Promote replica (30s) 30s read-only mode
Hallucinated citations Post-processing validator RAG citation verification Response blocked if citation not in retrieved docs
Data exfiltration Network policy enforcement Zero outbound on inference nodes Blocked at infra level
TRADE-OFF RATIONALE

99.5% uptime (vs 99.9%) = 3.6hr/month potential downtime vs $170K/year savings. Law firm queries are async productivity tools, not real-time systems. Rare 3min delays don't create malpractice risk. All failovers logged for partner transparency.

vLLM Efficiency & Router Logic
[+]
Why vLLM Enables Lean Provisioning
  • PagedAttention — 2-4× concurrency vs naive impl
  • Continuous batching — <2s latency at 80% GPU util
  • KV cache sharing — Common prefixes reuse memory
  • Measured — 1× g5.12xlarge serves 50 concurrent 7B users
Router Rules (Minimize 70B)
  • Knowledge mode — Always 7B
  • Document mode — 7B unless >3 docs + complex query
  • Escalation — Retry on 70B if 7B confidence <0.75
  • Result — 70B usage drops from 30% to <10%
Failover Strategy
  • 7B — Cold standby (3min boot)
  • 70B — Warm standby (60s activation)
  • Orchestrator — ASG (90s)
  • Target — 99.5% uptime (3.6hr/month)