Architecture
A technical deep-dive into the NeMo AI Platform architecture: 12 distributed services, 37 specialized ML engines, full Kubernetes deployment with Istio service mesh, all orchestrated on a single consumer GPU.
How It Works
A layered architecture designed for GPU efficiency and real-time processing.
12 Microservices
10 application services + 2 infrastructure services working in concert, all Dockerized.
Central entry point handling JWT auth, request routing, rate limiting, and WebSocket connections for real-time updates.
Conversational AI powered by Gemma 3 4B (Q4 quantized). Integrates RAG context, emotional awareness, and business analysis.
VRAM orchestration via Redis semaphores. Manages model loading, memory allocation, and preemptive pausing between Gemma and ASR.
Real-time ASR using Parakeet RNNT 0.6b. Includes speaker diarization (Sortformer) and streaming audio processing.
Retrieval-augmented generation with FAISS vector storage. Handles conversation memory and knowledge bases.
Multi-dimensional emotion analysis using fine-tuned DistilRoBERTa. Provides sentiment scoring and temporal tracking.
"System 2" thinking with 37 specialized engines. Titan AutoML, Oracle Causal, Newton Symbolic, Salesforce CRM analytics, and more.
Analytics and AutoML experiments. Revenue forecasting, anomaly detection, and business intelligence visualization.
HTTPS reverse proxy with TLS 1.3 termination. Handles SSL certificates, security headers, and load balancing.
Voice command integration and smart home automation. Connects to Voice Monkey for Alexa control.
Banking hub integrating with Fiserv API for account data, transactions, and financial analytics.
Persistence layer. Redis for high-speed caching and locks; PostgreSQL for durable data and task queues.
37 Specialized ML Engines
A comprehensive suite of analytical engines powered by the ML Service.
Core AI Models
AutoML & Predictive (Titan Series)
Financial Analysis
Advanced Analytics
Salesforce CRM Analytics
Kubernetes Ready
Production-ready K8s manifests with NVIDIA GPU passthrough
Manifests
- namespace.yaml
- secrets.yaml (12 keys)
- services.yaml (ClusterIP)
- deployments.yaml (842 lines)
- ingress.yaml (NGINX)
- nvidia-device-plugin.yaml
GPU Passthrough
- NVIDIA Device Plugin DaemonSet
- Direct device mounts (/dev/nvidia*)
- CUDA runtime in containers
- Resource limits: nvidia.com/gpu: 1
- Shared GPU across pods
Kustomize Overlays
- base/ (common resources)
- overlays/local/ (dev)
- overlays/install/ (prod)
- Health check probes
- Init containers for setup
Istio Service Mesh
- VirtualService routing
- DestinationRule policies
- Gateway TLS termination
- mTLS between services
- Traffic management