Architecture
A technical deep-dive into the NeMo AI Platform: 10 services, 29 ML engines, full Kubernetes deployment, all orchestrated on a single consumer GPU.
How It Works
A layered architecture designed for GPU efficiency and real-time processing
10 Microservices
8 application services + 2 infrastructure services working in concert
Central entry point handling JWT auth, request routing, rate limiting, and WebSocket connections for real-time updates.
Conversational AI powered by Gemma 3 4B (Q4 quantized). Integrates RAG context, emotional awareness, and business analysis.
VRAM orchestration via Redis semaphores. Manages model loading, memory allocation, and preemptive pausing.
Real-time ASR using Parakeet TDT 0.6B. Includes speaker diarization (Sortformer) and streaming audio processing.
Retrieval-augmented generation with FAISS vector storage. Handles conversation memory and knowledge bases.
Multi-dimensional emotion analysis using fine-tuned DistilRoBERTa. Provides sentiment scoring and temporal tracking.
"System 2" thinking with 29 specialized engines. Titan AutoML, Oracle Causal, Newton Symbolic, and more.
Analytics and AutoML experiments. Revenue forecasting, anomaly detection, and business intelligence.
29 ML Engines
Specialized processing engines for analytical thinking
Core AI Models
AutoML and Predictive
Financial Analysis
Advanced Analytics
Kubernetes Ready
Production-ready K8s manifests with NVIDIA GPU passthrough
Manifests
- namespace.yaml
- secrets.yaml (12 keys)
- services.yaml (ClusterIP)
- deployments.yaml (842 lines)
- ingress.yaml (NGINX)
- nvidia-device-plugin.yaml
GPU Passthrough
- NVIDIA Device Plugin DaemonSet
- Direct device mounts (/dev/nvidia*)
- CUDA runtime in containers
- Resource limits: nvidia.com/gpu: 1
- Shared GPU across pods
Kustomize Overlays
- base/ (common resources)
- overlays/local/ (dev)
- overlays/install/ (prod)
- Health check probes
- Init containers for setup
Technology
Production-grade tools and frameworks