Architecture

A technical deep-dive into the NeMo AI Platform: 10 services, 29 ML engines, full Kubernetes deployment, all orchestrated on a single consumer GPU.

How It Works

A layered architecture designed for GPU efficiency and real-time processing

Clients
Web Frontend
20+ pages
Flutter Mobile
Cross-platform
Gateway
API Gateway
JWT Auth + Rate Limiting + WebSocket
:8000
Services
Gemma
LLM
:8001
Transcription
ASR
:8003
RAG
Context
:8004
Emotion
Analysis
:8005
ML Service
29 Engines
:8006
GPU
GPU Coordinator
Semaphore Locks + Preemptive Scheduling
:8002
Infra
Redis
Cache + Locks
:6379
PostgreSQL
Task Queue
:5432
FAISS
Vector Store
SQLCipher
Encrypted DB

10 Microservices

8 application services + 2 infrastructure services working in concert

api-gateway :8000

Central entry point handling JWT auth, request routing, rate limiting, and WebSocket connections for real-time updates.

gemma-service :8001

Conversational AI powered by Gemma 3 4B (Q4 quantized). Integrates RAG context, emotional awareness, and business analysis.

gpu-coordinator :8002

VRAM orchestration via Redis semaphores. Manages model loading, memory allocation, and preemptive pausing.

transcription-service :8003

Real-time ASR using Parakeet TDT 0.6B. Includes speaker diarization (Sortformer) and streaming audio processing.

rag-service :8004

Retrieval-augmented generation with FAISS vector storage. Handles conversation memory and knowledge bases.

emotion-service :8005

Multi-dimensional emotion analysis using fine-tuned DistilRoBERTa. Provides sentiment scoring and temporal tracking.

ml-service :8006

"System 2" thinking with 29 specialized engines. Titan AutoML, Oracle Causal, Newton Symbolic, and more.

insights-service :8010

Analytics and AutoML experiments. Revenue forecasting, anomaly detection, and business intelligence.

29 ML Engines

Specialized processing engines for analytical thinking

Core AI Models

Gemma 3 4B
Parakeet ASR
TitaNet Speaker
DistilRoBERTa
MiniLM Embeddings
Sortformer Diar

AutoML and Predictive

Titan AutoML
Oracle Causal
Newton Symbolic
Chronos Temporal
Galileo Geometric
Scout Discovery

Financial Analysis

Revenue Forecast
Cash Flow
Budget Variance
Profit Margin
Cost Optimization
Pricing Strategy

Advanced Analytics

Chaos Non-Linear
Anomaly Detector
Clustering Engine
Customer LTV
Market Basket
Spend Pattern
FAISS Retrieval
Universal Graph
Flash Inference
Mirror Synthetic
RAG Evaluation

Kubernetes Ready

Production-ready K8s manifests with NVIDIA GPU passthrough

Manifests

  • namespace.yaml
  • secrets.yaml (12 keys)
  • services.yaml (ClusterIP)
  • deployments.yaml (842 lines)
  • ingress.yaml (NGINX)
  • nvidia-device-plugin.yaml

GPU Passthrough

  • NVIDIA Device Plugin DaemonSet
  • Direct device mounts (/dev/nvidia*)
  • CUDA runtime in containers
  • Resource limits: nvidia.com/gpu: 1
  • Shared GPU across pods

Kustomize Overlays

  • base/ (common resources)
  • overlays/local/ (dev)
  • overlays/install/ (prod)
  • Health check probes
  • Init containers for setup

Technology

Production-grade tools and frameworks

AI / ML

PyTorch 2.0
NVIDIA NeMo
Hugging Face Transformers
FAISS
llama.cpp (GGUF)
scikit-learn

Backend

Python 3.11
FastAPI + Uvicorn
PostgreSQL 15
Redis
WebSocket
SQLCipher (AES-256)

Infrastructure

Docker + Compose
Kubernetes + Kustomize
CUDA 12.1
NGINX Ingress
Ubuntu 22.04
systemd