Architecture

A technical deep-dive into the NeMo AI Platform architecture: 12 distributed services, 37 specialized ML engines, full Kubernetes deployment with Istio service mesh, all orchestrated on a single consumer GPU.

System Overview

How It Works

A layered architecture designed for GPU efficiency and real-time processing.

Clients

Web Frontend

20+ pages

Flutter Mobile

Cross-platform

Gateway

NGINX

TLS Termination

:443

API Gateway

Auth / Limits

:8000

Services

Gemma

LLM AI

:8001

Transcription

ASR

:8003

RAG

Context

:8004

Emotion

Analysis

:8005

ML Service

37 Engines

:8006

Insights

Analytics

:8010

Fiserv

Banking

:8015

N8N

Automation

:8011

Infra

GPU Coordinator

Queue Service

:8002

Redis

Cache/Locks

:6379

PostgreSQL

Persistence

:5432

Core Services

12 Microservices

10 application services + 2 infrastructure services working in concert, all Dockerized.

api-gateway :8000

Central entry point handling JWT auth, request routing, rate limiting, and WebSocket connections for real-time updates.

gemma-service :8001

Conversational AI powered by Gemma 3 4B (Q4 quantized). Integrates RAG context, emotional awareness, and business analysis.

queue-service (GPU Coord) :8002

VRAM orchestration via Redis semaphores. Manages model loading, memory allocation, and preemptive pausing between Gemma and ASR.

transcription-service :8003

Real-time ASR using Parakeet RNNT 0.6b. Includes speaker diarization (Sortformer) and streaming audio processing.

rag-service :8004

Retrieval-augmented generation with FAISS vector storage. Handles conversation memory and knowledge bases.

emotion-service :8005

Multi-dimensional emotion analysis using fine-tuned DistilRoBERTa. Provides sentiment scoring and temporal tracking.

ml-service :8006

"System 2" thinking with 37 specialized engines. Titan AutoML, Oracle Causal, Newton Symbolic, Salesforce CRM analytics, and more.

insights-service :8010

Analytics and AutoML experiments. Revenue forecasting, anomaly detection, and business intelligence visualization.

nginx :443

HTTPS reverse proxy with TLS 1.3 termination. Handles SSL certificates, security headers, and load balancing.

n8n-service :8011

Voice command integration and smart home automation. Connects to Voice Monkey for Alexa control.

fiserv-service :8015

Banking hub integrating with Fiserv API for account data, transactions, and financial analytics.

redis + postgres Infra

Persistence layer. Redis for high-speed caching and locks; PostgreSQL for durable data and task queues.

ML Capabilities

37 Specialized ML Engines

A comprehensive suite of analytical engines powered by the ML Service.

Core AI Models

Gemma 3 4B

Parakeet ASR

TitaNet Speaker

DistilRoBERTa (Emotion)

MiniLM Embeddings

Sortformer Diarization

AutoML & Predictive (Titan Series)

Titan AutoML

Oracle Causal

Newton Symbolic

Chronos Temporal

Galileo Geometric

Scout Discovery

Predictive Engine

Statistical Engine

Trend Engine

Financial Analysis

Revenue Forecast

Cash Flow

Budget Variance

Profit Margin

Cost Optimization

Pricing Strategy

ROI Prediction

Inventory Optimization

Advanced Analytics

Chaos Non-Linear

Anomaly Detector

Clustering Engine

Customer LTV

Market Basket

Spend Pattern

Deep Feature

Universal Graph

Flash Inference

Mirror Synthetic

RAG Evaluation

Quality Insights

Salesforce CRM Analytics

Churn Prediction

Next Best Action

Deal Velocity

Competitive Intel

Customer 360

Lead Scoring

Opportunity Score

Deployment

Kubernetes Ready

Production-ready K8s manifests with NVIDIA GPU passthrough

Manifests

namespace.yaml
secrets.yaml (12 keys)
services.yaml (ClusterIP)
deployments.yaml (842 lines)
ingress.yaml (NGINX)
nvidia-device-plugin.yaml

GPU Passthrough

NVIDIA Device Plugin DaemonSet
Direct device mounts (/dev/nvidia*)
CUDA runtime in containers
Resource limits: nvidia.com/gpu: 1
Shared GPU across pods

Kustomize Overlays

base/ (common resources)
overlays/local/ (dev)
overlays/install/ (prod)
Health check probes
Init containers for setup

Istio Service Mesh

VirtualService routing
DestinationRule policies
Gateway TLS termination
mTLS between services
Traffic management