NeMo AI Ecosystem • Pruitt Colon

Executive Summary

This is not just a Python server - this is a complete AI-powered life automation product.

The NeMo AI Ecosystem integrates enterprise-grade voice intelligence, AR wearables, IoT orchestration, and multi-platform connectivity into a unified system. Built for real-world use with scalable architecture, security, and enterprise patterns.

Technical Value Proposition

Hands-Free Control: Voice commands via AR glasses control your entire digital and physical environment
Multi-Platform Integration: Android app, Python backend, IoT hub, smart home devices - all orchestrated seamlessly
Enterprise AI: Real-time speech recognition, speaker diarization, emotion analysis, and RAG-powered contextual memory
Scalable Architecture: Microservice design with GPU optimization, role-based security, and Docker deployment
Enterprise-Grade: 12,000+ lines of tested Python code, comprehensive API documentation, automated testing

Complete Ecosystem Architecture

This system connects 8+ platforms to create a unified AI-powered automation ecosystem:

graph TB User[👤 User Voice Command] subgraph Wearables Glasses[Even Reality AR Glasses] end subgraph Mobile Flutter[Flutter Android App] end subgraph Core["🚀 NeMo Server (12K Lines Python)"] API[FastAPI Backend
30+ REST/WebSocket Endpoints] subgraph AI ASR[NeMo Parakeet ASR] Speaker[TitaNet Speaker ID] Emotion[DistilRoBERTa Emotion] Gemma[Gemma 3 4B LLM] RAG[FAISS + RAG] end DB[(SQLite + AES-256)] end subgraph Cloud["☁️ Cloud APIs"] Deepgram[Deepgram] OpenAI[OpenAI] end subgraph IoT["🏠 Smart Home"] Hub[IoT Hub] VM[VoiceMonkey] Devices[Smart Devices] end User --> Glasses User --> Flutter Glasses --> Flutter Flutter --> API API --> ASR API --> Speaker API --> Emotion API --> Gemma API --> RAG API --> Deepgram API --> OpenAI ASR --> DB Speaker --> DB Gemma --> RAG RAG --> DB API --> Hub Hub --> VM VM --> Devices style Core fill:#0f1219,stroke:#00aaff,stroke-width:3px style IoT fill:#0f1219,stroke:#00aaff,stroke-width:2px style Wearables fill:#1a1f2e,stroke:#00aaff style Mobile fill:#1a1f2e,stroke:#00aaff style Cloud fill:#1a1f2e,stroke:#00aaff

Integration Flow: Voice to Action

1

Voice Input - Speak command to Even Reality AR glasses or Flutter mobile app

2

Audio Processing - NeMo Parakeet ASR transcribes speech, TitaNet identifies speaker

3

Context Analysis - RAG system retrieves relevant memory, emotion analyzer detects sentiment

4

LLM Processing - Gemma 3 4B generates contextually-aware response with full conversation history

5

Action Orchestration - IoT Hub routes commands to smart home devices via VoiceMonkey/Alexa

6

Device Control - Smart home devices execute commands for lights, TV, locks, thermostats, etc.

7

Feedback Loop - AR glasses display confirmation, system logs interaction for future context

Technology Stack

Backend Framework

FastAPI 0.110, Python 3.10, Uvicorn ASGI, WebSocket support

AI/ML Core

PyTorch 2.3, NVIDIA NeMo 2.5, Transformers 4.53, llama-cpp-python 0.2.90

Models

Parakeet-TDT 0.6B (ASR), TitaNet (Speaker ID), Gemma 3 4B (LLM), MiniLM-L6-v2 (Embeddings), DistilRoBERTa (Emotion)

Infrastructure

Docker multi-stage builds, CUDA 12.1, GPU passthrough, Linux optimization

Security

Bcrypt password hashing, AES-256 encryption, RBAC, JWT sessions, audit logging

Vector DB

FAISS CPU 1.8, sentence embeddings, semantic search, RAG context retrieval

Cloud APIs

Deepgram (ASR backup), OpenAI GPT (advanced NLP), VoiceMonkey (Alexa)

Mobile/IoT

Flutter (Android), Even Reality SDK (AR glasses), IoT Hub orchestration, REST APIs

Key Features

🎯 GPU Architecture

Optimized for 6GB VRAM (GTX 1660 Ti)

Gemma LLM gets exclusive GPU access with all layers offloaded (-1 n_gpu_layers). All other services (ASR, speaker diarization, embeddings, emotion) run on CPU to prevent resource contention. Deterministic startup time of 3 minutes with health checks.

👥 Speaker Isolation

100% data separation by speaker ID

TitaNet speaker embeddings with voice enrollment system. K-means clustering for diarization. Each user sees only their own transcripts, memories, and conversations. Enterprise-grade multi-tenant architecture.

🧠 RAG System

Contextual memory with semantic search

FAISS vector database with MiniLM-L6-v2 embeddings. Semantic similarity search retrieves relevant conversation history for LLM context. Automatic memory consolidation and deduplication.

🔒 Enterprise Security

Enterprise-grade authentication & encryption

Role-based access control (Admin/User), bcrypt password hashing with cost factor 12, AES-256 database encryption, JWT session tokens, httpOnly cookies, audit logging with IP tracking, rate limiting middleware.

🏠 IoT Orchestration

Unified smart home control hub

VoiceMonkey integration for Alexa device control, IoT hub for device routing and orchestration, real-time status monitoring, error handling and retry logic, REST API control for smart devices.

🥽 AR Glasses Integration

Even Reality G1 wearable interface

Real-time voice input from AR glasses, visual feedback display on wearable screen, low-latency serial communication, JNI bindings for native performance, telemetry overlay for system status.

📱 Multi-Platform

Android, Web, Desktop, AR Wearables

Flutter mobile app with full system access, Next.js dashboard for analytics, Even Reality AR glasses integration, REST API for 3rd-party integration, WebSocket support for real-time updates, cross-platform audio streaming.

🐳 DevOps Ready

Docker deployment with GPU support

3-stage production Dockerfile with CUDA wheel compilation, GPU passthrough configuration, health checks and automatic restart, volume mounts for data persistence, docker-compose orchestration, CI/CD ready.

Click any feature card to expand details

API Endpoints (30+ Total)

Full REST and WebSocket APIs with comprehensive documentation:

Core Services

POST /transcribe - Real-time audio transcription with NeMo Parakeet
POST /speaker/enroll - Speaker voice enrollment for identification
POST /speaker/identify - Identify speaker from audio sample
POST /emotion/analyze - Emotion detection from text or transcript
POST /gemma/chat - LLM inference with RAG context injection
WS /gemma/stream - WebSocket streaming for real-time chat

Memory & RAG

POST /rag/search - Semantic search across conversation history
GET /memories - Retrieve user memories with filtering
POST /memories - Store new contextual memory
GET /transcripts - Access historical transcriptions

IoT Control

POST /iot/command - Execute smart home device commands
GET /iot/status - Query device states
POST /voicemonkey/trigger - Trigger Alexa routines

Authentication & Admin

POST /auth/login - User authentication with session creation
POST /auth/logout - Session invalidation
GET /admin/users - User management (admin only)
GET /health - System health check with GPU status

Architecture Design Decisions

Why CPU for ASR/Diarization?

Deterministic GPU allocation prevents resource contention. By reserving the GPU exclusively for Gemma LLM, we guarantee consistent inference performance. NeMo Parakeet and TitaNet run efficiently on CPU with acceptable latency for non-real-time transcription.

Why GGUF Quantization?

4-bit quantization (Q4_K_M) enables running Gemma 3 4B on 6GB VRAM with minimal quality loss (<3% perplexity increase). Reduces memory footprint from 8GB to ~3.5GB while maintaining coherent responses. Essential for consumer-grade hardware deployment.

Why Microservice Architecture?

Separate services for ASR, speaker ID, emotion, LLM, and RAG enable independent scaling and testing. Each service has clear interfaces and responsibilities. Failure in one service doesn't cascade to others. Easy to swap models or providers without refactoring.

Why Local-First AI?

Privacy-focused design runs ASR, speaker ID, and LLM locally. Only falls back to Deepgram/OpenAI APIs when local resources are unavailable. User data stays on-premises by default. Reduces cloud costs and API dependencies.

Why IoT Hub Pattern?

Centralized orchestration layer abstracts device-specific protocols. Unified interface for VoiceMonkey, curl commands, and direct HTTP APIs. Enables complex automation routines that span multiple devices. Simplifies error handling and retry logic.

Code Sample: GPU Exclusive Enforcement

This pattern ensures Gemma LLM gets deterministic GPU access while preventing other services from competing for VRAM:

# GPU-exclusive enforcement for Gemma LLM
from src.utils.gpu_utils import enforce_gpu_only, clear_gpu_cache

# Clear any residual VRAM before loading
clear_gpu_cache(force=True)

# Enforce GPU-only context (sets CUDA_VISIBLE_DEVICES)
with enforce_gpu_only(device_id=0):
    model = Llama(
        model_path=GEMMA_MODEL_PATH,
        n_gpu_layers=-1,  # All 28 layers on GPU
        n_ctx=8192,       # 8K context window
        n_batch=512,
        verbose=False
    )
    
    # Verify GPU offload succeeded
    assert model.llama_supports_gpu_offload(), "GPU offload failed!"
    
    print(f"[GEMMA] Loaded with {model.n_layers} layers on GPU")
    print(f"[GEMMA] VRAM allocated: {get_vram_usage()['allocated_gb']:.2f} GB")

# Other services run with CUDA_VISIBLE_DEVICES=""
# ensuring they never see the GPU

System Architecture & Practices

This system is built for real-world deployment with enterprise-grade practices:

✅ Tested Codebase

12,000+ lines of tested Python, comprehensive error handling, logging, and monitoring

✅ Security Hardened

OWASP compliance, encryption at rest, secure session management, audit trails

✅ Scalable Architecture

Microservices, async I/O, connection pooling, rate limiting, load-ready

✅ Deployment Ready

Docker containerization, GPU passthrough, health checks, automated testing, CI/CD pipelines

✅ Multi-Platform

Android app, AR glasses, web dashboard, REST API for integration

✅ Documentation

API docs, architecture diagrams, deployment guides, troubleshooting wikis

Real-World Use Case

This system demonstrates full-stack integration capabilities:

Voice commands from AR glasses control smart home devices
Contextual AI assistant remembers conversations and preferences
Multi-user support with isolated data and personalized experiences
Reliable performance on consumer hardware (GTX 1660 Ti)
Secure remote access via authenticated mobile apps

Interested in This Technology?

This ecosystem demonstrates enterprise-grade AI engineering, IoT orchestration, and product development capabilities. The architecture, code quality, and integration depth showcase scalable software engineering practices.

Explore Source Code → Discuss Opportunities