Complete Voice-Controlled Life Automation System
Enterprise AI + AR Glasses + IoT Hub + Smart Home Integration
"From voice command to smart home action in milliseconds"
This is not just a Python server - this is a complete AI-powered life automation product.
The NeMo AI Ecosystem integrates enterprise-grade voice intelligence, AR wearables, IoT orchestration, and multi-platform connectivity into a unified system. Built for real-world use with scalable architecture, security, and enterprise patterns.
This system connects 8+ platforms to create a unified AI-powered automation ecosystem:
FastAPI 0.110, Python 3.10, Uvicorn ASGI, WebSocket support
PyTorch 2.3, NVIDIA NeMo 2.5, Transformers 4.53, llama-cpp-python 0.2.90
Parakeet-TDT 0.6B (ASR), TitaNet (Speaker ID), Gemma 3 4B (LLM), MiniLM-L6-v2 (Embeddings), DistilRoBERTa (Emotion)
Docker multi-stage builds, CUDA 12.1, GPU passthrough, Linux optimization
Bcrypt password hashing, AES-256 encryption, RBAC, JWT sessions, audit logging
FAISS CPU 1.8, sentence embeddings, semantic search, RAG context retrieval
Deepgram (ASR backup), OpenAI GPT (advanced NLP), VoiceMonkey (Alexa)
Flutter (Android), Even Reality SDK (AR glasses), IoT Hub orchestration, REST APIs
Optimized for 6GB VRAM (GTX 1660 Ti)
100% data separation by speaker ID
Contextual memory with semantic search
Enterprise-grade authentication & encryption
Unified smart home control hub
Even Reality G1 wearable interface
Android, Web, Desktop, AR Wearables
Docker deployment with GPU support
Click any feature card to expand details
Full REST and WebSocket APIs with comprehensive documentation:
POST /transcribe - Real-time audio transcription with NeMo ParakeetPOST /speaker/enroll - Speaker voice enrollment for identificationPOST /speaker/identify - Identify speaker from audio samplePOST /emotion/analyze - Emotion detection from text or transcriptPOST /gemma/chat - LLM inference with RAG context injectionWS /gemma/stream - WebSocket streaming for real-time chatPOST /rag/search - Semantic search across conversation historyGET /memories - Retrieve user memories with filteringPOST /memories - Store new contextual memoryGET /transcripts - Access historical transcriptionsPOST /iot/command - Execute smart home device commandsGET /iot/status - Query device statesPOST /voicemonkey/trigger - Trigger Alexa routinesPOST /auth/login - User authentication with session creationPOST /auth/logout - Session invalidationGET /admin/users - User management (admin only)GET /health - System health check with GPU statusDeterministic GPU allocation prevents resource contention. By reserving the GPU exclusively for Gemma LLM, we guarantee consistent inference performance. NeMo Parakeet and TitaNet run efficiently on CPU with acceptable latency for non-real-time transcription.
4-bit quantization (Q4_K_M) enables running Gemma 3 4B on 6GB VRAM with minimal quality loss (<3% perplexity increase). Reduces memory footprint from 8GB to ~3.5GB while maintaining coherent responses. Essential for consumer-grade hardware deployment.
Separate services for ASR, speaker ID, emotion, LLM, and RAG enable independent scaling and testing. Each service has clear interfaces and responsibilities. Failure in one service doesn't cascade to others. Easy to swap models or providers without refactoring.
Privacy-focused design runs ASR, speaker ID, and LLM locally. Only falls back to Deepgram/OpenAI APIs when local resources are unavailable. User data stays on-premises by default. Reduces cloud costs and API dependencies.
Centralized orchestration layer abstracts device-specific protocols. Unified interface for VoiceMonkey, curl commands, and direct HTTP APIs. Enables complex automation routines that span multiple devices. Simplifies error handling and retry logic.
This pattern ensures Gemma LLM gets deterministic GPU access while preventing other services from competing for VRAM:
# GPU-exclusive enforcement for Gemma LLM
from src.utils.gpu_utils import enforce_gpu_only, clear_gpu_cache
# Clear any residual VRAM before loading
clear_gpu_cache(force=True)
# Enforce GPU-only context (sets CUDA_VISIBLE_DEVICES)
with enforce_gpu_only(device_id=0):
model = Llama(
model_path=GEMMA_MODEL_PATH,
n_gpu_layers=-1, # All 28 layers on GPU
n_ctx=8192, # 8K context window
n_batch=512,
verbose=False
)
# Verify GPU offload succeeded
assert model.llama_supports_gpu_offload(), "GPU offload failed!"
print(f"[GEMMA] Loaded with {model.n_layers} layers on GPU")
print(f"[GEMMA] VRAM allocated: {get_vram_usage()['allocated_gb']:.2f} GB")
# Other services run with CUDA_VISIBLE_DEVICES=""
# ensuring they never see the GPU
This system is built for real-world deployment with enterprise-grade practices:
12,000+ lines of tested Python, comprehensive error handling, logging, and monitoring
OWASP compliance, encryption at rest, secure session management, audit trails
Microservices, async I/O, connection pooling, rate limiting, load-ready
Docker containerization, GPU passthrough, health checks, automated testing, CI/CD pipelines
Android app, AR glasses, web dashboard, REST API for integration
API docs, architecture diagrams, deployment guides, troubleshooting wikis
This system demonstrates full-stack integration capabilities:
This ecosystem demonstrates enterprise-grade AI engineering, IoT orchestration, and product development capabilities. The architecture, code quality, and integration depth showcase scalable software engineering practices.