Unified multimodal pipeline
Text, image, and reference video inputs flow through a single server-side orchestrator that picks the right foundation model for each scene, normalizes parameters across providers, and returns one consistent output format. You write one prompt; we handle model selection, ratio adaptation, and audio sync.