LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation
By improving on-policy distillation, LiveTalk enables real-time multimodal interactive video generation with comparable visual quality to full-step diffusion models but with 20x less inference cost, outperforming state-of-the-art models like Sora in multi-turn video coherence and latency.
Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.
Ethan Chern, Zhulin Hu, Bohao Tang +4 more
LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation
By improving on-policy distillation, LiveTalk enables real-time multimodal interactive video generation with comparable visual quality to full-step diffusion models but with 20x less inference cost, outperforming state-of-the-art models like Sora in multi-turn video coherence and latency.
Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.
Ethan Chern, Zhulin Hu, Bohao Tang +4 more
GamiBench: Evaluating Spatial Reasoning and 2D-to-3D Planning Capabilities of MLLMs with Origami Folding Tasks
GamiBench, a new benchmark using origami folding tasks, reveals that even state-of-the-art multimodal large language models like GPT-5 and Gemini-2.5-Pro exhibit significant weaknesses in spatial reasoning, particularly in cross-view consistency and impossible-fold detection.
Logic Sketch Prompting (LSP): A Deterministic and Interpretable Prompting Method
Logic Sketch Prompting (LSP) improves LLM reasoning accuracy to 89% on pharmacologic compliance tasks by introducing typed variables, rule-based validation, and deterministic condition evaluators, outperforming chain-of-thought and zero-shot prompting.
DarkPatterns-LLM: A Multi-Layer Benchmark for Detecting Manipulative and Harmful AI Behavior
DarkPatterns-LLM, a new benchmark with 401 examples and fine-grained annotations, reveals that even state-of-the-art LLMs like GPT-4 struggle to detect subtle manipulative behaviors that undermine user autonomy, highlighting the need for improved safety measures.
Open-Source Multimodal Moxin Models with Moxin-VLM and Moxin-VLA
Moxin introduces a fully transparent, open-source 7B language model, along with its vision-language and vision-language-action variants, trained on open data and frameworks to foster collaborative AI development.
Exploring the Vertical-Domain Reasoning Capabilities of Large Language Models
A new study benchmarks large language models on accounting reasoning tasks, revealing that while GPT-4 shows promise, significant improvements are needed before LLMs can reliably handle enterprise-level accounting scenarios.
Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2
Structured pruning of Llama-3.2 reveals that reducing GLU-MLP expansion ratio improves instruction-following by up to 75% while degrading factual knowledge, highlighting a critical architectural trade-off between knowledge capacity and behavioral alignment.
Conformal Prediction Sets for Next-Token Prediction in Large Language Models: Balancing Coverage Guarantees with Set Efficiency
Vocabulary-Aware Conformal Prediction (VACP) drastically reduces the size of prediction sets for next-token prediction in large language models by 197x while maintaining coverage guarantees, using semantic masking and temperature scaling to refine the prediction space.
Unbiased Visual Reasoning with Controlled Visual Inputs
VISTA, a modular vision-language framework using reinforcement learning and a controlled information bottleneck, significantly reduces spurious correlation bias in VLMs, achieving a 16% improvement on SpuriVerse with only 641 training examples.
Towards Signboard-Oriented Visual Question Answering: ViSignVQA Dataset, Method and Benchmark
The ViSignVQA dataset, featuring over 10,000 images and 25,000 question-answer pairs, enables significant performance gains (up to 209% F1-score improvement) in Vietnamese visual question answering by integrating OCR and a multi-agent reasoning framework.
Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks
Training language models on chain-of-thought reasoning traces from stronger models, even when those traces lead to incorrect answers, surprisingly boosts performance by aligning the training data distribution with the model's inherent biases and leveraging partially correct reasoning steps.
HiFi-RAG: Hierarchical Content Filtering and Two-Pass Generation for Open-Domain RAG
HiFi-RAG achieves state-of-the-art open-domain RAG performance by using a hierarchical filtering approach with Gemini 2.5 Flash for efficient retrieval and Gemini 2.5 Pro for high-quality answer generation, improving ROUGE-L by up to 57.4% on post-cutoff knowledge questions.
Tiny-YOLOSAM: Fast Hybrid Image Segmentation
Tiny-YOLOSAM achieves significantly faster full-scene segmentation by using YOLOv12 to generate box prompts for TinySAM, supplemented with sparse point prompts in uncovered regions, resulting in a 4.7x speedup compared to dense prompting while substantially improving mask coverage.
Toward Equitable Recovery: A Fairness-Aware AI Framework for Prioritizing Post-Flood Aid in Bangladesh
An adversarial debiasing model reduces statistical parity difference in post-flood aid allocation by 41.6% while maintaining strong predictive accuracy, ensuring more equitable distribution to vulnerable populations in Bangladesh.
With Great Capabilities Come Great Responsibilities: Introducing the Agentic Risk & Capability Framework for Governing Agentic AI Systems
The Agentic Risk & Capability (ARC) Framework offers a structured, capability-centric approach to identifying, assessing, and mitigating risks in agentic AI systems by linking risk sources to specific threats and technical controls.
Tyee: A Unified, Modular, and Fully-Integrated Configurable Toolkit for Intelligent Physiological Health Care
The Tyee toolkit introduces a unified data interface, modular architecture, and end-to-end configuration for physiological signal analysis, achieving state-of-the-art results on 12 of 13 benchmark datasets.