FrontierPapersDaily AI Research from Top Labs
200 papers
200 papers
Hugging FaceVisionOpenAIFeatured

LiveTalk: Real-Time Interactive Video Diffusion Achieves Sora-Level Quality, 20x Faster

LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

By improving on-policy distillation, LiveTalk enables real-time multimodal interactive video generation with comparable visual quality to full-step diffusion models but with 20x less inference cost, outperforming state-of-the-art models like Sora in multi-turn video coherence and latency.

Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.

Ethan Chern, Zhulin Hu, Bohao Tang +4 more

2d agoRead Paper
Hugging FaceVisionOpenAIFeatured

LiveTalk: Real-Time Interactive Video Diffusion Achieves Sora-Level Quality, 20x Faster

LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

By improving on-policy distillation, LiveTalk enables real-time multimodal interactive video generation with comparable visual quality to full-step diffusion models but with 20x less inference cost, outperforming state-of-the-art models like Sora in multi-turn video coherence and latency.

Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.

Ethan Chern, Zhulin Hu, Bohao Tang +4 more

2d agoRead Paper
arXivDeepMind1d ago

GamiBench Exposes MLLM Spatial Reasoning Flaws: Even GPT-5 Struggles with Origami

GamiBench: Evaluating Spatial Reasoning and 2D-to-3D Planning Capabilities of MLLMs with Origami Folding Tasks

GamiBench, a new benchmark using origami folding tasks, reveals that even state-of-the-art multimodal large language models like GPT-5 and Gemini-2.5-Pro exhibit significant weaknesses in spatial reasoning, particularly in cross-view consistency and impossible-fold detection.

arXivDeepMind1d ago

Logic Sketch Prompting: Boost LLM Accuracy by 30% with Deterministic Rules

Logic Sketch Prompting (LSP): A Deterministic and Interpretable Prompting Method

Logic Sketch Prompting (LSP) improves LLM reasoning accuracy to 89% on pharmacologic compliance tasks by introducing typed variables, rule-based validation, and deterministic condition evaluators, outperforming chain-of-thought and zero-shot prompting.

arXivOpenAI1d ago

DarkPatterns-LLM Exposes LLM Manipulation: GPT-4 Fails Autonomy Test

DarkPatterns-LLM: A Multi-Layer Benchmark for Detecting Manipulative and Harmful AI Behavior

DarkPatterns-LLM, a new benchmark with 401 examples and fine-grained annotations, reveals that even state-of-the-art LLMs like GPT-4 struggle to detect subtle manipulative behaviors that undermine user autonomy, highlighting the need for improved safety measures.

Yesterday

arXivLLMsOpenAI
1d ago

Moxin: New Fully Open-Source 7B LLM Rivals Proprietary Models

Open-Source Multimodal Moxin Models with Moxin-VLM and Moxin-VLA

Moxin introduces a fully transparent, open-source 7B language model, along with its vision-language and vision-language-action variants, trained on open data and frameworks to foster collaborative AI development.

Source
arXivLLMsOpenAI
1d ago

GPT-4 Aces Accounting Exam, But Still Needs Work for Real-World Use

Exploring the Vertical-Domain Reasoning Capabilities of Large Language Models

A new study benchmarks large language models on accounting reasoning tasks, revealing that while GPT-4 shows promise, significant improvements are needed before LLMs can reliably handle enterprise-level accounting scenarios.

Source
arXivLLMsMeta AI
1d ago

Llama-3.2 Pruning Paradox: Knowledge Fades, Instruction-Following Soars (+75%)

Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

Structured pruning of Llama-3.2 reveals that reducing GLU-MLP expansion ratio improves instruction-following by up to 75% while degrading factual knowledge, highlighting a critical architectural trade-off between knowledge capacity and behavioral alignment.

Source
arXivLLMsDeepMind
1d ago

LLMs Predict Next Tokens More Precisely with Vocabulary-Aware Conformal Prediction

Conformal Prediction Sets for Next-Token Prediction in Large Language Models: Balancing Coverage Guarantees with Set Efficiency

Vocabulary-Aware Conformal Prediction (VACP) drastically reduces the size of prediction sets for next-token prediction in large language models by 197x while maintaining coverage guarantees, using semantic masking and temperature scaling to refine the prediction space.

Source
arXivLLMsMeta AI
1d ago

VISTA: RL Fine-tuning Cuts VLM Bias by 16% on SpuriVerse

Unbiased Visual Reasoning with Controlled Visual Inputs

VISTA, a modular vision-language framework using reinforcement learning and a controlled information bottleneck, significantly reduces spurious correlation bias in VLMs, achieving a 16% improvement on SpuriVerse with only 641 training examples.

Source
arXivVisionOpenAI
1d ago

ViSignVQA: New Vietnamese Signboard Dataset Supercharges OCR-Integrated Visual Question Answering

Towards Signboard-Oriented Visual Question Answering: ViSignVQA Dataset, Method and Benchmark

The ViSignVQA dataset, featuring over 10,000 images and 25,000 question-answer pairs, enables significant performance gains (up to 209% F1-score improvement) in Vietnamese visual question answering by integrating OCR and a multi-agent reasoning framework.

Source
arXivLLMsDeepMind
1d ago

Wrong Answers, Right Reasoning: Training AI on Flawed Logic Improves Performance

Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks

Training language models on chain-of-thought reasoning traces from stronger models, even when those traces lead to incorrect answers, surprisingly boosts performance by aligning the training data distribution with the model's inherent biases and leveraging partially correct reasoning steps.

Source
arXivLLMsDeepMind
1d ago

HiFi-RAG: Gemini 2.5 Powers Cheaper, More Accurate Open-Domain RAG

HiFi-RAG: Hierarchical Content Filtering and Two-Pass Generation for Open-Domain RAG

HiFi-RAG achieves state-of-the-art open-domain RAG performance by using a hierarchical filtering approach with Gemini 2.5 Flash for efficient retrieval and Gemini 2.5 Pro for high-quality answer generation, improving ROUGE-L by up to 57.4% on post-cutoff knowledge questions.

Source
arXivVisionMeta AI
1d ago

Tiny-YOLOSAM: YOLOv12 and TinySAM Team Up for 4.7x Faster Segmentation

Tiny-YOLOSAM: Fast Hybrid Image Segmentation

Tiny-YOLOSAM achieves significantly faster full-scene segmentation by using YOLOv12 to generate box prompts for TinySAM, supplemented with sparse point prompts in uncovered regions, resulting in a 4.7x speedup compared to dense prompting while substantially improving mask coverage.

Source
arXivSafety
1d ago

AI Fights Flood Aid Bias in Bangladesh, Boosts Fairness by 42%

Toward Equitable Recovery: A Fairness-Aware AI Framework for Prioritizing Post-Flood Aid in Bangladesh

An adversarial debiasing model reduces statistical parity difference in post-flood aid allocation by 41.6% while maintaining strong predictive accuracy, ensuring more equitable distribution to vulnerable populations in Bangladesh.

Source
arXivAgents
1d ago

ARC Framework: Taming Agentic AI Risks with Capability-Centric Governance

With Great Capabilities Come Great Responsibilities: Introducing the Agentic Risk & Capability Framework for Governing Agentic AI Systems

The Agentic Risk & Capability (ARC) Framework offers a structured, capability-centric approach to identifying, assessing, and mitigating risks in agentic AI systems by linking risk sources to specific threats and technical controls.

Source
arXivLLMs
1d ago

Tyee Toolkit Unifies Physiological Data, Crushes Benchmarks on 12 Datasets

Tyee: A Unified, Modular, and Fully-Integrated Configurable Toolkit for Intelligent Physiological Health Care

The Tyee toolkit introduces a unified data interface, modular architecture, and end-to-end configuration for physiological signal analysis, achieving state-of-the-art results on 12 of 13 benchmark datasets.

Source

FrontierPapers — Daily AI research from the world's top labs