Spaces:
Running
Running
Commit
·
503c989
1
Parent(s):
c75822a
Update README for HF Space (shorten description)
Browse files
README.md
CHANGED
|
@@ -1,14 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
<div align="center">
|
| 2 |
|
| 3 |
-
##
|
| 4 |
[](https://microsoft.github.io/VibeVoice)
|
| 5 |
[](https://huggingface.co/collections/microsoft/vibevoice-68a2ef24a875c44be47b034f)
|
| 6 |
[](https://arxiv.org/pdf/2508.19205)
|
| 7 |
|
|
|
|
|
|
|
| 8 |
|
| 9 |
</div>
|
| 10 |
|
| 11 |
-
|
| 12 |
<div align="center">
|
| 13 |
<picture>
|
| 14 |
<source media="(prefers-color-scheme: dark)" srcset="Figures/VibeVoice_logo_white.png">
|
|
@@ -16,40 +28,46 @@
|
|
| 16 |
</picture>
|
| 17 |
</div>
|
| 18 |
|
| 19 |
-
<div align="left">
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
| 24 |
-
|
|
|
|
|
|
|
| 25 |
|
| 26 |
-
|
| 27 |
|
| 28 |
-
|
| 29 |
|
| 30 |
-
|
| 31 |
-
|
|
|
|
|
|
|
| 32 |
|
| 33 |
-
|
| 34 |
|
| 35 |
-
|
| 36 |
|
| 37 |
-
|
| 38 |
|
| 39 |
-
|
| 40 |
|
|
|
|
|
|
|
| 41 |
|
| 42 |
-
|
| 43 |
|
| 44 |
-
|
| 45 |
|
| 46 |
-
|
| 47 |
|
| 48 |
-
|
| 49 |
-
- **[Realtime streaming TTS model](docs/vibevoice-realtime-0.5b.md)**: Produces initial audible speech in ~**300 ms** and supports **streaming text input** for single-speaker **real-time** speech generation; designed for low-latency generation.
|
| 50 |
|
| 51 |
-
|
|
|
|
| 52 |
|
|
|
|
| 53 |
|
| 54 |
<p align="left">
|
| 55 |
<img src="Figures/MOS-preference.png" alt="MOS Preference Results" height="260px">
|
|
@@ -59,65 +77,26 @@ A core innovation of VibeVoice is its use of continuous speech tokenizers (Acous
|
|
| 59 |
|
| 60 |
### 🎵 Demo Examples
|
| 61 |
|
| 62 |
-
|
| 63 |
-
**Video Demo**
|
| 64 |
-
|
| 65 |
-
We produced this video with [Wan2.2](https://github.com/Wan-Video/Wan2.2). We sincerely appreciate the Wan-Video team for their great work.
|
| 66 |
-
|
| 67 |
**English**
|
| 68 |
<div align="center">
|
| 69 |
-
|
| 70 |
https://github.com/user-attachments/assets/0967027c-141e-4909-bec8-091558b1b784
|
| 71 |
-
|
| 72 |
-
</div>
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
**Chinese**
|
| 76 |
-
<div align="center">
|
| 77 |
-
|
| 78 |
-
https://github.com/user-attachments/assets/322280b7-3093-4c67-86e3-10be4746c88f
|
| 79 |
-
|
| 80 |
</div>
|
| 81 |
|
| 82 |
**Cross-Lingual**
|
| 83 |
<div align="center">
|
| 84 |
-
|
| 85 |
https://github.com/user-attachments/assets/838d8ad9-a201-4dde-bb45-8cd3f59ce722
|
| 86 |
-
|
| 87 |
-
</div>
|
| 88 |
-
|
| 89 |
-
**Spontaneous Singing**
|
| 90 |
-
<div align="center">
|
| 91 |
-
|
| 92 |
-
https://github.com/user-attachments/assets/6f27a8a5-0c60-4f57-87f3-7dea2e11c730
|
| 93 |
-
|
| 94 |
-
</div>
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
**Long Conversation with 4 people**
|
| 98 |
-
<div align="center">
|
| 99 |
-
|
| 100 |
-
https://github.com/user-attachments/assets/a357c4b6-9768-495c-a576-1618f6275727
|
| 101 |
-
|
| 102 |
</div>
|
| 103 |
|
| 104 |
For more examples, see the [Project Page](https://microsoft.github.io/VibeVoice).
|
| 105 |
|
|
|
|
| 106 |
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
While efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate. VibeVoice inherits any biases, errors, or omissions produced by its base model (specifically, Qwen2.5 1.5b in this release).
|
| 111 |
-
Potential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations in the relevant jurisdictions. It is best practice to disclose the use of AI when sharing AI-generated content.
|
| 112 |
-
|
| 113 |
-
English and Chinese only: Transcripts in languages other than English or Chinese may result in unexpected audio outputs.
|
| 114 |
-
|
| 115 |
-
Non-Speech Audio: The model focuses solely on speech synthesis and does not handle background noise, music, or other sound effects.
|
| 116 |
-
|
| 117 |
-
Overlapping Speech: The current model does not explicitly model or generate overlapping speech segments in conversations.
|
| 118 |
|
| 119 |
We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly.
|
| 120 |
|
| 121 |
## Star History
|
| 122 |
|
| 123 |
-

|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: VibeVoice
|
| 3 |
+
emoji: 🌍
|
| 4 |
+
colorFrom: purple
|
| 5 |
+
colorTo: yellow
|
| 6 |
+
sdk: docker
|
| 7 |
+
pinned: false
|
| 8 |
+
license: mit
|
| 9 |
+
short_description: VibeVoice-Realtime-0.5B - Real-time neural voice generation
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
<div align="center">
|
| 13 |
|
| 14 |
+
## �️ VibeVoice: Open-Source Frontier Voice AI
|
| 15 |
[](https://microsoft.github.io/VibeVoice)
|
| 16 |
[](https://huggingface.co/collections/microsoft/vibevoice-68a2ef24a875c44be47b034f)
|
| 17 |
[](https://arxiv.org/pdf/2508.19205)
|
| 18 |
|
| 19 |
+
**Real-time neural voice synthesis system powered by Microsoft VibeVoice-Realtime-0.5B.**
|
| 20 |
+
*This Space demonstrates how to run the model using Docker on Hugging Face Spaces.*
|
| 21 |
|
| 22 |
</div>
|
| 23 |
|
|
|
|
| 24 |
<div align="center">
|
| 25 |
<picture>
|
| 26 |
<source media="(prefers-color-scheme: dark)" srcset="Figures/VibeVoice_logo_white.png">
|
|
|
|
| 28 |
</picture>
|
| 29 |
</div>
|
| 30 |
|
|
|
|
| 31 |
|
| 32 |
+
## 🚀 Space Features
|
| 33 |
|
| 34 |
+
- ⚡ **Real-time voice generation** (~300ms latency)
|
| 35 |
+
- 🧠 **Lightweight 0.5B parameter model**
|
| 36 |
+
- 🐳 **Docker-based deployment** (downloaded at runtime)
|
| 37 |
+
- 🌐 **Runs on CPU** (Zero GPU supported)
|
| 38 |
|
| 39 |
+
---
|
| 40 |
|
| 41 |
+
## 📦 Model Details
|
| 42 |
|
| 43 |
+
- **Model:** [`microsoft/VibeVoice-Realtime-0.5B`](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B)
|
| 44 |
+
- **Type:** Text-to-Speech / Voice Generation
|
| 45 |
+
- **Inference:** Real-time Streaming
|
| 46 |
+
- **Source:** Microsoft Research
|
| 47 |
|
| 48 |
+
---
|
| 49 |
|
| 50 |
+
## 🏗️ Technical Overview (From Original Repo)
|
| 51 |
|
| 52 |
+
<div align="left">
|
| 53 |
|
| 54 |
+
<h3>📰 News</h3>
|
| 55 |
|
| 56 |
+
<img src="https://img.shields.io/badge/Status-New-brightgreen?style=flat" alt="New" />
|
| 57 |
+
<img src="https://img.shields.io/badge/Feature-Realtime_TTS-blue?style=flat&logo=soundcharts" alt="Realtime TTS" />
|
| 58 |
|
| 59 |
+
<strong>2025-12-03: 📣 We open-sourced <a href="docs/vibevoice-realtime-0.5b.md"><strong>VibeVoice‑Realtime‑0.5B</strong></a>.</strong>
|
| 60 |
|
| 61 |
+
To mitigate deepfake risks and ensure low latency for the first speech chunk, voice prompts are provided in an embedded format. For users requiring voice customization, please reach out to our team.
|
| 62 |
|
| 63 |
+
</div>
|
| 64 |
|
| 65 |
+
### Overview
|
|
|
|
| 66 |
|
| 67 |
+
VibeVoice is a novel framework designed for generating **expressive**, **long-form**, **multi-speaker** conversational audio.
|
| 68 |
+
It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.
|
| 69 |
|
| 70 |
+
**[Realtime streaming TTS model](docs/vibevoice-realtime-0.5b.md)**: Produces initial audible speech in ~**300 ms** and supports **streaming text input** for single-speaker **real-time** speech generation.
|
| 71 |
|
| 72 |
<p align="left">
|
| 73 |
<img src="Figures/MOS-preference.png" alt="MOS Preference Results" height="260px">
|
|
|
|
| 77 |
|
| 78 |
### 🎵 Demo Examples
|
| 79 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
**English**
|
| 81 |
<div align="center">
|
|
|
|
| 82 |
https://github.com/user-attachments/assets/0967027c-141e-4909-bec8-091558b1b784
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
</div>
|
| 84 |
|
| 85 |
**Cross-Lingual**
|
| 86 |
<div align="center">
|
|
|
|
| 87 |
https://github.com/user-attachments/assets/838d8ad9-a201-4dde-bb45-8cd3f59ce722
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
</div>
|
| 89 |
|
| 90 |
For more examples, see the [Project Page](https://microsoft.github.io/VibeVoice).
|
| 91 |
|
| 92 |
+
## Risks and Limitations
|
| 93 |
|
| 94 |
+
While efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate.
|
| 95 |
+
**Potential for Deepfakes and Disinformation:** High-quality synthetic speech can be misused. Users must ensure transcripts are reliable and avoid using generated content in misleading ways.
|
| 96 |
+
**Non-Speech Audio:** The model focuses solely on speech synthesis and does not handle background noise, music, or other sound effects.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
|
| 98 |
We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly.
|
| 99 |
|
| 100 |
## Star History
|
| 101 |
|
| 102 |
+

|