ChaitanyaChandra commited on
Commit
503c989
·
1 Parent(s): c75822a

Update README for HF Space (shorten description)

Browse files
Files changed (1) hide show
  1. README.md +43 -64
README.md CHANGED
@@ -1,14 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
1
  <div align="center">
2
 
3
- ## 🎙️ VibeVoice: Open-Source Frontier Voice AI
4
  [![Project Page](https://img.shields.io/badge/Project-Page-blue?logo=microsoft)](https://microsoft.github.io/VibeVoice)
5
  [![Hugging Face](https://img.shields.io/badge/HuggingFace-Collection-orange?logo=huggingface)](https://huggingface.co/collections/microsoft/vibevoice-68a2ef24a875c44be47b034f)
6
  [![Technical Report](https://img.shields.io/badge/Technical-Report-red?logo=adobeacrobatreader)](https://arxiv.org/pdf/2508.19205)
7
 
 
 
8
 
9
  </div>
10
 
11
-
12
  <div align="center">
13
  <picture>
14
  <source media="(prefers-color-scheme: dark)" srcset="Figures/VibeVoice_logo_white.png">
@@ -16,40 +28,46 @@
16
  </picture>
17
  </div>
18
 
19
- <div align="left">
20
 
21
- <h3>📰 News</h3>
22
 
23
- <img src="https://img.shields.io/badge/Status-New-brightgreen?style=flat" alt="New" />
24
- <img src="https://img.shields.io/badge/Feature-Realtime_TTS-blue?style=flat&logo=soundcharts" alt="Realtime TTS" />
 
 
25
 
26
- <strong>2025-12-03: 📣 We open-sourced <a href="docs/vibevoice-realtime-0.5b.md"><strong>VibeVoice‑Realtime‑0.5B</strong></a>, a real‑time text‑to‑speech model that supports streaming text input and robust long-form speech generation. Try it on [Colab](https://colab.research.google.com/github/microsoft/VibeVoice/blob/main/demo/vibevoice_realtime_colab.ipynb).</strong>
27
 
28
- <strong>2025-12-09: 📣 We’ve added experimental speakers in nine languages (DE, FR, IT, JP, KR, NL, PL, PT, ES) for exploration—welcome to try them out and share your feedback.</strong>
29
 
30
- To mitigate deepfake risks and ensure low latency for the first speech chunk, voice prompts are provided in an embedded format. For users requiring voice customization, please reach out to our team. We will also be expanding the range of available speakers.
31
- <br>
 
 
32
 
33
- https://github.com/user-attachments/assets/0901d274-f6ae-46ef-a0fd-3c4fba4f76dc
34
 
35
- > (Launch your own realtime demo via the websocket example in [Usage](docs/vibevoice-realtime-0.5b.md#usage-1-launch-real-time-websocket-demo)).
36
 
37
- </div>
38
 
39
- 2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have disabled this repo until we are confident that out-of-scope use is no longer possible.
40
 
 
 
41
 
42
- ### Overview
43
 
44
- VibeVoice is a novel framework designed for generating **expressive**, **long-form**, **multi-speaker** conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.
45
 
46
- VibeVoice currently includes two model variants:
47
 
48
- - **Long-form multi-speaker model**: Synthesizes conversational/single-speaker speech up to **90 minutes** with up to **4 distinct speakers**, surpassing the typical 1–2 speaker limits of many prior models.
49
- - **[Realtime streaming TTS model](docs/vibevoice-realtime-0.5b.md)**: Produces initial audible speech in ~**300 ms** and supports **streaming text input** for single-speaker **real-time** speech generation; designed for low-latency generation.
50
 
51
- A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a [next-token diffusion](https://arxiv.org/abs/2412.08635) framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.
 
52
 
 
53
 
54
  <p align="left">
55
  <img src="Figures/MOS-preference.png" alt="MOS Preference Results" height="260px">
@@ -59,65 +77,26 @@ A core innovation of VibeVoice is its use of continuous speech tokenizers (Acous
59
 
60
  ### 🎵 Demo Examples
61
 
62
-
63
- **Video Demo**
64
-
65
- We produced this video with [Wan2.2](https://github.com/Wan-Video/Wan2.2). We sincerely appreciate the Wan-Video team for their great work.
66
-
67
  **English**
68
  <div align="center">
69
-
70
  https://github.com/user-attachments/assets/0967027c-141e-4909-bec8-091558b1b784
71
-
72
- </div>
73
-
74
-
75
- **Chinese**
76
- <div align="center">
77
-
78
- https://github.com/user-attachments/assets/322280b7-3093-4c67-86e3-10be4746c88f
79
-
80
  </div>
81
 
82
  **Cross-Lingual**
83
  <div align="center">
84
-
85
  https://github.com/user-attachments/assets/838d8ad9-a201-4dde-bb45-8cd3f59ce722
86
-
87
- </div>
88
-
89
- **Spontaneous Singing**
90
- <div align="center">
91
-
92
- https://github.com/user-attachments/assets/6f27a8a5-0c60-4f57-87f3-7dea2e11c730
93
-
94
- </div>
95
-
96
-
97
- **Long Conversation with 4 people**
98
- <div align="center">
99
-
100
- https://github.com/user-attachments/assets/a357c4b6-9768-495c-a576-1618f6275727
101
-
102
  </div>
103
 
104
  For more examples, see the [Project Page](https://microsoft.github.io/VibeVoice).
105
 
 
106
 
107
-
108
- ## Risks and limitations
109
-
110
- While efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate. VibeVoice inherits any biases, errors, or omissions produced by its base model (specifically, Qwen2.5 1.5b in this release).
111
- Potential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations in the relevant jurisdictions. It is best practice to disclose the use of AI when sharing AI-generated content.
112
-
113
- English and Chinese only: Transcripts in languages other than English or Chinese may result in unexpected audio outputs.
114
-
115
- Non-Speech Audio: The model focuses solely on speech synthesis and does not handle background noise, music, or other sound effects.
116
-
117
- Overlapping Speech: The current model does not explicitly model or generate overlapping speech segments in conversations.
118
 
119
  We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly.
120
 
121
  ## Star History
122
 
123
- ![Star History Chart](https://api.star-history.com/svg?repos=Microsoft/vibevoice&type=date&legend=top-left)
 
1
+ ---
2
+ title: VibeVoice
3
+ emoji: 🌍
4
+ colorFrom: purple
5
+ colorTo: yellow
6
+ sdk: docker
7
+ pinned: false
8
+ license: mit
9
+ short_description: VibeVoice-Realtime-0.5B - Real-time neural voice generation
10
+ ---
11
+
12
  <div align="center">
13
 
14
+ ## �️ VibeVoice: Open-Source Frontier Voice AI
15
  [![Project Page](https://img.shields.io/badge/Project-Page-blue?logo=microsoft)](https://microsoft.github.io/VibeVoice)
16
  [![Hugging Face](https://img.shields.io/badge/HuggingFace-Collection-orange?logo=huggingface)](https://huggingface.co/collections/microsoft/vibevoice-68a2ef24a875c44be47b034f)
17
  [![Technical Report](https://img.shields.io/badge/Technical-Report-red?logo=adobeacrobatreader)](https://arxiv.org/pdf/2508.19205)
18
 
19
+ **Real-time neural voice synthesis system powered by Microsoft VibeVoice-Realtime-0.5B.**
20
+ *This Space demonstrates how to run the model using Docker on Hugging Face Spaces.*
21
 
22
  </div>
23
 
 
24
  <div align="center">
25
  <picture>
26
  <source media="(prefers-color-scheme: dark)" srcset="Figures/VibeVoice_logo_white.png">
 
28
  </picture>
29
  </div>
30
 
 
31
 
32
+ ## 🚀 Space Features
33
 
34
+ - ⚡ **Real-time voice generation** (~300ms latency)
35
+ - 🧠 **Lightweight 0.5B parameter model**
36
+ - 🐳 **Docker-based deployment** (downloaded at runtime)
37
+ - 🌐 **Runs on CPU** (Zero GPU supported)
38
 
39
+ ---
40
 
41
+ ## 📦 Model Details
42
 
43
+ - **Model:** [`microsoft/VibeVoice-Realtime-0.5B`](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B)
44
+ - **Type:** Text-to-Speech / Voice Generation
45
+ - **Inference:** Real-time Streaming
46
+ - **Source:** Microsoft Research
47
 
48
+ ---
49
 
50
+ ## 🏗️ Technical Overview (From Original Repo)
51
 
52
+ <div align="left">
53
 
54
+ <h3>📰 News</h3>
55
 
56
+ <img src="https://img.shields.io/badge/Status-New-brightgreen?style=flat" alt="New" />
57
+ <img src="https://img.shields.io/badge/Feature-Realtime_TTS-blue?style=flat&logo=soundcharts" alt="Realtime TTS" />
58
 
59
+ <strong>2025-12-03: 📣 We open-sourced <a href="docs/vibevoice-realtime-0.5b.md"><strong>VibeVoice‑Realtime‑0.5B</strong></a>.</strong>
60
 
61
+ To mitigate deepfake risks and ensure low latency for the first speech chunk, voice prompts are provided in an embedded format. For users requiring voice customization, please reach out to our team.
62
 
63
+ </div>
64
 
65
+ ### Overview
 
66
 
67
+ VibeVoice is a novel framework designed for generating **expressive**, **long-form**, **multi-speaker** conversational audio.
68
+ It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.
69
 
70
+ **[Realtime streaming TTS model](docs/vibevoice-realtime-0.5b.md)**: Produces initial audible speech in ~**300 ms** and supports **streaming text input** for single-speaker **real-time** speech generation.
71
 
72
  <p align="left">
73
  <img src="Figures/MOS-preference.png" alt="MOS Preference Results" height="260px">
 
77
 
78
  ### 🎵 Demo Examples
79
 
 
 
 
 
 
80
  **English**
81
  <div align="center">
 
82
  https://github.com/user-attachments/assets/0967027c-141e-4909-bec8-091558b1b784
 
 
 
 
 
 
 
 
 
83
  </div>
84
 
85
  **Cross-Lingual**
86
  <div align="center">
 
87
  https://github.com/user-attachments/assets/838d8ad9-a201-4dde-bb45-8cd3f59ce722
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
  </div>
89
 
90
  For more examples, see the [Project Page](https://microsoft.github.io/VibeVoice).
91
 
92
+ ## Risks and Limitations
93
 
94
+ While efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate.
95
+ **Potential for Deepfakes and Disinformation:** High-quality synthetic speech can be misused. Users must ensure transcripts are reliable and avoid using generated content in misleading ways.
96
+ **Non-Speech Audio:** The model focuses solely on speech synthesis and does not handle background noise, music, or other sound effects.
 
 
 
 
 
 
 
 
97
 
98
  We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly.
99
 
100
  ## Star History
101
 
102
+ ![Star History Chart](https://api.star-history.com/svg?repos=Microsoft/vibevoice&type=date&legend=top-left)