Voicebox: AI Speech Model of the Future Unveiled by Mark Zuckerberg

Voicebox, introduced by Mark Zuckerberg, is an advanced text-to-speech (TTS) AI model that produces realistic speech from text.

[{"selector":"#anim-d6109232-365e-49f3-8120-adbc36526312 [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(-54.1434150825744%, 0, 0) translate(-25%, 0%) scale(1.5)","translate3d(0%, 0, 0) translate(0%, 0%) scale(1)"]},"delay":0,"duration":2000,"fill":"forwards"}] Photo Source - Google

Like ChatGPT and Dall-E, Voicebox can complete tasks beyond its explicit training.

[{"selector":"#anim-6b7b5c6f-cd9a-4d8c-bb39-6ef01104e1fb [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(-54.33091508371213%, 0, 0) translate(-25%, 0%) scale(1.5)","translate3d(0%, 0, 0) translate(0%, 0%) scale(1)"]},"delay":0,"duration":2000,"fill":"forwards"}] Photo Source - Google

Zuckerberg announced Voicebox on his Meta Channel, demonstrating its text-to-speech capabilities and noise handling.

[{"selector":"#anim-6b6e866d-a1a8-4741-8335-31d80a61bac0 [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(-54.338727583759535%, 0, 0) translate(-25%, 0%) scale(1.5)","translate3d(0%, 0, 0) translate(0%, 0%) scale(1)"]},"delay":0,"duration":2000,"fill":"forwards"}] Photo Source - Google

Voicebox's training involved 50,000+ hours of diverse audio in multiple languages, using a flow-matching model.

[{"selector":"#anim-8fab796b-3d3a-4127-b5af-958a0492462b [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(-15.973772225431517%, 0, 0) translate(-25%, 0%) scale(1.5)","translate3d(0%, 0, 0) translate(0%, 0%) scale(1)"]},"delay":0,"duration":2000,"fill":"forwards"}] Photo Source - Google

With extensive training, Voicebox delivers conversationally fluid speech, performing similarly to models trained on real speech.

[{"selector":"#anim-c8d06e9f-7d50-4e3e-8e9d-de4bf55615c8 [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(-68.33152891627972%, 0, 0) translate(-25%, 0%) scale(1.5)","translate3d(0%, 0, 0) translate(0%, 0%) scale(1)"]},"delay":0,"duration":2000,"fill":"forwards"}] Photo Source - Google

Voicebox actively edits audio clips, removing noise and replacing misspoken words, similar to photo editing software.

[{"selector":"#anim-8a8b44e5-660f-4160-a23a-48c2036e6eeb [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(-27.605468654429572%, 0, 0) translate(-25%, 0%) scale(1.5)","translate3d(0%, 0, 0) translate(0%, 0%) scale(1)"]},"delay":0,"duration":2000,"fill":"forwards"}] Photo Source - Google

Unlike other TTS generators, Voicebox mimics subjects with minimal source material using its zero-shot training method, Flow Matching.

[{"selector":"#anim-d84fef1f-ea08-42b7-b02d-a7febdfb1d35 [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(-28.58984366040272%, 0, 0) translate(-25%, 0%) scale(1.5)","translate3d(0%, 0, 0) translate(0%, 0%) scale(1)"]},"delay":0,"duration":2000,"fill":"forwards"}] Photo Source - Google

Due to concerns about potential misuse, Meta has not released Voicebox to the public yet.

[{"selector":"#anim-3494a9cc-1be7-4b55-915f-8f8649ebaf27 [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(-27.72265615514065%, 0, 0) translate(-25%, 0%) scale(1.5)","translate3d(0%, 0, 0) translate(0%, 0%) scale(1)"]},"delay":0,"duration":2000,"fill":"forwards"}] Photo Source - Google