• Share this News :        


  • February 16, 2024
  • Shahala VP
Largest Text-to-Speech AI Model Reveals "Emergent Abilities"

Researchers at Amazon have achieved a groundbreaking milestone by training the largest text-to-speech (TTS) model to date, boasting what they describe as "emergent" qualities that enhance its capacity to articulate complex sentences with natural fluency. This development may mark a significant step toward overcoming the uncanny valley in AI technology. While the growth and improvement of language models were anticipated, the researchers aimed for a leap in capabilities, similar to what occurred when language models surpassed a certain size. Beyond this threshold, these Large Language Models (LLMs) exhibit heightened robustness and versatility in tasks they weren't explicitly trained for. Amazon's AGI team, with their sights set on Artificial General Intelligence, hypothesized a similar progression for text-to-speech models, and their findings suggest this hypothesis holds true.

Named "Big Adaptive Streamable TTS with Emergent abilities" or BASE TTS, the largest version utilizes 100,000 hours of public domain speech, predominantly in English, with additional content in German, Dutch, and Spanish. At 980 million parameters, BASE-large stands out as the largest model in its category. The research also included 400M- and 150M-parameter models for comparative analysis, revealing emergent behaviors in the medium-sized model. The breakthrough lies not only in improved speech quality but in the observed emergent abilities. Examples from the research include handling compound nouns, expressing emotions, incorporating foreign words, utilizing readable non-words, interpreting punctuation, forming questions, and tackling syntactic complexities. These challenging tasks were not part of BASE TTS's explicit training but were performed impressively.

While BASE TTS still faces challenges, it outperformed its counterparts, including models like Tortoise and VALL-E. The model's streamable nature, generating speech moment by moment at a relatively low bitrate, further distinguishes it. The team also experimented with packaging speech metadata separately, allowing features like emotionality and prosody to accompany audio in a low-bandwidth stream. The potential breakout moment for text-to-speech models in 2024 could revolutionize accessibility, with the technology proving invaluable. The research team, cautious of potential misuse, opted not to disclose the model's source and other data. Future research will delve into determining the inflection point for emergent ability and optimizing the training and deployment processes. The BASE TTS project at Amazon showcases a pioneering advancement in text-to-speech technology, paving the way for more natural and versatile conversational AI.