- Core Features
- Multi-voice text-to-speech program with high realism and prosody
- Uses both autoregressive and diffusion decoders with low sampling rates
- Generates medium-sized sentences every 2 minutes on K80 GPU
- New Features in v2.1
- Added random voice generation capability
- Allows downloading and using user-provided voice conditioning latents
- Enables using custom pretrained models
- Refactored directory structures and improved performance
- Usage and Limitations
- Requires NVIDIA GPU for local installation
- Works best with books and poetry, struggles with other speech types
- Training dataset limited to audiobooks, lacks diverse voices
- Includes classifier to detect if audio was generated by Tortoise
- Technical Details
- Built on 5 separate models trained on 50k hours of speech data
- Inspired by OpenAI's DALLE with improved decoder
- Currently 20x smaller than original DALLE transformer
- Training methodology and configurations not yet released