Based on the F5-TTS model, a Diffusion Transformer with ConvNeXt V2, this Vietnamese text-to-speech model was trained on ~4 hours of Vietnamese audio data in 41k training steps. It boasts faster training and inference speeds, however, the quality of the synthesized speech may have noticeable imperfections such as choppiness or lack of natural intonation.