Helpful Tools to Make Your First Voice Clone Dataset Easy to Build

May 20, 2023
DALL-E 2 generated image: Transformer robots playing guitar and drums in a rock band onstage, band poster

Helpful Tools to Make Your First Voice Clone Dataset Easy to Build

Voice clones are really, really good now. And there are a bunch of companies that sell them. Do yourself a favor and spend five minutes playing with a voice clone from any of these services. ElevenLabs is not listed in that article but they have a generous “free forever” plan.

These tools are magic and unbelievably fun. I wanted to understand how they worked. Luckily, after a hackathon where I tried (vainly) to recreate NVIDIA’s Riva Studio and train a voice model, my friend suggested that I check out Coqui.ai. Four of the team members built Mozilla’s TTS engine before they started Coqui. Their work is available on GitHub. I’ll say that again. The eldritch spellbook for state of the art voice synthesis is available on GitHub AND the warlocks will answer your questions publicy and on their Discord channel for free!

I have a hard time learning a system without working with it. Thankfully, Coqui community members have created excellent content for working with the software. NanoNomad’s Voice Cloning Tutorial and Google Colab notebook helped me build my first voice clone. Once I had skin in the game, reading the papers behind Coqui’s tech seemed like a wise investment. This is gonna sound nuts but believe me, after you follow a couple tutorials, you’ll giggle like an idiot while reading VITS and YourTTS. The magic behind the magic is, yet again, attention and the transformer architecture.

Attention and the Transformers! One night only, Dec 06 2017 at Long Beach, CA.

Band Poster of Transformer robots playing guitar and drums in a rock band onstage.The joke is that Coqui uses the now-ubiquitous Transformer architecture to train voice clones
 Generated by DALL-E 2

DIY voice clone

Of course, the community could not have had so much much fun with Coqui’s work if it wasn’t so easy to use. Here’s a python quickstart. Do you hate installing dependencies? Here’s a Docker image. You could run it locally, put it behind an api, even build a product around it. Seriously, it’s licensed under MPL 2.0. Do your kids wake you up at night? Clone your voice and hide an rPi in their closet. Teach them French with a multilingual model. Unlimited fun, for free.

Yet, as with all magic, a sacrifice must be made. To create a voice clone, you’ll need a voice dataset. You won’t need the 24 hours of Keith Ito’s LJ Speech Dataset or the 110 speakers of the CSTR VCTK Corpus to make something cool, but if you want to train a model with convincing prosody, you will have to record 30 minutes of audio. (You can finetune a model with 3-10seconds of audio using YourTTS with a process called ZeroShotTTS. And OH MY GOODNESS THAT’S COOL; use this notebook or Coqui Studio. BUT! Such a spell will not capture the you-ness of your voice.) I hope you’ll find that the resources in this article make the dataset building process a little easier.

What should the voice dataset look like?

Basically:

  1. A bunch of recordings
  2. The words that were said in each recording

PLEASE someone record their dog and map those recordings to a list of transcriptions

You could even translate human text into animal sounds!
Photo by Isabel Vittrup-Pallier on Unsplash

Really, you could format your dataset however you want, as long as you write something that can normalize your dataset into a format that your speech synthesis training program can use. Right now, May 20, 2023, Coqui’s TTS supports 28 different formats. Champion TTS hobbyist Thorsten Müller asserts that Keith Ito’s LJ Speech format is the best-known format. Let’s use that one.

Here’s the format, stolen directly from Keith Ito’s website:

Metadata is provided in transcripts.csv. This file consists of one record per line, delimited by the pipe character (0x7c). The fields are:

1. ID: this is the name of the corresponding .wav file
2. Transcription: words spoken by the reader (UTF-8)
3. Normalized Transcription: transcription with numbers, ordinals, and monetary units expanded into full words (UTF-8).

Each audio file is a single-channel 16-bit PCM WAV with a sample rate of 22050 Hz.

How can I make a dataset that will train well?

Eren Golge of Coqui names six parts of what makes a good dataset. Read the link for details.

Here’re the headings:

  1. Gaussian like distribution on clip and text lengths
  2. Mistake free
  3. Noise free
  4. Compatible tone and pitch among voice clips
  5. Good phoneme coverage
  6. Naturalness of recordings

Wait, that sounds like boring work!

Need: We want to make a high-quality dataset, probably several times

Problem: High-quality implies careful, diligent, painstaking preparation and review

Resolution: Tools that automate boring work

Photo by Alex Knight on Unsplash

Voice dataset builders

I’m aware of two open source tools for building voice datasets. They will both give you a dataset in the LJ Speech format. We’ll use EchoKeeper in this article but I suggest that you look into VoiceDatasetCreation before you begin recording.

EchoKeeper

I forked Sahar Mor’s whisper-playground to build EchoKeeper. You can use it by either running the container from ghcr.io/harrolee/echokeeper:latest or by cloning the repo. Follow Quickstart 2 if you’ve brought your own prompts or Quickstart 1 if you just want to record and transcribe some audio.

VoiceDatasetCreation

Rio Harper wrote this excellent guide for building a voice clone with Coqui TTS, featuring his own tool for building datasets. I wish that I had found his work before I started my own. It seems very good, and I believe that he is pressing forward with the excellently named VocalForge, “An End-to-End Toolkit for Voice Datasets”. Check out his work.

Doing the thing!

Let’s use EchoKeeper to build a dataset that matches Golge’s spec.

Gaussian like distribution on clip and text lengths

Use this notebook to generate a set of guidelines. If you don’t want to write 180 sentences of varying length, paste the output of that notebook into ChatGPT or your side-LLM. That is how I generated these prompts in EchoKeeper.
Do you want learn how to skip the line on ChatGPT’s UI and use OpenAI’s api directly? Aaron Alexander’s got you covered. Do you want to know how they work? AARON ALEXANDER DELIVERS!

Mistake Free

Humans excel at making mistakes. Robots are less good at that. In EchoKeeper, after you record a clip, you’ll see a transcription of your speech. Click into the textarea to edit the transcription so that it matches what you said. Your edited transcription will be saved and associated with your recording. If you didn’t like your recording, you can delete it and try again.

Noise Free

Refrigerators, computer fans, the Geddup Noise… all sounds are existential threats to a Noise-Free dataset. The Coqui.ai docs recommend the tool rnnoise for denoising audio. Here’s the repo.

Compatible tone and pitch among voice clips

This one’s on you. Here’s Eren’s explanation:

For instance, if you are using an audiobook recording for your project, it might have impersonation for different characters in the book. This kind of divergences between instances downgrades the model performance.

Good phoneme coverage

The ubiquitous Thorsten Müller composed this notebook to help you analyze the phoneme coverage of your dataset. It is well documented and the Python is very clean.

After you have composed your prompts, format them for use with the notebook. Update the LLM prompt that you generated with my sentence_distribution notebook with feedback from Thorsten’s notebook. Use the updated prompt to generate sentences with better phoneme coverage.

Normalize audio clip volume

A prompt might have excited you into shouting or lulled you into a hushed tone. You’ll want to normalize the volume of your dataset lest you teach your model that some phonemes are expressed exclusively in a low volume. EchoKeeper doesn’t support this yet. Until it does, this superuser thread offers a bunch of clever ways to use ffmpeg to normalize audio output.

Naturalness of recordings

Imagine the character of your voice clone. Does it croak like a skater or coo as a child? Does it seek to annoy or inform? What is the timbre and pitch? To get an idea for the prosody of your voice clone, imagine the sorts of things that the character of your voice clone would say. Write these down. If you’re like me and over-analyze everything, relax and realize that do not need to understand completely the rhythmic and tonal patterns of your desired voice clone in order to train it. It is enough to work by feel and write phrases that allow your desired speech patterns to emerge. Only after you allow your creative brain to generate several phrases should you try to categorize them in order ensure that every pattern you imagined is represented in your prompt sentences.

Technical advice

Prompt assembly

Here’s a bonus section! Follow this three step process to get your prompts into EchoKeeper.

  1. Generate prompts with this notebook, either by yourself or with ChatGPT.
  2. Create a folder named [whatever_you_want_here]
  3. Create a folder named prompts inside of that
  4. Create a file named prompts.json inside of [whatever_you_want_here]/prompts/prompts.json

Recording configuration

You’ll need to configure your microphone in the audio settings of your operating system to match the channel and sample rate of the data that model you are finetuning was pretrained on. For example, if you want to finetune the VITS model pretrained on LJSpeech, you’ll need to configure your microphone in the same way that the LJSpeech files were recorded.

From the spec:
Each audio file is a single-channel 16-bit PCM WAV with a sample rate of 22050 Hz.

You can resample and even combine the channels of a recording with ffmpeg.

Checking that the thing is good!

You’ve recorded 30minutes of audio and exported your EchoKeeper project. Nice! Before you spend time and/or money to train a model on your dataset, you will want to make sure that it will train effectively.

Coqui helps us out here too. Use the notebooks in TTS/notebooks/dataset_analysis to analyze your data.

Next steps

Congratutions! You have all the tools you need to build a dataset. Your next task is to read the Coqui docs for Fine-tuning a 🐸 TTS model and create your own voice clone. I’ll refer you again to Rio Harper’s comprehensive guide and NanoNomad’s video. Reach out on LinkedIn, GitHub, or the comments section here if you want help building something cool.