Disguising the real voice of De Mol among fakes

By Sophie De Coppel

If you've been caught up in the latest season of the beloved Flemish TV sensation, De Mol, you might have noticed a twist that sent shockwaves throughout the watchers of the show: the introduction of AI. Yes, you read that right—artificial intelligence has joined the ranks of contestants and conspirators, adding an extra layer of intrigue to an already captivating show.

With just weeks on the clock, our team at Dataroots embarked on a mission: to hide the real voice of De Mol among a cacophony of synthesized echoes, leaving De Mol's identity shrouded in uncertainty. We embarked on a journey of innovation and deception, seeking to blur the lines between reality and fabrication.

The result? A symphony of intrigue where contestants and viewers alike struggle to decipher real from fake. Join us as we uncover the secrets behind crafting these deceptive voices.

Following the traditional route

Our journey to fabricate the voices of the candidates began with the traditional method. This path involved gathering audio data, transcribing it into text, and then utilizing a text-to-speech model for training. While we could have opted for established models like Tacotron and Vits, we decided to explore newer options: YourTTS and TortoiseTTS. These models boasted significant performance enhancements compared to their predecessors, sparking our curiosity. However, as these models were relatively new at the time of our endeavor, their training codebase remained largely experimental, with minimal documentation on their myriad hyperparameters. Consequently, our experimentation process largely relied on trial and error as we navigated through uncharted territory.

TortoiseTTS

TortoiseTTS stands out as a text-to-speech framework renowned for its focus on multi-voice functionalities and exceptionally lifelike prosody and intonation. Its user-friendly web interface enables effortless model training with just a few clicks. While initially captivating, it was restricted to English during our exploration (although it now supports additional languages, as evidenced by this example). Consequently, we shifted our focus to the multi-lingual capabilities of YourTTS.

YourTTS

YourTTS gained prominence initially due to its impressive zero-shot voice conversion capabilities. As a multi-lingual text-to-speech model with voice conversion functionalities, it boasts the ability to make anyone utter any text. This feature seemed tailor-made for our candidate voices. However, despite its promise, we encountered subpar first results, largely due to the absence of Dutch in its training dataset. Nonetheless, YourTTS provided a promising foundation for our endeavors, given its proficiency in English, French, and Portuguese, and its ability to handle diverse male and female voices effectively.

Thus, we commenced by gathering audio snippets from each candidate, primarily sourcing recordings from their selection interviews. Subsequently, we embarked on preparing this data for training, employing a pre-processing pipeline that entailed:

  1. Noise removal to enhance data clarity.
  2. Speaker diarization to isolate candidate data from interviewer segments.
  3. Audio normalization to ensure uniform volume and sampling rate.
  4. Silence removal to trim prolonged silent intervals from the training dataset.
  5. Automatic transcription facilitated by Whisper, streamlining data preparation.

While the process was largely automated, manual oversight was still necessary for transcript verification and speaker identification.

Having prepped the data, our focus shifted to configuring the training parameters. With scant documentation available, this phase entailed a trial-and-error approach. Ultimately, we settled on a setup inspired by NanoNomads tutorials; which involved:

  • Unfreezing the text encoder and duration predictor for approximately 11 thousand steps, followed by freezing them for the remaining fine-tuning steps.
  • Disabling the usage of language embedding throughout the training process and labeling the language as the unknown string "nl".

Apart from working on the candidates' voices, we also gave a shot at creating a model for the famous presenter, Gilles De Coster. His VoiceOver recordings were clear and transcribed, giving us a good chance to see how well our method worked. Following the outlined training regimen, our outcomes varied significantly. For Gilles De Coster's voice, characterized by clean voiceovers, the results were notably favorable. His model consistently produced clear, well-pronounced Dutch outputs—a testament to the quality of the source audio. However, the picture was less rosy for the candidates' voices. While some outputs showed promise, the majority suffered from noise contamination and exhibited less refined Dutch and intonation. Subsequent analysis of the training dataset revealed a recurring challenge: the models struggled with sentences containing words or sounds that were underrepresented in the training data. This discrepancy underscored the importance of data diversity in achieving optimal results.

audio-thumbnail
Gilles' voice after 11k training steps
0:00
/2.272
audio-thumbnail
Michael's voice after 11k training steps (good example)
0:00
/5.648
audio-thumbnail
Michael's voice after 11k training steps (bad example)
0:00
/2.784

We explored various methods to enhance the model's performance, such as extending training duration, fine-tuning using Gilles De Coster's model, employing voice conversion, and adding Dutch data. However, these efforts yielded limited improvement. The primary reasons behind this were the quality and quantity disparities between Gilles' voiceovers (over 3 hours) and the candidates' interview recordings (less than an hour each).

Furthermore, the model outputs lacked natural-sounding speech, despite occasional clear pronunciation. The intonation and vitality of the voice remained deficient, making it evident that the voice was artificial. Attempting to address this, we utilized the model's voice conversion capability, but it struggled to entirely transform the voice, leaving room for improvement.

Trying something new

After grappling with YourTTS for an extended period, we ultimately opted to transition to the latest technology known as Retrieval based Voice Conversion(RVC). This decision stemmed from our realization that voice conversion was crucial for achieving natural-sounding speech. The community-built codebase of RVC proved ideal for this purpose. Leveraging smart feature extraction (influenced by so-vits and Hubert), a generative adversarial network, and a fast mapping algorithm (based on Faiss), RVC excels at transforming one voice into another with remarkable efficiency and accuracy.

Decoding the Sound of Virality: A Deep Dive into Adversarial AI for Voice Conversion Tasks (on M1…
Welcome to an in-depth explanation and reverse engineering of the Retrieval-based-Voice-Conversion-WebUI software for local preprocessing…

A must-read for people that want to know more about RVC

This remarkable code enables training a voice model with just 20 minutes of data in approximately 1 hour—a truly astounding feat! After experimenting with this approach for one of the candidates, we were utterly astounded by the quality of the output.

audio-thumbnail
Fake RVC voice of a Bernard
0:00
/41.64


This code prioritizes voice conversion exclusively, eliminating the need to devote time and effort to learning Dutch. Instead, the model focuses solely on voice characteristics, seamlessly mapping new data to the learned voice, irrespective of intonation, naturalness, or language—since these components are inherently present in the input audio. This approach enabled us to achieve natural-sounding results, even accommodating dialects!

Moreover, this codebase offers the distinct advantage of minimal data requirements. With just 20 minutes of high-quality data, we could train the model—no preprocessing or corrections of Whisper transcripts were necessary. Additionally, during inference, it's possible to fine-tune parameters such as pitch, volume, or noise reduction, affording further control over the output.

Unmasking De Mol

Finally, after all this work, we are done, right? Well no, to be sure that it isn't too easy for viewers of the show to unmask the real identity of De Mol , we need to test if we can easily find the real voice among the fakes. To do this we devised four techniques:

  1. The test by ear: Can we distinguish the fake voices from the real one by listening carefully for hissing, naturalness, background noise, differences in dialect and so on. This test is also performed by somebody that didn't hear the generated voices beforehand to avoid bias.
  2. The test by spectograms: Are there any visible differences in the spectrograms? Some deepfake models are known to cut off high frequencies or to produce harsh transitions in its spectrograms.
  3. The test with pretrained models: there are already some models out there to detect deepfake models. Although most of these models are trained on outdated data (from the ASVspoof challenge of 2019 and 2021), we thought it would be worth the try as these are probably the first models that watchers of the show will go to. Here we tested Resemblyzer and whisper-deepfake-detection.
  4. The test with custom trained models: Finally, what better way to test our results than to assume a watcher of the show has expert knowledge! They already guessed which model we used and they also know that you can use the discriminator part of the GAN to detect the fake audio. To simulate this scenario we used the voice audio from the episodes from each candidate to retrain the RVC models, after which we can use the discriminators to grade the voices of the test.

Upon analyzing the results of these four tests, we observed a lack of consensus regarding the identity of De Mol—a favorable outcome for the producers of the show. The first two tests, which are more subjective in nature, leaned towards Michael and Stephanie, while the latter two tests favored Michael, Senne, and Bernard.

While we hope that our fake voices remained undetected and that no one developed a deepfake detection model capable of unmasking De Mol, we acknowledge the rapid evolution of technology and the ingenuity of individuals in this field. Hence, we've maintained a low profile regarding this project, coupled with the limited availability of data and post-processing effects applied to the audio. This strategy maximizes De Mol's chances of remaining concealed for as long as possible.

Conclusion

In conclusion, we extend our sincere gratitude to everyone involved in the development of the fake AI voices for De Mol! It was an incredibly rewarding collaboration, and we're proud of the results achieved. However, we also recognize the broader implications of this technology and the imperative need for strict governance and robust deepfake detection algorithms. Our hope is that our efforts serve as a catalyst for greater awareness and action in addressing the ethical challenges posed by deepfake technology.