By Zoë Van Noppen

This blog post is part of a series of content in which we uncover how we wrote the song “Song of the Machines” with which we participated in the AI Song Contest 2022.

AI songcontest

As shared in the previous blog post, our vision for this year's song contest was to use AI tools as “creative partners in crime, music-making”. Rather than letting one model create the entire song, we would keep a human in the loop (for a limited part) of the songwriting model. However, in the spirit of the “AI Song Contest”, we still wanted to use as much AI as possible. To be able to do so, we created multiple (fine-tuned) models which were all responsible for a different part of the songwriting process. These four models were: DoriVAE, CocoZoë, GPT-Sandy and Arhurton. This blogpost will focus on describing CocoZoë. In the songwriting process CocoZoë was used to harmonise samples generated by DoriVAE. If you want more information on our song writing process or DoriVAE, please visit our previous blog post.

Harmonisation with Coconet

CocoZoë is,  as its name indicates, the fine-tuned Coconet model on contemporary music. Coconet is the model behind Google’s Bach doodle. Playing with the Bach doodle gives you a good feeling of what the underlying model, Coconet, is trying to accomplish. Given a sequence of notes in one voice, it can generate three voices that sound good together when played simultaneously. We will call this harmonisation in the latter part of the text. More specifically, Coconet is a convolutional neural network (CNN) which uses (blocked) Gibbs sampling. The model was originally trained on Johan Sebastian Bach music, in order to harmonise music in a particular way.

Coconet is trained in a way a composer would write a music piece. A composer does not compose a music piece linearly. It is rather an iterative process, where part of the music piece gets scrapped during each iteration. Coconet is trained in a similar manner. During training its notes get randomly erased and we train the music piece to fill in the correct notes. According to the creators of Coconet, the training procedure is equivalent to training many different models at once, where each model is applicable in a different scenario. More information on why this training process works so well can be found in Magenta’s blog on the Coconet model.

More specifically, Coconet is a rather simple convolutional model containing batch normalization and residual connections. It is trained on 2D feature maps which have the following dimensions: time x pitch x number of voices, where each point in time is represented as a one-hot pitch vector. This means that the original model is limited to accepting only one note per voice per timestep. Of course this limitation can be removed, yet, a consequence of this would be a vast increase in variables.

Finetune with Lakh dataset

We trained Coconet on the Lakh dataset to sound more contemporary. The Lakh dataset, with contemporary music, contains 176.000 midi files. Midi files contain information on which notes are played, when and how long they are played. In contrast to standard mp3 files, midi files do not contain audio information. Midi files are thus lightweight and contain a lower number of dimensions. Nevertheless, they also have limitations as they can not capture human voices and many subtle timbres and dynamics as well as expressivity in the music. The Lakh dataset has also 45.000 midi files which are aligned to the Million Song Dataset, which is a mp3 repository with contemporary music. Each alignment also contains a confidence score. Moreover, the Million Song Dataset also bears metadata such as key, artist, genre, danceability...

The Lakh midi dataset was preprocessed so it could be used to retrain Coconet. To train CocoZoë we used no more than four instrumental tracks: piano, strings, guitar and bass. Moreover we had to make sure the song had the right tempo and key and only one note was played per track on each time step.

Once the CocoZoë was retrained we fed it the resulting melodies from DoriVAE, which it could harmonise. Again, we cherry picked the best results. A couple of these harmonised samples could then be put together in DAW and passed further to a last enriching step.

Below this paragraph you can find the melody which was generated with DoriVAE and used as input for the Coconet and CocoZoë model. The resulting outputs can also be found below this paragraph.  


Coconet is a model which delivers impressive results regarding harmonisation of a musical piece. In a similar fashion generates the finetuned Coconet more contemporary harmonised results. Now we only need to put the outputs of the different models together and put a nice beat under it. And of course, you can not have a great song without some good lyrics. Stay tuned for the upcoming blogpost on these topics.

You might also like

Song of the Machines (1): Sampling musical sections - Dorian Van den Heede
Can 4 Dataroots colleagues without professional music production experiencewrite hit songs with AI? In this blogpost series the Beatroots team membersuncover how they wrote their latest song, Song of the Machines, which theysubmitted for the 2022 AI Song Contest. AI Song ContestThe AI Song Cont…