It was only five years ago that electronic punk band YACHT entered the recording studio with a daunting task: They would train an AI on 14 years of their music, then synthesize the results into the album “Chain Tripping.”
“I’m not interested in being a reactionary,” YACHT member and tech writer Claire L. Evans said in a documentary about the album. “I don’t want to return to my roots and play acoustic guitar because I’m so freaked out about the coming robot apocalypse, but I also don’t want to jump into the trenches and welcome our new robot overlords either.”
But our new robot overlords are making a whole lot of progress in the space of AI music generation. Even though the Grammy-nominated “Chain Tripping” was released in 2019, the technology behind it is already becoming outdated. Now, the startup behind the open source AI image generator Stable Diffusion is pushing us forward again with its next act: making music.
Harmonai is an organization with financial backing from Stability AI, the London-based startup behind Stable Diffusion. In late September, Harmonai released Dance Diffusion, an algorithm and set of tools that can generate clips of music by training on hundreds of hours of existing songs.
“I started my work on audio diffusion around the same time as I started working with Stability AI,” Zach Evans, who heads development of Dance Diffusion, told TechCrunch in an email interview. “I was brought on to the company due to my development work with [the image-generating algorithm] Disco Diffusion and I quickly decided to pivot to audio research. To facilitate my own learning and research, and make a community that focuses on audio AI, I started Harmonai.”
Dance Diffusion remains in the testing stages — at present, the system can only generate clips a few seconds long. But the early results provide a tantalizing glimpse at what could be the future of music creation, while at the same time raising questions about the potential impact on artists.
The emergence of Dance Diffusion comes several years after OpenAI, the San Francisco-based lab behind DALL-E 2, detailed its grand experiment with music generation, dubbed Jukebox. Given a genre, artist and a snippet of lyrics, Jukebox could generate relatively coherent music complete with vocals. But the songs Jukebox produced lacked larger musical structures like choruses that repeat and often contained nonsense lyrics.
Google’s AudioLM, detailed for the first time earlier this week, shows more promise, with an uncanny ability to generate piano music given a short snippet of playing. But it hasn’t been open sourced.
Dance Diffusion aims to overcome the limitations of previous open source tools by borrowing technology from image generators such as Stable Diffusion. The system is what’s known as a diffusion model, which generates new data (e.g., songs) by learning how to destroy and recover many existing samples of data. As it’s fed the existing samples — say, the entire Smashing Pumpkins discography — the model gets better at recovering all the data it had previously destroyed to create new works.
Kyle Worrall, a Ph.D. student at the University of York in the U.K. studying the musical applications of machine learning, explained the nuances of diffusion systems in an interview with TechCrunch:
“In the training of a diffusion model, training data such as the MAESTRO data set of piano performances is ‘destroyed’ and ‘recovered,’ and the model improves at performing these tasks as it works its way through the training data,” he said via email. “Eventually the trained model can take noise and turn that into music similar to the training data (i.e., piano performances in MAESTRO’s case). Users can then use the trained model to do one of three tasks: Generate new audio, regenerate existing audio that the user chooses or interpolate between two input tracks.”
It’s not the most intuitive idea. But as DALL-E 2, Stable Diffusion and other such systems have shown, the results can be remarkably realistic.
For example, check out this Disco Diffusion model fine-tuned on Daft Punk music:
Or this style transfer of the Pirates of the Caribbean theme to flute:
Or this style transfer of Smash Mouth vocals to the Tetris theme (yes, really):
Or these models, which were fine-tuned on copyright-free dance music:
Jona Bechtolt of YACHT was impressed by what Dance Diffusion can create.
“Our initial reaction was like, ‘Okay, this is a leap forward from where we were at before with raw audio,’” Bechtolt told TechCrunch.
Unlike popular image-generating systems, Dance Diffusion is somewhat limited in what it can create — at least for the time being. While it can be fine-tuned on a particular artist, genre or even instrument, the system isn’t as general as Jukebox. The handful of Dance Diffusion models available — a hodgepodge from Harmonai and early adopters on the official Discord server, including models fine-tuned with clips from Billy Joel, The Beatles, Daft Punk and musician Jonathan Mann’s Song A Day project — stay within their respective lanes. That is to say, the Jonathan Mann model always generates songs in Mann’s musical style.
And Dance Diffusion-generated music won’t fool anyone today. While the system can “style transfer” songs by applying the style of one artist to a song by another, essentially creating covers, it can’t generate clips longer than a few seconds in length and lyrics that aren’t gibberish (see the below clip). That’s the result of technical hurdles Harmonai has yet to overcome, says Nicolas Martel, a self-taught game developer and member of the Harmonai Discord.
“The model is only trained on short 1.5-second samples at a time so it can’t learn or reason about long-term structure,” Martel told TechCrunch. “The authors seem to be saying this isn’t a problem, but in my experience — and logically anyway — that hasn’t been very true.”
YACHT’s Evans and Bechtolt are concerned about the ethical implications of AI — they are working artists, after all — but they observe that these “style transfers” are already part of the natural creative process.
“That’s something that artists are already doing in the studio in a much more informal and sloppy way,” Evans said. “You sit down to write a song and you’re like, I want a Fall bass line and a B-52’s melody, and I want it to sound like it came from London in 1977.”
But Evans isn’t interested in writing the dark, post-punk rendition of “Love Shack.” Rather, she thinks that interesting music comes from experimentation in the studio — even if you take inspiration from the B-52’s, your final product may not bear the signs of those influences.
“In trying to achieve that, you fail,” Evans told TechCrunch. “One of the things that attracted us to machine learning tools and AI art was the ways in which it was failing, because these models aren’t perfect. They’re just guessing at what we want.”
Evans describes artists as “the ultimate beta testers,” using tools outside of the ways in which they were intended to create something new.
“Oftentimes, the output can be really weird and damaged and upsetting, or it can sound really strange and novel, and that failure is delightful,” Evans said.
Assuming Dance Diffusion one day reaches the point where it can generate coherent whole songs, it seems inevitable that major ethical and legal issues will come to the fore. They already have, albeit around simpler AI systems. In 2020, Jay-Z ‘s record label filed copyright strikes against a YouTube channel, Vocal Synthesis, for using AI to create Jay-Z covers of songs like Billy Joel’s “We Didn’t Start the Fire.” After initially removing the videos, YouTube reinstated them, finding the takedown requests were “incomplete.” But deepfaked music still stands on murky legal ground.
Perhaps anticipating legal challenges, OpenAI for its part open sourced Jukebox under a non-commercial license, prohibiting users from selling any music created with the system.
“There is little work into establishing how original the output of generative algorithms are, so the use of generative music in advertisements and other projects still runs the risk of accidentally infringing on copyright and as such damaging the property,” Worrall said. “This area needs to be further researched.”
An academic paper authored by Eric Sunray, now a legal intern at the Music Publishers Association, argues that AI music generators like Dance Diffusion violate music copyright by creating “tapestries of coherent audio from the works they ingest in training, thereby infringing the United States Copyright Act’s reproduction right.” Following the release of Jukebox, critics have also questioned whether training AI models on copyrighted musical material constitutes fair use. Similar concerns have been raised around the training data used in image-, code- and text-generating AI systems, which is often scraped from the web without creators’ knowledge.
Technologists like Mat Dryhurst and Holly Herndon founded Spawning AI, a set of AI tools built for artists, by artists. One of their projects, “Have I Been Trained,” allows users to search for their artwork and see if it has been incorporated into an AI training set without their consent.
“We are showing people what exists within popular datasets used to train AI image systems and are initially offering them tools to opt out or opt in to training,” Herndon told TechCrunch via email. “We are also talking to many of the biggest research organizations to convince them that consensual data is beneficial for everyone.”
But these standards are — and will likely remain — voluntary. Harmonai hasn’t said whether it’ll adopt them.
“To be clear, Dance Diffusion is not a product and it is currently only research,” said Zach Evans of Stability AI. “All of the models that are officially being released as part of Dance Diffusion are trained on public domain data, Creative Commons-licensed data and data contributed by artists in the community. The method here is opt-in only and we look forward to working with artists to scale up our data through further opt-in contributions, and I applaud the work of Holly Herndon and Mat Dryhurst and their new Spawning organization.”
YACHT’s Evans and Bechtolt see parallels between the emergence of AI generated art and other new technologies.
“It’s especially frustrating when we see the same patterns play out across all disciplines,” Evans told TechCrunch. “We’ve seen the way that people being lazy about security and privacy on social media can lead to harassment. When tools and platforms are designed by people who aren’t thinking about the long-term consequences and social effects of their work like that, things happen.”
Jonathan Mann — the same Mann whose music was used to train one of the early Dance Diffusion models — told TechCrunch that he has mixed feelings about generative AI systems. While he believes that Harmonai has been “thoughtful” about the data they’re using for training, others like OpenAI have not been as conscientious.
“Jukebox was trained on thousands of artists without their permission — it’s staggering,” Mann said. “It feels weird to use Jukebox knowing that a lot of folks’ music was used without their permission. We are in uncharted territory.”
From a user perspective, Waxy’s Andy Baio speculates in a blog post that new music generated by an AI system would be considered a derivative work, in which case only the original elements would be protected by copyright. Of course, it’s unclear what might be considered “original” in such music. Using this music commercially is to enter uncharted waters. It’s a simpler matter if generated music is used for purposes protected under fair use, like parody and commentary, but Baio expects that courts would have to make case-by-base judgments.
According to Herndon, copyright law is not structured to adequately regulate AI art-making. Evans also points out that the music industry has been historically more litigious than the visual art world, which is perhaps why Dance Diffusion was explicitly trained on a dataset of copyright-free or voluntarily submitted material, while DALL-E mini will easily spit out a Pikachu if you input the term “Pokémon.”
“I have no illusion that that’s because they thought that was the best thing to do ethically,” Evans said. “It’s because copyright law in music is very strict and more aggressively enforced.”
Gordon Tuomikoski, an arts major at the University of Nebraska-Lincoln who moderates the official Stable Diffusion Discord community, believes that Dance Diffusion has immense artistic potential. He notes that some members of the Harmonai server have created models trained on dubstep “webs,” kicks and snare drums and backup vocals, which they’ve strung together into original songs.
“As a musician, I definitely see myself using something like Dance Diffusion for samples and loops,” Tuomikoski told TechCrunch via email.
Martel sees Dance Diffusion one day replacing VSTs, the digital standard used to connect synthesizers and effect plugins with recording systems and audio editing software. For example, he says, a model trained on ’70s jazz rock and Canterbury music will intelligently introduce new “textures” in the drums, like subtle drum rolls and “ghost notes,” in the same way that artists like John Marshall might — but without the manual engineering work normally required.
Take this Dance Diffusion model of Senegalese drumming, for instance:
And this model of snares:
And this model of a male choir singing in the key of D across three octaves:
And this model of Mann’s songs fine-tuned with royalty-free dance music:
“Normally, you’d have to lay down notes in a MIDI file and sound-design really hard. Achieving a humanized sound this way is not only very time-consuming but requires a deeply intimate understanding of the instrument you’re sound designing,” Martel said. “With Dance Diffusion, I look forward to feeding the finest ’70s prog rock into AI, an infinite unending orchestra of virtuoso musicians playing Pink Floyd, Soft Machine and Genesis, trillions of new albums in different styles, remixed in new ways by injecting some Aphex Twin and Vaporwave, all performing at the peak of human creativity — all in collaboration with your own personal tastes.”
Mann has greater ambitions. He’s currently using a combination of Jukebox and Dance Diffusion to play around with music generation and plans to release a tool that’ll allow others to do the same. But he hopes to one day use Dance Diffusion — possibly in conjunction with other systems — to create a “digital version” of himself capable of continuing the Song A Day project after he passes away.
“The exact form it’ll take hasn’t quite become clear yet … [but] thanks to folks at Harmonai and some others I’ve met in the Jukebox Discord, over the last few months I feel like we’ve made bigger strides than any time in the last four years,” Mann said. “I have over 5,000 Song A Day songs, complete with their lyrics as well as rich metadata, with attributes ranging from mood, genre, tempo, key, all the way to location and beard (whether or not I had a beard when I wrote the song). My hope is that given all this data, we can create a model that can reliably create new songs as if I had written them myself. A Song A Day, but forever.”
If AI can successfully make new music, where does that leave musicians?
YACHT’s Evans and Bechtolt point out that new technology has upended the art scene before, and the results weren’t as catastrophic as expected. In the 1980s, the U.K. Musicians Union attempted to ban the use of synthesizers, arguing that it would replace musicians and put them out of work.
“With synthesizers, a lot of artists took this new thing and instead of refusing it, they invented techno, hip hop, post punk and new wave music,” Evans said. “It’s just that right now, the upheavals are happening so quickly that we don’t have time to digest and absorb the impact of these tools and make sense of them.”
Still, YACHT worries that AI could eventually challenge work that musicians do in their day jobs, like writing scores for commercials. But like Herndon, they don’t think AI can quite replicate the creative process just yet.
“It is divisive and a fundamental misunderstanding of the function of art to think that AI tools are going to replace the importance of human expression,” Herndon said. “I hope that automated systems will raise important questions about how little we as a society have valued art and journalism on the internet. Rather than speculate about replacement narratives, I prefer to think about this as a fresh opportunity to revalue humans.”