On the latest episode of Creativity Squared, Harmonai founder, Zach Evans, walks us through the fascinating evolution of A.I. for music production via his own journey going from Microsoft developer, to amateur EDM musician, to now leading a team of artists and programmers working to develop the next generation of A.I. models for music.
Harmonai is a community-led lab “by artists and for artists” under the umbrella of Stability AI. Stability AI is a company that offers a range of open-source A.I. products, such as their well-known Stable Diffusion text-to-image generator.
Through the episode, Zach touches on the importance of the open-source development community, the biggest challenges in improving A.I. music models, and his goal of developing A.I. technology that empowers artists.
Some of us got into making sourdough bread, and others filled their time welcoming a new puppy in their home, but when social restrictions were implemented early in the Covid-19 pandemic, Zach Evans started making electronic dance music (EDM). As a frequent raver prior to the pandemic, Zach was already an enthusiast.
In an effort to accelerate his learning, Zach started actively participating in a few communities of EDM producers on Discord, where members would swap tips, share their tracks, and solicit feedback. Through these communities, Zach got the opportunity to interact with some of the producers he’d already admired, such as Kill The Noise. In December 2020, Zach was co-hosting a Twitch stream with Kill the Noise, where the producer was experimenting with one of the first A.I. music projects, Jukebox from OpenAI.
As a software developer at Microsoft, Zach had dabbled in machine learning before, but the potential applications for music provided a new passion to delve further. Just like when he wanted to level up his music-making, Zach tried to find the “movers and shakers” in the A.I. music scene. That’s how he first encountered Dadabots, a death metal duo from Boston. They caught Zach’s attention with a Youtube livestream they started in 2019 that uses a neural network to produce endless death metal (it’s still going to this day). Dadabots weren’t exactly machine learning scientists at that point, but their experience working with A.I. in music production put them in a position to be advising PhDs. Realizing he could make an impact on the space without an extensive background in machine learning, Zach accepted an invitation to mingle with the larger machine learning community at the NeurIps conference.
Following the conference, he tried to find online communities for people interested in making music with A.I., only to find that none really existed.
In 2021, Zach performed his first and only DJ set to date, opening for Au5, among others. Attending shows and getting backstage with artists, Zach grew his network. Meanwhile, Zach started participating in communities such as Eleuther A.I., where members were sharing their experiments with earlier iterations of text-to-image generators available on Google’s Colab platform. Colab, short for Colaboratory, is a service that offers developers an environment on their own computer where they can run machine learning code using Google’s massive computing infrastructure for a fee. The service has made the machine learning space much more accessible by allowing developers to run code without having to invest thousands to buy their own graphics processing unit. GPUs are the machines purpose-built to execute complex code and the same machines that fill the massive data centers that power all of our cloud-based programs.
Using a Colab notebook for an image generator called Disco Diffusion that could produce an infinite zoom effect, Zach started making audio-reactive music videos for some of his producer friends that lacked the resources for hiring videographers or digital animators. From there, he got himself onto the development team for Disco Diffusion.
It was through his participation in the A.I. image generation communities that Zach was introduced to the founder of Stability AI, Emad Mostaque. At the time, Mostaque was supporting A.I. research by offering access to expensive, high-powered GPUs. Zach was deep in the machine learning community at that point, even receiving some mentorship for optimizing his models from Katherine Crowson, who he calls the “Oracle” of the space. Eventually, Zach had the “glass-shattering, world-breaking” realization that his tweaking and fine-tuning in Colab notebooks was actually cutting-edge research.
After a conversation with Mostaque, Zach got the support he needed to leave his job at Microsoft, as well as a directive from Mostaque to go out and build a community of artists and developers that could help them apply the technology they were using for image generation to generate music instead. And so, Harmonai was born.
Harmonai’s most significant contribution to A.I. for music is their Dance Diffusion model. Dance Diffusion can generate new variations of music that it’s been trained on. Enter Jonathan Mann, who holds the Guinness World Record for writing and recording a song every day for over 5,000 consecutive days. Mann was looking to join the young Midjourney server on Discord, so he wrote a song about it, posted it to YouTube, and got the invite. Zach was a member of the server already since some of the people he’d worked with on Disco Diffusion had joined Midjourney.
Comparing himself to Danny Ocean in Ocean’s 11, Zach enjoyed collaborating with Mann who reintroduced Zach to Dadabots. Zach ended up recruiting CJ Carr from Dadabots to be on the Harmonai team. Mann’s biggest impact, however, was allowing Zach to use his collection of thousands of songs to train the Dance Diffusion model. The “J Mann” model was the first trained version of Dance Diffusion, but anybody can get their own version of the model and train it on their own music or owned samples to produce new samples in the same style.
Open-source distribution is a core tenet of Harmonai and Stability AI’s business. Zach says that free access to their models is critical to unleashing the full potential of artistic expression that A.I. technology can enable.
Regulating access to A.I., and preventing bad actors from abusing A.I.-enabled voice cloning or deepfakes is currently a hot topic for stakeholders at every level. Zach sees less risk for A.I.-generated music though, reasoning that “you could make bad music, but you could do that without A.I. too.”
The thornier question for A.I.’s application to the music industry is how it will affect individual artists’ equity. The technology isn’t advanced enough to replace artists for now, but could that be a risk one day? At a high level, Zach says that A.I. could reduce the skill barriers and empower more people to create more diverse music. As for the existing musicians, Zach says much of the fear about A.I. imagines a dynamic of severe corporate control that already exists to a certain extent.
Zach sees A.I. as a new tool for musicians with the potential to expand the depth of the music landscape we know today. He says he wants to see creatives use the technology “wrong” to explore what else is possible in music. He compares it to the night sky: all the stars we can see is all the music that already exists, and all the space between the stars is where he and his colleagues want to explore. He thinks a lot about an A.I. model that can mimic the natural ability of a brilliant musician to hear and compose new sounds that exist between and outside the sounds we’ve already heard.
In terms of the technical music production process, Zach thinks A.I. can serve as a creative collaborator to handle the aspects of the process that might not interest every musician. For instance, a songwriter might not be a great instrumentalist, but maybe A.I. can generate beats or melodies to accompany their lyrics. As Zach sees it, A.I. could help musicians focus on what they do best and delegate the tasks that might slow down their process.
While there’s already a lot of great work being made with A.I. right now, one of the biggest challenges in developing more powerful models is the size of the data and the amount of computing power it takes to process high-quality audio. Digital audio is recorded at a standard sample rate of 48kHz, which refers to how many tiny pieces of analog music are digitally recorded per second. 48kHz represents 48,000 numbers per second, requiring significant processing power that might be financially out of reach for those looking to dabble in music production.
That’s why Harmonai is supporting volunteer research projects and hosting production challenges for artists, trying to push the limits of the technology in pursuit of longer sequence lengths. The long-term goal for Zach and others in his space is a music version of text-to-image generators such as Stable Diffusion or Midjourney.
Thank you, Zach, for being our guest on Creativity Squared.
This show is produced and made possible by the team at PLAY Audio Agency: https://playaudioagency.com.
Creativity Squared is brought to you by Sociality Squared, a social media agency who understands the magic of bringing people together around what they value and love: http://socialitysquared.com.
Because it’s important to support artists, 10% of all revenue Creativity Squared generates will go to ArtsWave, a nationally recognized non-profit that supports over 150 arts organizations, projects, and independent artists.
Zach Evans: I can keep listing, probably every genre starts with some new technology – pop coming from radio or whatever. And so I think that one of the most impactful things that one can do in music, I don’t think that writing songs individually, I’m going to change the world. I don’t think that I have a story to tell.
This is my story to tell, right? You know, I don’t think I have some heartfelt song to write, to change America, to change people’s whatever. That’s not gonna be my impact on art. But as a, you know, technological person, this has a strong potential to create new scenes, create new genres, create something different.
And that, I think is what I’m most excited about is just what creative people will do with these tools.
Helen Todd: Zach Evans helped found and is head of Harmonai, StabilityAI research lab and online community. StabilityAI is the world’s leading open source artificial intelligence company. As an integral part of StabilityAI, Harmonai is dedicated to creating open source generative audio models and pushing forward creative uses of generative AI in music production.
Harmonai is by artists and for artists. Zach is a programmer and musician and is driven by the computational possibilities of the future of music and the belief that community is everything. Zach knows that generative AI is a nuclear bomb for the music industry, and his mission is to turn it into nuclear energy.
He cares deeply about artistic freedom and the musicians, DJs and producers that he’s empowering with Harmonai. He leads with substance and is most excited about how people will use these tools beyond what he can imagine and hopes people make music that’s weirder and different from what exists today.
Pushing the boundaries of the technology, Zach’s passion for community music and artistic freedom is infectious. He landed in a dream role and his story is intertwined with the evolution of generative AI music. Today you’ll hear his story and how he went from being a Microsoft developer to becoming an EDM musician, to now leading a dynamic team of artists and programmers to develop the next generation of AI music models.
Dolly Parton and AI also came up and you’ll hear how Zach demystifies diffusion, shares his perspective on open source development, discusses the challenges to improve AI music models, and thoughtfully articulates his vision of AI technology empowering artists.
Theme: But have you ever thought, what if this is all just a dream?
Helen Todd: Welcome to Creativity Squared. Discover how creatives are collaborating with artificial intelligence in your inbox, on YouTube, and on your preferred podcast platform. Hi, I’m Helen Todd, your host, and I’m so excited to have you join the weekly conversations I’m having with amazing pioneers in this space.
The intention of these conversations is to ignite our collective imagination at the intersection of AI and creativity to envision a world where artists thrive.
Zach, it is so nice to have you on the Creativity Squared show. Welcome.
Zach Evans: Thanks for having me. Happy to be here. Stoked to talk about some cool stuff today.
Helen Todd: I am so excited, especially based on our pre-call. You have so much knowledge and passion and excitement around the topic about AI and music.
But let’s start with, out the gate, what’s the most important thing you want our viewers and listeners to know at this moment in time?
Zach Evans: Yeah, so many big things to say about AI and music. I think the first one is, don’t be afraid. And then, you know, I think that it’s a very powerful technology. And a thing I’ve been saying is, you know, it’s almost like a nuclear bomb for the music industry. And I think that it’s, you know, it’s our job to try to turn that into nuclear energy, you know, find a way to take the power of this technology and really support artists, have it help art thrive and artists thrive and be a strong but beneficial force in the music scene.
Helen Todd: Ever since you said that, I’ve repeated that so many times about the nuclear bomb and how to convert it to nuclear energy.
So let’s dive in. So you’re the head of Harmonai at StabilityAI. And for those who don’t know StabilityAI, I know they’re a major player in the AI space, but can you just kind of introduce our viewers to StabilityAI first?
Zach Evans: Yeah, so StabilityAI is a company that started a few years back by a man named Emad Mostaque, and it is essentially the building blocks to help activate humanity’s potential focused on generative AI systems. But really at a high level, it’s an organization to organize and support a bunch of people in the open source AI space. So it’s the definitions of and goals of it have changed a little bit over time. But yeah, you know, being one of the main players in the generative AI space for images, for audio, for all the different modalities.
Helen Todd: That’s great. And then for, StabilityAI, as I understand it, really introduced diffusion and that was a game changer in the generative AI. Is that correct?
Zach Evans: It’s a bit of a longer story that I will, I guess I can go into a bit about that, a kind of the history of diffusion text to image and how we got to where we are today.
So the idea of these diffusion models, let’s say, let’s talk about this in terms of images first. That goes kind of the first big thing to pop off in terms of generative models with diffusion. The idea of being able to add noise to something and then remove the noise from it, but removing noise in a way that makes it something as opposed to just, we could remove noise from noise and get all one color or something.
But the idea here is that it creates things by turning noise into signal. The original paper on generative models with diffusion came out with the author was Jascha Sohl-Dickstein in 2015. I think he worked for Google maybe.
The big breakthrough was a release in 2020, I wanna say by Ho et al on these image generation diffusion models. And that was from OpenAI. So they released that kind of the first big diffusion thing. So OpenAI released their unconditional diffusion model, so that’s not text-to-image, that’s just noise to something.
They also, before releasing that, they released their Dall-E 1 so we knew it just called Dall-E, but their first paper of like, the first time people really saw good text-based image synthesis. You know, the classic example was like a astronaut riding a horse or a chair shaped like an avocado.
Those kind of became the, you know, those are their default examples when they released the paper and the blog post about that, and it was super cool. It was like for the first time, you know, you can actually do really good, you could put it in a description and get an image, and that was, people were fascinated by that.
But OpenAI didn’t release that model. They shared the blog post, shared examples, but kept it internal. And that led to a lot of people outside of OpenAI being like, Hey, wait, but we wanted to play with that, but that seems really cool and I want to try it.
But of course, making a thing like that requires a whole bunch of data, a whole bunch of compute and, you know, a bunch of people on Discord didn’t really have access to that.
So there’s kind of this community effort of, and this community growing around this of like, how can we make an open source version of that? How can we get something like that where you can put in text and get an image and have it be a thing? You don’t have to, you know, work at OpenAI to play with.
And that led to a few. so there’s a few more releases from OpenAI that came out after that, that really enabled that. So the first one was this model called Clip Contrastive Language Image Pre-training. That’s basically at its most basic, definition, you give it some text, you give an image and it tells you how well that text matches that image. That’s what that model is meant to do.
So the community took that model and a different model called BigGAN is the guy named well it’s AAD now and on Twitter. And he put those two things together and realized with some special math you could take this, this GAN generator and this clip model and have it kind of alter the image until it fit the prompt better.
That was called the Biggest Leap. And that was kinda the first big release of open source text to image. And it was, you know, it was abstract and weird and incoherent and led to people posting things on Twitter like, I’ll pay you a hundred bucks, you can name anything in this picture.
And it was all of course incoherent nothingness. You couldn’t name anything. But you know, that kind of led to some like liminal space, weird creative images. And then Katherine Crowson, another independent researcher, now the generative media lead at StabilityAI put together that clip model and that unconditional diffusion model that it’ll just de-noise things into data.
And with some of the same math, what’s called clip guidance, clip guided diffusion was able to, you put in the prompt and you try to optimize this thing to make a thing that you know matches that image prompt better.
A lot of machine learning is just find a way to make some number smaller. So the number to make smaller there is, how much the image doesn’t match the prompt essentially.
And so that led to the like real explosion in growing communities on Discord, and on Twitter of people using these Google Colab notebooks. You know, Google Colab being a service Google provides that you can get access to really good high-end GPUs for a comparatively very low price.
It’s gotten more expensive over time as it’s become more popular, you know, it’s, GPUs are super expensive to be able to use a thing like this. It was, you know, 2 to $4,000 for a good GPU as opposed to 10 bucks a month on Google Colab. Pretty easy decision to make there.
So that led to this really big explosion of online communities of trying to make this better, trying to work in other kinds of math, other kinds of models to improve that.
And that’s kind of where I came into it. So that was the, right, Katherine Crowson figuring out how to use these two open AI models to make text-to-image in the public. At the same time, of course, there were other research labs trying to find ways to do similar things with fewer resources.
One of them was the lab at the University of Heidelberg now at LMU Munich. It was the computer vision group there put together a model called latent diffusion, which is instead of trying to noise and de-noise raw images, which we would say in pixel space, you noise and de-noise encodings of images where it’s just less data and you have some other model that can compress and expand images to make it easier to do stuff. That’s the idea of latent diffusion there.
And so essentially, and this I’ll get into a bit more about Stability, but Stability supported essentially the next level of that model. And it became kind of a large bunch of groups working together on that. And, you know, Stability providing the computers and some of the people at Stability working on some of the different parts kind of a bunch of different, different communities come together to make Stable Diffusion with Stability, providing a lot of the funding and compute for that.
And the big deal about that is that when that came out, it’s open source, so anyone can use that locally. We have the Stable Diffusion model. You can download it, run it on your phone now. People have optimized this to be super, super fast, super efficient, could run on, you know, on a CPU, maybe not as quick as I could on a GPU, of course.
But, so, yeah, the big kind of game changer here was that. While we were seeing a lot of these text to image technologies, with things like Dall-E or Midjourney, you know, you had to sign up to their websites to use them and pay their pay credits to them to use them.
Whereas this, you know, there is a place you can go and pay credits to use Stable Diffusion, but you can also, if you have the right computer for it and the right stuff, download it locally. You know, make this stuff for, for the only, I wouldn’t say for free, but the cost of running your gp, the cost of electricity, essentially.
And not only that, but because anyone could download this and use this, anyone can improve it. So very quickly, within just days of Stable Diffusion being released to the public, you were able to see people learning how to, you know, fine tune it. So take it, and I guess we can make images of your dog specifically by just giving it, you know, 10 images of your dog running some code and waiting an hour or two on your own GPU.
You can fine tune it and make it, make it more unique and personalizable. People are able to find things like optimizations to make it all run faster or whatever. So that was the big deal of Stable Diffusion, was that it was the first really high quality publicly available free and open source text-to-image model.
Helen Todd: Yeah, that’s really cool. I think it was another podcast I was listening to from BlackRock, talking about the game changer moment. So thank you for explaining that. And, you know, one conversation that does come up when talking about these AI tools and just how powerful they are, is the pros and cons of having open source, especially with StabilityAI in all the modalities cause, you know, with the internet and humanity always comes the lovely bad actors that participate and you’re opening it up to a very powerful tool that’s open source that, you know, potentially could be in bad actors’ hands. So I’m curious what Stability’s take is on this conversation.
Zach Evans: Yeah. So I can’t really speak for the company as a whole in terms of that I can only speak for, for my work and my team. And, you know, working in audio, working in music, I think that it’s harder to do bad things with it. You know, you can make bad music, sure. But you can do that without AI. You know, I think that there’s always gonna be trying to adjust for that, and that requires a lot of, you know, pre-thought about what, what models can do and what you put out there.
But I think at the end of the day, the alternative to that is the progression of these models is held, and is specifically only guided by investments of large corporations. And particularly for creative tools, I think that’s not a good and healthy thing for creativity and art in general, right?
Like when, when Dall-E 2 first came out, you know, this was right after the Russian invasion of Ukraine. So kind of a, you know, an intense political atmosphere there. And they banned the word Ukraine from being, you know, put that into Dall-E 2. And, you know, not, not in like, even political context, but like beautiful wheat fields in, you know, in Ukraine.
It’s like, oh, you can’t, Ukraine is too, you can’t say that here. And so it’s like, okay, well that means that they have control over not only the service but expression. And so, like, you know, I made a tweet back then of like, you know, kind of, if your art synthesis program won’t let, won’t let you make protest art, it’s not for the people.
Helen Todd: Mm-hmm.
Zach Evans: Right?
Helen Todd: Yeah.
Zach Evans: And I mean with, so if you don’t have it open source and available to people, then all of the outputs and the expression are controlled by corporations and their impetus is going to be, keep it as banal and safe as possible so that it doesn’t blow up on Twitter of someone, you know, it’s more about PR then than about enabling expression.
And that’s gonna happen for any system that is, you know, behind a paywall, behind a service. And that’s, I’m not saying that they should allow anything to be done because they are a company with PR and, you know, same with Stability. We have our interfaces for creating things with Stable Diffusion.
And we won’t allow you to put in some words because we think that there’s not really much good that can come from that as well as, you know, doing some classification on the images and checking if it’s NSFW, blurring them, whatever. But particularly it is kinda a thing I like to do. I think in general about creative tools like this, you need to give artists the freedom to be able to do what they want to do.
I strongly believe in artistic freedom, and I think that trying to mitigate any possible negative output at the level of a service makes sense from a business perspective. But if that is the only way to access these tools, then that is incredibly limiting on just the future of creativity in this context.
I mean, you can, you can pick up a pencil, you can pick up a guitar or whatever, but you know, as these are, as I want these to be useful tools for people to be able to express themselves or have fun or build a community or do whatever, I think it’s important that it can be customizable, it can be, you know, decentralized.
I’m not like, you know, a big everything should be decentralized, I’m not a crypto person, but, you know, I am an artist and I do come from that community and yeah, I think that’s my long-winded answer there of, yes, there are some bad things that one could do potentially with some of these models, but I think that the alternative is just, gray corporate mess of all of these things, it all becomes, you know, just a toy and just a, you know, a niche thing or just to make, you know, just churn out corporate art or whatever.
And I think it’s just better for us to have, you know, to give the individuals more power over what they do with these things.
Helen Todd: Yeah, I appreciate that perspective. One of the missions of this podcast or Creativity Squared is to envision a world where artists not only coexist with AI but thrive.
And you know, I agree with you that thriving means artistic expression as well. And to a certain extent, you know, I think from the lens or the lane of creativity, a lot of the ethical questions that we’re dealing with with AI, you know, it does, it is applied, but there’s a lot of other applications outside of this lens too, where there might be bigger consequences as well maybe.
Zach Evans: Yeah. I think there’s a different bar for responsibility and ethics between for-profit corporations and individual artists, and that’s a complicated thing to navigate. But, you know, I think that’s, that’s a good principle of, like, I don’t think I am, I mean, I’m certainly not in any sort of position to tell artists what they should or should not do with their art.
Yeah, so I, yeah, kind of going back to the same point here, but I think that’s kind of key for what I do here is whatever we make, I want there to be a version of it that people can take and, and create how they want with it.
Because, I mean, and then this also ties to things where it’s like you have these models that are improving over time and get better quality outputs over time. But better quality is very subjective and personal and cultural. And music is subjective and personal and cultural.
So I think that you need to have some sort of ability to personalize these models and their outputs because if you know, if people are trying to make this better and better and better, that might be going further and further away from one individual person’s taste.
Helen Todd: Well I feel like this is a good segue into Harmonai, because it’s by musicians for musicians, in regards to music and AI, and you’re the head of Harmonai. So why don’t you tell us, a little bit about, specifically, what you guys are up to and the tools available, yeah, share what’s exciting you about the space right now?
Zach Evans: Yeah, so Harmonai is a Discord server first and foremost. It’s the name of the community, and the research lab, you know, fully part of StabilityAI, but also opening up to whoever wants to join the server and be a part of it.
In terms of actual releases, back last August we released our first model referred to as dance diffusion. And that is more in the style of that original openAI unconditional diffusion model. So it’s not text-to-audio, it’s just a de-noiser essentially. and so essentially, you know, it’ll make like random outputs kind of based on what you train it on.
So it’s really more about the idea of instead of having here is this base foundation model where outta the box anyone can use it. It’s like outta the box. It’s all trained on music from one guy. I’ll get to that in a little bit, but you can take this, you know, put it onto a Colab notebook, put it onto your local GPU point at your samples library of your drum loops or your bass sounds or whatever, fine tune that and it’ll start spitting out random drum loops and bass sounds and things.
Also because audio is, you know, one of the big challenges in generating stuff with AI in general is just the size of the actual data itself. The sequence length matters a lot. That’s why you get things in like, you know, ChatGPT where they’ll have, you know, for those transformer models, you’ll get maybe a max sequence length of 4,000 words, something like that.
Now they can get up to like a hundred thousand, whatever, but it’s very expensive. With audio, with high fidelity audio or doing things like, you know, 48K stereo audio, that’s 48,000 numbers per second. So trying to get to that same sequence length, you get less of an output through our perception, right?
So, you know, dance diffusion, the high fidelity version of that model 48K stereo, 65,000 samples. That’s, a second and a half, right? That’s not gonna get you a full song. It’s not gonna get you even the full drum loop right now. It’s better for sound design and things. There’s also a model that, you know, the three second model or lower sample rate gets you longer outputs, but for lower quality of course.
That’s what we released as a Colab notebook, as a GitHub repo and some pre-trained models. and the idea there was kind of, you know, trying to, to kick off a bit of the same energy that I saw in that text-to-image space, in the audio space. and times have changed since, since then, so it didn’t quite go the same.
But, so I think we were released. That’s gotten a great reception that’s gotten, you know, I’ve shared that with artists that I look up to and have met over the few years on Discord, and it’s gotten some great responses. You know, talking to artists like Mr. Bill or Au5, Virtual Riot artists were like, you know, I, I come from the electronic music space, and those are big producers.
Those are, you know, well-respected sound designers. And my goal from the beginning here has been all right. There’s a lot of stuff that, like, there’s, there’s a lot of room for improvement that I could see in music machine learning. And so it’s very rewarding to be working on these things and sharing them with the artists I look up to, and they’re like, this is really cool and I see how it’s useful.
So we, you know, we released that back in August or September of last year. We’ve been since then, you know, supporting the community around people using that to make music, running some production challenges of like, make a song that’s, you know, up to four minutes long using Dance Diffusion for two of the elements at least.
Tell us how you do it, submit it, we’ll have an event. Everyone comes on stage and Discord server and talks about their track. so that’s what we’ve released. And since then we’ve been cooking. You know, there’s been a lot of advancements in music generation since then. Things like musicLM from Google, MusicGen from Meta, there’s been a whole bunch of papers this year on music generation.
But essentially, we’re trying to follow the image space in terms of, and that’s what’s true for a lot of audio ML stuff. Follow the image space in terms of, you know, okay, cool, get high quality text-to-audio working. Try to get longer sequence lengths.
Can we get a whole song out of this? Can we get maybe 10 seconds for a loop? That’s what we’re working on now is longer form generation, you know, working on supporting volunteer research projects. And hopefully more to share soon.
Helen Todd: Lots of teasers on what’s to come. Well, you know, in the opening, you did say you’re sitting on a nuclear bomb and want to figure out how to turn that into nuclear energy.
And can you expand on like what, what you mean by that relative to the music industry? Like what AI and how transformative it can be for the music industry?
Zach Evans: Yeah, so I guess I’ll bring this back to kind of the revelations that started me. I’m gonna start Harmonai, and I guess I’ll take this as an opportunity to try to tell a bit more of the longer story and build up to that.
Helen Todd: Yeah, please do. Because your story is so tied into, you know, the evolution of music and AI. So I think that’s great.
Zach Evans: Yeah, yeah, yeah. And it all, it all flows from one thing to the next. I’ll keep trying to start in the middle, and then I gotta go backwards and forwards to jump around a lot. So, so back in the beginning of lockdown, when the pandemic started, my life had just gone under a big shift.
You know, my social life had totally just changed, end of a big relationship, and suddenly it was like, all right, you know, I, I basically rebuilt my life on Discord. It was like March 2022. My life in person was weird. And so it was nice to have an escape of like, you know, going on Discord, finding these communities of artists.
You know, I’m a raver. I’m in the electronic music scene. and finding these communities oof artists online. ’cause it’s like, that was one of my big interests was, you know, I wanna talk about EDM and these artists and bass music and things. So, I also looked up to, up to a lot, Trivecta started a Discord server and I joined that and became super active in there.
Made a bunch of friends, was hanging out in voice chat all day every day with these new friends. And then kind of from there, I joined a few different servers. So the artists, Kill the Noise for Tasha Baxter, some other electronic music artists that I looked up to, and essentially in there got really ingrained into those communities.
I had just been starting to learn how to make music on my own for, you know, a few years before that, but never really got into it. I didn’t have anyone to do it with. You know, I could share it with friends, but they didn’t actually have that much opinions on it.
So, diving into these discord communities, learning to make music being a part of it, you know, it’s, it’s very different when you’re trying to learn a skill like this with people than just by yourself and YouTube videos and things.
So, you know, got super into that, made some great friends. That’s where I, you know, that’s where I met Noah, two groups, our mutual friend Drew’s music production partner. And just it was, you know, just wonderful, wonderful new life of like, okay, now I’m a, it’s this creative community. You know, I was working at Microsoft at the time, so much more technical, objective job, and then working with, you know, creative people.
It was just, it was, it was a whole new world. It was amazing. You know, running production challenges, learning to make music, talking with producers. And around December of that year, 2020, there was some big AI releases like Jukebox from openAI, which was the first good high fidelity long form song generator using Transformers, language modeling stuff.
And it took like a day to get a song length output. It took a really, really long time, but it was like, this is really cool. This is clearly something. And I was, you know, kind of co-hosting these Twitch streams from Kill The Noise.
And I was on an episode there with him where he was, you know, messing around with these tools. And he is a very experienced electronic music producer. And, you know, diving into these new tools, this kind of conclusion after a few hours was, this is cool, this is neat, but not really up to the standard of quality for me to use it in my music as a tool.
And I was like, well, that’s a thing to have as a goal, that’s really neat. And so I, you know, took that opportunity to really start diving into machine learning. And I dabbled a bit before, here and there, but it was always like the code was too hard to read, or you needed more compute or whatever.
But at that time, late 2020, I really dove into the scene of like, what, you know, who are the movers and shakers in music ML and I found this duo, Dadabots who they had like this YouTube channel with this like infinite generated death metal streams, that channel’s still going, it’s been making infinite death metal for like four years now.
And, you know, I found ’em on Twitter and I found this Jukebox Discord community, and I got really into the stuff. I found this new release, DDSP by Google Magenta, which is their audio, you know, music ML team, you know, started talking to the Databots guys.
CJ was the one I was talking to; it’s two guys, CJ and Zach, and, you know, he was super great to talk to super, I was like, you know, I’m just diving into this space now. I wanna be a part of it. I don’t have a PhD in this stuff, can I do it? He’s like, oh yeah, you know, we don’t have PhDs. We started doing stuff from like hackathons and now we actually advise PhDs, and I’m like, cool.
It’s actually possible. All right. Just super, super excited and, you know, I’m like, how can I get more into this? He was saying, well, this Saturday is actually the NeurIPS conference’s workshop on machine learning and design. And I’m like, well that’s incredibly convenient. Like, oh, the big ML design workshop is happening in four days. I’m like, well that’s perfect timing. Okay.
So I went there and I was, you know, I met some people that were the authors of the things that I had done. I was this excited trying to get everyone onto the and on board of like, I wanna find where is the music ML community? Where is the place like the communities that I’ve been in, where people that are equally excited, equally learning things together, equally pushing forward on this.
And I couldn’t find one. There wasn’t one. There was, it was like a Slack community, the sound of AI that they weren’t talking about the thing that I was interested in. And outside of that, it was like I realized, all right, I’m gonna have to go work for openAI or work for Google or change my job or whatever, Go get a PhD to do this stuff.
So that kind of waned and I was like, well, that’s unfortunate, but I’ll go back to and I’l keep my job working at Microsoft. I’ll keep working on music, whatever. This isn’t the time it’s actually, it’s still a little bit Ivory tower.
So during that year, you know, 2021, I was still focusing more on music, focusing on building communities, meeting artists, going to shows, you know, ended up getting backstage for some shows, meeting more artists. And that kind of culminated in a, the one and only DJ set that I have played in December of that year.
I got to open for some of my favorite artists. It was such a cool time. You know, Au5 was one of the ones that I opened for. You know, legendary sound designer who I’d wanted to meet for a while and finally got to, so I, you know, I had this cool, this, like my network was really built up of like these producers.
And then around that same time, Noah, who I talked about earlier, introduced me to all that text-to-image stuff like I was talking about earlier. There was this ongoing communities that started, you know, around the beginning of 2021 of this clip guided diffusion and all these Colab notebooks. And so there was one called Pity, and the server was called V clips.
And that was kind of the melting pot that started a lot of this. And it’s a couple different communities as well. It was a EleutherAI, which was a Discord community focusing on large language models. There was LAION who did large data sets, and a lot of the image model stuff, so yeah, I got into this text-to-image space and I was just enamored by it.
Cause I am not a visual artist. I can’t draw to save my life, but I am a programmer. I am, you know, I can dive into some code and in these Colab notebooks you have all the nice little interface things. You can double click and you get the actual code itself. You can change it, you know, manipulate that, figure out how it works, try to change code without breaking it too bad.
And so I was actually able to get a bit of an intuition on how this stuff works. Diving into the code, being able to change it, make it better. Found this new notebook someone made called disco diffusion. And that was taking that same text image thing, but using some special tricks to turn into animation.
And so that was, you know, if you zoom in on a picture, you get some like zoom-in artifacts, right? But if you then noise it and de-noise it a little bit with these diffusion models, gets rid of the artifacts, make new details. So keep doing that over and over again. You get kind of a fake infinite zoom effect.
And it was just cool animations and like, oh, this is be great for like music videos. Like, I got a bunch of friends who make music and are DJs and producers, but they don’t have a whole lot of art assets or, you know, it’s hard for them to get stuff from music video or whatever. So I used that and I was making audio reactive videos for friends.
And so I got really into that. I got into the dev team for disco diffusion and this is all the time still just me messing around. I was seeing this as like, like I’m messing around in the DAW for making music here, messing around in Colab to mess around and make some fun art stuff. and I think a really important part about that was the community.
And that’s the thing I really want to drive in general for this stuff is community is everything. Art can be, I’ve thought about a lot is like art can be a lot of things. Art as commodity, art as expression, art as community, art as comedy. and I think that’s, that was really what spoke to me the most and what I really loved about the AI art community at the time and, and still now, particularly then, you know, a lot of the, the vitriol that there is now wasn’t around then because it was still kind of niche people didn’t know about it as much, was just like, I was just seeing people doing good work, making these cool innovative changes, sharing them on Twitter, and it was all just like independent nerds, you know, excited, usually heavy ADHD, technical programmers and artists who were just like, hey, I changed this and I made this little feature for it.
Like, oh, there are only a couple lines of code. I’m not really sure how, you know, I’m not really much of a programmer, but it did this. I’m like, that’s groundbreaking. Like this is such an opportunity for like, even a little bit of improvement is huge.
So I was in, as part of that community, and Emad Mostaque was part of it, so the founder of StabilityAI, he was one of the many people that was kind of a regular in his communities. He was a regular or a known person in the EleutherAI community. And yeah, again, I was working on the disco diffusion stuff. And then Noah again started doing a music video based on that and talked to one of the guys that was building it, Gandamu.
He brought me into that. That was how I met Gandamu, one of the other authors of Disco Diffusion. And, after talking to him, we were talking earlier about how these, these, these dominoes fall of how all this stuff happens. Just we try to explain the domino setup here. He was, you know, I’m, I was sharing kind of the research that I was doing, getting into some, like in-painting stuff, some, you know, improvements.
Cause like the, I was making music videos, but it took like, a, you know, five, 10 minutes per frame to render. So I would try to make a ten second video at 15 fps and 12 hours later would get my results and it was misaligned. I’m like, okay, gotta do that again. Like, this is not working. I’m gonna get more bang for my buck if I learn how to make this stuff run faster.
So I’ve been, I had seen Katherine Crowson, who, you know, her name is at the top of all these Colab notebooks. She was kinda like the big brain in the space. And I’m like, I want to make this faster. I see her posting on Twitter of these performance improvements, this different model. I’m gonna just message her and ask her, like, how do you do that?
Like, can I, no, feeling like a little bit nervous of like talking to the person who, I don’t see her talking around much, but she’s like the Oracle in this space, you know? So it seems, but I’m like, okay, but I’ve been on Discord servers for the last few years, DMing artists I look up to. People are just people and I should do that.
And so I did, and she was very kind and very helpful and I learned a whole bunch from her. And, you know, trying to, to not rely too much on someone else trying to stay on, you know, because I don’t need her help eventually learning things.
And so, yeah, I was like, at this point, pretty deep into doing research and Gandamu said to me, Hey, like you’re doing actual practical research. You should talk to Emad, who is giving out access to A100s while doing research. My first thought was research. I’m just playing, I’m making art. Oh, this is research because it’s code and it’s new.
And so that, that’s technically research, which was like a huge like, you know, glass-shattering, world breaking thing for me where I realized that like, how much of like these super technical musicians that I’m friends with working every day, and these really, you know, the look ahead milliseconds for a compressor to make your kick stronger.
I’m like, they’re, they’re also doing research. You just don’t call it that, you call it music production or whatever, but like they’re doing post-graduate DSP work, but, you know, whatever. So I’m, I’m realizing that, all right, well, okay, this is research. Cool. I should move forward with this. This is really fun.
Maybe I can have a job with this. Maybe this is better than my current job. And so I ended up in a conversation with Emad and he basically offered to support me doing this at the time. This is kind of what, you know, how StabilityAI started. They had been doing some stuff in their office in London for about a year before that.
But in terms of like expanding it to, to Discord, it was essentially Emad going around and seeing who were the people who were really making stuff happen and saying, what do you need to keep doing that? You know, do you need a computer? Do you need, you know, do you need resources or time?
And I’m like, I need both because I am losing sleep by having my full-time job. And then, well, I’d say coming home, I was working from home and then, you know, doing my full-time job and then staying up until 2:00 AM and working on this AI stuff. So yes, I need time. I need an actual job.
And you know, I took off March from my job at Microsoft and started building communities basically, you know, Emad was saying like, all right, you know, you need to build tech, build community, and go for it.
And I’m just sitting here like, all right, well, alright. I will, you know, at the time I was still working on the image stuff, but I was like, by the way, my real passion is music. That’s what I actually like doing. And he’s like, cool, you know, I’m actually, I was working on some in-painting, thing for images and realized that the same technology could be used to make like infinite drum loops or to help with modifying samples instead of images and like, oh wait, this is way cooler and I know way more about this stuff, the technology.
You know, the image stuff was getting into like 3-D video, which is not my space, I don’t think I was really motivated to learn about. And I realized like, wait, what am I doing? I should be working on the audio version of all of this stuff that will help my friends, that will help me make music.
That will be really neat. And it’s really under-explored and a lot of the people that are doing audio ML right now don’t have the same bar for quality that I do. You know, a lot of it is like 16K mono piano music. And part of that’s because that’s, you know, there’s a lot of available data for that and it’s easier, because it’s lower sample rates, smaller sequence lengths, whatever.
Or things in the MIDI space, you know, doing symbolic generation: generate MIDI and then run that through whatever VST preset thing, which I thought was fine, but pretty well explored. And I was more interested in this Dadabots stuff of like raw generation, like, you know, how can it apply to sound design?
So at this point I was like, all right, okay, I need to find who is doing this stuff for music and join them. And you know, like I had done a year before, I looked around and tried to find, all right, where is like, this must exist now.
And looking around at EleutherAI, there was nothing for audio looking around at LAION, there was one channel for like audio data sets, which was largely speech. And I’m like, all right, I think I’m the one who’s gonna make this. And I’m like, all right, I’m acting, you know, kinda like looking at my position. I’m like, I’ve got a huge network of top of the line producers. I’m good at creating Discord servers and good communities.
I’ve got the mathematical and technical knowledge to start executing on creating this stuff. And I’ve got a person saying he will give me all the compute and money I need to do that. So I’m like, you know, this is not opportunity knocking, it is not opportunity knocking, this is opportunity breaking down my door with a bulldozer, right?
This is like, if I don’t, go with this. This is the most laid out on a platter, like, hey, do this.
Helen Todd: Oh my God, this is the universe being like, Zach, here, go
Zach Evans: Yeah, exactly.
Helen Todd: Like, talk about manifesting
Zach Evans: like here, do something, right
Helen Todd: This is like where you are exactly supposed to be. And I love this.
Zach Evans: Exactly, right place, right time, right network, right knowledge set. And I was like, if I am not an entrepreneur. I’ve always been adverse to startups. You know, I worked at Microsoft, very stable job. Even if it wasn’t the most fulfilling, it was still, you know, I had a good team and I was doing good work. And I liked it.
But I was like, this is my calling. This is my first time. I’ve felt something greater than me that I actually have the ability to go and like change. So I’m like, all right, this is my life now. This is what I do.
So I started the Harmonai community, you know, then it was a private server as I tried to, to build it up. And, you know, reached out to the guys that I had met a few years before.
Actually by habit, so anyway, I wanna reach out to the Databots guys and help them because on their, on their FAQ on their website, which is a great FAQ, you should totally go read it. talking about how like, hey, if, you know, we have ideas of things we could do, we just need more money for it, we need more compute for it.
If you’ve got a bunch of money and a bunch of computers and wanna help us reach out. And I’m like, I do, I do have that. That’s me.
Helen Todd: I love it. It’s just like music nerds in a candy shop, right? And it’s like, let’s go play. I love it so much.
Zach Evans: Yeah, exactly. And so it was, that was, that was essentially my March into April 2021, 2022. Time is weird. I was like, all right. I felt like, you know, Danny Ocean, you gotta go put a team together. You know, I basically Emad being like, all right, go build your team. Who were, you know, whatcha are you gonna do, you’re in charge of this. Go. And I’m like, okay, okay, okay. So I’m like, you know, cold DMing people on, on Twitter, on, you know, finding people on GitHub who have some cool stuff that they’re doing.
You know, being like, Hey, I’ve got this. You know, basically it felt weird just being like, you wanna join my Discord server? You know, messaging him on, on Twitter River. I’m like, but we’ve actually got a lot of computers and we’re gonna be doing cool things. I promise, I swear. Just join, check it out.
And so I, I hit up Dadabots again, and, I don’t think I got a response from them, but then I was in, I was in the Midjourney server. This is while it was still in the beta. You know, Emad was supporting that as well. It was all kind of part of the same community. A lot of the developers on that came from the same communities that started a lot of the other research.
So that kind of splintered off from the same pool of people, like the guy who made Disco Diffusion ended up going to work for Midjourney is one of their developers. So I was in there and I see someone had written, you know, it was, it was still Invite Only, and someone had written this song on Midjourney.
I was like, I want to get in. How do I get in? and posted it to YouTube and the guy who ran it, David H was like, okay, yeah, I’ll send him an invite. That’s kind of fun.
The guy’s name was Jonathan Mann and he has the Guinness World Record for the most consecutive days of writing and releasing a song. He was about to hit 5,000 back then. He passed that back in August. He’s been doing it since like 2004 or something.
And he came in and he was like, Hey, I’m Jonathan Mann. Thanks for the invite. I’m a musician and I’m doing this project with Dadabots to give me, you know, immortality in my music to be able to keep creating this song a day after I’m gone. And I’m like, that’s awesome. I want to talk to you. Also, hey, you’re talking to Dadabots, could you help me get in contact with them?
I’m trying to talk to ’em. I’ve got a thing for them.
And so, you know, through him I got Dadabots and yada yada yada, CJ from Dadabots is now working as part of the Harmonai team. And Jonathan Mann talking with him, you know, he’s clearly very into AI music, into this stuff. And I’m like, can we, can we use your music trainees models?
And he was like, yeah, sure. So sent me all 5,000 songs and said, go for it. So, you know, the main Dance Diffusion model is the J-Mann model, the Jonathan Mann model, where it’s basically, you know, by default that model will just put out, little clips of this one guy singing and playing, you know, no coherent vocals or whatever.
So again, by itself not terribly usable for most people who aren’t Jonathan Mann, but enough knowledge in there to take that fine tune on your own stuff and, and have it actually still work. So that was kind of how Harmonai started was me realizing like, this has to happen. There’s no good play for this stuff being coordinated.
In general, music is not supported as well as the rest of AI research. And I care a whole lot about this and I have now been given everything I need to make it work. So then they’re basically just kind of cheerleading, kind of like, hey, you know, let’s, let’s get going. Let’s put teams together, let’s start working on stuff.
And yeah, that’s how Harmonai came together, how it became part of Stability.
You know, Emad asked me very early on, I like, do you want Harmonai to be independent or part of Stability? And I’m like, definitely part of Stability. I am not a business guy. I don’t wanna think about tax IDs and all of that. So it’s just been an absolutely incredible opportunity for me to be given the freedom and trust to use a whole bunch of resources and try to, you know, start up this whole community and research initiative.
And you know, that’s really what Stability has been for me is, you know, find people who can actually do the stuff. And enable them. Find people in the community who are just the actual ones who will go and do the work and have good ideas and execute on them and support them to be able to push forward this whole space.
Helen Todd: I love that so much. And you did mention like how small of a community the music AI world is and we, you mentioned Drew. So shout out to Drew. He’s actually helping produce the show today and is part of the Play Audio team. And I didn’t realize that we had overlap with Dadabots because Harry Yeff, also known as Reeps One, was the first guest on Creativity Squared.
So I love how small the community is, and the overlap and how we even came to our conversation. Well, and I kind of wanted to go back to, and I just like love your passion, it just comes out so much. But I wanna go back to the question of kind of the nuclear bomb that we’re sitting on
Zach Evans: Right, right. Yeah. That’s what I was building back up to. Thank you. I’m like, I know I’m gonna forget the actual question if you, so, basically, when I started realizing what was possible here, sorry, is there more…
Helen Todd: No, no, no, I was just gonna say, well, I feel like, you know, the whole divergence really set up like why you’re so well-equipped to answer this question. Like you helped found and create Harmonai, and you’re so immersed into the community. So you, like, out of anyone, like from the seat where you’re sitting, tell us, tell us more about what you mean by that nuclear bomb?
Zach Evans: Yeah, the background there was also to show like kind of where I come from in terms of like, my community is the producers, the DJs, the musicians who are touring the country and still making no money, right?
And so I realized, you know, while I was kind of, when I realized that applied to audio, I was like, this is gonna be awesome and really fun tools we can get to where Jukebox was supposed to be and, and make what we want. But then what does that do to the industry? What does that do to how people consume music?
You know, I think that in the short term it won’t change that much at all. It’ll be like, you know, adding new production tools. But say it does get to the point where it’s like you can make whatever song you want personalized for you. You know, what does it look like when Spotify has a bunch of generated music on it?
And a lot of this has, has changed over since when I was first thinking about this with, as it’s become a larger conversation, there’s been more, you know, statements from streaming services saying they won’t host AI generated music or whatever. but basically really this is going to be a huge thing.
It’s going to give a lot of creative power to people who aren’t the current, the current artists and producers. And at a high level, I’m in favor of that, not the just, you know, reduction of power of current producers, but, you know, there being more artists. There being more people who are able to express themselves or get out what they want in terms of audio.
But, you know, what will this do in terms of the value of music and, and all of that. Like, what does this do in terms of streaming revenues for people. And I don’t have any answers on that. I have no idea. But it seems like okay, in the most, you know, naive way of doing this, I don’t think that by default it, I think it takes extra care and thought to think about how does this help current artists and enable other people to be artists?
And so instead of, you know, I guess the, the nuclear bomb in this case is anyone can make anything. So what? Well listen to your music. And there’s plenty of answers to that. That’s not, I don’t think any sort of, I mean there’s, I think that the kind of counter to that is when was music popularity ever about the music?
You know? It’s not about the actual artifact you put out or the end result. There’s no correlation really between, after a certain point between subjective quality and fame and fortune, it’s all business at that point, you know, marketing, branding, whatever.
But the nuclear energy I see as like now artists are able to get out their ideas more easily, you know, there’s, there’s an increase in overall like, like the diversity of sounds and diversity of genres. The exploration of, you know, music at whatever length, that I see as like, just a whole lot of empowerment of more people are able to make more stuff better. So now I was talking to an artist, like, talking to Mr. Bill who was a, you know, producer. I look up to a lot, you know, really technical producer and kind of sharing with him what I’ve been working on, and he was saying, this is great because I want to be able to speed up my process of making Mr. Bill music so that my fans have more Mr. Bill music to enjoy. And I’m like, there you go.
That’s what I want to hear. That’s, that’s the goal, right? Is like the artists can take these tools and supercharge their own workflows, make new things they couldn’t before, get inspiration, help with writer’s block, and, you know, empower them to do more in whatever aspect they want, you know?
And I don’t think any of this is prescriptive in terms of every artist should do X and that will make you better. ’cause y people are different, artists are different. But that as I see as kind of the, you know, the differences. How much do you work artists into the process? I think, you know, looking at the, like, if you did this, let’s take a straw man, big company.
I don’t want to name an individual company ’cause they’re all doing actually pretty good work right now. But take strong man, big corporation. You know, if they controlled the entire production of AI generated music and all of the profits and all of the whole ecosystem then that could be great for them and their bottom line, but possibly not great for the artistic community.
And this kinda comes back to the original thing I was saying about, you know, if there is all of the control by large corporations, it won’t necessarily lend itself to, you know, it’ll be more corporate, you know, whatever effects that’ll have.
And so I think the nuclear energy there is everyone being able to have access to these tools and using ’em in the ways that they see fit and that they think will help them achieve their goals.
Helen Todd: It’s like the democratization of creativity applied to music. Yeah. Well, and, and one thing that you had said too is, you know how I was also thinking about it, is that these tools are gonna open up not only access to more diverse voices being able to make music, but that they’ll be able to make almost any sound possible or make any type of music.
Can you share a little bit more about, you know, what types of tools you’re enabling them to use and be able to make?
Zach Evans: So let’s assume it’s something that’s similar to Stable Diffusion, right? Where you’ve got some sort of text prompting, you know, text-to-audio. Let’s take it in that context.
You know, assuming you’ve got data that has enough information in there where you can type in the information you want and get asking similar, assuming all of that, you know, at that point it’s basically, it’s so open-ended.
It’s hard to think about what, you know, directly people could use, but you can use that to explore like, different combinations of genres and ideas, or take those same things and put in your own music and remix it or you know, there’s things like stem separation to, to take out individual tracks.
I think we aren’t currently working on that, but, you know, put in drum loop and get out a slightly different drum loop, or a totally different drum loop. So one of the cases I’ve been thinking about when I was first starting this, I was working on this Cyran track, trying to be super, like fastidious with it.
Be, you know, apply all of the knowledge that I had, I guess this more technical track. And I was adding in, drum fills because, you know, you get a bunch of drum fills for the different transitions between different sections. And it had this, this cashmere sample pack from Splice that had some really good tom fills I really liked.
But I’d already used the, you know, one or two that I liked from the pack that I had. And I’m like, all right, how can I get more drum fills? I’m like, all right, I could either go on Splice, whatever and find more tom fills, hope they aren’t completely tonally different from the ones that I have.
I could learn how to record drums and make my own. Yeah, that’s not gonna happen. I can chop up these fills and try to make them a little bit different, which I tried, but it made it worse, right? It wasn’t the same thing.
At the same time, I’m working with Disco Diffusion where I can put an image, noise and de-noise and get a variation on that image. And I’m like, that’s what I need. I need a thing where I can put it in my drum fills and get variations on that. Get slightly different drum fills that still semantically do the same thing. Rhythmically, do the same thing, but aren’t the same sound four times in my song. So that’s one of those cases is, is variations on sounds as well as things like interpolations.
So you can actually do that in Dance Diffusion already without having to have text prompting is give it two sounds and find what is between those two sounds in some sort of semantic space that it’s going through.
Yeah. So I mean, and, and then things like sound design, you know, can you type in Skrillex growl bass and get out some cool FM sounding bass?
You know, I’m a big fan of the idea of having some bass model for that, but then fine tune that on your library. So one of these cases I think about a lot is the musician’s been working on music for 15 years and they’ve got a terabyte hard drive of samples and they never use most of it ’cause they have, you know, they don’t go into the library that much.
But giving them, you know, the option to say here is some base model. On your computer, point this program at that root folder for your samples, hit go and you will now have this explorable version of your own library.
And you know, one of the things that CJ talks about when he gives his talk as, as Dadabots, you know, he’ll give research talks and things at conferences. He’ll show this picture of like, of space, like the Hubble deep field with all the different galaxies and say, these visible galaxies are the points right now in music, the songs that currently exist.
We want to explore the spaces in between them. So the, yeah, that’s, that I think is what excites me the most about a lot of this technology is using it as exploration, creating a map of music and saying where are the holes and what would be there?
Helen Todd: I love that
Zach Evans: I see a lot of, I see a lot of creativity. It’s the kinda thing I was thinking about before in the context of like, artists who will really specifically try to copy, you know, one artist’s sound for their, for their outputs or whatever. And I say it’s not about necessarily, you know, don’t copy one artist.
It’s not necessarily about having a sound that is totally unique. It’s more like find 10 different artists you like, combine them and that’s you.
You know, we’re all influenced by things that we listen to and things that we like. No art is created in an absolute vacuum. And I’m really interested in seeing what that looks like computationally, what does that look like mathematically?
What does that look like when it’s not someone thinking, you know, what can the computer do? I’m fascinated by, you know, like, this concept of can these models learn some abstraction of music theory that is different from what we learn. It’s different, you know, it’s not specifically Western music theory or Eastern music theory.
It’s, it’s what is the actual, you know, it’ll learn something else, right? And then being able to explore that kind of mind model of creativity and exploring that space is what really interests me.
Yeah, I think in general with these tools, one thing I try to push is that it’s already currently possible to automate the entire process of songwriting.
Not necessarily for a great chart-topping hit, but I could go, you know, into GarageBand, pull together some loops that have the same name, go grab an acapella in the same key, put it over there, run it through automated mixing and mastering stuff and put it on Spotify. Right? Will I become popular and have a fan base?
Probably not from doing that, but you still can. But people don’t do that because as an artist, you want to have some sort of impact on your work. You want to feel like it is yours. You want to be able to impart something of yourself into that work. But what that something is, is different for everybody.
There are plenty of people where songwriting is their favorite part, but they don’t want to do mixing and mastering or recording or whatever. Plenty of people who, it’s like songwriting is, yeah, whatever. I’m, you know, much more interested in the, the fine touches of, of the, the timbre of the sounds and the final mix down or whatever.
And I think that it’s important in general, regardless of AI or not, to find out what parts of the workflow of creativity do you like and want to throw yourself into, and which parts do you see as a route and you want to automate to get yourself the parts that you still like.
And I think that in general, that is my goal with these tools that in terms of creativity and aren’t tools for artists, it’s, I’m not, I’m gonna make no claim that this is gonna change everything and you gotta use it so you’re gonna fall behind.
I think that’s just obnoxious. People say stuff like that. This is a new tool, like the synthesizer, like the guitar amplifier, like the, you know, 808 drum machine that will be what you make of it. That’s kind of my point to artists in terms of like, what is the point of this? It’s the point is what you wanna do with it.
You don’t have to use these tools. It might not be, you know, I, for someone who their favorite thing is songwriting and then playing acoustic guitar and singing an open mic, I don’t have a whole lot for them right now. You know, may, but maybe I do. Maybe if they can put in, you know, nice happy acoustic guitar song and get some inspiration on what chords to play or whatever, then that’s great.
But yeah, I think that, you know, my, I try not to bring in too much hype or FOMO. I think that’s not productive in a creative space like this. I try to lead with substance and be like, here’s this new tool. Go use it in ways that I wouldn’t think of. I’m really excited by people taking this technology and using it, quote unquote wrong.
Right? Using it in ways that I wouldn’t expect. If, it would be hubris for me to think that I could predict what will be, what these tools would be used for. But historically the big changes in new genres and new scenes in music very frequently come from advancements in technology.
So if you look at things like the digital sampler, allowing, you know, sampling and hip hop, you know, the DJ Cool-Herc at a pool party or wherever he was, hooking up two turntables and scratching and mixing things. And inventing, DJing is not the intended use of record players. you know, the 808 kick, the 808 and the 303 resonance knobs being so widely abused, auto-tune being turned all the way down to zero, and creating that, you know, creating T-Pain’s career and the whole SoundCloud rap scene.
You know, heavy metal coming from advancements and amplifiers and guitars. So I can keep listing. Probably every genre starts with some new technology pop coming from radio or whatever, and so I think that one of the most impactful things that one can do in music. I don’t think that writing songs individually, I’m going to change the world. I don’t think that I have a story to tell. This is my story to tell, right?
You know, I don’t think I have some heartfelt song to write, to change America, to change people’s whatever. That’s not gonna be my impact on art. But as a, you know, technological person, this has a strong potential to create new scenes, create new genres, create something different.
And that, I think is what I’m most excited about, is just what creative people will do with these tools.
Helen Todd: I love that. So well-said. Well, and one question that I have that, you mentioned, I forget which gentleman, was talking about, thinking about his legacy, the guy that trained the system with 5,000 of his songs,
Zach Evans: Jonathan Mann.
Helen Todd: Yeah. Jonathan Mann. Because, and this question actually comes from also listening to Dolly Parton’s America, a podcast. I grew up in Dolly Parton’s hometown.
She’s a dream guest for the podcast. But interestingly, she’s actually very forward thinking. And it was hinted in that series that she’s already thinking of postpartum, like how her legacy could live on and potentially, I don’t know, tracks or whatever. But now with all these AI tools, like I’m really curious if she’s cloning her voice or opening up her music.
So this is, if anyone’s listening that can hook us up with Dolly Parton, let me know. But I’m so curious how she’s thinking about, you know, her legacy related to technology. And I’m curious, you know, with that gentleman, the man, and legacy, like if you’ve heard of any other interesting, I don’t know, artists thinking about legacy related to these tools too, or how they’re thinking about it.
’cause I think it opens up a whole, a whole new way of just, aside from putting your songs out there, of what you can do in terms of legacy too.
Zach Evans: Yeah, I haven’t heard, I don’t have any more notable examples of that. I mean, I know that, you know, Grimes has been super active in terms of AI voice technology and things like that.
I’m sure others have mentioned it. Yeah, I haven’t seen as many things like that. I am curious to hear more about that, about Dolly Parton and what she wants to do. Actually, I think I just heard about Dolly Parton kind of ties into one of my thoughts about AI creativity.
People will say like, oh, this makes it too easy, it’s too fast, it’s too quick to make these, these things. Now art takes time and if it’s too easy, then it’s not real art which I reject because there is not a correlation between time it takes for a project and quality.
I have spent weeks on terrible music. Dolly Parton wrote Jolene, and I Will Always Love You in the same car trip on the same night.
Helen Todd: Yep.
Zach Evans: So like, you know, All I Want For Christmas Is You was written in 45 minutes. So it’s like speed is not a thing, but maybe, you know, there was, there is taste. I think that is, again, that point I wanna make is that taste is still a thing. AI generated stuff, if it, if you think it’s bad, that’s okay.
If you think it’s low quality, you’re allowed to. If you are a person using these tools and making things and wanting to put them out there, you’re still applying your taste and you will be judged for it for better or for worse. So this is not, you know, I think that in general, taste will still apply and maybe the process of creation starts to increasingly be the process of curation.
But we’re already kind of on that slope with things like, you know, sample packs and the fact there’s only so much you can actually do melodically in music. It all kind of ends up being curating different things into some sort of complete piece.
Helen Todd: Yeah, and one other interesting thing that I’ve been thinking a lot about with Dolly Parton too related to AI and music is she, I mean, she’s such a prolific writer and you know, general national treasure.
But she actually has thousands of traditional mountain melody songs that she has written that has never seen the light of day. And almost, I’ve been thinking about like the value of things that aren’t online now, how much more valuable those could be if and when she ever releases them and how they could be released.
Which I think opens up a whole nother, you know, can of worms about, well, and this is actually something that Harry spoke to in our interview too, about, using the machines and training them with your own data for your, you know, your music creation and stuff, which, and how that improves like your own art is collaborating with yourself, which I found super fascinating what he said.
So I don’t know if you have any reactions to that or not.
Zach Evans: No, I totally agree. And I think that, you know, he’s already an example I would’ve given on this. You know, he used Dadabots to make a model on his own beatboxing, and it put out, you know, samples and flows that he was like, oh, that’s actually a pretty good idea.
I should start doing that. You know, so we saw that same thing with like, the game of Go, you know, Alpha Go was a model that was released by Deep Mind, no, not released, but trained by Deep Mind and, you know, it famously beat the world’s best Go player at a game that people thought was, you know, a computer could never learn that.
Well, it did, and it didn’t end the game of Go and it didn’t go well. You know, it made them all better by people learning, you know, working with this, with, with the bot, working with the program, they were able to get better at their craft.
So yeah, I really look forward to that. I think it’s a, I think it’s a great move of, you know, particularly if you have a sizable library, take one of these models, train it on your own stuff and see what comes out.
If it’s, you know, if it’s nothing inspiring, then okay, sure. But, you know, I think that it’s, that’s unlikely. I dunno, maybe I don’t wanna make that claim. I don’t know if that’s true or not, but, you know, I think there’s a lot of potential in terms of like, these models can given the right, given the right data and trade on your own stuff, make things that are similar enough but different.
And that’s pretty much what people want. For a lot of things is that similar but different. So I think it’s a great tool for those kinds of things. You know, even back with Jukebox, I had a song that I wrote, and I fed it in there and it continued the song after the part where I cut it off and it added these, you know, really cool drum breaks and rhythms.
I’m like, that’s actually pretty neat, added this little, you know, that I took that and recreated that. Use it in a different song. So it’s like, yeah, all these things can, you know, it can give you the little bits of inspiration, you need to take something and run from there.
Helen Todd: Well, since you mentioned the game, go, I’ve gotta use this opportunity to mention the movie Alpha Go.
I highly recommend it, and it’s got this amazing scene and I’m kind of it’s a spoiler alert, where the best player of Go at the time, he actually did win one game against the machine. And the guys behind the computer said that it was like one and like a million chance that a human or like anyone would select that.
And when they interviewed the player, he was like, that’s the only move I saw. And it was, and they call it like the God move. And I just like, love that scene in that movie so much.
Zach Evans: Yeah, yeah, yeah, No, I’m really excited to see what happens with music, with these technologies, what people, what people will embrace.
And, you know, when you put out technology like this creative technology, you cannot predict what’s going to happen. People will do things that are a thousand times more clever and creative, and a thousand times more cringe than I thought possible with whatever we put out. And that’s impact, you know?
Yeah. You know, the Serum, Massive, the big synths that are out there, there’s plenty of incredible music on there and plenty of incredibly bad music played with those tools. So that’s, that’s humanity. That’s not an AI thing. That’s, there is a range of quality that is different for every person and people will do things
Helen Todd: Well. I promise, I’ve been kicked off of karaoke stages ’cause I’m so bad at singing, so I promise I will not add to the bad music. I’ll stay in my lane. Well this, and we’ll definitely check back in, ’cause you’ve kind of given a few teasers of some things that you can’t announce yet. So we’d love to check back in, see what announcements will be coming. Other cool projects or a massively cringe projects to talk about too.
Zach Evans: Hopefully they aren’t mine. They’re massively cringe. I’ll do my best.
Helen Todd: Oh, well, before we sign off, we always like to do kind of wishes or predictions or, you know, any final thoughts, that you’d like to, let everyone know before, before signing off.
Zach Evans: Do things different. Maybe that’s more of an ask than a wish, but I hope that things get weirder in music. I hope that people use these tools to make new things, not try to replicate existing things.
Helen Todd: I love that. And how can people get in touch with you or do you wanna plug any links or anything?
Zach Evans: Harmonai.org, h a r m o n a i.org, is our pretty barren home page, there’s a link on there to get to our Discord server. Join the community and stick around. Every Thursday we have office hours, we’ll all be on Discord stage just answering questions, chatting about what I’ve been reading about, working on lately.
And, you know, we do research, presentations and production challenges. So come through and hang out.
Helen Todd: Amazing. Well, Drew who’s on the phone, who’s producing this, thank you for putting us in touch. Zach, it has been an absolute pleasure. I know I’ve learned so much from you, and I know all of our listeners and viewers are too.
So thank you so much for your time and for like, leading the charge on the anti-big company, creativity and music and AI. We need more people like you in this space.
Zach Evans: I’m happy to come on and, and all of us working together to make sure this stays cool.
Helen Todd: I love that. Well, here’s to staying cool. Thank you for spending some time with us today. We’re just getting started and would love your support. Subscribe to Creativity Squared on your preferred podcast platform and leave a review. It really helps and I’d love to hear your feedback. What topics are you thinking about and want to dive into more?
I invite you to visit creativitysquared.com to let me know. And while you’re there, be sure to sign up for our free weekly newsletter so you can easily stay on top of all the latest news at the intersection of AI and creativity.
Because it’s so important to support artists, 10% of all revenue, Creativity Squared generates will go to Arts Wave, a nationally recognized nonprofit that supports over a hundred arts organizations. Become a premium newsletter subscriber, or leave a tip on the website to support this project and Arts Wave and premium newsletter subscribers will receive NFTs of episode cover art, and more extras to say thank you for helping bring my dream to life.
And a big, big thank you to everyone who’s offered their time, energy, and encouragement and support so far. I really appreciate it from the bottom of my heart.
This show is produced and made possible by the team at Play Audio Agency. Until next week, keep creating
Theme: Just a dream, dream, AI, AI, AI