The Ethics of AI Audio

Posted on Fri 16 December 2022 in Blog

Some thoughts on AI ethics

A hot-topic issue on social media at the moment is the ethical implications of AI. This is something that is taught and discussed in AI and machine learning courses, but seems to have hit the mainstream consciousness due to the recent popularity of technologies like Open AI's Dall-E 2, Stability's Stable Diffusion 2 and even Open AI's ChatGPT bringing the potential of deep learning to a viral audience. Twitter is awash with anime characters with weird hands, generic looking fantasy landscapes and incredibly convincing essays and short stories, all being produced entirely by AI at the request of a simple text prompt.

All of this content has sparked discussion and protest, with artists staging protest against the very idea of AI generated art in their own communities. The sentiment (in my opinion) is entirely justified: these models only work as they are trained on huge swathes of information scraped from across the web. Taking Stable Diffusion as an example, it is a generative model trained on the Laion-5B dataset, which is a scraped collection of 5.85 billion image-text pairs. This dataset attempts to dodge the issue of copyright by saying that it only provides 'indexes to the internet' and so washes its hands of any potential problems, while Stable Diffusion itself states that there was no opt-out process for artists as the dataset 'intended to be a general representation of the language-image connection of the Internet'. This all results in a model that produces art that is based directly on the work of artists, but without giving any hint of credit or financial payback to the artists that have made the model possible, and both the dataset provider and the model builder are refusing to take responsibility.

I have no solutions to this problem, and it's been really good reading opinions from people far smarter than me on the subject. I try to take an optimistic view - the only reason these models are good at producing images and text is due to the fact they're built on top of incredibly talented artists and writers who have produced content on the internet. AI art does strike me as incredibly soulless - while the text ChatGPT produces can be eerily human, I personally suspect that there is just enough text on the internet now to 'steal' a good answer to any prompt or question and reproduce it as it's own.

Is a similar problem coming for audio?

All of this discourse has made me think about a potential future ethical crossroads for audio. I wonder if there are more stumbling blocks in the way of training huge models for generating audio: it's a lot harder in terms of storage and practice to 'mass-scrape' the internet for audio the same way these models do for images and text. For things like generative music, I doubt that the large record labels and publishers would let an generative AI model based on their artists music get far along in development at all (based on how harsh copyright laws are around playing music on Youtube, Twitch etc.).

Generative audio models are also typically more compute-heavy than image-based and text-based models. However, the pace of ML development and deep learning implies to me that in the next few years, we will have state-of-the-art generative audio models that can produce great results.

One feasible example I can think of in the near future is in foley work - instead of needing to record new soundscapes of a rainforest (for example), one could go to an AI audio generator, type in 'wet humid tropical rainforest' and the generative model could produce a soundscape that matches those requirements. There are free sources of audio (like Freesound.com) that have a huge amount of free recorded soundscapes and sound effects that could be exploited for machine learning. Is this ethical to do? Or is this a continuation of the issues that are happening in other domains? What happens to the foley artists, soundscape designers and audio experts that have produced the work that these models could then imitate to a near-perfect degree?

It's an interesting aspect of audio to think about, and it makes me think that all artists should be uniting against this kind of blanket 'scraping' of their work. While audio isn't a field dealing with these issues at the moment, I think it's only a few years away before we're struggling with the same kinds of issues and ignorance that artists are dealing with. In the same way that people on Twitter seem to believe that 'anyone can draw this AI stuff', it soon will become 'anyone can record this audio stuff', and the value of the knowledge and expertise people bring to their art is diminished.