Victoria Atkin — who played Evie Frye in 2015’s Assassin’s Creed syndicate — tells IGN how the video game industry needs to change to protect its performers.
I’d call AI-generated mods one of the best applications for AI-generated voice samples. It is extremely unlikely that a fan-made mod is going to ever get the original voice actor onboard, and without those voice actors, any mod necessarily cannot fit in with the rest of the game.
We could have a world where modding just doesn’t happen, in general. But if we’re going to have mods, that voice synth makes it practical to extend games that otherwise could not realistically be extended by third parties in a seamless way.
It’s possible to make textures that fit in with original environments, or to model things that do so. Or to write text. But people are pretty good at distinguishing between voices, and so without the ability to do computer-synthesized voices for mods, one can’t really create modified speech for existing characters.
The problem isn’t the mods themselves, but that they serve as a proof of concept for the publishers to do the same.
I’ll also add that I’m skeptical that at least the US is going to treat AI models trained on something as intrinsically creating copyright-infringing derivative works, though I don’t know for sure what the EU will do. However, even if one assumes that some jurisdiction does decide to treat models as a derivative work, there’s a fairly straightforward way to continue to distribute mods that I would expect should remain legal, and has happened in the past to avoid distributing copyrighted assets: distribute them as a patch against the original work.
It is legal for the end user to modify a copyrighted work that he owns. So if I distribute a patch that takes in Voice Actor X’s base-game audio as an input and then takes them as input to generate new ones, well, that’s not a legal problem for copyright. Copyright only deals with distribution from one person to another. I can create all the derivative works I want myself – as the end user – as long as I don’t myself distribute them.
In fact, while it’s probably not a very CPU-efficient way to distribute it – going to waste the world’s electricity, do another Bitcoin – one approach might be to just distribute Tortoise TTS or whatever it is that people are using to generate the audio, as well as any marked-up text to regenerate, then just have the regeneration run on the end-user’s computer to generate the mod using the original voice assets. Tortoise TTS has expensive generation, but unlike, say, Stable Diffusion, where the training process requires a lot of computational capacity, has very rapid training time on a new voice. Would be bandwidth-efficient, at any rate.
But point is, that is unquestionably legal, and still winds up in a place where the end user has the mod with the same new voice data on his computer.
And given that, I don’t really see the point in trying to prohibit distributing the AI-generated speech files, from the standpoint of someone who is trying to block someone from playing a mod for a game using AI generated voice, because that player is going to wind up in basically the same place regardless of which route they take. It’s maybe marginally more-obnoxious to take the full regeneration route, maybe have to run the “regenerate the mod voices” process overnight, but it’s not going to generally stop the player from getting and playing the mod.
If you want to kill modding you can fuck right off. Mods are not the issue here. Corporations not paying you because they used AI instead of a voice actor is the issue. Modders were never going to pay Victoria here for anything.
Why protect? This is huge boon for the industry that will allow all games to become more accessible and feature-rich.
There’ll be fewer voice actors in the future for sure, just like there’s fewer radio stars and telephone operators. The world moves on.
Even trying to control it seems hopeless when the voice generation is getting so good with just a minute of audio etc.
“I’ve done so many games I’ve lost count, and there’s so much of my voice out there that I would never be able to keep track of… There’s credits that are not even on my IMDb that I’ve done. It’s just frightening… It’s kind of dangerous what they can do with it without my say.”
This isn’t really relevant to the larger point of the article, but a technical nitpick: I seriously doubt that anyone wants to generate voice that sounds like a voice actor so much as a character in a specific game. It’s not that someone’s likely going to take all the voice acting for different characters and produce some aggregate from that.
Take Mel Blanc. He’s a famous voice actor.
https://en.wikipedia.org/wiki/Mel_Blanc
He voiced Bugs Bunny, Foghorn Leghorn, and Barney Rubble, among others.
There is not now and I suspect will not be for the forseeable future some kind of useful voice model that you’re going to get that involves merging information from the voices of those three characters.
Purely theoretically, okay, yeah, you could maybe statistically infer some data about the physical characteristics of the speaker that spans multiple characters, like I don’t know, the size of vocal cords. Though I suspect that post-processing specific to individual characters probably mucks with even that. Most of what defines those characters is character-specific. The amount of useful information that you can derive across characters is gonna be pretty limited.
So I doubt that the number of different works has much impact on the accuracy of a voice model.
I’ll also add that my experience playing around with Tortoise TTS and from what I’ve seen of “voice cloning” online services suggests that the training set size for a new voice doesn’t need to be all that large, that the kind of information that they can use to learn about a voice doesn’t presently extend much beyond the information present in a relatively small training set size.
https://github.com/neonbjb/tortoise-tts
Cut your clips into ~10 second segments. You want at least 3 clips. More is better, but I only experimented with up to 5 in my testing.
Now, I will believe that maybe that’s a limitation of Tortoise TTS, and that a future, more-sophisticated generative AI could find useful data spanning larger datasets – I recall once seeing someone British complaining that Tortoise TTS tended to make American-sounding voices, I presume because it was trained on American speakers – but as things stand, I don’t think that the difference between many hours of speech and a relatively small amount has a massive impact. That is, most of the useful information comes from the model’s training on pre-existing voices, and the new voice mostly determines where the new voice lies relative to those.
“I was kind of shocked, really, that it’s been used without my say, without my consent, it’s just out there,”
I mean, it’s fucked up yeah. But you have to treat any public info as a permanent record that anyone can see, modify, use…
Once something is “out there” as she says, there is no taking control back. So be mindful of what you post to the internet.
As for actors, they give the rights for their voice to be used and stuff. Not saying people have the right to do whatever they want with that, but once again, you can’t take control back of publically available data. People WILL do inmoral things with your work. And you have to be okay with that. Because that is the present, and the future.