Victoria Atkin — who played Evie Frye in 2015’s Assassin’s Creed syndicate — tells IGN how the video game industry needs to change to protect its performers.

  • tal@kbin.social
    link
    fedilink
    arrow-up
    3
    ·
    edit-2
    1 year ago

    “I’ve done so many games I’ve lost count, and there’s so much of my voice out there that I would never be able to keep track of… There’s credits that are not even on my IMDb that I’ve done. It’s just frightening… It’s kind of dangerous what they can do with it without my say.”

    This isn’t really relevant to the larger point of the article, but a technical nitpick: I seriously doubt that anyone wants to generate voice that sounds like a voice actor so much as a character in a specific game. It’s not that someone’s likely going to take all the voice acting for different characters and produce some aggregate from that.

    Take Mel Blanc. He’s a famous voice actor.

    https://en.wikipedia.org/wiki/Mel_Blanc

    He voiced Bugs Bunny, Foghorn Leghorn, and Barney Rubble, among others.

    There is not now and I suspect will not be for the forseeable future some kind of useful voice model that you’re going to get that involves merging information from the voices of those three characters.

    Purely theoretically, okay, yeah, you could maybe statistically infer some data about the physical characteristics of the speaker that spans multiple characters, like I don’t know, the size of vocal cords. Though I suspect that post-processing specific to individual characters probably mucks with even that. Most of what defines those characters is character-specific. The amount of useful information that you can derive across characters is gonna be pretty limited.

    So I doubt that the number of different works has much impact on the accuracy of a voice model.

    I’ll also add that my experience playing around with Tortoise TTS and from what I’ve seen of “voice cloning” online services suggests that the training set size for a new voice doesn’t need to be all that large, that the kind of information that they can use to learn about a voice doesn’t presently extend much beyond the information present in a relatively small training set size.

    https://github.com/neonbjb/tortoise-tts

    Cut your clips into ~10 second segments. You want at least 3 clips. More is better, but I only experimented with up to 5 in my testing.

    Now, I will believe that maybe that’s a limitation of Tortoise TTS, and that a future, more-sophisticated generative AI could find useful data spanning larger datasets – I recall once seeing someone British complaining that Tortoise TTS tended to make American-sounding voices, I presume because it was trained on American speakers – but as things stand, I don’t think that the difference between many hours of speech and a relatively small amount has a massive impact. That is, most of the useful information comes from the model’s training on pre-existing voices, and the new voice mostly determines where the new voice lies relative to those.