Tue, May 21 2024

Although OpenAI created a voice cloning technology, it is not yet usable.

March 30, 2024
10 Min Reads

OpenAI is improving the technology used to clone voices as deepfakes become more common, but the company says it's doing it ethically.

AI

OpenAI's Voice Engine, an extension of the company's current text-to-speech API, makes its preview premiere today. speech Engine, which has been under development for around two years, lets users submit any 15-second speech clip to create an artificial voice. However, the public release date has not yet been announced, allowing the business time to address any misuse or inappropriate use of the concept.

In an interview with TechCrunch, Jeff Harris, an OpenAI product team member, stated, "We want to make sure that everyone feels good about how it's being deployed—that we understand the landscape of where this tech is dangerous and we have mitigations in place for that."

Educating the model


According to Harris, Voice Engine's generative AI model has been lurking in plain sight for a while.

The preset voices accessible in OpenAI's text-to-speech API, as well as the voice and "read aloud" features of ChatGPT, the AI-driven chatbot, are powered by the same model. Additionally, Spotify has been utilizing it to translate podcasts for well-known broadcasters like Lex Fridman since early September.

It's a sensitive topic, so I asked Harris where the model's training data came from. All he would tell was that a combination of licensed and publicly available data were used to train the Voice Engine model.

Speech recordings are the type of example that models like the one that powers Voice Engine are trained on. These recordings are often taken from public websites and online data sets. Many manufacturers of generative AI hold training data and related information close to the vest because they view it as a competitive advantage. However, the possibility of IP-related litigation based on training data specifics is another deterrent to disclosing a lot.

Allegations that OpenAI broke intellectual property laws by teaching its AI on copyrighted material—such as images, artwork, code, articles, and e-books—without giving acknowledgment or payment to the original creators have previously resulted in the firm being sued.

Webmasters can prevent OpenAI's web crawler from collecting training data from their website by blocking the crawler's access. OpenAI has license agreements in place with several content providers, including Shutterstock and the news publisher Axel Springer. Additionally, artists can "opt out" of having their work included in the datasets that OpenAI uses to train its image-generating models—including the most recent DALL-E 3—by removing it from the datasets.

However, OpenAI's other products do not have a similar opt-out option. Furthermore, OpenAI claimed that fair use—the legal theory that permits the use of copyrighted works to create a secondary creation as long as it is transformative—protects it when it comes to model training in a recent statement to the House of Lords in the United Kingdom. OpenAI suggested that it is “impossible” to create useful AI models without copyrighted material.

Voice Synthesizing
It's surprising that Voice Engine hasn't been trained or optimized using user data. This is partially due to the transient nature of speech generation produced by the model, which combines a transformer and a diffusion process.

According to Harris, "we generate realistic speech that matches the original speaker using a small audio sample and text." "Once the request is fulfilled, the audio that was used is removed."

As he put it, the model creates a corresponding voice without the need to create a unique model for each speaker by concurrently evaluating the speech data it uses and the text data intended to be read aloud.

It's not new technology. For years, a variety of firms, including ElevenLabs, Replica Studios, Papercup, Deepdub, and Respeecher, have been offering voice cloning technologies. Big Tech heavyweights like Amazon, Google, and Microsoft have also done so. Incidentally, Microsoft is one of OpenAI's biggest investors.

According to Harris, OpenAI's method produces speech that is generally of superior quality.

We also know that the price will be high. Voice Engine costs $15 for every million characters, or around 162,500 words, according to papers seen by TechCrunch, despite OpenAI removing the price from the marketing materials it released today. That would accommodate "Oliver Twist" by Charles Dickens with plenty room to spare. TechCrunch was informed by an OpenAI representative that there is no distinction between HD and non-HD voices, despite the fact that the pricing of the "HD" quality option is double that. Interpret that as you please.)

That works out to almost eighteen hours of audio, so the cost is a little less than one dollar per hour. That is, in fact, less expensive than ElevenLabs, one of the most well-known competing distributors, which costs $11 for 100,000 characters every month. However, there will be a certain amount of customisation lost.

Voice Engine does not have controls for changing a voice's pitch, tone, or cadence. As Harris points out, any expressiveness in the 15-second voice sample will be preserved through later generations (for instance, if you speak in an excited tone, the resulting synthetic voice will sound consistently excited). In fact, it doesn't currently have any fine-tuning knobs or dials. When a direct comparison between the models is possible, we will examine how the reading quality stacks up.

Voice ability as a resource


Even at the lowest end of the pay scale, voice actors with agents will charge a significantly higher price each job. ZipRecruiter offers voice actors hourly wages ranging from $12 to $79, which is significantly more than Voice Engine. If OpenAI's technology becomes popular, voice work may become more commonplace. What then is the situation for actors?

The talent business has long been battling the existential danger posed by generative AI, so it wouldn't exactly be taken off guard. Voice actors are being requested more and more to give up the rights to their voices so that companies might utilize artificial intelligence (AI) to create synthetic replicas of them that may one day take their place. Voice employment, especially low-paying, entry-level work, may become obsolete in favor of speech produced by AI.

Currently, various AI voice systems are attempting to achieve equilibrium.

In an agreement that was somewhat controversial, Replica Studios and SAG-AFTRA last year created and licensed recordings of the voices of media artist union members. The groups said that in negotiating terms for the use of synthetic voices in new works, including video games, the agreement set reasonable and moral rules and conditions to guarantee performer consent.

ElevenLabs, on the other hand, runs a synthetic voice marketplace where users may generate, validate, and exchange voices with the public. The original authors of a voice are compensated when someone else uses it; they get paid a certain amount per 1,000 characters.

At least not in the near future, OpenAI will not create any such labor union agreements or marketplaces. Instead, users will only need to get the “explicit consent” of the individuals whose voices are being cloned, disclose which voices are generated by AI, and promise not to use the voices of minors, the deceased, or prominent politicians from their generation.

We're keeping a careful eye on this and are quite interested in how it relates to the voice actor industry, Harris added. "I believe that with this kind of technology, voice actors will have many opportunities to sort of expand their reach. However, we will discover all of this when people use and experiment with the technology more.

Morality and deepfakes


Apps for voice cloning can be exploited in ways that go well beyond endangering performers' careers.

Hateful remarks imitating celebrities like Emma Watson were shared on ElevenLabs' platform by the notorious message board 4chan, which gained notoriety for its conspiratorial content. James Vincent of The Verge was able to use artificial intelligence (AI) techniques to swiftly and maliciously clone voices, producing samples that included anything from violent threats to derogatory statements about transgender people. Additionally, Joseph Cox, a reporter for Vice, recorded creating a vocal clone that was convincing enough to trick an authentication system at a bank.

There are concerns that dishonest people may try to use voice cloning to influence elections. They are also not without merit: A deepfake of President Biden was used in a phone campaign in January to discourage voters in New Hampshire, which prompted the FCC to take action to outlaw similar operations in the future.

Apart from policy-level prohibitions on deepfakes, what other measures is OpenAI doing to guard against potential abuses of Voice Engine? Harris listed a few.

First, Voice Engine is initially only being made accessible to a very limited number of developers (about 10). OpenAI is focusing on "low risk" and "socially beneficial" use cases, such as healthcare and accessibility, in addition to experimenting with "responsible" synthetic media, according to Harris.

A couple of the first companies to use Voice Engine were HeyGen, a storytelling software, and Age of Learning, an edtech business that uses the platform to create voice-overs from performers who have already been hired. Health professionals may now receive feedback in their native tongues thanks to a Voice Engine-powered application being developed by Dimagi and used by Livox, Lifespan, and others to generate voices for those with disabilities and communication impairments.

Secondly, OpenAI devised a method of watermarking Voice Engine clones that incorporates inaudible identifiers into recordings. (Other companies use watermarks similar to Microsoft and Resemble AI.) Harris called the watermark "tamper resistant," but he did not guarantee that there are no ways to get around it.

According to Harris, "it's really easy for us to look at an audio clip that's out there and determine that it was generated by our system and the developer who actually did that generation." We're keeping it internally for the time being; it's not open sourced yet. Although there are undoubtedly additional dangers associated with making it public, we are interested in doing so.

Third, OpenAI intends to provide Voice Engine access to its red teaming network—a hired set of specialists that assist in guiding the company's risk assessment and mitigation methods for AI models—in order to identify potentially harmful applications.

Certain experts contend that the scope of AI red teaming is insufficient and that vendors should provide resources to guard against any damages that their AI may inflict. Although OpenAI's Voice Engine project isn't nearly that far, Harris says the company's "top principle" is sharing the technology in a secure manner.

Public release


OpenAI may make the tool available to a larger developer base when the preview concludes given the public's reaction to Voice Engine, but the company is now hesitant to make any firm commitments.

However, Harris did provide a preview of Voice Engine's future plans, stating that OpenAI is developing a security feature that requires users to read aloud words produced at random to verify their presence and awareness of the way their voice is being utilized. This may just be the beginning, according to Harris, or it might provide OpenAI the assurance it needs to make Voice Engine available to more people.

"What we learn from the pilot, the safety issues that are uncovered, and the mitigations that we have in place are really going to determine what's going to keep pushing us forward in terms of the actual voice matching technology," he stated. "We want people to be able to distinguish between real human voices and artificial voices."

Leave a Comment
logo-img Fintech Newz

All Rights Reserved © 2024 Fintech Newz