Alongside cooking for myself and strolling laps round the home, Japanese cartoons (or “anime” as the children are calling it) are one thing I’ve realized to like throughout quarantine.
The issue with watching anime, although, is that wanting studying Japanese, you turn into depending on human translators and voice actors to port the content material to your language. Typically you get the subtitles (“subs”) however not the voicing (“dubs”). Different occasions, total seasons of reveals aren’t translated in any respect, and also you’re left on the sting of your seat with solely Wikipedia summaries and 90s internet boards to ferry you thru the darkness.
So what are you speculated to do? The reply is clearly to not ask a pc to transcribe, translate, and voice-act total episodes of a TV present from Japanese to English. Translation is a cautious artwork that may’t be automated and requires the loving contact of a human hand. Apart from, even for those who did use machine studying to translate a video, you couldn’t use a pc to dub… I imply, who would need to take heed to machine voices for a whole season? It’d be terrible. Solely an actual sicko would need that.
So on this publish, I’ll present you find out how to use machine studying to transcribe, translate, and voice-act movies from one language to a different, i.e. “AI-Powered Video Dubs.” It may not get you Netflix-quality outcomes, however you should utilize it to localize on-line talks and YouTube movies in a pinch. We’ll begin by transcribing audio to textual content utilizing Google Cloud’s Speech-to-Textual content API. Subsequent, we’ll translate that textual content with the Translate API. Lastly, we’ll “voice act” the translations utilizing the Textual content-to-Speech API, which produces voices which are, in line with the docs, “humanlike.”
(By the best way, earlier than you flame-blast me within the feedback, I ought to inform you that YouTube will robotically and free of charge transcribe and translate your movies for you. So you’ll be able to deal with this mission like your new pastime of baking sourdough from scratch: a extremely inefficient use of 30 hours.)
AI-dubbed movies: Do they normally sound good?
Earlier than you embark on this journey, you most likely need to know what it’s important to sit up for. What high quality can we realistically count on to attain from an ML-video-dubbing pipeline?
Right here’s one instance dubbed robotically from English to Spanish (the subtitles are additionally robotically generated in English). I haven’t accomplished any tuning or adjusting on it:
As you’ll be able to see, the transcriptions are first rate however not good, and the identical for the translations. (Ignore the truth that the speaker typically speaks too quick — extra on that later.) Total, you’ll be able to simply get the gist of what’s occurring from this dubbed video, nevertheless it’s not precisely close to human-quality.
What makes this mission trickier (learn: extra enjoyable) than most is that there are no less than three attainable factors of failure:
- The video could be incorrectly transcribed from audio to textual content by the Speech-to-Textual content API
- That textual content could be incorrectly or awkwardly translated by the Translation API
- These translations could be mispronounced by the Textual content-to-Speech API
In my expertise, probably the most profitable dubbed movies have been those who featured a single speaker over a transparent audio stream and that have been dubbed from English to a different language. That is largely as a result of the standard of transcription (Speech-to-Textual content) was a lot greater in English than in different supply languages.
Dubbing from non-English languages proved considerably tougher. Right here’s one notably unimpressive dub from Japanese to English of one among my favourite reveals, Demise Notice:
If you wish to depart translation/dubbing to people, effectively–I can’t blame you. But when not, learn on!
Constructing an AI Translating Dubber
As at all times, you could find the entire code for this mission within the Making with Machine Studying Github repo. To run the code your self, comply with the README to configure your credentials and allow APIs. Right here on this publish, I’ll simply stroll by my findings at a excessive stage.
First, listed here are the steps we’ll comply with:
- Extract audio from video recordsdata
- Convert audio to textual content utilizing the Speech-to-Textual content API
- Break up transcribed textual content into sentences/segments for translation
- Translate textual content
- Generate spoken audio variations of the translated textual content
- Pace up the generated audio to align with the unique speaker within the video
- Sew the brand new audio on prime of the fold audio/video
I admit that after I first got down to construct this dubber, I used to be filled with hubris–all I needed to do was plug a couple of APIs collectively, what could possibly be simpler? However as a programmer, all hubris have to be punished, and boy, was I punished.
The difficult bits are those I bolded above, that primarily come from having to align translations with video. However extra on that in a bit.
Utilizing the Google Cloud Speech-to-Textual content API
Step one in translating a video is transcribing its audio to phrases. To do that, I used Google Cloud’s Speech-to-Textual content API. This software can acknowledge audio spoken in 125 languages, however as I discussed above, the standard is highest in English. For our use case, we’ll need to allow a few particular options, like:
- Enhanced fashions. These are Speech-to-Textual content fashions which have been skilled on particular information sorts (“video,” “phone_call”) and are normally higher-quality. We’ll use the “video” mannequin, after all.
- Profanity filters. This flag prevents the API from returning any naughty phrases.
- Phrase time offsets. This flag tells the API that we wish transcribed phrases returned together with the occasions that the speaker mentioned them. We’ll use these timestamps to assist align our subtitles and dubs with the supply video.
- Speech Adaption. Sometimes, Speech-to-Textual content struggles most with unusual phrases or phrases. If you understand sure phrases or phrases are prone to seem in your video (i.e. “gradient descent,” “assist vector machine”), you’ll be able to go them to the API in an array that can make the extra prone to be transcribed:
The API returns the transcribed textual content together with word-level timestamps as JSON. For example, I transcribed this video. You’ll be able to see the JSON returned by the API in this gist. The output additionally lets us do a fast high quality sanity examine:
What I truly mentioned:
“Software program Builders. We’re not recognized for our rockin’ model, are we? Or are we? Right now, I’ll present you the way I used ML to make me trendier, taking inspiration from influencers.”
What the API thought I mentioned:
“Software program builders. We’re not recognized for our Rock and elegance. Are we or are we in the present day? I’ll present you the way I take advantage of ml to make new trendier taking inspiration from influencers.”
In my expertise, that is concerning the high quality you’ll be able to count on when transcribing high-quality English audio. Notice that the punctuation is somewhat off. Should you’re proud of viewers getting the gist of a video, that is most likely ok, though it’s simple to manually appropriate the transcripts your self for those who converse the supply language.
At this level, we are able to use the API output to generate (non-translated) subtitles. In actual fact, for those who run my script with the `–srt` flag, it should do precisely that for you (srt is a file kind for closed captions):
Now that now we have the video transcripts, we are able to use the Translate API to… uh… translate them.
That is the place issues begin to get somewhat 🤪.
Our goal is that this: we wish to have the ability to translate phrases within the authentic video after which play them again at roughly the identical cut-off date, in order that my “dubbed” voice is talking in alignment with my precise voice.
The issue, although, is that translations aren’t word-for-word. A sentence translated from English to Japanese might have a phrase order jumbled. It might comprise fewer phrases, extra phrases, totally different phrases, or (as is the case with idioms) utterly totally different wording.
A method we are able to get round that is by translating total sentences after which making an attempt to align the time boundaries of these sentences. However even this turns into difficult, as a result of how do you denote a single sentence? In English, we are able to break up phrases by punctuation mark, i.e.:
However punctuation differs by language (there’s no ¿ in English), and a few languages don’t separate sentences by punctuation marks in any respect.
Plus, in real-life speech, we regularly don’t speak in full sentences. Y’know?
One other wrinkle that makes translating transcripts tough is that, usually, the extra context you feed right into a translation mannequin, the upper high quality translation you’ll be able to count on. So for instance, if I translate the next sentence into French:
“I’m feeling blue, however I like pink too.”
I’ll get the interpretation:
“Je me sens bleu, mais j’aime aussi le rose.”
That is correct. But when I break up that sentence in two (“I’m feeling blue” and “However I like pink too”) and translate every half individually, I get:
“Je me sens triste, mais j’aime aussi le rose”, i.e. “I’m feeling unhappy, however I like pink too.”
That is all to say that the extra we chop up textual content earlier than sending it to the Translate API, the more serious high quality the translations will probably be (although it’ll be simpler to temporally align them with the video).
In the end, the technique I selected was to separate up spoken phrases each time the speaker took a greater-than-one-second pause in talking. Right here’s an instance of what that appeared like:
This naturally led to some awkward translations (i.e. “or are we” is a bizarre fragment to translate), however I discovered it labored effectively sufficient. Right here’s the place that logic seems like in code.
Facet bar: I additionally seen that the accuracy of the timestamps returned by the Speech-to-Textual content API was considerably much less for non-English languages, which additional decreased the standard of Non-English-to-English dubbing.
And one very last thing. Should you already know the way you need sure phrases to be translated (i.e. my identify, “Dale,” ought to at all times be translated merely to “Dale”), you’ll be able to enhance translation high quality by benefiting from the “glossary” characteristic of the Translation API Superior. I wrote a weblog publish about that right here.
The Media Translation API
Because it occurs, Google Cloud is engaged on a brand new API to deal with precisely the issue of translating spoken phrases. It’s referred to as the Media Translation API, and it runs translation on audio straight (i.e. no transcribed textual content middleman). I wasn’t ready to make use of that API on this mission as a result of it doesn’t but return timestamps (the software is at present in beta), however I believe it’d be nice to make use of in future iterations!
Now for the enjoyable bit–selecting out pc voices! Should you examine my PDF-to-Audiobook converter, you understand that I really like me a funny-sounding pc voice. To generate audio for dubbing, I used the Google Cloud Textual content-to-Speech API. The TTS API can generate numerous totally different voices in several languages with totally different accents, which you could find and play with right here. The “Customary” voices would possibly sound a bit, er, tinny, if you understand what I imply, however the WaveNet voices, that are generated by high-quality neural networks, sound decently human.
Right here I bumped into one other downside I didn’t foresee: what if a pc voice speaks lots slower than a video’s authentic speaker does, in order that the generated audio file is just too lengthy? Then the dubs can be unimaginable to align to the supply video. Or, what if a translation is extra verbose than the unique wording, resulting in the identical downside?
To take care of this problem, I performed round with the
speakingRate parameter obtainable within the Textual content-to-Speech API. This lets you pace up or decelerate a pc voice:
So, if it took the pc longer to talk a sentence than it did for the video’s authentic speaker, I elevated the speakingRate till the pc and human took up about the identical period of time.
Sound somewhat difficult? Right here’s what the code seems like:
This solved the issue of aligning audio to video, nevertheless it did typically imply the pc audio system in my dubs have been somewhat awkwardly quick. However that’s an issue for V2.
Was it price it?
You already know the expression, “Play silly video games, win silly prizes?” It seems like each ML mission I construct right here is one thing of a labor of affection, however this time, I really like my silly prize: the power to generate a vast variety of bizarre, robotic, awkward anime dubs, which are typically kinda first rate.
Try my outcomes right here: