Voice Translation using Hugging Face

Voice translation is a technology that converts spoken language in one language into spoken output in another. It functions like an AI-powered interpreter, listening to speech, understanding it, translating it and delivering a response in a different language. At a technical level, voice translation integrates three main components:

Speech Recognition (ASR): Converts spoken audio into text
Machine Translation (MT): Translates the extracted text into a target language
Text to Speech (TTS): Converts the translated text back into natural-sounding audio

voice_translation_pipeline — Voice translation pipeline

Implementation

This implementation converts English speech into Hindi speech using three models:

Whisper: A transformer based Automatic Speech Recognition model by OpenAI that converts spoken audio into text. It is multilingual, robust to noise and capable of accurate speech transcription.
Helsinki-NLP: A sequence to sequence transformer model designed for translating text between languages. It generates context aware and grammatically correct translations.
MMS-TTS: A multilingual text to speech model by Meta that converts text into natural sounding speech using the Variational Inference with Adversarial Learning(VITS) architecture.

Step 1: Set Up the Environment

First, install the required libraries. Run the following command in your command prompt.

pip install transformers torch torchaudio soundfile

Step 2: Import Required Libraries

These libraries provide:

Pretrained transformer models from Hugging face
Audio processing utilities
Waveform generation and saving functionality

Python

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import AutoProcessor, VitsModel
import torch
import torchaudio
import soundfile as sf

Step 3: Download Sample Audio Input

This audio file serves as the input to our speech recognition system.

You can also download the audio file from here

Python

import requests

url = "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"
r = requests.get(url)

with open("input_audio.flac", "wb") as f:
    f.write(r.content)

Step 4: Converting Speech to Text

Load Whisper Model

Python

whisper_processor = WhisperProcessor.from_pretrained("openai/whisper-small")
whisper_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

Output:

Load and Resample Audio

Here we load the audio and checks its sampling rate. Since models like Whisper require 16kHz audio, it resamples the waveform if needed and updates the sampling rate. This ensures the audio is compatible for speech recognition.

Python

audio, sr = torchaudio.load("/content/input_audio.flac")

if sr != 16000:
    resampler = torchaudio.transforms.Resample(sr, 16000)
    audio = resampler(audio)
    sr = 16000

inputs = whisper_processor(
    audio.squeeze().numpy(),
    sampling_rate=sr,
    return_tensors="pt"
)

Convert Audio to Text

This code converts the audio waveform into model ready features using whisper_processor. The model then generates token IDs from these audio features, and batch_decode converts those tokens into readable text. Finally, it prints the transcribed speech.

Python

with torch.no_grad():
    predicted_ids = whisper_model.generate(inputs["input_features"])

speech_text = whisper_processor.batch_decode(
    predicted_ids,
    skip_special_tokens=True
)[0]

print("Recognized Text:", speech_text)

Output:

Recognized Text: I have a dream that one day this nation will rise up and live out the true meaning of its creed.

Step 5: Machine Translation (English to Hindi)

Load Translation Model

Python

model_name = "Helsinki-NLP/opus-mt-en-hi"

tokenizer = AutoTokenizer.from_pretrained(model_name)
translation_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

Output:

Translate Text

This code converts the recognized speech text into tokens using the tokenizer. The translation model then generates translated token IDs, and decode() converts them back into readable text. Finally, it prints the translated sentence.

Python

inputs = tokenizer(speech_text, return_tensors="pt")

with torch.no_grad():
    outputs = translation_model.generate(**inputs)

translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Translated Text:", translated_text)

Output:

Translated Text: मैं एक सपना देखा है कि एक दिन इस राष्ट्र उठकर अपने धर्म - सिद्धांत के वास्तविक अर्थ से बाहर जी जाएगा।

Step 6: Text to Speech Translation

Load Hindi Text To Speech Model

Python

tts_processor = AutoProcessor.from_pretrained("facebook/mms-tts-hin")
tts_model = VitsModel.from_pretrained("facebook/mms-tts-hin")

Output:

Generate Speech from Hindi Text

This code converts the translated text into speech. The processor prepares the text for the model, the TTS model generates the audio waveform and the waveform is converted to a NumPy array. Finally, the audio is saved as a WAV file using the model’s sampling rate.

Python

inputs = tts_processor(text=translated_text, return_tensors="pt")

with torch.no_grad():
    output = tts_model(**inputs)

audio_output = output.waveform.squeeze().cpu().numpy()

sf.write(
    "translated_audio.wav",
    audio_output,
    tts_model.config.sampling_rate
)

Output:

You can download the full code from here

Voice Translation using Hugging Face

Implementation

Step 1: Set Up the Environment

Step 2: Import Required Libraries

Step 3: Download Sample Audio Input

Step 4: Converting Speech to Text

Step 5: Machine Translation (English to Hindi)

Step 6: Text to Speech Translation

Explore