A deep neural network architecture which can directly translate speech from one language into text in another language is being developed by Google researchers.

The study, titled Sequence-to-Sequence Models Can Directly Transcribe Foreign Speech, describes using a modified sequence-to-sequence model, which has had previous success in speech recognition, to create a powerful encoder-decoder network for machine translation.

The paper explains that the new model does not explicitly transcribe the speech into text in the source language, nor does it require supervision from the source language transcription during training.

In testing, the research team reported ‘state-of-the-art performance’ on conversational Spanish to English speech translation tasks. The experiments used the Fisher Callhome Spanish-English dataset and found that the proposed model could outperform cascades of speech recognition and machine translation technologies.

Using the BLEU (bilingual evaluation understudy) scoring framework, which evaluates the quality of machine-translated text, the proposed system recorded 1.8 points over other translation models.

According to the study, when Spanish transcripts were used as training data for additional supervision across independent automatic speech recognition (ASR) and speech translation (ST) decoders, additional improvements of at least 1.4 BLEU points were obtained.

In future work, the Google researchers plan to construct a multilingual speech translation system in which a single decoder is shared across multiple languages.