A new voice-transcription software, named Trint, can listen to an audio recording or a video of two or more speakers engaged in a natural conversation, then provide a written transcript of what each person said.
Trint’s technology is still nascent, but it could eventually give new life to vast swaths of non-text-based media on the internet, like videos and podcasts, by making them readable to both humans and search engines. People could read podcasts they lack the time or ability to listen to. YouTube videos could be indexed with a time-coded transcript, then searched for text terms. There are other applications too: Filmmakers could index their footage for better organization, and journalists, researchers, and lawyers could save the many hours it takes to transcribe long interviews.
As machine learning and automation technologies continue to transform the 21st century, voice recognition remains a pesky speed bump. Transcription in particular is a technology that some have spent decades pursuing and others deemed outright impossible in our lifetimes. While news organizations and social media outlets alike have invested heavily in video content, the ability to optimize those clips for search engines remains elusive. And with younger readers still preferring print to video anyway, the value of transcribed text remains high.
Based in London and launched in autumn 2016, Trint is a web app built on two separate but entwined elements. The company’s transcription algorithm feeds text into a browser interface for editing, which links the words in a transcript directly to the corresponding points in the recording. While the accuracy is hardly perfect (as Trint’s founders are the first to admit), the system almost always produces a transcript that’s clean enough for searching and editing. At roughly 25 cents per minute (or $15 per hour), Trint’s software-as-service costs a quarter of the $1 per minute rate offered by competitors. There’s a reason Trint is so cheap: Those other services, like Casting Words and 3Play, use humans to clean up automated transcripts or to do the actual transcribing. Trint is all machines.
Microsoft has released voice recognition toolkits for programmers to experiment with, and Google just last week added multi-voice recognition to its Google Home smart speaker. But Trint’s software was the first public-facing commercial product to serve this space.
According to lead engineer Simon Turvey, Trint users report an error rate of between five and 10 percent for cleanly recorded audio. Though this is close to the eight percent industry standard estimated last year by veteran Microsoft scientist Xuedong Huang, the Trint founders consider their product’s editing function the thing that gives them a stronger competitive edge. Trint’s time-coded transcript and the web-based editor allows users to quickly find and work on the quotes they need.
Trint can currently understand 13 languages, including several varieties of English accents. Since it’s a cloud-based application, Trint’s voice transcription algorithm can be updated frequently to add new languages, new accents (Cuban-accented English is tough), and fresh batches of proper nouns.