Last year, IBM announced a major milestone in conversational speech recognition: a system that achieved a 6.9 percent word error rate. Since then, they have continued to push the boundaries of speech recognition, and have now reached a new industry record of 5.5 percent.

This was measured on a very difficult speech recognition task: recorded conversations between humans discussing day-to-day topics like “buying a car.” This recorded corpus, known as the “SWITCHBOARD” corpus, has been used for over two decades to benchmark speech recognition systems.

To reach this 5.5 percent breakthrough, IBM researchers focused on extending the application of deep learning technologies. They combined LSTM (Long Short Term Memory) and WaveNet language models with three strong acoustic models. Within the acoustic models used, the first two were six-layer bidirectional LSTMs. One of these has multiple feature inputs, while the other is trained with speaker-adversarial multi-task learning. The unique thing about the last model is that it not only learns from positive examples but also takes advantage of negative examples – so it gets smarter as it goes and performs better where similar speech patterns are repeated.

Reaching human parity – meaning an error rate on par with that of two humans speaking – has long been the ultimate industry goal. Others in the industry are chasing this milestone alongside IBM, and some have recently claimed reaching 5.9 percent as equivalent to human parity. However, as part of the process in reaching the milestone of 5.5 percent, it was determined that human parity is actually lower than what anyone has yet achieved — 5.1 percent.