Most people that use speech recognition services either love them or hate them - there tends to be no middle ground. When users are not given the option - for example many banks will not let you speak to a human unless you have already spent 20 minutes getting nowhere with their voice assistant - then users are less tolerant. But even for those using a speech assistant by choice - for example Alexa or Siri - there often is a similar spread of tolerance. But Siri and Alexa are just two high-profile systems that deploy automatic speech recognition (ASR) in home automation. There are many more ASR applications and Cloud services that are used by other service providers.
The problem with ASR is there is rarely any measurement available for performance - ie how much you say does it correctly recognise? You would not buy a motor car without knowing the fuel consumption first, yet if a company decides to offer ASR interaction to its customers, there is very little information on which to base a judgement as to whether it will enhance customer satisfaction. Sadly, many companies offering speech as a method of interaction simply don’t care. A human operator will cost them £50 per hour, whereas a voice-operated BOT will cost them a few pounds per hour. They don’t care if the customer spends 20 minutes inputting information that would take a human 2 minutes. It's the customer’s time that is being wasted, not theirs.
A further complication is the unfortunate trend of confusing voice recognition with speech recognition, a trap that even some journalists and service providers fall into. Speech recognition is about establishing what someone says, voice recognition is about who is saying it. Getting 90% of speakers’ identities correct is not the same as getting 90% of the words they say correct.
That said, ASR performance is difficult to measure. It depends on so much. The quality of the speech, the communication channel, the pronunciation and motivation of the speaker, the size of the vocabulary, the language of use - the list is endless. While one high-motivated speaker using a high-quality microphone may achieve 95% recognition performance, the same phrase uttered during a telephone call may get 20% of the words recognised.
Add to that the fact that ASR service providers are reluctant to provide any sort of performance measurement when they don’t have to. For them, it is much better to let the prospective customer assume the best.
Faced with this issue when developing a service involving call transcription services, we have always felt it is important that the expectations of the customers are established before a commitment to ASR is made. It's all very well saying “if you don’t like it, you can have your money back”, but that does not compensate for the time and money spent in providing the infrastructure for call transcription to start improving the bottom line. And used in the proper way, the benefits of ASR can be massive. This is all the more important where the customer has a choice of ASR engines, each with different trade-offs on cost, speed and accuracy.
Given the dearth of information from the ASR service providers and the fact that the performance of ASR is so variable, we devised a framework in which prospective customers can test the performance of ASR with:
- The choice of ASR services
- Their own staff
- Their own telephony equipment
- Their own dialogues
To make things easier, we provide a standard dialogue in 5 European languages which prospective customers can recite and get an exact measurement of the words correctly and incorrectly recognised. That will not necessarily reflect the operational performance, but it does avoid wasting time and money on a technology not yet fit for their purpose.
An interesting spin-off from this work is that it gives us the ability to see how ASR systems are evolving and improving with time. We can submit recordings made months or even years ago and compare the performance now with the performance then.
And another thing it shows is the importance of context in ASR. People are less tolerant of ASR systems when they compare them to humans because they do not realise that the computer often has none of the other clues we humans use to recognise speech. In terms of pure word recognition rates, computers are often better than humans. What computers don’t have is a knowledge of the subject, the speaker, the language and a plethora of other contextual information we take for granted. In short, computers are good at recognising words and humans are good at understanding sentences. Our framework extracts context from other sorts of digital messages such as emails, to improve ASR performance and it is only by comparing a machine transcription with a human transcription that the effects of this can be properly assessed.
So the answer to the question, “How good are Siri and Alex?” is “it all depends!”. Potential users need to compare ASR engines on the user’s environment to decide.
If you are more interested in the science and technology behind speech recognition performance measurement framework, then have a read of our white paper on the subject: “A Framework for automatically measuring performance of Automatic Speech Recognition Systems”