3 February 2020

A tale told by an idiot: the trouble with speech recognition


I’ve spent much of the last year in my day job investigating the applicability of artificial intelligence. The question I’m always being asked is, ‘but does it work? The answer is…‘sort of’.

The biggest promise with AI is how to deliver cognitive machine intelligence. ‘Cognitive’ is a word that is used rather freely, even though the formal definition is quite strict. In order to be truly ‘cognitive’ machines should be adaptive (using changing data and responding to changing goals), iterative (asking questions and searching for new data), and contextual (understanding how meaning can change according to time and place and person).

The ultimate test of cognitive computing is the Turing Test – whether a machine can emulate a human purely on the basis of the answers it gives to any question. Now, anyone who has used a chatbot – that sort of artificial agent that you get stuck with when you have to deal with banks and utility companies – knows that they do not pass the Turing Test.

Speech recognition is the engine of the chatbot, and recognition technologies are among the most advanced AI applications today. But even the best chatbot still has a long way to go before it fully processes what you say. It cannot make the leap from sound to meaning.

That’s partly because the corporate chatbot is probably using technology from last year, or the year before – speech recognition technology is evolving very fast. Microsoft claims that its latest speech recognition applications can now interpret human speech at an error rate of only 5.1%, while Google claims 4.9%.

These performance figures are important, because speech recognition is a central plank of the AI proposition (and by the way speech recognition is not the same as voice recognition, which is about identifying individuals rather than spoken words).

Speech recognition is already a roughly $50 billion a year market according to Forbes, and it is also the essential component of a host of corporate governance applications – for example detecting whether employees are adhering to guidelines in trading, or selling, or in screening interactions with customers. It is also a prime tool for security services looking for leads in conversations of interest.

So back to the first question: does it work? Time for a reality check.

I decided to put the best current speech recognition system to a simple test. Simple, because I did not demand any understanding of contextual meaning from Google’s most advanced non-statistical ‘neural-only’ speech recognition system – all I asked of it was that it accurately transcribe very different speakers working from the same script.

First, I downloaded recordings of six speakers reciting the ‘Tomorrow and tomorrow’ soliloquy from Macbeth. For those unfamiliar, these are the Bard’s words:

Tomorrow, and tomorrow, and tomorrow,
Creeps in this petty pace from day to day,
To the last syllable of recorded time;
And all our yesterdays have lighted fools
The way to dusty death. Out, out, brief candle!
Life’s but a walking shadow, a poor player,
That struts and frets his hour upon the stage,
And then is heard no more. It is a tale
Told by an idiot, full of sound and fury,
Signifying nothing.

I used US professionals, US amateurs, UK professionals, UK amateurs, and a professional Chinese actor. I fed the recordings into the dictation tool that is hidden in Google Docs but which gives access to Google’s latest speech recognition technology.

Speech recognition works – but only on its own terms. Like so much automated intervention, it tells you what it thinks you want to know.

My first speaker was a professional US actor. Google reproduced his recital with something close to the accuracy claimed, and the errors were superficial orthographical ones. Then, the amateur US actor: here there was an error rate of more like 20%. Several words were misheard (‘scandal’ for ‘candle’, ‘poor Clare’ for ‘poor player’). The text was readable, and it’s easy to see what the algorithm intends – but only if you already know what the algorithm intends. This is the first warning flag that AI is strong on confirmation bias.

Then on to the Brits. As we know the Americans love a British accent, but only up to a point. When a British amateur spoke Shakespeare’s words into the Google machine – well, there was a transatlantic situation. “A walking shadow, a poor player, that struts became “a walking shadow of polo plaid struts …”

Hmm. Polo plaid struts are not so good.

Next, a real test: Sir Ian McKellen, one of the great Shakespeareans of the age and a voice known to America as well as to the world – but speaking in a seventeenth century Warwickshire accent, something close to Shakespeare’s own speech.

“Turbina!” the Google transcription begins. And ends “This hard but no more, it is a tale told by an idiot, signifying battery counts.”

And finally on to the Chinese actor (an actor with a far from strong Chinese accent – if anything the speaker sounded as if an English public school was a big part of his story). Google’s transcription was no more than 20% accurate, and this is how Google heard Macbeth wrapping things up:

“I’m not an exclusive way to dusty death, Ford Kendall lice, but a walking shadow, poems by an idiot, free stickers.”

Free stickers sound good, although maybe not in this context. AI really needs to deliver rather more if it is to usher us into the fourth industrial age.

These were simple tests. They did not look at how speech recognition is used in organisational settings, or how it could be tuned to particular situations. Like other AI applications, speech recognition is only a tool. But tools need to be sharp and this one seems somewhat blunt, considering the ambitious claims that are made for AI recognition technologies.

Other AI applications also come to mind. Companies and consultants talk a lot about the age of ‘cognitive computing’, but it is up for debate whether artificial intelligence that passes the cognitive tests actually exists. Most of what is used in the real world consists of things like billing and claims processing, audits and logistics design (essentially robotic process applications), and machine learning that comprises statistical analysis, fraud detection, spam filters and recommendation software. These are all a level below cognitive computing.

In short, the machines are not autonomous and contain the human biases of their makers. Google speech recognition speaks Silicon Valley, and if you also speak Silicon Valley it will recognize what you say.

But if you are William Shakespeare you might find you still need to write it all down in capital letters for the foreseeable future, if not to the last syllable of recorded time.

Click here to subscribe to our daily briefing – the best pieces from CapX and across the web.

CapX depends on the generosity of its readers. If you value what we do, please consider making a donation.

Richard Walker is a journalist and communications adviser to financial companies