Count Smorltalk speculates on WER
“Well there you are, four candles!’
“No, fork ‘andles! ‘Andles for forks!”
If you haven’t ever seen the Two Ronnies sketch The Hardware Shop, do it now:
It immortalises the kind of chaos that ensues when one human says one thing and another human hears something else. Genius.
Of course, we interpreters listen more carefully than Ronnie Corbett. We take in the context and assess the probability of one thing over another thing. Interpreters wouldn’t get these things wrong now, would we.
To pick up on my earlier post on Artful Intelligence, it is common currency in interpreter/boffin discourse that interpreters are the gold standard against which computers are measured. I dared to suggest that it is a gold standard whose days might be more numbered than we like to think.
Today I’m going to take a closer look at the first bit of this whole interpreting malarkey: hearing. And I’m going to equate human hearing with the speech recognition component of an automated interpreting system. And for the purposes of this contest, I am going to use one simple metric: WER, that is to say Were Dare Aerate.
What? WHAT??? Were Dare Aerate? That can’t be right, that doesn’t make sense. And Were Dare Aerate would be WDA not WER…Sorry, just a sec, WER…WER…. Ahh! Bingo. Not Were Dare Aerate but Word Error Rate! Finally got it. Hang on, WAIT I missed the next sentence. STOP!
Let’s pull that apart a bit and see why I misheard Word Error Rate as Were Dare Aerate.
Most of us never pause to reflect on how we know where the boundaries are between words when listening to someone speak. They seem completely obvious. That’s down to an interplay between acoustics and semantics where our brains effortlessly attach meaning to the sounds we hear. Mostly there is no confusion because we’re used to attaching meaning to seemingly random noise. We use our knowledge of the world to do this. But if you look at an audio wave of someone speaking, you’ll see that there is no separation between individual words – it all runs together in a sort of fuzzy concertina. We have to understand language and we have to know what things are to be able to perform this amazing feat.
But what happens when we don’t know something is a thing? For example, until you started reading this did you know that WER is a metric used to measure the performance of speech recognition software? If you did, then it’s probably because you are vastly clever. Or also possibly because you are interested in language processing. Or both. WER, being meaningless to the majority of us, would trip us up. WER? WER?? Did he mean WEU? Did he misspeak? Is it even English? If we hit WER whilst interpreting general speech we’d be a bit sunk. Choices are: leave it out, repeat it phonetically (if you can), go for the nearest logical thing, guess, say “um”, pause and hope it’ll come to you…. Decisions, decisions. Careful not to miss the rest of the sentence.
Oh four fugs ache!
Come on, I hear you say, that’s not fair. People don’t just drop things like that into general speech. We need fair warning of WERs on the line. Yes, point taken. The time we’re most likely to encounter WER is whilst interpreting a conference on speech recognition, or at the Dragon Software Works Council. We’d have time to read up beforehand. A nice colleague might forward us their terminology list for the meeting. There it would be on page 2 under W. All we have to do now is remember we have the list, find the list on the desk or screen, scan down the list to the Ws, and hey presto. Or we could just be vastly clever and remember stuff like that. How difficult can it be?
Remember you are the gold standard. Be proud of that.
So that can happen if we don’t know a thing is a thing. What happens when we don’t know a thing has a word? For example, do you know the word nudiustertian?* God you are vastly clever! For the rest of us ignoramuses, we have a problem. New deo sturgeon? Nudy us tersh on? What??? And what to do with the sentence, “There was nothing left but a tittynope”?** A what?? A tit teen hope??
LEAVE IT OUT!
Back to WER. As I said, WER is a metric commonly used to measure the performance of speech recognition systems.
Let me first ask whether a Covid vaccine that is 95% efficacious is ok? Pfizer, Moderna? Anyone up for that? Need something a bit better than 95%? What about speech recognition software that is 95% accurate? Good or bad?
So if you Google “Google Word Error Rate” you get a caption that proudly announces “Google CEO Sundar Pichai today announced that the company’s speech recognition technology has now achieved a 4.9 percent word error rate. Put another way, Google transcribes every 20th word incorrectly”. This is dated 2017.
Every 20th word wrong??? That’s why we’re the gold standard, not Google. That’s terrible.
In that same year IBM announced that they had set new performance records on CTS transcription. CTS? CTS?? What?? Got you again. CTS is Conversational Telephone Speech, and is recordings of spontaneous speech recorded over a telephone channel with various channel distortions, in addition to numerous speaking styles. They said they had achieved a WER for CTS of 5.5%.
Nah, no good, I hear you say. Rubbish in fact. An interpreter making that level of mistakes wouldn’t last a week.
I’ll leave you with this. IBM’s WER for Broadcast News was 5.9% on one benchmark in 2019. The human equivalent for Broadcast News was 2.8%.
Phew. We’re saved!!
Not so fast!
10 years ago IBM’s Word Error Rate for Broadcast News was about 17%. And now it’s 5.9%.
I’ll just say that last bit again in case you messed it up.
“Tenja sago I bee yams were dare aerate fob raw car sinews what’s up outs Evan teen purr scent an ow wits fife poy nigh purse end.”
From 17% to 5.9% in a decade? With that rate of progress, the human gold standard will have to be abandoned around 2030 – more or less at the same time the UK will outlaw the sale of new petrol and diesel cars.
*nudiustertian = the day before yesterday
**tittynope = small amount of leftover food