Count Smorltalk cogitates on the machinery of translation
Je pense, donc je suis. These famous words, penned by Descartes in his 1637 Discours de la Méthode, are a cornerstone of Western philosophy: the fact of thought means that the thinker is real. Later, the words Je pense, donc je suis were translated into Latin: Cogito ergo sum. In English “I think, therefore I am” seems like a perfectly reasonable translation. But why not “I am thinking, therefore I exist”? Same sort of thing but different.
Two translations of the same thing. Is one better than the other? If humans are going to remain the gold standard for interpretation, with computers destined to bring up the rear, we’re going to need a metric to prove our superiority. So how do we do that?
This is all part of the investigation I started in my earlier posts on AI and interpreting (‘Artful Intelligence’ and ‘Were Dare Aerate’). My conjecture is that human interpreters may not be as secure as we like to think. In today’s post I turn my attention to machine translation, this being the meat in the automated interpretation sandwich.
“I think, therefore I am” is fine. It’s the standard translation. “I am thinking, therefore I exist” is also fine. It’s the version chosen in an “authoritative and comprehensive” English translation by John Cottingham published by Cambridge University Press in 1991. So, I guess we’d say that if our translation metric was scored from 0 to 100, with 100 being best, each of these would get a score of 100.
“I lay an egg, therefore I am” (Je ponds, donc je suis) would be a clear mistranslation. One part of the expression is wrong. But one part is right. What’s the score for that? 50? But the trouble is that the whole of the idea is corrupted. So what then? 0?
The trouble with metrics for translation evaluation is that the best way to score one translation against another is to take an experienced translator and ask them to compare and then give a score. That’s fine at school and university, but it’s not fine when you have Google Translate churning out translation by the ton. So, machine translation experts have come up with a range of automated metrics. For example, BLEU, NIST, METEOR, and TER use computer power to calculate a score using an algorithm. And all the time, ever more complex new metrics are created. There’s even one called RATATOUILLE.
But these metrics still need a reference translation to compare with and that’s where the problems start because for most of the world’s utterances there are no reference translations. So, measuring quality in translation is difficult and the scores attained by machine translation need to be viewed critically.
In the findings of the Association for Computational Linguistics 2019 Conference on Machine Translation, German to English machine translation was said to be “tied with human performance”. English to German machine translation by Facebook-FAIR “achieves super-human translation performance”. Several other English to German machine translation systems are “tied with human performance”. Of course, this is the industry choosing its own metrics and vaunting its own performance, so these findings need to be taken with a pinch of salt. In 2020, Cornell University disputed some of the claims made in the ACL 2019 findings, but they did concede that the claim for human parity for English-to-German translation was true. If that is right, then the best machines are now a match for humans translating news from English text into German text.
So, in some situations for some language pairs computers can already match humans at translation. Let us now add in the fact that a computer translates a page of text in seconds or less, where we humans need minutes or more.
But really, I’m barking up the wrong tree here because interpreters do not produce translations of course, we produce interpretations. When our output is webcast, as is increasingly the case, it comes with a disclaimer along the lines of “The interpretation does not constitute an authentic record of proceedings… Only the original speech or the revised written translation of that speech is authentic.” That’s a way of saying the interpretation may be wrong. Interpretation isn’t usually subject to quality controls of any kind and so, frankly, nobody really knows. The only real evidence that something has gone wrong with the interpretation is a delegate complaining. Mostly, when the interpreters are experienced, this does not happen but that is not the same as saying that the interpretation is necessarily of high quality.
The assumption is made that human interpretation is the gold standard but we simply don’t know how good or bad we are. My conjecture is that sometimes, in some situations, even the best of us are not actually that perfect.
But hey, it’s interpretation, you say. It’s not supposed to be perfect; it’s supposed to help people understand each other. Well, the bad news on that is that computers aren’t perfect either, but whereas we are quantifying how good or bad computers are with WER metrics for speech to text, and translation quality metrics for machine translation, and therefore have an idea of the level, we’re not ready to compare automated interpretation output with human interpretation output.

The proof of the pudding will be in the eating. In other words, one day a commercial undertaking will put an automated interpreting product on the market that is deemed, by users, to be good enough. No metrics required. How’s the pudding? Yeah, not bad actually.
I think, therefore I am an interpreter. That appears to be our motto. The corollary, computers don’t think, therefore they are not interpreters, seems also to be an accepted fact. But what is thinking? Deep-learning architectures such as the deep neural networks used in machine translation engage in self-supervised learning. This throws overboard the long-standing assumption that humans need to teach computers by labelling everything for them to get a grasp of how sentence structure works. Turns out computers are better left to do it for themselves. This way of going about things is getting much closer to natural language acquisition in children.
Did you notice the expression super-human performance above? Yes, that’s when a computer outperforms a human. I think, therefore I am human. I compute, therefore I am super-human?
More by Count Smorltalk
Count Smorltalk’s posts on AI
Images: geralt / Gerd Altmann, Pixabay; Nebraska Oddfish, flickr