It’s a fascinating and possibly pointless exercise, trying to work out how search engines work. Although this article was inspired by a news story on beating (so-called) plagiarism detectors, I found myself more interested in what the story told us about Google and (presumably) other search engines.
The story starts withn an article in Hoax-Alert: Forget Russian Bots: Fake Native Americans Are Using Russian Characters To Avoid Fake News and Plagiarism DetectorsThe story relates how a number of websites which appear to be promoted by Native Americans are in fact sites originating in Kosovo and other countries. It seems that they are stealing content, disguising it (to escape similarity detectors) and getting away with it. The way they disguise the content is to substitute Cyrillic characters which look like Latin alphabet characters in text, in order to beat text-matching software. The HoaxAlert story shows this illustration:
The only difference between the two sentences appears to be the lower-case “a” in “army” used in the first example and the upper-case “A” used in the second example. In fact, the top example substitutes Russian look-alike characters for some of the vowels.
HoaxAlert shows how this becomes apparent if you copy-paste the two sentences into Word and then change the font. Here are the two sentences, first the true version (the lower of the original sentences) and then the fake version:
The notion is that this will beat Turnitin and other plagiarism detectors. HoaxAlert goes on to show how searches in Google copy-pasting the sentences yields very different results.
I’ll come back to that. Before I do, I want to take a closer look at the notion that one can beat Turnitin simply by replacing Latin characters with look-alike characters from a different alphabet. It’s an ingenuous notion for beating Turnitin, granted, BUT… (1) there is nothing new in this notion and (2) Turnitin can (it claims) handle character substitution.
Here’s The Awl, in a post dated August 5, 2010
and here’s a YouTube video showing how to do it:
though it is worth noting that there is an on-screen warning in the first few seconds:
Turnitin confirms that they are wise to this trick too, in a blog post Can Students “Trick” Turnitin? dated 23 May 2013.
Of course, not all Turnitin’s claims are the truth, the whole truth and nothing but the truth – but in this instance, I am prepared to believe them, they spent a lot of time proclaiming this. (What’s more, the HoaxAlert post does not actually say that this trick will beat Turnitin. It claims it might beat some “plagiarism detectors.” None is mentioned by name.)
And so we come to Google.
The HoaxAlert post showed that when copy-pasting the Cyrillic version, Google came up with zero hits.
while a search for the true version found nearly half-a-million hits
Fair enough. As I said, there are very few words which use a mix of Cyrillic and Latin alphabets. Frankly, off the top of my head, I can’t think of a single one (apart from those used in the article) – and I’m pretty sure that you can’t either.
I did note the lack of quotation marks around either of HoaxAleert searches, and wondered how many hits would come up with quotation marks. Much fewer, I thought.
[Quotation marks are used to require that the search engine search only for the words used and in that particular order, the quotation word-for-word, as it were. When no quotation marks are used in a search phrase, the search engine will look for documents which use the words in the search-term but not necessarily adjacent or in that order. Any document which uses all those terms in any order will be found – thus the half-million hits.]
Here we go, and copying the Cyrillic version and pasting into the Google search-box between quotation marks, I got “About 6 results.”
HoaxAlert got 0 results without quotation marks – but it is clear that the first hit is the HoaxAlert article (and there is no way in which Google could find this before HoaxAlert posted it, is there?) and the others are reposts and other versions of the HoaxAlert story.
Where it gets interesting is at the foot of the results page. At the top of the page, Google reports “About 6 results.” At the foot of the page there is a note:
In order to show you the most relevant results, we have omitted some entries very similar to the 11 already displayed.
If you like, you can repeat the search with the omitted results included.
I should have counted how many results were actually displayed, but I missed that. I was too eager to see how many results there were in total. Answer, about 20.
What about the sentence in Latin alphabet? I didn’t get half-a-million hits, I got “About 29 results.”
That’s not a surprise – I had used quotation marks; for this search, Google was searching for the exact sentence, not for any use of the words in any order.
Once again, there was a note at the foot of the page, and once again, the numbers did not quite match.
Once again, the number shown was only slightly more than the number of hits originally shown – and I am not convinced that all of them really do include the target sentence.
What if I don’t use quotation marks – replicating the HoaxAlert search?
Interesting … four words are underlined, as if they have been spelled wrongly.
Interesting … the results page shows that this is so…
I am asked:
Did you mean: Thе rеаsоn аrmy hеlіcоptеrs аrе nаmеd аftеr nаtіvе trіbеs wіll mаkе yоu smіlе
Google recognises that some of the words are problematic? But why not all of the words? HoaxAlert shows ALL the words include Cyrillic characters, not just these four.
I get the same result, using Gill sans:
All the vowels come up as replaced. I couldn’t think of any words which use a mix of Latin and Cyrillic alphabets, but Google apparently can?
Just to round things off, how many hits do I get using the real sentence, the all-Latin version? HoaxAlert got just short of half-a-million. I got …
… “about 84 results.” What?
Sure enough, the clue is at the foot of the page, and again there is a mismatch between the number shown as found and the number claimed to be displayed:
In order to show you the most relevant results, we have omitted some entries very similar to the 97 already displayed.
If you like, you can repeat the search with the omitted results included.
And here we are, repeating the search, Latin alphabet, no quotation marks, “About 499,000 results” found.
That means “about” 498,900 results were not displayed because some part of the Google algorithm considers them “similar to the (results) already displayed.”
As I said, it’s probably pointless, trying to work out how search engines work – except we can be misled in oh so many ways if we accept what they tell us at face value and do not delve. How many of those “similar” results could be worth looking at more deeply if only we know about them?
Given recent thoughts about needing to read laterally and think laterally when considering what we read, we do need to remind ourselves that we still do not know (for sure) how search engines decide which results to push to the top, how many they hide in the heap of results – and how many they don’t tell us about.
We just cannot afford to stop thinking, can we?
What comes to mind vis à vis students attempting to fool a plagiarism checker is, if they spent half as much time actually writing an essay properly, they would make far better use of their time. Interesting article, John. (With no letters substituted!)
Hasn’t that always been the way, Susan? At least, the way with students who set out knowing they are cheating and hoping to get away with it? They work really hard at taking short cuts!