The Voynich Manuscript is one of the most famous mysteries in the world. It’s a book from the 15th century, but no one has been able to identify what language it’s written in, or even what alphabet it uses. So many crazy theories have been proposed that one writer invented the Voynich Bullshit Index to score them. Of course, I haven’t solved the mystery, but I’ve spent a few weeks thinking about it over the last couple of years.
After weighing the evidence, it seems extremely likely that the Voynich is simply written in an unknown natural language, rather than a cipher, a code, or more exotic options listed by Wikipedia. The first major reason is that Voynich writing passes most known statistical tests for natural languages, such as Zipf’s Law. Since Zipf’s Law wasn’t discovered until the 20th century, it would have been impossible to deliberately fake. The second reason is prior probabilities: the number of manuscripts written in languages that now can’t be read (such as Etruscan or Linear A) is pretty large, while the number of manuscripts written entirely in ciphers is very small. The third major reason is one of information asymmetry. In 2015, cryptography is vitally important to the world economy; hence, we know far more about cryptography (and associated disciplines like steganography) than the ancients did. On the other hand, since lost languages are unimportant economically, very little is known about many of them; what information exists is usually locked up in obscure manuscripts, not available online; while a native speaker would obviously know their language well. The Voynich is very mysterious to us, but probably not mysterious to its writer, who (from handwriting analysis) is known to have written it quickly and fluidly; hence, the information asymmetry matches a natural language and does not match a code. This paper summarizes some of this evidence, and concludes from machine learning analysis that the Voynich is most likely an abjad, an alphabet without vowels (like Arabic or Hebrew).
My own best guess is that the Voynich is written in the Cuman language. To the best of my and Nick Pelling’s knowledge, no one has ever proposed this theory, which seems shocking considering the sheer extent of Voynich hypotheses (this long page just lists some of the more popular ones). Cuman is, by the standards of lost languages, quite well-understood; it has a Wikipedia page, substantial surviving literature, and it’s very clearly related to modern languages like Kazakh. It seems like a strong indicator that this class of theories is under-explored. If nobody’s thought of Cuman before, there are surely many other less-known languages that haven’t been looked at either.
Evidence in favor of Cuman:
– Cuman, unlike almost every language spoken in Europe, is non-Indo-European. (It’s related to Turkish, Mongolian, and Kazakh.) This would explain the Voynich’s lack of typical Indo-European language features.
– Cuman (like Turkish and Manchu, another proposed language) employs vowel harmony, which several have observed in the Voynich glyphs.
– The Voynich’s first known owner was Emperor Rudolf II, who was also king of Hungary. Cuman was spoken widely in Hungary in the early 1400s (per Wikipedia, the last known Cuman speakers died in Hungary in the 18th century). The Golden Horde used Cuman extensively, and it was spoken widely in the area they conquered (eastern Europe through central Asia) during the 14th century. Hence, it makes geographical sense in the relevant time period.
– Like many central Asian languages of the time, Cuman appears to have lacked a written script. We know the 14th century Church wrote dictionaries for translating it into Latin, to help convert the Cumans to Catholicism. Hence, it makes sense that a script would be invented for it.
– Computer analysis of the letter frequency distribution of the Voynich shows the best match to Voynichese is Moldavian (Moldavia, next to Hungary, was also home to Cumans), followed by two other languages of the former Golden Horde area (eastern Europe and what is now Kazakhstan).
“After weighing the evidence, it seems extremely likely that the Voynich is simply written in an unknown natural language, rather than a cipher, a code, or more exotic options listed by Wikipedia. The first major reason is that Voynich writing passes most known statistical tests for natural languages, such as Zipf’s Law.”
Many encoding schemes for a language that followed Zipf’s Law would also follow Zipf’s Law. For a trivial example, rot13.
“The second reason is prior probabilities: the number of manuscripts written in languages that now can’t be read (such as Etruscan or Linear A) is pretty large, while the number of manuscripts written entirely in ciphers is very small.”
That seems like way too broad a reference class. As far as I know, the number of manuscripts in cipher is much higher than the number of manuscripts in an otherwise unknown language we don’t even recognize.
Regarding the Cuman hypothesis, why would Cumans be drawing pictures of nonexistent plants?
Re: the first point, an scheme like rot13, Caesar cipher, or simple substitution ciphers in general wouldn’t make the manuscript any harder to read. The Voynich is already in an unknown alphabet, so if the “plaintext” is glyph 1 = A, glyph 2 = B, etc., the “ciphertext” would just be (eg.) glyph 1 = N, glyph 2 = O, etc., which from our perspective looks exactly the same. By Occam’s Razor, we can assume that such a cipher wasn’t used – because if it was there’d be no way to tell, even after the manuscript was fully decoded. Any cipher that we’d have to worry about cracking must be a) more complex than a simple substitution cipher, and b) not break Zipf’s Law or other statistical features of natural language, and c) usable without computers or 20th-century math, and d) fast enough to write an entire book in (just copying an existing book written in Latin took months before printing presses), and e) not cracked by any of the 20th-century cryptographers who’ve looked at the Voynich. The combination of all five seems extremely unlikely.
Re: the second point, are there any book-length documents from before 1800 written entirely in code? There are cases like the Babington Plot of individual letters or short messages being encoded, but the Voynich is way longer than that at ~35,000 words.
Re: the third point, why would anyone else be drawing pictures of nonexistent plants? We know that someone drew them 🙂
Consider for example a book code, where you list all the words in (say) the Bible in order, then match them to the list of the most common English words in order of use. Letters will obey a Zipfian distribution, words will obey a Zipfian distribution, but it’s still very hard to break.
Voynich probably isn’t that, since frequency analysis would let you decode the letters into obvious Bible words, but I’m just saying that even some very complicated codes can maintain distributions even if you’re not optimizing for those.
Actually would be interesting to check if conlangs followed Zipf’s Law. I bet they would, especially if the conlanger just makes it “sound” like their own language.
Re: the third point. I feel like the sort of person who makes up nonexistent plants is the same sort of person who might make up their own language. We know that the author was doing something weird. The attractiveness of the Cuman theory is that it seems like it takes away the weirdness (some guy wrote a book in his own language, making up a script for it the same way the Cherokee did, what’s so weird about that?) but it actually still leaves a lot of the weirdness intact.
Why would monks be drawing pictures of nonexistent animals? Well, they thought they were drawing (for example) lions, but they didn’t know what lions looked like. There hadn’t been any lions around for a while.
Maybe the Voynich manuscript is some Cumans sitting down, going “gee, you know, we ought to record what things were like in the homeland”, and realizing they have no idea what the plants there look like.
The general point that it is probably a natural language near Europe that didn’t have an established script is pretty good, and Cuman is a pretty good candidate for that.
The examples of Linear A and Etruscan are very old. Are there recent examples where people were surprised to discover that a script for a language that they thought lacked one? Why doesn’t the Codex Cumanicus mention the script? Did the authors not know about it? This seems to me like a strike against Cuman.
You mention suggestions that the Voynich script is an abjad and that it has vowel harmony. I don’t know much about these structures, but they do not seem to me to be compatible claims about a script.
The last item does not seem to me like much evidence. (Most of these complaints to the paper and not just your application of it to Cuman.) The property is not stable between related languages, as seen in the fact that the top scoring languages are unrelated. If there were several Turkic languages at the top, that would be evidence for a Turkic language. And if it’s not stable between Turkic languages, is there reason to believe Cuman would score highly? It can’t hurt to try it, though. Suggesting that Moldavian and Kabardian were influenced by Cuman seems like grasping at straws. And the method seems based on assuming that the writing system is not and abjad. ie, if you used this method to compare a language written in an abjad to the same language written in an alphabet, it would declare them very different.
PS – It doesn’t affect your argument, but Turkic and Mongolic are not considered to be related by mainstream historical linguistics. Though the micro-Altaic hypothesis that they are related is probably the most conservative extension.
Thanks for your comments!
Is this theory consistent with all the word-level repetitions in the text?
“The second reason is prior probabilities: the number of manuscripts written in languages that now can’t be read (such as Etruscan or Linear A) is pretty large, while the number of manuscripts written entirely in ciphers is very small.”
The number of manuscripts in unreadable languages is not the same thing as the number of unreadable languages used in manuscripts. Unless other documents are found that seem to be written in the same language as the Voynich Manuscript, you want to compare to the latter, not the former.
Good day!
My name is Nikolai.
To a question about the key to the Voynich manuscript.
Today, I have to add on this matter following.
The manuscript was written no letters, and signs for the letters of the alphabet of one of the ancient languages. Moreover, in the text there are 2 more levels of encryption to virtually eliminate the possibility of computer-assisted translation, even after replacing the signs letters.
I pick up the key by which the first section I was able to read the following words: hemp, hemp clothing; food, food (sheet of 20 numbering on the Internet); cleaned (intestines), knowledge may wish to drink a sugary drink (nectar), maturation (maturity), to consider, to think (sheet 107); drink; six; flourishing; growing; rich; peas; sweet drink nectar and others. It is only a short word, mark 2-3. To translate words consisting of more than 2.3 characters is necessary to know this ancient language.
If you are interested, I am ready to send more detailed information, including scans of pages indicating the translated words.
Sincerely, Nicholas.
The Altaic hypothesis is, AFAIK, no longer taken seriously, so Turkic can’t be said to be related to Mongolic.
I’m not sure what that paper means by “Moldavian” — probably the state language of Moldova, Romanian written in Cyrillic. Romanian doesn’t have much to do with Turkic: it’s a Romance language (descended from Latin) in the Balkan sprachbund (an area of mutual linguistic influence that also includes South Slavic, Albanian, and Greek). If “Moldavian” ranks so much higher than Romanian, either “Moldavian” is something other than the official language of Moldova or there’s a problem with their methodology. (It’s probably differences in text sample size. To generate the text samples, they pulled 100 random articles from Wikipedia and concatenated them. The Moldovan Wikipedia is much smaller than the Romanian one.)
But, most importantly, how similar is Cuman to the other Turkic languages? The Oghuz Turkic languages (a category which includes Cuman and all living Turkic languages except Chuvash) diverged fairly recently and still retain some mutual intelligibility. My guess is that it’s within a “modern-day Frisian trying to read Chaucer” distance of Tatar or Kyrgyz — if it’s actually Turkic, it may be possible for somebody who knows something about Turkic to brute-force it by looking for common words that fit the pattern.
Oh, and if the script works like Old Turkic, that could square “abjad” with “vowel harmony” — you have one set of letters for consonants in words with front vowels, and another set of letters for consonants in words with back vowels. A bit like C/K/Q in early Latin, but for many more letters than that. The Old Turkic Voynich hypothesis is testable by statistical analysis, isn’t it? Hell, I’d be surprised if it hasn’t been done.
From the paper:
This is… bizarre. How are Swedish and French richer in vowels than Serbian and “Moldavian”? Swedish and French have much larger vowel inventories than Serbian, true, but they’re not using phonemic transcriptions; they’re just using text. Are they going for a vowel/consonant ratio?
OK, back-of-the-envelope calculation: I’ll use the articles for Barack Obama. Romanian is about 39% vowels, French and Serbo-Croatian are about 37% vowels (although the count for Serbo-Croatian is artificially low since I’m not bothering to distinguish between vocalic and consonantal l and r), and Swedish is about 34% vowels. (This isn’t strictly accurate, since I’m using the entire contents of one page, but it’s a rough estimate.)
A better source to use for a vowel-heavy language would be a Polynesian language — ideally one with no consonant digraphs, such as Hawaiian. Anything with a (C)V syllable structure (which AFAIK all Polynesian languages have) and no consonant digraphs will necessarily be at least 50% vowels.
I am deciphering the manuscript of Voynich and got positive results.
There is a key to cipher the Voynich manuscript.
The key to the cipher manuscript placed in the manuscript. It is placed throughout the text. Part of the key hints is placed on the sheet 14. With her help was able to translate a few dozen words that are completely relevant to the theme sections.
The Voynich manuscript is not written with letters. It is written in signs. Characters replace the letters of the alphabet one of the ancient language. Moreover, in the text there are 2 levels of encryption. I figured out the key by which the first section could read the following words: hemp, wearing hemp; food, food (sheet 20 at the numbering on the Internet); to clean (gut), knowledge, perhaps the desire, to drink, sweet beverage (nectar), maturation (maturity), to consider, to believe (sheet 107); to drink; six; flourishing; increasing; intense; peas; sweet drink, nectar, etc. Is just the short words, 2-3 sign. To translate words with more than 2-3 characters requires knowledge of this ancient language. The fact that some symbols represent two letters. In the end, the word consisting of three characters can fit up to six letters. Three letters are superfluous. In the end, you need six characters to define the semantic word of three letters. Of course, without knowledge of this language make it very difficult even with a dictionary.
And most important. In the manuscript there is information about “the Holy Grail”.
I’m willing to share information.
Nikolai.
Suggesting the Voynich manuscript might be written in Cuman has a long, but typically desultory history. I mean ‘typically desultory’ in that the same box of bits-and-pieces is constantly rummaged and things found and dropped are later ‘found’ again and so on… over and over.. When this happens and nothing moves, it is usually a sign that there is some fundamental error: an error in the basic premises informing a discussion. However, the Codex Cumanicus was certainly mentioned by 2004, when Peter Fox talked about it on the first Voynich mailing list.
I quoted it in a comment to one of the Voynich forums just last year, after Emma Smith had made the same suggestion (again – but from a more informed linguistic point of view).
My comment to the forum:
2004. First ‘Cuman-related’ item. Not sure what got Leonard Fox onto that track, but on November 6th. of that year he wrote to the Jim’s [first] Voynich mailing list about his own friend and colleague Peter Golden who had already written several essays on the Codex Cumanicus and who had already pointed out that the text’s Turkic language was “quite closely related to Karaim”… ‘Karaim’ is how Golden and Fox speak of the Karaite dialect spoken by Jews of the Crimea. Fox said in that message that he could confirm the the similarity, because Karaim (or Karaite) was the language of his own childhood.
On another matter. You seem to have a few language buffs commenting on your post and I wonder if any would be willing to offer some thoughts about the ‘Tartar language’ supposedly represented in Johann Schildberger’s account of his time spent among Turks and Mongols in the late fourteenth-to-early fifteenth century?
The clip of his ‘Pater Noster in the Tatar language’ can be seen in the header to this post (if you permit a link)
https://voynichrevisionist.com/2019/05/01/light-relief-inventing-a-german-byzantine-turkish-mongol-solution-theory/