During the autumn I was engaged in a machine-learning project, trying to predict pronunciation from English spelling, but I took the uncharted approach of seeing whether syllable stress patterns can be predicted from patterns of consonant and vowel letters. The results have been patchy so far, but some of the graphics have turned out to be pretty and interesting, to me at least, and I thought it would be nice to share some info and graphics. So what follows is an analysis of the number of letters in English words, the number of syllables and, slightly more novel, the number of letters per syllable.
The first graph is nothing new. It shows the number of letters in distinct English words and you can find the same basically graph on Peter Norvig’s page. Letter length forms a rather well-behaved distribution, with most words falling between three and thirteen(ish) letters, with a neat peak at seven-letter-words. This leaves just a handful of dwarfwords (a, be, do, em, go, he, I ), and a long tail of sesquipedalian behemoths hanging out the back of it (your chlorofluorocarbons and your incomprehensibilitys).[ii]
We should note that the focus here is on distinct word types, so that equal recognition is given to the words thesaurus and the. The same goes for a and aardvark. Elsewhere, I have looked at The 5000 most common words of English, which shows how the dwarfwords are as common as muck and dominate the written landscape. Any analysis of words has to recognise this distinction from the very beginning.
Meanwhile, the number of syllables in distinct English words forms a rather different-looking distribution, with bisyllables being by far the most numerous (37.5%), followed by trisyllables (27.4%) and, in roughly equal parts, quadrisyllables (14.4%) and monosyllables (13.3%). Again, though, we are talking about distinct words, so that righteousness counts as much as right. Pentasyllables also provide a sizeable input at 5.8% of the data but after that, the numbers fall off a cliff, with sesquipedalian words accounting for just 1.39% of the total (i.e. 592 actual words) and there are less than a hundred outlengthers (e.g. counterrevolutionary, heterosexuality, totalitarianism… ).
Of course, the choice of data is an issue here. I have been working with the Carnegie Mellon Pronouncing Dictionary (CMU), which contains 130,000+ words and names along with their phonological forms in American English. CMU also includes syllable stress pattern information, making it easy to count syllables. I have merged CMU with the Enable2k dataset, a simple list of words used by Scrabble fiends, and other wordgame enthusiasts. The merger filters out the names as well as yoinking out oddities like supercalifragilisticexpialidocious, which appear in only one of the datasets. Neither dataset contains large amounts of technical vocabulary, so stupidly long chemical names are not included here, hence the lack of hyperpolysyllabic words.
Letters versus syllables
The fun task has been to look at how the distributions interact. We are now talking about the number of letters per syllable in words, and the mosaic below is identical to the graphic at the top of the page, only with a white background this time, labels for the number of letters, and much better recognition of outliers. Most words fall into the central diagonal, meaning words with two syllables (orange) or three (red), and these mostly have between six and nine letters. Shit be getting weird, howandever, when we look at the edge cases. This is where it finally comes in handy to know that strength has eight letters but only one syllable (that’s the tiny sliver of yellow underneath the ‘8’), while a and I have just one obviously letter (that’s the long, near-invisible, leftmost line). Bisyllables have a slightly larger range, going from three letters (ago, bio, ivy) all the way up to twelve (breakthrough, transgressed). Trisyllables range from four letters (area, iota) up to fifteen (straightforward). Six-letter quadrisyllables (purplish) force the mouth into some vowel-tastic verbal gymnastics (idiocy, egoism, myopia) but the longest quadrisyllable is the rather dull example of straightforwardly, which we have already seen, only with a two-letter suffix <ly> tacked on. For what it’s worth, pentasyllables (violet) range between nine letters (academia, ideology) and sixteen (disenfranchisement).
The strange thing about the plot is that all the colours seem to be scrambling to the top right corner, making to plot feel like it is rotating slowly anticlockwise. This observation made me dig deeper and I did a few sums and a spot of boxplottery:
The boxplot shows the number of letters per syllable in English words. That’s basically what I discussed the last paragraph, except I have now divided the number of letters in words by their number of syllables. One pattern predominates: as words increase in their syllabicity, the number of letters per syllable decreases, bottoming out at two letters. Below, I shall put forth some under-tested ideas on the matter.
Ten in the bed
The first thing to appreciate is that the distribution doesn’t have to be this way. Yes, the variance should decrease due to regression-to-the-mean (I’ll talk about this in a follow-up post), but the medians don’t have to reduce. The problem is rather similar to squeezing your children into bedrooms. If you have one child, they will quite likely have their own bedroom. If you have two, they may have to share, and if you have six, it is highly unlikely that you will have enough bedrooms for the lot. The reasons are not entirely numerical, they are spatial, financial and social. How many families can afford seven bedrooms? If they could, would all of the kids want to file away alone at night? Nonethenevertheless, there do remain spatial constraints, as all human beings are somewhat voluminous and, while privacy may feel like a luxury for the rich, not even the poorest families can squash sixteen siblings into one small bedroom.
The piano players: long words, short syllables
Similarly, with words, there are physical, articulatory constraints on speech, and syllables generally have at least one consonant and vowel sound, even though this constraint is relaxed somewhat in English spelling. Either way, the graph bottoms out at just over two letters per syllable: one consonant letter and one vowel letter. This is clear when we look at the handful of octosyllables in the dataset (the tiny green dot at the bottom right of the black plot). Their spellings are presented here both horizontally and vertically, to show the words and their syllabic breakdown (formed by applying the onset-maximisation principle).
Each box represents one syllable and there are eighty syllables in total (ten octosyllables). A whopping 58 of the syllables are represented with exactly two letters, seven of them have just one letter, twelve have three letters and just three have four letters. If you need a quick graphic, here you go:
The piano shifters: short words, bulky spellings
Meanwhile, at the other end of the spectrum we have the monosyllables with eight letters:
and bisyllables with eleven or twelve letters, most of which are compound words:
These bad boys are fattened up by two different, but easily confused factors: consonant clusters and polygraphs.
Consonant cluster analysis
Several of these words contain clusters of consonants such as the /str/ of homestretch and the /ltz/ of schmaltz. This is a common feature of English which allows a lot of information to be squeezed into one syllable. Many languages don’t allow for complex consonant clusters this so the information has to be spread over more syllables. That’s why the Japanese word for softball is sofutobōru. Another titillating side-effect is that English spellings often have shorter words nested inside them, as with the nap inside schnapps and the malt in schmaltz.
But enough of such efficiencies. English spelling is notorious for its erratic boobytraps and few are odder than the preponderance of polygraphs, those bulky strings of multiple letters which only represent one sound, and exemplified above by <sch, pp, aigh, th, sh, tch, igh, eigh, wh, wr, ng>. Compare the inefficiency of <sch> and <sh>, which both represent the same one phoneme, versus the cluster <str> which represents three (and could not represent any others). None of the above polygraphs occur in Latin-derived words (except for <pp>), which suggests that their existence may be due to the never-fully-resolved problem of adapting the Roman alphabet to our Germanic language. It is therefore no coincidence that the overweight English spellings above derive mostly from Old English or, in the case of Schnapps and Schmaltz, modern German and Yiddish, two closely related languages. Incomplete script adaptation has also plagued the Gaelic languages and their writing systems are infested with inelegant, makeshift solutions formed by cobbling together ancient Roman letters which are not fit for Modern purposes. By contrast, the octosyllabic (English) examples above are all, ultimately, from Latin or ancient Greek, albeit many have been sieved via French.
[i] This is of course a discrete distribution, since you can only have whole-number-letters words, and it appears to follow a poisson distribution of sorts. See NZ link. The same for the Syllables
[ii] These numbers feel right to me because when I was a child we had a clever teacher whose idea of a good punishment was to ask the kids to list fifteen words with fifteen or more letters. Good man Nev Wilkinson.