The size and shape of English words

The graph above shows the ‘size’ of words in the English language, along with the frequencies. The number of letters in each word’s spelling (y-axis) is plotted against the number syllables in the spoken form (x-axis) and obviously the two correlate strongly, causing the dots drift diagonally up to the right.  The bigger the dots, the more common the pattern. Most words have either one, two or three syllables, and the number of letters for each syllable follows a regular distribution around a median. The outliers are greater among shorter words, with monosyllabic ‘strengths’ having nine letters and bisyllabic ‘breakthroughs’ having twelve. Pentasyllables range from nine letters (e.g. ‘ideology’) to seventeen (‘disenfranchisement’), etc.  An interactive version of the graph could highlight all the words at each dot (watch this space)

The problem

This weekend, as soon as soon as Ireland had been dispatched from the rugby world cup in Japan, I kept seeing the word sayonara, a word that I don’t remember encountering before, either in writing or speech. I therefore had to go through the necessary process of working out how to pronounce this new sequence letters, a process which takes several steps. First you have to divvy up the letters into consonant and vowels and then you map these groupings into syllables, an easy job for a Japanese word, hence <>. The hard part then is to decide which syllables to stress.  This gives you the shape of the word, perhaps <(SA.yo.)(NA.ra)> or <sa.(>? After that, you can worry about whether the first syllable should sound like the word say or sigh.

This problem arises every time we have a new or strange word. Last week, we had Typhoon Hagibis to contend with and most punters opted for <> (shaped like nemesis and curiously similar in sound to the Scottish word Haggis). Never to let popular pronunciation get in their way, some RTE newsreaders elected for the <ha.GI.bis> option (shaped like  arthritis and sounding like a lovechild of Jeeves and hibiscus). Meanwhile both Ryan Tubridy and I were discussing the shape of the word Samoa

The problem, therefore, is to formalise and simplify the process of mapping new words into sound.

The beginnings of a solution

My latest spelling project is an attempt to simplify the reading process. I plan to use machine learning to help me work out the rules for reading words aloud and, hopefully, I can then spread some of this knowledge to students, teachers and punters, so that more people can read with more confidence. I have discovered from my previous research that the outstanding question in the field is to predict the stress pattern of each word from the spelling. Doing this simplifies the reading process from reading whole words to reading a series of stressed and unstressed syllables. Thus the makey-uppy word supercalifragilisticexpialidocious can be reduced to seven pairs of syllables.


Ideally all polysyllabic words can be reduced to groups of one, two or three syllables: thus ankylosaurus reduces down to <( / (sau.rus)> and  sesquipedalian to<( / (>. Diplodocus can go two ways: <(di.plo) / (do.cus)> or <di (>. But how do we know which one to choose? That’s the crux of the problem.

A spelling paints a thousand words

A paired example comes with the words picture and pictorial. If we know that the first word has two syllables and the second has four, then we have solved a large part of the problem. We then need to decide which syllables are stressed and only then can we work out the vowel sounds. Now, the pronunciations of <(> and  <(pic) /(> are not random but in  fact part of micro-networks of pronunciation: in the first instance, we see similarities in fracture, nature, fissure, action, vicious and even –expialidocious. . In the latter group we have factorial, tutorial, alluvial, original, etc. If we can isolate the micro-networks, then we could potentially draw a whole map of the English spelling syste

Keep it simple.

For now, though, the project is in its infancy and I am still exploring patterns in the data. The first thing I did was to get a giant corpus of wordforms and names along with their pronunciations and thence I extracted their stress patterns (what I am calling their ‘shape’). Last week I wrote about the most common shapes in English, and I was surprised by how common certain ones were. This has reduced my task enormously because instead of mapping from 120,000 spellings to 120,000 pronunciations, we can reduce the latter to just 285 shapes, with 95% of all words being accounted for by just twenty shapes.

How many letters and syllables per word?

The are many ways to divvy up the spellings but the first task has been to map out the number of letters and syllables in all words.

The graph above shows that there are relatively few English words with one, two or three letters. So if we open up a dictionary, these will be hard to find. Of course, their occurrence in actual writing will be very high because they include many of the most common words of the language (in this paragraph alone we have already had the, are, few, one, two, or, so, if , we, up, a, etc. ). But if we count the number of letters in each wordform, the mode is seven. My dataset includes plurals and inflected forms (<ed> and <ing> forms, etc.) so this skews the data to the right. To check, you could ‘stem’ the data as is done by most other analysts of spelling.


It is worth having a quick look at two other intermediary plots I created in order to understand the data. The first is a boxplot of the syllables v letters and this shows how many letters occur for each word. This plot does not show the frequency of each group so the boxes are relatively uniform in size (unlike the dots in top graph). This plot was useful for helping me to isolate problems in the dataset. Many of the (upper) outliers are due to how the words are stored in the dataset and the lower ones show the inclusion of initialisms or non-words such as <ABCD> which has four letters and four syllables.

The violin plot below contains the same information except that it shows the distribution by letter number. Among monosyllables, we can see that five letters is the most common length. For bisyllables, it is a tight call between six and seven letters, etc. Once we get above six syllables, this kind of plot breaks down because there are not enough examples.

So what are the takeaways?

Firstly, there now exists a simple map of the size of English wordforms and names, based on one particulardatabase. The next job is to match the size against the shapes and explore each of the subpatterns.  We have seen that bisyllables can have five stress patterns (that’s the red part of the treemap above), so I will explore how many letters map into each syllable,  stressed or unstressed. I don’t know how yet, and any help with finding the natural language processing literature would be much appreciated. There is an idea in the spelling literature (Evertz 2014) that words with final stress have more letters and I will start to examine this. I am not ready yet to concern myself with etymology but I will soon look at the relationships between morphology and word stress as in <(o.ri.gin)>, <o(> and <o(>.

Furthermore, the graphs above have given me some access points to attack the data. Instead of having 120,000 words to deal with, I now have a few dozen letter-to-syllable patterns to manage. I can isolate the outliers and I will start to investigate these more. I also suspect that longer words will be easier to subdivide into consonant-vowel patterns but harder to work out which syllables have primary and secondary stress. And that is the central goal of the project.

I don’t know yet which machine learning techniques to use because I don’t yet know what my sub-questions are. So anyone who has ideas, please let me know. I will be writing about this between now and my deadline of December 3. So watch this space.

In the meantime, here is a skinny version of the first graphic:


Leave a Reply

Your email address will not be published. Required fields are marked *