Warning: This is a first draft, so bear with me.
The plot is busy. Very very busy. That’s because English spelling is messy. Very very messy.
But among the mess, there are very strong patterns and a large number of subpatterns, some of which are very helpful, and others which are obstructive and could probably do with being trimmed. Now I”m not going to try and answer the question of whether it would be possible to clean up English spelling, but rather to look at what might be the best candidates for change.
The goal here is that people can get a feel for size – and shape – of the problem.
The plot shows the vowel letters used to spell 13 different vowel sounds of ‘General’ American English, and you can hover over each bars to see the examples. The data is taken from this useful database (http://www.speech.cs.cmu.edu/cgi-bin/cmudict), which i’ve worked with a lot, although it’s far from perfect. For now, the focus is on words with one syllable, and I’ll expand to longer words once I work out how to get the data into something semi-intelligible in a graphic. I also need to clean up this plot a lot, and work on the colour scheme, but for now, there is enough to keep us busy. For simplicity, I’ve also ignored ‘silent’ < e > until the end.
Each box represents a vowel sound, so that all of the word in the PRICE box have that same vowel and
all of the words in the KIT box do too. I’ve ordered the plot roughly in the order < I , E , A , O, U> and the vowels can be subgrouped as follows,
I. PRICE, KIT
E. FLEECE, DRESS
A. FACE, TRAP/BATH
O. LOT/CLOTH PALM/THOUGHT GOAT (this section needs to be disentangled and de-Americanized).
U. STRUT, FOOT, GOOSE
The best-behaved group is CHOICE, in the bottom right corner. These words are either spelt with < oi > if the vowel is inside the word,
or with < oy > if it is at the end of the word. Hence toil v toy. The worst-behaved group is GOOSE, where < oo > is the most common spelling and there’s another dozen variants of concern.
Some of the variation can be explained by the fact that a lot of vowels are paired, as can be heard in the following:
I. divine – divinity
E. serene – serenity
A. nation – national
O. cone – conical
U. reduction – … – reduce
And as a result, you’ll see the same colours repeating in adjacent columns. The blue < i > and < e > of the first four columns;
the purple < a > and < o >, and the cream < u > and brown < oo >.
In the top left corner of the plot are words with the vowel that occurs in the word PRICE, as in nice, mice, might,
fight, lie, sly, buy, height. If we focus on the outliers, or simply count the number of ways that this letter can e spelt, then it feels like there is frightening amount of needless variation. But the vast majority of the words are spelt with the same sequenceo of letters, namely < i..e >, < y..e > or < igh >, all of which map neatly to the PRICE vowel. The focus of reform therefore, ought to be on the spellings which do not map predictably onto sound, like height and, to a lesser extent, buy. The word height is easy to fix and it could just drop the < e > and become < hight > without there being any obvious knock-on effects. Buy is more problematic becuase fixing it is blocked by several things. The obvious candidates are < by > and < bye >, but those spellings aref already taken by other words, as is the less Englishy spelling < bi >. Already, we are into the problem that spelling and meaning are intimately connected and that it is very hard to change a spelling which has the useful visual function of distingushing two words which sound the same. Changing spellings
, if something has to give, then often
extract <gh> words
extract word-medial <y>
Log scale version