This is just a short note from some graphs I’ve been creating this week:
The above plot shows like 60,000 bisyllabic words and names of English. The columns tell you how many letters in the first syllable (2-7 letters) while the rows tell you the number in the second syllable (2-8 letters). The first row and column got eliminated by the dodgy algorithm I downloaded for free off the Jimi Internet.
This bars does not tell you the absolute numbers (see plot below) but simply the proportions in each box:
The colours tell you the stress patterns:
O1 = Japan (red)
10 = China (green)
12 = Greenland (blue)
21 = Bahrain (purple).
The shows how the minor groups become more common in certain regions spaces. So like if you have loads of letters in both syllables then the word is much more likely to have two stresses (e.g. breakthrough). That would be the blue groups which is common in the middle boxes. If there are lots of letters in the second syllable, but not many in the first, then the red stress pattern is more likely (e.g. beNEATH).
Here are the absolute frequencies. Most words have like 2-4 letters in the first syllable and 2-4 in the second. While the rest may seem like outliers, closer examination in the first plot shows up the micro-patterns.
I”ll write more when I get the chance. I’m trying to use machine learning to predict stress patterns from the spelling and i’m still prepping the data which remains very rough.