Glyph pairs/adjacencies

Primary tabs

5 posts / 0 new
Last post
Briän M Zick's picture
Offline
Joined: 8 Nov 2008 - 9:38pm
Glyph pairs/adjacencies
0

Prompted by curiosity and previous threads, I've made a script that counts glyphs pairs, and lists them. I know it's been done before, however I wanted one with a little more flexibility. Unfortunately, because this is a shell script that uses old Unix tools like grep, there isn't really Unicode support. I'm going to have to rewrite it in something else in order to get that working better.

So here's some sample output. The first column is the glyph pair. Second column: count in file. Third: percentage.

First 5 1 glyph pairs
e 74725 10.9311960023
t 52313 7.6526417727
o 50175 7.3398830300
a 47443 6.9402306047
n 43578 6.3748365258

First 5 2 glyph pairs
th 12424 4.0538644514
in 7265 2.3705187733
an 5687 1.8556283914
er 4862 1.5864366518
to 4462 1.4559194448

First 5 3 glyph pairs
the 7791 4.4071976875
and 2737 1.5482608228
you 2725 1.5414726862
tha 1659 0.9384598849
for 1461 0.8264556310

First 5 4 glyph pairs
that 1379 1.2400633071
with 824 0.7409805403
have 791 0.7113053487
your 766 0.6888241430
thin 642 0.5773173627

First 5 5 glyph pairs
there 341 0.4782675774
thing 331 0.4642421352
peopl 274 0.3842971150
which 261 0.3660640402
don't 256 0.3590513191

First 5 6 glyph pairs
people 273 0.5722910509
progra 175 0.3668532377
should 171 0.3584680209
always 160 0.3354086745
person 140 0.2934825902

First 5 7 glyph pairs
program 175 0.5386935911
because 129 0.3970941329
somethi 126 0.3878593856
nothing 119 0.3663116419
compute 116 0.3570768947

First 5 8 glyph pairs
somethin 126 0.5967604433
computer 115 0.5446623094
stardate 72 0.3410059676
anything 72 0.3410059676
differen 66 0.3125888036

First 5 9 glyph pairs
something 126 0.9353425878
understan 65 0.4825180016
everythin 56 0.4157078168
programme 54 0.4008611090
<Knghtbrd 49 0.3637443397

By the way, the file I used to run through this was the BSD fortune command run 5000 times (fortune.txt in the attachments), so some of it might seem odd. You'll also find output similar to the above with the first 40 pairs, and all pairs. The first attachment is the script itself.

To use the script, open up a terminal and cd to the directory you've put the script in. I haven't tested this on my Mac yet, but it should work. Note, the script require zsh and probably GNU coreutils. You'll want to change the name first. ('%' represents a prompt.)
% mv pairs.zsh.txt pairs.zsh

And make it executable:
% chmod +x pairs.zsh

Finally, to run it:
% ./pairs.zsh LENGTH FILE
(Where LENGTH is the length of pairs, and FILE is what you wish to run the script on.)

Craig Eliason's picture
Offline
Joined: 19 Mar 2004 - 1:44pm
0

I don't think your script is counting right. Under 8 glyph "pairs," if "somethin" appears 126 times, shouldn't "omething" be tied with the same number?

Briän M Zick's picture
Offline
Joined: 8 Nov 2008 - 9:38pm
0

You make a good point. I think it must be cutting off bits of words with the regular expression somehow. I'll try to fix that.

I might think about trying to come up with a better term than "pairs" as well.

Nick Curtis's picture
Offline
Joined: 21 Apr 2005 - 8:16am
0

Maybe the English language has changed, or perhaps it’s my memory, but the old reckoning listed e t a i n as the five most frequently used letters, and in that order. On the other hand, if “stardate” appears more frequently than “anything,” the text from which this data was mined is probably not terribly old school itself…

Briän M Zick's picture
Offline
Joined: 8 Nov 2008 - 9:38pm
0

Right, I didn't exactly choose the best text to mine this data from, just to test the script. Also, according to Wikipedia (not that Wikipedia is the best authority), the most used letters in English are e a i n o.