Glyph pairs/adjacencies

brianskywalker's picture

Prompted by curiosity and previous threads, I've made a script that counts glyphs pairs, and lists them. I know it's been done before, however I wanted one with a little more flexibility. Unfortunately, because this is a shell script that uses old Unix tools like grep, there isn't really Unicode support. I'm going to have to rewrite it in something else in order to get that working better.

So here's some sample output. The first column is the glyph pair. Second column: count in file. Third: percentage.

First 5 1 glyph pairs
e	74725	10.9311960023
t	52313	7.6526417727
o	50175	7.3398830300
a	47443	6.9402306047
n	43578	6.3748365258

First 5 2 glyph pairs
th	12424	4.0538644514
in	7265	2.3705187733
an	5687	1.8556283914
er	4862	1.5864366518
to	4462	1.4559194448

First 5 3 glyph pairs
the	7791	4.4071976875
and	2737	1.5482608228
you	2725	1.5414726862
tha	1659	0.9384598849
for	1461	0.8264556310

First 5 4 glyph pairs
that	1379	1.2400633071
with	824	0.7409805403
have	791	0.7113053487
your	766	0.6888241430
thin	642	0.5773173627

First 5 5 glyph pairs
there	341	0.4782675774
thing	331	0.4642421352
peopl	274	0.3842971150
which	261	0.3660640402
don't	256	0.3590513191

First 5 6 glyph pairs
people	273	0.5722910509
progra	175	0.3668532377
should	171	0.3584680209
always	160	0.3354086745
person	140	0.2934825902

First 5 7 glyph pairs
program	175	0.5386935911
because	129	0.3970941329
somethi	126	0.3878593856
nothing	119	0.3663116419
compute	116	0.3570768947

First 5 8 glyph pairs
somethin	126	0.5967604433
computer	115	0.5446623094
stardate	72	0.3410059676
anything	72	0.3410059676
differen	66	0.3125888036

First 5 9 glyph pairs
something	126	0.9353425878
understan	65	0.4825180016
everythin	56	0.4157078168
programme	54	0.4008611090
<Knghtbrd	49	0.3637443397

By the way, the file I used to run through this was the BSD fortune command run 5000 times (fortune.txt in the attachments), so some of it might seem odd. You'll also find output similar to the above with the first 40 pairs, and all pairs. The first attachment is the script itself.

To use the script, open up a terminal and cd to the directory you've put the script in. I haven't tested this on my Mac yet, but it should work. Note, the script require zsh and probably GNU coreutils. You'll want to change the name first. ('%' represents a prompt.)
% mv pairs.zsh.txt pairs.zsh

And make it executable:
% chmod +x pairs.zsh

Finally, to run it:
% ./pairs.zsh LENGTH FILE
(Where LENGTH is the length of pairs, and FILE is what you wish to run the script on.)

pairs.zsh.txt650 bytes
first40glyphpairs.txt8.28 KB
allglyphpairs.txt1.76 MB
fortune.txt825.84 KB
eliason's picture

I don't think your script is counting right. Under 8 glyph "pairs," if "somethin" appears 126 times, shouldn't "omething" be tied with the same number?

brianskywalker's picture

You make a good point. I think it must be cutting off bits of words with the regular expression somehow. I'll try to fix that.

I might think about trying to come up with a better term than "pairs" as well.

oldnick's picture

Maybe the English language has changed, or perhaps it’s my memory, but the old reckoning listed e t a i n as the five most frequently used letters, and in that order. On the other hand, if “stardate” appears more frequently than “anything,” the text from which this data was mined is probably not terribly old school itself…

brianskywalker's picture

Right, I didn't exactly choose the best text to mine this data from, just to test the script. Also, according to Wikipedia (not that Wikipedia is the best authority), the most used letters in English are e a i n o.

Syndicate content Syndicate content