New to Typophile? Accounts are free, and easy to set up.
Prompted by curiosity and previous threads, I've made a script that counts glyphs pairs, and lists them. I know it's been done before, however I wanted one with a little more flexibility. Unfortunately, because this is a shell script that uses old Unix tools like grep, there isn't really Unicode support. I'm going to have to rewrite it in something else in order to get that working better.
So here's some sample output. The first column is the glyph pair. Second column: count in file. Third: percentage.
First 5 1 glyph pairs e 74725 10.9311960023 t 52313 7.6526417727 o 50175 7.3398830300 a 47443 6.9402306047 n 43578 6.3748365258 First 5 2 glyph pairs th 12424 4.0538644514 in 7265 2.3705187733 an 5687 1.8556283914 er 4862 1.5864366518 to 4462 1.4559194448 First 5 3 glyph pairs the 7791 4.4071976875 and 2737 1.5482608228 you 2725 1.5414726862 tha 1659 0.9384598849 for 1461 0.8264556310 First 5 4 glyph pairs that 1379 1.2400633071 with 824 0.7409805403 have 791 0.7113053487 your 766 0.6888241430 thin 642 0.5773173627 First 5 5 glyph pairs there 341 0.4782675774 thing 331 0.4642421352 peopl 274 0.3842971150 which 261 0.3660640402 don't 256 0.3590513191 First 5 6 glyph pairs people 273 0.5722910509 progra 175 0.3668532377 should 171 0.3584680209 always 160 0.3354086745 person 140 0.2934825902 First 5 7 glyph pairs program 175 0.5386935911 because 129 0.3970941329 somethi 126 0.3878593856 nothing 119 0.3663116419 compute 116 0.3570768947 First 5 8 glyph pairs somethin 126 0.5967604433 computer 115 0.5446623094 stardate 72 0.3410059676 anything 72 0.3410059676 differen 66 0.3125888036 First 5 9 glyph pairs something 126 0.9353425878 understan 65 0.4825180016 everythin 56 0.4157078168 programme 54 0.4008611090 <Knghtbrd 49 0.3637443397
By the way, the file I used to run through this was the BSD fortune command run 5000 times (fortune.txt in the attachments), so some of it might seem odd. You'll also find output similar to the above with the first 40 pairs, and all pairs. The first attachment is the script itself.
To use the script, open up a terminal and cd to the directory you've put the script in. I haven't tested this on my Mac yet, but it should work. Note, the script require zsh and probably GNU coreutils. You'll want to change the name first. ('%' represents a prompt.)
% mv pairs.zsh.txt pairs.zsh
And make it executable:
% chmod +x pairs.zsh
Finally, to run it:
% ./pairs.zsh LENGTH FILE
(Where LENGTH is the length of pairs, and FILE is what you wish to run the script on.)
| Attachment | Size |
|---|---|
| pairs.zsh.txt | 650 bytes |
| first40glyphpairs.txt | 8.28 KB |
| allglyphpairs.txt | 1.76 MB |
| fortune.txt | 825.84 KB |
22 Apr 2012 — 1:10pm
I don't think your script is counting right. Under 8 glyph "pairs," if "somethin" appears 126 times, shouldn't "omething" be tied with the same number?
22 Apr 2012 — 1:19pm
You make a good point. I think it must be cutting off bits of words with the regular expression somehow. I'll try to fix that.
I might think about trying to come up with a better term than "pairs" as well.
22 Apr 2012 — 1:29pm
Maybe the English language has changed, or perhaps it’s my memory, but the old reckoning listed e t a i n as the five most frequently used letters, and in that order. On the other hand, if “stardate” appears more frequently than “anything,” the text from which this data was mined is probably not terribly old school itself…
22 Apr 2012 — 1:59pm
Right, I didn't exactly choose the best text to mine this data from, just to test the script. Also, according to Wikipedia (not that Wikipedia is the best authority), the most used letters in English are e a i n o.