Linguistic Adjacencies

hrant's picture

While I was compiling the linguistic kerning pair data, I realized that it might also be useful (in fact probably more so, when you think about it) to have adjacency data - to see what letters are likely to be on the left and right of a given letter. This can help not only in optimizing spacing (and kerning), but also in designing the letterforms themselves, for example by allowing fine-tuning of the whitespace relationships between/within the glyph bodies. As in the kerning data, this is based on an English corpus, so it's largely (but not entirely) limited to optimizing English setting.

(109K) http://www.themicrofoundry.com/other/kf_adj.pdf

Notes:
1) For each letter (center column), on each side are the letters most likely to occur in decreasing frequency away from the center. The spaces separate frequency groups, while beyond the dash it's pretty slim pickings in terms of frequency. The asterisk on a side of a letter indicates that that row is 1/10 less frequent than the overall table norm*. As a reference, the most two frequent adjacencies are "th" and "he", both at above 100K instances**, then it drops to about 50K (for "an" and "in"), and the rest is mostly bunched up.
2) I used UC letters ro avoid apparent-leading issues - obviously lc is the real name of the game.
3) Sorry it's so ugly...

* So for example you can tell that b, p and v occur mostly as initial letters.

** The corpus has about 4.5 million pairs.

hhp

rs_donsata's picture

I have made a temporary pause in my thesis project but I

brianskywalker's picture

Héctor: You cut off mid-sentence, what were you trying to say?*

Anyway, I've been thinking about this lately, I'm glad the link came up for this old thread.† Working on Alpine, I thought it would be useful to make some dummy text. Adhesion Text works well enough for this, but 1: I'd like to be able to generate dummy text offline, and 2: I was thinking about some other ways of actually generating the text.

A good start is to make pairs of all letters in a list, but it's a little unrealistic. Many of the combinations will probably never happen. Linux usually comes with dictionary files for spell checking‡. I could make a script to catch each combination.

In line with this thread, with a little modification the same script could count the occurrences. The data could be used in the same way as the information in your pdf. A count of pairs in a dictionary wouldn't really tell you about actually usage—the same script could also count the pairs in books from several periods, and work for other languages.

Is there any more information on this? I've tried googling but wasn't able to find much. Most of it relates directly to kerning, although it is slightly related. pp 204-205 of the Elements of Typographic Style displays part of what appears to be a very excellent test file. I wonder if the rest is available.

* Of course, I'm asking 8 years late.
† Is it bad to revive an old thread?
‡ they are slightly inadequate for spelling in that many words are missing, for instance those with prefixes or suffixes added

hrant's picture

To me there's nothing better than reviving an old thread. Recently we revived one that was about 10 years old!

The Brown corpus must be available online. It's a bit long-in-tooth, but it's very large, so it still gets respect. That said I for one would love to hear of alternatives, especially for non-Latin scripts.

About Héctor's post: during one of Typophile's past system-transitions something went wrong, possibly having to do with how upper-ASCII(?) characters are handled, and some posts just get truncated. I don't know whether the text data is actually gone, or it's there but isn't being rendered.

hhp

rs_donsata's picture

The post was from 2004, I was fresh out of university and I wanted to do a thesis on the topic of type design for spanish under the assumption that there could be specific language characteristics such as letter pair frequency that could be used for practical benefits.

Well, the topic was too much. It ended just in a simple sketch of thesis.

I remember I concluded that the details could be irrelevant for readability improvement but that there were enough aesthetic opportunities in the design of diacritics, punctuation marks and ligatures.

brianskywalker's picture

Right. Those aesthetic opportunities are what I'm interested in. But I do think legibility can be improved slightly by taking a look at letter pairs that actually happen. At least, in that one can solve problems of collisions, which could be distracting. In a way that too is aesthetic. But readability is at least a little bit aesthetic.

hrant's picture

Please stop reading my mind. :-)

hhp

Syndicate content Syndicate content