Resources on letter pair/diacritic pair frequency?

agisaak's picture

Does anyone know of any resources which would provide information on which letter *pairs* frequently occur in different languages? It would be especially useful if this included information on diacritics.

I'm currently dealing with some consonant-vowel ligatures, and want to figure out if there are diacritical combinations which can be safely omitted. I'd tried googling for various diacritical combinations, but the useful data ends up buried amid results drawn from a miscellany of legacy CJK encodings.

André

TheOtherNick's picture

Chthonic, Django Reinhart, Jzanus, Ljubjana, llama...you get the idea: too many possibilities...

Michel Boyer's picture

One source of information I have used for similar purposes is Open Office Dictionaries and other dictionaries for spelling checkers. For instance, if you click on the link for Canadian English (zip file) you get a folder containing a file with extension .dic with 62341 entries (including "derived" entries). Other dictionaries can be much larger. The .dic file is plain text. If you remove what follows the slash after each word, you get a file on which you can run programs to extract pairs, count them, etc. Of course, that gives no information on the frequency with which those pairs occur in actual texts but that gives information on possible pairs for the language you chose. Some dictionaries are utf-8 encoded, others are latin1 and so on. The encoding is given at the first line of a second file with extension .aff. Some programming ability is thus required.

agisaak's picture

Well, yes there will be lots of possibilities, but some pairs are still going to be cross-linguistically more common than others, and diacritics which are not commonly used may not occur adjacent to others -- for example, I *think* that if one had an sa ligature in a font, that it would be more important to also implement than șä . But I'm basing that on the fact that ä doesn't occur in Rumanian and AFAIK that's the only language which uses ș. Even within a language which contains a variety of diacritics, it's not necessarily the case that all of those diacritics will occur adjacent to one another, and while it's relatively easy to find information on which diacritics are used in which languages, I haven't found information on diacritic pairs..

André

agisaak's picture

Thanks Michael -- I'd tried using the Mac OS built-in dictionary for those languages I've installed, but it doesn't support wildcards (or if it does, the asterisk isn't used for this). Never thought, though, to try opening the actual file (a senior moment).

André

Michel Boyer's picture

I use terminal windows and unix utilities to find those files and process them. Maybe you can do better with Mac utilities, I don't know. For dictionaries installed by Firefox, I type the command "cd $HOME/Library/Ap*ort/Firefox" in a terminal window and then

find . -name "*.dic"

gives me the list of those dictionaries. They can be copied in some temporary folder and batch processed.

Michel

blank's picture

I always wish there would be some linguistics textbook that covers this stuff. Maybe Steve Peters will chime in here with some help.

If you have time to figure out the syntax to sift through text file wordlists it’s pretty easy to put this stuff together using Python or just Bash scripting (grep "*öö*" file.txt | wc -l). The OpenWall wordlists disk is worth it’s low price if you don’t need to analyze actual text. Ask around in the netsec world and I’m sure even more dictionaries exist. Project Gutenberg and similar resources probably have real texts covering many of the languages you need to analyze.

Michel Boyer's picture

I have somewhere a python script that counts bigrams in a utf-8 encoded source. To get the list of words, I just use "awk 'BEGIN{FS="/"}{print $1}' *.dic". If that can be useful, I'll try to find the script. That's just a few lines of code, never more.

agisaak's picture

James wrote: I always wish there would be some linguistics textbook that covers this stuff.

Linguistics texts generally aren't that concerned with orthography, so this isn't a likely source. You'll find lots of information on the pairings of various sounds , but any statistics presented will likely involve IPA rather than orthographic representations.

André

gaultney's picture

Frequency analysis is what you really need - a dictionary would not be enough. This would require some long texts in all the languages of interest. I don't know of a good general source for these, but someone must have compiled such.

Some years ago Luc(as) de Groot (http://www.lucasfonts.com/) did some good work on compiling resources for kerning and building some tools for it. I think he called it Kernologica. He should be able to point you in some useful directions.

Michel Boyer's picture

Frequency analysis is what you really need.

Most obviously. To get frequencies (absolute or relative) of bigrams, all you need is a very basic script that can be run on some utf-8 encoded input. To get such a script (for alphabetic bigrams), you can just copy what is between the cut lines and paste it in a terminal window and you will get an executable file named bigrams in your current folder.

----
cat >bigrams <<'EOF'
#!/usr/bin/python

# M. Boyer 2009

import codecs, sys
infile=codecs.open(sys.argv[1],"r","utf-8")
text=infile.read(); infile.close()

tallies={};  nbdata=0;  prev=' '
def tallyq(c):
  return c.isalpha()

for char in text:
  if (tallyq(prev) and tallyq(char)):
    datum=prev+char # ; datum=datum.lower()
    nbdata=nbdata+1
    if datum in tallies:
      tallies[datum]=tallies[datum]+1
    else:
      tallies[datum]=1
  prev=char

for d in tallies:
  print('%s;%d;%.3f%%' %
     (d.encode('utf-8'), tallies[d], 100.0*tallies[d]/nbdata))
EOF
chmod 755 bigrams
----

Then you decide what you want to run it on. For instance, if you want to run it on Chekhov's text Дама с собачкой (The lady with the little dog), you can type (or copy and paste) the line

   lynx -dump http://lib.ru/LITRA/CHEHOW/d.txt > dama.txt

and then run (maybe after removing some html references at the bottom)

   ./bigrams dama.txt | sort

Here is a copy paste of part of the output

то;372;1.927%
тп;1;0.005%
тр;85;0.440%
тс;31;0.161%
ту;26;0.135%
тф;2;0.010%
тх;1;0.005%
тч;7;0.036%

There were 372 occurrences of то which reprensents 1.927% of all bigrams (after cleaning the text).

With the internet, there are now many sources of texts in all languages. There is also nothing to prevent you from running the script on a dictionary to know possible combinations; it seems you then don't need the frequencies but it may still be interesting to see what were the words containing bigrams with very low frequencies. A simple grep answers the question.

Michel

[added] I guess the mac does not come with lynx installed. I must have installed it myself. That example may be more for Linux than mac users. Sorry.

ebensorkin's picture

Nice!

kaibernau's picture

Ohai.

The LetterMeter from Peter Bilak and Just van Rossum can run a text for single letter and letter pair occurence. Then it is just a matter of feeding it with the texts you deem appropriate.

Says the website:
LetterMeter is a text analysis tool, used in the Type&Media classes (postgraduate course of type design) at the Royal Academy of Art in The Hague. LetterMeter is designed for comparing multilingual texts and measuring the frequency of particular glyphs.

Because it is Unicode based, it will work with the majority of languages. The current version will recognize Latin, Greek and Cyrillic glyphs, and sort them according to their formal attributes. LetterMeter's results include statistics for the incidences of round/square/open/diagonal left and right sides of glyphs, ratios of vowels/consonants, and counts of glyphs with accents, ascenders and descenders, in any given text(s).

LetterMeter was developed jointly by Peter Bilak and Just van Rossum, whom I would like to thank for the Python programming. Vera Evstafieva helped with the Cyrillic specifications, and Panos Haratzopoulos with the Greek.

LetterMeter is created using Python. and works only on Mac OS X. Although it is available for free, it is copyrighted, and you may not redistribute it. All rights reserved, © 2003, Peter Bilak, Just van Rossum.

For TEH DOWNLOADS at Typotheque

Michel Boyer's picture

Here is another tool I made using the above code (I replaced semicolons by tabs, and added basic choices). It can be used from absolutely any computer (well... you tell me if it works on an iPhone). Link.

On a PC, if you save the resulting statistics as a text file, you can then import it in Excel for further processing. On the mac, I have found no way to import utf-8 text into Excel. Hard to believe!

Michel

qu1j0t3's picture

Here are some results I got for English. http://groups.google.ca/group/comp.lang.postscript/msg/34c2bb049b42f668?...

I used to use a C program to count the most common digrams, then augment it against punctuation, to generate kerning pair lists for URW Kernus.

Michel Boyer's picture

I guess there are indeed good references for English.

Before continuing, let me say that Lynx for Mac OS X can be downloaded from http://www.apple.com/downloads/macosx/unix_open_source/lynxtextwebbrowse.... To use it at the command line, you add /Applications to your path. I assume this is done, and that "Terminal > Window Settings > Display" is set to Unicode (UTF-8). What follows is then good for Linux and Mac users that are used to unix commands.

Now, some digrams may cause more than kerning problems. For instance, in the Typophile thread f + umlauts, Florian Hardwig mentions that the diagrams , , may cause a clash between the umlauts and the f. Those combinations occur in German. How often? Let's check.

On the Project Gutenberg Catalog, I find Kant's Kritik der reinen Vernunft. On that page I see no html version, and no utf-8 version. I see a plain text iso-8859-1 file and if I right click the "main site" link and paste it I get that the iso-8859-1 text has url

http://www.gutenberg.org/dirs/etext04/8ikc210.txt

I will thus need to tell lynx to expect iso8859-1 text; I will save the result in kritik.txt as follows (on the command line):

lynx --dump -assume_charset=ISO8859-1 http://www.gutenberg.org/dirs/etext04/8ikc210.txt > kritik.txt

The resulting file kritik.txt now contains the utf8 text (lynx did the reencoding).

Now I look at the digrams in kritik.txt; I do not try to be efficient; the bigrams code above is not, and as long as I get my answer in reasonable time, that's fine with me. I'll just find all bigrams in the text and then egrep those containing fä, fö, fü (I replaced semicolons by tabs in the bigrams code)

./bigrams kritik.txt | egrep "f[äöü]"

and I get the output

fö 27 0.003%
fü 697 0.079%
fä 255 0.029%

which means that there is a total of 27+697+255 = 979 possible clashes in Kant's text. In my library, the book is 847 pages. On the average, that is more than one possible clash per page. A few simple an inefficient scripts, unix commands and pipes often give answers faster than sophisticated programs.

Michel

Syndicate content Syndicate content