character frequencies?

Primary tabs

9 posts / 0 new
Last post
paul d hunt's picture
Offline
Joined: 5 May 2005 - 8:44pm
character frequencies?
0

does any know of a good resource of character frequency information for various languages, preferably online, but otherwise is also fine.
i have found a few sites, like this: letter frequencies (rankings for various languages). but i'm looking for something fairly extensive, perhaps with statistics and definitely Unicode compliant. I've also found a tool that compiles letter frequencies. has anyone used this or know of a better one that might do what i've outlined above? (generate character frequencies, stats, keeping Unicode intact) Just thought I'd put a few feelers out there...

Bert Vanderveen's picture
Offline
Joined: 13 Jun 2004 - 8:19am
0

Didn't Luc(as) de Groot collect extensive itnformation regarding this? Couldn't find it on his site, but you could contact him directly.

. . .
Bert Vanderveen BNO

Gus Winterbottom's picture
Joined: 19 Oct 2006 - 11:46am
0

You could try the codebreakers at the NSA. No, seriously -- this NSA website lists a number of declassified printed documents you might be able to request regarding letter or digraph frequencies in Russian, French, Polish, and Japanese.

http://This site has a letter frequency list for English, French, German, and Spanish, and Wikipedia adds Esperanto. And Appendix A of Army field manual 34-40-2, Basic Cryptanalysis, has a list of English digraph frequencies. The whole manual is available as a zipped collection of PDFs here.

(Later edit: I also found this NSA document (PDF), an introduction to cryptanalysis, that has some interesting information on pages 11 through 17. Unfortunately, it's a book that was scanned into PDF, and dates back to 1938, so it's not likely to be Unicode friendly -- but it does have a claim to being authoritative.)

Simon Daniels's picture
Offline
Joined: 11 Apr 2002 - 6:37pm
0

>Didn’t Luc(as) de Groot collect extensive itnformation regarding this?

Luc's data extends to common pairs for various languages. Helps plan kerning.

Cheeers, Si

Linda Cunningham's picture
Joined: 26 Jul 2006 - 3:55pm
0

When I lived in DC, a friend of mine worked for the NSA, and their stuff doesn't get released for, as has been noted, at least a substantial number of years after it's been collected, so it's probably not all that useful.

Russell McGorman's picture
Joined: 25 May 2006 - 10:01am
0

I doubt the language would have changed all that much.

-=®=-

Linda Cunningham's picture
Joined: 26 Jul 2006 - 3:55pm
0

You'd be surprised -- between the start of WWII and now, there's been some serious Anglicization of most other languages in the world, and that radically alters character frequency.

(Except for France, of course, where they are quite rude wrt "English" invading "their" language. Fold them in with many non-Latin languages that are using English words written in their own character forms and all bets are off....)

Tim Ahrens's picture
Offline
Joined: 28 Sep 2004 - 9:15am
0

I have done some extensive analysis in this field. The texts generated by my test text generator are synthesized on the basis of triplets frequency lists, which I obtained by a very thorough analysis of texts.
For example, for English, as an input I used several texts, 5 literature, 5 scientific and 2 economic, each 5-10 MB in size. In the end, I did not take the arithmetic average but something similar to the median so as to make sure subject-specific key words in a certain text do not spoil the overall result.
As you can see, I have data for 22 languages but some of them are only based on 4-5 texts.
I could convert my frequency lists to character frequencies or pair frequencies if there is a general interest. Btw, why were you interested in the first place, Paul? What would you use them for?

Dan Reynolds's picture
Offline
Joined: 20 Jul 2002 - 11:00am
0

Paul, you can use Typotheque's Letter Frequency Meter to an extent, even with non-Latin scripts. The first column of the results it gives seem to me at first glance to list glyph occurrences correctly in any Unicode-encoded text. It is Mac only, but you could get on one of your roommates' machines…