Adobe Devanagari Font

Uli's picture

It seems that the Adobe Devanagari font was not yet discussed at Typophile.

I had a closer technical look at AdobeDevanagari-Regular.otf, version 1.105 (2011), and here are my findings:

1. The Latin diacritics required for transliterating Indic (Hindi, Sanskrit, etc.) texts is incomplete. The diacritic for "sh" (both lowercase and uppercase), very frequently used in Indic words, e.g. in Shiva etc., is missing.

2. Many frequently used ligatures are missing, even ligatures which have a frequency of much more than 0.01 %, e.g. "ddhv" (frequency 0.215 %).

For details see http://www.sanskritweb.net/itrans/adobe-ligatures.pdf

For comparison see http://www.sanskritweb.net/itrans/itmanual2003.pdf (page 29 seq.)

3. The Adobe Devanagari font does not work with older Windows and older Word.

For example, Adobe Devanagari does not work with old Microsoft Word, version 10, in conjunction with Windows XP.

For comparison, Mangal and all the other Devanagari Unicode fonts known to me work with older Word and older Windows, provided the Uniscribe system library for foreign language support was installed with Windows.

John Hudson's picture

The first thing that should be noted is that Adobe Devanagari was designed specifically for modern Hindi use, and not for Sanskrit; it may be of limited use even for other modern languages such as Marathi and Nepali. The design brief was specifically to target use of Hindi in a modern business environment (the font was originally made to bundle with Acrobat), and not scholarly use.

1. That could be an oversight. Thanks for bringing it to our attention.

2. See note above re. target language support. The ligature set was based on a mixed approach: referencing the set employed in Linotype Devanagari (on which Fiona also worked) and also frequency analysis of modern Hindi text.

3. Correct. The font uses only the newer 'dev2' script tag and layout behaviour, not the older, deprecated shaping. This is what Adobe spec'd. Hence the font will not work in pre-Vista versions of Windows or other environments that only support 'deva'.

Uli's picture

A) > Adobe Devanagari was designed specifically for modern Hindi use, and not for Sanskrit

That's okay, but a circumspect designer, adding only a few additional ligatures could make the font also suitable for Marathi and Sanskrit. For example, for Classical Sanskrit, only 11 additional ligatures would be required to make the Adobe Devanagari font suitable for Classical Sanskrit (as opposed to Vedic Sanskrit).

see file adobe-ligatures.pdf (see the missing ligatures marked with "!!!")

B) > That could be an oversight

The oversight was due to the fact that most Indic diacritics are in the Unicode range 1E0C through 1E96, with the exception of

015A Sacute
015B sacute

which are used in Polish texts, but which happen to be used also as Indic diacritics.

C) > The design brief was specifically to target use of Hindi

If this should be the case, then there must have been made design errors.

(Cave: I am not a Hindi expert, so that a scholar should examine the following)

(a) On the one hand, it seems to me (see cave above) that the Adobe Devanagari font INCLUDES ligatures, which are (to my knowledge) NOT used in Hindi (and also NOT used in Sanskrit etc.), for example "kspr":

see http://www.sanskritweb.net/temporary/kspr-kspl.jpg

Did the Adobe designers invent the ligature "kspr" just for fun?

(b) On the other hand, it seems to me (see cave above) that the Adobe Devanagari font LACKS ligatures, which are (to my knowledge) USED in Hindi, for example "dg" in the Hindi word "khadga" (= "sword" in English):

see http://www.sanskritweb.net/temporary/khadga.jpg

In the Oxford Hindi-English dictionary by McGregor published in 1993, the word "khadga" was typeset with virama, i.e. without ligature, due to the fact that professional Devanagari fonts, such as Siddhanta.ttf and Sanskrit2003.ttf, were not available at that time. But today in the Unicode era, it is possible to design professional Devanagari fonts suited for Hindi.

I wonder whether a Hindi scholar ever checked the Adobe Devanagari font for ligatures
(a) required for Hindi and
(b) not used in Hindi

Instead of including ligatures required for Hindi, the Adobe Devanagari fonts includes a plethora of fancy ikara variants which would only make sense in a font containing the required ligatures.

see http://www.sanskritweb.net/temporary/ikara.jpg

Przemysław's picture

The font uses only the newer 'dev2' script tag and layout behaviour, not the older, deprecated shaping.

Why then does it have the "script deva;" declaration for nukt, akhn etc. etc.?

015A Sacute
015B sacute

What? The font has both.

Michel Boyer's picture

Uli, you mention McGregor. Here is the list provided by McGregor in his Outline of Hindi Grammar, 1972.



(click to enlarge)

I don't see your ligature.

Uli's picture

Mr. Przemysław:

>015A Sacute, 015B sacute. What? The font has both.

But they do not show up on my old Windows XP machine.
With Windows XP, the Adobe Devanagari font has quirks.
Other diacritics show up on my machine, but for S/s acute,
Wordpad of Windows XP defaults to Verdana (see PDF file).

Mr. Boyer:

> I don't see your ligature.

I have this book too, but which ligature do you mean?

Michel Boyer's picture

which ligature do you mean?

The one in "sword", "khadga" as you wrote it. Here is a grab of your link:

By the way, I doubt the font did not have it, because this one was available:

Uli's picture

Mr. Boyer:

dga and nga are two different conjuncts
(watch out for the "dot" to the right of the
Devanagari glyph). Both Devanagari letters
d and n only differ by the "dot" to the right.
nga is available in Abobe Devanagari.

In the McGregor textbook cited by you,
see page xxiii.

Michel Boyer's picture

Herrn Ulrich Stiehl,

I know very well that those two are different. I was pointing out that one is listed in the commonest conjuncts (nga), and thus was available in the font, and the one you point out (dga) is not listed. I see no reason a font used in the seventies with nga would not have dga and dga was not listed by McGregor. I conclude that it was intentional on McGregor's part (and not due to a missing glyph).

Uli's picture

Mr. Boyer:

> commonest conjuncts

As far as Sanskrit is concerned, dg has a frequency of 0.167 %, which means that it is very common in Sanskrit.

As far as Hindi is concerned, you will have to supply your own frequency research results, because I am no Hindi expert.

If we discuss the Adobe Devanagari font on the basis of the "commonest conjuncts", I think we should forget the Adobe font, because who wants to use a font which includes only the commonest conjuncts? "Mangal" was such a makeshift font including only the "commonest conjuncts".

Michel Boyer's picture

I am no expert of Hindi either, and I don't know where I could find a good corpus of Hindi texts for reliable statistics. The best I could do is to download the aspell Hindi dictionary from ftp://ftp.gnu.org/gnu/aspell/dict/0index.html, untar it with

   tar -jxpvf aspell6-hi-0.02-0.tar.bz2

and, in the directory aspell6-hi-0.02-0, execute

   preunzip hi.cwl

to get the utf-8 encoded dictionary hi.wl. It contains 83514 entries, which compares favorably to my en_US dictionary (62115 entries) or my French dictionary (61305 entries). Here is now a trace of execution (on OS X 10.6)

512 % grep ड्ग hi.wl
खड्ग
खड्गकोश
खड्गधारी
खड्गधेनु
खड्गमुष्टि
खड्गी
षड्ग
षड्गुण
513 %

Those are all the entries containing the ड्ग combination. I guess there are contexts where those entries can be used more frequently than others but I doubt a representative corpus would give the combination ड्ग a very high frequency.

Uli's picture

Thanks for performing the search for "dg".

Now search for ligature "kspr", and you will see that it does not occur at all.

Michel Boyer's picture

If you mean क्स्प्र then you are right.

Uli's picture

> If you mean क्स्प्र then you are right.

That is what I mean. The Adobe Devanagari font contains conjuncts which cannot occur in Hindi for linguistic reasons. In Sanskrit, theoretically, "kspr" could occur by combining a word ending in "k" with a word beginning with "spr", just like combining in English "ink" with "spray" to "inkspray" which would/could contain the "kspr-conjunct" (in-kspr-ay). But these are "wordplays" which do not have a linguistic basis. Therefore, I think, several conjuncts contained in the Adobe Devanagari font were invented just for fun.

John Hudson's picture

Some of the conjuncts in the Adobe Devanagari font were inherited from the Linotype Devanagari font, which was used a lot in newspapers and whose glyph set was in part developed in response to requests from newspaper editors in India. It is entirely possible that some conjuncts were intended for commonly transliterated words from other languages.

I do think a more systematic, analytical approach should be taken to defining the conjunct set for Hindi fonts, as Uli has done for Sanskrit. In the case of the Adobe Devanagari, we compiled a set from a variety of sources. We had intended to cover Marathi too, but the development schedule wasn't very long because it needed to be bundled with Acrobat, so we had to prioritise for Adobe's principal targets. We did do testing of extended Hindi texts -- including an entire novel -- and found the font to perform well.

Uli, it is worth bearing in mind that modern Hindi typography has broken quite significantly with the Sanskrit tradition. I understand your comment re. fonts that support only the more common conjunct ligatures, but that seems to me very much the perspective of a Sanskritist: Hindi readers have been used to extensive use of half forms and, for some letters, even explicit halants ever since the hot metal typesetting days. Indeed, when Fiona first reintroduced some of the conjunct ligature forms in the digital LT Devanagari (for the Linotron 202) there was resistance from some Indian customers to what were perceived as 'Sanskrit' forms. A lot of the custom encoding solutions for Hindi fonts, e.g. the Modular Infotech system, remain based around extensive half form use. None of this is to say that we couldn't, with more time and research, have come up with a better Hindi conjunct set for Adobe Devanagari, but that the expectation that any given conjunct, however rare, should be presented in its ligature form is not one that Hindi readers share with you.

Thank you, by the way, for the very helpful review documents, which I'll certainly examine carefully and I hope will help us to do better in future.

John Hudson's picture

I've confirmed that the ś diacritic is indeed in the Adobe Devanagari fonts, as Przemysław reports. I've no idea why it isn't displaying in your WinXP environment, Uli. Have you checked to see whether it displays correctly with other PostScript OT fonts?

Uli, can you share some information about the corpus on which you based your frequency analysis? This looks like very useful information, and I'd like to get an idea of the scope of the analysed text.

Michel Boyer's picture

Michel, I'm working on another Hindi font project at the moment and wonder if you might be able to assist me with some conjunct frequency analysis?

That sounds interesting. I'd be glad to help.

John Hudson's picture

Thanks, Michel. I'll contact you with more details.

Uli's picture

1) Mr. Boyer:

I discovered that by Edwin Greaves in his 1921 Hindi Grammar

see http://archive.org/details/hindigrammar00greauoft

the ligature "dg" is reckoned as a "principal compound".

see http://www.sanskritweb.net/temporary/dga.jpg

This book by Greaves has also an interesting introduction describing "High Hindi" as a language "for those who delight to cram their pages with high-sounding Sanskrit words" (see page 4 of the scanned book). But this was 90 years ago.

2) Mr. Hudson:

I did not say that I have "the expectation that any given conjunct, however rare, should be presented in its ligature form". On the contrary, I said that "only 11 additional ligatures would be required to make the Adobe Devanagari font suitable for Classical Sanskrit".

What I criticize is that the Adobe Devanagari font contains many extremely infrequent and perhaps entirely unattestable ligatures, whereas this font lacks very frequent ligatures, at least as far as Sanskrit is concerned.

3) Mr. Boyer and Mr. Hudson:

I made a complete compound analysis of the Adobe Devanagari font, which will be of interest to Mr. Boyer and to Mr. Hudson:

http://www.sanskritweb.net/itrans/adobe-ligatures-analysis.pdf

The frequency counts are based on Sanskrit texts developed for Sanskrit fonts, but these statistics may be also of help for Hindi-only fonts.

The most infrequent or rarest, though attestable, Sanskrit compounds have a frequency of 0.001 %. This means that in 100000 (one hundred thousand) lines of Sanskrit texts, this compound occurs only ONCE on an average.

Now, if Mr. Boyer applies his Gnu Hindi dictionary word count utility to the compounds listed on pages 27 through 37 of the above file (adobe-ligatures-analysis.pdf) containing the rarest or most infrequent Sanskrit ligatures (0.001%), I predict that Mr. Boyer is sure to discover that innumerable ligatures (compounds), which are indeed contained in the Adobe Devanagari font cannot be attested (i.e. located or found) in his Hindi dictionary. If this be true (let's see), this would mean that the Adobe Devanagari font contains innumerable unattested compounds (ligatures) with a frequency of 0.000 % (nil, nothing).

John Hudson's picture

Uli, as noted, Sanskrit was not a target language for the Adobe Devanagari font. It is entirely possible that Adobe will want to extend its glyph set coverage for other languages, but I suspect that modern languages such as Marathi, Konkani and Nepali would be their priorities. I'm a little surprised that you reckon only eleven more conjunct ligatures would be needed for classical Sanskrit; I would think more.

Fiona confirms that some of the conjuncts inherited from the Linotype set were for transliteration of foreign words; we included them in Adobe Devanagari because the Linotype fonts are a recognised standard among some Indian customers, and to deviate from the set would be to invite criticism and require explanation. Instead, we get criticism from a German Sanskritist and have to explain to him. :)

The core sets on which the Adobe Devanagari set are based were the Linotype Devanagari (for compatibility with a perceived 'standard') and that provided by Rupert Snell in his Hindi grammar, which is the most up-to-date source for the modern language. Recently, we had occasion to add some extra variants for one of Adobe's customers who wanted different forms for certain conjuncts.

Greaves' comment about 'High Hindi' is worth bearing in mind. You cite खड्ग as an example of the 'dg' conjunct in Hindi, but the common modern Hindi word for sword is तलवार.

Clearly Michel is not going to find 'innumerable' conjuncts in Adobe Devanagari that do not occur in Hindi words: he is going to find a precise number, and these will all be conjuncts that also appeared in the Linotype set. Far from being 'invented just for fun', they were included in a set whose size was limited by technical restraints at the request of Linotype's Indian customers, in order to cleanly (without explicit halant) render transliterations of common foreign words in Hindi newspapers. In the context of those newspapers, it is entirely likely that some foreign words would be more common than those Hindi words that contain low frequency conjuncts. [Transliteration of foreign words, especially proper names, was a major factor for newspaper typesetting in the Subcontinent. Full-word ligatures needed to be added to Urdu fonts whenever the Soviet Union had a change of leaders.]

quadibloc's picture

And thus, I take it, the "kspr" ligature saw heavy use in 1995, perhaps?

(This was back in the days when there were two world chess champions, a PCA one as well as a FIDE one... of course, there was also the infamous Fischer-Spassky rematch.)

Uli's picture

From the scholarly point of view, I think that it is ridiculous to invent unique ligatures for proper names.

But from the marketing point of view, I understand that Linotype wants to cash in on selling unique ligatures, when for instance McDonald’s comes along and says:

"We shell out a lot of dough, if you invent for us the unique Devanagari ligature "mcd" and also the unique Devanagari ligature "lds" so that our trademark "McDonald's" has two unique ligatures ("Mcd-ona-lds") for advertisements in Indian newspapers."

hrant's picture

Invention rules.

hhp

John Hudson's picture

Uli: From the scholarly point of view, I think that it is ridiculous to invent unique ligatures for proper names.

From a scholarly point of view, I agree, but these conjunct ligatures were made at the request of newspaper publishers, who are perennially concerned with column width and word length, and for whom being able to reduce the width of commonly occurring words through ligation was a practical benefit (especially if you bear in mind that newsprint was rationed in India for long periods of the 20th Century). Linotype were not 'cashing in on selling unique ligatures': you are entirely misunderstanding the nature of the commercial relationship between newspapers and typesetting machine manufacturers at that time. Newspapers invested in machines, and the makers of those machines provided fonts according to the specifications of the newspaper publishers. It is not as if Linotype invented these conjunct ligatures and offered them for sale to the newspapers; rather, the newspaper publishers and editors who had purchased the Linotype typesetting machinery requested the addition of these ligatures to their fonts.

I think we're at a point in the development of Indic fonts now where we should reasonably consider whether these legacy transliteration ligatures, which were developed at a particular point in time for a particular technology and particular customer base, should be ignored. As I reported earlier, we included them in Adobe Devanagari because of concerns that users in India familiar with the Linotype sets might consider the Adobe set deficient if it did not include them. This sort of thing happens in situations in which the only reference that many people have for judging the quality of a new font is comparison with previous fonts. We encounter this regularly. I think, Uli, that you and Michel are on entirely the right track with frequency analysis: this is a much better basis on which to plan glyph sets, and in line with what we're doing these days (bear in mind that although Adobe Devanagari was only recently released we actually made it more than four years ago).

Michel Boyer's picture

I don't have the Adobe Devanagari font and can't comment on it. Uli's analysis contains 821 compound glyphs. For the aspell dictionary, I made my analysis by first replacing all letters followed by a nukta by the corresponding precomposed character. On 475108 bigrams, there were only two "u092F nukta" and that many for others (number of occurrences between parentheses)

u0915 (316), u0916 (411), u0917 (157), u091C (673), u0921 (1789), u0922 (395), u092B (391)

I then just searched for the longest sequences of the form

    letter virama letter virama … virama letter 

and found 627 glyphs to be precompounded to eliminate occurrences of virama that are not at the end of a word. From the analysis, one sees that 211 compounds would be required by only one occurrence in the dictionary and only 356 compounds occur in more than two entries. You can find my results in the files on my blog:

/files/compounds_20120812b.pdf

I think I see bugs with vowels (bottom of page 10). I would need the right "regular expression" that characterizes possible conjuncts (or is it a problem with the input dictionary file (txt, 1.67M) ?).

Michel

New version on my blog.

John Hudson's picture

This is very useful, Michel. I think the problems at the bottom of page 10 and on page 11 must be encoding issues in the dictionary: whenever you see Uniscribe inserting a dotted circle into the leftmost column, that indicates a character sequence considered invalid, e.g. a vowel followed by a virama. These entries should be ignored.

I wonder if you might be able to run the same test on the Hunspell Hindi dictionary? And perhaps a combined test of the two dictionaries (having removed duplicate word entries)?

I'd very much appreciate having your analysis in a spreadsheet format: when I try to copy and paste from the PDF, some Devanagari letters show up as unknown characters.

Michel Boyer's picture

The latex source is on my blog for that purpose, utf-8 encoded (I used xeLaTeX).

http://typophile.com/files/compounds_20120812b.txt

It should be as good as a csv if you replace the character & by a comma or a semicolon and remove the \\ at the end of lines. If you want a particular format, send me a message with the specs.

I'll have a look at the Hunspell dictionary.

Michel

Michel Boyer's picture

For the Hunspell dictionary; the file contains only 15990 entries and at most 343 compounds are found.

/files/hunspell20120812.pdf
/files/hunspell20120812.txt

PS. Here are weird bigrams that gave nuktas that could not be precomposed:

u0924;u093C; 1 occurrence
u092C;u093C; 2 occurrences
u0939;u093C; 1 occurrence

Added: new version on my blog http://typophile.com/node/91095

quadibloc's picture

I, for one, certainly don't dispute the appropriateness of Adobe using real-world marketing considerations as a guide to the design of its fonts.

This is not to say, however, that it would not also be nice, if by the addition of a limited number of additional glyphs, the font could be made usable for other languages using the Devanagari script besides Hindi, to add those glyphs. After all, as things stand now, it should really be called a Hindi font and not a Devanagari font - just as a font not including the characters for typesetting, say, Serbian ought to be called a Russian font and not a Cyrillic font.

Thus, to be able to set Sanskrit after a fashion, and Hindi well - by including ligatures for foreign words commonly transliterated in Hindi - would be good. And the other point raised, that a limited number of glyphs are missing that are at least shown by dictionaries as used by Hindi - if, indeed, the number is limited, then perhaps they should be added as well. This, though, is less important - as here we are talking about ligatures, to which virama is always an alternative.

As for the example above, yes, dga is different from nga, since the former doesn't have a dot - but since omitting a dot is much easier than drawing a new character from scratch, if nga is included, then since dga comes cheaply from a design point of view, unless the size of the font comes at a premium, dga ought to be included if the language could call for it.

So I think that while the criticism of Adobe adding useless ligatures just for fun can be thrown out, some of the other objections seem to have enough merit that, if, on examination, they really do call for nothing more than the addition of a very limited number of glyphs, they could be highly constructive.

Michel Boyer's picture

I added the files with both dictionaries put together. A applied sort -u on the result, to remove duplicates. Some resulting entries look weird but I could find they exist with Google. I am nevertheless surprised that so many entries (7692) were added to the aspell dictionary that already had 83514 entries while hunspell has only 15990 entries. I don't understand how such a thing can happen.

/files/together20120812.pdf
/files/together20120812.txt

Added: new versions on my blog http://typophile.com/node/91095

John Hudson's picture

Michel, I see some instances in your analysis of letter + virama + independent vowel, e.g ल्उ. I'm somewhat surprised by this spelling, but in any case this would not constitute a conjunct, which is a sequence of consonant letters separated by virama.

Michel Boyer's picture

John, I have no doubt you are right but I prefer having more and clean up than missing some. If I define consonants to be range(0x0915,0x0940) + range(0x0958,0x0960) am I missing something?

John Hudson's picture

The consonant letter ranges for Hindi are 0x0915–0x0939 and 0x0958–0x095F inclusive.

Michel Boyer's picture

Thanks, that now gives a clean output. New versions of all files are on my blog for all cases (hunspell, aspell and the dictionaries put together).

http://typophile.com/files/hunspell20120812c.txt
http://typophile.com/files/hunspell20120812c.pdf
http://typophile.com/files/aspell20120812c.pdf
http://typophile.com/files/aspell20120812c.txt
http://typophile.com/files/together20120812b.pdf
http://typophile.com/files/together20120812b.txt

The lines causing trouble disappeared and the others were kept.

Michel

Uli's picture

Mr. Boyer:

Thank you very much for compiling the conjunct files.
This will help Mr. Hudson to improve his Adobe font.

I am pleased to see that your PDF files (together20120812b.pdf etc.) were typeset using the great Siddhanda font made by my correspondent Mihail Bayaryn.

see http://www.sanskritweb.net/cakram/index.html
and http://svayambhava.blogspot.de/p/siddhanta-devanagariunicode-open-type.html

For lovers of foreign-language Bibles, I should mention that the Sanskrit version of St. John's Gospel was typeset using Mihail Bayaryn's great font Siddhanta:

see http://www.sanskritweb.net/sansdocs/john.pdf

John Hudson's picture

Uli, if I provide you with a list of conjuncts that do not occur in either the Aspell or Hunspell Hindi dictionaries, would you be able to confirm which occur in your Sanskrit corpus? Of the various attested conjuncts, I am now trying to sort out which are standard Hindi, which would be used for Sanskrit, and which are were introduced for transliteration of foreign words.

John Hudson's picture

Michel: I am nevertheless surprised that so many entries (7692) were added to the aspell dictionary that already had 83514 entries while hunspell has only 15990 entries. I don't understand how such a thing can happen.

This troubles me too, and makes me wonder how the dictionaries were compiled. I suspect the Aspell collection might be based on a simple corpus analysis, in which case there is a strong likelihood that it contains transliterated foreign words and, as we can see from your analysis, misspellings and incorrect encodings.

Uli's picture

Mr. Hudson:

"Uli, if I provide you with a list of conjuncts that do not occur in either the Aspell or Hunspell Hindi dictionaries, would you be able to confirm which occur in your Sanskrit corpus?"

Since everything was documented by me, you can quite easily do this yourself.

At my subsite http://www.sanskritweb.net/itrans/index.html#SANS2003

please download http://www.sanskritweb.net/itrans/itmanual2003.pdf

On pages 28 through 42, you will find the complete list of attested Sanskrit compounds sorted by Indic alphabet.

Another document of interest for font developers is the list of attested orthographic syllables sorted by Indic alphabet and contained in the file itmanual2003.pdf, pages 76 through 103. The same list sorted by frequency is contained in another file:

http://www.sanskritweb.net/itrans/ortho2003.pdf

For additional Hindi ligatures, see itmanual2003.pdf, pages 110 through 130.

John Hudson's picture

Thanks, Uli. I'll take a look at your documents and let you know if I have any questions. It would be great to have versions of these tables that could be used as test documents, e.g. plain text files.

Michel Boyer's picture

I am nevertheless surprised that so many entries (7692) were added

In fact, there were even duplicate entries in the aspell dictionary, so that it is not 7692 but 7819 entries that were added to it by adding entries from the hunspell dictionary. I saved them to a file that I posted on my blog.

http://typophile.com/files/added_20120813.txt

From those 7819 entries, I can see many proper names but I have no idea what is the proportion. That there are errors in the aspell dictionary does not explain that it would have missed 7819 entries on 15990; that is 48.9%. That means that only 51.1% of hunspell entries, i.e. 8171 entries in hunspell, are also in aspell (which contains over 83000 entries).

Uli's picture

Mr. Hudson:

"Thanks, Uli. I'll take a look at your documents and let you know if I have any questions. It would be great to have versions of these tables that could be used as test documents, e.g. plain text files."

As a companion to adobe-ligatures-analysis.pdf (see above), I uploaded the file

http://www.sanskritweb.net/itrans/adobe-ligatures-analysis.txt

containing nothing but the Devanagari ligatures in plain 16 bit Unicode encoding.

This 16-bit Unicode file has the following internal structure:

FF FE - (16 bit Unicode file identification signature)

2A 09 - प (Devanagari p)
4D 09 - ् (Devanagari virama)
30 09 - र (Devanagari r)
0D 00 - CR (carriage return)
0A 00 - LF (line feed)

etc. etc.

So, the first line of this file is the Devanagari ligature प्र.

MS Word recognises adobe-ligatures-analysis.txt as 16-bit Unicode file.

Uli's picture

Mr. Hudson:

And here are the 460-odd attested Hindi ligatures drawn from Hindi dictionaries and compiled and exemplified by Ernst Tremel and contained on the pages 110 through 130 of the Itranslator manual:

http://www.sanskritweb.net/itrans/itmanual2003.pdf

This ligature list compiled by Ernst Tremel is downloadable as a plain Unicode file

http://www.sanskritweb.net/itrans/hindi-ligatures.txt

This file contains nothing but the Devanagari ligatures in plain 16 bit Unicode encoding in the same mannner as the companion Unicode file adobe-ligatures-analysis.txt.

John Hudson's picture

Many thanks, Uli. I have correlated my most recent draft glyph set to Michel's analysis of the Hunspel and Aspell dictionaries, and am now in the process of correlating to your Sanskrit ligature list and Ernst Tremel's Hindi list. The number of ligature un-attested in these sources is gradually being whittled down, and most that remain are either what I classify as systematic inclusions -- i.e. representative of core aspects of the writing systems qua system such as merged -R forms of each letter -- or are fairly obvious transliteration sequences.

With regard to the latter, I've come to the conclusion that their inclusion needs to be a matter of individual font and its intended purpose. Coincidentally, Fiona just communicated to me a request from a major Bengali language newspaper publisher to include an SPL conjunct ligature for the transliteration of the English loan word 'splinter', which apparently occurs often enough in the context of politics to be afforded such treatment. Recently, we worked on user interface fonts for Hindi, and analysed the localised strings for the operating system and other software, finding such unexpected transliterated conjuncts as TZV (barmitzva). Not all of these end up with ligature solutions -- if I can get the half form shaping for rare sequences to look good, that's what I'll use --, but they illustrate the reality of transliterated loan words in modern Hindi.

At present, I am working on a font specifically for pre-modern Hindi texts, so being able to identify the source and attestation for different groups of conjunct ligatures is very helpful.

Uli's picture

Mr. Hudson:

I wish you good look in finishing your Adobe Devanagari font.

Ligatures are a never-ending story. Look at you own name "Hudson".

Sanskrit and Hindi do not have "ds" as a sound combination. This is because the soft dental "d" would be assimilated to the hard dental "t" before "s", resulting in "ts", which would be available as a Hindi ligature. But you would not want your name transliterated in Devanagari as "Hutson", would you?

Therefore, transliterating your name in Devanagari would required a new ligature, namely for "ds".

Namaste

Michel Boyer's picture

finding such unexpected transliterated conjuncts as TZV (barmitzva)

Knowing that barmitzva is itself a transliteration from the Hebrew* בַּר מִצְוָה where I expect the letter צ (tsadi) would normally give rise to the sound ts and not tz, I find that surprising too (even if some might voice it before a voiced consonant, and I wonder who).

*(מִצְוָה is certainly Hebrew; בַּר is Aramaic)

John Hudson's picture

One of our Indian associate designers has taken a look at the Hunspell and Aspell Hindi lists, and confirms my suspicion that they are heavily loaded with multiple transliterations or transcriptions of foreign loan words, e.g.

इंग्लिश इंग्लैंड इग्लैंड ग्लव्स
inglish, ingland, iglaind, and glov(e)s.

This, of course, makes them useful for their purpose of spellchecking for modern Hindi usage, but I'm finding Ernst Tremel's list more reliable in terms of being sourced from *mostly* Hindi words only. I find the presence of conjuncts beginning with ङ that are not attested in either Hunspell or Aspell unnerving, and note even in his list there are transliterations or transcriptions of naturalised terminology e.g. 'anglo-indiyan' or 'kongres'.

Michel Boyer's picture

For those who wonder how the statistics based on the dictionaries can be obtained, here is the method I used (using Python regular expressions).

As John said, a compound is a sequence of virama separated consonants. A consonant can be defined by the Python regular expression

   [\u0915-\u0939\u0958-\u095F]

(I just rewrote in Python what John wrote in words above). The Virama is \u094D. A compound is thus a sequence of 1 or more consonants followed by \u094D (whence the + operator in the code below), with a consonant appended to it. For grouping, I used (?:expression) to avoid the back referencing mechanism.

The dictionary contains one word per line. The following program reads line per line, finds the compounds in each line, and outputs them directly.

import re, sys

compound = ur'(?:[\u0915-\u0939\u0958-\u095F]\u094D)+[\u0915-\u0939\u0958-\u095F]'
f=open(sys.argv[1])
word = f.readline().decode('utf-8')
while word:
  for comb in re.findall(compound, word):
    print comb.encode('utf-8')
  word = f.readline().decode('utf-8')

If we call this stub compounds.py and if the dictionary is aspell.txt, then

python compounds.py aspell.txt

outputs the compounds (I should rather say the candidates for producing compounds), one per line, as many times as they occur. That should work on any platform.

To get a more sophisticated output, you can make a more involved Python program, or just use standard unix commands if you are on Linux or OS X. The first thing to do is to sort the compounds so that they are grouped together in the output and then use the unix command uniq with the option -c to count them.

python compounds.py aspell.txt | sort | uniq -c

Here are the first lines of the output:

223 क्क
1 क्क्क
1 क्क्ड़
32 क्ख

So there are 223 occurrences of क्क. It would be interesting now to have those numbers is descending order. Again, all that is needed is to sort those last lines according to the numerical value (option -n) and I'll choose the reverse order (option -r). Here is the full command.

% python compounds.py aspell.txt | sort | uniq -c | sort -n -r

Here are the first lines of the output

3200 प्र
1614 त्र
1599 क्ष
961 स्त
924 र्ण

Once you know the compounds, you can search for the words containing them in the dictionary using the unix command grep. Very little programming is thus required.

Of course, to produce a table for LaTeX, the compounds found were put in a Python dictionary and the processing was all done in Python. The full source is 39 lines of Python after removing comments and blank lines (but including the 8 lines above).

Michel
Rem: I removed duplicates in aspell after I produced my last tex files; the number of occurrences for स्त has decreased from 962 to 961.

Michel Boyer's picture

The 8 lines of Python in the previous post can be replaced by the following 6 lines (that first read the full dictionary as a list of words, which may cause trouble on very large dictionaries but works fine for me on aspell.txt). For closing the input file, additional lines of code would be required.

---
import re, sys, codecs
compound = ur'(?:[\u0915-\u0939\u0958-\u095F]\u094D)+[\u0915-\u0939\u0958-\u095F]'

listwords=codecs.open(sys.argv[1],"r","utf-8").read().split()
for word in listwords:
  for comb in re.findall(compound, word):
    print comb.encode('utf-8')
---

The advantage is that the processing loop is extremely simple.

paul d hunt's picture

(listening)

John Hudson's picture

Here's a summary of my thinking on this, after taking time to correlate my Devanagari glyph set spreadsheet with the data Michel derived from the Hunspell and Aspell dictionaries and with Ernst Tremel's list of Hindi conjunct ligatures.

I am suspicious of both the Hunspell and Aspell dictionaries as attestation sources for many Hindi conjuncts, simply because they seem to include high numbers of transliterated or transcribed foreign loan words. Tremel's list makes more sense to me, based on his sources, although this too contains some loan words and, importantly in terms of glyph set design, includes large numbers of conjuncts that almost all Hindi writers would instead write with anusvara, e.g. संख्या instead of सङ्ख्या. Also, Tremel's list does not include frequency data.

In terms of the Adobe Devanagari set, there are about 100 conjunct ligatures that are not attested in any of the lists, of which a significant number are of Sanskrit origin, their presence in the Linotype list presumably reflecting those Sanskrit conjuncts that we most frequently encountered in words quoted by Linotype's Indian customers of the time. A few are what I consider 'systematic inclusions', i.e. those whose existence is implied by the writing system rather than its application to a particular language, e.g. a merged rakar form of the nukta letters. The remainder are presumed to be transliteration and transcription forms for loan words, as requested by Linotype's customers. As noted by Uli, once you start including such forms, it becomes an open ended set, and the only way to constrain it is by frequency. But, of course, the frequency of foreign loan words depends entirely on the nature of the texts examined; which accounts for the differences between the Hunspell, Aspell and Linotype/Adobe sets.

What I'm left with is a set of data that includes a fairly large number of Hunspell and Aspell transliteration conjuncts that are not covered in the Adobe Devanagari set, and vice versa. The most useful aspect of this, at the moment, is that it suggests ways in which I might reduce the draft glyph set of a font I am working on for pre-modern Hindi texts, in which few loan words are expected to be encountered. For fonts targeting modern Hindi usage, the usefulness is less clear, but perhaps there might be candidates among the more common transliteration conjuncts from Hunspell and Aspell that could be added. On the other hand, I'm pleased to see just how many of the conjuncts not supported by ligatures in the Adobe Devanagari font shape very well with half forms (which we carefully kerned to that purpose).

Uli, you suggested early in this thread that the Adobe Devanagari set would need only about a dozen more ligatures in order to adequately support classical Sanskrit. This surprised me. The first document to which you linked in this discussion seems to be a subset of Sanskrit conjuncts, certainly much shorter than the complete list in your manual. Can you explain the basis of this subset? Is it based on frequency, or on a particular set of texts?

Uli's picture

Mr. Hudson:

"Uli, you suggested early in this thread that the Adobe Devanagari set would need only about a dozen more ligatures in order to adequately support classical Sanskrit. This surprised me. The first document to which you linked in this discussion seems to be a subset of Sanskrit conjuncts, certainly much shorter than the complete list in your manual. Can you explain the basis of this subset? Is it based on frequency, or on a particular set of texts?"

The first document, namely this document:

http://www.sanskritweb.net/itrans/adobe-ligatures.pdf

lists attested Sanskrit conjuncts in descending frequency order.

I cut off this partial list at the frequency of 0.010 %, namely here:

ङ्घ्र्य ṅghry (!!!) 0.010%

If the Adobe Devanagari font contained all attested Sanskrit ligatures down to a frequency of 0.010 %, it would be a very good Sanskrit font, although it would not include the rarest ligatures with a frequency below 0.010%.

These rarest ligatures are only covered by our own highly specialized Itranslator fonts "Sanskrit2003.tt", "Chandas.ttf" and "Siddhanda.ttf" downloadable here

http://www.sanskritweb.net/itrans/

However, the Adobe Devanagari font, primarily designed for Hindi, could be also a satisfactory Sanskrit font, if it contained a subset of Sanskrit ligatures, provided that at least the most frequent Sanskrit ligatures were covered by the subset.

It is up to you where you cut off the frequency list adobe-ligatures.pdf.

For example, given the present version of the Adobe Devanagari font, it contains all Sanskrit ligatures down to a frequency of 0.215%, because the first ligature not covered by this font is the ligature द्ध्व, namely

द्ध्व ddhv (!!!) 0.215%

If you were to include also ड्ग, the Adobe Devanagari font would cover all Sanskrit ligatures down to a frequency of 0.167 %, namely

ड्ग ḍg (!!!) 0.167%

and so on and so on. You could proceed and cut off here

द्द्व ddv (!!!) 0.119%, or here:

ङ्घ्र ṅghr (!!!) 0.109%, or here:

द्द्र ddr (!!!) 0.109%, or here:

ड्य ḍy (!!!) 0.097%, or here:

...

ङ्घ्र्य ṅghry (!!!) 0.010%

If the Adobe Devanagari font included all ligatures down to ङ्घ्र्य, id est down to a frequency of 0.010%, it would be a very good Sanskrit font. But I repeat, you may cut off the list much earlier.

As regards the attestations, I made frequency analyses of original electronic Sanskrit files.

The German University of Göttingen hosts this huge collection of electronic Sanskrit files:

http://gretil.sub.uni-goettingen.de/gretil.htm

Ten years ago, I started with the proofread entire Mahabharata:

http://bombay.indology.info/mahabharata/statement.html

However, it only makes sense to analyze proofread electronic texts:

http://bombay.indology.info/mahabharata/history.html

because text files crammed with typos result in erroneous frequency counts.

For making a good Hindi font, someone would have to analyze proofread electronic Hindi files and would have to make frequency counts. Thereafter it would become clear, which ligatures should be included into a good Hindi font and which should be omitted.

I should like to mention that the German University of Cologne here

http://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/tamil/index.html

hosts a huge electronic Sanskrit dictionary with more than 150,000 entries, which I also completely analysed for my own ligature frequency counts ten years ago.

A similar electronic dictionary should also be available for Hindi. Mr. Boyer's short Hindi word files mentioned above in this thread are not sufficient for large-scale frequency counts.

Syndicate content Syndicate content