New, Unicode 5.2 Arabic joining groups

John Hudson's picture

The Unicode Standard v5.2 has just been published, and includes revisions to the table of dual-joining characters that clarify the behaviour of two extended Arabic characters: the Farsi (Persian) yeh U+06CC, and the Jawi (Malay) nya U+06BD. Note that this change does not involve new characters, but only clarification of how these existing characters should be displayed in various joining conditions. This clarification is necessary, because these characters do not inherit dot positioning across the various joining forms. For details, see the table on page 16 of this PDF:

http://www.unicode.org/versions/Unicode5.2.0/ch08.pdf

Note also joining class for the Burushaski yeh barree letters U+077A and U+077B, which was added in Unicode v5.1.

Thomas Milo's picture

This introduction is remarkable and misleading:

"U+06CC arabic letter farsi yeh is used in Persian, Urdu, Pashto, Azerbaijani, Kurdish, and various minority languages written in the Arabic script, and also Koranic Arabic. It behaves differently from most Arabic letters, in a way surprising to native Arabic language speakers. The letter has two horizontal dots below the skeleton in initial and medial forms, but no dots in final and isolated forms."

Whoever wrote this apparently never set foot in an Arab country and never read the Koran. Most native Arabic speakers have done both. In most Middle Eastern countries, any text not produced by Apple of Microsoft software still has the dotless final yeh - including millions of cars :-)

Farsi Yeh and Arabic Yeh are also viable and vital regional variants and belong to be treated as such. The choice should be given to the user. When all Egyptian non-digital text production still prefers dotless final yeh, that should be given as an option. It would surprise no Arab if he had the choice to use or omit the dots.

http://www.yfoto.de/en/foto/1652/0/1/license_plate%20order%3Dvalue_desc

John Hudson's picture

It seems a typical example of Unicode's approach to style and language usage variation in the Arabic script. There are three yeh characters -- not counting those with different dot patterns for other languages, nor the yeh barree forms that are treated as distinct characters instead of style-specific forms --:

U+0649, alef maksura, which is dotless in all positions;

U+064A, yeh, which is dotted in all positions;

U+06CC, ‘Farsi’ yeh, which is dotless in isolated and final position but dotted in initial and medial position.

I don't think the text you cite is suggesting that a final dotless yeh is surprising to native Arabic language speakers, since U+0649 is expected to be used for this in Arabic text. I think it is suggesting that the non-inheritance of the dot pattern across all joining forms of a single character is surprising. Hmm, not so surprising as finding a single letter encoded as multiple characters. :)

Thomas Milo's picture

You are probably right, John - but only an engineer without knowledge of the world can put it that clumsily. UNICODE's failure to understand that dots are characters (and only added to the letters in Western typography, in original Middle Eastern type they are kept separate) is of course the real glitch in this argument. This Yeh story is representative. Anyone outside Unicodistan would be flabbergasted why a simple combination of a letter and a dot pattern gets three separate complex encodings - and still doesn't cover actual practice. If "native Arabic speakers" are surprised, I guess that's the reason :-)

Same is true for Teh Marbuta. For the user it's just Heh with dots. I have had plenty of raw text where Heh and Teh Marbuta either randomly or systematically alternate. The best example is a map of Iraq where dots are only placed on Heh when the Heh in the place name is pronounced as Teh (in so-called Statu Constructo).

Thomas Milo's picture

It's telling that particularly Teh Marbuta alternates with Heh in a way that reflects the independent status of the dots. Presently UNICODE misrepresents this information, and as a result, a search for Nabiha نبهه / نبهة or Khadija خدیجه / خدیجة with or without dots yields totally different results, in spite of the fact that they are clearly identical from a native Arabic speaker's point of view.

John Hudson's picture

I don't think Unicode ‘misrepresents this information’; rather, Unicode is essentially agnostic about this information, just as it is agnostic about j being interchangeable for consonantal i in ecclesiastical Latin. A search for Jesu produces different results from a search for Iesu. Or, less obscurely, one can point out that Unicode is agnostic about e.g. ‘organise’ being interchangeable with ‘organize’.

What Unicode has done is to make a decision about at what level to make a distinction between ة and ه. I tend to agree with you that it would have been better to have made the distinction at a different level, i.e. not at the character encoding level, but in my experience regardless of where one makes the distinction the slack needs to be picked up elsewhere. Let's say the distinction were made at the glyph level, as a layout choice; then users for whom it was important to maintain the distinction, e.g. because they are transcribing a particular manuscript text, would be faced with the problem of losing this distinction in plain text interchange. As it currently stands, the slack needs to be picked up on the searching and collating end -- as it needs to be for the Dutch IJ/ij letters --, which requires knowledge, time and resources, but once it's done it is done. The difficulty is persuading software makers that they need to do it.

Google seems pretty good at things like regional spelling variations in English. It seems to me that if they understood the rules, it would be consistent with their approach to text searching to handle Arabic dot variation similarly.

Thomas Milo's picture

I agree, the Iesu - Jesu comparison cannot be handled competently by an agnostic system - if alone because it language or orthography-dependent. The case for Arabic is, however, different.

In Arabic, the difference between ة and ه (both heh, with or without dots) is like the difference between كَتَبَ and كتب: (both: kataba, with or without vowels).

Therefore, Unicode does misrepresent Arabic. The basic flaw is that Unicode fails to reflect the independent nature of the dots. [This follows European typographic hacks, that, unlike Middle Eastern typography by Müterferrika and Mühendisoğlu, creates ligatures out of combinations of archigraphemes (neutral skeleton elements) and dot patterns (distinctive diacritics), which then leads to excessive numbers of glyphs and eventually to a clamor for "simplification" - for all clarity, letters like ب ت ث were in Middle Eastern typography, unlike in Western typography, not treated as ligatures of skeleton+dots].

Because Unicode today encodes ة and ه as separate, distinct letters, the difference between خدیجة and خدیجه (both: Khadija), erroneously, becomes analogous to your example of to ‘organise’ vs.‘organize’. But in the real world this is emphatically not the case, because Khadija occurs with both spellings in the same consistent corpora from one location, whereas ‘organise’ vs.‘organize’ are shattered over different parts of the English-speaking world. It is an objective fact that for masses of native Arabic speakers, ة can freely be substituted by ه (not vice versa), within the same area, orthography, style, you name it. I have observed this phenomenon also in book production (where I felt I had to correct it), and in bureaucratic usage.

Regarding the ة and ه issue, Google follows Unicode perception and is, consequently, clueless.

John Hudson's picture

Tom: In Arabic, the difference between ة and ه (both heh, with or without dots) is like the difference between كَتَبَ and كتب: (both: kataba, with or without vowels).

But this variance between ة and ه only occurs in a final position, which means that at the character level these two forms are not mutually interchangeable. ة can always be represented by ه, but not every instance of ه can be represented by ة. I agree that this is not directly analogous to the Latin i/j or British/American s/z variance, which make or do not make phonemic distinctions, and joining analysis of Arabic text easily provides the means to determine when the variance is permissable, without needing language-specific spelling information.

I entirely agree that an archigraphemic Arabic encoding with combining dots would be better representative of how the writing system actually works; it would also be more elegant and flexible. But Unicode can't be entirely blamed for the letter-based encoding they inherited from earlier standards. At this stage, it is all moot, because introducing a second encoding model on top of what they already have isn't a possibility, not least because of stability agreements that prevent new canonical decompositions of existing characters. So, as I wrote, one needs to look at where the slack can be and needs to be taken up, which is in things like search algorithms, spellcheckers, layout engines, etc.

Remember, there's nothing that says that a character encoding model has to represent how a writing system works, let alone do so in a way that is optimal.

Thomas Milo's picture

John, of course the alternation ة and ه only occurs in a final position, because positioning of dots on heh is limited to final position. But the composite Teh Marbuta also occurs only in final position - and that is taken as a character. In spite of the fact that no grammar of Arabic ever lists it as a member of the alphabet. Remember, 28 letters :-) Or 29, but then hamza is added, another one that doesn't really count: an amphibious letter.

Anyhow, I am not really blaming Unicode - there are far too many blessings and I have been a staunch supporter from day one. Let's indeed focus on weeding out the slack, incuding side-effects of bad analysis. As experts it's not our concern to back off because of stability compromises. That's not the spirit. Unicode got where it is now, exactly because it trashed many of such impediments, and found elegant alternatives.

Indeed for archaic Arabic, dots must be encoded separately. It's impossible to encode scholarly texts without compromising them righ now. In Korans, dots were usually added later to manuscripts. I have some very clear evidence for this, including cases where a single archigrapheme gets two contadictory sets of dots added to it.

John Hudson's picture

I agree regarding dots for archaic Arabic, but these will need to be encoded in a way that makes it clear that they cannot be used generatively to produce multi-character letters that any software would be expected to treat as equivalent to existing letter+dot characters. That's where stability comes in, and it isn't a compromise: it's a contractual agreement between Unicode and other industrial standards organisations (notably IETF) that need to be able to rely on stability in certain areas of Unicode text processing in order to do what they do. Canonical decomposition is the principle area affected by stability agreements, because if you change character equivalences for existing characters you not only break existing software but you introduce security problems.

The Arabic ‘pedagogical symbol’ dots have been approved by Unicode, and are now making their way through final stage national balloting for ISO 10646 (we, Canada, confirmed yesterday that we will be approving the revision that includes these characters, along with various new scripts and additional character for existing scripts). But, of course, these are explicitly spacing symbols, and no software can be expected to treat them as combining marks. Personally, I think the case for encoding these dots-as-symbols was very weak. The original proposer was pretty clearly trying to do an end-run around the letter-encoding model in a way that would enable a generative archigrapheme + combining dot encoding. He failed in this, of course, and what we're left with is a confusing set of spacing dot characters for which hardly anyone is going to find a practical use.

If anything, the encoding of these spacing dots will only make it harder to make the case for combining dots for archaic Arabic. But it needs to be done, and with a lot of scholarly support from experts in that field.

John Hudson's picture

...of course the alternation ة and ه only occurs in a final position, because positioning of dots on heh is limited to final position.

Right, but that's a grammar rule that some software needs to learn regardless of how the text is encoded, displayed, searched, sorted, etc. As a grammar rule, it is different in kind from the ecclesiastical Latin one that says an i can only be written as a j when it represents a palatal approximant, but it is a grammar rule in the same sense. If ه were encoded as a single character regardless of the presence of absence of dots, then it would be display software and fonts that would need to learn this rule. Since it is encoded as separate characters distinguished by the presence or absence of dots, searching and sorting software needs to learn this rule. As I wrote earlier, no matter how you encode something, someone, somewhere needs to pick up the slack that results from not encoding it in a different way.

Jongseong's picture

The original proposer was pretty clearly trying to do an end-run around the letter-encoding model in a way that would enable a generative archigrapheme + combining dot encoding. He failed in this, of course, and what we're left with is a confusing set of spacing dot characters for which hardly anyone is going to find a practical use.

If anything, the encoding of these spacing dots will only make it harder to make the case for combining dots for archaic Arabic.

Thanks for the explanation. I had had the mistaken impression that the new additions would be combining symbols based on the original detailed proposal.

This is frustrating, since combining dots would be the obvious solution not just for archaic Arabic, but adaptations of the Arabic script for many different languages, including the numerous Ajami scripts for African languages. Is there a workable solution that would avoid introducing new canonical decompositions for existing characters? Other than brute-force documentation and encoding of all archigrapheme-dot combinations that have historically been used?

John Hudson's picture

Is there a workable solution that would avoid introducing new canonical decompositions for existing characters?

Well, Unicode can simply avoid assigning canonical decompositions, but one is left with the security, phishing issue: the more ways you have to encode something that will be visually identical to something else in most fonts, the more security holes you open. I don't see a way forward with generative archigrapheme + combining dot(s) encoding for any modern Arabic scripted language. We need to proceed as Unicode have to date: identifying letters including dots and encoding them based on their joining behaviours (even if, as in the case of ة and ه, this means encoding a single letter as multiple characters). Like it or not, its generally more important for an encoding model to be internally consistent than for it to be consistent with any external analysis of how a writing system works. Unicode has already inherited internal inconsistencies in the encoding of e.g. European languages with both precomposed and combining mark diacritics -- and, for that matter, Hangul as both jamo and syllables --, and we're well aware of the technical problems and inefficiencies that such inconsistencies produce for software and font developers. I'm pretty sure that there is zero interest in the UTC for introducing such inconsistencies for Arabic. Like it or not, Unicode's Arabic encoding is a letter/joining behaviour model, and will remain so.

I believe combining archigrapheme disambiguators for transcribing archaic Arabic are something that could be encoded as such, because a) they can be visually distinguished from the evolved style of dots, making it less likely that there will be visual correspondence in many fonts (indeed, their classification as archaic characters means that most fonts will not support them: they will be specialist characters requiring specialist fonts), and b) their use can safely, explicitly be prevented in high-risk security situations such as URLs.

Thomas Milo's picture

"...of course the alternation ة and ه only occurs in a final position, because positioning of dots on heh is limited to final position.

Right, but that's a grammar rule"

Its not a grammar rule, it is optional use of dots.

Thomas Milo's picture

This neatly sums up the discussion: Unicode has a contractual obligation to maintain status quo, even though the actual solution could be a no-brainer, because

1. It does not handle Arabic in the cultural sense, by marginalizing its literary legacy, and particularly the Koran,

2. there are more and more characters added that are in fact just new combinations of the basic construction elements.

Hm. I have been am wasting my time trying to sort these challenges. I overlooked the contract :-)

Cheers,

t

John Hudson's picture

Its not a grammar rule, it is optional use of dots.

I mean it is a grammar rule that this optional use of dots is limited to the letter ه in the final position.

Syndicate content Syndicate content