Call for entries about Unicode

mojuba
28.Oct.2003 3.18am
mojuba's picture

The Unicode Standard, on which the consortium is labouring for about ten, twelve years, will obtain essential significance for the future of all handling of text and type matters throughout the world. Of course a matching of all the world

Therefore a lot of issues are discussed vigorously among type designers, typographers and others about missing certain characters or even scripts. Further, a somewhat questionable encoding policy on the Unicode side is considered by many professionals.

Can you give some examples of 'missing characters'? I've been involved with Unicode for several years, and most often when people start talking about missing characters I find that they do not understand the character/glyph distinction. There is also a tendency to base opinions on what is missing on what is present, ignoring the fact that there are many characters in Unicode that are there only for backwards compatibility with pre-existing character sets, and do not meet the character criteria of the Unicode character/glyph model. Of course, this does not mean that there are not genuine missing characters, but there are procedures for proposing such characters and, if they meet the Unicode character criteria, they will be encoded.

This is not to say that the Unicode Standard is perfect: there are things which I think the UTC got wrong, some of which can and has been changed, but much of which cannot be changed because of stability agreements with other standards bodies.

I would be interested to know what, exactly, 'many professionals' find 'somewhat questionable' about the Unicode encoding policy.


I agree with John's comments above. I'm particularly curious about what "missing scripts" there are? I mean, sure, it seems like they have old Nordic runes without having all the old Hungarian runes, but in terms of modern language usage, what's missing?

As for the encoding policy, there are really three policies that govern Unicode:

1) compatibility with pre-existing encoding standards

2) encode characters, not glyphs (except when required by 1 above)

3) use only one slot for a given character, even though it may have somewhat different presentation forms in different languages

I'm curious about which of these principles people are objecting to, and on what grounds.

I'll note that the existence of principles 2 and 3 strongly implies the need for a technology like OpenType to bridge the character/glyph divide, to handle things such as small caps, alternate forms, etc.

Regards,

T


At the beginning: I criticize neither the work nor the necessity of unicode.
In this call it concerns the exchange and suggestions. Mainly to signographic problems with unicode.

For example:
In the Bulgarian Code Page are included all vowels with accents that are really necessary for the typesetting in Bulgarian language, and are not included in the Cyrillic Code Page. Different Glyphes and same unicode...

0x0432 # CYRILLIC SMALL LETTER VE
0x0436 # CYRILLIC SMALL LETTER ZHE
0x043A # CYRILLIC SMALL LETTER KA
0x044E # CYRILLIC SMALL LETTER YU

I very probably understood the differences between a character and glyphs. But if I liked to make an extensive cyrillic, italic OT Font, I have problems with Russian and Serbian (0x043F CYRILLIC SMALL TYPE CHARACTER PE). This character has completely different forms with same unicode in italics. Why?

Why codepage for Dingbats (Range: 2700


The Serbian italics and the typographic glyphs issue are both conqequences of the character/glyph model (principles 2 and 3 I outlined above).

Neither of these things is a "problem" in a smart font format such as OpenType. For example, all of our Cyrillic fonts support the Serbian forms in the italic.

The general reason for not encoding typographic glyphs as if they were characters is that it is harmful to plain text operations, and it is an unbounded task. What ligatures would you encode? How about the "Th" ligature that is commonly found in newer Adobe text fonts? What about people who want to have two sizes of small caps in the same font? Or figures that are halfway between oldstyle and lining?

Finally, the question about which dingbats are in Unicode is a question of which ones got in because of association with existing standards and codepages. I think you'll find that the black telephone is a standard character in one of the Asian standard character sets, for example.

Regards,

T


First a cheer to John Hudson and Thomas Phinney for having started discussion.

Missing scripts -: I enjoyed to lern recently that there are serious activities to add relevant historic or minority scripts. I regard this as very important, in the case of historic scripts it becomes possible for us in the future to keep in touch with our cultural heritage. Many publishing projects need to use things like Etruscan, Anglosaxon, Gaelic, Gothic, Hethit and the like.

Missing characters -: It's not so strange that there are any characters missed at all, but in some cases I cannot avoid wondering about the contents of a certain set encoded. An example for this: Currency symbols (20A0 - 20B1. Apart from the well known $,


Andreas,

Missing scripts. New scripts are being added all the time, and there is a significant roadmap document that preserves spaces for new scripts in both the Basic Multilingual Plane (16 bit) and the extenstion planes (32 bit). Of the scripts you mention, Etruscan and Gothic are already encoded. I'm not sure what you mean by Gaelic script.

Missing characters. If you know of a character that you think needs to be encoded, I encourage you to document it and submit a proposal. Sometimes, if you make enough noise about a particular character and provide a convincing enough argument, someone else will do the work of submitting a proposal, but you can't rely on this. Better to do it yourself. Example: last year at ATypI in Rome, Peter Lofting from Apple made a pilgrimage to the Isle of Tiberius to photograph the earlist known carving of the Rod of Aesclepius in order to demonstrate that this should not be confused with the caduceus and should be separately encoded. He documented his proposal well, and the Rod of Aesclepius has been accepted for encoding in the Unicode Miscellaneous Symbols block. I also notice that the Guarani currency symbol has been accepted, which means that someone has done the work of proposing it.

Regarding scribal abbreviations, I am not convinced that these need to be encoded. Unicode encodes characters for plain text, but not everything that occurs in text needs to occur in plain text. Scribal abbreviations are a form of ligature, and it is possible for them to be handled with markup and with smart font features (including selection from multiple abbreviations of the same character sequence).

Regarding generic informational signs, there is a move to encode more of them, although this has raised the philosophical question of where, exactly, does one draw the line. A case in point is the 'Man cleaning up after his dog' sign, which has recently been debated by members of the UTC (not sure what they decided). Such things will always -- and probably should -- raise philosophical issues for a text encoding standard.

Regarding Zapf Dingbats, those are included only for backwards compatibility with Apple's ZD codepage. They may not have been accepted if proposed today, but they went in very early, and Apple were a founder members of the consortium. The circled numbers are for compatibility with East Asian standards.

Regarding the graphic representation of characters in the codepages, these are intended to be informational only, and are clearly identified as such. The charts are not claimed to be normative, inerrant, consistent, or anything else; the Unicode Standard is not the charts, it is the abstract characters behind the charts. The charts are made from available fonts, in many cases provided by the individuals who proposed specific characters for inclusion. Yes, there is room for improvement, and yes it is possible to suggest improvements or, even better, provide replacement fonts. The Unicode Consortium is not interested, however, in a 'universal standard of writing': the glyphs in the code charts need to be sufficient to present a recognisable and acceptable form for each character: any recognisable and acceptable form. Unicode is a character encoding standard for plain text: it is intended to be the backbone of text processing, not as a project to catalogue shapes.


> Peter Lofting from Apple made a pilgrimage to the Isle of Tiberius
> 'Man cleaning up after his dog' sign

You've just made me realize I did the same thing last May. Here's my documentation from Montjuic:

bowowhaus

hhp


Dear Thomas, John et all,

it is for me all about the following:
By the beginning of 2003 the coding of SmallCaps was regulated by the AdobeGlyphList (AGL) obviously. A unicode was for the base glyphs exactly one name and exactly scheduled in it. All earlier 'expert-fonts' are based on this allocation. Since the beginning of 2003 has given the Adobe Glyph List For New Fonts (AGLFN). It is defined into this that SmallCaps is coded completely in the PUA. No obligatory unicode. This already was before so but the designer has the 'total' freedom now. He can allocate the name and unicode both freely in PUA.

Situation following now: A typographer puts a text with an extensive use of SmallCaps. The customer doesn't like the used document. The typographer changes it. The new document is coded (by ADLFN) completely differently than the old one. Text garbage arises ...

Foundry A coded SmallCaps in PUA. Foundry B coded SmallCaps differend from Foundry A. How can there be a passing consistency there if the obligatory creation is missing as you code?

Why can't a creation be made for SmallCaps (extended Latin), PetitCaps (extended Latin) and Titlings (extended Latin) in unicode? Why the 'usual' ligatures obligatorily don't code. With stand-by. Sufficient space would be available in the unicode tables anyway.

[preusss].


[this message by Andreas Stötzner was generously posted by Ingo Preuss because of technical problems with posting]

Dear John,

I always enjoy to learn things getting better. And I believe, besides some critical querries, that the completion of the standard is really on a good way.

It would actually be no problem (for me) to make one proposal per week, but, as a freelancer hunting for jobs to feed a family and spending the rest of time to collecting and signographic research it's a little bit hard to afford even further resources to a proposal procedure if it goes beyond some four, five characters. I don't know wether others do think so too. Well, but that' perhaps a personal problem rather than Unicode's.

You have stated one aspect which in particular seems worth considering: that Unicode is a standard for text processing and not a catalogue of graphic shapes. Sure, this has to be put a stress on. Yet, what is text, and what is not?

I do ask this for a couple of reasons. I guess, at least we being people of the western culture are still sometimes preoccupated with the concept of 'text' just consisting of 'script' and 'script' consisting of alphabetical characters. But this is obviously only a partition of what we have to deal with.


Regarding the smallcaps mess, this has nothing to do with Unicode, which does not encode smallcaps. This has to do with pre-Unicode software and fonts using 8-bit character sets to include smallcaps and Adobe's very unfortunate decision to map from those smallcap character sets to the Unicode <i>Private Use Area>/I>, i.e. to non-standard codepoints. With the result than any document produced using these codepoints for smallcaps will be garbage unless displayed in appropriate fonts that use Adobe's convention (a convention that even Adobe have realised was a bad idea).

Smallcaps should not be encoded as such. A smallcap letter is a glyph variant (of either a lowercase or uppercase character, depending on context), and should be handled in markup, i.e. at a higher level above plain text.

I'll respond to your other comments later.