Unicode vs. "liga" feature

Theunis de Jong's picture

Technically it is possible to add both a ligature in the common form of "f_f_i" (and its accompanying Feature code) and a separate Unicode glyph for the same character -- U+FB03 for "ffi" -- to the same font.
Is this a good idea? I'm pretty sure it's not!

I only tested this with InDesign, but I vaguely recalled that it would automatically use "fi" and "fl" ligatures, if a font had these characters in the correct Unicode positions. However, a quick test with a freshly created OTF font shows it does not (in CS4, at least). Perhaps this is (or was) only true for Plain Old Type 1 fonts without any further OTF enhancements.

Regardless: if I add both an "f_f_i" glyph and an "ffi" Unicode glyph to a font, what sort of issues would I encounter? Maybe all it shows is in what order the underlying software examines ligature candidates. Some might first parse OTF features (which -- again, I think -- would be preferable), where others might check the character map first.

Or maybe the entire font drawing machinery crashes.

John Hudson's picture

The only reason to include both is if you want to not only correctly display legacy texts that use the Unicode presentation form codepoints to hard code ligatures but also want to preserve that encoding in PDFs with text reconstructed from parsed glyph names. Personally, I think you are doing users a favour if you don't do this, and instead let parsed glyph names map back to the letter sequence ffi, rather than to the presentation form codepoint. So my recommendation is to use the 'f_f_i' format name, but map that glyph to the Unicode presentation form.

ahyangyi's picture

Or maybe the entire font drawing machinery crashes.

Like this?

charles ellertson's picture

John & Theunis --

Kind of academic for me, since we only set books, which means new manuscripts & no legacy issues. So when I have to work on fonts, I only anticipate Unicode.

But if you are trying to provide for reading in ASCII files with a new typeface, wouldn't those files perforce use *names* from the Adobe Glyph list, with any somehow-progam-generated Unicode being the presentation form?

In that case, liga should use the Unicode format, f_f_i, & no number. That is, not map it to the presentation form.

In fact, if I were doing this, I'd try to map any legacy characters (i.e., PostScript = 7-bit ASCII + fibs) to the now-correct Unicode, so any new document would have the correct encoding.

What's missed with this thinking?

EDIT -- could one use {ccmp} to deconstruct legacy ligatures?

John Hudson's picture

In that case, liga should use the Unicode format, f_f_i, & no number. That is, not map it to the presentation form.

{liga} is an OpenType Layout feature, which means it maps from glyphs to glyphs independent of the characters in the text string. So let's say you have the word 'office', encoded as a sequence of six Unicode Latin alphabetic characters; with the {liga} feature active it will display with the ffi ligature. Now let's say you have a document from some legacy process in which the word 'office' is encoded as three alphabetic characters with a Unicode presentation form ligature character in the middle. In my scheme, both encodings are displayed with the 'f_f_i' ligature glyph, one via the {liga} feature and one via the cmap table. The only difference between the two is that the use of the 'f_f_i' glyph naming means that the Unicode presentation form ligature character will not be preserved in PDFs in which text strings are reconstructed from parsed glyph names; instead, Acrobat will parse the 'f_f_i' ligature glyph in the PDF as representing the alphabetic character sequence ffi. And I think that is a Good Thing.

I suspect we might be talking about different kinds of legacy encodings here. I'm talking about mechanisms that used Unicode presentation form codepoints to represent common Latin ligatures. This is different from mechanisms such as Adobe's expert set PS Type 1 fonts, which used 8-bit encodings for ligatures. Such documents require character transform operations to be cleaned up and standardised as Unicode encoded text, but in that case my recommendation would always be that legacy encodings of a ligature surch as ffi should always be converted to the underlying alphabetic character sequence and never to the Unicode presentation form ligature codepoint.

John Hudson's picture

could one use {ccmp} to deconstruct legacy ligatures?

Deconstruct to what? Remember, OTL features affect only glyph display, not text encoding. There wouldn't be any point in decomposing a legacy ligature glyph to underlying alphabetic glyphs only to recompose to a ligature glyph, all the while preserving that legacy codepoint in the text string.

charles ellertson's picture

Ah.

John, practically speaking, and as far as I see (which may not be far enough), the only situation that would arise is when you're taking text from a pdf file. As you say, with ligatures, anything that relied on the older Mac or Windows OS had fibs for character encodings. The old TeX used proper ligaturing, though I'm not sure how that wound up in a pdf file; the ASCII file would just have the base characters.

Where I'm lacking, aside from what might be wrong with the above, is just what can get into a pdf file, esp. the older ones. From time to time we do get in jobs where we have to extract the text from a pdf and then reset. But the extraction of the text is in the "not my job" department. I would think Theunis has done as much or more of this than most of us, so I was trying to figure out how the situation he described could arise.

Khaled Hosny's picture

So my recommendation is to use the 'f_f_i' format name, but map that glyph to the Unicode presentation form.

This will not work the way you think it is; if the glyph is encoded that code point will be used regardless of the glyph name (which is the right thing to do; if you have a code point why second guess it?). At least the PDF readers and text extraction utilities I tested do so. However, you might still get back an sequence in this particular case because some tools are smart enough to understand those legacy presentation forms and decompose them.

Theunis de Jong's picture

Charles, you are right. I do get text extracted from a PDF at times, and when it contains a single "ffi" glyph I see this as a separate glyph. It doesn't mean anything for the display of text, but text search fails when I search for "difficult"; and one of my side jobs involves re-purposing text for XML and ePub export.
In addition, I'd like my own PDFs as "clean" as possible – copying text out of it again should return plain text.
So when I encounter these glyphs, I find-and-change them to regular text.

Of course this assumes that the software that is used to create is aware of the {liga} feature. Older software may not, and may rely on the presence of the actual Unicode /ffi glyph.

The problem in constructing a font lies here: if one adds an OTF f_f_i glyph, you could map the same glyph to the assigned Unicode. That way you are sure that whatever the software does, it will result in the same glyph. However, it is also possible to assign another glyph to the Unicode position! This is "unaware" of the {liga} code and the associated glyph.

I am working on a basic font creation program and was asked whether it should (a) allow the above anyway, as it is merely a question of interpretation and should be left to the font creator, (b) remove the few Unicodes for 'hardcoded' ligatures so the user can only create them through the {liga} feature, or (c) basically point (b) but also automatically add the ligatures as single Unicode glyphs.

From a technical point of view I'd say (b), because the Unicode ligatures were actually a very bad idea – in my informed opinion :-) – and this effectively prohibits the user making bad mapping decisions.

John Hudson's picture

Charles, what gets into a PDF -- and hence what you can get out of a PDF -- depends on how the PDF is created. For quite a while now it has been possible to store original document character strings in the PDF, which is what a good PDF-export tool will do. This should remove all chance: the characters used to encode the original document correspond to the characters stored in the PDF (in practice, this might mean normalisation of some kind, and I've had mixed results when dealing with complex scripts).

The whole glyph name mapping thing is only important when you're dealing with a PDF that has been distilled from a print stream: either a print-to-PDF function or distilled from a .ps file. In that case, the original document character strings are not available, because all Acrobat can get from the print stream is the embedded font and the glyph IDs. This is where Acrobat tries to parse the glyph names in the font to reconstruct the text.

John Hudson's picture

Khaled, are you saying that if a) a document is originally encoded using alphabetic characters only, but displayed using {liga}, and b) the 'f_f_i' glyph mapped in the {liga} feature also maps to the Unicode presentation form ligature, and c) a PDF is distilled from a print stream (rather than having the original character strings written to the PDF), then Acrobat will reconstruct the text as containing the presentation form codepoint and not the alphabetic sequence?

Khaled Hosny's picture

That is my experience (though not with legacy presentation forms since Adobe Reader decomposes them regardless of glyph name).

On a second thought, it might be a problem in the PDF creation not in text extraction, the tools I tried don’t embed the names of encoded glyphs.

charles ellertson's picture

@Theunis:

From a technical point of view I'd say (b), because the Unicode ligatures were actually a very bad idea – in my informed opinion :-) – and this effectively prohibits the user making bad mapping decisions.

I agree completely. And we'll have Unicode around a lot longer than OpenType. The XML files, if they fulfill their intent, will be the electronic form of the archived document. Acid-free paper, please.

"Legacy" issues should be resolved as quickly in that document's life as possible. When things stop being used, people forget what they mean, even if that meaning can still be checked. Who, for example, remembers EBCDIC -- one convention for high-order ASCII. Yes, I know presentation forms are different, but in this case, it seems to me to amount to a kind of competing encoding, and that's bad.

I'm sensitive that the needs of complex scripts go beyond ligaturing in Latin-based languages. I'm arguing that ligatures are different -- ligatures per se aren't desirable, just like the two-character logotypes (e.g., Vo) with the Linecaster weren't desirable. One's a property of a particular font -- why have ligatures in Palatino, for example, and one's a property of a particular manufacturing technology. Nothing to do with the language in either case.

Edit:

I just had a thought about

Older software may not, and may rely on the presence of the actual Unicode /ffi glyph.

Don't really think so. What software both used Unicode & had non-OpenType ligaturing capabilities?

John Hudson's picture

Adobe Reader decomposes them regardless of glyph name

Which implies that my method is sound. The Unicode presentation form codepoints are mapped in the cmap table to representative glyphs -- e.g. 'f_f_i' -- so that if such codepoints are encountered in text then they will display correctly. On the PDF side, it doesn't matter whether decomposition takes place by parsing glyph names or by normalisation of the presentation form codepoints to alphabetic characters: the result is the same, and there seems no point in providing duplicate sets of ligatures in a font with different names and mappings.

Khaled Hosny's picture

Which implies that my method is sound.

If you restrict yourself to Adobe products, yes. But I already have combinations of PDF creation and text extraction tool whee the one glyph solution does not work and you always get back the presentation code point.

John Hudson's picture

Well, I've never been one for complicating font production in order to accommodate what I consider bugs in other peoples' software. :)

Syndicate content Syndicate content