Searching characters in Acrobat

Primary tabs

11 posts / 0 new
Last post
Ken Krugh's picture
Offline
Joined: 10 Apr 2007 - 9:05am
Searching characters in Acrobat
0

I'm adding some glyphs to an existing font and am trying to better undestand how the set up of the OTF font effects finding characters in PDF files generated by InDesign.

When adding glyphs I use Adobe's list (http://www.adobe.com/devnet-archive/opentype/archives/glyphlist.txt) to name the glyphs.

Currently I'm adding the ae dipthong character with a macron (unicode code point 01E3) as well as a the thorn with a cross stroke (unicode code point A765). The ae IS in Adobe's list, the thorn is NOT. In my liga feature I added the following:
sub f i by fi;
sub Thorn macron by Thornmacron;
sub thorn macron by thornmacron;
sub AE macron by AEmacron;
sub ae macron by aemacron;

As far as I knew this provided everything necessary so that when I export from InD to PDF I can search for a regular ae (with no macron) or a regular thorn (with no stroke) and my new glyphs will get found as well as the "regular" ones.

If I insert either glyph using the InDesign glyph palette, InD inserts them as a single character. If I insert the main character (the ae or the thorn WITHOUT the macron/stroke) followed by the macron, InDesign puts in the two characters and dutifully uses the glyph as it should.

When searching an exported PDF, however, ALL the occurances of ae are found regardless of how they were inserted, but the thorn characters that were inserted by using the glyph palette (and are a single glyph in InD) are NOT found. Only the "regular" thorns (with no cross stroke) and the ones that are in InD as two characters are found.

Does anyone know whether this is because InD and possibly Acrobat don't "know" what the thorn with the stroke because it's not in Adobe's list? Other than being from two different unicode sets, I don't why there would be a difference with the ae versus the thorn.

Thanks for any insight anyone can offer and thanks for reading a marathon post!

Best,
Ken

Ken Krugh's picture
Offline
Joined: 10 Apr 2007 - 9:05am
0

OK, been kicking this around some and a bunch of other glyphs I have in the font (ggotaccent, gbreve, egonek) all get round when searching for the "regular" character (with no accent).

So is it ALL just the search engine being used and has nothing to do with the font??

Thanks again!
Ken

André G. Isaak's picture
Offline
Joined: 31 May 2008 - 10:05am
0

You're glyph names aren't valid. You should use the following:

sub Thorn macron by uniA764;
sub thorn macron by uniA765;
sub AE macron by uni01E2;
sub ae macron by uni01E3;

If you export a .pdf file from InDesign, the unicode values will be embedded, so it will be searching for those. When you insert your Thornmacron character from the glyph palette, you're giving it something which has no unicode value and to which it cannot assign one since it doesn't have a meaningful name. If you had named it either Thorn_uni0304 or uni00DE0304 you might have better luck since these names are at least interpretable.

Whether a program will find (e.g.) eogonek when searching for e will depend entirely on how unicode savvy the program is. Some programs will decompose eogonek into e + uni0328 when searching. Others won't.

André

André G. Isaak's picture
Offline
Joined: 31 May 2008 - 10:05am
0

Just two quick additional notes:

First, I'd put these sorts of substitutions (minus the f i) in the ccmp feature rather than the liga feature -- that is, after all, what it is for.

Also, if you're going to allow your thorn with stroke character to be entered as a sequence, I wouldn't use the combining macron character. I'd instead use uni0335 or uni0337, neither of which is entirely appropriate but either of which is a better choice than uni0304. The macron sits above the character whereas the other two intersect it.

André

David W. Goodrich's picture
Joined: 14 May 2008 - 2:23pm
0

As a general point about Acrobat's searching it may be worth mentioning that Acrobat offers a preference setting for Search that can be set to "Ignore Diacritics and Accents."

David

Ken Krugh's picture
Offline
Joined: 10 Apr 2007 - 9:05am
0

Still using the macron I tried straight unicode names (uniA765) as well as combination names (uni00FE00AF), niether of which InD/Acrobat decomposed as they weren't found by searching or the thorn.

I also tried various combinations using the stroke character (uni0335) named per Adobe's glyph list but was still unable to find the new stroked character searching for the non-stroked one.

Yes, "Ignore Diacritics and Accents" was on, I think it is on by default.

My concern is that when this material/fonts go to the web and will be in need of being searched. As ther is not yet a concrete plan of what the web end of things will entail, I guess I'll just stick with the recommended naming conventions and hope for the best!

Any other info would be welcome and I'll be sure to post back if I come up with any revallations.

Many thanks for the input guys,
Ken

André G. Isaak's picture
Offline
Joined: 31 May 2008 - 10:05am
0

Part of the problem may be that Latin Extended D is relatively new and not all software might support it. As a name, uni00FE00AF is less than ideal since u00AF isn't a combining mark but a spacing character. Acrobat's 'ignore diacritics' might not view slashed thorn as a thorn plus a diacritic since it's really a scribal abbreviation rather than an accented character.

If you need to be able to search for these, I'd just make sure to enter the character as a two character sequence in InDesign (Thorn + uni0304 or uni0335 or whatever you choose) since ID will then preserve the two unicode values in your exported .pdf.

André

Michel Boyer's picture
Offline
Joined: 2 Jun 2007 - 1:01pm
0

From http://typophile.com/node/31039#comment-182353 I understand that

sub Thorn macron by Thorn_macron;
sub thorn macron by thorn_macron;
sub AE macron by AE_macron;
sub ae macron by ae_macron;

might solve your problem. I don't know if there can be any bad side effect. Have you tried? Any opinion from those who know?

André G. Isaak's picture
Offline
Joined: 31 May 2008 - 10:05am
0

Actually, if you're really concerned about searchability, why not name your crossed thorn as a three-character ligature and then use something like the following:

feature dlig {
    sub Thorn [AE ae] [T t] by Thorn_ae_t;
    sub thorn ae t by thorn_ae_t;
} dlig;

André

Ken Krugh's picture
Offline
Joined: 10 Apr 2007 - 9:05am
0

Am I right in thinking that the dlig would pretty much end up working like the liga or ccmp, where the characters would be inserted seperately? In that case I think I'd prefer the more basic liga or ccmp, simple concepts for my simply mind, which is quickly turning to mush under the weight of Opentype features! :o)

Well, I'm now convinced (mostly) that it is simply the unicode characters recognized by either InD, Acrobat or both. In the font there are glyphs that were not in any kind of substitution feature, nor were they even named according to Adobe's list yet searching for the plain character found the accented character. A G with an overdot (unicode 0120) is one example. That seems to indicate that the a substitution table isn't necessary for this type of searching. Something I'd love confirmed if anyone can do so.

I think I'll take André's good advice and add the short stroke and name the thorn glyph using the two unicode values that comprise it. I'm guessing that other search engines may have the best shot at decomping it when the time comes.

Thanks again for everyone's time. This forum is da' bomb (sorry, its late).

All best,
Ken

André G. Isaak's picture
Offline
Joined: 31 May 2008 - 10:05am
0

The behaviour will be determined by the rule type you use; the choice of feature differs in the following way:

liga = standard ligatures, which in most applications will be turned on be default.
dlig = discretionary ligatures, which will generally be turned off unless the user turns it on.
ccmp = unicode composition, which will be turned on by default and generally cannot be turned off by the user.

So which of these you should use will really depend on what you want the default behaviour to be. If you're using thorn + stroke (uni0335) then ccmp would probably be the best choice, though liga would also work.

If you went with my second suggestion of a three-character combination (thorn_ae_t) then dlig would be the best choice (the use of barred thorn as an abbreviation for þæt is the only usage I'm familiar with -- if it has other uses this choice would be less than optimal).

André