Dumping a TTF to editable text in Windows?

jasonmenzies's picture

Does anyone have any free or paid Windows XP utilities they can recommend that will dump a TTF font file to editable text?

Ideally, I want an extensive Asian Unicode font (like SimHei for example) to be dumped to a text file that looks just like the Character Map (a grid that's 10 or so characters wide and several hundred rows tall), but in editable text (DOC, RTF, TXT, whatever works).

I'm trying to do it using the copy and paste function in Character Map but it's taking way too long. Is there a utility out there that will do something like this for me in an automated manner? I'm willing to pay a decent price.

aric's picture

What do you plan to do with the editable text file?

Antonio Cavedoni's picture

This is a job for TTX.

Arno Enslin's picture

@jasonmenzies

There is a command "fontplot" contained in the Adobe Font Development Kit for OpenType. It might help:

fontplot

Make a pdf file showing all the glyphs in the font. This fills a page with many glyphs, and shows just the filled outlines at a small point size, with some info around each glyph. Works with OpenType fonts and Type 1 fonts.

"charplot", "digiplot", "fontplot", "hintplot", and "showfont" are all command files that call the ProofPDF script with different options. ProofPDF takes as options a list of fonts, and an optional list of glyphs, and prints a PDF file for the specified font. showing the glyphs as specified by options.

Do you want to print the character map or do you want to dump the font file? If you want to dump (decompile) it, TTX is right. Otherwise I don’t know, in which way TTX should be a help there.

I am not sure, whether fontplot works with TrueType fonts. I assume, it does not, although OpenType fonts can be TrueType flavored.

jasonmenzies's picture

Thanks for the suggestions and assistance everyone. None of these ideas helped but maybe with a bit more refinement of my original question, someone will have a different suggestion.

Aric, I work in home video subtitling so all my text needs to be rendered as a bitmap or other graphic image for client delivery. I need all the characters dumped from a font to text so I can load them into my rendering engine (which runs on proprietary Unicode text-based files), render them out to graphics, and QC the outputs.

Antonio, thanks for the suggestion. TTX seems helpful, but I think it's giving me all the stuff behind the scenes to make the characters work (for programming, compiling, etc), whereas I just need the characters themselves.

Arno, thanks to you for the suggestion as well. I think I found a couple of other programs that do the same thing yesterday (dump to a printable PDF), but again since I can't actually use the text from the PDF in my editing software, it isn't exactly what I need.

Ideally I just need content like this:

!"#$%&'()*+,-./01234

That's the first row of the character map from Arial, copied one-by-one with the select button and then copied to the clipboard, and pasted here.

For Arial, it won't be such a big deal, but SimHei has 16,000 characters, so I was hoping to find something that would do it a bit quicker.

Any further ideas would be warmly welcomed. I thank you all for your time thus far.

Arno Enslin's picture

@jasonmenzies

I think I have an idea:

Decompile the CMAP-table of the font file with TTX (-t cmap). Then use a search&replace tool.

Example:
<cmap>
<tableVersion version="0"/>
<cmap_format_4 platformID="0" platEncID="3" language="0">
<map code="0x20" name="space"/><!-- SPACE -->
<map code="0x21" name="exclam"/><!-- EXCLAMATION MARK -->
<map code="0x22" name="quotedbl"/><!-- QUOTATION MARK -->

should look like this (without underscores):
&_#x20;
&_#x21;
&_#x22;

OpenOffice Writer wants to have them at least in a html element. So you should have this in the end (without underscores):
<p>&_#x20;&_#x21;&_#x22;</p>

Store it as html file and open it in Word or OpenOffice Writer or anything comparable and save it as doc or rtf file. Word and Writer will display the glyphs then (and naturally your browser), as long as the font is installed and selected (with the font menu or a CSS declaration).

For the case, that this does not work, it might help you to find the right way. I am not sure, whether the cmap table necessarily contains the Unicode positions of all characters of the font. But I assume, you neither would be able to get access to characters, that are not in the CMAP table with their Unicode positions, with the help of the clipboard.

And probably there is a more elegant way.

Theunis de Jong's picture

Arno, you are correct: Window's Charmap utility does not display "all" characters in a font, just those with a valid Unicode mapping (it does not show OTF substitution results). I agree that it's a moot point, since the input of Jason's program should be Unicode :-)

A one-step solution could be a program that immediately dumps the Unicode CMAP to a file (there might be more than one encoding in any font). I wrote lots of little programs to examine TTFs and OTFs -- I'll have a quick look-see if it's possible to re-purpose one to do this. Hold on a minute.

Theunis de Jong's picture

There you go.
Download "cmaptofile.zip" from my site. Extract to a suitable place.

It's a console program; usage:

cmaptofile font file name [ -utf8 | -buc ] [ -llnumber ]

Default output is Unicode, little endian. Use -buc to toggle to big endian. Use -utf8 to output as UTF8 encoded file (not thouroughly tested ;-). Use -llnumber to set the line length in characters -- default is -ll32.

Output goes to the same folder as the program is in, to a file with the same name as the font plus ".txt" appended.

Ideally, it oughta use the largest CMAP it finds -- any unicode one. If the font doesn't have a unicode map .. well, anyone's guess what happens then. I think it fails silently. OTOH, if it does work, you'll see a complete dump of the UC characters it writes to the file.

Tested with a handful of PFB, OTF, and TTFs -- still, no guarantees, it was a real quick hack.

Michel Boyer's picture

Here are things that work best (and most easily) on Linux, that I use on a Mac and that may work on a PC with Cygwin.

I always use small scripts. The following one outputs the unicode numbers (in hex format) in a font; it requires Fontforge, the Python Fontforge module and, of course, Python. Copy what is between the cut lines and paste in a terminal window (Cygwin, mac or Linux), and this creates the executable file lstucodes in the current directory.


----
cat > lstucodes <<'EOF'
#!/usr/bin/env python

import fontforge,sys
fnt=fontforge.open(sys.argv[1],1)

for g in fnt.glyphs():
   if (g.unicode >= 0x21):
      print "%04X" % (g.unicode)
EOF
chmod 755 lstucodes
----

If simhei.ttf is also in the current directory then

./lstucodes simhei.ttf |sort > simhei.txt

outputs all the unicode values of the characters in simhei.ttf, sorts them and puts the result in simhei.txt.

To format, I usually use sed and awk but young people use other things now. Anyway, here is my script (if I want 10 characters per line); again, copy and paste in the terminal window to get hex2html:


---
cat > hex2html <<'EOF'
#!/bin/sh

awk 'BEGIN{print "<html><body>"}{
  printf "&#x%s;\n", $0
  n = n+1
  if (n == 10) {
    printf "<br>\n"
    n=0
  }
}END{ printf "</body></html>\n"}'  $1
EOF
chmod 755 hex2html
---

Now ./hex2html simhei.txt > simhei.html gives the file that you can find here: simhei.html.

Characters with a unicode number above FFFF cause problem. If you find other problems, I'd like to be told.

Michel

jasonmenzies's picture

Theunis, you are my hero! This is absolutely FANTASTIC! This is exactly what I needed and will save me hours of work.

Please feel free to contact me through my profile if you are seeking any level of compensation for this. I would be happy to pay you for your efforts due to the amount of labor this will save me.

I am so happy that I came to this site for assistance. You have all been so helpful and I really appreciate it!

Michel Boyer's picture

Characters with a unicode number above FFFF cause problem.

They were causing problems (due to Python) on the Mac (not on Linux) in different circumstances but not here, or so it seems. I remplaced 04X by 05X in lstucodes (just so that those characters come at the end of the file) and tried on STHeiti and, when viewed with the STHeiti font, the resulting file STHeiti.html looks fine to me (but I dont' know Chinese).

Theunis de Jong's picture

Michel,

Good point. My quick hack does not support codes >0FFFFh -- I cannot write these to a regular UC file as hi/lo bytes (but I think there are code extensions for that). The UTF8 web page I used to implement this did show how to write them. Then again, I don't know how these huge codes are coded in the CMAP, so I cannot read them anyway.

I don't think I even have fonts with such huge codes. Does this STHeiti thingy have them? (Not that I have that font either.) (Ah -- stupid remark: your HTML shows the last 5 characters to be in the x20000 range.)

jasonc's picture

The font may have a format 12 cmap subtable, which will complicate things further.
Hmm, not that I have a solution for you however.

Michel Boyer's picture

On OS X 10.5.8 (on my MacBook Pro) there is a font named STXihei in /System/Library/Fonts that appears to be seen as "STHeiti Light" in the character palette. According to the FontForge dump, that font contains 4241 characters in CJK Unified Idiographs Ext.B range 20000-2F7FF. That font is /System/Library/Fonts/华文细黑.ttf. The characters I checked (just a few, of course) are seen in the character palette and in the dump to be seen here: stxihei.html.

My guess is that FontForge does well its jobs.

Michel

Michel Boyer's picture

The ­­lstucodes script above is something I had in my ~/bin folder.

I now took the time to compare with Arno's method. On tens of thousands of characters, comparing outputs can be quite instructive. Looking at stxihei.ttx, I see only one cmap table, of type 12; there are 34962 character codes (and 31 multiply encoded glyphs).

For each line containing 'map code' we want the string enclosed between the first and the second quote character. That can be done with awk, telling it to use quote as a field separator. In any unix-like shell, typing


grep 'map code' stxihei.ttx  | awk -F\" '{print $2}' | sed 's/0x//' > stxihei.txt

does it (and removes the leading 0x). Then (using the hex2html script above)

hex2html stxihei.txt > stxihei.html

gives the desired html file, 10 characters per line.

After comparing the output from FontForge and the above, one can see that on multiply encoded glyphs in stxihei, FontForge outputs only one unicode value per glyph (I checked with a script). Arno's method based on the ttx output gives all the unicode values in that font and that may be what you want.

On other ttx files, with multiple cmap tables, I don't know what may occur. For some other font, I saw unicode points with .notdef glyphs (that would need to be discarded with grep -v). What else?

Michel

Theunis de Jong's picture

Just for fun, added a few more options to the quick-and-dirty proggie (most recent version: cmaptofile101.zip).

  • -html Write a HTML file, with the character codes in full decimals
  • -cr, -lf, -crlf and -lfcr change line ending mode
  • -ofilename set output file name
  • -name (just for fun) show the name of each character

I removed the worthless long listing of the character names while processing.
I changed the default line length to 10 characters (should've done that right away).

If you feed it a TTC, it writes out one file per sub-font (although I seem to get the same files -- my few TTCs must all have the same CMAP).

I added a GetWindowsFolder call -- if you need a font from your Windows folder, you don't have to explicitly prepend "c:\windows\fonts", it'll automatically check in there if the file is not found in your currect folder.

Oh -- and I added CMAP12 support, for those *huge* codes. If present, the program uses this. Couldn't really check; it seems to work with LastResort though.

The program is hardwired not to add null characters, U+FFFE, U+FFFF, and U+D800..U+DFFF (but these really should not be present in the file anyway).

Theunis de Jong's picture

Just for fun, I compiled a Mac version!
Direct link: cmap2file_mac.zip

Download it and put it somewhere you can reach using Terminal. Then run, using

./cmap2file font_file_name[*] [ options ... see previous mails]

[*] Typically, your installed fonts can be found in either /Library/Fonts or in ~/Library/Fonts.

I added support for the old(!) versions of TTF that start with 'true', rather than "ttcf", 0x10000 (Windows TTF), or "OTTO". dfonts are not supported!
I also added another option: -name inserts the name per character, taken from the file if present, synthesized otherwise.
Be sure to use the -ofilename option to put the output somewhere you can find it later -- remember, default output is where the font file was found.

Disclaimer: It appears to work just fine on my Mac OSX 10.6, but I set the minimal target to 10.4. Still, your mileage may vary.

quadibloc's picture

I know of a utility that can dump an Adobe Type Manager font to an editable text file, and read it back in again - it was on the Garbo CD ROM from Walnut Creek - but not one for TrueType.

Syndicate content Syndicate content