the unicode of accented characters in Word

trine rask's picture

We work with packaging for many countries and has to deal with many languages, many we don't read and with alphabets we don't recognize. To control this we use software to control that the unicode values are the same in input (text in Word documents) and output (pdf generated from QuarkXpress) We are now struggling with differences in input & output for accented characters in Vietnamese text. So I wonder if the accented characters in Word can be the input (the characters you type) and not the result (the accented character)
Or is there an other explanation of the different unicode values for the same character? (the visual character is correct, this is verified by vietnamese)

Michel Boyer's picture

With Vietnamese, it occurred to me that even a simple copy paste with TextEdit in a .txt file on the Macintosh did not preserve the sequence of unicode characters. Both texts looked the same but were only Unicode equivalent. For instance, the unicode character

  1EBF  LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE

is equivalent to the three characters (its NFD decomposition)

  0065  LATIN SMALL LETTER E
  0302  COMBINING CIRCUMFLEX ACCENT
  0301  COMBINING ACUTE ACCENT

To compare two strings str1 and str2, you then need to compare their normalized forms. With Python, you can use the unicodedata library; with Java, there is the Class Normalizer.

On the Mac or on Linux, given two utf-8 encoded .txt files, you can simply normalize both line by line and then compare the normalized outputs with diff.

Michel Boyer's picture

[duplicate copy removed]

Michel Boyer's picture

[duplicate copy removed]

Syndicate content Syndicate content