Difference between character set and codepage?

cursusductus's picture

Hi everybody, I'm trying to translate these two terms to spanish; They are often used interchangeably. As I can understand, character set is a collection of characters, "suerte" in spanish, and a codepage is a coded character set (or multiple sets) used by an operative system.
So, UNICODE is a codepage or an encoding? I'm getting crazy!
thanks

Mark Simonson's picture

A character set is all the characters in a font. It might be a few hundred or many thousands.

A codepage is the set of characters (or a subset in a large font) that can be typed directly from the keyboard for a particular keyboard layout.

A codepage and "an encoding" are essentially the same thing. Most codepages correspond to character sets in older 8-bit (256-character) font encoding schemes, such as ASCII.

Unicode is a standard system for assigning unique codes to semantically distinct characters in most of the world's languages.

In modern OpenType Unicode-based fonts: Unicode > character sets >= codepages

(This is a bit of a simplification, and I have not even talked about glyphs vs. characters...)

Tim Ahrens's picture

Jukka Korpela's Tutorial on character code issues is quite an extensive explanation of the subject: http://www.cs.tut.fi/~jkorpela/chars.html

j.hadley's picture

As to Unicode in particular I'd recommend The Unicode Consortium FAQ at http://www.unicode.org/faq/ as well as the "What is Unicode" discussion at http://www.unicode.org/standard/WhatIsUnicode.html.

cursusductus's picture

Thank you, I'm reading all this material (I knew Unicode, but somethings are not very well explained). I apreciate specially the simplification of Mark's answer, it helps to clarify. I'll go on studing.

Thomas Phinney's picture

All codepages are character sets, but not all character sets are codepages.

A character set is any specific collection of characters. You could consider any given font to have its own character set, which may or may not be the same as some externally-defined one.

A codepage is a character set used by a computer, usually OS specific, usually to support a specific language or set of languages. For example, MacRoman is a codepage. Windows codepage 1250 (Eastern European) is a codepage. Many codepages are single-byte character sets - that is, they contain no more than 256 characters.

Regards,

T

cursusductus's picture

Thanks, Thomas, so UNICODE is a codepage? or a meta-codepage?

Mark Simonson's picture

Unicode is not a codepage, and I'm not sure if "meta-codepage" would be a useful description, either.

Think of Unicode as a set of all possible character codes. A codepage is a subset of character codes in a particular order, usually limited to 256 characters.

A little background: A "page" is a block or section of computer memory. In early personal computer systems, a page of memory was 256 bytes, the largest number that can be represented with 8 bits. Before Unicode, most standard character code systems used 8-bit encoding, so it was not possible to have more than 256 characters. So, an 8-bit character set could be thought of as a "page", and a codepage usually refers to any standard pre-Unicode 8-bit character set.

Mark Simonson's picture

On second thought, maybe "meta-codepage" works, if you take it to mean "beyond codepage".

Artur Schmal's picture

Maybe it helps understanding how codepages work if you are aware that a single character can be represented in various codepages. So for example, the lowercase 'a' is represented the MacRoman codepage as well as in the Windows 1252 codepage (and in many more).

Think of Unicode as a label attached to the character via which the character can be accessed by app's and OS's. Some apps and OS's address characters through their name, some through their unicode.

Hope this helps,

Artur

Thomas Phinney's picture

Unicode is a single very large (and still growing) character set and encoding, which encompasses essentially all the standard computer character sets that predated it.

Most any computer codepage can be mapped to Unicode and back. However, in computer systems Unicode is largely replacing codepage based approaches, and for good reasons. Instead of having dozens of codepages each using (and re-using) the same numbered slots for different characters, each character gets its own unique numbered slot in Unicode.

T

piccic's picture

So, basically, as we get more support from applications, Unicode will replace the limited access of 8-bit codepages?

A thing which is always difficult for me to get is related to the Keyboard: as long as we have a keyboard won't we always have the need to access the glyphs via a limited "codepage"? if I have to type (mostly) Arabic or Greek, a Latin-based keyboard is of little use, isn't it?
I mean, after all, I will always need to have more direct access to the glyphs pertinent to the language I am typing, isn't it?

Thanks for explaining such things in a simple way…

behnam's picture

Keyboards are not always encoded in 8-bit, relative to a specific codepage. Not on the Mac anyway. But you are right. This keyboard issue is a source of a lot of confusion and incompatibility amongst programs and platforms, particularly for non Roman languages.
I'm guessing keyboards on Windows are always 8-bit encoded, then 'translated' to Unicode based on language setting along the way. This causes many cases of 'lost in translation' issues, for example for mirroring characters in rtl settings.

Dr jack's picture

That's a great line Artur...
"Think of Unicode as a label attached to the character via which the character can be accessed by app’s and OS’s"

The more things are explained simply, the easier it becomes for newbies.

Cheers

Thomas Phinney's picture

"I’m guessing keyboards on Windows are always 8-bit encoded...."

If that were true, Chinese and Japanese keyboards would be impossible, no? And whether or not it is true, there's be no reason for that to cause problems with mirroring characters in RTL settings.

Cheers,

T

behnam's picture

I see your point Thomas. But somehow it does. RTL imported form Windows to Mac is OK. RTL imported from Mac to Windows is not. I'm guessing it's because 'opening parenthesis' is directly encoded on the Mac RTL keyboards, but on Windows keyboards, it is always left parenthesis, subject to interpretation based on the language setting. But it may well be done with Unicode encoding as well I suppose. But isn't language setting related to codepages?
What is the role of codepage in Unicode era? Backward compatibility? Or just smaller codes?
This subject was always very confusing and beyond my comprehension. But apparently many applications agree with me!

Thomas Phinney's picture

Keyboards have no effect at all on imported text in any application I've ever worked with.

> But isn’t language setting related to codepages?

What do you mean by "language setting," *precisely*? On which OS or in which app is this setting?

Codepages have little role at all these days in Unicode apps, and that's a good thing. Mostly any role they do have, has to do with supporting pre-existing legacy fonts and text.

Regards,

T

behnam's picture

I have to get back to you on this when I get a chance to test it extensively. But here is what I gather now:
In Windows, you can produce an rtf Arabic (or Persian whatever) text, send it to a Mac, and reproduce the text in the Mac. No problem. When you do the same in the Mac, whether it is Persian or Arabic, there is no 'codepage' per se, only the appropriate keyboard. And when it is imported to Windows, it defaults the content to 'Arabic' if it does at all, and also it looses its mirroring effect. I know a lot of Mac users have trouble keeping their page layout intact when imported to Windows. I'll give you more detail when I get a chance.
Unicode doesn't identify a text as 'Arabic' (I wish it did!) only rtl. The rest is within the encoding. But it doesn't seem to work that way. If you look at the keyboard layout of an rtl Mac keyboard, the opening parenthesis is encoded as left parenthesis (which will mirror in rtl and become right parenthesis). If you look at a Windows rtl keyboard layout, the parenthesis are exactly the same as for ltr. The encoding has to be modified first, then mirrored in the process when the rtl content is recognized by the application. When it imports directly from a Mac for example, one of these two, modifying or mirroring is not happening.

Thomas Phinney's picture

> If you look at the keyboard layout of an rtl Mac keyboard

Totally irrelevant to most anything that happens with text when you're not actually typing.

> If you look at the keyboard layout of an rtl Mac keyboard,
> the opening parenthesis is encoded as left parenthesis

From a Unicode perspective, there's no such thing as left parenthesis, just opening and closing parentheses. What they look like depends on whether you're in RTL or LTR.

Now, it's possible that the Windows Arabic keyboard takes the thing that looks like a "left" paren on the keyboard and treats that input as an opening paren in the English keyboard and as a closing paren when you're using the Arabic keyboard. But whether or not it does so, this is unrelated to what happens with Arabic text moving between Mac and Windows....

Somebody who is really familiar with Arabic typesetting might be able to help decode the particular problems you're seeing, but you'll need to be a lot more specific about two things:

1) How are you producing the RTF Arabic text on the Mac? (Application, font, font format)

2) In what Windows application are you opening that Mac-created RTF text file?

It's in one of those steps that the problem likely lies.

I'll just note on the side that Mac Word doesn't support Arabic last I heard.

Regards,

T

Syndicate content Syndicate content