Difference between character set and codepage?

cursusductus
4.Dec.2007 3.34am
cursusductus's picture

Hi everybody, I’m trying to translate these two terms to spanish; They are often used interchangeably. As I can understand, character set is a collection of characters, “suerte” in spanish, and a codepage is a coded character set (or multiple sets) used by an operative system.
So, UNICODE is a codepage or an encoding? I’m getting crazy!
thanks



Mark Simonson
4.Dec.2007 5.30am
Mark Simonson's picture

A character set is all the characters in a font. It might be a few hundred or many thousands.

A codepage is the set of characters (or a subset in a large font) that can be typed directly from the keyboard for a particular keyboard layout.

A codepage and “an encoding” are essentially the same thing. Most codepages correspond to character sets in older 8-bit (256-character) font encoding schemes, such as ASCII.

Unicode is a standard system for assigning unique codes to semantically distinct characters in most of the world’s languages.

In modern OpenType Unicode-based fonts: Unicode > character sets >= codepages

(This is a bit of a simplification, and I have not even talked about glyphs vs. characters...)


Tim Ahrens
4.Dec.2007 6.23am
Tim Ahrens's picture

Jukka Korpela’s Tutorial on character code issues is quite an extensive explanation of the subject: http://www.cs.tut.fi/~jkorpela/chars.html


j.hadley
4.Dec.2007 2.57pm
j.hadley's picture

As to Unicode in particular I’d recommend The Unicode Consortium FAQ at http://www.unicode.org/faq/ as well as the “What is Unicode” discussion at http://www.unicode.org/standard/WhatIsUnicode.html.


cursusductus
5.Dec.2007 9.28am
cursusductus's picture

Thank you, I’m reading all this material (I knew Unicode, but somethings are not very well explained). I apreciate specially the simplification of Mark’s answer, it helps to clarify. I’ll go on studing.


Thomas Phinney
6.Dec.2007 8.36pm
Thomas Phinney's picture

All codepages are character sets, but not all character sets are codepages.

A character set is any specific collection of characters. You could consider any given font to have its own character set, which may or may not be the same as some externally-defined one.

A codepage is a character set used by a computer, usually OS specific, usually to support a specific language or set of languages. For example, MacRoman is a codepage. Windows codepage 1250 (Eastern European) is a codepage. Many codepages are single-byte character sets - that is, they contain no more than 256 characters.

Regards,

T


cursusductus
12.Dec.2007 1.33am
cursusductus's picture

Thanks, Thomas, so UNICODE is a codepage? or a meta-codepage?


Mark Simonson
12.Dec.2007 8.57am
Mark Simonson's picture

Unicode is not a codepage, and I’m not sure if “meta-codepage” would be a useful description, either.

Think of Unicode as a set of all possible character codes. A codepage is a subset of character codes in a particular order, usually limited to 256 characters.

A little background: A “page” is a block or section of computer memory. In early personal computer systems, a page of memory was 256 bytes, the largest number that can be represented with 8 bits. Before Unicode, most standard character code systems used 8-bit encoding, so it was not possible to have more than 256 characters. So, an 8-bit character set could be thought of as a “page”, and a codepage usually refers to any standard pre-Unicode 8-bit character set.


Mark Simonson
12.Dec.2007 9.01am
Mark Simonson's picture

On second thought, maybe “meta-codepage” works, if you take it to mean “beyond codepage”.


Artur Schmal
12.Dec.2007 9.42am
Artur Schmal's picture

Maybe it helps understanding how codepages work if you are aware that a single character can be represented in various codepages. So for example, the lowercase ’a’ is represented the MacRoman codepage as well as in the Windows 1252 codepage (and in many more).

Think of Unicode as a label attached to the character via which the character can be accessed by app’s and OS’s. Some apps and OS’s address characters through their name, some through their unicode.

Hope this helps,

Artur


Thomas Phinney
19.Dec.2007 11.50pm
Thomas Phinney's picture

Unicode is a single very large (and still growing) character set and encoding, which encompasses essentially all the standard computer character sets that predated it.

Most any computer codepage can be mapped to Unicode and back. However, in computer systems Unicode is largely replacing codepage based approaches, and for good reasons. Instead of having dozens of codepages each using (and re-using) the same numbered slots for different characters, each character gets its own unique numbered slot in Unicode.

T