posts - 9, comments - 1, trackbacks - 0, articles - 0

2009年12月9日

character set,character encoding和xml encoding declaration

hanlray@gmail.com

Revision: 0.9 Date: 2005/08/28


1. character set和character encoding

一 直以为character set和character encoding是一回事,最近才发现它们差不多是两个概念。从字面意思上看,一个character set就是一些字符的集合,比如26个英文字母,全体汉字,或者两者的组合,然而计算机是只能处理数字的,一个字符必须被表示成一个数字才可能被计算机处 理,因此一个character set还必须定义字符到数字的映射关系,比如字符A映射到数字64,字符Z映射到数字90,这些数字叫做code points;character encoding则用来定义这些code points如何用字节表示出来,比如code point 90可以被编码为一个单字节,一个little-endian的unsigned short,一个big-endian的integer,或者更复杂的形式。换句话说,一个character set仅仅是一张字符和数字的映射表,它并不规定这些数字如何用字节表示出来,这种表示的事情由character encoding来定义。因此,一个character encoding是针对一个character set的,而一个character set可以有多个character encoding,比如UTF-8,UTF-16都是Unicode Character Set的encoding。

2. xml encoding declaration

对 一段编码后的文本,要想正确地处理它,必须要事先知道其采用的编码方式,因此这种编码信息一般是存在于该文本之外的,比如某些编辑器会在文件开始放几个不 可见的字节来指示其正文的编码方式,这些字节叫做BOM(Byte Order Mark);某些网络协议会有一些字段来指示其所携带文本的编码方式。这种方式很直观,很多系统/技术采用这种方式,大多数有关xml的应用也会优先使用 这种外部的编码信息,但是当没有这种外部的编码信息可用的时候呢?一个xml document可以用一个xml declaration来声明其采用的编码,形如<?xml version="1.0" encoding="UTF-8"?>,这种方式看起来不大可能工作,因为这个声明本身就是文本的,并且该声明是xml document的一部分,不可能规定其采用的编码方式。如何能在不知道xml document编码的情况下理解其xml declaration中声明的编码呢?对xml编码声明的位置及内容的限制使自动检测成为可能:编码声明必须出现在文档开头,只允许ASCII字符并且 其头几个字符必须是<?xml,这样,一个xml processor就可以先读出文档的前几个字节,推断其采用的编码的特征,根据该特征能理解xml declaration,读出encoding属性值,从而知道该文档的编码。比如,如果前4个字节的内容为3C 3F 78 6D,则可以确定该文档采用的是一种兼容ASCII的编码,这样xml processor就可以用任一种兼容ASCII的编码(如UTF-8)来理解编码后xml声明,因为其只包含ASCII字符,任何兼容ASCII的编码 对其编码的结果都是相同的。当得到xml declaration中声明的编码时,xml processor再转换到该编码对该xml进行处理。下表来自XML W3C Recommendation,列出了自动检测编码的方式:

00 00 00 3C
3C 00 00 00
00 00 3C 00
00 3C 00 00
UCS-4 or other encoding with a 32-bit code unit and ASCII characters encoded as ASCII values, in respectively big-endian (1234), little-endian (4321) and two unusual byte orders (2143 and 3412). The encoding declaration must be read to determine which of UCS-4 or other supported 32-bit encodings applies.
00 3C 00 3F UTF-16BE or big-endian ISO-10646-UCS-2 or other encoding with a 16-bit code unit in big-endian order and ASCII characters encoded as ASCII values (the encoding declaration must be read to determine which)
3C 00 3F 00 UTF-16LE or little-endian ISO-10646-UCS-2 or other encoding with a 16-bit code unit in little-endian order and ASCII characters encoded as ASCII values (the encoding declaration must be read to determine which)
3C 3F 78 6D UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS, EUC, or any other 7-bit, 8-bit, or mixed-width encoding which ensures that the characters of ASCII have their normal positions, width, and values; the actual encoding declaration must be read to detect which of these applies, but since all of these encodings use the same bit patterns for the relevant ASCII characters, the encoding declaration itself may be read reliably
4C 6F A7 94 EBCDIC (in some flavor; the full encoding declaration must be read to tell which code page is in use)
Other UTF-8 without an encoding declaration, or else the data stream is mislabeled (lacking a required encoding declaration), corrupt, fragmentary, or enclosed in a wrapper of some kind

posted @ 2009-12-09 20:27 TonyZhangtl 阅读(298) | 评论 (0)编辑 收藏