汉字(中文)还是unicode

汉字与 unicode 编码相互转化

(2006年7月17日 11:07:58 )

一、 概述：

如果项目采用了 GBK 的编码，那么汉字转化就不是问题了。但是如果采用了 utf-8 的编码，汉字的处理就相对比较麻烦一些。

二、 功能实现：

代码如下：

// 转为unicode
2

public static void writeUnicode( final DataOutputStream out,
3

final String value) {
4

try {
5

final String unicode = gbEncoding(value);
6

final byte [] data = unicode.getBytes();
7

final int dataLength = data.length;
8

System.out.println( " Data Length is: " + dataLength);
10

System.out.println( " Data is: " + value);
11

out.writeInt(dataLength); // 先写出字符串的长度
12

out.write(data, 0 , dataLength); // 然后写出转化后的字符串
13

} catch (IOException e) {
14

}
16

}
17

public static String gbEncoding( final String gbString) {
19

char [] utfBytes = gbString.toCharArray();
20

String unicodeBytes = "" ;
21

for ( int byteIndex = 0 ; byteIndex < utfBytes.length; byteIndex ++ ) {
22

String hexB = Integer.toHexString(utfBytes[byteIndex]);
23

if (hexB.length() <= 2 ) {
24

hexB = " 00 " + hexB;
25

}
26

unicodeBytes = unicodeBytes + " \\u " + hexB;
27

}
28

// System.out.println("unicodeBytes is: " + unicodeBytes);
29

return unicodeBytes;
30

}
31

/**
33

* This method will decode the String to a recognized String in ui.
34

* 功能:将unicod码转为需要的格式(utf-8)
35

* @author javajohn
36

* @param dataStr
37

* @return
38

*/
39

public static StringBuffer decodeUnicode( final String dataStr) {
40

final StringBuffer buffer = new StringBuffer();
41

String tempStr = "" ;
42

String operStr = dataStr;
43

if (operStr != null && operStr.indexOf( " \\u " ) == - 1 ) return buffer.append(operStr); //
44

if (operStr != null && ! operStr.equals( "" ) && ! operStr.startsWith( " \\u " )) { //
45

tempStr = operStr.substring( 0 ,operStr.indexOf( " \\u " )); //
46

operStr = operStr.substring(operStr.indexOf("\\u"),operStr.length());//operStr字符一定是以unicode编码字符打头的字符串
47

}
48

buffer.append(tempStr);
49

while (operStr != null && ! operStr.equals( "" ) && operStr.startsWith( " \\u " )) { // 循环处理,处理对象一定是以unicode编码字符打头的字符串
50

tempStr = operStr.substring( 0 , 6 );
51

operStr = operStr.substring( 6 ,operStr.length());
52

String charStr = "" ;
53

charStr = tempStr.substring( 2 , tempStr.length());
54

char letter = ( char ) Integer.parseInt(charStr, 16 ); // 16进制parse整形字符串。
55

buffer.append( new Character(letter).toString());
56

if (operStr.indexOf( " \\u " ) == - 1 ) { //
57

buffer.append(operStr);
58

} else { // 处理operStr使其打头字符为unicode字符
59

tempStr = operStr.substring( 0 ,operStr.indexOf( " \\u " ));
60

operStr = operStr.substring(operStr.indexOf( " \\u " ),operStr.length());
61

buffer.append(tempStr);
62

}
63

}
64

return buffer;
65

}

一、 结尾：

posted on 2006-07-17 11:07 javajohn 阅读(5548) 评论(1) 编辑收藏所属分类: 我的记忆

Feedback

# re: 汉字(中文)还是unicode 2006-07-18 17:11 小猪

关于代码单元和代码点的理解：
1、一个代码点可能包含一个或两个代码单元。
2、在我的测试程序中，“我 ”也只占用一个代码单元。即代码点数等于代码单元数。
下面是在unicode的官方网站上找到的关于unicode的中文，韩文，日文的一些说明：
Q: I have heard that UTF-8 does not support some Japanese characters. Is this correct?

A: There is a lot of misinformation floating around about the support of Chinese, Japanese and Korean (CJK) characters. The Unicode Standard supports all of the CJK characters from JIS X 0208, JIS X 0212, JIS X 0221, or JIS X 0213, for example, and many more. This is true no matter which encoding form of Unicode is used: UTF-8, UTF-16, or UTF-32.

Unicode supports over 70,000 CJK characters right now, and work is underway to encode further additions. The International Standard ISO/IEC 10646 and the Unicode Standard are completely synchronized in repertoire and content. And that means that Unicode has the same repertoire as GB 18030, since that also is synchronized with ISO 10646 — although with a different ordering and byte format.
无论是那个编码方式（UTF-8, UTF-16, or UTF-32）都可以对中文全面支持？

我的测试程序如下:
public class test0 {
public static void main(String[] args)
{String a="我 ";
int cuCount=a.length();
System.out.println("the number of code units required for string \"test\" in the UTF-16 encoding is "+cuCount);
int cpCount=a.codePointCount(0, a.length());
System.out.println("the number of code points is "+cpCount);
System.out.println("the end of string \"我 \" is "+a.charAt(a.length()-1));

}

}

输出结果为:
the number of code units required for string "test" in the UTF-16 encoding is 2
the number of code points is 2
the end of string "我 " is [空格]

在eclipse里面找到了set encoding选项，在里面可以设置编码方式。回复更多评论

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园 IT新闻 Chat2DB C++博客博问管理
相关文章: 三十六计历史上的十大乱世枭雄：越王勾践最牛（保存）折纸 80句箴言 (收藏)项目管理的20条锦囊妙计 chm文件无法打开－解决方法彩色验证码实现汉字(中文)还是unicode

javajohn