love fish大鹏一曰同风起，扶摇直上九万里

常用链接

统计

随笔 - 500
文章 - 0
评论 - 155
引用 - 0

积分与排名

积分 - 546754
排名 - 91

friends

汉字与 unicode 编码相互转化（转）

汉字与 unicode 编码相互转化

(2006年7月17日 11:07:58 )

一、 概述：

如果项目采用了 GBK 的编码，那么汉字转化就不是问题了。但是如果采用了 utf-8 的编码，汉字的处理就相对比较麻烦一些。

二、 功能实现：

代码如下：

// 转为unicode
2

public static void writeUnicode( final DataOutputStream out,
3

final String value) {
4

try {
5

final String unicode = gbEncoding(value);
6

final byte [] data = unicode.getBytes();
7

final int dataLength = data.length;
8

System.out.println( " Data Length is: " + dataLength);
10

System.out.println( " Data is: " + value);
11

out.writeInt(dataLength); // 先写出字符串的长度
12

out.write(data, 0 , dataLength); // 然后写出转化后的字符串
13

} catch (IOException e) {
14

}
16

}
17

public static String gbEncoding( final String gbString) {
19

char [] utfBytes = gbString.toCharArray();
20

String unicodeBytes = "" ;
21

for ( int byteIndex = 0 ; byteIndex < utfBytes.length; byteIndex ++ ) {
22

String hexB = Integer.toHexString(utfBytes[byteIndex]);
23

if (hexB.length() <= 2 ) {
24

hexB = " 00 " + hexB;
25

}
26

unicodeBytes = unicodeBytes + " \\u " + hexB;
27

}
28

// System.out.println("unicodeBytes is: " + unicodeBytes);
29

return unicodeBytes;
30

}
31

/**
33

* This method will decode the String to a recognized String in ui.
34

* 功能:将unicod码转为需要的格式(utf-8)
35

* @author javajohn
36

* @param dataStr
37

* @return
38

*/
39

public static StringBuffer decodeUnicode( final String dataStr) {
40

final StringBuffer buffer = new StringBuffer();
41

String tempStr = "" ;
42

String operStr = dataStr;
43

if (operStr != null && operStr.indexOf( " \\u " ) == - 1 ) return buffer.append(operStr); //
44

if (operStr != null && ! operStr.equals( "" ) && ! operStr.startsWith( " \\u " )) { //
45

tempStr = operStr.substring( 0 ,operStr.indexOf( " \\u " )); //
46

operStr = operStr.substring(operStr.indexOf("\\u"),operStr.length());//operStr字符一定是以unicode编码字符打头的字符串
47

}
48

buffer.append(tempStr);
49

while (operStr != null && ! operStr.equals( "" ) && operStr.startsWith( " \\u " )) { // 循环处理,处理对象一定是以unicode编码字符打头的字符串
50

tempStr = operStr.substring( 0 , 6 );
51

operStr = operStr.substring( 6 ,operStr.length());
52

String charStr = "" ;
53

charStr = tempStr.substring( 2 , tempStr.length());
54

char letter = ( char ) Integer.parseInt(charStr, 16 ); // 16进制parse整形字符串。
55

buffer.append( new Character(letter).toString());
56

if (operStr.indexOf( " \\u " ) == - 1 ) { //
57

buffer.append(operStr);
58

} else { // 处理operStr使其打头字符为unicode字符
59

tempStr = operStr.substring( 0 ,operStr.indexOf( " \\u " ));
60

operStr = operStr.substring(operStr.indexOf( " \\u " ),operStr.length());
61

buffer.append(tempStr);
62

}
63

}
64

return buffer;
65

}

一、 结尾：

posted on 2006-07-17 11:07 javajohn 阅读(673) 评论(1) 编辑收藏收藏至365Key 所属分类: 我的记忆

FeedBack:

# re: 汉字(中文)还是unicode

2006-07-18 17:11 | 小猪

关于代码单元和代码点的理解：
1、一个代码点可能包含一个或两个代码单元。
2、在我的测试程序中，“我 ”也只占用一个代码单元。即代码点数等于代码单元数。
下面是在unicode的官方网站上找到的关于unicode的中文，韩文，日文的一些说明：
Q: I have heard that UTF-8 does not support some Japanese characters. Is this correct?

A: There is a lot of misinformation floating around about the support of Chinese, Japanese and Korean (CJK) characters. The Unicode Standard supports all of the CJK characters from JIS X 0208, JIS X 0212, JIS X 0221, or JIS X 0213, for example, and many more. This is true no matter which encoding form of Unicode is used: UTF-8, UTF-16, or UTF-32.

Unicode supports over 70,000 CJK characters right now, and work is underway to encode further additions. The International Standard ISO/IEC 10646 and the Unicode Standard are completely synchronized in repertoire and content. And that means that Unicode has the same repertoire as GB 18030, since that also is synchronized with ISO 10646 — although with a different ordering and byte format.
无论是那个编码方式（UTF-8, UTF-16, or UTF-32）都可以对中文全面支持？

我的测试程序如下:
public class test0 {
public static void main(String[] args)
{String a="我 ";
int cuCount=a.length();
System.out.println("the number of code units required for string \"test\" in the UTF-16 encoding is "+cuCount);
int cpCount=a.codePointCount(0, a.length());
System.out.println("the number of code points is "+cpCount);
System.out.println("the end of string \"我 \" is "+a.charAt(a.length()-1));

}

}

输出结果为:
the number of code units required for string "test" in the UTF-16 encoding is 2
the number of code points is 2
the end of string "我 " is [空格]

在eclipse里面找到了set encoding选项，在里面可以设置编码方式。回

posted on 2006-07-21 01:46 liaojiyong 阅读(4382) 评论(0) 编辑收藏所属分类: Java

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园博客园最新博文博问管理
相关文章: 能为你的程序锦上添花的几种程序结构（转） Java虚拟机运行机制(转) JDK6的新特性(转) 求两时间点之间日期差的简洁方法（转）代码发现的问题及解决方案（转） serialVersionUID 最弱智的搜索字符串算法 Java1.5泛型指南中文版(Java1.5 Generic Tutorial) （转） Java文件操作大全（转） JFreeChat的介绍（转）

love fish大鹏一曰同风起，扶摇直上九万里

导航

公告

留言簿(15)

随笔分类(493)

随笔档案(498)

相册

阅读排行榜

常用链接

统计

积分与排名

friends

link

最新评论

汉字与 unicode 编码相互转化（转）

汉字与 unicode 编码相互转化