随笔 - 17 文章 - 84 trackbacks - 0

2007年7月

日

一

二

三

四

五

六

如非特别说明，所有文章均为原创。如需引用，请注明出处
Email:liangtianyu@gmail.com
MSN:terry.liangtianyu@hotmail.com

常用链接

留言簿(4)

随笔分类(12)

随笔档案(17)

搜索

积分与排名

积分 - 52993
排名 - 961

阅读排行榜

评论排行榜

Lucene 2.1研究：对字符的判断

posted on 2007-07-02 08:14 Terry Liang 阅读(1641) 评论(5) 编辑收藏所属分类: Lucene 2.1研究

FeedBack:

# re: Lucene 2.1研究：对字符的判断 2007-07-02 14:02 xmlspy

没弄明白你这个到底如何用,下面是我的测试代码

无论如何都是返回false

1 import org.apache.oro.text.regex.MalformedPatternException;
2 import org.apache.oro.text.regex.Pattern;
3 import org.apache.oro.text.regex.PatternCompiler;
4 import org.apache.oro.text.regex.PatternMatcher;
5 import org.apache.oro.text.regex.Perl5Compiler;
6 import org.apache.oro.text.regex.Perl5Matcher;
7
8 //正则表达式
9 public class RegxLan {
10
11     //用于判断Unicode Letter：
12     private static final String UNICODE_LETTER_PATTERN = "[(\u0041-\u005a)|"
13             + "(\u0061-\u007a)|(\u00c0-\u00d6)|(\u00d8-\u00f6)|(\u00f8-\u00ff)|"
14             + "(\u0100-\u1fff)]";
15
16     //用于判断亚洲语言字符（中国，日本，韩国）：
17     private static final String UNICODE_CJP_PATTERN = "[(\u3040-\u318f)|(\u3300-\u337f)|"
18             + "(\u3400-\u3d2d)|(\u4e00-\u9fff)|(\uf900-\ufaff)|(\uac00-\ud7af)]";
19
20     //用于判断Unicode中的数字：
21     private static final String UNICODE_DIGIT_PATTERN = "[(\u0030-\u0039)|"
22             + "(\u0660-\u0669)|(\u06f0-\u06f9)|(\u0966-\u096f)|(\u09e6-\u09ef)|"
23             + "(\u0a66-\u0a6f)|(\u0ae6-\u0aef)|(\u0b66-\u0b6f)|(\u0be7-\u0bef)|"
24             + "(\0c66-\u0c6f)|(\u0ce6-\u0cef)|(\u0d66-\u0d6f)|(\u0e50-\u0e59)|"
25             + "(\u0ed0-\u0ed9)|(\u1040-\u1049)]";
26
27     /**
28      * 判断是否是Unicode字母
29      */
30     public static final boolean isUnicodeLetter(String str) {
31         return testString(str,UNICODE_LETTER_PATTERN);
32     }
33     /**
34      * 判断是否是Unicode数字
35      */
36     public static final boolean isUnicodeDigit(String str) {
37         return testString(str,UNICODE_DIGIT_PATTERN);
38     }
39     /**
40      * 判断是否是Unicode亚洲语言字符
41      */
42     public static final boolean isUnicodeCPJ(String str) {
43         return testString(str,UNICODE_CJP_PATTERN);
44     }
45
46     public static void main(String[] args) {
47         String x="123";
48         boolean is=isUnicodeLetter(x);
49         System.out.println(is);
50         is=isUnicodeDigit(x);
51         System.out.println(is);
52         is=isUnicodeCPJ(x);
53         System.out.println(is);
54     }
55     private static final boolean testString(String str, String pattern) {
56         PatternCompiler cpl = new Perl5Compiler();
57         Pattern p=null;
58         try {
59             p=cpl.compile(pattern);
60         } catch (MalformedPatternException e) {
61             e.printStackTrace();
62         }
63         PatternMatcher matcher=new Perl5Matcher();
64         return matcher.matches(str, p);
65     }
66 }
67

回复更多评论

# re: Lucene 2.1研究：对字符的判断 2007-07-02 14:16 Terry Liang

@xmlspy
我定义的是正则表达式样式，我在C#中测试通过，而且我已经指明是判断单个字符的，假如传入字符串，当然只会返回false了。
例如：对于“我”，假如UnicodeCJPattern去正则匹配，则会返回true。
很不好意思，我没有写一个java正则表达式应用的事例。
回复更多评论

# re: Lucene 2.1研究：对字符的判断 2007-07-02 21:37 xmlspy

谢谢 :)

把我那个改了吧,正好当作示例用 :) 回复更多评论

# re: Lucene 2.1研究：对字符的判断 2007-07-02 22:24 xmlspy

测试了一下,还是有些问题的,不严谨.

哥们请看一下 :)

1 import org.apache.oro.text.regex.MalformedPatternException;
2 import org.apache.oro.text.regex.Pattern;
3 import org.apache.oro.text.regex.PatternCompiler;
4 import org.apache.oro.text.regex.PatternMatcher;
5 import org.apache.oro.text.regex.Perl5Compiler;
6 import org.apache.oro.text.regex.Perl5Matcher;
7
8 //正则表达式
9 //jdk版本:jdk1.5.0_09
10 //类库:jakarta-oro-2.0.8.jar
11 //操作系统: win2003 standard
12 public class RegxLan {
13
14     //用于判断Unicode Letter：
15     private static final String UNICODE_LETTER_PATTERN = "[(\u0041-\u005a)|"
16             + "(\u0061-\u007a)|(\u00c0-\u00d6)|(\u00d8-\u00f6)|(\u00f8-\u00ff)|"
17             + "(\u0100-\u1fff)]";
18
19     //用于判断亚洲语言字符（中国，日本，韩国）：
20     private static final String UNICODE_CJP_PATTERN = "[(\u3040-\u318f)|(\u3300-\u337f)|"
21             + "(\u3400-\u3d2d)|(\u4e00-\u9fff)|(\uf900-\ufaff)|(\uac00-\ud7af)]";
22
23     //用于判断Unicode中的数字：
24     private static final String UNICODE_DIGIT_PATTERN = "[(\u0030-\u0039)|"
25             + "(\u0660-\u0669)|(\u06f0-\u06f9)|(\u0966-\u096f)|(\u09e6-\u09ef)|"
26             + "(\u0a66-\u0a6f)|(\u0ae6-\u0aef)|(\u0b66-\u0b6f)|(\u0be7-\u0bef)|"
27             + "(\0c66-\u0c6f)|(\u0ce6-\u0cef)|(\u0d66-\u0d6f)|(\u0e50-\u0e59)|"
28             + "(\u0ed0-\u0ed9)|(\u1040-\u1049)]";
29
30     /**
31      * 判断是否是Unicode字母
32      */
33     public static final boolean isUnicodeLetter(String str) {
34         return testString(str, UNICODE_LETTER_PATTERN);
35     }
36
37     /**
38      * 判断是否是Unicode数字
39      */
40     public static final boolean isUnicodeDigit(String str) {
41         return testString(str, UNICODE_DIGIT_PATTERN);
42     }
43
44     /**
45      * 判断是否是Unicode亚洲语言字符
46      */
47     public static final boolean isUnicodeCPJ(String str) {
48         return testString(str, UNICODE_CJP_PATTERN);
49     }
50
51     //通过测试,看到还是有问题的,尤其是对符号判读不正确,
52     //另外,把英文字母也当作数字对待了
53     //全角字符，和．返回的都是false,而全角字符×返回的确实false,true,false
54     //
55     public static void main(String[] args) {
56         //最后三个是全角字符
57         char[] test = "`~!@#$%^&*()_-+=|\\,.<>/?;:'\"[]{}w2这×，．".toCharArray();
58
59         for (char t : test) {
60             String x = String.valueOf(t);
61             System.out.println("========== 字符: "+t+" 的结果 ==========");
62
63             boolean is = isUnicodeLetter(x);
64             System.out.println("isUnicodeLetter == "+is);
65             is = isUnicodeDigit(x);
66             System.out.println("isUnicodeDigit == "+is);
67             is = isUnicodeCPJ(x);
68             System.out.println("isUnicodeCPJ == "+is);
69         }
70     }
71
72     private static final boolean testString(String str, String pattern) {
73         PatternCompiler cpl = new Perl5Compiler();
74         Pattern p = null;
75         try {
76             p = cpl.compile(pattern);
77         } catch (MalformedPatternException e) {
78             e.printStackTrace();
79         }
80         PatternMatcher matcher = new Perl5Matcher();
81         return matcher.matches(str, p);
82     }
83 }
84

回复更多评论

# re: Lucene 2.1研究：对字符的判断 2007-07-18 12:43 Terry Liang

@xmlspy
我不了解java和.net对正则表达式的应用有什么异同。
上述判断证则表示样式我只在.net中测试过。
@xmlspy能否告诉我具体有什不严谨的地方呢？
回复更多评论

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园博客园最新博文博问管理
相关文章: 正在修改基于Java Lucene 2.3.2的.Net Framework 3.5版本 Lucene 2.1研究：发布NLucene-2.1 Lucene 2.1研究：对字符的判断 Lucene 2.1研究：检索 Lucene 2.1研究：文件存储 Lucene 2.1研究：倒排序基本常识 Lucene 2.1研究：索引文件格式说明基于Lucene 2.1研究：时间的处理基于Lucene 2.1的研究：Lucene.Net版本Bug修改 Lucene数据索引搜索示例