这些天偶然翻起了算法书,看到KMP算法的时候,突然发现多年来似乎一直没有完全搞明白,于是恶补了一下。写了个程序,“基本”明白了fail函数和KMP的那些事~~~~
具体程序见下
1package test;
2/** *//**
3 * @author Jia Yu
4 * @date 2010-9-28
5 */
6public class StringMatch {
7
8 private int []f;
9 /** *//**
10 * KMP fail function to calculate f[]
11 * f[j] = k < j where k is the maximum of pat[0..k] == pat[j-k..j]
12 * = -1 otherwise
13 * @param pat
14 */
15 public void fail(String pat){
16 int lenP = pat.length();
17 f = new int[lenP];
18 f[0] = -1;
19 for(int j=1;j<lenP;j++){
20 int i = f[j-1];
21 while((pat.charAt(j)==pat.charAt(i+1))&&(i>=0)) i = f[i];
22 if(pat.charAt(j)!=pat.charAt(i+1)) f[j] = i + 1;
23 else f[j] = -1;
24 }
25 }
26
27 /** *//**
28 * Implementation of KMP algorithm.
29 * @param s string which is source string
30 * @param pat string pattern
31 * @return
32 */
33 public int kmp_find(String s,String pat){
34 int lenS,lenP;
35 lenS = s.length();
36 lenP = pat.length();
37 int i,j;
38 i=j=0;
39 while(i<lenS&&j<lenP){
40 if(s.charAt(i)==pat.charAt(j)){
41 i++;
42 j++;
43 }
44 else{
45 if(j==0) i++;
46 else j=f[j-1] +1;
47 }
48 }
49 if(j<lenP||lenP==0) return -1;
50 else return i-lenP;
51 }
52
53 /** *//**
54 * @param args
55 */
56 public static void main(String[] args) {
57 // TODO Auto-generated method stub
58 StringMatch ss = new StringMatch();
59 String s = "abcdedabc";
60 String pat = "dab";
61 ss.fail(pat);
62 System.out.println(ss.kmp_find(s, pat));
63 }
64
65}
66
完事后忍不住想和String的indexOf比比性能。一直以为jdk 的src里String的indexOf是用Naïve string search algorithm实现的。没有任何技巧,当然也有好多人去拿这个说事,但是逛过一些论坛,记得人们只是自己去实现KMP,然后说KMP有多好,但是很少有拿出数据来比对的。所以今天,在投简历闲暇,还是抽出些时间,看了看String match相关的一些东西。写了一些代码,发现了一些东东~~~~
不多扯闲话了。首先进行了一个KMP和String indexOf的比较,看看结果吧。
preprocess using 7549ns
=====================================================
Cycles : 30000
String find pat in pos -1
Used Time is 87134ns
KMP find pat in pos -1
Used Time is 1071829ns
=====================================================
Cycles : 90000
String find pat in pos -1
Used Time is 150480ns
KMP find pat in pos -1
Used Time is 2277475ns
=====================================================
Cycles : 270000
String find pat in pos -1
Used Time is 380815ns
KMP find pat in pos -1
Used Time is 848257ns
=====================================================
Cycles : 810000
String find pat in pos 119457
Used Time is 509997ns
KMP find pat in pos 119457
Used Time is 1141992ns
=====================================================
Cycles : 2430000
String find pat in pos 459895
Used Time is 1845130ns
KMP find pat in pos 459895
Used Time is 4180643ns
测试代码如下:
1/** *//**
2 *
3 */
4package test;
5
6import java.util.Random;
7
8/** *//**
9 * @author Jia Yu
10 * @date 2010-9-28
11 */
12public class StringMatchTest2 {
13
14 private static String src;
15 private static String pat;
16 private static long cycles = 10000L;
17 private static Random rand = new Random(47);
18
19 public static void generateSource() {
20 StringBuilder sb = new StringBuilder();
21 int iter = (int) cycles;
22 for (int i = 0; i < iter; i++) {
23 sb.append((char) ('a' + (rand.nextInt(6))));
24 }
25 src = sb.toString();
26 }
27
28 public static void generatePattern() {
29 StringBuilder sb = new StringBuilder();
30 for (int i = 0; i < 7; i++) {
31 sb.append((char) ('a' + (rand.nextInt(6))));
32 }
33 pat = sb.toString();
34 }
35
36 /** *//**
37 * @param args
38 */
39 public static void main(String[] args) {
40 // TODO Auto-generated method stub
41 generatePattern();
42 StringMatch sm = new StringMatch();
43 long start, pre, dur;
44 start = System.nanoTime();
45 sm.fail(pat);
46 pre = System.nanoTime() - start;
47 System.out.println("preprocess using " + pre + "ns");
48 for (int i = 0; i < 5; i++) {
49 generateSource();
50 cycles *= 3;
51 System.out
52 .println("=====================================================");
53 System.out.printf("%s : %10d\n", "Cycles\t", cycles);
54 start = System.nanoTime();
55 System.out.println("String find pat in pos " + src.indexOf(pat));
56 dur = System.nanoTime() - start;
57 System.out.println("Used Time is " + dur + "ns");
58 start = System.nanoTime();
59 System.out.println("KMP find pat in pos " + sm.kmp_find(src, pat));
60 dur = System.nanoTime() - start;
61 System.out.println("Used Time is " + dur + "ns");
62 }
63 }
64
65}
66
总是说KMP算法的时间复杂度是O(n)的,但是看看效果,又是什么样子呢?看到这里我觉得自己哪里好像做的不对,所以厚颜的把代码拿出来,请大家帮忙看看到底哪里出了问题。我的思路就是用同一个pat串去查询规模逐渐变大的src串,就是为了能节省kmp的预处理时间,毕竟KMP要计算一个fail函数,拿出额外的O(m)空间做这个事情的,其中m是pat的长度。但是从结果上看,不管是查不到(返回-1)还是在中间位置附近查到。我实现的kmp完全没有任何优势可言。
对了,也得把java的String的indexOf方法贴上吧~~~
1static int indexOf(char[] source, int sourceOffset, int sourceCount,
2 char[] target, int targetOffset, int targetCount,
3 int fromIndex) {
4 if (fromIndex >= sourceCount) {
5 return (targetCount == 0 ? sourceCount : -1);
6 }
7 if (fromIndex < 0) {
8 fromIndex = 0;
9 }
10 if (targetCount == 0) {
11 return fromIndex;
12 }
13
14 char first = target[targetOffset];
15 int max = sourceOffset + (sourceCount - targetCount);
16
17 for (int i = sourceOffset + fromIndex; i <= max; i++) {
18 /**//* Look for first character. */
19 if (source[i] != first) {
20 while (++i <= max && source[i] != first);
21 }
22
23 /**//* Found first character, now look at the rest of v2 */
24 if (i <= max) {
25 int j = i + 1;
26 int end = j + targetCount - 1;
27 for (int k = targetOffset + 1; j < end && source[j] ==
28 target[k]; j++, k++);
29
30 if (j == end) {
31 /**//* Found whole string. */
32 return i - sourceOffset;
33 }
34 }
35 }
36 return -1;
37 }
感觉笨笨的java的方法,却达到了高的性能。那跟别的比比呢?
之前有用过一个Rope结构的,模仿了String,但是内部不是字符数组实现,而是用了R*树(如果我没记错的话)。Rope也有类似的indexOf实现,那么来比比吧,我为了增加算法的多样性,还拿stringsearch这个包来调用BoyerMooreHorspool算法实验一下。
具体的结果如下:
=====================================================
Cycles : 10000
String : 15991ns in pos 5000
Rope : 125542ns in pos 5000
KMP : 448211ns in pos 5000
BMH : 743977ns in pos 5000
=====================================================
Cycles : 30000
String : 27135ns in pos 15000
Rope : 122409ns in pos 15000
KMP : 1554487ns in pos 15000
BMH : 175424ns in pos 15000
=====================================================
Cycles : 90000
String : 77267ns in pos 45000
Rope : 284037ns in pos 45000
KMP : 266580ns in pos 45000
BMH : 906169ns in pos 45000
=====================================================
Cycles : 270000
String : 235725ns in pos 135000
Rope : 112890ns in pos 135000
KMP : 796252ns in pos 135000
BMH : 971998ns in pos 135000
=====================================================
Cycles : 810000
String : 698246ns in pos 405000
Rope : 328063ns in pos 405000
KMP : 2652067ns in pos 405000
BMH : 12013728ns in pos 405000
=====================================================
Cycles : 2430000
String : 2082535ns in pos 1215000
Rope : 884518ns in pos 1215000
KMP : 7261113ns in pos 1215000
BMH : 4895766ns in pos 1215000
这个实验中,我每次去产生一个随机的字符串src,然后模式串pat只选用了src串的中间位置开始的200个字符。这样可以保证每次都是可以查找到的,因此从某种角度上讲没有测试最坏情况,因为返回-1这种情况实在是太easy 了。
实验代码如下:
1package test;
2
3import org.ahmadsoft.ropes.Rope;
4import com.eaio.stringsearch.*;
5import java.util.*;
6
7/** *//**
8 * @author Jia Yu
9 * @date 2010-9-28
10 */
11public class StringMatchTest {
12
13 static String src;
14 static String pat;
15
16 public static void generateString() {
17 StringBuilder sb = new StringBuilder();
18 int iter = (int) Tester.cycles;
19 Random rand = new Random(47);
20 for (int i = 0; i < iter; i++) {
21 sb.append((char)('a' + (rand.nextInt(26))));
22 }
23 src = sb.toString();
24 pat = sb.substring(iter / 2, iter / 2 + 200);
25 }
26
27 static void test() {
28 System.out
29 .println("=====================================================");
30 System.out.printf("%s : %10d\n", "Cycles\t", Tester.cycles);
31 generateString();
32 BaseLine baseLine = new BaseLine(src, pat);
33 RopeTest ropeTest = new RopeTest(src, pat);
34 KMPTest kmpTest = new KMPTest(src, pat);
35 BMHTest bmhTest = new BMHTest(src, pat);
36
37 baseLine.timedTest();
38 ropeTest.timedTest();
39 kmpTest.timedTest();
40 bmhTest.timedTest();
41 }
42
43 /** *//**
44 * @param args
45 */
46 public static void main(String[] args) {
47 // TODO Auto-generated method stub
48 int iter = 6;
49 for (int i = 0; i < iter; i++) {
50 test();
51 Tester.cycles *= 3;
52 }
53 }
54
55}
56
57abstract class Tester {
58 public static long cycles = 10000L;
59 protected String id = "error";
60 protected String src, pat;
61
62 public Tester(String s, String p) {
63 this.src = s;
64 this.pat = p;
65 }
66
67 public abstract int match();
68
69 public void timedTest() {
70 long start = System.nanoTime();
71 int pos = this.match();
72 long duriation = System.nanoTime() - start;
73 System.out.println(id + "\t : \t" + duriation + "ns in pos " + pos);
74 }
75}
76
77class BaseLine extends Tester {
78 public BaseLine(String s, String p) {
79 super(s, p);
80 id = "String";
81 }
82
83 @Override
84 public int match() {
85 // TODO Auto-generated method stub
86 return src.indexOf(pat);
87 }
88}
89
90class RopeTest extends Tester {
91
92 private Rope rope;
93
94 public RopeTest(String s, String p) {
95 super(s, p);
96 // TODO Auto-generated constructor stub
97 id = "Rope";
98 rope = Rope.BUILDER.build(s.toCharArray());
99 }
100
101 @Override
102 public int match() {
103 // TODO Auto-generated method stub
104 return rope.indexOf(pat);
105 }
106
107}
108
109class KMPTest extends Tester {
110
111 StringMatch sm = new StringMatch();
112
113 public KMPTest(String s, String p) {
114 super(s, p);
115 // TODO Auto-generated constructor stub
116 id = "KMP";
117 sm.fail(pat);
118 }
119
120 @Override
121 public int match() {
122 // TODO Auto-generated method stub
123 return sm.kmp_find(src, pat);
124 }
125
126}
127
128class BMHTest extends Tester{
129
130 private StringSearch ss;
131 public BMHTest(String s, String p) {
132 super(s, p);
133 // TODO Auto-generated constructor stub
134 id = "BMH";
135 ss = new BoyerMooreHorspool();
136 }
137
138 @Override
139 public int match() {
140 // TODO Auto-generated method stub
141 return ss.searchChars(src.toCharArray(), pat.toCharArray());
142 }
143
144}
各种悲剧,在搞了一段时间后,我已经被这些东西的时间差异搞晕了。看了看这篇文章http://en.wikipedia.org/wiki/String_searching_algorithm,里面讲的比较清楚。罗里吧嗦的写了这么多,没有研究清楚String search,反倒是更迷糊了。希望有人能给我讲讲,针对我的实验情况。总归有些收获的,KMP现在印在脑子里了,虽然对其性能没啥信心。最后还要把用到的Stringsearch的下载地址附上,毕竟是别人的结果。
Stringsearch:http://johannburkard.de/software/stringsearch/。
另外关于BMH算法,可以看http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm。
P.S. 有个补充的东西,在rope实现的过程中,如果令cycles变为20000,有兴趣的可以试试结果。难不成我发现了些什么?如果令cycles不是3,而是以2的倍数指数的改变呢?