这些天偶然翻起了算法书,看到KMP算法的时候,突然发现多年来似乎一直没有完全搞明白,于是恶补了一下。写了个程序,“基本”明白了fail函数和KMP的那些事~~~~
具体程序见下
1
package test;
2
/** *//**
3
* @author Jia Yu
4
* @date 2010-9-28
5
*/
6
public class StringMatch
{
7
8
private int []f;
9
/** *//**
10
* KMP fail function to calculate f[]
11
* f[j] = k < j where k is the maximum of pat[0..k] == pat[j-k..j]
12
* = -1 otherwise
13
* @param pat
14
*/
15
public void fail(String pat)
{
16
int lenP = pat.length();
17
f = new int[lenP];
18
f[0] = -1;
19
for(int j=1;j<lenP;j++)
{
20
int i = f[j-1];
21
while((pat.charAt(j)==pat.charAt(i+1))&&(i>=0)) i = f[i];
22
if(pat.charAt(j)!=pat.charAt(i+1)) f[j] = i + 1;
23
else f[j] = -1;
24
}
25
}
26
27
/** *//**
28
* Implementation of KMP algorithm.
29
* @param s string which is source string
30
* @param pat string pattern
31
* @return
32
*/
33
public int kmp_find(String s,String pat)
{
34
int lenS,lenP;
35
lenS = s.length();
36
lenP = pat.length();
37
int i,j;
38
i=j=0;
39
while(i<lenS&&j<lenP)
{
40
if(s.charAt(i)==pat.charAt(j))
{
41
i++;
42
j++;
43
}
44
else
{
45
if(j==0) i++;
46
else j=f[j-1] +1;
47
}
48
}
49
if(j<lenP||lenP==0) return -1;
50
else return i-lenP;
51
}
52
53
/** *//**
54
* @param args
55
*/
56
public static void main(String[] args)
{
57
// TODO Auto-generated method stub
58
StringMatch ss = new StringMatch();
59
String s = "abcdedabc";
60
String pat = "dab";
61
ss.fail(pat);
62
System.out.println(ss.kmp_find(s, pat));
63
}
64
65
}
66
完事后忍不住想和String的indexOf比比性能。一直以为jdk 的src里String的indexOf是用Naïve string search algorithm实现的。没有任何技巧,当然也有好多人去拿这个说事,但是逛过一些论坛,记得人们只是自己去实现KMP,然后说KMP有多好,但是很少有拿出数据来比对的。所以今天,在投简历闲暇,还是抽出些时间,看了看String match相关的一些东西。写了一些代码,发现了一些东东~~~~
不多扯闲话了。首先进行了一个KMP和String indexOf的比较,看看结果吧。
preprocess using 7549ns
=====================================================
Cycles : 30000
String find pat in pos -1
Used Time is 87134ns
KMP find pat in pos -1
Used Time is 1071829ns
=====================================================
Cycles : 90000
String find pat in pos -1
Used Time is 150480ns
KMP find pat in pos -1
Used Time is 2277475ns
=====================================================
Cycles : 270000
String find pat in pos -1
Used Time is 380815ns
KMP find pat in pos -1
Used Time is 848257ns
=====================================================
Cycles : 810000
String find pat in pos 119457
Used Time is 509997ns
KMP find pat in pos 119457
Used Time is 1141992ns
=====================================================
Cycles : 2430000
String find pat in pos 459895
Used Time is 1845130ns
KMP find pat in pos 459895
Used Time is 4180643ns
测试代码如下:
1
/** *//**
2
*
3
*/
4
package test;
5
6
import java.util.Random;
7
8
/** *//**
9
* @author Jia Yu
10
* @date 2010-9-28
11
*/
12
public class StringMatchTest2
{
13
14
private static String src;
15
private static String pat;
16
private static long cycles = 10000L;
17
private static Random rand = new Random(47);
18
19
public static void generateSource()
{
20
StringBuilder sb = new StringBuilder();
21
int iter = (int) cycles;
22
for (int i = 0; i < iter; i++)
{
23
sb.append((char) ('a' + (rand.nextInt(6))));
24
}
25
src = sb.toString();
26
}
27
28
public static void generatePattern()
{
29
StringBuilder sb = new StringBuilder();
30
for (int i = 0; i < 7; i++)
{
31
sb.append((char) ('a' + (rand.nextInt(6))));
32
}
33
pat = sb.toString();
34
}
35
36
/** *//**
37
* @param args
38
*/
39
public static void main(String[] args)
{
40
// TODO Auto-generated method stub
41
generatePattern();
42
StringMatch sm = new StringMatch();
43
long start, pre, dur;
44
start = System.nanoTime();
45
sm.fail(pat);
46
pre = System.nanoTime() - start;
47
System.out.println("preprocess using " + pre + "ns");
48
for (int i = 0; i < 5; i++)
{
49
generateSource();
50
cycles *= 3;
51
System.out
52
.println("=====================================================");
53
System.out.printf("%s : %10d\n", "Cycles\t", cycles);
54
start = System.nanoTime();
55
System.out.println("String find pat in pos " + src.indexOf(pat));
56
dur = System.nanoTime() - start;
57
System.out.println("Used Time is " + dur + "ns");
58
start = System.nanoTime();
59
System.out.println("KMP find pat in pos " + sm.kmp_find(src, pat));
60
dur = System.nanoTime() - start;
61
System.out.println("Used Time is " + dur + "ns");
62
}
63
}
64
65
}
66
总是说KMP算法的时间复杂度是O(n)的,但是看看效果,又是什么样子呢?看到这里我觉得自己哪里好像做的不对,所以厚颜的把代码拿出来,请大家帮忙看看到底哪里出了问题。我的思路就是用同一个pat串去查询规模逐渐变大的src串,就是为了能节省kmp的预处理时间,毕竟KMP要计算一个fail函数,拿出额外的O(m)空间做这个事情的,其中m是pat的长度。但是从结果上看,不管是查不到(返回-1)还是在中间位置附近查到。我实现的kmp完全没有任何优势可言。
对了,也得把java的String的indexOf方法贴上吧~~~
1
static int indexOf(char[] source, int sourceOffset, int sourceCount,
2
char[] target, int targetOffset, int targetCount,
3
int fromIndex)
{
4
if (fromIndex >= sourceCount)
{
5
return (targetCount == 0 ? sourceCount : -1);
6
}
7
if (fromIndex < 0)
{
8
fromIndex = 0;
9
}
10
if (targetCount == 0)
{
11
return fromIndex;
12
}
13
14
char first = target[targetOffset];
15
int max = sourceOffset + (sourceCount - targetCount);
16
17
for (int i = sourceOffset + fromIndex; i <= max; i++)
{
18
/**//* Look for first character. */
19
if (source[i] != first)
{
20
while (++i <= max && source[i] != first);
21
}
22
23
/**//* Found first character, now look at the rest of v2 */
24
if (i <= max)
{
25
int j = i + 1;
26
int end = j + targetCount - 1;
27
for (int k = targetOffset + 1; j < end && source[j] ==
28
target[k]; j++, k++);
29
30
if (j == end)
{
31
/**//* Found whole string. */
32
return i - sourceOffset;
33
}
34
}
35
}
36
return -1;
37
}
感觉笨笨的java的方法,却达到了高的性能。那跟别的比比呢?
之前有用过一个Rope结构的,模仿了String,但是内部不是字符数组实现,而是用了R*树(如果我没记错的话)。Rope也有类似的indexOf实现,那么来比比吧,我为了增加算法的多样性,还拿stringsearch这个包来调用BoyerMooreHorspool算法实验一下。
具体的结果如下:
=====================================================
Cycles : 10000
String : 15991ns in pos 5000
Rope : 125542ns in pos 5000
KMP : 448211ns in pos 5000
BMH : 743977ns in pos 5000
=====================================================
Cycles : 30000
String : 27135ns in pos 15000
Rope : 122409ns in pos 15000
KMP : 1554487ns in pos 15000
BMH : 175424ns in pos 15000
=====================================================
Cycles : 90000
String : 77267ns in pos 45000
Rope : 284037ns in pos 45000
KMP : 266580ns in pos 45000
BMH : 906169ns in pos 45000
=====================================================
Cycles : 270000
String : 235725ns in pos 135000
Rope : 112890ns in pos 135000
KMP : 796252ns in pos 135000
BMH : 971998ns in pos 135000
=====================================================
Cycles : 810000
String : 698246ns in pos 405000
Rope : 328063ns in pos 405000
KMP : 2652067ns in pos 405000
BMH : 12013728ns in pos 405000
=====================================================
Cycles : 2430000
String : 2082535ns in pos 1215000
Rope : 884518ns in pos 1215000
KMP : 7261113ns in pos 1215000
BMH : 4895766ns in pos 1215000
这个实验中,我每次去产生一个随机的字符串src,然后模式串pat只选用了src串的中间位置开始的200个字符。这样可以保证每次都是可以查找到的,因此从某种角度上讲没有测试最坏情况,因为返回-1这种情况实在是太easy 了。
实验代码如下:
1
package test;
2
3
import org.ahmadsoft.ropes.Rope;
4
import com.eaio.stringsearch.*;
5
import java.util.*;
6
7
/** *//**
8
* @author Jia Yu
9
* @date 2010-9-28
10
*/
11
public class StringMatchTest
{
12
13
static String src;
14
static String pat;
15
16
public static void generateString()
{
17
StringBuilder sb = new StringBuilder();
18
int iter = (int) Tester.cycles;
19
Random rand = new Random(47);
20
for (int i = 0; i < iter; i++)
{
21
sb.append((char)('a' + (rand.nextInt(26))));
22
}
23
src = sb.toString();
24
pat = sb.substring(iter / 2, iter / 2 + 200);
25
}
26
27
static void test()
{
28
System.out
29
.println("=====================================================");
30
System.out.printf("%s : %10d\n", "Cycles\t", Tester.cycles);
31
generateString();
32
BaseLine baseLine = new BaseLine(src, pat);
33
RopeTest ropeTest = new RopeTest(src, pat);
34
KMPTest kmpTest = new KMPTest(src, pat);
35
BMHTest bmhTest = new BMHTest(src, pat);
36
37
baseLine.timedTest();
38
ropeTest.timedTest();
39
kmpTest.timedTest();
40
bmhTest.timedTest();
41
}
42
43
/** *//**
44
* @param args
45
*/
46
public static void main(String[] args)
{
47
// TODO Auto-generated method stub
48
int iter = 6;
49
for (int i = 0; i < iter; i++)
{
50
test();
51
Tester.cycles *= 3;
52
}
53
}
54
55
}
56
57
abstract class Tester
{
58
public static long cycles = 10000L;
59
protected String id = "error";
60
protected String src, pat;
61
62
public Tester(String s, String p)
{
63
this.src = s;
64
this.pat = p;
65
}
66
67
public abstract int match();
68
69
public void timedTest()
{
70
long start = System.nanoTime();
71
int pos = this.match();
72
long duriation = System.nanoTime() - start;
73
System.out.println(id + "\t : \t" + duriation + "ns in pos " + pos);
74
}
75
}
76
77
class BaseLine extends Tester
{
78
public BaseLine(String s, String p)
{
79
super(s, p);
80
id = "String";
81
}
82
83
@Override
84
public int match()
{
85
// TODO Auto-generated method stub
86
return src.indexOf(pat);
87
}
88
}
89
90
class RopeTest extends Tester
{
91
92
private Rope rope;
93
94
public RopeTest(String s, String p)
{
95
super(s, p);
96
// TODO Auto-generated constructor stub
97
id = "Rope";
98
rope = Rope.BUILDER.build(s.toCharArray());
99
}
100
101
@Override
102
public int match()
{
103
// TODO Auto-generated method stub
104
return rope.indexOf(pat);
105
}
106
107
}
108
109
class KMPTest extends Tester
{
110
111
StringMatch sm = new StringMatch();
112
113
public KMPTest(String s, String p)
{
114
super(s, p);
115
// TODO Auto-generated constructor stub
116
id = "KMP";
117
sm.fail(pat);
118
}
119
120
@Override
121
public int match()
{
122
// TODO Auto-generated method stub
123
return sm.kmp_find(src, pat);
124
}
125
126
}
127
128
class BMHTest extends Tester
{
129
130
private StringSearch ss;
131
public BMHTest(String s, String p)
{
132
super(s, p);
133
// TODO Auto-generated constructor stub
134
id = "BMH";
135
ss = new BoyerMooreHorspool();
136
}
137
138
@Override
139
public int match()
{
140
// TODO Auto-generated method stub
141
return ss.searchChars(src.toCharArray(), pat.toCharArray());
142
}
143
144
}
各种悲剧,在搞了一段时间后,我已经被这些东西的时间差异搞晕了。看了看这篇文章http://en.wikipedia.org/wiki/String_searching_algorithm,里面讲的比较清楚。罗里吧嗦的写了这么多,没有研究清楚String search,反倒是更迷糊了。希望有人能给我讲讲,针对我的实验情况。总归有些收获的,KMP现在印在脑子里了,虽然对其性能没啥信心。最后还要把用到的Stringsearch的下载地址附上,毕竟是别人的结果。
Stringsearch:http://johannburkard.de/software/stringsearch/。
另外关于BMH算法,可以看http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm。
P.S. 有个补充的东西,在rope实现的过程中,如果令cycles变为20000,有兴趣的可以试试结果。难不成我发现了些什么?如果令cycles不是3,而是以2的倍数指数的改变呢?