有时间我把它弄成中文的。
The 7th version of the Java Developer’s Kit (aka JDK 7) delivers quite a speed boost over JDK 6 array accesses. For us, this is huge. It’s like another year and a half of Moore’s law for free. Only in software. And you don’t even have to write multi-threaded code.
I’ve been profiling my new K-Means++ implementation for the next LingPipe release on some randomly generated data. It’s basically a stress test for array gets, array sets, and simple multiply-add arithmetic. Many LingPipe modules are like this at run-time: named entity, part-of-speech tagging, language modeling, LM-based classifiers, and much more.
While I was waiting for a run using JDK 1.6 to finish, I installed the following beta release of JDK 7:
> java -version
java version "1.7.0-ea"
Java(TM) SE Runtime Environment (build 1.7.0-ea-b52)
Java HotSpot(TM) 64-Bit Server VM (build 15.0-b03, mixed mode)
You can get it, too:
I believe much of the reason it’s faster is the work of these fellows:
Java’s always suffered relative to C in straight matrix multiplication because Java does range checks on every array access (set or get). With some clever static and run-time analysis, Würthinger et al. are able to eliminate most of the array bounds checks. They show on matrix benchmarks that this one improvement doubles the speed of the LU matrix factorization benchmark in the U.S. National Institute of Standards (NIST) benchmark suite SciMark 2, which like our clustering algorithm, is basically just a stress test for array access and arithmetic.
So far, my tests have only been on a Thinkpad Z61P notebook running Windows Vista (64 bit) with an Intel Core 2 CPU (T2700; 2.0GHz), and 4GB of reasonably zippy memory. I don’t know if the speedups will be as great for other OSes or for 32-bit JDKs.
I’m pretty excited about the new fork-join concurrency, too, as it’s just what we’ll need to parallelize the inner loops without too much work for us or the operating system.
*Update: 2:30 PM, 30 March 2009 JDK 7 is only about 15% faster than Sun’s JDK 6 on my quad Xeons (E5410, 2.33GHz) at work running the same code. I’ll have to check the exact specs on both of my memory buses. The notebook has surprisingly fast memory and the Xeon’s running ECC registered memory that I don’t think is quite as fast.
Update: 11:00 AM, 31 March 2009 Like other matrix algorithms, k-means clustering is extremely front-side-bus sensitive (connection between memory and the CPU), because the bottleneck is often between memory and the CPU’s L2 cache. Memory’s significantly slower than CPU these days.
The Intel dual quad-core Xeon E5410 have 12MB of L2 cache at 2.3GHz, whereas the Thinkpad Z61P has Intel Core 2 Mobile T7200 has only 4MB of L2 cache at 2GHz. The Core 2 has a 667 MHz front-side bus whereas the Xeon reports a 1333 MHz front-side bus (is that just the confusion between spec reporting). I actually don’t know what kind of memory’s in the workstation — I’ll have to crack it open and look. I’ve got 4GB of RAM in the notebook, but the motherboard can only see 3GB (ithat is, it’s not Windows — the same thing happened when I installed Ubuntu on the notebook and it’s a known design limitation in many motherboards); I have 16GB of RAM in the workstation and the motherboard can see all of it. But it has two physical chips, each of which share the memory, so the motherboard’s architecture’s very different. There are so many confounding factors that I can’t tease apart what’s speeding up in JDK 7 so much on the notebook.
Anway, go forth and test. If you’re using a machine like my notebook to do number crunching, JDK 7 really is twice as fast as JDK 6 for matrix algorithms.