今天将Hadoop下载下来学习了一下文档中的tutorial,然后仿照如下链接实现了一个word count的例子:
以下是一部分理论学习:
The storage is provided by HDFS, and analysis by MapReduce.
MapReduce is a good fit for problems
that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis.
An RDBMS is good for point queries or updates, where the dataset has been indexed
to deliver low-latency retrieval and update times of a relatively small amount of
data.
MapReduce suits applications where the data is written once, and read many
times, whereas a relational database is good for datasets that are continually updated.
MapReduce tries to colocate the data with the compute node, so data access is fast
since it is local.* This feature, known as data locality, is at the heart of MapReduce and
is the reason for its good performance.
Hadoop divides the input to a MapReduce job into fixed-size pieces called input
splits, or just splits. Hadoop creates one map task for each split, which runs the userdefined
map function for each record in the split.
On the other hand, if splits are too small, then the overhead of managing the splits and
of map task creation begins to dominate the total job execution time.For most jobs, a
good split size tends to be the size of a HDFS block, 64 MB by default.
Reduce tasks don’t have the advantage of data locality—the input to a single reduce
task is normally the output from all mappers.
Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays
to minimize the data transferred between map and reduce tasks. Hadoop allows the
user to specify a combiner function to be run on the map output—the combiner function’s
output forms the input to the reduce function.
Why Is a Block in HDFS So Large?
HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost
of seeks. By making a block large enough, the time to transfer the data from the disk
can be made to be significantly larger than the time to seek to the start of the block.
Thus the time to transfer a large file made of multiple blocks operates at the disk transfer
rate.
A quick calculation shows that if the seek time is around 10ms, and the transfer rate is
100 MB/s, then to make the seek time 1% of the transfer time, we need to make the
block size around 100 MB. The default is actually 64 MB, although many HDFS installations
use 128 MB blocks. This figure will continue to be revised upward as transfer
speeds grow with new generations of disk drives.
This argument shouldn’t be taken too far, however. Map tasks in MapReduce normally
operate on one block at a time, so if you have too few tasks (fewer than nodes in the
cluster), your jobs will run slower than they could otherwise.
意思是这样的,Block大的话,寻找Block的时间大概少,主要耗在传输的时间上,但是如果Block小的话,传输的时间和寻址的时间就相当了,等于说就是消耗的时间是2倍传输的时间,划不来。具体的说是,如果数据量为100M,那么Block的大小是100M,那么传输的时间就是1s(100M/s),但是如果Block的大小是1M,那么传输的时间还是1s,但是seek的时间10ms*100=1s了。这样总共花去的时间就是2s。是不是越大越好呢?也不是,太大的话,可能导致文档没有分布式的存储,也就没有很好的利用MapReduce模型进行计算了,反而可能更慢。