参考:
http://hadoop.apache.org/common/docs/r0.15.2/streaming.html
注意
目前 streaming 对 linux pipe #也就是 cat |wc -l 这样的管道 不支持,但不妨碍我们使用perl,python 行式命令!!
原话是 :
Can I use UNIX pipes? For example, will -mapper "cut -f1 | sed s/foo/bar/g" work?
Currently this does not work and gives an "java.io.IOException: Broken
pipe" error.
This is probably a bug that needs to be investigated.
但如果你是强烈的 linux shell pipe 发烧友 ! 参考下面
$> perl -e 'open( my $fh, "grep -v null
tt |sed -n 1,5p |");while ( <$fh> ) {print;} '
#不过我没测试通过 !!
环境 :hadoop-0.18.3
$> find . -type f -name "*streaming*.jar"
./contrib/streaming/hadoop-0.18.3-streaming.jar
测试数据:
-bash-3.00$ head tt
null false 3702 208100
6005100 false 70 13220
6005127 false 24 4640
6005160 false 25 4820
6005161 false 20 3620
6005164 false 14 1280
6005165 false 37 7080
6005168 false 104 20140
6005169 false 35 6680
6005240 false 169 32140
......
运行:
c1=" perl -ne 'if(/.*\t(.*)/){\$sum+=\$1;}END{print \"\$sum\";}' "
# 注意 这里 $ 要写成 \$ " 写成 \"
echo $c1; # 打印输出 perl -ne 'if(/.*"t(.*)/){$sum+=$1;}END{print $sum;}'
hadoop jar hadoop-0.18.3-streaming.jar
-input file:///data/hadoop/lky/jar/tt
-mapper "/bin/cat"
-reducer "$c1"
-output file:///tmp/lky/streamingx8
结果:
cat
/tmp/lky/streamingx8/*
1166480
本地运行输出:
perl -ne 'if(/.*"t(.*)/){$sum+=$1;}END{print $sum;}' < tt
1166480
结果正确!!!!
命令自带文档:
-bash-3.00$ hadoop jar hadoop-0.18.3-streaming.jar -info
09/09/25 14:50:12 ERROR streaming.StreamJob: Missing required option -input
Usage: $HADOOP_HOME/bin/hadoop [--config dir] jar \
$HADOOP_HOME/hadoop-streaming.jar [options]
Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-combiner <JavaClassName> Combiner has to be a Java class
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
-dfs <h:p>|local Optional. Override DFS configuration
-jt <h:p>|local Optional. Override JobTracker configuration
-additionalconfspec specfile Optional.
-inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional.
-outputformat TextOutputFormat(default)|JavaClassName Optional.
-partitioner JavaClassName Optional.
-numReduceTasks <num> Optional.
-inputreader <spec> Optional.
-jobconf <n>=<v> Optional. Add or override a JobConf property
-cmdenv <n>=<v> Optional. Pass env.var to streaming commands
-mapdebug <path> Optional. To run this script when a map task fails
-reducedebug <path> Optional. To run this script when a reduce task fails
-cacheFile fileNameURI
-cacheArchive fileNameURI
-verbose
整理 www.blogjava.net/Good-Game