Hadoop stream sorting -
could on hadoop streaming sort problem? suggestions in advance.
i newbie on hadoop , need implement sort function on 500gb tab delimited text file. following example input, there 3 fields in 1 line "reada14 chr14 50989". here need numeric sort 2nd , 3rd column, unless set number of reducers 1, never correct ordering result.
example input:
reada14 chr14 50989 readb18 chr18 517043 readc22 chr22 88345 readd10 chr10 994183 reade19 chr19 232453 readf20 chr20 42912 readf9 chr9 767396 readg22 chr22 783469 readg16 chr16 522257 readh9 chr9 826357 readh16 chr16 555098 readh21 chr21 128309 readh4 chr4 719890 readh18 chr18 944551 readh22 chr22 530068 readh9 chr9 212247 readh11 chr11 574930 readh22 chr22 664833 readh2 chr2 908178 readh22 chr22 486178 readh7 chr7 533343 readh6 chr6 109022 readh15 chr15 316353 readh20 chr20 439938 readh21 chr21 731912 readh11 chr11 81162 readh2 chr2 670838 readh15 chr15 729549 readh3 chr3 196626 readh14 chr14 841104 my code of streaming sort:
hadoop jar /home/hadoop-0.20.2-cdh3u5/contrib/streaming/hadoop-streaming-0.20.2-cdh3u5.jar -input /user/luoqin/projects/samsort/number -output /user/luoqin/projects/samsort/number_sort -mapper "cat" -reducer "sort -k 2.5 -n -k 3" -partitioner org.apache.hadoop.mapred.lib.keyfieldbasedpartitioner -jobconf map.output.key.field.separa="\t" -jobconf num.key.fields.for.partition=1 -jobconf mapred.data.field.separator="\t" -jobconf map.output.key.value.fields.spec="2:0-" -jobconf reduce.output.key.value.fields.spec="2:0-" -jobconf mapred.reduce.tasks=50 results partitioned 50 parts cause reduce.task set 50. viewing results as, not correct unless reduce.task set 1:
hadoop fs -cat /user/projects/samsort/number_sort/*
by default hadoop uses hash partitioner - in key output mapper 'hashed' determine reducer key should sent to. hashing what's causing 'incorrect' results when use more single reducer.
you should note each output part sorted, , need interleave different parts single sorted output.
you can solve problem implementing own partitioner , sending key,value pairs reducer depending on chrx value in second field. need couple partitioner implementation , number of reducers, otherwise you'll still similar results have @ moment.
so if know domain or range of values of second column (lets chr0 chr255) run 256 reducer job custom partitioner based upon int value after chr string
Comments
Post a Comment