Hadoop stream sorting -

February 15, 2015

could on hadoop streaming sort problem? suggestions in advance.

i newbie on hadoop , need implement sort function on 500gb tab delimited text file. following example input, there 3 fields in 1 line "reada14 chr14 50989". here need numeric sort 2nd , 3rd column, unless set number of reducers 1, never correct ordering result.

example input:

reada14 chr14   50989 readb18 chr18   517043 readc22 chr22   88345 readd10 chr10   994183 reade19 chr19   232453 readf20 chr20   42912 readf9  chr9    767396 readg22 chr22   783469 readg16 chr16   522257 readh9  chr9    826357 readh16 chr16   555098 readh21 chr21   128309 readh4  chr4    719890 readh18 chr18   944551 readh22 chr22   530068 readh9  chr9    212247 readh11 chr11   574930 readh22 chr22   664833 readh2  chr2    908178 readh22 chr22   486178 readh7  chr7    533343 readh6  chr6    109022 readh15 chr15   316353 readh20 chr20   439938 readh21 chr21   731912 readh11 chr11   81162 readh2  chr2    670838 readh15 chr15   729549 readh3  chr3    196626 readh14 chr14   841104

my code of streaming sort:

  hadoop jar /home/hadoop-0.20.2-cdh3u5/contrib/streaming/hadoop-streaming-0.20.2-cdh3u5.jar -input /user/luoqin/projects/samsort/number -output /user/luoqin/projects/samsort/number_sort -mapper "cat" -reducer "sort -k 2.5 -n -k 3" -partitioner org.apache.hadoop.mapred.lib.keyfieldbasedpartitioner -jobconf map.output.key.field.separa="\t" -jobconf num.key.fields.for.partition=1 -jobconf mapred.data.field.separator="\t" -jobconf map.output.key.value.fields.spec="2:0-" -jobconf reduce.output.key.value.fields.spec="2:0-" -jobconf mapred.reduce.tasks=50

results partitioned 50 parts cause reduce.task set 50. viewing results as, not correct unless reduce.task set 1:

   hadoop fs -cat /user/projects/samsort/number_sort/*

by default hadoop uses hash partitioner - in key output mapper 'hashed' determine reducer key should sent to. hashing what's causing 'incorrect' results when use more single reducer.

you should note each output part sorted, , need interleave different parts single sorted output.

you can solve problem implementing own partitioner , sending key,value pairs reducer depending on chrx value in second field. need couple partitioner implementation , number of reducers, otherwise you'll still similar results have @ moment.

so if know domain or range of values of second column (lets chr0 chr255) run 256 reducer job custom partitioner based upon int value after chr string

Search This Blog

Three

Hadoop stream sorting -

Comments

Post a Comment

Popular posts from this blog

.htaccess - First slash is removed after domain when entering a webpage in the browser -

c# - Farseer ContactListener is not working -

Automatically create pages in phpfox -