A join operation using Hadoop MapReduce -

April 15, 2010

how take join of 2 record sets using map reduce ? of solutions including posted on suggest emit records based on common key , in reducer add them hashmap , take cross product. (eg. join of 2 datasets in mapreduce/hadoop)

this solution , works majority of cases in case issue rather different. dealing data has got billions of records , taking cross product of 2 sets impossible because in many cases hashmap end having few million objects. encounter heap space error.

i need more efficient solution. whole point of mr deal high amount of data want know if there solution can me avoid issue.

don't know if still relevant anyone, facing similar issue these days. intention use key-value store, cassandra, , use cross product. means:

when running on line of type a, key in cassandra. if exists - merge records existing value (b elements). if not - create key, , add elements value.

when running on line of type b, key in cassandra. if exists - merge b records existing value (a elements). if not - create key, , add b elements value.

this require additional server cassandra, , disk space, since i'm running in cloud (google's bdutil hadoop framework), don't think should of problem.

Search This Blog

Three

A join operation using Hadoop MapReduce -

Comments

Post a Comment

Popular posts from this blog

.htaccess - First slash is removed after domain when entering a webpage in the browser -

Automatically create pages in phpfox -

c# - Farseer ContactListener is not working -