A join operation using Hadoop MapReduce -
how take join of 2 record sets using map reduce ? of solutions including posted on suggest emit records based on common key , in reducer add them hashmap , take cross product. (eg. join of 2 datasets in mapreduce/hadoop)
this solution , works majority of cases in case issue rather different. dealing data has got billions of records , taking cross product of 2 sets impossible because in many cases hashmap end having few million objects. encounter heap space error.
i need more efficient solution. whole point of mr deal high amount of data want know if there solution can me avoid issue.
don't know if still relevant anyone, facing similar issue these days. intention use key-value store, cassandra, , use cross product. means:
when running on line of type a, key in cassandra. if exists - merge records existing value (b elements). if not - create key, , add elements value.
when running on line of type b, key in cassandra. if exists - merge b records existing value (a elements). if not - create key, , add b elements value.
this require additional server cassandra, , disk space, since i'm running in cloud (google's bdutil hadoop framework), don't think should of problem.
Comments
Post a Comment