Returning values from InputFormat via the Hadoop Configuration object -


consider running hadoop job, in custom inputformat needs communicate ("return", callback) few simple values driver class (i.e., class has launched job), within overriden getsplits() method, using new mapreduce api (as opposed mapred).

these values should ideally returned in-memory (as opposed saving them hdfs or distributedcache).

if these values numbers, 1 tempted use hadoop counters. however, in numerous tests counters not seem available @ getsplits() phase , anyway restricted numbers.

an alternative use configuration object of job, which, source code reveals, should same object in memory both getsplits() , driver class.

in such scenario, if inputformat wants "return" (say) positive long value driver class, code like:

// in custom inputformat. public list<inputsplit> getsplits(jobcontext job) throws ioexception {     ...     long value = ... // value >= 0     job.getconfiguration().setlong("value", value);     ... }   // in hadoop driver class. job job = ... // job launched ... job.submit(); // start running job ...  while (!job.iscomplete()) {     ...      if (job.getconfiguration().getlong("value", -1))     {         ...     }     else     {         continue; // wait value set getsplits()     }      ...      } 

the above works in tests, "safe" way of communicating values?

or there better approach such in-memory "callbacks"?

update

the "in-memory callback" technique may not work in hadoop distributions, so, mentioned above, safer way is, instead of saving values passed in configuration object, create custom object, serialize (e.g., json), saved (in hdfs or in distributed cache) , have read in driver class. have tested approach , works expected.

using configuration suitable solution (admittedly problem i'm not sure understand), once job has been submitted job tracker, not able amend value (client side or task side) , expect see change on opposite side of comms (setting configuration values in map task example not persisted other mappers, nor reducers, nor visible job tracker).

so communicate information within getsplits client polling loop (to see when job has finished defining input splits) fine in example.

what's greater aim or use case using this?


Comments

Popular posts from this blog

SPSS keyboard combination alters encoding -

Add new record to the table by click on the button in Microsoft Access -

CSS3 Transition to highlight new elements created in JQuery -