Returning values from InputFormat via the Hadoop Configuration object -
consider running hadoop
job, in custom inputformat needs communicate ("return", callback) few simple values driver class (i.e., class has launched job), within overriden getsplits()
method, using new mapreduce
api (as opposed mapred
).
these values should ideally returned in-memory (as opposed saving them hdfs
or distributedcache
).
if these values numbers, 1 tempted use hadoop
counters. however, in numerous tests counters not seem available @ getsplits()
phase , anyway restricted numbers.
an alternative use configuration
object of job, which, source code reveals, should same object in memory both getsplits()
, driver class.
in such scenario, if inputformat
wants "return" (say) positive long value driver class, code like:
// in custom inputformat. public list<inputsplit> getsplits(jobcontext job) throws ioexception { ... long value = ... // value >= 0 job.getconfiguration().setlong("value", value); ... } // in hadoop driver class. job job = ... // job launched ... job.submit(); // start running job ... while (!job.iscomplete()) { ... if (job.getconfiguration().getlong("value", -1)) { ... } else { continue; // wait value set getsplits() } ... }
the above works in tests, "safe" way of communicating values?
or there better approach such in-memory "callbacks"?
update
the "in-memory callback" technique may not work in hadoop
distributions, so, mentioned above, safer way is, instead of saving values passed in configuration
object, create custom object, serialize (e.g., json), saved (in hdfs or in distributed cache) , have read in driver class. have tested approach , works expected.
using configuration suitable solution (admittedly problem i'm not sure understand), once job has been submitted job tracker, not able amend value (client side or task side) , expect see change on opposite side of comms (setting configuration values in map task example not persisted other mappers, nor reducers, nor visible job tracker).
so communicate information within getsplits client polling loop (to see when job has finished defining input splits) fine in example.
what's greater aim or use case using this?
Comments
Post a Comment