is there mathematical model to describe the relationship between running time and input data size for hadoop? -
in hadoop cluster, there mathematical model describe curve transmission time , datainputsize of mapper?
for example, if original data size n m mappers, , total transmission time mappers reducers t. wanna double data size 2n in mappers, there approximation estimation transmission time t'(i think t' must less 2t), idea use log curve describe curve, not sure correct.
i assume input coming hdfs(?) assume input data has been placed on hdfs, we're not talking time transmit input data local file store hdfs. assume input size n total size of of input files combined. assume m number of map tasks (based on number of input splits input files broken into). if we're talking transmission between map tasks , reduce tasks need know size of output map operations. in general, size of output unrelated size of input n.
even if knew how total data needs transmitted between map tasks , reduce tasks, asking transmission time not meaningful, because transmission can happen @ same time map , reduce tasks executing, , series of individual transmissions between individual map tasks , reduce tasks each happen @ different points in time. goal of written hadoop application hide transmission time overlapping computation , communication.
Comments
Post a Comment