Hadoop - map tasks continue after reduce tasks completed -
i'm running hadoop version 1.0.0 on cluster of 500 nodes. job has 3000 map tasks , 10 reduce tasks. map tasks complete after 4 hours (as expected). reduce tasks each complete after , results available in output directory. however, jobtracker thinks of map tasks have failed , starts re-executing them. number of executing , pending reduce tasks remains @ zero. 8 hours later last of these map tasks complete , job marked completed.
any ideas???
the following extract of of jobtracker log file:
// map tasks complete, eg: 2013-05-20 10:50:59,742 info org.apache.hadoop.mapred.jobinprogress: task 'attempt_201305131710_0007_m_000430_0' has completed task_201305131710_0007_m_000430 successfully. //reduce tasks complete: 2013-05-20 13:38:34,040 info org.apache.hadoop.mapred.jobinprogress: task 'attempt_201305131710_0007_r_000009_0' has completed task_201305131710_0007_r_000009 successfully. 2013-05-20 13:38:34,142 info org.apache.hadoop.mapred.jobinprogress: task 'attempt_201305131710_0007_r_000004_0' has completed task_201305131710_0007_r_000004 successfully. 2013-05-20 13:38:34,204 info org.apache.hadoop.mapred.jobinprogress: task 'attempt_201305131710_0007_r_000008_0' has completed task_201305131710_0007_r_000008 successfully. 2013-05-20 13:38:34,745 info org.apache.hadoop.mapred.jobinprogress: task 'attempt_201305131710_0007_r_000002_0' has completed task_201305131710_0007_r_000002 successfully. 2013-05-20 13:38:35,521 info org.apache.hadoop.mapred.jobinprogress: task 'attempt_201305131710_0007_r_000003_0' has completed task_201305131710_0007_r_000003 successfully. 2013-05-20 13:38:36,196 info org.apache.hadoop.mapred.jobinprogress: task 'attempt_201305131710_0007_r_000007_0' has completed task_201305131710_0007_r_000007 successfully. 2013-05-20 13:38:36,276 info org.apache.hadoop.mapred.jobtracker: adding tracker tracker_hn301-1657.labs.edu.au:127.0.0.1/127.0.0.1:1295 host hn301-1657.labs.edu.au 2013-05-20 13:38:36,469 info org.apache.hadoop.mapred.jobinprogress: task 'attempt_201305131710_0007_r_000005_0' has completed task_201305131710_0007_r_000005 successfully. 2013-05-20 13:38:36,598 info org.apache.hadoop.mapred.jobinprogress: task 'attempt_201305131710_0007_r_000006_0' has completed task_201305131710_0007_r_000006 successfully. 2013-05-20 13:38:36,612 info org.apache.hadoop.mapred.jobinprogress: task 'attempt_201305131710_0007_r_000000_0' has completed task_201305131710_0007_r_000000 successfully. 2013-05-20 13:38:40,388 info org.apache.hadoop.mapred.jobinprogress: task 'attempt_201305131710_0007_r_000001_0' has completed task_201305131710_0007_r_000001 successfully. 2013-05-20 13:44:12,795 info org.apache.hadoop.mapred.jobtracker: lost tracker 'tracker_hn301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896' //as reduce tasks reporting success, job tracker detects 1 of job trackers has died , restarts it. //each of jobs completed task tracker reexecuted 2013-05-20 13:44:12,795 info org.apache.hadoop.mapred.taskinprogress: error attempt_201305131710_0007_m_000430_0: lost task tracker: tracker_hn301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 2013-05-20 13:44:12,795 info org.apache.hadoop.mapred.taskinprogress: error attempt_201305131710_0007_m_000571_0: lost task tracker: tracker_hn301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 2013-05-20 13:44:12,796 info org.apache.hadoop.mapred.taskinprogress: error attempt_201305131710_0007_m_001612_0: lost task tracker: tracker_hn301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 2013-05-20 13:44:12,796 info org.apache.hadoop.mapred.taskinprogress: error attempt_201305131710_0007_m_001629_0: lost task tracker: tracker_hn301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 2013-05-20 13:44:12,796 info org.apache.hadoop.mapred.taskinprogress: error attempt_201305131710_0007_m_001892_0: lost task tracker: tracker_hn301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 2013-05-20 13:44:12,796 info org.apache.hadoop.mapred.taskinprogress: error attempt_201305131710_0007_m_002424_0: lost task tracker: tracker_hn301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 2013-05-20 13:44:12,796 info org.apache.hadoop.mapred.taskinprogress: error attempt_201305131710_0007_m_002437_0: lost task tracker: tracker_hn301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 2013-05-20 13:44:12,796 info org.apache.hadoop.mapred.taskinprogress: error attempt_201305131710_0007_m_002696_0: lost task tracker: tracker_hn301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 2013-05-20 13:44:12,796 info org.apache.hadoop.mapred.taskinprogress: error attempt_201305131710_0007_m_003130_0: lost task tracker: tracker_hn301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 2013-05-20 13:44:12,796 info org.apache.hadoop.mapred.taskinprogress: error attempt_201305131710_0007_m_003149_0: lost task tracker: tracker_hn301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 2013-05-20 13:44:12,796 info org.apache.hadoop.mapred.taskinprogress: error attempt_201305131710_0007_m_003187_0: lost task tracker: tracker_hn301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.taskinprogress: error attempt_201305131710_0007_m_003275_0: lost task tracker: tracker_hn301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.taskinprogress: error attempt_201305131710_0007_m_003358_0: lost task tracker: tracker_hn301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.taskinprogress: error attempt_201305131710_0007_m_003437_0: lost task tracker: tracker_hn301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.taskinprogress: error attempt_201305131710_0007_m_003451_0: lost task tracker: tracker_hn301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.taskinprogress: error attempt_201305131710_0007_m_003478_0: lost task tracker: tracker_hn301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.taskinprogress: error attempt_201305131710_0007_m_003506_0: lost task tracker: tracker_hn301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.taskinprogress: error attempt_201305131710_0010_m_000021_0: lost task tracker: tracker_hn301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.jobtracker: removing task 'attempt_201305131710_0010_m_000021_0' 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.jobtracker: removing task 'attempt_201305131710_0007_m_000430_0' 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.jobtracker: removing task 'attempt_201305131710_0007_m_000571_0' 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.jobtracker: removing task 'attempt_201305131710_0007_m_001612_0' 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.jobtracker: removing task 'attempt_201305131710_0007_m_001629_0' 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.jobtracker: removing task 'attempt_201305131710_0007_m_001892_0' 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.jobtracker: removing task 'attempt_201305131710_0007_m_002424_0' 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.jobtracker: removing task 'attempt_201305131710_0007_m_002437_0' 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.jobtracker: removing task 'attempt_201305131710_0007_m_002696_0' 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.jobtracker: removing task 'attempt_201305131710_0007_m_003130_0' 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.jobtracker: removing task 'attempt_201305131710_0007_m_003149_0' 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.jobtracker: removing task 'attempt_201305131710_0007_m_003187_0' 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.jobtracker: removing task 'attempt_201305131710_0007_m_003275_0' 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.jobtracker: removing task 'attempt_201305131710_0007_m_003358_0' 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.jobtracker: removing task 'attempt_201305131710_0007_m_003437_0' 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.jobtracker: removing task 'attempt_201305131710_0007_m_003451_0' 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.jobtracker: removing task 'attempt_201305131710_0007_m_003478_0' 2013-05-20 13:44:12,797 info org.apache.hadoop.mapred.jobtracker: removing task 'attempt_201305131710_0007_m_003506_0' 2013-05-20 13:44:12,917 info org.apache.hadoop.mapred.jobtracker: adding task (task_cleanup) 'attempt_201305131710_0010_m_000021_0' tip task_201305131710_0010_m_000021, tracker 'tracker_hn301-1654.labs.edu.au:127.0.0.1/127.0.0.1:1100' 2013-05-20 13:44:13,760 info org.apache.hadoop.mapred.jobinprogress: choosing failed task task_201305131710_0007_m_000430 2013-05-20 13:44:13,761 info org.apache.hadoop.mapred.jobtracker: adding task (map) 'attempt_201305131710_0007_m_000430_1' tip task_201305131710_0007_m_000430, tracker 'tracker_zc329-0001.labs.edu.au:127.0.0.1/127.0.0.1:1113'
you might want check configuration of each cluster node:
tracker_hn301-1657.labs.edu.au:127.0.0.1/127.0.0.1:3896 the jobtracker trying contact tasktracker node via loop address. check contents of each nodes /etc/hosts file check correct (and preferably know each other node in cluster can avoid dns lookup costs).
i'm not saying cause of problem, isn't right, , should track down
Comments
Post a Comment