command_runner doesn't always restart cleanly

RESOLVED FIXED

Status

RESOLVED FIXED
5 years ago
3 months ago

People

(Reporter: catlee, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

5 years ago
sometimes dies with an exception like this:
Traceback (most recent call last):
  File "tools/buildbot-helpers/command_runner.py", line 199, in <module>
    main()
  File "tools/buildbot-helpers/command_runner.py", line 196, in main
    runner.loop()
  File "tools/buildbot-helpers/command_runner.py", line 112, in loop
    self.monitor()
  File "tools/buildbot-helpers/command_runner.py", line 100, in monitor
    self.q.remove(job.item_id)
  File "/builds/buildbot/queue/tools/lib/python/mozilla_buildtools/queuedir.py", line 191, in remove
    os.unlink(os.path.join(self.cur_dir, item_id))
OSError: [Errno 2] No such file or directory: '/dev/shm/queue/commands/cur/1367978378-0-22524RDEZrh'
(Reporter)

Comment 1

5 years ago
I suspect this is due to running it with -j4, not due to restarting it
(Reporter)

Comment 2

5 years ago
http://hg.mozilla.org/build/tools/rev/b339c1d70d4f seems to fix it

The problem was that with -j1, we would end up in this block of code when waiting for a job to finish:
http://hg.mozilla.org/build/tools/file/b339c1d70d4f/buildbot-helpers/command_runner.py#l114

no problems there, a nice simple busy loop.

If -j > 1, then we get into this part of the code while waiting for jobs to finish:
http://hg.mozilla.org/build/tools/file/b339c1d70d4f/buildbot-helpers/command_runner.py#l124

and without pyinotify, we would wait up to 1000s, or until a new job came along to wake us up. we could end up waiting more than 5 minutes, which is enough time for the job files to be cleaned up by various processes.

now we wait only 1 second, so we can go back and touch all the job files we have active.
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
(Assignee)

Updated

5 years ago
Product: mozilla.org → Release Engineering
(Assignee)

Updated

3 months ago
Component: General Automation → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.