Fuzzing jobs are hitting file-in-use errors

RESOLVED INVALID

Status

P3
normal
RESOLVED INVALID
6 years ago
5 months ago

People

(Reporter: jruderman, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [fuzzer])

(Reporter)

Description

6 years ago
One of the first things bot.py does is clean up any left-over wtmp1/ directory from previous runs. On Windows, it's frequently hitting an error where a file in wtmp1/ is in use.

Do these machines get rebooted between jobs?  Is there indexing or antivirus or something else running on the machines that would interfere with the file system?  Do you know how to debug problems like this?

From /mnt/pvt_builds/fuzzing/tinderbox-builds/idle-win32fuzzer-win32-bm13-build1-build2560.txt.gz on pvtbuilds2:

wtmp1 shouldn't exist now. killing it.
Traceback (most recent call last):
  File "fuzzing/dom/automation/bot.py", line 15, in <module>
    bot.main()
  File "e:\builds\moz2_slave\fuzzer-win32\fuzzing\bot.py", line 208, in main
    shutil.rmtree("wtmp1")
  File "d:\mozilla-build\python25\lib\shutil.py", line 174, in rmtree
    onerror(os.remove, fullname, sys.exc_info())
  File "d:\mozilla-build\python25\lib\shutil.py", line 172, in rmtree
    os.remove(fullname)
WindowsError: [Error 13] The process cannot access the file because it is being used by another process: 'wtmp1\\w10-err.txt'

You can find more instances with:
  zcat * | grep "because it is being used"
I found a couple of machines where the js executable was still running after the buildbot job finished. It's a known issue that buildbot isn't always successful at cleaning up on windows, but I think I saw this on linux too so there's something fishy going on.

On a windows box wtmp1/ was using 93G of space, and had filled up the builds partition. The w1-err.txt file was the problem, lots of repeated
 e:\builds\moz2_slave\fuzzer-win32\fuzzing\js\jsfunfuzz.js:660: strict warning: jgipjb is read-only
by the looks. See pvtbuilds2:/tmp/fuzz.tar.gz for a copy of fuzzer-win32/. tar truncated the err log to the first 145MB; the fuzzing repo rev was 2f9ea46c14ff.
I don't think the machines get rebooted between fuzzing jobs. That might be an easy way to fix this issue.
(Reporter)

Comment 3

6 years ago
Was the still-running js executable left over from a job that finished normally, or a job that hit a buildbot timeout?
(In reply to Chris AtLee [:catlee] from comment #2)
> I don't think the machines get rebooted between fuzzing jobs. That might be
> an easy way to fix this issue.

I think we should be doing that anyway.

I'm taking a stab at the platform based on the pattern in bug 692715.
Component: Release Engineering → Release Engineering: Automation (General)
OS: All → Windows 7
QA Contact: catlee
Hardware: All → x86
Whiteboard: [fuzzer]
Chris, actually our logs seem to show Windows Server (I think 2003).
OS: Windows 7 → Windows Server 2003

Updated

6 years ago
Priority: -- → P3
(Assignee)

Updated

5 years ago
Product: mozilla.org → Release Engineering
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → INVALID
(Assignee)

Updated

5 months ago
Component: General Automation → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.