Closed Bug 719377 Opened 13 years ago Closed 12 years ago

Android reftests fail with a bunch of "error pulling file '/mnt/sdcard/tests/reftest/reftest.log': No such file or directory"

Categories

(Testing :: General, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: mak, Unassigned)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file)

https://tbpl.mozilla.org/php/getParsedLog.php?id=8662106&tree=Mozilla-Inbound pushing directory: /tmp/tmpwo0Ora to /mnt/sdcard/tests/reftest/profile pushing directory: /tmp/tmpwo0Ora to /mnt/sdcard/tests/reftest/profile REFTEST INFO | runreftest.py | Running tests: start. FIRE PROC: '"MOZ_CRASHREPORTER=1,XPCOM_DEBUG_BREAK=stack,MOZ_CRASHREPORTER_NO_REPORT=1,NO_EM_RESTART=1,MOZ_PROCESS_LOG=/tmp/tmpZ5RpHmpidlog,XPCOM_MEM_BLOAT_LOG=/tmp/tmpwo0Ora/runreftest_leaks.log" org.mozilla.fennec -no-remote -profile /mnt/sdcard/tests/reftest/profile/' INFO | automation.py | Application pid: 1489 DeviceManager: error pulling file '/mnt/sdcard/tests/reftest/reftest.log': No such file or directory DeviceManager: error pulling file '/mnt/sdcard/tests/reftest/reftest.log': No such file or directory DeviceManager: error pulling file '/mnt/sdcard/tests/reftest/reftest.log': No such file or directory ... DeviceManager: error pulling file '/mnt/sdcard/tests/reftest/reftest.log': No such file or directory INFO | automation.py | Application ran for: 1:01:20.529375 INFO | automation.py | Reading PID log: /tmp/tmpZ5RpHmpidlog getting files in '/mnt/sdcard/tests/reftest/profile/minidumps/' WARNING | automationutils.processLeakLog() | refcount logging is off, so leaks can't be detected! REFTEST INFO | runreftest.py | Running tests: end. DeviceManager: error pulling file '/mnt/sdcard/tests/reftest/reftest.log': No such file or directory program finished with exit code 0 elapsedTime=3688.847225 TinderboxPrint: jsreftest-1<br/><em class="testfail">T-FAIL</em>
It seems that sometimes we can't pull back the reftest log with the result. For releng it means that we should have turned the job to purple. 372 if (self.remoteLogFile): 373 self._devicemanager.getFile(self.remoteLogFile, self.localLogName) Perhaps, a try catch with sys.exit(5)? [1] http://mxr.mozilla.org/mozilla-central/source/layout/tools/reftest/remotereftest.py#370 python reftest/remotereftest.py --deviceIP 10.250.49.12 --xre-path ../hostutils/xre --utility-path ../hostutils/bin --app org.mozilla.fennec --http-port 30025 --ssl-port 31025 --pidfile /builds/tegra-025/test/../remotereftest.pid --enable-privilege --bootstrap --total-chunks 3 --this-chunk 1 reftest/tests/testing/crashtest/crashtests.list --symbols-path=../http://stage.mozilla.org/pub/mozilla.org/mobile/tinderbox-builds/mozilla-central-android/1331649309/fennec-13.0a1.en-US.android-arm.crashreporter-symbols.zip in dir /builds/tegra-025/test/build/tests (timeout 2400 secs) watching logfiles {} argv: ['python', 'reftest/remotereftest.py', '--deviceIP', '10.250.49.12', '--xre-path', '../hostutils/xre', '--utility-path', '../hostutils/bin', '--app', 'org.mozilla.fennec', '--http-port', '30025', '--ssl-port', '31025', '--pidfile', '/builds/tegra-025/test/../remotereftest.pid', '--enable-privilege', '--bootstrap', '--total-chunks', '3', '--this-chunk', '1', 'reftest/tests/testing/crashtest/crashtests.list', '--symbols-path=../http://stage.mozilla.org/pub/mozilla.org/mobile/tinderbox-builds/mozilla-central-android/1331649309/fennec-13.0a1.en-US.android-arm.crashreporter-symbols.zip'] environment: MINIDUMP_SAVE_PATH=/builds/tegra-025/test/minidumps MINIDUMP_STACKWALK=/builds/tegra-025/test/tools/breakpad/osx/minidump_stackwalk PATH=/opt/local/bin:/opt/local/sbin:/opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin PWD=/builds/tegra-025/test/build/tests SUT_IP=10.250.49.12 SUT_NAME=tegra-025 __CF_USER_TEXT_ENCODING=0x1F6:0:0 closing stdin using PTY: False unable to execute ADB: ensure Android SDK is installed and adb is in your $PATH restarting as root failed reconnecting socket args: ['../hostutils/bin/xpcshell', '-g', '/builds/tegra-025/test/build/hostutils/xre', '-v', '170', '-f', '/builds/tegra-025/test/build/tests/reftest/reftest/components/httpd.js', '-e', "const _PROFILE_PATH = '/tmp/tmpMjgE9k';const _SERVER_PORT = '30025'; const _SERVER_ADDR ='10.250.48.200';", '-f', '/builds/tegra-025/test/build/tests/reftest/server.js'] INFO | remotereftests.py | Server pid: 25579 pushing directory: /tmp/tmplmBwig to /mnt/sdcard/tests/reftest/profile pushing directory: /tmp/tmplmBwig to /mnt/sdcard/tests/reftest/profile REFTEST INFO | runreftest.py | Running tests: start. FIRE PROC: '"MOZ_CRASHREPORTER=1,XPCOM_DEBUG_BREAK=stack,MOZ_CRASHREPORTER_NO_REPORT=1,NO_EM_RESTART=1,MOZ_PROCESS_LOG=/tmp/tmpFdkHrBpidlog,XPCOM_MEM_BLOAT_LOG=/tmp/tmplmBwig/runreftest_leaks.log" org.mozilla.fennec -no-remote -profile /mnt/sdcard/tests/reftest/profile/' INFO | automation.py | Application pid: 1539 DeviceManager: error pulling file '/mnt/sdcard/tests/reftest/reftest.log': No such file or directory DeviceManager: error pulling file '/mnt/sdcard/tests/reftest/reftest.log': No such file or directory ... DeviceManager: error pulling file '/mnt/sdcard/tests/reftest/reftest.log': No such file or directory INFO | automation.py | Application ran for: 1:01:19.236892 INFO | automation.py | Reading PID log: /tmp/tmpFdkHrBpidlog getting files in '/mnt/sdcard/tests/reftest/profile/minidumps/' WARNING | automationutils.processLeakLog() | refcount logging is off, so leaks can't be detected! REFTEST INFO | runreftest.py | Running tests: end. DeviceManager: error pulling file '/mnt/sdcard/tests/reftest/reftest.log': No such file or directory
Blocks: 723667
Comment on attachment 605744 [details] [diff] [review] try exception for missing remote reftest.log looks good. I am thinking we can optimize our tests a bit more and say if we don't get a log file in 10 minutes, then we fail. This way we are not spending an hour of time waiting for nothing. Also a max timeout of 60 minutes doesn't seem useful. Since things are split up we should make this a 30 minute limit. Maybe another bug.
Attachment #605744 - Flags: review?(jmaher) → review+
Whiteboard: [orange] → [orange][autoland-try: -b do -p android,android-xul -u all -t all]
philor, how often do we see the reftest.log missing problem? do you know?
Whiteboard: [orange][autoland-try: -b do -p android,android-xul -u all -t all] → [orange][autoland-try:-b do -p android,android-xul -u all -t all]
Autoland Failure There are no patches to run.
Whiteboard: [orange][autoland-try:-b do -p android,android-xul -u all -t all] → [orange]
Attachment #605744 - Attachment is patch: true
Whiteboard: [orange] → [orange][autoland-try:-b do -p android,android-xul -u all -t all]
Whiteboard: [orange][autoland-try:-b do -p android,android-xul -u all -t all] → [orange][autoland-in-queue]
Autoland Patchset: Patches: 605744 Branch: mozilla-central => try Destination: http://hg.mozilla.org/try/pushloghtml?changeset=062184b8c098 Try run started, revision 062184b8c098. To cancel or monitor the job, see: https://tbpl.mozilla.org/?tree=Try&rev=062184b8c098
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #9) > philor, how often do we see the reftest.log missing problem? do you know? With code from Sunday, or with code from Monday? That's just the reftest harness flavor of bug 722166, either the browser not starting or crashing on startup before we notice it crashed. Something landed Monday on inbound which caused us to hit that around 2-5 times per run (plus another few of the Talos runs, which fail with their own messages).
We should turn the job red. I believe they are orange which is absolutely not good. > reftest-1: T-FAIL > program finished with exit code 0 Orange :S ==> https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=beb93f812874&jobname=Android Tegra 250 mozilla-inbound opt test reftest-1
See Also: → 722166
Blocks: 736090
so this shouldn't be red because it can be: * browser hanging * browser not starting properly * losing network connectivity I suspect the network connectivity is a big offender here which would be a 'red' offense, but the others should be orange.
Even if it is "red" the developer has no log or minidumps to fix it. My patch would turn it "purple" which means infra problem and retry. Perhaps, that is incorrect since we might get in an infinite loop of retried jobs if the crash is a permanent one. jmaher, is there a bug file where we can handle these crashes better and recuperate those logs and minidumps? IIUC that is the root problem and we're focusing on the make up for the corpse.
We rarely get minidumps, I would say 1/1000 failures. There is no log file, as we usually have the log file if one exists. For example, if the test stops halfway through, then we get this message a lot. In the log file on tbpl up to that point is the contents of reftest.log already. There is no other information. If the test shows no output other than 'fire proc', 'unable to find file'- then there is no file. I would like to detect no logs faster and terminate faster to free up tegra time!
Actually, your patch would turn it the same orange it is, which is a good thing :) Automatic retry is blue rather than purple, and it's not nearly that easy to set, but if it was it wouldn't be a matter of "we might get in an infinite loop" - this is a symptom of a browser which won't start up or crashes on startup, which is something we do all the time in code pushes - it would be a matter of "how many days will it be before we get an infinite loop from this, and will it be on a tree where philor watches, so he might see it and kill the loop, or will it be on try, where it will just continue forever unless a reconfig or a catastrophic master restart or something kills it?" You're returning an exit code of 5 in http://mxr.mozilla.org/build/source/buildbotcustom/steps/unittest.py#954 which gets its evaluateCommand from http://mxr.mozilla.org/build/source/buildbotcustom/steps/unittest.py#369 so first your super_class (eventually a ShellCommand by way of ShellCommandReportTimeout in http://mxr.mozilla.org/build/source/buildbotcustom/steps/unittest.py#384) evaluates status, says "exit code was non-zero, I'll set the status for the step as FAILURE" and then in http://mxr.mozilla.org/build/source/buildbotcustom/steps/unittest.py#287 evaluateReftest will turn that FAILURE into WARNINGS and the job will still be orange.
Try run for 062184b8c098 is complete. Detailed breakdown of the results available here: https://tbpl.mozilla.org/?tree=Try&rev=062184b8c098 Results (out of 58 total builds): exception: 2 success: 51 warnings: 4 failure: 1 Builds (or logs if builds failed) available at: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/autolanduser@mozilla.com-062184b8c098
Whiteboard: [orange][autoland-in-queue] → [orange]
No longer blocks: 723667
Blocks: 738716
No longer blocks: 738716
Whiteboard: [orange]
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: