Closed Bug 1457452 Opened 7 years ago Closed 6 years ago

Intermittent Error copying results /home/cltbld/workspace/build/local.json to upload dir /home/cltbld/workspace/build/blobber_upload_dir/perfherder-data.json

Categories

(Testing :: Talos, defect, P5)

Version 3
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [retriggered][stockwell fixed:other])

Filed by: csabou [at] mozilla.com https://treeherder.mozilla.org/logviewer.html#?job_id=175914124&repo=autoland https://queue.taskcluster.net/v1/task/Z08qwmiOSG27N9rm4EV57Q/runs/0/artifacts/public/logs/live_backing.log 00:11:35 INFO - Test is using firstNonBlankPaint, browser pref will be turned on 00:11:35 INFO - Test is using firstNonBlankPaint, browser pref will be turned on 00:11:35 INFO - Test is using firstNonBlankPaint, browser pref will be turned on 00:11:35 INFO - mozversion application_buildid: 20180427062038 00:11:35 INFO - mozversion application_changeset: 8229360171458d5e2a36fd562e9ba1da6af9b6ba 00:11:35 INFO - mozversion application_display_name: Firefox Nightly 00:11:35 INFO - mozversion application_id: {ec8030f7-c20a-464f-9b0e-13a3a9e97384} 00:11:35 INFO - mozversion application_name: Firefox 00:11:35 INFO - mozversion application_remotingname: firefox 00:11:35 INFO - mozversion application_repository: https://hg.mozilla.org/integration/autoland 00:11:35 INFO - mozversion application_vendor: Mozilla 00:11:35 INFO - mozversion application_version: 61.0a1 00:11:35 INFO - mozversion platform_buildid: 20180427062038 00:11:35 INFO - mozversion platform_changeset: 8229360171458d5e2a36fd562e9ba1da6af9b6ba 00:11:35 INFO - mozversion platform_repository: https://hg.mozilla.org/integration/autoland 00:11:35 INFO - mozversion platform_version: 61.0a1 00:11:35 INFO - using testdate: 1524813095 00:11:35 INFO - actual date: 1524813095 00:11:35 INFO - starting webserver on 'localhost:46359' 00:11:35 INFO - Starting mitmproxy playback using env path: /home/cltbld/workspace/build/application/firefox 00:11:35 INFO - Starting mitmproxy playback using command: /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmdump -k -s /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/alternate-server-replay.py /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmproxy-recording-google.mp /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmproxy-recording-youtube.mp /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmproxy-recording-amazon.mp /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmproxy-recording-facebook.mp 00:11:35 INFO - Error starting proxy server: OSError(98, 'Address already in use') 00:11:45 INFO - Aborting: mitmproxy playback process failed to start, poll returned: 1 00:11:45 INFO - Return code: 0 00:11:45 CRITICAL - PERFHERDER_DATA was seen 0 times, expected 1. 00:11:45 CRITICAL - Error copying results /home/cltbld/workspace/build/local.json to upload dir /home/cltbld/workspace/build/blobber_upload_dir/perfherder-data.json 00:11:45 INFO - Running post-action listener: _package_coverage_data 00:11:45 INFO - Running post-action listener: _resource_record_post_action 00:11:45 INFO - [mozharness: 2018-04-27 07:11:45.648690Z] Finished run-tests step (failed) 00:11:45 FATAL - Uncaught exception: Traceback (most recent call last): 00:11:45 FATAL - File "/home/cltbld/workspace/mozharness/mozharness/base/script.py", line 2076, in run 00:11:45 FATAL - self.run_action(action) 00:11:45 FATAL - File "/home/cltbld/workspace/mozharness/mozharness/base/script.py", line 2015, in run_action 00:11:45 FATAL - self._possibly_run_method(method_name, error_if_missing=True) 00:11:45 FATAL - File "/home/cltbld/workspace/mozharness/mozharness/base/script.py", line 1955, in _possibly_run_method 00:11:45 FATAL - return getattr(self, method_name)() 00:11:45 FATAL - File "/home/cltbld/workspace/mozharness/mozharness/mozilla/testing/talos.py", line 768, in run_tests 00:11:45 FATAL - self._artifact_perf_data(dest) 00:11:45 FATAL - File "/home/cltbld/workspace/mozharness/mozharness/mozilla/testing/talos.py", line 663, in _artifact_perf_data 00:11:45 FATAL - parser.update_worst_log_and_tbpl_levels(CRITICAL, TBPL_FAILURE) 00:11:45 FATAL - NameError: global name 'parser' is not defined 00:11:45 FATAL - Running post_fatal callback... 00:11:45 FATAL - Exiting -1 00:11:45 INFO - Running post-run listener: _resource_record_post_run 00:11:45 INFO - Total resource usage - Wall time: 23s; CPU: 8.0%; Read bytes: 0; Write bytes: 282296320; Read time: 0; Write time: 38552 00:11:45 INFO - TinderboxPrint: CPU usage<br/>7.6% 00:11:45 INFO - TinderboxPrint: I/O read bytes / time<br/>0 / 0 00:11:45 INFO - TinderboxPrint: I/O write bytes / time<br/>282,296,320 / 38,552 00:11:45 INFO - TinderboxPrint: CPU idle<br/>169.8 (92.4%) 00:11:45 INFO - TinderboxPrint: CPU user<br/>13.4 (7.3%) 00:11:45 INFO - TinderboxPrint: Swap in / out<br/>0 / 0 00:11:45 INFO - install - Wall time: 11s; CPU: 13.0%; Read bytes: 0; Write bytes: 275693568; Read time: 0; Write time: 37384 00:11:45 INFO - setup-mitmproxy - Wall time: 1s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0 00:11:45 INFO - run-tests - Wall time: 11s; CPU: 1.0%; Read bytes: 0; Write bytes: 6602752; Read time: 0; Write time: 1168 00:11:45 INFO - Running post-run listener: _upload_blobber_files 00:11:45 WARNING - Blob upload gear skipped. Missing cmdline options. 00:11:45 INFO - Running post-run listener: copy_logs_to_upload_dir 00:11:45 INFO - Copying logs to upload dir... 00:11:45 INFO - mkdir: /home/cltbld/workspace/build/upload/logs cleanup + cleanup + local rv=255 + [[ -s /home/cltbld/.xsession-errors ]] + cp /home/cltbld/.xsession-errors /home/cltbld/artifacts/public/xsession-errors.log + false + exit 255
Flags: needinfo?(jmaher)
Whiteboard: [retriggered]
Wow, never seen anything like this before. I don't get how that change could break mitmproxy: 04:02:42 INFO - Starting mitmproxy playback using command: /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmdump -k -s /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/alternate-server-replay.py /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmproxy-recording-google.mp /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmproxy-recording-youtube.mp /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmproxy-recording-amazon.mp /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmproxy-recording-facebook.mp 04:02:43 INFO - Error starting proxy server: OSError(98, 'Address already in use') It almost looks as though mitmproxy was still running from a previous test - like the machine hadn't been cleaned or reset after a previous run or something... or something else has been introduced that is grabbing the mitmproxy port or something.
Flags: needinfo?(rwood)
this is linux specific and it looks to be related to a merge from inbound: https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=f551d1f94952bd23073a75244fba738e6e3080c9 I did some retriggers on mozilla-inbound in case this helps us figure it out: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=47db02b5cdc9&tochange=1472f261db23&filter-searchStr=linux%20h2%20talos&selectedJob=175784623 in response to rwood, possibly there is something that changed in our server creation code or cleanup code- maybe this is infrastructure related?
oh, these all seem to be t-linux64-ms-239: https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-239 seems like a pattern. :markco, can you take a look at this machine? I have asked CIduty to remove it from the working pool.
Flags: needinfo?(jmaher) → needinfo?(mcornmesser)
Depends on: 1458196
Dhouse is more familiar with the linux nodes. Passing the NI to hum.
Flags: needinfo?(mcornmesser) → needinfo?(dhouse)
I checked over the system logs and I don't see any failures or warning around the mitmproxy startups. I have rebooted and un-quarantined it to see if the next job succeeds.
Flags: needinfo?(dhouse)
I'd describe the pattern of https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1457452&entireHistory=true&tree=all as less "it's all about 239" and more "something breaks them and then they stay broken until something." Given how 060 stayed broken from the evening of the 27th to midday on the 29th, if the until something is a reboot, uh, do we really only reboot them every two days? And I'm really missing the way buildbot would let us see history for an arbitrary number of jobs, since the something breaks them could quite easily be what it has been in the past for similar things, "John does a try push with his patch for that one bug, and whatever machine does the run of the foo suite is then broken on every run of talos h2 and tp6 it does until it is next rebooted."
:coop, is there a similar tool in the new land of taskcluster that allows us to see a larger history of jobs run/status for a given machine? Please see :philor's concern in comment 8
Flags: needinfo?(coop)
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #9) > :coop, is there a similar tool in the new land of taskcluster that allows us > to see a larger history of jobs run/status for a given machine? Please see > :philor's concern in comment 8 Just the worker type explorer that you mentioned already in comment #5. If you need a deeper history there for hardware workers in particular, we could consider adding more and paginating as we do elsewhere in the tool already. Let's get a separate tools bug on file if you do want this.
Flags: needinfo?(coop)
In the last 7 days, there have been 30 failures on Linux 64 pgo and opt. Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=186948997&repo=autoland&lineNumber=834 18:50:22 INFO - Starting mitmproxy playback using env path: /home/cltbld/workspace/build/application/firefox 18:50:22 INFO - Starting mitmproxy playback using command: /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmdump -k -s /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/alternate-server-replay.py /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmproxy-recording-google.mp /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmproxy-recording-youtube.mp /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmproxy-recording-amazon.mp /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmproxy-recording-facebook.mp 18:50:23 INFO - Error starting proxy server: OSError(98, 'Address already in use') 18:50:32 INFO - Aborting: mitmproxy playback process failed to start, poll returned: 1 18:50:33 INFO - Return code: 0 18:50:33 CRITICAL - PERFHERDER_DATA was seen 0 times, expected 1. 18:50:33 CRITICAL - Error copying results /home/cltbld/workspace/build/local.json to upload dir /home/cltbld/workspace/build/blobber_upload_dir/perfherder-data.json 18:50:33 CRITICAL - # TBPL FAILURE #
Whiteboard: [retriggered] → [retriggered][stockwell needswork]
In the last 7 days, there have been 63 failures on Linux 64 and Linux64-qr pgo and opt. Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=190184340&repo=autoland&lineNumber=835 22:59:56 INFO - starting webserver on 'localhost:57678' 22:59:56 INFO - Starting mitmproxy playback using env path: /home/cltbld/workspace/build/application/firefox 22:59:56 INFO - Starting mitmproxy playback using command: /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmdump -k -s /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/alternate-server-replay.py /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmproxy-recording-google.mp /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmproxy-recording-youtube.mp /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmproxy-recording-amazon.mp /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmproxy-recording-facebook.mp 22:59:57 INFO - Error starting proxy server: OSError(98, 'Address already in use') 23:00:06 INFO - Aborting: mitmproxy playback process failed to start, poll returned: 1 23:00:07 INFO - Return code: 0 23:00:07 CRITICAL - PERFHERDER_DATA was seen 0 times, expected 1. 23:00:07 CRITICAL - Error copying results /home/cltbld/workspace/build/local.json to upload dir /home/cltbld/workspace/build/blobber_upload_dir/perfherder-data.json 23:00:07 CRITICAL - # TBPL FAILURE # 23:00:07 WARNING - setting return code to 2 rwood, can you please take a look?
Flags: needinfo?(rwood)
There are 70 failures in the last 7 days. They occur on linux64 (opt) and linux64-qr (opt and pgo). Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=193589337&repo=autoland&lineNumber=835
Whiteboard: [retriggered][stockwell unknown] → [retriggered][stockwell unknown][stockwell needswork]
Update: there have been 172 failures in the last 7 days on linux x64 and Linux x64 QR.
raptor is running tp6 as tier-2 on mozilla-central still. We were delayed the last 3 weeks working on Android, so I think we are looking at end of September before we make raptor tier-1; it is possible we could do this sooner and get tp6 for firefox only as tier-1 and then turn off tp6 talos. It is worth considering.
Over the last 7 days there are 66 failures on this bug. These happen on linux64 and linux64-qr This is the most recent log example: https://treeherder.mozilla.org/logviewer.html#?job_id=196589055&repo=autoland&lineNumber=837 23:19:03 CRITICAL - PERFHERDER_DATA was seen 0 times, expected 1. 23:19:03 CRITICAL - Error copying results /home/cltbld/workspace/build/local.json to upload dir /home/cltbld/workspace/build/blobber_upload_dir/perfherder-data.json 23:19:03 CRITICAL - # TBPL FAILURE #
There have been 36 failures in the last week on linux64 and linux64-qr platforms. Affected builds: opt & pgo Summary: Intermittent Error copying results /home/cltbld/workspace/build/local.json to upload dir /home/cltbld/workspace/build/blobber_upload_dir/perfherder-data.json Recent relevant log file: https://treeherder.mozilla.org/logviewer.html#?job_id=198502576&repo=mozilla-inbound&lineNumber=831
This bug failed 60 times in the last 7 days. Failures occur on linux64 and linux64-qr on pgo and opt build types. Recent log: https://treeherder.mozilla.org/logviewer.html#?job_id=200328196&repo=autoland&lineNumber=833 INFO - Calling ['/home/cltbld/workspace/build/venv/bin/python', '/home/cltbld/workspace/build/tests/talos/talos/run_tests.py', '--title', 't-linux64-ms-278', '--symbolsPath', 'https://queue.taskcluster.net/v1/task/UIZK47OCTJmnk14GEVkdFw/artifacts/public/build/target.crashreporter-symbols.zip', '--mitmproxy', u'mitmproxy-recording-google.mp mitmproxy-recording-youtube.mp mitmproxy-recording-amazon.mp mitmproxy-recording-facebook.mp', '--suite', 'tp6-e10s', '--executablePath', '/home/cltbld/workspace/build/application/firefox/firefox', '--mitmdumpPath', '/home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmdump', '--log-tbpl-level=debug', '--log-errorsummary=/home/cltbld/workspace/build/blobber_upload_dir/tp6-e10s_errorsummary.log', '--log-raw=/home/cltbld/workspace/build/blobber_upload_dir/tp6-e10s_raw.log'] with output_timeout 3600 18:04:22 INFO - Test is using firstNonBlankPaint, browser pref will be turned on 18:04:22 INFO - Test is using firstNonBlankPaint, browser pref will be turned on 18:04:22 INFO - Test is using firstNonBlankPaint, browser pref will be turned on 18:04:22 INFO - mozversion application_buildid: 20180919232655 18:04:22 INFO - mozversion application_changeset: 8e1af99ab1adb64ba631d81f67942dc9b6c48df5 18:04:22 INFO - mozversion application_display_name: Firefox Nightly 18:04:22 INFO - mozversion application_id: {ec8030f7-c20a-464f-9b0e-13a3a9e97384} 18:04:22 INFO - mozversion application_name: Firefox 18:04:22 INFO - mozversion application_remotingname: firefox 18:04:22 INFO - mozversion application_repository: https://hg.mozilla.org/integration/autoland 18:04:22 INFO - mozversion application_vendor: Mozilla 18:04:22 INFO - mozversion application_version: 64.0a1 18:04:22 INFO - mozversion platform_buildid: 20180919232655 18:04:22 INFO - mozversion platform_changeset: 8e1af99ab1adb64ba631d81f67942dc9b6c48df5 18:04:22 INFO - mozversion platform_repository: https://hg.mozilla.org/integration/autoland 18:04:22 INFO - mozversion platform_version: 64.0a1 18:04:22 INFO - using testdate: 1537405462 18:04:22 INFO - actual date: 1537405462 18:04:22 INFO - starting webserver on 'localhost:34400' 18:04:22 INFO - Starting mitmproxy playback using env path: /home/cltbld/workspace/build/application/firefox 18:04:22 INFO - Starting mitmproxy playback using command: /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmdump -k -s /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/alternate-server-replay.py /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmproxy-recording-google.mp /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmproxy-recording-youtube.mp /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmproxy-recording-amazon.mp /home/cltbld/workspace/build/tests/talos/talos/mitmproxy/mitmproxy-recording-facebook.mp 18:04:22 INFO - Error starting proxy server: OSError(98, 'Address already in use') 18:04:32 INFO - Aborting: mitmproxy playback process failed to start, poll returned: 1 18:04:32 INFO - Return code: 0 18:04:32 CRITICAL - PERFHERDER_DATA was seen 0 times, expected 1. 18:04:32 CRITICAL - Error copying results /home/cltbld/workspace/build/local.json to upload dir /home/cltbld/workspace/build/blobber_upload_dir/perfherder-data.json 18:04:32 CRITICAL - # TBPL FAILURE # 18:04:32 WARNING - setting return code to 2 18:04:32 INFO - Running post-action listener: _package_coverage_data 18:04:32 INFO - Running post-action listener: _resource_record_post_action 18:04:32 INFO - Running post-action listener: process_java_coverage_data 18:04:32 INFO - [mozharness: 2018-09-20 01:04:32.645815Z] Finished run-tests step (success) 18:04:32 INFO - Running post-run listener: _resource_record_post_run 18:04:32 INFO - Total resource usage - Wall time: 24s; CPU: 8.0%; Read bytes: 0; Write bytes: 3694592; Read time: 0; Write time: 24 18:04:32 INFO - TinderboxPrint: CPU usage<br/>7.9% 18:04:32 INFO - TinderboxPrint: I/O read bytes / time<br/>0 / 0 18:04:32 INFO - TinderboxPrint: I/O write bytes / time<br/>3,694,592 / 24 18:04:32 INFO - TinderboxPrint: CPU idle<br/>176.6 (92.1%) 18:04:32 INFO - TinderboxPrint: CPU user<br/>14.7 (7.7%) 18:04:32 INFO - TinderboxPrint: Swap in / out<br/>0 / 0 18:04:32 INFO - install - Wall time: 13s; CPU: 13.0%; Read bytes: 0; Write bytes: 1687552; Read time: 0; Write time: 16 18:04:32 INFO - setup-mitmproxy - Wall time: 1s; CPU: 12.0%; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0 18:04:32 INFO - run-tests - Wall time: 11s; CPU: 0.0%; Read bytes: 0; Write bytes: 2007040; Read time: 0; Write time: 8 18:04:32 INFO - Running post-run listener: copy_logs_to_upload_dir 18:04:32 INFO - Copying logs to upload dir... 18:04:32 INFO - mkdir: /home/cltbld/workspace/build/upload/logs 18:04:32 INFO - Copying logs to upload dir... 18:04:32 WARNING - returning nonzero exit status 2 rwood: Can you please take a look at this bug?
Whiteboard: [retriggered][stockwell disable-recommended] → [retriggered][stockwell disable-recommended][stockwell needswork]]
There are 10 failures in the last 2 weeks. jmaher: what would be the right whiteboard tag for this (Comment 38)? Clearing the 2 months ago ni for rwood.
Flags: needinfo?(rwood) → needinfo?(jmaher)
ok, marking this as fixed:other
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(jmaher)
Resolution: --- → FIXED
Whiteboard: [retriggered][stockwell disable-recommended][stockwell needswork]] → [retriggered][stockwell fixed:other]
You need to log in before you can comment on or make changes to this bug.