Windows 10 xperf tasks are sometimes reporting all zeros for results (but still passes)
Categories
(Testing :: Talos, defect, P1)
Tracking
(Not tracked)
People
(Reporter: standard8, Assigned: alexandrui)
References
Details
Attachments
(3 files)
I noticed in running xperf tests on the Talos servers, that sometimes a run will report every result as zero, but it still passes
This is throwing off the average/standard deviation calculations when doing the comparison, and this can lead to some quite big deltas (e.g. 20%).
It seems only to happen on Windows 10 Quantum Renderer builds, with shippable and non-shippable.
On some try runs I had one or two "all-zeros" per 5 runs.
From the log it looks like we know there's something wrong, but it doesn't get reported as an actual error.
Really there's probably two issues here:
- The fact the error isn't reported as a failure.
- The error that caused this in the first place.
Example log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=288241114&repo=try&lineNumber=1820
[task 2020-02-10T18:29:39.624Z] 18:29:39 INFO - Merged Etl: test.etl
[task 2020-02-10T18:31:21.234Z] 18:31:21 INFO - reading etl filename: test.etl
[task 2020-02-10T18:31:21.234Z] 18:31:21 INFO - etlparser: in readfile: test.etl.csv
[task 2020-02-10T18:31:21.234Z] 18:31:21 ERROR - Traceback (most recent call last):
[task 2020-02-10T18:31:21.234Z] 18:31:21 INFO - File "Z:\task_1581358975\build\tests\talos\talos/xtalos/parse_xperf.py", line 115, in <module>
[task 2020-02-10T18:31:21.234Z] 18:31:21 INFO - main()
[task 2020-02-10T18:31:21.234Z] 18:31:21 INFO - File "Z:\task_1581358975\build\tests\talos\talos/xtalos/parse_xperf.py", line 109, in main
[task 2020-02-10T18:31:21.234Z] 18:31:21 INFO - **args.__dict__)
[task 2020-02-10T18:31:21.235Z] 18:31:21 INFO - File "Z:\task_1581358975\build\tests\talos\talos/xtalos/parse_xperf.py", line 90, in stop_from_config
[task 2020-02-10T18:31:21.235Z] 18:31:21 INFO - processID=kwargs['processID'])
[task 2020-02-10T18:31:21.235Z] 18:31:21 INFO - File "Z:\task_1581358975\build\tests\talos\talos\xtalos\etlparser.py", line 573, in etlparser_from_config
[task 2020-02-10T18:31:21.235Z] 18:31:21 INFO - etlparser(**args)
[task 2020-02-10T18:31:21.235Z] 18:31:21 INFO - File "Z:\task_1581358975\build\tests\talos\talos\xtalos\etlparser.py", line 355, in etlparser
[task 2020-02-10T18:31:21.235Z] 18:31:21 INFO - trackThreadNetIO(row, io, stage)
Reporter | ||
Updated•5 years ago
|
Updated•5 years ago
|
Reporter | ||
Comment 1•5 years ago
|
||
This isn't just Quantum Renderer builds, this also happens on non-QR builds as well, e.g. https://treeherder.mozilla.org/#/jobs?repo=try&revision=648731c016dec86f16285e4207b22b57c4a2903d&selectedJob=289549501
Comment 2•5 years ago
|
||
:bebe, can someone from your team see if this error has an easy solution? If it doesn't, can you provide details for how/why it isn't simple to fix?
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 3•5 years ago
•
|
||
The trackThreadNetIO(row, io, stage)
error doesn't seem to be consistent.
The pattern seems to look like this:
Windows 7 Shippable opt
: zeroes on
tp5n main_normal_netio opt e10s stylo
INTERMITTENT,tp5n main_startup_netio opt e10s stylo
Windows 10 x64 Shippable opt
: zeroes on
tp5n main_normal_netio opt e10s stylo
INTERMITTENT,tp5n main_startup_netio opt e10s stylo
,tp5n mainthread_writebytes opt e10s stylo
,tp5n mainthread_writecount opt e10s stylo
Windows 10 x64 QuantumRender Shippable opt
: zeroes on
tp5n main_normal_netio opt e10s stylo
INTERMITTENT,tp5n main_startup_netio opt e10s stylo
,tp5n mainthread_writebytes opt e10s stylo
,tp5n mainthread_writecount opt e10s stylo
Ongoing investigation. Have to look further on all subtests to determine any more intermittens.
Assignee | ||
Comment 4•5 years ago
•
|
||
Mark, I didn't find any more xperf jobs passing with the error. Also I can access the log in your description, looks like it's been removed from storage. Could you point me to a log that is actually available?
Also, seems like those talos tests were designed to report zeroes. The graph contains zeroes since 1 year ago (on autoland) for almost all xperf tests on all winfows platforms.
Aaron, I know you're the owner of those tests. Any suggestion?
Reporter | ||
Comment 5•5 years ago
|
||
The netio
variants seem to always produce zero (or nearly always), so you can ignore those.
You need to look at the main_normal_fileio
and main_startup_fileio
values:
These are the direct links to the latest two failures:
https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=292924730&group_state=expanded&revision=cb1605f13d5f56252b6dc89e2934f4c4aedac298
https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=292721403&group_state=expanded&revision=d28ceb16f6839c79662e1eb355584d87a9810d60
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 6•5 years ago
|
||
Waiting for :aklotz reply in order to proceed.
Comment 7•5 years ago
|
||
Comment 8•5 years ago
|
||
Two issues here:
- Obviously silently warning about those failures is bad and that should be fixed.
- Re: the regexp failure. I have attached the event names that came from my machine running Windows 10 Fall 2019 Update. Note that sometimes xperf event names and behaviours change between Windows versions. Is there any sort of correlation between the xperf failures and the version of the OS on which the test is running?
Assignee | ||
Comment 10•5 years ago
•
|
||
I wanted to know if these silent failures are intended or not. I am going to investigate further to find out how can we fix this and then to answer the question about correlation between the xperf failures and the version of OS.
Updated•5 years ago
|
Assignee | ||
Comment 11•5 years ago
|
||
(In reply to Aaron Klotz [:aklotz] from comment #8)
Two issues here:
- Obviously silently warning about those failures is bad and that should be fixed.
Aaron, should I understand that the zero-values reported are fine and just the failures should be fixed?
Comment 12•5 years ago
|
||
The zero values are not fine, but if you fix the failures they should go away.
Assignee | ||
Comment 13•5 years ago
|
||
(In reply to Aaron Klotz [:aklotz] from comment #8)
- Re: the regexp failure. I have attached the event names that came from my machine running Windows 10 Fall 2019 Update. Note that sometimes xperf event names and behaviours change between Windows versions. Is there any sort of correlation between the xperf failures and the version of the OS on which the test is running?
I don't know. What can I do is to add a print to the event that's not matching the pattern and follow the logs when the intermittent appears again.
Assignee | ||
Comment 14•5 years ago
•
|
||
Aaron, were does etl_output.csv come from? The filename is hard-coded and seems to be no track of it anywhere else. It contains the zeroes for writecount
and writebytes
. Could you please give me more insights about how is this file created?
Comment 15•5 years ago
|
||
Assignee | ||
Comment 16•5 years ago
•
|
||
Aaron, looks like the conditions in the lambda function cover only the events with no DiskWriteCount
and DiskWriteBytes
:
[task 2020-04-08T07:30:15.003Z] 07:30:15 INFO - ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\favicons.sqlite', '6732 (main)', 'startup') {'DiskReadBytes': 196808, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 8}
[task 2020-04-08T07:30:15.003Z] 07:30:15 INFO - ('Z:\\task_1586329312\\build\\application\\firefox\\defaults\\pref\\channel-prefs.js', '6732 (main)', 'startup') {'DiskReadBytes': 866, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 2}
[task 2020-04-08T07:30:15.003Z] 07:30:15 INFO - ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\permissions.sqlite', '6732 (main)', 'startup') {'DiskReadBytes': 4392, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 12}
[task 2020-04-08T07:30:15.003Z] 07:30:15 INFO - ('Z:\\task_1586329312\\AppData\\Roaming\\Mozilla\\firefox\\Crash Reports\\InstallTime20200408054848', '6732 (main)', 'startup') {'DiskReadBytes': 20, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 2}
[task 2020-04-08T07:30:15.003Z] 07:30:15 INFO - ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\key4.db', '6732 (main)', 'startup') {'DiskReadBytes': 196936, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 16}
[task 2020-04-08T07:30:15.003Z] 07:30:15 INFO - ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\favicons.sqlite-shm', '6732 (main)', 'startup') {'DiskReadBytes': 65536, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 2}
[task 2020-04-08T07:30:15.003Z] 07:30:15 INFO - ('Z:\\task_1586329312\\build\\application\\firefox\\dependentlibs.list', '6732 (main)', 'startup') {'DiskReadBytes': 16384, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 4}
[task 2020-04-08T07:30:15.003Z] 07:30:15 INFO - ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\prefs.js', '6732 (main)', 'startup') {'DiskReadBytes': 28970, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 2}
[task 2020-04-08T07:30:15.003Z] 07:30:15 INFO - ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\cookies.sqlite', '6732 (main)', 'startup') {'DiskReadBytes': 131272, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 6}
[task 2020-04-08T07:30:15.004Z] 07:30:15 INFO - ('C:\\Windows\\System32\\InputHost.dll', '6732 (main)', 'startup') {'DiskReadBytes': 32768, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 2}
[task 2020-04-08T07:30:15.004Z] 07:30:15 INFO - ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\user.js', '6732 (main)', 'startup') {'DiskReadBytes': 18288, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 2}
[task 2020-04-08T07:30:15.004Z] 07:30:15 INFO - ('C:\\Windows\\System32\\spool\\drivers\\color\\sRGB Color Space Profile.icm', '6732 (main)', 'startup') {'DiskReadBytes': 8192, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 2}
[task 2020-04-08T07:30:15.004Z] 07:30:15 INFO - ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\cert9.db', '6732 (main)', 'startup') {'DiskReadBytes': 131336, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 10}
[task 2020-04-08T07:30:15.004Z] 07:30:15 INFO - ('C:\\Windows\\System32\\setupapi.dll', '6732 (main)', 'startup') {'DiskReadBytes': 88064, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 8}
[task 2020-04-08T07:30:15.004Z] 07:30:15 INFO - ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\xulstore\\data.mdb', '6732 (main)', 'startup') {'DiskReadBytes': 1216, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 8}
[task 2020-04-08T07:30:15.004Z] 07:30:15 INFO - ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\places.sqlite-shm', '6732 (main)', 'startup') {'DiskReadBytes': 65536, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 2}
[task 2020-04-08T07:30:16.992Z] 07:30:16 INFO - ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\places.sqlite', '6732 (main)', 'startup') {'DiskReadBytes': 262344, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 10}
[task 2020-04-08T07:30:16.992Z] 07:30:16 INFO - ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\pkcs11.txt', '6732 (main)', 'startup') {'DiskReadBytes': 16384, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 4}
[task 2020-04-08T07:30:16.993Z] 07:30:16 INFO - ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\cookies.sqlite-shm', '6732 (main)', 'startup') {'DiskReadBytes': 65536, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 2}
[task 2020-04-08T07:30:16.993Z] 07:30:16 INFO - ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\compatibility.ini', '6732 (main)', 'startup') {'DiskReadBytes': 446, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 2}
We probably need to revisit that logic in order to get rid of the zero-values.
There's this lambda function that filters those values and looks like this:
filekeys = filter(lambda x: (all_stages or x[2] == stages[0]) and
(all_threads or x[1].endswith("(main)")) and
(all_stages and x[2] != stages[0] or
not checkWhitelist(x[0], whitelist)),
files.iterkeys())
If we take favicons.sqlite
and look at the unfiltered values from etl_output.csv
it looks like this:
('C:\\Users\\task_1586182132\\AppData\\Local\\Temp\\tmpajjk25\\profile\\favicons.sqlite', '9912', 'all'):
{'DiskReadBytes': 458952, 'DiskWriteCount': 94, 'DiskWriteBytes': 4915200, 'DiskReadCount': 16},
('C:\\Users\\task_1586182132\\AppData\\Local\\Temp\\tmpajjk25\\profile\\favicons.sqlite', '11180 (main)', 'all'):
{'DiskReadBytes': 196808, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 8},
('C:\\Users\\task_1586182132\\AppData\\Local\\Temp\\tmpajjk25\\profile\\favicons.sqlite', '9912', 'normal'):
{'DiskReadBytes': 458952, 'DiskWriteCount': 94, 'DiskWriteBytes': 4915200, 'DiskReadCount': 16},
('C:\\Users\\task_1586182132\\AppData\\Local\\Temp\\tmpajjk25\\profile\\favicons.sqlite', '11180 (main)', 'startup'):
{'DiskReadBytes': 196808, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 8},
('C:\\Users\\task_1586182132\\AppData\\Local\\Temp\\tmpajjk25\\profile\\favicons.sqlite', '9912', 'startup'):
{'DiskReadBytes': 262344, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 10},
DiskWriteCount
and DiskWriteBytes
are greater than 0 for all
and normal
stages. Looking at the code for debug mode:
if debug:
# in debug, we want stages = [startup+normal] and all threads, not just (main)
# we will use this data to upload fileIO info to blobber only for debug mode
outputData = filter(lambda x: (all_stages or x[2] in [stages[0], stages[1]]) and
(all_stages and x[2] not in [stages[0], stages[1]] or
not checkWhitelist(x[0], whitelist)),
files.iterkeys())
we can see that it includes stages[1]
in x[2]
, which is exactly the normal
stage.
I don't have thorough enough knowledge about this code to tell if this is the right fix, but is the closest fix I found so far (inclusing stages[1]
in the expression against x[2]
).
Thoughts?
Comment 17•5 years ago
|
||
Since this code landed 7 years ago, so I am having trouble believing that the emergence of this issue is with the filtering. I am happy to be convinced otherwise.
I'm confused, though: based on the failures in comment 5, we're seeing AttributeError
s because of a regex that isn't matching; the script is bailing out, hence we aren't producing good data. My expectation was that once we fixed that, these zeros would go away. Is this data from comment 16 coming from code that includes the regex fix?
Also, keep in mind what this test is actually seeking: I/O reads and writes on the main thread are bad, especially when they occur during startup. We segment the timeline into the startup
, normal
, shutdown
stages to assist us with determining how "bad" the main thread I/O is. Changes in non-main thread I/O will happen all the time, and we don't really care about that unless there is some kind of weird, unreasonable spike. Keeping this in mind, in an ideal world where we spectacularly pass the test with flying colours, we would want zeros for all stages on the main thread. In practice that is not 100% possible, hence we have the whitelist. And of course, 0 I/O across all threads and all stages is clearly an error, since a web browser that doesn't do any I/O is not much of a web browser at all.
Assignee | ||
Comment 18•5 years ago
•
|
||
Aaron, the failures in comment 5 are indeed due to e regex but are very rare and what I did is to add a try catch to be able to see the string that is failing. However, the zero-values aren't caused by that as they are permanent (they happen also when the matching error doesn't). So this issue has 2 culprits: silent failure because of the re.match and the zero-values whose occurrence I explained in comment 15. With the mention that the zero-values happen since ever while the match failure is relatively recent.
As we can't really see what event name is causing the silent fail, my proposal is to land a patch adding a try/except
to the re.match
as the intermittent occurrence is fairly low.
I wouldn't wait one more week, as this patch is fairly easy to review, and it could buy us time to advance with the investigations about zero-values.
Thanks for your answer, Aaron.
Assignee | ||
Comment 19•5 years ago
|
||
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 20•5 years ago
|
||
Note to self to remove the leave-open
keyword after this is fixed.
Comment 21•5 years ago
|
||
Comment 22•5 years ago
|
||
bugherder |
Assignee | ||
Comment 23•5 years ago
|
||
(In reply to Mark Banner (:standard8) from comment #5)
The
netio
variants seem to always produce zero (or nearly always), so you can ignore those.You need to look at the
main_normal_fileio
andmain_startup_fileio
values:These are the direct links to the latest two failures:
https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=292924730&group_state=expanded&revision=cb1605f13d5f56252b6dc89e2934f4c4aedac298
https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=292721403&group_state=expanded&revision=d28ceb16f6839c79662e1eb355584d87a9810d60
Mark, from what you're saying, this bug was open because of the zero-values reported by the silent re.match
error and we don't care about others. Is that right?
Reporter | ||
Comment 24•5 years ago
|
||
(In reply to (PTO until April 20th) Alexandru Ionescu (needinfo me) :alexandrui from comment #23)
Mark, from what you're saying, this bug was open because of the zero-values reported by the silent
re.match
error and we don't care about others. Is that right?
The ones we care in this bug are the ones where we have all-zeros reported across all tests. I don't know if anything should be done about the sub-tests that report zeros all the time - probably nothing based on the previous discussions.
Assignee | ||
Comment 25•5 years ago
|
||
Meanwhile I landed a patch meant to log more details about the regex matching error and found this job reporting that the event name causing the error looks like this:
Error matching event Microsoft-Windows-TCPIP//win:Info: 'NoneType' object has no attribute 'group'
The event name with prefix and suffix is Microsoft-Windows-TCPIP//win:Info
and the event name alone seems to be an empty string .
I am not sure what I should understand from here. Aaron, could you please help on this?
Comment 26•5 years ago
|
||
Microsoft-Windows-TCPIP//win:Info
is different from the other event names that I posted in the attachment.
The regex in trackThreadNetIO
tries to capture the content between the slashes, but in this case the content between the slashes is empty.
Assignee | ||
Comment 27•5 years ago
•
|
||
(In reply to Aaron Klotz [:aklotz] from comment #26)
Microsoft-Windows-TCPIP//win:Info
is different from the other event names that I posted in the attachment.The regex in
trackThreadNetIO
tries to capture the content between the slashes, but in this case the content between the slashes is empty.
I mean, in the context of why is this failing the regular expression matching, is this supposed to happen? Is this normal? Should the code cover this empty match continue the execution or fail it as not normal?
(In reply to Aaron Klotz [:aklotz] from comment #17)
Also, keep in mind what this test is actually seeking: I/O reads and writes on the main thread are bad, especially when they occur during startup.
Considering this, the only concerning zeroes are indeed the ones provoked by the matching error during startup for windows10-64-shippable, windows10-64 and windows10-64-qr. So once we get to the same page with the questions above, we can identify a solution.
Assignee | ||
Comment 28•5 years ago
|
||
Comment 29•5 years ago
|
||
(In reply to Alexandru Ionescu (needinfo me) :alexandrui from comment #27)
I mean, in the context of why is this failing the regular expression matching, is this supposed to happen? Is this normal? Should the code cover this empty match continue the execution or fail it as not normal?
We already filter events based on the text between the slashes, so an empty string would be treated identically to any other string that we don't care about. So yes, we should just produce an empty string and continue execution.
Assignee | ||
Comment 30•5 years ago
|
||
Aaron, there's a piece of code here which is continuing the execution if the while-list file is not found, just like it was for the events not fitting in the regular expression. It is prone to false results and can misslead. I would raise an exception and fail the execution of the code instead of returning.
Comment 31•5 years ago
|
||
Comment 33•5 years ago
|
||
bugherder |
Assignee | ||
Updated•5 years ago
|
Updated•5 years ago
|
Description
•