Closed Bug 1614805 Opened 1 year ago Closed 1 year ago

Windows 10 xperf tasks are sometimes reporting all zeros for results (but still passes)

Categories

(Testing :: Talos, defect, P1)

Version 3
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: standard8, Assigned: alexandrui)

References

Details

Attachments

(3 files)

I noticed in running xperf tests on the Talos servers, that sometimes a run will report every result as zero, but it still passes

This is throwing off the average/standard deviation calculations when doing the comparison, and this can lead to some quite big deltas (e.g. 20%).

It seems only to happen on Windows 10 Quantum Renderer builds, with shippable and non-shippable.

On some try runs I had one or two "all-zeros" per 5 runs.

From the log it looks like we know there's something wrong, but it doesn't get reported as an actual error.

Really there's probably two issues here:

  1. The fact the error isn't reported as a failure.
  2. The error that caused this in the first place.

Example log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=288241114&repo=try&lineNumber=1820

[task 2020-02-10T18:29:39.624Z] 18:29:39     INFO -  Merged Etl: test.etl
[task 2020-02-10T18:31:21.234Z] 18:31:21     INFO -  reading etl filename: test.etl
[task 2020-02-10T18:31:21.234Z] 18:31:21     INFO -  etlparser: in readfile: test.etl.csv
[task 2020-02-10T18:31:21.234Z] 18:31:21    ERROR -  Traceback (most recent call last):
[task 2020-02-10T18:31:21.234Z] 18:31:21     INFO -    File "Z:\task_1581358975\build\tests\talos\talos/xtalos/parse_xperf.py", line 115, in <module>
[task 2020-02-10T18:31:21.234Z] 18:31:21     INFO -      main()
[task 2020-02-10T18:31:21.234Z] 18:31:21     INFO -    File "Z:\task_1581358975\build\tests\talos\talos/xtalos/parse_xperf.py", line 109, in main
[task 2020-02-10T18:31:21.234Z] 18:31:21     INFO -      **args.__dict__)
[task 2020-02-10T18:31:21.235Z] 18:31:21     INFO -    File "Z:\task_1581358975\build\tests\talos\talos/xtalos/parse_xperf.py", line 90, in stop_from_config
[task 2020-02-10T18:31:21.235Z] 18:31:21     INFO -      processID=kwargs['processID'])
[task 2020-02-10T18:31:21.235Z] 18:31:21     INFO -    File "Z:\task_1581358975\build\tests\talos\talos\xtalos\etlparser.py", line 573, in etlparser_from_config
[task 2020-02-10T18:31:21.235Z] 18:31:21     INFO -      etlparser(**args)
[task 2020-02-10T18:31:21.235Z] 18:31:21     INFO -    File "Z:\task_1581358975\build\tests\talos\talos\xtalos\etlparser.py", line 355, in etlparser
[task 2020-02-10T18:31:21.235Z] 18:31:21     INFO -      trackThreadNetIO(row, io, stage)
Summary: Windows 10 Quantum xperf tasks are sometimes reporting all zeros for results (but still passes) → Windows 10 Quantum Renderer xperf tasks are sometimes reporting all zeros for results (but still passes)
Priority: -- → P2
Whiteboard: [perftest:triage]

This isn't just Quantum Renderer builds, this also happens on non-QR builds as well, e.g. https://treeherder.mozilla.org/#/jobs?repo=try&revision=648731c016dec86f16285e4207b22b57c4a2903d&selectedJob=289549501

Summary: Windows 10 Quantum Renderer xperf tasks are sometimes reporting all zeros for results (but still passes) → Windows 10 xperf tasks are sometimes reporting all zeros for results (but still passes)

:bebe, can someone from your team see if this error has an easy solution? If it doesn't, can you provide details for how/why it isn't simple to fix?

Flags: needinfo?(fstrugariu)
Whiteboard: [perftest:triage]
Assignee: nobody → aionescu
Status: NEW → ASSIGNED
Flags: needinfo?(fstrugariu)
Priority: P2 → P1

The trackThreadNetIO(row, io, stage) error doesn't seem to be consistent.
The pattern seems to look like this:
Windows 7 Shippable opt: zeroes on

  • tp5n main_normal_netio opt e10s stylo INTERMITTENT,
  • tp5n main_startup_netio opt e10s stylo

Windows 10 x64 Shippable opt: zeroes on

  • tp5n main_normal_netio opt e10s stylo INTERMITTENT,
  • tp5n main_startup_netio opt e10s stylo,
  • tp5n mainthread_writebytes opt e10s stylo,
  • tp5n mainthread_writecount opt e10s stylo

Windows 10 x64 QuantumRender Shippable opt: zeroes on

  • tp5n main_normal_netio opt e10s stylo INTERMITTENT,
  • tp5n main_startup_netio opt e10s stylo,
  • tp5n mainthread_writebytes opt e10s stylo,
  • tp5n mainthread_writecount opt e10s stylo

Ongoing investigation. Have to look further on all subtests to determine any more intermittens.

Mark, I didn't find any more xperf jobs passing with the error. Also I can access the log in your description, looks like it's been removed from storage. Could you point me to a log that is actually available?
Also, seems like those talos tests were designed to report zeroes. The graph contains zeroes since 1 year ago (on autoland) for almost all xperf tests on all winfows platforms.

Aaron, I know you're the owner of those tests. Any suggestion?

Flags: needinfo?(aklotz)
Priority: P1 → P2

Waiting for :aklotz reply in order to proceed.

Two issues here:

* Obviously silently warning about those failures is bad and that should be fixed.
* Re: the regexp failure. Here is a dump of the fields generated by the `Microsoft-Windows-TCPIP` provider:

```

```

Two issues here:

  • Obviously silently warning about those failures is bad and that should be fixed.
  • Re: the regexp failure. I have attached the event names that came from my machine running Windows 10 Fall 2019 Update. Note that sometimes xperf event names and behaviours change between Windows versions. Is there any sort of correlation between the xperf failures and the version of the OS on which the test is running?
Flags: needinfo?(aklotz)

Ni for comment #8

Flags: needinfo?(aionescu)

I wanted to know if these silent failures are intended or not. I am going to investigate further to find out how can we fix this and then to answer the question about correlation between the xperf failures and the version of OS.

Flags: needinfo?(aionescu)
Priority: P2 → P1

(In reply to Aaron Klotz [:aklotz] from comment #8)

Two issues here:

  • Obviously silently warning about those failures is bad and that should be fixed.

Aaron, should I understand that the zero-values reported are fine and just the failures should be fixed?

Flags: needinfo?(aklotz)

The zero values are not fine, but if you fix the failures they should go away.

Flags: needinfo?(aklotz)

(In reply to Aaron Klotz [:aklotz] from comment #8)

  • Re: the regexp failure. I have attached the event names that came from my machine running Windows 10 Fall 2019 Update. Note that sometimes xperf event names and behaviours change between Windows versions. Is there any sort of correlation between the xperf failures and the version of the OS on which the test is running?

I don't know. What can I do is to add a print to the event that's not matching the pattern and follow the logs when the intermittent appears again.

Aaron, were does etl_output.csv come from? The filename is hard-coded and seems to be no track of it anywhere else. It contains the zeroes for writecount and writebytes. Could you please give me more insights about how is this file created?

Flags: needinfo?(aklotz)

Aaron, looks like the conditions in the lambda function cover only the events with no DiskWriteCount and DiskWriteBytes:

[task 2020-04-08T07:30:15.003Z] 07:30:15     INFO -  ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\favicons.sqlite', '6732 (main)', 'startup') {'DiskReadBytes': 196808, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 8}
[task 2020-04-08T07:30:15.003Z] 07:30:15     INFO -  ('Z:\\task_1586329312\\build\\application\\firefox\\defaults\\pref\\channel-prefs.js', '6732 (main)', 'startup') {'DiskReadBytes': 866, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 2}
[task 2020-04-08T07:30:15.003Z] 07:30:15     INFO -  ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\permissions.sqlite', '6732 (main)', 'startup') {'DiskReadBytes': 4392, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 12}
[task 2020-04-08T07:30:15.003Z] 07:30:15     INFO -  ('Z:\\task_1586329312\\AppData\\Roaming\\Mozilla\\firefox\\Crash Reports\\InstallTime20200408054848', '6732 (main)', 'startup') {'DiskReadBytes': 20, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 2}
[task 2020-04-08T07:30:15.003Z] 07:30:15     INFO -  ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\key4.db', '6732 (main)', 'startup') {'DiskReadBytes': 196936, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 16}
[task 2020-04-08T07:30:15.003Z] 07:30:15     INFO -  ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\favicons.sqlite-shm', '6732 (main)', 'startup') {'DiskReadBytes': 65536, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 2}
[task 2020-04-08T07:30:15.003Z] 07:30:15     INFO -  ('Z:\\task_1586329312\\build\\application\\firefox\\dependentlibs.list', '6732 (main)', 'startup') {'DiskReadBytes': 16384, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 4}
[task 2020-04-08T07:30:15.003Z] 07:30:15     INFO -  ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\prefs.js', '6732 (main)', 'startup') {'DiskReadBytes': 28970, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 2}
[task 2020-04-08T07:30:15.003Z] 07:30:15     INFO -  ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\cookies.sqlite', '6732 (main)', 'startup') {'DiskReadBytes': 131272, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 6}
[task 2020-04-08T07:30:15.004Z] 07:30:15     INFO -  ('C:\\Windows\\System32\\InputHost.dll', '6732 (main)', 'startup') {'DiskReadBytes': 32768, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 2}
[task 2020-04-08T07:30:15.004Z] 07:30:15     INFO -  ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\user.js', '6732 (main)', 'startup') {'DiskReadBytes': 18288, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 2}
[task 2020-04-08T07:30:15.004Z] 07:30:15     INFO -  ('C:\\Windows\\System32\\spool\\drivers\\color\\sRGB Color Space Profile.icm', '6732 (main)', 'startup') {'DiskReadBytes': 8192, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 2}
[task 2020-04-08T07:30:15.004Z] 07:30:15     INFO -  ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\cert9.db', '6732 (main)', 'startup') {'DiskReadBytes': 131336, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 10}
[task 2020-04-08T07:30:15.004Z] 07:30:15     INFO -  ('C:\\Windows\\System32\\setupapi.dll', '6732 (main)', 'startup') {'DiskReadBytes': 88064, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 8}
[task 2020-04-08T07:30:15.004Z] 07:30:15     INFO -  ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\xulstore\\data.mdb', '6732 (main)', 'startup') {'DiskReadBytes': 1216, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 8}
[task 2020-04-08T07:30:15.004Z] 07:30:15     INFO -  ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\places.sqlite-shm', '6732 (main)', 'startup') {'DiskReadBytes': 65536, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 2}
[task 2020-04-08T07:30:16.992Z] 07:30:16     INFO -  ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\places.sqlite', '6732 (main)', 'startup') {'DiskReadBytes': 262344, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 10}
[task 2020-04-08T07:30:16.992Z] 07:30:16     INFO -  ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\pkcs11.txt', '6732 (main)', 'startup') {'DiskReadBytes': 16384, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 4}
[task 2020-04-08T07:30:16.993Z] 07:30:16     INFO -  ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\cookies.sqlite-shm', '6732 (main)', 'startup') {'DiskReadBytes': 65536, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 2}
[task 2020-04-08T07:30:16.993Z] 07:30:16     INFO -  ('C:\\Users\\task_1586329312\\AppData\\Local\\Temp\\tmpmpvoz_\\profile\\compatibility.ini', '6732 (main)', 'startup') {'DiskReadBytes': 446, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 2}

We probably need to revisit that logic in order to get rid of the zero-values.

There's this lambda function that filters those values and looks like this:

filekeys = filter(lambda x: (all_stages or x[2] == stages[0]) and
                            (all_threads or x[1].endswith("(main)")) and
                            (all_stages and x[2] != stages[0] or
                             not checkWhitelist(x[0], whitelist)),
                  files.iterkeys())

If we take favicons.sqlite and look at the unfiltered values from etl_output.csv it looks like this:

('C:\\Users\\task_1586182132\\AppData\\Local\\Temp\\tmpajjk25\\profile\\favicons.sqlite', '9912', 'all'): 
    {'DiskReadBytes': 458952, 'DiskWriteCount': 94, 'DiskWriteBytes': 4915200, 'DiskReadCount': 16},
('C:\\Users\\task_1586182132\\AppData\\Local\\Temp\\tmpajjk25\\profile\\favicons.sqlite', '11180 (main)', 'all'): 
    {'DiskReadBytes': 196808, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 8},
('C:\\Users\\task_1586182132\\AppData\\Local\\Temp\\tmpajjk25\\profile\\favicons.sqlite', '9912', 'normal'): 
    {'DiskReadBytes': 458952, 'DiskWriteCount': 94, 'DiskWriteBytes': 4915200, 'DiskReadCount': 16},
('C:\\Users\\task_1586182132\\AppData\\Local\\Temp\\tmpajjk25\\profile\\favicons.sqlite', '11180 (main)', 'startup'): 
    {'DiskReadBytes': 196808, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 8},
('C:\\Users\\task_1586182132\\AppData\\Local\\Temp\\tmpajjk25\\profile\\favicons.sqlite', '9912', 'startup'): 
    {'DiskReadBytes': 262344, 'DiskWriteCount': 0, 'DiskWriteBytes': 0, 'DiskReadCount': 10},

DiskWriteCount and DiskWriteBytes are greater than 0 for all and normal stages. Looking at the code for debug mode:

if debug:
    # in debug, we want stages = [startup+normal] and all threads, not just (main)
    # we will use this data to upload fileIO info to blobber only for debug mode
    outputData = filter(lambda x: (all_stages or x[2] in [stages[0], stages[1]]) and
                                  (all_stages and x[2] not in [stages[0], stages[1]] or
                                   not checkWhitelist(x[0], whitelist)),
                        files.iterkeys())

we can see that it includes stages[1] in x[2], which is exactly the normal stage.
I don't have thorough enough knowledge about this code to tell if this is the right fix, but is the closest fix I found so far (inclusing stages[1] in the expression against x[2]).

Thoughts?

Flags: needinfo?(aklotz)

Since this code landed 7 years ago, so I am having trouble believing that the emergence of this issue is with the filtering. I am happy to be convinced otherwise.

I'm confused, though: based on the failures in comment 5, we're seeing AttributeErrors because of a regex that isn't matching; the script is bailing out, hence we aren't producing good data. My expectation was that once we fixed that, these zeros would go away. Is this data from comment 16 coming from code that includes the regex fix?

Also, keep in mind what this test is actually seeking: I/O reads and writes on the main thread are bad, especially when they occur during startup. We segment the timeline into the startup, normal, shutdown stages to assist us with determining how "bad" the main thread I/O is. Changes in non-main thread I/O will happen all the time, and we don't really care about that unless there is some kind of weird, unreasonable spike. Keeping this in mind, in an ideal world where we spectacularly pass the test with flying colours, we would want zeros for all stages on the main thread. In practice that is not 100% possible, hence we have the whitelist. And of course, 0 I/O across all threads and all stages is clearly an error, since a web browser that doesn't do any I/O is not much of a web browser at all.

Flags: needinfo?(aklotz)

Aaron, the failures in comment 5 are indeed due to e regex but are very rare and what I did is to add a try catch to be able to see the string that is failing. However, the zero-values aren't caused by that as they are permanent (they happen also when the matching error doesn't). So this issue has 2 culprits: silent failure because of the re.match and the zero-values whose occurrence I explained in comment 15. With the mention that the zero-values happen since ever while the match failure is relatively recent.

As we can't really see what event name is causing the silent fail, my proposal is to land a patch adding a try/except to the re.match as the intermittent occurrence is fairly low.

I wouldn't wait one more week, as this patch is fairly easy to review, and it could buy us time to advance with the investigations about zero-values.
Thanks for your answer, Aaron.

Attachment #9140114 - Attachment description: Bug 1614805 Part 1: catch re.match error → Bug 1614805 Part 1: catch re.match error r?bebe,#perftest
Attachment #9140114 - Attachment description: Bug 1614805 Part 1: catch re.match error r?bebe,#perftest → Bug 1614805 Part 1: catch re.match error r?#perftest,bebe

Note to self to remove the leave-open keyword after this is fixed.

Flags: needinfo?(aionescu)
Pushed by aionescu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f2c1f62b1f65
Part 1: catch re.match error r=perftest-reviewers,Bebe

(In reply to Mark Banner (:standard8) from comment #5)

The netio variants seem to always produce zero (or nearly always), so you can ignore those.

You need to look at the main_normal_fileio and main_startup_fileio values:

https://treeherder.mozilla.org/perf.html#/graphs?highlightAlerts=1&selected=1922959,1074880601&series=autoland,1922967,1,1&series=autoland,1922959,1,1&timerange=2592000

These are the direct links to the latest two failures:

https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=292924730&group_state=expanded&revision=cb1605f13d5f56252b6dc89e2934f4c4aedac298
https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=292721403&group_state=expanded&revision=d28ceb16f6839c79662e1eb355584d87a9810d60

Mark, from what you're saying, this bug was open because of the zero-values reported by the silent re.match error and we don't care about others. Is that right?

Flags: needinfo?(aionescu) → needinfo?(standard8)

(In reply to (PTO until April 20th) Alexandru Ionescu (needinfo me) :alexandrui from comment #23)

Mark, from what you're saying, this bug was open because of the zero-values reported by the silent re.match error and we don't care about others. Is that right?

The ones we care in this bug are the ones where we have all-zeros reported across all tests. I don't know if anything should be done about the sub-tests that report zeros all the time - probably nothing based on the previous discussions.

Flags: needinfo?(standard8)

Meanwhile I landed a patch meant to log more details about the regex matching error and found this job reporting that the event name causing the error looks like this:
Error matching event Microsoft-Windows-TCPIP//win:Info: 'NoneType' object has no attribute 'group'
The event name with prefix and suffix is Microsoft-Windows-TCPIP//win:Info and the event name alone seems to be an empty string .

I am not sure what I should understand from here. Aaron, could you please help on this?

Flags: needinfo?(aklotz)

Microsoft-Windows-TCPIP//win:Info is different from the other event names that I posted in the attachment.

The regex in trackThreadNetIO tries to capture the content between the slashes, but in this case the content between the slashes is empty.

Flags: needinfo?(aklotz)

(In reply to Aaron Klotz [:aklotz] from comment #26)

Microsoft-Windows-TCPIP//win:Info is different from the other event names that I posted in the attachment.

The regex in trackThreadNetIO tries to capture the content between the slashes, but in this case the content between the slashes is empty.

I mean, in the context of why is this failing the regular expression matching, is this supposed to happen? Is this normal? Should the code cover this empty match continue the execution or fail it as not normal?

(In reply to Aaron Klotz [:aklotz] from comment #17)

Also, keep in mind what this test is actually seeking: I/O reads and writes on the main thread are bad, especially when they occur during startup.

Considering this, the only concerning zeroes are indeed the ones provoked by the matching error during startup for windows10-64-shippable, windows10-64 and windows10-64-qr. So once we get to the same page with the questions above, we can identify a solution.

Flags: needinfo?(aklotz)

(In reply to Alexandru Ionescu (needinfo me) :alexandrui from comment #27)

I mean, in the context of why is this failing the regular expression matching, is this supposed to happen? Is this normal? Should the code cover this empty match continue the execution or fail it as not normal?

We already filter events based on the text between the slashes, so an empty string would be treated identically to any other string that we don't care about. So yes, we should just produce an empty string and continue execution.

Flags: needinfo?(aklotz)

Aaron, there's a piece of code here which is continuing the execution if the while-list file is not found, just like it was for the events not fitting in the regular expression. It is prone to false results and can misslead. I would raise an exception and fail the execution of the code instead of returning.

Flags: needinfo?(aklotz)
Pushed by aionescu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d3c0d7256d20
Part 2: modify the re.match expression to cover the empty event r=perftest-reviewers,sparky

That's fine.

Flags: needinfo?(aklotz)
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.