sessionrestore_no_auto_restore generating NaNs

NEW
Unassigned

Status

defect
3 years ago
3 years ago

People

(Reporter: ekyle, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

sessionrestore_no_auto_restore has been producing NaNs for a long time.  Here is an example.

https://archive.mozilla.org/pub/firefox/tinderbox-builds/autoland-win32/1476200087/autoland_xp_ix_test-other-e10s-bm127-tests1-windows-build162.txt.gz

An updated list of the most recent PERFHERDER_DATA that has NaNs, or infinities, can be seen here:

http://activedata.allizom.org/tools/query.html#query_id=qDxoKaWP

Currently, ActiveData is a couple days behind on ingestion, so it shows data from a couple days ago.  It will catch up.
in perfherder we do not ingest replicates, so we don't see this :)  But that doesn't mean we shouldn't address this.

it would be good to know:
1) why this is happening
2) what platforms this happens on
3) find a way to not allow this to happen, or if not possible, ignore NaN results and not include in perfherder data (ideally ignored at the test level)

for sessionrestore* tests, they are a ts style test meaning the Talos harness measures the time based on __start_timestamp __end_timestamp, and related messages it sees in stdout.  Given that, we should be able to detect the NaN or other non int/float numbers in talos and ignore them.

here is where we are outputting the duration which is reporting as NaN:
https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/startup_test/sessionrestore/main.js#27

here is where we parse the results from stdout:
https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/results.py#258

Alison, would you like to investigate this?
Flags: needinfo?(ashiue)
Yes, please, keep the ni for reminding.
According to the logs from activedata, this issue happened on all platforms (both on e10s and non-e10s). 
This symptom seems exists for a while, the logs from this March (the earliest log I can query) already have NaN value. 
http://activedata.allizom.org/tools/query.html#query_id=_tPcG4jB

It's interesting that this issue ONLY happened on the first 2 times when browser launch (sessionrestore_no_auto_restore only), the "startup_info.sessionRestored" is not yet defined when we are trying to get the sessionRestored number (we could get the value if adding delay).

From the logs, we can find that the NaN numeber always shows on the running cycle 1/10 for sessionrestore_no_auto_restore test...
So, the first time is when initialized browser and the second time is the 1/10 test run.

Therefore, I think the easiest way to avoid this issue is to initialize browser twice before sessionrestore_no_auto_restore tests, and then all the sessionrestore_no_auto_restore test runs will get the duration number successfully.
Flags: needinfo?(ashiue)
Alison, thanks for the details related to the scope of this.  In talos we have the option to ignore_first(X), where X is the number of data points to ignore.  For example:
https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/test.py?q=path%3Atesting%2Ftalos%2Ftalos%2Ftest.py&redirect_type=single#136

I would play with ignore_first(2), and increasing the cycles to add 2 extra runs.  All of that can be done in test.py.
(In reply to Joel Maher ( :jmaher) from comment #4)
> Alison, thanks for the details related to the scope of this.  In talos we
> have the option to ignore_first(X), where X is the number of data points to
> ignore.  For example:
> https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/test.
> py?q=path%3Atesting%2Ftalos%2Ftalos%2Ftest.py&redirect_type=single#136
> 
> I would play with ignore_first(2), and increasing the cycles to add 2 extra
> runs.  All of that can be done in test.py.

I feel like the correct thing to do here is figure out why the test is generating NaNs, and fix the root cause (going back to Joel's original comment) -- rather than papering over the problem by adding more runs/cycles/ignores. This bug at least doesn't seem to be causing any immediate issues, so there's no need to rush a fix.
Hi Joel, I've checked the codes about filters, the test sessionrestore_no_auto_restore already inherited filters from https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/test.py#32.
(Currently, the first data "NaN" already been ignored.)
Therefore, I think using ignore_first(2) still cannot avoid showing NaN in log.

Hi William, OK, since the NaN already been ignored, there is no harm for testing, I will try to figure out why this happened.
I wonder if we print all the values in the log, but do not use the first one for the summary calculation.  If that is the case, then ignore_first(2) would be ok- although this problem will still exist for other log parsers.

Alison, with your method to run the warmup run from ffsetup.py twice- do we have NaN in the raw log?

Also ask :Yoric about it as he wrote the test originally.
(In reply to Joel Maher ( :jmaher) from comment #8)
> I wonder if we print all the values in the log, but do not use the first one
> for the summary calculation.  If that is the case, then ignore_first(2)
> would be ok- although this problem will still exist for other log parsers.
> 
> Alison, with your method to run the warmup run from ffsetup.py twice- do we
> have NaN in the raw log?
Yes, the NaN may still show on the warmup runs, but we can avoid it in test result PERFHERDER_DATA:{.....}

> Also ask :Yoric about it as he wrote the test originally.
do you have a try run with the two warmup runs?  It would be nice to examine the logs of many retriggers to see if that solves the problem- it might hint at why we have this to begin with.
Here is the try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d63a9917b12d0676dce011d58a2cbb8cd14ba787

It looks like this workaround could avoid this NaN issue.
:wlach, do you have concerns with the approach (running the warmup twice via a flag) Alison has implemented here?
Flags: needinfo?(wlachance)
(In reply to Joel Maher ( :jmaher) from comment #12)
> :wlach, do you have concerns with the approach (running the warmup twice via
> a flag) Alison has implemented here?

I don't think we should go with this approach, as it doesn't address the root cause (whatever that may be). My preference would be to figure that out and address that.
Flags: needinfo?(wlachance)
Comment on attachment 8804591 [details]
Bug 1309956 - Avoid sessionrestore_no_auto_restore generating NaNs

thanks :wlach- I was leaning that way, I suspect there is an issue with profile creation- something we are making happen when we warmup the profile a second time.

possibly we could diff the profile after warmup run 1 and 2?
Attachment #8804591 - Flags: review?(jmaher)
You need to log in before you can comment on or make changes to this bug.