792212 - on the tegras we have found that reftests fail a lot, a whole lot on the newer tegras- this is due to the shared_prefs files

Reporter

Description

•

12 years ago

On the new tegras when we fail, we seem to fail on a fresh install of fennec.  This is because a fresh install has no files in /data/data/org.mozilla.fennec.  After a first run, we can run again and tests pass.  For reference I was using border-radius as a test case where 23 tests fail on first install and all pass on second run.

After a lot of trial and error, I found that the two files in /data/data/org.mozilla.fennec/shared_prefs are the culprits:
telnet tegra-306 20701
Trying 10.250.51.146...
Connected to tegra-306.
Escape character is '^]'.
$>cd /data/data/org.mozilla.fennec/shared_prefs
$>ls
ProfileMigrator.xml
GeckoApp.xml
$>cat ProfileMigrator.xml
<?xml version='1.0' encoding='utf-8' standalone='yes' ?>
<map>
<boolean name="move_profile_done" value="true" />
<boolean name="history_done" value="true" />
<boolean name="bookmarks_done" value="true" />
</map>
$>cat GeckoApp.xml
<?xml version='1.0' encoding='utf-8' standalone='yes' ?>
<map>
<boolean name="wasStopped" value="true" />
<boolean name="OOMException" value="false" />
</map>
$>

I tested deleting each file individually and if either one (or both) are missing the tests fail.

Armen [:armenzg]

Comment 1

•

12 years ago

Good finding!

Is this related to bug 784278?

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 2

•

12 years ago

yep, this appears to be closer to the root cause.

Blocks: 784278

Geoff Brown [:gbrown]

Assignee

Updated

•

12 years ago

Assignee: nobody → gbrown

Geoff Brown [:gbrown]

Assignee

Comment 3

•

12 years ago

:blassey and :jmaher suggest that if we cannot find an easy solution to this, we might work-around it by launching a dummy run to initialize the environment before running tests.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 4

•

12 years ago

Attached patch do a firstrun registration for reftests on mobile only (1.0) (obsolete) — Details — Splinter Review

this is looking really good on try for all platforms.

Assignee: gbrown → jmaher

Status: NEW → ASSIGNED

Attachment #663031 - Flags: review?(gbrown)

Geoff Brown [:gbrown]

Assignee

Comment 5

•

12 years ago

Comment on attachment 663031 [details] [diff] [review]
do a firstrun registration for reftests on mobile only (1.0)

Review cancelled. As discussed, doFirstRun runs a full reftest cycle, at least for "make reftest-remote" with TEST_PATH specified.

Attachment #663031 - Flags: review?(gbrown)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 6

•

12 years ago

Attached patch do a first run only (WIP) — Details — Splinter Review

so we pass in the -reftest and uri by setting a pref before pushing the profile.  I have a patch that sets the pref to something invalid, then after that run, it sets it to the proper value.

What makes this difficult is that this approach generates an error and we need to ignore it 100% or else log parsers will catch it.  Also this patch in its current form doesn't seem to go onto the test, on try server it stops after the first run and prints the error.

I am putting this up here so others can hack on it as I won't have much time to do this.

Attachment #663031 - Attachment is obsolete: true

William Lachance (:wlach)

Comment 7

•

12 years ago

Attached patch Patch to fix some potential issues around shared preferences (obsolete) — Details — Splinter Review

So I kind of got inspired to look into this problem more deeply today together with :jmaher. I came up with a prototype patch which disables some code that touches shared profile preferences when using a custom profile (such as when we're running mochitest) and makes some other code synchronous that seemed to be threaded for no good reason (that I could tell).

The theory is that modifying shared preferences from two different processes simultaneously was triggering some kind of race condition on the tegras that was causing the reftests to fail.

With the attached patch applied, we no longer reproduce the problem with the exsiting harness code (i.e. no custom patching to do a "first run" of fennec or anything). I do feel somewhat bad that I haven't been able to determine the exact root cause with any certainty, but I do think this is a good start at such an investigation. 

The main question for me is whether the changes are legitimate. They "feel" ok to me, but I'll leave the judgement of that up to the mobile folks.

Attachment #664638 - Flags: feedback?(gbrown)

Attachment #664638 - Flags: feedback?(blassey.bugs)

Geoff Brown [:gbrown]

Assignee

Comment 8

•

12 years ago

Comment on attachment 664638 [details] [diff] [review]
Patch to fix some potential issues around shared preferences

When I saw the initial bug report, I was suspicious of profile migration and investigated that a little. I found that profile migration is initiated with each test since it is a fresh install and the profile is on /sdcard...but it soon finds there is nothing to migrate and finishes cleanly -- I could find nothing going wrong there.

Regarding your removal of the background threads: We are very sensitive to doing anything more than necessary on the main thread, perhaps especially at startup (as this code is) -- your change may regress startup time.

feedback+ for the effort and discovering a way to avoid the failures - there's a clue to the cause here, I'm sure - but feedback- overall. If we cannot find and address the cause of the failures, I would prefer to solve this with a dummy first-run; after all, I think the non-first-run case is more relevant for reftests.

Attachment #664638 - Flags: feedback?(gbrown) → feedback-

William Lachance (:wlach)

Comment 9

•

12 years ago

(In reply to Geoff Brown [:gbrown] from comment #8)
> Comment on attachment 664638 [details] [diff] [review]
> Patch to fix some potential issues around shared preferences
> 
> When I saw the initial bug report, I was suspicious of profile migration and
> investigated that a little. I found that profile migration is initiated with
> each test since it is a fresh install and the profile is on /sdcard...but it
> soon finds there is nothing to migrate and finishes cleanly -- I could find
> nothing going wrong there.
> 
> Regarding your removal of the background threads: We are very sensitive to
> doing anything more than necessary on the main thread, perhaps especially at
> startup (as this code is) -- your change may regress startup time.

I haven't checked yet, but my suspicion would be that this patch would fix the issue even if we put the background thread stuff for the GeckoApp shared preferences back in. Based on jmaher's original diagnosis and the resutls we're seeing here, I'm pretty sure that this bug is due to an *interaction* between the different shared prefs files being modified. If it's only the GeckoApp stuff that we're modifying, we might well be ok.

Would that be more acceptable? There should be no performance issues then.

It also occurred to me that we could try to get closer to the root cause by selectively removing my patches to disable the different types of profile migration -- maybe there's one type of migration in particular which is causing issues on the tegras.

Brad Lassey [:blassey] (use needinfo?)

Updated

•

12 years ago

Attachment #664638 - Flags: feedback?(blassey.bugs)

William Lachance (:wlach)

Comment 10

•

12 years ago

Attached patch Don't attempt to migrate profiles if using a custom profile — Details — Splinter Review

This removes the code that takes reading/modifying geckoapp preferences off the main thread. So we're now just avoiding profile migration in the case that a custom profile is passed in.

Joel tested this patch and found it made this problem go away. This still only addresses symptoms -- we don't yet truly understand the underlying cause. But maybe this will suffice as a stopgap until we do.

Assignee: jmaher → wlachance

Attachment #664638 - Attachment is obsolete: true

Attachment #665042 - Flags: review?(gbrown)

Brad Lassey [:blassey] (use needinfo?)

Updated

•

12 years ago

Attachment #665042 - Flags: review?(gpascutto)

Geoff Brown [:gbrown]

Assignee

Comment 11

•

12 years ago

Comment on attachment 665042 [details] [diff] [review]
Don't attempt to migrate profiles if using a custom profile

Review of attachment 665042 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for cleaning this up - my concerns are addressed. I'm still trying to better determine the cause of the failures, but don't want that to further delay this fix.

Attachment #665042 - Flags: review?(gbrown) → review+

William Lachance (:wlach)

Comment 12

•

12 years ago

(In reply to Geoff Brown [:gbrown] from comment #11)
> Comment on attachment 665042 [details] [diff] [review]
> Don't attempt to migrate profiles if using a custom profile
> 
> Review of attachment 665042 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Thanks for cleaning this up - my concerns are addressed. I'm still trying to
> better determine the cause of the failures, but don't want that to further
> delay this fix.

Cool. Pushed patch to try:

https://tbpl.mozilla.org/?tree=Try&rev=1ecd559ebefd

Gian-Carlo Pascutto [:gcp]

Comment 13

•

12 years ago

I have questions:

Comment 1 says:

"I tested deleting each file individually and if either one (or both) are missing the tests fail."

Your patch avoids reads on ProfileMigrator.xml, but not on GeckoApp.xml. This seems inconsistent. I'd like to understand if this is indeed a generic shared_prefs race issue, or if Migration is interacting with tests.

>The theory is that modifying shared preferences from two different processes simultaneously 
>was triggering some kind of race condition on the tegras that was causing the reftests to 
>fail.

What would the two processes be that modify ProfileMigrator.xml?

Note:
ProfileMigrator.xml contains a mark that indicates whether the profile has been tested for being on a SDcard, and if not it will move it to internal storage. see GeckoProfile.getDir() and ProfileMigrator.MoveProfileTask().

Does only killing that part change the problem? (I still don't see how that can race as it's synchronized, but it seems much more likely to be involved in these failures than the rest of the code).

Do we have ADB output or anything from both a successful launch + reftest run and a failed one? How does the output for Profile Migration look in both cases?

I don't really have a problem r+'ing this patch, but isolating the failure even more would be very helpful here.

William Lachance (:wlach)

Comment 14

•

12 years ago

(In reply to Gian-Carlo Pascutto (:gcp) from comment #13)
> I have questions:
> 
> Comment 1 says:
> 
> "I tested deleting each file individually and if either one (or both) are
> missing the tests fail."
> 
> Your patch avoids reads on ProfileMigrator.xml, but not on GeckoApp.xml.
> This seems inconsistent. I'd like to understand if this is indeed a generic
> shared_prefs race issue, or if Migration is interacting with tests.
> 
> >The theory is that modifying shared preferences from two different processes simultaneously 
> >was triggering some kind of race condition on the tegras that was causing the reftests to 
> >fail.
> 
> What would the two processes be that modify ProfileMigrator.xml?
> 
> Note:
> ProfileMigrator.xml contains a mark that indicates whether the profile has
> been tested for being on a SDcard, and if not it will move it to internal
> storage. see GeckoProfile.getDir() and ProfileMigrator.MoveProfileTask().
> 
> Does only killing that part change the problem? (I still don't see how that
> can race as it's synchronized, but it seems much more likely to be involved
> in these failures than the rest of the code).
> 
> Do we have ADB output or anything from both a successful launch + reftest
> run and a failed one? How does the output for Profile Migration look in both
> cases?
> 
> I don't really have a problem r+'ing this patch, but isolating the failure
> even more would be very helpful here.

Very good questions, these are things I'm wondering about myself. I believe :gbrown is looking to answer them.

That said, if there's no objections I'd like to commit my patch as my understanding is that this will allow us to move a whole bunch of new tegras into production.

Gian-Carlo Pascutto [:gcp]

Comment 15

•

12 years ago

Comment on attachment 665042 [details] [diff] [review]
Don't attempt to migrate profiles if using a custom profile

Review of attachment 665042 [details] [diff] [review]:
-----------------------------------------------------------------

r+ because it doesn't seem to break anything. Caveats to taking such a workaround in the comments above.

Attachment #665042 - Flags: review?(gpascutto) → review+

William Lachance (:wlach)

Comment 16

•

12 years ago

(In reply to William Lachance (:wlach) from comment #12)
> (In reply to Geoff Brown [:gbrown] from comment #11)
> > Comment on attachment 665042 [details] [diff] [review]
> > Don't attempt to migrate profiles if using a custom profile
> > 
> > Review of attachment 665042 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > Thanks for cleaning this up - my concerns are addressed. I'm still trying to
> > better determine the cause of the failures, but don't want that to further
> > delay this fix.
> 
> Cool. Pushed patch to try:
> 
> https://tbpl.mozilla.org/?tree=Try&rev=1ecd559ebefd

Try run looks good.

William Lachance (:wlach)

Comment 17

•

12 years ago

Pushed to inbound: https://hg.mozilla.org/integration/mozilla-inbound/rev/4a4733ddb26e

Please leave open for further investigation.

Whiteboard: [leave open]

Ryan VanderMeulen [:RyanVM]

Comment 18

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/4a4733ddb26e

Geoff Brown [:gbrown]

Assignee

Comment 19

•

12 years ago

I tried re-creating the failures on foopy26/tegra-314 using border-radius tests. 

First problem: 3 tests failed to load - invalidate-1a, invalidate-1b, and scroll-1. These fail with or without :wlach's patch, after a substantial timeout. For now, I've disabled these 3 tests.

Without :wlach's patch, there are 22 failures following a fresh install; with the patch, there are none. 

From initial logcats, I note that "LayerView paint state set to 2" is missing from the failure case.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Updated

•

12 years ago

Blocks: 790685

Geoff Brown [:gbrown]

Assignee

Comment 20

•

12 years ago

endDrawing is not being called. LayerManagerOGL::EndTransaction *is* being called and LayerManagerOGL::Render *is* being called -- something seems to be going wrong in Render.

Geoff Brown [:gbrown]

Assignee

Comment 21

•

12 years ago

LayerManagerOGL::Render is returning prematurely because width == 0 and height == 0.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 22

•

12 years ago

mbrubeck was looking at a change for talos where it was getting 0,0 and causing pageloader tests to fail.  It was a change that landed and they found a workaround for it.  Maybe it is the same cause.

Geoff Brown [:gbrown]

Assignee

Comment 23

•

12 years ago

In the normal/successful case, it looks like this width and height derive from nsBaseWidget::mBounds which is normally already set when nsBaseWidget::CreateCompositor is called; on the failing tegras, CreateCompositor finds mBounds set to 0, 0.

Geoff Brown [:gbrown]

Assignee

Comment 24

•

12 years ago

GeckoLayerClient::sendResizeEventIfNecessary sends the size from Java to Gecko as a SIZE_CHANGED event. In the reftests, Java consistently has the correct sizes and sends them. However, if the SIZE_CHANGED event is received after the compositor is created, we seem to continue to use the size in effect when the compositor was created (0,0), causing Render to abort and the reftests to fail.

Chris Lord [:cwiiis]

Comment 25

•

12 years ago

(In reply to Geoff Brown [:gbrown] from comment #24)
> GeckoLayerClient::sendResizeEventIfNecessary sends the size from Java to
> Gecko as a SIZE_CHANGED event. In the reftests, Java consistently has the
> correct sizes and sends them. However, if the SIZE_CHANGED event is received
> after the compositor is created, we seem to continue to use the size in
> effect when the compositor was created (0,0), causing Render to abort and
> the reftests to fail.

Could you test this after bug 797393 lands? Perhaps the size isn't propagating because of the FrameMetrics incorrectly not syncing. This may have nothing to do with it, but worth looking.

Geoff Brown [:gbrown]

Assignee

Comment 26

•

12 years ago

Thanks for the pointer. I pulled the patch from the bug and re-tested: the reftest problem persists.

Matt Brubeck (:mbrubeck)

Comment 27

•

12 years ago

(In reply to Joel Maher (:jmaher) from comment #22)
> mbrubeck was looking at a change for talos where it was getting 0,0 and
> causing pageloader tests to fail.  It was a change that landed and they
> found a workaround for it.  Maybe it is the same cause.

See bug 793419 for details on this issue.

Depends on: 793419

Geoff Brown [:gbrown]

Assignee

Updated

•

12 years ago

Assignee: wlachance → gbrown

Geoff Brown [:gbrown]

Assignee

Updated

•

12 years ago

Depends on: 797942

Geoff Brown [:gbrown]

Assignee

Comment 28

•

12 years ago

:wlach - when bug 797942 is resolved, you should be able to back out your work-around, if you like.

Justin Wood (:Callek)

Comment 29

•

12 years ago

Comment on attachment 665042 [details] [diff] [review]
Don't attempt to migrate profiles if using a custom profile

Review of attachment 665042 [details] [diff] [review]:
-----------------------------------------------------------------

Nomming for Gecko17 (feel free to flap over to approval-beta if need be)

This will make our tegra love much better on the reftest side, as well as is a pre-req for separately approved Bug 795053

Attachment #665042 - Flags: approval-mozilla-aurora?

Justin Wood (:Callek)

Comment 30

•

12 years ago

(In reply to Geoff Brown [:gbrown] (PTO Oct 5 - Oct 15) from comment #28)
> :wlach - when bug 797942 is resolved, you should be able to back out your
> work-around, if you like.

Geoff, given c#29 do we really intend to keep this [leave open] or a backout?

Alex Keybl [:akeybl]

Updated

•

12 years ago

Attachment #665042 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Wesley Johnston (:wesj)

Comment 31

•

12 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/c8fd11e70c0f

Wesley Johnston (:wesj)

Comment 32

•

12 years ago

Missed this when unbitrotting:
https://hg.mozilla.org/releases/mozilla-aurora/rev/5e1ac36ca717

Geoff Brown [:gbrown]

Assignee

Comment 33

•

12 years ago

reftests seem to run fine on the new tegras now. Looks like all the problems were caused by bug 797942.

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → DUPLICATE

Geoff Brown [:gbrown]

Assignee

Comment 34

•

12 years ago

(In reply to Justin Wood (:Callek) from comment #30)
> (In reply to Geoff Brown [:gbrown] (PTO Oct 5 - Oct 15) from comment #28)
> > :wlach - when bug 797942 is resolved, you should be able to back out your
> > work-around, if you like.
> 
> Geoff, given c#29 do we really intend to keep this [leave open] or a backout?

Given c#29, let's keep it in.

Whiteboard: [leave open]

Nobody; OK to take it and work on it

Updated

•

3 years ago

Product: Firefox for Android → Firefox for Android Graveyard

do a firstrun registration for reftests on mobile only (1.0) 12 years ago Joel Maher ( :jmaher ) (UTC -8) 1.91 KB, patch		Details \| Diff \| Splinter Review
do a first run only (WIP) 12 years ago Joel Maher ( :jmaher ) (UTC -8) 7.26 KB, patch		Details \| Diff \| Splinter Review
Patch to fix some potential issues around shared preferences 12 years ago William Lachance (:wlach) 5.40 KB, patch	gbrown : feedback-	Details \| Diff \| Splinter Review
Don't attempt to migrate profiles if using a custom profile 12 years ago William Lachance (:wlach) 3.32 KB, patch	gbrown : review+ gcp : review+ akeybl : approval-mozilla-aurora+	Details \| Diff \| Splinter Review