Closed Bug 792212 Opened 12 years ago Closed 12 years ago

on the tegras we have found that reftests fail a lot, a whole lot on the newer tegras- this is due to the shared_prefs files

Categories

(Firefox for Android Graveyard :: General, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 797942

People

(Reporter: jmaher, Assigned: gbrown)

References

Details

Attachments

(2 files, 2 obsolete files)

On the new tegras when we fail, we seem to fail on a fresh install of fennec.  This is because a fresh install has no files in /data/data/org.mozilla.fennec.  After a first run, we can run again and tests pass.  For reference I was using border-radius as a test case where 23 tests fail on first install and all pass on second run.

After a lot of trial and error, I found that the two files in /data/data/org.mozilla.fennec/shared_prefs are the culprits:
telnet tegra-306 20701
Trying 10.250.51.146...
Connected to tegra-306.
Escape character is '^]'.
$>cd /data/data/org.mozilla.fennec/shared_prefs
$>ls
ProfileMigrator.xml
GeckoApp.xml
$>cat ProfileMigrator.xml
<?xml version='1.0' encoding='utf-8' standalone='yes' ?>
<map>
<boolean name="move_profile_done" value="true" />
<boolean name="history_done" value="true" />
<boolean name="bookmarks_done" value="true" />
</map>
$>cat GeckoApp.xml
<?xml version='1.0' encoding='utf-8' standalone='yes' ?>
<map>
<boolean name="wasStopped" value="true" />
<boolean name="OOMException" value="false" />
</map>
$>

I tested deleting each file individually and if either one (or both) are missing the tests fail.
Good finding!

Is this related to bug 784278?
yep, this appears to be closer to the root cause.
Blocks: 784278
Assignee: nobody → gbrown
:blassey and :jmaher suggest that if we cannot find an easy solution to this, we might work-around it by launching a dummy run to initialize the environment before running tests.
this is looking really good on try for all platforms.
Assignee: gbrown → jmaher
Status: NEW → ASSIGNED
Attachment #663031 - Flags: review?(gbrown)
Comment on attachment 663031 [details] [diff] [review]
do a firstrun registration for reftests on mobile only (1.0)

Review cancelled. As discussed, doFirstRun runs a full reftest cycle, at least for "make reftest-remote" with TEST_PATH specified.
Attachment #663031 - Flags: review?(gbrown)
so we pass in the -reftest and uri by setting a pref before pushing the profile.  I have a patch that sets the pref to something invalid, then after that run, it sets it to the proper value.

What makes this difficult is that this approach generates an error and we need to ignore it 100% or else log parsers will catch it.  Also this patch in its current form doesn't seem to go onto the test, on try server it stops after the first run and prints the error.

I am putting this up here so others can hack on it as I won't have much time to do this.
Attachment #663031 - Attachment is obsolete: true
So I kind of got inspired to look into this problem more deeply today together with :jmaher. I came up with a prototype patch which disables some code that touches shared profile preferences when using a custom profile (such as when we're running mochitest) and makes some other code synchronous that seemed to be threaded for no good reason (that I could tell).

The theory is that modifying shared preferences from two different processes simultaneously was triggering some kind of race condition on the tegras that was causing the reftests to fail.

With the attached patch applied, we no longer reproduce the problem with the exsiting harness code (i.e. no custom patching to do a "first run" of fennec or anything). I do feel somewhat bad that I haven't been able to determine the exact root cause with any certainty, but I do think this is a good start at such an investigation. 

The main question for me is whether the changes are legitimate. They "feel" ok to me, but I'll leave the judgement of that up to the mobile folks.
Attachment #664638 - Flags: feedback?(gbrown)
Attachment #664638 - Flags: feedback?(blassey.bugs)
Comment on attachment 664638 [details] [diff] [review]
Patch to fix some potential issues around shared preferences

When I saw the initial bug report, I was suspicious of profile migration and investigated that a little. I found that profile migration is initiated with each test since it is a fresh install and the profile is on /sdcard...but it soon finds there is nothing to migrate and finishes cleanly -- I could find nothing going wrong there.

Regarding your removal of the background threads: We are very sensitive to doing anything more than necessary on the main thread, perhaps especially at startup (as this code is) -- your change may regress startup time.

feedback+ for the effort and discovering a way to avoid the failures - there's a clue to the cause here, I'm sure - but feedback- overall. If we cannot find and address the cause of the failures, I would prefer to solve this with a dummy first-run; after all, I think the non-first-run case is more relevant for reftests.
Attachment #664638 - Flags: feedback?(gbrown) → feedback-
(In reply to Geoff Brown [:gbrown] from comment #8)
> Comment on attachment 664638 [details] [diff] [review]
> Patch to fix some potential issues around shared preferences
> 
> When I saw the initial bug report, I was suspicious of profile migration and
> investigated that a little. I found that profile migration is initiated with
> each test since it is a fresh install and the profile is on /sdcard...but it
> soon finds there is nothing to migrate and finishes cleanly -- I could find
> nothing going wrong there.
> 
> Regarding your removal of the background threads: We are very sensitive to
> doing anything more than necessary on the main thread, perhaps especially at
> startup (as this code is) -- your change may regress startup time.

I haven't checked yet, but my suspicion would be that this patch would fix the issue even if we put the background thread stuff for the GeckoApp shared preferences back in. Based on jmaher's original diagnosis and the resutls we're seeing here, I'm pretty sure that this bug is due to an *interaction* between the different shared prefs files being modified. If it's only the GeckoApp stuff that we're modifying, we might well be ok.

Would that be more acceptable? There should be no performance issues then.

It also occurred to me that we could try to get closer to the root cause by selectively removing my patches to disable the different types of profile migration -- maybe there's one type of migration in particular which is causing issues on the tegras.
Attachment #664638 - Flags: feedback?(blassey.bugs)
This removes the code that takes reading/modifying geckoapp preferences off the main thread. So we're now just avoiding profile migration in the case that a custom profile is passed in.

Joel tested this patch and found it made this problem go away. This still only addresses symptoms -- we don't yet truly understand the underlying cause. But maybe this will suffice as a stopgap until we do.
Assignee: jmaher → wlachance
Attachment #664638 - Attachment is obsolete: true
Attachment #665042 - Flags: review?(gbrown)
Attachment #665042 - Flags: review?(gpascutto)
Comment on attachment 665042 [details] [diff] [review]
Don't attempt to migrate profiles if using a custom profile

Review of attachment 665042 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for cleaning this up - my concerns are addressed. I'm still trying to better determine the cause of the failures, but don't want that to further delay this fix.
Attachment #665042 - Flags: review?(gbrown) → review+
(In reply to Geoff Brown [:gbrown] from comment #11)
> Comment on attachment 665042 [details] [diff] [review]
> Don't attempt to migrate profiles if using a custom profile
> 
> Review of attachment 665042 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Thanks for cleaning this up - my concerns are addressed. I'm still trying to
> better determine the cause of the failures, but don't want that to further
> delay this fix.

Cool. Pushed patch to try:

https://tbpl.mozilla.org/?tree=Try&rev=1ecd559ebefd
I have questions:

Comment 1 says:

"I tested deleting each file individually and if either one (or both) are missing the tests fail."

Your patch avoids reads on ProfileMigrator.xml, but not on GeckoApp.xml. This seems inconsistent. I'd like to understand if this is indeed a generic shared_prefs race issue, or if Migration is interacting with tests.

>The theory is that modifying shared preferences from two different processes simultaneously 
>was triggering some kind of race condition on the tegras that was causing the reftests to 
>fail.

What would the two processes be that modify ProfileMigrator.xml?

Note:
ProfileMigrator.xml contains a mark that indicates whether the profile has been tested for being on a SDcard, and if not it will move it to internal storage. see GeckoProfile.getDir() and ProfileMigrator.MoveProfileTask().

Does only killing that part change the problem? (I still don't see how that can race as it's synchronized, but it seems much more likely to be involved in these failures than the rest of the code).

Do we have ADB output or anything from both a successful launch + reftest run and a failed one? How does the output for Profile Migration look in both cases?

I don't really have a problem r+'ing this patch, but isolating the failure even more would be very helpful here.
(In reply to Gian-Carlo Pascutto (:gcp) from comment #13)
> I have questions:
> 
> Comment 1 says:
> 
> "I tested deleting each file individually and if either one (or both) are
> missing the tests fail."
> 
> Your patch avoids reads on ProfileMigrator.xml, but not on GeckoApp.xml.
> This seems inconsistent. I'd like to understand if this is indeed a generic
> shared_prefs race issue, or if Migration is interacting with tests.
> 
> >The theory is that modifying shared preferences from two different processes simultaneously 
> >was triggering some kind of race condition on the tegras that was causing the reftests to 
> >fail.
> 
> What would the two processes be that modify ProfileMigrator.xml?
> 
> Note:
> ProfileMigrator.xml contains a mark that indicates whether the profile has
> been tested for being on a SDcard, and if not it will move it to internal
> storage. see GeckoProfile.getDir() and ProfileMigrator.MoveProfileTask().
> 
> Does only killing that part change the problem? (I still don't see how that
> can race as it's synchronized, but it seems much more likely to be involved
> in these failures than the rest of the code).
> 
> Do we have ADB output or anything from both a successful launch + reftest
> run and a failed one? How does the output for Profile Migration look in both
> cases?
> 
> I don't really have a problem r+'ing this patch, but isolating the failure
> even more would be very helpful here.

Very good questions, these are things I'm wondering about myself. I believe :gbrown is looking to answer them.

That said, if there's no objections I'd like to commit my patch as my understanding is that this will allow us to move a whole bunch of new tegras into production.
Comment on attachment 665042 [details] [diff] [review]
Don't attempt to migrate profiles if using a custom profile

Review of attachment 665042 [details] [diff] [review]:
-----------------------------------------------------------------

r+ because it doesn't seem to break anything. Caveats to taking such a workaround in the comments above.
Attachment #665042 - Flags: review?(gpascutto) → review+
(In reply to William Lachance (:wlach) from comment #12)
> (In reply to Geoff Brown [:gbrown] from comment #11)
> > Comment on attachment 665042 [details] [diff] [review]
> > Don't attempt to migrate profiles if using a custom profile
> > 
> > Review of attachment 665042 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > Thanks for cleaning this up - my concerns are addressed. I'm still trying to
> > better determine the cause of the failures, but don't want that to further
> > delay this fix.
> 
> Cool. Pushed patch to try:
> 
> https://tbpl.mozilla.org/?tree=Try&rev=1ecd559ebefd

Try run looks good.
Pushed to inbound: https://hg.mozilla.org/integration/mozilla-inbound/rev/4a4733ddb26e

Please leave open for further investigation.
Whiteboard: [leave open]
I tried re-creating the failures on foopy26/tegra-314 using border-radius tests. 

First problem: 3 tests failed to load - invalidate-1a, invalidate-1b, and scroll-1. These fail with or without :wlach's patch, after a substantial timeout. For now, I've disabled these 3 tests.

Without :wlach's patch, there are 22 failures following a fresh install; with the patch, there are none. 

From initial logcats, I note that "LayerView paint state set to 2" is missing from the failure case.
Blocks: 790685
endDrawing is not being called. LayerManagerOGL::EndTransaction *is* being called and LayerManagerOGL::Render *is* being called -- something seems to be going wrong in Render.
LayerManagerOGL::Render is returning prematurely because width == 0 and height == 0.
mbrubeck was looking at a change for talos where it was getting 0,0 and causing pageloader tests to fail.  It was a change that landed and they found a workaround for it.  Maybe it is the same cause.
In the normal/successful case, it looks like this width and height derive from nsBaseWidget::mBounds which is normally already set when nsBaseWidget::CreateCompositor is called; on the failing tegras, CreateCompositor finds mBounds set to 0, 0.
GeckoLayerClient::sendResizeEventIfNecessary sends the size from Java to Gecko as a SIZE_CHANGED event. In the reftests, Java consistently has the correct sizes and sends them. However, if the SIZE_CHANGED event is received after the compositor is created, we seem to continue to use the size in effect when the compositor was created (0,0), causing Render to abort and the reftests to fail.
(In reply to Geoff Brown [:gbrown] from comment #24)
> GeckoLayerClient::sendResizeEventIfNecessary sends the size from Java to
> Gecko as a SIZE_CHANGED event. In the reftests, Java consistently has the
> correct sizes and sends them. However, if the SIZE_CHANGED event is received
> after the compositor is created, we seem to continue to use the size in
> effect when the compositor was created (0,0), causing Render to abort and
> the reftests to fail.

Could you test this after bug 797393 lands? Perhaps the size isn't propagating because of the FrameMetrics incorrectly not syncing. This may have nothing to do with it, but worth looking.
Thanks for the pointer. I pulled the patch from the bug and re-tested: the reftest problem persists.
(In reply to Joel Maher (:jmaher) from comment #22)
> mbrubeck was looking at a change for talos where it was getting 0,0 and
> causing pageloader tests to fail.  It was a change that landed and they
> found a workaround for it.  Maybe it is the same cause.

See bug 793419 for details on this issue.
Depends on: 793419
Assignee: wlachance → gbrown
Depends on: 797942
:wlach - when bug 797942 is resolved, you should be able to back out your work-around, if you like.
Comment on attachment 665042 [details] [diff] [review]
Don't attempt to migrate profiles if using a custom profile

Review of attachment 665042 [details] [diff] [review]:
-----------------------------------------------------------------

Nomming for Gecko17 (feel free to flap over to approval-beta if need be)

This will make our tegra love much better on the reftest side, as well as is a pre-req for separately approved Bug 795053
Attachment #665042 - Flags: approval-mozilla-aurora?
(In reply to Geoff Brown [:gbrown] (PTO Oct 5 - Oct 15) from comment #28)
> :wlach - when bug 797942 is resolved, you should be able to back out your
> work-around, if you like.

Geoff, given c#29 do we really intend to keep this [leave open] or a backout?
Attachment #665042 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
reftests seem to run fine on the new tegras now. Looks like all the problems were caused by bug 797942.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → DUPLICATE
(In reply to Justin Wood (:Callek) from comment #30)
> (In reply to Geoff Brown [:gbrown] (PTO Oct 5 - Oct 15) from comment #28)
> > :wlach - when bug 797942 is resolved, you should be able to back out your
> > work-around, if you like.
> 
> Geoff, given c#29 do we really intend to keep this [leave open] or a backout?

Given c#29, let's keep it in.
Whiteboard: [leave open]
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: