Closed Bug 652050 Opened 13 years ago Closed 13 years ago

Some non-code change on 2011-04-21 caused four SVG reftests to fail on Android, due to "intl.accept_languages" being unset & "systemLanguage" checks always passing

Categories

(Core :: IPC, defect, P3)

ARM
Android
defect

Tracking

()

RESOLVED FIXED
mozilla12

People

(Reporter: philor, Assigned: dholbert)

References

Details

(Whiteboard: [android][multi-locale][test marked as fails-on-android][android_tier_2])

Attachments

(3 files)

Typing that summary made me doubt it yet again, but tbpl seems to agree with me: independent of which push or what it did before, every mobile reftest-2 which started after 14:17 today (the last one before that having started at 11:04) has failed in conditions-01.svg, dynamic-conditions-01.svg, dynamic-switch-01.svg, and switch-01.svg, though I have no clue what failing those four particular tests means.

http://dev.philringnalda.com/tbpl/?tree=Mobile&rev=7cdcae5dee49 had a green R2, then one from the nightly which failed to start (and a third which I retriggered, which comes up later). After that, we had exceptions and timeouts and skips, then the fun range, http://dev.philringnalda.com/tbpl/?tree=Mobile&fromchange=4e01eac2e447&tochange=da6f72c9e672

The first thing in that range had a timeout the first time it ran R2, (at 11:20, because it got a much slower build than the next push), then the next push regressed three *different* SVG tests, without hitting the other four, in runs starting at 10:37 and 11:04, then the last push in that range started at 15:31, and hit both those three for the last time before the cause was backed out, and also these four. In between, because I was getting suspicious about the three, I retriggered a second run on 636763da2e9a, which started at 14:17 and did not hit the three (it was from the push before the cause of them), but did hit the four.

I missed seeing that at the time, so when I started getting suspicious about it being non-code, I went back further to the skipped and timed out one, all of which hit the four on my retriggers, and back to 7cdcae5dee49 which had been green before, but on a retrigger also hit the four, so independent of what code was in the build, any run starting after somewhere between 11:04 and 14:17 will fail those four tests, even if a run on the same build and same test package passed them before.
There was an android automation that went into production today

http://hg.mozilla.org/build/buildbotcustom/rev/a8404b0bcbe4
Attached file reftest failure log
The failure in the tests indicates that a system language of "foo" is now a match!. Is it possible to check one of the devices and see whether somehow Options->Content->Languages has foo in the list somehow?
(Just to confirm what Robert said -- in all four failing tests, the problem appears to be the same -- a red rect with systemLanguage="foo" is rendering when we don't expect it to.)

As a quick diagnostic hack, we could potentially check in a cset to replace "foo" with "zomgzomg" or some other nonsense-string in one of the tests, and see if that fixes it.   (That would tell us whether we're matching "foo" in particular vs. matching everything somehow.)

Best to check the devices' languages directly as Robert suggested, though, if that's possible.
(In reply to comment #4)
> As a quick diagnostic hack, we could potentially check in a cset to replace
> "foo" with "zomgzomg" or some other nonsense-string in one of the tests

Done -- tweaked the first failing test, conditions-01.svg:
   http://hg.mozilla.org/mozilla-central/rev/ee7fef67c370
We'll see if it passes now...
Nope -- the first reftest run to complete with my tweak still failed conditions-01.svg, with the same broken rendering as in the attached reftest-log:
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303494555.1303496286.14526.gz

So on the test box, we're rendering this element:
> <rect systemLanguage="XXXdholbertThisIsNotALanguage" x="50" y="100" width="50" height="50" fill="red"/>

I don't know how we implement systemLanguage, but maybe the system API that we plug into is just unconditionally returning true?
Hm, I wonder if the nightly automatically tested the multi or en-US apk.
If the multi, maybe the language selection bit failed back to en-US?

Then Philor's re-run might have been en-US, which failed because it doesn't have the multi language setting.

All just a guess, but maybe we should re-try the rev before the nightly.
Aha! So we don't plug into a system API - we plug into the "intl.accept_languages" pref.  And if that pref is empty, then we accept all languages (including foo and XXXdholbertThisIsNotALanguage).  I can reproduce this locally if I clear that pref.

Reference:
http://mxr.mozilla.org/mozilla-central/source/content/svg/content/src/nsSVGFeatures.cpp?mark=221-222,225-225,228-230#218

So I'm guessing that pref is empty for some reason -- perhaps because these are yet-to-be-localized builds or something?
The pref is defined in chrome://global/locale/intl.properties, btw.  So I think this is likely because it's an un-localized build.

I think we want to be testing localized builds, right? Otherwise, we can't test locale-dependent functionality like the "systemLanguage" attribute.
I think we're testing en-US across the board.
There are no localized android builds-on-change; there are only multilocale nightlies.

Also, ignore comment 7, as we tested the en-US build in the green run.
Hm, so the log in comment 6 does definitely seem to be using an en-US build (fennec-6.0a1.en-US.eabi-arm.apk).

I actually downloaded that exact build to my phone (based on the URL in the log), and I confirmed that
 - "intl.accept_languages" is set correctly
 - it works correctly on conditions-01.svg

So the fennec-6.0a1.en-US.eabi-arm.apk builds that are purportedly being run (according to the logs) seem to be fine - this appears to be due to other aspect of the configuration.

Also: I backed out the diagnostic from comment 5:
  http://hg.mozilla.org/mozilla-central/rev/f7019527f489
(In reply to comment #1)
> There was an android automation that went into production today
> 
> http://hg.mozilla.org/build/buildbotcustom/rev/a8404b0bcbe4

Bear says that this change just made us start creating PID files.

Is there *anything* else that might have changed, automation-wise? (particularly l10n-related)

This point of philor's in particular seems to indicate that this is a regression in some part of our testing automation system:
> independent of what code
> was in the build, any run starting after somewhere between 11:04 and 14:17 will
> fail those four tests, even if a run on the same build and same test package
> passed them before.
Just to be absolutely sure about Comment 8, I propose we push this temporary diagnostic (adding 2 printf's) and then back it out once we've gotten reftest logs back.

I'm assuming we'll trigger the "intl.accept_languages is EMPTY" printf, but I think it's worth making absolutely sure, since the cause of this is remaining elusive.
Attachment #527848 - Flags: review?
Attachment #527848 - Attachment description: diagnostic #2: add printfs → temporary trivial diagnostic #2: add printfs
Attachment #527848 - Flags: review?(longsonr)
Attachment #527848 - Flags: review?(jwatt)
Attachment #527848 - Flags: review?
Attachment #527848 - Flags: review?(longsonr)
Attachment #527848 - Flags: review?(jwatt)
Attachment #527848 - Flags: review+
(pinged dougt for a quick rubber-stamp on the diagnostic, since I'd like to get this patch in & out today if possible.  Thanks dougt!)
Landed diagnostic #2: http://hg.mozilla.org/mozilla-central/rev/7eb2de708edb
Attachment #527848 - Flags: review+
d'oh... apparently on Android, printf doesn't work, and we don't have any other way to print things on Gecko that end up appearing in the reftest logs.

At dougt's suggestion, I'm going to tweak Comment 14 from printf --> printf_stderr, and then I'll try to get the logcat output (where printf_stderr apparently shows up)
[18:12]	<aki>	i'm going to guess that this is caused by installing a multilocale build and then an en-US build
[18:12]	<dholbert>	ok - that sounds reasonable
[18:12]	<aki>	not the crashing, the lang thing
[18:12]	<dholbert>	right
[18:13]	<aki>	it would be good to know how to clean up, other than reimaging

If that is the cause,
a) reliably cleaning up would be the best fix
b) reimaging/cleaning up one time, and then never ever running multilocale apks on the tegras, is another potential "fix"
(In reply to comment #15)
> At dougt's suggestion, I'm going to tweak Comment 14 from printf -->
> printf_stderr, and then I'll try to get the logcat output (where printf_stderr
> apparently shows up)

Turns out we can't reliably get logcat info right now, because telnetting into a tegra while it's running reftests seems to crash it within seconds, and telneting in afterwards is impossible because it reboots (clearing its logcat buffer) 9 seconds after reftests complete. :(

I backed out the diagnostic printf from Comment 14 -- I'm just going to assume that Comment 8 was correct, since I apparently have no way of testing it :) -- and I marked these tests as failing-on-android for now, to stop the permaorange:
  http://hg.mozilla.org/mozilla-central/rev/3589bf459f85
You could write debug text directly into the reftests using a small lime font positioned within the red failing squares.
Of course you'd have to put something in the reftest with an unlikely to be duplicated id so that the backend could find it and manipulate it to do its easter egg style stuff. Too complicated perhaps?
since reftest uses messages manager to load pages, can we just send debug messages to message manager and listen for them in the reftest harness?  On Android, we print all output from the harness to a log file (inside the chrome extension).  It shouldn't be too hard to do that from inside a reftest as well.
Prooobably too complicated / too much work given that we're already pretty sure about comment 8 & comment 16, and given that I have higher priority bugs I'm supposed to be working on anyway. :) (and given that this bug's perma-orange pain has now been silenced with known-fail annotations)

If someone else wants to do something along the lines of comment 19/20, though, feel free!
Whiteboard: [android][multi-locale]
Summary: Some non-code change on 2011-04-21 caused four SVG reftests to fail on Android → Some non-code change on 2011-04-21 caused four SVG reftests to fail on Android, due to "systemLanguage" checks always passing
Whiteboard: [android][multi-locale] → [android][multi-locale][test marked as fails-on-android]
aki,

is this something that your still working on or should we bounce it over to ateam for their followup?
Assignee: nobody → aki
(In reply to comment #22)
> aki,
> 
> is this something that your still working on or should we bounce it over to
> ateam for their followup?

Aki: any follow-up here?
(In reply to comment #23)
> (In reply to comment #22)
> > aki,
> > 
> > is this something that your still working on or should we bounce it over to
> > ateam for their followup?
> 
> Aki: any follow-up here?

Sorry about that, I must have missed the original comment.

To directly answer the question, I'm not actively working on this.

However, per comment 16, I think that having ever installed a multilocale apk on a tegra is the culprit here.

Bug 657413 may have fixed this, though someone needs to verify.  Assuming that uninstalling the previous apk removes the profile, and assuming that the previous multilocale apk install is the culprit, this bug should be fixed.

If uninstalling the previous apk doesn't remove the profile, we should probably manually nuke the profile every test run.

If a previous multilocale apk install is not the culprit, we're potentially back at square one here in determining the root cause.
> Bug 657413 may have fixed this, though someone needs to verify.

The tests involved are marked as "fails-on-Android" since comment 17, so verification can consist of watching for orange UNEXPECTED-PASS in Android reftest runs on Mobile tbpl.

(If we don't see that, then this isn't fixed.)
(In reply to comment #25)
> > Bug 657413 may have fixed this, though someone needs to verify.
> 
> The tests involved are marked as "fails-on-Android" since comment 17, so
> verification can consist of watching for orange UNEXPECTED-PASS in Android
> reftest runs on Mobile tbpl.
> 
> (If we don't see that, then this isn't fixed.)

Assuming that hasn't happened, since there's been no traction in this bug.  So what's our state?  This is now a known failure?  The test harnesses themselves should clean up the profile we put on the device in their cleanup step.
We expect the profile to contain at least one language. It does on desktop and used to on mobile but now it doesn't. Options->Content->Languages.
(In reply to comment #27)
> We expect the profile to contain at least one language. It does on desktop
> and used to on mobile but now it doesn't. Options->Content->Languages.

(To clarify -- as noted in comment 11, everything does seem to be fine when *manually* running our builds on a mobile phone.  This is something specific to running under our test-harness.)
What does intl.properties look like on the devices? Can we upload that file from one?
Although that file is probably shipped with each build downloaded to the device.
Assignee: aki → nobody
Priority: -- → P3
Whiteboard: [android][multi-locale][test marked as fails-on-android] → [android][multi-locale][test marked as fails-on-android][android_tier_1]
Assignee: nobody → blassey.bugs
I guess the obvious path to success is for somebody who's familiar with how to debug Fennec to set a breakpoint in nsSVGFeatures::PassesConditionalProcessingTests and see what happens there.
(In reply to Ehsan Akhgari [:ehsan] from comment #31)
> I guess the obvious path to success is for somebody who's familiar with how
> to debug Fennec to set a breakpoint

Note: Per comment 11, I was unable to reproduce the test-failure locally (with the same build that the test box ran & failed on) -- so I'm not sure that local debugging will get anywhere.

Rather, I think we need to debug this _on the test boxes_ with added diagnostic output.  (which I tried, but gave up in comment 17 since at that point there was no reliable way to capture output from device during test-runs tests.  Maybe the situation has improved now?)
(In reply to comment #32)
> (In reply to Ehsan Akhgari [:ehsan] from comment #31)
> > I guess the obvious path to success is for somebody who's familiar with how
> > to debug Fennec to set a breakpoint
> 
> Note: Per comment 11, I was unable to reproduce the test-failure locally (with
> the same build that the test box ran & failed on) -- so I'm not sure that local
> debugging will get anywhere.
> 
> Rather, I think we need to debug this _on the test boxes_ with added diagnostic
> output.  (which I tried, but gave up in comment 17 since at that point there
> was no reliable way to capture output from device during test-runs tests. 
> Maybe the situation has improved now?)

Can we get you access to a test slave?
Sounds good to me - I'd be open to taking another look at this, given a dedicated test slave.  Aki, would you be able to loan me one? (or, do you know who I should talk to about that?)
tegra-030. I'll file a bug for IT to get that to you.
Depends on: 679890
Ok, I've got the tegra running a locally-made debug fennec build, and I've figured out how to get it to run reftests.

So far, I've found:
 - I can reproduce (TEST-EXPECTED-FAIL) when running remotereftest.py
 - When I load the testcases directly, in my build on the device, I get correct/expected behavior.

So this is definitely some artifact of the test harness itself.
I applied the diagnostic from Comment 14 locally (with s/printf/printf_stderr/), and I verified in logcat that intl.accept_languages is empty in the reftest process.
Summary: Some non-code change on 2011-04-21 caused four SVG reftests to fail on Android, due to "systemLanguage" checks always passing → Some non-code change on 2011-04-21 caused four SVG reftests to fail on Android, due to "intl.accept_languages" being unset & "systemLanguage" checks always passing
Assignee: blassey.bugs → dholbert
One more data point: these properties (all strings-valued, all defined in all.js) all are empty during the reftest process, too:
  network.proxy.no_proxies_on
  intl.charsetmenu.browser.more1
  intl.accept_languages
  intl.menuitems.alwaysappendaccesskeys
  intl.menuitems.insertseparatorbeforeaccesskeys
  intl.charsetmenu.mailedit
  intl.charset.detector
  font.language.group

But these three are nonempty:
  intl.charsetmenu.browser.static
  intl.charset.default
  intl.ellipsis

The first nonempty one (intl.charsetmenu.browser.static) stood out to me in particular, since it's got the same value (a properties pointer) in all.js as intl.accept_languages:
> 1079 pref("intl.accept_languages",               "chrome://global/locale/intl.properties");
[...]
> 1082 pref("intl.charsetmenu.browser.static",     "chrome://global/locale/intl.properties");

HOWEVER: it's actually overridden by a different value in mobile.js:
> pref("intl.charsetmenu.browser.static", "chrome://browser/locale/browser.properties");

And if I tweak that mobile.js line, to match the line from all.js, then that (.static) is reported as being empty, too.

So: It appears that the prefs service is able to get prefs from  chrome://browser/locale/browser.properties, but not from chrome://global/locale/intl.properties, for some reason.  (note the "//browser/" vs "//global/" difference)
The other two nonempty prefs that I pointed out (intl.charset.default &   intl.ellipsis ) both point to "chrome://global-platform/locale/intl.properties" -- so our reftest process has access to that, too.
Sorry, disregard Comment 39/40 -- I was using Preferences::GetLocalizedString() to look up all those prefs, which only makes sense for prefs that are indirectly set via a chrome:// URI.

I've got some more-useful data, though -- it appears that we're failing due to an early return in nsStringBundle::LoadProperties().  Specifically, that function thinks it's failed in the past, so it returns early -- BUT if I let it continue, it works fine!  I commented it out here on TryServer, and it passed the 4 SVG tests in question:
  http://hg.mozilla.org/try/rev/c415cfcdcdfc

(nsStringBundle::LoadProperties is the function that we use to actually read chrome://global/locale/intl.properties, BTW.)


Investigating more...  Getting close now, I think.
OK - so the nsStringBundle::LoadProperties call ultimately ends up relying on a call to nsChromeRegistryContent::GetFlagsFromPackage("global"), which fails the first time we call it, because that object's 'mPackagesHash' is still empty.

A bit later on, we register a bunch of packages with the nsChromeRegistryContent.  However, we've already failed our first nsStringBundle::LoadProperties call, so we insist on making all future calls fail.

I'm guessing the package-registration is supposed to happen before pretty much anything else.  But we're somehow getting an earlier call to nsStringBundle::LoadProperties, which is fails, and that failure is sticky.
Looking at this with dholbert, it appears the issue is that ContentParent launches the child process, then immediately fires off SendRegisterChrome(..). However the child tries to access the registry *before* that message is received and processed - looks like it loads XPCOM services and during that it looks for the registry.

I am not sure if there is a way for our IPC code to guarantee that something is received and handled before anything else happens in the child process (or if it even makes sense to do that). Perhaps the best solution is not to rely on IPC for this? We can send the information for SendRegisterChrome(..) by putting it in a file or in the environment if it's small enough, both of which would not rely on message timing.
Here's a snippet of printfs-in-logcat output, fleshing out comment 42 / comment 43 a bit.

It shows the content parent spawning the child process and calling 
SendRegisteredChrome -- but before the child receives that message, it hits nsHttpHandler::Init, which tries to access the intl.accept_languages pref, and that fails (at the line flagged "***BADNESS***") because we haven't received its properties-bundle yet.  (That happens in the final line of the attached log, inside of RecvRegisterChrome)
(In reply to Daniel Holbert [:dholbert] from comment #44)
> and that fails (at the line flagged "***BADNESS***") because we haven't
> received its properties-bundle yet.
(er, by "properties-bundle" I meant "package", I guess; that is, the discrete bundles of prefs stored by nsChromeRegistryContent, registered in nsChromeRegistryContent::RegisterPackage)
bsmedberg: would you be ok with the file solution in comment 43? Or is there a better way to get information to the child process that is guaranteed to arrive before the child process starts running things?
(sorry, forgot the cc for comment 46)
Actually, couldn't we just put off accessing the pref? That seems like a much easier solution to me.
That might or might not be easier, I'm not sure, but it sounds like a workaround. In my opinion changing things to ensure the proper delivery order sounds better. And transferring this data via a file sounds very straightforward. Unless of course this is already the right delivery order, and accessing the pref that early is in violation of how things are expected to work.
No update here in over 2 weeks - any progress on this android_tier_1 bug?
No progress -- I was waiting on an answer to Comment 46 before proceeding.

(To be clear, this bug doesn't affect any "real" users on Android as far as we can tell; it only seems to reproduce when firefox is run in reftest-mode.)
It's kindof a hack that we can't just send that data when we set things up: I think our good options here are:

* send the data when we create the channel (no IPDL support for this, though)
* don't start anything (XPCOM/etc) until the RegisterChrome message has been received (probably pretty easy)
* don't read the pref until later (also pretty easy)

Transferring the data via file sounds pretty horrid to me, actually, since it introduces yet another communication mechanism.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #52)
> * don't start anything (XPCOM/etc) until the RegisterChrome message has been
> received (probably pretty easy)

This seems like the most robust & future-proof strategy to me. (since it seems like this could crop up with other prefs, too, at some point)

Is there some master "start xpcom/etc" function that we could move inside of ContentChild::RecvRegisterChrome?
We'll want to make sure that we don't regress first-content-tab loading time with this change. We might be safe because we pre-init the content process on the start screen, IIRC.
dhobert, what's the status of this bug? are you still working on it?
I've been working on other bugs while tentatively waiting on a response to comment 53, to implement bsmedberg's second suggestion. (been intending to bug someone more actively about it)

(If anyone more familiar with the mobile startup / cross-process code is interested in taking this bug, I'm happy to have it stolen from me. :) I originally took it because it was SVG-related, but it looks like the fix isn't SVG-related at all & is in code that I'm less familiar with.  Otherwise, I can follow up on this in the next week or two, though.)
Component: Release Engineering → IPC
Product: mozilla.org → Core
QA Contact: release → ipc
Version: other → Trunk
Summary: Some non-code change on 2011-04-21 caused four SVG reftests to fail on Android, due to "intl.accept_languages" being unset & "systemLanguage" checks always passing → Mobile startup race condition w/ nsChromeRegistryContent leaves pref "intl.accept_languages" empty during mobile reftests & causes 4 SVG reftest failures
Whiteboard: [android][multi-locale][test marked as fails-on-android][android_tier_1] → [android][multi-locale][test marked as fails-on-android][android_tier_2]
pushed https://hg.mozilla.org/integration/mozilla-inbound/rev/0341759ffb4c to mark tests as passing now.
Depends on: 607854
Target Milestone: --- → mozilla12
(In reply to Robert Longson from comment #57)
> pushed https://hg.mozilla.org/integration/mozilla-inbound/rev/0341759ffb4c
> to mark tests as passing now.

Oh, awesome!  Hmm... So that was from Bug 607854 landing, which (among other things) changed how we check for SVG requiredFeatures & systemLanguage, which is the way that this bug ended up tickling the underlying android issue.

IIUC, it seems like we might still have the underlying problem (potential race condition with pref-reading at startup, in lightweight environments like the reftest harness), and it's just not exposed by these SVG tests anymore.

On the other hand, if it's not causing problems with automated tests anymore (and doesn't affect actual users -- which it doesn't, as of comment 51), then I'm more than happy to tentatively consider this fixed-ish.
(In reply to Daniel Holbert [:dholbert] from comment #58)
> IIUC, it seems like we might still have the underlying problem (potential
> race condition with pref-reading at startup, in lightweight environments
> like the reftest harness), and it's just not exposed by these SVG tests
> anymore.

That is indeed still true.
https://hg.mozilla.org/mozilla-central/rev/0341759ffb4c

For some reason, we still don't have a DONTCAREUNLESSITBITESMEAGAIN resolution, settling for FIXED.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
(In reply to Phil Ringnalda (:philor) from comment #60)
> For some reason, we still don't have a DONTCAREUNLESSITBITESMEAGAIN
> resolution, settling for FIXED.

Haha. :) Restoring an earlier bug-summary, to be clear about what's actually FIXED here.
Summary: Mobile startup race condition w/ nsChromeRegistryContent leaves pref "intl.accept_languages" empty during mobile reftests & causes 4 SVG reftest failures → Some non-code change on 2011-04-21 caused four SVG reftests to fail on Android, due to "intl.accept_languages" being unset & "systemLanguage" checks always passing
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: