Closed Bug 1039881 Opened 10 years ago Closed 10 years ago

use an empty directory tiles data source pref for beta 32

Categories

(Firefox :: New Tab Page, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED
Firefox 33
Iteration:
34.2
Tracking Status
firefox30 --- unaffected
firefox31 --- unaffected
firefox32 + verified
firefox33 --- unaffected

People

(Reporter: clarkbw, Assigned: Mardak)

References

Details

Attachments

(1 file, 2 obsolete files)

Similar to bug 993581 we need to use the empty set of directory tiles for the coming uplift into Beta so we don't turn on Directory Tiles just yet.
Ed - Should be a simple fix. Are you going to handle this with review from adw as you did in bug 993581?
Attached patch v1 (obsolete) — Splinter Review
Assignee: nobody → edilee
Status: NEW → ASSIGNED
Attachment #8458053 - Flags: review?(adw)
Flags: needinfo?(edilee)
Attachment #8458053 - Flags: review?(adw) → review+
Attached patch for checkin (obsolete) — Splinter Review
Approval Request Comment
[Feature/regressing bug #]: revert bug 1000459 (reverting bug 993581)
[User impact if declined]: directory tiles/sponsored will move to beta
[Describe test coverage new/current, TBPL]: https://tbpl.mozilla.org/?tree=Try&rev=0768f53e2850
[Risks and why]: low risk - just a pref change to turn off data
[String/UUID change made/needed]: none
Attachment #8458500 - Flags: approval-mozilla-aurora?
Attachment #8458500 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
https://hg.mozilla.org/releases/mozilla-aurora/rev/7da62c51fa97
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 32
Backed out per bug 1038997 comment 17.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: Firefox 32 → ---
https://hg.mozilla.org/mozilla-central/rev/9365ece2ed90
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → Firefox 33
The mozilla-central patch was just the tests that were part of this empty tiles for 32 patch. I'm still investigating if this is causing the tart crashes per bug 1038997 comment 17.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This patch is indeed the reason for the tart crash:
https://tbpl.mozilla.org/?tree=Try&rev=ec96d835ce7e

This is very odd as 31 does not crash and already has the pref set to {}:

https://hg.mozilla.org/releases/mozilla-release/annotate/cd52a7f89548/browser/app/profile/firefox.js#l1455

A separate bug 1026561 that does a simple change of a pref also seems to cause these tart crashes.

Original push to fx-team:
https://tbpl.mozilla.org/?rev=ea0e9e117349&tree=Mozilla-Aurora
mac: https://tbpl.mozilla.org/php/getParsedLog.php?id=44125722&tree=Mozilla-Aurora
lin: https://tbpl.mozilla.org/php/getParsedLog.php?id=44126001&tree=Mozilla-Aurora
Ed, do you understand the reason for the crash? It seems a TART crash happens both here and in bug 1026561, could you please summarize these cases in plain English (i.e. without me having to read the patch and interpret what they do)?
Flags: needinfo?(edilee)
See Also: → 1026561
I don't understand why it's crashing. This bug 1039881 turns off directory tiles (which would have shown content instead of empty). Bug 1026561 increases the number of columns shown up from 3, so more tiles are shown.

I suppose the common thing here is more empty tiles will be shown where the former makes all tiles empty and the latter increases the number of tiles more than the data we have available.

The thing is this bug's patch is effectively already in 31 (31 has the pref already set to {}), so I would assume the same tart is running for 31. And because it's not crashing for 31, there must be some new code introduced in 32 that is also in 33 causing the crash.
Flags: needinfo?(edilee)
(In reply to Ed Lee :Mardak from comment #10)
> I don't understand why it's crashing. This bug 1039881 turns off directory
> tiles (which would have shown content instead of empty). Bug 1026561
> increases the number of columns shown up from 3, so more tiles are shown.
> 
> I suppose the common thing here is more empty tiles will be shown where the
> former makes all tiles empty and the latter increases the number of tiles
> more than the data we have available.

Of the two bugs, this one (bug 1039881) looks to me the less obtrusive one, at least as far as preload goes, and therefore probably the better one to investigate WRT the crash.

We didn't have TART related crashes before DT were added nor after DT were added, but now disabling them appears to be triggering a crash.

It looks to me important to get to the bottom of this crash even if we can somehow make it go away with little effort.
I'm running TART with the build from https://tbpl.mozilla.org/?rev=ea0e9e117349&tree=Mozilla-Aurora, and it seems to be running and showing numbers just fine.

jmaher noted in bug 1026561 comment 33 that "Sometimes I see a failure to shutdown (timeout) and we then kill the process which looks like a crash."

So it somewhat seems like a crash introduced with some new newtab functionality and the talos infrastructure, yet oddly it doesn't crash until either this bug 1039881's patch to remove the data or bug 1026561's patch to increase the number of columns.
Ed, any chance you could own this issue and lead the investigation actively till it's fixed? Talk to others who were involved in the past with the newtab page or investigate it yourself, just.. find it :)

I'm not going away, but since TART doesn't do "weird stuff" with the browser, I'm currently suspecting it's not TART's fault. I'm sticking around and will help where I can, but I also don't want to lead this investigation because the newtab page is really not my code.
(In reply to Avi Halachmi (:avih) from comment #13)
> Ed, any chance you could own this issue and lead the investigation actively
> till it's fixed?
Yes, I've been investigating what code could have introduced this crash for 33/32 but not crash for 31.
Flags: firefox-backlog?
Attachment #8458500 - Flags: approval-mozilla-aurora+
(In reply to Ed Lee :Mardak from comment #14)
> (In reply to Avi Halachmi (:avih) from comment #13)
> > Ed, any chance you could own this issue and lead the investigation actively
> > till it's fixed?
> Yes, I've been investigating what code could have introduced this crash for
> 33/32 but not crash for 31.

@Ed,

Any progress with the 32/33 investigation? Did it also crash on 31 - which is the main version already?

What's the general plan for directory tiles? i.e. do we keep delaying it "as much as we need"? or is there some goal for a version we want it to finally ship?

@Lawrence,

Regardless, I think that for features which we know will not ride the trains for a while, like directory tiles, or OMTC, we should also regularly test the variant which WILL ride the trains.

In the case of this bug, it would mean also regularly testing the directory-tiles off variant BEFORE the trains start moving, because while we regularly test the DT-on variant, the off variant only gets tested when an uplift patch lands - which is generally very bad timing to discover regressions.

What do you think of this, and in general for other features which we know will not ride the trains for a while?
Flags: needinfo?(lmandel)
Flags: needinfo?(edilee)
Summary: use an empty directory tiles data source pref before uplift → use an empty directory tiles data source pref for beta 32
(In reply to Avi Halachmi (:avih) from comment #15)
> Regardless, I think that for features which we know will not ride the trains
> for a while, like directory tiles, or OMTC, we should also regularly test
> the variant which WILL ride the trains.
>
> In the case of this bug, it would mean also regularly testing the
> directory-tiles off variant BEFORE the trains start moving, because while we
> regularly test the DT-on variant, the off variant only gets tested when an
> uplift patch lands - which is generally very bad timing to discover
> regressions.
> 
> What do you think of this, and in general for other features which we know
> will not ride the trains for a while?

In many cases the variant that will ride the trains has already been in the product for a while and is fairly well tested. However, there are times that we should do what you've proposed. Telemetry experiments provides the facility to do this and I'm pleased to see that we're already making use of it with some features. I think that it's reasonable to have new features enabled for all on Nightly. If we don't intend to ship a feature in a specific cycle, we should determine whether we can test it on Aurora and Beta with experiments rather than with the whole population for the reason you cited.
Flags: needinfo?(lmandel)
(In reply to Avi Halachmi (:avih) from comment #15)
> What's the general plan for directory tiles? i.e. do we keep delaying it "as
> much as we need"? or is there some goal for a version we want it to finally
> ship?

We're targeting 33 for Directory Tiles. We have a set of fixes landing now that should uplift into Aurora to target Beta on Sept. 2nd.
I did some debugging of the tart crash, and it seems to happen when there aren't enough directory tiles to fill up the 9 tiles. E.g., remove 3 sponsored tiles from the packaged 10 and it crashes because there's space for 2 non-directory tiles. This would also explain why bug 1026561 caused the same crash -- more columns means more empty tiles. Same thing for the purpose of this bug: remove all directory tiles, so all 9 tiles are available for history.

Digging into the differences between crash and not, the first difference in the full log:

exception getting privileged access, defaulting to XUL_FENNEC__browserInfo
(148 instances)

This line comes from http://hg.mozilla.org/build/talos/file/tip/talos/getInfo.html

The odd thing here is just a few lines higher, it reports the end of tsvgr_opacity and the correct browserInfo:

INFO -  Completed test tsvgr_opacity:
..
INFO -  __browserInfo
INFO -  browser_name:Firefox

I also see a bunch MessageChannel errors:
INFO -  ###!!! [Parent][MessageChannel] Error: Channel error: cannot send/recv


A wild guess right now is perhaps something is appearing as a history tile and maybe triggering a thumbnail fetch. It may be some unrelated change and the 9 directory tiles happened to mask the issue... ???
Flags: needinfo?(edilee)
(In reply to Ed Lee :Mardak from comment #18)
> A wild guess right now is perhaps something is appearing as a history tile
> and maybe triggering a thumbnail fetch. It may be some unrelated change and
> the 9 directory tiles happened to mask the issue... ???
This might not be too far off it seems.

https://tbpl.mozilla.org/?tree=Try&rev=f5d21d53fe28
https://hg.mozilla.org/try/rev/f5d21d53fe28
     1.2 +++ b/toolkit/components/thumbnails/BackgroundPageThumbs.jsm
    1.11    captureIfMissing: function (url, options={}) {
    1.12 +return;

No crash during tart when background page thumbs capture is turned off.


I also tried applying the patch to m-c, and there's no crash either.
https://tbpl.mozilla.org/?tree=Try&rev=038a9480c01a
Nice debugging!  Background thumbnailing is disabled on mochitests, reftests, and crashtests.  Maybe it should be disabled on TART, too?
(In reply to Drew Willcoxon :adw from comment #20)
> Nice debugging!  Background thumbnailing is disabled on mochitests,
> reftests, and crashtests.  Maybe it should be disabled on TART, too?

Why would background thumbnailing cause issues during talos runs?

Also, while it does make some sense to disable it when we're testing talos stuff (like scroll or tab animations), if taking thumbnails affects animations, then I'd argue that we should _somehow_ not ignore it completely.

We could do this either by keeping it working during tests, or have some tests which explicitly measures the effect of thumbnailing on animations.
I don't know.  Background thumbnailing uses a remote browser.  The crash is probably related to that somehow.  But I'm unclear -- is this a real crash, or is it a timeout after which the harness intentionally crashes the browser?

I agree we shouldn't disable it if it affects talos.  Ideally we'd get to the bottom of the crash and fix it.  If that's difficult to do by the time we need to land this patch, then I think it would be fine to disable background thumbnailing on talos and file a follow-up for finding the source of the crash, and once that's done, re-enable it.

newtab starts background-capturing missing thumbnails when it becomes visible, i.e., when it moves out of the preloader, so it's conceivable that it affects talos.  However, it loads pages and captures their windows in an OOP browser, and it does its I/O in a chrome worker, so I would expect the effect to be small.  But it lazily creates the OOP browser the first time a capture happens (and also the first time after 60s without a capture), and that probably has a larger effect.
(In reply to Drew Willcoxon :adw from comment #22)
> I don't know.  Background thumbnailing uses a remote browser.  The crash is
> probably related to that somehow.  But I'm unclear -- is this a real crash,
> or is it a timeout after which the harness intentionally crashes the browser?

I _think_ it's a real crash, though Joel would know more.

Joel - does talos "crash" or otherwise terminate the browser which appears as crash at the logs - in case of timeouts?


> I agree we shouldn't disable it if it affects talos.  Ideally we'd get to
> the bottom of the crash and fix it.

Agreed.

> If that's difficult to do by the time
> we need to land this patch, then I think it would be fine to disable
> background thumbnailing on talos and file a follow-up for finding the source
> of the crash, and once that's done, re-enable it.

Agreed.
Flags: needinfo?(jmaher)
bholley, we're getting a bunch of talos/tart crashes when landing this patch on mozilla-beta.

PROCESS-CRASH | tart | application crashed [@ nsGlobalWindow::SetNewDocument(nsIDocument*, nsISupports*, bool)]

The test failure was masked because it seems to only happen if background thumbnails would be captured on the new tab page, but because there were directory tiles filling up the new tab, the codepath for empty tiles -> fetch thumbnail was not triggered.

I bisected by switching to an empty directory tiles (the patch for this bug) for various revisions, and the last green:
https://tbpl.mozilla.org/?tree=Try&rev=9b973ab273ab
Bug 997440 - Stop associating XPCWrappedNativeScopes with an XPCContext. r=bz 
https://hg.mozilla.org/releases/mozilla-beta/rev/f22f1c4043f9

And the first red:
https://tbpl.mozilla.org/?tree=Try&rev=a9151191b853
Bug 997440 - Remove cx pushing in SetNewDocument. r=bz
https://hg.mozilla.org/releases/mozilla-beta/rev/f135d7122262

mozilla-central seems to be green when applying the patch for this bug, but we're trying to land this patch on mozilla-beta.

Should backing out that one revision (f135d7122262) be okay for beta?
Depends on: 997440
Flags: needinfo?(bobbyholley)
Not sure if these are relevant, but the main difference in the full log between the last green and first red:

###!!! [Parent][MessageChannel] Error: Channel error: cannot send/recv
160 of these

 0  libxul.so!nsGlobalWindow::SetNewDocument(nsIDocument*, nsISupports*, bool) [nsGlobalWindow.cpp:a9151191b853 : 2358 + 0x0]
 1  libxul.so!nsDocumentViewer::InitInternal(nsIWidget*, nsISupports*, nsIntRect const&, bool, bool, bool) [nsDocumentViewer.cpp:a9151191b853 : 891 + 0x13]
 2  libxul.so!nsDocumentViewer::Init(nsIWidget*, nsIntRect const&) [nsDocumentViewer.cpp:a9151191b853 : 632 + 0x15]
 3  libxul.so!nsDocShell::SetupNewViewer(nsIContentViewer*) [nsDocShell.cpp:a9151191b853 : 8574 + 0x15]
 4  libxul.so!nsDocShell::Embed(nsIContentViewer*, char const*, nsISupports*) [nsDocShell.cpp:a9151191b853 : 6584 + 0xb]
 5  libxul.so!nsDocShell::CreateContentViewer(char const*, nsIRequest*, nsIStreamListener**) [nsDocShell.cpp:a9151191b853 : 8369 + 0x14]
 6  libxul.so!nsDSURIContentListener::DoContent(char const*, bool, nsIRequest*, nsIStreamListener**, bool*) [nsDSURIContentListener.cpp:a9151191b853 : 122 + 0x14]
 7  libxul.so!nsDocumentOpenInfo::TryContentListener(nsIURIContentListener*, nsIChannel*) [nsURILoader.cpp:a9151191b853 : 714 + 0x20]
 8  libxul.so!nsDocumentOpenInfo::DispatchContent(nsIRequest*, nsISupports*) [nsURILoader.cpp:a9151191b853 : 386 + 0x15]
 9  libxul.so!nsDocumentOpenInfo::OnStartRequest(nsIRequest*, nsISupports*) [nsURILoader.cpp:a9151191b853 : 262 + 0xc]
10  libxul.so!nsBaseChannel::OnStartRequest(nsIRequest*, nsISupports*) [nsBaseChannel.cpp:a9151191b853 : 715 + 0x10]
11  libxul.so!nsInputStreamPump::OnStateStart() [nsInputStreamPump.cpp:a9151191b853 : 517 + 0xc]
12  libxul.so!nsInputStreamPump::OnInputStreamReady(nsIAsyncInputStream*) [nsInputStreamPump.cpp:a9151191b853 : 431 + 0x8]
13  libxul.so!nsInputStreamReadyEvent::Run() [nsStreamUtils.cpp:a9151191b853 : 88 + 0x8]
14  libxul.so!nsThread::ProcessNextEvent(bool, bool*) [nsThread.cpp:a9151191b853 : 715 + 0x5]
15  libxul.so!NS_ProcessNextEvent(nsIThread*, bool) [nsThreadUtils.cpp:a9151191b853 : 263 + 0xf]
16  libxul.so!nsThread::Shutdown() [nsThread.cpp:a9151191b853 : 559 + 0xb]
17  libxul.so!mozilla::dom::EncodingCompleteEvent::Run() [ImageEncoder.cpp:a9151191b853 : 58 + 0x8]
18  libxul.so!nsThread::ProcessNextEvent(bool, bool*) [nsThread.cpp:a9151191b853 : 715 + 0x5]
19  libxul.so!NS_ProcessNextEvent(nsIThread*, bool) [nsThreadUtils.cpp:a9151191b853 : 263 + 0xf]
20  libxul.so!mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) [MessagePump.cpp:a9151191b853 : 136 + 0xb]
21  libxul.so!MessageLoop::RunInternal() [message_loop.cc:a9151191b853 : 229 + 0x6]
22  libxul.so!MessageLoop::Run() [message_loop.cc:a9151191b853 : 222 + 0x7]
23  libxul.so!nsBaseAppShell::Run() [nsBaseAppShell.cpp:a9151191b853 : 164 + 0xd]
24  libxul.so!XRE_RunAppShell [nsEmbedFunctions.cpp:a9151191b853 : 690 + 0x8]
25  libxul.so!mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) [MessagePump.cpp:a9151191b853 : 253 + 0x4]
26  libxul.so!MessageLoop::RunInternal() [message_loop.cc:a9151191b853 : 229 + 0x6]
27  libxul.so!MessageLoop::Run() [message_loop.cc:a9151191b853 : 222 + 0x7]
28  libxul.so!XRE_InitChildProcess [nsEmbedFunctions.cpp:a9151191b853 : 527 + 0xe]
29  plugin-container!main [MozillaRuntimeMain.cpp:a9151191b853 : 149 + 0xa]

In particular line 17 seems like the thing related to the background thumbnailing?
17  libxul.so!mozilla::dom::EncodingCompleteEvent::Run() [ImageEncoder.cpp:a9151191b853 : 58 + 0x8]
Per IRC discussion, I think we just want to add:

NS_ENSURE_TRUE(GetContextInternal(), NS_ERROR_NOT_INITIALIZED);

To the top of that function.
Flags: needinfo?(bobbyholley)
Attached patch v2Splinter Review
Thanks so much for the help bholley! No more crash when applying this patch to beta!
https://tbpl.mozilla.org/?tree=Try&rev=4ba61f0d4d3d
Attachment #8458053 - Attachment is obsolete: true
Attachment #8458500 - Attachment is obsolete: true
Attachment #8474081 - Flags: review?(bobbyholley)
Comment on attachment 8474081 [details] [diff] [review]
v2

Review of attachment 8474081 [details] [diff] [review]:
-----------------------------------------------------------------

r=me on the change in nsGlobalWindow.cpp.
Attachment #8474081 - Flags: review?(bobbyholley) → review+
Comment on attachment 8474081 [details] [diff] [review]
v2

Approval Request Comment
[Feature/regressing bug #]: revert bug 1000459 (reverting bug 993581)
[User impact if declined]: directory tiles/sponsored will be in 32 release
[Describe test coverage new/current, TBPL]: https://tbpl.mozilla.org/?tree=Try&rev=4ba61f0d4d3d
[Risks and why]: low risk - just a pref change to turn off data (same as on 33 release)
[String/UUID change made/needed]: none
Attachment #8474081 - Flags: approval-mozilla-beta?
(In reply to Ed Lee :Mardak from comment #29)
> pref change to turn off data (same as on 33 release)
I meant 31 release.
Ed, Bobby great work! thanks :)

> NS_ENSURE_TRUE(GetContextInternal(), NS_ERROR_NOT_INITIALIZED);

Bobby, do we know why it crashes without the patch? Do we also need this patch in m-c?

If we need it on m-c, then it's probably better to land and uplift this patch as its own bug ("Prevent crash <something> thumbs <something>" ?) regardless of this bug - though these are technicalities.

Thanks for the good work guys.
great work:)
Flags: needinfo?(jmaher)
Pretty sure this isn't needed on nightly as there's no crash there. Although I didn't bisect to find what changeset fixed the problem.
(In reply to Ed Lee :Mardak from comment #33)
> Pretty sure this isn't needed on nightly as there's no crash there.
Forgot to mention "anymore." It was fixed since June 15 when the columns patched tried to land but triggered the same crash in bug 1026561 comment 27.
Comment on attachment 8474081 [details] [diff] [review]
v2

As Ed said, we need to turn off directory tiles in 32. beta+
Attachment #8474081 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
https://hg.mozilla.org/releases/mozilla-beta/rev/6790f9333fec
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Iteration: --- → 34.2
Flags: qe-verify?
Flags: firefox-backlog?
Flags: firefox-backlog+
Flags: qe-verify? → qe-verify+
(In reply to Avi Halachmi (:avih) from comment #31)
> Ed, Bobby great work! thanks :)
> 
> > NS_ENSURE_TRUE(GetContextInternal(), NS_ERROR_NOT_INITIALIZED);
> 
> Bobby, do we know why it crashes without the patch? Do we also need this
> patch in m-c?

Hm, looking at the code, I don't see any reason why the same crash couldn't theoretically happen on m-c. We might as well be safe rather than handling this reactively. Mardak, would you mind landing your patch on m-c and aurora as well?
Flags: needinfo?(edilee)
Blocks: 1055150
(In reply to Bobby Holley (:bholley) from comment #37)
> Mardak, would you mind landing your patch on m-c and aurora as well?
I filed bug 1055150 and as smaug suggested it might be a dupe of bug 1031303.

I tried applying the {} pref from this bug to the commit just before bug 1031303 landing, and it did indeed cause the tart crash; and for the next commit with the fix, tart stopped crashing.
Flags: needinfo?(edilee)
QA Contact: cornel.ionce
I can confirm that an empty directory tiles data source pref for beta 32 is used, as the "browser.newtabpage.directory.source" pref ise set to "data:application/json,{}" by default on Firefox 32 beta 8, build ID: 20140818191513.
Status: RESOLVED → VERIFIED
(In reply to (PTO 8/22 - 9/1) from comment #37)
> (In reply to Avi Halachmi (:avih) from comment #31)
> > Ed, Bobby great work! thanks :)
> > 
> > > NS_ENSURE_TRUE(GetContextInternal(), NS_ERROR_NOT_INITIALIZED);
> > 
> > Bobby, do we know why it crashes without the patch? Do we also need this
> > patch in m-c?
> 
> Hm, looking at the code, I don't see any reason why the same crash couldn't
> theoretically happen on m-c. We might as well be safe rather than handling
> this reactively. Mardak, would you mind landing your patch on m-c and aurora
> as well?

Apparently, uplifting this 1-line patch causes some i10n Firefox builds to incorrectly display mixed-content warning icon instead of lock icon for HTTPS sites (bug 1046645).

Since one of the options right now is to backout the uplift of this line (bug 1046645 comment 32), I'll repeat my question from comment 31:

Do we know why keeping this line causes the crash?

And I'm adding:

- What's the exact scenario to reproduce the crash?
- Can this be reproduced outside of talos?
- Would users be exposed to this crash somehow?
No longer depends on: 1046645
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: