If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Secure necko IPDL usage

RESOLVED FIXED in Firefox 19

Status

()

Core
IPC
P1
normal
RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: jduell, Assigned: jduell)

Tracking

unspecified
B2G C3 (12dec-1jan)
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking-basecamp:+, firefox19 fixed, firefox20 fixed, b2g18 fixed)

Details

(Whiteboard: [LOE:M])

Attachments

(15 attachments, 15 obsolete attachments)

3.99 KB, patch
ted
: review+
Details | Diff | Splinter Review
4.12 KB, text/plain
Details
3.35 KB, patch
jduell
: review+
Details | Diff | Splinter Review
13.52 KB, patch
jduell
: review+
Justin Lebar (not reading bugmail)
: feedback+
Details | Diff | Splinter Review
14.98 KB, patch
jduell
: review+
Details | Diff | Splinter Review
16.70 KB, patch
jduell
: review+
Details | Diff | Splinter Review
11.54 KB, patch
jduell
: review+
Details | Diff | Splinter Review
12.49 KB, patch
jduell
: review+
Details | Diff | Splinter Review
9.64 KB, patch
jduell
: review+
Details | Diff | Splinter Review
12.70 KB, patch
jdm
: review+
Details | Diff | Splinter Review
4.77 KB, patch
ted
: review+
jduell
: feedback+
Details | Diff | Splinter Review
3.07 KB, patch
jduell
: review+
Details | Diff | Splinter Review
886 bytes, patch
ted
: review+
Details | Diff | Splinter Review
792 bytes, patch
ted
: review+
Details | Diff | Splinter Review
75.17 KB, patch
jduell
: review+
Details | Diff | Splinter Review
Necko IPDL channels are opened from the child.  Right now we trust whatever appid/inBrowserElement they tell us they are.  We must verify that the correct AppID is used for the channel using parent-side verification.  From my understanding so far, this will probably be achieved by making sure all necko IPDL channels are constructed passing a TabParent/child instance which we can use to get the appID on the parent side (cjones tells me we already have the code to detect if a bogus TabChild is passed from the child).

There may be some cases where necko channels do not have a TabChild and/or a nsILoadContext.  (smaug and jdm inform me that "frame scripts" and addons are cases for child-side code w/o a LoadContext:  we get the nsITabChild from a channel's callbacks, so I'm not sure that's always set either.)  We need to figure out what appID/inbrowser/etc credentials they need/get in that case (in particular what are valid channels, what might be rogue processes trying to get more permission than they ought to, and how to detect/enforce the difference).

This bug also include the IPDL protocols for Cookies and the AppCache, and we may have similar issues with ensuring we always have the Tabparent/loadcontext in those cases too.
(Assignee)

Comment 1

5 years ago
Most of the other bugs blocking bug 776652 are blockers, so I assume this is too.

Hoping this is LOE:S, but likely to be M to deal with all the security holes/issues here.  We can split off into sub-issues and categorize as blocking or not as needed.
blocking-basecamp: --- → ?
Whiteboard: [LOE:M]
Assignee: nobody → jduell.mcbugs
blocking-basecamp: ? → +
Whiteboard: [LOE:M] → [LOE:M][WebAPI:P2]
Keywords: feature
(Assignee)

Comment 2

5 years ago
cjones, mounir, jlebar, and other b2g folks,

I'm wondering if there's an easier way to fix this.

The plan so far has been that channels (which are created from the child-side) would need to have callbacks that implement both TabParent and nsILoadContext, so that we can 1) send the {appId, inBrowserElement} for the load (and also usePrivateBrowsing for nextgen private browsings) to the parent, and 2) provide the parent with a TabParent that can be securely used to get the mozIApplication.localId, which will tell us if the child is passing in a bogus appId.

The problem with this plan is that lots of necko channels seem to not have callbacks set that provide nsILoadContext and/or TabParent.  There are potentially a lot of call sites to change to fix that.  We'll need to fix them eventually to make nextGen private browsing work (and ehsan has been valiantly fixing some of them--see the dependencies for bug 787743), but it's unclear if we'll get to them all soon enough.

Here's a different plan, which may be less work?  
1) Change IPDL inheritance so that PNecko is managed by PBrowser instead of PContent.
2) Change 'mozbrowser' processes (which is just firefox now?) to use it's own unique appID instead of sharing an appID with it's app (IIUC the mozbrowser process is currently using 0, and so is already not sharing B2G firefox's appID, correct?  I'd be proposing we change it to some non-zero unique appid).

Change #1 would let me get the appID for a channel (via IPDL GetParent()) without relying on the child to pass in the TabParent.   Change #2 would let me get the isInBrowserElement value without the child passing in a nsILoadGroup (assuming we provide some kind of mapping of which appIDs are mozbrowsers).

Thoughts?
I think #1 would be very hard to do since you'd have to find the right PBrowser any time we make a network request so that you could then find the PNecko belonging to that browser. That seems basically just as hard as finding the appropriate {appid,browserflag}
> 2) Change 'mozbrowser' processes (which is just firefox now?) to use it's own unique appID instead 
> of sharing an appID with it's app (IIUC the mozbrowser process is currently using 0, and so is 
> already not sharing B2G firefox's appID, correct?  I'd be proposing we change it to some non-zero 
> unique appid).

Our story with these IDs is already so complicated, I'm really hesitant to complicate it further by having one part of the system tell a different story about the ID than another part of the system.

We use the current system of (app-id, is-browser) everywhere, including in Necko itself for managing cookies, right?

> Change #2 would let me get the isInBrowserElement value without the child passing in a 
> nsILoadGroup (assuming we provide some kind of mapping of which appIDs are mozbrowsers).

If you can get the app-id off the tabparent, can't you also get the is-browser-element value off the tab-parent, or does that not work for some reason?
(Assignee)

Comment 5

5 years ago
> I think #1 would be very hard to do...

Ah, ok.  I was hoping it would be easier.  

> can't you also get the is-browser-element value off the tab-parent...

No, it comes from the nsILoadContext. 

I've got :baku working on figuring out how many sites lack TabParent/nsILoadContext info, so we should soon have more info on how many call sites need to be fixed, and hopefully we can fix them quickly.

thanks for fast replies!
This should probably never have had the feature keyword.  Sorry for the noise.
Keywords: feature
Priority: -- → P2
Whiteboard: [LOE:M][WebAPI:P2] → [LOE:M]
How's this going, Jason?
Flags: needinfo?(jduell.mcbugs)

Comment 8

5 years ago
Milestoning for C2 (deadline of 12/10), as this meets the criteria of "known P2 bugs found before or during C1".
Target Milestone: --- → B2G C2 (20nov-10dec)
No updates since *September*. We need an update on where this is at, and whether it truly needs to block the release.
(Assignee)

Comment 10

5 years ago
Talked to jlebar, have design, am writing patch right now.  Should have in a day or so.

Whether it should block: we'd allow apps to read other app's cookies, so probably needs to block.
Flags: needinfo?(jduell.mcbugs)
Thanks Jason. Any update on a patch for this? We're 7 days out from our C2 milestone, so need to get this landed as soon as possible.
(Assignee)

Comment 12

5 years ago
I'm close here, but have context-switched to work on bug 815523, which seems of equal or higher priority and was higher risk.  I think I've got a handle on both now.
Still have a handle on both, Jason?  Pressure is high.  :)
(Assignee)

Comment 14

5 years ago
Created attachment 689839 [details] [diff] [review]
part1: add new network.disable.ipc.security pref and make NeckoParent observe it

We need to be able to disable IPC checks for xpcshell tests, which don't have the TabParent needed to pass them.
Attachment #689839 - Flags: review?(josh)
(Assignee)

Comment 15

5 years ago
Created attachment 689840 [details] [diff] [review]
part2: also make NeckoChild observe

Same thing but for NeckoChild, so we can fail in debug mode, etc, when we know on child side that we lack info needed to pass checks on parent.
Attachment #689840 - Flags: review?(josh)
Comment on attachment 689839 [details] [diff] [review]
part1: add new network.disable.ipc.security pref and make NeckoParent observe it

Review of attachment 689839 [details] [diff] [review]:
-----------------------------------------------------------------

The vast majority of these changes can be avoided by just using a static global var and Preferences::AddUintVarCache.
Attachment #689839 - Flags: review?(josh) → review-
Comment on attachment 689840 [details] [diff] [review]
part2: also make NeckoChild observe

Review of attachment 689840 [details] [diff] [review]:
-----------------------------------------------------------------

Likewise.
Attachment #689840 - Flags: review?(josh) → review-
(Assignee)

Comment 18

5 years ago
Created attachment 689845 [details] [diff] [review]
part2.5: disable necko IPC security by default for xpcshell tests

ted: the alternative would be to add head.js files to directories that need it, but that's essentially all xpcshell tests that do networking, so seemed easier to make it opt-in.  We'll get test coverage with checks on in mochitests.
Attachment #689845 - Flags: review?(ted)
Comment on attachment 689845 [details] [diff] [review]
part2.5: disable necko IPC security by default for xpcshell tests

Review of attachment 689845 [details] [diff] [review]:
-----------------------------------------------------------------

::: testing/xpcshell/head.js
@@ +40,5 @@
>  try {
> +  runningInParent = Components.classes["@mozilla.org/xre/runtime;1"].
> +                    getService(Components.interfaces.nsIXULRuntime).processType
> +                    == Components.interfaces.nsIXULRuntime.PROCESS_TYPE_DEFAULT;
> +} 

nit: trailing whitespace
Attachment #689845 - Flags: review?(ted) → review+
(Assignee)

Comment 20

5 years ago
Created attachment 689874 [details] [diff] [review]
part4: infrastructure for sec checks, and HTTP channel checks

jlebar:  could you take a look at the checks I do in NeckoParent::CreateChannelLoadContext() and tell me if they seem right to you?

This is a WIP:  it handles HTTP channels (it compiles and run w/o crashing, not extensively tested yet).  We need to add same TabParent param and checks to rest of necko channels (PFtpChannel, PwyciwygChannel, PWebSocket.ipdl, PCookieSvc, and POfflineCacheUpdate).  jdm, if you're up for working on this while I finish bug 815523 you're my hero.
Attachment #689874 - Flags: feedback?(justin.lebar+bug)
(Assignee)

Comment 21

5 years ago
Created attachment 689879 [details]
IRC chats on design
Created attachment 689920 [details] [diff] [review]
Parts 1+2: Control Necko IPC security checks via a pref.
(Assignee)

Comment 23

5 years ago
Comment on attachment 689920 [details] [diff] [review]
Parts 1+2: Control Necko IPC security checks via a pref.

Review of attachment 689920 [details] [diff] [review]:
-----------------------------------------------------------------

Yes, AddTYPEVarCache is much easier than registering observers.  Yet we've somehow never used it before in necko-land.  Good to know.

::: netwerk/ipc/NeckoChild.cpp
@@ +21,5 @@
>  namespace mozilla {
>  namespace net {
>  
> +static bool gDisableIPCSecurity = false;
> +static const char kPrefDisableIPCSecurity[] = "network.disable.ipc.security";

I'd rather see this as a #define in NeckoCommon.h so it's centralized.  Linker might also be more likely to coalesce two identical string constants than two static vars with same value?
Attachment #689920 - Flags: review+
(Assignee)

Updated

5 years ago
Attachment #689839 - Attachment is obsolete: true
(Assignee)

Updated

5 years ago
Attachment #689840 - Attachment is obsolete: true
Status update: I have completely untested patches that add validation for PCookieService requests, PFTPChannel, PWyciwygChannel, PWebSocket, and POfflineCacheUpdate. I have also completely broken the private browsing-related code, so I'm going to come up with a plan to fix that up.
Created attachment 690727 [details] [diff] [review]
Part 3: Use validated app data for remote HTTP channels.
Attachment #690727 - Flags: review?(jduell.mcbugs)
Created attachment 690728 [details] [diff] [review]
Part 4: Use validated app data for remote FTP channels.
Attachment #690728 - Flags: review?(jduell.mcbugs)
Created attachment 690729 [details] [diff] [review]
Part 5: Use validated app data for remote cookie requests.
Attachment #690729 - Flags: review?(jduell.mcbugs)
Created attachment 690730 [details] [diff] [review]
Part 6: Use verified app data for remote offline cache requests instead of insecure values from the child.
Attachment #690730 - Flags: review?(jduell.mcbugs)
Created attachment 690731 [details] [diff] [review]
Part 7: Use validated app data for remote web socket channels.
Attachment #690731 - Flags: review?(jduell.mcbugs)
Created attachment 690732 [details] [diff] [review]
Part 8: Add PBrowser context to remote wyciwyg channels.
Attachment #690732 - Flags: review?(jduell.mcbugs)

Updated

5 years ago
Attachment #690732 - Attachment description: Add PBrowser context to remote wyciwyg channels. → Part 8: Add PBrowser context to remote wyciwyg channels.
I thinking of making a utility GetPBrowserChildCallback method that could replace the goop that's present in every patch.
Created attachment 690736 [details] [diff] [review]
Part 8: Use validated app data for remote wyciwyg channels.

Sorry, last minute changes caused a build error.
Attachment #690736 - Flags: review?(jduell.mcbugs)

Updated

5 years ago
Attachment #690732 - Attachment is obsolete: true
Attachment #690732 - Flags: review?(jduell.mcbugs)
Created attachment 690738 [details] [diff] [review]
Part 7: Use validated app data for remote web socket channels.

Sorry, too many patches to keep track of.
Attachment #690738 - Flags: review?(jduell.mcbugs)

Updated

5 years ago
Attachment #690731 - Attachment is obsolete: true
Attachment #690731 - Flags: review?(jduell.mcbugs)
Created attachment 690740 [details] [diff] [review]
Part 8: Use validated app data for remote wyciwyg channels.

Last update, I swear.
Attachment #690740 - Flags: review?(jduell.mcbugs)

Updated

5 years ago
Attachment #690736 - Attachment is obsolete: true
Attachment #690736 - Flags: review?(jduell.mcbugs)
(Assignee)

Updated

5 years ago
Attachment #689845 - Attachment description: part3: disable necko IPC security by default for xpcshell tests → part2.5: disable necko IPC security by default for xpcshell tests
(Assignee)

Updated

5 years ago
Attachment #689874 - Attachment is obsolete: true
Attachment #689874 - Flags: feedback?(justin.lebar+bug)
(Assignee)

Comment 35

5 years ago
Comment on attachment 690727 [details] [diff] [review]
Part 3: Use validated app data for remote HTTP channels.

Review of attachment 690727 [details] [diff] [review]:
-----------------------------------------------------------------

I don't think we actually need all the nsIPrivateBrowsingChannel support here and in the rest of the patch series, but it's harmless to include for now.  Filed bug 820304 to remove it.

jdm: re: our IRC chat about using UNKNOWN_APP vs NO_APP as default for xpcshell tests: looks like you were right and we should go with NO_APP.

::: docshell/base/LoadContext.h
@@ +31,5 @@
>  public:
>    NS_DECL_ISUPPORTS
>    NS_DECL_NSILOADCONTEXT
>  
> +#ifndef CHECK_WE_ONLY_USE_THIS_FROM_NEW_CODE

We need to get rid of this, or the code it surrounds.  I'll file a followup patch with this and other changes from my comments.
Attachment #690727 - Flags: review?(jduell.mcbugs) → review+
(Assignee)

Updated

5 years ago
Attachment #690728 - Flags: review?(jduell.mcbugs) → review+
(Assignee)

Comment 36

5 years ago
Comment on attachment 690729 [details] [diff] [review]
Part 5: Use validated app data for remote cookie requests.

Review of attachment 690729 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/ipc/NeckoParent.cpp
@@ +119,5 @@
> +    }
> +    if (!aSerialized.IsNotNull()) {
> +      return "SerializedLoadContext from child is null";
> +      }
> +    }

Above checks are duplicated in call to GetValidatedAppInfo, so we should remove them here.
Attachment #690729 - Flags: review?(jduell.mcbugs) → review+
(Assignee)

Comment 37

5 years ago
Comment on attachment 690730 [details] [diff] [review]
Part 6: Use verified app data for remote offline cache requests instead of insecure values from the child.

Review of attachment 690730 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/ipc/TabParent.cpp
@@ +1125,5 @@
>  {
>    nsRefPtr<mozilla::docshell::OfflineCacheUpdateParent> update =
>      new mozilla::docshell::OfflineCacheUpdateParent();
>  
> +  nsresult rv = update->Schedule(aManifestURI, aDocumentURI, stickDocument, this);

Minor change: I think it's cleaner to have the OfflineCacheUpdate constructor take appID/inBrowser and call it from here with (OwnOrContainingAppId(), IsBrowserElement).
Attachment #690730 - Flags: review?(jduell.mcbugs) → review+
(Assignee)

Updated

5 years ago
Attachment #690738 - Flags: review?(jduell.mcbugs) → review+
(Assignee)

Comment 38

5 years ago
Comment on attachment 690740 [details] [diff] [review]
Part 8: Use validated app data for remote wyciwyg channels.

Review of attachment 690740 [details] [diff] [review]:
-----------------------------------------------------------------

This patch is different than all the other IPDL protocols in that we don't do the security checks right at IPDL construction time, which means we have to be more careful that a malicious child process can't send us IPDL messages in some sequence that bypasses the checks we do in AsyncOpen.  That said, I don't see any useful attacks here, because there's not much that can be done until AsyncOpen is called.  Also wyciwyg channels only touch the HTTP cache, which doesn't have per-app isolation (yet), so there's not much to protect.  So I think this is good for now.
Attachment #690740 - Flags: review?(jduell.mcbugs) → review+
(Assignee)

Comment 39

5 years ago
Created attachment 690792 [details] [diff] [review]
Part 9: various fixups

jdm: let me know if you find any of these changes objectionable.
Attachment #690792 - Flags: review?(josh)
(Assignee)

Comment 40

5 years ago
Comment on attachment 690727 [details] [diff] [review]
Part 3: Use validated app data for remote HTTP channels.

Review of attachment 690727 [details] [diff] [review]:
-----------------------------------------------------------------

jlebar: it would still be nice to have feedback as to whether the sec checks (see NeckoParent::CreateChannelLoadContext in this patch) look right to you.
Attachment #690727 - Flags: feedback?(justin.lebar+bug)
(Assignee)

Comment 41

5 years ago
try build:

  https://tbpl.mozilla.org/?tree=Try&rev=4eb7c5a23f42

Patches don't apply cleanly to beta.  Haven't had a chance to try on phone yet.  Will get back to this when I've got patches for bug 815523 done.

Updated

5 years ago
Attachment #690792 - Flags: review?(josh) → review+
Comment on attachment 690727 [details] [diff] [review]
Part 3: Use validated app data for remote HTTP channels.

Review of attachment 690727 [details] [diff] [review]:
-----------------------------------------------------------------

This looks right to me.  Sorry for leaving you hanging!
Attachment #690727 - Flags: feedback?(justin.lebar+bug) → feedback+
(Assignee)

Comment 43

5 years ago
Created attachment 690960 [details] [diff] [review]
Combined patch as landed on m-c

Merged into big combo patch for easier approvals/backouts/etc.

   https://hg.mozilla.org/integration/mozilla-inbound/rev/34c9ccee8058
Attachment #690960 - Flags: review+

Comment 44

5 years ago
Backed out because this broke some mochitests: https://hg.mozilla.org/integration/mozilla-inbound/rev/cdc37128bd0d

https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=34c9ccee8058
https://tbpl.mozilla.org/?tree=Try&rev=88b081780424
Created attachment 691155 [details] [diff] [review]
Part 10: Disable security tests for browser-element tests.
Assignee: jduell.mcbugs → josh
Attachment #691155 - Flags: review?(ted)
(Assignee)

Comment 47

5 years ago
Comment on attachment 691155 [details] [diff] [review]
Part 10: Disable security tests for browser-element tests.

Review of attachment 691155 [details] [diff] [review]:
-----------------------------------------------------------------

// our  test harness tests browser elements
// without sticking them in apps, and the security checks dislike that.

jlebar told me we weren't using this configuration in production yet, and I didn't know about the tests.  I've filed bug 820712 for the work to figure this out--doesn't need to block this bug IMO.  Meanwhile disabling security for test runs is good workaround.
Attachment #691155 - Flags: feedback+
(Assignee)

Comment 48

5 years ago
Comment on attachment 690960 [details] [diff] [review]
Combined patch as landed on m-c

I've got a beta version of the combined patch ready (aurora doesn't require any mods).  akeybl told me on IRC we can land these w/o +a process since this is all e10s.
Comment on attachment 691155 [details] [diff] [review]
Part 10: Disable security tests for browser-element tests.

Review of attachment 691155 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/browser-element/mochitest/browserElementTestHelpers.js
@@ +70,5 @@
> +  },
> +
> +  setIPCSecurityDisabledPref: function(value) {
> +    return this._setBoolPref("network.disable.ipc.security", value);
> +  },

I don't understand the extra layers of indirection here, but it's consistent with what's already here.
Attachment #691155 - Flags: review?(ted) → review+
https://tbpl.mozilla.org/?tree=Try&rev=fc7c3e025098
Created attachment 691351 [details] [diff] [review]
Secure necko IPDL usage.

Rollup for inbound.

Updated

5 years ago
Attachment #690960 - Attachment is obsolete: true
(Assignee)

Comment 52

5 years ago
Created attachment 691565 [details] [diff] [review]
Part 11: turn off sec checks for 3 more OOP browser-but-not-app tests

Hopefully this is the last of the test fixed.  Pushed to try

  https://tbpl.mozilla.org/?tree=Try&rev=f5dc5426ebe5
Attachment #691565 - Flags: review?(ted)
(Assignee)

Updated

5 years ago
Attachment #691155 - Attachment description: Disable security tests for browser-element tests. → Part 10: Disable security tests for browser-element tests.
Comment on attachment 691565 [details] [diff] [review]
Part 11: turn off sec checks for 3 more OOP browser-but-not-app tests

You should clear the pref once test_child_process_shutdown_message.html finishes running.
(Assignee)

Comment 54

5 years ago
Created attachment 691693 [details] [diff] [review]
Part 11: turn off sec checks for 3 more OOP browser-but-not-app tests
Attachment #691565 - Attachment is obsolete: true
Attachment #691565 - Flags: review?(ted)
Attachment #691693 - Flags: review?(ted)
Comment on attachment 691693 [details] [diff] [review]
Part 11: turn off sec checks for 3 more OOP browser-but-not-app tests

Review of attachment 691693 [details] [diff] [review]:
-----------------------------------------------------------------

::: content/base/test/test_child_process_shutdown_message.html
@@ +113,5 @@
>    SpecialPowers.addPermission("browser", true, window.document);
>    SpecialPowers.addPermission("embed-apps", true, window.document);
> +
> +  // TODO: remove in bug 820712
> +  SpecialPowers.setBoolPref("network.disable.ipc.security", true);

Given the way this test is structured, you could pretty easily change it to use SpecialPowers.pushPrefEnv, which does the cleanup for you:
http://mxr.mozilla.org/mozilla-central/source/testing/specialpowers/content/specialpowersAPI.js#538

But since the test already has a cleanup function it's not super important.

::: dom/devicestorage/ipc/test_ipc.html
@@ +152,5 @@
>            ["device.storage.enabled", true],
>            ["device.storage.testing", true],
>            ["device.storage.prompt.testing", true],
>  
> +          // TODO: remove this as part of bug 820712 

nit: trailing whitespace here

::: dom/indexedDB/ipc/test_ipc.html
@@ +167,5 @@
>      addEventListener("load", function() {
>        SpecialPowers.addPermission("browser", true, document);
>        SpecialPowers.pushPrefEnv({
>          "set": [
> +          // TODO: remove this as part of bug 820712 

nit: here too
Attachment #691693 - Flags: review?(ted) → review+
(Assignee)

Comment 56

5 years ago
Created attachment 692001 [details] [diff] [review]
part 12: another test needs checks disabled.

apparently this is going to be an iterative process...

  https://tbpl.mozilla.org/?tree=Try&rev=9385fe428974
Assignee: josh → jduell.mcbugs
Status: NEW → ASSIGNED
Attachment #692001 - Flags: review?(ted)
(Assignee)

Comment 57

5 years ago
Created attachment 692007 [details] [diff] [review]
part 11: test changes w/ted's nits fixed
Attachment #691693 - Attachment is obsolete: true
Attachment #692007 - Flags: review+
(Assignee)

Comment 58

5 years ago
Created attachment 692012 [details] [diff] [review]
part 12: v2 another test needs checks disabled.

fixed typo
Attachment #692001 - Attachment is obsolete: true
Attachment #692001 - Flags: review?(ted)
Attachment #692012 - Flags: review?(ted)
(Assignee)

Comment 59

5 years ago
https://tbpl.mozilla.org/?tree=Try&rev=9b5d82dc4dff
(Assignee)

Comment 60

5 years ago
Comment on attachment 692012 [details] [diff] [review]
part 12: v2 another test needs checks disabled.

Hmm, test fails reliably on all platforms on try, but works fine on my desktop.  Probably some interaction issue with other mochitests--investigating.
Attachment #692012 - Flags: review?(ted)
(Assignee)

Comment 61

5 years ago
Created attachment 692255 [details] [diff] [review]
part 12: v3 another test needs checks disabled.

ok, this seems to be working:

  https://tbpl.mozilla.org/?tree=Try&rev=bb0efd6090e6

Started full try run:

  https://tbpl.mozilla.org/?tree=Try&rev=4336a651a3d4
Attachment #692012 - Attachment is obsolete: true
Attachment #692255 - Flags: review?(ted)
Attachment #692255 - Flags: review?(ted) → review+
(Assignee)

Comment 62

5 years ago
Created attachment 692697 [details] [diff] [review]
omnibus patch as landed on inbound

https://hg.mozilla.org/integration/mozilla-inbound/rev/3e4e600adc3b
Attachment #691351 - Attachment is obsolete: true
Attachment #692697 - Flags: review+
The try build for this got this Linux Opt Cipc failure:

TEST-UNEXPECTED-FAIL | http://localhost:4444/1355488614395/2/330010-1.xul | application timed out after 330 seconds with no output

It was thought to be random, however every build on inbound since this landed has resulted in this error, so it would seem this is a real issue.
Backed out for crashtest-ipc timeouts:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=3e4e600adc3b

https://hg.mozilla.org/integration/mozilla-inbound/rev/463b740c3e2f

Why did this land if the try run was broken?

Whilst people not using Try is frustrating, I can at least empathise that in many cases it's just not worth the infra load to test a particular changeset before landing. However, using our (scarce) Try resources and then still landing after it was broken on Try (without asking #developers whether they thought the failure was real, or even just doing a retrigger on Try), seems like the worst of both worlds...
> Why did this land if the try run was broken?

Because the Cpic failure matched the signature of a known randomorange?
(In reply to Justin Lebar [:jlebar] from comment #65)
> > Why did this land if the try run was broken?
> 
> Because the Cpic failure matched the signature of a known randomorange?

But it doesn't...
(In reply to Ed Morley [UTC+0; email:edmorley@moco] from comment #66)
> > Because the Cpic failure matched the signature of a known randomorange?
> 
> But it doesn't...

The Cpic failure in comment 61 says:

> TEST-UNEXPECTED-FAIL | http://localhost:4444/1355488614395/2/330010-1.xul | application timed out after 330 seconds with no output
> Bug 742455 - Intermittent crashtest content/xul/templates/src/crashtests/397148-1.xul, 330010-1.xul, 330012-1.xul | Exited with code -1073741819 during test run [@ RDFContentSinkImpl::HandleEndElement] PROCESS-CRASH | http://localhost:4444/1355488614395/2/330010-1.xul | application crashed [@ linux-gate.so + 0x424]
> Bug 742455 - Intermittent crashtest content/xul/templates/src/crashtests/397148-1.xul, 330010-1.xul, 330012-1.xul | Exited with code -1073741819 during test run [@ RDFContentSinkImpl::HandleEndElement] Thread 0 (crashed)

So you're right; if you look at this closely, the failure in the try push is a timeout, whereas the known orange is a crash.

But given that at least three people (jduell, whoever was sheriffinb m-i, and now me) looked at this TBPL output and believed the failure to be a random orange, it seems reasonable to me to suggest that the tool is more at fault here than the users.
(In reply to Justin Lebar [:jlebar] from comment #67)
> But given that at least three people (jduell, whoever was sheriffinb m-i,
> and now me) looked at this TBPL output and believed the failure to be a
> random orange, it seems reasonable to me to suggest that the tool is more at
> fault here than the users.

It's a weekend, no one is really sheriffing inbound, and it's not clear at this point whether jduell saw the cipc orange.

That said, if you can think if ways in which to make it easier to spot that string 1 != string 2 (given that in this case you didn't even need to look at the log, and string matches are bolded, so most of this string wasn't in bold, indicating it didn't match... so it seems pretty clear already), then I'll gladly try and improve the annotated summary :-)
Comment 63 made it sound like someone believed that the failure was random on m-i.  Maybe it was jduell himself looking at the m-i failures, but anyway it seems unlikely that I'm the only one who looked at this TBPL log and misinterpreted it.

> if you can think if ways in which to make it easier to spot that string 1 != string 2

How about not requiring the user to do a strstr (computer-aided or otherwise)?  tbpl could explicitly separate matches into "likely your failure" and "likely not your failure" instead of relying on users to look for typographic hints.

If TBPL can't do this collation correctly based on boldness, then surely naive users also can't make this determination of "likely/not likely a known failure" based on boldness, and then the claim that "likely a known failure" is a straightforward determination isn't true.
I wrote comment 63 and was giving you the benefit of the doubt as to why you may have landed despite the obvious try server test issue.  Don't try to make my comment justify this action.
(In reply to Justin Lebar [:jlebar] from comment #69)
> instead of relying on users to look
> for typographic hints.

Calling them 'hints' is a bit of a stretch. The two failure types shown were really quite different. 

Perhaps we just need to have a day a week where inbound is closed and devs self-sheriff, so people don't lose touch with how to see if their push was successful?

> If TBPL can't do this collation correctly based on boldness, then surely
> naive users also can't make this determination of "likely/not likely a known
> failure" based on boldness, and then the claim that "likely a known failure"
> is a straightforward determination isn't true.

Using a simple % of string bolded is just not practical, due to additional content often needed in bug summaries. Unless we have structured logging (and a dozen accompanying bugzilla fields to be able to perform definitive matching against filed intermittents), then we're always going to have to match against bug summaries.

There are many many ways in which TBPL could be improved (and hopefully will be over the coming months, with a possible TBPLv2 on the cards next year), but ultimately at some point, devs need to accept they may not always get a dumbed down red/green indication, and actually have to read 20 characters of text now and again ;-)
Again, I do not think blaming users is a productive course of action here.

> devs need to accept they may not always get a dumbed down red/green indication, and 
> actually have to read 20 characters of text now and again ;-)

That's fine if you can accept that we'll get it wrong occasionally.  But comment 64 suggests that you're expecting a high degree of accuracy from a process carried out by busy non-experts many times a week.  I think is unrealistic and ultimately unfair to developers.

Indeed, if I'm going to get this kind of flack when I push to try and miss an orange, then I think a rational reaction would be to push to try less often and let experts sort out the potential failures on m-i.  That's obviously counter to your goal, so I again suggest that blaming users here is counter-productive.

> Perhaps we just need to have a day a week where inbound is closed and devs self-sheriff, 
> so people don't lose touch with how to see if their push was successful?

If the goal here is to maximize developer productivity, it seems unlikely to me that this is a better course of action than the status quo.

I maintain that it's unreasonable to expect that engineers, who may push to try or m-i ten times in a week and see five random oranges per push, should examine every orange / red in their push beyond clicking it and seeing if tbpl suggests a match (and even that is pushing it, imo).  Indeed, the presence of a TBPL match is strongly indicative of whether the failure is a known orange, so you're asking us to look carefully at each of these fifty failures to catch what is in reality an unlikely event, making the expectation that we will carry this out with a high degree of accuracy even more unrealistic.

If you want fewer csets pushed to m-i after failing try pushes, let's figure out what can be done automatically.  For example, maybe we can automatically re-trigger oranges on try.  Or we could push for increased test capacity so we could stop coalescing tests on m-i so pushing an orange wouldn't be as big a deal.
I don't get you position here.  My comment 63 was trying to give you an excuse for how you obviously screwed up by landing something that failed on try.
(In reply to Bill Gianopoulos [:WG9s] from comment #73)
> I don't get you position here.  My comment 63 was trying to give you an
> excuse for how you obviously screwed up by landing something that failed on
> try.

I didn't land anything; Jason did.  I'm defending his oversight, which I myself also made when looking at the try results after I read comment 64, on the grounds that this oversight is an easy one to make and that it's unreasonable to expect a high degree of accuracy from developers interpreting their try results when that process requires so many manual steps.  

My position is that if the accuracy we're seeing from developers interpreting try results is too low, we should focus on ways to improve the system, rather than blaming users of that system for being human and screwing up.
(In reply to Justin Lebar [:jlebar] from comment #72)
> I maintain that it's unreasonable to expect that engineers, who may push to
> try or m-i ten times in a week and see five random oranges per push, should
> examine every orange / red in their push beyond clicking it and seeing if
> tbpl suggests a match (and even that is pushing it, imo). 

I really don't think it's pushing it to ask devs to click on 3-4 failures on a run to even see if there are suggestions - I suspect we're going to have to agree to disagree on this front :-)

Both in this discussion (and previous on dev.{platform,planning}) you have advocated pushing most of the burden onto sheriffs - which is great in principal (and a place where I would like us to be at eventually, given it's an inefficient use of dev time for them to be doing some of these tasks), but is just not practical given the single employed sheriff, the tooling deficit we've built up over the years & our hard-working but understaffed releng team who don't have the cycles to fix our capacity problems as fast as we would like.

Thankfully, we're almost about to fill another sheriff position, which will reduce the load on me slightly (and actually give me more time to work on things like TBPL improvements), but even then, the world you (and myself too!) would like us to be in, just isn't where we're at - so (IMO) both the devs and the sheriffs need to share the burden for the short term.

> If you want fewer csets pushed to m-i after failing try pushes, let's figure
> out what can be done automatically.  For example, maybe we can automatically
> re-trigger oranges on try.  Or we could push for increased test capacity so
> we could stop coalescing tests on m-i so pushing an orange wouldn't be as
> big a deal.

The recently stated Futurama project has some great suggestions so far for these kind of things - and I'm really excited to see what comes out of it :-D
https://wiki.mozilla.org/Auto-tools/Projects/Futurama
> Both in this discussion (and previous on dev.{platform,planning}) you have advocated 
> pushing most of the burden onto sheriffs

I'd love it if you guys could fix the problems we've identified with TBPL, but that's only half of my argument.  The other half is that given the current state of affairs -- given that you don't have the resources to fix problems in TBPL, for example -- you shouldn't blame devs when we get bitten by deficiencies in the system.  I further contend that blaming devs for this kind of mistake may cause us to use try less often, and may thus be counter to your goals.

In contrast, I think you were/are explicitly suggesting that the burdens here should be carried by developers -- that developers should spend time carefully combing through their pushes' oranges so as to notice when TBPL incorrectly suggests a bug, for example -- and that it's not acceptable for developers to screw up when carrying out these tasks.  In other words, you were/are arguing that the failure here was the user's fault.  Again I object to this notion as inefficient, counter-productive, and ultimately unrealistic.

When you spend your whole day interacting with one tool, it's easy to think that people are careless for missing aspects about that tool which are obvious to you.  For example, you said in comment 71 that calling the boldness a "typographic hint" was a stretch, because the types of failures "really [are] quite different".

Instead, I'd encourage you to take an alternative point of view and use the fact that this indicator was missed not as evidence of carelessness but as a pointer to an area of the tool which can be improved.  When you blame a user instead of looking at things from the user's point of view, you lose important pieces of evidence for how to improve and essentially trade them for an ad hom.  This is not productive.
(Assignee)

Comment 77

5 years ago
FWIW here's what I was thinking (and I'm totally open to hearing from the sheriffs if this approach is a terrible strategy).  I assumed the oranges in my try build were random (I tend to make that assumption more especially with timeout reports that only happened on one platform and where there's already some sort of bad exit report for the test).  I pushed to inbound.  I starred what were obvious known oranges and stared at the rest.  I was actually worried about a different orange (browser_819510_perwindowpb.js failed, without suggestions, which seemed likely to be a problem: turns out it's not).  I considered backing my patch out, but I figured that it wasn't burning the tree, so one or two following pushes to inbound would provide essentially more try runs to reveal if there was really a problem here.  I had rolled my patch up into one commit so it's easy to backout, so... no problem?  (I also figured I'd get up early and check/backout it myself, but instead slept through my alarm and woke up instead to find my bug comments section morphed into a poor cousin of dev.platform :)  

Unless I hear otherwise from the sheriffs, I'm going to assume that I should have just backed out at the least sign of trouble and pushed to try a few more times (unless there was a sheriff on IRC who I could let know what was going on, and who was OK with letting my patch bake for a while).  There's a trade-off here between sheriff resources, developer time, and try server consumption, obviously.  Maybe we can talk about that larger issue somewhere other than in more comments in this bug?  Unless you're brief :)
Are we going to try for a re-land today? This is one of a small handful of remaining LOE:M work, and needs to land as early this week as possible.
Priority: P2 → P1
Target Milestone: B2G C2 (20nov-10dec) → B2G C3 (12dec-1jan)
(Assignee)

Comment 79

5 years ago
I can't repro the issue on my box.  Try builds are taking forever to return 'Cipc' test results:

   https://tbpl.mozilla.org/?tree=Try&rev=9a532b225c42
   https://tbpl.mozilla.org/?tree=Try&rev=a73db2de25f0
   https://tbpl.mozilla.org/?tree=Try&rev=3b9367e2ed7c

One of these builds is using a different strategy--to simply allow the "mozbrowser w/o app" to pass security checks.  If that passes and the others don't I'll land that tomorrow.
(Assignee)

Updated

5 years ago
Depends on: 822998
(Assignee)

Comment 80

5 years ago
So we're reliably failing Cipc on the try servers regardless of whether we turn on mozbrowser-w/o-app checks or not.  And I can't repro the failure on my desktop :(  

I'm installing a 32bit Fedora on a VM in case that shows the issue for me.  I've also filed bug 822998 to get access to an actual buildbot in case that doesn't work.
(Assignee)

Comment 81

5 years ago
Created attachment 694328 [details] [diff] [review]
Part 13: disable RDF test when in e10s mode

Maybe patch 13 is my lucky patch!

This skip-if's a crashtest in e10s mode because it uses RDFService which doesn't provide TabChild in HTTP channel callbacks (and thus crashes).

I filed bug 823470 about the larger issue of whether we'll ever need RDFService in child. 

Try build:

  https://tbpl.mozilla.org/?tree=Try&rev=34bab1c0e437
Attachment #694328 - Flags: review?(ted)
Comment on attachment 694328 [details] [diff] [review]
Part 13: disable RDF test when in e10s mode

Review of attachment 694328 [details] [diff] [review]:
-----------------------------------------------------------------

::: content/xul/templates/src/crashtests/crashtests.list
@@ +1,4 @@
>  load 257752-1-recursion.xul
>  load 329335-1.xul
>  load 329884-1.xul
> +skip-if(winWidget||browserIsRemote) HTTP load 330010-1.xul # bug 742455

Can you add the new bug # that you filed here?
Attachment #694328 - Flags: review?(ted) → review+
(Assignee)

Comment 83

5 years ago
Created attachment 694534 [details] [diff] [review]
omnibus patch as landed on inbound, v3

> Can you add the new bug # that you filed here?

done
Attachment #694534 - Flags: review+
(Assignee)

Updated

5 years ago
Attachment #692697 - Attachment is obsolete: true
(Assignee)

Comment 84

5 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/6fe383b92b05
(Assignee)

Comment 85

5 years ago
https://hg.mozilla.org/releases/mozilla-aurora/rev/e1d28c3dc1aa
https://hg.mozilla.org/releases/mozilla-b2g18/rev/6962039806d6
(Assignee)

Updated

5 years ago
status-b2g18: --- → fixed
status-firefox19: --- → fixed
(Assignee)

Comment 86

5 years ago
Note: if your B2G child process starts crashing after this lands with a message that it's failed necko security checks, it's because you have a bug (you need to pass security info to all necko channels), so backing this patch out is probably not your best first response.  If we run into enough trouble that cjones, etc, thinks this needs backing out, then by all means, back out :)  So far we're passing automated tests at least...
Depends on: 823962
Sorry folks, this code may be an innocent bystander but it caused bug 823962.  Need to back out and investigate.
Aaaand I just saw comment 86.  However, I don't see any necko errors.
I/Gecko   ( 3436): [Parent 3436] WARNING: NeckoParent::AllocPHttpChannel: FATAL error: missing required PBrowser argument: KILLING CHILD PROCESS

Clock is running though.
Hahahahahahahahhahaha

(gdb) printf "%s", PrintJSStack()
0 anonymous() ["jar:file:///system/b2g/omni.ja!/components/contentSecurityPolicy.js":338]
    this = [object Object]
1 anonymous() ["jar:file:///system/b2g/omni.ja!/components/contentSecurityPolicy.js":518]
    this = function () {
    [sourceless code]
}
Depends on: 824170
Depends on: 824177
Sorry guys, we have some nontrivial regressions from this.  Clock is up :/.  Will help with regressions where I can.
Sniffle

https://hg.mozilla.org/integration/mozilla-inbound/rev/81f530168b39
https://hg.mozilla.org/releases/mozilla-aurora/rev/15286fbd15a7
https://hg.mozilla.org/releases/mozilla-b2g18/rev/dbba0daf07b2
Depends on: 824200

Updated

5 years ago
Depends on: 824179

Updated

5 years ago
status-b2g18: fixed → ---
status-firefox19: fixed → ---
No longer depends on: 824179
(Assignee)

Comment 93

5 years ago
Created attachment 695193 [details] [diff] [review]
inbound minibus patch (v4) w/p parts that landed in bug 815523

ready to land once we fix the dependencies.
Attachment #694534 - Attachment is obsolete: true
(Assignee)

Comment 94

5 years ago
Created attachment 695238 [details] [diff] [review]
inbound minibus patch (v5) w/o parts that landed in bug 815523

Missed a '+' from merging patch.
Attachment #695193 - Attachment is obsolete: true
Attachment #695238 - Flags: review+
* bug 823962 has landed on aurora and b2g18.
* bug 824177 has landed on inbound.
* jdm told me that bug 824200 is likely fixed by bug 753981
* jdm _also_ told me that he's going to try to land this again :)
https://hg.mozilla.org/integration/mozilla-inbound/rev/6cd44513d285
https://hg.mozilla.org/releases/mozilla-aurora/rev/46e707cbe7b3
status-firefox19: --- → fixed
status-firefox20: --- → fixed
https://hg.mozilla.org/releases/mozilla-b2g18/rev/2cd25cde565b
status-b2g18: --- → fixed
If any further problems are encountered, let's please just flip the pref. The full patch is a pain to rebase.
Nevermind, there was an actual test failure :(

https://hg.mozilla.org/integration/mozilla-inbound/rev/8aa341f3e3a6
https://hg.mozilla.org/releases/mozilla-aurora/rev/831232d1e065
https://hg.mozilla.org/releases/mozilla-b2g18/rev/c707d415f13b

I'll fix it up tonight and try again.
https://hg.mozilla.org/releases/mozilla-aurora/rev/6d0533934f90
https://hg.mozilla.org/integration/mozilla-inbound/rev/ef2a48aded5b
https://hg.mozilla.org/releases/mozilla-b2g18/rev/6d41d4a75478

Comment 101

5 years ago
https://hg.mozilla.org/mozilla-central/rev/ef2a48aded5b
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Depends on: 825698
You need to log in before you can comment on or make changes to this bug.