<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Updated

•

5 years ago

Severity: -- → S3

Keywords: regressionwindow-wanted

Priority: -- → P3

Comment 2

•

5 years ago

I was testing this using VirtualBox and Fedora ISO downloaded from here:
https://download.fedoraproject.org/pub/fedora/linux/releases/32/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-32-1.6.iso

And I still can't reproduce the multi-selection problem.

Wesley, can you please try to see the problem in Firefox Nightly?
https://www.mozilla.org/en-US/firefox/channel/desktop/

Honza

Flags: needinfo?(kremit)

Simona Badau, Desktop QA

Comment 3

•

5 years ago

I can't reproduce this issue either, tested in Ubuntu 18.04 using the latest Nightly 79.0a1 and Firefox 77.

Wesley, in your previous attempts of reproducing this issue, did you ever used a clean profile? If not, could you please give it a try?
https://support.mozilla.org/en-US/kb/profile-manager-create-remove-switch-firefox-profiles

Assignee

Comment 4

•

4 years ago

I seem to suffer from the same issue. I tried using Fedora Live ISO (version 32-1.6) inside VirtualBox, and wasn't able to reproduce the error. However, I still can readily reproduce the issue on my machine, using the very same Firefox version (78.0b4). Can I provide any more information to aid in the debugging?

Assignee

Comment 5

•

4 years ago

I'll try building from source code on my machine and poking around the code. Any advice on where I should look to provide more information for the developers?

Comment 6

•

4 years ago

We seem to have found an STR that relate

Could you try https://bugzilla.mozilla.org/show_bug.cgi?id=1628162#c13

Assignee

Comment 7

•

4 years ago

Yes, I can see the issue using this STR, both on my machine (Arch Linux) and Fedora in VirtualBox.

Simona Badau, Desktop QA

Comment 8

•

4 years ago

Tried to reproduce this issue again in order to provide a regression window - used the steps indicated in Comment 6 and also the steps from the duplicate reports of Bug 1628162. I still cannot reproduce it on my machine. Considering this, I'm going to add the whiteboard [qa-regression-triage], please feel free to remove it if additional information that could lead to finding the regression window are added.

QA Whiteboard: [qa-regression-triage]

Assignee

Comment 9

•

4 years ago

I suspect that this isn't a regression inside Firefox, but instead it's somehow related to changes in lower-level components (like for example the kernel). Why do I think that: because testing with "stable" distributions (Ubuntu, Fedora) in VirtualBox doesn't repoduce the issue, even though exactly the same Firefox build is used.

I'm still trying to build Firefox on my machine to try drilling deep into the failure, but unfortunately I'm pressed for time and compiling Firefox turned out to be harder than I thought (currently I'm hitting some codegen bugs inside one of the Rust components).

Assignee

Comment 10

•

4 years ago

I finished building the Firefox on my machine (the "tip" branch). The bug is still present in the freshly built instance.

I'll try searching for the bug myself, but given that I'm unfamiliar with the codebase and Firefox is a huge project that will probably take a lot of time. Meanwhile, is there anything that I can do on my machine (maybe tweak some code, provide debug output, etc) that will help you diagnose the cause of the error?

Assignee

Comment 11

•

4 years ago

This is bizzare, but just as I sat down and started drilling into netmonitor code, the bug stopped reproducing - both on the freshly built version and on the stock one on my machine. I'm quite sure that I didn't do any drastic changes to my environment during that time - no reboots, no package updates (I verified that using system logs and bash history).

I'll keep an eye out in case the problem reoccurs.

Assignee

Comment 12

•

4 years ago

Hooray, the bug came back!

Comment 13

•

4 years ago

I still can't repro this issue (running Windows + Firefox Nightly/Beta/Release)

Is anyone else able to see this and provide baby-steps to reproduce it?

Honza

Assignee

Comment 14

•

4 years ago

I can reproduce the bug reliably at the moment, but suspect that this is somehow related to network activity outside Firefox (it stopped reproducing yesterday, came back today, all without any relevant changes on the machine).

Your analysis in bug 1628162 seems to be on point. I'm currently working through NetworkMonitorCode to verify your analysis in the second comment, will get back with additional details.

At the moment I can definitely say that several (different) requests in a row get exactly the same actor ids:

// console.log("server/network-monitor sends networkEvent", this.parentID, actor.form().actor, actor.form().url);
server/network-monitor sends networkEvent server1.conn0.consoleActor3 server1.conn0.netEvent239 http://localhost:5501/test.png/000
server/network-monitor sends networkEvent server0.conn0.child2/consoleActor2 server0.conn0.netEvent24 http://localhost:5501/test.png/000
server/network-monitor sends networkEvent server1.conn0.consoleActor3 server1.conn0.netEvent239 http://localhost:5501/test.png/001
server/network-monitor sends networkEvent server0.conn0.child2/consoleActor2 server0.conn0.netEvent24 http://localhost:5501/test.png/001
server/network-monitor sends networkEvent server1.conn0.consoleActor3 server1.conn0.netEvent239 http://localhost:5501/test.png/002
server/network-monitor sends networkEvent server0.conn0.child2/consoleActor2 server0.conn0.netEvent24 http://localhost:5501/test.png/002
server/network-monitor sends networkEvent server1.conn0.consoleActor3 server1.conn0.netEvent280 http://localhost:5501/test.png/003
server/network-monitor sends networkEvent server0.conn0.child2/consoleActor2 server0.conn0.netEvent65 http://localhost:5501/test.png/003
server/network-monitor sends networkEvent server1.conn0.consoleActor3 server1.conn0.netEvent294 http://localhost:5501/test.png/004
server/network-monitor sends networkEvent server0.conn0.child2/consoleActor2 server0.conn0.netEvent79 http://localhost:5501/test.png/004

Assignee

Comment 15

•

4 years ago

I'm not familiar with the codebase - is "channel" in network-observer supposed to be shared between several different requests? Currently the bad events originate from the same channelId, and that channelId is used to look up the actor, and that in turn leads to request id collisions.

P.S. Maybe it will be more appropriate to ask newbie questions like above in some Matrix room - which one?

Comment 16

•

4 years ago

(In reply to Platon Pronko from comment #14)

Your analysis in bug 1628162 seems to be on point. I'm currently working through NetworkMonitorCode to verify your analysis in the second comment, will get back with additional details.

Excellent, thanks for doing that!

(In reply to Platon Pronko from comment #15)

I'm not familiar with the codebase - is "channel" in network-observer supposed to be shared between several different requests? Currently the bad events originate from the same channelId, and that channelId is used to look up the actor, and that in turn leads to request id collisions.

channel is supposed to be used for one request only

The problem might be in the way we track these channels.

We use openRequests map to track all open channels
https://searchfox.org/mozilla-central/rev/72a1334982cadde0ca8b3d3583877afea80f5639/devtools/server/actors/network-monitor/network-observer.js#157
We are adding new channels into the map here
https://searchfox.org/mozilla-central/rev/72a1334982cadde0ca8b3d3583877afea80f5639/devtools/server/actors/network-monitor/network-observer.js#923
We are sending an event to the client side (to the Network panel) here
https://searchfox.org/mozilla-central/rev/72a1334982cadde0ca8b3d3583877afea80f5639/devtools/server/actors/network-monitor/network-observer.js#839

It might be that we are sending this event multiple times for the same channel. Perhaps removing the channel from this.openRequests at some point and then re-adding it back and send to the client again - just as if it would be a new request....

Also note that we are using this ChannelMap for openRequests
https://searchfox.org/mozilla-central/rev/72a1334982cadde0ca8b3d3583877afea80f5639/devtools/server/actors/network-monitor/utils/channel-map.js#26

P.S. Maybe it will be more appropriate to ask newbie questions like above in some Matrix room - which one?
DevTools or DevTools Network Monitor

Honza

Assignee

Comment 17

•

4 years ago

Unfortunately the bug stopped reproducing again. Could it possibly be somehow time-related? It stopped reproducing yesterday at about the same time. Will keep trying.

Possibly interesting: I had left one tab open from two hours ago, and the bug still reproduces inside it. But if I open a new tab, or start a new browser instance the bug doesn't reproduce.

It doesn't look that activity objects or the channel map are causing the issue. The channels already arrive with same channelId but different name. For example here's a log from NetworkObserver._onRequestHeader (log from NetworkObserver._createNetworkEvent is similar):

// console.log("NetworkObserver._onRequestHeader", channel.channelId, channel.name);
NetworkObserver._onRequestHeader 16502183474233400 http://localhost:5501/
NetworkObserver._onRequestHeader 16502183474233400 http://localhost:5501/
NetworkObserver._onRequestHeader 16502183474233404 http://localhost:5501/test.png/000
NetworkObserver._onRequestHeader 16502183474233404 http://localhost:5501/test.png/000
NetworkObserver._onRequestHeader 16502183474233404 http://localhost:5501/test.png/001
NetworkObserver._onRequestHeader 16502183474233404 http://localhost:5501/test.png/001
NetworkObserver._onRequestHeader 16502183474233404 http://localhost:5501/test.png/002
NetworkObserver._onRequestHeader 16502183474233404 http://localhost:5501/test.png/002
NetworkObserver._onRequestHeader 16502183474233406 http://localhost:5501/test.png/003
NetworkObserver._onRequestHeader 16502183474233406 http://localhost:5501/test.png/003
NetworkObserver._onRequestHeader 16502183474233408 http://localhost:5501/test.png/004
NetworkObserver._onRequestHeader 16502183474233408 http://localhost:5501/test.png/004

Note how channel *404 is shown several times with different urls.

Assignee

Comment 18

•

4 years ago

So, the bug is back and I managed to make significant progress.

The problem is that nsHttpChannel stores channelId as unit64_t. But JavaScript uses floats for numbers, thus if channelId is bigger than 2^52 then information in lower bits will be lost.

I haven't looked into channelId generation code yet, but it seems that ids are generated sequentially, at least if requests are all made at the same time. So while in C++ code ids are correct, by the time they get to NetworkObserver several neighbouring ids collapse into one. Here's an example (ids are taken from my actual request logs):

$ node
Welcome to Node.js v12.18.1.
Type ".help" for more information.
> 16125102525513871
16125102525513872
> 16125102525513872
16125102525513872
> 16125102525513873
16125102525513872

I'm off to look at channelId generation code. Just a shot in the dark - could it be possible that channelId generator looks at the tabs' PID?

Assignee

Comment 19

•

4 years ago

Indeed, my guess was correct - ids are generated sequentially and use the PID. Here's the actual code:

nsresult nsHttpHandler::NewChannelId(uint64_t& channelId) {
  channelId =
      ((static_cast<uint64_t>(mProcessId) << 32) & 0xFFFFFFFF00000000LL) |
      mNextChannelId++;
  return NS_OK;
}

Thus if we get a PID greater than ~2000000 generated channel ids will overflow "safe" integer-float range and the errors will surface.

Usually PIDs are less than 32768, but it's configurable by writing /proc/sys/kernel/pid_max. Actual max allowed PID value is PID_MAX_LIMIT, 2^22.
And systemd from version 243 sets pid_max to 2^22 on 64-bit systems by default (https://github.com/systemd/systemd/blob/master/NEWS#L1182).

So the fix is simple - we're actually only one bit short, so we just need to shift the pid by one bit less. This works on the whole range of possible PIDs:

@@ -2743,7 +2743,7 @@ void nsHttpHandler::ShutdownConnectionManager() {
 
 nsresult nsHttpHandler::NewChannelId(uint64_t& channelId) {
   channelId =
-      ((static_cast<uint64_t>(mProcessId) << 32) & 0xFFFFFFFF00000000LL) |
+      ((static_cast<uint64_t>(mProcessId) << 31) & 0xFFFFFFFF80000000LL) |
       mNextChannelId++;
   return NS_OK;
 }

(I tested it by forking lots of processes to cycle through PID range, and the bug goes away on all pids from 0 to PID_MAX_LIMIT).

Assignee

Comment 20

•

4 years ago

What is the purpose of & 0xFFFFFFFF00000000LL in nsHttpHandler::NewChannelId? We already have uint64_t on the lhs, thus there we are not zeroing upper bits - and we just did the left shift, thus zeroing out lower bits doesn't make sense either.

Comment 21

•

4 years ago

Thanks Platon for investigating this.
This looks clearly like a platform issue?

Would you like to open a patch with the fix?

Also adding someone from the networking platform team.

Honza,
Not sure who is best to help with Platon's observations and question in comment 20.

Thanks

Flags: needinfo?(honzab.moz)

Assignee

Comment 22

•

4 years ago

Ok, I'll start preparing the patch. How to submit a patch tutorial states that patch submitter should be the assignee of the bug - can somebody assign me to this bug?

Updated

•

4 years ago

Assignee: nobody → platon7pronko

Updated

•

4 years ago

Status: UNCONFIRMED → ASSIGNED

Ever confirmed: true

Assignee

Comment 23

•

4 years ago

Attached file Bug 1640745 - fix NewChannelId so that ids fit into safe JS integer range — Details

On Linux PID_MAX_LIMIT is 2^22. By default pid_max was set to 2^15, but
systemd since version 243 sets it to 2^22.
nsHttpHandler::NewChannelId uses PID shifted by 32 bits to generate
channelId, and that results in values greater than 2^53.
Such values are fine for uint64_t, but when we pass channelId to
devtools it gets converted to JavaScript floats, and values greater
than 2^53 get errors in lower two bits. This in turn leads to much
sorrow, like request id collisions or "no headers for this request".

This patch shifts the PID by one bit less, resulting in channelIds
that always fit into safe JavaScript integer range for all PIDs.

Comment 24

•

4 years ago

Platon, you've done great work, thanks for the patch!
Let's see what the Necko team says.

Pulsebot

Comment 25

•

4 years ago

Pushed by valentin.gosu@gmail.com: https://hg.mozilla.org/integration/autoland/rev/3a060213050c fix NewChannelId so that ids fit into safe JS integer range r=necko-reviewers,valentin