Closed Bug 1640745 Opened 5 months ago Closed 3 months ago

Can no longer select individual requests in Developer tools / network tab

Categories

(DevTools :: Netmonitor, defect, P3)

76 Branch
defect

Tracking

(firefox81 fixed)

RESOLVED FIXED
81 Branch
Tracking Status
firefox81 --- fixed

People

(Reporter: kremit, Assigned: platon7pronko, NeedInfo)

References

Details

(Keywords: regressionwindow-wanted)

Attachments

(2 files)

User Agent: Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0

Steps to reproduce:

Open the Web Developer tools, then Network tab. Attempt to select an individual HTTP request.

Actual results:

In some cases (see screenshot), multiple entries are selected. You can double-click on an entry to resend the request in a new tab, but you cannot select one to view details of that request as other entries are also selected. This appears to be a regression around FF 72.

Expected results:

Previous versions, and other dev tools like Chrome, allow you to choose individual requests and view their details in the sidebar as usual.

Bugbug thinks this bug should belong to this component, but please revert this change in case of error.

Component: Untriaged → Netmonitor
Product: Firefox → DevTools
Severity: -- → S3
Priority: -- → P3
See Also: → 1628162

I was testing this using VirtualBox and Fedora ISO downloaded from here:
https://download.fedoraproject.org/pub/fedora/linux/releases/32/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-32-1.6.iso

And I still can't reproduce the multi-selection problem.

Wesley, can you please try to see the problem in Firefox Nightly?
https://www.mozilla.org/en-US/firefox/channel/desktop/

Honza

Flags: needinfo?(kremit)

I can't reproduce this issue either, tested in Ubuntu 18.04 using the latest Nightly 79.0a1 and Firefox 77.

Wesley, in your previous attempts of reproducing this issue, did you ever used a clean profile? If not, could you please give it a try?
https://support.mozilla.org/en-US/kb/profile-manager-create-remove-switch-firefox-profiles

I seem to suffer from the same issue. I tried using Fedora Live ISO (version 32-1.6) inside VirtualBox, and wasn't able to reproduce the error. However, I still can readily reproduce the issue on my machine, using the very same Firefox version (78.0b4). Can I provide any more information to aid in the debugging?

I'll try building from source code on my machine and poking around the code. Any advice on where I should look to provide more information for the developers?

We seem to have found an STR that relate

Could you try https://bugzilla.mozilla.org/show_bug.cgi?id=1628162#c13

Yes, I can see the issue using this STR, both on my machine (Arch Linux) and Fedora in VirtualBox.

Tried to reproduce this issue again in order to provide a regression window - used the steps indicated in Comment 6 and also the steps from the duplicate reports of Bug 1628162. I still cannot reproduce it on my machine. Considering this, I'm going to add the whiteboard [qa-regression-triage], please feel free to remove it if additional information that could lead to finding the regression window are added.

QA Whiteboard: [qa-regression-triage]

I suspect that this isn't a regression inside Firefox, but instead it's somehow related to changes in lower-level components (like for example the kernel). Why do I think that: because testing with "stable" distributions (Ubuntu, Fedora) in VirtualBox doesn't repoduce the issue, even though exactly the same Firefox build is used.

I'm still trying to build Firefox on my machine to try drilling deep into the failure, but unfortunately I'm pressed for time and compiling Firefox turned out to be harder than I thought (currently I'm hitting some codegen bugs inside one of the Rust components).

I finished building the Firefox on my machine (the "tip" branch). The bug is still present in the freshly built instance.

I'll try searching for the bug myself, but given that I'm unfamiliar with the codebase and Firefox is a huge project that will probably take a lot of time. Meanwhile, is there anything that I can do on my machine (maybe tweak some code, provide debug output, etc) that will help you diagnose the cause of the error?

This is bizzare, but just as I sat down and started drilling into netmonitor code, the bug stopped reproducing - both on the freshly built version and on the stock one on my machine. I'm quite sure that I didn't do any drastic changes to my environment during that time - no reboots, no package updates (I verified that using system logs and bash history).

I'll keep an eye out in case the problem reoccurs.

Hooray, the bug came back!

I still can't repro this issue (running Windows + Firefox Nightly/Beta/Release)

Is anyone else able to see this and provide baby-steps to reproduce it?

Honza

I can reproduce the bug reliably at the moment, but suspect that this is somehow related to network activity outside Firefox (it stopped reproducing yesterday, came back today, all without any relevant changes on the machine).

Your analysis in bug 1628162 seems to be on point. I'm currently working through NetworkMonitorCode to verify your analysis in the second comment, will get back with additional details.

At the moment I can definitely say that several (different) requests in a row get exactly the same actor ids:

// console.log("server/network-monitor sends networkEvent", this.parentID, actor.form().actor, actor.form().url);
server/network-monitor sends networkEvent server1.conn0.consoleActor3 server1.conn0.netEvent239 http://localhost:5501/test.png/000
server/network-monitor sends networkEvent server0.conn0.child2/consoleActor2 server0.conn0.netEvent24 http://localhost:5501/test.png/000
server/network-monitor sends networkEvent server1.conn0.consoleActor3 server1.conn0.netEvent239 http://localhost:5501/test.png/001
server/network-monitor sends networkEvent server0.conn0.child2/consoleActor2 server0.conn0.netEvent24 http://localhost:5501/test.png/001
server/network-monitor sends networkEvent server1.conn0.consoleActor3 server1.conn0.netEvent239 http://localhost:5501/test.png/002
server/network-monitor sends networkEvent server0.conn0.child2/consoleActor2 server0.conn0.netEvent24 http://localhost:5501/test.png/002
server/network-monitor sends networkEvent server1.conn0.consoleActor3 server1.conn0.netEvent280 http://localhost:5501/test.png/003
server/network-monitor sends networkEvent server0.conn0.child2/consoleActor2 server0.conn0.netEvent65 http://localhost:5501/test.png/003
server/network-monitor sends networkEvent server1.conn0.consoleActor3 server1.conn0.netEvent294 http://localhost:5501/test.png/004
server/network-monitor sends networkEvent server0.conn0.child2/consoleActor2 server0.conn0.netEvent79 http://localhost:5501/test.png/004

I'm not familiar with the codebase - is "channel" in network-observer supposed to be shared between several different requests? Currently the bad events originate from the same channelId, and that channelId is used to look up the actor, and that in turn leads to request id collisions.

P.S. Maybe it will be more appropriate to ask newbie questions like above in some Matrix room - which one?

(In reply to Platon Pronko from comment #14)

Your analysis in bug 1628162 seems to be on point. I'm currently working through NetworkMonitorCode to verify your analysis in the second comment, will get back with additional details.

Excellent, thanks for doing that!

(In reply to Platon Pronko from comment #15)

I'm not familiar with the codebase - is "channel" in network-observer supposed to be shared between several different requests? Currently the bad events originate from the same channelId, and that channelId is used to look up the actor, and that in turn leads to request id collisions.

channel is supposed to be used for one request only

The problem might be in the way we track these channels.

  1. We use openRequests map to track all open channels
    https://searchfox.org/mozilla-central/rev/72a1334982cadde0ca8b3d3583877afea80f5639/devtools/server/actors/network-monitor/network-observer.js#157

  2. We are adding new channels into the map here
    https://searchfox.org/mozilla-central/rev/72a1334982cadde0ca8b3d3583877afea80f5639/devtools/server/actors/network-monitor/network-observer.js#923

  3. We are sending an event to the client side (to the Network panel) here
    https://searchfox.org/mozilla-central/rev/72a1334982cadde0ca8b3d3583877afea80f5639/devtools/server/actors/network-monitor/network-observer.js#839

It might be that we are sending this event multiple times for the same channel. Perhaps removing the channel from this.openRequests at some point and then re-adding it back and send to the client again - just as if it would be a new request....

  1. Also note that we are using this ChannelMap for openRequests
    https://searchfox.org/mozilla-central/rev/72a1334982cadde0ca8b3d3583877afea80f5639/devtools/server/actors/network-monitor/utils/channel-map.js#26

P.S. Maybe it will be more appropriate to ask newbie questions like above in some Matrix room - which one?
DevTools or DevTools Network Monitor

Honza

Unfortunately the bug stopped reproducing again. Could it possibly be somehow time-related? It stopped reproducing yesterday at about the same time. Will keep trying.

Possibly interesting: I had left one tab open from two hours ago, and the bug still reproduces inside it. But if I open a new tab, or start a new browser instance the bug doesn't reproduce.

It doesn't look that activity objects or the channel map are causing the issue. The channels already arrive with same channelId but different name. For example here's a log from NetworkObserver._onRequestHeader (log from NetworkObserver._createNetworkEvent is similar):

// console.log("NetworkObserver._onRequestHeader", channel.channelId, channel.name);
NetworkObserver._onRequestHeader 16502183474233400 http://localhost:5501/
NetworkObserver._onRequestHeader 16502183474233400 http://localhost:5501/
NetworkObserver._onRequestHeader 16502183474233404 http://localhost:5501/test.png/000
NetworkObserver._onRequestHeader 16502183474233404 http://localhost:5501/test.png/000
NetworkObserver._onRequestHeader 16502183474233404 http://localhost:5501/test.png/001
NetworkObserver._onRequestHeader 16502183474233404 http://localhost:5501/test.png/001
NetworkObserver._onRequestHeader 16502183474233404 http://localhost:5501/test.png/002
NetworkObserver._onRequestHeader 16502183474233404 http://localhost:5501/test.png/002
NetworkObserver._onRequestHeader 16502183474233406 http://localhost:5501/test.png/003
NetworkObserver._onRequestHeader 16502183474233406 http://localhost:5501/test.png/003
NetworkObserver._onRequestHeader 16502183474233408 http://localhost:5501/test.png/004
NetworkObserver._onRequestHeader 16502183474233408 http://localhost:5501/test.png/004

Note how channel *404 is shown several times with different urls.

So, the bug is back and I managed to make significant progress.

The problem is that nsHttpChannel stores channelId as unit64_t. But JavaScript uses floats for numbers, thus if channelId is bigger than 2^52 then information in lower bits will be lost.

I haven't looked into channelId generation code yet, but it seems that ids are generated sequentially, at least if requests are all made at the same time. So while in C++ code ids are correct, by the time they get to NetworkObserver several neighbouring ids collapse into one. Here's an example (ids are taken from my actual request logs):

$ node
Welcome to Node.js v12.18.1.
Type ".help" for more information.
> 16125102525513871
16125102525513872
> 16125102525513872
16125102525513872
> 16125102525513873
16125102525513872

I'm off to look at channelId generation code. Just a shot in the dark - could it be possible that channelId generator looks at the tabs' PID?

Indeed, my guess was correct - ids are generated sequentially and use the PID. Here's the actual code:

nsresult nsHttpHandler::NewChannelId(uint64_t& channelId) {
  channelId =
      ((static_cast<uint64_t>(mProcessId) << 32) & 0xFFFFFFFF00000000LL) |
      mNextChannelId++;
  return NS_OK;
}

Thus if we get a PID greater than ~2000000 generated channel ids will overflow "safe" integer-float range and the errors will surface.

Usually PIDs are less than 32768, but it's configurable by writing /proc/sys/kernel/pid_max. Actual max allowed PID value is PID_MAX_LIMIT, 2^22.
And systemd from version 243 sets pid_max to 2^22 on 64-bit systems by default (https://github.com/systemd/systemd/blob/master/NEWS#L1182).

So the fix is simple - we're actually only one bit short, so we just need to shift the pid by one bit less. This works on the whole range of possible PIDs:

@@ -2743,7 +2743,7 @@ void nsHttpHandler::ShutdownConnectionManager() {
 
 nsresult nsHttpHandler::NewChannelId(uint64_t& channelId) {
   channelId =
-      ((static_cast<uint64_t>(mProcessId) << 32) & 0xFFFFFFFF00000000LL) |
+      ((static_cast<uint64_t>(mProcessId) << 31) & 0xFFFFFFFF80000000LL) |
       mNextChannelId++;
   return NS_OK;
 }

(I tested it by forking lots of processes to cycle through PID range, and the bug goes away on all pids from 0 to PID_MAX_LIMIT).

What is the purpose of & 0xFFFFFFFF00000000LL in nsHttpHandler::NewChannelId? We already have uint64_t on the lhs, thus there we are not zeroing upper bits - and we just did the left shift, thus zeroing out lower bits doesn't make sense either.

Thanks Platon for investigating this.
This looks clearly like a platform issue?

Would you like to open a patch with the fix?

Also adding someone from the networking platform team.

Honza,
Not sure who is best to help with Platon's observations and question in comment 20.

Thanks

Flags: needinfo?(honzab.moz)

Ok, I'll start preparing the patch. How to submit a patch tutorial states that patch submitter should be the assignee of the bug - can somebody assign me to this bug?

Assignee: nobody → platon7pronko
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true

On Linux PID_MAX_LIMIT is 2^22. By default pid_max was set to 2^15, but
systemd since version 243 sets it to 2^22.
nsHttpHandler::NewChannelId uses PID shifted by 32 bits to generate
channelId, and that results in values greater than 2^53.
Such values are fine for uint64_t, but when we pass channelId to
devtools it gets converted to JavaScript floats, and values greater
than 2^53 get errors in lower two bits. This in turn leads to much
sorrow, like request id collisions or "no headers for this request".

This patch shifts the PID by one bit less, resulting in channelIds
that always fit into safe JavaScript integer range for all PIDs.

Platon, you've done great work, thanks for the patch!
Let's see what the Necko team says.

Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/3a060213050c
fix NewChannelId so that ids fit into safe JS integer range r=necko-reviewers,valentin

Platon, Valentin, thanks for working on this. This is great improvement!

Honza

Flags: needinfo?(honzab.moz)
Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → 81 Branch
Duplicate of this bug: 1628162
Duplicate of this bug: 1658426
Duplicate of this bug: 1661069
Duplicate of this bug: 1664239
Duplicate of this bug: 1664187
You need to log in before you can comment on or make changes to this bug.