Closed Bug 1947617 Opened 1 month ago Closed 22 days ago

Mesa fails to create the shader cache even if disabled (Fedora 41)

Categories

(Core :: Widget: Gtk, defect, P3)

All
Linux
defect

Tracking

()

RESOLVED MOVED

People

(Reporter: andre.ocosta, Unassigned)

References

Details

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:135.0) Gecko/20100101 Firefox/135.0

Steps to reproduce:

Watch any AV1-encoded Youtube video (eg. https://www.youtube.com/watch?v=EDvik1MAtr4)

Actual results:

Video plays just fine, but these lines keep appearing on the logs:

fev 11 18:18:36 org.mozilla.firefox.desktop[3997]: Failed to create /home for shader cache (Permission denied)---disabling.
fev 11 18:18:36 org.mozilla.firefox.desktop[3997]: Failed to create /home for shader cache (Permission denied)---disabling.
fev 11 18:18:37 org.mozilla.firefox.desktop[3997]: Failed to create /home for shader cache (Permission denied)---disabling.
...

Expected results:

The shader cache should have been created (actually, the directories already exist, but for some reason Firefox tries to recreate them, and fail).

The Bugbug bot thinks this bug should belong to the 'Core::Widget: Gtk' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Widget: Gtk
Product: Firefox → Core

(In reply to André Costa from comment #0)

Watch any AV1-encoded Youtube video (eg. https://www.youtube.com/watch?v=EDvik1MAtr4)

Sorry, my bad: I confused the codecs. The above line should have been "Watch any VP9-encoded Youtube video" (I have media.av1.enabled set to false).

This issue was also filed. Hope the developers assigned can coordinate the change in logic to fix the 3 month long issue:

https://bugzilla.mozilla.org/show_bug.cgi?id=1933040

I believe there are a couple of problems on disk_cache.c that need to be addressed:

  • mkdir_if_needed() needs a better handling of the possibilities for the stat() call it makes early on -- eg. it shouldn't attempt to call mkdir() if stat() failed with EACCESS (actually, it probably should only be called if stat() returned ENOENT)
  • make_cache_file_directory() doesn't check if mkdir_if_needed() succeeded
  • should mkdir_if_needed() be called with /home as an argument? Non-root processes wouldn't be able to create anything there anyway. For the cache problem at hand, it seems to only make sense to pass it /home/<user>/.cache

Andre, thank you for reporting and looking into it. I have two questions for you.

Do you think the call path causing the error includes make_cache_file_directory()? If so, is there any reason to support that?

The path is "/home", it unlikely made by make_cache_file_directory() although it could be. Have you even checked environment variables? For example, MESA_GLSL_CACHE_DIR.

Flags: needinfo?(andre.ocosta)

Hi Thinker, thank you for joining the discussion.

I just mentioned make_cache_file_directory() because I searched for all the places where mkdir_if_needed() is called during my investigation, but I don't think it is part of this specific call path. I can see now it was misleading, sorry for that.

Also, for the record, I did not debug it myself, I assumed this report on the other thread was accurate.

Following your suggestion, I decided to repeat my previous tests:

  1. set MESA_GLSL_CACHE_DIR. This issued a warning saying it's deprecated ("*** MESA_GLSL_CACHE_DIR is deprecated; use MESA_SHADER_CACHE_DIR instead ***"), so I set MESA_SHADER_CACHE_DIR instead
  2. with any of the above env vars set, I could NOT reproduce the problem this time (which was strange, because I had tested it previously with MESA_SHADER_CACHE_DIR

I then suspected my previous tests were flawed -- and, guess what, they were. It turns out I had set the env var but forgot to export it (I'm using fish shell, and on my previous tests I mistakenly used set -U instead of set -Ux, so the var was not being exported to child processes)

So, I am glad (and embarrassed) to say that, with MESA_SHADER_CACHE_DIR properly set, the problem doesn't manifest, which is good news.

If the others following this thread could also confirm this, we can close this issue as NOTABUG.

BTW should MESA_SHADER_CACHE_DIR point to ~/.cache or ~/.cache/mesa_shader_cache ? Using the latter creates mesa_shader_cache_db inside mesa_shader_cache instead of besides it.

Flags: needinfo?(andre.ocosta)

André, Thank you very much for your update. Yes, setting MESA_SHADER_CACHE_DIR to ~/.cache also CIRCUMVENTS the problem for me. However, CIRCUMVENTING a problem is not FIXING it.

Newbie Linux users who are just in the process of replacing Windoze with an easy-to-install Linux distribution (where the command-line is never necessary) must not be expected to suddenly learn how to use a line command such as "export MESA_SHADER_CACHE_DIR=~/.cache", and then start Firefox from the command-line instead of from a desktop menu item, in order to get Firefox to work properly.

Firefox and Mesa should work properly "out of the box" without any such interventions. In this case that means that Firefox and Mesa must figure out where to put their shader cache all on their own, which they certainly did in the past. Closing this Bug as NOTABUG (without properly fixing it) may be OK for Linux enthusiasts, but it is most certainly not acceptable for newbie Linux users. Bob.

That's a fair point, Bob. Firefox should indeed work out of the box as much as possible. And, there is logic on disk_cache.c to handle this specific case where no env var is set, and this should be fixed. This is the code snippet from disk_cache_create() that handles the case where neither MESA_GLSL_CACHE_DIR nor XDG_CACHE_HOME are set:

   if (path == NULL) {
      char *buf;
      size_t buf_size;
      struct passwd pwd, *result;

      buf_size = sysconf(_SC_GETPW_R_SIZE_MAX);
      if (buf_size == -1)
         buf_size = 512;

      /* Loop until buf_size is large enough to query the directory */
      while (1) {
         buf = ralloc_size(local, buf_size);

         getpwuid_r(getuid(), &pwd, buf, buf_size, &result);
         if (result)
            break;

         if (errno == ERANGE) {
            ralloc_free(buf);
            buf = NULL;
            buf_size *= 2;
         } else {
            goto path_fail;
         }
      }

      path = concatenate_and_mkdir(local, pwd.pw_dir, ".cache");
      if (path == NULL)
         goto path_fail;

      path = concatenate_and_mkdir(local, path, CACHE_DIR_NAME);
      if (path == NULL)
         goto path_fail;
   }

mkdir_if_needed() is indirectly called through concatenate_and_mkdir(), so at first glance it seems one of the two calls should be the culprit. Still, none of them is calling mkdir_if_needed() with "/home" as an argument, so I'm clearly missing something here...

@André Firefox uses system Mesa on Linux. The older in-tree copy is irrelevant.
https://elixir.bootlin.com/mesa/mesa-24.3.1/source/src/util/disk_cache_os.c#L190

Thanks @Masatoshi, I suspected I was looking at the wrong source code since I could not find the "*** MESA_GLSL_CACHE_DIR is deprecated; use MESA_SHADER_CACHE_DIR instead ***" error message. Now it's making sense.

So it seems this snippet of disk_cache_generate_cache_dir() is what we're looking for, since $HOME is always defined on Linux.

   if (!path) {
      char *home = getenv("HOME");

      if (home) {
         path = concatenate_and_mkdir(mem_ctx, home, ".cache");
         if (!path)
            return NULL;

         path = concatenate_and_mkdir(mem_ctx, path, cache_dir_name);
         if (!path)
            return NULL;
      }
   }

On Mesa, concatenate_and_mkdir() calls mkdir_with_parents_if_needed(), which then calls mkdir_if_needed() for each part of the path it receives, and it will of course start with "/home".

mkdir_if_needed() theoretically should handle this just fine provided this stat() call returns 0. If however it is failing with EACCESS , then we will try to mkdir("/home", 0700) on line 131 and will fail, leading to the "Failed to create shader cache..." error message.

Does this make sense? If so, why is stat("/home") failing?

(In reply to André Costa from comment #11)

Does this make sense? If so, why is stat("/home") failing?

Do you have read permission at the root directory?
Sometime, people may remove read permission but keep execute permission at the root directory for security reason.
That make you can access subdirectories but not to read the root itself. In this case, stat("/xxx") fails.

Since it is in mesa, we don't have much to do from firefox. One of potential workaround is to translate these paths specified in env vars to relative paths when it happens to prevent it to visit directories out side of $HOME. Or, just warn the user how to do workaround with it.

I do have read permission:

❯ ls -ld / /home
dr-xr-xr-x. 1 root root 182 fev 22 13:12 //
drwxr-xr-x. 1 root root  10 jul 16  2024 /home/

I agree, there's not much Firefox can do if this is handled inside Mesa. One of Mesa's developers argued that Firefox sandbox could be interfering on this filesystem querying; could it?

One thing still doesn't add up, though: when MESA_SHADER_CACHE_DIR is set, it also points to an absolute path ("/home/<user>/.cache/mesa_shader_cache"), and this path is passed to the same concatenate_and_mkdir(); why passing "/home" fails but passing "/home/<user>/.cache/mesa_shader_cache" succeeds, considering in both cases it walks an absolute path with the same prefix, trying to create intermediate dirs?

(In reply to André Costa from comment #13)

One thing still doesn't add up, though: when MESA_SHADER_CACHE_DIR is set, it also points to an absolute path ("/home/<user>/.cache/mesa_shader_cache"), and this path is passed to the same concatenate_and_mkdir(); why passing "/home" fails but passing "/home/<user>/.cache/mesa_shader_cache" succeeds, considering in both cases it walks an absolute path with the same prefix, trying to create intermediate dirs?

Are you sure MESA_SHADER_CACHE_DIR has an expanded value (/home/costa/.cache instead of ~/.cache)?
I'm asking because I found this rejected PR: https://github.com/flathub/io.gitlab.librewolf-community/pull/145/files

(In reply to André Costa from comment #13)

I do have read permission:

❯ ls -ld / /home
dr-xr-xr-x. 1 root root 182 fev 22 13:12 //
drwxr-xr-x. 1 root root  10 jul 16  2024 /home/

I agree, there's not much Firefox can do if this is handled inside Mesa. One of Mesa's developers argued that Firefox sandbox could be interfering on this filesystem querying; could it?

It could be. Could you check if the process id on messages is a content process of firefox? (with -contentproc)
/proc/<pid>/status should include a line.

 Seccomp:        2

(In reply to Masatoshi Kimura [:emk] from comment #14)

Are you sure MESA_SHADER_CACHE_DIR has an expanded value (/home/costa/.cache instead of ~/.cache)?
I'm asking because I found this rejected PR: https://github.com/flathub/io.gitlab.librewolf-community/pull/145/files

Good catch. But, I double-checked, and it has the expanded value. Actually, while testing this, I realized that set -Ux only works for fish-launched subprocesses, these variables are not available to my desktop environment, which means that firefox launched by gnome-shell was still showing the shader cache creation error, as shown by journalctl; I moved the variable configuration to /etc/environment, and now it is all good.

❯ grep MESA /etc/environment 
MESA_SHADER_CACHE_DIR=/home/costa/.cache/mesa_shader_cache/

So, the mystery remains...

(In reply to Thinker Li [:sinker] from comment #15)

It could be. Could you check if the process id on messages is a content process of firefox? (with -contentproc)
/proc/<pid>/status should include a line.

 Seccomp:        2

Sorry, is -contentproc a firefox argument? I tried it and it caused a SIGSEV:

❯ firefox -contentproc
fish: Job 1, 'firefox -contentproc' terminated by signal SIGSEGV (Address boundary error)

Here's what else I tried:

❯ pgrep -l firefox
8928 firefox

❯ grep Seccomp /proc/8928/status
Seccomp:	0
Seccomp_filters:	0

❯ pidof firefox
10235 10203 9500 9449 9363 9340 9252 9178 9128 9125 9072 9068 9041 8928

❯ grep Seccomp /proc/10235/status
Seccomp:	2
Seccomp_filters:	1

(not sure if it helps, please feel free to point me to the right direction)

For example, quoting from your log in comment #0:

fev 11 18:18:36 org.mozilla.firefox.desktop[3997]: Failed to create /home for shader cache (Permission denied)---disabling.

3997 is the pid. So you should see /proc/3997/status in this case.

I have checked /proc/<pid>/smaps for firefox processes. Only the chrome/parent process links to mesa. So, it should not be hindered by the sandbox of Firefox. Firefox doesn't run sandbox for the chrome process.

Your environment may be different. So, you should check the pid in the error message is the main/parent process or not. If it is the parent process, it should not caused by the sandbox of Firefox.

I can independently confirm that these messages are coming from a content process.

env[3805]: Failed to create /home for shader cache (Permission denied)---disabling.
$ cat /proc/3805/status |grep Sec
Seccomp:        2
$ ps -p 3805 --no-headers -o cmd |cat
/usr/lib/firefox/firefox -contentproc -parentBuildID 20250130195129 -prefsHandle 0 -prefsLen 32777 -prefMapHandle 1 -prefMapSize 269484 -sandboxReporter 2 -chrootClient 3 -ipcHandle 4 -initialChannelId xxx -parentPid 3648 -crashReporter 5 -appDir /usr/lib/firefox/browser 3 rdd

I have checked /proc/<pid>/smaps for firefox processes. Only the chrome/parent process links to mesa. So, it should not be hindered by the sandbox of Firefox. Firefox doesn't run sandbox for the chrome process.

Your environment may be different. So, you should check the pid in the error message is the main/parent process or not. If it is the parent process, it should not caused by the sandbox of Firefox. (Seccomp should be 0 for the parent process.)

(In reply to Colin S from comment #20)

$ ps -p 3805 --no-headers -o cmd |cat
/usr/lib/firefox/firefox -contentproc -parentBuildID 20250130195129 -prefsHandle 0 -prefsLen 32777 -prefMapHandle 1 -prefMapSize 269484 -sandboxReporter 2 -chrootClient 3 -ipcHandle 4 -initialChannelId xxx -parentPid 3648 -crashReporter 5 -appDir /usr/lib/firefox/browser 3 rdd

Thank you for the confirmation. We may need someone familiar with media subsystem to look into this issue.

(In reply to André Costa from comment #16)

So, the mystery remains...

Ah, if MESA_SHADER_CACHE_DIR is set, disk_cache_delete_old_cache will not be called which in turn calls disk_cache_generate_cache_dir. The other caller is gated by disk_cache_enabled. In content processes, disk_cache_enabled returns false and disk_cache_generate_cache_dir should not be called. Therefore this MR is the proper fix of this bug IMO.

(In reply to Colin S from comment #20)

/usr/lib/firefox/firefox -contentproc -parentBuildID 20250130195129 -prefsHandle 0 -prefsLen 32777 -prefMapHandle 1 -prefMapSize 269484 -sandboxReporter 2 -chrootClient 3 -ipcHandle 4 -initialChannelId xxx -parentPid 3648 -crashReporter 5 -appDir /usr/lib/firefox/browser 3 rdd

The -contentproc flag is poorly named; the process type is passed as the last argument, and this is the RDD (Remote Data Decoder) process, which matches comment #0 mentioning AV1 decoding.

But, in theory we turn off Mesa's shader cache for that process. (My understanding is that it's not really doing graphics rendering that would need shaders, or at least not nontrivial shaders, but it does need an EGL context for hardware-assisted decoding and/or for efficiently passing the decoded frames to the compositor.) So I don't know why Mesa is still trying to access the cache directory.

(In reply to Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧ from comment #24)

So I don't know why Mesa is still trying to access the cache directory.

Comment #23 hadn't been posted yet when I wrote this. That Mesa MR looks like it explains why we're getting this error message even though the cache should be disabled.

According to what we have known, the Mesa MR mentioned in Comment #23 has been merged. And, this error message doesn't cause any obvious issue except the message itself. I think this bug should be wonfix if some one can test with the fixed mesa, right?

(In reply to Masatoshi Kimura [:emk] from comment #23)

Ah, if MESA_SHADER_CACHE_DIR is set, disk_cache_delete_old_cache will not be called which in turn calls disk_cache_generate_cache_dir. The other caller is gated by disk_cache_enabled. In content processes, disk_cache_enabled returns false and disk_cache_generate_cache_dir should not be called. Therefore this MR is the proper fix of this bug IMO.

Thanks for the explanation @Masatoshi, now it makes sense to me. The tricky part (which I would have never guessed on my own) is that disk_cache_enabled() returns false for content processes even if MESA_SHADER_CACHE_DISABLE is not set (probably because on this case __normal_user() returns false)

I agree with you and @Thinker, the proposed Mesa MR properly addresses the problem. Until it is available, to avoid log spamming one should set MESA_SHADER_CACHE_DIR.

FYI Mesa 25.0.0 contains the fix.

Summary: Firefox fails to create directory for the shader cache on Linux (Fedora 41) → Mesa fails to create the shader cache even if disabled (Fedora 41)
Status: UNCONFIRMED → NEW
Ever confirmed: true
Severity: -- → S4
OS: Unspecified → Linux
Priority: -- → P3
Hardware: Unspecified → All
Version: Firefox 135 → Trunk
Duplicate of this bug: 1921742
Status: NEW → RESOLVED
Closed: 22 days ago
Resolution: --- → MOVED
You need to log in before you can comment on or make changes to this bug.