Firefox connection issues were due to Google change and library bug
Firefox was limited to use for a few hours in mid-January. As it turned out, a Google service had switched to HTTP/3 unannounced. This triggered a bug in the browser that replaced uppercase letters in a header with lowercase letters. The HTTP/3 code cannot read it.
Some of the Firefox users were unable to visit websites for more than four hours on January 13, although it turned out that disabling HTTP/3 helped. Firefox Tech Lead and Senior Staff Security Engineer Christian Holler explains on a Mozilla blog what caused this. The problem arose when Google Cloud Platform transitioned the default HTTP to HTTP/3. Google had not announced this beforehand and the Firefox browser adopted this change by default. Firefox uses cloud services like GCP for updates, telemetry, certificate management, crash reports, and other functions.
Mozilla soon learned that switching to HTTP/3 led to the outage. When this was turned off and the browsers started using HTTP/2 again, the problem was gone. However, it was not yet clear why HTTP/3 caused problems.
This turned out to be due to parts of the Firefox browser that are based on Rust, specifically the Telemetry header. All HTTP/3 connections go through Necko, Firefox’s network stack. Rest components, however, do not go directly via Necko, but first run via the intermediate library viaduct. Something went wrong with this intermediate library.
The lower-level HTTP/3 code requires the Content-Length header, which Necko automatically creates if code does not already have this header. However, when checking if code has that header, Necko doesn’t look at the distinction between lowercase and uppercase letters. This turned out to be a problem, because viaduct makes all uppercase letters in headers lowercase. This creates the content-length header, so without capital letters.
With Rust code where the header had not yet been added, nothing went wrong. Viaduct itself does not create headers, but only changes the uppercase letters into lowercase letters. So the code went without a header to Necko, which generated the correct Content-Length header.
Telemetry was the only Firefox component created in Rust that had the correct Content-Length header during the outage. Viaduct made this content-length, which was not adapted by Necko. Because the lower-level HTTP/3 code requires the upper-case Content-Length header, the HTTP/3 code couldn’t find the header, which Necko said was there. This resulted in an infinite loop and no error message. That loop, in turn, blocked other network communications and rendered the browser unusable. Rest parts without the header did work, because the header was only added at Necko. As a result, users who disabled Telemetry could still use the browser.
Mozilla says it is now in contact with Google to prevent such unexpected changes. The company acknowledges that an announcement does not completely eliminate the risk of such an incident, but that it allows for more testing. In addition, Mozilla will check all service configurations to see if other services also just take over the default settings of another service, as happened with the switch to HTTP/3. Finally, Mozilla wants to run more tests with different HTTP versions and the organization wants to accelerate its incident response.