Audio dongles and the ghost of USB 1

USB-C to 3.5mm headphone dongles, a new (but also old) problem.

I’ve got some fuel I need to throw onto a fire: USB-C for headphones on phones. I’m not the first person to complain about it, but I’ve run into something people don’t talk about, and weirdly it goes all the way back to the 1990’s and USB 1.
Brief recap.
In 2016 Apple dropped the 3.5mm headphone socket from the iPhone, the main idea is you should use their wireless AirPods instead (currently selling about £129 a set). Android devices have followed this example; Samsung started dropping 3.5mm in 2019. If you want to keep using wired headphones then you need a Lightning or USB-C to 3.5mm adapter or pair of headphones with a USB-C connector. (Lightning was essentially Apple’s USB-C predecessor.)
It’s easy to think these are just a wire that joins two different ports together, a bit of an annoyance, but nothing important. In most cases that’s wrong.
I’ll assume you know a few things already, if terms like ADC and DAC aren’t familiar then try this Android Authority article from 2016: https://www.androidauthority.com/3-5mm-audio-usb-type-c-701507/.

The ghost of USB1

USB started on version 1 and the most recent is version 4, most new phones will be USB 2 or USB 3, but what really matters are the actual protocols. USB 1 had Low Speed (LS) at 1.5Mb/s and Full Speed (FS) at 12Mb/s, USB 2 added High Speed (HS) 480Mbps, USB 3.0 introduced Super Speed (SS) at 5Gb/s, later versions add faster protocols. USB 3 devices can still talk Low Speed or Full Speed. USB-C is not a USB version, it’s a port. When USB-C was introduced the USB 2 standard was updated to include it, and LS (1.5Mb/s) or FS (12Mb/s) devices with a USB-C connector are considered USB 2, not USB 1.
Bar chart of Low Speed, Full Speed and High Speed bandwidth, full speed is a short sliver next to high speed, while low speed is just a line Bar chart of low, full, high and super speed bandwidths, high speed comes about one tenth of the way up super speed, while full speed appears as a line and low speed is barely visible
^ High speed and Super Speed modes are far faster than the original USB Full Speed.

How full is full speed?

You’ve probably guessed where this is going: nearly all USB-C to 3.5mm headphone adapters use USB FS to communicate. Is 12Mbps still plenty for an audio device? Maybe…
USB audio interfaces date back to USB 1 days, a way of easily attaching more (often quite specialised) sound devices to a computer in the late 1990’s. A headphone dongle packages this up into a little chip hidden inside one of its plugs.
Because marketing on big numbers is easier than marketing on quality, these devices come with crazy sample rates, generally 96kHz 24 bit, the highest USB 1 can do in stereo. And because they are meant to cosplay as just a wire between your phone and headphones there is usually no way to configure them.
Digital stereo sound at 96kHz 24 bit works out at 4.6Mb/s (96k samples per second, 24 bits per sample, 2 channels). This is uncompressed sound and uses a surprising amount of bandwidth, but even USB 1 could deal with this.
UAC (“USB audio class”) transfers are “isochronous”, which just means scheduled to regular intervals. USB FS streams can only have 80% of the stream isochronous to leave room for other communication types, a limit of 9.6Mb/s.
Next there are protocol overheads; raw data has to be put into packets for transfer which increases the size a bit. This is where we really begin to have problems. For example 192kHz 24bit stereo would be 9.2Mb/s of data, but adding the protocol overhead takes it over the 9.6Mb/s limit and USB FS devices cannot do 192kHz 24bit stereo (you might see 192kHz 16bit, which will work).
Then add the microphone input. Headset mics are generally mono, and mostly don’t bother with 24 bit. A 16bit, 48kHz mono stream is 0.768Mb/s. There’s also a human interface device stream to support a call button. This is also an isochronous, not much data but it does need scheduled in. Still looking fine.
At this point, it’s important to remember this is adding up to the 12Mb/s FS limit for the USB bus and then 9.6Mb/s limit on isochronous transfers. HS or faster devices on the same bus can transfer more than this, but they can’t talk while FS communication happens, so the percentage of FS bandwidth being used takes out the same percentage of HS bandwidths. Of course on a phone there’s usually only one external USB port.

Booking it in

Finally we have scheduling. Full Speed supports 12Mb/s, but there is a fundamental difference between USB 1’s LS and FS modes and the faster USB modes. USB 1 transfers take place in a series of 1 millisecond (1ms) frames. 1ms is a long time at 480Mbps or faster, so High Speed and above use 125 microsecond (125us) microframes. LS and FS transfers need to be packed in a certain way into the microframes, with space left for transfers to complete. Basically the USB controller is having to talk slowly to understand LS and FS. (USB-1-only OHCI and UHCI controllers have less complex restrictions.)
Illustration of microframes, a series of boxes represent 1ms USB 1 frames, beneath them a series of smaller boxes at one eighth the width indicate USB2+ microframes. A final 2 rows are shown, with USB1 frames in a series of different colours and the corresponding microframes beneath them coloured in groups of eight.

Breaking point

I first really noticed a problem with all this while trying to use an Anker adaptor to plug a headset into a laptop. It has a 24bit microphone stream and (for some reason) 2 microphone channels, bandwidth of 2.3Mb/s. That’s a bit higher than our earlier estimate, 4.6Mb/s for the headphones (even at 48kHz the USB stream still needs to reserve the full size), making 6.9Mb/s. The call start/end button on this dongle reserves about 0.5Mb/s. Protocol overheads push the bandwidth requirement for output up to 5.5Mb/s and microphone to 2.8Mb/s, the total is 8.5Mb/s. We’re starting to nudge up against the limits of USB Full Speed.

Booked out

Looking at scheduling, here’s how the Linux’s EHCI driver sees the situation. Every millisecond there are 8 microframes of 125 microseconds, into each microframe it can fit this many microseconds of isochronous transfer::
125 125 125 125 125 125 30 0 us
(total 780 microseconds, or 9.36Mb)
If we break up the different streams into these blocks we have output:
125 125 125 83 us
Microphone:
125 109 us
Call button:
39 us
Line them up and we get:
125 125 125 83 125 109 39 0 us

That last 39 doesn’t fit into the 30us allowed for the 7th microframe.
In fact, this wont work on a Linux system that has an EHCI controller (USB 2 era). Try to use this device in duplex mode and you will get the error:
cannot submit urb 0, error -28: not enough bandwidth
The scheduling rules prevent just squashing all the transfers together, transfers of 125us or more have to start on a fresh microframe to allow earlier ones to complete.
Why does it mostly work on my phone then? More recent devices have an xHCI controller, rather than an EHCI one, even if they only support USB 2. The hardware in xHCI controllers seems to do more of the scheduling work for LS/FS transfers. But even on a USB 3.2 computer where the thing does work in full duplex, plugging into a newer laptop lets me see the thrashing going on in the background if the USB bus has other devices on it (common on laptops where webcams and bluetooth are internally connected to USB):
[12089.631824] usb 3-1: Not enough bandwidth for new device state.
[12089.631826] usb 3-1: Not enough bandwidth for altsetting 1
It’s hard to be sure, but I think this pushing USB FS right to its limit is probably also responsible for the buzzing the Anker dongle sometimes produces. Maybe the phone dealing with weak signal or changing cells is enough to throw it over the edge.

Was it acceptable in the 90s?

Why is this a problem now if it wasn’t in the 90s? There are a few pieces to the answer. The first is it was sometimes a problem, but hardware that hit these sample rates and bit depths was mostly used by professionals, devices would often be used to only record or to playback at a time (not duplex) and sample rates and depths could be adjusted if it became a problem. USB2 also changed things, in USB1 days it was more common to have separate controllers for each port, USB2’s High Speed mode is so much faster that this wasn’t that necessary any more and it became common for multiple ports to share one controller, the scheduling to run Full Speed and Low Speed on the faster buses for High Speed and above also causes problems. Even after USB2 computers might have had a dedicated USB1 controller (UHCI or OHCI) or even one per port to work alongside the USB2 EHCI controller and avoid the scheduling problems.

A tale of three dongles

I now have three of these things.

Anker USB-C 3.5mm adaptor
UGreen USB-C to 3.5mm adapter
Cubilux Hi-Res DAC Black and red (and, yes, the colour is important!)

They all look very similar (right down to the braided cable), and yet they all act a little differently. How is this even possible?
Four USB to 3.5mm dongles with braided cables, one above the other. From top to bottom, USB-C black with a red trim on the USB plug (Cubliux), USB-C black with a silver 3.5mm barrel (Anker), USB-C silver (UGreen) and USB-A silver with a longer cord (Ugreen).

The personalities of my three dongles.

The Anker is the first one I had; for about a year it was the only one I had. Mostly it worked without problems (for calls, listening to audio and duolingo, we’ll come back to those activities). But I noticed that sometimes, most frequently if using on a train or in a car, I got a very consistent buzzing when using duolingo. So far as I can tell it’s at a fixed volume and pitch, and doesn’t go away if I unplug and reconnect either end of the dongle. Weird, but livable. The second quirk I only discovered when I tried to use it in a laptop: it could play sound, it could record, but it couldn’t do both at once (using Teams or Zoom).
I now know why the Anker has problems on my laptop, but after I figured it out I got the Cubilux and the UGreen to investigate a bit more. They don’t have the laptop problem, but I was surprised to find they both have their own issues.
First, both have more microphone noise than the Anker, on recording quality alone the Anker is better. Second the Cublix device had slightly noticeable latency issues when using duolingo. This was interesting because bluetooth latency issues (and microphone quality) are one of the reasons I sometimes use these dongles.
Lastly they don’t have the occasional buzzing the Anker does, I’ve taken them with me on journeys and swapped between the Anker and the other two when the buzzing happens to test that.

The rest?

The Anker’s problem is simply trying to use too much bandwidth. What about the Cubilux and UGreen devices? Well, they are different, and in different ways.
First the UGreen. It’s very similar to the Anker, but the microphone input is 16bit 48kHz mono, which makes a big difference. It defaults to 24bit output and 16 bit microphone taking (including overheads) 5.5Mbps + 1.0Mbps = 6.5Mbps, and 0.3Mbps for the control switch, which works fine. On Amazon and the Ugreen website there’s nothing about 16bit input, but it’s actually an advantage here, “The USB C Audio Adapter is offered up to 24bit/96Khz resolution with internal DAC chip while others are 16bit/48Khz.” which gives you a taste of the claims usually made. As a bonus, I also have the Type A version and it has exactly the same chipset.
The Cubilux. I picked this device because it has 192kHz 24bit playback (they claim this “could restore the details of the sound”...). As mentioned, this should need High Speed operation. Does it? Yes! It connects at high speed and supports 16 and 24bit, output 192kHz, 96kHz, 48kHz and 44.1kHz, input 48kHz and 44.1kHz (stereo for some reason, like the Anker). I was specific about the colour, “Cubilux Hi-Res DAC Black and red”, because they sell other colours which are mostly 96kHz, a 384kHz model and a red and black “hifi” one which is 96kHz.
As a result both work fine on the older laptop with EHCI; one doesn’t need as much bandwidth, the other uses USB HS so more bandwidth is available. However they both have more microphone noise than the Anker. Additionally the duolingo latency that sometimes occurs with the Cubilux, but doesn’t affect the other two, I’m not clear on the cause, although it could be something to do with the higher playback sample rate.

Conclusion

All of this is a pity. The microphone clarity of a cheap (£12) wired headset still easily beats that of all my cheaper (£20+) bluetooth ones. Even a pair of Sony WH-1000XM3 is only just about comparable. And I don’t mean, “In perfect listening conditions,” I mean calls involving sentences like, “You sound like you’re underwater” or “I’ll call you back.”
Most available dongles seem to be USB FS and can still have these issues if plugged in alongside other devices (the specs for Apple, Samsung and Google dongles as well as most generic ones strongly suggest FS). Guess what are also often USB-FS? Internally connected Bluetooth interfaces, so you can find a laptop bluetooth module fighting with an audio dongle.
There’s no reason USB audio devices can’t use High Speed or faster modes which would completely avoid this situation. However most listings appear to be USB FS (based on showing 96kHz 24 bit or 192kHz 16 bit as their maximum rate), and almost none tell you this in their listed specs. I think this is the main issue about these things; most boast about their highest headphone sample rate, almost none say whether they are USB FS or not, what their microphone interface is, or even what audio modes they support. The few adaptors that actually advertise as being UAC-2 are often expensive audiophile ones without microphone support. One sign might be a 24bit 192kHz mode (not 16bit) 192kHz), which FS can only support in mono, and this is how I found the Cubliux. It’s a crazy sample rate, but it could be a useful indicator that the device can do HS.
xHCI controllers are found in most new devices (phones or computers) even if they only support USB 2, appears to deal a bit better with the scheduling situation than EHCI, but it doesn’t remove the problem.

Coda

...actually, it’s even weirder than that

As I’ve mentioned, my main issue with the Anker was trying to use it on a laptop. There you’ve got a bit more control over the device than on a phone. So my first thought was, can I just run in a lower bandwidth mode?
USB endpoints like these dongles can have “alt” modes, which take different amounts of bandwidth. They’re chosen by the manufacturer. On the Anker the input and output both have separate alt modes for 16bit and 24bit operation. (Not for bit-rates: 44.1kHz up to 96kHz are in the same alt mode, meaning it always needs the bandwidth for full 96kHz output or 48kHz input operation.)
By default the largest sample depth (24bit) is often chosen, but the 16bit modes should require less bandwidth. On Linux (Fedora 39) with Pipewire and Wireplumber it’s possible to force the system to use the 16 bit modes.
Forcing 16bit operation with Wireplumber still had the same problem, playback and recording worked separately, but not duplex. It turns out the 16bit output interface is misconfigured in firmware and claims to require 768byte packets (this should be the amount for 1 ms worth of samples at the highest rate, actually 384bytes), resulting in the USB driver thinking it needs to allocate 7.3Mbps for playback (compared to 5.5Mbps for 24 bit playback). The microphone stream goes down to 1.9Mbps, but this is outweighed by the playback stream.
If only the microphone stream is forced to 16bit then things actually work, although not much space is left on the bus.

What about the other two? The UGreen has a similar layout of alt modes; one 16 bit mono alt mode for input (41.kHz and 48kHz), output 16 bit and 24 bit stereo alt modes (44.1kHz, 48kHz, 96kHz). Unlike the Anker the 16 bit output alt mode is set for 384 byte packets.
The Cubilux on the other hand has a separate stereo alt mode for every supported sample rate and bit depth combination on input and output. That means it should use the minimum bandwidth needed for whatever configuration is used, although being USB HS even the highest bitrates only need a fraction of the available bandwidth. On the other hand the packet size for each mode is about 30% larger than it should be and the stereo input needs twice as much bandwidth as mono.
If there’s a take-home message, this is another reason USB-C audio dongles are a terrible compromise. They need to be cheap and simple, but they also have to contain audio adaptors which make quite high demands on the USB bus. (And because they have to be simple you usually have little way to configure them.)

Electric Penguinland