Amazon reduces Alexa latency by a quarter by switching to its own Inferentia CPUs

Spread the love

Amazon now runs most of the searches submitted via Alexa voice assistant largely on its own Inferentia chips. The company wants to move away from Nvidia GPUs in its data centers and start using its own equipment.

Amazon writes in a blog post that it wants to use Alexa searches on its own chips for machine learning. This is done with the Elastic Compute Cloud Inf1 service, which runs on the Inferentia chipset used in Amazon Web Services. The Inferentia chip is built specifically for AWS to accelerate machine learning. Inferentia chips have four NeuronCores and include a lot of on-chip cache to make that process easy. According to Amazon, this ensures, among other things, a lower latency.

Amazon says that “the vast majority” of Alexa workloads now run on those Inferentia chips. So far, that would have resulted in a 25 percent reduction in latency and a 30 percent reduction in costs. Until now, Amazon used Nvidia’s T4 GPUs to perform calculations, but the company wants to get rid of that in the long run.

Incidentally, that switch only concerns text-to-speech from Alexa commands. That was the only aspect of the technology behind the voice assistant that still ran on dedicated GPUs. Other parts of the calculations, including the Automatic Speech Recognition and the Natural Language Understanding, have already been done on chips.

According to Amazon, the facial recognition program Rekognition is also being migrated to Inferentia chips. The latency would be eight times lower than with traditional GPU calculations. However, Amazon does not want to say which hardware was used for that first.

You might also like