While many people still think of academic research when it comes to deep learning, Snap Inc. has been applying deep learning models to improve its recommendation engines on a daily basis. Using Google’s Cloud Tensor Processing Units (TPUs), Snap has accelerated its pace of innovation and model improvement to enhance the user experience.

Snap’s blog Training Large-Scale Recommendation Models with TPUs tells the story of how the Snap ad ranking team leveraged Google’s leading-edge TPUs to train deep learning models quickly and efficiently. But there’s a lot more to the story than the how, and that’s what we’re sharing here.

Faster leads to better

Snap’s ad ranking team is charged with training the models that make sure the right ad is served to the right Snapchatter at the right time. With 300+ million users daily and millions of ads to rank, training models quickly and efficiently is a large part of a Snap ML engineer’s daily workload. It’s simple, really: the more models Snap’s engineers can train, the more likely they are to find the models that perform better—and the less it costs to do so. Better ad recommendation models translate to more relevant ads for users, driving greater engagement and improving conversion rates for advertisers.

Over the past decade, there has been tremendous evolution in the hardware accelerators used to train large ML models like those Snap uses for ad ranking, from general-purpose multicore central processing units (CPUs) to graphics processing units (GPUs) to TPUs.

TPUs are Google’s custom-developed application specific integrated circuits (ASICs) used to accelerate ML workloads. TPUs are designed from the ground up to minimize time to accuracy when training large models. Models that previously took weeks to train on other hardware platforms can now be trained in hours on TPUs—a product of Google’s leadership and experience in machine learning (dig into the technology in Snap’s blog).

Benchmarking success

Snap wanted to understand for itself what kind of improvements in training speed it might see using TPUs. So, the Snap team benchmarked model training using TPUs versus both GPUs and CPUs, and the results were impressive. GPUs underperformed TPUs in terms of both throughput and cost, with a reduction in throughput of 67 percent and an increase in costs of 52 percent when using GPUs. Similarly, TPU-based training drastically outperformed CPU-based training for Snap’s most common models. For example, when looking at their standard ad recommendation model, TPUs slashed processing costs by as much as 74 percent while increasing throughput by as much as 250 percent—all with the same level of accuracy.

Because TPU embedding API is a native and optimized solution for embedding-based operations, it performs embedding-based computations and lookups more efficiently. This is particularly valuable to recommenders, which have additional requirements such as fast embedding lookups and high memory bandwidth.

Benefits across the board

For Snap’s ad ranking team, those improvements translate into tangible workflow advantages. It’s not unusual for Snap to have a month’s worth of data that includes all the logs of users who were shown particular ads and a record of whether they interacted with an ad or not. That means it has millions of data points to process, and Snap wants to model them as quickly as possible so it can make better recommendations going forward. It’s an iterative process, and the faster Snap can get the results from one experiment, the faster its engineers can spin up another with even better results—and they’d much prefer to do that in hours rather than days.

Increased efficiency and velocity benefit Snapchatters, too. The better the models are, the more likely they are to correctly predict the likelihood that a given user will interact with a particular ad, improving the user experience and boosting engagement. Improved engagement leads to higher conversion rates and greater advertiser value—and given the volumes of ads and users Snap deals with, even a one percent improvement has real monetary impact.

Working at the leading edge

Snap is working hard to improve its recommendation quality with the goal of delivering greater value to advertisers and a better experience for Snapchatters. That includes going all-in on leading-edge solutions like Google TPUs that allow its talented ML engineers to shine.

Now that you know the whole story, see how Snap got there with the help of Google: Training Large-Scale Recommendation Models with TPUs.



By: Aymeric Damien (Machine Learning Engineer, Snap Inc.) and Samir Ahmed (Software Engineer, Snap Inc.)
Source: Google Cloud Blog

Previous How Garbage Collection Works Inside A Java Virtual Machine
Next No More Normal? No Problem When You Build Supply Chains With Data And AI