Publication

Scalable Distributed Training of Recommendation Models: An ASTRA-SIM + NS3 case-study with TCP/IP transport

IEEE Symposium on High Performance Interconnects (HOTI)


Abstract

Recommendation model DNNs have gained significant attention due to their vital role in recommending the best content to the user. However, in order to further increase accuracy, DNNs are becoming more complex with more data to be trained, making them infeasible for training on a single node. Distributed training is a solution to tackle this problem by employing multiple nodes for training. The importance of recommendation models necessitates to design customized HW/SW platforms for training such networks in order to minimize the communication overheads among different nodes. However, exploring this design space is difficult due to the presence of many HW/SW parameters and the limitations to change the HW parameters in real systems.

In this paper, we port the previously proposed ASTRASIM simulation platform on top of the versatile NS3 network simulator by introducing a portable network interface for ASTRA-SIM. Using NS3 enables modeling a wide variety of networks with much better accuracy. Furthermore, we enhance NS3 with detailed modeling of TCP/IP.

Finally, we study various HW/SW platforms for the DLRM recommendation model with TCP/IP as the network protocol and analyze the communication overheads in the presence of various interconnect configurations.

Code: https://github.com/astra-sim/astra-sim.

Related Publications

All Publications

Workshop on Online Abuse and Harms (WHOAH) at ACL - November 30, 2021

Findings of the WOAH 5 Shared Task on Fine Grained Hateful Memes Detection

Lambert Mathias, Shaoliang Nie, Bertie Vidgen, Aida Davani, Zeerak Waseem, Douwe Kiela, Vinodkumar Prabhakaran

Journal of Big Data - November 6, 2021

A graphical method of cumulative differences between two subpopulations

Mark Tygert

NeurIPS - December 6, 2021

Parallel Bayesian Optimization of Multiple Noisy Objectives with Expected Hypervolume Improvement

Samuel Daulton, Maximilian Balandat, Eytan Bakshy

arXiv - January 29, 2020

fastMRI: An Open Dataset and Benchmarks for Accelerated MRI

Jure Zbontar, Florian Knoll, Anuroop Sriram, Tullie Murrell, Zhengnan Huang, Matthew J. Muckley, Aaron Defazio, Ruben Stern, Patricia Johnson, Mary Bruno, Marc Parente, Krzysztof J. Geras, Joe Katsnelson, Hersh Chandarana, Zizhao Zhang, Michal Drozdzal, Adriana Romero, Michael Rabbat, Pascal Vincent, Nafissa Yakubova, James Pinkerton, Duo Wang, Erich Owens, Larry Zitnick, Michael P. Recht, Daniel K. Sodickson, Yvonne W. Lui

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookie Policy