Accelerating AI using distributed model training at Stitch Fix
Dec 07, 2023
- Stitch Fix utilizes a sophisticated multi-tiered recommender system stack, encompassing feature generation, scoring, ranking, and business policy decision-making. This presentation delves into the training architecture of the scoring model, a deep learning model that predicts the likelihood of a user purchasing an item.
- Give a walkthrough of our journey transitioning from training on a single GPU to multiple GPUs.
- Highlight the benefit of Distributed Data Parallel (DDP) training methodology.
- Present empirical results scaling up training from 1 to N GPUs.
- System design considerations that went into our decision making