Sparse Mixture of Experts (MoE) models are gaining traction due to their ability to enhance accuracy without proportionally increasing computational demands. Traditionally, significant computational ...
In a new paper Upcycling Large Language Models into Mixture of Experts, an NVIDIA research team introduces a new “virtual group” initialization technique to facilitate the transition of dense models ...