V
主页
京东 11.11 红包
[Short Review] Fully Sharded Data Parallel: faster AI training with fewer GPUs
发布人
Abstract: Fully Sharded Data Parallel (FSDP) is the newest tool we’re introducing. It shards an AI model’s parameters across data parallel workers and can optionally offload part of the training computation to the CPUs. As its name suggests, FSDP is a type of data-parallel training algorithm. Although the parameters are sharded to different GPUs, the computation for each microbatch of data is still local to each GPU worker. This conceptual simplicity makes FSDP easier to understand and more applicable to a wide range of usage scenarios (compared with intra-layer parallelism and pipeline parallelism). Compared with optimizer state+gradient sharding data parallel methods, FSDP shards parameters more uniformly and is capable of better performance via communication and computation overlapping during training. With FSDP, it is now possible to more efficiently train models that are orders of magnitude larger using fewer GPUs. FSDP has been implemented in the FairScale library and allows engineers and developers to scale and optimize the training of their models with simple APIs. At Facebook, FSDP has already been integrated and tested for training some of our NLP and Vision models. My notes: https://blog.olewave.com/olewaves-tech-review-fully-sharded-data-parallel-faster-ai-training-with-fewer-gpus/
打开封面
下载高清视频
观看高清视频
视频下载器
[Long Review] Axial Attention in Multidimensional Transformers
[Long Review] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
[Long Review] Fully Sharded Data Parallel: faster AI training with fewer GPUs
十分钟看懂谷歌易筋经BERT
[Long Review]Switch Transformers: Scaling to Trillion Parameter Models with
[Long Review] Xception: Deep Learning with Depthwise Separable Convolution
[Short Review]Conformer Convolution-augmented Transformer for Speech Recognition
十分钟看懂脸书太极拳法Wav2Vec2.0 -- 语音预训练模型就像绝命毒师老白教杰西
十分钟看懂谷歌金钟罩Transformer以及语音的LAS模型
[Long Review] Transfer Learning from Speaker Verification to Multispeaker TTS
[Long Review]Kullback-Leibler Divergence: Listen, Attend, Spell and Adapt ASR
[Long Review] Cascaded Diffusion Models for High Fidelity Image Generation
三分钟搞定微软零样本语音合成VALL-E
语音文本技术论文阅读 Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recogni
语音文本技术论文阅读 RNN-T: Sequence Transduction with Recurrent Neural Networks
[Long Review] Deduplicating Training Data Makes Language Models Better
[Short Review] Xception: Deep Learning with Depthwise Separable Convolution
语音文本技术论文阅读 Branchformer: Parallel MLP-Attention Architectures and E-Branchformer
[Long Review] Conformer: Convolution-augmented Transformer for Speech Recogniti
详解OpenAI GPT-3: Language Models are Few-Shot Learners(2/3)
十分钟告诉你为什么OpenAI的Whisper语音识别没ChatGPT那么好用 [语音语言论文阅读]
语音文本技术论文阅读 Improving Speech Recognition Accuracy of Local POI Using Geographical
语音文本技术论文阅读 SNRi Target Training for Joint Speech Enhancement and Recognition
语音文本技术论文阅读 Scaling Laws for Neural Language Models
语音NLP论文阅读 Token-level Sequence Labeling for SLU using Compositional E2E Models
[Long Review] Towards Zero-Label Language Learning
语音文本技术论文阅读 UniSpeech-SAT - Universal Speech Representation Learning with Speaker
[Short Review] Towards Zero-Label Language Learning
十分钟看懂谷歌铁布衫BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised ...
解锁天顶星科技ChatGPT
[Long Review] Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using
Boris Johnson约翰逊辞职演讲 - 附双麦克风使用分析
语音文本技术论文阅读 One-Edit-Distance Network (OEDN) in Mispronunciation Detection & ASR
语音文本技术论文阅读 OpenAI最新的Whisper ASR也会像GPT-3一样火起来吗?
详解AudioLM: a Language Modeling Approach to Audio Generation
详解微软零样本语音合成VALL-E
超全超简单!一口气学完CNN、RNN、GAN、GNN、DQN、Transformer、LSTM、DBN等八大深度学习神经网络算法!存下吧,真的比啃书快多了!!
十分钟看懂谷歌W2v-BERT: Combining Contrastive Learning and Masked Language Modeling
[Olewave's Long Review] Efficient Training of Neural Transducer for Speech Recog
[Short Review] Transfer Learning from Speaker Verification to Multispeaker TTS