Pytorch 翻车记录：单卡改多卡踩坑记！

知行编程网 2022-07-26 20:00 知行编程网 | 隐藏边栏 | 抢沙发 | 248 0

文章评分 0 次，平均分 0.0 ：

作者 | 哟林小平

转自 | 夕小瑶的卖萌屋

先说明一下背景，目前正在魔改以下这篇论文的代码：

https://github.com/QipengGuo/GraphWriter-DGLgithub.com

由于每次完成实验需要5个小时（baseline），自己的模型需要更久（2倍），非常不利于调参和发现问题，所以开始尝试使用多卡加速。

torch.nn.DataParallel ==> 简称 DP

torch.nn.parallel.DistributedDataParallel ==> 简称DDP

一开始采用dp试图加速，结果因为dgl的实现（每个batch的点都会打包进一个batch，从而不可分割），而torch.nn.DataParallel的实现是把一个batch切分成更小，再加上他的加速性能也不如ddp，所以我开始尝试魔改成ddp。

另外，作者在实现Sampler的时候是继承了torch.utils.data.Sampler这个类的，目的在于agenda数据集的文本长度严重不均衡，如下：

为了让模型更快train完，把长度相近的文本打包成一个batch（温馨提醒，torchtext也有相关的类 bucketiterator^[1]，大概形式如下：

这是背景。

写bug第一步：继承DistributedSampler的漏洞百出

我一开始理想当然的把作者的sampler源码crtl-cv下来，唯独只改动了这里：

随后就发现了几个问题：

dataloader不会发包；
dataloader给每个进程发的是完整的数据，按武德来说，应该是1/n的数据，n为你设置的gpu数量；

然后我就开始看起了源码^[2]，很快啊：

这里最关键的问题是是什么呢？首先在torch.utils.data.distributed.DistributedSampler里面，数据集的变量叫self.dataset而不是data_source；其次和torch.utils.data.Sampler要求你_重写__iter__函数不同：

DistributedSampler这个父类里有部分实现，如果你没有考虑到这部分，就自然会出现每个进程拿到的数据都是all的情况。

于是我重写了我的DDPBaseBucketSampler类：

后面每个进程终于可以跑属于自己的数据了（1/n，n=进程数量=GPU数量，单机）

紧接着问题又来了，我发现训练过程正常结束后，主进程无法退出mp.spawn()函数。

写bug第二步，master进程无法正常结束

number workers ddp pytorch下无法正常结束。具体表现为，mp.spawn传递的函数参数可以顺利运行完，但是master进程一直占着卡，不退出。一开始我怀疑是sampler函数的分发batch的机制导致的，什么意思呢？就是由于每个进程拿到的数据不一样，各自进程执行sampler类的时候，由于我规定了长度接近的文本打包在一起，所以可能master进程有一百个iter，slave只有80个，然后我马上试了一下，很快啊：

▲都能够正常打印，证明__iter__函数没有问题

发现只有细微的差别，并且，程序最后都越过了这些print，应该不会是batch数量不一致导致的问题。（顺便指的一提的是，sampler在很早的时候就把batch打包好了）

加了摧毁进程，也于事无补

然后只能点击强制退出

代码参考：基于Python初探Linux下的僵尸进程和孤儿进程(三)^[3]、 Multiprocessing in python blocked^[4]

很显然是pytorch master进程产生死锁了，变成了僵尸进程。

再探究，发现当我把dataloader的number workers设为0的时候，程序可以正常结束。经过我的注释大法后我发现，哪怕我把for _i , batch in enumerate(dataloader)内的代码全部注释改为pass，程序还是会出现master无法正常结束的情况。所以问题锁定在dataloader身上。参考：nero：PyTorch DataLoader初探^[5]

另外一种想法是，mp.spawn出现了问题。使用此方式启动的进程，只会执行和 target 参数或者 run() 方法相关的代码。Windows 平台只能使用此方法，事实上该平台默认使用的也是该启动方式。相比其他两种方式，此方式启动进程的效率最低。参考：Python设置进程启动的3种方式^[6]

现在试一下，绕开mp.spawn函数，用shell脚本实现ddp，能不能不报错：

参数解释：

nnodes：因为是单机多卡，所以设为1，显然node_rank 只能是0了
local_rank:进程在运行的时候，会利用args插入local_rank这个参数标识进程序号

一番改动后，发现问题有所好转，最直观的感受是速度快了非常多！！现在我没有父进程的问题了，但还是在运行完所有的程序后，无法正常结束：

此时我的代码运行到：

上面的代码是main函数，2个进程（master，salve）都可以越过barrier，其中slave顺利结束，但是master却迟迟不见踪影：

这个时候ctrl+c终止，发现：

顺着报错路径去torch/distributed/launch.py, line 239找代码：

可恶，master和dataloader到底有什么关系哇。。

这个问题终于在昨天（2020/12/22）被解决了，说来也好笑，左手是graphwriter的ddp实现，无法正常退出，右手是minst的ddp最小例程，可以正常退出，于是我开始了删减大法。替换了数据集，model，然后让dataloader空转，都没有发现问题，最后一步步逼近，知道我把自己的代码这一行注释掉以后，终于可以正常结束了：

为什么我当时会加上这句话呢？因为当时在调试number worker的时候（当时年轻，以为越大越好，所以设置成了number workers = cpu.count()），发现系统报错，说超出了打开文件的最大数量限制。在torch.multiprocessing的设定里，共享策略（参考pytorch中文文档^[7]）默认是File descriptor，此策略将使用文件描述符作为共享内存句柄。当存储被移动到共享内存中，一个由<span style="font-size: 16px;">shm_open</span>获得的文件描述符被缓存。当时，文档还提到：

如果你的系统对打开的文件描述符数量有限制，并且无法提高，你应该使用<span style="font-size: 16px;">file_system</span>策略。

所以我换成了torch.multiprocessing.set_sharing_strategy('file_system')，但是却忽略文档里的共享内存泄露警告。显然，或许这不是严重的问题，文档里提到：

也有可能我所说的master进程就是这个torch_shm_manager，因为destory进程组始终无法结束0号进程：

这个BUG结束了，真开心，期待下一个BUG快快到来。

[1]bucketiterator (https://pytorch.org/text/stable/data.html#bucketiterator)

[2]源码(https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py)

[3]基于Python初探Linux下的僵尸进程和孤儿进程(三)(http://dwz.date/dUmd)

[4]Multiprocessing in python blocked (https://stackoverflow.com/questions/13649625/multiprocessing-in-python-blocked)

[5]nero：PyTorch DataLoader初探 (https://zhuanlan.zhihu.com/p/91521705)

[6]Python设置进程启动的3种方式 (http://c.biancheng.net/view/2633.html)

[7]pytorch中文文档 (https://pytorch-cn.readthedocs.io/zh/latest/package_references/torch-multiprocessing/)

—完—

<pre><section style="letter-spacing: 0.544px;white-space: normal;font-family: -apple-system-font, system-ui, "Helvetica Neue", "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;"><section powered-by="xiumi.us"><section style="margin-top: 15px;margin-bottom: 25px;opacity: 0.8;"><section><section style="letter-spacing: 0.544px;"><section powered-by="xiumi.us"><section style="margin-top: 15px;margin-bottom: 25px;opacity: 0.8;"><section><section style="margin-bottom: 15px;padding-right: 0em;padding-left: 0em;color: rgb(127, 127, 127);font-size: 12px;font-family: sans-serif;line-height: 25.5938px;letter-spacing: 3px;text-align: center;"><span style="color: rgb(0, 0, 0);"><strong><span style="font-size: 16px;font-family: 微软雅黑;caret-color: red;">为您推荐</span></strong></span></section><p style="margin: 5px 16px;padding-right: 0em;padding-left: 0em;font-family: sans-serif;letter-spacing: 0px;opacity: 0.8;line-height: normal;text-align: center;">一个算法工程师的日常是怎样的？</p><p style="margin: 5px 16px;padding-right: 0em;padding-left: 0em;font-family: sans-serif;letter-spacing: 0px;opacity: 0.8;line-height: normal;text-align: center;">彻底搞懂机器学习中的正则化<br  /></p><section style="margin-top: 5px;margin-bottom: 5px;padding-right: 0em;padding-left: 0em;min-height: 1em;font-family: sans-serif;letter-spacing: 0px;opacity: 0.8;line-height: normal;text-align: center;"><span style="font-size: 14px;">13个算法工程师必须掌握的PyTorch Tricks</span></section><section style="margin-top: 5px;margin-bottom: 5px;padding-right: 0em;padding-left: 0em;min-height: 1em;font-family: sans-serif;letter-spacing: 0px;opacity: 0.8;line-height: normal;text-align: center;"><span style="font-size: 14px;">吴恩达上新：生成对抗网络（GAN）专项课程</span></section><section style="margin-top: 5px;margin-bottom: 5px;padding-right: 0em;padding-left: 0em;min-height: 1em;font-family: sans-serif;letter-spacing: 0px;opacity: 0.8;line-height: normal;text-align: center;">从SGD到NadaMax，十种优化算法原理及实现</section></section></section></section></section></section></section></section></section>

本篇文章来源于: 深度学习这件小事

本文为原创文章，版权归知行编程网所有，欢迎分享本文，转载请保留出处！

知行编程网关注：1 粉丝：1

这个人很懒，什么都没写

写bug第一步：继承DistributedSampler的漏洞百出

写bug第二步，master进程无法正常结束

[1]bucketiterator (https://pytorch.org/text/stable/data.html#bucketiterator)

[2]源码(https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py)

[3]基于Python初探Linux下的僵尸进程和孤儿进程(三)(http://dwz.date/dUmd)

[4]Multiprocessing in python blocked (https://stackoverflow.com/questions/13649625/multiprocessing-in-python-blocked)

[5]nero：PyTorch DataLoader初探 (https://zhuanlan.zhihu.com/p/91521705)

[6]Python设置进程启动的3种方式 (http://c.biancheng.net/view/2633.html)

[7]pytorch中文文档 (https://pytorch-cn.readthedocs.io/zh/latest/package_references/torch-multiprocessing/)

内容反馈

你可能也喜欢

热评文章

发表评论

联系我们

标签云

推广返利

Pytorch 翻车记录：单卡改多卡踩坑记！

写bug第一步：继承DistributedSampler的漏洞百出

写bug第二步，master进程无法正常结束

[1]bucketiterator (https://pytorch.org/text/stable/data.html#bucketiterator)

[2]源码(https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py)

[3]基于Python初探Linux下的僵尸进程和孤儿进程(三)(http://dwz.date/dUmd)

[4]Multiprocessing in python blocked (https://stackoverflow.com/questions/13649625/multiprocessing-in-python-blocked)

[5]nero：PyTorch DataLoader初探 (https://zhuanlan.zhihu.com/p/91521705)

[6]Python设置进程启动的3种方式 (http://c.biancheng.net/view/2633.html)

[7]pytorch中文文档 (https://pytorch-cn.readthedocs.io/zh/latest/package_references/torch-multiprocessing/)

分享本文海报

内容反馈

你可能也喜欢

热评文章

发表评论

联系我们

标签云

推广返利