训练模型时遇到的问题
Q1: cuda out of memory
cuda memory 和什么有关:
- 模型参数量
- 输入数据量(batch_size大小,相当于一次性把这么多数据放入cuda memory)
所以一般来说,只需要:
- 调小 batch size
- 把网络参数量降低
Q2: 多 GPU 加速训练【没整明白】
了解到了一种方法,实际上是一块 GPU,多个子进程。但是好像是各自处理任务,而不是合作处理整个任务
不知道怎么使用 多GPU 训练,报错:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
1
1
Traceback (most recent call last):
File "ddp_train.py", line 24, in <module>
torch.distributed.init_process_group(backend='Gloo', rank=args.local_rank)
File "/data/liumengyin/.conda/envs/basic_env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/data/liumengyin/.conda/envs/basic_env/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Killing subprocess 4596
Killing subprocess 4597
Q3:模型训练过程不收敛
- 待解决