Q1: cuda out of memory

cuda memory 和什么有关:

  • 模型参数量
  • 输入数据量(batch_size大小,相当于一次性把这么多数据放入cuda memory)

所以一般来说,只需要:

  • 调小 batch size
  • 把网络参数量降低

Q2: 多 GPU 加速训练【没整明白】

了解到了一种方法,实际上是一块 GPU,多个子进程。但是好像是各自处理任务,而不是合作处理整个任务

不知道怎么使用 多GPU 训练,报错:

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
1
1
Traceback (most recent call last):
  File "ddp_train.py", line 24, in <module>
    torch.distributed.init_process_group(backend='Gloo', rank=args.local_rank)
  File "/data/liumengyin/.conda/envs/basic_env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/data/liumengyin/.conda/envs/basic_env/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Killing subprocess 4596
Killing subprocess 4597

Q3:模型训练过程不收敛

  • 待解决