PyTorch中可以使用torch.nn.parallel.DistributedDataParallel类来进行分布式训练。具体步骤如下:
import torchimport torch.distributed as distfrom torch.multiprocessing import Processdef init_process(rank, size, fn, backend='gloo'): os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '1234' dist.init_process_group(backend, rank=rank, world_size=size) fn(rank, size)定义训练函数,在训练函数中创建模型和数据加载器,并使用torch.nn.parallel.DistributedDataParallel对模型进行包装:def train(rank, size): # 创建模型 model = Model() model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[rank]) # 创建数据加载器 train_loader = DataLoader(...) # 定义优化器 optimizer = torch.optim.SGD(model.parameters(), lr=0.001) # 训练模型 for epoch in range(num_epochs): for batch_idx, (data, target) in enumerate(train_loader): optimizer.zero_grad() output = model(data) loss = loss_function(output, target) loss.backward() optimizer.step()使用torch.multiprocessing.spawn启动多个进程来运行训练函数:if __name__ == '__main__': num_processes = 4 size = num_processes processes = [] for rank in range(num_processes): p = Process(target=init_process, args=(rank, size, train)) p.start() processes.append(p) for p in processes: p.join()以上是一个简单的分布式训练的示例,根据实际情况可以对代码进行进一步的修改和扩展。PyTorch还提供了其他一些用于分布式训练的工具和功能,如torch.distributed模块和torch.distributed.rpc模块,可以根据需要选择合适的工具进行分布式训练。




