Pytorch多GPU推論結果の統合方法

単一ノード(DataParallel)

DataParallel の使用

特徴: 簡単に並列化可能（シングルノード限定）。
自動集約: 各 GPU の出力が自動でメイン GPU に統合される。

コード例:

model = nn.DataParallel(model).to(device)  # 自動でデータ分割＋結果統合
with torch.no_grad():
    outputs = model(inputs)  # outputs は全GPUの結果を含む

結果の評価

自動集約済みのため、通常の評価処理で OK:

_, predicted = torch.max(outputs, 1)
accuracy = (predicted == labels).sum().item() / len(labels)

分散環境(DistributedDataParallel)

DistributedDataParallel の使用

特徴: 多ノード環境で高性能（手動で初期化・同期が必要）。

初期化:

dist.init_process_group(backend='nccl')  # 分散環境初期化
model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])

推論 dataloader

多 GPU 推論の場合は、各 GPU にデータを割り当てる DistributedSampler を使用。

sampler = DistributedSampler(val_dataset, rank=rank, drop_last=False)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size//world_size, sampler=sampler, num_workers=num_workers)

推論

推論は各プロシースに普通的に実行。結果の統合は別途必要。

correct = 0
total = 0

model.eval()  # 推論モードへ切り替え

with torch.no_grad():
    for inputs, labels in val_dataloader:
        inputs, labels = inputs.to(device), labels.to(device)

        # 推論実行
        outputs = model(inputs)

        # 当プロシースのデータの精度計算
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

結果の集約

集約方法: torch.distributed.all_reduce で各 GPU の結果を合計。

コード例:

# 各GPUの値をテンソルに変換
correct_tensor = torch.tensor(correct, device='cuda')
total_tensor = torch.tensor(total, device='cuda')

# 全GPUの値を集約
dist.all_reduce(correct_tensor, op=dist.ReduceOp.SUM)
dist.all_reduce(total_tensor, op=dist.ReduceOp.SUM)

if rank == 0:  # 主プロシースのみ精度計算
  accuracy = correct_tensor.item() / total_tensor.item()
  print(f'Accuracy: {accuracy * 100:.2f}%')

LOADING

Pytorch多GPU推論結果の統合方法

目次

単一ノード(DataParallel)

DataParallel の使用

結果の評価

分散環境(DistributedDataParallel)

DistributedDataParallel の使用

推論 dataloader

推論

結果の集約

参考