分佈式深度學習框架horovod

horovod分佈式深度學習框架

horovod是一個針對TensorFlow/Keras/PyTorch的分佈式訓練框架,目標是構建一個快速的易用的分佈式深度學習。

安裝:

1.安裝Open MPI或其他MPI.

2.安裝horovod包.

$ pip install horovod

如果想用docker,則參見https://github.com/uber/horovod/blob/master/docs/docker.md

使用:

參見如下使用的例子。

import tensorflow as tf

import horovod.tensorflow as hvd

# Initialize Horovod

hvd.init()

# Pin GPU to be used to process local rank (one GPU per process)

config = tf.ConfigProto()

config.gpu_options.visible_device_list = str(hvd.local_rank())

# Build model...

loss = ...

opt = tf.train.AdagradOptimizer(0.01 * hvd.size())

# Add Horovod Distributed Optimizer

opt = hvd.DistributedOptimizer(opt)

# Add hook to broadcast variables from rank 0 to all other processes during

# initialization.

hooks = [hvd.BroadcastGlobalVariablesHook(0)]

# Make training operation

train_op = opt.minimize(loss)

# Save checkpoints only on worker 0 to prevent other workers from corrupting them.

checkpoint_dir = '/tmp/train_logs' if hvd.rank() == 0 else None

# The MonitoredTrainingSession takes care of session initialization,

# restoring from a checkpoint, saving to a checkpoint, and closing when done

# or an error occurs.

with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,

config=config,

hooks=hooks) as mon_sess:

while not mon_sess.should_stop():

# Perform synchronous training.

mon_sess.run(train_op)

運行:

1.在帶有4個GPU的機器上運行。

$ mpirun -np 4 \

-H localhost:4 \

-bind-to none -map-by slot \

-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \

-mca pml ob1 -mca btl ^openib \

python train.py


分享到:


相關文章: