Set Up a Distributed TensorFlow Cluster on 1&1 Cloud Servers

Table of Contents

Introduction

Learn how to set up a cluster of 1&1 Cloud Servers to run Distributed TensorFlow. There are several reasons to set up a Distributed TensorFlow cluster across several different servers. Using Distributed TensorFlow will greatly increase the training throughput and processing speed, because it harnesses the power of multiple servers to work on the same project. Google recommends using a cluster as a strategy for dealing with very large TensorFlow models and data sets.

Requirements

  • A 1&1 Cloud Server account with root access

Create the Master Server

For this project we will create a server, install Pip and TensorFlow on it, then use this server to clone parameter servers and worker nodes as needed.

Log in to the Cloud Panel. Click Servers then +Create to create a new Cloud Server.

Create distributed TensorFlow cluster

Select the desired server size, choose Ubuntu 16.04 64-bit as the operating system, and use a standard installation base.

After the server is built, connect to it with SSH. Follow the instructions in our article Install TensorFlow with Pip to use Pip to install TensorFlow.

Create the Cluster Servers

Log in to the Cloud Panel and click Servers. Click the master server, then click Actions > Clone to clone this server.

Create distributed TensorFlow cluster

Create as many clones as you wish to have nodes in the cluster. For easier management, you may wish to rename each cloned server. To do this, click the pencil icon next to the server name.

Create a Firewall Policy

You will need to allow access to ports 22, 2222, and 2223. You may also wish to allow access to ports 80, 8080, and any other ports you wish to use to access the cluster servers.

Log in to the Cloud Panel and click Network > Firewall Policies.

Create distributed TensorFlow cluster

Click +Create to create a new Firewall Policy. Add the ports you wish to open (whitelist), then assign all of the cluster servers to the Firewall Policy. For step-by-step instructions on this process, see our article Configure the Firewall Policy.

Update DNS

For easier management, you may wish to assign a subdomain to each Cloud Server. Log in to the Domain Manager and click the domain you wish to use.

Create a subdomain for each Cloud Server (for example, ps0.example.com, ps1.example.com, worker0.example.com, and so on.)

Point each subdomain to the IP address of the corresponding Cloud Server. Click the Actions icon for the subdomain you wish to modify, then click DNS Settings.

Create distributed TensorFlow cluster

In the A/AAAA and CNAME Records section, click Other IP address. Then fill out the IP address of the Cloud Server.

Create distributed TensorFlow cluster

Click Save to save the DNS changes. Repeat this process for each subdomain.

Run the Distributed TensorFlow Code

As an example, you can use this test script tutorial in the Distributed TensorFlow documentation. You will need to add the trainer.py script and run the script on each node in the cluster in turn.

For example, if you set up the following four cluster nodes:

  • ps0.example.com
  • ps1.example.com
  • worker0.example.com
  • worker1.example.com

You would use the following tf.train.ClusterSpec declaration:

tf.train.ClusterSpec({
    "worker": [
        "worker0.example.com:2222",
        "worker1.example.com:2222",
    ],
    "ps": [
        "ps0.example.com:2222",
        "ps1.example.com:2222"
    ]})

Add this to the relevant section of the test script. SSH to each server in turn, and create the trainer.py script. Then run the script with the following command:

On ps0.example.com:

python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=ps --task_index=0

On ps1.example.com:

python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=ps --task_index=1

On worker0.example.com:

python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=ps --task_index=0

On worker1.example.com: python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=ps --task_index=1

If everything is set up correctly, the server will respond with a message like:

018-03-21 17:38:21.070861: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> worker0.example.com:2222, 1 -> worker1.example.com:2222}
2018-03-21 17:38:21.072122: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222

Comments

Tags: TensorFlow