I am running the inception v4 training code obtained from model zoo, on Atlas 800-9000. I have some questions regarding the configuration of the distributed training (up to 8 devices on a single server).
From the documentation, it is required to prepare a resource configuration file (the ranktable file) which should contain the NIC IP addresses of the devices available in the training server.
However, I noticed that even if the IP addresses in the ranktable file are not valid, the training runs successfully and leverages all the 8 devices (I checked the device utilization using npu-smi and ascend-dmi tools). So my question are:
Is the ranktable file required for distributed training on single server scenario?
If the ranktable is not required for such a scenario, then how the training job is scheduled over the devices?

![[Distributed Training on Atlas 800-9000] Ranktable file is not recognized-3839847-1](static/image/smiley/default/victory.gif)

![[Distributed Training on Atlas 800-9000] Ranktable file is not recognized-3839853-1](static/image/smiley/default/handshake.gif)