WARNING.api:Sending process 1570 closing signal SIGTERM WARNING.api:Sending process 1566 closing signal SIGTERM WARNING.api:Sending process 1565 closing signal SIGTERM WARNING.api:Sending process 1564 closing signal SIGTERM WARNING.api:Sending process 1563 closing signal SIGTERM WARNING.api:Sending process 1562 closing signal SIGTERM WARNING.api:Sending process 1558 closing signal SIGTERM WARNING.api:Sending process 1556 closing signal SIGTERM WARNING.api:Sending process 1555 closing signal SIGTERM WARNING.api:Sending process 1554 closing signal SIGTERM WARNING.api:Sending process 1553 closing signal SIGTERM WARNING.api:Sending process 1549 closing signal SIGTERM WARNING.api:Sending process 1546 closing signal SIGTERM Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802703 milliseconds before timing out. Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802025 milliseconds before timing out. Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802399 milliseconds before timing out. Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801821 milliseconds before timing out. Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801777 milliseconds before timing out. Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801667 milliseconds before timing out. Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802073 milliseconds before timing out. ![]() Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802173 milliseconds before timing out. Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801679 milliseconds before timing out. Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801862 milliseconds before timing out. ![]() Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801901 milliseconds before timing out. To avoid this inconsistency, we are taking the entire process down. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. Some NCCL operations have failed or timed out. Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800768 milliseconds before timing out. ![]() Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800172 milliseconds before timing out. Increasing the timeout limit beyond 1800 seconds just delays in which iteration the error occurs during the evaluation. For a smaller dataset such as Caltech101, it does not happen. The error occurs for the Imagenet dataset. I am running the codebase in GitHub - salesforce/MUST: PyTorch code for MUST as it is, in 16 V100 GPUs.Īfter training, during the evaluation, I get the following error.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |