FlowPulse: Catching Network Failures in ML Clusters
| Title: | FlowPulse: Catching Network Failures in ML Clusters |
|---|---|
| Authors: | Krebs, Jakob; Gavrilenko, Dimitry; Amir, Daniel; Landau Feibish, Shir; Silberstein, Mark |
| Source: | Proceedings of the 24th ACM Workshop on Hot Topics in Networks. :139-148 |
| Availability: | http://dl.acm.org/doi/10.1145/3772356.3772384 |
| Database: | ACM Full-Text Collection |