Created by: davides
Patch Description
- Add trailing wildcard support to
PathManager.ls()to support changes below - Update checkpoint caching:
- Remove the double call to
get_local_pathhere which may have been causing a race condition. Passingforce=Trueshould get the intended effect. Add a utility to stress test file locking - Fix
load_checkpoint_to_cpu()to support remote checkpoints when DP>1 (see the stacktrace I got here). I think this only worked before becauseget_local_path()is a no-op for local paths and the other shard files are already nearby. Updated to ensure we cache all shards locally before attempting to load, using the new wildcard support in PathManager.ls()
- Remove the double call to
Testing steps Multiple eval runs on OPT 125M:
- remote path + consolidated
- remote path + DP>1
- local path + consolidated
- local path + DP>1
Running the stress test:
python -m tests.file_io.async_download_test