bug: renaming checkpoints
File name conflict occurs if a training is run more than once within a minute:
- In a first training campaign, three checkpoints are saved: one
last.ckptand two best checkpoints. - On a second run within a minute, the last checkpoint is renamed with a suffix indicating its creation date
YYYY-mm-dd_HH-MM, and a newlast.ckptfile is added (and two new best). No error at this point, but the newlast.ckptfile has the same last-modified date as the previous one. - If the training is run again, at any point of time, the file overwrite error occurs, because
last.ckptcannot be renamed by adding the date suffix.
---------------------------------------------------------------------------
FileExistsError Traceback (most recent call last)
Cell In[47], line 1
----> 1 model.fit(datamodule, name_run='train_semiramis', max_epochs=20, accelerator='cpu', flag_wandb=False)
File ~\aixd\src\aixd\mlmodel\architecture\cond_ae_model.py:642, in CondAEModel.fit(self, datamodule, name_run, max_epochs, callbacks, loggers, accelerator, flag_early_stop, criteria, flag_wandb, wandb_entity, **kwargs)
640 if os.path.exists(os.path.join(self.save_dir, self.CHECKPOINT_DIR, "last.ckpt")):
641 date_f = self._get_file_creation_date(os.path.join(self.save_dir, self.CHECKPOINT_DIR, "last.ckpt"))
--> 642 os.rename(os.path.join(self.save_dir, self.CHECKPOINT_DIR, "last.ckpt"), os.path.join(self.save_dir, self.CHECKPOINT_DIR, "last_" + date_f + ".ckpt"))
644 callbacks.append(
645 ModelCheckpoint(
646 monitor=criteria,
(...)
653 )
654 )
656 # Setup early stopping callback
FileExistsError: [WinError 183] Cannot create a file when that file already exists:
'c:\\..\\checkpoints\\last.ckpt' -> 'c:\\..\\checkpoints\\last_2025-02-18_13-56.ckpt'
Also, for multiple runs within a minute, the best checkpoints share the same prefix for all runs.
Suggestion: change the suffix to include seconds: YYYY-mm-dd_HH-MM-SS ?
Edited by Ania Apolinarska