Checkpoints on v0.5
We need to save checkpoints during training; this isn't done on v0.5. The initial parameters should be saved as well as every X iterations.
Comments (10)
-
reporter -
- changed status to resolved
Fixed in the commit 09faede.
-
reporter - changed status to open
Actually, I think this does not work. I think it keeps saving the initial parameters at each checkpoint. You can see this in the code -- the parameters are saved but it has no knowledge of what is in
vector
at this stage. I confirmed that this was happening by also writing out the initial parameters and runningdiff
on the two files, which were identical. -
reporter -
assigned issue to
-
assigned issue to
-
reporter As @akhorshi pointed out, this also applies to the parameters saved in the untrained parameters file produced at the end of training.
-
This is now solved in the commit 5ef4381.
Right now, we are saving one file for checkpoints and one file for trained/untrained parameters. I was wondering if we need to save initial parameters as well separately (right now, initial parameters are saved as checkpoint, but soon updated in the 100th optimization step)?
-
reporter Probably a good idea. Only issue is this has the potential to create a lot of clutter. I wonder if we should put these in a "checkpoints" directory? Or somehow allow the user to specify how much checkpointing to do?
-
Initial parameters are now save in the commit a5a8e90. To save parameter checkpoints separately (and not overwrite them), one option is to save all checkpoint files in a folder "checkpoints". We will have hundreds of checkpoint files inside the folder, but they should be not huge files. @andrewpeterson what do you think?
-
Just saw your comment :)
-
- changed status to resolved
- Log in to comment
Issue
#94was marked as a duplicate of this issue.