Checkpoints on v0.5

Issue #76 resolved
andrew_peterson repo owner created an issue

We need to save checkpoints during training; this isn't done on v0.5. The initial parameters should be saved as well as every X iterations.

Comments (10)

  1. andrew_peterson reporter
    • changed status to open

    Actually, I think this does not work. I think it keeps saving the initial parameters at each checkpoint. You can see this in the code -- the parameters are saved but it has no knowledge of what is in vector at this stage. I confirmed that this was happening by also writing out the initial parameters and running diff on the two files, which were identical.

  2. andrew_peterson reporter

    As @akhorshi pointed out, this also applies to the parameters saved in the untrained parameters file produced at the end of training.

  3. Alireza Khorshidi

    This is now solved in the commit 5ef4381.

    Right now, we are saving one file for checkpoints and one file for trained/untrained parameters. I was wondering if we need to save initial parameters as well separately (right now, initial parameters are saved as checkpoint, but soon updated in the 100th optimization step)?

  4. andrew_peterson reporter

    Probably a good idea. Only issue is this has the potential to create a lot of clutter. I wonder if we should put these in a "checkpoints" directory? Or somehow allow the user to specify how much checkpointing to do?

  5. Alireza Khorshidi

    Initial parameters are now save in the commit a5a8e90. To save parameter checkpoints separately (and not overwrite them), one option is to save all checkpoint files in a folder "checkpoints". We will have hundreds of checkpoint files inside the folder, but they should be not huge files. @andrewpeterson what do you think?

  6. Log in to comment