Bug description

I am trying to use pytorch lightning's early stopping callback for my pytorch model. I am using an early_stop_patience = 3. So if the best loss is at epoch 8 then I expect the checkpoint at epoch 8 to be the best model. But the as per the library the best_model_path is at epoch=11. Am I doing something wrong? Can someone confirm the behavior or suggest a fix?

Here is the partial log showing the training progression, the val_loss and the best_model_path.

How to reproduce the bug

pl.seed_everything(42)
early_stop_callback = EarlyStopping(
    monitor="val_loss", min_delta=1e-4, patience=3, verbose=True, mode="min"
)
lr_logger = LearningRateMonitor()

trainer = pl.Trainer(
    max_epochs=20,
    gpus=0,
    gradient_clip_val=0.1,
    callbacks=[early_stop_callback, lr_logger],
    limit_train_batches=100,
)

Error messages and logs

Training for an epoch...
Epoch 8: 100%|███████████████████████████████| 111/111 [01:58<00:00,  1.07s/it, loss=2.15, v_num=4, train_loss_step=2.520, val_loss=1.460, train_loss_epoch=2.400Finished training an epoch.                                                                                                                                        
Metric val_loss improved by 0.763 >= min_delta = 0.0001. New best score: 1.461
Epoch 9: 100%|███████████████████████████████| 111/111 [01:58<00:00,  1.07s/it, loss=2.15, v_num=4, train_loss_step=2.520, val_loss=1.460, train_loss_epoch=2.060]FitLoop: advancing loop
Training for an epoch...
Epoch 9: 100%|███████████████████████████████| 111/111 [02:11<00:00,  1.19s/it, loss=1.68, v_num=4, train_loss_step=1.320, val_loss=2.190, train_loss_epoch=2.060Finished training an epoch.                                                                                                                                        
Epoch 10: 100%|██████████████████████████████| 111/111 [02:12<00:00,  1.19s/it, loss=1.68, v_num=4, train_loss_step=1.320, val_loss=2.190, train_loss_epoch=1.640]FitLoop: advancing loop
Training for an epoch...
Epoch 10: 100%|██████████████████████████████| 111/111 [02:25<00:00,  1.31s/it, loss=2.61, v_num=4, train_loss_step=1.220, val_loss=2.220, train_loss_epoch=1.640Finished training an epoch.                                                                                                                                        
Epoch 11: 100%|██████████████████████████████| 111/111 [02:25<00:00,  1.31s/it, loss=2.61, v_num=4, train_loss_step=1.220, val_loss=2.220, train_loss_epoch=2.110]FitLoop: advancing loop
Training for an epoch...
Epoch 11: 100%|██████████████████████████████| 111/111 [02:38<00:00,  1.43s/it, loss=1.68, v_num=4, train_loss_step=0.655, val_loss=1.560, train_loss_epoch=2.110Finished training an epoch.                                                                                                                                        
Monitored metric val_loss did not improve in the last 3 records. Best score: 1.461. Signaling Trainer to stop.
Epoch 11: 100%|██████████████████████████████| 111/111 [02:38<00:00,  1.43s/it, loss=1.68, v_num=4, train_loss_step=0.655, val_loss=1.560, train_loss_epoch=2.600]FitLoop: train run ended
Epoch 11: 100%|██████████████████████████████| 111/111 [02:38<00:00,  1.43s/it, loss=1.68, v_num=4, train_loss_step=0.655, val_loss=1.560, train_loss_epoch=2.600]
Trainer: trainer tearing down
Trainer: calling teardown hooks

best_model_path
.../lightning_logs/version_4/checkpoints/epoch=11-step=1199.ckpt
0
© 2022 pullanswer.com - All rights reserved.