I am trying to use pytorch lightning's early stopping callback for my pytorch model. I am using an early_stop_patience = 3. So if the best loss is at epoch 8 then I expect the checkpoint at epoch 8 to be the best model. But the as per the library the best_model_path is at epoch=11. Am I doing something wrong? Can someone confirm the behavior or suggest a fix?
Here is the partial log showing the training progression, the val_loss and the best_model_path.
pl.seed_everything(42)
early_stop_callback = EarlyStopping(
monitor="val_loss", min_delta=1e-4, patience=3, verbose=True, mode="min"
)
lr_logger = LearningRateMonitor()
trainer = pl.Trainer(
max_epochs=20,
gpus=0,
gradient_clip_val=0.1,
callbacks=[early_stop_callback, lr_logger],
limit_train_batches=100,
)
Training for an epoch...
Epoch 8: 100%|███████████████████████████████| 111/111 [01:58<00:00, 1.07s/it, loss=2.15, v_num=4, train_loss_step=2.520, val_loss=1.460, train_loss_epoch=2.400Finished training an epoch.
Metric val_loss improved by 0.763 >= min_delta = 0.0001. New best score: 1.461
Epoch 9: 100%|███████████████████████████████| 111/111 [01:58<00:00, 1.07s/it, loss=2.15, v_num=4, train_loss_step=2.520, val_loss=1.460, train_loss_epoch=2.060]FitLoop: advancing loop
Training for an epoch...
Epoch 9: 100%|███████████████████████████████| 111/111 [02:11<00:00, 1.19s/it, loss=1.68, v_num=4, train_loss_step=1.320, val_loss=2.190, train_loss_epoch=2.060Finished training an epoch.
Epoch 10: 100%|██████████████████████████████| 111/111 [02:12<00:00, 1.19s/it, loss=1.68, v_num=4, train_loss_step=1.320, val_loss=2.190, train_loss_epoch=1.640]FitLoop: advancing loop
Training for an epoch...
Epoch 10: 100%|██████████████████████████████| 111/111 [02:25<00:00, 1.31s/it, loss=2.61, v_num=4, train_loss_step=1.220, val_loss=2.220, train_loss_epoch=1.640Finished training an epoch.
Epoch 11: 100%|██████████████████████████████| 111/111 [02:25<00:00, 1.31s/it, loss=2.61, v_num=4, train_loss_step=1.220, val_loss=2.220, train_loss_epoch=2.110]FitLoop: advancing loop
Training for an epoch...
Epoch 11: 100%|██████████████████████████████| 111/111 [02:38<00:00, 1.43s/it, loss=1.68, v_num=4, train_loss_step=0.655, val_loss=1.560, train_loss_epoch=2.110Finished training an epoch.
Monitored metric val_loss did not improve in the last 3 records. Best score: 1.461. Signaling Trainer to stop.
Epoch 11: 100%|██████████████████████████████| 111/111 [02:38<00:00, 1.43s/it, loss=1.68, v_num=4, train_loss_step=0.655, val_loss=1.560, train_loss_epoch=2.600]FitLoop: train run ended
Epoch 11: 100%|██████████████████████████████| 111/111 [02:38<00:00, 1.43s/it, loss=1.68, v_num=4, train_loss_step=0.655, val_loss=1.560, train_loss_epoch=2.600]
Trainer: trainer tearing down
Trainer: calling teardown hooks
best_model_path
.../lightning_logs/version_4/checkpoints/epoch=11-step=1199.ckpt