Evaluation

Metrics

Several metrics will be used to evaluate submissions:

  • Sørensen–Dice coefficient
  • Volume Difference
  • Simple Lesion Count
  • Lesionwise F1 Score

You can find the Docker image used for evaluation on the NPNL GitHub.

Sørensen–Dice

The Sørensen–Dice coefficient is computed as follows:

2*|Prediction ∩ Truth| / (|Prediction| + |Truth|)

The coefficient measures the overlap between the prediction and the ground truth and normalizes the overlap using the size of the combined regions. Other equivalent formulations can be found on the the relevant Wiki page.

Volume Difference

The difference between the true, total lesion volume and the predicted, total volume. Does not consider overlap. Measured in voxels.

Lesion Count

The difference in the number of lesions between the ground truth and the prediction.

Lesionwise F1 Score

Considers whether a given lesion has been detected. The metric is relatively forgiving; a lesion counts as "detected" if a single voxel has been predicted.

ISLES'22/ATLAS/MICCAI Score

The overall rank is the mean rank across the metrics. Please note that there is a prediction score using the public test data (automated evaluation), as well as a prediction score using a completely hidden generalizability test dataset (docker submission). The final ranking for the MICCAI challenge will weight performance on the hidden dataset 4x higher than the public test ranking, in order to discourage participants from overfitting or retraining on the public test data. The final scores will be announced at MICCAI.