RAMP scoring¶
Local submissions¶
When testing a submission locally (i.e. with ramp_test_submission
) a number
of scores will be calculated and printed to standard output. The scores will
look like this:
Testing Titanic survival classification
Reading train and test files from ./data ...
Reading cv ...
Training ./submissions/random_forest_20_5 ...
CV fold 0
train auc = 0.84
valid auc = 0.89
test auc = 0.83
CV fold 1
train auc = 0.85
valid auc = 0.86
test auc = 0.83
CV fold 2
train auc = 0.85
valid auc = 0.83
test auc = 0.82
CV fold 3
train auc = 0.84
valid auc = 0.91
test auc = 0.83
CV fold 4
train auc = 0.85
valid auc = 0.87
test auc = 0.83
CV fold 5
train auc = 0.84
valid auc = 0.89
test auc = 0.84
CV fold 6
train auc = 0.84
valid auc = 0.88
test auc = 0.84
CV fold 7
train auc = 0.85
valid auc = 0.86
test auc = 0.84
----------------------------
Mean CV scores
----------------------------
train auc = 0.85 ± 0.005
valid auc = 0.87 ± 0.023
test auc = 0.83 ± 0.006
----------------------------
Bagged scores
----------------------------
score auc
valid 0.875
test 0.834
Locally, there should be a training dataset and a testing dataset, usually
within a folder named data/
. We will call these datasets the ‘public’
training data and the ‘public’ test data. This is because, for a RAMP challenge,
there will also be private training and test data (see Preparing your data for more).
Eight-fold cross-validation (CV) is performed, whereby the public training data
is split into ‘training’ and ‘validation’ subsets 8 times. The subsets are
different each time. For each CV fold, the model is trained with
the training data then used to predict targets for the training and validation
subsets and the public testing data. The scores are computed for the training,
validation and testing datasets, for each fold. The mean of these 8 scores are
calculated and printed under Mean CV scores
. In the example above, there is
only one score metric; ‘auc’. If more than one score metric was defined in
problem.py
(see score types), scores for all the score
metrics will be printed.
Bagged scores
are calculated by combining the predictions of the 8 folds
and using the combined prediction to calculate the score. For regression
problems the combined prediction is the mean of the predictions and
for classification problems, it is the mean probability of each class. For
detection problems, the combined prediction calculation is more complex. See
the source code
for more details.
For example, the Titanic challenge aims to predict whether or not each passenger survived. For each CV fold, different survival predictions are made for the test data. This is because for each CV fold, the model is different as it was trained using different data. The probality of each classification (survived or did not survive), computed from the 8 CV models, is averaged for every sample in the test dataset. The classification label is then computed using the new averaged probabilities. This new ‘combined prediction’ is used to calculate the ‘bagged’ score. The validation bagged score is calculated similarly, though there is a slight variation because the validation datasets differ between each CV fold. Validation samples may or may not overlap between CV folds. In cases where a validation sample was present in only one CV fold, there is only one prediction for that sample. The combined prediction for this sample will simply be that single prediction.
Note that technically this is not what bagging means, but the name is used for historical reasons.
RAMP event submissions¶
The above scores are also calculated when you make a submission to a RAMP
event on RAMP studio. However, only the mean cv validation score (i.e.,
valid 0.825 ± 0.0096
above) is shown on the public leaderboard. The
mean cv test score is not shown as we wish to assess if the participants
submissions generalise to the private test data. Providing them with the
test score provides participants with a score to try and improve and may result
in models that perform well on the test data because it is overfit for the test
data.
Typically, the test score is used to officially rank the participants and are made public at the end of a RAMP event.