The average performance of our model is 87.5, with a standard deviation of about 0.85.
A problem with multiple split tests is that it is possible that some data instance are never included for training or testing, where as others may be selected multiple times.
If you have one mean and standard deviation for algorithm A and another mean and standard deviation for algorithm B and they differ (for example, algorithm A has a higher accuracy), how do you know if the difference is meaningful?
This only matters if you want to compare the results between algorithms.
The effect is that this may skew results and may not give an meaningful idea of the accuracy of the algorithm.
The randomness may be explicit in the algorithm or may be in the sample of the data selected to train the algorithm.This will split the dataset into 10 parts (10 folds) and the algorithm will be run 10 times.Each time the algorithm is run, it will be trained on 90% of the data and tested on 10%, and each run of the algorithm will change which 10% of the data the algorithm is tested on.If you have a dataset, you may want to train the model on the dataset and then report the results of the model on that dataset. The problem with this approach of evaluating algorithms is that you indeed will know the performance of the algorithm on the dataset, but do not have any indication of how the algorithm will perform on data that the model was not trained on (so-called unseen data).This matters, only if you want to use the model to make predictions on unseen data.