I had access to surfing data from surf-stats.com, a site that collects data for fantasy surfing. It contains average scores for 55 surfers in the 2014 World Surfing League under different conditions: wind, wave size, surf break, weather and more. Scores are continuous, varying from 0 to 20.
This is how the WSL scoring works in more detail:
Events are comprised of rounds and those rounds are made up of heats with anywhere from two to four surfers looking to lock in their two highest scoring waves, both out of a possible 10 points for a possible 20-point heat total. A panel of five judges scores each wave on a scale of one to ten. For every scoring ride, the highest and lowest scores (of the five judges) are discounted and the surfer receives the average of the remaining three scores. There is no limit on the number of waves that will be scored, but the two best scoring waves (each out of a possible 10) are added together to become a surfers heat total (out of a possible 20).
How to visualise the performance of each surfer? One way is to examine correlations between scores in a scatterplot.
The problem with the scatterplots above is obvious: it is messy and there is not a lot of useful information.
A good way to reduce the number of dimensions while still capturing important information in a dataset is to apply Principal Component Analysis (PCA). In R, the function used to create principal components from a data matrix is
# Principal Components Analysis surf.pca <- princomp(surf2, cor = TRUE, scale = TRUE, scores = TRUE) summary(surf.pca)
The number of components created is the same as the number of variables: 22. However the first two components capture 69% of the variance in the data.
Importance of components:
Standard deviation 3.5986606 1.48119937
Proportion of Variance 0.5886527 0.09972507
Cumulative Proportion 0.5886527 0.68837773
We can now plot each surfer on a scatterplot using Component 2 (x axis) and Component 1 (y axis) using
ggplot2. Note how the top surfers, the ones with the most final points at the end of the season like Gabriel Medina and Kelly Slater, are located on the bottom centre of the plot while the lower ranked surfers are spread over the top.
So, what should a surfer do in order to get to the top? For example, Jordy Smith finished in 6th place in 2014 with a total final score of 721.61. What should he focus in the coming season in oder to close the distance between him and Medina (assuming other surfers will have similar performance as past season)? The answer is the loadings of each component.
# analyse factor loadings factor.loading <- cbind(factor1=surf.pca$loadings[,1], factor2=surf.pca$loadings[,2]) print(factor.loading)
variable factor1 factor2
wave1_4ft -0.2374149 0.24349535
wave4_6ft -0.2081297 0.15736879
wave6_8ft -0.2071007 -0.28506647
wave8_10ftplus -0.2112475 -0.32208835
Offshore -0.2254676 -0.15282571
Light -0.1800410 0.01206050
Cross -0.2247514 0.14418875
Onshore -0.2253189 0.17502168
Sunny -0.1923999 -0.10751846
Overcast -0.2367012 0.07289387
Rainy -0.2127804 -0.02572388
Testing -0.2353646 0.02678095
Poor -0.2056338 0.24198318
Good -0.1915139 -0.21389770
Very.Good -0.2285273 -0.19721732
Excellent -0.2078707 -0.35093832
Left -0.2170075 0.06510453
Right -0.2300637 0.30083087
RLD 0.1270319 0.00356751
Point -0.2242095 0.16569986
Beach -0.2300637 0.30083087
Reef -0.2019777 -0.39836895
The immediate benefits are in the most negative variables of each factor, since Medina’s position in the plot is at the bottom and slightly to the left. Jordy Smith should then focus first on improving his score on small waves (1 to 4 ft). Secondly, he should improve his score in heats when the weather is overcast. Other variables that would make Jordy move downards in the plot are: heats during test conditions, right (wave direction) and beach breaks. In regards to component 2, Jordy should also work on improving his score in reef breaks. This would make him move towards the left of the plot.
The R code can be found on my GitHub page.