I had access to surfing data from surf-stats.com, a site that collects data for fantasy surfing. It contains average scores for 55 surfers in the 2014 World Surfing League under different conditions: wind, wave size, surf break, weather and more. Scores are continuous, varying from 0 to 20.

This is how the WSL scoring works in more detail:

Events are comprised of rounds and those rounds are made up of heats with anywhere from two to four surfers looking to lock in their two highest scoring waves, both out of a possible 10 points for a possible 20-point heat total. A panel of five judges scores each wave on a scale of one to ten. For every scoring ride, the highest and lowest scores (of the five judges) are discounted and the surfer receives the average of the remaining three scores. There is no limit on the number of waves that will be scored, but the two best scoring waves (each out of a possible 10) are added together to become a surfers heat total (out of a possible 20).

How to visualise the performance of each surfer? One way is to examine correlations between scores in a scatterplot.

The problem with the scatterplots above is obvious: it is messy and there is not a lot of useful information.

A good way to reduce the number of dimensions while still capturing important information in a dataset is to apply Principal Component Analysis (PCA). In R, the function used to create principal components from a data matrix is `princomp`

.

# Principal Components Analysis surf.pca <- princomp(surf2, cor = TRUE, scale = TRUE, scores = TRUE) summary(surf.pca)

The number of components created is the same as the number of variables: 22. However the first two components capture 69% of the variance in the data.

Importance of components:

Comp.1 Comp.2

Standard deviation 3.5986606 1.48119937

Proportion of Variance 0.5886527 0.09972507

Cumulative Proportion 0.5886527 0.68837773

We can now plot each surfer on a scatterplot using Component 2 (x axis) and Component 1 (y axis) using `ggplot2`

. Note how the top surfers, the ones with the most final points at the end of the season like Gabriel Medina and Kelly Slater, are located on the bottom centre of the plot while the lower ranked surfers are spread over the top.

So, what should a surfer do in order to get to the top? For example, Jordy Smith finished in 6th place in 2014 with a total final score of 721.61. What should he focus in the coming season in oder to close the distance between him and Medina (assuming other surfers will have similar performance as past season)? The answer is the loadings of each component.

# analyse factor loadings factor.loading <- cbind(factor1=surf.pca$loadings[,1], factor2=surf.pca$loadings[,2]) print(factor.loading)

variable factor1 factor2

wave1_4ft -0.2374149 0.24349535

wave4_6ft -0.2081297 0.15736879

wave6_8ft -0.2071007 -0.28506647

wave8_10ftplus -0.2112475 -0.32208835

Offshore -0.2254676 -0.15282571

Light -0.1800410 0.01206050

Cross -0.2247514 0.14418875

Onshore -0.2253189 0.17502168

Sunny -0.1923999 -0.10751846

Overcast -0.2367012 0.07289387

Rainy -0.2127804 -0.02572388

Testing -0.2353646 0.02678095

Poor -0.2056338 0.24198318

Good -0.1915139 -0.21389770

Very.Good -0.2285273 -0.19721732

Excellent -0.2078707 -0.35093832

Left -0.2170075 0.06510453

Right -0.2300637 0.30083087

RLD 0.1270319 0.00356751

Point -0.2242095 0.16569986

Beach -0.2300637 0.30083087

Reef -0.2019777 -0.39836895

The immediate benefits are in the most negative variables of each factor, since Medina’s position in the plot is at the bottom and slightly to the left. Jordy Smith should then focus first on improving his score on small waves (1 to 4 ft). Secondly, he should improve his score in heats when the weather is overcast. Other variables that would make Jordy move downards in the plot are: heats during test conditions, right (wave direction) and beach breaks. In regards to component 2, Jordy should also work on improving his score in reef breaks. This would make him move towards the left of the plot.

The R code can be found on my GitHub page.