This data set is part of the #UCSIA15 course by Keith Lyons. The Google docs shared by Keith could be found here. More infor about the data set can be found here
I have downloaded the sheet, converted it to .CSV and removed #REF! from rows 81-85. This will cause missing values in the data set (NAs) that need to be dealt with appropriately. More on this later.
Anyway, here is the data set:
suppressPackageStartupMessages(library(googleVis))
# Load the data
data <- read.csv("AFL Data.csv", header = TRUE, stringsAsFactors = FALSE)
The dimensions of the data are the 84x200. Since there are more features (columns) than observations, it makes this data set highly dimensional, which is problematic for some analysis.
The data represents GPS analysis of one AFL game, split into quarters. The scores by quarter in the game were:
33 v 22 (Q1)
33 v 22 (Q2)
33 v 26 (Q3)
33 v 37 (Q4)
The data doesn’t contains any ID of the players, so it is impossible to perform any type of single-subject analysis, but rather perform it on the team level.
The goal of the analysis is try to find if the GPS data can explain quarters scores.
Before data can be used it needs to be cleaned and prepared, and usually that is the longest and toughest part of data analysis.
As said previously some of the rows contains missing values. Lets see what columns contain missing values
nacols <- function(df) {
colnames(df)[unlist(lapply(df, function(x) any(is.na(x))))]
}
nacols(data)
## [1] "IMA.High.CoD.min" "mins" "mins.coverted."
## [4] "IMA.High.CoD.R.."
And let’s see what quarters contain the most missing data
table(data$Period.Name, rowSums(is.na(data)))
##
## 0 4
## Qtr 1 21 0
## Qtr 2 21 0
## Qtr 3 21 0
## Qtr 4 16 5
Appartently, Quarter 4 contains all missing data. Removing these players (rows) might remove interesting data from Quarter four. One option is to imputate missing values, either using averages for the column, or using KNN algorythim. For this analysis I will remove the columns containing missing data
# Remove the columns with missing data
data <- data[!(names(data) %in% nacols(data))]
Another issue we need to deal with is the format for time for some of the columns, since they use “00:10:13” and we want to convert it to seconds.
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(dplyr))
# Convert to factors the colums we are certain it doesn't contain time
data$Position <- factor(data$Position)
data$Period.Name <- factor(data$Period.Name)
# Create a function that return a class of the column
allClass <- function(x) {unlist(lapply(unclass(x),class))}
# Get the column classes
column.classes <- allClass(data)
# What columns contain time as character
time.columns.positions <- which(column.classes == "character")
# Select those columns
data.time <- data %>% select(time.columns.positions)
# Convert to seconds
data.time <- sapply(data.time, function(x) {period_to_seconds(hms(x))} )
# Put them back in the original data frame
data[colnames(data.time)] <- data.time
Since not all players played the full quarters, the GPS metrics need to be normalized per playing time. There is one problems with that - some metrics are already normalized (e.g. Distance / min), some shouldn’t be normalized (e.g. % in Zone 5, max velocity). This is why it is important to have domain knowledge of the features.
Since there are ~200 features it is hard to go over each (especially beause their names changed when importing with the removal of spaces and special characters) to see which one needs to be normalized.
What we can do instead is to remove plyers who didn’t play enough in the game. This is VERY IMPORTANT since these modifications and assumptions can affect the results of the analysis.
Let’s see the distribution of playing time across quarters
aggregate(Player.Game.Time~Period.Name, data,
function(x){c(MEAN = mean(x), SD = sd(x), N = length(x))})
## Period.Name Player.Game.Time.MEAN Player.Game.Time.SD Player.Game.Time.N
## 1 Qtr 1 26.568571 3.484331 21.000000
## 2 Qtr 2 27.078095 3.138332 21.000000
## 3 Qtr 3 27.249524 3.073313 21.000000
## 4 Qtr 4 29.755714 4.115769 21.000000
# The distribution on play time across quarters
library(ggplot2)
gg <- ggplot(data, aes(Player.Game.Time, fill = Period.Name))
gg <- gg + geom_density(alpha = 0.5)
gg <- gg + theme_bw()
gg
If we use data as it is, without normalization for time, then the effects might be related to longer fourth quarter. But let’s see if the differences are statistically significant (I haven’t check the normality of the data) using ANOVA
summary(aov(Player.Game.Time ~ Period.Name, data))
## Df Sum Sq Mean Sq F value Pr(>F)
## Period.Name 3 127.9 42.63 3.525 0.0186 *
## Residuals 80 967.5 12.09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see from the ANOVA that we cannot use quarters data as they are since that will affect the effects. We need to normalize the data or create cut-off for play time to make quarters equal. Since I cannot normalize the data because I don’t know which columns need to be normalized (divided by play time) and it is ~200 of them, I will need to remove some of the players (observations) to make quarters more equal in play time.
It is very important to notice that this procedure might/will affect the analysis, but without the list of features (columns) that needs to be normalized I cannot procees with the analysis.
# Keep observations with less than 32 minutes of play
data.mod <- data %>% filter(Player.Game.Time < 32)
# Check the differences between quarters
aggregate(Player.Game.Time~Period.Name, data.mod,
function(x){c(MEAN = mean(x), SD = sd(x), N = length(x))})
## Period.Name Player.Game.Time.MEAN Player.Game.Time.SD Player.Game.Time.N
## 1 Qtr 1 26.568571 3.484331 21.000000
## 2 Qtr 2 27.078095 3.138332 21.000000
## 3 Qtr 3 26.714737 2.706460 19.000000
## 4 Qtr 4 27.675714 3.409556 14.000000
# plot
gg <- ggplot(data.mod, aes(Player.Game.Time, fill = Period.Name))
gg <- gg + geom_density(alpha = 0.5)
gg <- gg + theme_bw()
gg
# Check the statistical significance
summary(aov(Player.Game.Time ~ Period.Name, data.mod))
## Df Sum Sq Mean Sq F value Pr(>F)
## Period.Name 3 11.8 3.941 0.387 0.763
## Residuals 71 722.8 10.180
So using 32 minutes as cutoff creates more equal groups (but cuts the number of observations in quarter 4)
Once we have data to work with, that is cleaned and prepared we can start doing the fun stuff. Once again, the way we prepared the data will affect the analysis and that is important to keep in mind.
Since we have ~200 features, it would be futile to visualize them and inspect them visually (creating trellis plot and using meand +/- SD across quarters, and maybe performing ANOVA per every feature, but that rings my multiple comparisons problem alarms , since we will always find some significan difference based purerly on sampling). But, just for the sake of an example (and power of ggplot2) I will do it.
The first trellis plot shows all GPS variables and their distribution across quarters using density plot. We want to visually inspect if some GPS metrics differ based on the quarter
suppressPackageStartupMessages(library(reshape2))
data.mod.long <- melt(data.mod, id.vars = 1:5)
gg <- ggplot(data.mod.long, aes(value))
gg <- gg + geom_density(alpha = 0.2, aes(fill = Period.Name))
gg <- gg + facet_wrap(~variable, scales="free", ncol = 5)
gg <- gg + theme_bw()
gg
Another graph is the same, but we are using boxplots
gg <- ggplot(data.mod.long, aes(y = value, x = Period.Name))
gg <- gg + geom_boxplot(alpha = 0.2, aes(fill = Period.Name))
gg <- gg + facet_wrap(~variable, scales="free", ncol = 5)
gg <- gg + theme_bw()
gg
As a simpler solution, I will visualize this using dot-plot, since we have a lot of variables. But to do that I would need to “normalize” each feature (using Z-scores), so they can be plot on the same axis. Each dot represents mean of the observations for each quarter and each GPS variable. This might help us visualy identify variables that differ based on the game quarter
# Scale the data
data.mod.scaled <- as.data.frame(scale(data.mod[-(1:5)]))
data.mod.scaled$Quarter <- data.mod$Period.Name
# Convert data to long format
data.mod.scaled.long <- melt(data.mod.scaled, id.vars = "Quarter")
gg <- ggplot(data.mod.scaled.long, aes (x = variable, y = value, color=Quarter))
gg <- gg + stat_summary(geom = "point",
alpha = 0.8,
fun.y = mean)
gg <- gg + theme_bw()
gg <- gg + coord_flip()
gg
As can be seen there is a lot of overlap. Let’s run ANOVA for every GPS metric to try to find the GPS metrics that are different (statistically significant) across quarters. NOTE: No normality nor outliers tests were performed
Quarters <- data.mod$Period.Name
ANOVA <- function(GPS.Metric)
{
# Extract p-value
return(summary(aov(GPS.Metric ~ Quarters) )[[1]][["Pr(>F)"]][1])
}
# Peorform ANOVA across quarters for every GPS variable
GPS.Metric.p.values <- sapply(data.mod[-(1:5)], ANOVA)
# Find meaningful ones (p<0.05)
significant.GPS.metrics <- GPS.Metric.p.values[GPS.Metric.p.values < 0.05]
round(significant.GPS.metrics, 3)
## Meters...min Player.Load...min
## 0.000 0.009
## Vel.Zone.4.Dist Vel.Zone.1.Dist..
## 0.046 0.002
## Vel.Zone.1.Time Vel.Zone.4.Time
## 0.001 0.042
## Vel.Zone.1.Time.. Vel.Zone.2.Time..
## 0.000 0.010
## Vel.Zone.3.Time.. Vel.Zone.4.Time..
## 0.034 0.008
## Vel.Zone.4.Efforts Vel.Zone.3.Min.Effort.Time
## 0.031 0.036
## Meta..Power.Zone.3.Time Meta..Power.Zone.1.Time..
## 0.046 0.006
## Meta..Power.Zone.2.Time.. Meta..Power.Zone.3.Time..
## 0.043 0.009
## Meta..Power.Zone.1.Avg.Power IMA.CoD.Right.Low
## 0.000 0.038
## IMA.Free.Running.Total.Event.Count Distance.Z3.to.Z6
## 0.039 0.016
Those above are GPS variables that might differ across quarters. It is important to note that we are dealing with multiple comparisson problem here and we need to be cautious.
Anyway, let’s show visually only those variables
data.mod.long.subset <- data.mod.long[data.mod.long$variable %in% names(significant.GPS.metrics),]
gg <- ggplot(data.mod.long.subset, aes(y = value, x = Period.Name))
gg <- gg + geom_boxplot(alpha = 0.2, aes(fill = Period.Name))
gg <- gg + facet_wrap(~variable, scales="free", ncol = 5)
gg <- gg + theme_bw()
gg
Visualizing multidimensional data is always tricky, as can be seen above it also takes a lot of room. We can reduce the dimensions of the data using PCA (Principal Components Analysis) and visualize all of our ~200 GPS metrics on 2D chart (using PC1 and PC2). This will result in information loss, but much more appealing visualization.
# Scale the data
data.mod.scaled <- scale(data.mod[-(1:5)])
data.PCA <- prcomp(data.mod.scaled)
data.plot <- data.frame(Quarter = data.mod$Period.Name,
PC1 = data.PCA$x[,1],
PC2 = data.PCA$x[,2])
gg <- ggplot(data.plot, aes(x = PC1, y = PC2, color = Quarter))
gg <- gg + geom_point(alpha = 0.8, size = 3)
gg <- gg + theme_bw()
gg
What we are looking for are the clusters formed, and we would ‘expect’ that quarters form clusters (more on this later), but they are not. We can see two clusters, and we can probably inspect more - maybe they are related to position played?
data.plot <- data.frame(Position = data.mod$Position,
PC1 = data.PCA$x[,1],
PC2 = data.PCA$x[,2])
gg <- ggplot(data.plot, aes(x = PC1, y = PC2, color = Position))
gg <- gg + geom_point(alpha = 0.8, size = 3)
gg <- gg + theme_bw()
gg
It doesn’t seem so. Let run some analytics. First let’s figure out how many clusters are actually in the data, using silhouete method.
suppressPackageStartupMessages(library(fpc))
(number.of.clusters <- pamk(data.mod.scaled)$nc)
## [1] 2
So there are two clusters in the data. Let’s perform K-means clustering
cluster.model <- kmeans(data.mod.scaled,
centers = number.of.clusters,
nstart = 200)
data.plot <- data.frame(Cluster = factor(cluster.model$cluster),
PC1 = data.PCA$x[,1],
PC2 = data.PCA$x[,2])
gg <- ggplot(data.plot, aes(x = PC1, y = PC2, color = Cluster))
gg <- gg + geom_point(alpha = 0.8, size = 3)
gg <- gg + theme_bw()
gg
Using decision tree we can figure out what GPS variables differentiate between clusters
suppressPackageStartupMessages(library(rpart))
suppressPackageStartupMessages(library(DMwR))
data.mod$Cluster <- factor(factor(cluster.model$cluster))
rpart.model <- rpart(Cluster ~ ., data.mod)
prettyTree(rpart.model)
Apparently only High Intensity Accelerations (cutoff 41) differs these two clusters. Let’s see by plotting only that variable
gg <- ggplot(data.mod, aes(x = Cluster, y = High.Intensity.Accels, fill = Cluster))
gg <- gg + geom_boxplot(alpha = 0.8)
gg <- gg + theme_bw()
gg
We can see from the graph above that we had players who did perform a lot of High Intensity Accelerations and those that didn’t. Without further knowledge we can’t say who are those players. From what can be seen from picture above, it might be that PC1 differs the clusters.
PCA.data <- as.data.frame(data.PCA$x)
PCA.data$Cluster <- factor(factor(cluster.model$cluster))
rpart.model1 <- rpart(Cluster ~ ., PCA.data)
prettyTree(rpart.model1)
Let’s inspect the GPS metrics with highest (absolute) loadings on PC1, since these differ the clusters
PC1.loadings <- sort(abs(data.PCA$rotation[,1]))
PC1.loadings <- data.frame(GPS.Variable = factor(names(PC1.loadings),
levels = names(PC1.loadings)),
PC1.Loading = PC1.loadings)
gg <- ggplot(PC1.loadings, aes(x = GPS.Variable, y = PC1.Loading))
gg <- gg + geom_bar(stat='identity', color = "blue", alpha = 0.8) + coord_flip()
gg <- gg + theme_bw()
gg
## Warning: position_stack requires constant width: output may be incorrect
Due correlation betwen GPS variables, regression tree only selected High Intensity Accelerations, but there are others also important in diffentiation the clusters.
I am not sure why the players are forming clusters - in the regression tree model I have included both information about the position, quarter and time played, but these were not important enough to differentiate between the clusters.
Let’s perform regression tree to see what GPS metric can predict the quarter and hence tell us what metrics are different across quarters
# Remove the cluster data
data.mod$Cluster <- NULL
rpart.model2 <- rpartXse(Period.Name ~ ., data.mod[-(1:4)])
prettyTree(rpart.model2)
So, these are the GPS variables that differ across quarters. Here is the list
significant.GPS.metrics.rtree <- rpart.model2$variable.importance
Please compare to signifiant variable selected by ANOVA.
Since the last quarter was lost, it seems that the variable important here is Metabolic Power Zone 1 Average Power.
Another way to compute what variable was different when quarter was lost is to code it as Win-Lost and see if they differ.
quarter.outcome <- rep("Win", nrow(data.mod))
quarter.outcome[data.mod$Period.Name == "Qtr 4"] <- "Lost"
quarter.outcome <- factor(quarter.outcome)
data.mod$Quarter.Outcome <- quarter.outcome
rpart.model3 <- rpart(Quarter.Outcome ~ ., data.mod[-(1:5)])
prettyTree(rpart.model3)
It seems to be the same variable: Metabolic Power Zone 1 Average Power. When the quarter was lost Metabolic Power Zone 1 Average power was < 4.705.
gg <- ggplot(data.mod, aes(x = Quarter.Outcome,
y = Meta..Power.Zone.1.Avg.Power,
fill = Quarter.Outcome))
gg <- gg + geom_boxplot(alpha = 0.8)
gg <- gg + theme_bw()
gg
I haven’t perfromed model diagnostics (e.g. ROC curves, confuson matrix) for the sake of time. Please note that regression trees are k-fold cross validated anyway.
To be honest I am not sure this is the correct way to analyze the GPS metrics to say which one is important to the quearter outcome. This is only one game, and we coded quarters on each athlete. So we are trying to predict game outcome (or quarters) based on single player performance. What we need to do instead is to aggregate across multiple games and provide “summary” of the metrics across team (e.g. average or median) and then try to see what GPS metric (over team performance) was related to quarter outcome. Not sure this is the problem of mixed/multi-level/hierarchical modeling. These types of models are something I am currently reading about, so maybe in the months to come I will be able provide more food for the thought.