An exercise in plyr and ggplot2 using triathlon results

I ran my last triathlon for this year a couple of weeks ago, in the beautiful town of Stratford-upon-Avon. The results were online the day after so I decided to have a look at my fellow competitors’ times, which gave me an opportunity to flex my plyr and ggplot2 muscles.

The data itself was in pdf, so it was a bit of a pain to extract in usable format. After some copying and pasting in a spreadsheet, I finally got it in csv, which was easy to parse with R. Given the amount of effort required, you’ll forgive me to have only extracted the race I was in, that is, the sprint male. It consists of a 400m swim, followed by a 23km ride and ends with a 5km run.

So first, let’s read the data:

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
times<-read.table(
     "stratford.csv",
     header=TRUE,
     sep="\t",
     stringsAsFactors=FALSE,
     na.strings="")
head(times)
  Position StartingPosition StartingTime Age Category     Swim    Cycle      Run    Total
1        1              441     08:44:45  32        F 00:06:04 00:36:46 00:19:11 01:02:01
2        2                5     08:46:00  35        G 00:05:55 00:37:23 00:20:18 01:03:36
3        3               26     08:56:00  23        D 00:06:28 00:37:39 00:19:30 01:03:37
4        4              443     10:35:30  31        F     <NA>     <NA> 00:20:51 01:04:09
5        5              445     10:36:00  27        E 00:06:43 00:37:26 00:21:36 01:05:45
6        6               32     08:52:45  32        F 00:06:25 00:39:27 00:21:08 01:07:00

A technical point about the data: the times were manually reported and some are missing. On top of that, the transition times (e.g. between the end of the swim and the start of the cycling) were not recorded but were added to one of the disciplines. The starting position can be seen as an ID.

Next are a few lines that transform the times from character strings to numeric (in minutes). First we define a column-wise function which does exactly that:

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
library("ggplot2")
library("stringr")
stringToMinutes<-colwise(
  function(s) 
     unlist(
          lapply(
               str_split(s,":"),function(s)
                    sum(as.numeric(s)*c(60,1,1/60))
               )
          ),
  .(Swim,Cycle,Run,Total))


Then apply it to the times data frame:

Select All Code:
1
2
3
4
5
6
7
8
9
times<-ddply(times,.(Position,StartingPosition,Category),stringToMinutes)
head(times)
  Position StartingPosition Category     Swim    Cycle      Run    Total
1        1              441        F 6.066667 36.76667 19.18333 62.01667
2        2                5        G 5.916667 37.38333 20.30000 63.60000
3        3               26        D 6.466667 37.65000 19.50000 63.61667
4        4              443        F       NA       NA 20.85000 64.15000
5        5              445        E 6.716667 37.43333 21.60000 65.75000
6        6               32        F 6.416667 39.45000 21.13333 67.00000

With the benefit of hindsight, I can tell you that there are a couple of obvious errors (like an hour-long swim). Let’s clean up the data a bit:

Select All Code:
1
times<-subset(times,Swim+Cycle+Run<Total+5 | is.na(Swim+Cycle+Run))

Ideally we would expect Swim+Cycle+Run=Total but given the precision of the recording, we have to allow for some discrepancy. The is.na(Swim+Cycle+Swim) condition is there to allow for cases where at least one of the disciplines is missing, in which case the sum is NA, the first test fails and the corresponding row is lost. None of the Total time is NA.

In fact, let’s only consider the runners with all times defined (most of them):

Select All Code:
1
times<-subset(times,Swim+Cycle+Run<Total+5)

We can now have a first look at the average times for each age category, with a mixture of ddply and summarise.

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
digest<-ddply(
     times,
     "Category",
     summarise,
          median=median(Total),
          average=mean(Total), 
          headCount=length(Total)
     )
print(digest)
  Category    median   average headCount
1         A  93.66667  93.33519         9
2         C  79.02500  79.19583         4
3         D  78.31667  80.78333        12
4         E  81.81667  85.83667        25
5         F  84.88333  88.81955        52
6         G  83.27500  87.47064        88
7         H  85.78333  88.57263        81
8         I  84.81667  85.50847        61
9         J  83.90000  85.85541        37
10        K  88.11667  90.53333        11
11        L  99.15833 100.03889         6
12        M  98.01667  98.01667         2
13        N 100.01667 100.01667         1
14     <NA>  90.55833  90.55833         2

As usual, the median is more robust than the average, which is useful here given how uneven the groups are and the existence of outliers. A graphical representation shows that the total time is actually fairly flat across age categories.

Select All Code:
1
2
3
4
ggplot(times,aes(x=Category,y=Total,group=1))
     +geom_jitter(position=position_jitter(width=0.05))
     +geom_smooth()
     +ylim(70,125)

Now let’s have a look at the distributions of times for each discipline. For this, we’re going to melt the data frame so as to have the swim, cycle, run and total as factors in a new column called Discipline.

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
meltedTimes<-melt(
  times,
  c("StartingPosition","Category"),
  c("Swim","Cycle","Run","Total"),
  variable_name="Discipline")
head(meltedTimes)
  StartingPosition Category Position Discipline    value
1              441        F        1       Swim 6.066667
2                5        G        2       Swim 5.916667
3               26        D        3       Swim 6.466667
4              445        E        5       Swim 6.716667
5               32        F        6       Swim 6.416667
6                2        H        7       Swim 6.033333

We can now show the 4 distributions we’re interested in in just one command, thanks to the faceting option:

Select All Code:
1
2
3
4
ggplot(meltedTimes,aes(x=value))
    +geom_density()
    +facet_wrap(~Discipline,scales="free")
    +xlab("Time (mn)")

scales=”free” is useful here, because the times across disciplines are quite different.

You can also show the same distributions split across age categories:

Select All Code:
1
2
3
4
ggplot(meltedTimes,aes(x=value,fill=Category))
     +geom_density(alpha=0.3)
     +facet_wrap(~Discipline,scales="free")
     +xlab("Time (mn)")


But it’s a bit misleading given the small size of some categories; the kernel smoothing can exaggerate local density. Things can be a bit improved by limiting the plot to age categories of at least 10 athletes, but not much.

In an attempt at investigating whether some athletes are better in some disciplines than others, we can plot their trajectories:

Select All Code:
1
2
3
ggplot(meltedTimes,aes(x=Discipline,y=value,group=StartingPosition))
     +geom_line(alpha=0.05,colour="#0000FF",size=1)
     +ylab("Time (mn)")


Unfortunately, the graph is too busy to reveal anything. A better way is to plot their ranking for each discipline and see in which discipline they rank best. Once again, plyr makes things very easy for us.

First we add a column Ranking to meltedTimes. This will be the rank of the athletes within the discipline, i.e. either swim, cycle, run or total.

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
meltedTimes<-ddply(
     meltedTimes,
     .(Discipline),
     transform,
          ranking=rank(value,ties.method="random")
)
head(meltedTimes)
  StartingPosition Category Position Discipline    value ranking
1              441        F        1       Swim 6.066667       7
2                5        G        2       Swim 5.916667       4
3               26        D        3       Swim 6.466667      17
4              445        E        5       Swim 6.716667      25
5               32        F        6       Swim 6.416667      14
6                2        H        7       Swim 6.033333       6

i.e. Athlete #441 ranked 7th for the swim.

Equipped with the intra-discipline ranking, we can summarise each athlete by his best discipline:

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
bestDiscipline<-ddply(
  meltedTimes,
  .(StartingPosition),
  summarise,
  Discipline=Discipline[order(ranking)[1]],
  Position=Position[order(ranking)[1]])
head(bestDiscipline)
 StartingPosition Discipline Position
1                1       Swim      204
2                2       Swim        7
3                4       Swim       12
4                5      Cycle        2
5                9        Run      147
6               10       Swim       40

i.e. athlete #5′s strong point is the cycling; this is the discipline he was best ranked at. Position is the final position, which I’ll need in a minute.

So let’s see what’s the athletes’ favourite disciplines:

Select All Code:
1
2
3
ggplot(bestDiscipline)
     +geom_bar(aes(x=Discipline,y=..count..))
     +xlab("Strong point")

But what about the best athletes? Say, those in the first quartile? Well, let’s see.

Select All Code:
1
2
3
4
bestDiscipline$isInFirstQuartile<-bestDiscipline$Position<nrow(bestDiscipline)/4
ggplot(bestDiscipline)
     +geom_bar(aes(x=Discipline,y=..count..,fill=isInFirstQuartile))
     +xlab("Strong point")


Or in numbers:

Select All Code:
1
2
3
4
5
6
7
summaryTable<-table(bestDiscipline[,c("Discipline","isInFirstQuartile")])
          isInFirstQuartile
Discipline FALSE TRUE
     Swim    113   18
     Cycle    83   30
     Run      79   20
     Total     1   16

The over-representation of top athletes in Total time makes sense: by definition, their rank for Total is already low, so it’s less likely to do even better in the other disciplines.

But top athletes do have a different distribution than the rest of the competitors:

Select All Code:
1
2
3
4
5
6
7
print(100*summaryTable/rep(colSums(summaryTable),1,each=4),digits=2)
          isInFirstQuartile
Discipline FALSE  TRUE
     Swim  40.94 21.43
     Cycle 30.07 35.71
     Run   28.62 23.81
     Total  0.36 19.05

There are many other things one could do with this data (what’s the biggest catch-up, do you have to be in the top 25% in all disciplines to be in the final top 25%? etc.) but I hope it gave you a flavour of what’s possible with plyr and ggplot2, two excellent packages which might require some getting-used-to but are definitely worth the effort.

A sequel to this post can be found here

This entry was posted in Uncategorized and tagged , , , , , . Bookmark the permalink.

3 Responses to An exercise in plyr and ggplot2 using triathlon results

  1. Julyan says:

    Hi Christophe,
    this is a great post, all the more that I love stats on triathlon :) ggplot2 perfectly fits for the distributions by age. You can get databases directly in csv for a lot of (French) races on http://www.ipitos.com/

  2. CL says:

    Thanks!

    I’ll have a look at that website; it could be interesting to see whether races are different or how it’s possible to aggregate them to get better statistical power.

  3. Pingback: Triathlon data with ggplot2 « Statisfaction

Leave a Reply

Your email address will not be published. Required fields are marked *

*


nine − 5 =

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">