Winning from losing

By following twitter’s #rstats hashtag (rss feed), I recently came across a very interesting R-related blog: datanalytics.com.

The first post I read from it was about setting up an on-line reading group to go through the excellent “The Elements of Statistical Learning“. It is going on and you can find related posts here. Something you should know, however: it’s in Spanish.

Thanks to that blog, I’ve learned at least one thing: the meaning of ‘L’ as in x<-10L, which I've seen in some code before but never investigated further. It turns out that it's used to cast variables as integers instead of doubles. You can see that in their code, which also has other unexpected examples:

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
a <- 1:10
typeof( a )
object.size( a )
 
a[1] <- 10
typeof( a )
object.size( a )
 
a <- 1:10
a[1] <- 10L
typeof( a )
object.size( a )

The blog also presented a puzzle last week. The gist of it is that you play three different gambling games:

  • Game 1: at each step, you win 1 point with probability 0.49 and lose 1 point with probability 0.51.
  • Game 2: at each step, you look at how many points you have accumulated so far. If it's a multiple of 3, you win 1 point with probability 0.095 and lose 1 point with probability 0.905. Otherwise, you win 1 point with probability 0.745 and lose 1 point with probability 0.255.
  • Game 3: at each step, start with tossing a fair coin: head you play game 1, tail you play game 2.

If you simulate this (or write down the maths), you'll see that the expectation for the score after 1000 steps is negative for both game 1 and game 2; you will lose with both games. But mixing the two games randomly as in Game 3 produces a winning game!

Like datanalytics, I invite you to think about why that is.

Here is what the score distribution looks like for each game after 5000 iterations:

EDIT: Two probabilities were inverted (see comment by jkomi below). Text has been corrected now.

Posted in Uncategorized | Tagged , , , | 4 Comments

A couple of nerdy troll quotes

Troll quotes amuse me and I decided to do a couple of those on a whim. Not sure if they worked. In any case, here they are.

If you have other ideas, let me know!

Posted in Uncategorized | Tagged , , , | 3 Comments

The mysterious case of the misbehaving writeLines() (or: a cat saves the day)

Dear readers and R experts, I submit to you a mysterious R quirk which has been baffling me for the best part of a week. I found a work-around but I’d love it if someone could explain this strangest of behaviour.

I was using writeLines() to dump the results of a large number of classification results and noticed that the output file was not as expected: some records were missing, some repeated. It took me a while to figure out exactly what the problem was (the multicore library? network latency? an unclosed sink()? etc.) but once I finally stripped the script bare, it turned out that I could reproduce the bug with a combination of multicore and writeLines().

Here is how my script works: a function (accuracyWithSomeFeatures()) is applied a number of times (numberFeatures). For each application, I output numberFeatures rows, each with 3 columns: a identifier for this iteration, the value passed to the function (from 1 to numberFeatures) and the number generated by the function. This process is repeated numberExperiments times.
When it’s finished, I should therefore end up with a text file of numberFeatures x numberExperiments.
Well this doesn’t happen if the function is applied in parallel AND I use writeLines() to write to the file.

Only this combination reproduces the bug. If I use multicore (or parallel in 2.14) and cat() or don’t use multicore but writeLines() etc., the output file is of the right shape.

What’s even weirder is that when the bug happens, the number of lines of the output file is not constant; the repeat/mangling is not consistent. For example, with numberFeatures<-317; numberExperiments<-467, a wc -l gives 217175, 217417, 217357 lines after 3 consecutive runs. None of these numbers are multiples of 317 or 467 so something is going seriously wrong.

The effect cannot be due to parallel writing either: the output file is populated after the parallelised generation of the data.

I’m leaving the code here if you fancy solving a bit of R mystery. Please let me know if you find the reason for this apparent bug, I’m sure there’s something to learn here. (Or it’s staring at me in the face. I’m ready for that too).

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
 
if(as.numeric(version$major)*100+as.numeric(version$minor)<214) library(multicore) else library(parallel)
 
classifyInParallel<-TRUE
withCat<-FALSE
 
numberFeatures<-317
numberExperiments<-467
 
outputFile<-"~/debugWriteLines.csv"
unlink(outputFile)
if(!withCat) outputFile<-file(outputFile,'a')
 
replicate(numberExperiments,{
  experimentID<-paste(sample(letters,6,replace=TRUE),collapse='')
 
  accuracyWithSomeFeatures<-function(k) 100*runif(1)
 
    if(classifyInParallel){
      accuracy<-mclapply(1:numberFeatures,accuracyWithSomeFeatures)
      accuracy<-unlist(accuracy)
    } else{
      accuracy<-sapply(1:numberFeatures,accuracyWithSomeFeatures)
    }
 
  content<-paste(experimentID, 1:numberFeatures,accuracy,sep='\t')
 
  if(withCat)
    cat(content,
      sep='\n',
      file=outputFile,
      append=TRUE)  
  else
    writeLines(
      content,
      con=outputFile)  
 
  NULL
})
 
if (!withCat) close(outputFile)
 
cat("\nclassifyInParallel:",classifyInParallel)
cat("\nwithCat:",withCat)
cat("\nThere should be",numberFeatures*numberExperiments,"lines in total.")

My system:
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 2
minor 14.0
year 2011
month 10
day 31
svn rev 57496
language R
version.string R version 2.14.0 (2011-10-31)

(original image by Kaptain Kobold)

Update (17 Nov 2011):
I now have an even barer script. writeLines() gets it wrong only if the output file is opened and closed once, i.e. before and after the main loop. It works if the file is opened and closed at each iteration. Again, this is not an issue of writing to a file from multiple processors, since writeLines() is outside mclapply().

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
if(as.numeric(version$major)*100+as.numeric(version$minor)<214) library(multicore) else library(parallel)
 
numberFeatures<-17
numberExperiments<-13
 
outputFilename<-"~/debugWriteLines.csv"
 
 
cat("\nThere should be",numberFeatures*numberExperiments,"lines in total.")
 
# opening and closing at each save
cat("\n\tOpening and closing at each iteration.")
unlink(outputFilename)
 
for(experiment in 1:numberExperiments){
  outputFile<-file(outputFilename,'a')
  mclapply(1:numberFeatures,function(k) runif(1))
  content<-paste(experiment, 1:numberFeatures,sep='\t')
  writeLines(content,con=outputFile)
  close(outputFile)
  }
cat(" Number of lines:",length(readLines(outputFilename)))
 
# opening and closing once
cat("\n\tOpening and closing once.")
 
unlink(outputFilename)
outputFile<-file(outputFilename,'a')
 
for(experiment in 1:numberExperiments){
  mclapply(1:numberFeatures,function(k) runif(1)) # do something in parallel
  content<-paste(experiment, 1:numberFeatures,sep='\t')
  writeLines(content,con=outputFile)}
 
close(outputFile)
 
numberLines<-length(readLines(outputFilename))
cat(" Number of lines:",numberLines)
cat("\n\t\tNumber of lines / number of experiments:",numberLines/numberExperiments)
cat("\n\t\tNumber of lines / number of features:",numberLines/numberFeatures)
cat("\n\t\tNumber of lines / (number of experiments * number of features):",numberLines/(numberExperiments*numberFeatures))

produces:

There should be 221 lines in total.
Opening and closing at each save. Number of lines: 221
Opening and closing once. Number of lines: 2465
Number of lines / number of experiments: 189.6
Number of lines / number of features: 145
Number of lines / (number of experiments * number of features): 11.15

Posted in Uncategorized | Tagged , , , , | 2 Comments

Anarchy Golf! And that’s your Sunday gone.

I like to follow good practice when I program. I want my code to be readable, properly indented, modular and re-usable. And I want my variables to have descriptive names. There’s nothing that I hate moderately dislike more than arbitrary abbreviations and inconsistent style. I have to say that R is not the best example when it comes to style. Even base functions often have weird names, and their arguments are either camelCased, period.separated, abbrvtd, all willy-nilly with no consistency, as if saving a few keystrokes was so important. It’s like learning PHP again. But the weirdest thing I’ve come across yet is the possibility of using a partial name for an argument (e.g. co for collapse). I’m at lost to find a rationale for this; it seems designed to engineer impenetrable code.

Good and consistent style helps you code better. Long, descriptive names make your code more readable and tab-completion will save you those precious keystrokes. So go for it. That’s not to say it can’t a problem sometimes. For example, this week I was adding some bells and whistles to a function I’d written. One statement involved subsetting a data frame on a hard coded value, like subset(result,association==”firstKind”)
A new argument for my function was, you guessed it, association. You see where it’s heading to; the statement turned to:

Select All Code:
1
subset(result,association==association)

And of course it failed; the condition is always true, because both instances of association are interpreted to be referring the column name of the data frame. So all rows are selected, whatever the argument is.

How to get out of this? Well, one could change the argument or the column name but I was already using them all over the function and elsewhere, and didn’t fancy tinkering the code too much at that point. Besides, I was reluctant to rename them in the first place, for reasons that should be obvious now. So what I’ve done is to read up the documentation on scoping, which is what the problem is, and came up with this:

Select All Code:
1
2
e<-environment()
subset(result,association==get('association',e)

The association on the right hand side is now correctly interpreted as the function’s argument. It’s a bit clumsy, but I get to use my beloved descriptive variable names and don’t need to go off on a replacement frenzy and its associated new bugs.

If you don’t see what I mean, here is some code I left on a related stackOverflow thread:

Select All Code:
1
2
3
4
5
6
7
8
9
10
x<-data.frame(
  start=sample(3,20,replace=TRUE),
  someValue=runif(20))
 
e<-environment()
start<-3
cat("\nDefaut scope:")
print(subset(x,start==start)) # all entries, as start==start is evaluated to TRUE
cat("\nSpecific environment:")
print(subset(x,start==get('start',e)))  # second start is replaced by its value in former environment.

However, bad practice has its perks and can be a lot of fun! I recently came across this very addictive online game: anarchy golf. There are more than 500 programming tasks to choose from. Each of them is very easy to code, like printing out the Fibonacci sequence, or just ‘Hello World’ but that’s not where the challenge is.

As the name suggests, the real challenge is to do it in as few bytes as possible! And that’s where obscure and horribly nested code come in handy. Variables names have to be 1 letter max., if you use variables at all, that is.

My current records are:

R-bloggers and readers, I challenge you to beat that!

Careful with the possible invisible line breaks at the end of your file. This bit of perl will get rid of it if your editor insists on adding it: perl -pe ‘chomp if eof’ . And no cheating! Your code must be pure R, so no using system() please.

It’s a great and terribly addictive game, and teaches you some of the weirdest and more obscure R commands and shortcuts. And partial matching suddenly become useful and even recommended.

(Original image: Paradise Valley Golf Course, Fairfield, CA, by David Bastyr, license: CC BY-NC-SA 2.0)

Posted in Uncategorized | Tagged , , , | 8 Comments

Customised terminals on Ubuntu

In order to quickly differentiate between the different machines I’m logged in, I use to set the corresponding terminals in different colours. I decided to go one step further and set each machine with its own wallpaper. Two of the machines I work with are named ‘Crick’ and ‘Mendel’ (it’s a genetic epidemiology lab). I’ve downloaded one picture for each and made the following two wallpapers in Gimp:

To add a profile, open a terminal then Edit/Profiles/New. Give it a name and set the wallpaper in Background. You also want to change the colours, and set ‘Black on White’ for the Mendel Profile.

The result:

I also set up two shortcuts that open a terminal and an ssh connection. Go to ‘keyboard’ and add a shortcut with the following command:
/usr/bin/gnome-terminal -e ‘ssh USERNAME@crick’ –window-with-profile=Crick.

(if you like biology-themed customisation, this post might be of interest.)

Posted in Uncategorized | Tagged , , , , | Comments Off on Customised terminals on Ubuntu

plyr, ggplot2 and triathlon results, part II

I ended my previous post by mentioning how one could imagine other ways of looking at the triathlon data with plyr and ggplot2. I couldn’t help but carry on playing with it so here are more stats and graphs from the same dataset: the results of a local sprint triathlon. This post will have a slightly more statistical bent to it.
Continue reading

Posted in Uncategorized | Tagged , , , , | 2 Comments

An exercise in plyr and ggplot2 using triathlon results

I ran my last triathlon for this year a couple of weeks ago, in the beautiful town of Stratford-upon-Avon. The results were online the day after so I decided to have a look at my fellow competitors’ times, which gave me an opportunity to flex my plyr and ggplot2 muscles.

The data itself was in pdf, so it was a bit of a pain to extract in usable format. After some copying and pasting in a spreadsheet, I finally got it in csv, which was easy to parse with R. Given the amount of effort required, you’ll forgive me to have only extracted the race I was in, that is, the sprint male. It consists of a 400m swim, followed by a 23km ride and ends with a 5km run.

So first, let’s read the data:

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
times<-read.table(
     "stratford.csv",
     header=TRUE,
     sep="\t",
     stringsAsFactors=FALSE,
     na.strings="")
head(times)
  Position StartingPosition StartingTime Age Category     Swim    Cycle      Run    Total
1        1              441     08:44:45  32        F 00:06:04 00:36:46 00:19:11 01:02:01
2        2                5     08:46:00  35        G 00:05:55 00:37:23 00:20:18 01:03:36
3        3               26     08:56:00  23        D 00:06:28 00:37:39 00:19:30 01:03:37
4        4              443     10:35:30  31        F     <NA>     <NA> 00:20:51 01:04:09
5        5              445     10:36:00  27        E 00:06:43 00:37:26 00:21:36 01:05:45
6        6               32     08:52:45  32        F 00:06:25 00:39:27 00:21:08 01:07:00

A technical point about the data: the times were manually reported and some are missing. On top of that, the transition times (e.g. between the end of the swim and the start of the cycling) were not recorded but were added to one of the disciplines. The starting position can be seen as an ID.

Next are a few lines that transform the times from character strings to numeric (in minutes). First we define a column-wise function which does exactly that:

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
library("ggplot2")
library("stringr")
stringToMinutes<-colwise(
  function(s) 
     unlist(
          lapply(
               str_split(s,":"),function(s)
                    sum(as.numeric(s)*c(60,1,1/60))
               )
          ),
  .(Swim,Cycle,Run,Total))

Continue reading

Posted in Uncategorized | Tagged , , , , , | 3 Comments

A first go at ‘manipulate’ in RStudio

Something I’m missing from R (especially coming from Mathematica) is the ability to quickly build interactive graphs, which I find very useful for getting a good intuition of the impact of parameters on a mathematical function.
Richie Cotton’s post about interactive plots in R gave me an incentive to have a go at the manipulate package in RStudio.

I adapted Richie’s example (go to his page to download his data and example) to manipulate and I have to say I have been impressed by how easy and fast it is to put something together.
Here is the whole script that replicates the example:

Select All Code:
# Adapting Richie Cotton's gWidgettcltk example to RStudio's manipulate
# updated 10/11/2012 to reflect changes in ggplot2
# C. Ladroue
 
library("ggplot2")
library("manipulate")
 
chromium <- read.csv("chromium.csv")
nickel <- read.csv("nickel.csv")
 
manipulate({
  p<- ggplot(data, aes(air, bm)) + geom_point()+scale_x_continuous(trans=xScale)+scale_y_continuous(trans=yScale)
  if(facet!="None") p<-p+facet_grid(facet)
  p
},
           yScale=picker("Linear"="identity","Log"="log10",label="Y Scale Transformation"),
           xScale=picker("Linear"="identity","Log"="log10",label="X Scale Transformation"),
           facet=picker( "None" = "None", "RPE" = ". ~ rpe","Welding type" = ". ~ welding.type","RPE and Welding type" = "rpe ~ welding.type",initial="None",label="Faceting"),
           data=picker("Chromium"=chromium,"Nickel"=nickel,label="Datasets")
)

And the result is as expected: a graph with added controls. (click to view the whole image)

As you can see from the code, this is extremely readable. The only control type I’ve had to use is picker, which takes a list of labels, followed by an optional value. It’s this value that will be passed on to the assigned variable. The label appearing on the form can also be changed with label=. Other available controls are slider and checkbox. There is no text control at the moment, so I couldn’t replicate the ‘title’ box from Richie’s example.

I’ve used the TclTk package before and while it worked, it was a bit too cumbersome to be used routinely. By contrast, manipulate allows for a very easy and fast setup, with little to no overhead. This comes at the expense of being restricted to RStudio, but since this is my IDE of choice, I’m fine with that.

This is only a quick and dirty attempt at manipulate and there are things I’d like to change; in particular, I’m fairly sure that the line

Select All Code:
data=picker("Chromium"=chromium,"Nickel"=nickel,label="Datasets")

is an unnecessary memory hog, due to the passing of the 2 data frames to picker. A better choice would be to only pass a string and check its value in the expression that generates the plot.

Posted in Uncategorized | Tagged , , , , | 4 Comments

Match Point

The solution of the latest Futility Closet puzzle left me a bit unsatisfied. I understand that the blog usually doesn’t use formulae but in this case, I found the explanation given uncharacteristically hand-wavy. The problem itself is not difficult and I’ll show my solution after the fold.

First things first, here’s the problem:

Henrietta wants a yacht. Her parents think she’s too young. Like all rich people, they settle disagreements by playing competitive lawn tennis.

Henrietta must play three singles matches against her parents. If she wins two matches in a row she gets the yacht. Her mother is a better player than her father. Should she play mother-father-mother or father-mother-father?

Continue reading

Posted in Uncategorized | Tagged , , | Comments Off on Match Point

Darwin’s diagram as a wallpaper

I’ve always been fascinated by this diagram in one of Darwin’s notebook. Everything is there: the tree of life, with its ever-expanding branches from one single origin and the personal humble note from the scientist: ‘I think’. I also like the fact that the drawing conveys the idea much faster than words and one can imagine Darwin starting writing ‘I think’ then switching from words to the sketch.

The diagram popped up again recently as I was watching David Attenborough’s “The Tree of Life” and I decided to make it my wallpaper. I kept the same colour scheme as the default Ubuntu Natty Narwhal and made 3 wallpapers with Gimp:

for my laptop (1366×768), my desktop at work (1680×1050) and my phone (940×980)

(As an aside and as Gould pointed out in “A Wonderful Life”, a tree is a good analogy for the evolution of animals but not so much, ironically, for that of plants, for which evolutionary branches can meet again at later stage.)

Posted in Uncategorized | Tagged , , , , , | 2 Comments