Another take on building a multi-lingual shiny app

I was reading this interesting post about how to build a multi-lingual Shiny app. I’m also building a multi-lingual Shiny app and came up with slightly different take on it.

First, I don’t use a function for finding the translation, but a 2D list. This way I can directly get to the translation with a simple access to the list.

Select All Code:
1
2
3
4
5
6
7
translation <- list(
  "youhaveselected" = list("en" = "You have selected:", "fr"="Vous avez sélectionné:"),
  "greetings" = list("en" = "Hello", "fr"="Bonjour")
  )
# then:
translation[['greetings']][['en']] # Hello
translation[['greetings']][['fr']]  # Bonjour

Second, I don’t use observe, as I didn’t find it necessary. I simply have a radio button for switching between languages, and a function tr() to translate a phrase or a list of phrases. Like in the original post, the UI is built from server.R using renderUI().

Select All Code:
1
2
3
  tr <- function(text){ # translates text into current language
    sapply(text,function(s) translation[[s]][[input$language]], USE.NAMES=FALSE)
  }
Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
 
  # UI
  output$uiObs <- renderUI({
    sliderInput("obs", tr("numberOfObservations"),  
                  min = 1, max = 100, value = 50)
  })
 
  output$uiWeekdays <- renderUI({
    # Using a named list in order to pass the same value regardless of the label (which changes with the language)
    daysValue <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
    days <- structure(daysValue, .Names = tr(daysValue))
 
    selectInput(inputId   = "weekdays",
                label     = tr("Selection:"),
                choices   = days,
                multiple  = TRUE)
  })

To make things easier for the translators, the dictionary is stored as a csv file, which is easy to edit. A small R script turns the csv into the expected 2D list, and saves it in a binary file, to avoid re-processing the file every time the user decides to switch language.

Select All Code:
1
2
3
4
5
6
7
8
9
# update the processed translation file translation.bin
# run this every time dictionary.csv is updated 
# it reads the look-up table in dictionary.csv and turns it into a 2D list
 
library(plyr)
translationContent <- read.delim("dictionary.csv", header = TRUE, sep = "\t", as.is = TRUE) 
translation <- dlply(translationContent ,.(key), function(s) key = as.list(s))
 
save(translation, file = "translation.bin")

You can consult the whole code on the github repository and run it directly from R using:

Select All Code:
1
shiny::runGitHub("multilingualShinyApp","chrislad")
Posted in Uncategorized | Tagged , , | 2 Comments

How to pick up 3 numbers from a uniform distribution in a transparent manner?

Over in my previous post, I’m giving away 3 copies of my video course on ggplot2 and shiny. To win a copy, you just need to leave a comment and I will select 3 winners among the n participants at random after a deadline.

But how do I pick 3 winners such that:

  • all players are equally likely to win.
  • no-one can contest the fairness of the selection.

The first thing that comes to mind is to run sample(n,3, replace = FALSE) and report the winners. But how do you know I actually ran this code? I could have decided on the winners well in advance and just pretended to run the code.

A way to approach this issue could be to set the random seed to some value so that anyone suspecting foul play can run the code themselves and get the same answer as me:

Select All Code:
1
2
set.seed(someSeed)
sample(n, 3, replace=FALSE)

I see at least two problems with it: 1) I could still have selected a seed that gives me the sample I eventually want, and 2) even using a function (e.g. of n the number of participants) as the seed doesn’t guarantee a uniform distribution for each player.

I came up with a plan which I think addresses both the uniform distribution over the players, and the incontestability of the selection.

First, I simplify the problem of selecting 3 winners among n participants to selecting 1 integer from a uniform distribution. This is easy: instead of choosing 3 items among n, I’m selecting 1 of the choose(n,3) possible combinations. Once I’ve sampled 1 number i, I simply use combn(n,3) to generate all the combinations and pick the ith item:
combn(n,3, simplify=FALSE)[[i]].

Second, I have a way to pick a number from a uniform distribution that’s completely transparent. To do this, I pick up a number uniformly at random in a much bigger set (1:N, with N>>choose(n,3)) and project this number back to the interval I’m interested in (1:choose(n,3)). That is, once I have a number j between 1 and N, I use

Select All Code:
1
i <- ceiling(choose(n,3) * j /N)

to find a random value uniformly distributed from 1 to choose(n,3)

Ideally you’d want N to be a multiple of choose(n,3) for every outcome to be exactly equally likely, but if N is much larger than choose(n,3), the slight difference in probability for each outcome is negligible (of the order of 1/N).

Now, how do I pick up a number in a bigger set in a transparent manner? We saw that using the seed is fraught with difficulty, so I need something else. I’m going to use something which neither I nor the players have any control over: the UK national lottery, which is a combination of 7 integers from the set {1,…,49}. More precisely, I’m doing this:

  • declare in advance which future lottery draw I’m using.
  • use the lexicographic index of the combination drawn as my number j; j comes from a uniform distribution between 1 and N=choose(49,7), which is a relatively big number.

And this is it: this way, I pick up 3 winners among n and there is no way for me (or the players) to rig the selection. Here is the code I’ll run:

Select All Code:
1
2
3
4
5
6
lotteryResult <- c( , , , , , , ) # to be filled in by the actual lottery draw
nPlayers <- 200 # to be updated with the number of participants
nWinners <- 3
index <- ceiling(choose(nPlayers, nWinners) * lexicographicIndex(lotteryResult, 49) / choose(49,7))
winners <- combn(nPlayers,nWinners,simplify = FALSE)[[index]]
cat("\n The winners are:", winners)

The deadline for the competition is Wednesday 09th July at midnight UK time. The next lottery (UK lotto) draw after that is on Saturday 12th July, and I’ll use that draw to decide on the winners, using the code I’ve presented here.

What do you think? Can you poke holes in my strategy? Can you come up with something simpler?

Note about the lexicographic index
It is not terribly difficult to find the index of a combination without generating them all. All you need to do is to count the number of combinations that appeared before. For example, if the combination starts with 3, you know it comes after all the combinations that start with 1 and 2. Here is the code I wrote to go from a combination to its lexicographic index. There’s also a test function after it.

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# gives the index of the combination if sorted in lexicographic order, starting at 1
# lexicographicIndex(c(2,3,4), 5) = 7
# because [2,3,4] is the 7th item in the lexicographically ordered combinations of length 3 using 5 letters: 
# 1 2 3
# 1 2 4
# 1 2 5
# 1 3 4
# 1 3 5
# 1 4 5
# 2 3 4
# 2 3 5
# 2 4 5
# 3 4 5
# C. Ladroue
# combination is a sequence of unique integers between 1 and alphabetSize
 
lexicographicIndex <- function(combination, alphabetSize){
  combination <- sort(combination)
 
  combinationLength <- length(combination)
 
  index <- 1 
  for(p in 1:combinationLength){
 
    starting  <- ifelse(p == 1, 1 , combination[p-1] + 1)
    finishing <- combination[p] - 1
 
    if(starting <= finishing)
      index <- index + sum(sapply(starting:finishing, function(j) choose(alphabetSize - j, combinationLength - p)))
  }
    index 
}
 
 
lexicographicIndexTest <- function(){
  alphabetSize <- 10
  combinationLength <- 3
  x <- combn(alphabetSize, combinationLength, simplify = FALSE)
  cat("\n test all combinations with alphabet size = ",alphabetSize,"and combination length = ",combinationLength,": ",
      all(sapply(1:length(x), function(index) lexicographicIndex(x[[index]], alphabetSize) == index ))
      )
  cat("\n")
}
Posted in Uncategorized | Tagged , , , , | 5 Comments

Win a free copy of a new video course on ggplot2 and Shiny!

Noticed all these posts on r-bloggers about ggplot2 and shiny? Do you want in? My course “Building Interactive Graphs with ggplot2 and Shiny” (published by Packt Publishing) covers those 2 packages in a series of 40 videos, each one dedicated to one concept at the time. After this course, you’ll be able to make fancy customised plots and publish them on the internet on some cool interactive webpages to be shared with the world!

See the content of the course here, and my previous post about it for an introduction and some reviews.

Now, I am pleased to announce that I have teamed up with Packt Publishing to organise a giveaway! Three (3) of you beautiful readers are going to get this course for free!

Here’s how it works:

  • Simply leave a comment below (e.g. “I want it!”, “You rock!” or whatever takes your fancy) together with your email address*.
  • After the deadline (Wed. 09th July at midnight, UK time), I’ll pick 3 winners at random.

Winners will be contacted by email, so be sure to use your real email address when you comment!

(By the way here’s a question for you to ponder over: how do you pick three numbers between 1 and n at random, transparently so that no-one can contest the fairness of the selection?)

Good luck!

* Use the field “email” in the form; the address won’t show in the comment.

EDIT: Don’t worry if I don’t approve your comment straight away. I’ll get to it as soon as I can. Thanks.

UPDATE: The comments are now closed. Thanks for the great response! I’ll announce the 3 winners early next week, after selecting them with the strategy explained here.

SECOND UPDATE: We have the winners!

I have followed the protocol I designed here and selected 3 lucky winners. I got the list of unique participants sorted by their posting date directly from the blog’s database and ran the following code:

Select All Code:
players <- read.csv2("wp_comments.csv", sep=",", header=FALSE, col.names=c("ID","PlayerName"))
lotteryResult <- c(3,19,23,28,37,43,34) # UK national lottery resultst on the 12th July 2014 https://www.national-lottery.co.uk/player/lotto/results/results.ftl
nPlayers <- nrow(players)
nWinners <- 3
 
allCombinations <- combn(nPlayers,nWinners,simplify = FALSE) 
index <- ceiling(choose(nPlayers, nWinners) * lexicographicIndex(lotteryResult, 49) / choose(49,7))
winners <- allCombinations[[index]]
cat("\n The winners are:", winners)
cat("\n that is:")
print(players[winners,])

So the winners are:

Eduardo, David and Jeff: packt will contact you soon about this.

Thanks again everyone for playing, it was great to see such a great response to the competition!

Posted in Uncategorized | Tagged , , , , , | 173 Comments

A two-hour online course on ggplot2 and Shiny


I’ve just published a video course with Packt Publishing about ggplot2 and Shiny!
In just two hours, you’ll get to learn the popular R packages ggplot2 and Shiny, as well as how to put them together to build interactive webpages. And all that from R.

The course consists of short videos (around 2 or 3 minutes) that explain one concept at the time. Each video comes with the relevant code, and pointers to go further in your own time.

It’s divided in 8 chapters:

  1. Getting Started with ggplot2 [15:00 minutes]
  2. Understanding Basic Plots [11:33 minutes]
  3. Using Conditional Plots [09:32 minutes]
  4. Using Statistics in Our Plot [09:49 minutes]
  5. Customizing Your Graphs [11:18 minutes]
  6. Shiny – Part 1 [14:39 minutes]
  7. Shiny – Part 2 [12:25 minutes]
  8. Putting Everything Together [12:17 minutes]

In the last chapter, we build a multi-page dashboard, with adaptive controls, showing some ggplot2 graphs produced on the fly.

Here is a sample video “Customizing the Color Palette for Continuous Variables” with ggplot2:

If you’ve been meaning to learn ggplot2 (for making nice looking graphs) or Shiny (for building interactive websites using R alone), but never got round to it, this is the course for you!

You can buy it from here. Download the scripts by clicking ‘Support‘ on this page.

PS: Arthur Zubarev, from compudicted, wrote a review of the whole course here: https://compudicted.wordpress.com/2014/05/11/building-interactive-graphs-with-ggplot2-and-shiny-r-by-christophe-ladroue-packt-publishing-video-review/.

PPS: More reviews:

  1. By Mark van der Loo: Review of “Building interactive graphs with ggplot2 and shiny”
  2. By MilanoR: Building Interactive Graphs with ggplot2 and Shiny
Posted in Uncategorized | Tagged , , , , , | 8 Comments

A very quick introduction to ggplot2

I gave a very brief 10mn introduction to ggplot2 at the Birmingham R user group meeting on Monday. The aim was to give a headstart to R users who’ve heard of ggplot2 but never got around to trying it.

I made the talk with deck.js, with a couple of css customisations. You can browse the slides here. Right and left arrows to navigate and ‘m‘ to have an overview. Most code snippets are clickable and will show the resulting plot. Click on the plot to make it disappear.

Download all the files in one go with this zip file if you’re interested. Unzip and open ggplot2Intro/introductionGGplot2/index.html with a modern browser.

Posted in Uncategorized | Tagged , , , , | 12 Comments

An exercise in R using local open data

Last week I went to the “Government Open Data Hack Day” ([@,#]godhd on twitter) in Birmingham (UK), organised by Gavin Broughton and James Catell. The idea was to get hold of local open data and try and make use of them in just one day. You can see some of the work done on that day presented here. It was good fun and I’ve learned a few tricks and resources in the process, which I’m going to go through in this post. I’ll refrain from any data analysis because I know next to nothing about this type of data. Rather, I’m going to explain how to go from the raw format (in this case an Excel sheet) to something useful and exploitable.

(all files here)

The data I was given come from nomis and consist of job vacancies in West Midlands for the years 2011 and 2012, broken down by job types. The spreadsheet lists 353 job types for 59 constituencies, one after the other:

From the spread sheet to an R data frame
The first thing to do is to turn this into an R data frame for easier manipulation and reshaping. Luckily, each dataset follows the exact same pattern and it’s easy to extract the name of each constituency and each job type in two files with a simple grep, and combine both files into a data frame from R. The lines starting with “area name” contain the name of the constituency, those starting with 4 digits contains the job type and the numbers we want. (nomis_jobs_wm_2011_2012.csv is the tab-separated version of the spreadsheet)

In a terminal:

Select All Code:
1
2
grep "^area name" nomis_jobs_wm_2011_2012.csv > areaname.csv
grep "^[0-9]\{4\}" nomis_jobs_wm_2011_2012.csv > jobtypes.csv

In R:

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# reading data
areas<-read.table("arename.csv",header=FALSE,sep='\t')
jobs<-read.table("jobtypes.csv",header=FALSE,sep='\t')
 
# combine them
areas<-areas$V2
jobs$region<-rep(areas,each=nrow(jobs)/length(areas))
 
names(jobs)<-c("JobType","Vacancies2011","Vacancies2012","change","changePercent","region")
jobs<-subset(jobs,select=c("JobType","Vacancies2011","Vacancies2012","region"))
 
# A subtlety here: Excel formatted the number with a comma e.g. 1,234 for 1234. 
# So the comma has to be removed for the cast to work properly
jobs$Vacancies2011<-as.numeric(gsub(",","",jobs$Vacancies2011))
jobs$Vacancies2012<-as.numeric(gsub(",","",jobs$Vacancies2012))
Select All Code:
 head(jobs)
                                                       JobType Vacancies2011 Vacancies2012              region
1               1111 : Senior officials in national government             0             0 Aldridge-Brownhills
2 1112 : Directors and chief executives of major organisations             0             0 Aldridge-Brownhills
3                  1113 : Senior officials in local government             0             0 Aldridge-Brownhills
4    1114 : Senior officials of special interest organisations             0             0 Aldridge-Brownhills
5            1121 : Production, works and maintenance managers            10             4 Aldridge-Brownhills
6                              1122 : Managers in construction            47             9 Aldridge-Brownhills

Now that we have the data in one single data frame, it’s much easier to do something with it.

Aggregating the job types
There are 353 job types in total, which is too fine of a granularity for us. It turns out that the 4 digit numbers that were so useful for parsing the data are from the SOC (Standard Occupational Classification) code and follow a hierarchical pattern, with 1xxx meaning “Managers and Senior Officials”, 2xxx “Professional Occupations” etc.. Somewhere on the internet (I can’t remember where) I tracked down an exploitable list (as in, not a b. pdf!) of those SOC numbers, which I promptly turned into a tab-separated file.

Select All Code:
1
2
3
4
5
6
7
8
9
soc<-read.table("soc.csv",sep='\t',head=TRUE,stringsAsFactor=FALSE)
> head(soc)
  Major.Group Sub.Major.Group Minor.Group Unit...Group                              Group.Title
1           1              NA          NA           NA MANAGERS, DIRECTORS AND SENIOR OFFICIALS
2          NA              11          NA           NA         CORPORATE MANAGERS AND DIRECTORS
3          NA              NA         111           NA    Chief Executives and Senior Officials
4          NA              NA          NA         1115    Chief executives and senior officials
5          NA              NA          NA         1116     Elected officers and representatives
6          NA              NA         112           NA        Production Managers and Directors

Now that we have the description of each level in the SOC code, we can aggregate the 353 jobs into, for example, the 9 job types of level 1 (‘Major Group’).

Select All Code:
1
2
3
4
5
6
# select the job types in the major group
level1<-subset(soc,!is.na(Major.Group),select=c("Major.Group","Group.Title"))
# build a look-up table to go from a digit to a job type
lookup<-1:nrow(level1)
lookup[level1$Major.Group]<-level1$Group.Title
lookup<-factor(lookup,levels=lookup)
Select All Code:
> lookup
[1] MANAGERS, DIRECTORS AND SENIOR OFFICIALS         PROFESSIONAL OCCUPATIONS                        
[3] ASSOCIATE PROFESSIONAL AND TECHNICAL OCCUPATIONS ADMINISTRATIVE AND SECRETARIAL OCCUPATIONS      
[5] SKILLED TRADES OCCUPATIONS                       CARING, LEISURE AND OTHER SERVICE OCCUPATIONS   
[7] SALES AND CUSTOMER SERVICE OCCUPATIONS           PROCESS, PLANT AND MACHINE OPERATIVES           
[9] ELEMENTARY OCCUPATIONS                          
9 Levels: MANAGERS, DIRECTORS AND SENIOR OFFICIALS PROFESSIONAL OCCUPATIONS ... ELEMENTARY OCCUPATIONS
Select All Code:
# add a column 'level1' to jobs which contains one of the 9 possible job titles
jobs$level1<-lookup[
  as.numeric(sapply(jobs$JobType,function(s) substr(s,1,1),USE.NAMES=FALSE))]
# Build a new data frame byLevel1, the aggregated data
byLevel1<-ddply(jobs,.(region,level1),summarise,Vacancies2011=sum(Vacancies2011),Vacancies2012=sum(Vacancies2012))
Select All Code:
> head(byLevel1)
               region                                           level1 Vacancies2011 Vacancies2012
1 Aldridge-Brownhills         MANAGERS, DIRECTORS AND SENIOR OFFICIALS           173           134
2 Aldridge-Brownhills                         PROFESSIONAL OCCUPATIONS            97           100
3 Aldridge-Brownhills ASSOCIATE PROFESSIONAL AND TECHNICAL OCCUPATIONS           548           190
4 Aldridge-Brownhills       ADMINISTRATIVE AND SECRETARIAL OCCUPATIONS           288           202
5 Aldridge-Brownhills                       SKILLED TRADES OCCUPATIONS           693          1470
6 Aldridge-Brownhills    CARING, LEISURE AND OTHER SERVICE OCCUPATIONS           477           566

We now have a smaller data frame, with 59×9=531 (constituencies x job types) rows. An obvious graph to do is looking at the distribution of vacancies in each constituency:

Select All Code:
1
2
3
4
5
6
7
library(ggplot2)
# sort the constituencies backward, to have them listed alphabetically from top to bottom in the graph
levels(byLevel1$region)<-rev(levels(byLevel1$region))
p<-ggplot(byLevel1)
p<-p+geom_bar(aes(x=region,Vacancies2011,fill=level1),position='fill')
p<-p+coord_flip()+scale_fill_brewer(type='qual',palette='Set1')
print(p)

This representation shows the relative proportion of job types within each constituency. It would be misleading to try and compare the number of vacancies from one constituency with another for example, since they might not represent the same population etc.. I don’t have this data so can’t normalise in a sensible manner.

Maps!
Since we’re dealing with regional data, wouldn’t it be cool to plot that on a map? geom_map from ggplot2 can help with that, but we first need to find the boundaries of all the 59 constituencies to get started. My office mate helpfully pointed me to mapit, a great service from mysociety.org. If you know the id of an area, mapit can give you its boundaries in a JSON object, which you can easily turn into a data frame with the package rjson. Here’s how I did it:

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
library(rjson)
library(plyr)
 
# list of all the areas we need
areas<-read.table("arename.csv",header=FALSE,sep='\t')
areas<-areas$V2
areas<-as.character(areas)
 
# All UK Parliament Constituencies
WMC<-fromJSON(file='http://mapit.mysociety.org/areas/WMC')
# Extract name and id
constituencies<-sapply(names(WMC),function(id) WMC[[c(id,'name')]])
# Select only those we need
constituencies<-constituencies[which(constituencies %in% areas)]
# id and name
areas<-data.frame(group=names(constituencies),region=constituencies)
 
# boundaries to all West Midlands constituencies
WestMidlands<-ddply(areas,.(group,region),.fun=function(row){
  x<-fromJSON(file=paste('http://mapit.mysociety.org/area/',row$group,'.geojson',sep=''))
  x<-unlist(x$coordinates)
  n<-length(x)/2
  data.frame(long=x[2*(1:n)-1],lat=x[2*(1:n)])  
})

(ignore the warnings, they’re all due to some non-existent end-of-line.)

Select All Code:
> head(WestMidlands)
  group                region      long      lat
1 65563 Shrewsbury and Atcham -3.044652 52.66554
2 65563 Shrewsbury and Atcham -3.044531 52.66568
3 65563 Shrewsbury and Atcham -3.044481 52.66573
4 65563 Shrewsbury and Atcham -3.044355 52.66585
5 65563 Shrewsbury and Atcham -3.044110 52.66605
6 65563 Shrewsbury and Atcham -3.043950 52.66621

Let’s see what’s the relative change in vacancies for each constituency per job title:

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
 
# Compute relative change
byLevel1$percentChange<-100*(byLevel1$Vacancies2012-byLevel1$Vacancies2011)/byLevel1$Vacancies2011
 
# Connect map and data
p<-ggplot(byLevel1, aes(map_id = region)) 
p<-p+geom_map(aes(fill = percentChange), map = WestMidlands)
p<-p+expand_limits(x = WestMidlands$long, y = WestMidlands$lat)
# Colour scale from red to blue, cropped at -100 and 100. "Lab" is nicer on the eyes than RGB.
p<-p+scale_fill_gradient2(limits=c(-100,100),name="% change",space="Lab")
# mercator
p<-p+coord_map()
# one plot per job level
p<-p+facet_wrap(~level1)
# remove grid etc.
p<-p+opts(
  axis.title.x=theme_blank(),
  axis.title.y=theme_blank(),
  panel.grid.major=theme_blank(),
  panel.grid.minor=theme_blank(),
  axis.text.x=theme_blank(),
  axis.text.y=theme_blank(),
  axis.ticks=theme_blank()
  )
print(p) # might take some time!

which is rather nice for a first go. Some areas appear gray because they’re off the scale; the relative change is over 100%. This hard limit is completely arbitrary, but setting it to 220 (and getting rid of the gray areas) results in very low constrast for the rest of the plot. One could fix that with a capping colour scale for example.

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
> summary(byLevel1$percentChange)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-74.580 -18.200   2.732   7.292  26.780 218.600 
> subset(byLevel1,percentChange>100)
                      region                                           level1 Vacancies2011 Vacancies2012 percentChange
5                Wyre Forest                       SKILLED TRADES OCCUPATIONS           693          1470      112.1212
38  Wolverhampton North East                         PROFESSIONAL OCCUPATIONS            63           173      174.6032
83                    Warley                         PROFESSIONAL OCCUPATIONS            91           209      129.6703
114               The Wrekin    CARING, LEISURE AND OTHER SERVICE OCCUPATIONS           872          2236      156.4220
129                 Tamworth ASSOCIATE PROFESSIONAL AND TECHNICAL OCCUPATIONS           370           743      100.8108
135                 Tamworth                           ELEMENTARY OCCUPATIONS           634          1331      109.9369
190   Stoke-on-Trent Central         MANAGERS, DIRECTORS AND SENIOR OFFICIALS           188           599      218.6170
195   Stoke-on-Trent Central    CARING, LEISURE AND OTHER SERVICE OCCUPATIONS           667          1427      113.9430
204  Staffordshire Moorlands    CARING, LEISURE AND OTHER SERVICE OCCUPATIONS           363           730      101.1019
312       Mid Worcestershire    CARING, LEISURE AND OTHER SERVICE OCCUPATIONS           734          1869      154.6322
315       Mid Worcestershire                           ELEMENTARY OCCUPATIONS          1245          2744      120.4016
380             Dudley North                         PROFESSIONAL OCCUPATIONS            97           246      153.6082
384             Dudley North    CARING, LEISURE AND OTHER SERVICE OCCUPATIONS          1465          3117      112.7645
389           Coventry South                         PROFESSIONAL OCCUPATIONS            90           193      114.4444
521    Birmingham, Edgbaston            PROCESS, PLANT AND MACHINE OPERATIVES          1245          2548      104.6586

We can spot a 200% increase in managerial positions in Stoke-on-Trent from 2011 to 2012! I’ll leave it to the professionals to explain those numbers.

I’m stopping here but there’s obviously quite a lot you can do with this data, it all depends on what question you want to ask. Again, this data is available for free, which is rather nice and crossed with other datasets (like geographical location here), we can do quite a lot — after some preprocessing work — in a few lines of code. See for example what Andy Pryke and Sarah Mount did with similar datasets on that day: video.

Posted in Uncategorized | Tagged , , , , , , , | 3 Comments

spam evolution

Despite some rather modest protection (like a simple captcha), I still receive spammy comments on this blog every now and again. They’re easily spotted and actually never appear on the website.

There’s obviously an incentive for the spammer to post something as convincing as possible: either you’re taken in and think it’s a genuine comment, or it takes so much time for you to decide whether it’s genuine or not, you just give up. In order to achieve that, I’ve noticed a new generation of comments that simply copy texts from somewhere on the web. The text is more readable than a Markov-chain generated blurb and thus more taxing for the blogger to identify. It does it with a twist though: there’s usually a word seemingly deliberately misspelt. Here is an example:

Hi Louis apparently my honstig company have had a few issues today. As far as I can see, the images are there now. Have they returned for you as well? If not, I can try tweaking a few things and seeing what happens

I wondered why the spelling mistake was introduced and my current, unsubstantiated guess is that it’s a way for the spammer to detect which have gone through and identify blogs that are weak on security.

Today I’ve started receiving an even more pernicious spammy comments on my blog: the comments are genuine comments from R-related blogs and thus even more difficult to spot since they seem, at least superficially, somewhat related to the post they’re posted under. Here is an example:

Lattice and ggplot add a lot of value in that they pruocde objects with which you can do things. Also, the whole reason lattice (trellis) was created in the first place was to provide a powerful system that takes care of a lot of tedious things. For example, if you want a histogram conditional on some categorical variable, you’ve got it immediately. Just because it also works in the simple case presented above does not mean it is an equivalent alternative to hist(). I would say that having many options does not make R look like legacy at all. If you need something simple, use something simple (like hist()). If you need something more powerful and flexible, use that.

It threw me at first, because my original post was indeed about ggplot but it was completely off-topic and I got suspicious. I found its origin on a 2009 blog post. Notice that the spelling mistake does not appear in the original (?) comment.

I filed the comment as spam, slightly amused by the attempt and what do you know? A few hours later, I receive another spammy comment, which is exactly the reply of the comment in the original thread.

to whom it may concern I was never in doubt, that havnig graphic objects and conditioning is an advantage (sorry, when I was unclear at this point) but as you already pointed out, there are already two packages which are mostly equivalent from an ordinary user’s perspective.My concern regards havnig many packages in parallel with very much overlap and little structured and coordinated progress.

Again, with added misspelt words. This type of spam definitely requires more time to identify and I guess it’s achieving its purpose. I wonder how widespread this is. One unintended consequence of this might be fewer off-topic comments though!

Posted in Uncategorized | Tagged , , | 3 Comments

A graphical overview of your MySQL database

If you use MySQL, there’s a default schema called ‘information_schema‘ which contains lots of information about your schemas and tables among other things. Recently I wanted to know whether a table I use for storing the results of a large number experiments was any way near maxing out. To cut a brief story even shorter, the answer was “not even close” and could be found in ‘information_schema.TABLES‘. Not being one to avoid any opportunity to procrastinate, I went on to write a short script to produce a global overview of the entire database.

infomation_schema.TABLES contains the following fields: TABLE_SCHEMA, TABLE_NAME, TABLE_ROWS, AVG_ROW_LENGTH and MAX_DATA_LENGTH (and a few others). We can first have a look at the relative sizes of the schemas with the MySQL query “SELECT TABLE_SCHEMA,SUM(DATA_LENGTH) SCHEMA_LENGTH FROM information_schema.TABLES WHERE TABLE_SCHEMA!='information_schema' GROUP BY TABLE_SCHEMA“.

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
library("ggplot2") # You'll need ggplot2 0.9 for this.
library("reshape2")
library("RMySQL")
 
connection<-dbConnect(MySQL(), user="username", password="XXXXXX",host="127.0.0.1",port=3306,dbname='')
 
  query<-"SELECT TABLE_SCHEMA,SUM(DATA_LENGTH) SCHEMA_LENGTH FROM information_schema.TABLES WHERE TABLE_SCHEMA!='information_schema' GROUP BY TABLE_SCHEMA"
  result<-dbGetQuery(connection,query)
  result$TABLE_SCHEMA<-reorder(result$TABLE_SCHEMA,result$SCHEMA_LENGTH)
  p<-ggplot(result)+geom_bar(aes(x=TABLE_SCHEMA,y=SCHEMA_LENGTH))+coord_flip()
  p<-p+xlab("Size")+ylab("")
  p<-p+opts(title="Schemas' size")
  print(p)

And for the whole overview, let’s break each schema down by tables:

Select All Code:
1
2
3
4
5
6
7
  query<-"SELECT TABLE_SCHEMA,TABLE_NAME,TABLE_ROWS,DATA_LENGTH FROM information_schema.TABLES WHERE TABLE_SCHEMA!='information_schema'"
  result<-dbGetQuery(connection,query)
  result<-within(result,TABLE_NAME<-factor(TABLE_NAME,levels=sort(TABLE_NAME,decreasing=TRUE)))
  p<-ggplot(result)+geom_bar(aes(x=TABLE_NAME,y=DATA_LENGTH))+coord_flip()+facet_wrap(~TABLE_SCHEMA,scales='free')
  p<-p+xlab("Size")+ylab("")
  p<-p+opts(title="Tables' size")
  print(p)


Also, using the AVG_ROW_LENGTH and MAX_DATA_LENGTH and assuming a relatively constant row length, we can derive the maximum number of rows that a table can use, which gives us an estimate of how much space there is left:

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
  query<-"SELECT TABLE_SCHEMA,TABLE_NAME,100*TABLE_ROWS/FLOOR(MAX_DATA_LENGTH/AVG_ROW_LENGTH) AS USED FROM information_schema.TABLES WHERE TABLE_SCHEMA!='information_schema'"
#   query<-"SELECT TABLE_SCHEMA,TABLE_NAME,RAND(42)*100 AS USED FROM information_schema.TABLES WHERE TABLE_SCHEMA!='information_schema'"
 
  result<-dbGetQuery(connection,query)
  result$LEFTOVER<-100-result$USED
  result<-within(result,TABLE_NAME<-factor(TABLE_NAME,levels=sort(TABLE_NAME,decreasing=TRUE)))
  result<-melt(result,id.vars=c("TABLE_SCHEMA","TABLE_NAME"),variable.name='TYPE',value.name='PROPORTION',na.rm=TRUE)
  p<-ggplot(result)
  p<-p+geom_bar(aes(x=TABLE_NAME,y=PROPORTION,fill=TYPE),stat='identity')
  p<-p+coord_flip()+facet_wrap(~TABLE_SCHEMA,scales='free')
  p<-p+scale_fill_manual(values=c("USED"='#DD0000',LEFTOVER='#AAAAAA'))
  p<-p+xlab('')+ylab('')+opts(title="Tables' usage")
  print(p)
 
  query<-"SELECT TABLE_SCHEMA, MAX(100*TABLE_ROWS/FLOOR(MAX_DATA_LENGTH/AVG_ROW_LENGTH)) AS USED FROM information_schema.TABLES WHERE TABLE_SCHEMA!='information_schema' GROUP BY TABLE_SCHEMA"
#   query<-"SELECT TABLE_SCHEMA, MAX(100*RAND(42)) AS USED FROM information_schema.TABLES WHERE TABLE_SCHEMA!='information_schema' GROUP BY TABLE_SCHEMA"
 
  result<-dbGetQuery(connection,query)
  result$LEFTOVER<-100-result$USED
  result$TABLE_SCHEMA<-reorder(result$TABLE_SCHEMA,result$USED)
  result<-melt(result,id.vars=c("TABLE_SCHEMA"),variable.name='TYPE',value.name='PROPORTION',na.rm=TRUE)
  p<-ggplot(result)
  p<-p+geom_bar(aes(x=TABLE_SCHEMA,y=PROPORTION,fill=TYPE),stat='identity')
  p<-p+coord_flip()
  p<-p+scale_fill_manual(values=c("USED"='#DD0000',LEFTOVER='#AAAAAA'))
  p<-p+xlab("")+ylab("")+opts(title="Largest Usage")
  print(p)
dbDisconnect(connection)

Unless you are using very large tables, those last two graphs should come out pretty much all gray. You can check that the colouring works by using the commented out queries instead, which use random values for the estimates.

About dbConnect(): I left it here to make things easier to replicate but I normally call a simple function which is just a wrapper for it, with my username and password in. This way my credentials are in one single place instead of all over my scripts.

PS: This is my first anniveRsary! I’ve been using R for a year now. And I’m certainly planning to carry on.

Posted in Uncategorized | Tagged , , | 5 Comments

Parallelising plink (or anything else) the easy way

plink is the swiss-army knife of genome association studies. Its impressive tool set can be seen here. I am currently running some experiments for which I need to compute associations between 30’000 SNPs and 130 assays. This calculation is only the first step of the experiments, which I want to run as many times as possible. So to save time, the more direct approach is to try and parallelise the whole process.

Enters GNU parallel, an amazing unix command which makes the parallelisation a piece of cake. The best way to learn is to go through the numerous examples. As you can see, it’s used pretty much as a normal xargs (see this recent post about xargs on Getting Genetics Done).

To compute the associations, I created a .phen file which contains all the assays, as well as each subjects’ family and ID. This is just a long tab-separated text file. Its header starts with FID IID nameAssay1 nameAssay2 etc..

A normal use of plink would look like this:

Select All Code:
1
plink --manyOptions --pheno allAssays.phen --all-pheno --linear --out analyses

This will calculate a linear regression for each pair (SNP,assays) and store the results in a directory analyses. Time it took in my case: about 80mn.

Using GNU parallel however, the change is minimal. I just need:

  • to parse the header in order to extract all the assays’ names.
  • to tell plink which phenotype I want to process. This is done with –pheno-name

The first bit is done with a simple combination of usual unix tools:

Select All Code:
1
head -n1 allAssays.phen |cut -f 3- |sed 's/\t/\n/g'

This will produce the list of assays.

Now combine this with –pheno-name and parallel:

Select All Code:
1
head -n1 allAssays.phen |cut -f 3- |sed 's/\t/\n/g'|parallel plink --manyOptions --pheno allAssays.phen --pheno-name {} --linear --out analyses/experimentID.{}

And this is it! I’ve just piped the list of assays to parallel plink. This now runs #cores copies of plink, each processing one phenotype. Each instance of {} is replaced by what is piped in, in this case, the name of a phenotype. You really can’t make it easier. How satisfying it is to do an htop and watch all processors being used!
The whole thing is now done in 10-15mn, with very little extra effort to make it work.

Installation
The official website provides the sources and some binaries for it. If you use Ubuntu, there’s a PPA available here and it’s straightforward to install. Note that there’s a Ubuntu package called ‘moreutils’ which contains a parallel command, but it’s different from GNU parallel.

Posted in Uncategorized | Tagged , , , | 2 Comments

polar histogram: pretty and useful

Do you have tens of histograms to show but no room to put them all on the page? As I was reading this paper in Nature Genetics, I came across a simple and clever way of packing all this information in a small space: arrange them all around a circle, and add some guides to help their cross-comparison.

It didn’t look too difficult to implement in ggplot2 thanks to polar coordinates and after a busy Saturday afternoon I ended up with the following image with my data (*) (and a poster-ready pdf, after 2 seconds of prettying up with Inkscape):

The graph shows the proportion of some SNP scores (‘first’, ‘second’ and ‘third’) for a number of phenotypes, which are grouped by themes. I’m quite happy with the result. It’s pretty and useful: it’s very easy to compare one histogram with any of the other 60.

The code is still a bit rough around the edges; a few things are not terribly elegant or are hard-coded. An improved version will be shipped with our graphical package next month. In the mean-time, here it is, if you want to try it with your own data.

It returns a ggplot object containing the graph. You can either display it, with print(), save it as a pdf with ggsave(“myPlot.pdf”) or modify it with the usual ggplot2 commands. I’ve called it polar histogram, which, I think, is self-explanatory. If you know how it’s actually called, please let me know. (No, I will not call it polR histogram.)

And here is some fake data to get you going:

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# fake data for polarHistogram()
# Christophe Ladroue
library(plyr)
library(ggplot2)
source("polarHistogram.R")
 
# a little helper that generates random names for families and items.
randomName<-function(n=1,syllables=3){
  vowels<-c("a","e","i","o","u","y")
  consonants<-setdiff(letters,vowels)
  replicate(n,
            paste(
              rbind(sample(consonants,syllables,replace=TRUE),
                    sample(vowels,syllables,replace=TRUE)),
              sep='',collapse='')
            )
}
 
  set.seed(42)
 
  nFamily<-20
  nItemPerFamily<-sample(1:6,nFamily,replace=TRUE)
  nValues<-3
 
  df<-data.frame(
    family=rep(randomName(nFamily),nItemPerFamily),
    item=randomName(sum(nItemPerFamily),2))
 
df<-cbind(df,as.data.frame(matrix(runif(nrow(df)*nValues),nrow=nrow(df),ncol=nValues)))
 
 
  df<-melt(df,c("family","item"),variable_name="score") # from wide to long
  p<-polarHistogram(df,familyLabel=FALSE)
  print(p)

Options:
Many defaults can be changed already, look at the code for the complete list. The two things you might want to change are familyLabels (logical) which displays (or not) the name of each group as well, and direction, which is either ‘inwards’ or ‘outwards’.

Coding notes:
It wasn’t terribly difficult but it did take me a bit longer than expected, for a few reasons:

  1. coord_polar() doesn’t affect the orientation of geom_text() so it had to be calculated manually.

  2. You’ll notice that the label orientations change between 6 and 9 o’clock, or they would end up upside down and be difficult to read.
  3. There are some scoping issues with plyr and ggplot2 which can be a bit annoying once you encapsulate your code in a function. For example:

    Select All Code:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    
    df<-data.frame(
      x=runif(10),
      y=runif(10))
     
    z<-10
    ggplot(df)+geom_point(aes(x=x+z,y=y)) # works
     
    rm(z)
    fakeFunction<-function(df){
      z<-10
      ggplot(df)+geom_point(aes(x=x+z,y=y))
      }
     
    fakeFunction(df) # error

Happy plotting!

(*) The numbers are fudged, don’t spend time reverse-engineering them.

Update (24/03/2012):
Christos Hatzis has modified my original code to plot a collection of un-normalised bar charts, like this.

He’s happy to share his code here: PolarBarchart.zip, together with a test file.

Update (02/06/2012):
You can find a better version in my R package ‘phorest‘.

Update (24/04/2015):
I’ve finally updated the code for the new version of ggplot2. It’s here:
https://github.com/chrislad/phenotypicForest/‘.
Untested and provided as is. Enjoy!

Posted in Uncategorized | Tagged , , , , | 39 Comments