The mysterious case of the misbehaving writeLines() (or: a cat saves the day)

Dear readers and R experts, I submit to you a mysterious R quirk which has been baffling me for the best part of a week. I found a work-around but I’d love it if someone could explain this strangest of behaviour.

I was using writeLines() to dump the results of a large number of classification results and noticed that the output file was not as expected: some records were missing, some repeated. It took me a while to figure out exactly what the problem was (the multicore library? network latency? an unclosed sink()? etc.) but once I finally stripped the script bare, it turned out that I could reproduce the bug with a combination of multicore and writeLines().

Here is how my script works: a function (accuracyWithSomeFeatures()) is applied a number of times (numberFeatures). For each application, I output numberFeatures rows, each with 3 columns: a identifier for this iteration, the value passed to the function (from 1 to numberFeatures) and the number generated by the function. This process is repeated numberExperiments times.
When it’s finished, I should therefore end up with a text file of numberFeatures x numberExperiments.
Well this doesn’t happen if the function is applied in parallel AND I use writeLines() to write to the file.

Only this combination reproduces the bug. If I use multicore (or parallel in 2.14) and cat() or don’t use multicore but writeLines() etc., the output file is of the right shape.

What’s even weirder is that when the bug happens, the number of lines of the output file is not constant; the repeat/mangling is not consistent. For example, with numberFeatures<-317; numberExperiments<-467, a wc -l gives 217175, 217417, 217357 lines after 3 consecutive runs. None of these numbers are multiples of 317 or 467 so something is going seriously wrong.

The effect cannot be due to parallel writing either: the output file is populated after the parallelised generation of the data.

I’m leaving the code here if you fancy solving a bit of R mystery. Please let me know if you find the reason for this apparent bug, I’m sure there’s something to learn here. (Or it’s staring at me in the face. I’m ready for that too).

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
 
if(as.numeric(version$major)*100+as.numeric(version$minor)<214) library(multicore) else library(parallel)
 
classifyInParallel<-TRUE
withCat<-FALSE
 
numberFeatures<-317
numberExperiments<-467
 
outputFile<-"~/debugWriteLines.csv"
unlink(outputFile)
if(!withCat) outputFile<-file(outputFile,'a')
 
replicate(numberExperiments,{
  experimentID<-paste(sample(letters,6,replace=TRUE),collapse='')
 
  accuracyWithSomeFeatures<-function(k) 100*runif(1)
 
    if(classifyInParallel){
      accuracy<-mclapply(1:numberFeatures,accuracyWithSomeFeatures)
      accuracy<-unlist(accuracy)
    } else{
      accuracy<-sapply(1:numberFeatures,accuracyWithSomeFeatures)
    }
 
  content<-paste(experimentID, 1:numberFeatures,accuracy,sep='\t')
 
  if(withCat)
    cat(content,
      sep='\n',
      file=outputFile,
      append=TRUE)  
  else
    writeLines(
      content,
      con=outputFile)  
 
  NULL
})
 
if (!withCat) close(outputFile)
 
cat("\nclassifyInParallel:",classifyInParallel)
cat("\nwithCat:",withCat)
cat("\nThere should be",numberFeatures*numberExperiments,"lines in total.")

My system:
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 2
minor 14.0
year 2011
month 10
day 31
svn rev 57496
language R
version.string R version 2.14.0 (2011-10-31)

(original image by Kaptain Kobold)

Update (17 Nov 2011):
I now have an even barer script. writeLines() gets it wrong only if the output file is opened and closed once, i.e. before and after the main loop. It works if the file is opened and closed at each iteration. Again, this is not an issue of writing to a file from multiple processors, since writeLines() is outside mclapply().

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
if(as.numeric(version$major)*100+as.numeric(version$minor)<214) library(multicore) else library(parallel)
 
numberFeatures<-17
numberExperiments<-13
 
outputFilename<-"~/debugWriteLines.csv"
 
 
cat("\nThere should be",numberFeatures*numberExperiments,"lines in total.")
 
# opening and closing at each save
cat("\n\tOpening and closing at each iteration.")
unlink(outputFilename)
 
for(experiment in 1:numberExperiments){
  outputFile<-file(outputFilename,'a')
  mclapply(1:numberFeatures,function(k) runif(1))
  content<-paste(experiment, 1:numberFeatures,sep='\t')
  writeLines(content,con=outputFile)
  close(outputFile)
  }
cat(" Number of lines:",length(readLines(outputFilename)))
 
# opening and closing once
cat("\n\tOpening and closing once.")
 
unlink(outputFilename)
outputFile<-file(outputFilename,'a')
 
for(experiment in 1:numberExperiments){
  mclapply(1:numberFeatures,function(k) runif(1)) # do something in parallel
  content<-paste(experiment, 1:numberFeatures,sep='\t')
  writeLines(content,con=outputFile)}
 
close(outputFile)
 
numberLines<-length(readLines(outputFilename))
cat(" Number of lines:",numberLines)
cat("\n\t\tNumber of lines / number of experiments:",numberLines/numberExperiments)
cat("\n\t\tNumber of lines / number of features:",numberLines/numberFeatures)
cat("\n\t\tNumber of lines / (number of experiments * number of features):",numberLines/(numberExperiments*numberFeatures))

produces:

There should be 221 lines in total.
Opening and closing at each save. Number of lines: 221
Opening and closing once. Number of lines: 2465
Number of lines / number of experiments: 189.6
Number of lines / number of features: 145
Number of lines / (number of experiments * number of features): 11.15

This entry was posted in Uncategorized and tagged , , , , . Bookmark the permalink.

2 Responses to The mysterious case of the misbehaving writeLines() (or: a cat saves the day)

  1. Keith says:

    Hi, I didn’t analise your code closely so this could be wrong, but I was having a similar problem with using multicore & writing to a connection. It turned out that calling flush on the connection fixed the problem. I think this SO question explains what’s happening since the multicore package uses fork:

    http://stackoverflow.com/questions/3513242/working-of-fork-in-linux-gcc

    • CL says:

      Very interesting, thank you. It does seem to explain why I get more than I asked for, exactly like in the OP’s example. And I guess closing the connection after each save effectively flushes the buffer (cf. my update), hence makes the script work.
      From this, I’d infere that cat(), which is not affected by that, does the flushing itself. So I think I’ll keep on using cat() from now on, one less thing to worry about.

      Cheers!

Leave a Reply

Your email address will not be published. Required fields are marked *

*


× 5 = forty

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">