reading raster data using library(parallel)

Recently, I have been doing some analysis for a project I am involved in. In particular, I was interested what role pacific sea surface temperatures play with regard to rainfall in East Africa. I spare you the details as I am currently writing all this up into a paper which you can have a look at once published.

For this analysis, however, I am processing quite an amount of raster files. This led me to investigate the possibilities of the parallel package to speed up the process.

Here's a quick example on how to read in raster data (in this case 460 global sea surface temperature files of 1° x 1° degree resolution) using parallel

First, lets do it the conventional way and see how long that takes

library(raster)
library(rgdal)

### Input preparation ########################################################
inputpath <- "/media/tims_ex/sst_kili_analysis"
ptrn <- "*sst_anom_pcadenoise_*_R2.rst"

### list files in direcotry ##################################################
fnames_sst_r2 <- list.files(inputpath, 
                            pattern = glob2rx(ptrn), 
                            recursive = T)

### read into raster format ##################################################
system.time({
  sst.global <- lapply(seq(fnames_sst_r2), function(i) {
    raster(paste(inputpath, fnames_sst_r2[i], sep = "/"))
    }
                       )
  })
##    user  system elapsed 
##  61.584   0.412  68.104

Now using library(parallel)

library(parallel)

system.time({
  ### set up cluster call ######################################################
  cl <- makePSOCKcluster(4)

  clusterExport(cl, varlist = c("inputpath", "fnames_sst_r2"), 
                envir=environment())
  junk <- clusterEvalQ(cl, c(library(raster),
                             library(rgdal)))

  ### read into raster format using parallel version of lapply #################
  sst.global.p <- parLapply(cl, seq(fnames_sst_r2), function(i) {
    raster(paste(inputpath, fnames_sst_r2[i], sep = "/"))
    }
                          )

  ### stop the cluster #########################################################
  stopCluster(cl)
  })
##    user  system elapsed 
##   0.152   0.080  25.670

Not a crazy speed enhancement, but we need to keep in mind that the raster command does not read into memory. Hence, the speed improvements should be a lot higher once we start the calculations or plotting.

Finally, let's test whether the two methods produce identical results.

identical(sst.global.p, sst.global)
## [1] TRUE

to be continued…

sessionInfo()
## R version 2.15.3 (2013-03-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=C                 LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] rgdal_0.8-5   raster_2.0-41 sp_1.0-5      knitr_1.1    
## 
## loaded via a namespace (and not attached):
## [1] digest_0.6.3    evaluate_0.4.3  formatR_0.7     grid_2.15.3    
## [5] lattice_0.20-13 stringr_0.6.2   tools_2.15.3
About these ads
This entry was posted in R. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s