Quantcast
Channel: R-bloggers
Viewing all 2417 articles
Browse latest View live

Bio7 2.1 for Linux 64-bit and Windows 32-bit released

$
0
0

(This article was first published on » R, and kindly contributed to R-bloggers)

19.05.2015

After the 64-bit release of Windows the Linux 64-bit and Windows 32-bit release can be downloaded at: http://bio7.org

rlinux

Linux installation:

The installation of Bio7 is similar to the installation of the Eclipse environment. Simply decompress the downloaded *.zip file in a preferred location on your file system. After decompressing with a standard zip-tool (like WinZip, Win Rar) the typical file structure of an Eclipse based application will be created. To start the application simply double click on the  Bio7 binary file.

R  and Rserve installation:

To use R from within Bio7 please install R with a Linux R package manager.
Also the installation of the Rserve library is required. Rserve hast be compiled and installed in the local R application with the shell command:

sudo PKG_CPPFLAGS=-DCOOPERATIVE R CMD INSTALL Rserve_1.8-2.tar.gz

The flag before R CMD INSTALL… is necessary to enable a shared workspace when switching from a local Rserve connection to the native Bio7 R console and conversely. After the installation of R the path to the R (if not using the default path!) application has to be adjusted inside of Bio7 (Preferences- ▷ Preferences Bio7). In addition the path to the (add-on) packages install location has to be adjusted, too (Preferences ▷ Preferences Bio7 ▷ RServe Preferences).
Please also set the user rights for the folder. This is sometimes necessary if you would like to install packages with the Bio7 interface and you don’t have the user rights.
Since Bio7 1.4 default R paths are set which are usually correct for a Linux distribution!

 

To leave a comment for the author, please follow the link and comment on his blog: » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

TidyR Challenge: Data.Table Solution

$
0
0

(This article was first published on Jeffrey Horner, and kindly contributed to R-bloggers)

Arun Srinivasan is the man! Once he saw that his data.table solution to the TidyR Challenge had an issue, he fixed it!

His solution is below along with a quick equivalence test to my original solution, and check out this stackOverflow question for a more engaging discussion of the strengths and weaknesses of both dplyr/tidyr and data.table.

Fake Data

library(wakefield)
library(tidyr)
library(dplyr)

d <- r_data_frame(
      n=100,
      id,
      r_series(date_stamp,15,name='foo_date'),
      r_series(level,15,name='foo_supply'),
      r_series(date_stamp,10,name='bar_date'),
      r_series(level,10,name='bar_supply'),
      r_series(date_stamp,3,name='baz_date'),
      r_series(level,3,name='baz_supply')
  )

Test Function for Equivalence

# Create a true ordered data frame and drop any extraneous classes for each column
true_ordered_df <- function(x){
  x$ID <- as.character(x$ID); class(x$ID) <- 'character'
  x$med_date <- as.Date(x$med_date); class(x$med_date) <- 'Date'
  x$med_supply <- as.integer(x$med_supply); class(x$med_supply) <- 'integer'
  x$med_name <- as.character(x$med_name); class(x$med_name) <- 'character'
  x <- data.frame(
          ID=x$ID, 
          med_date=x$med_date,  
          med_supply=x$med_supply, 
          med_name=x$med_name,
          stringsAsFactors=FALSE
  )
  x <- x[with(x,order(ID,med_date,med_supply,med_name)),]
  row.names(x) <- NULL
  x
}

Data.Table Solution, thanks to Arun Srinivasan

require(data.table) # v1.9.5
dt = as.data.table(d)

pattern = c("date", "supply")
mcols = lapply(pattern, grep, names(dt), value=TRUE)
dt.m = melt(dt, id="ID", measure=mcols, variable.name="med_name", 
value.name = paste("med", pattern, sep="_"))
setattr(dt.m$med_name, 'levels', gsub("_.*$", "", mcols[[1L]]))

scripts2 <- true_ordered_df(dt.m)

My Original Solution

# foo
med_dates <- d %>% 
    select(ID,foo_date_1:foo_date_15) %>% 
    gather(med_seq, med_date, foo_date_1:foo_date_15)
med_dates$med_seq <- as.integer(sub('^foo_date_','',med_dates$med_seq))
med_supply <- d %>% 
    select(ID,foo_supply_1:foo_supply_15) %>% 
    gather(med_seq, med_supply, foo_supply_1:foo_supply_15)
med_supply$med_seq <- as.integer(sub('^foo_supply_','',med_supply$med_seq))
foo <- left_join(med_dates,med_supply, by=c('ID','med_seq')) %>% 
    select(ID,med_date,med_supply)
foo$med_name <- 'foo'

# bar
med_dates <- d %>% 
    select(ID,bar_date_1:bar_date_10) %>% 
    gather(med_seq, med_date, bar_date_1:bar_date_10)
med_dates$med_seq <- as.integer(sub('^bar_date_','',med_dates$med_seq))
med_supply <- d %>% 
    select(ID,bar_supply_1:bar_supply_10) %>% 
    gather(med_seq, med_supply, bar_supply_1:bar_supply_10)
med_supply$med_seq <- as.integer(sub('^bar_supply_','',med_supply$med_seq))
bar <- left_join(med_dates,med_supply, by=c('ID','med_seq')) %>% 
    select(ID,med_date,med_supply)
bar$med_name <- 'bar'

# baz
med_dates <- d %>% 
    select(ID,baz_date_1:baz_date_3) %>% 
    gather(med_seq, med_date, baz_date_1:baz_date_3)
med_dates$med_seq <- as.integer(sub('^baz_date_','',med_dates$med_seq))
med_supply <- d %>% 
    select(ID,baz_supply_1:baz_supply_3) %>% 
    gather(med_seq, med_supply, baz_supply_1:baz_supply_3)
med_supply$med_seq <- as.integer(sub('^baz_supply_','',med_supply$med_seq))
baz <- left_join(med_dates,med_supply, by=c('ID','med_seq')) %>% 
    select(ID,med_date,med_supply)
baz$med_name <- 'baz'

scripts <- true_ordered_df(rbind(foo,bar,baz))
all.equal(scripts,scripts2)
## [1] TRUE

Huzzah!

To leave a comment for the author, please follow the link and comment on his blog: Jeffrey Horner.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Fast parallel computing with Intel Phi coprocessors

$
0
0

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Andrew Ekstrom
Recovering physicist, applied mathematician and graduate student in applied Stats and systems engineering

We know that R is a great system for performing statistical analysis. The price is quite nice too ;-) . As a graduate student, I need a cheap replacement for Matlab and/or Maple. Well, R can do that too. I’m running a large program that benefits from parallel processing. RRO 8.0.2 with the MKL works exceedingly well.

For a project I am working on, I need to generate a really large matrix (10,000x10,000) and raise it to really high powers (like 10^17). This is part of my effort to model chemical kinetics reactions, specifically polymers. I’m using a Markov Matrix of 5,000x5,000 and now 10,000x10,000 to simulate polymer chain growth at femptosecond timescales.

At the beginning of this winter semester, I used Maple 18 originally. I was running my program on a Windows 7 Pro computer using an intel I7 – 3700K (3.5GHz) quad core processor with 32GB of DDR3 ram. My full program took, well, WWWWWAAAAAAAYYYYYYYY TTTTTTTTOOOOOOOO LLLLLOOOONNNNGGGGGG!!!!!!!!

After a week, my computer would still be running. I also noticed that my computer would use 12% -13% of the processor power. With that in mind, I went to the local computer parts superstore and consulted with the sales staff. I ended up getting a “Gamer” rig when I purchased a new AMD FX9590 processor (4.7GHz on 8 cores) and dropped it into a new mobo. This new computer ran the same Maple program with slightly better results. It took 4-5 days to complete... assuming no one else used the computer and turned it off.

After searching for a better method (meaning better software) for running my program, I decided to try R. After looking around for a few hours, I was able to rewrite my program using R. YEAH! Using the basic R (version 3.1.2), my new program only took a few days (2-3).  A nice feature of R is an improved BLAS and LAPACK and their implementation in R over Maple 18. Even though R 3.1.2 is faster than Maple 18, R only used 12%-13% of my processor. 

Why do I keep bringing up the 12%-13% CPU usage?  Well, it means that on my 8 core processor, only 1 core is doing all the work. (1/8 = 0.125) Imagine you go out and buy a new car. This car has a big V8 engine but, only 1 cylinder runs at a time. Even though you have 7 other cylinders in the car, they are NOT used. If that was your car, you would be furious. For a computer program, this is standard protocol. A cure for this type of silliness is to use parallel programming.   

Unfortunately, I AM NOT A PROGRAMMER! I make things happen with a minimal amount of typing. I’m very likely to use “default settings” because I’m likely to mistype something and spend an hour trying to figure out, “Is that a colon or a semi colon?” So when I looked around at other websites discussing how to compile and/or install different blas and lapack for R, I started thinking, “I wish I was taking QED right now. (QED = Quantum Electro-Dynamics)” I also use Windows, most of the websites I saw discussed doing this in Linux.

That led me to Revolution Analytics RRO. I installed RRO version 8.0.2 and the MKL available from here: http://mran.revolutionanalytics.com/download/#download

RRO uses Intel’s Math Kernel Library, which is updated and upgraded to run certain types of calculations in parallel. Yes, parallel processing in Windows, which is step one of HPC (High Performance Computing) and something many of my comp sci friends and faculty said was difficult to do.

A big part of my project is raising a matrix to a power. This is a highly parallelizable process.  By that I mean, calculating element A(n,n) in the new matrix does not depend upon the value of A(x,x) in the new matrix. They only care about what is in the old matrix. Using the old style (series) computing, you calculate A(1,1), then A(1,2), A(1,3) … A(n,n). With parallel programming, on my 8 core AMD processor, I can calculate A(1,1), A(1,2), A(1,3) … A(1,8) at the same time. If these calculations were “perfectly parallel” I would get my results 8 times faster. For those of us that have read other blog posts on RevolutionAnalytics.com, you know that the speed boost for parallel programming is great, but not perfect. (Almost like it follows the laws of thermodynamics.) By using RRO, I was able to run my program in R and get results for all of my calculations in 6-8 hours. That got me thinking.

If parallel processing on 8 cores instead of series processing on 1 core is a major step up, can I boost the parallel processing possibility? Yes. GPU processors like the Tesla and FirePro are nice and all but:

1)      Using them with R requires programming and using Linux. Two things I don’t have time to do.

2)      Entry level Tesla and Good Firepro GPUs cost a lot of money. Something I don’t have a lot of right now.

The other option is using an Intel Phi coprocessor, or two. Fortunately, when I started looking, I could pick up a Phi coprocessor for cheap. Like $155 cheap for a brand new coprocessor from an authorized retailer. The video card in my computer cost more than my 2 Phi’s. The big issue, is getting a motherboard that has the ability to handle the Phi’s. Phi coprocessors have 6+GB of ram. Most mobo’s can’t handle more than 4GB of ram through a PCI-E 3.0 slot. So, I bought a second mobo as a “hobby” project computer. This new mobo is intended for “workstations” and has 4 PCI-E 3.0 slots. That gives me enough room for a good video card and 2 Phi’s. This new Workstation PC has an Intel Xeon E5-2620V3 (2.4GHz 6-core, 12-Thread) processor, 2 Intel Xeon Phi coprocessors 31S1P (57 cores with 4 threads per core at 1.1GHz per thread for a total of 456threads) and 48Gb DDR4 Ram. 

The Intel Phi coprocessors work well with the Intel MKL. The same MKL RRO uses. Which means, if I use RRO with my Phi’s, after they are properly set up, I should be good to go….. Intel doesn’t make this easy. (I cobbled together the information from 6-7 different sources. Each source had a small piece of the puzzle.) The Phi’s are definitely not “Plug and Play”. I used MPSS version 3.4 for Windows 7. I downloaded the drivers from here:

https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss#wn34rel

I had to go into the command prompt and follow some of the directions available here. (Helpful hint, use micinfo to check your Phi coprocessors after step 9 in section 2.2.3 “Updating the Flash”.)  

http://registrationcenter.intel.com/irc_nas/6252/readme-windows.pdf

After many emails to Revolution Analytics staff, I was able to get the Phi’s up and running! Now, my Phi’s work harmoniously with MKL.   Most of the information I needed is available here. https://software.intel.com/sites/default/files/11MIC42_How_to_Use_MKL_Automatic_Offload_0.pdf

https://software.intel.com/en-us/articles/performance-tips-of-using-intel-mkl-on-intel-xeon-phi-coprocessor

In the paper and website above, I needed to create some environmental variables. The generic ones are:

MKL_MIC_ENABLE=1
OFFLOAD_DEVICES=<list>
MKL_MIC_MAX_MEMORY=2GB
MIC_ENV_PREFIX=MIC
MIC_OMP_NUM_THREADS=###
MIC_KMP_AFFINITY=balanced

Since I have 2 Phi coprocessors, my <list> is 0, 1.(At least this is the list that worked.) I set MKL_MIC_MAX_MEMORY to 8GB. ( I have the ram to do it, so why not.) MIC_OMP_NUM_THREADS = 456.

Below, is a sample program I used to benchmark Maple 2015, R and RRO on my Gamer computer and my Workstation. Between the time I started this project and now, Maple up graded their program to Maple 2015. The big breakthrough is that Maple now does parallel processing. So, I ran the program below using Maple 2015 to see how it compares to R and RRO. (I uninstalled Maple 18 in anger.)  I also ran the same program on my Workstation PC to see how well the Phi coprocessors worked. Once I had everything enabled, I didn’t want to disable anything. So, I just have the one, VERY IMPRESSIVE, time for my workstation.

require("expm")
options(digits=22)
a=10000
b=0.000000001
c=matrix(0,a,a)
for ( i in 1:a){c[i,i] = 1-1.75*b}
for ( i in 1:a){c[i-1,i] = b}
for ( i in 2:a){c[i,i-1] = 0.75*b}
c[1,1]=1-b
c[a,a]=1
c[a,a-1]=0
system.time(e=c%^%100)

Power of parallel revised

By using RRO instead of R, I got my results 3.12 hours faster. Considering the fact that I have several dozen more calcs like this one, saving 3hrs per calc is wonderful ;-) By using RRO instead of Maple 2015, I saved about 41 mins. By using RRO with the Phi’s on my Workstation PC, I was done in 187.3s. I saved an additional 39 mins over my Gamer Computer! When I ran my full program, it took under an hour. Compared to the days/weeks for my smaller calculations, an hour is awesome!

An interesting note on the InteL MKL. It only uses cores, not threads, on the main processor. I’m not sure how it handles the threads on the Phi coprocessors. So, my Intel Xeon processor only had 50% usage of the main processor.

Now, your big question is, “Why should I care?” I ran a 10,000x10,000 matrix and raised it to unbelievably high values. I used a brute force method to do it. Suppose that you are doing “Big Data” analysis and you have 30 columns by 2,000,000 rows. If you run a linear regression on that data, your software will use a Pseudoinverse to calculate the coefficients of your regression. A part of the pseudoinverse involves multiplying your 30x2,000,000 matrix by a 2,000,000x30 matrix and it’s all parallelizable! Squaring my matrix uses about 1.00x1012 operations (assuming I have my Big O calculation correct.) The pseudo inverse of your matrix uses a mere 1.80x109 operations.  

Some of my friends who do these sort of “Big Data” calculations using the series method built into basic R or SAS tell me that they take hours(1-2) to complete. With my workstation, I have the computational power of 17 servers that use my same Xeon processor. That calculation would take me way less than a minute.

Behold, the power of parallel processing!

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Exact computation of sums and means

$
0
0

(This article was first published on Radford Neal's blog » R Programming, and kindly contributed to R-bloggers)

A while ago, I came across a mention of the Python math.fsum function, which sums a set of floating-point values exactly, then rounds to the closest floating point value. This seemed useful. In particular, I thought that if it’s fast enough it could be used instead of R’s rather primitive two-pass approach to trying to compute the sample mean more accurately (but still not exactly). My initial thought was to just implement the algorithm Python uses in pqR. But I soon discovered that there were newer (and faster) algorithms. And then I thought that I might be able to do even better…

The result is a new paper of mine on Fast exact summation using small and large superaccumulators (also available from arxiv.org).

A superaccumulator is a giant fixed-point number that can exactly represent any floating-point value (and then some, to allow for bigger numbers arising from doing sums). This concept has been used before to implement exact summation methods. But if done in software in the most obvious way, it would be pretty slow. In my paper, I introduce two new variations on this method. The “small” superaccumulator method uses a superaccumulator composed of 64-bit “chunks” that overlap by 32 bits, allowing carry propagation to be done infrequently. The “large” superaccumulator method has a separate chunk for every possible combination of the sign bit and exponent bits in a floating-point number (4096 chunks in all). It has higher overhead for initialization than the small superaccumulator method, but takes less time per term added, so it turns out to be faster when summing more than about 1000 terms.

Here is a graph of performance on a Dell Precision T7500 workstation, with a 64-bit Intel Xeon X5680 processor:

T7500-blog-figure

The horizontal axis is the number of terms summed, the vertical axis the time per term in nanoseconds, both on logarithmic scales. The time is obtained by repeating the same summation many times, so the terms summed will be in cache memory if it is large enough (vertical lines give sizes of the three cache levels).

The red lines are for the new small (solid) and large (dashed) superaccumulator methods. The blue lines are for the iFastSum (solid) and OnlineExact (dashed) methods of Zhu and Hayes (2010), which appear to be the fastest previous exact summation methods. The black lines are for the obvious (inexact) simple summation method (solid) and a simple out-of-order summation method, that adds terms with even and odd indexes separately, then adds together these two partial sums. Out-of-order summation provides more potential for instruction-level parallelism, but may not produce the same result as simple ordered summation, illustrating the reproducibility problems with trying to speed up non-exact summation.

One can see that my new superaccumulator methods are about twice as fast as the best previous methods, except for sums of less than 100 terms. For large sums (10000 or more terms), the large superaccumulator method is about 1.5 times slower than the obvious simple summation method, and about three times slower than out-of-order summation.

These results are all for serial implementations. One advantage of exact summation is that it can easily be parallelized without affecting the result, since the exact sum is the same for any summation order. I haven’t tried a parallel implementation yet, but it should be straightforward. For large summations, it should be possible to perform exact summation at the maximum rate possible given limited memory bandwidth, using only a few processor cores.

For small sums (eg, 10 terms), the exact methods are about ten times slower than simple summation. I think it should be possible to reduce this inefficiency, using a method specialized to such small sums.

However, even without such an improvement, the new superaccumulator methods should be good enough for replacing R’s “mean” function with one that computes the exact sample mean, since for small vectors the overhead of calling “mean” will be greater than the overhead of exactly summing the vector. Summing all the data points exactly, then rounding to 64-bit floating point, and then dividing by the number of data points wouldn’t actually produce the exactly-rounded mean (due to the effect of two rounding operations). However, it should be straightforward to combine the final division with the rounding, to produce the exactly-correct rounding of the sample mean. This should also be faster than the current inexact two-pass method.

Modifying “sum” to have an “exact=TRUE” option also seems like a good idea. I plan to implement these modifications to both “sum” and “mean” in a future version of pqR, though perhaps not the next version, which may be devoted to other improvements.

 


To leave a comment for the author, please follow the link and comment on his blog: Radford Neal's blog » R Programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The R Foundation announces new mailing list ‘R-package-devel’

$
0
0

At last week’s monthly meeting, the R foundation has decided to
create a new mailing list in order to help R package authors in
their package development and testing.

The idea is that some experienced R programmers (often those
currently helping on R-devel or also R-help) will help package
authors and thus unload some of the burden of the CRAN team
members.

We expect impact for R-devel: I’m expecting somewhat less traffic there,
and the focus returning to implementation of R and future features of R
itself.

Please read the description of the mailing list here
https://stat.ethz.ch/mailman/listinfo/r-package-devel
or below, subscribe and start using it!

For the R foundation,
Martin Maechler,  Secretary General

——————- “About R-package-devel”  (from above URL): ———

This list is to get help about package development in R. The goal of the list is to provide a forum for learning about the package development process. We hope to build a community of R package developers who can help each other solve problems, and reduce some of the burden on the CRAN maintainers. If you are having problems developing a package or passing R CMD check, this is the place to ask!

Please note that while R-package-devel contributors will do their best to provide you accurate and authoritative information, the final arbiters of CRAN submission is the CRAN team.

Please keep it civil. It’s easy to get frustrated when building a package, or when answering the same question for what feels like the thousandth time. But everyone involved in the process is a volunteer.
Include a reproducible example. We can’t help if we don’t know what the problem is. For packages, if possible, include a link to the package source. If you’re having a problem with R CMD check, include the relevant message inline.
If you’re in violation of this code, one of the moderators will send you a gentle admonishment off-list.
For more about such “Netiquette”, read the Debian code of conduct.

Note that there may be some overlap of topics with the R-devel mailing list notably as before the existence of R-package-devel, many package developers have used R-devel for questions that are now meant to be asked on this list. Beware that cross-posting, i.e., posting to both, is generally considered as impolite — with rare exceptions, e.g., if a thread is being moved from one list to the other for good reasons.

Revolution R Open 3.2.0 now available for download

$
0
0

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The latest update to Revolution R Open, RRO 3.2.0, is now available for download from MRAN. In addition to new features, this release tracks the version number of the underlying R engine version (so this is the release following RRO 8.0.3).

Revolution R Open 3.2.0 includes:

  • The latest R engine, R 3.2.0. This includes many improvements, including faster processing, reduced memory usage, support for bigger in-memory objects, and an improved byte compiler.
  • Multi-threaded math processing, reducing the time for some numerical operations on multi-core systems.
  • A focus on reproducibility, with access to a fixed CRAN snapshot taken on May 1, 2015. Many new and updated packages are available since the previous release of RRO -- see the latest Package Spotlight for details. CRAN packages released since May 1 can be easily (and reproducibly!) accessed with the checkpoint function.
  • Binary downloads for Windows, Mac and Linux systems.
  • 100% compatibility with R 3.2.0, RStudio and all other R-based applications.

You can download Revolution R Open now from the link below, and we welcome comments, suggestions and other discussion on the RRO Google Group. If you're new to Revolution R Open, here are some tips to get started, and there are many data sources you can explore with RRO. Thanks go as always to the contributors to the R Project upon which RRO is built.

MRAN: Download Revolution R Open

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

sjmisc – package for working with (labelled) data #rstats

$
0
0

(This article was first published on Strenge Jacke! » R, and kindly contributed to R-bloggers)

The sjmisc-package

My last posting was about reading and writing data between R and other statistical packages like SPSS, Stata or SAS. After that, I decided to bundle all functions that are not directly related to plotting or printing tables, into a new package called sjmisc.

Basically, this package covers three domains of functionality:

  • reading and writing data between other statistical packages (like SPSS) and R, based on the haven and foreign packages; hence, sjmisc also includes function to work with labelled data.
  • frequently used statistical tests, or at least convenient wrappers for such test functions
  • frequently applied recoding and variable conversion tasks

In this posting, I want to give a quick and short introduction into the labeling features.

Labelled Data

In software like SPSS, it is common to have value and variable labels as variable attributes. Variable values, even if categorical, are mostly numeric. In R, however, you may use labels as values directly:

> factor(c("low", "high", "mid", "high", "low"))
[1] low  high mid  high low 
Levels: high low mid

Reading SPSS-data (from haven, foreign or sjmisc), keeps the numeric values for variables and adds the value and variable labels as attributes. See following example from the sample-dataset efc, which is part of the sjmisc-package:

library(sjmisc)
data(efc)
str(efc$e42dep)

> atomic [1:908] 3 3 3 4 4 4 4 4 4 4 ...
> - attr(*, "label")= chr "how dependent is the elder? - subjective perception of carer"
> - attr(*, "labels")= Named num [1:4] 1 2 3 4
>  ..- attr(*, "names")= chr [1:4] "independent" "slightly dependent" "moderately dependent" "severely dependent"

While all plotting and table functions of the sjPlot-package make use of these attributes (see many examples here), many packages and/or functions do not consider these attributes, e.g. R base graphics:

library(sjmisc)
data(efc)
barplot(table(efc$e42dep, efc$e16sex), 
        beside = T, 
        legend.text = T)

barplot_1

Adding value labels as factor values

to_label is a sjmisc-function that converts a numeric variable into a factor and sets attribute-value-labels as factor levels. Using factors with valued levels, the bar plot is labelled.

library(sjmisc)
data(efc)
barplot(table(to_label(efc$e42dep),
              to_label(efc$e16sex)), 
        beside = T, 
        legend.text = T)

Rplot

to_fac is a convenient replacement of as.factor, which converts a numeric vector into a factor, but keeps the value and variable label attributes.

Getting and setting value and variable labels

There are four functions that let you easily set or get value and variable labels of either a single vector or a complete data frame:

  • get_var_labels() to get variable labels
  • get_val_labels() to get value labels
  • set_var_labels() to set variable labels (add them as vector attribute)
  • set_val_labels() to set value labels (add them as vector attribute)
library(sjmisc)
data(efc)
barplot(table(to_label(efc$e42dep),
              to_label(efc$e16sex)), 
        beside = T, 
        legend.text = T,
        main = get_var_labels(efc$e42dep))

Rplot01

get_var_labels(efc) would return all data.frame’s variable labels. And get_val_labels(etc) would return a list with all value labels of all data.frame’s variables.

Restore labels from subsetted data

The base subset function as well as dplyr’s (at least up to 0.4.1) filter and select functions omit label attributes (or vector attributes in general) when subsetting data. In the current development-snapshot of sjmisc at GitHub (which will most likely become version 1.0.3 and released in June or July), there are handy functions to deal with this problem: add_labels and remove_labels.

add_labels adds back labels to a subsetted data frame based on the original data frame. And remove_labels removes all label attributes (this might be necessary when working with dplyr up to 0.4.1, dplyr sometimes throws an error when working with labelled data – this issue should be addressed for the next dplyr-update).

Losing labels during subset

library(sjmisc)
data(efc)
efc.sub <- subset(efc, subset = e16sex == 1, select = c(4:8))
str(efc.sub)

> 'data.frame':	296 obs. of  5 variables:
> $ e17age : num  74 68 80 72 94 79 67 80 76 88 ...
> $ e42dep : num  4 4 1 3 3 4 3 4 2 4 ...
> $ c82cop1: num  4 3 3 4 3 3 4 2 2 3 ...
> $ c83cop2: num  2 4 2 2 2 2 1 3 2 2 ...
> $ c84cop3: num  4 4 1 1 1 4 2 4 2 4 ...

Add back labels

efc.sub <- add_labels(efc.sub, efc)
str(efc.sub)

> 'data.frame':	296 obs. of  5 variables:
>  $ e17age : atomic  74 68 80 72 94 79 67 80 76 88 ...
>   ..- attr(*, "label")= Named chr "elder' age"
>   .. ..- attr(*, "names")= chr "e17age"
>  $ e42dep : atomic  4 4 1 3 3 4 3 4 2 4 ...
>   ..- attr(*, "label")= Named chr "how dependent is the elder? - subjective perception of carer"
>   .. ..- attr(*, "names")= chr "e42dep"
>   ..- attr(*, "labels")= Named chr  "1" "2" "3" "4"
>   .. ..- attr(*, "names")= chr  "independent" "slightly dependent" "moderately dependent" "severely dependent"

# truncated output

So, when working with labelled data, especially when working with data sets imported from other software packages, it comes very handy to make use of the label attributes. The sjmisc package supports this feature and offers some useful functions for these tasks…


Tagged: R, rstats, sjmisc, sjPlot, SPSS, Statistik

To leave a comment for the author, please follow the link and comment on his blog: Strenge Jacke! » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

CONCOR in R

$
0
0

(This article was first published on Bad Hessian » R, and kindly contributed to R-bloggers)

In network analysis, blockmodels provide a simplified representation of a more complex relational structure. The basic idea is to assign each actor to a position and then depict the relationship between positions. In settings where relational dynamics are sufficiently routinized, the relationship between positions neatly summarizes the relationship between sets of actors. How do we go about assigning actors to positions? Early work on this problem focused in particular on the concept of structural equivalence. Formally speaking, a pair of actors is said to be structurally equivalent if they are tied to the same set of alters. Note that by this definition, a pair of actors can be structurally equivalent without being tied to one another. This idea is central to debates over the role of cohesion versus equivalence.

In practice, actors are almost never exactly structural equivalent to one another. To get around this problem, we first measure the degree of structural equivalence between each pair of actors and then use these measures to look for groups of actors who are roughly comparable to one another. Structural equivalence can be measured in a number of different ways, with correlation and Euclidean distance emerging as popular options. Similarly, there are a number of methods for identifying groups of structurally equivalent actors. The equiv.clust routine included in the sna package in R, for example, relies on hierarchical cluster analysis (HCA). While the designation of positions is less cut and dry, one can use multidimensional scaling (MDS) in a similar manner. MDS and HCA can also be used in combination, with the former serving as a form of pre-processing. Either way, once clusters of structurally equivalent actors have been identified, we can construct a reduced graph depicting the relationship between the resulting groups.

Yet the most prominent examples of blockmodeling built not on HCA or MDS, but on an algorithm known as CONCOR. The algorithm takes it name from the simple trick on which it is based, namely the CONvergence of iterated CORrelations. We are all familiar with the idea of using correlation to measure the similarity between columns of a data matrix. As it turns out, you can also use correlation to measure the degree of similarity between the columns of the resulting correlation matrix. In other words, you can use correlation to measure the similarity of similarities. If you repeat this procedure over and over, you eventually end up with a matrix whose entries take on one of two values: 1 or -1. The final matrix can then be permuted to produce blocks of 1s and -1s, with each block representing a group of structurally equivalent actors. Dividing the original data accordingly, each of these groups can be further partitioned to produce a more fine-grained solution.

Insofar as CONCOR uses correlation as a both a measure of structural equivalence as well as a means of identifying groups of structurally equivalent actors, it is easy to forget that blockmodeling with CONCOR entails the same basic steps as blockmodeling with HCA. The logic behind the two procedures is identical. Indeed, Breiger, Boorman, and Arabie (1975) explicitly describe CONCOR as a hierarchical clustering algorithm. Note, however, that when it comes to measuring structural equivalence, CONCOR relies exclusively on the use of correlation, whereas HCA can be made to work with most common measures of (dis)similarity.

Since CONCOR wasn’t available as part of the sna or igraph libraries, I decided to put together my own CONCOR routine. It could probably still use a little work in terms of things like error checking, but there is enough there to replicate the wiring room example included in the piece by Breiger et al. Check it out! The program and sample data are available on my GitHub page. If you have devtools installed, you can download everything directly using R. At the moment, the concor_hca command is only set up to handle one-mode data, though this can be easily fixed. In an earlier version of the code, I included a second function for calculating tie densities, but I think it makes more sense to use concor_hca to generate a membership vector which can then be passed to the blockmodel command included as part of the sna library.

#REPLICATE BREIGER ET AL. (1975)
#INSTALL CONCOR
devtools::install_github("aslez/concoR")

#LIBRARIES
library(concoR)
library(sna)

#LOAD DATA
data(bank_wiring)
bank_wiring

#CHECK INITIAL CORRELATIONS (TABLE III)
m0 <- cor(do.call(rbind, bank_wiring))
round(m0, 2)

#IDENTIFY BLOCKS USING A 4-BLOCK MODEL (TABLE IV)
blks <- concor_hca(bank_wiring, p = 2)
blks

#CHECK FIT USING SNA (TABLE V)
#code below fails unless glabels are specified
blk_mod <- blockmodel(bank_wiring, blks$block, 
     glabels = names(bank_wiring),
     plabels = rownames(bank_wiring[[1]]))
blk_mod
plot(blk_mod)

The results are shown below. If you click on the image, you should be able to see all the labels.

bank_blocks

To leave a comment for the author, please follow the link and comment on his blog: Bad Hessian » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tutorial Recap: Analyzing Census Data in R

$
0
0

(This article was first published on Just an R Blog » R, and kindly contributed to R-bloggers)

A big thanks to Gabriela de Quieroz for organizing the San Francisco R-ladies Meetup, where I spent a few hours yesterday introducing people to my census-related R packages. A special thanks to Sharethrough as well, for letting us use their space for the event.

It was my first time running a tutorial like this, and I spent a while thinking about how to structure it. I decided to have everyone analyze the demographics of the state, county and ZIP Code where they are from, and share the results with their neighbor. I think that this format worked well – it kept people engaged and made the material relevant to them.

Several people wanted to participate but were unable to make it. A common question was whether the event was going to be live streamed. While it was not streamed, the slides are now available on the github repo I created for the talk (see the bottom of the README file). As always, if you have any questions about the packages you can ask on the choroplethr google group.

Personally, I’d like to see more live tutorial sessions in the R community. There’s tons of interesting niches in the R ecosystem that I would like to know more about. If you have a tutorial that you’d like to run, I suggest contacting the organizer of your local R meetup. If you live in the SF Bay Area that is probably the Bay Area R User Group and R-ladies.

The main challenge I found was getting everyone’s computer set up properly for the event. I created a script that I asked people to run before the event which installed all the necessary packages. That script helped a lot, but there were still some problems that required attention throughout the night.

Need help with an R related project? Send me an email at arilamstein@gmail.com.


To leave a comment for the author, please follow the link and comment on his blog: Just an R Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Old is New: XML and rvest

$
0
0

(This article was first published on Jeffrey Horner, and kindly contributed to R-bloggers)

Huh… I didn’t realize just how similar rvest was to XML until I did a bit of digging.

After my wonderful experience using dplyr and tidyr recently, I decided to revisit some of my old RUNNING code and see if it could use an upgrade by swapping out the XML dependency with rvest.

Ultra Signup: Treasure Trove of Ultra Data

If you’re into ultra running, then you probably know about Ultra Signup and the kinds of data you can find there: current and historical races results, list of entrants for each upcoming reace, results by runner, etc. I’ve done quite a bit of web scraping on their pages and you can see some of the fun things I’ve done with the data over on my running blog.

rvest versus XML

This post will discuss the mechanics of using rvest vs. XML on scraping the entrants list for the upcoming Rock/Creek StumpJump 50k.

library(magrittr)
library(RCurl)
library(XML)
library(rvest)

# Entrants Page for the Rock/Creek StumpJump 50k Race
URL <- "http://ultrasignup.com/entrants_event.aspx?did=31114"

Dowloading and Parsing the URL

rvest definitely is compact using only one function. I like it.

rvest_doc <- html(URL)

XML gets its work done with the help of RCurl’s getURL function.

XML_doc   <- htmlParse(getURL(URL),asText=TRUE)

And come to find out they return the exact same classed object. I didn’t know that!

class(rvest_doc)
## [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument" 
## [4] "XMLAbstractDocument"
all.equal( class(rvest_doc), class(XML_doc) )
## [1] TRUE

Searching for the HTML Table

rvest seems to poo poo using xpath for selecting nodes in a DOM. Rather, they recommend using CSS selectors instead. Still, the code is nice and compact.

rvest_table_node <- html_node(rvest_doc,"table.ultra_grid")

XML here uses xpath, which I don’t think is that hard to understand once you get used to it. The only other hitch here is that we have to choose the first node returned from getNodeSet.

XML_table_node <- getNodeSet(XML_doc,'//table[@class="ultra_grid"]')[[1]]

But each still returns the exact same classed object.

class(rvest_table_node)
## [1] "XMLInternalElementNode" "XMLInternalNode"       
## [3] "XMLAbstractNode"
all.equal( class(rvest_table_node), class(XML_table_node) )
## [1] TRUE

From HTML Table to Data Frame

rvest returns a nice stringy data frame here.

rvest_table <- html_table(rvest_table_node)

While XML must submit to the camelHumpDisaster of an argument name and factor reviled convention of stringsAsFactor=FALSE.

XML_table <- readHTMLTable(XML_table_node, stringsAsFactors=FALSE)

Still, they return almost equal data frames.

all.equal(rvest_table,XML_table)
## [1] "Component "Results": Modes: numeric, character"              
## [2] "Component "Results": target is numeric, current is character"
all.equal( rvest_table$Results, as.integer(XML_table$Results) )
## [1] TRUE

Magrittr For More Elegance

Adding in the way cool magrittr pipe system makes rvest really shine in compactness.

rvest_table <- html(URL) %>% html_node("table.ultra_grid") %>% html_table()

While XML is not as elegant, having to use named arguments in getNodeSet and exposing the internal function .subset2.

XML_table <- htmlParse(getURL(URL),asText=TRUE) %>% 
                getNodeSet(path='//table[@class="ultra_grid"]') %>%
                .subset2(n=1) %>% 
                readHTMLTable(stringsAsFactors=FALSE)

Summing Things Up

rvest is definitely elegant and compact syntactic sugar, which I’m drawn to these days. But scraping web pages reveals the dirtiest data among dirty data, and for now I think I’ll stick to the power of XML over sytactic sugar.

Meh… who am I kidding, I’m just lazy. And old.

To leave a comment for the author, please follow the link and comment on his blog: Jeffrey Horner.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

My New Book and Other Matters

$
0
0

I haven’t posted for a while, so here are some news items:

  • My new book, Parallel Computation for Data Science, will be out in June or July. I believe it will be useful to anyone doing computationally intensive work.
  • After a few months being busy with the book and other things, I have returned to my Snowdoop project and my associated package, partools. I recently gave a talk on Snowdoop to the Berkeley R Group, and you will be hearing more here on this blog.
  • I’ve begun a new book on regression/predictive analytics, finishing two chapters so far.
  • I’ve been doing some research on the missing-value problem, and am developing a package for it too.

Lots of fun stuff to do these days. :-)

R Now Contains 150 Times as Many Commands as SAS

$
0
0

(This article was first published on r4stats.com » R, and kindly contributed to R-bloggers)

by Bob Muenchen

In my ongoing quest to analyze the world of analytics, I’ve updated the Growth in Capability section of The Popularity of Data Analysis Software. To save you the trouble of foraging through that tome, I’ve pasted it below.

Growth in Capability

The capability of analytics software has grown significantly over the years. It would be helpful to be able to plot the growth of each software package’s capabilities, but such data are hard to obtain. John Fox (2009) acquired them for R’s main distribution site http://cran.r-project.org/, and I collected the data for later versions following his method.

Figure 9 shows the number of R packages on CRAN for the last version released in each year. The growth curve follows a rapid parabolic arc (quadratic fit with R-squared=.995). The right-most point is for version 3.1.2, the last version released in late 2014.

Fig_9_CRAN

Figure 9. Number of R packages available on its main distribution site for the last version released in each year.

To put this astonishing growth in perspective, let us compare it to the most dominant commercial package, SAS. In version, 9.3, SAS contained around 1,200 commands that are roughly equivalent to R functions (procs, functions etc. in Base, Stat, ETS, HP Forecasting, Graph, IML, Macro, OR, QC). In 2014, R added 1,357 packages, counting only CRAN, or approximately 27,642 functions. During 2014 alone, R added more functions/procs than SAS Institute has written in its entire history.

Of course SAS and R commands solve many of the same problems, they are certainly not perfectly equivalent. Some SAS procedures have many more options to control their output than R functions do, so one SAS procedure may be equivalent to many R functions. On the other hand, R functions can nest inside one another, creating nearly infinite combinations. SAS is now out with version 9.4 and I have not repeated the arduous task of recounting its commands. If SAS Institute would provide the figure, I would include it here. While the comparison is far from perfect, it does provide an interesting perspective on the size and growth rate of R.

As rapid as R’s growth has been, these data represent only the main CRAN repository. R has eight other software repositories, such as Bioconductor, that are not included in
Figure 9. A program run on 5/22/2015 counted 8,954 R packages at all major repositories, 6,663 of which were at CRAN. (I excluded the GitHub repository since it contains duplicates to CRAN that I could not easily remove.) So the growth curve for the software at all repositories would be approximately 34.4% higher on the y-axis than the one shown in Figure 9. Therefore, the estimated total growth in R functions for 2014 was 28,260 * 1.344 or 37981.

As with any analysis software, individuals also maintain their own separate collections typically available on their web sites. However, those are not easily counted.

What’s the total number of R functions? The Rdocumentation site shows the latest counts of both packages and functions on CRAN. They indicate that there is an average of 20.37 functions per package. Since a program run on 5/22/2015 counted 8,954 R packages at all major repositories, on that date there were approximately 182,393 total functions in R. In total, R has over 150 times as many commands as SAS.

I invite you to follow me here or at http://twitter.com/BobMuenchen. If you’re interested in learning R, DataCamp.com offers my 16-hour interactive workshop, R for SAS, SPSS and Stata Users for $25. That’s a monthly fee, but it definitely won’t take you a month to take it!  For students & academics, it’s $9. I also do R training on-site.


To leave a comment for the author, please follow the link and comment on his blog: r4stats.com » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

visNetwork, Currencies, and Minimum Spanning Trees

$
0
0

(This article was first published on Timely Portfolio, and kindly contributed to R-bloggers)
Just because I’m ignorant doesn’t mean I won’t try things.  Feel free to correct any ignorance that follows.  More than anything I would like to feature the new htmlwidget visNetwork.  I thought  the example from Minimum Spanning Trees in R applied to currency data (similar to this research paper Minimum Spanning Tree Application in the Currency Market) would be a good way to demonstrate this

To leave a comment for the author, please follow the link and comment on his blog: Timely Portfolio.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Simulation-based power analysis using proportional odds logistic regression

$
0
0

(This article was first published on BioStatMatt » R, and kindly contributed to R-bloggers)

Consider planning a clinicial trial where patients are randomized in permuted blocks of size four to either a 'control' or 'treatment' group. The outcome is measured on an 11-point ordinal scale (e.g., the numerical rating scale for pain). It may be reasonable to evaluate the results of this trial using a proportional odds cumulative logit model (POCL), that is, if the proportional odds assumption is valid. The POCL model uses a series of 'intercept' parameters, denoted alpha_1 leq ldots leq alpha_{r-1}, where r is the number of ordered categories, and 'slope' parameters beta_1, ldots, beta_m, where m is the number of covariates. The intercept parameters encode the 'baseline', or control group frequencies of each category, and the slope parameters represent the effects of covariates (e.g., the treatment effect).

A Monte-Carlo simulation can be implemented to study the effects of the control group frequencies, the odds ratio associated with treatment allocation (i.e., the 'treatment effect'), and sample size on the power or precision associated with a null hypothesis test or confidence interval for the treatment effect.

In order to simulate this process, it's necessary to specify each of the following:

  1. control group frequencies
  2. treatment effect
  3. sample size
  4. testing or confidence interval procedure

Ideally, the control group frequencies would be informed by preliminary data, but expert opinion can also be useful. Once specified, the control group frequencies can be converted to intercepts in the POCL model framework. There is an analytical solution for this; see the link above. But, a quick and dirty method is to simulate a large sample from the control group population, and then fit an intercept-only POCL model to those data. The code below demonstrates this, using the polr function from the MASS package.


## load MASS for polr()
library(MASS)
## specify frequencies of 11 ordered categories
prbs <- c(1,5,10,15,20,40,60,80,80,60,40)
prbs <- prbs/sum(prbs)
## sample 1000 observations with probabilities prbs
resp <- factor(replicate(1000, sample(0:10, 1, prob=prbs)),
               ordered=TRUE, levels=0:10)
## fit POCL model; extract intercepts (zeta here)
alph <- polr(resp~1)$zeta

As in most other types of power analysis, the treatment effect can represent the minimum effect that the study should be designed to detect with a specified degree of power; or in a precision analysis, the maximum confidence interval width in a specified fraction of samples. In this case, the treatment effect is encoded as a log odds ratio, i.e., a slope parameter in the POCL model.

Given the intercept and slope parameters, observations from the POCL model can be simulated with permuted block randomization in blocks of size four to one of two treatment groups as follows:


## convenience functions
logit <- function(p) log(1/(1/p-1))
expit <- function(x) 1/(1/exp(x) + 1)

## block randomization
## n - number of randomizations
## m - block size
## levs - levels of treatment
block_rand <- function(n, m, levs=LETTERS[1:m]) {
  if(m %% length(levs) != 0)
    stop("length(levs) must be a factor of 'm'")
  k <- if(n%%m > 0) n%/%m + 1 else n%/%m
  l <- m %/% length(levs)
  factor(c(replicate(k, sample(rep(levs,l),
    length(levs)*l, replace=FALSE))),levels=levs)
}

## simulate from POCL model
## n - sample size
## a - alpha
## b - beta
## levs - levels of outcome
pocl_simulate <- function(n, a, b, levs=0:length(a)) {
  dat <- data.frame(Treatment=block_rand(n,4,LETTERS[1:2])) 
  des <- model.matrix(~ 0 + Treatment, data=dat)
  nlev <- length(a) + 1
  yalp <- c(-Inf, a, Inf)
  xbet <- matrix(c(rep(0, nrow(des)),
                   rep(des %*% b , nlev-1),
                   rep(0, nrow(des))), nrow(des), nlev+1)
  prbs <- sapply(1:nlev, function(lev) {
    yunc <- rep(lev, nrow(des))
    expit(yalp[yunc+1] - xbet[cbind(1:nrow(des),yunc+1)]) - 
      expit(yalp[yunc]   - xbet[cbind(1:nrow(des),yunc)])
  })
  colnames(prbs) <- levs
  dat$y <- apply(prbs, 1, function(p) sample(levs, 1, prob=p))
  dat$y <- unname(factor(dat$y, levels=levs, ordered=TRUE))
  return(dat)
}

The testing procedure we consider here is a likelihood ratio test with 5% type-I error rate:


## Likelihood ratio test with 0.05 p-value threshold
## block randomization in blocks of size four to one
## of two treatment groups
## dat - data from pocl_simulate
pocl_test <- function(dat) {
  fit <- polr(y~Treatment, data=dat)
  anova(fit, update(fit, ~.-Treatment))$"Pr(Chi)"[2] < 0.05
}

The code below demontrates the calculation of statistical power associated with sample of size 100 and odds ratio 0.25, where the control group frequencies of each category are as specified above. When executed, which takes some time, this gives about 80% power.


## power: n=50, OR=0.25
mean(replicate(10000, pocl_test(pocl_simulate(50, a=alph, b=c(0, log(0.25))))))

The figure below illustrates the power associated with a sequence of odds ratios. The dashed line represents the nominal type-I error rate 0.05.

power curve

Simulation-based power and precision analysis is a very powerful technique, which ensures that the reported statistical power reflects the intended statistical analysis (often times in research proposals, the proposed statistical analysis is not the same as that used to evaluate statistical power). In addition to the simple analysis described above, it is also possible to evaluate an adjusted analysis, i.e., the power to detect a treatment effect after adjustement for covariate effects. Of course, this requires that the latter effects be specified, and that there is some mechanism to simulate covariates. This can be a difficule task, but makes clear that there are many assumptions involved in a realistic power analysis.

Another advantage to simulation-based power analysis is that it requires implementation of the planned statistical procedure before the study begins, which ensures its feasibility and provides an opportunity to consider details that might otherwise be overlooked. Of course, it may also accelerate the 'real' analysis, once the data are collected.

Here is the complete R script:


## load MASS for polr()
library(MASS)
## specify frequencies of 11 ordered categories
prbs <- c(1,5,10,15,20,40,60,80,80,60,40)
prbs <- prbs/sum(prbs)
## sample 1000 observations with probabilities prbs
resp <- factor(replicate(1000, sample(0:10, 1, prob=prbs)),
               ordered=TRUE, levels=0:10)
## fit POCL model; extract intercepts (zeta here)
alph <- polr(resp~1)$zeta


## convenience functions
logit <- function(p) log(1/(1/p-1))
expit <- function(x) 1/(1/exp(x) + 1)

## block randomization
## n - number of randomizations
## m - block size
## levs - levels of treatment
block_rand <- function(n, m, levs=LETTERS[1:m]) {
  if(m %% length(levs) != 0)
    stop("length(levs) must be a factor of 'm'")
  k <- if(n%%m > 0) n%/%m + 1 else n%/%m
  l <- m %/% length(levs)
  factor(c(replicate(k, sample(rep(levs,l),
                               length(levs)*l, replace=FALSE))),levels=levs)
}

## simulate from POCL model
## n - sample size
## a - alpha
## b - beta
## levs - levels of outcome
pocl_simulate <- function(n, a, b, levs=0:length(a)) {
  dat <- data.frame(Treatment=block_rand(n,4,LETTERS[1:2])) 
  des <- model.matrix(~ 0 + Treatment, data=dat)
  nlev <- length(a) + 1
  yalp <- c(-Inf, a, Inf)
  xbet <- matrix(c(rep(0, nrow(des)),
                   rep(des %*% b , nlev-1),
                   rep(0, nrow(des))), nrow(des), nlev+1)
  prbs <- sapply(1:nlev, function(lev) {
    yunc <- rep(lev, nrow(des))
    expit(yalp[yunc+1] - xbet[cbind(1:nrow(des),yunc+1)]) - 
      expit(yalp[yunc]   - xbet[cbind(1:nrow(des),yunc)])
  })
  colnames(prbs) <- levs
  dat$y <- apply(prbs, 1, function(p) sample(levs, 1, prob=p))
  dat$y <- unname(factor(dat$y, levels=levs, ordered=TRUE))
  return(dat)
}

## Likelihood ratio test with 0.05 p-value threshold
## block randomization in blocks of size four to one
## of two treatment groups
## dat - data from pocl_simulate
pocl_test <- function(dat) {
  fit <- polr(y~Treatment, data=dat)
  anova(fit, update(fit, ~.-Treatment))$"Pr(Chi)"[2] < 0.05
}

## power: n=50, OR=0.25
mean(replicate(10000, pocl_test(pocl_simulate(50, a=alph, b=c(0, log(0.25))))))

To leave a comment for the author, please follow the link and comment on his blog: BioStatMatt » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Parametric Inference: Likelihood Ratio Test Problem 2

$
0
0

(This article was first published on Analysis with Programming, and kindly contributed to R-bloggers)
More on Likelihood Ratio Test, the following problem is originally from Casella and Berger (2001), exercise 8.12.

Problem

For samples of size $n=1,4,16,64,100$ from a normal population with mean $mu$ and known variance $sigma^2$, plot the power function of the following LRTs (Likelihood Ratio Tests). Take $alpha = .05$.
  1. $H_0:muleq 0$ versus $H_1:mu>0$
  2. $H_0:mu=0$ versus $H_1:muneq 0$

Solution

  1. The LRT statistic is given by $$ lambda(mathbf{x})=frac{displaystylesup_{muleq 0}mathcal{L}(mu|mathbf{x})}{displaystylesup_{-infty<mu<infty}mathcal{L}(mu|mathbf{x})}, ;text{since }sigma^2text{ is known}. $$ The denominator can be expanded as follows: $$ begin{aligned} sup_{-infty<mu<infty}mathcal{L}(mu|mathbf{x})&=sup_{-infty<mu<infty}prod_{i=1}^{n}frac{1}{sqrt{2pi}sigma}expleft[-frac{(x_i-mu)^2}{2sigma^2}right]\ &=sup_{-infty<mu<infty}frac{1}{(2pisigma^2)^{1/n}}expleft[-displaystylesum_{i=1}^{n}frac{(x_i-mu)^2}{2sigma^2}right]\ &=frac{1}{(2pisigma^2)^{1/n}}expleft[-displaystylesum_{i=1}^{n}frac{(x_i-bar{x})^2}{2sigma^2}right],\ &quadtext{since }bar{x}text{ is the MLE of }mu.\ &=frac{1}{(2pisigma^2)^{1/n}}expleft[-frac{n-1}{n-1}displaystylesum_{i=1}^{n}frac{(x_i-bar{x})^2}{2sigma^2}right]\ &=frac{1}{(2pisigma^2)^{1/n}}expleft[-frac{(n-1)s^2}{2sigma^2}right],\ end{aligned} $$ while the numerator is evaluated as follows: $$ begin{aligned} sup_{muleq 0}mathcal{L}(mu|mathbf{x})&=sup_{muleq 0}prod_{i=1}^{n}frac{1}{sqrt{2pi}sigma}expleft[-frac{(x_i-mu)^2}{2sigma^2}right]\ &=sup_{muleq 0}frac{1}{(2pisigma^2)^{1/n}}expleft[-displaystylesum_{i=1}^{n}frac{(x_i-mu)^2}{2sigma^2}right]. end{aligned} $$ Above expression will attain its maximum if the value inside the exponential function is small. And for negative values of $muin(-infty,0)$ the quantity $(x_i-mu)^2$ would be large, implies that the exponential term would become small. Therefore, the only value that will give us the supremum likelihood is $mu=mu_0=0$. Hence, $$ begin{aligned} sup_{muleq 0}mathcal{L}(mu|mathbf{x})&=frac{1}{(2pisigma^2)^{1/n}}expleft[-displaystylesum_{i=1}^{n}frac{(x_i-mu_0)^2}{2sigma^2}right]\ =frac{1}{(2pisigma^2)^{1/n}}&expleft[-displaystylesum_{i=1}^{n}frac{(x_i-bar{x}+bar{x}-mu_0)^2}{2sigma^2}right]\ =frac{1}{(2pisigma^2)^{1/n}}&expleft{-displaystylesum_{i=1}^{n}left[frac{(x_i-bar{x})^2+2(x_i-bar{x})(bar{x}-mu_0)+(bar{x}-mu_0)^2}{2sigma^2}right]right}\ =frac{1}{(2pisigma^2)^{1/n}}&expleft[-frac{(n-1)s^2+n(bar{x}-mu_0)^2}{2sigma^2}right], \ &text{since the middle term is 0.}\ =frac{1}{(2pisigma^2)^{1/n}}&expleft[-frac{(n-1)s^2+nbar{x}^2}{2sigma^2}right], text{since }mu_0=0.\ end{aligned} $$ So that $$ begin{equation} label{eq:lrtre} begin{aligned} lambda(mathbf{x})&=frac{frac{1}{(2pisigma^2)^{1/n}}expleft[-frac{(n-1)s^2+nbar{x}^2}{2sigma^2}right]}{frac{1}{(2pisigma^2)^{1/n}}expleft[-frac{(n-1)s^2}{2sigma^2}right]}\ &=expleft[-frac{nbar{x}^2}{2sigma^2}right].\ end{aligned} end{equation} $$ And we reject the null hypothesis if $lambda(mathbf{x})leq c$, that is $$ begin{aligned} expleft[-frac{nbar{x}^2}{2sigma^2}right]&leq c\ -frac{nbar{x}^2}{2sigma^2}&leq log c\ frac{lvertbar{x}rvert}{sigma/sqrt{n}}&geqsqrt{-2log c}=c'. end{aligned} $$
    Figure 1: Plot of Likelihood Ratio Test Statistic for $n = 4,sigma = 1$.

    Hence, rejecting the null hypothesis if $lambda(mathbf{x})leq c$, is equivalent to rejecting $H_0$ if $frac{bar{x}}{sigma/sqrt{n}}geq c'in[0,infty)$. Figure 1 depicts the plot of the LRT, the shaded region is on the positive side because that's where the alternative region is, $H_1:mu>0$, in a sense that if the LRT is small enough to reject $H_0$, then it simply tells us that the plausibility of the parameter in the alternative in explaining the sample is higher compared to the null hypothesis. And if that's the case, we expect the sample to come from the model proposed by $H_1$, so that the sample mean $bar{x}$, being an unbiased estimator of the population mean $mu$, a function of the LRT statistic, should fall on the side (shaded region) of the alternative.

    So that the power function, that is the probability of rejecting the null hypothesis given that it is true (the probability of Type I error) is, $$ begin{aligned} beta(mu)&=mathrm{P}left[frac{bar{x}-mu_0}{sigma/sqrt{n}}geq c'right],quadmu_0=0\ &=1-mathrm{P}left[frac{bar{x}+mu-mu-mu_0}{sigma/sqrt{n}}< c'right]\ &=1-mathrm{P}left[frac{bar{x}-mu}{sigma/sqrt{n}} + frac{mu-mu_0}{sigma/sqrt{n}}< c'right]\ &=1-mathrm{P}left[frac{bar{x}-mu}{sigma/sqrt{n}}< c'+ frac{mu_0-mu}{sigma/sqrt{n}}right]\ &=1-Phileft[c 1="frac{mu_0-mu}{sigma/sqrt{n}}right" language="'+"][/c]. end{aligned} $$ Values taken by $Phi$ are negative and so it decreases, but since we subtracted it to 1, then $beta(mu)$ is an increasing function. So that for $alpha=.05$, $$ begin{aligned} alpha&=sup_{muleq mu_0}beta(mu)\ .05&=beta(mu_0)Rightarrowbeta(mu_0)=1-Phi(c')\ .95&=Phi(c')Rightarrow c'=1.645. end{aligned} $$ Since, $$ begin{aligned} Phi(1.645)=int_{-infty}^{1.645}frac{1}{sqrt{2pi}}expleft[-frac{x^2}{2}right]operatorname{d}x=.9500151. end{aligned} $$ Therefore for $c'=1.645,mu_0=0,sigma=1$, the plot of the power function as a function of $mu$ for different sample size, $n$, is shown in Figure 2. For example, for $n=1$ we compute for the function begin{equation} label{eq:powcomp} begin{aligned} beta(mu)&=1-Phileft[c 1="frac{mu_0-mu}{sigma/sqrt{n}}right" language="'+"][/c]\ &=1-Phileft[1.645+ frac{0-mu}{1/sqrt{1}}right]\ &=1-int_{-infty}^{left(1.645+ frac{0-mu}{1/sqrt{1}}right)}frac{1}{sqrt{2pi}}expleft[-frac{x^2}{2}right]operatorname{d}x. end{aligned} end{equation} The obtained values would be the $y$. For $n = 64$ $$ begin{aligned} beta(mu)&=1-Phileft[c 1="frac{mu_0-mu}{sigma/sqrt{n}}right" language="'+"][/c]\ &=1-Phileft[1.645+ frac{0-mu}{1/sqrt{64}}right]\ &=1-int_{-infty}^{left(1.645+ frac{0-mu}{1/sqrt{64}}right)}frac{1}{sqrt{2pi}}expleft[-frac{x^2}{2}right]operatorname{d}x, end{aligned} $$ and so on.
    Figure 2: Power Function for Different Values of $n$.

  2. The LRT statistic is given by $$ lambda(mathbf{x})=frac{displaystylesup_{mu= 0}mathcal{L}(mu|mathbf{x})}{displaystylesup_{-inftyFigure 3: Plot of Likelihood Ratio Test Statistic for $n = 4,sigma = 1$.
    So that the power function is, $$ begin{aligned} beta(mu)&=mathrm{P}left[frac{lvertbar{x}rvert}{sigma/sqrt{n}}geq c'right]\ &=1 - mathrm{P}left[frac{lvertbar{x}rvert}{sigma/sqrt{n}}< c'right]\ &=1 - mathrm{P}left[-c'Figure 4: Two-Sided Power Function for Different $n$.
    The points in the plot are computed by substituting values of $mu=0,sigma=1$ and $n$ to the power function just like we did in Equation (2).

Reference

  1. Casella, G. and Berger, R.L. (2001). Statistical Inference. Thomson Learning, Inc.
  2. Felix Schönbrodt. Shading regions of the normal: The Stanine scale. Retrieved May 2015.

To leave a comment for the author, please follow the link and comment on his blog: Analysis with Programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

My favorite R bug

$
0
0

In this note am going to recount “my favorite R bug.” It isn’t a bug in R. It is a bug in some code I wrote in R. I call it my favorite bug, as it is easy to commit and (thanks to R’s overly helpful nature) takes longer than it should to find.

H96566k

The original problem I was working on was generating a training set as a subset of a simple data frame.

# read our data
tf <- read.table('tf.csv.gz',header=TRUE,sep=',')
print(summary(tf))
##        x                y          
##  Min.   :-0.05075   Mode :logical  
##  1st Qu.:-0.01739   FALSE:37110    
##  Median : 0.01406   TRUE :2943     
##  Mean   : 0.00000   NA's :0        
##  3rd Qu.: 0.01406                  
##  Max.   : 0.01406
# Set our random seed to our last state for 
# reproducibility.  I initially did not set the
# seed, I was just using R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
# on OSX 10.10.3
# But once I started seeing the effect, I saved the state for
# reproducibility.
.Random.seed = readRDS('Random.seed')

# For my application tf was a data frame with a modeling
# variable x (floating point) and an outcome y (logical).
# I wanted a training sample that was non-degenerate
# (has variation in both x and y) and I thought I would
# find such a sample by using rbinom(nrow(tf),1,0.5)
# to pick random training sets and then inspect I had 
# a nice training set (and had left at least one row out
# for test)
goodTrainingSample <- function(selection) {
  (sum(selection)>0) && (sum(selection)<nrow(tf)) &&
    (max(tf$x[selection])>min(tf$x[selection])) &&
    (max(tf$y[selection])>min(tf$y[selection]))
}

# run my selection
sel <- rbinom(nrow(tf),1,0.5)
summary(sel)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4987  1.0000  1.0000
sum(sel)
## [1] 19974

Now I used rbinom(nrow(tf),1,0.5) (which gives a sample that should be about half the data) instead of sample.int(nrow(tf),floor(nrow(tf)/2)) because I had been intending to build a model on the training data and then score that on the hold-out data. So I thought referring to the test set as !sel instead of setdiff(seq_len(nrow(tf)),sel) would be convenient.

# and it turns out to not be a good training set
print(goodTrainingSample(sel))
## [1] FALSE
# one thing that failed is y is a constant on this subset
print(max(tf$y[sel])>min(tf$y[sel]))
## [1] FALSE
print(summary(tf[sel,]))
##        x               y          
##  Min.   :0.01406   Mode :logical  
##  1st Qu.:0.01406   FALSE:19974    
##  Median :0.01406   NA's :0        
##  Mean   :0.01406                  
##  3rd Qu.:0.01406                  
##  Max.   :0.01406
# Whoops! everything is constant on the subset!

# okay no, problem that is why we figured we might have to
# generate and test multiple times.

But wait, lets bound the odds of failing. Even missing the “y varies” condition is so unlikely we should not expect see that happen. Y is true 2943 times. So the odds of missing all the true values when we are picking each row with 50/50 probability is exactly 2^(-2943). Or about one chance in 10^885 of happening.

We have a bug. Here is some excellent advice on debugging:

“Finding your bug is a process of confirming the many things that you believe are true — until you find one which is not true.” —Norm Matloff

We saved the state of the pseudo random number generator, as it would be treacherous to try and debug someting it is involved with without first having saved its state. But that doesn’t mean we are accusing the pseudo random number generator (though one does wonder, it is common for some poor pseudo random generators to alternate the lower bit in some situations). Lets instead work through our example carefully. Other people have used R and our code is new, so we really want to look at our own assumptions and actions. Our big assumption was that we called rbinom() correctly and got a usable selection. We even called summary(sel) to check that sel was near 50/50. But wait- that summary doesn’t look quite right. You can sum() logicals, but they have a slightly different summary.

str(sel)
##  int [1:40053] 1 1 0 1 1 0 0 1 1 1 ...

Aha! sel is an array if integers, not a logical. That makes sense it represents how many successes you get in 1 trial for each row. So using it to sample doesn’t give us a sample of 19974 rows, but instead 19974 copies of the first row. But what about the zeros?

tf
## [1] x y
## <0 rows> (or 0-length row.names)

Ah, yet another gift from R’s irregular bracket operator. I admit, I messed up and gave a vector of integers where I meant to give a vector of logicals. However, R didn’t help me by signaling the problem, even though many of my indices were invalid. Instead of throwing an exception, or warning, or returning NA, it just does nothing (which delayed our finding our own mistake).

The fix is to calculate sel as one of:

Binomial done right.

sel <- rbinom(nrow(tf),1,0.5)>0
test <- !sel
summary(sel)
##    Mode   FALSE    TRUE    NA's 
## logical   19860   20193       0
summary(test)
##    Mode   FALSE    TRUE    NA's 
## logical   20193   19860       0

Cutting a uniform sample.

sel <- runif(nrow(tf))>=0.5
test <- !sel
summary(sel)
##    Mode   FALSE    TRUE    NA's 
## logical   20061   19992       0
summary(test)
##    Mode   FALSE    TRUE    NA's 
## logical   19992   20061       0

Or, set of integers.

sel <- sample.int(nrow(tf),floor(nrow(tf)/2))
test <- setdiff(seq_len(nrow(tf)),sel)
summary(sel)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2   10030   20030   20050   30100   40050
summary(test)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1   10000   20020   20000   29980   40050

Wait, does that last example say that sel and test have the same max (40050) and therefore share an element? They were supposed to be disjoint.

max(sel)
## [1] 40053
max(test)
## [1] 40051
str(sel)
##  int [1:20026] 23276 3586 32407 33656 21518 14269 22146 25252 12882 4564 ...
str(test)
##  int [1:20027] 1 3 4 5 8 12 14 15 17 18 ...

Oh it is just summary() displaying our numbers to only four significant figures even though they are in fact integers and without warning us by turning on scientific notation.

Don’t get me wrong: I love R and it is my first choice for analysis. But I wish it had simpler to explain semantics (not so many weird cases on the bracket operator), signaled errors much closer to where you make them (cutting down how far you have to look and how many obvious assumptions you have to test when debugging), and was a bit more faithful in how it displayed data (I don’t like it claiming a vector integers has a maximum value of 40050, when 40053 is in fact in the list).

One could say “just be more careful and don’t write bugs.” I am careful, I write few bugs- but I find them quickly because I check a lot of my intermediate results. I write about them as I research new ways to prevent and detect them quickly.

You are going to have to write and debug code to work as a data scientist, just understand time spent debugging is not time spent in analysis. So you want to make bugs hard to write, and easy to find and fix.

(Original knitr source here)

another viral math puzzle

$
0
0

(This article was first published on Xi'an's Og » R, and kindly contributed to R-bloggers)

After the Singapore Maths Olympiad birthday problem that went viral, here is a Vietnamese primary school puzzle that made the frontline in The Guardian. The question is: Fill the empty slots with all integers from 1 to 9 for the equality to hold. In other words, find a,b,c,d,e,f,g,h,i such that

a+13xb:c+d+12xef-11+gxh:i-10=66.

With presumably the operation ordering corresponding to

a+(13xb:c)+d+(12xe)f-11+(gxh:i)-10=66

although this is not specified in the question. Which amounts to

a+(13xb:c)+d+(12xe)f+(gxh:i)=87

and implies that c divides b and i divides gxh. Rather than pursing this analytical quest further, I resorted to R coding, checking by brute force whether or not a given sequence was working.

baoloc=function(ord=sample(1:9)){
if (ord[1]+(13*ord[2]/ord[3])+ord[4]+
12*ord[5]-ord[6]-11+(ord[7]*ord[8]/
ord[9])-10==66) return(ord)}

I then applied this function to all permutations of {1,…,9} [with the help of the perm(combinat) R function] and found the 128 distinct solutions. Including some for which b:c is not an integer. (Not of this obviously gives a hint as to how a 8-year old could solve the puzzle.)


Filed under: Books, Kids, R, University life Tagged: mathematical puzzle, permutation, primary school, The Guardian, Vietnam

To leave a comment for the author, please follow the link and comment on his blog: Xi'an's Og » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Geomorph update 2.1.5 Now Available!

$
0
0

(This article was first published on geomorph, and kindly contributed to R-bloggers)
Geomorph users,

We have uploaded version 2.1.5 of geomorph* to CRAN. The windows and mac binaries have been compiled and the tarball is available.

New Features:


  • New Auto Mode allows users to include pre-digitized landmarks added to build.template() and digitsurface()

  • New gridPar() is a new function to customize plots of plotRefToTarget()







  • New digit.curves() is a new function to calculate equidistant semilandmarks along 2D and 3D curves (based on tpsDIG algorithm for 2D curves).

  • define.sliders() is new interactive function for defining sliding semilandmarks for 2D and 3D curves, plus an automatic mode when given a sequence of semilandmarks along a curve

  • plotGMPhyloMorphoSpace() now has options to customise the plots




Important Bug Fixes:





  • Corrected an error in plotAllometry() where verbose=T did not return




Other Changes:





  • pairwiseD.test() and pairwise.slope.test() deprecated and replaced by advanced.procD.lm()

  • Read functions now allow both tab and space delimited files

  • define.sliders.2d() and define.sliders.3d() deprecated and replaced by define.sliders()



Emma







* geomorph: Geometric Morphometric Analyses of 2D/3D Landmark Data





Read, manipulate, and digitize landmark data, generate shape variables via Procrustes analysis for points, curves and surfaces, perform shape analyses, and provide graphical depictions of shapes and patterns of shape variation.







To leave a comment for the author, please follow the link and comment on his blog: geomorph.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Paper Helicopter experiment, part II

$
0
0

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)
Last week I created a JAGS model combining data from two paper helicopter datasets. This week, I will use the model to find the longest flying one.

Predicting

The JAGS/RJAGS system has no predict() function that I know of. What I therefore did is adapt the model so during estimation of the parameters the predictions were made. Using this adapted model, two prediction steps were made.
In step one predictions from the whole design space were combined. To keep the number of predictions at least somewhat limited, only a few levels were used for the continuous variables. This step was used to select the best region within  the whole space. Step two focuses on the best region and provides more detailed predictions.

Step 1 

After predicting from the whole experimental space, the mean and lower 5% limits of predicted times were plotted.

It was decided to focus on the region top right. At least 2.7 for the lower 5% limit, and at least 3.7 for the mean time. The associated settings are summarized below.
        PaperType    WingLength       BodyWidth       BodyLength     TapedBody
 bond        :72   Min.   : 7.408   Min.   :2.540   Min.   : 3.810   No :114  
 regular1    :72   1st Qu.:12.065   1st Qu.:3.387   1st Qu.: 6.562   Yes: 54  
 construction:24   Median :12.065   Median :4.233   Median : 6.562            
                   Mean   :11.871   Mean   :4.163   Mean   : 6.955            
                   3rd Qu.:12.065   3rd Qu.:5.080   3rd Qu.: 9.313            
                   Max.   :12.065   Max.   :5.080   Max.   :12.065            
 TapedWing PaperClip PaperClip2 Fold     test       Time        
 No : 68   No :84    No: 20     No:168   WH:168   Mode:logical  
 Yes:100   Yes:84    WH:148                       NA's:168      
                     RH:  0                                     
                                                                
                                                                
                                                                
      Mean            u95             l05       
 Min.   :3.203   Min.   :3.574   Min.   :2.701  
 1st Qu.:3.327   1st Qu.:3.808   1st Qu.:2.776  
 Median :3.465   Median :3.975   Median :2.847  
 Mean   :3.486   Mean   :4.108   Mean   :2.873  
 3rd Qu.:3.636   3rd Qu.:4.388   3rd Qu.:2.952  
 Max.   :3.877   Max.   :5.044   Max.   :3.165  

Phase 2

The second prediction only varied PaperType, BodyWidth, BodyLength and TapedWing. All others were set at their most occurring setting. As can be seen, there is a bit of a trade-off. It is possible to select the longest time, but that incurs some chance of a much lower time, because of model uncertainty. On the other hand, for a slightly lesser mean time, we can have the certainty.
It is my choice to avoid the more uncertain region. Hence I will base my choice on the lower limit. Here we can see that there is a tradeoff. The bond paper needs a slightly longer BodyLength, while Regular paper can have a short BodyLength. BodyWidth should be maximized, but that is not a sensitive parameter.
For completeness, the mean prediction. This shows hardly any interaction. Hence the need for higher BodyLength in bond type paper is due to lack of experiments in this region. A few confirming final experiments seem to be in order. Within those, we could also include a low BodyWidth, since the models are unclear if this should be maximized or minimized.

Code used

Code for actual data are in previous post. This code starts after reading in those data.
helis <- rbind(h1,h2)
helis$test <- factor(helis$test)

helis$PaperClip2 <- factor(ifelse(helis$PaperClip=='No','No',as.character(helis$test)),
    levels=c('No','WH','RH'))

library(R2jags)
library(ggplot2)

helispred <- expand.grid(
    PaperType=c('bond','regular1','construction'),
    WingLength=seq(min(helis$WingLength),max(helis$WingLength),length.out=4),
    BodyWidth=seq(min(helis$BodyWidth),max(helis$BodyWidth),length.out=4),
    BodyLength=seq(min(helis$BodyLength),max(helis$BodyLength),length.out=4),
    TapedBody=c('No','Yes'),
    TapedWing=c('No','Yes'),
    PaperClip=c('No','Yes'),
    PaperClip2=c('No','WH','RH'),
    Fold='No',
    test='WH',
    Time=NA)

helisboth <- rbind(helis,helispred)


#################################
datain <- list(
    PaperType=c(2,1,3,1)[helisboth$PaperType],
    WingLength=helisboth$WingLength,
    BodyLength=helisboth$BodyLength,
    BodyWidth=helisboth$BodyWidth,
    PaperClip=c(1,2,3)[helisboth$PaperClip2],
    TapedBody=c(0,1)[helisboth$TapedBody],
    TapedWing=c(0,1)[helisboth$TapedWing],
    test=c(1,2)[helisboth$test],
    Time=helisboth$Time,
    n=nrow(helis),
    m=nrow(helispred))

parameters <- c('Mul','WL','BL','PT','BW','PC','TB','TW','StDev',
    'WLBW','WLPC',            'WLWL',
    'BLPT'       ,'BLPC',     'BLBL',
    'BWPC',                   'BWBW',  'other','pred')

jmodel <- function() {
  for (i in 1:(n+m)) {     
    premul[i] <- (test[i]==1)+Mul*(test[i]==2)
    mu[i] <- premul[i] * (
          WL*WingLength[i]+
          BL*BodyLength[i] + 
          PT[PaperType[i]] +
          BW*BodyWidth[i] +
          PC[PaperClip[i]] +
          TB*TapedBody[i]+
          TW*TapedWing[i]+
          
          WLBW*WingLength[i]*BodyWidth[i]+
          WLPC[1]*WingLength[i]*(PaperClip[i]==2)+
          WLPC[2]*WingLength[i]*(PaperClip[i]==3)+
          
          BLPT[1]*BodyLength[i]*(PaperType[i]==2)+
          BLPT[2]*BodyLength[i]*(PaperType[i]==3)+
          BLPC[1]*BodyLength[i]*(PaperClip[i]==2)+
          BLPC[2]*BodyLength[i]*(PaperClip[i]==3)+
          
          BWPC[1]*BodyWidth[i]*(PaperClip[i]==2)+
          BWPC[2]*BodyWidth[i]*(PaperClip[i]==3) +
          
          WLWL*WingLength[i]*WingLength[i]+
          BLBL*BodyLength[i]*BodyLength[i]+
          BWBW*BodyWidth[i]*BodyWidth[i]       
          )
  }
  for (i in 1:n) {
    Time[i] ~ dnorm(mu[i],tau[test[i]])
  }
#    residual[i] <- Time[i]-mu[i]
  for (i in 1:2) {
    tau[i] <- pow(StDev[i],-2)
    StDev[i] ~dunif(0,3)
    WLPC[i] ~dnorm(0,1)
    BLPT[i] ~dnorm(0,1)
    BLPC[i] ~dnorm(0,1) 
    BWPC[i] ~dnorm(0,1)      
  }
  for (i in 1:3) {
    PT[i] ~ dnorm(PTM,tauPT)
  }
  tauPT <- pow(sdPT,-2)
  sdPT ~dunif(0,3)
  PTM ~dnorm(0,0.01)
  WL ~dnorm(0,0.01) 
  BL ~dnorm(0,0.01)
  BW ~dnorm(0,0.01)
  PC[1] <- 0
  PC[2]~dnorm(0,0.01)
  PC[3]~dnorm(0,0.01) 
  TB ~dnorm(0,0.01)
  TW ~dnorm(0,0.01)
  
  WLBW~dnorm(0,1)
  WLTW~dnorm(0,1)
  
  WLWL~dnorm(0,1)
  BLBL~dnorm(0,1) 
  BWBW~dnorm(0,1)
  
  other~dnorm(0,1)
  Mul ~ dnorm(1,1) %_% I(0,2)
  for (i in 1:m) {
    pred[i] <- mu[i+n]
  }
}

jj <- jags(model.file=jmodel,
    data=datain,
    parameters=parameters,
    progress.bar='gui',
    n.chain=5,
    n.iter=4000,
    inits=function() list(Mul=1.3,WL=0.15,BL=-.08,PT=rep(1,3),
          PC=c(NA,0,0),TB=0,TW=0))
#jj

predmat <- jj$BUGSoutput$sims.matrix[,grep('pred',dimnames(jj$BUGSoutput$sims.matrix)[[2]],value=TRUE)]
helispred$Mean <- colMeans(predmat)
helispred$u95 <- apply(predmat,2,function(x) quantile(x,.95))
helispred$l05 <- apply(predmat,2,function(x) quantile(x,.05))
png('select1.png')
qplot(y=Mean,x=l05,data=helispred)
dev.off()
select <- helispred[helispred$Mean>3.2 & helispred$l05>2.7,]
summary(select)

########

helispred <- expand.grid(
    PaperType=c('bond','regular1'),
    WingLength=12.065,
    BodyWidth=seq(2.5,5,length.out=11),
    BodyLength=seq(3.8,12,length.out=11),
    TapedBody=c('No'),
    TapedWing=c('No','Yes'),
    PaperClip='No',
    PaperClip2=c('WH'),
    Fold='No',
    test='WH',
    Time=NA)

helisboth <- rbind(helis,helispred)
datain <- list(
    PaperType=c(2,1,3,1)[helisboth$PaperType],
    WingLength=helisboth$WingLength,
    BodyLength=helisboth$BodyLength,
    BodyWidth=helisboth$BodyWidth,
    PaperClip=c(1,2,3)[helisboth$PaperClip2],
    TapedBody=c(0,1)[helisboth$TapedBody],
    TapedWing=c(0,1)[helisboth$TapedWing],
    test=c(1,2)[helisboth$test],
    Time=helisboth$Time,
    n=nrow(helis),
    m=nrow(helispred))

jj <- jags(model.file=jmodel,
    data=datain,
    parameters=parameters,
    progress.bar='gui',
    n.chain=5,
    n.iter=4000,
    inits=function() list(Mul=1.3,WL=0.15,BL=-.08,PT=rep(1,3),
          PC=c(NA,0,0),TB=0,TW=0))
#jj

predmat <- jj$BUGSoutput$sims.matrix[,grep('pred',dimnames(jj$BUGSoutput$sims.matrix)[[2]],value=TRUE)]
helispred$Mean <- colMeans(predmat)
helispred$u95 <- apply(predmat,2,function(x) quantile(x,.95))
helispred$l05 <- apply(predmat,2,function(x) quantile(x,.05))

#
png('select2.png')
qplot(y=Mean,x=l05,data=helispred)
dev.off()


png('l05.png')
v <- ggplot(helispred, aes(BodyLength, BodyWidth, z = l05))
v + stat_contour(aes(colour= ..level.. )) +
    scale_colour_gradient(name='Time' )+
    facet_grid(PaperType ~ TapedWing )+
    ggtitle('Lower 95% predicion') 
dev.off()

png('mean.png')

v <- ggplot(helispred, aes(BodyLength, BodyWidth, z = Mean))
v + stat_contour(aes(colour= ..level.. )) +
    scale_colour_gradient(name='Time' )+
    facet_grid(PaperType ~ TapedWing ) +
    ggtitle('Mean prediction')
dev.off()

To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Interactive maps of Crime data in Greater London

$
0
0

(This article was first published on R tutorial for Spatial Statistics, and kindly contributed to R-bloggers)
In the previous post we looked at ways to perform some introductory point pattern analysis of open data downloaded from Police.uk. As you remember we subset the dataset of crimes in the Greater London area, extracting only the drug related ones. Subsequently, we looked at ways to use those data with the package spatstat and perform basic statistics.
In this post I will briefly discuss ways to create interactive plots of the results of the point pattern analysis using the Google Maps API and Leaflet from R.

Number of Crimes by Borough
In the previous post we looped through the GreaterLondonUTM shapefile to extract the area of each borough and then counted the number of crimes within its border. To show the results we used a simple barplot. Here I would like to use the same method I presented in my post Interactive Maps for the Web to plot these results on Google Maps.

This post is intended to be a continuation of the previous, so I will not present again the methods and objects we used in the previous experiment. To make this code work you can just copy and paste it below the code you created before and it should work just fine.

First of all, let's create a new object including only the names of the boroughs from the GreaterLondonUTM shapefile. We need to do this because otherwise when we will click on a polygons on the map it will show us a long list of useless data.

GreaterLondon.Google <- GreaterLondonUTM[,"name"]

The new object has only one column with the name of each borough.
Now we can create a loop to iterate through these names and calculate the intensity of the crimes:

Borough <- GreaterLondonUTM[,"name"]
 
for(i in unique(GreaterLondonUTM$name)){
sub.name <- Local.Intensity[Local.Intensity[,1]==i,2]
 
Borough[Borough$name==i,"Intensity"] <- sub.name
 
Borough[Borough$name==i,"Intensity.Area"] <- round(sub.name/(GreaterLondonUTM[GreaterLondonUTM$name==i,]@polygons[[1]]@area/10000),4)
}

As you can see this loop selects one name at the time, then subset the object Local.Intensity (which we created in the previous post) to extract the number of crimes for each borough. The next line attach this intensity to the object Borough as a new column named Intensity. However, the code does not stop here. We also create another column named Intensity.Area in which we calculate the amount of crimes per unit area. Since the area from the shapefile is in square meters and the number were very high, I though about dividing it by 10'000 in order to have a unit area of 10 square km. So this column shows the amount of crime per 10 square km in each borough. This should correct the fact that certain borough have a relatively high number of crimes only because their area is larger than others.

Now we can use again the package plotGoogleMaps to create a beautiful visualization of our results and save it in HTML so that we can upload it to our website or blog.
The code for doing that is very simple and it is presented below:

plotGoogleMaps(Borough,zcol="Intensity",filename="Crimes_Boroughs.html",layerName="Number of Crimes", fillOpacity=0.4,strokeWeight=0,mapTypeId="ROADMAP")

I decided to plot the polygons on top of the roadmap and not on top of the satellite image, which is the default for the function. Thus I added the option mapTypeId="ROADMAP".
The result is the map shown below and at this link: Crimes on GoogleMaps



In the post Interactive Maps for the Web in R I received a comment from Gerardo Celis, whom I thank for it, telling me that now in R is also available the package leafletR, that allows us to create interactive maps based on Leaflet. So for this new experiment I decided to try it out!

I started from the sample of code presented here: https://github.com/chgrl/leafletR and I adapted with very few changes to my data.
The function leaflet does not work directly with Spatial data, we first need to transform them into GeoJSON with another function in leafletR:

Borough.Leaflet <- toGeoJSON(Borough)

Extremely simple!!

Now we need to set the style to use for plotting the polygons using the function styleGrad, which is used to create a list of colors based on a particular attribute:

map.style <- styleGrad(pro="Intensity",breaks=seq(min(Borough$Intensity),max(Borough$Intensity)+15,by=20),style.val=cm.colors(10),leg="Number of Crimes", fill.alpha=0.4, lwd=0)

In this function we need to set several options:
pro = is the name of the attribute (as the column name) to use for setting the colors
breaks = this option is used to create the ranges of values for each colors. In this case, as in the example, I just created a sequence of values from the minimum to the maximum. As you can see from the code I added 15 to the maximum value. This is because the number of breaks needs to have 1 more element compared to the number of colors. For example, if we set 10 breaks we would need to set 9 colors. For this reason if the sequence of breaks ends before the maximum, the polygons with the maximum number of crimes would be presented in grey.
This is important!!

style.val = this option takes the color scale to be used to present the polygons. We can select among one of the default scales or we can create a new one with the function color.scale in the package plotrix, which I already discussed here: Downloading and Visualizing Seismic Events from USGS

leg = this is simply the title of the legend
fill.alpha = is the opacity of the colors in the map (ranges from 0 to 1, where 1 is the maximum)
lwd =  is the width of the line between polygons

After we set the style we can simply call the function leaflet to create the map:

leaflet(Borough.Leaflet,popup=c("name","Intensity","Intensity.Area"),style=map.style)

In this function we need to input the name of the GeoJSON object we created before, the style of the map and the names of the columns to use for the popups.
The result is the map shown below and available at this link: Leaflet Map



I must say this function is very neat. First of all the function plotGoogleMaps, if you do not set the name of the HTML file, creates a series of temporary files stored in your temp folder, which is not great. Then even if you set the name of the file the legend is saved into different image files every time you call the function, which you may do many times until you are fully satisfied the result.
The package leafletR on the other hand creates a new folder inside the working directory where it stores both the GeoJSON and the HTML file, and every time you modify the visualization the function overlays the same file.
However, I noticed that I cannot see the map if I open the HTML files from my PC. I had to upload the file to my website every time I changed it to actually see these changes and how they affected the plot. This may be something related to my PC, however.


Density of Crimes in raster format
As you may remember from the previous post, one of the steps included in a point pattern analysis is the computation of the spatial density of the events. One of the techniques to do that is the kernel density, which basically calculates the density continuously across the study area, thus creating a raster.
We already looked at the kernel density in the previous post so I will not go into details here, the code for computing the density and transform it into a raster is the following:

Density <- density.ppp(Drugs.ppp, sigma = 500,edge=T,W=as.mask(window,eps=c(100,100)))
Density.raster <- raster(Density)
projection(Density.raster)=projection(GreaterLondonUTM)

The first lines is basically the same we used in the previous post. The only difference is that here I added the option W to set the resolution of the map with eps at 100x100 m.
Then I simply transformed the first object into a raster and assign to it the same UTM projection of the object GreaterLondonUTM.
Now we can create the map. As far as I know (and for what I tested) leafletR is not yet able to plot raster objects, so the only way we have of doing it is again to use the function plotGoogleMaps:

plotGoogleMaps(Density.raster,filename="Crimes_Density.html",layerName="Number of Crimes", fillOpacity=0.4,strokeWeight=0,colPalette=rev(heat.colors(10)))

When we use this function to plot a raster we clearly do not need to specify the zcol option. Moreover, here I changed the default color scale using the function colPalette to a reverse heat.colors, which I think is more appropriate for such a map. The result is the map below and at this link: Crime Density




Density of Crimes as contour lines
The raster presented above can also be represented as contour lines. The advantage of this type of visualization is that it is less intrusive, compared to a raster, and can also be better suited to pinpoint problematic locations.
Doing this in R is extremely simple, since there is a dedicated function in the package raster:

Contour <- rasterToContour(Density.raster,maxpixels=100000,nlevels=10)

This function transforms the raster above into a series of 10 contour lines (we can change the number of lines by changing the option nlevels).

Now we can plot these lines to an interactive web map. I first tested again the use of plotGoogleMaps but I was surprised to see that for contour lines it does not seem to do a good job. I do not fully know the reason, but if I use the object Contour with this function it does not plot all the lines on the Google map and therefore the visualization is useless.
For this reason I will present below the lines to plot contour lines using leafletR:

Contour.Leaflet <- toGeoJSON(Contour)
 
colour.scale <- color.scale(1:(length(Contour$level)-1),color.spec="rgb",extremes=c("red","blue"))
map.style <- styleGrad(pro="level",breaks=Contour$level,style.val=colour.scale,leg="Number of Crimes", lwd=2)
leaflet(Contour.Leaflet,style=map.style,base.map="tls")

As mentioned, the first thing to do to use leafletR is to transform our Spatial object into a GeoJSON; the object Contour belongs to the class SpatialLinesDataFrame, so it is supported in the function toGeoJSON.
The next step is again to set the style of the map and then plot it. In this code I changed a few things just to show some more options. The first thing is the custom color scale I created using the function color.scale in the package plotrix. The only thing that the function styleGrad needs to set the colors in the option style.val is a vector of colors, which must be long one unit less than the vector used for the breaks. In this case the object Contour has only one property, namely "level", which is a vector of class factor. The function styleGrad can use it to create the breaks but the function color.scale cannot use it to create the list of colors. We can work around this problem by setting the length of the color.scale vector using another vector: 1:(length(Contour$level)-1, which basically creates a vector of integers from 1 to the length of Contours minus one. The result of this function is a vector of colors ranging from red to blue, which we can plug in in the following function.
In the function leaflet the only thing I changed is the base.map option, in which I use "tls". From the help page of the function we can see that the following options are available:

"One or a list of "osm" (OpenStreetMap standard map), "tls" (Thunderforest Landscape), "mqosm" (MapQuest OSM), "mqsat" (MapQuest Open Aerial),"water" (Stamen Watercolor), "toner" (Stamen Toner), "tonerbg" (Stamen Toner background), "tonerlite" (Stamen Toner lite), "positron" (CartoDB Positron) or "darkmatter" (CartoDB Dark matter). "

These lines create the following image, available as a webpage here: Contour





R code snippets created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: R tutorial for Spatial Statistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Viewing all 2417 articles
Browse latest View live