From JSON to Tables

May 15, 2015, 11:49 am

≫ Next: Because it’s Friday: Love in the land of Facebook

≪ Previous: In-database R coming to SQL Server 2016

(This article was first published on My Data Atelier » R, and kindly contributed to R-bloggers)

“First things first”. Tautological, always true. However, sometimes some data scientists seem to ignore this: you can think of using the most sophisticated and trendy algorithm, come up with brilliant ideas, imagine the most creative visualizations but, if you do not know how to get the data and handle it in the exact way you need it, all of this becomes worthless. In other words, first things first.

In my professional experience, I have heard thousands of potenitally brilliant ideas which couldn’t be even tested because the data scientist in question did not know how to handle data according to his/her needs. This has become particularly problematic with the popularisation of JSON: despite the undeniable advantages that this data structure has in terms of data storage, replication, etc, it presents a challenge for data scientist, as most algorithms require that the input data is passed in a tabular form. I have myself faced to this problem and, after a couple of times of creating some temporary solution, felt it was high time to come up to a general one: of course, the solution I have come up to is just one approach.

In order to understand it, two things from JSONs structure must be considered: Firstly, JSON objects are basically an array of objects. This means that, unlike traditional tabulated Databases there is no indication what they don´t have. This might seem trivial but if you want to tabulate JSONs it becomes certainly something to be solved. Secondly, this array of objects can contain objects which are arrays themselves. This means that, when transforming this to a table, that value has to be transformed somehow to an appropiate format for a table. The function in question contemplates these two scenarios: it is intended to be used not only to tabulate objects with identical structure but also for objects whose internal structure is different

Before deep diving into the code of the function, let us propose a scenario where this could be useful. Imagine we would like to tabulate certain information from professional football players. In this mini-example, we have chosen Lionel Messi, Cristiano Ronaldo and George Best. The information is retrieved from dbpedia, which exposes the corresponding JSONs of their wikipedia’s entries. In the code below, the first steps:


library(rjson)
library(RCurl)

players <- c("Lionel_Messi","George_Best","Cristiano_Ronaldo")

players.info <- lapply(players, function(pl) fromJSON(getURL(paste0("http://dbpedia.org/data/",pl,".json"))))

players.info <- lapply(1:length(players), function(x) players.info[[x]][[paste0("http://dbpedia.org/resource/",players[x])]])

players.unlist <- unlist(players.info)

I have to admit that this is not the simplest example as the JSONs retrieved are extremely complex and need much more pre-processing than usual (for example, than most API calls). The problem is that nowadays most API calls need an API Key and so the example becomes less reproducible.

Back to the example, firstly the data for each player is retrieved using getURL from RCurl and then converted to list by using fromJSON() from rjson, which converts json strings or files to lists. All the call is done inside a lapply statement, so the object returned is basically a list of JSONs converted to list, i.e., a list of lists. After that, from each of the sub-lists( in this case, 3, as there were three players) we are selecting only the piece of information that refers to the player himself; in a few words, a bit of cleanup. Finally, the file is unlisted. unlist() is a function which turns a list (a list of lists is a list itself) into a named vector of a certain type, to which all the elements can be coerced to; in this case, characters.

This is the point of manual intervention, where the variable selection has to be done. At this stage, it is highly recommended to have at least one example loaded in some JSON parser (JSON Viewer, or any browser’s JSONView add-in). The names of the vector generated by unlist() (“players.unlist” in the example) are a concatenation of the names of the lists that value was in, separated by “.”.

The elements in the character vector are ordered in the same way they appeared in the lists. So, the first thing to do is to establish where each of the cases start. In order to do so, the challenge now is to find a common pattern between the vector names which can identify where each of the elements start. This element does not necessarily have to be inside the final table. In this case I chose “surname\.value$”. However, I would recommend, in most of the cases, to choose the first element that appears in the object.

Similarly, the challenge now is to find common patterns in the names of the vector elements that will define each of the variables. In order to do that, the wisest would be to takle a look at the names that you have. In this case:


> unique(names(players.unlist))
  [1] "http://www.w3.org/1999/02/22-rdf-syntax-ns#type.type"   
  [2] "http://www.w3.org/1999/02/22-rdf-syntax-ns#type.value"  
  [3] "http://www.w3.org/2002/07/owl#sameAs.type"              
  [4] "http://www.w3.org/2002/07/owl#sameAs.value"             
  [5] "http://www.w3.org/2000/01/rdf-schema#label.type"        
  [6] "http://www.w3.org/2000/01/rdf-schema#label.value"       
  [7] "http://www.w3.org/2000/01/rdf-schema#label.lang"        
  [8] "http://purl.org/dc/terms/subject.type"                  
  [9] "http://purl.org/dc/terms/subject.value"                 
 [10] "http://xmlns.com/foaf/0.1/homepage.type"                
 [11] "http://xmlns.com/foaf/0.1/homepage.value"               
 [12] "http://xmlns.com/foaf/0.1/depiction.type"               
 [13] "http://xmlns.com/foaf/0.1/depiction.value"              
 [14] "http://purl.org/dc/elements/1.1/description.type"       
 [15] "http://purl.org/dc/elements/1.1/description.value"      
 [16] "http://purl.org/dc/elements/1.1/description.lang"       
 [17] "http://xmlns.com/foaf/0.1/givenName.type"               
 [18] "http://xmlns.com/foaf/0.1/givenName.value"              
 [19] "http://xmlns.com/foaf/0.1/givenName.lang"               
 [20] "http://xmlns.com/foaf/0.1/name.type"

By taking a look at the JSON of any of these examples, we can see that “type” refers to the type of value of “value”. So, in this case, we know that we are going to need only the ones that end with “.value”. However, the JSONs are too large to do a complete scan of all its elements, so the wisest thing would be to look for the desired value manually. For example, if we would like to take the date of birth:


> grep("(birth|Birth).*\.value",unique(names(players.unlist)),value=T)
[1] "http://dbpedia.org/ontology/birthName.value"   
[2] "http://dbpedia.org/property/birthDate.value"   
[3] "http://dbpedia.org/property/birthPlace.value"  
[4] "http://dbpedia.org/property/dateOfBirth.value" 
[5] "http://dbpedia.org/property/placeOfBirth.value"
[6] "http://dbpedia.org/ontology/birthDate.value"   
[7] "http://dbpedia.org/ontology/birthPlace.value"  
[8] "http://dbpedia.org/ontology/birthYear.value"   
[9] "http://dbpedia.org/property/birthName.value"

Now we know that there are several different birth dates fields. In this case, we should take a look manually and based upon that choose the one that better fits our needs.In this example, I chose 6 arbitrary variables, extracted its pattern and chose suitable variable names for them. Please notice that, for example, date of death should be empty for Messi and Ronaldo while number is empty in the case of George Best (players didn’t use to have a fixed number in the past). Apart from that, “goals” has multiple entries per player, as the json has one value per club. This is the end of the script:

st.obj <- "^.+surname\.value$"

columns <- c("fullname\.value","ontology/height\.value",
             "dateOfBirth\.value","dateOfDeath\.value",
             "ontology/number\.value","property/goals.value$")

colnames <- c("full.name","height","date.of.birth","date.of.death","number","goals")

players.table <- tabulateJSON(players.unlist,st.obj,columns,colnames)

And this is the final result


> players.table
     full.name                             height  date.of.birth      date.of.death     
[1,] "Lionel Andrés Messi"                 "1.69"  "1987-06-24+02:00" NA                
[2,] "George Best"                         "1.524" "1946-05-22+02:00" "2005-11-25+02:00"
[3,] "Cristiano Ronaldo dos Santos Aveiro" "1.85"  "1985-02-05+02:00" NA                
     number goals                        
[1,] "10"   "6;5;242"                    
[2,] NA     "6;2;3;0;1;15;12;8;33;137;21"
[3,] "7"    "3;84;176"

Finally, we get to the function. tabulateJSON() expects four parameters: an unlisted json (or whatever character vector that has similar characteristics as the ones produced by unlisting a JSON), a string that represents the name of the starting positions of the elements, a vector of characters (normally regex patterns) with the element names to be saught and finally the names to assign to those columns generated.

Now let’s take on how look tabulateJSON() works. Below, the entire code of the function:


tabulateJSON <- function (json.un, start.obj, columns, colnames) 
{
  if (length(columns) != length(colnames)) {
    stop("'columns' and 'colnames' must be the same length")
  }
  start.ind <- grep(start.obj, names(json.un))
  
  col.indexes <- lapply(columns, grep, names(json.un))
  col.position <- lapply(1:length(columns), function(x) findInterval(col.indexes[[x]], start.ind))
  
  
    temp.frames <- lapply(1:length(columns), function(x) data.frame(pos = col.position[[x]], ind = json.un[col.indexes[[x]]], stringsAsFactors = F))
    
    collapse.cols <- which(sapply(temp.frames, nrow) > length(start.ind))
  
  if(length(collapse.cols) > 0){
    temp.frames[collapse.cols] <- lapply(temp.frames[collapse.cols], function(x) 
      ddply(.data = x, .(pos), summarise, value = paste0(ind, collapse = ";")))
    
  }
  
  matr <- Reduce(function(...) merge(...,all=T,by="pos"),temp.frames)
  matr$pos <- NULL
  names(matr) <- colnames
  matr <- as.matrix(matr)
  colnames(matr) <- colnames
  return(matr)
}

How does it work? Firstly, it looks for the items that define a the start of an object based upon the pattern passed and return their indexes. These will be the delimiters.

  start.ind <- grep(start.obj, names(json.un))

After that, it will look for the names value that match the pattern passed in “columns” and return the indexes. Those indexes, have to be assigned to a position (i.e., a row number), which is done with findInterval(). This function maps a number to an interval given cut points.

For each of the variables, a data frame with 2 variables is created, where the first one is the position and the second one the value itself, which is obtained by accessing the values by the column indexes in the character vector. As it will be seen later, this is done in this way because it might be possible that for a sought pattern (that is converted to a variable) there might be more than one match in a row. Of course, that becomes problematic when trying to generate a tabulated dataset. For that reason, it is possible that temp.frames contains data frames with different number of rows.

  col.indexes <- lapply(columns, grep, names(json.un))
  col.position <- lapply(1:length(columns), function(x) findInterval(col.indexes[[x]], start.ind))
    temp.frames <- lapply(1:length(columns), function(x) data.frame(pos = col.position[[x]], ind = json.un[col.indexes[[x]]], stringsAsFactors = F))

After this, it is necessary to collapse those variables which contain multiple values per row. Firstly, it checks whether there is any of the elemnts in the list whose amount of rows is greater than the amount of delimiters (which define the amount of rows). In the case there is any, the function uses plyr’s ddply() to collapse by row specifier (pos). After this process all the data frames in temp.frames will be of equal length:


    collapse.cols <- which(sapply(temp.frames, nrow) > length(start.ind))
  
  if(length(collapse.cols) > 0){
    temp.frames[collapse.cols] <- lapply(temp.frames[collapse.cols], function(x) 
      ddply(.data = x, .(pos), summarise, value = paste0(ind, collapse = ";")))
    
  }

Finally, all the elements in temp.frames are merged one with another, the column name is assigned and “pos” is erased, as it does not belong to the dataset to be done:


  matr <- Reduce(function(...) merge(...,all=T,by="pos"),temp.frames)
  matr$pos <- NULL
  names(matr) <- colnames
  matr <- as.matrix(matr)
  colnames(matr) <- colnames
  return(matr)

Of course, using this function straightforward is a bit “uncomfortable”. For that reason, and depending your particular needs, you can include tabulateJSON() as part of a higher-level function. In this particular example, the function could receive the names of the and a list of high level variables, which will be mapped to a particular pattern, for example:


getPlayerInfo <- function(players,variables){
  
  players.info <- lapply(players, function(pl) fromJSON(getURL(paste0("http://dbpedia.org/data/",pl,".json"))))
  
  players.info <- lapply(1:length(players), function(x) players.info[[x]][[paste0("http://dbpedia.org/resource/",players[x])]])
  
  players.unlist <- unlist(players.info)
  
  st.obj <- "^.+surname\.value$"
  
  columns.to.grep <- paste0(variables,"\.value$")
  
  #Check if there is a multiple match with different types
  
  col.grep <- lapply(columns.to.grep, grep, x=unique(names(players.unlist)))
  
  columns <- sapply(col.grep, function(x) unique(names(players.unlist))[x[1]])
  
  #Convert names to a regex
  
  columns <- gsub("\.","\\\.",columns)
  columns <- paste0("^",columns,"$")
  
  players.table <- tabulateJSON(players.unlist,st.obj,columns,variables)
  
  return(players.table)
  
}

So, the call will be:


getPlayerInfo(c("Lionel_Messi","David_Beckham","Zinedine_Zidane","George_Best","George_Weah"),c("fullname","height","dateOfBirth","dateOfDeath","number","goals"))
     fullname                                  height   dateOfBirth       
[1,] "Lionel Andrés Messi"                     "1.69"   "1987-06-24+02:00"
[2,] "David Robert Joseph Beckham"             "1.8288" "1975-05-02+02:00"
[3,] "Zinedine Yazid Zidane"                   "1.85"   "1972-06-23+02:00"
[4,] "George Best"                             "1.524"  "1946-05-22+02:00"
[5,] "George Tawlon Manneh;Oppong Ousman Weah" "1.84"   "1966-10-01+02:00"
     dateOfDeath        number goals                        
[1,] NA                 "10"   "6;5;242"                    
[2,] NA                 NA     "62;2;0;13;18"               
[3,] NA                 NA     "6;37;28;24"                 
[4,] "2005-11-25+02:00" NA     "6;2;3;0;1;15;12;8;33;137;21"
[5,] NA                 NA     "46;47;32;24;14;13;7;5;3;1"

I hope you enjoyed it and/or found it useful at least ;)

As usual, if you have any comments, suggestions, critics, please drop me a line

To leave a comment for the author, please follow the link and comment on his blog: My Data Atelier » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

Because it’s Friday: Love in the land of Facebook

May 15, 2015, 12:00 pm

≫ Next: How to correctly set color in the image() function?

≪ Previous: From JSON to Tables

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Today is my 11th wedding anniversary with my wonderful husband Jay, so it's a love-themed Friday post today. Jay and I met before Facebook was a thing, but we've been touched by the congratulations on our timelines today.

Those timeline posts reveal a lot about you and your relationships, and last year the Facebook data science team published a series of analyses with anonymized data (and done with R, of course) on what your Facebook activity says about your love life. For example, the way to tell if a two people are destined to be a couple (as marked by a joint "In a Relationship" status on Facebook) is to watch for a steadily increasing rate of shared timeline posts:

After the relationship is declared the rate of joint posts drops off dramatically (as, presumably, the couple is otherwise occupied), but those posts they do share become sweetly sentimental:

Read about other Facebook analyses of relationship status in the 6-part series.

That's all for this week. Have a great weekend, and we'll see you back here on Monday!

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

↧

How to correctly set color in the image() function?

May 15, 2015, 1:10 pm

≫ Next: R Recipe: Reordering Columns in a Flexible Way

≪ Previous: Because it’s Friday: Love in the land of Facebook

(This article was first published on One Tip Per Day, and kindly contributed to R-bloggers)

Sometimes we want to make our own heatmap using image() function. I recently found it's tricky to set the color option there, as its manual has very little information on col:

col a list of colors such as that generated by rainbow, heat.colors, topo.colors, terrain.colors or similar functions.

I posted my question on BioStars. The short answer is: Unless the breaks is set, the range of Z is evenly cut into N intervals (where N = the length of color) and values in Z are assigned to the color of corresponding interval. For example, when x=c(3,1,2,1) and col=c("blue","red",'green','yellow'), the minimal of x is assigned as the first color, and max to the last color. Any value between is calculated proportionally to a color. In this case, 2 is the the middle one, according to the principal that intervals are closed on the right and open on the left, it's assigned to "red". So, that's why we see the colors are yellow-->blue-->red-->blue.

In practice, unless we want to manually define the color break points, we just set the first and last color, it will automatically find colors for the values in Z.

collist<-c(0,1)
image(1:ncol(x),1:nrow(x), as.matrix(t(x)), col=collist, asp=1)

If we want to manually define the color break points, we need to

x=matrix(rnorm(100),nrow=10)*100
xmin=0; xmax=100;
x[x<xmin]=xmin; x[x>xmax]=xmax;
collist<-c("#053061","#2166AC","#4393C3","#92C5DE","#D1E5F0","#F7F7F7","#FDDBC7","#F4A582","#D6604D","#B2182B","#67001F")
ColorRamp<-colorRampPalette(collist)(10000)
ColorLevels<-seq(from=xmin, to=xmax, length=10000)
ColorRamp_ex <- ColorRamp[round(1+(min(x)-xmin)*10000/(xmax-xmin)) : round( (max(x)-xmin)*10000/(xmax-xmin) )]
par(mar=c(2,0,2,0), oma=c(3,3,3,3))
layout(matrix(seq(2),nrow=2,ncol=1),widths=c(1),heights=c(3,0.5))
image(t(as.matrix(x)), col=ColorRamp_ex, las=1, xlab="",ylab="",cex.axis=1,xaxt="n",yaxt="n")
image(as.matrix(ColorLevels),col=ColorRamp, xlab="",ylab="",cex.axis=1,xaxt="n",yaxt="n")
axis(1,seq(xmin,xmax,10),seq(xmin,xmax,10))

To leave a comment for the author, please follow the link and comment on his blog: One Tip Per Day.

↧

R Recipe: Reordering Columns in a Flexible Way

May 15, 2015, 11:20 pm

≫ Next: Learning about classes in R with plot.bike()

≪ Previous: How to correctly set color in the image() function?

(This article was first published on Exegetic Analytics » R, and kindly contributed to R-bloggers)

Suppose you have a data frame with a number of columns.

> names(trading)
 [1] "OpenDate"   "CloseDate"  "Symbol"     "Action"     "Lots"       "SL"         "TP"         "OpenPrice" 
 [9] "ClosePrice" "Commission" "Swap"       "Pips"       "Profit"     "Gain"       "Duration"   "Trader"    
[17] "System"

You want to put the Trader and System columns first but you also want to do this in a flexible way. One approach would be to specify column numbers.

> trading = trading[, c(16:17, 1:15)]
> names(trading)
 [1] "Trader"     "System"     "OpenDate"   "CloseDate"  "Symbol"     "Action"     "Lots"       "SL"        
 [9] "TP"         "OpenPrice"  "ClosePrice" "Commission" "Swap"       "Pips"       "Profit"     "Gain"      
[17] "Duration"

This does the job but it's not very flexible. After all, the number of columns might change. Rather do it by specifying column names.

> refcols <- c("Trader", "System")
> #
> trading <- trading[, c(refcols, setdiff(names(trading), refcols))]
> names(trading)
 [1] "Trader"     "System"     "OpenDate"   "CloseDate"  "Symbol"     "Action"     "Lots"       "SL"        
 [9] "TP"         "OpenPrice"  "ClosePrice" "Commission" "Swap"       "Pips"       "Profit"     "Gain"      
[17] "Duration"

The post R Recipe: Reordering Columns in a Flexible Way appeared first on Exegetic Analytics.

To leave a comment for the author, please follow the link and comment on his blog: Exegetic Analytics » R.

↧

Learning about classes in R with plot.bike()

May 16, 2015, 5:00 pm

≫ Next: The paper helicopter experiment

≪ Previous: R Recipe: Reordering Columns in a Flexible Way

(This article was first published on Robin Lovelace - R, and kindly contributed to R-bloggers)

A useful feature of R is its ability to implement a function differently depending on the ‘class’ of the object acted on. This article explores this behaviour with reference to a playful modification of the ‘generic’ function plot() to allow plotting of cartoon bicycles. Although the example is quite simple and fun, the concepts it touches on are complex and serious.

The example demonstrates several of the programming language paradigms that R operates under. R is simultaneously object-orientated, functional and polymorphic. The example also demonstrates the paradigm of inheritance, through the passing of arguments from plot.bike() to plot() via the ... symbol. There has been much written about programming paradigms and R’s adherence to (or flouting of!) them. Two useful references on the subject are a Wikibook page on programming language paradigms and Hadley Wickham's Advanced R book. There is a huge amount of information on these topics. For the purposes of the examples presented here suffice to say that R uses multiple paradigms and is extremely flexible.

Context: an advanced R course

The behaviour of ‘generic functions’ such as plot was taught during a 2 day course by Colin Gillespie in Newcastle. In it we learned about some of the nuts and bolts that underlie R’s uniquely flexible and sometimes bizarre syntax. Environments, start-up, functions and classes were some of the topics covered. These and more issues are described in multiple places on-line and in R’s own documentation, and neatly synthesised in Hadley Wickham’s penultimate book, Advanced R. However, nothing beats face-to-face learning and I learned plenty about R’s innards during the course, despite having read around the topics covered previously.

Colin has made his materials available on-line on github for the benefit of people worldwide. http://rcourses.github.io/ contains links to pages which introduce a number of courses which can, to a large extent, be conducted from the safety of one’s home. There are also R packages for each of the courses. The package for the Advanced R course, for example, can be installed with the following code:

Once the package has been installed and loaded, with library(nclRadvanced), a number of vignettes and solutions sheets can be accessed, e.g. via:

vignette(package = "nclRadvanced")
vignette(package = "nclRadvanced", "practical2")

Creating a new S3 class for bikes

The S3 class system is very flexible. Any object can be allocated to a class of any name, without restriction. S3 classes only become meaningful when objects allocated to particular class are passed to a function that recognises classes. Functions that behave differently depending on the class of the object they act on are known as generic.

We can find out the class type of an object using the pryr package. A good example of the S3 object type is “lm”, which plots in a different way thanks to plot.lm(), which dispatches the plot() command differently for objects within the lm S3 object class.

x <- 1:9
y <- x^2
m <- lm(y ~ x)
class(m)

## [1] "lm"

pryr::otype(m) # requires pryr to be installed

## [1] "S3"

Note that the object system is flexible, so any class name can be allocated to any object, such as class(x) <- "lm". Note that if we enter this, plot(x) will try to dispatch x to plot.lm() and fail.

Classes only become useful when they have a series of generic methods associated with them. We will illustrate this by defining a list as a ‘bike’ object and creating a plot.bike(), a class-specific method of the generic plot function for plotting S3 objects of that class. Let’s define the key components of a bike:

Not that there are no strict rules. We could allocate the class to any object, and we could replace bike with almost any name. The S4 class, used in spatial data for example, is much stricter.

The bike class becomes useful when it comes to method dispatch, such as plotting.

Creating a plot method for bikes

Suppose that every bike object has the same as those contained in the object x created above. We can specify how it should be plotted as follows:

Now that a new method has been added to the generic plot() function, the fun begins. Any object assigned to the class ‘bike’ will now automatically be dispatched to plot.bike() when plot() is called.

And, as the plots below show, a plot of a bicycle is produced.

plot(x)

Try playing with the wheel size - some bikes with quite strange dimensions can be produced!

x$ws <- 1500 # a bike with large wheels
plot(x)

x$ws <- 150 # a bike with small wheels
plot(x)

It would be interesting to see how the dimensions of the last bicycle compare with a Brompton!

Discussion

The bike class demonstrates that the power of S3 classes lies not in the class’s object but in the generic functions which take-on new methods. It is precisely this behaviour which makes the class family of Spatial* objects defined by the sp package so powerful. sp adds new methods for plot(), aggregate() and even the subsetting function "[".

This can be seen by calling methods() before and after sp is loaded:

methods(aggregate)

## [1] aggregate.data.frame aggregate.default*   aggregate.formula*
## [4] aggregate.ts
## see '?methods' for accessing help and source code

library(sp) # load the sp library, which creates new methods
methods(aggregate) # the new method is now shown

## [1] aggregate.data.frame aggregate.default*   aggregate.formula*
## [4] aggregate.Spatial*   aggregate.ts
## see '?methods' for accessing help and source code

Note that Spatial classes are different from the bike class because they use the S4 class system. We will be covering the nature and behaviour of Spatial objects in the “Spatial data analysis with R” course in Newcastle, 2nd - 3rd June, which is still open for registration.

The bike class is not ‘production’ ready but there is no reason why someone who understands bicycles inside out could not create a well-defined (perhaps S4) class for a bicycle, with all the essential dimensions defined. This could really be useful, including in efforts at making R more useful for transport planning, such as my package under development to provide tools for transportation research and analysis, stplanr.

Having learned about classes, I’m wondering whether origin-destination ‘flow’ data, used in stplanr, would benefit from its own class, or if its current definition as SpatialLinesDataFrame is sufficient. Any ideas welcome!

Conclusion

Classes are an advanced topic in R the usually just ‘work’. However, if you want to modify existing functions to behave differently on new object-types, understanding how to create classes and class-specific methods can be very useful. The example of the bike class created above is not intended for production, but provides a glimpse into what is possible. At the very least, this article should help provide readers with new insight into the inner workings of R and its impressive functional flexibility.

Post script

If you are interested in using R for transport research, please check out my under-development package stplanr and let me know via GitHub of any features you’d like it to have before submission to CRAN and rOpenSci:

https://github.com/Robinlovelace/stplanr

Or tweet me on @robinlovelace

To leave a comment for the author, please follow the link and comment on his blog: Robin Lovelace - R.

↧

The paper helicopter experiment

May 17, 2015, 12:01 am

≫ Next: infuser: a template engine for R

≪ Previous: Learning about classes in R with plot.bike()

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)

The paper helicopter is one of the devices to explain about design of experiments. The aim is to create the longest flying paper helicopter by means of experimental design.
Paper helicopters are a nice example, because they are cheap to make, easy to test landing time and sufficient variables to make it non obvious.
Rather than make and measure my own helicopters, I decided to use data from the internet. In this post I use data from williamghunter.net and http://www.rose-hulman.edu. There is more data on the internet, but these two are fairly similar. Both use a fractional factorial design of 16 runs and they have the same variables. However, a quick check showed that these were different results and, very important, the aliasing structure was different.

Data

Data were taken from the above given locations. Rather than using the coded units, the data was converted to sizes in cm. Time to land was converted to seconds.
Since these were separate experiments, it has to be assumed that they used different paper, different heights to drop helicopters from. It even seems, that different ways were found to attach a paperclip to the helicopters.

Simple analysis

To confirm the data an analysis on coded units was performed. These results were same as given by the websites, results not shown here. My own analysis starts with real world units and is by regression. Disadvantage or real world units is that one cannot compare the size of the effects, however, given the designs used, the t-statistic can be used for this purpose.
The first data set shows WingLength and BodyLength to have the largest effects.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.92798 0.54903 3.512 0.009839 **
PaperTyperegular1 -0.12500 0.13726 -0.911 0.392730
WingLength 0.17435 0.03088 5.646 0.000777 ***
BodyLength -0.08999 0.03088 -2.914 0.022524 *
BodyWidth 0.01312 0.07205 0.182 0.860634
PaperClipYes 0.05000 0.13726 0.364 0.726403
FoldYes -0.10000 0.13726 -0.729 0.489918
TapedBodyYes -0.15000 0.13726 -1.093 0.310638
TapedWingYes 0.17500 0.13726 1.275 0.242999

The second data set shows WingLength, PaperClip and PaperType to have the largest effects.

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.73200 0.21737 3.368 0.01196 *

PaperTyperegular2 0.28200 0.06211 4.541 0.00267 **

WingLength 0.16654 0.01223 13.622 2.7e-06 ***

BodyLength -0.02126 0.01630 -1.304 0.23340

BodyWidth -0.03307 0.04890 -0.676 0.52058

PaperClipYes -0.35700 0.06211 -5.748 0.00070 ***

FoldYes 0.04500 0.06211 0.725 0.49222

TapedBodyYes -0.14700 0.06211 -2.367 0.04983 *

TapedWingYes 0.06600 0.06211 1.063 0.32320

It seems then, that the two experiments show somewhat different effects. WingLength is certainly important. BodyLength maybe. Regarding paper, both have regular paper, but one has bond paper and the other construction paper. It is not difficult to imagine these are quite different.

Combined analysis

The combination analysis is programmed in Jags. To capture a different falling distance, a Mul parameter is used, which defines a multiplicative effect between the two experiments. In addition, both sets have their own measurement error. There are four types of paper, two from each data set, and three levels of paperclip, no paperclip assumed same for both experiments. In addition to the parameters given earlier, residuals are estimated, in order to have some idea about the quality of fit.

The model then, looks like this

jmodel <- function() {
for (i in 1:n) {
premul[i] <- (test[i]==1)+Mul*(test[i]==2)
mu[i] <- premul[i] * (
WL*WingLength[i]+
BL*BodyLength[i] +
PT[PaperType[i]] +
BW*BodyWidth[i] +
PC[PaperClip[i]] +
FO*Fold[i]+
TB*TapedBody[i]+
TW*TapedWing[i]
)
Time[i] ~ dnorm(mu[i],tau[test[i]])
residual[i] <- Time[i]-mu[i]
}
for (i in 1:2) {
tau[i] <- pow(StDev[i],-2)
StDev[i] ~dunif(0,3)
}
for (i in 1:4) {
PT[i] ~ dnorm(PTM,tauPT)
}
tauPT <- pow(sdPT,-2)
sdPT ~dunif(0,3)
PTM ~dnorm(0,0.01)
WL ~dnorm(0,0.01)
BL ~dnorm(0,1000)
BW ~dnorm(0,1000)
PC[1] <- 0
PC[2]~dnorm(0,0.01)
PC[3]~dnorm(0,0.01)

FO ~dnorm(0,1000)
TB ~dnorm(0,0.01)
TW ~dnorm(0,0.01)

Mul ~ dnorm(1,1) %_% I(0,2)
}

Inference for Bugs model at "C:/Users/Kees/AppData/Local/Temp/Rtmp4o0rhh/model16f468e854ce.txt", fit using jags,
4 chains, each with 3000 iterations (first 1500 discarded)
n.sims = 6000 iterations saved
mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff
BL -0.029 0.014 -0.056 -0.038 -0.028 -0.019 -0.001 1.001 4400
BW -0.005 0.025 -0.052 -0.023 -0.006 0.011 0.044 1.002 1900
FO 0.005 0.028 -0.050 -0.014 0.005 0.023 0.058 1.001 6000
Mul 1.166 0.149 0.819 1.087 1.176 1.254 1.433 1.028 130
PC[1] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 1
PC[2] 0.066 0.141 -0.208 -0.021 0.061 0.147 0.360 1.002 2300
PC[3] -0.362 0.070 -0.501 -0.404 -0.362 -0.319 -0.225 1.001 6000
PT[1] 1.111 0.397 0.516 0.864 1.059 1.286 2.074 1.021 150
PT[2] 1.019 0.379 0.437 0.783 0.974 1.186 1.925 1.019 160
PT[3] 0.728 0.170 0.397 0.615 0.728 0.840 1.068 1.002 2900
PT[4] 0.991 0.168 0.655 0.885 0.993 1.103 1.309 1.002 1600
StDev[1] 0.133 0.039 0.082 0.108 0.127 0.150 0.225 1.005 540
StDev[2] 0.304 0.075 0.192 0.251 0.292 0.343 0.488 1.003 1300
TB -0.144 0.059 -0.264 -0.181 -0.144 -0.108 -0.025 1.001 4100
TW 0.084 0.059 -0.033 0.045 0.084 0.122 0.203 1.001 4400
WL 0.164 0.013 0.138 0.156 0.164 0.172 0.188 1.004 810
residual[1] 0.174 0.146 -0.111 0.079 0.173 0.268 0.464 1.002 1700
residual[2] 0.466 0.158 0.162 0.361 0.463 0.567 0.780 1.004 730
residual[3] 0.150 0.170 -0.173 0.041 0.147 0.253 0.499 1.003 1100
residual[4] -0.416 0.162 -0.733 -0.523 -0.418 -0.308 -0.099 1.001 3800
residual[5] -0.087 0.168 -0.419 -0.198 -0.084 0.026 0.238 1.005 560
residual[6] -0.085 0.156 -0.397 -0.184 -0.084 0.016 0.221 1.003 1200
residual[7] -0.056 0.159 -0.371 -0.156 -0.055 0.047 0.251 1.003 910
residual[8] -0.203 0.157 -0.527 -0.304 -0.198 -0.100 0.095 1.001 6000
residual[9] 0.150 0.150 -0.139 0.052 0.148 0.247 0.451 1.001 6000
residual[10] 0.103 0.156 -0.200 0.003 0.101 0.206 0.415 1.004 720
residual[11] 0.133 0.160 -0.176 0.027 0.131 0.237 0.454 1.002 2100
residual[12] 0.335 0.177 -0.006 0.218 0.332 0.451 0.689 1.004 830
residual[13] -0.436 0.156 -0.747 -0.536 -0.436 -0.337 -0.128 1.002 2100
residual[14] 0.098 0.162 -0.227 -0.007 0.099 0.205 0.410 1.004 670
residual[15] -0.018 0.160 -0.340 -0.118 -0.015 0.084 0.292 1.003 920
residual[16] -0.127 0.155 -0.441 -0.224 -0.125 -0.027 0.173 1.001 3600
residual[17] 0.037 0.088 -0.135 -0.018 0.037 0.093 0.215 1.002 1600
residual[18] -0.088 0.090 -0.274 -0.141 -0.086 -0.031 0.081 1.002 2500
residual[19] -0.074 0.088 -0.248 -0.129 -0.072 -0.018 0.100 1.002 1900
residual[20] -0.079 0.088 -0.259 -0.133 -0.076 -0.023 0.091 1.001 3800
residual[21] -0.037 0.087 -0.201 -0.093 -0.039 0.016 0.141 1.002 3000
residual[22] 0.051 0.087 -0.128 -0.001 0.053 0.107 0.221 1.001 4800
residual[23] -0.008 0.084 -0.177 -0.061 -0.009 0.046 0.159 1.001 5500
residual[24] 0.129 0.086 -0.047 0.076 0.130 0.185 0.294 1.002 1900
residual[25] 0.196 0.087 0.030 0.141 0.196 0.249 0.370 1.003 1400
residual[26] -0.027 0.084 -0.195 -0.081 -0.026 0.029 0.138 1.001 6000
residual[27] 0.070 0.088 -0.101 0.016 0.070 0.124 0.247 1.001 3700
residual[28] -0.166 0.089 -0.355 -0.221 -0.163 -0.108 0.004 1.001 3700
residual[29] -0.052 0.087 -0.223 -0.107 -0.053 0.002 0.124 1.001 4300
residual[30] 0.039 0.089 -0.139 -0.016 0.038 0.095 0.218 1.002 2500
residual[31] -0.079 0.089 -0.245 -0.135 -0.080 -0.026 0.103 1.002 2300
residual[32] 0.048 0.085 -0.122 -0.006 0.049 0.102 0.214 1.002 2300
deviance -15.555 7.026 -26.350 -20.655 -16.487 -11.540 0.877 1.004 750

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 24.6 and DIC = 9.0
DIC is an estimate of expected predictive error (lower deviance is better).
Striking in the results is big residuals, for instance for observations 2, 4 and 13. The residuals for observations 4 and 13 are also big when a similar classical model is used, hence this is a clear indication of some kind of interaction.

Model with interactions

Adding the most obvious interactions, such as WingLength*TapedBody did not really provide a suitable answer. Indeed, large residuals at observations 4 and 13, which are at opposite sides in the fractional factorial design, can not be resolved with one interaction.

Hence I proceeded with adding all two way interactions. Since this was expected to result in a model without clear estimates, all interactions had a strong prior; mean was 0 and precision (tau) was 1000. This model was subsequently reduced by giving the interactions which clearly differed from 0 a lesser precision while interactions which where clearly zero were removed. During this process the parameter Fold was removed from the parameter set. Finally, quadratic effects were added. There is one additional parameter, other, it has no function in the model, but tells what the properties of the prior for the interactions are. Parameters with a standard deviation less than other have information added from the data.

jmodel <- function() {

for (i in 1:n) {

premul[i] <- (test[i]==1)+Mul*(test[i]==2)

mu[i] <- premul[i] * (

WL*WingLength[i]+

BL*BodyLength[i] +

PT[PaperType[i]] +

BW*BodyWidth[i] +

PC[PaperClip[i]] +

TB*TapedBody[i]+

TW*TapedWing[i]+

WLBW*WingLength[i]*BodyWidth[i]+

WLPC[1]*WingLength[i]*(PaperClip[i]==2)+

WLPC[2]*WingLength[i]*(PaperClip[i]==3)+

BLPT[1]*BodyLength[i]*(PaperType[i]==2)+

BLPT[2]*BodyLength[i]*(PaperType[i]==3)+

BLPC[1]*BodyLength[i]*(PaperClip[i]==2)+

BLPC[2]*BodyLength[i]*(PaperClip[i]==3)+

BWPC[1]*BodyWidth[i]*(PaperClip[i]==2)+

BWPC[2]*BodyWidth[i]*(PaperClip[i]==3) +

WLWL*WingLength[i]*WingLength[i]+

BLBL*BodyLength[i]*BodyLength[i]+

BWBW*BodyWidth[i]*BodyWidth[i]

)

Time[i] ~ dnorm(mu[i],tau[test[i]])

residual[i] <- Time[i]-mu[i]

}

for (i in 1:2) {

tau[i] <- pow(StDev[i],-2)

StDev[i] ~dunif(0,3)

WLPC[i] ~dnorm(0,1)

BLPT[i] ~dnorm(0,1)

BLPC[i] ~dnorm(0,1)

BWPC[i] ~dnorm(0,1)

}

for (i in 1:3) {

PT[i] ~ dnorm(PTM,tauPT)

}

tauPT <- pow(sdPT,-2)

sdPT ~dunif(0,3)

PTM ~dnorm(0,0.01)

WL ~dnorm(0,0.01)

BL ~dnorm(0,0.01)

BW ~dnorm(0,0.01)

PC[1] <- 0

PC[2]~dnorm(0,0.01)

PC[3]~dnorm(0,0.01)

TB ~dnorm(0,0.01)

TW ~dnorm(0,0.01)

WLBW~dnorm(0,1)

WLTW~dnorm(0,1)

WLWL~dnorm(0,1)

BLBL~dnorm(0,1)

BWBW~dnorm(0,1)

other~dnorm(0,1)

Mul ~ dnorm(1,1) %_% I(0,2)

}

Inference for Bugs model at "C:/Users/Kees/AppData/Local/Temp/Rtmp4o0rhh/model16f472b05364.txt", fit using jags,

5 chains, each with 4000 iterations (first 2000 discarded), n.thin = 2

n.sims = 5000 iterations saved

mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff

BL 0.021 0.197 -0.367 -0.080 0.027 0.121 0.396 1.021 590

BLBL -0.001 0.015 -0.027 -0.009 -0.003 0.006 0.031 1.015 1200

BLPC[1] -0.099 0.105 -0.295 -0.125 -0.086 -0.053 0.021 1.100 560

BLPC[2] -0.110 0.111 -0.334 -0.134 -0.094 -0.060 0.018 1.130 250

BLPT[1] -0.038 0.190 -0.503 -0.124 0.001 0.069 0.286 1.005 600

BLPT[2] 0.058 0.038 -0.031 0.045 0.063 0.078 0.113 1.063 400

BW -0.430 0.558 -1.587 -0.657 -0.389 -0.143 0.463 1.045 960

BWBW 0.009 0.094 -0.160 -0.031 0.009 0.052 0.176 1.053 1300

BWPC[1] -0.224 0.173 -0.615 -0.295 -0.209 -0.133 0.064 1.011 5000

BWPC[2] -0.093 0.101 -0.285 -0.137 -0.091 -0.044 0.085 1.040 5000

Mul 1.053 0.145 0.680 0.997 1.069 1.139 1.281 1.098 290

PC[1] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 1

PC[2] 1.459 2.367 -3.571 0.333 1.565 2.617 6.138 1.019 420

PC[3] 0.401 0.732 -0.619 0.032 0.309 0.629 1.954 1.074 320

PT[1] 1.353 1.437 -1.364 0.556 1.318 2.088 4.128 1.032 480

PT[2] 1.906 1.767 -1.087 0.828 1.726 2.814 5.879 1.013 1300

PT[3] 0.731 1.419 -1.864 -0.058 0.682 1.444 3.535 1.032 520

StDev[1] 0.108 0.082 0.045 0.067 0.088 0.120 0.302 1.023 450

StDev[2] 0.267 0.156 0.122 0.177 0.229 0.301 0.706 1.021 390

TB -0.146 0.051 -0.247 -0.172 -0.145 -0.119 -0.048 1.011 5000

TW 0.086 0.054 -0.007 0.055 0.082 0.112 0.204 1.010 1700

WL 0.209 0.380 -0.496 0.007 0.188 0.394 1.035 1.014 670

WLBW 0.051 0.062 -0.013 0.026 0.043 0.062 0.167 1.159 220

WLPC[1] 0.057 0.210 -0.304 -0.063 0.024 0.152 0.556 1.004 1600

WLPC[2] 0.020 0.027 -0.031 0.010 0.021 0.033 0.066 1.044 2400

WLWL -0.014 0.026 -0.072 -0.026 -0.011 0.001 0.032 1.014 5000

other 0.002 1.007 -1.973 -0.680 0.000 0.684 1.952 1.002 2200

residual[1] 0.227 0.272 -0.178 0.066 0.190 0.334 0.935 1.041 390

residual[2] 0.035 0.231 -0.447 -0.084 0.037 0.160 0.503 1.007 2500

residual[3] 0.026 0.269 -0.404 -0.118 -0.002 0.131 0.587 1.039 430

residual[4] -0.123 0.279 -0.542 -0.276 -0.157 -0.018 0.530 1.053 370

residual[5] -0.046 0.241 -0.535 -0.168 -0.043 0.083 0.422 1.008 5000

residual[6] -0.094 0.241 -0.568 -0.221 -0.095 0.035 0.390 1.005 2600

residual[7] 0.284 0.268 -0.139 0.140 0.263 0.392 0.861 1.046 430

residual[8] 0.018 0.240 -0.460 -0.107 0.022 0.144 0.494 1.006 5000

residual[9] 0.121 0.299 -0.310 -0.042 0.079 0.223 0.827 1.054 300

residual[10] 0.038 0.237 -0.428 -0.086 0.034 0.155 0.518 1.006 3100

residual[11] -0.077 0.251 -0.562 -0.204 -0.073 0.046 0.401 1.020 5000

residual[12] 0.153 0.262 -0.286 0.013 0.133 0.267 0.711 1.035 610

residual[13] -0.024 0.244 -0.466 -0.160 -0.035 0.095 0.493 1.008 5000

residual[14] -0.019 0.244 -0.537 -0.140 -0.013 0.111 0.456 1.006 5000

residual[15] -0.159 0.250 -0.663 -0.281 -0.156 -0.038 0.302 1.026 860

residual[16] 0.034 0.273 -0.531 -0.076 0.056 0.178 0.486 1.037 410

residual[17] 0.001 0.115 -0.185 -0.057 -0.008 0.047 0.232 1.047 890

residual[18] 0.016 0.105 -0.187 -0.038 0.017 0.067 0.211 1.014 3300

residual[19] -0.068 0.108 -0.262 -0.118 -0.068 -0.017 0.127 1.036 5000

residual[20] 0.067 0.114 -0.138 0.017 0.067 0.115 0.270 1.046 4500

residual[21] 0.003 0.117 -0.223 -0.046 0.007 0.057 0.203 1.044 3200

residual[22] -0.004 0.113 -0.202 -0.059 -0.007 0.044 0.211 1.035 2000

residual[23] -0.039 0.134 -0.313 -0.081 -0.023 0.027 0.145 1.097 300

residual[24] 0.009 0.114 -0.197 -0.042 0.009 0.061 0.223 1.039 5000

residual[25] 0.045 0.110 -0.170 -0.005 0.046 0.095 0.248 1.028 5000

residual[26] -0.044 0.108 -0.252 -0.096 -0.043 0.007 0.165 1.024 4000

residual[27] 0.046 0.112 -0.164 -0.005 0.046 0.100 0.264 1.022 3600

residual[28] -0.062 0.115 -0.296 -0.104 -0.053 -0.004 0.112 1.047 1400

residual[29] -0.025 0.143 -0.321 -0.064 -0.006 0.042 0.153 1.110 230

residual[30] -0.016 0.118 -0.228 -0.066 -0.015 0.037 0.196 1.042 1400

residual[31] -0.025 0.115 -0.239 -0.072 -0.021 0.028 0.174 1.047 1300

residual[32] 0.020 0.111 -0.176 -0.033 0.017 0.066 0.233 1.041 2600

deviance -32.864 19.923 -62.354 -46.843 -35.763 -22.807 16.481 1.014 420

For each parameter, n.eff is a crude measure of effective sample size,

and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)

pD = 196.7 and DIC = 163.8

DIC is an estimate of expected predictive error (lower deviance is better).

Model discussion

This model does not have the big residuals. In addition it seems that some parameters, e.g. WLWL and WLBW have small mean values and small standard deviations. To me this suggests that they are indeed estimated and found to be close to 0. After all, if the data contained no information, their standard deviation would be similar to the prior, which is much larger, as seen from the other parameter.
The quadratic effects were added to allow detection of a maximum. There is not much presence of these effects, except perhaps in WingLength (parameter WLWL).
For descriptive purposes, I will leave these parameters in. However, for predictive purposes, it may be better to remove them or shrink them closer to zero.
Given the complex way in which the parameters are chosen, it is very well possible that a different model would be better. In hindsight, I might have used the BMA function to do a more thorough selection. Thus the model needs to be validated some more. Since I found two additional data sets on line, these might be used for this purpose.

Code

h1 <- read.table(sep='t',header=TRUE,text='

PaperType WingLength BodyLength BodyWidth PaperClip Fold TapedBody TapedWing Time

regular1 7.62 7.62 3.175 No No No No 2.5

bond 7.62 7.62 3.175 Yes No Yes Yes 2.9

regular1 12.065 7.62 3.175 Yes Yes No Yes 3.5

bond 12.065 7.62 3.175 No Yes Yes No 2.7

regular1 7.62 12.065 3.175 Yes Yes Yes No 2

bond 7.62 12.065 3.175 No Yes No Yes 2.3

regular1 12.065 12.065 3.175 No No Yes Yes 2.9

bond 12.065 12.065 3.175 Yes No No No 3

regular1 7.62 7.62 5.08 No Yes Yes Yes 2.4

bond 7.62 7.62 5.08 Yes Yes No No 2.6

regular1 12.065 7.62 5.08 Yes No Yes No 3.2

bond 12.065 7.62 5.08 No No No Yes 3.7

regular1 7.62 12.065 5.08 Yes No No Yes 1.9

bond 7.62 12.065 5.08 No No Yes No 2.2

regular1 12.065 12.065 5.08 No Yes No No 3

bond 12.065 12.065 5.08 Yes Yes Yes Yes 3

h2 <- read.table(sep='t',header=TRUE,text='

PaperType BodyWidth BodyLength WingLength PaperClip Fold TapedBody TapedWing Time

regular2 2.54 3.81 5.08 No No No No 1.74

construction 2.54 3.81 5.08 No Yes Yes Yes 1.296

regular2 3.81 3.81 5.08 Yes No Yes Yes 1.2

construction 3.81 3.81 5.08 Yes Yes No No 0.996

regular2 2.54 7.62 5.08 Yes Yes Yes No 1.056

construction 2.54 7.62 5.08 Yes No No Yes 1.104

regular2 3.81 7.62 5.08 No Yes No Yes 1.668

construction 3.81 7.62 5.08 No No Yes No 1.308

regular2 2.54 3.81 10.16 Yes Yes No Yes 2.46

construction 2.54 3.81 10.16 Yes No Yes No 1.74

regular2 3.81 3.81 10.16 No Yes Yes No 2.46

construction 3.81 3.81 10.16 No No No Yes 2.184

regular2 2.54 7.62 10.16 No No Yes Yes 2.316

construction 2.54 7.62 10.16 No Yes No No 2.208

regular2 3.81 7.62 10.16 Yes No No No 1.98

construction 3.81 7.62 10.16 Yes Yes Yes Yes 1.788

l1 <- lm(Time ~ PaperType + WingLength + BodyLength + BodyWidth +

PaperClip + Fold + TapedBody + TapedWing, data=h1)

summary(l1)

residuals(l1)

l2 <- lm(Time ~ PaperType + WingLength + BodyLength + BodyWidth +

PaperClip + Fold + TapedBody + TapedWing, data=h2)

summary(l2)

h1$test <- 'WH'

# WingLength, BodyLength

h2$test <- 'RH'

#WhingLegnth, PaperClip, PaperType

helis <- rbind(h1,h2)

helis$test <- factor(helis$test)

helis$PaperClip2 <- factor(ifelse(helis$PaperClip=='No','No',as.character(helis$test)),

levels=c('No','WH','RH'))

library(R2jags)

datain <- list(

PaperType=c(1:4)[helis$PaperType],

WingLength=helis$WingLength,

BodyLength=helis$BodyLength,

BodyWidth=helis$BodyWidth,

PaperClip=c(1,2,3)[helis$PaperClip2],

Fold=c(0,1)[helis$Fold],

TapedBody=c(0,1)[helis$TapedBody],

TapedWing=c(0,1)[helis$TapedWing],

test=c(1,2)[helis$test],

Time=helis$Time,

n=nrow(helis))

parameters <- c('Mul','WL','BL','PT','BW','PC','FO','TB','TW','StDev','residual')

jmodel <- function() {

for (i in 1:n) {

premul[i] <- (test[i]==1)+Mul*(test[i]==2)

mu[i] <- premul[i] * (

WL*WingLength[i]+

BL*BodyLength[i] +

PT[PaperType[i]] +

BW*BodyWidth[i] +

PC[PaperClip[i]] +

FO*Fold[i]+

TB*TapedBody[i]+

TW*TapedWing[i]

)

Time[i] ~ dnorm(mu[i],tau[test[i]])

residual[i] <- Time[i]-mu[i]

}

for (i in 1:2) {

tau[i] <- pow(StDev[i],-2)

StDev[i] ~dunif(0,3)

}

for (i in 1:4) {

PT[i] ~ dnorm(PTM,tauPT)

}

tauPT <- pow(sdPT,-2)

sdPT ~dunif(0,3)

PTM ~dnorm(0,0.01)

WL ~dnorm(0,0.01)

BL ~dnorm(0,1000)

BW ~dnorm(0,1000)

PC[1] <- 0

PC[2]~dnorm(0,0.01)

PC[3]~dnorm(0,0.01)

FO ~dnorm(0,1000)

TB ~dnorm(0,0.01)

TW ~dnorm(0,0.01)

Mul ~ dnorm(1,1) %_% I(0,2)

}

jj <- jags(model.file=jmodel,

data=datain,

parameters=parameters,

progress.bar='gui',

n.chain=4,

n.iter=3000,

inits=function() list(Mul=1.3,WL=0.15,BL=-.08,PT=rep(1,4),

BW=0,PC=c(NA,0,0),FO=0,TB=0,TW=0))

#################################
datain <- list(
PaperType=c(2,1,3,1)[helis$PaperType],
WingLength=helis$WingLength,
BodyLength=helis$BodyLength,
BodyWidth=helis$BodyWidth,
PaperClip=c(1,2,3)[helis$PaperClip2],
TapedBody=c(0,1)[helis$TapedBody],
TapedWing=c(0,1)[helis$TapedWing],
test=c(1,2)[helis$test],
Time=helis$Time,
n=nrow(helis))

parameters <- c('Mul','WL','BL','PT','BW','PC','TB','TW','StDev','residual',
'WLBW','WLPC', 'WLWL',
'BLPT' ,'BLPC', 'BLBL',
'BWPC', 'BWBW', 'other')

jmodel <- function() {
for (i in 1:n) {
premul[i] <- (test[i]==1)+Mul*(test[i]==2)
mu[i] <- premul[i] * (
WL*WingLength[i]+
BL*BodyLength[i] +
PT[PaperType[i]] +
BW*BodyWidth[i] +
PC[PaperClip[i]] +
TB*TapedBody[i]+
TW*TapedWing[i]+

WLBW*WingLength[i]*BodyWidth[i]+
WLPC[1]*WingLength[i]*(PaperClip[i]==2)+
WLPC[2]*WingLength[i]*(PaperClip[i]==3)+

BLPT[1]*BodyLength[i]*(PaperType[i]==2)+
BLPT[2]*BodyLength[i]*(PaperType[i]==3)+
BLPC[1]*BodyLength[i]*(PaperClip[i]==2)+
BLPC[2]*BodyLength[i]*(PaperClip[i]==3)+

BWPC[1]*BodyWidth[i]*(PaperClip[i]==2)+
BWPC[2]*BodyWidth[i]*(PaperClip[i]==3) +

WLWL*WingLength[i]*WingLength[i]+
BLBL*BodyLength[i]*BodyLength[i]+
BWBW*BodyWidth[i]*BodyWidth[i]


)
Time[i] ~ dnorm(mu[i],tau[test[i]])
residual[i] <- Time[i]-mu[i]
}
for (i in 1:2) {
tau[i] <- pow(StDev[i],-2)
StDev[i] ~dunif(0,3)
WLPC[i] ~dnorm(0,1)
BLPT[i] ~dnorm(0,1)
BLPC[i] ~dnorm(0,1)
BWPC[i] ~dnorm(0,1)
}
for (i in 1:3) {
PT[i] ~ dnorm(PTM,tauPT)
}
tauPT <- pow(sdPT,-2)
sdPT ~dunif(0,3)
PTM ~dnorm(0,0.01)
WL ~dnorm(0,0.01)
BL ~dnorm(0,0.01)
BW ~dnorm(0,0.01)
PC[1] <- 0
PC[2]~dnorm(0,0.01)
PC[3]~dnorm(0,0.01)
TB ~dnorm(0,0.01)
TW ~dnorm(0,0.01)

WLBW~dnorm(0,1)
WLTW~dnorm(0,1)

WLWL~dnorm(0,1)
BLBL~dnorm(0,1)
BWBW~dnorm(0,1)

other~dnorm(0,1)
Mul ~ dnorm(1,1) %_% I(0,2)
}

jj <- jags(model.file=jmodel,
data=datain,
parameters=parameters,
progress.bar='gui',
n.chain=5,
n.iter=4000,
inits=function() list(Mul=1.3,WL=0.15,BL=-.08,PT=rep(1,3),
PC=c(NA,0,0),TB=0,TW=0))
jj

To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet.

↧

infuser: a template engine for R

May 17, 2015, 1:00 pm

≫ Next: random 0.2.4

≪ Previous: The paper helicopter experiment

(This article was first published on FishyOperations, and kindly contributed to R-bloggers)

Version 0.1 of infuser was just released on CRAN.

infuser is a very basic templating engine. It allows you to replace parameters in character strings or text files with a given specified value.

I would often include some SQL code in my R scripts so that I could make parameters in the SQL code variable. But I was always a bit hesitant on the cleanliness of this. It irked me so much that I finally wrote this very simple package. The infuse function allows me to read in a text file (or a character string for that matter) and infuse it with the requested values.

This means that I can now put my SQL code in a separate .sql file :). Load in into a character string with all the desired values in place and use it to my liking. Of course, the same thing can be done for any text file (e.g. an .html).

Information on usage and examples can be found here: http://github.com/Bart6114/infuser

To leave a comment for the author, please follow the link and comment on his blog: FishyOperations.

↧

random 0.2.4

May 17, 2015, 8:01 pm

≫ Next: DataCamp R Certifications – Now Available on Your LinkedIn Profile

≪ Previous: infuser: a template engine for R

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new release of our random package for truly (hardware-based) random numbers as provided by random.org is now on CRAN.

The R 3.2.0 release brought the change to use an internal method="libcurl" which we are using if available; else the curl::curl() method added in release 0.2.3 is used. We are also a little more explicit about closing connection, and added really basic regression tests -- as it is hard to test hardware-based RNGs draws.

Courtesy of CRANberries comes a diffstat report for this release. Current and previous releases are available here as well as on CRAN.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box .

↧

DataCamp R Certifications – Now Available on Your LinkedIn Profile

May 18, 2015, 3:11 am

≫ Next: Introductory Time-Series analysis of US Environmental Protection Agency (EPA) pollution data

≪ Previous: random 0.2.4

(This article was first published on The DataCamp Blog » R, and kindly contributed to R-bloggers)

You can now quickly and easily integrate your DataCamp.com course completion certificates with your LinkedIn profile. Showcase any completed DataCamp course on your profile, including our free Introduction to R course.

Share with managers, recruiters and potential clients the skills and knowledge you’ve acquired at DataCamp. In just a few clicks, you can show off your R and data science skills to your entire professional network. Just follow these easy steps:

Select “My Accomplishments” in the top-right corner, and click add to LinkedIn profile.
You will go automagically to your LinkedIn profile, where you just have to click “Add to Profile”.
Scroll down to Certifications, and there it is! Your personalized certification.

The post DataCamp R Certifications – Now Available on Your LinkedIn Profile appeared first on The DataCamp Blog .

To leave a comment for the author, please follow the link and comment on his blog: The DataCamp Blog » R.

↧

Introductory Time-Series analysis of US Environmental Protection Agency (EPA) pollution data

May 18, 2015, 4:05 am

≫ Next: Data Science Tweet Analysis – What tools are people talking about?

≪ Previous: DataCamp R Certifications – Now Available on Your LinkedIn Profile

(This article was first published on R Video tutorial for Spatial Statistics, and kindly contributed to R-bloggers)

Download EPA air pollution data
The US Environmental Protection Agency (EPA) provides tons of free data about air pollution and other weather measurements through their website. An overview of their offer is available here: http://www.epa.gov/airdata/

The data are provided in hourly, daily and annual averages for the following parameters:
Ozone, SO2, CO,NO2, Pm 2.5 FRM/FEM Mass, Pm2.5 non FRM/FEM Mass, PM10, Wind, Temperature, Barometric Pressure, RH and Dewpoint, HAPs (Hazardous Air Pollutants), VOCs (Volatile Organic Compounds) and Lead.

All the files are accessible from this page:
http://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/download_files.html

The web links to download the zip files are very similar to each other, they have an initial starting URL: http://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/
and then the name of the file has the following format: type_property_year.zip
The type can be: hourly, daily or annual. The properties are sometimes written as text and sometimes using a numeric ID. Everything is separated by an underscore.

Since these files are identified by consistent URLs I created a function in R that takes year, property and type as arguments, downloads and unzip the data (in the working directory) and read the csv.
To complete this experiment we would need the following packages: sp, raster, xts, plotGoogleMaps
The code for this function is the following:

download.EPA <- function(year, property = c("ozone","so2","co","no2","pm25.frm","pm25","pm10","wind","temp","pressure","dewpoint","hap","voc","lead"), type=c("hourly","daily","annual")){
if(property=="ozone"){PROP="44201"}
if(property=="so2"){PROP="42401"}
if(property=="co"){PROP="42101"}
if(property=="no2"){PROP="42602"}
 
if(property=="pm25.frm"){PROP="88101"}
if(property=="pm25"){PROP="88502"}
if(property=="pm10"){PROP="81102"}
 
if(property=="wind"){PROP="WIND"}
if(property=="temp"){PROP="TEMP"}
if(property=="pressure"){PROP="PRESS"}
if(property=="dewpoint"){PROP="RH_DP"}
if(property=="hap"){PROP="HAPS"}
if(property=="voc"){PROP="VOCS"}
if(property=="lead"){PROP="lead"}
 
URL <- paste0("http://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/",type,"_",PROP,"_",year,".zip")
download.file(URL,destfile=paste0(type,"_",PROP,"_",year,".zip"))
unzip(paste0(type,"_",PROP,"_",year,".zip"),exdir=paste0(getwd()))
read.table(paste0(type,"_",PROP,"_",year,".csv"),sep=",",header=T)
}

This function can be used as follow to create a data.frame with exactly the data we are looking for:

data <- download.EPA(year=2013,property="ozone",type="daily")

This creates a data.frame object with the following characteristics:

> str(data)
'data.frame':   390491 obs. of  28 variables:
 $ State.Code         : int  1 1 1 1 1 1 1 1 1 1 ...
 $ County.Code        : int  3 3 3 3 3 3 3 3 3 3 ...
 $ Site.Num           : int  10 10 10 10 10 10 10 10 10 10 ...
 $ Parameter.Code     : int  44201 44201 44201 44201 44201 44201 44201 44201 44201 44201 ...
 $ POC                : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Latitude           : num  30.5 30.5 30.5 30.5 30.5 ...
 $ Longitude          : num  -87.9 -87.9 -87.9 -87.9 -87.9 ...
 $ Datum              : Factor w/ 4 levels "NAD27","NAD83",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Parameter.Name     : Factor w/ 1 level "Ozone": 1 1 1 1 1 1 1 1 1 1 ...
 $ Sample.Duration    : Factor w/ 1 level "8-HR RUN AVG BEGIN HOUR": 1 1 1 1 1 1 1 1 1 1 ...
 $ Pollutant.Standard : Factor w/ 1 level "Ozone 8-Hour 2008": 1 1 1 1 1 1 1 1 1 1 ...
 $ Date.Local         : Factor w/ 365 levels "2013-01-01","2013-01-02",..: 59 60 61 62 63 64 65 66 67 68 ...
 $ Units.of.Measure   : Factor w/ 1 level "Parts per million": 1 1 1 1 1 1 1 1 1 1 ...
 $ Event.Type         : Factor w/ 3 levels "Excluded","Included",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ Observation.Count  : int  1 24 24 24 24 24 24 24 24 24 ...
 $ Observation.Percent: num  4 100 100 100 100 100 100 100 100 100 ...
 $ Arithmetic.Mean    : num  0.03 0.0364 0.0344 0.0288 0.0345 ...
 $ X1st.Max.Value     : num  0.03 0.044 0.036 0.042 0.045 0.045 0.045 0.048 0.057 0.059 ...
 $ X1st.Max.Hour      : int  23 10 18 10 9 10 11 12 12 10 ...
 $ AQI                : int  25 37 31 36 38 38 38 41 48 50 ...
 $ Method.Name        : Factor w/ 1 level " - ": 1 1 1 1 1 1 1 1 1 1 ...
 $ Local.Site.Name    : Factor w/ 1182 levels ""," 201 CLINTON ROAD, JACKSON",..: 353 353 353 353 353 353 353 353 353 353 ...
 $ Address            : Factor w/ 1313 levels "  Edgewood  Chemical Biological Center (APG), Waehli Road",..: 907 907 907 907 907 907 907 907 907 907 ...
 $ State.Name         : Factor w/ 53 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ County.Name        : Factor w/ 631 levels "Abbeville","Ada",..: 32 32 32 32 32 32 32 32 32 32 ...
 $ City.Name          : Factor w/ 735 levels "Adams","Air Force Academy",..: 221 221 221 221 221 221 221 221 221 221 ...
 $ CBSA.Name          : Factor w/ 414 levels "","Adrian, MI",..: 94 94 94 94 94 94 94 94 94 94 ...
 $ Date.of.Last.Change: Factor w/ 169 levels "2013-05-17","2013-07-01",..: 125 125 125 125 125 125 125 125 125 125 ...

The csv file contains a long series of columns that should again be consistent among all the dataset cited above, even though it changes slightly between hourly, daily and annual average.
A complete list of the meaning of all the columns is available here:
aqsdr1.epa.gov/aqsweb/aqstmp/airdata/FileFormats.html

Some of the columns are self explanatory, such as the various geographical names associated with the location of the measuring stations. For this analysis we are particularly interested in the address (that we can use to extract data from individual stations), event type (that tells us if extreme weather events are part of the averages), the date and the actual data (available in the column Arithmetic.Mean).

Extracting data for individual stations
The data.frame we loaded using the function download.EPA contains Ozone measurements from all over the country. To perform any kind of analysis we first need a way to identify and then subset the stations we are interested in.
For doing so I though about using one of the interactive visualization I presented in the previous post. To use that we first need to transform the csv into a spatial object. We can use the following loop to achieve that:

locations <- data.frame(ID=numeric(),LON=numeric(),LAT=numeric(),OZONE=numeric(),AQI=numeric())
for(i in unique(data$Address)){
dat <- data[data$Address==i,]
locations[which(i==unique(data$Address)),] <- data.frame(which(i==unique(data$Address)),unique(dat$Longitude),unique(dat$Latitude),round(mean(dat$Arithmetic.Mean,na.rm=T),2),round(mean(dat$AQI,na.rm=T),0))
}
 
locations$ADDRESS <- unique(data$Address)
 
coordinates(locations)=~LON+LAT
projection(locations)=CRS("+init=epsg:4326")

First of all we create an empty data.frame declaring the type of variable for each column. With this loop we can eliminate all the information we do not need from the dataset and keep the one we want to show and analyse. In this case I kept Ozone and the Air Quality Index (AQI), but you can clearly include more if you wish.
In the loop we iterate through the addresses of each EPA station, for each we first subset the main dataset to keep only the data related to that station and then we fill the data.frame with the coordinates of the station and the mean values of Ozone and AQI.
When the loop is over (it may take a while!), we can add the addresses to it and transform it into a SpatialObject. We also need to declare the projection of the coordinates, which in WGS84.
Now we are ready to create an interactive map using the package plotGoogleMaps and the Google Maps API. We can simply use the following line:

map <- plotGoogleMaps(locations,zcol="AQI",filename="EPA_GoogleMaps.html",layerName="EPA Stations")

This creates a map with a marker for each EPA station, coloured with the mean AQI. If we click on a marker we can see the ID of the station, the mean Ozone value and the address (below). The EPA map I created is shown at this link: EPA_GoogleMaps

From this map we can obtain information regarding the EPA stations, which we can use to extract values for individual stations from the dataset.
For example, we can extract values using the ID we created in the loop or the address of the station, which is also available on the Google Map, using the code below:

ID = 135
Ozone <- data[paste(data$Address)==unique(data$Address)[ID]&paste(data$Event.Type)=="None",]
 
ADDRESS = "966 W 32ND"
Ozone <- data[paste(data$Address)==ADDRESS&paste(data$Event.Type)=="None",]

Once we have extracted only data for a single station we can proceed with the time-series analysis.

Time-Series Analysis
There are two ways to tell R that a particular vector or data.frame is in fact a time-series. We have the function ts available in the package basic and the function xts, available in the package xts.
I will first analyse how to use xts, since this is probably the best way of handling time-series.
The first thing we need to do is make sure that our data have a column of class Date. This is done by transforming the current date values into the proper class. The EPA datasets has a Date.local column that R reads as factors:

> str(Ozone$Date.Local)
 Factor w/ 365 levels "2013-01-01","2013-01-02",..: 90 91 92 93 94 95 96 97 98 99 ...

We can transform this into the class Date using the following line, which creates a new column named DATE in the Ozone object:

Ozone$DATE <- as.Date(Ozone$Date.Local)

Now we can use the function xts to create a time-series object:

Ozone.TS <- xts(x=Ozone$Arithmetic.Mean,order.by=Ozone$DATE)
plot(Ozone.TS,main="Ozone Data",sub="Year 2013")

The first line creates the time-series using the Ozone data and the DATE column we created above. The second line plots the time-series and produces the image below:

To extract the dates of the object Ozone we can use the function index and we can use the function coredata to extract the ozone values.

index(Ozone.TS)
 Date[1:183], format: "2013-03-31" "2013-04-01" "2013-04-02" "2013-04-03" ...
 
coredata(Ozone.TS)
 num [1:183, 1] 0.044 0.0462 0.0446 0.0383 0.0469 ...

Subsetting the time-series is super easy in the package xts, as you can see from the code below:

Ozone.TS['2013-05-06'] #Selection of a single day
 
Ozone.TS['2013-03'] #Selection of March data
 
Ozone.TS['2013-05/2013-07'] #Selection by time range

The first line extracts values for a single day (remember that the format is year-month-day); the second extracts values from the month of March. We can use the same method to extract values from one particular year, if we have time-series with multiple years.
The last line extracts values in a particular time range, notice the use of the forward slash to divide the start and end of the range.

We can also extract values by attributes, using the functions index and coredata. For example, if we need to know which days the ozone level was above 0.03 ppm we can simply use the following line:

index(Ozone.TS[coredata(Ozone.TS)>0.03,])

The package xts features some handy function to apply custom functions to specific time intervals along the time-series. These functions are: apply.weekly, apply.monthly, apply.quarterly and apply.yearly

The use of these functions is similar to the use of the apply function. Let us look at the example below to clarify:

apply.weekly(Ozone.TS,FUN=mean)
apply.monthly(Ozone.TS,FUN=max)

The first line calculates the mean value of ozone for each week, while the second computes the maximum value for each month. As for the function apply we are not constrained to apply functions that are available in R, but we can define our own:

apply.monthly(Ozone.TS,FUN=function(x) {sd(x)/sqrt(length(x))})

in this case for example we can define a function to calculate the standard error of the mean for each month.

We can use these functions to create a simple plot that shows averages for defined time intervals with the following code:

plot(Ozone.TS,main="Ozone Data",sub="Year 2013")
lines(apply.weekly(Ozone.TS,FUN=mean),col="red")
lines(apply.monthly(Ozone.TS,FUN=mean),col="blue")
lines(apply.quarterly(Ozone.TS,FUN=mean),col="green")
lines(apply.yearly(Ozone.TS,FUN=mean),col="pink")

These lines return the following plot:

From this image it is clear that ozone presents a general decreasing trend over 2013 for this particular station. However, in R there are more precise ways of assessing the trend and seasonality of time-series.

Trends
Let us create another example where we use again the function download.EPA to download NO2 data over 3 years and then assess their trends.

NO2.2013.DATA <- download.EPA(year=2013,property="no2",type="daily")
NO2.2012.DATA <- download.EPA(year=2012,property="no2",type="daily")
NO2.2011.DATA <- download.EPA(year=2011,property="no2",type="daily")
 
ADDRESS = "2 miles south of Ouray and south of the White and Green River confluence"  #Copied and pasted from the interactive map
NO2.2013 <- NO2.2013.DATA[paste(NO2.2013.DATA$Address)==ADDRESS&paste(NO2.2013.DATA$Event.Type)=="None",]
NO2.2012 <- NO2.2012.DATA[paste(NO2.2012.DATA$Address)==ADDRESS&paste(NO2.2012.DATA$Event.Type)=="None",]
NO2.2011 <- NO2.2011.DATA[paste(NO2.2011.DATA$Address)==ADDRESS&paste(NO2.2011.DATA$Event.Type)=="None",]
 
 
NO2.TS <- ts(c(NO2.2011$Arithmetic.Mean,NO2.2012$Arithmetic.Mean,NO2.2013$Arithmetic.Mean),frequency=365,start=c(2011,1))

The first lines should be clear from we said before. The only change is that the time-series is created using the function ts, available in base R. With ts we do not have to create a column of class Date in our dataset, but we can just specify the starting point of the time series (using the option start, which in this case is January 2011) and the number of samples per year with the option frequency. In this case the data were collected daily so the number of times per year is 365; if we had a time-series with data collected monthly we would specify a frequency of 12.

We can decompose the time-series using the function decompose, which is based on moving averages:

dec <- decompose(NO2.TS)
plot(dec)

The related plot is presented below:

There is also another method, based on the loess smoother (for more info: Article) that can be accessed using the function stl:

STL <- stl(NO2.TS,"periodic")
plot(STL)

This function is able to calculate the trend along the whole length of the time-series:

Conclusions
This example shows how to download and access the open pollution data for the US available from the EPA directly from R.
Moreover we have seen here how to map the locations of the stations and subset the dataset. We also looked at ways to perform some introductory time-series analysis on pollution data.
For more information and material regarding time-series analysis please refer to the following references:

A Little Book of R For Time Series

Analysis of integrated and cointegrated time series with R

Introductory time series with R

R code snippets created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: R Video tutorial for Spatial Statistics.

↧

Data Science Tweet Analysis – What tools are people talking about?

May 18, 2015, 4:12 am

≫ Next: Scraping jQuery DataTable Programmatic JSON with R

≪ Previous: Introductory Time-Series analysis of US Environmental Protection Agency (EPA) pollution data

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

By Chris Musselle PhD, Mango UK

At Mango we use a variety of tools in-house to address our clients’ business needs and when these fall within the data science arena, the main candidates we turn to are either the R or Python programming languages.

The question as to which is the “best” language for doing data science is a hotly debated topic ([link] [link] [link] [link]), with both languages having their pros and cons. However the capabilities of each are expanding all the time thanks to continuous open source development in both areas.

With both languages becoming increasingly popular for data analysis, we thought it would be interesting to track current trends and see what people are saying about these and other tools for data science on Twitter.

This post is the first of three that will look into the results of our analysis, but first a bit of background.

Twitter Analysis

Today many companies are routinely drawing on social media data sources such as Twitter and Facebook to enhance their business decision making in a number of ways. This type of analysis can be a component of market research, an avenue for collecting customer feedback or a way to promote campaigns and conduct targeted advertising.

To facilitate this type of analysis, Twitter offer a variety of Application Programming Interfaces or APIs that enable an application to programmatically interact with the services provided by Twitter. These APIs currently come in three main flavours.

REST API – Allows automated access to searching, reading and writing tweets
Streaming API – Allows tracking of multiple users and or search terms in near real time, though results may only be a sample
Twitter Firehose – Allows tracking of all tweets past and future, no limits on search results returned.

These different approaches have different trade-offs. The REST API can only search past tweets, and is limited in how far back you can search as Twitter only keeps the last couple of weeks of data. The Streaming API tracks tweets as they happen, but Twitter only guarantees a sample of all current tweets will be collected [link]. This means that if your search term is very generic and matches a lot of tweets, then not all of these tweets will be returned [link].

The Twitter Firehose addresses the shortcomings of the previous two APIs, but at quite a substantial cost, whereas the other two are free to use. There are also a growing number of third party intermediaries that have access to the Twitter Firehose, and sell on the Twitter data they collect [link].

Our Approach

We chose to use the Streaming API to collect tweets containing the hashtags “python” and/or “rstats” and/or “datascience” over a 10 day period.

To harvest the data, a python script was created to utilize the API and append tweets to a single file. Command line tools such as cvskit and jq were then used to clean and preprocess the data, with the analysis done in Python using the pandas library.

Preliminary Results: Hashtag Counts and Co-occurrence

From Figure 1, it is immediately obvious that “python” and “datascience” were more popular hashtags than “rstats” over the time period sampled. Though interestingly, there was little overlap between these groups.

Figure 1: Venn diagram of tweet counts by hashtag

This suggests that the majority of tweets that mentioned these subjects either did so in isolation or alongside other hashtags that were not tracked. We can get a sense of which is the case by looking at a count of the total number of unique hashtags that occurred alongside each tracked hashtag, this is shown in Table 1.

Table 1: Total unique hashtags used per tracked subset

These counts show that the “python” hashtag is mentioned alongside a lot more other topics/hashtags than “rstats” and “datascience”. This makes sense when you consider that Python is a general purpose programming language, and as such has a broader range across application domains than R, which is more statistically focused. In between these is the “datascience” hashtag, a term that relates to many different skillsets and technologies, and so we would expect the number of unique hashtag co-occurrences to be quite high.

So what are people mentioning alongside these hashtags if not these technologies?

Table 2 shows the top hashtags mentioned alongside the three tracked hashtags. Here the numbers in the header are the total number of tweets that contained the tracked hashtag term, plus at least one other hashtag. So the vast majority of tweets occur with multiple hashtags As can be seen all three subjects were commonly mentioned alongside other hashtags.

Table 2: Table of most frequent co-occurring hashtags with tracked keywords. Numbers in the header are the total number of tweets containing at least one other hashtag to the one tracked.

As we may expect, many co-occurring hashtags are closely related, though in general it’s interesting to see that “datascience” co-occurs with many more general concepts and or ‘buzzwords’ frequently, with technologies mentioned further down the list.

Python on the other hand occurs frequently alongside other web technologies, as well as “careers” and “hiring”, which may reflect a high demand for jobs that use Python and these related technologies for web development. On the other hand it may simply be that many good web developers are active on Twitter, and as such recruitment companies favor this medium of advertising when trying to fill web development positions.

It’s interesting that tweets with the “Rstats” hashtags mentioned “datascience” and “bigdata” more than any other, likely reflecting the increasing trends in using R in this arena. The other co-occurring hashtags for R can be grouped into: those that relate to its domain specific use (“statistics”, “analytics”, “machinelearning” etc.); possible ways of integrating it with other language (“python”, “excel”, “d3js”); and other ways of referencing R itself (“r”, “rlang”)!

Summary

So from looking at the counts of hashtags and their co-occurrences, it looks like:

Tweets containing Python or data science were roughly 5 times more frequent than those containing Rstats. There was also little relative overlap in the three hashtags tracked.
Tweets containing Python also mention a broader range of other topics, while R is more focused around data science, statistics and analytics.
Tweets mentioning data science most commonly include hashtags for general analytics concepts and ‘buzzwords’, with specific technologies only occasionally mentioned.
Tweets mentioning Python most commonly include hashtags for web development technologies and are likely the result of a high volume of recruitment advertising.

Future Work

So far we have only looked at the hashtag contents of the tweet and there is much more data contained within that can be analysed. Two other key components are the user mentions and the URLs in the message. Future posts will look into both of these to investigate the content being shared, along with who is retweeting/being retweeted by whom.

To leave a comment for the author, please follow the link and comment on his blog: Mango Solutions.

↧

Scraping jQuery DataTable Programmatic JSON with R

May 18, 2015, 4:35 am

≫ Next: A Basic Logical Invest Global Market Rotation Strategy

≪ Previous: Data Science Tweet Analysis – What tools are people talking about?

(This article was first published on rud.is » R, and kindly contributed to R-bloggers)

School of Data had a recent post how to copy “every item” from a multi-page list. While their post did provide a neat hack, their “words of warning” are definitely missing some items and the overall methodology can be improved upon with some basic R scripting.

First, the technique they outlined relies heavily on how parameters are passed and handled by the server the form is connected to. The manual technique is not guaranteed to work across all types of forms nor even those with a “count” popup. I can see this potentially frustrating many budding data janitors.

Second, this particular technique and example really centers around jQuery DataTables. While their display style can be highly customized, it’s usually pretty easy to determine if they are being used both visually:

(i.e. by the controls & style of the controls available) and in the source:

The URLs might be local or on a common content delivery network, but it should be pretty easy to determine when a jQuery DataTable is in use. Once you do, you should also be able to tell if it’s calling out to a URL for some JSON to populate the structure.

Here, I just used Chrome’s Developer Tools to look a the responses coming back from the server. That’s a pretty ugly GET request, but we can see the query parameters a bit better if we scroll down:

These definitely track well with the jQuery DataTable server-side documentation so we should be able to use this to our advantage to avoid the pitfalls of overwhelming the browser with HTML entities and doing cut & paste to save out the list.

Getting the Data With R

The R code to get this same data is about as simple as it gets. All you need is the data source URL, with a modified length query parameter. After that’s it’s just a few lines of code:

library(httr)
library(jsonlite)
library(dplyr) # for glimpse
 
url <- "http://www.allflicks.net/wp-content/themes/responsive/processing/processing_us.php?draw=1&columns%5B0%5D%5Bdata%5D=box_art&columns%5B0%5D%5Bname%5D=&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=false&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=title&columns%5B1%5D%5Bname%5D=&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=true&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=year&columns%5B2%5D%5Bname%5D=&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=true&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=rating&columns%5B3%5D%5Bname%5D=&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=true&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B4%5D%5Bdata%5D=category&columns%5B4%5D%5Bname%5D=&columns%5B4%5D%5Bsearchable%5D=true&columns%5B4%5D%5Borderable%5D=true&columns%5B4%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B4%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B5%5D%5Bdata%5D=available&columns%5B5%5D%5Bname%5D=&columns%5B5%5D%5Bsearchable%5D=true&columns%5B5%5D%5Borderable%5D=true&columns%5B5%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B5%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B6%5D%5Bdata%5D=director&columns%5B6%5D%5Bname%5D=&columns%5B6%5D%5Bsearchable%5D=true&columns%5B6%5D%5Borderable%5D=true&columns%5B6%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B6%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B7%5D%5Bdata%5D=cast&columns%5B7%5D%5Bname%5D=&columns%5B7%5D%5Bsearchable%5D=true&columns%5B7%5D%5Borderable%5D=true&columns%5B7%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B7%5D%5Bsearch%5D%5Bregex%5D=false&order%5B0%5D%5Bcolumn%5D=5&order%5B0%5D%5Bdir%5D=desc&start=0&length=7448&search%5Bvalue%5D=&search%5Bregex%5D=false&movies=true&shows=true&documentaries=true&rating=netflix&_=1431945465056"
 
resp <- GET(url)

Normally we would be able to do:

content(resp, as="parsed")

but this server did not set the Content-Type of the response well, so we have to do it by hand with the jsonlite package:

recs <- fromJSON(content(resp, as="text"))

The recs variable is now an R list with a structure that (thankfully) fully represents the expected server response:

## List of 4
##  $ draw           : int 1
##  $ recordsTotal   : int 7448
##  $ recordsFiltered: int 7448
##  $ data           :'data.frame':  7448 obs. of  9 variables:
##   ..$ box_art  : chr [1:7448] "<img src="http://cdn1.nflximg.net/images/9159/12119159.jpg" width="55" alt="Thumbnail">" "<img src="http://cdn1.nflximg.net/images/6195/20866195.jpg" width="55" alt="Thumbnail">" "<img src="http://cdn1.nflximg.net/images/3735/2243735.jpg" width="55" alt="Thumbnail">" "<img src="http://cdn0.nflximg.net/images/2668/21112668.jpg" width="55" alt="Thumbnail">" ...
##   ..$ title    : chr [1:7448] "In the Bedroom" "Wolfy: The Incredible Secret" "Bratz: Diamondz" "Tinker Bell and the Legend of the NeverBeast" ...
##   ..$ year     : chr [1:7448] "2001" "2013" "2006" "2015" ...
##   ..$ rating   : chr [1:7448] "3.3" "2.5" "3.6" "4" ...
##   ..$ category : chr [1:7448] "<a href="http://www.allflicks.net/category/thrillers/">Thrillers</a>" "<a href="http://www.allflicks.net/category/children-and-family-movies/">Children & Family Movies</a>" "<a href="http://www.allflicks.net/category/children-and-family-movies/">Children & Family Movies</a>" "<a href="http://www.allflicks.net/category/children-and-family-movies/">Children & Family Movies</a>" ...
##   ..$ available: chr [1:7448] "17 May 2015" "17 May 2015" "17 May 2015" "17 May 2015" ...
##   ..$ cast     : chr [1:7448] "Tom Wilkinson, Sissy Spacek, Nick Stahl, Marisa Tomei, William Mapother, William Wise, Celia Weston, Karen Allen, Frank T. Well"| __truncated__ "Rafael Marin, Christian Vandepas, Gerald Owens, Yamile Vasquez, Pilar Uribe, James Carrey, Rebecca Jimenez, Joshua Jean-Baptist"| __truncated__ "Olivia Hack, Soleil Moon Frye, Tia Mowry-Hardrict, Dionne Quan, Wendie Malick, Lacey Chabert, Kaley Cuoco, Charles Adler" "Ginnifer Goodwin, Mae Whitman, Rosario Dawson, Lucy Liu, Pamela Adlon, Raven-Symoné, Megan Hilty" ...
##   ..$ director : chr [1:7448] "Todd Field" "Éric Omond" "Mucci Fassett, Nico Rijgersberg" "Steve Loter" ...
##   ..$ id       : chr [1:7448] "60022258" "70302834" "70053695" "80028529" ...

We see there is a data.frame in there with the expected # of records. We can also use glimpse from dplyr to see the data table a bit better:

glimpse(recs$data)
 
## Observations: 7448
## Variables:
## $ box_art   (chr) "<img src="http://cdn1.nflximg.net/images/9159/12...
## $ title     (chr) "In the Bedroom", "Wolfy: The Incredible Secret", ...
## $ year      (chr) "2001", "2013", "2006", "2015", "1993", "2013", "2...
## $ rating    (chr) "3.3", "2.5", "3.6", "4", "3.5", "3.1", "3.3", "4....
## $ category  (chr) "<a href="http://www.allflicks.net/category/thril...
## $ available (chr) "17 May 2015", "17 May 2015", "17 May 2015", "17 M...
## $ cast      (chr) "Tom Wilkinson, Sissy Spacek, Nick Stahl, Marisa T...
## $ director  (chr) "Todd Field", "Éric Omond", "Mucci Fassett, Nico R...
## $ id        (chr) "60022258", "70302834", "70053695", "80028529", "8...

Now, we can use that in any R workflow or use write it out as a CSV (or other format) for other workflows to use. No browsers were crashed and we have code we run again to scrape the site (i.e. when the add more movies to the database) vs a manual cut & paste workflow.

Many of the concepts in this post can be applied to other data table displays (i.e. those not based on jQuery DataTable), but you’ll have to get comfortable with the developer tools view of your favorite browser.

To leave a comment for the author, please follow the link and comment on his blog: rud.is » R.

↧

A Basic Logical Invest Global Market Rotation Strategy

May 18, 2015, 6:08 am

≫ Next: What’s new in Revolution R Enterprise 7.4

≪ Previous: Scraping jQuery DataTable Programmatic JSON with R

(This article was first published on QuantStrat TradeR » R, and kindly contributed to R-bloggers)

This may be one of the simplest strategies I’ve ever presented on this blog, but nevertheless, it works, for some definition of “works”.

Here’s the strategy: take five global market ETFs (MDY, ILF, FEZ, EEM, and EPP), along with a treasury ETF (TLT), and every month, fully invest in the security that had the best momentum. While I’ve tried various other tweaks, none have given the intended high return performance that the original variant has.

Here’s the link to the original strategy.

While I’m not quite certain of how to best go about programming the variable lookback period, this is the code for the three month lookback.

require(quantmod)
require(PerformanceAnalytics)

symbols <- c("MDY", "TLT", "EEM", "ILF", "EPP", "FEZ")
getSymbols(symbols, from="1990-01-01")
prices <- list()
for(i in 1:length(symbols)) {
  prices[[i]] <- Ad(get(symbols[i]))
}
prices <- do.call(cbind, prices)
colnames(prices) <- gsub("\.[A-z]*", "", colnames(prices))
returns <- Return.calculate(prices)
returns <- na.omit(returns)

logicInvestGMR <- function(returns, lookback = 3) {
  ep <- endpoints(returns, on = "months") 
  weights <- list()
  for(i in 2:(length(ep) - lookback)) {
    retSubset <- returns[ep[i]:ep[i+lookback],]
    cumRets <- Return.cumulative(retSubset)
    rankCum <- rank(cumRets)
    weight <- rep(0, ncol(retSubset))
    weight[which.max(cumRets)] <- 1
    weight <- xts(t(weight), order.by=index(last(retSubset)))
    weights[[i]] <- weight
  }
  weights <- do.call(rbind, weights)
  stratRets <- Return.portfolio(R = returns, weights = weights)
  return(stratRets)
}

gmr <- logicInvestGMR(returns)
charts.PerformanceSummary(gmr)

And here’s the performance:

> rbind(table.AnnualizedReturns(gmr), maxDrawdown(gmr), CalmarRatio(gmr))
                          portfolio.returns
Annualized Return                  0.287700
Annualized Std Dev                 0.220700
Annualized Sharpe (Rf=0%)          1.303500
Worst Drawdown                     0.222537
Calmar Ratio                       1.292991

With the resultant equity curve:

While I don’t get the 34% advertised, nevertheless, the risk to reward ratio over the duration of the backtest is fairly solid for something so simple, and I just wanted to put this out there.

Thanks for reading.

To leave a comment for the author, please follow the link and comment on his blog: QuantStrat TradeR » R.

↧

What’s new in Revolution R Enterprise 7.4

May 18, 2015, 6:18 am

≫ Next: Bio7 2.1 for Windows 64 bit released

≪ Previous: A Basic Logical Invest Global Market Rotation Strategy

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Bill Jacobs, Director Technical Sales, Microsoft Advanced Analytics

Without missing a beat, the engineers at Revolution Analytics have brought another strong release to users of Revolution R Enterprise (RRE). Just a few weeks after acquisition of Revolution Analytics by Microsoft, RRE 7.4 was released to customers on May 15 adding new capabilities, enhanced performance and security, ann faster and simpler Hadoop editions.

New features in version 7.4 include:

Addition of Naïve Bayes Classifiers to the ScaleR library of algorithms
Optional coefficient tracking for stepwise regressions. Coefficient tracking makes stepwise less of a “black box” by illustrating how model features are selected as the algorithm iterates toward a final reduced model.
Faster import of wide data sets, faster computation of data summaries, and faster fitting of tree-based models and predictions of decision forests and gradient boosted tree algorithms.
Support for HDFS file caching on Hadoop that speeds analysis of most files, especially when applying multi-step and iterative algorithms.
Improved tools for distributing R packages across Cloudera Hadoop clusters.
An updated edition of R, version 3.1.3.
Certification of RRE 7.4 on Cloudera CDH 5.2 and 5.3, Hortonworks HDP 2.2 and MapR 4.0.2, along with certification of the much requested CentOS as a supported Linux platform.

For RRE users integrating R with enterprise apps, RRE's included DeployR integration server now includes:

New R Session Process Controls that provide fine-grained file access controls on Linux.
Support for external file repositories including git and svn for managing scripts and metadata used by DeployR.
Strengthened password handling to resist recent attack vectors.
Updates to the Java, JavaScript and .NET broker frameworks and corresponding client libraries, and,
A new DeployR command line tool that will grow in capability with subsequent releases.

More broadly, the release of version 7.4 so shortly after acquisition of Revolution by Microsoft underscores our commitment to delivery of an expanding array enterprise-capable R platforms. It also demonstrates Microsoft’s growing commitment to the growth of advanced analytics facilities that leverage and extend upon open source technologies such as the R language.

Details of the new features in RRE 7.4 can be found in the release notes here.

Details of improvements to DeployR integration server in RRE 7.4 can be found here.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

↧

Bio7 2.1 for Windows 64 bit released

May 18, 2015, 1:50 pm

≫ Next: Basic text string functions in R

≪ Previous: What’s new in Revolution R Enterprise 7.4

(This article was first published on » R, and kindly contributed to R-bloggers)

18.05.2015

I released a new version of the Integrated Development Environment Bio7 with new functions and many visual layout improvements of the Bio7 Graphical User Interface.

Bio7 2.1 can be downloaded here:

http://bio7.org

Release notes Bio7 2.1:

Updated R to version 3.2.0.
Improved the R perspective layout (see below).

Improved the syntax coloring and grammar (assignment, multiline string, infix operator for package ‘data.table’)
Improved the layout of the R-Shell view for higher dpi’s
Added code folding for ‘if’,’while’, ‘repeat’ and ‘for’ expressions.
Added options to enable or disable codefolding, code context (mark words, info popup) and code completion.
The line numbering of the R editor is now enabled by default.
Improved the “Install package(s)” GUI and other dialogs.

ImageJ

Updated ImageJ to 1.49t
Added Bio7 ImageJ preferences for special dialogs to stay on top (ROI Manager, Results Table, Macro Recorder).
Added extra action panels for the histogram and profile plot to enable the new ImageJ actions.

Resized the layout of the ‘ImageJ-Toolbar’ and ‘Image-Methods’ view.
Improved the compatibility for the 3D viewer plugin and the OMERO (The Open Microscopy Environment) client.
Added options to resize and store the dimensions of the ‘Image-Methods’ dialog and the ‘ImageJ-Toolbar’ in the preferences.
Added more tooltips for context information.

WorldWind

Improved the layout for the different actions and layers.
Added an option to load a GEOTIFF image data parallel in ImageJ when added as a layer to WorldWind.
Added an easier to use alpha value function for transparent image regions (from ImageJ – see below).

Greyscale and float images can now be displayed as RGBA if enabled (tranparent regions).
Added a ‘Location’ action for loaded shapefiles.
Improved the GDAL loading. Now GDAL (Java) can be called from within Bio7 dynamically.

Improved the layout and actions for the different 3D panels.

Python

Added Py4J library for the communication between Java and Cpython.
Added a server start/stop action for Py4J available in the Scripts menu.
Added a Py4J example (ImageJ measurement) in the Bio7 documentation.
As an alternative you can now install the Eclipse Python editor PyDev and execute the python script within the Bio7 connection or in the PyDev editor process.
Added support to eval Python3.x scripts (can be enabled in the preferences).

Bio7 GUI

Improved the startup layout for Bio7. The application now starts maximized.
Increased the Bio7 splashscreen.

Java

Updated the embedded Java Runtime Environment to 1.8.45.
Updated the integrated JavaFX SceneBuilderKit.

Added support for Java3D built on JOGL.
Added a default ‘close’ action in the base (abstract) Model class.
The ‘close’ method is called automatically if a custom view is closed.

Examples:

Added and fixed some examples for Bio7

Installation:

The installation of Bio7 is similar to the installation of the Eclipse environment. Simply decompress the downloaded *.zip file in a preferred location on your file system. After decompressing with a standard zip-tool (like WinZip, Win Rar) the typical file structure of an Eclipse based application will be created. To start the application simply double click on the Bio7.exe file.

For more information about Bio7:

Documentation

YouTube Videos

To leave a comment for the author, please follow the link and comment on his blog: » R.

↧

Basic text string functions in R

May 18, 2015, 2:36 pm

≫ Next: Query Multiple Google Analytics View IDs with R

≪ Previous: Bio7 2.1 for Windows 64 bit released

(This article was first published on lukemiller.org » R-project, and kindly contributed to R-bloggers)

To get the length of a text string (i.e. the number of characters in the string): Using length() would just give you the length of the vector containing the string, which will be 1 if the string is just a single string. To get the position of a regular expression match(es) in a text string […]

To leave a comment for the author, please follow the link and comment on his blog: lukemiller.org » R-project.

↧

Query Multiple Google Analytics View IDs with R

May 18, 2015, 4:48 pm

≫ Next: Posterior predictive output with Stan

≪ Previous: Basic text string functions in R

(This article was first published on analytics for fun, and kindly contributed to R-bloggers)

Extracting Google Analytics data from one website is pretty easy, and there are several options to do it quickly. But what if you need to extract data from multiple websites or, to be more precise, from multiple Views? And perhaps you also need to summarize it within a single data frame?

Not long ago I was working on a reporting project, where the client owned over 60 distinct websites. All of them tracked using Google Analytics.

Given the high number of sites managed and the nature of their business, it did not make sense for them to report & analyse data for each single website. It was much more effective to group those websites into categories (let say category 1, category 2, category 3, etc.) and report/analyse data at a category level rather than at a website level.

In other words, they needed to:
1) Collect data from each website
2) Categorize websites data according to specific internal business criteria
3) Report and visualize data at a category level (through an internal reporting system)

Very soon I realized that steps 1 & 2 were critical both in terms of time needed for extracting data and of the risk of copy/paste errors, especially if the extraction process was executed directly from Google Analytics platform.

But luckily that's where R and the RGoogleAnalytics package came in handy, allowing me to automate the extraction process with a simple for loop.

Let's quickly go through the different options I had to tackle points 1) and 2):

a) Download data from Google Analytics platform as Excel format
This would have meant doing the same operation for each one of the 60 sites! Too long. Plus a subsequent manual copy/paste work to group sites data into different categories. Boring and too risky! Moreover, given the segmentation required by the client, I could not find the info directly from Google Analytics standard reports.

b) Google Analytics Query Explorer
Google Analytics Query Explorer is very very handy and I use it a lot. You can connect to Google Analytics API and build complex queries quickly thanks to their easy to use interface. So I could obtain the required segmentation of data quite fast.

However, the current Query Explorer version allows you to query only one View ID at a time. Despite its plural nomenclature (ids), the View ID is a unique value as explained in Core Reporting API documentation, and you will have to run your request several times in order to query multiple websites.

Hence, even if you use Query Explorer, you will have to query each website/view at a time. Download the data and merge it together your "websites category".

c) Google Spreadsheet Add-on
Thanks to the Google Analytics Spreadsheet Add-on, it's easy to run a query via Google Analytics API and obtain your web data. You can also run more than one query at a time, which means you can query more than one Vew ID at a time.

I love Google Sheets Add-on, though in this particular case (query and categorize over 60 websites), you would still have some manual copy/paste work to do once you extracted the data into the spreadsheet.

d) Automate the extraction process with R
There are a few packages in R that let you connect to Google Analytics API. One of them is RGoogleAnalytics. But R is also a powerful programming language which allows you to automate complex operations.

So, I thought that combining the RGoogleAnalytics package with a simple R control structure like a for loop, could do the job quickly and with low margin of error.

Here below I provide a bit more details of how I run multiple queries in R, and obviously, the code!

For loop to query multiple Google Analytics View IDs with R

What I did, was running a simple for loop that iterates over each View ID of my category, and retrieves the corresponding data using the query. Each time appending the new data in a data frame that will eventually become the final category data frame.

Let's break it down in a few steps to make it clearer.

Step 1: Authenticate to Google Analytics via RGoogleAnalytics package

I assume you are familiar with the RGoogleAnalytics package. If not, please check out this brilliant post which explains in details how to connect Google Analytics with R.

What you have to do, is first of all create a new project using the Google developers Console. Once created, you will grab your credentials ("client.id" and "client.secret" variables in the code), and use them to create and validate your token.

Of course you need to have the RGoogleAnalytics library loaded to do all of this.

library(RGoogleAnalytics)
client.id <- "yourClientID"
client.secret <- "yourClientSecret"
 
# if no token is found within your worrking directory, a new token will be created. Otherwise the existing one will be loaded
 
if (!file.exists("./oauth_token")) {
oauth_token <- Auth(client.id,client.secret)
oauth_token <- save(token,file="./oauth_token")
} else {
load("./oauth_token")
}
 
ValidateToken(token)

Step 2: Create the View IDs category

Using the "GetProfiles" command, you can get a list with all the Views(or profiles) you have access to with your token. And the corresponding View IDs too, which are actually the parameters you need to build your query.

From that list you can easily select the ones you need to build your category. Or otherwise you can create your category directly by entering the IDs manually. As an example, below I create 3 categories, each containing a certain number of IDs.

Each category will be a vector of charachter class.

viewID<-GetProfiles(token)
viewID
 
category1<- c("79242136", "89242136", "892421","242136","242138","242140","242141")
category2<- c("54120", "54121", "54125","54126")
category3<- c("60123", "60124", "60125")

Step 3: Initialize an empty data frame

Before executing the loop, I create an empty data frame named "df". I will need this to store the data extracted through the multiple queries.

df<-data.frame()

As you will see in next step, each time a new query is run for a specific View ID, the resulting data will be appended below the last row of the previous data frame using the function rbind.

Step 4: Run the for loop over each category

Now that we have the websites's categories set up and the a data frame ready to store data, we can finally run the loop. What I do here, is using a variable called "v" and iterate it over a specific category, let say "category1". In other words, the Google Analytics query is run for each single View ID included in the category.

The resulting object of each query is a data frame called "ga.data". To collect the result of each query in the same data frame, each time the loop is run, the "df" data frame created previously is joined vertically using a "rbind" function.

for (v in category1){
     start.date <- "2015-04-01"
     end.date <- "2015-04-30"
     view.id <- paste("ga:",v,sep="") #the View ID parameter need to have "ga:" in front of the ID 
 
     query.list <- Init(start.date = start.date, end.date = end.date, dimensions = "ga:date,    ga:deviceCategory, ga:channelGrouping,", metrics = "ga:sessions, ga:users, ga:bounceRate, ga:goalCompletions1", table.id = view.id)
     ga.query <- QueryBuilder(query.list)
     ga.data <- GetReportData(ga.query, token, paginate_query = F)
 
     df<-rbind(df,ga.data)
}

Query Multiple Google Analytics View IDs output

This for loop would query data only for category 1. To query websites belonging to category 2, you would need run the same loop again, this time iterating over category 2. Remember to re-initialize the "df" data frame when you change category, otherwise all nes results will be joined below your previous data frame.

Step 5: Do whatever you want with your data!

At this point, you should have all the Google Analytics data available in your R workspace. And most importantly, categorized!

You might need now to perform some cleaning on your data, visualize it or export it into another format. Fortunately R offers you so many functions and packages that you can do basically whatever you want with those data.

If you need for example to export your data frame into a .csv. file, you can do it very quickly using the write.csv command:

write.csv(df,file="category1.csv")

Another data munging operation you might want to do on your Google Analytics data, is converting dates in a more friendly format. Infact, the dates you extract from Google Analytics comes into R as character data type, with the "yyyyMMdd" format. You can do this with the following code:

class(ga.data$date) # dates come as character
newDate<-as.Date(ga.data$date,"%Y%M%d")  #convert into date data type
newFormat<- format(newDate,"%m/%d/%y") #to change format, but it convets it back to character class
newFormat<- as.Date(newFormat,"%d/%m/%y")  #convert it back to date data type

In general I suggest you use the dplyr package for any data manipulation operation you might need to perform on your data frame.

And of course, you could include all the data cleaning/manipulation commands inside the above for loop if you like. By doing that, you would automatize your process even more, and end up with a data frame ready to be reported or visualized for your audience.

To leave a comment for the author, please follow the link and comment on his blog: analytics for fun.

↧

Posterior predictive output with Stan

May 18, 2015, 10:28 pm

≫ Next: How Predictable is the English Premier League?

≪ Previous: Query Multiple Google Analytics View IDs with R

(This article was first published on mages' blog, and kindly contributed to R-bloggers)

I continue my Stan experiments with another insurance example. Here I am particular interested in the posterior predictive distribution from only three data points. Or, to put it differently I have a customer of three years and I'd like to predict the expected claims cost for the next year to set or adjust the premium.

The example is taken from section 16.17 in Loss Models: From Data to Decisions [1]. Some time ago I used the same example to get my head around a Bayesian credibility model.

Suppose the claims likelihood distribution is believed to follow an exponential distribution for a given parameter (Theta). The prior parameter distribution on (Theta) is assumed to be a gamma distribution with parameters (alpha=4, beta=1000):
[begin{aligned}Theta & sim mbox{Gamma}(alpha, beta)\
ell_i & sim mbox{Exp}(Theta) , ; forall i in N
end{aligned}]In this case the predictive distribution is a Pareto II distribution with density (f(x) = frac{alpha beta^alpha}{(x+beta)^{alpha+1}}) and a mean of (frac{beta}{alpha-1}=,)333.33.

I have three independent observations, namely losses of $100, $950 and $450. The posterior predictive expected loss is $416.67 and can be derived analytical, as shown in my previous post. Now let me reproduce the answer with Stan as well.

Implementing the model in Stan is straightforward and I follow the same steps as in my simple example of last week. However, here I am also interested in the posterior predictive distribution, hence I add a generated quantities code block.

The output shows a simulated predictive mean of $416.86, close to the analytical answer. I can also read out that the 75%ile of the posterior predictive distribution is a loss of $542 vs. $414 from the prior predictive. That means every four years I shouldn't be surprised to observe a loss in excess of $500. Further I can read of that 90% of losses are expected to be less than $950, or in other words the observation in my data may reflect the outcome of an event with a 1 in 10 return period.

Comparing the sampling output from Stan with the analytical output gives me some confidence that I am doing the 'right thing'.

References

[1] Klugman, S. A., Panjer, H. H. & Willmot, G. E. (2004), Loss Models: From Data to Decisions, Wiley Series in Probability and Statistics.

Session Info

R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base     

other attached packages:
[1] lattice_0.20-31 actuar_1.1-8 rstan_2.6.0 inline_0.3.14  
[5] Rcpp_0.11.6    

loaded via a namespace (and not attached):
[1] tools_3.2.0  codetools_0.2-11 grid_3.2.0 stats4_3.2.0

To leave a comment for the author, please follow the link and comment on his blog: mages' blog.

↧

How Predictable is the English Premier League?

May 19, 2015, 1:24 am

≫ Next: Interactive charts in R

≪ Previous: Posterior predictive output with Stan

(This article was first published on DiffusePrioR » R, and kindly contributed to R-bloggers)

The reason why football is so exciting is uncertainty. The outcome of any match or league is unknown, and you get to watch the action unfold without knowing what’s going to happen. Watching matches where you know the score is never exciting.

This weekend the English Premier League season will conclude with little fanfare. Bar one relegation place, the league positions have already been determined. In fact, these positions were, for the most part, decided weeks ago. The element of uncertainty seems to have been reduced this season.

With this in mind, I wanted to look at uncertainty over the long run in English football. To do this used the data provided by http://www.football-data.co.uk/ and analyzed these with R. These data consist of 34,740 matches played in the top 5 divisions of English football between 2000 and 2015, containing information about both the result and the odds offered by bookies on this result.

To measure the uncertainty of any given match I used the following strategy. First, I averaged across all bookies’ odds for the three possible events: home win, draw, and away win. Next I mapped these aggregated odds into probabilities by inverting each of the odds and then dividing by the summed inverted odds. This takes care of the over round that helps bookies to make a profit. For example, if the odds were 2.1/1 that an event happens and 2.1/1 that it doesn’t then the probability of the event occurring is:

(1/2.1)/ (1/2.1 + (1/2.1)) = 0.4761905/(0.4761905+0.4761905) = 0.5.

Finally, to measure the uncertainty of each match, I subtract the probability that the event occurred from 1, to calculate a “residual” score. Imagine a home win occurs. The “residual” in this case will be 1-P(home win). If P(home win)=1, then there is no uncertainty, and this uncertainty score will be zero. Since there are 3 outcomes, we would expect an uncertainty measure to be bounded between 0 (no uncertainty) and 0.67 (pure uncertainty) where we get 1 out of 3right by just guessing.

After importing these data into R and calculating the uncertainty measure, I looked at this uncertainty measure over time. The plot in the above shows fitted smoothed trend lines of uncertainty, stratified by division. These trends are striking. Going by this graph, the Premier League has gotten more predictable over the analysis period. In 2000, the uncertainty measure was around 0.605. Given that we expect this measure to be bound between 0 (complete certainty) and 0.67 (completely random), this tell us that the average league game was very unpredictable. Over time, however, this measure has decreased by about 5%, which does not seem like much. Despite, the somewhat unexciting end to the 2014/15 season, the outcome of the average game is still not very predictable.

Noticeably, in lower league games there is even greater uncertainty. In fact, the average uncertainty measure of League 2 games approached a value of 0.65 in 2014. This indicates that the average League 2 game is about as unpredictable as playing rock-paper-scissors. Interestingly, and unlike the Premier League, there does not appear to be any discernible change over time. The games are just as unpredictable now as they were in 2000. Please see my R code below.

# clear
rm(list=ls())

# libraries
library(ggplot2)

# what are urls

years = c(rep("0001",4), rep("0102",4), rep("0203",4), rep("0405",4),
          rep("0506",5), rep("0607",5), rep("0708",5), rep("0809",5),
          rep("0910",5), rep("1011",5), rep("1112",5), rep("1213",5),
          rep("1314",5), rep("1415",5))
divis = c(rep(c("E0","E1","E2","E3"),4), rep(c("E0","E1","E2","E3","EC"),10))

urls = paste(years, divis, sep="/")
urls = paste("http://www.football-data.co.uk/mmz4281", urls, sep="/")


odds = c("B365H","B365D","B365A",
         "BSH","BSD","BSA",
         "BWH","BWD","BWA",
         "GBH","GBD","GBA",
         "IWH","IWD","IWA",
         "LBH","LBD","LBA",
         "PSH","PSD","PSA",
         "SOH","SOD","SOA",
         "SBH","SBD","SBA",
         "SJH","SJD","SJA",
         "SYH","SYD","SYA",
         "VCH","VCD","VCA",
         "WHH","WHD","WHA")
home = odds[seq(1,length(odds),3)]
draw = odds[seq(2,length(odds),3)]
away = odds[seq(3,length(odds),3)]

# load all data in a loop
full.data = NULL
for(i in 1:length(urls)){
  temp = read.csv(urls[i])
  # calculate average odds
  temp$homeodds = apply(temp[,names(temp) %in% home], 1, function(x) mean(x,na.rm=T))
  temp$drawodds = apply(temp[,names(temp) %in% draw], 1, function(x) mean(x,na.rm=T))
  temp$awayodds = apply(temp[,names(temp) %in% away], 1, function(x) mean(x,na.rm=T))
  temp = temp[,c("Div","Date","FTHG","FTAG","FTR","homeodds","drawodds","awayodds")]
  full.data = rbind(full.data, temp)
}

full.data$homewin = ifelse(full.data$FTR=="H", 1, 0)
full.data$draw = ifelse(full.data$FTR=="D", 1, 0)
full.data$awaywin = ifelse(full.data$FTR=="A", 1, 0)

# convert to probs with overrind
full.data$homeprob = (1/full.data$homeodds)/(1/full.data$homeodds+1/full.data$drawodds+1/full.data$awayodds)
full.data$drawprob = (1/full.data$drawodds)/(1/full.data$homeodds+1/full.data$drawodds+1/full.data$awayodds)
full.data$awayprob = (1/full.data$awayodds)/(1/full.data$homeodds+1/full.data$drawodds+1/full.data$awayodds)

# bookie residual
full.data$bookieres = 1-full.data$homeprob
full.data$bookieres[full.data$FTR=="D"] = 1-full.data$drawprob[full.data$FTR=="D"]
full.data$bookieres[full.data$FTR=="A"] = 1-full.data$awayprob[full.data$FTR=="A"]

# now plot over time
full.data$time = ifelse(nchar(as.character(full.data$Date))==8, 
                         as.Date(full.data$Date,format='%d/%m/%y'),
                         as.Date(full.data$Date,format='%d/%m/%Y'))
full.data$date = as.Date(full.data$time, origin = "1970-01-01") 

full.data$Division = "Premier League" 
full.data$Division[full.data$Div=="E1"] = "Championship" 
full.data$Division[full.data$Div=="E2"] = "League 1" 
full.data$Division[full.data$Div=="E3"] = "League 2" 
full.data$Division[full.data$Div=="EC"] = "Conference" 

full.data$Division = factor(full.data$Division, levels = c("Premier League", "Championship", "League 1",
                                                           "League 2","Conference"))

ggplot(full.data, aes(date, bookieres, colour=Division)) +
  stat_smooth(size = 1.25, alpha = 0.2) +
  labs(x = "Year", y = "Uncertainty") + 
  theme_bw() +
  theme(legend.position="bottom") +
  theme(axis.text=element_text(size=20),
        axis.title=element_text(size=20),
        legend.title = element_text(size=20),
        legend.text = element_text(size=20))

To leave a comment for the author, please follow the link and comment on his blog: DiffusePrioR » R.

↧

Interactive charts in R

May 19, 2015, 1:37 am

≫ Next: Bio7 2.1 for Linux 64-bit and Windows 32-bit released

≪ Previous: How Predictable is the English Premier League?

(This article was first published on Benomics » R, and kindly contributed to R-bloggers)

I’m giving a talk tomorrow at the Edinburgh R usergroup (EdinbR) on how to get started building interactive charts in R. I’ll talk about rCharts as a great general entry point to quickly generating interactive charts, and also the newer htmlwidgets movement, allowing interactive charts to be more easily integrated with RMarkdown and Shiny. I also tried to throw in a decent amount of Edinburgh-related examples along the way.

Current slides are here:

Click through for HTML slide deck.

I’ve since spun out what started as a simple example for the talk into a live web app, viewable at blackspot.org.uk. Here I’m looking at Edinburgh Open Data from the county council of vehicle collisions in the city. It’s still under development and will be my first real project in Shiny, but already has started to come together quite nicely.

Blackspot Shiny web app. Code available on github. NB. The UI currently retains a lot of code borrowed from Joe Cheng’s beautiful SuperZip shiny example.

The other speaker for the session is Alastair Kerr (head of bioinformatics at the Wellcome Trust Centre for Cell Biology here in Edinburgh), and he’ll be giving a beginner’s guide to the Shiny web framework. All in all it should be a great meeting, if you’re nearby do come along!

To leave a comment for the author, please follow the link and comment on his blog: Benomics » R.

↧