Quantcast
Channel: R-bloggers
Viewing all 2417 articles
Browse latest View live

Paper Helicopter Experiment, part III

$
0
0

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)
As final part of my paper helicopter experiment analysis (part I, part II) I do a reanalysis for one more data set. In 2002 Erik Erhardt and Hantao Mai did an extensive experiment, see The Search for the Optimal Paper Helicopter. They did a number of steps, including variable screening, steepest ascend and confirmatory experiment. For my part, I have combined all those data in one data set, and checked what kind of model would be used.

Data

The data extracted contains 45 observations. These observations have a number of replications, for instance a central composite design has a replicated center and the optimum found has been repeatedly tested.
After creation of a factor combining all variables it is pretty easy to examine the replications. The replications are thus. Here the first eight variables are the experimental settings, allvl is the factor combining all levels, Time is response and Freq the frequency of occurrence for allvl:
   RotorLength RotorWidth BodyLength FootLength FoldLength FoldWidth
1         8.50       4.00        3.5       1.25          8       2.0
2         8.50       4.00        3.5       1.25          8       2.0
3         8.50       4.00        3.5       1.25          8       2.0
4         8.50       4.00        3.5       1.25          8       2.0
5         8.50       4.00        3.5       1.25          8       2.0
6         8.50       4.00        3.5       1.25          8       2.0
7        11.18       2.94        2.0       2.00          6       1.5
8        11.18       2.94        2.0       2.00          6       1.5
9        11.18       2.94        2.0       2.00          6       1.5
10       11.18       2.94        2.0       2.00          6       1.5
11       11.18       2.94        2.0       2.00          6       1.5
12       11.18       2.94        2.0       2.00          6       1.5
13       11.50       2.83        2.0       1.50          6       1.5
14       11.50       2.83        2.0       1.50          6       1.5
15       11.50       2.83        2.0       1.50          6       1.5
   PaperWeight DirectionOfFold                                 allvl  Time Freq
1        heavy         against  8.5.4.0.3.5.1.2. 8.2.0.heavy.against 13.88    3
2        heavy         against  8.5.4.0.3.5.1.2. 8.2.0.heavy.against 15.91    3
3        heavy         against  8.5.4.0.3.5.1.2. 8.2.0.heavy.against 16.08    3
4        light         against  8.5.4.0.3.5.1.2. 8.2.0.light.against 10.52    3
5        light         against  8.5.4.0.3.5.1.2. 8.2.0.light.against 10.81    3
6        light         against  8.5.4.0.3.5.1.2. 8.2.0.light.against 10.89    3
7        light         against 11.2.2.9.2.0.2.0. 6.1.5.light.against 17.29    6
8        light         against 11.2.2.9.2.0.2.0. 6.1.5.light.against 19.41    6
9        light         against 11.2.2.9.2.0.2.0. 6.1.5.light.against 18.55    6
10       light         against 11.2.2.9.2.0.2.0. 6.1.5.light.against 15.54    6
11       light         against 11.2.2.9.2.0.2.0. 6.1.5.light.against 16.40    6
12       light         against 11.2.2.9.2.0.2.0. 6.1.5.light.against 19.67    6
13       light         against 11.5.2.8.2.0.1.5. 6.1.5.light.against 16.35    3
14       light         against 11.5.2.8.2.0.1.5. 6.1.5.light.against 16.41    3
15       light         against 11.5.2.8.2.0.1.5. 6.1.5.light.against 17.38    3

Transformation

It is also possible to do a regression of Time against allvl and examine the residuals. Since it is not difficult to imagine that error is proportional to elapsed time this is done for both original data and log10 transformed data.
As can be seen it seems that larger values have the larger error, but it is not really corrected very much by a log transformation. To examine this a bit more, the Box-Cox transformation is used. From there it seems square root is almost optimum, but log and no transformation should also work. It was decided to use a square root transformation.
Given the square root transformation the residual error should not be lower than 0.02, since that is what the replications have. On the other hand, much higher than 0.02 is a clear sign of under fitting.
Analysis of Variance Table

Response: sqrt(Time)
          Df  Sum Sq Mean Sq F value    Pr(>F)    
allvl      3 1.84481 0.61494      26 2.707e-05 ***
Residuals 11 0.26016 0.02365                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model selection

Given the residual variance desired, the model linear in variables is not sufficient.
Analysis of Variance Table

Response: sTime
                Df Sum Sq Mean Sq F value    Pr(>F)    
RotorLength      1 3.6578  3.6578 18.4625 0.0001257 ***
RotorWidth       1 1.0120  1.0120  5.1078 0.0299644 *  
BodyLength       1 0.1352  0.1352  0.6823 0.4142439    
FootLength       1 0.2719  0.2719  1.3725 0.2490708    
FoldLength       1 0.0060  0.0060  0.0302 0.8629331    
FoldWidth        1 0.0189  0.0189  0.0953 0.7592922    
PaperWeight      1 0.6528  0.6528  3.2951 0.0778251 .  
DirectionOfFold  1 0.4952  0.4952  2.4994 0.1226372    
Residuals       36 7.1324  0.1981                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Adding interactions and quadratic effects via stepwise regression did not improve much.
Analysis of Variance Table

Response: sTime
                       Df Sum Sq Mean Sq F value    Pr(>F)    
RotorLength             1 3.6578  3.6578 29.5262 3.971e-06 ***
RotorWidth              1 1.0120  1.0120  8.1687  0.007042 ** 
FootLength              1 0.3079  0.3079  2.4851  0.123676    
PaperWeight             1 0.6909  0.6909  5.5769  0.023730 *  
I(RotorLength^2)        1 2.2035  2.2035 17.7872  0.000159 ***
I(RotorWidth^2)         1 0.3347  0.3347  2.7018  0.108941    
FootLength:PaperWeight  1 0.4291  0.4291  3.4634  0.070922 .  
RotorWidth:FootLength   1 0.2865  0.2865  2.3126  0.137064    
Residuals              36 4.4598  0.1239                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
Just adding the quadratic effects did not help either. However, using both linear and quadratic as a starting point did give a more extensive model.
Analysis of Variance Table

Response: sTime
                            Df Sum Sq Mean Sq  F value    Pr(>F)    
RotorLength                  1 3.6578  3.6578 103.8434 5.350e-10 ***
RotorWidth                   1 1.0120  1.0120  28.7293 1.918e-05 ***
FootLength                   1 0.3079  0.3079   8.7401 0.0070780 ** 
FoldLength                   1 0.0145  0.0145   0.4113 0.5276737    
FoldWidth                    1 0.0099  0.0099   0.2816 0.6007138    
PaperWeight                  1 0.7122  0.7122  20.2180 0.0001633 ***
DirectionOfFold              1 0.5175  0.5175  14.6902 0.0008514 ***
I(RotorLength^2)             1 1.7405  1.7405  49.4119 3.661e-07 ***
I(RotorWidth^2)              1 0.3160  0.3160   8.9709 0.0064635 ** 
I(FootLength^2)              1 0.1216  0.1216   3.4525 0.0760048 .  
I(FoldLength^2)              1 0.0045  0.0045   0.1272 0.7245574    
RotorLength:RotorWidth       1 0.4181  0.4181  11.8693 0.0022032 ** 
RotorLength:PaperWeight      1 0.3778  0.3778  10.7247 0.0033254 ** 
RotorWidth:FootLength        1 0.6021  0.6021  17.0947 0.0004026 ***
PaperWeight:DirectionOfFold  1 0.3358  0.3358   9.5339 0.0051968 ** 
RotorWidth:FoldLength        1 1.5984  1.5984  45.3778 7.167e-07 ***
RotorWidth:FoldWidth         1 0.3937  0.3937  11.1769 0.0028207 ** 
RotorWidth:PaperWeight       1 0.2029  0.2029   5.7593 0.0248924 *  
RotorWidth:DirectionOfFold   1 0.0870  0.0870   2.4695 0.1297310    
RotorLength:FootLength       1 0.0687  0.0687   1.9517 0.1757410    
FootLength:PaperWeight       1 0.0732  0.0732   2.0781 0.1629080    
Residuals                   23 0.8102  0.0352                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This model does is quite extensive. For prediction purpose I would probably drop a few terms. For instance FootLength:PaperWeight could be removed, lessen the fit yet improve predictions, since its p-value is close to 0.15. As it is currently the model does have some issues. For instance, quite some points have high leverage.

Conclusion

The paper helicopter needs quite a complex model to fit all effects on flying time. This somewhat validates the complex models found in part 1. 

Code used

library(dplyr)
library(car)
h3 <- read.table(sep='t',header=TRUE,text='
        RotorLength RotorWidth BodyLength FootLength FoldLength FoldWidth PaperWeight DirectionOfFold Time
        5.5 3 1.5 0 5 1.5 light against 11.8
        5.5 3 1.5 2.5 11 2.5 heavy against 8.29
        5.5 3 5.5 0 11 2.5 light with 9
        5.5 3 5.5 2.5 5 1.5 heavy with 7.21
        5.5 5 1.5 0 11 1.5 heavy with 6.65
        5.5 5 1.5 2.5 5 2.5 light with 10.26
        5.5 5 5.5 0 5 2.5 heavy against 7.98
        5.5 5 5.5 2.5 11 1.5 light against 8.06
        11.5 3 1.5 0 5 2.5 heavy with 9.2
        11.5 3 1.5 2.5 11 1.5 light with 19.35
        11.5 3 5.5 0 11 1.5 heavy against 12.08
        11.5 3 5.5 2.5 5 2.5 light against 20.5
        11.5 5 1.5 0 11 2.5 light against 13.58
        11.5 5 1.5 2.5 5 1.5 heavy against 7.47
        11.5 5 5.5 0 5 1.5 light with 9.79
        11.5 5 5.5 2.5 11 2.5 heavy with 9.2
        8.5 4 3.5 1.25 8 2 light against 10.52
        8.5 4 3.5 1.25 8 2 light against 10.81
        8.5 4 3.5 1.25 8 2 light against 10.89
        8.5 4 3.5 1.25 8 2 heavy against 15.91
        8.5 4 3.5 1.25 8 2 heavy against 16.08
        8.5 4 3.5 1.25 8 2 heavy against 13.88
        8.5 4 2 2 6 2 light against 12.99
        9.5 3.61 2 2 6 2 light against 15.22
        10.5 3.22 2 2 6 2 light against 16.34
        11.5 2.83 2 2 6 1.5 light against 18.78
        12.5 2.44 2 2 6 1.5 light against 17.39
        13.5 2.05 2 2 6 1.5 light against 7.24
        10.5 2.44 2 1.5 6 1.5 light against 13.65
        12.5 2.44 2 1.5 6 1.5 light against 13.74
        10.5 3.22 2 1.5 6 1.5 light against 15.48
        12.5 3.22 2 1.5 6 1.5 light against 13.53
        11.5 2.83 2 1.5 6 1.5 light against 17.38
        11.5 2.83 2 1.5 6 1.5 light against 16.35
        11.5 2.83 2 1.5 6 1.5 light against 16.41
        10.08 2.83 2 1.5 6 1.5 light against 12.51
        12.91 2.83 2 1.5 6 1.5 light against 15.17
        11.5 2.28 2 1.5 6 1.5 light against 14.86
        11.5 3.38 2 1.5 6 1.5 light against 11.85
        11.18 2.94 2 2 6 1.5 light against 15.54
        11.18 2.94 2 2 6 1.5 light against 16.4
        11.18 2.94 2 2 6 1.5 light against 19.67
        11.18 2.94 2 2 6 1.5 light against 19.41
        11.18 2.94 2 2 6 1.5 light against 18.55
        11.18 2.94 2 2 6 1.5 light against 17.29
        ')
names(h3)

h3 <- h3 %>% 
    mutate(.,
        FRL=factor(format(RotorLength,digits=2)),
        FRW=factor(format(RotorWidth,digits=2)),
        FBL=factor(format(BodyLength,digits=2)),
        FFt=factor(format(FootLength,digits=2)),
        FFd=factor(format(FoldLength,digits=2)),
        FFW=factor(format(FoldWidth,digits=2)),
        allvl=interaction(FRL,FRW,FBL,FFt,FFd,FFW,PaperWeight,DirectionOfFold,drop=TRUE)
    )

h4 <- xtabs(~allvl,data=h3) %>% 
    as.data.frame %>%
    filter(.,Freq>1) %>%
    merge(.,h3) %>%
    select(.,RotorLength,
        RotorWidth,BodyLength,FootLength,
        FoldLength,FoldWidth,PaperWeight,
        DirectionOfFold,allvl,Time,Freq) %>%
    print
lm(Time~allvl,data=h4) %>% anova

par(mfrow=c(1,2))
aov(Time~allvl,data=h3) %>% residualPlot(.,main='Untransformed')
aov(log10(Time)~allvl,data=h3) %>% residualPlot(.,main='Log10 Transform')

lm(Time ~   RotorLength + RotorWidth + BodyLength +
            FootLength + FoldLength + FoldWidth + 
            PaperWeight + DirectionOfFold,
        data=h3) %>%
    boxCox(.)
dev.off()

lm(sqrt(Time)~allvl,data=h4) %>% anova

h3 <- mutate(h3,sTime=sqrt(Time))

lm(sTime ~  RotorLength + RotorWidth + BodyLength +
            FootLength + FoldLength + FoldWidth + 
            PaperWeight + DirectionOfFold,
        data=h3) %>%
    anova

s1 <- lm(sTime ~ 
            RotorLength + RotorWidth + BodyLength +
            FootLength + FoldLength + FoldWidth + 
            PaperWeight + DirectionOfFold ,
        data=h3) %>%
    step(.,scope=~(RotorLength + RotorWidth + BodyLength +
              FootLength + FoldLength + FoldWidth + 
              PaperWeight + DirectionOfFold)*
            (RotorLength + RotorWidth + BodyLength +
              FootLength + FoldLength + FoldWidth)+
            I(RotorLength^2) + I(RotorWidth^2) + I(BodyLength^2) +
            I(FootLength^2) + I(FoldLength^2) + I(FoldWidth^2) +
            PaperWeight*DirectionOfFold)
anova(s1)

s1 <- lm(sTime ~ 
            RotorLength + RotorWidth + BodyLength +
            FootLength + FoldLength + FoldWidth + 
            PaperWeight + DirectionOfFold ,
        data=h3) %>%
    step(.,scope=~(RotorLength + RotorWidth + BodyLength +
              FootLength + FoldLength + FoldWidth + 
              PaperWeight + DirectionOfFold)*
            (RotorLength + RotorWidth + BodyLength +
              FootLength + FoldLength + FoldWidth)+
            I(RotorLength^2) + I(RotorWidth^2) + I(BodyLength^2) +
            I(FootLength^2) + I(FoldLength^2) + I(FoldWidth^2) +
            PaperWeight*DirectionOfFold)

anova(s1)

s2 <- lm(sTime ~ 
            RotorLength + RotorWidth + BodyLength +
            FootLength + FoldLength + FoldWidth + 
            PaperWeight + DirectionOfFold +
            I(RotorLength^2) + I(RotorWidth^2) + I(BodyLength^2) +
            I(FootLength^2) + I(FoldLength^2) + I(FoldWidth^2) ,
        data=h3) %>%
    step(.,scope=~(RotorLength + RotorWidth + BodyLength +
              FootLength + FoldLength + FoldWidth + 
              PaperWeight + DirectionOfFold)*
            (RotorLength + RotorWidth + BodyLength +
              FootLength + FoldLength + FoldWidth)+
            I(RotorLength^2) + I(RotorWidth^2) + I(BodyLength^2) +
            I(FootLength^2) + I(FoldLength^2) + I(FoldWidth^2) +
            PaperWeight*DirectionOfFold)

anova(s2)
par(mfrow=c(2,2))
plot(s2)

To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A new R package for detecting unusual time series

$
0
0

(This article was first published on Hyndsight » R, and kindly contributed to R-bloggers)

The anomalous package provides some tools to detect unusual time series in a large collection of time series. This is joint work with Earo Wang (an honours student at Monash) and Nikolay Laptev (from Yahoo Labs). Yahoo is interested in detecting unusual patterns in server metrics.

The basic idea is to measure a range of features of the time series (such as strength of seasonality, an index of spikiness, first order autocorrelation, etc.) Then a principal component decomposition of the feature matrix is calculated, and outliers are identified in 2-dimensional space of the first two principal component scores.

We use two methods to identify outliers.

  1. A bivariate kernel density estimate of the first two PC scores is computed, and the points are ordered based on the value of the density at each observation. This gives us a ranking of most outlying (least density) to least outlying (highest density).
  2. A series of \alpha–convex hulls are computed on the first two PC scores with decreasing \alpha, and points are classified as outliers when they become singletons separated from the main hull. This gives us an alternative ranking with the most outlying having separated at the highest value of \alpha, and the remaining outliers with decreasing values of \alpha.

I explained the ideas in a talk last Tuesday given at a joint meeting of the Statistical Society of Australia and the Melbourne Data Science Meetup Group. Slides are available here. A link to a video of the talk will also be added there when it is ready.

The density-ranking of PC scores was also used in my work on detecting outliers in functional data. See my 2010 JCGS paper and the associated rainbow package for R.

There are two versions of the package: one under an ACM licence, and a limited version under a GPL licence. Eventually we hope to make the GPL version contain everything, but we are currently dependent on the alphahull package which has an ACM licence.

To leave a comment for the author, please follow the link and comment on his blog: Hyndsight » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

rfoaas 0.1.6

$
0
0

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

After a few quiet months, a new version of rfoaas is now on CRAN. As before, it shadows upstream release of FOAAS from which carried over the version number 0.1.6.

The rfoaas package provides an interface for R to the most excellent FOAAS service--which provides a modern, scalable and RESTful web service for the frequent need to tell someone to f$#@ off.

Release 0.1.6 of FOAAS builds on the initial support for filters and now adds working internationalization. This is best shown by example:

R> library(rfoaas)
R> off("Tom", "Everyone")                                      # standard upstream example
[1] "Fuck off, Tom. - Everyone"
R> off("Tom", "Everyone", language="fr")                       # now in French
[1] "Va te faire foutre, Tom. - Everyone"
R> off("Tom", "Everyone", language="de", filter="shoutcloud")  # or in German and LOUD
[1] "VERPISS DICH, TOM. - EVERYONE"
R> 

We start with a standard call to off(), add the (now fully functional upstream) language support and finally illustrate the shoutcloud.io filter added in the preceding 0.1.3 release.

As usual, CRANberries provides a diff to the previous CRAN release. Questions, comments etc should go to the GitHub issue tracker.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New shinyjs version: Useful tools for any Shiny app developer + easily call JavaScript functions as R code

$
0
0

(This article was first published on Dean Attali's R Blog, and kindly contributed to R-bloggers)

About a month ago I made an announcement about the initial release of shinyjs. After some feedback, a few feature requests, and numerous hours of work, I’m excited to say that a new version of shinyjs v0.0.6.2 was made available on CRAN this week. The package’s main objective is to make shiny app development better and easier by allowing you to perform many useful functions with simple R code that would normally require JavaScript coding. Some of the features include hiding/showing elements, enabling/disabling inputs, resetting an input to its original value, and many others.

Table of contents

Availability

shinyjs is available through both CRAN (install.packages("shinyjs")) and GitHub (devtools::install_github("daattali/shinyjs")). Use the GitHub version to get the latest version with the newest features.

Quick overview of new features

This post will only discuss new features in shinyjs. You can find out more about the package in the initial post or in the package README on GitHub. Remember that in order to use any function, you need to add a call to useShinyjs() in the shiny app’s UI.

Two major new features:

  • reset function allows inputs to be reset to their original value
  • extendShinyjs allows you to add your own JavaScript functions and easily call them from R as regular R code

Two major improvements:

  • Enabling and disabling of input widgets now works on all types of shiny inputs (many people asked how to disable a slider/select input/date range/etc, and shinyjs now handles all of them)
  • The toggle functions gained an additional condition argument, which can be used to show/hide or enable/disable an element based on a condition. For example, instead of writing code such as if (test) enable(id) else disable(id), you can simply write toggleState(id, test)

Three new features available on the GitHub version but not yet on CRAN:

  • hidden (used to initialize a shiny tag as hidden) can now accept any number of tags or a tagList rather than just a single tag
  • hide/show/toggle can be run on any JQuery selector, not only on a single ID, so that you can hide multiple elements simultaneously
  • hide/show/toggle have a new arugment delay which can be used to perform the action later rather than immediately. This can be useful if you want to show a message and have it disappear after a few seconds

Two major new features

There were two major features that I wanted to include in the CRAN release.

reset - allows inputs to be reset to their original value

Being able to reset the value of an input has been a frequently asked question on StackOverflow and the shiny Google Group, but a general solution was never available. Now with shinyjs it’s possible and very easy: if you have an input with id name and you want to reset it to its original value, simply call reset("name"). It doesn’t matter what type of input it is - reset works with all shiny inputs.

The reset function only takes one arugment, an HTML id, and resets all inputs inside of that element. This makes reset very flexible because you can either give it a single input widget to reset, or a form that contains many inputs and reset them all. Note that reset can only work on inputs that are generated from the app’s ui and it will not work for inputs generated dynamically using uiOutput/renderUI.

Here is a simple demo of reset in action

Reset demo

extendShinyjs - allows you to easily call your own JavaScript functions from R

The main idea behind shinyjs when I started working on it was to make it extremely easy to call JavaScript functions that I used commonly from R. Now whenever I want to add a new function to shinyjs (such as the reset function), all I have to do is write the JavaScript function, and the integration between shiny and JavaScript happens seamlessly thanks to shinyjs. My main goal after the initial release was to also allow anyone else to use the same smooth R –> JS workflow, so that anyone can add a JavaScript function and call it from R easily. With the extendShinyjs function, that is now possible.

Very simple example

Using extendShinyjs is very simple and makes defining and calling JavaScript functions painless. Here is a very basic example of using extendShinyjs to define a (fairly useless) function that changes the colour of the page.

library(shiny)
library(shinyjs)

jsCode <- "shinyjs.pageCol = function(params){$('body').css('background', params);}"

runApp(shinyApp(
  ui = fluidPage(
    useShinyjs(),
    extendShinyjs(text = jsCode),
    selectInput("col", "Colour:",
                c("white", "yellow", "red", "blue", "purple"))
  ),
  server = function(input,output,session) {
    observeEvent(input$col, {
      js$pageCol(input$col)
    })
  }
))

Running the code above produces this shiny app:

Extendshinyjs demo

See how easy that was? All I had to do was make the JavaScript function shinyjs.pageCol, pass the JavaScript code as an argument to extendShinyjs, and then I can call js$pageCol(). That’s the basic idea: any JavaScript function named shinyjs.foo will be available to call as js$foo(). You can either pass the JS code as a string to the text argument, or place the JS code in a separate JavaScript file and use the script argument to specify where the code can be found. Using a separate file is generally prefered over writing the code inline, but in these examples I will use the text argument to keep it simple.

Passing arguments from R to JavaScript

Any shinyjs function that is called will pass a single array-like parameter to its corresponding JavaScript function. If the function in R was called with unnamed arguments, then it will pass an Array of the arguments; if the R arguments are named then it will pass an Object with key-value pairs. For example, calling js$foo("bar", 5) in R will call shinyjs.foo(["bar", 5]) in JS, while calling js$foo(num = 5, id = "bar") in R will call shinyjs.foo({num : 5, id : "bar"}) in JS. This means the shinyjs.foo function needs to be able to deal with both types of parameters.

To assist in normalizing the parameters, shinyjs provides a shinyjs.getParams() function which serves two purposes. First of all, it ensures that all arguments are named (even if the R function was called without names). Secondly, it allows you to define default values for arguments. Here is an example of a JS function that changes the background colour of an element and uses shinyjs.getParams().

shinyjs.backgroundCol = function(params) {
  var defaultParams = {
    id : null,
    col : "red"
  };
  params = shinyjs.getParams(params, defaultParams);

  var el = $("#" + params.id);
  el.css("background-color", params.col);
}

Note the defaultParams that we defined and the call to shinyjs.getParams. It ensures that calling js$backgroundCol("test", "blue") and js$backgroundCol(id = "test", col = "blue") and js$backgroundCol(col = "blue", id = "test") are all equivalent, and that if the colour parameter is not provided then “red” will be the default. All the functions provided in shinyjs make use of shinyjs.getParams, and it is highly recommended to always use it in your functions as well. Notice that the order of the arguments in defaultParams in the JavaScript function matches the order of the arguments when calling the function in R with unnamed arguments. This means that js$backgroundCol("blue", "test") will not work because the arguments are unnamed and the JS function expects the id to come before the colour.

For completeness, here is the code for a shiny app that uses the above function (it’s not a very practical example, but it’s great for showing how to use extendShinyjs with parameters):

library(shiny)
library(shinyjs)

jsCode <- '
shinyjs.backgroundCol = function(params) {
  var defaultParams = {
    id : null,
    col : "red"
  };
  params = shinyjs.getParams(params, defaultParams);

  var el = $("#" + params.id);
  el.css("background-color", params.col);
}'

runApp(shinyApp(
  ui = fluidPage(
    useShinyjs(),
    extendShinyjs(text = jsCode),
    p(id = "name", "My name is Dean"),
    p(id = "sport", "I like soccer"),
    selectInput("col", "Colour:",
                c("white", "yellow", "red", "blue", "purple")),    
    textInput("selector", "Element", ""),
    actionButton("btn", "Go")
  ),
  server = function(input,output,session) {
    observeEvent(input$btn, {
      js$backgroundCol(input$selector, input$col)
    })
  }
))

And the resulting app:

Extendshinyjs params demo

Note that I chose to define the JS code as a string for illustration purposes, but in reality I would prefer to place the code in a separate file and use the script argument instead of text.

Two major improvements

Among the many small improvements made, there are two that will be the most useful.

Enabling/disabling works on all inputs

The initial release of shinyjs had a disable/enable function which worked on the major input types that I was commonly using, but not all. Several people noticed that various inputs could not be disabled, so I made sure to fix all of them in the next version. The reasons behind why not all inputs were easy to disable are very technical so I won’t go into them. Now calling disable(id) or enable(id) will work on any type of shiny input.

Use a condition in toggle functions

I’ve noticed that some users of shinyjs had to often write code such as if (test) enable(id) else disable(id). This seemed inefficient and verbose, especially since there was already a toggleState function that enables disabled elements and vice versa. The toggleState function now has a new condition parameter to address exactly this problem. The code above can now be rewritten as toggleState(id, test).

Similarly, code that previously used if (test) show(id) else hide(id) can now use toggle(id = id, condition = test), and code that was doing a similar thing with addClass/removeClass can use the toggleClass(id, class, condition) function.

Three new features available on the GitHub version but not yet on CRAN

Since submitting shinyjs to CRAN, there were a few more features added. They will go into the next CRAN submission in about a month, but for now they can be used if you download the GitHub version.

hidden now accepts multiple tags

The hidden function is the only shinyjs function that’s used in the UI rather than in the server. It’s used to initialize a tag as hidden, and you can later reveal it using show(tag). The initial release only allows single tags to be given to hidden, but now it can accept any number of tags or a tagList. For example, you can now add a button and some text to the UI and have them both hidden with this code: hidden(actionButton("btn", "Button"), p(id = "text", "text")). You can then call show("btn") or show("text") to unhide them.

Visibility functions can be run on any selector

Previously, the only way to tell the hide, show, and toggle functions what element to act on was to give them an ID. That becomes very limiting when you want to hide or show elements in batch, or even if you just want to show/hide an element without an ID. The visibility functions now have a new optional parameter selector that accepts any CSS-style selector. For example, to hide all hyperlinks on a page that have the class “hide-me-later” you can now call hide(selector = "a.hide-me-later"). This makes the visibility functions much more powerful.

Visibility functions can be delayed

In a shiny app that I’m currently developing for my graduate work there are many different “Update” buttons that the user can click on. After an update is successful, I wanted to show a “Done!” message that would disappear after a few seconds. Using shinyjs I was already able to show the message when I wanted to, but I needed an easy way to make it disappear later. So I added the delay parameter to show/hide/toggle, that tells the function to only act in x seconds instead of immediately. Now if I want to show a message and hide it after 5 seconds, I can call show("doneMsg"); hide(id = "doneMsg", delay = 5). It’s not a big deal, but it can be handy.

Feedback + suggestions

If you have any feedback on shinyjs, I’d love to hear about it! I really do hope that it’s as easy to use as possible and that many of you will find it useful. If you have any suggestions, please do open a GitHub issue or let me know in any other way.

To leave a comment for the author, please follow the link and comment on his blog: Dean Attali's R Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

“Mail merge” with RMarkdown

$
0
0

(This article was first published on Citizen-Statistician » R Project, and kindly contributed to R-bloggers)

The term “mail merge” might not be familiar to those who have not worked in an office setting, but here is the Wikipedia definition:

Mail merge is a software operation describing the production of multiple (and potentially large numbers of) documents from a single template form and a structured data source. The letter may be sent out to many “recipients” with small changes, such as a change of address or a change in the greeting line.

Source: http://en.wikipedia.org/wiki/Mail_merge

The other day I was working on creating personalized handouts for a workshop. That is, each handout contained some standard text (including some R code) and some fields that were personalized for each participant (login information for our RStudio server). I wanted to do this in RMarkdown so that the R code on the handout could be formatted nicely. Googling “rmarkdown mail merge” didn’t yield much (that’s why I’m posting this), but I finally came across this tutorial which called the process “iterative reporting”.

Turns our this is a pretty straightforward task. Below is a very simple minimum working example. You can obviously make your markdown document a lot more complicated. I’m thinking holiday cards made in R…

All relevant files for this example can also be found here.

Input data: meeting_times.csv

This is a 20 x 2 csv file, an excerpt is shown below. I got the names from here.

name meeting_time
Peggy Kallas 9:00 AM
Ezra Zanders 9:15 AM
Hope Mogan 9:30 AM
Nathanael Scully 9:45 AM
Mayra Cowley 10:00 AM
Ethelene Oglesbee 10:15 AM

R script: mail_merge_script.R


## Packages
library(knitr)
library(rmarkdown)

## Data
personalized_info <- read.csv(file = "meeting_times.csv")

## Loop
for (i in 1:nrow(personalized_info)){
 rmarkdown::render(input = "mail_merge_handout.Rmd",
 output_format = "pdf_document",
 output_file = paste("handout_", i, ".pdf", sep=''),
 output_dir = "handouts/")
}

RMarkdown: mail_merge_handout.Rmd

---
output: pdf_document
---

```{r echo=FALSE}
personalized_info <- read.csv("meeting_times.csv", stringsAsFactors = FALSE)
name <- personalized_info$name[i]
time <- personalized_info$meeting_time[i]
```

Dear `r name`,

Your meeting time is `r time`.

See you then!

Save the Rmd file and the R script in the same folder (or specify the path to the Rmd file accordingly in the R script), and then run the R script. This will call the Rmd file within the loop and output 20 PDF files to the handouts directory. Each of these files look something like this

mail_merge_sample

with the name and date field being different in each one.

If you prefer HTML or Word output, you can specify this in the output_format argument in the R script.

To leave a comment for the author, please follow the link and comment on his blog: Citizen-Statistician » R Project.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Simple Data Science To Maximize Return On Lottery Investment

$
0
0

(This article was first published on Ripples, and kindly contributed to R-bloggers)

Every finite game has an equilibrium point (John Nash, Non-Cooperative Games, 1950)

I read recently this amazing book, where I discovered that we (humans) are not capable of generating random sequences of numbers by ourselves when we play lottery. John Haigh demonstrates this fact analyzing a sample of 282 raffles of 6/49 UK Lotto. Once I read this, I decided to prove if this disability is property only of British population or if it is shared with Spanish people as well. I am Spanish, so this experiment can bring painful results to myself, but here I come.

The Spanish equivalent of 6/40 UK Lotto is called “Lotería Primitiva” (or “Primitiva”, to abbreviate). This is a ticket of Primitiva lotto:

JugarPrimitiva

As you can see, one ticket gives the chance to do 8 bets. Each bet consists on 6 numbers between 1 and 49 to be chosen in a grid of 10 rows by 5 columns. People tend to choose separate numbers because we think that they are more likely to come up than combinations with some consecutive numbers. We think we have more chances to get rich choosing 4-12-23-25-31-43 rather than 3-17-18-19-32-33, for instance. To be honest, I should recognize I am one of these persons.

Primitiva lotto is managed by Sociedad Estatal Loterías y Apuestas del Estado, a public business entity belonging to the Spanish Ministry of Finance and Public Administrations. They know what people choose and they could do this experiment more exactly than me. They could analyze just human bets (those made by players by themselves) and discard machine ones (those made automatically by vending machines) but anyway it is possible to confirm the previous thesis with some public data.

I analysed 432 raffles of Primitiva carried out between 2011 and 2015; for each raffle I have this information:

  • The six numbers that form the winning combination
  • Total number of bets
  • Number of bets which hit the six numbers (Observed Winners)

The idea is to compare observed winners of raffles with the expected number of them, estimated as follows:

Expected\, Winners=\frac{Total\, Bets}{C_{6}^{49}},\: where\: C_{6}^{49}=\binom{49}{6}=\frac{49!}{43!6!}

This table compare the number of expected and observed winners between raffles which contain consecutive and raffles which not:

Table1

There are 214 raffles without consecutive with 294 winners while the expected number of them was 219. In other words, a winner of a non-consecutive-raffle must share the prize with a 33% of some other person. On the other hand, the number of observed winners of a raffle with consecutive numbers 17% lower than the expected one. Simple and conclusive. Spanish are like British, at least in what concerns to this particular issue.

Let’s go further. I can do the same for any particular number. For example, there were 63 raffles containing number 45 in the winning combination and 57 (observed) winners, although 66 were expected. After doing this for every number, I can draw this plot, where I paint in blue those which ratio of observed winners between expected is lower than 0.9:

LotteryBlues

It seems that blue numbers are concentrated on the right side of the grid. Do we prefer small numbers rather than big ones? There are 15 primes between 1 and 49 (rate: 30%) but only 3 primes between blue numbers (rate: 23%). Are we attracted by primes?

Let’s combine both previous results. This table compares the number of expected and observed winners between raffles which contain consecutive and blues (at least one) and raffles which not:

Table2

Now, winning combinations with some consecutive and some blue numbers present 20% less of observed winners than expected. After this, which combination would you choose for your next bet? 27-35-36-41-44-45 or 2-6-13-15-26-28? I would choose the first one. Both of them have the same probability to come up, but probably you will become richer with the first one if it happens.

This is the code of this experiment. If someone need the dataset set to do their own experiments, feel free to ask me (you can find my email here):

library("xlsx")  
library("sqldf")
library("Hmisc")
library("lubridate")
library("ggplot2")
library("extrafont")
library("googleVis")
windowsFonts(Garamond=windowsFont("Garamond"))
setwd("YOUR WORKING DIRECTORY HERE")
file = "SORTEOS_PRIMITIVA_2011_2015.xls"
data=read.xlsx(file, sheetName="ALL", colClasses=c("numeric", "Date", rep("numeric", 21)))  
#Impute null values to zero
data$C1_EUROS=with(data, impute(C1_EUROS, 0))
data$CE_WINNERS=with(data, impute(CE_WINNERS, 0))
#Expected winners for each raffle
data$EXPECTED=data$BETS/(factorial(49)/(factorial(49-6)*factorial(6)))
#Consecutives indicator
data$DIFFMIN=apply(data[,3:8], 1, function (x) min(diff(sort(x))))
#Consecutives vs non-consecutives comparison
df1=sqldf("SELECT CASE WHEN DIFFMIN=1 THEN 'Yes' ELSE 'No' END AS CONS, 
      COUNT(*) AS RAFFLES,
      SUM(EXPECTED) AS EXP_WINNERS, 
      SUM(CE_WINNERS+C1_WINNERS) AS OBS_WINNERS
      FROM data GROUP BY CONS")
colnames(df1)=c("Contains consecutives?", "Number of  raffles", "Expected Winners", "Observed Winners")
Table1=gvisTable(df1, formats=list('Expected Winners'='#,###'))
plot(Table1)
#Heat map of each number
results=data.frame(BALL=numeric(0), EXP_WINNER=numeric(0), OBS_WINNERS=numeric(0))
for (i in 1:49)
{
  data$TF=apply(data[,3:8], 1, function (x) i %in% x + 0)
  v=data.frame(BALL=i, sqldf("SELECT SUM(EXPECTED) AS EXP_WINNERS, SUM(CE_WINNERS+C1_WINNERS) AS OBS_WINNERS FROM data WHERE TF = 1"))
  results=rbind(results, v)
}
results$ObsByExp=results$OBS_WINNERS/results$EXP_WINNERS
results$ROW=results$BALL%%10+1
results$COL=floor(results$BALL/10)+1
results$ObsByExp2=with(results, cut(ObsByExp, breaks=c(-Inf,.9,Inf), right = FALSE))
opt=theme(legend.position="none",
          panel.background = element_blank(),
          panel.grid = element_blank(),
          axis.ticks=element_blank(),
          axis.title=element_blank(),
          axis.text =element_blank())
ggplot(results, aes(y=ROW, x=COL)) +
  geom_tile(aes(fill = ObsByExp2), colour="gray85", lwd=2) +
  geom_text(aes(family="Garamond"), label=results$BALL, color="gray10", size=12)+
  scale_fill_manual(values = c("dodgerblue", "gray98"))+
  scale_y_reverse()+opt
#Blue numbers
Bl=subset(results, ObsByExp2=="[-Inf,0.9)")[,1]
data$BLUES=apply(data[,3:8], 1, function (x) length(intersect(x,Bl)))
#Combination of consecutives and blues
df2=sqldf("SELECT CASE WHEN DIFFMIN=1 AND BLUES>0 THEN 'Yes' ELSE 'No' END AS IND, 
      COUNT(*) AS RAFFLES,
      SUM(EXPECTED) AS EXP_WINNERS, 
      SUM(CE_WINNERS+C1_WINNERS) AS OBS_WINNERS
      FROM data GROUP BY IND")
colnames(df2)=c("Contains consecutives and blues?", "Number of  raffles", "Expected Winners", "Observed Winners")
Table2=gvisTable(df2, formats=list('Expected Winners'='#,###'))
plot(Table2)

To leave a comment for the author, please follow the link and comment on his blog: Ripples.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Analysing the Twitter Mentions Network

$
0
0

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

By Douglas Ashton, Consultant

One of the big successes of data analytics is the cultural change in how business decisions are being made. There is now wide spread acceptance of the role that data science has to play in decision making. With the explosion in the quantity of data available, the task for the modern analyst is to filter through to find the information that is most relevant.

Twitter represents a classic case. The volume and velocity of Twitter data is staggering, and as discussed in the first part of this series it is within reach to obtain large, clean datasets. This puts the pressure on the analyst to ask the right questions of the data. In the remaining parts of this series we’ll be looking through a variety of ways to view the data from Twitter. The view you wish to take will ultimately depend on the question that is being asked. In this post we’re tackling the question: “Who are the big players in the data science Twitter community?”.

The dataset that we will be using is a sample of all tweets tagged with the hashtags #datascience, #rstats and #python (snake related tweets cleaned out) between the 7th and 17th December 2014. Each tweet contains a number of pieces of useful information. The view that we’re going to take in this post is the mentions network.

Mentions Network

There are a number of ways users can interact on Twitter. Users can “follow” other users to receive regular updates of their tweets, and users may “mention” other users in their own tweets.

DougTwitterBlog1

In the tweet above I mentioned @MangoTheCat in the text of the tweet and so we draw a directed link from me, @dougashton to @MangoTheCat. A retweet is another way for one user to mention another. We then went through each of the 22545 tweets in our data set and formed links for every mention. The resulting network contained 8220 nodes (users) and 20461 edges (mentions). In network language an “edge” is the same as a link.

A useful tool for dealing with networks in R is the feature rich igraph package (also available for Python and C). Once you have created your network as an igraph object many of the standard network analysis tools become easily available.

DougTwitterBlog2

 

 

While igraph has nice built-in plotting tools, for large graphs I also like the cross-platform, open source, Gephi. Gephi is an interactive network visualisation and analysis tool with many available plugins. We can easily export our graph from igraph to an open format, such as graphml, and read it into Gephi. The layout below shows the full mentions network and was created using the Force Atlas 2 layout.

DougTwitterBlog3.jpg

 

Even with the limited sample that we used, this is a big network. This visualisation is useful as a high level view of the network. For instance, we have a connected core of tweeters and a disconnected periphery, which is to be expected with this type of sampling technique. We can also see a common motif of clusters of nodes around one other node, these look like parachutes in the visualisation. To gain further insight we must dig a little deeper into the data.

Broadcasters and Receivers

As noted above, this network appears to contain some nodes that are surrounded by large cluster of other nodes. We can quantify this by looking at the degree centrality. The in-degree of a node is the total number of tweets that mention that user. Similarly the out-degree is the number of mentions made by that user. The top ten nodes for both in and out degree are listed below:

DougTwitterBlog4

The two lists are completely different. We have a group of users with a large out-degree who retweet at a high rate. These are our broadcasters. In general they tend to pass on content. The users with a large in-degree have their tweets retweeted many times by different users. These are our receivers.

It might be that you can stop there. The nodes with a high in-degree are likely important nodes in this network. However, how do we really know how far their influence really goes? It is possible to get a high in-degree score by being retweeted many times by a small group of broadcasters. This is not a guarantee of influence. For a broader view we must go beyond this nearest neighbour approach and look to more sophisticated network measures.

Centrality Measures

There are many standard measures of network structure available and the choice of which one to use really comes down to exactly what you are interested in. Here I’ll go through two, Page Rank and Betweenness Centrality. In igraph it’s as easy as running

DougTwitterBlog5

 

 

 

Page Rank is the basis of Google’s search engine. Roughly speaking it tells you which nodes you are likely to land on if you spent some time surfing twitter feeds randomly following links. If mentions tend to flow towards you, you will get a high score. It also shows that you are connected to other influential nodes.

Betweenness Centrality is a useful measure for networks with a strong community structure. If you work out all of the shortest paths between all of the nodes, betweenness tells you how many of those paths go through each node. If you are the bridge between two communities then you will get a high score.

DougTwitterBlog6

 

 

 

 

 

 

 

This time we see many of the same users but now some familiar names begin to appear in these lists. For instance, we know that @hadleywickham is an influential figure in the R community, and while Hadley only tweeted 23 times in this period he features high up the Page Rank centrality list.

@MangoTheCat only tweeted four times in this period yet the cat’s betweenness score is relatively high. This implies that the connections formed in those tweets were bridging connections between different types of node. We can see this a little better if we look at a much smaller version of our network. The strongly connected component is the part of the network where you can travel from any node to any node along the links. We get this by finding the clusters and keeping the largest.

DougTwitterBlog7

DougTwitterBlog8.png

While a little out from the core we see that @MangoTheCat does indeed sit between the core and a group on the edge of the cluster.

Summing up

We’ve seen that the tools available in R have made aquiring and analysing the Twitter network an easily accessible task. In this post we’ve dipped our toe into the vast array of methods available in network analysis and we’ve found that digging a little deeper than simply counting connections can lead to a deeper insight of the network’s true function. With so much data available it is important to ask the right questions. If your goal is to improve your social media presence, or whether you are trying to better search for influential users, a little bit of the right network tools can go a long way.

To leave a comment for the author, please follow the link and comment on his blog: Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Cluster analysis on earthquake data from USGS

$
0
0

(This article was first published on R tutorial for Spatial Statistics, and kindly contributed to R-bloggers)
Theoretical Background
In some cases we would like to classify the events we have in our dataset based on their spatial location or on some other data. As an example we can return to the epidemiological scenario in which we want to determine if the spread of a certain disease is affected by the presence of a particular source of pollution. With the G function we are able to determine quantitatively that our dataset is clustered, which means that the events are not driven by chance but by some external factor. Now we need to verify that indeed there is a cluster of points located around the source of pollution, to do so we need a form of classification of the points.
Cluster analysis refers to a series of techniques that allow the subdivision of a dataset into subgroups, based on their similarities (James et al., 2013). There are various clustering method, but probably the most common is k-means clustering. This technique aims at partitioning the data into a specific number of clusters, defined a priori by the user, by minimizing the within-clusters variation. The within-cluster variation measures how much each event in a cluster k, differs from the others in the same cluster k. The most common way to compute the differences is using the squared Euclidean distance (James et al., 2013), calculated as follow:

Where W_k (I use the underscore to indicate the subscripts) is the within-cluster variation for the cluster k, n_k is the total number of elements in the cluster k, p is the total number of variables we are considering for clustering and x_ij is one variable of one event contained in cluster k. This equation seems complex, but it actually quite easy to understand. To better understand what this means in practice we can take a look at the figure below. 



For the sake of the argument we can assume that all the events in this point pattern are located in one unique cluster k, therefore n_k is 15. Since we are clustering events based on their geographical location we are working with two variables, i.e. latitude and longitude; so p is equal to two. To calculate the variance for one single pair of points in cluster k, we simply compute the difference between the first point’s value of the first variable, i.e. its latitude, and the second point value of the same variable; and we do the same for the second variable. So the variance between point 1 and 2 is calculated as follow:


where V_(1:2) is the variance of the two points. Clearly the geographical position is not the only factor that can be used to partition events in a point pattern; for example we can divide earthquakes based on their magnitude. Therefore the two equations can be adapted to take more variables and the only difference is in the length of the linear equation that needs to be solved to calculate the variation between two points. The only problem may be in the number of equations that would need to be solved to obtain a solution. This however is something that the k-means algorithms solves very efficiently.

The algorithm starts by randomly assigning each event to a cluster, then it calculates the mean centre of each cluster (we looked at what the mean centre is in the post: Introductory Point Pattern Analysis of Open Crime Data in London). At this point it calculates the Euclidean distance between each event and the two clusters and reassigns them to a new cluster, based on the closest mean centre, then it recalculates the mean centres and it keeps going until the cluster elements stop changing. As an example we can look at the figure below, assuming we want to divide the events into two clusters. 



In Step 1 the algorithm assigns each event to a cluster at random. It then computes the mean centres of the two clusters (Step 2), which are the large black and red circles. Then the algorithm calculate the Euclidean distance between each event and the two mean centres, and reassign the events to new clusters based on the closest mean centre, so if a point was first in cluster one but it is closer to the mean centre of cluster two it is reassigned to the latter. Subsequently the mean centres are computed again for the new clusters (Step 4). This process keeps going until the cluster elements stop changing.



Practical Example
In this experiment we will look at a very simple exercise of cluster analysis of seismic events downloaded from the USGS website. To complete this exercise you would need the following packages: sp, raster, plotrix, rgeos, rgdal and scatterplot3d
I already mentioned in the post Downloading and Visualizing Seismic Events from USGS how to download the open data from the United States Geological Survey, so I will not repeat the process. The code for that is the following.

URL <- "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv"
Earthquake_30Days <- read.table(URL, sep = ",", header = T)
 
 
#Download, unzip and load the polygon shapefile with the countries' borders
download.file("http://thematicmapping.org/downloads/TM_WORLD_BORDERS_SIMPL-0.3.zip",destfile="TM_WORLD_BORDERS_SIMPL-0.3.zip")
unzip("TM_WORLD_BORDERS_SIMPL-0.3.zip",exdir=getwd())
polygons <- shapefile("TM_WORLD_BORDERS_SIMPL-0.3.shp")

I also included the code to download the shapefile with the borders of all countries.

For the cluster analysis I would like to try to divide the seismic events by origin. In other words I would like to see if there is a way to distinguish between events close to plates, or volcanoes or other faults. In many cases the distinction is hard to make since many volcanoes are originated from subduction, e.g. the Andes, where plates and volcanoes are close to one another and the algorithm may find difficult to distinguish the origins. In any case I would like to explore the use of cluster analysis to see what the algorithm is able to do.

Clearly the first thing we need to do is download data regarding the location of plates, faults and volcanoes. We can find shapefiles with these information at the following website: http://legacy.jefferson.kctcs.edu/techcenter/gis%20data/

The data are provided in zip files, so we need to extract them and load them in R. There are some legal restrictions to use these data. They are distributed by ESRI and can be used in conjunction with the book: "Mapping Our World: GIS Lessons for Educators.". Details of the license and other information may be found here: http://legacy.jefferson.kctcs.edu/techcenter/gis%20data/World/Earthquakes/plat_lin.htm#getacopy

If you have the rights to download and use these data for your studies you can download them directly from the web with the following code. We already looked at code to do this in previous posts so I would not go into details here:

dir.create(paste(getwd(),"/GeologicalData",sep=""))
 
#Faults
download.file("http://legacy.jefferson.kctcs.edu/techcenter/gis%20data/World/Zip/FAULTS.zip",destfile="GeologicalData/FAULTS.zip")
unzip("GeologicalData/FAULTS.zip",exdir="GeologicalData")
 
faults <- shapefile("GeologicalData/FAULTS.SHP")
 
 
#Plates
download.file("http://legacy.jefferson.kctcs.edu/techcenter/gis%20data/World/Zip/PLAT_LIN.zip",destfile="GeologicalData/plates.zip")
unzip("GeologicalData/plates.zip",exdir="GeologicalData")
 
plates <- shapefile("GeologicalData/PLAT_LIN.SHP")
 
 
#Volcano
download.file("http://legacy.jefferson.kctcs.edu/techcenter/gis%20data/World/Zip/VOLCANO.zip",destfile="GeologicalData/VOLCANO.zip")
unzip("GeologicalData/VOLCANO.zip",exdir="GeologicalData")
 
volcano <- shapefile("GeologicalData/VOLCANO.SHP")

The only piece of code that I never presented before is the first line, to create a new folder. It is pretty self explanatory, we just need to create a string with the name of the folder and R will create it. The rest of the code downloads data from the address above, unzip them and load them in R.

We have not yet transform the object Earthquake_30Days, which is now a data.frame, into a SpatioPointsDataFrame. The data from USGS contain seismic events that are not only earthquakes but also related to mining and other events. For this analysis we want to keep only the events that are classified as earthquakes, which we can do with the following code:

Earthquakes <- Earthquake_30Days[paste(Earthquake_30Days$type)=="earthquake",]
coordinates(Earthquakes)=~longitude+latitude

This extracts only earthquakes and transform the object into a SpatialObject.


We can create a map that shows the earthquakes alongside all the other geological elements we downloaded using the following code, which saves directly the image in jpeg:

jpeg("Earthquake_Origin.jpg",4000,2000,res=300)
plot(plates,col="red")
plot(polygons,add=T)
title("Earthquakes in the last 30 days",cex.main=3)
lines(faults,col="dark grey")
points(Earthquakes,col="blue",cex=0.5,pch="+")
points(volcano,pch="*",cex=0.7,col="dark red")
legend.pos <- list(x=20.97727,y=-57.86364)
 
legend(legend.pos,legend=c("Plates","Faults","Volcanoes","Earthquakes"),pch=c("-","-","*","+"),col=c("red","dark grey","dark red","blue"),bty="n",bg=c("white"),y.intersp=0.75,title="Days from Today",cex=0.8)
 
text(legend.pos$x,legend.pos$y+2,"Legend:")
dev.off()

This code is very similar to what I used here so I will not explain it in details. We just added more elements to the plot and therefore we need to remember that R plots in layers one on top of the other depending on the order in which they appear on the code. For example, as you can see from the code, the first thing we plot are the plates, which will be plotted below everything, even the borders of the polygons, which come second. You can change this just by changing the order of the lines. Just remember to use the option add=T correctly.
The result is the image below:


Before proceeding with the cluster analysis we first need to fix the projections of the SpatialObjects. Luckily the object polygons was created from a shapefile with the projection data attached to it, so we can use it to tell R that the other objects have the same projection:

projection(faults)=projection(polygons)
projection(volcano)=projection(polygons)
projection(Earthquakes)=projection(polygons)
projection(plates)=projection(polygons)

Now we can proceed with the cluster analysis. As I said I would like to try and classify earthquakes based on their distance between the various geological features. To calculate this distance we can use the function gDistance in the package rgeos.
These shapefiles are all unprojected, and their coordinates are in degrees. We cannot use them directly with the function gDistance because it deals only with projected data, so we need to transform them using the function spTransform (in the package rgdal). This function takes two arguments, the first is the SpatialObject, which needs to have projection information, and the second is the data regarding the projection to transform the object into. The code for doing that is the following:

volcanoUTM <- spTransform(volcano,CRS("+init=epsg:3395"))
faultsUTM <- spTransform(faults,CRS("+init=epsg:3395"))
EarthquakesUTM <- spTransform(Earthquakes,CRS("+init=epsg:3395"))
platesUTM <- spTransform(plates,CRS("+init=epsg:3395"))

The projection we are going to use is the standard mercator, details here: http://spatialreference.org/ref/epsg/wgs-84-world-mercator/

NOTE:
the plates object presents lines also along the borders of the image above. This is something that R cannot deal with, so I had to remove them manually from ArcGIS. If you want to replicate this experiment you have to do the same. I do not know of any method in R to do that quickly, if you know it please let me know in the comment section.


We are going to create a matrix of distances between each earthquake and the geological features with the following loop:

distance.matrix <- matrix(0,nrow(Earthquakes),7,dimnames=list(c(),c("Lat","Lon","Mag","Depth","DistV","DistF","DistP")))
for(i in 1:nrow(EarthquakesUTM)){
sub <- EarthquakesUTM[i,]
dist.v <- gDistance(sub,volcanoUTM)
dist.f <- gDistance(sub,faultsUTM)
dist.p <- gDistance(sub,platesUTM)
distance.matrix[i,] <- matrix(c(sub@coords,sub$mag,sub$depth,dist.v,dist.f,dist.p),ncol=7)
}
 
 
distDF <- as.data.frame(distance.matrix)


In this code we first create an empty matrix, which is usually wise to do since R already allocates the RAM it would need for the process and it should also be faster to fill it compared to create a new matrix directly from inside the loop. In the loop we iterate through the earthquakes and for each we calculate its distance to the geological features. Finally we change the matrix into a data.frame.

The next step is finding the correct number of clusters. To do that we can follow the approach suggested by Matthew Peeples here: http://www.mattpeeples.net/kmeans.html and also discussed in this stackoverflow post: http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters

The code for that is the following:

mydata <-  scale(distDF[,5:7])
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")

We basically calculate clusters between 2 and 15 and we plot the number of clusters against the "within clusters sum of squares", which is the parameters that is minimized during the clustering process. Generally this quantity decreases very fast up to a point, and then basically stops decreasing. We can see this behaviour from the plot below generated from the earthquake data:


As you can see for 1 and 2 clusters the sum of squares is high and decreases fast, then it decreases  between 3 and 5, and then it gets erratic. So probably the best number of clusters would be 5, but clearly this is an empirical method so we would need to check other numbers and test whether they make more sense.

To create the clusters we can simply use the function kmeans, which takes two arguments: the data and the number of clusters:

clust <- kmeans(mydata,5)
distDF$Clusters <- clust$cluster

We can check the physical meaning of the clusters by plotting them against the distance from the geological features using the function scatterplot3d, in the package scatterplot3d:

scatterplot3d(distDF$DistV,xlab="Distance to Volcano",distDF$DistF,ylab="Distance to Fault",distDF$DistP,zlab="Distance to Plate", color = clust$cluster,pch=16,angle=120,scale=0.5,grid=T,box=F)

This function is very similar to the standard plot function, but it takes three arguments instead of just two. I wrote the line of code distinguishing between the three axis to better understand it. So we have the variable for x, and the corresponding axis label, and so on for each axis. Then we set the colours based on clusters, and the symbol with pch, as we would do in plot. The last options are only available here: we have the angle between x and y axis, the scale of the z axis compared to the other two, then we plot a grid on the xy plane and we do not plot a box all around the plot. The result is the following image:



It seems that the red and green cluster are very similar, they differ only because red is closer to volcanoes than faults and vice-versa for the green. The black cluster seems only to be farther away from volcanoes. Finally the blue and light blue clusters seem to be close to volcanoes and far away from the other two features.

We can create an image with the clusters using the following code:

clustSP <- SpatialPointsDataFrame(coords=Earthquakes@coords,data=data.frame(Clusters=clust$cluster))
 
jpeg("Earthquake_Clusters.jpg",4000,2000,res=300)
plot(plates,col="red")
plot(polygons,add=T)
title("Earthquakes in the last 30 days",cex.main=3)
lines(faults,col="dark grey")
points(volcano,pch="x",cex=0.5,col="yellow")
legend.pos <- list(x=20.97727,y=-57.86364)
 
points(clustSP,col=clustSP$Clusters,cex=0.5,pch="+")
legend(legend.pos,legend=c("Plates","Faults","Volcanoes","Earthquakes"),pch=c("-","-","x","+"),col=c("red","dark grey","dark red","blue"),bty="n",bg=c("white"),y.intersp=0.75,title="Days from Today",cex=0.6)
 
text(legend.pos$x,legend.pos$y+2,"Legend:")
 
dev.off()

I created the object clustSP based on the coordinates in WGS84 so that I can plot everything as before. I also plotted the volcanoes in yellow, so that differ from the red cluster. The result is the following image:



To conclude this experiment I would also like to explore the relation between the distance to the geological features and the magnitude of the earthquakes. To do that we need to identify the events that are at a certain distance from each geological feature. We can use the function gBuffer, again available from the package rgeos, for this job.

volcano.buffer <- gBuffer(volcanoUTM,width=1000)
volcano.over <- over(EarthquakesUTM,volcano.buffer)
 
plates.buffer <- gBuffer(platesUTM,width=1000)
plates.over <- over(EarthquakesUTM,plates.buffer)
 
faults.buffer <- gBuffer(faultsUTM,width=1000)
faults.over <- over(EarthquakesUTM,faults.buffer)

This function takes minimum two arguments, the SpatialObject and the maximum distance (in metres because it requires data to be projected) to reach with the buffer, option width. The results is a SpatialPolygons object that include a buffer around the starting features; for example if we start with a point we end up with a circle of radius equal to width. In the code above we first created these buffer areas and then we overlaid EarthquakesUTM with these areas to find the events located within their borders. The overlay function returns two values: NA if the object is outside the buffer area and 1 if it is inside. We can use this information to subset EarthquakesUTM later on.

Now we can include the overlays in EarthquakesUTM as follows:

EarthquakesUTM$volcano <- as.numeric(volcano.over)
EarthquakesUTM$plates <- as.numeric(plates.over)
EarthquakesUTM$faults <- as.numeric(faults.over)

To determine if there is a relation between the distance from each feature and the magnitude of the earthquakes we can simply plot the magnitude's distribution for the various events included in the buffer areas we created before with the following code:

plot(density(EarthquakesUTM[paste(EarthquakesUTM$volcano)=="1",]$mag),ylim=c(0,2),xlim=c(0,10),main="Earthquakes by Origin",xlab="Magnitude")
lines(density(EarthquakesUTM[paste(EarthquakesUTM$faults)=="1",]$mag),col="red")
lines(density(EarthquakesUTM[paste(EarthquakesUTM$plates)=="1",]$mag),col="blue")
legend(3,0.6,title="Mean magnitude per origin",legend=c(paste("Volcanic",round(mean(EarthquakesUTM[paste(EarthquakesUTM$volcano)=="1",]$mag),2)),paste("Faults",round(mean(EarthquakesUTM[paste(EarthquakesUTM$faults)=="1",]$mag),2)),paste("Plates",round(mean(EarthquakesUTM[paste(EarthquakesUTM$plates)=="1",]$mag),2))),pch="-",col=c("black","red","blue"),cex=0.8)

which creates the following plot:


It seems that earthquakes close to plates have higher magnitude on average.






R code snippets created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: R tutorial for Spatial Statistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A comparison of high-performance computing techniques in R

$
0
0

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

When it comes to speeding up "embarassingly parallel" computations (like for loops with many iterations), the R language offers a number of options:

  • An R looping operator, like mapply (which runs in a single thread)
  • A parallelized version of a looping operator, like mcmapply (which can use multiple cores)
  • Explicit parallelization, via the parallel package or the ParallelR suite (which can use multiple cores, or distribute the problem across nodes in a cluster)
  • Translating the loop to C++ using Rcpp (which runs as compiled and optimized machine code)

Data scientist Tony Fischetti tried all of these methods and more attempting to find the distance between every pair of airports (a problem that grows polynomially in time as the number of airports increases, but which is embarassingly parallel). Here's a chart comparing the time taken via various methods as the number of airports grows:

Parallel

The clear winner is Rcpp — the orange line at the bottom of the chart. The line looks like it's flat, but while it the time does increase as the problem gets larger, it's much much faster than all the other methods tested. Ironically, Rcpp doesn't use any parallelization at all and so doesn't benefit from the quad-processor system used for testing, but again: it's just that much faster.

Check out the blog post linked before for a detailed comparison of the methods used, and some good advice for using Rcpp effectively (pro-tip: code the whole loop, not just the body, with Rcpp).

On the lambda: Lessons learned in high-performance R

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Back from R/Finance in Chicago

$
0
0

(This article was first published on mages' blog, and kindly contributed to R-bloggers)
I had a great time at the R/Finance conference in Chicago last Friday/Saturday. Some brief takeaways for me were:

From Emanuel Derman's talk: It is is important to distinguish between theories and models. Theories live in an abstract world and for a given set of axioms they can be proven right. However, models live in the real world, are build on simplifying assumptions and are only useful until experiments/data proves them wrong.

'Pornography is hard to define, but I know it when I see it.' Matt Dowle from h2o had the laughs on his side when he started his talk with this Justice Potter Stewart quote to illustrate the value of his data.table package to its users.

Bryan W. Lewis showed why inverting a matrix is tricky, particularly when it contains entries close to zero and what you can do about it.

Marius Hofert gave a stimulating talk on simsalapar a package for parallel simulations, which I need to study in more detail.

Following a brief conversation with Dirk on drat I finally got the punch line of the package, but not so much the joke on drat as a fairly mild expression of anger or annoyance. I had never heard the expression in the UK. Perhaps drat is better explained as Dirk's R Archive Template?

The audience seemed to have appreciated my talk on Communicating Risk. My chart of visualising profitability using a Whale Chart appeared to have resonated with a few.

Furthermore, I learned that the weather in Chicago is even more unstable than in London. After an amazing conference dinner at the Trump Tower, spending most of the time outside and admiring the sunset, we experienced a very cold and rainy Saturday. But then again, there is always time for a Jazz club and a drink. Talking about drinks, thanks to Q Ethan McCallum I had true American breakfast experience, including bottomless coffee.

Yet, the last word should go to ShabbyChef, who took a photo of a slide during Louis Marascio's keynote and tweeted:


Amen.

To leave a comment for the author, please follow the link and comment on his blog: mages' blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Using Azure as an R datasource: Part 2 – Pulling data from MySQL/MariaDB

$
0
0

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Gregory Vandenbrouck
Software Engineer, Mirosoft

This post is the second in a series that covers pulling data from various Windows Azure hosted storage solutions (such as MySQL, or Microsoft SQL Server) to an R client on Windows or Linux.

Last time we covered pulling data from SQL Azure to an R client on Windows. This time we’ll be pulling data from MariaDB (compatible with MySQL) hosted on a Linux VM in Azure to an R client on Windows.

Creating the database

The Azure Management site changes quite often, therefore these instructions are valid “at the time of this writing” :o)

  1. Log on the Azure Management site with your Azure subscription (see previous post),
  2. Create a VM: select “New”, “Compute”, “Virtual Machine”, “Quick Create” and then fill-in the information. In our case we chose “SUSE Linux Enterprise” for the image, but most images should work, both Windows and Linux.
  3. Once the creation is completed, click on “Virtual Machines”, your machine name, and then “endpoints”. You should already have SSH (port 22) listed if you chose a Linux VM (and if not, add it). Add MySQL (port 3306) to the list.
  4. Connect to your VM via ssh.
    • Putty is a free lightweight ssh client for Windows,
    • MobaXterm is a more advanced solution that also provides a free version.
  5. Once logged on to the VM via ssh, install MySQL or MariaDB: sudo zypper install mariadb*.
    Note: this command and subsequent ones may vary depending on the choice of Linux distribution. For example the equivalent of “zypper” may instead be called: “yum”, “apt-get”, “synaptic”, “aptitude”, “dpkg-install”, etc.
  6. Start the server. sudo rcmysql start
  7. Secure the server: sudo mysql_secure_installation
  8. Add a user that’s authorized to connect remotely. Below is the full transcript of a session where we add user MyUser with password MyPassword that can connect remotely from any location with all privileges (not recommended), and we also create the MyDatabase database:

    sshuser@MyServer:~> mysql -s --user=root -p
    Enter password:
    MariaDB [(none)]> create user 'MyUser'@'%' identified by 'MyPassword';
    MariaDB [(none)]> grant all privileges on *.* to 'MyUser'@'%' with grant option;
    MariaDB [(none)]> create database MyDatabase;
    MariaDB [(none)]> quit
    sshuser@MyServer:~>

Connecting to the database from outside of R

Optional step, but can be useful for troubleshooting. For example to solve firewall, port and credential issues.

  • Install mysql.exe client or MariaDB client.
  • Below an example of a successful test session in a command prompt (search for “cmd” in the Start menu or screen to start a command prompt):

    C:>"c:Program FilesMySQLMySQL Workbench 6.2 CEmysql.exe" -s -hMyServer.cloudapp.net -uMyUser -pMyPassword -DMyDatabase
    Warning: Using a password on the command line interface can be insecure.
    mysql> select 1+1 as Answer;
    Answer
    2
    mysql> quit
    C:>

(replace MyServer, MyUser, MyPassword and MyDatabase with your values)

Note: there’s no space between the flags and values; e.g. it’s “-pMyPassword” and not “-p MyPassword”.

Connecting to the database from R on Windows

Using RODBC’s odbcDriverConnect function

MySQL’s ODBC drivers need to be installed. To see if you have these installed and get the driver’s name, open the “ODBC Data Source” (see previous post for instructions), and look in the “Drivers” tab. Below an example with MySQL drivers installed:

  ODBC3

If the drivers aren’t installed, you can get them by installing MySQL’s or MariaDB’s client. There are also ODBC-drivers-only installation options.

library(RODBC)

myServer <- "MyServer.cloudapp.net"
myUser <- "MyUser"
myPassword <- "MyPassword"
myDatabase <- "MyDatabase"
myDriver <- "MySQL ODBC 5.3 Unicode Driver"

connectionString <- paste0(
    "Driver=", myDriver,
    ";Server=", myServer,
    ";Database=", myDatabase,
    ";Uid=", myUser,
    ";Pwd=", myPassword)
conn <- odbcDriverConnect(connectionString)
sqlQuery(conn, "SELECT 142857 * 3 AS Cyclic")
##   Cyclic
## 1 428571
close(conn) # don't leak connections !

Using RODBC’s odbcConnect function

Similar to the Microsoft SQL Server case, you can persist a Data Source Name (DSN). This time I created a “System DSN” (available to everyone):

ODBC4

Contrary to the SQL Server driver, MySQL allows for saving credentials. This is handy as credentials no longer need to be present in clear text in the R script. Unfortunately, the credentials are saved in clear text in Windows’ registry. Choose your poison! 

For the drivers we used, the registry locations are HKEY_CURRENT_USERSoftwareODBCODBC.INI for “User DSN” and HKEY_LOCAL_MACHINESoftwareODBCODBC.INI for “System DSN”.

When using a DSN with saved credentials, the R code is quite simplified.

library(RODBC)

# No need to specify uid and pwd: these are part of the DSN
conn <- odbcConnect("MyMariaDBAzure")
sqlQuery(conn, "SELECT 10001 - 73 * 137 AS Zero")
##   Zero
## 1    0
close(conn)

Using RJDBC

To use RJDBC to connect to MySQL or MariaDB, you need to:

  • Install:
  • Know the following:
    • driverClass: the class name. See the driver-specific documentation.
    • classPath: the location of the jar file. If you don’t know where it is, try running dir /s /b %systemdrive%*sql*.jar from a command prompt.
    • url connector prefix. Again, driver-specific.

In my specific setup:

library(RJDBC)
drv <- JDBC(
    driverClass = "com.mysql.jdbc.Driver",
    classPath = "C:/Program Files (x86)/MySQL/MySQL Connector J/mysql-connector-java-5.1.35-bin.jar")
conn <- dbConnect(drv, "jdbc:mysql://MyServer.cloudapp.net", "MyUser", "MyPassword")
dbGetQuery(conn, "SELECT 1+1")
dbDisconnect(conn)

Using RMySQL

RMySQL also needs the JDK installed.

It’s slighter easier to use than RJDBC:

library(RMySQL)
conn <- dbConnect(RMySQL::MySQL(), host = "MyServer.cloudapp.net", user="MyUser", password="MyPassword")
dbGetQuery(conn, "SELECT 1+1")
dbDisconnect(conn)

Summary

  • We’ve tried RODBC, RJDBC and RMySQL to connect to a MariaDB database hosted on a Suse VM in Azure.
  • In all cases, we had to install something on the client machine (besides R packages): drivers and/or JDK.
  • When it comes to performance, our experience was similar to Microsoft SQL Server’s case: RODBC was faster, both to connect and to get back the query results.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Hacks for thinking about high-dimensional space

$
0
0

(This article was first published on isomorphismes, and kindly contributed to R-bloggers)

High-dimensional Euclidean space is ℝ×ℝ×ℝ×ℝ×ℝ×…. Cartesian product of many continuous quantities.

physical slidersa css slider

You are already familiar with the concept via “an arbitrary number of sliders” or “an arbitrary number of columns in a spreadsheet”. Pivot tables are the normal way business people navigate a low-dimensional subset of a high-dimensional space.

OLAPOLAPOLAPdatabase schema

One database or slider-config would be one point in a high-dimensional space. But what do these spaces look like overall?

Lots of Corners

Think about the graph-skeleton of cubes and squares. Zoom in on a corner.

skeleton of a cube

Look at how the number of neighbours increases as you increase dimension. The corner of a box in 2-D has two sticks coming out of it; in 3-D three sticks come out of it. In 30 dimensions there would be 30 sticks coming out of each ball. (Toward the interior. You’d need 2•30 sticks if we’re talking about a grid rather than a box.)

3d grid

More on the outside, less on the inside.

Think about a Matryushka of nested spheres ⬤⊂⬤⊂⬤⊂… or, ok, even dolls.

Beatles Matryushka

The 10th one requires more plastic than the 1st one.

Inverse square law.svg
Picture by Borb. Licensed under CC BY-SA 3.0 via Wikimedia Commons.

In 3-D a two-dimensional surface gets larger as radius². In 30 dimensions surface areas will go up like radius29.

Fatter at the equator.

Circles of latitude are wider in the middle of the ball than at the top.

bigger circles of latitude at the equator

But this is a consequence of the coördinatisation (lat,long), not of the object itself. A hectare in Finland is as big as a hectare in Kenya. And I’m not suddenly faster running in Finland either.

David J. Wright's picture of walking in a Kleinian group

(Manifolds solve the problem by patching together different charts, each of which shows one area more faithfully.)

arctic circle

Cuman-Kipchak

southern hemisphere

The sphere is tiny or the cube’s mass is in its spikes.

This is justified by comparing spheres with cubes.

the spiky picture

In the 3-D the corners of the cube stick out beyond the sphere. I could measure the size of these corners by putting another sphere outside the corners, and calculating volume(circumsphere) − volume(insphere). The Earth has more volume in its mantle than in its inner core.

manticore, I mean mantle + core

Combine this with what I said above in the Beatles Matryushka section. I’m seeing the mantle and core as successions of nested shells of volume ≈ ∫radius², each bigger than the last. Once again this pattern will only get more extreme in high dimensions.

Simplices

These are the spaces on which probability happens. In 3 variables I can think about (⅓,⅓,⅓), (⅔,0,⅓), (0,1,0), and (0,0,1).

3-simplex

Do they have to get more complicated when we pad them with zeroes? (0,0,⅓,0,0,⅓,0,0,0,⅓) and (0,0,⅔,0,0,0,0,0,0,⅓) tell me that a 10-simplex has 10 choose 3 = 120 3-simplices embedded in it. But (.1,.1,.1,.1,.1,.1,.1,.1,.1,.1) tells me it contains more than that.

I can still think about the 10-simplex as connections between (1,0,0,0,0,0,0,0,0,0), (0,1,0,0,0,0,0,0,0,0), … (0,0,0,0,0,0,0,0,0,1), but now there are more ways to connect—since each of those points has up to 9 buddies it can be heading toward at a time.

complete graph of order 10

When 2 entries are nonzero, that’s a 2-D surface. When 7 entries are nonzero, that’s a 7-D surface. (Just think of 7 sliders.)

sliders

3-Spheres on Up

All of this mention of spheres, it’d be nice to have an alternative description to w²+x²+y²+z²=1 and sin θ₁ sin θ₂ sin θ₃ … cos θₙ, which I’ve always found hard to digest.

wut

I know that S¹×S¹ isn’t S², because that would be a torus instead. So what do I combine, and how, to get S²?

torus shown as product of circles

I like the Drawstring Bag model: drawstring bag model of the sphere

That takes me from the open disk to S² (= the surface of the Earth—sans magma). This is a topological construction so small perturbations are ok. Calling the geoid a “sphere” is ok. The disk doesn’t have to be exactly circular.

the Earth’s surface ≠ S²

To get to the 3-sphere I then imagine a solid sphere (= Earth + magma), attach another point ∞, and identify all of the Earth’s crust with ∞.

Going back to the open disk, here’s another imagination-exercise I do to clear this up for myself. In some of the early Mario games, a glitch you could do was to put half of Mario’s body on one side of the screen and half on the right. This is what I think of when I imagine two sides of a square being identified.

a polygon with some sides identified—makes a 2-manifold, and is something I bet the ancients could not have imagined that you can.

(You can play inside an octagon-with-identifications too. Thicken this octagon and imagine yourself flying around inside it. Imagine waving your arm through eg a̅₁. Where does it come out? Turn your head and watch what it’s doing. BTW, just like Mario and Pac–Man, this identification manifold was considered by entertainment artists around the same time period—remember Scooby-Doo and the Gang running in and out of hotel doors? But the animators will change the rules either over time or for different people or introduce noncommutativity—part of the humour, which proves that people do intuitively understand identification polyhedra (as well as, obviously, object permanence), or else those changes wouldn’t be funny.)


DunceHatSpace.png

DunceHatSpace”. Licensed under Public Domain via Wikimedia Commons.

some identifications of the square that make different 2-dimensional spaces

To do this with the drawstring-bag, I imagine a circle of friends. Even though they look like they’re standing all the way across the disk from each other, they’re actually immediately next to each other at ★, and they could make a ring and touch hands and dance around it.

Or, a solitary explorer at ★ could take small steps and look, in the disk model, like he’s jumped all the way across the Earth. He stretches his hand out and, like Mario, it shows up on the other side. But this is just an aberrance of the disk model, not of S². (This also tells you how the gridding should go on the disk.)

So in symbols, T² = ◯×◯ ≠ S² = ⬤/◯ = ⬤/(The pattern Im ƒ / ker ƒ is much broader, including the +c of integration.)

Specific dimension sometimes matters.

A lot of what I’m walking you through is dimension → ∞ kind of stuff. But sometimes odd or even matters. And sometimes there are structures that show up in like dimension 7 or even seemingly random high numbers. Just a warning.

magic triangle for Lie groups by Predrag Cvitanovic

Computing on the surface of a high-dimensional sphere.

Persi Diaconis explains how to do this. The trick is to think of “matrices with determinant one” as the group SO(n), which represents either a point on a sphere or the action of twisting a sphere so ★ moves to that point. (The “noun” and “verb” versions are so easily interchangeable that mathematicians will usually elide the two.) If you allow determinant ±1, that’s O(n)—so allowing reflection—and O(n) is what the Gram-Schmidt algorithm reduces matrices to in order to solve [A]⋅x⃗ = b⃗.

To do what Dr Diaconis says in R you would run this 2-liner:

n = 3
rnorm(n**2) %>% matrix(n,n) %>% qr %>% qr.Q

which you can verify has determinant ±1 and is orthogonal with things like:

rnorm(n**2) %>% matrix(n,n) %>% qr %>% qr.Q -> diaconis
diaconis %>% eigen
{diaconis %>% eigen}$values %>% Mod     #eigenvalues of an O(n) are length 1
diaconis %>% det
diaconis %>% crossprod %>% zapsmall     #diaconis %*% t(diaconis)

Use rotations.

A heap of linear algebra—and therefore the logic of high-dimensional Euclidean space—is simplified if you can imagine quotienting by SO(n).


Having the drawstring-bag model and the stick+ball model and the ability to compute examples with matrices and the rectangular-grid model helps me imagine high-dimensional space. Hope that was useful for you.

Next steps

∂D³

To leave a comment for the author, please follow the link and comment on his blog: isomorphismes.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Air Pollution (PM10 and PM2.5) in Different Cities using Interactive Charts

$
0
0

(This article was first published on Adventures in Analytics and Visualization, and kindly contributed to R-bloggers)
Gardiner Harris, who is a South Asia correspondent of the New York Times, shared a personal story of his son’s breathing troubles in New Delhi, India, in a recent dispatch titled Holding Your Breath in India. In this post, I use data from the World Health Organization’s Website to identify and map cities where the air quality is worse than the acceptable levels, measured by the annual mean concentration of particulate matter (PM10 and PM2.5). A link to this was provided in the New York Times article. I use many packages from the R-Studio, Ramnath Vaidyanathan and Kenton Russel team, among others, in this process. Please visit http://patilv.com/airpollution/ for the post.

To leave a comment for the author, please follow the link and comment on his blog: Adventures in Analytics and Visualization.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New package commonmark: yet another markdown parser?

$
0
0

(This article was first published on OpenCPU, and kindly contributed to R-bloggers)
opencpu logo

Last week the commonmark package was released on CRAN. The package implements some very thin R bindings to John Macfarlane’s (author of pandoc) cmark library. From the cmark readme:

cmark is the C reference implementation of CommonMark, a rationalized version of Markdown syntax with a spec. It provides a shared library (libcmark) with functions for parsing CommonMark documents to an abstract syntax tree (AST), manipulating the AST, and rendering the document to HTML, groff man, CommonMark, or an XML representation of the AST.

Each of the R wrapping functions parses markdown and renders it to one of the output formats:

md <- "
## Test
My list:
  - foo
  - bar"

The markdown_html function converts markdown to HTML:

library(commonmark)
cat(markdown_html(md))
<h2>Test</h2>
<p>My list:</p>
<ul>
<li>foo</li>
<li>bar</li>
</ul>

The markdown_xml function gives the parse tree in xml format:

cat(markdown_xml(md))
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE CommonMark SYSTEM "CommonMark.dtd">
<document>
  <header level="2">
    <text>Test</text>
  </header>
  <paragraph>
    <text>My list:</text>
  </paragraph>
  <list type="bullet" tight="true">
    <item>
      <paragraph>
        <text>foo</text>
      </paragraph>
    </item>
    <item>
      <paragraph>
        <text>bar</text>
      </paragraph>
    </item>
  </list>
</document>

Most of the value in commonmark and is probably in the latter. There already exist a few nice markdown converters for R including the popular rmarkdown package, which uses pandoc to convert markdown to several presentation formats.

The formal commonmark spec makes markdown suitable for more strict documentation purposes, where we might currently be inclined to use json or xml. For example we could use it to parse NEWS.md files from R packages in a way that allows for archiving and indexing individual news items, without ambiguity over indentation rules and such.

To leave a comment for the author, please follow the link and comment on his blog: OpenCPU.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R vs Autobox vs ForecastPro vs …

$
0
0

(This article was first published on Hyndsight » R, and kindly contributed to R-bloggers)

Every now and then a commercial software vendor makes claims on social media about how their software is so much better than the forecast package for R, but no details are provided.

There are lots of reasons why you might select a particular software solution, and R isn’t for everyone. But anyone claiming superiority should at least provide some evidence rather than make unsubstantiated claims.

The M3 forecasting competition was organized by Spyros Makridakis and Michèle Hibon. Entrants had to forecast 3003 time series and the results were compared to a test set that was withheld from participants. All the data (both training and test sets) and the forecasts of the original participants are publicly available in the Mcomp package for R. The competition results are also publicly available in an IJF paper published in 2000. The best performing methods overall were the Theta method, ForecastPro and ForecastX, as measured by the symmetric MAPE (sMAPE) that was favoured by Makridakis and Hibon. The following table shows some of the results from the original competition including results from the main commercial software vendors. The first sMAPE column is taken from the original paper. My own recalculation of the sMAPE results usually gives values slightly less than those published (I don’t know why). The MAPE column shows the mean absolute percentage error and the MASE column shows the mean absolute scaled errors.

Method Average sMAPE Average sMAPE (recalculated) MAPE MASE
Theta 13.01 12.76 17.42 1.39
ForecastPro 13.19 13.06 18.00 1.47
ForecastX 13.49 13.09 17.35 1.42
BJ automatic 14.01 13.72 19.13 1.54
Autobox2 14.41 13.82 18.23 1.51
Autobox1 15.23 15.20 20.36 1.69
Autobox3 15.33 15.46 19.31 1.57

BJ automatic was produced by ForecastPro but with the forecasts restricted to ARIMA models. For some reason, Autobox was allowed three separate submissions (a practice normally not allowed as it leads to over-fitting on the test set).

Any good forecasting software should be aiming to get close to (or better than) Theta on this test. After all, the M3 competition was held more than 15 years ago. Presumably all of the software companies have tried to improve their results since then. Unfortunately, none of them to my knowledge has published any updated figures. I wish they would (preferably independently verified). It would provide some evidence that they are improving their algorithms.

My aim with the forecast package for R is to make freely available state-of-the-art algorithms for some forecasting models. I do not attempt to offer a comprehensive suite of algorithms, but what I do provide gives forecasts that are in the same ballpark as the best methods in the M3 competition. Here is the evidence.

Method Average sMAPE MAPE MASE
ETS 13.13 17.38 1.43
AutoARIMA 13.85 19.12 1.47
Combined ETS/AutoARIMA 12.88 17.63 1.40

The last method is a simple average of the forecasts from ets and auto.arima. If you only want point forecasts, that is the best approach available in the forecast package. It is also better than any of the commercial software (at least as far as they have been prepared to subject their algorithms to independent testing).

Unlike the commercial vendors, you don’t have to take my word for it. My algorithms are open source, and the code that produced the above tables can be downloaded.

To leave a comment for the author, please follow the link and comment on his blog: Hyndsight » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

My aversion to pipes

$
0
0

(This article was first published on From the Bottom of the Heap - R, and kindly contributed to R-bloggers)

At the risk of coming across as even more of a curmudgeonly old fart than people already think I am, I really do dislike the current vogue in R that is the pipe family of binary operators; e.g. %>%. Introduced by Hadley Wickham and popularised and advanced via the magrittr package by Stefan Milton Bache, the basic idea brings the forward pipe of the F# language to R. At first, I was intrigued by the prospect and initial examples suggested this might be something I would find useful. But as time has progressed and I’ve seen the use of these pipes spread, I’ve grown to dislike the idea altogether. here I outline why.

The forward pipe operator is designed, in R at least (I’m not familiar with F#), to avoid the sort of nested/inline R code of the type shown below

the_data <- head(transform(subset(read.csv('/path/to/data/file.csv'),
                                  variable_a > x),
                           variable_c = variable_a/variable_b),
                 100)

replacing that awful mess with

the_data <-
  read.csv('/path/to/data/file.csv') %>%
  subset(variable_a > x) %>%
  transform(variable_c = variable_a/variable_b) %>%
  head(100)

And when compared against one another like that, who wouldn’t rejoice at the prospect of a pipe to banish such awful R code to distant memory? The problem with this comparison though is, who writes code like that in the first code block? I don’t think I’ve ever written code like that, even when I was a very green useR around the turn of the century.

When you compare the pipe version with how I’d lay out the R code

the_data <- read.csv('/path/to/data/file.csv')
the_data <- subset(the_data, variable_a > x)
the_data <- transform(the_data, variable_c = variable_a/variable_b)
the_data <- head(the_data, 100) # I'm perplexed as to why this would be a good thing to do?

the benefits of the pipe remain but they aren’t, at least in my opinion, as compelling. My version is verbose; I repeatedly overwrite the_data object with subsequent operations. Rather the writing the_data once in the pipe version, I’d write it 7 times! But that said, I could pass my version to a relative novice useR and they’d have a reasonable grasp of what the code did. I don’t think the same could be said for the pipe version.

But all that really doesn’t matter does it. It’s personal preference as to how you choose to write your data analysis and manipulation R script code. If you find it easier to write code and then read it back using the pipe operator all power to you.

Where I think it does make a difference is where you are

  1. writing code to go into an R package for general consumption on say CRAN, or
  2. writing example material for your package in a vignette or similar document.

I don’t claim that these are the only problem areas nor that these are universally accepted. I wager I’m in the majority position at the moment, but that is probably down to the relatively recent arrival of the pipe on the R scene.

Why is the pipe a problem if you are writing code to go into a general purpose R package that you expect users to abuse with their own data in their own code? Two reasons. The pipe operator involves the standard non-standard evaluation (NSE) paradigm. The pipe captures expressions on each side of the %>% operator and then arranges for the thing on the left of %>% to be injected into the expression on the right of %>%, usually as the first argument but not always. This all involves capturing the expressions and evaluating them within the %>%() function.

OK, isn’t that what all functions using a formula do, or what transform(), subset(), et al do? Well yes, and this is where my spider sense starts tingling. Who among us hasn’t had those things fail on us when we dropped them into an lapply() inside an anonymous function? Or wrapped those function as part of a package function only for some user to execute your function in a way you didn’t envisage? Now Hadley assures us that there is a correct way to do NSE and he even has a package for that, lazyeval. But still I have my reservations, despite Stefan’s attempts to allay my fears

OK, let’s assume Stefan and Hadley know what they are doing (and I invariably do) and the NSE used here really is safe. That still leaves the major problem I have with writing R code like this in package functions; how do you read it, parse it, and understand what it does? How do you track down a bug in the code and where it occurs if several steps are conflated into a single pipe chain? I’m not a pipe smoker so I’ll have to guess; you undo the chain and see where things break. Wouldn’t it have been easier to just write out the steps in the first place? That way the debugger can just step through the statements line by line as you’ve written them. I’m not alone in having concerns in this general area

I suppose a lot of this will come down to how well you grok pipes and how well you understand your actual code.

OK, enough of that; on to problem area number 2. I was recently helping a StackOverflow user massage some output from a vegan function into a format suitable for plotting with ggplot2. There, the aim was to go from this:

Group.1                S.obs     se.obs    S.chao1   se.chao1
Cliona celata complex  499.7143  59.32867  850.6860  65.16366
Cliona viridis         285.5000  51.68736  462.5465  45.57289
Dysidea fragilis       358.6667  61.03096  701.7499  73.82693
Phorbas fictitius      525.9167  24.66763  853.3261  57.73494

to this:

                Group.1   var        S       se
1 Cliona celata complex chao1 850.6860 65.16366
2 Cliona celata complex   obs 499.7143 59.32867
3        Cliona viridis chao1 462.5465 45.57289
4        Cliona viridis   obs 285.5000 51.68736
5      Dysidea fragilis chao1 701.7499 73.82693
6      Dysidea fragilis   obs 358.6667 61.03096
7     Phorbas fictitius chao1 853.3261 57.73494
8     Phorbas fictitius   obs 525.9167 24.66763

(or at least something pretty close it) so that the required dynamite plot (yes, yes, I know!) could be produced.

A little fiddling with reshape2 suggested this wasn’t something that it would handle gracefully (I may well be wrong here; I’m not familiar that particular package) and having recalled some details of Hadley’s tidyr package I felt that it would be more suited to the problem at hand. Not having used tidyr I proceeded to CRAN to grab the manual and look at any vignettes that might help me with understanding how to solve this particular problem. Thankfully, Hadley is a conscientious R package maintainer and there was a rather nice HTML-rendered version of the vignette right there on CRAN for me to peruse. The only downside to this was all the example code used pipes.

The very first usage example is (or was, depending on when you are reading this)

library(tidyr)
library(dplyr)
preg2 <- preg %>% 
  gather(treatment, n, treatmenta:treatmentb) %>%
  mutate(treatment = gsub("treatment", "", treatment)) %>%
  arrange(name, treatment)
preg2

Innocuous enough I guess, until you realise that I“m also reading the manual which has usage that doesn’t involve pipes and that Hadley isn’t naming the arguments in the calls here. Now I am having to grok what is being passed, and where, by the pipes, whilst trying to match the usage shown in the example snippet with the arguments in the manual. I might be old-school but yes, I do read the manual.

The point I’m trying to make here with my little anecdote is this; what point did the use of the pipe serve here? How am I as a user new to the package helped by Hadley also using the pipe? In my case I wasn’t; in fact it made it somewhat trickier to understand what went where, what the actual tidyr calls were etc. Now I fully understand that Hadley finds the pipe operator to be very expressive for data analysis, and who am I to argue with that? Where I would raise an issue is that if you are writing introductory example code, don’t force your users to have to grapple with two new concepts at once, at least not in the first few examples.

I don’t want to beat on Hadley over this; it’s just that this was a prime example of where the use of the pipe was obfuscatory not revelatory, for me at least.

So yes, I am a curmudgeonly old fart, but this old dog can learn new tricks. Convince me I’m wrong here cause I really do want to like the pipe; my Granddad smoked one and I have fond memories of the smell and, well, all the cool kids are using the pipe so it must be good, right?

To leave a comment for the author, please follow the link and comment on his blog: From the Bottom of the Heap - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Computing with GPUs in R

$
0
0

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

On Monday, we compared the performance of several different ways of calculating a distance matrix in R. Now there's another method to add to the list: using GPU acceleration in R.

A GPU is a dedicated, high-performance chip available on many computers today. Unlike the CPU, it's not used for general computations, but rather for specialized tasks that benefit from a massively multi-threaded architecture. Video-game graphics is the usual target for GPUs, but in recent years they've been used for certain high-performance computing tasks as well. The problem is that GPUs require specialized programming, and because they have limited access to RAM, they're generally not well suited to tasks that require a lot of data throughput. But for simulations and other tasks that require a lot of computing on limited data, they can offer huge performance benefits.

The rpud package for R implements a few algorithms in R that will use a CUDA-compatible NVIDIA GPU for the computations. The algorithms include support vector machines, bayesian classification, and hierarchical linear models. On the NVIDIA Cuda Zone blog, Gord Sissons tested the rpud package for hierarchcal clustering, which involves calculating a distance matrix. Here's a comparison of the perfomance using regular R functions (blue) and with GPU-accelerated functions (orange):

Pruhclust_performance-624x425

Note the Y axis is on a log-10 scale: in most cases the GPU-based functions ran 10x faster than the standard CPU-based functions.

GPU programming doesn't help with everything, but if your problem happens to be one that has a GPU-based implementation, and you have the appropriate GPU hardware, the results can be dramatic. Check the link below for details of the tests, and how you can spin up a cloud-based GPU server to run them on.

Parallel Forall: GPU-Accelerated R in the Cloud with Teraproc Cluster-as-a-Service

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Discovered Two Great Web Sites Today

$
0
0

(This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

Today is my lucky day.  I learned of two very interesting Web pages, both of them quite informative and the first of them rather provocative (yay!). I have some comments on both, in some cases consisting of mild disagreement, which I may post later, but in any event, I highly recommend both.  Here they are:

Tip: Read Drew’s page as soon as possible, before he removes the F-bombs. :-)


To leave a comment for the author, please follow the link and comment on his blog: Mad (Data) Scientist.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Processing Punycode and IDNA Domain Names in R

$
0
0

(This article was first published on Data Driven Security, and kindly contributed to R-bloggers)

When fighting the good cyber-fight, one often has to process domain names. Our good friend @alexcpsec was in need of Punycode/IDNA processing in R which begat the newly-minted punycode R package. Much of the following has been culled from open documentation, so if you are already “in the know” about Punycode & IDNA, skip to the R code bits.

What is ‘Punycode’?

Punycode is a simple and efficient transfer encoding syntax designed for use with Internationalized Domain Names in Applications. It uniquely and reversibly transforms a Unicode string into an ASCII string. ASCII characters in the Unicode string are represented literally, and non-ASCII characters are represented by ASCII characters that are allowed in host name labels (letters, digits, and hyphens).

What is ‘IDNA’?

Until now, there has been no standard method for domain names to use characters outside the ASCII repertoire. The IDNA document defines internationalized domain names (IDNs) and a mechanism called IDNA for handling them in a standard fashion. IDNs use characters drawn from a large repertoire (Unicode), but IDNA allows the non-ASCII characters to be represented using only the ASCII characters already allowed in so-called host names today. This backward-compatible representation is required in existing protocols like DNS, so that IDNs can be introduced with no changes to the existing infrastructure. IDNA is only meant for processing domain names, not free text.

Why domain validation?

Organizations that manage some Top Level Domains (TLDs) have published tables with characters they accept within the domain. The reason may be to reduce complexity that come from using the full Unicode range, and to protect themselves from future (backwards incompatible) changes in the IDN or Unicode specifications. Libidn (and, hence, this package) implements an infrastructure for defining and checking strings against such tables.

Working with punycode

All three functions in the package are vectorized at the C-level.

For encoding and decoding operations, you pass in vectors of domain names and the functions return encoded or decoded character vectors. If there are any issues during the conversion of a particular domain name, the function will substitute Invalid for the domain name.

For the TLD validation function, any character set or conversion issue will cause FALSE to be returned. Otherwise the function will return TRUE.

Usage

devtools::install_github("hrbrmstr/punycode")
library(punycode)

ascii_doms <- c("xn------qpeiobbci9acacaca2c8a6ie7b9agmy.net",
"xn----0mcgcx6kho30j.com",
"xn----9hciecaaawbbp1b1cd.net",
"xn----9sbmbaig5bd2adgo.com",
"xn----ctbeewwhe7i.com",
"xn----ieuycya4cyb1b7jwa4fc8h4718bnq8c.com",
"xn----ny6a58fr8c8rtpsucir8k1bo62a.net",
"xn----peurf0asz4dzaln0qm161er8pd.biz",
"xn----twfb7ei8dwjzbf9dg.com",
"xn----ymcabp2br3mk93k.com")

intnl_doms <- c("ثبت-دومین.com",
"טיול-לפיליפינים.net",
"бизнес-тренер.com",
"новый-год.com",
"東京ライブ-バルーンスタンド.com",
"看護師高収入-求人.net",
"ユベラ-贅沢ポリフェノール.biz",
"เด็ก-ภูเก็ต.com",
"ایران-هاست.com")


for_valid <- c("gr€€n.no", "זגורי-אימפריה-לצפייה-ישירה.net", "ثبت-دومین.com",
"טיול-לפיליפינים.net", "xn------qpeiobbci9acacaca2c8a6ie7b9agmy.net", "xn----0mcgcx6kho30j.com",
"xn----9hciecaaawbbp1b1cd.net", "rudis.net")

# encoding

puny_encode(ascii_doms)

##  [1] "זגורי-אימפריה-לצפייה-ישירה.net"  "ثبت-دومین.com"                   "טיול-לפיליפינים.net"            
##  [4] "бизнес-тренер.com"               "новый-год.com"                   "東京ライブ-バルーンスタンド.com"
##  [7] "看護師高収入-求人.net"           "ユベラ-贅沢ポリフェノール.biz"   "เด็ก-ภูเก็ต.com"                      
## [10] "ایران-هاست.com"

# decoding

puny_decode(intnl_doms)

## [1] "xn----0mcgcx6kho30j.com"                   "xn----9hciecaaawbbp1b1cd.net"             
## [3] "xn----9sbmbaig5bd2adgo.com"                "xn----ctbeewwhe7i.com"                    
## [5] "xn----ieuycya4cyb1b7jwa4fc8h4718bnq8c.com" "xn----ny6a58fr8c8rtpsucir8k1bo62a.net"    
## [7] "xn----peurf0asz4dzaln0qm161er8pd.biz"      "xn----twfb7ei8dwjzbf9dg.com"              
## [9] "xn----ymcabp2br3mk93k.com"

# validation

puny_tld_check(for_valid)

## [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Fin

If you find any errors or need more functionality, post an issue on github and/or drop a note in the comments.

To leave a comment for the author, please follow the link and comment on his blog: Data Driven Security.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New online R tutorial by DataCamp: Intermediate R programming

$
0
0

(This article was first published on The DataCamp Blog » R, and kindly contributed to R-bloggers)

Today the course creation team at DataCamp released a new online R tutorial called Intermediate R. It is the sequel to our infamous Introduction to R course that has been taken by over 60,000 R enthusiasts. This new tutorial combines short videos with in-browser coding exercises to increase your R knowledge even more.

Start the new Intermediate R course for free, today.

What you’ll learn

The 6 hour Intermediate R tutorial is the logical next stop on your journey in the R programming language. In five chapters, each with short videos and challenging coding exercises inside your browser, you learn how to write better and more advanced R code for your daily data analysis tasks. Get a taste of conditionals and control flow in R, discover how and when to make use of the for and while loop, and get yourself familiar with writing your own functions in R.

laptop_interface

After mastering the above topics, the tutorial continues with the apply family. Functions like lapply, sapply and vapply might seem hard to understand in the beginning, but once you master them they will become part of your daily routine for sure. This tutorial will take you step-by-step through the different members of the apply family and you will learn how to use them efficiently and effectively.

Having a solid knowledge of a wide range of R functions will definitely give you a competitive edge. That’s why the final part of the tutorial focuses less on programming concepts and more on what functions to use when facing common data science challenges. For example, learn how to make use of regular expressions in R to filter out certain observations, or discover how you can user R to analyze times and dates.

intermediate_R

For those of you that are interested in earning a certificate we have some good news. Once you have finished the tutorial, you will earn a statement of accomplishment that you add directly to your linkedIn profile.

Start the course!

facebooktwittergoogle_pluslinkedin

The post New online R tutorial by DataCamp: Intermediate R programming appeared first on The DataCamp Blog .

To leave a comment for the author, please follow the link and comment on his blog: The DataCamp Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Viewing all 2417 articles
Browse latest View live