Using Wikipediatrend

June 8, 2015, 8:20 am

≫ Next: 17 new R jobs (2015-06-08) – from @NewYork to @London

≪ Previous: Why has R, despite quirks, been so successful?

(This article was first published on Automated Data Collection with R Blog - rstats, and kindly contributed to R-bloggers)

What do Wikipedia's readers care about? Is Britney Spears more popular than Brittany? Is Asia Carrera more popular than Asia? How many people looked at the article on Santa Claus in December? How many looked at the article on Ron Paul?
What can you find?
Source: http://stats.grok.se/

The wikipediatrend package provides convenience access to daily page view counts (Wikipedia article traffic statistics) stored at http://stats.grok.se/ .

If you want to know how often an article has been viewed over time and work with the data from within R, this package is for you. Maybe you want to compare how much attention articles from different languages got and when, this package is for you. Are you up to policy studies or epidemiology? Have a look at page counts for Flue, Ebola, Climate Change or Millennium Development Goals and maybe build a model or two. Again, this package is for you.

If you simply want to browse Wikipedia page view statistics without all that coding, visit http://stats.grok.se/ and have a look around.

If non-big data is not an option, get the raw data in their entity at http://dumps.wikimedia.org/other/pagecounts-raw/ .

If you think days are crude measures of time but seconds might do if need be and info about which article views led to the numbers is useless anyways - go to http://datahub.io/dataset/english-wikipedia-pageviews-by-second.

To get further information on the data source (Who? When? How? How good?) there is a Wikipedia article for that: http://en.wikipedia.org/wiki/Wikipedia:Pageviewstatistics and another one: http://en.wikipedia.org/wiki/Wikipedia:Aboutpageviewstatistics .

Installation

stable CRAN version (http://cran.rstudio.com/web/packages/wikipediatrend/)

install.packages("wikipediatrend")

development version (https://github.com/petermeissner/wikipediatrend)

devtools::install_github("petermeissner/wikipediatrend")

... and load it via:

library(wikipediatrend)

A first try

The workhorse of the package is the wp_trend() function that allows you to get page view counts as neat data frames like this:

page_views <- wp_trend("main_page")

page_views

##    date       count    lang page      rank month  title    
## 3  2015-05-01 15195088 en   Main_page 2    201505 Main_page
## 2  2015-05-02 13800408 en   Main_page 2    201505 Main_page
## 1  2015-05-03 19462469 en   Main_page 2    201505 Main_page
## 7  2015-05-04 21295053 en   Main_page 2    201505 Main_page
## 6  2015-05-05 21338940 en   Main_page 2    201505 Main_page
## 5  2015-05-06 21198056 en   Main_page 2    201505 Main_page
## 4  2015-05-07 20128000 en   Main_page 2    201505 Main_page
## 11 2015-05-08 17191834 en   Main_page 2    201505 Main_page
## 10 2015-05-09 19560505 en   Main_page 2    201505 Main_page
## 26 2015-05-10 21168444 en   Main_page 2    201505 Main_page
## 27 2015-05-11 22101221 en   Main_page 2    201505 Main_page
## 28 2015-05-12 22320344 en   Main_page 2    201505 Main_page
## 29 2015-05-13 20420337 en   Main_page 2    201505 Main_page
## 22 2015-05-14 20174520 en   Main_page 2    201505 Main_page
## 23 2015-05-15 17176625 en   Main_page 2    201505 Main_page
## 24 2015-05-16 15845474 en   Main_page 2    201505 Main_page
## 25 2015-05-17 21462364 en   Main_page 2    201505 Main_page
## 20 2015-05-18 23386371 en   Main_page 2    201505 Main_page
## 21 2015-05-19 22999646 en   Main_page 2    201505 Main_page
## 9  2015-05-20 22486802 en   Main_page 2    201505 Main_page
## 8  2015-05-21 19693422 en   Main_page 2    201505 Main_page
## 19 2015-05-22 16096041 en   Main_page 2    201505 Main_page
## 13 2015-05-24 43322132 en   Main_page 2    201505 Main_page
## 12 2015-05-25 21908990 en   Main_page 2    201505 Main_page
## 15 2015-05-26 21954108 en   Main_page 2    201505 Main_page
## 14 2015-05-27 22918926 en   Main_page 2    201505 Main_page
## 16 2015-05-28 19572988 en   Main_page 2    201505 Main_page
## 31 2015-05-29 15049919 en   Main_page 2    201505 Main_page
## 30 2015-05-30 20611835 en   Main_page 2    201505 Main_page
## 
## ... 2 rows of data not shown

... that can easily be turned into a plot ...

library(ggplot2)

ggplot(page_views, aes(x=date, y=count)) + 
  geom_line(size=1.5, colour="steelblue") + 
  geom_smooth(method="loess", colour="#00000000", fill="#001090", alpha=0.1) +
  scale_y_continuous( breaks=seq(5e6, 50e6, 5e6) , 
  label= paste(seq(5,50,5),"M") ) +
  theme_bw()

`wp_trend()` options

wp_trend() has several options and most of them are set to defaults:

page
from = Sys.Date() - 30
to = Sys.Date()
lang = "en"
file = ""
~~friendly~~ deprecated
~~requestFrom~~ deprecated
~~userAgent~~ deprecated

`page`

The page option allows to specify one or more article titles for which data should be retrieved.

These titles should be in the same format as shown in the address bar of your browser to ensure that the pages are found. If we want to get page views for the United Nations Millennium Development Goals and the article is found here: "http://en.wikipedia.org/wiki/MillenniumDevelopmentGoals" the page title to pass to wp_trend() should be MillenniumDevelopmentGoals not Millennium Development Goals or Millenniumdevelopmentgoals or amy other 'mostly-like-the-original' variation.

To ease data gathering wp_trend() page accepts whole vectors of page titles and will retrieve date for each one after another.

page_views <- 
  wp_trend( 
    page = c( "Millennium_Development_Goals", "Climate_Change") 
  )

library(ggplot2)

ggplot(page_views, aes(x=date, y=count, group=page, color=page)) + 
  geom_line(size=1.5) + theme_bw()

`from` and `to`

These two options determine the time frame for which data shall be retrieved. The defaults are set to gather the last 30 days but might be set to cover larger time frames as well. Note that there is no data prior to December 2007 so that any date prior will be set to this minimum.

page_views <- 
  wp_trend( 
    page = "Millennium_Development_Goals" ,
    from = "2000-01-01",
    to   = prev_month_end()
  )

library(ggplot2)

ggplot(page_views, aes(x=date, y=count, color=wp_year(date))) + 
  geom_line() + 
  stat_smooth(method = "lm", formula = y ~ poly(x, 22), color="#CD0000a0", size=1.2) +
  theme_bw()

`lang`

This option determines for which Wikipedia the page views shall be retrieved, English, German, Chinese, Spanish, ... . The default is set to "en" for the English Wikipedia. This option should get one language shorthand that then is used for all pages or for each page a corresponding language shorthand should be specified.

page_views <- 
  wp_trend( 
    page = c("Objetivos_de_Desarrollo_del_Milenio", "Millennium_Development_Goals") ,
    lang = c("es", "en"),
    from = Sys.Date()-100
  )

library(ggplot2)

ggplot(page_views, aes(x=date, y=count, group=lang, color=lang, fill=lang)) + 
  geom_smooth(size=1.5) + 
  geom_point() +
  theme_bw()

`file`

This last option allows for storing the data retrieved by a call to wp_trend() in a file, e.g. file = "MyCache.csv". While MyCache.csv will be created if it does not exist already it will never be overwritten by wp_trend() thus allowing to accumulate data from susequent calls to wp_trend(). To get the data stored back into R use wp_load(file = "MyCache.csv").

wp_trend("Cheese", file="cheeeeese.csv")
wp_trend("Käse", lang="de", file="cheeeeese.csv")

cheeeeeese <- wp_load( file="cheeeeese.csv" )
cheeeeeese

##    date       count lang page      rank month  title 
## 34 2015-05-01  275  de   K%C3%A4se 6057 201505 Käse  
## 32 2015-05-03  261  de   K%C3%A4se 6057 201505 Käse  
## 38 2015-05-04  297  de   K%C3%A4se 6057 201505 Käse  
## 36 2015-05-06  326  de   K%C3%A4se 6057 201505 Käse  
## 57 2015-05-10  239  de   K%C3%A4se 6057 201505 Käse  
## 58 2015-05-11  297  de   K%C3%A4se 6057 201505 Käse  
## 59 2015-05-12  356  de   K%C3%A4se 6057 201505 Käse  
## 60 2015-05-13  332  de   K%C3%A4se 6057 201505 Käse  
## 39 2015-05-21  366  de   K%C3%A4se 6057 201505 Käse  
## 50 2015-05-22  236  de   K%C3%A4se 6057 201505 Käse  
## 44 2015-05-24  450  de   K%C3%A4se 6057 201505 Käse  
## 43 2015-05-25  295  de   K%C3%A4se 6057 201505 Käse  
## 45 2015-05-27  308  de   K%C3%A4se 6057 201505 Käse  
## 47 2015-05-28  313  de   K%C3%A4se 6057 201505 Käse  
## 62 2015-05-29  301  de   K%C3%A4se 6057 201505 Käse  
## 1  2015-05-03 1642  en   Cheese    705  201505 Cheese
## 6  2015-05-05 2053  en   Cheese    705  201505 Cheese
## 4  2015-05-07 2983  en   Cheese    705  201505 Cheese
## 11 2015-05-08 2015  en   Cheese    705  201505 Cheese
## 10 2015-05-09 4963  en   Cheese    705  201505 Cheese
## 23 2015-05-15 1756  en   Cheese    705  201505 Cheese
## 25 2015-05-17 1421  en   Cheese    705  201505 Cheese
## 20 2015-05-18 1809  en   Cheese    705  201505 Cheese
## 9  2015-05-20 1947  en   Cheese    705  201505 Cheese
## 15 2015-05-26 1877  en   Cheese    705  201505 Cheese
## 14 2015-05-27 1917  en   Cheese    705  201505 Cheese
## 16 2015-05-28 2029  en   Cheese    705  201505 Cheese
## 30 2015-05-30 1413  en   Cheese    705  201505 Cheese
## 17 2015-05-31 1481  en   Cheese    705  201505 Cheese
## 
## ... 33 rows of data not shown

Caching

Session caching

When using wp_trend() you will notice that subsequent calls to the function might take considerably less time than previous calls - given that later calls include data that has been downloaded already. This is due to the caching system running in the background and keeping track of things downloaded already. You can see if wp_trend() had to download something if it reports one or more links to the stats.grok.se server, e.g. ...

wp_trend("Cheese")

## http://stats.grok.se/json/en/201505/Cheese

wp_trend("Cheese")

... but ...

wp_trend("Cheese", from = Sys.Date()-60)

## http://stats.grok.se/json/en/201504/Cheese

The current cache in memory can be accessed via:

wp_get_cache()

##      date       count    lang page             rank month 
## 2965 2014-12-22    19833 en   Islamic_Stat ... -1   201412
## 763  2008-04-23      473 en   Millennium_D ... 7435 200804
## 1207 2009-07-15      488 en   Millennium_D ... 7435 200907
## 1268 2009-09-17      899 en   Millennium_D ... 7435 200909
## 1469 2010-04-04      554 en   Millennium_D ... 7435 201004
## 1712 2010-12-18      829 en   Millennium_D ... 7435 201012
## 2021 2011-10-17     1977 en   Millennium_D ... 7435 201110
## 2054 2011-11-19     1142 en   Millennium_D ... 7435 201111
## 2146 2012-02-19     1375 en   Millennium_D ... 7435 201202
## 2314 2012-08-19     1010 en   Millennium_D ... 7435 201208
## 2333 2012-08-21     1481 en   Millennium_D ... 7435 201208
## 2345 2012-09-03     1702 en   Millennium_D ... 7435 201209
## 2678 2013-08-19     1950 en   Millennium_D ... 7435 201308
## 2778 2013-11-02     1423 en   Millennium_D ... 7435 201311
## 2813 2013-12-06     2263 en   Millennium_D ... 7435 201312
## 269  2014-06-29     1000 en   Millennium_D ... 7435 201406
## 448  2014-12-13      994 en   Millennium_D ... 7435 201412
## 6200 2015-05-29     1311 en   Millennium_D ... 7435 201505
## 4248 2010-02-10     4415 en   Syria            1802 201002
## 4232 2010-02-21     3856 en   Syria            1802 201002
## 4783 2011-08-16     7471 en   Syria            1802 201108
## 5066 2012-05-10     7631 en   Syria            1802 201205
## 5098 2012-06-11    10616 en   Syria            1802 201206
## 5430 2013-05-04     9450 en   Syria            1802 201305
## 5823 2014-06-04     5082 en   Syria            1802 201406
## 5847 2014-07-28     5104 en   Syria            1802 201407
## 5881 2014-08-04     5876 en   Syria            1802 201408
## 3970 2014-09-09     4769 ru   %D0%98%D1%81 ... -1   201409
## 4060 2014-12-18     2527 ru   %D0%98%D1%81 ... -1   201412
##      title           
## 2965 Islamic_Stat ...
## 763  Millennium_D ...
## 1207 Millennium_D ...
## 1268 Millennium_D ...
## 1469 Millennium_D ...
## 1712 Millennium_D ...
## 2021 Millennium_D ...
## 2054 Millennium_D ...
## 2146 Millennium_D ...
## 2314 Millennium_D ...
## 2333 Millennium_D ...
## 2345 Millennium_D ...
## 2678 Millennium_D ...
## 2778 Millennium_D ...
## 2813 Millennium_D ...
## 269  Millennium_D ...
## 448  Millennium_D ...
## 6200 Millennium_D ...
## 4248 Syria           
## 4232 Syria           
## 4783 Syria           
## 5066 Syria           
## 5098 Syria           
## 5430 Syria           
## 5823 Syria           
## 5847 Syria           
## 5881 Syria           
## 3970 <U+0418><U+0441><U+043B><U+0430><U+043C><U+0441><U+043A><U+043E><U+0435>_<U+0433><U+043E> ...
## 4060 <U+0418><U+0441><U+043B><U+0430><U+043C><U+0441><U+043A><U+043E><U+0435>_<U+0433><U+043E> ...
## 
## ... 6294 rows of data not shown

... and emptied by a call to wp_cache_reset().

Caching across sessions 1

While everything that is downloaded during a session is cached in memory it might come handy to save the cache parallel on disk to reuse it in the next R session. To activate disk-caching for a session simply use:

wp_set_cache_file( file = "myCache.csv" )

The function will reload whatever is stored in the file and in subsequent calls to wp_trend() will automatically add data as it is downloaded. The file used for disk-caching can be changed by another call to wp_set_cache_file( file = "myOtherCache.csv") or turned off completely by leaving the file argument empty.

Caching across sessions 2

If disk-caching should be enabled by default one can define a path as system/environment variable WP_CACHE_FILE. When loading the package it will look for this variable via Sys.getenv("WP_CACHE_FILE") and use the path for caching as if ...

wp_set_cache_file( Sys.getenv("WP_CACHE_FILE") )

.. would have been typed in by the user.

Counts for other languages

If comparing languages is important one needs to specify the exact article titles for each language: While the article about the Millennium Goals has an English title in the English Wikipedia, it of course is named differently in Spanish, German, Chinese, ... . One might look these titles up by hand or use the handy wp_linked_pages() function like this:

titles <- wp_linked_pages("Islamic_State_of_Iraq_and_the_Levant", "en")
titles <- titles[titles$lang %in% c("en", "de", "es", "ar", "ru"),]
titles

##   page             lang title           
## 1 Islamic_Stat ... en   Islamic_Stat ...
## 2 %D8%AF%D8%A7 ... ar   <U+062F><U+0627><U+0639><U+0634>
## 3 Islamischer_ ... de   Islamischer_ ...
## 4 Estado_Isl%C ... es   Estado_Islám ...
## 5 %D0%98%D1%81 ... ru   <U+0418><U+0441><U+043B><U+0430><U+043C><U+0441><U+043A><U+043E><U+0435>_<U+0433><U+043E> ...

... then we can use the information to get data for several languages ...

page_views <- 
  wp_trend(
    page = titles$page[1:5], 
    lang = titles$lang[1:5],
    from = "2014-08-01"
  )

library(ggplot2)

for(i in unique(page_views$lang) ){
  iffer <- page_views$lang==i
  page_views[iffer, ]$count <- scale(page_views[iffer, ]$count)
}

ggplot(page_views, aes(x=date, y=count, group=lang, color=lang)) + 
  geom_line(size=1.2, alpha=0.5) + 
  ylab("standardized countn(by lang: m=0, var=1)") +
  theme_bw() + 
  scale_colour_brewer(palette="Set1") + 
  guides(colour = guide_legend(override.aes = list(alpha = 1)))

Going beyond Wikipediatrend -- Anomalies and mean shifts

Identifying anomalies with `AnomalyDetection`

Currently the AnomalyDetection package is not availible on CRAN so we have to use install_github() from the devtools package to get it.

# install.packages( "AnomalyDetection", repos="http://ghrr.github.io/drat",  type="source")
library(AnomalyDetection)
library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

The package is a little picky about the data it accepts for processing so we have to build a new data frame. It should contain only the date and count variable. Furthermore, date should be named timestamp and transformed to type POSIXct.

page_views <- wp_trend("Syria", from = "2010-01-01")

## http://stats.grok.se/json/en/201505/Syria

page_views_br <- 
  page_views  %>% 
  select(date, count)  %>% 
  rename(timestamp=date)  %>% 
  unclass()  %>% 
  as.data.frame() %>% 
  mutate(timestamp = as.POSIXct(timestamp))

Having transformed the data we can detect anomalies via AnomalyDetectionTs(). The function offers various options e.g. the significance level for rejecting normal values (alpha); the maximum fraction of the data that is allowed to be detected as anomalies (max_amoms); whether or not upward deviations, downward devaitions or irregularities in both directions might form the basis of anomaly detection (direction) and last but not least whether or not the time frame for detection is larger than one month (lonterm).

Lets choose a greedy set of parameters and detect possible anomalies:

res <- 
AnomalyDetectionTs(
  x         = page_views_br, 
  alpha     = 0.05, 
  max_anoms = 0.40,
  direction = "both",
  longterm  = T
)$anoms

res$timestamp <- as.Date(res$timestamp)

head(res)

##    timestamp anoms
## 1 2010-02-02  5567
## 2 2010-02-04  5191
## 3 2010-01-23     0
## 4 2010-02-03  4322
## 5 2010-02-01  3918
## 6 2010-02-08     0

... and play back the detected anomalies to our page_views data set:

page_views <- 
  page_views  %>% 
  mutate(normal = !(page_views$date %in% res$timestamp))  %>% 
  mutate(anom   =   page_views$date %in% res$timestamp )

class(page_views) <- c("wp_df", "data.frame")

Now we can plot counts and anomalies ...

(
  p <-
    ggplot( data=page_views, aes(x=date, y=count) ) + 
      geom_line(color="steelblue") +
      geom_point(data=filter(page_views, anom==T), color="red2", size=2) +
      theme_bw()
)

... as well as compare running means:

p + 
  geom_line(stat = "smooth", size=2, color="red2", alpha=0.7) + 
  geom_line(data=filter(page_views, anom==F), 
  stat = "smooth", size=2, color="dodgerblue4", alpha=0.5)

It seems like upward and downward anomalies partial each other out most of the time since both smooth lines (with and without anomalies) do not differ much. Nonetheless, keeping anomalies in will upward bias the counts slightly, so we proceed with a cleaned up data set:

page_views_clean <- 
  page_views  %>% 
  filter(anom==F)  %>% 
  select(date, count, lang, page, rank, month, title)

page_views_br_clean <- 
  page_views_br  %>% 
  filter(page_views$anom==F)

Identifying mean shifts with `BreakoutDetection`

BreakoutDetection is a package that allows to search data for mean level shifts by dividing it into timespans of change and those of stability in the presence of seasonal noise. Similar to AnomalyDetection the BreakoutDetection package is not available on CRAN but has to be obtained from Github.

# install.packages(  "BreakoutDetection",   repos="http://ghrr.github.io/drat", type="source")
library(BreakoutDetection)
library(dplyr)
library(ggplot2)
library(magrittr)

... again the workhorse function (breakout()) is picky and requires "a data.frame which has 'timestamp' and 'count' components" like our page_views_br_clean.

The function has two general options: one tweaks the minimum length of a timespan (min.size); the other one does determine how many mean level changes might occur during the whole time frame (method); and several method specific options, e.g. decree, beta, and percent which control the sensitivity adding further breakpoints. In the following case the last option tells the function that overall model fit should be increased by at least 5 percent if adding a breakpoint.

br <- 
  breakout(
    page_views_br_clean, 
    min.size = 30, 
    method   = 'multi', 
    percent  = 0.05,
    plot     = TRUE
  )
br

## $loc
##  [1]  53 105 137 174 263 306 389 426 458 488 518 566 601 640 670 751 784
## 
## $time
## [1] 1.19
## 
## $pval
## [1] NA
## 
## $plot

In the following snippet we combine the break information with our page views data and can have a look at the dates at which the breaks occured.

breaks <- page_views_clean[br$loc,]
breaks

##           date count lang  page rank  month title
## 53  2010-02-13  3327   en Syria 1802 201002 Syria
## 105 2010-04-08  5210   en Syria 1802 201004 Syria
## 137 2010-06-03  5176   en Syria 1802 201006 Syria
## 174 2010-07-14  3874   en Syria 1802 201007 Syria
## 263 2010-11-22  6090   en Syria 1802 201011 Syria
## 306 2010-12-05  6182   en Syria 1802 201012 Syria
## 389 2011-04-15  6113   en Syria 1802 201104 Syria
## 426 2011-05-12  8217   en Syria 1802 201105 Syria
## 458 2011-06-07  9442   en Syria 1802 201106 Syria
## 488 2011-07-04  4745   en Syria 1802 201107 Syria
## 518 2011-08-07  7506   en Syria 1802 201108 Syria
## 566 2011-10-22  7449   en Syria 1802 201110 Syria
## 601 2011-11-27  8492   en Syria 1802 201111 Syria
## 640 2012-01-31 11496   en Syria 1802 201201 Syria
## 670 2012-02-22 23197   en Syria 1802 201202 Syria
## 751 2012-06-04  9461   en Syria 1802 201206 Syria
## 784 2012-07-27 17383   en Syria 1802 201207 Syria

Next, we add a span variable capturing which page_view observations belong to which span, allowing us to aggregate data.

page_views_clean$span <- 0
for (d in breaks$date ) {
  page_views_clean$span[ page_views_clean$date > d ] %<>% add(1)
}

page_views_clean$mcount <- 0
for (s in unique(page_views_clean$span) ) {
  iffer <- page_views_clean$span == s
page_views_clean$mcount[ iffer ] <- mean(page_views_clean$count[iffer])
}

spans <- 
  page_views_clean  %>% 
  as_data_frame() %>% 
  group_by(span) %>% 
  summarize(
    start      = min(date), 
    end        = max(date), 
    length     = end-start,
    mean_count = round(mean(count)),
    min_count  = min(count),
    max_count  = max(count),
    var_count  = var(count)
  )
spans

## Source: local data frame [18 x 8]
## 
##    span      start        end length mean_count min_count max_count
## 1     0 2010-01-01 2010-02-13     43       3662         0      4734
## 2     1 2010-02-14 2010-04-08     53       4768         0      8179
## 3     2 2010-04-10 2010-06-03     54       4900      3849      5741
## 4     3 2010-06-04 2010-07-14     40       3454         0      5270
## 5     4 2010-07-20 2010-11-22    125       5760      4172      8711
## 6     5 2010-11-24 2010-12-05     11       5488      4752      6182
## 7     6 2010-12-06 2011-04-15    130       7158      3725     22825
## 8     7 2011-04-16 2011-05-12     26      10075      6713     19661
## 9     8 2011-05-13 2011-06-07     25       6374      3672      9442
## 10    9 2011-06-08 2011-07-04     26       7135      4194     11989
## 11   10 2011-07-05 2011-08-07     33       4940      3395      9729
## 12   11 2011-08-08 2011-10-22     75       5729      3510     12574
## 13   12 2011-10-24 2011-11-27     34       8060      5195     13217
## 14   13 2011-11-28 2012-01-31     64       6479         0     11496
## 15   14 2012-02-01 2012-02-22     21      18015      7005     36378
## 16   15 2012-02-23 2012-06-04    102       9042         0     24728
## 17   16 2012-06-05 2012-07-27     52      12042      6464     25414
## 18   17 2012-07-28 2015-05-31   1037       7287         0    111331
## Variables not shown: var_count (dbl)

Also, we can now plot the shifting mean.

ggplot(page_views_clean, aes(x=date, y=count) ) + 
  geom_line(alpha=0.5, color="steelblue") + 
  geom_line(aes(y=mcount), alpha=0.5, color="red2", size=1.2) + 
  theme_bw()

To leave a comment for the author, please follow the link and comment on his blog: Automated Data Collection with R Blog - rstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

17 new R jobs (2015-06-08) – from @NewYork to @London

June 8, 2015, 12:48 pm

≫ Next: Notebooks, knitr and the Language-Markdown View Source Option…

≪ Previous: Using Wikipediatrend

This is the bimonthly post (for 2015-06-08) for new R Jobs from R-users.com.

Employers: visit this link to post a new R job to the R community (it’s free and quick).

Job seekers: please follow the links below to learn more and apply for your job of interest (or visit previous R jobs posts).

Freelance

Data scientist (3-month mission @France)
BCA Expertise – Posted by BCA_Expertise

La Celle-Sous-GouzonLimousin, France

8 Jun2015
Full-Time

Capital Markets Data Scientist (for Freddie Mac @NewYork over $100k/year)
Freddie Mac – Posted by lgodfrey

New YorkNew York, United States

5 Jun2015
Full-Time

Data Scientist for wildlife conservation in Africa and Asia (@Nairobi)
CITES Secretariat — Monitoring the Illegal Killing of Elephants (MIKE) – Posted by Julian Blanc

NairobiNairobi, Kenya

4 Jun2015
Full-Time

Data Analyst for Advantage Waypoint (@USA)
Advantage Waypoint – Posted by matthewbward

CoffeyvilleKansas, United States

3 Jun2015
Full-Time

Senior Analyst Data & Analytics (@Alberta)
WestJst – Posted by lmarra@westjet.com

AlbertaCanada

3 Jun2015
Freelance

R Programmer/Modeler (@California)
Cramer Fish Sciences – Posted by CFS_California

SangerCalifornia, United States

3 Jun2015
Full-Time

Web Analyst (@London)
uSwitch – Posted by uSwitch

LondonEngland, United Kingdom

3 Jun2015
Full-Time

SAS Analyst (with some bonus skills, such as R) @Virginia
United Network for Organ Sharing (UNOS) – Posted byrkhockma

Anywhere

2 Jun2015
Freelance

Developer – R / Mahout / Matlab (@London)
Damia Group – Posted by Chris Bardoe

Anywhere

1 Jun2015
Full-Time

STATISTICIAN RD (@London)
zest – Posted by envazen

LondonEngland, United Kingdom

31 May2015
Full-Time

Data Scientist (@Washington)
Booz Allen Hamilton – Posted by boozallen

WashingtonDistrict of Columbia, United States

30 May2015
Full-Time

Data Analyst / Healthcare Statitician (@London)
Loc8tor – Posted by arichards

BorehamwoodEngland, United Kingdom

29 May2015
Full-Time

Bioinformatician for IMB (@Mainz @Germany)
Institute of Molecular Biology – Posted by IMB_Mainz

MainzRheinland-Pfalz, Germany

29 May2015
Full-Time

Statistician for The Food and Agriculture Organization of the UN (@Rome)
Food and Agriculture Organization of the United Nations – Posted by joshua.browning

RomaLazio, Italy

26 May2015
Part-Time

R/Full Stack Developer (@London – over 100K/year)
oinfinite

LondonEngland, United Kingdom

26 May2015
Full-Time

Statistician in R&D
melissa44

Singen (Hohentwiel)Baden-Württemberg, Germany

25 May2015
Full-Time

R + Javascript Full Stack Developer (@Boston)
Boston Children’s Hospital – Posted by katiekerr7

BostonMassachusetts, United States

25 May2015

r_jobs

↧

Notebooks, knitr and the Language-Markdown View Source Option…

June 8, 2015, 9:07 am

≫ Next: R in a 64 bit world

≪ Previous: 17 new R jobs (2015-06-08) – from @NewYork to @London

(This article was first published on OUseful.Info, the blog... » Rstats, and kindly contributed to R-bloggers)

One of the foundational principles of the web, though I suspect ever fewer people know it, is that you can “View Source” on a web page to see what bits of HTML, Javascript and CSS are used to create it.

In the WordPress editor I’m currently writing in, I’m using a Text view that lets me write vanilla HTML; but there is also a WYSIWYG (what you see is what you get) view that shows how the interpreted HTML text will look when it is rendered in the browser as a web page.

Reflecting on IPython Markdown Opportunities in IPython Notebooks and Rstudio, it struck me that the Rmd (Rmarkdown) view used in RStudio, the HTML preview of “executed” Rmd documents generated from Rmd by knitr and the interactive Jupyter (IPython, as was) notebook view can be seen as standing in this sort of relation to each other:

From that, it’s not too hard to imagine RStudio offering the following sort of RStudio/IPython notebook hybrid interface – with an Rmd “text” view, and with a notebook “visual” view (eg via an R notebook kernel):

And from both, we can generate the static HTML preview view.

In terms of underlying machinery, I guess we could have something like this:

I’m looking forward to it:-)

To leave a comment for the author, please follow the link and comment on his blog: OUseful.Info, the blog... » Rstats.

↧

R in a 64 bit world

June 8, 2015, 11:00 am

≫ Next: The curl package: a modern R interface to libcurl

≪ Previous: Notebooks, knitr and the Language-Markdown View Source Option…

(This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers)

32 bit data structures (pointers, integer representations, single precision floating point) have been past their “best before date” for quite some time. R itself moved to a 64 bit memory model some time ago, but still has only 32 bit integers. This is going to get more and more awkward going forward. What is R doing to work around this limitation?

We discuss this in this article, the first of a new series of articles discussing aspects of “R as it is.”

Currently R’s only integer data type is a 32 bit signed integer. Such integers can only count up to about 2 billion. This range is in fact ridiculously small and unworkable. Some examples:

The human population on the earth has been over 2 billion humans since around 1930.
The U.S. Department of the Treasury prints over 2 billion $1 bills per year.
Gangnam Style by Psy has been viewed well over 2 billion times.
An obsolete computer can count through this set of values in around 1 second.
It is becoming more and more likely somebody may share data with containing 64 bit integers (as many other languages do have a 64 bit integer data type).

What can we do about this in R?

Note we are not talking about big-data issues (which are addressed in R with tools like ff, data.table, dplyr, databases, streaming algorithms, Hadoop, see Revolutions big data articles for some tools and ideas). Fortunately, when working with data you really want to do things like aggregate, select, partition, and compute derived columns. With the right primitives you can (and should) perform all of these efficiently without ever referring to explicit row indices.

What we are talking about is the possible embarrassment of:

Not being to represent 64 bit IDs as something other than strings.
Not being able to represent a 65536 by 65536 matrix in R (because R matrices are actually views over single vectors).
Not being able to index into 3 billion doubles (or about $300 worth of memory).

Now R actually dodged these bullets, but without introducing a proper 64 bit integer type. Let’s discuss how.

First a lot of people think R has a 64 bit integer type. This is because R’s notation for integer constants “L” looks like Java’s notation for longs (64 bit), but probably derives from C’s notation for long (“at least 32 bits”). But feast your eyes on the following R code:

3000000000L
## [1] 3e+09
Warning message:
non-integer value 3000000000L qualified with L; using numeric value

Yes, “L” only means integer in R.

What the R designers did to delay paying the price of having only 32 bit integers was to allow doubles to be used as array indices (and as the return value for length())! Take a look at the following R code:

c(1,2,3)[1.7]
## [1] 1

It looks like this was one of the big changes in moving from R2.15.3 to R3.0.1 in 2013 (see here). However, it feels dirty. In a more perfect world the above code would throw an error. This puts R in league with languages that force everything to be represented in way too few base-types ( Javascript, TCL, and Perl). IEEE 754 doubles define a 53 bit mantissa (separate from the sign and exponent), so with a proper floating point implementation we expect a double can faithfully represent an integer range of -2^53 through 2^53. But only as long as you don’t accidentally convert to or round-trip through a string/character type.

One of the issues is that underlying C and Fortran code (often used to implement R packages/libraries) are not going to be able to easily use longs as indices. However, I still would much prefer the introduction of a proper 64 bit integer type.

Of course Java is in a much worse position going forward than R. Because of Java’s static type signatures any class that implements the Collection interface is stuck with “int size()” pretty much forever (this includes Array, Vector, List, Set, and many more). In much better shape is Python which has been working on unifying ints and longs since 2001 ( PEP237 ) and uses only 64 bit integers in Python 3 (just a matter of moving people from Python 2 to Python 3).

Enough about sizes and indexing- let’s talk a bit about representation. What should we do if we try to import data and we know one of the columns is 64 bit integers (assuming we are lucky enough to detect this and the column doesn’t get converted in a non-reversible way to doubles)?

R has always been a bit “loosey-goosey” with ints. That is why you see weird stuff in summary:

summary(55555L)
##   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  55560   55560   55560   55560   55560   55560

Where we are told that the integer 55555 is in the range 55560 to 55560 (in hindsight we can see R is printing as if the data were floating point and then, adding insult to injury, it is not signaling its use of four significant figures by having the decency to switch into scientific notation). This is also why I don’t really trust R numerics to reliably represent integers like 2^50, some function accidentally round trips your value into a string representation and back (such as reading/writing data to a CSV/TSV table) and you may not get back the value you started with (for the worst possible reason: you never wrote the correct value out).

In fact it becomes a bit of a bother to even check if a floating point number represents an integer. Some standard advice is “check if floor(a)==ceiling(a).” Which works until it fails:

a <- 2^50
b <- 2^50 + 0.5
c <- 5^25+1.5
for(v in c(a,b,c)) {
  print(floor(v)==ceiling(v))
}
## [1] TRUE
## [1] FALSE
## [1] TRUE

What went wrong for “c” is “c” is an integer, it just isn’t the number we typed in (due to the use of floating point). The true value is seen as follows:

# Let's try the "doubles are good enough" path
# 5^25 < 2^60, so would fit in a 64 bit integer
format(5L^25L,digits=20)
## [1] "298023223876953152"
# doesn't even end in a 5 as powers of 5 should
format(5^25+1.5,digits=20)
## [1] "298023223876953152"
# and can't see our attempt at addition

# let's get the right answer
library('Rmpfr')
mpfr(5,precBits=200)^25
## [1] 298023223876953125
# ends in 5 as we expect
mpfr(5,precBits=200)^25 + 1.5
# obviously not an integer!
## [1] 298023223876953126.5

Something like the following is probably pretty close to the correct test function:

is.int <- function(v) {
  is.numeric(v) & 
    v>-2^53 & v<2^53 & 
    (floor(v)==ceiling(v))
}

But, that (or whatever IEEE math library function actually does this) is hard to feel very good about. The point is we should not have to study What Every Computer Scientist Should Know About Floating-Point Arithmetic when merely trying to index into arrays. However every data scientist should read this paper to understand some of the issues of general numeric representation and manipulation!

What are our faithful representation options in R?

Force to strings (and pray they don’t try to convert to factors).
Try to use doubles (this is what happens if you don’t know about the column, and will irreversibly mangle IDs).
Try a package like Google’s int64 package (kicked off cran in 2012 for lack of maintenance).
Try a bigint package such as gmp special math package such as " target="_blank">Rmpfr.

Our advice is to first try representing 64 bit integers as strings. For functions like read.table() this means setting as.is to TRUE for the appropriate columns, and not converting a column back to string after it has already been damaged by the reader.

And this is our first taste of “R as it is.”

(Thank you to Joseph Rickert and Nina Zumel for helpful comments on earlier drafts of this article.)

To leave a comment for the author, please follow the link and comment on his blog: Win-Vector Blog » R.

↧

The curl package: a modern R interface to libcurl

June 8, 2015, 5:00 pm

≫ Next: analyze the trends in international mathematics and science study (timss) with r

≪ Previous: R in a 64 bit world

(This article was first published on OpenCPU, and kindly contributed to R-bloggers)

TL;DR: Check out the vignette or the development version of httr.

The package I put most time and effort in this year is curl. Last week version 0.8 was published on CRAN which fixes the last outstanding bug for Solaris. The package is pretty much done at this point: stable, well tested, and does everything it needs to; nothing more nothing less…

From the description:

The curl() and curl_download() functions provide highly configurable drop-in replacements for base url() and download.file() with better performance, support for encryption (https://, ftps://), ‘gzip’ compression, authentication, and other ‘libcurl’ goodies. The core of the package implements a framework for performing fully customized requests where data can be processed either in memory, on disk, or streaming via the callback or connection interfaces.

The initial motivation of the package was to implement a connnection interface with SSL (https) support, something R has always been lacking (see also json streaming). But since then the package has matured into a full featured HTTP client. By now it has become exactly what I promised it would not be: a complete replacement of RCurl.

What about RCurl?

Good question. The RCurl package by all-star R-core member Duncan Temple-Lang is one of the most widely used R packages. The first CRAN release was about 11 years ago and it has since then been the standard networking client for R. The paper shows that Duncan was (as with most of his work) ahead of his time, describing tools and technology that are now part of the standard data-science workflow.

The RCurl package was also the basis of Hadley’s popular httr package, which started to reveal some shortcomings, including memory leaks, build problems, performance regressions and mysterious errors. Now a bug or two we can fix, but from the RCurl source code it becomes obvious that a lot has changed over the past 10 years. Both R and libcurl have matured a lot, and the internet has largely converged to (REST style) HTTP and SSL, with other protocols slowly being phased out. Also Duncan is a busy guy and seems to have largely moved on to other projects. And so we are going to need a rewrite from scratch…

The curl package is inspired by the good parts of RCurl but with an implementation that takes advantage of modern features in R such as the connection interface and external pointers with proper finalizers. This allows for a much simpler interface to libcurl that has better performance, supports streaming, and handles that automatically clean up after themselves. Moreover curl is deliberately very minimal and only contains the essential foundations for interacting with libcurl. High-level logic and utilities can be provided by other packages that build on curl, such as httr. The result is a small, clean and powerful package that takes 2 seconds to compile and will hopefully prove to be reliable and low maintenance.

Getting started with curl and httr

The best introduction to the curl package is the vignette which has some nice examples to get you started. Moreover the development version of httr has already been migrated from RCurl to curl. To install using devtools:

library(devtools)
install_github("hadley/httr")

Note that devtools itself depends on httr so you might need to restart R after updating httr. If you are seeing some ERROR: loading failed error (especially on Windows) just restart R and try again.

To leave a comment for the author, please follow the link and comment on his blog: OpenCPU.

↧

analyze the trends in international mathematics and science study (timss) with r

June 8, 2015, 11:00 pm

≫ Next: Using system and web fonts in R plots

≪ Previous: The curl package: a modern R interface to libcurl

(This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers)

an underexplored chest of international quantitative aptitude testing, the trends in international mathematics and science study (timss) would make any pearson stockholder smile. examining the mathematical mastery of more than 600,000 students from over sixty countries, this survey would illuminate an aggressive analyst (that's you) about any and everything you'd ever want to know about how your country's grade schoolers compare to the mathemagicians in finland and south korea. created in amsterdam by the international association for the evaluation of educational achievement (iea) and administered in boston by boston college (bc) alongside its bookwormish little brother pirls, this microdata has everything you could possibly want to know about the learning and retention of fourth and eighth graders in algebra and chemistry class. this new github repository contains three scripts:

download import and design.R

loop through and download every available extract onto your local disk
convert and import each individual country-level data set into an r-readable format bamn
construct replicate-weighted survey designs equivalent to the unfathomably inefficient sas, spss, and the horrific iea idb analyzer provided by the otherwise delightful data administrators

analysis examples.R

run the well-documented block of code reviewing most of the syntax configurations you'll need for the lion's share of your research

replication.R

prove that r will precisely match the output of proprietary software systems costing thousands of hundreds of dollars and also the dutch version of the frankenstein monster, apparently with zombie dna because the iea idb analyzer will eat your braaaaaains

click here to view these three scripts

for more detail about the trends in international mathematics and science study (timss), visit:

the iea page on the timss is a better place to start than
the boston college timss and pirls homepage, but predictably neither are as readable as
the wikipedia entry

notes:

before analyzing your first record of microdata, confirm you don't actually want to invest your energies on the programme for international student assessment (pisa).

r users have published this toolkit specifically for timss, pirls, pisa, and piaac, but i am skeptical that learning a framework separate from the survey package is worth your time if you ever wish to analyze surveys other than this narrow set of four. these surveys each have plausible value variables which are computationally equivalent to any other multiply-imputed item. since the survey package smartly collaborates with mitools, just use the system that you already know and be done with it. but if you don't know either survey or intsvy, decide based on this: intsvy works on four data sets, the survey package works on all of the microdata listed here, notably including those intsvy four. my example syntax uses the more broadly applicable set of tools, but that doesn't mean there isn't anything to learn from sniffing around the intsvy documentation.

confidential to sas, spss, stata, and sudaan users: you just jumped out of an airplane with a bungee cord instead of a parachute. time to trade in your equipment. time to transition to r. :D

To leave a comment for the author, please follow the link and comment on his blog: asdfree by anthony damico.

↧

Using system and web fonts in R plots

June 8, 2015, 11:10 pm

≫ Next: Mortgages Are About Math: Open-Source Loan-Level Analysis of Fannie and Freddie

≪ Previous: analyze the trends in international mathematics and science study (timss) with r

(This article was first published on mages' blog, and kindly contributed to R-bloggers)

The forthcoming R Journal has an interesting article on the showtext package by Yixuan Qiu. The package allows me to use system and web fonts directly in R plots, reminding me a little of the approach taken by XeLaTeX. But "unlike other methods to embed fonts into graphics, showtext converts text into raster images or polygons, and then adds them to the plot canvas. This method produces platform-independent image files that do not rely on the fonts that create them." [1]

Here is an example with fonts from my local system:

library(showtext)
png("System-Fonts.png", width=550, height=350);
par(mfrow=c(2,2))
plot(1 ~ 1, main="Lucida Bright", family = "Lucida Bright")
plot(1 ~ 1, main="Courier", family = "Courier")
plot(1 ~ 1, main="Helvetica Neue Light", family = "Helvetica Neue Light") 
plot(1 ~ 1, main="Lucida Handwriting Italic", family = "Lucida Handwriting Italic")
dev.off()

Additionally showtext allows me to use fonts hosted online, e.g. Google web fonts:

font.add.google("Alegreya Sans", "aleg");
font.add.google("Permanent Marker", "marker")
font.add.google("Gruppo", "gruppo")
font.add.google("Lobster", "lobster")
png("Google-Fonts.png", width=550, height=350)
showtext.begin()
par(mfrow=c(2,2))
plot(1 ~ 1, main="Alegreya Sans", family = "aleg")
plot(1 ~ 1, main="Permanent Marker", family = "marker")
plot(1 ~ 1, main="Gruppo", family = "gruppo") 
plot(1 ~ 1, main="Lobster", family = "lobster") 
showtext.end()
dev.off()

For more information read the article and/or visit the project site.

References

[1] Yixuan Qiu. showtext: Using System Fonts in R Graphics. The R Journal, 7(1), 2015.

Session Info

R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base     

other attached packages:
[1] showtext_0.4-2 sysfonts_0.5  

loaded via a namespace (and not attached):
[1] RCurl_1.95-4.6 showtextdb_1.0 jsonlite_0.9.16 bitops_1.0-6

To leave a comment for the author, please follow the link and comment on his blog: mages' blog.

↧

Mortgages Are About Math: Open-Source Loan-Level Analysis of Fannie and Freddie

June 9, 2015, 3:30 am

≫ Next: Estimating Analytics Software Market Share by Counting Books

≪ Previous: Using system and web fonts in R plots

(This article was first published on Category: R | Todd W. Schneider, and kindly contributed to R-bloggers)

[M]ortgages were acknowledged to be the most mathematically complex securities in the marketplace. The complexity arose entirely out of the option the homeowner has to prepay his loan; it was poetic that the single financial complexity contributed to the marketplace by the common man was the Gordian knot giving the best brains on Wall Street a run for their money. Ranieri’s instincts that had led him to build an enormous research department had been right: Mortgages were about math.

The money was made, therefore, with ever more refined tools of analysis.

—Michael Lewis, Liar's Poker (1989)

Fannie Mae and Freddie Mac began reporting loan-level credit performance data in 2013 at the direction of their regulator, the Federal Housing Finance Agency. The stated purpose of releasing the data was to “increase transparency, which helps investors build more accurate credit performance models in support of potential risk-sharing initiatives.”

The so-called government-sponsored enterprises went through a nearly $200 billion government bailout during the financial crisis, motivated in large part by losses on loans that they guaranteed, so I figured there must be something interesting in the loan-level data. I decided to dig in with some geographic analysis, an attempt to identify the loan-level characteristics most predictive of default rates, and more. As part of my efforts, I wrote code to transform the raw data into a more useful PostgreSQL database format, and some R scripts for analysis. The code for processing and analyzing the data is all available on GitHub.

Default rate by month

At the time of Chairman Bernanke’s statement, it really did seem like agency loans were unaffected by the problems observed in subprime loans, which were expected to have higher default rates. About a year later, defaults on Fannie and Freddie loans increased dramatically, and the government was forced to bail out both companies to the tune of nearly $200 billion

The “medium data” revolution

It should not be overlooked that in the not-so-distant past, i.e. when I worked as a mortgage analyst, an analysis of loan-level mortgage data would have cost a lot of money. Between licensing data and paying for expensive computers to analyze it, you could have easily incurred costs north of a million dollars per year. Today, in addition to Fannie and Freddie making their data freely available, we’re in the midst of what I might call the “medium data” revolution: personal computers are so powerful that my MacBook Air is capable of analyzing the entire 215 GB of data, representing some 38 million loans, 1.6 billion observations, and over $7.1 trillion of origination volume. Furthermore, I did everything with free, open-source software. I chose PostgreSQL and R, but there are plenty of other free options you could choose for storage and analysis.

Both agencies released data for 30-year, fully amortizing, fixed-rate mortgages, which are considered standard in the U.S. mortgage market. Each loan has some static characteristics which never change for the life of the loan, e.g. geographic information, the amount of the loan, and a few dozen others. Each loan also has a series of monthly observations, with values that can change from one month to the next, e.g. the loan’s balance, its delinquency status, and whether it prepaid in full.

The PostgreSQL schema then is split into 2 main tables, called loans and monthly_observations. Beyond the data provided by Fannie and Freddie, I also found it helpful to pull in some external data sources, most notably the FHFA’s home price indexes and Freddie Mac’s mortgage rate survey data.

A fuller glossary of the data is available in an appendix at the bottom of this post.

What can we learn from the loan-level data?

I started by calculating simple cumulative default rates for each origination year, defining a “defaulted” loan as one that became at least 60 days delinquent at some point in its life. Note that not all 60+ day delinquent loans actually turn into foreclosures where the borrower has to leave the house, but missing at least 2 payments typically indicates a serious level of distress.

Loans originated from 2005-2008 performed dramatically worse than loans that came before them! That should be an extraordinarily unsurprising statement to anyone who was even slightly aware of the U.S. mortgage crisis that began in 2007:

Cumulative default rates by vintage

About 4% of loans originated from 1999 to 2003 became seriously delinquent at some point in their lives. The 2004 vintage showed some performance deterioration, and then the vintages from 2005 through 2008 show significantly worse performance: more than 15% of all loans originated in those years became distressed.

From 2009 through present, the performance has been much better, with fewer than 2% of loans defaulting. Of course part of that is that it takes time for a loan to default, so the most recent vintages will tend to have lower cumulative default rates while their loans are still young. But as we’ll see later, there was also a dramatic shift in lending standards so that the loans made since 2009 have been much higher credit quality.

Geographic performance

Default rates increased everywhere during the bubble years, but some states fared far worse than others. I took every loan originated between 2005 and 2007, broadly considered to be the height of reckless mortgage lending, bucketed loans by state, and calculated the cumulative default rate of loans in each state. Mouse over the map to see individual state data:

4 states in particular jump out as the worst performers: California, Florida, Arizona, and Nevada. Just about every state experienced significantly higher than normal default rates during the mortgage crisis, but these 4 states, often labeled the “sand states”, experienced the worst of it.

I also used the data to make more specific maps at the county-level; default rates within different metropolitan areas can show quite a bit of variation. California jumps out as having the most interesting map: the highest default rates in California came from inland counties, most notably in the Central Valley and Inland Empire regions. These exurban areas, like Stockton, Modesto, and Riverside, experienced the largest increases in home prices leading up to the crisis, and subsequently the largest collapses.

The map clearly shows the central parts of California with the highest default rates, and the coastal parts with generally better default rates:

The major California metropolitan areas with the highest default rates in were:

Modesto – 40%
Stockton – 37%
Riverside-San Bernardino-Ontario (Inland Empire) – 33%

And the major metropolitan areas with the lowest default rates:

San Francisco – 4.3%
San Jose – 7.6%
Santa Ana-Anaheim-Irvine (Orange County) – 11%

It’s less than 100 miles from San Francisco to Modesto and Stockton, and only 35 miles from Anaheim to Riverside, yet we see such dramatically different default rates between the inland regions and their relatively more affluent coastal counterparts.

The inland cities, with more land available to allow expansion, experienced the most overbuilding, the most aggressive lenders, the highest levels of speculators looking to get rich quick by flipping houses, and so perhaps it’s not that surprising that when the housing market turned south, they also experienced the highest default rates. Not coincidentally, California has also led the nation in “housing bubble” searches on Google Trends every year since 2004.

The county-level map of Florida does not show as much variation as the California map:

Although the regions in the panhandle had somewhat lower default rates than central and south Florida, there were also significantly fewer loans originated in the panhandle. The Tampa, Orlando, and Miami/Fort Lauderdale/West Palm Beach metropolitan areas made up the bulk of Florida mortgage originations, and all had very high default rates. The worst performing metropolitan areas in Florida were:

Miami – 40%
Port St. Lucie – 39%
Cape Coral/Fort Myers – 38%

Arizona and Nevada have very few counties, so their maps don’t look very interesting, and each state is dominated by a single metropolitan area: Phoenix experienced a 31% cumulative default rate, and Las Vegas a 42% cumulative default rate.

Modeling mortgage defaults

The dataset includes lots of variables for each individual loan beyond geographic location, and many of these variables seem like they should correlate to mortgage performance. Perhaps most obviously, credit scores were developed specifically for the purpose of assessing default risk, so it would be awfully surprising if credit scores weren’t correlated to default rates.

Some of the additional variables include the amount of the loan, the interest rate, the loan-to-value ratio (LTV), debt-to-income ratio (DTI), the purpose of the loan (purchase, refinance), the type of property, and whether the loan was originated directly by a lender or by a third party. All of these things seem like they might have some predictive value for modeling default rates.

We can also combine loan data with other data sources to calculate additional variables. In particular, we can use the FHFA’s home price data to calculate current loan-to-value ratios for every loan in the dataset. For example, say a loan started at an 80 LTV, but the home’s value has since declined by 25%. If the balance on the loan has remained unchanged, then the new current LTV would be 0.8 / (1 – 0.25) = 106.7. An LTV over 100 means the borrower is “underwater” — the value of the house is now less than the amount owed on the loan. If the borrower does not believe that home prices will recover for a long time, the borrower might rationally decide to “walk away” from the loan.

Another calculated variable is called spread at origination (SATO), which is the difference between the loan’s interest rate, and the prevailing market rate at the time of origination. Typically borrowers with weaker credit get higher rates, so we’d expect a larger value of SATO to correlate to higher default rates.

Even before formulating any specific model, I find it helpful to look at graphs of aggregated data. I took every monthly observation from 2009-11, bucketed along several dimensions, and calculated default rates. Note that we’re now looking at transition rates from current to defaulted, as opposed to the cumulative default rates in the previous section. Transition rates are a more natural quantity to model, since when we make future projections we have to predict not only how many loans will default, but when they’ll default.

Here are graphs of annualized default rates as a function of credit score and current LTV:

Default rate by FICO

The above graph shows FICO credit score on the x-axis, and annualized default rate on the y-axis. For example, loans with FICO score 650 defaulted at a rate of about 12% per year, while loans with FICO 750 defaulted around 4.5% per year

Default rate by current LTV

Clearly both of these variables are highly correlated with default rates, and in the directions we would expect: higher credit scores correlate to lower default rates, and higher loan-to-value ratios correlate to higher default rates.

The dataset cannot tell us why any borrowers defaulted. Some probably came upon financial hardship due to the economic recession and were unable to pay their bills. Others might have been taken advantage of by unscrupulous mortgage brokers, and could never afford their monthly payments. And, yes, some also “strategically” defaulted — meaning they could have paid their mortgages, but chose not to.

The fact that current LTV is so highly correlated to default rates leads me to suspect that strategic defaults were fairly common in the depths of the recession. But why might some people walk away from loans that they’re capable of paying?

As an example, say a borrower has a $300,000 loan at a 6% interest rate against a home that had since declined in value to $200,000, for an LTV of 150. The monthly payment on such a mortgage is $1,800. Assuming a price/rent ratio of 18, approximately the national average, then the borrower could rent a similar home for $925 per month, a savings of over $10,000 per year. Of course strategically defaulting would greatly damage the borrower’s credit, making it potentially much more difficult to get new loans in the future, but for such a large monthly savings, the borrower might reasonably decide not to pay.

A Cox proportional hazards model helps give us a sense of which variables have the largest relative impact on default rates. The model assumes that there’s a baseline default rate (the “hazard rate”), and that the independent variables have a multiplicative effect on that baseline rate. I calibrated a Cox model on a random subset of loans using R’s coxph() function:

library(survival)</p>

<p>formula = Surv(loan_age &ndash; 1, loan_age, defaulted) ~</p>

<pre><code>      credit_score + ccltv + dti + loan_purpose + channel + sato
</code></pre>

<p>cox_model = coxph(formula, data = monthly_default_data)</p>

<p>summary(cox_model)

</p>

<blockquote><p>summary(cox_model)
Call:
coxph(formula = Surv(loan_age &ndash; 1, loan_age, defaulted) ~ credit_score +</p>

<pre><code>ccltv + dti + loan_purpose + channel + sato, data = monthly_default_data)
</code></pre></blockquote>

<p>  n= 17866852, number of events= 94678</p>

<pre><code>                coef  exp(coef)   se(coef)       z Pr(&gt;|z|)
</code></pre>

<p>credit_score  -9.236e-03  9.908e-01  8.387e-05 -110.12   &lt;2e-16
ccltv          2.259e-02  1.023e+00  1.582e-04  142.81   &lt;2e-16
dti            2.092e-02  1.021e+00  4.052e-04   51.62   &lt;2e-16
loan_purposeR  4.655e-01  1.593e+00  9.917e-03   46.94   &lt;2e-16
channelTPO     1.573e-01  1.170e+00  9.682e-03   16.25   &lt;2e-16
sato           3.563e-01  1.428e+00  1.284e-02   27.75   &lt;2e-16

The categorical variables, loan_purpose and channel, are the easiest to interpret because we can just look at the exp(coef) column to see their effect. In the case of loan_purpose, loans that were made for refinances multiply the default rate by 1.593 compared to loans that were made for purchases. For channel, loans that were made by third party originators, e.g. mortgage brokers, increase the hazard rate by 17% compared to loans that were originated directly by lenders.

The coefficients for the continuous variables are harder to compare because they each have their own independent scales: credit scores range from roughly 600 to 800, LTVs from 30 to 150, DTIs from 20 to 60, and SATO from -1 to 1. Again I find graphs the easiest way to interpret. We can use R’s predict() function to generate hazard rate multipliers for each independent variable, while holding all the other variables constant:

Hazard rate multipliers

Remember that the y-axis here shows a multiplier of the base default rate, not the default rate itself. So, for example, the average current LTV in the dataset is 82, which has a multiplier of 1. If we were looking at two loans, one of which had current LTV 82, the other a current LTV of 125, then the model predicts that the latter loan’s monthly default rate is 2.65 times the default rate of the former.

All of the variables behave directionally as we’d expect: higher LTV, DTI, and SATO are all associated with higher hazard rates, while higher credit scores are associated with lower hazard rates. The graph of hazard rate multipliers shows that current LTV and credit score have larger magnitude impact on defaults than DTI and SATO. Again the model tells us nothing about why borrowers default, but it does suggest that home price-adjusted LTVs and credit scores are the most important predictors of default rates.

There is plenty of opportunity to develop more advanced default models. Many techniques, including Cox proportional hazards models and logistic regression, are popular because they have relatively simple functional forms that behave well mathematically, and there are existing software packages that make it easy to calibrate parameters. On the other hand, these models can fall short because they have no meaningful connection to the actual underlying dynamics of mortgage borrowers.

So-called agent-based models attempt to model the behavior of individual borrowers at the micro-level, then simulate many agents interacting and making individual decisions, before aggregating into a final prediction. The agent-based approach can be computationally much more complicated, but at least in my opinion it seems like a model based on traditional statistical techniques will never explain phenomena like the housing bubble and financial crisis, whereas a well-formulated agent-based model at least has a fighting chance.

Why are defaults so much lower today?

We saw earlier that recently originated loans have defaulted at a much lower rate than loans originated during the bubble years. For one thing, home prices bottomed out sometime around 2012 and have rebounded some since then. The partial home price recovery causes current LTVs to decline, which as we’ve seen already, should correlate to lower default rates.

Perhaps more importantly, though, it appears that Fannie and Freddie have adopted significantly stricter lending standards starting in 2009. The average FICO score used to be 720, but since 2009 it has been more like 765. Furthermore, if we look 2 standard deviations from the mean, we see that the low end of the FICO spectrum used to reach down to about 600, but since 2009 there have been very few loans with FICO less than 680.

Tighter agency standards, coupled with a complete shutdown in the non-agency mortgage market, including both subprime and Alt-A lending, mean that there is very little credit available to borrowers with low credit scores (a far more difficult question is whether this is a good or bad thing!).

Average FICO by origination year

Since 2009, Fannie and Freddie have made significantly fewer loans to borrowers with credit scores below 680. There is some discussion about when, if ever, Fannie and Freddie will begin to loosen their credit standards and provide more loans to borrowers with lower credit scores

What next?

There are many more things we could study in the dataset. Long before investors worried about default rates on agency mortgages, they worried about voluntary prepayments due to refinancing and housing turnover. When interest rates go down, many mortgage borrowers refinance their loans to lower their monthly payments. For mortgage investors, investment returns can depend heavily on how well they project prepayments.

I’m sure some astronomical number of human-hours have been spent modeling prepayments, dating back to the 1970s when mortgage securitization started to become a big industry. Historically the models were calibrated against aggregated pool-level data, which was okay, but does not offer as much potential as loan-level data. With more loan-level data available, and faster computers to process it, I’d imagine that many on Wall Street are already hard at work using this relatively new data to refine their prepayment models.

Fannie and Freddie continue to improve their datasets, recently adding data for actual losses suffered on defaulted loans. In other words, when the bank has to foreclose and sell a house, how much money do the agencies typically lose? This loss severity number is itself a function of many variables, including home prices, maintenance costs, legal costs, and others. Severity will also be extremely important for mortgage investors in the proposed new world where Fannie and Freddie might no longer provide full guarantees against loss of principal.

Beyond Wall Street, I’d hope that the open-source nature of the data helps provide a better “early detection” system than we saw in the most recent crisis. A lot of people were probably generally aware that the mortgage market was in trouble as early as 2007, but unless you had access to specialized data and systems to analyze it, there was no way for most people to really know what was going on.

There’s still room for improvement: Fannie and Freddie could expand their datasets to include more than just 30-year fixed-rate loans. There are plenty of other types of loans, including 15-year terms and loans with adjustable interest rates. 30-year fixed-rate loans continue to be the standard of the U.S. mortgage market, but it would still be good to release data for all of Fannie and Freddie’s loans.

It’d also be nice if Fannie and Freddie released the data in a more timely manner instead of lagged by several months to a year. The lag before releasing the data reduces its effectiveness as a tool for monitoring the general health of the economy, but again it’s much better than only a few years ago when there was no readily available data at all. In the end, the trend toward free and open data, combined with the ever-increasing availability of computing power, will hopefully provide a clearer picture of the mortgage market, and possibly even prevent another financial crisis.

Appendix: data glossary

Mortgage data is available to download from Fannie Mae and Freddie Mac’s websites, and the full scripts I used to load and process the data are available on GitHub

Each loan has an origination record, which includes static data that will never change for the life of the loan. Each loan also has a set of monthly observations, which record values at every month of the loan’s life. The PostgreSQL database has 2 main tables: loans and monthly_observations.

Beyond the data provided by Fannie and Freddie, I found it helpful to add columns to the loans table for what we might call calculated characteristics. For example, I found that it was helpful to have a column on the loans table called first_serious_dq_date. This column would be populated with the first month in which a loan was 60 days delinquent, or null if the loan has never been 60 days delinquent. There’s no new information added by the column, but it’s convenient to have it available in the loans table as opposed to the monthly_observations table because loans is a significantly smaller table, and so if we can avoid database joins to monthly_observations for some analysis then that makes things faster and easier.

I also collected home price data from the FHFA, and mortgage rate data from Freddie Mac

Selected columns from the loans table:

credit_score, also referred to as FICO
original_upb, short for original unpaid balance; the amount of the loan
oltv and ocltv, short for original (combined) loan-to-value ratio. Amount of the loan divided by the value of the home at origination, expressed as a percentage. Combined loan-to-value includes and additional liens on the property
dti, debt-to-income ratio. From Freddie Mac’s documentation: the sum of the borrower’s monthly debt payments […] divided by the total monthly income used to underwrite the borrower
sato, short for spread at origination, the difference between the loan’s interest rate and the prevailing market rate at the time the loan was made
property_state
msa, metropolitan statistical area
hpi_index_id, references the FHFA home price index (HPI) data. If the loan’s metropolitan statistical area has its own home price index, use the MSA index, otherwise use the state-level index. Additionally if the FHFA provides a purchase-only index, use purchase-only, otherwise use purchase and refi
occupancy_status (owner, investor, second home)
channel (retail, broker, correspondent)
loan_purpose (purchase, refinance)
mip, mortgage insurance premium
first_serious_dq_date, the first date on which the loan was observed to be at least 60 days delinquent. Null if the loan was never observed to be delinquent
id and loan_sequence_number, loan_sequence_number are the unique string IDs assigned by Fannie and Freddie, id is a unique integer designed to save space in the monthly_observations table

Selected columns from the monthly_observations table:

loan_id, for joining against the loans table, loans.id = monthly_observations.loan_id date
current_upb, current unpaid balance
previous_upb, the unpaid balance in the previous month
loan_age
dq_status and previous_dq_status

More info available in the documentations provided by Fannie Mae and Freddie Mac

To leave a comment for the author, please follow the link and comment on his blog: Category: R | Todd W. Schneider.

↧

Estimating Analytics Software Market Share by Counting Books

June 9, 2015, 12:05 pm

≫ Next: List of user-installed R packages and their versions

≪ Previous: Mortgages Are About Math: Open-Source Loan-Level Analysis of Fannie and Freddie

(This article was first published on r4stats.com » R, and kindly contributed to R-bloggers)

Below is the latest update to The Popularity of Data Analysis Software.

Books

The number of books published on each software package or language reflects its relative popularity. Amazon.com offers an advanced search method which works well for all the software except R and the general-purpose languages such as Java, C, and MATLAB. I did not find a way to easily search for books on analytics that used such general purpose languages, so I’ve excluded them in this section. While “R” is a vague term to search for, I was able to obtain a count of its books from http://www.r-project.org/doc/bib/R-books.html.

The Amazon.com advanced search configuration that I used was (using SAS as an example):

Title: SAS -excerpt -chapter -changes -articles 
Subject: Computers & Technology
Condition: New
Format: All formats
Publication Date: After January, 2000

The “title” parameter allowed me to focus the search on books that included the software names in their titles. Other books may use a particular software in their examples, but they’re impossible to search for easily. SAS has many manuals for sale as individual chapters or excerpts. They contain “chapter” or “excerpt” in their title so I excluded them using the minus sign, e.g. “-excerpt”. SAS also has short “changes and enhancements” booklets that the developers of other packages release only in the form of flyers and/or web pages, so I excluded “changes” as well. Some software listed brief “articles” which I also excluded. I did the search on June 1, 2015, and I excluded excerpts, chapters, changes, and articles from all searches.

The results are shown in Table 1, where it’s clear that a very small number of analytics software packages dominate the world of book publishing. SAS has a huge lead with 576 titles, followed by SPSS with 339. SAS and SPSS both have many versions of the same book or manual still for sale, so their numbers are both inflated as a result. R was the only other package with more than 100 titles, though JMP came close. Although I obtained counts on all 27 of the domain-specific (i.e. not general-purpose) analytics software packages or languages shown in Figure 2a, I cut the table off at software that had 8 or fewer books to save space.

Software        Number of Books 
SAS                  576
SPSS Statistics      339
R                    172
JMP                   97
Hadoop                89
Stata                 62
Minitab               33
Enterprise Miner      32

Table 1. The number of books whose titles contain the name of each software package.

To leave a comment for the author, please follow the link and comment on his blog: r4stats.com » R.

↧

List of user-installed R packages and their versions

June 9, 2015, 1:02 pm

≫ Next: advanced.procD.lm for pairwise tests and model comparisons

≪ Previous: Estimating Analytics Software Market Share by Counting Books

(This article was first published on Heuristic Andrew, and kindly contributed to R-bloggers)

This R command lists all the packages installed by the user (ignoring packages that come with R such as base and foreign) and the package versions.


ip <- as.data.frame(installed.packages()[,c(1,3:4)])
rownames(ip) <- NULL
ip <- ip[is.na(ip$Priority),1:2,drop=FALSE]
print(ip, row.names=FALSE)

Example output


       Package   Version
        bitops     1.0-6
 BradleyTerry2     1.0-6
          brew     1.0-6
         brglm     0.5-9
           car    2.0-25
         caret    6.0-47
          coin    1.0-24
    colorspace     1.2-6
        crayon     1.2.1
      devtools     1.8.0
     dichromat     2.0-0
        digest     0.6.8
         earth     4.4.0
      evaluate       0.7
[..snip..]

Tested with R 3.2.0.

This is a small step towards managing package versions: for a better solution, see the checkpoint package. You could also use the first column to reinstall user-installed R packages after an R upgrade.

To leave a comment for the author, please follow the link and comment on his blog: Heuristic Andrew.

↧

advanced.procD.lm for pairwise tests and model comparisons

June 9, 2015, 7:08 pm

≫ Next: Worrying About my Cholesterol Level

≪ Previous: List of user-installed R packages and their versions

(This article was first published on geomorph, and kindly contributed to R-bloggers)

In geomorph 2.1.5, we decided to deprecate the functions, pairwiseD.test and pairwise.slope.test. Our reason for this was two-fold. First, recent updates by CRAN rendered these functions unstable. These functions depended on the model.frame family of base functions, which were updated by CRAN. We tried to find solutions but the updated pairwise functions were worse than non-functioning functions, as they sometimes provided incorrect results (owing to strange sorting of rows in design matrices). We realized that we were in a position that required a complete overhaul of these functions, if we wanted to maintain them. Second, because advanced.procD.lm was already capable of pairwise tests and did not suffer from the same issues, we realized we did not have to update the other functions, but could instead help users understand how to use advancd.procD.lm. Basically, this blog post is a much better use of time than trying again and again to fix broken functions. Before reading on, if you have not already read the blog post on ANOVA in geomorph, it would probably be worth your time to read that post first.
There are three topics covered in this post: 1) advanced.procD.lm as a model comparison test, 2) "full" randomization versus randomized residual permutation procedure (RRPP), and 3) pairwise tests. These topics are not independent. It is hoped that users realize that pairwise tests are nothing more than using the same resampling experiment for ANOVA for all pairwise test statistics. Thus, understanding the first two topics makes the pairwise options of advanced.procD.lm obvious to manipulate. (Note that the term ANOVA is used here to include univariate and multivariate ANOVA. The ANOVA results in geomorph functions are not obtained differently for univariate and multivariate data.)

Model comparisons: As explained in the ANOVA blog post, we use comparisons of model (sum of squared) error to calculate effects (sum of squares, SS). An "effect" is described by the variables contained in one model and lacked in another. For example, using the plethodon data,

> library(geomorph)
> data(plethodon)
> gpa <- gpagen(plethodon$land)
> Y <- two.d.array(gpa$coords)
> species <- plethodon$species
> site <- plethodon$site
> fit1 <- lm(Y ~ 1) # effects = intercept
> fit2 <- lm(Y ~ species) # effects = intercept + species

we have created two models, fit1 and fit 2. It is easiest to appreciate the difference between these models by looking at the model matrices.

> model.matrix(fit1)
   (Intercept)
1            1
2            1
3            1
4            1
5            1
6            1
7            1
8            1
9            1
10           1
11           1
12           1
13           1
14           1
15           1
16           1
17           1
18           1
19           1
20           1
21           1
22           1
23           1
24           1
25           1
26           1
27           1
28           1
29           1
30           1
31           1
32           1
33           1
34           1
35           1
36           1
37           1
38           1
39           1
40           1
attr(,"assign")
[1] 0
> model.matrix(fit2)
   (Intercept)                  Teyah
1            1                      0
2            1                      0
3            1                      0
4            1                      0
5            1                      0
6            1                      0
7            1                      0
8            1                      0
9            1                      0
10           1                      0
11           1                      1
12           1                      1
13           1                      1
14           1                      1
15           1                      1
16           1                      1
17           1                      1
18           1                      1
19           1                      1
20           1                      1
21           1                      0
22           1                      0
23           1                      0
24           1                      0
25           1                      0
26           1                      0
27           1                      0
28           1                      0
29           1                      0
30           1                      0
31           1                      1
32           1                      1
33           1                      1
34           1                      1
35           1                      1
36           1                      1
37           1                      1
38           1                      1
39           1                      1
40           1                      1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$species`
[1] "contr.treatment"

The fit2 model is more "complex" but contains the element of the fit1 model. (It can also be seen that the species "effect" allows two means instead of one. The first species becomes the intercept [second column = 0] and the second species changes the value of the intercept [second column = 1]. More on this here.) If one wanted to evaluate the species effect, he could compare the error produced by the two models, since "species" is contained in one but lacked in the other. This is what is done with advanced.procD.lm, e.g.

> advanced.procD.lm(fit1, fit2)

ANOVA with RRPP

        df     SSE       SS      F      Z     P
        39 0.19694
species 38 0.16768 0.029258 6.6304 4.9691 0.001

This is how this ANOVA should be interpreted: first, there are 40 salamanders in these data. The intercept model (fit1) estimates the mean. Therefore there are 40 - 1 = 39 degrees of freedom. The species model (fit2) estimates two means. Therefore there are 40 - 2 = 38 degrees of freedom. From the "predicted" values (one multivariate mean or two multivariate means, depending on model), residuals are obtained, their Procrustes distances to predicted values are calculated (the procD part of the function name), and the squared distances are summed to find the sum of squared error (SSE) for each model. Error can only decrease with more model effects. By adding the species effect, the change in SSE was 0.029258, which is the model effect. For descriptive purposes, this can be converted to an F value by dividing the effect SS by the difference in degrees of freedom (1 df), then dividing this value by the SSE of the full model divided by its degrees of freedom (38 df). This is a measure of effect size. More importantly, the effect size can also be evaluated by the position of the observed SS in the distribution of random SS. Since we did not specify the number of permutations, the default is used (999 plus the observed). The observed SS is 4.813 standard deviations from the expected value of 0 under the null hypothesis, with a probability of being exceeded of 0.001 (the P-value). ANOVA with RRPP indicates that in every random permutation, the residuals of the "reduced" model are randomzied.

For comparison, here are the results from procD.lm on fit2, using RRPP

> procD.lm(fit2, RRPP=TRUE)

Type I (Sequential) Sums of Squares and Cross-products

Randomized Residual Permutation Procedure used

          df       SS        MS     Rsq      F      Z P.value
species    1 0.029258 0.0292578 0.14856 6.6304 4.8448   0.001
Residuals 38 0.167682 0.0044127
Total     39 0.196940

Notice that the summary statistics are the exact same (except Z and maybe P.value, as these depend on random outcomes). This is the case if the reduced model is the most basic "null" model (contains only an intercept) and there is only one effect. This time, let's do the same thing without RRPP.

> procD.lm(fit2, RRPP=FALSE)

Type I (Sequential) Sums of Squares and Cross-products

Randomization of Raw Values used

          df       SS        MS     Rsq      F     Z P.value
species    1 0.029258 0.0292578 0.14856 6.6304 4.772   0.002
Residuals 38 0.167682 0.0044127
Total     39 0.196940

Notice that the Z and P.values were so indifferent as to suggest that nothing different was done. In this case, this is the truth. RRPP performed with residuals from a model with only an intercept is tantamount to a "full" randomization of the observed values. (This is explained better here.)

Full randomization versus RRPP: To better appreciate the difference between RRPP and full randomization, let's make a different model

fit3 <- lm(Y ~ species + site)

Now we have options! First, let's just use procD.lm and forget about mode comparisons... sort of. We will do this with both full randomization and RRPP

> procD.lm(fit3, RRPP=FALSE)

Type I (Sequential) Sums of Squares and Cross-products

Randomization of Raw Values used

          df       SS       MS     Rsq      F       Z P.value
species    1 0.029258 0.029258 0.14856 10.479 4.7407   0.001
site       1 0.064375 0.064375 0.32688 23.056 10.1732   0.001
Residuals 37 0.103307 0.002792
Total     39 0.196940

> procD.lm(fit3, RRPP=TRUE)

Type I (Sequential) Sums of Squares and Cross-products

Randomized Residual Permutation Procedure used

          df       SS       MS     Rsq      F       Z P.value
species    1 0.029258 0.029258 0.14856 10.479 4.7368   0.002
site       1 0.064375 0.064375 0.32688 23.056 11.9331   0.001
Residuals 37 0.103307 0.002792
Total     39 0.196940

Notice that the test stats are the same for both but the effect sizes (Z scores) are different. The species Z scores are similar, and similar to what was observed before. The site Z scores are slightly different though. There are cases where one method might produce significant results and the other does not. The reason for this is that prcoD.lm performs a series of model comparisons. The first model has an intercept. The second model add species. The third model adds site to the second model. There are two model comparisons: first to second and second to third. Where these methods differ is that with RRPP, the residuals from the "reduced" model is used in each comparison; with full randomization, the residuals from the intercept model (only and always) are used.

This gets even more complex with more effects added to the model; e.g., an interaction:

> fit4 <- lm(Y ~ species * site)
> procD.lm(fit4, RRPP=FALSE)

Type I (Sequential) Sums of Squares and Cross-products

Randomization of Raw Values used

             df       SS       MS     Rsq      F      Z P.value
species       1 0.029258 0.029258 0.14856 14.544 4.7214   0.001
site          1 0.064375 0.064375 0.32688 32.000 9.9451   0.001
species:site 1 0.030885 0.030885 0.15682 15.352 4.9743   0.001
Residuals    36 0.072422 0.002012
Total        39 0.196940

> procD.lm(fit4, RRPP=TRUE)

Type I (Sequential) Sums of Squares and Cross-products

Randomized Residual Permutation Procedure used

             df       SS       MS     Rsq      F       Z P.value
species       1 0.029258 0.029258 0.14856 14.544 4.7512   0.001
site          1 0.064375 0.064375 0.32688 32.000 11.5108   0.001
species:site 1 0.030885 0.030885 0.15682 15.352 9.8574   0.001
Residuals    36 0.072422 0.002012
Total        39 0.196940

It should be appreciated that because full randomization ignores the effects already added to the model (e.g., species and site, before the interaction is added), spurious results can occur. This could be significant effects rendered non-significant, or non-significant effects rendered significant. The reason advanced.procD.lm is useful is that it allows creativity rather than the simple on/off switch for RRPP in procD.lm.

Model comparisons redux: Here is a simple exercise. Let's do all logical model comparisons of the model fits we have thus far, using advanced.procD.lm.

> advanced.procD.lm(fit1, fit2)

ANOVA with RRPP

        df     SSE       SS      F      Z     P
        39 0.19694
species 38 0.16768 0.029258 6.6304 4.8135 0.001

> advanced.procD.lm(fit2, fit3)

ANOVA with RRPP

             df     SSE       SS      F      Z     P
species      38 0.16768
species+site 37 0.10331 0.064375 23.056 11.459 0.001

> advanced.procD.lm(fit3, fit4)

ANOVA with RRPP

                          df      SSE       SS      F      Z     P
species+site              37 0.103307
species+site+species:site 36 0.072422 0.030885 15.352 9.4351 0.001

The Z values in each test are approximately the same as procD.lm with RRPP. The SS values are exactly the same! (The F values are not, as the error SS changes in each case.) So, it might not be clear why advanced.procD.lm is useful. Here is something that cannot be done with procD.lm.

> advanced.procD.lm(fit2, fit4)

ANOVA with RRPP

                          df      SSE      SS      F      Z     P
species                   38 0.167682
species+site+species:site 36 0.072422 0.09526 23.676 9.6226 0.001

The usefulness is that advanced.procD.lm can be used with any comparison of nested models. The models do not have to differ by one effect. Also, a "full" model evaluation can be done as

> advanced.procD.lm(fit1, fit4)

ANOVA with RRPP

                          df      SSE      SS      F      Z     P
                          39 0.196940
species+site+species:site 36 0.072422 0.12452 20.632 7.3273 0.001

Understanding that any two nested models can be compared, and that advanced.procD.lm uses RRPP exclusively, one can use the resampling experiment to perform pairwise comparisons for model effects that describe groups.

Pairwise tests: When one or more model effects are factors (categorical), pairwise statistics can be calculated and statistically evaluated with advanced.prcoD.lm. This is accomplished with the groups operator within the function. E.g.,

> advanced.procD.lm(fit1, fit4, groups = ~species*site)
$anova.table

ANOVA with RRPP

                          df      SSE      SS      F      Z     P
                          39 0.196940
species+site+species:site 36 0.072422 0.12452 20.632 7.4109 0.001

$Means.dist
            Jord:Allo Jord:Symp Teyah:Allo Teyah:Symp
Jord:Allo 0.00000000 0.09566672 0.02432519 0.1013670
Jord:Symp 0.09566672 0.00000000 0.09193082 0.1069432
Teyah:Allo 0.02432519 0.09193082 0.00000000 0.0994980
Teyah:Symp 0.10136696 0.10694324 0.09949800 0.0000000

$Prob.Means.dist
           Jord:Allo Jord:Symp Teyah:Allo Teyah:Symp
Jord:Allo      1.000     0.001      0.708      0.001
Jord:Symp      0.001     1.000      0.001      0.001
Teyah:Allo     0.708     0.001      1.000      0.001
Teyah:Symp     0.001     0.001      0.001      1.000

In addition to the ANOVA, the pairwise Procrustes distances between all possible means (as defined) were calculated. The P-values below these indicate the probability of finding a greater distance, by chance, from the resampling experiment. Because fit1 contains only an intercept, the resampling experiment was a full randomization of shape values. To account for species and site main effects, this analysis could be repeated with the model that contains the main effects, but no interaction. E.g.,

> advanced.procD.lm(fit3, fit4, groups = ~species*site)
$anova.table

ANOVA with RRPP

                          df      SSE       SS      F      Z     P
species+site              37 0.103307
species+site+species:site 36 0.072422 0.030885 15.352 9.9593 0.001

$Means.dist
            Jord:Allo Jord:Symp Teyah:Allo Teyah:Symp
Jord:Allo 0.00000000 0.09566672 0.02432519 0.1013670
Jord:Symp 0.09566672 0.00000000 0.09193082 0.1069432
Teyah:Allo 0.02432519 0.09193082 0.00000000 0.0994980
Teyah:Symp 0.10136696 0.10694324 0.09949800 0.0000000

$Prob.Means.dist
           Jord:Allo Jord:Symp Teyah:Allo Teyah:Symp
Jord:Allo      1.000     0.014      0.999      0.589
Jord:Symp      0.014     1.000      0.597      0.001
Teyah:Allo     0.999     0.597      1.000      0.001
Teyah:Symp     0.589     0.001      0.001      1.000

This has a profound effect. Many of the previous significant pairwise differences in means are now not significant, after accounting for general species and site effects. One should always be careful when interpreting results to understand the null hypothesis. The former test assumes a null hypothesis of no differences among means; the latter test assumes a null hypothesis of no difference among means, given species and site effects. These are two different things!

When using advanced.procD.lm, one can add covariates or other factors that might be extraneous sources of variation. For example, if we wanted to repeat the last test but also account for body size, the following could be done. (Also notice via this example, that making model fits beforehand is not necessary.)

> advanced.procD.lm(Y ~ log(CS) + species + site, ~ log(CS) + species*site, groups = ~species*site)
$anova.table

ANOVA with RRPP

                                  df      SSE       SS      F      Z     P
log(CS)+species+site              36 0.098490
log(CS)+species+site+species:site 35 0.068671 0.029819 15.198 9.6757 0.001

$Means.dist
            Jord:Allo Jord:Symp Teyah:Allo Teyah:Symp
Jord:Allo 0.00000000 0.09566672 0.02432519 0.1013670
Jord:Symp 0.09566672 0.00000000 0.09193082 0.1069432
Teyah:Allo 0.02432519 0.09193082 0.00000000 0.0994980
Teyah:Symp 0.10136696 0.10694324 0.09949800 0.0000000

$Prob.Means.dist
           Jord:Allo Jord:Symp Teyah:Allo Teyah:Symp
Jord:Allo      1.000     0.013      1.000      0.591
Jord:Symp      0.013     1.000      0.584      0.001
Teyah:Allo     1.000     0.584      1.000      0.001
Teyah:Symp     0.591     0.001      0.001      1.000

The ANOVA and pairwise stats change a bit (but means do not), as log(CS) accounts for variation in shape in both models. Also note that "~" is needed in all operator parts that are formulaic. This is essential for proper functioning. (Technical note: this test is not quite appropriate, as the means are not appropriate. This will be explained below, after further discussion about slopes.)

One can also compare slopes for a covariate among groups (or account for slopes. This involves comparing a model with a common slope to one allowing different slopes (factor-slope interactions). E.g.,

> advanced.procD.lm(Y ~ log(CS) + species*site, ~ log(CS)*species*site, groups = ~species*site, slope = ~log(CS))
$anova.table

ANOVA with RRPP

                                                                                    df      SSE        SS      F      Z     P
log(CS)+species+site+species:site                                                   35 0.068671
log(CS)+species+site+log(CS):species+log(CS):site+species:site+log(CS):species:site 32 0.061718 0.0069531 1.2017 1.2817 0.099

$LS.Means.dist
            Jord:Allo Jord:Symp Teyah:Allo Teyah:Symp
Jord:Allo 0.00000000 0.09396152 0.02473040 0.10148562
Jord:Symp 0.09396152 0.00000000 0.08999162 0.10547891
Teyah:Allo 0.02473040 0.08999162 0.00000000 0.09949284
Teyah:Symp 0.10148562 0.10547891 0.09949284 0.00000000

$Prob.Means.dist
           Jord:Allo Jord:Symp Teyah:Allo Teyah:Symp
Jord:Allo      1.000     0.605      0.871      0.639
Jord:Symp      0.605     1.000      0.629      0.629
Teyah:Allo     0.871     0.629      1.000      0.638
Teyah:Symp     0.639     0.629      0.638      1.000

$Slopes.dist
           Jord:Allo Jord:Symp Teyah:Allo Teyah:Symp
Jord:Allo 0.0000000 0.2188238 0.1780151 0.1231082
Jord:Symp 0.2188238 0.0000000 0.2718850 0.2354029
Teyah:Allo 0.1780151 0.2718850 0.0000000 0.1390140
Teyah:Symp 0.1231082 0.2354029 0.1390140 0.0000000

$Prob.Slopes.dist
           Jord:Allo Jord:Symp Teyah:Allo Teyah:Symp
Jord:Allo      1.000     0.374      0.091      0.165
Jord:Symp      0.374     1.000      0.134      0.174
Teyah:Allo     0.091     0.134      1.000      0.259
Teyah:Symp     0.165     0.174      0.259      1.000

$Slopes.correlation
              Jord:Allo   Jord:Symp   Teyah:Allo Teyah:Symp
Jord:Allo   1.000000000 0.01344439 -0.006345334 0.1577696
Jord:Symp   0.013444387 1.00000000 -0.288490065 -0.3441474
Teyah:Allo -0.006345334 -0.28849007 1.000000000 0.3397718
Teyah:Symp 0.157769562 -0.34414737 0.339771753 1.0000000

$Prob.Slopes.cor
           Jord:Allo Jord:Symp Teyah:Allo Teyah:Symp
Jord:Allo      1.000     0.257      0.108      0.090
Jord:Symp      0.257     1.000      0.058      0.007
Teyah:Allo     0.108     0.058      1.000      0.424
Teyah:Symp     0.090     0.007      0.424      1.000

A couple of things. First, instead of means there is a comparison of least-squares (LS) means. These are predicted values at the average value of the covariate (slope variable). These can be different than means, as groups can be comprised of different ranges of covariates. The slope distance is the difference in amount of shape change (as a Procrustes distance) per unit of covariate change. The slope correlation is the vector correlation of slope vectors. This indicates if vectors point in different directions in the tangent space (or other data space). Note that these pairwise stats should not be considered in this case, as the ANOVA reveals a non significant difference between models.

Returning to the incorrect pairwise test between LS means, the correct method is as follows.

> advanced.procD.lm(Y ~ log(CS) + species + site, ~ log(CS)+ species*site, groups = ~species*site, slope = ~log(CS))
$anova.table

ANOVA with RRPP

                                  df      SSE       SS      F      Z     P
log(CS)+species+site              36 0.098490
log(CS)+species+site+species:site 35 0.068671 0.029819 15.198 9.6675 0.001

$LS.Means.dist
            Jord:Allo Jord:Symp Teyah:Allo Teyah:Symp
Jord:Allo 0.00000000 0.09396152 0.02473040 0.10148562
Jord:Symp 0.09396152 0.00000000 0.08999162 0.10547891
Teyah:Allo 0.02473040 0.08999162 0.00000000 0.09949284
Teyah:Symp 0.10148562 0.10547891 0.09949284 0.00000000

$Prob.Means.dist
           Jord:Allo Jord:Symp Teyah:Allo Teyah:Symp
Jord:Allo       1.00     0.020      1.000      0.580
Jord:Symp       0.02     1.000      0.578      0.001
Teyah:Allo      1.00     0.578      1.000      0.002
Teyah:Symp      0.58     0.001      0.002      1.000

$Slopes.dist
           Jord:Allo Jord:Symp Teyah:Allo Teyah:Symp
Jord:Allo 0.0000000 0.2188238 0.1780151 0.1231082
Jord:Symp 0.2188238 0.0000000 0.2718850 0.2354029
Teyah:Allo 0.1780151 0.2718850 0.0000000 0.1390140
Teyah:Symp 0.1231082 0.2354029 0.1390140 0.0000000

$Prob.Slopes.dist
           Jord:Allo Jord:Symp Teyah:Allo Teyah:Symp
Jord:Allo      1.000     0.675      0.340      0.425
Jord:Symp      0.675     1.000      0.418      0.486
Teyah:Allo     0.340     0.418      1.000      0.511
Teyah:Symp     0.425     0.486      0.511      1.000

$Slopes.correlation
              Jord:Allo   Jord:Symp   Teyah:Allo Teyah:Symp
Jord:Allo   1.000000000 0.01344439 -0.006345334 0.1577696
Jord:Symp   0.013444387 1.00000000 -0.288490065 -0.3441474
Teyah:Allo -0.006345334 -0.28849007 1.000000000 0.3397718
Teyah:Symp 0.157769562 -0.34414737 0.339771753 1.0000000

$Prob.Slopes.cor
           Jord:Allo Jord:Symp Teyah:Allo Teyah:Symp
Jord:Allo      1.000     0.317      0.204      0.193
Jord:Symp      0.317     1.000      0.116      0.045
Teyah:Allo     0.204     0.116      1.000      0.447
Teyah:Symp     0.193     0.045      0.447      1.000

In this case, the LS means comparison is meaningful and the rest can be ignored. (Sorry, it is too daunting to make a program that anticipates every intention of the user. Sometimes excessive output is required and it is reliant upon the user to know what he is doing and which output to interpret.)   Likewise, if a significant slope-interaction was observed (i.e., heterogeneity of slopes), then it would be silly to compare LS. means. It is imperative that the user understand the models that are used and which output to interpret. Here is a little guide.

1. If there is a covariate involved, compare two models: covariate + groups and covariate*groups. If the ANOVA returns a significant result, re-perform and assign groups = ~groups and slope = ~covariate. Focus on the slope distance and slope correlations (or angles, if one of the angle options is chosen). If ANOVA does not return a significant result, go to 2.

2. If there is a covariate involved, compare two models: covariate and covariate + groups. If the ANOVA returns a significant result, re-perform and assign groups = ~groups and slope = ~covariate. Focus on the LS means If ANOVA does not return a significant result, groups are not different.

3. If groups represent a factorial interaction (e.g., species*site), one should also consider main effects in the reduced model. If not, then an intercept can comprise the reduced model.

Ultimately, the advanced.procD.lm function has the capacity to compare a multitude of different reduced and full models and perform specialized pairwise tests. For example, one could do this:

advanced.procD.lm(Y ~ log(CS) + site, ~ log(CS)*site*species, groups = ~ species, slope = log(CS), angle.type = "rad")

Doing this would require a lucid reason to think the residuals of the reduced model are the appropriate exchangeable units under a null hypothesis and that comparing species only, despite an interaction with site, is an appropriate thing to do. Although advanced.procD.lm can handle it, it is the user's responsibility to validate the output as legitimate.

We hope that this tutorial and the Q & A that will result will be more edifying than the previous pairwiseD.test and pairwise.slope.test, which although more straightforward at first, were less flexible. With a little patience and practice, this function will become clear.

More analytical details found here.

To leave a comment for the author, please follow the link and comment on his blog: geomorph.

↧

Worrying About my Cholesterol Level

June 9, 2015, 7:56 pm

≫ Next: Case Study: DataCamp, dplyr, and blended learning

≪ Previous: advanced.procD.lm for pairwise tests and model comparisons

(This article was first published on Econometrics Beat: Dave Giles' Blog, and kindly contributed to R-bloggers)

The headline, "Don't Get Wrong Idea About Cholesterol", caught my attention in the 3 May, 2015 Times-Colonist newspaper here in Victoria, B.C.. In fact the article came from a syndicated column, published about a week earlier. No matter - it's always a good time for me to worry about my cholesterol!

The piece was written by a certain Dr. Gifford-Jones (AKA Dr. Ken Walker).

Here's part of what he had to say:

"Years ago, Dr. John Judkin, formerly emeritus professor of physiology at the University of London, was ridiculed after after he reported that a high dietary intake of animal fat and the eating of foods containing cholesterol were not the cause of coronary heart disease.

But Judkin pointed to a greater correlation between the intake of sucrose (ordinary sugar) and coronary attack. For instance a study in 15 countries showed that as the population consumed more sugar, there was a dramatic increase in heart attacks.

More impressive is a prison study by Milton Winitz, a U.S. biochemist, in 1964. Eighteen prisoners, kept behind bars for six months, were given food that was regulated. In this controlled environment, it was proven that when the prisoner diet was high in sugar, blood cholesterol increased and when dietary sugar was decreased there was a huge drop in blood cholesterol."

I've got nothing against the good doctor, but you'll notice that I've highlighted a few key words in the material quoted above. I'm sure I don't need to explain why!

What he's referring to is research reported by Winitz and his colleagues in the 1964 paper, "The effect of dietary carbohydrate on serum cholesterol levels" (Archives of Biochemistry and Biophysics, 108, 576-579). Interestingly, the findings outlined in that paper were a by-product of the main research that was being undertaken with NASA sponsorship - research into the development of diets for astronauts!

In his famous book, How to Live Longer and Feel Better, the Nobel laureate Linus Pauling refers to this study by Winitz et al.:

"These investigators studied 18 subjects, who were kept in a locked institution, without access to other food, during the whole period of study (about 6 months).

After a preliminary period with ordinary food, they were placed on a chemically well-defined small molecule diet (seventeen amino acids, a little fat, vitamins, essential minerals, and glucose as the only carbohydrate).

The only significant physiological change that was found was in the concentration of cholesterol in the blood serum, which decreased rapidly for each of the 18 subjects.

The average concentration in the initial period, on ordinary food, was 227 milligrams per deciliter. After two weeks on the glucose diet it had dropped to 173, and after another two weeks it was 160.

The diet was then changed by replacing one quarter of the glucose with sucrose, with all the other dietary constituents the same. Within one week the average cholesterol concentration had risen from 160 to 178, and after two more weeks to 208.

The sucrose was then replaced by glucose. Within one week the average cholesterol concentration had dropped to 175, and it continued dropping , leveling off at 150, 77 less than the initial value." (p.42)

Does any of this constitute proof? Of course not!

But let's take a look at the actual data and undertake our own statistical analysis of this very meagre set of information. From Winitz et al., p.577, we have:

(The data are available in a text file on the data page for this blog. As you can see, the sample size is extremely small - only 18 people were "treated".)

The authors summarize their key findings as follows (p. 578):

"On the basis of a statistical analysis of the mean values of the serum cholesterol levels at the end of the 4th, 7th, 8th, and 19th weeks (see Table I), the following conclusions were drawn (95% confidence level): (a) each of the two progressive decreases in serum cholesterol level with the diet containing glucose as the sole sugar is statistically significant, and (b) the progressive increase in serum cholesterol level upon partial substitution of the glucose with sucrose is also statistically significant."

Interestingly, exactly what they mean by, "a statistical analysis", is not explained!

(Don't try getting away with that lack of specificity, kids!)

So, the claim is that there's a significant difference between the "before treatment" and "after treatment" results. I must admit to being somewhat skeptical about this, given the tiny number of candidates being treated. However, let's keep an open mind.

Crucially, we're not told by the researchers what statistical tests were actually performed!

However, given that we have a group of people with "before treatment" and "after treatment" scenarios, paired t-tests for the equality of the means provide a natural way to proceed. (e.g., see here.) This requires that we have simple random sampling, and that the population is Normal. More specifically, the differences between the "before" and "after" data have to be normal.

Just for funzies, let's use a variety of resources in our statistical analysis

My EViews workfile can be found on the code page for this blog. Take a look at the "README" text object in that file for some details of what I did.

Some basic Q-Q plots support the normality assumption for the data differences. Here's just one typical example - it's for the difference ("D67") between the "Week 7" and "Week 6" data in Table 1, above:

Moreover, the Anderson-Darling test for normality (with a sample size adjustment) produces p-values ranging from 0.219 to 0.925. This is very strong support for the normality assumption, especially given the known desirable power properties of this test.

The assumption of random sampling is somewhat more problematic. Were the 18 inmates selected randomly from the population at large? I don't think so! There's nothing we can about this, except to be cautious if we try to extrapolate the results of the study to more general population. Which is, of course, what Dr. Gifford-Jones and others are doing.

Now, what about the results for the paired t-tests themselves?

Here they are, with p-values in parentheses. The naming of the t-statistics and their p-values follows the column headings in Table 1, above, with "1½" and "2½" abbreviated to "1" and "2" respectively. For example, "t02" and "p02" refer to the test between the "0 Weeks" and "2 Weeks" data. Similarly, "t819" and "p819" refer to the test between the "8 Weeks" and "19 Weeks" data, etc.

Table 2

Phase I Phase II Phase III

t01 t02 t04 t12 t14 t24 t56 t57 t67 t819

-7.92 -7.33 -9.82 -1.10 -2.88 - 2.21 4.92 7.54 3.71 -4.12
(0.00) (0.00) (0.00) (0.15) (0.01) (0.02) ( 0.00) (0.00) (0.00) (0.00)

Take a look back at footnote "b" in Table 1 above. You'll see that negative t-statistics are expected everywhere in Table 2 except during "Phase II" of the trials, if we believe the hypothesis that (prolonged) high sucrose intakes are associated with high cholesterol levels.

In all but one instance, the paired t-tests give results that are significant at the 5% level.

Now, it's all very well to have obtained these results, but we might ask - "How powerful is the paired t-test when we're working with such small samples?"

To answer this question I decided to adapt some code that uses the "pwr" package in R, kindly provided by Jake Westfall on the Cross Validated site. The code requires the value of the so-called "effect size", which I computed for our data to be equal to 1, using this online resource. The "tweaked" R code that I used is available on my code page.

The for a particular sample correlation between the paired data, the code computes the (minimum) number of pairs needed for a paired t-test to have a desired power when the significance level is 5%.
:

Table 3

The pair-wise sample correlations in the data set we're examining (the relevant columns in Table 1) range between 0.696 and 0.964. So, in Table 3, it turns out that even for the sample sizes that we have, the powers of the paired t-tests are actually quite respectable. For example, the sample correlation for the data for Weeks 1 and 2 is 0.898, so a sample size of at least 5 is needed for the test of equality of the corresponding means to have a power of 99%. This is for a significance level of 5%. This minimum sample size increases to 6 if the significance level is 1% - you can re-run the R code to verify this.

At the end of the day, the small number of people included in the experiment was probably not a big problem. However, don't forget that (questionable) assumption of independent sampling.

In any case, I'm going to cut back on my sugar intake and get more exercise!

To leave a comment for the author, please follow the link and comment on his blog: Econometrics Beat: Dave Giles' Blog.

↧

Case Study: DataCamp, dplyr, and blended learning

June 10, 2015, 12:45 am

≫ Next: Big Data and Chess: What are the Predictive Point Values of Chess Pieces?

≪ Previous: Worrying About my Cholesterol Level

(This article was first published on The DataCamp Blog » R, and kindly contributed to R-bloggers)

Editorial Note: This is a guest blog post by Professor Matthew J. Salganik (Princeton University) in which he describes his experiences using the DataCamp interactive learning platform for blended learning. The article was first published on Wheels on the Bus. Want to use DataCamp in your class as well? Contact us via teach@datacamp.com.

DataCamp, dplyr, and blended learning

As I’ve written about in previous posts (here, here, and here), this semester I taught a course called Advanced Data Analysis for the Social Science, which is the second course in our department’s required sequence for Ph.D. students. I’ve taught this course in the past, and in teaching the course this time, I tried to modernize it both in content and in form. Therefore, I partnered with DataCamp to make their dplyr course, taught by Garrett Grolemund, available to my students. This combination of face-to-face teaching and online content is called blended learning, and it’s something that I’d like to explore more in future classes. For a first attempt, I think it worked pretty well, and the people at DataCamp were very helpful. Here’s more about what happened.

DataCamp is an online learning platform focused on data science. They offer courses on a variety of topics, but most are focused on R and variety of R packages. I was happy to see that they offer a course on dplyr, a wonderful data manipulation package by Hadley Wickham and colleagues. dplyr is very well designed, but it takes some getting used to because it works differently (some might say better) than the way things work in base R. So, like a traditional class, we had a face-to-face lab session on dplyr and a homework assignment that required students to use it (here’s our class syllabus). But, I knew that the students would need more practice if they were going to becoming truly fluent in dplyr. So, in addition to our traditional class activities, I offered the students the chance to take Garrett Grolemund’s dplyr course. The course consists of 5 chapters, each with videos and instantly graded exercises. I had taken Garrett’s class myself, and I found the exercises to be really, really helpful for practicing dplyr’s style of thinking.

How did my students respond? About half the students started Garrett’s course, and, of the students that started, a bit less than half pretty much completed the whole thing.

Is that a success or failure? I’d say a success because Garrett’s course was not required and it was quite long (about 4 hours). In other words, by offering this enrichment about a quarter of the class ended up spending more time learning. I’ve redacted names from the plot, but the students that were most engaged with Garrett’s course were an interesting mix; it was not just the strongest students or the struggling students. If this kind of enrichment were offered more regularly, I wonder which students would be the most likely to take it up.

Finally, I’d like to thank Martijn Theuwissen from DataCamp for helping to make this experiment in blended learning possible. I’d definitely like to try this again next time I teach a course on data analysis.

The post Case Study: DataCamp, dplyr, and blended learning appeared first on The DataCamp Blog .

To leave a comment for the author, please follow the link and comment on his blog: The DataCamp Blog » R.

↧

Big Data and Chess: What are the Predictive Point Values of Chess Pieces?

June 10, 2015, 2:33 am

≫ Next: Padé approximants: CRAN package

≪ Previous: Case Study: DataCamp, dplyr, and blended learning

(This article was first published on Publishable Stuff, and kindly contributed to R-bloggers)

Who doesn’t like chess? Me! Sure, I like the idea of chess – intellectual masterminds battling each other using nothing but pure thought – the problem is that I tend to loose, probably because I don’t really know how to play well, and because I never practice. I do know one thing: How much the different pieces are worth, their point values:

This was among the first things I learned when I was taught chess by my father. Given these point values it should be as valuable to have a knight on the board as having three pawns, for example. So where do these values come from? The point values are not actually part of the rules for chess, but are rather just used as a guideline when trading pieces, and they seem to be based on the expert judgment of chess authorities. (According to the guardian of truth there are many alternative valuations, all in the same ballpark as above.) As I recently learned that it is very important to be able to write Big Data on your CV, I decided to see if these point values could be retrieved using zero expert judgement in favor of huge amounts of chess game data.

The method

How to allocate point values to the chess pieces using only data? One way of doing this is to calculate the predictive values of the chess pieces. That is, given that we only know the current number of pieces in a game of chess and use that information to predict the outcome of that game, how much does each type of piece contribute to the prediction? We need a model to predict the outcome of chess games where we have the following restrictions:

Each type of piece has a single point value that directly contributes to the predicted outcome of the game, so no interaction effects between the pieces.
The value of a piece does not change over the course of the game.
Use no context and nor positional information.

Now these restrictions might feel a bit restrictive, especially if we actually would want to predict the outcome of chess games as well as possible, but they come from that the original point values follow the same restrictions. As the original point values doesn’t change with context, neither should ours. Now, as my colleague Can Kabadayi (with an ELO well above 2000) remarked: “But context is everything in Chess!”. Absolutely, but I’m not trying to do anything profound here, this is just a fun exercise! :)

Given the restrictions there is one obvious model: Logistic regression, a vanilla statistical model that calculates the probability of a binary event, like a loss-win. To get it going I needed data and the biggest Big Data data set I could find was the Million Base 2.2 which contains over 2.2 million chess games. I had to do a fair deal of data munging to get it into a format that I could work with, but the final result was a table with a lot of rows that looked like this:

pawn_diff rook_diff knight_diff bishop_diff queen_diff white_win 
    1         0           1         -1           0       TRUE

Here each row is from a position in a game where a positive number means White has more of that piece. For the position above white has one more pawn and knight, but one less bishop than Black. Last in each row we get to know whether White won or lost in the end, as logistic regression assumes a binary outcome I discarded all games that ended in a draw. My résumé is unfortunately not going to look that good as I never really solved the Big Data problem well. Two million chess games are a lot of games and it took my poor old laptop over a day to process only the first 100,000 games. Then I had the classic Big Data problem that I couldn’t fit it all into working memory, so I simply threw away data until worked. Still, for the analysis I ended up using a sample of 1,000,000 chess positions from the first 100,000 games in the Million Base 2.2 . Big enough data for me.

The result

Using the statistical language R I first fitted the following logistic model using maximum likelihood (here described by R’s formula language):

white_win ~ 1 + pawn_diff + knight_diff + bishop_diff + rook_diff + queen_diff

This resulted in the following piece values:

Three things to note: (1) In addition to the piece values, the model also included a coefficient for the advantage of going first, called White’s advantage above. (2) The predictive piece values ranks the pieces in the same order as the original piece values does. (3) The piece values are given in log-odds, which can be a bit tricky to interpret but that can be easily transformed into probabilities as this graph shows:

Here White’s advantage translates to a 56% chance of White winning (everything else being equal), being two knights and one rook ahead but one pawn behind gives 92% chance of winning, while being one queen behind gives only a 8% chance of winning. While log-odds are useful if you want to calculate probabilities, the original piece values are not given in log-odds, instead they are set relative to the value of a pawn which is fixed at 1.0 . Let’s scale our log-odds so that the pawn is give a piece value of 1.0 :

We see now that, while the ranking is roughly the same, the scale is compressed compared to the original piece values. A queen is usually considered as nine times more valuable than a pawn, yet when it comes to predicting the outcome of a game a queen advantage counts the same as only a four pawn advantage. A second thing to notice is that bishops are valued slightly higher than knights. If you look at the Wiki page for Chess piece relative value you find that some alternative valuations value the bishop slightly higher than the knight, others add ½ point for a complete bishop pair. We can add that to the model!

white_win ~ 1 + pawn_diff + knight_diff + bishop_diff + rook_diff + queen_diff + bishop_pair_diff

Now with a pair of bishops getting their own value, the values of a knight and a single bishop are roughly equal. There is still the “mystery” regarding the low valuation of all the pieces compared to the pawn. (This doesn’t really have to be a mystery as there is no reason why predictive piece values necessarily should be the same as the original piece values). Instead of anchoring the value of a pawn to 1.0 we could anchor the value of another piece to it’s original piece value. Let’s anchor the knight to 3.0:

Now the value of the pieces (excluding the pawn) line up very nicely with the original piece values! So, as I don’t really play chess, I don’t know why a pawn advantage is such a strong predictor of winning (compared to the original piece values, that is). My colleague Can Kabadayi (the ELO > 2000 guy!) had the following to say:

In a high-class game the pawns can be more valuable – they are often the decisive element of the outcome – one can even say that the whole plan of the middle game is to munch a pawn somewhere and then convert it in the endgame by exchanging the pieces, thus increasing the value of the pawn. Grandmaster games tend to go to a rook endgame where both sides have a rook but one side has an extra pawn. It is not easy to convert these positions into a win, but you see the general idea. A passed pawn (a pawn that does not have any enemy pawns blocking its march to become a queen) is also an important asset in chess, as they are a constant threat to become a queen.

Can also gave me two quotes from legendary chess Grandmasters José Capablanca and Paul Keres relating to the value of pawns:

The winning of a pawn among good players of even strength often means the winning of the game. – Jose Capablanca

The older I grow, the more I value pawns. – Paul Keres

Another thing to keep in mind is that the predictive piece values might have looked very difference if I had used a different data set. For example, the players in the current data set are all very skilled, having a median ELO of 2400 and with 95% of the players having an ELO between 2145 and 2660. Still I think it is cool that the predictive piece values matched the original piece values as well as they did!

Final notes

Again, I don’t know much about chess, but the the Million Base 2.2 is a fun database to work with, so if you have any suggestion for other things to look at, leave a comment below or tweet me (@rabaath)!
If you want to dabble with the database yourself you can find the scripts I used to convert the data in the Million Base into an analysis friendly format, and the code for recreating the predictive piece value analysis, here:
- Python script to convert the Million Base 2.2 to a json-format:
  - https://gist.github.com/rasmusab/07f1823cb4bd0bc7352d
- R scripts that recreates the analysis and plots in this post:
  - https://gist.github.com/rasmusab/fb98cced046d4c675d74
  - https://gist.github.com/rasmusab/b29bb53cfc3fe25f3f80
  - https://gist.github.com/rasmusab/b29bb53cfc3fe25f3f80
While “researching” this post I learned about this really fun chess variant called Knightmare Chess which plays like normal chess but with the addition that each player can play action cards with “special effects”. These effects are often spectacular, like the card Fireball that “explodes” a piece which kills all adjacent pieces. This add a (large) element of randomness to the game, which might irritate chess purist, but makes it possible for me to win once in a while :)

To leave a comment for the author, please follow the link and comment on his blog: Publishable Stuff.

↧

Padé approximants: CRAN package

June 10, 2015, 6:56 am

≫ Next: Identifying the OS from R

≪ Previous: Big Data and Chess: What are the Predictive Point Values of Chess Pieces?

(This article was first published on Strange Attractors » R, and kindly contributed to R-bloggers)

While working on the previous post about Padé approximants, a search on CRAN showed that there was only one package which calculated the coefficients, given the appropriate Taylor series: the pracma package. The method it uses seems rather sophisticated, but does allow for calculating coefficients “beyond” that which the Taylor series would allow. Therefore, I put together a new R package, Pade, which uses the simpler system of linear equations to calculate the coefficients and does not permit calculating coefficients beyond the order of the supplied Taylor series. Any and all feedback is appreciated, as always!

To leave a comment for the author, please follow the link and comment on his blog: Strange Attractors » R.

↧

Identifying the OS from R

June 10, 2015, 11:50 am

≫ Next: In case you missed it: May 2015 roundup

≪ Previous: Padé approximants: CRAN package

(This article was first published on Just the kind of thing you were expecting » R, and kindly contributed to R-bloggers)

Sometimes a bit of R code needs to know what operating system it’s running on. Here’s a short account of where you can find this information and a little function to wrap the answer up neatly.

Operating systems are a platform issue, so let’s start with the constants in the list .Platform. For Windows the OS.type is just “windows”, but confusingly it’s “unix” for Unix, Linux, and my Mac OSX laptop. To be fair that’s because OSX is built on a base of tweaked BSD Unix. But it does seem that .Platform won’t distinguish OSX from a more traditional Unix or Linux machine.

Nor, by the way, will the value of GUI. When we’re running inside RStudio this is just “RStudio” and from the R.app that’s bundled with the R distribution it’s “AQUA” on OSX and “Rgui” on Windows. Worse, from my command line (Terminal) GUI is “X11″, even though it uses Aqua for the graphics instead. Presumably it’s this same value on Unix and Linux where we really would be using X11.

We can also ask about the R that we’re running by looking at the R.version list. On my machine this has os as “darwin13.4.0″, which is much the same information that Sys.info() presents more neatly.

Amusingly, the help page for R.version can’t quite decide whether or not we should use os to determine which operating system we’re running on: The ‘Note’ tells us not to use it, and the second piece of example code uses it anyway with a comment about it being a “good way to detect OSX”.

Another source of information is the Sys.info() function. On my machine it says that sysname is “Darwin”, which for various not very interesting reasons confirms that I’m running OSX. More straightforwardly, on Windows it’s “Windows” and on Linux it’s “Linux”. That all looks like what we want, except for the cryptic note in the Help pages suggesting that this function is not always implemented.

I’ve never used a machine where Sys.info() returned NULL. Have you?

To sum up then. The easiest way to find out what we want to know is to check Sys.info()["sysname"], remembering that “Darwin” means it’s OSX.

If we’re more cautious or we’re on a mystery machine where Sys.info really isn’t implemented, then we can first check whether .Platform$OS.type is “windows” or “unix” and if it’s “unix” look carefully at the R.version$os string. This backup route is a bit more involved, so I’ve wrapped everything up in a function that tries to return the lower case name of the operating system we’re running on.

get_os <- function(){
  sysinf <- Sys.info()
  if (!is.null(sysinf)){
    os <- sysinf['sysname']
    if (os == 'Darwin')
      os <- "osx"
  } else { ## mystery machine
    os <- .Platform$OS.type
    if (grepl("^darwin", R.version$os))
      os <- "osx"
    if (grepl("linux-gnu", R.version$os))
      os <- "linux"
  }
  tolower(os)
}

Corrections or improvements are welcome.

To leave a comment for the author, please follow the link and comment on his blog: Just the kind of thing you were expecting » R.

↧

In case you missed it: May 2015 roundup

June 10, 2015, 2:04 pm

≫ Next: Spotting Potential Battles in F1 Races

≪ Previous: Identifying the OS from R

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from May of particular interest to R users.

RStudio 0.99 released with improved autocomplete and data viewer features.

A tutorial on the new Naive Bayes classifier in the RevoScaleR package.

R is the most popular Predictive Analytics / Data Mining / Data Science software in the latest KDnuggets poll.

A Shiny application predicts the winner of baseball games mid-game using R.

A list of over 100 open data sources you can use with R.

Revolution R Open 3.2.0 now available, following RRO 8.0.3.

A review of talks at the Extremely Large Databases conference, featuring Stephen Wolfram and John Chambers.

My TechCrunch article on the impact of open source software on business features several R examples.

You can improve performance of R even further by using Revolution R Open with Intel Phi coprocessors.

New features in Revolution R Enterprise 7.4, now available.

The next release of SQL Server will run R in-database.

Create embeddable, interactive graphics in R with htmlwidgets.

Computerworld reviews R packages for data wrangling.

A tutorial on using data stored in the Azure cloud with R.

Using histograms as points in scatterplots, and other embedded plots in R.

A comparison of data frames, data.table, and dplyr with a random walks problem.

A video on using R for human resources optimization.

How to call R and Python from base SAS.

General interest stories (not related to R) in the past month included: a song written by an iPhone, a Facebook algorithm that tells when “like” becomes “love”, a map of light pollution and a machine-learning application that tells you how old you look.

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

↧

Spotting Potential Battles in F1 Races

June 10, 2015, 4:01 pm

≫ Next: GPU-Accelerated R in the Cloud with Teraproc Cluster-as-a-Service

≪ Previous: In case you missed it: May 2015 roundup

(This article was first published on OUseful.Info, the blog... » Rstats, and kindly contributed to R-bloggers)

Over the last couple of races, I’ve started trying to review a variety of battlemaps for various drivers in each race. Prompted by an email request for more info around the battlemaps, I generated a new sketch charting the on track gaps between each driver and the lap leader for each lap of the race (How the F1 Canadian Grand Prix Race Evolved on Track).

Colour is used to identify cars on lead lap compared to lapped drivers. For lapped drivers, a count of how many laps they are behind the leader is displayed. I additionally overplot with a highlight for specified driver, as well as adding in a mark that shows the on track position of the leader of the next lap, along with their driver code.

Battles can be identified through the close proximity of two or more drivers within a lap, across several laps. The ‘next-lap-leader’ time at the far right shows how close the leader on the next lead lap is to the backmarker (on track) on the current lead lap.

By highlighting two particular drivers, we could compare how their races evolved, perhaps highlighting different strategies used within a race that eventually bring the drivers into a close competitive battle in the last few laps of a race.

The unchanging leader-on-track-delta-of-0 line is perhaps missing an informational opportunity? For example, should we set the leader’s time to be the delta compared to the lap time for the leader laps from the previous lead lap? Or a delta compared to the fastest laptime on the previous lead lap? And if we do start messing about with an offset to the leader’s lap time, we presumably need to apply the same offset to the laptime of everyone else on the lap so we can still see the comparative on-track gaps to leader?

On the to-do list are various strategies for automatically identifying potential battles based on a variety of in-lap and across-lap heuristics.

Here’s the code:

#Grab some data
lapTimes =lapsData.df(2015,7)

#Process the laptimes
lapTimes=battlemap_encoder(laptimes)

#Find the accumulated race time at the start of each leader's lap
lapTimes=ddply(lapTimes,.(leadlap),transform,lstart=min(acctime))

#Find the on-track gap to leader
lapTimes['trackdiff']=lapTimes['acctime']-lapTimes['lstart']

#Construct a dataframe that contains the difference between the 
#leader accumulated laptime on current lap and next lap
#i.e. how far behind current lap leader is next-lap leader?
ll=data.frame(t=diff(lapTimes[lapTimes['position']==1,'acctime']))
#Generate a de facto lap count
ll['n']=1:nrow(ll)
#Grab the code of the lap leader on the next lap
ll['c']=lapTimes[lapTimes['position']==1 & lapTimes['lap']>1,'code']

#Plot the on-track gap to leader versus leader lap
g = ggplot(lapTimes) 
g = g + geom_point(aes(x=trackdiff,y=leadlap,col=(lap==leadlap)), pch=1)
g= g + geom_point(data=lapTimes[lapTimes['driverId']=='vettel',],
                  aes(x=trackdiff,y=leadlap), pch='+')
g=g + geom_text(data=lapTimes[lapTimes['lapsbehind']>0,],
                  aes(x=trackdiff,y=leadlap, label=lapsbehind),size=3)
g = g + geom_point(data=ll,aes(x=t, y=n), pch='x')
g = g + geom_text(data=ll,aes(x=t+3, y=n,label=c), size=2)
g = g + geom_vline(aes(xintercept=17), linetype=3)
g

This chart will be included in a future update to the Wrangling F1 Data With R book. I hope to do a sprint on that book to tidy it up and get it into a reasonably edited state in the next few weeks. At that point, the text will probably be frozen, a print-on-demand version generated, and if it ends up on Amazon, the minimum price being hiked considerably.

To leave a comment for the author, please follow the link and comment on his blog: OUseful.Info, the blog... » Rstats.

↧

GPU-Accelerated R in the Cloud with Teraproc Cluster-as-a-Service

June 10, 2015, 5:49 pm

≫ Next: R User Groups are Everywhere

≪ Previous: Spotting Potential Battles in F1 Races

(This article was first published on Teraproc - Application Cluster-as-a-Service » R-blog, and kindly contributed to R-bloggers)

Analysis of statistical algorithms can generate workloads that run for hours, if not days, tying up a single computer. Many statisticians and data scientists write complex simulations and statistical analysis using the R statistical computing environment. Often these programs have a very long run time. Given the amount of time R programmers can spend waiting for results, it makes sense to take advantage of parallelism in the computation and the available hardware.

In a previous post on the Teraproc blog, I discussed the value of parallelism for long-running R models, and showed how multi-core and multi-node parallelism can reduce run times. In this blog I’ll examine another way to leverage parallelism in R, harnessing the processing cores in a general-purpose graphics processing unit (GPU) to dramatically accelerate commonly used clustering algorithms in R. The most widely used GPUs for GPU computing are the NVIDIA Tesla series. A Tesla K40 GPU has 2,880 integrated cores, 12 GB of memory with 288 GB/sec of bandwidth delivering up to 5 trillion floating point calculations per second.

The examples in this post build on the excellent work of Mr. Chi Yau available at r-tutor.com. Chi is the author of the CRAN open-source rpud package as well as rpudplus, R libraries that make is easy for developers to harness the power of GPUs without programming directly in CUDA C++. To learn more about R and parallel programming with GPUs you can download Chi’s e-book. For illustration purposes, I’ll focus on an example involving distance calculations and hierarchical clustering, but you can use the rpud package to accelerate a variety of applications.

Hierarchical Clustering in R

Cluster analysis, or clustering, is the process of grouping objects such that objects in the same cluster are more similar (by a given metric) to each other than to objects in other clusters. Cluster analysis is a problem with significant parallelism. In a post on the Teraproc blog we showed an example that involved clustering analysis using k-means. In this post we’ll look at hierarchical cluster in R with hclust, a function that makes it simple to create a dendrogram (a tree diagram as in Figure 1) based on differences between observations. This type of analysis is useful in all kinds of applications from taxonomy to cancer research to time-series analysis of financial data.

dendrogram-624x279 — Figure 1: Dendrogram created using hierarchical clustering in R.

Similar to our k-means example, grouping observations in a hierarchical fashion depends on being able to quantify the differences (or distances) between observations. This means calculating the Euclidean distance between pairs of observations (think of this as the Pythagorean Theorem extended to more dimensions). Chi Yau explains this in his two posts Distance Matrix by GPU and Hierarchical Cluster Analysis, so we won’t attempt to cover all the details here.

R’s hclust function accepts a matrix of previously computed distances between observations. The dist function in R computes the difference between rows in a dataset supporting multiple methods including Euclidean distance (the default). If I have a set of M observations (rows), each with N attributes (columns), for each distance calculation I need to compute the length of a vector in N-dimensional space between the observations. There are M * (M-1) / 2 discrete distance calculations between all rows. Thus, computation scales as the square of the number of observations: for 10 observations I need 45 distance calculations, for 100 observations I need 4,950, and for 100,000 observations I need 4,999,950,000 (almost 5 billion) distance calculations. As you can see, dist can get expensive for large datasets.

Deploying a GPU Environment for Our Test

Before I can start running applications, first I need access to a system with a GPU. Fortunately, for a modest price I can rent a machine with GPUs for a couple of hours. In preparing this blog I used two hours of machine time on the Teraproc R Analytics Cluster-as-a-Service. The service leverages Amazon EC2 and the total cost for machine time was $1.30; quite a bit cheaper than setting up my own machine! The reason I was able to use so little time is because the process of installing the cluster is fully automated by the Teraproc service. Teraproc’s R Cluster-as-a-Service provides CUDA, R, R Studio and other required software components pre-installed and ready to use. OpenLava and NFS are also configured automatically, giving me the option to extend the cluster across many GPU capable compute nodes and optionally use Amazon spot pricing to cut costs.

I deployed a one-node cluster on Teraproc.com using the Amazon g2.2xlarge machine type as shown below. I could have installed the g2.2xlarge instance myself from the Amazon EC2 console, but in this case I would have needed to install R, R Studio and configure the environment myself resulting in spending more time and money. You can learn how to set up an R cluster yourself on different node types (including free machines) at the Teraproc R Analytics Cluster-as-a-Service website. If you already have an Amazon EC2 account you can set up a cluster in as little as five minutes.

The g2.2xlarge machine instance is a Sandy Bridge based machine with 8 cores / vCPUs on a Xeon E5-2670 processor, 15GB or memory, a solid state disk drive and an NVIDIA GRID K520 GPU. The on-demand price for this machine is $0.65 per hour. The NVIDIA GRID K520 has two GK104 graphics processors each with 1,536 cores on a single card with 8 GB of RAM.

Running Our Example

First we use the teraproc.com R-as-a-cluster service to provision the R environment making sure that we select the correct machine type (G22xlarge) and install a one-node cluster, as Figure 1 shows. This automatically deploys a single-node cluster complete with R Studio and provides us with a URL to access the R Studio environment.

cluster_dendrogram-624x201 — Figure 2: Using Teraproc R Analytics Cluster-as-a-Service to start a GPU-Accelerated R Studio Cluster.

Using the shell function within R Studio (under the Tools menu), I can run an operating system command to make sure that the GPU is present on the machine.

gordsissons@ip-10-0-93-199:~$ lspci | grep -i nvidia
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)

To use the rpud package to access GPU functions we need to install it first. Run the command below from inside R-Studio to install rpud from CRAN.

> install.packages(“rpud”)

Next, use the library command to access rpud functions.

> library(rpud)
Rpudplus 0.5.0
http://www.r-tutor.com
Copyright (C) 2010-2015 Chi Yau. All Rights Reserved.
Rpudplus is free for academic use only. There is absolutely NO warranty.

If everything is working properly, we should be able to see the GPU on our Amazon instance from the R command prompt by calling rpuGetDevice.

> rpuGetDevice()
GRID K520 GPU
[1] 0

The following listing shows a sample R program that compares performance of a hierarchical clustering algorithm with and without GPU acceleration. The first step is to create a suitable dataset where we can control the number of observations (rows) as well as the number of dimensions (columns) for each observation. The test.data function returns a matrix of random values based on the number of rows and columns provided.

The run_cpu function calculates all the distances between the observations (rows) using R’s dist function, and then runs R’s native hclust function against the computed distances stored in dcpu to create a dendrogram. The run_gpu function performs exactly the same computations using the GPU-optimized versions of dist and hclust (rpuDist and rpuHclust<em>) from the rpud package.

The R script creates a matrix m of a particular size by calling test.data and then measures and displays the time required to create a hierarchical cluster using both the CPU and GPU functions.

library("rpud")
#
# function to populate a data matrix
#
test.data <- function(dim, num, seed=10) {
    set.seed(seed) 
    matrix(rnorm(dim * num), nrow=num)
}

run_cpu <- function(matrix) {
    dcpu <- dist(matrix)
    hclust(dcpu)
}

run_gpu <- function(matrix) {
    dgpu <- rpuDist(matrix)
    rpuHclust(dgpu)
}

#
# create a matrix with 20,000 observations each with 100 data elements
#
m <- test.data(100, 20000)

#
# Run dist and hclust to calculate hierarchical clusters using CPU
#

print("Calculating hclust with Sandy Bridge CPU")
print(system.time(cpuhclust <-run_cpu(m)))

#
# Run dist and hclust to calculate hierarchical clusters using GPU
#

print("Calculating hclust with NVIDIA K520 GPU")
print(system.time(gpuhclust <- run_gpu(m)))

Running the script yields the following results:

>source('~/examples/rgpu_hclust.R')
[1] "Calculating hclust with Sandy Bridge CPU"   
user     system elapsed
294.760   0.746 295.314
[1] "Calculating hclust with NVIDIA K520 GPU"   
user     system elapsed 
19.285    3.160  22.431

To explore the GPU vs. CPU speedup, we ran the script on datasets with a varying number of rows and plotted the results. The distance calculation is highly parallel on the GPU, while much of the GPU-optimized hclust calculation runs on the CPU. For this reason the calculation scales well as the dataset gets larger and the time required for the distance calculations dominates.

Number of rows	Number of dimensions	Total elements	# distance calculations	CPU time (seconds)	GPU time (seconds)	Speed-up
1,000	100	100,000	1,998,000	0.50	0.04	11.8
2,000	100	200,000	7,996,000	2.06	0.17	12.1
5,000	100	500,000	49,990,000	13.42	1.17	11.5
10,000	100	1,000,000	199,980,000	59.83	5.03	11.9
15,000	100	1,500,000	449,970,000	141.15	11.61	12.2
20,000	100	2,000,000	799,960,000	295.31	22.43	13.2

Looking at the run times side by side, we see that running the multiple steps with the GPU is over ten times faster than running on the CPU alone.

pruhclust_performance-624x425

The result of our analysis is a hierarchical cluster that we can display as a dendrogram like Figure 1 using R’s plot command.

> plot(gpuhclust,hang = -1)

Conclusion

Our results clearly show that running this type of analysis on a GPU makes a lot of sense. Not only can we complete calculations ten times faster, but just as importantly we can reduce the cost of resources required to do our work. We can use these efficiencies to do more thorough analysis and explore more scenarios. By using the Teraproc service, we make GPU computing much more accessible to R programmers who may not otherwise have access to GPU-capable nodes.

In a future post we’ll show how you can tackle very large analysis problems with clusters of GPU-capable machines. Try out Teraproc R Analytics Cluster-as-a-Service today! To learn about other ways to accelerate your R code with GPUs, check out the post Accelerate R Applications with CUDA by NVIDIA’s Patric Zhao.

To leave a comment for the author, please follow the link and comment on his blog: Teraproc - Application Cluster-as-a-Service » R-blog.

↧

R User Groups are Everywhere

June 11, 2015, 8:30 am

≫ Next: R man (chester) in the North…

≪ Previous: GPU-Accelerated R in the Cloud with Teraproc Cluster-as-a-Service

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

In a little over three weeks useR! 2015 will convene in Aalborg, Denmark and I am looking forward to being there and learning and talking about R user groups. The following map shows the big picture for R User Groups around the world.

However, it is very difficult to keep it up to date. Just after the map "went to press" I learned that a new user group formed in Norfolk Virginia last month. In fact, at least 11 new R user groups have formed so far this year.

          RUG	               City	         Country	  Date Founded	No. Members	Website
Berlin R Users Group	       Berlin	         Germany	1/5/2015	133	http://www.meetup.com/Berlin-R-Users-Group/
Rhus - useR Group	       Aarhus	         Denmark	1/19/2015	87	http://www.meetup.com/Rhus-useR-group
Trenton R Users (TRU)	       Trenton	         US	        1/27/2015	50	http://www.meetup.com/TRUgroup/
Honolulu R Users Group	       Honolulu	         US	        1/30/2015	36	http://www.meetup.com/Honolulu-R-Users-Group/
Oslo useR! Group	       Oslo	         Norway	        2/26/2015	89	http://www.meetup.com/Oslo-useR-Group/
St. Petersburg R User Group    St Petersburg	 Russia	        3/13/2015	50	http://www.meetup.com/St-Petersburg-R-User-Group/
Nijmegen eveRybody	       Nijmegen	         Netherlands	4/11/2015	23	http://www.meetup.com/Nijmegen-eveRybody/
AthensR	                       Athens	         Greece	        4/16/2015	30	http://www.meetup.com/AthensR/
757 R Users Group	       Norfolk	         US	        4/22/2015	18	http://www.meetup.com/757-R-Users-Group/
LubbockR Meetup	               Lubbock	         US	        5/5/2015	23	http://www.meetup.com/lubbockR-Meetup/
R Kazan	                       Kazan	         Russia	        5/10/2015	11	http://www.meetup.com/R-Kazan/

Moreover, judging by the more than 4,000 users that showed up for the R China conference this week, China is probably severely under represented. There are very likely a few more user groups there and in the rest of Asia for us to learn about.

Embedded image permalink

From a tweet by Kun Ren

The data set used to build the map may be obtained here: Download All_RUGS_6_9_15. In addition to the name and address information it contains fields for the date the user group was founded and an estimate of the number of members. For sites that use meetup.com as their web platform I collected the relevant information directly from the group home pages. For user groups using other platforms I estimated start date and number of members from other information available on the website.

                     RUG      City State       Country Date.Founded Num.Members                                       Website Platform
1 Adelaide R-users group  Adelaide    SA     Australia    10/1/2011          94 http://www.meetup.com/Adelaide-R-users-group/   Meetup
2   Albany R Users Group    Albany    NY United States    3/20/2014         104   http://www.meetup.com/Albany-R-Users-Group/   Meetup
3             amst-R-dam Amsterdam         Netherlands     9/9/2010         483             http://www.meetup.com/amst-R-dam/   Meetup
4 Turkish Community of R    Ankara              Turkey     3/8/2013         181              http://www.meetup.com/tcr-users/   Meetup
5      Rhus - useR group    Aarhus             Denmark    1/19/2015          87         http://www.meetup.com/Rhus-useR-group   Meetup
6                AthensR    Athens              Greece    4/16/2015          30                http://www.meetup.com/AthensR/   Meetup

We would like this information as well as our R User Group Directory to be as accurate as possible and would very much appreciate corrections, additions and subtractions from user group organizers.

If you are going to Aalborg and would like to chat about R User Groups please come by the Revolution Analytics / Microsoft table.

Here is the code used to draw the map: Download Code_for_RUGs_Map.