Quantcast
Channel: R-bloggers
Viewing all 2417 articles
Browse latest View live

Some Impressions from R Finance 2015

$
0
0

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

The R/Finance 2015 Conference wrapped up last Saturday at UIC. It has been seven years already, but R/Finance still has the magic! - mostly very high quality presentations and the opportunity to interact and talk shop with some of the most accomplished R developers, financial modelers and even a few industry legends such as Emanuel Derman and Blair Hull.

Emanuel Derman led off with a provocative but extraordinary keynote talk. Derman began way out there, somewhere well beyond the left field wall recounting the struggle of Johannes Kepler to formulate his three laws of planetary motion and closed with some practical advice on how to go about the business of financial modeling. Along the way he shared some profound, original thinking in an attempt to provide a theoretical context for evaluating and understanding the limitations of financial models. His argument hinged on making and defending the distinction between theories and models. Theories such as physical theories of Kepler, Newton and Einstein are ontological: they attempt to say something about how the world is. A theory attempts to provide "absolute knowledge of the world". A model, on the other hand, "tells you about what some aspect of the world is like". Theories can be wrong, but they are not the kinds of things you can interrogate with "why" questions.

Models work through analogies and similarities. They compare something we understand to something we don't. Spinoza's Theory of emotions is a theory because it attempts to explain human emotions axiomatically from first principles.

Spinoza_emotions

The Black Scholes equation, by contrast, is a model that tries to provide insight through the analogy with Brownian motion. As I understood it, the practical advice from all of this is to avoid the twin traps of attempting to axiomatize financial models as if they directly captured reality, and of believing that analyzing data, no matter how many terabytes you plow through, is a substitute for an educated intuition about how the world is.

The following table lists the remaining talks in alphabetical order by speaker.

  Presentation  Package Package Location
1 Rohit Arora:   Inefficiency of Modified VaR and ES    
2 Kyle Balkissoon: A Framework for Integrating Portfolio-level Backtesting with   Price and Quantity Information PortFolioAnalytics  
3 Mark Bennett:   Gaussian Mixture Models for Extreme Events    
4 Oleg Bondarenko: High-Frequency Trading Invariants for Equity Index Futures    
5 Matt Brigida:   Markov Regime-Switching (and some State Space) Models in Energy Markets code for regime   switching GitHub
6 John Burkett:   Portfolio Optimization: Price Predictability, Utility Functions,   Computational Methods, and Applications DEoptim CRAN
7 Matthew Clegg:   The partialAR Package for Modeling Time Series with both Permanent and   Transient Components partialAR CRAN
8 Yuanchu Dang:   Credit Default Swaps with R (with Zijie Zhu) CDS GitHub
9 Gergely Daroczi: Network analysi​s of the Hungarian interbank lending market    
10 Sanjiv Das:   Efficient Rebalancing of Taxable Portfolios    
11 Sanjiv Das:   Matrix Metrics: Network-Based Systemic Risk Scoring    
12 Emanuel Derman:   Understanding the World     
13 Matthew Dixon:   Risk Decomposition for Fund Managers    
14 Matt Dowle:   Fast automatic indexing with data.table data.table CRAN
15 Dirk Eddelbuettel: Rblpapi: Connecting R to the data service that shall not be   named Rblpapi GitHub
16 Markus Gesmann:   Communicating risk - a perspective from an insurer    
17 Vincenzo Giordano: Quantifying the Risk and Price Impact of Energy Policy Events   on Natural Gas Markets Using R (with Soumya Kalra)    
18 Chris Green:   Detecting Multivariate Financial Data Outliers using Calibrated Robust   Mahalanobis Distances CerioliOutlierDetection CRAN
19 Rohini Grover:   The informational role of algorithmic traders in the option market    
20 Marius Hofert:   Parallel and other simulations in R made easy: An end-to-end study simsalapar CRAN
21 Nicholas James:   Efficient Multivariate Analysis of Change Points ecp CRAN
22 Kresimir Kalafatic: Financial network analysis using SWIFT and R    
23 Michael Kapler:   Follow the Leader - the application of time-lag series analysis to discover   leaders in S&P 500 SIT other
24 Ilya Kipnis:   Flexible Asset Allocation With Stepwise Correlation Rank    
25 Rob Krzyzanowski: Building Better Credit Models through Deployable Analytics in   R    
26 Bryan Lewis:   More thoughts on the SVD and Finance    
27 Yujia Liu and Guy Yollin: Fundamental Factor Model DataBrowser using Tableau and R factorAnalytics RFORGE
28 Louis Marascio:   An Outsider's Education in Quantitative Trading     
29 Doug Martin:   Nonparametric vs Parametric Shortfall: What are the Differences?    
30 Alexander McNeil: R Tools for Understanding Credit Risk Modelling     
31 William Nicholson: Structured Regularization for Large Vector Autoregression BigVAR GitHub
32 Steven Pav:   Portfolio Cramer-Rao Bounds (why bad things happen to good quants) SharpeR CRAN
33 Jerzy Pawlowksi: Are High Frequency Traders Prudent and Temperate? HighFreq GitHub
34 Bernhard Pfaff:   The sequel of cccp: Solving cone constrained convex programs cccp CRAN
35 Stephen Rush:   Information Diffusion in Equity Markets    
36 Mark Seligman:   The Arborist: a High-Performance Random Forest Implementation Rborist CRAN
37 Majeed Simaan:   Global Minimum Variance Portfolio: a Horse Race of Volatilities    
38 Anthoney Tsou:   Implementation of Quality Minus Junk qmj GitHub
39 Marjan Wauters:   Characteristic-based equity portfolios: economic value and dynamic style   allocation    
40 Hadley Wickham:   Data ingest in R readr CRAN
41 Eric Zivot:   Price Discovery Share-An Order Invariant Measure of Price Discovery with   Application to Exchange-Traded Funds    

I particularly enjoyed Sanjiv Das' talks on Efficient Rebalancing of Taxable Portfolios and Matrix Metrics: Network Based Systemic Risk Scoring, both of which are approachable by non-specialists. Sanjiv became the first person to present two talks at an R/Finance conference, and thus the first person to win one of the best presentation prizes with the judges unwilling to say which of his two presentations secured the award.

Bryan Lewis' talk: More thoughts on the SVD and Finance was also notable for its exposition. Listening to Bryan you can almost fool yourself into believing that you could develop a love for numerical analysis and willingly spend an inordinate amount of your time contemplating the stark elegance of matrix decompositions.

Alexander McNeil's talk: R Tools for Understanding Credit Risk Modeling was a concise and exceptionally coherent tutorial on the subject, an unusual format for a keynote talk, but something that I think will be valued by students when the slides for all of the presentations become available.

Going out on a limb a bit, I offer a few un-researched, but strong impressions of the conference. This year, to a greater extent than I remember in previous years, talks were built around particular packages; talks 5, 7 and 8 for example. Also, it seemed that authors were more comfortable hightlighting  and sharing packages that are work in progress; residing not on CRAN but on GitHub, R-Forge and other platforms. This may reflect a larger trend in R culture.

This is the year that cointegration replaced correlation as the operative concept in many models. The quants are way out ahead of the statisticians and data scientists on this one. Follow the money!

Speaking of data scientists: if you are a Random Forests fan do check out Mark Seligman's Rborist package, a high-performance and extensible implementation of the Random Forests algorithm.

Network analysis also seemed to be an essential element of many presentations. Gergely Daróczi's Shiny app for his analysis of the Hungarian interbank lending network is a spectacular example of how interactive graphics can enhance an analysis.

Finally, I'll finish up with some suggested reading in preparation for studying the slides of the presentations when they become available.

Sanjiv Das: Efficient Rebalancing of Taxable Portfolios
Sanjiv Das: Matrix Metrics: Network-based Systematic Risk Scoring
Emanuel Derman: Models.Behaving.Badly
Jurgen A. Doornik and R.J. O'Brien: Numerically Stable Cointegration Analysis (A recommendation from Bryan Lewis)
Arthur Koestler: The Sleepwalkers (I am certain this is the book whose title Derman forgot.)
Alexander J. McNeil and Rudiger Frey: Quantitative Risk Management Concepts, Techniques and Tools
Bernhard Pfaff: Analysis of Integrated and Cointegrated Time Series with R (Use R!)

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R Recipe: RStudio and UNC Paths

$
0
0

(This article was first published on Exegetic Analytics » R, and kindly contributed to R-bloggers)

RStudio does not like Uniform Naming Convention (UNC) paths. This can be a problem if, for example, you install it under Citrix. The solution is to create a suitable environment file. This is what worked for me: I created an .Renviron file in my Documents folder on the Citrix remote drive. The file had the following contents:

R_LIBS_USER="H:/myCitrixFiles/Documents/R/win-library/3.1"
R_USER="H:/myCitrixFiles/Documents"

Simple. After that, the flurry of error and warning messages at startup disappeared and I was able install packages without any problems.

The post R Recipe: RStudio and UNC Paths appeared first on Exegetic Analytics.

To leave a comment for the author, please follow the link and comment on his blog: Exegetic Analytics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Welcome to AriLamstein.com!

$
0
0

(This article was first published on » R, and kindly contributed to R-bloggers)

Today I am happy to announce that I have migrated my blog from JustAnRBlog.wordpress.com to AriLamstein.com. I thought I would give an inside peek into the change in case others have an interest. Also, I would personally like to see more R packages and blogs, and I hope that this post can encourage others to take the plunge.

I created JustAnRBlog in March in order to solve a particular problem. I was releasing my first R package (choroplethrZip) that would not be available on CRAN, and needed to announce and document it. In that sense I expected my blog to be like a tweet, where I wrote something and moved on. I wound up writing about other package updates I did, as well as a tutorial that I ran. Again, though, I viewed what I wrote as announcements that I would “make and move on”.

Fast forward three months and you can see that I misunderstood how blogs work. Traffic steadily grew, to a level I didn’t expect for an R blog:

Screen Shot 2015-05-29 at 7.54.07 AM

The main reason for the increase, I think, is that the blog joined R-bloggers in April. A lot of people read that blog, and appearing there generates a lot of traffic. In short, it turns out that I didn’t really understand how much interest there is in new R packages. I may only know a few people in person who have an interest in this material, but Tal (the owner of R-bloggers) knows a ton.

The lesson? If you have an R package that you want to develop, I highly recommend doing so. Additionally, if you want to communicate with others about the package, I highly recommend creating a blog about it and getting the blog listed in R-bloggers. Due to the targeted traffic there, you will almost certainly connect with people who have an interest in your work.

The other reason for the move is that I wanted to do more consulting work. Normally when people start consulting they take out a URL for their name, create a site outlining the type of work they can help with, and add a contact form and blog to the site. Since one area I want consult in is R, it made sense to merge the blog with the site.

As for the technical details of the blog and site, they use WordPress and are hosted on BlueHost. The original blog, of course, was hosted on WordPress.com. I am a big fan of WordPress, but I migrated to BlueHost largely because of their cost structure. For example, adding Google Analytics to a site on WordPress.com seems to require purchasing their Business plan, which is $299/yr. But on BlueHost you can add it as part of their basic plan which is only $3.49/month.

As for what to expect from this blog going forward, I mostly expect to do what I’ve been doing: announcing new versions of my open source projects, publishing my own analyses using software I’ve written, as well as announcing any tutorials that I schedule. The main difference is that I’ve picked a particular time to publish, which is Thursday mornings.

Looking for help with a software or data project? Contact me here.

To leave a comment for the author, please follow the link and comment on his blog: » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Can Bradley Wiggins Do It? Welcome to the Thunder-Drome!

$
0
0

(This article was first published on Graph of the Week, and kindly contributed to R-bloggers)
Many have tried. Most have failed.
Bradley Wiggins knows this. He also knows the ordeal he faces, knows the pain he will endure and knows the scrutiny he will face. It's nothing he hasn't experienced before, having raced and won the world's most prestigious cycling event: the Tour de France. This is a different animal, however. The demands placed upon his body will be much different than any road race in which he has competed. He will exert max effort under controlled conditions for exactly one hour after which, the distance he's covered will be measured.

There will be no other riders to chase nor any to attack. There will be no feed stations nor assistance of any kind. He will pedal within himself, in his own head, or as he calls it: his "escape" zone.

Welcome to the Thunder "drome", Sir Bradley Wiggins. Welcome to the World Hour Record.

This article was written by Patrick Rhodes and published on June 4, 2015. Click here to read the rest of the article.

To leave a comment for the author, please follow the link and comment on his blog: Graph of the Week.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A Practical Example of Calculating Padé Approximant Coefficients Using R

$
0
0

(This article was first published on Strange Attractors » R, and kindly contributed to R-bloggers)

Introduction

I recently had the opportunity to use Padé approximants. There is a lot of good information available on line on the theory and applications of using Padé approximants, but I had trouble finding a good example explaining just how to calculate the co-efficients.

Basic Background

Hearken back to undergraduate calculus for a moment. For a given function, its The Taylor series is the “best” polynomial representations of that function. If the function is being evaluated at 0, the Taylor series representation is also called the Maclaurin series. The error is proportional to the first “left-off” term. Also, the series is only a good estimate in a small radius around the point for which it is calculated (e.g. 0 for a Maclaurin series).

Padé approximants estimate functions as the quotient of two polynomials. Specifically, given a Taylor series expansion of a function T(x) of order m + n, there are two polynomials, P(x) of order m and Q(x) of order n, such that \frac{P(x)}{Q(x)}, called the Padé approximant of order [m/n], “agrees” with the original function in order m + n. You may ask, “but the Taylor series from whence it is derived is also of order m + n?” And you would be correct. However, the Padé approximant seems to consistently have a wider radius of convergence than its parent Taylor series, and, being a quotient, is composed of lower-degree polynomials. With the normalization that the first term of Q(x) is always 1, there is a set of linear equations which will generate the unique Padé approximant coefficients. Letting a_n be the coefficients for the Taylor series, one can solve:

(1)   \begin{align*} &a_0 &= p_0\\ &a_1 + a_0q_1 &= p_1\\ &a_2 + a_1q_1 + a_0q_2 &= p_2\\ &a_3 + a_2q_1 + a_1q_2 + a_0q_3 &= p_3\\ &a_4 + a_3q_1 + a_2q_2 + a_1q_3 + a_0q_4 &= p_4\\ &\vdots&\vdots\\ &a_{m+n} + a_{m+n-1}q_1 + \ldots + a_0q_{m+n} &= p_{m+n} \end{align*}

remembering that all p_k, k > m and q_k, k > n are 0.

There is a lot of research on the theory of Padé approximants and Padé tables, how they work, their relationship to continued fractions, and why they work so well. For example, the interested reader is directed to Baker (1975), Van Assche (2006), Wikipedia, and MathWorld for more.

Practical Example

The function \log(1 + x) will be used as the example. This function has a special representation in almost all computer languages, often called log1p(x), as naïve implementation as log(1 + x) will suffer catastrophic floating point errors for x near 0.

The Maclaurin series expansion for \log(1 + x) is:

    \[ \sum_{k=1}^\infty -1^{k+1}\frac {x^k}{k} = x - \frac{1}{2}x^2 + \frac{1}{3}x^3 - \frac{1}{4}x^4 + \frac{1}{5}x^5 - \frac{1}{6}x^6\ldots \]

The code below will compare a Padé[3,3] approximant with the 6-term Maclaurin series, which would actually be the Padé[6/0] approximant. First to calculate the coefficients. We know the Maclaurin coefficients, they are 0, 1, -\frac{1}{2}, \frac{1}{3}, -\frac{1}{4}, \frac{1}{5}, -\frac{1}{6}. Therefore, the system of linear equations looks like this:

(2)   \begin{align*} &0 &= p_0\\ &1 + 0q_1 &= p_1\\ &-\frac{1}{2} + q_1 + 0q_2 &= p_2\\ &\frac{1}{3} - \frac{1}{2}q_1 + q_2 + a_0q_3 &= p_3\\ &-\frac{1}{4} + \frac{1}{3}q_1 - \frac{1}{2} q_2 + q_3 &= 0\\ &\frac{1}{5} - \frac{1}{4}q_1 + \frac{1}{3} q_2 - \frac{1}{2} q_3 &= 0\\ &-\frac{1}{6} + \frac{1}{5}q_1 - \frac{1}{4} q_2 + \frac{1}{3} q_3 &= 0 \end{align*}

Rewriting in terms of the known a_n coefficients, we get:

(3)   \begin{align*} &-p_0 =& 0\\ &0q_1 - p_1 =& -1\\ &q_1 + 0q_2 -p_2 =& \frac{1}{2}\\ &-\frac{1}{2}q_1 + q_2 + a_0q_3 - p_3 =& -\frac{1}{3}\\ &\frac{1}{3}q_1 - \frac{1}{2} q_2 + q_3 =& \frac{1}{4}\\ &-\frac{1}{4}q_1 + \frac{1}{3} q_2 - \frac{1}{2} q_3 =& -\frac{1}{5}\\ &\frac{1}{5}q_1 - \frac{1}{4} q_2 + \frac{1}{3} q_3 =& \frac{1}{6} \end{align*}

We can solve this in R using solve:

A <- matrix(c(0, 0, 1, -.5 ,1 / 3 , -.25, .2, 0, 0, 0, 1, -.5, 1 / 3, -.25, 0, 0, 0, 0, 1, -.5, 1 / 3, -1, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0), ncol = 7)
B <- c(0, -1, .5, -1/3, .25, -.2, 1/6)
P_Coeff <- solve(A, B)
print(P_Coeff)

## [1] 1.5000000 0.6000000 0.0500000 0.0000000 1.0000000 1.0000000 0.1833333

Now we can create the estimating functions:
ML <- function(x){x - .5 * x ^ 2 + x ^ 3 / 3 - .25 * x ^ 4 + .2 * x ^ 5 - x ^ 6 / 6}

PD33 <- function(x){
  NUMER <- x + x ^ 2 + 0.1833333333333333 * x ^ 3
  DENOM <- 1 + 1.5 * x + .6 * x ^ 2 + 0.05 * x ^ 3
  return(NUMER / DENOM)
}

Let’s compare the behavior of these functions around 0 with the naïve and sophisticated implementations of \log(1+x) in R.
library(dplyr)
library(ggplot2)
library(tidyr)
D <- seq(-1e-2, 1e-2, 1e-6)
RelErr <- tbl_df(data.frame(X = D, Naive = (log(1 + D) - log1p(D)) / log1p(D), MacL = (ML(D) - log1p(D)) / log1p(D), Pade = (PD33(D) - log1p(D)) / log1p(D)))
RelErr2 <- gather(RelErr, Type, Error, -X)
RelErr2 %>% group_by(Type) %>% summarize(MeanError = mean(Error, na.rm = TRUE)) %>% knitr::kable(digits = 18)
Type MeanError
Naïve -4.3280e-15
MacL -2.0417e-14
Pade -5.2000e-17

Graphing the relative error in a small area around 0 shows the differing behaviors. First, against the naïve implementation, both estimates do much better.

ggplot(RelErr2, aes(x = X)) + geom_point(aes(y = Error, colour = Type), alpha = 0.5)

Graph1

But when compared one against the other, the Pade approximant (blue) shows better behavior than the Maclaurin (red) and it’s relative error stays below EPS for a wider swath.

ggplot(RelErr, aes(x = X)) + geom_point(aes(y = MacL), colour = 'red', alpha = 0.5) + geom_point(aes(y = Pade), colour = 'blue', alpha = 0.5)

Graph2

Just for fun, restricting the y axis to that above, overlaying the naïve formulation (green) looks like this:

ggplot(RelErr, aes(x = X)) + geom_point(aes(y = Naive), colour = 'green', alpha = 0.5) + scale_y_continuous(limits = c(-1.5e-13, 0)) + geom_point(aes(y = MacL), colour = 'red', alpha = 0.5) + geom_point(aes(y = Pade), colour = 'blue', alpha = 0.5)

Graph3

There are certainly more efficient and elegant ways to calculate Padé approximants, but I found this exercise helpful, and I hope you do as well!

References

  • Baker, G. A. Essentials of Padé Approximants Academic Press, 1975
  • Van Assche, W. Pade and Hermite-Pade approximation and orthogonality ArXiv Mathematics e-prints, 2006

To leave a comment for the author, please follow the link and comment on his blog: Strange Attractors » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Beautiful table-outputs: Summarizing mixed effects models #rstats

$
0
0

(This article was first published on Strenge Jacke! » R, and kindly contributed to R-bloggers)

The current version 1.8.1 of my sjPlot package has two new functions to easily summarize mixed effects models as HTML-table: sjt.lmer and sjt.glmer. Both are very similar, so I focus on showing how to use sjt.lmer here.

# load required packages
library(sjPlot) # table functions
library(sjmisc) # sample data
library(lme4) # fitting models

Linear mixed models summaries as HTML table

The sjt.lmer function prints summaries of linear mixed models (fitted with the lmer function of the lme4-package) as nicely formatted html-tables. First, some sample models are fitted:

# load sample data
data(efc)
# prepare grouping variables
efc$grp = as.factor(efc$e15relat)
levels(x = efc$grp) <- get_val_labels(efc$e15relat)
efc$care.level <- as.factor(rec(efc$n4pstu, "0=0;1=1;2=2;3:4=4"))
levels(x = efc$care.level) <- c("none", "I", "II", "III")

# data frame for fitted model
mydf <- data.frame(neg_c_7 = as.numeric(efc$neg_c_7),
                   sex = as.factor(efc$c161sex),
                   c12hour = as.numeric(efc$c12hour),
                   barthel = as.numeric(efc$barthtot),
                   education = as.factor(efc$c172code),
                   grp = efc$grp,
                   carelevel = efc$care.level)

# fit sample models
fit1 <- lmer(neg_c_7 ~ sex + c12hour + barthel + (1|grp), data = mydf)
fit2 <- lmer(neg_c_7 ~ sex + c12hour + education + barthel + (1|grp), data = mydf)
fit3 <- lmer(neg_c_7 ~ sex + c12hour + education + barthel +
              (1|grp) +
              (1|carelevel), data = mydf)

The simplest way of producing the table output is by passing the fitted models as parameter. By default, estimates (B), confidence intervals (CI) and p-values (p) are reported. The models are named Model 1 and Model 2. The resulting table is divided into three parts:

  • Fixed parts – the model’s fixed effects coefficients, including confidence intervals and p-values.
  • Random parts – the model’s group count (amount of random intercepts) as well as the Intra-Class-Correlation-Coefficient ICC.
  • Summary – Observations, AIC etc.
sjt.lmer(fit1, fit2)

Note that, due to WordPress-CSS, the resulting HTML-table looks different in this blog-posting compared to the usual output in R!

Model 1 Model 2
B CI p B CI p
Fixed Parts
(Intercept) 14.14 13.15 – 15.12 <.001 13.75 12.63 – 14.87 <.001
sex2 0.48 -0.07 – 1.03 .087 0.67 0.10 – 1.25 .020
c12hour 0.00 -0.00 – 0.01 .233 0.00 -0.00 – 0.01 .214
barthel -0.05 -0.06 – -0.04 <.001 -0.05 -0.06 – -0.04 <.001
education2 0.19 -0.43 – 0.80 .098
education3 0.80 0.03 – 1.58 .098
Random Parts
Ngrp 8 8
ICCgrp 0.022 0.021
Observations 872 815

Customizing labels

Here is an example how to change the labels. Note that showHeaderStrings makes the two labels on top and top left corner appear in the table.

sjt.lmer(fit1,
         fit2,
         showHeaderStrings = TRUE,
         stringB = "Estimate",
         stringCI = "Conf. Int.",
         stringP = "p-value",
         stringDependentVariables = "Response",
         stringPredictors = "Coefficients",
         stringIntercept = "Konstante",
         labelDependentVariables = c("Negative Impact",
                                     "Negative Impact"))
Coefficients Response
Negative Impact Negative Impact
Estimate Conf. Int. p-value Estimate Conf. Int. p-value
Fixed Parts
Konstante 14.14 13.15 – 15.12 <.001 13.75 12.63 – 14.87 <.001
sex2 0.48 -0.07 – 1.03 .087 0.67 0.10 – 1.25 .020
c12hour 0.00 -0.00 – 0.01 .233 0.00 -0.00 – 0.01 .214
barthel -0.05 -0.06 – -0.04 <.001 -0.05 -0.06 – -0.04 <.001
education2 0.19 -0.43 – 0.80 .098
education3 0.80 0.03 – 1.58 .098
Random Parts
Ngrp 8 8
ICCgrp 0.022 0.021
Observations 872 815

Custom variable labels

To change variable labels in the plot, use the labelPredictors parameter:

sjt.lmer(fit1, fit2,
         labelPredictors = c("Carer's Sex",
                             "Hours of Care",
                             "Elder's Dependency",
                             "Mid Educational Level",
                             "High Educational Level"))
Model 1 Model 2
B CI p B CI p
Fixed Parts
(Intercept) 14.14 13.15 – 15.12 <.001 13.75 12.63 – 14.87 <.001
Carer’s Sex 0.48 -0.07 – 1.03 .087 0.67 0.10 – 1.25 .020
Hours of Care 0.00 -0.00 – 0.01 .233 0.00 -0.00 – 0.01 .214
Elder’s Dependency -0.05 -0.06 – -0.04 <.001 -0.05 -0.06 – -0.04 <.001
Mid Educational Level 0.19 -0.43 – 0.80 .098
High Educational Level 0.80 0.03 – 1.58 .098
Random Parts
Ngrp 8 8
ICCgrp 0.022 0.021
Observations 872 815

Changing table style

You can change the table style with specific parameters, e.g. to include CI into the same table cell as the estimates, print asterisks instead of numeric p-values etc.

sjt.lmer(fit1, fit2,
         separateConfColumn = FALSE, # ci in same cell as estimates
         showStdBeta = TRUE,         # also show standardized beta values
         pvaluesAsNumbers = FALSE)   # "*" instead of numeric values
Model 1 Model 2
B (CI) std. Beta (CI) B (CI) std. Beta (CI)
Fixed Parts
(Intercept) 14.14
(13.15 – 15.12) ***
13.75
(12.63 – 14.87) ***
sex2 0.48
(-0.07 – 1.03)
0.05
(-0.01 – 0.11)
0.67
(0.10 – 1.25) *
0.07
(0.01 – 0.14)
c12hour 0.00
(-0.00 – 0.01)
0.04
(-0.03 – 0.12)
0.00
(-0.00 – 0.01)
0.05
(-0.03 – 0.12)
barthel -0.05
(-0.06 – -0.04) ***
-0.37
(-0.44 – -0.30)
-0.05
(-0.06 – -0.04) ***
-0.37
(-0.44 – -0.30)
education2 0.19
(-0.43 – 0.80)
0.02
(-0.05 – 0.10)
education3 0.80
(0.03 – 1.58)
0.08
(0.00 – 0.16)
Random Parts
Ngrp 8 8
ICCgrp 0.022 0.021
Observations 872 815
Notes * p<.05 ** p<.01 *** p<.001

Models with different random intercepts

When models have different random intercepts, the sjt.lmer function tries to detect these information from each model. In the Random parts section of the table, information on multiple grouping levels and ICC’s are printed then.

sjt.lmer(fit1, fit2, fit3)
Model 1 Model 2 Model 3
B CI p B CI p B CI p
Fixed Parts
(Intercept) 14.14 13.15 – 15.12 <.001 13.75 12.63 – 14.87 <.001 13.76 12.63 – 14.88 <.001
sex2 0.48 -0.07 – 1.03 .087 0.67 0.10 – 1.25 .020 0.65 0.08 – 1.22 .026
c12hour 0.00 -0.00 – 0.01 .233 0.00 -0.00 – 0.01 .214 0.00 -0.00 – 0.01 .205
barthel -0.05 -0.06 – -0.04 <.001 -0.05 -0.06 – -0.04 <.001 -0.05 -0.06 – -0.04 <.001
education2 0.19 -0.43 – 0.80 .098 0.16 -0.46 – 0.79 .103
education3 0.80 0.03 – 1.58 .098 0.79 0.01 – 1.57 .103
Random Parts
Ngrp 8 8 8
Ncarelevel 4
ICCgrp 0.022 0.021 0.021
ICCcarelevel 0.000
Observations 872 815 807

Note that in certain cases, depending on the order of fitted models with several random intercepts, the group label might be incorrect.

Further examples

More details on the sjt.lmer function can be found in this online-manual.


Tagged: data visualization, mixed effects, multilevel, R, rstats, sjPlot, Statistik, table output

To leave a comment for the author, please follow the link and comment on his blog: Strenge Jacke! » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Momentum, Markowitz, and Solving Rank-Deficient Covariance Matrices — The Constrained Critical Line Algorithm

$
0
0

(This article was first published on QuantStrat TradeR » R, and kindly contributed to R-bloggers)

This post will feature the differences in the implementation of my constrained critical line algorithm with that of Dr. Clarence Kwan’s. The constrained critical line algorithm is a form of gradient descent that incorporates elements of momentum. My implementation includes a volatility-targeting binary search algorithm.

First off, rather than try and explain the algorithm piece by piece, I’ll defer to Dr. Clarence Kwan’s paper and excel spreadsheet, from where I obtained my original implementation. Since that paper and excel spreadsheet explains the functionality of the algorithm, I won’t repeat that process here. Essentially, the constrained critical line algorithm incorporates its lambda constraints into the structure of the covariance matrix itself. This innovation actually allows the algorithm to invert previously rank-deficient matrices.

Now, while Markowitz mean-variance optimization may be a bit of old news for some, the ability to use a short lookback for momentum with monthly data has allowed me and my two coauthors (Dr. Wouter Keller, who came up with flexible and elastic asset allocation, and Adam Butler, of GestaltU) to perform a backtest on a century’s worth of assets, with more than 30 assets in the backtest, despite using only a 12-month formation period. That paper can be found here.

Let’s look at the code for the function.

CCLA <- function(covMat, retForecast, maxIter = 1000, 
                 verbose = FALSE, scale = 252, 
                 weightLimit = .7, volThresh = .1) {
  if(length(retForecast) > length(unique(retForecast))) {
    sequentialNoise <- seq(1:length(retForecast)) * 1e-12
    retForecast <- retForecast + sequentialNoise
  }
  
  #initialize original out/in/up status
  if(length(weightLimit) == 1) {
    weightLimit <- rep(weightLimit, ncol(covMat))
  }
  rankForecast <- length(retForecast) - rank(retForecast) + 1
  remainingWeight <- 1 #have 100% of weight to allocate
  upStatus <- inStatus <- rep(0, ncol(covMat))
  i <- 1
  while(remainingWeight > 0) {
    securityLimit <- weightLimit[rankForecast == i]
    if(securityLimit < remainingWeight) {
      upStatus[rankForecast == i] <- 1 #if we can't invest all remaining weight into the security
      remainingWeight <- remainingWeight - securityLimit
    } else {
      inStatus[rankForecast == i] <- 1
      remainingWeight <- 0
    }
    i <- i + 1
  }
  
  #initial matrices (W, H, K, identity, negative identity)
  covMat <- as.matrix(covMat)
  retForecast <- as.numeric(retForecast)
  init_W <- cbind(2*covMat, rep(-1, ncol(covMat)))
  init_W <- rbind(init_W, c(rep(1, ncol(covMat)), 0))
  H_vec <- c(rep(0, ncol(covMat)), 1)
  K_vec <- c(retForecast, 0)
  negIdentity <- -1*diag(ncol(init_W))
  identity <- diag(ncol(init_W))
  matrixDim <- nrow(init_W)
  weightLimMat <- matrix(rep(weightLimit, matrixDim), ncol=ncol(covMat), byrow=TRUE)
  
  #out status is simply what isn't in or up
  outStatus <- 1 - inStatus - upStatus
  
  #initialize expected volatility/count/turning points data structure
  expVol <- Inf
  lambda <- 100
  count <- 0
  turningPoints <- list()
  while(lambda > 0 & count < maxIter) {
    
    #old lambda and old expected volatility for use with numerical algorithms
    oldLambda <- lambda
    oldVol <- expVol
    
    count <- count + 1
    
    #compute W, A, B
    inMat <- matrix(rep(c(inStatus, 1), matrixDim), nrow = matrixDim, byrow = TRUE)
    upMat <- matrix(rep(c(upStatus, 0), matrixDim), nrow = matrixDim, byrow = TRUE)
    outMat <- matrix(rep(c(outStatus, 0), matrixDim), nrow = matrixDim, byrow = TRUE)
    
    W <- inMat * init_W + upMat * identity + outMat * negIdentity
    
    inv_W <- solve(W)
    modified_H <- H_vec - rowSums(weightLimMat* upMat[,-matrixDim] * init_W[,-matrixDim])
    A_vec <- inv_W %*% modified_H
    B_vec <- inv_W %*% K_vec
    
    #remove the last elements from A and B vectors
    truncA <- A_vec[-length(A_vec)]
    truncB <- B_vec[-length(B_vec)]
    
    #compute in Ratio (aka Ratio(1) in Kwan.xls)
    inRatio <- rep(0, ncol(covMat))
    inRatio[truncB > 0] <- -truncA[truncB > 0]/truncB[truncB > 0]
    
    #compute up Ratio (aka Ratio(2) in Kwan.xls)
    upRatio <- rep(0, ncol(covMat))
    upRatioIndices <- which(inStatus==TRUE & truncB < 0)
    if(length(upRatioIndices) > 0) {
      upRatio[upRatioIndices] <- (weightLimit[upRatioIndices] - truncA[upRatioIndices]) / truncB[upRatioIndices]
    }
    
    #find lambda -- max of up and in ratios
    maxInRatio <- max(inRatio)
    maxUpRatio <- max(upRatio)
    lambda <- max(maxInRatio, maxUpRatio)
    
    #compute new weights
    wts <- inStatus*(truncA + truncB * lambda) + upStatus * weightLimit + outStatus * 0
    
    #compute expected return and new expected volatility
    expRet <- t(retForecast) %*% wts
    expVol <- sqrt(wts %*% covMat %*% wts) * sqrt(scale)
    
    #create turning point data row and append it to turning points
    turningPoint <- cbind(count, expRet, lambda, expVol, t(wts))
    colnames(turningPoint) <- c("CP", "Exp. Ret.", "Lambda", "Exp. Vol.", colnames(covMat))
    turningPoints[[count]] <- turningPoint
    
    #binary search for volatility threshold -- if the first iteration is lower than the threshold,
    #then immediately return, otherwise perform the binary search until convergence of lambda
    if(oldVol == Inf & expVol < volThresh) {
      turningPoints <- do.call(rbind, turningPoints)
      threshWts <- tail(turningPoints, 1)
      return(list(turningPoints, threshWts))
    } else if(oldVol > volThresh & expVol < volThresh) {
      upLambda <- oldLambda
      dnLambda <- lambda
      meanLambda <- (upLambda + dnLambda)/2
      while(upLambda - dnLambda > .00001) {
        
        #compute mean lambda and recompute weights, expected return, and expected vol
        meanLambda <- (upLambda + dnLambda)/2
        wts <- inStatus*(truncA + truncB * meanLambda) + upStatus * weightLimit + outStatus * 0
        expRet <- t(retForecast) %*% wts
        expVol <- sqrt(wts %*% covMat %*% wts) * sqrt(scale)
        
        #if new expected vol is less than threshold, mean becomes lower bound
        #otherwise, it becomes the upper bound, and loop repeats
        if(expVol < volThresh) {
          dnLambda <- meanLambda
        } else {
          upLambda <- meanLambda
        }
      }
      
      #once the binary search completes, return those weights, and the corner points
      #computed until the binary search. The corner points aren't used anywhere, but they're there.
      threshWts <- cbind(count, expRet, meanLambda, expVol, t(wts))
      colnames(turningPoint) <- colnames(threshWts) <- c("CP", "Exp. Ret.", "Lambda", "Exp. Vol.", colnames(covMat))
      turningPoints[[count]] <- turningPoint
      turningPoints <- do.call(rbind, turningPoints)
      return(list(turningPoints, threshWts))
    }
    
    #this is only run for the corner points during which binary search doesn't take place
    #change status of security that has new lambda
    if(maxInRatio > maxUpRatio) {
      inStatus[inRatio == maxInRatio] <- 1 - inStatus[inRatio == maxInRatio]
      upStatus[inRatio == maxInRatio] <- 0
    } else {
      upStatus[upRatio == maxUpRatio] <- 1 - upStatus[upRatio == maxUpRatio]
      inStatus[upRatio == maxUpRatio] <- 0
    }
    outStatus <- 1 - inStatus - upStatus
  }
  
  #we only get here if the volatility threshold isn't reached
  #can actually happen if set sufficiently low
  turningPoints <- do.call(rbind, turningPoints)
  
  threshWts <- tail(turningPoints, 1)
  
  return(list(turningPoints, threshWts))
}

Essentially, the algorithm can be divided into three parts:

The first part is the initialization, which does the following:

It creates three status vectors: in, up, and out. The up vector denotes which securities are at their weight constraint cap, the in status are securities that are not at their weight cap, and the out status are securities that receive no weighting on that iteration of the algorithm.

The rest of the algorithm essentially does the following:

It takes a gradient descent approach by changing the status of the security that minimizes lambda, which by extension minimizes the volatility at the local point. As long as lambda is greater than zero, the algorithm continues to iterate. Letting the algorithm run until convergence effectively provides the volatility-minimization portfolio on the efficient frontier.

However, one change that Dr. Keller and I made to it is the functionality of volatility targeting, allowing the algorithm to stop between iterations. As the SSRN paper shows, a higher volatility threshold, over the long run (the *VERY* long run) will deliver higher returns.

In any case, the algorithm takes into account several main arguments:

A return forecast, a covariance matrix, a volatility threshold, and weight limits, which can be either one number that will result in a uniform weight limit, or a per-security weight limit. Another argument is scale, which is 252 for days, 12 for months, and so on. Lastly, there is a volatility threshold component, which allows the user to modify how aggressive or conservative the strategy can be.

In any case, to demonstrate this function, let’s run a backtest. The idea in this case will come from a recent article published by Frank Grossmann from SeekingAlpha, in which he obtained a 20% CAGR but with a 36% max drawdown.

So here’s the backtest:

symbols <- c("AFK", "ASHR", "ECH", "EGPT",
             "EIDO", "EIRL", "EIS", "ENZL",
             "EPHE", "EPI", "EPOL", "EPU",
             "EWA", "EWC", "EWD", "EWG",
             "EWH", "EWI", "EWJ", "EWK",
             "EWL", "EWM", "EWN", "EWO",
             "EWP", "EWQ", "EWS", "EWT",
             "EWU", "EWW", "EWY", "EWZ",
             "EZA", "FM", "FRN", "FXI",
             "GAF", "GULF", "GREK", "GXG",
             "IDX", "MCHI", "MES", "NORW",
             "QQQ", "RSX", "THD", "TUR",
             "VNM", "TLT"
)

getSymbols(symbols, from = "2003-01-01")

prices <- list()
entryRets <- list()
for(i in 1:length(symbols)) {
  prices[[i]] <- Ad(get(symbols[i]))
}
prices <- do.call(cbind, prices)
colnames(prices) <- gsub("\.[A-z]*", "", colnames(prices))

returns <- Return.calculate(prices)
returns <- returns[-1,]

sumIsNa <- function(col) {
  return(sum(is.na(col)))
}

appendZeroes <- function(selected, originalSetNames) {
  zeroes <- rep(0, length(originalSetNames) - length(selected))
  names(zeroes) <- originalSetNames[!originalSetNames %in% names(selected)]
  all <- c(selected, zeroes)
  all <- all[originalSetNames]
  return(all)
}

computeStats <- function(rets) {
  stats <- rbind(table.AnnualizedReturns(rets), maxDrawdown(rets), CalmarRatio(rets))
  return(round(stats, 3))
}

CLAAbacktest <- function(returns, lookback = 3, volThresh = .1, assetCaps = .5, tltCap = 1,
                         returnWeights = FALSE, useTMF = FALSE) {
  if(useTMF) {
    returns$TLT <- returns$TLT * 3
  }
  ep <- endpoints(returns, on = "months")
  weights <- list()
  for(i in 2:(length(ep) - lookback)) {
    retSubset <- returns[(ep[i]+1):ep[i+lookback],]
    retNAs <- apply(retSubset, 2, sumIsNa)
    validRets <- retSubset[, retNAs==0]
    retForecast <- Return.cumulative(validRets)
    covRets <- cov(validRets)
    weightLims <- rep(assetCaps, ncol(covRets))
    weightLims[colnames(covRets)=="TLT"] <- tltCap
    weight <- CCLA(covMat = covRets, retForecast = retForecast, weightLimit = weightLims, volThresh = volThresh)
    weight <- weight[[2]][,5:ncol(weight[[2]])]
    weight <- appendZeroes(selected = weight, colnames(retSubset))
    weight <- xts(t(weight), order.by=last(index(validRets)))
    weights[[i]] <- weight
  }
  weights <- do.call(rbind, weights)
  stratRets <- Return.portfolio(R = returns, weights = weights)
  if(returnWeights) {
    return(list(weights, stratRets))
  }
  return(stratRets)
}

In essence, we take the returns over a specified monthly lookback period, specify a volatility threshold, specify asset caps, specify the bond asset cap, and whether or not we wish to use TLT or TMF (a 3x leveraged variant, which just multiplies all returns of TLT by 3, for simplicity). The output of the CCLA (Constrained Critical Line Algorithm) is a list that contains the corner points, and the volatility threshold corner point which contains the corner point number, expected return, expected volatility, and the lambda value. So, we want the fifth element onward of the second element of the list.

Here are some results:

config1 <- CLAAbacktest(returns = returns)
config2 <- CLAAbacktest(returns = returns, useTMF = TRUE)
config3 <- CLAAbacktest(returns = returns, lookback = 4)
config4 <- CLAAbacktest(returns = returns, lookback = 2, useTMF = TRUE)

comparison <- na.omit(cbind(config1, config2, config3, config4))
colnames(comparison) <- c("Default", "TMF instead of TLT", "Lookback 4", "Lookback 2 and TMF")
charts.PerformanceSummary(comparison)
computeStats(comparison)

With the following statistics:

> computeStats(comparison)
                          Default TMF instead of TLT Lookback 4 Lookback 2 and TMF
Annualized Return           0.137              0.146      0.133              0.138
Annualized Std Dev          0.126              0.146      0.125              0.150
Annualized Sharpe (Rf=0%)   1.081              1.000      1.064              0.919
Worst Drawdown              0.219              0.344      0.186              0.357
Calmar Ratio                0.625              0.424      0.714              0.386

The variants that use TMF instead of TLT suffer far worse drawdowns. Not much of a hedge, apparently.

Here’s the equity curve:

Taking the 4 month lookback configuration (strongest Calmar), we’ll play around with the volatility setting.

Here’s the backtest:

config5 <- CLAAbacktest(returns = returns, lookback = 4, volThresh = .15)
config6 <- CLAAbacktest(returns = returns, lookback = 4, volThresh = .2)

comparison2 <- na.omit(cbind(config3, config5, config6))
colnames(comparison2) <- c("Vol10", "Vol15", "Vol20")
charts.PerformanceSummary(comparison2)
computeStats(comparison2)

With the results:

> computeStats(comparison2)
                          Vol10 Vol15 Vol20
Annualized Return         0.133 0.153 0.180
Annualized Std Dev        0.125 0.173 0.204
Annualized Sharpe (Rf=0%) 1.064 0.886 0.882
Worst Drawdown            0.186 0.212 0.273
Calmar Ratio              0.714 0.721 0.661

In this case, more risk, more reward, lower risk/reward ratios as you push the volatility threshold. So for once, the volatility puzzle doesn’t rear its head, and higher risk indeed does translate to higher returns (at the cost of everything else, though).

Here’s the equity curve.

Lastly, let’s try toggling the asset cap limits with the vol threshold back at 10.

config7 <- CLAAbacktest(returns = returns, lookback = 4, assetCaps = .1)
config8 <- CLAAbacktest(returns = returns, lookback = 4, assetCaps = .25)
config9 <- CLAAbacktest(returns = returns, lookback = 4, assetCaps = 1/3)
config10 <- CLAAbacktest(returns = returns, lookback = 4, assetCaps = 1)

comparison3 <- na.omit(cbind(config7, config8, config9, config3, config10))
colnames(comparison3) <- c("Cap10", "Cap25", "Cap33", "Cap50", "Uncapped")
charts.PerformanceSummary(comparison3)
computeStats(comparison3)

With the resulting statistics:

> computeStats(comparison3)
                          Cap10 Cap25 Cap33 Cap50 Uncapped
Annualized Return         0.124 0.122 0.127 0.133    0.134
Annualized Std Dev        0.118 0.122 0.123 0.125    0.126
Annualized Sharpe (Rf=0%) 1.055 1.002 1.025 1.064    1.070
Worst Drawdown            0.161 0.185 0.186 0.186    0.186
Calmar Ratio              0.771 0.662 0.680 0.714    0.721

Essentially, in this case, there was very little actual change from simply tweaking weight limits. Here’s an equity curve:

To conclude, while not exactly achieving the same aggregate returns or Sharpe ratio that the SeekingAlpha article did, it did highlight a probable cause of its major drawdown, and also demonstrated the levers of how to apply the constrained critical line algorithm, the mechanics of which are detailed in the papers linked to earlier.

Thanks for reading.


To leave a comment for the author, please follow the link and comment on his blog: QuantStrat TradeR » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Any R code as a cloud service: R demonstration at BUILD

$
0
0

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

At last month's BUILD conference for Microsoft developers in San Francisco, R was front-and-center on the keynote stage.

  BUILD R

In the keynote, Microsoft CVP Joseph Sirosh introduced the "language of data": open source R. Sirosh encouraged the audience to learn R, saying "if there is a single language that you choose to learn today .. let it be R". 

The keynote featured a demonstration of genomic data analysis using R. The analysis was based on the 1000 genomes data set stored in the HDInsight Hadoop-in-the-cloud service. Revolution R Enterprise running on eight Hadoop clusters distributed around the globe (about 1600 cores in total), and R's Bioconductor suite (specifically the VariantTools and gmapR packages), was used to perform 'variant calling' and calculate the disease risks indicated by a subset of the 1000 genomes in parallel. The result was an interactive heat-map showing the disease risks for each individual.

Heatmap

The heat map was created by Winston Chang and Joe Cheng from RStudio as an htmlwidget using the D3heatmap package. (You can interact with a variant of the heatmap from the demo here.)

The next part of the demo was to compare an individual's disease risks — as indicated by his or her DNA — to the population. Joseph Sirosh had his own DNA sequence for this purpose, which he submitted via a Windows Phone app to an Azure service running R. This is easy to do with Azure ML Studio: just put your R code as part of a workflow, and an API will automatically be generated on request. In this way you can publish any R code as an API to the cloud, which is then callable by any connected application.

Azure ml

You can watch the entire keynote presentation below, and the R demo begins at around the 23 minute mark.

 

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Macros in R

$
0
0

(This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

In programming, sometimes it’s useful to write a macro rather than a function. (Don’t worry if you’ve never heard the term before.) In this post, I’ll give an example of use of macros in R. using the gtools package on CRAN.

I wanted to write some utility code to help me reuse my earlier R commands during an interactive R session. Most (though not all) of what I wanted is already provided in the excellent R user interface systems such as ESS, RStudio and vim-r, but for various reasons I generally use the command line directly, especially for short sessions. (I also have developed my own vim editing mappings for R.) Specifically, I wanted to develop utilities to perform the following tasks:

  • Display all my recent commands, possibly restricting to those matching or excluding certain character strings.
  • Choose one of the recent commands for re-execution.
  • Re-execute a recent command by number or matching string.

One nice thing about R is that one can easily form some command programmatically in a character string, say using paste(), and then execute the string as a command, using eval(). Thus it would be easy to code up the above-listed tasks into functions that I can call when needed, except for one problem: Within the body of a function, one has a different environment than at the caller’s level.

Take for instance (a slightly modified version of) the first example in the online help for defmacro() in gtools, in which to goal is to write a function to recode as NAs all entries in a data frame column with a certain value. The following will NOT work:

setNA <- function(df,var,val) 
   df$var[df$var == val] <- NA

The problem is that within setNA(), df will be a data frame that is only a copy of the one in your call. The recoding to NAs will be made only to the copy. The defmacro() function in gtools would do what you really want:

setNA <- defmacro(df,var,val,
   expr={df$var[df$var == val] <- NA}

With this, df really will be the desired data frame. I urge you to try a little test with both of the above code snippets.

The code I wrote for my command-history utilities is rather short, but too long to place in a blog post; instead, you can access it here. There is a short sample session at the end of the file, which again I urge you to execute by hand to fully understand it. Take note of the tasks I coded as macros instead of functions, and think about why I needed to do so.

How is all this magic accomplished? The defmacro() function still makes use of eval(), substitute() etc., but it already does that work for you, so that you can write your macro while thinking of it as function. Your code is a function — defmacro() builds the function and returns it, and by the way note that that means it is debugable — but again, the point is that you don’t have to deal with all the calls to eval() etc.

If you are a C/C++ programmer, note that this differs from macros in that language, which are straight substitutions made by the compiler preprocessor.

 


To leave a comment for the author, please follow the link and comment on his blog: Mad (Data) Scientist.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New package for image processing in R

$
0
0

(This article was first published on dahtah » R, and kindly contributed to R-bloggers)

I’ve written a package for image processing in R, with the goal of providing a fast API in R that lets you do things in C++ if you need to. The package is called imager, and it’ on Github.
The whole thing is based on CImg, a very nice C++ library for image processing by David Tschumperlé.

Features:

  • Handles images in up to 4 dimensions, meaning you can use it for volumetric/hyperspectral/data or short videos
  • Facilities for taking subsets of images, pixel neighbourhoods, etc.
  • All the usual image processing stuff (filters, morphology, transformations, interpolation, etc.)
  • Easy access to the C++ API via Rcpp

The package is still in an early phase but it can already do a lot of useful things as you’ll see from the documentation.

Example code:

library(imager)
im <- load.image(system.file('extdata/parrots.png',package='imager'))
layout(t(1:3)) 
plot(im,main="Original image")
grad <- grayscale(im) %>% get_gradient("xy")
names(grad) <- paste("Gradient along",c("x","y")) 
l_ply(names(grad),function(n) plot(grad[[n]],main=n))

example_parrots

Visit the website for more information.


To leave a comment for the author, please follow the link and comment on his blog: dahtah » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

PERFORMANCE: Calling R_CheckUserInterrupt() every 256 iteration is actually faster than ever 1,000,000 iteration

$
0
0

(This article was first published on jottR, and kindly contributed to R-bloggers)

If your native code takes more than a few seconds to finish, it is a nice courtesy to the user to check for user interrupts (Ctrl-C) once in a while, say, every 1,000 or 1,000,000 iteration. The C-level API of R provides R_CheckUserInterrupt() for this (see 'Writing R Extensions' for more information on this function). Here's what the code would typically look like:

for (int ii = 0; ii < n; ii++) {
/* Some computational expensive code */
if (ii % 1000 == 0) R_CheckUserInterrupt()
}

This uses the modulo operator % and tests when it is zero, which happens every 1,000 iteration. When this occurs, it calls R_CheckUserInterrupt(), which will interrupt the processing and “return to R” whenever an interrupt is detected.

Interestingly, it turns out that, it is significantly faster to do this check every k=2m iteration, e.g. instead of doing it every 1,000 iteration, it is faster to do it every 1,024 iteration. Similarly, instead of, say, doing it every 1,000,000 iteration, do it every 1,048,576 - not one less (1,048,575) or one more (1,048,577). The difference is so large that it is even 2-3 times faster to call R_CheckUserInterrupt() every 256 iteration rather than, say, every 1,000,000 iteration, which at least to me was a bit counter intuitive the first time I observed it.

Below are some benchmark statistics supporting the claim that testing / calculating ii % k == 0 is faster for k=2m (blue) than for other choices of k (red).

Note that the times are on the log scale (the results are also tabulated at the end of this post). Now, will it make a big difference to the overall performance of your code if you choose, say, 1,048,576 instead of 1,000,000? Probably not, but on the other hand, it does not hurt to pick an interval that is a 2m integer. This observation may also be useful in algorithms that make lots of use of the modulo operator.

So why is ii % k == 0 a faster test when k=2m? I can only speculate. For instance, the integer 2m is a binary number with all bits but one set to zero. It might be that this is faster to test for than other bit patterns, but I don't know if this is because of how the native code is optimized by the compiler and/or if it goes down to the hardware/CPU level. I'd be interested in feedback and hear your thoughts on this.

Details on how the benchmarking was done

I used the inline package to generate a set of C-level functions with varying interrupt intervals k. I'm not passing k as a parameter to these functions. Instead, I use it as a constant value so that the compiler can optimize as far as possible, but also in order to imitate how most code is written. This is why I generate multiple C functions. I benchmarked across a wide range of interval choices using the microbenchmark package. The C functions (with corresponding R functions calling them) and the corresponding benchmark expressions to be called were generated as follows:

## The interrupt intervals to benchmark
## (a) Classical values
ks <- c(1, 10, 100, 1000, 10e3, 100e3, 1e6)
## (b) 2^k values and the ones before and after
ms <- c(2, 5, 8, 10, 16, 20)
as <- c(-1, 0, +1) + rep(2^ms, each=3)

## List of unevaluated expressions to benchmark
mbexpr <- list()

for (k in sort(c(ks, as))) {
name <- sprintf("every_%d", k)

## The C function
assign(name, inline::cfunction(c(length="integer"), body=sprintf("
int i, n = asInteger(length);
for (i=0; i < n; i++) {
if (i %% %d == 0) R_CheckUserInterrupt();
}
return ScalarInteger(n);
", k)))

## The corresponding expression to benchmark
mbexpr <- c(mbexpr, substitute(every(n), list(every=as.symbol(name))))
}

The actual benchmarking of the 25 cases was then done by calling:

n <- 10e6  ## Number of iterations
stats <- microbenchmark::microbenchmark(list=mbexpr)
exprminlqmeanmedianuqmax
every_1(n)174.05178.77184.68180.76183.97262.69
every_3(n)66.7869.1672.1070.2072.42114.75
every_4(n)53.8055.3156.9856.3257.2669.71
every_5(n)46.1747.5249.4248.8349.9966.98
every_10(n)33.3134.3236.5835.1236.6654.83
every_31(n)23.7824.4525.7425.1025.8358.10
every_32(n)17.8118.2518.9118.8219.2225.25
every_33(n)22.9023.5824.9024.5925.2634.45
every_100(n)18.1418.5519.4719.1519.6327.42
every_255(n)19.9620.5621.6721.1621.9842.53
every_256(n)7.077.187.547.407.6310.73
every_257(n)19.3219.7220.6020.3620.8529.66
every_1000(n)16.3716.9817.8117.5318.0824.24
every_1023(n)19.5420.1620.9420.5021.2528.20
every_1024(n)6.326.406.816.606.8313.32
every_1025(n)18.5819.0519.9119.7420.0830.51
every_10000(n)15.9216.7617.4017.3817.8224.10
every_65535(n)18.9219.6020.4120.1020.8027.69
every_65536(n)6.086.166.626.396.5713.40
every_65537(n)22.0822.7023.7923.6924.3531.57
every_100000(n)16.1616.5517.2017.0517.6124.54
every_1000000(n)16.0216.4217.1716.8517.4221.84
every_1048575(n)18.8819.2320.2719.8520.5230.21
every_1048576(n)6.086.186.536.476.5812.64
every_1048577(n)22.8823.2324.2823.8324.6331.84

I get similar results across various operating systems (Windows, OS X and Linux) all using GNU Compiler Collection (GCC).

Feedback and comments are welcome!

To reproduce these results, do:

> path <- 'https://raw.githubusercontent.com/HenrikBengtsson/jottr.org/master/blog/20150604%2CR_CheckUserInterrupt'
> html <- R.rsp::rfile('R_CheckUserInterrupt.md.rsp', path=path)
> !html ## Open in browser

To leave a comment for the author, please follow the link and comment on his blog: jottR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppArmadillo 0.5.200.1.0

$
0
0

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Conrad put out a new minor release 5.200.1 of Armadillo yesterday. Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab.

Our corresponding RcppArmadillo release 0.5.200.1.0 is now on CRAN and on its way into Debian. See below for the brief list of changes.

Changes in RcppArmadillo version 0.5.200.1.0 (2015-06-04)

  • Upgraded to Armadillo release 5.200.1 ("Boston Tea Smuggler")

    • added orth() for finding the orthonormal basis of the range space of a matrix

    • expanded element initialisation to handle nested initialiser lists (C++11)

    • workarounds for bugs in GCC, Intel and MSVC C++ compilers

  • Added another example to inst/examples/fastLm.r

Courtesy of CRANberries, there is also a diffstat report for the most recent CRAN release. As always, more detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A Better ZigZag

$
0
0

(This article was first published on Quintuitive » R, and kindly contributed to R-bloggers)

There are a lot of “winning” strategies for bull markets floating around. “Buy the pullbacks” is certainly one of them. Does this sound simple enough to implement to you? While I am no Sheldon Cooper (although I have a favorite couch seat), I still like to live in a somewhat well defined world, a world in which, there is much more information attached to a tip like “Buy the pullbacks”. Let’s start with a chart of the recent history of the S&P 500 ETF:

SPY

The indicator on the chart is the ZigZag indicator. It’s certainly one way to look for the proverbial pullbacks. One can use an absolute dollar amount, or a percentage to define the minimum amount the price needs to travel in the opposite direction to constitute a correction. I used a setting of 2% in the above chart. Notice, we won’t enter at the bottoms, most likely the position is taken as soon as the pullback is “registered”. Using our 2% setup, we enter the pullback as soon as the market is down 2% from the most recent peak. The rest of the road to the actual bottom is the pain we are going to suffer. :)

The chart indicates that over the last one year, there were 4 points to enter a position. That’s somewhat low if we think of buying each pullback and then closing the trade (these are not buy and hold entries).

Pretty easy you say – go to 2.5%, or even 3%, pullback. Well, before we do that, let’s look at the same chart just over a different period in time:

spy04

To me it seems that we actually have way too many “corrections” (a.k.a. pullbacks) during the 2011 turbulence.

By now it’s pretty clear what I am getting at – the pullbacks shouldn’t be a hard number, neither a total amount, nor a percentage. It seems a better approach to get volatility involved.

To address these issues, I modified the ZigZag to actually use volatility to measure the pullbacks. With this improvement, the charts look somewhat different.

SPY 2014,

Notice, we have a few more “corrections” generated by the modified ZigZag (the red line in the charts) in the low-volatility environment. Similarly, keeping the settings the same, there are a few less “corrections” generated by the volatility ZigZag in the high-volatility environment.

SPY 2011

How useful is all this? Frankly, I don’t know yet, but it doesn’t really matter, in a way. I consider the first goal achieved – I have a simple indicator, with a single setting, to model what my strategy would consider as a pullback. In the process, I have added a few useful features to my ZigZag. My R implementation is, in an amusing way, totally the opposite from the TTR’s ZigZag. The TTR’s ZigZag simply returns the indicator line, thus, one can easily plot it, but it’s hard to tell from the line when was the move “registered” (in other words, when was the 2% pullback first hit). My implementation on the other hand (available from the btutils package), does not compute the lines connecting the extremes (thus, one cannot draw it at all), but it computes three other columns which can be easily used to implement a trading strategy based on the modified ZigZag (the “age” column is 0 at the inflection points). I guess we all have different goals. :) Here is how to call the btutils version:

require(btutils)
require(quantmod)
spy = getSymbols("SPY", from="1900-01-01", auto.assign=F)
spy.zz = zig.zag(Cl(spy), 2*runSD(ROC(Cl(spy),type="discrete"),10), percent=T)
tail(spy.zz)

Last, the btutils package currently cannot be installed on Windows directly from R-Forge. You still can download the source code (the easiest way is to create a new project in RStudio and select from source control etc) and build the package though.

The post A Better ZigZag appeared first on Quintuitive.

To leave a comment for the author, please follow the link and comment on his blog: Quintuitive » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Aggregate player preference for the first 20 building created in Illyriad

$
0
0

(This article was first published on The Shape of Code » R, and kindly contributed to R-bloggers)

I was at the Microsoft Gaming data hackathon today. Gaming is very big business and companies rarely publish detailed game data. Through contacts one of the organizers was able to obtain two gaming datasets, both containing just under 300M of compressed of data.

Illyriad supplied a random snapshot of anonymised data on 50,000 users and Mediatonic supplied three months of player data.

Being a Microsoft event there were lots of C# developers, with data analysis people being thin on the ground. While there were plenty of gamers present I could not find any that knew the games for which we had data (domain experts are always in short supply at hackathons).

I happened to pick the Illyriad data to investigate first and stayed with it. The team sitting next to us worked on the Mediatonic data and while I got to hear about this data and kicked a few ideas around with them, I did not look at it.

The first thing to do with any dataset is to become familiar with what data it actually contains and the relationships between different items. I was working with two people new to data science who wanted to make the common beginner mistake of talking about interesting things we could do; it took a while for my message of “no point of talking about what we could do with the data until we know what data we have” to have any effect. Of course it is always worth listening to what a domain expert is interested in before looking at the data, as a source of ideas to keep in mind; it is not worth keeping in mind ideas from non-domain experts.

Quick Illyriad game overview: Players start with a settlement and construct/upgrade buildings until they have a legendary city. These buildings can generate resources such as food and iron; towns/cities can be conquered and colonized… you get the picture.

My initial investigation of the data did not uncover any of the obvious simple patterns, but did manage to find a way of connecting some pairs of players in a transaction relationship (the data for each player included a transaction list which gave one of 255 numeric locations and the transaction amount; I reasoned that the location/amount pair was likely to be unique).

The data is a snapshot in time, which appeared to rule out questions involving changes over time. Finally, I realized that time data was present in the form of the order in which each player created buildings in their village/town/city.

Buildings are the mechanism through which players create resources. What does the data have to say about gamers preferential building construction order? Do different players with different playing strategies use different building construction orders?

A search of the Illyriad website located various beginners’ guides containing various strategy suggestions, depending on player references for action.

Combining the order of the first 16 different buildings, created by all 50,000 players, into an aggregate preference building order we get:

Library
Storehouse
Lumberjack
Farmyard
Marketplace
Iron Mine
Clay Pit
Quarry
Barracks
Consulate
Mage Tower
Common Ground
Tavern
Paddock
Brewery
Spearmaker

A couple of technical points: its impractical to get an exact preference order for more than about 10 players and a Monti Carlo approach is used by RankAggreg and building multiple instance of the same kind of building were treated as a single instance (some form of weighting might be used to handle this behavior):

The order of the top three ranked buildings is very stable, but some of the buildings in lower ranks could switch places with adjacent buildings with little impact on ranking error.

Do better players use different building orders than poor players? The data does not include player ability data as such; it included game ranking (a high ranking might be achieved quickly by a strong player or slowly over a longer period by a weaker player) and various other rankings (some of which could be called sociability).

Does the preference for buildings change as a players’ village becomes a town becomes a city? At over 200 minutes of cpu time per run I have not yet had the time to find out. Here is the R code for you to try out some ideas:

library("plyr")
library("RankAggreg")
 
get_build_order=function(df)
{
# Remove duplicates for now
dup=duplicated(df$building_id)
 
# Ensure there are at least 20
build_order=c(df$building_id[!dup], -1:-20)
return(build_order[1:20])
}
 
# town_id,building_id,build_order_for_town
#1826159E-976D-4743-8AEB-0001281794C2,7,1
build=read.csv("~/illyriad/town_buildings.csv", as.is=TRUE)
 
build_order=daply(build, .(town_id), get_build_order)
 
build_rank=RankAggreg(build_order, 20)

What did other teams discover in the data? My informal walk around on Saturday evening found everybody struggling to find anything interesting to talk about (I missed the presentation on Sunday afternoon, perhaps a nights sleep turned things around for people, we will have to check other blogs for news).

If I was to make one suggestion to the organizers of the next gaming data hackathon (I hope there is one), it would be to arrange to have some domain experts (i.e., people very familiar with playing the games) present.

ps. Thanks to Richard for organizing chicken for the attendee who only eats pizza when truly starving.

To leave a comment for the author, please follow the link and comment on his blog: The Shape of Code » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

European debt and interest

$
0
0

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)
I was told the Eurostat package would be interesting for me.  This is indeed true and now I want to use it to plot some data which are related core of some of the European policies; debt.
In these plots I only show individual countries, not aggregated over the EU or Euro zone. In addition Norway is dropped, because it had less data, is not an EU country and has fairly different financial positions than the rest of Europe. This resulted in 28 countries, which can be displayed in a grid of 4 by 7.

Lending and borrowing

In lending and borrowing we see the crisis troubles. Especially in Ireland there is a spike showing increased borrowing. In a few years from net lender to 30% borrowing is massive. But the same can be seen in Spain, UK. In fact it only few countries did not have more borrowing in that period.
The plot has an additional line, in red 3% borrowing is depicted, 3% being in the stability and growth pact. It can be seen that quite some countries are under or near those 3% or edging closer to it.

Debt

The consequence of all that borrowing is debt. All countries have debt. The red line is placed at 60%, which is the upper limit for the Euro zone stability and growth pact. Many countries had debt under 60% before the crisis, examples, Spain, Netherlands and Ireland. Italy and Belgium were quite above those 60% and got some increase. Germany, Europe's economic miracle, had debts over 60%. There seems to be no obvious link between the debt before crisis and the debt post crisis.

Interest

The consequence of debt is interest. This is probably the most remarkable set of data. There are no targets here. What is visible is the decrease in interest for many countries at the end of last century. Here the positive influence of the Euro is visible. But the strange thing is that many countries did hardly suffer increased interest payments during the crisis at all. Italy payed approximately 5% for more than 10 years now. Netherlands has a decreasing curve which was flattened by the crisis and is decreasing again. In summary, many countries currently pay historically low interests.
Part of this is obviously the work of various policies to contain the crisis. But it can also mean that debt may not be the biggest problem Europe faces at this moment.

Code

The code is fairly simple. What is needed is retrieving what codes mean. For countries these are extracted from the database. For the other codes it is most easy just to open the table on the Eurostat website and look there which codes are interesting.
The dplyr package allowed channeling selection and plotting in one call, thereby eliminating the chance of not updating intermediate data frames.
library(eurostat)
library(dplyr)
library(ggplot2)
library(scales)

r1 <- get_eurostat('gov_10dd_edpt1')

# add country names
r2 <- get_eurostat_dic('geo') %>%
    mutate(.,
        geo=V1,
        country=V2,
        country=gsub('\(.*$','',country)) %>%
    select(.,geo,country) %>%
    merge(.,r1) %>%
# filter countries
    filter(.,
        !grepl('EA.*',geo),
        !grepl('EU.*',geo),
        geo!='NO')

filter(r2,
        sector=='S13', # general government
        na_item=='B9', # Net lending (+) /net borrowing (-)
        unit=='PC_GDP' # % GDP
    ) %>%
    ggplot(.,aes(x=time,y=values)) + 
    geom_line()+
    ylab('% GDP')+
    facet_wrap(~country,nrow=4) +
    ggtitle('Net lending (+) /net borrowing (-)') +
    xlab('Year') +
    geom_hline(yintercept=-3,colour='red') +
    scale_x_date(
        breaks=c(as.Date("2000-01-01"),as.Date("2010-01-01") )
        ,labels = date_format("%Y"))

#########
filter(r2,
        sector=='S13', # general government
        na_item=='GD', # Gross debt
        unit=='PC_GDP' # % GDP
    ) %>%
    ggplot(.,aes(x=time,y=values)) + 
    geom_line()+
    ylab('% GDP')+
    facet_wrap(~country,nrow=4) +
    ggtitle('Gross Debt') +
    xlab('Year') +
    geom_hline(yintercept=60,colour='red') +
    scale_x_date(
        breaks=c(as.Date("2000-01-01"),as.Date("2010-01-01") )
        ,labels = date_format("%Y"))

#########
filter(r2,
        sector=='S13', # general government
        na_item=='D41PAY', # Interest, payable
        unit=='PC_GDP' # % GDP
    ) %>%
    ggplot(.,aes(x=time,y=values)) +
    geom_line()+
    ylab('% GDP')+
    facet_wrap(~country,nrow=4) +
    ggtitle('Interest, payable') +
    xlab('Year') +
    scale_x_date(
        breaks=c(as.Date("2000-01-01"),as.Date("2010-01-01") )
        ,labels = date_format("%Y"))

To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Heteroscedasticity in Regression — It Matters!

$
0
0

(This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

R’s main linear and nonlinear regression functions, lm() and nls(), report standard errors for parameter estimates under the assumption of homoscedasticity, a fancy word for a situation that rarely occurs in practice. The assumption is that the (conditional) variance of the response variable is the same at any set of values of the predictor variables.

Take for instance my favorite introductory example, predicting human weight from height. (It is unlikely we’d need to make such a prediction, but it serves as a quick and easy illustration.) The homoscedasticity assumption says that tall people have no more variation in weight than short people, certainly not true.

So, confidence intervals and p-values calculated by lm() and nls() are generally invalid. In this blog post, I’ll show how to remedy this problem — including some new code that I’ll provide here — and give an example showing just how far wrong one can be without a remedy.

In the linear case, the remedy is simple, but probably not widely known. Years ago, researchers such as Eickert and White derived large-sample theory for the heteroscedastic variance case, and it is coded in CRAN’s car and sandwich packages in the functions hccm() and vcovHC(), respectively. I’ll use the latter here, as its name is similar to that of R’s vcov() function.

The latter inputs the result of a call to lm() or nls(), and outputs the estimated covariance matrix of your estimated parameter vector. Thus the standard errors of the estimated parameters are the square roots of the diagonal elements of the matrix returned by vcov(). Let’s check that, using the example from the online help for lm():

> ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
> trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
> group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
> weight <- c(ctl, trt)
> lm.D9 <- lm(weight ~ group)
> summary(lm.D9)
...
Coefficients:
 Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0320 0.2202 22.850 9.55e-15
groupTrt -0.3710 0.3114 -1.191 0.249
> diag(sqrt(vcov(lm.D9)))
(Intercept) groupTrt 
 0.2202177 0.3114349 

Sure enough, the standard errors match.  (Were you doubting it? :-) )

So, the linear case, i.e. lm(), is taken care of.  But unfortunately, there doesn’t seem to be anything for the nonlinear case, nls(). The following code solves that problem:

library(minpack.lm)
library(sandwich)

nlsvcovhc <- function(nlslmout) {
 b <- coef(nlslmout)
 m <- nlslmout$m
 resid <- m$resid()
 hmat <- m$gradient()
 fakex <- hmat
 fakey <- resid + hmat %*% b
 lmout <- lm(fakey ~ fakex - 1)
 vcovHC(lmout)
}

(The names such as fakex, which I should remove, are explained in the last paragraph below.) You can download this code (complete with comments, which I’ve omitted here for brevity) here. The function is applied to the output of nlsLM(), which is a modified version of nls() in the package minpack.lm, which has better convergence behavior.

Here’s an example, using the enzyme data set vmkmki from the CRAN package nlstools. (I removed the last 12 observations, which for reasons I won’t go into here seem to be anomalous.) The nonlinear model here is the one given in the online help for vmkmki.

> regftn <- function(t,u,b1,b2,b3) b1 * t / (t + b2 * (1 + u/b3))
> bstart <- list(b1=1,b2=1, b3=1)
> z <- nls(v ~ regftn(S,I,b1,b2,b3),data=vmkmki,start=list(b1=1,b2=1, b3=1))
> z
Nonlinear regression model
 model: v ~ regftn(S, I, b1, b2, b3)
 data: vmkmki
 b1 b2 b3 
18.06 15.21 22.28 
...
> vcov(z)
 b1 b2 b3
b1 0.4786776 1.374961 0.8930431
b2 1.3749612 7.568837 11.1332821
b3 0.8930431 11.133282 29.1363366

Compare that to using Eickert-White:

ttt# get new z from nlsLM(), not shown
> nlsvcovhc(z)
 fakex1 fakex2 fakex3
fakex1 0.4708209 1.706591 2.410712
fakex2 1.7065910 10.394496 20.314688
fakex3 2.4107117 20.314688 53.086958

This is rather startling! Except for the estimated variance of the first parameter estimate, the estimated variances and covariances from Eickert-White are much larger than what nls()} found under the assumption of homoscedasticity. In other words, nls() was way off the mark. In fact, this simulation code shows that the standard errors reported by nls() can lead, for instance, to confidence intervals having only 60% coverage probability rather than 90%, due to heteroscedasticity, while the above code fixes the problem.

How does that code work? (Optional reading.) Most nonlinear least-squares procedures use a local linear approximation, such as in the Gauss-Newton algorithm. This results computationally in a fake lm() setting. As such, we are already set up to use the delta method. We can then apply vcovHC().


To leave a comment for the author, please follow the link and comment on his blog: Mad (Data) Scientist.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Animated US Hexbin Map of the Avian Flu Outbreak

$
0
0

(This article was first published on rud.is » R, and kindly contributed to R-bloggers)

The recent announcement of the start of egg rationing in the U.S. made me curious enough about the avian flu outbreak to try to dig into the numbers a bit. I finally stumbled upon a USDA site that had an embedded HTML table of flock outbreak statistics by state, county and date (also flock type and whether it was a commercial enterprise or “backyard” farm). Just looking at the sum of flock sizes on that page shows that nearly 50 million birds have been impacted since December, 2014.

We can scrape the data with R & rvest and then use the shapefile hexbins from previous posts to watch the spread week-over-week.

The number of packages I ended up relying on was a bit surprising. Let’s get them out of the way before focusing on the scraping and hexbin-making:

library(rvest)     # scraping
library(stringr)   # string manipulation
library(lubridate) # date conversion
library(dplyr)     # data mjnging
library(zoo)       # for locf
library(ggplot2)   # plotting
library(rgdal)     # map stuff
library(rgeos)     # map stuff

We also end up using magrittr and tidyr but only for one function, so you’ll see those with :: in the code.

Grabbing the USDA page is pretty straightforward:

url <- "http://www.aphis.usda.gov/wps/portal/aphis/ourfocus/animalhealth/sa_animal_disease_information/sa_avian_health/ct_avian_influenza_disease/!ut/p/a1/lVJbb4IwFP41e1qwFZDLI-oUnGgyswm8kAMUaAaFQNG4X7-ibnEPYtakDz3nO_kupyhAHgoYHGgGnFYMiv4daOFqa8vjKZad5c58wc7mY-Eaa13Z2qoA-AKA7xwL_53fvjpaP_-Gp_Z8jHcK2qMABTHjNc-RD3VO2zCuGCeMhwWNGmhOT7iFsOqaMK3irj2_gNESijAnUPD8tpLQlkBLQsrSqinPJi7tAwX2i4_5tSBgRUfYF_wM9mLqmCbIj2QzxZpMJMUYg6TGkSLBBCaSPEnSJIljXVH0q_kBdw_CO5sXkNnSslV9LQJTDRk7czGumy7GjnYFDOTrCw36XRJTRbt_mlo9VD1FgfuctqrVByA37szNBAPwXOpzR97gPi7tm30gb2AfQkxWVJH4ifvZLavFIsUQrA1JSUOaUV61HHnH43HUtQmMsuqA6vK9NJST9JluNlKwyPr7DT6YvRs!/?1dmy&urile=wcm%3apath%3a%2Faphis_content_library%2Fsa_our_focus%2Fsa_animal_health%2Fsa_animal_disease_information%2Fsa_avian_health%2Fsa_detections_by_states%2Fct_ai_pacific_flyway"
 
#' read in the data, extract the table and clean up the fields
#' also clean up the column names to since they are fairly nasty
 
pg <- html(url)

If you poke at the source for the page you’ll see there are two tables in the code and we only need the first one. Also, if you scan the rendered table on the USDA page by eye you’ll see that the column names are horrible for data analysis work and they are also inconsistent in the values used for various columns. Furthermore, there are commas in the flock counts and it would be handy to have the date as an actual date type. We can extract the table we need and clean all that up in a reasonably-sized dplyr pipe:

pg %>%
  html_nodes("table") %>%
  magrittr::extract2(1) %>%
  html_table(header=TRUE) %>%
  filter(`Flock size`!="pending") %>%
  mutate(Species=str_replace(tolower(Species), "s$", ""),
         `Avian influenza subtype*`=str_replace_all(`Avian influenza subtype*`, " ", ""),
         `Flock size`=as.numeric(str_replace_all(`Flock size`, ",", "")),
         `Confirmation date`=as.Date(mdy(`Confirmation date`))) %>%
  rename(state=State, county=County, flyway=Flyway, flock_type=`Flock type`,
         species=Species, subtype=`Avian influenza subtype*`, date=`Confirmation date`,
         flock_size=`Flock size`) -> birds

Let’s take a look at what we have:

glimpse(birds)
 
## Observations: 202
## Variables:
## $ state      (chr) "Iowa", "Minnesota", "Minnesota", "Iowa", "Minnesota", "Iowa",...
## $ county     (chr) "Sac", "Renville", "Renville", "Hamilton", "Kandiyohi", "Hamil...
## $ flyway     (chr) "Mississippi", "Mississippi", "Mississippi", "Mississippi", "M...
## $ flock_type (chr) "Commercial", "Commercial", "Commercial", "Commercial", "Comme...
## $ species    (chr) "turkey", "chicken", "turkey", "turkey", "turkey", "turkey", "...
## $ subtype    (chr) "EA/AM-H5N2", "EA/AM-H5N2", "EA/AM-H5N2", "EA/AM-H5N2", "EA/AM...
## $ date       (date) 2015-06-04, 2015-06-04, 2015-06-04, 2015-06-04, 2015-06-03, 2...
## $ flock_size (dbl) 42200, 415000, 24800, 19600, 37000, 26200, 17200, 1115700, 159...

To make an animated map of cumulative flock totals by week, we’ll need to

  • group the birds data frame by week and state
  • calculate the cumulative sums
  • fill in the gaps where there are missing state/week combinations
  • carry the last observations by state/week forward in this expanded data frame
  • make breaks for data ranges so we can more intelligently map them to colors

This ends up being a longer dplyr pipe than I usuall like to code (I think very long ones are hard to follow) but it gets the job done and is still pretty readable:

birds %>%
  mutate(week=as.numeric(format(birds$date, "%Y%U"))) %>%
  arrange(week) %>%
  group_by(week, state) %>%
  tally(flock_size) %>%
  group_by(state) %>%
  mutate(cum=cumsum(n)) %>%
  ungroup %>%
  select(week, state, cum) %>%
  mutate(week=as.Date(paste(week, 1), "%Y%U %u")) %>%
  left_join(tidyr::expand(., week, state), .) %>%
  group_by(state) %>%
  do(na.locf(.)) %>%
  mutate(state_abb=state.abb[match(state, state.name)],
         cum=as.numeric(ifelse(is.na(cum), 0, cum)),
         brks=cut(cum,
                  breaks=c(0, 200, 50000, 1000000, 10000000, 50000000),
                  labels=c("1-200", "201-50K", "50k-1m",
                           "1m-10m", "10m-50m"))) -> by_state_and_week

Now, we perform the standard animation steps:

  • determine where we’re going to break the data up
  • feed that into a loop
  • partition the data in the loop
  • render the plot to a file
  • combine all the individual images into an animation

For this graphic, I’m doing something a bit extra. The color ranges for the hexbin choropleth go from very light to very dark, so it would be helpful if the titles for the states went from very dark to very light, matching the state colors. The lines that do this check for state breaks that fall in the last two values and appropriately assign "black" or "white" as the color.

i <- 0
 
for (wk in unique(by_state_and_week$week)) {
 
  # filter by week
 
  by_state_and_week %>% filter(week==wk) -> this_wk
 
  # hack to let us color the state labels in white or black depending on
  # the value of the fill
 
  this_wk %>%
    filter(brks %in% c("1m-10m", "10m-50m")) %>%
    .$state_abb %>%
    unique -> white_states
 
  centers %>%
    mutate(txt_col="black") %>%
    mutate(txt_col=ifelse(id %in% white_states, "white", "black")) -> centers
 
  # setup the plot
 
  gg <- ggplot()
  gg <- gg + geom_map(data=us_map, map=us_map,
                      aes(x=long, y=lat, map_id=id),
                      color="white", fill="#dddddd", size=2)
  gg <- gg + geom_map(data=this_wk, map=us_map,
                      aes(fill=brks, map_id=state_abb),
                      color="white", size=2)
  gg <- gg + geom_text(data=centers,
                       aes(label=id, x=x, y=y, color=txt_col), size=4)
  gg <- gg + scale_color_identity()
  gg <- gg + scale_fill_brewer(name="Combined flock sizen(all types)",
                               palette="RdPu", na.value="#dddddd", drop=FALSE)
  gg <- gg + guides(fill=guide_legend(override.aes=list(colour=NA)))
  gg <- gg + coord_map()
  gg <- gg + labs(x=NULL, y=NULL,
                  title=sprintf("U.S. Avian Flu Total Impact as of %sn", wk))
  gg <- gg + theme_bw()
  gg <- gg + theme(plot.title=element_text(face="bold", hjust=0, size=24))
  gg <- gg + theme(panel.border=element_blank())
  gg <- gg + theme(panel.grid=element_blank())
  gg <- gg + theme(axis.ticks=element_blank())
  gg <- gg + theme(axis.text=element_blank())
  gg <- gg + theme(legend.position="bottom")
  gg <- gg + theme(legend.direction="horizontal")
  gg <- gg + theme(legend.title.align=1)
 
  # save the image
 
  # i'm using "quartz" here since I'm on a Mac. Use what works for you system to ensure you
  # get the best looking output png
 
  png(sprintf("output/%03d.png", i), width=800, height=500, type="quartz")
  print(gg)
  dev.off()
 
  i <- i + 1
 
}

We could use one of the R animation packages to actually make the animation, but I know ImageMagick pretty well so I just call it as a system command:

system("convert -delay 60 -loop 1 output/*png output/avian.gif")

All that results in:

avian

If that’s a static image, open it in a new tab/window (or just click on it). I really didn’t want to do a looping gif but if you do just make the -loop 1 into -loop 0.

Now, we can just re-run the code when the USDA refreshes the data.

The code, data and sample bitmaps are on github.

To leave a comment for the author, please follow the link and comment on his blog: rud.is » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Last Post For A While, And Two Premium (Cheap) Databases

$
0
0

(This article was first published on QuantStrat TradeR » R, and kindly contributed to R-bloggers)

This will be my last post on this blog for an indefinite length of time. I will also include an algorithm to query Quandl’s SCF database, which is an update on my attempt to use free futures data from Quandl’s CHRIS database, which suffered from data integrity issues, even after attempts to clean it. Also provided is a small tutorial on using Quandl’s EOD database, for those that take issue with Yahoo’s daily data.

So, first off, the news…as some of you may have already heard from meeting me in person at R/Finance 2015 (which was terrific…interesting presentations, good people, good food, good drinks), effective June 8th, I will be starting employment in the New York area as a quantitative research analyst, and part of the agreement is that this blog becomes an archive of my work, so that I may focus my full energies on my full-time position. That is, it’s not going anywhere, so for those that are recent followers, you now have a great deal of time to catch up on the entire archive, which including this post, will be 62 posts. Topics covered include:

Quantstrat — its basics, and certain strategies coded using it, namely those based off of John Ehlers’s digital signal processing algorithms, along with custom order-sizing functions. A small aside on pairs trading is included as well.
Asset rotation — flexible asset allocation, elastic asset allocation, and most recently, classic asset allocation (aka the constrained critical line algorithm).
Seeking Alpha ideas — both Logical Invest and Harry Long’s work, along with Cliff Smith’s Quarterly Tactical Strategy (QTS). The Logical Invest algorithms do what they set out to do, but in some cases, are dependent on dividends to drive returns. By contrast, Harry Long’s ideas are more thought processes and proofs of concept, as opposed to complete strategies, often depending on ETFs with inception dates after the financial crisis passed (which I used some creativity for to backtest to pre-crisis timelines). I’ve also collaborated with Harry Long privately, and what I’ve seen privately has better risk/reward than anything he has revealed in public, and I find it to be impressive given the low turnover rate of such portfolios.
Volatility trading — XIV and VXX, namely, and a strategy around these two instruments that has done well out of sample.
Other statistical ideas, such as robustness heuristics and change point detection.

Topics I would have liked to have covered but didn’t roll around to:

Most Japanese trading methods — Ichimoku and Heiken Ashi, among other things. Both are in my IKTrading package, I just never rolled around to showing them off. I did cover a hammer trading system which did not perform as I would have liked it to.
Larry Connors’s mean reversion strategies — he has a book on trading ETFs, and another one he wrote before that. The material provided on this blog is sufficient for anyone to use to code those strategies.
The PortfolioAnalytics package — what quantstrat is to signal-based individual instrument trading strategies, PortfolioAnalytics is this (and a lot more) to portfolio management strategies. Although strategies such as equal-weight ranking perform well by some standards, this is only the tip of the iceberg. PortfolioAnalytics is, to my awareness, cutting edge portfolio management technology that can run the gauntlet from quick classic quadratic optimization to cutting-edge random-search global optimization portfolios (albeit those take more time to compute).

Now, onto the second part of this post, which is a pair of premium databases. They’re available from Quandl, and cost $50/month. As far as I’ve been able to tell, the futures database (SCF) data quality is vastly better than the CHRIS database, which can miss (or corrupt) chunks of data. The good news, however, is that free users can actually query these databases (or maybe all databases total, not sure) 150 times in a 60 day period. The futures script sends out 40 of these 150 queries, which may be all that is necessary if one intends to use it for some form of monthly turnover trading strategy.

Here’s the script for the SCF (futures) database. There are two caveats here:

1) The prices are on a per-contract rate. Notional values in futures trading, to my understanding, are vastly larger than one contract, to the point that getting integer quantities is no small assumption.

2) According to Alexios Ghalanos (AND THIS IS REALLY IMPORTANT), R’s GARCH guru, and one of the most prominent quants in the R/Finance community, for some providers, the Open, High, and Low values in futures data may not be based off of U.S. traditional pit trading hours in the same way that OHLC in equities/ETFs are based off of the 9:30 AM – 4:00 PM hours, but rather, extended trading hours. This means that there’s very low liquidity around open in futures, and that the high and low are also based off of these low-liquidity times as well. I am unsure if Quandl’s SCF database uses extended hours open-high-low data (low liquidity), or traditional pit hours (high liquidity), and I am hoping a representative from Quandl will clarify this in the comments section for this post. In any case, I just wanted to make sure that readers are aware of this issue.

In any case, here’s the data fetch for the Stevens Continuous Futures (SCF) database from Quandl. All of these are for front-month contracts, unadjusted prices on open interest cross. Note that in order for this script to work, you must supply quandl with your authorization token, which takes the form of something like this:

Quandl.auth("yourTokenHere")
require(Quandl)

Quandl.auth("yourTokenHere")
authCode <- "yourTokenHere"

quandlSCF <- function(code, authCode, from = NA, to = NA) {
  dataCode <- paste0("SCF/", code)
  out <- Quandl(dataCode, authCode = authCode)
  out <- xts(out[, -1], order.by=out$Date)
  colnames(out)[4] <- "Close"
  colnames(out)[6] <- "PrevDayOpInt"
  if(!is.na(from)) {
    out <- out[paste0(from, "::")]
  }
  if(!is.na(to)) {
    out <- out[paste0("::", to)]
  }
  return(out)
}

#Front open-interest cross
from <- NA
to <- NA

#Energies
CME_CL1 <- quandlSCF("CME_CL1_ON", authCode = authCode, from = from, to = to) #crude
CME_NG1 <- quandlSCF("CME_NG1_ON", authCode = authCode, from = from, to = to) #natgas
CME_HO1 <- quandlSCF("CME_HO1_ON", authCode = authCode, from = from, to = to) #heatOil
CME_RB1 <- quandlSCF("CME_RB1_ON", authCode = authCode, from = from, to = to) #RBob
ICE_B1 <- quandlSCF("ICE_B1_ON", authCode = authCode, from = from, to = to) #Brent
ICE_G1 <- quandlSCF("ICE_G1_ON", authCode = authCode, from = from, to = to) #GasOil

#Grains
CME_C1 <- quandlSCF("CME_C1_ON", authCode = authCode, from = from, to = to) #Chicago Corn
CME_S1 <- quandlSCF("CME_S1_ON", authCode = authCode, from = from, to = to) #Chicago Soybeans
CME_W1 <- quandlSCF("CME_W1_ON", authCode = authCode, from = from, to = to) #Chicago Wheat
CME_SM1 <- quandlSCF("CME_SM1_ON", authCode = authCode, from = from, to = to) #Chicago Soybean Meal
CME_KW1 <- quandlSCF("CME_KW1_ON", authCode = authCode, from = from, to = to) #Kansas City Wheat
CME_BO1 <- quandlSCF("CME_BO1_ON", authCode = authCode, from = from, to = to) #Chicago Soybean Oil

#Softs
ICE_SB1 <- quandlSCF("ICE_SB1_ON", authCode = authCode, from = from, to = to) #Sugar No. 11
ICE_KC1 <- quandlSCF("ICE_KC1_ON", authCode = authCode, from = from, to = to) #Coffee
ICE_CC1 <- quandlSCF("ICE_CC1_ON", authCode = authCode, from = from, to = to) #Cocoa
ICE_CT1 <- quandlSCF("ICE_CT1_ON", authCode = authCode, from = from, to = to) #Cotton

#Other Ags
CME_LC1 <- quandlSCF("CME_LC1_ON", authCode = authCode, from = from, to = to) #Live Cattle
CME_LN1 <- quandlSCF("CME_LN1_ON", authCode = authCode, from = from, to = to) #Lean Hogs

#Precious Metals
CME_GC1 <- quandlSCF("CME_GC1_ON", authCode = authCode, from = from, to = to) #Gold
CME_SI1 <- quandlSCF("CME_SI1_ON", authCode = authCode, from = from, to = to) #Silver
CME_PL1 <- quandlSCF("CME_PL1_ON", authCode = authCode, from = from, to = to) #Platinum
CME_PA1 <- quandlSCF("CME_PA1_ON", authCode = authCode, from = from, to = to) #Palladium

#Base
CME_HG1 <- quandlSCF("CME_HG1_ON", authCode = authCode, from = from, to = to) #Copper

#Currencies
CME_AD1 <- quandlSCF("CME_AD1_ON", authCode = authCode, from = from, to = to) #Ozzie
CME_CD1 <- quandlSCF("CME_CD1_ON", authCode = authCode, from = from, to = to) #Canadian Dollar
CME_SF1 <- quandlSCF("CME_SF1_ON", authCode = authCode, from = from, to = to) #Swiss Franc
CME_EC1 <- quandlSCF("CME_EC1_ON", authCode = authCode, from = from, to = to) #Euro
CME_BP1 <- quandlSCF("CME_BP1_ON", authCode = authCode, from = from, to = to) #Pound
CME_JY1 <- quandlSCF("CME_JY1_ON", authCode = authCode, from = from, to = to) #Yen
ICE_DX1 <- quandlSCF("ICE_DX1_ON", authCode = authCode, from = from, to = to) #Dollar Index

#Equities
CME_ES1 <- quandlSCF("CME_ES1_ON", authCode = authCode, from = from, to = to) #Emini
CME_MD1 <- quandlSCF("CME_MD1_ON", authCode = authCode, from = from, to = to) #Midcap 400
CME_NQ1 <- quandlSCF("CME_NQ1_ON", authCode = authCode, from = from, to = to) #Nasdaq 100
ICE_RF1 <- quandlSCF("ICE_RF1_ON", authCode = authCode, from = from, to = to) #Russell Smallcap
CME_NK1 <- quandlSCF("CME_NK1_ON", authCode = authCode, from = from, to = to) #Nikkei

#Bonds/rates
CME_FF1  <- quandlSCF("CME_FF1_ON", authCode = authCode, from = from, to = to) #30-day fed funds
CME_ED1 <- quandlSCF("CME_ED1_ON", authCode = authCode, from = from, to = to) #3 Mo. Eurodollar/TED Spread
CME_FV1  <- quandlSCF("CME_FV1_ON", authCode = authCode, from = from, to = to) #Five Year TNote
CME_TY1  <- quandlSCF("CME_TY1_ON", authCode = authCode, from = from, to = to) #Ten Year Note
CME_US1  <- quandlSCF("CME_US1_ON", authCode = authCode, from = from, to = to) #30 year bond

In this case, I just can’t give away my token. You’ll have to replace that with your own, which every account has. However, once again, individuals not subscribed to these databases need to pay $50/month.

Lastly, I’d like to show the Quandl EOD database. This is identical in functionality to Yahoo’s, but may be (hopefully!) more accurate. I have never used this database on this blog because the number one rule has always been that readers must be able to replicate all analysis for free, but for those who doubt the quality of Yahoo’s data, they may wish to look at Quandl’s EOD database.

This is how it works, with an example for SPY.

out <- Quandl("EOD/SPY", start_date="1999-12-31", end_date="2005-12-31", type = "xts")

And here’s some output.

> head(out)
             Open   High    Low  Close   Volume Dividend Split Adj_Open Adj_High  Adj_Low Adj_Close Adj_Volume
1999-12-31 146.80 147.50 146.30 146.90  3172700        0     1 110.8666 111.3952 110.4890  110.9421    3172700
2000-01-03 148.25 148.25 143.88 145.44  8164300        0     1 111.9701 111.9701 108.6695  109.8477    8164300
2000-01-04 143.50 144.10 139.60 139.80  8089800        0     1 108.3828 108.8359 105.4372  105.5882    8089800
2000-01-05 139.90 141.20 137.30 140.80 12177900        0     1 105.6631 106.6449 103.6993  106.3428   12177900
2000-01-06 139.60 141.50 137.80 137.80  6227200        0     1 105.4327 106.8677 104.0733  104.0733    6227200
2000-01-07 140.30 145.80 140.10 145.80  8066500        0     1 105.9690 110.1231 105.8179  110.1231    8066500

To note, this data is not automatically assigned to “SPY” as quantmod’s “getSymbols” function fetching from Yahoo would automatically do. Also, note that when calling the Quandl function to its EOD database, you automatically obtain both adjusted and unadjusted prices. One aspect that I am not sure is as easily done through Quandl’s API is how easy it is to adjust prices for splits but not dividends. But, for what it’s worth, there it is. So for those that take contention with the quality of Yahoo data, you may wish to look at Quandl’s EOD database for $50/month.

So…that’s it. From this point on, this blog is an archive of my work that will stay up; it’s not going anywhere. However, I won’t be updating it or answering questions on this blog. For those that have any questions about functionality, I highly recommend posting questions to the R-SIG Finance mailing list. It’s been a pleasure sharing my thoughts and work with you, and I’m glad I’ve garnered the attention of many intelligent individuals, from those that have provided me with data, to those that have built upon my work, to those that have hired me for consulting (and now a full-time) opportunity. I also hope that some of my work displayed here made it to other trading and/or asset management firms. I am very grateful for all of the feedback, comments, data, and opportunities I’ve received along the way.

Once again, thank you so much for reading. It’s been a pleasure.


To leave a comment for the author, please follow the link and comment on his blog: QuantStrat TradeR » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A little growth curve analysis Q&A

$
0
0

(This article was first published on Minding the Brain, and kindly contributed to R-bloggers)
I had an email exchange with Jeff Malins, who asked several questions about growth curve analysis. I often get questions of this sort and Jeff agreed to let me post excerpts from our (email) conversation. The following has been lightly edited for clarity and to be more concise.


Jeff asked:
I’ve fit some curves for accuracy data using both linear and logistic approaches and in both versions, one of the conditions acts strangely. As is especially evident in the linear plots, the green line is not a line! Is this an issue with the fitted() function you’ve come across before? Or is this is a signal something is amiss with the model?
I answered:
In the logistic model, some curvature is reasonable because the model is linear on the logit scale, but that is curved when projected back to the proportions scale. Since all of the model fits look curved for the logistic model, that seems like a reasonable explanation.
I am not sure what is going wrong in your linear model, but one possibility is that it is some odd consequence of unequal numbers of observations (if that is even relevant here).
Unequal number of trials turned out to be part of the problem, which Jeff fixed, then followed up with a few more questions:
(1) If I create a first-order orthogonal time term and then use this in the model (ot1 in your code), my understanding is this is centered in the distribution as opposed to starting at the origin. So for linear models fit using ot1, an intercept term to me seems to be indexing global amplitude differences in the model fits (translation in the y-direction) rather than a y-intercept. Is this correct?
 (2) My understanding is that one only needs to generate orthogonal time terms if fitting second order models or higher. However, I performed a logistic GCA which was first order and it failed to converge when I used my raw time variable and only converged when I transformed it to an orthogonal polynomial with the same number of steps.
(3)  I am unclear as to when to include a “1” in the random effects structure for conditions nested within subjects. For example, what is the difference between (1+ot1 | Subject:Condition) and (ot1 | Subject:Condition)? I have the former in the linear GCA and the latter in the logistic GCA.
My answers:
(1) Yes, that is correct. I might quibble slightly with your terminology and say that you've moved the origin to the center of your time window rather than the baseline time point, but the concept is the same. Also, I find that having the intercept represent the overall average is often a helpful property because traditional "area under the curve" analyses are then represented by the intercept term.
(2) Well, centering the linear term does affect interpretation of the intercept, which is sometimes worth doing. However, I suspect you were thinking about de-correlating the time terms, in which case you are correct, that only matters when there are multiple time terms (starting with second-order polynomials). Your point about convergence is a slightly trickier issue. Convergence can be touchy and generally works better when the predictors are on similar scales. Raw time variables typically go from 0 to 10 or 20, but other predictors are often 0/1, so there is an order of magnitude difference there. The orthogonal linear time term should be in the -1 to 1 range, which can help with convergence.
(3) There is no difference between those two random effect definitions: the "random intercepts" (1 | ...) are included by default even if you omit the 1. Sometimes I include the 1 to make my code more transparent when I am teaching about GCA, but I almost never use it in my own code. Including the 1 can also make it easier to think about how to de-correlate random effects, but that's probably too tangential for now.

To leave a comment for the author, please follow the link and comment on his blog: Minding the Brain.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Why has R, despite quirks, been so successful?

$
0
0

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

I was on a panel back in 2009 where Bow Cowgill said, "The best thing about R is that it was written by statisticians. The worst thing about R is that it was written by statisticians." R is undeniably quirky — especially to computer scientists — and yet it has attracted a huge following for a domain-specific language, with more than two million users wordwide. 

So why has R become so successful, despite being outside the mainstream of programming languages? John Cook adeptly tackles that question in a 2013 lecture, "The R Language: The Good The Bad And The Ugly" (embedded below). His insight is that to understand a domain-specific language, you have to understand the domain, and statistical data analysis is a very different domain  than systems programming. 

I think R sometimes gets a bit of an unfair rap from its quirks, but in fact these design decisions — made in the interest of making R extensible rather than fast — have enabled some truly important innovations in statistical computing:

  • The fact that R has lazy evaluation allowed for the development of the formula syntax, so useful for statistical modeling of all kinds.
  • The fact that R supports missing values as a core data value allowed R to handle real-world, messy data sources without resorting to dangerous hacks (like using zeroes to represent missing data).
  • R's package system — a simple method of encapsulating user-contributed functions for R — enabled the CRAN system to flourish. The pass-by-value system and naming notation for function arguments also made it easy for R programmers to create R functions that could easily be used by others.
  • R's graphics system was designed to be extensible, which allowed the ggplot2 system to be built on top of the "grid" framework (and influencing the look of statistical graphics everywhere).
  • R is dynamically typed and allows functions to "reach outside" of scope, and everything is an object — including expressions in the R language itself. These language-level programming features allowed for the development of the reactive programming framework underlying Shiny
  • The fact that every action in R is a function — including operators — allowed for the development of new syntax models, like the %>% pipe operator in magrittr.
  • R gives programmers the ability to control the REPL loop, which allowed for the development of IDEs like ESS and RStudio.
  • The "for" loops can be slow in R which ... well, I can't really think of an upside for that one, except that it encouraged the development of high-performance extension frameworks like Rcpp.

Some languages have some of these features, but I don't know of any language that has all of these features — probably with good reason. But there's no doubt that without these qualities, R would not have been able to advance the state of the art in statistical computing in so many ways, and attract such a loyal following in the process.

 

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Viewing all 2417 articles
Browse latest View live