Home > Uncategorized > Translating SPSS to R: Descriptives

Translating SPSS to R: Descriptives

Before my students do any inferential statistics, they learn to do descriptives. Rule #1 in my class is “always plot your data.” So they first learn to get means and other descriptives, generate box plots and histograms, and do things like z-score transforms. The good news in SPSS is that most of these things can all be done with a single procedure called EXAMINE. z-score transforms are done in a strange way in SPSS, essentially as a side effect of the DESCRIPTIVES procedure. Here’s a snippet of SPSS from recent homework, based on data from wide receiver statistics from the 2011 NFL season. I asked students to compute some summary measures, and then compute an index of player “goodness” based on combining z-scores computed from other measures, and then look at that goodness variable and identify who the high and low scoring players were.

SPSS Version
Here’s the SPSS code that I provided as part of the the answer key:

/* add labels to the values for the ‘conference’ variable */
/* this creates the three new variables */
COMPUTE TotalTD = PassTD + RushTD + RecTD + RetTD.
COMPUTE TotalYards = PassYards + RecYards + RushYards + RetYards.
COMPUTE Turnovers = Int + FumLost.
/* this generates z-scores */
DESCRIPTIVES TotalTD (TDZ) TotalYards (YardZ) Turnovers (TOZ).
/* this computes the new goodness variable */
COMPUTE Goodness = (TDZ + YardZ - TOZ)/3.
/* this will generate the box plot, list the top and bottom five */
/* cases, and label them with names */

This produces all kinds of output, including a box plot, a histogram, all kinds of descriptives, and the most extreme cases labeled by the name of the player associated with each case. Here’s what it looks like:

That’s a lot of information from not very much code.

R Version
R doesn’t really have anything quite the same as EXAMINE, but it offers essentially the same functionality, just in individual pieces. However, right away, we have to deal with some issues in R that never come up in SPSS, namely loading external packages, collisions in name spaces, and data type coercion issues.

Computing variables and producing plots is, fortunately, quite straightforward:

## add labels
Conf = factor(Conf, labels=c("AFC","NFC"))

## compute new variables
TotalTD = PassTD + RushTD+ RecTD + RetTD
TotalYards = PassYards + RecYards + RushYards + RetYards
Turnovers = Int + FumLost

## compute z-scores
TDz = scale(TotalTD)
Yardz = scale(TotalYards)
TOz = scale(Turnovers)

## compute goodness
Goodness = (TDz + Yardz - TOz)/3

## generate plots
hist(Goodness, col="lightgray")
boxplot(Goodness, ylab="Goodness score", col="lightgray")

R generally produces slightly more aesthetic histograms and box plots than SPSS, and this case is no exception:

One really annoying thing about R plots is the amount of dead whitespace generated around each figure—I didn’t do anything with this output, just copied it and pasted it here, the extra whitespace came for free.

There’s also the annoying inconsistency in whether the graph gets a title or not. Yes, you can turn off the title on the histogram or add one to the box plot, but the real question is why are the defaults different in the first place?

The next problem is getting out meaningful descriptives like the kind of things SPSS produces with EXAMINE. R doesn’t have a good function for this in the default package, but there are some reasonable functions in other packages. The “describe” function in the “psych” package is pretty good:


and it produces this for output:

var n mean sd median trimmed mad min max range skew kurtosis se
1 1 50 0 0.57 -0.08 -0.02 0.45 -1.51 1.46 2.97 0.36 0.72 0.08

That’s pretty close in terms of what information is reported. However, it requires installing a package that isn’t part of the default package, which means students have to be walked through how to install packages, which is not really that hard, but it certainly seems like extra unnecessary work to any new student trying to decide which stats package to use.

The other problem is that this doesn’t show us the most extreme values of the distribution like SPSS does. The good news is that there is a function that will do this easily. The bad news is that it is also called “describe” and it lives in a different package, “Hmisc,” so now students have to be introduced to the idea of namespaces and how collisions are resolved. Furthermore, the “scale” function used to compute z-scores earlier does not return vectors for vector input, it returns 1-dimensional matrices, and Hmisc’s “describe” function doesn’t work on those, so the data have to be coerced before this will work. Explaining coercion of data types is not exactly a thrill and is something that almost never needs to be done in SPSS. Anyway, here’s the code:


There is a lot of output when the “Hmisc” library is loaded, at least for me, that I’m going to skip. The meat of it is here:

n missing unique Mean .05 .10 .25 .50 .75 .90 .95
50 0 50 6.916e-17 -0.80783 -0.58925 -0.30460 -0.07741 0.32005 0.63412 0.98251

lowest : -1.5066 -0.9281 -0.9104 -0.6824 -0.6370, highest: 0.7287 0.8394 1.0996 1.4618 1.4620

So, now students can go view the data frame and determine which players have those high and low values. I’m sure it’s possible to write a function that will find the lowest/highest value and report value of the name variable for that value, but I promise you that most new Psychology graduate new students aren’t going to do it that way; “write a function to do something” just isn’t how most of them think about the world.

If there’s a better way to do this in R, I’d love to know what it is. The problem that this illustrates, however, is that there’s also not really good documentation for R. Yes, every individual function has documentation, but if you don’t know what function you need, that’s not very helpful. Mostly I figure out how to do things in R via Googling and hoping someone has written decent documentation on a Web site or someone else has already asked on a Web forum like StackOverflow. Yes, there are books, but graduate students aren’t going to pay for any books that they don’t have to, and of course there are many, many books out there on R; to which one should I direct them? I have no idea.

SPSS, by a long shot. Way too many hoops to jump through in R just to get simple descriptive statistics. I know that if I were a new student who had to do this in R and I had fellow students who got to do it in SPSS, I’d be peeved.

Categories: Uncategorized Tags: , ,
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: