Visualizing your data gives you clues about how two variables relate to each other. Ignoring clues from the visualization can you lead to potentially inaccurate conclusions.

Last week Education Sector, a nonprofit education think tank announced something they are calling “Higher Ed Data Central.” They have taken a bunch of publicly available data sets and combined them into a database.

On their blog, the Quick and the Ed, they started showing examples of what they could do with this data. On Friday they published a post including the graph below of the number of administrators who make over \$100k per 1,000 students versus tuition at private non-profit 4 year universities.

Nice graph. Then they say this:

“Each additional highly compensated administrator per 1,000 students is correlated with \$1,120 higher tuition (R^2 = .42). (Of course, this doesn’t prove that higher administrative staffing causes higher tuition, merely that they are correlated – a deeper analysis would be needed to determine if there is causation.)”

Oh boy, a big reason to visualize your data in the scatterplot is so you can see things like that fact that this data is nonlinear. See how there seems to be a big curve up and to the right? It looks more like a lower case letter “r.” than a straight line. However, they have used a line to summarize the data.

At first I thought Education Sector was making Data Central publicly available. Unfortunately, when I went to look for it this morning, I find that it isn’t. Also, they don’t tell me the original sources of the data used to create their graph. So, I created some simulated data in the pattern of the data they display. Here is a scatterplot of my data. I also added a linear regression line. Then I created a logarithmic curve as well… which one seems to be closest to the pattern of the data?

What this curve tells us is that we can’t give a single value of how much tuition goes up per compensated administrator because the relationship varies depending on where you are on the scale. Based on the equation for this curve, moving from 1 highly compensated administrator to 2 would be related to a \$6,114.76 increase, while moving from 25 to 26 would “just” be a \$345.99 increase.

This model still isn’t great.; it will underestimate tuition at those schools in the upper left, for example. It would be interesting to see if there are other things in common among the schools out to the far right on number of highly compensated administrators (remember all these schools are private, non-profit). We might also want to think about transforming these non-normal variables prior to the analyses. However, this is as far as I’ll go with my “simulated” data for now.

One of the emphases of the research center I’m part of at Pearson is data visualization. This isn’t just for purposes of communication to others, but for your own understanding of the data. It should lead you to better models of relationships.