If We Assume: September 2012

US Population By Longitude & Latitude

My friend Eddie reminded me of this nice set of figures that made the rounds a few months ago on Visual.ly (and also Reddit).

I thought it would be cool to zoom in on the USA in high detail using data from the 2010 US Census... so that's what I did. As a bonus, the pixels represent the density of ZIP codes in small bins, which essentially track population density. Enjoy!

[ Leave a comment ]

NFL Replacement Referee Bias

1 comment: Topics: sports, statistics

Today I'm happy to feature the first guest post on If We Assume, written by fellow astronomer Peter Yoachim! He's discussing the now-famous debacle by the replacement referees (see also here) that occurred in last night's Seattle Seahawks game (Some are calling it the "worst call in NFL history"). Take it away Peter...

Getty Images

After watching the refs botch Monday Night Football (Go Seahawks!?), I was wondering if there's a way to quantify just how bad the NFL replacement referees are.

One thing that stood out in the game was how many calls went the Seahawks' way on the final drive--which reminded me of the discussion of home-field advantage in Scorcasting. They concluded that referee bias is the primary driver in home-field advantage across sports. They even note that in the NFL from 1985-1998, the home team won 58.5% of the time, but after instant replay was introduced, the home team only won 56% of the games (1999-2008).

If the replacement refs are much worse than the regulars, we might expect the home-field advantage to grow. My logic being, if the refs are botching more calls, those botched calls will tend to be in favor of the home team, that gives them an advantage, so they should win more.

How have home teams fared so far? After 48 games this NFL season, the home teams have a record of 31 wins and 17 losses, for a whopping 64.6% win rate! But is that significantly more than 56%? 31 wins is actually only 4 more wins than we would have expected with the regular refs. As always happens when I try to calculate the statistical significance of something, I got bogged down in an arcane wikipedia page, when it told me to look up some value from a table. Whenever a statistician tells me to look something up in a table, I reply, "Fuck that, I can Monte Carlo this in 5 lines of Python." So I did:

#play 10,000 seasons of football with 48 games each
hg = np.random.rand(10000,48)
#home team wins 56% of the time
hg[np.where(hg <= 0.56)]=1
#the rest are losses
hg[np.where(hg < 1)]=0
#total up the wins per season
ack = np.sum(hg, axis=1)
print 'probability of home team winning 31 or more games with 1999-2008 refs = %.2f'%(np.size(np.where(ack >= 31)) /10000.*100)+'%'

If you run that, you find out that we would expect 31 (or more) home team wins only 15% of the time. To turn it around: 85% of the time the home teams have fewer wins at this point in the season. We normally say something is significant when we reach the 5% level, so we're not there yet. If the home teams keep winning at a 65% rate (or higher) for 3-4 more weeks we should make it to significance! That's about the only reason I've found to root for the replacement refs sticking around--damned, scabs!

NCAA Football Coach Salary vs. Wins

3 comments: Topics: academia, costs, sports, statistics

Be sure to subscribe for updates on this and all my other data analysis projects!

A question was posed to me the other day: "Is Steve Sarkisian (head coach for UW's football team) worth the money we're paying him?". For the record, his salary is currently around $2.25 million, though he's not paid by tax payer money.

The question of an employee's worth intrigues me. No doubt people have strong opinions/feelings on the matter. How do we quantify this to answer it objectively?

In the case of a factory worker, we might say that the number of gizmos he/she produces per hour without error determines their value. I don't think this kind of metric works for things like teachers...

Still, football coaches are often judged based on their team's performance. So I decided that the best way to answer the question was to compare the salaries and win/loss records for NCAA FBS (aka Division I-A) coaches.

Detailed data on coach salaries wasn't super easy to find, I would have liked to find a neat & tidy table with salary broken down by year for each coach, alas. I did find this nice compilation by USA Today. I grabbed win/loss stats here. Note: for my analysis I have not followed up on any of these stats/teams individually, so no doubt there have been hires/fires and raises/cuts which will affect the specific details.

The correlation between higher pay and better winning percentage is promising. The median salary is $1.46 million. Texas is doing well, but boy they're paying for it! I then subtracted a linear fit (dashed line) from the winning percentages to determine the typical scatter.

The standard deviation in winning percentage at a given salary is +/- 12%. All the coaches that fall within this "region of acceptable performance" are highlighted in purple. I believe these coaches are "worth it". Twice the standard deviation is gold/yellow. Coaches in this region should either be asking for a raise, or watching for the hammer to fall.

There are a few major outliers that bear mention. Boise State is getting a whopper of a deal (as noted in Fig 1), as well as Ohio State. On the unfortunate side, Duke is very far below par; the sole outcast in the negative 3rd standard deviation. This doesn't bode well for an athletics program under scrutiny to dial back costs.

So this has all been in good fun, and I certainly hope no one is actually fired on my account! Looking at Washington in particular, Sark seems to be just below the standard deviation, but in fairness he's only been coach since 2009. After our victory over Portland State this weekend, I'm hopeful he'll make up some lost ground this year!

The full table of data is below the fold...
Update: due to demand from the wise folks on Reddit, I have updated the table to be sorted by School name ~~and added helmet thumbnails.~~ (I took them down, it seemed to be causing havoc with his website)

[ Continue Reading ]

Astronomy Programming

1 comment: Topics: Astronomy, statistics, visualization

Today I'm attending a workshop on "astroinfomatics" at Microsoft Research, and one question that has come up all morning is: how much computer science do we need to teach astronomy students?

Here is my summary slide on the problem

Clearly there is need for people at all levels of expertise, but how much do you need to know to actually do research? What do you think?

The Pace of NSF Funded Research

8 comments: Topics: academia, Astronomy, costs, statistics

Recently on Facebook I came across a note by Chris Erdmann that some handy folks at Harvard put together statistics on (nearly) every astronomy paper from 1995 to present that was funded through an NSF AST grant. This seemed like a really interesting dataset, especially for a young (read: financially uncertain) research such as myself.

So parsing through all 29,042 papers listed, here are two interesting things I've learned...

1. A typical AST grant produces < 10 papers

This is a simple histogram of the number of papers each unique Grant Number produced. Many people have only produced 1 paper with a grant, but the average is about 8.75 (and the median is 3).

Distribution of number of papers per grant, with a
mean of 8.75 papers per grant (blue dashed line)

2. A grant has its peak output of papers at 3.1 years

This is a more intriguing figure to me. I've plotted (Year Published - Year Grant Awarded) as a function of Year Grant Awarded for all 29K papers, and then binned it up with pixels. You can clearly see the peak productivity between 2-4 years. I've marked the mean (solid orange) and stdddev (dashed orange) lines for each year.

"nsfastgrantbib" data from the Astronomy Dataverse

Conclusion

Following this second figure, we expect that the grants from 2009 onward have only produced maybe half of their useful (publishable) results, despite many with dwindling funding.

Likewise, it's very encouraging that the grants awarded today will still be producing usable science well into the next decade!

Finally, we are reassured that Astronomy is an exciting and fast-paced field, and that continued strong funding is required to preserve this fact.

Data Reference
Christopher Erdmann; Louise Rubin, "Compiled List of NSF Grants to ADS Records from 1995 to August 2012", hdl:10904/10152 V2 [Version]

Voyager Data

No comments: Topics: Astronomy

The Astronomy Picture of the Day (aka APOD) today features a super cool plot of the dramatic up-turn in cosmic ray detection rate over the last year that Voyager 1 has measured, which may indicate that the spacecraft is entering true interstellar space.

Here for comparison is the full data (publicly available, thanks NASA!!) over the lifetime of both the twin Voyager probes for the cosmic ray instrument.

The upturn (highlighted by the red bar) is very noticeable from 2011 onwards. Sinusoidal modulations, measured by both spacecraft, seem to occur on roughly an 11 year cycle. I think this figure is even more remarkable than the 1-year subset shown on APOD, as it gives you a flavor for both the history of the Voyager mission, and the magnitude of the discovery.

I frequently think how the seemingly menial data we collect today might be used in the future, and this figure is a grand example. If we want to discover the subtle things, things that take a long time to find, or only change noticeably on timescales longer than a human lifespan, then we need to take the best data possible now. This is our investment in the future of astronomy, and why you mind your p's and q's when reporting and storing your observations.