If We Assume: April 2012

Be sure to subscribe for updates on this and all my other data analysis projects!

I live in Seattle, the self proclaimed "coffee capitol" of the USA. We have a Starbucks on every corner downtown, independent cafes all over, and even some Starbucks incognito as indie coffee stores. It's a wonderfully caffeinated culture we're brewing out here.

A long standing joke among my friends at the University of Washington (UW): we drink so much coffee that we put a cafe in nearly every building on campus. This isn't quite true, of course, but we do have many!

While driving home from school today it occurred to me to consider the joke in a different direction: how far on UW's campus can you get from a cafe? Are you ever more than 2 minutes away from a coffee stand?

So I set about finding the answer!

I found a link while browsing reddit this afternoon (from r/dataisbeautiful) that pointed to a community data visualization challenge. The source for the data was a "Global Water Experiment", which provided some basic measurements of water characteristics, such as pH.

Though the data challenge had passed its deadline by a couple weeks, I was still intrigued, and so I used a couple free hours while some code was running on my work machine to play with this "Global Experiment".

Aside: It is becoming clear to me that I need to learn some new data visualization software, and I haven't been very impressed with most graphics I've seen from python. R sounds like a cool option, it's free and widely used... we'll see what I get in to.

I downloaded the .xls file, and while looking through the data on pH levels a question popped in to my head!

Question: does the "financial prowess" of a country correlate with the quality of its drinking water as tracked by pH levels?

This really reminded me of the (now famous) TED talk by Hans Roling, and his cute data visualization that everyone gushed over a few years ago - and it really is a cool talk, btw. Some quick searching online suggests that this sort of effect has been investigated before, but results have been largely inconclusive, or consistent with no correlation.

The challenge data provided me with over 2000 pH measurements from almost a hundred countries, though the sampling seemed sporadic between countries. I had to spend some time "cleaning" the data file by making names of countries uniform (e.g. USA became United States) and getting rid of non-standard characters.

The financial numbers came from Gross Domestic Product (GDP) data, provided courtesy of the CIA as it so happened!

So armed with my two sources of data, I set about the ever-fun task of string matching. It's somewhat of a clunky operation in IDL, and I always recommend people try a few things when doing this:

trim any leading/trailing spaces using STRTRIM(inputstring,2)
convert everything to lower case using STRLOWCASE(inputstring)
if things vary too much, you can try matching over only part of the string using the vector functionality of STRMID()

I matched the GDP and pH data, and then selected only the fresh water samples. I was a bit hasty with the matching, and probably threw out by accident some of the pH sampling. I also missed some country name matching, no doubt.

For each country with GDP and pH data, I measured the mean (average) and standard deviation of the pH samples, with Std Dev only calculated for countries with 2 or more samples. The immediate red flag that should go up in your mind (or certainly did for me) was: don't some countries have drastically different water environments that are being tested? Absolutely! As I mentioned above, this study didn't seem to guarantee any certain degree of accuracy or completeness. Furthermore, as a good friend/mentor of mine once reminded me when I was an undergrad, science requires error bars! These data have none that I could find.

The Std Dev may or may not be terribly useful, but I felt the means were quite illuminating. Here is the figure:

Figure 1: Top) Mean temperature reported vs GDP. Bottom) Mean water pH level reported, with standard deviations for each country shown as error bars. A linear least squares fit is provided in red.

As you can see from this figure there is a weak trend with GDP. Also curious to me was the apparent anti-correlation between temperature and GDP. Evidently, if you want to live in a well-to-do country, live somewhere cooler!

That furthest-right data point: the good 'ol USA of course! Several other name-brand western countries comprise the right-most portion of this figure (such as the UK, Canada, Australia). Noticeably absent from this data set: freshwater measurements from China.

Conclusion: There seems to be a weak trend, with water pH levels for more "developing" countries preferentially basic. Intermediate economies seem to be quite spread in pH levels. The most powerful nations continue the rough trend seen over four orders of magnitude in GDP: a decline from alkaline towards pure water.

The rough trend may seem encouraging for politicians, but the significant scatter represents both the intrinsic noise in the data, and a wider ecological issue. Consider that the USA has the highest economy listed by a wide margin, as well as one of the widest standard deviations in pH. Several countries with GDP's smaller than most American states have markedly better water quality. This supports the indication (mentioned in the abstract linked near the beginning) that it is not the GDP which contributes most to water quality. I am left thinking of a few better candidates: social/political forces, environmental regulations, natural resources, or maybe just number of trees...

That's enough wild (data-less) speculation for tonight I'd say.

Coffee: Two Minutes from Anywhere

Water Water