In this post I am going to go through a crowdsourced data project for determining the radius of the Earth. It’s part of an exercise I gave to my Physics lab students as an introduction to data science, and here I am going to go through the theory and evaluation of the data to see how we can use it to determine the radius of the Earth.
First, all credit where credit is due: this is the brain child of Lauren Weiss, over at the RockyWorlds blog. She envisioned a world-wide initiative to measure the shadow of a meter stick a noon, mimicking the classical measurement of Eratosthenes in ~300 BC. It’s a great exercise for a middle school/high school science class, and I thought it would really great if we really did have data from all over the world to try and mine it as much as possible. This was the assignment I gave to my laboratory students.
Unfortunately, it didn’t end up being “world-wide”. There were 33 entries, and although the US was well-covered, the only international data came from Mexico and the Canary Islands. However, that should not dissuade us from treating the data like data and seeing what the results are. After all, this project is really about the process of science – that’s why I gave it to my students, and that’s why I wanted to try and do a careful analysis of my own.
A quick description of what we are doing – measuring the shadow of a meter stick at as many locations around the globe at the moment the Sun is highest in the sky. It should not be hard to see that because the Earth is a curved surface, the length of the shadow will be different depending on where your observation was made. To see carefully how one can get the radius of the Earth from a set of these measurements, we need a little geometry.
The Sun’s rays travel in parallel lines to the surface of the Earth, since the Sun is so far away. When they fall on a meter stick as shown, they make shadows which form little triangles. The angle between the Sun’s rays and the meter sticks can be found using . This can be be done at two different locations to find two different angles, and . Looking at the picture, we can use simple geometry to relate those angles to the angle formed by the two radial lines from the meter sticks to the center of the Earth, . Using this and the basic definition of the angle as , we can look up the distance between the two points on the Earth’s surface and determine the radius of the Earth .
At least, in principle. There are several complications. The major one is that the above analysis really only works if the longitudes of the two locations are the same. Apparently that is how the “classical” measurement was done. In my class, I had the students pick groups of measurements which were done at approximately the same longitude, and that worked well. But I want to be able to use all available data to try and determine the best value for the radius of the Earth that I can, so I need to take the different longitudes into account.
Let me introduce the theoretical tool that we can use to do this: the Haversine formula:
(This isn’t quite the form of the Haversine function you’ll see in the Wikipedia link above, but it’s the form I used and equivalent to the Wikipedia one). Here and is the difference in latitude, is the difference in longitude, is the arclength distance between the two points and is the radius of the Earth.
My first plan failed: I used this formula to determine the distance between the two points, assuming the radius of the Earth was the known value. Then I used a second application of the Haverside formula to determine the distance along the latitude line, which I then used in with my measured values of the shadow angle . The measured radii were all over the place, with an almost flat distribution from 0 up to (and over!) 100 000 km. The radius of the Earth is approximately 6370 km…
I believe the problem was this approach directly compares the known latitude change with the measured change in the shadow angle , and a little study of that diagram will tell you they should be the same. But when you look at the data, the measured shadow angles were often very far off from the change in latitude (factors of 3,5, even 10 times!), which caused the radius to be correspondingly erroneous.
So my next thought was to use the Haverside formula twice; once with the known values of the latitude, once with the measured values for the shadow. We would get two different measurements of the distance along the arc, and assuming they are supposed to be the same (like, we could have gotten out our meter stick and walked between all the points, and we should have gotten the same result in both cases), we could determine the radius of the Earth implied by the difference. In a little more detail: Call the right hand side of the Haverside formula above. We have two formula for this ratio:
where is the known value of the radius of the Earth. Dividing these two equations by each other we can solve for the radius implied by our shadow measurements:
In this way, we are not really doing a truely independent measurement for the radius of the Earth, but we are certainly checking what the value our shadow measurements imply for the radius of the Earth.
UPDATE: There is an additional complication here, which I missed in the original version of this post: the latitude is measured from the equator, but the equator is tilted with respect to the Earth’s orbit around the Sun (this path is called the ecliptic). This tilt is around 23 degrees, but the angle of the Sun changes throughout the year because the axis is fixed with respect to the background stars (this is what gives us the seasons!) But on this particular day, we can determine what the angle was between the equator and the ecliptic by simply for each value, and averaging. When I do that, I get . That will change the Haversine formula to the following:
Ok, so what happens when we try to apply this approach to the real-life data set? You can find the raw data here, and the cleaned versions discussed below here or here (fair warning – some of the angle calculations in the raw data were incorrect, so we had to recalculate them! Thanks great lab students!). I used Sagemath to do the analysis; I’ll discuss what I did but you can find the notebook here.
For our first pass, we find all the pairs (for 33 independent points we get 528 points – combinatorics!), and calculate the radius of the Earth implied as outlined above. The result was:
So obviously, a few things jump out at us about this plot. First, there is an overwhealming number of points that are very small – actually, these are zero. This happened because many of the data points were taken at the exact same location – Middleburg, VA. Obviously, these pairs cannot be used because they are not separated by any distance on the Earth’s surface. My solution to this was to average all the points that were close together and count it as a single measurement. These kind of decisions are generally called “applying data cuts”; you have some reason to remove some part of the data for some good reason.
But how do we know exactly what cuts to apply? It’s pretty convincing to average the points from exactly the same location, which removes those zeros, but what about that long tail from 0 to around 3000-4000 km? If the errors associated with this measurement are randomly distributed, we would expect this distribution to be normal – specifically, symmetric about the most common value. So we have some reason to suspect that long tail – the underestimation of the radius of the Earth – is due to additional points which are too close together to be considered in our analysis.
Looking back at the data, we have a set that looks like
|San Francisco, CA, United States of America||37.775369||-122.421869|
|Hanford, California U.S.A.||36.370196||-119.646648|
|Los Altos, CA||37.392||-122.109|
|Hanford California USA||36.368977||-119.646669|
So these measurements all came from California, and not particularly close to each other at that. But if you look at all the data less than 3000 km (what I would call “clear underestimates”), all the pairs associated with this list appear there. That tells me that these locations should probably be averaged together as well.
In the end, I combined a bunch of points into 2: Virginia and Maryland (18 points), and California (4 points). It ended up being differentiated by state, but that was just a coincidence – I asked “which pairs are well outside of the expected normal distribution?”, and the answer was consistently the pairs within Virginia and Maryland and within California. The resulting plot looks like
So this is still not obviously a normal distribution, but I didn’t see any geographical reason to combine any of the points further, so this represents “the best I can do”. The final result for the radius of the Earth in this dataset is
(that error is indeed the error on the average as it should be, and not the error of the sample. And I usually teach my students to report your errors to 1 significant figure, and the best value should match. So maybe I should really report this as
So the average radius of the Earth is around 6370 km, so this worked very well! Not only is the real value within one standard deviation of the average, it is actually extremely close to the real value (within 0.5%!). Of course, our measurement is not really an independent estimate for the radius of the Earth, because we were not able to go out and measure the distance between all the points – we had to use the Haversine formula to do that, which included the radius of the Earth! I don’t really see how this could lead to the overestimate we see here, but it is worth keeping in mind.
(As a sidenote: A previous version of this post did not include the difference in angle between the ecliptic and the equator, and the result was a constant overestimation of the radius by about 1000 km. That’s called a systematic error, and I was able to deal with it by finding the angle on that day within the data itself. )
So the last thing we should do, as responsible data scientists, is to look for additional patterns in the data. The simplest way of doing this is probably simply graphing all the variables against each other. Unfortunately, we only have a few choices here. For instance, here is the radius of the Earth as a function of the change in latitude between the two observation points:
That actually doesn’t seem to tell us much. I don’t really think there is any argument for the clustering to be more apparent at any particular latitude changes. If I plot the change in the shadow angle against the calculated radius of the Earth, I basically get the same thing as the plot above. The longitude plot is a bit more illuminating:
So here we can see a relatively clear trend towards both smaller estimates for the radius of the Earth and more variation in the radius as the change in longitude becomes small. That echos what we found when cleaning the data – points that are nearby each other (although apparently only nearby in longitude!) tend towards great underestimations for the radius of the Earth.
One further plot we can look at is difference in shadow angle against difference in latitude angle.
That looks confusing, but I think it is showing us exactly what we expect. As the change in latitude gets larger, the change in shadow angle gets larger too, in the same direction and with some scatter (you can imagine a straight line running from the lower left to upper right of this plot). Since we expect all the points to lie right on that line (with some scatter), I think this is just confirming our expectations.
That’s pretty much all the additional analysis we can really do. We can’t really do anything with the “time” column in the raw data, because I ended up having to merge some points taken at different times. When I started this, I was hoping that column might provide some interesting analysis because the instructions were “do this at noon”, but of course that’s not when the Sun is actually highest in the sky. For instance, “Solar noon” occurred at 12:30 in my location. Alas, we’ve lost most of that information, and now I’m not sure it would be much use to us.
And of course, we have to decide when to stop; the nature of this field is that you are never really done analyzing a particular data set, even one as small as this. We could try to tease some information out of the observation time, or go back and evaluate individual measurements, like try to determine the altitude of each observation. In reality, at some stage you have to determine the “level of truth” at which you are going to say “at this point, I am going to accept this information as fact”. I’ve referred to this basic concept as The Axiom of Measurement, and argued that it should be incorporated into our scientific lexicon.
So, that was quite a journey for something which was done reasonably accurately over 2000 years ago! But I think it’s a great illustration of crowdsourcing a scientific project, and also gave us a chance to play around with some of the tools and philosophies surrounding the field of data science.