#MeasureEarth: Crowdsourced Data for the Radius of the Earth

In this post I am going to go through a crowdsourced data project for determining the radius of the Earth. It’s part of an exercise I gave to my Physics lab students as an introduction to data science, and here I am going to go through the theory and evaluation of the data to see how we can use it to determine the radius of the Earth.

First, all credit where credit is due: this is the brain child of Lauren Weiss, over at the RockyWorlds blog. She envisioned a world-wide initiative to measure the shadow of a meter stick a noon, mimicking the classical measurement of Eratosthenes in ~300 BC. It’s a great exercise for a middle school/high school science class, and I thought it would really great if we really did have data from all over the world to try and mine it as much as possible. This was the assignment I gave to my laboratory students.

Unfortunately, it didn’t end up being  “world-wide”. There were 33 entries, and although the US was well-covered, the only international data came from Mexico and the Canary Islands. However, that should not dissuade us from treating the data like data and seeing what the results are. After all, this project is really about the process of science – that’s why I gave it to my students, and that’s why I wanted to try and do a careful analysis of my own.

A quick description of what we are doing – measuring the shadow of a meter stick at as many locations around the globe at the moment the Sun is highest in the sky. It should not be hard to see that because the Earth is a curved surface, the length of the shadow will be different depending on where your observation was made. To see carefully how one can get the radius of the Earth from a set of these measurements, we need a little geometry.

The Sun’s rays travel in parallel lines to the surface of the Earth, since the Sun is so far away. When they fall on a meter stick as shown, they make shadows which form little triangles. The angle between the Sun’s rays and the meter sticks can be found using \tan(\theta)=l_s/l_m. This can be be done at two different locations to find two different angles, \theta_1 and \theta_2. Looking at the picture, we can use simple geometry to relate those angles to the angle formed by the two radial lines from the meter sticks to the center of the Earth, \Delta \theta=\theta_2-\theta_1.  Using this and the basic definition of the angle as \Delta \theta=s/R, we can look up the distance s between the two points on the Earth’s surface and determine the radius of the Earth R.

At least, in principle. There are several complications. The major one is that the above analysis really only works if the longitudes of the two locations are the same. Apparently that is how the “classical” measurement was done. In my class, I had the students pick groups of measurements which were done at approximately the same longitude, and that worked well. But I want to be able to use all available data to try and determine the best value for the radius of the Earth that I can, so I need to take the different longitudes into account.

Let me introduce the theoretical tool that we can use to do this: the Haversine formula:

\sin^2(d/R)=\sin^2(\Delta \phi/2)+\frac{1}{2}[\cos(\phi_1-\phi_2)+cos(\phi_1+\phi_2)]\sin^2(\Delta \lambda/2)

(This isn’t quite the form of the Haversine function you’ll see in the Wikipedia link above, but it’s the form I used and equivalent to the Wikipedia one). Here \Delta \phi=\phi_2-\phi_1 and is the difference in latitude, \Delta \lambda is the difference in longitude, d is the arclength distance between the two points and R is the radius of the Earth.

My first plan failed: I used this formula to determine the distance between the two points, assuming the radius of the Earth was the known value. Then I used a second application of the Haverside formula to determine the distance s along the latitude line, which I then used in R=s/\Delta \theta with my measured values of the shadow angle \Delta \theta=\theta_2-\theta_1. The measured radii were all over the place, with an almost flat distribution from 0 up to (and over!) 100 000 km. The radius of the Earth is approximately 6370 km…

I believe the problem was this approach directly compares the known latitude change \Delta \phi with the measured change in the shadow angle \Delta \theta, and a little study of that diagram will tell you they should be the same. But when you look at the data, the measured shadow angles were often very far off from the change in latitude (factors of 3,5, even 10 times!), which caused the radius to be correspondingly erroneous.

So my next thought was to use the Haverside formula twice; once with the known values of the latitude, once with the measured values for the shadow. We would get two different measurements of the distance along the arc, and assuming they are supposed to be the same (like, we could have gotten out our meter stick and walked between all the points, and we should have gotten the same result in both cases), we could determine the radius of the Earth implied by the difference. In a little more detail: Call h(\phi,\lambda) the right hand side of the Haverside formula above. We have two formula for this ratio:

d/R_0=2\arcsin(\sqrt{h(\phi,\lambda)}),\qquad d/R=2\arcsin(\sqrt{h(\theta,\lambda)})

where R_0 is the known value of the radius of the Earth. Dividing these two equations by each other we can solve for the radius R implied by our shadow measurements:

R=R_0\frac{\arcsin \sqrt{h(\\phi,\lambda)}}{\arcsin \sqrt{h(\theta,\lambda)}}

In this way, we are not really doing a truely independent measurement for the radius of the Earth, but we are certainly checking what the value our shadow measurements imply for the radius of the Earth.

UPDATE: There is an additional complication here, which I missed in the original version of this post: the latitude is measured from the equator, but the equator is tilted with respect to the Earth’s orbit around the Sun (this path is called the ecliptic). This tilt is around 23 degrees, but the angle of the Sun changes throughout the year because the axis is fixed with respect to the background stars (this is what gives us the seasons!) But on this particular day, we can determine what the angle was between the equator and the ecliptic by simply \alpha=\theta-\phi for each value, and averaging. When I do that, I get \alpha=11.3^\circ. That will change the Haversine formula to the following:

h(\theta,\lambda)=\sin^2(\Delta \theta/2)+\frac{1}{2}[\cos(\theta_1-\theta_2)+cos(\theta_1+\theta_2+2\alpha)]\sin^2(\Delta \lambda/2)

Ok, so what happens when we try to apply this approach to the real-life data set? You can find the raw data here, and the cleaned versions discussed below here or here (fair warning – some of the angle calculations in the raw data were incorrect, so we had to recalculate them! Thanks great lab students!). I used Sagemath to do the analysis; I’ll discuss what I did but you can find the notebook here.

For our first pass, we find all the pairs (for 33 independent points we get 528 points – combinatorics!), and calculate the radius of the Earth implied as outlined above. The result was:

So obviously, a few things jump out at us about this plot. First, there is an overwhealming number of points that are very small – actually, these are zero. This happened because many of the data points were taken at the exact same location – Middleburg, VA. Obviously, these pairs cannot be used because they are not separated by any distance on the Earth’s surface. My solution to this was to average all the points that were close together and count it as a single measurement. These kind of decisions are generally called “applying data cuts”; you have some reason to remove some part of the data for some good reason.

But how do we know exactly what cuts to apply? It’s pretty convincing to average the points from exactly the same location, which removes those zeros, but what about that long tail from 0 to around 3000-4000 km? If the errors associated with this measurement are randomly distributed, we would expect this distribution to be normal – specifically, symmetric about the most common value. So we have some reason to suspect that long tail – the underestimation of the radius of the Earth – is due to additional points which are too close together to be considered in our analysis.

Looking back at the data, we have a set that looks like

San Francisco, CA, United States of America 37.775369 -122.421869
Hanford, California U.S.A. 36.370196 -119.646648
Los Altos, CA 37.392 -122.109
Hanford California USA 36.368977 -119.646669

So these measurements all came from California, and not particularly close to each other at that. But if you look at all the data less than 3000 km (what I would call “clear underestimates”), all the pairs associated with this list appear there. That tells me that these locations should probably be averaged together as well.

In the end, I combined a bunch of points into 2: Virginia and Maryland (18 points), and California (4 points). It ended up being differentiated by state, but that was just a coincidence – I asked “which pairs are well outside of the expected normal distribution?”, and the answer was consistently the pairs within Virginia and Maryland and within California. The resulting plot looks like

So this is still not obviously a normal distribution, but I didn’t see any geographical reason to combine any of the points further, so this represents “the best I can do”. The final result for the radius of the Earth in this dataset is

R=6402 \pm 126~km

(that error is indeed the error on the average as it should be, and not the error of the sample. And I usually teach my students to report your errors to 1 significant figure, and the best value should match. So maybe I should really report this as

R=6400 \pm 100~km)

So the average radius of the Earth is around 6370 km, so this worked very well! Not only is the real value within one standard deviation of the average, it is actually extremely close to the real value (within 0.5%!). Of course, our measurement is not really an independent estimate for the radius of the Earth, because we were not able to go out and measure the distance between all the points – we had to use the Haversine formula to do that, which included the radius of the Earth! I don’t really see how this could lead to the overestimate we see here, but it is worth keeping in mind.

(As a sidenote: A previous version of this post did not include the difference in angle between the ecliptic and the equator, and the result was a constant overestimation of the radius by about 1000 km. That’s called a systematic error, and I was able to deal with it by finding the angle on that day within the data itself. )

So the last thing we should do, as responsible data scientists, is to look for additional patterns in the data. The simplest way of doing this is probably simply graphing all the variables against each other. Unfortunately, we only have a few choices here. For instance, here is the radius of the Earth as a function of the change in latitude between the two observation points:

That actually doesn’t seem to tell us much. I don’t really think there is any argument for the clustering to be more apparent at any particular latitude changes. If I plot the change in the shadow angle against the calculated radius of the Earth, I basically get the same thing as the plot above. The longitude plot is a bit more illuminating:

So here we can see a relatively clear trend towards both smaller estimates for the radius of the Earth and more variation in the radius as the change in longitude becomes small. That echos what we found when cleaning the data – points that are nearby each other (although apparently only nearby in longitude!) tend towards great underestimations for the radius of the Earth.

One further plot we can look at is difference in shadow angle against difference in latitude angle.

That looks confusing, but I think it is showing us exactly what we expect. As the change in latitude gets larger, the change in shadow angle gets larger too, in the same direction and with some scatter (you can imagine a straight line running from the lower left to upper right of this plot). Since we expect all the points to lie right on that line (with some scatter), I think this is just confirming our expectations.

That’s pretty much all the additional analysis we can really do. We can’t really do anything with the “time” column in the raw data, because I ended up having to merge some points taken at different times.  When I started this, I was hoping that column might provide some interesting analysis because the instructions were “do this at noon”, but of course that’s not when the Sun is actually highest in the sky. For instance, “Solar noon” occurred at 12:30 in my location. Alas, we’ve lost most of that information, and now I’m not sure it would be much use to us.

And of course, we have to decide when to stop; the nature of this field is that you are never really done analyzing a particular data set, even one as small as this. We could try to tease some information out of the observation time, or go back and evaluate individual measurements, like try to determine the altitude of each observation. In reality, at some stage you have to determine the “level of truth” at which you are going to say “at this point, I am going to accept this information as fact”. I’ve referred to this basic concept as The Axiom of Measurement, and argued that it should be incorporated into our scientific lexicon.

So, that was quite a journey for something which was done reasonably accurately over 2000 years ago! But I think it’s a great illustration of crowdsourcing a scientific project, and also gave us a chance to play around with some of the tools and philosophies surrounding the field of data science.



Chaos and SageMath

This semester I’m teaching our Analytical Mechanics course, and I just finished a day on Introduction to Chaos. In this class, we are using the SageMath language to do some simple numerical analysis, so I used Sage to demonstrate some of the interesting behavior you find in chaotic systems. Since I’m learning Sage myself, I thought I would post the result of that class here, to demonstrate some of the Codes and the kinds of plots you can get with simple tools like Sage.

Before getting into the science, SageMath is a free, open-source mathematics software which includes things like Maxima, Python, and the GSL. It’s great because it’s extremely powerful and can be used right in a web browser, thanks for the Sage Cell Server. So I did all of this right in front of my students, to demonstrate how easy this tool is to use.

For the scientific background, I am going to do the same example of the driven, damped pendulum found in Classical Mechanics by John Taylor (although the exact same system can be found in Analytic Mechanics, by Hand and Finch). So, I didn’t create any of this science, I’m just demonstrating how to study it using Sage.

First, some very basic background. The equation of motion for a driven, damped pendulum of length L and mass m being acted upon by a driving force F(t)=F_0\cos\omega t is

\ddot{\phi}+2\gamma\dot{\phi}-\omega_0^2\sin(\phi)=f\omega_0^2\cos(\omega t)

\gamma here is the damping term and f=F_0/(mg), the ratio of the forcing amplitude to the weight of the pendulum. In order to get this into Sage, I’m going to rewrite it as a system of first-order linear differential equations,

\dot y=x

\dot x=f\omega_0^2\cos(\omega t)+\omega_0^2\sin(y)-2\gamma x

This is a typical trick to use numerical integrators, basically because it’s easy to integrate first-order equations, even if they are nonlinear.

It’s easiest to find chaos right near resonance, so let’s pick the parameters \omega_0=3\pi and \omega=2\pi. This means the t-axis will display in units of the period, 1 s. We also take \gamma=3/4\pi. The first plot will be this system when the driving force is the same as the weight. That is, f=1, and code + result is Figure 1 shown below.

from sage.calculus.desolvers import desolve_system_rk4
x,y,t=var('x y t')
Q=[[i,k] for i,j,k in P]

Figure 2 is a plot with the driving force slightly bigger than the weight, f=1.06.

Figure 1, f=1

Figure 2, f=1.06







This demonstrates an attractor, meaning the steady-state solution eventually settles down to oscillate around \phi\approx 2\pi. We can check this is actually still periodic by asking Sage for the value of \phi at t=30 s, t=31 s, etc., by calling this line instead of the plot command above

[intP(i) for i in range(30,40)]

(Note that we also have to change the range of integration from [0,15] to [30,40].) The output is shown in Figure 3; the period is clearly 1.0 s out to four significant figures.


Figure 3: The value of \phi at integer steps between t=30 and t=40 for \gamma=1.06.


Figure 4: The value of \phi at integer steps between t=30 and t=40 for \gamma=1.073


Figure 6: The value of \phi at integer steps between t=30 and t=40 for \gamma=1.077.









Next, let’s increase the forcing to \gamma=1.073. The result is shown in Figure 7. The attractor is still present (now with a value of around \phi=-2\pi), but the behavior is much more dramatic. In fact, you might not even be convinced that the period is still 1.0 s, since the peaks look to be different values. We can repeat our experiment from above, and ask Sage to print out the value of \phi for integer timesteps between t=30 and t=40. The result is shown in Figure 4. The actual period appears to be 2.0 s, since the value of \phi does not repeat exactly after 1.0 s. This is called Period Doubling.


Figure 7: f=1.073


Figure 8: f=1.077









In Figure 8, I’ve displayed a plot with f=1.077, and it’s immediately obvious that the oscillatory motion now has period 3.0 s. We can check this by playing the same game, shown in Figure 6.

Now we are in a position to see some unique behavior. I am going to overlay a new solution onto this one, but give the second solution a different initial value, \phi(0)=-\pi/2 instead of \phi(0)=0. The code I am adding is

Q2=[[i,k] for i,j,k in P2]
plot(intP,0,15)+plot(intP2,0,15,linestyle=":", color=''red'')

The result is shown in Figure 8. Here we can see the first example of the sensitivity to initial conditions. The two solutions diverge markedly once you have a slightly different initial condition, heading towards two very different attractors. Let’s plot the difference between the two oscillators,
but with only a very small difference in the initial conditions, \Delta\phi(0)=1\times 10^{-4}. The code follows:

#plot(intP,0,15)+plot(intP2,0,15,linestyle=":", color=''red'')
plot(lambda x: abs(intP(x)-intP2(x)),0,15)



Figure 8: Two oscillators with f=1.073 but with different initial values, \Delta\phi(0)=-\pi/2.


Figure 9: A plot of the difference in the oscillators over time, \Delta\phi(t), for f=1.077 and a very small initial separation, \Delta \phi(0)=1\times 10^-4.









This is shown in Figure 9. It clearly decays to zero, but that’s hard to see so let’s plot it on a log scale, shown in Figure 10.

#plot(intP,0,15)+plot(intP2,0,15,linestyle=":", color=''red'')
plot_semilogy(lambda x: abs(intP(x)-intP2(x)),0,15)

Now, let’s see what happens if we do this same thing, but make the force parameter over the critical value of f=1.0829. This is displayed in Figure 11, for f=1.105. We get completely the opposite behavior, the differences in the oscillators are driven away from each other due to their small initial separations. This is the essence of “Jurrasic Park Chaos” – a small change in the initial conditions (like a butterfly flapping it’s wings in Malaysa) causes a large change in the final outcome (a change in the weather pattern over California).


Figure 10: A log plot of two oscillators with f=1.073 but with very small differences in their initial conditions, \Delta\phi(0)=1\times 10^{-4}.


Figure 11: Finally, a log plot of two overcritical oscillators (f=1.105) and very small differences in their initial conditions, \Delta\phi(0)=1\times 10^{-4}

Loops in the Digits of Pi

Is \pi special?

Of course, the concept of \pi as the ratio between the diameter and circumference of a circle is more than important – a cursory glance through the arXiv suggests it appears in near 85% of all papers on theoretical physics. What I mean is are the digits of \pi special? Is there anything actually significant hidden in the seemingly random digits of this all-important transcendental number?

This is a well-trodden topic among pseudo-intellectuals and science fiction writers alike. No less then the great Carl Sagan afforded a special significance to the digits of our friend \pi at the very end of Contact (a part which didn’t make it into the movie). But is there any truth to this? Or even any evidence for it? In fact, how would you even go about trying to figure it out?

What got me thinking about this was a website I came across a few weeks back, talking about finding strings of specific numbers in \pi – you can see it here. It’s a very cool page, which lets you do cool things like search for your SS number in the digits of \pi (no joy for me there, I only get 8/9 numbers). However, down at the bottom they define something which I formalize as follows:

Loop Sequences: A loop sequence \mathcal{L} in a string of single-digits integers \mathcal{S} is a set of integers \mathcal{L}=\{n_1,n_2,...,n_N\} such that as a string of single digits, the integer n_j is found at the position n_{j+1} in the string \mathcal{S}, and n_N is found at position n_1.

This is perhaps best illustrated by an example. Let’s start with n_1=169. Turns out, starting at the 35th digit of \pi (counting starting after the decimal point), we have …841971693993…, which contains 169 starting at the 40th position, so n_2=40. I continue to do this and find n_3=70, n_4=96, and so on until you find that you are looking 169 again! This is a loop sequence.

That web page gives a single loop sequence, found by one Dan Sikorski. I wondered if there are more – how common are these loops, and what would it take to find them? Sounding like an interesting computational project (rather than tackling this theoretically, which might be possible but struck me as more difficult), I though I would look for loops in some mathematical constants, along with random numbers, to see if there was any evidence for \pi being special.

In short, no, there is not.

Of course, since these numbers are infinitely long, every number you start with must loop back at some point. What I’m after is how common these loops are. So let’s start with a million digits of \pi, and try to find loops that contain any of the numbers 1 to 100000 (rather arbitrary, but my PC can handle this in under an hour so it seems appropriate). The results are as follows:

For \pi, I found the following loops:

\mathcal{L}_1(\pi)=\{1\} (this is self-referencing)

\mathcal{L}_2(\pi)=\{40,70,96,180,3664,24717,15492,84198,65489,3725,16974,41702,3788,5757,1958,14609,62892,44745,9385,169\} (this might be called “the Sikorski loop”. And thanks to Jesse Chamberlain for pointing out a typographical error in an old version of this loop!)

\mathcal{L}_3(\pi)=\{19,37,46\} (this is a new loop, but who knows if I’m the first to notice it!)

For e, I found the following loops:

\mathcal{L}_1(e)=\{3624, 810\}

\mathcal{L}_2(e)=\{586, 1439, 4227, 19689, 76567, 4846, 16317, 32036\}

\mathcal{L}_3(e)=\{2543\} (self-referencing)

\mathcal{L}_4(e)=\{170596\} (another self-referencing)

(the original version of this post had an error in the digits of e, which was found by some students who are working on this project with me, D Vadala and J Gormley.)

To see if this distribution is at all unusual, I generated 100 random strings of a million integers and did the same kind of search. The distribution for the random numbers, plotted with the strings I found in \pi and e is:

The random distribution probably looks exactly how one would expect it – smaller loops are far more common, and larger loops (>5 or so) are part of the statistical variation. Due to small number statistics, its very hard to convince yourself that \pi and e are particularly special in terms of the distribution of their loops. It might be tempting to say that the length 20 loop in \pi lies outside the statistical variation, but you can see that I found loops of lengths 17, 18 and 31 in the random sample. For this reason, I would say this study does not suggest anything about the special character of the digits of \pi and e.

I suppose one should go further to try and deal with the statistics, and perhaps I’ll just run my laptop for a week and do the 10 million digit version of this, but it’s a little hard to imagine that I will find any evidence to suggest that there are any cyclic patterns in the digits of \pi or e.