The Big Day Algorithm

October 1, 2021
kylebradyfischer
No Comments

The rules of big day birding, roughly, are quite simple: find as many bird species as possible, by sight or sound, in 24 hours within a specific geographic area. Besides the action day-of, the real task comes in planning a route that maximizes a team’s chances of finding birds. Most serious big day teams spend weeks scouting specific locations for specific bird species. The goal, in essence, is to make every possible bird a lock – to ensure that the chance of finding each bird the day of the event is as near 100% as one can hope for from a thing with wings.

But I’m not on a serious big day team, and likely neither are you. We’re not ornithologists or fanatical retirees, but working stiffs without sufficient time or resources to spend scouting arcane backwaters for birds. Or, perhaps, we’re doing a big day (or weekend or week or year) in a place we haven’t been to before. Is there a way to generate a decent big day route without all the legwork?

Let’s reconsider the problem: we have some set number of locations we can visit and at each spot there is a probability that a bird will or will not be present. If the probability is unknown, it would make sense to scout for locations where the probability of that bird being present approaches 1. But if we knew all the probabilities, it may be a better strategy to group spots with complementary probabilities such that, when combined, the probabilities approach 1.

The gist is this: If we knew the underlying probability of a bird being at each location, we could optimize which sets of locations would yield the highest likelihood of seeing the most birds. And where would we find such underlying probabilities? Why, eBird, of course.

Example bar chart for the San Diego River mouth eBird Hotspot.

These charts plot the frequency at which each species was reported in checklists submitted in each week of the year, compiled across all records (1900-2021). If we’re willing to stomach a few assumptions (that bird populations and migratory patterns don’t change; that all eBird checklist submitters are equally good at finding birds) and set some standards (that the data must be generated from some minimum number of checklists to be believable), then we can treat these data points as probabilities that a given species will be found at a given location in a given week.

This data is all available to download as excel or csv files. With a bit of manipulation, we combine the bird species probabilities for all birds for any give set of hotspots for a specific week.

Snippet of our bird probability table for each species (rows) in each location (columns). To make the math we’ll perform later work (and, I’d argue, improve our probabilistic model), probabilities of zero have been replaced with 1/10000, or 0.0001.

So here’s the plan:

Use this data to calculate the probability of finding a species having visited a specific set of hotspots.
Use these new joint probabilities to figure out the probability of seeing any given number of species for this set of hotspots.
Find the set of hotspots from all possible sets that gives the highest likelihood of seeing the most species.

To illustrate how this all works we’re going to examine a fantastical scenario. Enter…

The Omnipotent Birder

The solution to finding the most birds in a geographical region in a single day is simple: be everywhere at once. (Big Sit birders pull off this trick by making their geographical region as small as possible).

Let’s ignore for a moment the universe’s unbending insistence that a birder be in but one space at one time. What if you could bird everywhere at once within a geographical region? How many birds could the omnipotent birder expect to find on a given day? What would a big day count look like on “God Mode” – or, put differently, how many birds are there to find, and what is the probability of finding them?

Let’s consider how our probabilistic model behaves. Like a series of coin flips, there is a probability of finding or not finding a bird at each hotspot our omnipotent birder visits. Unlike a coin flip, that probability is not uniform (ie, always 50/50). This is called a Poisson binomial distribution and like all statistics it is governed by a specific set of rules. For instance, if we want to figure out the probability of getting a specific number of successes from a series of events in a Poisson binomial distributions (or, put plainly, the probability of seeing a specific bird species at a specific number of hotspots we visit looking for it), we need to use one of the following formulas:

Poisson binomial distribution wiki, for those who can’t get enough math.

Where k is the number of successful events and p is the probability of that event happening for any i of n events. The top equation to the right of bracket is to calculate the probability of zero successes (that is, you don’t find the species). We want to know the probability of finding at least 1, or not zero, which is simply one minus this equation.

Plugging the probability of finding a species at each hotspot into this formula, and repeating this for each species, we can calculate the probability our omnipotent birder finds each species having visited every hotspot. Check step 1 off our list.

Snippet of the table of joint probabilities across all hotspots for each species (first column).

Now we want to take these joint probabilities and use them to find the probability of seeing 1, 2, 3, 4…. all the possible species. For this we’ll need the bottom formula to the right of the curly bracket. I won’t explain how this formula works because, frankly, it is beyond my mathematical understanding (“The recursive formula is not numerically stable!” sounds like something screamed in a Sci-Fi movie). Luckily, a greater mind has produced a bit of code to do it for us. What a time to be alive.

We use this to calculate the probability of each possible number of species, from 1 to the total number of species in our dataset. Plotting this, we get a cumulative distribution function, or CDF.

Here is the CDF for all San Diego County hotspots that cut mustard for the 1st week of the January.

CDF for the first week of January for all SD hotspots with at least 5 checklists for this week.

The Y axis is the probability while the X is the number of species. If you draw a horizontal line from 0.5 on the Y axis until you hit the curve, then draw a vertical line until you hit the X axis, you’ll find that, having visited all the hotspots in San Diego County in the 1st week of May, our omnipotent birder would have a coinflip’s chance of finding 224 birds. Similarly, they’d have 95% chance of finding 215 and 5% chance of finding 223 birds.

We now know the probability of seeing a specific number of bird species from a set of hotspots. Check step 2 off our list.

Before moving on, let’s see what else we can learn from this dataset.

As you might imagine, these values look different for different times of year. Here are the CDFs for the second weeks of May and October, typically when eBird’s Global Big Day and October Big Day are held.

In San Diego, the second week of October just edges out the second week of May: the 50th percentile for these two weeks are 207 and 200, respectively.

Planning a big day and want to know the birdiest time of year in San Diego? Let’s look at the 50% value for every week of the year:

50th percentile of bird species in San Diego County for every week of the year.

This plot tells us the birdiest time of year in San Diego is the first week of March (week 8), while the least birdy is the 3rd week of July (week 27). Here are the CDFs for each of those weeks.

Points from top to bottom: 95%, 50%, and 5% probabilities.

Things Get Darwinian

But you can’t be in every hotspot in one, let alone two, state-sized counties at once, can you? No, we must return to the physical realm. Having figured out how to calculate the probability of finding a given number of species from a set of hotspots, we now just need to complete step 3: figuring out which set of hotspots has the highest probability of yielding the most birds.

What we have is a optimization problem, and a version of a famously hard one at that. Until now we’ve ignored the route part of our big day – not only do we have to find an optimal set of hotspots to visit, they have to be in an order that can be visited and birded within 24 hours (or ~12 hours, if we want to stick to daylight hours. More on that later).

To do this we’re gong to need travel time data for driving between all hotspots. I chose Microsoft’s distance matrix API from the many available. You can read more about it here. I won’t go into the details of how this is implemented, but suffice it to say we can generate a distance matrix for all the hotspots we may use in our route.

How do we find the best set of hotspots with the best route? Well we can measure how good a set of hotspots are for strictly bird finding by using the number of birds we have a 50% probability of seeing, as outlined above. To measure how good of a route we have, we can take that number and divide it by the time it would take to travel to all those hotspots in order. In essence, we’ll rank routes by birds/hour. The more birds per hour, the more birds we can cram into a day.

All that’s left to do is run this calculation on all possible permutations of hotspots, each with a unique order and without repeats. Let’s say we want to visit 20 hotspots and there are 500 that have sufficient data. That would be only… 648903261808795135455388694539771793327246045184000000 different routes. Even on the fastest computers in the world, this sort of exhaustive search would take centuries, if not eons, to calculated.

So don’t hold your breath.

Luckily mathematicians have been dealing with problems like these for a long time. And it is my hope, should any such mathematicians be reading this, that they might have their interest piqued enough to come up with a better optimization strategy than the one laid out here. But, dear reader, your author isn’t a mathematician. Nope, you’re stuck with a biologist. And biologists are really only familiar with one tried-and-true way of optimizing a difficult problem for which there a near endless set of permutations: evolution.

To find an optimal set of hotspots for our big day we’re going to use a genetic algorithm. It works something like this:

A bunch of sets of hotspots are randomly generated, which we’ll refer to as our population.
The 50th percentile value divided by the travel time required will be calculated for each member of our population, which we’ll refer to as its fitness.
Those routes with the highest fitness will be selected and “bred”, in which subsets of a route will be swapped between pairs of routes to make a new population.
At some set rate, each member may undergo a mutation, in which a hotspot in the route is swapped out for one from the entire pool of hotspots.
This population is fed back to step 2, and the cycle will continue for some set number of generations.

After deciding upon the number of spots we want to visit and how long we want to bird them, we choose a reasonable population size, mutation rate, and number of generations, and feed in our bird data and distance matrix. Once all set, we hit the go button and let the algorithm run. Each generations, depending on the parameters, can take a couple seconds to calculate (on my beefy, if somewhat dated, desktop computer with code not yet optimized for parallel processing). Each generation the fitness of the best route, in birds/hour, is printed out so we can see our evolutionary progress.

Plot of the fitness of the best route in each generation for the 1st week of October, 2021, for routes with 30 minutes to bird at each of 15 stops.

When we’ve run our last generation and the cream of the crop has risen to the top, the algorithm spits out its route along with the travel time, birds/hour, and anticipated species number with a 50% probability for this route.

Why stakeout Grace’s Warbler? Only the algorithm knows…

To give you a sense of how this route stacks up against all the possible birds present for this time of year, we can look at the CDF for this route compared to the CDF of all hotspots for this week.

50% probability of 140 species for the suggested route; 237 for the whole county. Oh, to be omnipotent…

No doubt you’ll now go back, tweak some of the genetic algorithm parameters, maybe run it overnight constantly varying the number of spots, see if you can clean up the data a little more… but you have a route! Now, not all the spots, or their order, makes sense, and you don’t necessarily know what exactly you’ll be looking for when you get there, but a route it is, and a statistically good one at that.

Now comes the hard part: actually doing it.

Start your engines…

As of this writing, this algorithm, or a version of it, was used to conduct two big days in San Diego County, in October 17th 2020 and May 8th 2021. My only rule was that I could not visit any spot outside the route, but that I could bird from the car in route. I stuck to these rules (mostly…).

In October 2020, I thought it would be smart to have the algorithm react in real time to what I had already seen, so as to not include seen birds in future calculations to make a better route. The only direction I gave the algorithm was to start at dawn in Borrego Springs. In application this strategy was nerve wracking. Every time I finished birding a spot, I’d type in what I had found, the laptop would whir as it calculated the best route, and it would spit out the next place to go, without reason. This turned out to be incredibly inefficient, as the algorithm sent me (or attempted to send me) across the county multiple times. Not knowing where I’d be next, I wasted time trying to pick up species en route, unsure if I’d get another chance. When the dust had settled I’d found 119 species, and the algorithm had given me a 50% chance of 125. I was thrilled – it wasn’t a particularly high count, but it gave me great confidence in the accuracy of the data.

The blue dot above shows my count for this route, which was the ~15th percentile. Not bad for a hectic first try.

(To wit: According to eBird’s tally for San Diego, 119 was the second highest big day count for San Diego County. The highest count, 120, came from a big sit, if I recall. If only I could sit still…)

Taking what I had learned from the October experience I rewrote the algorithm into the form you find described here for the 2021 May Global Big Day. I also included a nocturnal section (see below for more info), which was stitched in just before go time. At 10pm the night before the Big Day, the route was spat out. By 3am I was starting at San Elijo and returned home at 9pm from Ramona. I learned two valuable lessons on this run:

TRUST THE PROCESS. Any deviations, including driving slow with your head out the window trying to pick up stuff en route, will only hurt you. If you believe the data the route was built on, it’ll work.
Always go over the route beforehand. I got lost out of cell service and missed two spots and several key species.
Plan out where you’re going to find a bathroom…

Despite getting lost I managed to find 148 species. This was well below the predicted 50% probability for the original route. When I went back and ran the data for the hotspots I managed to visit I was surprised to see I’d done above average. Again, the data and route were way more reliable than I anticipated.

Blue dot shows my count of 148 on the CDF for the hotspots visited, just shy the 70th percentile.

Don’t take my word for it!

But let’s be frank: you should never trust the numbers coming from a big day team of one. Brains have the knack of seeing what they want to see, especially when there aren’t any other brains present to keep them in check. It is my hope that what I’ve laid out here gives some weight to my claims and, more importantly, encourages others to try.

I’ll post routes here for San Diego and am working to make the code publicly available for those want to try elsewhere. Please reach out, I’m happy to answer any questions and would love to hear how things go.

Final Thoughts and Errata

I’ve sped over many details in the interest of clarity (if not brevity) in this write up. Here are a few final thoughts and omissions:

Rare birds are a staple of many big day routes and this algorithm is no exception. Historical data for any run is combined with recent confirmed “notable sightings”, as eBird calls them, for a certain number of days back. These sightings are given some static (and dangerously high) probability of being present where they were sighted. However, if a notable bird is not at a hotspot, the probability for all other birds for this stop will be 0.0001. As a result, rare birds of this nature typically don’t make it into the final route, as the combined probability is more greatly maximized by visiting a more diverse spot. This problem could be fixed by “guessing” the probability of other bird species at these locations (see eBird’s range maps, which combine checklist data with habitat data to predict where birds may be). For the time being, this problem remains unsolved – the simplest solution is to handpick which trusted notable species get in and which don’t.
Birding at night is essential to any serious (or fun) big day route. Unfortunately, likely due to the diurnal nature of most birders, there is little eBird data specific to noctural birding. My solution to this was to run a separate genetic algorithm only using data for hand-annotated nocturnal species. This worked to some extent, but due to the very limited dataset, full CDFs could not be calculated, requiring a new way to measure the fitness of each potential route. This is still being worked on.
Speaking of nocturnal: as should be apparent, no temporal information (dawn, dusk, tide, etc.) is included in this algorithm. You might be told to go to a migrant spot in the heat of the day or, worse, La Jolla Point at 2pm on a sunny Saturday in May (was 30min lost to parking worth the Heerman’s Gull?). It is possible that this kind of info could be incorporated in future iterations. For instance, if one were to look at checklist submission times for each hotspots, you could probably figure out when the best place to visit some spots would be, and the probabilities could be weighted by when or if you’re there at the appropriate time. Similarly, each species with temporal constraints could be flagged in one way or another, and their probability weighted depending on where in the route a hotspot finds itself. Either way, this would take a bit of work and may do more harm than good – more thought is needed.
One of the strangest and more exciting parts of this endeavor has been going to bird spots I’ve never visited as fast as possible. Could this approach be applied to a place one has truly never been to before? In theory, yes. However, some basic knowledge of the hotspots definitely helps. For instance, pelagic hotspots have been removed from consideration in the San Diego dataset (but if you have a speedboat and want to give it a go, I’m all ears). Some spots are less obvious. For example, the Tijuana River Mouth is a 30 minute walk from the car and Fort Rosecrans National Cemetery is only open to birding at certain hours. Both are premier sites, but have to be omitted or treated in a special manner so as to avoid potentially messing up a route.
How much time should one allow to bird a spot? This is a very important question, as the less time you need, the more spots you can fit in. At the same time, the less time you spend, the less birds you see. From experience, a 20 minute average is fast in application (can I mention again that the algorithm does not include bathroom breaks?) but is likely necessary for higher counts. For the uninitiated (or less masochistic) a 30, 45, or even 60 minute average would be more appropriate.