A couple of weeks ago, I encountered a market on whether Trump would win majorities in all five states voting on April 26th. Trump winning more than 50% in all five states seemed highly unlikely to me at the time, so I bet against it and wrote up my reasoning. Some errors in my thinking were later pointed out to me, but heading into April 26th I still felt good about the fundamental bet.
Then the April 26th primaries happened, and Trump dominated (a), taking over 50% of the vote in all five states. A few nights later, Trump swept Indiana, Cruz suspended his campaign (as did Kasich soon after), and the Republican primary was effectively over. All of this triggered an avalanche of angsty reevaluations of data journalism, as well as wiping out all of my prediction market winnings (I lost a lot on the "majorities in all 5 states" bet, and my "brokered convention" shares cratered from $0.60 purchase price to near zero).
So, what happened here? The challenge is assessing how much of my April 26th loss was attributable to poor reasoning, and how much of it was a genuine surprise (such that it would have been difficult to predict even if I had reasoned perfectly with all the information available prior to April 26th).
My take is that, on reflection, there are a couple of places where my reasoning about the bet could have been better, but even with those improvements, the magnitude of Trump's April 26th victory would have seemed quite unlikely heading into the primaries. If my reasoning had been better, I'm not sure if I would have made the bet – I'm guessing that I still would have, though likely would have wagered a smaller amount.
Let's take a closer look at how my reasoning could have been better.
538's polls-plus vs. polls-only
538, my go-to for primary forecasting, has been building two types of models for its 2016 primary forecasts (a) – polls-plus and polls-only. Polls-only incorporates all of the polling from the state into a model. Polls-plus incorporates all of the in-state polling, as well as national polling (which is a contrarian indicator) and endorsements.
I had been favoring polls-plus forecasts over polls-only, under the naïve assumption that more information and a more nuanced model would generate more accurate forecasts. I hadn't dug into the details of how the polls-plus model was put together before making the judgment, even though there was reason to think that polls-plus might not be strictly better than polls-only (when backtested, polls-only was more accurate 43% of the time).
Operating on this naïve assumption, I favored polls-plus in my decision-making around the "Will Trump win majorities in all 5?" bet. Additionally, polls-plus was better aligned with my prior, as it gave lower probabilities of Trump victories, which made the chance of a five-state sweep seem very low.
If I had done my homework about polls-plus (or had followed a different heuristic, like "try to not rely on black boxes you don't understand"), I would likely have put more weight on the polls-only model, and that would have made Trump's 5-state majority victory seem less outlandish (using my method with polls-only data and a 3x fudge factor would have yielded a 21% chance of victory, rather than a 5% chance). I would have likely been less enthused about the bet.
Dealing with dependence
I first assumed that the primaries were all independent of each other. This was a mistake, and I ended up calculating the conjunct probability of majority victory in all five states as if the primaries were independent, then adding a 3x fudge factor to account for inter-primary dependence.
This approach probably did not adjust sufficiently for dependence (after adjusting, I estimated a 5% chance of Trump winning majority victories in all five). An alternate method, which at the time I dismissed as too complicated, might have yielded a better estimate. Let's take a look at that method now.
Starting with one primary (our "seed primary"), we then condition off of it, forming a chain of conditional probabilities:
p(Trump wins 50% in Maryland) * p(Trump wins 50% in Pennsylvania | Maryland >50%) * p(Trump wins 50% in Connecticut | Maryland >50% and Pennsylvania >50%) * p(Trump wins 50% in Delaware | Maryland >50% and Pennsylvania >50% and Connecticut >50%) * p(Trump wins 50% in Rhode Island | Maryland >50% and Pennsylvania >50% and Connecticut >50% and Delaware >50%)
Plugging in some numbers (and drawing the seed probability from 538's polls-only model) gives us:
0.20 * 0.80 * 0.99 * 0.99 * 0.99 = 0.155
15.5%, substantially higher than the dependence-adjusted probability of my original method. Note that apart from the seed primary, the probabilities here are entirely subjective. You have to imagine what you'd estimate the chance of Trump winning a majority in Pennsylvania would be, having learned that he won a majority in Maryland. Then you have to imagine your probability for Connecticut, having learned of Trump majority victories in Pennsylvania and Maryland, and so on.
Above, I arbitrarily chose Maryland to be the seed primary. To mitigate anchoring effects, we can take the average of all the outcomes of this exercise, starting with each primary as the seed primary (if we wanted to go a step further, we could take the average of the outcomes from all possible orderings of each probability tree, but I'm not going to do that here; instead, I'll keep the order constant except for swapping out the seed primary in each iteration).
Maryland (as above): 0.20 * 0.80 * 0.99 * 0.99 * 0.99 = 0.155 Pennsylvania: 0.40 * 0.70 * 0.95 * 0.99 * 0.99 = 0.272 Connecticut: 0.98 * 0.50 * 0.70 * 0.99 * 0.99 = 0.336 Delaware: 0.95 * 0.45 * 0.99 * 0.75 * 0.99 = 0.314 Rhode Island: 0.95 * 0.45 * 0.99 * 0.99 * 0.80 = 0.335 Average: 0.283
So, this method yields a 28.3% chance of Trump winning majority victories in all five primaries. Much higher than the method I used (28.3% vs. 5%), higher even than my original method would have yielded if I used polls-only data (28.3% vs. 21.2%), and more in line with the market at the time.
If I had been more careful with my reasoning, I likely would have arrived at a much higher conjunct probability than I did. It's hard to say, but I might have not even made the bet, given that the probability arrived at above was basically in line with the market at the time (though I likely would have low-balled the subjective estimates if I was using the above method before April 26th, not having the benefit of hindsight and operating under the prior that Trump majority victories are very unlikely).
That said, there is some reason to think that Trump's majority victories were genuinely surprising.
Prior to April 26th, Trump had won only one majority victory, in his home state of New York. The New York majority victory could be attributed to a fundamental shift in Trump's prospects, or it could be attributed to home state advantage. When thinking about the April 26th bet, I favored the home state advantage interpretation. In retrospect, it seems like Trump actually made substantial gains unrelated to any home field advantage he may have had in New York.
Also, Nate Silver seemed to find the April 26th result surprising. So I feel in good company.
[rereads: 1, edits: phrasing tweaks]