Rugby Moneyball: Predicting Wins With Machine Learning

In part 1 of this series I split 14 Kintetsu matches into wins and losses and asked which metrics differed most. Yellow cards, turnover differential, and points per ball carry came out on top. The tackle paradox, that winners miss more tackles than losers, sat there as a counterintuitive footnote. To tell which of those signals are real and which are confounded, you need a model that sees them together.

This post is that model.

The headline up front. The eye test mostly held, with one twist. The model agreed with Pt 1's ranking of yellow cards, points-per-carry, and turnover differential as real predictors. But the largest positive coefficient went to missed tackle percentage, putting numerical weight behind the tackle paradox in a way pairwise comparison couldn't. And the AUC, for full honesty, is 0.60 with a 95% confidence interval that includes "useless."

Let me unpack all of that.

What I'd want to know if I were the head coach reading this

Three things, in order.

When you put all the metrics in a single model, which ones still pay rent?
How well does the model actually predict?
Should I trust it?

The honest answer to (3) shapes everything else, so I want to lead with it.

On sample size and class imbalance, before any numbers

Fourteen matches. Drop the draw, leaves 13: ten wins, three losses. That's not a fifty-fifty class split. It's heavily imbalanced toward wins.

Two things follow.

First, the textbook rule for logistic regression is at least ten events per variable. With three losses as the minority class and eight features, we're at less than half of one event per variable. By any standard, this is too little data to fit a stable predictive model.

Second, naive baselines matter when classes are imbalanced. A model that simply predicts "win" for every match would be correct ten times out of thirteen, about 77% accuracy. Any model that beats that on accuracy is doing real work. Any model that doesn't is a vanity project. Spoiler: ours doesn't beat it on accuracy. It beats chance on AUC, which is a different and more useful measure of discrimination, but I want that comparison on the table from the start.

So why fit a model at all? Two reasons.

Illustration. The point of this post isn't to deploy a production model. It's to show what the multivariate question looks like, what tools to use for it, and how to read the output honestly. The methodology transfers; the specific coefficients here don't.

And honest reporting of small-data ML is itself rare in this field. Most "we built a model" posts in sport oversell. I'd rather show you the wide confidence intervals and the inversion of feature rankings than pretend they aren't there.

With that calibrated, here's what I did and what I got.

The setup

Eight features, drawn from Pt 1's findings plus possession kept as a control:

Yellow card count
Turnover differential (won minus conceded)
Points per ball carry
Penalties conceded (total)
Line breaks per defender beaten
Possession %
Dominant carry % (dominant carries / total carries)
Missed tackle %

Two models. Logistic regression with L2 (ridge) regularisation and class-balanced weighting. Regularisation is mandatory at this n. Class-balancing pulls the model out of the trivial "always W" attractor. Random forest, 500 trees, max depth 3, also class-balanced, in case any signal is non-linear.

Cross-validation: leave-one-match-out. Each match becomes the test set once; the model trains on the other twelve. Repeat thirteen times, take the out-of-fold predictions. With small data this is the most honest split I know.

Reported: AUC with bootstrap 95% CI, confusion matrix at the 0.5 threshold, standardised coefficients from a full-data fit, SHAP values for per-match attribution.

The headline numbers

Logistic regression: AUC = 0.60 (95% CI: 0.08 to 1.00). Random forest: AUC = 0.60 (95% CI: 0.18 to 1.00).

Read those confidence intervals before you read the point estimates. Both lower bounds sit well below 0.5, the line where the model is doing no better than chance. With thirteen matches and three losses, the model could plausibly be useless, and the data alone can't rule that out.

Confusion matrix from the logistic model's out-of-fold predictions at threshold 0.5:

	Predicted W	Predicted L
Actual W	8	2
Actual L	2	1

Nine out of thirteen called correctly, about 69% accuracy. Naive "always predict win" baseline: ten out of thirteen, about 77%. The model gets fewer matches right than the trivial classifier, even though it has marginally better discrimination (AUC 0.60 vs 0.50). That gap matters: the model is ranking matches in roughly the right order but flipping borderline calls in a way "always W" doesn't.

This is exactly the small-data, heavy-imbalance reality I wanted to show. The metric that looks good (eight of ten wins predicted correctly) hides that the model is calling two wins as losses, which the trivial classifier wouldn't do.

What survived in the model

Standardised logistic coefficients, ranked by magnitude. Positive sign means more of the feature pushes prediction toward win.

Missed tackle %: +1.024
Points per ball carry: +0.660
Yellow cards: −0.619
Turnover differential: +0.475
Line breaks per defender beaten: +0.434
Dominant carry %: −0.349
Possession %: −0.317
Penalties conceded: −0.274

Three things to take from this.

The biggest positive coefficient is missed tackle %. The Pt 1 tackle paradox, that winners miss more tackles than losers, is not just a pairwise oddity. The multivariate model, controlling for the other seven features, says the missed-tackle signal is the strongest single discriminator of wins from losses in this dataset. Which sounds insane until you remember the Pt 1 explanation: in matches Kintetsu won, the opposition was forced into broken-field attack, where tackles are harder to complete. The model is picking up that situational signature.

Discipline (yellow cards) and conversion (points per carry) survived in the expected direction. Both are real signals. The yellow cards coefficient is smaller than Pt 1's pairwise effect would suggest, because some of the discipline signal is correlated with the broken-field-attack signature: a yellow card tends to come in matches that already look like the kind you lose, and the model attributes that pattern to features other than the card itself.

The "no signal" features stayed no-signal. Possession % has a coefficient of −0.32, slightly negative and well within bootstrap noise of zero. Dominant carry % is also negative, confirming Pt 1's inverse-signal finding: Kintetsu's losing matches had more dominant carries than its winning ones.

SHAP: why the model thought what it thought

SHAP values let you ask, for any specific match, what features pushed the prediction toward win or loss. Two contrasting examples are worth showing in detail.

Round 3 (Kintetsu lost 17 to 36, model predicted p̂(W)=0.02). The model called this one correctly with high confidence. Top SHAP attributions: missed tackle % (−1.54), yellow cards (−1.13), dominant carry % (−0.86). A textbook "looked like a loss profile" match: low missed tackle %, a yellow card, lots of dominant carries that didn't convert. Model and eye test agreed.

Round 10 (Kintetsu won 33 to 5, model predicted p̂(W)=0.17). The model called this one wrong. Predicted L by a clear margin; reality was a 28-point win. SHAP shows why: a yellow card (−1.13), the worst turnover differential in the entire dataset (−0.82), low dominant carry %. High missed tackle % pushed strongly toward W (+1.92) but couldn't outweigh the rest of the loss-profile features.

Round 10 is the most honest data point in the analysis. Kintetsu won by 28 points despite descriptive stats that looked like a losing performance. Either the team transcended its underlying numbers that day, or the underlying numbers are missing variables that decided the match. The model tells you it doesn't know which, and a single match can't either.

That's the genuine value of doing this exercise. ML in small-data sport contexts isn't going to surface hidden patterns invisible to a coach. With 13 matches it can't. What it does do is force the multivariate question, "is this signal still real after controlling for everything else," and produce honest answers, including "I don't know."

What's next

Two implications.

For deployment, this dataset is not enough. The natural next step is part 5 of this series: applying the same model to Div 2 league data with many more teams and games, where AUC confidence intervals can actually narrow.

For utility, what's already useful is the discipline of asking the multivariate question at all. If you're a club making selection or recruitment decisions, the value is not the specific coefficients I've shown. It's the rigour of asking "do these features still matter when the model sees them together, and how confident am I in the answer." Most clubs don't ask their data that question.

In part 3, I'll wrap this model into a public win-probability tool you can paste your own match stats into. With a 13-match training set, the predictions will be illustrative, not deployable, and the tool will say so clearly. The point is to make the methodology visible, not to sell a black box.

Coming next: the tool.