Learning a Mixture of Heterogeneous Experts for Multi-Horizon Price Movement on Regulated Prediction-Market Trades | Blog

Throughout this project, we study whether short-horizon movements in prediction-market contracts (Kalshi) can be anticipated from trade-level microstructure. The core modeling story is a sequence of commitments: we started with a minimal convolutional detector on aggregated bars, watched where it failed, then reframed the prediction target, and only then brought in mixture-of-experts ideas to combine models with different inductive biases on the same task.¹²

Four collaborators worked alongside me (Harvard SEAS): Gianluca Pisa, Andres Blanco Prada, Moritz Wassermann, and Vishwesh Venkatramani (all MIT Sloan). We split ownership by subsystem (data preprocessing and splits, per-expert training, gating and fusion, diagnostics and reporting) and reviewed anything that touched label definitions, masks, or chronology in our working paper.

This started as a final project for CS 1090B. At some point it stopped being just that. We burned through multiple Colab compute allocations (thanks Colab students!), rewrote the preprocessing pipeline twice after catching leakage, and spent more late nights on this than any of us planned.

But hey, I'm glad we did this. Lots of fun!

Data construction

We sourced raw trade records and market snapshots from Kalshi through Jon Becker's prediction-market-analysis tooling, then materialized those execution records and contract snapshots into columnar stores for downstream modeling.⁴ At full resolution the trade history covers 517,520 contracts and 63.8 million trades from January through November 2025; all downstream models consume filtered, time-sorted rows with explicit handling of thin prices and calendar gaps. The chronological split yields 34.96 million training rows and 8.59 million held-out test rows. Chronological splits are non-negotiable: any procedure that shuffles contracts across time would admit information leakage from settlement paths not realized at quote time.

We first built exploratory tables that paired each execution with coarse tail flags from backward-looking returns, while enforcing volume and price floors. We treated those tables as diagnostics rather than final labels: they helped us see which parts of the trade data were reliable enough to model before we froze the production labeler used in the mixture stage.

Stage I: one-dimensional CNN, binary jump prediction

The first modeling decision was deliberately narrow. We aggregated trades to one-minute bars per ticker, required a minimum activity threshold, discarded windows whose real clock span exceeded ninety minutes (to avoid smuggling overnight gaps into a "minute" tensor), and trained a three-block 1D CNN on tensors (B, F=6, T=32): six strictly causal features (normalized close, short returns, log-volume, cyclical time) and thirty-two lags. The head emitted a single logit for jump vs. not at horizons H in {5,15,30,60} minutes, with forward returns matched within a tolerance band around t + H so sparse markets do not receive synthetic labels. Jump labels used fixed cent thresholds derived from the training fold's 10th/90th percentile of forward returns (±5¢, ±6¢, ±8¢, ±9¢ at 5/15/30/60 min), targeting roughly 20% positive rate per horizon. Forward returns were matched via per-ticker searchsorted to the closest subsequent trade, preventing cross-market leakage in sparse contracts. Training used BCEWithLogitsLoss with positive-class weighting; validation included a precision floor on predicted jumps before test-set reporting (AUC-PR, MCC, confusion).

input (B, 6, 32)6 features × 32 lookback 1-min windows

outputjump logit per window

Stage I — binary jump classifier per horizon.

Click any block to expand details.

When we checked what the CNN was actually using, including ablations where we removed one input channel at a time, one feature dominated: close_norm, the normalized price level. The short-term return features added much less than we expected. Read plainly, the network was learning a level and liquidity partition of the state space more than a temporally structured event detector. In thinly traded contracts, a long gap followed by one trade could make the model look better than it really was, so we started checking results separately by gap length. That diagnosis is what forced the next move: the binary CNN was a useful compressor of local context, but not the right outer loop for the scientific question we cared about.

Stage II: pivot to train-thresholded ternary targets

The binary jump setup was too blunt: it mixed direction, size, and base rate into one yes/no label. So we changed the task to three-way classification at the same horizons. For each row, the model predicts whether the future move in yes_price is in the lower tail, the middle band, or the upper tail. We computed those cutoffs using only the training period, then reused the same cutoffs for validation and test so future data could not leak into the label definition. At the five-minute horizon, the middle class is by far the largest: about 83% of rows are flat, while down and up are about 8% each. Because of that imbalance, raw accuracy is not very informative; balanced accuracy, macro-F1, and per-class confusion became the metrics we cared about.

Schematic timeline from raw Kalshi data and ETL through binary CNN, ternary pivot, specialist models, MoE gate, and evaluation — Modeling trajectory: binary convolutional baseline, relabeling, heterogeneous experts, gated mixture, held-out evaluation.

Stage III: heterogeneous experts on a common simplex

Before committing to a mixture head, we trained separate models chosen to stress different inductive biases on identical rows and labels. The hypothesis was empirical: if tail errors were structured differently under tree boosting than under recurrent or attention-backed sequence encoders, then no single winner would dominate every stratum of the covariate space. Concretely, we fit LightGBM on 41 tabular features—including eight-dimensional ticker embeddings (SentenceTransformer + UMAP), backward returns at each horizon, microstructure counters, and cyclical time encodings (num_leaves=255, lr=0.05)—⁵ an LSTM baseline,⁶ a Mamba block for long-range recurrence without quadratic attention cost,⁷ a foundation-style time-series encoder (Moirai),¹¹ a FT-Transformer treating features as tokens,⁸ and a Convolutional Transformer Time Series model (CTTS) combining convolution and self-attention.⁹ Each expert outputs a full softmax over {down, flat, up}; alignment of label definitions and masking rules across experts dominated implementation time relative to architecture search.

Six heterogenous models

Click any card to open its full architecture.

At this point, we could have simply averaged the six experts' predicted probabilities. That would already be a basic ensemble. Instead, we wanted the model to learn how much to trust each expert for each row. The held-out gate weights later made this choice look justified: at short horizons, the mixture leaned heavily on LightGBM and CTTS rather than treating all experts equally.

Stage IV: mixture-of-experts as a learned convex combination

Classical mixture-of-experts systems assign each input a soft partition among competing submodels whose specialties are carved out by a gating network.¹² That template predates contemporary sparsely-gated trillion-parameter variants;¹⁰ what we import is the statistical idea (per-example soft competition between submodels), not routing hardware.

For each expert e, we took its class probabilities p_e and concatenated them with 40 trade-level context features x. We fed that vector into a three-layer gating MLP (hidden dims 192 → 96 → 48, LayerNorm + GELU, dropout 0.3, ~35 k trainable parameters), which produced weights g = softmax(z). When an expert was missing for a row, we masked it out and renormalized the remaining weights. We also added an entropy bonus (λ = 0.01) so the gate would not collapse onto one expert during training; optimization used AdamW with cosine annealing. The final prediction is p* = Σ_e g_e p_e, a weighted average of expert probability vectors. In other words, the gate learns from both the raw trade context and the experts' own uncertainty, instead of just picking one model or using fixed weights.

experts — each outputs p_e ∈ Δ²

Gating MLP

g = softmax(Wx + b), g ∈ Δ⁵

hidden: 192 → 96 → 48LayerNorm + GELUdropout = 0.3~35k parametersentropy bonus λ = 0.01AdamW + cosine LR

p*(y | x) = Σ_e g_e(x) p_e(y | x)

class from train-fold thresholds

down

r_t,H < q₁₀^train(H)

flat

q₁₀^train(H) ≤ r_t,H ≤ q₉₀^train(H)

r_t,H > q₉₀^train(H)

Gated mixture — each expert outputs p_e ∈ Δ²; gate learns per-trade convex weights g_e ∈ Δ⁵.

Click an expert row to read its role in the ensemble.

At H = 5 minutes, LightGBM absorbs ~50% of gate mass and CTTS ~27%; FT-Transformer, Mamba, and LSTM share the next ~21%; Moirai receives only ~1.5%. That non-uniformity is the qualitative signature we wanted from a mixture rather than a collapse to a single model. The entropy regularizer was what kept the gate from collapsing during training.

Held-out behavior

After joining cached expert outputs, the main training tensor contained 34.96 million rows with a chronologically separated test set of 8.59 million rows (November 6, 2025 onward). Balanced accuracy climbed from 0.495 at five minutes to 0.548 at sixty, consistent with microstructure noise averaging out over longer windows. At five minutes, overall accuracy was 0.676 because the flat class is easy, while macro-F1 of 0.473 better reflects residual difficulty on directional tails.

The five-minute breakdown makes the imbalance obvious. The model handles the flat class reasonably well (precision / recall / F1 = 0.877 / 0.774 / 0.822), but the directional tails remain much harder: down F1 is 0.300 and up F1 is 0.297. That gap is exactly why raw accuracy looked better than the actual tail performance.

Mean gate weights at H = 5 on the test split:

MoE balanced accuracy

Mean gate weight

Gate weights are averages over held-out rows at H = 5 min; e.g. 0.503 means the gate assigns LightGBM about 50.3% of the mixture weight on average. LightGBM + CTTS hold 77% of gate mass.

Held-out test metrics — MoE ensemble.

Limitations and scope

The labels are only as good as the price data behind them. Some Kalshi contracts trade constantly; others sit quiet for long stretches and then move on a single print. That means one aggregate score can hide weak behavior on thin markets, so we checked results separately by liquidity and gap length. Even with six different architectures, the models can still lean on price-level shortcuts. The CNN stage made that failure mode obvious, and it remains a caveat for the mixture.

The reported numbers are historical results on this fixed split, not a claim about future Kalshi markets. Large artifacts such as frozen tables and checkpoints live in Google Drive because they are too large for the repo. The repo keeps the notebooks and pinned requirements file; rerunning them on the same data reproduces the reported metrics.

Conclusion

The main lesson was that the target definition mattered as much as the architecture. The first CNN was useful because it showed us what could go wrong: a model can look competent while mostly learning price level, liquidity, and gaps in trading activity. Rebuilding the task around train-only down / flat / up thresholds made the evaluation harder, but also more honest.

The mixture-of-experts stage was valuable for a different reason. It did not magically solve the class imbalance, but it gave us a controlled way to combine models that fail differently. The next version should push harder on market-level diagnostics: evaluate by contract family, liquidity bucket, time-to-expiry, and event type; compare the learned gate against simple equal-weight and validation-tuned ensembles; and test calibration, not just classification. I would also like to move from fixed percentile cutoffs toward labels that account for fees, spread, and tradability, because those are closer to the decisions a real system would have to make.

References

Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79–87. https://doi.org/10.1162/neco.1991.3.1.79
Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6(2), 181–214. https://doi.org/10.1162/neco.1994.6.2.181
Wolfers, J., & Zitzewitz, E. (2004). Prediction markets. Journal of Economic Perspectives, 18(2), 107–126. https://doi.org/10.1257/0895330041371321
Becker, J. (2024). prediction-market-analysis (public repository for Kalshi trade and market snapshots). https://github.com/jon-becker/prediction-market-analysis
Ke, G., et al. (2017). LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30. NeurIPS 2017 paper
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Gu, A., & Dao, T. (2024). Mamba: Linear-time sequence modeling with selective state spaces. arXiv:2312.00752. https://arxiv.org/abs/2312.00752
Gorishniy, Y., Rubachev, I., Khrulkov, V., & Babenko, A. (2021). Revisiting deep learning models for tabular data. Advances in Neural Information Processing Systems 34. https://arxiv.org/abs/2106.11959 (FT-Transformer).
Gulati, A., et al. (2020). Conformer: Convolution-augmented Transformer for speech recognition. Proc. INTERSPEECH. https://arxiv.org/abs/2005.08100
Shazeer, N., et al. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv:1701.06538. https://arxiv.org/abs/1701.06538
Woo, G., et al. (2024). Unified training of universal time series forecasting transformers. arXiv:2402.02592 (Moirai). https://arxiv.org/abs/2402.02592