The Unreasonable Effectiveness of Non‑Overlapping Failures in LLM Prover Ensembles
Improving the reliability of mathematical reasoning in large language models (LLMs) is critical for applications in education, automated theorem proving, and formal verification. This paper investigates whether the functional diversity of prover models, specifically their non-overlapping failures, can be harnessed through ensembling to improve collective performance. We present a theoretical risk decomposition framework for an OR-aggregated ensemble of theorem provers, demonstrating that the ensemble's risk is equal to the average individual risk minus an 'ambiguity effect' that quantifies the diversity of the provers. Our analysis formalizes the intuition that diversity, defined as non-overlapping failures, is strictly beneficial in this context. We hypothesize that such ensembles may not only surpass the accuracy of any individual model but could also potentially generate proofs for statements previously unprovable by any single prover. Furthermore, we aim to investigate whether these techniques can be applied to current state-of-the-art models to push performance on more difficult, unsaturated benchmarks such as PutnamBench.