iDEF

Benchmarks & Engineering

← Back to Chat

IMO-Inspired Geometry Benchmarks

Performance on International Mathematical Olympiad Style Problems

Verified Success Rate

100%

75%

50%

25%

74%

Model A5

78%

Model A6

📥 Download Complete Dataset

How It Was Tested

Geometry Proof Benchmark Protocol ════════════════════════════════════════════════════════════════════════════════════ Algebraic Elimination Is a Real Chain of Reasoning, but displayed different way. ────────────────────────────────────────────────────────────────────────────────────── 1) Same theory, different weapons. Elementary Euclidean geometry (incidence, perpendicularity, parallelism, equal lengths, tangency, cyclicity, ratios) can be translated into polynomial equalities and inequalities over real coordinates. Some examples like: Tarski’s program, Hilbert's Nullstellensatz and real-closed-field quantifier elimination formalize this bridge. 2) Stepwise derivation and Human-auditable structure. The above items our models generate—coordinate model, generators, goal polynomials, reduction steps, and NDGs—are a transparent chain of reasoning. Each step is a definable, finite transformation justified by algebraic identities; nothing relies on opaque heuristics. In short: it is a *proof narrative* in algebraic calculus. Where This Process Already Matters ────────────────────────────────────────────────────────────────────────────────────── Many Complex problems in Euclidean geometry (and physics-flavored geometry) admit a precise optimization/variational reformulation. The geometry predicates become constraints, the target property becomes either a feasibility objective (“drive residual to 0”) or a true extremum (length/area/energy...). Its just like solving geometric problems is solved by several methods such as KKT/Lagrange multiplier systems (finite-dimensional) or Euler–Lagrange equations (infinite-dimensional). When constraints/objectives are polynomial, sum-of-squares (SOS)/moment and semidefinite programming (SDP) provide global certificates of optimality. SOS/SDP gives certificates under algebraic hypotheses. This is standard, mathematically rigorous practice. *Dynamic geometry & education - Automated reasoning modules translate constructions into polynomials and derive consequences symbolically, but they mostly hide their chain of reasoning and Is not available for humans to check, as that requires whole another layer-level of engeneering. *Computer vision & robotics - Here solvers (PnP, relative pose, essential matrix) are designed by forming polynomial systems (and eliminating variables). *CAD/geometry processing - Constraint solvers reduce incidence/metric constraints to algebraic systems and analyze feasibility symbolically. Formal methods & control - Nonlinear real constraints are tackled by algebraic methods closely related to elimination, real-algebraic geometry, and quantifier elimination. Benchmark Design, How to validate (assistance of LLMs) & Important Caveat ( How Comparable to LLMs and neuro-symbol geometry provers) ────────────────────────────────────────────────────────────────────────────────────── Problem's solution is said to be right and accepted when it generates full chain of reasoning and shows steps it made. As it generates full chain of reasoning you can check it yourself, or to sped up that, you can use modern LLMs such as GPT-5 or Claude 4.5 Sonnet for that, paste that scheme of problem and output given by idef and ask "Is it mathematically rigorous and accurate steps and solution, why?" it can explain even more and from our experience it mostly gives positive answers, however don't trust at LLMs only, rather use it to expand that solution into simpler look at steps, as not everyone needs that level of details that IDEF intentionally generates. Benchmark was tested on IDEF's API of model A5(available for use for developers), on dataset consisting of nerfed IMO inspired problems, as because of hardware limitations of this startup, processing full IMO problems remain very time-consuming and can take thousands of seconds, but eventually will still give solution. Therefore decided to test against a bit nerfed-IMO, inspired geometric problems fused with some other Olympiads. During evaluation, crash cases for A5 were not penalized in the solve-rate metric, since termination occurred before the model could deploy its full inference pipeline; not to mention the fact that, in addition, A5 employs backward-reasoning refinements to improve accuracy and proof rigor. Dataset is available and anyone can check it by themselves, and we're 90% sure, it will give more or less (+- 5%) same results written here. However, because of IDEF's limited budget and pure hardware limitations, models hallucinate often when you give it geometry problem in natural way. As you already might have seen, IDEF model can talk to you with anything, but in terms of geometry problems, if you want for it to use full reasoning power, it's better if you just give your problem transformed into specific : For example, take this IMO problem: "description": "Let ABC be an acute triangle with circumcircle Γ. Let ℓ be a tangent line to Γ. Let ℓ_a, ℓ_b, ℓ_c be the reflections of ℓ across BC, CA, and AB, respectively. Prove that the circumcircle of the triangle formed by ℓ_a, ℓ_b, ℓ_c is tangent "goal": "Show that the circumcircle of the triangle determined by ℓ_a, ℓ_b, ℓ_c is tangent to Γ." to Γ" .This is natural way of forming your geometric question, but directly-problem format of that specific problem we talked about would look like this: GIVEN: POINTS A,B,C,O,T,L,T_A,L_A,T_B,L_B,T_C,L_C,M_TA,M_LA,M_TB,M_LB,M_TC,M_LC,P_A,P_B,P_C,M_PAB,M_PBC,O2,K EQLEN O A O B EQLEN O B O C EQLEN O T O A PERP O T T L COL B C M_TA EQLEN T M_TA M_TA T_A PERP B C T T_A COL B C M_LA EQLEN L M_LA M_LA L_A PERP B C L L_A COL C A M_TB EQLEN T M_TB M_TB T_B PERP C A T T_B COL C A M_LB EQLEN L M_LB M_LB L_B PERP C A L L_B COL A B M_TC EQLEN T M_TC M_TC T_C PERP A B T T_C COL A B M_LC EQLEN L M_LC M_LC L_C PERP A B L L_C COL T_B L_B P_A COL T_C L_C P_A COL T_C L_C P_B COL T_A L_A P_B COL T_A L_A P_C COL T_B L_B P_C COL P_A P_B M_PAB EQLEN P_A M_PAB M_PAB P_B COL P_B P_C M_PBC EQLEN P_B M_PBC M_PBC P_C PERP P_A P_B M_PAB O2 PERP P_B P_C M_PBC O2 GOAL: EQLEN O K O A EQLEN O2 K O2 P_A COL O O2 K You might have recalled that Google DeepMind's AlphaGeometry2 also uses similar principle, but we will talk about that with great detail in IDEF's engineering part.

Engineering Details

IDEF AI is a closed-source startup project, so we cannot disclose implementation details. The core codebase comprises roughly 21,000 dense lines focused on optimization and general reasoning. We think that, verifier-augmented generation as the most credible path to improving LLM accuracy on Math specific benchmarks. Foundational work on LLM training as well as recent OpenAI research, shows that models are optimized and artificially motivated to produce answers rather than to abstain. This is done on purpose, so models can generalize, they are incentivized to respond and to rationalize even when a query is out of distribution(Way off its fine tune or pre-training, therefore Model doesn't knows whether its true or not), which leaves correctness uncertain. If one instructs a model to refrain whenever it is unsure, it will not be able to answer tasks users mostly care about. So we can guess that this fundamental part of LLM's architecture will remain more or less same. Therefore, at this point, the practical remedy is external verification—e.g., Z3 and Lean 4—that checks intermediate steps and gates continuation, especially for multi-modal mathematical reasoning. This method can be characterized as externally enhancing the model’s reasoning—akin to a lecturer guiding a student mid-solution by indicating whether the line of thought is on the right track or not. We adopted this approach in IDEF, and in our experiments it improved accuracy on models such as Mistral-7B (evaluated on Google Colab T4 GPUs). For improving reasoning internally as well, IDEF incorporates internal lightweight extension of Virtual Parameters System (VPS), which injects “virtual” parameters that behave like added capacity(mimics added additional parameters) without increasing the explicit parameter count. Its like adding extra IQ points temporarily during thinking, while size of model's parameter count remains same. This line of work is early-stage but showed encouraging results on small Qwen models (e.g., 1.5B), which is one reason IDEF also ships a Qwen-3B-based configuration. Only the A5 model includes the VPS addition; A6 does not. A VPS “lite” research prototype is available on GitHub. As of this addition, the engineering is non-trivial and A5 can still crash in edge cases; we plan to harden this. IDEF is the synthesis of several years of open-source research and practical lessons. For the assistant stack, we favor Qwen variants, further adapted with a Qwen-72B-Math-TIR(Tool-integrated reasoning)-style configuration that has demonstrated strong efficiency at its scale and remains state-of-the-art in its class, outperforming many larger models. Google DeepMind’s AlphaGeometry inspired the system’s geometry-centric design; while our approach is distinct, we adopted some lessions from DeepMind's project and for example a scheme-based problem representation similar in spirit to AlphaGeometry2, as illustrated in the accompanying figure(and not just that). ╔═══════════════════════════╦════════════════════════╦════════════════════════════════╗ ║Concept ║IDEF DSL ║AG2/AG1 ║ ╠═══════════════════════════╬════════════════════════╬════════════════════════════════╣ ║Equal length ║`EQLEN A B C D` ║`cong a b c d` ║ ║Perpendicular ║`PERP A B C D` ║`perp a b c d` ║ ║Parallel ║`PAR A B C D` ║`para a b c d` ║ ║Collinear ║`COL A B C` ║`coll a b c` ║ ║Concyclic ║`CYCLIC A B C D` ║`cyclic a b c d` ║ ║Directed angle equality ║`EQANG A B C D E F` ║`eqangle a b c d e f` ║ ║Ratio equality ║(various forms) ║`eqratio a b c d e f g h` ║ ║Fixed angle/ratio ║your constants/params ║`aconst … x; rconst … y` ║ ╚═══════════════════════════╩════════════════════════╩════════════════════════════════╝ To raise proof accuracy, we integrate classical engines—Wu’s method and Gröbner-basis reasoning as primary paths—augmented and minorly assisted by an also other SMT backend for lightweight assistance. The principal limitation today is resourcing: with a startup-scale budget and hardware, the system is closer to a “GPT-2/3-era” research stage; it still hallucinates at times, and solving harder instances can require long wall-clock times (occasionally on the order of thousands of seconds). Despite global, problem-agnostic optimization passes, we believe much of the remaining gap is hardware-bound rather than algorithmic. Our development draws on, among others: * Point geometry / self-evident identities (Zhang–Peng–Chen, 2019). * Complex-number identity method (Peng et al., 2023). * Readable polynomial-equality proofs (Jiang–Zhang–Wang, 2008). * PointGeo / Coq point-geometry system (Lei et al., 2023). * Earlier vector / full-angle / invariant work (ACM Digital Library). * “Towards Automated Readable Proofs of Ruler and Compass Constructions” (2024). * ArgoCLP (coherent-logic–based geometry prover), which produces both machine-verifiable and human-readable proofs across multiple geometry axiom systems. ...