Can LLMs truly reason, or do they just string together patterns until the math works? The debate heats up around ProofOfThought's approach, which wires LLMs to Z3 theorem proving and LEAN for autoformalization. The thread anchors the discussion in how LLMs handle uncertainty, produce structured outputs, and whether their reasoning is robust enough to trust [1].
What’s being tried - Autoformalization with LEAN — LLMs convert internal docs and policies into LEAN, then run the solver to verify consistency. Engineers review the results, keeping a human in the loop [1]. - Structured outputs beat raw text — The discussion notes that many pipelines rely on parsing free text, but modern tools can emit schema-compliant outputs directly (e.g., via function calling). OpenAI's structured outputs are mentioned to reduce hallucinated keys [1]. - Uncertainty quantification — The team reports uncertainty quantification of autoformalization on well-defined grammars in their NeurIPS 2025 paper 2505.20047 [1]. - Domain fit — This approach matters in domains with tight legal/financial compliance [1].
Why this matters - Critics ask if LLMs truly reason or merely simulate. The tension centers on trust in the chain from model to formal verifier [1]. - Proponents say verifiability grows when you couple LLMs with formal tools like Z3 and LEAN, while skeptics worry about the autoformalization gap [1].
Keep an eye on ProofOfThought’s ongoing work and future NeurIPS/ArXiv results for clearer answers [1].
References
ProofOfThought: LLM-based reasoning using Z3 theorem proving
Discusses using LLMs for autoformalization with LEAN, Z3 verification, uncertainty, structured outputs, and debates on whether LLMs can reason properly.
View source