NyayaBench evaluates structured reasoning in large language models across four languages using the classical Indian Nyaya Panchavayava ; a 7-step epistemic framework with no equivalent in existing multilingual benchmarks.
Built to measure whether reasoning survives challenge — not whether output follows a template.
Qwen3-14B fine-tuned with the Nyāya Panchavayava 7-step structure. Confidence-graded (HIGH / LOW / NONE). RAG over 8 classical Indian text corpora. Refuses ungrounded output.
Two-stage tokenizer for Brahmic scripts: a Unicode finite-state machine that segments at akshara boundaries, followed by a SentencePiece Unigram model trained on 1 million akshara-segmented sentences. 7,294 base akshara units · 64,000 final vocabulary. Covers 99.995% of Brahmic text — 14× more vocabulary-efficient than BPE for these scripts. (Not yet integrated into IYRA training or inference.)
Open evaluation benchmark — 52 questions across English, Hindi, Punjabi, Tamil. Five reasoning dimensions: D1 Format · D2 Pūrvapakṣa · D3 Confidence · D4 Language · D5 Siddhānta. Total 0–9, pass threshold ≥ 6.
The Nyaya Panchavayava is a classical Indian system of inference and argumentation, originating in the Nyaya Sutras of Aksapada Gautama (~2nd century BCE). It structures valid reasoning into seven obligatory steps, making it uniquely suited as a fine-tuning and evaluation constraint for language models required to reason rather than merely retrieve.
NyayaBench operationalises all seven steps as scoreable, language-agnostic dimensions. A model that cannot produce structured Purvapaksha (steelman) and a genuine Siddhanta (conclusion with back-reference) fails the benchmark regardless of factual accuracy.
The claim or thesis to be established. Must open with a [CONFIDENCE: HIGH/LOW/NONE] signal calibrated to retrieval distance.
The logical ground or justification for the proposition. Must be distinct from the claim itself.
A concrete illustrative example or analogy that supports the stated reason.
Application of the general example back to the specific case under examination.
Restatement of the proposition as now established by the chain of reasoning above.
A genuine steelman of the strongest opposing position, citing a named philosopher or tradition where applicable. Scored on three sub-dimensions.
The final settled conclusion that explicitly resolves the Purvapaksha. Must open with a back-reference to the prior view and offer genuine resolution, not repetition.
Each of the 52 questions is scored across five independently measurable dimensions. The maximum is 9 points; the pass threshold is 6 points (67%). Scoring is automated via regex pattern matching, Unicode-aware philosopher name detection, and character-set-based language identification.
| Dimension | Max | What it measures |
|---|---|---|
| D1 · Format | 1 pt | All 7 Nyaya steps present and correctly labelled in the response. |
| D2 · Purvapaksha | 3 pts | Named philosopher cited (+1), steelman argument present (+1), source text identified (+1). |
| D3 · Confidence | 2 pts | Confidence signal [HIGH / LOW / NONE] present (+1) and appropriately calibrated to RAG retrieval distance (+1). |
| D4 · Language | 1 pt | Response language matches prompt language throughout, no code-switching. |
| D5 · Siddhanta | 2 pts | Back-reference opener present (+1), genuine resolution rather than repetition of the initial claim (+1). |
Note on D3 Evaluator (two fixes documented): The initial evaluator accepted only HIGH | MEDIUM | LOW confidence signals, omitting NONE. All v9 responses that correctly opened with [CONFIDENCE: NONE] scored D3 = 0 in the first run. The corrected evaluator (commit e643db1) adds NONE to the pattern. For v10, a second fix extended the evaluator to detect Siddhanta and Purvapaksha step labels written in Gurmukhi, Devanagari, and Tamil scripts — not just Latin transliteration — correcting Indic-language scoring that had previously undercounted D3 and D5 in Hindi, Punjabi, and Tamil responses. Both raw and corrected scores are reported in the companion paper for full transparency.
Scored with NyayaBench evaluator v1 (historical training run). Current scores under the corrected v2 evaluator are shown in the comparison below.
All results are from IYRA (Indic Yukti Reasoning Architecture): a QLoRA fine-tune of Qwen3-14B on the Nyaya Panchavayava framework. Training data spans English, Hindi, Punjabi, and Tamil. Results show the per-version trajectory from first training run through v10 (current release, 67.3%).
v10 language breakdown (pass rate per language): English 82% · Punjabi 70% · Hindi 30% · Tamil 20%.
NyayaBench is designed to be model-agnostic. Any LLM can be evaluated against these 52 questions. The scoring harness will be published alongside the arXiv paper.
Full baseline · NyayaBench evaluator v2 — corrected for Indic-script philosopher names and Pūrvapakṣa substance. All 6 models scored identically.
Fifty-two questions across four languages, three local models and three frontier APIs, all evaluated with the same prompt and scoring harness. IYRA v10 is the only model to clear both NyayaBench thresholds: pass rate ≥ 70% and bench score ≥ 67%.
| Model | Pass % | Bench % | D1 | D2 | D3 | D4 | D5 |
|---|---|---|---|---|---|---|---|
| IYRA v10 ✦ | 84.6PASSES | 76.1 | 0.98 | 2.13 | 2.00 | 0.96 | 0.77 |
| Qwen3.6-27B (base) | 67.3 | 66.0 | 0.94 | 2.25 | 1.52 | 1.00 | 0.23 |
| Qwen3-14B (base) | 59.6 | 60.0 | 0.98 | 1.92 | 1.60 | 0.86 | 0.04 |
| GPT-4o | 40.4 | 52.4 | 0.73 | 1.21 | 1.61 | 1.00 | 0.15 |
| Claude Sonnet 4.6 | 38.5 | 53.4 | 0.77 | 1.83 | 0.71 | 1.00 | 0.50 |
| Gemini 2.5 Pro | 3.8 | 20.7 | 0.00 | 0.77 | 0.02 | 1.00 | 0.08 |
IYRA v10 is the only model to pass NyayaBench v1.0, clearing both the 70% pass-rate threshold and the 67% bench-score threshold across all 52 questions and four languages. Frontier models (GPT-4o, Claude Sonnet 4.6, Gemini 2.5 Pro) score near-zero on D4 Language Consistency for non-English questions — they default to English regardless of question language — and show weaker D3 Confidence Calibration. Scoring criteria are documented in the scoring methodology (full note to accompany the arXiv preprint).
A companion paper describing the NyayaBench methodology, dataset construction, scoring harness, and full IYRA model progression results is currently in preparation for arXiv submission.
The paper covers the Nyaya Panchavayava as a fine-tuning constraint, the AksharaTokenizer for linguistically correct Brahmic script tokenization, the RAG-grounded HIGH / LOW / NONE confidence tier system, and the v6 → v10 model progression.
arXiv Preprint
In preparation · Expected 2026
Intellectual Property — Indian Patent Office, Provisional Applications:
IYRA Reasoning Architecture
Application No. 202611067244 · Filed 29 May 2026
"System and Method for Training and Executing an AI Language Model to Generate Verifiable Dialectical Reasoning Data Structures"
AksharaTokenizer
Application No. 202611071450 · Filed 9 June 2026
"A Two-Stage Tokenization System for Brahmic Scripts and Related Methods"
Complete Specifications and PCT filings within 12 months of each priority date.
For academic correspondence, collaboration inquiries, or early access to the scoring harness, contact: gursimran@nyayabench.com