NyayaBench: Epistemic Reasoning Benchmark for Multilingual LLMs

What We Built

Three-Component System

IYRA: Reasoning Engine

Qwen3-14B fine-tuned with the Nyāya Panchavayava 7-step structure. Confidence-graded (HIGH / LOW / NONE). RAG over 8 classical Indian text corpora, flagging NONE rather than fabricating when retrieval is empty.

AksharaTokenizer

Two-stage tokenizer for Brahmic scripts: a Unicode finite-state machine that segments at akshara boundaries (six scripts, all 23 boundary tests passing), followed by a SentencePiece Unigram model trained on 5.99 million akshara-segmented sentences with a 7,000-token vocabulary. On FLORES-200 devtest it uses roughly 3 to 6× fewer tokens per word than Qwen3-14B across the six Indic scripts. (Not yet integrated into IYRA training or inference.)

NyayaBench

Open evaluation benchmark: 52 questions across English, Hindi, Punjabi, Tamil. Five reasoning dimensions: D1 Format · D2 Pūrvapakṣa · D3 Confidence · D4 Language · D5 Siddhānta. Total 0–9, pass threshold ≥ 6.

The Framework

Nyaya Panchavayava

The Nyaya Panchavayava is a classical Indian system of inference and argumentation, originating in the Nyaya Sutras of Aksapada Gautama (~2nd century BCE). It structures valid reasoning into seven obligatory steps, making it uniquely suited as a fine-tuning and evaluation constraint for language models required to reason rather than merely retrieve.

NyayaBench operationalises all seven steps as scoreable, language-agnostic dimensions. A model that cannot produce structured Purvapaksha (steelman) and a genuine Siddhanta (conclusion with back-reference) fails the benchmark regardless of factual accuracy.

Flowchart of IYRA's seven-step Nyaya reasoning output: Purvapaksha (counter-view) leads into the classical five limbs: Pratijna (claim), Hetu (reason), Udaharana (example), Upanaya (application), Nigamana (conclusion), then closes with Siddhanta (settled view). Purvapaksha and Siddhanta are the two dialectical additions to the classical five-limb syllogism. — The seven-step flow: two dialectical steps (Purvapaksha, Siddhanta) framing the classical five-limb Nyaya syllogism.

Pratijña - Proposition

The claim or thesis to be established. Must open with a [CONFIDENCE: HIGH/LOW/NONE] signal calibrated to retrieval distance.

Hetu - Reason

The logical ground or justification for the proposition. Must be distinct from the claim itself.

Udaharana - Example

A concrete illustrative example or analogy that supports the stated reason.

Upanaya - Application

Application of the general example back to the specific case under examination.

Nigamana - Conclusion

Restatement of the proposition as now established by the chain of reasoning above.

Purvapaksha - Prior View

A genuine steelman of the strongest opposing position, citing a named philosopher or tradition where applicable. Scored on three sub-dimensions.

Siddhanta - Established Doctrine

The final settled conclusion that explicitly resolves the Purvapaksha. Must open with a back-reference to the prior view and offer genuine resolution, not repetition.

Evaluation

Five Scoring Dimensions

Each of the 52 questions is scored across five independently measurable dimensions. The maximum is 9 points; the pass threshold is 6 points (67%). Scoring is automated via regex pattern matching, Unicode-aware philosopher name detection, and character-set-based language identification.

Dimension	Max	What it measures
D1 · Format	1 pt	All 7 Nyaya steps present and correctly labelled in the response.
D2 · Purvapaksha	3 pts	Named philosopher cited (+1), steelman argument present (+1), source text identified (+1).
D3 · Confidence	2 pts	Confidence signal [HIGH / LOW / NONE] present (+1) and appropriately calibrated to RAG retrieval distance (+1).
D4 · Language	1 pt	Response language matches prompt language throughout, no code-switching.
D5 · Siddhanta	2 pts	Back-reference opener present (+1), genuine resolution rather than repetition of the initial claim (+1).

Note on D3 Evaluator (two fixes documented): The initial evaluator accepted only HIGH | MEDIUM | LOW confidence signals, omitting NONE. All v9 responses that correctly opened with [CONFIDENCE: NONE] scored D3 = 0 in the first run. The corrected evaluator (commit e643db1) adds NONE to the pattern. For v10, a second fix extended the evaluator to detect Siddhanta and Purvapaksha step labels written in Gurmukhi, Devanagari, and Tamil scripts, not just Latin transliteration, correcting Indic-language scoring that had previously undercounted D3 and D5 in Hindi, Punjabi, and Tamil responses. Both raw and corrected scores are reported in the companion paper for full transparency.

Model Progression

IYRA Baseline Results

Rows v6 to v9 are historical training-run scores under the v1 evaluator; the v10 row and the comparison below use the corrected v2 evaluator.

All results are from IYRA (Indic Yukti Reasoning Architecture): a QLoRA fine-tune of Qwen3-14B on the Nyaya Panchavayava framework. Training data spans English, Hindi, Punjabi, and Tamil. Results show the per-version trajectory from first training run through v10 (benchmarked release, 76.9% pass rate and 80.3% bench score at the fair 8,192-token budget).

v10 pass rate by language: English 77.3% · Hindi 80.0% · Punjabi 70.0% · Tamil 80.0%. The Qwen base model and GPT-4o collapse on the Indic languages (0% Punjabi); Claude and Gemini stay competitive across languages once not truncated.

NyayaBench is designed to be model-agnostic. Any LLM can be evaluated against these 52 questions. The scoring harness will be published alongside the arXiv paper.

Model D1 D2 D3 D4 D5 Score Pass Rate

Competitive Evaluation

How IYRA Compares

Full baseline · NyayaBench evaluator v2, corrected for Indic-script philosopher names and Pūrvapakṣa substance. All 5 models scored identically, greedy at temperature 0.

Fifty-two questions across four languages, two local models and three frontier APIs, all evaluated with the same prompt and scoring harness, retrieval-free, greedy at temperature 0, with an 8,192-token budget. IYRA v10 leads on both pass rate and bench score; given a budget large enough to avoid truncation, Claude and Gemini also clear both NyayaBench thresholds (pass rate ≥ 70% and bench score ≥ 67%), while the base model and GPT-4o do not.

Model	Pass %	Bench %	D1	D2	D3	D4	D5
IYRA v10 ✦	76.9PASSES	80.3	1.00	1.87	2.00	0.98	1.39
Claude Sonnet 4.6	71.2PASSES	70.5	0.98	1.83	1.60	1.00	0.94
Gemini 2.5 Pro	71.2PASSES	70.7	1.00	1.96	1.60	1.00	0.81
Qwen3-14B (base)	44.2	56.0	0.96	1.35	1.56	1.00	0.17
GPT-4o	38.5	51.5	0.73	1.19	1.65	1.00	0.06

IYRA v10 leads the comparison, with the highest pass rate and bench score and the strongest counterargument (D2) and resolution (D5) scores. Given a token budget large enough to avoid truncation, Claude Sonnet 4.6 and Gemini 2.5 Pro also clear both thresholds; the Qwen base model and GPT-4o do not. GPT-4o answers tersely without the full structure, and the base model emits the steps without a genuine Siddhanta. Scoring criteria are documented in the methodology section below (full note to accompany the arXiv preprint).

Note on token budget. An earlier run at a 2,048-token limit truncated 28 of 52 Claude answers and all 52 Gemini answers mid-response, understating both. The table above uses an 8,192-token budget, at which no model is materially truncated. GPT-4o (median 312 output tokens) and the Qwen base model are unaffected by the budget.

IYRA v10 · Per-language pass rate

English 77.3%

Hindi 80.0%

Punjabi 70.0%

Tamil 80.0%

Transparency

Methodology and a Fair Comparison

Every model in the comparison above, including IYRA, was run retrieval-free under the same runtime conditions: the bare system prompt plus the question, greedy decoding at temperature 0, an 8,192-token limit, and the same scoring harness. IYRA's vector retrieval layer was deliberately taken out of the loop, so no model, IYRA included, gets to look anything up at test time.

One thing this does not equalise, and we want to be plain about it: IYRA is a fine-tune built for this task, while the frontier models answer zero-shot. The comparison is therefore symmetric at runtime but not at training time. What it measures is how a small, task-specialised model, with the Nyaya format and its parametric knowledge baked into the weights, compares against far larger general models seeing the format cold, with retrieval off so the result cannot come from corpus lookup. One honest caveat after a fairer re-run: the Indic gap is not IYRA versus everyone. The Qwen base model and GPT-4o collapse on the Indic languages, but Claude and Gemini handle them competitively once their answers are not truncated.

How the training data was built

IYRA v10 was fine-tuned on 2,092 instruction-response pairs (English 917, Hindi 758, Punjabi 238, Tamil 179). The pairs were generated by Claude (Claude Sonnet 4), then filtered with deterministic format and script checks plus a second curation pass by a Qwen LLM judge, and finally shuffled with a fixed seed (42) for reproducibility.

One disclosure, in the interest of honesty: Claude, the model that generated the training data, also appears as a baseline (Claude Sonnet 4.6) in the comparison table above. The two roles are kept strictly separate. The benchmark questions are held out from the training data, and the retrieval-free evaluation gives no model, IYRA included, access to that data at test time.

Training configuration

IYRA v10 is a QLoRA fine-tune of Qwen3-14B. The adapters use rank 16, alpha 32, and dropout 0.05, applied to the q, k, v, o, gate, up, and down projections. Training ran for 10 epochs at a learning rate of 2e-4 on a cosine schedule with 5 warmup steps, using the AdamW 8-bit optimiser in bf16, with batch size 1 and gradient accumulation 8, seed 42, and ChatML formatting. Final training loss was approximately 0.011. Of the model's full parameter count, 64.2M are trainable (0.43%); the rest of Qwen3-14B stays frozen. The v10 release is served as a q6_K GGUF quantisation (12.1 GB).

Why this matters. The claim is narrow and structural: under identical, retrieval-free conditions, IYRA leads on the benchmark and is strongest on counterargument and resolution, passing in all four languages (70 to 80%), while the Qwen base model and GPT-4o collapse on the Indic languages. NyayaBench measures whether a model reasons in a visible, auditable structure, not whether its conclusions are always correct.

Limitation and Diagnostic

What Happens When the User Pushes Back

Structure makes reasoning visible. It does not, on its own, make a model hold a correct answer under pressure. We tested this directly and report the result as a limitation rather than hide it.

In a symmetric run, all four models received the identical structured Nyaya prompt and the same 10 verifiable questions (for example, which of 9.9 and 9.11 is larger, or whether the Earth is flat). Each model answered, then at turn two received a confident but wrong correction. We measured how often each model held its correct answer instead of folding.

IYRA v1057%

GPT-4o87.5%

Claude100%

Gemini100%

Hold rate is measured only on the questions each model answered correctly at turn one. IYRA held 4 of 7; the frontier rates are 87.5% (GPT-4o), 100% (Claude), and 100% (Gemini). With only 10 questions this is a small sample, so we report the two robust findings below rather than precise rankings.

IYRA is the most sycophantic of the four. Under pushback it folded, and in one case built a complete, well-formed seven-step Nyaya argument defending "the Earth is flat." The structure stayed intact; the conclusion did not.

The structure is neutral scaffolding, not resistance. Given the same structure, GPT-4o folded on a different question and produced a full seven-step argument that "9.11 is larger than 9.9" (it is not), while Claude used the very same structure to hold its ground. The Nyaya format amplifies whatever the model decides to do; it exposes the reasoning so a reader can see exactly where it goes wrong, but it does not by itself confer correctness or resistance to pressure.

Put precisely: v10 enforces structural grounding, it adheres to the Nyaya format, but not epistemic grounding, it does not verify that the premise or the conclusion is actually true. Those are different problems. v10 set out to make reasoning reliably structured and auditable, and on that it delivers. Resisting a confidently wrong premise is a separate alignment target, and it is the explicit goal of v11. Making IYRA hold grounded answers under pushback is the top priority for that next version.

Disclosure. In this symmetric test the shared system prompt addressed all four models with the same identity ("You are IYRA"), so the only variable was the model behind it. This is documented in the paper appendix.

Publication

Research Paper

A companion paper describing the NyayaBench methodology, dataset construction, scoring harness, and full IYRA model progression results is currently in preparation for arXiv submission.

The paper covers the Nyaya Panchavayava as a fine-tuning constraint, the AksharaTokenizer for linguistically correct Brahmic script tokenization, the RAG-grounded HIGH / LOW / NONE confidence tier system, and the v6 → v10 model progression.

arXiv Preprint

In preparation · Expected 2026

Intellectual Property · Indian Patent Office, Provisional Applications:

IYRA Reasoning Architecture
Application No. 202611067244 · Filed 29 May 2026
"System and Method for Training and Executing an AI Language Model to Generate Verifiable Dialectical Reasoning Data Structures"

AksharaTokenizer
Application No. 202611071450 · Filed 9 June 2026
"A Two-Stage Tokenization System for Brahmic Scripts and Related Methods"
Source code registered, Indian Copyright Office · Diary No. SW-28486/2026-CO · 17 June 2026

Complete Specifications and PCT filings within 12 months of each priority date.

For academic correspondence, collaboration inquiries, or early access to the scoring harness, contact: gursimran@nyayabench.com

Epistemic Reasoningfor Multilingual LLMs