IYRA
Benchmark · v1.0

Epistemic Reasoning
for Multilingual LLMs

NyayaBench evaluates structured reasoning in large language models across four languages using the classical Indian Nyaya Panchavayava ; a 7-step epistemic framework with no equivalent in existing multilingual benchmarks.

Built to measure whether reasoning survives challenge — not whether output follows a template.

2 Indian Patents Filed 52 Questions 4 Languages 5 Dimensions arXiv · Coming Soon

Three-Component System

01

IYRA — Reasoning Engine

Qwen3-14B fine-tuned with the Nyāya Panchavayava 7-step structure. Confidence-graded (HIGH / LOW / NONE). RAG over 8 classical Indian text corpora. Refuses ungrounded output.

02

AksharaTokenizer

Two-stage tokenizer for Brahmic scripts: a Unicode finite-state machine that segments at akshara boundaries, followed by a SentencePiece Unigram model trained on 1 million akshara-segmented sentences. 7,294 base akshara units · 64,000 final vocabulary. Covers 99.995% of Brahmic text — 14× more vocabulary-efficient than BPE for these scripts. (Not yet integrated into IYRA training or inference.)

03

NyayaBench

Open evaluation benchmark — 52 questions across English, Hindi, Punjabi, Tamil. Five reasoning dimensions: D1 Format · D2 Pūrvapakṣa · D3 Confidence · D4 Language · D5 Siddhānta. Total 0–9, pass threshold ≥ 6.

52 Evaluation Questions
4 Languages
9 Points Maximum
67% Pass Threshold
English 22 questions
Hindi हिन्दी · 10 questions
Punjabi ਪੰਜਾਬੀ · 10 questions
Tamil தமிழ் · 10 questions

Nyaya Panchavayava

The Nyaya Panchavayava is a classical Indian system of inference and argumentation, originating in the Nyaya Sutras of Aksapada Gautama (~2nd century BCE). It structures valid reasoning into seven obligatory steps, making it uniquely suited as a fine-tuning and evaluation constraint for language models required to reason rather than merely retrieve.

NyayaBench operationalises all seven steps as scoreable, language-agnostic dimensions. A model that cannot produce structured Purvapaksha (steelman) and a genuine Siddhanta (conclusion with back-reference) fails the benchmark regardless of factual accuracy.

01

Pratijña - Proposition

The claim or thesis to be established. Must open with a [CONFIDENCE: HIGH/LOW/NONE] signal calibrated to retrieval distance.

02

Hetu - Reason

The logical ground or justification for the proposition. Must be distinct from the claim itself.

03

Udaharana - Example

A concrete illustrative example or analogy that supports the stated reason.

04

Upanaya - Application

Application of the general example back to the specific case under examination.

05

Nigamana - Conclusion

Restatement of the proposition as now established by the chain of reasoning above.

06

Purvapaksha - Prior View

A genuine steelman of the strongest opposing position, citing a named philosopher or tradition where applicable. Scored on three sub-dimensions.

07

Siddhanta - Established Doctrine

The final settled conclusion that explicitly resolves the Purvapaksha. Must open with a back-reference to the prior view and offer genuine resolution, not repetition.

Five Scoring Dimensions

Each of the 52 questions is scored across five independently measurable dimensions. The maximum is 9 points; the pass threshold is 6 points (67%). Scoring is automated via regex pattern matching, Unicode-aware philosopher name detection, and character-set-based language identification.

Dimension Max What it measures
D1 · Format 1 pt All 7 Nyaya steps present and correctly labelled in the response.
D2 · Purvapaksha 3 pts Named philosopher cited (+1), steelman argument present (+1), source text identified (+1).
D3 · Confidence 2 pts Confidence signal [HIGH / LOW / NONE] present (+1) and appropriately calibrated to RAG retrieval distance (+1).
D4 · Language 1 pt Response language matches prompt language throughout, no code-switching.
D5 · Siddhanta 2 pts Back-reference opener present (+1), genuine resolution rather than repetition of the initial claim (+1).

Note on D3 Evaluator (two fixes documented): The initial evaluator accepted only HIGH | MEDIUM | LOW confidence signals, omitting NONE. All v9 responses that correctly opened with [CONFIDENCE: NONE] scored D3 = 0 in the first run. The corrected evaluator (commit e643db1) adds NONE to the pattern. For v10, a second fix extended the evaluator to detect Siddhanta and Purvapaksha step labels written in Gurmukhi, Devanagari, and Tamil scripts — not just Latin transliteration — correcting Indic-language scoring that had previously undercounted D3 and D5 in Hindi, Punjabi, and Tamil responses. Both raw and corrected scores are reported in the companion paper for full transparency.

IYRA Baseline Results

Scored with NyayaBench evaluator v1 (historical training run). Current scores under the corrected v2 evaluator are shown in the comparison below.

All results are from IYRA (Indic Yukti Reasoning Architecture): a QLoRA fine-tune of Qwen3-14B on the Nyaya Panchavayava framework. Training data spans English, Hindi, Punjabi, and Tamil. Results show the per-version trajectory from first training run through v10 (current release, 67.3%).

v10 language breakdown (pass rate per language): English 82% · Punjabi 70% · Hindi 30% · Tamil 20%.

NyayaBench is designed to be model-agnostic. Any LLM can be evaluated against these 52 questions. The scoring harness will be published alongside the arXiv paper.

Model D1 D2 D3 D4 D5 Score Pass Rate

How IYRA Compares

Full baseline · NyayaBench evaluator v2 — corrected for Indic-script philosopher names and Pūrvapakṣa substance. All 6 models scored identically.

Fifty-two questions across four languages, three local models and three frontier APIs, all evaluated with the same prompt and scoring harness. IYRA v10 is the only model to clear both NyayaBench thresholds: pass rate ≥ 70% and bench score ≥ 67%.

Model Pass % Bench % D1 D2 D3 D4 D5
IYRA v10 ✦ 84.6PASSES 76.1 0.98 2.13 2.00 0.96 0.77
Qwen3.6-27B (base) 67.3 66.0 0.94 2.25 1.52 1.00 0.23
Qwen3-14B (base) 59.6 60.0 0.98 1.92 1.60 0.86 0.04
GPT-4o 40.4 52.4 0.73 1.21 1.61 1.00 0.15
Claude Sonnet 4.6 38.5 53.4 0.77 1.83 0.71 1.00 0.50
Gemini 2.5 Pro 3.8 20.7 0.00 0.77 0.02 1.00 0.08

IYRA v10 is the only model to pass NyayaBench v1.0, clearing both the 70% pass-rate threshold and the 67% bench-score threshold across all 52 questions and four languages. Frontier models (GPT-4o, Claude Sonnet 4.6, Gemini 2.5 Pro) score near-zero on D4 Language Consistency for non-English questions — they default to English regardless of question language — and show weaker D3 Confidence Calibration. Scoring criteria are documented in the scoring methodology (full note to accompany the arXiv preprint).

English 90.9%
Punjabi 80.0%
Hindi 70.0%
Tamil 90.0%

Research Paper

A companion paper describing the NyayaBench methodology, dataset construction, scoring harness, and full IYRA model progression results is currently in preparation for arXiv submission.

The paper covers the Nyaya Panchavayava as a fine-tuning constraint, the AksharaTokenizer for linguistically correct Brahmic script tokenization, the RAG-grounded HIGH / LOW / NONE confidence tier system, and the v6 → v10 model progression.

arXiv Preprint

In preparation · Expected 2026

Intellectual Property — Indian Patent Office, Provisional Applications:

IYRA Reasoning Architecture
Application No. 202611067244 · Filed 29 May 2026
"System and Method for Training and Executing an AI Language Model to Generate Verifiable Dialectical Reasoning Data Structures"

AksharaTokenizer
Application No. 202611071450 · Filed 9 June 2026
"A Two-Stage Tokenization System for Brahmic Scripts and Related Methods"

Complete Specifications and PCT filings within 12 months of each priority date.

For academic correspondence, collaboration inquiries, or early access to the scoring harness, contact: gursimran@nyayabench.com