Grammar-to-RL: Training Language Models on Historical Texts via Compositional Rewards

Novel Methodology: Zero-to-One Language Model

This project introduces a novel approach to low-resource language learning by transforming a single historical textbook into a complete, self-contained training ecosystem.

Why This Is Now Working

The fundamental insight driving this approach: grammar rules can be reward functions, and rewards should decompose into linguistic primitives. When you do this, syntax emerges naturally without an external verifier as judge. This is particularly powerful for RL on non-coding tasks where compositional structure matters.

Grammar AS Reward Function

Everyone else treats grammar as either:

Preprocessing constraints (rule-based systems)
Post-hoc evaluation (check grammar after generation)

This project makes grammar rules directly differentiable through compositional rewards. Each rule becomes a gradient signal, not just a binary check.

The key advantage: interpretability. You can actually see where in the latent space each linguistic level is being encoded. This makes debugging possible: "Oh, the model is failing on ć preservation because the character embedding gradient is being overwhelmed by the semantic gradient."

Mathematical Formulation

The efficiency of this method comes from treating the historical text not just as data, but as a computable specification.

1. The Transformation Function (Book to Environment)

Let $\mathcal{T}$ be the historical textbook source. We define an extraction function $\mathcal{E}$ (the VLM) that maps the raw text into a structured grammar space $\mathcal{G}$:

$$ \mathcal{G} = {g_1, g_2, \dots, g_N} = \mathcal{E}(\mathcal{T}) $$

Where each grammar rule $g_k$ acts as a constraint function on the generated token sequence $y$:

$$ g_k: \Sigma^* \rightarrow {0, 1} $$

2. The Compositional Reward Function

Standard RLHF or GRPO typically uses a singular reward model $R(y)$. The Grammar-as-signal approach decomposes $R$ into a weighted sum of linguistic primitives.

Let $y_i$ be the $i$-th generation in a group of size $G$. The reward $r_i$ for generation $y_i$ given prompt $x$ is:

$$ r(y_i, x) = \lambda_{diff}(x) \cdot \left[ \alpha \cdot R_{char}(y_i, x) + \beta \cdot R_{morph}(y_i, \mathcal{G}) + \gamma \cdot R_{sem}(y_i, y^*) \right] $$

Where:

$R_{char}$ (Orthography): The Intersection-over-Union (or Recall) of required special unicode characters $\mathcal{C}_{spec}$ (e.g., ŋ, š, ć).

$R_{char} = \frac{|chars(y_i) \cap chars(x)|}{|chars(x) \cap \mathcal{C}_{spec}|}$
$R_{morph}$ (Syntax): A binary or scalar check against specific grammar rules $g_k \in \mathcal{G}$ (e.g., affix presence regex).

$R_{morph} = \frac{1}{|A|}\sum_{a \in A} \mathbb{I}(a \subset y_i) \quad \text{where } A \text{ are required affixes}$
$R_{sem}$ (Semantics): Semantic similarity to ground truth (or Dictionary lookup).
Weights: $(\alpha, \beta, \gamma) = (0.4, 0.4, 0.2)$ per the config.
$\lambda_{diff}$: The curriculum difficulty multiplier ($1.0 \dots 2.0$).

3. The Modified GRPO Objective

In standard GRPO, we compute the advantage $A_i$ by normalizing rewards within the group. By injecting the new compositional reward, the gradient ascent objective becomes:

$$ \mathcal{L}_{G-GRPO}(\theta) = \mathbb{E}_{x \sim \mathcal{D}} \left[ \frac{1}{G} \sum_{i=1}^G \underbrace{\left( \frac{r(y_i, x) - \bar{r}}{\sigma_r} \right)}_{\text{Grammar-Verified Advantage}} \cdot \underbrace{\min \left( \frac{\pi_\theta(y_i|x)}{\pi_{old}(y_i|x)}, 1+\epsilon \right)}_{\text{Clipped Policy Ratio}} - \beta_{KL} \mathbb{D}_{KL}(\pi_\theta || \pi_{ref}) \right] $$

Where $\bar{r}$ is the mean reward of the group of $G$ rollouts, and $\sigma_r$ is the standard deviation.

4. The "Wow! Signal": Gradient Signal Density

The mathematical reason for the 160-step convergence and 97.9% morphology accuracy is the density of the gradient signal.

In standard Language Modeling (Next Token Prediction), the loss is:

$$ \mathcal{L}_{LM} = -\log P(y_{target} | x) $$

This gives feedback only on exact token matches.

In the Grammar-GRPO, the feedback signal $\nabla J$ allows the model to perform gradient ascent on the structure of the language directly:

$$ \nabla J(\theta) \propto \sum_{components} w_c \nabla R_c(y) $$

Because $R_{char}$ and $R_{morph}$ are deterministic and verifiable (unlike subjective human preference), the variance of the reward $\sigma^2_r$ is significantly lower than standard RLHF.

Mathematically, you reduced the noise in the reward signal, allowing the policy to traverse the optimization landscape directly toward the "Grammar Valley" (the manifold where valid Dakota syntax exists).

Training Results: RL Performance Visualizations

This section presents comprehensive visualizations from our successful Reinforcement Learning training runs, demonstrating the effectiveness of the grammar-to-RL methodology on the Dakota language. There have been two successful runs: one at 1000 steps (final) and one at 400 steps (initial).

Run 1: 1000-Step Training (Final)

Training Run Details

Project: dakota-rl-grammar
Entity: christian-cooper-us
Trainer Run: 7nikv4vp - dakota-0.6b-rl-trainer
Orchestrator Run: 29hn8w98 - dakota-0.6b-rl-orchestrator
Model: Qwen3-0.6B-Dakota-Grammar-RL
Training Steps: 1,000 steps (998 completed)
Total Samples: 256,000 samples processed
Training Duration: 1.54 hours (5,537 seconds)

Key Achievements

190% improvement in overall reward (0.120 → 0.349)
97.9% morphological accuracy - exceptional performance in affix application
53.5% character preservation - significant improvement for complex orthography
90% of improvement achieved in first 160 steps (16% of training) - demonstrating rapid learning
Stable training with controlled KL divergence throughout

Comprehensive Dashboard

The comprehensive dashboard provides an at-a-glance view of all training metrics, combining reward progression, component performance, loss dynamics, entropy, KL divergence, and throughput metrics into a single visualization.

What this shows: This multi-panel dashboard synthesizes all key training signals. The top panel shows reward progression with milestone markers indicating when 25%, 50%, 75%, and 90% of total improvement was achieved. The component comparison bar chart (middle-left) reveals the differential performance: morphological accuracy reached 97.9% while character preservation achieved 53.5%, reflecting the challenge of preserving Dakota's complex orthography (ć, š, ŋ, ḣ, ṡ, á, é, í, ó, ú) with a 0.6B parameter model. The loss and entropy panels demonstrate stable optimization, while the KL divergence metrics show controlled policy adaptation without catastrophic forgetting.

View full run: Trainer Run | Orchestrator Run

Reward Progression

The reward progression visualization demonstrates the learning trajectory over 1,000 training steps, showing both overall composite reward and individual component breakdown.

What this shows: The top panel tracks overall reward progression from 0.120 (step 0) to 0.349 (step 999), representing a 190.1% improvement. Milestone markers highlight key learning efficiency points: 25% improvement at step 49 (4.9% of training), 50% at step 71 (7.1%), 75% at step 109 (10.9%), and 90% at step 160 (16%). The rapid initial learning validates the methodology's efficiency - grammar-based tasks provide dense learning signals compared to general language modeling. The bottom panel shows the component breakdown: Morphological Accuracy (green) achieved near-perfect performance (0.979), Character Preservation (orange) showed substantial improvement from 0.038 to 0.535 (14x increase), while the Overall Composite (blue) reflects the weighted combination including semantic components.

Interpretation: The divergence between component performances demonstrates that the model learned morphological patterns more effectively than orthographic preservation. This suggests potential areas for future improvement through specialized character-focused training or larger model capacity. The semantic component (20% weight) likely contributes to the composite score being lower than individual components, indicating multi-objective optimization challenges.

View full run: Orchestrator Run

Training Metrics

This visualization tracks the core training dynamics: policy loss, model entropy (confidence), KL divergence (policy adaptation), and inference probabilities.

What this shows:

Policy Loss (top-left): Values ranged from approximately 1e-5 to 1e-3, typical of GRPO training with conservative learning rates. The log-scale visualization shows consistent small magnitudes indicating stable gradient-based optimization. The shaded region represents ±1 standard deviation, showing controlled variance throughout training.
Model Entropy (top-right): Decreased from 0.93 to 0.21, indicating the model became significantly more confident in its predictions. Low final entropy (0.21) suggests the model is highly confident, which aligns with the high morphological accuracy achieved.
KL Divergence (bottom-left): Three metrics track policy adaptation:
- Masked KL: Increased from near-zero to 9.32, indicating substantial policy adaptation for Dakota-specific masked tokens
- Overall KL: Moderate increase from 0.001 to 3.83, suggesting controlled policy adaptation
- Unmasked KL: Remained extremely low (mean: 0.070, final: 0.042), confirming the model preserved general language capabilities while learning Dakota-specific patterns
Inference Probabilities (bottom-right): Increased from 0.63 to 0.86, showing the model became more certain in its predictions over time.

Interpretation: The increasing KL divergence trends indicate active learning and policy adaptation, while the relatively moderate values (especially for unmasked tokens) suggest training remained stable. The model successfully specialized for Dakota grammar while preserving general language understanding, validating the training approach.

View full run: Trainer Run

Performance Metrics

Performance metrics track computational efficiency: training throughput (tokens per second) and GPU utilization (Model FLOPS Utilization).

What this shows:

Training Throughput (left): Average throughput of 8,178 tokens/sec with consistent performance throughout training. The red dashed line indicates the average, showing stable training execution without significant throughput variations.
GPU Efficiency - MFU (right): Average Model FLOPS Utilization of 2.68%, indicating GPU efficiency. While this may seem low, it's typical for small models (0.6B parameters) where memory bandwidth rather than compute is often the bottleneck. The consistent MFU suggests stable training without memory pressure or compute bottlenecks.

Interpretation: The consistent performance metrics validate stable training execution. The throughput remained stable throughout 1,000 steps, processing 256,000 total samples with an average of 256 samples per step. Peak memory usage was 11.5 GiB, well within reasonable bounds for the model size.

View full run: Trainer Run

Run 2: 400-Step Training (Initial Breakthrough)

Training Run Details

Project: dakota-rl-grammar
Entity: christian-cooper-us
Trainer Run: yut26kcm - dakota-0.6b-ledger-test-400-trainer
Orchestrator Run: 1y33h9zr - dakota-0.6b-ledger-test-400-orchestrator
Model: Qwen3-0.6B-Dakota-Grammar-RL-400
Training Steps: 400 steps (all completed)
Total Samples: 102,400 samples processed
Base Model: Qwen/Qwen3-0.6B (small instruct model optimized for RL)

Key Achievements

150.3% improvement in overall reward (0.128 → 0.321, peak: 0.345)
Rapid learning: 90% of improvement achieved in first 85 steps (21.25% of training)
Sample efficiency: 0.000483 improvement per step - demonstrating dense learning signals
Stable training: Controlled KL divergence with unmasked KL remaining low (mean: 0.094, final: 0.092)
Policy confidence: Entropy decreased from 0.93 to 0.28, showing increased model certainty

Reward Progression (400 Steps)

Interpretation: The rapid learning trajectory suggests that compositional reward functions can support efficient learning on qualitative linguistic tasks. The milestone markers show consistent acceleration, with each 25% improvement requiring progressively fewer steps, indicating the model is learning to learn more effectively. These early signals hint that GRPO could extend beyond coding/math domains when rewards are decomposed into linguistic primitives, but this still needs more confirmation.

View full run: Orchestrator Run

Run 3: 30B Tinker (Thinking Machines Final)

Project / Run ID: dakota-rl-grammar / i55d4x26
Base Model: Qwen/Qwen3-30B-A3B-Instruct-2507 (LoRA rank 32)
Steps: 199 (0–198) on Tinker; checkpoint at tinker://da1ef918-d67a-5080-b500-dd1256db9ca7:train:0/sampler_weights/final
Composite reward: 0.105 → 0.442 peak (step 116) → 0.317 final
Character preservation: 0.265 → 0.699 peak (step 185) → 0.619 final
Affix accuracy: 0.957 → 1.000 (stayed perfect in later stages)
Exact match: 0.001 → 0.337 peak (step 116) → 0.100 final; length penalty rose to 1.0 by step 52

GRPO for Qualitative Tasks: Early Signals

This work suggests that GRPO (Group Relative Policy Optimization) can achieve strong learning on qualitative linguistic tasks when rewards are properly decomposed into interpretable components, contingent on continued results. This is significant because:

Why This Matters

GRPO has been successfully applied to quantitative domains (code generation, mathematical reasoning) where correctness is verifiable and rewards are clear. However, qualitative tasks like language learning, translation, and grammar have traditionally been considered unsuitable for RL because:

Subjective evaluation: "Is this translation good?" lacks clear criteria
Multi-dimensional quality: A translation can be semantically correct but orthographically wrong
Nuanced feedback: Binary correct/incorrect fails to capture partial correctness

Our Solution: Compositional Rewards

By decomposing rewards into linguistic primitives (character preservation, morphological accuracy, semantic correctness), we transform qualitative tasks into quantitatively optimizable objectives:

Character preservation (40%): Verifiable Unicode-level correctness
Morphological accuracy (40%): Pattern-matching against grammar rules
Semantic correctness (20%): Meaning preservation metrics

This decomposition enables GRPO to work effectively because:

Each component is independently verifiable (no human judgment needed)
Gradients flow through each component (model learns what to prioritize)
Multi-dimensional feedback (model knows exactly what it got wrong)

Key Results Demonstrating Significance

150.3% improvement in 400 steps - Comparable to GRPO performance on coding tasks
90% improvement in 21% of training - Demonstrates dense learning signals from compositional rewards
Low unmasked KL (0.092) - Model specializes without catastrophic forgetting
Stable training dynamics - No reward hacking or instability issues

Implications

This work hints that GRPO may not be limited to quantitative domains. When qualitative tasks are decomposed into verifiable components, they may become more learnable, opening possibilities for:

Low-resource language learning (this work)
Style transfer (decompose into syntax, semantics, register)
Dialogue systems (decompose into coherence, relevance, appropriateness)
Creative tasks (decompose into structure, originality, coherence)

Methodology Validation

These results provide early evidence for the core methodological idea: transforming grammar rules from a 130-year-old historical textbook into verifiable RL environments. The strong morphological accuracy (97.9%) suggests that:

Rule Extraction Quality: The VLM-based extraction successfully captured testable grammar patterns from historical text, preserving morphological rules in a format suitable for RL training.
Task Generation Effectiveness: The conversion of 1,036 grammar rules into 5,657 RL tasks created sufficient training signal for the model to learn morphological patterns. The high accuracy suggests the task generation process successfully encoded the grammar rules as verifiable constraints.
Compositional Reward Structure: The decomposition into character, morphology, and semantic components enabled fine-grained learning. The model learned morphological patterns more effectively than orthographic preservation, suggesting potential areas for future improvement.
Sample Efficiency: The rapid initial learning (90% improvement in 16% of training) demonstrates the methodology's efficiency for low-resource language scenarios. With only 256,000 samples processed over 1,000 steps, the model achieved substantial improvement.

Visualization Generation

These visualizations were generated using scripts/create_rl_visualizations.py, which loads data from Weights & Biases and creates publication-quality plots. To regenerate with updated data:

# For the 1000 step run (default):
python scripts/create_rl_visualizations.py \
    --trainer-id 7nikv4vp \
    --orchestrator-id 29hn8w98 \
    --project dakota-rl-grammar \
    --entity christian-cooper-us

All visualization code and data are available in the repository for reproducibility and further analysis.

Reward Ledger: Transparent Reward Decomposition

We log detailed reward component breakdowns to provide full transparency into how rewards are computed. The reward ledger exposes all internal components, weights, penalties, and multipliers used in reward calculation.

What this shows: The reward ledger visualization demonstrates how individual components combine to form the final reward. Each line represents a different component or intermediate computation:

Weighted components: Individual reward components (exact match, character overlap, pattern matching, affix accuracy) multiplied by their respective weights
Composite pre: Weighted sum of components before applying penalties and multipliers
Length penalty multiplier: Applied to penalize overly long responses (1.0 = no penalty, <1.0 = penalty)
Difficulty multiplier: Applied based on task difficulty (1.0 for easy, up to 2.0 for expert)
Composite predicted: Final computed reward from components
Reward scalar: Actual reward used in training (should match composite_predicted)

The visualization validates that the logged components correctly reconstruct the final reward, with composite_diff tracking any numerical discrepancies.

Ledger Fields: The ledger logs raw and normalized component scores, weights, penalties, difficulty multipliers, and intermediate composites. All data is logged to both Weights & Biases (under ledger/* namespace) and CSV (wandb_analysis/reward_ledger.csv) for offline analysis.

Head/tail of reward_ledger.csv

<!-- Auto-generated by scripts/analysis/make_ledger_snippet.py -->
<!-- Run: python scripts/analysis/make_ledger_snippet.py -->

Integration: The ledger is automatically computed by DakotaGrammarRubric.score() and can be accessed via environment.get_reward_ledger(). See dakota_rl_training/utils/LEDGER_INTEGRATION.md for integration instructions.

Exceptional level of detail preserved from the 1890 source material — every character, accent, and linguistic nuance captured with precision

All images, extractions, and datasets are freely available in this repository: data/processed_images

/data now preserves every artifact of the Dakota Grammar preliminary RL build, from VLM extraction outputs and audit traces to the structured rule corpora and PrimeIntellect-ready task sets, making the full pipeline from 1890 scans to verifiable environments reproducible and inspectable. This snapshot captures the linguistic provenance, reasoning logs, and RL curriculum that power the repository's closed-loop grammar gym.

Ask DeepWiki How This Works

https://deepwiki.com/HarleyCoops/Dakota1890

The Process So Far

We take a 1890 grammar textbook and split it into two complementary components that feed into each other:

Why This Is Novel

1. Single-Source Completeness

One textbook provides BOTH grammar rules AND vocabulary
No need for separate corpora or parallel datasets
Self-contained: grammar validates what dictionary teaches

2. Grammar as Verifiable RL Environment

Grammar rules → RL reward functions
Each rule becomes a testable constraint
PrimeIntellect TOPLOC verifies special character preservation
Compositional rewards: characters + affixes + semantics

3. Bi-Directional Synthesis

Forward: {dakota:english} from dictionary
Reverse: {english:dakota} synthetic generation
Stoney Nakoda Q&A methodology adapted for full sentences
Grammar gym validates synthetic outputs

4. Closed-Loop Training

Dictionary words → Synthetic sentences
Synthetic sentences → Grammar gym (RL verification)
RL feedback → Improved generation
Grammar rules from same source ensure consistency

5. Historical Text → Modern RL

1890s textbook → 2025 distributed RL training
VLM extraction bridges 135-year gap
No OCR training required
Preserves original orthography perfectly

Key Differentiators from Existing Literature

Traditional Approaches:

Require parallel corpora (expensive, rare for low-resource languages)
Separate grammar documentation and training data
Rule-based systems OR data-driven (not both)
No verifiable reward functions for morphology

Our Approach:

Single textbook = complete ecosystem
Grammar rules ARE the reward function
Synthetic data validated by grammar from same source
Verifiable character-level preservation (TOPLOC)
Fully reproducible (just needs scanned textbook + VLM API)

Research Contributions

VLM-Based Historical Text Extraction
- 92-95% accuracy on 130-year-old text
- No OCR training required
- Preserves complex orthography (ć, š, ŋ, ḣ)
Grammar-to-RL Pipeline
- Linguistic rules → Compositional reward functions
- Multi-turn environments for morphology
- Curriculum learning (easy → medium → hard)
Closed-Loop Synthesis
- Dictionary pairs → Grammar-validated sentences
- Bi-directional generation (dakota ↔ english)
- Self-consistency through shared source
Distributed Verification
- TOPLOC for Unicode preservation
- Verifiable special character handling
- Prevents corruption in untrusted workers

Potential Impact: This methodology could be applied to ANY low-resource language with historical grammar documentation, democratizing language preservation through modern AI.

Overview

This project uses modern Vision-Language Models (VLMs) to extract and preserve the Dakota language from historical 1890s grammar texts and dictionaries. Our goal is to create high-quality structured datasets that enable Reinforcement Learning-based fine-tuning of open-source language models on Dakota, contributing to Indigenous language revitalization efforts.

Key Innovation: We've proven that VLMs can extract complex orthographic features (special characters like ć, š, ŋ) from 130-year-old texts without requiring traditional OCR training, achieving 92-95% accuracy through prompt engineering alone. We then transform these extractions into verifiable RL training tasks with compositional reward functions that preserve linguistic structure.

Universal Framework: While demonstrated on Dakota, this pipeline is language-agnostic and can be applied to any low-resource language with historical documentation. The only requirements are: (1) scanned images of source texts, (2) a VLM API key, and (3) definition of the language's special characters. The extraction and RL training infrastructure adapts automatically.

The Language: Dakota

Dakota is a Siouan language historically spoken by the Dakota people across the Great Plains. The language uses a rich orthographic system with special characters to represent sounds unique to Dakota phonology:

Special consonants: ć (c-acute), š (s-caron), ŋ (eng), ḣ (h-dot)
Pitch accents: á, é, í, ó, ú
Long vowels: Represented by doubled letters (aa, ii, oo)
Syllable structure: Marked with hyphens (e.g., é-iŋ-hiŋ-tku)

Example Dakota text:

Wićašta wańŋ éińhińtku nonpa : unkań hakakata kiń he atkuku kiń heéiya
Man     a    son-his    two   : and   youngest  the  that father-his the said-to-him
"A man had two sons: and the youngest said to his father"

Complete Pipeline: From 1890 Textbook to RL Training

Step 1: Source Material Acquisition

Input: Stephen Return Riggs' Dakota Grammar and Dictionary (1890)

665 pages from Internet Archive
JP2 format, 2000x3000px scans
Pages 31-92: Grammar rules (62 pages)
Pages 93-440: Dictionary entries (~10,000 words)

Step 2: Image Conversion

Script: scripts/extraction/convert_all_images.py

python scripts/extraction/convert_all_images.py

Output: 440 JPEG images in data/processed_images/

Step 3A: Grammar Extraction (RL Gym Foundation)

Script: scripts/extraction/extract_grammar_pages.py

python scripts/extraction/extract_grammar_pages.py --pages 31-92 --yes

Process:

Claude Sonnet 4.5 extracts linguistic rules
Specialized prompt for Dakota orthography
Interlinear text parsing
Confidence scoring

Output: data/grammar_extracted/

62 JSON files (1 per page)
1,036 grammar rules extracted
6 categories: morphology, syntax, phonology, conjugation, particles, translation

Key Innovation: Each grammar rule becomes a verifiable constraint

Step 3B: Dictionary Extraction (Vocabulary Source)

Script: scripts/extraction/extract_dakota_dictionary_v2.py

python scripts/extraction/extract_dakota_dictionary_v2.py --pages 93-440

Output: data/dictionary_extracted/

~10,000 {dakota:english} pairs
Etymology and usage notes
Part of speech tags

Dakota Dictionary - Historical Text Detail

Step 4: Grammar → RL Environment Conversion

Script: scripts/rl/organize_grammar_for_rl.py

python scripts/rl/organize_grammar_for_rl.py --input data/grammar_extracted/

Process:

Rules → RL task format
Positive/negative example generation
Difficulty estimation (easy/medium/hard)
Affix and special character tagging

Output: data/rl_training_rules/

1,036 rules organized by category
Verification patterns defined
Source page tracking

Step 5: RL Task Generation

Script: scripts/conversion/convert_rules_to_primeintellect.py

python scripts/conversion/convert_rules_to_primeintellect.py

Process:

1 rule → 5.5 tasks average
Multiple task types per rule:
- Morphology application
- Translation (dakota → english)
- Reverse translation (english → dakota)
- Syntax analysis
- Pattern identification

Output: dakota_rl_training/datasets/

5,657 total tasks from 1,036 rules
Curriculum-ready:
- Easy: 1,998 tasks
- Medium: 2,155 tasks
- Hard: 398 tasks
- Advanced: 1,106 tasks

Step 6: Synthetic Dataset Generation

Script: scripts/conversion/generate_synthetic_dakota.py (Stoney Nakoda methodology)

python scripts/conversion/generate_synthetic_dakota.py --dictionary data/dictionary_extracted/

Process:

Load dictionary pairs: {dakota:english}
Reverse pairs: {english:dakota}
Generate Q&A variations:
- "How do you say X in Dakota?" → dakota_word
- "Translate X to English" → english_word
- "Use X in a sentence" → full_sentence
Validate sentences through Grammar Gym:
- Check special characters
- Verify affix usage
- Confirm grammatical structure

Output: data/synthetic_dataset/

Sentence-level Dakota examples
Grammar-validated
Q&A format for fine-tuning

Step 7: RL Environment Setup

Script: scripts/rl/create_grammar_rl_environment.py

python scripts/rl/create_grammar_rl_environment.py --rules-dir data/rl_training_rules/

Creates:

DakotaGrammarEnv: Multi-turn verification
DakotaGrammarRubric: Compositional rewards
Curriculum learning stages
TOPLOC verification enabled

Output: data/rl_environment/environment_config.json

Step 8: Training on PrimeIntellect

Script: dakota_rl_training/train.py

# Install PrimeIntellect
pip install git+https://github.com/PrimeIntellect-ai/verifiers.git
pip install git+https://github.com/PrimeIntellect-ai/prime-rl.git

# Launch distributed training
cd dakota_rl_training
prime-rl train \
    --config configs/training_config.yaml \
    --num-workers 4 \
    --use-toploc \
    --wandb-project dakota-rl-grammar

Training Loop:

Model generates Dakota text
DakotaGrammarEnv verifies against grammar rules
DakotaGrammarRubric calculates reward:
- Character preservation: 0-1.0
- Affix accuracy: 0-1.0
- Semantic accuracy: 0-1.0
- Composite with difficulty multiplier
GRPO updates policy
Curriculum advances (easy → medium → hard)

Synthetic Data Integration:

Synthetic sentences used as training examples
Grammar gym validates each generation
Closed loop: dictionary → sentences → grammar validation

Results & Statistics

Extraction Quality

Grammar extraction: 97% average confidence
Dictionary extraction: 92-95% accuracy
Special character preservation: 100% (ć, š, ŋ, ḣ preserved exactly)

Dataset Size

Grammar rules: 1,036 rules across 6 categories
RL training tasks: 5,657 tasks
Dictionary entries: ~10,000 word pairs
Synthetic sentences: Generated and validated

Training Metrics (Expected)

Character accuracy: >90% for all special chars
Affix accuracy: >85% for common morphology
Translation accuracy: 85-90% (easy), 65-70% (hard)
Training time: 8-12 hours on distributed workers

Cost Analysis

Grammar extraction: $15.50
Dictionary extraction: ~$87.50
RL training: Distributed (free via PrimeIntellect)
Total: ~$103 for complete system

Key Files & Documentation

Core Scripts

Extraction Scripts (scripts/extraction/):

convert_all_images.py - JP2 → JPEG conversion
extract_grammar_pages.py - Grammar rule extraction
extract_dakota_dictionary_v2.py - Dictionary extraction

Conversion Scripts (scripts/conversion/): 4. convert_rules_to_primeintellect.py - Generate RL tasks 5. generate_synthetic_dakota.py - Synthetic data generation 6. convert_extracted_to_chat.py - Convert to chat format

RL Scripts (scripts/rl/): 7. organize_grammar_for_rl.py - Rules → RL format 8. create_grammar_rl_environment.py - RL environment setup 9. publish_dakota_environment.py - Publish to PrimeIntellect 10. run_complete_grammar_pipeline.py - Run full pipeline

Training: 11. dakota_rl_training/train.py - Launch training

Documentation

docs/guides/GRAMMAR_RL_PIPELINE.md - Complete grammar extraction guide
docs/status/PRIMEINTELLECT_INTEGRATION_COMPLETE.md - RL integration details
docs/status/GRAMMAR_EXTRACTION_COMPLETE.md - Final extraction results
docs/root/CLAUDE.md - Project instructions for AI assistants
docs/root/ - Quick reference documentation previously at root

Output Directories

data/processed_images/ - Converted JPEG images (440 files)
data/grammar_extracted/ - Raw grammar rules (62 pages)
data/rl_training_rules/ - Organized RL rules (1,036 rules)
data/dictionary_extracted/ - Dictionary entries (~10,000)
data/synthetic_dataset/ - Generated sentences
dakota_rl_training/datasets/ - RL training tasks (5,657)
dakota_rl_training/checkpoints/ - Model checkpoints

Technical Architecture

VLM Extraction Layer

Primary: Claude Sonnet 4.5 (claude-sonnet-4-5-20250929)
Alternative: Qwen3-VL-235B-A22B (with reasoning budget)
Prompt Engineering: Specialized Dakota orthography preservation
Output: Structured JSON with confidence scores

RL Training System

Base Model: Qwen/Qwen2.5-7B-Instruct
Method: LoRA fine-tuning (rank 64)
Algorithm: GRPO (Group Relative Policy Optimization)
Framework: PrimeIntellect prime-rl
Verification: TOPLOC for distributed Unicode validation

Reward Function

Multi-component compositional rewards:

reward = (
    0.4 * character_preservation +  # Dakota special chars
    0.4 * affix_accuracy +          # Morphological correctness
    0.2 * semantic_accuracy         # Translation quality
) * difficulty_multiplier           # 1.0x - 2.0x

Curriculum Learning

Progressive difficulty over 3 stages:

Easy tasks (1,998) → 80% target accuracy
Medium tasks (2,155) → 75% target accuracy
Hard tasks (398) → 70% target accuracy

Research Significance

Novel Contributions

VLM-Based Historical Text Processing
- First demonstration of VLM special character extraction from 19th-century texts
- No OCR training required
- 92-95% accuracy on complex orthography
Grammar-to-RL Pipeline
- Linguistic rules → Verifiable reward functions
- Compositional rewards for morphology
- Multi-turn environments for complex tasks
Closed-Loop Synthesis Method
- Single textbook → Complete training ecosystem
- Dictionary + Grammar = Self-validating system
- Bi-directional generation with verification
Distributed Verification (TOPLOC)
- First application to Unicode preservation
- Prevents character corruption in untrusted workers
- Critical for low-resource language training

Potential Applications

Immediate:

Dakota language revitalization
Educational tools for language learners
Grammar-aware translation systems

Broader Impact:

Template for other Siouan languages (Lakota, Nakota, Stoney)
Methodology for any historical grammar textbook
Framework for low-resource language preservation
Verifiable distributed training for linguistic tasks

Generalized RL Use Cases

The same closed-loop, rule-to-reward pattern extends beyond low-resource language learning:

Legal & Compliance: Turn the US Tax Code (text) into a reward function to train models that avoid non-compliant financial advice.
Biology: Turn protein folding heuristics papers into reward constraints for generative chemistry.
Code Migration: Turn a 1970 COBOL manual into a verification environment to train a Python-to-COBOL translation agent.
Industrial Safety (1950s Rail): Build an RL environment that mirrors a postwar railroad engineer's operating rules using the procedures documented in Public/1959RailRoadCodeRL.pdf, enforcing safety, dispatch, and signaling constraints in generation.

Project Expansion: Railroad Engineer 1959

Building on the success of the Dakota grammar pipeline, we are applying the same Grammar-to-RL methodology to industrial safety.

Objective: Train RL agents to operate trains strictly according to the 1959 Consolidated Code of Operating Rules.
Methodology:
1. Extraction: VLM extracts rules from Public/1959RailRoadCodeRL.pdf.
2. Environment: RailroadEnv verifies adherence to safety protocols, signaling rules, and terminology.
3. Reward: Compositional reward function prioritizing Safety (critical), Procedure (sequence), and Terminology (exact phrasing).
Status: In active development (see RailroadEngineer1959/).

Future Research Directions

Multi-Language Extension
- Apply to related languages (Lakota, Stoney Nakoda)
- Cross-lingual transfer learning
- Comparative Siouan linguistics
Enhanced Synthesis
- Context-aware sentence generation
- Dialogue system integration
- Cultural knowledge embedding
Evaluation Framework
- Linguistic acceptability metrics
- Native speaker validation
- Grammaticality benchmarks
Morphological Productivity
- Test generalization to unseen words
- Measure affix composition ability
- Analyze systematic errors

Requirements

Software Dependencies

# Core extraction
pip install anthropic pillow python-dotenv

# RL training
pip install git+https://github.com/PrimeIntellect-ai/verifiers.git
pip install git+https://github.com/PrimeIntellect-ai/prime-rl.git

# Optional: Alternative VLM
pip install openai  # For OpenRouter/Qwen3-VL

API Keys

ANTHROPIC_API_KEY - Claude Sonnet 4.5 (required)
OPENROUTER_API_KEY - Qwen3-VL (optional)

System Requirements

Extraction: Python 3.8+, 8GB RAM
Training: Distributed via PrimeIntellect (or local GPU)
Storage: ~5GB for images + extractions

Quick Start

1. Setup

git clone https://github.com/HarleyCoops/Dakota1890.git
cd Dakota1890
python -m pip install -r requirements.txt

# Add API keys to .env
echo "ANTHROPIC_API_KEY=your_key_here" > .env

2. Convert Images

python scripts/extraction/convert_all_images.py
# Output: 440 images in data/processed_images/

3. Extract Grammar

python scripts/extraction/extract_grammar_pages.py --pages 31-92 --yes
# Output: 1,036 rules in data/grammar_extracted/

4. Generate RL Tasks

python scripts/rl/organize_grammar_for_rl.py --input data/grammar_extracted/
python scripts/conversion/convert_rules_to_primeintellect.py
# Output: 5,657 tasks in dakota_rl_training/datasets/

5. Launch Training

cd dakota_rl_training
python train.py --config configs/training_config.yaml
# Follow instructions to install PrimeIntellect and launch

Curriculum Section

The repo ships with two trainer presets:

dakota_rl_training/configs/train_30b.toml – the lightweight config we use in the quick start above. It has no curriculum logic; every RL step samples uniformly from whatever dataset the environment serves.
dakota_rl_training/configs/train.toml (and the matching training_config.yaml used by python train.py) – these contain an explicit [curriculum] block with staged datasets (easy → medium → hard) plus target step ranges.

To activate the curriculum when launching through the Prime RL CLI, point each component at the staged configs:

uv run rl \
  --trainer @ ~/dakota-rl-training/configs/train.toml \
  --orchestrator @ ~/dakota-rl-training/configs/orch.toml \
  --inference @ ~/dakota-rl-training/configs/infer.toml \
  ... # other flags (gpu ids, wandb, etc.)

The orchestrator/inference presets that accompany train.toml live in the same folder (orch.toml, infer.toml) and expect the staged datasets described in training_config.yaml. If you stick with the *_30b.toml files, no curriculum will trigger. In short: choose the config trio you pass to uv run rl based on whether you want uniform sampling (*_30b.toml) or curriculum-driven sampling (train.toml + orch.toml + infer.toml).

Citation

If you use this work, please cite:

@software{dakota_rl_grammar_2025,
  title = {Dakota Language Preservation Through VLM-Based Grammar Extraction and RL Training},
  author = {Cooper, Christian H.},
  year = {2025},
  url = {https://github.com/[repo]},
  note = {Novel closed-loop methodology for low-resource language training}
}

Acknowledgments

Stephen Return Riggs: Original Dakota grammar documentation (1890)
Internet Archive: Historical text preservation
PrimeIntellect: Distributed RL training infrastructure
Anthropic: Claude Sonnet 4.5 VLM capabilities
Dakota Language Community: Ongoing language revitalization efforts

License

Code is licensed under Apache-2.0 (see LICENSE). The included historical Dakota text (Riggs, 1890) is in the public domain; see DATA_LICENSE.md for details and scope.

Contact & Contributions

This methodology is designed to be reproducible and extensible. Contributions welcome for:

Additional language applications
Improved synthesis methods
Enhanced verification functions
Evaluation frameworks

See docs/CONTRIBUTING.md for guidelines.

Name		Name	Last commit message	Last commit date
Latest commit History 5,597 Commits
.claude		.claude
.github		.github
Akkadian		Akkadian
Baguettotron-Dakota1890		Baguettotron-Dakota1890
Dictionary/grammardictionar00riggrich_jp2		Dictionary/grammardictionar00riggrich_jp2
OpenAIFineTune		OpenAIFineTune
Public		Public
Qwen3-30B-ThinkingMachines-Dakota1890		Qwen3-30B-ThinkingMachines-Dakota1890
Qwen3-RailroadEngineer1959-RL		Qwen3-RailroadEngineer1959-RL
RailroadEngineer1959		RailroadEngineer1959
dakota-grammar-env		dakota-grammar-env
dakota_extraction		dakota_extraction
dakota_rl_training		dakota_rl_training
data		data
docs		docs
e2b		e2b
environments		environments
eval		eval
huggingface_space		huggingface_space
media		media
outputs/evals/dakota1890--meta-llama--llama-3.1-70b-instruct		outputs/evals/dakota1890--meta-llama--llama-3.1-70b-instruct
scripts		scripts
tests		tests
tmp_publish		tmp_publish
tools		tools
wandb		wandb
wandb_analysis		wandb_analysis
wandb_visualizations		wandb_visualizations
.env.template		.env.template
.gitattributes		.gitattributes
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
Jsoncrack.jpg		Jsoncrack.jpg
LICENSE		LICENSE
MODEL_CARD.md		MODEL_CARD.md
README.md		README.md
benchmark_results.jsonl		benchmark_results.jsonl
conftest.py		conftest.py
dakota_rl.html		dakota_rl.html
dakota_rl_wandb_report.py		dakota_rl_wandb_report.py
debug_output.txt		debug_output.txt
debug_vlm.txt		debug_vlm.txt
example_hf_inference.py		example_hf_inference.py
grammardictionar00riggrich.pdf		grammardictionar00riggrich.pdf
hf_inference_standalone.py		hf_inference_standalone.py
last_vlm_response.txt		last_vlm_response.txt
paper.tex		paper.tex
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements-docs.txt		requirements-docs.txt
requirements.txt		requirements.txt
requirements_hf_inference.txt		requirements_hf_inference.txt
run_inference.py		run_inference.py
test_model_inference.py		test_model_inference.py
test_space_inference_local.py		test_space_inference_local.py
training_error.txt		training_error.txt
training_error_2.txt		training_error_2.txt
training_error_3.txt		training_error_3.txt
vlm_error.txt		vlm_error.txt

License

HarleyCoops/Dakota1890

Folders and files

Latest commit

History

Repository files navigation

Grammar-to-RL: Training Language Models on Historical Texts via Compositional Rewards

Novel Methodology: Zero-to-One Language Model

Why This Is Now Working

Grammar AS Reward Function

Mathematical Formulation

1. The Transformation Function (Book to Environment)

2. The Compositional Reward Function

3. The Modified GRPO Objective

4. The "Wow! Signal": Gradient Signal Density

Training Results: RL Performance Visualizations

Run 1: 1000-Step Training (Final)

Training Run Details

Key Achievements

Comprehensive Dashboard

Reward Progression

Training Metrics

Performance Metrics

Run 2: 400-Step Training (Initial Breakthrough)

Training Run Details

Key Achievements

Reward Progression (400 Steps)

Run 3: 30B Tinker (Thinking Machines Final)

GRPO for Qualitative Tasks: Early Signals

Why This Matters

Our Solution: Compositional Rewards

Key Results Demonstrating Significance

Implications

Methodology Validation

Visualization Generation

Reward Ledger: Transparent Reward Decomposition

Ask DeepWiki How This Works

The Process So Far

Why This Is Novel

Key Differentiators from Existing Literature

Research Contributions

Overview

The Language: Dakota

Complete Pipeline: From 1890 Textbook to RL Training

Step 1: Source Material Acquisition

Step 2: Image Conversion

Step 3A: Grammar Extraction (RL Gym Foundation)

Step 3B: Dictionary Extraction (Vocabulary Source)

Step 4: Grammar → RL Environment Conversion

Step 5: RL Task Generation

Step 6: Synthetic Dataset Generation

Step 7: RL Environment Setup

Step 8: Training on PrimeIntellect

Results & Statistics

Extraction Quality

Dataset Size

Training Metrics (Expected)

Cost Analysis

Key Files & Documentation

Core Scripts

Documentation

Output Directories

Technical Architecture

VLM Extraction Layer

RL Training System

Reward Function

Curriculum Learning

Research Significance

Novel Contributions

Potential Applications

Generalized RL Use Cases

Project Expansion: Railroad Engineer 1959

Future Research Directions

Requirements

Software Dependencies

API Keys

System Requirements

Quick Start

1. Setup

2. Convert Images

Packages