This project introduces a novel approach to low-resource language learning by transforming a single historical textbook into a complete, self-contained training ecosystem.
The fundamental insight driving this approach: grammar rules can be reward functions, and rewards should decompose into linguistic primitives. When you do this, syntax emerges naturally without an external verifier as judge. This is particularly powerful for RL on non-coding tasks where compositional structure matters.
Everyone else treats grammar as either:
- Preprocessing constraints (rule-based systems)
- Post-hoc evaluation (check grammar after generation)
This project makes grammar rules directly differentiable through compositional rewards. Each rule becomes a gradient signal, not just a binary check.
The key advantage: interpretability. You can actually see where in the latent space each linguistic level is being encoded. This makes debugging possible: "Oh, the model is failing on ć preservation because the character embedding gradient is being overwhelmed by the semantic gradient."
The efficiency of this method comes from treating the historical text not just as data, but as a computable specification.
Let
Where each grammar rule
Standard RLHF or GRPO typically uses a singular reward model
Let
Where:
-
$R_{char}$ (Orthography): The Intersection-over-Union (or Recall) of required special unicode characters$\mathcal{C}_{spec}$ (e.g.,ŋ,š,ć).$R_{char} = \frac{|chars(y_i) \cap chars(x)|}{|chars(x) \cap \mathcal{C}_{spec}|}$ -
$R_{morph}$ (Syntax): A binary or scalar check against specific grammar rules$g_k \in \mathcal{G}$ (e.g., affix presence regex).$R_{morph} = \frac{1}{|A|}\sum_{a \in A} \mathbb{I}(a \subset y_i) \quad \text{where } A \text{ are required affixes}$ -
$R_{sem}$ (Semantics): Semantic similarity to ground truth (or Dictionary lookup). -
Weights:
$(\alpha, \beta, \gamma) = (0.4, 0.4, 0.2)$ per the config. -
$\lambda_{diff}$ : The curriculum difficulty multiplier ($1.0 \dots 2.0$ ).
In standard GRPO, we compute the advantage
Where
The mathematical reason for the 160-step convergence and 97.9% morphology accuracy is the density of the gradient signal.
In standard Language Modeling (Next Token Prediction), the loss is:
This gives feedback only on exact token matches.
In the Grammar-GRPO, the feedback signal
Because
Mathematically, you reduced the noise in the reward signal, allowing the policy to traverse the optimization landscape directly toward the "Grammar Valley" (the manifold where valid Dakota syntax exists).
This section presents comprehensive visualizations from our successful Reinforcement Learning training runs, demonstrating the effectiveness of the grammar-to-RL methodology on the Dakota language. There have been two successful runs: one at 1000 steps (final) and one at 400 steps (initial).
- Project:
dakota-rl-grammar - Entity:
christian-cooper-us - Trainer Run:
7nikv4vp-dakota-0.6b-rl-trainer - Orchestrator Run:
29hn8w98-dakota-0.6b-rl-orchestrator - Model: Qwen3-0.6B-Dakota-Grammar-RL
- Training Steps: 1,000 steps (998 completed)
- Total Samples: 256,000 samples processed
- Training Duration: 1.54 hours (5,537 seconds)
- 190% improvement in overall reward (0.120 → 0.349)
- 97.9% morphological accuracy - exceptional performance in affix application
- 53.5% character preservation - significant improvement for complex orthography
- 90% of improvement achieved in first 160 steps (16% of training) - demonstrating rapid learning
- Stable training with controlled KL divergence throughout
The comprehensive dashboard provides an at-a-glance view of all training metrics, combining reward progression, component performance, loss dynamics, entropy, KL divergence, and throughput metrics into a single visualization.
What this shows: This multi-panel dashboard synthesizes all key training signals. The top panel shows reward progression with milestone markers indicating when 25%, 50%, 75%, and 90% of total improvement was achieved. The component comparison bar chart (middle-left) reveals the differential performance: morphological accuracy reached 97.9% while character preservation achieved 53.5%, reflecting the challenge of preserving Dakota's complex orthography (ć, š, ŋ, ḣ, ṡ, á, é, í, ó, ú) with a 0.6B parameter model. The loss and entropy panels demonstrate stable optimization, while the KL divergence metrics show controlled policy adaptation without catastrophic forgetting.
View full run: Trainer Run | Orchestrator Run
The reward progression visualization demonstrates the learning trajectory over 1,000 training steps, showing both overall composite reward and individual component breakdown.
What this shows: The top panel tracks overall reward progression from 0.120 (step 0) to 0.349 (step 999), representing a 190.1% improvement. Milestone markers highlight key learning efficiency points: 25% improvement at step 49 (4.9% of training), 50% at step 71 (7.1%), 75% at step 109 (10.9%), and 90% at step 160 (16%). The rapid initial learning validates the methodology's efficiency - grammar-based tasks provide dense learning signals compared to general language modeling. The bottom panel shows the component breakdown: Morphological Accuracy (green) achieved near-perfect performance (0.979), Character Preservation (orange) showed substantial improvement from 0.038 to 0.535 (14x increase), while the Overall Composite (blue) reflects the weighted combination including semantic components.
Interpretation: The divergence between component performances demonstrates that the model learned morphological patterns more effectively than orthographic preservation. This suggests potential areas for future improvement through specialized character-focused training or larger model capacity. The semantic component (20% weight) likely contributes to the composite score being lower than individual components, indicating multi-objective optimization challenges.
View full run: Orchestrator Run
This visualization tracks the core training dynamics: policy loss, model entropy (confidence), KL divergence (policy adaptation), and inference probabilities.
What this shows:
- Policy Loss (top-left): Values ranged from approximately 1e-5 to 1e-3, typical of GRPO training with conservative learning rates. The log-scale visualization shows consistent small magnitudes indicating stable gradient-based optimization. The shaded region represents ±1 standard deviation, showing controlled variance throughout training.
- Model Entropy (top-right): Decreased from 0.93 to 0.21, indicating the model became significantly more confident in its predictions. Low final entropy (0.21) suggests the model is highly confident, which aligns with the high morphological accuracy achieved.
- KL Divergence (bottom-left): Three metrics track policy adaptation:
- Masked KL: Increased from near-zero to 9.32, indicating substantial policy adaptation for Dakota-specific masked tokens
- Overall KL: Moderate increase from 0.001 to 3.83, suggesting controlled policy adaptation
- Unmasked KL: Remained extremely low (mean: 0.070, final: 0.042), confirming the model preserved general language capabilities while learning Dakota-specific patterns
- Inference Probabilities (bottom-right): Increased from 0.63 to 0.86, showing the model became more certain in its predictions over time.
Interpretation: The increasing KL divergence trends indicate active learning and policy adaptation, while the relatively moderate values (especially for unmasked tokens) suggest training remained stable. The model successfully specialized for Dakota grammar while preserving general language understanding, validating the training approach.
View full run: Trainer Run
Performance metrics track computational efficiency: training throughput (tokens per second) and GPU utilization (Model FLOPS Utilization).
What this shows:
- Training Throughput (left): Average throughput of 8,178 tokens/sec with consistent performance throughout training. The red dashed line indicates the average, showing stable training execution without significant throughput variations.
- GPU Efficiency - MFU (right): Average Model FLOPS Utilization of 2.68%, indicating GPU efficiency. While this may seem low, it's typical for small models (0.6B parameters) where memory bandwidth rather than compute is often the bottleneck. The consistent MFU suggests stable training without memory pressure or compute bottlenecks.
Interpretation: The consistent performance metrics validate stable training execution. The throughput remained stable throughout 1,000 steps, processing 256,000 total samples with an average of 256 samples per step. Peak memory usage was 11.5 GiB, well within reasonable bounds for the model size.
View full run: Trainer Run
- Project:
dakota-rl-grammar - Entity:
christian-cooper-us - Trainer Run:
yut26kcm-dakota-0.6b-ledger-test-400-trainer - Orchestrator Run:
1y33h9zr-dakota-0.6b-ledger-test-400-orchestrator - Model: Qwen3-0.6B-Dakota-Grammar-RL-400
- Training Steps: 400 steps (all completed)
- Total Samples: 102,400 samples processed
- Base Model: Qwen/Qwen3-0.6B (small instruct model optimized for RL)
- 150.3% improvement in overall reward (0.128 → 0.321, peak: 0.345)
- Rapid learning: 90% of improvement achieved in first 85 steps (21.25% of training)
- Sample efficiency: 0.000483 improvement per step - demonstrating dense learning signals
- Stable training: Controlled KL divergence with unmasked KL remaining low (mean: 0.094, final: 0.092)
- Policy confidence: Entropy decreased from 0.93 to 0.28, showing increased model certainty
Interpretation: The rapid learning trajectory suggests that compositional reward functions can support efficient learning on qualitative linguistic tasks. The milestone markers show consistent acceleration, with each 25% improvement requiring progressively fewer steps, indicating the model is learning to learn more effectively. These early signals hint that GRPO could extend beyond coding/math domains when rewards are decomposed into linguistic primitives, but this still needs more confirmation.
View full run: Orchestrator Run
- Project / Run ID:
dakota-rl-grammar/i55d4x26 - Base Model: Qwen/Qwen3-30B-A3B-Instruct-2507 (LoRA rank 32)
- Steps: 199 (0–198) on Tinker; checkpoint at
tinker://da1ef918-d67a-5080-b500-dd1256db9ca7:train:0/sampler_weights/final - Composite reward: 0.105 → 0.442 peak (step 116) → 0.317 final
- Character preservation: 0.265 → 0.699 peak (step 185) → 0.619 final
- Affix accuracy: 0.957 → 1.000 (stayed perfect in later stages)
- Exact match: 0.001 → 0.337 peak (step 116) → 0.100 final; length penalty rose to 1.0 by step 52
This work suggests that GRPO (Group Relative Policy Optimization) can achieve strong learning on qualitative linguistic tasks when rewards are properly decomposed into interpretable components, contingent on continued results. This is significant because:
GRPO has been successfully applied to quantitative domains (code generation, mathematical reasoning) where correctness is verifiable and rewards are clear. However, qualitative tasks like language learning, translation, and grammar have traditionally been considered unsuitable for RL because:
- Subjective evaluation: "Is this translation good?" lacks clear criteria
- Multi-dimensional quality: A translation can be semantically correct but orthographically wrong
- Nuanced feedback: Binary correct/incorrect fails to capture partial correctness
By decomposing rewards into linguistic primitives (character preservation, morphological accuracy, semantic correctness), we transform qualitative tasks into quantitatively optimizable objectives:
- Character preservation (40%): Verifiable Unicode-level correctness
- Morphological accuracy (40%): Pattern-matching against grammar rules
- Semantic correctness (20%): Meaning preservation metrics
This decomposition enables GRPO to work effectively because:
- Each component is independently verifiable (no human judgment needed)
- Gradients flow through each component (model learns what to prioritize)
- Multi-dimensional feedback (model knows exactly what it got wrong)
- 150.3% improvement in 400 steps - Comparable to GRPO performance on coding tasks
- 90% improvement in 21% of training - Demonstrates dense learning signals from compositional rewards
- Low unmasked KL (0.092) - Model specializes without catastrophic forgetting
- Stable training dynamics - No reward hacking or instability issues
This work hints that GRPO may not be limited to quantitative domains. When qualitative tasks are decomposed into verifiable components, they may become more learnable, opening possibilities for:
- Low-resource language learning (this work)
- Style transfer (decompose into syntax, semantics, register)
- Dialogue systems (decompose into coherence, relevance, appropriateness)
- Creative tasks (decompose into structure, originality, coherence)
These results provide early evidence for the core methodological idea: transforming grammar rules from a 130-year-old historical textbook into verifiable RL environments. The strong morphological accuracy (97.9%) suggests that:
-
Rule Extraction Quality: The VLM-based extraction successfully captured testable grammar patterns from historical text, preserving morphological rules in a format suitable for RL training.
-
Task Generation Effectiveness: The conversion of 1,036 grammar rules into 5,657 RL tasks created sufficient training signal for the model to learn morphological patterns. The high accuracy suggests the task generation process successfully encoded the grammar rules as verifiable constraints.
-
Compositional Reward Structure: The decomposition into character, morphology, and semantic components enabled fine-grained learning. The model learned morphological patterns more effectively than orthographic preservation, suggesting potential areas for future improvement.
-
Sample Efficiency: The rapid initial learning (90% improvement in 16% of training) demonstrates the methodology's efficiency for low-resource language scenarios. With only 256,000 samples processed over 1,000 steps, the model achieved substantial improvement.
These visualizations were generated using scripts/create_rl_visualizations.py, which loads data from Weights & Biases and creates publication-quality plots. To regenerate with updated data:
# For the 1000 step run (default):
python scripts/create_rl_visualizations.py \
--trainer-id 7nikv4vp \
--orchestrator-id 29hn8w98 \
--project dakota-rl-grammar \
--entity christian-cooper-usAll visualization code and data are available in the repository for reproducibility and further analysis.
We log detailed reward component breakdowns to provide full transparency into how rewards are computed. The reward ledger exposes all internal components, weights, penalties, and multipliers used in reward calculation.
What this shows: The reward ledger visualization demonstrates how individual components combine to form the final reward. Each line represents a different component or intermediate computation:
- Weighted components: Individual reward components (exact match, character overlap, pattern matching, affix accuracy) multiplied by their respective weights
- Composite pre: Weighted sum of components before applying penalties and multipliers
- Length penalty multiplier: Applied to penalize overly long responses (1.0 = no penalty, <1.0 = penalty)
- Difficulty multiplier: Applied based on task difficulty (1.0 for easy, up to 2.0 for expert)
- Composite predicted: Final computed reward from components
- Reward scalar: Actual reward used in training (should match composite_predicted)
The visualization validates that the logged components correctly reconstruct the final reward, with composite_diff tracking any numerical discrepancies.
Ledger Fields: The ledger logs raw and normalized component scores, weights, penalties, difficulty multipliers, and intermediate composites. All data is logged to both Weights & Biases (under ledger/* namespace) and CSV (wandb_analysis/reward_ledger.csv) for offline analysis.
Head/tail of reward_ledger.csv
<!-- Auto-generated by scripts/analysis/make_ledger_snippet.py -->
<!-- Run: python scripts/analysis/make_ledger_snippet.py -->
Integration: The ledger is automatically computed by DakotaGrammarRubric.score() and can be accessed via environment.get_reward_ledger(). See dakota_rl_training/utils/LEDGER_INTEGRATION.md for integration instructions.
Exceptional level of detail preserved from the 1890 source material — every character, accent, and linguistic nuance captured with precision
All images, extractions, and datasets are freely available in this repository: data/processed_images
/data now preserves every artifact of the Dakota Grammar preliminary RL build, from VLM extraction outputs and audit traces to the structured rule corpora and PrimeIntellect-ready task sets, making the full pipeline from 1890 scans to verifiable environments reproducible and inspectable. This snapshot captures the linguistic provenance, reasoning logs, and RL curriculum that power the repository's closed-loop grammar gym.
We take a 1890 grammar textbook and split it into two complementary components that feed into each other:
1. Single-Source Completeness
- One textbook provides BOTH grammar rules AND vocabulary
- No need for separate corpora or parallel datasets
- Self-contained: grammar validates what dictionary teaches
2. Grammar as Verifiable RL Environment
- Grammar rules → RL reward functions
- Each rule becomes a testable constraint
- PrimeIntellect TOPLOC verifies special character preservation
- Compositional rewards: characters + affixes + semantics
3. Bi-Directional Synthesis
- Forward: {dakota:english} from dictionary
- Reverse: {english:dakota} synthetic generation
- Stoney Nakoda Q&A methodology adapted for full sentences
- Grammar gym validates synthetic outputs
4. Closed-Loop Training
- Dictionary words → Synthetic sentences
- Synthetic sentences → Grammar gym (RL verification)
- RL feedback → Improved generation
- Grammar rules from same source ensure consistency
5. Historical Text → Modern RL
- 1890s textbook → 2025 distributed RL training
- VLM extraction bridges 135-year gap
- No OCR training required
- Preserves original orthography perfectly
Traditional Approaches:
- Require parallel corpora (expensive, rare for low-resource languages)
- Separate grammar documentation and training data
- Rule-based systems OR data-driven (not both)
- No verifiable reward functions for morphology
Our Approach:
- Single textbook = complete ecosystem
- Grammar rules ARE the reward function
- Synthetic data validated by grammar from same source
- Verifiable character-level preservation (TOPLOC)
- Fully reproducible (just needs scanned textbook + VLM API)
-
VLM-Based Historical Text Extraction
- 92-95% accuracy on 130-year-old text
- No OCR training required
- Preserves complex orthography (ć, š, ŋ, ḣ)
-
Grammar-to-RL Pipeline
- Linguistic rules → Compositional reward functions
- Multi-turn environments for morphology
- Curriculum learning (easy → medium → hard)
-
Closed-Loop Synthesis
- Dictionary pairs → Grammar-validated sentences
- Bi-directional generation (dakota ↔ english)
- Self-consistency through shared source
-
Distributed Verification
- TOPLOC for Unicode preservation
- Verifiable special character handling
- Prevents corruption in untrusted workers
Potential Impact: This methodology could be applied to ANY low-resource language with historical grammar documentation, democratizing language preservation through modern AI.
This project uses modern Vision-Language Models (VLMs) to extract and preserve the Dakota language from historical 1890s grammar texts and dictionaries. Our goal is to create high-quality structured datasets that enable Reinforcement Learning-based fine-tuning of open-source language models on Dakota, contributing to Indigenous language revitalization efforts.
Key Innovation: We've proven that VLMs can extract complex orthographic features (special characters like ć, š, ŋ) from 130-year-old texts without requiring traditional OCR training, achieving 92-95% accuracy through prompt engineering alone. We then transform these extractions into verifiable RL training tasks with compositional reward functions that preserve linguistic structure.
Universal Framework: While demonstrated on Dakota, this pipeline is language-agnostic and can be applied to any low-resource language with historical documentation. The only requirements are: (1) scanned images of source texts, (2) a VLM API key, and (3) definition of the language's special characters. The extraction and RL training infrastructure adapts automatically.
Dakota is a Siouan language historically spoken by the Dakota people across the Great Plains. The language uses a rich orthographic system with special characters to represent sounds unique to Dakota phonology:
- Special consonants: ć (c-acute), š (s-caron), ŋ (eng), ḣ (h-dot)
- Pitch accents: á, é, í, ó, ú
- Long vowels: Represented by doubled letters (aa, ii, oo)
- Syllable structure: Marked with hyphens (e.g., é-iŋ-hiŋ-tku)
Example Dakota text:
Wićašta wańŋ éińhińtku nonpa : unkań hakakata kiń he atkuku kiń heéiya
Man a son-his two : and youngest the that father-his the said-to-him
"A man had two sons: and the youngest said to his father"
Input: Stephen Return Riggs' Dakota Grammar and Dictionary (1890)
- 665 pages from Internet Archive
- JP2 format, 2000x3000px scans
- Pages 31-92: Grammar rules (62 pages)
- Pages 93-440: Dictionary entries (~10,000 words)
Script: scripts/extraction/convert_all_images.py
python scripts/extraction/convert_all_images.pyOutput: 440 JPEG images in data/processed_images/
Script: scripts/extraction/extract_grammar_pages.py
python scripts/extraction/extract_grammar_pages.py --pages 31-92 --yesProcess:
- Claude Sonnet 4.5 extracts linguistic rules
- Specialized prompt for Dakota orthography
- Interlinear text parsing
- Confidence scoring
Output: data/grammar_extracted/
- 62 JSON files (1 per page)
- 1,036 grammar rules extracted
- 6 categories: morphology, syntax, phonology, conjugation, particles, translation
Key Innovation: Each grammar rule becomes a verifiable constraint
Script: scripts/extraction/extract_dakota_dictionary_v2.py
python scripts/extraction/extract_dakota_dictionary_v2.py --pages 93-440Output: data/dictionary_extracted/
- ~10,000 {dakota:english} pairs
- Etymology and usage notes
- Part of speech tags
Script: scripts/rl/organize_grammar_for_rl.py
python scripts/rl/organize_grammar_for_rl.py --input data/grammar_extracted/Process:
- Rules → RL task format
- Positive/negative example generation
- Difficulty estimation (easy/medium/hard)
- Affix and special character tagging
Output: data/rl_training_rules/
- 1,036 rules organized by category
- Verification patterns defined
- Source page tracking
Script: scripts/conversion/convert_rules_to_primeintellect.py
python scripts/conversion/convert_rules_to_primeintellect.pyProcess:
- 1 rule → 5.5 tasks average
- Multiple task types per rule:
- Morphology application
- Translation (dakota → english)
- Reverse translation (english → dakota)
- Syntax analysis
- Pattern identification
Output: dakota_rl_training/datasets/
- 5,657 total tasks from 1,036 rules
- Curriculum-ready:
- Easy: 1,998 tasks
- Medium: 2,155 tasks
- Hard: 398 tasks
- Advanced: 1,106 tasks
Script: scripts/conversion/generate_synthetic_dakota.py (Stoney Nakoda methodology)
python scripts/conversion/generate_synthetic_dakota.py --dictionary data/dictionary_extracted/Process:
- Load dictionary pairs: {dakota:english}
- Reverse pairs: {english:dakota}
- Generate Q&A variations:
- "How do you say X in Dakota?" → dakota_word
- "Translate X to English" → english_word
- "Use X in a sentence" → full_sentence
- Validate sentences through Grammar Gym:
- Check special characters
- Verify affix usage
- Confirm grammatical structure
Output: data/synthetic_dataset/
- Sentence-level Dakota examples
- Grammar-validated
- Q&A format for fine-tuning
Script: scripts/rl/create_grammar_rl_environment.py
python scripts/rl/create_grammar_rl_environment.py --rules-dir data/rl_training_rules/Creates:
DakotaGrammarEnv: Multi-turn verificationDakotaGrammarRubric: Compositional rewards- Curriculum learning stages
- TOPLOC verification enabled
Output: data/rl_environment/environment_config.json
Script: dakota_rl_training/train.py
# Install PrimeIntellect
pip install git+https://github.com/PrimeIntellect-ai/verifiers.git
pip install git+https://github.com/PrimeIntellect-ai/prime-rl.git
# Launch distributed training
cd dakota_rl_training
prime-rl train \
--config configs/training_config.yaml \
--num-workers 4 \
--use-toploc \
--wandb-project dakota-rl-grammarTraining Loop:
- Model generates Dakota text
DakotaGrammarEnvverifies against grammar rulesDakotaGrammarRubriccalculates reward:- Character preservation: 0-1.0
- Affix accuracy: 0-1.0
- Semantic accuracy: 0-1.0
- Composite with difficulty multiplier
- GRPO updates policy
- Curriculum advances (easy → medium → hard)
Synthetic Data Integration:
- Synthetic sentences used as training examples
- Grammar gym validates each generation
- Closed loop: dictionary → sentences → grammar validation
- Grammar extraction: 97% average confidence
- Dictionary extraction: 92-95% accuracy
- Special character preservation: 100% (ć, š, ŋ, ḣ preserved exactly)
- Grammar rules: 1,036 rules across 6 categories
- RL training tasks: 5,657 tasks
- Dictionary entries: ~10,000 word pairs
- Synthetic sentences: Generated and validated
- Character accuracy: >90% for all special chars
- Affix accuracy: >85% for common morphology
- Translation accuracy: 85-90% (easy), 65-70% (hard)
- Training time: 8-12 hours on distributed workers
- Grammar extraction: $15.50
- Dictionary extraction: ~$87.50
- RL training: Distributed (free via PrimeIntellect)
- Total: ~$103 for complete system
Extraction Scripts (scripts/extraction/):
convert_all_images.py- JP2 → JPEG conversionextract_grammar_pages.py- Grammar rule extractionextract_dakota_dictionary_v2.py- Dictionary extraction
Conversion Scripts (scripts/conversion/):
4. convert_rules_to_primeintellect.py - Generate RL tasks
5. generate_synthetic_dakota.py - Synthetic data generation
6. convert_extracted_to_chat.py - Convert to chat format
RL Scripts (scripts/rl/):
7. organize_grammar_for_rl.py - Rules → RL format
8. create_grammar_rl_environment.py - RL environment setup
9. publish_dakota_environment.py - Publish to PrimeIntellect
10. run_complete_grammar_pipeline.py - Run full pipeline
Training:
11. dakota_rl_training/train.py - Launch training
docs/guides/GRAMMAR_RL_PIPELINE.md- Complete grammar extraction guidedocs/status/PRIMEINTELLECT_INTEGRATION_COMPLETE.md- RL integration detailsdocs/status/GRAMMAR_EXTRACTION_COMPLETE.md- Final extraction resultsdocs/root/CLAUDE.md- Project instructions for AI assistantsdocs/root/- Quick reference documentation previously at root
data/processed_images/- Converted JPEG images (440 files)data/grammar_extracted/- Raw grammar rules (62 pages)data/rl_training_rules/- Organized RL rules (1,036 rules)data/dictionary_extracted/- Dictionary entries (~10,000)data/synthetic_dataset/- Generated sentencesdakota_rl_training/datasets/- RL training tasks (5,657)dakota_rl_training/checkpoints/- Model checkpoints
- Primary: Claude Sonnet 4.5 (
claude-sonnet-4-5-20250929) - Alternative: Qwen3-VL-235B-A22B (with reasoning budget)
- Prompt Engineering: Specialized Dakota orthography preservation
- Output: Structured JSON with confidence scores
- Base Model: Qwen/Qwen2.5-7B-Instruct
- Method: LoRA fine-tuning (rank 64)
- Algorithm: GRPO (Group Relative Policy Optimization)
- Framework: PrimeIntellect prime-rl
- Verification: TOPLOC for distributed Unicode validation
Multi-component compositional rewards:
reward = (
0.4 * character_preservation + # Dakota special chars
0.4 * affix_accuracy + # Morphological correctness
0.2 * semantic_accuracy # Translation quality
) * difficulty_multiplier # 1.0x - 2.0xProgressive difficulty over 3 stages:
- Easy tasks (1,998) → 80% target accuracy
- Medium tasks (2,155) → 75% target accuracy
- Hard tasks (398) → 70% target accuracy
-
VLM-Based Historical Text Processing
- First demonstration of VLM special character extraction from 19th-century texts
- No OCR training required
- 92-95% accuracy on complex orthography
-
Grammar-to-RL Pipeline
- Linguistic rules → Verifiable reward functions
- Compositional rewards for morphology
- Multi-turn environments for complex tasks
-
Closed-Loop Synthesis Method
- Single textbook → Complete training ecosystem
- Dictionary + Grammar = Self-validating system
- Bi-directional generation with verification
-
Distributed Verification (TOPLOC)
- First application to Unicode preservation
- Prevents character corruption in untrusted workers
- Critical for low-resource language training
Immediate:
- Dakota language revitalization
- Educational tools for language learners
- Grammar-aware translation systems
Broader Impact:
- Template for other Siouan languages (Lakota, Nakota, Stoney)
- Methodology for any historical grammar textbook
- Framework for low-resource language preservation
- Verifiable distributed training for linguistic tasks
The same closed-loop, rule-to-reward pattern extends beyond low-resource language learning:
- Legal & Compliance: Turn the US Tax Code (text) into a reward function to train models that avoid non-compliant financial advice.
- Biology: Turn protein folding heuristics papers into reward constraints for generative chemistry.
- Code Migration: Turn a 1970 COBOL manual into a verification environment to train a Python-to-COBOL translation agent.
- Industrial Safety (1950s Rail): Build an RL environment that mirrors a postwar railroad engineer's operating rules using the procedures documented in
Public/1959RailRoadCodeRL.pdf, enforcing safety, dispatch, and signaling constraints in generation.
Building on the success of the Dakota grammar pipeline, we are applying the same Grammar-to-RL methodology to industrial safety.
- Objective: Train RL agents to operate trains strictly according to the 1959 Consolidated Code of Operating Rules.
- Methodology:
- Extraction: VLM extracts rules from
Public/1959RailRoadCodeRL.pdf. - Environment:
RailroadEnvverifies adherence to safety protocols, signaling rules, and terminology. - Reward: Compositional reward function prioritizing Safety (critical), Procedure (sequence), and Terminology (exact phrasing).
- Extraction: VLM extracts rules from
- Status: In active development (see
RailroadEngineer1959/).
-
Multi-Language Extension
- Apply to related languages (Lakota, Stoney Nakoda)
- Cross-lingual transfer learning
- Comparative Siouan linguistics
-
Enhanced Synthesis
- Context-aware sentence generation
- Dialogue system integration
- Cultural knowledge embedding
-
Evaluation Framework
- Linguistic acceptability metrics
- Native speaker validation
- Grammaticality benchmarks
-
Morphological Productivity
- Test generalization to unseen words
- Measure affix composition ability
- Analyze systematic errors
# Core extraction
pip install anthropic pillow python-dotenv
# RL training
pip install git+https://github.com/PrimeIntellect-ai/verifiers.git
pip install git+https://github.com/PrimeIntellect-ai/prime-rl.git
# Optional: Alternative VLM
pip install openai # For OpenRouter/Qwen3-VLANTHROPIC_API_KEY- Claude Sonnet 4.5 (required)OPENROUTER_API_KEY- Qwen3-VL (optional)
- Extraction: Python 3.8+, 8GB RAM
- Training: Distributed via PrimeIntellect (or local GPU)
- Storage: ~5GB for images + extractions
git clone https://github.com/HarleyCoops/Dakota1890.git
cd Dakota1890
python -m pip install -r requirements.txt
# Add API keys to .env
echo "ANTHROPIC_API_KEY=your_key_here" > .envpython scripts/extraction/convert_all_images.py
# Output: 440 images in data/processed_images/python scripts/extraction/extract_grammar_pages.py --pages 31-92 --yes
# Output: 1,036 rules in data/grammar_extracted/python scripts/rl/organize_grammar_for_rl.py --input data/grammar_extracted/
python scripts/conversion/convert_rules_to_primeintellect.py
# Output: 5,657 tasks in dakota_rl_training/datasets/cd dakota_rl_training
python train.py --config configs/training_config.yaml
# Follow instructions to install PrimeIntellect and launchThe repo ships with two trainer presets:
dakota_rl_training/configs/train_30b.toml– the lightweight config we use in the quick start above. It has no curriculum logic; every RL step samples uniformly from whatever dataset the environment serves.dakota_rl_training/configs/train.toml(and the matchingtraining_config.yamlused bypython train.py) – these contain an explicit[curriculum]block with staged datasets (easy → medium → hard) plus target step ranges.
To activate the curriculum when launching through the Prime RL CLI, point each component at the staged configs:
uv run rl \
--trainer @ ~/dakota-rl-training/configs/train.toml \
--orchestrator @ ~/dakota-rl-training/configs/orch.toml \
--inference @ ~/dakota-rl-training/configs/infer.toml \
... # other flags (gpu ids, wandb, etc.)The orchestrator/inference presets that accompany train.toml live in the same folder (orch.toml, infer.toml) and expect the staged datasets described in training_config.yaml. If you stick with the *_30b.toml files, no curriculum will trigger. In short: choose the config trio you pass to uv run rl based on whether you want uniform sampling (*_30b.toml) or curriculum-driven sampling (train.toml + orch.toml + infer.toml).
If you use this work, please cite:
@software{dakota_rl_grammar_2025,
title = {Dakota Language Preservation Through VLM-Based Grammar Extraction and RL Training},
author = {Cooper, Christian H.},
year = {2025},
url = {https://github.com/[repo]},
note = {Novel closed-loop methodology for low-resource language training}
}- Stephen Return Riggs: Original Dakota grammar documentation (1890)
- Internet Archive: Historical text preservation
- PrimeIntellect: Distributed RL training infrastructure
- Anthropic: Claude Sonnet 4.5 VLM capabilities
- Dakota Language Community: Ongoing language revitalization efforts
Code is licensed under Apache-2.0 (see LICENSE). The included historical Dakota text (Riggs, 1890) is in the public domain; see DATA_LICENSE.md for details and scope.
This methodology is designed to be reproducible and extensible. Contributions welcome for:
- Additional language applications
- Improved synthesis methods
- Enhanced verification functions
- Evaluation frameworks
See docs/CONTRIBUTING.md for guidelines.











