Cyborg Translator (EN β RU)
Overview
Cyborg Translator is a custom-trained English β Russian neural machine translation model. The project focuses on data quality, tokenizer design, and bidirectional translation robustness rather than scale.
This model was trained end-to-end on a self-curated parallel corpus derived from cleaned literary and technical texts.
Model Details
- Architecture: Transformer (GPT-style causal LM adapted for translation)
- Parameters: ~300M
- Precision: FP32
- Tokenizer: Custom BPE (32k vocab)
- Framework: PyTorch / Hugging Face Transformers
- Training Style: Supervised bilingual sequence modeling
Training Data
- ~40 public-domain English books
- Extensive text normalization and deduplication
- Sentence-level alignment
- Russian translations generated and filtered programmatically
- Multiple cleaning passes (length, language ID, punctuation, encoding)
Emphasis was placed on corpus hygiene and alignment fidelity.
Intended Use
- English β Russian translation research
- Studying effects of tokenizer choice on MT quality
- Low-resource MT experimentation
- Educational purposes
Limitations
- Not instruction-tuned
- May hallucinate under ambiguous input
- No safety fine-tuning
- Not suitable for production or legal/medical use
Reproducibility
Inference example:
from transformers import AutoTokenizer, AutoModelForCausalLM
tok = AutoTokenizer.from_pretrained("LoganResearch/cyborg-translator-en-ru")
model = AutoModelForCausalLM.from_pretrained("LoganResearch/cyborg-translator-en-ru")
inputs = tok("Hello world", return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=100)
print(tok.decode(out[0]))
- Downloads last month
- 88