Cyborg Translator (EN ↔ RU)

Overview

Cyborg Translator is a custom-trained English ↔ Russian neural machine translation model. The project focuses on data quality, tokenizer design, and bidirectional translation robustness rather than scale.

This model was trained end-to-end on a self-curated parallel corpus derived from cleaned literary and technical texts.

Model Details

Architecture: Transformer (GPT-style causal LM adapted for translation)
Parameters: ~300M
Precision: FP32
Tokenizer: Custom BPE (32k vocab)
Framework: PyTorch / Hugging Face Transformers
Training Style: Supervised bilingual sequence modeling

Training Data

~40 public-domain English books
Extensive text normalization and deduplication
Sentence-level alignment
Russian translations generated and filtered programmatically
Multiple cleaning passes (length, language ID, punctuation, encoding)

Emphasis was placed on corpus hygiene and alignment fidelity.

Intended Use

English ↔ Russian translation research
Studying effects of tokenizer choice on MT quality
Low-resource MT experimentation
Educational purposes

Limitations

Not instruction-tuned
May hallucinate under ambiguous input
No safety fine-tuning
Not suitable for production or legal/medical use

Reproducibility

Inference example:

from transformers import AutoTokenizer, AutoModelForCausalLM

tok = AutoTokenizer.from_pretrained("LoganResearch/cyborg-translator-en-ru")
model = AutoModelForCausalLM.from_pretrained("LoganResearch/cyborg-translator-en-ru")

inputs = tok("Hello world", return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=100)
print(tok.decode(out[0]))

Downloads last month: 88

Safetensors

Model size

0.3B params

Tensor type

F32