Cyborg Translator – EN↔RU

Cyborg Translator (EN ↔ RU)

Overview

Cyborg Translator is a custom-trained English ↔ Russian neural machine translation model. The project focuses on data quality, tokenizer design, and bidirectional translation robustness rather than scale.

This model was trained end-to-end on a self-curated parallel corpus derived from cleaned literary and technical texts.

Model Details

  • Architecture: Transformer (GPT-style causal LM adapted for translation)
  • Parameters: ~300M
  • Precision: FP32
  • Tokenizer: Custom BPE (32k vocab)
  • Framework: PyTorch / Hugging Face Transformers
  • Training Style: Supervised bilingual sequence modeling

Training Data

  • ~40 public-domain English books
  • Extensive text normalization and deduplication
  • Sentence-level alignment
  • Russian translations generated and filtered programmatically
  • Multiple cleaning passes (length, language ID, punctuation, encoding)

Emphasis was placed on corpus hygiene and alignment fidelity.

Intended Use

  • English ↔ Russian translation research
  • Studying effects of tokenizer choice on MT quality
  • Low-resource MT experimentation
  • Educational purposes

Limitations

  • Not instruction-tuned
  • May hallucinate under ambiguous input
  • No safety fine-tuning
  • Not suitable for production or legal/medical use

Reproducibility

Inference example:

from transformers import AutoTokenizer, AutoModelForCausalLM

tok = AutoTokenizer.from_pretrained("LoganResearch/cyborg-translator-en-ru")
model = AutoModelForCausalLM.from_pretrained("LoganResearch/cyborg-translator-en-ru")

inputs = tok("Hello world", return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=100)
print(tok.decode(out[0]))
Downloads last month
88
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support