Background & Goal
I’m working with a small model (Qwen3-0.6B, <1B parameters) due to resource constraints, aiming to achieve:
1. High accuracy in domain-specific knowledge (mechanical engineering/CAD, text format)
2. Maintain general conversational ability
3. Enable reasoning capability for MCP tool selection
Current Setup
· Model: Qwen3-0.6B
· Platform: LLaMA-Factory
· Method: Fine-tuning only
Training Experiments & Results
Experiment 1: Domain Knowledge Only
Dataset:
· Chinese Mechanical engineering QA (mostly structured + unstructured)
· Format: Alpaca
o self-instruct/evol-instruction haven’t got good results due to closed-domain QA constraints
· Size: 2300 samples
Training config:
· Method: LoRA (rank=192, lower rank got lower domain accuracy)
· Cutoff length: 1024
· Epochs: 1 (lower epoch to avoid catastrophic forgetting of general ability)
Results:
· High accuracy on single-turn domain QA
· limit ability of 2-4 turn multi-turn conversations in domain
· Limited general conversation ability – sometimes model will answer general questions with domain knowledge
Experiment 2: Domain + Reasoning (1:1 ratio)
Motivation:
· Qwen3-0.6B can select MCP tools with prompting (without fine-tuning)
· After domain fine-tuning, the model lost reasoning/thinking process
· Need to restore reasoning capability
Dataset:
· Domain QA: 2300 samples
· Reasoning dataset: 2300 samples from twinkle-ai/tw-reasoning-instruct-50k
Training config:
· Method: Full fine-tuning (switched from LoRA because even rank=512 didn’t outperform full fine-tuning with increased data diversity and amount)
· Epochs: 1
Results:
· Domain knowledge accuracy dropped significantly
· General conversation improved
· Reasoning ability on reasoning-like questions
· Reasonable MCP tool selection accuracy
· Cannot maintain both strong domain knowledge AND reasoning ability
Experiment 3: Train all the domain Data
Dataset:
· Domain QA: 7,000 samples
· Reasoning: 7,000 samples
· Result: Domain knowledge accuracy degraded even more, MCP tool calling ability decrease
Experiment 4: Overfitting Attempt
· Extended domain QA length to reduce sample count (1000 samples), also reduce reasoning data(1000 samples) to keep ratio 1:1
· Trained both datasets to overfit (epochs 3-5)
· Result: High domain accuracy, some reasoning ability, no MCP tool calling
Key Questions
1. Training Strategy: Is this the inherent limitation of fine-tuning small models (<1B) on multiple datasets with these data amount ? or is there room for optimization?
2. MCP Tool Selection:Should MCP tool selection require its own dedicated training dataset in my training scenario?
Any insights on balancing multiple capabilities in resource-constrained scenarios would be greatly appreciated!