Fine-tuning Small Model (Qwen3-0.6B) for Domain Knowledge + Reasoning: Seeking Optimization Advice

Background & Goal

I’m working with a small model (Qwen3-0.6B, <1B parameters) due to resource constraints, aiming to achieve:

1. High accuracy in domain-specific knowledge (mechanical engineering/CAD, text format)

2. Maintain general conversational ability

3. Enable reasoning capability for MCP tool selection

Current Setup

· Model: Qwen3-0.6B

· Platform: LLaMA-Factory

· Method: Fine-tuning only

Training Experiments & Results

Experiment 1: Domain Knowledge Only

Dataset:

· Chinese Mechanical engineering QA (mostly structured + unstructured)

· Format: Alpaca

o self-instruct/evol-instruction haven’t got good results due to closed-domain QA constraints

· Size: 2300 samples

Training config:

· Method: LoRA (rank=192, lower rank got lower domain accuracy)

· Cutoff length: 1024

· Epochs: 1 (lower epoch to avoid catastrophic forgetting of general ability)

Results:

· High accuracy on single-turn domain QA

· limit ability of 2-4 turn multi-turn conversations in domain

· Limited general conversation ability – sometimes model will answer general questions with domain knowledge

Experiment 2: Domain + Reasoning (1:1 ratio)

Motivation:

· Qwen3-0.6B can select MCP tools with prompting (without fine-tuning)

· After domain fine-tuning, the model lost reasoning/thinking process

· Need to restore reasoning capability

Dataset:

· Domain QA: 2300 samples

· Reasoning dataset: 2300 samples from twinkle-ai/tw-reasoning-instruct-50k

Training config:

· Method: Full fine-tuning (switched from LoRA because even rank=512 didn’t outperform full fine-tuning with increased data diversity and amount)

· Epochs: 1

Results:

· Domain knowledge accuracy dropped significantly

· General conversation improved

· Reasoning ability on reasoning-like questions

· Reasonable MCP tool selection accuracy

· Cannot maintain both strong domain knowledge AND reasoning ability

Experiment 3: Train all the domain Data

Dataset:

· Domain QA: 7,000 samples

· Reasoning: 7,000 samples

· Result: Domain knowledge accuracy degraded even more, MCP tool calling ability decrease

Experiment 4: Overfitting Attempt

· Extended domain QA length to reduce sample count (1000 samples), also reduce reasoning data(1000 samples) to keep ratio 1:1

· Trained both datasets to overfit (epochs 3-5)

· Result: High domain accuracy, some reasoning ability, no MCP tool calling

Key Questions

1. Training Strategy: Is this the inherent limitation of fine-tuning small models (<1B) on multiple datasets with these data amount ? or is there room for optimization?

2. MCP Tool Selection:Should MCP tool selection require its own dedicated training dataset in my training scenario?

Any insights on balancing multiple capabilities in resource-constrained scenarios would be greatly appreciated!

2 Likes

Improvements seem possible. Given size constraints, it’s unclear how much can be resolved…

1 Like

bro that model is too small. even if u perfect the fine tuning for that, u won’t achieve your goal. maybe u should try RAG

2 Likes

True… When there are no particular constraints, using the RAG mechanism allows for more accurate utilization of domain-specific knowledge.

Hey, if you’re still working with that model or if you want to experiment with larger ones, I have some unused A100s/V100s I can let you use for a bit. Email me at jack.lee - @ - rice.edu

1 Like