Project Overview
I am building a Natural Language → T-SQL system for Microsoft SQL Server (T-SQL).
Expected behavior:
If a user asks a natural-language question (e.g.,
“How many users are using smartphones last month?”),
the system should generate a valid and logically correct T-SQL query.
Constraints
- Maximum GPU memory: 40 GB
- Deployment: Local GPU inference only
- No internet access after training (fully offline deployment)
- This restricts model size and external API usage
Current Architecture
- LLM:
defog/sqlcoder-7b-2 - Fine-tuning: ~2,500 complex SQL queries
- Multi-table JOINs
- Aggregations
- Date logic
- Schema Handling (RAG):
- Tables and column descriptions stored separately
- Embedded using MiniLM
- Retrieved via cosine similarity
- Generation Flow:
- User NL query
- Retrieve relevant schema context
- Inject schema into prompt
- Generate T-SQL
What Works
- Simple queries
- Single-table queries
- WHERE / GROUP BY / HAVING
- Basic aggregations
Issue
For complex queries involving:
- Multiple JOINs
- SQL Server date functions (
DATEADD,DATEDIFF,CONVERT) - Cross-table business logic
the model often:
- Chooses incorrect JOIN paths
- Misses required tables
- Hallucinates columns
- Produces SQL Server–invalid date syntax
- Generates logically incorrect queries
This happens despite fine-tuning and schema grounding.
Questions
- Is this mainly a 7B model limitation for complex for this project?
- Would explicitly injecting foreign-key relationships / join graphs into the prompt help?
- Is a query-planning stage (join planning → filters → final SQL) recommended?
- Any best practices for T-SQL–specific correctness?
- Given offline + 40 GB GPU constraints, would:
- Larger quantized models
- Multi-stage planners
- Rule-based join resolution + LLM
be more reliable?
- Are there any open-source or production-grade Natural Language to Sql architectures that handle complex joins reliably under similar constraints?
Goal
To generate correct, production-ready T-SQL for complex NL queries under offline and 40 GB GPU constraints.
Thanks in advance for any guidance or references!