Looking for best LLMs to search SMS messages

Hi all.
I’ve been using Bert models against a dataset of sms messages and the results vary wildly dependent on what I use as my search criteria. I think that models that are more textspeak specific would generate better results and would take into account the abbreviations, acronyms, poor grammar, etc.
As I follow up I’m looking at the best way of checking accuracy and comparison against different models beyond manually checking results.
Any advice would be massively appreciated :slight_smile:
Thanks

I ran into something similar building a search tool for archived messages from a sip trunking provider. I got decent results using a distilled BERT model with simple embeddings and cosine similarity. It wasn’t perfect, but good enough for grouping similar message types. I also added a basic keyword filter before the model call to cut down noise, which helped speed things up and keep results relevant.

1 Like

I’ve been working on something similar and found that smaller, quantized models like MiniLM or BGE work well if you’re tight on resources. I also tested sentence-transformers with cosine similarity for basic ranking. In one case, I pulled archived chats from sms.to to create a small vector store using FAISS—that helped with faster lookups when testing different prompts. GPT4 embeddings were solid, but the costs add up fast.

1 Like

Plan (pseudocode)

  • Config

    • alpha (BM25 vs dense weight), top_k, rerank_k.
  • Preprocess

    • Lowercase, strip headers/signatures.

    • Normalize phone numbers/URLs/dates.

    • Expand common SMS abbreviations (e.g., u→you, brb→be right back).

    • Preserve original text for display.

  • Load data

    • CSV → SMSRecord(id, text, …).

    • Skip empty/duplicate rows.

  • Index

    • Tokenize → build BM25 (rank_bm25 or fallback TF-IDF).

    • Encode with Sentence-Transformers → dense matrix.

    • Build ANN index (FAISS if available, else brute-force).

    • Persist artifacts (JSON meta + .npy embeddings).

  • Search(query)

    • Preprocess query.

    • BM25 scores.

    • Dense similarity (cosine) via ANN.

    • Normalize and combine: score = alpha*bm25_norm + (1-alpha)*dense_norm.

    • Merge, dedupe, take top_k.

    • Optional cross-encoder re-rank top rerank_k.

  • Evaluate

    • CSV: query,positive_ids (comma-separated ids).

    • Compute Recall@k, MRR@k, nDCG@k, MAP.

  • CLI

    • index --data sms.csv --out ./index_dir

    • search --index ./index_dir --q "where r u"

    • eval --index ./index_dir --qrels qrels.csv

    • Optional serve (FastAPI) for quick HTTP demo.

file: tools/sms_search/main.py

“”"
Hybrid SMS search with evaluation.

  • BM25 + dense embeddings + (optional) cross-encoder re-ranking
  • SMS-aware text normalization and abbreviation expansion
  • CLI: index, search, eval, serve
    Why: Practical, fast, and robust across noisy SMS text.
    “”"

from future import annotations
import argparse
import csv
import dataclasses
import json
import math
import os
import re
import sys
import time
from collections import defaultdict
from pathlib import Path
from typing import Dict, Iterable, List, Optional, Sequence, Tuple

import numpy as np

Optional deps with graceful fallbacks

try:
from rank_bm25 import BM25Okapi # preferred for BM25
except Exception: # pragma: no cover
BM25Okapi = None

try:
import faiss # optional ANN
except Exception: # pragma: no cover
faiss = None # type: ignore

Sentence-Transformers and optional cross-encoder

try:
from sentence_transformers import SentenceTransformer, CrossEncoder, util as st_util
except Exception as e: # pragma: no cover
SentenceTransformer = None # type: ignore
CrossEncoder = None # type: ignore
st_util = None # type: ignore

----------------------------

Data structures

----------------------------

@dataclasses.dataclass(frozen=True)
class SMSRecord:
id: str
text: str
timestamp: Optional[str] = None
thread_id: Optional[str] = None
sender: Optional[str] = None

----------------------------

SMS-specific normalization

----------------------------

ABBR = {

Common chat/SMS slang (extend as needed)

“u”: “you”,
“ur”: “your”,
“r”: “are”,
“brb”: “be right back”,
“idk”: “i do not know”,
“imo”: “in my opinion”,
“imho”: “in my humble opinion”,
“lmk”: “let me know”,
“omw”: “on my way”,
“ttyl”: “talk to you later”,
“btw”: “by the way”,
“thx”: “thanks”,
“pls”: “please”,
“plz”: “please”,
“asap”: “as soon as possible”,
“gr8”: “great”,
“b4”: “before”,
“bc”: “because”,
“afaik”: “as far as i know”,
“fyi”: “for your information”,
}

RE_URL = re.compile(r"https?://\S+|www.\S+“)
RE_PHONE = re.compile(r”+?\d[\d-\s]{6,}\d")
RE_EMAIL = re.compile(r"[a-zA-Z0-9_.±]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+“)
RE_DIGITS = re.compile(r”\b\d{2,}\b")
RE_WHITESPACE = re.compile(r"\s+")

def expand_abbr(token: str) → str:
return ABBR.get(token, token)

def normalize_sms(text: str) → str:
“”"
Normalize noisy SMS text.
Why: Reduces variance from URLs/phones/slang, improving recall without hurting precision.
“”"
t = text.strip().lower()
t = RE_URL.sub(" “, t)
t = RE_EMAIL.sub(” “, t)
t = RE_PHONE.sub(” “, t)

Replace long numbers (order ids, codes) with generic token

t = RE_DIGITS.sub(” “, t)

Tokenize by whitespace and punctuation, expand common abbreviations

tokens = re.findall(r”[a-zA-Z0-9<>']+", t)
tokens = [expand_abbr(tok) for tok in tokens]
t = " “.join(tokens)
t = RE_WHITESPACE.sub(” ", t).strip()
return t

def tokenize(text: str) → List[str]:
return normalize_sms(text).split()

----------------------------

Index container

----------------------------

class SMSIndex:
“”"
In-memory index with BM25 + dense embeddings.
Artifacts saved to a directory for reuse.
“”"

def __init__(
    self,
    bm25: Optional[BM25Okapi],
    dense_embeddings: Optional[np.ndarray],
    ids: List[str],
    texts: List[str],
    norm_texts: List[str],
    model_name: Optional[str],
    faiss_index: Optional[object] = None,
    tfidf_fallback: Optional[Tuple[np.ndarray, np.ndarray]] = None,
):
    self.bm25 = bm25
    self.dense = dense_embeddings
    self.ids = ids
    self.texts = texts
    self.norm_texts = norm_texts
    self.model_name = model_name
    self.faiss_index = faiss_index
    self.tfidf = tfidf_fallback  # (X_tfidf, idf) when rank_bm25 unavailable

# ---------- Persistence ----------

@staticmethod
def save(index: "SMSIndex", out_dir: Path) -> None:
    out_dir.mkdir(parents=True, exist_ok=True)
    (out_dir / "ids.json").write_text(json.dumps(index.ids, ensure_ascii=False))
    (out_dir / "texts.json").write_text(json.dumps(index.texts, ensure_ascii=False))
    (out_dir / "norm_texts.json").write_text(json.dumps(index.norm_texts, ensure_ascii=False))
    meta = {
        "model_name": index.model_name,
        "has_bm25": index.bm_

Response generated by TD Ai.

1 Like