## 9. Improving Prompts

With our LLM prompt showing such strong results, you might be content to leave it as it is. But there are always ways to improve, and you might come across a circumstance where the model's performance is less than ideal.

Earlier in the lesson, we showed how you can feed the LLM examples of inputs and output prior to your request as part of a "few shot" prompt. An added benefit of coding a supervised sample for testing is that you can also use the training slice of the set to prime the LLM with this technique. If you've already done the work of labeling your data, you might as well use it to improve your model as well.

Converting the training set you held to the side into a few-shot prompt is a simple matter of formatting it to fit your LLM's expected input. Here's how you might do it in our case.

In [1]:
import json
import time
import os
from retry import retry
from rich.progress import track
from huggingface_hub import InferenceClient
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
import pandas as pd

api_key = os.getenv("HF_TOKEN")
client = InferenceClient(
    token=api_key,
)

sample_df = pd.read_csv("https://huggingface.co/spaces/JournalistsonHF/first-llm-classifier/resolve/main/notebooks/gradio-app/sample.csv")

Calling our previous `get_batch_list` function again:

In [2]:
def get_batch_list(li, n=10):
    """Split the provided list into batches of size `n`."""
    batch_list = []
    for i in range(0, len(li), n):
        batch_list.append(li[i : i + n])
    return batch_list

training_input, test_input, training_output, test_output = train_test_split(
    sample_df[['payee']],
    sample_df['category'],
    test_size=0.33,
    random_state=42
)

In [3]:
def get_fewshots(training_input, training_output, batch_size=10):
    """Convert the training input and output from sklearn's train_test_split into a few-shot prompt"""
    # Batch up the training input into groups of `batch_size`
    input_batches = get_batch_list(list(training_input.payee), n=batch_size)

    # Do the same for the output
    output_batches = get_batch_list(list(training_output), n=batch_size)

    # Create a list to hold the formatted few-shot examples
    fewshot_list = []

    # Loop through the batches
    for i, input_list in enumerate(input_batches):
        fewshot_list.extend([
            # Create a "user" message for the LLM formatted the same was a our prompt with newlines
            {
                "role": "user",
                "content": "\n".join(input_list),
            },
            # Create the expected "assistant" response as the JSON formatted output we expect
            {
                "role": "assistant",
                "content": json.dumps(output_batches[i])
            }
        ])

    # Return the list of few-shot examples, one for each batch
    return fewshot_list

Pass in your training data.

In [4]:
fewshot_list = get_fewshots(training_input, training_output)

Take a peek at the first pair to see if it's what we expect.

In [5]:
fewshot_list[:2]

[{'role': 'user',
  'content': 'UFW OF AMERICA - AFL-CIO\nRE-ELECT FIONA MA\nELLA DINNING ROOM\nMICHAEL EMERY PHOTOGRAPHY\nLAKELAND  VILLAGE\nTHE IVY RESTAURANT\nMOORLACH FOR SENATE 2016\nBROWN PALACE HOTEL\nAPPLE STORE FARMERS MARKET\nCABLETIME TV'},
 {'role': 'assistant',
  'content': '["Other", "Other", "Other", "Other", "Other", "Restaurant", "Other", "Hotel", "Other", "Other"]'}]

Now, we can add those examples to our prompt's `messages`.

In [6]:
@retry(ValueError, tries=2, delay=2)
def classify_payees(name_list):
    prompt = """You are an AI model trained to categorize businesses based on their names.

You will be given a list of business names, each separated by a new line.

Your task is to analyze each name and classify it into one of the following categories: Restaurant, Bar, Hotel, or Other.

It is extremely critical that there is a corresponding category output for each business name provided as an input.

If a business does not clearly fall into Restaurant, Bar, or Hotel categories, you should classify it as "Other".

Even if the type of business is not immediately clear from the name, it is essential that you provide your best guess based on the information available to you. If you can't make a good guess, classify it as Other.

For example, if given the following input:

"Intercontinental Hotel\nPizza Hut\nCheers\nWelsh's Family Restaurant\nKTLA\nDirect Mailing"

Your output should be a JSON list in the following format:

["Hotel", "Restaurant", "Bar", "Restaurant", "Other", "Other"]

This means that you have classified "Intercontinental Hotel" as a Hotel, "Pizza Hut" as a Restaurant, "Cheers" as a Bar, "Welsh's Family Restaurant" as a Restaurant, and both "KTLA" and "Direct Mailing" as Other.

Ensure that the number of classifications in your output matches the number of business names in the input. It is very important that the length of JSON list you return is exactly the same as the number of business names you receive.
"""
    response = client.chat.completions.create(
        messages=[
            ### <-- NEW 
            {
                "role": "system",
                "content": prompt,
            },
            *fewshot_list,
            {
                "role": "user",
                "content": "\n".join(name_list),
            }
            ### -->
        ],
        model="meta-llama/Llama-3.3-70B-Instruct",
        temperature=0,
    )

    answer_str = response.choices[0].message.content
    answer_list = json.loads(answer_str)

    acceptable_answers = [
        "Restaurant",
        "Bar",
        "Hotel",
        "Other",
    ]
    for answer in answer_list:
        if answer not in acceptable_answers:
            raise ValueError(f"{answer} not in list of acceptable answers")

    try:
        assert len(name_list) == len(answer_list)
    except:
        raise ValueError(f"Number of outputs ({len(name_list)}) does not equal the number of inputs ({len(answer_list)})")

    return dict(zip(name_list, answer_list))

Calling our previous `classify_batches`function again:

In [7]:
def classify_batches(name_list, batch_size=10, wait=2):
    # Store the results
    all_results = {}

    # Batch up the list
    batch_list = get_batch_list(name_list, n=batch_size)

    # Loop through the list in batches
    for batch in track(batch_list):
        # Classify it
        batch_results = classify_payees(batch)

        # Add it to the results
        all_results.update(batch_results)

        # Tap the brakes
        time.sleep(wait)

    # Return the results
    return pd.DataFrame(
        all_results.items(),
        columns=["payee", "category"]
    )

And all you need to do is run it again.

In [8]:
llm_df = classify_batches(list(test_input.payee))

Output()

And see if your results are any better

In [9]:
print(classification_report(
    test_output,
    llm_df.category,
))

              precision    recall  f1-score   support

         Bar       1.00      1.00      1.00         2
       Hotel       1.00      1.00      1.00         9
       Other       1.00      0.98      0.99        57
  Restaurant       0.94      1.00      0.97        15

    accuracy                           0.99        83
   macro avg       0.98      1.00      0.99        83
weighted avg       0.99      0.99      0.99        83



Another common tactic is to examine the misclassifications and tweak your prompt to address any patterns they reveal.

One simple way to do this is to merge the LLM's predictions with the human-labeled data and filter for discrepancies.

In [10]:
comparison_df = llm_df.merge(
    sample_df,
    on="payee",
    how="inner",
    suffixes=["_llm", "_human"]
)

And filter to cases where the LLM and human labels don't match.

In [11]:
comparison_df[comparison_df.category_llm != comparison_df.category_human]

Unnamed: 0,payee,category_llm,category_human
16,SOTTOVOCE MADERO,Restaurant,Other


Looking at the misclassifications, you might notice that the LLM is struggling with a particular type of business name. You can then adjust your prompt to address that specific issue.

In [12]:
comparison_df.head()

Unnamed: 0,payee,category_llm,category_human
0,MIDTOWN FRAMING,Other,Other
1,ALBERGO HILTON ROME AIRPO FIUMICINO,Hotel,Hotel
2,ISTOCK PHOTOS,Other,Other
3,DORIAN B. GARCIA,Other,Other
4,KEELER ADVERTISING,Other,Other


In this case, I observed that the LLM was struggling with businesses that had both the word bar and the word restaurant in their name. A simple fix would be to add a new line to your prompt that instructs the LLM what to do in that case:

`If a business name contains both the word "Restaurant" and the word "Bar", you should classify it as a Restaurant.`

In [13]:
prompt = """You are an AI model trained to categorize businesses based on their names.

You will be given a list of business names, each separated by a new line.

Your task is to analyze each name and classify it into one of the following categories: Restaurant, Bar, Hotel, or Other.

It is extremely critical that there is a corresponding category output for each business name provided as an input.

If a business does not clearly fall into Restaurant, Bar, or Hotel categories, you should classify it as "Other".

Even if the type of business is not immediately clear from the name, it is essential that you provide your best guess based on the information available to you. If you can't make a good guess, classify it as Other.

For example, if given the following input:

"Intercontinental Hotel\nPizza Hut\nCheers\nWelsh's Family Restaurant\nKTLA\nDirect Mailing"

Your output should be a JSON list in the following format:

["Hotel", "Restaurant", "Bar", "Restaurant", "Other", "Other"]

This means that you have classified "Intercontinental Hotel" as a Hotel, "Pizza Hut" as a Restaurant, "Cheers" as a Bar, "Welsh's Family Restaurant" as a Restaurant, and both "KTLA" and "Direct Mailing" as Other.

If a business name contains both the word "Restaurant" and the word "Bar", you should classify it as a Restaurant.

Ensure that the number of classifications in your output matches the number of business names in the input. It is very important that the length of JSON list you return is exactly the same as the number of business names you receive.
"""

Repeating this disciplined, scientific process of prompt refinement, testing and review can, after a few careful cycles, gradually improve your prompt to return even better results.

In [None]:
%pip install gradio jupyter-server-proxy

In [20]:
import gradio as gr
import json

# -- Gradio interface function --
def classify_business_names(input_text):
    name_list = [line.strip() for line in input_text.splitlines() if line.strip()]
    try:
        result = classify_payees(name_list)
        return json.dumps(result, indent=2)
    except Exception as e:
        return f"Error: {e}"

# -- Launch the demo --
demo = gr.Interface(
    fn=classify_business_names,
    inputs=gr.Textbox(lines=10, placeholder="Enter business names, one per line"),
    outputs="json",
    title="Business Category Classifier",
    description="Enter business names and get a classification: Restaurant, Bar, Hotel, or Other."
)

demo.launch(server_name="0.0.0.0", server_port=7873, root_path="/proxy/7873/", quiet=True)



**[10. Sharing your classifier â†’](ch10-sharing-with-gradio.ipynb)**