Generate and save an embeddings CSV file

This section explains the Search and Ask approach described here (and adapts it for our use case).

We will first create embeddings for each FAQ (concatenation of question and answer)

import os

import openai
from dotenv import load_dotenv

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

# imports
import openai  # for calling the OpenAI API
import pandas as pd  # for storing text and embeddings data

# models
EMBEDDING_MODEL = "text-embedding-ada-002"

df_faq: pd.DataFrame = pd.read_csv('hn_faq.csv', encoding='utf-8', dtype=object,
                                   index_col=False)

text_batch = []
for line_number, (index, row) in enumerate(df_faq.iterrows()):
    curr_text = f'''

    Q: {row['prompt']}

    A: {row['completion']}

    '''
    text_batch.append(curr_text)

response = openai.Embedding.create(model=EMBEDDING_MODEL, input=text_batch)
for i, be in enumerate(response["data"]):
    assert i == be["index"]  # double check embeddings are in same order as input
batch_embeddings = [e["embedding"] for e in response["data"]]

df = pd.DataFrame({"text": text_batch, "embedding": batch_embeddings})
df.to_csv(f'embeddings.csv', index=False)

We will read from the FAQ CSV file, construct a prompt which concatenates the question and answer, and use the create Embedding API call to batch create embeddings for each individual FAQ.

Then we construct a new dataframe where the first column is the text, and the second column is the embedding itself, and save it to a new CSV file called embeddings.csv

Note that we use the text-embedding-ada-02 model for generating the embeddings.