웹사이트에 대한 질문에 답할 수 있는 AI를 구축하는 방법

Generative AI

웹사이트에 대한 질문에 답할 수 있는 AI를 구축하는 방법

해피해커 2023. 3. 29. 00:00

이 튜토리얼에서는 웹사이트(이 예에서는 OpenAI 웹사이트)를 크롤링하고, 임베딩 API를 사용하여 크롤링된 페이지를 임베딩으로 전환한 다음, 사용자가 임베딩된 정보에 대해 질문할 수 있는 기본 검색 기능을 만드는 간단한 예제를 안내합니다. 이는 사용자 지정 지식창고를 활용하는 보다 정교한 애플리케이션을 위한 출발점이 될 수 있습니다.

시작하기
이 튜토리얼을 진행하려면 Python과 GitHub에 대한 기본 지식이 필요합니다. 시작하기 전에 OpenAI API 키를 설정하고 빠른 시작 튜토리얼을 살펴보세요. 이렇게 하면 API를 최대한 활용하는 방법에 대해 직관적으로 이해할 수 있습니다.

Python은 OpenAI, Pandas, transformers, NumPy 및 기타 인기 패키지와 함께 기본 프로그래밍 언어로 사용됩니다. 이 튜토리얼을 진행하면서 문제가 발생하면 OpenAI 커뮤니티 포럼에 질문해 주세요.

코드를 시작하려면 이 튜토리얼의 전체 코드를 GitHub에서 복제하세요. 또는 각 섹션을 Jupyter 노트북에 복사하여 단계별로 코드를 실행하거나 그냥 따라 읽어보세요. 문제를 방지하는 좋은 방법은 새 가상 환경을 설정하고 다음 명령을 실행하여 필요한 패키지를 설치하는 것입니다.

python -m venv env

source env/bin/activate

pip install -r requirements.txt

웹 크롤러 설정하기
이 튜토리얼의 주요 초점은 OpenAI API이므로 원하는 경우 웹 크롤러를 만드는 방법에 대한 컨텍스트를 건너뛰고 소스 코드만 다운로드할 수 있습니다. 그렇지 않은 경우 아래 섹션을 확장하여 스크래핑 메커니즘 구현을 진행하세요.

임베딩 인덱스 구축

CSV는 임베딩을 저장하는 일반적인 형식입니다. 텍스트 디렉터리에 있는 원시 텍스트 파일을 Pandas 데이터 프레임으로 변환하여 Python에서 이 형식을 사용할 수 있습니다. Pandas는 표 형식 데이터(행과 열로 저장된 데이터)로 작업하는 데 도움이 되는 인기 있는 오픈 소스 라이브러리입니다.

빈 줄은 텍스트 파일을 복잡하게 만들어 처리하기 어렵게 만들 수 있습니다. 간단한 함수를 사용하면 이러한 줄을 제거하고 파일을 깔끔하게 정리할 수 있습니다.

텍스트를 CSV로 변환하려면 앞서 만든 텍스트 디렉터리에 있는 텍스트 파일을 반복해야 합니다. 각 파일을 연 후 여분의 공백을 제거하고 수정된 텍스트를 목록에 추가합니다. 그런 다음 새 줄을 제거한 텍스트를 빈 Pandas 데이터 프레임에 추가하고 데이터 프레임을 CSV 파일에 씁니다.

import pandas as pd

# Create a list to store the text files
texts=[]

# Get all the text files in the text directory
for file in os.listdir("text/" + domain + "/"):

    # Open the file and read the text
    with open("text/" + domain + "/" + file, "r", encoding="UTF-8") as f:
        text = f.read()

        # Omit the first 11 lines and the last 4 lines, then replace -, _, and #update with spaces.
        texts.append((file[11:-4].replace('-',' ').replace('_', ' ').replace('#update',''), text))

# Create a dataframe from the list of texts
df = pd.DataFrame(texts, columns = ['fname', 'text'])

# Set the text column to be the raw text with the newlines removed
df['text'] = df.fname + ". " + remove_newlines(df.text)
df.to_csv('processed/scraped.csv')
df.head()

토큰화는 원시 텍스트를 CSV 파일로 저장한 다음 단계입니다. 이 프로세스는 입력 텍스트를 문장과 단어로 분해하여 토큰으로 분할합니다. 이에 대한 시각적 데모는 문서에서 토큰화기를 확인하면 확인할 수 있습니다.

API에는 임베딩에 사용할 수 있는 최대 입력 토큰 수에 제한이 있습니다. 이 한도 이하로 유지하려면 CSV 파일의 텍스트를 여러 행으로 나눠야 합니다. 분할해야 하는 행을 식별하기 위해 각 행의 기존 길이가 먼저 기록됩니다.

import tiktoken

# Load the cl100k_base tokenizer which is designed to work with the ada-002 model
tokenizer = tiktoken.get_encoding("cl100k_base")

df = pd.read_csv('processed/scraped.csv', index_col=0)
df.columns = ['title', 'text']

# Tokenize the text and save the number of tokens to a new column
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))

# Visualize the distribution of the number of tokens per row using a histogram
df.n_tokens.hist()

최신 임베딩 모델은 최대 8191개의 입력 토큰으로 입력을 처리할 수 있으므로 대부분의 행에는 청킹이 필요하지 않지만, 스크랩된 모든 하위 페이지에 해당되는 것은 아니므로 다음 코드 청크는 긴 행을 더 작은 청크로 분할합니다.

max_tokens = 500

# Function to split the text into chunks of a maximum number of tokens
def split_into_many(text, max_tokens = max_tokens):

    # Split the text into sentences
    sentences = text.split('. ')

    # Get the number of tokens for each sentence
    n_tokens = [len(tokenizer.encode(" " + sentence)) for sentence in sentences]
    
    chunks = []
    tokens_so_far = 0
    chunk = []

    # Loop through the sentences and tokens joined together in a tuple
    for sentence, token in zip(sentences, n_tokens):

        # If the number of tokens so far plus the number of tokens in the current sentence is greater 
        # than the max number of tokens, then add the chunk to the list of chunks and reset
        # the chunk and tokens so far
        if tokens_so_far + token > max_tokens:
            chunks.append(". ".join(chunk) + ".")
            chunk = []
            tokens_so_far = 0

        # If the number of tokens in the current sentence is greater than the max number of 
        # tokens, go to the next sentence
        if token > max_tokens:
            continue

        # Otherwise, add the sentence to the chunk and add the number of tokens to the total
        chunk.append(sentence)
        tokens_so_far += token + 1

    return chunks
    

shortened = []

# Loop through the dataframe
for row in df.iterrows():

    # If the text is None, go to the next row
    if row[1]['text'] is None:
        continue

    # If the number of tokens is greater than the max number of tokens, split the text into chunks
    if row[1]['n_tokens'] > max_tokens:
        shortened += split_into_many(row[1]['text'])
    
    # Otherwise, add the text to the list of shortened texts
    else:
        shortened.append( row[1]['text'] )

업데이트된 히스토그램을 다시 시각화하면 행이 성공적으로 단축된 섹션으로 분할되었는지 확인하는 데 도움이 됩니다.

df = pd.DataFrame(shortened, columns = ['text'])
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))
df.n_tokens.hist()

이제 콘텐츠가 더 작은 덩어리로 세분화되고 새로운 텍스트 임베딩-ada-002 모델을 사용하여 임베딩을 생성하도록 지정하는 간단한 요청을 OpenAI API에 보낼 수 있습니다:

import openai

df['embeddings'] = df.text.apply(lambda x: openai.Embedding.create(input=x, engine='text-embedding-ada-002')['data'][0]['embedding'])

df.to_csv('processed/embeddings.csv')
df.head()

이 과정은 약 3~5분 정도 소요되지만 임베딩을 사용할 준비가 되면 완료됩니다!

임베딩으로 질문 답변 시스템 구축하기

임베딩이 준비되면 이 프로세스의 마지막 단계는 간단한 질문과 답변 시스템을 만드는 것입니다. 이 시스템은 사용자의 질문을 받아 임베딩을 생성하고 기존 임베딩과 비교하여 스크랩된 웹사이트에서 가장 관련성이 높은 텍스트를 검색합니다. 그러면 텍스트-davinci-003 모델이 검색된 텍스트를 기반으로 자연스럽게 들리는 답변을 생성합니다.

임베딩을 NumPy 배열로 변환하는 것이 첫 번째 단계이며, 이렇게 하면 NumPy 배열에서 작동하는 많은 함수를 사용할 수 있으므로 사용 방법이 더 유연해집니다. 또한 이후 많은 작업에 필요한 형식인 1-D로 차원을 평탄화합니다.

import numpy as np
from openai.embeddings_utils import distances_from_embeddings

df=pd.read_csv('processed/embeddings.csv', index_col=0)
df['embeddings'] = df['embeddings'].apply(eval).apply(np.array)

df.head()

이제 데이터가 준비되었으므로 간단한 함수를 사용하여 질문을 임베딩으로 변환해야 합니다. 임베딩을 사용한 검색은 코사인 거리를 사용하여 숫자 벡터(원시 텍스트의 변환)를 비교하기 때문에 이 작업이 중요합니다. 벡터는 서로 관련이 있을 가능성이 높으며 코사인 거리가 가까우면 질문에 대한 답이 될 수 있습니다. OpenAI 파이썬 패키지에는 여기에 유용한 distances_from_embeddings 함수가 내장되어 있습니다.

def create_context(
    question, df, max_len=1800, size="ada"
):
    """
    Create a context for a question by finding the most similar context from the dataframe
    """

    # Get the embeddings for the question
    q_embeddings = openai.Embedding.create(input=question, engine='text-embedding-ada-002')['data'][0]['embedding']

    # Get the distances from the embeddings
    df['distances'] = distances_from_embeddings(q_embeddings, df['embeddings'].values, distance_metric='cosine')


    returns = []
    cur_len = 0

    # Sort by distance and add the text to the context until the context is too long
    for i, row in df.sort_values('distances', ascending=True).iterrows():
        
        # Add the length of the text to the current length
        cur_len += row['n_tokens'] + 4
        
        # If the context is too long, break
        if cur_len > max_len:
            break
        
        # Else add it to the text that is being returned
        returns.append(row["text"])

    # Return the context
    return "\n\n###\n\n".join(returns)

텍스트가 더 작은 토큰 세트로 나뉘어져 있으므로 오름차순으로 반복하여 텍스트를 계속 추가하는 것이 완전한 답변을 얻기 위한 중요한 단계입니다. 원하는 것보다 많은 콘텐츠가 반환되는 경우 최대 길이를 더 작게 수정할 수도 있습니다.

이전 단계에서는 질문과 의미론적으로 관련된 텍스트 덩어리만 검색했기 때문에 답을 포함할 수도 있지만, 그렇다고 보장할 수는 없습니다. 가장 가능성이 높은 상위 5개 결과를 반환함으로써 답을 찾을 확률을 더욱 높일 수 있습니다.

그러면 답변 프롬프트가 검색된 문맥에서 관련 사실을 추출하여 일관성 있는 답변을 작성합니다. 관련 답변이 없는 경우 프롬프트에 "모르겠습니다"가 반환됩니다.

텍스트-davinci-003을 사용하여 완성 엔드포인트를 사용하면 질문에 대한 사실적인 답변을 만들 수 있습니다.

def answer_question(
    df,
    model="text-davinci-003",
    question="Am I allowed to publish model outputs to Twitter, without a human review?",
    max_len=1800,
    size="ada",
    debug=False,
    max_tokens=150,
    stop_sequence=None
):
    """
    Answer a question based on the most similar context from the dataframe texts
    """
    context = create_context(
        question,
        df,
        max_len=max_len,
        size=size,
    )
    # If debug, print the raw model response
    if debug:
        print("Context:\n" + context)
        print("\n\n")

    try:
        # Create a completions using the question and context
        response = openai.Completion.create(
            prompt=f"Answer the question based on the context below, and if the question can't be answered based on the context, say \"I don't know\"\n\nContext: {context}\n\n---\n\nQuestion: {question}\nAnswer:",
            temperature=0,
            max_tokens=max_tokens,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0,
            stop=stop_sequence,
            model=model,
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

완료되었습니다! 이제 OpenAI 웹사이트의 지식이 포함된 작동하는 Q/A 시스템이 준비되었습니다. 몇 가지 간단한 테스트를 통해 출력의 품질을 확인할 수 있습니다:

응답은 다음과 같이 표시됩니다:

answer_question(df, question="What day is it?", debug=False)

answer_question(df, question="What is our newest embeddings model?")

answer_question(df, question="What is ChatGPT?")



"I don't know."

'The newest embeddings model is text-embedding-ada-002.'

'ChatGPT is a model trained to interact in a conversational way. It is able to answer followup questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests.'

시스템이 예상되는 질문에 대한 답변을 제공할 수 없는 경우, 원시 텍스트 파일을 검색하여 예상되는 정보가 실제로 포함되었는지 여부를 확인하는 것이 좋습니다. 처음에 수행된 크롤링 프로세스는 제공된 원래 도메인 외부의 사이트를 건너뛰도록 설정되어 있으므로 하위 도메인이 설정되어 있는 경우 해당 지식이 없을 수 있습니다.

현재는 질문에 답하기 위해 매번 데이터 프레임이 전달되고 있습니다. 더 많은 프로덕션 워크플로우를 위해서는 임베딩을 CSV 파일에 저장하는 대신 벡터 데이터베이스 솔루션을 사용해야 하지만, 현재의 접근 방식은 프로토타이핑을 위한 훌륭한 옵션입니다.

저작자표시 비영리 변경금지