Dec 16th, 2024: [EN] ChatGPT Summary with ESS as your Private Datastore

Summary

This Tutorial is how to setup Elasticsearch a webcrawler to index a website into Elasticsearch and then leverage ChatGPT to use our private data to summarize questions asked of it.

Github Repo for Python Scripts:

Objectives:
Learn how to use Elasticsearch to act as a private datastore for ChatGPT.

Process

1. Create your ESS Deployment

To start this tutorial we will begin by creating an ESS deployment.
Create an Elasticsearch cluster 8.17 or later on ESS. Make sure it includes at least 1 machine learning node.

Recommended minimum layout:
2 8GB Hot Nodes
1 2GB Machine Learning Node (make sure Machine Learning auto scaling is not enabled, this is to help see the impact these machine learning processes can have on your resources) . When you deploy the model the more allocations a model has the more memory it requires.

Please take note of the Machine Learning node's usage during this process. Depending on the website you decide to target, the Machine Learning node will likely be your bottleneck. This is because the documents have to me be indexed and then the machine learning node will apply the embedding model referenced in the next step to the documents as they are ingested.

Example layout:

Your Elasticsearch cluster needs to be available to a remote AI source. This does not require ESS, but it is the easiest to quickly implement. Generally speaking an on-premise deployment would require firewall rules, etc... to allow the Remote AI to connect to the on-premise Elasticsearch Cluster.

2. Setup up the embedding model

Prior to crawling our website to create a private index to ChatGPT to interact with we need to load the embedding model into Elasticsearch.

For this example we will use the all-distilroberta-v1 (sentence-transformers/all-distilroberta-v1 · Hugging Face) model trained by SentenceTransformers and hosted on the Hugging Face model hub. Other models can be used, but this particular model is good for general use and was trained on a large data set covering a variety of topics. However if you

This particular model isn’t required for this setup to work. It is good for general use as it was trained on very large data sets covering a wide range of topics. However, with vector search use cases, using a model fine-tuned to your particular data set will usually provide the best results. This model for example might not be the best if you are searching scientific research papers.

To do this, we will use the Eland python library (GitHub - elastic/eland: Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch) created by Elastic. The library provides a wide range of data science functions, but we will be using it as a bridge to load the model into Elasticsearch from the Hugging Face model hub so it can be deployed on machine learning nodes for inference use.

Eland can either be run from the command line, a docker container or as part of a python script.

Installing from the command line (Example using Ubuntu 20.04)
Additional Docker Instructions here: Import the trained model and vocabulary | Machine Learning in the Elastic Stack [8.17] | Elastic

Step 1 -- Install eland onto your machine or use Docker (see the above link for Docker Instructions):
python3 -m pip install 'eland[pytorch]'

Step 2 -- Copy your ESS deployment URL
Go to https://cloud.elastic.co and login.
Select Manage next to your target deployment
Click "Copy Endpoint" next to Elasticsearch

Step 3 -- Load model into Elasticsearch:
Using the URL you copied from the Cloud Control Panel, the Elastic username and Password -- complete the command below and execute it.
The command below will load hte hub-model onto your Machine Learning node and should automatically start it.

eland_import_hub_model --url https://test-f22762.es.us-central1.gcp.cloud.es.io:9243 -u elastic -p YOURPASSWORD --hub-model-id sentence-transformers/all-distilroberta-v1 --task-type text_embedding

Step 4 -- Log into Kibana and verify the model has started
Log into Kibana
Navgiate to Machine Learning
Under Model Mangement click on "Trained Models"

**3. Crawl your data **
Step 1. Identify the website you wish to crawl. In this example we are going to crawl the NFL's hall of fame.
-- We will not be setting up exclusions, etc... but they should be configured in production

Step 2. Log into Kibana. Under Search select Elasticsearch.

Step 3. Click on Web Crawlers

Step 4. Click on New Web Crawler

Step 5. Type in the name of your index. It will be prefixed with "search"
Example: nfl-hof will become the index search-nfl-hof

Step 6. Click Create Index

Step 7. Go to Manage domains Add the domain you wish to crawl. Click on Validate Domain, then click on Add Domain.
Example: Players | Pro Football Hall of Fame | Pro Football Hall of Fame

Step 8. Go to Pipelines
Click on Copy and Customize under "Unlock your custom pipelines"

Next look under "Machine Learning Inference Pipelines" -- click on "Add Inference Pipleine"

Step 9 Select Dense Vector Text Embedding sentence-transformers__all-distilroberta-v1 then click Continue


Step 10. Under Select Field Mappings select "Title" and then click "Add"


Step 11. Click Continue until you see "Create Pipeline" and then click Create Pipeline.

Step 12. Click Crawl

Step 13. Review the web cralwer indices and ensure documents are being populated into the search index.

Step 14. Do a test query to ensure the documents you are interested in sending to ChatGPT have been ingested into Elasticsearch.
Example query to ensure Walter Payton documents have been ingested into the Index.

GET search-nfl-hofx/_search
{
  "query": {
  "match": {
    "title" : "Walter Payton"}
  }
}

4. Install Streamlit -- Streamlit is required to execute the python script referenced in step 5. This will create the interface.

pip install streamlit

Test and ensure Streamlit was successfully installed by issuing the following command from your console:
streamlit hello


5. Download the python script.
Link: https://github.com/elastic/support/tree/master/chatgpt_demo

There are two options.

A. Connect ESS to OpenAI Script:
hof_es_gpt_noBing.py
B. Connect ESS to OpenAI and connect Bing to OpenAI
hof_es_gpt_withBing.py

Within the python script edit the index name to match the name of the index you created when you setup your web crawler:

line 70 in hof_es_gpt_noBing.py
index = 'search-nfl-hofx'

6. Setup your external resources:
Setup OpenAI's ChatGPT:

To connect to ChatGPT, you will need an OpenAI API account and key. If you don’t already have an account, you can create a free account and you will be given an initial amount of free credits.

Go to https://platform.openai.com and click on Signup.

Once your account is created, you will need to create an API key:

Click on API Keys.
Click Create new secret key.
Copy the new key and save it someplace safe as you won’t be able to view the key again.

Optional Setup Bing API -- this would allow the application to search ESS datastore first, then if the data isn't found allow it to search the internet via Bing.

7. Setup the script
Step 1. Launch the console
Step 2. Navigate to the directory where the python script is saved
Step 3. Define the script variables within the console:

Variables required:
OpenAI API Key
ESS Cloud ID
ESS username
ESS password

Optional (depending on the script):
Bing API Key
Bing Endpoint

Prior to running the python script we need to define some variables in the console. If you aren't using the Bing script, you can skip the Bing Variables -- change the API keys to match your API keys. The below keys have been revoked.

export openai_api="sk-IduoyWxSoVRtGpJUiyY99QaGO70v8sf1agXvuuFrVUT3BlbkFJrUBcDHY-ervQgdun2Z7IkJa6YYXqCczk1NnrEzPSEA"
export cloud_id="814-test:dXMtY2VudHJhbDEuZ2NwLmNsb3VkLmVzLmlvOjQ0MyQ5ZDJiZDRlODc3YmM0YmQ0YWFhM2I4MjBlMzk2ZDhiYSQwYWM5YjRmMWEzZTg0ODdlOTlmZGM3OTVkZjg4YTUxNQ=="
export cloud_pass="aCu1A6hkhQAw6cso359o8IH5"
export cloud_user="elastic"
export bing_subkey="94b11de338384967a4ddb61b611d3c97"
export bing_endpoint="https://api.bing.microsoft.com/"

After defining the below variables execute the script:

Execute the script with streamlit
streamlit run hof_es_gpt_noBing.py

Examples

Examples to what the final product should be or look like

Testing:
How to test the model we uploaded:

POST _ml/trained_models/sentence-transformers__all-distilroberta-v1/_infer
{
  "docs": [
    {
      "text_field": "Halo is a military science fiction media franchise, originally developed and created by Bungie and currently managed and developed by 343 Industries, part of Microsofts Xbox Game Studios. The series launched in 2001 with the first-person shooter video game Halo: Combat Evolved and its tie-in novel, The Fall of Reach. The latest main game, Halo Infinite, was released in late 2021. Combat Evolved started life as a real-time strategy game for personal computers, turning into a first-person shooter exclusive to Microsoft's Xbox video game console after Bungie was acquired by the company. Bungie regained its independence in 2007, releasing additional Halo games through 2010 before moving on from the franchise. Microsoft established 343 Industries to oversee Halo going forward, producing games itself and in partnership with other studios."
    },
    {
      "text_field": "Sonic the Hedgehog[c] is a 1991 platform game developed by Sonic Team and published by Sega for the Genesis/Mega Drive. It was released in North America on June 23 and in PAL regions and Japan the following month. Players control Sonic the Hedgehog, who can run at near supersonic speeds; Sonic sets out on a quest to defeat Dr. Robotnik, a scientist who has imprisoned animals in robots and seeks the powerful Chaos Emeralds. The gameplay involves collecting rings as a form of health, and a simple control scheme, with jumping and attacking controlled by a single button. Development began in 1990 when Sega ordered its developers to create a game featuring a mascot for the company. The developers chose a blue hedgehog designed by Naoto Ohshima after he won an internal character design contest, and named themselves Sonic Team to match their character. It uses a novel technique that allows Sonic's sprite to roll along curved scenery which was based on a concept by Oshima from 1989.[2] Sonic the Hedgehog, designed for fast gameplay, was influenced by games by Super Mario series creator Shigeru Miyamoto. The music was composed by Masato Nakamura, bassist of the J-pop band Dreams Come True."
    }
  ],
  "inference_config": {
    "text_embedding": {
     
    }
  }
}

Get Stats about the model we uploaded

GET _ml/trained_models/sentence-transformers__all-distilroberta-v1/_stats

Test queries:

Traditional Query:

POST search-nfl-hof/_search
{
  "size": 1,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": {
              "query": "Walter Payton",
              "boost": 1
            }
          }
        },
        {
          "knn": {
             
        "field": "ml.inference.title.predicted_value",

        "num_candidates": 20,
        "query_vector_builder": {
            "text_embedding": {
                "model_id": "sentence-transformers__all-distilroberta-v1",
                "model_text": "Walter Payton"
            }
        },
        "boost": 24
    
          }
        }
      ],
      "filter": [
        {
          "exists": {
            "field": "ml.inference.title.predicted_value"
          }
        }
      ]
    }
  }
}

KNN Query:

POST search-nfl-hof/_search
{
  "size" : 1,
  "query" : {
    "bool" : {
      "must" : {
           "knn": {
                     "field": "ml.inference.title.predicted_value",
        "num_candidates": 20,
        "query_vector_builder": {
            "text_embedding": {
                "model_id": "sentence-transformers__all-distilroberta-v1",
                "model_text": "Tell me about Tom Brady"
            }
        },
        "boost": 24
           }
          }
        }
  }}
1 Like