Dec 13th, 2024: [EN] Semantic, Vector, and Hybrid Search all in Kibana Console

Want to combine your regular Elasticsearch queries with new AI searching capability? Well, won't you need to connect to some third party's Large Language Model (LLM)? No. You can improve your existing searches using Elastic's ELSER model to create semantic vectors and combine those with regular searches to get more relevant search results!

Introduction
When it comes to using AI and applying vector search to data on Elastic, I see a lot of tutorials and write-ups that can get pretty complicated. I wanted to show that you can take existing searches you may be running today and augment them with semantic and vector search. You do not have to connect to any outside LLM or pay any of the big-boys to process your data. Within just Elasticsearch and Kibana you can download Elastic's semantic model, ELSER, process your data to add descriptive vectors, and augment your searches to explore improvements. You don't need LangChain, or Python, or chatGPT or any other external tools.

PLATFORM
The searches we will run were done on version 8.13.4 of Elasticsearch and Kibana. You can run those on any platform you like, whether on-prem or in the cloud.

DATA
The data we will search is from Kaggle's set of open license datasets called "recipes." The original dataset can be downloaded from here (which is discussed on page 173 of Vector Search for Practitioners with Elastic ). I downloaded and called the file "allrecipes.csv."

Alternatively you can download the dataset from my GitHub. Please do open that link to download other files in upcoming instructions.

Deduplication
There are entries that repeat in the original dataset. I built a small python script to deduplicate them. The script is in the GitHub project folder and is called dedupecsv.py. The deduplicated result file is called allrecipes_dedupe.csv, which is also in the project file. If you want to run the dedupecsv.py yourself, the python library pandas needs to be installed. (e.g. pip install pandas). But you do NOT have to do this deduplication because it has already been done and the result is allrecipes_dedupe.csv. Download that file.

Indexing Data
I was able to use both Filebeat and (even easier) the File Uploader in Kibana to ingest the data into an index in Elasticsearch, most recently using version 8.13.4 but in older versions as well. If you use Filebeat to ingest allrecipes_dedupe.csv into Elasticsearch, the configuration file is filebeat.yml and a copy is in the project folder. I followed the "quick install" documentation to install and use filebeat for a "self-managed" system. If you are using a remote system (like Cloud or Strigo) scp both filebeat.yml and allrecipes_dedupe.csv to your instance.

But even easier with File Uploader just download allrecipes_dedupe.csv and drag-and-drop it into Kibana.

Using either method, create an Elasticsearch index called "recipes." Filebeat will do this for you. You enter that name in the File Uploader.

KIBANA

Step 1: Let's verify that the recipes index was successfully ingested into Elasticsearch. In the Kibana Console (Kibana Main Menu -> Management -> Dev Tools) run this command:

GET _cat/indices?v&s=index

The _cat API gives an output of rows and columns. The characters after the question mark (?v&s=index) are options to show header labels (v) and to sort on the "index" column (&s=index)

Another way to verify the recipes index is to use the _count API.

GET recipes/_count

It should show something like 4808 documents.

Step 2: Download ELSER

The Kibana UI lets us download models in the Machine Learning area. Go to Main Menu -> Machine Learning. Find and click on Trained Models. You should see a short list of available models. For a non-Intel environment download .elser_model_2. If your platform is using an Intel x86 chip, download the specially prepared .elser_model_2_linux-x86_64 instead.

Step 3: Start the model

Start a deployment of that model for inference. Use one allocation and four threads. In most cases an allocation consumes a core and threads are threads on that core. The API call below gives the model a name, "elser_model."

If you are running with an Intel x86 chip:

POST _ml/trained_models/.elser_model_2_linux-x86_64/deployment/_start?deployment_id=elser_model&number_of_allocations=1&threads_per_allocation=4

otherwise:

POST _ml/trained_models/.elser_model_2/deployment/_start?deployment_id=elser_model&number_of_allocations=1&threads_per_allocation=4

If you need to start and stop the model you can do:
POST _ml/trained_models/elser_model/deployment/_stop

If a pipeline is calling the model (which we will later):
POST _ml/trained_models/elser_model/deployment/_stop?force=true

Make sure the model is in a started state:

GET _ml/trained_models/_stats/
GET _ml/trained_models/_stats?filter_path=trained_model_stats.deployment_stats.state

Step 4: Explore the recipes index

Run these commands to examine the content and attributes of the recipes index.

To examine some documents in the index:

GET recipes/_search

To see how many documents there are total:

GET recipes/_count

To see the fields and their data types:

GET recipes/_mapping

To see all the attributes of the index (mappings, settings, aliases), it's actually easier:

GET recipes

Step 5: Improve the Mapping

If we run this:

GET recipes/_search?filter_path=hits.hits._source.id

Notice all the IDs are small integers. However, if we go back and examine the mapping we will see that the datatype for IDs is likely either text/keyword or a long (depending on which tool we used to index recipes).

Notice also that the summary field is datatype text and is defaulting to use the standard analyzer. If would be useful for our searches if the the english analyzer results were available (which provides stemmed results). That way if we search summary for "stewed" tomatoes, say, the results come back for stewed, stew, stews, stewing, etc., which might be helpful to more easily obtain relevant recipes.

Additionally we will need create a field to hold the results of vectorizing the data. Later we are going to configure an ingest node pipeline processor to direct the model to write result to a field called ml.tokens. As explained here the ELSER model results should be stored as the datatype "sparse vector" for our vector embeddings.

With these alterations in mind, let's create a non-existing index with all the improved mapping attributes.

PUT recipes_embeddings
{
  "mappings": {
    "properties": {
      "id": {
        "type": "short"
      },
      "group": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "ingredient": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "n_rater": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "n_reviewer": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "name": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "process": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "rating": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "summary": {
        "type": "text",
        "analyzer": "english",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "ml.tokens": {
        "type": "sparse_vector"
      }
    }
  }
}

Step 6: Other document fixes

Let's again examine some of the documents in recipes.

GET recipes/_search
{
  "size": 2000,
  "_source": ["ingredient"]
}

If you cntr+f find on the right-hand-side of the console you might notice there are some documents with double or maybe triple quotes.

Here we create an ingest node pipeline to replace the triple quotes with single quotes.
We will call our pipeline "doublequotes."

PUT _ingest/pipeline/doublequotes
{
  "processors": [
    {
      "gsub": {
        "field": "ingredient",
        "pattern": "\"(.*?)",
        "replacement": ""
      }
    }
  ]
}

To test our pipeline we can use the _simulate API.

POST _ingest/pipeline/doublequotes/_simulate
{
  "docs": [
    {
      "_index": "recipes",
      "_id": "id",
      "_source": {
        "ingredient": """shredded cheese, and even some cilantro for a great-tasting breakfast burrito that will keep your appetite curbed all day long.","prep: 15 mins,cook: 5 mins,total: 20 mins,Servings: 2,Yield: 2 burritos","2 (10 inch) flour tortillas + 1 tablespoon butter + 4 medium eggs + 1 cup shredded mild Cheddar cheese + 1 Hass avocado - peeled, pitted, and sliced + 1 small tomato, chopped + 1 small bunch fresh cilantro, chopped, or to taste (Optional) + 1 pinch salt and ground black pepper to taste + 1 dash hot sauce, or to taste (Optional)""",
        "tags": 2342
      }
    }
  ]
}

Step 7: Pipeline with Inference

Below we are creating a pipeline to both clean the quotes and to apply the inference processor. The inference processor is what we use to run our model ( ELSER, in this case) against the ingredient field in the index. Recall the name of our deployment is elser_model, which we see here is called as the model_id. We will call this pipeline elser_clean_recipes.

PUT _ingest/pipeline/elser_clean_recipes
{
  "processors": [
    {
      "pipeline": {
        "name": "doublequotes"
      },
      "inference": {
        "model_id": "elser_model",
        "target_field": "ml",
        "field_map": {
          "ingredient": "text_field"
        },
        "inference_config": {
          "text_expansion": {
            "results_field": "tokens"
          }
        }
      }
    }
  ]
}

Notice that the model defaults to vectorizing the field "text_field." The line "field_map" is where we configure the inference processor to use a different field (ingredient in this case).
Be sure to test.

POST _ingest/pipeline/elser_clean_recipes/_simulate
{
  "docs": [
    {
      "_index": "recipes",
      "_id": "id",
      "_source": {
        "ingredient": """prep: 20 mins,cook: 20 mins,total: 40 mins,Servings: 4,Yield: 4 servings","½ small onion, chopped + ½ tomato, chopped + 1 jalapeno pepper, seeded and minced + 1 sprig fresh cilantro, chopped + 6 eggs, beaten + 4 (10 inch) flour tortillas + 2 cups shredded Cheddar cheese + ¼ cup sour cream, for topping + ¼ cup guacamole, for topping""",
        "tags": 2342
      }
    },
     {
      "_index": "recipes",
      "_id": "id",
      "_source": {
        "ingredient":
    """"shredded cheese, and even some cilantro for a great-tasting breakfast burrito that will keep your appetite curbed all day long.","prep: 15 mins,cook: 5 mins,total: 20 mins,Servings: 2,Yield: 2 burritos","2 (10 inch) flour tortillas + 1 tablespoon butter + 4 medium eggs + 1 cup shredded mild Cheddar cheese + 1 Hass avocado - peeled, pitted, and sliced + 1 small tomato, chopped + 1 small bunch fresh cilantro, chopped, or to taste (Optional) + 1 pinch salt and ground black pepper to taste + 1 dash hot sauce, or to taste (Optional)"""
      }},
      {"_index": "recipes",
      "_id": "id",
      "_source": {  
        "ingredient": """shredded cheese, and even some cilantro for a great-tasting breakfast burrito that will keep your appetite curbed all day long.","prep: 15 mins,cook: 5 mins,total: 20 mins,Servings: 2,Yield: 2 burritos","2 (10 inch) flour tortillas + 1 tablespoon butter + 4 medium eggs + 1 cup shredded mild Cheddar cheese + 1 Hass avocado - peeled, pitted, and sliced + 1 small tomato, chopped + 1 small bunch fresh cilantro, chopped, or to taste (Optional) + 1 pinch salt and ground black pepper to taste + 1 dash hot sauce, or to taste (Optional)"""}}
  ]
}

Step 8: Process Data through the Pipeline

Now that we have a pipeline , it is time to process our index. We are going to use the _reindex API to send data from the index recipes to recipes_embeddings. On the way the data will go through our pipeline to create embeddings.

Reindexing can take a lot of time, so here we are running with an option called

wait_for_completion=false

When we run the command this will produce an ID number that we can use to check on progress. Be sure to copy and paste this ID somewhere.

Additionally the reindex is using the options...

requests_per_second=-1&timeout=60m

...to run as fast as it can and to not timeout too soon, respectively.

WARNING = can take approximately 15-30 minutes

POST _reindex?wait_for_completion=false&requests_per_second=-1&timeout=60m
{
  "conflicts": "proceed", 
  "source": {
    "index": "recipes",
    "size": 500
  },
  "dest": {
    "index": "recipes_embeddings",
    "pipeline": "elser_clean_recipes"
  }
}

Copy and use the task number to track process of the reindex.

GET _tasks/< paste task number here >

For example, I copy & pasted this ID:
LAV3l8oZTmaR9p8VUVqO3g:373447

So I would run this command to check on progress:

GET _tasks/LAV3l8oZTmaR9p8VUVqO3g:373447

If needed, you can remove the recipes_embeddings index like this.
DELETE recipes_embeddings

Step 9: Examine processed documents

After at least one batch of documents completes, examine the results.

GET _cat/indices?v&s=i

GET recipes_embeddings/_search

The lengthy ml.tokens field makes it awkward to see the output. You can suppress it like this.

GET recipes_embeddings/_search
{ "_source": { "excludes": "ml"} }

GET recipes_embeddings/_count
4808 documents

We can run this aggregation to find all the different values in "group" that we can query.

GET recipes_embeddings/_search?size=0
{
  "aggs": {
    "all the groups": {
      "terms": {
        "field": "group.keyword",
        "size": 200
      }
    }
  }
}

How many buckets are there?

GET recipes_embeddings/_search
{
  "size": 0, 
  "aggs": {
    "how many buckets": {
      "cardinality": {
        "field": "group.keyword",
        "precision_threshold": 200
      }
    }
  }
}

I got 174 different groups.

To isolate groups on the output so we can see them in documents, you can run this:

GET recipes_embeddings/_search?size=1000&_source=group



Searches


Finally the data is prepared and we can perform searches on the recipes_embeddings index. Let's compare running without the ELSER results and then with ELSER.

Search 1: Old Fashion

First we will search our recipes for a cocktail called Old Fashion. Here's how to make it.

-- Old Fashioned - bourbon cocktail --
1. Add one or two cherries to an old Old Fashioned glass and lightly mash with a muddler.
2. Take orange peel and rub around the inside of the glass rim, then add peel to the cherries.
3. Add ice cubes, rye whiskey, simple sugar, and bitters.
4. Stir to combine and serve.

First without ELSER

GET recipes_embeddings/_search
{
  "_source": {
    "excludes": "ml", "includes": ["name","group","summary","ingredient"]},
  "query": {
    "bool": {
      "should": [ { "wildcard": { "group": {"value": "drinks*" }}},
        {
          "multi_match": {
            "type": "phrase", 
            "query": "old fashion",
            "fields": [
              "summary",
              "name"]}},
        {
          "match": {
            "summary": "delicious sensational"
          }
        }
      ]
    }
  }
}

Run that in your console and you will see some pretty bad results among the top hits.
For me, none of them were even drinks (even though we searched the drinks* group).
Many of my result had phrases like "good old fashion meals and dishes..." which is not what we are after.

Now let's search with ELSER... and see the excellent results!

GET recipes_embeddings/_search
{
  "_source": {
    "excludes": "ml",
    "includes": [ "name",  "group",   "summary" , "ingredient"]
  },
  "sub_searches": [
    {
      "query": {
        "bool": {
          "should": [
            {
              "wildcard": {"group": {"value": "drinks*"}}},
            {
              "multi_match": {
                "query": "old fashioned",
                "type": "phrase",
                "fields": [
                  "summary",
                  "name" ]}} ] }}},
    {
      "query": {
        "text_expansion": {
          "ml.tokens": {
            "model_id": "elser_model",
            "model_text": "old fashioned bourbon whiskey whisky drink"
          }
        }
      }
    }
  ],
  "rank": {
    "rrf": {
      "window_size": 500,
      "rank_constant": 60
    }
  }
}

Notice that all are adult drinks and a couple start with "Old Fashion" in the name.


Search 2: Shrimp dishes

Without ELSER

GET recipes_embeddings/_search
{
  "_source": {
    "excludes": "ml"
  },
  "query": {
    "bool": {
      "should": [
        {
          "wildcard": {
            "group": {
              "value": "main*"
            }
          }
        },
        {
          "multi_match": {
            "type": "phrase", 
            "query": "tempura shrimp",
            "fields": [
              "ingredient",
              "name^2"
            ]
          }
        },
        {
          "match": {
            "summary": "tasty delightful"
          }
        }
      ]
    }
  }
}

Pretty bad results in the top 5: I got recipes like pork tenderloin, Spanish sauce, carrot salad, .... Nothing even with shrimp.

With ELSER

GET recipes_embeddings/_search
{
  "_source": {
    "excludes": "ml"
  },
  "sub_searches": [
    {
      "query": {
        "bool": {
          "should": [
            {
              "wildcard": {
                "group": {
                  "value": "main*"
                }
              }
            },
            {
              "multi_match": {
                "query": "tempura shrimp",
                "type":"phrase",
                "fields": [
                  "ingredient",
                  "name^2"
                ]
              }
            }/*,
            {
              "match": {
                "summary": "tasty delightful"
              }
            }*/
          ]
        }
      }
    },
    {
      "query": {
        "text_expansion": {
          "ml.tokens": {
            "model_id": "elser_model",
            "model_text": "tempura shrimp"
          }
        }
      }
    }
  ],
  "rank": {
    "rrf": {
      "window_size": 500,
      "rank_constant": 60
    }
  }
}

Much better in the top 5: Shrimp with Pasta, Shrimp Scampis, Penne with Shrimp, Grilled Scampi, Shrimp Quiche
Notice how with semantic search Elasticsearch finds out that shrimp and scampi and other terms like prawns are also relevant.


Search 3: Spaghetti dishes

Without ELSER

GET recipes_embeddings/_search
{
  "_source": {
    "excludes": "ml"
  },
  "query": {
    "bool": {
      "should": [
        {
          "wildcard": {
            "group": {
              "value": "main*"
            }
          }
        },
        {
          "multi_match": {
            "type": "phrase", 
            "query": "Spaghetti Bolognese",
            "fields": [
              "ingredient",
              "name^2"
            ]
          }
        },
        {
          "match": {
            "summary": "tasty delightful"
          }
        }
      ]
    }
  }
}

Bad again in top 5: pork tenderloin, Med. sauce, carrot salad, lime chicken, ...
No spaghetti meals at all for me.

With ELSER

GET recipes_embeddings/_search?filter_path=hits.hits._source
{
  "_source": {
    "excludes": "ml"
  },
  "sub_searches": [
    {
      "query": {
        "bool": {
          "should": [
            {
              "wildcard": {
                "group": {
                  "value": "main*"
                }
              }
            },
            {
              "multi_match": {
                "query": "Spaghetti Bolognese",
                "type":"phrase",
                "fields": [
                  "ingredient",
                  "name^2"
                ]
              }
            }/*,
            {
              "match": {
                "summary": "tasty delightful"
              }
            }*/
          ]
        }
      }
    },
    {
      "query": {
        "text_expansion": {
          "ml.tokens": {
            "model_id": "elser_model",
            "model_text": "main Spaghetti Bolognese"
          }
        }
      }
    }
  ],
  "rank": {
    "rrf": {
      "window_size": 500,
      "rank_constant": 60
    }
  }
}

Much improved. many ingredients have spaghetti or penne
Again within the top : Pennes, pastas, spaghettis.


Search 4: Chocolate

Without ELSER

GET recipes_embeddings/_search
{
  "_source": {"excludes": "ml"},
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "name": "dessert"
          }
        },
        {
          "match": {
            "ingredient": "chocolate"
          }
        },
        {
          "match": {
            "summary": "tasty delightful"
          }
        }
      ]
    }
  }
}

Shake and crepes...?
I could not find any ingredient with chocolate.

With ELSER

GET recipes_embeddings/_search
{
  "_source": {"excludes": "ml"}, 
  "sub_searches": [
    {
      "query": {
        "bool": {
          "should": [
             {
              "match": {
                "name": "dessert"
              }
            },
            {
              "match": {
                "ingredient": "chocolate"
              }
            },
            {
              "match": {
                "summary": "tasty delightful"
              }
            }
          ]
        }
      }
    },
    {
      "query": {
        "text_expansion": {
          "ml.tokens": {
            "model_id": "elser_model",
            "model_text": "dessert chocolate"
          }
        }
      }
    }
  ],
  "rank": {
    "rrf": {
      "window_size": 50,
      "rank_constant": 20
    }
  }
}

Peppermint bark is an obscure term for chocolate.
Nanaimo bars are a chocolate topped cookie. I also see hot chocolate, chocolate muffins, chocolate cake, Oreo truffles, and cake balls where ingredient includes chocolate.
I see a lot of ingredients with chocolate.


Let's look again at the "groups" we can query.

GET recipes_embeddings/_search?size=0
{
  "aggs": {
    "all the groups": {
      "terms": {
        "field": "group.keyword",
        "size": 200
      }
    }
  }
}

How many recipes are in everyday-cooking?

GET recipes_embeddings/_count
{
  "query": {
    "wildcard": {
      "group.keyword": {
        "value": "everyday-cooking*"
      }
    }
  }
}

I got 310.

Fish Sandwich

Let's look for "Fish Sandwich" in everyday-cooking.

Without ELSER

GET recipes_embeddings/_search
{
  "_source": {
    "excludes": "ml"
  },
  "query": {
    "bool": {
      "should": [
        {
          "wildcard": {
            "group": {
              "value": "everyday-cooking*"
            }
          }
        },
        {
          "multi_match": {
            "type": "phrase", 
            "query": "fish sandwich",
            "fields": [
              "ingredient",
              "name^2"
            ]
          }
        }
      ]
    }
  }
}

None!
Wow, no fish sandwiches at all...?

Add embedding search with ELSER.

GET recipes_embeddings/_search
{
  "_source": {
    "excludes": "ml"
  },
  "sub_searches": [
    {
      "query": {
        "bool": {
          "should": [
            {
              "wildcard": {
                "group": {
                  "value": "everyday-cooking*"
                }
              }
            },
            {
              "multi_match": {
                "query": "fish sandwich",
                "type":"phrase",
                "fields": [
                  "ingredient",
                  "name^2"
                ]
              }
            }
          ]
        }
      }
    },
    {
      "query": {
        "text_expansion": {
          "ml.tokens": {
            "model_id": "elser_model",
            "model_text": "fish sandwich"
          }
        }
      }
    }
  ],
  "rank": {
    "rrf": {
      "window_size": 500,
      "rank_constant": 60
    }
  }
}

Do a "find" on the right-hand-side for "sandwich".
I see tuna patties, tuna salads - lots of fish and many are sandwiches.


Let's check how many recipes are in the "Main" category?

GET recipes_embeddings/_search
{
  "_source": {"excludes": "ml"},
  "query": {"bool": { 
    "must": [{"match": { "group": "main*" }} ] }}}

I got 458.


Tenderloin Steak

Note that in the USA we say "tenderloins or tenderloin steaks" but it's called Chateaubriand in France. Also note that "Chateaubriand" is not in any recipe. A multi_match command can be used to search many fields like this:

GET recipes_embeddings/_search
{
  "query": {
    "multi_match": {
      "query": "Chateaubriand",
      "fields": ["ingredient","summary","name"]
    }
  }
}

zero results

Without ELSER

GET recipes_embeddings/_search
{
  "_source": {
    "excludes": "ml", "includes": ["name","group","summary","ingredient"]},
  "query": {
    "bool": {
      "should": [ { "wildcard": { "group": {"value": "main*" }}},
        {
          "multi_match": {
            //"type": "phrase", 
            "query": "tenderloin steak Chateaubriand beef",
            "fields": [
              "ingredient",
              "name^2"
            ]
          }
        },
        {
          "match": {
            "summary": "delicious Chateaubriand"
          }
        },
        {
          "match": {
            "ingredient": "beef"
          }
        }
      ]
    }
  }
}

Pretty bad results in the top 10. Salt and pepper fries, salt bread, Pork Tenderloin again and the like.

With ELSER

GET recipes_embeddings/_search
{
  "_source": {
    "excludes": "ml", "includes": ["name","group","summary","ingredient"]},
    "sub_searches": [
      {
        "query": {
          "bool": {
            "should": [
              {
                "wildcard": {"group": {"value": "main*"}}},
              {
                "multi_match": {
                  "query": "tenderloin steak Chateaubriand beef",
                  //"type":"phrase",
                  "fields": [
                    "ingredient",
                    "name^2"
                    ]
                }
              },
              {"match":{
                "ingredient":{
                  "query": "beef"
                }
              }}
              ]
          }
        }
      },
      {
        "query": {
          "text_expansion": {
            "ml.tokens": {
              "model_id": "elser_model",
              "model_text": "tenderloin steak Chateaubriand beef"
            }
          }
        }
      }
      ],
      "rank": {
        "rrf": {
          "window_size": 500,
          "rank_constant": 60
        }
      }
}

A lot more actual steaks.

Congratulations! We have worked through a number of examples where vectorizing our data with ELSER has enabled us to conduct semantic searches. We combined our regular, perhaps preexisting term searches with vector searching to discover that hybrid search results can often be more relevant than just searching terms by themselves.

1 Like