Dec 7th, 2023: [EN] Find book about Christmas without searching for Christmas

As we’re getting closer to the holiday season, I’m looking forward to getting cozy, picking up a new book and having a relaxing time.

But book discovery online using a search bar is not as easy as it seems.... Most retail search engines rely solely on keyword searches, which is fine when we know exactly what title we're looking for, but it becomes more challenging when we only have a vague idea of the theme.

So, for this quick post, I decided to explore how I could leverage Elasticsearch’s support for semantic search to help people who want to find a book about Christmas… without using the word “Christmas”.

For our example, we’ll be using a dataset containing book summaries. To follow along, you will need an Elasticsearch cluster up and running with the ELSER model downloaded,

First, let’s configure an ingest pipeline to generate the sparse vectors for each book synopsis.

# Init Elasticsearch connection
es = Elasticsearch(
 cloud_id=ELASTIC_CLOUD_ID,
 api_key=ELASTIC_API_KEY,
 request_timeout=600
)


# ingest pipeline definition
PIPELINE_ID="vectorize_books_elser"


es.ingest.put_pipeline(id=PIPELINE_ID, processors=[{
     "foreach": {
         "field": "synopsis_passages",
         "processor": {
           "inference": {
             "field_map": {
               "_ingest._value.text": "text_field"
             },
             "model_id": ".elser_model_2_linux-x86_64",
             "target_field": "_ingest._value.vector",
             "on_failure": [
               {
                 "append": {
                   "field": "_source._ingest.inference_errors",
                   "value": [
                     {
                       "message": "Processor 'inference' in pipeline 'ml-inference-title-vector' failed with message '{{ _ingest.on_failure_message }}'",
                       "pipeline": "ml-inference-title-vector",
                       "timestamp": "{{{ _ingest.timestamp }}}"
                     }
                   ]
                 }
               }
             ]
           }
         }
       }
}])

Then create the books index where we will index our documents.

# Define the mapping
mappings = {
   "properties": {
       "title": {"type": "text"},
       "published_date": {"type": "text"},
       "synopsis": {"type": "text"},
       "synopsis_passages": {
         "type": "nested",
         "properties": {
             "vector": {
               "properties": {
                 "is_truncated": {
                   "type": "boolean"
                 },
                 "model_id": {
                   "type": "text",
                   "fields": {
                     "keyword": {
                       "type": "keyword",
                       "ignore_above": 256
                     }
                   }
                 },
                 "predicted_value": {
                   "type": "sparse_vector
"
                 }
            }
         }
     }
   }
}
}

# Create the index (deleting any previously existing index)
es.indices.delete(index="books", ignore_unavailable=True)
es.indices.create(index="books", mappings=mappings)

Now we can use the bulk API to ingest our documents. Note that we pass the pipeline name created previously to enrich documents using our ELSER ML model.

url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/datasets/book_summaries_1000_chunked.json"
response = urlopen(url)
books = json.loads(response.read())


from elasticsearch.helpers import streaming_bulk
count = 0
def generate_actions(books):


 for book in books:
   doc = {}
   doc["_index"] = "books"
   doc["pipeline"] = "vectorize_books_elser"
   doc["_source"] = book
   yield doc


for ok, info in streaming_bulk(client=es, index="books", actions=generate_actions(books), chunk_size=50):
 if not ok:
   print(f"Unable to index {info['index']['_id']}: {info['index']['error']}")

We’re now ready for the interesting part: Testing some queries to see the results we’re getting. One great thing here is that Elasticsearch supports keyword search and semantic search with the same index, as long as the data has been indexed correctly, which is the case here. We have indexed the synopsis as text and also an array of sparse vectors.

Here we will try to find books about Christmas using the following queries:

  • “Story with Santa Claus”
  • “Xmas stories”
  • “Gift receiving and festive season”

The query to search using keyword search (BM25) is the following:

POST books/_search
{
  "_source": ["title"], 
  "query": {
    "match": {
      "synopsis": "Xmas stories"
    }
  }
}

And the query to search using semantic search is this one:

POST books/_search
{
  "_source": [
    "title"
  ],
  "query": {
    "nested": {
      "path": "synopsis_passages",
      "query": {
        "text_expansion": {
          "synopsis_passages.vector.predicted_value": {
            "model_id": ".elser_model_2_linux-x86_64",
            "model_text": "Xmas stories"
          }
        }
      }
    }
  }
}

Because we’re not using the keyword Christmas, semantic search outperforms lexical search in this instance.

Look at the result for the first query: “Story with Santa Claus”. The semantic search looks way more relevant.

For the other two test queries, we're getting the following results:

  • “Xmas stories”

    • Lexical search:

      1. Naked Lunch
      2. Lost Girls
      3. Gilgamesh the King
    • Semantic search:

      1. A Visit from St. Nicholas
      2. Light in August
      3. A Christmas Carol
  • “Gift receiving and festive season”

    • Lexical search:

      1. Smith of Wootton Major
      2. A Canticle for Leibowitz
      3. A Gift Upon the Shore
    • Semantic search:

      1. Smith of Wootton Major
      2. A Visit from St. Nicholas
      3. A Gift Upon the Shore

I’ll let you read the books to see which ones are the most related to the Christmas celebrations :slight_smile:

Have a great holiday!

1 Like