Dec 18th, 2023: [EN] The most magical time of the year: Using semantic search to find the most festive Harry Potter moments

Christmas at Hogwarts, anyone?

I don't know about you, but for me, Christmas usually means starting (yet another) Harry Potter marathon. While I'm a fan of the Wizarding World year-round, there is something extra festive about Christmas at Hogwarts.

Do you ever wish you had a way to find the most merry, cheery, present-filled moments of the series but you just don't have the time to comb through all 7 books and 8 movies? Enter Elastic Semantic Search!

Let's go on a magical journey of turning the Harry Potter Books into an NLP index, and using the Elastic Python clients and vector search capabilities to do some really cool, very festive searches.

A magical search experience

This works by combining a few key concepts:

  • Firstly, we do what Elastic does best, you know, search.
  • To take it a step further, we can easily use ELAND to import supported models from Hugging Face such as sentiment analysis, in order to get more nuance out of our text.
  • Finally, we can use ELSER to search by meaning rather than the usual key-words / filter system. In simple terms, we search with a more human understanding or language. (In fancy terms, we find the smallest distance between embedding space representations of sentences as vectors, using a similarity score. - but let's not overcomplicate Christmas)

First thing first, I got the books from a kaggle dataset, and I did some extra processing in a Python Notebook to get them in just the right shape for our index. We're focusing on the first three books for these examples.
We're taking each sentence and creating a document dictionary as our ready-to-search data base. Looks something like this:

[{'text_field': 'Mr and Mrs Dursley of number four Privet Drive were proud to say that they were perfectly normal thank you very much '},
 {'text_field': 'They were the last people youd expect to be involved in anything strange or mysterious because they just didnt hold with such nonsense '},
 {'text_field': 'Mr Dursley was the director of a firm called Grunnings which made drills '}]

We create a simple index on our Elastic cluster and batch upload each sentence/document to it with the python client. You can check out the repo for the full project code here.

Already we can do some simple searches like finding all mentions of "Christmas" in the books:

The status quo - key word search

response = client.search(index = index, query={
    "match" : {
        "text_field" : "christmas"
    }
})
We get back 203 results, here are the top ones:
meeting before Christmas 
Merry Christmas said George 
See you at Christmas 
Come on Hermione its Christmas 
So Ive come for Christmas 

This is fine when we search for a specific word; but it won't be as helpful when we are trying to remember a specific scene in the books... or was it the movies... and what was the specific line.... argh... something like "Harry getting a present". Will that work?

We get back 10000 results, here are the top ones:
Have a present 
Ive got yeh a present 
Wow Harry He had just opened Harrys present a Chudley Cannon hat 
Particularly in present company ! ‘Present company ?repeated Snape sardonically 
Early Christmas present for you Harry he said 
Id be happier if a teacher were present 

Okay, so the model doesn't get that "being present / living in the present" isn't quite the festive vibe I'm looking for. Let's add some magic to our search.

Enriching data with Sentiment Analysis and Semantic Search

With Eland, we can import various ML models and deploy them into the Elastic environment, allowing us to make inference calls to them either with new data, or to enrich existing indices. There are loads of compatible models but in this case we will use a Sentiment Analysis text classifier from here.

Any model can be imported and through docker for an easier deployment experience, and to ensur all dependencies and versions are satisfied.

docker run -it --rm elastic/eland \
    eland_import_hub_model \
      --cloud-id $CLOUD_ID \
      -u $USER -p $PASSWORD \
      --hub-model-id distilbert-base-uncased-finetuned-sst-2-english \
      --task-type text_classification \
      --start 

We can also import embedding models that would allow us to search using text similarity to account for synonyms or related words. This can be done manually with another imported model, or we can use ELSER - which is Elastic's out of the box semantic search module.

Once the models have been added to your Elastic cluster - you can use them at any time to run inference on text data. For example, a simple call would look like this:

model_id = "distilbert-base-uncased-finetuned-sst-2-english"
doc_test = {"text_field": "Harry Potter is the best Christmas movie in the world"}
result = MlClient.infer_trained_model(client, model_id =model_id, docs = doc_test)
print(result["inference_results"])
[{'predicted_value': 'POSITIVE', 'prediction_probability': 0.9998637678432463}]

Awesome! We can now take sentiment into account in our searches, after all, we are looking for the absolute merry-est parts of the series.

Let's speed up the process - we can create a pipeline to evaluate all of our documents at once with both the sentiment and vector embedding models and thus enrich our entire index. So each sentence in the book will get a Positive or Negative sentiment label, as well as a vector representation to help us use semantic search later.

client.ingest.put_pipeline(
    id="sentiment_and_elser", 
    processors=[
    {
      "inference": {
        "model_id": "distilbert-base-uncased-finetuned-sst-2-english",
        "target_field" : "sentiment",
        "field_map": {
          "Sentence": "text_field"
        }
      }
    },
    {
      "inference": {
        "model_id": ".elser_model_1",
        "target_field": "ml",
        "field_map": {
          "Sentence": "text_field"
        },
        "inference_config": {
          "text_expansion": {
            "results_field": "tokens"
          }
        }
      }
    }
  ]
)
client.reindex(body={
      "source": {
          "index": "hp_books"},
      "dest": {"index": "hp_books_enriched", "pipeline" : "sentiment_and_elser"}
    }, wait_for_completion=False)

Now we have a complex index, with various ML fields we can use for our magical search.
Let's try it out with a few festive searches.

Most Magical Time of the Year

We can now run the same search with the added dimension of the Semantic information.

result = client.search(
    index='hp_books_enriched', 
    size=5,
    query={
        "text_expansion": {
            "ml.tokens": {
                "model_id":".elser_model_1",
                "model_text":"christmas"
            }
        }
    },
    request_timeout=30
)

Our new results look much more interesting than the previous search - which just focused on finding the word Christmas in the books rather than capturing some more context for the search.

Most unfortunate that it should happen on Christmas Day 
Christmas morning dawned cold and white 
What a jolly holiday its going to be 
Td invite you for Christmas but 
We can do all our Christmas shopping there !said Hermione 

How about the present example? Hopefully there is no more grammar-induced ambiguity in the search results.

result = client.search(
    index='hp_books_enriched', 
    size=5,
    query={
        "text_expansion": {
            "ml.tokens": {
                "model_id":".elser_model_1",
                "model_text":"Harry getting a present"
            }
        }
    },
    request_timeout=30
)

And we now get:

but how else was I supposed to get Harrys present to him ?Stick it back in the trunk Harry advised as the Sneakoscope whistled piercingly 
Oy !Presents !Harry reached for his glasses and put them on squinting through the semidarkness to the foot of his bed where a small heap of parcels had appeared 
Harry opened the last present to find a new handknitted sweater from Mrs Weasley and a large plum cake 
There you go Ron yelled happily stuffing a fistful of gold coins into Harrys hand for the Omnioculars !Now youve got to buy me a Christmas present ha !T 
Early Christmas present for you Harry he said 

Much better! Loads more festive!

Let's add some sentiment into the mix. Most of the results already look quite positive - so let's see give it a challenge - can we find any negative present-giving instances?

result = client.search(
    index='hp_books_enriched', 
    size=5,
    query={
        "bool": {
            "should": [{
                "text_expansion": {
                    "ml.tokens": {
                        "model_id":".elser_model_1",
                        "model_text":"Harry getting a present"
                    }
                },
            }],
            "must":[
            {
                "match" : {
                    "sentiment.predicted_value": "NEGATIVE"
                }
            }]}})
Yet another unusual thing about Harry was how little he looked forward to his birthdays 

Ouch. Okay we're not doing that again. Let's leave on a positive note.

What are THE most positive passages in the series?
(showing we can also choose to just use one of the models in a search at a time, in this case - only sentiment)

query={
    "match" : {
      "sentiment.predicted_value": "POSITIVE"
    }
  }

response = client.search(index = "hp_books_enriched",query=query, sort="sentiment.prediction_probability:desc")
The most positive sentences in the series:

A really really happy memory 
that means ‘great happiness 
it was incredible 
Dinner that night was a very enjoyable affair 
Relief warm sweeping glorious relief swept over Harry 
It was all delicious 
A powerful one 
Excellent !said Harry happily 
Brilliant mind 
Tonight will be an excellent time to do it 

Ah - now I'm in a great mood to watch the movies again.

Happy holidays everyone, and happy searching!

5 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.