Christmas at Hogwarts, anyone?
I don't know about you, but for me, Christmas usually means starting (yet another) Harry Potter marathon. While I'm a fan of the Wizarding World year-round, there is something extra festive about Christmas at Hogwarts.
Do you ever wish you had a way to find the most merry, cheery, present-filled moments of the series but you just don't have the time to comb through all 7 books and 8 movies? Enter Elastic Semantic Search!
Let's go on a magical journey of turning the Harry Potter Books into an NLP index, and using the Elastic Python clients and vector search capabilities to do some really cool, very festive searches.
A magical search experience
This works by combining a few key concepts:
- Firstly, we do what Elastic does best, you know, search.
- To take it a step further, we can easily use ELAND to import supported models from Hugging Face such as sentiment analysis, in order to get more nuance out of our text.
- Finally, we can use ELSER to search by meaning rather than the usual key-words / filter system. In simple terms, we search with a more human understanding or language. (In fancy terms, we find the smallest distance between embedding space representations of sentences as vectors, using a similarity score. - but let's not overcomplicate Christmas)
First thing first, I got the books from a kaggle dataset, and I did some extra processing in a Python Notebook to get them in just the right shape for our index. We're focusing on the first three books for these examples.
We're taking each sentence and creating a document dictionary as our ready-to-search data base. Looks something like this:
[{'text_field': 'Mr and Mrs Dursley of number four Privet Drive were proud to say that they were perfectly normal thank you very much '},
{'text_field': 'They were the last people youd expect to be involved in anything strange or mysterious because they just didnt hold with such nonsense '},
{'text_field': 'Mr Dursley was the director of a firm called Grunnings which made drills '}]
We create a simple index on our Elastic cluster and batch upload each sentence/document to it with the python client. You can check out the repo for the full project code here.
Already we can do some simple searches like finding all mentions of "Christmas" in the books:
The status quo - key word search
response = client.search(index = index, query={
"match" : {
"text_field" : "christmas"
}
})
We get back 203 results, here are the top ones:
meeting before Christmas
Merry Christmas said George
See you at Christmas
Come on Hermione its Christmas
So Ive come for Christmas
This is fine when we search for a specific word; but it won't be as helpful when we are trying to remember a specific scene in the books... or was it the movies... and what was the specific line.... argh... something like "Harry getting a present"
. Will that work?
We get back 10000 results, here are the top ones:
Have a present
Ive got yeh a present
Wow Harry He had just opened Harrys present a Chudley Cannon hat
Particularly in present company ! ‘Present company ?repeated Snape sardonically
Early Christmas present for you Harry he said
Id be happier if a teacher were present
Okay, so the model doesn't get that "being present / living in the present" isn't quite the festive vibe I'm looking for. Let's add some magic to our search.
Enriching data with Sentiment Analysis and Semantic Search
With Eland, we can import various ML models and deploy them into the Elastic environment, allowing us to make inference calls to them either with new data, or to enrich existing indices. There are loads of compatible models but in this case we will use a Sentiment Analysis text classifier from here.
Any model can be imported and through docker for an easier deployment experience, and to ensur all dependencies and versions are satisfied.
docker run -it --rm elastic/eland \
eland_import_hub_model \
--cloud-id $CLOUD_ID \
-u $USER -p $PASSWORD \
--hub-model-id distilbert-base-uncased-finetuned-sst-2-english \
--task-type text_classification \
--start
We can also import embedding models that would allow us to search using text similarity to account for synonyms or related words. This can be done manually with another imported model, or we can use ELSER - which is Elastic's out of the box semantic search module.
Once the models have been added to your Elastic cluster - you can use them at any time to run inference on text data. For example, a simple call would look like this:
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
doc_test = {"text_field": "Harry Potter is the best Christmas movie in the world"}
result = MlClient.infer_trained_model(client, model_id =model_id, docs = doc_test)
print(result["inference_results"])
[{'predicted_value': 'POSITIVE', 'prediction_probability': 0.9998637678432463}]
Awesome! We can now take sentiment into account in our searches, after all, we are looking for the absolute merry-est parts of the series.
Let's speed up the process - we can create a pipeline to evaluate all of our documents at once with both the sentiment and vector embedding models and thus enrich our entire index. So each sentence in the book will get a Positive or Negative sentiment label, as well as a vector representation to help us use semantic search later.
client.ingest.put_pipeline(
id="sentiment_and_elser",
processors=[
{
"inference": {
"model_id": "distilbert-base-uncased-finetuned-sst-2-english",
"target_field" : "sentiment",
"field_map": {
"Sentence": "text_field"
}
}
},
{
"inference": {
"model_id": ".elser_model_1",
"target_field": "ml",
"field_map": {
"Sentence": "text_field"
},
"inference_config": {
"text_expansion": {
"results_field": "tokens"
}
}
}
}
]
)
client.reindex(body={
"source": {
"index": "hp_books"},
"dest": {"index": "hp_books_enriched", "pipeline" : "sentiment_and_elser"}
}, wait_for_completion=False)
Now we have a complex index, with various ML fields we can use for our magical search.
Let's try it out with a few festive searches.
Most Magical Time of the Year
We can now run the same search with the added dimension of the Semantic information.
result = client.search(
index='hp_books_enriched',
size=5,
query={
"text_expansion": {
"ml.tokens": {
"model_id":".elser_model_1",
"model_text":"christmas"
}
}
},
request_timeout=30
)
Our new results look much more interesting than the previous search - which just focused on finding the word Christmas in the books rather than capturing some more context for the search.
Most unfortunate that it should happen on Christmas Day
Christmas morning dawned cold and white
What a jolly holiday its going to be
Td invite you for Christmas but
We can do all our Christmas shopping there !said Hermione
How about the present example? Hopefully there is no more grammar-induced ambiguity in the search results.
result = client.search(
index='hp_books_enriched',
size=5,
query={
"text_expansion": {
"ml.tokens": {
"model_id":".elser_model_1",
"model_text":"Harry getting a present"
}
}
},
request_timeout=30
)
And we now get:
but how else was I supposed to get Harrys present to him ?Stick it back in the trunk Harry advised as the Sneakoscope whistled piercingly
Oy !Presents !Harry reached for his glasses and put them on squinting through the semidarkness to the foot of his bed where a small heap of parcels had appeared
Harry opened the last present to find a new handknitted sweater from Mrs Weasley and a large plum cake
There you go Ron yelled happily stuffing a fistful of gold coins into Harrys hand for the Omnioculars !Now youve got to buy me a Christmas present ha !T
Early Christmas present for you Harry he said
Much better! Loads more festive!
Let's add some sentiment into the mix. Most of the results already look quite positive - so let's see give it a challenge - can we find any negative present-giving instances?
result = client.search(
index='hp_books_enriched',
size=5,
query={
"bool": {
"should": [{
"text_expansion": {
"ml.tokens": {
"model_id":".elser_model_1",
"model_text":"Harry getting a present"
}
},
}],
"must":[
{
"match" : {
"sentiment.predicted_value": "NEGATIVE"
}
}]}})
Yet another unusual thing about Harry was how little he looked forward to his birthdays
Ouch. Okay we're not doing that again. Let's leave on a positive note.
What are THE most positive passages in the series?
(showing we can also choose to just use one of the models in a search at a time, in this case - only sentiment)
query={
"match" : {
"sentiment.predicted_value": "POSITIVE"
}
}
response = client.search(index = "hp_books_enriched",query=query, sort="sentiment.prediction_probability:desc")
The most positive sentences in the series:
A really really happy memory
that means ‘great happiness
it was incredible
Dinner that night was a very enjoyable affair
Relief warm sweeping glorious relief swept over Harry
It was all delicious
A powerful one
Excellent !said Harry happily
Brilliant mind
Tonight will be an excellent time to do it
Ah - now I'm in a great mood to watch the movies again.
Happy holidays everyone, and happy searching!