How to apply English analyzer to a set of documents already indexed


(Riccardo Spampinato) #1

Hi everyone,
I'm a newbie in Elasticsearch world and I'm trying to do something with email. In particular I took the Enron email dataset and I indexed it into elastic database (I took just one folder of the whole dataset to make the computation easier) . This is the mapping I put:

https://pastebin.com/RyZ3ez3V

Now I'd like to do some processing actions (tokenizing, stemming and removing stopwords) on the content of the body of each email. I had a look on language analyzers, but I do not know how to apply one of them to my case. I was wondering whether it's possible to use python and elasticsearch-dsl to create a script which takes the body of each email and does these kind of actions. Thank you guys


(Zachary Tong) #2

Tokenizing/stemming/stopwords/etc is called "analysis" in Elasticsearch, and it's a process that's executed when the document is indexed. E.g. when you index a document, Elasticsearch will take the values from the document and run them through the configured analysis pipeline. The result of that analysis is then stored in ES for querying.

So you'll need to configure an Analyzer for the data you are indexing. You can read more about how analysis and mapping works here:

Because this is an index-time operation, you'll need to re-index those documents to get them properly stemmed/stopworded, etc. RIght now they were analyzed with the default analyzer (standard) which does only basic processing (lowercasing, splitting on spaces/special characters, etc).


(Riccardo Spampinato) #3

@polyfractal
Thank you for your help. You completely understood my problem. I tried to follow your guides, but it seems that it doesn't apply any analyzer different from the standard one. I paste here the function I'm using to put the mapping:

https://pastebin.com/ReLXcKft

In message_body field I added the "analyzer": "english" and I run again the script which deletes any existing index with the same name and indexes again the set of documents.

Where am I wrong?


(Riccardo Spampinato) #4

Anyone can help me? @polyfractal


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.