How to apply English analyzer to a set of documents already indexed

Riccardo_Spampinato · May 26, 2017, 12:11pm

Hi everyone,
I'm a newbie in Elasticsearch world and I'm trying to do something with email. In particular I took the Enron email dataset and I indexed it into elastic database (I took just one folder of the whole dataset to make the computation easier) . This is the mapping I put:

https://pastebin.com/RyZ3ez3V

Now I'd like to do some processing actions (tokenizing, stemming and removing stopwords) on the content of the body of each email. I had a look on language analyzers, but I do not know how to apply one of them to my case. I was wondering whether it's possible to use python and elasticsearch-dsl to create a script which takes the body of each email and does these kind of actions. Thank you guys

polyfractal · May 26, 2017, 2:53pm

Tokenizing/stemming/stopwords/etc is called "analysis" in Elasticsearch, and it's a process that's executed when the document is indexed. E.g. when you index a document, Elasticsearch will take the values from the document and run them through the configured analysis pipeline. The result of that analysis is then stored in ES for querying.

So you'll need to configure an Analyzer for the data you are indexing. You can read more about how analysis and mapping works here:

Because this is an index-time operation, you'll need to re-index those documents to get them properly stemmed/stopworded, etc. RIght now they were analyzed with the default analyzer (standard) which does only basic processing (lowercasing, splitting on spaces/special characters, etc).

Riccardo_Spampinato · May 27, 2017, 9:49am

@polyfractal
Thank you for your help. You completely understood my problem. I tried to follow your guides, but it seems that it doesn't apply any analyzer different from the standard one. I paste here the function I'm using to put the mapping:

https://pastebin.com/ReLXcKft

In message_body field I added the "analyzer": "english" and I run again the script which deletes any existing index with the same name and indexes again the set of documents.

Where am I wrong?

Riccardo_Spampinato · May 29, 2017, 3:44pm

Anyone can help me? @polyfractal

system · June 26, 2017, 3:44pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Indexing PDF's and Perform Text Analytics with ES Elasticsearch	12	3640	October 9, 2018
Using differents analysers based on the document language Elasticsearch	2	327	July 6, 2017
Stopping analyzer to apply on the search part Elasticsearch	1	309	July 6, 2017
Index text as keyword array leveraging tokenizers and filters Elasticsearch	1	518	July 19, 2020
Need suggestions on type of query to be used for a given analysis for better results? Elasticsearch	2	373	July 6, 2017

How to apply English analyzer to a set of documents already indexed

Related topics