Hi everyone,
I'm a newbie in Elasticsearch world and I'm trying to do something with email. In particular I took the Enron email dataset and I indexed it into elastic database (I took just one folder of the whole dataset to make the computation easier) . This is the mapping I put:
Now I'd like to do some processing actions (tokenizing, stemming and removing stopwords) on the content of the body of each email. I had a look on language analyzers, but I do not know how to apply one of them to my case. I was wondering whether it's possible to use python and elasticsearch-dsl to create a script which takes the body of each email and does these kind of actions. Thank you guys
Tokenizing/stemming/stopwords/etc is called "analysis" in Elasticsearch, and it's a process that's executed when the document is indexed. E.g. when you index a document, Elasticsearch will take the values from the document and run them through the configured analysis pipeline. The result of that analysis is then stored in ES for querying.
So you'll need to configure an Analyzer for the data you are indexing. You can read more about how analysis and mapping works here:
Because this is an index-time operation, you'll need to re-index those documents to get them properly stemmed/stopworded, etc. RIght now they were analyzed with the default analyzer (standard) which does only basic processing (lowercasing, splitting on spaces/special characters, etc).
@polyfractal
Thank you for your help. You completely understood my problem. I tried to follow your guides, but it seems that it doesn't apply any analyzer different from the standard one. I paste here the function I'm using to put the mapping:
In message_body field I added the "analyzer": "english" and I run again the script which deletes any existing index with the same name and indexes again the set of documents.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.