Trying to remove stopwords from ES-index JAVA API

fluehmann · July 22, 2015, 3:13pm

Hey guys

I'm all new to ES.

I'm indexing website content with a crawler into ES.
So now I'm trying to apply an analyzer to my index (website content) to remove all stopwords from the website content throught the Java API. Therefore I load the settings from a .json settings file when indexing with a FileInputStream.

When looking into the metadata in the head plugin the analyzer is shown properly.
I can even access the analyzer via the sense plugin.

But what I actually wanted to achieve was to remove stopwords from the index data that I'm indexing.
I expected the content to be cleaned from stopwords. But all stopwords remain there.

What am I doing wrong?
Can anyone help me?

My indexing code:

public boolean index(String index, String type, String id, HashMap<String, String> fields) throws ElasticsearchException, IOException {
if (!checkIfIndexExists(index)){
  Settings indexSettings = ImmutableSettings.settingsBuilder()
      .put("number_of_shards", 5)
      .put("number_of_replicas", 1)
      .build();
  
  CreateIndexRequest indexRequest = new CreateIndexRequest(index, indexSettings);
  indexRequest.settings(getSettingsJsonString());

  client.admin().indices().create(indexRequest).actionGet(); }

my settings file:

{
    "analysis": {
        "analyzer": {
            "de_std": {
                "type":      "standard",
                "stopwords": "_german_"
            }
        }
    }
}

Cheers Simon

colings86 · July 22, 2015, 3:18pm

You will need to set the analyzer field in your mappings for the fields you want to use this analyzer. See here for more information on the settings for String fields including the analyzer parameter: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-core-types.html#string

If you want to set all String fields to use that analyzer you should look at dynamic templates: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-root-object-type.html#_dynamic_templates

Hope that helps

fluehmann · July 23, 2015, 7:28am

Hey colings86

Thank you very much for your reply - I appreciate it very much!
The use of mapping makes a lot of sense to me.

My mapping file:

        {
     website": {
        "properties": {
           "content": {
              "type": "string"
              "analyzer": "de_std"
           }
           "timestamp": {
              "type": "string"
           }
           "url": {
              "type": "string"
           }
        }
     }
 }

Now I see the following in the metadata - and still no effect.

Do I have to overwrite the default Mapping somehow?

I'm sorry for my noob questions...

Cheers Simon

colings86 · July 23, 2015, 10:28am

When you applied your new mappings did you create a new index with the new mappings and then re-index your data into the new index? You cannot update mappings on an existing index so if you didn't re-index you will need to for your analyzer to be used in your mappings

No worries these are good questions

fluehmann · July 23, 2015, 11:47am

Hey

Thanks for your reply.

Yes, I always delete the index and reindex it if I want to apply the Settings/Mapping:

[2015-07-23 13:40:53,661][INFO ][cluster.metadata         ] [Chemistro] [emmental.ch] deleting index
[2015-07-23 13:41:09,559][INFO ][cluster.metadata         ] [Chemistro] [emmental.ch] creating index, cause [api], templates [], shards [5]/[1], mappings [{
         website": {
            "properties": {
               "content": {
                  "type": "string"
                  "analyzer": "de_std"
               }
               "timestamp": {
                  "type": "string"
               }
               "url": {
                  "type": "string"
               }
            }
         }
     } ]
[2015-07-23 13:41:09,631][INFO ][cluster.metadata         ] [Chemistro] [emmental.ch] update_mapping [website] (dynamic)

This is how I load the mapping:

CreateIndexRequest indexRequest = new CreateIndexRequest(index, indexSettings);
		indexRequest.settings(getSettingsJsonString());
		indexRequest.mapping(getMappingsJsonString());
client.admin().indices().create(indexRequest).actionGet();

In the metadata the Settings are displayed correctly - but the mapping isn't (screenshot post before).

Do you see any mistakes?

colings86 · July 23, 2015, 12:03pm

Good, you should always do this

I don't see any mistakes in what you've put here.

Maybe @Clinton_Gormley has some ideas on what could be going wrong?

Clinton_Gormley · July 23, 2015, 1:05pm

@fluehmann It's not very clear what you mean by:

I expected the content to be cleaned from stopwords. But all stopwords remain there.

If you're expecting the stopwords to be removed from the _source field, that doesn't happen. The _source field is never changed. It just removes stopwords from the field before indexing the terms for search.

fluehmann · July 24, 2015, 10:12am

Hey guys

Thank you for your help. I had to do some thinking...

Now it's clear to me that source field will never be changed and only when retrieving the index the stopwords will be ommited.

I think it works now.
I had to change the mapping to fit the constructor:

indexRequest.mapping("website", getMappingsJsonString());

That's why the mapping was first interpreted as a single string instead of a JSON-file.

Thank you guys very much for your help and patience!
I guess I'll be back soon with some more questions...

Cheers
Simon

Topic		Replies	Views
Stopwords in analyzer doesn't seem to work Elasticsearch	3	384	June 26, 2020
Analizer with stop words removal by language Elasticsearch	5	464	July 6, 2017
Using English analyzer filtered out some words Elasticsearch	2	323	February 14, 2019
How to stem, remove stop words from my index data? Elasticsearch	4	2197	July 4, 2019
Stop word filter problem Elasticsearch	5	383	July 6, 2017

Trying to remove stopwords from ES-index JAVA API

Related topics