Trying to remove stopwords from ES-index JAVA API


(Fluehmann) #1

Hey guys

I'm all new to ES.

I'm indexing website content with a crawler into ES.
So now I'm trying to apply an analyzer to my index (website content) to remove all stopwords from the website content throught the Java API. Therefore I load the settings from a .json settings file when indexing with a FileInputStream.

When looking into the metadata in the head plugin the analyzer is shown properly.
I can even access the analyzer via the sense plugin.

But what I actually wanted to achieve was to remove stopwords from the index data that I'm indexing.
I expected the content to be cleaned from stopwords. But all stopwords remain there.

What am I doing wrong?
Can anyone help me?

My indexing code:

public boolean index(String index, String type, String id, HashMap<String, String> fields) throws ElasticsearchException, IOException {
if (!checkIfIndexExists(index)){
  Settings indexSettings = ImmutableSettings.settingsBuilder()
      .put("number_of_shards", 5)
      .put("number_of_replicas", 1)
      .build();
  
  CreateIndexRequest indexRequest = new CreateIndexRequest(index, indexSettings);
  indexRequest.settings(getSettingsJsonString());

  client.admin().indices().create(indexRequest).actionGet(); }

my settings file:

{
    "analysis": {
        "analyzer": {
            "de_std": {
                "type":      "standard",
                "stopwords": "_german_"
            }
        }
    }
}

Cheers Simon


(Colin Goodheart-Smithe) #2

You will need to set the analyzer field in your mappings for the fields you want to use this analyzer. See here for more information on the settings for String fields including the analyzer parameter: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-core-types.html#string

If you want to set all String fields to use that analyzer you should look at dynamic templates: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-root-object-type.html#_dynamic_templates

Hope that helps


(Fluehmann) #3

Hey colings86

Thank you very much for your reply - I appreciate it very much!
The use of mapping makes a lot of sense to me.

My mapping file:

        {
     website": {
        "properties": {
           "content": {
              "type": "string"
              "analyzer": "de_std"
           }
           "timestamp": {
              "type": "string"
           }
           "url": {
              "type": "string"
           }
        }
     }
 } 

Now I see the following in the metadata - and still no effect.

Do I have to overwrite the default Mapping somehow?

I'm sorry for my noob questions...

Cheers Simon


(Colin Goodheart-Smithe) #4

When you applied your new mappings did you create a new index with the new mappings and then re-index your data into the new index? You cannot update mappings on an existing index so if you didn't re-index you will need to for your analyzer to be used in your mappings

No worries these are good questions :slight_smile:


(Fluehmann) #5

Hey

Thanks for your reply.

Yes, I always delete the index and reindex it if I want to apply the Settings/Mapping:

[2015-07-23 13:40:53,661][INFO ][cluster.metadata         ] [Chemistro] [emmental.ch] deleting index
[2015-07-23 13:41:09,559][INFO ][cluster.metadata         ] [Chemistro] [emmental.ch] creating index, cause [api], templates [], shards [5]/[1], mappings [{
         website": {
            "properties": {
               "content": {
                  "type": "string"
                  "analyzer": "de_std"
               }
               "timestamp": {
                  "type": "string"
               }
               "url": {
                  "type": "string"
               }
            }
         }
     } ]
[2015-07-23 13:41:09,631][INFO ][cluster.metadata         ] [Chemistro] [emmental.ch] update_mapping [website] (dynamic)

This is how I load the mapping:

CreateIndexRequest indexRequest = new CreateIndexRequest(index, indexSettings);
		indexRequest.settings(getSettingsJsonString());
		indexRequest.mapping(getMappingsJsonString());
client.admin().indices().create(indexRequest).actionGet();

In the metadata the Settings are displayed correctly - but the mapping isn't (screenshot post before).

Do you see any mistakes?


(Colin Goodheart-Smithe) #6

Good, you should always do this :smile:

I don't see any mistakes in what you've put here.

Maybe @Clinton_Gormley has some ideas on what could be going wrong?


(Clinton Gormley) #7

@fluehmann It's not very clear what you mean by:

I expected the content to be cleaned from stopwords. But all stopwords remain there.

If you're expecting the stopwords to be removed from the _source field, that doesn't happen. The _source field is never changed. It just removes stopwords from the field before indexing the terms for search.


(Fluehmann) #8

Hey guys

Thank you for your help. I had to do some thinking... :smile:

Now it's clear to me that source field will never be changed and only when retrieving the index the stopwords will be ommited.

I think it works now.
I had to change the mapping to fit the constructor:

indexRequest.mapping("website", getMappingsJsonString());

That's why the mapping was first interpreted as a single string instead of a JSON-file.

Thank you guys very much for your help and patience!
I guess I'll be back soon with some more questions... :wink:

Cheers
Simon


(system) #9