Elasticsearch index size using different analyzers

Hi,

The goal is to reduce an index size.
I have several String fields, on which, I am setting an nGram analyzer from 1 character to 45 characters. The overall index size is reached to 65GB. Then, I removed the nGram analyzer from 4 fields and repopulated the index (those 4 fields, are now using the Standard analyzer), the index reached to 70GB (even bigger).

I tried again with index: not_analyzed for those 4 fields, and the index now reached 75.4GB.

I am deleting and creating the index each time from the beginning.

Does anyone have a clue why?
Is there a way to know the size of the fields ?

Thanks,

Ori

There is a default assumption that you will need extra "doc values" files creating on disk with your mapping choices - see https://www.elastic.co/guide/en/elasticsearch/reference/current/doc-values.html

These files support sorting and aggregation use cases but you should turn them off if these use cases do not apply to your fields.

Hi Marks,

Thanks A lot!!!

I would like those fields to be analyzed.
According to Docs, analyzed string fields do not keep doc values.
I have a total of 9 String fields, 5 must be using the nGram analyzer for word size ranging from 1 to 45.
The last 4 are the ones in question:
A) Using nGram with them, the index size was 65GB
B) Not using nGram, but leaving the Standard analyzer (as default) the index size was nearly 71GB

As far as I understand, the nGram suppose to consume more space than using the Standard analyzer, but the case here is otherwise.

Than I tested it as not_analyzed and got 75GB of index size.

Aggregations and Sorts are not needed (except one sort of search results), so I might be able to remove the doc_values from other fields which are not string type.

I do need to understand why the standard analyzer is consuming more space than using the nGram analyzer.

Thanks,

Ori

That bit doesn't make any sense to me. Ngrams 1 to 45 sounds like a recipe for a massive index. Can you share your mappings?

{
"settings": {
"index": { "number_of_shards" : 2 },
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 1,
"max_gram": 45,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
},
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
},
"list_analyzer" : {
"type" : "custom",
"tokenizer" : "pattern"
},
"urls_links_emails" : {
"type" : "custom",
"tokenizer" : "uax_url_email",
"filter": [
"lowercase"
]
}
}
}
},
"mappings" : {
"usersdata" : {
"properties" : {
"account_id" : {
"type" : "integer",
"include_in_all" : false
},
"account" : {
"type" : "string",
"analyzer" : "nGram_analyzer",
"search_analyzer" : "whitespace_analyzer",
"include_in_all" : false,
"fields": {
"raw": {
"type": "string",
"include_in_all" : false
}
}
},
"first_name" : {
"type" : "string",
"analyzer" : "nGram_analyzer",
"search_analyzer" : "whitespace_analyzer",
"include_in_all" : false,
"fields": {
"raw": {
"type": "string",
"include_in_all" : false
}
}
},
"last_name" : {
"type" : "string",
"analyzer" : "nGram_analyzer",
"search_analyzer" : "whitespace_analyzer",
"include_in_all" : false,
"fields": {
"raw": {
"type": "string",
"include_in_all" : false
}
}
},
"email" : {
"type" : "string",
"analyzer" : "nGram_analyzer",
"search_analyzer" : "whitespace_analyzer",
"include_in_all" : false,
"fields": {
"raw": {
"type": "string",
"include_in_all" : false,
"analyzer" : "urls_links_emails"
}
}
},
"company" : {
"type" : "string",
"analyzer" : "nGram_analyzer",
"search_analyzer" : "whitespace_analyzer",
"include_in_all" : false,
"fields": {
"raw": {
"type": "string",
"include_in_all" : false
}
}
},
"country" : {
"type" : "string",
"analyzer" : "nGram_analyzer",
"search_analyzer" : "whitespace_analyzer",
"include_in_all" : false,
"fields": {
"raw": {
"type": "string",
"include_in_all" : false
}
}
},
"city" : {
"type" : "string",
"analyzer" : "nGram_analyzer",
"search_analyzer" : "whitespace_analyzer",
"include_in_all" : false,
"fields": {
"raw": {
"type": "string",
"include_in_all" : false
}
}
},
"state" : {
"type" : "string",
"analyzer" : "nGram_analyzer",
"search_analyzer" : "whitespace_analyzer",
"include_in_all" : false,
"fields": {
"raw": {
"type": "string",
"include_in_all" : false
}
}
},
"province" : {
"type" : "string",
"analyzer" : "nGram_analyzer",
"search_analyzer" : "whitespace_analyzer",
"include_in_all" : false,
"fields": {
"raw": {
"type": "string",
"include_in_all" : false
}
}
},
"event_id_list" : {
"type" : "string",
"analyzer" : "list_analyzer",
"include_in_all" : false
},
"venue_id_list" : {
"type" : "string",
"analyzer" : "list_analyzer",
"include_in_all" : false
},
"event_last_visit_date" : {
"type" : "date",
"format" : "yyyy-MM-dd HH:mm:ss",
"include_in_all" : false
}
}
}
}
}'

The fields:

Country, City, State, Province

Are the ones I am replacing between nGram Analyzer and standard.

As Standard they are:

      "country" : {
        "type" : "string",
        "include_in_all" : false
      }

For each of them.

Ori

What version of elasticsearch are you running?

Elasticsearch 2.4.1
Including plugins: readonlyrest, siren, delete-by-query

That is (hopefully) a fixed vocabulary so may only account for a comparatively small number of n-grams.
What you might be seeing here is the potentially temporary difference in index sizes caused by the differences in background merge operations. Index content is organised into many segments each of which are a mini-index in their own right. As more content is added more segments are created and a background process merges them together as a form of compaction. You should ideally compare index sizes of fully compacted indices to get a fair comparison of index size. You can do this using this API [1] which is expensive to run (it will effectively rewrite your index) and normally is only ever used on an index that will receive no more updates.

[1] Force Merge | Elasticsearch Guide [2.4] | Elastic

Funny,

I am currently trying the OPTIMIZE with MAX_NUM_SEGMENTS=1
Since it is a Test Env, I can do it without interfering anyone.

How do I know how much segments the index currently holds ? And how much should be ?
According to Doc, 1 Segment is for the best, but it is rebuilding the entire index and will take a lot of time.

Ori

There is no correct answer - hit "pause" at any random point in this video and this is an indication of how many segments you'll have : Changing Bits: Visualizing Lucene's segment merges

"Force_merge" is a deliberate renaming of the old "optimize" function to make it sound more scary and prevent people calling this very expensive function. People would run it with little thought (who wouldn't want an "optimal" index?) and cripple their cluster when there are perfectly good background tasks continually merging segments as new content is added. That's why we re-iterate only use this for indexes that won't receive any more updates or perhaps as in your case when you are benchmarking effects of mapping changes.

Should the _forcemerge only block access to the relevant index(es) being merged ?
Or should there be any other block with the Elastic ?
I understand that it will be I/O intensive operations, but beside accessing the merged in-progress indexes, there should be not other blocking, am I right ?

Ori

It will not "block access". Old segments are not deleted until they have been fully merged into the new structure so can be accessed while this reorganisation goes on.

But Inserts/Updates will not be accepted, right ?

They will be accepted. They just create new segments that fall outside of the list of segments you chose to re-org when you initiated the force_merge request. Though generally it does not make sense to call force_merge on an index undergoing changes as I mentioned previously.

OK, Thanks!!!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.