Does Text Analysis increaase CPU utilization

I want to add some analyzer to tokenize text fields in some specific way, will it have impact to ES cluster resourses utilization (CPU, ... )? If yes, how significant?



that depends on your text analysis chain. If you use something like ngrams and split the text into many tokens, this will naturally take more time (and also more space on disk as it writes more data) compared to using a keyword with a single term.

I rarely care about this, as the more important question usually is, if the mapping configuration allows you to write queries that answer the question you got in a relevant fashion.

Is there any specific question hidden behind that one, that we could help with? :slight_smile:


Thank you for your response, Alex.
I'm going to replace standard tokenizer with my custom:

    "tokenizer": {
      "custom_tokenizer" : {
        "type" : "char_group",
        "tokenize_on_chars" : ["whitespace", "\n", "punctuation", "symbol"]

and set this tokenizer for some specific fields.

Is it possible to predict in some way how this increase disk usage?

I don't think you will notice a huge difference to the standard analyzer (gut feeling, nothing I tested).

That said, your best bet is to take a sample set of your data (like 10k or 100k or 500k), index them into an index with the standard tokenizer, then index into an index with the custom one, potentially run a force merge and compare the size (note, you can still end up with different segment numbers, so this is still not fully comparable, but may give you a small hint. you could also run a force merge and merge down to one segment then that would be comparable).

Thank you, thats actually I'm going to do)

Just to summarize - in my case the difference was 1% (what is within the statistical error)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.