Does Text Analysis increaase CPU utilization

dimm · September 18, 2020, 9:35am

Hi,
I want to add some analyzer to tokenize text fields in some specific way, will it have impact to ES cluster resourses utilization (CPU, ... )? If yes, how significant?

thanks

spinscale · September 22, 2020, 7:58am

Hey,

that depends on your text analysis chain. If you use something like ngrams and split the text into many tokens, this will naturally take more time (and also more space on disk as it writes more data) compared to using a keyword with a single term.

I rarely care about this, as the more important question usually is, if the mapping configuration allows you to write queries that answer the question you got in a relevant fashion.

Is there any specific question hidden behind that one, that we could help with?

--Alex

dimm · September 22, 2020, 8:19am

Thank you for your response, Alex.
I'm going to replace standard tokenizer with my custom:

    "tokenizer": {
      "custom_tokenizer" : {
        "type" : "char_group",
        "tokenize_on_chars" : ["whitespace", "\n", "punctuation", "symbol"]
      }
    }

and set this tokenizer for some specific fields.

Is it possible to predict in some way how this increase disk usage?

spinscale · September 22, 2020, 8:52am

I don't think you will notice a huge difference to the standard analyzer (gut feeling, nothing I tested).

That said, your best bet is to take a sample set of your data (like 10k or 100k or 500k), index them into an index with the standard tokenizer, then index into an index with the custom one, potentially run a force merge and compare the size (note, you can still end up with different segment numbers, so this is still not fully comparable, but may give you a small hint. you could also run a force merge and merge down to one segment then that would be comparable).

dimm · September 22, 2020, 8:56am

Thank you, thats actually I'm going to do)

dimm · September 28, 2020, 12:43pm

Just to summarize - in my case the difference was 1% (what is within the statistical error)

system · October 26, 2020, 12:44pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to throttle analyzing (stemming, tokenizing, etc) Elasticsearch	3	328	July 6, 2017
Custom analyzer without a tokenizer Elasticsearch	3	848	July 6, 2017
ES myindex/_analyze?analyzer=...&text=... is very slow. Is there any way to speed it up Elasticsearch	6	409	July 6, 2017
Using only Analyzers, Tokenizers Elasticsearch	5	284	July 6, 2017
Change(disable) Text Analysis at Index Time Elasticsearch	3	465	July 6, 2017

Does Text Analysis increaase CPU utilization

Related topics