Performance of doc_values field vs analysed field

ndtreviv · September 5, 2017, 2:10pm

Hi!

We have an elasticsearch that contains over half a billion documents that each have a url field that stores a URL.

The url field mapping currently has the settings:

{
    index: not_analyzed
    doc_values: true
    ...
}

We want our users to be able to search URLs, or portions of URLs without having to use wildcards.
For example, taking the URL: https://www.domain.com/part1/user@site/part2/part3.ext

They should be able to bring back a matching document by searching:

part3.ext
user@site
part1
part2/part3.ext

The way I see it, we have two options:

Implement an analysed version of this field (which can no longer have doc_values: true) and do match querying instead of wildcards. This would also require using a custom analyser to leverage the pattern tokeniser to make the extracted terms correct (the standard tokeniser would split user@site into user and site).
Go through our database and for each document create a new field that is a list of URL parts. This field could have doc_values: true still so would be stored off-heap, and we could do term querying on exact field values instead of wildcards.

My question is this:
Which is better for performance: having a list of variable lengths that has doc_values on, or having an analysed field? (ie: option 1 or option 2) OR is there an option 3 that would be even better yet?!

Thanks for your help!

jpountz · September 6, 2017, 5:09pm

Will you really need to aggregate over parts of URLs? I suspect not. You could just index it twice: once with type:text and once with type:keyword and index:false. The first field will be useful to search over url parts and the second field will be be useful to aggregate entire urls.

ndtreviv · September 19, 2017, 2:18pm

So I tried it myself against 10 million documents.

I did one test with a filename field that was created by copy_to from two other URL-type fields (also set store: false), and used a custom analyser with a pattern tokeniser. This one I called Filename Analysed.

I did another test where I ran an enrichment over all 10 million documents to create a list field that contained all permutations of segments for both URL-type fields that I wanted to include. This one I called Filename Array.

Then I ran gatling tests against each index to see how they compared, whilst comparing netdata (https://github.com/firehol/netdata) CPU/Memory stats.

The results were interesting.

Compare the centiles:

In terms of netdata, the Filename Array index performed much better - it hardly registered usage on the CPU at all in comparison to the Filename Analysed index.

Both solutions solve the problem above, but Filename Array also lets me do exact substring matching (without the need for wildcards) whereas Filename Analysed does not.

There is a problem with Filename Array, though in that it nearly doubles the index size (bloats it by about 80%).

So now I'm wondering if I can write a custom tokenizer script that creates the same tokens that I was creating in my Filename Array enrichment?

ie: Taking the url above: https://www.domain.com/part1/user@site/part2/part3.ext I want the following tokens:

www.domain.com
part1
user@site
part2
part3.ext
www.domain.com/part1
www.domain.com/part1/user@site
www.domain.com/part1/user@site/part2
www.domain.com/part1/user@site/part2/part3.ext
part1/user@site/part2/part3.ext
user@site/part2/part3.ext
part2/part3.ext
part1/user@site/part2
part1/user@site
user@site/part2

My regex capabilities don't live up to this requirement! Is it possible to write a java plugin that can be used as a custom tokenizer? Or a groovy script? Or can someone suggest a regular expression that might work?!

Thanks for any help!

ndtreviv · September 20, 2017, 9:31am

OK, I've discovered that it should be possible to write a plugin that contains a custom tokenizer.
It's not a process that's documented anywhere, and I'm hitting problems: Building a custom tokenizer: "Could not find suitable constructor"

system · October 18, 2017, 9:32am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Doc_values and wildcard searches Elasticsearch	5	553	July 6, 2017
Doc_values all the things! But, what about complex analyzers? Elasticsearch	1	522	July 5, 2017
Indexing performance with doc values (particularly with larger number of fields) Elasticsearch	2	570	July 6, 2017
Not_analyzed field with doc_values still in fielddata cache Elasticsearch	3	2583	July 5, 2017
Not analyzed vs Analyzed Elasticsearch	2	1182	July 5, 2017

Performance of doc_values field vs analysed field

Related topics