Performance of doc_values field vs analysed field

Hi!

We have an elasticsearch that contains over half a billion documents that each have a url field that stores a URL.

The url field mapping currently has the settings:

{
    index: not_analyzed
    doc_values: true
    ...
}

We want our users to be able to search URLs, or portions of URLs without having to use wildcards.
For example, taking the URL: https://www.domain.com/part1/user@site/part2/part3.ext

They should be able to bring back a matching document by searching:

  • part3.ext
  • user@site
  • part1
  • part2/part3.ext

The way I see it, we have two options:

  1. Implement an analysed version of this field (which can no longer have doc_values: true) and do match querying instead of wildcards. This would also require using a custom analyser to leverage the pattern tokeniser to make the extracted terms correct (the standard tokeniser would split user@site into user and site).
  2. Go through our database and for each document create a new field that is a list of URL parts. This field could have doc_values: true still so would be stored off-heap, and we could do term querying on exact field values instead of wildcards.

My question is this:
Which is better for performance: having a list of variable lengths that has doc_values on, or having an analysed field? (ie: option 1 or option 2) OR is there an option 3 that would be even better yet?!

Thanks for your help!

Will you really need to aggregate over parts of URLs? I suspect not. You could just index it twice: once with type:text and once with type:keyword and index:false. The first field will be useful to search over url parts and the second field will be be useful to aggregate entire urls.

So I tried it myself against 10 million documents.

I did one test with a filename field that was created by copy_to from two other URL-type fields (also set store: false), and used a custom analyser with a pattern tokeniser. This one I called Filename Analysed.

I did another test where I ran an enrichment over all 10 million documents to create a list field that contained all permutations of segments for both URL-type fields that I wanted to include. This one I called Filename Array.

Then I ran gatling tests against each index to see how they compared, whilst comparing netdata (https://github.com/firehol/netdata) CPU/Memory stats.

The results were interesting.

Compare the centiles:

In terms of netdata, the Filename Array index performed much better - it hardly registered usage on the CPU at all in comparison to the Filename Analysed index.

Both solutions solve the problem above, but Filename Array also lets me do exact substring matching (without the need for wildcards) whereas Filename Analysed does not.

There is a problem with Filename Array, though in that it nearly doubles the index size (bloats it by about 80%).

So now I'm wondering if I can write a custom tokenizer script that creates the same tokens that I was creating in my Filename Array enrichment?

ie: Taking the url above: https://www.domain.com/part1/user@site/part2/part3.ext I want the following tokens:

  • www.domain.com
  • part1
  • user@site
  • part2
  • part3.ext
  • www.domain.com/part1
  • www.domain.com/part1/user@site
  • www.domain.com/part1/user@site/part2
  • www.domain.com/part1/user@site/part2/part3.ext
  • part1/user@site/part2/part3.ext
  • user@site/part2/part3.ext
  • part2/part3.ext
  • part1/user@site/part2
  • part1/user@site
  • user@site/part2

My regex capabilities don't live up to this requirement! Is it possible to write a java plugin that can be used as a custom tokenizer? Or a groovy script? Or can someone suggest a regular expression that might work?!

Thanks for any help!

OK, I've discovered that it should be possible to write a plugin that contains a custom tokenizer.
It's not a process that's documented anywhere, and I'm hitting problems: Building a custom tokenizer: "Could not find suitable constructor"

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.