We have an elasticsearch that contains over half a billion documents that each have a url field that stores a URL.
The url field mapping currently has the settings:
{
index: not_analyzed
doc_values: true
...
}
We want our users to be able to search URLs, or portions of URLs without having to use wildcards.
For example, taking the URL: https://www.domain.com/part1/user@site/part2/part3.ext
They should be able to bring back a matching document by searching:
part3.ext
user@site
part1
part2/part3.ext
The way I see it, we have two options:
Implement an analysed version of this field (which can no longer have doc_values: true) and do match querying instead of wildcards. This would also require using a custom analyser to leverage the pattern tokeniser to make the extracted terms correct (the standard tokeniser would split user@site into user and site).
Go through our database and for each document create a new field that is a list of URL parts. This field could have doc_values: true still so would be stored off-heap, and we could do term querying on exact field values instead of wildcards.
My question is this:
Which is better for performance: having a list of variable lengths that has doc_values on, or having an analysed field? (ie: option 1 or option 2) OR is there an option 3 that would be even better yet?!
Will you really need to aggregate over parts of URLs? I suspect not. You could just index it twice: once with type:text and once with type:keyword and index:false. The first field will be useful to search over url parts and the second field will be be useful to aggregate entire urls.
So I tried it myself against 10 million documents.
I did one test with a filename field that was created by copy_to from two other URL-type fields (also set store: false), and used a custom analyser with a pattern tokeniser. This one I called Filename Analysed.
I did another test where I ran an enrichment over all 10 million documents to create a list field that contained all permutations of segments for both URL-type fields that I wanted to include. This one I called Filename Array.
Then I ran gatling tests against each index to see how they compared, whilst comparing netdata (https://github.com/firehol/netdata) CPU/Memory stats.
In terms of netdata, the Filename Array index performed much better - it hardly registered usage on the CPU at all in comparison to the Filename Analysed index.
Both solutions solve the problem above, but Filename Array also lets me do exact substring matching (without the need for wildcards) whereas Filename Analysed does not.
There is a problem with Filename Array, though in that it nearly doubles the index size (bloats it by about 80%).
So now I'm wondering if I can write a custom tokenizer script that creates the same tokens that I was creating in my Filename Array enrichment?
ie: Taking the url above: https://www.domain.com/part1/user@site/part2/part3.ext I want the following tokens:
www.domain.com
part1
user@site
part2
part3.ext
www.domain.com/part1
www.domain.com/part1/user@site
www.domain.com/part1/user@site/part2
www.domain.com/part1/user@site/part2/part3.ext
part1/user@site/part2/part3.ext
user@site/part2/part3.ext
part2/part3.ext
part1/user@site/part2
part1/user@site
user@site/part2
My regex capabilities don't live up to this requirement! Is it possible to write a java plugin that can be used as a custom tokenizer? Or a groovy script? Or can someone suggest a regular expression that might work?!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.