Suggestion for wildcard field type

willemdh · August 5, 2021, 3:51pm

Hello,

We need to do a lot of (front) wildcard queries on certain fields. One such field is url.original. Currently this is mapped in ECS as keyword. I'm wondering if this could be changed to wildcard field type?

For example we need to exclude all firewall logs where url.original ends with url.original: *.cedexis-test.com/. I've never used the wildcard field type, but would it be better if we update the mapping to wildcard in such a case to increase performance?

Willem

Mark_Harwood · August 5, 2021, 4:11pm

Roughly speaking leading infix queries (leading wildcard or .foo. regexes) have costs as follows:

keyword fields are linear with the number of unique terms
wildcard fields are linear with the number of docs that use a term

So if your URLs e.g. the site home page appear in many docs these are likely to be slow to match.
The flowchart at the end of this blog has a good decision chart.

willemdh · August 5, 2021, 4:35pm

Thanks for the link and the flowchart suggestion, that helps! Just wondering

Does the above unique values mean total unique values for url.orginal or total possible unique values for all possible results when using *.cedexis-test.com/

I think there are around 1000 unique *.cedexis-test.com/ in my data, so in that case a keyword would be my best option. But I have 100K+ unique url.original values so than a wildcard would be the best option.

Mark_Harwood · August 5, 2021, 4:44pm

total unique values for url.orginal or total possible unique values for all possible results when using *.cedexis-test.com/

The former.

The thing about prefix/infix queries and keyword fields is the index is of limited use to you.
Unlike an exact-match query or leading wildcard, the alphabetic sorting of the list of unique terms can't be used to quickly seek to the relevant part - infix/prefix queries have to scan the full list of all unique terms.

With URLs I imagine while there's a lot of unique values they're not evenly divided. So quite "Zipfy" e.g. a handful of URLs account for 90% of all mentions. These very popular URLs will be slow to query with wildcard field because it has to verify each use of a term in a doc whereas keyword fields need only find the term in the index and be certain all docs listed with that term genuinely do have that term.

Benchmarking is the only way to know for sure which is the best approach as so much depends on your data.

willemdh · August 5, 2021, 5:00pm

Thanks @Mark_Harwood I think I'll duplicate the field.original field to another field, eg url.originalwildcard and test both.

Mark_Harwood · August 5, 2021, 6:14pm

I'd be interested to hear about your results if you're able to share

system · September 2, 2021, 6:14pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch Wildcard fieldtype has slow performance for wildcard queries Elasticsearch	5	2839	January 26, 2021
Prefix query on fields mapped as keyword Elastic Search elastic-app-search	6	646	May 22, 2024
ES query performance when searching on filed with wildcard in field name Elasticsearch	3	2736	March 24, 2017
Has anyone tried to set the type of all fields to to wildcard field? Elasticsearch	3	259	June 20, 2022
Filtering for wildcard domains Elasticsearch	4	686	September 8, 2021

Suggestion for wildcard field type

Related topics