Serching requested files in Kibana which ends in a number and the file extension

I bumped into a strange issue. When I tried to look for the most requested documents (pdf) on one of the analyzed sites, I saw that there are some docs definitely missing. First I used "request:*.pdf" then I checked "request:pdf". That was the time I noticed that the wildcard request missed ALL documents which request ended in any number before '.pdf' such as 'calendar_2017.pdf'. Which is odd, because it is a string field so I don't understand how a number can cause this issue. Is there something I can do without reindexing the data?

@YvorL would you mind posting the mapping for the specific field that you'd having issues searching on?

Is this what you're looking for?
"request": {
"type": "text",
"norms": false,
"fields": {
"keyword": {
"type": "keyword"
}
}
}

@YvorL if you're using request:*.pdf in kibana in the querybar, it's translated into a query_string query against an analyzed text string.

The standard analyzer splits the following text into the following keywords:

  1. calendar_2017.pdf -> calendar_2017, pdf
  2. 01302017.pdf -> 01302017, pdf
  3. something.pdf -> something.pdf

The standard analyzer is generally meant for text fields, but it explains why *.pdf doesn't return anything for 1 and 2 above.

You should be able to use request.keyword: *.pdf which isn't executing the query against the analyzed field and should return what you're looking for.

If you are able to reindex your data, pulling out the extension either using a pattern analyzer or some other mechanism during ingest would be much more performant.

@Brandon_Kobel
I still don't see why the first two are separated to keywords if the last one isn't. My understanding is that if the text is continuous (and in this case, it'll be the URI) then a dot or an underscore won't act as keyword separator. It's a text field, and it should handle numbers as any other character.
Regardless, it seems that this is the intended way. I was also avoiding searching in a unanalyzed field with a leading asterisk because it won't be a one-time query. It leaves me to reindexing the data.

Thank you for taking your time!

@YvorL unfortunately, the details of the tokenizer are out of my expertise, but I'll move this to the Elasticsearch forum and hopefully they can enlighten us both :slight_smile:

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.