"Best" way for (partial) search in URIs (e.g. requests/referrer)?

Hi there,

I'm dumping an nginx access log into elasticsearch (I just use the standard analyzer for all fields atm (this is what I learned afterwards, the defaults in elasticsearch are just too good to bother you with this kind of topics early on :grin: ))

I struggle a bit with looking up data. It usually works but I have some cases where the standard analyzer interferes (I think).

I kinda have some different request patterns. For example:

/public/a_16432_d0FqR/file/data/36541_536277.png
/access/access/logout?sid=f5a4875f7ee771174f1df1
/file/File/getThumbnail/835/64/64?sid=f5a4875f7ee771174f1df1
/page/setDelete/905?sid=f5a4875f7ee771174f1df1
/dashboard/dashboard/execute?sid=f5a4875f7ee771174f1df1&a_u=16432_79958

Standardanalyzer allows it to look up parts (e.g. setDelete but not delete) which is ok'ish to me (pain is not big enough to justify aditional changes here). But looking up specific GET parameters does not work when just looking for parts. E.g. a_u returns results, a_u=16432_79958 can be looked up as well, but what I need is a_u=16432 which does not work (probably because of the _) .

My template looks like this atm:

      ...
      "request": {
        "dynamic": true,
        "properties": {
          "keyword": {
            "type": "keyword"
          },
          "raw": {
            "type": "text"
          }
        }
      }
     ...

My "requirements" are:

  • allow search for fragments (e.g. page/setDelete) - just delete would be the icing on the cake)
  • allow search for complete url
  • allow search for (parts) of specific GET parameters (e.g. a_u=12345 when the complete parameter is a_u=12345_6789)

Can someone please give me some advice what to adjust to solve my problem? (Or point me to the relevant part?) Is the analyzer wrong? (If so, what's the best for this kind of data?)

Thanks in advance! :slight_smile:

You're right - the standard analyzer is wrong for your use case. You can see what the standard analyzer does on your URLs by using the _analyze API:

GET my_index/_analyze
{
  "analyzer": "standard",
  "text": [
    "/page/setDelete/905?sid=f5a4875f7ee771174f1df1",
    "/dashboard/dashboard/execute?sid=f5a4875f7ee771174f1df1&a_u=16432_79958"
  ]
}

The output shows you that for example setDelete and a_u=12345_6789 do not get broken up into separate tokens by the standard analyzer, preventing you from being able to search for just a_u=12345.

You'll need to create a custom analyzer to support your use case. The documentation has an example of a custom "camelcase" analyzer with a tokenizer that uses regular expressions to break up strings. That could be a good starting point for yours. Maybe something like this?

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "pattern",
          "pattern": "([^\\p{L}\\d=]+)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
        }
      }
    }
  }
}

GET my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": [
    "/page/setDelete/905?sid=f5a4875f7ee771174f1df1",
    "/dashboard/dashboard/execute?sid=f5a4875f7ee771174f1df1&a_u=16432_79958"
  ]
}

Keep in mind that you cannot change the analyzer for existing fields in your mapping. You'll have to add a new multifield to your mapping, or create a new index, with a mapping that uses your new custom analyzer.

Thank you @abdon! I will play around with this :slight_smile: