Aggregating urls with slightly different variations in elasticsearch query

I am new to Elastic search and learning about how the tooling works. I have an "audit" database containing records of HTTP requests to different endpoints in my application and which time they were executed.

You can imagine this fictional example:

18 jan 2018 18:06:00: POST /user/1/books
18 jan 2018 18:07:00: POST /user/3/books
18 jan 2018 18:06:03: GET /books/search?title=Hello
19 jan 2018 17:04:01: GET /books/search?title=AnotherBook&pagesMoreThan=300

In my example the 1 and 3 and query parameters are variable parts.

I am wondering what the best way would be to build my documents to allow answering the following questions:

  • How many times did someone call the endpoint to get books from users in a given timeframe (any user)?
  • How many times did someone search for books (disregarding parameters)?

To do this I would need to be able to ignore the variable parts in each of the urls. I would need to be able to get a count of /user/.?/books or /books/search for example.

What is the recommended way of doing this in elasticsearch?

One thing I can think of is that it's not the responsibility of elasticsearch itself and maybe I should preprocess it when I'm writing the documents. So maybe I can store it as

{
    "url": "/user/?/books",
    "path_parameters": [1]
},
{
    "url": "/books/search",
    "parameters": ["title=AnotherBook", "pagesMoreThan=300"]
}

Even in that case, determining which parts of an URL are variable is not an easy task to do so maybe it's not even possible in a way where I don't manually specify all URLs that can occur.

I also noticed that elasticsearch has data aggregation functions but I'm not sure if that is flexible enough to support what I need.

Any recommendations?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.