Parsing url in log

thedraketaylor · April 11, 2020, 10:41pm

I'm storing urls in the format "http://domain.com/showthread.php?10357-thread-title-and-such/page22" as "text". However, when I try to match the page number It fails to find anything. If I try to match anything before the last "/'" (such as "thread" or "title") it works perfectly.

Can someone point me to a way of being able to match "page"?

Thanks!

Luca_Belluccini · April 11, 2020, 11:46pm

Hello @thedraketaylor

The default analyzer for text fields is the standard one.

To see which are the tokens generated by the standard analyser, you can use:

POST _analyze
{
  "analyzer": "standard", 
  "text": "http://domain.com/showthread.php?10357-thread-title-and-such/page22"
}
# Result
{
  "tokens" : [
    {
      "token" : "http",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "domain.com",
      "start_offset" : 7,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "showthread.php",
      "start_offset" : 18,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "10357",
      "start_offset" : 33,
      "end_offset" : 38,
      "type" : "<NUM>",
      "position" : 3
    },
    {
      "token" : "thread",
      "start_offset" : 39,
      "end_offset" : 45,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "title",
      "start_offset" : 46,
      "end_offset" : 51,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "and",
      "start_offset" : 52,
      "end_offset" : 55,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "such",
      "start_offset" : 56,
      "end_offset" : 60,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "page22",
      "start_offset" : 61,
      "end_offset" : 67,
      "type" : "<ALPHANUM>",
      "position" : 8
    }
  ]
}

You can test out simple with:

POST _analyze
{
  "analyzer": "simple", 
  "text": "http://domain.com/showthread.php?10357-thread-title-and-such/page22"
}
# Result
{
  "tokens" : [
    {
      "token" : "http",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "domain",
      "start_offset" : 7,
      "end_offset" : 13,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "com",
      "start_offset" : 14,
      "end_offset" : 17,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "showthread",
      "start_offset" : 18,
      "end_offset" : 28,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "php",
      "start_offset" : 29,
      "end_offset" : 32,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "thread",
      "start_offset" : 39,
      "end_offset" : 45,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "title",
      "start_offset" : 46,
      "end_offset" : 51,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "and",
      "start_offset" : 52,
      "end_offset" : 55,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "such",
      "start_offset" : 56,
      "end_offset" : 60,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "page",
      "start_offset" : 61,
      "end_offset" : 65,
      "type" : "word",
      "position" : 9
    }
  ]
}

If you're using well known URLs (for which you know their typical structure), you might use the pattern analyzer.

More information about this subject can be found in our documentation.

It is also possible to create a custom analyzer and use it in your index.

thedraketaylor · April 12, 2020, 12:20am

Thank you so much! That's given me a LOT to learn. I appreciate it!

system · May 10, 2020, 12:32am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.