Parsing url in log

I'm storing urls in the format "http://domain.com/showthread.php?10357-thread-title-and-such/page22" as "text". However, when I try to match the page number It fails to find anything. If I try to match anything before the last "/'" (such as "thread" or "title") it works perfectly.

Can someone point me to a way of being able to match "page"?

Thanks!

Hello @thedraketaylor

The default analyzer for text fields is the standard one.

To see which are the tokens generated by the standard analyser, you can use:

POST _analyze
{
  "analyzer": "standard", 
  "text": "http://domain.com/showthread.php?10357-thread-title-and-such/page22"
}
# Result
{
  "tokens" : [
    {
      "token" : "http",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "domain.com",
      "start_offset" : 7,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "showthread.php",
      "start_offset" : 18,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "10357",
      "start_offset" : 33,
      "end_offset" : 38,
      "type" : "<NUM>",
      "position" : 3
    },
    {
      "token" : "thread",
      "start_offset" : 39,
      "end_offset" : 45,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "title",
      "start_offset" : 46,
      "end_offset" : 51,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "and",
      "start_offset" : 52,
      "end_offset" : 55,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "such",
      "start_offset" : 56,
      "end_offset" : 60,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "page22",
      "start_offset" : 61,
      "end_offset" : 67,
      "type" : "<ALPHANUM>",
      "position" : 8
    }
  ]
}

You can test out simple with:

POST _analyze
{
  "analyzer": "simple", 
  "text": "http://domain.com/showthread.php?10357-thread-title-and-such/page22"
}
# Result
{
  "tokens" : [
    {
      "token" : "http",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "domain",
      "start_offset" : 7,
      "end_offset" : 13,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "com",
      "start_offset" : 14,
      "end_offset" : 17,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "showthread",
      "start_offset" : 18,
      "end_offset" : 28,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "php",
      "start_offset" : 29,
      "end_offset" : 32,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "thread",
      "start_offset" : 39,
      "end_offset" : 45,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "title",
      "start_offset" : 46,
      "end_offset" : 51,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "and",
      "start_offset" : 52,
      "end_offset" : 55,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "such",
      "start_offset" : 56,
      "end_offset" : 60,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "page",
      "start_offset" : 61,
      "end_offset" : 65,
      "type" : "word",
      "position" : 9
    }
  ]
}

If you're using well known URLs (for which you know their typical structure), you might use the pattern analyzer.

More information about this subject can be found in our documentation.

It is also possible to create a custom analyzer and use it in your index.

1 Like

Thank you so much! That's given me a LOT to learn. I appreciate it!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.