Problem with token delimiter and regular expression

Mory_Kaba · July 2, 2014, 4:20pm

Hi,

I am pretty new to elasticsearch and I'm facing a problem I can't figure
out.
I'm using logstash to store log files to elasticsearch following a specific
format. Each log line includes an URL, and some other elements that are
translated into fields inside elasticsearch databases.
The storing process seems to work pretty well and I am able to browse the
data like I want.
The problem is related to the way some fields are parsed when I come to try
to analyze the data and more particularly related to the delimiters that
are used to split the tokens.

One of the fields (named 'category') I want to analyze is composed of
several parts separated by special characters, such as '|' and the actual
token sometimes contain '-' characters. example : "category1|cat-egory2".
The first one should stay a delimiter but the dash is a problem as it is
part of some of the category names.

I've read some documentation about token delimiter (
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html
and
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-pattern-analyzer.html)
and tried to apply the instructions. So, before creating any index, I tried
to request elasticsearch to change the pattern of delimiters by putting my
own regular expression ( "pattern":"|\\s+" ), like in the whitespace
example, not very different from the one in the example, I'm pretty sure
the pattern is correct.

Here is the kind of request I am performing after the PUT request was made:

{
  "query": {
    "match_all": {}
  },
  "facets": {
    "category name": {
      "terms": {
    "field": "category"
      }
    }
  }
}

The response reports the number of occurrences of each 'category' field, by
splitting the tokens into different parts. But the tokens split are not
following the pattern I entered in the whitespace tokenizer.
Instead I get statistics that are not reflecting the actual data because of
the default comportment of elasticsearch.
I would like to know what I'm doing wrong and that's why I'm asking for
your help.

Regards

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8c16bc1b-89ff-4057-91f1-1d3cb4edeaf6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ivan · July 3, 2014, 7:20pm

Did you apply your mapping with the new analyzer before indexing documents?

First, you should verify that your mapping is correct by using the mapping
API. Do not just look at your templates, use the API in case there is a
problem in the templates:

Second, if the mapping looks correct, use the analyze API to test your
analyzer. Use the index and field names so that the actual analyzer defined
in the mapping is used.

My guess is that something was skipped in your configuration.

Cheers,

Ivan

On Wed, Jul 2, 2014 at 9:20 AM, Mory Kaba teninmory1@gmail.com wrote:

Hi,

I am pretty new to elasticsearch and I'm facing a problem I can't figure
out.
I'm using logstash to store log files to elasticsearch following a
specific format. Each log line includes an URL, and some other elements
that are translated into fields inside elasticsearch databases.
The storing process seems to work pretty well and I am able to browse the
data like I want.
The problem is related to the way some fields are parsed when I come to
try to analyze the data and more particularly related to the delimiters
that are used to split the tokens.

One of the fields (named 'category') I want to analyze is composed of
several parts separated by special characters, such as '|' and the actual
token sometimes contain '-' characters. example : "category1|cat-egory2".
The first one should stay a delimiter but the dash is a problem as it is
part of some of the category names.

I've read some documentation about token delimiter (
Elasticsearch Platform — Find real-time answers at scale | Elastic
and
Elasticsearch Platform — Find real-time answers at scale | Elastic)
and tried to apply the instructions. So, before creating any index, I tried
to request elasticsearch to change the pattern of delimiters by putting my
own regular expression ( "pattern":"|\\s+" ), like in the whitespace
example, not very different from the one in the example, I'm pretty sure
the pattern is correct.

Here is the kind of request I am performing after the PUT request was made:
{
  "query": {
    "match_all": {}
  },
  "facets": {
    "category name": {
      "terms": {
    "field": "category"
      }
    }
  }
}
The response reports the number of occurrences of each 'category' field,
by splitting the tokens into different parts. But the tokens split are not
following the pattern I entered in the whitespace tokenizer.
Instead I get statistics that are not reflecting the actual data because
of the default comportment of elasticsearch.
I would like to know what I'm doing wrong and that's why I'm asking for
your help.

Regards

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8c16bc1b-89ff-4057-91f1-1d3cb4edeaf6%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/8c16bc1b-89ff-4057-91f1-1d3cb4edeaf6%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQAFFBPbj5HXQjDrdh1A7%2BxB6Xj8w0WkQZLfJKD6jCEm9Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Problems with Tokenization Elasticsearch	3	646	October 26, 2017
Elasticsearch template to use standard analyzer but addional token_filter word_delimiter Elasticsearch	1	335	July 6, 2017
Bug in official document sample Elasticsearch	4	725	July 5, 2017
Pattern tokenization to split multiple URL's (edited) Elasticsearch	1	448	July 5, 2017
Word_delimiter behaviour using match query with operator and Elasticsearch	1	203	September 26, 2022

Problem with token delimiter and regular expression

Related topics