Using a char_filter in combination with a lowercase filter


(Matthias Hogerheijde) #1

Hi,

We're using Elasticsearch with an Analyzer to map the y character to
ij, (char_fitler named "char_mapper") since in Dutch these two are
"somewhat" interchangeable. We're also using a lowercase filter.

This is the configuration:

{
"analysis": {
"analyzer": {
"index": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym_twoway",
"standard",
"asciifolding"
],
"char_filter": [
"char_mapper"
]
},
"index_prefix": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym_twoway",
"standard",
"asciifolding",
"prefixes"
],
"char_filter": [
"char_mapper"
]
},
"search": {
"alias": [
"default"
],
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym",
"synonym_twoway",
"standard",
"asciifolding"
],
"char_filter": [
"char_mapper"
]
},
"postal_code": {
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"standard": {
"stopwords": [

    ]
  }
},
"filter": {
  "synonym": {
    "type": "synonym",
    "synonyms": [
      "st => sint",
      "jp => jan pieterszoon",
      "mh => maarten harpertszoon"
    ]
  },
  "synonym_twoway": {
    "type": "synonym",
    "synonyms": [
      "den haag, s gravenhage",
      "den bosch, s hertogenbosch"
    ]
  },
  "prefixes": {
    "type": "edgeNGram",
    "side": "front",
    "min_gram": 1,
    "max_gram": 30
  }
},
"char_filter": {
  "char_mapper": {
    "type": "mapping",
    "mappings": [
      "y => ij"
    ]
  }
}

}
}

When indexing cities, we're using this mapping:

{
"properties": {
"city": {
"type": "multi_field",
"fields": {
"city": {
"type": "string"
},
"prefix": {
"type": "string",
"boost": 0.5,
"index_analyzer": "index_prefix"
}
}
},
"province_code": {
"type": "string"
},
"unique_name": {
"type": "boolean"
},
"point": {
"type": "geo_point"
},
"search_terms": {
"type": "multi_field",
"fields": {
"search_terms": {
"type": "string"
},
"prefix": {
"boost": 0.5,
"index_analyzer": "index_prefix",
"type": "string"
}
}
}
},
"search_analyzer": "search",
"index_analyzer": "index"
}

When we index all the (Dutch) cities from our data-source, there are cities
starting with both IJ and Y. (for example, these citiy names exist:
IJssel, IJsselstein, Yerseke and Ysselsteyn.) It seems that these
characters are not lowercased before the char_mapping is applied.

Querying the index, results in

/top/city/_search?q=ijsselstein -> works, returns the document for
IJsselstein
/top/city/_search?q=Ijsselstein -> works, returns the document for
IJsselstein
/top/city/_search?q=yerseke -> *doesn't *work, returns nothing
/top/city/_search?q=Yerseke -> *does *work, returns the document for Yerseke
/top/city/_search?q=YsselsteYn -> *doesn't *work, returns nothing
/top/city/_search?q=Ysselsteyn -> *does *work, returns the document for
Ysselsteyn

Changing the case of any other letter doesn't affect the results.

I've worked around this issue by adding the mapping "Y => ij", i.e.:

"char_filter": {
"char_mapper": {
"type": "mapping",
"mappings": [
"y => ij",
"Y => ij"
]
}
}

This solves the problem, but I'd rather see that the lowercase filter is
applied before the mapping, or, that I can make the order explicit. Is
there any stance on this issue? Or is this intended behaviour?

Regards,
Matthias Hogerheijde

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c60de452-2a3f-42f7-a677-956f81ecec17%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Ivan Brusic) #2

Char filters are applied before the text is tokenized, and therefore they
are applied before the "normal" filters are used, which is why they are a
separate class of filter. With Lucene, the order is:

char filters -> tokenizer -> filters

Have you looked into the ICU analyzer?
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-icu-plugin.html

I have no idea how well it works with Dutch.

Cheers,

Ivan

On Mon, Aug 18, 2014 at 2:14 AM, Matthias Hogerheijde <
matthias.hogerheijde@goabout.com> wrote:

Hi,

We're using Elasticsearch with an Analyzer to map the y character to
ij, (char_fitler named "char_mapper") since in Dutch these two are
"somewhat" interchangeable. We're also using a lowercase filter.

This is the configuration:

{
"analysis": {
"analyzer": {
"index": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym_twoway",
"standard",
"asciifolding"
],
"char_filter": [
"char_mapper"
]
},
"index_prefix": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym_twoway",
"standard",
"asciifolding",
"prefixes"
],
"char_filter": [
"char_mapper"
]
},
"search": {
"alias": [
"default"
],
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym",
"synonym_twoway",
"standard",
"asciifolding"
],
"char_filter": [
"char_mapper"
]
},
"postal_code": {
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"standard": {
"stopwords": [

    ]
  }
},
"filter": {
  "synonym": {
    "type": "synonym",
    "synonyms": [
      "st => sint",
      "jp => jan pieterszoon",
      "mh => maarten harpertszoon"
    ]
  },
  "synonym_twoway": {
    "type": "synonym",
    "synonyms": [
      "den haag, s gravenhage",
      "den bosch, s hertogenbosch"
    ]
  },
  "prefixes": {
    "type": "edgeNGram",
    "side": "front",
    "min_gram": 1,
    "max_gram": 30
  }
},
"char_filter": {
  "char_mapper": {
    "type": "mapping",
    "mappings": [
      "y => ij"
    ]
  }
}

}
}

When indexing cities, we're using this mapping:

{
"properties": {
"city": {
"type": "multi_field",
"fields": {
"city": {
"type": "string"
},
"prefix": {
"type": "string",
"boost": 0.5,
"index_analyzer": "index_prefix"
}
}
},
"province_code": {
"type": "string"
},
"unique_name": {
"type": "boolean"
},
"point": {
"type": "geo_point"
},
"search_terms": {
"type": "multi_field",
"fields": {
"search_terms": {
"type": "string"
},
"prefix": {
"boost": 0.5,
"index_analyzer": "index_prefix",
"type": "string"
}
}
}
},
"search_analyzer": "search",
"index_analyzer": "index"
}

When we index all the (Dutch) cities from our data-source, there are
cities starting with both IJ and Y. (for example, these citiy names
exist: IJssel, IJsselstein, Yerseke and Ysselsteyn.) It seems
that these characters are not lowercased before the char_mapping is
applied.

Querying the index, results in

/top/city/_search?q=ijsselstein -> works, returns the document for
IJsselstein
/top/city/_search?q=Ijsselstein -> works, returns the document for
IJsselstein
/top/city/_search?q=yerseke -> *doesn't *work, returns nothing
/top/city/_search?q=Yerseke -> *does *work, returns the document for
Yerseke
/top/city/_search?q=YsselsteYn -> *doesn't *work, returns nothing
/top/city/_search?q=Ysselsteyn -> *does *work, returns the document for
Ysselsteyn

Changing the case of any other letter doesn't affect the results.

I've worked around this issue by adding the mapping "Y => ij", i.e.:

"char_filter": {
"char_mapper": {
"type": "mapping",
"mappings": [
"y => ij",
"Y => ij"
]
}
}

This solves the problem, but I'd rather see that the lowercase filter is
applied before the mapping, or, that I can make the order explicit. Is
there any stance on this issue? Or is this intended behaviour?

Regards,
Matthias Hogerheijde

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c60de452-2a3f-42f7-a677-956f81ecec17%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/c60de452-2a3f-42f7-a677-956f81ecec17%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQAzTpAxXiZtkpXh3JLga%3DmvX3MThcsFV-2YPOXDBWSphg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Matthias Hogerheijde) #3

Thanks for your reply. I see that I didn't fully understand that
CharFilters are ran first, which makes it logical to special-case the
different cases. I was originally thrown off-scent that searching with an
uppercase 'Y' worked and thought that the lowercase filter was not applied
to the 'Y', but now I see that searching for a 'y' will cause the mapper to
search for 'ij' in stead.

I don't understand the full extend of the icu analysers, but it seems to me
that in our case this is semantically different, since we regard 'Y' and
'IJ' as different letters? (note that we actually regard 'ij' to be a
single character.) It's not like removing the accents from 'ä', or
transcribing a Cyrillic number into it's Roman equivalent, or am I wrong to
that regard?

Regards,
Matthias

On Tuesday, August 19, 2014 6:37:29 AM UTC+2, Ivan Brusic wrote:

Char filters are applied before the text is tokenized, and therefore they
are applied before the "normal" filters are used, which is why they are a
separate class of filter. With Lucene, the order is:

char filters -> tokenizer -> filters

Have you looked into the ICU analyzer?
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-icu-plugin.html
http://www.google.com/url?q=http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-icu-plugin.html&sa=D&sntz=1&usg=AFQjCNGvdkiBOpv0quMGWpUHS15nSr8aug

I have no idea how well it works with Dutch.

Cheers,

Ivan

On Mon, Aug 18, 2014 at 2:14 AM, Matthias Hogerheijde <
matthias.h...@goabout.com <javascript:>> wrote:

Hi,

We're using Elasticsearch with an Analyzer to map the y character to
ij, (char_fitler named "char_mapper") since in Dutch these two are
"somewhat" interchangeable. We're also using a lowercase filter.

This is the configuration:

{
"analysis": {
"analyzer": {
"index": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym_twoway",
"standard",
"asciifolding"
],
"char_filter": [
"char_mapper"
]
},
"index_prefix": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym_twoway",
"standard",
"asciifolding",
"prefixes"
],
"char_filter": [
"char_mapper"
]
},
"search": {
"alias": [
"default"
],
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym",
"synonym_twoway",
"standard",
"asciifolding"
],
"char_filter": [
"char_mapper"
]
},
"postal_code": {
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"standard": {
"stopwords": [

    ]
  }
},
"filter": {
  "synonym": {
    "type": "synonym",
    "synonyms": [
      "st => sint",
      "jp => jan pieterszoon",
      "mh => maarten harpertszoon"
    ]
  },
  "synonym_twoway": {
    "type": "synonym",
    "synonyms": [
      "den haag, s gravenhage",
      "den bosch, s hertogenbosch"
    ]
  },
  "prefixes": {
    "type": "edgeNGram",
    "side": "front",
    "min_gram": 1,
    "max_gram": 30
  }
},
"char_filter": {
  "char_mapper": {
    "type": "mapping",
    "mappings": [
      "y => ij"
    ]
  }
}

}
}

When indexing cities, we're using this mapping:

{
"properties": {
"city": {
"type": "multi_field",
"fields": {
"city": {
"type": "string"
},
"prefix": {
"type": "string",
"boost": 0.5,
"index_analyzer": "index_prefix"
}
}
},
"province_code": {
"type": "string"
},
"unique_name": {
"type": "boolean"
},
"point": {
"type": "geo_point"
},
"search_terms": {
"type": "multi_field",
"fields": {
"search_terms": {
"type": "string"
},
"prefix": {
"boost": 0.5,
"index_analyzer": "index_prefix",
"type": "string"
}
}
}
},
"search_analyzer": "search",
"index_analyzer": "index"
}

When we index all the (Dutch) cities from our data-source, there are
cities starting with both IJ and Y. (for example, these citiy names
exist: IJssel, IJsselstein, Yerseke and Ysselsteyn.) It seems
that these characters are not lowercased before the char_mapping is
applied.

Querying the index, results in

/top/city/_search?q=ijsselstein -> works, returns the document for
IJsselstein
/top/city/_search?q=Ijsselstein -> works, returns the document for
IJsselstein
/top/city/_search?q=yerseke -> *doesn't *work, returns nothing
/top/city/_search?q=Yerseke -> *does *work, returns the document for
Yerseke
/top/city/_search?q=YsselsteYn -> *doesn't *work, returns nothing
/top/city/_search?q=Ysselsteyn -> *does *work, returns the document for
Ysselsteyn

Changing the case of any other letter doesn't affect the results.

I've worked around this issue by adding the mapping "Y => ij", i.e.:

"char_filter": {
"char_mapper": {
"type": "mapping",
"mappings": [
"y => ij",
"Y => ij"
]
}
}

This solves the problem, but I'd rather see that the lowercase filter is
applied before the mapping, or, that I can make the order explicit. Is
there any stance on this issue? Or is this intended behaviour?

Regards,
Matthias Hogerheijde

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c60de452-2a3f-42f7-a677-956f81ecec17%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/c60de452-2a3f-42f7-a677-956f81ecec17%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e18b3d66-0cec-49ae-9bea-af699ce5a97c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Ivan Brusic) #4

The plugin uses collation to identify characters which are equivalent. It
does far more than simple replacement/folding, so sometimes the sort order
matters.


http://userguide.icu-project.org/transforms/normalization

Take a look at the plugin's test to figure out how it is used. I only work
with English/Mandarin, so I do not know how useful it is with Dutch.

https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/master/src/test/java/org/elasticsearch/index/analysis

Cheers,

Ivan

On Tue, Aug 19, 2014 at 12:46 AM, Matthias Hogerheijde <
matthias.hogerheijde@goabout.com> wrote:

Thanks for your reply. I see that I didn't fully understand that
CharFilters are ran first, which makes it logical to special-case the
different cases. I was originally thrown off-scent that searching with an
uppercase 'Y' worked and thought that the lowercase filter was not applied
to the 'Y', but now I see that searching for a 'y' will cause the mapper to
search for 'ij' in stead.

I don't understand the full extend of the icu analysers, but it seems to
me that in our case this is semantically different, since we regard 'Y' and
'IJ' as different letters? (note that we actually regard 'ij' to be a
single character.) It's not like removing the accents from 'ä', or
transcribing a Cyrillic number into it's Roman equivalent, or am I wrong to
that regard?

Regards,
Matthias

On Tuesday, August 19, 2014 6:37:29 AM UTC+2, Ivan Brusic wrote:

Char filters are applied before the text is tokenized, and therefore they
are applied before the "normal" filters are used, which is why they are a
separate class of filter. With Lucene, the order is:

char filters -> tokenizer -> filters

Have you looked into the ICU analyzer? http://www.
elasticsearch.org/guide/en/elasticsearch/reference/
current/analysis-icu-plugin.html
http://www.google.com/url?q=http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-icu-plugin.html&sa=D&sntz=1&usg=AFQjCNGvdkiBOpv0quMGWpUHS15nSr8aug

I have no idea how well it works with Dutch.

Cheers,

Ivan

On Mon, Aug 18, 2014 at 2:14 AM, Matthias Hogerheijde <
matthias.h...@goabout.com> wrote:

Hi,

We're using Elasticsearch with an Analyzer to map the y character to
ij, (char_fitler named "char_mapper") since in Dutch these two are
"somewhat" interchangeable. We're also using a lowercase filter.

This is the configuration:

{
"analysis": {
"analyzer": {
"index": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym_twoway",
"standard",
"asciifolding"
],
"char_filter": [
"char_mapper"
]
},
"index_prefix": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym_twoway",
"standard",
"asciifolding",
"prefixes"
],
"char_filter": [
"char_mapper"
]
},
"search": {
"alias": [
"default"
],
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym",
"synonym_twoway",
"standard",
"asciifolding"
],
"char_filter": [
"char_mapper"
]
},
"postal_code": {
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"standard": {
"stopwords": [

    ]
  }
},
"filter": {
  "synonym": {
    "type": "synonym",
    "synonyms": [
      "st => sint",
      "jp => jan pieterszoon",
      "mh => maarten harpertszoon"
    ]
  },
  "synonym_twoway": {
    "type": "synonym",
    "synonyms": [
      "den haag, s gravenhage",
      "den bosch, s hertogenbosch"
    ]
  },
  "prefixes": {
    "type": "edgeNGram",
    "side": "front",
    "min_gram": 1,
    "max_gram": 30
  }
},
"char_filter": {
  "char_mapper": {
    "type": "mapping",
    "mappings": [
      "y => ij"
    ]
  }
}

}
}

When indexing cities, we're using this mapping:

{
"properties": {
"city": {
"type": "multi_field",
"fields": {
"city": {
"type": "string"
},
"prefix": {
"type": "string",
"boost": 0.5,
"index_analyzer": "index_prefix"
}
}
},
"province_code": {
"type": "string"
},
"unique_name": {
"type": "boolean"
},
"point": {
"type": "geo_point"
},
"search_terms": {
"type": "multi_field",
"fields": {
"search_terms": {
"type": "string"
},
"prefix": {
"boost": 0.5,
"index_analyzer": "index_prefix",
"type": "string"
}
}
}
},
"search_analyzer": "search",
"index_analyzer": "index"
}

When we index all the (Dutch) cities from our data-source, there are
cities starting with both IJ and Y. (for example, these citiy names
exist: IJssel, IJsselstein, Yerseke and Ysselsteyn.) It seems
that these characters are not lowercased before the char_mapping is
applied.

Querying the index, results in

/top/city/_search?q=ijsselstein -> works, returns the document for
IJsselstein
/top/city/_search?q=Ijsselstein -> works, returns the document for
IJsselstein
/top/city/_search?q=yerseke -> *doesn't *work, returns nothing
/top/city/_search?q=Yerseke -> *does *work, returns the document for
Yerseke
/top/city/_search?q=YsselsteYn -> *doesn't *work, returns nothing
/top/city/_search?q=Ysselsteyn -> *does *work, returns the document for
Ysselsteyn

Changing the case of any other letter doesn't affect the results.

I've worked around this issue by adding the mapping "Y => ij", i.e.:

"char_filter": {
"char_mapper": {
"type": "mapping",
"mappings": [
"y => ij",
"Y => ij"
]
}
}

This solves the problem, but I'd rather see that the lowercase filter is
applied before the mapping, or, that I can make the order explicit. Is
there any stance on this issue? Or is this intended behaviour?

Regards,
Matthias Hogerheijde

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/c60de452-2a3f-42f7-a677-956f81ecec17%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/c60de452-2a3f-42f7-a677-956f81ecec17%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e18b3d66-0cec-49ae-9bea-af699ce5a97c%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e18b3d66-0cec-49ae-9bea-af699ce5a97c%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCjkqWBqs8u8QyGCtZ7UBZjPA346j2uMbZM8wpXKha1OA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #5