Analyzer selection on multi-field


(Jeremy McLain) #1

I'm trying to index a single field three different ways depending on the
language of the text. I'm using field to determine the analyzer for the
primary sub-field using the "_analyzer" field. I set this analyzer in the
document based on the language of the document. The "secondary" sub-field
uses the "simple" analyzer. The third sub-field is called "bigram". I want
the system to use the custom "word_bigram" analyzer if it the language of
the document uses whitespace between words, otherwise I want it to use the
"character_bigram" analyzer (e.g., Chinese).

I can't figure out how to specify the analyzer for this third field when
the document is added. My only idea right now is to break the bigram
sub-field out of the multi-field into two separate fields. Only one of them
would be included in the document depending on the language. Depending on
the answer of my other questionhttps://groups.google.com/forum/#!topic/elasticsearch/UHpw50LLndYI'm not crazy about this idea because this may require me to store the
contents of this field 3 to 4 times.

"body_word_bigram": {
"type": "string",
"store": true,
"analyzer": "word_bigram",
"boost": 2.0
},
"body_char_bigram":{
"type": "string",
"store": true,
"analyzer": "char_bigram",
"boost": 2.0
}

I've included the relevant portions of my schema below.

Settings:
{
"text_document": {
"analysis": {
"analyzer": {
"word_bigram": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "truncate_5", "word_bigram"]
},
"char_bigram": {
"type": "custom",
"tokenizer": "pattern",
"filter": ["lowercase", "char_bigram"]
}
},
"filter": {
"truncate_5": {
"type": "truncate",
"length": 5
},
"word_bigram": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 2,
"output_unigrams": false
},
"char_bigram": {
"type": "nGram",
"min_gram": 2,
"max_gram": 2
}
}
}
}
}

Mapping:
{
"text_document": {
"_analyzer": {
"path": "analyzer"
},
"properties": {
"body": {
"type": "string",
"store": true,
"fields": {
"secondary": {
"type": "string",
"analyzer": "simple"
},
"bigram": {
"type": "string",
"analyzer": "?",
"boost": 2.0
}
}
},
"analyzer": {
"type": "string",
"store": "true",
"index": false
}
}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/825ae036-597a-484d-a94e-95a5daeaad51%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jeremy McLain) #2

Ideas anyone?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/63a1b38e-7f80-41d8-8b5a-e16cdb3c2608%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #3