I tried creating an index with a mapping, specifying a tokenizer and
analyzers.
XContentBuilder settings = jsonBuilder()
.startObject()
.startObject("analysis")
.startObject("analyzer")
.startObject("response_search_analyzer")
.field("tokenizer", "responseTokenizer")
.field("filter", "lowercase")
.endObject()
.startObject("response_index_analyzer")
.field("tokenizer", "responseTokenizer")
.field("filter", "lowercase", "nGram")
.endObject()
.endObject()
.startObject("tokenizer")
.startObject("responseTokenizer")
.field("type", "whitespace")
.endObject()
.endObject()
.startObject("filter")
.startObject("nGram")
.field("type", "nGram")
.field("min_ngram", 3)
.field("max_ngram", 6)
.endObject()
.endObject()
.endObject()
.endObject();
XContentBuilder mapping = jsonBuilder()
.startObject()
.startObject("question")
.startObject("properties")
.startObject("responseDescription")
.field("type", "string")
.field("search_analyzer",
"response_search_analyzer")
.field("index_analyzer",
"response_index_analyzer")
.endObject()
.endObject()
.endObject()
.endObject();
Then i create the index this way :
CreateIndexResponse response =
client.admin().indices().prepareCreate("faq-ze").setSettings(createSettings())
.addMapping("question",
createMapping()).execute().actionGet();
Resulting settings are these :
curl -XGET
'http://192.168.6.159:9202/faq-ze/_settings?pretty=1'
{
"faq-ze" : {
"settings" : {
"index.analysis.analyzer.response_index_analyzer.filter.0" :
"lowercase",
"index.analysis.analyzer.response_index_analyzer.filter.1" : "nGram",
"index.analysis.tokenizer.responseTokenizer.type" : "whitespace",
"index.analysis.analyzer.response_index_analyzer.tokenizer" :
"responseTokenizer",
"index.analysis.analyzer.response_search_analyzer.filter" :
"lowercase",
"index.analysis.filter.nGram.min_ngram" : "3",
"index.analysis.filter.nGram.type" : "nGram",
"index.analysis.filter.nGram.max_ngram" : "6",
"index.analysis.analyzer.response_search_analyzer.tokenizer" :
"responseTokenizer",
"index.number_of_shards" : "2",
"index.number_of_replicas" : "1",
"index.version.created" : "190499"
}
}
}
And the mapping :
curl -XGET
'http://192.168.6.159:9202/faq-ze/_mapping?pretty=1'
{
"faq-ze" : {
"category" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "string"
}
}
},
"question" : {
"properties" : {
"categoryTitle" : {
"type" : "string"
},
"id" : {
"type" : "long"
},
"questionDisplay" : {
"type" : "string"
},
"questionPopularity" : {
"type" : "long"
},
"questionTitle" : {
"type" : "string"
},
"responseDescription" : {
"type" : "string",
"index_analyzer" : "index_analyzer",
"search_analyzer" : "search_analyzer"
},
"responseMedia" : {
"type" : "string"
},
"responseMediaGlimpse" : {
"type" : "string"
},
"responsePdf" : {
"type" : "string"
},
"responsePlusLabel" : {
"type" : "string"
},
"responsePlusUrl" : {
"type" : "string"
},
"responseTitle" : {
"type" : "string"
}
}
}
}
}
But an analysis try gives nothing :
curl -XGET 'http://192.168.6.159:9202/faq-ze/_analyze?pretty=1&text="In the
future, the HCCI diesel engine (Homogenous Charge Compression Ignition) and
CAI gasoline engine "&analyzer=response_index_analyzer'
curl: (52) Empty reply from server
[1] 9614 exit 52 curl -XGET
And of course, searching returns nothing :
{
"query" : {
"field" : {
"responseDescription" : "ren"
}
}
}
Gives :
Pretty
Result Transformer?
Repeat Request
Display Options?
{
- took: 1
- timed_out: false
- _shards: {
- total: 2
- successful: 2
- failed: 0
}
- hits: {
- total: 0
- max_score: null
- hits:
}
}
(given that my document's reponseDecription field values contains 'Renault'
in the middle of a text.
Any idea ?
On Friday, June 1, 2012 2:56:53 PM UTC+2, Frederic Esnault wrote:
Well i think i get the point, i just need to get up to the speed with
ngrams, and how to implement an analyzer.
I guess the best source to look for this is lucene, right ?
On Friday, June 1, 2012 12:45:39 PM UTC+2, David G Ortega wrote:
"...are NGrams useful for this usage? And if yes, how ?..."
ngrams are the way I have gone to do autocompletion and probably the
way almost everyone here uses.
Thats why I'm asking how was performing something that IMHO is a
beast.
Wildcard queries over a large index is not the best idea unless you
dont mind the response time or has sharded with lots of machines.
To autocomplete with ngram just create a custom analyzer that do what
you want plus ngram, thats making the index bigger but
you have the possibility of search by ngram which are essentially
parts of a word.
Does this makes sense to U?