matchPhraseQuery can not retrieve documents with trailing “’s” even if set word delimiter tokenfilter when created indices

Hi there,

I want to use ES to query some documents mentioning some person names.

For example, when I use Bill Gates to conduct a matchPhraseQuery, I can
just get the documents exactly mention the name "Bill Gates".
While a lot of documents mention Bill Gates indirectly, say, they may
mention "Bill Gates's company".

How should I construct a query to retrieve these documents? Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You probably want a stemming analyzer (for example, snowball). In that
case, for example, Debbie's, Debby's, Debby, and Debbie all match each
other.

Of course, the documents will need to be reloaded to cause the proper
stemmed terms to be indexed.

Cheers!
Brian

On Monday, May 13, 2013 5:20:14 AM UTC-4, Jingang Wang wrote:

Hi there,

I want to use ES to query some documents mentioning some person names.

For example, when I use Bill Gates to conduct a matchPhraseQuery, I can
just get the documents exactly mention the name "Bill Gates".
While a lot of documents mention Bill Gates indirectly, say, they may
mention "Bill Gates's company".

How should I construct a query to retrieve these documents? Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Jingang,

Here is a full example with the index settings and mappings and a curl
command to show how various forms of Gates may be indexed so that they
match each other without any additional work on the part of your query:

Analyzing the string using the "cn" field in the "test" index:

$ curl 'http://localhost:9200/test/_analyze?field=cn&pretty=true' -d "gates
gate's gates' gates's
" && echo
{
"tokens" : [ {
"token" : "gate",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 1
}, {
"token" : "gate",
"start_offset" : 6,
"end_offset" : 12,
"type" : "",
"position" : 2
}, {
"token" : "gate",
"start_offset" : 13,
"end_offset" : 18,
"type" : "",
"position" : 3
}, {
"token" : "gate",
"start_offset" : 20,
"end_offset" : 27,
"type" : "",
"position" : 4
} ]
}

And here are a subset of the settings and mappings. Note that I needed to
fully construct my own filter and analyzers in order to more fully specify
things such as the language to use.

{
"settings" : {
"index" : {
"number_of_shards" : 1,
"refresh_interval" : "2s",
"number_of_replicas" : 0,
"analysis" : {

    "filter" : {
      "english_snowball_filter" : {
        "type" : "snowball",
        "language" : "English"
      }
    },
    "analyzer" : {
      "english_stemming_analyzer" : {
        "type" : "custom",
        "tokenizer" : "standard",
        "filter" : [ "standard", "lowercase", "asciifolding", 

"english_snowball_filter" ]
},
"english_standard_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "asciifolding" ]
}
}
}
}
},
"mappings" : {
"person" : {
"_all" : {
"enabled" : false
},
"properties" : {
"uid" : {
"type" : "long"
},
"cn" : {
"type" : "multi_field",
"fields" : {
"cn" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer"
},
"raw" : {
"type" : "string",
"index" : "not_analyzed"
}
}
},
"location" : {
"type" : "geo_point",
"lat_lon" : true
},
"telno" : {
"type" : "multi_field",
"fields" : {
"telno" : {
"type" : "string",
"analyzer" : "english_standard_analyzer"
},
"num" : {
"type" : "long"
}
}
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime"
}
}
}
}
}

Hope this helps!

Regards,
Brian

On Monday, May 13, 2013 5:20:14 AM UTC-4, Jingang Wang wrote:

Hi there,

I want to use ES to query some documents mentioning some person names.

For example, when I use Bill Gates to conduct a matchPhraseQuery, I can
just get the documents exactly mention the name "Bill Gates".
While a lot of documents mention Bill Gates indirectly, say, they may
mention "Bill Gates's company".

How should I construct a query to retrieve these documents? Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

And here's another example of analyzing the same string as below, but this
time using the cn.raw field. It's an example of multi-field mapping in
which a field may be indexed (or not) in two or more ways but its source
values only need to be stored once. Really awesome!!!

$ curl 'http://localhost:9200/sgen/_analyze?field=cn.raw&pretty=true' -d "gates
gate's gates' gates's"
&& echo
{
"tokens" : [ {
"token" : "gates gate's gates' gates's",
"start_offset" : 0,
"end_offset" : 27,
"type" : "word",
"position" : 1
} ]
}

On Monday, May 13, 2013 6:41:29 PM UTC-4, InquiringMind wrote:

Hi Jingang,

Here is a full example with the index settings and mappings and a curl
command to show how various forms of Gates may be indexed so that they
match each other without any additional work on the part of your query:

Analyzing the string using the "cn" field in the "test" index:

$ curl 'http://localhost:9200/test/_analyze?field=cn&pretty=true' -d "gates
gate's gates' gates's
" && echo
{
"tokens" : [ {
"token" : "gate",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 1
}, {
"token" : "gate",
"start_offset" : 6,
"end_offset" : 12,
"type" : "",
"position" : 2
}, {
"token" : "gate",
"start_offset" : 13,
"end_offset" : 18,
"type" : "",
"position" : 3
}, {
"token" : "gate",
"start_offset" : 20,
"end_offset" : 27,
"type" : "",
"position" : 4
} ]
}

And here are a subset of the settings and mappings. Note that I needed to
fully construct my own filter and analyzers in order to more fully specify
things such as the language to use.

{
"settings" : {
"index" : {
"number_of_shards" : 1,
"refresh_interval" : "2s",
"number_of_replicas" : 0,
"analysis" : {

    "filter" : {
      "english_snowball_filter" : {
        "type" : "snowball",
        "language" : "English"
      }
    },
    "analyzer" : {
      "english_stemming_analyzer" : {
        "type" : "custom",
        "tokenizer" : "standard",
        "filter" : [ "standard", "lowercase", "asciifolding", 

"english_snowball_filter" ]
},
"english_standard_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "asciifolding" ]
}
}
}
}
},
"mappings" : {
"person" : {
"_all" : {
"enabled" : false
},
"properties" : {
"uid" : {
"type" : "long"
},
"cn" : {
"type" : "multi_field",
"fields" : {
"cn" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer"
},
"raw" : {
"type" : "string",
"index" : "not_analyzed"
}
}
},
"location" : {
"type" : "geo_point",
"lat_lon" : true
},
"telno" : {
"type" : "multi_field",
"fields" : {
"telno" : {
"type" : "string",
"analyzer" : "english_standard_analyzer"
},
"num" : {
"type" : "long"
}
}
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime"
}
}
}
}
}

Hope this helps!

Regards,
Brian

On Monday, May 13, 2013 5:20:14 AM UTC-4, Jingang Wang wrote:

Hi there,

I want to use ES to query some documents mentioning some person names.

For example, when I use Bill Gates to conduct a matchPhraseQuery, I can
just get the documents exactly mention the name "Bill Gates".
While a lot of documents mention Bill Gates indirectly, say, they may
mention "Bill Gates's company".

How should I construct a query to retrieve these documents? Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Dear
​ Brian,

Thanks you so much for elaborate examples and explanation.
​I have changed my settings as your example, while I still can not match
the phrase in the documents.

Here is my settings of index:

{
"20120103": {
"settings": {
"index.analysis.filter.my_delimiter.generate_word_parts": "true",
"index.analysis.filter.my_delimiter.stem_english_possessive": "true",
"index.analysis.filter.my_delimiter.preserve_original": "true",
"index.analysis.analyzer.my_analyzer.tokenizer": "standard",
"index.analysis.filter.english_stemming_filter.language": "English",
"index.analysis.filter.my_delimiter.split_on_numerics": "true",
"index.analysis.filter.my_delimiter.catenate_all": "true",
"index.number_of_shards": "10",
"index.analysis.filter.my_delimiter.catenate_numbers": "true",
"index.analysis.analyzer.my_analyzer.type": "custom",
"index.analysis.filter.english_stemming_filter.type": "snowball",
"index.analysis.filter.my_delimiter.type": "word_delimiter",
"index.analysis.filter.my_delimiter.catenate_words": "true",
"index.number_of_replicas": "0",
"index.analysis.analyzer.my_analyzer.filter.2": "my_delimiter",
"index.analysis.analyzer.my_analyzer.filter.1": "lowercase",
"index.analysis.analyzer.my_analyzer.filter.0": "standard",
"index.analysis.filter.my_delimiter.split_on_case_change": "true",
"index.analysis.analyzer.my_analyzer.filter.5": "stop",
"index.analysis.analyzer.my_analyzer.filter.4":
"english_stemming_filter",
"index.analysis.analyzer.my_analyzer.filter.3": "asciifolding",
"index.version.created": "900001"
}
}
}

And my query is:

QueryBuilder qb = QueryBuilders
.boolQuery()
.must(matchPhraseQuery("body_cleansed", "Aharon
Barak").analyzer("my_analyzer"));

There is a documents which contains “Aharon Barak's policy”, but the query
could not retrieve it.

On Tue, May 14, 2013 at 6:59 AM, InquiringMind brian.from.fl@gmail.comwrote:

And here's another example of analyzing the same string as below, but this
time using the cn.raw field. It's an example of multi-field mapping in
which a field may be indexed (or not) in two or more ways but its source
values only need to be stored once. Really awesome!!!

$ curl 'http://localhost:9200/sgen/_analyze?field=cn.raw&pretty=true'
-d "gates gate's gates' gates's" && echo
{
"tokens" : [ {
"token" : "gates gate's gates' gates's",
"start_offset" : 0,
"end_offset" : 27,
"type" : "word",
"position" : 1
} ]
}

On Monday, May 13, 2013 6:41:29 PM UTC-4, InquiringMind wrote:

Hi Jingang,

Here is a full example with the index settings and mappings and a curl
command to show how various forms of Gates may be indexed so that they
match each other without any additional work on the part of your query:

Analyzing the string using the "cn" field in the "test" index:

$ curl 'http://localhost:9200/test/_**analyze?field=cn&pretty=truehttp://localhost:9200/test/_analyze?field=cn&pretty=true'
-d "gates gate's gates' gates's" && echo
{
"tokens" : [ {
"token" : "gate",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 1
}, {
"token" : "gate",
"start_offset" : 6,
"end_offset" : 12,
"type" : "",
"position" : 2
}, {
"token" : "gate",
"start_offset" : 13,
"end_offset" : 18,
"type" : "",
"position" : 3
}, {
"token" : "gate",
"start_offset" : 20,
"end_offset" : 27,
"type" : "",
"position" : 4
} ]
}

And here are a subset of the settings and mappings. Note that I needed to
fully construct my own filter and analyzers in order to more fully specify
things such as the language to use.

{
"settings" : {
"index" : {
"number_of_shards" : 1,
"refresh_interval" : "2s",
"number_of_replicas" : 0,
"analysis" : {

    "filter" : {
      "english_snowball_filter" : {
        "type" : "snowball",
        "language" : "English"
      }
    },
    "analyzer" : {
      "english_stemming_analyzer" : {
        "type" : "custom",
        "tokenizer" : "standard",
        "filter" : [ "standard", "lowercase", "asciifolding",

"english_snowball_filter" ]
},
"english_standard_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "asciifolding" ]
}
}
}
}
},
"mappings" : {
"person" : {
"_all" : {
"enabled" : false
},
"properties" : {
"uid" : {
"type" : "long"
},
"cn" : {
"type" : "multi_field",
"fields" : {
"cn" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer"
},
"raw" : {
"type" : "string",
"index" : "not_analyzed"
}
}
},
"location" : {
"type" : "geo_point",
"lat_lon" : true
},
"telno" : {
"type" : "multi_field",
"fields" : {
"telno" : {
"type" : "string",
"analyzer" : "english_standard_analyzer"
},
"num" : {
"type" : "long"
}
}
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime"
}
}
}
}
}

Hope this helps!

Regards,
Brian

On Monday, May 13, 2013 5:20:14 AM UTC-4, Jingang Wang wrote:

Hi there,

I want to use ES to query some documents mentioning some person names.

For example, when I use Bill Gates to conduct a matchPhraseQuery, I can
just get the documents exactly mention the name "Bill Gates".
While a lot of documents mention Bill Gates indirectly, say, they may
mention "Bill Gates's company".

How should I construct a query to retrieve these documents? Thanks.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/SIK3lc215Bk/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Wang Jingang(王金刚)
Ph.D. Candidate at
Lab of High Volume Language Information Processing & Cloud Computing
School of Computer Science
Beijing Institute of Technology
Beijing 100081
P.R China

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

My mapping looks like as follows;

XContentBuilder mapping = jsonBuilder()
.startObject()
.startObject("kba")
.startObject("_all").field("enable", true).field("index_analyzer",
"my_analyzer").field("search_analyzer", "my_analyzer").endObject()
.startObject("properties")
.startObject("_source").field("compress", "true").endObject()
.startObject("stream_id").field("type",
"string").field("index","not_analyzed").endObject()
.startObject("source").field("type", "string").field("index",
"not_analyzed").endObject()
.startObject("epoch_ticks").field("type",
"double").field("index", "not_analyzed").endObject()
.startObject("zulu_timestamp").field("type",
"string").field("index","not_analyzed").endObject()
.startObject("title_cleansed").field("type","string").field("index",
"analyzed").field("index_analyzer","my_analyzer").field("search_analyzer","my_analyzer").endObject()

.startObject("body_cleansed").field("type","string").field("index",
"analyzed").field("index_analyzer","my_analyzer").field("search_analyzer","my_analyzer").endObject()
.endObject()
.endObject()
.endObject();
PutMappingRequest mappingRequest =
Requests.putMappingRequest(indexName).type("kba").source(mapping);

On Tue, May 14, 2013 at 10:53 AM, Jingang Wang bitwjg@gmail.com wrote:

Dear
​ Brian,

Thanks you so much for elaborate examples and explanation.
​I have changed my settings as your example, while I still can not match
the phrase in the documents.

Here is my settings of index:

{
"20120103": {
"settings": {
"index.analysis.filter.my_delimiter.generate_word_parts": "true",
"index.analysis.filter.my_delimiter.stem_english_possessive": "true",
"index.analysis.filter.my_delimiter.preserve_original": "true",
"index.analysis.analyzer.my_analyzer.tokenizer": "standard",
"index.analysis.filter.english_stemming_filter.language": "English",
"index.analysis.filter.my_delimiter.split_on_numerics": "true",
"index.analysis.filter.my_delimiter.catenate_all": "true",
"index.number_of_shards": "10",
"index.analysis.filter.my_delimiter.catenate_numbers": "true",
"index.analysis.analyzer.my_analyzer.type": "custom",
"index.analysis.filter.english_stemming_filter.type": "snowball",
"index.analysis.filter.my_delimiter.type": "word_delimiter",
"index.analysis.filter.my_delimiter.catenate_words": "true",
"index.number_of_replicas": "0",
"index.analysis.analyzer.my_analyzer.filter.2": "my_delimiter",
"index.analysis.analyzer.my_analyzer.filter.1": "lowercase",
"index.analysis.analyzer.my_analyzer.filter.0": "standard",
"index.analysis.filter.my_delimiter.split_on_case_change": "true",
"index.analysis.analyzer.my_analyzer.filter.5": "stop",
"index.analysis.analyzer.my_analyzer.filter.4":
"english_stemming_filter",
"index.analysis.analyzer.my_analyzer.filter.3": "asciifolding",
"index.version.created": "900001"
}
}
}

And my query is:

QueryBuilder qb = QueryBuilders
.boolQuery()
.must(matchPhraseQuery("body_cleansed", "Aharon
Barak").analyzer("my_analyzer"));

There is a documents which contains “Aharon Barak's policy”, but the query
could not retrieve it.

On Tue, May 14, 2013 at 6:59 AM, InquiringMind brian.from.fl@gmail.comwrote:

And here's another example of analyzing the same string as below, but
this time using the cn.raw field. It's an example of multi-field
mapping in which a field may be indexed (or not) in two or more ways but
its source values only need to be stored once. Really awesome!!!

$ curl 'http://localhost:9200/sgen/_analyze?field=cn.raw&pretty=true'
-d "gates gate's gates' gates's" && echo
{
"tokens" : [ {
"token" : "gates gate's gates' gates's",
"start_offset" : 0,
"end_offset" : 27,
"type" : "word",
"position" : 1
} ]
}

On Monday, May 13, 2013 6:41:29 PM UTC-4, InquiringMind wrote:

Hi Jingang,

Here is a full example with the index settings and mappings and a curl
command to show how various forms of Gates may be indexed so that they
match each other without any additional work on the part of your query:

Analyzing the string using the "cn" field in the "test" index:

$ curl 'http://localhost:9200/test/_**analyze?field=cn&pretty=truehttp://localhost:9200/test/_analyze?field=cn&pretty=true'
-d "gates gate's gates' gates's" && echo
{
"tokens" : [ {
"token" : "gate",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 1
}, {
"token" : "gate",
"start_offset" : 6,
"end_offset" : 12,
"type" : "",
"position" : 2
}, {
"token" : "gate",
"start_offset" : 13,
"end_offset" : 18,
"type" : "",
"position" : 3
}, {
"token" : "gate",
"start_offset" : 20,
"end_offset" : 27,
"type" : "",
"position" : 4
} ]
}

And here are a subset of the settings and mappings. Note that I needed
to fully construct my own filter and analyzers in order to more fully
specify things such as the language to use.

{
"settings" : {
"index" : {
"number_of_shards" : 1,
"refresh_interval" : "2s",
"number_of_replicas" : 0,
"analysis" : {

    "filter" : {
      "english_snowball_filter" : {
        "type" : "snowball",
        "language" : "English"
      }
    },
    "analyzer" : {
      "english_stemming_analyzer" : {
        "type" : "custom",
        "tokenizer" : "standard",
        "filter" : [ "standard", "lowercase", "asciifolding",

"english_snowball_filter" ]
},
"english_standard_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "asciifolding" ]
}
}
}
}
},
"mappings" : {
"person" : {
"_all" : {
"enabled" : false
},
"properties" : {
"uid" : {
"type" : "long"
},
"cn" : {
"type" : "multi_field",
"fields" : {
"cn" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer"
},
"raw" : {
"type" : "string",
"index" : "not_analyzed"
}
}
},
"location" : {
"type" : "geo_point",
"lat_lon" : true
},
"telno" : {
"type" : "multi_field",
"fields" : {
"telno" : {
"type" : "string",
"analyzer" : "english_standard_analyzer"
},
"num" : {
"type" : "long"
}
}
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime"
}
}
}
}
}

Hope this helps!

Regards,
Brian

On Monday, May 13, 2013 5:20:14 AM UTC-4, Jingang Wang wrote:

Hi there,

I want to use ES to query some documents mentioning some person names.

For example, when I use Bill Gates to conduct a matchPhraseQuery, I can
just get the documents exactly mention the name "Bill Gates".
While a lot of documents mention Bill Gates indirectly, say, they may
mention "Bill Gates's company".

How should I construct a query to retrieve these documents? Thanks.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/SIK3lc215Bk/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Wang Jingang(王金刚)
Ph.D. Candidate at
Lab of High Volume Language Information Processing & Cloud Computing
School of Computer Science
Beijing Institute of Technology
Beijing 100081
P.R China

--
Wang Jingang(王金刚)
Ph.D. Candidate at
Lab of High Volume Language Information Processing & Cloud Computing
School of Computer Science
Beijing Institute of Technology
Beijing 100081
P.R China

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Jingang,

The most important debugging tool is to use the _analyze command I showed
you.

For example, if you analyze "Foo Bar" and you get two tokens, "Foo" and
"Bar", then you will immediately know that Foo and Bar can match that
string but neither foo nor bar can.

The _analyze function a very useful tool that helped me get through my
query issues related to analyzers and mappings.

Also, it's an excellent idea to extract the _mapping for the specified
index and type. I found in a lot of my early efforts that I had what looked
to me like a valid mapping, but with a few JSON mistakes ElasticSearch
didn't recognize it. So it's not my eye that I trust to see if the mappings
I intend are actually there; I always ask ElasticSearch what it thinks I
gave it. Because in the end, ES's opinion is the only one that counts!

Regards,
Brian

On Monday, May 13, 2013 10:53:31 PM UTC-4, Jingang Wang wrote:

Dear
​ Brian,

Thanks you so much for elaborate examples and explanation.
​I have changed my settings as your example, while I still can not match
the phrase in the documents.

Here is my settings of index:

{
"20120103": {
"settings": {
"index.analysis.filter.my_delimiter.generate_word_parts": "true",
"index.analysis.filter.my_delimiter.stem_english_possessive": "true",
"index.analysis.filter.my_delimiter.preserve_original": "true",
"index.analysis.analyzer.my_analyzer.tokenizer": "standard",
"index.analysis.filter.english_stemming_filter.language": "English",
"index.analysis.filter.my_delimiter.split_on_numerics": "true",
"index.analysis.filter.my_delimiter.catenate_all": "true",
"index.number_of_shards": "10",
"index.analysis.filter.my_delimiter.catenate_numbers": "true",
"index.analysis.analyzer.my_analyzer.type": "custom",
"index.analysis.filter.english_stemming_filter.type": "snowball",
"index.analysis.filter.my_delimiter.type": "word_delimiter",
"index.analysis.filter.my_delimiter.catenate_words": "true",
"index.number_of_replicas": "0",
"index.analysis.analyzer.my_analyzer.filter.2": "my_delimiter",
"index.analysis.analyzer.my_analyzer.filter.1": "lowercase",
"index.analysis.analyzer.my_analyzer.filter.0": "standard",
"index.analysis.filter.my_delimiter.split_on_case_change": "true",
"index.analysis.analyzer.my_analyzer.filter.5": "stop",
"index.analysis.analyzer.my_analyzer.filter.4":
"english_stemming_filter",
"index.analysis.analyzer.my_analyzer.filter.3": "asciifolding",
"index.version.created": "900001"
}
}
}

And my query is:

QueryBuilder qb = QueryBuilders
.boolQuery()
.must(matchPhraseQuery("body_cleansed", "Aharon
Barak").analyzer("my_analyzer"));

There is a documents which contains “Aharon Barak's policy”, but the query
could not retrieve it.

On Tue, May 14, 2013 at 6:59 AM, InquiringMind <brian....@gmail.com<javascript:>

wrote:

And here's another example of analyzing the same string as below, but
this time using the cn.raw field. It's an example of multi-field
mapping in which a field may be indexed (or not) in two or more ways but
its source values only need to be stored once. Really awesome!!!

$ curl 'http://localhost:9200/sgen/_analyze?field=cn.raw&pretty=true'
-d "gates gate's gates' gates's" && echo
{
"tokens" : [ {
"token" : "gates gate's gates' gates's",
"start_offset" : 0,
"end_offset" : 27,
"type" : "word",
"position" : 1
} ]
}

On Monday, May 13, 2013 6:41:29 PM UTC-4, InquiringMind wrote:

Hi Jingang,

Here is a full example with the index settings and mappings and a curl
command to show how various forms of Gates may be indexed so that they
match each other without any additional work on the part of your query:

Analyzing the string using the "cn" field in the "test" index:

$ curl 'http://localhost:9200/test/_**analyze?field=cn&pretty=truehttp://localhost:9200/test/_analyze?field=cn&pretty=true'
-d "gates gate's gates' gates's" && echo
{
"tokens" : [ {
"token" : "gate",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 1
}, {
"token" : "gate",
"start_offset" : 6,
"end_offset" : 12,
"type" : "",
"position" : 2
}, {
"token" : "gate",
"start_offset" : 13,
"end_offset" : 18,
"type" : "",
"position" : 3
}, {
"token" : "gate",
"start_offset" : 20,
"end_offset" : 27,
"type" : "",
"position" : 4
} ]
}

And here are a subset of the settings and mappings. Note that I needed
to fully construct my own filter and analyzers in order to more fully
specify things such as the language to use.

{
"settings" : {
"index" : {
"number_of_shards" : 1,
"refresh_interval" : "2s",
"number_of_replicas" : 0,
"analysis" : {

    "filter" : {
      "english_snowball_filter" : {
        "type" : "snowball",
        "language" : "English"
      }
    },
    "analyzer" : {
      "english_stemming_analyzer" : {
        "type" : "custom",
        "tokenizer" : "standard",
        "filter" : [ "standard", "lowercase", "asciifolding", 

"english_snowball_filter" ]
},
"english_standard_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "asciifolding" ]
}
}
}
}
},
"mappings" : {
"person" : {
"_all" : {
"enabled" : false
},
"properties" : {
"uid" : {
"type" : "long"
},
"cn" : {
"type" : "multi_field",
"fields" : {
"cn" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer"
},
"raw" : {
"type" : "string",
"index" : "not_analyzed"
}
}
},
"location" : {
"type" : "geo_point",
"lat_lon" : true
},
"telno" : {
"type" : "multi_field",
"fields" : {
"telno" : {
"type" : "string",
"analyzer" : "english_standard_analyzer"
},
"num" : {
"type" : "long"
}
}
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime"
}
}
}
}
}

Hope this helps!

Regards,
Brian

On Monday, May 13, 2013 5:20:14 AM UTC-4, Jingang Wang wrote:

Hi there,

I want to use ES to query some documents mentioning some person names.

For example, when I use Bill Gates to conduct a matchPhraseQuery, I can
just get the documents exactly mention the name "Bill Gates".
While a lot of documents mention Bill Gates indirectly, say, they may
mention "Bill Gates's company".

How should I construct a query to retrieve these documents? Thanks.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/SIK3lc215Bk/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Wang Jingang(王金刚)
Ph.D. Candidate at
Lab of High Volume Language Information Processing & Cloud Computing
School of Computer Science
Beijing Institute of Technology
Beijing 100081
P.R China

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Brain,

Your suggestions are so much useful to me.

I have resolved this problem via using _mapping function to check the
mapping information of specific index and type.

The problem was resulted from unsuccessful mapping operation.

​Thanks again for your help.​

On Tue, May 14, 2013 at 11:27 PM, InquiringMind brian.from.fl@gmail.comwrote:

Hi Jingang,

The most important debugging tool is to use the _analyze command I showed
you.

For example, if you analyze "Foo Bar" and you get two tokens, "Foo" and
"Bar", then you will immediately know that Foo and Bar can match that
string but neither foo nor bar can.

The _analyze function a very useful tool that helped me get through my
query issues related to analyzers and mappings.

Also, it's an excellent idea to extract the _mapping for the specified
index and type. I found in a lot of my early efforts that I had what looked
to me like a valid mapping, but with a few JSON mistakes ElasticSearch
didn't recognize it. So it's not my eye that I trust to see if the mappings
I intend are actually there; I always ask ElasticSearch what it thinks I
gave it. Because in the end, ES's opinion is the only one that counts!

Regards,
Brian

On Monday, May 13, 2013 10:53:31 PM UTC-4, Jingang Wang wrote:

Dear
​ Brian,

Thanks you so much for elaborate examples and explanation.
​I have changed my settings as your example, while I still can not match
the phrase in the documents.

Here is my settings of index:

{
"20120103": {
"settings": {
"index.analysis.filter.my_delimiter.generate_word_parts":
"true",
"index.analysis.filter.my_**delimiter.stem_english_**possessive":
"true",
"index.analysis.filter.my_**delimiter.preserve_original": "true",
"index.analysis.analyzer.my_**analyzer.tokenizer": "standard",
"index.analysis.filter.**english_stemming_filter.**language":
"English",
"index.analysis.filter.my_**delimiter.split_on_numerics": "true",
"index.analysis.filter.my_**delimiter.catenate_all": "true",
"index.number_of_shards": "10",
"index.analysis.filter.my_**delimiter.catenate_numbers": "true",
"index.analysis.analyzer.my_**analyzer.type": "custom",
"index.analysis.filter.**english_stemming_filter.type": "snowball",
"index.analysis.filter.my_**delimiter.type": "word_delimiter",
"index.analysis.filter.my_**delimiter.catenate_words": "true",
"index.number_of_replicas": "0",
"index.analysis.analyzer.my_**analyzer.filter.2": "my_delimiter",
"index.analysis.analyzer.my_**analyzer.filter.1": "lowercase",
"index.analysis.analyzer.my_**analyzer.filter.0": "standard",
"index.analysis.filter.my_**delimiter.split_on_case_**change":
"true",
"index.analysis.analyzer.my_**analyzer.filter.5": "stop",
"index.analysis.analyzer.my_**analyzer.filter.4":
"english_stemming_filter",
"index.analysis.analyzer.my_**analyzer.filter.3": "asciifolding",
"index.version.created": "900001"
}
}
}

And my query is:

QueryBuilder qb = QueryBuilders
.boolQuery()
.must(matchPhraseQuery("body_cleansed", "Aharon
Barak").analyzer("my_analyzer"
));

There is a documents which contains “Aharon Barak's policy”, but the
query could not retrieve it.

On Tue, May 14, 2013 at 6:59 AM, InquiringMind brian....@gmail.comwrote:

And here's another example of analyzing the same string as below, but
this time using the cn.raw field. It's an example of multi-field
mapping in which a field may be indexed (or not) in two or more ways but
its source values only need to be stored once. Really awesome!!!

$ curl 'http://localhost:9200/sgen/_**analyze?field=http://localhost:9200/sgen/_analyze?field=
cn.raw&pretty=**true' -d "gates gate's gates' gates's" && echo
{
"tokens" : [ {
"token" : "gates gate's gates' gates's",
"start_offset" : 0,
"end_offset" : 27,
"type" : "word",
"position" : 1
} ]
}

On Monday, May 13, 2013 6:41:29 PM UTC-4, InquiringMind wrote:

Hi Jingang,

Here is a full example with the index settings and mappings and a curl
command to show how various forms of Gates may be indexed so that they
match each other without any additional work on the part of your query:

Analyzing the string using the "cn" field in the "test" index:

$ curl 'http://localhost:9200/test/_analyze?field=cn&pretty=truehttp://localhost:9200/test/_analyze?field=cn&pretty=true'
-d "gates gate's gates' gates's" && echo
{
"tokens" : [ {
"token" : "gate",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 1
}, {
"token" : "gate",
"start_offset" : 6,
"end_offset" : 12,
"type" : "",
"position" : 2
}, {
"token" : "gate",
"start_offset" : 13,
"end_offset" : 18,
"type" : "",
"position" : 3
}, {
"token" : "gate",
"start_offset" : 20,
"end_offset" : 27,
"type" : "",
"position" : 4
} ]
}

And here are a subset of the settings and mappings. Note that I needed
to fully construct my own filter and analyzers in order to more fully
specify things such as the language to use.

{
"settings" : {
"index" : {
"number_of_shards" : 1,
"refresh_interval" : "2s",
"number_of_replicas" : 0,
"analysis" : {

    "filter" : {
      "english_snowball_filter" : {
        "type" : "snowball",
        "language" : "English"
      }
    },
    "analyzer" : {
      "english_stemming_analyzer" : {
        "type" : "custom",
        "tokenizer" : "standard",
        "filter" : [ "standard", "lowercase", "asciifolding",

"english_snowball_filter" ]
},
"english_standard_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "asciifolding" ]
}
}
}
}
},
"mappings" : {
"person" : {
"_all" : {
"enabled" : false
},
"properties" : {
"uid" : {
"type" : "long"
},
"cn" : {
"type" : "multi_field",
"fields" : {
"cn" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer"
},
"raw" : {
"type" : "string",
"index" : "not_analyzed"
}
}
},
"location" : {
"type" : "geo_point",
"lat_lon" : true
},
"telno" : {
"type" : "multi_field",
"fields" : {
"telno" : {
"type" : "string",
"analyzer" : "english_standard_analyzer"
},
"num" : {
"type" : "long"
}
}
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime"
}
}
}
}
}

Hope this helps!

Regards,
Brian

On Monday, May 13, 2013 5:20:14 AM UTC-4, Jingang Wang wrote:

Hi there,

I want to use ES to query some documents mentioning some person names.

For example, when I use Bill Gates to conduct a matchPhraseQuery, I
can just get the documents exactly mention the name "Bill Gates".
While a lot of documents mention Bill Gates indirectly, say, they may
mention "Bill Gates's company".

How should I construct a query to retrieve these documents? Thanks.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**SIK3lc215Bk/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/SIK3lc215Bk/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
Wang Jingang(王金刚)
Ph.D. Candidate at
Lab of High Volume Language Information Processing & Cloud Computing
School of Computer Science
Beijing Institute of Technology
Beijing 100081
P.R China

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/SIK3lc215Bk/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Wang Jingang(王金刚)
Ph.D. Candidate at
Lab of High Volume Language Information Processing & Cloud Computing
School of Computer Science
Beijing Institute of Technology
Beijing 100081
P.R China

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.