I am tired of continuously trying to override the default analyzer and tokanizer settings


(noor) #1

I Know I must be doing something wrong but frankly speaking due to
inadequate documentation I had to quit with my idea of using
elasticsearch.
I am unable to find even a single consolidated document describing the
technicalities involved in getting the text based search work like the
way it should. You can find good books about Apache Solr but I don't
want to use Apache Solr because I did not learned any thing relevant
to muti-tanent support in Apache Solr.

I humbly request to all of you that could anyone please lead me to a
good source for not only to get a good start with elastic search but
also understand the basic things like I am facing at the moment and
please don't refer me the elasticserch guide and tutorial I have
already read them.


(Benjamin Devèze) #2

Could you please be more specific about the problems, issues you are dealing
with so that we can help you properly?


(James Cook) #3

....and the award for the most patient, understanding and compassionate man
in the world goes to...Benjamin Devèze! You set the bar much higher than I
can reach. :slight_smile:


(David Pilato) #4

LOL ! Benjamin does not like when users leaves the best search engine to go ... nowhere... :wink:

David :wink:

Le 20 sept. 2011 à 02:36, James Cook jcook@tracermedia.com a écrit :

....and the award for the most patient, understanding and compassionate man in the world goes to...Benjamin Devèze! You set the bar much higher than I can reach. :slight_smile:


(noor) #5

Here is some background:

I am using Java API to interact with elasticsearch. I was facing a
problem right from the start that when ever I pass a "was" key word in
my search string in my query string I get zero hits in result. Where
as both of the terms exists in the document that I was trying to
search for. After posting this issue over the thread some body
suggested me that I am using default mapping, analyzer and tokanizer.

I checked the guide and I found that "Standard Tokenizer" provides
grammar based tokeniztion. Therefore I decided to use whitespace
tokenizer in order to get of the problem.
For this I am using following code snippet but It seems that I am
unable to override the default behavior of elastic search or I am
using wrong analyzer because I get zero search hits.
If you think that there might be a problem associated with stopwords
then please tell me how to override them.

public static void main(String[] args) throws Exception
{
connect();

	try {

		String mapping ="{\"tweet\" : {"+
		"\"properties\" : {"+
        		"\"user\" : {\"type\" : \"string\", \"index\" :

"not_analyzed"},"+
""message" : {"type" : "string", "null_value" :
"na", "index" : "analyzed", "analyzer" : "whitespace"},"+
""postDate" : {"type" : "date"}"+
"}}}";

		CreateIndexRequest indexRequest = new

CreateIndexRequest("my_twitter1");
indexRequest.mapping("tweet",mapping);

		client.admin() .indices() .create(indexRequest) .actionGet();

		client.prepareIndex("my_twitter1", "tweet", "1")
		.setSource(XContentFactory.jsonBuilder()
		            .startObject()
		                .field("user", "kimchy")
		                .field("postDate", new Date())
		                .field("message", "I was trying to use elastic

search")
.endObject()
)
.execute()
.actionGet();

		SearchResponse response = client.prepareSearch("my_twitter1")
        .setSearchType(SearchType.DFS_QUERY_THEN_FETCH).setTypes("tweet")
        .setQuery(QueryBuilders.

queryString("was").analyzer("whitespace")).setFrom(0).setSize(60).setExplain(false)
.execute()
.actionGet();

		Iterator<SearchHit> iterator = response.getHits().iterator();

		 while (iterator.hasNext())
        {
            SearchHit searchHit = (SearchHit) iterator.next();
            System.out.println("============="+searchHit.getType()

+"=============");
Set keys = searchHit.getSource().keySet();
for (String key : keys)
{
System.out.print(searchHit.getSource().get(key)+"
");
}
System.out.println();

System.out.println("------------------------------------");
}
} catch (Exception e)
{
e.printStackTrace();
}
finally
{
disconnect();
}
}

Thanks,
Noor


(James Cook) #6

Thanks for the detailed question and the steps to reproduce it. I changed your Java example to use the REST style calls as I find it easier to use when the question isn't API-specific. I also empathize with your frustration over the lack of "structured" documentation. Shay produces an enormous amount of code for a single human being, and unlike many developers he also accompanies each feature with pretty good documentation. There are some gaps that need to be filled, and none as glaring as a step-by-step instruction that can only be truly represented in a good book format. I will certainly be amongst the first people to order that book when it hits the stores.

I also have the double whammy of not only having to learn the codebase from the source and website, but I don't have a background in Lucene. Any good book on ES will have to include both of these topics. If you reflect on what Shay has created, it is quite incredibly IMHO. An auto-clustering, distributed Lucene environment that abstracts away all of Lucene's complexity while providing a dead-simple startup/embedding scheme with REST and Java APIs. I even replaced MongoDB with elasticsearch because I found its sharding and clustering capability (at the time) easier to use and its disk usage more efficient.

Anyway, we need a good book. Now, on to your problem.

Disclaimer: This is a long post and I don't really solve your problem. I wrote this out this way because I was trying to document a process that I thought may be helpful to others. I think Shay or someone knowledgeable will have to weigh in to give you an answer. My uninformed conclusion is that there is a bug because I do not see the whitespace filter applied to field queries. I put this up front because I don't want anyone reading this very long post expecting to see a solution.

Create an index with your mapping. The key here is the declaration of a whitespace analyzer.

curl -XPUT 'http://localhost:9311/my_twitter1/' -d '
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"tweet" : {
"properties" : {
"user" : {"type" : "string", "index" : "not_analyzed"},
"message" : {"type" : "string", "null_value" : "na", "index" : "analyzed", "analyzer" : "whitespace"},
"postDate" : {"type" : "date"}
}
}
}
}'

{"ok":true,"acknowledged":true}

Index a tweet.

curl -XPUT 'http://localhost:9311/my_twitter1/tweet/1' -d '{
"user" : "kimchy",
"post_date" : "2011-09-20T16:20:00",
"message" : "I was trying to use elastic search"
}'

{"ok":true,"_index":"my_twitter1","_type":"tweet","_id":"1","_version":1}

I think the following query faithfully represents your Java API, but does not force a specific analyzer as you were using. The documentation n the mapping section (http://www.elasticsearch.org/guide/reference/mapping/core-types.html) states:

The analyzer [property is] used to analyze the text contents when analyzed during indexing and when searching using a query string. Defaults to the globally configured analyzer.

So, we shouldn't have to force the whitespace analyzer while searching, since it is specified in the mapping step.
curl -X GET http://localhost:9311/my_twitter1/tweet/_search?search_type=dfs_query_and_fetch -d '{
"query": {
"query_string": {
"default_field": "_all",
"query": "was"
}
},
"from": 0,
"size": 60,
"explain": false
}'**{
"took":34,
"timed_out":false,
"_shards":{
"total":5,"successful":5,"failed":0
},
"hits":{
"total":0,"max_score":null,"hits":[]
}
}

As we see, this query returns no hits as you are seeing in your Java code. So, now what? There are a few possibilities:

  1. Bug?
  2. Set some breakpoints in the source code to see what is going on under the hood.
  3. Do we really understand the vast number of query subtypes? I don't yet :slight_smile:
  4. Whitespace analyzer is not applied during the analysis of the query string "was". Is it removed because it is a stop word?
  5. The whitespace analyzer isn't configured correctly in the mapping file, and the stop words were removed when the tweet was indexed.
  6. Maybe I don't really understand the whitespace analyzer? Does it actually remove stop words?

Let's eliminate the most likely cases first, especially the ones we can easily check.
*Does the whitespace analyzer remove stop words? *

The best way to sanity check this is to use the analyze feature:
curl -X GET "http://localhost:9311/my_twitter1/_analyze?analyzer=whitespace" -d "I was trying to use elastic search"
{
"tokens" : [ {
"token" : "I", ...
}, {
"token" : "was", ...
}, {
"token" : "trying", ...
}, {
"token" : "to", ...
}, {
"token" : "use", ...
}, {
"token" : "elastic", ...
}, {
"token" : "search", ...
} ]
}

So, we were correct about our understanding of the whitespace tokenizer. Just breaks on whitespace. Even the case of the terms remains unchanged, and of course our stop words are present.

In fact, a few more cases can give those of us without Lucene understanding, some insight into the different types of analyzers.

The quick brown fox is jumping over the lazy dog.

*whitespace*:

    [The] [quick] [brown] [fox] [is] [jumping] [over] [the] [lazy] [dog.]

*simple*:

    [the] [quick] [brown] [fox] [is] [jumping] [over] [the] [lazy] [dog]

*stop*:

    [quick] [brown] [fox] [jumping] [over] [lazy] [dog]

*standard*:

    [quick] [brown] [fox] [jumping] [over] [lazy] [dog]

*keyword*:

    [The quick brown fox is jumping over the lazy dog.]

*snowball*:

    [quick] [brown] [fox] [jump] [over] [lazi] [dog]

I'll let you be the judge whether the period after dog in the whitespace example is a bug or not. The documentation states that punctuation is part of the term when it is not followed by whitespace. I'm sure smarter people than I see the wisdom in this choice as I do not.
Is the whitespace analyzer configured correctly in the mapping file?

Well, it was accepted by ES when we created it. Shay validates all of the JSON passed in, so if there was an error in the mapping file our index creation should of been rejected. I see a lot of postings in this group related to sometimes putting a property at the wrong level in the JSON structure, so we can dump the mapping file ES thinks it is using and compare it carefully against the documentation.
curl -X GET "http://localhost:9311/my_twitter1/tweet/_mapping?pretty=true"
{
"tweet" : {
"_id" : {
"index" : "not_analyzed"
},
"properties" : {
"message" : {
"null_value" : "na",
"analyzer" : "whitespace",
"type" : "string"
},
"_analyzer" : {
"type" : "string"
},
"postDate" : {
"format" : "dateOptionalTime",
"type" : "date"
},
"user" : {
"index" : "not_analyzed",
"type" : "string"
},
"post_date" : {
"format" : "dateOptionalTime",
"type" : "date"
}
}
}
}

That looks correct, and the analyzer seems to be in the right location according to the documentation.
*Do we really understand the vast number of query subtypes? *

If we change our original search to include a non-stop word, what happens? (I'm going to simplify the query_string syntax for this example.)
curl -X GET "http://localhost:9311/my_twitter1/tweet/_search?q=message:search&pretty=true"

{
"took" : 17,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.11506981,
"hits" : [ {
"_index" : "my_twitter1",
"_type" : "tweet",
"_id" : "1",
"_score" : 0.11506981, "_source" : {
"user" : "kimchy",
"post_date" : "2011-09-20T16:20:00",
"message" : "I was trying to use elastic search"
}

That's interesting. Searching on "search" returned a hit, while searching on "was" did not. If we are pretty sure the whitespace analyzer is getting applied correctly, then maybe the problem is the "way" we are searching? On the Query DSL page http://www.elasticsearch.org/guide/reference/query-dsl/ there are more than 15 different types of search functions, and I am not going to purport to be an expert on any of them. (That's why we have Clinton in the Google Group.)

But a few of these stand out to me as some kind of textual search and we have to do our fair share of educating ourselves to their differences. Let's look at "query_string", "text", and "term". Please take the time to go and read about each of them. It won't take long.

Well, that should be perfectly clear, right? :wink: If you have a good understanding of Lucene, perhaps you are all straight now. If you don't then you are probably confused. There was a decent description of the difference between query_string and text at the bottom of the text page, but the difference between term queries and field queries (query_string) are still a bit fuzzy to me. I do recall that a term query does no analysis on the searched text, while a field query performs the analysis step and builds a set of terms out of the resulting tokens.
So, the next step is to debug the code to see if it might be a bug.

I'm not going to go into any details on how to debug ES, but since it is a Java program it is relatively easy to set up remote debugging, especially if yo use one of the many quality IDEs.

My initial hunch at this point is the whitespace analyzer is not being used to analyze the search query. So, I put a breakpoint on the method in the WhitespaceTokenizer class that checks each character to determine whether it is whitespace. The codebase is so abstracted, it wasn't clear to me in the time I had to determine where the whitespace analyzer should of been selected during the parsing of the search query.

What I saw was the whitespace tokenizer was indeed hot when a tweet is indexed. This matches my earlier assumption that the analyzer was tokenizing properly during index. So, what about when I performed a field query? No hits on the breakpoint! So, at this point I think we have boiled it down to one of two different possibilities:

  1. It is a bug, or
  2. I don't have any idea how analyzers and queries work.

Either one is highly likely. :slight_smile:

I'm sorry that you had to read this far to realize that I didn't have a nugget of clarity to help you. (I will put a wanring at the top of the post.) Based on what I have read, I would expect your query to work since the whitespace analyzer should be applied to the search term "was" and it should remain since stop words do not apply. I have also tried to force the whitespace analyzer as you did in your Java API, but I did not receive any hits either.

For what it is worth, a query for the word "was" will succeed if you use a term query. I believe this also proves that the analyzer is properly applied to the indexed document.
curl -X GET "http://localhost:9311/my_twitter1/tweet/_search" -d '{
"query": {
"term": {
"message": "was"
}
}}'**{
"took" : 54,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.11506981,
"hits" : [ {
"_index" : "my_twitter1",
"_type" : "tweet",
"_id" : "1",
"_score" : 0.11506981,
"_source" : {
"user" : "kimchy",
"post_date" : "2011-09-20T16:20:00",
"message" : "I was trying to use elastic search"
}
} ]
}
}


(ppearcy) #7

James, great reply, I think I can solve the final piece of the puzzle.
I like to use facets to debug how things are really behaving under the
hood. Using facets to inspect the _all field, we see that it does not
contain the "was" term:

curl -XGET http://localhost:9200/my_twitter1/_search?pretty=true -d '
{"query": {"match_all": {}}, "facets": {"tag": {"terms": {"field":
"_all", "size": 10000}}}, "size": 0}
'
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ ]
},
"facets" : {
"tag" : {
"_type" : "terms",
"missing" : 0,
"total" : 11,
"other" : 0,
"terms" : [ {
"term" : "use",
"count" : 1
}, {
"term" : "trying",
"count" : 1
}, {
"term" : "search",
"count" : 1
}, {
"term" : "kimchy",
"count" : 1
}, {
"term" : "i",
"count" : 1
}, {
"term" : "elastic",
"count" : 1
}, {
"term" : "20t16",
"count" : 1
}, {
"term" : "2011",
"count" : 1
}, {
"term" : "20",
"count" : 1
}, {
"term" : "09",
"count" : 1
}, {
"term" : "00",
"count" : 1
} ]
}
}
}

So, the stop words are getting applied to the all field. This is
occurring because no mappings are set for the _all field, so the
default is used. If we delete and recreate the index with mappings for
the all field things are better, sort of...

curl -XDELETE 'http://localhost:9200/my_twitter1/'

curl -XPUT 'http://localhost:9200/my_twitter1/' -d '
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"tweet" : {
"properties" : {
"user" : {"type" : "string", "index" :
"not_analyzed"},
"message" : {"type" : "string", "null_value" : "na",
"index" : "analyzed", "analyzer" : "whitespace"},
"_all" : {"type" : "string", "null_value" : "na",
"index" : "analyzed", "analyzer" : "whitespace"},
"postDate" : {"type" : "date"}
}
}
}
}'

curl -XPUT 'http://localhost:9200/my_twitter1/tweet/1' -d '{
"user" : "kimchy",
"post_date" : "2011-09-20T16:20:00",
"message" : "I was trying to use elastic search"
}'

Now, facet on the _all field will show us that "was" is indexed:

curl -XGET http://localhost:9200/my_twitter1/_search?pretty=true -d '
{"query": {"match_all": {}}, "facets": {"tag": {"terms": {"field":
"_all", "size": 10000}}}, "size": 0}
'

{
"took" : 18,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ ]
},
"facets" : {
"tag" : {
"_type" : "terms",
"missing" : 0,
"total" : 9,
"other" : 0,
"terms" : [ {
"term" : "was",
"count" : 1
}, {
"term" : "use",
"count" : 1
}, {
"term" : "trying",
"count" : 1
}, {
"term" : "to",
"count" : 1
}, {
"term" : "search",
"count" : 1
}, {
"term" : "kimchy",
"count" : 1
}, {
"term" : "elastic",
"count" : 1
}, {
"term" : "I",
"count" : 1
}, {
"term" : "2011-09-20T16:20:00",
"count" : 1
} ]
}
}
}

So, the original query should now work, right? Nope! It appears that
the wrong analyzer is getting applied to the _all field, so to address
that, it must be set in the search:

curl -X GET http://localhost:9200/my_twitter1/tweet/_search?search_type=dfs_query_and_fetch
-d '{
"query": {
"query_string": {
"default_field": "_all",
"query": "was",
"analyzer": "whitespace"
}
},
"from": 0,
"size": 60,
"explain": false
}'

I believe the last part is a bug, but the rest is working as
intended.

Best Regards,
Paul

On Sep 20, 9:20 pm, James Cook jc...@tracermedia.com wrote:

Thanks for the detailed question and the steps to reproduce it. I changed your Java example to use the REST style calls as I find it easier to use when the question isn't API-specific. I also empathize with your frustration over the lack of "structured" documentation. Shay produces an enormous amount of code for a single human being, and unlike many developers he also accompanies each feature with pretty good documentation. There are some gaps that need to be filled, and none as glaring as a step-by-step instruction that can only be truly represented in a good book format. I will certainly be amongst the first people to order that book when it hits the stores.

I also have the double whammy of not only having to learn the codebase from the source and website, but I don't have a background in Lucene. Any good book on ES will have to include both of these topics. If you reflect on what Shay has created, it is quite incredibly IMHO. An auto-clustering, distributed Lucene environment that abstracts away all of Lucene's complexity while providing a dead-simple startup/embedding scheme with REST and Java APIs. I even replaced MongoDB with elasticsearch because I found its sharding and clustering capability (at the time) easier to use and its disk usage more efficient.

Anyway, we need a good book. Now, on to your problem.

Disclaimer: This is a long post and I don't really solve your problem. I wrote this out this way because I was trying to document a process that I thought may be helpful to others. I think Shay or someone knowledgeable will have to weigh in to give you an answer. My uninformed conclusion is that there is a bug because I do not see the whitespace filter applied to field queries. I put this up front because I don't want anyone reading this very long post expecting to see a solution.

Create an index with your mapping. The key here is the declaration of a whitespace analyzer.

curl -XPUT 'http://localhost:9311/my_twitter1/'-d '
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"tweet" : {
"properties" : {
"user" : {"type" : "string", "index" : "not_analyzed"},
"message" : {"type" : "string", "null_value" : "na", "index" : "analyzed", "analyzer" : "whitespace"},
"postDate" : {"type" : "date"}
}
}
}}'

{"ok":true,"acknowledged":true}

Index a tweet.

curl -XPUT 'http://localhost:9311/my_twitter1/tweet/1'-d '{
"user" : "kimchy",
"post_date" : "2011-09-20T16:20:00",
"message" : "I was trying to use elastic search"}'

{"ok":true,"_index":"my_twitter1","_type":"tweet","_id":"1","_version":1}

I think the following query faithfully represents your Java API, but does not force a specific analyzer as you were using. The documentation n the mapping section (http://www.elasticsearch.org/guide/reference/mapping/core-types.html) states:

The analyzer [property is] used to analyze the text contents when analyzed during indexing and when searching using a query string. Defaults to the globally configured analyzer.

So, we shouldn't have to force the whitespace analyzer while searching, since it is specified in the mapping step.
*curl -X GEThttp://localhost:9311/my_twitter1/tweet/_search?search_type=dfs_query...-d '{
"query": {
"query_string": {
"default_field": "_all",
"query": "was"
}
},
"from": 0,
"size": 60,
"explain": false}'**{

"took":34,
"timed_out":false,
"_shards":{
    "total":5,"successful":5,"failed":0
},
"hits":{
    "total":0,"max_score":null,"hits":[]
}}*

As we see, this query returns no hits as you are seeing in your Java code. So, now what? There are a few possibilities:

  1. Bug?
  2. Set some breakpoints in the source code to see what is going on under the hood.
  3. Do we really understand the vast number of query subtypes? I don't yet :slight_smile:
  4. Whitespace analyzer is not applied during the analysis of the query string "was". Is it removed because it is a stop word?
  5. The whitespace analyzer isn't configured correctly in the mapping file, and the stop words were removed when the tweet was indexed.
  6. Maybe I don't really understand the whitespace analyzer? Does it actually remove stop words?

Let's eliminate the most likely cases first, especially the ones we can easily check.
*Does the whitespace analyzer remove stop words? *

The best way to sanity check this is to use the analyze feature:
curl -X GET "http://localhost:9311/my_twitter1/_analyze?analyzer=whitespace" -d "I was trying to use elastic search"
{
"tokens" : [ {
"token" : "I", ...
}, {
"token" : "was", ...
}, {
"token" : "trying", ...
}, {
"token" : "to", ...
}, {
"token" : "use", ...
}, {
"token" : "elastic", ...
}, {
"token" : "search", ...
} ]}

So, we were correct about our understanding of the whitespace tokenizer. Just breaks on whitespace. Even the case of the terms remains unchanged, and of course our stop words are present.

In fact, a few more cases can give those of us without Lucene understanding, some insight into the different types of analyzers.

The quick brown fox is jumping over the lazy dog.

*whitespace*:

    [The] [quick] [brown] [fox] [is] [jumping] [over] [the] [lazy] [dog.]

*simple*:

    [the] [quick] [brown] [fox] [is] [jumping] [over] [the] [lazy] [dog]

*stop*:

    [quick] [brown] [fox] [jumping] [over] [lazy] [dog]

*standard*:

    [quick] [brown] [fox] [jumping] [over] [lazy] [dog]

*keyword*:

    [The quick brown fox is jumping over the lazy dog.]

*snowball*:

    [quick] [brown] [fox] [jump] [over] [lazi] [dog]

I'll let you be the judge whether the period after dog in the whitespace example is a bug or not. The documentation states that punctuation is part of the term when it is not followed by whitespace. I'm sure smarter people than I see the wisdom in this choice as I do not.
Is the whitespace analyzer configured correctly in the mapping file?

Well, it was accepted by ES when we created it. Shay validates all of the JSON passed in, so if there was an error in the mapping file our index creation should of been rejected. I see a lot of postings in this group related to sometimes putting a property at the wrong level in the JSON structure, so we can dump the mapping file ES thinks it is using and compare it carefully against the documentation.
curl -X GET "http://localhost:9311/my_twitter1/tweet/_mapping?pretty=true"
{
"tweet" : {
"_id" : {
"index" : "not_analyzed"
},
"properties" : {
"message" : {
"null_value" : "na",
"analyzer" : "whitespace",
"type" : "string"
},
"_analyzer" : {
"type" : "string"
},
"postDate" : {
"format" : "dateOptionalTime",
"type" : "date"
},
"user" : {
"index" : "not_analyzed",
"type" : "string"
},
"post_date" : {
"format" : "dateOptionalTime",
"type" : "date"
}
}
}}

That looks correct, and the analyzer seems to be in the right location according to the documentation.
*Do we really understand the vast number of query subtypes? *

If we change our original search to include a non-stop word, what happens? (I'm going to simplify the query_string syntax for this example.)
curl -X GET "http://localhost:9311/my_twitter1/tweet/_search?q=message:search&pret..."

{
"took" : 17,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.11506981,
"hits" : [ {
"_index" : "my_twitter1",
"_type" : "tweet",
"_id" : "1",
"_score" : 0.11506981, "_source" : {
"user" : "kimchy",
"post_date" : "2011-09-20T16:20:00",
"message" : "I was trying to use elastic search"}

That's interesting. Searching on "search" returned a hit, while searching on "was" did not. If we are pretty sure the whitespace analyzer is getting applied correctly, then maybe the problem is the "way" we are searching? On the Query DSL page http://www.elasticsearch.org/guide/reference/query-dsl/ there are more than 15 different types of search functions, and I am not going to purport to be an expert on any of them. (That's why we have Clinton in the Google Group.)

But a few of these stand out to me as some kind of textual search and we have to do our fair share of educating ourselves to their differences. Let's look at "query_string", "text", and "term". Please take the time to go and read about each of them. It won't take long.

Well, that should be perfectly clear, right? :wink: If you have a good understanding of Lucene, perhaps you are all straight now. If you don't then you are probably confused. There was a decent description of the difference between query_string and text at the bottom of the text page, but the difference between term queries and field queries (query_string) are still a bit fuzzy to me. I do recall that a term query does no analysis on the searched text, while a field query performs the analysis step and builds a set of terms out of the resulting tokens.
So, the next step is to debug the code to see if it might be a bug.

I'm not going to go into any details on how to debug ES, but since it is a Java program it is relatively easy to set up remote debugging, especially if yo use one of the many quality IDEs.

My initial hunch at this point is the whitespace analyzer is not being used to analyze the search query. So, I put a breakpoint on the method in the WhitespaceTokenizer class that checks each character to determine whether it is whitespace. The codebase is so abstracted, it wasn't clear to me in the time I had to determine where the whitespace analyzer should of been selected...

read more »


(noor) #8

Thanks a lot James Cook for giving my issue your precious time and for
your effort on diagnosing it and representing it in such a good
manner. Anyways there should be some way we can search "was" term in
query String or maybe it is meaning less to search "was" term in you
documents.

Thanks again,
Noor


(Benjamin Devèze) #9

It is with a lot of emotion that I accept the award for the most patient,
understanding and compassionate man in the world, I would like to thanks my
family, friends and all the people who were supportive during all this
year... But wait now that I have seen this nice reply I think that I must
give it back to someone that deserves it more and it should go to you James.
:slight_smile:

More seriously that was a nice answer


(James Cook) #10

Thanks, I am just learning along with everyone else. :slight_smile:

And thank you Paul for that technique of using facets to inspect the terms
generated by indexing a document. That will be very valuable.


(ppearcy) #11

Glad to help :slight_smile:

FYI, I raised an issue for what I think is a bug:

Best Regards,
Paul

On Sep 21, 7:22 am, James Cook jc...@tracermedia.com wrote:

Thanks, I am just learning along with everyone else. :slight_smile:

And thank you Paul for that technique of using facets to inspect the terms
generated by indexing a document. That will be very valuable.


(noor) #12

Thanks again all. I learned a lot about ES after reading above posts.

@Benjamin: James really deserves the reward :-).

Best Regards,

Noor


(Shay Banon) #13

Hi,

Moments like this makes me proud of our community!. Thanks for all the

efforts. Regarding the docs, they still need much more work. The "basics"
are there, which means the hard core docs, parameters, settings and so on
are there. The "concepts", or vertical docs (how does analysis works, how to
configure it, and how it ties to search, for example) are still missing... .
No need for a book (yet) to fill those.

I commented on the issue opened, the problem in the mapping provided
using the index creation is that the _all field mapping is not placed at the
correct place.

Regarding using the whitespace analyzer, you can use it. One can also use
the standard analyzer (the default one), and just have it not use any
stopwords. Here is a sample: https://gist.github.com/1234581.

On Wed, Sep 21, 2011 at 8:45 PM, ppearcy ppearcy@gmail.com wrote:

Glad to help :slight_smile:

FYI, I raised an issue for what I think is a bug:
https://github.com/elasticsearch/elasticsearch/issues/1353

Best Regards,
Paul

On Sep 21, 7:22 am, James Cook jc...@tracermedia.com wrote:

Thanks, I am just learning along with everyone else. :slight_smile:

And thank you Paul for that technique of using facets to inspect the
terms
generated by indexing a document. That will be very valuable.


(davrob) #14

Hi Shay,

I have to reiterate my personal thanks for everything you have done
with ElasticSearch and Compass before that.

But in some ways you are being far too modest when you say there is no
need for a book. There is so much "best practise" in numerous areas
that have been tacitly baked in to ElasticSearch covering numerous
areas: clustering, search, NoSQL, low latency - as well as a lot of
the hardware/platform issues related to: amazon EC2, garbage
collection, optimization etc. The difficulty would be fitting it all
in to one book, which I guess is the big issue, that you're so busy
coding that there is no time for a book.

Maybe one day we will get a book, then maybe certification, training,
expos,consultancy etc.

Also, many thanks to James forhis massive contribution above, I wish I
could have got so far in my understanding at this stage, and of course
to Clinton and Lukáš - without whom we all would be so much the worse
off in the last year.

-David.

On Sep 22, 12:34 pm, Shay Banon kim...@gmail.com wrote:

Hi,

Moments like this makes me proud of our community!. Thanks for all the

efforts. Regarding the docs, they still need much more work. The "basics"
are there, which means the hard core docs, parameters, settings and so on
are there. The "concepts", or vertical docs (how does analysis works, how to
configure it, and how it ties to search, for example) are still missing... .
No need for a book (yet) to fill those.

I commented on the issue opened, the problem in the mapping provided
using the index creation is that the _all field mapping is not placed at the
correct place.

Regarding using the whitespace analyzer, you can use it. One can also use
the standard analyzer (the default one), and just have it not use any
stopwords. Here is a sample:https://gist.github.com/1234581.

On Wed, Sep 21, 2011 at 8:45 PM, ppearcy ppea...@gmail.com wrote:

Glad to help :slight_smile:

FYI, I raised an issue for what I think is a bug:
https://github.com/elasticsearch/elasticsearch/issues/1353

Best Regards,
Paul

On Sep 21, 7:22 am, James Cook jc...@tracermedia.com wrote:

Thanks, I am just learning along with everyone else. :slight_smile:

And thank you Paul for that technique of using facets to inspect the
terms
generated by indexing a document. That will be very valuable.


(Kevin Lawrence) #15

Sorry to resurrect an old thread, but this (James' & ppearcy's excellent
posts) gets to the heart of a problem I am having debugging this issue:

I have a custom analyzer that is not used when I think it should be and I
don't know enough to debug the problem.

I am interested to understand how you knew to investigate the analyzer used
for the_all field versus the specific field that was causing the problem. I
suspect that my problem will be amenable to similar debugging but I don't
know where to start. Any pointers to a page with debugging tips?

In my case the correct analyzer is used in filters and facets but it stops
working after I restart the server. I've tried eyeballing (and diffing) the
_mapping and _settings before and after the restart but everything is
identical. I can't reproduce my problem in a simpler repro case and I don't
know what to look at next. I feel like, if I had some debugging tips (like
the ones in this thread) to try, I could track down the problem more
quickly.

Any tips for me?

Thanks in advance,

Kevin

On Tuesday, September 20, 2011 10:25:14 PM UTC-7, ppearcy wrote:

James, great reply, I think I can solve the final piece of the puzzle.
I like to use facets to debug how things are really behaving under the
hood. Using facets to inspect the _all field, we see that it does not
contain the "was" term:

curl -XGET http://localhost:9200/my_twitter1/_search?pretty=true -d '
{"query": {"match_all": {}}, "facets": {"tag": {"terms": {"field":
"_all", "size": 10000}}}, "size": 0}
'
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ ]
},
"facets" : {
"tag" : {
"_type" : "terms",
"missing" : 0,
"total" : 11,
"other" : 0,
"terms" : [ {
"term" : "use",
"count" : 1
}, {
"term" : "trying",
"count" : 1
}, {
"term" : "search",
"count" : 1
}, {
"term" : "kimchy",
"count" : 1
}, {
"term" : "i",
"count" : 1
}, {
"term" : "elastic",
"count" : 1
}, {
"term" : "20t16",
"count" : 1
}, {
"term" : "2011",
"count" : 1
}, {
"term" : "20",
"count" : 1
}, {
"term" : "09",
"count" : 1
}, {
"term" : "00",
"count" : 1
} ]
}
}
}

So, the stop words are getting applied to the all field. This is
occurring because no mappings are set for the _all field, so the
default is used. If we delete and recreate the index with mappings for
the all field things are better, sort of...

curl -XDELETE 'http://localhost:9200/my_twitter1/'

curl -XPUT 'http://localhost:9200/my_twitter1/' -d '
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"tweet" : {
"properties" : {
"user" : {"type" : "string", "index" :
"not_analyzed"},
"message" : {"type" : "string", "null_value" : "na",
"index" : "analyzed", "analyzer" : "whitespace"},
"_all" : {"type" : "string", "null_value" : "na",
"index" : "analyzed", "analyzer" : "whitespace"},
"postDate" : {"type" : "date"}
}
}
}
}'

curl -XPUT 'http://localhost:9200/my_twitter1/tweet/1' -d '{
"user" : "kimchy",
"post_date" : "2011-09-20T16:20:00",
"message" : "I was trying to use elastic search"
}'

Now, facet on the _all field will show us that "was" is indexed:

curl -XGET http://localhost:9200/my_twitter1/_search?pretty=true -d '
{"query": {"match_all": {}}, "facets": {"tag": {"terms": {"field":
"_all", "size": 10000}}}, "size": 0}
'

{
"took" : 18,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ ]
},
"facets" : {
"tag" : {
"_type" : "terms",
"missing" : 0,
"total" : 9,
"other" : 0,
"terms" : [ {
"term" : "was",
"count" : 1
}, {
"term" : "use",
"count" : 1
}, {
"term" : "trying",
"count" : 1
}, {
"term" : "to",
"count" : 1
}, {
"term" : "search",
"count" : 1
}, {
"term" : "kimchy",
"count" : 1
}, {
"term" : "elastic",
"count" : 1
}, {
"term" : "I",
"count" : 1
}, {
"term" : "2011-09-20T16:20:00",
"count" : 1
} ]
}
}
}

So, the original query should now work, right? Nope! It appears that
the wrong analyzer is getting applied to the _all field, so to address
that, it must be set in the search:

curl -X GET http://localhost:9200/my_twitter1/tweet/_search?search_type=dfs_query_and_fetch

-dhttp://localhost:9200/my_twitter1/tweet/_search?search_type=dfs_query_and_fetch-d'{
"query": {
"query_string": {
"default_field": "_all",
"query": "was",
"analyzer": "whitespace"
}
},
"from": 0,
"size": 60,
"explain": false
}'

I believe the last part is a bug, but the rest is working as
intended.

Best Regards,
Paul

On Sep 20, 9:20 pm, James Cook jc...@tracermedia.com wrote:

Thanks for the detailed question and the steps to reproduce it. I
changed your Java example to use the REST style calls as I find it easier
to use when the question isn't API-specific. I also empathize with your
frustration over the lack of "structured" documentation. Shay produces an
enormous amount of code for a single human being, and unlike many
developers he also accompanies each feature with pretty good documentation.
There are some gaps that need to be filled, and none as glaring as a
step-by-step instruction that can only be truly represented in a good book
format. I will certainly be amongst the first people to order that book
when it hits the stores.

I also have the double whammy of not only having to learn the codebase
from the source and website, but I don't have a background in Lucene. Any
good book on ES will have to include both of these topics. If you reflect
on what Shay has created, it is quite incredibly IMHO. An auto-clustering,
distributed Lucene environment that abstracts away all of Lucene's
complexity while providing a dead-simple startup/embedding scheme with REST
and Java APIs. I even replaced MongoDB with elasticsearch because I found
its sharding and clustering capability (at the time) easier to use and its
disk usage more efficient.

Anyway, we need a good book. Now, on to your problem.

Disclaimer: This is a long post and I don't really solve your problem.
I wrote this out this way because I was trying to document a process that I
thought may be helpful to others. I think Shay or someone knowledgeable
will have to weigh in to give you an answer. My uninformed conclusion is
that there is a bug because I do not see the whitespace filter applied to
field queries. I put this up front because I don't want anyone reading this
very long post expecting to see a solution.

Create an index with your mapping. The key here is the declaration of a
whitespace analyzer.

curl -XPUT 'http://localhost:9311/my_twitter1/'-d '
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"tweet" : {
"properties" : {
"user" : {"type" : "string", "index" : "not_analyzed"},
"message" : {"type" : "string", "null_value" : "na",
"index" : "analyzed", "analyzer" : "whitespace"},
"postDate" : {"type" : "date"}
}
}
}}'

{"ok":true,"acknowledged":true}

Index a tweet.

curl -XPUT 'http://localhost:9311/my_twitter1/tweet/1'-d '{
"user" : "kimchy",
"post_date" : "2011-09-20T16:20:00",
"message" : "I was trying to use elastic search"}'

{"ok":true,"_index":"my_twitter1","_type":"tweet","_id":"1","_version":1}

I think the following query faithfully represents your Java API, but
does not force a specific analyzer as you were using. The documentation n
the mapping section (
http://www.elasticsearch.org/guide/reference/mapping/core-types.html)
states:

The analyzer [property is] used to analyze the text contents when
analyzed during indexing and when searching using a query string. Defaults
to the globally configured analyzer.

So, we shouldn't have to force the whitespace analyzer while searching,
since it is specified in the mapping step.
*curl -X
GEThttp://localhost:9311/my_twitter1/tweet/_search?search_type=dfs_query...-d
'{
"query": {
"query_string": {
"default_field": "_all",
"query": "was"
}
},
"from": 0,
"size": 60,
"explain": false}'**{

"took":34, 
"timed_out":false, 
"_shards":{ 
    "total":5,"successful":5,"failed":0 
}, 
"hits":{ 
    "total":0,"max_score":null,"hits":[] 
}}* 

As we see, this query returns no hits as you are seeing in your Java
code. So, now what? There are a few possibilities:

  1. Bug?
  2. Set some breakpoints in the source code to see what is going on
    under the hood.
  3. Do we really understand the vast number of query subtypes? I don't
    yet :slight_smile:
  4. Whitespace analyzer is not applied during the analysis of the
    query string "was". Is it removed because it is a stop word?
  5. The whitespace analyzer isn't configured correctly in the mapping
    file, and the stop words were removed when the tweet was indexed.
  6. Maybe I don't really understand the whitespace analyzer? Does it
    actually remove stop words?

Let's eliminate the most likely cases first, especially the ones we can
easily check.
*Does the whitespace analyzer remove stop words? *

The best way to sanity check this is to use the analyze feature:
curl -X GET "
http://localhost:9311/my_twitter1/_analyze?analyzer=whitespace" -d "I was
trying to use elastic search"

{
"tokens" : [ {
"token" : "I", ...
}, {
"token" : "was", ...
}, {
"token" : "trying", ...
}, {
"token" : "to", ...
}, {
"token" : "use", ...
}, {
"token" : "elastic", ...
}, {
"token" : "search", ...
} ]}

So, we were correct about our understanding of the whitespace tokenizer.
Just breaks on whitespace. Even the case of the terms remains unchanged,
and of course our stop words are present.

In fact, a few more cases can give those of us without Lucene
understanding, some insight into the different types of analyzers.

The quick brown fox is jumping over the lazy dog.

*whitespace*: 

    [The] [quick] [brown] [fox] [is] [jumping] [over] [the] [lazy] 

[dog.]

*simple*: 

    [the] [quick] [brown] [fox] [is] [jumping] [over] [the] [lazy] 

[dog]

*stop*: 

    [quick] [brown] [fox] [jumping] [over] [lazy] [dog] 

*standard*: 

    [quick] [brown] [fox] [jumping] [over] [lazy] [dog] 

*keyword*: 

    [The quick brown fox is jumping over the lazy dog.] 

*snowball*: 

    [quick] [brown] [fox] [jump] [over] [lazi] [dog] 

I'll let you be the judge whether the period after dog in the whitespace
example is a bug or not. The documentation states that punctuation is part
of the term when it is not followed by whitespace. I'm sure smarter people
than I see the wisdom in this choice as I do not.
Is the whitespace analyzer configured correctly in the mapping file?

Well, it was accepted by ES when we created it. Shay validates all of
the JSON passed in, so if there was an error in the mapping file our index
creation should of been rejected. I see a lot of postings in this group
related to sometimes putting a property at the wrong level in the JSON
structure, so we can dump the mapping file ES thinks it is using and
compare it carefully against the documentation.
curl -X GET "
http://localhost:9311/my_twitter1/tweet/_mapping?pretty=true"

{
"tweet" : {
"_id" : {
"index" : "not_analyzed"
},
"properties" : {
"message" : {
"null_value" : "na",
"analyzer" : "whitespace",
"type" : "string"
},
"_analyzer" : {
"type" : "string"
},
"postDate" : {
"format" : "dateOptionalTime",
"type" : "date"
},
"user" : {
"index" : "not_analyzed",
"type" : "string"
},
"post_date" : {
"format" : "dateOptionalTime",
"type" : "date"
}
}
}}

That looks correct, and the analyzer seems to be in the right location
according to the documentation.
*Do we really understand the vast number of query subtypes? *

If we change our original search to include a non-stop word, what
happens? (I'm going to simplify the query_string syntax for this example.)
curl -X GET "
http://localhost:9311/my_twitter1/tweet/_search?q=message:search&pret..."

{
"took" : 17,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.11506981,
"hits" : [ {
"_index" : "my_twitter1",
"_type" : "tweet",
"_id" : "1",
"_score" : 0.11506981, "_source" : {
"user" : "kimchy",
"post_date" : "2011-09-20T16:20:00",
"message" : "I was trying to use elastic search"}

That's interesting. Searching on "search" returned a hit, while
searching on "was" did not. If we are pretty sure the whitespace analyzer
is getting applied correctly, then maybe the problem is the "way" we are
searching? On the Query DSL page <
http://www.elasticsearch.org/guide/reference/query-dsl/> there are more
than 15 different types of search functions, and I am not going to purport
to be an expert on any of them. (That's why we have Clinton in the Google
Group.)

But a few of these stand out to me as some kind of textual search and we
have to do our fair share of educating ourselves to their differences.
Let's look at "query_string", "text", and "term". Please take the time to
go and read about each of them. It won't take long.

Well, that should be perfectly clear, right? :wink: If you have a good
understanding of Lucene, perhaps you are all straight now. If you don't
then you are probably confused. There was a decent description of the
difference between query_string and text at the bottom of the text page,
but the difference between term queries and field queries (query_string)
are still a bit fuzzy to me. I do recall that a term query does no analysis
on the searched text, while a field query performs the analysis step and
builds a set of terms out of the resulting tokens.
So, the next step is to debug the code to see if it might be a bug.

I'm not going to go into any details on how to debug ES, but since it is
a Java program it is relatively easy to set up remote debugging, especially
if yo use one of the many quality IDEs.

My initial hunch at this point is the whitespace analyzer is not being
used to analyze the search query. So, I put a breakpoint on the method in
the WhitespaceTokenizer class that checks each character to determine
whether it is whitespace. The codebase is so abstracted, it wasn't clear to
me in the time I had to determine where the whitespace analyzer should of
been selected...

read more »

--


(Clinton Gormley) #16

Hi Kevin

On Tue, 2012-08-21 at 11:56 -0700, Kevin Lawrence wrote:

Sorry to resurrect an old thread, but this (James' & ppearcy's
excellent posts) gets to the heart of a problem I am having debugging
this issue:

I have a custom analyzer that is not used when I think it should be
and I don't know enough to debug the problem.

Most important is to look at the index settings and the mapping that
elasticsearch stores.

A server restart should not change these values.

The easiest way to get help is to gist a full curl recreation of the
problem, so that we can see what it is you are doing, and offer advice.

See http://www.elasticsearch.org/help for guidance on doing this

clint

I am interested to understand how you knew to investigate the analyzer
used for the_all field versus the specific field that was causing the
problem. I suspect that my problem will be amenable to similar
debugging but I don't know where to start. Any pointers to a page with
debugging tips?

In my case the correct analyzer is used in filters and facets but it
stops working after I restart the server. I've tried eyeballing (and
diffing) the _mapping and _settings before and after the restart but
everything is identical. I can't reproduce my problem in a simpler
repro case and I don't know what to look at next. I feel like, if I
had some debugging tips (like the ones in this thread) to try, I could
track down the problem more quickly.

Any tips for me?

Thanks in advance,

Kevin

On Tuesday, September 20, 2011 10:25:14 PM UTC-7, ppearcy wrote:
James, great reply, I think I can solve the final piece of the
puzzle.
I like to use facets to debug how things are really behaving
under the
hood. Using facets to inspect the _all field, we see that it
does not
contain the "was" term:

    curl -XGET
    http://localhost:9200/my_twitter1/_search?pretty=true -d ' 
    {"query": {"match_all": {}}, "facets": {"tag": {"terms":
    {"field": 
    "_all", "size": 10000}}}, "size": 0} 
    ' 
    { 
      "took" : 0, 
      "timed_out" : false, 
      "_shards" : { 
        "total" : 1, 
        "successful" : 1, 
        "failed" : 0 
      }, 
      "hits" : { 
        "total" : 1, 
        "max_score" : 1.0, 
        "hits" : [ ] 
      }, 
      "facets" : { 
        "tag" : { 
          "_type" : "terms", 
          "missing" : 0, 
          "total" : 11, 
          "other" : 0, 
          "terms" : [ { 
            "term" : "use", 
            "count" : 1 
          }, { 
            "term" : "trying", 
            "count" : 1 
          }, { 
            "term" : "search", 
            "count" : 1 
          }, { 
            "term" : "kimchy", 
            "count" : 1 
          }, { 
            "term" : "i", 
            "count" : 1 
          }, { 
            "term" : "elastic", 
            "count" : 1 
          }, { 
            "term" : "20t16", 
            "count" : 1 
          }, { 
            "term" : "2011", 
            "count" : 1 
          }, { 
            "term" : "20", 
            "count" : 1 
          }, { 
            "term" : "09", 
            "count" : 1 
          }, { 
            "term" : "00", 
            "count" : 1 
          } ] 
        } 
      } 
    } 
    
    
    So, the stop words are getting applied to the all field. This
    is 
    occurring because no mappings are set for the _all field, so
    the 
    default is used. If we delete and recreate the index with
    mappings for 
    the all field things are better, sort of... 
    
    curl -XDELETE 'http://localhost:9200/my_twitter1/' 
    
    curl -XPUT 'http://localhost:9200/my_twitter1/' -d ' 
    { 
        "settings": { 
            "number_of_shards": 1, 
            "number_of_replicas": 0 
        }, 
        "mappings": { 
            "tweet" : { 
                "properties" : { 
                    "user" : {"type" : "string", "index" : 
    "not_analyzed"}, 
                    "message" : {"type" : "string", "null_value" :
    "na", 
    "index" : "analyzed", "analyzer" : "whitespace"}, 
                    "_all" : {"type" : "string", "null_value" :
    "na", 
    "index" : "analyzed", "analyzer" : "whitespace"}, 
                    "postDate" : {"type" : "date"} 
                } 
            } 
        } 
    }' 
    
    
    curl -XPUT 'http://localhost:9200/my_twitter1/tweet/1' -d '{ 
        "user" : "kimchy", 
        "post_date" : "2011-09-20T16:20:00", 
        "message" : "I was trying to use elastic search" 
    }' 
    
    
    Now, facet on the _all field will show us that "was" is
    indexed: 
    
    curl -XGET
    http://localhost:9200/my_twitter1/_search?pretty=true -d ' 
    {"query": {"match_all": {}}, "facets": {"tag": {"terms":
    {"field": 
    "_all", "size": 10000}}}, "size": 0} 
    ' 
    
    { 
      "took" : 18, 
      "timed_out" : false, 
      "_shards" : { 
        "total" : 1, 
        "successful" : 1, 
        "failed" : 0 
      }, 
      "hits" : { 
        "total" : 1, 
        "max_score" : 1.0, 
        "hits" : [ ] 
      }, 
      "facets" : { 
        "tag" : { 
          "_type" : "terms", 
          "missing" : 0, 
          "total" : 9, 
          "other" : 0, 
          "terms" : [ { 
            "term" : "was", 
            "count" : 1 
          }, { 
            "term" : "use", 
            "count" : 1 
          }, { 
            "term" : "trying", 
            "count" : 1 
          }, { 
            "term" : "to", 
            "count" : 1 
          }, { 
            "term" : "search", 
            "count" : 1 
          }, { 
            "term" : "kimchy", 
            "count" : 1 
          }, { 
            "term" : "elastic", 
            "count" : 1 
          }, { 
            "term" : "I", 
            "count" : 1 
          }, { 
            "term" : "2011-09-20T16:20:00", 
            "count" : 1 
          } ] 
        } 
      } 
    } 
    
    So, the original query should now work, right? Nope! It
    appears that 
    the wrong analyzer is getting applied to the _all field, so to
    address 
    that, it must be set in the search: 
    
    curl -X GET
    http://localhost:9200/my_twitter1/tweet/_search?search_type=dfs_query_and_fetch 
    -d '{ 
        "query": { 
            "query_string": { 
                "default_field": "_all", 
                "query": "was", 
                "analyzer": "whitespace" 
            } 
        }, 
        "from": 0, 
        "size": 60, 
        "explain": false 
    }' 
    
    
    I believe the last part is a bug, but the rest is working as 
    intended. 
    
    Best Regards, 
    Paul 
    
    
    
    On Sep 20, 9:20 pm, James Cook <jc...@tracermedia.com> wrote: 
    > Thanks for the detailed question and the steps to reproduce
    it. I changed your Java example to use the REST style calls as
    I find it easier to use when the question isn't API-specific.
    I also empathize with your frustration over the lack of
    "structured" documentation. Shay produces an enormous amount
    of code for a single human being, and unlike many developers
    he also accompanies each feature with pretty good
    documentation. There are some gaps that need to be filled, and
    none as glaring as a step-by-step instruction that can only be
    truly represented in a good book format. I will certainly be
    amongst the first people to order that book when it hits the
    stores. 
    > 
    > I also have the double whammy of not only having to learn
    the codebase from the source and website, but I don't have a
    background in Lucene. Any good book on ES will have to include
    both of these topics. If you reflect on what Shay has created,
    it is quite incredibly IMHO. An auto-clustering, distributed
    Lucene environment that abstracts away all of Lucene's
    complexity while providing a dead-simple startup/embedding
    scheme with REST and Java APIs. I even replaced MongoDB with
    elasticsearch because I found its sharding and clustering
    capability (at the time) easier to use and its disk usage more
    efficient. 
    > 
    > Anyway, we need a good book. Now, on to your problem. 
    > 
    > *Disclaimer: This is a long post and I don't really solve
    your problem. I wrote this out this way because I was trying
    to document a process that I thought may be helpful to others.
    I think Shay or someone knowledgeable will have to weigh in to
    give you an answer. My uninformed conclusion is that there is
    a bug because I do not see the whitespace filter applied to
    field queries. I put this up front because I don't want anyone
    reading this very long post expecting to see  a solution.* 
    > 
    > Create an index with your mapping. The key here is the
    declaration of a whitespace analyzer. 
    > 
    > *curl -XPUT 'http://localhost:9311/my_twitter1/'-d ' 
    > { 
    >     "settings": { 
    >         "number_of_shards": 1, 
    >         "number_of_replicas": 0 
    >     }, 
    >     "mappings": { 
    >         "tweet" : { 
    >             "properties" : { 
    >                 "user" : {"type" : "string", "index" :
    "not_analyzed"}, 
    >                 "message" : {"type" : "string",
    "null_value" : "na", "index" : "analyzed", "analyzer" :
    "whitespace"}, 
    >                 "postDate" : {"type" : "date"} 
    >             } 
    >         } 
    >     }}'* 
    > 
    > *{"ok":true,"acknowledged":true}* 
    > 
    > Index a tweet. 
    > 
    > *curl -XPUT 'http://localhost:9311/my_twitter1/tweet/1'-d
    '{ 
    >     "user" : "kimchy", 
    >     "post_date" : "2011-09-20T16:20:00", 
    >     "message" : "I was trying to use elastic search"}'* 
    > 
    >
    *{"ok":true,"_index":"my_twitter1","_type":"tweet","_id":"1","_version":1}* 
    > 
    > I think the following query faithfully represents your Java
    API, but does not force a specific analyzer as you were using.
    The documentation n the mapping section
    (http://www.elasticsearch.org/guide/reference/mapping/core-types.html) states: 
    > 
    > The analyzer [property is] used to analyze the text contents
    when analyzed during indexing and when searching using a query
    string. Defaults to the globally configured analyzer. 
    > 
    > So, we shouldn't have to force the whitespace analyzer while
    searching, since it is specified in the mapping step. 
    > *curl -X
    GEThttp://localhost:9311/my_twitter1/tweet/_search?search_type=dfs_query...-d '{ 
    >     "query": { 
    >         "query_string": { 
    >             "default_field": "_all", 
    >             "query": "was" 
    >         } 
    >     }, 
    >     "from": 0, 
    >     "size": 60, 
    >     "explain": false}'**{ 
    > 
    >     "took":34, 
    >     "timed_out":false, 
    >     "_shards":{ 
    >         "total":5,"successful":5,"failed":0 
    >     }, 
    >     "hits":{ 
    >         "total":0,"max_score":null,"hits":[] 
    >     }}* 
    > 
    > As we see, this query returns no hits as you are seeing in
    your Java code. So, now what? There are a few possibilities: 
    > 
    >    1. Bug? 
    >    2. Set some breakpoints in the source code to see what is
    going on under the hood. 
    >    3. Do we really understand the vast number of query
    subtypes? I don't yet :) 
    >    4. Whitespace analyzer is not applied during the analysis
    of the query string "was". Is it removed because it is a stop
    word? 
    >    5. The whitespace analyzer isn't configured correctly in
    the mapping file, and the stop words were removed when the
    tweet was indexed. 
    >    6. Maybe I don't really understand the whitespace
    analyzer? Does it actually remove stop words? 
    > 
    > Let's eliminate the most likely cases first, especially the
    ones we can easily check. 
    > *Does the whitespace analyzer remove stop words? * 
    > 
    > * 
    > * 
    > 
    > The best way to sanity check this is to use the analyze
    feature: 
    > *curl -X GET
    "http://localhost:9311/my_twitter1/_analyze?analyzer=whitespace" -d "I was trying to use elastic search"* 
    > *{ 
    >   "tokens" : [ { 
    >     "token" : "I", ... 
    >   }, { 
    >     "token" : "was", ... 
    >   }, { 
    >     "token" : "trying", ... 
    >   }, { 
    >     "token" : "to", ... 
    >   }, { 
    >     "token" : "use", ... 
    >   }, { 
    >     "token" : "elastic", ... 
    >   }, { 
    >     "token" : "search", ... 
    >   } ]}* 
    > 
    > So, we were correct about our understanding of the
    whitespace tokenizer. Just breaks on whitespace. Even the case
    of the terms remains unchanged, and of course our stop words
    are present. 
    > 
    > In fact, a few more cases can give those of us without
    Lucene understanding, some insight into the different types of
    analyzers. 
    > 
    > *The quick brown fox is jumping over the lazy dog.* 
    > 
    >     *whitespace*: 
    > 
    >         [The] [quick] [brown] [fox] [is] [jumping] [over]
    [the] [lazy] [dog.] 
    > 
    >     *simple*: 
    > 
    >         [the] [quick] [brown] [fox] [is] [jumping] [over]
    [the] [lazy] [dog] 
    > 
    >     *stop*: 
    > 
    >         [quick] [brown] [fox] [jumping] [over] [lazy] [dog] 
    > 
    >     *standard*: 
    > 
    >         [quick] [brown] [fox] [jumping] [over] [lazy] [dog] 
    > 
    >     *keyword*: 
    > 
    >         [The quick brown fox is jumping over the lazy dog.] 
    > 
    >     *snowball*: 
    > 
    >         [quick] [brown] [fox] [jump] [over] [lazi] [dog] 
    > 
    > I'll let you be the judge whether the period after dog in
    the whitespace example is a bug or not. The documentation
    states that punctuation is part of the term when it is not
    followed by whitespace. I'm sure smarter people than I see the
    wisdom in this choice as I do not. 
    > *Is the whitespace analyzer configured correctly in the
    mapping file?* 
    > 
    > Well, it was accepted by ES when we created it. Shay
    validates all of the JSON passed in, so if there was an error
    in the mapping file our index creation should of been
    rejected. I see a lot of postings in this group related to
    sometimes putting a property at the wrong level in the JSON
    structure, so we can dump the mapping file ES _thinks_ it is
    using and compare it carefully against the documentation. 
    > *curl -X GET
    "http://localhost:9311/my_twitter1/tweet/_mapping?pretty=true"* 
    > *{ 
    >   "tweet" : { 
    >     "_id" : { 
    >       "index" : "not_analyzed" 
    >     }, 
    >     "properties" : { 
    >       "message" : { 
    >         "null_value" : "na", 
    >         "analyzer" : "whitespace", 
    >         "type" : "string" 
    >       }, 
    >       "_analyzer" : { 
    >         "type" : "string" 
    >       }, 
    >       "postDate" : { 
    >         "format" : "dateOptionalTime", 
    >         "type" : "date" 
    >       }, 
    >       "user" : { 
    >         "index" : "not_analyzed", 
    >         "type" : "string" 
    >       }, 
    >       "post_date" : { 
    >         "format" : "dateOptionalTime", 
    >         "type" : "date" 
    >       } 
    >     } 
    >   }}* 
    > 
    > That looks correct, and the analyzer seems to be in the
    right location according to the documentation. 
    > *Do we really understand the vast number of query subtypes?
    * 
    > 
    > If we change our original search to include a non-stop word,
    what happens? (I'm going to simplify the query_string syntax
    for this example.) 
    > *curl -X GET
    "http://localhost:9311/my_twitter1/tweet/_search?q=message:search&pret..."* 
    > 
    > *{ 
    >   "took" : 17, 
    >   "timed_out" : false, 
    >   "_shards" : { 
    >     "total" : 1, 
    >     "successful" : 1, 
    >     "failed" : 0 
    >   }, 
    >   "hits" : { 
    >     "total" : 1, 
    >     "max_score" : 0.11506981, 
    >     "hits" : [ { 
    >       "_index" : "my_twitter1", 
    >       "_type" : "tweet", 
    >       "_id" : "1", 
    >       "_score" : 0.11506981, "_source" : { 
    >     "user" : "kimchy", 
    >     "post_date" : "2011-09-20T16:20:00", 
    >     "message" : "I was trying to use elastic search"}* 
    > 
    > That's interesting. Searching on "search" returned a hit,
    while searching on "was" did not. If we are pretty sure the
    whitespace analyzer is getting applied correctly, then maybe
    the problem is the "way" we are searching? On the Query DSL
    page <http://www.elasticsearch.org/guide/reference/query-dsl/>
    there are more than 15 different types of search functions,
    and I am not going to purport to be an expert on any of them.
    (That's why we have Clinton in the Google Group.) 
    > 
    > But a few of these stand out to me as some kind of textual
    search and we have to do our fair share of educating ourselves
    to their differences. Let's look at "query_string", "text",
    and "term". Please take the time to go and read about each of
    them. It won't take long. 
    > 
    > Well, that should be perfectly clear, right? ;) If you have
    a good understanding of Lucene, perhaps you are all straight
    now. If you don't then you are probably confused. There was a
    decent description of the difference between query_string and
    text at the bottom of the text page, but the difference
    between term queries and field queries (query_string) are
    still a bit fuzzy to me. I do recall that a term query does no
    analysis on the searched text, while a field query performs
    the analysis step and builds a set of terms out of the
    resulting tokens. 
    > *So, the next step is to debug the code to see if it might
    be a bug.* 
    > 
    > I'm not going to go into any details on how to debug ES, but
    since it is a Java program it is relatively easy to set up
    remote debugging, especially if yo use one of the many quality
    IDEs. 
    > 
    > My initial hunch at this point is the whitespace analyzer is
    not being used to analyze the search query. So, I put a
    breakpoint on the method in the WhitespaceTokenizer class that
    checks each character to determine whether it is whitespace.
    The codebase is so abstracted, it wasn't clear to me in the
    time I had to determine where the whitespace analyzer should
    of been selected... 
    > 
    > read more »

--

--


(Kevin Lawrence) #17

Thanks Clint,

I now have a gist that reproduces my problem and I'll start a new thread to
post the details.

Kevin

On Thursday, August 23, 2012 5:31:21 AM UTC-7, Clinton Gormley wrote:

Hi Kevin

On Tue, 2012-08-21 at 11:56 -0700, Kevin Lawrence wrote:

Sorry to resurrect an old thread, but this (James' & ppearcy's
excellent posts) gets to the heart of a problem I am having debugging
this issue:

I have a custom analyzer that is not used when I think it should be
and I don't know enough to debug the problem.

Most important is to look at the index settings and the mapping that
elasticsearch stores.

A server restart should not change these values.

The easiest way to get help is to gist a full curl recreation of the
problem, so that we can see what it is you are doing, and offer advice.

See http://www.elasticsearch.org/help for guidance on doing this

clint

I am interested to understand how you knew to investigate the analyzer
used for the_all field versus the specific field that was causing the
problem. I suspect that my problem will be amenable to similar
debugging but I don't know where to start. Any pointers to a page with
debugging tips?

In my case the correct analyzer is used in filters and facets but it
stops working after I restart the server. I've tried eyeballing (and
diffing) the _mapping and _settings before and after the restart but
everything is identical. I can't reproduce my problem in a simpler
repro case and I don't know what to look at next. I feel like, if I
had some debugging tips (like the ones in this thread) to try, I could
track down the problem more quickly.

Any tips for me?

Thanks in advance,

Kevin

On Tuesday, September 20, 2011 10:25:14 PM UTC-7, ppearcy wrote:
James, great reply, I think I can solve the final piece of the
puzzle.
I like to use facets to debug how things are really behaving
under the
hood. Using facets to inspect the _all field, we see that it
does not
contain the "was" term:

    curl -XGET 
    http://localhost:9200/my_twitter1/_search?pretty=true -d ' 
    {"query": {"match_all": {}}, "facets": {"tag": {"terms": 
    {"field": 
    "_all", "size": 10000}}}, "size": 0} 
    ' 
    { 
      "took" : 0, 
      "timed_out" : false, 
      "_shards" : { 
        "total" : 1, 
        "successful" : 1, 
        "failed" : 0 
      }, 
      "hits" : { 
        "total" : 1, 
        "max_score" : 1.0, 
        "hits" : [ ] 
      }, 
      "facets" : { 
        "tag" : { 
          "_type" : "terms", 
          "missing" : 0, 
          "total" : 11, 
          "other" : 0, 
          "terms" : [ { 
            "term" : "use", 
            "count" : 1 
          }, { 
            "term" : "trying", 
            "count" : 1 
          }, { 
            "term" : "search", 
            "count" : 1 
          }, { 
            "term" : "kimchy", 
            "count" : 1 
          }, { 
            "term" : "i", 
            "count" : 1 
          }, { 
            "term" : "elastic", 
            "count" : 1 
          }, { 
            "term" : "20t16", 
            "count" : 1 
          }, { 
            "term" : "2011", 
            "count" : 1 
          }, { 
            "term" : "20", 
            "count" : 1 
          }, { 
            "term" : "09", 
            "count" : 1 
          }, { 
            "term" : "00", 
            "count" : 1 
          } ] 
        } 
      } 
    } 
    
    
    So, the stop words are getting applied to the all field. This 
    is 
    occurring because no mappings are set for the _all field, so 
    the 
    default is used. If we delete and recreate the index with 
    mappings for 
    the all field things are better, sort of... 
    
    curl -XDELETE 'http://localhost:9200/my_twitter1/' 
    
    curl -XPUT 'http://localhost:9200/my_twitter1/' -d ' 
    { 
        "settings": { 
            "number_of_shards": 1, 
            "number_of_replicas": 0 
        }, 
        "mappings": { 
            "tweet" : { 
                "properties" : { 
                    "user" : {"type" : "string", "index" : 
    "not_analyzed"}, 
                    "message" : {"type" : "string", "null_value" : 
    "na", 
    "index" : "analyzed", "analyzer" : "whitespace"}, 
                    "_all" : {"type" : "string", "null_value" : 
    "na", 
    "index" : "analyzed", "analyzer" : "whitespace"}, 
                    "postDate" : {"type" : "date"} 
                } 
            } 
        } 
    }' 
    
    
    curl -XPUT 'http://localhost:9200/my_twitter1/tweet/1' -d '{ 
        "user" : "kimchy", 
        "post_date" : "2011-09-20T16:20:00", 
        "message" : "I was trying to use elastic search" 
    }' 
    
    
    Now, facet on the _all field will show us that "was" is 
    indexed: 
    
    curl -XGET 
    http://localhost:9200/my_twitter1/_search?pretty=true -d ' 
    {"query": {"match_all": {}}, "facets": {"tag": {"terms": 
    {"field": 
    "_all", "size": 10000}}}, "size": 0} 
    ' 
    
    { 
      "took" : 18, 
      "timed_out" : false, 
      "_shards" : { 
        "total" : 1, 
        "successful" : 1, 
        "failed" : 0 
      }, 
      "hits" : { 
        "total" : 1, 
        "max_score" : 1.0, 
        "hits" : [ ] 
      }, 
      "facets" : { 
        "tag" : { 
          "_type" : "terms", 
          "missing" : 0, 
          "total" : 9, 
          "other" : 0, 
          "terms" : [ { 
            "term" : "was", 
            "count" : 1 
          }, { 
            "term" : "use", 
            "count" : 1 
          }, { 
            "term" : "trying", 
            "count" : 1 
          }, { 
            "term" : "to", 
            "count" : 1 
          }, { 
            "term" : "search", 
            "count" : 1 
          }, { 
            "term" : "kimchy", 
            "count" : 1 
          }, { 
            "term" : "elastic", 
            "count" : 1 
          }, { 
            "term" : "I", 
            "count" : 1 
          }, { 
            "term" : "2011-09-20T16:20:00", 
            "count" : 1 
          } ] 
        } 
      } 
    } 
    
    So, the original query should now work, right? Nope! It 
    appears that 
    the wrong analyzer is getting applied to the _all field, so to 
    address 
    that, it must be set in the search: 
    
    curl -X GET 

http://localhost:9200/my_twitter1/tweet/_search?search_type=dfs_query_and_fetch

    -d '{ 
        "query": { 
            "query_string": { 
                "default_field": "_all", 
                "query": "was", 
                "analyzer": "whitespace" 
            } 
        }, 
        "from": 0, 
        "size": 60, 
        "explain": false 
    }' 
    
    
    I believe the last part is a bug, but the rest is working as 
    intended. 
    
    Best Regards, 
    Paul 
    
    
    
    On Sep 20, 9:20 pm, James Cook <jc...@tracermedia.com> wrote: 
    > Thanks for the detailed question and the steps to reproduce 
    it. I changed your Java example to use the REST style calls as 
    I find it easier to use when the question isn't API-specific. 
    I also empathize with your frustration over the lack of 
    "structured" documentation. Shay produces an enormous amount 
    of code for a single human being, and unlike many developers 
    he also accompanies each feature with pretty good 
    documentation. There are some gaps that need to be filled, and 
    none as glaring as a step-by-step instruction that can only be 
    truly represented in a good book format. I will certainly be 
    amongst the first people to order that book when it hits the 
    stores. 
    > 
    > I also have the double whammy of not only having to learn 
    the codebase from the source and website, but I don't have a 
    background in Lucene. Any good book on ES will have to include 
    both of these topics. If you reflect on what Shay has created, 
    it is quite incredibly IMHO. An auto-clustering, distributed 
    Lucene environment that abstracts away all of Lucene's 
    complexity while providing a dead-simple startup/embedding 
    scheme with REST and Java APIs. I even replaced MongoDB with 
    elasticsearch because I found its sharding and clustering 
    capability (at the time) easier to use and its disk usage more 
    efficient. 
    > 
    > Anyway, we need a good book. Now, on to your problem. 
    > 
    > *Disclaimer: This is a long post and I don't really solve 
    your problem. I wrote this out this way because I was trying 
    to document a process that I thought may be helpful to others. 
    I think Shay or someone knowledgeable will have to weigh in to 
    give you an answer. My uninformed conclusion is that there is 
    a bug because I do not see the whitespace filter applied to 
    field queries. I put this up front because I don't want anyone 
    reading this very long post expecting to see  a solution.* 
    > 
    > Create an index with your mapping. The key here is the 
    declaration of a whitespace analyzer. 
    > 
    > *curl -XPUT 'http://localhost:9311/my_twitter1/'-d ' 
    > { 
    >     "settings": { 
    >         "number_of_shards": 1, 
    >         "number_of_replicas": 0 
    >     }, 
    >     "mappings": { 
    >         "tweet" : { 
    >             "properties" : { 
    >                 "user" : {"type" : "string", "index" : 
    "not_analyzed"}, 
    >                 "message" : {"type" : "string", 
    "null_value" : "na", "index" : "analyzed", "analyzer" : 
    "whitespace"}, 
    >                 "postDate" : {"type" : "date"} 
    >             } 
    >         } 
    >     }}'* 
    > 
    > *{"ok":true,"acknowledged":true}* 
    > 
    > Index a tweet. 
    > 
    > *curl -XPUT 'http://localhost:9311/my_twitter1/tweet/1'-d 
    '{ 
    >     "user" : "kimchy", 
    >     "post_date" : "2011-09-20T16:20:00", 
    >     "message" : "I was trying to use elastic search"}'* 
    > 
    > 

{"ok":true,"_index":"my_twitter1","_type":"tweet","_id":"1","_version":1}

    > 
    > I think the following query faithfully represents your Java 
    API, but does not force a specific analyzer as you were using. 
    The documentation n the mapping section 
    (

http://www.elasticsearch.org/guide/reference/mapping/core-types.html)
states:

    > 
    > The analyzer [property is] used to analyze the text contents 
    when analyzed during indexing and when searching using a query 
    string. Defaults to the globally configured analyzer. 
    > 
    > So, we shouldn't have to force the whitespace analyzer while 
    searching, since it is specified in the mapping step. 
    > *curl -X 

GEThttp://localhost:9311/my_twitter1/tweet/_search?search_type=dfs_query...-d
'{

    >     "query": { 
    >         "query_string": { 
    >             "default_field": "_all", 
    >             "query": "was" 
    >         } 
    >     }, 
    >     "from": 0, 
    >     "size": 60, 
    >     "explain": false}'**{ 
    > 
    >     "took":34, 
    >     "timed_out":false, 
    >     "_shards":{ 
    >         "total":5,"successful":5,"failed":0 
    >     }, 
    >     "hits":{ 
    >         "total":0,"max_score":null,"hits":[] 
    >     }}* 
    > 
    > As we see, this query returns no hits as you are seeing in 
    your Java code. So, now what? There are a few possibilities: 
    > 
    >    1. Bug? 
    >    2. Set some breakpoints in the source code to see what is 
    going on under the hood. 
    >    3. Do we really understand the vast number of query 
    subtypes? I don't yet :) 
    >    4. Whitespace analyzer is not applied during the analysis 
    of the query string "was". Is it removed because it is a stop 
    word? 
    >    5. The whitespace analyzer isn't configured correctly in 
    the mapping file, and the stop words were removed when the 
    tweet was indexed. 
    >    6. Maybe I don't really understand the whitespace 
    analyzer? Does it actually remove stop words? 
    > 
    > Let's eliminate the most likely cases first, especially the 
    ones we can easily check. 
    > *Does the whitespace analyzer remove stop words? * 
    > 
    > * 
    > * 
    > 
    > The best way to sanity check this is to use the analyze 
    feature: 
    > *curl -X GET 
    "http://localhost:9311/my_twitter1/_analyze?analyzer=whitespace" 

-d "I was trying to use elastic search"*

    > *{ 
    >   "tokens" : [ { 
    >     "token" : "I", ... 
    >   }, { 
    >     "token" : "was", ... 
    >   }, { 
    >     "token" : "trying", ... 
    >   }, { 
    >     "token" : "to", ... 
    >   }, { 
    >     "token" : "use", ... 
    >   }, { 
    >     "token" : "elastic", ... 
    >   }, { 
    >     "token" : "search", ... 
    >   } ]}* 
    > 
    > So, we were correct about our understanding of the 
    whitespace tokenizer. Just breaks on whitespace. Even the case 
    of the terms remains unchanged, and of course our stop words 
    are present. 
    > 
    > In fact, a few more cases can give those of us without 
    Lucene understanding, some insight into the different types of 
    analyzers. 
    > 
    > *The quick brown fox is jumping over the lazy dog.* 
    > 
    >     *whitespace*: 
    > 
    >         [The] [quick] [brown] [fox] [is] [jumping] [over] 
    [the] [lazy] [dog.] 
    > 
    >     *simple*: 
    > 
    >         [the] [quick] [brown] [fox] [is] [jumping] [over] 
    [the] [lazy] [dog] 
    > 
    >     *stop*: 
    > 
    >         [quick] [brown] [fox] [jumping] [over] [lazy] [dog] 
    > 
    >     *standard*: 
    > 
    >         [quick] [brown] [fox] [jumping] [over] [lazy] [dog] 
    > 
    >     *keyword*: 
    > 
    >         [The quick brown fox is jumping over the lazy dog.] 
    > 
    >     *snowball*: 
    > 
    >         [quick] [brown] [fox] [jump] [over] [lazi] [dog] 
    > 
    > I'll let you be the judge whether the period after dog in 
    the whitespace example is a bug or not. The documentation 
    states that punctuation is part of the term when it is not 
    followed by whitespace. I'm sure smarter people than I see the 
    wisdom in this choice as I do not. 
    > *Is the whitespace analyzer configured correctly in the 
    mapping file?* 
    > 
    > Well, it was accepted by ES when we created it. Shay 
    validates all of the JSON passed in, so if there was an error 
    in the mapping file our index creation should of been 
    rejected. I see a lot of postings in this group related to 
    sometimes putting a property at the wrong level in the JSON 
    structure, so we can dump the mapping file ES _thinks_ it is 
    using and compare it carefully against the documentation. 
    > *curl -X GET 
    "http://localhost:9311/my_twitter1/tweet/_mapping?pretty=true"* 
    > *{ 
    >   "tweet" : { 
    >     "_id" : { 
    >       "index" : "not_analyzed" 
    >     }, 
    >     "properties" : { 
    >       "message" : { 
    >         "null_value" : "na", 
    >         "analyzer" : "whitespace", 
    >         "type" : "string" 
    >       }, 
    >       "_analyzer" : { 
    >         "type" : "string" 
    >       }, 
    >       "postDate" : { 
    >         "format" : "dateOptionalTime", 
    >         "type" : "date" 
    >       }, 
    >       "user" : { 
    >         "index" : "not_analyzed", 
    >         "type" : "string" 
    >       }, 
    >       "post_date" : { 
    >         "format" : "dateOptionalTime", 
    >         "type" : "date" 
    >       } 
    >     } 
    >   }}* 
    > 
    > That looks correct, and the analyzer seems to be in the 
    right location according to the documentation. 
    > *Do we really understand the vast number of query subtypes? 
    * 
    > 
    > If we change our original search to include a non-stop word, 
    what happens? (I'm going to simplify the query_string syntax 
    for this example.) 
    > *curl -X GET 
    "

http://localhost:9311/my_twitter1/tweet/_search?q=message:search&pret..."*

    > 
    > *{ 
    >   "took" : 17, 
    >   "timed_out" : false, 
    >   "_shards" : { 
    >     "total" : 1, 
    >     "successful" : 1, 
    >     "failed" : 0 
    >   }, 
    >   "hits" : { 
    >     "total" : 1, 
    >     "max_score" : 0.11506981, 
    >     "hits" : [ { 
    >       "_index" : "my_twitter1", 
    >       "_type" : "tweet", 
    >       "_id" : "1", 
    >       "_score" : 0.11506981, "_source" : { 
    >     "user" : "kimchy", 
    >     "post_date" : "2011-09-20T16:20:00", 
    >     "message" : "I was trying to use elastic search"}* 
    > 
    > That's interesting. Searching on "search" returned a hit, 
    while searching on "was" did not. If we are pretty sure the 
    whitespace analyzer is getting applied correctly, then maybe 
    the problem is the "way" we are searching? On the Query DSL 
    page <http://www.elasticsearch.org/guide/reference/query-dsl/> 
    there are more than 15 different types of search functions, 
    and I am not going to purport to be an expert on any of them. 
    (That's why we have Clinton in the Google Group.) 
    > 
    > But a few of these stand out to me as some kind of textual 
    search and we have to do our fair share of educating ourselves 
    to their differences. Let's look at "query_string", "text", 
    and "term". Please take the time to go and read about each of 
    them. It won't take long. 
    > 
    > Well, that should be perfectly clear, right? ;) If you have 
    a good understanding of Lucene, perhaps you are all straight 
    now. If you don't then you are probably confused. There was a 
    decent description of the difference between query_string and 
    text at the bottom of the text page, but the difference 
    between term queries and field queries (query_string) are 
    still a bit fuzzy to me. I do recall that a term query does no 
    analysis on the searched text, while a field query performs 
    the analysis step and builds a set of terms out of the 
    resulting tokens. 
    > *So, the next step is to debug the code to see if it might 
    be a bug.* 
    > 
    > I'm not going to go into any details on how to debug ES, but 
    since it is a Java program it is relatively easy to set up 
    remote debugging, especially if yo use one of the many quality 
    IDEs. 
    > 
    > My initial hunch at this point is the whitespace analyzer is 
    not being used to analyze the search query. So, I put a 
    breakpoint on the method in the WhitespaceTokenizer class that 
    checks each character to determine whether it is whitespace. 
    The codebase is so abstracted, it wasn't clear to me in the 
    time I had to determine where the whitespace analyzer should 
    of been selected... 
    > 
    > read more » 

--

--


(system) #18