James, great reply, I think I can solve the final piece of the puzzle.
I like to use facets to debug how things are really behaving under the
hood. Using facets to inspect the _all field, we see that it does not
contain the "was" term:
curl -XGET http://localhost:9200/my_twitter1/_search?pretty=true -d '
{"query": {"match_all": {}}, "facets": {"tag": {"terms": {"field":
"_all", "size": 10000}}}, "size": 0}
'
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" :
},
"facets" : {
"tag" : {
"_type" : "terms",
"missing" : 0,
"total" : 11,
"other" : 0,
"terms" : [ {
"term" : "use",
"count" : 1
}, {
"term" : "trying",
"count" : 1
}, {
"term" : "search",
"count" : 1
}, {
"term" : "kimchy",
"count" : 1
}, {
"term" : "i",
"count" : 1
}, {
"term" : "elastic",
"count" : 1
}, {
"term" : "20t16",
"count" : 1
}, {
"term" : "2011",
"count" : 1
}, {
"term" : "20",
"count" : 1
}, {
"term" : "09",
"count" : 1
}, {
"term" : "00",
"count" : 1
} ]
}
}
}
So, the stop words are getting applied to the all field. This is
occurring because no mappings are set for the _all field, so the
default is used. If we delete and recreate the index with mappings for
the all field things are better, sort of...
curl -XDELETE 'http://localhost:9200/my_twitter1/'
curl -XPUT 'http://localhost:9200/my_twitter1/' -d '
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"tweet" : {
"properties" : {
"user" : {"type" : "string", "index" :
"not_analyzed"},
"message" : {"type" : "string", "null_value" : "na",
"index" : "analyzed", "analyzer" : "whitespace"},
"_all" : {"type" : "string", "null_value" : "na",
"index" : "analyzed", "analyzer" : "whitespace"},
"postDate" : {"type" : "date"}
}
}
}
}'
curl -XPUT 'http://localhost:9200/my_twitter1/tweet/1' -d '{
"user" : "kimchy",
"post_date" : "2011-09-20T16:20:00",
"message" : "I was trying to use Elasticsearch"
}'
Now, facet on the _all field will show us that "was" is indexed:
curl -XGET http://localhost:9200/my_twitter1/_search?pretty=true -d '
{"query": {"match_all": {}}, "facets": {"tag": {"terms": {"field":
"_all", "size": 10000}}}, "size": 0}
'
{
"took" : 18,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" :
},
"facets" : {
"tag" : {
"_type" : "terms",
"missing" : 0,
"total" : 9,
"other" : 0,
"terms" : [ {
"term" : "was",
"count" : 1
}, {
"term" : "use",
"count" : 1
}, {
"term" : "trying",
"count" : 1
}, {
"term" : "to",
"count" : 1
}, {
"term" : "search",
"count" : 1
}, {
"term" : "kimchy",
"count" : 1
}, {
"term" : "elastic",
"count" : 1
}, {
"term" : "I",
"count" : 1
}, {
"term" : "2011-09-20T16:20:00",
"count" : 1
} ]
}
}
}
So, the original query should now work, right? Nope! It appears that
the wrong analyzer is getting applied to the _all field, so to address
that, it must be set in the search:
curl -X GET http://localhost:9200/my_twitter1/tweet/_search?search_type=dfs_query_and_fetch
-d '{
"query": {
"query_string": {
"default_field": "_all",
"query": "was",
"analyzer": "whitespace"
}
},
"from": 0,
"size": 60,
"explain": false
}'
I believe the last part is a bug, but the rest is working as
intended.
Best Regards,
Paul
On Sep 20, 9:20 pm, James Cook jc...@tracermedia.com wrote:
Thanks for the detailed question and the steps to reproduce it. I changed your Java example to use the REST style calls as I find it easier to use when the question isn't API-specific. I also empathize with your frustration over the lack of "structured" documentation. Shay produces an enormous amount of code for a single human being, and unlike many developers he also accompanies each feature with pretty good documentation. There are some gaps that need to be filled, and none as glaring as a step-by-step instruction that can only be truly represented in a good book format. I will certainly be amongst the first people to order that book when it hits the stores.
I also have the double whammy of not only having to learn the codebase from the source and website, but I don't have a background in Lucene. Any good book on ES will have to include both of these topics. If you reflect on what Shay has created, it is quite incredibly IMHO. An auto-clustering, distributed Lucene environment that abstracts away all of Lucene's complexity while providing a dead-simple startup/embedding scheme with REST and Java APIs. I even replaced MongoDB with elasticsearch because I found its sharding and clustering capability (at the time) easier to use and its disk usage more efficient.
Anyway, we need a good book. Now, on to your problem.
Disclaimer: This is a long post and I don't really solve your problem. I wrote this out this way because I was trying to document a process that I thought may be helpful to others. I think Shay or someone knowledgeable will have to weigh in to give you an answer. My uninformed conclusion is that there is a bug because I do not see the whitespace filter applied to field queries. I put this up front because I don't want anyone reading this very long post expecting to see a solution.
Create an index with your mapping. The key here is the declaration of a whitespace analyzer.
curl -XPUT 'http://localhost:9311/my_twitter1/'-d '
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"tweet" : {
"properties" : {
"user" : {"type" : "string", "index" : "not_analyzed"},
"message" : {"type" : "string", "null_value" : "na", "index" : "analyzed", "analyzer" : "whitespace"},
"postDate" : {"type" : "date"}
}
}
}}'
{"ok":true,"acknowledged":true}
Index a tweet.
curl -XPUT 'http://localhost:9311/my_twitter1/tweet/1'-d '{
"user" : "kimchy",
"post_date" : "2011-09-20T16:20:00",
"message" : "I was trying to use Elasticsearch"}'
{"ok":true,"_index":"my_twitter1","_type":"tweet","_id":"1","_version":1}
I think the following query faithfully represents your Java API, but does not force a specific analyzer as you were using. The documentation n the mapping section (Elasticsearch Platform — Find real-time answers at scale | Elastic) states:
The analyzer [property is] used to analyze the text contents when analyzed during indexing and when searching using a query string. Defaults to the globally configured analyzer.
So, we shouldn't have to force the whitespace analyzer while searching, since it is specified in the mapping step.
*curl -X GEThttp://localhost:9311/my_twitter1/tweet/_search?search_type=dfs_query...-d '{
"query": {
"query_string": {
"default_field": "_all",
"query": "was"
}
},
"from": 0,
"size": 60,
"explain": false}'**{
"took":34,
"timed_out":false,
"_shards":{
"total":5,"successful":5,"failed":0
},
"hits":{
"total":0,"max_score":null,"hits":[]
}}*
As we see, this query returns no hits as you are seeing in your Java code. So, now what? There are a few possibilities:
- Bug?
- Set some breakpoints in the source code to see what is going on under the hood.
- Do we really understand the vast number of query subtypes? I don't yet
- Whitespace analyzer is not applied during the analysis of the query string "was". Is it removed because it is a stop word?
- The whitespace analyzer isn't configured correctly in the mapping file, and the stop words were removed when the tweet was indexed.
- Maybe I don't really understand the whitespace analyzer? Does it actually remove stop words?
Let's eliminate the most likely cases first, especially the ones we can easily check.
*Does the whitespace analyzer remove stop words? *
The best way to sanity check this is to use the analyze feature:
curl -X GET "http://localhost:9311/my_twitter1/_analyze?analyzer=whitespace" -d "I was trying to use Elasticsearch"
{
"tokens" : [ {
"token" : "I", ...
}, {
"token" : "was", ...
}, {
"token" : "trying", ...
}, {
"token" : "to", ...
}, {
"token" : "use", ...
}, {
"token" : "elastic", ...
}, {
"token" : "search", ...
} ]}
So, we were correct about our understanding of the whitespace tokenizer. Just breaks on whitespace. Even the case of the terms remains unchanged, and of course our stop words are present.
In fact, a few more cases can give those of us without Lucene understanding, some insight into the different types of analyzers.
The quick brown fox is jumping over the lazy dog.
*whitespace*:
[The] [quick] [brown] [fox] [is] [jumping] [over] [the] [lazy] [dog.]
*simple*:
[the] [quick] [brown] [fox] [is] [jumping] [over] [the] [lazy] [dog]
*stop*:
[quick] [brown] [fox] [jumping] [over] [lazy] [dog]
*standard*:
[quick] [brown] [fox] [jumping] [over] [lazy] [dog]
*keyword*:
[The quick brown fox is jumping over the lazy dog.]
*snowball*:
[quick] [brown] [fox] [jump] [over] [lazi] [dog]
I'll let you be the judge whether the period after dog in the whitespace example is a bug or not. The documentation states that punctuation is part of the term when it is not followed by whitespace. I'm sure smarter people than I see the wisdom in this choice as I do not.
Is the whitespace analyzer configured correctly in the mapping file?
Well, it was accepted by ES when we created it. Shay validates all of the JSON passed in, so if there was an error in the mapping file our index creation should of been rejected. I see a lot of postings in this group related to sometimes putting a property at the wrong level in the JSON structure, so we can dump the mapping file ES thinks it is using and compare it carefully against the documentation.
curl -X GET "http://localhost:9311/my_twitter1/tweet/_mapping?pretty=true"
{
"tweet" : {
"_id" : {
"index" : "not_analyzed"
},
"properties" : {
"message" : {
"null_value" : "na",
"analyzer" : "whitespace",
"type" : "string"
},
"_analyzer" : {
"type" : "string"
},
"postDate" : {
"format" : "dateOptionalTime",
"type" : "date"
},
"user" : {
"index" : "not_analyzed",
"type" : "string"
},
"post_date" : {
"format" : "dateOptionalTime",
"type" : "date"
}
}
}}
That looks correct, and the analyzer seems to be in the right location according to the documentation.
*Do we really understand the vast number of query subtypes? *
If we change our original search to include a non-stop word, what happens? (I'm going to simplify the query_string syntax for this example.)
curl -X GET "http://localhost:9311/my_twitter1/tweet/_search?q=message:search&pret..."
{
"took" : 17,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.11506981,
"hits" : [ {
"_index" : "my_twitter1",
"_type" : "tweet",
"_id" : "1",
"_score" : 0.11506981, "_source" : {
"user" : "kimchy",
"post_date" : "2011-09-20T16:20:00",
"message" : "I was trying to use Elasticsearch"}
That's interesting. Searching on "search" returned a hit, while searching on "was" did not. If we are pretty sure the whitespace analyzer is getting applied correctly, then maybe the problem is the "way" we are searching? On the Query DSL page http://www.elasticsearch.org/guide/reference/query-dsl/ there are more than 15 different types of search functions, and I am not going to purport to be an expert on any of them. (That's why we have Clinton in the Google Group.)
But a few of these stand out to me as some kind of textual search and we have to do our fair share of educating ourselves to their differences. Let's look at "query_string", "text", and "term". Please take the time to go and read about each of them. It won't take long.
Well, that should be perfectly clear, right? If you have a good understanding of Lucene, perhaps you are all straight now. If you don't then you are probably confused. There was a decent description of the difference between query_string and text at the bottom of the text page, but the difference between term queries and field queries (query_string) are still a bit fuzzy to me. I do recall that a term query does no analysis on the searched text, while a field query performs the analysis step and builds a set of terms out of the resulting tokens.
So, the next step is to debug the code to see if it might be a bug.
I'm not going to go into any details on how to debug ES, but since it is a Java program it is relatively easy to set up remote debugging, especially if yo use one of the many quality IDEs.
My initial hunch at this point is the whitespace analyzer is not being used to analyze the search query. So, I put a breakpoint on the method in the WhitespaceTokenizer class that checks each character to determine whether it is whitespace. The codebase is so abstracted, it wasn't clear to me in the time I had to determine where the whitespace analyzer should of been selected...
read more »