I'm very new to ElasticSearch and I'm still trying to understand how it
works. At the moment I'm experimenting with a clean instance, and I'm
trying to figure out what would be the best approach to tackle the search
problem for a small application that behaves like a forum. To get started,
I created one index calles "Threads", where all the posts are stored. Since
I don't yet understand the what's the difference between the various
analyzers, even in their default configurations, I used the following logic
to choose one:
- Post titles and bodies are are free text in human language.
- Posts may eventually be in multiple languages (although everything
will be in English, at the beginning).
That led me to choose the lang analyzerhttp://www.elasticsearch.org/guide/reference/index-modules/analysis/lang-analyzer/s,
since they seem to cover the above. Standard analyzer also seems to cover
English, but I was thinking of planning for the other languages already, so
I went for "lang". I created the index, the mapping, added some documents
and the results have been odd, therefore I have a few questions:
- I defined Threads index as follows:
{
"analysis": {
"analyzer": {
"indexAnalyzer": {
"type": "english"
},
"searchAnalyzer": {
"type": "english"
}
}
}
}
Question: Is that the correct way to choose the lang analyzer for
English? The documentation is not very clear, but, since it says "the
following types are available", then it lists languages, I thought
that language = type in configuration.
- I added one document to the index (mapping is correct, indicating the *
Title* and Body fields as strings and adding them both to _allfield), containing the following information:
Title: This has nothing to do with the rest
Body: It should have something to do with it, though.
I entered on purpose some silly text with some common words, to see how
the index would behave. I checked the index content, and I saw that the
document was indexed correctly. I then performed some searches using CURL,
and this is where I got unexpected results:
- Searching for have,* though* and do returned the document.
- Searching for has, this, something and nothing returned
nothing.
At the beginning I thought that this could be due to stop words, but then I
started wondering why have is ok, while has is not. I got even more
perplexed by the fact that something and nothing also returned no
results, as I don't think they are stop words.
Question: what is causing such behaviour? I'm fully conscious that my
knowledge of ElasticSearch is next to zero, but I don't see a clear logic
for the above to happen.
- As I wrote, I chose a "lang" analyzer because it seemed the most
logical to me. However, in the case of English language, the Standard
analyzer should also work. Other analyzers are more obscure (with Snowball
at the top of the list).
Question: how does one choose which analyzer to use, both at index and
search time? I read in many places suggestions to "try and see", but I
can't really finding the differences without a significant amount of data,
and, if I had such amount of data, I would probably not have the time to
"figure out" what changes. I know that the choice depends on many factors,
therefore I'm not expecting a step by step guide, but I would be happy to
have some links to resources that explain what to look for and what to
evaluate when choosing how to configure an index. In my specific case, the
question would be "what analyzer(s) should I use for an English forum where
people chat about (almost) anything?"
I also have further questions regarding the indexing of "non-discussion"
data, such as user names, to provide an autocomplete feature when looking
for a User, but I think I can save them for another time.
Thanks in advance for the answers.
Diego
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.