How do I use "lang" analyzers? Actually, should I use them?


(Diego) #1

I'm very new to ElasticSearch and I'm still trying to understand how it
works. At the moment I'm experimenting with a clean instance, and I'm
trying to figure out what would be the best approach to tackle the search
problem for a small application that behaves like a forum. To get started,
I created one index calles "Threads", where all the posts are stored. Since
I don't yet understand the what's the difference between the various
analyzers, even in their default configurations, I used the following logic
to choose one:

  • Post titles and bodies are are free text in human language.
  • Posts may eventually be in multiple languages (although everything
    will be in English, at the beginning).

That led me to choose the lang analyzerhttp://www.elasticsearch.org/guide/reference/index-modules/analysis/lang-analyzer/s,
since they seem to cover the above. Standard analyzer also seems to cover
English, but I was thinking of planning for the other languages already, so
I went for "lang". I created the index, the mapping, added some documents
and the results have been odd, therefore I have a few questions:

  • I defined Threads index as follows:
    {
    "analysis": {
    "analyzer": {
    "indexAnalyzer": {
    "type": "english"
    },
    "searchAnalyzer": {
    "type": "english"
    }
    }
    }
    }

Question: Is that the correct way to choose the lang analyzer for
English? The documentation is not very clear, but, since it says "the
following types are available", then it lists languages, I thought
that language = type in configuration.

  • I added one document to the index (mapping is correct, indicating the *
    Title* and Body fields as strings and adding them both to _allfield), containing the following information:
    Title: This has nothing to do with the rest
    Body: It should have something to do with it, though.

I entered on purpose some silly text with some common words, to see how
the index would behave. I checked the index content, and I saw that the
document was indexed correctly. I then performed some searches using CURL,
and this is where I got unexpected results:
- Searching for have,* though* and do returned the document.
- Searching for has, this, something and nothing returned
nothing.

At the beginning I thought that this could be due to stop words, but then I
started wondering why have is ok, while has is not. I got even more
perplexed by the fact that something and nothing also returned no
results, as I don't think they are stop words.

Question: what is causing such behaviour? I'm fully conscious that my
knowledge of ElasticSearch is next to zero, but I don't see a clear logic
for the above to happen.

  • As I wrote, I chose a "lang" analyzer because it seemed the most
    logical to me. However, in the case of English language, the Standard
    analyzer should also work. Other analyzers are more obscure (with Snowball
    at the top of the list).
    Question: how does one choose which analyzer to use, both at index and
    search time? I read in many places suggestions to "try and see", but I
    can't really finding the differences without a significant amount of data,
    and, if I had such amount of data, I would probably not have the time to
    "figure out" what changes. I know that the choice depends on many factors,
    therefore I'm not expecting a step by step guide, but I would be happy to
    have some links to resources that explain what to look for and what to
    evaluate when choosing how to configure an index. In my specific case, the
    question would be "what analyzer(s) should I use for an English forum where
    people chat about (almost) anything?"

I also have further questions regarding the indexing of "non-discussion"
data, such as user names, to provide an autocomplete feature when looking
for a User, but I think I can save them for another time.

Thanks in advance for the answers.

Diego

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Diego) #2

Update
I discovered something interesting (for me, at least, since I'm a total
beginner). I connected to my ES Server via browser, and entered the
following:

  • Queried the type, with *
    myserver/discussions/discussion/_search?pretty=true&q=something. **Result
    *: one document. Correct.
  • Queried the index*,
    myserver/discussions/_search?pretty=true&q=something. *Result: zero
    documents. Incorrect.

I didn't know that both the index and the document type could be queried,
but now that I found a way to retrieve the result I expected, I have more
questions:

  • When should I query the index, and when should I query the type?
  • Why does the index return a different result?

Sorry if these are all basic questions, but such behaviour is odd to me,
probably because I lack the knowledge to fully understand ES logic. Thanks
again for all the answers.

Diego

On Saturday, September 7, 2013 8:38:17 PM UTC+1, Diego wrote:

I'm very new to ElasticSearch and I'm still trying to understand how it
works. At the moment I'm experimenting with a clean instance, and I'm
trying to figure out what would be the best approach to tackle the search
problem for a small application that behaves like a forum. To get started,
I created one index calles "Threads", where all the posts are stored. Since
I don't yet understand the what's the difference between the various
analyzers, even in their default configurations, I used the following logic
to choose one:

  • Post titles and bodies are are free text in human language.
  • Posts may eventually be in multiple languages (although everything
    will be in English, at the beginning).

That led me to choose the lang analyzerhttp://www.elasticsearch.org/guide/reference/index-modules/analysis/lang-analyzer/s,
since they seem to cover the above. Standard analyzer also seems to cover
English, but I was thinking of planning for the other languages already, so
I went for "lang". I created the index, the mapping, added some documents
and the results have been odd, therefore I have a few questions:

  • I defined Threads index as follows:
    {
    "analysis": {
    "analyzer": {
    "indexAnalyzer": {
    "type": "english"
    },
    "searchAnalyzer": {
    "type": "english"
    }
    }
    }
    }

Question: Is that the correct way to choose the lang analyzer for
English? The documentation is not very clear, but, since it says "the
following types are available", then it lists languages, I thought
that language = type in configuration.

  • I added one document to the index (mapping is correct, indicating
    the Title and Body fields as strings and adding them both to _allfield), containing the following information:
    Title: This has nothing to do with the rest
    Body: It should have something to do with it, though.

I entered on purpose some silly text with some common words, to see
how the index would behave. I checked the index content, and I saw that the
document was indexed correctly. I then performed some searches using CURL,
and this is where I got unexpected results:
- Searching for have,* though* and do returned the document.
- Searching for has, this, something and nothing returned
nothing.

At the beginning I thought that this could be due to stop words, but then
I started wondering why have is ok, while has is not. I got even more
perplexed by the fact that something and nothing also returned no
results, as I don't think they are stop words.

Question: what is causing such behaviour? I'm fully conscious that my
knowledge of ElasticSearch is next to zero, but I don't see a clear logic
for the above to happen.

  • As I wrote, I chose a "lang" analyzer because it seemed the most
    logical to me. However, in the case of English language, the Standard
    analyzer should also work. Other analyzers are more obscure (with Snowball
    at the top of the list).
    Question: how does one choose which analyzer to use, both at index
    and search time? I read in many places suggestions to "try and see", but I
    can't really finding the differences without a significant amount of data,
    and, if I had such amount of data, I would probably not have the time to
    "figure out" what changes. I know that the choice depends on many factors,
    therefore I'm not expecting a step by step guide, but I would be happy to
    have some links to resources that explain what to look for and what to
    evaluate when choosing how to configure an index. In my specific case, the
    question would be "what analyzer(s) should I use for an English forum where
    people chat about (almost) anything?"

I also have further questions regarding the indexing of "non-discussion"
data, such as user names, to provide an autocomplete feature when looking
for a User, but I think I can save them for another time.

Thanks in advance for the answers.

Diego

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Dmitry Gorbunov) #3

When should I query the index, and when should I query the type?

If you have types sharing some fields and you want to query all of them.
For example, you have types "post" and "comment" sharing field "author" and
you want to fetch both posts and comments for the same author. If there are
no types that share fields, querying index is useless, but possible, e.g.
you can fetch absolutely anything that has word "hello" in absolutely any
field. Actually, you can even specify types that you want to be included in
query: /index/type1,type2/_search.

On Sunday, September 8, 2013 6:11:32 AM UTC+9, Diego wrote:

Update
I discovered something interesting (for me, at least, since I'm a total
beginner). I connected to my ES Server via browser, and entered the
following:

  • Queried the type, with *
    myserver/discussions/discussion/_search?pretty=true&q=something. **
    Result*: one document. Correct.
  • Queried the index*,
    myserver/discussions/_search?pretty=true&q=something. *Result: zero
    documents. Incorrect.

I didn't know that both the index and the document type could be queried,
but now that I found a way to retrieve the result I expected, I have more
questions:

  • When should I query the index, and when should I query the type?
  • Why does the index return a different result?

Sorry if these are all basic questions, but such behaviour is odd to me,
probably because I lack the knowledge to fully understand ES logic. Thanks
again for all the answers.

Diego

On Saturday, September 7, 2013 8:38:17 PM UTC+1, Diego wrote:

I'm very new to ElasticSearch and I'm still trying to understand how it
works. At the moment I'm experimenting with a clean instance, and I'm
trying to figure out what would be the best approach to tackle the search
problem for a small application that behaves like a forum. To get started,
I created one index calles "Threads", where all the posts are stored. Since
I don't yet understand the what's the difference between the various
analyzers, even in their default configurations, I used the following logic
to choose one:

  • Post titles and bodies are are free text in human language.
  • Posts may eventually be in multiple languages (although everything
    will be in English, at the beginning).

That led me to choose the lang analyzerhttp://www.elasticsearch.org/guide/reference/index-modules/analysis/lang-analyzer/s,
since they seem to cover the above. Standard analyzer also seems to cover
English, but I was thinking of planning for the other languages already, so
I went for "lang". I created the index, the mapping, added some documents
and the results have been odd, therefore I have a few questions:

  • I defined Threads index as follows:
    {
    "analysis": {
    "analyzer": {
    "indexAnalyzer": {
    "type": "english"
    },
    "searchAnalyzer": {
    "type": "english"
    }
    }
    }
    }

Question: Is that the correct way to choose the lang analyzer for
English? The documentation is not very clear, but, since it says "the
following types are available", then it lists languages, I thought
that language = type in configuration.

  • I added one document to the index (mapping is correct, indicating
    the Title and Body fields as strings and adding them both to *_all
  • field), containing the following information:
    Title: This has nothing to do with the rest
    Body: It should have something to do with it, though.

I entered on purpose some silly text with some common words, to see
how the index would behave. I checked the index content, and I saw that the
document was indexed correctly. I then performed some searches using CURL,
and this is where I got unexpected results:
- Searching for have,* though* and do returned the document.
- Searching for has, this, something and nothing returned
nothing.

At the beginning I thought that this could be due to stop words, but then
I started wondering why have is ok, while has is not. I got even
more perplexed by the fact that something and nothing also returned
no results, as I don't think they are stop words.

Question: what is causing such behaviour? I'm fully conscious that my
knowledge of ElasticSearch is next to zero, but I don't see a clear logic
for the above to happen.

  • As I wrote, I chose a "lang" analyzer because it seemed the most
    logical to me. However, in the case of English language, the Standard
    analyzer should also work. Other analyzers are more obscure (with Snowball
    at the top of the list).
    Question: how does one choose which analyzer to use, both at index
    and search time? I read in many places suggestions to "try and see", but I
    can't really finding the differences without a significant amount of data,
    and, if I had such amount of data, I would probably not have the time to
    "figure out" what changes. I know that the choice depends on many factors,
    therefore I'm not expecting a step by step guide, but I would be happy to
    have some links to resources that explain what to look for and what to
    evaluate when choosing how to configure an index. In my specific case, the
    question would be "what analyzer(s) should I use for an English forum where
    people chat about (almost) anything?"

I also have further questions regarding the indexing of "non-discussion"
data, such as user names, to provide an autocomplete feature when looking
for a User, but I think I can save them for another time.

Thanks in advance for the answers.

Diego

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Diego-2) #4

Thanks Dmitry,
Very clear explanation. So, searching an index means searching all the
types in it. The thing I don't understand is why searching the index and
searching the type, in my case, return different results. If I search the
index, I get no results. If I search the type, I get the result I expected.
What should I check to find out the reason of such behaviour?

Thanks again for the help.
On 9 Sep 2013 11:09, "Dmitry Gorbunov" atskiisotona@gmail.com wrote:

When should I query the index, and when should I query the type?

If you have types sharing some fields and you want to query all of them.
For example, you have types "post" and "comment" sharing field "author" and
you want to fetch both posts and comments for the same author. If there are
no types that share fields, querying index is useless, but possible, e.g.
you can fetch absolutely anything that has word "hello" in absolutely any
field. Actually, you can even specify types that you want to be included in
query: /index/type1,type2/_search.

On Sunday, September 8, 2013 6:11:32 AM UTC+9, Diego wrote:

Update
I discovered something interesting (for me, at least, since I'm a total
beginner). I connected to my ES Server via browser, and entered the
following:

  • Queried the type, with *myserver/discussions/
    discussion/_search?pretty=true&q=something. *Result: one document.
    Correct.
  • Queried the index*, myserver/discussions/_search?
    pretty=true&q=something. *Result: zero documents. Incorrect.

I didn't know that both the index and the document type could be queried,
but now that I found a way to retrieve the result I expected, I have more
questions:

  • When should I query the index, and when should I query the type?
  • Why does the index return a different result?

Sorry if these are all basic questions, but such behaviour is odd to me,
probably because I lack the knowledge to fully understand ES logic. Thanks
again for all the answers.

Diego

On Saturday, September 7, 2013 8:38:17 PM UTC+1, Diego wrote:

I'm very new to ElasticSearch and I'm still trying to understand how it
works. At the moment I'm experimenting with a clean instance, and I'm
trying to figure out what would be the best approach to tackle the search
problem for a small application that behaves like a forum. To get started,
I created one index calles "Threads", where all the posts are stored. Since
I don't yet understand the what's the difference between the various
analyzers, even in their default configurations, I used the following logic
to choose one:

  • Post titles and bodies are are free text in human language.
  • Posts may eventually be in multiple languages (although everything
    will be in English, at the beginning).

That led me to choose the lang analyzerhttp://www.elasticsearch.org/guide/reference/index-modules/analysis/lang-analyzer/s,
since they seem to cover the above. Standard analyzer also seems to cover
English, but I was thinking of planning for the other languages already, so
I went for "lang". I created the index, the mapping, added some documents
and the results have been odd, therefore I have a few questions:

  • I defined Threads index as follows:
    {
    "analysis": {
    "analyzer": {
    "indexAnalyzer": {
    "type": "english"
    },
    "searchAnalyzer": {
    "type": "english"
    }
    }
    }
    }

Question: Is that the correct way to choose the lang analyzer
for English? The documentation is not very clear, but, since it says "the
following types are available", then it lists languages, I thought
that language = type in configuration.

  • I added one document to the index (mapping is correct, indicating
    the Title and Body fields as strings and adding them both to *
    _all* field), containing the following information:
    Title: This has nothing to do with the rest
    Body: It should have something to do with it, though.

I entered on purpose some silly text with some common words, to see
how the index would behave. I checked the index content, and I saw that the
document was indexed correctly. I then performed some searches using CURL,
and this is where I got unexpected results:
- Searching for have,* though* and do returned the document.
- Searching for has, this, something and nothing returned
nothing.

At the beginning I thought that this could be due to stop words, but
then I started wondering why have is ok, while has is not. I got
even more perplexed by the fact that something and nothing also
returned no results, as I don't think they are stop words.

Question: what is causing such behaviour? I'm fully conscious that my
knowledge of ElasticSearch is next to zero, but I don't see a clear logic
for the above to happen.

  • As I wrote, I chose a "lang" analyzer because it seemed the most
    logical to me. However, in the case of English language, the Standard
    analyzer should also work. Other analyzers are more obscure (with Snowball
    at the top of the list).
    Question: how does one choose which analyzer to use, both at index
    and search time? I read in many places suggestions to "try and see", but I
    can't really finding the differences without a significant amount of data,
    and, if I had such amount of data, I would probably not have the time to
    "figure out" what changes. I know that the choice depends on many factors,
    therefore I'm not expecting a step by step guide, but I would be happy to
    have some links to resources that explain what to look for and what to
    evaluate when choosing how to configure an index. In my specific case, the
    question would be "what analyzer(s) should I use for an English forum where
    people chat about (almost) anything?"

I also have further questions regarding the indexing of "non-discussion"
data, such as user names, to provide an autocomplete feature when looking
for a User, but I think I can save them for another time.

Thanks in advance for the answers.

Diego

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/3tfwTGXa5Ak/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #5