Help with analyzer and mapping


(sujoysett) #1

Hi,

In an index I have a field "text", analyzed with standard analyzer. I now
want to return the documents which has the keyword "at&t" occurring in the
field "text"
.

However, "at" is probably a member of Stop Token Filter, and "&" is again
probably a member of the tokenizer used. (I am don't have much clarity on
the exact logic here).
I have tried using match, text, and *query_string *in my query, and all
returns quite a lot of junk documents in additional to the required
documents.

I was thinking of custom analyzer here, but I want to use "at" and "&" as
is for other search functions on this same field, and a custom analyzer
might upset that.

Am I missing something simpler here? How to search for a text that probably
includes stopwords and tokenizer characters?
Is something like exact search irrespective of tokens (might be time
consuming search, I accept) possible in elasticsearch?

Thanks in advance,
-- Sujoy.

--


(Tanguy) #2

Hi,

You can use the _analyze API to understand the logic behind analyzers:
http://localhost:9200/_analyze?pretty=true&analyzer=standard&text=The+at%26t+company

If you index "The at&t company" with the standard analyzer, the token that
are really indexed are "t" and "company". The same logic applies when
searching with match & query_string queries and that explains the results
you have.

There are many ways to get the expected results when searching for "at&t".
Some suggestions:

  • use a custom analyzer for the "text" field in mapping
  • declare "text" as multi_field and search for exact matches

Hope this helps,

-- Tanguy
Twitter: @tlrx

Le mardi 16 octobre 2012 11:53:28 UTC+2, Sujoy Sett a écrit :

Hi,

In an index I have a field "text", analyzed with standard analyzer. I
now want to return the documents which has the keyword
"at&t" occurring in the field "text"
.

However, "at" is probably a member of Stop Token Filter, and "&" is again
probably a member of the tokenizer used. (I am don't have much clarity on
the exact logic here).
I have tried using match, text, and *query_string *in my query, and
all returns quite a lot of junk documents in additional to the required
documents.

I was thinking of custom analyzer here, but I want to use "at" and "&" as
is for other search functions on this same field, and a custom analyzer
might upset that.

Am I missing something simpler here? How to search for a text that
probably includes stopwords and tokenizer characters?
Is something like exact search irrespective of tokens (might be time
consuming search, I accept) possible in elasticsearch?

Thanks in advance,
-- Sujoy.

--


(sujoysett) #3

Thanks Tanguy,

I will surely try the multi-field.

-- Sujoy.

On Tuesday, October 16, 2012 3:39:16 PM UTC+5:30, Tanguy wrote:

Hi,

You can use the _analyze API to understand the logic behind analyzers:

http://localhost:9200/_analyze?pretty=true&analyzer=standard&text=The+at%26t+company

If you index "The at&t company" with the standard analyzer, the token that
are really indexed are "t" and "company". The same logic applies when
searching with match & query_string queries and that explains the results
you have.

There are many ways to get the expected results when searching for "at&t".
Some suggestions:

  • use a custom analyzer for the "text" field in mapping
  • declare "text" as multi_field and search for exact matches

Hope this helps,

-- Tanguy
Twitter: @tlrx
https://github.com/tlrx

Le mardi 16 octobre 2012 11:53:28 UTC+2, Sujoy Sett a écrit :

Hi,

In an index I have a field "text", analyzed with standard analyzer. I
now want to return the documents which has the keyword
"at&t" occurring in the field "text"
.

However, "at" is probably a member of Stop Token Filter, and "&" is again
probably a member of the tokenizer used. (I am don't have much clarity on
the exact logic here).
I have tried using match, text, and *query_string *in my query, and
all returns quite a lot of junk documents in additional to the required
documents.

I was thinking of custom analyzer here, but I want to use "at" and "&" as
is for other search functions on this same field, and a custom analyzer
might upset that.

Am I missing something simpler here? How to search for a text that
probably includes stopwords and tokenizer characters?
Is something like exact search irrespective of tokens (might be time
consuming search, I accept) possible in elasticsearch?

Thanks in advance,
-- Sujoy.

--


(sujoysett) #4

Hi All,

Can anyone explain what does "type" mean for a token?

http://localhost:9200/[index]/_analyze?pretty=true&analyzer=custom-whitespace-lowercase&text=ipad
gives response

{
"tokens": [
{
"token": "ipad",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
}
]
}

whereas,

http://localhost:9200/[index]/_analyze?pretty=true&analyzer=standard&text=ipad
gives response

{
"tokens": [
{
"token": "ipad",
"start_offset": 0,
"end_offset": 4,
"type": "",
"position": 1
}
]
}

custom-whitespace-lowercase is a custom analyzer defined
with whitespace tokenizer and lowercase filter.
The purpose of defining this analyzer was to avoid the stop-word filter
that comes by default in standard analyzer.

But this analyzer is creating a different problem by not identifying the
term "ipad" while querying.
Also, a bit of extra information, I don't know whether relevant or not, my
mapping is as follows:
properties: {

  • text: {
    • type: multi_field
    • fields: {
      • text: {
        • type: string
          }
      • text_custom_1: {
        • include_in_all: false
        • analyzer: custom-whitespace-lowercase
        • type: string
          }
          }
          }

Thanks,
-- Sujoy.

On Tuesday, October 16, 2012 7:30:15 PM UTC+5:30, Sujoy Sett wrote:

Thanks Tanguy,

I will surely try the multi-field.

-- Sujoy.

On Tuesday, October 16, 2012 3:39:16 PM UTC+5:30, Tanguy wrote:

Hi,

You can use the _analyze API to understand the logic behind analyzers:

http://localhost:9200/_analyze?pretty=true&analyzer=standard&text=The+at%26t+company

If you index "The at&t company" with the standard analyzer, the token
that are really indexed are "t" and "company". The same logic applies when
searching with match & query_string queries and that explains the results
you have.

There are many ways to get the expected results when searching for
"at&t". Some suggestions:

  • use a custom analyzer for the "text" field in mapping
  • declare "text" as multi_field and search for exact matches

Hope this helps,

-- Tanguy
Twitter: @tlrx
https://github.com/tlrx

Le mardi 16 octobre 2012 11:53:28 UTC+2, Sujoy Sett a écrit :

Hi,

In an index I have a field "text", analyzed with standard analyzer. I
now want to return the documents which has the keyword
"at&t" occurring in the field "text"
.

However, "at" is probably a member of Stop Token Filter, and "&" is
again probably a member of the tokenizer used. (I am don't have much
clarity on the exact logic here).
I have tried using match, text, and *query_string *in my query, and
all returns quite a lot of junk documents in additional to the required
documents.

I was thinking of custom analyzer here, but I want to use "at" and "&"
as is for other search functions on this same field, and a custom analyzer
might upset that.

Am I missing something simpler here? How to search for a text that
probably includes stopwords and tokenizer characters?
Is something like exact search irrespective of tokens (might be time
consuming search, I accept) possible in elasticsearch?

Thanks in advance,
-- Sujoy.

--


(simonw-2) #5

hey, the type is set by the tokenizer or token filter. the default type is
"word". StandardTokenizer might set it to "alphanum", "url", "email" etc.
other token filters like ShingleFilter set this to "shingle" to indicate
what this 'token' is. if you want to use standard analyzer but without
stopwords you can just compose it out of standard tokenizer, & lowercase

simon

On Wednesday, October 24, 2012 12:21:05 PM UTC+2, Sujoy Sett wrote:

Hi All,

Can anyone explain what does "type" mean for a token?

http://localhost:9200/[index]/_analyze?pretty=true&analyzer=custom-whitespace-lowercase&text=ipad
gives response

{
"tokens": [
{
"token": "ipad",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
}
]
}

whereas,

http://localhost:9200/[index]/_analyze?pretty=true&analyzer=standard&text=ipad
gives response

{
"tokens": [
{
"token": "ipad",
"start_offset": 0,
"end_offset": 4,
"type": "",
"position": 1
}
]
}

custom-whitespace-lowercase is a custom analyzer defined
with whitespace tokenizer and lowercase filter.
The purpose of defining this analyzer was to avoid the stop-word filter
that comes by default in standard analyzer.

But this analyzer is creating a different problem by not identifying the
term "ipad" while querying.
Also, a bit of extra information, I don't know whether relevant or not, my
mapping is as follows:
properties: {

  • text: {
    • type: multi_field
    • fields: {
      • text: {
        • type: string
          }
      • text_custom_1: {
        • include_in_all: false
        • analyzer: custom-whitespace-lowercase
        • type: string
          }
          }
          }

Thanks,
-- Sujoy.

On Tuesday, October 16, 2012 7:30:15 PM UTC+5:30, Sujoy Sett wrote:

Thanks Tanguy,

I will surely try the multi-field.

-- Sujoy.

On Tuesday, October 16, 2012 3:39:16 PM UTC+5:30, Tanguy wrote:

Hi,

You can use the _analyze API to understand the logic behind analyzers:

http://localhost:9200/_analyze?pretty=true&analyzer=standard&text=The+at%26t+company

If you index "The at&t company" with the standard analyzer, the token
that are really indexed are "t" and "company". The same logic applies when
searching with match & query_string queries and that explains the results
you have.

There are many ways to get the expected results when searching for
"at&t". Some suggestions:

  • use a custom analyzer for the "text" field in mapping
  • declare "text" as multi_field and search for exact matches

Hope this helps,

-- Tanguy
Twitter: @tlrx
https://github.com/tlrx

Le mardi 16 octobre 2012 11:53:28 UTC+2, Sujoy Sett a écrit :

Hi,

In an index I have a field "text", analyzed with standard analyzer. I
now want to return the documents which has the keyword
"at&t" occurring in the field "text"
.

However, "at" is probably a member of Stop Token Filter, and "&" is
again probably a member of the tokenizer used. (I am don't have much
clarity on the exact logic here).
I have tried using match, text, and *query_string *in my query,
and all returns quite a lot of junk documents in additional to the required
documents.

I was thinking of custom analyzer here, but I want to use "at" and "&"
as is for other search functions on this same field, and a custom analyzer
might upset that.

Am I missing something simpler here? How to search for a text that
probably includes stopwords and tokenizer characters?
Is something like exact search irrespective of tokens (might be time
consuming search, I accept) possible in elasticsearch?

Thanks in advance,
-- Sujoy.

--


(sujoysett) #6

Thanks Simon.

Probably standard tokenizer + lowercase filter will not server my purpose,
as I want words like AT&T as a single token, whereas, standard tokenizer
breaks down text by special characters like '&'.

But that is different issue. What I am concerned with right now is that
querying for the term 'ipad' on a multifield analyzed with this custom
analyzer is not fetching me proper results. I am querying by a term query
on 'ipad' within a boolean must_not, and I am finding results with term
'ipad' in it. But going by standard analyzer is fetching results as
expected. Any hint to the cause of this behavior?

Thanks,
-- Sujoy.

On Wednesday, October 24, 2012 8:15:07 PM UTC+5:30, simonw wrote:

hey, the type is set by the tokenizer or token filter. the default type is
"word". StandardTokenizer might set it to "alphanum", "url", "email" etc.
other token filters like ShingleFilter set this to "shingle" to indicate
what this 'token' is. if you want to use standard analyzer but without
stopwords you can just compose it out of standard tokenizer, & lowercase

simon

On Wednesday, October 24, 2012 12:21:05 PM UTC+2, Sujoy Sett wrote:

Hi All,

Can anyone explain what does "type" mean for a token?

http://localhost:9200/[index]/_analyze?pretty=true&analyzer=custom-whitespace-lowercase&text=ipad
gives response

{
"tokens": [
{
"token": "ipad",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
}
]
}

whereas,

http://localhost:9200/[index]/_analyze?pretty=true&analyzer=standard&text=ipad
gives response

{
"tokens": [
{
"token": "ipad",
"start_offset": 0,
"end_offset": 4,
"type": "",
"position": 1
}
]
}

custom-whitespace-lowercase is a custom analyzer defined
with whitespace tokenizer and lowercase filter.
The purpose of defining this analyzer was to avoid the stop-word filter
that comes by default in standard analyzer.

But this analyzer is creating a different problem by not identifying the
term "ipad" while querying.
Also, a bit of extra information, I don't know whether relevant or not,
my mapping is as follows:
properties: {

  • text: {
    • type: multi_field
    • fields: {
      • text: {
        • type: string
          }
      • text_custom_1: {
        • include_in_all: false
        • analyzer: custom-whitespace-lowercase
        • type: string
          }
          }
          }

Thanks,
-- Sujoy.

On Tuesday, October 16, 2012 7:30:15 PM UTC+5:30, Sujoy Sett wrote:

Thanks Tanguy,

I will surely try the multi-field.

-- Sujoy.

On Tuesday, October 16, 2012 3:39:16 PM UTC+5:30, Tanguy wrote:

Hi,

You can use the _analyze API to understand the logic behind analyzers:

http://localhost:9200/_analyze?pretty=true&analyzer=standard&text=The+at%26t+company

If you index "The at&t company" with the standard analyzer, the token
that are really indexed are "t" and "company". The same logic applies when
searching with match & query_string queries and that explains the results
you have.

There are many ways to get the expected results when searching for
"at&t". Some suggestions:

  • use a custom analyzer for the "text" field in mapping
  • declare "text" as multi_field and search for exact matches

Hope this helps,

-- Tanguy
Twitter: @tlrx
https://github.com/tlrx

Le mardi 16 octobre 2012 11:53:28 UTC+2, Sujoy Sett a écrit :

Hi,

In an index I have a field "text", analyzed with standard analyzer. I
now want to return the documents which has the keyword
"at&t" occurring in the field "text"
.

However, "at" is probably a member of Stop Token Filter, and "&" is
again probably a member of the tokenizer used. (I am don't have much
clarity on the exact logic here).
I have tried using match, text, and *query_string *in my query,
and all returns quite a lot of junk documents in additional to the required
documents.

I was thinking of custom analyzer here, but I want to use "at" and "&"
as is for other search functions on this same field, and a custom analyzer
might upset that.

Am I missing something simpler here? How to search for a text that
probably includes stopwords and tokenizer characters?
Is something like exact search irrespective of tokens (might be time
consuming search, I accept) possible in elasticsearch?

Thanks in advance,
-- Sujoy.

--


(simonw-2) #7

On Wednesday, October 24, 2012 7:58:57 PM UTC+2, Sujoy Sett wrote:

Thanks Simon.

Probably standard tokenizer + lowercase filter will not server my purpose,
as I want words like AT&T as a single token, whereas, standard tokenizer
breaks down text by special characters like '&'.

But that is different issue. What I am concerned with right now is that
querying for the term 'ipad' on a multifield analyzed with this custom
analyzer is not fetching me proper results. I am querying by a term query
on 'ipad' within a boolean must_not, and I am finding results with term
'ipad' in it. But going by standard analyzer is fetching results as
expected. Any hint to the cause of this behavior?

the documents that are returned, do they contain "I Pad" or "ipad" ? I mean
are you sure the are analyzed correctly?

simon

Thanks,
-- Sujoy.

On Wednesday, October 24, 2012 8:15:07 PM UTC+5:30, simonw wrote:

hey, the type is set by the tokenizer or token filter. the default type
is "word". StandardTokenizer might set it to "alphanum", "url", "email"
etc. other token filters like ShingleFilter set this to "shingle" to
indicate what this 'token' is. if you want to use standard analyzer but
without stopwords you can just compose it out of standard tokenizer, &
lowercase

simon

On Wednesday, October 24, 2012 12:21:05 PM UTC+2, Sujoy Sett wrote:

Hi All,

Can anyone explain what does "type" mean for a token?

http://localhost:9200/[index]/_analyze?pretty=true&analyzer=custom-whitespace-lowercase&text=ipad
gives response

{
"tokens": [
{
"token": "ipad",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
}
]
}

whereas,

http://localhost:9200/[index]/_analyze?pretty=true&analyzer=standard&text=ipad
gives response

{
"tokens": [
{
"token": "ipad",
"start_offset": 0,
"end_offset": 4,
"type": "",
"position": 1
}
]
}

custom-whitespace-lowercase is a custom analyzer defined
with whitespace tokenizer and lowercase filter.
The purpose of defining this analyzer was to avoid the stop-word filter
that comes by default in standard analyzer.

But this analyzer is creating a different problem by not identifying the
term "ipad" while querying.
Also, a bit of extra information, I don't know whether relevant or not,
my mapping is as follows:
properties: {

  • text: {
    • type: multi_field
    • fields: {
      • text: {
        • type: string
          }
      • text_custom_1: {
        • include_in_all: false
        • analyzer: custom-whitespace-lowercase
        • type: string
          }
          }
          }

Thanks,
-- Sujoy.

On Tuesday, October 16, 2012 7:30:15 PM UTC+5:30, Sujoy Sett wrote:

Thanks Tanguy,

I will surely try the multi-field.

-- Sujoy.

On Tuesday, October 16, 2012 3:39:16 PM UTC+5:30, Tanguy wrote:

Hi,

You can use the _analyze API to understand the logic behind analyzers:

http://localhost:9200/_analyze?pretty=true&analyzer=standard&text=The+at%26t+company

If you index "The at&t company" with the standard analyzer, the token
that are really indexed are "t" and "company". The same logic applies when
searching with match & query_string queries and that explains the results
you have.

There are many ways to get the expected results when searching for
"at&t". Some suggestions:

  • use a custom analyzer for the "text" field in mapping
  • declare "text" as multi_field and search for exact matches

Hope this helps,

-- Tanguy
Twitter: @tlrx
https://github.com/tlrx

Le mardi 16 octobre 2012 11:53:28 UTC+2, Sujoy Sett a écrit :

Hi,

In an index I have a field "text", analyzed with standard analyzer. I
now want to return the documents which has the keyword
"at&t" occurring in the field "text"
.

However, "at" is probably a member of Stop Token Filter, and "&" is
again probably a member of the tokenizer used. (I am don't have much
clarity on the exact logic here).
I have tried using match, text, and *query_string *in my query,
and all returns quite a lot of junk documents in additional to the required
documents.

I was thinking of custom analyzer here, but I want to use "at" and
"&" as is for other search functions on this same field, and a custom
analyzer might upset that.

Am I missing something simpler here? How to search for a text that
probably includes stopwords and tokenizer characters?
Is something like exact search irrespective of tokens (might be time
consuming search, I accept) possible in elasticsearch?

Thanks in advance,
-- Sujoy.

--


(Chris Male) #8

Are you able to provide the query you're using? Just so we can see which
fields you're querying and what not.

On Thursday, October 25, 2012 6:58:57 AM UTC+13, Sujoy Sett wrote:

Thanks Simon.

Probably standard tokenizer + lowercase filter will not server my purpose,
as I want words like AT&T as a single token, whereas, standard tokenizer
breaks down text by special characters like '&'.

But that is different issue. What I am concerned with right now is that
querying for the term 'ipad' on a multifield analyzed with this custom
analyzer is not fetching me proper results. I am querying by a term query
on 'ipad' within a boolean must_not, and I am finding results with term
'ipad' in it. But going by standard analyzer is fetching results as
expected. Any hint to the cause of this behavior?

Thanks,
-- Sujoy.

On Wednesday, October 24, 2012 8:15:07 PM UTC+5:30, simonw wrote:

hey, the type is set by the tokenizer or token filter. the default type
is "word". StandardTokenizer might set it to "alphanum", "url", "email"
etc. other token filters like ShingleFilter set this to "shingle" to
indicate what this 'token' is. if you want to use standard analyzer but
without stopwords you can just compose it out of standard tokenizer, &
lowercase

simon

On Wednesday, October 24, 2012 12:21:05 PM UTC+2, Sujoy Sett wrote:

Hi All,

Can anyone explain what does "type" mean for a token?

http://localhost:9200/[index]/_analyze?pretty=true&analyzer=custom-whitespace-lowercase&text=ipad
gives response

{
"tokens": [
{
"token": "ipad",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
}
]
}

whereas,

http://localhost:9200/[index]/_analyze?pretty=true&analyzer=standard&text=ipad
gives response

{
"tokens": [
{
"token": "ipad",
"start_offset": 0,
"end_offset": 4,
"type": "",
"position": 1
}
]
}

custom-whitespace-lowercase is a custom analyzer defined
with whitespace tokenizer and lowercase filter.
The purpose of defining this analyzer was to avoid the stop-word filter
that comes by default in standard analyzer.

But this analyzer is creating a different problem by not identifying the
term "ipad" while querying.
Also, a bit of extra information, I don't know whether relevant or not,
my mapping is as follows:
properties: {

  • text: {
    • type: multi_field
    • fields: {
      • text: {
        • type: string
          }
      • text_custom_1: {
        • include_in_all: false
        • analyzer: custom-whitespace-lowercase
        • type: string
          }
          }
          }

Thanks,
-- Sujoy.

On Tuesday, October 16, 2012 7:30:15 PM UTC+5:30, Sujoy Sett wrote:

Thanks Tanguy,

I will surely try the multi-field.

-- Sujoy.

On Tuesday, October 16, 2012 3:39:16 PM UTC+5:30, Tanguy wrote:

Hi,

You can use the _analyze API to understand the logic behind analyzers:

http://localhost:9200/_analyze?pretty=true&analyzer=standard&text=The+at%26t+company

If you index "The at&t company" with the standard analyzer, the token
that are really indexed are "t" and "company". The same logic applies when
searching with match & query_string queries and that explains the results
you have.

There are many ways to get the expected results when searching for
"at&t". Some suggestions:

  • use a custom analyzer for the "text" field in mapping
  • declare "text" as multi_field and search for exact matches

Hope this helps,

-- Tanguy
Twitter: @tlrx
https://github.com/tlrx

Le mardi 16 octobre 2012 11:53:28 UTC+2, Sujoy Sett a écrit :

Hi,

In an index I have a field "text", analyzed with standard analyzer. I
now want to return the documents which has the keyword
"at&t" occurring in the field "text"
.

However, "at" is probably a member of Stop Token Filter, and "&" is
again probably a member of the tokenizer used. (I am don't have much
clarity on the exact logic here).
I have tried using match, text, and *query_string *in my query,
and all returns quite a lot of junk documents in additional to the required
documents.

I was thinking of custom analyzer here, but I want to use "at" and
"&" as is for other search functions on this same field, and a custom
analyzer might upset that.

Am I missing something simpler here? How to search for a text that
probably includes stopwords and tokenizer characters?
Is something like exact search irrespective of tokens (might be time
consuming search, I accept) possible in elasticsearch?

Thanks in advance,
-- Sujoy.

--


(sujoysett) #9

Hi,

Tried to create a gist with small set of docs, but was not able to recreate
the problem.
Apparently, some docs missed the multi-field mapping while indexing and
were the reason behind faulty responses from the search query being used -

{
"size": 100,
"query": {
"bool": {
"must_not": [
{
"match": {
"text.text_custom_1": {
"query": "ipad",
"type": "phrase"
}
}
}
]
}
}
}

Applying proper filter with this query removed the faulty docs. It was
really a silly fault. Thanks very much for all your help.

Thanks
-- Sujoy.

On Thursday, October 25, 2012 9:01:02 AM UTC+5:30, Chris Male wrote:

Are you able to provide the query you're using? Just so we can see which
fields you're querying and what not.

On Thursday, October 25, 2012 6:58:57 AM UTC+13, Sujoy Sett wrote:

Thanks Simon.

Probably standard tokenizer + lowercase filter will not server my
purpose, as I want words like AT&T as a single token, whereas, standard
tokenizer breaks down text by special characters like '&'.

But that is different issue. What I am concerned with right now is that
querying for the term 'ipad' on a multifield analyzed with this custom
analyzer is not fetching me proper results. I am querying by a term query
on 'ipad' within a boolean must_not, and I am finding results with term
'ipad' in it. But going by standard analyzer is fetching results as
expected. Any hint to the cause of this behavior?

Thanks,
-- Sujoy.

On Wednesday, October 24, 2012 8:15:07 PM UTC+5:30, simonw wrote:

hey, the type is set by the tokenizer or token filter. the default type
is "word". StandardTokenizer might set it to "alphanum", "url", "email"
etc. other token filters like ShingleFilter set this to "shingle" to
indicate what this 'token' is. if you want to use standard analyzer but
without stopwords you can just compose it out of standard tokenizer, &
lowercase

simon

On Wednesday, October 24, 2012 12:21:05 PM UTC+2, Sujoy Sett wrote:

Hi All,

Can anyone explain what does "type" mean for a token?

http://localhost:9200/[index]/_analyze?pretty=true&analyzer=custom-whitespace-lowercase&text=ipad
gives response

{
"tokens": [
{
"token": "ipad",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
}
]
}

whereas,

http://localhost:9200/[index]/_analyze?pretty=true&analyzer=standard&text=ipad
gives response

{
"tokens": [
{
"token": "ipad",
"start_offset": 0,
"end_offset": 4,
"type": "",
"position": 1
}
]
}

custom-whitespace-lowercase is a custom analyzer defined
with whitespace tokenizer and lowercase filter.
The purpose of defining this analyzer was to avoid the stop-word filter
that comes by default in standard analyzer.

But this analyzer is creating a different problem by not identifying
the term "ipad" while querying.
Also, a bit of extra information, I don't know whether relevant or not,
my mapping is as follows:
properties: {

  • text: {
    • type: multi_field
    • fields: {
      • text: {
        • type: string
          }
      • text_custom_1: {
        • include_in_all: false
        • analyzer: custom-whitespace-lowercase
        • type: string
          }
          }
          }

Thanks,
-- Sujoy.

On Tuesday, October 16, 2012 7:30:15 PM UTC+5:30, Sujoy Sett wrote:

Thanks Tanguy,

I will surely try the multi-field.

-- Sujoy.

On Tuesday, October 16, 2012 3:39:16 PM UTC+5:30, Tanguy wrote:

Hi,

You can use the _analyze API to understand the logic behind analyzers:

http://localhost:9200/_analyze?pretty=true&analyzer=standard&text=The+at%26t+company

If you index "The at&t company" with the standard analyzer, the token
that are really indexed are "t" and "company". The same logic applies when
searching with match & query_string queries and that explains the results
you have.

There are many ways to get the expected results when searching for
"at&t". Some suggestions:

  • use a custom analyzer for the "text" field in mapping
  • declare "text" as multi_field and search for exact matches

Hope this helps,

-- Tanguy
Twitter: @tlrx
https://github.com/tlrx

Le mardi 16 octobre 2012 11:53:28 UTC+2, Sujoy Sett a écrit :

Hi,

In an index I have a field "text", analyzed with standard analyzer. I
now want to return the documents which has the keyword
"at&t" occurring in the field "text"
.

However, "at" is probably a member of Stop Token Filter, and "&" is
again probably a member of the tokenizer used. (I am don't have much
clarity on the exact logic here).
I have tried using match, text, and *query_string *in my query,
and all returns quite a lot of junk documents in additional to the required
documents.

I was thinking of custom analyzer here, but I want to use "at" and
"&" as is for other search functions on this same field, and a custom
analyzer might upset that.

Am I missing something simpler here? How to search for a text that
probably includes stopwords and tokenizer characters?
Is something like exact search irrespective of tokens (might be time
consuming search, I accept) possible in elasticsearch?

Thanks in advance,
-- Sujoy.

--


(system) #10