Question on Analyzer Not Working


(Scott Decker) #1

Hey All,

have a quick analyzer question.

We have an index analyzer setup like this:
index:
analysis:
analyzer:
pipeDelim:
tokenizer: pattern
pattern: |
filter: [lowercase]

And we then index a document using this analyzer, like this field:

entities: {

analyzer: pipeDelim
type: string

}

when we query, I would expect this query to work:
query: {

bool: {
    must: [
        {
            term: {
                entities: kobe bryant
            }
        }
        {
            match_all: { }
        }
    ]
}

}

but it does not

if I do this query, it works though:
query: {

bool: {
    must: [
        {
            query_string: {
                default_field: entities
                query: kobe bryant
            }
        }
        {
            match_all: { }
        }
    ]
}

}

Is something wrong in the index setup perhaps? Seems like this should
match on a term query, since when the document was written, our field
may look like
Kobe Bryant|Kobe|Bryant
and Kobe Bryant is one of the terms, if the analyzer for the pipe
delimiter worked correctly

Any help would be much appreciated.
Thanks,
Scott


(Karussell) #2

How does your indexing example looks like? Normally the tokens are
splitted and a term query will match only exact terms ...

Peter.

On 17 Jan., 21:39, Scott Decker sc...@publishthis.com wrote:

Hey All,

have a quick analyzer question.

We have an index analyzer setup like this:
index:
analysis:
analyzer:
pipeDelim:
tokenizer: pattern
pattern: |
filter: [lowercase]

And we then index a document using this analyzer, like this field:

entities: {

analyzer: pipeDelim
type: string

}

when we query, I would expect this query to work:
query: {

bool: {
    must: [
        {
            term: {
                entities: kobe bryant
            }
        }
        {
            match_all: { }
        }
    ]
}

}

but it does not

if I do this query, it works though:
query: {

bool: {
    must: [
        {
            query_string: {
                default_field: entities
                query: kobe bryant
            }
        }
        {
            match_all: { }
        }
    ]
}

}

Is something wrong in the index setup perhaps? Seems like this should
match on a term query, since when the document was written, our field
may look like
Kobe Bryant|Kobe|Bryant
and Kobe Bryant is one of the terms, if the analyzer for the pipe
delimiter worked correctly

Any help would be much appreciated.
Thanks,
Scott


(ppearcy) #3

Hey Scott,
We use a similar, if not identical analyzer. Your issue is that the
analyzer needs to be a valid regex and | is a special regex character.
Here is what I have in my yaml file:

  pipe_tokenizer :
    type: pattern
    lowercase: true
    pattern: '\|'
    stopwords: _none_
    flags: DOTALL

Best Regards,
Paul

On Jan 18, 1:55 am, Karussell tableyourt...@googlemail.com wrote:

How does your indexing example looks like? Normally the tokens are
splitted and a term query will match only exact terms ...

Peter.

On 17 Jan., 21:39, Scott Decker sc...@publishthis.com wrote:

Hey All,

have a quick analyzer question.

We have an index analyzer setup like this:
index:
analysis:
analyzer:
pipeDelim:
tokenizer: pattern
pattern: |
filter: [lowercase]

And we then index a document using this analyzer, like this field:

entities: {

analyzer: pipeDelim
type: string

}

when we query, I would expect this query to work:
query: {

bool: {
    must: [
        {
            term: {
                entities: kobe bryant
            }
        }
        {
            match_all: { }
        }
    ]
}

}

but it does not

if I do this query, it works though:
query: {

bool: {
    must: [
        {
            query_string: {
                default_field: entities
                query: kobe bryant
            }
        }
        {
            match_all: { }
        }
    ]
}

}

Is something wrong in the index setup perhaps? Seems like this should
match on a term query, since when the document was written, our field
may look like
Kobe Bryant|Kobe|Bryant
and Kobe Bryant is one of the terms, if the analyzer for the pipe
delimiter worked correctly

Any help would be much appreciated.
Thanks,
Scott


(ppearcy) #4

Also, I recommend running a facet query after indexing a doc, to
confirm the terms are getting generated correctly.

On Jan 18, 9:56 am, ppearcy ppea...@gmail.com wrote:

Hey Scott,
We use a similar, if not identical analyzer. Your issue is that the
analyzer needs to be a valid regex and | is a special regex character.
Here is what I have in my yaml file:

  pipe_tokenizer :
    type: pattern
    lowercase: true
    pattern: '\|'
    stopwords: _none_
    flags: DOTALL

Best Regards,
Paul

On Jan 18, 1:55 am, Karussell tableyourt...@googlemail.com wrote:

How does your indexing example looks like? Normally the tokens are
splitted and a term query will match only exact terms ...

Peter.

On 17 Jan., 21:39, Scott Decker sc...@publishthis.com wrote:

Hey All,

have a quick analyzer question.

We have an index analyzer setup like this:
index:
analysis:
analyzer:
pipeDelim:
tokenizer: pattern
pattern: |
filter: [lowercase]

And we then index a document using this analyzer, like this field:

entities: {

analyzer: pipeDelim
type: string

}

when we query, I would expect this query to work:
query: {

bool: {
    must: [
        {
            term: {
                entities: kobe bryant
            }
        }
        {
            match_all: { }
        }
    ]
}

}

but it does not

if I do this query, it works though:
query: {

bool: {
    must: [
        {
            query_string: {
                default_field: entities
                query: kobe bryant
            }
        }
        {
            match_all: { }
        }
    ]
}

}

Is something wrong in the index setup perhaps? Seems like this should
match on a term query, since when the document was written, our field
may look like
Kobe Bryant|Kobe|Bryant
and Kobe Bryant is one of the terms, if the analyzer for the pipe
delimiter worked correctly

Any help would be much appreciated.
Thanks,
Scott


(Scott Decker) #5

Thanks Paul.

I tried this out, but there seemed to be two errors.
One was, if I took my analyzer setup in elasticsearch.yml
and applied your format to it:

index:
analysis:
analyzer:
pipeDelim:
type: pattern
lowercase: true
pattern: '|'
stopwords: none
flags: DOTALL

elastic search would not start, saying there was some error in
the .yml file.
speaking of - is there any nice place to see what the actual issue is
there? I just get some random thing saying "expecting blockquote" or
something

if I just had this setup
index:
analysis:
analyzer:
pipeDelim:
tokenizer: pattern
pattern: '|'
filter: [lowercase]

That allowed elastic search to load, we reindexed the documents, but
still no love on the search for "kobe bryant" as a term query. So,
still not working correctly.

We are running version 17.7, so, maybe something changed in the index
mapping setup between that version and latest? It seems like trying to
put your format in yaml, the es startup doesn't like it.

On Jan 18, 8:57 am, ppearcy ppea...@gmail.com wrote:

Also, I recommend running a facet query after indexing a doc, to
confirm the terms are getting generated correctly.

On Jan 18, 9:56 am, ppearcy ppea...@gmail.com wrote:

Hey Scott,
We use a similar, if not identicalanalyzer. Your issue is that the
analyzerneeds to be a valid regex and | is a special regex character.
Here is what I have in my yaml file:

  pipe_tokenizer :
    type: pattern
    lowercase: true
    pattern: '\|'
    stopwords: _none_
    flags: DOTALL

Best Regards,
Paul

On Jan 18, 1:55 am, Karussell tableyourt...@googlemail.com wrote:

How does your indexing example looks like? Normally the tokens are
splitted and a term query will match only exact terms ...

Peter.

On 17 Jan., 21:39, Scott Decker sc...@publishthis.com wrote:

Hey All,

have a quickanalyzerquestion.

We have an indexanalyzersetup like this:
index:
analysis:
analyzer:
pipeDelim:
tokenizer: pattern
pattern: |
filter: [lowercase]

And we then index a document using thisanalyzer, like this field:

entities: {

analyzer: pipeDelim
type: string

}

when we query, I would expect this query to work:
query: {

bool: {
    must: [
        {
            term: {
                entities: kobe bryant
            }
        }
        {
            match_all: { }
        }
    ]
}

}

but it does not

if I do this query, it works though:
query: {

bool: {
    must: [
        {
            query_string: {
                default_field: entities
                query: kobe bryant
            }
        }
        {
            match_all: { }
        }
    ]
}

}

Is something wrong in the index setup perhaps? Seems like this should
match on a term query, since when the document was written, our field
may look like
Kobe Bryant|Kobe|Bryant
and Kobe Bryant is one of the terms, if theanalyzerfor the pipe
delimiter worked correctly

Any help would be much appreciated.
Thanks,
Scott


(Clinton Gormley) #6

index:
analysis:
analyzer:
pipeDelim:
type: pattern
lowercase: true
pattern: '|'
stopwords: none
flags: DOTALL

elastic search would not start, saying there was some error in
the .yml file.
speaking of - is there any nice place to see what the actual issue is
there? I just get some random thing saying "expecting blockquote" or
something

That's a YAML parser issue. I copied and pasted what you have above, and
it seems legal to me, so I assume it has lost some formatting in the mail.
Perhaps a tab or something?

Try this (changing 'foo' to the index of your choice):

curl -XPUT 'http://127.0.0.1:9200/foo/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"pipeDelim" : {
"stopwords" : "none",
"pattern" : "\|",
"flags" : "DOTALL",
"lowercase" : true,
"type" : "pattern"
}
}
}
}
}
'

clint


(Scott Decker) #7

All right, switched to JSON instead of YML and things started to work.
Still not sure why there were differences, as the yml file validated
fine. Oh well, yey for JSON commas and quotes!

Thanks for your help on the analyzer config options!

Scott

On Jan 18, 12:44 pm, Clinton Gormley cl...@traveljury.com wrote:

index:
analysis:
analyzer:
pipeDelim:
type: pattern
lowercase: true
pattern: '|'
stopwords: none
flags: DOTALL

elastic search would not start, saying there was some error in
the .yml file.
speaking of - is there any nice place to see what the actual issue is
there? I just get some random thing saying "expecting blockquote" or
something

That's a YAML parser issue. I copied and pasted what you have above, and
it seems legal to me, so I assume it has lost some formatting in the mail.
Perhaps a tab or something?

Try this (changing 'foo' to the index of your choice):

curl -XPUT 'http://127.0.0.1:9200/foo/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"pipeDelim" : {
"stopwords" : "none",
"pattern" : "\|",
"flags" : "DOTALL",
"lowercase" : true,
"type" : "pattern"
}
}
}
}}

'

clint


(system) #8