Accent issue on search - ES 5.0


(evert) #1

Problem: I can not find the correct settings (ES5) to query a term as "prodigo" and return all occurencies of "prodigo" and "pródigo", word which has an accent.

Enviornment:

I am upgrading from ES 2.4 to ES 5.0. In my current cenário this is my mapping:

'properties'    => [
    'file'      => [
        'type'      => 'attachment',
        'fields'    => [
            'content'   => [
                'type'          => 'string',
                'term_vector'   => 'with_positions_offsets',
                'store'         => true,
                'analyzer'      => 'brazilian'
            ]
        ]
    ],
    'book_name' => [
        'type' => 'string',
        'analyzer' => 'brazilian'
    ],
    'book_author' => [
        'type' => 'string',
        'analyzer' => 'brazilian'
    ],
    'book_editor' => [
        'type'  => 'string'
    ],
    'url' => [
        'type'  => 'string'
    ]
]

This settings does a little trick for me today, when looking for a term without accent it searches for all the words in Portuguese universe which could have accent as well.

So, in my new settings for ES 5.0 I have:

'properties'    => [
    'name' => [
        'type' => 'text',
        'analyzer' => 'brazilian'
    ],
    'author' => [
        'type' => 'text',
        'analyzer' => 'brazilian'
    ],
    'editor' => [
        'type'  => 'text',
        'analyzer' => 'brazilian'
    ],
    'url' => [
        'type'  => 'text'
    ],
    'content' => [
        'type'  => 'text',
        'analyzer' => 'brazilian',
        'term_vector'   => 'with_positions_offsets',
        'store'         => true
    ]
]

But it´s not doing the trick anymore... I have read a lot on the docs, and unfortunately, the page which has found the previous solutions has not been updated yet.

So, I have tried these:

'analyzer' => [
    'brazilian' => [
        'tokenizer' => 'standard',
        'filter' => [
            'standard',
            'lowercase',
            'asciifolding'
        ]
    ]
]

With some variations as well... a lot variations... and still could not make it work. Also tried the Language Analyzers for Brazilian Portuguese and still not getting it solved.

The processor I am using to ingest my pdf files content is:

{
    "description": "Extract attachment information",
    "processors": {
        {
            "attachment": {
                "field": "content",
                "indexed_chars": -1
            }                        
        },
    }
}

My query is like this:

'query'     => [
    'match_phrase' => [
        'content' => [
            'query' => '(MY_SEARCH_STRING - EX. prodigo)',
            'slop' => 15
        ]
    ]
]

Any help will be appreciated.

P.S.: All codes are in Array formats because I am using PHP Client.


(evert) #2

When using ingest attachment the field which should be filtered is the "processor" field, in this case the attachment.content, as of cleared by @dadoonet in here.

So, my mapping should have the below under content:

{ 
    "mappings" : { 
        "docs" : { 
            "properties" : { 
                "attachment.content" : { 
                    "type": "text", 
                    "analyzer" : "brazilian" 
                } 
            } 
        } 
    } 
} 

Sorry my mistake...!


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.