Analyzer, mapping et apostrophe


(Sébastien Vitry) #1

Bonjour,

Je suis en train de monter en compétence sur elasticsearch.
Je m’intéresse maintenant aux analyzer et aux mapping afin d'avoir un moteur de recherche pertinent.
A ce sujet, j'ai lu pas mal d'article en plus de la documentation, que j'indiquerais plus tard, cela pourra servir à d'autres.

Malheureusement, je rencontre quand même des problèmes avec les apostrophes et les accents.
Je m'explique, la recherche sur 'etoile' ne me remonte pas mes documents avec le token 'étoile'.
Pour les apostrophes, l'indexation se fait sur "l'étoile" au lieu de juste 'étoile'.
Je vous joins ma configuration voir si quelqu'un peut me renseigner à ce sujet.

Les settings

✘ svi@spawn  ~  http localhost:9200/user/_settings
HTTP/1.1 200 OK
Content-Length: 1039
Content-Type: application/json; charset=UTF-8
{
    "user": {
        "settings": {
            "index": {
                "analysis": {
                    "analyzer": {
                        "custom_french_analyzer": {
                            "filter": [
                                "stopwords", 
                                "asciifolding", 
                                "lowercase", 
                                "snowball", 
                                "elision", 
                                "worddelimiter", 
                                "french_stemmer"
                            ], 
                            "tokenizer": "nGram", 
                            "type": "custom"
                        }, 
                        "custom_search_analyzer": {
                            "filter": [
                                "stopwords", 
                                "asciifolding", 
                                "lowercase", 
                                "snowball", 
                                "elision", 
                                "worddelimiter", 
                                "french_stemmer"
                            ], 
                            "tokenizer": "standard", 
                            "type": "custom"
                        }, 
                        "tag_analyzer": {
                            "filter": [
                                "asciifolding", 
                                "lowercase"
                            ], 
                            "tokenizer": "keyword"
                        }
                    }, 
                    "filter": {
                        "elision": {
                            "articles": [
                                "l", 
                                "m", 
                                "t", 
                                "qu", 
                                "n", 
                                "s", 
                                "j", 
                                "d", 
                                "c", 
                                "jusqu", 
                                "quoiqu", 
                                "lorsqu", 
                                "puisqu"
                            ], 
                            "type": "elision"
                        }, 
                        "french_stemmer": {
                            "language": "light_french", 
                            "type": "stemmer"
                        }, 
                        "snowball": {
                            "language": "French", 
                            "type": "snowball"
                        }, 
                        "stopwords": {
                            "ignore_case": "true", 
                            "stopwords": "_french_", 
                            "type": "stop"
                        }, 
                        "worddelimiter": {
                            "type": "word_delimiter"
                        }
                    }, 
                    "tokenizer": {
                        "nGram": {
                            "max_gram": "20", 
                            "min_gram": "2", 
                            "type": "nGram"
                        }
                    }
                }, 
                "creation_date": "1436370441302", 
                "number_of_replicas": "1", 
                "number_of_shards": "5", 
                "uuid": "Jm75RQEuQ5-UvkeMmE-XYw", 
                "version": {
                    "created": "1050299"
                }
            }
        }
    }
}

Le mapping :

svi@spawn  ~  http localhost:9200/user/_mapping 
HTTP/1.1 200 OK
Content-Length: 550
Content-Type: application/json; charset=UTF-8

{
    "user": {
        "mappings": {
            "heroes": {
                "_all": {
                    "auto_boost": true
                }, 
                "dynamic_templates": [
                    {
                        "string_fields": {
                            "mapping": {
                                "fields": {
                                    "raw": {
                                        "ignore_above": 256, 
                                        "index": "not_analyzed", 
                                        "type": "string"
                                    }
                                }, 
                                "index": "analyzed", 
                                "index_analyzer": "custom_french_analyzer", 
                                "omit_norms": true, 
                                "search_analyzer": "custom_search_analyzer", 
                                "type": "string"
                            }, 
                            "match": "*", 
                            "match_mapping_type": "string"
                        }
                    }
                ], 
                "properties": {
                    "about": {
                        "type": "string"
                    }, 
                    "age": {
                        "type": "long"
                    }, 
                    "first_name": {
                        "boost": 2.0, 
                        "type": "string"
                    }, 
                    "interests": {
                        "type": "string"
                    }, 
                    "last_name": {
                        "boost": 3.0, 
                        "type": "string"
                    }
                }
            }
        }
    }
}

Pour finir, voici un exemple de document indexer (ce n'est qu'un jeu de test ;))

{
  "last_name": "Batman",
  "first_name": "Bruce Wayne",
  "age": 41,
  "about": "Batman est un personnage de fiction appartenant à l'univers de DC Comics. Créé par le dessinateur Bob Kane et le scénariste Bill Finger, il apparaît pour la première fois dans le comic book Detective Comics no 27 (date de couverture : mai 1939 mais la date réelle de parution est le 30 mars 1939) avec le nom de The Bat-Man. Bien que ce soit le succès de Superman qui a amené sa création, il se détache de ce modèle puisqu'il n'a aucun pouvoir surhumain. Il n'est qu'un humain nommé Bruce Wayne décidé à lutter contre le crime après avoir vu ses parents abattus par un voleur dans une ruelle de Gotham City, la ville où se déroulent la plupart de ses aventures. Malgré sa réputation de héros solitaire, il sait s'entourer d'alliés, comme Robin, son maître d'hôtel Alfred Pennyworth ou encore le commissaire de police James Gordon.",
  "interests": [
    "Batmobile",
    "Batarang",
    "Batgrappin"
  ]
}

Et donc, mes sources:

Merci d'avance pour votre aide.


(David Pilato) #2

Utilise l'API _analyze pour comprendre ce qui se passe.

Je ne suis pas certain que j'utiliserai un tokenizer ngram. Plutôt un tokenfilter.
Ensuite avoir un analyseur différent entre index et search est plus difficile à appréhender. Mais valable dans certains cas (autocomplétion typiquement).

Bref, essaye de faire un script complet qui créé un analyseur puis qui utilise l'API _analyze. Ce sera plus facile pour t'aider je pense.


(Sébastien Vitry) #3

Merci pour votre réponse.

Effectivement, j'ai utilisé l'API _analyze pour tenter de comprendre mais sans succès.

svi@spawn  ~  echo "bat'mobile" | http localhost:9200/user/_analyze\?pretty\=true                                                                                
HTTP/1.1 200 OK
Content-Length: 148
Content-Type: application/json; charset=UTF-8

{
    "tokens": [
        {
            "end_offset": 10, 
            "position": 1, 
            "start_offset": 0, 
            "token": "bat'mobile", 
            "type": "<ALPHANUM>"
        }
    ]
}

Par contre, je viens de checker la configuration d'un de mes fields.
Ce qui confirme ce que je pensais, le template dynamique que j'ai posé n'est pas appliqué.

svi@spawn  ~  http localhost:9200/user/_mapping/heroes/field/last_name                 
HTTP/1.1 200 OK
Content-Length: 126
Content-Type: application/json; charset=UTF-8

{
    "user": {
        "mappings": {
            "heroes": {
                "last_name": {
                    "full_name": "last_name", 
                    "mapping": {
                        "last_name": {
                            "boost": 3.0, 
                            "type": "string"
                        }
                    }
                }
            }
        }
    }
}

Auriez-vous une idée à ce sujet?
Je vais essayer de passer la configuration sur chacun des champs pour voir si cela change quelque chose.


(Sébastien Vitry) #4

L'analyzer semble lui correctement fonctionné.

svi@spawn  ~  echo "bat'mobile" | http localhost:9200/user/_analyze\?pretty\=true\&analyzer\=custom_french_analyzer
    HTTP/1.1 200 OK
    Content-Length: 7627
    Content-Type: application/json; charset=UTF-8
    {
        "tokens": [
            {
                "end_offset": 2, 
                "position": 1, 
                "start_offset": 0, 
                "token": "ba", 
                "type": "word"
            }, 
            {
                "end_offset": 3, 
                "position": 2, 
                "start_offset": 0, 
                "token": "bat", 
                "type": "word"
            }, 
            {
                "end_offset": 3, 
                "position": 3, 
                "start_offset": 0, 
                "token": "bat", 
                "type": "word"
            }, 
            {
                "end_offset": 3, 
                "position": 4, 
                "start_offset": 0, 
                "token": "bat", 
                "type": "word"
            }, 
            {
                "end_offset": 5, 
                "position": 5, 
                "start_offset": 4, 
                "token": "m", 
                "type": "word"
            }, 
            {
                "end_offset": 3, 
                "position": 6, 
                "start_offset": 0, 
                "token": "bat", 
                "type": "word"
            }, 
            {
                "end_offset": 6, 
                "position": 7, 
                "start_offset": 4, 
                "token": "mo", 
                "type": "word"
            }, 
            {
                "end_offset": 3, 
                "position": 8, 
                "start_offset": 0, 
                "token": "bat", 
                "type": "word"
            }, 
            {
                "end_offset": 7, 
                "position": 9, 
                "start_offset": 4, 
                "token": "mob", 
                "type": "word"
            }, 
            {
                "end_offset": 8, 
                "position": 10, 
                "start_offset": 0, 
                "token": "bat", 
                "type": "word"
            }, 
            {
                "end_offset": 8, 
                "position": 11, 
                "start_offset": 0, 
                "token": "mob", 
                "type": "word"
            }, 
            {
                "end_offset": 3, 
                "position": 12, 
                "start_offset": 0, 
                "token": "bat", 
                "type": "word"
            }, 
            {
                "end_offset": 9, 
                "position": 13, 
                "start_offset": 4, 
                "token": "mobil", 
                "type": "word"
            }, 
            {
                "end_offset": 10, 
                "position": 14, 
                "start_offset": 0, 
                "token": "bat", 
                "type": "word"
            }, 
            {
                "end_offset": 10, 
                "position": 15, 
                "start_offset": 0, 
                "token": "mobil", 
                "type": "word"
            }, 
            {
                "end_offset": 3, 
                "position": 16, 
                "start_offset": 0, 
                "token": "bat", 
                "type": "word"
            }, 
            {
                "end_offset": 10, 
                "position": 17, 
                "start_offset": 4, 
                "token": "mobil", 
                "type": "word"
            }, 
            {
                "end_offset": 3, 
                "position": 18, 
                "start_offset": 1, 
                "token": "at", 
                "type": "word"
            }, 
            {
                "end_offset": 3, 
                "position": 19, 
                "start_offset": 1, 
                "token": "at", 
                "type": "word"
            }, 
            {
                "end_offset": 3, 
                "position": 20, 
                "start_offset": 1, 
                "token": "at", 
                "type": "word"
            }, 
            {
                "end_offset": 5, 
                "position": 21, 
                "start_offset": 4, 
                "token": "m", 
                "type": "word"
            }, 
            {
                "end_offset": 3, 
                "position": 22, 
                "start_offset": 1, 
                "token": "at", 
                "type": "word"
            }, 
            {
                "end_offset": 6, 
                "position": 23, 
                "start_offset": 4, 
                "token": "mo", 
                "type": "word"
            }, 
            {
                "end_offset": 3, 
                "position": 24, 
                "start_offset": 1, 
                "token": "at", 
                "type": "word"
            }, 
            {
                "end_offset": 7, 
                "position": 25, 
                "start_offset": 4, 
                "token": "mob", 
                "type": "word"
            }, 
            {
                "end_offset": 8, 
                "position": 26, 
                "start_offset": 1, 
                "token": "at", 
                "type": "word"
            }, 
            {
                "end_offset": 8, 
                "position": 27, 
                "start_offset": 1, 
                "token": "mob", 
                "type": "word"
            }, 
            {
                "end_offset": 3, 
                "position": 28, 
                "start_offset": 1, 
                "token": "at", 
                "type": "word"
            }, 
            {
                "end_offset": 9, 
                "position": 29, 
                "start_offset": 4, 
                "token": "mobil", 
                "type": "word"
            }, 
            {
                "end_offset": 10, 
                "position": 30, 
                "start_offset": 1, 
                "token": "at", 
                "type": "word"
            }, 
            {
                "end_offset": 10, 
                "position": 31, 
                "start_offset": 1, 
                "token": "mobil", 
                "type": "word"
            }, 
            {
                "end_offset": 3, 
                "position": 32, 
                "start_offset": 1, 
                "token": "at", 
                "type": "word"
            }, 
            {
                "end_offset": 10, 
                "position": 33, 
                "start_offset": 4, 
                "token": "mobil", 
                "type": "word"
            }, 
            {
                "end_offset": 5, 
                "position": 34, 
                "start_offset": 2, 
                "token": "m", 
                "type": "word"
            }, 
            {
                "end_offset": 6, 
                "position": 35, 
                "start_offset": 2, 
                "token": "mo", 
                "type": "word"
            }, 
            {
                "end_offset": 7, 
                "position": 36, 
                "start_offset": 2, 
                "token": "mob", 
                "type": "word"
            }, 
            {
                "end_offset": 8, 
                "position": 37, 
                "start_offset": 2, 
                "token": "mob", 
                "type": "word"
            }, 
            {
                "end_offset": 9, 
                "position": 38, 
                "start_offset": 2, 
                "token": "mobil", 
                "type": "word"
            }, 
            {
                "end_offset": 10, 
                "position": 39, 
                "start_offset": 2, 
                "token": "mobil", 
                "type": "word"
            }, 
            {
                "end_offset": 11, 
                "position": 40, 
                "start_offset": 2, 
                "token": "mobil", 
                "type": "word"
            }, 
            {
                "end_offset": 5, 
                "position": 41, 
                "start_offset": 4, 
                "token": "m", 
                "type": "word"
            }, 
            {
                "end_offset": 6, 
                "position": 42, 
                "start_offset": 4, 
                "token": "mo", 
                "type": "word"
            }, 
            {
                "end_offset": 7, 
                "position": 43, 
                "start_offset": 4, 
                "token": "mob", 
                "type": "word"
            }, 
            {
                "end_offset": 8, 
                "position": 44, 
                "start_offset": 4, 
                "token": "mob", 
                "type": "word"
            }, 
            {
                "end_offset": 9, 
                "position": 45, 
                "start_offset": 4, 
                "token": "mobil", 
                "type": "word"
            }, 
            {
                "end_offset": 10, 
                "position": 46, 
                "start_offset": 4, 
                "token": "mobil", 
                "type": "word"
            }, 
            {
                "end_offset": 10, 
                "position": 47, 
                "start_offset": 4, 
                "token": "mobil", 
                "type": "word"
            }, 
            {
                "end_offset": 6, 
                "position": 48, 
                "start_offset": 4, 
                "token": "mo", 
                "type": "word"
            }, 
            {
                "end_offset": 7, 
                "position": 49, 
                "start_offset": 4, 
                "token": "mob", 
                "type": "word"
            }, 
            {
                "end_offset": 8, 
                "position": 50, 
                "start_offset": 4, 
                "token": "mob", 
                "type": "word"
            }, 
            {
                "end_offset": 9, 
                "position": 51, 
                "start_offset": 4, 
                "token": "mobil", 
                "type": "word"
            }, 
            {
                "end_offset": 10, 
                "position": 52, 
                "start_offset": 4, 
                "token": "mobil", 
                "type": "word"
            }, 
            {
                "end_offset": 10, 
                "position": 53, 
                "start_offset": 4, 
                "token": "mobil", 
                "type": "word"
            }, 
            {
                "end_offset": 7, 
                "position": 54, 
                "start_offset": 5, 
                "token": "ob", 
                "type": "word"
            }, 
            {
                "end_offset": 8, 
                "position": 55, 
                "start_offset": 5, 
                "token": "obi", 
                "type": "word"
            }, 
            {
                "end_offset": 9, 
                "position": 56, 
                "start_offset": 5, 
                "token": "obil", 
                "type": "word"
            }, 
            {
                "end_offset": 10, 
                "position": 57, 
                "start_offset": 5, 
                "token": "obil", 
                "type": "word"
            }, 
            {
                "end_offset": 10, 
                "position": 58, 
                "start_offset": 5, 
                "token": "obil", 
                "type": "word"
            }, 
            {
                "end_offset": 8, 
                "position": 59, 
                "start_offset": 6, 
                "token": "bi", 
                "type": "word"
            }, 
            {
                "end_offset": 9, 
                "position": 60, 
                "start_offset": 6, 
                "token": "bil", 
                "type": "word"
            }, 
            {
                "end_offset": 10, 
                "position": 61, 
                "start_offset": 6, 
                "token": "bil", 
                "type": "word"
            }, 
            {
                "end_offset": 10, 
                "position": 62, 
                "start_offset": 6, 
                "token": "bile", 
                "type": "word"
            }, 
            {
                "end_offset": 10, 
                "position": 64, 
                "start_offset": 7, 
                "token": "ile", 
                "type": "word"
            }, 
            {
                "end_offset": 10, 
                "position": 65, 
                "start_offset": 7, 
                "token": "ile", 
                "type": "word"
            }, 
            {
                "end_offset": 10, 
                "position": 67, 
                "start_offset": 8, 
                "token": "le", 
                "type": "word"
            }, 
            {
                "end_offset": 10, 
                "position": 68, 
                "start_offset": 9, 
                "token": "e", 
                "type": "word"
            }
        ]
    }

(Sébastien Vitry) #5

J'ai trouvé où ce situé mon "erreur".

Je suis parti du postulat que le "dynamic_templates" était appliqué par héritage aux fields que l'on définissais manuellement.
Et ce n'est malheureusement pas le cas. Et pour mon cas personnel, je trouve ça dommage ;).

La question que je me pose maintenant, est-ce que ce genre d'héritage existe?


(David Pilato) #6

Ca n'existe pas à ma connaissance.

A part dans les templates eux-mêmes. Lorsqu'un nouveau champ est créé, on regarde si tu l'as défini dans le mapping. Si oui, on applique ce que tu as dit. Si non, on passe par tous les templates pour voir si ils s'appliquent. C'est plus de la composition que de l'héritage.


(system) #7