Issue with nGram Analyzer on ECE

Francisco_Noguera · November 21, 2019, 7:18pm

Hi, I'm creating a index on ECE with a nGram analyzer but never works the query. This is my index:

{
	"settings": {
		"analysis": {
			"analyzer": {
				"ailabs_analyzer": {
					"type": "stop",
					"stopwords": ["a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with", "any", "than", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t"]
				},
				"ngram_analyzer": {
					"type": "custom",
          "tokenizer": "ngram_tokenizer",
					"filter":[
						"lowercase"
					]
        }
			},
			"tokenizer": {
        "ngram_tokenizer": {
          "type": "nGram",
          "min_gram": 3,
          "max_gram": 4
        }
      }
		}
	},
	"mappings": {
		"properties": {
			"documentId": {
				"type": "keyword"
			},
			"documentName": {
				"type": "text",
				"fielddata": true,
				"fields": {
					"keyword": {
						"type": "keyword"
					}
				}
			},
			"documentType": {
				"type": "text",
				"fields": {
					"sortable": {
						"type": "keyword"
					}
				}
			},
			"dateCreated": {
				"type": "keyword"
			},
			"userCreated": {
				"type": "text",
				"fields": {
					"sortable": {
						"type": "keyword"
					}
				}
			},
			"language": {
				"type": "text",
				"fields": {
					"sortable": {
						"type": "keyword"
					}
				}
			},
			"unitOfAnalysis": {
				"properties": {
					"uoaId": {
						"type": "text"
					},
					"unitOfAnalysis": {
						"type": "text",
						"analyzer": "ngram_analyzer"
					},
					"page": {
						"type": "integer"
					},
					"index": {
						"type": "integer"
					},
					"percent": {
						"type": "text",
						"analyzer": "ailabs_analyzer",
						"fielddata": true,
						"fields": {
							"keyword": {
								"type": "keyword"
							}
						}
					},
					"duration": {
						"type": "text",
						"analyzer": "ailabs_analyzer",
						"fielddata": true,
						"fields": {
							"keyword": {
								"type": "keyword"
							}
						}
					},
					"org": {
						"type": "text",
						"analyzer": "ailabs_analyzer",
						"fielddata": true,
						"fields": {
							"keyword": {
								"type": "keyword"
							}
						}
					},
					"date": {
						"type": "text",
						"analyzer": "ailabs_analyzer",
						"fielddata": true,
						"fields": {
							"keyword": {
								"type": "keyword"
							}
						}
					},
					"cardinal": {
						"type": "text",
						"analyzer": "ailabs_analyzer",
						"fielddata": true,
						"fields": {
							"keyword": {
								"type": "keyword"
							}
						}
					},
					"ordinal": {
						"type": "text",
						"analyzer": "ailabs_analyzer",
						"fielddata": true,
						"fields": {
							"keyword": {
								"type": "keyword"
							}
						}
					},
					"gpe": {
						"type": "text",
						"analyzer": "ailabs_analyzer",
						"fielddata": true,
						"fields": {
							"keyword": {
								"type": "keyword"
							}
						}
					},
					"person": {
						"type": "text",
						"analyzer": "ailabs_analyzer",
						"fielddata": true,
						"fields": {
							"keyword": {
								"type": "keyword"
							}
						}
					},
					"work_of_art": {
						"type": "text",
						"analyzer": "ailabs_analyzer",
						"fielddata": true,
						"fields": {
							"keyword": {
								"type": "keyword"
							}
						}
					},
					"time": {
						"type": "text",
						"analyzer": "ailabs_analyzer",
						"fielddata": true,
						"fields": {
							"keyword": {
								"type": "keyword"
							}
						}
					},
					"law": {
						"type": "text",
						"analyzer": "ailabs_analyzer",
						"fielddata": true,
						"fields": {
							"keyword": {
								"type": "keyword"
							}
						}
					},
					"money": {
						"type": "text",
						"analyzer": "ailabs_analyzer",
						"fielddata": true,
						"fields": {
							"keyword": {
								"type": "keyword"
							}
						}
					},
					"loc": {
						"type": "text",
						"analyzer": "ailabs_analyzer",
						"fielddata": true,
						"fields": {
							"keyword": {
								"type": "keyword"
							}
						}
					},
					"frequency": {
						"type": "text",
						"analyzer": "ailabs_analyzer",
						"fielddata": true,
						"fields": {
							"keyword": {
								"type": "keyword"
							}
						}
					},
					"fac": {
						"type": "text",
						"analyzer": "ailabs_analyzer",
						"fielddata": true,
						"fields": {
							"keyword": {
								"type": "keyword"
							}
						}
					},
					"norp": {
						"type": "text",
						"analyzer": "ailabs_analyzer",
						"fielddata": true,
						"fields": {
							"keyword": {
								"type": "keyword"
							}
						}
					},
					"quantity": {
						"type": "text",
						"analyzer": "ailabs_analyzer",
						"fielddata": true,
						"fields": {
							"keyword": {
								"type": "keyword"
							}
						}
					},
					"language": {
						"type": "text",
						"analyzer": "ailabs_analyzer",
						"fielddata": true,
						"fields": {
							"keyword": {
								"type": "keyword"
							}
						}
					}
				}
			}
		}
	}
}

This is my query

{
    "query": {
				"query_string" : {
					"query" : "unitOfAnalysis.unitOfAnalisys:memb",
					"fields" : []
				}
		}
}

If I search for 'memb' I didn't get results, if I search for 'member' I get 12 results, if I search for members I get 17 results. Any ideas about what I'm doing wrong?

Christian_Dahlqvist · November 22, 2019, 6:17am

What does the document you are expecting to match with your query look like? There also seems to be a typo. In your schema you have specified the field as unitOfAnalysis while it in the query is unitOfAnalisys.

Francisco_Noguera · November 22, 2019, 5:38pm

Right now I fix the query, but now I don't get any results

{
    "query": {
				"query_string" : {
					"query" : "unitOfAnalysis.unitOfAnalysis:memb",
					"fields" : [ ]
				}
		}
}


{
        "query": {
    				"query_string" : {
    					"query" : "memb",
    					"fields" : [ "unitOfAnalysis.unitOfAnalysis"]
    				}
    		}
    }

Christian_Dahlqvist · November 22, 2019, 6:21pm

What does the document you expect to match look like? Please provide a minimal example that reproduces the issue.

Francisco_Noguera · November 22, 2019, 6:30pm

For example If I search with out fields I get the next respose as result:

{
    "query": {
				"query_string" : {
					"query" : "member",
					"fields" : []
				}
		}
}

Result:

{
  "took": 995,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 12,
      "relation": "eq"
    },
    "max_score": 8.96235,
    "hits": [
      {
        "_index": "document",
        "_type": "_doc",
        "_id": "5dcc82289f6c34f13f75275a",
        "_score": 8.96235,
        "_source": {
          "documentId": "35",
          "documentName": "GB020_Hull _MARINA_COURT_CASTLE_STREET_Nov 2012.PDF",
          "documentType": "real estate",
          "unitOfAnalysis": {
            "uoaId": "5dcc82289f6c34f13f75275a",
            "unitOfAnalisys": "Designated Member 49 g:\\031793-140534\\01752511.doc",
            "page": 53,
            "index": 3,
            "percent": null,
            "duration": null,
            "org": [
              "g:\\031793"
            ],
            "date": null,
            "cardinal": [
              "49"
            ],
            "ordinal": null,
            "gpe": null,
            "person": null,
            "workOfArt": null,
            "time": null,
            "law": null,
            "money": null,
            "loc": null,
            "frequency": null,
            "fac": null,
            "norp": null,
            "quantity": null,
            "language": null
          },
          "entities": null,
          "language": "en",
          "dateCreated": 1573665516381,
          "userCreated": "",
          "file": null
        }
      },
      {
        "_index": "document",
        "_type": "_doc",
        "_id": "5dcc5c700c03bfac18f27d63",
        "_score": 7.921951,
        "_source": {
          "documentId": "33",
          "documentName": "GB_Telford_FullerHouse_Headlease_12.011.pdf",
          "documentType": "real estate",
          "unitOfAnalysis": {
            "uoaId": "5dcc5c700c03bfac18f27d63",
            "unitOfAnalisys": "3.10.9 The foregoing provisions of this clause 3.10 shall not apply to any parting with possession or occupation or the sharing of occupation or sub-division of the Demised Premises to or with any member of a group of companies of which the Tenant is itself a member upon the conditions that:-",
            "page": 14,
            "index": 2,
            "percent": null,
            "duration": null,
            "org": [
              "the Demised Premises",
              "Tenant"
            ],
            "date": null,
            "cardinal": [
              "3.10"
            ],
            "ordinal": null,
            "gpe": null,
            "person": null,
            "workOfArt": null,
            "time": null,
            "law": null,
            "money": null,
            "loc": null,
            "frequency": null,
            "fac": null,
            "norp": null,
            "quantity": null,
            "language": null
          },
          "entities": null,
          "language": "en",
          "dateCreated": 1573655586490,
          "userCreated": "",
          "file": null
        }
      },
      {
        "_index": "document",
        "_type": "_doc",
        "_id": "5dcc82279f6c34f13f752759",
        "_score": 7.749855,
        "_source": {
          "documentId": "35",
          "documentName": "GB020_Hull _MARINA_COURT_CASTLE_STREET_Nov 2012.PDF",
          "documentType": "real estate",
          "unitOfAnalysis": {
            "uoaId": "5dcc82279f6c34f13f752759",
            "unitOfAnalisys": "Signed as a deed by XYZ LLP ) acting by two designated members and ) delivered at the date hereof: ) ) Designated Member",
            "page": 53,
            "index": 2,
            "percent": null,
            "duration": null,
            "org": [
              "XYZ LLP"
            ],
            "date": [
              "the date hereof"
            ],
            "cardinal": [
              "two"
            ],
            "ordinal": null,
            "gpe": null,
            "person": null,
            "workOfArt": null,
            "time": null,
            "law": null,
            "money": null,
            "loc": null,
            "frequency": null,
            "fac": null,
            "norp": null,
            "quantity": null,
            "language": null
          },
          "entities": null,
          "language": "en",
          "dateCreated": 1573665516381,
          "userCreated": "",
          "file": null
        }
      }, ...

Like I do a tokenization by 3 or 4 letters I expect get results querying on the field that have the analyzer for example with this query:

{
        "query": {
    				"query_string" : {
    					"query" : "memb",
    					"fields" : [ "unitOfAnalysis.unitOfAnalysis"]
    				}
    		}
    }

But I expect more results because must be also considered words as "members" or "membership" because has tokens related to the query

Christian_Dahlqvist · November 22, 2019, 6:37pm

The field in your document has the same spelling issue and does not match the field you specified ngram analyzer for in your mapping. If you look at the mappings for the index I believe you will find that the field you are querying has the default mapping, which means it is a standard text field without ngrams. this explains why the full word matches but partials do not.

I am also not convinced your mappings are valid as you have the same field name being defined as an object as well as a string although at different points in the hierarchy. Which version of Elasticsearch are you on?

system · December 20, 2019, 6:37pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Must_not filter not working in query_string Elasticsearch	1	965	July 5, 2018
Issues creating custom_analyzer Elasticsearch	4	399	September 13, 2019
Help with ngram analyzer after migrating to ES 1.5 Elasticsearch	1	338	July 6, 2017
Partial word search does not work with Ngram Analyzer! Elasticsearch	2	1388	October 11, 2017
Ngram Analyze not working for forward slash '/' Elasticsearch	1	475	June 12, 2018

Issue with nGram Analyzer on ECE

Related topics