Search using ngram max_ngram


(sripri) #1

What's the best practice for pattern searches? I have implemented using ngram tokenizer, however there are min and max limitations. For instance

min_gram - 3 and max_gram - 4 will limit the search to 3 characters and a max of 4 characters. like,

abc will produce hits (if any) and abcd will produce hits (if any)
If I type abcde then I won't have any hits.

Because the tokenizer is going to create tokens ranging from 3-4 chars in length.

The use case I am trying to solve is a scaleable pattern search which can accommodate any string length.

What's a solution for this. I came across suggestions to create multiple indices and "alias" them. Not sure what that does to the performance and how it can be used to scale for all of the use cases. Can someone help with the best practice implementation?


(sripri) #2

I am more puzzled by the search results now. I ran a test against the following patterns: (example. AA-PP-001-007)

  • search for 'aa-' is successful
  • search for 'aa-p' is not successful
  • search for 'aa- ....' until i type in the full string is unsuccessful
  • search for 'aa-pp-001-007' is successful

My tokenizer min_gram = 3 and max_gram=10 and tokenizes all characters including special characters:

{"tokens":[{"token":"aa-","start_offset":0,"end_offset":3,"type":"word","position":0},{"token":"aa-p","start_offset":0,"end_offset":4,"type":"word","position":1},{"token":"aa-
    pp","start_offset":0,"end_offset":5,"type":"word","position":2},{"token":"aa-pp-
    ","start_offset":0,"end_offset":6,"type":"word","position":3},{"token":"aa-pp-
    0","start_offset":0,"end_offset":7,"type":"word","position":4},{"token":"aa-pp-00","start_offset":0,"end_offset":8,"type":"word","position":5},{"token":"aa-pp-001","start_offset":0,"end_offset":9,"type":"word","position":6},{"token":"aa-pp-001-","start_offset":0,"end_offset":10,"type":"word","position":7},{"token":"a-p","start_offset":1,"end_offset":4,"type":"word","position":8},{"token":"a-pp","start_offset":1,"end_offset":5,"type":"word","position":9},{"token":"a-pp-","start_offset":1,"end_offset":6,"type":"word","position":10},{"token":"a-pp-0","start_offset":1,"end_offset":7,"type":"word","position":11},{"token":"a-pp-00","start_offset":1,"end_offset":8,"type":"word","position":12},{"token":"a-pp-001","start_offset":1,"end_offset":9,"type":"word","position":13},{"token":"a-pp-001-","start_offset":1,"end_offset":10,"type":"word","position":14},{"token":"a-pp-001-0","start_offset":1,"end_offset":11,"type":"word","position":15},{"token":"-pp","start_offset":2,"end_offset":5,"type":"word","position":16},{"token":"-pp-","start_offset":2,"end_offset":6,"type":"word","position":17},{"token":"-pp-0","start_offset":2,"end_offset":7,"type":"word","position":18},{"token":"-pp-00","start_offset":2,"end_offset":8,"type":"word","position":19},{"token":"-pp-001","start_offset":2,"end_offset":9,"type":"word","position":20},{"token":"-pp-001-","start_offset":2,"end_offset":10,"type":"word","position":21},{"token":"-pp-001-0","start_offset":2,"end_offset":11,"type":"word","position":22},{"token":"-pp-001-00","start_offset":2,"end_offset":12,"type":"word","position":23},{"token":"pp-","start_offset":3,"end_offset":6,"type":"word","position":24},{"token":"pp-0","start_offset":3,"end_offset":7,"type":"word","position":25},{"token":"pp-00","start_offset":3,"end_offset":8,"type":"word","position":26},{"token":"pp-001","start_offset":3,"end_offset":9,"type":"word","position":27},{"token":"pp-001-","start_offset":3,"end_offset":10,"type":"word","position":28},{"token":"pp-001-0","start_offset":3,"end_offset":11,"type":"word","position":29},{"token":"pp-001-00","start_offset":3,"end_offset":12,"type":"word","position":30},{"token":"pp-001-007","start_offset":3,"end_offset":13,"type":"word","position":31},{"token":"p-0","start_offset":4,"end_offset":7,"type":"word","position":32},{"token":"p-00","start_offset":4,"end_offset":8,"type":"word","position":33},{"token":"p-001","start_offset":4,"end_offset":9,"type":"word","position":34},{"token":"p-001-","start_offset":4,"end_offset":10,"type":"word","position":35},{"token":"p-001-0","start_offset":4,"end_offset":11,"type":"word","position":36},{"token":"p-001-00","start_offset":4,"end_offset":12,"type":"word","position":37},{"token":"p-001-007","start_offset":4,"end_offset":13,"type":"word","position":38},{"token":"-00","start_offset":5,"end_offset":8,"type":"word","position":39},{"token":"-001","start_offset":5,"end_offset":9,"type":"word","position":40},{"token":"-001-","start_offset":5,"end_offset":10,"type":"word","position":41},{"token":"-001-0","start_offset":5,"end_offset":11,"type":"word","position":42},{"token":"-001-00","start_offset":5,"end_offset":12,"type":"word","position":43},{"token":"-001-007","start_offset":5,"end_offset":13,"type":"word","position":44},{"token":"001","start_offset":6,"end_offset":9,"type":"word","position":45},{"token":"001-","start_offset":6,"end_offset":10,"type":"word","position":46},{"token":"001-0","start_offset":6,"end_offset":11,"type":"word","position":47},{"token":"001-00","start_offset":6,"end_offset":12,"type":"word","position":48},{"token":"001-007","start_offset":6,"end_offset":13,"type":"word","position":49},{"token":"01-","start_offset":7,"end_offset":10,"type":"word","position":50},{"token":"01-0","start_offset":7,"end_offset":11,"type":"word","position":51},{"token":"01-00","start_offset":7,"end_offset":12,"type":"word","position":52},{"token":"01-007","start_offset":7,"end_offset":13,"type":"word","position":53},{"token":"1-0","start_offset":8,"end_offset":11,"type":"word","position":54},{"token":"1-00","start_offset":8,"end_offset":12,"type":"word","position":55},{"token":"1-007","start_offset":8,"end_offset":13,"type":"word","position":56},{"token":"-00","start_offset":9,"end_offset":12,"type":"word","position":57},{"token":"-007","start_offset":9,"end_offset":13,"type":"word","position":58},{"token":"007","start_offset":10,"end_offset":13,"type":"word","position":59}]}

(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.