Partial Search of a sentence not working as expected

GuruPrasath_Ramesh · July 31, 2019, 8:09am

I use Translate API to get the native elastic search queries from SQL queries.

At first try, we use the RLIKE query to match a partial word in a sentence.

POST /_xpack/sql/translate
{
   "query":
"SELECT * FROM c1s_may_20 
WHERE ((clause_number.keyword RLIKE '.*emisión.*' AND level = 'CLAUSE'))
              AND data_source IN ('CFR', 'PANAMA', 'FINRA')
ORDER BY clause_number.keyword DESC
LIMIT 10"
}

The translated native query for the above-translated API is as follows:

         GET c1s_may_20/_search
            {
              "size" : 10,
              "query" : {
                "bool" : {
                  "must" : [
                    {
                      "bool" : {
                        "must" : [
                          {
                            "query_string" : {
                              "query" : "/.*emisión.*/",
                              "fields" : [
                                "clause_number.keyword^1.0"
                              ],
                              "type" : "best_fields",
                              "default_operator" : "or",
                              "max_determinized_states" : 10000,
                              "enable_position_increments" : true,
                              "fuzziness" : "AUTO",
                              "fuzzy_prefix_length" : 0,
                              "fuzzy_max_expansions" : 50,
                              "phrase_slop" : 0,
                              "escape" : false,
                              "auto_generate_synonyms_phrase_query" : true,
                              "fuzzy_transpositions" : true,
                              "boost" : 1.0
                            }
                          },
                          {
                            "term" : {
                              "level.keyword" : {
                                "value" : "CLAUSE",
                                "boost" : 1.0
                              }
                            }
                          }
                        ],
                        "adjust_pure_negative" : true,
                        "boost" : 1.0
                      }
                    },
                    {
                      "terms" : {
                        "data_source.keyword" : [
                          "CFR",
                          "PANAMA",
                          "FINRA"
                        ],
                        "boost" : 1.0
                      }
                    }
                  ],
                  "adjust_pure_negative" : true,
                  "boost" : 1.0
                }
              },
              "_source" : {
                "includes" : [
                  "amendment",
                  "amendment_number",
                  "appendix",
                  "appendix_name",
                  "appendix_number",
                  "chapter_name",
                  "chapter_number",
                  "clause_number",
                  "considerations",
                  "data_source",
                  ....(continues)
                ],
                "excludes" : [\*]
              },
              "docvalue_fields" : [
                {
                  "field" : "appendix_id",
                  "format" : "use_field_mapping"
                },
                {
                  "field" : "chapter_id",
                  "format" : "use_field_mapping"
                },
                {
                  "field" : "clause_id",
                  "format" : "use_field_mapping"
                },
                {
                  "field" : "division_id",
                  "format" : "use_field_mapping"
                },
               ...(continues)
              ],
              "sort" : [
                {
                  "clause_number.keyword" : {
                    "order" : "desc",
                    "missing" : "_first",
                    "unmapped_type" : "keyword"
                  }
                }
              ]
            }

Here, it fails in case insensitive accented search.

So, we decided to take on another idea using the multi-match query with phrase_prefix.

The translate API is as follows.

POST /_xpack/sql/translate
{
  "query": "
SELECT * FROM c1s_may_20
WHERE (level = 'CLAUSE' AND MATCH('clause_number', 'emisión','type=phrase_prefix'))"
}

It produces the following native elastic search query:

GET c1s_may_20/_search
{
  "size" : 1000,
  "query" : {
    "bool" : {
      "must" : [
        {
          "term" : {
            "level.keyword" : {
              "value" : "CLAUSE",
              "boost" : 1.0
            }
          }
        },
        {
          "multi_match" : {
            "query" : "emisión",
            "fields" : [
              "clause_number^1.0"
            ],
            "type" : "phrase_prefix",
            "operator" : "OR",
            "slop" : 0,
            "prefix_length" : 0,
            "max_expansions" : 50,
            "zero_terms_query" : "NONE",
            "auto_generate_synonyms_phrase_query" : true,
            "fuzzy_transpositions" : true,
            "boost" : 1.0
          }
        }
      ],
      "adjust_pure_negative" : true,
      "boost" : 1.0
    }
  },
  "_source" : {
    "includes" : [
      "amendment",
      "amendment_number",
      "appendix",
      "appendix_name",
      "appendix_number",
      "chapter_name",
      "chapter_number",
      "clause_number",
      "considerations",
      "data_source",
      ...(continues)
    ],
    "excludes" : [ ]
  },
  "docvalue_fields" : [
    {
      "field" : "appendix_id",
      "format" : "use_field_mapping"
    },
    {
      "field" : "chapter_id",
      "format" : "use_field_mapping"
    },
    {
      "field" : "clause_id",
      "format" : "use_field_mapping"
    },
    {
      "field" : "division_id",
      "format" : "use_field_mapping"
    },
   ...(continues)
  ],
  "sort" : [
    {
      "_doc" : {
        "order" : "asc"
      }
    }
  ]
}

The above query works for case insensitive accented search. But, there is another scenario.

For example,
A document of _id='0' has a field "clause_number":"El numeral 7, literal d del artículo 2 del Acuerdo No. 5-2011, queda así:"
Another document of _id='1' has a field "clause_number" : "El artículo 11 del Acuerdo No.005-2011 queda así:"

While searching the above query with the keyword "5-2011", it should return both documents, but it fails to return document of _id='1' as it contains "005-2011" and returns the only document of "_id":'0' as we used phrase_prefix here.

Is there any alternative way to solve this problem which satisfies both the scenarios?

Andrei_Stefan · July 31, 2019, 8:52am

@GuruPrasath_Ramesh this post is so difficult to follow, the queries are unreadable and there is no formatting. I strongly suggest to format it properly next time you post. This time I'll re-format it.

GuruPrasath_Ramesh · July 31, 2019, 9:03am

Thanks for the formatting @Andrei_Stefan. Do you have any ideas to resolve it?

Andrei_Stefan · July 31, 2019, 9:10am

Have you tried the "SQL way"? SELECT * FROM test WHERE clause_number LIKE '%5-2011%'

I don't know what didn't work in your first scenario because you didn't provide any sample documents that should have matched (and didn't) or did match (and shouldn't). But the query above works for me for the two sample documents you provided.

pmusa · July 31, 2019, 9:22am

You should take a look into text analysis and how full-text search works. You could have full control of how you index and query the text.

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html

GuruPrasath_Ramesh · July 31, 2019, 9:43am

Hi @Andrei_Stefan @pmusa

I have attached sample documents for your reference.

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
	"total" : 5,
	"successful" : 5,
	"skipped" : 0,
	"failed" : 0
  },
  "hits" : {
	"total" : 3,
	"max_score" : 1.0,
	"hits" : [
	  {
	    "_index" : "test_local",
	    "_type" : "_doc",
	    "_id" : "0",
	    "_score" : 1.0,
	    "_source" : {
	      "subpart_number" : "Capítulo XV II",
	      "section_number" : "145",
	      "part_name" : "DEL RÉGIMEN BANCARIO",
	      "subchapter_name" : null,
	      "subsection_number" : null,
	      "is_subtitle" : 0,
	      "clause_number" : "3.   Autorizar la emisión de nuevas acciones del banco, así como su venta a terceros, al precio que el  reorganizador o la junta de reorganización determine",
	      "clause_id" : 576989,
	      "subpart_id" : 8914
	    }
	  },
	  {
	    "_index" : "test_local",
	    "_type" : "_doc",
	    "_id" : "2",
	    "_score" : 1.0,
	    "_source" : {
	      "subpart_number" : null,
	      "section_number" : "1",
	      "part_name" : "Dada en la ciudad de Panamá, a los veintidós (22) días mes de julio de dos mil catorce (2014). <br> COMUNÍQUESE, PUBLÍQUESE Y CÚMPLASE.",
	      "subchapter_name" : null,
	      "subsection_number" : null,
	      "is_subtitle" : 0,
	      "clause_number" : "El numeral 7, literal d del artículo 2 del Acuerdo No. 5-2011, queda así:",
	      "clause_id" : 581142,
	      "subpart_id" : null
	    }
	  },
	  {
	    "_index" : "test_local",
	    "_type" : "_doc",
	    "_id" : "1",
	    "_score" : 1.0,
	    "_source" : {
	      "subpart_number" : null,
	      "section_number" : "1",
	      "part_name" : "Dado en la ciudad de Panamá, a los diecinueve (19) días del mes de junio de dos mil doce (2012). <br> COMUNÍQUESE, PUBLÍQUESE Y CÚMPLASE.",
	      "subchapter_name" : null,
	      "subsection_number" : null,
	      "is_subtitle" : 0,
	      "clause_number" : "El artículo 11 del Acuerdo No.005-2011 queda así:",
	      "clause_id" : 580778,
	      "subpart_id" : null
	    }
	  }
	]
  }
}

The LIKE query works for '5-2011' scenario correctly.
Another scenario is,

POST /_xpack/sql/translate
{
"query":"SELECT * FROM test_local WHERE ((clause_number.keyword LIKE '%emisión%' AND level = 'CLAUSE')) AND data_source IN ('CFR', 'PANAMA', 'FINRA') ORDER BY clause_number.keyword DESC LIMIT 10"
}

The output of the query will be,

{
"took" : 12,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : null,
"hits" : [
  {
	"_index" : "test_local",
	"_type" : "_doc",
	"_id" : "0",
	"_score" : null,
	"_source" : {
	  "subpart_number" : "Capítulo XV II",
	  "chapter_name" : "Panama Superintendency of Banks - Panama Banking Law",
	  "clause_number" : "3.   Autorizar la *emisión* de nuevas acciones del banco, así como su venta a terceros, al precio que el  reorganizador o la junta de reorganización determine",

	  "subsection_number" : null,
	  "subchapter_name" : null,
	  "part_name" : "DEL RÉGIMEN BANCARIO",
	  "section_number" : "145"
	},
	"fields" : {
	  "title_number" : [
	    52
	  ],
	  "is_subtitle" : [
	    0
	  ],
	  "subpart_id" : [
	    8914
	  ],
	  "section_id" : [
	    88336
	  ],
	  "clause_id" : [
	    576989
	  ],
	  "is_subpart" : [
	    1
	  ],
	  "title_id" : [
	    2
	  ],
	  "is_subsection" : [
	    0
	  ],
	  "chapter_id" : [
	    26
	  ],
	  "part_id" : [
	    6309
	  ],
	  "is_division" : [
	    0
	  ],
	  "is_subchapter" : [
	    0
	  ]
	},
	"sort" : [
	  "3.   Autorizar la emisión de nuevas acciones del banco, así como su venta a terceros, al precio que el  reorganizador o la junta de reorganización determine"
	]
  }
]
}
}

The query string is highlighted in "clause_number"

But while searching with the keyword "Emisión", it doesn't work.

POST /_xpack/sql/translate
{
"query":"SELECT * FROM test_local WHERE ((clause_number.keyword LIKE '%Emisión%' AND level = 'CLAUSE')) AND data_source IN ('CFR', 'PANAMA', 'FINRA') ORDER BY clause_number.keyword DESC LIMIT 10"
}

The output of the above query is:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
	"total" : 5,
	"successful" : 5,
	"skipped" : 0,
	"failed" : 0
  },
  "hits" : {
	"total" : 0,
	"max_score" : null,
	"hits" : [ ]
  }
}

Now, could you able to understand the problem?

Andrei_Stefan · July 31, 2019, 9:57am

@GuruPrasath_Ramesh if you expect Emisión (uppercase E) to match emisión (lowercase e) then LIKE won't work, because this is terms matching (no analysis of the passed in text). Maybe you should index your documents differently and use sub-fields and various analyzers, then build a more complex query that should use LIKE, MATCH and QUERY at the same time on various fields/sub-fields.

GuruPrasath_Ramesh · July 31, 2019, 10:07am

@Andrei_Stefan I have used lowercase and ASCII folding analyzers in the mapping.

{
  "test_local" : {
	"settings" : {
	  "index" : {
	    "number_of_shards" : "5",
	    "provided_name" : "test_local",
	    "creation_date" : "1564564952829",
	    "analysis" : {
	      "filter" : {
	        "my_ascii_folding" : {
	          "type" : "asciifolding",
	          "preserve_original" : "true"
	        }
	      },
	      "analyzer" : {
	        "folding" : {
	          "filter" : [
	            "lowercase",
	            "my_ascii_folding"
	          ],
	          "tokenizer" : "standard"
	        }
	      }
	    },
	    "number_of_replicas" : "1",
	    "uuid" : "oOrOrvQVQNu612PaLmJHiw",
	    "version" : {
	      "created" : "6060299"
	    }
	  }
	}
  }
}

The problem is that I have searched with clause_number.keyword for which the analyzer won't work. At the same time, I cannot search that without .keyword.

And also I can't use a normalizer for this keyword, as it emits more than one token.

That's why I came up here for a solution.

GuruPrasath_Ramesh · July 31, 2019, 10:22am

@Andrei_Stefan As you said, I have tried the multi-match query with 'type=phrase_prefix' with the query_string:'5-2011'

POST /_xpack/sql/translate
{
  "query": "SELECT * FROM test_local WHERE (level = 'CLAUSE' AND MATCH('clause_number', '5-2011','type=phrase_prefix'))"
}

The output of the query is:

{
  "took" : 37,
  "timed_out" : false,
  "_shards" : {
	"total" : 5,
	"successful" : 5,
	"skipped" : 0,
	"failed" : 0
  },
  "hits" : {
	"total" : 1,
	"max_score" : null,
	"hits" : [
	  {
	    "_index" : "test_local",
	    "_type" : "_doc",
	    "_id" : "2",
	    "_score" : null,
	    "_source" : {
	      "subpart_number" : null,
	      "level" : "CLAUSE",
	      "section_name" : "ARTICULO 1",
	      "chapter_number" : "00",
	      "chapter_name" : "Panama Superintendency of Banks - Panama Banking Regulations",
	      "clause_number" : "El numeral 7, literal d del artículo 2 del Acuerdo No. 5-2011, queda así:",
	      "subpart_name" : null,
	      "data_source" : "PANAMA",
	      "section_number" : "1"
	    },
	    "fields" : {
	      "title_number" : [
	        1
	      ],
	      "is_subtitle" : [
	        0
	      ],
	      "section_id" : [
	        89589
	      ],
	      "clause_id" : [
	        581142
	      ],
	      "is_subpart" : [
	        0
	      ],
	      "title_id" : [
	        3
	      ],
	      "is_subsection" : [
	        0
	      ],
	      "chapter_id" : [
	        110
	      ],
	      "part_id" : [
	        6443
	      ],
	      "is_division" : [
	        0
	      ],
	      "is_subchapter" : [
	        0
	      ]
	    },
	    "sort" : [
	      0
	    ]
	  }
	]
  }
}

It does not satisfy the 5-2011 scenario as it should return '005-2011'

Are there any ways to use the LIKE and MATCH query both in the same SQL Translate API?

Andrei_Stefan · July 31, 2019, 10:37am

SELECT * FROM test WHERE clause_number LIKE '%5-2011%' OR MATCH('clause_number', '%5-2011%', 'type=phrase_prefix') OR QUERY('clause_number:*5-2011*')

GuruPrasath_Ramesh · July 31, 2019, 11:12am

@Andrei_Stefan Thanks a lot. It works!

GuruPrasath_Ramesh · August 6, 2019, 9:11am

There arises another problem here.

POST /_xpack/sql/translate
{
"query":"SELECT * FROM vedant WHERE  ( (title_name.keyword RLIKE '.*emision.*' AND level = 'TITLE') OR (level = 'TITLE' AND MATCH('title_name','emision','type=phrase_prefix')) OR (title_number.keyword RLIKE '.*emision.*' AND level = 'TITLE') OR (level = 'TITLE' AND MATCH('title_number','emision','type=phrase_prefix')) OR (subtitle_number.keyword RLIKE '.*emision.*' AND level = 'SUBTITLE') OR (level = 'SUBTITLE' AND MATCH('subtitle_number','emision','type=phrase_prefix')) OR (subtitle_name.keyword RLIKE '.*emision.*' AND level = 'SUBTITLE') OR (level = 'SUBTITLE' AND MATCH('subtitle_name','emision','type=phrase_prefix')) OR (chapter_name.keyword RLIKE '.*emision.*' AND level = 'CHAPTER') OR (level = 'CHAPTER' AND MATCH('chapter_name','emision','type=phrase_prefix')) OR (chapter_number.keyword RLIKE '.*emision.*' AND level = 'CHAPTER') OR (level = 'CHAPTER' AND MATCH('chapter_number','emision','type=phrase_prefix')) OR (subchapter_name.keyword RLIKE '.*emision.*' AND level = 'SUBCHAPTER') OR (level = 'SUBCHAPTER' AND MATCH('subchapter_name','emision','type=phrase_prefix')) OR (subchapter_number.keyword RLIKE '.*emision.*' AND level = 'SUBCHAPTER') OR (level = 'SUBCHAPTER' AND MATCH('subchapter_number','emision','type=phrase_prefix')) OR (part_name.keyword RLIKE '.*emision.*' AND level = 'PART') OR (level = 'PART' AND MATCH('part_name','emision','type=phrase_prefix')) OR (considerations.keyword RLIKE '.*emision.*' AND level = 'PART') OR (level = 'PART' AND MATCH('considerations','emision','type=phrase_prefix')) OR (part_number.keyword RLIKE '.*emision.*' AND level = 'PART') OR (level = 'PART' AND MATCH('part_number','emision','type=phrase_prefix')) OR (subpart_number.keyword RLIKE '.*emision.*' AND level = 'SUBPART') OR (level = 'SUBPART' AND MATCH('subpart_number','emision','type=phrase_prefix')) OR (subpart_name.keyword RLIKE '.*emision.*' AND level = 'SUBPART') OR (level = 'SUBPART' AND MATCH('subpart_name','emision','type=phrase_prefix')) OR (division_number.keyword RLIKE '.*emision.*' AND level = 'DIVISION') OR (level = 'DIVISION' AND MATCH('division_number','emision','type=phrase_prefix')) OR (division_name.keyword RLIKE '.*emision.*' AND level = 'DIVISION') OR (level = 'DIVISION' AND MATCH('division_name','emision','type=phrase_prefix')) OR (section_number.keyword RLIKE '.*emision.*' AND level = 'SECTION') OR (level = 'SECTION' AND MATCH('section_number','emision','type=phrase_prefix')) OR (section_name.keyword RLIKE '.*emision.*' AND level = 'SECTION') OR (level = 'SECTION' AND MATCH('section_name','emision','type=phrase_prefix')) OR (subsection_number.keyword RLIKE '.*emision.*' AND level = 'SUBSECTION') OR (level = 'SUBSECTION' AND MATCH('subsection_number','emision','type=phrase_prefix')) OR (subsection_name.keyword RLIKE '.*emision.*' AND level = 'SUBSECTION') OR (level = 'SUBSECTION' AND MATCH('subsection_name','emision','type=phrase_prefix')) OR (interpretation_number.keyword RLIKE '.*emision.*' AND level = 'INTERPRETATION') OR (level = 'INTERPRETATION' AND MATCH('interpretation_number','emision','type=phrase_prefix')) OR (interpretation_name.keyword RLIKE '.*emision.*' AND level = 'INTERPRETATION') OR (level = 'INTERPRETATION' AND MATCH('interpretation_name','emision','type=phrase_prefix')) OR (interpretation.keyword RLIKE '.*emision.*' AND level = 'INTERPRETATION') OR (level = 'INTERPRETATION' AND MATCH('interpretation','emision','type=phrase_prefix')) OR (clause_number.keyword RLIKE '.*emision.*' AND level = 'CLAUSE') OR (level = 'CLAUSE' AND MATCH('clause_number','emision','type=phrase_prefix')) OR (footnote.keyword RLIKE '.*emision.*' AND level = 'FOOTNOTE') OR (level = 'FOOTNOTE' AND MATCH('footnote','emision','type=phrase_prefix')) OR (footnote_number.keyword RLIKE '.*emision.*' AND level = 'FOOTNOTE') OR (level = 'FOOTNOTE' AND MATCH('footnote_number','emision','type=phrase_prefix')) OR (appendix_number.keyword RLIKE '.*emision.*' AND level = 'APPENDIX') OR (level = 'APPENDIX' AND MATCH('appendix_number','emision','type=phrase_prefix')) OR (appendix_name.keyword RLIKE '.*emision.*' AND level = 'APPENDIX') OR (level = 'APPENDIX' AND MATCH('appendix_name','emision','type=phrase_prefix')) OR (appendix.keyword RLIKE '.*emision.*' AND level = 'APPENDIX') OR (level = 'APPENDIX' AND MATCH('appendix','emision','type=phrase_prefix')) OR (amendment.keyword RLIKE '.*emision.*' AND level = 'AMENDMENT') OR (level = 'AMENDMENT' AND MATCH('amendment','emision','type=phrase_prefix')) OR (amendment_number.keyword RLIKE '.*emision.*' AND level = 'AMENDMENT') OR (level = 'AMENDMENT' AND MATCH('amendment_number','emision','type=phrase_prefix')) ) AND data_source IN ('CFR', 'PANAMA', 'FINRA') ORDER BY part_id DESC  LIMIT 10"
}

The output for the above translate API is,

{
"error": {
  "root_cause": [
    {
      "type": "parsing_exception",
      "reason": "line 1:4363: SQL statement too large; halt parsing to prevent memory errors (stopped at depth 200)"
    }
  ],
  "type": "parsing_exception",
  "reason": "line 1:4363: SQL statement too large; halt parsing to prevent memory errors (stopped at depth 200)"
},
"status": 400
}

Elastic Search version:

{
"name" : "Guruprasath",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "j3e2NlceTn2LqnVtZzxDRg",
"version" : {
  "number" : "7.0.1",
  "build_flavor" : "default",
  "build_type" : "tar",
  "build_hash" : "e4efcb5",
  "build_date" : "2019-04-29T12:56:03.145736Z",
  "build_snapshot" : false,
  "lucene_version" : "8.0.0",
  "minimum_wire_compatibility_version" : "6.7.0",
  "minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}

Are there any ways to overcome the following problem?

Andrei_Stefan · August 7, 2019, 9:17am

In case you were not aware: https://www.elastic.co/guide/en/elasticsearch/reference/current/sql-limitations.html#large-parsing-trees

I see a pattern with your conditions and I think you could simplify 4 conditions for the same level like (title_name.keyword RLIKE '.*emision.*' AND level = 'TITLE') OR (level = 'TITLE' AND MATCH('title_name','emision','type=phrase_prefix')) OR (title_number.keyword RLIKE '.*emision.*' AND level = 'TITLE') OR (level = 'TITLE' AND MATCH('title_number','emision','type=phrase_prefix'))

to something like

((title_name.keyword RLIKE '.*emision.*' OR MATCH('title_name','emision','type=phrase_prefix') OR (title_number.keyword RLIKE '.*emision.*') OR (MATCH('title_number','emision','type=phrase_prefix'))) AND level = 'TITLE')

Basically (A AND X) OR (B AND X) OR (C AND X) OR (D AND X) is equivalent with (A OR B OR C OR D) AND X.

system · September 4, 2019, 9:18am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Partial phrase or exact phrase matching Elasticsearch	10	7367	August 20, 2020
SQL Translate does not work well Elasticsearch	2	511	March 25, 2020
Elasticsearch query like not working when search number and string in 1 words using wildcardquery and match_phrase Elasticsearch	1	587	September 20, 2019
Ctrl+F search behavior in elastic Elasticsearch	17	22	November 26, 2024
Need Help for ES query Elasticsearch	2	322	August 23, 2018

Partial Search of a sentence not working as expected

Related topics