Partial Search of a sentence not working as expected

I use Translate API to get the native elastic search queries from SQL queries.

At first try, we use the RLIKE query to match a partial word in a sentence.

POST /_xpack/sql/translate
{
   "query":
"SELECT * FROM c1s_may_20 
WHERE ((clause_number.keyword RLIKE '.*emisión.*' AND level = 'CLAUSE'))
              AND data_source IN ('CFR', 'PANAMA', 'FINRA')
ORDER BY clause_number.keyword DESC
LIMIT 10"
}

The translated native query for the above-translated API is as follows:

         GET c1s_may_20/_search
            {
              "size" : 10,
              "query" : {
                "bool" : {
                  "must" : [
                    {
                      "bool" : {
                        "must" : [
                          {
                            "query_string" : {
                              "query" : "/.*emisión.*/",
                              "fields" : [
                                "clause_number.keyword^1.0"
                              ],
                              "type" : "best_fields",
                              "default_operator" : "or",
                              "max_determinized_states" : 10000,
                              "enable_position_increments" : true,
                              "fuzziness" : "AUTO",
                              "fuzzy_prefix_length" : 0,
                              "fuzzy_max_expansions" : 50,
                              "phrase_slop" : 0,
                              "escape" : false,
                              "auto_generate_synonyms_phrase_query" : true,
                              "fuzzy_transpositions" : true,
                              "boost" : 1.0
                            }
                          },
                          {
                            "term" : {
                              "level.keyword" : {
                                "value" : "CLAUSE",
                                "boost" : 1.0
                              }
                            }
                          }
                        ],
                        "adjust_pure_negative" : true,
                        "boost" : 1.0
                      }
                    },
                    {
                      "terms" : {
                        "data_source.keyword" : [
                          "CFR",
                          "PANAMA",
                          "FINRA"
                        ],
                        "boost" : 1.0
                      }
                    }
                  ],
                  "adjust_pure_negative" : true,
                  "boost" : 1.0
                }
              },
              "_source" : {
                "includes" : [
                  "amendment",
                  "amendment_number",
                  "appendix",
                  "appendix_name",
                  "appendix_number",
                  "chapter_name",
                  "chapter_number",
                  "clause_number",
                  "considerations",
                  "data_source",
                  ....(continues)
                ],
                "excludes" : [\*]
              },
              "docvalue_fields" : [
                {
                  "field" : "appendix_id",
                  "format" : "use_field_mapping"
                },
                {
                  "field" : "chapter_id",
                  "format" : "use_field_mapping"
                },
                {
                  "field" : "clause_id",
                  "format" : "use_field_mapping"
                },
                {
                  "field" : "division_id",
                  "format" : "use_field_mapping"
                },
               ...(continues)
              ],
              "sort" : [
                {
                  "clause_number.keyword" : {
                    "order" : "desc",
                    "missing" : "_first",
                    "unmapped_type" : "keyword"
                  }
                }
              ]
            }

Here, it fails in case insensitive accented search.

So, we decided to take on another idea using the multi-match query with phrase_prefix.

The translate API is as follows.

POST /_xpack/sql/translate
{
  "query": "
SELECT * FROM c1s_may_20
WHERE (level = 'CLAUSE' AND MATCH('clause_number', 'emisión','type=phrase_prefix'))"
}

It produces the following native elastic search query:

GET c1s_may_20/_search
{
  "size" : 1000,
  "query" : {
    "bool" : {
      "must" : [
        {
          "term" : {
            "level.keyword" : {
              "value" : "CLAUSE",
              "boost" : 1.0
            }
          }
        },
        {
          "multi_match" : {
            "query" : "emisión",
            "fields" : [
              "clause_number^1.0"
            ],
            "type" : "phrase_prefix",
            "operator" : "OR",
            "slop" : 0,
            "prefix_length" : 0,
            "max_expansions" : 50,
            "zero_terms_query" : "NONE",
            "auto_generate_synonyms_phrase_query" : true,
            "fuzzy_transpositions" : true,
            "boost" : 1.0
          }
        }
      ],
      "adjust_pure_negative" : true,
      "boost" : 1.0
    }
  },
  "_source" : {
    "includes" : [
      "amendment",
      "amendment_number",
      "appendix",
      "appendix_name",
      "appendix_number",
      "chapter_name",
      "chapter_number",
      "clause_number",
      "considerations",
      "data_source",
      ...(continues)
    ],
    "excludes" : [ ]
  },
  "docvalue_fields" : [
    {
      "field" : "appendix_id",
      "format" : "use_field_mapping"
    },
    {
      "field" : "chapter_id",
      "format" : "use_field_mapping"
    },
    {
      "field" : "clause_id",
      "format" : "use_field_mapping"
    },
    {
      "field" : "division_id",
      "format" : "use_field_mapping"
    },
   ...(continues)
  ],
  "sort" : [
    {
      "_doc" : {
        "order" : "asc"
      }
    }
  ]
}

The above query works for case insensitive accented search. But, there is another scenario.

For example,
A document of _id='0' has a field "clause_number":"El numeral 7, literal d del artículo 2 del Acuerdo No. 5-2011, queda así:"
Another document of _id='1' has a field "clause_number" : "El artículo 11 del Acuerdo No.005-2011 queda así:"

While searching the above query with the keyword "5-2011", it should return both documents, but it fails to return document of _id='1' as it contains "005-2011" and returns the only document of "_id":'0' as we used phrase_prefix here.

Is there any alternative way to solve this problem which satisfies both the scenarios?

@GuruPrasath_Ramesh this post is so difficult to follow, the queries are unreadable and there is no formatting. I strongly suggest to format it properly next time you post. This time I'll re-format it.

Thanks for the formatting @Andrei_Stefan. Do you have any ideas to resolve it?

Have you tried the "SQL way"? SELECT * FROM test WHERE clause_number LIKE '%5-2011%'

I don't know what didn't work in your first scenario because you didn't provide any sample documents that should have matched (and didn't) or did match (and shouldn't). But the query above works for me for the two sample documents you provided.

1 Like

You should take a look into text analysis and how full-text search works. You could have full control of how you index and query the text.

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html

Hi @Andrei_Stefan @pmusa

I have attached sample documents for your reference.

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
	"total" : 5,
	"successful" : 5,
	"skipped" : 0,
	"failed" : 0
  },
  "hits" : {
	"total" : 3,
	"max_score" : 1.0,
	"hits" : [
	  {
	    "_index" : "test_local",
	    "_type" : "_doc",
	    "_id" : "0",
	    "_score" : 1.0,
	    "_source" : {
	      "subpart_number" : "Capítulo XV II",
	      "section_number" : "145",
	      "part_name" : "DEL RÉGIMEN BANCARIO",
	      "subchapter_name" : null,
	      "subsection_number" : null,
	      "is_subtitle" : 0,
	      "clause_number" : "3.   Autorizar la emisión de nuevas acciones del banco, así como su venta a terceros, al precio que el  reorganizador o la junta de reorganización determine",
	      "clause_id" : 576989,
	      "subpart_id" : 8914
	    }
	  },
	  {
	    "_index" : "test_local",
	    "_type" : "_doc",
	    "_id" : "2",
	    "_score" : 1.0,
	    "_source" : {
	      "subpart_number" : null,
	      "section_number" : "1",
	      "part_name" : "Dada en la ciudad de Panamá, a los veintidós (22) días mes de julio de dos mil catorce (2014). <br> COMUNÍQUESE, PUBLÍQUESE Y CÚMPLASE.",
	      "subchapter_name" : null,
	      "subsection_number" : null,
	      "is_subtitle" : 0,
	      "clause_number" : "El numeral 7, literal d del artículo 2 del Acuerdo No. 5-2011, queda así:",
	      "clause_id" : 581142,
	      "subpart_id" : null
	    }
	  },
	  {
	    "_index" : "test_local",
	    "_type" : "_doc",
	    "_id" : "1",
	    "_score" : 1.0,
	    "_source" : {
	      "subpart_number" : null,
	      "section_number" : "1",
	      "part_name" : "Dado en la ciudad de Panamá, a los diecinueve (19) días del mes de junio de dos mil doce (2012). <br> COMUNÍQUESE, PUBLÍQUESE Y CÚMPLASE.",
	      "subchapter_name" : null,
	      "subsection_number" : null,
	      "is_subtitle" : 0,
	      "clause_number" : "El artículo 11 del Acuerdo No.005-2011 queda así:",
	      "clause_id" : 580778,
	      "subpart_id" : null
	    }
	  }
	]
  }
}

The LIKE query works for '5-2011' scenario correctly.
Another scenario is,

POST /_xpack/sql/translate
{
"query":"SELECT * FROM test_local WHERE ((clause_number.keyword LIKE '%emisión%' AND level = 'CLAUSE')) AND data_source IN ('CFR', 'PANAMA', 'FINRA') ORDER BY clause_number.keyword DESC LIMIT 10"
}

The output of the query will be,

{
"took" : 12,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : null,
"hits" : [
  {
	"_index" : "test_local",
	"_type" : "_doc",
	"_id" : "0",
	"_score" : null,
	"_source" : {
	  "subpart_number" : "Capítulo XV II",
	  "chapter_name" : "Panama Superintendency of Banks - Panama Banking Law",
	  "clause_number" : "3.   Autorizar la *emisión* de nuevas acciones del banco, así como su venta a terceros, al precio que el  reorganizador o la junta de reorganización determine",

	  "subsection_number" : null,
	  "subchapter_name" : null,
	  "part_name" : "DEL RÉGIMEN BANCARIO",
	  "section_number" : "145"
	},
	"fields" : {
	  "title_number" : [
	    52
	  ],
	  "is_subtitle" : [
	    0
	  ],
	  "subpart_id" : [
	    8914
	  ],
	  "section_id" : [
	    88336
	  ],
	  "clause_id" : [
	    576989
	  ],
	  "is_subpart" : [
	    1
	  ],
	  "title_id" : [
	    2
	  ],
	  "is_subsection" : [
	    0
	  ],
	  "chapter_id" : [
	    26
	  ],
	  "part_id" : [
	    6309
	  ],
	  "is_division" : [
	    0
	  ],
	  "is_subchapter" : [
	    0
	  ]
	},
	"sort" : [
	  "3.   Autorizar la emisión de nuevas acciones del banco, así como su venta a terceros, al precio que el  reorganizador o la junta de reorganización determine"
	]
  }
]
}
}

The query string is highlighted in "clause_number"

But while searching with the keyword "Emisión", it doesn't work.

POST /_xpack/sql/translate
{
"query":"SELECT * FROM test_local WHERE ((clause_number.keyword LIKE '%Emisión%' AND level = 'CLAUSE')) AND data_source IN ('CFR', 'PANAMA', 'FINRA') ORDER BY clause_number.keyword DESC LIMIT 10"
}

The output of the above query is:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
	"total" : 5,
	"successful" : 5,
	"skipped" : 0,
	"failed" : 0
  },
  "hits" : {
	"total" : 0,
	"max_score" : null,
	"hits" : [ ]
  }
}

Now, could you able to understand the problem?

@GuruPrasath_Ramesh if you expect Emisión (uppercase E) to match emisión (lowercase e) then LIKE won't work, because this is terms matching (no analysis of the passed in text). Maybe you should index your documents differently and use sub-fields and various analyzers, then build a more complex query that should use LIKE, MATCH and QUERY at the same time on various fields/sub-fields.

1 Like

@Andrei_Stefan I have used lowercase and ASCII folding analyzers in the mapping.

{
  "test_local" : {
	"settings" : {
	  "index" : {
	    "number_of_shards" : "5",
	    "provided_name" : "test_local",
	    "creation_date" : "1564564952829",
	    "analysis" : {
	      "filter" : {
	        "my_ascii_folding" : {
	          "type" : "asciifolding",
	          "preserve_original" : "true"
	        }
	      },
	      "analyzer" : {
	        "folding" : {
	          "filter" : [
	            "lowercase",
	            "my_ascii_folding"
	          ],
	          "tokenizer" : "standard"
	        }
	      }
	    },
	    "number_of_replicas" : "1",
	    "uuid" : "oOrOrvQVQNu612PaLmJHiw",
	    "version" : {
	      "created" : "6060299"
	    }
	  }
	}
  }
} 

The problem is that I have searched with clause_number.keyword for which the analyzer won't work. At the same time, I cannot search that without .keyword.

And also I can't use a normalizer for this keyword, as it emits more than one token.

That's why I came up here for a solution.

@Andrei_Stefan As you said, I have tried the multi-match query with 'type=phrase_prefix' with the query_string:'5-2011'

POST /_xpack/sql/translate
{
  "query": "SELECT * FROM test_local WHERE (level = 'CLAUSE' AND MATCH('clause_number', '5-2011','type=phrase_prefix'))"
}

The output of the query is:

{
  "took" : 37,
  "timed_out" : false,
  "_shards" : {
	"total" : 5,
	"successful" : 5,
	"skipped" : 0,
	"failed" : 0
  },
  "hits" : {
	"total" : 1,
	"max_score" : null,
	"hits" : [
	  {
	    "_index" : "test_local",
	    "_type" : "_doc",
	    "_id" : "2",
	    "_score" : null,
	    "_source" : {
	      "subpart_number" : null,
	      "level" : "CLAUSE",
	      "section_name" : "ARTICULO 1",
	      "chapter_number" : "00",
	      "chapter_name" : "Panama Superintendency of Banks - Panama Banking Regulations",
	      "clause_number" : "El numeral 7, literal d del artículo 2 del Acuerdo No. 5-2011, queda así:",
	      "subpart_name" : null,
	      "data_source" : "PANAMA",
	      "section_number" : "1"
	    },
	    "fields" : {
	      "title_number" : [
	        1
	      ],
	      "is_subtitle" : [
	        0
	      ],
	      "section_id" : [
	        89589
	      ],
	      "clause_id" : [
	        581142
	      ],
	      "is_subpart" : [
	        0
	      ],
	      "title_id" : [
	        3
	      ],
	      "is_subsection" : [
	        0
	      ],
	      "chapter_id" : [
	        110
	      ],
	      "part_id" : [
	        6443
	      ],
	      "is_division" : [
	        0
	      ],
	      "is_subchapter" : [
	        0
	      ]
	    },
	    "sort" : [
	      0
	    ]
	  }
	]
  }
}

It does not satisfy the 5-2011 scenario as it should return '005-2011'

Are there any ways to use the LIKE and MATCH query both in the same SQL Translate API?

SELECT * FROM test WHERE clause_number LIKE '%5-2011%' OR MATCH('clause_number', '%5-2011%', 'type=phrase_prefix') OR QUERY('clause_number:*5-2011*')

1 Like

@Andrei_Stefan Thanks a lot. It works!

There arises another problem here.

POST /_xpack/sql/translate
{
"query":"SELECT * FROM vedant WHERE  ( (title_name.keyword RLIKE '.*emision.*' AND level = 'TITLE') OR (level = 'TITLE' AND MATCH('title_name','emision','type=phrase_prefix')) OR (title_number.keyword RLIKE '.*emision.*' AND level = 'TITLE') OR (level = 'TITLE' AND MATCH('title_number','emision','type=phrase_prefix')) OR (subtitle_number.keyword RLIKE '.*emision.*' AND level = 'SUBTITLE') OR (level = 'SUBTITLE' AND MATCH('subtitle_number','emision','type=phrase_prefix')) OR (subtitle_name.keyword RLIKE '.*emision.*' AND level = 'SUBTITLE') OR (level = 'SUBTITLE' AND MATCH('subtitle_name','emision','type=phrase_prefix')) OR (chapter_name.keyword RLIKE '.*emision.*' AND level = 'CHAPTER') OR (level = 'CHAPTER' AND MATCH('chapter_name','emision','type=phrase_prefix')) OR (chapter_number.keyword RLIKE '.*emision.*' AND level = 'CHAPTER') OR (level = 'CHAPTER' AND MATCH('chapter_number','emision','type=phrase_prefix')) OR (subchapter_name.keyword RLIKE '.*emision.*' AND level = 'SUBCHAPTER') OR (level = 'SUBCHAPTER' AND MATCH('subchapter_name','emision','type=phrase_prefix')) OR (subchapter_number.keyword RLIKE '.*emision.*' AND level = 'SUBCHAPTER') OR (level = 'SUBCHAPTER' AND MATCH('subchapter_number','emision','type=phrase_prefix')) OR (part_name.keyword RLIKE '.*emision.*' AND level = 'PART') OR (level = 'PART' AND MATCH('part_name','emision','type=phrase_prefix')) OR (considerations.keyword RLIKE '.*emision.*' AND level = 'PART') OR (level = 'PART' AND MATCH('considerations','emision','type=phrase_prefix')) OR (part_number.keyword RLIKE '.*emision.*' AND level = 'PART') OR (level = 'PART' AND MATCH('part_number','emision','type=phrase_prefix')) OR (subpart_number.keyword RLIKE '.*emision.*' AND level = 'SUBPART') OR (level = 'SUBPART' AND MATCH('subpart_number','emision','type=phrase_prefix')) OR (subpart_name.keyword RLIKE '.*emision.*' AND level = 'SUBPART') OR (level = 'SUBPART' AND MATCH('subpart_name','emision','type=phrase_prefix')) OR (division_number.keyword RLIKE '.*emision.*' AND level = 'DIVISION') OR (level = 'DIVISION' AND MATCH('division_number','emision','type=phrase_prefix')) OR (division_name.keyword RLIKE '.*emision.*' AND level = 'DIVISION') OR (level = 'DIVISION' AND MATCH('division_name','emision','type=phrase_prefix')) OR (section_number.keyword RLIKE '.*emision.*' AND level = 'SECTION') OR (level = 'SECTION' AND MATCH('section_number','emision','type=phrase_prefix')) OR (section_name.keyword RLIKE '.*emision.*' AND level = 'SECTION') OR (level = 'SECTION' AND MATCH('section_name','emision','type=phrase_prefix')) OR (subsection_number.keyword RLIKE '.*emision.*' AND level = 'SUBSECTION') OR (level = 'SUBSECTION' AND MATCH('subsection_number','emision','type=phrase_prefix')) OR (subsection_name.keyword RLIKE '.*emision.*' AND level = 'SUBSECTION') OR (level = 'SUBSECTION' AND MATCH('subsection_name','emision','type=phrase_prefix')) OR (interpretation_number.keyword RLIKE '.*emision.*' AND level = 'INTERPRETATION') OR (level = 'INTERPRETATION' AND MATCH('interpretation_number','emision','type=phrase_prefix')) OR (interpretation_name.keyword RLIKE '.*emision.*' AND level = 'INTERPRETATION') OR (level = 'INTERPRETATION' AND MATCH('interpretation_name','emision','type=phrase_prefix')) OR (interpretation.keyword RLIKE '.*emision.*' AND level = 'INTERPRETATION') OR (level = 'INTERPRETATION' AND MATCH('interpretation','emision','type=phrase_prefix')) OR (clause_number.keyword RLIKE '.*emision.*' AND level = 'CLAUSE') OR (level = 'CLAUSE' AND MATCH('clause_number','emision','type=phrase_prefix')) OR (footnote.keyword RLIKE '.*emision.*' AND level = 'FOOTNOTE') OR (level = 'FOOTNOTE' AND MATCH('footnote','emision','type=phrase_prefix')) OR (footnote_number.keyword RLIKE '.*emision.*' AND level = 'FOOTNOTE') OR (level = 'FOOTNOTE' AND MATCH('footnote_number','emision','type=phrase_prefix')) OR (appendix_number.keyword RLIKE '.*emision.*' AND level = 'APPENDIX') OR (level = 'APPENDIX' AND MATCH('appendix_number','emision','type=phrase_prefix')) OR (appendix_name.keyword RLIKE '.*emision.*' AND level = 'APPENDIX') OR (level = 'APPENDIX' AND MATCH('appendix_name','emision','type=phrase_prefix')) OR (appendix.keyword RLIKE '.*emision.*' AND level = 'APPENDIX') OR (level = 'APPENDIX' AND MATCH('appendix','emision','type=phrase_prefix')) OR (amendment.keyword RLIKE '.*emision.*' AND level = 'AMENDMENT') OR (level = 'AMENDMENT' AND MATCH('amendment','emision','type=phrase_prefix')) OR (amendment_number.keyword RLIKE '.*emision.*' AND level = 'AMENDMENT') OR (level = 'AMENDMENT' AND MATCH('amendment_number','emision','type=phrase_prefix')) ) AND data_source IN ('CFR', 'PANAMA', 'FINRA') ORDER BY part_id DESC  LIMIT 10"
}

The output for the above translate API is,

{
"error": {
  "root_cause": [
    {
      "type": "parsing_exception",
      "reason": "line 1:4363: SQL statement too large; halt parsing to prevent memory errors (stopped at depth 200)"
    }
  ],
  "type": "parsing_exception",
  "reason": "line 1:4363: SQL statement too large; halt parsing to prevent memory errors (stopped at depth 200)"
},
"status": 400
}

Elastic Search version:

{
"name" : "Guruprasath",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "j3e2NlceTn2LqnVtZzxDRg",
"version" : {
  "number" : "7.0.1",
  "build_flavor" : "default",
  "build_type" : "tar",
  "build_hash" : "e4efcb5",
  "build_date" : "2019-04-29T12:56:03.145736Z",
  "build_snapshot" : false,
  "lucene_version" : "8.0.0",
  "minimum_wire_compatibility_version" : "6.7.0",
  "minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}

Are there any ways to overcome the following problem?

In case you were not aware: https://www.elastic.co/guide/en/elasticsearch/reference/current/sql-limitations.html#large-parsing-trees

I see a pattern with your conditions and I think you could simplify 4 conditions for the same level like (title_name.keyword RLIKE '.*emision.*' AND level = 'TITLE') OR (level = 'TITLE' AND MATCH('title_name','emision','type=phrase_prefix')) OR (title_number.keyword RLIKE '.*emision.*' AND level = 'TITLE') OR (level = 'TITLE' AND MATCH('title_number','emision','type=phrase_prefix'))

to something like

((title_name.keyword RLIKE '.*emision.*' OR MATCH('title_name','emision','type=phrase_prefix') OR (title_number.keyword RLIKE '.*emision.*') OR (MATCH('title_number','emision','type=phrase_prefix'))) AND level = 'TITLE')

Basically (A AND X) OR (B AND X) OR (C AND X) OR (D AND X) is equivalent with (A OR B OR C OR D) AND X.