Is it possible to "conditionaly" analyze same field differently? [synonyms]

astrodi · February 24, 2023, 1:18pm

Hi there,

The index contains 3 business units, the goal is to provide different set of synonyms for each BU.
The field is unstructured text (PDF rendition), occupying 95% of overall index storage, currently analyzed twice (for normal search and for exact match).

How you would approach this?

write 3 index analyzers (as multifield)
PROS: fast querying
CONS: waste of storage - stored 3 times (in our case - > 5TB instead of 1.7TB)
use 3 search analyzers - specific to BU (query time)
PROS: save storage
CONS: slower queries
BU-specific indices (having 3 indices instead of one)
PROS: save storage, fast queries
CONS: worse index-management (incremental crawl, re-indexing etc)
any "conditional analyzer" able to analyze one subset of data with analyzer 1, another subset of data with analyzer 2 etc... - all against same single field?
any leverage of routing here?
any other approach?

What approach would you select/recommend and why?

Thanks in advance
Dominik

BenB196 · February 25, 2023, 2:14pm

From my perspective options 1 & 2 probably depend more on the needs of the BU's rather than their pros/cons. If the BU's need something fast, then they'll need to deal with the extra storage costs, if the BU's can deal with something slower, than they can deal with option 2.

Another option entirely:

Your stated use case is:

provide different set of synonyms for each BU

Is there any reason why you can't just have one index analyzer with the synonyms for all BU's? I'm assuming that most likely the BU's have a majority of shared synonyms and a few that are BU specific. If that is the case, a single index analyzer with all synonyms for all BU's probably provides a overall best outcome.

astrodi · March 10, 2023, 1:28pm

yes, we have it so far that way (all synonyms together for whole corpus), but for legal/approval reasons it has to be separated.

I have technically explored option #2 (search analyzers) and it looks it doesn't work well for multi-token synonyms.

Here's my case:

Firstly defining analyzers, synonyms (multi-tokens - used contraction approach)
Multi-token synonyms defined only for "ire_exact" analyzer (quoted search).
SETTINGS:

"analysis": {
          "filter": {
            "lh_synonym_regular": {
              "type": "synonym",
              "synonyms": [
                "applicable_law,governing_law",
                "effective_date,commencement_date,inception_date,indemnity_period"
 ]
            },
            "lh_synonym_contraction": {
              "type": "synonym",
              "synonyms": [
                "effective date => effective_date,effective date",
                "commencement date => commencement_date,commencement date",
                "inception date => inception_date,inception date",
...
 "analyzer": {
           "ire_regular": {
              "filter": [
                "ascii_folding",
                "lowercase",
                "lh_synonym_single",
                "stop",
                "irregular_stems",
                "prevent_stems",
                "english_stemmer"
              ],
              "char_filter": [
                "OCR_filter",
                "special_chars_regular"
              ],
              "tokenizer": "standard"
            },
            "ire_exact": {
              "filter": [
                "lowercase",
                "lh_synonym_contraction",
                "lh_synonym_regular"
              ],
              "char_filter": [
                "OCR_filter",
                "special_chars_exact"
              ],
              "tokenizer": "whitespace"
            }

MAPPINGS:

      "properties": {
        "content": {
          "properties": {
            "DOC_TEXT": {
              "type": "text",
              "fields": {
                "exact": {
                  "type": "text",
                  "term_vector": "with_positions_offsets"
                }
              },
              "term_vector": "with_positions_offsets"
            },

Inserting test data:

PUT treaty_scenario2a/_doc/testing1
{
  "content":{
    "DOC_TEXT":"document with inception date mentioned"
  }
}

PUT treaty_scenario2a/_doc/testing2
{
  "content":{
    "DOC_TEXT":"an effective date document"
  }
}

QUERY:

didn't match "effective date"
GET treaty_scenario2a/_explain/testing2?q=content.DOC_TEXT.exact:"inception date"&analyzer=ire_exact
didn't match "inception date"
GET treaty_scenario2a/_explain/testing1?q=content.DOC_TEXT.exact:"effective date"&analyzer=ire_exact

I was aiming to get it work this way (provide BU specific ire_regular and ire_exact analyzers to query_string):

GET treaty_scenario2a/_search?q=_id:testing2
{
  "explain": true, 
  "_source": ["content.DOC_TEXT"], 
  "query": {
    "query_string": {
      "default_field": "content.DOC_TEXT",
      "query": "\"inception date\"",
      "analyzer": "ire_regular",

      "quote_field_suffix": ".exact",
      "quote_analyzer": "ire_exact"
    }
  },
  "highlight": {
    "fields": {
      "content.DOC_TEXT": {
        "type": "fvh",
        "matched_fields": [
          "content.DOC_TEXT",
          "content.DOC_TEXT.exact"
        ],
        "fragment_size": 1000,
        "no_match_size": 0,
        "number_of_fragments": 1,
        "boundary_scanner": "chars",
        "fragmenter": "span"
      },
      "content.DOC_TEXT_PROX": {
        "highlight_query": {
          "query_string": {
            "fields": [
              "content.DOC_TEXT.exact"
            ],
            "analyzer": "ire_exact",
            "query": "\"inception date\""
          }
        },
        "type": "unified",
        "boundary_scanner": "sentence",
        "fragment_size": 1000,
        "number_of_fragments": 1,
        "no_match_size": 1000,
        "fragmenter": "span"
      }
    }
  }
}

But it seems, it cannot match/highlight multi-word synonyms on SEARCH time, since the synonyms are not indexed. If you would have some solution to that case, I would be happy to hear about it.

Thanks

astrodi · March 13, 2023, 3:16pm

looks, that "synonym_graph" is solution for this multi-token synonyms problem (documentation here). It is allowing to get rid of contraction approach (as far as I was able to test).

system · April 10, 2023, 3:17pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How can we use multiple analyzer to the same field or all field? Elasticsearch	1	626	September 15, 2017
Synonyms as option at query time Elasticsearch	3	599	February 7, 2018
Can the ES do so Elasticsearch	3	361	July 6, 2017
Synonyms in a query Elasticsearch	7	1398	July 6, 2017
How to index with each fields its own synonym file Elasticsearch	2	663	January 23, 2017

Is it possible to "conditionaly" analyze same field differently? [synonyms]

Related topics