Is it possible to "conditionaly" analyze same field differently? [synonyms]

Hi there,

The index contains 3 business units, the goal is to provide different set of synonyms for each BU.
The field is unstructured text (PDF rendition), occupying 95% of overall index storage, currently analyzed twice (for normal search and for exact match).

How you would approach this?

  1. write 3 index analyzers (as multifield)
    PROS: fast querying
    CONS: waste of storage - stored 3 times (in our case - > 5TB instead of 1.7TB)

  2. use 3 search analyzers - specific to BU (query time)
    PROS: save storage
    CONS: slower queries

  3. BU-specific indices (having 3 indices instead of one)
    PROS: save storage, fast queries
    CONS: worse index-management (incremental crawl, re-indexing etc)

  4. any "conditional analyzer" able to analyze one subset of data with analyzer 1, another subset of data with analyzer 2 etc... - all against same single field?

  5. any leverage of routing here?

  6. any other approach?

What approach would you select/recommend and why?

Thanks in advance
Dominik

From my perspective options 1 & 2 probably depend more on the needs of the BU's rather than their pros/cons. If the BU's need something fast, then they'll need to deal with the extra storage costs, if the BU's can deal with something slower, than they can deal with option 2.

Another option entirely:

Your stated use case is:

provide different set of synonyms for each BU

Is there any reason why you can't just have one index analyzer with the synonyms for all BU's? I'm assuming that most likely the BU's have a majority of shared synonyms and a few that are BU specific. If that is the case, a single index analyzer with all synonyms for all BU's probably provides a overall best outcome.

yes, we have it so far that way (all synonyms together for whole corpus), but for legal/approval reasons it has to be separated.

I have technically explored option #2 (search analyzers) and it looks it doesn't work well for multi-token synonyms.

Here's my case:

Firstly defining analyzers, synonyms (multi-tokens - used contraction approach)
Multi-token synonyms defined only for "ire_exact" analyzer (quoted search).
SETTINGS:

"analysis": {
          "filter": {
            "lh_synonym_regular": {
              "type": "synonym",
              "synonyms": [
                "applicable_law,governing_law",
                "effective_date,commencement_date,inception_date,indemnity_period"
 ]
            },
            "lh_synonym_contraction": {
              "type": "synonym",
              "synonyms": [
                "effective date => effective_date,effective date",
                "commencement date => commencement_date,commencement date",
                "inception date => inception_date,inception date",
...
 "analyzer": {
           "ire_regular": {
              "filter": [
                "ascii_folding",
                "lowercase",
                "lh_synonym_single",
                "stop",
                "irregular_stems",
                "prevent_stems",
                "english_stemmer"
              ],
              "char_filter": [
                "OCR_filter",
                "special_chars_regular"
              ],
              "tokenizer": "standard"
            },
            "ire_exact": {
              "filter": [
                "lowercase",
                "lh_synonym_contraction",
                "lh_synonym_regular"
              ],
              "char_filter": [
                "OCR_filter",
                "special_chars_exact"
              ],
              "tokenizer": "whitespace"
            }

MAPPINGS:

      "properties": {
        "content": {
          "properties": {
            "DOC_TEXT": {
              "type": "text",
              "fields": {
                "exact": {
                  "type": "text",
                  "term_vector": "with_positions_offsets"
                }
              },
              "term_vector": "with_positions_offsets"
            },

Inserting test data:

PUT treaty_scenario2a/_doc/testing1
{
  "content":{
    "DOC_TEXT":"document with inception date mentioned"
  }
}

PUT treaty_scenario2a/_doc/testing2
{
  "content":{
    "DOC_TEXT":"an effective date document"
  }
}

QUERY:

  • didn't match "effective date"
    GET treaty_scenario2a/_explain/testing2?q=content.DOC_TEXT.exact:"inception date"&analyzer=ire_exact
  • didn't match "inception date"
    GET treaty_scenario2a/_explain/testing1?q=content.DOC_TEXT.exact:"effective date"&analyzer=ire_exact

image

I was aiming to get it work this way (provide BU specific ire_regular and ire_exact analyzers to query_string):

GET treaty_scenario2a/_search?q=_id:testing2
{
  "explain": true, 
  "_source": ["content.DOC_TEXT"], 
  "query": {
    "query_string": {
      "default_field": "content.DOC_TEXT",
      "query": "\"inception date\"",
      "analyzer": "ire_regular",

      "quote_field_suffix": ".exact",
      "quote_analyzer": "ire_exact"
    }
  },
  "highlight": {
    "fields": {
      "content.DOC_TEXT": {
        "type": "fvh",
        "matched_fields": [
          "content.DOC_TEXT",
          "content.DOC_TEXT.exact"
        ],
        "fragment_size": 1000,
        "no_match_size": 0,
        "number_of_fragments": 1,
        "boundary_scanner": "chars",
        "fragmenter": "span"
      },
      "content.DOC_TEXT_PROX": {
        "highlight_query": {
          "query_string": {
            "fields": [
              "content.DOC_TEXT.exact"
            ],
            "analyzer": "ire_exact",
            "query": "\"inception date\""
          }
        },
        "type": "unified",
        "boundary_scanner": "sentence",
        "fragment_size": 1000,
        "number_of_fragments": 1,
        "no_match_size": 1000,
        "fragmenter": "span"
      }
    }
  }
}

But it seems, it cannot match/highlight multi-word synonyms on SEARCH time, since the synonyms are not indexed. If you would have some solution to that case, I would be happy to hear about it.

Thanks

looks, that "synonym_graph" is solution for this multi-token synonyms problem (documentation here). It is allowing to get rid of contraction approach (as far as I was able to test).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.