Return text with duplicate tokens highlighted

I would like to be able to view a document's text (the 'notes' field) that has the duplicated tokens highlighted. (not the original token, just the duplicate). This should only be performed on each document (i.e. I don't care if 2 documents have the same token, only if tokens are repeated within a document. I know that's an odd thing to ask of ES.).

So far, I have tokenized the text into the correct tokens (using a pattern), and I am able to remove duplicated tokens via the unique token filter, or remove duplicates or min hash. But I am having trouble creating a field that will allow me to print the _source.notes text, with the highlighted duplicates.

This analyzer produces tokens at the positions of the first location (original) of each token. So, I think that If I could just highlight everything that is not included in this list of originals, it would do the job.

Any input on this matter is greatly appreciated. thank you.

// PUT mimic_dat
    {
      "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": {
          "tokenizer": {
            "mimic_tokenizer": {
              "type": "pattern",
              "pattern": """(\.\s|\n+)""",
              "group": -1
            }
          },
          "filter": {
            "unique_mimic": {
              "type": "unique",
              "only_on_same_position": false
            }
          },
          "analyzer": {
            "mimic_hash_analyzer": {
              "type": "custom",
              "tokenizer": "mimic_tokenizer",
              "filter": [
                "unique_mimic"
              ]
            }
          }
        }
      },
      "mappings": {
        "mimic_type": {
          "properties": {
            "subject_id": {
              "type": "keyword"
            },
            "notes": {
              "type": "text",
              "fielddata": true,
              "fields": {
                "my_hash": {
                  "type": "text",
                  "analyzer": "mimic_hash_analyzer",
                  "fielddata": true,
                  "term_vector": "with_positions_offsets",
                  "store": true
                }
              }
            }
          }
        }
      }
    }

// PUT mimic_dat/mimic_type/4
{
  "notes": """
Past History: Chronic xx which lead to; Ca.
  
Review of systems:    Cardiac,   SR.
O2: sats on room air 100%.  

ID:  No active issues, temp 99.3 PO.

Review of systems:    Cardiac,   SR.

ID:  No active issues, temp 99.3 PO. 
"""
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.