Return text with duplicate tokens highlighted

1fmusic · February 2, 2019, 7:06pm

I would like to be able to view a document's text (the 'notes' field) that has the duplicated tokens highlighted. (not the original token, just the duplicate). This should only be performed on each document (i.e. I don't care if 2 documents have the same token, only if tokens are repeated within a document. I know that's an odd thing to ask of ES.).

So far, I have tokenized the text into the correct tokens (using a pattern), and I am able to remove duplicated tokens via the unique token filter, or remove duplicates or min hash. But I am having trouble creating a field that will allow me to print the _source.notes text, with the highlighted duplicates.

This analyzer produces tokens at the positions of the first location (original) of each token. So, I think that If I could just highlight everything that is not included in this list of originals, it would do the job.

Any input on this matter is greatly appreciated. thank you.

// PUT mimic_dat
    {
      "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": {
          "tokenizer": {
            "mimic_tokenizer": {
              "type": "pattern",
              "pattern": """(\.\s|\n+)""",
              "group": -1
            }
          },
          "filter": {
            "unique_mimic": {
              "type": "unique",
              "only_on_same_position": false
            }
          },
          "analyzer": {
            "mimic_hash_analyzer": {
              "type": "custom",
              "tokenizer": "mimic_tokenizer",
              "filter": [
                "unique_mimic"
              ]
            }
          }
        }
      },
      "mappings": {
        "mimic_type": {
          "properties": {
            "subject_id": {
              "type": "keyword"
            },
            "notes": {
              "type": "text",
              "fielddata": true,
              "fields": {
                "my_hash": {
                  "type": "text",
                  "analyzer": "mimic_hash_analyzer",
                  "fielddata": true,
                  "term_vector": "with_positions_offsets",
                  "store": true
                }
              }
            }
          }
        }
      }
    }

// PUT mimic_dat/mimic_type/4
{
  "notes": """
Past History: Chronic xx which lead to; Ca.
  
Review of systems:    Cardiac,   SR.
O2: sats on room air 100%.  

ID:  No active issues, temp 99.3 PO.

Review of systems:    Cardiac,   SR.

ID:  No active issues, temp 99.3 PO. 
"""
}

system · March 2, 2019, 7:06pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Highlight the result of tokenization when viewing full text Elasticsearch	1	486	March 13, 2019
How to add an analyzer that can remove duplicate tokens from the analyzed field? Elasticsearch	1	196	January 25, 2023
ES 6 SignificantTextAggregation's DeDuplicatingTokenFilter usage Elasticsearch	3	600	December 20, 2017
Duplicate Tokens in elasticsearch uax_url_email tokenizer Elasticsearch	1	180	April 30, 2022
Comparison of tokens must not be repeated from query side to index document side Elasticsearch	1	368	August 27, 2019

Return text with duplicate tokens highlighted

Related topics