I would like to be able to view a document's text (the 'notes' field) that has the duplicated tokens highlighted. (not the original token, just the duplicate). This should only be performed on each document (i.e. I don't care if 2 documents have the same token, only if tokens are repeated within a document. I know that's an odd thing to ask of ES.).
So far, I have tokenized the text into the correct tokens (using a pattern), and I am able to remove duplicated tokens via the unique token filter, or remove duplicates or min hash. But I am having trouble creating a field that will allow me to print the _source.notes text, with the highlighted duplicates.
This analyzer produces tokens at the positions of the first location (original) of each token. So, I think that If I could just highlight everything that is not included in this list of originals, it would do the job.
Any input on this matter is greatly appreciated. thank you.
// PUT mimic_dat
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"tokenizer": {
"mimic_tokenizer": {
"type": "pattern",
"pattern": """(\.\s|\n+)""",
"group": -1
}
},
"filter": {
"unique_mimic": {
"type": "unique",
"only_on_same_position": false
}
},
"analyzer": {
"mimic_hash_analyzer": {
"type": "custom",
"tokenizer": "mimic_tokenizer",
"filter": [
"unique_mimic"
]
}
}
}
},
"mappings": {
"mimic_type": {
"properties": {
"subject_id": {
"type": "keyword"
},
"notes": {
"type": "text",
"fielddata": true,
"fields": {
"my_hash": {
"type": "text",
"analyzer": "mimic_hash_analyzer",
"fielddata": true,
"term_vector": "with_positions_offsets",
"store": true
}
}
}
}
}
}
}
// PUT mimic_dat/mimic_type/4
{
"notes": """
Past History: Chronic xx which lead to; Ca.
Review of systems: Cardiac, SR.
O2: sats on room air 100%.
ID: No active issues, temp 99.3 PO.
Review of systems: Cardiac, SR.
ID: No active issues, temp 99.3 PO.
"""
}