Search in HL7 message with elasticsearch


(Sheel Shah) #1

We are using elasticsearch for searching in HL7 message stored in MongoDB using GridFS. By use of mongodb-river we are able to search via keyword. Following is working scenario.

Creating index and connecting MongoDB and elasticsearch.

curl -XPUT 'http://localhost:9200/_river/demoindex/_meta' -d '
{
	"type": "mongodb",
	"mongodb": { 
		"servers": [ 
			{ "host": "localhost", "port": "27017" } 
		],
		"options": { "secondary_read_preference": "true" },
		"db": "demodb",
		"gridfs":"true",
		"collection": "democollection"
	},
	"index": { 
		"name": "demoindex",
		"type": "files"
	}
}' 

HL7 message example,

MSH|^~\&|LAB|767543|ADT|767543|199003141304-0500||ACK^^ACK|XX3657|P|2.4
MSA|AR|ZZ9380|059805^^^MCH^MR~000000339016^^^MCH^EE~508625465^^^MCH^SS
ERR|PID^1^16^103&Table value not found&HL70357

Created some terms/token by standard analyzer/tokenizer,

msh lab 767543 adt 767543 199003141304 0500 ack ack xx3657 p 2.4 msa ar zz9380 059805 mch mr 000000339016 mch ee 508625465 mch ss err pid 1 16 103 table value not found hl70357

By this we can easily search keywords from messages. But my actual requirement is search segment wise. Means if user say search '141304' on MSH-7, or search '508625' on MSA-4. So by standard tokenizer this thing is not possible, also searched with regexp filter but no chance coz whole message breaks into tokens. So we decided to create tokens first separated by '/r (carriage return)', separated by '| (pipe)', separated by '^ (cape)', separated by '~ (tilde)' and finally standard tokenizer. So I tired with basic regex tokenizer but when I save message into MongoDB then still tokenized on standard tokenizer.

New index settings

curl -XPUT localhost:9200/demoindex/  -d '
{
   "settings" : {
   	"analysis" : {
   		"analyzer" : {
   			"my_pattern": {
   				"type": "pattern",
        	"lowercase": true,
        	"pattern": "[\\d ]+"
      	}
     	},
			"tokenizer" : {
      	"my_tokens": {
        	"type": "pattern",
        	"pattern": "[\\d ]+",
        	"flags": "",
        	"group": -1
    		}
      }
     }
   }
}
'

Output of http://localhost:9200/demoindex/_settings

{
	"demoindex":{
		"settings":{
			"index":{
				"creation_date":"1440586817485",
				"uuid":"MiooEsf0T1qFdH_9DqbdoA",
				"analysis":{
					"analyzer":{
						"my_pattern":{
							"type":"pattern",
							"pattern":"[\\d ]+",
							"lowercase":"true"
						}
					},
					"tokenizer":{
						"my_tokens":{
							"flags":"",
							"pattern":"[\\d ]+",
							"group":"-1",
							"type":"pattern"
						}
					}
				},
				"number_of_replicas":"1",
				"number_of_shards":"5",
				"version":{
					"created":"1040299"
				}
			}
		}
	}
}

After applying this still message get tokenized in standard manner. Where are we wrong? Or any other better approach to achieve this.


(Sarwar Bhuiyan) #2

This might sound pretty basic but did you delete your index/data after changing the mapping and then reindex?


(Sheel Shah) #3

@Sarwar Yes we did. That's why this is bit surprising.


(Sarwar Bhuiyan) #4

Sorry, just looked at your setting again, and the analyser section seems to be wrong. It needs to be something like here: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.htm

In your analyzer, you are setting something called my_pattern which seems to be just another filter. It needs to have type set to "custom" and then values the keys "tokenizer", "filter", and "char_filter"


(system) #5