We are using elasticsearch for searching in HL7 message stored in MongoDB using GridFS. By use of mongodb-river we are able to search via keyword. Following is working scenario.
Creating index and connecting MongoDB and elasticsearch.
curl -XPUT 'http://localhost:9200/_river/demoindex/_meta' -d '
{
"type": "mongodb",
"mongodb": {
"servers": [
{ "host": "localhost", "port": "27017" }
],
"options": { "secondary_read_preference": "true" },
"db": "demodb",
"gridfs":"true",
"collection": "democollection"
},
"index": {
"name": "demoindex",
"type": "files"
}
}'
HL7 message example,
MSH|^~\&|LAB|767543|ADT|767543|199003141304-0500||ACK^^ACK|XX3657|P|2.4
MSA|AR|ZZ9380|059805^^^MCH^MR~000000339016^^^MCH^EE~508625465^^^MCH^SS
ERR|PID^1^16^103&Table value not found&HL70357
Created some terms/token by standard analyzer/tokenizer,
msh lab 767543 adt 767543 199003141304 0500 ack ack xx3657 p 2.4 msa ar zz9380 059805 mch mr 000000339016 mch ee 508625465 mch ss err pid 1 16 103 table value not found hl70357
By this we can easily search keywords from messages. But my actual requirement is search segment wise. Means if user say search '141304' on MSH-7
, or search '508625' on MSA-4
. So by standard tokenizer this thing is not possible, also searched with regexp filter but no chance coz whole message breaks into tokens. So we decided to create tokens first separated by '/r (carriage return)'
, separated by '| (pipe)'
, separated by '^ (cape)'
, separated by '~ (tilde)'
and finally standard tokenizer. So I tired with basic regex tokenizer but when I save message into MongoDB then still tokenized on standard tokenizer.
New index settings
curl -XPUT localhost:9200/demoindex/ -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"my_pattern": {
"type": "pattern",
"lowercase": true,
"pattern": "[\\d ]+"
}
},
"tokenizer" : {
"my_tokens": {
"type": "pattern",
"pattern": "[\\d ]+",
"flags": "",
"group": -1
}
}
}
}
}
'
Output of http://localhost:9200/demoindex/_settings
{
"demoindex":{
"settings":{
"index":{
"creation_date":"1440586817485",
"uuid":"MiooEsf0T1qFdH_9DqbdoA",
"analysis":{
"analyzer":{
"my_pattern":{
"type":"pattern",
"pattern":"[\\d ]+",
"lowercase":"true"
}
},
"tokenizer":{
"my_tokens":{
"flags":"",
"pattern":"[\\d ]+",
"group":"-1",
"type":"pattern"
}
}
},
"number_of_replicas":"1",
"number_of_shards":"5",
"version":{
"created":"1040299"
}
}
}
}
}
After applying this still message get tokenized in standard manner. Where are we wrong? Or any other better approach to achieve this.