We have fix protocol log messages that look like "8=FIX.4.4|9=67|35=5|49=OMNIMKT001|56=GEMINIMKT|34=357821|52=20181109-23:00:59.800|10=056|". We would like to search for strings such as "35=5". To do this, I created a custom analyzer in my proof of concept
PUT /tokenizer-test-2018.11.14/
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "\\|"
}
}
}
}
}
I populated the index with a couple of messages:
POST /tokenizer-test-2018.11.14/_doc
{
"@timestamp": "2018-11-4T11:26:45 -0800",
"message": "8=FIX.4.4|9=67|35=5|49=OMNIMKT001|56=GEMINIMKT|34=357821|52=20181109-23:00:59.800|10=056|"
}
POST /tokenizer-test-2018.11.14/_doc
{
"@timestamp": "2018-11-4T11:26:45 -0800",
"message": "8=FIX.4.4|9=67|35=0|49=OMNIMKT001|56=GEMINIMKT|34=357821|52=20181109-23:00:59.800|10=056|"
}
I used the analyze api to test the tokenization occurs as expected, where 35=5 is a token.
However when I search
GET /tokenizer-test-2018.11.14/_search
{
"query": {
"match": {
"message": {
"query": "35=5",
"analyzer": "my_analyzer"
}
}
}
}
I get 0 hits. When I remove the analyzer from the search, I get both the document with 35=0 (which we don't want) and 35=5, which we do want.