I have an Elastic index with billions of transactions.
An example tx:
{
"id": "54bfa9af-009a-437d-bd21-caaf651f7218",
"amount": 100.0,
"currency": "EUR",
"type": "expense",
...
"note": "CARD PAYMENT TO AZAMON.COM 100.0 EUR, RATE 0.86/GBP ON 05-05-2022"
}
I have a few millions merchants (companies, etc.) in a RDBMS table:
id | Name
123 | Azamon Ltd
456 | Alple Inc.
789 | Goooogle
...
I can easily ingest them in another Elastic index.
Now, both transaction's note
and merchant's name
are analyzed fields.
I would like, for every new transaction indexed, to enrich its content with a merchant ID+name. It doesn't have to be perfect, though. A threshold score could be fine tuned once the solution works for most of the matching tx.
e.g.
for the tx above, I would like to obtain "123, Azamon Ltd" as a search result
Should I just create a custom tokenizer, analyzer for that, and run a query against a "merchants" index using the tx note as a search term? What would be a good pipeline structure for single-language tx/merchants matching?
Or is there a out-of-the box more efficient solution for that problem? I'm reading about NER, documents similarity and other stuff, but I can't figure out what's the best approach in my (simple) case.
Pointing me to relevant and proven doc pages will be considered an acceptable answer. TY