I've used Elasticsearch and Kibana a little bit for basic log searching and queries, but I have an application use case that I think Elasticsearch can help with and I wanted to know a) if what I want is even really possible with Elasticsearch, and b) any pointers on crafting queries / workflow that can help.
I have data with the following identifiable fields:
- name - most commonly "Firstname Surname", but can also be "Surname Firstname". There are also a few cases (but most of these aren't too important), with "name[some other junk]", where the junk is uninteresting and short but hard to filter out.
- extracted_id - this is an id that (99% of the time, there are a few mis-assigned ids) uniquely ties to a person - which usually means has the same name (within the scope of the variations as mentioned above), but there are some cases e.g. where people have changed their name but still have the same id. Approximately 60% of the records have this id field populated, otherwise it is blank.
- event_id - each record is tied to an event. 99% of the time, a person will only have 1 record per event, so this could potentially be used to exclude false positives if multiple matches are found with the same event_id but different scores from the other two fields.
I wish to be able to search with a name (and possibly an extracted id, but that would be optional) and return every record that is tied to this person. Of course there is no way to get a perfect set because of the data quality, but I'd like to see how accurate I can get and I felt like Elasticsearch might be a good tool for this, I'm just not sure what the best way to construct a query/ies would be.
For example, I can search for a name (without knowing the extracted id), and (approximately) 60% of the returned results with have the correct extracted id in them. Searching by exact name match will fail since it wouldn't include the inverse (surname firstname), and may also miss some where someone's name has changed but the record has the extracted id. So I need a way to match records by a variation of names (some quite close, but also some could be quite different in cases of marriage etc), incorporating the extracted_id (which I pretty much want to return every case that matches), and have event_id to potentially help break ties.
One method I could think of using is to search on the name and take the most common extracted_id and then search again on that, but I wondered if I could utilise the 'document similarity' idea of Elasticsearch to build a relevancy score in one go (i.e by searching for a name we find lots of documents with the same extracted_id, so we also want to include them even if the name doesn't match quite as well).
I don't expect a full solution but any pointers on how I can best formulate a query for this would be much appreciated. Thanks!