Hi ya,
I'm moving data from Postgres to Elasticsearch to use primarily for filtering(faceted), text search over a few fields, sorting, and pagination. Unfortunately, the data is very relational and I'm having trouble converting it to a flat schema.
Note: Here is part of my dataset (the actual number of fields is much larger, but the problems are all the same where I have ids and values and I'd like to map them without nested fields).
I have a list of statuses linked to ids like so, for org_id = '1' and submission_id='43'
:
form_id: 1, status: "complete"
form_id: 3, status: "in-progress"
form_id: 7, status: "complete"
Multiple forms can exist in one document and there are millions of ids (so I can't map like 1: "Complete"
). Form's data is not shared across orgs.
A common query would be in english:
Give me all submission_ids for org_id = 1 where form_id = 1 and form_1_status = "completed" and form_id=3 and form_3_status = "in-progress
Here is what I have right now for mapping:
'org_id': { 'type': 'keyword' },
'submission_id': { 'type': 'keyword' },
'forms' : {
'type' : 'nested',
'properties' : {
'form_id' : { 'type' : 'keyword' },
'status' : { 'type' : 'keyword' }
}
}
}
Anyone have solutions to flatten this data structure? I know having millions of keys is a bad/infeasible idea, but I also don't want to lose performance from nesting documents. I've come up with two potential options:
Option 1:
I've thought of keeping a separate mapping that keeps track of es_key -> postgres_id mapping. So the separate mapping document data would look like this:
'org_id': '1',
'form_status_a': '1',
'form_status_b': '3',
'form_status_c': '7'
}
Then the actual data would be in a separate index like this:
'org_id': '1',
'submission_id': 43
'form_status_a': 'complete',
'form_status_b': 'in-progress',
'form_status_c': 'complete'
So I'd query the mapping first, then once I know the names of the es fields, I could query them. This reduces the dimenionality of the Elasticsearch keys so I don't blow it up.
This feels very complicated and lots could go wrong in pipeline building this up. Especially since you can delete and update fields quite a bit so this isn't just handling inserts.
Option 2:
I've thought of combining my keys and values together like so:
org_id: '1',
submission_id: '43',
forms: [ 'form_id1_completed', 'form_id3_in_progress', 'form_id7_completed' ]
This also feels pretty pretty hacky, but it works without having to maintain a second mapping.
The problem here is it can't support numeric fields. Like review_score_id: 1, score: 82.124
, where I'd like to either aggregate or compare with something like greater than.
Questions:
However, it feels like I'm trying to fit a square peg into a round hole...is there a better option? I know Elasticsearch can do this(especially with nested documents it isn't too difficult), but is Elasticsearch the right tool for this job?
Thanks so much,
Patrick