Avoiding Nested Types

Hi ya,

I'm moving data from Postgres to Elasticsearch to use primarily for filtering(faceted), text search over a few fields, sorting, and pagination. Unfortunately, the data is very relational and I'm having trouble converting it to a flat schema.

Note: Here is part of my dataset (the actual number of fields is much larger, but the problems are all the same where I have ids and values and I'd like to map them without nested fields).

I have a list of statuses linked to ids like so, for org_id = '1' and submission_id='43':

form_id: 1, status: "complete"
form_id: 3, status: "in-progress"
form_id: 7, status: "complete"

Multiple forms can exist in one document and there are millions of ids (so I can't map like 1: "Complete"). Form's data is not shared across orgs.

A common query would be in english:

  • Give me all submission_ids for org_id = 1 where form_id = 1 and form_1_status = "completed" and form_id=3 and form_3_status = "in-progress

Here is what I have right now for mapping:

'org_id': { 'type': 'keyword' },
'submission_id': { 'type': 'keyword' },
'forms' : {
    'type' : 'nested',
    'properties' : {
        'form_id' : { 'type' : 'keyword' },
        'status' : { 'type' : 'keyword' }
        }
    }
}

Anyone have solutions to flatten this data structure? I know having millions of keys is a bad/infeasible idea, but I also don't want to lose performance from nesting documents. I've come up with two potential options:

Option 1:
I've thought of keeping a separate mapping that keeps track of es_key -> postgres_id mapping. So the separate mapping document data would look like this:

'org_id': '1',
'form_status_a': '1',
'form_status_b': '3',
'form_status_c': '7'
}

Then the actual data would be in a separate index like this:

'org_id': '1',
'submission_id': 43
'form_status_a': 'complete',
'form_status_b': 'in-progress',
'form_status_c': 'complete'

So I'd query the mapping first, then once I know the names of the es fields, I could query them. This reduces the dimenionality of the Elasticsearch keys so I don't blow it up.

This feels very complicated and lots could go wrong in pipeline building this up. Especially since you can delete and update fields quite a bit so this isn't just handling inserts.

Option 2:
I've thought of combining my keys and values together like so:

org_id: '1',
submission_id: '43',
forms: [ 'form_id1_completed', 'form_id3_in_progress', 'form_id7_completed' ]

This also feels pretty pretty hacky, but it works without having to maintain a second mapping.

The problem here is it can't support numeric fields. Like review_score_id: 1, score: 82.124, where I'd like to either aggregate or compare with something like greater than.

Questions:
However, it feels like I'm trying to fit a square peg into a round hole...is there a better option? I know Elasticsearch can do this(especially with nested documents it isn't too difficult), but is Elasticsearch the right tool for this job?

Thanks so much,
Patrick

I am not sure if this is correct or not.
but I have flatten my data quite a bit. I am querying five table and putting output in flat design.

I do not understand your data hence I can't comment about it. but once you put it elk your search will be super fast. take small case and explain we might be able to help.

will you have many many sub field in this one?

'forms' : {
    'type' : 'nested',
    'properties' : {
        'form_id' : { 'type' : 'keyword' },
        'status' : { 'type' : 'keyword' }
        }
    }

if not then just define forms as object and it will be called like forms.properties.form_id, forms.properties.status

is one org_id has multiple forms?

then you might want to create duplicated document.
1st org_id -> form1
1st org_id -> form2 etc....

and then you can do the aggregation. Again I am just assuming how they will be.

Unfortunately, I need to query(filter) across an entire submission, which has multiple form statuses, at once. Like the query:

  • Give me all submission_ids for org_id = 1 where (form_id = 1 and form_1_status = "completed") and (form_id=3 and form_3_status = "in-progress)

I think that means both forms need to be in the same document? Unless I wanted to return everything then filter in memory which isn't feasible.

But you are right, for aggregations separating them would work great assuming I don't need to filter across multiple forms.

Thanks,
Patrick

if you have multiple record you should be able to do it
{ org_id:1, form_id:1}
{org_id:1, form_id:2}
{org_id:1, form_id:3 }
{org_id:2, form_id:50}

here if you just search for org_id=1 you will get three record back. and so forth.

run test with small set of data

Sorry, I'm not being clear. Imagine this dataset:

{
  'org_id': '1',
  'submission_id': 43,
  'forms': [
      { 'form_id': '1', 'status': 'complete'},
      { 'form_id': '2', 'status': 'in-progress'}
  ]
},
{
  'org_id': '1',
  'submission_id': 42
  'forms': [
      { 'form_id': '1', 'status': 'complete'},
      { 'form_id': '2', 'status': 'rejected'}
  ]
},
{
  'org_id': '1',
  'submission_id': 41
  'forms': [
      { 'form_id': '1', 'status': 'complete'},
      { 'form_id': '2', 'status': 'in-progress'}
  ]
}

Now imagine this query:

  • Give me all submission_ids for org_id = 1 where (form_id = 1 and status = "completed") and (form_id=2 and status = "in-progress )

This would return submission ids 43 and 41. Notice both status queries are ANDed together in one query. However, this requires a nested document.

Thanks,
Patrick

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.