Index with millions nested documents

I have a problem when design Elasticsearch mapping.

I have 2 table in Postgresql: Person table with 1000 records and Variant table with 50M records. Each person has 20 millions variants and each variant belongs to at least 700 person. Two tables join by a mapping table Person_Variant with n-n relationship.
I want to transform it to Elasticsearch for faster query text. But I also want to some aggregate and filter on Web UI like this: when filter one property of Person or Variant, Web UI update number of Person meet the filter and number of variant from matched Person; and Web UI update some aggregate value of properties of Person, Variant. Data does not need to update, if it change I can index again.

I used nested document design: Each record in Person Index has list of nested Variant which contains only filter fields, and each record in Variant Index has list of nested Person which contains only filter fields. It still to large and not efficient.

Do you have meet any problem like this? Or Can you give some advice?
Thank you.

Have you considered denormalizing and storing a document per person variant combination? This should allow you to apply filtering and use cardinality aggregations to calculate number of persons with a certain variant etc.

I may help if you could provide some additional information about the average cardinality between the entities and the different ways you need to query them. Information about how frequently entities are changing or being added would also be useful.

thanks @Christian_Dahlqvist,

I have updated my question. Each person have 5 millions variants and each variant belongs to at least 700 persons.
Data maybe not need to update. It's indexed only one after that I only search on it.

I would recommend trying the flat approach and see how it works. This is basically one document per entry in your mapping table. As far as I can tell this looks like a very bad fit for nested documents.

Thank @Christian_Dahlqvist,

I have test with flat approach But it's not good enough because too many records.

And logstash option to parse and potentially modify the data structure to fit your needs is out of the picture?

How did this approach compare to the other one? It would help if you could quantify how query performance compared to the other approach. How many documents did it result in? How much data on disk you end up with? How many shards did you use?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.