Index with millions nested documents

Babytut · December 14, 2019, 3:10am

I have a problem when design Elasticsearch mapping.

I have 2 table in Postgresql: Person table with 1000 records and Variant table with 50M records. Each person has 20 millions variants and each variant belongs to at least 700 person. Two tables join by a mapping table Person_Variant with n-n relationship.
I want to transform it to Elasticsearch for faster query text. But I also want to some aggregate and filter on Web UI like this: when filter one property of Person or Variant, Web UI update number of Person meet the filter and number of variant from matched Person; and Web UI update some aggregate value of properties of Person, Variant. Data does not need to update, if it change I can index again.

I used nested document design: Each record in Person Index has list of nested Variant which contains only filter fields, and each record in Variant Index has list of nested Person which contains only filter fields. It still to large and not efficient.

Do you have meet any problem like this? Or Can you give some advice?
Thank you.

Christian_Dahlqvist · December 14, 2019, 7:59am

Have you considered denormalizing and storing a document per person variant combination? This should allow you to apply filtering and use cardinality aggregations to calculate number of persons with a certain variant etc.

I may help if you could provide some additional information about the average cardinality between the entities and the different ways you need to query them. Information about how frequently entities are changing or being added would also be useful.

Babytut · December 14, 2019, 8:19am

thanks @Christian_Dahlqvist,

I have updated my question. Each person have 5 millions variants and each variant belongs to at least 700 persons.
Data maybe not need to update. It's indexed only one after that I only search on it.

Christian_Dahlqvist · December 14, 2019, 8:38am

I would recommend trying the flat approach and see how it works. This is basically one document per entry in your mapping table. As far as I can tell this looks like a very bad fit for nested documents.

Babytut · December 16, 2019, 3:32am

Thank @Christian_Dahlqvist,

I have test with flat approach But it's not good enough because too many records.

VietCong · December 16, 2019, 4:21am

And logstash option to parse and potentially modify the data structure to fit your needs is out of the picture?

Christian_Dahlqvist · December 16, 2019, 6:09am

How did this approach compare to the other one? It would help if you could quantify how query performance compared to the other approach. How many documents did it result in? How much data on disk you end up with? How many shards did you use?

system · January 13, 2020, 6:16am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Mapping Explosion in Nested Datatype Elasticsearch	3	998	July 24, 2017
Nested docs vs many fields Elasticsearch	1	504	March 2, 2017
Paginating nested documents Elasticsearch	1	433	July 5, 2017
Elasticsearch Performance Issue Elasticsearch	7	572	September 4, 2020
I have about 500,000 children for each parent Elasticsearch	5	377	March 13, 2021

Index with millions nested documents

Related topics