I have a problem when design Elasticsearch mapping.
I have 2 table in Postgresql: Person table with 1000 records and Variant table with 50M records. Each person has 20 millions variants and each variant belongs to at least 700 person. Two tables join by a mapping table Person_Variant with n-n relationship.
I want to transform it to Elasticsearch for faster query text. But I also want to some aggregate and filter on Web UI like this: when filter one property of Person or Variant, Web UI update number of Person meet the filter and number of variant from matched Person; and Web UI update some aggregate value of properties of Person, Variant. Data does not need to update, if it change I can index again.
I used nested document design: Each record in Person Index has list of nested Variant which contains only filter fields, and each record in Variant Index has list of nested Person which contains only filter fields. It still to large and not efficient.
Do you have meet any problem like this? Or Can you give some advice?
Thank you.
Have you considered denormalizing and storing a document per person variant combination? This should allow you to apply filtering and use cardinality aggregations to calculate number of persons with a certain variant etc.
I may help if you could provide some additional information about the average cardinality between the entities and the different ways you need to query them. Information about how frequently entities are changing or being added would also be useful.
I have updated my question. Each person have 5 millions variants and each variant belongs to at least 700 persons.
Data maybe not need to update. It's indexed only one after that I only search on it.
I would recommend trying the flat approach and see how it works. This is basically one document per entry in your mapping table. As far as I can tell this looks like a very bad fit for nested documents.
How did this approach compare to the other one? It would help if you could quantify how query performance compared to the other approach. How many documents did it result in? How much data on disk you end up with? How many shards did you use?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.