Mixing shared and private data in one query

We have a seemingly simple data model consisting of one Shared database (about 1.4 million documents) and one Private database in which a field indicates which user the private data belongs to. However, getting elasticsearch to work with this have been harder than expected.

A user starts with no documents in Private. If the user wants to change any information on a document in Shared the changed information is to be stored in Private and the old information should not be searchable for that specific user.

Every night the information in Shared is updated (about 10,000 changes/night).

EXAMPLE

User 102 wants to change the name of document 1 from A to New. The result in the database is:

       [Shared]                       
| ID | Name | Number | City |  
|  1 |  A   |   3    |   AA | 
|  2 |  B   |   4    |   BB |  

       [Private]
|Name | Number | City | User | Origin|
| New |        |      |  102 |   1   |
  1. When user 102 search New he should get an object displaying |New|3|AA| back.
  2. When another user search New there shall be no hits.
  3. When user 102 search AA he should get an object displaying |New|3|AA| back.
  4. When another user search AA he should get an object displaying |A|3|AA| back.

SUGGESTED SOLUTIONS

Here are some solutions we tried. Feel free to suggest new ones or tweaks to these.

Application side-joins
If the user makes one change, copy all information on that document to Private. Search both Private and Shared independently and remove all hits (in application, outside of elastic) in Shared if they also exist in Private. If we only get a hit in Shared but there is a related document in Private, no hits should be displayed.
Problems

  1. We usually just want the top 5 but the hit is sometimes several thousand items long. To ensure that our filtering is correct we need to get all the 2. data.
  2. Since we copy all the document information from Shared to Private there is redundant data in Private that needs to be updated if something changes in Shared. This takes a long time.

Parent child
Private is a child to Shared. When a field is changed, it will be stored in Private as a child to the document where the change was made.
Problems
The ranking does not work. If we change “city” it should be as if the city in the parent does not exist. It also seems like the “city” in parent is counted in the tf-idf count.

Nested objects
The info in Private is stored in the document in Shared as a nested object.
Problems
When a nested object is hit all its siblings is also returned. This means a lot of irrelevant data from other users.

Data denormalization
All data in Shared is copied to Private when Private is created and only Private is used in search.
Problems

  1. All data in Shared is about 4 GB and if every user gets a copy that will result in too much mem.
  2. Since we copy all the document information from Shared to Private there is redundant data in Private that needs to be updated if something changes in Shared. This takes a long time.

Optimized denormalization (Mark the document as invisible)
When a user makes a change, that document is copied to Private and a field in the document in Shared stores that this document in unsearchable in that specific app.
Problems
Since we copy all the document information from Shared to Private there is redundant data in Private that needs to be updated if something changes in Shared. This takes a long time.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.