Merging search results by key

I've got a tough one, which I think can only be solved by client code, but here goes:

My index has 4 documents:

{"_id": "user1__field1", "_source": {"object_id": "user1", "field_name": "field1", "value_int": 123}}
{"_id": "user1__field2", "_source": {"object_id": "user1", "field_name": "field2", "value_str": "hello"}}
{"_id": "user2__field1", "_source": {"object_id": "user2", "field_name": "field1", "value_int": 456}}
{"_id": "user2__field2", "_source": {"object_id": "user2", "field_name": "field2", "value_str": "world"}}

For many reasons, every user/field pair is a separate document, unlike most use cases, where all fields for a user are in one document.

I'd like to perform a search for all of user1's fields, but get all results in one document, instead of 2. When I search for {"query": {"term": {"object_id": "user1"}}}, I obviously get 2 results.

I'd like to run a search that will return something like

    "object_id": "user1",
    "field1": 123,
    "field2": "hello"

In a way that it's taking the value_* from each document and using the "field_name" field to name that field's value. Crazy, I know, but possible?

You need to solve that problem at index time IMO.

Not sure what you mean. Because I have many sources of data updating many fields, I can't have all fields in one document, so I split every field into its own document.

This does not really make sense to me. Could you please explain the rationale behind this? have you considered using scripted updates?

I have. It doesn't work that well for me. The details are:

I need to store a bunch of fields for each user, about 100 of them. I have many data science models that run on different time (some of them run at the same time), each updating a different field for the same set of users.

If I store all fields for each user in one document, I'd be updating (reindexing) that document quite often, which to my best understanding is discouraged.

On top of that, I'm also using external versioning, to deal with out-of-order updates to specific fields, so the version of the documents is used as the time the field's value was generated. If two scripts generate different value for the same field (currently same document), the latest value is guaranteed to get indexed.

When I tried using scripted partial updated, it only worked if I didn't update my document that often. More than 5 updates per second led to 400s and 500s from ES.

Does that even make sense? Does splitting the fields into different documents make sense? (due to frequent field updates)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.