Atomic insert/delete/update on multiple indices

Hi

I struggle with large documents in an index.
When we want to extract something via source it is slow since it has to parse the entire source.
We could use stored fields, but this seems like a bad solution, when we have to make many fields stored, and make tricks to save object relations.

An idea is therefore to make two indices, one with the full documents, and one with part of the document.
Thus we can search in the full index, and use the id's to extract them from det partial index.

This however gives problem with the atomicity of an insert/delete/update.

I have not been able to find any documentation for this problem, so maybe someone here knows a solution to either

  • Large documents
  • Atomicity on multiple indices

Best regards
Jens

there is no atomicity across indices or even across two documents within the same index.

you would either have to solve the indexing/atomicity issue within the client or go with big document approach. Out of curiosity: What document size are we talking about here?

The index contains ~40.000 documents, and is ~70GB.
So the documents are on average ~1.5-2MB.

The query usually here takes ~200-300ms.
The same query, on and index with ~1.1mil documents on 230GB -> ~0.2MB pr document, only takes ~20ms.

The query use fetch_source and returns 10 documents

The first thing I'd reach for is source filtering. It isn't nearly as efficient a solution as the second index, but it might be good enough and then you don't have to think about atomicity stuff.

Another option, similar to source filtering, is to turn off _source fetching entirely and use doc value fields. But that is only really going to be viable if you only have a few fields or else. Or not, fetching 10 documents isn't too many. But it only works if you want things like numbers and keywords. text fields won't work.

If those don't work you could try marking some fields as stored and just pulling them back. That'll work, but it writes the data twice.

Beyond that, I think @spinscale's point about checking atomicity stuff on the client side is possible, but hard to think about. I think in the general case it is terrifying, but your use case might be simpler.

Maybe I misunderstands, but .setFetchSource is using source filtering right?

In that case, that is what we are doing know, and gives us problems with slow queries, as the documents has to be parsed, to find the fields which should be returned.

I have tried to make the fields stored, but this seems like a bad solution, when we have to make many fields stored, and make tricks to save object relations - since these are lost in stored fields.
In addition to this, it is slow to extract from mulitple stored fields, compared to a source field in a smaller index.

I have not tested doc_values - maybe this could be a solution.

In this discussion Martijn Van Groningen explains that stored fields should be used for returning field values, while doc_values are for fast retrieving doing scripts and queries. which is also how I understood it from the documentations.

Is there a reason you suggest using doc_values as a solution, before using stored fields?
And why is it a problem to use doc_values if we want to retrieve multiple fields?

Doc values are a terrible solution for retrieving values because they are stored field by field rather than doc by doc. Stored fields for a single doc are all stored together which is great if you want them all. But you don't want them all.

So, yeah, doc values are wrong for this, but it looks like stored fields aren't working for you so it is worth a shot, I think. It won't work if you want to retrieve a bunch of fields, but a couple are probably ok, especially if you already use them for things like script scoring - then they'll already be hot.

The big problem with this the doc values thing is that it'll be fast when your test data set is small but it could get drastically slower with more test data. Because of the page cache. And because it'll put more stress on the disks. And disks don't respond as well to stress as CPU.

Okay, so if I want to retrieve mulitple documents, with few fields, then doc_values is a possibility,
while if I want to retrieve few documents, with multiple fields, then stored_fields is a possibility.

Im not sure I understand the
"Stored fields for a single doc are all stored together which is great if you want them all. But you don't want them all".
Does this mean, that if you have 15 fields you make "store=true", then it is slower to retrieve one of the fields, than if you only had 3 fields you make "store=true", and wish to retrieve the same field?

Probably. All of the stored fields for a document are saved together. Actually the stored fields for many documents are bundled together, then compressed, and then saved. Stored fields will absolutely have fewer disk seeks than doc values, but you have to decompress all of the other stored fields.

And if you want _source then you have to decompress that. And if you have source filtering then you need to parse it as well.

Another thing - _source is a stored field. So big _sourcees mean you have to decompress it anyway.

Okay, my understanding on this area of elastic is getting much better, thanks!
Before I go back, and try using this to come up with new solutions, I have a last clarifying question.

Is it correct that all fields are compressed by themself, then afterwards they get compressed together?
So no matter which field you retrieve through "fields", it requires 2xdecompression, and 1xparsing?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.