Hi.
We have following requirements:
- store some product-related data (for milions of products)
- data are downloaded from different sources (like google analytics,...)
- we need to make some "cross-source" searches - something like "get sales per category" (sales are from google analytics, category is from xml feed),...
- some sources have data only once per day, some more often
- we also need to store the history indefinitely, but this data are "warm"
- we need latest 30 days of history (hot)
Our current solution
- index
product
- latest data from all sources
- simplified 30days history of some sources (only selected fields - this allows us to calculate quickly sales for latest 30 days) - stored in nested field
- document id is inserted manually (to correlate with external system)
- indices 'sources'
- one index holds data for one month
- there is one document per product per day (document id = product_id + date)
- has similar mapping as
product
index, except, that sources, that produces more than one record for product per day are stored inlatest_source_xxx
(latest value of source) andsources_xxx
(nested field with complete history of source for this day)
- we use update scripts (painless) when inserting data to Elasticsearch (everything goes through bulk endpoint)
- this scripts has multiple parts:
- checks, that
source_xxx.date
is newer than currently stored (we can download manually historical data for this source, but we need to store there only the latest value) - processes simplified 30 days history (append new / remove too old items)
- calculates sums, averages,.. for this history
- checks, that
The problem is, that write-requests tooks too long (something between 2-30s). I think, that our problem needs some different point of view than mine, so I will appreciate every opinion.
I think, that if we can get rid of the update scripts, everything will be much faster. But they are currently too important (cant get the same functionality without them). I thought, that using ingest pipelines can solve partialy this problem (run some ingest-only node with a lot of CPU), but it looks, that pipelines are only for indexing, not updating documents.