How to store product data

Hi.
We have following requirements:

  • store some product-related data (for milions of products)
  • data are downloaded from different sources (like google analytics,...)
  • we need to make some "cross-source" searches - something like "get sales per category" (sales are from google analytics, category is from xml feed),...
  • some sources have data only once per day, some more often
  • we also need to store the history indefinitely, but this data are "warm"
  • we need latest 30 days of history (hot)

Our current solution

  • index product
    • latest data from all sources
    • simplified 30days history of some sources (only selected fields - this allows us to calculate quickly sales for latest 30 days) - stored in nested field
    • document id is inserted manually (to correlate with external system)
  • indices 'sources'
    • one index holds data for one month
    • there is one document per product per day (document id = product_id + date)
    • has similar mapping as product index, except, that sources, that produces more than one record for product per day are stored in latest_source_xxx (latest value of source) and sources_xxx (nested field with complete history of source for this day)
  • we use update scripts (painless) when inserting data to Elasticsearch (everything goes through bulk endpoint)
  • this scripts has multiple parts:
    • checks, that source_xxx.date is newer than currently stored (we can download manually historical data for this source, but we need to store there only the latest value)
    • processes simplified 30 days history (append new / remove too old items)
    • calculates sums, averages,.. for this history

The problem is, that write-requests tooks too long (something between 2-30s). I think, that our problem needs some different point of view than mine, so I will appreciate every opinion.

I think, that if we can get rid of the update scripts, everything will be much faster. But they are currently too important (cant get the same functionality without them). I thought, that using ingest pipelines can solve partialy this problem (run some ingest-only node with a lot of CPU), but it looks, that pipelines are only for indexing, not updating documents.

You should look to split the product data from the transactions of those products. They are related, but the use cases are different and storing transactions like this is costly (as you can see).

I can't split them, because I need to make queries with conditions on both data. Maybe I just described the problem wrong.. We dont save every transaction separately. We save the data only once per day (download them once per day from Google Analytics), but there are some types of sources, (like source_feed in following example) which are updated a few times per day.

Here is example mapping:

{
  "mapping": {
    "product": {
      "dynamic": "false",
      "properties": {
        "engine_product_url": {
          "type": "keyword"
        },
        "updated_at": {
          "type": "date",
          "format": "yyyy-MM-dd HH:mm:ss||epoch_millis"
        },
        "history_analytics": {
          "type": "nested",
          "properties": {
            "clicks": {
              "type": "integer"
            },
            "date": {
              "type": "date",
              "format": "yyyy-MM-dd||epoch_millis"
            },
            "orders": {
              "type": "integer"
            },
            "sales": {
              "type": "double"
            }
          }
        },
        "history_analytics_sum_clicks": {
          "type": "integer"
        },
        "history_analytics_sum_orders": {
          "type": "integer"
        },
        "history_analytics_sum_sales": {
          "type": "double"
        },
        "source_analytics": {
          "properties": {
            "clicks": {
              "type": "integer"
            },
            "conversion": {
              "type": "double"
            },
            "date": {
              "type": "date",
              "format": "yyyy-MM-dd||epoch_millis"
            },
            "orders": {
              "type": "integer"
            },
            "sales": {
              "type": "double"
            }
          }
        },
        "source_feed": {
          "properties": {
            "categorytext": {
              "type": "keyword"
            },
            "date": {
              "type": "date",
              "format": "yyyy-MM-dd HH:mm:ss||epoch_millis"
            },
            "name": {
              "type": "keyword",
              "fields": {
                "fulltext": {
                  "type": "text"
                },
                "fulltext-cz": {
                  "type": "text",
                  "analyzer": "czech"
                }
              }
            },
            "price_vat": {
              "type": "double"
            }
          }
        }
      }
    }
  }
}

And then make some queries like (simplified)...
yesterday sales per category

POST product/_search
{
  "size": 0, 
  "aggs": {
    "categories": {
      "terms": {
        "field": "source_feed.categorytext",
        "size": 10
      },
      "aggs": {
        "sales": {
          "sum": {
            "field": "source_analytics.sales"
          }
        }
      }
    }
  }
}

or sales for latest 90days per category

POST sources*/_search
{
  "size": 0,
  "query": {
    "range": {
      "source_analytics.date": {
        "gte": "now-90d"
      }
    }
  }, 
  "aggs": {
    "categories": {
      "terms": {
        "field": "latest_source_feed.categorytext",
        "size": 10
      },
      "aggs": {
        "sales": {
          "sum": {
            "field": "source_analytics.sales"
          }
        }
      }
    }
  }
}

So I think it is not possible to split the data... they must be in one document (the real aggregations are much more complex). I also need really fast acces to sales/clicks/... sums per latest 30 days, which is the reason, why I store the simplified history in history_analytics,... fields and calculate sums on write(update)-time. It is too slow to get some required data from product index and sums from other indices (sources in this example)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.