I know the subject seems little bit clumsy. Let me try to explain it with
the exact usecase. Me and my team members are evaluating elasticsearch for
an eCommerce project. We want our customers to search for products they
want to buy from the search portal. For this we want to show our customer
products in some form of grouping (where each group will represent multiple
similar kind of products e.g. one search result record can be IPhone
representing all IPhone series) in the search result.
We know this cannot be achieved with elasticsearch ( https://github.com/elasticsearch/elasticsearch/issues/256). So we decided
to store the grouped representation of the products in elastic search
instead of actual products with a simple criteria. At time of adding a
product search for the representative product group based on product
attributes and if we find a group then associate the product with the group
otherwise create a new group from this newly added product.
The problem we are facing here is that we cannot search a product (consider
its as a group) which is just added into elasticsearch before adding a new
product resulting in creation of another similar group.
Note: we are aware of elasticsearch realtime get and faceted search
features and both seems irrelevant to our problem space.
My question is that can we achieve realtime indexing without using refresh
API (as it results in complete refresh of index) or is it the only way and
if it is than what will be the performance impact of using refresh API.
Elasticsearch is near-real time system. Refresh is needed for new documents
to be searchable. To my knowledge, there is no other way around this now.
On the other hand refresh happens every 1sec by default and you can change
it (also on the fly
).
May be it would help if you describe in more details why you need real-time
search in your case. To me it is surprising you need a true real-time
product search in eCommerce site (what is the rate of index/update
operations and can your users really learn they see results that are few
seconds behind?).
If facets do not apply to your case (btw why?) you might look at nested
type or may be better to parent/child documents.
After reading your email several times it sounds to me as if you might be
already using parent/child where you create parent documents on the fly
while indexing children and then the problem could be that you might not be
able to check if particular parent already exists because of the refresh
issue. Is that the case?
I know the subject seems little bit clumsy. Let me try to explain it with
the exact usecase. Me and my team members are evaluating elasticsearch for
an eCommerce project. We want our customers to search for products they
want to buy from the search portal. For this we want to show our customer
products in some form of grouping (where each group will represent multiple
similar kind of products e.g. one search result record can be IPhone
representing all IPhone series) in the search result.
We know this cannot be achieved with elasticsearch ( Field Collapsing/Combining · Issue #256 · elastic/elasticsearch · GitHub). So we decided
to store the grouped representation of the products in Elasticsearch
instead of actual products with a simple criteria. At time of adding a
product search for the representative product group based on product
attributes and if we find a group then associate the product with the group
otherwise create a new group from this newly added product.
The problem we are facing here is that we cannot search a product
(consider its as a group) which is just added into elasticsearch before
adding a new product resulting in creation of another similar group.
Note: we are aware of elasticsearch realtime get and faceted search
features and both seems irrelevant to our problem space.
My question is that can we achieve realtime indexing without using refresh
API (as it results in complete refresh of index) or is it the only way and
if it is than what will be the performance impact of using refresh API.
We ran into exact the same issue and use the excellent patch for
supporting result grouping by Martijn van Groningen (it runs in
production since over a year). The only limitation is, that you need
to use one shard (so unless you are not having millions of products
this should not pose a problem). We did not group our products by
series but rather by same images, this means that products like
clothes with the same color and size are grouped into one search
results (blue shirt in s/l/xl is grouped, the same shirt in red is not
put into that group, but all red shirts in all sizes are grouped). You
can see the patch in action at http://www.lusini.de
We have implemented a streaming json river, which updates the products
every 30 seconds (and we are not in any need of realtime indexing
features).
Thanks guys for giving your insight into our problem.
@Lukas
Yes, you understood our problem correct. The reason why faceted search is
not helpful is because it gives you count of the tags in which you are
interested in, like count of different demographics or occupations against
different user profiles as described in the elasticsearch example for
faceted search. It is not meant for grouping the products
like Alexander mentioned in his reply.
The reason why we need real time indexing is that we want to create the
parent on the fly if it does not exist and then want to associate the child*right after
the creation of parent.
@Alexander
I had also seen the elasticsearch grouping patch but like you said it is
limited to one shard.
We are actually building a cross store solution. In which each individual
store will post its products to our solution and then customers will search
different products from our cross store solution which will then (some how)
take them to a specific store. Here different stores can submit similar
kind of product to our solution due to this reason we want our search
result to show similar products represented by a single representative
group e.g one IPhone 4s will represent all the IPhone 4s submitted from
different stores.
We had plan to use scheduler based approach to avoid realtime indexing
problem. But we again caught into the same loop because we want our
representative product to be created at time of adding a new product if it
does not find the group of similar nature.
if your product category ID would be predictable/deterministic (for example
you know that iPhone5 would go to "iphone" category) then I think you can
still do it because then you could use realtime get to check if this
product group exists and I think then you should be fine creating them on
the fly. Have you tried this approach?
Thanks guys for giving your insight into our problem.
@Lukas
Yes, you understood our problem correct. The reason why faceted search is
not helpful is because it gives you count of the tags in which you are
interested in, like count of different demographics or occupations against
different user profiles as described in the elasticsearch example for
faceted search. It is not meant for grouping the products
like Alexander mentioned in his reply.
The reason why we need real time indexing is that we want to create the
parent on the fly if it does not exist and then want to associate the child
right after* the creation of parent.
@Alexander
I had also seen the elasticsearch grouping patch but like you said it is
limited to one shard.
We are actually building a cross store solution. In which each individual
store will post its products to our solution and then customers will search
different products from our cross store solution which will then (some how)
take them to a specific store. Here different stores can submit similar
kind of product to our solution due to this reason we want our search
result to show similar products represented by a single representative
group e.g one IPhone 4s will represent all the IPhone 4s submitted from
different stores.
We had plan to use scheduler based approach to avoid realtime indexing
problem. But we again caught into the same loop because we want our
representative product to be created at time of adding a new product if it
does not find the group of similar nature.
But, if you need to only find out if a document exists given it's ID,
that is different than searching that is using the get interface.
"By default, the get API is realtime, and is not affected by the refresh
rate of the index (when data will become visible for search)."
To complete the thread called "writing parents and children docs" from 2
weeks ago, I implemented this and it seems to work fine. In fact, it was
Lukáš who pointed out to me that get works different than search and
claims to be real time.
The reason why we need real time indexing is that we want to
create the parent on the fly if it does not exist and then want to
associate the child/*right after*/ the creation of parent.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.