I have per-customer static indices of up to 20M documents and 200GB in size which forms the basis for our search engine. Queries are always executed against a single index. Cross-index search is not used.
Now I would like to keep track of some user-generated data, such as manual tags and automatic timestamps when a document was last viewed in our application. I would like to be able to query on this data as well, making my storage options rather limited. A typical index might have up to 1000 users using the system viewing documents and making tags.
I can imagine that ES does not handle a constant stream of updates very well, considering it (re)indexes the entire document whenever an update is sent, let alone handle version conflicts because of concurrent users viewing a document.
What would be a good way to approach this problem? Focusing on last-viewed first, I could represent it as a nested field of [user_id, timestamp]. But it would still re-index the entire document on update, right? Even if it is just a nested field which gets updated ( = separate Lucene document).
Storing this data in a separate database would also be an option, although it would limit the query capabilities, having to do comparison and pruning between the two on application level. This would also reduce performance.
It may not be much of an issue to incur the cost of updating the full document, but that may also depend on how frequent your updates are. If it takes a user about 10 seconds to view each document, you're talking an order of 100 updates/second, which isn't crazy. Even 1000 updates/second isn't crazy. You may want to consider benchmarking using something like Rally to see if the simplest data model is performant enough for you if simplicity of data model is the primary driver.
Generally, storing the sort of click-stream information in a separate, append-only index and doing client-side logic will often be more performant and may actually buy you more than you'd get by embedding the data inside of the documents. For example, adding "last viewed" in the model gives you a single view date vs storing the entire view history which will allow you to answer questions like "has the user viewed the document more than 3 times in the past 24 hours" which may have some additional business benefit for some users.
Our index mappings are often quite complex, sometimes running into actual limitations (max 50 nested fields, for example). This affects our indexing speed greatly, although it is not a major impediment, since updates are generally rare. Since views will be many, I am afraid that this will become problematic pretty fast, running into all kind of performance issues or version conflicts.
I am quite confident I don't need the entire view history for our business case. Just having the last view date per user is good enough.
Could this be done using the join datatype which has replaced the parent-child relation. Then we could have many small documents representing the views and other metadata which can be maintained independently of the bigger parent documents?
Your other idea of having this in a separate index or database might make it very hard to query on this data later. If we need to perform some client side filtering we might get into this loop of [retrieve from ES] > [prune using last-viewed DB] > [retrieve more from ES to fill page size] > [prune those] > .. etc.
I will have a look into Rally to do some performance analysis of our current data structures to see where our bottleneck is.
Yeah, the join datatype allows for independent updates of parent/child documents. You just need to be aware that with the join type, you pay the cost at query time, so if you make lots of joining queries, you may want to consider paying that cost. Your queries and indexing load (and business case) are all going to be somewhat unique to you, so I'd just benchmark each of these before making a decision. I would personally lean towards the side of storing the "full" logs (even if they're not used right now) if this is a system that you expect to expand on the use cases over time. It's hard to get data back once you choose to not store it
True about query time cost, although you'd only pay that if you actually query or aggregate on these join fields, right? I think that is acceptable for our use case.
How does the newer version of ES / Lucene handle sparsity? I've read that it is better to the point of acceptable for these use cases. This is important, since we'd only store 1 or 2 fields for this user metadata while the parent document could have hundreds of complex field mappings.
You're right that in this setup it is trivial to store all and not just the last 'viewed' metadata for potential later usage. I'll consider that!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.