Great, Thanks
I will definitely keep that in mind.
Thanks,
Ashwin Sathya
Date: Fri, 6 Sep 2013 14:18:31 -0700
From: zacharyjtong@gmail.com
To: elasticsearch@googlegroups.com
CC: ashwin.sathya@outlook.com
Subject: Re: Design guidance for multi-tenant multi-source indexing
No problem. Let me know if you have any more questions. =)
The only real constant to consider when thinking about scaling (no matter how you organize your data) is the number of primary shards, since this cannot be changed once the index is created. Everything else is very flexible. And even the primary shard situation can be changed if you re-index your data into a new index that has been provisioned with more shards.
-Zach
On Friday, September 6, 2013 4:57:51 PM UTC-4, R Ashwin Sathya wrote:
Great, thanks for the detailed explanation Zachary.
As I said, I am not looking into splitting the data across shards for my experimental purposes.
However, going by ES docs and the data I can read up, scaling down and scaling up based on the load that each user puts (both read and write) seems to be a well understood problem and I will think about how I need to model my system. I will definitely look into the talk before I proceed for such serious thinking.
Thanks again for the help.
Thanks,
Ashwin Sathya
Date: Fri, 6 Sep 2013 13:46:54 -0700
From: zachar...@gmail.com
To: elasti...@googlegroups.com
CC: ashwin...@outlook.com
Subject: Re: Design guidance for multi-tenant multi-source indexing
Yep, you're correct - if you wanted to do backups of a particular user, you'll have to implement a selective backup process that pulls their data out of the index. You could do this fairly easily with a Scan/Scroll API call and a filtered query. Restoring data will be a little more painful, since you will probably have to perform a Delete-By-Query and then reindex the data.
You can also do an index-per-user, especially if you know you will have at most 30 users (or some other relatively small number, e.g. at most hundreds, not thousands). If each user has their own index, you can internally partition the data however you like. Searching between a bunch of types (Type1_Date1, etc) is equivalent to searching one type and applying a filter on a date field. Internally types are managed by filters on "special" fields, so the process/performance is basically identical.
A perk to doing index-per-type is that you can scale individual indexes to meet the needs of individual users. So if one user is very large and requires a lot of capacity, you can provide their index with 10 shards, while another user only needs 2 shards for their index. A downside is that removing old data will be more expensive, since deleting documents individually is much slower than dropping an entire index. Another disadvantage is that you are somewhat limited in the number of users you can add - at some point adding more indexes becomes too much overhead.
Shay has a very good talk describing two "data flows" - user data flow and time-based data flow - which you may find helpful: http://vimeo.com/44716955
-Zach
On Friday, September 6, 2013 4:29:11 PM UTC-4, R Ashwin Sathya wrote:
Thanks Zachary,
Timed indexes would fit perfect for my scenario. Particularly for modelling costs (say indexes older than N days will be deleted and things like that)
The only downside of this I see is that, it won't be easier to recover the indexes by user that easily. For example, if I isolate the user data into separate indexes, I can configure backups for particular users and restore them at will. In the shared case, I will have to selectively backup ?
I am also having to test against another parallel search technology, for just experimental purposes.
The setup there is as follows.
1 Shard -> Mapped over 1 master and 2 replica nodes
30 users -> Each user has an index
2 Types -> Two types of document data (Type1, Type2)
30 days worth of data -> I am basically having to accommodate them in the same index. I am quite unsure as to how I can achieve this parity. In my other search system, we have a concept of clear tables, so i have named my tables as Type1_Date1, Type2_Date1, ... and so on. An equivalent would be create the table names as types right ?
From what I am learning from ES, I understand that the above is highly under utilizing the true power of ES to scale. But as I mentioned, it is for benchmarking and other purposes.
Thanks,
Ashwin Sathya
Date: Fri, 6 Sep 2013 13:20:41 -0700
From: zachar...@gmail.com
To: elasti...@googlegroups.com
CC: ashwin...@outlook.com
Subject: Re: Design guidance for multi-tenant multi-source indexing
I would probably build time-based indexes. For example, an index per day (or week, or hour...whatever unit of time seems appropriate for your setup). Your documents would then contain both a timestamp field and a source field.
When a user searches a time range, you can search only the range of indexes that match the requested time. E.g. if you store daily indexes and your user requests all values in the last two days, you perform a search on just those two indexes. It is very easy to search over multiple indexes at the same time, you simply concatenate them together in the URI with a comma:
curl -XGET localhost:9200/data_09_06_2013,data09_05_2013/_search -d '{}'
Now, since you have multiple users sharing the same index, you perform a filtered_query so that results are filtered by the source field. If you need finer control on time ranges (e.g. a particular hour in a particular day) you can just include a Range filter along with the term filter on the source field
Make sense?-Zach
On Friday, September 6, 2013 11:02:48 AM UTC-4, R Ashwin Sathya wrote:
Hi,
I am detailing my scenario here.
I need to support a number of users who have data from multiple sources. The nature of the data source is chronological, it is both real-time and timestamp'ed.
The search capability that I need to provide the user is to search over particular source and over particular time range (say a few hours to few days)
I am not able to grasp/map the index concepts to how I will design my data layer. Any suggestions/guidance ?
Thanks,
Ashwin Sathya
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.