We have a PoC that has been successfully running in the .Net environment with a SQL Server back-end. The system manages several websites with their own respective configuration data sets, all currently managed under one tenant. The respective configuration settings are read by a data spider that aggregates filtered RSS Data into a data mart.
In the Windows/.Net world, we have built a Windows service that reads the configs, crawls sources and places the data into a SQL Server. Displaying the data is done through a website that returns free-text and pre-defined, link-based searches.
Seeing the benefits of ElasticSearch I am now looking on how to build a similar system under the ES umbrella and looking for some guidance on this, with this ultimate goal of using ElasitcSearch for Hadoop on the data.
As I am new to this forum, please let me know if I should distribute the following questions into multiple topics:
In my research I have found that some people will still use SQL Server (or another database) for their configuration parameters, user accounts, tenant management, etc., then use ES as the data-store. If we are using ES, I would prefer to stick to one database. However, as I am new to this, I would appreciate the advice of those with experience.
The new platform will be a multi-tenant platform, with a tenant being able to create multiple connected or disconnected data-sets. In relation to indices, would each tenant get their own index, or should each data-set be in its own index?
Our crawling service would read configuration parameters and perform scheduled crawls, consuming RSS data and depositing it into the SQL Server database. As the initial architecture was in .Net, I have considered modifying the current service to use one of the ElasticSearch's .Net clients to populate the ES database. However in doing more research, it appears that Nutch is the standard for crawlers and getting data into ES.
The issue with crawling is that your IP address can get blocked, and I am curious to know how large-scale crawlers get around this. Suggestions?
Accessing Data / Visualization
At a high-level, I was wondering if it was possible to expose Kibana to various tenants, thereby giving them access to only their own data, in both real-time and historical. Is this possible?
If we implement ES with a .Net wrapper, we can maintain our entire ES cluster in a secured IP environment on our VPC with API access only from our specific web servers.
I realize that Shield offers some additional security/role-based benefits. However is this necessary from the get-go?
I appreciate any assistance/discussions that follow.