Looking for architectural direction in porting an existing table-based PoC to ES/Hadoop production

baden0x1 · September 5, 2016, 1:42pm

We have a PoC that has been successfully running in the .Net environment with a SQL Server back-end. The system manages several websites with their own respective configuration data sets, all currently managed under one tenant. The respective configuration settings are read by a data spider that aggregates filtered RSS Data into a data mart.

In the Windows/.Net world, we have built a Windows service that reads the configs, crawls sources and places the data into a SQL Server. Displaying the data is done through a website that returns free-text and pre-defined, link-based searches.

Seeing the benefits of ElasticSearch I am now looking on how to build a similar system under the ES umbrella and looking for some guidance on this, with this ultimate goal of using ElasitcSearch for Hadoop on the data.

As I am new to this forum, please let me know if I should distribute the following questions into multiple topics:

Configuration/User info
In my research I have found that some people will still use SQL Server (or another database) for their configuration parameters, user accounts, tenant management, etc., then use ES as the data-store. If we are using ES, I would prefer to stick to one database. However, as I am new to this, I would appreciate the advice of those with experience.

Indices
The new platform will be a multi-tenant platform, with a tenant being able to create multiple connected or disconnected data-sets. In relation to indices, would each tenant get their own index, or should each data-set be in its own index?

Crawling
Our crawling service would read configuration parameters and perform scheduled crawls, consuming RSS data and depositing it into the SQL Server database. As the initial architecture was in .Net, I have considered modifying the current service to use one of the ElasticSearch's .Net clients to populate the ES database. However in doing more research, it appears that Nutch is the standard for crawlers and getting data into ES.

The issue with crawling is that your IP address can get blocked, and I am curious to know how large-scale crawlers get around this. Suggestions?

Accessing Data / Visualization
At a high-level, I was wondering if it was possible to expose Kibana to various tenants, thereby giving them access to only their own data, in both real-time and historical. Is this possible?

Security
If we implement ES with a .Net wrapper, we can maintain our entire ES cluster in a secured IP environment on our VPC with API access only from our specific web servers.

I realize that Shield offers some additional security/role-based benefits. However is this necessary from the get-go?

I appreciate any assistance/discussions that follow.

Topic		Replies	Views
New to elasticsearch and wondering how to use it for a mult-tenant invironment Elasticsearch	2	359	January 31, 2013
Hosting and securing ElasticSearch Elasticsearch	4	672	January 31, 2011
Ideal method of indexing data into ES from SQL Server? Elasticsearch	1	1251	March 7, 2014
Hadoop / Elasticsearch functionality Elasticsearch es-hadoop	19	3419	May 2, 2016
How to implement multi tenant environment in Elasticsearch Elasticsearch	18	8242	August 31, 2023

Looking for architectural direction in porting an existing table-based PoC to ES/Hadoop production

Related topics