Is ElasticSearch the correct solution to index >300 websites?

I'm new to elastic and thinking meta elastic is the correct place to ask architectural questions.

TLDR: Should I use ES 7.o and Nutch 1.15 to index a few hundred industry specific sites to build a faceted search?

I have spent the last few weeks investigating the use of ElasticSearch to build a search engine for an industry sector. The idea being that we index a few hundred domains (maybe up to one thousand) within a specific industry sector. And once indexed the different domains would be present within the search results of each domains relevant specific industry sub-sector terms (or other attribute; geolocation, products, etc.). We also have a have an industry hierarchy of terms (or sub-sectors), these could be considered aggregates (or facets). It would be good to have the results filtered by these aggregates. It certainly sounds relatively straight foward.

So my efforts were to get the ELK stack running on an ubuntu 18.04 (bionic) instance on my laptop and move up from there. Things have gone well. I got ElasticSearch up and running, have Kibana running... (latest versions available via apt-get on ubuntu). I worked through some tutorials. feeling pretty good. I then started looking for a tool to crawl all the sites and push this into ES, I looked at the elastic site search solution (price became prohibitive), so then I went down the nutch path. I got nutch to crawl, link, and segment a short list of the industry domains (40 domains), then the challenges started when I wanted to index the nutch results into elasticsearch. I've been reading through log files to troubleshoot and googling my errors. After a week of reading tech notes, and discussion forums... making changes, etc... still struggling to get nutch results into elasticsearch. Tech notes / discussions all seem to be a couple of years old... So...

I am now asking myself, am I going down the correct path? This shouldn't be so hard... ES seems more big data, event driven, SIEM, enterprise, and single site (with loads of products) driven; all these logstash plugins and nothing for indexing a collection of websites. So I have a few questions for the ElasticSearch community;

  1. Is the ELK stack what I am looking for?
  2. Is nutch a good solution for my project? And if yes, which versions should I be using? (i'm using ES 7.x and Nutch 1.15)
  3. Is there a better way for me to index a few hundred domains. And build a faceted search?


Elastic have developed a their Site Search product on top of Elasticsearch which seems to offer similar functionality to what you are describing, so it is certainly possible to do what you want on top of Elasticsearch. As you point out, there are open-source components that can be used to build such a solution but there is however as far as I know no out-of-the box solution, so what you are looking to do will require some work if you are to build it yourself.

I am not very familiar with the Site Search product, but that might be an option that saves a lot of development if you can use it the way you want.

Thanks Christian,

Yeah, I went down site search path. Looks to be a great product, but our wanting to index hundreds of sites tooks us to a place that was price prohibitive. So I am pursuing a build it ourselves route. I'm going to exhaust using nutch as the crawl and index before I build it completely myself.

Thanks for the reply,

Hi @prawsthorne I've raised your question with our Site Search people and they've asked me to get you in touch with each other. I will send you a private message with more details.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.