TLDR: Should I use ES 7.x and Nutch 1.15 to index a few hundred industry specific sites to build a faceted search?
I have spent the last few weeks investigating the use of ElasticSearch to build a search engine for an industry sector. The idea being that we index a few hundred domains (maybe up to one thousand) within a specific industry sector. And once indexed the different domains would be present within the search results of each domains relevant specific industry sub-sector terms (or other attribute; geolocation, products, etc.). We also have a have an industry hierarchy of terms (or sub-sectors), these could be considered aggregates (or facets). It would be good to have the results filtered by these aggregates. It certainly sounds relatively straight forward.
So my efforts were to get the ELK stack running on an ubuntu 18.04 (bionic) instance on my laptop and move up from there. Things have gone well. I got ElasticSearch up and running, have Kibana running... (latest versions available via apt-get on ubuntu). I worked through some tutorials. feeling pretty good. I then started looking for a tool to crawl all the sites and push this into ES, I looked at the elastic site search solution (price became prohibitive), so then I went down the nutch path. I got nutch to crawl, link, and segment a short list of the industry domains (40 domains), then the challenges started when I wanted to index the nutch results into elasticsearch. I've been reading through log files to troubleshoot and googling my errors. After a week of reading tech notes, and discussion forums... making changes, etc... still struggling to get nutch results into elasticsearch. Tech notes / discussions all seem to be a couple of years old... So...
I am now asking myself, am I going down the correct path? This shouldn't be so hard... ES seems more big data, event driven, SIEM, enterprise, and single site (with loads of products) driven; all these logstash plugins and nothing for indexing a collection of websites. So I have a few questions for the ElasticSearch community;
- Is the ELK stack what I am looking for?
- Is nutch a good solution for my project? And if yes, which versions should I be using? (i'm using ES 7.x and Nutch 1.15)
- Is there a better way for me to index a few hundred domains. And build a faceted search?
Thank-you.