Elasticsearch Crawling

Hello guys,

I was checking "Google" but I didn't find too much info about it, so any help would be greatly appreciated.
I was working with Google Search Appliance before but it was decommissioned, and Elastic seems to be the best option on the market right now.

I need to:

  • Crawl & index a number of websites (around 50)
  • Serve them as an XML to a web application

The current version of Elasticsearch does not have a crawler? I need to install something else?

I am using the Elastic Cloud in AWS.

Thank you!

*We will crawl only public websites, like Twitter accounts etc.

https://swiftype.com/ can do most of that!

3 Likes

Thanks Mark!
Swiftype is ok, but unfortunately we need to have more or less a real-time crawling (every 5 minutes) and their solution (the cheap one with <100$/month) offers only 1 crawling every 3 days or so.

I was checking now some other solutions like: 80legs.com and if I find something I will post here.

Can ElsticSearch be used for this type of live crawling, indexing & serving?

Elasticsearch can be the backend for storing the data collected from crawlers, but it has no crawling capabilities.

1 Like

May be you can have a look at https://github.com/DigitalPebble/storm-crawler it has some integration with ES. (I haven't used it myself though.)

Regards,
Lukáš

1 Like

@warkolm Yes.. I was a bit shocked to see that there is no official crawler (at least on the Cloud version).
It's like selling only the engine and some other parts of a car :smile:, but you need to find the wheels by your own.

I guess I was too accustomed with the Google Search Appliance.

I will keep researching and post here whatever solution I find!

@lukas_vlcek Thanks! I will test it.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.