Total newb question

slinuxuzer · August 13, 2018, 5:51am

HI all, my name is Steven and yes this is my first post. I'm more of an IT Infrastructure guy, but I am working on a personal project and I'm not familiar with the Elasticsearch product suite or technologies at all. I am looking for a solution that can index a large number of external URLs say in the order of 200 million or so, could be less, these are a large document repository and I don't wish to download the files if possible I'd rather index them from the site.

The solution could be cloud based or hosted On Prem and it can be ElasticSearch or another product suite, preferably turn key. Just looking for someone to point me in the right direction? Its entirely possible I'm looking for something infeasible or that doesn't exist.

Thanks in advance, please try not to laugh too hard

Peter_Steenbergen · August 13, 2018, 7:17am

You can do that with Elasticsearch no problem. I think you mean a scraper that can store and search the html from all the urls?

You can look at scrapy (https://scrapy.org/) for a simple scraper and store the results in ES.

slinuxuzer · August 13, 2018, 5:15pm

If possible I'd like to read the URLs and build an index from them, but not store the actual documents long term?

Peter_Steenbergen · August 13, 2018, 8:13pm

That can be done. Not sure what you want from here now exactly?
You can save the data into Elasticsearch with ease, and delete them after period X if you store some kind of timestamp of the scraping.

system · September 10, 2018, 8:13pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Indexing HTML Elasticsearch	5	675	July 6, 2017
Elasticsearch Crawling Elastic Community and Ecosystem	7	3327	December 11, 2017
Questions from a newbie Elasticsearch	15	417	July 6, 2017
ElasticSearch for +500gb Audit Trail Elasticsearch	4	1136	September 23, 2017
Indexing HTML documents, problems with JSON Elasticsearch	5	981	July 6, 2017

Total newb question

Related topics