Total newb question

HI all, my name is Steven and yes this is my first post. I'm more of an IT Infrastructure guy, but I am working on a personal project and I'm not familiar with the Elasticsearch product suite or technologies at all. I am looking for a solution that can index a large number of external URLs say in the order of 200 million or so, could be less, these are a large document repository and I don't wish to download the files if possible I'd rather index them from the site.

The solution could be cloud based or hosted On Prem and it can be ElasticSearch or another product suite, preferably turn key. Just looking for someone to point me in the right direction? Its entirely possible I'm looking for something infeasible or that doesn't exist.

Thanks in advance, please try not to laugh too hard :smiley:

You can do that with Elasticsearch no problem. I think you mean a scraper that can store and search the html from all the urls?

You can look at scrapy (https://scrapy.org/) for a simple scraper and store the results in ES.

If possible I'd like to read the URLs and build an index from them, but not store the actual documents long term?

That can be done. Not sure what you want from here now exactly?
You can save the data into Elasticsearch with ease, and delete them after period X if you store some kind of timestamp of the scraping.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.