Currently I am using scrapy to parse an XML file from an ftp server into elasticsearch. It works but seems quite a heavy weight solution and it uses a lot of memory too.
I am wondering if I am better off writing a plugin for ES instead.
I have some questions:
A) It seems writing it in Python (since I'm a python guy) as a push plugin rather than a pull river makes sense, unless anyone has a reason why pull is better?
B) For simple importing (and slight modification such as trimming, language check etc) is it likely that an ES plugin is likely going to be a better solution to importing fairly large XML files or should I just leave scrapy to do it as it is doing at the moment?
Any help and advice would be appreciated as I start on this journey.
Currently I am using scrapy to parse an XML file from an ftp server into
elasticsearch. It works but seems quite a heavy weight solution and it uses
a lot of memory too.
I am wondering if I am better off writing a plugin for ES instead.
I have some questions:
A) It seems writing it in Python (since I'm a python guy) as a push plugin
rather than a pull river makes sense, unless anyone has a reason why pull
is better?
B) For simple importing (and slight modification such as trimming,
language check etc) is it likely that an ES plugin is likely going to be a
better solution to importing fairly large XML files or should I just leave
scrapy to do it as it is doing at the moment?
Any help and advice would be appreciated as I start on this journey.
Thank you for the reply. I've seen that mentioned but does it have the capability to modify the XML content before it is imported? For example, adding the ability to do language detection and trimming via custom scripts?
Currently I am using scrapy to parse an XML file from an ftp server into elasticsearch. It works but seems quite a heavy weight solution and it uses a lot of memory too.
I am wondering if I am better off writing a plugin for ES instead.
I have some questions:
A) It seems writing it in Python (since I'm a python guy) as a push plugin rather than a pull river makes sense, unless anyone has a reason why pull is better?
B) For simple importing (and slight modification such as trimming, language check etc) is it likely that an ES plugin is likely going to be a better solution to importing fairly large XML files or should I just leave scrapy to do it as it is doing at the moment?
Any help and advice would be appreciated as I start on this journey.
Thank you for the reply. I've seen that mentioned but does it have the
capability to modify the XML content before it is imported? For example,
adding the ability to do language detection and trimming via custom scripts?
Currently I am using scrapy to parse an XML file from an ftp server into
elasticsearch. It works but seems quite a heavy weight solution and it uses
a lot of memory too.
I am wondering if I am better off writing a plugin for ES instead.
I have some questions:
A) It seems writing it in Python (since I'm a python guy) as a push
plugin rather than a pull river makes sense, unless anyone has a reason why
pull is better?
B) For simple importing (and slight modification such as trimming,
language check etc) is it likely that an ES plugin is likely going to be a
better solution to importing fairly large XML files or should I just leave
scrapy to do it as it is doing at the moment?
Any help and advice would be appreciated as I start on this journey.
Thank you for the reply. I do need to do work on the data before importing such as language detection and geocoding using third party libraries and I feel like log stash may be great for getting my some of the way it won't be able to get me all the way.
A custom plugin may be my only option in that regard but is it really going to provide me any benefits over something like scrapy? Any feedback would be appreciated
Language detection can't be done in LS that I know of, but you can definitely trim things.
On 3 April 2015 at 13:16, Employ mail@employ.com wrote:
Thank you for the reply. I've seen that mentioned but does it have the capability to modify the XML content before it is imported? For example, adding the ability to do language detection and trimming via custom scripts?
Currently I am using scrapy to parse an XML file from an ftp server into elasticsearch. It works but seems quite a heavy weight solution and it uses a lot of memory too.
I am wondering if I am better off writing a plugin for ES instead.
I have some questions:
A) It seems writing it in Python (since I'm a python guy) as a push plugin rather than a pull river makes sense, unless anyone has a reason why pull is better?
B) For simple importing (and slight modification such as trimming, language check etc) is it likely that an ES plugin is likely going to be a better solution to importing fairly large XML files or should I just leave scrapy to do it as it is doing at the moment?
Any help and advice would be appreciated as I start on this journey.
Hi,
My gut feel is don't add this to the ES setup itself. Horses for courses -
have your script (Python +1) running somewhere taking care of the
processing, dealing with issues on the ftp side , etc. Let ES do its
thing...specially if the XML parsing will take so much memory and you need
external services.
The script can run/be managed/ designed in many ways , from a simple
cronjob to a celery task or a service under chronos/mesos , or a consumer
getting messages and publishing to ES ( though if you have 1 large XML to
process once a day I wouldn't go with a consumer ...)
Good luck,
B
On 03/04/2015 3:47 pm, "Employ" mail@employ.com wrote:
Thank you for the reply. I do need to do work on the data before importing
such as language detection and geocoding using third party libraries and I
feel like log stash may be great for getting my some of the way it won't be
able to get me all the way.
A custom plugin may be my only option in that regard but is it really
going to provide me any benefits over something like scrapy? Any feedback
would be appreciated
Thank you for the reply. I've seen that mentioned but does it have the
capability to modify the XML content before it is imported? For example,
adding the ability to do language detection and trimming via custom scripts?
Currently I am using scrapy to parse an XML file from an ftp server into
elasticsearch. It works but seems quite a heavy weight solution and it uses
a lot of memory too.
I am wondering if I am better off writing a plugin for ES instead.
I have some questions:
A) It seems writing it in Python (since I'm a python guy) as a push
plugin rather than a pull river makes sense, unless anyone has a reason why
pull is better?
B) For simple importing (and slight modification such as trimming,
language check etc) is it likely that an ES plugin is likely going to be a
better solution to importing fairly large XML files or should I just leave
scrapy to do it as it is doing at the moment?
Any help and advice would be appreciated as I start on this journey.
Ok. I think that's a fair comment and is likely the root I will take. Thank you both for your help.
On 4 Apr 2015, at 20:33, Norberto Meijome numard@gmail.com wrote:
Hi,
My gut feel is don't add this to the ES setup itself. Horses for courses - have your script (Python +1) running somewhere taking care of the processing, dealing with issues on the ftp side , etc. Let ES do its thing...specially if the XML parsing will take so much memory and you need external services.
The script can run/be managed/ designed in many ways , from a simple cronjob to a celery task or a service under chronos/mesos , or a consumer getting messages and publishing to ES ( though if you have 1 large XML to process once a day I wouldn't go with a consumer ...)
Good luck,
B
On 03/04/2015 3:47 pm, "Employ" mail@employ.com wrote:
Thank you for the reply. I do need to do work on the data before importing such as language detection and geocoding using third party libraries and I feel like log stash may be great for getting my some of the way it won't be able to get me all the way.
A custom plugin may be my only option in that regard but is it really going to provide me any benefits over something like scrapy? Any feedback would be appreciated
Language detection can't be done in LS that I know of, but you can definitely trim things.
On 3 April 2015 at 13:16, Employ mail@employ.com wrote:
Thank you for the reply. I've seen that mentioned but does it have the capability to modify the XML content before it is imported? For example, adding the ability to do language detection and trimming via custom scripts?
Currently I am using scrapy to parse an XML file from an ftp server into elasticsearch. It works but seems quite a heavy weight solution and it uses a lot of memory too.
I am wondering if I am better off writing a plugin for ES instead.
I have some questions:
A) It seems writing it in Python (since I'm a python guy) as a push plugin rather than a pull river makes sense, unless anyone has a reason why pull is better?
B) For simple importing (and slight modification such as trimming, language check etc) is it likely that an ES plugin is likely going to be a better solution to importing fairly large XML files or should I just leave scrapy to do it as it is doing at the moment?
Any help and advice would be appreciated as I start on this journey.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.