Getting XML into ES efficiently

Hi,

Currently I am using scrapy to parse an XML file from an ftp server into elasticsearch. It works but seems quite a heavy weight solution and it uses a lot of memory too.

I am wondering if I am better off writing a plugin for ES instead.

I have some questions:

A) It seems writing it in Python (since I'm a python guy) as a push plugin rather than a pull river makes sense, unless anyone has a reason why pull is better?

B) For simple importing (and slight modification such as trimming, language check etc) is it likely that an ES plugin is likely going to be a better solution to importing fairly large XML files or should I just leave scrapy to do it as it is doing at the moment?

Any help and advice would be appreciated as I start on this journey.

James

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/610e7f9b-3d23-44a9-b8f3-07deb262dd54%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Logstash can handle XML, it has a filter specifically for it -

On 3 April 2015 at 09:33, James mail@employ.com wrote:

Hi,

Currently I am using scrapy to parse an XML file from an ftp server into
elasticsearch. It works but seems quite a heavy weight solution and it uses
a lot of memory too.

I am wondering if I am better off writing a plugin for ES instead.

I have some questions:

A) It seems writing it in Python (since I'm a python guy) as a push plugin
rather than a pull river makes sense, unless anyone has a reason why pull
is better?

B) For simple importing (and slight modification such as trimming,
language check etc) is it likely that an ES plugin is likely going to be a
better solution to importing fairly large XML files or should I just leave
scrapy to do it as it is doing at the moment?

Any help and advice would be appreciated as I start on this journey.

James

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/610e7f9b-3d23-44a9-b8f3-07deb262dd54%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8TLso3YjNLpqHoR5r87nr6Li2Ng53AjHwwNzE1j9FJeA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thank you for the reply. I've seen that mentioned but does it have the capability to modify the XML content before it is imported? For example, adding the ability to do language detection and trimming via custom scripts?

On 2 Apr 2015, at 19:44, Mark Walkom markwalkom@gmail.com wrote:

Logstash can handle XML, it has a filter specifically for it - Xml filter plugin | Logstash Reference [8.11] | Elastic

On 3 April 2015 at 09:33, James mail@employ.com wrote:
Hi,

Currently I am using scrapy to parse an XML file from an ftp server into elasticsearch. It works but seems quite a heavy weight solution and it uses a lot of memory too.

I am wondering if I am better off writing a plugin for ES instead.

I have some questions:

A) It seems writing it in Python (since I'm a python guy) as a push plugin rather than a pull river makes sense, unless anyone has a reason why pull is better?

B) For simple importing (and slight modification such as trimming, language check etc) is it likely that an ES plugin is likely going to be a better solution to importing fairly large XML files or should I just leave scrapy to do it as it is doing at the moment?

Any help and advice would be appreciated as I start on this journey.

James

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/610e7f9b-3d23-44a9-b8f3-07deb262dd54%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/L9uzIGfT7Gs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8TLso3YjNLpqHoR5r87nr6Li2Ng53AjHwwNzE1j9FJeA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/AE7A0FB1-0DE9-4BBF-BEE1-7A29964204E5%40employ.com.
For more options, visit https://groups.google.com/d/optout.

You can do data transformation on the fly, yes.

Language detection can't be done in LS that I know of, but you can
definitely trim things.

On 3 April 2015 at 13:16, Employ mail@employ.com wrote:

Thank you for the reply. I've seen that mentioned but does it have the
capability to modify the XML content before it is imported? For example,
adding the ability to do language detection and trimming via custom scripts?

On 2 Apr 2015, at 19:44, Mark Walkom markwalkom@gmail.com wrote:

Logstash can handle XML, it has a filter specifically for it -
Xml filter plugin | Logstash Reference [8.11] | Elastic

On 3 April 2015 at 09:33, James mail@employ.com wrote:

Hi,

Currently I am using scrapy to parse an XML file from an ftp server into
elasticsearch. It works but seems quite a heavy weight solution and it uses
a lot of memory too.

I am wondering if I am better off writing a plugin for ES instead.

I have some questions:

A) It seems writing it in Python (since I'm a python guy) as a push
plugin rather than a pull river makes sense, unless anyone has a reason why
pull is better?

B) For simple importing (and slight modification such as trimming,
language check etc) is it likely that an ES plugin is likely going to be a
better solution to importing fairly large XML files or should I just leave
scrapy to do it as it is doing at the moment?

Any help and advice would be appreciated as I start on this journey.

James

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/610e7f9b-3d23-44a9-b8f3-07deb262dd54%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/L9uzIGfT7Gs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8TLso3YjNLpqHoR5r87nr6Li2Ng53AjHwwNzE1j9FJeA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8TLso3YjNLpqHoR5r87nr6Li2Ng53AjHwwNzE1j9FJeA%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/AE7A0FB1-0DE9-4BBF-BEE1-7A29964204E5%40employ.com
https://groups.google.com/d/msgid/elasticsearch/AE7A0FB1-0DE9-4BBF-BEE1-7A29964204E5%40employ.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X92a7G536TNgArbzdvC9P%2B0gKMS_5jMBxT9ZBVDJ9PMMg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thank you for the reply. I do need to do work on the data before importing such as language detection and geocoding using third party libraries and I feel like log stash may be great for getting my some of the way it won't be able to get me all the way.

A custom plugin may be my only option in that regard but is it really going to provide me any benefits over something like scrapy? Any feedback would be appreciated

Sent from my iPhone

On 3 Apr 2015, at 00:44, Mark Walkom markwalkom@gmail.com wrote:

You can do data transformation on the fly, yes.

Language detection can't be done in LS that I know of, but you can definitely trim things.

On 3 April 2015 at 13:16, Employ mail@employ.com wrote:
Thank you for the reply. I've seen that mentioned but does it have the capability to modify the XML content before it is imported? For example, adding the ability to do language detection and trimming via custom scripts?

On 2 Apr 2015, at 19:44, Mark Walkom markwalkom@gmail.com wrote:

Logstash can handle XML, it has a filter specifically for it - Xml filter plugin | Logstash Reference [8.11] | Elastic

On 3 April 2015 at 09:33, James mail@employ.com wrote:
Hi,

Currently I am using scrapy to parse an XML file from an ftp server into elasticsearch. It works but seems quite a heavy weight solution and it uses a lot of memory too.

I am wondering if I am better off writing a plugin for ES instead.

I have some questions:

A) It seems writing it in Python (since I'm a python guy) as a push plugin rather than a pull river makes sense, unless anyone has a reason why pull is better?

B) For simple importing (and slight modification such as trimming, language check etc) is it likely that an ES plugin is likely going to be a better solution to importing fairly large XML files or should I just leave scrapy to do it as it is doing at the moment?

Any help and advice would be appreciated as I start on this journey.

James

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/610e7f9b-3d23-44a9-b8f3-07deb262dd54%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/L9uzIGfT7Gs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8TLso3YjNLpqHoR5r87nr6Li2Ng53AjHwwNzE1j9FJeA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/AE7A0FB1-0DE9-4BBF-BEE1-7A29964204E5%40employ.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/L9uzIGfT7Gs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X92a7G536TNgArbzdvC9P%2B0gKMS_5jMBxT9ZBVDJ9PMMg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5AAD2EB6-D5A9-46DD-8A8A-E3FBC4154929%40employ.com.
For more options, visit https://groups.google.com/d/optout.

Hi,
My gut feel is don't add this to the ES setup itself. Horses for courses -
have your script (Python +1) running somewhere taking care of the
processing, dealing with issues on the ftp side , etc. Let ES do its
thing...specially if the XML parsing will take so much memory and you need
external services.

The script can run/be managed/ designed in many ways , from a simple
cronjob to a celery task or a service under chronos/mesos , or a consumer
getting messages and publishing to ES ( though if you have 1 large XML to
process once a day I wouldn't go with a consumer ...)

Good luck,
B
On 03/04/2015 3:47 pm, "Employ" mail@employ.com wrote:

Thank you for the reply. I do need to do work on the data before importing
such as language detection and geocoding using third party libraries and I
feel like log stash may be great for getting my some of the way it won't be
able to get me all the way.

A custom plugin may be my only option in that regard but is it really
going to provide me any benefits over something like scrapy? Any feedback
would be appreciated

Sent from my iPhone

On 3 Apr 2015, at 00:44, Mark Walkom markwalkom@gmail.com wrote:

You can do data transformation on the fly, yes.

Language detection can't be done in LS that I know of, but you can
definitely trim things.

On 3 April 2015 at 13:16, Employ mail@employ.com wrote:

Thank you for the reply. I've seen that mentioned but does it have the
capability to modify the XML content before it is imported? For example,
adding the ability to do language detection and trimming via custom scripts?

On 2 Apr 2015, at 19:44, Mark Walkom markwalkom@gmail.com wrote:

Logstash can handle XML, it has a filter specifically for it -
Xml filter plugin | Logstash Reference [8.11] | Elastic

On 3 April 2015 at 09:33, James mail@employ.com wrote:

Hi,

Currently I am using scrapy to parse an XML file from an ftp server into
elasticsearch. It works but seems quite a heavy weight solution and it uses
a lot of memory too.

I am wondering if I am better off writing a plugin for ES instead.

I have some questions:

A) It seems writing it in Python (since I'm a python guy) as a push
plugin rather than a pull river makes sense, unless anyone has a reason why
pull is better?

B) For simple importing (and slight modification such as trimming,
language check etc) is it likely that an ES plugin is likely going to be a
better solution to importing fairly large XML files or should I just leave
scrapy to do it as it is doing at the moment?

Any help and advice would be appreciated as I start on this journey.

James

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/610e7f9b-3d23-44a9-b8f3-07deb262dd54%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/L9uzIGfT7Gs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8TLso3YjNLpqHoR5r87nr6Li2Ng53AjHwwNzE1j9FJeA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8TLso3YjNLpqHoR5r87nr6Li2Ng53AjHwwNzE1j9FJeA%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/AE7A0FB1-0DE9-4BBF-BEE1-7A29964204E5%40employ.com
https://groups.google.com/d/msgid/elasticsearch/AE7A0FB1-0DE9-4BBF-BEE1-7A29964204E5%40employ.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/L9uzIGfT7Gs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X92a7G536TNgArbzdvC9P%2B0gKMS_5jMBxT9ZBVDJ9PMMg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X92a7G536TNgArbzdvC9P%2B0gKMS_5jMBxT9ZBVDJ9PMMg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5AAD2EB6-D5A9-46DD-8A8A-E3FBC4154929%40employ.com
https://groups.google.com/d/msgid/elasticsearch/5AAD2EB6-D5A9-46DD-8A8A-E3FBC4154929%40employ.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CACj2-4JZ3gSngza0LmJBkPmR5zHWjwNm1tZ%3DKY3x0MNvcm70Rg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Ok. I think that's a fair comment and is likely the root I will take. Thank you both for your help.

On 4 Apr 2015, at 20:33, Norberto Meijome numard@gmail.com wrote:

Hi,
My gut feel is don't add this to the ES setup itself. Horses for courses - have your script (Python +1) running somewhere taking care of the processing, dealing with issues on the ftp side , etc. Let ES do its thing...specially if the XML parsing will take so much memory and you need external services.

The script can run/be managed/ designed in many ways , from a simple cronjob to a celery task or a service under chronos/mesos , or a consumer getting messages and publishing to ES ( though if you have 1 large XML to process once a day I wouldn't go with a consumer ...)

Good luck,
B

On 03/04/2015 3:47 pm, "Employ" mail@employ.com wrote:
Thank you for the reply. I do need to do work on the data before importing such as language detection and geocoding using third party libraries and I feel like log stash may be great for getting my some of the way it won't be able to get me all the way.

A custom plugin may be my only option in that regard but is it really going to provide me any benefits over something like scrapy? Any feedback would be appreciated

Sent from my iPhone

On 3 Apr 2015, at 00:44, Mark Walkom markwalkom@gmail.com wrote:

You can do data transformation on the fly, yes.

Language detection can't be done in LS that I know of, but you can definitely trim things.

On 3 April 2015 at 13:16, Employ mail@employ.com wrote:
Thank you for the reply. I've seen that mentioned but does it have the capability to modify the XML content before it is imported? For example, adding the ability to do language detection and trimming via custom scripts?

On 2 Apr 2015, at 19:44, Mark Walkom markwalkom@gmail.com wrote:

Logstash can handle XML, it has a filter specifically for it - Xml filter plugin | Logstash Reference [8.11] | Elastic

On 3 April 2015 at 09:33, James mail@employ.com wrote:
Hi,

Currently I am using scrapy to parse an XML file from an ftp server into elasticsearch. It works but seems quite a heavy weight solution and it uses a lot of memory too.

I am wondering if I am better off writing a plugin for ES instead.

I have some questions:

A) It seems writing it in Python (since I'm a python guy) as a push plugin rather than a pull river makes sense, unless anyone has a reason why pull is better?

B) For simple importing (and slight modification such as trimming, language check etc) is it likely that an ES plugin is likely going to be a better solution to importing fairly large XML files or should I just leave scrapy to do it as it is doing at the moment?

Any help and advice would be appreciated as I start on this journey.

James

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/610e7f9b-3d23-44a9-b8f3-07deb262dd54%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/L9uzIGfT7Gs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8TLso3YjNLpqHoR5r87nr6Li2Ng53AjHwwNzE1j9FJeA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/AE7A0FB1-0DE9-4BBF-BEE1-7A29964204E5%40employ.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/L9uzIGfT7Gs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X92a7G536TNgArbzdvC9P%2B0gKMS_5jMBxT9ZBVDJ9PMMg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5AAD2EB6-D5A9-46DD-8A8A-E3FBC4154929%40employ.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/L9uzIGfT7Gs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CACj2-4JZ3gSngza0LmJBkPmR5zHWjwNm1tZ%3DKY3x0MNvcm70Rg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/55176C11-6381-41B1-A224-57067C6F3EC9%40employ.com.
For more options, visit https://groups.google.com/d/optout.