Indexing of HTML Column in an MS SQL Server 2014 database

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database.
I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc
and the database is indexed.

Using

"settings":{

"analysis":{

  "analyzer":{

    "default":{

      "type":"custom",

      "tokenizer":"standard",

      "filter":[ "standard", "lowercase" ], 

      "char_filter" : ["html_strip"]

    }

  }

}

}

is good for searching but not for the highlighter as that returns sometimes
trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then
the highlighter works well - all original html tags are gone - but I am
unable to get the river push the column directly to the Mapper Attachments.

Questions:

  1. what is the best practice for indexing HTML columns? I am aware of the
    possibility of a manual removal of HTML tags using Agility Pack but do not
    like that as it's too much extra maintenance.

  2. is there any better highlighter for html data which doesn't cut off any
    original html tags?

  3. How to plug in the JDBC river to Mapper Attachments?

  4. Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Can you give some information about the mapper attachment setup you used
successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik jiri.pik@googlemail.com wrote:

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database.
I have set up a JDBC river
https://github.com/jprante/elasticsearch-river-jdbc and the database is
indexed.

Using

"settings":{

"analysis":{

  "analyzer":{

    "default":{

      "type":"custom",

      "tokenizer":"standard",

      "filter":[ "standard", "lowercase" ],

      "char_filter" : ["html_strip"]

    }

  }

}

}

is good for searching but not for the highlighter as that returns
sometimes trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then
the highlighter works well - all original html tags are gone - but I am
unable to get the river push the column directly to the Mapper Attachments.

Questions:

  1. what is the best practice for indexing HTML columns? I am aware of the
    possibility of a manual removal of HTML tags using Agility Pack but do not
    like that as it's too much extra maintenance.

  2. is there any better highlighter for html data which doesn't cut off any
    original html tags?

  3. How to plug in the JDBC river to Mapper Attachments?

  4. Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Jörg

A bit out of topic: I wonder if you are indexing blobs as base64 encoded fields in JDBC river?
(I did not look at the doc)

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 22 févr. 2015 à 18:11, "joergprante@gmail.com" joergprante@gmail.com a écrit :

Can you give some information about the mapper attachment setup you used successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik jiri.pik@googlemail.com wrote:
I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed.

Using

"settings":{

"analysis":{

  "analyzer":{

    "default":{

      "type":"custom",

      "tokenizer":"standard",

      "filter":[ "standard", "lowercase" ], 

      "char_filter" : ["html_strip"]

    }

  }

}

}

is good for searching but not for the highlighter as that returns sometimes trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then the highlighter works well - all original html tags are gone - but I am unable to get the river push the column directly to the Mapper Attachments.

Questions:

  1. what is the best practice for indexing HTML columns? I am aware of the possibility of a manual removal of HTML tags using Agility Pack but do not like that as it's too much extra maintenance.

  2. is there any better highlighter for html data which doesn't cut off any original html tags?

  3. How to plug in the JDBC river to Mapper Attachments?

  4. Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

For java.sql.Types.BLOB, I use the builder.value(Object object) method in
XContentBuilder, with a byte array as parameter.

For java.sql.Types.CLOB/NCLOB, I use just a string as returned by JDBC in
Clob.getSubString

There are DBs which store blobs as java.sql.Types.BINARY, and this can be
passed as string or byte array to XContentBuilder (default is byte array).

Here, it is a NVARCHAR column of MS SQL, which is alway returned by JDBC as
string by the getNString() method.

Jörg

On Sun, Feb 22, 2015 at 6:14 PM, David Pilato david@pilato.fr wrote:

Hi Jörg

A bit out of topic: I wonder if you are indexing blobs as base64 encoded
fields in JDBC river?
(I did not look at the doc)

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 22 févr. 2015 à 18:11, "joergprante@gmail.com" joergprante@gmail.com
a écrit :

Can you give some information about the mapper attachment setup you used
successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik jiri.pik@googlemail.com wrote:

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server
database. I have set up a JDBC river
https://github.com/jprante/elasticsearch-river-jdbc and the database is
indexed.

Using

"settings":{

"analysis":{

  "analyzer":{

    "default":{

      "type":"custom",

      "tokenizer":"standard",

      "filter":[ "standard", "lowercase" ],

      "char_filter" : ["html_strip"]

    }

  }

}

}

is good for searching but not for the highlighter as that returns
sometimes trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then
the highlighter works well - all original html tags are gone - but I am
unable to get the river push the column directly to the Mapper Attachments.

Questions:

  1. what is the best practice for indexing HTML columns? I am aware of the
    possibility of a manual removal of HTML tags using Agility Pack but do not
    like that as it's too much extra maintenance.

  2. is there any better highlighter for html data which doesn't cut off
    any original html tags?

  3. How to plug in the JDBC river to Mapper Attachments?

  4. Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr
https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEJ2NTzZThKsHmjKbw%2BLu0HWEw0UrUPsqE3wJtCMMGNpQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

David:

David: Do I need to use copy_to a new dummy column in order the highlighting to work???

From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of David Pilato
Sent: Sunday, February 22, 2015 6:15 PM
To: elasticsearch@googlegroups.com
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database

Hi Jörg

A bit out of topic: I wonder if you are indexing blobs as base64 encoded fields in JDBC river?

(I did not look at the doc)

--

David :wink:

Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 22 févr. 2015 à 18:11, "joergprante@gmail.com mailto:joergprante@gmail.com " <joergprante@gmail.com mailto:joergprante@gmail.com > a écrit :

Can you give some information about the mapper attachment setup you used successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik <jiri.pik@googlemail.com mailto:jiri.pik@googlemail.com > wrote:

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed.

Using

"settings":{

"analysis":{

  "analyzer":{

    "default":{

      "type":"custom",

      "tokenizer":"standard",

      "filter":[ "standard", "lowercase" ], 

      "char_filter" : ["html_strip"]

    }

  }

}

}

is good for searching but not for the highlighter as that returns sometimes trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then the highlighter works well - all original html tags are gone - but I am unable to get the river push the column directly to the Mapper Attachments.

Questions:

  1. what is the best practice for indexing HTML columns? I am aware of the possibility of a manual removal of HTML tags using Agility Pack but do not like that as it's too much extra maintenance.

  2. is there any better highlighter for html data which doesn't cut off any original html tags?

  3. How to plug in the JDBC river to Mapper Attachments?

  4. Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/eeeabbac7ce6425abc9edc47698d3413%40Ex13DAG10-N1.dataoncloud.net.
For more options, visit https://groups.google.com/d/optout.

And David:

Would it be possible to index text/html given as text rather than Base64?

From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of David Pilato
Sent: Sunday, February 22, 2015 6:15 PM
To: elasticsearch@googlegroups.com
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database

Hi Jörg

A bit out of topic: I wonder if you are indexing blobs as base64 encoded fields in JDBC river?

(I did not look at the doc)

--

David :wink:

Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 22 févr. 2015 à 18:11, "joergprante@gmail.com mailto:joergprante@gmail.com " <joergprante@gmail.com mailto:joergprante@gmail.com > a écrit :

Can you give some information about the mapper attachment setup you used successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik <jiri.pik@googlemail.com mailto:jiri.pik@googlemail.com > wrote:

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed.

Using

"settings":{

"analysis":{

  "analyzer":{

    "default":{

      "type":"custom",

      "tokenizer":"standard",

      "filter":[ "standard", "lowercase" ], 

      "char_filter" : ["html_strip"]

    }

  }

}

}

is good for searching but not for the highlighter as that returns sometimes trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then the highlighter works well - all original html tags are gone - but I am unable to get the river push the column directly to the Mapper Attachments.

Questions:

  1. what is the best practice for indexing HTML columns? I am aware of the possibility of a manual removal of HTML tags using Agility Pack but do not like that as it's too much extra maintenance.

  2. is there any better highlighter for html data which doesn't cut off any original html tags?

  3. How to plug in the JDBC river to Mapper Attachments?

  4. Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d7d8ce2eca5f4f8fbe77b91aa8875ffc%40Ex13DAG10-N1.dataoncloud.net.
For more options, visit https://groups.google.com/d/optout.

Thank you very much for your kind answer. If I encode the html file into Base64, and use the enclosed script, then all works just fine.

So, Joerg:

  1.   Is there a way for the JDBC river to transform the nvarchar(MAX) into Base64 by itself? 
    
  2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL Server type? And then the SELECT * from XXX would just work?
    

What are your thoughts?

BTW I have been able to convert the nvarchar to base64 using this query

select ID, cast(N'' as xml).value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as Content from (SELECT ID , cast( cast(Content as varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k;

The usual river and mapper attachment work just fine but the initial indexing takes substantially longer. Why?

  1.   Is there any performance settings I could tweak?
    

From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of joergprante@gmail.com
Sent: Sunday, February 22, 2015 6:12 PM
To: elasticsearch@googlegroups.com
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database

Can you give some information about the mapper attachment setup you used successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik <jiri.pik@googlemail.com mailto:jiri.pik@googlemail.com > wrote:

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed.

Using

"settings":{

"analysis":{

  "analyzer":{

    "default":{

      "type":"custom",

      "tokenizer":"standard",

      "filter":[ "standard", "lowercase" ], 

      "char_filter" : ["html_strip"]

    }

  }

}

}

is good for searching but not for the highlighter as that returns sometimes trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then the highlighter works well - all original html tags are gone - but I am unable to get the river push the column directly to the Mapper Attachments.

Questions:

  1. what is the best practice for indexing HTML columns? I am aware of the possibility of a manual removal of HTML tags using Agility Pack but do not like that as it's too much extra maintenance.

  2. is there any better highlighter for html data which doesn't cut off any original html tags?

  3. How to plug in the JDBC river to Mapper Attachments?

  4. Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cbdf7a2f7f0d40f299ec9e51d8f1a4b5%40Ex13DAG10-N1.dataoncloud.net.
For more options, visit https://groups.google.com/d/optout.

Apologies for everyone for sending these emails with digital signature which may have caused some issues:

Summary for Joerg:

  1.   Is there a way for the JDBC river to transform the nvarchar(MAX) into Base64 by itself? I can do on SQL server – see below (1) for David – but it’s substantially slower
    
  2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL Server type? And then the SELECT * from XXX would just work?
    

Summary for David:

  1.   If I convert the HTML column using select ID, cast(N'' as xml).value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k; the indexing just works but takes longer than usual – is there any performance setting I could use?
    
  2.   Would it be possible for the attachment mapper to index pure txt file without base64?
    

From: Jiri Pik
Sent: Monday, February 23, 2015 6:08 AM
To: elasticsearch@googlegroups.com
Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database

Thank you very much for your kind answer. If I encode the html file into Base64, and use the enclosed script, then all works just fine.

So, Joerg:

  1.   Is there a way for the JDBC river to transform the nvarchar(MAX) into Base64 by itself?
    
  2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL Server type? And then the SELECT * from XXX would just work?
    

What are your thoughts?

BTW I have been able to convert the nvarchar to base64 using this query

select ID, cast(N'' as xml).value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as Content from (SELECT ID , cast( cast(Content as varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k;

The usual river and mapper attachment work just fine but the initial indexing takes substantially longer. Why?

  1.   Is there any performance settings I could tweak?
    

From: elasticsearch@googlegroups.commailto:elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of joergprante@gmail.commailto:joergprante@gmail.com
Sent: Sunday, February 22, 2015 6:12 PM
To: elasticsearch@googlegroups.commailto:elasticsearch@googlegroups.com
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database

Can you give some information about the mapper attachment setup you used successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik <jiri.pik@googlemail.commailto:jiri.pik@googlemail.com> wrote:

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed.

Using

 "settings":{

   "analysis":{

     "analyzer":{

       "default":{

         "type":"custom",

         "tokenizer":"standard",

         "filter":[ "standard", "lowercase" ],

         "char_filter" : ["html_strip"]

       }

     }

   }

 }

is good for searching but not for the highlighter as that returns sometimes trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then the highlighter works well - all original html tags are gone - but I am unable to get the river push the column directly to the Mapper Attachments.

Questions:

  1. what is the best practice for indexing HTML columns? I am aware of the possibility of a manual removal of HTML tags using Agility Pack but do not like that as it's too much extra maintenance.

  2. is there any better highlighter for html data which doesn't cut off any original html tags?

  3. How to plug in the JDBC river to Mapper Attachments?

  4. Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.commailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.commailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net.
For more options, visit https://groups.google.com/d/optout.

  1. I opened an issue for adding optional base64 encoding on columns:
    https://github.com/jprante/elasticsearch-river-jdbc/issues/472

  2. What is "initial indexing"? What do you mean by "slower"?

  3. Yes, you can change the documented bulk index settings.

Jörg

On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik jiri.pik@jiripik.com wrote:

Apologies for everyone for sending these emails with digital signature
which may have caused some issues:

Summary for Joerg:

  1.   Is there a way for the JDBC river to transform the nvarchar(MAX)
    

into Base64 by itself? I can do on SQL server – see below (1) for David –
but it’s substantially slower

  1.   If not, do you recommend nvarbinary(MAX) or some other MS SQL
    

Server type? And then the SELECT * from XXX would just work?

Summary for David:

  1.   If I convert the HTML column using select ID, cast(N'' as xml).
    

value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))',
'varchar(max)') as Content from (SELECT ID , cast( cast(Content as
varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k; the
indexing just works but takes longer than usual – is there any performance
setting I could use?

  1.   Would it be possible for the attachment mapper to index pure
    

txt file without base64?

From: Jiri Pik
Sent: Monday, February 23, 2015 6:08 AM
To: elasticsearch@googlegroups.com
Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database

Thank you very much for your kind answer. If I encode the html file into
Base64, and use the enclosed script, then all works just fine.

So, Joerg:

  1.   Is there a way for the JDBC river to transform the nvarchar(MAX)
    

into Base64 by itself?

  1.   If not, do you recommend nvarbinary(MAX) or some other MS SQL
    

Server type? And then the SELECT * from XXX would just work?

What are your thoughts?

BTW I have been able to convert the nvarchar to base64 using this query

select ID, cast(N'' as xml).value (
'xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)')
as Content from (SELECT ID , cast( cast(Content as varchar(MAX )) as
varbinary( MAX)) Content from KBArticles) k;

The usual river and mapper attachment work just fine but the initial
indexing takes substantially longer. Why?

  1.   Is there any performance settings I could tweak?
    

From: elasticsearch@googlegroups.com [
mailto:elasticsearch@googlegroups.com elasticsearch@googlegroups.com] *On
Behalf Of *joergprante@gmail.com
Sent: Sunday, February 22, 2015 6:12 PM
To: elasticsearch@googlegroups.com
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database

Can you give some information about the mapper attachment setup you used
successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik jiri.pik@googlemail.com wrote:

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server
database. I have set up a JDBC river
https://github.com/jprante/elasticsearch-river-jdbc and the database is
indexed.

Using

"settings":{

"analysis":{

  "analyzer":{

    "default":{

      "type":"custom",

      "tokenizer":"standard",

      "filter":[ "standard", "lowercase" ],

      "char_filter" : ["html_strip"]

    }

  }

}

}

is good for searching but not for the highlighter as that returns
sometimes trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then
the highlighter works well - all original html tags are gone - but I am
unable to get the river push the column directly to the Mapper Attachments.

Questions:

  1. what is the best practice for indexing HTML columns? I am aware of the
    possibility of a manual removal of HTML tags using Agility Pack but do not
    like that as it's too much extra maintenance.

  2. is there any better highlighter for html data which doesn't cut off any
    original html tags?

  3. How to plug in the JDBC river to Mapper Attachments?

  4. Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net
https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thank you for opening of the issue.

If I indexed the column as varchar and used the default ES indexing, the entire table is indexed within 5 seconds. If I use the Mapper Attachments, it takes up to 2 minutes. I am not sure whether it’s because of the extra work SQL Server is doing, or the extra volume the jdbc is taking care, but I assume it may be because of the way the Mapper Attachments works?

From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of joergprante@gmail.com
Sent: Monday, February 23, 2015 9:26 AM
To: elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database

  1. I opened an issue for adding optional base64 encoding on columns: https://github.com/jprante/elasticsearch-river-jdbc/issues/472

  2. What is "initial indexing"? What do you mean by "slower"?

  3. Yes, you can change the documented bulk index settings.

Jörg

On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik <jiri.pik@jiripik.commailto:jiri.pik@jiripik.com> wrote:

Apologies for everyone for sending these emails with digital signature which may have caused some issues:

Summary for Joerg:

  1.   Is there a way for the JDBC river to transform the nvarchar(MAX) into Base64 by itself? I can do on SQL server – see below (1) for David – but it’s substantially slower
    
  2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL Server type? And then the SELECT * from XXX would just work?
    

Summary for David:

  1.   If I convert the HTML column using select ID, cast(N'' as xml).value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k; the indexing just works but takes longer than usual – is there any performance setting I could use?
    
  2.   Would it be possible for the attachment mapper to index pure txt file without base64?
    

From: Jiri Pik
Sent: Monday, February 23, 2015 6:08 AM
To: elasticsearch@googlegroups.commailto:elasticsearch@googlegroups.com
Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database

Thank you very much for your kind answer. If I encode the html file into Base64, and use the enclosed script, then all works just fine.

So, Joerg:

  1.   Is there a way for the JDBC river to transform the nvarchar(MAX) into Base64 by itself?
    
  2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL Server type? And then the SELECT * from XXX would just work?
    

What are your thoughts?

BTW I have been able to convert the nvarchar to base64 using this query

select ID, cast(N'' as xml).value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as Content from (SELECT ID , cast( cast(Content as varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k;

The usual river and mapper attachment work just fine but the initial indexing takes substantially longer. Why?

  1.   Is there any performance settings I could tweak?
    

From: elasticsearch@googlegroups.commailto:elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of joergprante@gmail.commailto:joergprante@gmail.com
Sent: Sunday, February 22, 2015 6:12 PM
To: elasticsearch@googlegroups.commailto:elasticsearch@googlegroups.com
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database

Can you give some information about the mapper attachment setup you used successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik <jiri.pik@googlemail.commailto:jiri.pik@googlemail.com> wrote:

  I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed.

  Using

    "settings":{

      "analysis":{

        "analyzer":{

          "default":{

            "type":"custom",

            "tokenizer":"standard",

            "filter":[ "standard", "lowercase" ],

            "char_filter" : ["html_strip"]

          }

        }

      }

    }

  is good for searching but not for the highlighter as that returns sometimes trimmed unpaired html tags.

  I have played with the Mapper Attachments with HTML attachments and then the highlighter works well - all original html tags are gone - but I am unable to get the river push the column directly to the Mapper Attachments.

  Questions:

  1. what is the best practice for indexing HTML columns? I am aware of the possibility of a manual removal of HTML tags using Agility Pack but do not like that as it's too much extra maintenance.

  2. is there any better highlighter for html data which doesn't cut off any original html tags?

  3. How to plug in the JDBC river to Mapper Attachments?

  4. Any better ideas how to achieve my goals?



  Thanks!

  --
  You received this message because you are subscribed to the Google Groups "elasticsearch" group.
  To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com<mailto:elasticsearch+unsubscribe@googlegroups.com>.
  To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer>.
  For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.commailto:elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.commailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.nethttps://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.commailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net.
For more options, visit https://groups.google.com/d/optout.

How big is the entire table you index?

You can use monitor tools like BigDesk to verify the resources ES is using.

It is close to impossible that just base64 encoding takes 20x longer while
indexing, maybe mapper attachment is doing other extra work.

Jörg

On Mon, Feb 23, 2015 at 9:50 AM, Jiri Pik jiri.pik@jiripik.com wrote:

Thank you for opening of the issue.

If I indexed the column as varchar and used the default ES indexing, the
entire table is indexed within 5 seconds. If I use the Mapper Attachments,
it takes up to 2 minutes. I am not sure whether it’s because of the extra
work SQL Server is doing, or the extra volume the jdbc is taking care, but
I assume it may be because of the way the Mapper Attachments works?

From: elasticsearch@googlegroups.com [mailto:
elasticsearch@googlegroups.com] *On Behalf Of *joergprante@gmail.com
Sent: Monday, February 23, 2015 9:26 AM
To: elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014
database

  1. I opened an issue for adding optional base64 encoding on columns:
    https://github.com/jprante/elasticsearch-river-jdbc/issues/472

  2. What is "initial indexing"? What do you mean by "slower"?

  3. Yes, you can change the documented bulk index settings.

Jörg

On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik jiri.pik@jiripik.com wrote:

Apologies for everyone for sending these emails with digital signature
which may have caused some issues:

Summary for Joerg:

  1.   Is there a way for the JDBC river to transform the nvarchar(MAX)
    

into Base64 by itself? I can do on SQL server – see below (1) for David –
but it’s substantially slower

  1.   If not, do you recommend nvarbinary(MAX) or some other MS SQL
    

Server type? And then the SELECT * from XXX would just work?

Summary for David:

  1.   If I convert the HTML column using select ID, cast(N'' as xml).
    

value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))',
'varchar(max)') as Content from (SELECT ID , cast( cast(Content as
varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k; the
indexing just works but takes longer than usual – is there any performance
setting I could use?

  1.   Would it be possible for the attachment mapper to index pure txt
    

file without base64?

From: Jiri Pik
Sent: Monday, February 23, 2015 6:08 AM
To: elasticsearch@googlegroups.com
Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database

Thank you very much for your kind answer. If I encode the html file into
Base64, and use the enclosed script, then all works just fine.

So, Joerg:

  1.   Is there a way for the JDBC river to transform the nvarchar(MAX)
    

into Base64 by itself?

  1.   If not, do you recommend nvarbinary(MAX) or some other MS SQL
    

Server type? And then the SELECT * from XXX would just work?

What are your thoughts?

BTW I have been able to convert the nvarchar to base64 using this query

select ID, cast(N'' as xml).value (
'xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)')
as Content from (SELECT ID , cast( cast(Content as varchar(MAX )) as
varbinary( MAX)) Content from KBArticles) k;

The usual river and mapper attachment work just fine but the initial
indexing takes substantially longer. Why?

  1.   Is there any performance settings I could tweak?
    

From: elasticsearch@googlegroups.com [
mailto:elasticsearch@googlegroups.com elasticsearch@googlegroups.com] *On
Behalf Of *joergprante@gmail.com
Sent: Sunday, February 22, 2015 6:12 PM
To: elasticsearch@googlegroups.com
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database

Can you give some information about the mapper attachment setup you used
successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik jiri.pik@googlemail.com wrote:

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server
database. I have set up a JDBC river
https://github.com/jprante/elasticsearch-river-jdbc and the database is
indexed.

Using

"settings":{

"analysis":{

  "analyzer":{

    "default":{

      "type":"custom",

      "tokenizer":"standard",

      "filter":[ "standard", "lowercase" ],

      "char_filter" : ["html_strip"]

    }

  }

}

}

is good for searching but not for the highlighter as that returns
sometimes trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then
the highlighter works well - all original html tags are gone - but I am
unable to get the river push the column directly to the Mapper Attachments.

Questions:

  1. what is the best practice for indexing HTML columns? I am aware of the
    possibility of a manual removal of HTML tags using Agility Pack but do not
    like that as it's too much extra maintenance.

  2. is there any better highlighter for html data which doesn't cut off any
    original html tags?

  3. How to plug in the JDBC river to Mapper Attachments?

  4. Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net
https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net
https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHw3oba_%3DAGAnYofoeHY%3Dx5JDwdSPmRhEcPdmMkHUEQwQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

The table has 3000 rows, the index is defined as below

{

{

"index": {

"primary_size_in_bytes": 296341451,

"size_in_bytes": 296341451

},

"translog": {

"operations": 0

},

"docs": {

"num_docs": 3000,

"max_doc": 3000,

"deleted_docs": 0

},},

I believe it’s the mapper attachment who is causing this delay.

David – is there any way to speed this up?

From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of joergprante@gmail.com
Sent: Monday, February 23, 2015 10:15 AM
To: elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database

How big is the entire table you index?

You can use monitor tools like BigDesk to verify the resources ES is using.

It is close to impossible that just base64 encoding takes 20x longer while indexing, maybe mapper attachment is doing other extra work.

Jörg

On Mon, Feb 23, 2015 at 9:50 AM, Jiri Pik <jiri.pik@jiripik.com mailto:jiri.pik@jiripik.com > wrote:

Thank you for opening of the issue.

If I indexed the column as varchar and used the default ES indexing, the entire table is indexed within 5 seconds. If I use the Mapper Attachments, it takes up to 2 minutes. I am not sure whether it’s because of the extra work SQL Server is doing, or the extra volume the jdbc is taking care, but I assume it may be because of the way the Mapper Attachments works?

From: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com ] On Behalf Of joergprante@gmail.com mailto:joergprante@gmail.com
Sent: Monday, February 23, 2015 9:26 AM
To: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database

  1. I opened an issue for adding optional base64 encoding on columns: https://github.com/jprante/elasticsearch-river-jdbc/issues/472

  2. What is "initial indexing"? What do you mean by "slower"?

  3. Yes, you can change the documented bulk index settings.

Jörg

On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik <jiri.pik@jiripik.com mailto:jiri.pik@jiripik.com > wrote:

Apologies for everyone for sending these emails with digital signature which may have caused some issues:

Summary for Joerg:

  1.   Is there a way for the JDBC river to transform the nvarchar(MAX) into Base64 by itself? I can do on SQL server – see below (1) for David – but it’s substantially slower
    
  2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL Server type? And then the SELECT * from XXX would just work?
    

Summary for David:

  1.   If I convert the HTML column using select ID, cast(N'' as xml).value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k; the indexing just works but takes longer than usual – is there any performance setting I could use?
    
  2.   Would it be possible for the attachment mapper to index pure txt file without base64?
    

From: Jiri Pik
Sent: Monday, February 23, 2015 6:08 AM
To: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com
Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database

Thank you very much for your kind answer. If I encode the html file into Base64, and use the enclosed script, then all works just fine.

So, Joerg:

  1.   Is there a way for the JDBC river to transform the nvarchar(MAX) into Base64 by itself? 
    
  2.   If not, do you recommend nvarbinary(MAX) or some other MS SQL Server type? And then the SELECT * from XXX would just work?
    

What are your thoughts?

BTW I have been able to convert the nvarchar to base64 using this query

select ID, cast(N'' as xml).value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as Content from (SELECT ID , cast( cast(Content as varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k;

The usual river and mapper attachment work just fine but the initial indexing takes substantially longer. Why?

  1.   Is there any performance settings I could tweak?
    

From: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of joergprante@gmail.com mailto:joergprante@gmail.com
Sent: Sunday, February 22, 2015 6:12 PM
To: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database

Can you give some information about the mapper attachment setup you used successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik <jiri.pik@googlemail.com mailto:jiri.pik@googlemail.com > wrote:

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed.

Using

"settings":{

"analysis":{

  "analyzer":{

    "default":{

      "type":"custom",

      "tokenizer":"standard",

      "filter":[ "standard", "lowercase" ], 

      "char_filter" : ["html_strip"]

    }

  }

}

}

is good for searching but not for the highlighter as that returns sometimes trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then the highlighter works well - all original html tags are gone - but I am unable to get the river push the column directly to the Mapper Attachments.

Questions:

  1. what is the best practice for indexing HTML columns? I am aware of the possibility of a manual removal of HTML tags using Agility Pack but do not like that as it's too much extra maintenance.

  2. is there any better highlighter for html data which doesn't cut off any original html tags?

  3. How to plug in the JDBC river to Mapper Attachments?

  4. Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .

To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer .

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .

To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer .

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHw3oba_%3DAGAnYofoeHY%3Dx5JDwdSPmRhEcPdmMkHUEQwQ%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHw3oba_%3DAGAnYofoeHY%3Dx5JDwdSPmRhEcPdmMkHUEQwQ%40mail.gmail.com?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/eb5ebcabc33a4e82a726a936733fdd28%40Ex13DAG10-N1.dataoncloud.net.
For more options, visit https://groups.google.com/d/optout.

I am not sure but it looks like mapper attachment is doing some extra
processing, for example Tika, which is very expensive. Maybe there is some
configuration option, I did not check.

Jörg

On Mon, Feb 23, 2015 at 2:13 PM, Jiri Pik jiri.pik@jiripik.com wrote:

The table has 3000 rows, the index is defined as below

{

{

"index": {

"primary_size_in_bytes": 296341451,

"size_in_bytes": 296341451

},

"translog": {

"operations": 0

},

"docs": {

"num_docs": 3000,

"max_doc": 3000,

"deleted_docs": 0

},},

I believe it’s the mapper attachment who is causing this delay.

David – is there any way to speed this up?

From: elasticsearch@googlegroups.com [mailto:
elasticsearch@googlegroups.com] *On Behalf Of *joergprante@gmail.com
Sent: Monday, February 23, 2015 10:15 AM

To: elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014
database

How big is the entire table you index?

You can use monitor tools like BigDesk to verify the resources ES is using.

It is close to impossible that just base64 encoding takes 20x longer while
indexing, maybe mapper attachment is doing other extra work.

Jörg

On Mon, Feb 23, 2015 at 9:50 AM, Jiri Pik jiri.pik@jiripik.com wrote:

Thank you for opening of the issue.

If I indexed the column as varchar and used the default ES indexing, the
entire table is indexed within 5 seconds. If I use the Mapper Attachments,
it takes up to 2 minutes. I am not sure whether it’s because of the extra
work SQL Server is doing, or the extra volume the jdbc is taking care, but
I assume it may be because of the way the Mapper Attachments works?

From: elasticsearch@googlegroups.com [mailto:
elasticsearch@googlegroups.com] *On Behalf Of *joergprante@gmail.com
Sent: Monday, February 23, 2015 9:26 AM
To: elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014
database

  1. I opened an issue for adding optional base64 encoding on columns:
    https://github.com/jprante/elasticsearch-river-jdbc/issues/472

  2. What is "initial indexing"? What do you mean by "slower"?

  3. Yes, you can change the documented bulk index settings.

Jörg

On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik jiri.pik@jiripik.com wrote:

Apologies for everyone for sending these emails with digital signature
which may have caused some issues:

Summary for Joerg:

  1.   Is there a way for the JDBC river to transform the nvarchar(MAX)
    

into Base64 by itself? I can do on SQL server – see below (1) for David –
but it’s substantially slower

  1.   If not, do you recommend nvarbinary(MAX) or some other MS SQL
    

Server type? And then the SELECT * from XXX would just work?

Summary for David:

  1.   If I convert the HTML column using select ID, cast(N'' as xml).
    

value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))',
'varchar(max)') as Content from (SELECT ID , cast( cast(Content as
varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k; the
indexing just works but takes longer than usual – is there any performance
setting I could use?

  1.   Would it be possible for the attachment mapper to index pure txt
    

file without base64?

From: Jiri Pik
Sent: Monday, February 23, 2015 6:08 AM
To: elasticsearch@googlegroups.com
Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database

Thank you very much for your kind answer. If I encode the html file into
Base64, and use the enclosed script, then all works just fine.

So, Joerg:

  1.   Is there a way for the JDBC river to transform the nvarchar(MAX)
    

into Base64 by itself?

  1.   If not, do you recommend nvarbinary(MAX) or some other MS SQL
    

Server type? And then the SELECT * from XXX would just work?

What are your thoughts?

BTW I have been able to convert the nvarchar to base64 using this query

select ID, cast(N'' as xml).value (
'xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)')
as Content from (SELECT ID , cast( cast(Content as varchar(MAX )) as
varbinary( MAX)) Content from KBArticles) k;

The usual river and mapper attachment work just fine but the initial
indexing takes substantially longer. Why?

  1.   Is there any performance settings I could tweak?
    

From: elasticsearch@googlegroups.com [
mailto:elasticsearch@googlegroups.com elasticsearch@googlegroups.com] *On
Behalf Of *joergprante@gmail.com
Sent: Sunday, February 22, 2015 6:12 PM
To: elasticsearch@googlegroups.com
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database

Can you give some information about the mapper attachment setup you used
successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik jiri.pik@googlemail.com wrote:

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database.
I have set up a JDBC river
https://github.com/jprante/elasticsearch-river-jdbc and the database is
indexed.

Using

"settings":{

"analysis":{

  "analyzer":{

    "default":{

      "type":"custom",

      "tokenizer":"standard",

      "filter":[ "standard", "lowercase" ],

      "char_filter" : ["html_strip"]

    }

  }

}

}

is good for searching but not for the highlighter as that returns
sometimes trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then
the highlighter works well - all original html tags are gone - but I am
unable to get the river push the column directly to the Mapper Attachments.

Questions:

  1. what is the best practice for indexing HTML columns? I am aware of the
    possibility of a manual removal of HTML tags using Agility Pack but do not
    like that as it's too much extra maintenance.

  2. is there any better highlighter for html data which doesn't cut off any
    original html tags?

  3. How to plug in the JDBC river to Mapper Attachments?

  4. Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net
https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net
https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHw3oba_%3DAGAnYofoeHY%3Dx5JDwdSPmRhEcPdmMkHUEQwQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHw3oba_%3DAGAnYofoeHY%3Dx5JDwdSPmRhEcPdmMkHUEQwQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/eb5ebcabc33a4e82a726a936733fdd28%40Ex13DAG10-N1.dataoncloud.net
https://groups.google.com/d/msgid/elasticsearch/eb5ebcabc33a4e82a726a936733fdd28%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGMyR-nMK%3Dymrec%2B3KF1nySUNJw7%2BSR0KcxHsxq7CrKdQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

David – can you advise if there is anything which can be done to speed up indexing? Are there any config parameters I could use to tweak the performance?

Joerg – the entire indexing looks completely differently. If I do not use the mapper attachment, the size of the index and the number of indexed documents grow at the same time. With mapper attachments, however, the size of the index grows with the number of indexed documents staying at 0 until the entire index is being built.

From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of joergprante@gmail.com
Sent: Monday, February 23, 2015 5:13 PM
To: elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database

I am not sure but it looks like mapper attachment is doing some extra processing, for example Tika, which is very expensive. Maybe there is some configuration option, I did not check.

Jörg

On Mon, Feb 23, 2015 at 2:13 PM, Jiri Pik <jiri.pik@jiripik.commailto:jiri.pik@jiripik.com> wrote:

The table has 3000 rows, the index is defined as below

{

{

"index": {

"primary_size_in_bytes": 296341451,

"size_in_bytes": 296341451

},

"translog": {

"operations": 0

},

"docs": {

"num_docs": 3000,

"max_doc": 3000,

"deleted_docs": 0

},},

I believe it’s the mapper attachment who is causing this delay.

David – is there any way to speed this up?

From: elasticsearch@googlegroups.commailto:elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.commailto:elasticsearch@googlegroups.com] On Behalf Of joergprante@gmail.commailto:joergprante@gmail.com
Sent: Monday, February 23, 2015 10:15 AM

To: elasticsearch@googlegroups.commailto:elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database

How big is the entire table you index?

You can use monitor tools like BigDesk to verify the resources ES is using.

It is close to impossible that just base64 encoding takes 20x longer while indexing, maybe mapper attachment is doing other extra work.

Jörg

On Mon, Feb 23, 2015 at 9:50 AM, Jiri Pik <jiri.pik@jiripik.commailto:jiri.pik@jiripik.com> wrote:

  Thank you for opening of the issue.



  If I indexed the column as varchar and used the default ES indexing, the entire table is indexed within 5 seconds. If I use the Mapper Attachments, it takes up to 2 minutes. I am not sure whether it’s because of the extra work SQL Server is doing, or the extra volume the jdbc is taking care, but I assume it may be because of the way the Mapper Attachments works?







  From: elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com> [mailto:elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com>] On Behalf Of joergprante@gmail.com<mailto:joergprante@gmail.com>
  Sent: Monday, February 23, 2015 9:26 AM
  To: elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com>
  Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database



  1. I opened an issue for adding optional base64 encoding on columns: https://github.com/jprante/elasticsearch-river-jdbc/issues/472



  2. What is "initial indexing"? What do you mean by "slower"?



  3. Yes, you can change the documented bulk index settings.



  Jörg





  On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik <jiri.pik@jiripik.com<mailto:jiri.pik@jiripik.com>> wrote:

     Apologies for everyone for sending these emails with digital signature which may have caused some issues:



     Summary for Joerg:



     1.       Is there a way for the JDBC river to transform the nvarchar(MAX) into Base64 by itself? I can do on SQL server – see below (1) for David – but it’s substantially slower

     2.       If not, do you recommend nvarbinary(MAX) or some other MS SQL Server type? And then the SELECT * from XXX would just work?



     Summary for David:

     1.       If I convert the HTML column using select ID, cast(N'' as xml).value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k; the indexing just works but takes longer than usual – is there any performance setting I could use?

     2.       Would it be possible for the attachment mapper to index pure txt file without base64?













     From: Jiri Pik
     Sent: Monday, February 23, 2015 6:08 AM
     To: elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com>
     Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database



     Thank you very much for your kind answer. If I encode the html file into Base64, and use the enclosed script, then all works just fine.



     So, Joerg:



     1.       Is there a way for the JDBC river to transform the nvarchar(MAX) into Base64 by itself?

     2.       If not, do you recommend nvarbinary(MAX) or some other MS SQL Server type? And then the SELECT * from XXX would just work?



     What are your thoughts?



     BTW I have been able to convert the nvarchar to base64 using this query

     select ID, cast(N'' as xml).value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k;





     The usual river and mapper attachment work just fine but the initial indexing takes substantially longer. Why?



     3.       Is there any performance settings I could tweak?



     From: elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com> [mailto:elasticsearch@googlegroups.com] On Behalf Of joergprante@gmail.com<mailto:joergprante@gmail.com>
     Sent: Sunday, February 22, 2015 6:12 PM
     To: elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com>
     Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database



     Can you give some information about the mapper attachment setup you used successfully?



     There is no good reason why this should not be possible with JDBC river.



     Jörg



     On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik <jiri.pik@googlemail.com<mailto:jiri.pik@googlemail.com>> wrote:

        I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed.

        Using

          "settings":{

            "analysis":{

              "analyzer":{

                "default":{

                  "type":"custom",

                  "tokenizer":"standard",

                  "filter":[ "standard", "lowercase" ],

                  "char_filter" : ["html_strip"]

                }

              }

            }

          }

        is good for searching but not for the highlighter as that returns sometimes trimmed unpaired html tags.

        I have played with the Mapper Attachments with HTML attachments and then the highlighter works well - all original html tags are gone - but I am unable to get the river push the column directly to the Mapper Attachments.

        Questions:

        1. what is the best practice for indexing HTML columns? I am aware of the possibility of a manual removal of HTML tags using Agility Pack but do not like that as it's too much extra maintenance.

        2. is there any better highlighter for html data which doesn't cut off any original html tags?

        3. How to plug in the JDBC river to Mapper Attachments?

        4. Any better ideas how to achieve my goals?



        Thanks!

        --
        You received this message because you are subscribed to the Google Groups "elasticsearch" group.
        To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com<mailto:elasticsearch+unsubscribe@googlegroups.com>.
        To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer>.
        For more options, visit https://groups.google.com/d/optout.



     --
     You received this message because you are subscribed to the Google Groups "elasticsearch" group.
     To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com<mailto:elasticsearch+unsubscribe@googlegroups.com>.

     To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer>.
     For more options, visit https://groups.google.com/d/optout.

     --
     You received this message because you are subscribed to the Google Groups "elasticsearch" group.
     To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com<mailto:elasticsearch+unsubscribe@googlegroups.com>.
     To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net<https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer>.


     For more options, visit https://groups.google.com/d/optout.



  --
  You received this message because you are subscribed to the Google Groups "elasticsearch" group.
  To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com<mailto:elasticsearch+unsubscribe@googlegroups.com>.

  To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com?utm_medium=email&utm_source=footer>.
  For more options, visit https://groups.google.com/d/optout.

  --
  You received this message because you are subscribed to the Google Groups "elasticsearch" group.
  To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com<mailto:elasticsearch+unsubscribe@googlegroups.com>.
  To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net<https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer>.


  For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.commailto:elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHw3oba_%3DAGAnYofoeHY%3Dx5JDwdSPmRhEcPdmMkHUEQwQ%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAKdsXoHw3oba_%3DAGAnYofoeHY%3Dx5JDwdSPmRhEcPdmMkHUEQwQ%40mail.gmail.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.commailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/eb5ebcabc33a4e82a726a936733fdd28%40Ex13DAG10-N1.dataoncloud.nethttps://groups.google.com/d/msgid/elasticsearch/eb5ebcabc33a4e82a726a936733fdd28%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.commailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGMyR-nMK%3Dymrec%2B3KF1nySUNJw7%2BSR0KcxHsxq7CrKdQ%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAKdsXoGMyR-nMK%3Dymrec%2B3KF1nySUNJw7%2BSR0KcxHsxq7CrKdQ%40mail.gmail.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/33a3cd982b1743e49520143e773600db%40Ex13DAG10-N1.dataoncloud.net.
For more options, visit https://groups.google.com/d/optout.