Indexing of HTML Column in an MS SQL Server 2014 database

Jiri_Pik · February 22, 2015, 4:20pm

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database.
I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc
and the database is indexed.

Using

"settings":{

"analysis":{

  "analyzer":{

    "default":{

      "type":"custom",

      "tokenizer":"standard",

      "filter":[ "standard", "lowercase" ], 

      "char_filter" : ["html_strip"]

    }

  }

}

}

is good for searching but not for the highlighter as that returns sometimes
trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then
the highlighter works well - all original html tags are gone - but I am
unable to get the river push the column directly to the Mapper Attachments.

Questions:

what is the best practice for indexing HTML columns? I am aware of the
possibility of a manual removal of HTML tags using Agility Pack but do not
like that as it's too much extra maintenance.
is there any better highlighter for html data which doesn't cut off any
original html tags?
How to plug in the JDBC river to Mapper Attachments?
Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · February 22, 2015, 5:11pm

Can you give some information about the mapper attachment setup you used
successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik jiri.pik@googlemail.com wrote:

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database.
I have set up a JDBC river
GitHub - jprante/elasticsearch-jdbc: JDBC importer for Elasticsearch and the database is
indexed.

Using

"settings":{
"analysis":{

  "analyzer":{

    "default":{

      "type":"custom",

      "tokenizer":"standard",

      "filter":[ "standard", "lowercase" ],

      "char_filter" : ["html_strip"]

    }

  }

}
}

is good for searching but not for the highlighter as that returns
sometimes trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then
the highlighter works well - all original html tags are gone - but I am
unable to get the river push the column directly to the Mapper Attachments.

Questions:

what is the best practice for indexing HTML columns? I am aware of the
possibility of a manual removal of HTML tags using Agility Pack but do not
like that as it's too much extra maintenance.

is there any better highlighter for html data which doesn't cut off any
original html tags?

How to plug in the JDBC river to Mapper Attachments?

Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

dadoonet · February 22, 2015, 5:14pm

Hi Jörg

A bit out of topic: I wonder if you are indexing blobs as base64 encoded fields in JDBC river?
(I did not look at the doc)

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 22 févr. 2015 à 18:11, "joergprante@gmail.com" joergprante@gmail.com a écrit :

Can you give some information about the mapper attachment setup you used successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg
On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik jiri.pik@googlemail.com wrote:
I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I have set up a JDBC river GitHub - jprante/elasticsearch-jdbc: JDBC importer for Elasticsearch and the database is indexed.

Using

"settings":{
"analysis":{

  "analyzer":{

    "default":{

      "type":"custom",

      "tokenizer":"standard",

      "filter":[ "standard", "lowercase" ], 

      "char_filter" : ["html_strip"]

    }

  }

}
}

is good for searching but not for the highlighter as that returns sometimes trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then the highlighter works well - all original html tags are gone - but I am unable to get the river push the column directly to the Mapper Attachments.

Questions:

what is the best practice for indexing HTML columns? I am aware of the possibility of a manual removal of HTML tags using Agility Pack but do not like that as it's too much extra maintenance.

is there any better highlighter for html data which doesn't cut off any original html tags?

How to plug in the JDBC river to Mapper Attachments?

Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

jprante · February 22, 2015, 6:56pm

For java.sql.Types.BLOB, I use the builder.value(Object object) method in
XContentBuilder, with a byte array as parameter.

For java.sql.Types.CLOB/NCLOB, I use just a string as returned by JDBC in
Clob.getSubString

There are DBs which store blobs as java.sql.Types.BINARY, and this can be
passed as string or byte array to XContentBuilder (default is byte array).

Here, it is a NVARCHAR column of MS SQL, which is alway returned by JDBC as
string by the getNString() method.

Jörg

On Sun, Feb 22, 2015 at 6:14 PM, David Pilato david@pilato.fr wrote:

Hi Jörg

A bit out of topic: I wonder if you are indexing blobs as base64 encoded
fields in JDBC river?
(I did not look at the doc)

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 22 févr. 2015 à 18:11, "joergprante@gmail.com" joergprante@gmail.com
a écrit :

Can you give some information about the mapper attachment setup you used
successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik jiri.pik@googlemail.com wrote:
I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server
database. I have set up a JDBC river
GitHub - jprante/elasticsearch-jdbc: JDBC importer for Elasticsearch and the database is
indexed.

Using

"settings":{
"analysis":{

  "analyzer":{

    "default":{

      "type":"custom",

      "tokenizer":"standard",

      "filter":[ "standard", "lowercase" ],

      "char_filter" : ["html_strip"]

    }

  }

}
}

is good for searching but not for the highlighter as that returns
sometimes trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then
the highlighter works well - all original html tags are gone - but I am
unable to get the river push the column directly to the Mapper Attachments.

Questions:

what is the best practice for indexing HTML columns? I am aware of the
possibility of a manual removal of HTML tags using Agility Pack but do not
like that as it's too much extra maintenance.

is there any better highlighter for html data which doesn't cut off
any original html tags?

How to plug in the JDBC river to Mapper Attachments?

Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr
https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEJ2NTzZThKsHmjKbw%2BLu0HWEw0UrUPsqE3wJtCMMGNpQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Jiri_Pik_2 · February 23, 2015, 4:19am

David:

David: Do I need to use copy_to a new dummy column in order the highlighting to work???

From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of David Pilato
Sent: Sunday, February 22, 2015 6:15 PM
To: elasticsearch@googlegroups.com
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database

Hi Jörg

A bit out of topic: I wonder if you are indexing blobs as base64 encoded fields in JDBC river?

(I did not look at the doc)

--

David

Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 22 févr. 2015 à 18:11, "joergprante@gmail.com mailto:joergprante@gmail.com " <joergprante@gmail.com mailto:joergprante@gmail.com > a écrit :

Can you give some information about the mapper attachment setup you used successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik <jiri.pik@googlemail.com mailto:jiri.pik@googlemail.com > wrote:

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed.

Using

"settings":{

"analysis":{

  "analyzer":{

    "default":{

      "type":"custom",

      "tokenizer":"standard",

      "filter":[ "standard", "lowercase" ], 

      "char_filter" : ["html_strip"]

    }

  }

}

}

is good for searching but not for the highlighter as that returns sometimes trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then the highlighter works well - all original html tags are gone - but I am unable to get the river push the column directly to the Mapper Attachments.

Questions:

what is the best practice for indexing HTML columns? I am aware of the possibility of a manual removal of HTML tags using Agility Pack but do not like that as it's too much extra maintenance.
is there any better highlighter for html data which doesn't cut off any original html tags?
How to plug in the JDBC river to Mapper Attachments?
Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/eeeabbac7ce6425abc9edc47698d3413%40Ex13DAG10-N1.dataoncloud.net.
For more options, visit https://groups.google.com/d/optout.

Jiri_Pik_2 · February 23, 2015, 4:30am

And David:

Would it be possible to index text/html given as text rather than Base64?

From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of David Pilato
Sent: Sunday, February 22, 2015 6:15 PM
To: elasticsearch@googlegroups.com
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database

Hi Jörg

A bit out of topic: I wonder if you are indexing blobs as base64 encoded fields in JDBC river?

(I did not look at the doc)

--

David

Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 22 févr. 2015 à 18:11, "joergprante@gmail.com mailto:joergprante@gmail.com " <joergprante@gmail.com mailto:joergprante@gmail.com > a écrit :

Can you give some information about the mapper attachment setup you used successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik <jiri.pik@googlemail.com mailto:jiri.pik@googlemail.com > wrote:

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed.

Using

"settings":{

"analysis":{

  "analyzer":{

    "default":{

      "type":"custom",

      "tokenizer":"standard",

      "filter":[ "standard", "lowercase" ], 

      "char_filter" : ["html_strip"]

    }

  }

}

}

is good for searching but not for the highlighter as that returns sometimes trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then the highlighter works well - all original html tags are gone - but I am unable to get the river push the column directly to the Mapper Attachments.

Questions:

what is the best practice for indexing HTML columns? I am aware of the possibility of a manual removal of HTML tags using Agility Pack but do not like that as it's too much extra maintenance.
is there any better highlighter for html data which doesn't cut off any original html tags?
How to plug in the JDBC river to Mapper Attachments?
Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr https://groups.google.com/d/msgid/elasticsearch/09317C08-E397-4044-91F2-072A5FA4A3DF%40pilato.fr?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d7d8ce2eca5f4f8fbe77b91aa8875ffc%40Ex13DAG10-N1.dataoncloud.net.
For more options, visit https://groups.google.com/d/optout.

Jiri_Pik_2 · February 23, 2015, 5:08am

Thank you very much for your kind answer. If I encode the html file into Base64, and use the enclosed script, then all works just fine.

So, Joerg:

  Is there a way for the JDBC river to transform the nvarchar(MAX) into Base64 by itself?

  If not, do you recommend nvarbinary(MAX) or some other MS SQL Server type? And then the SELECT * from XXX would just work?

What are your thoughts?

BTW I have been able to convert the nvarchar to base64 using this query

select ID, cast(N'' as xml).value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as Content from (SELECT ID , cast( cast(Content as varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k;

The usual river and mapper attachment work just fine but the initial indexing takes substantially longer. Why?

  Is there any performance settings I could tweak?

From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of joergprante@gmail.com
Sent: Sunday, February 22, 2015 6:12 PM
To: elasticsearch@googlegroups.com
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database

Can you give some information about the mapper attachment setup you used successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik <jiri.pik@googlemail.com mailto:jiri.pik@googlemail.com > wrote:

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed.

Using

"settings":{

"analysis":{

  "analyzer":{

    "default":{

      "type":"custom",

      "tokenizer":"standard",

      "filter":[ "standard", "lowercase" ], 

      "char_filter" : ["html_strip"]

    }

  }

}

}

is good for searching but not for the highlighter as that returns sometimes trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then the highlighter works well - all original html tags are gone - but I am unable to get the river push the column directly to the Mapper Attachments.

Questions:

what is the best practice for indexing HTML columns? I am aware of the possibility of a manual removal of HTML tags using Agility Pack but do not like that as it's too much extra maintenance.
is there any better highlighter for html data which doesn't cut off any original html tags?
How to plug in the JDBC river to Mapper Attachments?
Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cbdf7a2f7f0d40f299ec9e51d8f1a4b5%40Ex13DAG10-N1.dataoncloud.net.
For more options, visit https://groups.google.com/d/optout.

Jiri_Pik_2 · February 23, 2015, 5:12am

Apologies for everyone for sending these emails with digital signature which may have caused some issues:

Summary for Joerg:

  Is there a way for the JDBC river to transform the nvarchar(MAX) into Base64 by itself? I can do on SQL server – see below (1) for David – but it’s substantially slower

  If not, do you recommend nvarbinary(MAX) or some other MS SQL Server type? And then the SELECT * from XXX would just work?

Summary for David:

  If I convert the HTML column using select ID, cast(N'' as xml).value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k; the indexing just works but takes longer than usual – is there any performance setting I could use?

  Would it be possible for the attachment mapper to index pure txt file without base64?

From: Jiri Pik
Sent: Monday, February 23, 2015 6:08 AM
To: elasticsearch@googlegroups.com
Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database

Thank you very much for your kind answer. If I encode the html file into Base64, and use the enclosed script, then all works just fine.

So, Joerg:

  Is there a way for the JDBC river to transform the nvarchar(MAX) into Base64 by itself?

  If not, do you recommend nvarbinary(MAX) or some other MS SQL Server type? And then the SELECT * from XXX would just work?

What are your thoughts?

BTW I have been able to convert the nvarchar to base64 using this query

select ID, cast(N'' as xml).value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as Content from (SELECT ID , cast( cast(Content as varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k;

The usual river and mapper attachment work just fine but the initial indexing takes substantially longer. Why?

  Is there any performance settings I could tweak?

From: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of joergprante@gmail.com mailto:joergprante@gmail.com
Sent: Sunday, February 22, 2015 6:12 PM
To: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database

Can you give some information about the mapper attachment setup you used successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik <jiri.pik@googlemail.com mailto:jiri.pik@googlemail.com> wrote:

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed.

Using

 "settings":{

   "analysis":{

     "analyzer":{

       "default":{

         "type":"custom",

         "tokenizer":"standard",

         "filter":[ "standard", "lowercase" ],

         "char_filter" : ["html_strip"]

       }

     }

   }

 }

is good for searching but not for the highlighter as that returns sometimes trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then the highlighter works well - all original html tags are gone - but I am unable to get the river push the column directly to the Mapper Attachments.

Questions:

what is the best practice for indexing HTML columns? I am aware of the possibility of a manual removal of HTML tags using Agility Pack but do not like that as it's too much extra maintenance.
is there any better highlighter for html data which doesn't cut off any original html tags?
How to plug in the JDBC river to Mapper Attachments?
Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net.
For more options, visit https://groups.google.com/d/optout.

jprante · February 23, 2015, 8:26am

I opened an issue for adding optional base64 encoding on columns:
Base64 encoding · Issue #472 · jprante/elasticsearch-jdbc · GitHub
What is "initial indexing"? What do you mean by "slower"?
Yes, you can change the documented bulk index settings.

Jörg

On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik jiri.pik@jiripik.com wrote:

Apologies for everyone for sending these emails with digital signature
which may have caused some issues:

Summary for Joerg:
  Is there a way for the JDBC river to transform the nvarchar(MAX)
into Base64 by itself? I can do on SQL server – see below (1) for David –
but it’s substantially slower
  If not, do you recommend nvarbinary(MAX) or some other MS SQL
Server type? And then the SELECT * from XXX would just work?

Summary for David:
  If I convert the HTML column using select ID, cast(N'' as xml).
value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))',
'varchar(max)') as Content from (SELECT ID , cast( cast(Content as
varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k; the
indexing just works but takes longer than usual – is there any performance
setting I could use?
  Would it be possible for the attachment mapper to index pure
txt file without base64?

From: Jiri Pik
Sent: Monday, February 23, 2015 6:08 AM
To: elasticsearch@googlegroups.com
Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database

Thank you very much for your kind answer. If I encode the html file into
Base64, and use the enclosed script, then all works just fine.

So, Joerg:
  Is there a way for the JDBC river to transform the nvarchar(MAX)
into Base64 by itself?
  If not, do you recommend nvarbinary(MAX) or some other MS SQL
Server type? And then the SELECT * from XXX would just work?

What are your thoughts?

BTW I have been able to convert the nvarchar to base64 using this query

select ID, cast(N'' as xml).value (
'xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)')
as Content from (SELECT ID , cast( cast(Content as varchar(MAX )) as
varbinary( MAX)) Content from KBArticles) k;

The usual river and mapper attachment work just fine but the initial
indexing takes substantially longer. Why?
  Is there any performance settings I could tweak?
From: elasticsearch@googlegroups.com [
mailto:elasticsearch@googlegroups.com elasticsearch@googlegroups.com] *On
Behalf Of *joergprante@gmail.com
Sent: Sunday, February 22, 2015 6:12 PM
To: elasticsearch@googlegroups.com
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database

Can you give some information about the mapper attachment setup you used
successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik jiri.pik@googlemail.com wrote:

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server
database. I have set up a JDBC river
GitHub - jprante/elasticsearch-jdbc: JDBC importer for Elasticsearch and the database is
indexed.

Using

"settings":{
"analysis":{

  "analyzer":{

    "default":{

      "type":"custom",

      "tokenizer":"standard",

      "filter":[ "standard", "lowercase" ],

      "char_filter" : ["html_strip"]

    }

  }

}
}

is good for searching but not for the highlighter as that returns
sometimes trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then
the highlighter works well - all original html tags are gone - but I am
unable to get the river push the column directly to the Mapper Attachments.

Questions:

what is the best practice for indexing HTML columns? I am aware of the
possibility of a manual removal of HTML tags using Agility Pack but do not
like that as it's too much extra maintenance.

is there any better highlighter for html data which doesn't cut off any
original html tags?

How to plug in the JDBC river to Mapper Attachments?

Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net
https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Jiri_Pik_2 · February 23, 2015, 8:50am

Thank you for opening of the issue.

If I indexed the column as varchar and used the default ES indexing, the entire table is indexed within 5 seconds. If I use the Mapper Attachments, it takes up to 2 minutes. I am not sure whether it’s because of the extra work SQL Server is doing, or the extra volume the jdbc is taking care, but I assume it may be because of the way the Mapper Attachments works?

From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of joergprante@gmail.com
Sent: Monday, February 23, 2015 9:26 AM
To: elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database

I opened an issue for adding optional base64 encoding on columns: https://github.com/jprante/elasticsearch-river-jdbc/issues/472
What is "initial indexing"? What do you mean by "slower"?
Yes, you can change the documented bulk index settings.

Jörg

On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik <jiri.pik@jiripik.com mailto:jiri.pik@jiripik.com> wrote:

Apologies for everyone for sending these emails with digital signature which may have caused some issues:

Summary for Joerg:

  Is there a way for the JDBC river to transform the nvarchar(MAX) into Base64 by itself? I can do on SQL server – see below (1) for David – but it’s substantially slower

  If not, do you recommend nvarbinary(MAX) or some other MS SQL Server type? And then the SELECT * from XXX would just work?

Summary for David:

  If I convert the HTML column using select ID, cast(N'' as xml).value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k; the indexing just works but takes longer than usual – is there any performance setting I could use?

  Would it be possible for the attachment mapper to index pure txt file without base64?

From: Jiri Pik
Sent: Monday, February 23, 2015 6:08 AM
To: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com
Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database

Thank you very much for your kind answer. If I encode the html file into Base64, and use the enclosed script, then all works just fine.

So, Joerg:

  Is there a way for the JDBC river to transform the nvarchar(MAX) into Base64 by itself?

  If not, do you recommend nvarbinary(MAX) or some other MS SQL Server type? And then the SELECT * from XXX would just work?

What are your thoughts?

BTW I have been able to convert the nvarchar to base64 using this query

select ID, cast(N'' as xml).value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as Content from (SELECT ID , cast( cast(Content as varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k;

The usual river and mapper attachment work just fine but the initial indexing takes substantially longer. Why?

  Is there any performance settings I could tweak?

From: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of joergprante@gmail.com mailto:joergprante@gmail.com
Sent: Sunday, February 22, 2015 6:12 PM
To: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database

Can you give some information about the mapper attachment setup you used successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik <jiri.pik@googlemail.com mailto:jiri.pik@googlemail.com> wrote:

  I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed.

  Using

    "settings":{

      "analysis":{

        "analyzer":{

          "default":{

            "type":"custom",

            "tokenizer":"standard",

            "filter":[ "standard", "lowercase" ],

            "char_filter" : ["html_strip"]

          }

        }

      }

    }

  is good for searching but not for the highlighter as that returns sometimes trimmed unpaired html tags.

  I have played with the Mapper Attachments with HTML attachments and then the highlighter works well - all original html tags are gone - but I am unable to get the river push the column directly to the Mapper Attachments.

  Questions:

  1. what is the best practice for indexing HTML columns? I am aware of the possibility of a manual removal of HTML tags using Agility Pack but do not like that as it's too much extra maintenance.

  2. is there any better highlighter for html data which doesn't cut off any original html tags?

  3. How to plug in the JDBC river to Mapper Attachments?

  4. Any better ideas how to achieve my goals?



  Thanks!

  --
  You received this message because you are subscribed to the Google Groups "elasticsearch" group.
  To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com<mailto:elasticsearch+unsubscribe@googlegroups.com>.
  To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer>.
  For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net.
For more options, visit https://groups.google.com/d/optout.

jprante · February 23, 2015, 9:15am

How big is the entire table you index?

You can use monitor tools like BigDesk to verify the resources ES is using.

It is close to impossible that just base64 encoding takes 20x longer while
indexing, maybe mapper attachment is doing other extra work.

Jörg

On Mon, Feb 23, 2015 at 9:50 AM, Jiri Pik jiri.pik@jiripik.com wrote:

Thank you for opening of the issue.

If I indexed the column as varchar and used the default ES indexing, the
entire table is indexed within 5 seconds. If I use the Mapper Attachments,
it takes up to 2 minutes. I am not sure whether it’s because of the extra
work SQL Server is doing, or the extra volume the jdbc is taking care, but
I assume it may be because of the way the Mapper Attachments works?

From: elasticsearch@googlegroups.com [mailto:
elasticsearch@googlegroups.com] *On Behalf Of *joergprante@gmail.com
Sent: Monday, February 23, 2015 9:26 AM
To: elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014
database

I opened an issue for adding optional base64 encoding on columns:
Base64 encoding · Issue #472 · jprante/elasticsearch-jdbc · GitHub

What is "initial indexing"? What do you mean by "slower"?

Yes, you can change the documented bulk index settings.

Jörg

On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik jiri.pik@jiripik.com wrote:

Apologies for everyone for sending these emails with digital signature
which may have caused some issues:

Summary for Joerg:
  Is there a way for the JDBC river to transform the nvarchar(MAX)
into Base64 by itself? I can do on SQL server – see below (1) for David –
but it’s substantially slower
  If not, do you recommend nvarbinary(MAX) or some other MS SQL
Server type? And then the SELECT * from XXX would just work?

Summary for David:
  If I convert the HTML column using select ID, cast(N'' as xml).
value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))',
'varchar(max)') as Content from (SELECT ID , cast( cast(Content as
varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k; the
indexing just works but takes longer than usual – is there any performance
setting I could use?
  Would it be possible for the attachment mapper to index pure txt
file without base64?

From: Jiri Pik
Sent: Monday, February 23, 2015 6:08 AM
To: elasticsearch@googlegroups.com
Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database

Thank you very much for your kind answer. If I encode the html file into
Base64, and use the enclosed script, then all works just fine.

So, Joerg:
  Is there a way for the JDBC river to transform the nvarchar(MAX)
into Base64 by itself?
  If not, do you recommend nvarbinary(MAX) or some other MS SQL
Server type? And then the SELECT * from XXX would just work?

What are your thoughts?

BTW I have been able to convert the nvarchar to base64 using this query

select ID, cast(N'' as xml).value (
'xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)')
as Content from (SELECT ID , cast( cast(Content as varchar(MAX )) as
varbinary( MAX)) Content from KBArticles) k;

The usual river and mapper attachment work just fine but the initial
indexing takes substantially longer. Why?
  Is there any performance settings I could tweak?
From: elasticsearch@googlegroups.com [
mailto:elasticsearch@googlegroups.com elasticsearch@googlegroups.com] *On
Behalf Of *joergprante@gmail.com
Sent: Sunday, February 22, 2015 6:12 PM
To: elasticsearch@googlegroups.com
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database

Can you give some information about the mapper attachment setup you used
successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik jiri.pik@googlemail.com wrote:

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server
database. I have set up a JDBC river
GitHub - jprante/elasticsearch-jdbc: JDBC importer for Elasticsearch and the database is
indexed.

Using

"settings":{
"analysis":{

  "analyzer":{

    "default":{

      "type":"custom",

      "tokenizer":"standard",

      "filter":[ "standard", "lowercase" ],

      "char_filter" : ["html_strip"]

    }

  }

}
}

is good for searching but not for the highlighter as that returns
sometimes trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then
the highlighter works well - all original html tags are gone - but I am
unable to get the river push the column directly to the Mapper Attachments.

Questions:

what is the best practice for indexing HTML columns? I am aware of the
possibility of a manual removal of HTML tags using Agility Pack but do not
like that as it's too much extra maintenance.

is there any better highlighter for html data which doesn't cut off any
original html tags?

How to plug in the JDBC river to Mapper Attachments?

Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net
https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net
https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHw3oba_%3DAGAnYofoeHY%3Dx5JDwdSPmRhEcPdmMkHUEQwQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Jiri_Pik_2 · February 23, 2015, 1:13pm

The table has 3000 rows, the index is defined as below

{

"index": {

"primary_size_in_bytes": 296341451,

"size_in_bytes": 296341451

},

"translog": {

"operations": 0

},

"docs": {

"num_docs": 3000,

"max_doc": 3000,

"deleted_docs": 0

},},

I believe it’s the mapper attachment who is causing this delay.

David – is there any way to speed this up?

From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of joergprante@gmail.com
Sent: Monday, February 23, 2015 10:15 AM
To: elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database

How big is the entire table you index?

You can use monitor tools like BigDesk to verify the resources ES is using.

It is close to impossible that just base64 encoding takes 20x longer while indexing, maybe mapper attachment is doing other extra work.

Jörg

On Mon, Feb 23, 2015 at 9:50 AM, Jiri Pik <jiri.pik@jiripik.com mailto:jiri.pik@jiripik.com > wrote:

Thank you for opening of the issue.

If I indexed the column as varchar and used the default ES indexing, the entire table is indexed within 5 seconds. If I use the Mapper Attachments, it takes up to 2 minutes. I am not sure whether it’s because of the extra work SQL Server is doing, or the extra volume the jdbc is taking care, but I assume it may be because of the way the Mapper Attachments works?

From: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com ] On Behalf Of joergprante@gmail.com mailto:joergprante@gmail.com
Sent: Monday, February 23, 2015 9:26 AM
To: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database

I opened an issue for adding optional base64 encoding on columns: https://github.com/jprante/elasticsearch-river-jdbc/issues/472
What is "initial indexing"? What do you mean by "slower"?
Yes, you can change the documented bulk index settings.

Jörg

On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik <jiri.pik@jiripik.com mailto:jiri.pik@jiripik.com > wrote:

Apologies for everyone for sending these emails with digital signature which may have caused some issues:

Summary for Joerg:

  Is there a way for the JDBC river to transform the nvarchar(MAX) into Base64 by itself? I can do on SQL server – see below (1) for David – but it’s substantially slower

  If not, do you recommend nvarbinary(MAX) or some other MS SQL Server type? And then the SELECT * from XXX would just work?

Summary for David:

  If I convert the HTML column using select ID, cast(N'' as xml).value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k; the indexing just works but takes longer than usual – is there any performance setting I could use?

  Would it be possible for the attachment mapper to index pure txt file without base64?

From: Jiri Pik
Sent: Monday, February 23, 2015 6:08 AM
To: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com
Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database

Thank you very much for your kind answer. If I encode the html file into Base64, and use the enclosed script, then all works just fine.

So, Joerg:

  Is there a way for the JDBC river to transform the nvarchar(MAX) into Base64 by itself?

  If not, do you recommend nvarbinary(MAX) or some other MS SQL Server type? And then the SELECT * from XXX would just work?

What are your thoughts?

BTW I have been able to convert the nvarchar to base64 using this query

select ID, cast(N'' as xml).value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as Content from (SELECT ID , cast( cast(Content as varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k;

The usual river and mapper attachment work just fine but the initial indexing takes substantially longer. Why?

  Is there any performance settings I could tweak?

From: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of joergprante@gmail.com mailto:joergprante@gmail.com
Sent: Sunday, February 22, 2015 6:12 PM
To: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database

Can you give some information about the mapper attachment setup you used successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik <jiri.pik@googlemail.com mailto:jiri.pik@googlemail.com > wrote:

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed.

Using

"settings":{

"analysis":{

  "analyzer":{

    "default":{

      "type":"custom",

      "tokenizer":"standard",

      "filter":[ "standard", "lowercase" ], 

      "char_filter" : ["html_strip"]

    }

  }

}

}

is good for searching but not for the highlighter as that returns sometimes trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then the highlighter works well - all original html tags are gone - but I am unable to get the river push the column directly to the Mapper Attachments.

Questions:

what is the best practice for indexing HTML columns? I am aware of the possibility of a manual removal of HTML tags using Agility Pack but do not like that as it's too much extra maintenance.
is there any better highlighter for html data which doesn't cut off any original html tags?
How to plug in the JDBC river to Mapper Attachments?
Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .

To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer .

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .

To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer .

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHw3oba_%3DAGAnYofoeHY%3Dx5JDwdSPmRhEcPdmMkHUEQwQ%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHw3oba_%3DAGAnYofoeHY%3Dx5JDwdSPmRhEcPdmMkHUEQwQ%40mail.gmail.com?utm_medium=email&utm_source=footer .
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/eb5ebcabc33a4e82a726a936733fdd28%40Ex13DAG10-N1.dataoncloud.net.
For more options, visit https://groups.google.com/d/optout.

jprante · February 23, 2015, 4:12pm

I am not sure but it looks like mapper attachment is doing some extra
processing, for example Tika, which is very expensive. Maybe there is some
configuration option, I did not check.

Jörg

On Mon, Feb 23, 2015 at 2:13 PM, Jiri Pik jiri.pik@jiripik.com wrote:

The table has 3000 rows, the index is defined as below

{

{

"index": {

"primary_size_in_bytes": 296341451,

"size_in_bytes": 296341451

},

"translog": {

"operations": 0

},

"docs": {

"num_docs": 3000,

"max_doc": 3000,

"deleted_docs": 0

},},

I believe it’s the mapper attachment who is causing this delay.

David – is there any way to speed this up?

From: elasticsearch@googlegroups.com [mailto:
elasticsearch@googlegroups.com] *On Behalf Of *joergprante@gmail.com
Sent: Monday, February 23, 2015 10:15 AM

To: elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014
database

How big is the entire table you index?

You can use monitor tools like BigDesk to verify the resources ES is using.

It is close to impossible that just base64 encoding takes 20x longer while
indexing, maybe mapper attachment is doing other extra work.

Jörg

On Mon, Feb 23, 2015 at 9:50 AM, Jiri Pik jiri.pik@jiripik.com wrote:

Thank you for opening of the issue.

If I indexed the column as varchar and used the default ES indexing, the
entire table is indexed within 5 seconds. If I use the Mapper Attachments,
it takes up to 2 minutes. I am not sure whether it’s because of the extra
work SQL Server is doing, or the extra volume the jdbc is taking care, but
I assume it may be because of the way the Mapper Attachments works?

From: elasticsearch@googlegroups.com [mailto:
elasticsearch@googlegroups.com] *On Behalf Of *joergprante@gmail.com
Sent: Monday, February 23, 2015 9:26 AM
To: elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014
database

I opened an issue for adding optional base64 encoding on columns:
Base64 encoding · Issue #472 · jprante/elasticsearch-jdbc · GitHub

What is "initial indexing"? What do you mean by "slower"?

Yes, you can change the documented bulk index settings.

Jörg

On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik jiri.pik@jiripik.com wrote:

Apologies for everyone for sending these emails with digital signature
which may have caused some issues:

Summary for Joerg:
  Is there a way for the JDBC river to transform the nvarchar(MAX)
into Base64 by itself? I can do on SQL server – see below (1) for David –
but it’s substantially slower
  If not, do you recommend nvarbinary(MAX) or some other MS SQL
Server type? And then the SELECT * from XXX would just work?

Summary for David:
  If I convert the HTML column using select ID, cast(N'' as xml).
value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))',
'varchar(max)') as Content from (SELECT ID , cast( cast(Content as
varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k; the
indexing just works but takes longer than usual – is there any performance
setting I could use?
  Would it be possible for the attachment mapper to index pure txt
file without base64?

From: Jiri Pik
Sent: Monday, February 23, 2015 6:08 AM
To: elasticsearch@googlegroups.com
Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database

Thank you very much for your kind answer. If I encode the html file into
Base64, and use the enclosed script, then all works just fine.

So, Joerg:
  Is there a way for the JDBC river to transform the nvarchar(MAX)
into Base64 by itself?
  If not, do you recommend nvarbinary(MAX) or some other MS SQL
Server type? And then the SELECT * from XXX would just work?

What are your thoughts?

BTW I have been able to convert the nvarchar to base64 using this query

select ID, cast(N'' as xml).value (
'xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)')
as Content from (SELECT ID , cast( cast(Content as varchar(MAX )) as
varbinary( MAX)) Content from KBArticles) k;

The usual river and mapper attachment work just fine but the initial
indexing takes substantially longer. Why?
  Is there any performance settings I could tweak?
From: elasticsearch@googlegroups.com [
mailto:elasticsearch@googlegroups.com elasticsearch@googlegroups.com] *On
Behalf Of *joergprante@gmail.com
Sent: Sunday, February 22, 2015 6:12 PM
To: elasticsearch@googlegroups.com
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database

Can you give some information about the mapper attachment setup you used
successfully?

There is no good reason why this should not be possible with JDBC river.

Jörg

On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik jiri.pik@googlemail.com wrote:

I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database.
I have set up a JDBC river
GitHub - jprante/elasticsearch-jdbc: JDBC importer for Elasticsearch and the database is
indexed.

Using

"settings":{
"analysis":{

  "analyzer":{

    "default":{

      "type":"custom",

      "tokenizer":"standard",

      "filter":[ "standard", "lowercase" ],

      "char_filter" : ["html_strip"]

    }

  }

}
}

is good for searching but not for the highlighter as that returns
sometimes trimmed unpaired html tags.

I have played with the Mapper Attachments with HTML attachments and then
the highlighter works well - all original html tags are gone - but I am
unable to get the river push the column directly to the Mapper Attachments.

Questions:

what is the best practice for indexing HTML columns? I am aware of the
possibility of a manual removal of HTML tags using Agility Pack but do not
like that as it's too much extra maintenance.

is there any better highlighter for html data which doesn't cut off any
original html tags?

How to plug in the JDBC river to Mapper Attachments?

Any better ideas how to achieve my goals?

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net
https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net
https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHw3oba_%3DAGAnYofoeHY%3Dx5JDwdSPmRhEcPdmMkHUEQwQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHw3oba_%3DAGAnYofoeHY%3Dx5JDwdSPmRhEcPdmMkHUEQwQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/eb5ebcabc33a4e82a726a936733fdd28%40Ex13DAG10-N1.dataoncloud.net
https://groups.google.com/d/msgid/elasticsearch/eb5ebcabc33a4e82a726a936733fdd28%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGMyR-nMK%3Dymrec%2B3KF1nySUNJw7%2BSR0KcxHsxq7CrKdQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Jiri_Pik_2 · February 23, 2015, 4:19pm

David – can you advise if there is anything which can be done to speed up indexing? Are there any config parameters I could use to tweak the performance?

Joerg – the entire indexing looks completely differently. If I do not use the mapper attachment, the size of the index and the number of indexed documents grow at the same time. With mapper attachments, however, the size of the index grows with the number of indexed documents staying at 0 until the entire index is being built.

From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of joergprante@gmail.com
Sent: Monday, February 23, 2015 5:13 PM
To: elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database

I am not sure but it looks like mapper attachment is doing some extra processing, for example Tika, which is very expensive. Maybe there is some configuration option, I did not check.

Jörg

On Mon, Feb 23, 2015 at 2:13 PM, Jiri Pik <jiri.pik@jiripik.com mailto:jiri.pik@jiripik.com> wrote:

The table has 3000 rows, the index is defined as below

{

"index": {

"primary_size_in_bytes": 296341451,

"size_in_bytes": 296341451

},

"translog": {

"operations": 0

},

"docs": {

"num_docs": 3000,

"max_doc": 3000,

"deleted_docs": 0

},},

I believe it’s the mapper attachment who is causing this delay.

David – is there any way to speed this up?

From: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com] On Behalf Of joergprante@gmail.com mailto:joergprante@gmail.com
Sent: Monday, February 23, 2015 10:15 AM

To: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database

How big is the entire table you index?

You can use monitor tools like BigDesk to verify the resources ES is using.

It is close to impossible that just base64 encoding takes 20x longer while indexing, maybe mapper attachment is doing other extra work.

Jörg

On Mon, Feb 23, 2015 at 9:50 AM, Jiri Pik <jiri.pik@jiripik.com mailto:jiri.pik@jiripik.com> wrote:

  Thank you for opening of the issue.



  If I indexed the column as varchar and used the default ES indexing, the entire table is indexed within 5 seconds. If I use the Mapper Attachments, it takes up to 2 minutes. I am not sure whether it’s because of the extra work SQL Server is doing, or the extra volume the jdbc is taking care, but I assume it may be because of the way the Mapper Attachments works?







  From: elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com> [mailto:elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com>] On Behalf Of joergprante@gmail.com<mailto:joergprante@gmail.com>
  Sent: Monday, February 23, 2015 9:26 AM
  To: elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com>
  Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database



  1. I opened an issue for adding optional base64 encoding on columns: https://github.com/jprante/elasticsearch-river-jdbc/issues/472



  2. What is "initial indexing"? What do you mean by "slower"?



  3. Yes, you can change the documented bulk index settings.



  Jörg





  On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik <jiri.pik@jiripik.com<mailto:jiri.pik@jiripik.com>> wrote:

     Apologies for everyone for sending these emails with digital signature which may have caused some issues:



     Summary for Joerg:



     1.       Is there a way for the JDBC river to transform the nvarchar(MAX) into Base64 by itself? I can do on SQL server – see below (1) for David – but it’s substantially slower

     2.       If not, do you recommend nvarbinary(MAX) or some other MS SQL Server type? And then the SELECT * from XXX would just work?



     Summary for David:

     1.       If I convert the HTML column using select ID, cast(N'' as xml).value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k; the indexing just works but takes longer than usual – is there any performance setting I could use?

     2.       Would it be possible for the attachment mapper to index pure txt file without base64?













     From: Jiri Pik
     Sent: Monday, February 23, 2015 6:08 AM
     To: elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com>
     Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database



     Thank you very much for your kind answer. If I encode the html file into Base64, and use the enclosed script, then all works just fine.



     So, Joerg:



     1.       Is there a way for the JDBC river to transform the nvarchar(MAX) into Base64 by itself?

     2.       If not, do you recommend nvarbinary(MAX) or some other MS SQL Server type? And then the SELECT * from XXX would just work?



     What are your thoughts?



     BTW I have been able to convert the nvarchar to base64 using this query

     select ID, cast(N'' as xml).value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k;





     The usual river and mapper attachment work just fine but the initial indexing takes substantially longer. Why?



     3.       Is there any performance settings I could tweak?



     From: elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com> [mailto:elasticsearch@googlegroups.com] On Behalf Of joergprante@gmail.com<mailto:joergprante@gmail.com>
     Sent: Sunday, February 22, 2015 6:12 PM
     To: elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com>
     Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database



     Can you give some information about the mapper attachment setup you used successfully?



     There is no good reason why this should not be possible with JDBC river.



     Jörg



     On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik <jiri.pik@googlemail.com<mailto:jiri.pik@googlemail.com>> wrote:

        I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed.

        Using

          "settings":{

            "analysis":{

              "analyzer":{

                "default":{

                  "type":"custom",

                  "tokenizer":"standard",

                  "filter":[ "standard", "lowercase" ],

                  "char_filter" : ["html_strip"]

                }

              }

            }

          }

        is good for searching but not for the highlighter as that returns sometimes trimmed unpaired html tags.

        I have played with the Mapper Attachments with HTML attachments and then the highlighter works well - all original html tags are gone - but I am unable to get the river push the column directly to the Mapper Attachments.

        Questions:

        1. what is the best practice for indexing HTML columns? I am aware of the possibility of a manual removal of HTML tags using Agility Pack but do not like that as it's too much extra maintenance.

        2. is there any better highlighter for html data which doesn't cut off any original html tags?

        3. How to plug in the JDBC river to Mapper Attachments?

        4. Any better ideas how to achieve my goals?



        Thanks!

        --
        You received this message because you are subscribed to the Google Groups "elasticsearch" group.
        To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com<mailto:elasticsearch+unsubscribe@googlegroups.com>.
        To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer>.
        For more options, visit https://groups.google.com/d/optout.



     --
     You received this message because you are subscribed to the Google Groups "elasticsearch" group.
     To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com<mailto:elasticsearch+unsubscribe@googlegroups.com>.

     To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer>.
     For more options, visit https://groups.google.com/d/optout.

     --
     You received this message because you are subscribed to the Google Groups "elasticsearch" group.
     To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com<mailto:elasticsearch+unsubscribe@googlegroups.com>.
     To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net<https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer>.


     For more options, visit https://groups.google.com/d/optout.



  --
  You received this message because you are subscribed to the Google Groups "elasticsearch" group.
  To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com<mailto:elasticsearch+unsubscribe@googlegroups.com>.

  To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com?utm_medium=email&utm_source=footer>.
  For more options, visit https://groups.google.com/d/optout.

  --
  You received this message because you are subscribed to the Google Groups "elasticsearch" group.
  To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com<mailto:elasticsearch+unsubscribe@googlegroups.com>.
  To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net<https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer>.


  For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHw3oba_%3DAGAnYofoeHY%3Dx5JDwdSPmRhEcPdmMkHUEQwQ%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHw3oba_%3DAGAnYofoeHY%3Dx5JDwdSPmRhEcPdmMkHUEQwQ%40mail.gmail.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/eb5ebcabc33a4e82a726a936733fdd28%40Ex13DAG10-N1.dataoncloud.net https://groups.google.com/d/msgid/elasticsearch/eb5ebcabc33a4e82a726a936733fdd28%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGMyR-nMK%3Dymrec%2B3KF1nySUNJw7%2BSR0KcxHsxq7CrKdQ%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGMyR-nMK%3Dymrec%2B3KF1nySUNJw7%2BSR0KcxHsxq7CrKdQ%40mail.gmail.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/33a3cd982b1743e49520143e773600db%40Ex13DAG10-N1.dataoncloud.net.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Indexing of HTML content Elasticsearch	12	3297	July 6, 2017
[Ann] JDBC River Plugin for ElasticSearch Elasticsearch	20	2784	July 6, 2017
Highlighting attachments Elasticsearch	8	679	July 6, 2017
JDBC River missing documents? Elasticsearch	8	436	July 6, 2017
ElasticSearch Version problem Elasticsearch	14	576	July 6, 2017

Indexing of HTML Column in an MS SQL Server 2014 database

Related topics