David – can you advise if there is anything which can be done to speed up indexing? Are there any config parameters I could use to tweak the performance?
Joerg – the entire indexing looks completely differently. If I do not use the mapper attachment, the size of the index and the number of indexed documents grow at the same time. With mapper attachments, however, the size of the index grows with the number of indexed documents staying at 0 until the entire index is being built.
From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of joergprante@gmail.com
Sent: Monday, February 23, 2015 5:13 PM
To: elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database
I am not sure but it looks like mapper attachment is doing some extra processing, for example Tika, which is very expensive. Maybe there is some configuration option, I did not check.
Jörg
On Mon, Feb 23, 2015 at 2:13 PM, Jiri Pik <jiri.pik@jiripik.commailto:jiri.pik@jiripik.com> wrote:
The table has 3000 rows, the index is defined as below
{
{
"index": {
"primary_size_in_bytes": 296341451,
"size_in_bytes": 296341451
},
"translog": {
"operations": 0
},
"docs": {
"num_docs": 3000,
"max_doc": 3000,
"deleted_docs": 0
},},
I believe it’s the mapper attachment who is causing this delay.
David – is there any way to speed this up?
From: elasticsearch@googlegroups.commailto:elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.commailto:elasticsearch@googlegroups.com] On Behalf Of joergprante@gmail.commailto:joergprante@gmail.com
Sent: Monday, February 23, 2015 10:15 AM
To: elasticsearch@googlegroups.commailto:elasticsearch@googlegroups.com
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database
How big is the entire table you index?
You can use monitor tools like BigDesk to verify the resources ES is using.
It is close to impossible that just base64 encoding takes 20x longer while indexing, maybe mapper attachment is doing other extra work.
Jörg
On Mon, Feb 23, 2015 at 9:50 AM, Jiri Pik <jiri.pik@jiripik.commailto:jiri.pik@jiripik.com> wrote:
Thank you for opening of the issue.
If I indexed the column as varchar and used the default ES indexing, the entire table is indexed within 5 seconds. If I use the Mapper Attachments, it takes up to 2 minutes. I am not sure whether it’s because of the extra work SQL Server is doing, or the extra volume the jdbc is taking care, but I assume it may be because of the way the Mapper Attachments works?
From: elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com> [mailto:elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com>] On Behalf Of joergprante@gmail.com<mailto:joergprante@gmail.com>
Sent: Monday, February 23, 2015 9:26 AM
To: elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com>
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database
1. I opened an issue for adding optional base64 encoding on columns: https://github.com/jprante/elasticsearch-river-jdbc/issues/472
2. What is "initial indexing"? What do you mean by "slower"?
3. Yes, you can change the documented bulk index settings.
Jörg
On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik <jiri.pik@jiripik.com<mailto:jiri.pik@jiripik.com>> wrote:
Apologies for everyone for sending these emails with digital signature which may have caused some issues:
Summary for Joerg:
1. Is there a way for the JDBC river to transform the nvarchar(MAX) into Base64 by itself? I can do on SQL server – see below (1) for David – but it’s substantially slower
2. If not, do you recommend nvarbinary(MAX) or some other MS SQL Server type? And then the SELECT * from XXX would just work?
Summary for David:
1. If I convert the HTML column using select ID, cast(N'' as xml).value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as Content from (SELECT ID , cast( cast(Content as varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k; the indexing just works but takes longer than usual – is there any performance setting I could use?
2. Would it be possible for the attachment mapper to index pure txt file without base64?
From: Jiri Pik
Sent: Monday, February 23, 2015 6:08 AM
To: elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com>
Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database
Thank you very much for your kind answer. If I encode the html file into Base64, and use the enclosed script, then all works just fine.
So, Joerg:
1. Is there a way for the JDBC river to transform the nvarchar(MAX) into Base64 by itself?
2. If not, do you recommend nvarbinary(MAX) or some other MS SQL Server type? And then the SELECT * from XXX would just work?
What are your thoughts?
BTW I have been able to convert the nvarchar to base64 using this query
select ID, cast(N'' as xml).value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as Content from (SELECT ID , cast( cast(Content as varchar(MAX )) as varbinary( MAX)) Content from KBArticles) k;
The usual river and mapper attachment work just fine but the initial indexing takes substantially longer. Why?
3. Is there any performance settings I could tweak?
From: elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com> [mailto:elasticsearch@googlegroups.com] On Behalf Of joergprante@gmail.com<mailto:joergprante@gmail.com>
Sent: Sunday, February 22, 2015 6:12 PM
To: elasticsearch@googlegroups.com<mailto:elasticsearch@googlegroups.com>
Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database
Can you give some information about the mapper attachment setup you used successfully?
There is no good reason why this should not be possible with JDBC river.
Jörg
On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik <jiri.pik@googlemail.com<mailto:jiri.pik@googlemail.com>> wrote:
I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server database. I have set up a JDBC river https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed.
Using
"settings":{
"analysis":{
"analyzer":{
"default":{
"type":"custom",
"tokenizer":"standard",
"filter":[ "standard", "lowercase" ],
"char_filter" : ["html_strip"]
}
}
}
}
is good for searching but not for the highlighter as that returns sometimes trimmed unpaired html tags.
I have played with the Mapper Attachments with HTML attachments and then the highlighter works well - all original html tags are gone - but I am unable to get the river push the column directly to the Mapper Attachments.
Questions:
1. what is the best practice for indexing HTML columns? I am aware of the possibility of a manual removal of HTML tags using Agility Pack but do not like that as it's too much extra maintenance.
2. is there any better highlighter for html data which doesn't cut off any original html tags?
3. How to plug in the JDBC river to Mapper Attachments?
4. Any better ideas how to achieve my goals?
Thanks!
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com<mailto:elasticsearch+unsubscribe@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com<mailto:elasticsearch+unsubscribe@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com<mailto:elasticsearch+unsubscribe@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net<https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com<mailto:elasticsearch+unsubscribe@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com<mailto:elasticsearch+unsubscribe@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net<https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.commailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHw3oba_%3DAGAnYofoeHY%3Dx5JDwdSPmRhEcPdmMkHUEQwQ%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAKdsXoHw3oba_%3DAGAnYofoeHY%3Dx5JDwdSPmRhEcPdmMkHUEQwQ%40mail.gmail.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.commailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/eb5ebcabc33a4e82a726a936733fdd28%40Ex13DAG10-N1.dataoncloud.nethttps://groups.google.com/d/msgid/elasticsearch/eb5ebcabc33a4e82a726a936733fdd28%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.commailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGMyR-nMK%3Dymrec%2B3KF1nySUNJw7%2BSR0KcxHsxq7CrKdQ%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAKdsXoGMyR-nMK%3Dymrec%2B3KF1nySUNJw7%2BSR0KcxHsxq7CrKdQ%40mail.gmail.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/33a3cd982b1743e49520143e773600db%40Ex13DAG10-N1.dataoncloud.net.
For more options, visit https://groups.google.com/d/optout.