Hi Mike,
new_ES_config.sh(define the templates and disable the refresh/flush):
curl -XPOST localhost:9200/doc -d '{
"mappings" : {
"type" : {
"_source" : { "enabled" : false },
"dynamic_templates" : [
{"t1":{
"match" : "_ss",
"mapping":{
"type": "string",
"store":false,
"norms" : {"enabled" : false}
}
}},
{"t2":{
"match" : "_dt",
"mapping":{
"type": "date",
"store": false
}
}},
{"t3":{
"match" : "*_i",
"mapping":{
"type": "integer",
"store": false
}
}}
]
}
}
}'
curl -XPUT localhost:9200/doc/_settings -d '{
"index.refresh_interval" : "-1"
}'
curl -XPUT localhost:9200/doc/_settings -d '{
"index.translog.disable_flush" : true
}'
new_ES_ingest_threads.pl( spawn 10 threads to use curl command to ingest
the doc and one thread to flush/optimize periodically):
my $num_args = $#ARGV + 1;
if ($num_args < 1 || $num_args > 2) {
print "\n usuage:$0 [src_dir] [thread_count]\n";
exit;
}
my $INST_HOME="/scratch/aime/elasticsearch-1.2.1";
my $pid = qx(jps | sed -e '/Elasticsearch/p' -n | sed 's/ .*//');
chomp($pid);
if( "$pid" eq "")
{
print "Instance is not up\n";
exit;
}
my $dir = $ARGV[0];
my $td_count = 10;
$td_count = $ARGV[1] if($num_args == 2);
open(FH, ">$lf");
print FH "source dir: $dir\nthread_count: $td_count\n";
print FH localtime()."\n";
use threads;
use threads::shared;
my $flush_intv = 10;
my $no:shared=0;
my $total = 10000;
my $intv = 1000;
my $tstr:shared = "";
my $ltime:shared = time;
sub commit {
$SIG{'KILL'} = sub {curl -XPOST 'http://localhost:9200/doc/_flush'
;print "forced commit done on
".localtime()."\n";threads->exit();};
while ($no < $total )
{
curl -XPOST 'http://localhost:9200/doc/_flush'
;
curl -XPOST 'http://localhost:9200/doc/_optimize'
;
print "commit on ".localtime()."\n";
sleep($flush_intv);
}
curl -XPOST 'http://localhost:9200/doc/_flush'
;
print "commit done on ".localtime()."\n";
}
sub do {
my $c = -1;
while(1)
{
{
lock($no);
$c=$no;
$no++;
}
last if($c >= $total);
curl -XPOST -s localhost:9200/doc/type/$c --data-binary \@$dir/$c.json
;
if( ($c +1) % $intv == 0 )
{
lock($ltime);
$curtime = time;
$tstr .= ($curtime - $ltime)." ";
$ltime = $curtime;
}
}
}
start the monitor processes
my $sarId = qx(sar -A 5 100000 -o sar5sec_$dir.out > /dev/null &\necho $!);
my $jgcId = qx(jstat -gc $pid 2s > jmem_$dir.out &\necho $!);
my $ct = threads->create(&commit);
my $start = time;
my @ts=();
for $i (1..$td_count)
{
my $t = threads->create(&do);
push(@ts, $t);
}
for my $t (@ts)
{
$t->join();
}
$ct->kill('KILL');
my $fin = time;
qx(kill -9 $sarId\nkill -9 $jgcId);
print FH localtime()."\n";
$ct->join();
print FH qx(curl 'http://localhost:9200/doc/type/_count?q=*');
close(FH);
new_Solr_ingest_threads.pl is similar to the file new_ES_ingest_threads.pl
and uses the different parameters for curl commands. Only post the
differences here:
sub commit {
while ($no < $total )
{
curl 'http://localhost:8983/solr/collection2/update?commit=true'
;
curl 'http://localhost:8983/solr/collection2/update?optimize=true'
;
print "commit on ".localtime()."\n";
sleep(10);
}
curl 'http://localhost:8983/solr/collection2/update?commit=true'
;
print "commit done on ".localtime()."\n";
}
sub do {
my $c = -1;
while(1)
{
{
lock($no);
$c=$no;
$no++;
}
last if($c >= $total);
curl -s 'http://localhost:8983/solr/collection2/update/json' --data-binary \@$dir/$c.json -H 'Content-type:application/json'
;
if( ($c +1) % $intv == 0 )
{
lock($ltime);
$curtime = time;
$tstr .= ($curtime - $ltime)." ";
$ltime = $curtime;
}
}
}
B&R
Maco
On Wednesday, June 18, 2014 4:44:35 AM UTC+8, Michael McCandless wrote:
Hi,
Could you post the scripts you linked to (new_ES_config.sh,
new_ES_ingest_threads.pl, new_Solr_ingest_threads.pl) inlined? I can't
download them from where you linked.
Optimizing every 10 seconds or 10 minutes is really not a good idea in
general, but I guess if you're doing the same with ES and Solr then the
comparison is at least "fair".
It's odd you see such a slowdown with ES...
Mike
On Fri, Jun 13, 2014 at 2:40 PM, Cindy Hsin <cindy...@gmail.com
<javascript:>> wrote:
Hi, Mark:
We are doing single document ingestion. We did a performance comparison
between Solr and Elastic Search (ES).
The performance for ES degrades dramatically when we increase the
metadata fields where Solr performance remains the same.
The performance is done in very small data set (ie. 10k documents, the
index size is only 75mb). The machine is a high spec machine with 48GB
memory.
You can see ES performance drop 50% even when the machine have plenty
memory. ES consumes all the machine memory when metadata field increased to
100k.
This behavior seems abnormal since the data is really tiny.
We also tried with larger data set (ie. 100k and 1Mil documents), ES
throw OOW for scenario 2 for 1 Mil doc scenario.
We want to know whether this is a bug in ES and/or is there any
workaround (config step) we can use to eliminate the performance
degradation.
Currently ES performance does not meet the customer requirement so we
want to see if there is anyway we can bring ES performance to the same
level as Solr.
Below is the configuration setting and benchmark results for 10k document
set.
scenario 0 means there are 1000 different metadata fields in the system.
scenario 1 means there are 10k different metatdata fields in the system.
scenario 2 means there are 100k different metadata fields in the system.
scenario 3 means there are 1M different metadata fields in the system.
- disable hard-commit & soft commit + use a client to do commit (ES
& Solr) every 10 second
- ES: flush, refresh are disabled
- Solr: autoSoftCommit are disabled
- monitor load on the system (cpu, memory, etc) or the ingestion
speed change over time
- monitor the ingestion speed (is there any degradation over time?)
- new ES config:new_ES_config.sh; new ingestion:
new_ES_ingest_threads.pl
- new Solr ingestion: new_Solr_ingest_threads.pl
- flush interval: 10s
Number of different meta data field ESSolrScenario 0: 100012secs ->
833docs/sec
CPU: 30.24%
Heap: 1.08G
time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
index size: 36M
iowait: 0.02%13 secs -> 769 docs/sec
CPU: 28.85%
Heap: 9.39G
time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2Scenario 1: 10k29secs ->
345docs/sec
CPU: 40.83%
Heap: 5.74G
time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
iowait: 0.02%
Index Size: 36M12 secs -> 833 docs/sec
CPU: 28.62%
Heap: 9.88G
time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2Scenario 2: 100k17 mins
44 secs -> 9.4docs/sec
CPU: 54.73%
Heap: 47.99G
time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
iowait: 0.02%
Index Size: 75M13 secs -> 769 docs/sec
CPU: 29.43%
Heap: 9.84G
time(secs) for each 1k docs:2 1 1 1 1 1 1 1 2 2 Scenario 3: 1M183 mins 8
secs -> 0.9 docs/sec
CPU: 40.47%
Heap: 47.99G
time(secs) for each 1k docs:133 422 701 958 989 1322 1622 1615 1630 159415
secs -> 666.7 docs/sec
CPU: 45.10%
Heap: 9.64G
time(secs) for each 1k docs:2 1 1 1 1 2 1 1 3 2
Thanks!
Cindy
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4efc9c2d-ead4-4702-896d-dc32b5867859%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7d0dc7fa-64cd-4adf-8c8b-f1a2ebd644f6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.