Scrolling performance

I've been trying to dump an index for further reindexing with different
sharding settings, but found out that it's almost impossible with large
dataset (50m, 7gb in my case).

I scroll index with packs of 5000 items per iteration and scrolling takes
longer and longer. Like this:

first request: 5000
fetched: 10000 , time ms: 150
fetched: 15000 , time ms: 154
fetched: 20000 , time ms: 134
fetched: 25000 , time ms: 145
fetched: 30000 , time ms: 146
fetched: 35000 , time ms: 177
fetched: 40000 , time ms: 223
fetched: 45000 , time ms: 199
fetched: 50000 , time ms: 204
fetched: 55000 , time ms: 216
fetched: 60000 , time ms: 223
fetched: 65000 , time ms: 229
fetched: 70000 , time ms: 189
fetched: 75000 , time ms: 242
fetched: 80000 , time ms: 198
fetched: 85000 , time ms: 264
fetched: 90000 , time ms: 260
fetched: 95000 , time ms: 217
fetched: 100000 , time ms: 285
fetched: 105000 , time ms: 289
fetched: 110000 , time ms: 334
fetched: 115000 , time ms: 299
fetched: 120000 , time ms: 311
fetched: 125000 , time ms: 253
fetched: 130000 , time ms: 345
fetched: 135000 , time ms: 337
fetched: 140000 , time ms: 345
fetched: 145000 , time ms: 352
fetched: 150000 , time ms: 371
fetched: 155000 , time ms: 416
fetched: 160000 , time ms: 398
fetched: 165000 , time ms: 452
fetched: 170000 , time ms: 405
fetched: 175000 , time ms: 404
fetched: 180000 , time ms: 421
fetched: 185000 , time ms: 426
fetched: 190000 , time ms: 474
fetched: 195000 , time ms: 446
fetched: 200000 , time ms: 360
fetched: 205000 , time ms: 457
fetched: 210000 , time ms: 470
fetched: 215000 , time ms: 477
fetched: 220000 , time ms: 522
fetched: 225000 , time ms: 397
fetched: 230000 , time ms: 508
fetched: 235000 , time ms: 532
fetched: 240000 , time ms: 523
fetched: 245000 , time ms: 567
fetched: 250000 , time ms: 570
fetched: 255000 , time ms: 552
fetched: 260000 , time ms: 562
fetched: 265000 , time ms: 593
fetched: 270000 , time ms: 610
fetched: 275000 , time ms: 585
fetched: 280000 , time ms: 529
fetched: 285000 , time ms: 604
fetched: 290000 , time ms: 614
fetched: 295000 , time ms: 663
fetched: 300000 , time ms: 634
fetched: 305000 , time ms: 646
fetched: 310000 , time ms: 689
fetched: 315000 , time ms: 580
fetched: 320000 , time ms: 674
fetched: 325000 , time ms: 682
fetched: 330000 , time ms: 694
fetched: 335000 , time ms: 734
fetched: 340000 , time ms: 717
fetched: 345000 , time ms: 725
fetched: 350000 , time ms: 726
fetched: 355000 , time ms: 700
fetched: 360000 , time ms: 701
fetched: 365000 , time ms: 610
fetched: 370000 , time ms: 767
fetched: 375000 , time ms: 810
fetched: 380000 , time ms: 784
fetched: 385000 , time ms: 801
fetched: 390000 , time ms: 851
fetched: 395000 , time ms: 805
fetched: 400000 , time ms: 655
fetched: 405000 , time ms: 869
fetched: 410000 , time ms: 838
fetched: 415000 , time ms: 1119
fetched: 420000 , time ms: 1167
fetched: 425000 , time ms: 1770
fetched: 430000 , time ms: 1727
fetched: 435000 , time ms: 885
fetched: 440000 , time ms: 898
fetched: 445000 , time ms: 905
fetched: 450000 , time ms: 731
fetched: 455000 , time ms: 795
fetched: 460000 , time ms: 956
fetched: 465000 , time ms: 918
fetched: 470000 , time ms: 905
fetched: 475000 , time ms: 972
fetched: 480000 , time ms: 988
fetched: 485000 , time ms: 887
fetched: 490000 , time ms: 991
fetched: 495000 , time ms: 1047
fetched: 500000 , time ms: 812
fetched: 505000 , time ms: 1066
fetched: 510000 , time ms: 1091
fetched: 515000 , time ms: 1046
fetched: 520000 , time ms: 1042
fetched: 525000 , time ms: 1247
fetched: 530000 , time ms: 1060
fetched: 535000 , time ms: 1048
fetched: 540000 , time ms: 1099
fetched: 545000 , time ms: 1093
fetched: 550000 , time ms: 1157
fetched: 555000 , time ms: 1114
fetched: 560000 , time ms: 1085
fetched: 565000 , time ms: 1129
fetched: 570000 , time ms: 1193
fetched: 575000 , time ms: 1149
fetched: 580000 , time ms: 1175
fetched: 585000 , time ms: 1262
fetched: 590000 , time ms: 1180
fetched: 595000 , time ms: 1228
fetched: 600000 , time ms: 1199
fetched: 605000 , time ms: 1314
fetched: 610000 , time ms: 1179
fetched: 615000 , time ms: 1290
fetched: 620000 , time ms: 994
fetched: 625000 , time ms: 1259
fetched: 630000 , time ms: 1417
fetched: 635000 , time ms: 1282
fetched: 640000 , time ms: 1383
fetched: 645000 , time ms: 1297
fetched: 650000 , time ms: 1441
fetched: 655000 , time ms: 1452
fetched: 660000 , time ms: 1398
fetched: 665000 , time ms: 1550
fetched: 670000 , time ms: 1642
fetched: 675000 , time ms: 1489
fetched: 680000 , time ms: 1361
fetched: 685000 , time ms: 1495
fetched: 690000 , time ms: 1376
fetched: 695000 , time ms: 1230
fetched: 700000 , time ms: 1492
fetched: 705000 , time ms: 1509
fetched: 710000 , time ms: 1425
fetched: 715000 , time ms: 1549
fetched: 720000 , time ms: 1572
fetched: 725000 , time ms: 1526
fetched: 730000 , time ms: 1587
fetched: 735000 , time ms: 1484
fetched: 740000 , time ms: 1652
fetched: 745000 , time ms: 1508
fetched: 750000 , time ms: 1630
fetched: 755000 , time ms: 1242
fetched: 760000 , time ms: 1621
fetched: 765000 , time ms: 1619
fetched: 770000 , time ms: 1542
fetched: 775000 , time ms: 1727
fetched: 780000 , time ms: 1252
fetched: 785000 , time ms: 1498
fetched: 790000 , time ms: 1766
fetched: 795000 , time ms: 1692
fetched: 800000 , time ms: 1589
fetched: 805000 , time ms: 1382
fetched: 810000 , time ms: 1404
fetched: 815000 , time ms: 1633
fetched: 820000 , time ms: 1788
fetched: 825000 , time ms: 1734
fetched: 830000 , time ms: 1799
fetched: 835000 , time ms: 1696
fetched: 840000 , time ms: 1763
fetched: 845000 , time ms: 1697
fetched: 850000 , time ms: 1915
fetched: 855000 , time ms: 1709
fetched: 860000 , time ms: 1531
fetched: 865000 , time ms: 1811
fetched: 870000 , time ms: 1849
fetched: 875000 , time ms: 2028
fetched: 880000 , time ms: 1858
fetched: 885000 , time ms: 1780
fetched: 890000 , time ms: 1762
fetched: 895000 , time ms: 1586
fetched: 900000 , time ms: 1803
fetched: 905000 , time ms: 1791
fetched: 910000 , time ms: 1821
fetched: 915000 , time ms: 1624
fetched: 920000 , time ms: 1709
fetched: 925000 , time ms: 1765
fetched: 930000 , time ms: 1892
fetched: 935000 , time ms: 3646
fetched: 940000 , time ms: 2576
fetched: 945000 , time ms: 2070
fetched: 950000 , time ms: 2401
fetched: 955000 , time ms: 1998
fetched: 960000 , time ms: 1924
fetched: 965000 , time ms: 2113
fetched: 970000 , time ms: 2037
fetched: 975000 , time ms: 1804
fetched: 980000 , time ms: 2143
fetched: 985000 , time ms: 1776
fetched: 990000 , time ms: 2092
fetched: 995000 , time ms: 2232
fetched: 1000000 , time ms: 1986

With packs of 20000 items:

fetched: 40000, time ms:275
fetched: 60000, time ms:296
fetched: 80000, time ms:297
fetched: 100000, time ms:358
fetched: 120000, time ms:386
fetched: 140000, time ms:413
fetched: 160000, time ms:443
fetched: 180000, time ms:420
fetched: 200000, time ms:548
fetched: 220000, time ms:487
fetched: 240000, time ms:557
fetched: 260000, time ms:609
fetched: 280000, time ms:686
fetched: 300000, time ms:655
fetched: 320000, time ms:697
fetched: 340000, time ms:717
fetched: 360000, time ms:675
fetched: 380000, time ms:787
fetched: 400000, time ms:808
fetched: 420000, time ms:916
fetched: 440000, time ms:878
fetched: 460000, time ms:940
fetched: 480000, time ms:1035
fetched: 500000, time ms:975
fetched: 520000, time ms:1000
fetched: 540000, time ms:1116
fetched: 560000, time ms:1074
fetched: 580000, time ms:990
fetched: 600000, time ms:1151
fetched: 620000, time ms:1178
fetched: 640000, time ms:1232
fetched: 660000, time ms:1255
fetched: 680000, time ms:1275
fetched: 700000, time ms:1419
fetched: 720000, time ms:1396
fetched: 740000, time ms:1528
fetched: 760000, time ms:1424
fetched: 780000, time ms:1449
fetched: 800000, time ms:1486
fetched: 820000, time ms:1346
fetched: 840000, time ms:1677
fetched: 860000, time ms:1574
fetched: 880000, time ms:1447
fetched: 900000, time ms:1464
fetched: 920000, time ms:1641
fetched: 940000, time ms:1537
fetched: 960000, time ms:1920
fetched: 980000, time ms:1763
fetched: 1000000, time ms:1900

After restart scrolling is fast at the beginning again. I am using
elasticsearch 0.90.5.

Is this an expected behaviour or known bug?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I consider ES to be an index over some some data, rather than the data itself
Why not just re index the source? I do this for over 1m docs for every deployment, it is very fast. Takes Takes about 20mins on a single node, using couch river.

James

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

These 1m items give an index size of about 4gb, for comparison.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

20 minutes isn't fast, it's not very fast for sure, you know. We
deploy up to 50 times per day, 20 minutes per deployment sounds very
slow for me.

I also cannot reindex data because elasticsearch is the source and the
only storage for this data, this is statistical data, events that
happened in the past.

On 26 November 2013 02:59, James Richardson james.time4tea@gmail.com wrote:

These 1m items give an index size of about 4gb, for comparison.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/RYVP-VH8BaA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Regards, Ian Babrou
http://bobrik.name http://twitter.com/ibobrik skype:i.babrou

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

About injection speed.

Yeah. It depends on the read rate, document size...
For my demos, I inject 1m docs in less than 1 mn.

But it's directly from a Java program which generate random documents (15 fields).

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 26 nov. 2013 à 06:59, ivan babrou ibobrik@gmail.com a écrit :

20 minutes isn't fast, it's not very fast for sure, you know. We
deploy up to 50 times per day, 20 minutes per deployment sounds very
slow for me.

I also cannot reindex data because elasticsearch is the source and the
only storage for this data, this is statistical data, events that
happened in the past.

On 26 November 2013 02:59, James Richardson james.time4tea@gmail.com wrote:
These 1m items give an index size of about 4gb, for comparison.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/RYVP-VH8BaA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Regards, Ian Babrou
http://bobrik.name http://twitter.com/ibobrik skype:i.babrou

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

It would help to have an impression of the source code you use.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFF0-sy8a99iG5X%2BXeN3yvYM3zhU77N%3DQmhSfyYUBXcAg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Is it possible for you to use a scan
(herehttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-search-type.html,
but scroll down the page) with your scroll? It's recommended if you can
live with point-in-time results - you'll miss docs added/removed to/from
the index - and a reduced search feature set - for example, sorting.
However, you'll achieve approximately the same low latency as you iterate
over the result set.

Or are you experiencing this performance when using scan already?

On Tue, Nov 26, 2013 at 10:48 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

It would help to have an impression of the source code you use.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFF0-sy8a99iG5X%2BXeN3yvYM3zhU77N%3DQmhSfyYUBXcAg%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHnsjpus5MPDwMU%3Dj3zawO2VP8vtvYhKJXwvqYMRj1p6cNKE0A%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

I'm using scan already. Is there any other way when there's no unique
sortable id in every document?

My code looks like this: http://pastie.org/8510769 (go version)

On 26 November 2013 23:32, Oli McCormack oli@climate.com wrote:

Is it possible for you to use a scan (here, but scroll down the page) with
your scroll? It's recommended if you can live with point-in-time results -
you'll miss docs added/removed to/from the index - and a reduced search
feature set - for example, sorting. However, you'll achieve approximately
the same low latency as you iterate over the result set.

Or are you experiencing this performance when using scan already?

On Tue, Nov 26, 2013 at 10:48 AM, joergprante@gmail.com
joergprante@gmail.com wrote:

It would help to have an impression of the source code you use.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFF0-sy8a99iG5X%2BXeN3yvYM3zhU77N%3DQmhSfyYUBXcAg%40mail.gmail.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/RYVP-VH8BaA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHnsjpus5MPDwMU%3Dj3zawO2VP8vtvYhKJXwvqYMRj1p6cNKE0A%40mail.gmail.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
Regards, Ian Babrou
http://bobrik.name http://twitter.com/ibobrik skype:i.babrou

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CANWdNRCy9KFThq3MgehQRx1j6H8yUhcVBkgi5usCEAAgSDA3FA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Why do you want to keep search results for 30 minutes on each node?

This is the cause of your program getting slower. In 30 minutes, you can
easily thrash the system memory of the ES nodes so they must getting slower
and slower.

Just use a scan/scroll retain time of a few seconds or so, just as long as
you need to dump the result. But never use minutes (unless you have a good
reason).

Where do you get this example from?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFJwg9JP0dKaWDRYnkEbsJWZzV4DL3WicEvu%2BfFSt42OQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Okay, I changed 30m to 5s. Memory usage was in bounds, I checked it.
Your version does not explain why restarting scrolling makes it fast
again. I don't think es could guess that previous scrolling is over.

I didn't get this example from anywhere, I wrote it myself. The first
version was in php, but it was very slow, so I decided to practice
golang and wrote this one. This one as slow as previous one.

Btw, results for 5s (no surprise here):

first request: 5000
fetched: 10000 , time ms: 190
fetched: 15000 , time ms: 255
fetched: 20000 , time ms: 197
fetched: 25000 , time ms: 282
fetched: 30000 , time ms: 211
fetched: 35000 , time ms: 213
fetched: 40000 , time ms: 219
fetched: 45000 , time ms: 223
fetched: 50000 , time ms: 231
fetched: 55000 , time ms: 235
fetched: 60000 , time ms: 336
fetched: 65000 , time ms: 344
fetched: 70000 , time ms: 347
fetched: 75000 , time ms: 256
fetched: 80000 , time ms: 357
fetched: 85000 , time ms: 369
fetched: 90000 , time ms: 289
fetched: 95000 , time ms: 305
fetched: 100000 , time ms: 306
fetched: 105000 , time ms: 402
fetched: 110000 , time ms: 296
fetched: 115000 , time ms: 342
fetched: 120000 , time ms: 432
fetched: 125000 , time ms: 430
fetched: 130000 , time ms: 324
fetched: 135000 , time ms: 460
fetched: 140000 , time ms: 352
fetched: 145000 , time ms: 464
fetched: 150000 , time ms: 508
fetched: 155000 , time ms: 523
fetched: 160000 , time ms: 359
fetched: 165000 , time ms: 366
fetched: 170000 , time ms: 409
fetched: 175000 , time ms: 375
fetched: 180000 , time ms: 536
fetched: 185000 , time ms: 392
fetched: 190000 , time ms: 418
fetched: 195000 , time ms: 561
fetched: 200000 , time ms: 418
fetched: 205000 , time ms: 569
fetched: 210000 , time ms: 589
fetched: 215000 , time ms: 460
fetched: 220000 , time ms: 591
fetched: 225000 , time ms: 454
fetched: 230000 , time ms: 486
fetched: 235000 , time ms: 453
fetched: 240000 , time ms: 633
fetched: 245000 , time ms: 656
fetched: 250000 , time ms: 656
fetched: 255000 , time ms: 664
fetched: 260000 , time ms: 488
fetched: 265000 , time ms: 693
fetched: 270000 , time ms: 530
fetched: 275000 , time ms: 510
fetched: 280000 , time ms: 715
fetched: 285000 , time ms: 526
fetched: 290000 , time ms: 578
fetched: 295000 , time ms: 539
fetched: 300000 , time ms: 745
fetched: 305000 , time ms: 757
fetched: 310000 , time ms: 582
fetched: 315000 , time ms: 767
fetched: 320000 , time ms: 628
fetched: 325000 , time ms: 810
fetched: 330000 , time ms: 599
fetched: 335000 , time ms: 615
fetched: 340000 , time ms: 815
fetched: 345000 , time ms: 599
fetched: 350000 , time ms: 660
fetched: 355000 , time ms: 612
fetched: 360000 , time ms: 661
fetched: 365000 , time ms: 894
fetched: 370000 , time ms: 1058
fetched: 375000 , time ms: 1112
fetched: 380000 , time ms: 1128
fetched: 385000 , time ms: 1127
fetched: 390000 , time ms: 1153
fetched: 395000 , time ms: 981
fetched: 400000 , time ms: 912
fetched: 405000 , time ms: 958
fetched: 410000 , time ms: 969
fetched: 415000 , time ms: 982
fetched: 420000 , time ms: 727
fetched: 425000 , time ms: 974
fetched: 430000 , time ms: 734
fetched: 435000 , time ms: 843
fetched: 440000 , time ms: 730
fetched: 445000 , time ms: 1051
fetched: 450000 , time ms: 778
fetched: 455000 , time ms: 773
fetched: 460000 , time ms: 849
fetched: 465000 , time ms: 1088
fetched: 470000 , time ms: 875
fetched: 475000 , time ms: 807
fetched: 480000 , time ms: 832
fetched: 485000 , time ms: 1120
fetched: 490000 , time ms: 813
fetched: 495000 , time ms: 1154
fetched: 500000 , time ms: 806
fetched: 505000 , time ms: 930
fetched: 510000 , time ms: 861
fetched: 515000 , time ms: 1144
fetched: 520000 , time ms: 885
fetched: 525000 , time ms: 1197
fetched: 530000 , time ms: 1205
fetched: 535000 , time ms: 890
fetched: 540000 , time ms: 1199
fetched: 545000 , time ms: 1265
fetched: 550000 , time ms: 983
fetched: 555000 , time ms: 1184
fetched: 560000 , time ms: 954
fetched: 565000 , time ms: 976
fetched: 570000 , time ms: 1363
fetched: 575000 , time ms: 1310
fetched: 580000 , time ms: 943
fetched: 585000 , time ms: 1283
fetched: 590000 , time ms: 1401
fetched: 595000 , time ms: 985
fetched: 600000 , time ms: 973
fetched: 605000 , time ms: 993
fetched: 610000 , time ms: 1054
fetched: 615000 , time ms: 1004
fetched: 620000 , time ms: 994
fetched: 625000 , time ms: 1021
fetched: 630000 , time ms: 1406
fetched: 635000 , time ms: 1044
fetched: 640000 , time ms: 1051
fetched: 645000 , time ms: 1123
fetched: 650000 , time ms: 1080
fetched: 655000 , time ms: 1070
fetched: 660000 , time ms: 1037
fetched: 665000 , time ms: 1191
fetched: 670000 , time ms: 1466
fetched: 675000 , time ms: 1167

And after immediate ctrl+c and dumping restart:

first request: 5000
fetched: 10000 , time ms: 229
fetched: 15000 , time ms: 221
fetched: 20000 , time ms: 225
fetched: 25000 , time ms: 232
fetched: 30000 , time ms: 304
fetched: 35000 , time ms: 331
fetched: 40000 , time ms: 248
fetched: 45000 , time ms: 255
fetched: 50000 , time ms: 264
fetched: 55000 , time ms: 314
fetched: 60000 , time ms: 329
fetched: 65000 , time ms: 304
fetched: 70000 , time ms: 296
fetched: 75000 , time ms: 286
fetched: 80000 , time ms: 409
fetched: 85000 , time ms: 411
fetched: 90000 , time ms: 304
fetched: 95000 , time ms: 427
fetched: 100000 , time ms: 310
fetched: 105000 , time ms: 315
fetched: 110000 , time ms: 406

On 27 November 2013 01:48, joergprante@gmail.com joergprante@gmail.com wrote:

Why do you want to keep search results for 30 minutes on each node?

This is the cause of your program getting slower. In 30 minutes, you can
easily thrash the system memory of the ES nodes so they must getting slower
and slower.

Just use a scan/scroll retain time of a few seconds or so, just as long as
you need to dump the result. But never use minutes (unless you have a good
reason).

Where do you get this example from?

Jörg

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/RYVP-VH8BaA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFJwg9JP0dKaWDRYnkEbsJWZzV4DL3WicEvu%2BfFSt42OQ%40mail.gmail.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
Regards, Ian Babrou
http://bobrik.name http://twitter.com/ibobrik skype:i.babrou

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CANWdNRC9Agq%2Btg%2BAuQtMd2pce6KWgtfMen8HwhOxdWAAzUPyQw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

You have all documents of same size?

And you can confirm the network interface bandwidth is not exhausted?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFEgKR9EiWpEy9VT52SGn0T__23G_NQPv1Az8wxAsUV3Q%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Exhausting gigabit network is hard enough, we have small documets. If you
look at the code, you may notice that I print timer from es so even 56k
modem to the client should make no difference.

On Wednesday, November 27, 2013, joergprante@gmail.com wrote:

You have all documents of same size?

And you can confirm the network interface bandwidth is not exhausted?

Jörg

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/RYVP-VH8BaA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com <javascript:_e({}, 'cvml',
'elasticsearch%2Bunsubscribe@googlegroups.com');>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFEgKR9EiWpEy9VT52SGn0T__23G_NQPv1Az8wxAsUV3Q%40mail.gmail.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
Regards, Ian Babrou
http://bobrik.name http://twitter.com/ibobrik skype:i.babrou

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CANWdNRDQYyC%2By7qREnHDcrihZ_1tqnt8JJox-VBfv6qTcM8Etw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

On my test 0.90.5 ES node with a modified knapsack plugin, I observe
constant "took" numbers in scan/scroll.

See gist: https://gist.github.com/jprante/7668933

Each line fetches 1000 docs of an avg doc size of ~1k.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHLgg0QRNhHLfu5u4vVXrFkAXUJ6SXnJa%2B%2B90ZfgtuF7w%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

How many items do you have in your index? Which java version do you use?

On 27 November 2013 04:47, joergprante@gmail.com joergprante@gmail.com wrote:

On my test 0.90.5 ES node with a modified knapsack plugin, I observe
constant "took" numbers in scan/scroll.

See gist: Took times on 0.90.5 with scan/scroll, each step = 1000 docs of ~1k · GitHub

Each line fetches 1000 docs of an avg doc size of ~1k.

Jörg

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/RYVP-VH8BaA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHLgg0QRNhHLfu5u4vVXrFkAXUJ6SXnJa%2B%2B90ZfgtuF7w%40mail.gmail.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
Regards, Ian Babrou
http://bobrik.name http://twitter.com/ibobrik skype:i.babrou

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CANWdNRDnBFZq6u2KVS3PS9%3DarN%2Bct1HeiaqGRF6W10N%2BwGoSiA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

I executed a scan/scroll over 60 million docs, size of the indices (folder
'data' size) is 87G.

java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)

Heap is 2G

Red Hat Enterprise Linux Server release 6.4 (Santiago)

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF2vZJfXUxHPcaHKr0hLvTtvVUksaw-pob3kKsksU3Mzw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Jörg,

It looks like you know something about scan/scroll I haven't found
documented elsewhere -- how to scan 60 million docs with 1000 documents per
fetch, at constant time per fetch. Other comments I've seen indicate that,
the deeper you get into fetching the results, the slower each fetch gets.
I'm looking at alternatives for implementing a feature which will require
scan/scroll on a similar scale, and knowing that what you've done is
possible is critical for my planning. Could you please share the key parts
of your setup/retrieve code, in addition to the configuration / version
information you've already shared?

Thanks in advance,

-Mark

On Wednesday, November 27, 2013 8:03:46 AM UTC-6, Jörg Prante wrote:

I executed a scan/scroll over 60 million docs, size of the indices (folder
'data' size) is 87G.

java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)

Heap is 2G

Red Hat Enterprise Linux Server release 6.4 (Santiago)

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/705ede01-a19e-4c40-94d4-2e592de72303%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Mark, my scrolling performance is pretty constant now too. My problem
was in incorrect code, actually. You could check out correct version
here: https://github.com/bobrik/esreindexer/blob/master/reindexer.go

On 19 January 2014 21:49, LiquidMark mark.e.molloy@gmail.com wrote:

Hi Jörg,

It looks like you know something about scan/scroll I haven't found
documented elsewhere -- how to scan 60 million docs with 1000 documents per
fetch, at constant time per fetch. Other comments I've seen indicate that,
the deeper you get into fetching the results, the slower each fetch gets.
I'm looking at alternatives for implementing a feature which will require
scan/scroll on a similar scale, and knowing that what you've done is
possible is critical for my planning. Could you please share the key parts
of your setup/retrieve code, in addition to the configuration / version
information you've already shared?

Thanks in advance,

-Mark

On Wednesday, November 27, 2013 8:03:46 AM UTC-6, Jörg Prante wrote:

I executed a scan/scroll over 60 million docs, size of the indices (folder
'data' size) is 87G.

java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)

Heap is 2G

Red Hat Enterprise Linux Server release 6.4 (Santiago)

Jörg

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/RYVP-VH8BaA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/705ede01-a19e-4c40-94d4-2e592de72303%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
Regards, Ian Babrou
http://bobrik.name http://twitter.com/ibobrik skype:i.babrou

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CANWdNRAdF4uiXSAyTxXq6zgHAHWSF%3Du0nLF1JpYLkJ1Z%2BDw06A%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Jörg, Thanks for sharing!

What kind of elaspsed time (in milliseconds?), roughly, does a
1000-document fetch take for you? Can you tell me key info that you thinks
affects this for you, e.g. how many shards etc?

-Mark

On Sunday, January 19, 2014 12:28:48 PM UTC-6, Ivan Babrou wrote:

Mark, my scrolling performance is pretty constant now too. My problem
was in incorrect code, actually. You could check out correct version
here: https://github.com/bobrik/esreindexer/blob/master/reindexer.go

On 19 January 2014 21:49, LiquidMark <mark.e...@gmail.com <javascript:>>
wrote:

Hi Jörg,

It looks like you know something about scan/scroll I haven't found
documented elsewhere -- how to scan 60 million docs with 1000 documents
per
fetch, at constant time per fetch. Other comments I've seen indicate
that,
the deeper you get into fetching the results, the slower each fetch
gets.
I'm looking at alternatives for implementing a feature which will
require
scan/scroll on a similar scale, and knowing that what you've done is
possible is critical for my planning. Could you please share the key
parts
of your setup/retrieve code, in addition to the configuration / version
information you've already shared?

Thanks in advance,

-Mark

On Wednesday, November 27, 2013 8:03:46 AM UTC-6, Jörg Prante wrote:

I executed a scan/scroll over 60 million docs, size of the indices
(folder
'data' size) is 87G.

java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)

Heap is 2G

Red Hat Enterprise Linux Server release 6.4 (Santiago)

Jörg

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/RYVP-VH8BaA/unsubscribe.

To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/705ede01-a19e-4c40-94d4-2e592de72303%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
Regards, Ian Babrou
http://bobrik.name http://twitter.com/ibobrik skype:i.babrou

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e16ca409-f171-4406-b61c-b2cc89d1b5e6%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The absolute time taken depends on the cluster resources of course. At my
laptop, for 1000 docs of ~1k size in average, a scroll response 'took'
field shows usually ~200-500ms. It takes additional time to process the
response hits.

I am not sure if the number of shards is relevant. There are more important
factors: shard numbers per node, shard size, buffers and heap memory,
network compression, network speed, node workload...

If you are interested in a Java scan/scroll example, you can peek into the
knapsack plugin source

https://github.com/jprante/elasticsearch-knapsack/blob/master/src/main/java/org/xbib/elasticsearch/action/RestExportAction.java#L310

Critical for a scalable scan/scroll is a reasonable timeout. In the
knapsack plugin, I use a default of 30 seconds.

In the ES docs, a timeout of 10 minutes is used

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-search-type.html

which seems not very helpful, as this will pressure your heap in almost all
cases of long-lasting scan/scroll...

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGhWWJf%3DdvxsBBEc%3DzoNfGsqLofTfOv4J4CmXbGJACg-w%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.