ElasticSearch and GeoIpSearch


(Ted Karmel) #1

Oh... hey there ElasticSearchers :slight_smile:

Ok, so I'm thinking about using ES for looking up geo info about IP
addresses. And I'm just wondering if someone has had some experience
they would like to share.

Or, if you have some insights on how this would be achieved, that
would be cool too.

The way I see it, there are three broad dimensions:

  1. Converting CSV to JSON
  • most of the GeoIP databases provide the data in csv
  • how to bulk convert
  1. Massive amount of documents
  • some of these databases can contain 6Million+ files/documents
  • include all in one index or spread across several indices for
    better performance
  1. Query Formulation
  • many of the IP addresses in these documents are expressed in ranges
  • e.g. 182.242.522.100/99
  • how to formulate query/filter so that it interprets this as a range
    and returns true if IP address within that range

Suggestions on any of these aspects are all welcome.

Cheers,
Ted


(Clinton Gormley) #2

hiya ted

The way I see it, there are three broad dimensions:

  1. Converting CSV to JSON
  • most of the GeoIP databases provide the data in csv

My only experience with mapping IPs to locations is with MaxMind's
GeoIP, but they provide a very compact and fast index for you - I'm not
sure what benefit there would be in putting that into ElasticSearch
( but I may not have understood your intent)

  • how to bulk convert

here's a quick Perl script which will index your CSV docs:

#!/usr/bin/perl

use strict;
use warnings;

use Text::CSV_XS();
use ElasticSearch();
use Data::Dumper;

my $Index     = 'my_index';
my $Type      = 'my_type';
my @Col_Names = qw(id name ip_addr foo bar);

my $es = ElasticSearch->new( servers => '127.0.0.1:9200' );

my $csv = Text_CSV::XS->new( { binary => 1 } );
$csv->column_names(@Col_Names);

for my $csv_file (@ARGV) {
    print "Processing file '$csv_file'\n";
    open my $csv_fh, '<:encoding(utf8)', $csv_file
        or die "Couldn't open $csv_file: $!";

    my @rows;
    my $i = 0;

    while ( my $row = $csv->getline_hr($csv_fh) ) {
        push @rows, {
            index => {
                index => $Index,
                type  => $Type,
                id    => $row->{id},    # assuming there is an ID
                data  => $row
            }
        };
        if ( $i++ == 1000 ) {
            index_rows( \@rows );
            @rows = ();
        }
    }
    index_rows( \@_ );
}

sub index_rows {
    my $rows   = shift;
    my $result = $es->bulk($rows);
    die "Error while indexing: " . $result->{errors}
        if $result->{errors};
}

clint


(Ted Karmel) #3

hey clint,

thanks for the reply. and the perl script.

funny you should mention maxmind. i've given their premium database a
try and they are quite inaccurate for many addresses. there are many
more databases that are much more precise and more comprehensive. so
the performance advantage provided from their C index does not really
compensate.

and my intent with using elasticsearch for geoip is to subsequently be
able to conduct searches with the geo filter.

this setup may suffer slightly from speed due to the required http
calls but from my standpoint is worth it. i'm just looking for
reasonably acceptable latency.

On Tue, Oct 26, 2010 at 5:35 PM, Clinton Gormley
clinton@iannounce.co.uk wrote:

hiya ted

The way I see it, there are three broad dimensions:

  1. Converting CSV to JSON
  • most of the GeoIP databases provide the data in csv

My only experience with mapping IPs to locations is with MaxMind's
GeoIP, but they provide a very compact and fast index for you - I'm not
sure what benefit there would be in putting that into ElasticSearch
( but I may not have understood your intent)

  • how to bulk convert

here's a quick Perl script which will index your CSV docs:

#!/usr/bin/perl

use strict;
use warnings;

use Text::CSV_XS();
use ElasticSearch();
use Data::Dumper;

my $Index = 'my_index';
my $Type = 'my_type';
my @Col_Names = qw(id name ip_addr foo bar);

my $es = ElasticSearch->new( servers => '127.0.0.1:9200' );

my $csv = Text_CSV::XS->new( { binary => 1 } );
$csv->column_names(@Col_Names);

for my $csv_file (@ARGV) {
print "Processing file '$csv_file'\n";
open my $csv_fh, '<:encoding(utf8)', $csv_file
or die "Couldn't open $csv_file: $!";

   my @rows;
   my $i = 0;

   while ( my $row = $csv->getline_hr($csv_fh) ) {
       push @rows, {
           index => {
               index => $Index,
               type  => $Type,
               id    => $row->{id},    # assuming there is an ID
               data  => $row
           }
       };
       if ( $i++ == 1000 ) {
           index_rows( \@rows );
           @rows = ();
       }
   }
   index_rows( \@_ );

}

sub index_rows {
my $rows = shift;
my $result = $es->bulk($rows);
die "Error while indexing: " . $result->{errors}
if $result->{errors};
}

clint


(Shay Banon) #4

Hi Ted,

The most interesting part here is actually what the query scenario is. Are
you going to get an IP address, and then try and find information related to
it, while what you index/have are actually ranges? Thats the "reverse" way
of querying and might require some tricks... . Also, when they do provide
ranges, is it only on the last byte (for ipv4)?

-shay.banon

On Tue, Oct 26, 2010 at 7:03 PM, Ted Karmel ted.karmel@gmail.com wrote:

hey clint,

thanks for the reply. and the perl script.

funny you should mention maxmind. i've given their premium database a
try and they are quite inaccurate for many addresses. there are many
more databases that are much more precise and more comprehensive. so
the performance advantage provided from their C index does not really
compensate.

and my intent with using elasticsearch for geoip is to subsequently be
able to conduct searches with the geo filter.

this setup may suffer slightly from speed due to the required http
calls but from my standpoint is worth it. i'm just looking for
reasonably acceptable latency.

On Tue, Oct 26, 2010 at 5:35 PM, Clinton Gormley
clinton@iannounce.co.uk wrote:

hiya ted

The way I see it, there are three broad dimensions:

  1. Converting CSV to JSON
  • most of the GeoIP databases provide the data in csv

My only experience with mapping IPs to locations is with MaxMind's
GeoIP, but they provide a very compact and fast index for you - I'm not
sure what benefit there would be in putting that into ElasticSearch
( but I may not have understood your intent)

  • how to bulk convert

here's a quick Perl script which will index your CSV docs:

#!/usr/bin/perl

use strict;
use warnings;

use Text::CSV_XS();
use ElasticSearch();
use Data::Dumper;

my $Index = 'my_index';
my $Type = 'my_type';
my @Col_Names = qw(id name ip_addr foo bar);

my $es = ElasticSearch->new( servers => '127.0.0.1:9200' );

my $csv = Text_CSV::XS->new( { binary => 1 } );
$csv->column_names(@Col_Names);

for my $csv_file (@ARGV) {
print "Processing file '$csv_file'\n";
open my $csv_fh, '<:encoding(utf8)', $csv_file
or die "Couldn't open $csv_file: $!";

   my @rows;
   my $i = 0;

   while ( my $row = $csv->getline_hr($csv_fh) ) {
       push @rows, {
           index => {
               index => $Index,
               type  => $Type,
               id    => $row->{id},    # assuming there is an ID
               data  => $row
           }
       };
       if ( $i++ == 1000 ) {
           index_rows( \@rows );
           @rows = ();
       }
   }
   index_rows( \@_ );

}

sub index_rows {
my $rows = shift;
my $result = $es->bulk($rows);
die "Error while indexing: " . $result->{errors}
if $result->{errors};
}

clint


(Ted Karmel) #5

Hi Shay,

I concur :slight_smile: The query part is the most interesting one. The other
parts can be tackled feasibly.

Yes, the intent is to get the IP address from the HTTP headers and
send it in a query against the geoip index that includes addresses
expressed in ranges. The full addresses of each end of the range are
expressed (not just the last byte). They are ipv4 but usually are
expressed in the IP address long format. I presume this format
facilitates searches by having one single integer to match against
instead of integers separated by periods which might throw up
exceptions.

I've uploaded a sample csv file to give you a concrete example:
http://s3.amazonaws.com/elasticgeoipsearch/ipligence-lite.csv

The data I'm considering at the moment is from Ipligence. They have
fairly precise returns with their sample API. There is some info here
on setting up with MySQL : http://ipligence.com/faq . Time for a
NoSQL solution :slight_smile: and maybe ElasticSearch is the one that could do
it...

ted

On Tue, Oct 26, 2010 at 8:36 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Hi Ted,
The most interesting part here is actually what the query scenario is. Are
you going to get an IP address, and then try and find information related to
it, while what you index/have are actually ranges? Thats the "reverse" way
of querying and might require some tricks... . Also, when they do provide
ranges, is it only on the last byte (for ipv4)?
-shay.banon

On Tue, Oct 26, 2010 at 7:03 PM, Ted Karmel ted.karmel@gmail.com wrote:

hey clint,

thanks for the reply. and the perl script.

funny you should mention maxmind. i've given their premium database a
try and they are quite inaccurate for many addresses. there are many
more databases that are much more precise and more comprehensive. so
the performance advantage provided from their C index does not really
compensate.

and my intent with using elasticsearch for geoip is to subsequently be
able to conduct searches with the geo filter.

this setup may suffer slightly from speed due to the required http
calls but from my standpoint is worth it. i'm just looking for
reasonably acceptable latency.

On Tue, Oct 26, 2010 at 5:35 PM, Clinton Gormley
clinton@iannounce.co.uk wrote:

hiya ted

The way I see it, there are three broad dimensions:

  1. Converting CSV to JSON
  • most of the GeoIP databases provide the data in csv

My only experience with mapping IPs to locations is with MaxMind's
GeoIP, but they provide a very compact and fast index for you - I'm not
sure what benefit there would be in putting that into ElasticSearch
( but I may not have understood your intent)

  • how to bulk convert

here's a quick Perl script which will index your CSV docs:

#!/usr/bin/perl

use strict;
use warnings;

use Text::CSV_XS();
use ElasticSearch();
use Data::Dumper;

my $Index = 'my_index';
my $Type = 'my_type';
my @Col_Names = qw(id name ip_addr foo bar);

my $es = ElasticSearch->new( servers => '127.0.0.1:9200' );

my $csv = Text_CSV::XS->new( { binary => 1 } );
$csv->column_names(@Col_Names);

for my $csv_file (@ARGV) {
print "Processing file '$csv_file'\n";
open my $csv_fh, '<:encoding(utf8)', $csv_file
or die "Couldn't open $csv_file: $!";

   my @rows;
   my $i = 0;

   while ( my $row = $csv->getline_hr($csv_fh) ) {
       push @rows, {
           index => {
               index => $Index,
               type  => $Type,
               id    => $row->{id},    # assuming there is an ID
               data  => $row
           }
       };
       if ( $i++ == 1000 ) {
           index_rows( \@rows );
           @rows = ();
       }
   }
   index_rows( \@_ );

}

sub index_rows {
my $rows = shift;
my $result = $es->bulk($rows);
die "Error while indexing: " . $result->{errors}
if $result->{errors};
}

clint


(Shay Banon) #6

Hey,

Yea, thats what I though. You can actually do something similar to the way
mysql is built. Have a doc with from and to addresses represented as numeric
values, and then, when searching, using two range queries wrapped in a bool
query within its must clause. My initial feeling is that filters will not
make much sense in this case, even the numeric_range filter, if that will be
the only part of the query.

You can open a feature request to have an ip address type (ipv4) built
into elasticsearch, so it will do the conversion for you. ipv6 is a bit more
complex as sadly we don't yet have 128bit numbers :), so range is more
interesting there..., though it can be represented as a string. But, it does
not look like they have ipv6, right?

-shay.banon

On Wed, Oct 27, 2010 at 12:46 AM, Ted Karmel ted.karmel@gmail.com wrote:

Hi Shay,

I concur :slight_smile: The query part is the most interesting one. The other
parts can be tackled feasibly.

Yes, the intent is to get the IP address from the HTTP headers and
send it in a query against the geoip index that includes addresses
expressed in ranges. The full addresses of each end of the range are
expressed (not just the last byte). They are ipv4 but usually are
expressed in the IP address long format. I presume this format
facilitates searches by having one single integer to match against
instead of integers separated by periods which might throw up
exceptions.

I've uploaded a sample csv file to give you a concrete example:
http://s3.amazonaws.com/elasticgeoipsearch/ipligence-lite.csv

The data I'm considering at the moment is from Ipligence. They have
fairly precise returns with their sample API. There is some info here
on setting up with MySQL : http://ipligence.com/faq . Time for a
NoSQL solution :slight_smile: and maybe ElasticSearch is the one that could do
it...

ted

On Tue, Oct 26, 2010 at 8:36 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Hi Ted,
The most interesting part here is actually what the query scenario is.
Are
you going to get an IP address, and then try and find information related
to
it, while what you index/have are actually ranges? Thats the "reverse"
way
of querying and might require some tricks... . Also, when they do provide
ranges, is it only on the last byte (for ipv4)?
-shay.banon

On Tue, Oct 26, 2010 at 7:03 PM, Ted Karmel ted.karmel@gmail.com
wrote:

hey clint,

thanks for the reply. and the perl script.

funny you should mention maxmind. i've given their premium database a
try and they are quite inaccurate for many addresses. there are many
more databases that are much more precise and more comprehensive. so
the performance advantage provided from their C index does not really
compensate.

and my intent with using elasticsearch for geoip is to subsequently be
able to conduct searches with the geo filter.

this setup may suffer slightly from speed due to the required http
calls but from my standpoint is worth it. i'm just looking for
reasonably acceptable latency.

On Tue, Oct 26, 2010 at 5:35 PM, Clinton Gormley
clinton@iannounce.co.uk wrote:

hiya ted

The way I see it, there are three broad dimensions:

  1. Converting CSV to JSON
  • most of the GeoIP databases provide the data in csv

My only experience with mapping IPs to locations is with MaxMind's
GeoIP, but they provide a very compact and fast index for you - I'm
not

sure what benefit there would be in putting that into ElasticSearch
( but I may not have understood your intent)

  • how to bulk convert

here's a quick Perl script which will index your CSV docs:

#!/usr/bin/perl

use strict;
use warnings;

use Text::CSV_XS();
use ElasticSearch();
use Data::Dumper;

my $Index = 'my_index';
my $Type = 'my_type';
my @Col_Names = qw(id name ip_addr foo bar);

my $es = ElasticSearch->new( servers => '127.0.0.1:9200' );

my $csv = Text_CSV::XS->new( { binary => 1 } );
$csv->column_names(@Col_Names);

for my $csv_file (@ARGV) {
print "Processing file '$csv_file'\n";
open my $csv_fh, '<:encoding(utf8)', $csv_file
or die "Couldn't open $csv_file: $!";

   my @rows;
   my $i = 0;

   while ( my $row = $csv->getline_hr($csv_fh) ) {
       push @rows, {
           index => {
               index => $Index,
               type  => $Type,
               id    => $row->{id},    # assuming there is an ID
               data  => $row
           }
       };
       if ( $i++ == 1000 ) {
           index_rows( \@rows );
           @rows = ();
       }
   }
   index_rows( \@_ );

}

sub index_rows {
my $rows = shift;
my $result = $es->bulk($rows);
die "Error while indexing: " . $result->{errors}
if $result->{errors};
}

clint


(anahap) #7

Hi Ted would you care to share with geoip adresses are better than maxmind?

Thanks

hey clint,

thanks for the reply. and the perl script.

funny you should mention maxmind. i've given their premium database a
try and they are quite inaccurate for many addresses. there are many
more databases that are much more precise and more comprehensive. so
the performance advantage provided from their C index does not really
compensate.

and my intent with using elasticsearch for geoip is to subsequently be
able to conduct searches with the geo filter.

this setup may suffer slightly from speed due to the required http
calls but from my standpoint is worth it. i'm just looking for
reasonably acceptable latency.

On Tue, Oct 26, 2010 at 5:35 PM, Clinton Gormley
<clinton@iannounce.co.uk> wrote:

hiya ted

The way I see it, there are three broad dimensions:

  1. Converting CSV to JSON
  • most of the GeoIP databases provide the data in csv

My only experience with mapping IPs to locations is with MaxMind's
GeoIP, but they provide a very compact and fast index for you - I'm not
sure what benefit there would be in putting that into ElasticSearch
( but I may not have understood your intent)

  • how to bulk convert

here's a quick Perl script which will index your CSV docs:

#!/usr/bin/perl

use strict;
use warnings;

use Text::CSV_XS();
use ElasticSearch();
use Data::Dumper;

my $Index = 'my_index';
my $Type = 'my_type';
my @Col_Names = qw(id name ip_addr foo bar);

my $es = ElasticSearch->new( servers => '127.0.0.1:9200' );

my $csv = Text_CSV::XS->new( { binary => 1 } );
$csv->column_names(@Col_Names);

for my $csv_file (@ARGV) {
print "Processing file '$csv_file'\n";
open my $csv_fh, '<:encoding(utf8)', $csv_file
or die "Couldn't open $csv_file: $!";

   my @rows;
   my $i = 0;

   while ( my $row = $csv->getline_hr($csv_fh) ) {
       push @rows, {
           index => {
               index => $Index,
               type  => $Type,
               id    => $row->{id},    # assuming there is an ID
               data  => $row
           }
       };
       if ( $i++ == 1000 ) {
           index_rows( \@rows );
           @rows = ();
       }
   }
   index_rows( \@_ );

}

sub index_rows {
my $rows = shift;
my $result = $es->bulk($rows);
die "Error while indexing: " . $result->{errors}
if $result->{errors};
}

clint


(system) #8