Remove Latitude and Longitude from a cell in a CSV for geo_point visualization

I'm trying to parse data from a CSV cell so that I can use the latitude and longitude present in the cell for a geo_point visualization. As of right now, the column is titled Location and the cells after that have the location listed with an address and then the latitude and longitude of the school within parentheses.

I've been reading through the filters page to see what my best option might be but between grok, dissect, split or extract numbers I'm not entirely sure which filter is the most efficient way to pull out the lat and long.

I've read through a previous post, this ones a good start, so I'm working off of that info to try and create a visualization. The data that I'm working with is from NYS School Lead Testing and my overall goal is to learn how to import the data and create visualizations from it. I've got the import part down and some of the visualization parts setup but its not perfect so I'm trying to figure out how to manipulate to what we need. For this project I've stripped the whole thing down to just the Location column to try and work out how to parse it correctly.

Learning how to setup the mapping for this will obviously be a step thats needed at the end.

From reading the

CSV column title and data example

Location

7 CLEVELAND DR
ADDISON, NY 14801
(42.102851749, -77.231278965)

Current Logstash config called removeLatLong

input {
  file {
    path => "/tmp/removeLatLong.csv"
      start_position => "beginning"
      sincedb_path => "/dev/null"
      tags => "removeLatLong"
  }
}

filter {
  csv {
    separator => ","
        columns => [ "Location" ]
        }
  mutate {
    convert => { "Location" => "string" }
    convert => { "Latitude" => "float" }
    convert => { "Longitude" => "float" }
    }
  mutate {
    rename => {
      "Latitude" => "[location][lat]"
      "Longitude" => "[location][lon]"
      }
    }
  #date {
    #match => [ "Latest Record Update", "ISO8601" ]
    #target => "Latest Record Update"
  #}
}

output {
  elasticsearch {
    hosts => ["https://abc.ny.gov:9200", "https://def.ny.gov:9200", "https://ghi.ny.gov:9200"]
    manage_template => false
    index => "removeLatLong"
    user => "logstash_internal"
    password => "#####"
    ssl => true
    #ssl_certificate_verification => false
    cacert => "/etc/logstash/globalcert/ca/ca.crt"
    }
}

I would expect you to go after County Location rather than Location

 grok { match => { "County Location" => "\(%{NUMBER:[location][lat]:float}, %{NUMBER:[location][lon]:float}\)" } }

Thats what I was looking at originally but the County Location is just a geographic lat/long for the 62 counties in NYS and the Location is actually the specific lat/long of a school in NYS. The grok code does give me good insight into what I need to be working towards and how to implement it which is appreciated. Also, with this I would normally just use excel to split the info but some of the data that were looking at is beyond our control, ie other people update it, and trying to tell a bunch of other agencies on how to specifically fill out their spreadsheets to help us is a pipedream to say the least.

Strangely, if you export the data as a CSV then Location is a copy of School Street !

I might be missing something here, do you mean that I can already use Location as a geo_point with the school street address? If so that would be useful as well if its easier to implement.

No, i mean the exported CSV does not have lat/lon for the school. It is overwritten with the street address.

Did you expand the column width and height when you looked at the Location column when you exported it? I've got lat and long in mine but when I first looked at it I thought the street addreess had overwritten it as well. This is what I can see when I open up the csv in excel. Bottom right has a lat and long. Theres a few blanks in the Location for some schools but overall a vast majority are there. Thats the lat and long I'm trying to pull out and display.

Finally we are on the same page :smiley: OK, so that grok still works

    csv { autodetect_column_names => true }
    grok { match => { "Location" => "\(%{NUMBER:[location][lat]:float}, %{NUMBER:[location][lon]:float}\)" } }

will get you

                              "Location" => "1051 DIX AVENUE\r HUDSONFALLS, NY 12839\r (43.319290299000045, -73.59206215299997)",
                              "location" => {
    "lat" => 43.319290299000045,
    "lon" => -73.59206215299997
},

If there is no lat/lon inside parentheses in the Location field you get a _grokparsefailure instead.

Your next problem is to update the template for your index to tell it that the location field is a geo_point. There are two example of how you could do that in previous releases (dynamic and non-dynamic) in this thread. However templates are changing. If those example do not work then ask a new question in the elasticsearch forum about how to configure a template that makes location a geo_point.

1 Like

Ok I understand the _grokparsefailure part, that makes sense because it has nothing to parse but I'm also getting some _csvparsefailure errors as well. This is repeated a lot when I run the config:

[2019-04-05T15:27:10,195][WARN ][logstash.filters.csv     ] Error parsing csv {:field=>"message", :source=>"\"OLD OLEAN RD", :exception=>#<CSV::MalformedCSVError: Unclosed quoted field on line 1.>}
[2019-04-05T15:27:10,196][WARN ][logstash.filters.csv     ] Error parsing csv {:field=>"message", :source=>"(42.526670542, -78.472396965)\",\r", :exception=>#<CSV::MalformedCSVError: Illegal quoting in line 1.>}

If I'm understanding this correctly it may be due to the fact that I need another part within my grok pattern to capture the address thats before the (lat, long). I've been trying the websites that are listed in the Grok Filter page to match the address part of the cell with no luck.

Update: As usual right after I post I stumble across this: https://github.com/logstash-plugins/logstash-patterns-core/blob/master/patterns/grok-patterns

No, parsing a row of the csv is failing because of the quoting. The following is not a valid csv

foo,"bar"baz,woot

If a column contain quotes then it has to start and end with quotes. It cannot be partly quoted.

Update:
So after a little more research on csv's, the logs and what you wrote Badger I think I see what you're saying now. The csv cell is being broken down into "2285 Broad St", "Yorktown Heights, NY 10598", "(41.287400247, -73.780487444)" which is breaking up the matching pattern thats being searched for.

Badger,

Sorry for the delay, been working on this off and on a side project to understand Logstash a bit better. So I looked at the last 5 errors that were printed when I ran the the config and then matched them up with the specific cells within the csv that they correlate to. After copying and pasting the cells into notepad the quotes seems to correctly placed.

One thing that may make a difference here is when I copy and paste the cell into notepad and it prints out on the same line theres no space between the st and the city. This seems to be where each error message is starting at. For example:

"2285 BROAD ST
YORKTOWN HEIGHTS, NY 10598
(41.287400247, -73.780487444)"

Actually goes into notepad as:

"2285 BROAD STYORKTOWN HEIGHTS, NY 10598(41.287400247, -73.780487444)"

Last 5 errors and the cells that they relate to:

[2019-04-11T09:13:21,949][WARN ][logstash.filters.csv     ] Error parsing csv {:field=>"message", :source=>"\"2285 BROAD ST", :exception=>#<CSV::MalformedCSVError: Unclosed quoted field on line 1.>}
[2019-04-11T09:13:21,950][WARN ][logstash.filters.csv     ] Error parsing csv {:field=>"message", :source=>"(41.287400247, -73.780487444)\",\r", :exception=>#<CSV::MalformedCSVError: Illegal quoting in line 1.>}
[2019-04-11T09:13:21,950][WARN ][logstash.filters.csv     ] Error parsing csv {:field=>"message", :source=>"\"2701 CROMPOND RD", :exception=>#<CSV::MalformedCSVError: Unclosed quoted field on line 1.>}
[2019-04-11T09:13:21,951][WARN ][logstash.filters.csv     ] Error parsing csv {:field=>"message", :source=>"(41.294789765, -73.801976404)\",\r", :exception=>#<CSV::MalformedCSVError: Illegal quoting in line 1.>}
[2019-04-11T09:13:21,951][WARN ][logstash.filters.csv     ] Error parsing csv {:field=>"message", :source=>"\"2727 CROMPOND RD", :exception=>#<CSV::MalformedCSVError: Unclosed quoted field on line 1.>}
[2019-04-11T09:13:21,952][WARN ][logstash.filters.csv     ] Error parsing csv {:field=>"message", :source=>"(41.294746646, -73.802799843)\",\r", :exception=>#<CSV::MalformedCSVError: Illegal quoting in line 1.>}
[2019-04-11T09:13:21,953][WARN ][logstash.filters.csv     ] Error parsing csv {:field=>"message", :source=>"\"2051 BALDWIN ROAD", :exception=>#<CSV::MalformedCSVError: Unclosed quoted field on line 1.>}
[2019-04-11T09:13:21,954][WARN ][logstash.filters.csv     ] Error parsing csv {:field=>"message", :source=>"(41.278067198, -73.789960734)\",\r", :exception=>#<CSV::MalformedCSVError: Illegal quoting in line 1.>}
[2019-04-11T09:13:21,954][WARN ][logstash.filters.csv     ] Error parsing csv {:field=>"message", :source=>"\"2901 MANOR ST", :exception=>#<CSV::MalformedCSVError: Unclosed quoted field on line 1.>}
[2019-04-11T09:13:21,955][WARN ][logstash.filters.csv     ] Error parsing csv {:field=>"message", :source=>"(41.304480032, -73.81375574)\",\r", :exception=>#<CSV::MalformedCSVError: Illegal quoting in line 1.>}

Cell quoting format for the above errors:

"2285 BROAD ST
YORKTOWN HEIGHTS, NY 10598
(41.287400247, -73.780487444)"
"2701 CROMPOND RD
YORKTOWN HEIGHTS, NY 10598
(41.294789765, -73.801976404)"
"2727 CROMPOND RD
YORKTOWN HEIGHTS, NY 10598
(41.294746646, -73.802799843)"
"2051 BALDWIN ROAD
YORKTOWN HEIGHTS, NY 10598
(41.278067198, -73.789960734)"
"2901 MANOR ST
YORKTOWN HEIGHTS, NY 10598
(41.304480032, -73.81375574)"

Unfortunately I run logstash on UNIX, so I cannot test using a multi-line quoted string on Windows.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.