Loading astronomical data with Logstash


(Marcel-Jan Krijgsman) #1

Hi,
I've only just done a course on Elasticsearch, Kibana and Logstash. I have a Ubuntu 64-bit in VirtualBox with Elasticsearch 6.2.3 and Logstash And I thought it would be a good idea to do a little project to see what I can do with the ELK stack now. So I decided to load the newly released second dataset of the Gaia telescope (https://gea.esac.esa.int/archive/).

To start off, I just used one 40 MB csv file with 14,000+ stars (http://cdn.gea.esac.esa.int/Gaia/gdr2/gaia_source/csv/GaiaSource_1000172165251650944_1000424567594791808.csv.gz) instead of the 1.6 billion available. Just to see how things go. And I created this conf file:

input {
    file {
        path => ["/home/marcel-jan/gaia/GaiaSource_1000172165251650944_1000424567594791808.csv"]
        start_position => "beginning"
        sincedb_path => "/null"
        type => "data"
    }
}

filter {
    csv {
        separator => ","
        columns => [
            "solution_id",
            "designation",
            "source_id",
            "random_index",
            <etc...>
        ]
	}
}

output {
    stdout { codec => rubydebug }
    elasticsearch {
        action => "index"
        hosts => ["127.0.0.1:9200"]
        index => "gaiadr2"
    }
}

And this works. So hurray. But unfortunately all the columns are of the string type and that's not so useful. So I added a convert clause in the filter:
filter {
csv {
separator => ","
columns => [
"solution_id",
"designation",
"source_id",
"random_index",
<etc...>
]
}
mutate {
convert => {
"solution_id" => "integer"
"designation" => "integer"
"source_id" => "integer"
"ref_epoch" => "float"
"ra" => "float"
"ra_error" => "float"
<etc..>
}
}
}

Now this works when I try only the integer ones. Some of the data are, according to the documentation (https://gea.esac.esa.int/archive/documentation/GDR2/Gaia_archive/chap_datamodel/sec_dm_main_tables/ssec_dm_gaia_source.html) of the dataset of the double type. But I have already found out Logstash doesn't support the double data type. I've tried float instead. (Here are some examples of the double data: 103.4475289523685, 0.04109941963375859, 56.02202543042615. It seems it should fit in a float.)

And when I do that, Logstash simply hangs silently, not importing any data.

Where am I going wrong? Is it the data type? Is there a way around this?


(Attila Boncok) #2

I recommend defining mapping in Elasticsearch rather than Logstash.
You can define much more types in Elasticsearch. See eg. numeric datatypes.


(Marcel-Jan Krijgsman) #3

Thanks @atira.

I've made a mapping. Tried running Logstash without the mutate and.. it hangs silently. It didn't read the CSV file. From stdout I can't see why. No errors. Is there any place where I can find out why Logstash isn't importing the data?

The mapping:
curlh -XPUT 127.0.0.1:9200/gaiadr2 -d '
{
"mappings": {
"gaiadr2": {
"properties": {
"solution_id": { "type": "long" },
"designation": { "type": "text" },
"source_id": { "type": "long" },
"random_index": { "type": "long" },
"ref_epoch": { "type": "double" },
"ra": { "type": "double" },
"ra_error": { "type": "double" },
"dec": { "type": "double" },
"dec_error": { "type": "double" },
"parallax": { "type": "double" },
"parallax_error": { "type": "double" },
"parallax_over_error": { "type": "float" },
"pmra": { "type": "double" },
"pmra_error": { "type": "double" },
"pmdec": { "type": "double" },
"pmdec_error": { "type": "double" },
"ra_dec_corr": { "type": "float" },
"ra_parallax_corr": { "type": "float" },
"ra_pmra_corr": { "type": "float" },
"ra_pmdec_corr": { "type": "float" },
"dec_parallax_corr": { "type": "float" },
"dec_pmra_corr": { "type": "float" },
"dec_pmdec_corr": { "type": "float" },
"parallax_pmra_corr": { "type": "float" },
"parallax_pmdec_corr": { "type": "float" },
"pmra_pmdec_corr": { "type": "float" },
"astrometric_n_obs_al": { "type": "integer" },
"astrometric_n_obs_ac": { "type": "integer" },
"astrometric_n_good_obs_al": { "type": "integer" },
"astrometric_n_bad_obs_al": { "type": "integer" },
"astrometric_gof_al": { "type": "float" },
"astrometric_chi2_al": { "type": "float" },
"astrometric_excess_noise": { "type": "double" },
"astrometric_excess_noise_sig": { "type": "double" },
"astrometric_params_solved": { "type": "byte" },
"astrometric_primary_flag": { "type": "boolean" },
"astrometric_weight_al": { "type": "float" },
"astrometric_pseudo_colour": { "type": "double" },
"astrometric_pseudo_colour_error": { "type": "double" },
"mean_varpi_factor_al": { "type": "float" },
"astrometric_matched_observations": { "type": "short" },
"visibility_periods_used": { "type": "short" },
"astrometric_sigma5d_max": { "type": "float" },
"frame_rotator_object_type": { "type": "integer" },
"matched_observations": { "type": "short" },
"duplicated_source": { "type": "boolean" },
"phot_g_n_obs": { "type": "integer" },
"phot_g_mean_flux": { "type": "double" },
"phot_g_mean_flux_error": { "type": "double" },
"phot_g_mean_flux_over_error": { "type": "double" },
"phot_g_mean_mag": { "type": "float" },
"phot_bp_n_obs": { "type": "integer" },
"phot_bp_mean_flux": { "type": "double" },
"phot_bp_mean_flux_error": { "type": "double" },
"phot_bp_mean_flux_over_error": { "type": "float" },
"phot_bp_mean_mag": { "type": "float" },
"phot_rp_n_obs": { "type": "integer" },
"phot_rp_mean_flux": { "type": "double" },
"phot_rp_mean_flux_error": { "type": "double" },
"phot_rp_mean_flux_over_error": { "type": "float" },
"phot_rp_mean_mag": { "type": "float" },
"phot_bp_rp_excess_factor": { "type": "float" },
"phot_proc_mode": { "type": "byte" },
"bp_rp": { "type": "float" },
"bp_g": { "type": "float" },
"g_rp": { "type": "float" },
"radial_velocity": { "type": "double" },
"radial_velocity_error": { "type": "double" },
"rv_nb_transits": { "type": "integer" },
"rv_template_teff": { "type": "float" },
"rv_template_logg": { "type": "float" },
"rv_template_fe_h": { "type": "float" },
"phot_variable_flag": { "type": "text" },
"l": { "type": "double" },
"b": { "type": "double" },
"ecl_lon": { "type": "double" },
"ecl_lat": { "type": "double" },
"priam_flags": { "type": "long" },
"teff_val": { "type": "float" },
"teff_percentile_lower": { "type": "float" },
"teff_percentile_upper": { "type": "float" },
"a_g_val": { "type": "float" },
"a_g_percentile_lower": { "type": "float" },
"a_g_percentile_upper": { "type": "float" },
"e_bp_min_rp_val": { "type": "float" },
"e_bp_min_rp_percentile_lower": { "type": "float" },
"e_bp_min_rp_percentile_upper": { "type": "float" },
"flame_flags": { "type": "long" },
"radius_val": { "type": "float" },
"radius_percentile_lower": { "type": "float" },
"radius_percentile_upper": { "type": "float" },
"lum_val": { "type": "float" },
"lum_percentile_lower": { "type": "float" },
"lum_percentile_upper": { "type": "float" }
}
}
}
}'


(Attila Boncok) #4

ES would throw away documents where there are mismatching mapping types. But it would log it.
You could go around it with the ignore_malformed option.

It's weird that there are no log entries about that. Are you sure you're not getting anything in either logs? Did you try lowering the log level?

Anyway, I find this one disturbing:

sincedb_path => "/null"

Is that similar to /dev/null? Logstash needs sincedb to keep track which files/lines it already read.


(Guy Boertje) #5

I agree with @atira.

When you specified sincedb_path => "/null" Logstash actually created a file called null in the hdd device root directory and its using it to track the position the file was read up to.

When you use "/dev/null" the position tracking option is disabled.


(Marcel-Jan Krijgsman) #6

Thanks for the quick responses, @atira and @guybroertje

I have changed the mapping, adding ignore_malformed: false.

I have gotten errors now. This looks like something I can troubleshoot.
[WARN ] 2018-04-30 14:29:37.374 [Ruby-0-Thread-9@[main]>worker0: /usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:384] elasticsearch - Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"gaiadr2", :_type=>"doc", :_routing=>nil}, #<LogStash::Event:0x55b1fe04>], :response=>{"index"=>{"_index"=>"gaiadr2", "_type"=>"doc", "_id"=>"7hmHFmMBa2YHi_IUn7af", "status"=>400, "error"=>{"type"=>"illegal_argument_exception", "reason"=>"Rejecting mapping update to [gaiadr2] as the final mapping would have more than 1 type: [doc, gaiadr2]"}}}}

Is that problem something that has to do with this part?

curlh -XPUT 127.0.0.1:9200/gaiadr2 -d '
{
  "settings": {
    "index.mapping.ignore_malformed": false 
  },
  "mappings": {
    "gaiadr2": {                                           <-----
      "properties": {
        "solution_id":    { "type": "long"  },

(Marcel-Jan Krijgsman) #7

Just changed that type to "doc" and all the data has loaded.

And I can now see the data in Kibana with the right data types. I can make graphs of the data. This is pretty cool.

Thanks for the help! In a while I'm planning to do a short Youtube video on my first experiences with Elasticsearch


(Mark Walkom) #8

I used this to grab all 5200 or so files to process, pity they didn't offer a torrent of them!

for i in `curl http://cdn.gea.esac.esa.int/Gaia/gdr1/gaia_source/csv/ |grep csv.gz|cut -d\" -f2`; do wget http://cdn.gea.esac.esa.int/Gaia/gdr1/gaia_source/csv/$i; done

(system) #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.