Graph relation extraction on lastfm data


(Inancarin) #1

Hi guys,

I am using elasticsearch for a while, but I am newbie on Graph API. I just indexed lastfm data into ES and one sample document in my index as follows:

{
            "_index": "lastfm",
            "_type": "song",
            "_id": "AVg-oFWHtNjcLv7Y8nnT",
            "_score": 1,
            "_source": {
               "timestamp": "2009-02-03T16:54:25Z",
               "userid": "user_000001",
               "artist-name": "Ken Ishii",
               "track-name": "Frame Out",
               "musicbrainz-track-id": "8f28cbe6-3e46-4f96-816d-304620f64b41",
               "musicbrainz-artist-id": "6d4c4759-8a16-4b9f-83e2-4c225307fc85",
               "user": {
                  "gender": "m",
                  "signup": "Aug 13, 2006",
                  "country": "Japan"
               }
            }
         }

What I want to do here is to find relations among artists-artists (I mean people listening some artist, they also listen another artist), among countries-artists, among user-artists and so on.

I can visulize charts on artist name in Kibana in a correct way as follows:

However when I try to find relations on graph api, I find vertexes but I cannot find relations among vertexes and I cannot expand selected vertexes ("artist-name.keyword" field selected)

When I select "artist-name" field, it gives me the following error: "Error 400 Bad Request: Fielddata is disabled on text fields by default. Set fielddata=true on [artist-name] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."

It is okay, I managed to solve it with the following:

PUT lastfm/_mapping/song
{
  "properties": {
    "artist-name": { 
      "type":     "text",
      "fielddata": true
    }
  }
}

Now, I am able to display vertexes and their relations, however "damien" and "rice" are located on different vertexes, they should be on one specific vertex "damien rice"

Any help on this will make me very happy.

Thanks,


(Mark Harwood) #2

So the main thing to be aware of here is that Graph draws on co-occurrence of tokens in the same document. That means for an artist->artist graph you need

  1. Documents that contain more than one artist-name and
  2. Indexed artist-name tokens that haven't split the string damien rice into damien and rice. (So untokenized "keyword" strings)

To do this kind of analysis we create one document per user with an array of the band names they like e.g.

 { "name": "Mark", likedArtists:["Fugazi", "Polica", "Team sleep", "Mastodon" ..] }  

.. and use the appropriate mapping definition.

Here's a script to do exactly all of this with the LastFm data using version 5+ of elasticsearch: https://gist.github.com/markharwood/f67a8532f0acba8dcc3fba07541b0933

Cheers
Mark


(Inancarin) #3

Hi Mark,

First of all thanks for your answer and sorry for late answer. It is working now (By the way, I realised there are two different lastfm datasets and we were using different ones :slight_smile: ).

I have a question, If I have a streaming data, I mean assume that users are continuously listening, liking or rating new songs/artists. Wouldn't it be costly the way you keep the data? When a user liked a new artist for example (First find the user, then check whether this artist exists in the current array in the artists field. If not update this array). What do you think in this kind of streaming data?


(Mark Harwood) #4

It doesn't have to be. I don't think it is vital for the benefit of others' recommendations that my latest song-play is updated immediately. That single action won't swing their recommendations but it is important to continually apply updates to keep abreast of new trends. This can be done in mini-batches where perhaps a day's worth of listening habits can be consolidated as a single update to a user profile. Some example scripts and a discussion is in this talk on "entity centric indexing" : https://www.youtube.com/watch?v=yBf7oeJKH2Y


(Mark Harwood) #5

(BTW, shifting this to the Graph forum)


(Inancarin) #6

Ohh I see, you are right about immediate updates. Thanks for your comments, they are very helpful.

Inanc


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.