I have an index of movies. The movies are sourced from different web sites and there are overlaps - for example Superman III is both sources from imdb and hulu. For each movie I have the title, and the name of the site where it came from. Also some of them have director and list of actors. The titles may vary a little depending on what sources it came from.
I would like to group the movies together so one group contains all the instances of the same movie - for example one group could be Superman III form imdb, hulu and Netflix.
Is that a good use case for the graph, and how would you go about doing that?
This is really a question about data preparation (de-duplication, entity resolution).
Once the data is normalised and linked, Graph should be good for exploring the connections but data-prep is often a big part of processing most real-world data sources.
There are various techniques you can use to link data. Normalization is a big one - e.g. do you turn Superman III into Superman 3 using rules to remove Roman numerals at the end of film titles? Do you remove accents from certain characters?
Do you combine information e.g. film title and year to ensure you get the right Cape Fear?
Do you combine actor-name and movie to avoid one James Stewart being linked with a different James Stewart? Much of this data-dependent so without knowing more about the data in question it is hard to prescribe an answer that is guaranteed to work.
Thank you Mark. This is helpful. I was exploring the possibilities of using elastic search to figure the linking out as it know title, director, actors it would be able to do a fuzzy match based on its statical models.
I will try to create a new text field and dump movie name, director and the list of actors and link on that.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.