NiFi Revisited : Aggregate Movie Ratings Data To Find Top 10 Movies


This post is a sample of data aggregation in NiFi.

If you just started learning NiFi, check this blog post, which is a much more detailed sample than this one.

Our goal is :

  1. Fetch the movie ratings data
  2. Calculate average rating per movie
  3. Find the top 10 rated movies
  4. Export the top 10 list in both CSV and AVRO formats.

Download Sample Dataset

Movielens dataset is available in Grouplens website.
 
In this challenge, we'll use MovieLens 100K Dataset. Download the zip file and extract "u.data" file.
u.data is tab delimited file, which keeps the ratings, and contains four columns :

user_id (int), movie_id (int), rating (int), time (int)

Keep this file until we test our NiFi flow. 

GetFile

Create a GetFile processor, and point it to a local folder you created, to fetch input files.



Input Directory/home/oguz/Documents/Olric/File_Source/



UpdateAttribute to alter the file name


Add an UpdateAttribute processor with filename attribute set to "Top10Movies.csv"


Connect GetFile processor to UpdateAttribute.


Source & Target Schemas


We'll now create a Controller Service of type AvroSchemaRegistry, and create two schemas inside this registry :


InputSchema
{     "type": "record",
     "namespace": "movies",
     "name": "movie",
     "fields": [

       { "name": "user_id", "type": ["int", "null"] },
       { "name": "movie_id", "type": ["int", "null"] },
       { "name": "rating", "type": ["float", "null"] },
       { "name": "timestamp", "type": ["int", "null"] }
     ]


OutputSchema
{     "type": "record",
     "namespace": "movies",
     "name": "movie",
     "fields": [

       { "name": "movie_id", "type": ["int", "null"] },
       { "name": "rating", "type": ["float", "null"] }
     ]

Although rating is actually an integer in source files, we read it as a float, since  our average calculation will result in float values.

CSVReader


Add a controller service of type CSVReader. Configuration shall be as follows :




Schema Access Strategy Use Schema Name Property
Schema Registry AvroSchemaRegistry
Schema Name InputSchema
CSV Format Tab Delimited
Treat First Line as Header True
Ignore CSV Header Column Names True


CSVRecordSetWriter

Create a CSVRecordSetWriter as well and configure it as seen below.



Schema Write Strategy Set schema.name Attribute
Schema Access Strategy Use Schema Name Property
Schema Registry AvroSchemaRegistry
Schema Name OutputSchema
CSV Format Custom Format

QueryRecord to aggregate flowfiles


QueryRecord allows us to query incoming data with SQL as if we are working on a relatinal database. Add a QueryRecord processor and align with the following configuration.


Record Reader CSVReader
Record Writer CSVRecordSetWriter
summary select movie_id, rating from
   (select movie_id, avg(rating) as rating, count(user_id) as cnt
      FROM FLOWFILE GROUP BY movie_id
   )
where cnt >= 100
order by
   rating desc limit 10

Connect UpdateAttribute processor to QueryRecord processor.


Also, navigate to the Settings tab of QueryRecord processor, and choose Original and Failure relationships in the section "Automatically Terminate Relationships". Otherwise, these will remain as unhandled relationships.

PutHDFS to upload target files into Hadoop


Now add PutHDFS processor to upload our results into HDFS. PutHDFS processor is configured as below.


Hadoop Configuration Resources /etc/hadoop/3.1.4.0-315/0/hdfs-site.xml,/etc/hadoop/3.1.4.0-315/0/core-site.xml
Directory /user/oguz/Top10Movies

Hadoop Configuration Resources shall include full paths to hdfs-site and core-site XML files, which exist in any HDP node.

Directory is the target directory in HDFS. Ensure that the user you start NiFi has permissions to write to this folder. If the directory does not exist, NiFi will create it automatically.


Connect QueryRecord processor with PutHDFS processor, for summary connection type.




This flow will produce the result in CSV format and upload to HDFS.

But we are not done yet, since we also need the same result in AVRO format.

InferAvroSchema to populate schema from flowfiles


InferAvroSchema makes it easy to populate an AVRO schema, using the incoming files.

Add an InferAvroSchema processor. All attributes will be left as defaults. We'll just


Choose flowfile-attribute from Schema Output Destination drop-down.

Also set the Avro Record Name, as "avrorec".

Terminate the failure, original, and unsopperted content relationships automatically in InferAvroSchema processor, under Setings tab.

Connect QueryRecord processor with InferAvroSchema processor, for summary connection type.



ConvertCSVToAvro


Since we now have a populated Avro schema in hand, we can use it in a ConvertCSVToAvro processor.



We shall set the value of Record Schema as

${inferred.avro.schema}

Also under Settings tab, we have to choose relationships of type failure and incompatible for termination.


Connect  InferAvroSchema with ConvertCSVToAvro for success relationship type.

Our last connection will be from ConvertCSVToAvro to PutHDFS processor,  also for  success relationship type.

And we're done.


Test the Flow


Now that we're ready, we can copy u.data file to the folder where GetFile is listening.

And see what happens.


I see the two files above under Files view of Ambari. So it worked for me.


I can see that the most popular movie's id is 408.

Let's check which title that is. We can check that in file u.item, which is another file in the movielens dataset.

So, here it is :


Were you also expecting something else ?

16 comments:

  1. This is pretty cool. Something to think about is parallelizing the computation for handling big data sets.

    ReplyDelete
  2. Thank you because you have been willing to share information with us. we will always appreciate all you have done here because I know you are very concerned with our. rdxhd

    ReplyDelete
  3. The motion picture quality is convertible as well. Individuals can download the films in High definition quality. The top quality essentially incorporates two kinds of goals nowadays which are the720p goals and the 1080p goals.
    Jamtara Netflix Review

    ReplyDelete
  4. You have a real talent for writing unique content. I like how you think and the way you express your views in this article. I am impressed by your writing style a lot. Thanks for making my experience more beautiful. 123 movies

    ReplyDelete
  5. I was very impressed by this post, this site has always been pleasant news Thank you very much for such an interesting post, and I meet them more often then I visited this site. gomovies

    ReplyDelete
  6. I found that site very usefull and this survey is very cirious, I ' ve never seen a blog that demand a survey for this actions, very curious... http://moncomptegratuit.com/comptes-netflix-gratuits-comment-avoir-un-compte-netflix-gratuit

    ReplyDelete
  7. The guidelines you provided listed below are extremely precious. It proved this sort of pleasurable surprise to acquire that waiting for me once i awakened today. These are constantly to the issue and easy to know. Thanks a large amount for your valuable ideas you’ve got shared here. putlockers

    ReplyDelete
  8. A debt of gratitude is in order for your data, it was truly exceptionally helpfull A Beautiful Day in the Neighborhood 2019

    ReplyDelete
  9. Hey! I merely observed one additional information in another weblog that appeared like this. How do you know all these items? That is one cool post. watch online freemovie7

    ReplyDelete
  10. In the world of www, there are countless blogs. But believe me, this blog has all the perfection that makes it unique in all. I will be back again and again. ดูหนังออนไลน์

    ReplyDelete
  11. web access suppliers in your district haven't. What's more, that used to imply that provincial cinephiles needed to make due with dial-up, or simply manage it.
    watch movies online

    ReplyDelete
  12. ERP represents Enterprise Resource Planning. Organizations in nourishment and drink conveyance and gear administration the board advantage extraordinarily click here

    ReplyDelete
  13. Motion pictures have gotten one of the most compelling components in present day society. From beginning new patterns to teaching the normal individuals, Movie Quotes

    ReplyDelete
  14. A very awesome blog post. We are really grateful for your blog post. You will find a lot of approaches after visiting your post. ดูหนังฟรี

    ReplyDelete
  15. The universe of film dispersion is an extreme spot. In the wake of battling and perspiring to complete a film you're intellectually and genuinely depleted. 123Movies

    ReplyDelete
  16. Very efficiently written information. It will be beneficial to anybody who utilizes it, including me. Keep up the good work. For sure i will check out more posts. This site seems to get a good amount of visitors. ดูหนัง

    ReplyDelete