Posts

One challenge with 10 solutions

Image
Technologies we use for Data Analytics has evolved a lot, recently. Good old relational database systems become less popular every day. Now we have to find our way through several new technologies, which can handle big (and streaming) data, preferably on distributed environments. Python has all the rage now, but of course there are lots of alternatives as well. SQL will always shine, and some other oldies-but-goldies, which we can never under-estimate, are still out there. So there are really a wide range of alternatives. Let's ramble through some of them, shall we? I'll define a simple challenge in this post, and provide ten solutions written in ten different technologies : Awk Perl Bash SQL Python MapReduce Pig Hive Scala MongoDB Together they represent the last 30+ years ! Using these technologies, we'll list the 10 most favorite movies, using the two CSV datasets provided by Grouplens website. The dataset We'll use Mov

NiFi Revisited : Aggregate Movie Ratings Data To Find Top 10 Movies

Image
This post is a sample of data aggregation in NiFi. If you just started learning NiFi, check this blog post , which is a much more detailed sample than this one. Our goal is : Fetch the movie ratings data Calculate average rating per movie Find the top 10 rated movies Export the top 10 list in both CSV and AVRO formats. Download Sample Dataset Movielens dataset is available in Grouplens website.   In this challenge, we'll use MovieLens 100K Dataset . Download the zip file and extract " u.data " file. u.data is tab delimited file, which keeps the ratings, and contains four columns : user_id (int), movie_id (int), rating (int), time (int) Keep this file until we test our NiFi flow.  GetFile Create a GetFile processor, and point it to a local folder you created, to fetch input files. Input Directory /home/oguz/Documents/Olric/File_Source/ UpdateAttribute to alter the file name Add an UpdateAttribute processor with filena

Get Started with Nifi : Partitioning CSV files based on column value

Image
This tutorial demonstrates how incoming data file can be divided into multiple files based on a column value, using Apache Nifi. The Nifi Flow will : Fetch files from a local folder Divide the content rows into several files using PartitionRecord processor Modify the file name to include the column value which is used for partitioning And upload the output files to HDFS Here we go. Download Sample Dataset Movielens dataset is available in Grouplens website.   In this challenge, we'll use MovieLens 100K Dataset . Download the zip file and extract " u.data " file. u.data is tab delimited file, which keeps the ratings, and contains four columns : user_id (int), movie_id (int), rating (int), time (int) Keep this file until we test our NiFi flow. Create a Process Group Click " New Process Group " to create a group named " grp_Partition_Sample ", or, whatever you like, actually. Before developing the flow, we need t