NiFi Revisited : Aggregate Movie Ratings Data To Find Top 10 Movies


This post is a sample of data aggregation in NiFi.

If you just started learning NiFi, check this blog post, which is a much more detailed sample than this one.

Our goal is :

  1. Fetch the movie ratings data
  2. Calculate average rating per movie
  3. Find the top 10 rated movies
  4. Export the top 10 list in both CSV and AVRO formats.

Download Sample Dataset

Movielens dataset is available in Grouplens website.
 
In this challenge, we'll use MovieLens 100K Dataset. Download the zip file and extract "u.data" file.
u.data is tab delimited file, which keeps the ratings, and contains four columns :

user_id (int), movie_id (int), rating (int), time (int)

Keep this file until we test our NiFi flow. 

GetFile

Create a GetFile processor, and point it to a local folder you created, to fetch input files.



Input Directory/home/oguz/Documents/Olric/File_Source/



UpdateAttribute to alter the file name


Add an UpdateAttribute processor with filename attribute set to "Top10Movies.csv"


Connect GetFile processor to UpdateAttribute.


Source & Target Schemas


We'll now create a Controller Service of type AvroSchemaRegistry, and create two schemas inside this registry :


InputSchema
{     "type": "record",
     "namespace": "movies",
     "name": "movie",
     "fields": [

       { "name": "user_id", "type": ["int", "null"] },
       { "name": "movie_id", "type": ["int", "null"] },
       { "name": "rating", "type": ["float", "null"] },
       { "name": "timestamp", "type": ["int", "null"] }
     ]


OutputSchema
{     "type": "record",
     "namespace": "movies",
     "name": "movie",
     "fields": [

       { "name": "movie_id", "type": ["int", "null"] },
       { "name": "rating", "type": ["float", "null"] }
     ]

Although rating is actually an integer in source files, we read it as a float, since  our average calculation will result in float values.

CSVReader


Add a controller service of type CSVReader. Configuration shall be as follows :




Schema Access Strategy Use Schema Name Property
Schema Registry AvroSchemaRegistry
Schema Name InputSchema
CSV Format Tab Delimited
Treat First Line as Header True
Ignore CSV Header Column Names True


CSVRecordSetWriter

Create a CSVRecordSetWriter as well and configure it as seen below.



Schema Write Strategy Set schema.name Attribute
Schema Access Strategy Use Schema Name Property
Schema Registry AvroSchemaRegistry
Schema Name OutputSchema
CSV Format Custom Format

QueryRecord to aggregate flowfiles


QueryRecord allows us to query incoming data with SQL as if we are working on a relatinal database. Add a QueryRecord processor and align with the following configuration.


Record Reader CSVReader
Record Writer CSVRecordSetWriter
summary select movie_id, rating from
   (select movie_id, avg(rating) as rating, count(user_id) as cnt
      FROM FLOWFILE GROUP BY movie_id
   )
where cnt >= 100
order by
   rating desc limit 10

Connect UpdateAttribute processor to QueryRecord processor.


Also, navigate to the Settings tab of QueryRecord processor, and choose Original and Failure relationships in the section "Automatically Terminate Relationships". Otherwise, these will remain as unhandled relationships.

PutHDFS to upload target files into Hadoop


Now add PutHDFS processor to upload our results into HDFS. PutHDFS processor is configured as below.


Hadoop Configuration Resources /etc/hadoop/3.1.4.0-315/0/hdfs-site.xml,/etc/hadoop/3.1.4.0-315/0/core-site.xml
Directory /user/oguz/Top10Movies

Hadoop Configuration Resources shall include full paths to hdfs-site and core-site XML files, which exist in any HDP node.

Directory is the target directory in HDFS. Ensure that the user you start NiFi has permissions to write to this folder. If the directory does not exist, NiFi will create it automatically.


Connect QueryRecord processor with PutHDFS processor, for summary connection type.




This flow will produce the result in CSV format and upload to HDFS.

But we are not done yet, since we also need the same result in AVRO format.

InferAvroSchema to populate schema from flowfiles


InferAvroSchema makes it easy to populate an AVRO schema, using the incoming files.

Add an InferAvroSchema processor. All attributes will be left as defaults. We'll just


Choose flowfile-attribute from Schema Output Destination drop-down.

Also set the Avro Record Name, as "avrorec".

Terminate the failure, original, and unsopperted content relationships automatically in InferAvroSchema processor, under Setings tab.

Connect QueryRecord processor with InferAvroSchema processor, for summary connection type.



ConvertCSVToAvro


Since we now have a populated Avro schema in hand, we can use it in a ConvertCSVToAvro processor.



We shall set the value of Record Schema as

${inferred.avro.schema}

Also under Settings tab, we have to choose relationships of type failure and incompatible for termination.


Connect  InferAvroSchema with ConvertCSVToAvro for success relationship type.

Our last connection will be from ConvertCSVToAvro to PutHDFS processor,  also for  success relationship type.

And we're done.


Test the Flow


Now that we're ready, we can copy u.data file to the folder where GetFile is listening.

And see what happens.


I see the two files above under Files view of Ambari. So it worked for me.


I can see that the most popular movie's id is 408.

Let's check which title that is. We can check that in file u.item, which is another file in the movielens dataset.

So, here it is :


Were you also expecting something else ?

Get Started with Nifi : Partitioning CSV files based on column value


This tutorial demonstrates how incoming data file can be divided into multiple files based on a column value, using Apache Nifi.



The Nifi Flow will :
  • Fetch files from a local folder
  • Divide the content rows into several files using PartitionRecord processor
  • Modify the file name to include the column value which is used for partitioning
  • And upload the output files to HDFS


Here we go.

Download Sample Dataset

Movielens dataset is available in Grouplens website.
 
In this challenge, we'll use MovieLens 100K Dataset. Download the zip file and extract "u.data" file.
u.data is tab delimited file, which keeps the ratings, and contains four columns :

user_id (int), movie_id (int), rating (int), time (int)

Keep this file until we test our NiFi flow.

Create a Process Group


Click "New Process Group" to create a group named "grp_Partition_Sample", or, whatever you like, actually.


Before developing the flow, we need to create a local folder named "File_Source". I created this in the following path :
/home/oguz/Documents/Olric/File_Source/

Now let's double click the process group. We have a clean sheet now.

GetFile


To add our first processor, we'll drag the processor icon from the toolbar and drop it somewhere in the canvas.



A pop-up window will be shown. All we have to do is find the GetFile processor from the list and click Apply to add it.


Double click the GetFile processor to edit its attributes. The only thing we'll change here is the "Input Directory" attribute. We'll put the full path of our File_Source folder here :
/home/oguz/Documents/Olric/File_Source/


Now we have a processor which will fetch the files from File_Source folder. It checks this folder continuously, and fetches all files from this folder to process in the nifi flow. Since the attribute "Keep Source File" is set to "false" by default, it will *NOT* copy the file. Instead, the folder will be emptied. Unless we don't process and put the contents to somewhere else, the file is gone.

Of course, our processor is not started, so it's not active yet. We'll start the processor later. Keep in mind that once started, a processor can't be configured anymore. We have to stop processors to configure them.


Create CSV Schema For Source Files


Our source files are CSV files,and NiFi needs to know more about them. Which columns are included, what are their data types, which column separator is used etc., all attributes needed, are defined as a Schema in Ni-Fi.

We do this by creating a controller service of type AvroSchemaRegistry

A controller service is a shared configuration, which you can use in multiple processors.

A controller service of type AvroSchemaRegistry is used to define schema definitions, used for CSV or Avro files.

So let's go ahead and create it. Check the Operate panel on left side. If no processors are selected, it should show the name of the process group. If you see a processor's name here, click on an empty area so that it will show the process group here. Then click the "Configuration" button in the panel.


The pop-up window has two tabs. Click the plus sign under "Controller Services" tab and add an "AvroSchemaRegistry" service. Click the configure button of the newly added service.


In the pop-up window, navigate to Properties tab, and click the plus sign to add a new schema.
This controller service can contain multiple schemas, but we'll create one schema for now.
Name it as "InputSchema", and paste the following text block as the value.

{     "type": "record",
     "namespace": "movies",
     "name": "movie",
     "fields": [

       { "name": "user_id", "type": ["int", "null"] },
       { "name": "movie_id", "type": ["int", "null"] },
       { "name": "rating", "type": ["int", "null"] },
       { "name": "timestamp", "type": ["int", "null"] }
     ]
}

If needed, you can enlarge the pop-up editor pane from its below-right corner. The editor pane is closed by hitting the Enter key. Use instead Shift + Enter while editing. (And Shift + Tab instead of Tab)

Create CSV Schema for Target Files

As mentioned above, the same "AvroSchemaRegistry"controller service can host multiple schemas. Let's define our target schema as well.

Click the configure button of the service once more. Add another schema and name it as OutputSchema. Let's assume that we don't need the last column, timestamp, in the output. Then our output schema will be :

{     "type": "record",
     "namespace": "movies",
     "name": "movie",
     "fields": [

       { "name": "user_id", "type": ["int", "null"] },
       { "name": "movie_id", "type": ["int", "null"] },
       { "name": "rating", "type": ["int", "null"] }
     ]
}

Once we've added both schemas, we can click the Enable icon to enable our AvroSchemaRegistry.

Controller services, just like processors, can only be configured when disabled.


Here is how our AvroSchemaRegistry shall look like :


CSV Reader and CSV Writer

Our NiFi flow will split the incoming flowfile into multiple flowfiles, based on movie_id column. This is done with a PartitionRecord processor.

To do that, it needs two controller services, a CSVReader and a CSVRecordSetWriter. Let's add two controller services.

Once added, configure the CSV Reader as follows :


Schema Access Strategy Use Schema Name Property
Schema Registry AvroSchemaRegistry
Schema Name InputSchema
CSV Format Tab Delimited
Treat First Line as Header True
Ignore CSV Header Column Names True

So we declared that our source files are tab delimited CSV files, containing a header line. But we'll ignore the field names from this header line. Instead, we'll use our column names from the InputSchema.
And here's how we configure our CSVRecordSetWriter.


Schema Write Strategy Set schema.name Attribute
Schema Access Strategy Use Schema Name Property
Schema Registry AvroSchemaRegistry
Schema Name OutputSchema
CSV Format Custom Format


So, our output CSV will be comma delimited file, and will not contain the timestamp column of the tab-delimited source files.

Make sure you enabled all three controller services before you close the configuration window.


PartitionRecord to Split Files

Now it's time to divide our flowfiles, based on movie_id column.

Add a PartitionRecord processor. Configure it as shown below.


Record Reader CSVReader
Record Writer CSVRecordSetWriter
movie_id /movie_id


Next thing we'll do is, building a connection between these two processors. Hover on the GetFile processor. An arrow will appear on top of it. Drag this arrow icon and drop it on the PartitionRecord processor.

A pop-up window will show up.



This connection is by default built for "success" relationship type,which is ok.

This means, only the successful flow is following this path. Any types of failures may also be handled, but this is not in scope of this post.




Now we have a flow with two steps. The GetFile processor has a red square sign, which means this processor has no issues, but it's also not active.
The PartitionRecord processor has a yellow warning sign. We see the warnings if we hover on this sign.

It says, relationship "success" is invalid because it's not terminated yet. Fair enough, because we didn't finalize our flow yet.

Let's take a look at the connection as well. It is a "success" connection, and there are no flowfiles on the queue.

If a processor is not active, while the previous processors are, then we can see the flowfiles queued in a connection. You can right-click on the connection and list the files in the queue. You can even view their contents; so it's really useful when you need to debug your flow.

UpdateAttribute to alter file names


Add a third processor of type UpdateAttribute. We'll add one property, filename, and set it to :

MovieRatings_${movie_id}.csv


This will ensure that we have movie_id value added to target file names.

Connect the UpdateAttribute processor to the end of our flow. Choose success as connection type.


PutHDFS to upload target files into Hadoop


Our fourth and last processor will be oftype PutHDFS. You may prefer a PutFile processor instead, if you don't have access to an HDFS environment right now.

Our PutHDFS processor is configured as below.


Hadoop Configuration Resources /etc/hadoop/3.1.4.0-315/0/hdfs-site.xml,/etc/hadoop/3.1.4.0-315/0/core-site.xml
Directory /user/oguz/Movie_Rating_Details

Hadoop Configuration Resources shall include full paths to hdfs-site and core-site XML files, which exist in any HDP node.

Directory is the target directory in HDFS. Ensure that the user you start NiFi has permissions to write to this folder. If the directory does not exist, NiFi will create it automatically.

Also, navigate to the Settings tab of PutHDFS processor, and choose Success and Failure relationships in the section "Automatically Terminate Relationships". Otherwise, these will remain as unhandled relationships.

Test the Flow


Click on any empty place on the canvas and choose Start. This will start our NiFi flow. Here's what it should look like :


Now copy the u.data file to the source file folder we created. (/home/oguz/Documents/Olric/File_Source/) 
It will disappear in few seconds. Then you'll -hopefully- see the output files uploaded to HDFS.


Debug the Flow


If things go wrong, I recommend to stop all processors and start them one by one. It gives you the chance to take a look at all flowfiles waiting in the queues.

Right -click menu of connections give you the option to review the files in the queue.

Also, you can empty the queue to reset the flow.

Good luck!

Advanced SQL Challenge


The Challenge

Here is a challenge for SQL enthusiasts.
I'll solve it here using PostgreSQL 10.10 on Kubuntu 18.04; but feel free to give it a try in your favorite RDBMS.
We have a simple table with two columns : Column dt is a date, and column rt is an exchange rate, showing real values of USD/TRY from July/August 2018. That's a period where this rate hits a peak.


The table contains data between 16 July and 17 August 2018. But it's far from being complete. Actually it contains only the Tuesdays and Thursdays, plus the first day (16 July) and the last day (17 August).
Our goal is to calculate substitute values for each of these missing days. The result dataset shall contain all dates between 16 July and 17 August 2018.




usd_tr_rates table contents 



dt rt
07/16/18 4.8516
07/17/18 4.8525
07/19/18 4.8389
07/24/18 4.7933
07/26/18 4.8342
07/31/18 4.9161
08/02/18 5.0671
08/07/18 5.2808
08/09/18 5.4167
08/14/18 6.5681
08/16/18 5.8174
08/17/18 6.0142



Load Source Table Data

Hereunder you can find DDL/DML statements to prepare source data in PostgreSQL. At least the insert statements should work for other databases as well.


CREATE DATABASE olric;
\c olric;
CREATE TABLE usd_tr_rates (dt date, rt decimal(18,8));
insert into usd_tr_rates (dt, rt) values ('16/7/2018', 4.8516);
insert into usd_tr_rates (dt, rt) values ('17/7/2018', 4.8525);
insert into usd_tr_rates (dt, rt) values ('19/7/2018', 4.8389);
insert into usd_tr_rates (dt, rt) values ('24/7/2018', 4.7933);
insert into usd_tr_rates (dt, rt) values ('26/7/2018', 4.8342);
insert into usd_tr_rates (dt, rt) values ('31/7/2018', 4.9161);
insert into usd_tr_rates (dt, rt) values ('2/8/2018', 5.0671);
insert into usd_tr_rates (dt, rt) values ('7/8/2018', 5.2808);
insert into usd_tr_rates (dt, rt) values ('9/8/2018', 5.4167);
insert into usd_tr_rates (dt, rt) values ('14/8/2018', 6.5681);
insert into usd_tr_rates (dt, rt) values ('16/8/2018', 5.8174);
insert into usd_tr_rates (dt, rt) values ('17/8/2018', 6.0142);


Required Output

Your SQL shall generate the following output.


dt_newrt_new
2018-07-164.85160000000000000000
2018-07-174.85250000000000000000
2018-07-184.84570000000000000000
2018-07-194.83890000000000000000
2018-07-204.82978000000000000000
2018-07-214.82066000000000000000
2018-07-224.81154000000000000000
2018-07-234.80242000000000000000
2018-07-244.79330000000000000000
2018-07-254.81375000000000000000
2018-07-264.83420000000000000000
2018-07-274.85058000000000000000
2018-07-284.86696000000000000000
2018-07-294.88334000000000000000
2018-07-304.89972000000000000000
2018-07-314.91610000000000000000
2018-08-014.99160000000000000000
2018-08-025.06710000000000000000
2018-08-035.10984000000000000000
2018-08-045.15258000000000000000
2018-08-055.19532000000000000000
2018-08-065.23806000000000000000
2018-08-075.28080000000000000000
2018-08-085.34875000000000000000
2018-08-095.41670000000000000000
2018-08-105.64698000000000000000
2018-08-115.87726000000000000000
2018-08-126.10754000000000000000
2018-08-136.33782000000000000000
2018-08-146.56810000000000000000
2018-08-156.19275000000000000000
2018-08-165.81740000000000000000



Solution (PostgreSQL)

My solution in PostgreSQL is as follows.
with rates1 as
    (
    select dt, rt,
        max(dt) over (partition by 1 order by dt
            rows between 1 following and 1 following) next_dt,
        max(rt) over (partition by 1 order by dt
            rows between 1 following and 1 following) next_rt
    from usd_tr_rates
    ),
    rates2 as
    (
    select dt, rt, next_dt, next_rt,
        next_dt - dt multip
    from rates1
    ),
    series as
    (
    select * from generate_series(1,10)
    )
select
    rates2.dt+series.generate_series-1 dt_new,
    rates2.rt+((series.generate_series-1) * ((next_rt - rt)/multip)) rt_new
from
    rates2
join
    series
on
    series.generate_series<=rates2.multip
order by
    dt, generate_series;

Install a single-node Hortonworks Data Platform (HDP) version 3.1.4 on Kubuntu 18.04






This post will lead you through the setup process of a single-node Hortonworks Data Platform, working on a Kubuntu 18.04 workstation.

Of course you can easily download a Hortonworks Sandbox image here.
Having said this, one can also prefer to get their hands dirty; and build it from scratch.
So here we go.

Scope

  • Ambari 2.7.3
  • HDP 3.1.4
    • HDFS 3.1.1
    • YARN 3.1.1
    • MapReduce2 3.1.1
    • HBase 2.0.2
    • ZooKeeper 3.4.6
    • Ambari Metrics 0.1.0
    • SmartSense 1.5.1.2.7.3.0-139
We'll first install Ambari Server. 
Then we'll register an Ambari Agent, using Ambari console.
During this registration, we'll choose the services above to be installed. This is a subset - For now we'll leave some important services out of our scope, such as Hive, Pig, Spark, Kafka, etc. 
But we'll come back to them soon.

Decide : Check the support matrix

The support matrix for Hortonworks Data Platform is here.
Choose your OS version under Operating systems. See which Ambari and HDP versions are supported for your OS.



I have an Ubuntu 18.04,and I will install HDP 3.1.4 using Ambari 2.7.3.
Don’t assume a newer version of your favorite OS will be ok. Stick to the versions listed in the support matrix. Author of this blog lost some time trying to install HDP on Ubuntu 19 – which is too new to be supported.

Prepare 1/6 : Maximum Open Files Requirements

Upper limit for open file descriptors in your system shall not be less than 10,000.

First let's check the current settings. Sn stands for soft limits whereas Hn is for Hard limits.



ulimit -Sn

ulimit -Hn



And here's how we change them.

ulimit -n 10000

Hard limit was maybe even more than 10,000, but we don't have to worry about that. These changes will be lost with the next reboot, but we need this only during the setup process anyway.

Check here for some nice details.






Prepare 2/6 : Setup password-less SSH

As you may already know, Ambari helps us administer all the nodes in a Hadoop cluster. Each of these nodes shall have an Ambari agent installed.
Ambari server is able to install these agents in nodes. - Only if an SSH connection is built.
Here we have a single node - and it's also where Ambari Server is installed. But still we need this SSH connection.
Now let's register our laptop as a trusted ssh connection for ... our laptop!
  • Generate SSH keys. Leave the passphrase empty.

oguz@dikanka:~$ sudo -i
root@dikanka:~# ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Created directory '/root/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
...
  • Add the public key to authorized keys file. 
root@dikanka:~# cd .ssh
root@dikanka:~/.ssh# cat id_rsa.pub >> authorized_keys
  • Change permissions of .ssh folder and authorized keys file.
root@dikanka:~/.ssh# cd ..
root@dikanka:~# chmod 700 .ssh  
root@dikanka:~# chmod 600 .ssh/authorized_keys4 
  •  Check the results
root@dikanka:~# ssh dikanka  
ssh: connect to host dikanka port 22: Connection refused
  •  This happens when :
    •  openssh is not installled. Install it as follows :
sudo apt-get update
sudo apt-get install openssh-server
    •  port 22 is blockedby the firewall. Allow this port as below :
sudo ufw allow 22
Rules updated
Rules updated (v6)
  • Check the results. When asked, type “yes” and press enter to confirm adding the server to known hosts.
root@dikanka:~# ssh dikanka
The authenticity of host 'dikanka (127.0.1.1)' can't be established.
ECDSA key fingerprint is ...
Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added 'dikanka' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 18.04.3 LTS (GNU/Linux 5.0.0-25-generic x86_64)
...
  • All seems OK. Type “exit” to close the ssh connection.
root@dikanka:~# exit
logout
Connection to dikanka closed.

Prepare 3/6 : Enable NTP on the Cluster and on the Browser Host

The clocks of all the nodes in your cluster and the machine that runs the browser through which you access the Ambari Web interface must be able to synchronize with each other.
First let's check if ntp is running :
oguz@dikanka:~$ ntpstat
Unable to talk to NTP daemon. Is it running?
It's not. The following command will start ntp service.
sudo service ntp start

And the following command will ensure that it gets automatically started during boot.
sudo update-rc.d ntp defaults

 Now it should be up and running :
oguz@dikanka:~$ ntpstat
synchronised to NTP server (213.136.0.252) at stratum 2
    time correct to within 943 ms
    polling server every 64 s


Prepare 4/6 : Configuring iptables

For Ambari to communicate during setup with the hosts it deploys to and manages, certain ports must be open and available. The easiest way to do this is to temporarily disable iptables, as follows:


sudo -i
ufw disable
iptables -X
iptables -t nat -F
iptables -t nat -X
iptables -t nat -X
iptables -t mangle -F
iptables -t mangle -X
iptables -P INPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -P OUTPUT ACCEPT


Prepare 5/6 : Umask

umask command is used to set the default permissions of newly created files and folders.
umask 022
umask with 022 value will set permissions of 755 for new files and folders.
If this numeric code does not ring a bell for you, I'd strongly suggest to learn more about linux file permissions. You can try wikipedia.


Prepare 6/6 : Repository Connection

Now we'll connect to Hortonworks software repository in order to install ambari server.


oguz@dikanka:~$sudo -i
root@dikanka:~#wget -O /etc/apt/sources.list.d/ambari.list http://public-repo-1.hortonworks.com/ambari/ubuntu18/2.x/updates/2.7.3.0/ambari.list
--2019-08-26 15:01:44--  http://public-repo-1.hortonworks.com/ambari/ubuntu18/2.x/updates/2.7.3.0/ambari.list
Resolving public-repo-1.hortonworks.com (public-repo-1.hortonworks.com)... 13.224.132.59, 13.224.132.44, 13.224.132.74, . ...

 2019-08-26 15:01:44 (10,8 MB/s) - ‘/etc/apt/sources.list.d/ambari.list’ saved [187/187] 


root@dikanka:~#
apt-key adv --recv-keys --keyserver keyserver.ubuntu.com B9733A7A07513CAD 

Executing: /tmp/apt-key-gpghome.DmGQnBiPeG/gpg.1.sh --recv-keys --keyserver keyserver.ubuntu.com B9733A7A07513CAD
gpg: key B9733A7A07513CAD: public key "Jenkins (HDP Builds) <jenkin@hortonworks.com>" imported
gpg: Total number processed: 1
gpg:               imported: 1


Let's update our software repository and see if we can locate ambari packages.


apt-get update

apt-cache showpkg ambari-server
apt-cache showpkg ambari-agent
apt-cache showpkg ambari-metrics-assembly




Ambari 1/3: Install Ambari Server

We'll now install the ambari server, and then configure and start ambari server. Should be easier than what you think.
apt-get install ambari-server

Success. You can now start the database server using:

   /usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile start

Ver Cluster Port Status Owner    Data directory              Log file
10  main    5432 down   postgres /var/lib/postgresql/10/main /var/log/postgresql/postgresql-10-main.log
update-alternatives: using /usr/share/postgresql/10/man/man1/postmaster.1.gz to provide /usr/share/man/man1/postmaster
.1.gz (postmaster.1.gz) in auto mode
Setting up postgresql (10+190) ...
Setting up ambari-server (2.7.3.0-139) ...
Processing triggers for ureadahead (0.100.0-21) ...
Processing triggers for systemd (237-3ubuntu10.24) ...

Ambari 2/3: Setup Ambari Server

 ambari-server setup

  • When asked, type 1 to download and install Oracle JDK
  • Type y to confirm Oracle licence agreement
  • Type n (or leave empty) when asked to downoad LZO packages
  • Type y when asked to enter advanced database configuration. (Default is using embedded PostgreSQL. Actually we’ll not change the defaults but this way we’ll be informed about db name, user name, password, etc.)


Using python  /usr/bin/python
Setup ambari-server
Checking SELinux...
WARNING: Could not run /usr/sbin/sestatus: OK
Customize user account for ambari-server daemon [y/n] (n)?
Adjusting ambari-server permissions and ownership...
Checking firewall status...
Checking JDK...
[1] Oracle JDK 1.8 + Java Cryptography Extension (JCE) Policy Files 8
[2] Custom JDK
==============================================================================
Enter choice (1): 1
To download the Oracle JDK and the Java Cryptography Extension (JCE) Policy Files you must accept the license terms fo
und at http://www.oracle.com/technetwork/java/javase/terms/license/index.html and not accepting will cancel the Ambari
Server setup and you must install the JDK and JCE files manually.
Do you accept the Oracle Binary Code License Agreement [y/n] (y)? y

Downloading JDK from http://public-repo-1.hortonworks.com/ARTIFACTS/jdk-8u112-linux-x64.tar.gz to /var/lib/ambari-serv
er/resources/jdk-8u112-linux-x64.tar.gz
jdk-8u112-linux-x64.tar.gz... 100% (174.7 MB of 174.7 MB)
Successfully downloaded JDK distribution to /var/lib/ambari-server/resources/jdk-8u112-linux-x64.tar.gz
Installing JDK to /usr/jdk64/
Successfully installed JDK to /usr/jdk64/
Downloading JCE Policy archive from http://public-repo-1.hortonworks.com/ARTIFACTS/jce_policy-8.zip to /var/lib/ambari
-server/resources/jce_policy-8.zip

Successfully downloaded JCE Policy archive to /var/lib/ambari-server/resources/jce_policy-8.zip
Installing JCE policy...
Check JDK version for Ambari Server...
JDK version found: 8
Minimum JDK version is 8 for Ambari. Skipping to setup different JDK for Ambari Server.
Checking GPL software agreement...
GPL License for LZO: https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Enable Ambari Server to download and install GPL Licensed LZO packages [y/n] (n)? n
Completing setup...
Configuring database...
Enter advanced database configuration [y/n] (n)? y
Configuring database...
==============================================================================
Choose one of the following options:
[1] - PostgreSQL (Embedded)
[2] - Oracle
[3] - MySQL / MariaDB
[4] - PostgreSQL
[5] - Microsoft SQL Server (Tech Preview)
[6] - SQL Anywhere
[7] - BDB
==============================================================================
Enter choice (1):
Database admin user (postgres):
Database name (ambari):
Postgres schema (ambari):
Username (ambari):
Enter Database Password (bigdata):
Default properties detected. Using built-in database.
Configuring ambari database...
Checking PostgreSQL...
Configuring local database...
Configuring PostgreSQL...
Restarting PostgreSQL
Creating schema and user...
done.
Creating tables...
done.
Extracting system views...
....ambari-admin-2.7.3.0.139.jar

Ambari repo file contains latest json url http://public-repo-1.hortonworks.com/HDP/hdp_urlinfo.json, updating stacks r
epoinfos with it...
Adjusting ambari-server permissions and ownership...
Ambari Server 'setup' completed successfully.

Ambari server will use a postgresql database for its repository. Note down the default database name and user credentials configured for this database.


Ambari 3/3: Start Ambari Server

ambari-server start 
Using python  /usr/bin/python
Starting ambari-server
Ambari Server running with administrator privileges.
Organizing resource files at /var/lib/ambari-server/resources...
Ambari database consistency check started...
Server PID at: /var/run/ambari-server/ambari-server.pid
Server out at: /var/log/ambari-server/ambari-server.out
Server log at: /var/log/ambari-server/ambari-server.log
Waiting for server start...................
Server started listening on 8080

DB configs consistency check: no errors and warnings were found.
Ambari Server 'start' completed successfully.

Register a cluster



Login to Ambari using a browser having root access. That's needed because we'll upload SSH key from root directory, during this registration.
 
oguz@dikanka~$ sudo chromium-browser --no-sandbox


Navigate to address http://localhost:8080

user: admin
password: admin

Click "Launch Install Wizard" to register our first and only cluster.


 
Name your cluster, click next.

 
HDP 3.1 is selected by default. “Use Public Repository” is also selected. Click Next.


Add your computer name in the list of target hosts.
Click “choose file”. Locate file “id_rsa” under path “/root/.ssh” (to be able to access hidden path “.ssh”, you may need to right click and choose option “Show hidden files”)
Cick “Register and confirm”
Ignore the warning about the computer name not being a valid FDQN.



You may need to do some problem solving here. Please remember that your OS shall be listed in the support matrix, client shall be reachable through passwordless SSH, and there should be no issues on connectivity, like a firewall blockage, or busy ports, etc.
If everything goes fine, you should see the screen below.


Time to install some services.
Following services will be installed. Deselect the others and click next.

  • YARN + MapReduce2
  • Hbase
  • ZooKeeper
  • Ambari Metrics
  • SmartSense


Ignore the warnings about limited functionally, due to skipping Apache Ranger and Apache Atlas.

All masters will be the same. Click Next. 



Only one slave is available. Click Next



All passwords are set as “admin123”



No database settings for selected services. Therefore the database settings are disabled. We left the directories as their default values. See them listed below.


HDFS
DataNode directories
/hadoop/hdfs/data
NameNode directories
/hadoop/hdfs/namenode
SecondaryNameNode Checkpoint directories
/hadoop/hdfs/namesecondary
NFSGateway dump directory
/tmp/.hdfs-nfs
NameNode Backup directory
/tmp/upgrades
JournalNode Edits directory
/hadoop/hdfs/journalnode
NameNode Checkpoint Edits directory
${dfs.namenode.checkpoint.dir}
Hadoop Log Dir Prefix
/var/log/hadoop
Hadoop PID Dir Prefix
/var/run/hadoop



YARN
YARN NodeManager Local directories
/hadoop/yarn/local
YARN Timeline Service Entity Group FS Store Active directory
/ats/active/
YARN Node Labels FS Store Root directory
/system/yarn/node-labels
YARN NodeManager Recovery directory
{{yarn_log_dir_prefix}}/nodemanager/recovery-state
YARN Timeline Service Entity Group FS Store Done directory
/ats/done/
YARN NodeManager Log directories
/hadoop/yarn/log
YARN NodeManager Remote App Log directory
/app-logs
YARN Log Dir Prefix
/var/log/hadoop-yarn
YARN PID Dir Prefix
/var/run/hadoop-yarn


MAPREDUCE2
Mapreduce JobHistory Done directory
/mr-history/done
Mapreduce JobHistory Intermediate Done directory
/mr-history/tmp
YARN App Mapreduce AM Staging directory
/user
Mapreduce Log Dir Prefix
/var/log/hadoop-mapreduce
Mapreduce PID Dir Prefix
/var/run/hadoop-mapreduce


HBASE
HBase Java IO Tmpdir
/tmp
HBase Bulkload Staging directory
/apps/hbase/staging
HBase Local directory
${hbase.tmp.dir}/local
HBase root directory
/apps/hbase/data
HBase tmp directory
/tmp/hbase-${user.name}
ZooKeeper Znode Parent
/hbase-unsecure
HBase Log Dir Prefix
/var/log/hbase
HBase PID Dir
/var/run/hbase


ZOOKEEPER
ZooKeeper directory
/hadoop/zookeeper
ZooKeeper Log Dir
/var/log/zookeeper
ZooKeeper PID Dir
/var/run/zookeeper


AMBARI METRICS
Aggregator checkpoint directory
/var/lib/ambari-metrics-collector/checkpoint
Metrics Grafana data dir
/var/lib/ambari-metrics-grafana
HBase Local directory
${hbase.tmp.dir}/local
HBase root directory
file:///var/lib/ambari-metrics-collector/hbase
HBase tmp directory
/var/lib/ambari-metrics-collector/hbase-tmp
HBase ZooKeeper Property DataDir
${hbase.tmp.dir}/zookeeper
Phoenix Spool directory
${hbase.tmp.dir}/phoenix-spool
Phoenix Spool directory
/tmp
Metrics Collector log dir
/var/log/ambari-metrics-collector
Metrics Monitor log dir
/var/log/ambari-metrics-monitor
Metrics Grafana log dir
/var/log/ambari-metrics-grafana
HBase Log Dir Prefix
/var/log/ambari-metrics-collector
Metrics Collector pid dir
/var/run/ambari-metrics-collector
Metrics Monitor pid dir
/var/run/ambari-metrics-monitor
Metrics Grafana pid dir
/var/run/ambari-metrics-grafana
HBase PID Dir
/var/run/ambari-metrics-collector/



Accounts are left as default values. See them listed below.




Smoke User
ambari-qa
Mapreduce User
mapred
Hadoop Group
hadoop
Oozie User
oozie
Ambari Metrics User
ams
Yarn ATS User
yarn-ats
HBase User
hbase
Yarn User
yarn
HDFS User
hdfs
ZooKeeper User
zookeeper
Proxy User Group
users




 

All settings in Advanced configuration tab are left as defaults, but the SSL client password setting under “HDFS / Advanced” might raise an error.





It’s a password setting issue. Type “admin123” in both password fields to fix the issue.





Click Deploy and pray to your preferred God.



Port issue for YARN

Yeah, it failed.
Luckily all the services were installed without issues. The failure happened during starting services. YARN fails with the following error :


java.net.BindException: Problem binding to [dikanka:53] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException
This happens because port 53 is not available. Solution is simple :

Under YARN / Configs / Advanced, locate the setting named “RegistryDNS Bind Port” and change from 53 to 553



Start Services

Since the services failed to start, we have to start them one by one. To start services, we’ll choose “Restart All” under “Actions” menu for each of the services.




Let’s start the services in the following order :
  • Zookeeper
  • HDFS
  • Hbase
  • YARN
  • MapReduce2

Check Files View

If all services are up and running, it’s time now to check what we have in hand. Click “Views” menu and choose “Files View”



This might result with the following error message :

Unauthorized connection for super-user: root from IP 127.0.0.1



In this case, apply the following steps to solve this issue :

  • In Ambari Web, browse to Services > HDFS > Configs.
  • Under the Advanced tab, navigate to the Custom core-site section.
  • Change the values of the following parameters to *

hadoop.proxyuser.root.groups=*
hadoop.proxyuser.root.hosts=*

After these values are altered, you will need to restart all services. Retry opening files view. Confirm that the view looks like the screenshot below :

 


And this marks the end of the scope for this post. Soon we'll continue with other services like Pig, Tez and Hive.
Hope this was helpful for some of you.