Basics of HDFS and installation steps on MacOS
HDFS stands for Hadoop Distributed File System. In today’s world where a lot of data is churned out everyday, we need to build systems which which scale with enormous data and can achieve higher level of computation.
Storing data on a single file system works very well for small amount of data, but when you are dealing with TeraBytes/PetaBytes of data then having it on a distributed file system across multiple machines gives advantage of parallel computation.
Whenever HDFS loads a file into its file-system. It partitions that into multiple blocks, each block has a size of 128mb(default configuration). These blocks are stored on multiple data nodes and which dataset is located where is managed by NameNode.
HDFS contains a NameNode and DataNode. DataNodes are where data is stored in HDFS, whereas NameNode stores the metadata information about the whereabouts of the data-
In this article, we will setup Hadoop-3.2.2 on our local system in a single node cluster-
First create directory in your local-filesystem to keep Hadoop installations-
mkdir hadoop
Download and extract the hadoop installer from the link and place that in the hadoop folder. I’ve used hadoop-3.2.2 here.
Below, will be the folder structure-
Before you start working on the hadoop installation, allow remote login on your computer in the ‘System Preferences’. Also, you need Java installed in your local machine-
Now, go to the /etc/hadoop folder in the installation and open core-site.xml and add the below to configuration section-
<configuration>
<property>
<name>fs.defaultFS</name
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Now, open the file hdfs-site.xml and make the following changes-
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Now, open hadoop-env.sh and add JAVA_HOME into the file
export JAVA_HOME=/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home
Now, setup passphraseless ssh-
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
From the home directory of your hadoop installation, enter the following command to format the name node
$ bin/hadoop namenode -format
Now, start all the nodes with the below command-
$ sbin/start-all.sh
Now, execute the command jps to verify the deployment-
To, check the health of your system go to http://localhost:9870/dfshealth.html#tab-overview
To create a folder in HDFS use the following command-
$ bin/hdfs dfs -mkdir -p /user/WordCount
To put a file in the directory use the following command-
$ hdfs dfs -put ./ec.txt /user/wordCount/input
You, can see the same file in the web-view aswell now-