Data in hdfs is store in blocks that have a default size of 64mb. Files that you store in hdfs are broken up and distributed throughout the cluster. The dfs.datanode.data.dir setting in hdfs-site.xml of the data nodes specifies where the blocks are stored. The dfs.replication value specifies the number of times blocks are replicated within the cluster. By default each block is replicated three times.
The permission model in hdfs for files and directories is implemented by associating files and directories with owners and groups. Permissions for the owner of a file/directory are separate from those of user group and other users. To read a file you need the r permission, to write or append to a file you need the w permission and to create or delete you need w permission. The superuser is the person who starts the namenode (hdfs). This superuser is not related to that of the host and can be different for each cluster. Permission checks for this user are skipped so they will never fail.
To manage access in hdfs you use groups and users. Multiple users are needed because of different operations that are done and to improve security. It is not advisable to run jobs as the superuser so we create a group hadoop and add users to that group. In a default installation hadoop does not attempt to perform any authentication of users. To enable authentication of users you use kerberos protocol. The protocol is used throughout the cluster to make sure users making requests are really who they are. File permissions are used to implement authorization.
To begin create an operating system group hadoop using this command sudo groupadd hadoop. Then add a user eduonix_tutorial with password @filetutorial1 using commands below
sudo useradd -g hadoop eduonix_tutorial sudo passwd eduonix_tutorial
After the user has been created we need to change permission of directory specified by hadoop.tmp.dir in core-site.xml to 1777. This gives read and write permissions to all users. This is the base directory for local and hdfs temporary directory.
sudo chmod 1777 -R /app/hadoop/tmp
Create a directory structure for eduonix_tutorial user and assign its ownership to eduonix_tutorial
hdfs dfs -mkdir /user/eduonix_tutorial/ hdfs dfs -chown -R eduonix_tutorial:hadoop /user/eduonix_tutorial/
Make the namenode aware of the new user by refreshing user and group mappings. You can execute jobs as user eduonix_tutorial and save output in /user/eduonix_tutorial/ directory. Using this approach you can add users and give them only permissions needed.
Run hdfs dfs -ls /user to check the user has been correctly added.
Before issuing any commands that interact with HDFS make sure the cluster has been started by running this command start-dfs.sh. To create a directory in HDFS you use the –mkdir command and specify the path of the directory. So to create a directory like /usr/local/cardata/ you use this command hdfs dfs mkdir /usr/local/cardata/
If the parent directory structure does not exist the mkdir command will fail. To resolve this use -p argument which instructs hdfs to create parent directory structure. For example to create a directory /home/sammy/cardata/ use this command hdfs dfs -mkdir -p /home/sammy/cardata
to upload data from the local file system to HDFS the put command is used. You specify the location of the file you would like to upload and the destination directory on HDFS. If we have a cars.csv in ~/Downloads directory we can upload it to directory we created above using the command below
hdfs dfs -put ~/Downloads/cars.csv /home/sammy/cardata
You can also us copyFromLocal command to move files from local file system to HDFS. Its use is similar to put command, you specify the file to be copied and target destination. We will use this command to copy a file called vehicles.csv from downloads directory to /home/sammy/cardata/ directory. Replace the file with an file that resides in your local system. There is nothing special about the cars.csv and vehicles.csv, they are just being used for demonstration.
hdfs dfs -copyFromLocal ~/Downloads/vehicles.csv /home/sammy/cardata/
To list contents of a directory you use the ls command. The r permission is required to do this.
hdfs dfs -ls /home/sammy/cardata
To copy files from hdfs to local file system you use get command. You specify file to be copied and destination on local file system. To copy the cars.csv from hdfs to documents directory we use the command below.
hdfs dfs -get /home/sammy/cardata/cars.csv ~/Documents
When you run the command above you may get a permission error because you do not own the target directory. One way to work around this is to run the command as a sudo user.
You can list ls ~/Documents the contents of Documents directory to check the file has been downloaded.
To remove a directory from HDFS you use rm command by specifying the path of the file or directory. To remove the cars.csv file we uploaded earlier to HDFS we use the command below
hdfs fs -rm /home/sammy/cardata/cars.csv
The rmr command also removes a directory and its child directories and files. Its use is identical to rm. For example to remove sammy directory and its child directories and files the command shown below is used.
hdfs fs -rmr /home/sammy/cardata/cars.csv
Assigning permissions to files and directories is done using chmod command. Permissions of the directory structure can be changed using optional argument -R. This does not change ownership of directory or file. To demonstrate this we create a directory /usr/local/demo/ and assign read and write permissions to anyone
hdfs dfs -mkdir /usr/local/demo hdfs dfs -chmod 1777 /usr/local/demo
Associating files to a certain user group is done using chgrp command. The -R optional argument can be used to change permissions throughout the directory structure. To demonstrate its use we assign the directory we created above to hadoop group.
hdfs dfs -chgrp hadoop /usr/local/demo
To check whether a path exists you use the test command. This command will return 1 if a directory exists and 0 otherwise.
This tutorial has largely demonstrated the commonly used commands to manage directories and files in HDFS. Directory creation, moving data to and from HDFS, assigning ownership, giving read and write permissions have been explained. Securing a Hadoop cluster using kerberos protocol was highlighted.