This notes make use of the CLoudera Training VM, The VM image is can be found here
Once we turn on the VM, the following steps are done
- Player>Removable Devices>Network Adapter> Settings
- Options> Shares Folders, select Always enables
- On folders, add your machines local folders path
- The files can be found on this path of the VM Training/File System/mnt/hgfs
Setting it up
Open virtual machine > select the .vmx file/d/1pizgWHk6VAgCgpiMXeFbGqQ4KmY41NJ3/view
Hadoop is an open-source framework designed for storing and processing large datasets;
across multiple machines in a distributed environment. It's particularly useful for
big data tasks because it allows large datasets to be stored, processed, and analyzed
in parallel, which makes it much faster and scalable compared to traditional data processing.
HDFS (Hadoop Distributed File System):
- file system that splits large files into smaller blocks and distributes them across multiple machines, creating a "distributed" storage.
- HDFS allows for fault tolerance by replicating these blocks on different machines, so data isn’t lost if one machine fails.
hadoop fs -put shakespeare /user/training/shakespeare
hadoop fs -ls /user/training/
hadoop fs -rm /user/training/shakespeare
hadoop fs -cat /user/training/shakespeare
- Nano
- Type: Command-line-based text editor
- Interface: Runs directly in the terminal
- User-Friendliness: Beginner-friendly with a simpler interface
- Shortcut Guidance: Shows helpful shortcuts at the bottom of the screen (e.g., ^X for Exit, where ^ means "Ctrl")
- Usage: Ideal for quick edits, especially when working over SSH or without a graphical interface
- Sample usage:
nano myfile.txt
- Gedit
- Type: Graphical text editor
- Interface: GUI-based with menus, buttons, and a visual interface
- User-Friendliness: Very beginner-friendly for users used to traditional text editors like Notepad on Windows
- Syntax Highlighting: Supports syntax highlighting for coding, making it ideal for editing code
- Usage: Good for writing code or editing files in a graphical Linux environment
- Sample usage:
gedit myfile.txt &
Run script to configure environment
~/scripts/developer/training_setup_dev.sh
What this does:
The ~ symbol represents the home directory.
The script path suggests it is located in a folder called scripts/developer/.
Running this script sets up necessary configurations or installations that the labs will require.
Understanding Hadoop
Hadoop is already installed, configured, and running on the virtual machine
To receive a help message that outlines how to use the command and its various options
hadoop
- Change to local directory
$ cd ~/training_materials/developer/data
- Run
ls
, see the two files: shakespeare.tar.gz and shakespeare-stream.tar.gz. - unzipping it by using
$ tar zxvf shakespeare.tar.gz
- Upload to HDFS,
$ hadoop fs -put shakespeare /user/training/shakespeare
- hadoop fs: This command interacts with the Hadoop file system.
- -put: This option is used to upload files or directories from the local filesystem to HDFS.
- list contents in HDFS
$ hadoop fs -ls /user/training
- create new dir
$ hadoop fs -mkdir weblog
$ gunzip -c access_log.gz | hadoop fs -put - weblog/access_log
- gunzip -c access_log.gz: This command decompresses the access_log.gz file.
- The -c option tells gunzip to write the uncompressed output to standard output (stdout) instead of creating a file on disk.
- '|' This is a pipe operator, takes the output of the command on the left and sends it as input to the command on the right.
- verify upload
$ hadoop fs -ls weblog
- Extract the first 5000 lines from the original log file and upload it to HDFS
$ gunzip -c access_log.gz | head -n 5000 | hadoop fs -put - weblog/access_log_small
- gunzip -c access_log.gz: Decompresses the file, as before.
- | head -n 5000: This takes the output from gunzip and extracts the first 5000 lines. The head command is used to display the beginning of a file.
- | hadoop fs -put - weblog/access_log_small: This uploads the first 5000 lines to HDFS, naming the file access_log_small in the weblog directory.
$ hadoop fs -cat shakespeare/histories | tail -n 50
Viewing the End of a Large File: This command is particularly useful when dealing with large output files from MapReduce
programs. Often, the results generated can be quite large, making it impractical to view the entire content at once.
Quick Insights: By using tail, we can quickly check the end of the output for errors, results, or any specific
information without scrolling through a massive amount of text.
Debugging and Monitoring:
This can be helpful for monitoring logs or outputs in real time or checking for specific patterns or results
generated by a MapReduce job.
$ hadoop fs -get shakespeare/poems ~/shakepoems.txt
hadoop fs -get: This command is used to copy files from HDFS to the local filesystem.
shakespeare/poems: This is the source path in HDFS from which we want to copy the file.
In this case, it’s the poems file located in the shakespeare directory.
~/shakepoems.txt: This is the destination path on the local filesystem where we want to save the downloaded file.
The ~ symbol refers to our home directory, and shakepoems.txt is the name we want to give to the downloaded file.
Action: This command will copy the poems file from HDFS and save it as shakepoems.txt in home directory on the
local filesystem.
$ less ~/shakepoems.txt
This command is a file viewer that allows to view the contents of files in a scrollable format. We can navigate through the file using the arrow keys, and or exit the viewer by pressing q.
cd ~/workspace/
- Week 2: Map Reduce
- Week 3: MRUnit Testing
- Week 4: Toolrunner
- Week 4: Combiner
- Week 5: Toolrunner
- Week 5: Logging
- Week 5: Counters
- Week 6: Writables
- Week 6: createsequencefile
- Week 6: Practicioner
- Week 7: Inverted Index
- Week 7: Word Co Occurance
- Week 8: Sqoop, MySQL and Oozie workflow
- Week 9: ETL process
- Week 10&11: Pig and Hive