Mastering HDFS: Installation and Essential Shell Commands

Mastering HDFS: Installation and Essential Shell Commands

In the realm of big data, the Hadoop Distributed File System (HDFS) stands as the foundational storage layer, providing robust, fault-tolerant, and highly scalable storage for massive datasets. Understanding its installation intricacies and mastering its command-line interface are pivotal skills for anyone venturing into the world of Hadoop. This extensive guide will navigate you through the comprehensive process of setting up a Hadoop cluster, detailing the configuration nuances, and empowering you with the essential HDFS shell commands to manage your distributed data efficiently.

Establishing the Hadoop Ecosystem: A Step-by-Step Installation Guide

Deploying a functional Hadoop cluster, whether for a modest pair of nodes or an expansive infrastructure spanning tens of thousands of machines, commences with a methodical installation process. The initial prerequisite involves installing Hadoop on a single machine, serving as your developmental or initial configuration environment. A fundamental precursor to this is the presence of Java Development Kit (JDK) on your system; ensuring its installation is a crucial first step.

The intricate process of orchestrating Hadoop across an entire cluster necessitates the ubiquitous presence of the requisite software components on every machine earmarked for inclusion within the cluster’s topology. By convention, one dedicated machine assumes the pivotal role of the NameNode, functioning as the central arbiter for the HDFS metadata, while another distinct machine is designated as the ResourceManager, responsible for overseeing resource allocation and job scheduling within the YARN (Yet Another Resource Negotiator) framework.

An important note for further exploration into Hadoop’s broader architecture: we recommend reviewing comprehensive Hadoop tutorials that delve into the overall framework before proceeding with the specific HDFS commands. These resources provide crucial context for understanding the interplay of HDFS within the larger Hadoop ecosystem.

Ancillary services, such as the MapReduce Job History Server and the Web Application Proxy Server, can be strategically deployed on specific, dedicated machines or, alternatively, on shared resources, contingent upon the prevailing operational requirements and anticipated workload demands. Crucially, all other nodes comprising the extensive cluster are configured with a dual operational mandate, concurrently serving as both a NodeManager (managing application containers and resource utilization) and a DataNode (storing actual data blocks for HDFS). These nodes, by virtue of their shared responsibilities, are colloquially referred to as slave nodes.

Configuring Hadoop for Non-Secure Operation

To initiate Hadoop in a non-secure operational mode, a meticulous configuration of its Java-based settings is indispensable. Hadoop’s configuration framework is structured around two primary categories of XML files:

  • Read-Only Default Configurations: These files provide the inherent, unalterable default settings for various Hadoop components. They include core-default.xml, which defines core Hadoop properties; hdfs-default.xml, specifying default HDFS parameters; yarn-default.xml, outlining YARN’s default behaviors; and mapred-default.xml, for MapReduce service defaults. These files should generally not be modified directly.
  • Site-Specific Configurations: These files are designed for custom overrides and specific deployments, allowing administrators to tailor Hadoop’s behavior to their unique environment. They are typically located within the etc/hadoop/ directory and include:
    • etc/hadoop/core-site.xml: For fundamental Hadoop properties that apply cluster-wide.
    • etc/hadoop/hdfs-site.xml: For HDFS-specific configurations, such as replication factors and NameNode directories.
    • etc/hadoop/yarn-site.xml: For YARN resource manager and node manager settings.
    • etc/hadoop/mapred-site.xml: For custom MapReduce job configurations.

Furthermore, administrators can fine-tune Hadoop scripts located in the bin/ directory of the distribution by setting site-specific environment variables within the etc/hadoop/hadoop-env.sh and etc/hadoop/yarn-env.sh shell scripts. These scripts are crucial for defining environment-specific settings like Java paths and memory allocations.

For comprehensive Hadoop cluster configuration, the initial step involves establishing a robust execution environment for the Hadoop daemons and defining all necessary configuration parameters. This ensures that the various components of the distributed system can communicate effectively and operate harmoniously.

The various daemons that constitute the operational backbone of the Hadoop Distributed File System and its associated components include:

  • NodeManager: A YARN component running on slave nodes, responsible for managing application containers and monitoring resource usage.
  • ResourceManager: The central authority in YARN, responsible for allocating resources (CPU, memory) to applications in the cluster.
  • WebAppProxy: A YARN component that provides proxy access to application master web user interfaces.
  • NameNode: The central metadata server for HDFS, maintaining the directory tree of all files and their locations.
  • SecondaryNameNode: A helper daemon for the NameNode, periodically merging the edit logs with the fsimage to prevent too large an edit log.
  • DataNode: The primary data storage unit in HDFS, running on slave nodes and responsible for storing actual data blocks.
  • YARN daemons: A collective term for ResourceManager and NodeManager, the core components of Hadoop’s resource management layer.

Customizing the Operational Blueprint for Hadoop Components

To ensure that the various Hadoop daemons function in harmony with the unique requisites and specific environmental characteristics of a given deployment, administrators are mandated to strategically modify a trio of pivotal environment configuration scripts. These are conventionally located within the etc/hadoop/ directory and bear the filenames hadoop-env.sh, mapred-env.sh, and yarn-env.sh. Within the hallowed confines of these scripting files, the JAVA_HOME environment variable assumes paramount importance, demanding meticulous and precise specification. Its correct definition is not merely advisable but absolutely imperative across every single remote node constituting the entire Hadoop cluster. This stringent requirement serves to guarantee that all constituent Hadoop components possess the unequivocal ability to accurately pinpoint and subsequently harness the quintessential Java runtime environment (JRE) for their seamless execution and unhindered operation. Failure to properly configure JAVA_HOME can lead to a cascade of errors, ranging from daemon startup failures to intermittent operational inconsistencies, fundamentally undermining the stability and reliability of the entire distributed system. This configuration is the bedrock upon which the Java-based Hadoop ecosystem is built, ensuring that the underlying virtual machine (JVM) is correctly initialized and available for all processing tasks.

Fine-Tuning Individual Daemon Performance Parameters

Each autonomous Hadoop daemon within the sprawling distributed architecture can have its intrinsic operational demeanor and critical resource allocation parameters exquisitely refined through the judicious manipulation of specific, dedicated environment variables. These variables act as conduits, allowing administrators to impart nuanced instructions to each daemon, thereby optimizing its performance and resource consumption profile within the broader cluster ecosystem. The pertinent environment variable inextricably linked with each individual daemon is meticulously delineated as follows:

Orchestrating NameNode Operations

The pivotal NameNode, the central arbiter of the Hadoop Distributed File System (HDFS), has its intricate operational parameters, including crucial memory allocation and other fundamental Java Virtual Machine (JVM) arguments, meticulously governed by the HADOOP_NAMENODE_OPTS variable. This particular variable provides administrators with granular control over how the NameNode process utilizes system memory, which is paramount for its role in maintaining the filesystem’s metadata. Insufficient memory can lead to performance bottlenecks or even NameNode failures, emphasizing the critical importance of judiciously configuring this variable based on the scale and workload of the HDFS. Proper tuning ensures the NameNode can efficiently manage metadata for billions of files and directories, responding promptly to client requests and maintaining the integrity of the distributed file system.

Sculpting DataNode Resource Utilization

The DataNode, responsible for the actual storage of data blocks within HDFS, is precisely configured through the HADOOP_DATANODE_OPTS variable. This variable serves as the primary mechanism for managing the DataNode’s memory footprint and optimizing its performance settings. It dictates how the DataNode processes data reads and writes, ensuring efficient utilization of system resources while maintaining high throughput for data operations. Misconfiguration here could lead to slow data access, garbage collection pauses, or even DataNode instability, directly impacting the overall performance and reliability of the HDFS data plane. Administrators must balance memory allocation with available physical memory to prevent swapping and ensure consistent data accessibility.

Adjusting Secondary NameNode Settings

The operational nuances of the Secondary NameNode, a critical component for checkpointing the NameNode’s metadata, are meticulously managed via the HADOOP_SECONDARYNAMENODE_OPTS variable. This variable offers a similar degree of control over its memory allocation and JVM parameters as its primary counterpart, the NameNode. Given its role in safeguarding the HDFS metadata and facilitating NameNode recovery, ensuring the Secondary NameNode has adequate resources is vital for the robustness of the entire HDFS infrastructure. Proper configuration helps in preventing long checkpointing times and ensures that the backup of the filesystem metadata is consistent and readily available.

Regulating ResourceManager Parameters

The ResourceManager, the central authority for resource management and job scheduling within Yet Another Resource Negotiator (YARN), has its intricate operational parameters and crucial memory allocation meticulously controlled by the YARN_RESOURCEMANAGER_OPTS variable. This variable is instrumental in defining the ResourceManager’s capacity to handle concurrent applications and manage the allocation of resources to various jobs and services. Adequate memory is essential for the ResourceManager to maintain the state of all running applications, track available cluster resources, and efficiently schedule containers. Incorrect settings can lead to scheduling delays, application failures, or an inability to utilize the cluster’s full capacity, making its proper configuration paramount for YARN’s overall efficiency.

Configuring NodeManager Behavior

The NodeManager, operating on each slave node within the YARN cluster, is precisely configured using the YARN_NODEMANAGER_OPTS variable. This variable plays a decisive role in dictating the NodeManager’s resource consumption and overall behavior on individual worker machines. It controls how the NodeManager launches, monitors, and manages application containers, ensuring they adhere to the allocated resource limits. Proper tuning of this variable is vital for preventing resource exhaustion on slave nodes, maintaining stability, and ensuring that application containers run efficiently without interfering with other processes. It directly impacts the effective utilization of CPU, memory, and disk resources across the entire YARN cluster.

Influencing WebAppProxy Performance

The integral settings governing the WebAppProxy, which facilitates secure access to the web user interfaces of YARN applications, are meticulously managed by the YARN_PROXYSERVER_OPTS variable. This configuration variable directly influences the performance and responsiveness of the proxy server. Given its role as an intermediary for accessing critical application UIs, ensuring its optimal operation is important for user experience and administrative monitoring. Proper tuning can prevent bottlenecks and ensure that web interfaces load quickly and reliably, even under heavy load, thereby improving the overall usability and observability of applications running on YARN.

Optimizing MapReduce Job History Server

The MapReduce Job History Server, a vital component for retaining historical information about completed MapReduce jobs, is directly controlled by the HADOOP_JOB_HISTORYSERVER_OPTS variable. This variable provides administrators with the necessary levers for optimizing the performance and resource utilization of this crucial service. Efficient configuration is paramount for ensuring that job logs are promptly stored and readily accessible for debugging, performance analysis, and auditing purposes. Poor optimization can lead to slow UI responsiveness or even a failure to store job history, severely impeding the ability to understand and troubleshoot past MapReduce executions.

Holistic Parameters for Hadoop Daemon Health

Beyond the daemon-specific configuration variables that allow for granular control over individual processes, several other pivotal environment parameters hold immense significance for the overarching health, stability, and operational efficiency of the entire Hadoop daemon ecosystem. These parameters offer broader control over fundamental aspects of daemon management and diagnostics, acting as essential levers for system administrators to maintain a robust and well-performing Hadoop cluster. Understanding and appropriately configuring these global parameters is just as critical as tuning individual daemon options, as they collectively contribute to the seamless functioning and long-term viability of the distributed system.

Designating the Process ID (PID) File Directory

The HADOOP_PID_DIR environment variable is specifically designated to dictate the precise directory where the process ID (PID) files of the currently executing daemons are meticulously stored. These diminutive yet critically important PID files serve as unique identifiers for each running daemon process, encapsulating its numerical process ID. Their existence within a well-defined and accessible location is absolutely indispensable for the efficient management and continuous monitoring of the daemon processes. By facilitating easy identification and control, these PID files empower administrators to swiftly ascertain the status of a daemon, send signals for graceful shutdown or restart, and effectively prevent the accidental spawning of multiple instances of the same daemon, thereby averting potential resource contention and operational anomalies. The proper setting of this variable is a small but crucial detail that underpins robust process management in a complex distributed environment.

Specifying the Daemon Log File Location

The HADOOP_LOG_DIR variable assumes a critical role by meticulously specifying the precise directory where the voluminous log files generated by the various Hadoop daemons are conscientiously archived. These logs are far from mere digital detritus; they represent an indispensable treasure trove of operational intelligence. They are the frontline defense for troubleshooting unforeseen issues, providing invaluable chronological records of daemon behavior, errors encountered, and events processed. Furthermore, these logs are paramount for comprehensive performance monitoring, allowing administrators to discern trends in resource utilization, identify bottlenecks, and fine-tune system parameters for optimal throughput. Crucially, they serve as an immutable audit trail, meticulously documenting the operational behavior of the entire cluster, a feature that is vital for security compliance and post-mortem analysis of system incidents. Ensuring sufficient disk space and appropriate access permissions for this directory is therefore a fundamental aspect of maintaining a healthy and observable Hadoop deployment.

Allocating Heap Memory for Hadoop and YARN Processes

The HADOOP_HEAPSIZE and YARN_HEAPSIZE variables represent critical levers for controlling the Java heap memory allocation for Hadoop and YARN processes, respectively. The values assigned to these variables are typically specified in megabytes (MB), directly influencing the amount of RAM that the Java Virtual Machine (JVM) allocates for its operations. For illustrative purposes, if HADOOP_HEAPSIZE is configured with a value of 1000, the corresponding daemon’s Java heap will be meticulously provisioned with 1000 MB (equivalent to 1 GB) of memory.

By default, this critical memory allocation often initializes to 1000 MB (or 1 GB), a setting that might suffice for rudimentary or small-scale deployments. However, it is an unequivocal fact that this default value frequently necessitates judicious adjustment and rigorous tuning. This adjustment is primarily predicated upon the dynamic and often fluctuating workload characteristics of the cluster, coupled with the quantity of available physical memory resources on the underlying hardware nodes.

Inadequate heap memory allocation can precipitate a myriad of detrimental consequences, the most prominent being the dreaded «out-of-memory» errors (OOM errors), which can lead to daemon crashes and service interruptions. Conversely, allocating excessive heap memory unnecessarily consumes precious physical RAM that could be utilized by other critical system processes or application containers. Therefore, the strategic configuration of HADOOP_HEAPSIZE and YARN_HEAPSIZE is paramount for achieving a delicate yet crucial balance. This balance ensures that the daemons possess sufficient memory to operate robustly and efficiently, thereby preventing performance degradation, minimizing the frequency of garbage collection cycles (which can introduce latency), and ultimately optimizing the overall throughput and stability of the entire Hadoop and YARN ecosystem. This careful memory management is a cornerstone of scalable and resilient big data processing.

Advanced Considerations for Heap Sizing and JVM Tuning

Beyond the basic HADOOP_HEAPSIZE and YARN_HEAPSIZE variables, advanced administrators often delve deeper into JVM tuning for optimal performance. These _OPTS variables (e.g., HADOOP_NAMENODE_OPTS) allow for the specification of more granular JVM arguments. For instance, -Xms sets the initial heap size, and -Xmx sets the maximum heap size. While _HEAPSIZE provides a shorthand for -Xmx, directly using -Xms and -Xmx within the daemon-specific _OPTS variables offers more precise control. For example, setting -Xms and -Xmx to the same value can prevent the JVM from dynamically resizing the heap, which can sometimes introduce performance pauses.

Another crucial aspect of JVM tuning within these _OPTS variables involves Garbage Collection (GC). Different garbage collectors (e.g., G1GC, ParallelGC, CMS) offer varying trade-offs between throughput, latency, and memory footprint. For instance, specifying -XX:+UseG1GC can be beneficial for large heaps to reduce GC pause times, which is critical for NameNode and ResourceManager stability. Other GC-related parameters like -XX:MaxGCPauseMillis (to set a target maximum pause time) or -XX:NewRatio (to control the size of the young generation) allow for highly specialized optimization depending on the daemon’s role and expected workload.

Furthermore, JVM logging via parameters like -Xlog:gc* can be invaluable for diagnosing memory-related issues and understanding GC behavior. Redirecting these verbose logs to separate files can help in post-mortem analysis and performance tuning efforts. For instance, if the NameNode is experiencing frequent and long GC pauses, analyzing its GC logs can pinpoint memory leaks or inefficient data structures that contribute to excessive memory churn.

The selection of the appropriate Java Development Kit (JDK) version specified by JAVA_HOME is also paramount. Different JDK versions come with different JVM implementations and optimizations. Newer JDKs often offer improved garbage collectors and better performance characteristics. Therefore, aligning the chosen JDK version with the recommendations for the specific Hadoop distribution version is crucial for realizing optimal performance and stability.

Finally, while these environment variables are the primary mechanism for daemon configuration, administrators must also consider the interplay with other Hadoop configuration files, such as hdfs-site.xml, yarn-site.xml, and mapred-site.xml. These XML files contain parameters that govern the functional aspects of Hadoop services, while the _env.sh scripts manage the environmental settings for the daemon processes themselves. A holistic approach to configuration, understanding how environment variables interact with XML-based settings, is essential for truly robust and highly performant Hadoop clusters. This comprehensive approach ensures that every layer of the Hadoop ecosystem is meticulously tuned for its specific role and the overall demands of the big data workload.

The Significance of Configuration in a Distributed Ecosystem

The act of meticulously tailoring the Hadoop daemon configuration environment transcends mere technical compliance; it is a foundational pillar supporting the entire edifice of a stable, performant, and scalable distributed computing ecosystem. In a system as inherently complex and resource-intensive as Hadoop, where hundreds or thousands of interconnected processes collaborate to manage and process colossal datasets, even minor misconfigurations can cascade into significant operational inefficiencies, system instability, or outright failures.

Firstly, precise configuration ensures resource optimization. Incorrect memory allocations (e.g., insufficient heap size for the NameNode or ResourceManager) can lead to constant garbage collection, excessive swapping to disk, or even daemon crashes, effectively crippling the cluster’s ability to process data. Conversely, over-allocating resources can waste valuable server memory and CPU cycles that could otherwise be utilized by other critical processes or application containers. Through careful tuning of _OPTS variables and heap sizes, administrators can strike a delicate balance, maximizing throughput and minimizing latency without overtaxing the underlying hardware. This meticulous resource management is especially critical in large-scale deployments where operational costs and efficiency are paramount.

Secondly, proper configuration is intrinsically linked to system reliability and fault tolerance. The ability to correctly locate PID files via HADOOP_PID_DIR facilitates robust process management, allowing for automated monitoring and graceful daemon shutdowns or restarts. Well-configured log directories (HADOOP_LOG_DIR) ensure that vital diagnostic information is captured, which is indispensable for proactive issue identification, rapid troubleshooting, and post-mortem analysis of system anomalies. Without comprehensive logs, diagnosing elusive errors in a distributed environment becomes a daunting, if not impossible, task, severely impacting the mean time to recovery (MTTR) during outages. The configuration of daemons also plays a role in their ability to recover from failures, for instance, by having sufficient memory to rebuild internal state after a restart.

Thirdly, environmental configuration is vital for security and compliance. While core security settings are typically found in XML configuration files, the environment scripts can influence how daemons interact with system-level security mechanisms or pathing to security credentials. For example, ensuring JAVA_HOME points to a securely managed Java installation is a basic security hygiene practice. The logging configurations can also feed into centralized logging systems, critical for auditing and meeting compliance requirements.

Fourthly, effective configuration underpins scalability and flexibility. As data volumes grow and computational demands intensify, administrators must be able to adjust daemon parameters to accommodate increased loads without bringing the entire cluster down. The modular nature of these environment scripts allows for isolated tuning of specific components (e.g., boosting NameNode memory independently of DataNode memory), providing the flexibility needed to scale different parts of the cluster proportionally to their bottlenecks. This adaptability is key to evolving a Hadoop cluster from a small proof-of-concept to a massive, production-grade big data platform.

Finally, the discipline of configuration management through these scripts fosters operational best practices. Standardizing the approach to daemon configuration across all nodes and all deployments (e.g., development, staging, production environments) reduces human error, simplifies deployment procedures, and ensures consistent behavior. This standardization is facilitated by version controlling these environment scripts alongside other configuration files, making changes traceable and reversible. In essence, tailoring the Hadoop daemon configuration environment is not a one-time task but an ongoing, iterative process fundamental to maintaining a high-performing, resilient, and manageable big data infrastructure.

Navigating HDFS: Essential Shell Commands for File Management

Once your Hadoop cluster is established and configured, interacting with the Hadoop Distributed File System (HDFS) is primarily accomplished through its command-line interface. These shell commands empower users to perform fundamental file system operations, mirroring many familiar Unix/Linux commands, but specifically tailored for the distributed nature of HDFS.

Directory Creation in HDFS

To create a new directory within HDFS at a specified path, the following command is utilized:

Bash

hadoop fs -mkdir <paths>

Example: To create two directories, /user/saurzcode/dir1 and /user/saurzcode/dir2, simultaneously:

Bash

hadoop fs -mkdir /user/saurzcode/dir1 /user/saurzcode/dir2

Listing Directory Contents

To view the contents of a directory in HDFS, similar to the ls command in Unix, use:

Bash

hadoop fs -ls <args>

Example: To list the contents of the /user/saurzcode directory:

Bash

hadoop fs -ls /user/saurzcode

HDFS File Upload and Download Operations

Efficiently moving files between your local file system and HDFS is a common operation.

Uploading Files to HDFS (hadoop fs -put)

This command facilitates copying single or multiple source files from your local file system to a specified destination path within HDFS:

Bash

hadoop fs -put <localsrc> … <HDFS_dest_Path>

Example: To upload Samplefile.txt from /home/saurzcode/ on your local system to /user/saurzcode/dir3/ in HDFS:

Bash

hadoop fs -put /home/saurzcode/Samplefile.txt /user/saurzcode/dir3/

Downloading Files from HDFS (hadoop fs -get)

This command enables copying or downloading files from HDFS to your local file system:

Bash

hadoop fs -get <hdfs_src> <localdst>

Example: To download Samplefile.txt from /user/saurzcode/dir3/ in HDFS to /home/ on your local system:

Bash

hadoop fs -get /user/saurzcode/dir3/Samplefile.txt /home/

Viewing File Content in HDFS (hadoop fs -cat)

To display the entire content of a file stored in HDFS, analogous to the Unix cat command:

Bash

hadoop fs -cat <path[filename]>

Example: To view the content of abc.txt located at /user/saurzcode/dir1/ in HDFS:

Bash

hadoop fs -cat /user/saurzcode/dir1/abc.txt

Copying Files within HDFS (hadoop fs -cp)

To copy a file from one location to another within the HDFS namespace:

Bash

hadoop fs -cp <source> <dest>

Example: To copy abc.txt from /user/saurzcode/dir1/ to /user/saurzcode/dir2/ within HDFS:

Bash

hadoop fs -cp /user/saurzcode/dir1/abc.txt /user/saurzcode/dir2

Copying Files Between Local and HDFS (Dedicated Commands)

While put and get are commonly used, dedicated commands also exist for clarity:

copyFromLocal

Copies a file from the local file system to HDFS. This is functionally identical to hadoop fs -put.

Bash

hadoop fs -copyFromLocal <localsrc> URI

Example: To copy abc.txt from /home/saurzcode/ to /user/saurzcode/ in HDFS:

Bash

hadoop fs -copyFromLocal /home/saurzcode/abc.txt /user/saurzcode/abc.txt

copyToLocal

Copies a file from HDFS to the local file system. This is functionally identical to hadoop fs -get.

Bash

hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

The [-ignorecrc] and [-crc] options relate to CRC (Cyclic Redundancy Check) integrity verification during file transfer.

Moving Files within HDFS (hadoop fs -mv)

To move a file or directory from a source path to a destination path within HDFS:

Bash

hadoop fs -mv <src> <dest>

Important Note: It is crucial to remember that hadoop fs -mv only operates within the same file system (HDFS). You cannot move files across different file systems (e.g., from HDFS to a local file system) using this command directly. For such operations, use hadoop fs -get followed by hadoop fs -rm (or vice-versa for upload).

Example: To move abc.txt from /user/saurzcode/dir1/ to /user/saurzcode/dir2/ in HDFS:

Bash

hadoop fs -mv /user/saurzcode/dir1/abc.txt /user/saurzcode/dir2

Deleting Files and Directories in HDFS (hadoop fs -rm and hadoop fs -rmr)

To remove files or empty directories from HDFS:

Bash

hadoop fs -rm <arg>

Example: To remove the file abc.txt from /user/saurzcode/dir1/:

Bash

hadoop fs -rm /user/saurzcode/dir1/abc.txt

To recursively delete a directory and its contents (similar to rm -r in Unix):

Bash

hadoop fs -rmr <arg>

Important Note: The -rmr command should be used with extreme caution, as it permanently deletes files and directories without prompting for confirmation.

Example: To recursively delete the directory /user/saurzcode/ and all its contents:

Bash

hadoop fs -rmr /user/saurzcode/

Displaying the End of a File (hadoop fs -tail)

To show the last few lines of a file stored in HDFS, similar to the Unix tail command:

Bash

hadoop fs -tail <path[filename]>

Example: To display the end of abc.txt located at /user/saurzcode/dir1/:

Bash

hadoop fs -tail /user/saurzcode/dir1/abc.txt

Showing Aggregate File Length/Disk Usage (hadoop fs -du)

To display the aggregate length of files or the disk usage of directories and files in HDFS:

Bash

hadoop fs -du <path>

Example: To show the disk usage for abc.txt within /user/saurzcode/dir1/:

Bash

hadoop fs -du /user/saurzcode/dir1/abc.txt

This comprehensive set of HDFS shell commands forms the bedrock for effective data management within your Hadoop cluster. Proficiency in these commands is vital for any professional working with big data stored in HDFS.