Various Filesystems in Hadoop

Hadoop is an open-source software framework written primarily in Java (with some components in C and shell scripting) that enables computation over massive volumes of data across a cluster of machines. Designed for batch/offline processing, Hadoop leverages commodity hardware i.e. inexpensive machines that provide both local storage and computation power to achieve distributed storage and processing.

While HDFS (Hadoop Distributed File System) is the most commonly associated storage layer in Hadoop, it is not the only filesystem supported. Hadoop provides a flexible architecture that allows it to interact with multiple filesystems. At its core, the Java abstract class:

org.apache.hadoop.fs.FileSystem

represents a filesystem in Hadoop. Different implementations extend this class, enabling Hadoop to work with both local and distributed storage systems.

Major Filesystems in Hadoop

Below is a detailed list of the filesystems supported by Hadoop:

Filesystem	URI scheme	Java implementation (all under org.apache.hadoop)	Description
Local	file	fs.LocalFileSystem	The Hadoop Local filesystem is used for a locally connected disk with client-side checksumming. The local filesystem uses RawLocalFileSystem with no checksums.
HDFS	hdfs	hdfs.DistributedFileSystem	HDFS stands for Hadoop Distributed File System and it is drafted for working with MapReduce efficiently.
HFTP	hftp	hdfs.HftpFileSystem	The HFTP filesystem provides read-only access to HDFS over HTTP. There is no connection of HFTP with FTP. This filesystem is commonly used with distcp to share data between HDFS clusters possessing different versions.
HSFTP	hsftp	hdfs.HsftpFileSystem	The HSFTP filesystem provides read-only access to HDFS over HTTPS. This file system also does not have any connection with FTP.
HAR	har	fs.HarFileSystem	The HAR file system is mainly used to reduce the memory usage of NameNode by registering files in Hadoop HDFS. This file system is layered on some other file system for archiving purposes.
KFS (Cloud-Store)	kfs	fs.kfs.KosmosFileSystem	cloud store or KFS(KosmosFileSystem) is a file system that is written in c++. It is very much similar to a distributed file system like HDFS and GFS(Google File System).
FTP	ftp	fs.ftp.FTPFileSystem	The FTP filesystem is supported by the FTP server.
S3 (native)	s3n	fs.s3native.NativeS3FileSystem	This file system is backed by AmazonS3.
S3 (block-based)	s3	fs.s3.S3FileSystem	S3 (block-based) file system which is supported by Amazon s3 stores files in blocks(similar to HDFS) just to overcome S3's file system 5 GB file size limit.

Use cases of filesystem

Hadoop chooses the appropriate filesystem based on the URI scheme provided.
Examples:

hdfs://namenode:9000/path/to/file
file:///local/path/to/file
s3n://bucket-name/path/to/file

While Hadoop can work with any filesystem implementation, distributed filesystems with data locality (like HDFS and KFS) are generally preferred for big data processing, as they minimize network overhead and improve performance.
HDFS remains the default choice for most Hadoop deployments due to its tight integration with the Hadoop ecosystem, replication strategy, and proven scalability.

Advantages of Supporting Multiple Filesystems

Flexibility: Developers can integrate Hadoop with cloud storage (Amazon S3), legacy systems (FTP), or on-premise storage.
Interoperability: Tools like distcp make it easy to copy and share data across different clusters or storage systems.
Cost Optimization: Organizations can combine local storage, cloud services, and distributed filesystems for efficient cost management.
Scalability: Systems like HDFS and KFS scale linearly by adding more nodes.
Security & Compatibility: Options like HSFTP (HTTPS-based) enhance security, while HAR archives optimize NameNode memory.

Various Filesystems in Hadoop

Major Filesystems in Hadoop

Use cases of filesystem

Advantages of Supporting Multiple Filesystems

Explore