Choose S3n over S3 when sharing file with HDFS

AWS S3 supports both s3 and s3n file system when communicates with HDFS.

s3n, the s3 native file system, allows files to be stored in the original format. s3 is the s3 block file system, which is block based storage, an equivalent of HDFS in AWS implementation. Other s3 tools would not be able to recognise the original file format, but to see a bunch of block files.

However s3n imposes a file size limit of 5G per file. s3 does not prevent users from storing large file bigger than 5G though.

Besides, s3 puts block files directly into a S3 bucket and occupies the whole bucket without differentiating folders. Whilst s3n puts files in original shape into a folder under S3 bucket. Hence s3n is more flexible in this sense.

Since my test files are mostly small files smaller than 1G, hadoop fs -cp outperforms hadoop distcp in my test. Besides, s3n boosts faster transmission than s3.

# copy files from hdfs to s3n
hadoop fs -cp hdfs:// s3n://awsid:awskey@bucket/folder

One thought on “Choose S3n over S3 when sharing file with HDFS

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s