Choose S3n over S3 when sharing file with HDFS

AWS S3 supports both s3 and s3n file system when communicates with HDFS.

s3n, the s3 native file system, allows files to be stored in the original format. s3 is the s3 block file system, which is block based storage, an equivalent of HDFS in AWS implementation. Other s3 tools would not be able to recognise the original file format, but to see a bunch of block files.

However s3n imposes a file size limit of 5G per file. s3 does not prevent users from storing large file bigger than 5G though.

Besides, s3 puts block files directly into a S3 bucket and occupies the whole bucket without differentiating folders. Whilst s3n puts files in original shape into a folder under S3 bucket. Hence s3n is more flexible in this sense.

Since my test files are mostly small files smaller than 1G, hadoop fs -cp outperforms hadoop distcp in my test. Besides, s3n boosts faster transmission than s3.

# copy files from hdfs to s3n
hadoop fs -cp hdfs:// s3n://awsid:awskey@bucket/folder