Good Practice: Git Commit Message, PR, Versioning and Code Review

How to write Git Commit Message
Good “git commit message” serves as a log that tells WHY and WHEN we made changes. Whilst diff shows only WHAT. A useful message is comprised of:
Concise (50 chars) imperative subject with reference to issue/change #
One blank line between subject and body
Body explains WHAT WHY HOW and wraps at 72 chars.

[KERNEL-003] Update Network Module to Support Wifi

Nowadays people use WIFI. For this reason, it is unacceptable for kernel not
to support Wifi.

Add a new subcomponent for network module that provides the following features:
 - listen to Wifi driver
 - save logs

One liner is acceptable if the change is simple and straightforward

Pull Request
In shared repository model, PR triggers automatic test and build on CI and review request. The guideline to writing good commit message also applies to writing good PR title and body.

Semantic Versioning
Bump Major when API changes
Bump Minor when adding new functionality without breaking API
Bump Path for bug fixes.

Code review
Review commit message as well as the actual code.
Don’t forget to praise
If in doubt Question rather than Judge
Look at the whole design and code surrounding the change, not just the change itself
There are many ways to have things done – respect the author

Ensemble Learning

Bagging (bootstrap aggregating)
A simple and straightforward way of ensembling models by averaging results from multiple models. Each model is trained with a fraction of data with replacement. Each model votes with equal weight: averaging for regression and majority vote for classification.
E.g. random forests

Train models sequentially. Start with equally weighted data.
Increase weights on misclassified data for the next model.
So on and so forth…
E.g. AdaBoosting

Train a model that takes the output of multiple models as input.

Choose S3n over S3 when sharing file with HDFS

AWS S3 supports both s3 and s3n file system when communicates with HDFS.

s3n, the s3 native file system, allows files to be stored in the original format. s3 is the s3 block file system, which is block based storage, an equivalent of HDFS in AWS implementation. Other s3 tools would not be able to recognise the original file format, but to see a bunch of block files.

However s3n imposes a file size limit of 5G per file. s3 does not prevent users from storing large file bigger than 5G though.

Besides, s3 puts block files directly into a S3 bucket and occupies the whole bucket without differentiating folders. Whilst s3n puts files in original shape into a folder under S3 bucket. Hence s3n is more flexible in this sense.

Since my test files are mostly small files smaller than 1G, hadoop fs -cp outperforms hadoop distcp in my test. Besides, s3n boosts faster transmission than s3.

# copy files from hdfs to s3n
hadoop fs -cp hdfs:// s3n://awsid:awskey@bucket/folder

A Post About Nothing in Scala

A wonderful post about Nothings (Null, null, Nil, Nothing, None, and Unit) in Scala

Matt Malone's Old-Fashioned Software Development Blog

One of the main complaints you hear about the Scala language is that it’s too complicated compared to Java. The average developer will never be able to achieve a sufficient understanding of the type system, the functional programming idioms, etc. That’s the argument. To support this position, you’ll often hear it pointed out that Scala includes several notions of nothingness (Null, null, Nil, Nothing, None, and Unit) and that you have to know which one to use in each situation. I’ve read an argument like this more than once.

It’s not as bad as all that. Yes, each of those things is part of Scala, and yes, you have to use the right one in the right situation. But the situations are so wildly different it’s not hard to figure out once you know what each of these things mean.

Null and null

First, let’s tackle Null and null. Null…

View original post 1,098 more words