Add metafeatures to feature extraction step
Issue #3
new
In addition to regular expression based feature extraction, we need to take features like length of a tweet. Check the blog post "Predicting HN upvotes using headlines" on https://www.dataquest.io/blog/predicting-upvotes/ . It has code like:
# Our list of functions to apply.
transform_functions = [
lambda x: len(x),
lambda x: x.count(" "),
lambda x: x.count("."),
lambda x: x.count("!"),
lambda x: x.count("?"),
lambda x: len(x) / (x.count(" ") + 1),
lambda x: x.count(" ") / (x.count(".") + 1),
lambda x: len(re.findall("\d", x)),
lambda x: len(re.findall("[A-Z]", x)),
]
# Apply each function and put the results into a list.
columns = []
for func in transform_functions:
columns.append(submissions["headline"].apply(func))
# Convert the meta features to a numpy array.
meta = numpy.asarray(columns).T