Make a grid search for Random Forest

Issue #160 closed

Thang Hanam created an issue 2018-09-07

Can we add the codes to make a grid search for RandomForestClassifier? I added some codes but it seems not to work.

Comments (18)

Andreas Janz
Hi @Thang Hanam , sorry for the very late response. Yes, this would be possible. Which parameters you want to tune?
- 2019-08-09T10:53:40+00:00
Andreas Janz
- changed status to on hold
- 2020-03-16T05:42:03+00:00
Agustin Lobo
In my opinion, the most important (and urgent) one is max_features, which despite its name is defined as “The number of features to consider when looking for the best split” (and not the maximum number of features, as the name would suggest) (see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

Currently, enmapbox uses the default, which is “auto” thus sqrt(p) (as recommended in Elements of Statistical Learning), but note the comment of notmatthancock “I don't think there's a silver bullet default value that will work the best across all regression problems. Even if one were to perform a large scale study across every regression problem, we'd only find the default value that works best on average. Users would still need to conduct tuning for their particular problem." (https://github.com/scikit-learn/scikit-learn/issues/7254). Thus, while sqrt(p) is probably given reasonable results in most cases, it is not granted to be the best.

‌
- 2021-02-14T18:39:00+00:00
Andreas Janz
We could add tuning of max_features to the default code snippet, but that would slow down every RF fit. I would rather suggest, that users that want to tune, need to adopt their code. And yes, I am aware, that this might be difficult for some/most users.
- 2021-02-15T07:50:09+00:00
Agustin Lobo
It would slow it down if and only if that option is selected. I’m not suggesting doing it mandatory.
- 2021-02-15T07:58:31+00:00

Andreas Janz

We could achiev this by preparing the code for tuning, but comment out per default. E.g. for SVC

svc = SVC(probability=False)
param_grid = {
    'kernel': ['rbf'],
    # 'gamma': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    # 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}
tunedSVC = GridSearchCV(cv=3, estimator=svc, scoring='f1_macro', param_grid=param_grid)

estimator = make_pipeline(StandardScaler(), tunedSVC)

‌

2021-02-15T08:05:53+00:00

Agustin Lobo
The code is not currently commented out in SVC and I did not find it slow.

I think that the search is done in the training set and does not require the image (it is part of the fitting step, not of the prediction step: I understand that the evaluation to select the best parameter(s) is made in part of the training set by k-fold CV).

The commented code would be fine for me, but obviously having these options in the GUI would be much better for reaching out to more users.

Note though that we are discussing RF in this ticket. How would be the code for searching max_features in RF?

Also, I would include the code line for n_estimators.

‌
- 2021-02-15T08:17:55+00:00
Andreas Janz
Just used SVC above as an example how “commenting out” would look like. And yes, this only applies to fitting. For RFC, the param_grid would change to:

param_grid = {'max_features': [1, 2, 3, 4, 5, …]}

Depending on the number of features, this list could be large, which would make it slow.
- 2021-02-15T08:26:06+00:00
Andreas Janz
Beside max_features, other users might want to tune one of the other parameters. RFC has quite a lot:

sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)

I’m still not sure, how to deal with that in general.
- 2021-02-15T08:34:09+00:00
Agustin Lobo
Searching requires a strategy that should be explained in the tutorial, as having more options requires more knowledge. In this case, the search has to be done around sqrt(p). Initially, the options are set by steps >>1, and once a first estimate is obtained the search is refined around that value. This searching strategy is not to be done in the Classification workflow but in the Fit.

‌
- 2021-02-15T08:46:50+00:00
Agustin Lobo
Regarding the rest of the parameters, I think you can stick to those pointed out in most references (max_features and n_estimators) and wait for additional suggestions from other users.
- 2021-02-15T08:53:12+00:00

Agustin Lobo

Would thus this be correct for RF?

from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(oob_score=True)
param_grid = {
'max_features': [2, 3, 4],
'n_estimators': [300, 350, 400, 450, 500, 1000]
}
tunedRFC = GridSearchCV(cv=3, estimator=RFC, scoring='f1_macro', param_grid=param_grid)
estimator =  make_pipeline(StandardScaler(), tunedRFC)

How can we save or at least print the actual selected values of max_features and n_estimators?

‌

2021-02-19T13:35:57+00:00

Andreas Janz
Code block looks good, just try it out.

All the detailed results can be found inside the final model, e.g.

‌
- 2021-02-19T14:19:01+00:00
Agustin Lobo
Thanks, I was not aware of this. Anyway, it would be good to have the values of the final searched parameters in the *acass.html file.
- 2021-02-20T10:53:56+00:00
Andreas Janz
Scikit learn estimators can be arbitrary complex, so that reporting special model parameters is not a suitable option. I would rather create a separat HTML file reporting all the details of an estimator, like shown in the screenshot above.
- 2021-02-22T07:28:00+00:00
Andreas Janz
v3.8 will output a human-readable JSON file next to the *.pkl model file
- 2021-03-17T18:03:42+00:00
Andreas Janz
- changed status to closed
- 2021-03-17T18:04:02+00:00
Andreas Janz
- removed version
Removing version: 0.3 (automated comment)
- 2021-11-03T08:32:16+00:00
Log in to comment

Assignee: –

Type: enhancement

Priority: major

Status: closed

Component: Processing

Milestone: –

Version: –

Votes: 0

Watchers: 1