Make a grid search for Random Forest
Can we add the codes to make a grid search for RandomForestClassifier? I added some codes but it seems not to work.
Comments (18)
-
-
- changed status to on hold
-
In my opinion, the most important (and urgent) one is max_features, which despite its name is defined as “The number of features to consider when looking for the best split” (and not the maximum number of features, as the name would suggest) (see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
Currently, enmapbox uses the default, which is “auto” thus sqrt(p) (as recommended in Elements of Statistical Learning), but note the comment of notmatthancock “I don't think there's a silver bullet default value that will work the best across all regression problems. Even if one were to perform a large scale study across every regression problem, we'd only find the default value that works best on average. Users would still need to conduct tuning for their particular problem." (https://github.com/scikit-learn/scikit-learn/issues/7254). Thus, while sqrt(p) is probably given reasonable results in most cases, it is not granted to be the best.
-
We could add tuning of max_features to the default code snippet, but that would slow down every RF fit. I would rather suggest, that users that want to tune, need to adopt their code. And yes, I am aware, that this might be difficult for some/most users.
-
It would slow it down if and only if that option is selected. I’m not suggesting doing it mandatory.
-
We could achiev this by preparing the code for tuning, but comment out per default. E.g. for SVC
svc = SVC(probability=False) param_grid = { 'kernel': ['rbf'], # 'gamma': [0.001, 0.01, 0.1, 1, 10, 100, 1000], # 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] } tunedSVC = GridSearchCV(cv=3, estimator=svc, scoring='f1_macro', param_grid=param_grid) estimator = make_pipeline(StandardScaler(), tunedSVC)
-
The code is not currently commented out in SVC and I did not find it slow.
I think that the search is done in the training set and does not require the image (it is part of the fitting step, not of the prediction step: I understand that the evaluation to select the best parameter(s) is made in part of the training set by k-fold CV).
The commented code would be fine for me, but obviously having these options in the GUI would be much better for reaching out to more users.
Note though that we are discussing RF in this ticket. How would be the code for searching max_features in RF?
Also, I would include the code line for n_estimators.
-
Just used SVC above as an example how “commenting out” would look like. And yes, this only applies to fitting. For RFC, the param_grid would change to:
param_grid = {'max_features': [1, 2, 3, 4, 5, …]}
Depending on the number of features, this list could be large, which would make it slow.
-
Beside max_features, other users might want to tune one of the other parameters. RFC has quite a lot:
sklearn.ensemble.RandomForestClassifier
(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)I’m still not sure, how to deal with that in general.
-
Searching requires a strategy that should be explained in the tutorial, as having more options requires more knowledge. In this case, the search has to be done around sqrt(p). Initially, the options are set by steps >>1, and once a first estimate is obtained the search is refined around that value. This searching strategy is not to be done in the Classification workflow but in the Fit.
-
Regarding the rest of the parameters, I think you can stick to those pointed out in most references (max_features and n_estimators) and wait for additional suggestions from other users.
-
Would thus this be correct for RF?
from sklearn.pipeline import make_pipeline from sklearn.model_selection import GridSearchCV from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClassifier RFC = RandomForestClassifier(oob_score=True) param_grid = { 'max_features': [2, 3, 4], 'n_estimators': [300, 350, 400, 450, 500, 1000] } tunedRFC = GridSearchCV(cv=3, estimator=RFC, scoring='f1_macro', param_grid=param_grid) estimator = make_pipeline(StandardScaler(), tunedRFC)
How can we save or at least print the actual selected values of max_features and n_estimators?
-
Code block looks good, just try it out.
All the detailed results can be found inside the final model, e.g.
-
Thanks, I was not aware of this. Anyway, it would be good to have the values of the final searched parameters in the *acass.html file.
-
Scikit learn estimators can be arbitrary complex, so that reporting special model parameters is not a suitable option. I would rather create a separat HTML file reporting all the details of an estimator, like shown in the screenshot above.
-
v3.8 will output a human-readable JSON file next to the *.pkl model file
-
- changed status to closed
-
- removed version
Removing version: 0.3 (automated comment)
- Log in to comment
Hi @Thang Hanam , sorry for the very late response. Yes, this would be possible. Which parameters you want to tune?