Make a grid search for Random Forest

Issue #160 closed
Thang Hanam created an issue

Can we add the codes to make a grid search for RandomForestClassifier? I added some codes but it seems not to work.

Comments (18)

  1. Andreas Janz

    Hi @Thang Hanam , sorry for the very late response. Yes, this would be possible. Which parameters you want to tune?

  2. Agustin Lobo

    In my opinion, the most important (and urgent) one is max_features, which despite its name is defined as “The number of features to consider when looking for the best split” (and not the maximum number of features, as the name would suggest) (see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

    Currently, enmapbox uses the default, which is “auto” thus sqrt(p) (as recommended in Elements of Statistical Learning), but note the comment of notmatthancock “I don't think there's a silver bullet default value that will work the best across all regression problems. Even if one were to perform a large scale study across every regression problem, we'd only find the default value that works best on average. Users would still need to conduct tuning for their particular problem." (https://github.com/scikit-learn/scikit-learn/issues/7254). Thus, while sqrt(p) is probably given reasonable results in most cases, it is not granted to be the best.

  3. Andreas Janz

    We could add tuning of max_features to the default code snippet, but that would slow down every RF fit. I would rather suggest, that users that want to tune, need to adopt their code. And yes, I am aware, that this might be difficult for some/most users.

  4. Agustin Lobo

    It would slow it down if and only if that option is selected. I’m not suggesting doing it mandatory.

  5. Andreas Janz

    We could achiev this by preparing the code for tuning, but comment out per default. E.g. for SVC

    svc = SVC(probability=False)
    param_grid = {
        'kernel': ['rbf'],
        # 'gamma': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
        # 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
    }
    tunedSVC = GridSearchCV(cv=3, estimator=svc, scoring='f1_macro', param_grid=param_grid)
    
    estimator = make_pipeline(StandardScaler(), tunedSVC)
    

  6. Agustin Lobo

    The code is not currently commented out in SVC and I did not find it slow.

    I think that the search is done in the training set and does not require the image (it is part of the fitting step, not of the prediction step: I understand that the evaluation to select the best parameter(s) is made in part of the training set by k-fold CV).

    The commented code would be fine for me, but obviously having these options in the GUI would be much better for reaching out to more users.

    Note though that we are discussing RF in this ticket. How would be the code for searching max_features in RF?

    Also, I would include the code line for n_estimators.

  7. Andreas Janz

    Just used SVC above as an example how “commenting out” would look like. And yes, this only applies to fitting. For RFC, the param_grid would change to:

    param_grid = {'max_features': [1, 2, 3, 4, 5, …]}

    Depending on the number of features, this list could be large, which would make it slow.

  8. Andreas Janz

    Beside max_features, other users might want to tune one of the other parameters. RFC has quite a lot:

    sklearn.ensemble.RandomForestClassifier(n_estimators=100*criterion='gini'max_depth=Nonemin_samples_split=2min_samples_leaf=1min_weight_fraction_leaf=0.0max_features='auto'max_leaf_nodes=Nonemin_impurity_decrease=0.0min_impurity_split=Nonebootstrap=Trueoob_score=Falsen_jobs=Nonerandom_state=Noneverbose=0warm_start=Falseclass_weight=Noneccp_alpha=0.0max_samples=None)

    I’m still not sure, how to deal with that in general.

  9. Agustin Lobo

    Searching requires a strategy that should be explained in the tutorial, as having more options requires more knowledge. In this case, the search has to be done around sqrt(p). Initially, the options are set by steps >>1, and once a first estimate is obtained the search is refined around that value. This searching strategy is not to be done in the Classification workflow but in the Fit.

  10. Agustin Lobo

    Regarding the rest of the parameters, I think you can stick to those pointed out in most references (max_features and n_estimators) and wait for additional suggestions from other users.

  11. Agustin Lobo

    Would thus this be correct for RF?

    from sklearn.pipeline import make_pipeline
    from sklearn.model_selection import GridSearchCV
    from sklearn.preprocessing import StandardScaler
    from sklearn.ensemble import RandomForestClassifier
    RFC = RandomForestClassifier(oob_score=True)
    param_grid = {
    'max_features': [2, 3, 4],
    'n_estimators': [300, 350, 400, 450, 500, 1000]
    }
    tunedRFC = GridSearchCV(cv=3, estimator=RFC, scoring='f1_macro', param_grid=param_grid)
    estimator =  make_pipeline(StandardScaler(), tunedRFC)
    

    How can we save or at least print the actual selected values of max_features and n_estimators?

  12. Andreas Janz

    Code block looks good, just try it out.

    All the detailed results can be found inside the final model, e.g.

  13. Agustin Lobo

    Thanks, I was not aware of this. Anyway, it would be good to have the values of the final searched parameters in the *acass.html file.

  14. Andreas Janz

    Scikit learn estimators can be arbitrary complex, so that reporting special model parameters is not a suitable option. I would rather create a separat HTML file reporting all the details of an estimator, like shown in the screenshot above.

  15. Log in to comment