I am interested in creating a search engine analyzer for Cebuano.
You have already provided a very capable algorithm for normalizing and stemming. A missing piece in the puzzle remains a list of stop words that should be ignored when performing general search. I have undertaken a first attempt here. My tactic is to translate a list of English stop words and use the results.
I am not a speaker of Cebuano and actually have no idea if this is a reasonable list of stop words. I guess every language would have about the same set of stop words. Also, unsure if it should be curated by hand. We might also take a large corpus of Cebuano and just select the 100 most often used words - or something like this.
Do you have any thoughts on this, inclination to help? For instance, any suggestions to find a representative corpus?