Website Indexing is not working (front-end check)

Issue #271 new
Enrico Piccini created an issue

No description provided.

Comments (17)

  1. Enrico Piccini reporter

    @dwmarrs Yes, the search engines indexing.... Right now we've only 1 page indexed :( The things to check are: * check if we've setted a robots.txt that prevents the indexing * check the indexing metatags on the html pages * Create a patch to "subscription" flow that in addition to cookie, check also the user agent using userAgent.indexOf('....'). If the user agent is "facebookexternalhit", "Twitterbot", "LinkedInBot", "Googlebot", "Bingbot" and "YandexBot" we don't redirect to /welcome

    Do you wanna work this issue Dave? If so, assign this to your account

  2. David Marrs

    @Brotrob Can you please test the tour functionality changes on http://test.udemi.org/ because that will help with this problem.

    Thanks

  3. udemi repo owner

    Hey @dwmarrs - so I tested the tour functionality on test.udemi.org once on Chrome and it is looking good :)

  4. Enrico Piccini reporter

    @Brotrob @dwmarrs So, is it possibile to deploy the new subscription flow to PROD? is it stable?

    @dwmarrs News about indexing problem?

  5. David Marrs

    I have merged the check to prevent redirect from the Issues map to /welcome into the develop branch just now.

    I tested it just by creating some fake user agents for the bots in Chrome dev tools. Seems ok to me, doesn't look like it will interfere with normal visitors being redirected to /welcome

    If we get it through test and out to prod then we can test it with the likes of webmaster tools (I'm assuming we didn't want to point webmaster tools to our test and dev sites?). The only problem I can see is if we have the wrong strings to check against user agents (I just googled to get the user agents strings to check).

    HOWEVER, a thought has occured to me as I write this: even if crawlers were redirected to /welcome, I assume they should, after that redirect, still be able to move on from there to pages like /about, and it doesn't look like /about has been indexed. So, I looked in Google Webmaster Tools Search Console, and ran Crawl > Fetch as Google. There is an issue with robots.txt blocking the static folder, which contains the JavaScript which will load page contents in, so I have created a pull request to allow it to be crawled, assuming Enrico thinks that's safe: https://bitbucket.org/udemi/udemi-ui/pull-requests/10/allow-crawlers-to-access-javascript-in/diff

    The only other problem i can think of (once the JavaScript can be loaded) is that the crawlers will be shown top issues/policies/actions/users around where they are based (based on their ip address) so if there are no issues around there then no issues will be indexed etc. Do we need some sort of site map for that? I imagine that could be huge. Something to think about...

  6. Enrico Piccini reporter

    Hi @dwmarrs , it's ok to index the static folder but exclude files like "env.js" or files that contains private information. You can decide which files must be private

  7. David Marrs

    Hmmm.... thinking about this, I might need to remove the line

    #!
    
    Disallow: /views/
    

    From robots.txt as well. Let's deploy to live and run Webmaster tools on it now anyway. Can always remove the line later.

    I think we will need it to be able to access env.js for the page to render anyway. Perhaps I could lock that access down to just Googlebot, BingBot, Yandexbot later (if we need to allow access to /views/ anyway)

  8. udemi repo owner

    Very much looking forward to this feature - the more content we have, the more we should rise in the search engine rankings.

    This is a very strong value proposition for politicians, who want their web-presence up the search results :)

  9. Enrico Piccini reporter

    @dwmarrs Hey Dave, I don't understand if you completed this task or not? You need only to check if works using Webmaster tools on test.udemi.org and then is done?

  10. Enrico Piccini reporter

    Hi @dwmarrs , I think you should disallow /static and /views on the robots.txt because we don't want to index those files....

    About the dev.udemi.org and test.udemi.org, right now we've the same robots.txt of prod so is possibile that Google index them if someone send it manually to the index (through Webmaster Tools) or other domains link to those... If not theoretically they will not be never indexed. I have an Idea that we can implement to be sure that dev.udemi.org and test.udemi.org will never be indexed: from javascript code, detect the hostname (eg: dev.udemi.org) and if so, set the meta <meta name="robots" content="noindex,nofollow"/>

    I think this is the last thing you can do + the robots.txt changes

  11. David Marrs

    Hi @enricosoft I have updated the pull request with some JS to change the Robots meta tag as you suggested, can this be merged & deployed now?

  12. Log in to comment