By Todd Rowan


2019-03-14 21:28:14 8 Comments

In May of last year we retired a domain and redirected it at a new one. This was done to consolidate two sites from a company merger into one site.

We shutdown the CMS for site1.com and pointed its DNS at site2.com. We built a redirect engine with over a thousand rules to redirect any request for site1.com to the appropriate page on site2.com. So it is not possible to get a 200 result from site1.com. You always get a 301 which turns into a 200 or a 404 at site2.com.

We are now ten months from site retirement and I still get 40K+ daily requests from bots on site1.com (80% from Bing, but everyone else is in there, too). There are no links anywhere on site2.com that reference site1.com. All sitemaps reference site2.com.

If you search on our primary keywords that were on site1.com before the migration, we still rank on the first page with site2.com urls. So SEO there is not a problem.

I have other site consolidation projects on the way and I do not want to have to spin up additional resources just to handle redirects for bots.

We do 301 redirect site1.com/robots.txt to site2.com/robots.txt. Should I configure my server to serve up a global Disallow on site1.com/robots.txt? That shouldn't affect site2 crawling, nor should it affect SEO, correct?

In short, how can I get the bots to stop crawling site1?

2 comments

@Chris Rutherfurd 2019-03-15 08:19:34

You are never going to be able to 100% stop bots from trying to access pages from the old domain. You have done the right thing by 301 redirecting the old pages to the new relevant pages as this will generally pass on the old ranking in calculating the ranking of the new site.

The hard part here is that being such a large site there are almost definitely going to be external third party links coming back to the old site and when a crawler hits one of those links the link will be added to the index again for re-crawling. There is no way to stop it from ever again happening as domains and pages that have gone do quite often re-activate again in the future whether by the original site owner or by a new site owner on a new subject matter.

As long as you maintain the old domain name and 301 redirect requests for pages which still are valid but have been moved then you are doing all you really can do. Over time organically old links will disappear as old pages linking to your pages get deleted or archived or webmasters update the links to the new correct content.

Additionally you mention that there are some pages which you return a 404 error for. If you don't plan on restoring those pages you would be better of returning a 410 Gone error code as it tells crawlers that the page is permanently gone and you don't have any intention of bringing it back and often times it will result in this page being removed from the index and from Google in general and when the crawlers come across the link again in the future when they check it and try to crawl it and get a 410 error code then the page won't be added to the index at all.

@Todd Rowan 2019-03-15 15:07:52

Thanks for this. I guess I'll just ride it out. RE: 404s, I didn't mean to say that we had pages that were removed, just that requests to the old domain that would have 404'd there now 404 on the new domain.

@Stephen Ostermiller 2019-04-14 09:51:03

You shouldn't be redirecting the entire domain if you don't have replacements for all the old pages. It is appropriate for users and bots to see 404 errors for pages on the old domain that have been removed. However, if you don't have the same pages on the new domain, it would be less confusing for users to get a 404 page at the old domain URL with a custom message.

As other answers have said, bots will never stop wanting to crawl the old domain. You could use your separate robots.txt. However, that isn't really going to address the real problem: Only some of the domain is really replaced by the new site. I'd take the approach of redirecting just the specific pages on the old site for which you have exact replacements.

From a technical side, I'd host the old domain in its own virtual host. Log the 404s from that virtual host to its own log file so you process them separately from your main site and ignore them.

<VirtualHost *:80>
    ServerName www.old-site.example
    ServerAlias old-site.example
    DocumentRoot "/var/www/old-site"
    CustomLog /var/log/apache/old-site.example-access.log combined
    ErrorLog /var/log/apache/old-site.example-error.log
</VirtualHost>

Then add specific redirect rules either directly in that virtual host or in the .htaccess file in the old site document root.

  • Redirect only the home page:

    RedirectMatch 301 ^/$ https://newsite.example/
    
  • Redirect specific pages for which you have equivalents:

    Redirect 301 /old-page.html https://newsite.example/similar-page.html
    Redirect 301 /old-content.html https://newsite.example/replacement-content.html
    ...
    

Use a custom 404 handler for the old site. It should explain that the old site is shut down and that it was bought by the new site, but that not everything that was available on the old site is still available on the new site.

ErrorDocument 404 /old-site-gone.html

It could even use a meta refresh tag to redirect to the home page of the new site after some number of seconds:

<meta http-equiv="refresh" content="10; url=https://newsite.example/">

@Todd Rowan 2019-04-16 21:57:56

Just to be clear, we do have replacements for all content. No valid URL from the old domain results in a 404 on redirect. URLs that would have 404'd at the old domain just 404 at the new one, but that number is small. The main problem is that the bots haven't grasped the fact that all valid URLs at the old site now 301 to valid URLs on the new site. It's been 11 months and they just keep hammering away getting the same result day after day (I'm talking to you, bing.com!). Machine learning is definitely not in use here.

@Stephen Ostermiller 2019-04-16 22:29:30

Why are those hits a problem? Redirects are generally very "cheap" to serve. They don't take many server resources. I thought that the problem was that you were seeing tons of hits in your error reporting. If that isn't the case, I'm not sure why you care that bots are interested in the old domain.

Related Questions

Sponsored Content

1 Answered Questions

[SOLVED] Why am I getting bot hits from compute-1.amazonaws.com?

1 Answered Questions

[SOLVED] How block only Yandex bot

2 Answered Questions

[SOLVED] How do you add a rule just for a specific bot to robots.txt?

0 Answered Questions

1 Answered Questions

[SOLVED] AOL search engine bot name?

1 Answered Questions

[SOLVED] How To Slow Down A Generic Bot?

  • 2011-09-28 15:19:30
  • Itai
  • 889 View
  • 3 Score
  • 1 Answer
  • Tags:   web-crawlers

2 Answered Questions

[SOLVED] Does the adsense bot ever get bored?

Sponsored Content