Multiple Robots.Txt for Subdomains in Rails

Multiple robots.txt for subdomains in rails

Actually, you probably want to set a mime type in mime_types.rb and do it in a respond_to block so it doesn't return it as 'text/html':

Mime::Type.register "text/plain", :txt

Then, your routes would look like this:

map.robots '/robots.txt', :controller => 'robots', :action => 'robots'

For rails3:

match '/robots.txt' => 'robots#robots'

and the controller something like this (put the file(s) where ever you like):

class RobotsController < ApplicationController
  def robots
    subdomain = # get subdomain, escape
    robots = File.read(RAILS_ROOT + "/config/robots.#{subdomain}.txt")
    respond_to do |format|
      format.txt { render :text => robots, :layout => false }
    end
  end
end

at the risk of overengineering it, I might even be tempted to cache the file read operation...

Oh, yeah, you'll almost certainly have to remove/move the existing 'public/robots.txt' file.

Astute readers will notice that you can easily substitute RAILS_ENV for subdomain...

Googlebot substitutes the links of Rails app with subdomain

Robots.txt will block it just fine. It's just important to remember BEFORE you publish a site - Google is pretty fast. Some search engines ignore robots.txt. Best thing to do is not have subdomains that don't really fit you situation. I recommend reading the true purpose of subdomains. You should not be serving the same site on different domains. You should use a 301 direct or have different contents on different (sub)domains... Unless stats.abc.com contains different material, it shouldn't be a subdomain. What exactly do you need so many subdomains for?

You could detect the user-agent, and if it's a bot, return a 404 too

Robots.txt and sub-folders

This robots.txt would be sufficient, you don’t have to list anything that comes after /_sub/:

User-agent: *
Disallow: /_sub/

This would disallow bots (who honor the robots.txt) to crawl any URL whose path starts with /_sub/. But that doesn’t necessarily stop these bots to index your URL itself (e.g., list them in their search results).

Ideally you would redirect from http://example.com/_sub/ex1/ to http://example1.com/ with HTTP status code 301. It depends on your server how that works (for Apache, you could use a .htaccess). Then everyone ends up on the canonical URL for your site.

How to remove subdomain from google index, which links to the main domain

You can use dynamic robots.txt for this purpose.
Something like this...

httpd.conf (.htaccess):

RewriteRule /robots\.txt$ /var/www/myweb/robots.php

robots.php:

<?php
header('Content-type: text/plain');

if($_SERVER['HTTP_HOST']=='cdn.myweb.com'){ 
    echo "User-agent: *\n";
    echo "Disallow: /\n";
}else{              
    include("./robots.txt");    
}

Can a relative sitemap url be used in a robots.txt?

According to the official documentation on sitemaps.org it needs to be a full URL:

You can specify the location of the Sitemap using a robots.txt file. To do this, simply add the following line including the full URL to the sitemap:
Sitemap: http://www.example.com/sitemap.xml

Multiple Robots.Txt for Subdomains in Rails