Multiple robots.txt for subdomains in rails
Actually, you probably want to set a mime type in mime_types.rb
and do it in a respond_to
block so it doesn't return it as 'text/html'
:
Mime::Type.register "text/plain", :txt
Then, your routes would look like this:
map.robots '/robots.txt', :controller => 'robots', :action => 'robots'
For rails3:
match '/robots.txt' => 'robots#robots'
and the controller something like this (put the file(s) where ever you like):
class RobotsController < ApplicationController
def robots
subdomain = # get subdomain, escape
robots = File.read(RAILS_ROOT + "/config/robots.#{subdomain}.txt")
respond_to do |format|
format.txt { render :text => robots, :layout => false }
end
end
end
at the risk of overengineering it, I might even be tempted to cache the file read operation...
Oh, yeah, you'll almost certainly have to remove/move the existing 'public/robots.txt'
file.
Astute readers will notice that you can easily substitute RAILS_ENV
for subdomain
...
Googlebot substitutes the links of Rails app with subdomain
Robots.txt will block it just fine. It's just important to remember BEFORE you publish a site - Google is pretty fast. Some search engines ignore robots.txt. Best thing to do is not have subdomains that don't really fit you situation. I recommend reading the true purpose of subdomains. You should not be serving the same site on different domains. You should use a 301 direct or have different contents on different (sub)domains... Unless stats.abc.com contains different material, it shouldn't be a subdomain. What exactly do you need so many subdomains for?
You could detect the user-agent, and if it's a bot, return a 404 too
Robots.txt and sub-folders
This robots.txt would be sufficient, you don’t have to list anything that comes after /_sub/
:
User-agent: *
Disallow: /_sub/
This would disallow bots (who honor the robots.txt) to crawl any URL whose path starts with /_sub/
. But that doesn’t necessarily stop these bots to index your URL itself (e.g., list them in their search results).
Ideally you would redirect from http://example.com/_sub/ex1/
to http://example1.com/
with HTTP status code 301. It depends on your server how that works (for Apache, you could use a .htaccess
). Then everyone ends up on the canonical URL for your site.
How to remove subdomain from google index, which links to the main domain
You can use dynamic robots.txt for this purpose.
Something like this...
httpd.conf (.htaccess):
RewriteRule /robots\.txt$ /var/www/myweb/robots.php
robots.php:
<?php
header('Content-type: text/plain');
if($_SERVER['HTTP_HOST']=='cdn.myweb.com'){
echo "User-agent: *\n";
echo "Disallow: /\n";
}else{
include("./robots.txt");
}
Can a relative sitemap url be used in a robots.txt?
According to the official documentation on sitemaps.org it needs to be a full URL:
You can specify the location of the Sitemap using a robots.txt file. To do this, simply add the following line including the full URL to the sitemap:
Sitemap: http://www.example.com/sitemap.xml
Related Topics
Ruby on Rails: What Reporting And/Or Charting Tools Are Available
Setting Ruby Hash .Default to a List
Why Are Symbols Not Frozen Strings
Redirect the "Puts" Command Output to a Log File
How to Introspect Things in Ruby
Rails: Your User Account Isn't Allowed to Install to the System Rubygems
How to Get Searchlogic to Work with Rails 3
How to Inspect the Methods of a Ruby Object
Is Autoload Thread-Safe in Ruby 1.9
Creating an Empty File in Ruby: "Touch" Equivalent
Rails Development Server Is Slow and Takes a Long Time to Load a Simple Page
App Pushed to Heroku Still Shows Standard Index Page