Is the Net::HTTP Ruby gem ignoring the Content-type header in my HTTP responses?
Ruby does not set the encoding of the response automatically (see ticket) and will always set the encoding to ASCII-8BIT.
That is a slightly misleading encoding name since it actually means "arbitrary binary data". This is why you need to use force_encoding
to set the encoding before you can transcode to other encodings.
How to control encoding when POSTing through Net::HTTP?
Consider using both the enctype and accept-charset attributes in your form definition. More information is available here: http://www.w3.org/TR/html401/interact/forms.html#h-17.3
Html wrongly encoded fetched by Nokogiri
That is written on the 'Encoding' section on README: http://nokogiri.org/
Strings are always stored as UTF-8 internally. Methods that return
text values will always return UTF-8 encoded strings. Methods that
return XML (like to_xml, to_html and inner_html) will return a string
encoded like the source document.
So, you should convert inner_html
string manually if you want to get it as UTF-8 string:
puts link.inner_html.encode('utf-8') # for 1.9.x
Ruby parsing HTTPresponse with Nokogiri
First thing. Your fetch
method returns a Net::HTTPResponse
object and not just the body. You should provide the body to Nokogiri.
response = fetch("http://www.somewebsite.com/hahaha/")
puts response.body
noko = Nokogiri::HTML(response.body)
I've updated your script so it's runnable (bellow). A couple of things were undefined.
require 'nokogiri'
require 'net/http'
def fetch(uri_str, limit = 10)
# You should choose better exception.
raise ArgumentError, 'HTTP redirect too deep' if limit == 0
url = URI.parse(URI.encode(uri_str.strip))
puts url
#get path
headers = {}
req = Net::HTTP::Get.new(url.path,headers)
#start TCP/IP
response = Net::HTTP.start(url.host,url.port) { |http|
http.request(req)
}
case response
when Net::HTTPSuccess
then #print final redirect to a file
puts "this is location" + uri_str
puts "this is the host #{url.host}"
puts "this is the path #{url.path}"
return response
# if you get a 302 response
when Net::HTTPRedirection
then
puts "this is redirect" + response['location']
return fetch(response['location'], limit-1)
else
response.error!
end
end
response = fetch("http://www.google.com/")
puts response
noko = Nokogiri::HTML(response.body)
puts noko
The script gives no error and prints the content. You may be getting Nokogiri error due to the content you're receiving. One common problem I've encountered with Nokogiri is character encoding. Without the exact error it's impossible to tell what's going on.
I'd recommnend looking at the following StackOverflow Questions
ruby 1.9: invalid byte sequence in UTF-8 (specifically this answer)
How to convert a Net::HTTP response to a certain encoding in Ruby 1.9.1?
How to URL encode a string in Ruby
str = "\x12\x34\x56\x78\x9a\xbc\xde\xf1\x23\x45\x67\x89\xab\xcd\xef\x12\x34\x56\x78\x9a".force_encoding('ASCII-8BIT')
puts CGI.escape str
=> "%124Vx%9A%BC%DE%F1%23Eg%89%AB%CD%EF%124Vx%9A"
What does Content-type: application/json; charset=utf-8 really mean?
The header just denotes what the content is encoded in. It is not necessarily possible to deduce the type of the content from the content itself, i.e. you can't necessarily just look at the content and know what to do with it. That's what HTTP headers are for, they tell the recipient what kind of content they're (supposedly) dealing with.
Content-type: application/json; charset=utf-8
designates the content to be in JSON format, encoded in the UTF-8 character encoding. Designating the encoding is somewhat redundant for JSON, since the default (only?) encoding for JSON is UTF-8. So in this case the receiving server apparently is happy knowing that it's dealing with JSON and assumes that the encoding is UTF-8 by default, that's why it works with or without the header.
Does this encoding limit the characters that can be in the message body?
No. You can send anything you want in the header and the body. But, if the two don't match, you may get wrong results. If you specify in the header that the content is UTF-8 encoded but you're actually sending Latin1 encoded content, the receiver may produce garbage data, trying to interpret Latin1 encoded data as UTF-8. If of course you specify that you're sending Latin1 encoded data and you're actually doing so, then yes, you're limited to the 256 characters you can encode in Latin1.
Related Topics
Rails 3: Call Functions Inside Controllers
Do Ruby 1.8 and 1.9 Have the Same Hash Code for a String
Ruby: Write Escaped String to Yaml
Star Rating in Ajax with Ruby on Rails
Why Is Devise Not Displaying Authentication Errors on Sign in Page
Aptana 3 Ruby Debugger - Exception in Debugthread Loop: Undefined Method 'Is_Binary_Data'
How to Add Child Nodes in Nodeset Using Nokogiri
How Does MACports Install Packages? How to Activate a Ruby Installation Done via MACports
Using Soap and Other Standard Libraries in Ruby 1.9.2
Can a Ruby Script Tell What Directory It's In
How to Add Confirm Message with Link_To Ruby on Rails
Why Do I Get an "Undefined Method for 'Has_Attached_File' When Installing Paperclip
Change Default Capybara Browser Window Size
Ruby Symbols VS Strings in Hashes