Word Document.Saveas Ignores Encoding, When Calling Through Ole, from Ruby or Vbs

Word Document.SaveAs ignores encoding, when calling through OLE, from Ruby or VBS

My solution was to open the HTML file using the same character set, as Word used to save it.
I also added a whitelist filter (Sanitize), to clean up the HTML. Further cleaning is done using Nokogiri, which Sanitize also rely on.

require 'sanitize'

# ... add some code converting a Word file to HTML.

# Post export cleanup.
html_file = File.open(html_file_name, "r:windows-1252:utf-8")
html = '<!DOCTYPE html>' + html_file.read()
html_document = Nokogiri::HTML::Document.parse(html)
Sanitize.new(Sanitize::Config::RESTRICTED).clean_node!(html_document)
html_document.css('html').first['lang'] = 'en-US'
html_document.css('meta[name="Generator"]').first.remove()

# ... add more cleaning up of Words HTML noise.

sanitized_html = html_document.to_html({:encoding => 'utf-8', :indent => 0})
# writing output to (new) file
sanitized_html_file_name = word_file_name.sub(/(.*)\..*$/, '\1.html')
File.open(sanitized_html_file_name, 'w:UTF-8') do |f|
f.write sanitized_html
end

HTML Sanitizer: https://github.com/rgrove/sanitize/

HTML parser and modifier: http://nokogiri.org/

In Word 2010 there is a new method, SaveAs2: http://msdn.microsoft.com/en-us/library/ff836084(v=office.14).aspx

I haven't tested SaveAs2, since I don't have Word 2010.

VBS Application.Outlook concurrent usage

Outlook uses an STA model, so your calls will be marshaled by the dispatcher to the main thread. There is nothing suspicious there, but I'd recommend creating any scheduling mechanism because the target application will not be able to handle your calls simultaneously.

Be aware, Microsoft does not currently recommend, and does not support, Automation of Microsoft Office applications from any unattended, non-interactive client application or component (including ASP, ASP.NET, DCOM, and NT Services), because Office may exhibit unstable behavior and/or deadlock when Office is run in this environment.

If you are building a solution that runs in a server-side context, you should try to use components that have been made safe for unattended execution. Or, you should try to find alternatives that allow at least part of the code to run client-side. If you use an Office application from a server-side solution, the application will lack many of the necessary capabilities to run successfully. Additionally, you will be taking risks with the stability of your overall solution.

Read more about that in the Considerations for server-side Automation of Office article.

Set CD Title - burning a CD with VB Script using IMAPI2

You can take a look in this thread:

FSI.VolumeName = "Stackoverflow"

Maybe using volume label or volume name as search keywords could help you to get better results.

What's the difference between UTF-8 and UTF-8 with BOM?

The UTF-8 BOM is a sequence of bytes at the start of a text stream (0xEF, 0xBB, 0xBF) that allows the reader to more reliably guess a file as being encoded in UTF-8.

Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.

According to the Unicode standard, the BOM for UTF-8 files is not recommended:

2.6 Encoding Schemes


... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials, for more information.

Which encoding opens CSV files correctly with Excel on both Mac and Windows?

The lowdown is: There is no solution. Excel 2011/Mac cannot correctly interpret a CSV file containing umlauts and diacritical marks no matter what encoding or hoop jumping you do. I'd be glad to hear someone tell me different!



Related Topics



Leave a reply



Submit