I Need to Go from HTML to Markdown, Any Suggestion

I need to go from HTML to markdown, any suggestion?

Google found a ruby script called reverse_markdown. It seems to do what you are looking for.

How to switch web app from HTML to Markdown?

Apparently converting the old HTML to markdown is completely unnecessary. I applied the django.contrib.markup.markdown filter to my templates, and the HTML in the legacy database records was passed straight through. Markdown in non-legacy records was also rendered correctly.

Granted, my web application does not allow users to be modifying these fields, so it is ok to let HTML pass straight through. If this was a user-editable field, like comments or a wiki, this solution would not suffice. You would have to pass the safe parameter to the markdown template filter, which would strip out all the HTML, and some HTML to markdown conversion would be necessary for legacy posts written in HTML.

Another solution, in that case, would be to write a new template filter that wrapped the markdown filter. It would allow old HTML posts to pass through, but apply the safe markdown filter to non-legacy posts.

Is there javascript to convert HTML to markdown?

I've started a project to do this:

https://github.com/domchristie/turndown

It's still in its early stages, so has not been heavily tested, but it's a start.

Feedback/contributions welcome.

How to transform HTML to Markdown using Nokogiri?

As @Konrads suggested, there is a duplicate entry here. However, the link which is given in the answer on that entry is broken.

I've googled a bit more and found that some forks survived: https://github.com/tomkrush/reverse-markdown, https://gist.github.com/788039.

If these forks become dead, just google for "reverse markdown". You will definitely find something that is alive and working.

Convert Html or RTF to Markdown or Wiki Compatible syntax?

Here's the code I used to wrap pandoc. I haven't seen any other decent methods so far unfortunately.

public string Convert(string source)
{
string processName = @"C:\Program Files\Pandoc\bin\pandoc.exe";
string args = String.Format(@"-r html -t mediawiki");

ProcessStartInfo psi = new ProcessStartInfo(processName, args);

psi.RedirectStandardOutput = true;
psi.RedirectStandardInput = true;

Process p = new Process();
p.StartInfo = psi;
psi.UseShellExecute = false;
p.Start();

string outputString = "";
byte[] inputBuffer = Encoding.UTF8.GetBytes(source);
p.StandardInput.BaseStream.Write(inputBuffer, 0, inputBuffer.Length);
p.StandardInput.Close();

p.WaitForExit(2000);
using (System.IO.StreamReader sr = new System.IO.StreamReader(
p.StandardOutput.BaseStream))
{

outputString = sr.ReadToEnd();
}

return outputString;
}

how to convert a jupyter notebook containing html tags to HTML?

OPTION A: Kludge workaround

I found a way to get the result I think you want using your test notebook code supplied as a starting point. I confirmed that it works whether you are saving the notebook as HTML using the 'File' menu > 'Save and Export Notebook as...' or use the '!jupyter nbconvert' command you had in your notebook.

First editing inside the notebook view of your supplied notebook I took your markdown cells that contained HTML and made them code cells. Then I added the cell magic code %%html as the first line of each of those new code cells above the HTML contents. Running this results in the same view you had although you now have the input code. Optionally, you can collapse that in the notebook rendering if you want; however, it is moot if your goal is only the HTML rendered result anyway.

Sample Image

Now you convert the modified notebook to HTML using your way or the 'FIle' menu. If you view the HTML file produced in a browser or inside JupyterLAb, you'll see it has the cells now vertical again. Except for the addition of the code cells producing the markdown, you have what you want now. Luckily, it is easy to remove the code cells from the HTML. For example, from the top example in your code cell, you remove this block of code in the HTML file:

<div class="jp-InputPrompt jp-InputArea-prompt">In [2]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%%html</span>
<span class="p"><</span><span class="nt">div</span> <span class="na">style</span><span class="o">=</span><span class="s">"background-color: #1A079B; width:100%"</span> <span class="p">></span>
<span class="p"><</span><span class="nt">h2</span> <span class="na">style</span><span class="o">=</span><span class="s">"margin: auto; padding: 20px; color:#fff; "</span><span class="p">></span>1 - Test <span class="p"></</span><span class="nt">h2</span><span class="p">></span>
<span class="p"></</span><span class="nt">div</span><span class="p">></span>
</pre></div>

</div>

In other words, the part that corresponds to the starting <div class="jp-InputPrompt jp-InputArea-prompt"> and ends at the indented </div> that matches with the div class="CodeMirror cm-s-jupyter"> div.

Sample Image

The use of the '%%html' as a way to find the section to delete in the produced HTML is nice.

You do that for both and you get what you want.

Sample Image

I know that seems inconvenient; however, I suspect this process could be scripted if you use of the HTML in the initial markdown code is consistent enough. (I'd use nbformat to do the step of moving the HTML code from the markdown cell to a code cell with the %%html tag and then I suspect iterating find and delete using a regular expression would work on the produced HTML file.) Your test notebook is very consistent as the content of the markdown cells all begin <div style=. Is that the case in all the notebooks you are working with? I didn't want to try scripting any automation until confirmed.



OPTION B: Change to using 'div class alert'

While the div style combo you are using triggers the horizontal rendering in the HTML, in the markdown cells you can use alert divs inside and they won't trigger the weird rendering in the HTML later.
This has the advantage that rendering in notebook and HTML is all consistent.

So in your markdown cells, this could be example use:

<div class="alert alert-block alert-warning">
<h2 style="margin: auto; padding: 20px; color:#fff; ">1 - Test </h2>
</div>

-Or-

<div class="alert alert-block alert-danger">
<h2 style="margin: auto; padding: 20px; color:#444; ">1 - Test </h2>
</div>

-Or-

<div class="alert alert-block alert-info">
<h2 style="margin: auto; padding: 20px; color:#222; ">1 - Test </h2>
</div>

So in the notebook it renders, like this:
Sample Image

And in the HTML it renders, like this:
Sample Image

(NOTE THAT THE VERSION OF THE ORIGNAL HTML SUPPLIED RENDERS CORRECTLY IN THE HTML IF YOU CHANGE IT TO RAW! That's the blue box with '2 - test' text. See below.)

OPTION C: Switch cells with HTML in them to 'Raw' cells before Conversion step

This works on the supplied notebook without changing anything except the type of cell where the original HTML styling code is harbored. You change the cell containing the HTML code from being designated as markdown type to 'Raw' type.

It's not perfect in regards to matching the cell width, and it doesn't work directly in the notebook rendering. But it is the 'path of least resistance' and results in an HTML version of the notebook that doesn't go horizontal. You can get the best of both worlds if you are careful. You can have the version of the notebook saved with the HTML as markdown and then when ready to convert, you just change those cells to 'Raw' and run the conversion.

Example starting from the original supplied notebook, change the two markdown cells containing HTML to the 'Raw' type:
Sample Image

And then the HTML looks close to the desired result:
Sample Image

The conversion to 'Raw' could be automated by using nbformat if the cells containing markdown and HTML are consistent enough to easily distinguish.



Related Topics



Leave a reply



Submit