Getting the Source HTML of the Current Page from Chrome Extension

Getting the source HTML of the current page from chrome extension

Inject a script into the page you want to get the source from and message it back to the popup....

manifest.json

{
"name": "Get pages source",
"version": "1.0",
"manifest_version": 2,
"description": "Get pages source from a popup",
"browser_action": {
"default_icon": "icon.png",
"default_popup": "popup.html"
},
"permissions": ["tabs", "<all_urls>"]
}

popup.html

<!DOCTYPE html>
<html style=''>
<head>
<script src='popup.js'></script>
</head>
<body style="width:400px;">
<div id='message'>Injecting Script....</div>
</body>
</html>

popup.js

chrome.runtime.onMessage.addListener(function(request, sender) {
if (request.action == "getSource") {
message.innerText = request.source;
}
});

function onWindowLoad() {

var message = document.querySelector('#message');

chrome.tabs.executeScript(null, {
file: "getPagesSource.js"
}, function() {
// If you try and inject into an extensions page or the webstore/NTP you'll get an error
if (chrome.runtime.lastError) {
message.innerText = 'There was an error injecting script : \n' + chrome.runtime.lastError.message;
}
});

}

window.onload = onWindowLoad;

getPagesSource.js

// @author Rob W <http://stackoverflow.com/users/938089/rob-w>
// Demo: var serialized_html = DOMtoString(document);

function DOMtoString(document_root) {
var html = '',
node = document_root.firstChild;
while (node) {
switch (node.nodeType) {
case Node.ELEMENT_NODE:
html += node.outerHTML;
break;
case Node.TEXT_NODE:
html += node.nodeValue;
break;
case Node.CDATA_SECTION_NODE:
html += '<![CDATA[' + node.nodeValue + ']]>';
break;
case Node.COMMENT_NODE:
html += '<!--' + node.nodeValue + '-->';
break;
case Node.DOCUMENT_TYPE_NODE:
// (X)HTML documents are identified by public identifiers
html += "<!DOCTYPE " + node.name + (node.publicId ? ' PUBLIC "' + node.publicId + '"' : '') + (!node.publicId && node.systemId ? ' SYSTEM' : '') + (node.systemId ? ' "' + node.systemId + '"' : '') + '>\n';
break;
}
node = node.nextSibling;
}
return html;
}

chrome.runtime.sendMessage({
action: "getSource",
source: DOMtoString(document)
});

chrome extension: Getting the source HTML of the current page on page load

It is about the permissions.

Your example a bit insufficient, but as I can see you are using "activeTab" permission.

According to the activeTab docs, the extension will get access (e.g. sources) to current tab after any of these actions will be performed:

  • Executing a browser action
  • Executing a page action
  • Executing a context menu item
  • Executing a keyboard shortcut from the commands API
  • Accepting a suggestion from the omnibox API

That's why you can get sources after opening the popup.

In order to get access to tabs without those actions, you need to ask for the following permissions:

  • tabs
  • <all_urls>

Be noted, it allows you to run content-script on every tab, not only the active one.

Here is the simplest example:

manifest.json

{
"name": "Getting Started Example",
"version": "1.0",
"description": "Build an Extension!",
"permissions": ["tabs", "<all_urls>"],
"background": {
"scripts": ["background.js"],
"persistent": false
},
"manifest_version": 2
}

background.js

chrome.tabs.onUpdated.addListener(function (tabId, info) {
if(info.status === 'complete') {
chrome.tabs.executeScript({
code: "document.documentElement.innerHTML" // or 'file: "getPagesSource.js"'
}, function(result) {
if (chrome.runtime.lastError) {
console.error(chrome.runtime.lastError.message);
} else {
console.log(result)
}
});
}
});

chrome extension get source html not currentPage

  • Use XMLHttpRequest to download whatever the server responds with when the url is accessed. On some sites it could be a minipage with script loader that would later render the page in case it were loaded by the browser normally.

  • To get a fully rendered source or DOM tree of an arbitrary url you'll have to load it in a tab first. To make the process less distracting for the user load it in a pinned tab:

    chrome.tabs.create({url: "https://google.com", pinned: true}, function(tab) {
    .... wait for the tab to load, get the source
    });

    (the simplest form of waiting that doesn't require any additional permissions would be periodic checking of tab.status == "complete" invoked from the above callback, otherwise use webNavigation.onCompleted for example or inject a content script with the run-of-the-mill "DOMContentLoaded" or "load" event handlers).

  • Or load the page in an IFRAME but some sites forbid the browser to do it.

Chrome extension : get source code of active tab

Your manifest has both "content_scripts" (which run in the context of the page on document_idle) and "browser_action" scripts (which run in an isolated context when the extensions menu button is clicked).

In popup.html you reference popup.js, so in popup.js when you call document.documentElement.outerHTML you're getting the content of popup.html, not the active tab.

You reference both popup.js and popup1.js, which is confusing. You're currently running the same code in both the popup and the page context, which is almost guaranteed to break in one or the other. By convention use content.js in "content_scripts" and reference popup.js in the action popup.html.

"content_scripts" run in every page, whether users click on the extension or not. Your current manifest is adding ["popup1.js","jquery-1.10.2.js","jquery-ui.js","bootstrap.min.js"] to every page, which is needlessly slow.

Avoid using jQuery in Chrome extensions. It's fairly large and a browser standardisation library doesn't add much when you know for absolute certain that all your users are on Chrome. If you can't code without it then try to restrict it to just your popup or load it in dynamically.

You set a "scripts": [ "background.js"], which runs constantly in the background and isn't needed at all in your current code. If you need to do things outside of the action button consider using event pages instead.

Use the Chrome API to get from the context of the popup to the page. You need to query chrome.tabs to get the active tab, and then call chrome.tabs.executeScript to execute script in the context of that tab.

Google's API uses callbacks, but in this example I'm going to use chrome-extension-async to allow use of promises (there are other libraries that do this too).

In popup.html (assuming you use bower install chrome-extension-async):

<!doctype html>
<html>
<head>
<script type="text/javascript" src="bower_components/chrome-extension-async/chrome-extension-async.js"></script>
<script type="text/javascript" src="popup.js"></script>
</head>

<body style="width: 600px; height: 300px;">
<button value="Test" id="check-1"> </button>
</body>
</html>

In popup.js (discard popup1.js):

function scrapeThePage() {
// Keep this function isolated - it can only call methods you set up in content scripts
var htmlCode = document.documentElement.outerHTML;
return htmlCode;
}

document.addEventListener('DOMContentLoaded', () => {
// Hook up #check-1 button in popup.html
const fbshare = document.querySelector('#check-1');
fbshare.addEventListener('click', async () => {
// Get the active tab
const tabs = await chrome.tabs.query({ active: true, currentWindow: true });
const tab = tabs[0];

// We have to convert the function to a string
const scriptToExec = `(${scrapeThePage})()`;

// Run the script in the context of the tab
const scraped = await chrome.tabs.executeScript(tab.id, { code: scriptToExec });

// Result will be an array of values from the execution
// For testing this will be the same as the console output if you ran scriptToExec in the console
alert(scraped[0]);
});
});

If you do it this way you don't need any "content_scripts" in manifest.json. You don't need jQuery or jQuery UI or Bootstrap either.

How to get actual HTML elements in Chrome extension, not original source code

Any elements that you see in the JavaScript inspector, but not in the HTML source code, are either (a) automatically added by the browser to normalize the any missing elements (i.e. no <body> tag) or correcting invalid structure (i.e. unclosed <p> tag) to make the document valid, or (b) added by JavaScript.

Any technique you use to inspect the document from your Chrome extension will automatically see the document as you see it in the document inspector. You don't need to do anything specific. All of the elements that have been created by the browser or by JavaScript will be there. For example, you could use document.querySelectorAll('*') to get an array-like object containing all of them, or document.body.outerHTML to get the HTML code.

The harder task would actually be if you wanted to get the original, uncorrected source code.

Storing source code of current browser in chrome extension

If you want to show HTML code inside an HTML element you may need to set the text content (innerText), not the HTML content (innerHTML):

bth1.onclick = function scrapeThePage() {
// Keep this function isolated - it can only call methods you set up in content scripts
var htmlCode = document.documentElement.outerHTML;
var btn = document.getElementById("mybtn1");
btn.innerText = htmlCode;
}

http://jsfiddle.net/1jk94r50/

Google Chrome Extensions: Get Current Page HTML (incl. Ajax or updated HTML)

I followed the exact solution here, and this gave me the Page Source HTML:

Getting the source HTML of the current page from chrome extension

The solution is to inject the HTML into the Popup.



Related Topics



Leave a reply



Submit