Puppeteer - Scroll Down Until You Can't Anymore

Puppeteer - scroll down until you can't anymore

Give this a shot:

const puppeteer = require('puppeteer');

(async () => {
const browser = await puppeteer.launch({
headless: false
});
const page = await browser.newPage();
await page.goto('https://www.yoursite.com');
await page.setViewport({
width: 1200,
height: 800
});

await autoScroll(page);

await page.screenshot({
path: 'yoursite.png',
fullPage: true
});

await browser.close();
})();

async function autoScroll(page){
await page.evaluate(async () => {
await new Promise((resolve) => {
var totalHeight = 0;
var distance = 100;
var timer = setInterval(() => {
var scrollHeight = document.body.scrollHeight;
window.scrollBy(0, distance);
totalHeight += distance;

if(totalHeight >= scrollHeight - window.innerHeight){
clearInterval(timer);
resolve();
}
}, 100);
});
});
}

Source: https://github.com/chenxiaochun/blog/issues/38

EDIT

added window.innerHeight to the calculation because the available scrolling distance is body height minus viewport height, not the entire body height.

Puppeteer - simulate scroll down

The page.mouse.down() is used to simulate a mouse click, not the scroll. That's why it doesn't do what you want to.

You may have to take a look at the window.scrollTo or window.scrollBy function to be used inside the page.evaluate(...) scope (in which the window variable is available). You'll then be able to scroll the page by some given distance. Please see the following topic in which the answer has already been given : Puppeteer - scroll down until you can't anymore

scroll to bottom not applying to site puppeteer

In case of youtube the height of body is 0 that's why your function is not working. If we see in devtools on youtube the whole content is in ytd-app element.

So we should use document.querySelector('ytd-app').scrollHeight instead of document.body.scrollHeight to scroll down to bottom.

working code.

const scrapeInfiniteScrollItems = async (page: puppeteer.Page) => {
while (true) {
const previousHeight = await page.evaluate(
"document.querySelector('ytd-app').scrollHeight"
);
await page.evaluate(() => {
const youtubeScrollHeight =
document.querySelector("ytd-app").scrollHeight;
window.scrollTo(0, youtubeScrollHeight);
});
try {
await page.waitForFunction(
`document.querySelector('ytd-app')?.scrollHeight > ${previousHeight}`,
{ timeout: 5000 }
);
} catch {
console.log("done");
break;
}
await new Promise((resolve) => setTimeout(resolve, 1000));
}
};

Scrolling to the bottom of a div in puppeteer not working

As you mention in your question, when you run page.$$, you get back an array of ElementHandle. From Puppeteer's documentation:

ElementHandle represents an in-page DOM element. ElementHandles can be created with the page.$ method.

This means you can iterate over them, but you also have to run evaluate() or $eval() over each element to access the DOM element.

I see from your snippet that you are trying to access the parent div that handles the list scroll event. The problem is that this page seems to be using auto-generated classes and ids. This might make your code brittle or not work properly. It would be best to try and access the ul, li, div's direct.

I've created this snippet that can get ITEMS amounts of concerts from the site:

const puppeteer = require('puppeteer')

/**
* Constants
*/
const ITEMS = process.env.ITEMS || 50
const URL = process.env.URL || "https://www.google.com/search?q=concerts+near+poughkeepsie&client=safari&rls=en&uact=5&ibp=htl;events&rciv=evn&sa=X&fpstate=tldetail"

/**
* Main
*/
main()
.then( () => console.log("Done"))
.catch((err) => console.error(err))

/**
* Functions
*/
async function main() {
const browser = await puppeteer.launch({ args: ["--no-sandbox"] })
const page = await browser.newPage()

await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36')
await page.goto(URL)

const results = await getResults(page)
console.log(results)

await browser.close()
}

async function getResults(page) {
await page.waitForSelector("ul")
const ul = (await page.$$("ul"))[0]
const div = (await ul.$x("../../.."))[0]
const results = []

const recurse = async () => {
// Recurse exit clause
if (ITEMS <= results.length) {
return
}

const $lis = await page.$$("li")
// Slicing this way will avoid duplicating the result. It also has
// the benefit of not having to handle the refresh interval until
// new concerts are available.
const lis = $lis.slice(results.length, Math.Infinity)
for (let li of lis) {
const result = await li.evaluate(node => node.innerText)
results.push(result)
}
// Move the scroll of the parent-parent-parent div to the bottom
await div.evaluate(node => node.scrollTo(0, node.scrollHeight))
await recurse()
}
// Start the recursive function
await recurse()

return results
}

By studying the page structure, we see that the ul for the list is nested in three divs deep from the div that handles the scroll. We also know that there are only two uls on the page, and the first is the one we want. That is
what we do on these lines:

  const ul  = (await page.$$("ul"))[0]
const div = (await ul.$x("../../.."))[0]

The $x function evaluates the XPath expression relative to the document as its context node*. It allows us to traverse the DOM tree until we find the div that we need. We then run a recursive function until we get the items that we want.

  • Taken from the docs.


Related Topics



Leave a reply



Submit