There are growing needs to access Puppeteer's raw page object both before and after requests.
My implementation of newpage event was a mistake for the following three reasons:
- You cannot pass values retrieved from
pageobject to the crawling results - You cannot access
pageobject after requests in order get cookie values, console logs and etc. - You cannot return
Promiseso that you have to deal with race conditions
Thus, I'd like to introduce new feature of customCrawl and hoping to replace newpage event by it.
It goes like this:
const HCCrawler = require('headless-chrome-crawler');
(async () => {
const crawler = await HCCrawler.launch({
customCrawl: async (page, crawl) => {
// You can access the page object before requests
await page.setRequestInterception(true);
page.on('request', request => {
if (request.url().endsWith('/')) {
request.continue();
} else {
request.abort();
}
});
// The result contains options, links, cookies and etc.
const result = await crawl();
// You can access the page object after requests
result.content = await page.content();
// You need to extend and return the crawled result
return result;
},
onSuccess: result => {
console.log(`Got ${result.content} for ${result.options.url}.`);
},
});
await crawler.queue('https://example.com/');
await crawler.onIdle();
await crawler.close();
})();
Fixes: https://github.com/yujiosaka/headless-chrome-crawler/issues/254 https://github.com/yujiosaka/headless-chrome-crawler/issues/256 https://github.com/yujiosaka/headless-chrome-crawler/pull/233