Jul 11, 2022 by Roman Landenband
To standout and sell more "ACME Widgets™", modern webpages are sprinkled with Javascript to make things appear, disappear or transform (or mutate, as I freely use the term to mean these)
This is fine... Unless, you are trying to track that page for changes... Which may trigger a false positive for every time you sample that page off by a fraction of a second from the previous time you visited
As it happens, mutating content is the root cause and source of the top complaints from our users, looking to get notified only if content changed. So we set out to research a more reliable change tracking.
What's needed to capture modern web page? an overview
Beyond being conservative in terms of resources needed to load a page, some parts of the page will not load until brought into the browser view-port, therefore, to "capture" a page does not strictly equal "loading it".
Roughly, in order for us to "see" the fully loaded web page, we need to handle-
- wait for the body "onload" event
- images loading (responsive + scroll discovered) - if wanting to do a visual capture of said page
- dynamic content loading
- lazy content loading (activated only if inside viewport)
This mimics a web user visiting a page, waiting for it to load, scrolling through, etc..
Detecting site changes- what are they?
While different people may need to track different changes (graphic designers, analysts, product managers, marketers etc..) we will focus on a specific type of change here, where content is added, removed or modified.
Therefore, ideally, for every-time we visit a page, we want to know if & how the content has changed since the last time, and possibly, how it changed overtime.
Alas, due to content manipulated via Javascript ...for the purpose of creating an appealing webpage to sell more Widgets™... doing a naïve diff between two snapshots is unreliable. All the mutating parts need to be eliminated to reduce false positives.
Outlining our approach
- wait for the page to load
- observe mutations
- keep track of all the places where content has been mutating
- remove mutating content for the sake of reliable change detection
POC code with comments below. Copy and paste into dev console in your browser of choice to see the full cycle
"use strict";
/*jshint esversion: 8 */
/*jshint browser: true */
const myScript = document.createElement('script');
myScript.type = "module";
// language=JavaScript
myScript.innerHTML = `
// we use the excellent "finder" library by "antonmedv"
// https://github.com/antonmedv/finder
// to resolve CSS selectors from DOM elements
import {finder} from 'https://medv.io/finder/finder.js'
// list of DOM selectors where mutations have been observed
const mutationSelectors = new Set();
// assign to window object to retrieve via debug console or another script
window._mutationSelectors = mutationSelectors;
// once consecutive maxIdleTicks have been counted, we are done
const maxIdleTicks = 3;
// how often to check for maxIdleTicks
const intervalMS = 2000;
// MutationObserver
const observer = new MutationObserver(function callback(mutationList, observer) {
mutationList.forEach((mutation) => {
try {
// we would never want the body element to end up
if (mutation.target !== document.body) {
// we use "finder" to get the best DOM selector for the mutating element
const selector = finder(mutation.type === 'characterData' ? mutation.target.parentElement : mutation.target)
mutationSelectors.add(selector);
}
} catch (e) {
console.log("err", mutation, e);
}
})
}
)
// we only care about content changes (characterData) and added/removed content (childList/subtree)
const observerOptions = {
childList: true,
attributes: false,
subtree: true,
characterData: true
}
const initialize = () => {
observer.observe(document.body, observerOptions);
// the size of mutating selectors when checked last
let lastSeenSize = 0;
// counting towards maxIdleTicks
let idleTickCount = 0;
const ref = window.setInterval(() => {
if (window._mutationSelectors.size !== lastSeenSize) {
lastSeenSize = window._mutationSelectors.size;
} else {
// once maxIdleTicks is reached, we are done
if (idleTickCount > maxIdleTicks) {
clearInterval(ref)
window.wrapUp();
} else {
idleTickCount++;
}
}
}, intervalMS);
}
// we normally are interested in changes that happen once the body is loaded
if (document.readyState === "complete") {
initialize()
} else {
window.addEventListener("load", initialize);
}
window.wrapUp = () => {
[...window._mutationSelectors].map(msItem => {
Array.from(document.querySelectorAll(msItem)).map(el => {
// hold on to element's parent
const parentEl = el.parentElement;
// remove element from DOM
el.remove();
// remove parent from DOM if empty
if (parentEl.childElementCount === 0) {
parentEl.remove();
}
});
});
console.log("all done, dump body text", document.body.innerText);
};
`;
document.head.appendChild(myScript);
CueTap users?
You can start using the new feature in page tracking options