Integrity attribute stripping #374

almightyju · 2019-09-04T15:18:21Z

So I'm scraping a site and generating integrity attributes but after returning the parsed body the integrity attributes are being striped.

I've found at resource-handler/html/index.js:55 is the following section

then((updatedText) => {
	el.setData(updatedText);
	el.removeIntegrityCheck();
});

Which is the issue, it looks like you check a bunch of child elements to find any other resources to load but in the process remove the integrity (which makes sense if you change the content of the child resource), but my use case no child resource has anything that ever changes as all links are relative/external.

I applied the following patch locally to test:
almightyju@ecc8aac

But I'm not convinced its the right approach since it doesn't work without a modified getReference action as this line in resource-handler/index.js:61

const { reference } = await self.getReference({parentResource, resource: respondedResource, originalReference: childPath});

sets reference as 'relative/url' when the original reference was '/relative/url' which means the content is different in the html resource handler and strips the integrity attribute. I've changed the getReference function via a plugin for myself but I'm not sure how badly it will break other cases if this was in your library and without it my patch is rather obscure to use.

Hoping there might be an easy way to resolve the leading / being stripped from the originalReference :)

The text was updated successfully, but these errors were encountered:

s0ph1e · 2019-09-09T07:47:01Z

Hey @almightyju 👋

Sorry for late response

So here we have 2 things - removing integrity check and getReference. I'll try to explain why it's implemented in this way

As for integrity check - it is removed for each resource because even if path (ex. ./path/style.css) was not updated (it was the same on original website and on copied), resource content may be changed (for example, some background image from css file was downloaded to different file than on original website). If we leave integrity check - resource will not be loaded on copied site

As for getReference - it generates relative link to make copied website work locally from directory with most simple setup. All you need to do - is srape website and copy scraped website directory somewhere and it works. With absolute link it requires additional setup.

So I think if having absolute links is suitable for your case - you can use getReference to overwrite default behavior. As for removing integrity checks - I'm not sure that checking if path was updated is enough. Maybe need also check if resource content was updated. What do you think?

almightyju · 2019-09-09T15:25:46Z

No rush, I've worked around it in my exact use case by using the saveResource action to add the integrity stuff back on.

It makes sense why you strip the integrity if you change the content that's scraped. For me personally an option of "preserver integrity attributes" which simply leaves the attribute alone would work, at least that way it's more obvious via the documentation by standard the attribute is removed (I was super confused how the attribute just went missing).

The better solution would be to check if any content has changed and simply leave the integrity if it hasn't or if the content has changed and the integrity attribute was originally there generate a new hash for it (crypto.createHash('sha256').update(content).digest('base64') gives you the hash).

But like I said, for my case right now I'm actually generating the integrity for resources on the page so a simple preserve option would do :)

s0ph1e · 2019-09-10T17:10:36Z

Nice that you found way to add integrity attributes 👍

I'm not going to add property like "preserve integrity attribute" now, I guess in 90% cases content will be changed and integrity should be removed/updated. Probably I should mention about this somewhere in readme or FAQ.

Leaving this issue open for now to not forget about it. I'll try to find good way to add integrity checks to downloaded resources. Maybe it makes sense to add more common getHtmlAttributes action to add not only integrity but also other attributes.

stale · 2019-11-09T17:49:41Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

alexivkin · 2020-06-11T08:18:55Z

It's rather odd to do on scripts and CSS's tags that come from CDNs and will never change.
Please disable this behavior at least on the links that are filtered out by the urlFilter.

s0ph1e · 2020-06-11T08:29:09Z

@alexivkin it's not called for filtered out resources (by url filter, by max depth, etc.). Only for resources that were actually downloaded - their content was changed so integrity attribute is removed.

alexivkin · 2020-06-11T17:18:45Z

What I mean is it's stripping the integrity attributes off tags that are in the downloaded file, but pointing to the skipped / filtered out asset. For example the original index.html has <link rel="stylesheet" href="https://stackpath.bootstrapcdn... with the appropriate integrity attribute. The scraped/saved index.html has that attribute removed from the <link rel=".... but that <link is still pointing to the remote CSS pm the CDN (as it should). Scraper should keep the integrity attribute on the <link in that case.

s0ph1e · 2020-06-15T18:24:38Z

@alexivkin could you provide reproducible example of such behavior when link is pointing to remote cdn and integrity attribute is removed? Then I can take closer look on it.

We have test which checks that integrity attribute is removed for downloaded elements only

node-website-scraper/test/unit/resource-handler/html.test.js

Lines 239 to 269 in c565e14

    
           	it('should remove SRI check for loaded resources', () => { 
        
           		const sources = [ 
        
           			{ selector: 'script', attr: 'src'} 
        
           		]; 
        
           		htmlHandler = new HtmlHandler({sources}, {downloadChildrenPaths}); 
        
           		const html = ` 
        
           			<html> 
        
           			<head> 
        
           				<link href="http://examlpe.com/style.css" integrity="sha256-gaWb8m2IHSkoZnT23u/necREOC//MiCFtQukVUYMyuU=" rel="stylesheet"> 
        
           			</head> 
        
           			<body> 
        
           				<script integrity="sha256-X+Q/xqnlEgxCczSjjpp2AUGGgqM5gcBzhRQ0p+EAUEk=" src="http://example.com/script.js"></script> 
        
           			</body> 
        
           			</html> 
        
           		`; 
        
           		const resource = new Resource('http://example.com', 'index.html'); 
        
           		resource.setText(html); 
        
           		// before handle should contain both integrity checks 
        
           		resource.getText().should.containEql('integrity="sha256-gaWb8m2IHSkoZnT23u/necREOC//MiCFtQukVUYMyuU="'); 
        
           		resource.getText().should.containEql('integrity="sha256-X+Q/xqnlEgxCczSjjpp2AUGGgqM5gcBzhRQ0p+EAUEk="'); 
        
           		return htmlHandler.handle(resource).then(() => { 
        
           			// after handle should contain integrity check for styles 
        
           			// but not contain integrity check for script because it was loaded 
        
           			resource.getText().should.containEql('integrity="sha256-gaWb8m2IHSkoZnT23u/necREOC//MiCFtQukVUYMyuU="'); 
        
           			resource.getText().should.not.containEql('integrity="sha256-X+Q/xqnlEgxCczSjjpp2AUGGgqM5gcBzhRQ0p+EAUEk="'); 
        
           		}); 
        
           	});

Maybe there is some bug in your case

alexivkin · 2020-07-22T02:00:05Z

Is this intentional in the test? - <link href=http: //examlpe.com/style.css"

Anyhow, here is where it is happening for me. Source html:

<head>
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
<script src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.7/umd/popper.min.js" integrity="sha384-UO2eT0CpHqdSJQ6hJty5KVphtPhzWj9WO1clHTMGa3JDZwrnQq4sF86dIHNDz0W1" crossorigin="anonymous"></script>
</head>

The integrity attribute on the last two lines disappears in the downloaded version.

It looks like the tags are erroneously get processed by loadResourcesForRules in the html handler. el.removeIntegrityCheck(); in then promise of downloadChildrenPaths strips the integrity attribute.
Could there be something with the way cheerio parses my html? I've noticed it also removes closing slashes - <meta name... /> becomes <meta name=... > although I am not too concerned about that part.

stale bot added the wontfix label Nov 9, 2019

s0ph1e added maybe-later and removed wontfix labels Nov 10, 2019

aivus added the help wanted label Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrity attribute stripping #374

Integrity attribute stripping #374

almightyju commented Sep 4, 2019 •

edited

Loading

s0ph1e commented Sep 9, 2019

almightyju commented Sep 9, 2019 •

edited

Loading

s0ph1e commented Sep 10, 2019

stale bot commented Nov 9, 2019

alexivkin commented Jun 11, 2020

s0ph1e commented Jun 11, 2020

alexivkin commented Jun 11, 2020 •

edited

Loading

s0ph1e commented Jun 15, 2020

alexivkin commented Jul 22, 2020 •

edited

Loading

Integrity attribute stripping #374

Integrity attribute stripping #374

Comments

almightyju commented Sep 4, 2019 • edited Loading

s0ph1e commented Sep 9, 2019

almightyju commented Sep 9, 2019 • edited Loading

s0ph1e commented Sep 10, 2019

stale bot commented Nov 9, 2019

alexivkin commented Jun 11, 2020

s0ph1e commented Jun 11, 2020

alexivkin commented Jun 11, 2020 • edited Loading

s0ph1e commented Jun 15, 2020

alexivkin commented Jul 22, 2020 • edited Loading

almightyju commented Sep 4, 2019 •

edited

Loading

almightyju commented Sep 9, 2019 •

edited

Loading

alexivkin commented Jun 11, 2020 •

edited

Loading

alexivkin commented Jul 22, 2020 •

edited

Loading