Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrity attribute stripping #374

Open
almightyju opened this issue Sep 4, 2019 · 9 comments
Open

Integrity attribute stripping #374

almightyju opened this issue Sep 4, 2019 · 9 comments

Comments

@almightyju
Copy link
Contributor

almightyju commented Sep 4, 2019

So I'm scraping a site and generating integrity attributes but after returning the parsed body the integrity attributes are being striped.

I've found at resource-handler/html/index.js:55 is the following section

then((updatedText) => {
	el.setData(updatedText);
	el.removeIntegrityCheck();
});

Which is the issue, it looks like you check a bunch of child elements to find any other resources to load but in the process remove the integrity (which makes sense if you change the content of the child resource), but my use case no child resource has anything that ever changes as all links are relative/external.

I applied the following patch locally to test:
almightyju@ecc8aac

But I'm not convinced its the right approach since it doesn't work without a modified getReference action as this line in resource-handler/index.js:61

const { reference } = await self.getReference({parentResource, resource: respondedResource, originalReference: childPath});

sets reference as 'relative/url' when the original reference was '/relative/url' which means the content is different in the html resource handler and strips the integrity attribute. I've changed the getReference function via a plugin for myself but I'm not sure how badly it will break other cases if this was in your library and without it my patch is rather obscure to use.

Hoping there might be an easy way to resolve the leading / being stripped from the originalReference :)

@s0ph1e
Copy link
Member

s0ph1e commented Sep 9, 2019

Hey @almightyju 👋

Sorry for late response

So here we have 2 things - removing integrity check and getReference. I'll try to explain why it's implemented in this way

As for integrity check - it is removed for each resource because even if path (ex. ./path/style.css) was not updated (it was the same on original website and on copied), resource content may be changed (for example, some background image from css file was downloaded to different file than on original website). If we leave integrity check - resource will not be loaded on copied site

As for getReference - it generates relative link to make copied website work locally from directory with most simple setup. All you need to do - is srape website and copy scraped website directory somewhere and it works. With absolute link it requires additional setup.

So I think if having absolute links is suitable for your case - you can use getReference to overwrite default behavior. As for removing integrity checks - I'm not sure that checking if path was updated is enough. Maybe need also check if resource content was updated. What do you think?

@almightyju
Copy link
Contributor Author

almightyju commented Sep 9, 2019

No rush, I've worked around it in my exact use case by using the saveResource action to add the integrity stuff back on.

It makes sense why you strip the integrity if you change the content that's scraped. For me personally an option of "preserver integrity attributes" which simply leaves the attribute alone would work, at least that way it's more obvious via the documentation by standard the attribute is removed (I was super confused how the attribute just went missing).

The better solution would be to check if any content has changed and simply leave the integrity if it hasn't or if the content has changed and the integrity attribute was originally there generate a new hash for it (crypto.createHash('sha256').update(content).digest('base64') gives you the hash).

But like I said, for my case right now I'm actually generating the integrity for resources on the page so a simple preserve option would do :)

@s0ph1e
Copy link
Member

s0ph1e commented Sep 10, 2019

Nice that you found way to add integrity attributes 👍

I'm not going to add property like "preserve integrity attribute" now, I guess in 90% cases content will be changed and integrity should be removed/updated. Probably I should mention about this somewhere in readme or FAQ.

Leaving this issue open for now to not forget about it. I'll try to find good way to add integrity checks to downloaded resources. Maybe it makes sense to add more common getHtmlAttributes action to add not only integrity but also other attributes.

@stale
Copy link

stale bot commented Nov 9, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Nov 9, 2019
@alexivkin
Copy link

It's rather odd to do on scripts and CSS's tags that come from CDNs and will never change.
Please disable this behavior at least on the links that are filtered out by the urlFilter.

@s0ph1e
Copy link
Member

s0ph1e commented Jun 11, 2020

@alexivkin it's not called for filtered out resources (by url filter, by max depth, etc.). Only for resources that were actually downloaded - their content was changed so integrity attribute is removed.

@alexivkin
Copy link

alexivkin commented Jun 11, 2020

What I mean is it's stripping the integrity attributes off tags that are in the downloaded file, but pointing to the skipped / filtered out asset. For example the original index.html has <link rel="stylesheet" href="https://stackpath.bootstrapcdn... with the appropriate integrity attribute. The scraped/saved index.html has that attribute removed from the <link rel=".... but that <link is still pointing to the remote CSS pm the CDN (as it should). Scraper should keep the integrity attribute on the <link in that case.

@s0ph1e
Copy link
Member

s0ph1e commented Jun 15, 2020

@alexivkin could you provide reproducible example of such behavior when link is pointing to remote cdn and integrity attribute is removed? Then I can take closer look on it.

We have test which checks that integrity attribute is removed for downloaded elements only

it('should remove SRI check for loaded resources', () => {
const sources = [
{ selector: 'script', attr: 'src'}
];
htmlHandler = new HtmlHandler({sources}, {downloadChildrenPaths});
const html = `
<html>
<head>
<link href="http://examlpe.com/style.css" integrity="sha256-gaWb8m2IHSkoZnT23u/necREOC//MiCFtQukVUYMyuU=" rel="stylesheet">
</head>
<body>
<script integrity="sha256-X+Q/xqnlEgxCczSjjpp2AUGGgqM5gcBzhRQ0p+EAUEk=" src="http://example.com/script.js"></script>
</body>
</html>
`;
const resource = new Resource('http://example.com', 'index.html');
resource.setText(html);
// before handle should contain both integrity checks
resource.getText().should.containEql('integrity="sha256-gaWb8m2IHSkoZnT23u/necREOC//MiCFtQukVUYMyuU="');
resource.getText().should.containEql('integrity="sha256-X+Q/xqnlEgxCczSjjpp2AUGGgqM5gcBzhRQ0p+EAUEk="');
return htmlHandler.handle(resource).then(() => {
// after handle should contain integrity check for styles
// but not contain integrity check for script because it was loaded
resource.getText().should.containEql('integrity="sha256-gaWb8m2IHSkoZnT23u/necREOC//MiCFtQukVUYMyuU="');
resource.getText().should.not.containEql('integrity="sha256-X+Q/xqnlEgxCczSjjpp2AUGGgqM5gcBzhRQ0p+EAUEk="');
});
});

Maybe there is some bug in your case

@alexivkin
Copy link

alexivkin commented Jul 22, 2020

Is this intentional in the test? - <link href=http: //examlpe.com/style.css"

Anyhow, here is where it is happening for me. Source html:

<head>
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
<script src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.7/umd/popper.min.js" integrity="sha384-UO2eT0CpHqdSJQ6hJty5KVphtPhzWj9WO1clHTMGa3JDZwrnQq4sF86dIHNDz0W1" crossorigin="anonymous"></script>
</head>

The integrity attribute on the last two lines disappears in the downloaded version.

It looks like the tags are erroneously get processed by loadResourcesForRules in the html handler. el.removeIntegrityCheck(); in then promise of downloadChildrenPaths strips the integrity attribute.
Could there be something with the way cheerio parses my html? I've noticed it also removes closing slashes - <meta name... /> becomes <meta name=... > although I am not too concerned about that part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants