-
-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrity attribute stripping #374
Comments
Hey @almightyju 👋 Sorry for late response So here we have 2 things - removing integrity check and getReference. I'll try to explain why it's implemented in this way As for integrity check - it is removed for each resource because even if path (ex. As for getReference - it generates relative link to make copied website work locally from directory with most simple setup. All you need to do - is srape website and copy scraped website directory somewhere and it works. With absolute link it requires additional setup. So I think if having absolute links is suitable for your case - you can use |
No rush, I've worked around it in my exact use case by using the saveResource action to add the integrity stuff back on. It makes sense why you strip the integrity if you change the content that's scraped. For me personally an option of "preserver integrity attributes" which simply leaves the attribute alone would work, at least that way it's more obvious via the documentation by standard the attribute is removed (I was super confused how the attribute just went missing). The better solution would be to check if any content has changed and simply leave the integrity if it hasn't or if the content has changed and the integrity attribute was originally there generate a new hash for it ( But like I said, for my case right now I'm actually generating the integrity for resources on the page so a simple preserve option would do :) |
Nice that you found way to add integrity attributes 👍 I'm not going to add property like "preserve integrity attribute" now, I guess in 90% cases content will be changed and integrity should be removed/updated. Probably I should mention about this somewhere in readme or FAQ. Leaving this issue open for now to not forget about it. I'll try to find good way to add integrity checks to downloaded resources. Maybe it makes sense to add more common |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
It's rather odd to do on scripts and CSS's tags that come from CDNs and will never change. |
@alexivkin it's not called for filtered out resources (by url filter, by max depth, etc.). Only for resources that were actually downloaded - their content was changed so integrity attribute is removed. |
What I mean is it's stripping the integrity attributes off tags that are in the downloaded file, but pointing to the skipped / filtered out asset. For example the original |
@alexivkin could you provide reproducible example of such behavior when link is pointing to remote cdn and integrity attribute is removed? Then I can take closer look on it. We have test which checks that integrity attribute is removed for downloaded elements only node-website-scraper/test/unit/resource-handler/html.test.js Lines 239 to 269 in c565e14
Maybe there is some bug in your case |
Is this intentional in the test? - <link href=http: //examlpe.com/style.css" Anyhow, here is where it is happening for me. Source html:
The integrity attribute on the last two lines disappears in the downloaded version. It looks like the tags are erroneously get processed by |
So I'm scraping a site and generating integrity attributes but after returning the parsed body the integrity attributes are being striped.
I've found at resource-handler/html/index.js:55 is the following section
Which is the issue, it looks like you check a bunch of child elements to find any other resources to load but in the process remove the integrity (which makes sense if you change the content of the child resource), but my use case no child resource has anything that ever changes as all links are relative/external.
I applied the following patch locally to test:
almightyju@ecc8aac
But I'm not convinced its the right approach since it doesn't work without a modified getReference action as this line in resource-handler/index.js:61
sets reference as 'relative/url' when the original reference was '/relative/url' which means the content is different in the html resource handler and strips the integrity attribute. I've changed the getReference function via a plugin for myself but I'm not sure how badly it will break other cases if this was in your library and without it my patch is rather obscure to use.
Hoping there might be an easy way to resolve the leading / being stripped from the originalReference :)
The text was updated successfully, but these errors were encountered: