Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Firewall issues when crawling some websites #83

Open
crarugal opened this issue Aug 2, 2022 · 0 comments
Open

Firewall issues when crawling some websites #83

crarugal opened this issue Aug 2, 2022 · 0 comments

Comments

@crarugal
Copy link

crarugal commented Aug 2, 2022

Here are a few examples of where Heritrix has been prevented by a firewall or captchas:

<style> </style>
Target Website Example instance or latest instance Comment
https://www.webarchive.org.uk/act/targets/128627 https://www.signatureaviation.com/ https://www.webarchive.org.uk/act/wayback/archive/20210824094106/https://www.signatureaviation.com/ seems ok now
https://www.webarchive.org.uk/act/targets/3706 http://www.crawleyobserver.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20210402101105/http://www.crawleyobserver.co.uk/ seems ok now
https://www.webarchive.org.uk/act/targets/136007 https://www.teachwire.net/ https://www.webarchive.org.uk/act/wayback/archive/20220106111103/https://www.teachwire.net/ seem ok now
https://www.webarchive.org.uk/act/targets/147300 https://www.schuh.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220721100444/https://www.schuh.co.uk/ still not crawling
https://www.webarchive.org.uk/act/targets/155587#crawlpolicy https://cilexjournal.org.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220220090651/https://cilexjournal.org.uk/ still not crawling
https://www.webarchive.org.uk/act/targets/149261 https://teamnnuh.co.uk/   no captures, no info in logs
https://www.webarchive.org.uk/act/targets/156010 https://hospicefoundation.ie/ https://www.webarchive.org.uk/act/wayback/archive/20220715091752/https://hospicefoundation.ie/ still not crawling
https://www.webarchive.org.uk/act/targets/156865 https://www.odeon.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220730103903/https://www.odeon.co.uk/ still an issue, cloudflare
https://www.webarchive.org.uk/act/targets/157334 https://muslimcharity.org.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220723100028/https://muslimcharity.org.uk/ still an issue, cloudflare
https://www.webarchive.org.uk/act/targets/159206 https://www.greencoat-renewables.com/ https://www.webarchive.org.uk/act/wayback/archive/20220722094820/https://www.greencoat-renewables.com/ still an issue
https://www.webarchive.org.uk/act/targets/158590 https://www.diehardia.com/   no captures, no info in logs
https://www.webarchive.org.uk/act/targets/157211 https://www.poferries.com/   not crawling since March 2022, -5000, -5002
https://www.webarchive.org.uk/act/targets/3851 https://www.thetimes.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220801103716/https://www.thetimes.co.uk/ still an issue, cloudfront
https://www.webarchive.org.uk/act/targets/160154 https://www.techagainstterrorism.org/ https://www.webarchive.org.uk/act/wayback/archive/20220728094007/https://www.techagainstterrorism.org/ still an issue, cloudflare
https://www.webarchive.org.uk/act/targets/160474 https://www.riverstonellc.com/   not crawling since May 2022, -5002
https://www.webarchive.org.uk/act/targets/161338 https://www.missguided.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220621091058/https://www.missguided.co.uk/ still an issue, captcha
https://www.webarchive.org.uk/act/targets/10645 https://www.fortnumandmason.com/ https://www.webarchive.org.uk/act/wayback/archive/20220627094005/https://www.fortnumandmason.com/ still an issue, cloudflare
https://www.webarchive.org.uk/act/targets/161938 https://www.amnh.org/ https://www.webarchive.org.uk/act/wayback/archive/20220705102408/https://www.amnh.org/research/darwin-manuscripts/?__cf_chl_rt_tk=AXI6j2vIje19Hv9U7uS5sSTpF9t2GyuzsVvhaLnUVDU-1657016648-0-gaNycGzNCKU still an issue, cloudflare
https://www.webarchive.org.uk/act/targets/131772 https://cumbriacrack.com/ https://www.webarchive.org.uk/act/wayback/archive/20220730101655/https://cumbriacrack.com/ still an issue, cloudflare
https://www.webarchive.org.uk/act/targets/149065 https://ort.org/ https://www.webarchive.org.uk/act/wayback/archive/20220517094742/https://ort.org/ still an issue, cloudflare
https://www.webarchive.org.uk/act/targets/164270 https://www.vistrygroup.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220716100156/https://www.vistrygroup.co.uk/ not crawling, -5002
Target Website Example instance or latest instance Comment https://www.webarchive.org.uk/act/targets/128627 https://www.signatureaviation.com/ https://www.webarchive.org.uk/act/wayback/archive/20210824094106/https://www.signatureaviation.com/ seems ok now https://www.webarchive.org.uk/act/targets/3706 http://www.crawleyobserver.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20210402101105/http://www.crawleyobserver.co.uk/ seems ok now https://www.webarchive.org.uk/act/targets/136007 https://www.teachwire.net/ https://www.webarchive.org.uk/act/wayback/archive/20220106111103/https://www.teachwire.net/ seem ok now https://www.webarchive.org.uk/act/targets/147300 https://www.schuh.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220721100444/https://www.schuh.co.uk/ still not crawling https://www.webarchive.org.uk/act/targets/155587#crawlpolicy https://cilexjournal.org.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220220090651/https://cilexjournal.org.uk/ still not crawling https://www.webarchive.org.uk/act/targets/149261 https://teamnnuh.co.uk/ no captures, no info in logs https://www.webarchive.org.uk/act/targets/156010 https://hospicefoundation.ie/ https://www.webarchive.org.uk/act/wayback/archive/20220715091752/https://hospicefoundation.ie/ still not crawling https://www.webarchive.org.uk/act/targets/156865 https://www.odeon.co.uk/ [https://www.webarchive.org.uk/act/wayback/archive/20220730103903/https://www.odeon.co.uk/](https://www.webarchive.org.uk/act/wayback/archive/20220730103903/https:/www.odeon.co.uk/) still an issue, cloudflare https://www.webarchive.org.uk/act/targets/157334 https://muslimcharity.org.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220723100028/https://muslimcharity.org.uk/ still an issue, cloudflare https://www.webarchive.org.uk/act/targets/159206 https://www.greencoat-renewables.com/ https://www.webarchive.org.uk/act/wayback/archive/20220722094820/https://www.greencoat-renewables.com/ still an issue https://www.webarchive.org.uk/act/targets/158590 https://www.diehardia.com/ no captures, no info in logs https://www.webarchive.org.uk/act/targets/157211 https://www.poferries.com/ not crawling since March 2022, -5000, -5002 https://www.webarchive.org.uk/act/targets/3851 https://www.thetimes.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220801103716/https://www.thetimes.co.uk/ still an issue, cloudfront https://www.webarchive.org.uk/act/targets/160154 https://www.techagainstterrorism.org/ https://www.webarchive.org.uk/act/wayback/archive/20220728094007/https://www.techagainstterrorism.org/ still an issue, cloudflare https://www.webarchive.org.uk/act/targets/160474 https://www.riverstonellc.com/ not crawling since May 2022, -5002 https://www.webarchive.org.uk/act/targets/161338 https://www.missguided.co.uk/ [https://www.webarchive.org.uk/act/wayback/archive/20220621091058/https://www.missguided.co.uk/](https://www.webarchive.org.uk/act/wayback/archive/20220621091058/https:/www.missguided.co.uk/) still an issue, captcha https://www.webarchive.org.uk/act/targets/10645 https://www.fortnumandmason.com/ [https://www.webarchive.org.uk/act/wayback/archive/20220627094005/https://www.fortnumandmason.com/](https://www.webarchive.org.uk/act/wayback/archive/20220627094005/https:/www.fortnumandmason.com/) still an issue, cloudflare https://www.webarchive.org.uk/act/targets/161938 https://www.amnh.org/ https://www.webarchive.org.uk/act/wayback/archive/20220705102408/https://www.amnh.org/research/darwin-manuscripts/?__cf_chl_rt_tk=AXI6j2vIje19Hv9U7uS5sSTpF9t2GyuzsVvhaLnUVDU-1657016648-0-gaNycGzNCKU still an issue, cloudflare https://www.webarchive.org.uk/act/targets/131772 https://cumbriacrack.com/ https://www.webarchive.org.uk/act/wayback/archive/20220730101655/https://cumbriacrack.com/ still an issue, cloudflare https://www.webarchive.org.uk/act/targets/149065 https://ort.org/ https://www.webarchive.org.uk/act/wayback/archive/20220517094742/https://ort.org/ still an issue, cloudflare https://www.webarchive.org.uk/act/targets/164270 https://www.vistrygroup.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220716100156/https://www.vistrygroup.co.uk/ not crawling, -5002
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant