-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about mirroring + caching #504
Comments
There are two mirroring modes: full mirror and proxy. In the proxy mode, we forward to the repodata.json from upstream but download the requested packages and cache them on the server. Lines 1639 to 1648 in 0c91da2
Streaming repodata.json is almost always a bad idea, it's much bigger than the gzip compressed one. |
This is where the .gz and .bz2 files are created: Lines 35 to 51 in 0c91da2
|
I don't see the point? What would be the point? Do you think for very small files that the redirect is the bottleneck? |
Yes. Will do some benchmarking on this. |
Just FYI the pre-authentication is computed completely on the quetz side usually (it's usually some encryption of the request metadata with the authentication token). I am interested to see a benchmark. In general, I think I'd like to try to avoid as much as possible that Python "touches" static files. Static files should be routed through nginx or S3 / GCS etc. |
Here are a bunch of timings with GCS pkgstore and from a WiFi home internet connection (each fastest of ~10 tries):
So, GCS overhead seems to be ~ 250 ms, and the overhead from the roundtrip seems to be another 110 ms. (Not sure how curl needs 110 ms to process the redirect?!) So there is a "budget" of 380 ms for Quetz to serve packages directly without redirect. (270 ms if you ignore the 110 ms spent in curl.) Also, the way Edit: GCS bucket was in US while I'm in EU. I did another test with GCS in EU, this removes ~ 150 ms. I also tried adding Cloud CDN in front, which shaves off another 50 ms. So the budget shrinks to 230/180 ms. (Or 120/70 ms with the curl overhead removed.) |
Never mind, it's with |
it's quite possible that there are issues with the proxy mode... I would have to look into it more deeply. Interesting findings re. timing. Tbh I am less concerned about small files then about large files and that's where I think they should really not be served through Python. If you have 5 (or more!) parallel downloads, this should only give a tiny hit on the overall picture ... |
For S3 we already have support in powerloader btw (to natively pre-sign URLs on the client side): https://github.com/mamba-org/powerloader/blob/effe2b7e1f555616e4e4c877648658d1e6c89ded/src/mirrors/s3.cpp#L239-L244 For GCS the algorithm looks extremely similar so we could also add support for https://cloud.google.com/storage/docs/access-control/signing-urls-manually That would remove the initial redirect roundtrip -- but force you to distribute S3 / GCS credentials to users. |
See #506. |
Likely. Will do some testing on this as well :) |
IIUC the mirroring code correctly, downloading and caching works as follows (assuming you're using a GCP backend):
Special case
repodata.json
:repodata.json.gz
from pkgstore and stream that to client..gz
exists, streamrepodata.json
from pkgstore to client.Questions:
repodata.json.gz
files coming from?repodata.json{,.gz}
?The text was updated successfully, but these errors were encountered: