-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ranking): add popularAlternativeNames, detect security packages (#941) #965
Conversation
Note that I focused on the ranking aspect here and didn't add any skipping logic for the security packages - can add that too if you want. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting approach!
I'd like to see what impact that has on the index, but not sure how to test it without doing a complete index, do you have any idea? |
Maybe what we could do is instead of removing alternative names completely, we add a new key called popularAlternativeNames in the searchInternal, add the query rule to that instead of alternative names, make sure the original name Is still in popularAlternativeNames (maybe?). That way after a reindex, we can compare the quality between popularAlternativeNames and alternativeNames |
Agree, that's a good idea. |
Co-authored-by: Haroen Viaene <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This will require full reindex unfortunately. I need to finish the work on incremental full reindex that was started #819
@bodinsamuel unrelated but do you know what's currently the bottleneck in full reindex? E.g. would it help to run it on a larger VM with higher concurrency? |
Also in issues like #930, I was wondering if throwing more resources at it would help as that's something we maybe would be able to do. |
@MartinKolarik currently the limitations are not physical: Registry:
Design
The way the system was designed (before I joined Algolia even) was simple and effective: receive a job, process it. The volume was lower and expectation were also lower. To help on everything I would love to:
|
That's interesting as our services query npm a lot too and we don't have this issue. Is that the registry or some other service? (maybe the downloads API?). We previously had a similar problem with GitHub and it the limit was too low-level for them to do anything about it, we solved it by assigning multiple IPs to our VMs and distributing the outgoing requests across those (but that's probably not possible at PaaS).
I checked the code and it indeed looks problematic but I think there might be some not too hard solutions. Process N packages at once, and track which are being processed. When a second event for the same package arrives, either put it in an intermediate queue (which has some small size limit) or simply pause until the previous event for the package competes. Of course, if there are issues with rate limits, it's a question how much this would help. |
I considered this a couple of times for jsDelivr as well but it seems like something that'll turn out to be more complex than you expect and bring a new brand new bunch of issues. |
Yes probably. But since there is the follower pattern builtin the DB I would expect to be at least easy to try but could not reach this state.
To be more precise it's https://replicate.npmjs.com that cause us issue not the download endpoint. So the feed to receive update. I think it boils down to the fact that we query update 1:1, but we receive only the
Yep, that's would be main solution. Where it's complicated is that To do this I don't see any other solution, to my knowledge, than to have a pub/sub or rabbit or db with persistent disk. It's a bit awful how simple it seems at first and then realising that it actually requires more complexity to be able to process things correctly. |
I may be missing something but not sure how the concurrency is relevant here - the inserts/updates here should be idempotent so I would just ignore the fact that B1 was processed and say "I'm at whatever was before A1" - which makes this part the same as for serial processing.
I'm not surprised at all 😆 |
Yeah my explanation was incomplete, let me try again. With this sequence of updates: sequence 1: A1 (update A)
sequence 2: B1 (create B)
sequence 3: A2 (update A)
sequence 4: B2 (delete B) We start at
As you mention, updates are now "idempotent" (idempotent = one update will always produce the same output, but not really idempotent because a previous update will produce a previous output), so it's fine to reprocess until you realise that if you already processed Note:
That's why I would favor having a DB or a queue, to store all updates and redo only the ones that were not processed. |
This seems acceptable because the window for this error is only as big as the concurrency level so very soon you delete B again. You could even say that from an outside observer's perspective, you wouldn't get to Anyway, an external queue definitely makes sense here, just saying that if an easier solution was more desirable, it could probably still work. |
Changed to |
Added now. |
Then I think it's all done here. |
@bodinsamuel is on holiday now and I don't have access to merge when the CI can't run apparently, will get back to you May 30th :) |
Well I don't have admin right neither and we do have the the "run fork pr" in circleci, it should run 🤔 |
merged in #975 |
isSecurityHeld
flag as discussed in Ignore some packages ? #657. I went with a property at the same level asisDeprecated
(and not_searchInternal
) since it may be interesting for the end users too.