Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ranking): add popularAlternativeNames, detect security packages (#941) #965

Closed
wants to merge 11 commits into from

Conversation

MartinKolarik
Copy link
Collaborator

  1. Adds a new isSecurityHeld flag as discussed in Ignore some packages ? #657. I went with a property at the same level as isDeprecated(and not _searchInternal) since it may be interesting for the end users too.
  2. The property is then used for Improve ranking #941 along with some other checks. Popular packages are excluded from the checks since I think those should get the alternative names even if they are deprecated (applies to e.g. request).

@MartinKolarik
Copy link
Collaborator Author

Note that I focused on the ranking aspect here and didn't add any skipping logic for the security packages - can add that too if you want.

Copy link
Collaborator

@Haroenv Haroenv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting approach!

src/config.ts Show resolved Hide resolved
src/saveDocs.ts Outdated Show resolved Hide resolved
@Haroenv
Copy link
Collaborator

Haroenv commented May 17, 2022

I'd like to see what impact that has on the index, but not sure how to test it without doing a complete index, do you have any idea?

@Haroenv
Copy link
Collaborator

Haroenv commented May 17, 2022

Maybe what we could do is instead of removing alternative names completely, we add a new key called popularAlternativeNames in the searchInternal, add the query rule to that instead of alternative names, make sure the original name Is still in popularAlternativeNames (maybe?). That way after a reindex, we can compare the quality between popularAlternativeNames and alternativeNames

@MartinKolarik
Copy link
Collaborator Author

Maybe what we could do is instead of removing alternative names completely, we add a new key called popularAlternativeNames in the searchInternal, add the query rule to that instead of alternative names, make sure the original name Is still in popularAlternativeNames (maybe?). That way after a reindex, we can compare the quality between popularAlternativeNames and alternativeNames

Agree, that's a good idea.

Co-authored-by: Haroen Viaene <[email protected]>
Copy link
Contributor

@bodinsamuel bodinsamuel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
This will require full reindex unfortunately. I need to finish the work on incremental full reindex that was started #819

src/@types/pkg.ts Show resolved Hide resolved
@MartinKolarik
Copy link
Collaborator Author

@bodinsamuel unrelated but do you know what's currently the bottleneck in full reindex? E.g. would it help to run it on a larger VM with higher concurrency?

@MartinKolarik
Copy link
Collaborator Author

Also in issues like #930, I was wondering if throwing more resources at it would help as that's something we maybe would be able to do.

@bodinsamuel
Copy link
Contributor

bodinsamuel commented May 17, 2022

@MartinKolarik currently the limitations are not physical:

Registry:

  • Rate limit (429) from NPM registry
    We should get in touch with them to see what can be done
  • Registry is slow
    Happened a lot lately where request could take more than 20sec.

Design

  • We can barely parallelise work

The way the system was designed (before I joined Algolia even) was simple and effective: receive a job, process it. The volume was lower and expectation were also lower.
Right now the volume is much bigger and we added more depencencies to each job (jsdelivr, types, npm hits, etc...)
To counter this there was a try to parallelise updates, however there is a big catch, updates are orderer so you can receive a
package publish then a package delete events; if you process them out of order it will results in corrupted index.
What could be done is parallelising per package which is harder to achieve but doable.


To help on everything I would love to:

  • Host a read-replica of the registry. I could not find the proper tutorial to do it, but it should be doable to spawn a couchdb in replica mode since it's the goal of the database itself. That would allow us to never query NPM registry again and be able to fasten all requests by huge factor.
  • Host a Queue or a DB or whatever, that could keep track of pending job and job that has failed. Right now we either skip jobs or put them aside and we can barely know which one failed. I have recently started logging the failed package into a dedicated Algolia index but Algolia is really not the best place to achieve this.

@MartinKolarik
Copy link
Collaborator Author

MartinKolarik commented May 17, 2022

Rate limit (429) from NPM registry
We should get in touch with them to see what can be done

That's interesting as our services query npm a lot too and we don't have this issue. Is that the registry or some other service? (maybe the downloads API?). We previously had a similar problem with GitHub and it the limit was too low-level for them to do anything about it, we solved it by assigning multiple IPs to our VMs and distributing the outgoing requests across those (but that's probably not possible at PaaS).

We can barely parallelise work

I checked the code and it indeed looks problematic but I think there might be some not too hard solutions. Process N packages at once, and track which are being processed. When a second event for the same package arrives, either put it in an intermediate queue (which has some small size limit) or simply pause until the previous event for the package competes.

Of course, if there are issues with rate limits, it's a question how much this would help.

@MartinKolarik
Copy link
Collaborator Author

Host a read-replica of the registry. I could not find the proper tutorial to do it, but it should be doable to spawn a couchdb in replica mode since it's the goal of the database itself. That would allow us to never query NPM registry again and be able to fasten all requests by huge factor.

I considered this a couple of times for jsDelivr as well but it seems like something that'll turn out to be more complex than you expect and bring a new brand new bunch of issues.

@bodinsamuel
Copy link
Contributor

I considered this a couple of times for jsDelivr as well but it seems like something that'll turn out to be more complex than you expect and bring a new brand new bunch of issues.

Yes probably. But since there is the follower pattern builtin the DB I would expect to be at least easy to try but could not reach this state.

That's interesting as our services query npm a lot too and we don't have this issue.

To be more precise it's https://replicate.npmjs.com that cause us issue not the download endpoint. So the feed to receive update. I think it boils down to the fact that we query update 1:1, but we receive only the rev part, then with this rev id we query the full fledged json with the all the info.

Process N packages at once, and track which are being processed.

Yep, that's would be main solution. Where it's complicated is that sequence id is not monotonic so you can not really keep track of where you stopped and start again at this point in time.
E.g: For package A and package B, if I receive A1, B1, A2, for whatever reason I process B1 first, I can not say "the process is at B1". If the process crashes during this time, I need to be able to say "I processed B2 but not A1 and A2".

To do this I don't see any other solution, to my knowledge, than to have a pub/sub or rabbit or db with persistent disk.


It's a bit awful how simple it seems at first and then realising that it actually requires more complexity to be able to process things correctly.

@MartinKolarik
Copy link
Collaborator Author

E.g: For package A and package B, if I receive A1, B1, A2, for whatever reason I process B1 first, I can not say "the process is at B1". If the process crashes during this time, I need to be able to say "I processed B2 but not A1 and A2".

I may be missing something but not sure how the concurrency is relevant here - the inserts/updates here should be idempotent so I would just ignore the fact that B1 was processed and say "I'm at whatever was before A1" - which makes this part the same as for serial processing.

It's a bit awful how simple it seems at first and then realising that it actually requires more complexity to be able to process things correctly.

I'm not surprised at all 😆

@bodinsamuel
Copy link
Contributor

bodinsamuel commented May 17, 2022

should be idempotent so I would just ignore the fact that B1 was processed and say "I'm at whatever was before A1"

Yeah my explanation was incomplete, let me try again. With this sequence of updates:

sequence 1: A1 (update A)
sequence 2: B1 (create B)
sequence 3: A2 (update A)
sequence 4: B2 (delete B)

We start at sequence 0, we process B1 and B2 correctly but A1 is blocking for whatever reason. The process needs to store the progress of the sequence somewhere to be able to start where it left off in case of crash or restart.

  • If I store sequence 4 I skipped 2 updates
  • If I store sequence 0 I redo 4 updates

As you mention, updates are now "idempotent" (idempotent = one update will always produce the same output, but not really idempotent because a previous update will produce a previous output), so it's fine to reprocess until you realise that if you already processed sequence 4 you deleted the package B. So by reprocessing all sequences you endup actually recreating the package B. (or in other scenario you virtually rollback to a previous version)

Note:

  • it's more than 4 sequences, could be hundred or thousand. So could impact a lot of packages.
  • we could say we don't care and it's an acceptable downside
  • we could add one more check that would get the last update date for a package and skip all updates where sequence.updateAt < current.updateAt but it needs to be stored somewhere and Algolia can't be relied upon because search is only eventually consistent.

That's why I would favor having a DB or a queue, to store all updates and redo only the ones that were not processed.
I hope the explanation is bit clearer :p

@MartinKolarik
Copy link
Collaborator Author

So by reprocessing all sequences you endup actually recreating the package B. (or in other scenario you virtually rollback to a previous version)

This seems acceptable because the window for this error is only as big as the concurrency level so very soon you delete B again. You could even say that from an outside observer's perspective, you wouldn't get to seq 4 before the crash in the first place with serial processing, so the final correct state may still be achieved faster even if it's B1 -> B2 -> B1 -> B2 instead of B1 -> B2

Anyway, an external queue definitely makes sense here, just saying that if an easier solution was more desirable, it could probably still work.

@MartinKolarik
Copy link
Collaborator Author

Changed to popularAlternativeNames and alternativeNames are kept as they were.

@MartinKolarik MartinKolarik changed the title feat(ranking): remove alternative names from non-popular packages (#941) feat(ranking): add popularAlternativeNames, detect security packages (#941) May 18, 2022
@MartinKolarik
Copy link
Collaborator Author

Note that I focused on the ranking aspect here and didn't add any skipping logic for the security packages - can add that too if you want.

Added now.

src/saveDocs.ts Outdated Show resolved Hide resolved
src/saveDocs.ts Outdated Show resolved Hide resolved
@MartinKolarik
Copy link
Collaborator Author

Then I think it's all done here.

src/config.ts Outdated Show resolved Hide resolved
@Haroenv
Copy link
Collaborator

Haroenv commented May 23, 2022

@bodinsamuel is on holiday now and I don't have access to merge when the CI can't run apparently, will get back to you May 30th :)

@bodinsamuel
Copy link
Contributor

Well I don't have admin right neither and we do have the the "run fork pr" in circleci, it should run 🤔

@Haroenv
Copy link
Collaborator

Haroenv commented May 30, 2022

merged in #975

@Haroenv Haroenv closed this May 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants