feat(ranking): add popularAlternativeNames, detect security packages (#941) #965

MartinKolarik · 2022-05-16T19:44:52Z

Adds a new isSecurityHeld flag as discussed in Ignore some packages ? #657. I went with a property at the same level as isDeprecated(and not _searchInternal) since it may be interesting for the end users too.
The property is then used for Improve ranking #941 along with some other checks. Popular packages are excluded from the checks since I think those should get the alternative names even if they are deprecated (applies to e.g. request).

MartinKolarik · 2022-05-16T22:21:04Z

Note that I focused on the ranking aspect here and didn't add any skipping logic for the security packages - can add that too if you want.

Haroenv

interesting approach!

src/config.ts

src/saveDocs.ts

Haroenv · 2022-05-17T08:18:06Z

I'd like to see what impact that has on the index, but not sure how to test it without doing a complete index, do you have any idea?

Haroenv · 2022-05-17T08:19:53Z

Maybe what we could do is instead of removing alternative names completely, we add a new key called popularAlternativeNames in the searchInternal, add the query rule to that instead of alternative names, make sure the original name Is still in popularAlternativeNames (maybe?). That way after a reindex, we can compare the quality between popularAlternativeNames and alternativeNames

MartinKolarik · 2022-05-17T09:07:18Z

Maybe what we could do is instead of removing alternative names completely, we add a new key called popularAlternativeNames in the searchInternal, add the query rule to that instead of alternative names, make sure the original name Is still in popularAlternativeNames (maybe?). That way after a reindex, we can compare the quality between popularAlternativeNames and alternativeNames

Agree, that's a good idea.

Co-authored-by: Haroen Viaene <[email protected]>

bodinsamuel

LGTM
This will require full reindex unfortunately. I need to finish the work on incremental full reindex that was started #819

src/@types/pkg.ts

MartinKolarik · 2022-05-17T11:52:48Z

@bodinsamuel unrelated but do you know what's currently the bottleneck in full reindex? E.g. would it help to run it on a larger VM with higher concurrency?

MartinKolarik · 2022-05-17T11:56:39Z

Also in issues like #930, I was wondering if throwing more resources at it would help as that's something we maybe would be able to do.

bodinsamuel · 2022-05-17T12:04:31Z

@MartinKolarik currently the limitations are not physical:

Registry:

Rate limit (429) from NPM registry
We should get in touch with them to see what can be done
Registry is slow
Happened a lot lately where request could take more than 20sec.

Design

We can barely parallelise work

The way the system was designed (before I joined Algolia even) was simple and effective: receive a job, process it. The volume was lower and expectation were also lower.
Right now the volume is much bigger and we added more depencencies to each job (jsdelivr, types, npm hits, etc...)
To counter this there was a try to parallelise updates, however there is a big catch, updates are orderer so you can receive a
package publish then a package delete events; if you process them out of order it will results in corrupted index.
What could be done is parallelising per package which is harder to achieve but doable.

To help on everything I would love to:

Host a read-replica of the registry. I could not find the proper tutorial to do it, but it should be doable to spawn a couchdb in replica mode since it's the goal of the database itself. That would allow us to never query NPM registry again and be able to fasten all requests by huge factor.
Host a Queue or a DB or whatever, that could keep track of pending job and job that has failed. Right now we either skip jobs or put them aside and we can barely know which one failed. I have recently started logging the failed package into a dedicated Algolia index but Algolia is really not the best place to achieve this.

MartinKolarik · 2022-05-17T12:39:05Z

Rate limit (429) from NPM registry
We should get in touch with them to see what can be done

That's interesting as our services query npm a lot too and we don't have this issue. Is that the registry or some other service? (maybe the downloads API?). We previously had a similar problem with GitHub and it the limit was too low-level for them to do anything about it, we solved it by assigning multiple IPs to our VMs and distributing the outgoing requests across those (but that's probably not possible at PaaS).

We can barely parallelise work

I checked the code and it indeed looks problematic but I think there might be some not too hard solutions. Process N packages at once, and track which are being processed. When a second event for the same package arrives, either put it in an intermediate queue (which has some small size limit) or simply pause until the previous event for the package competes.

Of course, if there are issues with rate limits, it's a question how much this would help.

MartinKolarik · 2022-05-17T12:42:36Z

Host a read-replica of the registry. I could not find the proper tutorial to do it, but it should be doable to spawn a couchdb in replica mode since it's the goal of the database itself. That would allow us to never query NPM registry again and be able to fasten all requests by huge factor.

I considered this a couple of times for jsDelivr as well but it seems like something that'll turn out to be more complex than you expect and bring a new brand new bunch of issues.

bodinsamuel · 2022-05-17T14:42:39Z

I considered this a couple of times for jsDelivr as well but it seems like something that'll turn out to be more complex than you expect and bring a new brand new bunch of issues.

Yes probably. But since there is the follower pattern builtin the DB I would expect to be at least easy to try but could not reach this state.

That's interesting as our services query npm a lot too and we don't have this issue.

To be more precise it's https://replicate.npmjs.com that cause us issue not the download endpoint. So the feed to receive update. I think it boils down to the fact that we query update 1:1, but we receive only the rev part, then with this rev id we query the full fledged json with the all the info.

Process N packages at once, and track which are being processed.

Yep, that's would be main solution. Where it's complicated is that sequence id is not monotonic so you can not really keep track of where you stopped and start again at this point in time.
E.g: For package A and package B, if I receive A1, B1, A2, for whatever reason I process B1 first, I can not say "the process is at B1". If the process crashes during this time, I need to be able to say "I processed B2 but not A1 and A2".

To do this I don't see any other solution, to my knowledge, than to have a pub/sub or rabbit or db with persistent disk.

It's a bit awful how simple it seems at first and then realising that it actually requires more complexity to be able to process things correctly.

MartinKolarik · 2022-05-17T14:59:45Z

E.g: For package A and package B, if I receive A1, B1, A2, for whatever reason I process B1 first, I can not say "the process is at B1". If the process crashes during this time, I need to be able to say "I processed B2 but not A1 and A2".

I may be missing something but not sure how the concurrency is relevant here - the inserts/updates here should be idempotent so I would just ignore the fact that B1 was processed and say "I'm at whatever was before A1" - which makes this part the same as for serial processing.

It's a bit awful how simple it seems at first and then realising that it actually requires more complexity to be able to process things correctly.

I'm not surprised at all 😆

bodinsamuel · 2022-05-17T18:16:58Z

should be idempotent so I would just ignore the fact that B1 was processed and say "I'm at whatever was before A1"

Yeah my explanation was incomplete, let me try again. With this sequence of updates:

sequence 1: A1 (update A)
sequence 2: B1 (create B)
sequence 3: A2 (update A)
sequence 4: B2 (delete B)

We start at sequence 0, we process B1 and B2 correctly but A1 is blocking for whatever reason. The process needs to store the progress of the sequence somewhere to be able to start where it left off in case of crash or restart.

If I store sequence 4 I skipped 2 updates
If I store sequence 0 I redo 4 updates

As you mention, updates are now "idempotent" (idempotent = one update will always produce the same output, but not really idempotent because a previous update will produce a previous output), so it's fine to reprocess until you realise that if you already processed sequence 4 you deleted the package B. So by reprocessing all sequences you endup actually recreating the package B. (or in other scenario you virtually rollback to a previous version)

Note:

it's more than 4 sequences, could be hundred or thousand. So could impact a lot of packages.
we could say we don't care and it's an acceptable downside
we could add one more check that would get the last update date for a package and skip all updates where sequence.updateAt < current.updateAt but it needs to be stored somewhere and Algolia can't be relied upon because search is only eventually consistent.

That's why I would favor having a DB or a queue, to store all updates and redo only the ones that were not processed.
I hope the explanation is bit clearer :p

MartinKolarik · 2022-05-17T18:45:25Z

So by reprocessing all sequences you endup actually recreating the package B. (or in other scenario you virtually rollback to a previous version)

This seems acceptable because the window for this error is only as big as the concurrency level so very soon you delete B again. You could even say that from an outside observer's perspective, you wouldn't get to seq 4 before the crash in the first place with serial processing, so the final correct state may still be achieved faster even if it's B1 -> B2 -> B1 -> B2 instead of B1 -> B2

Anyway, an external queue definitely makes sense here, just saying that if an easier solution was more desirable, it could probably still work.

MartinKolarik · 2022-05-18T17:26:41Z

Changed to popularAlternativeNames and alternativeNames are kept as they were.

MartinKolarik · 2022-05-18T18:17:32Z

Note that I focused on the ranking aspect here and didn't add any skipping logic for the security packages - can add that too if you want.

Added now.

src/saveDocs.ts

MartinKolarik · 2022-05-20T11:24:10Z

Then I think it's all done here.

src/config.ts

Haroenv · 2022-05-23T09:25:41Z

@bodinsamuel is on holiday now and I don't have access to merge when the CI can't run apparently, will get back to you May 30th :)

bodinsamuel · 2022-05-30T11:55:15Z

Well I don't have admin right neither and we do have the the "run fork pr" in circleci, it should run 🤔

Haroenv · 2022-05-30T12:06:58Z

merged in #975

MartinKolarik added 3 commits May 16, 2022 20:17

feat: add isSecurityHeld flag (algolia#657)

e7b1cf1

feat: remove alternative names from non-popular packages (algolia#941)

5cb3f22

fix: validate repo host

eb88d02

fix: add jsDelivrHits to RawPkg type

a43306e

Haroenv reviewed May 17, 2022

View reviewed changes

src/config.ts Show resolved Hide resolved

src/saveDocs.ts Outdated Show resolved Hide resolved

Update src/saveDocs.ts

a5dc906

Co-authored-by: Haroen Viaene <[email protected]>

bodinsamuel approved these changes May 17, 2022

View reviewed changes

src/@types/pkg.ts Show resolved Hide resolved

MartinKolarik added 2 commits May 18, 2022 19:24

Add popularAlternativeNames

408caf6

Use popularAlternativeNames in config

52a5096

MartinKolarik added 2 commits May 18, 2022 20:14

Add missing properties to RawPkg

c2859f8

Skip getting extra data for security held packages

9432c60

MartinKolarik changed the title ~~feat(ranking): remove alternative names from non-popular packages (#941)~~ feat(ranking): add popularAlternativeNames, detect security packages (#941) May 18, 2022

Haroenv reviewed May 19, 2022

View reviewed changes

src/saveDocs.ts Outdated Show resolved Hide resolved

Haroenv reviewed May 19, 2022

View reviewed changes

src/saveDocs.ts Outdated Show resolved Hide resolved

Fix popular alternative names and tests

72943b0

Haroenv reviewed May 20, 2022

View reviewed changes

src/config.ts Outdated Show resolved Hide resolved

Don't change config for now

0e5194d

Haroenv approved these changes May 23, 2022

View reviewed changes

bodinsamuel mentioned this pull request May 30, 2022

feat(ranking): add popularAlternativeNames, detect security packages #975

Merged

Haroenv closed this May 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ranking): add popularAlternativeNames, detect security packages (#941) #965

feat(ranking): add popularAlternativeNames, detect security packages (#941) #965

MartinKolarik commented May 16, 2022

MartinKolarik commented May 16, 2022

Haroenv left a comment

Haroenv commented May 17, 2022

Haroenv commented May 17, 2022

MartinKolarik commented May 17, 2022

bodinsamuel left a comment

MartinKolarik commented May 17, 2022

MartinKolarik commented May 17, 2022

bodinsamuel commented May 17, 2022 •

edited

Loading

MartinKolarik commented May 17, 2022 •

edited

Loading

MartinKolarik commented May 17, 2022

bodinsamuel commented May 17, 2022

MartinKolarik commented May 17, 2022

bodinsamuel commented May 17, 2022 •

edited

Loading

MartinKolarik commented May 17, 2022

MartinKolarik commented May 18, 2022

MartinKolarik commented May 18, 2022

MartinKolarik commented May 20, 2022

Haroenv commented May 23, 2022

bodinsamuel commented May 30, 2022

Haroenv commented May 30, 2022

feat(ranking): add popularAlternativeNames, detect security packages (#941) #965

feat(ranking): add popularAlternativeNames, detect security packages (#941) #965

Conversation

MartinKolarik commented May 16, 2022

MartinKolarik commented May 16, 2022

Haroenv left a comment

Choose a reason for hiding this comment

Haroenv commented May 17, 2022

Haroenv commented May 17, 2022

MartinKolarik commented May 17, 2022

bodinsamuel left a comment

Choose a reason for hiding this comment

MartinKolarik commented May 17, 2022

MartinKolarik commented May 17, 2022

bodinsamuel commented May 17, 2022 • edited Loading

MartinKolarik commented May 17, 2022 • edited Loading

MartinKolarik commented May 17, 2022

bodinsamuel commented May 17, 2022

MartinKolarik commented May 17, 2022

bodinsamuel commented May 17, 2022 • edited Loading

MartinKolarik commented May 17, 2022

MartinKolarik commented May 18, 2022

MartinKolarik commented May 18, 2022

MartinKolarik commented May 20, 2022

Haroenv commented May 23, 2022

bodinsamuel commented May 30, 2022

Haroenv commented May 30, 2022

bodinsamuel commented May 17, 2022 •

edited

Loading

MartinKolarik commented May 17, 2022 •

edited

Loading

bodinsamuel commented May 17, 2022 •

edited

Loading