Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add crates.io enichment option for rust audit binary, json schema and spdx license updates. #3554

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

jimmystewpot
Copy link

@jimmystewpot jimmystewpot commented Dec 31, 2024

Description

This pull request supports remotely enriching Rust auditable binaries using crates.io. It adds the license, supplier, originator, description, and other fields (optionally if enabled) to the manifest.

This information is unavailable in the cargo lock and binary; if approved, I will add this capability to the other rust cataloger.

Type of change

  • New feature (non-breaking change which adds functionality)

I've also updated the SPDX license list, as that was failing the make test, and updated the JSON schema version to support the new crates-enriched metadata. There are still some missing unit tests, specifically the mocks for the crates.io lookup and caching functionality. I wanted to submit a PR early, seek guidance, and ensure this would benefit the community before investing more time in standardising it across the Rust catalogers.

The new feature adds a rust key to the configuration that allows the feature to be turned on/off and some settings tuned for site-specific needs.

Checklist:

  • I have added unit tests that cover changed behaviour
  • I have tested my code in common scenarios and confirmed there are no regressions
  • I have added comments to my code, particularly in hard-to-understand sections

@github-actions github-actions bot added the json-schema Changes the json schema label Dec 31, 2024
Copy link
Contributor

@kzantow kzantow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jimmystewpot, thanks for this enhancement; overall this looks great, and it's much appreciated that you've followed conventions really well. I left a few specific comments, but the biggest takeaways, I think are:

  • "duplicate" the configuration struct (but it won't be completely duplicated -- for the multilevel configuration, you'll use *bool whereas the rust.CatalogerConfig would have a bool, for example)
  • we probably don't want to choose between one metadata type or the other, but rather add a way to keep both (though the suggestions I have are only suggestions and I'd like to run these by the team when we start to introduce new patterns for things). maybe the best thing is just to add the fields to the existing structs and not worry about having to support multiple metadata types just yet
  • you'll need to Sign-off your commit(s) see contributing.md

cmd/syft/internal/options/rust.go Outdated Show resolved Hide resolved

// NewCargoLockCataloger returns a new Rust Cargo lock file cataloger object.
func NewCargoLockCataloger() pkg.Cataloger {
return generic.NewCataloger("rust-cargo-lock-cataloger").
func NewCargoLockCataloger(opts CatalogerConfig) pkg.Cataloger {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this opts is not used?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct, and I wanted to get this pull request up for review before I invested the time in making it available to both catalogers. Once the details are refined, do you want me to add them both to the same pull request, or can I submit a second pull request with the second cataloger?

type CatalogerConfig struct {
InsecureSkipTLSVerify bool `yaml:"insecure-skip-tls-verify" json:"insecure-skip-tls-verify" mapstructure:"insecure-skip-tls-verify"`
UseCratesEnrichment bool `json:"use-crates-enrichment" yaml:"use-crates-enrichment" mapstructure:"use-crates-enrichment"`
Proxy string `yaml:"proxy,omitempty" json:"proxy,omitempty" mapstructure:"proxy"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this equivalent to the go http_proxy environment variable? I don't think we would need a special config for this, rather just advise users to use the environment variable such that it's used for all http calls instead of needing to configure each individually. If there's really some reason that we need configuration other than the environment variable, we should figure out how to set this globally for all http requests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've touched on this in the larger comment regarding a single HTTP helper or keeping it package/cataloger local.

syft/pkg/cataloger/rust/cataloger.go Outdated Show resolved Hide resolved
syft/pkg/cataloger/rust/cataloger.go Outdated Show resolved Hide resolved
func newCratesResolver(name string, opts CatalogerConfig) *rustCratesResolver {
base, err := url.Parse(opts.CratesBaseURL)
if err != nil {
panic(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should never panic but instead use error returns

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any examples of the cataloger configuration handling errors. Is there an example that I could use to better handle this?

Comment on lines 55 to 79
if r := recover(); r != nil {
fmt.Fprintf(os.Stderr, "recovered from panic while resolving license at: \n%s", string(debug.Stack()))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there shouldn't be a need for panic recovery here -- what is panicking?

Copy link
Author

@jimmystewpot jimmystewpot Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the crates.io service is unavailable, the HTTP request may panic. This allows the processing to continue without the additional metadata available from the remote service.

This might be legacy behaviour in Go; traditionally, if the request fails, it might result in a panic, so I've been in the habit of having a panic in place. If this is not the case now or it's handled elsewhere in the code I can remove it.

syft/pkg/cataloger/rust/crates_resolver.go Outdated Show resolved Hide resolved
syft/pkg/cataloger/rust/parse_audit_binary.go Outdated Show resolved Hide resolved
p = newPackageFromAudit(&dep, location.WithAnnotation(pkg.EvidenceAnnotationKey, pkg.PrimaryEvidenceAnnotation))
continue
}
p = newPackageWithEnrichment(&dep, cratesEnrichment, location.WithAnnotation(pkg.EvidenceAnnotationKey, pkg.PrimaryEvidenceAnnotation))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the thing that we need to discuss a bit as a team how best to handle: in the case enrichment is enabled, there is no pkg.RustBinaryAuditEntry metadata created, instead populating a richer, but different, metadata struct type. I've long been a proponent of allowing multiple metadata types, but we don't really have a standard way of doing this yet. I don't think we should have less data when enriching, but we would end up in a situation that potentially something is checking for the pkg.RustBinaryAuditEntry type and it's not found in this case.

I've talked with @wagoodman about this, but I don't think we came to a concrete solution, although since metadata types are arbitrary we could easily add a []any or something similar, and maybe have a helper function to find and return metadata. I don't know if we need this yet, but it definitely looks like some of the fields are being read when outputting different formats from the new enriched data.

If it were me, and the restrictions we have today exist, I might think adding a helper function in the syft/pkg package of something like:

func GetMetadata[T any](p *Package) *T {
  if t, ok := p.Metadata.(T); ok {
    return &t
  }
  if t, ok := p.Metadata.(*T); ok {
    return t
  }
  if metadatas, ok := p.Metadata.([]any); ok {
    for _, m := range metadatas {
      if t, ok := m.(T); ok {
        return &t
      }
      if t, ok := m.(*T); ok {
        return t
      }
    }
  }
  return nil
}

... or something of the sort. which would let us use it fairly simply where we need it, like:

if m := pkg.GetMetadata[pkg.RustBinaryAuditEntry](p); m != nil {
  // do something with the metadata
}

... and we then could set metadata to []any{ RustBinaryAuditEntry{...}, RustCargoMetadata{...} }. And, though it's not directly applicable here, if we migrated usage of the metadata types to this function instead of the direct type assertions we have, we could then also support merging packages more completely without losing certain metadata, etc..

Sorry for the long-winded comment here, just noting this for discussion along with some background.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted for the new metadata type because I wasn't comfortable with how teams integrate syft as a library and if it would break those integrations. If teams expect a particular struct to be passed back and it now has new fields, depending on how that is handled, it could break.

I directly contradict myself by making the existing structs from the catalogers uniform/consistent, which would be a potentially breaking change. This could be a complete outlier. I'll consider your recommendations and see if I can develop a plan.

I'll add more to this in a comment on the main thread rather than in the code so that it's easier to follow.

@wagoodman wagoodman added the ecosystem:rust relating to the rust ecosystem label Jan 9, 2025
@wagoodman wagoodman changed the title [rust] add crates.io enichment option for rust audit binary, json schema and spdx license updates. Add crates.io enichment option for rust audit binary, json schema and spdx license updates. Jan 9, 2025
@jimmystewpot jimmystewpot force-pushed the rust-license-checking branch 2 times, most recently from 09e9ea0 to b5b08ad Compare January 9, 2025 20:48
@jimmystewpot
Copy link
Author

jimmystewpot commented Jan 9, 2025

@kzantow Thanks for your prompt response; I was on leave for a week and didn't have time to get back to you with updates until today.

Several pieces would be better highlighted in a response on the main thread rather than in the code review. Several other package catalogers also have enrichment opportunities from remote sources, a pattern I have added in this pull request for early feedback. Once we arrive at something everyone is happy with, I'll continue contributing to them. The following is a list of responses and comments to your feedback on this pull request (and thank you so much for being so detailed; I appreciate it and know that it takes time and is often a very thankless task).

Some of the challenges with this type of feature addition are that it's hard to know how different teams/companies/integrations use syft, and anticipating them is difficult for me. The use case I am working to solve is external enrichment for a mono repo containing five primary languages/packaging types. This will introduce the need to be a good internet citizen; caching is part of that; the other part is having a client rate limiter so that the local mirrors or external sites will not be flooded with requests. This introduces a few questions that arise from your comments, and it was better to bring them up in a more direct conversation.

Should the HTTP client be an internal helper library that can be reused between different catalogers? I think so. However, I want to work within the architecture you and your team desire. The benefits of this approach mean that if teams have internal artifactory mirrors (as an example), the rate limiter can be reused across the different package catalogers. It also means common proxies, user agents, and other configurations can be set globally. Adding a user-agent allows more traceability in who or what requests the remote service, aiding fast diagnostics. I've removed the user-agent from this pull request. However, having a standard or settable pattern is a worthy goal.

Caching in this scenario becomes more interesting if the remote service used for enrichment supports more than a single package type. Then, a single cache for all package types should be possible. However, many organisations wouldn't mirror all their packages and rely on remote third-party sites, usually independently. In this scenario, would you imagine having a single cache or one per remote site/package type?

Are there any other areas that align with adding remote enrichment that I should consider?

@jimmystewpot jimmystewpot force-pushed the rust-license-checking branch from dc11cd4 to b3aad03 Compare January 9, 2025 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ecosystem:rust relating to the rust ecosystem json-schema Changes the json schema
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants