-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WARC-Conversion-Software and WARC-Conversion-Command fields #52
Comments
Thanks for starting this!
For
I guess I'd lean towards 1) as it seems more flexible overall, but 2) also makes sense, especially if most of the conversions are single command. But, then a question that arises of how extensive can the command be.. What if a conversion requires a custom command-line script or multiple programs? To what degree should the conversion process be specified if it requires more than one command, |
Unfortunately the format of the warcinfo software field is not specified, except by the example 'heritrix/1.12.0'. That example might be intended to indicate a format similar to User-Agent or it might not. Or perhaps that's would you meant, that the structure should be entirely unspecified? It could well be I'm overthinking it but I think it's helpful to specify some limited structure to enable applications like searching for records that a particular tool were involved in the creation of. I can see us for example tokenising it into a Solr field we can use to pull back a list of everything created with libvorbis (video or audio). It does mean more effort for writers though and one might need to put some effort into mapping "Microsoft Excel 2017" into suitable identifier.
This is a fair criticism and had the same concern myself but wasn't entirely comfortable with any of the solutions I could think of. That might well be a sign I need to revisit the premise of what I was trying to do. While JSON and CLI options are two ways conversion software could be configured, there could be more. On the other hand one of the biggest complaints I have about WARC (and many other library standards) is they under-specify things and don't give enough guidance. I'd frankly rather not have a field standardised at all then have it specified so flexibly it's effectively useless. So maybe I'm even being overly hasty in wanting to generalize 'command' to 'options' in the first place and we should start just with WARC-Conversion-Command and worry about more general options once we actually have some more concrete examples.
I agree. I don't think full reproducibility should be a goal for these headers. I think one or two commands or a one-liner pipeline is a helpful indicator but if it gets to the point of a 20+ line script containing a bunch of branches then that should be stored elsewhere and just referenced rather than included in full in every record header. That's what I was trying to get at with this paragraph:
I think it should be a short set of options, one or two lines at most. I would frown upon the inclusion of several kilobytes of XML or JSON or a base64 encoded ICC profile. Some institutions may well want that level of detail but I think that's well beyond the scope of this proposal. I think we should keep this focused on diagnostics and being able to find records that were converted in a particular way. Chances are the institutions that do want that level of detail have some sort of specialised digital preservation system for recording it anyway. |
Yes, that's what I meant, that the structure should not be specified. Unfortunately, ffmpeg does not print things out in a clean format like that. Instead,
Now, it'd be easy to encode this in a JSON string as JSON also provides a nice way to encode multiline strings. Converting it to a different format would take more effort. For practical reasons, you'd probably want to compare the output from
Agree that perhaps additional use cases can help inform if we need more options.
Ah right, I missed that paragraph somehow :) That all makes sense. |
Though, I like the idea of one commandline field and one misc other info field. Likely, there will be some conversion command that can be expressed as a single line, but there may also be additional metadata about the conversion, such as the version, perhaps other properties.. With that in mind, perhaps it should be:
The But, the
This would make sense if the same conversion software/script is used for multiple conversions. A metadata record ffmpeg might then look as follows:
Of course, this is getting a bit more complicated.. Are there other fields besides software version that would be important to store at this point? |
I've updated the proposal replacing WARC-Conversion-Options with the more specific WARC-Conversion-Command field.
Yeah, that's one of several reasons why recording the raw output of
If someone did want to store Also note: I'm not proposing that every tool that writes conversion records has to record down to the detail of libopus. It's just what I'd like to do with our archive and if other people want to do that too it'd be good to have a standard format for recording it. In the case of a generic tool like warcit that doesn't know anything in particular about the external conversion command it would probably be best to make optional and let the user supply it if they want to.
While it would be commonly present, I don't think we should make it mandatory in case the conversion was created by something that's not a command-line tool or is not a simple case of one input, one output.
Trying to think of metadata people might want to record about a conversion:
I don't have an immediate need for any of that though. Although that does indicate that WARC-Conversion-Software may not be the best name for the version field. Maybe WARC-Converter-Version. That would leave open things like WARC-Converter-URI to reference an actual tarball of the conversion software. |
When converting content in an archive it is useful for diagnostic purposes to record the versions of major software components used and important conversion options. Another common use case is to identify records that later need to be reconverted with newer software in order to improve conversion quality or fix records misconverted due to a bug or incorrect option.
Background
The options field proposed for standardisation below is based on "command" JSON field that @ikreymer uses in warcit. We would also like some way of recording this information for the Australian Web Archive.
WARC-Conversion-Software field
The WARC-Conversion-Software field indicates the version of software components used in the
conversion of the record's content. The field value has the same format as a HTTP User-Agent field
(see RFC7231 section 5.5.3) and consists of
a list of one or more product identifiers and zero or more comments.
For example:
Multiple product identifiers may be used to indicate the version of important
subcomponents such as codec libraries used when encoding a video.
When product identifiers represent multiple steps in a processing pipeline they
should be listed in processing order and otherwise in decreasing order of
significance for identifying the software. For example a TIFF image decoded with
an unknown version of
libtiff
and then re-encoded withlibjpeg
version9c
could be recorded as:
Software components unimportant to the conversion process, such as other codecs that
a video transcoder happens to support but did not use, should not be listed.
The WARC-Conversion-Software field may be used in ‘conversion’ type records and shall not be used
for other record types.
WARC-Conversion-Command field
The
WARC-Conversion-Command
field records command-line options used when converting the content.When the conversion software is configured through command-line options a full
command-line should be included with the tokens
{input}
and{output}
representing the input andoutput file respectively.
A conversion involving multiple steps may be indicated using a shell pipeline
or multiple sequential commands separated by semi-colons:
The WARC-Conversion-Command field may be used in ‘conversion’ type records and shall not be used
for other record types.
Alternative approaches
Separate metadata records
An obvious candidate would be a separate metadata record in some other specific-purpose format like PREMIS XML. For the Australian Web Archive we'd like to use these conversion fields not so much for highly detailed provenance records or the ability to exactly reproduce a conversion but rather as a human readable diagnostic and to quickly locate records that were converted in a particular way. While using a separate metadata field allows for more details and flexibility it makes this identifcation-at-a-glance use case much harder.
WARC-JSON-Metadata
@ikreymer also suggested the command field could be included as a property on a JSON object stored in the WARC-JSON-Metadata field proposed in #27. If we were to go back in time and redesign from WARC from scratch I think making WARC headers JSON-based rather than HTTP/1 based could be quite compelling. I would argue though that unless there's a serious show-stopper new standard fields should work within the existing framework and be added as new top-level header fields. I think its fine for individual tools to use something like WARC-JSON-Metadata to store implementation-specific data in their own native format though.
The text was updated successfully, but these errors were encountered: