Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use xml catalog for loading schemas #18

Open
rlskoeser opened this issue Dec 18, 2015 · 1 comment
Open

Use xml catalog for loading schemas #18

rlskoeser opened this issue Dec 18, 2015 · 1 comment
Assignees

Comments

@rlskoeser
Copy link
Contributor

We should update the schema loading in eulxml so it's not dependent on external resources that may not be available all the time.

lxml has support for xml catalogs via libxml2; see http://lxml.de/resolvers.html and the referenced instructions for setting up an xml catalog http://xmlsoft.org/catalog.html

I've already tested this with eulxml in proof-of-concept spike code, and it works great. Here's what I suggest we do:

  • remove this custom url resolver (it bypasses the resolver that loads from the catalog): https://github.com/emory-libraries/eulxml/blob/master/eulxml/xmlmap/core.py#L573-L585
  • remove code/documentation/warnings about needing http proxies for loading schemas (this includes some test code, e.g. validation tests that are currently skipped if schemas are not loaded)
  • add new code to generate an xml catalog with local copies and references for all schemas referenced in eulxml predefined models (ideally this should include any referenced or included schemas used by the main schemas, although that may be hard to track down; maybe we can automate the process of pulling related schemas, or otherwise we can add schemas we missed as we discover them)
  • when eulxml is installed, generate and save a new xml catalog file with fresh copies of all the referenced schemas (this would probably mean some custom logic in the setup.py)
  • add a little bit of logic so that the path to the generated xml catalog gets added to the XML_CATALOG_FILES environment variable whenever eulxml is loaded

As a way of testing that the resolver is working properly, you can modify the local schema files, and then load them through eulxml and confirm that your modification is present. I suppose you might also be able to test by validating local xml without network connectivity.

It actually would probably be a good idea to automatically add a comment to the copies of the schemas that we download and save when we generate the catalog - e.g., "downloaded by eulxml on [date]".

Here's a sample xml catalog file that I created in my testing, in case it's useful:

<?xml version="1.0"?>
<!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN" "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<uri
  name="http://www.loc.gov/standards/mods/v3/mods-3-4.xsd"
  uri="file:///tmp/mods-3-4.xsd.xml" />
<uri
  name="http://www.loc.gov/standards/mods/mods.xsd"
  uri="file:///tmp/mods-3-4.xsd.xml" />
<uri
  name="http://www.loc.gov/standards/xlink/xlink.xsd"
  uri="file:///tmp/xlink.xsd" />
</catalog>
@alexBLR alexBLR self-assigned this Apr 4, 2016
@alexBLR
Copy link
Contributor

alexBLR commented Apr 4, 2016

We opened a new feature branch to address this issue: https://github.com/emory-libraries/eulxml/tree/feature/loading_schemas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants