-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Analytics: Extract Law metadata #13
Comments
Proposed changes to metadata.xml v1 (see attached v2 sample for comment suggestions) (files renamed as .txt to allow upload) country_codes.json.txt |
@yash1802 we also need to determine language of the document, but its not stated in the metadata. But the metadata has an abstract which seems to be always in the language of the document. So we can make a best guess. I have used both of these libraries before for determining languages of fragments of text, I remember optimaize being slightly better, but langdetect may be more suitable here. https://github.com/Mimino666/langdetect So you need to detect the language of the Abstract and use that to apply that into the document metdata something like: (if the document is french ) <language code="fra" /> There is a json file with language codes, we use the 'alpha-3b' syntax : https://raw.githubusercontent.com/gawati/gawati-portal-ui/dev/src/configs/languageCodes.json |
@yash1802 This should give you a list of african countries from country codes json:
The URLs on the natlex (http://www.ilo.org/dyn/natlex/natlex4.byCountry?p_lang=en) page have the following structure:
and we have the alpha-3 code for each country in the country codes json. So you can iterate the |
@yash1802 See this document for info on extracting bibliography and a few other things: https://docs.google.com/document/d/1HsM7zGoulr3_dkxGQEwY9FONcrCfSJ3mtq7bKV4XUxw/edit?usp=sharing |
There is a large amount of African Legislation online on the ILO NATLEX portal. This is a UN agency website, which has curated different African legislation subject-wise and with keywords.
For the Gawati Project (https://www.gawati.org) we are trying to build a repository of African Legislation which can be searched in one place, but has been curated from different sources. ILO NATLEX is one such source.
ILO NATLEX site : http://www.ilo.org/dyn/natlex/natlex4.home?p_lang=en
We are interested only in African countries, and these can be found under country profiles:
http://www.ilo.org/dyn/natlex/natlex4.byCountry?p_lang=en
Here if you see Zimbabwe for example:
http://www.ilo.org/dyn/natlex/natlex4.countrySubjects?p_lang=en&p_country=ZWE
Each of the items here leads to a document citation:
E,g. clicking general provisions:
http://www.ilo.org/dyn/natlex/natlex4.listResults?p_lang=en&p_country=ZWE&p_count=395&p_classification=01&p_classcount=87
If I click special economic zones act
http://www.ilo.org/dyn/natlex/natlex4.detail?p_lang=en&p_isn=104410&p_country=ZWE&p_count=395&p_classification=01&p_classcount=87
This refers to a “Special Economic Zones Act” of the country, the link to the pDF is highlighted below in yellow.
There is important metadata here about the document:
Name, Country, Type, Official Date (Adopted On), ISN Number and citation text + PDF itself.
We need to extract this information into the official AKoma Ntoso XML format used by gawati.
Steps to Take
1 Akoma Ntoso XML
Here are sample documents in Akoma Ntoso XML format (a) has the XML documents, and (b) has the PDF document that is described by (a) .
https://github.com/gawati/gawati-data-xml/releases/download/1.2/akn_xml_sample-1.2.zip
https://github.com/gawati/gawati-data-xml/releases/download/1.2/akn_pdf_sample-1.2.zip
2 Downloading Source Documents
For the ILO site, you need to first gather raw data.
So by African country (lets start with 1 country, Zimbabwe to start with) , download the source metadata (as shown earlier) and the associated PDF document.
3 Processing Downloaded Data
Next step is to process the downloaded raw meta-data, and convert that to Akoma Ntoso format. The PDF file need not be converted, but needs to be associated with the corresponding Akoma Ntoso document, as shown in “1 Akoma Ntoso XML” above.
The text was updated successfully, but these errors were encountered: