Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analytics: Extract Law metadata #13

Open
kohsah opened this issue Mar 13, 2018 · 4 comments
Open

Analytics: Extract Law metadata #13

kohsah opened this issue Mar 13, 2018 · 4 comments
Assignees

Comments

@kohsah
Copy link
Contributor

kohsah commented Mar 13, 2018

There is a large amount of African Legislation online on the ILO NATLEX portal. This is a UN agency website, which has curated different African legislation subject-wise and with keywords.

For the Gawati Project (https://www.gawati.org) we are trying to build a repository of African Legislation which can be searched in one place, but has been curated from different sources. ILO NATLEX is one such source.

ILO NATLEX site : http://www.ilo.org/dyn/natlex/natlex4.home?p_lang=en

We are interested only in African countries, and these can be found under country profiles:

http://www.ilo.org/dyn/natlex/natlex4.byCountry?p_lang=en

Here if you see Zimbabwe for example:
http://www.ilo.org/dyn/natlex/natlex4.countrySubjects?p_lang=en&p_country=ZWE

Each of the items here leads to a document citation:

image

E,g. clicking general provisions:
http://www.ilo.org/dyn/natlex/natlex4.listResults?p_lang=en&p_country=ZWE&p_count=395&p_classification=01&p_classcount=87

image

If I click special economic zones act

http://www.ilo.org/dyn/natlex/natlex4.detail?p_lang=en&p_isn=104410&p_country=ZWE&p_count=395&p_classification=01&p_classcount=87

This refers to a “Special Economic Zones Act” of the country, the link to the pDF is highlighted below in yellow.

image

There is important metadata here about the document:

Name, Country, Type, Official Date (Adopted On), ISN Number and citation text + PDF itself.

We need to extract this information into the official AKoma Ntoso XML format used by gawati.

Steps to Take

1 Akoma Ntoso XML

Here are sample documents in Akoma Ntoso XML format (a) has the XML documents, and (b) has the PDF document that is described by (a) .
https://github.com/gawati/gawati-data-xml/releases/download/1.2/akn_xml_sample-1.2.zip
https://github.com/gawati/gawati-data-xml/releases/download/1.2/akn_pdf_sample-1.2.zip

2 Downloading Source Documents

For the ILO site, you need to first gather raw data.
So by African country (lets start with 1 country, Zimbabwe to start with) , download the source metadata (as shown earlier) and the associated PDF document.

3 Processing Downloaded Data

Next step is to process the downloaded raw meta-data, and convert that to Akoma Ntoso format. The PDF file need not be converted, but needs to be associated with the corresponding Akoma Ntoso document, as shown in “1 Akoma Ntoso XML” above.

@kohsah
Copy link
Contributor Author

kohsah commented Mar 13, 2018

Proposed changes to metadata.xml v1 (see attached v2 sample for comment suggestions)
Also see the country code documents...
You can use these to associate the document with a country code.
The content is the same, the json file is perhaps easier to load and use from a python script.
You need to use the "alpha-2" code for a country.

(files renamed as .txt to allow upload)

country_codes.json.txt
metadata_v2.xml.txt
metadata.xml.txt
country_codes.xml.txt

@kohsah
Copy link
Contributor Author

kohsah commented Mar 13, 2018

@yash1802 we also need to determine language of the document, but its not stated in the metadata. But the metadata has an abstract which seems to be always in the language of the document. So we can make a best guess.

I have used both of these libraries before for determining languages of fragments of text, I remember optimaize being slightly better, but langdetect may be more suitable here.

https://github.com/Mimino666/langdetect
https://github.com/optimaize/language-detector

So you need to detect the language of the Abstract and use that to apply that into the document metdata something like:

(if the document is french )

<language code="fra" />

There is a json file with language codes, we use the 'alpha-3b' syntax :

https://raw.githubusercontent.com/gawati/gawati-portal-ui/dev/src/configs/languageCodes.json

@kohsah
Copy link
Contributor Author

kohsah commented Mar 14, 2018

@yash1802 This should give you a list of african countries from country codes json:

import json

data = json.load(open('country_codes.json'))

african_countries = list(
        filter( 
        lambda country: country['region-code'] == '002',  
        data['countries']['country']
    )
)

The URLs on the natlex (http://www.ilo.org/dyn/natlex/natlex4.byCountry?p_lang=en) page have the following structure:

http://..../natlex4.countrySubjects?p_lang=en&p_country=BDI
                                                        ^^^ <== alpha-3 country code

and we have the alpha-3 code for each country in the country codes json. So you can iterate the african_countries array to determine the starting point urls you need to catch.

@kohsah
Copy link
Contributor Author

kohsah commented Mar 18, 2018

@yash1802 See this document for info on extracting bibliography and a few other things: https://docs.google.com/document/d/1HsM7zGoulr3_dkxGQEwY9FONcrCfSJ3mtq7bKV4XUxw/edit?usp=sharing

@gawati gawati deleted a comment from arjunaoverall Mar 18, 2018
@gawati gawati deleted a comment from ccsmart Mar 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants