Analytics: Extract Law metadata #13

kohsah · 2018-03-13T05:30:42Z

There is a large amount of African Legislation online on the ILO NATLEX portal. This is a UN agency website, which has curated different African legislation subject-wise and with keywords.

For the Gawati Project (https://www.gawati.org) we are trying to build a repository of African Legislation which can be searched in one place, but has been curated from different sources. ILO NATLEX is one such source.

ILO NATLEX site : http://www.ilo.org/dyn/natlex/natlex4.home?p_lang=en

We are interested only in African countries, and these can be found under country profiles:

http://www.ilo.org/dyn/natlex/natlex4.byCountry?p_lang=en

Here if you see Zimbabwe for example:
http://www.ilo.org/dyn/natlex/natlex4.countrySubjects?p_lang=en&p_country=ZWE

Each of the items here leads to a document citation:

E,g. clicking general provisions:
http://www.ilo.org/dyn/natlex/natlex4.listResults?p_lang=en&p_country=ZWE&p_count=395&p_classification=01&p_classcount=87

If I click special economic zones act

http://www.ilo.org/dyn/natlex/natlex4.detail?p_lang=en&p_isn=104410&p_country=ZWE&p_count=395&p_classification=01&p_classcount=87

This refers to a “Special Economic Zones Act” of the country, the link to the pDF is highlighted below in yellow.

There is important metadata here about the document:

Name, Country, Type, Official Date (Adopted On), ISN Number and citation text + PDF itself.

We need to extract this information into the official AKoma Ntoso XML format used by gawati.

Steps to Take

1 Akoma Ntoso XML

Here are sample documents in Akoma Ntoso XML format (a) has the XML documents, and (b) has the PDF document that is described by (a) .
https://github.com/gawati/gawati-data-xml/releases/download/1.2/akn_xml_sample-1.2.zip
https://github.com/gawati/gawati-data-xml/releases/download/1.2/akn_pdf_sample-1.2.zip

2 Downloading Source Documents

For the ILO site, you need to first gather raw data.
So by African country (lets start with 1 country, Zimbabwe to start with) , download the source metadata (as shown earlier) and the associated PDF document.

3 Processing Downloaded Data

Next step is to process the downloaded raw meta-data, and convert that to Akoma Ntoso format. The PDF file need not be converted, but needs to be associated with the corresponding Akoma Ntoso document, as shown in “1 Akoma Ntoso XML” above.

kohsah · 2018-03-13T05:33:00Z

Proposed changes to metadata.xml v1 (see attached v2 sample for comment suggestions)
Also see the country code documents...
You can use these to associate the document with a country code.
The content is the same, the json file is perhaps easier to load and use from a python script.
You need to use the "alpha-2" code for a country.

(files renamed as .txt to allow upload)

country_codes.json.txt
metadata_v2.xml.txt
metadata.xml.txt
country_codes.xml.txt

kohsah · 2018-03-13T05:45:00Z

@yash1802 we also need to determine language of the document, but its not stated in the metadata. But the metadata has an abstract which seems to be always in the language of the document. So we can make a best guess.

I have used both of these libraries before for determining languages of fragments of text, I remember optimaize being slightly better, but langdetect may be more suitable here.

https://github.com/Mimino666/langdetect
https://github.com/optimaize/language-detector

So you need to detect the language of the Abstract and use that to apply that into the document metdata something like:

(if the document is french )

<language code="fra" />

There is a json file with language codes, we use the 'alpha-3b' syntax :

https://raw.githubusercontent.com/gawati/gawati-portal-ui/dev/src/configs/languageCodes.json

kohsah · 2018-03-14T03:53:23Z

@yash1802 This should give you a list of african countries from country codes json:

import json

data = json.load(open('country_codes.json'))

african_countries = list(
        filter( 
        lambda country: country['region-code'] == '002',  
        data['countries']['country']
    )
)

The URLs on the natlex (http://www.ilo.org/dyn/natlex/natlex4.byCountry?p_lang=en) page have the following structure:

http://..../natlex4.countrySubjects?p_lang=en&p_country=BDI
                                                        ^^^ <== alpha-3 country code

and we have the alpha-3 code for each country in the country codes json. So you can iterate the african_countries array to determine the starting point urls you need to catch.

kohsah · 2018-03-18T12:46:25Z

@yash1802 See this document for info on extracting bibliography and a few other things: https://docs.google.com/document/d/1HsM7zGoulr3_dkxGQEwY9FONcrCfSJ3mtq7bKV4XUxw/edit?usp=sharing

kohsah assigned kohsah and yash1802 Mar 13, 2018

gawati deleted a comment from arjunaoverall Mar 18, 2018

gawati deleted a comment from ccsmart Mar 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analytics: Extract Law metadata #13

Analytics: Extract Law metadata #13

kohsah commented Mar 13, 2018

kohsah commented Mar 13, 2018 •

edited

Loading

kohsah commented Mar 13, 2018

kohsah commented Mar 14, 2018 •

edited

Loading

kohsah commented Mar 18, 2018

Analytics: Extract Law metadata #13

Analytics: Extract Law metadata #13

Comments

kohsah commented Mar 13, 2018

kohsah commented Mar 13, 2018 • edited Loading

kohsah commented Mar 13, 2018

kohsah commented Mar 14, 2018 • edited Loading

kohsah commented Mar 18, 2018

kohsah commented Mar 13, 2018 •

edited

Loading

kohsah commented Mar 14, 2018 •

edited

Loading