Skip to content

modeseven-os-climate/osc-transformer-based-extractor

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OSC Transformer Based Extractor

An OS-Climate Project Join OS-Climate on Slack Source code on GitHub PyPI package Build Status Built using PDM Project generated with PyScaffold

OS-Climate Data Extraction Tool

This project provides a CLI tool and Python scripts to train Transformer models (via Hugging Face) for two primary tasks: 1. Relevance Detection: Determines if a question-context pair is relevant. 2. KPI Detection: Fine-tunes models to extract key performance indicators (KPIs) from datasets like annual reports and perform inference.

Quick Start

To install the tool, use pip:

$ pip install osc-transformer-based-extractor

After installation, you can access the CLI tool with:

$ osc-transformer-based-extractor

This command will show the available commands and help via Typer, our CLI library.

Commands and Workflow

1. Relevance Detection

Fine-tuning the Model:

Assume your project structure looks like this:

project/
│
├── kpi_mapping.csv
├── training_data.csv
├── data/
│   └── (JSON files for inference)
├── model/
│   └── (Model-related files)
├── saved__model/
│   └── (Output from training)
├── output/
│   └── (Results from inference)

Use the following command to fine-tune the model:

$ osc-transformer-based-extractor relevance-detector fine-tune \
  --data_path "project/training_data.csv" \
  --model_name "bert-base-uncased" \
  --num_labels 2 \
  --max_length 128 \
  --epochs 3 \
  --batch_size 16 \
  --output_dir "project/saved__model/" \
  --save_steps 500

Running Inference:

$ osc-transformer-based-extractor relevance-detector perform-inference \
  --folder_path "project/data/" \
  --kpi_mapping_path "project/kpi_mapping.csv" \
  --output_path "project/output/" \
  --model_path "project/model/" \
  --tokenizer_path "project/model/" \
  --threshold 0.5

2. KPI Detection

The KPI detection functionality includes fine-tuning and inference.

Fine-tuning the KPI Model:

Assume your project structure looks like this:

project/
│
├── kpi_mapping.csv
├── training_data.csv
│
├── model/
│   └── (model-related files, e.g., tokenizer, config, checkpoints)
│
├── saved__model/
│   └── (Folder to store output from fine-tuning)
│
├── output/
│   └── (output files, e.g., inference_results.xlsx)
$ osc-transformer-based-extractor kpi-detection fine-tune \
    --data_path "project/training_data.csv" \
    --model_name "bert-base-uncased" \
    --max_length 128 \
    --epochs 3 \
    --batch_size 16 \
    --learning_rate 5e-5 \
    --output_dir "project/saved__model/" \
    --save_steps 500

Performing Inference:

$ osc-transformer-based-extractor kpi-detection inference \
    --data_file_path "project/data/input_dataset.csv" \
    --output_path "project/output/inference_results.xlsx" \
    --model_path "project/model/"

Training Data Requirements

  1. Relevance Detection Training File:

The training file should have the following columns: - Question - Context - Label

Example:

Training Data Example
Question Context Label
What is the company name? The Company is exposed to a risk... 0
  1. KPI Detection Training File:

For KPI detection, the dataset should have these additional columns:

KPI Detection Training Example
Question Context Label Company Source File KPI ID Year Answer Data Type
What is the company name? ... 0 NOVATEK 04_NOVATEK_AR_2016_ENG_11.pdf 0 2016 PAO NOVATEK TEXT
  1. KPI Mapping File:
KPI Mapping File Example
kpi_id question sectors add_year kpi_category
1 In which year was the annual report... OG, CM, CU FALSE TEXT

Developer Notes

Local Development

Clone the repository:

$ git clone https://github.com/os-climate/osc-transformer-based-extractor/

We use pdm for package management and tox for testing.

  1. Install pdm:

    $ pip install pdm
  2. Sync dependencies:

    $ pdm sync
  3. Add new packages (e.g., numpy):

    $ pdm add numpy
  4. Run tox for linting and testing:

    $ pip install tox
    $ tox -e lint
    $ tox -e test

Contributing

We welcome contributions! Please fork the repository and submit a pull request. Ensure you sign off each commit with the Developer Certificate of Origin (DCO). Read more: http://developercertificate.org/.

Governance Transition

On June 26, 2024, the Linux Foundation announced the merger of FINOS with OS-Climate. Projects are now transitioning to the [FINOS governance framework](https://community.finos.org/docs/governance).

Shields

An OS-Climate Project Join OS-Climate on Slack Source code on GitHub PyPI package Build Status Built using PDM Project generated with PyScaffold

About

Data Extraction: Transformer Based Extractor Tool

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 82.2%
  • Shell 17.8%