-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.Rmd
125 lines (90 loc) · 5.63 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# grafzahl <img src="man/figures/grafzahl_logo.png" align="right" height="139" />
<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version/grafzahl)](https://CRAN.R-project.org/package=grafzahl)
<!-- badges: end -->
The goal of grafzahl (**G**racious **R** **A**nalytical **F**ramework for **Z**appy **A**nalysis of **H**uman **L**anguages [^1]) is to duct tape the [quanteda](https://github.com/quanteda/quanteda) ecosystem to modern [Transformer-based text classification models](https://simpletransformers.ai/), e.g. BERT, RoBERTa, etc. The model object looks and feels like the textmodel S3 object from the package [quanteda.textmodels](https://github.com/quanteda/quanteda.textmodels).
If you don't know what I am talking about, don't worry, this package is gracious. You don't need to know a lot about Transformers to use this package. See the examples below.
Please cite this software as:
Chan, C., (2023). [grafzahl: fine-tuning Transformers for text data from within R](paper/grafzahl_sp.pdf). *Computational Communication Research* 5(1): 76-84. [https://doi.org/10.5117/CCR2023.1.003.CHAN](https://doi.org/10.5117/CCR2023.1.003.CHAN)
## Installation: Local environment
Install the CRAN version
```r
install.packages("grafzahl")
```
After that, you need to setup your conda environment
```r
require(grafzahl)
setup_grafzahl(cuda = TRUE) ## if you have GPU(s)
```
## On remote environments, e.g. Google Colab
On Google Colab, you need to enable non-Conda mode
```r
install.packages("grafzahl")
require(grafzahl)
use_nonconda()
```
Please refer the vignette.
## Usage
Suppose you have a bunch of tweets in the quanteda corpus format. And the corpus has exactly one docvar that denotes the labels you want to predict. The data is from [this repository](https://github.com/pablobarbera/incivility-sage-open) (Theocharis et al., 2020).
```{r, echo = FALSE, message = FALSE}
devtools::load_all()
```
```{r}
unciviltweets
```
In order to train a Transfomer model, please select the `model_name` from [Hugging Face's list](https://huggingface.co/models). The table below lists some common choices. In most of the time, providing `model_name` is sufficient, there is no need to provide `model_type`.
Suppose you want to train a Transformer model using "bertweet" (Nguyen et al., 2020) because it matches your domain of usage. By default, it will save the model in the `output` directory of the current directory. You can change it to elsewhere using the `output_dir` parameter.
```r
model <- grafzahl(unciviltweets, model_type = "bertweet", model_name = "vinai/bertweet-base")
### If you are hardcore quanteda user:
## model <- textmodel_transformer(unciviltweets,
## model_type = "bertweet", model_name = "vinai/bertweet-base")
```
Make prediction
```r
predict(model)
```
That is it.
## Extended examples
Several extended examples are also available.
| Examples | file |
|-------------------------------------------------|------------------------------------------------|
| van Atteveldt et al. (2021) | [paper/vanatteveldt.md](paper/vanatteveldt.md) |
| Dobbrick et al. (2021) | [paper/dobbrick.md](paper/dobbrick.md) |
| Theocharis et al. (2020) | [paper/theocharis.md](paper/theocharis.md) |
| OffensEval-TR (2020) | [paper/coltekin.md](paper/coltekin.md) |
| Amharic News Text classification Dataset (2021) | [paper/azime.md](paper/azime.md) |
## Some common choices of `model_name`
| Your data | model_type | model_name |
|-------------------|------------|------------------------------------|
| English tweets | bertweet | vinai/bertweet-base |
| Lightweight | mobilebert | google/mobilebert-uncased |
| | distilbert | distilbert-base-uncased |
| Long Text | longformer | allenai/longformer-base-4096 |
| | bigbird | google/bigbird-roberta-base |
| English (General) | bert | bert-base-uncased |
| | bert | bert-base-cased |
| | electra | google/electra-small-discriminator |
| | roberta | roberta-base |
| Multilingual | xlm | xlm-mlm-17-1280 |
| | xml | xlm-mlm-100-1280 |
| | bert | bert-base-multilingual-cased |
| | xlmroberta | xlm-roberta-base |
| | xlmroberta | xlm-roberta-large |
# References
1. Theocharis, Y., Barberá, P., Fazekas, Z., & Popa, S. A. (2020). The dynamics of political incivility on Twitter. Sage Open, 10(2), 2158244020919447.
2. Nguyen, D. Q., Vu, T., & Nguyen, A. T. (2020). BERTweet: A pre-trained language model for English Tweets. arXiv preprint arXiv:2005.10200.
---
[^1]: Yes, I totally made up the meaningless long name. Actually, it is the German name of the *Sesame Street* character [Count von Count](https://de.wikipedia.org/wiki/Sesamstra%C3%9Fe#Graf_Zahl), meaning "Count (the noble title) Number". And it seems to be so that it is compulsory to name absolutely everything related to Transformers after Seasame Street characters.