Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QQP: sentences appear in both train and dev/test splits #29

Open
tomhosking opened this issue Nov 30, 2022 · 0 comments
Open

QQP: sentences appear in both train and dev/test splits #29

tomhosking opened this issue Nov 30, 2022 · 0 comments

Comments

@tomhosking
Copy link

Hi,

The splits released for QQP seem to be somewhat leaky - a large number of sentences appear in both train and dev/test:

import os
import urllib
import sys
if sys.version_info >= (3, 0):
    import urllib.request
import zipfile

URLLIB=urllib
if sys.version_info >= (3, 0):
    URLLIB=urllib.request

data_file = "qqp.zip"
URLLIB.urlretrieve("https://dl.fbaipublicfiles.com/glue/data/QQP-clean.zip", data_file)
with zipfile.ZipFile(data_file) as zip_ref:
  zip_ref.extractall(".")
os.remove(data_file)

train_sents = set()
dev_sents = set()
test_sents = set()

with open('./QQP/train.tsv') as f:
  train = [x.strip().split('\t') for x in f.readlines()][1:]
for row in train:
  train_sents.add(row[3])
  train_sents.add(row[4])

with open('./QQP/dev.tsv') as f:
  dev = [x.strip().split('\t') for x in f.readlines()][1:]
for row in dev:
  dev_sents.add(row[3])
  dev_sents.add(row[4])

with open('./QQP/test.tsv') as f:
  test = [x.strip().split('\t') for x in f.readlines()][1:]
for row in test:
  test_sents.add(row[1])
  test_sents.add(row[2])

print(len(train_sents & dev_sents))
print(len(train_sents & test_sents))

29852
104698

Is this intentional?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant