You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The splits released for QQP seem to be somewhat leaky - a large number of sentences appear in both train and dev/test:
import os
import urllib
import sys
if sys.version_info >= (3, 0):
import urllib.request
import zipfile
URLLIB=urllib
if sys.version_info >= (3, 0):
URLLIB=urllib.request
data_file = "qqp.zip"
URLLIB.urlretrieve("https://dl.fbaipublicfiles.com/glue/data/QQP-clean.zip", data_file)
with zipfile.ZipFile(data_file) as zip_ref:
zip_ref.extractall(".")
os.remove(data_file)
train_sents = set()
dev_sents = set()
test_sents = set()
with open('./QQP/train.tsv') as f:
train = [x.strip().split('\t') for x in f.readlines()][1:]
for row in train:
train_sents.add(row[3])
train_sents.add(row[4])
with open('./QQP/dev.tsv') as f:
dev = [x.strip().split('\t') for x in f.readlines()][1:]
for row in dev:
dev_sents.add(row[3])
dev_sents.add(row[4])
with open('./QQP/test.tsv') as f:
test = [x.strip().split('\t') for x in f.readlines()][1:]
for row in test:
test_sents.add(row[1])
test_sents.add(row[2])
print(len(train_sents & dev_sents))
print(len(train_sents & test_sents))
29852
104698
Is this intentional?
The text was updated successfully, but these errors were encountered:
Hi,
The splits released for QQP seem to be somewhat leaky - a large number of sentences appear in both train and dev/test:
Is this intentional?
The text was updated successfully, but these errors were encountered: