Speed up creation of SampleData object #555
-
Thank you for creating this useful library. We have a rather large number of samples (about 2000). Also, there are many mutations that are only present in a small number of these samples. However, as we want to take into account each position that has a mutation in at least one sample, we obtain a very large number of positions (approx. 15 Mio in one Chromosome alone). Iterating over all positions and using Is there any smarter way to construct the SampleData object (e.g., exploiting the very sparse nature of the genotype lists)? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 5 replies
-
Hi @floklimm! 👋 2000 samples and 15M sites should be easily handled, we've created much larger SampleData files than this. Can you paste in a little bit of your code here, maybe with an example (ideally with some simple fake data so we can run reproduce?) |
Beta Was this translation helpful? Give feedback.
-
Yeah, this should be easily doable. If it helps there's some code to do this in parallel over sites from a VCF file, which I guess should be easily portable to pandas: #277 (comment) |
Beta Was this translation helpful? Give feedback.
-
The problem here is with the pandas code @floklimm - I think you were doing full table scans at each position, and this was very slow. I've rewritten it using simpler Python structures (hopefully getting the logic right!), and it runs in less than a second now. import numpy as np
import collections
import pandas as pd
import tsinfer
import progressbar
def prepareDataForInfer(snps):
mutated_samples = collections.defaultdict(list)
for _, row in snps.iterrows():
mutated_samples[row["Pos"]].append(row["GSM"])
GSM = snps["GSM"].unique().tolist()
nGSM = len(GSM)
sample_index_map = {}
for index, sample in enumerate(GSM):
sample_index_map[sample] = index
positions = sorted(mutated_samples.keys())
with tsinfer.SampleData(sequence_length=positions[-1]) as sample_data:
for u in GSM:
sample_data.add_individual(ploidy=1, metadata={"name": u})
for p in positions:
samples = mutated_samples[p]
if len(samples) > 0:
genotypes = np.zeros(nGSM, dtype=np.int8)
for sample in samples:
genotypes[sample_index_map[sample]] = 1
sample_data.add_site(
p - 1,
genotypes,
["A", "T"], # FIXME!
)
return sample_data
snps_1M = pd.read_csv("snpsExample.csv")
sampleData = prepareDataForInfer(snps_1M)
print(sampleData) |
Beta Was this translation helpful? Give feedback.
The problem here is with the pandas code @floklimm - I think you were doing full table scans at each position, and this was very slow. I've rewritten it using simpler Python structures (hopefully getting the logic right!), and it runs in less than a second now.