Speed up creation of SampleData object #555

floklimm · 2021-07-01T14:26:26Z

floklimm
Jul 1, 2021

Thank you for creating this useful library.

We have a rather large number of samples (about 2000). Also, there are many mutations that are only present in a small number of these samples. However, as we want to take into account each position that has a mutation in at least one sample, we obtain a very large number of positions (approx. 15 Mio in one Chromosome alone).

Iterating over all positions and using sample_data.add_site is therefore pretty slow (runtime multiple days).

Is there any smarter way to construct the SampleData object (e.g., exploiting the very sparse nature of the genotype lists)?

Answered by jeromekelleher

Jul 1, 2021

The problem here is with the pandas code @floklimm - I think you were doing full table scans at each position, and this was very slow. I've rewritten it using simpler Python structures (hopefully getting the logic right!), and it runs in less than a second now.

import numpy as np
import collections
import pandas as pd
import tsinfer
import progressbar


def prepareDataForInfer(snps):
    mutated_samples = collections.defaultdict(list)
    for _, row in snps.iterrows():
        mutated_samples[row["Pos"]].append(row["GSM"])

    GSM = snps["GSM"].unique().tolist()
    nGSM = len(GSM)
    sample_index_map = {}
    for index, sample in enumerate(GSM):
        sample_index_map[sample] = index
…

View full answer

jeromekelleher · 2021-07-01T14:31:06Z

jeromekelleher
Jul 1, 2021
Maintainer

Hi @floklimm! 👋

2000 samples and 15M sites should be easily handled, we've created much larger SampleData files than this. Can you paste in a little bit of your code here, maybe with an example (ideally with some simple fake data so we can run reproduce?)

3 replies

floklimm Jul 1, 2021
Author

Hi @jeromekelleher

Thanks for your speedy reply! Here the function I use for constructing the SampleData


def prepareDataForInfer(snps):
    # takes a pandas data frame and creates a sampleData from it
    uniquePos = np.sort(snps['Pos'].unique())
    length = len(uniquePos) # number of positions at which there is at least one mutation (in any of the samples)
    
    # grouping to find mutations at all positions
    groupedByPos = snps_all.groupby('Pos')
    
    # find all samples that have at least one mutation 
    GSM = snps['GSM'].unique().tolist()
    nGSM = len(GSM)
    
    with tsinfer.SampleData(sequence_length=np.max(uniquePos)) as sample_data:
        # 1) add individual names as the GSM
        for u in GSM:
            sample_data.add_individual(ploidy=1, metadata={"name": u})
        
        # 2) go over each position and add the site with the mutation info (this is the slow part)
        with progressbar.ProgressBar(max_value=length) as bar:
            i=0
            for p in uniquePos:
                # 1 if there is a mutation at this positon, otherwise 0
                mutationList = [1 if GSM[x] in groupedByPos.get_group(p)['GSM'].tolist() else 0 for x in np.arange(nGSM)]
                #print(p)
                #print(mutationList)

                if np.sum(mutationList)<nGSM: # if not all GSMs are mutated
                    # add the position with the mutation information
                    sample_data.add_site(p-1, mutationList, [groupedByPos.get_group(p)['Ref'].iloc[0], groupedByPos.get_group(p)['Alt'].iloc[0]  ])
                
                # progress bar
                i=i+1
                bar.update(i)

    return(sample_data)

The snps is a dataframe with the columns giving the 1) position, 2) Reference base, 3) Variant base, and 4) Sample name. (see attached for a small example, of the first 1M positions on chromosome 1).

For this example, the creation of the SampleData object takes 30min, for the whole Chromosome 1 the estimated execution time is 8 days.

snpsExample.csv

jeromekelleher Jul 1, 2021
Maintainer

OK wow, there's definitely something wrong there! Can you just add a few more lines to make the example runnable directly please (I assume just reaing in the CSV and calling the function?)

floklimm Jul 1, 2021
Author

Of course:

import pandas as pd
import tsinfer
import progressbar


snps_1M = pd.read_csv('snpsExample.csv')
sampleData = prepareDataForInfer(snps_1M)

hyanwong · 2021-07-01T15:04:45Z

hyanwong
Jul 1, 2021
Collaborator

Yeah, this should be easily doable. If it helps there's some code to do this in parallel over sites from a VCF file, which I guess should be easily portable to pandas: #277 (comment)

1 reply

floklimm Jul 1, 2021
Author

Thanks, I will have a look at whether I can go through the pandas in parallel!

jeromekelleher · 2021-07-01T17:08:31Z

jeromekelleher
Jul 1, 2021
Maintainer

The problem here is with the pandas code @floklimm - I think you were doing full table scans at each position, and this was very slow. I've rewritten it using simpler Python structures (hopefully getting the logic right!), and it runs in less than a second now.

import numpy as np
import collections
import pandas as pd
import tsinfer
import progressbar


def prepareDataForInfer(snps):
    mutated_samples = collections.defaultdict(list)
    for _, row in snps.iterrows():
        mutated_samples[row["Pos"]].append(row["GSM"])

    GSM = snps["GSM"].unique().tolist()
    nGSM = len(GSM)
    sample_index_map = {}
    for index, sample in enumerate(GSM):
        sample_index_map[sample] = index
    positions = sorted(mutated_samples.keys())

    with tsinfer.SampleData(sequence_length=positions[-1]) as sample_data:
        for u in GSM:
            sample_data.add_individual(ploidy=1, metadata={"name": u})
        for p in positions:
            samples = mutated_samples[p]
            if len(samples) > 0:
                genotypes = np.zeros(nGSM, dtype=np.int8)
                for sample in samples:
                    genotypes[sample_index_map[sample]] = 1

                sample_data.add_site(
                    p - 1,
                    genotypes,
                    ["A", "T"], # FIXME!
                )
    return sample_data


snps_1M = pd.read_csv("snpsExample.csv")
sampleData = prepareDataForInfer(snps_1M)
print(sampleData)

1 reply

floklimm Jul 2, 2021
Author

Marvelous!
I adapted my code and it now takes approximately 4min for the whole Chromosome 1, which seems appropriate.

Here the full code, which also computes the reference and alternative SNPs, for other people to adapt.

# load libraries
import tsinfer
import pandas as pd
from scipy import stats
import collections
import numpy as np
import json

# 1) create some test data
colNames =['Pos','Ref','Alt','SampleName']
testData = pd.DataFrame(columns = colNames)
# add some mutations to the empty pandas DF
testData.loc[0] = [1,'A','T','sample1']

testData.loc[1] = [2,'G','C','sample3']
testData.loc[2] = [2,'G','C','sample4']

testData.loc[3] = [3,'C','A','sample1']
testData.loc[4] = [3,'C','A','sample2']

testData.loc[5] = [4,'G','C','sample1']
testData.loc[6] = [4,'G','C','sample2']

testData.loc[7] = [5,'T','G','sample1']
testData.loc[8] = [5,'T','C','sample2']

# 2) function to prepare the data

def prepareDataForInfer(snps):
    
    # iterate over all snps and create dictionaries 
    mutated_samples = collections.defaultdict(list)
    refs = collections.defaultdict(list)
    alts = collections.defaultdict(list)
    for _, row in snps.iterrows():
        mutated_samples[row["Pos"]].append(row["SampleName"]) # which sample is mutated 
        alts[row["Pos"]].append(row["Alt"]) # which alternative alleles are at each position (we add them to a list)
        refs[row["Pos"]] = row["Ref"] # which Reference allele is at each position (we just overwrite because they are all the same)


            
    sampleList = snps["SampleName"].unique().tolist()
    nSamples = len(sampleList)
    sample_index_map = {}
    for index, sample in enumerate(sampleList):
        sample_index_map[sample] = index
    positions = sorted(mutated_samples.keys())
    
    
    with tsinfer.SampleData(sequence_length=positions[-1]) as sample_data:
        for u in sampleList:
            sample_data.add_individual(ploidy=1, metadata={"name": u})
        for p in positions:
            samples = mutated_samples[p]
            if len(samples) > 0:
                genotypes = np.zeros(nSamples, dtype=np.int8)
                for sample in samples:
                    genotypes[sample_index_map[sample]] = 1
                sample_data.add_site(
                        p - 1,
                        genotypes,
                        [refs[p], stats.mode(np.array(alts[p]))[0][0] ], # Reference is just in the list, for Alternative we use the mode of all mutations
                    )
    
    return sample_data

# 3) Preparing the data
sampleData = prepareDataForInfer(testData)
# 4) Inferring the tree
inferred_ts = tsinfer.infer(sampleData)

# 5) Plot the tree (which is the same as in the tutorial but one less sample, which doesn't have any mutation)
# a) get labels for plotting 
individual_for_node = {}
for n in inferred_ts.samples():
    individual_data = inferred_ts.individual(inferred_ts.node(n).individual)
    individual_for_node[n] = json.loads(individual_data.metadata)["name"]
# b) plotting
tree = inferred_ts.first()
tree.draw(
    path="treeTemp.svg",
    height=700,
    width=1200,
    node_labels=individual_for_node
)

Three caveats/simplifications to be aware of:

all mutations at a position are the same (i.e., the Alternative allele is always the most common amongst all)
the code works fine if all mutations are point mutations. In cases with indels, it may throw an error in rare cases when at the same position one mutation is a deletion and the other mutation is an insertion in such a way that the reference and alternative allele are then the same.
it only includes samples that have at least one mutation (in comparison with the reference)

Thanks again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up creation of SampleData object #555

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Speed up creation of SampleData object #555

floklimm Jul 1, 2021

Replies: 3 comments · 5 replies

jeromekelleher Jul 1, 2021 Maintainer

floklimm Jul 1, 2021 Author

jeromekelleher Jul 1, 2021 Maintainer

floklimm Jul 1, 2021 Author

hyanwong Jul 1, 2021 Collaborator

floklimm Jul 1, 2021 Author

jeromekelleher Jul 1, 2021 Maintainer

floklimm Jul 2, 2021 Author

floklimm
Jul 1, 2021

Replies: 3 comments 5 replies

jeromekelleher
Jul 1, 2021
Maintainer

floklimm Jul 1, 2021
Author

jeromekelleher Jul 1, 2021
Maintainer

floklimm Jul 1, 2021
Author

hyanwong
Jul 1, 2021
Collaborator

floklimm Jul 1, 2021
Author

jeromekelleher
Jul 1, 2021
Maintainer

floklimm Jul 2, 2021
Author