Skip to content

Speed up creation of SampleData object #555

Answered by jeromekelleher
floklimm asked this question in Q&A
Discussion options

You must be logged in to vote

The problem here is with the pandas code @floklimm - I think you were doing full table scans at each position, and this was very slow. I've rewritten it using simpler Python structures (hopefully getting the logic right!), and it runs in less than a second now.

import numpy as np
import collections
import pandas as pd
import tsinfer
import progressbar


def prepareDataForInfer(snps):
    mutated_samples = collections.defaultdict(list)
    for _, row in snps.iterrows():
        mutated_samples[row["Pos"]].append(row["GSM"])

    GSM = snps["GSM"].unique().tolist()
    nGSM = len(GSM)
    sample_index_map = {}
    for index, sample in enumerate(GSM):
        sample_index_map[sample] = index

Replies: 3 comments 5 replies

Comment options

You must be logged in to vote
3 replies
@floklimm
Comment options

@jeromekelleher
Comment options

@floklimm
Comment options

Comment options

You must be logged in to vote
1 reply
@floklimm
Comment options

Comment options

You must be logged in to vote
1 reply
@floklimm
Comment options

Answer selected by benjeffery
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants