Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Creating gene/exon entries automatically via GFF3/GTF annotation files in nucleotide-based models #377

Open
conJUSTover opened this issue Jun 19, 2023 · 5 comments

Comments

@conJUSTover
Copy link

When using the nucleotide-based models on real genomes (i.e. those imported from fasta files), it is difficult to add genes and exons manually, but this information is easily accessible with General Feature Format (GFF3) or Gene Transfer Format (GTF) annotation files that are published alongside genomes. These files have clear designations for UTRs, exons, introns, etc.

Can a feature be added to SLiM to facilitate automatic annotation of SliM genome objects using these GFF3 or GTF files?

@bhaller
Copy link
Contributor

bhaller commented Jun 19, 2023

I can certainly consider it. I don't know anything about GFF3 or GTF. Can you perhaps provide a couple of links to useful pages about them, and tell me why I might want to choose one or the other of those formats, and that sort of thing? A bit of background would be helpful. I'm not sure what form the support in SLiM would take, since SLiM doesn't have any built-in knowledge of things like exons, introns, etc.; that is all up to the user to define as needed. Maybe SLiM could assist in some way, though – providing support for scanning through such a file and returning start/end pairs for a given type of element (like exons), for example?

@petrelharp
Copy link
Collaborator

FWIW, we've got code in stdpopsim that takes a GFF and gives you an annotation, which then can be used to define DFEs (i.e., apply collections of mutation types and mutation rates to them). That machinery is not public (since we're providing pre-packaged models, not a framework to construct models with); but at least it would be worth while having a look at the API we've settled on there if you think about going this way. "Supporting GFF files" would be great; however, figuring out where it makes sense to include the GFFs in the API (in a sufficiently sensible and flexible way) takes some work.

@conJUSTover
Copy link
Author

GTF and GFF3 files are largely identical to each other (gff3 is a newer version of gtf), and here's a good link describing them with links to some example files. Most of the information is probably not necessary for SLiM (gene names, for example), but it does document exon start/stop sites and +/- strand.

@bhaller
Copy link
Contributor

bhaller commented Jun 20, 2023

OK, thanks. Pondering this, I'm not sure how much value-add SLiM can provide here. If you want to get a list of start/stop positions for exons out of a GTF file, presumably there are existing open-source tools that can do that and dump the results to a text file, which you could then read into SLiM with, e.g., readCSV() and use in your script (just a for loop over rows of the matrix, defining new genomic elements with the given start/stop positions, for example). I'm not sure duplicating that functionality inside SLiM provides much really. Do you have an idea that goes beyond that, @conJUSTover? What exactly would you want SLiM to facilitate here?

@bhaller
Copy link
Contributor

bhaller commented Jul 24, 2023

I'm going to mark this long-term, because it's not clear how to proceed. Feedback/guidance is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants