Snippets
Created by
Brian Ward
last modified
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | #!/usr/bin/env python
# -*- coding: utf-8 -*-
'''
@author: Brian Ward
My first python script! I think I did okay - maybe not great
This script does something pretty specific - it takes in a fasta file with
flanking sequences surrounding a particular SNP. The SNP name is the ID of each
fasta record. It then reads in a dataframe containing the SNP alleles for each
SNP corresponding to a fasta record.
The script must find the intersection of SNPs between these two files, and
should give a warning if there are any mismatches. Some formatting of the SNP
names is required for both files (the script isn't especially portable in this
sense)
This is a little trickier than expected; my initial thought was to read a smallish
fasta file into a dict. However, dicts are inherently unordered in python, so
they aren't ideal when the order of the sequences matters. Therefore, I think
that reading the fasta into a dataframe is a good way to go
The user inputs the expected length of the flanking sequence. This is just to
ensure that each sequence is the expected length - which will be:
(2 * flank_length) + 1
The presence of one or more SNPs without the expected length will not stop
execution - but the total lengths (read from the fasta) are placed in the output
.csv
The final output is a dataframe with a column for the SNP name, the LENGTH,
and the sequence with SNP inserted ("SEQ_VAR"), in the form (for example):
ATACG[G/A]TTGCA
'''
#### User-defined variables ####
## Location of input data in relation to home directory - needs leading '/' !
data_dir = '/Desktop/tagseq_test'
## fasta and SNPs .csv files; name of cols containing SNP names and
## alleles in the .csv file
fasta = 'TagSeq_201_Apr2017.fasta'
snps_tab = 'SS8412_table_SNP_POS.csv'
name_alleles = ['Name', 'VARIANT']
out_name = 'Neal_Carpenter_KASP_SNPs.csv'
## Expected length of flanking sequence on either side of SNP
flank_len = 100
#### Executable ####
import os
import sys
from Bio.SeqIO.FastaIO import SimpleFastaParser
import pandas as pd
## Set working directory
os.chdir(os.path.normpath(os.path.expanduser('~') + data_dir))
## Calculate expected total seq length
tot_len = 2 * flank_len + 1
## Parse fasta file into lists for IDs, sequences, sequence lengths;
## exit if a sequence with unexpected length is present
with open(fasta) as fasta_file:
ids = []
seqs = []
lengths = []
for title, sequence in SimpleFastaParser(fasta_file):
if len(sequence) == tot_len:
ids.append(title)
seqs.append(sequence)
lengths.append(len(sequence))
else:
print('Error: sequence {} had an unexpected length ({}bp)'.format(title, len(sequence)))
sys.exit(1)
## Clean up ID strings
ids = [i.replace('chr', '') for i in ids]
ids = [i.replace(':101:201', '') for i in ids]
## Combine lists into dataframe
seq_var = pd.DataFrame({
'SNP': ids,
'LENGTH': lengths,
'SEQ': seqs
})
## read in and format SNP info
snps = pd.read_csv(snps_tab, usecols = name_alleles)
snps.columns = ['SNP', 'ALLELES']
snps['SNP'] = snps['SNP'].str.replace('1_', '')
## merge the dataframes by SNP name
seq_var = pd.merge(seq_var, snps, how = 'inner', on = 'SNP')
## Insert SNP alleles into each sequence
## As an R user, always need to be careful with those string indices
seq_var['SEQ_VAR'] = (seq_var['SEQ'].str[:flank_len]
+ '['
+ seq_var['ALLELES']
+ ']'
+ seq_var['SEQ'].str[(flank_len + 1):])
## drop some columns, reorder remaining, then write out .csv
seq_var = seq_var.drop(['SEQ', 'ALLELES'], axis = 1)
seq_var = seq_var[['SNP', 'LENGTH', 'SEQ_VAR']]
seq_var.to_csv(out_name, index = False)
|
Comments (0)
You can clone a snippet to your computer for local editing. Learn more.