Snippets

Brian Ward py3_KASPar_SNP_flanking_sequence_variant_formatting

Created by Brian Ward last modified
#!/usr/bin/env python
# -*- coding: utf-8 -*-

''' 
@author: Brian Ward

My first python script! I think I did okay - maybe not great

This script does something pretty specific - it takes in a fasta file with
flanking sequences surrounding a particular SNP. The SNP name is the ID of each
fasta record. It then reads in a dataframe containing the SNP alleles for each
SNP corresponding to a fasta record.

The script must find the intersection of SNPs between these two files, and
should give a warning if there are any mismatches. Some formatting of the SNP
names is required for both files (the script isn't especially portable in this
sense)

This is a little trickier than expected; my initial thought was to read a smallish
fasta file into a dict. However, dicts are inherently unordered in python, so
they aren't ideal when the order of the sequences matters. Therefore, I think
that reading the fasta into a dataframe is a good way to go

The user inputs the expected length of the flanking sequence. This is just to
ensure that each sequence is the expected length - which will be:
    
    (2 * flank_length) + 1
    
The presence of one or more SNPs without the expected length will not stop
execution - but the total lengths (read from the fasta) are placed in the output
.csv

The final output is a dataframe with a column for the SNP name, the LENGTH, 
and the sequence with SNP inserted ("SEQ_VAR"), in the form (for example):
    
    ATACG[G/A]TTGCA
'''

#### User-defined variables ####

## Location of input data in relation to home directory - needs leading '/' !
data_dir = '/Desktop/tagseq_test'

## fasta and SNPs .csv files; name of cols containing SNP names and
## alleles in the .csv file
fasta = 'TagSeq_201_Apr2017.fasta'
snps_tab = 'SS8412_table_SNP_POS.csv'
name_alleles = ['Name', 'VARIANT']

out_name = 'Neal_Carpenter_KASP_SNPs.csv'

## Expected length of flanking sequence on either side of SNP
flank_len = 100


#### Executable ####

import os
import sys
from Bio.SeqIO.FastaIO import SimpleFastaParser
import pandas as pd

## Set working directory
os.chdir(os.path.normpath(os.path.expanduser('~') + data_dir))

## Calculate expected total seq length
tot_len = 2 * flank_len + 1

## Parse fasta file into lists for IDs, sequences, sequence lengths;
## exit if a sequence with unexpected length is present
with open(fasta) as fasta_file:
    ids = []
    seqs = []
    lengths = []
    for title, sequence in SimpleFastaParser(fasta_file):
        if len(sequence) == tot_len:
            ids.append(title)
            seqs.append(sequence)
            lengths.append(len(sequence))
        else:
            print('Error: sequence {} had an unexpected length ({}bp)'.format(title, len(sequence)))
            sys.exit(1)

## Clean up ID strings
ids = [i.replace('chr', '') for i in ids]
ids = [i.replace(':101:201', '') for i in ids]

## Combine lists into dataframe
seq_var = pd.DataFrame({
    'SNP': ids,
    'LENGTH': lengths,
    'SEQ': seqs
})

## read in and format SNP info
snps = pd.read_csv(snps_tab, usecols = name_alleles)
snps.columns = ['SNP', 'ALLELES']
snps['SNP'] = snps['SNP'].str.replace('1_', '')

## merge the dataframes by SNP name
seq_var = pd.merge(seq_var, snps, how = 'inner', on = 'SNP')

## Insert SNP alleles into each sequence
## As an R user, always need to be careful with those string indices
seq_var['SEQ_VAR'] = (seq_var['SEQ'].str[:flank_len]
    + '['
    + seq_var['ALLELES']
    + ']'
    + seq_var['SEQ'].str[(flank_len + 1):])

## drop some columns, reorder remaining, then write out .csv
seq_var = seq_var.drop(['SEQ', 'ALLELES'], axis = 1)
seq_var = seq_var[['SNP', 'LENGTH', 'SEQ_VAR']]
seq_var.to_csv(out_name, index = False)

Comments (0)

HTTPS SSH

You can clone a snippet to your computer for local editing. Learn more.