How to extract PDB file

Issue #223 resolved

Ashfaq Ahmad created an issue 2015-04-24

Hi guys, My question is is from a tutorial "Beginning with trajectory analysis" and i am dealing with MD trajectory .dcd file. Once I plot all my conformers on PC planes, the next step is to cluster them in to 2 groups using

hc <- hclust(dist(pc$z[,1:2]))
grps <- cutree(hc, k=2)

Here I got two groups with different colors, now I want to extract a PDB file from group 1 and group 2 individually. Is that possible? Sorry if my question is not that good.

Comments (11)

Lars Skjærven
- edited description
- 2015-04-24T18:29:25+00:00
Lars Skjærven
Your grps variable is a vector of the same length as the trajectory containing 1's and 2' assigning each frame to either group 1 or 2. All indices for frames in group 1 can be found with inds <- which(grps == 1). Use this to get the coordinates from your trajectory with xyz[inds[1], ]. You can then write these to a pdb object:
```
inds <- which(grps == 1)
crd <- xyz[inds[1], ]
write.pdb(xyz=crd, file="test.pdb")
```
Also note that write.pdb takes plenty of input arguments, e.g. to assign residue names and numbers. See ?write.pdb for more information. Hope it helps !
- 2015-04-24T18:39:59+00:00
Lars Skjærven
- changed component to Q&A
- changed version to v2.1
- marked as minor
- 2015-04-24T18:41:29+00:00
Ashfaq Ahmad reporter
Thanks @Lars, I tried it and got a big pdb file, probably containing the Ca-atoms for all the conformers from the grp 1. I opened it in VMD but looks wired with a scattered atoms. Yup i read write.pdb that require too many argument to be accurate. I was looking to obtain a single pdb file to represent the whole cluster.
- 2015-04-24T19:56:38+00:00
Barry Grant
You will be writing C-alpha only atoms here perhaps so try using the "Tube" representation in VMD for example.

If you want a singe output structure that in some way represents a given cluster you could determine the closest to the mean conformation for that cluster. The functions you might consider using for this include mean() via the apply() function, rmsd(), which.min() and of course write.pdb().

Let us know if you would like help with this - but I do encourage you to try it yourself also ;-)
- 2015-04-24T20:54:40+00:00
Ashfaq Ahmad reporter
@bjgrant Sir, I have tried the above options but couldn't implement them well. I am working with bio3d since last one year, during this time i was only dependent on straight forward commands mentioned in tutorials. Although they really worked out well for our two separate data enough good to be submitted in to Nature and Plos quality journals. Being a pure biological structure candidate at times I feel almost helpless even to understand the mathematical languages and their applications for such programs. I will try my best to dig these functions you suggested.
- 2015-04-25T09:48:20+00:00

Lars Skjærven

Hi Asfaq, Learning scripting / programming might be a bit time consuming, but once you get the hold of it the opportunities in Bio3D certainly gets larger. There are plenty of online resources to learn a bit more about R programming. Try e.g. http://tryr.codeschool.com/ or the following book http://www.nostarch.com/artofr.htm

If my example above gave you multiple frames you did something wrong. Did you remember to access only the first index (inds[1]) ? Also, does your trajectory contain only CA atoms, or do you have all atoms in the xyz matrix? if so, use atom.select and trim.xyz to work only with the CA atoms.

Here is a complete running example on clustering to obtain representative structures:

# read all atom trajectory
xyz <- read.dcd("mydcdfile")

# read pdb file containing the same number (and ordering) as the dcd file
pdb <- read.pdb("mypdbfile")

ca.inds <- atom.select(pdb, "calpha")

# trim PDB and XYZ to CA atoms only
pdb <- trim(pdb, ca.inds)
xyz <- trim(xyz, col.inds=ca.inds)

# clustering on RMSD
rd <- rmsd(xyz, fit=TRUE)
hc <- hclust(as.dist(rd))
grps <- cutree(hc, k=3)

# find representatives
unq <- unique(grps)
all.inds <- seq(1, length(grps))
rep.inds <- rep(NA, length(unq))

for(i in 1:length(unq)) {
  grp <- unq[i]
  inds <- which(grps==grp)

  if(length(inds) > 1) {
    tmp <- rd[inds, inds]

    m <- rowMeans(tmp)
    rep.ind <- which.min(m)
    rep.ind <- all.inds[inds][rep.ind]
  }
 else {
    rep.ind <- inds[1]
 }

  rep.inds[i] <- rep.ind
}

# indices of cluser representatives
rep.inds

write.pdb(pdb, xyz=xyz[rep.inds[1], ], file="rep_clust1.pdb")
write.pdb(pdb, xyz=xyz[rep.inds[2], ], file="rep_clust2.pdb")
write.pdb(pdb, xyz=xyz[rep.inds[3], ], file="rep_clust3.pdb")

2015-04-25T10:30:27+00:00

Ashfaq Ahmad reporter

@Lars. I am trying to trim it but my bio3d package can not recognize this function. You can see below for details;

#!> pdb <- trim(pdb, ca.inds)
Error: could not find function "trim"
> xyz <- trim(xyz, col.inds=ca.inds)
Error: could not find function "trim"
> trim.pdb
function (pdb, inds = NULL, sse = TRUE) 
{
    if (class(pdb) != "pdb") 
        stop("input 'pdb' must be a list object as returned from 'read.pdb'")
    if (is.null(inds)) 
        stop("no selection indices provided")
    if (!is.list(inds)) 
        stop("selection indices must be provided i.e. from 'atom.select'")
    if (is.null(inds$atom) || is.null(inds$xyz)) 
        stop("selection indices must be provided i.e. from 'atom.select'")
    atom <- pdb$atom[inds$atom, ]
    xyz <- pdb$xyz[inds$xyz]
    calpha <- as.logical(atom[, "elety"] == "CA")
    if (!is.null(pdb$xyz.models)) 
        xyz.models <- pdb$xyz.models[, inds$xyz]
    else xyz.models <- NULL
    sse.unbound <- function(sse) {
        ex <- NULL
        ch <- NULL
        ty <- NULL
        for (i in 1:length(sse$start)) {
            tmp <- sse$start[i]:sse$end[i]
            ex <- c(ex, tmp)
            ch <- c(ch, rep(sse$chain[i], length(tmp)))
            ty <- c(ty, rep(sse[[4]][i], length(tmp)))
        }
        ub <- cbind(ex, ch, ty)
        ub <- cbind(ub, apply(ub, 1, function(x) paste(x[1:2], 
            collapse = "_")))
        colnames(ub) <- c("nums", "chains", "type", "strid")
        return(ub)
    }
    bounds.inds <- function(nums) {
        if (length(nums) == 1) {
            return(1)
        }
        bounds.inds <- 1
        nums.start <- nums[1]
        diff.i <- 1
        for (i in 2:length(nums)) {
            if ((nums[i] - diff.i) != nums.start) {
                bounds.inds <- c(bounds.inds, i)
                nums.start <- nums[i]
                diff.i <- 1
            }
            else {
                diff.i <- diff.i + 1
            }
        }
        return(bounds.inds)
    }
    helix <- NULL
    sheet <- NULL
    if (sse) {
        trimmed <- atom[calpha, c("resno", "chain")]
        if (!is.matrix(trimmed)) 
            trimmed <- t(as.matrix(trimmed))
        strid <- apply(trimmed, 1, function(x) paste(x, collapse = "_"))
        trimmed <- cbind(trimmed, strid)
        colnames(trimmed) <- c("nums", "chains", "strid")
        if (length(pdb$helix$start) > 0) {
            tmp <- sse.unbound(pdb$helix)
            ht <- tmp[tmp[, "strid"] %in% trimmed[, "strid"], 
                1:3]
            new.helix <- matrix(ht, ncol = 3, byrow = FALSE)
            colnames(new.helix) <- c("nums", "chains", "type")
            if (nrow(new.helix) > 0) {
                nums <- as.numeric(new.helix[, "nums"])
                bnds <- bounds(nums, dup.inds = FALSE, pre.sort = FALSE)
                tmp.inds <- bounds.inds(nums)
                chain <- new.helix[tmp.inds, "chains"]
                type <- new.helix[tmp.inds, "type"]
                bnds <- cbind(bnds, chain, type)
                helix$start = as.numeric(bnds[, "start"])
                helix$end = as.numeric(bnds[, "end"])
                helix$chain = as.vector(bnds[, "chain"])
                helix$type = as.vector(bnds[, "type"])
            }
        }
        if (length(pdb$sheet$start) > 0) {
            tmp <- sse.unbound(pdb$sheet)
            ht <- tmp[tmp[, "strid"] %in% trimmed[, "strid"], 
                1:3]
            new.sheet <- matrix(ht, ncol = 3, byrow = FALSE)
            colnames(new.sheet) <- c("nums", "chains", "type")
            if (nrow(new.sheet) > 0) {
                nums <- as.numeric(new.sheet[, "nums"])
                bnds <- bounds(nums, dup.inds = FALSE, pre.sort = FALSE)
                tmp.inds <- bounds.inds(nums)
                chain <- new.sheet[tmp.inds, "chains"]
                type <- new.sheet[tmp.inds, "type"]
                bnds <- cbind(bnds, chain, type)
                sheet$start = as.numeric(bnds[, "start"])
                sheet$end = as.numeric(bnds[, "end"])
                sheet$chain = as.vector(bnds[, "chain"])
                sheet$sense = as.vector(bnds[, "type"])
            }
        }
    }
    output <- list(atom = atom, het = pdb$het, helix = helix, 
        sheet = sheet, seqres = pdb$seqres, xyz = xyz, xyz.models = xyz.models, 
        calpha = calpha)
    class(output) <- "pdb"
    return(output)
}
<environment: namespace:bio3d>
> help(trim.pdb)
starting httpd help server ... done
> pdb <- trim.pdb(pdb, ca.inds)
> xyz <- trim(xyz, col.inds=ca.inds)
Error: could not find function "trim"
> xyz <- trim.xyz(xyz, col.inds=ca.inds)
Error: could not find function "trim.xyz"
> xyz <- trim.pdb1(xyz, col.inds=ca.inds)
Error: could not find function "trim.pdb1"
> xyz <- trim.pdb(xyz, col.inds=ca.inds)
Error in trim.pdb(xyz, col.inds = ca.inds) : 
  unused argument (col.inds = ca.inds)
> trim(pdb$xyz, col.inds=sele$xyz)
Error: could not find function "trim"

2015-04-25T13:08:14+00:00

Lars Skjærven
Update your Bio3D version to 2.2. Looks like you have 2.1 or 2.0.
- 2015-04-25T13:11:55+00:00
Ashfaq Ahmad reporter
Got it. Thanks for your kind support and time. For sure I will look to those links you provided me to learn some basics of programming. Thank you all
- 2015-04-25T13:37:51+00:00
Lars Skjærven
- changed status to resolved
- 2015-06-16T12:55:50+00:00
Log in to comment

Assignee: Barry Grant

Type: task

Priority: minor

Status: resolved

Component: Q&A

Version: v2.1

Votes: 0

Watchers: 3