Macs in Chemistry

Insanely great science

 

Interacting with the RCSB Protein Data Bank

The RCSB Protein Data Bank is an absolutely invaluable resource that provides archive-information about the 3D shapes of proteins, nucleic acids, and complex assemblies that helps scientists understand all aspects of biomedicine and agriculture, from protein synthesis to health and disease. Currently the PDB contains over 134,000 data files containing structural information on 42547 distinct protein sequences of which 37600 are human sequences. They also provide a series of tools to search, view and analyse the data.

The RCSB PDB RESTful Web Service interface

These web services provide programmatic access to the data, there are two types of services for the RESTful interface:

Sometimes I have a list of Uniprot accession IDs and I want to find out if there is any structural information in the PDB, I could search for each Uniprot ID individually using the PDB user search tools, but if you have more than a couple to look up it is better to use a script. I use Vortex as a flexible desktop tool to search and store information from a variety of sources, and the scripting interface provides a very powerful tool. The PDB search web service interface exposes the RCSB PDB advanced search interface as an XML Web Service. To use this service, we need to POST a XML representation of an advanced search to:-

http://www.rcsb.org/pdb/rest/search

We need a list of uniprot codes.

P50225
Q70CQ3
A0A024QYR8
P00533
A0A023T6R1
P00519

First read the Uniprot codes into Vortex.

uniprots

The script first opens a dialog box asking the user to select the column contains the Uniprot ID, it then creates 2 new columns, one to contain the number of PDB entity id containing the Uniprot ID, the second to contain a list of all the PDB entity id that are returned. It is important to note there is not a 1:1 correspondence between Uniprot and PDB ids, a single Uniprot ID may be associated with multiple crystal structures, these might be the same structure at different resolutions or be structures containing different ligands, or even the protein without the ligand. In addition, a single PDB file can be associated with multiple Uniprot Id if it contains multiple different protein chains.

The next part of the script works through the table row by row selecting the uniprot id, creating the XML search query and POSTing it to the web service. The number of entries in the returned string is determined and the two columns completed.

The result is the table shown below. This is a nice summary but not that useful if you want to do further analysis.

pdbs

The next part of the script pivots the results to a single PDB entity ID per row, as shown. The advantage of this format is we can now search and store information related to an individual PDB entity.

entityperrow

The UniprotPDBmapping Vortex Script

#Use Uniprot accession id to find PDB structures

# Python imports
import urllib2
import urllib

# Vortex imports
import com.dotmatics.vortex.util.Util as Util
import com.dotmatics.vortex.mol2img.jni.genImage as genImage
import com.dotmatics.vortex.mol2img.Mol2Img as mol2Img
import jarray
import binascii
import string
import os


input_label = swing.JLabel("Uniprot column (for input)")
input_cb = workspace.getColumnComboBox()
panel = swing.JPanel()

layout.fill(panel, input_label, 0, 0)
layout.fill(panel, input_cb,    1, 0)

ret = vortex.showInDialog(panel, "Choose Uniprot column")

if ret == vortex.OK:
    input_idx = input_cb.getSelectedIndex()

    if input_idx == 0:
        vortex.alert("you must choose a column")
    else:
        col = vtable.getColumn(input_idx - 1)

        url = 'http://www.rcsb.org/pdb/rest/search'

        colpdbNo = vtable.findColumnWithName('Num PDB', 0) # Number of PDB structures
        colpdbid = vtable.findColumnWithName('PDB', 1) # csv list of PDB id

        rowdata = []
        rows = vtable.getRealRowCount()
        for r in range(0, int(rows)):
            uniprotid = col.getValueAsString(r)



            queryText = """
            <orgPdbQuery>

            <queryType>org.pdb.query.simple.UpAccessionIdQuery</queryType>

            <description>Simple query for a list of UniprotKB Accession IDs: P50225</description>

            <accessionIdList>%s</accessionIdList>

            </orgPdbQuery>
            """ % uniprotid

            req = urllib2.Request(url, data=queryText)
            f = urllib2.urlopen(req)
            result = f.read()
            f.close()
            Nos = result.count(':')
            nos = str(Nos)  #convert number to string
            colpdbNo = vtable.findColumnWithName('Num PDB', 1)
            colpdbNo.setValueFromString(r, nos)

            #newresult = string.replace(result, '\n', ',') # convert to csv not used
            colpdbid = vtable.findColumnWithName('PDB', 1)
        colpdbid.setValueFromString(r, result)

            #Need to convert P50225 1LS6,1Z28,2D06,3QVU,3QVV,3U3J,3U3K,3U3M,3U3O,3U3R,4GRA to
            #P50225, 1LS6
            #P50225, 1Z28
            #P50225, 2D06
            my_list = result.strip().split("\n") # read csv into list
            for x in range(0, len(my_list)):
                row = [uniprotid] + [my_list[x]]
                rowdata.append(row)

vtable.fireTableStructureChanged()

#create new workspace in Vortex

column_names = ['Uniprotid', 'PDB']

TableName = "PDB Mapping"

arrayToWorkspace(rowdata, column_names, TableName)

Getting More Information from PDB

With a table containing PDB entity ID we can now mine the PDB for more information. The web service takes a simple string as input

http://www.rcsb.org/pdb/rest/describeMol?structureId=4hhb

and returns a description of the entry in XML format, detailing each entity in the PDB file.

<molDescription> <structureId id="4HHB"> <polymer entityNr="1" length="141" type="protein" weight="15150.4"> <chain id="A"/> <chain id="C"/> <Taxonomy name="Homo sapiens" id="9606"/> <macroMolecule name="Hemoglobin subunit alpha"> <accession id="P69905"/> </macroMolecule> <polymerDescription description="HEMOGLOBIN (DEOXY) (ALPHA CHAIN)"/> </polymer> <polymer entityNr="2" length="146" type="protein" weight="15890.2"> <chain id="B"/> <chain id="D"/> <Taxonomy name="Homo sapiens" id="9606"/> <macroMolecule name="Hemoglobin subunit beta"> <accession id="P68871"/> </macroMolecule> <polymerDescription description="HEMOGLOBIN (DEOXY) (BETA CHAIN)"/> </polymer> </structureId> </molDescription>

The vortex script firsts asks the user to select the PDB column, then for each row in the table generates the query string and runs the query. The returned XML is than parsed to extract some of the data and then generates the appropriate columns for each entity in the returned XML. As can be seen in the image below, some contain a single protein, others contain multiple proteins. This script pulls out the name of the entity, type (e.g. protein), number of amino acids, Uniprot id, which could be different to the original query if the PDB contains multiple proteins.

pdbinfo

1Z7Q is the crystal structure of the 20s proteasome from yeast in complex with the proteasome activator PA26 from Trypanosome brucei at 3.2 angstroms resolution, it contains 15 different protein chains. It can be viewed here http://www.rcsb.org/pdb/ngl/ngl.do?pdbid=1Z7Q&bionumber=1.

The PDBinfo Vortex script

#Use PDB id to find PDB more info


#http://www.rcsb.org/pdb/rest/describeMol?structureId=4hhb

# Python imports
import urllib2
import urllib
import xml.etree.ElementTree as etree

# Vortex imports
import com.dotmatics.vortex.util.Util as Util
import com.dotmatics.vortex.mol2img.jni.genImage as genImage
import com.dotmatics.vortex.mol2img.Mol2Img as mol2Img
import jarray
import binascii
import string
import os

input_label = swing.JLabel("PDB column (for input)")
input_cb = workspace.getColumnComboBox()
panel = swing.JPanel()

layout.fill(panel, input_label, 0, 0)
layout.fill(panel, input_cb,    1, 0)
# Get column containing PDB id
ret = vortex.showInDialog(panel, "Choose PDB column")

if ret == vortex.OK:
    input_idx = input_cb.getSelectedIndex()

    if input_idx == 0:  
        vortex.alert("you must choose a column")
    else:
        col = vtable.getColumn(input_idx - 1)
        #Format of query url 
        #http://www.rcsb.org/pdb/rest/describeMol?structureId=4hhb

        rows = vtable.getRealRowCount()
        for r in range(0, int(rows)):
            pdbid = col.getValueAsString(r)
            if ":" in pdbid: #only search if pdb present
                pdbid = str(pdbid)[:4] #convert to string and remove :1
                mystr = "http://www.rcsb.org/pdb/rest/describeMol?structureId=" + pdbid

                f = urllib2.urlopen(mystr)
                myreturn = f.read()
                f.close()
            #You may want to do more detailed error checking, 
                tree = etree.fromstring(myreturn)
                for i, polymer in enumerate(tree.findall('.//polymer')):
                    try:
                        col_name = vtable.findColumnWithName('Name %s' % (i + 1), 1)
                        node = polymer.find('macroMolecule')
                        col_name.setValueFromString(r, node.get('name'))
                    except:
                        pass
                    try:
                        col_id = vtable.findColumnWithName('Type %s' % (i + 1), 1)
                        col_id.setValueFromString(r, polymer.get('type'))
                    except:
                        pass    
                    try:
                        col_length = vtable.findColumnWithName('Num AA %s' % (i + 1), 1)
                        col_length.setValueFromString(r, polymer.get('length'))
                    except:
                        pass
                    try:
                        col_name = vtable.findColumnWithName('UniprotID %s' % (i + 1), 1)
                        node = polymer.find('macroMolecule')
                        subnode = node.find('accession')
                        col_name.setValueFromString(r, subnode.get('id'))
                    except:
                        pass

The scripts can be dowloaded here

https://macinchem.org/reviews/vortexscripts/UniprotPDBmapping.vpy.zip
https://macinchem.org/reviews/vortex
scripts/PDBinfo.vpy.zip

Update

If you also want to download the PDB files there are a few scripting options here, Downloading PDB

Last Updated 9 November 2017