ppx + BioServices

The BioServices Python package provides access to a number of bioinformatics services. In their own words:

“BioServices is a Python package that provides access to many Bioinformatics Web Services (e.g., UniProt) and a framework to easily implement Web Service wrappers (based on WSDL/SOAP or REST protocols).

The primary goal of BioServices is to use Python as a glue language to provide a programmatic access to Biological Web Services. By doing so, elaboration of new applications that combine several Web Services should be fostered.”

BioServices provides access to the PRoteomics IDEntifications (PRIDE) Archive [1] API, which incidentally makes ppx + BioServices a powerful combination.

A Simple Example

To illustrate the power of ppx + BioServices, we’ll find all of the PRIDE datasets related to honey bees (Apis mellifera) with runs from Q-Exactive instruments, retrieve a list of files associated with each dataset, and setup to download all of the mass spectrometry data files.

Note

To proceed with this example, the BioServices Python package will need to be installed. See the BioServices package website for details on its installation and usage. Link

First, we need to import the ppx and BioServices packages. The PRIDE module in the BioServices package will allow us to find datasets that match out query:

Next, we retrieve a list of ProteomeXchange identifiers for:

Let’s see how many datasets we found:

Note that there are a number of additional filters for get_project_list() and that it returns several fields about the dataset. For example, look at the first element of datasets:

Now we can extract use the "accession" keys to create a list PXDataset objects:

With the PXDataset objects created, we can easily list the files to see which ones we might want to download. In this case, we’ll print first 5 from each:

Note that we also could have used bioservices.PRIDE.get_file_list() to retrieve a file list. Either way, we’ll download all of the Thermo *.raw files for each dataset. In this case, we could do:

Caution

You probably don’t actually want to do this since it would download a lot of large files. I’ve commented out the download command so that it instead only prints the file names, however you can uncomment and run it if you so desire.

Alternatively, we could just download all of the README files (This download is much smaller):