Loading...

Webmail
Switch Light/Dark Mode

Basic Linux Commands

These commands form the foundation for working on the command line. Mastering them is essential for navigating, viewing, and managing files on any Linux-based system, including high-performance clusters.

Navigating the Filesystem

Commands to help you move around and understand your location within the directory structure.

CommandUsageExampleDescription
pwdpwdpwdPrint Working Directory. Shows your current location.
lsls [options] [directory]ls -lh /home/userList directory contents. Use -l for details and -h for human-readable sizes.
cdcd [directory]cd Documents/Change Directory. cd .. moves up one level. cd alone goes home.

File & Directory Management

Commands for creating, renaming, copying, moving, and deleting files and folders.

CommandUsageExampleDescription
touchtouch [filename]touch report.txtCreates a new, empty file.
mkdirmkdir [directory_name]mkdir simulationsMake a new directory.
cpcp [source] [destination]cp script.py scripts_backup/Copy a file. Use -r option for directories.
mvmv [source] [destination]mv data.txt results/Move or rename a file or directory.
rmrm [filename]rm temp_output.logRemove (delete) a file. Permanent, use with caution.
rmdirrmdir [directory_name]rmdir old_folderRemove an empty directory.

Viewing & Reading Files

Commands to inspect the contents of files without opening a full editor.

CommandUsageExampleDescription
catcat [filename]cat results.txtDisplays the entire content of a file on the screen.
lessless [filename]less large_log_file.logViews large files page by page. Press 'q' to quit.
headhead [options] [filename]head -n 20 data.csvDisplays the beginning (head) of a file. Default is 10 lines.
tailtail [options] [filename]tail -f output.logDisplays the end (tail) of a file. The -f option follows the file as it grows.
wcwc [options] [filename]wc -l sequences.fastaWord Count. Counts lines (-l), words (-w), and characters (-c).
filefile [filename]file my_script.shDetermines the type of a file (e.g., text, executable, image).

Getting Help

Commands to learn more about other commands.

CommandUsageExampleDescription
manman [command]man cpDisplays the manual page for a command, showing all its options and usage details.
--help[command] --helpls --helpMost commands have a help flag that prints a quick summary of options.

Advanced Linux Commands

Powerful tools for searching, manipulating text, managing processes, handling permissions, and automating tasks with shell features.

Searching & Text Processing

These commands allow you to find information and process text data with precision.

CommandUsageExampleDescription
grepgrep [options] [pattern] [file]grep -i 'error' app.logSearches for a pattern in a file. -i for case-insensitive, -r for recursive.
findfind [path] [expression]find . -name "*.pdb"Searches for files and directories based on name, size, modification time, etc.
awkawk '{program}' [file]awk '{print $1, $3}' data.txtA powerful pattern-scanning and text-processing language, excellent for column-based data.
sedsed 's/old/new/g' [file]sed 's/ATOM/HETATM/g' protein.pdbStream EDitor for filtering and transforming text.
sortsort [options] [file]sort -k2 -n data.txtSorts lines of text files. -n for numeric sort, -k to specify a column.
uniquniq [options] [file]sort names.txt | uniq -cReports or omits repeated lines. Often used with sort. -c counts occurrences.

Process Management

Commands for monitoring and controlling running programs and services.

CommandUsageExampleDescription
psps [options]ps auxReports a snapshot of the current processes.
toptoptopDisplays a real-time, dynamic view of system processes.
htophtophtopAn interactive and more user-friendly version of top.
killkill [PID]kill 12345Sends a signal to terminate a process by its Process ID (PID).
killallkillall [process_name]killall vmdKills all processes matching a given name.
jobsjobsjobsLists all jobs running in the background of the current shell.
bg / fgbg %1 / fg %1fg %1Sends a job to the background or brings it to the foreground.

Permissions & Ownership

Commands to control who can read, write, and execute files and directories.

CommandUsageExampleDescription
chmodchmod [permissions] [file]chmod +x script.shChange mode. Modifies the permissions of a file. +x makes it executable.
chownchown [user]:[group] [file]chown newuser:staff data.txtChange owner. Changes the user and group ownership of a file.

Archiving & Compression

Tools for bundling multiple files into an archive and for compressing data to save space.

CommandUsageExampleDescription
tartar [options] [archive.tar] [files]tar -czvf archive.tar.gz results/Tape ARchiver. Bundles files. -c create, -x extract, -z gzip, -v verbose, -f file.
zip / unzipzip archive.zip filesunzip data.zipCreates or extracts .zip archives, common for Windows compatibility.

Shell Features & Scripting

Powerful concepts for combining commands and automating workflows.

OperatorUsageExampleDescription
| (Pipe)command1 | command2ls -l | grep ".txt"Redirects the output of one command to be the input of another.
> (Redirect)command > filels > file_list.txtRedirects output to a file, overwriting the file if it exists.
>> (Append)command >> fileecho "Done" >> log.txtAppends output to the end of a file without overwriting.
&& (AND)cmd1 && cmd2./configure && makeRuns the second command only if the first one succeeds.

MD Simulation with AMBER

A typical workflow for running a molecular dynamics simulation of a protein-ligand complex using the AmberTools suite. This guide assumes you have a clean protein PDB file (protein.pdb) and a ligand file (ligand.mol2).

Step 1: System Preparation with tleap

The first step is to prepare the system by generating topology (.prmtop) and coordinate (.inpcrd) files. This involves loading force fields, preparing the ligand with `antechamber`, solvating the complex in a water box, and adding counter-ions to neutralize it.

# 1a. Prepare the ligand file to generate GAFF parameters and charges.
# This creates a new mol2 file with correct atom types and a frcmod file with parameters.
antechamber -i ligand.mol2 -fi mol2 -o ligand_gaff.mol2 -fo mol2 -c bcc -s 2
parmchk2 -i ligand_gaff.mol2 -f mol2 -o ligand.frcmod

# 1b. Create a tleap input file (e.g., tleap.in) to build the system.
# This script combines the protein and ligand, solvates them, and adds ions.
# Save the following lines into a file named "tleap.in":
source leaprc.protein.ff14SB
source leaprc.water.tip3p
source leaprc.gaff2
loadamberparams ligand.frcmod
LIG = loadmol2 ligand_gaff.mol2
protein = loadpdb protein.pdb
complex = combine { protein LIG }
solvatebox complex TIP3PBOX 10.0
addions complex Na+ 0
saveamberparm complex complex.prmtop complex.inpcrd
quit

# 1c. Run tleap with the input file to generate the final system files.
tleap -f tleap.in

Step 2: Minimization, Heating & Equilibration

This multi-stage process prepares the solvated system for the production simulation. We first minimize the energy to remove bad steric contacts, then gradually heat the system to the target temperature (e.g., 300 K) while restraining the protein and ligand. Finally, we equilibrate the system's density under constant pressure.

# Stage 2a: Minimization Input (save as min.in)
Minimization of the system
&cntrl
imin=1, maxcyc=5000, ncyc=2500,
ntb=1, ntr=1, restraint_wt=10.0,
restraintmask=':1-286 & !@H=',  # Restrain protein backbone (e.g., residues 1-286)
cut=10.0,
/

# Stage 2b: Heating Input (save as heat.in)
Heating from 0K to 300K
&cntrl
imin=0, irest=0, ntx=1, nstlim=25000, dt=0.002,
ntc=2, ntf=2, ntt=3, gamma_ln=2.0, temp0=0.0, tempi=300.0,
ntb=1, ntp=0, ntpr=500, ntwx=500,
ntr=1, restraint_wt=10.0, restraintmask=':1-286 & !@H=',
cut=10.0,
/

# Stage 2c: Equilibration Input (save as equil.in)
NPT Equilibration
&cntrl
imin=0, irest=1, ntx=5, nstlim=50000, dt=0.002,
ntc=2, ntf=2, ntt=3, gamma_ln=2.0, temp0=300.0,
ntb=2, ntp=1, pres0=1.0, taup=2.0,
ntpr=500, ntwx=500,
ntr=0,  # No restraints
cut=10.0,
/

# Stage 2d: Run the steps sequentially (using GPU version for speed)
pmemd.cuda -O -i min.in -o min.out -p complex.prmtop -c complex.inpcrd -r min.rst -ref complex.inpcrd
pmemd.cuda -O -i heat.in -o heat.out -p complex.prmtop -c min.rst -r heat.rst -x heat.nc -ref min.rst
pmemd.cuda -O -i equil.in -o equil.out -p complex.prmtop -c heat.rst -r equil.rst -x equil.nc

Step 3: Production MD

Run the main simulation for data collection. This step is typically the longest and generates the trajectory data needed for analysis.

# Production MD Input (save as prod.in)
500 ns Production MD
&cntrl
imin=0, irest=1, ntx=5,
nstlim=250000000, dt=0.002,  # 500 ns = 250,000,000 steps * 2 fs/step
ntc=2, ntf=2,
ntt=3, gamma_ln=2.0, temp0=300.0,
ntb=2, ntp=1, pres0=1.0,
ntpr=5000, ntwx=5000, ntwr=10000,
cut=10.0,
/

# Run production from the equilibrated system
pmemd.cuda -O -i prod.in -o prod.out -p complex.prmtop -c equil.rst -r prod.rst -x prod.nc

Step 4: Trajectory Analysis with cpptraj

Post-process the trajectory to check for system stability (RMSD) and identify flexible regions (RMSF).

# cpptraj input script (save as analysis.in)
# Load topology
parm complex.prmtop
# Load trajectory, processing every 10th frame to save time
trajin prod.nc 1 last 10
# Center and image molecules to handle periodic boundary conditions
autoimage
# Calculate RMSD of protein backbone relative to the first frame
rms backbone_rmsd :1-286@CA,C,N out rmsd_backbone.dat
# Calculate RMSF of C-alpha atoms, averaged over the trajectory
atomicfluct out rmsf_ca.dat @CA byres
# Calculate radius of gyration
radgyr out radgyr.dat
run
quit

# Run cpptraj
cpptraj -i analysis.in > analysis.log

Step 5: MM/GBSA & MM/PBSA Binding Energy Calculation

Calculate the binding free energy using the `MMPBSA.py` script. This involves first creating separate topology files for the complex, receptor, and ligand.

# 5a. Create topologies for the complex, receptor, and ligand.
# Replace 'LIG' with the 3-letter residue name of your ligand.
ante-MMPBSA.py -p complex.prmtop -c complex_solv.prmtop -r receptor_solv.prmtop -l ligand_solv.prmtop -s ":WAT,Na+,Cl-" -n ":LIG"

# 5b. MMPBSA.py input file (save as mmpbsa.in)
&general
endframe=500, interval=5,   # Analyze 500 frames, skipping every 5
keep_files=0,
/
&gb
igb=5, saltcon=0.150,        # Use the GB-OBC II model
/
&pb
istrng=0.150,                # Use the Poisson-Boltzmann model
/

# 5c. Run the calculation on the production trajectory
MMPBSA.py -O -i mmpbsa.in -o FINAL_RESULTS_MMPBSA.dat -sp complex_solv.prmtop -cp complex.prmtop -rp receptor.prmtop -lp ligand.prmtop -y prod.nc

MD Simulation with GROMACS

A typical workflow for running a molecular dynamics simulation of a protein-ligand complex using GROMACS. This guide assumes you have a clean protein PDB file (protein.pdb) and a ligand file (e.g., ligand.mol2). Ligand parameterization is a critical step and can be done using tools like CGenFF or ACPYPE to generate GROMACS-compatible topologies.

Step 1: System Preparation

Prepare the protein topology, generate ligand parameters (externally), and combine them into a single complex structure.

# 1a. Generate protein topology using a chosen force field (e.g., AMBER99SB-ILDN)
gmx pdb2gmx -f protein.pdb -o protein_processed.gro -water tip3p
# (Select a force field from the interactive menu)

# 1b. Generate ligand topology (e.g., using ACPYPE for Amber parameters).
# This is a complex external step. Assume you now have 'ligand.itp' and 'ligand.gro'.
# acpype -i ligand.mol2 -n ligand

# 1c. Combine protein and ligand GRO files and include ligand topology in the main topology file.
# Manually edit protein.top to add:
# ; Include ligand topology
# #include "ligand.itp"
# ... and add 'ligand 1' to the [ molecules ] section at the end.

Step 2: Create Box, Solvate, and Add Ions

Define a simulation box around the complex, fill it with water molecules, and add ions to neutralize the system and mimic physiological salt concentration.

# 2a. Define the simulation box (e.g., a cubic box with 1.0 nm distance from solute)
gmx editconf -f complex.gro -o complex_newbox.gro -c -d 1.0 -bt cubic

# 2b. Solvate the box with water
gmx solvate -cp complex_newbox.gro -cs spc216.gro -o complex_solv.gro -p topol.top

# 2c. Add ions. First, create a .tpr file for genion.
gmx grompp -f ions.mdp -c complex_solv.gro -p topol.top -o ions.tpr

# 2d. Run genion to replace water molecules with ions.
# (Select the 'SOL' group when prompted to embed ions)
gmx genion -s ions.tpr -o complex_solv_ions.gro -p topol.top -pname NA -nname CL -neutral

Step 3: Energy Minimization

Relax the system to remove any steric clashes or bad contacts before starting dynamics.

# 3a. Create the grompp input file for minimization.
gmx grompp -f minim.mdp -c complex_solv_ions.gro -p topol.top -o em.tpr

# 3b. Run energy minimization.
gmx mdrun -v -deffnm em

Step 4: NVT & NPT Equilibration

Equilibrate the system first at constant volume (NVT) to stabilize the temperature, then at constant pressure (NPT) to stabilize the density.

# 4a. NVT Equilibration (constant volume)
gmx grompp -f nvt.mdp -c em.gro -r em.gro -p topol.top -o nvt.tpr
gmx mdrun -deffnm nvt

# 4b. NPT Equilibration (constant pressure)
gmx grompp -f npt.mdp -c nvt.gro -r nvt.gro -t nvt.cpt -p topol.top -o npt.tpr
gmx mdrun -deffnm npt

Step 5: Production MD

Run the main, long simulation for data collection.

# Create the production .tpr file
gmx grompp -f md.mdp -c npt.gro -t npt.cpt -p topol.top -o md_0_100.tpr

# Run the production simulation (e.g., for 100 ns)
gmx mdrun -deffnm md_0_100

Step 6: Trajectory Analysis

Post-process the trajectory to handle periodic boundary conditions and calculate metrics like RMSD and RMSF.

# 6a. Correct for periodic boundary conditions.
# (Select 'System' for output)
gmx trjconv -s md_0_100.tpr -f md_0_100.xtc -o md_noPBC.xtc -pbc mol -center

# 6b. Calculate RMSD.
# (Select 'Backbone' for least-squares fit, then 'C-alpha' for RMSD calculation)
gmx rms -s md_0_100.tpr -f md_noPBC.xtc -o rmsd.xvg -tu ns

# 6c. Calculate RMSF.
# (Select 'C-alpha' for calculation)
gmx rmsf -s md_0_100.tpr -f md_noPBC.xtc -o rmsf.xvg -res

Step 7: MM/PBSA Binding Energy Calculation

Calculate the binding free energy using the `gmx_MMPBSA` tool, which is a popular third-party extension for GROMACS.

# Example gmx_MMPBSA input script (mmpbsa.in)
&general
startframe=5000, endframe=10000, interval=10,
/
&gb
igb=5, saltcon=0.150,
/
&pb
istrng=0.150,
/

# Run the calculation. This requires creating an index file with groups
# for the complex, protein, and ligand.
gmx_MMPBSA -O -i mmpbsa.in -cs md_0_100.tpr -ci index.ndx -cg 1 13 -ct md_noPBC.xtc -o FINAL_RESULTS.dat -eo FINAL_RESULTS.csv

Make Your Own Supercomputer (Clustering)

This guide provides a conceptual overview for building a basic "Beowulf" cluster using commodity computers. This type of cluster consists of a "head node" (or master) that manages a set of "compute nodes" over a private network.

Hardware Requirements

  • Head Node: A decent computer (e.g., modern multi-core CPU, 16GB+ RAM) with two network interface cards (NICs). One NIC connects to the public internet, the other to the private cluster network.
  • Compute Nodes: Two or more computers (can be older or less powerful) that will perform the computations. They only need one NIC each.
  • Network Switch: A dedicated gigabit switch to create a private network connecting all nodes.
  • Cables: Ethernet cables to connect all nodes to the switch.

Step 1: Network Setup & OS Installation

Install a Linux distribution (like Ubuntu Server) on all machines. Configure the network so nodes can communicate.

# On the Head Node, configure network interfaces (e.g., in /etc/netplan/01-netcfg.yaml on Ubuntu)
# Public Interface (e.g., enp0s3) - Connects to Internet (often configured by DHCP)
network:
  ethernets:
    enp0s3:
      dhcp4: true
    # Private Interface (e.g., enp0s8) - Connects to cluster switch
    enp0s8:
      addresses: [192.168.1.1/24]

# On each Compute Node (e.g., node01)
network:
  ethernets:
    enp0s3:
      addresses: [192.168.1.2/24] # Change IP for each node (1.3, 1.4, etc.)
      gateway4: 192.168.1.1 # Head node is the gateway

# Edit /etc/hosts on ALL nodes for easy name resolution
192.168.1.1  headnode
192.168.1.2  node01
192.168.1.3  node02

Step 2: Head Node Configuration (SSH & NFS)

Set up passwordless SSH so the head node can control the compute nodes. Set up NFS (Network File System) to share the home directories, so all nodes have access to the same files.

# On Head Node: Generate SSH keys
ssh-keygen -t rsa

# Copy public key to each compute node
ssh-copy-id user@node01
ssh-copy-id user@node02

# On Head Node: Install NFS Server and configure exports
sudo apt-get install nfs-kernel-server
sudo nano /etc/exports
# Add this line to the file:
# /home *(rw,sync,no_subtree_check)

sudo exportfs -a
sudo systemctl restart nfs-kernel-server

Step 3: Compute Node Configuration (NFS Client)

Configure the compute nodes to mount the shared home directory from the head node.

# On ALL Compute Nodes: Install NFS Client
sudo apt-get install nfs-common

# Edit /etc/fstab to mount automatically on boot
sudo nano /etc/fstab
# Add this line:
# headnode:/home /home nfs defaults 0 0

# Mount it now without rebooting
sudo mount -a

Step 4: Install MPI & Job Scheduler

Install MPI (Message Passing Interface) for parallel programming and a job scheduler like Slurm to manage resources.

# On ALL nodes: Install MPI
sudo apt-get install mpich # or openmpi-bin

# On Head Node: Install Slurm controller
sudo apt-get install slurm-wlm

# On Compute Nodes: Install Slurm daemon
sudo apt-get install slurmd

# --- Basic slurm.conf on Head Node (/etc/slurm-llnl/slurm.conf) ---
# ClusterName=my_cluster
# SlurmctldHost=headnode
# NodeName=node01 Procs=4 State=UNKNOWN
# NodeName=node02 Procs=4 State=UNKNOWN
# PartitionName=debug Nodes=node[01-02] Default=YES MaxTime=INFINITE State=UP

Step 5: Test the Cluster

Compile and run a simple MPI "Hello World" program to verify that all nodes are communicating and working together.

# Create hello_mpi.c
#include 
#include 

int main(int argc, char argv) {
    MPI_Init(NULL, NULL);
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    MPI_Get_processor_name(processor_name, &name_len);
    printf("Hello from processor %s, rank %d out of %d processors\n",
           processor_name, world_rank, world_size);
    MPI_Finalize();
}

# Compile on the head node (accessible to all nodes via NFS)
mpicc -o hello_mpi hello_mpi.c

# Run the job via Slurm
srun -N 2 --ntasks-per-node=1 ./hello_mpi

Tutorial: Genomics Basics

Genomics is the study of an organism’s entire DNA, its genome. Unlike genetics, which zooms in on individual genes, genomics looks at the whole system: all genes, how they interact, and how they shape the biology of an organism.

The Central Dogma of Molecular Biology

The foundation of genomics lies in the central dogma, which describes the flow of genetic information from DNA to functional proteins.

Source: NIH
  • DNA: The blueprint of life, storing instructions for building and maintaining an organism.
  • Transcription: A specific segment of DNA (a gene) is copied into a messenger RNA (mRNA) molecule.
  • Translation: Ribosomes read the mRNA sequence and build the corresponding protein.
  • Protein: The workhorses of the cell, carrying out most biological functions.

Key Concepts in Genome Analysis

  • Genome Sequencing: Determining the complete DNA sequence of an organism. With next-generation sequencing (NGS), billions of fragments can be read in parallel.
  • Genome Assembly: The computational process of stitching together the vast number of short DNA fragments produced by sequencing to reconstruct the original chromosomes.
  • Gene Prediction (Annotation): Identifying where genes are located, which regions code for proteins, and marking features like start/stop codons, exons, introns, promoters, and enhancers.
  • Comparative Genomics: Comparing genomes across species to study evolution and identify conserved, functionally important sequences.
  • Functional Genomics: Understanding how genes work and interact, often by measuring genome-wide expression patterns using methods like RNA-Seq.

Common Genomic Data Formats

These are some of the key file formats used in our lab for genomic analysis.

FormatExtensionDescription
FASTA.fa, .fna, .fastaStores DNA or protein sequences in plain text with a header line.
GFF.gff, .gff3Tab-delimited format for describing genomic features like genes and their locations.
GTF.gtfBased on GFF, with additional conventions specific to gene information.
GenBank.gb, .gbkText-based format storing sequence data along with rich annotations.

Essential Tools and Databases

A selection of widely used resources in genomics research that we utilize in our work.

NameTypePrimary Use
NCBIDatabases/ToolsAn international resource providing access to public databases and analysis software.
BLASTToolFinds regions of similarity between DNA or protein sequences.
GenBankDatabaseA public repository of all known nucleotide sequences and their annotations.
EnsemblDatabaseA genome browser and annotation platform, focused on vertebrate genomes.
GENCODEDatabaseProvides high-quality reference annotation for human and mouse genes.
EPDnewDatabaseOffers experimentally validated promoter sequences for precise transcription start site annotation.
YeasTSSDatabaseAn atlas of transcription start sites in various yeast species.
EnhancerAtlasDatabaseA curated repository of enhancer elements across different species and tissues.

Note: This is not an exhaustive list. Additional important resources include the UCSC Genome Browser, DDBJ, KEGG, SIB, RCB, NIBMG, EMBL, FANTOM, HUGO, and the Indian Biological Data Centre (IBDC).

Basics of Proteomics

Proteomics is the large-scale study of proteomes. A proteome is the complete set of proteins produced by an organism, system, or biological context. While genomics tells us what a cell *could* do (the blueprint), proteomics tells us what it is *actually doing* at a functional level.

Why is Proteomics Important?

Proteins are the primary functional molecules in a cell, acting as enzymes, structural components, and signaling molecules. Studying them directly provides a clearer picture of cellular activity than studying genes alone. This has massive implications for:

  • Disease Biomarkers: Identifying proteins whose levels or modifications change in disease states can lead to new diagnostic tools.
  • Drug Discovery: Most drugs target proteins. Proteomics helps identify and validate new drug targets and understand how drugs affect the cellular machinery.
  • Understanding Biological Pathways: Mapping protein-protein interactions reveals the complex networks that govern cellular processes.
  • Source: NIH

Key Concepts in Proteomics

  • Protein Identification & Quantification: The most fundamental tasks. What proteins are present in a sample, and in what amounts?
  • Post-Translational Modifications (PTMs): After a protein is made, it can be chemically modified (e.g., phosphorylation, glycosylation). These PTMs are critical for regulating protein function and are a major focus of proteomics.
  • Protein-Protein Interactions (PPIs): Proteins rarely work alone. Identifying which proteins interact with each other helps to build a functional map of the cell.
  • Structural Proteomics: Determining the 3D structure of proteins on a large scale, which is essential for understanding their function and for structure-based drug design.

Common Experimental Techniques

Modern proteomics relies on a combination of sophisticated experimental and computational methods.

TechniqueDescriptionPrimary Use
Mass Spectrometry (MS)A technique that measures the mass-to-charge ratio of ions. In proteomics, proteins are first digested into smaller peptides, which are then analyzed by MS.The cornerstone of modern proteomics for protein identification and quantification.
2D Gel Electrophoresis (2D-PAGE)A method to separate a complex mixture of proteins. Proteins are separated by charge in the first dimension and by mass in the second, creating a 2D map of spots.Visualizing differences in protein expression between samples. Often used before Mass Spectrometry.
X-ray Crystallography & NMRHigh-resolution techniques used to determine the precise 3D atomic structure of individual proteins.Structural proteomics and structure-based drug design.
Yeast Two-Hybrid (Y2H)A molecular biology technique used to discover protein-protein interactions (PPIs) by testing for physical interactions between two proteins.Large-scale screening for PPI networks.

Essential Computational Tools & Databases

Raw experimental data from proteomics is vast and complex. Computational tools are essential to process, analyze, and interpret it.

NameTypePrimary Use
Mascot / MaxQuantSoftwareSearch engines that identify proteins by matching experimental mass spectrometry data against theoretical peptide masses from sequence databases.
Protein Data Bank (PDB)DatabaseThe primary worldwide repository for 3D structural data of large biological molecules, such as proteins and nucleic acids.
UniProtDatabaseA comprehensive, high-quality, and freely accessible database of protein sequence and functional information.
STRINGDatabaseA database of known and predicted protein-protein interactions, including direct (physical) and indirect (functional) associations.
AlphaFold / RoseTTAFoldSoftwareRevolutionary deep learning-based tools for predicting the 3D structure of a protein from its amino acid sequence with high accuracy.

Basics of Drug Discovery & Design

Computer-Aided Drug Design (CADD) uses computational methods to simulate drug-receptor interactions, accelerating the process of identifying and optimizing new drug candidates. It is a cornerstone of modern pharmaceutical research, significantly reducing the time and cost associated with discovering new medicines.

The Drug Discovery Pipeline

The journey from an idea to a marketable drug is long and complex, typically broken down into several key stages where computational methods play a vital role:

  • Target Identification & Validation: Identifying a specific biological molecule (usually a protein or nucleic acid) that is believed to play a critical role in a disease. Computational genomics and proteomics help in finding and validating these targets.
  • Hit Discovery: Screening large libraries of small molecules to find "hits"—compounds that bind to the target and modulate its activity. Virtual screening is a key CADD technique here.
  • Lead Optimization: Chemically modifying a promising "hit" compound to improve its properties (potency, selectivity, and pharmacokinetic profile) to create a "lead" candidate.
  • Preclinical & Clinical Trials: Rigorous experimental testing of the lead compound in labs, animals, and finally, humans.

Key Computational Approaches

CADD is broadly divided into two main categories, depending on whether the 3D structure of the target is known.

ApproachRequirementDescriptionCommon Techniques
Structure-Based Drug Design (SBDD) 3D structure of the target protein is known (from X-ray crystallography, NMR, or high-quality homology modeling). Uses the 3D structure of the target's binding site to design or screen for ligands that fit with high affinity and specificity. Molecular Docking, De Novo Design, Molecular Dynamics
Ligand-Based Drug Design (LBDD) 3D structure of the target is unknown, but a set of molecules known to interact with it is available. Uses the properties of these known active molecules to infer a model (a pharmacophore) that describes the necessary features for binding. QSAR, Pharmacophore Modeling, 3D Shape Similarity

A Typical SBDD Workflow

The following steps outline a common pipeline for a structure-based drug design project, similar to the approach used by our Sanjeevini software.

  • 1. Target Preparation: The crystal structure of a target protein (obtained from the PDB) is prepared. This involves adding hydrogen atoms, assigning correct protonation states, and repairing any missing residues or atoms.
  • 2. Active Site Identification: The binding pocket or "active site" where a drug molecule is expected to bind is identified. This can be done based on the location of a known co-crystallized ligand or by using cavity detection algorithms.
  • 3. Ligand Library Preparation: A large collection of small molecules (often millions) is prepared for screening. This involves generating 3D coordinates and assigning correct partial charges for each molecule. Our TPACM4 tool is used for this.
  • 4. Virtual Screening (Docking): Each molecule from the library is computationally "docked" into the active site of the target protein. A docking program (like ParDOCK+) systematically samples different orientations and conformations of the ligand within the binding site.
  • 5. Scoring and Ranking: After docking, a scoring function is used to estimate the binding affinity (how tightly the ligand binds) for each pose. The molecules are then ranked based on their scores. Tools like BAPPL+ are designed for this purpose.
  • 6. Post-Processing and Analysis: The top-ranked "hit" molecules are further analyzed using more computationally expensive methods like Molecular Dynamics simulations (using AMBER or GROMACS) and binding free energy calculations (like MM/PBSA) to get a more accurate prediction of their binding affinity and stability.

Essential Databases & Tools

A list of some fundamental databases and software used in drug discovery research.

NameTypePrimary Use
Protein Data Bank (PDB)DatabaseThe primary repository for 3D structures of proteins and other biomolecules, essential for SBDD.
PubChem / ZINCDatabaseVast, publicly available databases containing millions of small molecules for virtual screening.
BIMPDatabaseDatabase of Phytochemicals of Indian Medicinal Plants.
AutoDock Vina / GlideSoftwareWidely used academic and commercial molecular docking programs.
SanjeeviniSoftware SuiteOur in-house, freely accessible suite for target-directed lead molecule discovery.
PyMOL / ChimeraSoftwareMolecular visualization systems used to view and analyze protein-ligand complexes.