Loading...
Step-by-step guides for essential computational biology and bioinformatics tasks.
These commands form the foundation for working on the command line. Mastering them is essential for navigating, viewing, and managing files on any Linux-based system, including high-performance clusters.
Commands to help you move around and understand your location within the directory structure.
| Command | Usage | Example | Description |
|---|---|---|---|
pwd | pwd | pwd | Print Working Directory. Shows your current location. |
ls | ls [options] [directory] | ls -lh /home/user | List directory contents. Use -l for details and -h for human-readable sizes. |
cd | cd [directory] | cd Documents/ | Change Directory. cd .. moves up one level. cd alone goes home. |
Commands for creating, renaming, copying, moving, and deleting files and folders.
| Command | Usage | Example | Description |
|---|---|---|---|
touch | touch [filename] | touch report.txt | Creates a new, empty file. |
mkdir | mkdir [directory_name] | mkdir simulations | Make a new directory. |
cp | cp [source] [destination] | cp script.py scripts_backup/ | Copy a file. Use -r option for directories. |
mv | mv [source] [destination] | mv data.txt results/ | Move or rename a file or directory. |
rm | rm [filename] | rm temp_output.log | Remove (delete) a file. Permanent, use with caution. |
rmdir | rmdir [directory_name] | rmdir old_folder | Remove an empty directory. |
Commands to inspect the contents of files without opening a full editor.
| Command | Usage | Example | Description |
|---|---|---|---|
cat | cat [filename] | cat results.txt | Displays the entire content of a file on the screen. |
less | less [filename] | less large_log_file.log | Views large files page by page. Press 'q' to quit. |
head | head [options] [filename] | head -n 20 data.csv | Displays the beginning (head) of a file. Default is 10 lines. |
tail | tail [options] [filename] | tail -f output.log | Displays the end (tail) of a file. The -f option follows the file as it grows. |
wc | wc [options] [filename] | wc -l sequences.fasta | Word Count. Counts lines (-l), words (-w), and characters (-c). |
file | file [filename] | file my_script.sh | Determines the type of a file (e.g., text, executable, image). |
Commands to learn more about other commands.
| Command | Usage | Example | Description |
|---|---|---|---|
man | man [command] | man cp | Displays the manual page for a command, showing all its options and usage details. |
--help | [command] --help | ls --help | Most commands have a help flag that prints a quick summary of options. |
Powerful tools for searching, manipulating text, managing processes, handling permissions, and automating tasks with shell features.
These commands allow you to find information and process text data with precision.
| Command | Usage | Example | Description |
|---|---|---|---|
grep | grep [options] [pattern] [file] | grep -i 'error' app.log | Searches for a pattern in a file. -i for case-insensitive, -r for recursive. |
find | find [path] [expression] | find . -name "*.pdb" | Searches for files and directories based on name, size, modification time, etc. |
awk | awk '{program}' [file] | awk '{print $1, $3}' data.txt | A powerful pattern-scanning and text-processing language, excellent for column-based data. |
sed | sed 's/old/new/g' [file] | sed 's/ATOM/HETATM/g' protein.pdb | Stream EDitor for filtering and transforming text. |
sort | sort [options] [file] | sort -k2 -n data.txt | Sorts lines of text files. -n for numeric sort, -k to specify a column. |
uniq | uniq [options] [file] | sort names.txt | uniq -c | Reports or omits repeated lines. Often used with sort. -c counts occurrences. |
Commands for monitoring and controlling running programs and services.
| Command | Usage | Example | Description |
|---|---|---|---|
ps | ps [options] | ps aux | Reports a snapshot of the current processes. |
top | top | top | Displays a real-time, dynamic view of system processes. |
htop | htop | htop | An interactive and more user-friendly version of top. |
kill | kill [PID] | kill 12345 | Sends a signal to terminate a process by its Process ID (PID). |
killall | killall [process_name] | killall vmd | Kills all processes matching a given name. |
jobs | jobs | jobs | Lists all jobs running in the background of the current shell. |
bg / fg | bg %1 / fg %1 | fg %1 | Sends a job to the background or brings it to the foreground. |
Commands to control who can read, write, and execute files and directories.
| Command | Usage | Example | Description |
|---|---|---|---|
chmod | chmod [permissions] [file] | chmod +x script.sh | Change mode. Modifies the permissions of a file. +x makes it executable. |
chown | chown [user]:[group] [file] | chown newuser:staff data.txt | Change owner. Changes the user and group ownership of a file. |
Tools for bundling multiple files into an archive and for compressing data to save space.
| Command | Usage | Example | Description |
|---|---|---|---|
tar | tar [options] [archive.tar] [files] | tar -czvf archive.tar.gz results/ | Tape ARchiver. Bundles files. -c create, -x extract, -z gzip, -v verbose, -f file. |
zip / unzip | zip archive.zip files | unzip data.zip | Creates or extracts .zip archives, common for Windows compatibility. |
Powerful concepts for combining commands and automating workflows.
| Operator | Usage | Example | Description |
|---|---|---|---|
| (Pipe) | command1 | command2 | ls -l | grep ".txt" | Redirects the output of one command to be the input of another. |
> (Redirect) | command > file | ls > file_list.txt | Redirects output to a file, overwriting the file if it exists. |
>> (Append) | command >> file | echo "Done" >> log.txt | Appends output to the end of a file without overwriting. |
&& (AND) | cmd1 && cmd2 | ./configure && make | Runs the second command only if the first one succeeds. |
A typical workflow for running a molecular dynamics simulation of a protein-ligand complex using the AmberTools suite. This guide assumes you have a clean protein PDB file (protein.pdb) and a ligand file (ligand.mol2).
tleapThe first step is to prepare the system by generating topology (.prmtop) and coordinate (.inpcrd) files. This involves loading force fields, preparing the ligand with `antechamber`, solvating the complex in a water box, and adding counter-ions to neutralize it.
# 1a. Prepare the ligand file to generate GAFF parameters and charges.
# This creates a new mol2 file with correct atom types and a frcmod file with parameters.
antechamber -i ligand.mol2 -fi mol2 -o ligand_gaff.mol2 -fo mol2 -c bcc -s 2
parmchk2 -i ligand_gaff.mol2 -f mol2 -o ligand.frcmod
# 1b. Create a tleap input file (e.g., tleap.in) to build the system.
# This script combines the protein and ligand, solvates them, and adds ions.
# Save the following lines into a file named "tleap.in":
source leaprc.protein.ff14SB
source leaprc.water.tip3p
source leaprc.gaff2
loadamberparams ligand.frcmod
LIG = loadmol2 ligand_gaff.mol2
protein = loadpdb protein.pdb
complex = combine { protein LIG }
solvatebox complex TIP3PBOX 10.0
addions complex Na+ 0
saveamberparm complex complex.prmtop complex.inpcrd
quit
# 1c. Run tleap with the input file to generate the final system files.
tleap -f tleap.in
This multi-stage process prepares the solvated system for the production simulation. We first minimize the energy to remove bad steric contacts, then gradually heat the system to the target temperature (e.g., 300 K) while restraining the protein and ligand. Finally, we equilibrate the system's density under constant pressure.
# Stage 2a: Minimization Input (save as min.in)
Minimization of the system
&cntrl
imin=1, maxcyc=5000, ncyc=2500,
ntb=1, ntr=1, restraint_wt=10.0,
restraintmask=':1-286 & !@H=', # Restrain protein backbone (e.g., residues 1-286)
cut=10.0,
/
# Stage 2b: Heating Input (save as heat.in)
Heating from 0K to 300K
&cntrl
imin=0, irest=0, ntx=1, nstlim=25000, dt=0.002,
ntc=2, ntf=2, ntt=3, gamma_ln=2.0, temp0=0.0, tempi=300.0,
ntb=1, ntp=0, ntpr=500, ntwx=500,
ntr=1, restraint_wt=10.0, restraintmask=':1-286 & !@H=',
cut=10.0,
/
# Stage 2c: Equilibration Input (save as equil.in)
NPT Equilibration
&cntrl
imin=0, irest=1, ntx=5, nstlim=50000, dt=0.002,
ntc=2, ntf=2, ntt=3, gamma_ln=2.0, temp0=300.0,
ntb=2, ntp=1, pres0=1.0, taup=2.0,
ntpr=500, ntwx=500,
ntr=0, # No restraints
cut=10.0,
/
# Stage 2d: Run the steps sequentially (using GPU version for speed)
pmemd.cuda -O -i min.in -o min.out -p complex.prmtop -c complex.inpcrd -r min.rst -ref complex.inpcrd
pmemd.cuda -O -i heat.in -o heat.out -p complex.prmtop -c min.rst -r heat.rst -x heat.nc -ref min.rst
pmemd.cuda -O -i equil.in -o equil.out -p complex.prmtop -c heat.rst -r equil.rst -x equil.nc
Run the main simulation for data collection. This step is typically the longest and generates the trajectory data needed for analysis.
# Production MD Input (save as prod.in)
500 ns Production MD
&cntrl
imin=0, irest=1, ntx=5,
nstlim=250000000, dt=0.002, # 500 ns = 250,000,000 steps * 2 fs/step
ntc=2, ntf=2,
ntt=3, gamma_ln=2.0, temp0=300.0,
ntb=2, ntp=1, pres0=1.0,
ntpr=5000, ntwx=5000, ntwr=10000,
cut=10.0,
/
# Run production from the equilibrated system
pmemd.cuda -O -i prod.in -o prod.out -p complex.prmtop -c equil.rst -r prod.rst -x prod.nc
cpptrajPost-process the trajectory to check for system stability (RMSD) and identify flexible regions (RMSF).
# cpptraj input script (save as analysis.in)
# Load topology
parm complex.prmtop
# Load trajectory, processing every 10th frame to save time
trajin prod.nc 1 last 10
# Center and image molecules to handle periodic boundary conditions
autoimage
# Calculate RMSD of protein backbone relative to the first frame
rms backbone_rmsd :1-286@CA,C,N out rmsd_backbone.dat
# Calculate RMSF of C-alpha atoms, averaged over the trajectory
atomicfluct out rmsf_ca.dat @CA byres
# Calculate radius of gyration
radgyr out radgyr.dat
run
quit
# Run cpptraj
cpptraj -i analysis.in > analysis.log
Calculate the binding free energy using the `MMPBSA.py` script. This involves first creating separate topology files for the complex, receptor, and ligand.
# 5a. Create topologies for the complex, receptor, and ligand.
# Replace 'LIG' with the 3-letter residue name of your ligand.
ante-MMPBSA.py -p complex.prmtop -c complex_solv.prmtop -r receptor_solv.prmtop -l ligand_solv.prmtop -s ":WAT,Na+,Cl-" -n ":LIG"
# 5b. MMPBSA.py input file (save as mmpbsa.in)
&general
endframe=500, interval=5, # Analyze 500 frames, skipping every 5
keep_files=0,
/
&gb
igb=5, saltcon=0.150, # Use the GB-OBC II model
/
&pb
istrng=0.150, # Use the Poisson-Boltzmann model
/
# 5c. Run the calculation on the production trajectory
MMPBSA.py -O -i mmpbsa.in -o FINAL_RESULTS_MMPBSA.dat -sp complex_solv.prmtop -cp complex.prmtop -rp receptor.prmtop -lp ligand.prmtop -y prod.nc
A typical workflow for running a molecular dynamics simulation of a protein-ligand complex using GROMACS. This guide assumes you have a clean protein PDB file (protein.pdb) and a ligand file (e.g., ligand.mol2). Ligand parameterization is a critical step and can be done using tools like CGenFF or ACPYPE to generate GROMACS-compatible topologies.
Prepare the protein topology, generate ligand parameters (externally), and combine them into a single complex structure.
# 1a. Generate protein topology using a chosen force field (e.g., AMBER99SB-ILDN)
gmx pdb2gmx -f protein.pdb -o protein_processed.gro -water tip3p
# (Select a force field from the interactive menu)
# 1b. Generate ligand topology (e.g., using ACPYPE for Amber parameters).
# This is a complex external step. Assume you now have 'ligand.itp' and 'ligand.gro'.
# acpype -i ligand.mol2 -n ligand
# 1c. Combine protein and ligand GRO files and include ligand topology in the main topology file.
# Manually edit protein.top to add:
# ; Include ligand topology
# #include "ligand.itp"
# ... and add 'ligand 1' to the [ molecules ] section at the end.
Define a simulation box around the complex, fill it with water molecules, and add ions to neutralize the system and mimic physiological salt concentration.
# 2a. Define the simulation box (e.g., a cubic box with 1.0 nm distance from solute)
gmx editconf -f complex.gro -o complex_newbox.gro -c -d 1.0 -bt cubic
# 2b. Solvate the box with water
gmx solvate -cp complex_newbox.gro -cs spc216.gro -o complex_solv.gro -p topol.top
# 2c. Add ions. First, create a .tpr file for genion.
gmx grompp -f ions.mdp -c complex_solv.gro -p topol.top -o ions.tpr
# 2d. Run genion to replace water molecules with ions.
# (Select the 'SOL' group when prompted to embed ions)
gmx genion -s ions.tpr -o complex_solv_ions.gro -p topol.top -pname NA -nname CL -neutral
Relax the system to remove any steric clashes or bad contacts before starting dynamics.
# 3a. Create the grompp input file for minimization.
gmx grompp -f minim.mdp -c complex_solv_ions.gro -p topol.top -o em.tpr
# 3b. Run energy minimization.
gmx mdrun -v -deffnm em
Equilibrate the system first at constant volume (NVT) to stabilize the temperature, then at constant pressure (NPT) to stabilize the density.
# 4a. NVT Equilibration (constant volume)
gmx grompp -f nvt.mdp -c em.gro -r em.gro -p topol.top -o nvt.tpr
gmx mdrun -deffnm nvt
# 4b. NPT Equilibration (constant pressure)
gmx grompp -f npt.mdp -c nvt.gro -r nvt.gro -t nvt.cpt -p topol.top -o npt.tpr
gmx mdrun -deffnm npt
Run the main, long simulation for data collection.
# Create the production .tpr file
gmx grompp -f md.mdp -c npt.gro -t npt.cpt -p topol.top -o md_0_100.tpr
# Run the production simulation (e.g., for 100 ns)
gmx mdrun -deffnm md_0_100
Post-process the trajectory to handle periodic boundary conditions and calculate metrics like RMSD and RMSF.
# 6a. Correct for periodic boundary conditions.
# (Select 'System' for output)
gmx trjconv -s md_0_100.tpr -f md_0_100.xtc -o md_noPBC.xtc -pbc mol -center
# 6b. Calculate RMSD.
# (Select 'Backbone' for least-squares fit, then 'C-alpha' for RMSD calculation)
gmx rms -s md_0_100.tpr -f md_noPBC.xtc -o rmsd.xvg -tu ns
# 6c. Calculate RMSF.
# (Select 'C-alpha' for calculation)
gmx rmsf -s md_0_100.tpr -f md_noPBC.xtc -o rmsf.xvg -res
Calculate the binding free energy using the `gmx_MMPBSA` tool, which is a popular third-party extension for GROMACS.
# Example gmx_MMPBSA input script (mmpbsa.in)
&general
startframe=5000, endframe=10000, interval=10,
/
&gb
igb=5, saltcon=0.150,
/
&pb
istrng=0.150,
/
# Run the calculation. This requires creating an index file with groups
# for the complex, protein, and ligand.
gmx_MMPBSA -O -i mmpbsa.in -cs md_0_100.tpr -ci index.ndx -cg 1 13 -ct md_noPBC.xtc -o FINAL_RESULTS.dat -eo FINAL_RESULTS.csv
This guide provides a conceptual overview for building a basic "Beowulf" cluster using commodity computers. This type of cluster consists of a "head node" (or master) that manages a set of "compute nodes" over a private network.
Install a Linux distribution (like Ubuntu Server) on all machines. Configure the network so nodes can communicate.
# On the Head Node, configure network interfaces (e.g., in /etc/netplan/01-netcfg.yaml on Ubuntu)
# Public Interface (e.g., enp0s3) - Connects to Internet (often configured by DHCP)
network:
ethernets:
enp0s3:
dhcp4: true
# Private Interface (e.g., enp0s8) - Connects to cluster switch
enp0s8:
addresses: [192.168.1.1/24]
# On each Compute Node (e.g., node01)
network:
ethernets:
enp0s3:
addresses: [192.168.1.2/24] # Change IP for each node (1.3, 1.4, etc.)
gateway4: 192.168.1.1 # Head node is the gateway
# Edit /etc/hosts on ALL nodes for easy name resolution
192.168.1.1 headnode
192.168.1.2 node01
192.168.1.3 node02
Set up passwordless SSH so the head node can control the compute nodes. Set up NFS (Network File System) to share the home directories, so all nodes have access to the same files.
# On Head Node: Generate SSH keys
ssh-keygen -t rsa
# Copy public key to each compute node
ssh-copy-id user@node01
ssh-copy-id user@node02
# On Head Node: Install NFS Server and configure exports
sudo apt-get install nfs-kernel-server
sudo nano /etc/exports
# Add this line to the file:
# /home *(rw,sync,no_subtree_check)
sudo exportfs -a
sudo systemctl restart nfs-kernel-server
Configure the compute nodes to mount the shared home directory from the head node.
# On ALL Compute Nodes: Install NFS Client
sudo apt-get install nfs-common
# Edit /etc/fstab to mount automatically on boot
sudo nano /etc/fstab
# Add this line:
# headnode:/home /home nfs defaults 0 0
# Mount it now without rebooting
sudo mount -a
Install MPI (Message Passing Interface) for parallel programming and a job scheduler like Slurm to manage resources.
# On ALL nodes: Install MPI
sudo apt-get install mpich # or openmpi-bin
# On Head Node: Install Slurm controller
sudo apt-get install slurm-wlm
# On Compute Nodes: Install Slurm daemon
sudo apt-get install slurmd
# --- Basic slurm.conf on Head Node (/etc/slurm-llnl/slurm.conf) ---
# ClusterName=my_cluster
# SlurmctldHost=headnode
# NodeName=node01 Procs=4 State=UNKNOWN
# NodeName=node02 Procs=4 State=UNKNOWN
# PartitionName=debug Nodes=node[01-02] Default=YES MaxTime=INFINITE State=UP
Compile and run a simple MPI "Hello World" program to verify that all nodes are communicating and working together.
# Create hello_mpi.c
#include
#include
int main(int argc, char argv) {
MPI_Init(NULL, NULL);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);
printf("Hello from processor %s, rank %d out of %d processors\n",
processor_name, world_rank, world_size);
MPI_Finalize();
}
# Compile on the head node (accessible to all nodes via NFS)
mpicc -o hello_mpi hello_mpi.c
# Run the job via Slurm
srun -N 2 --ntasks-per-node=1 ./hello_mpi
Genomics is the study of an organism’s entire DNA, its genome. Unlike genetics, which zooms in on individual genes, genomics looks at the whole system: all genes, how they interact, and how they shape the biology of an organism.
The foundation of genomics lies in the central dogma, which describes the flow of genetic information from DNA to functional proteins.
These are some of the key file formats used in our lab for genomic analysis.
| Format | Extension | Description |
|---|---|---|
| FASTA | .fa, .fna, .fasta | Stores DNA or protein sequences in plain text with a header line. |
| GFF | .gff, .gff3 | Tab-delimited format for describing genomic features like genes and their locations. |
| GTF | .gtf | Based on GFF, with additional conventions specific to gene information. |
| GenBank | .gb, .gbk | Text-based format storing sequence data along with rich annotations. |
A selection of widely used resources in genomics research that we utilize in our work.
| Name | Type | Primary Use |
|---|---|---|
| NCBI | Databases/Tools | An international resource providing access to public databases and analysis software. |
| BLAST | Tool | Finds regions of similarity between DNA or protein sequences. |
| GenBank | Database | A public repository of all known nucleotide sequences and their annotations. |
| Ensembl | Database | A genome browser and annotation platform, focused on vertebrate genomes. |
| GENCODE | Database | Provides high-quality reference annotation for human and mouse genes. |
| EPDnew | Database | Offers experimentally validated promoter sequences for precise transcription start site annotation. |
| YeasTSS | Database | An atlas of transcription start sites in various yeast species. |
| EnhancerAtlas | Database | A curated repository of enhancer elements across different species and tissues. |
Note: This is not an exhaustive list. Additional important resources include the UCSC Genome Browser, DDBJ, KEGG, SIB, RCB, NIBMG, EMBL, FANTOM, HUGO, and the Indian Biological Data Centre (IBDC).
Proteomics is the large-scale study of proteomes. A proteome is the complete set of proteins produced by an organism, system, or biological context. While genomics tells us what a cell *could* do (the blueprint), proteomics tells us what it is *actually doing* at a functional level.
Proteins are the primary functional molecules in a cell, acting as enzymes, structural components, and signaling molecules. Studying them directly provides a clearer picture of cellular activity than studying genes alone. This has massive implications for:
Modern proteomics relies on a combination of sophisticated experimental and computational methods.
| Technique | Description | Primary Use |
|---|---|---|
| Mass Spectrometry (MS) | A technique that measures the mass-to-charge ratio of ions. In proteomics, proteins are first digested into smaller peptides, which are then analyzed by MS. | The cornerstone of modern proteomics for protein identification and quantification. |
| 2D Gel Electrophoresis (2D-PAGE) | A method to separate a complex mixture of proteins. Proteins are separated by charge in the first dimension and by mass in the second, creating a 2D map of spots. | Visualizing differences in protein expression between samples. Often used before Mass Spectrometry. |
| X-ray Crystallography & NMR | High-resolution techniques used to determine the precise 3D atomic structure of individual proteins. | Structural proteomics and structure-based drug design. |
| Yeast Two-Hybrid (Y2H) | A molecular biology technique used to discover protein-protein interactions (PPIs) by testing for physical interactions between two proteins. | Large-scale screening for PPI networks. |
Raw experimental data from proteomics is vast and complex. Computational tools are essential to process, analyze, and interpret it.
| Name | Type | Primary Use |
|---|---|---|
| Mascot / MaxQuant | Software | Search engines that identify proteins by matching experimental mass spectrometry data against theoretical peptide masses from sequence databases. |
| Protein Data Bank (PDB) | Database | The primary worldwide repository for 3D structural data of large biological molecules, such as proteins and nucleic acids. |
| UniProt | Database | A comprehensive, high-quality, and freely accessible database of protein sequence and functional information. |
| STRING | Database | A database of known and predicted protein-protein interactions, including direct (physical) and indirect (functional) associations. |
| AlphaFold / RoseTTAFold | Software | Revolutionary deep learning-based tools for predicting the 3D structure of a protein from its amino acid sequence with high accuracy. |
Computer-Aided Drug Design (CADD) uses computational methods to simulate drug-receptor interactions, accelerating the process of identifying and optimizing new drug candidates. It is a cornerstone of modern pharmaceutical research, significantly reducing the time and cost associated with discovering new medicines.
The journey from an idea to a marketable drug is long and complex, typically broken down into several key stages where computational methods play a vital role:
CADD is broadly divided into two main categories, depending on whether the 3D structure of the target is known.
| Approach | Requirement | Description | Common Techniques |
|---|---|---|---|
| Structure-Based Drug Design (SBDD) | 3D structure of the target protein is known (from X-ray crystallography, NMR, or high-quality homology modeling). | Uses the 3D structure of the target's binding site to design or screen for ligands that fit with high affinity and specificity. | Molecular Docking, De Novo Design, Molecular Dynamics |
| Ligand-Based Drug Design (LBDD) | 3D structure of the target is unknown, but a set of molecules known to interact with it is available. | Uses the properties of these known active molecules to infer a model (a pharmacophore) that describes the necessary features for binding. | QSAR, Pharmacophore Modeling, 3D Shape Similarity |
The following steps outline a common pipeline for a structure-based drug design project, similar to the approach used by our Sanjeevini software.
A list of some fundamental databases and software used in drug discovery research.
| Name | Type | Primary Use |
|---|---|---|
| Protein Data Bank (PDB) | Database | The primary repository for 3D structures of proteins and other biomolecules, essential for SBDD. |
| PubChem / ZINC | Database | Vast, publicly available databases containing millions of small molecules for virtual screening. |
| BIMP | Database | Database of Phytochemicals of Indian Medicinal Plants. |
| AutoDock Vina / Glide | Software | Widely used academic and commercial molecular docking programs. |
| Sanjeevini | Software Suite | Our in-house, freely accessible suite for target-directed lead molecule discovery. |
| PyMOL / Chimera | Software | Molecular visualization systems used to view and analyze protein-ligand complexes. |