======================================================================
Neighborhood Correlation (standalone)
======================================================================

For details of Neighborhood Correlation please refer to
http://www.neighborhoodcorrelation.org/ or the publication:

Sequence Similarity Network Reveals Common Ancestry of Multidomain
Proteins
Song N, Joseph JM, Davis GB, Durand D
PLoS Computational Biology 4(5): e1000063
doi:10.1371/journal.pcbi.1000063

This program is meant as a reference implementation to demonstrate the
algorithms used in the above publication.  We have focused upon an
intuitive implementation with readable code.  This program has no
dependencies beyond a basic Python installation.  It has been tested
with Python version 2.5 on a Linux computer.  It has no OS-specific
dependencies and should work on any complete Python installation.


======================================================================
Support
======================================================================

If you encounter any difficulties with this program, please contact
Jacob Joseph <jmjoseph@andrew.cmu.edu>.


======================================================================
Licensing
======================================================================

(C) 2008 Jacob Joseph <jmjoseph@andrew.cmu.edu>,
Nan Song, and Carnegie Mellon University
    
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 3 of the License, or (at
your option) any later version.


======================================================================
Program Execution
======================================================================

nc_standalone.py contains the program to be executed, while nc_base.py
contains helper functions called by nc_standalone.py.  These files
should be in the same directory.  Depending upon your system,
nc_standalone.py may be executed as:

./nc_standalone.py
or
python nc_standalone.py


======================================================================
Input and Output
======================================================================

This program accepts as input a list of pairwise sequence
similarities, as BLAST BIT-scores.  Please refer to
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.section.614 for
discussion of the BIT-score.

Specifically, this program will read a white-space delimited file, in
a three-column format:

    --------------------------------------
    seq_id_0   seq_id_1   bit_score
    seq_id_2   seq_id_3   bit_score
    ....
    --------------------------------------
          
No column heading should be provided.

Output is provided in the same three-column format:

    --------------------------------------
    seq_id_0   seq_id_1   nc_score
    seq_id_2   seq_id_3   nc_score
    ....
    --------------------------------------

======================================================================
Resource Consumption
======================================================================

This implementation can consume a significant amount of system memory.
The input pairwise similarity matrix is stored in memory as a nested
Python dictionary.  Absolute memory consumption is highly dependent
upon the system being used (32 or 64-bit) and number of pairwise
similarities provided.

As a rough guide, the set of Mouse and Human sequences used in our
analysis included 26,197 sequences.  From this, all-against-all BLAST
yielded approximately 4.8 million pairwise relations.  On a 64-bit
Python 2.5 installation, the implementation here can be expected to
consume approximately 1.5GB of system memory.  32-bit systems can be
expected to consume roughly half this quantity.  Running time for this
dataset is approximately 16 hours on an Intel Pentium D, at 3.2GHz.

Note that the output score file will not have the same number of lines
as the input, because a sequence pair that does not have a significant
BLAST score may still have a reasonably high Neighborhood Correlation
score..  The '--nc_thresh' parameter can be used to report only scores
above a given threshold.  Lower thresholds may be used for greater
completeness, at the cost of a larger output file.  If '--nc_thresh'
is set to zero, the number of lines in the output file will be the
square of the number of sequences.  This threshold affects only the
printed output and will not greatly affect runtime; all scores must
still be calculated.


======================================================================
Running BLAST
======================================================================

Though comprehensive BLAST usage is beyond the scope of this document,
a few aspects of BLAST execution are critical.  In particular,

 - Expectation value ( blastall -e) should be set to the number 10*N,
   where N is the number of sequences in your dataset.  This ensures
   that BLAST returns similarity scores for all alignments.

 - Effective search space length (blastall -Y) should be set to R^2,
   where R is the number of residues in your dataset.

Standalone blast executables are available from NCBI at
http://www.ncbi.nlm.nih.gov/BLAST/download.shtml.

NCBI BLAST is able to output much more information in addition to the
scores required for input to Neighborhood Correlation.  If you are in
need of robust BLAST parsing, we recommend you consider that provided
by the BioPython project (http://biopython.org).


======================================================================
Example Datasets
======================================================================

The file 'human_mouse_swissprot.dat' contains all-against-all BLAST
results for all Human and Mouse sequences in SwissProt, version 50.9.
Fragments have been excluded.  Note that 1-2 GB of system memory will
be required to execute 'nc_standalone.py' with this dataset (See
Resource Consumption, above.)  16-hour runtime can be expected on an
Intel Pentium D, at 3.2GHz.

Execution may be accomplished with:

./nc_standalone.py -f human_mouse_swissprot.dat


======================================================================
Parameter Detail
======================================================================

Detailed explanation of program parameters is also available with: 

./nc_standalone.py -h

Options:

     -f, --flatfile <filename>   (required)
          A white-space delimited file of BIT-scores from BLAST.  Three
          columns are expected, of the format:

          --------------------------------------
          seq_id_0   seq_id_1   bit_score
          seq_id_2   seq_id_3   bit_score
          ....
          --------------------------------------

          No column heading should be provided.  Please refer to
          http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.section.614
          for an explanation of BIT-score.

     -o, --output <filename>
          Write score output to a file.  If omitted, the score list is
          printed to stdout.

     -h, --help
          Print this help message.

     --num_residues <integer>
          Number of residues in the sequence database.  Used to
          calculate SMIN, the lowest expected bit_score.  If
          unspecified, an estimate of 537 residues per sequence will
          be used, which is an average of mouse and human SwissProt
          sequences.

     --nc_thresh <0.05>
          NC reporting threshold.  Calculated values below this
          threshold will not be reported.  Conservatively defaults to
          0.05.

     --smin_factor <0.95>
          SMIN factor.  Calculate SMIN from the expected random
          BIT-score scaled by this factor. Defaults to 0.95.