.. _usage:

-----
Usage
-----

===============
Program options
===============

|Software| takes in a number of command line arguments to control the program's behavior. To view a list of arguments, run |Software| without any command line arguments, i.e. |Software|, ``AlphaImpute2 -h``, ``AlphaImpute2 -help`` or ``AlphaImpute2 --help``. 

User can check the version of the program with ``AlphaImpute2 -version``. 
Remember to use the correct version of the documentation for the version of the program you are using.
For example, the link to the documentation for version ``v0.0.3`` is https://alphaimpute2.readthedocs.io/en/v0.0.3/.

There are four primary ways to run |Software| which differ on whether population or pedigree imputation should be run. The default option is to run both population and pedigree imputation in an integrated algorithm. This will be the option most users will want if they have access to pedigree data on a majority of individuals. The second option is to run population imputation only with the ``-pop_only`` flag. This option should be used if no pedigree data is availible. The third option is to run only pedigree based imputation using the ``-ped_only`` flag. This option is not recommended for general use cases, but may be applicable if 

(1) there are more than five generations of pedigree data, 

(2) imputation is done only on the most recent generations, 

(3) speed is a priority.

The fourth option is to run |Software| with the ``-cluster_only``. This option performs |Software|'s array clustering algorithm and outputs the results of the clustering. This option may be useful for debugging how individuals are clustered.

Core Arguments 
--------------

::
  
  Core arguments
    -out prefix              The output file prefix.

The ``-out`` argument gives the output file prefix for where the outputs of |Software| should be stored. By default, |Software| outputs a file with imputed genotypes, ``prefix.genotypes`` and phased haplotypes ``prefix.phase``. For more information on which files are created, see "Output Arguments", below.

Input Arguments 
----------------

::

    Input arguments:
      -genotypes [GENOTYPES ...]
                            A file in AlphaGenes format.
      -pedigree [PEDIGREE ...]
                            A pedigree file in AlphaGenes format.
      -startsnp STARTSNP    The first marker to consider. The first marker in the file is marker '1'. Default: 1.
      -stopsnp STOPSNP      The last marker to consider. Default: all markers considered.
      -seed SEED            A random seed to use for debugging.

|Software| requires a genotype file and an optional pedigree file to run the analysis.

A pedigree file may be supplied using the ``-pedigree`` option. 

Use the ``-startsnp`` and ``-stopsnp`` comands to run the analysis only on a subset of markers.

Imputation arguments: 
---------------------

::

    Impute options:
      -maxthreads MAXTHREADS
                            Number of threads to use. Default: 1.
      -binaryoutput         Flag to write out the genotypes as a binary plink
                            output.
      -phase_output         Flag to write out the phase information.
      -seg_output           Flag to write out the segmentation information.
      -pop_only             Flag to run the population based imputation algorithm
                            only.
      -ped_only             Flag to run the pedigree based imputation algorithm
                            only.
      -cluster_only         Flag to just cluster individuals into marker arrays
                            and write out results.
      -length LENGTH        Estimated map length for pedigree and population
                            imputation in Morgans. Default: 1 (100cM).

These options control how imputation is run. The ``-maxthreads`` argument can be used to allow multiple threads to be used for imputation. This argument can be set seperately from the ``-iothreads`` argument (above). The speed gains of using multiple threads is close to linear for population imputation, but is more limited for pedigree based imputation.

The ``-length`` argument controls the assumed length of the chromosome (in Morgans). We have found that imputation is largely insensitive to this value so keeping this value at its default of 1, should work in many cases. There are additional options to control the assumed recombination used for population based imputation (below).

The binary output option flags the program to write out files in plink binary format. Binary plink files require the package ``alphaplinkpython``. This can be installed via ``pip`` but is only stable for Linux. A fake map file is generated.

The remaining options control how |Software| is run.

Pedigree imputation options 
---------------------------
::

    Pedigree imputation options:
      -cycles CYCLES        Number of peeling cycles. Default: 4
      -final_peeling_threshold FINAL_PEELING_THRESHOLD
                            Genotype calling threshold for final round of peeling.
                            Default: 0.1 (best guess genotypes).

These options control how pedigree imputation is run for either the pedigree only algorithm, or the combined algorithm. ``-cycles`` controls the number of cycles of peeling that are perfromed. An additional very-high-confidence cycle is always performed in addition to the cycles specific here. We recommend using the default value of 4 cycles. Additional cycles seem to provide limited benifit in most pedigrees. 

The ``-final_peeling_threshold`` argument gives the genotype calling threshold for the final round of peeling. This applies to both the pedigree only or the combined algorithm. We recommend either using best guess genotypes (default with a cutoff of 0.1) or high confidence genotypes (with a cutoff of 0.95). Values that cannot be imputed with high enough confidence will be coded as missing.

Population imputation options 
-----------------------------

::

  Population imputation options:
    -n_phasing_cycles N_PHASING_CYCLES
                          Number of phasing cycles. Default: 5
    -n_phasing_particles N_PHASING_PARTICLES
                          Number of phasing particles. Defualt: 40.
    -n_imputation_particles N_IMPUTATION_PARTICLES
                          Number of imputation particles. Defualt: 100.
    -hd_threshold HD_THRESHOLD
                          Percentage of non-missing markers for an individual be
                          classified as high-density when building the haplotype
                          library. Default: 0.95.
    -min_chip MIN_CHIP    Minimum number of individuals on an inferred low-
                          density chip for it to be considered a low-density
                          chip. Default: 0.05
    -phasing_loci_inclusion_threshold PHASING_LOCI_INCLUSION_THRESHOLD
                          Percentage of non-missing markers per loci for it to
                          be included on a chip for imputation. Default: 0.9.
    -imputation_length_modifier IMPUTATION_LENGTH_MODIFIER
                          Increases the effective map length of the chip for
                          population imputation by this amount. Default: 1.
    -phasing_length_modifier PHASING_LENGTH_MODIFIER
                          Increases the effective map length of the chip for
                          Phasing imputation by this amount. Default: 5.
    -phasing_consensus_window_size PHASING_CONSENSUS_WINDOW_SIZE
                          Number of markers used to evaluate haplotypes when
                          creating a consensus haplotype. Default: 50.

These options control how population imputation is run. This algorithm uses a particle-based imputation approach where a number of particles are used to explore genotype combinations with high posterior probability. Increasing the number of particles can increase accuracy. USe the options, ``-n_phasing_particles`` and ``n_imputation_particles`` to change the number of particles run for phasing and imputation.

|Software| uses a number of rounds of phasing in order to iteratively build a haplotype reference panel from the observed data. The argument ``-n_phasing_cycles`` controls the number of rounds that are used for phasing. In pilot testing we have found that the default value of 5 cycles tends to give good accuracy. Additional accuracy may be possible by slightly increasing this value.

To perfrom phasing and imputation, |Software| selects high-density individuals to form the haplotype reference panel. ``-hd_threshold`` gives the percentage of non-missing markers the individual needs to carry to be included in the high-density reference panel.

Similar to ``-length`` the ``-imputation_length_modifier`` and ``-phasing_length_modifier`` control the assumed chromosome length for phasing and imputation. These values are applied multiplicatively to the ``-length`` option. We have found that imputation accuracy is not very sensitive to these values and recommend setting them to their default value.

When |Software| is run, multiple particles are merged based on the particle's score in a window centered around each marker. ``phasing_consensus_window_size`` controls the size of the window. Increasing this value can increase imputation accuracy if the low-density panel is very sparse compared to the high-density panel.

Joint imputation options 
------------------------

::

  Joint imputation options:
    -chip_threshold CHIP_THRESHOLD
                          Proportion more high density markers parents need to
                          be used over population imputation. Default: 0.95
    -final_peeling_threshold_for_phasing FINAL_PEELING_THRESHOLD_FOR_PHASING
                          Genotype calling threshold for first round of peeling
                          before phasing. This value should be conservative..
                          Default: 0.9.

These options control how population and pedigree imputation are combined. As part of the combined algorithm, |Software| detects a small number of "pseudo-founders" to impute using the population imputation algorithm. These "pseudo-founders" are selected by finding individuals with higher genotyping densities than their parents. |Software| tries to be conservative in which individuals are selected as a "pseudo-founder" and the ``-chip_threshold`` parameter tells algorithm how many more non-missing markers the individuals needs compared to their parents to be considered a "pseudo-founder".

Similar to the ``-final_peeling_threshold`` argument, the ``-final_peeling_threshold_for_phasing`` argument gives the final peeling threshold for the initial round of pedigree imputation in the combined algorithm. 

============
File formats
============

Input file formats
------------------

Genotype file 
=============

Genotype files contain the input genotypes for each individual. The first value in each line is the individual's id. The remaining values are the genotypes of the individual at each locus, either 0, 1, or 2 (or 9 if missing). The following examples gives the genotypes for four individuals genotyped on four markers each.

Example: ::

  id1 0 2 9 0 
  id2 1 1 1 1 
  id3 2 0 2 0 
  id4 0 2 1 0

Pedigree file
=============

Each line of a pedigree file has three values, the individual's id, their father's id, and their mother's id. "0" represents an unknown id.

Example: ::

  id1 0 0
  id2 0 0
  id3 id1 id2
  id4 id1 id2

Output file formats
-------------------

Genotype file 
=============

Genotype files contain the input genotypes for each individual. The first value in each line is the individual's id. The remaining values are the genotypes of the individual at each locus, either 0, 1, or 2 (or 9 if missing). The following examples gives the genotypes for four individuals genotyped on four markers each.

Example: ::

  id1 0 2 9 0 
  id2 1 1 1 1 
  id3 2 0 2 0 
  id4 0 2 1 0

Phase file
==========

The phase file gives the phased haplotypes (either 0 or 1) for each individual in two lines. For individuals where we can determine the haplotype of origin, the first line will provide information on the paternal haplotype, and the second line will provide information on the maternal haplotype.

Example: ::

  id1 0 1 9 0 # Maternal haplotype
  id1 0 1 9 0 # Paternal haplotype
  id2 1 1 1 0
  id2 0 0 0 1
  id3 1 0 1 0
  id3 1 0 1 0 
  id4 0 1 0 0
  id4 0 1 1 0