.. _usage: ----- Usage ----- =============== Program options =============== |Software| takes in a number of command line arguments to control the program's behavior. To view a list of arguments, run |Software| without any command line arguments, i.e. |Software|, ``AlphaImpute2 -h``, ``AlphaImpute2 -help`` or ``AlphaImpute2 --help``. User can check the version of the program with ``AlphaImpute2 -version``. Remember to use the correct version of the documentation for the version of the program you are using. For example, the link to the documentation for version ``v0.0.3`` is https://alphaimpute2.readthedocs.io/en/v0.0.3/. There are four primary ways to run |Software| which differ on whether population or pedigree imputation should be run. The default option is to run both population and pedigree imputation in an integrated algorithm. This will be the option most users will want if they have access to pedigree data on a majority of individuals. The second option is to run population imputation only with the ``-pop_only`` flag. This option should be used if no pedigree data is availible. The third option is to run only pedigree based imputation using the ``-ped_only`` flag. This option is not recommended for general use cases, but may be applicable if (1) there are more than five generations of pedigree data, (2) imputation is done only on the most recent generations, (3) speed is a priority. The fourth option is to run |Software| with the ``-cluster_only``. This option performs |Software|'s array clustering algorithm and outputs the results of the clustering. This option may be useful for debugging how individuals are clustered. Core Arguments -------------- :: Core arguments -out prefix The output file prefix. The ``-out`` argument gives the output file prefix for where the outputs of |Software| should be stored. By default, |Software| outputs a file with imputed genotypes, ``prefix.genotypes`` and phased haplotypes ``prefix.phase``. For more information on which files are created, see "Output Arguments", below. Input Arguments ---------------- :: Input arguments: -genotypes [GENOTYPES ...] A file in AlphaGenes format. -pedigree [PEDIGREE ...] A pedigree file in AlphaGenes format. -startsnp STARTSNP The first marker to consider. The first marker in the file is marker '1'. Default: 1. -stopsnp STOPSNP The last marker to consider. Default: all markers considered. -seed SEED A random seed to use for debugging. |Software| requires a genotype file and an optional pedigree file to run the analysis. A pedigree file may be supplied using the ``-pedigree`` option. Use the ``-startsnp`` and ``-stopsnp`` comands to run the analysis only on a subset of markers. Imputation arguments: --------------------- :: Impute options: -maxthreads MAXTHREADS Number of threads to use. Default: 1. -binaryoutput Flag to write out the genotypes as a binary plink output. -phase_output Flag to write out the phase information. -seg_output Flag to write out the segmentation information. -pop_only Flag to run the population based imputation algorithm only. -ped_only Flag to run the pedigree based imputation algorithm only. -cluster_only Flag to just cluster individuals into marker arrays and write out results. -length LENGTH Estimated map length for pedigree and population imputation in Morgans. Default: 1 (100cM). These options control how imputation is run. The ``-maxthreads`` argument can be used to allow multiple threads to be used for imputation. This argument can be set seperately from the ``-iothreads`` argument (above). The speed gains of using multiple threads is close to linear for population imputation, but is more limited for pedigree based imputation. The ``-length`` argument controls the assumed length of the chromosome (in Morgans). We have found that imputation is largely insensitive to this value so keeping this value at its default of 1, should work in many cases. There are additional options to control the assumed recombination used for population based imputation (below). The binary output option flags the program to write out files in plink binary format. Binary plink files require the package ``alphaplinkpython``. This can be installed via ``pip`` but is only stable for Linux. A fake map file is generated. The remaining options control how |Software| is run. Pedigree imputation options --------------------------- :: Pedigree imputation options: -cycles CYCLES Number of peeling cycles. Default: 4 -final_peeling_threshold FINAL_PEELING_THRESHOLD Genotype calling threshold for final round of peeling. Default: 0.1 (best guess genotypes). These options control how pedigree imputation is run for either the pedigree only algorithm, or the combined algorithm. ``-cycles`` controls the number of cycles of peeling that are perfromed. An additional very-high-confidence cycle is always performed in addition to the cycles specific here. We recommend using the default value of 4 cycles. Additional cycles seem to provide limited benifit in most pedigrees. The ``-final_peeling_threshold`` argument gives the genotype calling threshold for the final round of peeling. This applies to both the pedigree only or the combined algorithm. We recommend either using best guess genotypes (default with a cutoff of 0.1) or high confidence genotypes (with a cutoff of 0.95). Values that cannot be imputed with high enough confidence will be coded as missing. Population imputation options ----------------------------- :: Population imputation options: -n_phasing_cycles N_PHASING_CYCLES Number of phasing cycles. Default: 5 -n_phasing_particles N_PHASING_PARTICLES Number of phasing particles. Defualt: 40. -n_imputation_particles N_IMPUTATION_PARTICLES Number of imputation particles. Defualt: 100. -hd_threshold HD_THRESHOLD Percentage of non-missing markers for an individual be classified as high-density when building the haplotype library. Default: 0.95. -min_chip MIN_CHIP Minimum number of individuals on an inferred low- density chip for it to be considered a low-density chip. Default: 0.05 -phasing_loci_inclusion_threshold PHASING_LOCI_INCLUSION_THRESHOLD Percentage of non-missing markers per loci for it to be included on a chip for imputation. Default: 0.9. -imputation_length_modifier IMPUTATION_LENGTH_MODIFIER Increases the effective map length of the chip for population imputation by this amount. Default: 1. -phasing_length_modifier PHASING_LENGTH_MODIFIER Increases the effective map length of the chip for Phasing imputation by this amount. Default: 5. -phasing_consensus_window_size PHASING_CONSENSUS_WINDOW_SIZE Number of markers used to evaluate haplotypes when creating a consensus haplotype. Default: 50. These options control how population imputation is run. This algorithm uses a particle-based imputation approach where a number of particles are used to explore genotype combinations with high posterior probability. Increasing the number of particles can increase accuracy. USe the options, ``-n_phasing_particles`` and ``n_imputation_particles`` to change the number of particles run for phasing and imputation. |Software| uses a number of rounds of phasing in order to iteratively build a haplotype reference panel from the observed data. The argument ``-n_phasing_cycles`` controls the number of rounds that are used for phasing. In pilot testing we have found that the default value of 5 cycles tends to give good accuracy. Additional accuracy may be possible by slightly increasing this value. To perfrom phasing and imputation, |Software| selects high-density individuals to form the haplotype reference panel. ``-hd_threshold`` gives the percentage of non-missing markers the individual needs to carry to be included in the high-density reference panel. Similar to ``-length`` the ``-imputation_length_modifier`` and ``-phasing_length_modifier`` control the assumed chromosome length for phasing and imputation. These values are applied multiplicatively to the ``-length`` option. We have found that imputation accuracy is not very sensitive to these values and recommend setting them to their default value. When |Software| is run, multiple particles are merged based on the particle's score in a window centered around each marker. ``phasing_consensus_window_size`` controls the size of the window. Increasing this value can increase imputation accuracy if the low-density panel is very sparse compared to the high-density panel. Joint imputation options ------------------------ :: Joint imputation options: -chip_threshold CHIP_THRESHOLD Proportion more high density markers parents need to be used over population imputation. Default: 0.95 -final_peeling_threshold_for_phasing FINAL_PEELING_THRESHOLD_FOR_PHASING Genotype calling threshold for first round of peeling before phasing. This value should be conservative.. Default: 0.9. These options control how population and pedigree imputation are combined. As part of the combined algorithm, |Software| detects a small number of "pseudo-founders" to impute using the population imputation algorithm. These "pseudo-founders" are selected by finding individuals with higher genotyping densities than their parents. |Software| tries to be conservative in which individuals are selected as a "pseudo-founder" and the ``-chip_threshold`` parameter tells algorithm how many more non-missing markers the individuals needs compared to their parents to be considered a "pseudo-founder". Similar to the ``-final_peeling_threshold`` argument, the ``-final_peeling_threshold_for_phasing`` argument gives the final peeling threshold for the initial round of pedigree imputation in the combined algorithm. ============ File formats ============ Input file formats ------------------ Genotype file ============= Genotype files contain the input genotypes for each individual. The first value in each line is the individual's id. The remaining values are the genotypes of the individual at each locus, either 0, 1, or 2 (or 9 if missing). The following examples gives the genotypes for four individuals genotyped on four markers each. Example: :: id1 0 2 9 0 id2 1 1 1 1 id3 2 0 2 0 id4 0 2 1 0 Pedigree file ============= Each line of a pedigree file has three values, the individual's id, their father's id, and their mother's id. "0" represents an unknown id. Example: :: id1 0 0 id2 0 0 id3 id1 id2 id4 id1 id2 Output file formats ------------------- Genotype file ============= Genotype files contain the input genotypes for each individual. The first value in each line is the individual's id. The remaining values are the genotypes of the individual at each locus, either 0, 1, or 2 (or 9 if missing). The following examples gives the genotypes for four individuals genotyped on four markers each. Example: :: id1 0 2 9 0 id2 1 1 1 1 id3 2 0 2 0 id4 0 2 1 0 Phase file ========== The phase file gives the phased haplotypes (either 0 or 1) for each individual in two lines. For individuals where we can determine the haplotype of origin, the first line will provide information on the paternal haplotype, and the second line will provide information on the maternal haplotype. Example: :: id1 0 1 9 0 # Maternal haplotype id1 0 1 9 0 # Paternal haplotype id2 1 1 1 0 id2 0 0 0 1 id3 1 0 1 0 id3 1 0 1 0 id4 0 1 0 0 id4 0 1 1 0