Chooses the correct function to extract variants from input based on
the class of the object or the file extension. Different types of objects
can be mixed within the list. For example, the list can include VCF files
and maf objects. Certain parameters such as id
and rename
only apply to VCF objects or files and need to be individually specified
for each VCF. Therefore, these parameters should be suppied as a vector
that is the same length as the number of inputs. If other types of
objects are in the input list, then the value of id
and rename
will be ignored for these items.
extract_variants( inputs, id = NULL, rename = NULL, sample_field = NULL, filename_as_id = FALSE, strip_extension = c(".vcf", ".vcf.gz", ".gz"), filter = TRUE, multiallele = c("expand", "exclude"), fix_vcf_errors = TRUE, extra_fields = NULL, chromosome_col = "chr", start_col = "start", end_col = "end", ref_col = "ref", alt_col = "alt", sample_col = "sample", verbose = TRUE )
inputs | A vector or list of objects or file names. Objects can be
CollapsedVCF, ExpandedVCF, MAF,
an object that inherits from |
---|---|
id | A character vector the same length as |
rename | A character vector the same length as |
sample_field | Some algoriths will save the name of the
sample in the ##SAMPLE portion of header in the VCF.
See |
filename_as_id | If set to |
strip_extension | Only used if |
filter | Exclude variants that do not have a |
multiallele | Multialleles are when multiple alternative variants
are listed in the same row in the vcf.
See |
fix_vcf_errors | Attempt to automatically fix VCF file
formatting errors.
See |
extra_fields | Optionally extract additional fields from all input
objects. Default |
chromosome_col | The name of the column that contains the chromosome
reference for each variant. Only used if the input is a matrix or data.frame.
Default |
start_col | The name of the column that contains the start
position for each variant. Only used if the input is a matrix or data.frame.
Default |
end_col | The name of the column that contains the end
position for each variant. Only used if the input is a matrix or data.frame.
Default |
ref_col | The name of the column that contains the reference
base(s) for each variant. Only used if the input is a matrix or data.frame.
Default |
alt_col | The name of the column that contains the alternative
base(s) for each variant. Only used if the input is a matrix or data.frame.
Default |
sample_col | The name of the column that contains the sample
id for each variant. Only used if the input is a matrix or data.frame.
Default |
verbose | Show progress of variant extraction. Default |
Returns a data.table of variants from a vcf
# Get loations of two vcf files and a maf file luad_vcf_file <- system.file("extdata", "public_LUAD_TCGA-97-7938.vcf", package = "musicatk") lusc_maf_file <- system.file("extdata", "public_TCGA.LUSC.maf", package = "musicatk") melanoma_vcfs <- list.files(system.file("extdata", package = "musicatk"), pattern = glob2rx("*SKCM*vcf"), full.names = TRUE) # Read all files in at once inputs <- c(luad_vcf_file, melanoma_vcfs, lusc_maf_file) variants <- extract_variants(inputs = inputs)#> | | | 0% | |============== | 20%#>#> | |============================ | 40%#>#> | |========================================== | 60%#>#> | |======================================================== | 80%#>#> | |======================================================================| 100%#>#> #> TCGA-97-7938-01A-11D-2167-08 TCGA-EE-A3J5-06A-11D-A20D-08 #> 121 123 #> TCGA-ER-A197-06A-32D-A197-08 TCGA-ER-A19O-06A-11D-A197-08 #> 13 52 #> TCGA-56-7582-01A-11D-2042-08 TCGA-77-7335-01A-11D-2042-08 #> 199 283 #> TCGA-94-7557-01A-11D-2122-08 #> 120# Run again but renaming samples in first four vcfs new_name <- c(paste0("Sample", 1:4), NA) variants <- extract_variants(inputs = inputs, rename = new_name)#> | | | 0% | |============== | 20%#>#> | |============================ | 40%#>#> | |========================================== | 60%#>#> | |======================================================== | 80%#>#> | |======================================================================| 100%#>#> #> Sample1 Sample2 #> 121 123 #> Sample3 Sample4 #> 13 52 #> TCGA-56-7582-01A-11D-2042-08 TCGA-77-7335-01A-11D-2042-08 #> 199 283 #> TCGA-94-7557-01A-11D-2122-08 #> 120