Chooses the correct function to extract variants from input based on the class of the object or the file extension. Different types of objects can be mixed within the list. For example, the list can include VCF files and maf objects. Certain parameters such as id and rename only apply to VCF objects or files and need to be individually specified for each VCF. Therefore, these parameters should be suppied as a vector that is the same length as the number of inputs. If other types of objects are in the input list, then the value of id and rename will be ignored for these items.

extract_variants(
  inputs,
  id = NULL,
  rename = NULL,
  sample_field = NULL,
  filename_as_id = FALSE,
  strip_extension = c(".vcf", ".vcf.gz", ".gz"),
  filter = TRUE,
  multiallele = c("expand", "exclude"),
  fix_vcf_errors = TRUE,
  extra_fields = NULL,
  chromosome_col = "chr",
  start_col = "start",
  end_col = "end",
  ref_col = "ref",
  alt_col = "alt",
  sample_col = "sample",
  verbose = TRUE
)

Arguments

inputs

A vector or list of objects or file names. Objects can be CollapsedVCF, ExpandedVCF, MAF, an object that inherits from matrix or data.frame, or character strings that denote the path to a vcf or maf file.

id

A character vector the same length as inputs denoting the sample to extract from a vcf. See extract_variants_from_vcf for more details. Only used if the input is a vcf object or file. Default NULL.

rename

A character vector the same length as inputs denoting what the same will be renamed to. See extract_variants_from_vcf for more details. Only used if the input is a vcf object or file. Default NULL.

sample_field

Some algoriths will save the name of the sample in the ##SAMPLE portion of header in the VCF. See extract_variants_from_vcf for more details. Default NULL.

filename_as_id

If set to TRUE, the file name will be used as the sample name. See extract_variants_from_vcf_file for more details. Only used if the input is a vcf file. Default TRUE.

strip_extension

Only used if filename_as_id is set to TRUE. If set to TRUE, the file extention will be stripped from the filename before setting the sample name. See extract_variants_from_vcf_file for more details. Only used if the input is a vcf file. Default c(".vcf",".vcf.gz",".gz")

filter

Exclude variants that do not have a PASS in the FILTER column of VCF inputs.

multiallele

Multialleles are when multiple alternative variants are listed in the same row in the vcf. See extract_variants_from_vcf for more details. Only used if the input is a vcf object or file. Default "expand".

fix_vcf_errors

Attempt to automatically fix VCF file formatting errors. See extract_variants_from_vcf_file for more details. Only used if the input is a vcf file. Default TRUE.

extra_fields

Optionally extract additional fields from all input objects. Default NULL.

chromosome_col

The name of the column that contains the chromosome reference for each variant. Only used if the input is a matrix or data.frame. Default "Chromosome".

start_col

The name of the column that contains the start position for each variant. Only used if the input is a matrix or data.frame. Default "Start_Position".

end_col

The name of the column that contains the end position for each variant. Only used if the input is a matrix or data.frame. Default "End_Position".

ref_col

The name of the column that contains the reference base(s) for each variant. Only used if the input is a matrix or data.frame. Default "Tumor_Seq_Allele1".

alt_col

The name of the column that contains the alternative base(s) for each variant. Only used if the input is a matrix or data.frame. Default "Tumor_Seq_Allele2".

sample_col

The name of the column that contains the sample id for each variant. Only used if the input is a matrix or data.frame. Default "sample".

verbose

Show progress of variant extraction. Default TRUE.

Value

Returns a data.table of variants from a vcf

Examples

# Get loations of two vcf files and a maf file luad_vcf_file <- system.file("extdata", "public_LUAD_TCGA-97-7938.vcf", package = "musicatk") lusc_maf_file <- system.file("extdata", "public_TCGA.LUSC.maf", package = "musicatk") melanoma_vcfs <- list.files(system.file("extdata", package = "musicatk"), pattern = glob2rx("*SKCM*vcf"), full.names = TRUE) # Read all files in at once inputs <- c(luad_vcf_file, melanoma_vcfs, lusc_maf_file) variants <- extract_variants(inputs = inputs)
#> | | | 0% | |============== | 20%
#> Extracted 1 out of 5 inputs: /private/var/folders/8g/zr_0d8wd23762jsqlwm5r_6w0000gn/T/RtmpqEjdyu/temp_libpathfbd337016ff7/musicatk/extdata/public_LUAD_TCGA-97-7938.vcf
#> | |============================ | 40%
#> Extracted 2 out of 5 inputs: /private/var/folders/8g/zr_0d8wd23762jsqlwm5r_6w0000gn/T/RtmpqEjdyu/temp_libpathfbd337016ff7/musicatk/extdata/public_SKCM_TCGA-EE-A3J5-06A-11D-A20D-08.vcf
#> | |========================================== | 60%
#> Extracted 3 out of 5 inputs: /private/var/folders/8g/zr_0d8wd23762jsqlwm5r_6w0000gn/T/RtmpqEjdyu/temp_libpathfbd337016ff7/musicatk/extdata/public_SKCM_TCGA-ER-A197-06A-32D-A197-08.vcf
#> | |======================================================== | 80%
#> Extracted 4 out of 5 inputs: /private/var/folders/8g/zr_0d8wd23762jsqlwm5r_6w0000gn/T/RtmpqEjdyu/temp_libpathfbd337016ff7/musicatk/extdata/public_SKCM_TCGA-ER-A19O-06A-11D-A197-08.vcf
#> | |======================================================================| 100%
#> Extracted 5 out of 5 inputs: /private/var/folders/8g/zr_0d8wd23762jsqlwm5r_6w0000gn/T/RtmpqEjdyu/temp_libpathfbd337016ff7/musicatk/extdata/public_TCGA.LUSC.maf
table(variants$sample)
#> #> TCGA-97-7938-01A-11D-2167-08 TCGA-EE-A3J5-06A-11D-A20D-08 #> 121 123 #> TCGA-ER-A197-06A-32D-A197-08 TCGA-ER-A19O-06A-11D-A197-08 #> 13 52 #> TCGA-56-7582-01A-11D-2042-08 TCGA-77-7335-01A-11D-2042-08 #> 199 283 #> TCGA-94-7557-01A-11D-2122-08 #> 120
# Run again but renaming samples in first four vcfs new_name <- c(paste0("Sample", 1:4), NA) variants <- extract_variants(inputs = inputs, rename = new_name)
#> | | | 0% | |============== | 20%
#> Extracted 1 out of 5 inputs: /private/var/folders/8g/zr_0d8wd23762jsqlwm5r_6w0000gn/T/RtmpqEjdyu/temp_libpathfbd337016ff7/musicatk/extdata/public_LUAD_TCGA-97-7938.vcf
#> | |============================ | 40%
#> Extracted 2 out of 5 inputs: /private/var/folders/8g/zr_0d8wd23762jsqlwm5r_6w0000gn/T/RtmpqEjdyu/temp_libpathfbd337016ff7/musicatk/extdata/public_SKCM_TCGA-EE-A3J5-06A-11D-A20D-08.vcf
#> | |========================================== | 60%
#> Extracted 3 out of 5 inputs: /private/var/folders/8g/zr_0d8wd23762jsqlwm5r_6w0000gn/T/RtmpqEjdyu/temp_libpathfbd337016ff7/musicatk/extdata/public_SKCM_TCGA-ER-A197-06A-32D-A197-08.vcf
#> | |======================================================== | 80%
#> Extracted 4 out of 5 inputs: /private/var/folders/8g/zr_0d8wd23762jsqlwm5r_6w0000gn/T/RtmpqEjdyu/temp_libpathfbd337016ff7/musicatk/extdata/public_SKCM_TCGA-ER-A19O-06A-11D-A197-08.vcf
#> | |======================================================================| 100%
#> Extracted 5 out of 5 inputs: /private/var/folders/8g/zr_0d8wd23762jsqlwm5r_6w0000gn/T/RtmpqEjdyu/temp_libpathfbd337016ff7/musicatk/extdata/public_TCGA.LUSC.maf
table(variants$sample)
#> #> Sample1 Sample2 #> 121 123 #> Sample3 Sample4 #> 13 52 #> TCGA-56-7582-01A-11D-2042-08 TCGA-77-7335-01A-11D-2042-08 #> 199 283 #> TCGA-94-7557-01A-11D-2122-08 #> 120