data.table::tstrsplit has a useful type.convert argument. But it errors out when after the split each row gets converted into a different class, see example:
library(data.table)
x <- fread("CHROM POS REF ALT TYPE AF
chr1 1 A T MISSENSE 0.23
chr2 1 A T,G MISSENSE 0.17,0.09")
In ALT column we have "T" and "T,G", so first row gets converted to logical "TRUE", and the second row gets split and converted into character "T" and "G". As a result we get below error:
x[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed = TRUE, type.convert = TRUE))),
by = .(CHROM, POS, REF, TYPE)]
# Error in `[.data.table`(x, , lapply(.SD, function(x) unlist(tstrsplit(x, :
# Column 1 of result for group 2 is type 'character' but expecting type
# 'logical'. Column types must be consistent for each group.
We could avoid auto conversion, and convert later manually, all great:
x[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed = TRUE))),
by = .(CHROM, POS, REF, TYPE)][, .(CHROM, POS, REF, ALT, TYPE, AF = as.numeric(AF))]
# CHROM POS REF ALT TYPE AF
# 1: chr1 1 A T MISSENSE 0.23
# 2: chr2 1 A T MISSENSE 0.17
# 3: chr2 1 A G MISSENSE 0.09
But tidyr::separate doesn't have this issue:
tidyr::separate_rows(x, ALT, AF, convert = TRUE)
# # A tibble: 3 x 6
# CHROM POS REF ALT TYPE AF
# <chr> <int> <chr> <chr> <chr> <dbl>
# 1 chr1 1 A T MISSENSE 0.23
# 2 chr2 1 A T MISSENSE 0.17
# 3 chr2 1 A G MISSENSE 0.09
The question: is there a better data.table way to achieve this? I need to use type conversion as AF column needs to be numeric. I'd want to split delimited column simultaneously. In real data there might more than 2 columns with delimiters.