Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
339 views
in Technique[技术] by (71.8m points)

r - Split delimited string into new rows with type conversion set to TRUE

data.table::tstrsplit has a useful type.convert argument. But it errors out when after the split each row gets converted into a different class, see example:

library(data.table)

x <- fread("CHROM POS REF ALT TYPE AF
chr1 1 A T MISSENSE 0.23
chr2 1 A T,G MISSENSE 0.17,0.09")

In ALT column we have "T" and "T,G", so first row gets converted to logical "TRUE", and the second row gets split and converted into character "T" and "G". As a result we get below error:

x[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed = TRUE, type.convert = TRUE))),
  by = .(CHROM, POS, REF, TYPE)]

# Error in `[.data.table`(x, , lapply(.SD, function(x) unlist(tstrsplit(x,  : 
#   Column 1 of result for group 2 is type 'character' but expecting type
#   'logical'. Column types must be consistent for each group.

We could avoid auto conversion, and convert later manually, all great:

x[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed = TRUE))),
  by = .(CHROM, POS, REF, TYPE)][, .(CHROM, POS, REF, ALT, TYPE, AF = as.numeric(AF))]
#    CHROM POS REF ALT     TYPE   AF
# 1:  chr1   1   A   T MISSENSE 0.23
# 2:  chr2   1   A   T MISSENSE 0.17
# 3:  chr2   1   A   G MISSENSE 0.09

But tidyr::separate doesn't have this issue:

tidyr::separate_rows(x, ALT, AF, convert = TRUE)
# # A tibble: 3 x 6
#   CHROM   POS REF   ALT   TYPE        AF
#   <chr> <int> <chr> <chr> <chr>    <dbl>
# 1 chr1      1 A     T     MISSENSE  0.23
# 2 chr2      1 A     T     MISSENSE  0.17
# 3 chr2      1 A     G     MISSENSE  0.09

The question: is there a better data.table way to achieve this? I need to use type conversion as AF column needs to be numeric. I'd want to split delimited column simultaneously. In real data there might more than 2 columns with delimiters.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

It can be done more easily with cSplit

library(splitstackshape)
cSplit(x, c("ALT", "AF"), ",", "long")
#   CHROM POS REF ALT     TYPE   AF
#1:  chr1   1   A   T MISSENSE 0.23
#2:  chr2   1   A   T MISSENSE 0.17
#3:  chr2   1   A   G MISSENSE 0.09

Regarding the data.table option, another way is to add a space

x[, lapply(.SD, function(x) 
 trimws(unlist(tstrsplit(gsub("([TF])+", " \1", x), ",", 
    fixed = TRUE, type.convert = TRUE)))),
    by = .(CHROM, POS, REF, TYPE)]
#   CHROM POS REF     TYPE ALT   AF
#1:  chr1   1   A MISSENSE   T 0.23
#2:  chr2   1   A MISSENSE   T 0.17
#3:  chr2   1   A MISSENSE   G 0.09

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
...