data.table에서 참조하여 행을 삭제하는 방법은 무엇입니까?

programing

data.table에서 참조하여 행을 삭제하는 방법은 무엇입니까?

linuxpc 2023. 7. 10. 22:07

data.table에서 참조하여 행을 삭제하는 방법은 무엇입니까?

제 질문은 참조에 의한 과제와 복사에 관한 것입니다.data.table다음과 유사하게 참조로 행을 삭제할 수 있는지 알고 싶습니다.

DT[ , someCol := NULL]

에 대해 알고 싶습니다.

DT[someRow := NULL, ]

이 기능이 존재하지 않는 데는 충분한 이유가 있다고 생각합니다. 따라서 아래와 같이 일반적인 복사 방법에 대한 좋은 대안을 지적해 주시면 됩니다.특히, 예(data.table)에서 제가 가장 좋아하는 것으로 이동하면서,

DT = data.table(x = rep(c("a", "b", "c"), each = 3), y = c(1, 3, 6), v = 1:9)
#      x y v
# [1,] a 1 1
# [2,] a 3 2
# [3,] a 6 3
# [4,] b 1 4
# [5,] b 3 5
# [6,] b 6 6
# [7,] c 1 7
# [8,] c 3 8
# [9,] c 6 9

이 data.table에서 첫 번째 행을 삭제한다고 합니다.제가 할 수 있다는 것을 압니다.

DT <- DT[-1, ]

하지만 종종 우리는 그것을 피하고 싶을 수 있습니다. 왜냐하면 우리는 객체를 복사하고 있기 때문입니다. (만약 N이라면, 그것은 약 3*N의 메모리를 필요로 합니다.)object.size(DT)여기서 지적한 바와 같이이제 알았어요set(DT, i, j, value)값을 을 알고 있습니다(예, 0으로 ).

set(DT, 1:2, 2:3, 0) 
DT
#      x y v
# [1,] a 0 0
# [2,] a 0 0
# [3,] a 6 3
# [4,] b 1 4
# [5,] b 3 5
# [6,] b 6 6
# [7,] c 1 7
# [8,] c 3 8
# [9,] c 6 9

하지만 어떻게 하면 처음 두 줄을 지울 수 있을까요?하고있다

set(DT, 1:2, 1:3, NULL)

전체 DT를 NULL로 설정합니다.

제 SQL 지식은 매우 제한적입니다. 그래서 여러분은 제게 말합니다: 주어진 data.table은 SQL 기술을 사용하는데 SQL 명령과 동등한 것이 있습니까?

DELETE FROM table_name
WHERE some_column=some_value

데이터에서.표?

좋은 질문입니다. data.table아직 참조로 행을 삭제할 수 없습니다.

data.table는 열 포인터의 벡터를 오버랩하므로 참조를 통해 열을 추가 및 삭제할 수 있습니다.계획은 행에 대해 유사한 작업을 수행하고 빠르게 허용하는 것입니다.insert그리고.delete는 행삭는다사용다니합음을을 합니다.memmoveC에서 삭제된 행 뒤에 있는 항목(각 열 및 모든 열)의 크기를 조정합니다.테이블 중간에 있는 행을 삭제하는 것은 테이블에 있는 행의 빠른 삽입 및 삭제에 더 적합한 SQL과 같은 행 저장소 데이터베이스에 비해 여전히 상당히 비효율적입니다.그러나 삭제된 행이 없는 새 큰 개체를 복사하는 것보다 훨씬 빠릅니다.

반면에 열 벡터는 과도하게 할당되므로 행을 즉시 끝에 삽입(삭제)할 수 있습니다. 예를 들어 시계열이 증가합니다.

문제로 제출됩니다.참조로 행을 삭제합니다.

메모리 사용을 인플레이스 삭제와 유사하게 만들기 위해 제가 취한 접근 방식은 한 번에 열을 부분 집합화하고 삭제하는 것입니다.적절한 C memmove 솔루션만큼 빠르지는 않지만, 메모리 사용은 제가 여기서 신경쓰는 전부입니다. 다음과 같은 것입니다.

DT = data.table(col1 = 1:1e6)
cols = paste0('col', 2:100)
for (col in cols){ DT[, (col) := 1:1e6] }
keep.idxs = sample(1e6, 9e5, FALSE) # keep 90% of entries
DT.subset = data.table(col1 = DT[['col1']][keep.idxs]) # this is the subsetted table
for (col in cols){
  DT.subset[, (col) := DT[[col]][keep.idxs]]
  DT[, (col) := NULL] #delete
}

@vc273의 답변과 @Frank의 피드백을 바탕으로 한 작업 기능입니다.

delete <- function(DT, del.idxs) {           # pls note 'del.idxs' vs. 'keep.idxs'
  keep.idxs <- setdiff(DT[, .I], del.idxs);  # select row indexes to keep
  cols = names(DT);
  DT.subset <- data.table(DT[[1]][keep.idxs]); # this is the subsetted table
  setnames(DT.subset, cols[1]);
  for (col in cols[2:length(cols)]) {
    DT.subset[, (col) := DT[[col]][keep.idxs]];
    DT[, (col) := NULL];  # delete
  }
   return(DT.subset);
}

사용 예:

dat <- delete(dat,del.idxs)   ## Pls note 'del.idxs' instead of 'keep.idxs'

여기서 "dat"는 data.table입니다.140만 행에서 14k 행을 제거하는 것은 제 노트북에서 0.25초가 걸립니다.

> dim(dat)
[1] 1419393      25
> system.time(dat <- delete(dat,del.idxs))
   user  system elapsed 
   0.23    0.02    0.25 
> dim(dat)
[1] 1404715      25
>

PS. SO가 처음이라 @vc273 스레드에 코멘트를 추가할 수 없었습니다 :-(

그 주제는 여전히 많은 사람들(나 포함)에게 흥미롭습니다.

그건 어때요?사용한assign대하기위를 glovalenv그리고 앞에서 설명한 코드.원래 환경을 캡처하는 것이 낫겠지만 적어도.globalenv그것은 메모리 효율적이고 참조에 의한 변화처럼 작용합니다.

delete <- function(DT, del.idxs) 
{ 
  varname = deparse(substitute(DT))

  keep.idxs <- setdiff(DT[, .I], del.idxs)
  cols = names(DT);
  DT.subset <- data.table(DT[[1]][keep.idxs])
  setnames(DT.subset, cols[1])

  for (col in cols[2:length(cols)]) 
  {
    DT.subset[, (col) := DT[[col]][keep.idxs]]
    DT[, (col) := NULL];  # delete
  }

  assign(varname, DT.subset, envir = globalenv())
  return(invisible())
}

DT = data.table(x = rep(c("a", "b", "c"), each = 3), y = c(1, 3, 6), v = 1:9)
delete(DT, 3)

대신 NULL로 설정하거나 NA로 설정합니다(첫 번째 열의 NA 유형과 일치).

set(DT,1:2, 1:3 ,NA_character_)

여기 제가 사용한 몇 가지 전략이 있습니다..ROW 기능이 올 것 같습니다.아래의 접근 방식은 모두 빠르지 않습니다.이러한 전략은 하위 집합이나 필터링을 약간 넘어서는 것입니다.저는 dba가 데이터를 정리하는 것처럼 생각하려고 노력했습니다.위에서 설명한 대로 data.table에서 행을 선택하거나 제거할 수 있습니다.

data(iris)
iris <- data.table(iris)

iris[3] # Select row three

iris[-3] # Remove row three

You can also use .SD to select or remove rows:

iris[,.SD[3]] # Select row three

iris[,.SD[3:6],by=,.(Species)] # Select row 3 - 6 for each Species

iris[,.SD[-3]] # Remove row three

iris[,.SD[-3:-6],by=,.(Species)] # Remove row 3 - 6 for each Species

참고: .SD는 원본 데이터의 하위 집합을 만들고 j 또는 후속 data.table에서 상당한 작업을 수행할 수 있도록 합니다.https://stackoverflow.com/a/47406952/305675 을 참조하십시오.여기서 나는 세팔 길이로 아이리스를 주문했고, 지정된 세팔을 주문했습니다.최소 길이로 모든 종의 상위 3개(세팔 길이 기준)를 선택하고 모든 첨부 데이터를 반환합니다.

iris[order(-Sepal.Length)][Sepal.Length > 3,.SD[1:3],by=,.(Species)]

위의 접근 방식은 행을 제거할 때 data.table을 순차적으로 재정렬합니다.data.table을 전치하고 현재 전치된 열인 이전 행을 제거하거나 바꿀 수 있습니다.':=DICOM'을 사용하여 전치된 행을 제거할 때 다음 열 이름도 제거됩니다.

m_iris <- data.table(t(iris))[,V3:=NULL] # V3 column removed

d_iris <- data.table(t(iris))[,V3:=V2] # V3 column replaced with V2

data.frame을 data.table로 다시 이동할 때 원래 data.table에서 이름을 바꾸고 삭제 시 클래스 속성을 복원할 수 있습니다.이제 이전된 data.table에 ":=DICOM"을 적용하면 모든 문자 클래스가 생성됩니다.

m_iris <- data.table(t(d_iris));
setnames(d_iris,names(iris))

d_iris <- data.table(t(m_iris));
setnames(m_iris,names(iris))

키를 사용하거나 사용하지 않고 수행할 수 있는 중복 행을 제거하기만 하면 됩니다.

d_iris[,Key:=paste0(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species)]     

d_iris[!duplicated(Key),]

d_iris[!duplicated(paste0(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species)),]

'로 증분 카운터를 추가할 수도 있습니다.그런 다음 중복된 키 또는 필드를 검색하고 카운터에서 레코드를 제거하여 제거할 수 있습니다.이것은 계산 비용이 많이 들지만 제거할 선을 인쇄할 수 있기 때문에 몇 가지 이점이 있습니다.

d_iris[,I:=.I,] # add a counter field

d_iris[,Key:=paste0(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species)]

for(i in d_iris[duplicated(Key),I]) {print(i)} # See lines with duplicated Key or Field

for(i in d_iris[duplicated(Key),I]) {d_iris <- d_iris[!I == i,]} # Remove lines with duplicated Key or any particular field.

행을 0 또는 NA로 채우고 i 쿼리를 사용하여 삭제할 수도 있습니다.

 X 
   x v foo
1: c 8   4
2: b 7   2

X[1] <- c(0)

X
   x v foo
1: 0 0   0
2: b 7   2

X[2] <- c(NA)
X
    x  v foo
1:  0  0   0
2: NA NA  NA

X <- X[x != 0,]
X <- X[!is.na(x),]

이 버전은 vc273 및 user7114184 버전에서 영감을 받은 버전입니다."참조 기준"을 삭제하려는 경우 이를 위해 새 DT를 생성할 필요가 없습니다.그러나 데이터 테이블에서 모든 열을 제거하면 null 데이터 테이블이 되어 행 수가 제한되지 않습니다.따라서 열을 새 데이터 테이블로 이동하고 계속 이동하는 대신 열을 원래 데이터 테이블로 다시 이동하여 계속 사용할 수 있습니다.

이것은 우리에게 두 가지 기능을 제공합니다. 하나는data_table_add_rows이를 통해 "기준" 행을 data.table에 추가할 수 있습니다.다른 하나data_table_remove_rows행을 "참조 기준"으로 제거합니다.첫 번째는 값 목록을 사용하고, 두 번째는 우리가 좋은 일을 할 수 있도록 해주는 필터링을 위한 DT-call을 평가합니다.

#' Add rows to a data table in a memory efficient, by-referencesque manner
#'
#' This mimics the by-reference functionality `DT[, new_col := value]`, but
#' for rows instead. The rows in question are assigned at the end of the data
#' table. If the data table is keyed it is automatically reordered after the
#' operation. If not this function will preserve order of existing rows, but
#' will not preserve sortedness.
#'
#' This function will take the rows to add from a list of columns or generally
#' anything that can be named and converted or coerced to data frame.
#' The list may specify less columns than present in the data table. In this
#' case the rest is filled with NA. The list may not specify more columns than
#' present in the data table. Columns are matched by names if the list is named
#' or by position if not. The list may not have names not present in the data
#' table.
#'
#' Note that this operation is memory efficient as it will add the rows for
#' one column at a time, only requiring reallocation of single columns at a
#' time. This function will change the original data table by reference.
#'
#' This function will not affect shallow copies of the data table.
#'
#' @param .dt A data table
#' @param value A list (or a data frame). Must have at most as many elements as
#'        there are columns in \param{.dt}. If unnamed this will be applied to
#'        first columns in \param{.dt}, else it will by applied by name. Must
#'        not have names not present in \param{.dt}.
#' @return \param{.dt} (invisible)
data_table_add_rows <- function(.dt, value) {
  if (length(value) > ncol(.dt)) {
    rlang::abort(glue::glue("Trying to update data table with {ncol(.dt)
      } columns with {length(value)} columns."))
  }
  if (is.null(names(value))) names(value) <- names(.dt)[seq_len(length(value))]
  value <- as.data.frame(value)
  if (any(!(names(value) %in% names(.dt)))) {
    rlang::abort(glue::glue("Trying to update data table with columns {
        paste(setdiff(names(value), names(.dt)), collapse = ', ')
      } not present in original data table."))
  }
  value[setdiff(names(.dt), names(value))] <- NA
  
  k <- data.table::key(.dt)
  
  temp_dt <- data.table::data.table()
  
  for (col in c(names(.dt))) {
    set(temp_dt, j = col,value = c(.dt[[col]], value[[col]]))
    set(.dt, j = col, value = NULL)
  }
  
  for (col in c(names(temp_dt))) {
    set(.dt, j = col, value = temp_dt[[col]])
    set(temp_dt, j = col, value = NULL)
  }
  
  if (!is.null(k)) data.table::setkeyv(.dt, k)
  
  .dt
}

#' Remove rows from a data table in a memory efficient, by-referencesque manner
#'
#' This mimics the by-reference functionality `DT[, new_col := NULL]`, but
#' for rows instead. This operation preserves order. If the data table is keyed
#' it will preserve the key.
#'
#' This function will determine the rows to delete by passing all additional
#' arguments to a data.table filter call of the form
#' \code{DT[, .idx = .I][..., j = .idx]}
#' Thus we can pass a simple index vector or a condition, or even delete by
#' using join syntax \code{data_table_remove_rows(DT1, DT2, on = cols)} (or
#' reversely keep by join using
#' \code{data_table_remove_rows(DT1, !DT2, on = cols)}
#'
#' Note that this operation is memory efficient as it will add the rows for
#' one column at a time, only requiring reallocation of single columns at a
#' time. This function will change the original data table by reference.
#'
#' This function will not affect shallow copies of the data table.
#'
#' @param .dt A data table
#' @param ... Any arguments passed to `[` for filtering the data.table. Must not
#'        specify `j`.
#' @return \param{.dt} (invisible)
data_table_remove_rows <- function(.dt, ...) {
  k <- data.table::key(.dt)
  
  env <- parent.frame()
  args <- as.list(sys.call()[-1])
  if (!is.null(names(args)) && ".dt" %in% names(args)) args[.dt] <- NULL
  else args <- args[-1]
  
  if (!is.null(names(args)) && "j" %in% names(args)) {
    rlang::abort("... must not specify j")
  }
  
  call <- substitute(
    .dt[, .idx := .I][j = .idx],
    env = list(.dt = .dt))
  
  .nc <- names(call)
  
  for (i in seq_along(args)) {
    call[[i + 3]] <- args[[i]]
  }
  
  if (!is.null(names(args))) names(call) <- c(.nc, names(args))
  which <- eval(call, envir = env)
  set(.dt, j = ".idx", value = NULL)
  
  temp_dt <- data.table::data.table()
  
  for (col in c(names(.dt))) {
    set(temp_dt, j = col,value = .dt[[col]][-which])
    set(.dt, j = col, value = NULL)
  }
  
  for (col in c(names(temp_dt))) {
    set(.dt,j = col, value = temp_dt[[col]])
    set(temp_dt, j = col, value = NULL)
  }
  
  if (!is.null(k)) data.table::setattr(.dt, "sorted", k)
  
  .dt
}

이것은 우리가 꽤 좋은 통화를 할 수 있게 해줍니다.예를 들어 다음을 수행할 수 있습니다.

library(data.table)

d <- data.table(x = 1:10, y = runif(10))

#>         x          y
#>     <int>      <num>
#>  1:     1 0.77326131
#>  2:     2 0.88699627
#>  3:     3 0.15553784
#>  4:     4 0.71221778
#>  5:     5 0.11964578
#>  6:     6 0.73692709
#>  7:     7 0.05382835
#>  8:     8 0.61129007
#>  9:     9 0.18292229
#> 10:    10 0.22569555

# add some rows (y = NA)
data_table_add_rows(d, list(x=11:13))
# add some rows (y = 0)
data_table_add_rows(d, list(x=14:15, y = 0))

#>         x          y
#>     <int>      <num>
#>  1:     1 0.77326131
#>  2:     2 0.88699627
#>  3:     3 0.15553784
#>  4:     4 0.71221778
#>  5:     5 0.11964578
#>  6:     6 0.73692709
#>  7:     7 0.05382835
#>  8:     8 0.61129007
#>  9:     9 0.18292229
#> 10:    10 0.22569555
#> 11:    11         NA
#> 12:    12         NA
#> 13:    13         NA
#> 14:    14 0.00000000
#> 15:    15 0.00000000

# remove all added rows
data_table_remove_rows(d, is.na(y) | y == 0)

#>         x          y
#>     <int>      <num>
#>  1:     1 0.77326131
#>  2:     2 0.88699627
#>  3:     3 0.15553784
#>  4:     4 0.71221778
#>  5:     5 0.11964578
#>  6:     6 0.73692709
#>  7:     7 0.05382835
#>  8:     8 0.61129007
#>  9:     9 0.18292229
#> 10:    10 0.22569555

# remove by join
e <- data.table(x = 2:5)
data_table_remove_rows(d, e, on = "x")

#>        x          y
#>    <int>      <num>
#> 1:     1 0.77326131
#> 2:     6 0.73692709
#> 3:     7 0.05382835
#> 4:     8 0.61129007
#> 5:     9 0.18292229
#> 6:    10 0.22569555

# add back
data_table_add_rows(d, c(e, list(y = runif(nrow(e)))))

#>         x          y
#>     <int>      <num>
#>  1:     1 0.77326131
#>  2:     6 0.73692709
#>  3:     7 0.05382835
#>  4:     8 0.61129007
#>  5:     9 0.18292229
#>  6:    10 0.22569555
#>  7:     2 0.99372144
#>  8:     3 0.03363720
#>  9:     4 0.69880083
#> 10:     5 0.67863547

# keep by join
data_table_remove_rows(d, !e, on = "x")

#>        x         y
#>    <int>     <num>
#> 1:     2 0.9937214
#> 2:     3 0.0336372
#> 3:     4 0.6988008
#> 4:     5 0.6786355

편집: Matt Summersgill 덕분에 조금 더 나은 성능을 보여주었습니다!

언급URL : https://stackoverflow.com/questions/10790204/how-to-delete-a-row-by-reference-in-data-table

'programing' 카테고리의 다른 글

파이썬의 통화 형식 (0)	2023.07.10
asp.net 텍스트 상자에 정수 값이 있는지 확인하기 위한 유효성 검사 (0)	2023.07.10
ASP.NET ID 인터페이스에서 기본 키와 외부 키에 문자열을 사용하는 이유는 무엇입니까? (0)	2023.07.10
파이썬을 사용하여 문자열을 사용하여 MongoDB에 _id를 쿼리하는 올바른 방법은 무엇입니까? (0)	2023.07.10
Namecheap을 사용하여 Firebase에서 사용자 지정 도메인을 확인할 수 없음 (0)	2023.07.10

현재글data.table에서 참조하여 행을 삭제하는 방법은 무엇입니까?

각종 프로그래밍 정보를 다루는 블로그입니다.

AngularJS, Excel, JSON, Oracle, reactjs, angular, bash, powershell, GIT, sql-server, ajax, MariaDB, Android, python, spring-boot, MongoDB, c, WordPress, jquery, ASP.NET,

Today :
Yesterday :

linuxpc

data.table에서 참조하여 행을 삭제하는 방법은 무엇입니까?

data.table에서 참조하여 행을 삭제하는 방법은 무엇입니까?

'programing' 카테고리의 다른 글

'programing'의 다른글

티스토리툴바

« 2025/02 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

data.table에서 참조하여 행을 삭제하는 방법은 무엇입니까?

data.table에서 참조하여 행을 삭제하는 방법은 무엇입니까?

'programing' 카테고리의 다른 글

'programing'의 다른글

관련글

티스토리툴바