我正在使用一个数据集"Final.Export“,它看起来像这样:
LakeID LakeName SourceVariableName SourceVariableDescription SourceFlags
47 390 Moosehead Acolor(PCU) Apparent color <NA>
48 390 Moosehead Acolor(PCU) Apparent color <NA>
49 390 Moosehead Acolor(PCU) Apparent color <NA>
50 390 Moosehead Acolor(PCU) Apparent color <NA>
51 390 Moosehead Acolor(PCU) Apparent color <NA>
52 390 Moosehead Acolor(PCU) Apparent color <NA>
53 390 Moosehead Acolor(PCU) Apparent color <NA>
54 390 Moosehead Acolor(PCU) Apparent color <NA>
55 390 Moosehead Acolor(PCU) Apparent color <NA>
56 390 Moosehead Acolor(PCU) Apparent color <NA>
LagosVariableID LagosVariableName Value Units CensorCode DetectionLimit Date
47 11 Color, apparent 22 PCU NC NA 2003-08-26
48 11 Color, apparent 17 PCU NC NA 2003-08-26
49 11 Color, apparent 16 PCU NC NA 2003-08-26
50 11 Color, apparent 14 PCU NC NA 2003-08-26
51 11 Color, apparent 14 PCU NC NA 2003-08-26
52 11 Color, apparent 17 PCU NC NA 2003-08-26
53 11 Color, apparent 16 PCU NC NA 2003-08-26
54 11 Color, apparent 17 PCU NC NA 2003-08-26
55 11 Color, apparent 14 PCU NC NA 2003-08-26
56 11 Color, apparent 17 PCU NC NA 2003-08-26
LabMethodName LabMethodInfo SampleType SamplePosition SampleDepth MethodInfo
47 <NA> <NA> INTEGRATED SPECIFIED 6 <NA>
48 <NA> <NA> INTEGRATED SPECIFIED 7 <NA>
49 <NA> <NA> INTEGRATED SPECIFIED 6 <NA>
50 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
51 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
52 <NA> <NA> INTEGRATED SPECIFIED 9 <NA>
53 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
54 <NA> <NA> INTEGRATED SPECIFIED 8 <NA>
55 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
56 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
BasinType Subprogram Comments Dup
47 UNKNOWN NA NA NA
48 UNKNOWN NA NA NA
49 UNKNOWN NA NA NA
50 UNKNOWN NA NA NA
51 UNKNOWN NA NA NA
52 UNKNOWN NA NA NA
53 UNKNOWN NA NA NA
54 UNKNOWN NA NA NA
55 UNKNOWN NA NA NA
56 UNKNOWN NA NA NA我想将所有重复值标记为1。重复值定义为在'LakeID‘、'Date’、'LagosVariableID‘、'SampleDepth’和'SamplePosition‘列的每一列中具有完全相同的值的值。
为此,我使用以下代码创建了一个新的数据表"data1“:
library(data.table)
data1=data.table(Final.Export,key=c('LakeID','Date','LagosVariableID','SampleDepth','SamplePosition','Value'))
data1=data1[,Dup:=duplicated(.SD),.SDcols=c('LakeID','Date', 'LagosVariableID', 'SampleDepth', 'SamplePosition','Value')]
data1$Dup[which(data1$Dup==FALSE)]=NA
data1$Dup[which(data1$Dup==TRUE)]=1"data1“的问题是,只有在第一个唯一行(标记为NA)之后的重复行(根据我对重复的定义)才会被标记为”1“。我需要将唯一行和相关的重复行标记为“1”。你知道该怎么做吗?
如果这让人困惑,请告诉我如何澄清。
发布于 2013-10-05 00:56:02
没有可重复的例子很难说,但看起来你想要这样的东西:
data1[,dup:=duplicated(.SD),
by=list(LakeID, LagosVariableID, Value, Date, SamplePosition, SampleDepth)]编辑:
在OP的澄清之后,他们似乎只是想要这样:
data1[,dup:=duplicated(.SD),
.SDcols=c('LakeID', 'Date', 'LagosVariableID', 'SampleDepth', 'SamplePosition')]https://stackoverflow.com/questions/19186537
复制相似问题