Outlier processing

 

Handling method of outliers:

Delete records with outliersdirectly delete records with outliers;

Treat as missing valuetreat outlier as missing value, and use the missing value processing method to process

Correction of outliersthe outliers can be corrected by the endpoint value or the average of the two observed values

Labeling outliers: By creating new variables, outliers are labeled for further analysis or processing

No processingdata mining directly on datasets with outliers

Correction outliers

In SPL, A.sert()and P.sert(cn) can automatically correct outliers. For example, an outlier correction was made to the variable "Fare" in the Titanic data.


A

1

=file("D://titanic.csv").import@qtc()

2

=A1.sert@c("Fare")

A2 Corrects the outlier in the Fare variable, returns the correction result and the correction record Rec, @c indicates that the original data is modified.

Labeling outliers

For example, the "Fare" variable in titanic.csv is labeled with outliers as 3 standard deviations (z=3) and 5 standard deviations (z=5), respectively.


A

1

=file("D://titanic.csv").import@qtc()

2

=A1.avg(Fare)

3

=sqrt(var@s(A1.(Fare)))

4

=A1.derive((Fare-A2)/A3:Fare_z,if(Fare_z>3,1,if(Fare_z<-3,-1,0)):Fare_z3,if(Fare_z>5,1,if(Fare_z<-5,-1,0)):Fare_z5)

A2 Calculate the mean of Fare

A3 Calculate the standard deviation of Fare

A4 Calculate the z-score of Fare, denoted Fare_z, and label outliers according to the z-value. Z-values greater than 3 are marked as 1, z-values less than -3 are marked as -1, others are marked as 0, and the variable is marked as Fare_z3; Z-values greater than 5 are marked as 1, z-values less than -5 are marked as -1, others are marked as 0, and the variable is marked as Fare_z5

..