"Categorical variables are usually in the form of characters, which can not be directly recognize .."

Lina RaqForum 43 No.
260 View • 1 Years ago

Numerical the categorical variables

Categorical variables are usually in the form of characters, which can not be directly recognized and calculated by the algorithm, and must be converted into numerical data. In SPL, functions provided which can automatically handle categorical variables.

For categorical variables with no more than 6 categories, the A.bi()or P.bi(cn) functions can be used to split the categorical variable into multiple binary variables with values of 0,1.

For categorical variables with a large number of categories, the variables can be mapped to integers using A.setenum()or P.setenum(cn)

For example, in the Titanic data, the variable "Pclass" has a class number of 3 and the variable "Cabin" has a class number of 148, which are numeralized as follows:

	A
1	=file("D://titanic.csv").import@qtc()
2	=file("D://titanic_t.csv").import@qtc()
3	=A1.bi("Pclass")
4	=A2.bi@r("Pclass",A3(2))
5	=A1.setenum@c("Ticket")
6	=A2.setenum@rc("Ticket",A5(2))

A1 Import the modeling data

A2 Import the prediction data

A3 In the modeling data, multiple binary variables are derived from "Pclass", A3(1) returns the processing result, and A3(2) returns the processing record Rec

A4 The same variables are derived on the prediction data according to A3's processing record Rec

A5 The classification variable "Ticket" is mapped to integers, and the mapping result and mapping record Rec are returned. @c indicates that the original data is changed to the mapping data.