You can find the full code here
At first, all of my feature engineering methods are based on the original data, but my public score raised from 0.812 to 0.813 after I merely change the data source so all the feature engineering processes and based on the raw data.
My preprocessing method is using np.log1p
to change the
raw
data. I have also tried other preprocessing methods like MAGIC
and
TF-IDF
but they can’t improve my CV score.
The final inputs of the models consist of mainly six parts. Three of
them are dimension reduction parts including Tsvd
, UMAP
, and
Novel’s method
. The rest are feature selection parts including
name importance
, corr importance
, and rf importance
.
-
Tsvd
:TruncatedSVD(n_components=128, random_state=42)
-
UMAP
:UMAP(n_neighbors = 16,n_components=128, random_state=42,verbose = True)
-
Novel’s method
: The original method can be found here. At first, I wanted to implement the preprocessing method to replace simplelog1p
but after I replaced theTsvd
results oflog1p
by theTsvd
results ofNovel’s method
I found that my CV went down. But if I kept both of them, the CV score would increase a little bit. So I kept theTsvd
results ofNovel’s method
. -
name importance
: It ’s mainly based on AmbrosM’s notebook. But I added additional information frommygene
while matching. I will release my complete preprocessing code later and specific results can be found there. -
corr importance
: As the name suggested, I chose the top 3 features that correlated with the targets. There was overlap and the number of selected features was about 104 -
rf importance
: Since the feature importances of random forest may apply to NN and other models as well. So I selected 128 top feature importances of the random forest model.
I have also tried other mothed including PCA
, KernelPCA
,
LocallyLinearEmbedding
, and SpectralEmbedding
.PCA
gives little
help and it will cause severe overfitting when used with Tsvd
. I
could’ t finish the manifold methods in 24 hours so I gave them up.
I have implemented the CV strategy like the private test, but it turns
out that the strategy like the public test is better. So all of the
results are based on GroupKFold
on donors
. I have done there-layers
stacking in the competition. and I have also done the ensemble on the
stacking results and the results of independent models. Here are the
models I used and I will also release the code later.
Method | Stacking | NN | NN_online | CNN | kernel_rigde | LGBM | Catboost |
---|---|---|---|---|---|---|---|
CV | 0.89677 | 0.89596 | 0.89580 | 0.89530 | 0.89326 | 0.89270 | 0.89100 |
-
NN
: A personal-designed NN network, trying to do something like the transformers. I used MLP to replace the dot product in the mechanism of attention. This may not be so reasonable and I am also aware of the importance of feature vectors and dot products. But I was so fascinated by attention and I also triedtabnet
andrtdl
but they didn’t work very well. But my method seemed to work even better than simple MLP. -
CNN
: Inspired by the tmp method here and also added multidimensional convolution kernel like the Resnet. -
NN(Online)
: This model is mainly based on pourchot’s method here and only some tiny change was made. -
Kernel Rigde
: This model is inspired by the best solution of last year’s competition. I used Ray Tune to optimize the hypermeters -
Catboost
: There are many options forcatboost
here. UsingMultiOutputRegressor
orMultiRMSE
asobjective
. But we can’t do earlystopping to prevent overfitting in the first method and the result of the second method is not good enough so I made a classMultiOutputCatboostRegressor
personally, usingMSE
to fit the normalized targets. -
LGBM
: I also wroteMultiOutputLGBMRegressor
and the results seem to be better and the training process was so slow that I had to give it up in the stacking. However, I still trained a independent LGBM model and used it in the final training. -
stacking
: I usedKNN
,CNN
,ridge,rf
,catboost
,NN
in the first layer and onlyCNN
,catboost
,NN
in the second and just a simpleMLP
in the last layer. To avoid overfitting, I usedKFold
and oof predictions between layers, and every stacking model are usingGroupKFold
(so there are 3 stacking models here). It seems to be a little bit to understand so you may refer to the picture. If you still have confusion please feel free to ask me.
CV Results | Model Ⅰ (vaild 32606) | Model Ⅱ (vaild 13176) | Model Ⅲ (vaild 31800) |
---|---|---|---|
Fold 1 | 0.8989 | 0.8967 | 0.8947 |
Fold 2 | 0.8995 | 0.8967 | 0.8951 |
Fold 3 | 0.8985 | 0.8959 | 0.8949 |
Fold Mean | 0.89897 | 0.89643 | 0.89490 |
Model Mean | 0.89677 | - | - |
To be honest, I put most of my efforts on cite part so there is nothing very special here and I will make a brief introduction.
- TF-IDF normalization
np.log1p(data * 1e4)
- Tsvd -> 512
- Normalization -> mean = 0, std = 1
- Tsvd -> 1024
-
NN
: A personal-designed NN network as mentioned above. The output of the model is 1024 dim and make dot product withtsvd.components_
(constant) to get the final prediction than usecorrel_loss
to calculate the loss then back propagate the grads. -
Catboost
: The results from online notebook -
LGBM
: The same as theMultiOutputLGBMRegressor
mentioned above. UsingMSE
to fit the tsvd results of normalized targets.
The same notebook as mentioned above. notebook