Every Tuesdays, Kaggle opens a light-weight challenges which runs for 2 weeks. These datas are based on real-world data but easy to use for beginners of ML & AI. Sometimes I try to participate to these competitions and this experience improves my skills of Machine Learning world.

Explore Multi-Label Classification with an Enzyme Substrate Dataset

Season 3, Episode 18 competition is about Multi-Label Classification. Since I’m currently learning about Keras API, this time I tried to solve this problem with Deep Learning method. The following is my footprints of my journey.

First Attempt (23.07.05)

I basically drop some skewed features (such as FpDensityMorgan1~3, Kappa3) and ID. Then I tried to predict target data with 3 layer keras model. However, my accuracy got stuck around 0.5000. ☹️ Adding some layers didn’t help much. What’s wrong with my model? I was confusing…

The very first attempt gave me score of 0.53114 which is no better than just gussing all the answers into 0.5. This means that current model is useless…

Second Attempt (23.07.06)

I figured how to solve this problem…

Loss

First, I found that loss isn’t decresing as the epoch increases. There may be several reasons, but it turns out that my model is too complex. I made my Keras Model simpler, which really helped.

model = keras.Sequential()
model.add(keras.Input(shape=(17,)))
model.add(keras.layers.Dense(10, activation='relu'))
model.add(keras.layers.Dropout(0.3))
model.add(keras.layers.Dense(2, activation='softmax'))

According to saturnicloud.io , the reasons of loss not decreasing during training are the following.

  1. Model Complexity → Simplify the Model
  2. Learning Rate → Adjust the Learning Rate
  3. Initialization → Change the Initialization
  4. Dataset → Improve the Dataset

Result

New submission scored 0.56205, which is an improvement of previous score 0.53114. The public best score is 0.66246. It seems that it is hard to improve more than that. Still I think there is another way to improve this result.

Third Attempt (23.07.06)

Feature Engineering

I realized that the problem was on Preprocessing part. It seems that simply dropping some features isn’t the way it works. Since I do not have a biochemical background, I didn’t know how to deal with such features. According to one of the Kaggle Discussion the following are shown as the interactions between the features.

  1. Molecular Complexity: BertzCT * ExactMolWt
  2. Structural Flexibility: Chi1 * Chi2n
  3. Functional Specificity: Chi1n * Chi3v
  4. Chemical Diversity: EState_VSA1 * NumHeteroatoms
  5. Size-Related Descriptors: ExactMolWt * FpDensityMorgan1
  6. Topological Patterns: FpDensityMorgan2 * FpDensityMorgan3
  7. Electronic Structure: HallKierAlpha * MaxAbsEStateIndex
  8. Atom Weight and Charge: HeavyAtomMolWt * MinEStateIndex
  9. Geometrical Shape: Kappa3 * NumHeteroatoms
  10. Molecular Surface Properties: PEOE_VSA10 * PEOE_VSA14
  11. Hydrophobicity and Polarizability: PEOE_VSA6 * PEOE_VSA8
  12. Solvent Accessible Surface Area: SMR_VSA10 * SMR_VSA5
  13. Electronic State and Shape: SlogP_VSA3 * VSA_EState9
  14. Functional Group Occurrence: fr_COO * fr_COO2
  15. Molecular Complexity: BertzCT * Chi1v
  16. Structural Patterns: Chi2n * Chi4n
  17. Molecular Weight and Shape: ExactMolWt * Kappa3
  18. Structural Density: FpDensityMorgan1 * FpDensityMorgan2
  19. Molecular Shape and Flexibility: HallKierAlpha * HeavyAtomMolWt
  20. Chemical Diversity: MinEStateIndex * NumHeteroatoms

Explore Multi-Label Classification with an Enzyme Substrate Dataset

Result

Submission after feature engineering scored 0.56547, which is a slight improvement than 0.56205.

Conclusion

This is my second time joining Kaggle Playground Competition . The dataset is well refined, which encourages me to jump into analyze those data. Also this is my first time using Deep Learning methods to solve problems. Compared to ML techniques, Deep Learning has more freedom in constructing model, which makes me more intriguing.

I can implement the knowledge what I have learned, and this experience was far different from I imagined so far. Simply knowing the frameworks won’t work. I realized that refining data using domain knowledge is the key part of Deep Learning process. We have to remember that Deep Learning is just a tool not Deus ex machina.

I’ll keep learning deeply about Deep Learning techniques, and someday start my own major-related project!