Every Tuesdays, Kaggle opens a light-weight challenges which runs for 2 weeks. These datas are based on real-world data but easy to use for beginners of ML & AI. Sometimes I try to participate to these competitions and this experience improves my skills of Machine Learning world.

Explore Multi-Label Classification with an Enzyme Substrate Dataset

Season 3, Episode 18 competition is about Multi-Label Classification. Since I’m currently learning about Keras API, this time I tried to solve this problem with Deep Learning method. The following is my footprints of my journey.

First Attempt (23.07.05)

I basically drop some skewed features (such as FpDensityMorgan1~3, Kappa3) and ID. Then I tried to predict target data with 3 layer keras model. However, my accuracy got stuck around 0.5000. ☹️ Adding some layers didn’t help much. What’s wrong with my model? I was confusing…

The very first attempt gave me score of 0.53114 which is no better than just gussing all the answers into 0.5. This means that current model is useless…

Second Attempt (23.07.06)

I figured how to solve this problem…

Loss

First, I found that loss isn’t decresing as the epoch increases. There may be several reasons, but it turns out that my model is too complex. I made my Keras Model simpler, which really helped.

model = keras.Sequential()
model.add(keras.Input(shape=(17,)))
model.add(keras.layers.Dense(10, activation='relu'))
model.add(keras.layers.Dropout(0.3))
model.add(keras.layers.Dense(2, activation='softmax'))

According to saturnicloud.io , the reasons of loss not decreasing during training are the following.

Model Complexity → Simplify the Model
Learning Rate → Adjust the Learning Rate
Initialization → Change the Initialization
Dataset → Improve the Dataset

Result

New submission scored 0.56205, which is an improvement of previous score 0.53114. The public best score is 0.66246. It seems that it is hard to improve more than that. Still I think there is another way to improve this result.

Third Attempt (23.07.06)

Feature Engineering

I realized that the problem was on Preprocessing part. It seems that simply dropping some features isn’t the way it works. Since I do not have a biochemical background, I didn’t know how to deal with such features. According to one of the Kaggle Discussion the following are shown as the interactions between the features.

Molecular Complexity: BertzCT * ExactMolWt
Structural Flexibility: Chi1 * Chi2n
Functional Specificity: Chi1n * Chi3v
Chemical Diversity: EState_VSA1 * NumHeteroatoms
Size-Related Descriptors: ExactMolWt * FpDensityMorgan1
Topological Patterns: FpDensityMorgan2 * FpDensityMorgan3
Electronic Structure: HallKierAlpha * MaxAbsEStateIndex
Atom Weight and Charge: HeavyAtomMolWt * MinEStateIndex
Geometrical Shape: Kappa3 * NumHeteroatoms
Molecular Surface Properties: PEOE_VSA10 * PEOE_VSA14
Hydrophobicity and Polarizability: PEOE_VSA6 * PEOE_VSA8
Solvent Accessible Surface Area: SMR_VSA10 * SMR_VSA5
Electronic State and Shape: SlogP_VSA3 * VSA_EState9
Functional Group Occurrence: fr_COO * fr_COO2
Molecular Complexity: BertzCT * Chi1v
Structural Patterns: Chi2n * Chi4n
Molecular Weight and Shape: ExactMolWt * Kappa3
Structural Density: FpDensityMorgan1 * FpDensityMorgan2
Molecular Shape and Flexibility: HallKierAlpha * HeavyAtomMolWt
Chemical Diversity: MinEStateIndex * NumHeteroatoms

Explore Multi-Label Classification with an Enzyme Substrate Dataset

Result

Submission after feature engineering scored 0.56547, which is a slight improvement than 0.56205.

Conclusion

This is my second time joining Kaggle Playground Competition . The dataset is well refined, which encourages me to jump into analyze those data. Also this is my first time using Deep Learning methods to solve problems. Compared to ML techniques, Deep Learning has more freedom in constructing model, which makes me more intriguing.

I can implement the knowledge what I have learned, and this experience was far different from I imagined so far. Simply knowing the frameworks won’t work. I realized that refining data using domain knowledge is the key part of Deep Learning process. We have to remember that Deep Learning is just a tool not Deus ex machina.

I’ll keep learning deeply about Deep Learning techniques, and someday start my own major-related project!

Kaggle - Explore Multi-Label Classification with an Enzyme Substrate Dataset

First Attempt (23.07.05)

Second Attempt (23.07.06)

Loss

Result

Third Attempt (23.07.06)

Feature Engineering

Result

Conclusion

Comments:

First Attempt (23.07.05)#

Second Attempt (23.07.06)#

Loss#

Result#

Third Attempt (23.07.06)#

Feature Engineering#

Result#

Conclusion#

Comments:

First Attempt (23.07.05)

Second Attempt (23.07.06)

Loss

Result

Third Attempt (23.07.06)

Feature Engineering

Result

Conclusion