Every Tuesdays, Kaggle opens a light-weight challenges which runs for 2 weeks. These datas are based on real-world data but easy to use for beginners of ML & AI. Sometimes I try to participate to these competitions and this experience improves my skills of Machine Learning world.
Explore Multi-Label Classification with an Enzyme Substrate Dataset
Season 3, Episode 18 competition is about Multi-Label Classification. Since I’m currently learning about Keras API, this time I tried to solve this problem with Deep Learning method. The following is my footprints of my journey.
First Attempt (23.07.05)
I basically drop some skewed
features (such as FpDensityMorgan1~3, Kappa3) and ID. Then I tried to predict target data with 3 layer keras model. However, my accuracy got stuck around 0.5000. ☹️ Adding some layers didn’t help much. What’s wrong with my model? I was confusing…
The very first attempt gave me score of 0.53114 which is no better than just gussing all the answers into 0.5. This means that current model is useless…
Second Attempt (23.07.06)
I figured how to solve this problem…
Loss
First, I found that loss isn’t decresing as the epoch increases. There may be several reasons, but it turns out that my model is too complex. I made my Keras Model simpler, which really helped.
model = keras.Sequential()
model.add(keras.Input(shape=(17,)))
model.add(keras.layers.Dense(10, activation='relu'))
model.add(keras.layers.Dropout(0.3))
model.add(keras.layers.Dense(2, activation='softmax'))
According to saturnicloud.io , the reasons of loss not decreasing during training are the following.
- Model Complexity → Simplify the Model
- Learning Rate → Adjust the Learning Rate
- Initialization → Change the Initialization
- Dataset → Improve the Dataset
Result
New submission scored 0.56205, which is an improvement of previous score 0.53114. The public best score is 0.66246. It seems that it is hard to improve more than that. Still I think there is another way to improve this result.
Third Attempt (23.07.06)
Feature Engineering
I realized that the problem was on Preprocessing
part. It seems that simply dropping some features isn’t the way it works. Since I do not have a biochemical background, I didn’t know how to deal with such features. According to one of the Kaggle Discussion
the following are shown as the interactions between the features.
- Molecular Complexity: BertzCT * ExactMolWt
- Structural Flexibility: Chi1 * Chi2n
- Functional Specificity: Chi1n * Chi3v
- Chemical Diversity: EState_VSA1 * NumHeteroatoms
- Size-Related Descriptors: ExactMolWt * FpDensityMorgan1
- Topological Patterns: FpDensityMorgan2 * FpDensityMorgan3
- Electronic Structure: HallKierAlpha * MaxAbsEStateIndex
- Atom Weight and Charge: HeavyAtomMolWt * MinEStateIndex
- Geometrical Shape: Kappa3 * NumHeteroatoms
- Molecular Surface Properties: PEOE_VSA10 * PEOE_VSA14
- Hydrophobicity and Polarizability: PEOE_VSA6 * PEOE_VSA8
- Solvent Accessible Surface Area: SMR_VSA10 * SMR_VSA5
- Electronic State and Shape: SlogP_VSA3 * VSA_EState9
- Functional Group Occurrence: fr_COO * fr_COO2
- Molecular Complexity: BertzCT * Chi1v
- Structural Patterns: Chi2n * Chi4n
- Molecular Weight and Shape: ExactMolWt * Kappa3
- Structural Density: FpDensityMorgan1 * FpDensityMorgan2
- Molecular Shape and Flexibility: HallKierAlpha * HeavyAtomMolWt
- Chemical Diversity: MinEStateIndex * NumHeteroatoms
Explore Multi-Label Classification with an Enzyme Substrate Dataset
Result
Submission after feature engineering scored 0.56547, which is a slight improvement than 0.56205.
Conclusion
This is my second time joining Kaggle Playground Competition
. The dataset is well refined, which encourages me to jump into analyze those data. Also this is my first time using Deep Learning methods to solve problems. Compared to ML techniques, Deep Learning has more freedom in constructing model, which makes me more intriguing.
I can implement the knowledge what I have learned, and this experience was far different from I imagined so far. Simply knowing the frameworks won’t work. I realized that refining data
using domain knowledge
is the key part of Deep Learning process. We have to remember that Deep Learning is just a tool
not Deus ex machina.
I’ll keep learning deeply about Deep Learning techniques, and someday start my own major-related project!