Does AI Need to Learn Protein-ligand Interactions to Calculate? (I)
Recently, scientists found that when AI is used to predict protein-ligand interactions, the commonly used training sets (PDBbind and DUD-E) have serious data bias for AI model training, resulting in artificially high AI models. The prediction lacks generalization ability and robustness, misleads the development and practical application of methods in this field, and based on this, they put forward views and suggestions on how to objectively evaluate the AI model.
How could data bias hurt? There is a classic example. In 1957, in a study supported by the US military, researchers used neural networks to predict whether there were tanks in the woods, and the training set was a picture of tanks or no tanks. The accuracy was amazing. But later it was found that the pictures with tanks were taken on cloudy days, and the pictures without tanks were taken on sunny days. This trained AI model is not a tank classifier, but a weather classifier! The seemingly complicated learning of the causality of predicting the existence of tanks, replaced by lazy and slippery AI with the simplest weather correlation, fooling human beings. Taming AI first needs to have a good training set, which should be consistent with the true distribution on the target attributes, and unbiased on non-target attributes, to avoid the implicit data bias learned by the model.
In recent years, neural network-based AI models have repeatedly been claimed to have obtained "state-of-the-art" performance on PDBbind and/or DUD-E protein-ligand binding data sets. However, the authors found that training an AI model based on ligand small molecule data can also achieve the same "unmatched" performance, suggesting that the AI model can "predict" protein-ligand binding without learning protein-ligand interaction at all. The reason for this paradox is that PDBbind and DUD-E contain data that biases the AI model.
PDBbind and DUD-E are not specially constructed training sets for protein-ligand interaction prediction. Their main role is an independent benchmark test set, which is used to evaluate the predictive ability of the model. The training set needs to be distinguished from the test set, which is basic common sense and an important basis for evaluating the reliability of the model. However, due to the lack of experimental data on protein-ligand binding, the reported AI model can only be cross-validated on PDBbind and DUD-E (divide the data into k points, and take each one as the test set, other k-1 Part is the training set) to evaluate the ability of AI models to predict protein-ligand interactions. How reliable is the model trained in this situation? In order to answer this question, the author designed a baseline model (baseline) and cross-validation experiments for PDBbind and DUD-E, and analyzed what implicit data biases the model will learn.
PDBbind collected the protein-ligand complexes with experimentally measured binding constants in the PDB protein crystal structure database, which are divided into general (11987), refined (3706) and core (195) set according to data quality (low to high) and quantity (large to small). The author splits the complex into proteins and ligands to form three PDBbind datasets: the original PDBbind (Binding Complex), PDBbind (Ligand Alone) containing only ligands and PDBbind (Protein Alone) containing only proteins. Using the Atomic Convolutional Neural Network (ACNN) model developed by the Pande laboratory of Stanford University, the protein-ligand interaction force prediction model is trained on the refined set or general set, the core set is removed from the training set, and then the composite in the core set is predicted. It can be seen that using only the protein or the ligand structure as the input to calculate the strength of the protein-ligand interaction, you can get similar or even better performance as the complex as the input. This result reveals that the AI model can calculate the protein-ligand interaction without learning the protein-ligand interaction mode. This violation of common sense can only be reversed to obtain a reasonable explanation-the use of PDBbind data to train AI models is seriously biased. In short, for AI, the PDBbind data set is not full (the amount of data is not large enough), also severely partial eclipse (diversity is not high enough).
To be continued in Part Two…
About Protheragen AI
Protheragen AI has proudly developed a unique artificial intelligence drug research and development platform to offer drug development solutions for worldwide customers, including but not limited to Drug R&D, Machine Translation, Intelligent Image Diagnosis, and Medical Therapy and Research System. Through big data analysis and other technical means, its AI platform can quickly and accurately mine and select the appropriate compounds or organisms. Compared with traditional methods, AI can save the cost of screening candidates by tens of billions every year. AI technology has been widely used in disease target prediction, high-throughput data analysis and system biology modeling.
Nu au fost inca adaugate firme in acest ghid.