Predicting Structure-Activity Relationships through Data Mining/Predictice Analytics

Jabberwocky · Apr 16, 2018

Hello N&PD,

This might have come up as a topic of discussion before but I haven't seen this addressed in a while. In the last 20 years machine learning has become more mainstream, in part due to advances in the theoretical understanding, computational power and general acceptance of the versatility of these methods for discovering hidden relationships in large data sets. I recently built a neural network (NN) and have been playing a lot with different configurations and am growing pretty excited about their ability to classify data in real-world complex problems where intuition can fail. Some of the recent achievements in this field are hard to ignore.

For example, deep learning techniques have been employed by google to train computers to learn the rules of chess and subsequently beat the world champion chess program. The world champion chess program Stockfish 8 has been developed over 20 year. (Read more: alphazero google deepmind ai beats champion program teaching itself to play in four hours). Other recent headlines related to deep learning show that facial recognition software is now advanced enough that it can help find a fugitive in a crowd of 600,000 (Read more: chinese facial recognition recognizes wanted man in crowd of 60000/). The question occurred to me as to whether these deep learning techniques could be used to make accurate predictions of structure-activity relationships for new molecules using a database of existing molecule binding affinities to predict the binding affinities of molecules for which data has not been experimentally determined.

By no means is this a new area of inquiry. Researchers have been looking at these kinds of questions for decades. In the past NN's have lost favor with researchers as a viable strategy for data mining in favor of other techniques like random forest and support vector machines (SVM). Only recently have deep neural networks (with more than one hidden layer of artificial neurons) become viable for these type of research questions and recent reports have shown that they can actually do a better job than other techniques used in data mining.

For example, a recent publication https://pubs.acs.org/doi/abs/10.1021/ci500747n shows that deep NN's can begin to outperform more established techniques for data mining.

Anyways, I'm relatively new to machine learning and I'm not a medicinal chemist, so I don't have strong views on this but I've been very impressed with the ability of neural networks to make predictions that defy intuition in complicated data sets. I know the intuition of a skilled medicinal chemist is often hard to beat when proposing candidates for new drugs, but I'm wondering if in the future data mining might dramatically change the way drug discovery is done and what some of the most promising techniques for realizing this might be. I know binding affinities can be calculated ab-initio in some cases with good results. This is very different from data mining however because with data mining the physics and chemistry of the molecule is not required to make predictions. Rather, the complicated SAR is discovered from the data without need of physical intuition. The results of this data mining might then inform refinements in detailed physical models. How might these compare with the results of NN's or other data mining techniques when comparing the accuracy of binding affinities?

I don't want to make this an overly restrictive discussion. I look forward to hearing from people who have insights, opinions, open questions and relevant papers related to data mining techniques in the drug discovery pipeline and how they might shape the future of drug discovery.

clubcard · Apr 16, 2018

In-silico models alone have not got a great record. They are of far more value to the old fashioned rational design. Even recently, people have made some great discoveries (in established fields) using CHARMM. (Chemistry at HARvard Macromolecular Mechanics) which is free to students and non-profit organizations. J Mol Model (2011) 17:477–493 is a CLASSIC example. Of course, since then we have made strides into more subtypes but with the appropriate training sets it's quite good at spotting an active. For less mature fields, I think others will know better than I.

Jabberwocky · Apr 22, 2018

Thanks clubcard. Appreciate the article. While I'm not familiar with the CHARMM program, I know molecular dynamics calculations are being used routinely to calculate molecular properties like binding affinity, but like you said, these calculations can be finicky, and they require lots of expertise to configure and lots of computer processing time to calculate, and their accuracy is typically ball-park from what I understand. Most of the computational effort in molecular dynamics calculations is in updating the force field calculation, which in the case of the CHARMM program uses the CHARMM force fields. That's actually one of the ways neural networks might help. To speed up these calculations the force field potential could be learned with an appropriate training set. Without understanding all the sophisticated physics and chemistry required to calculate the force field from first principles, a NN would simply be trained using a database of pre-computed force fields models for known structures.

For example, in the paper: Neural network provides accurate simulations without the cost they train a NN to make accurate predictions of DFT calculations in a fraction of the time. Training is the most time consuming step with a NN, but once a NN is adequately trained it can produce estimates in milliseconds as opposed to the lengthy calculations required for semi-empirical. Such a procedure could be adapted to the CHARMM force fields I'd image and that might really speed things up down the line.

Interesting stuff.

Here's a recent article in CN&E magazine about the promise of NN in big data as applied to chemical questions: https://cen.acs.org/articles/95/i39/Digitalization-comes-materials-industry.html

Here's a recent article that suggests that these technique rival the more tradition QSAR models though I haven't read it yet. Chemception Matches the Performance of Expert-developed QSAR/QSPR Models. These guys are calling it Chemception — a deep convolutional neural network (CNN) architecture for general-purpose small molecule property prediction. They use Chemception as a general purpose neural network architecture for predicting toxicity, activity, and solvation properties when trained on a modest database (~600 - 40,000 molecules).

Jabberwocky · Apr 23, 2018

Here's a really interesting paper I found in the literature unrelated to SAR but one that in the future is likely to shape the way chemists make decisions about synthetic routes. Once a target molecule is identified the ability to produce this molecule in a cost-effective way is certainly a part of the calculation in bringing a drug to market. E.J Corey formalized the concepts retrosynthesis, a problem-solving technique in which target molecules are recursively transformed into increasingly simpler precursors to identify the most promising synthetic routes. Retrosynthetic analysis is the accepted methodology for making an informed choice about a synthetic route.

Attempts have been made to automate this process in the past using databases of known chemical reactions. Selecting promising routes is one thing a trained chemist excels in, using a highly developed intuition about what works. According to the authors "methods of algorithmically extracting transformations from reaction datasets been criticized for high noise and lack of ‘chemical intelligence.'' Another way of stating that is that a trained chemist can typically tell the difference between a machine proposed synthetic pathway and one proposed by a skilled chemist. This is a kind of Turing test for synthetic pathways. Can a trained chemists tell whether a proposed synthetic route is computer generated?

The authors of this paper implements an advanced deep learning methodology to propose the best possible synthetic pathways to answer that question and the proposed pathways where carefully analyzed and compared to a synthetic pathway that was proposed by a human to assess the quality of the solutions proposed by using 45 graduate-level organic chemists from two world-leading organic chemistry institutes in China and Germany to choose one of two routes leading to the same molecule on the basis of personal preference and synthetic plausibility. Unlike those lazy chemists, the NN has read all pretty much all papers on organic synthesis, is trained on a database of all know chemical transformation and uses advanced decision-making heuristics.

In this work, we combine three different neural networks together with MCTS to perform chemical synthesis planning (3N-MCTS). The first neural network (the expansion policy) guides the search in promising directions by proposing a restricted number of automatically extracted transformations. A second neural network then predicts whether the proposed reactions are actually feasible (in scope). Finally, to estimate the position value, transformations are sampled from a third neural network during the rollout phase. The neural networks were
trained on essentially all reactions published in the history of organic chemistry.

You can read more here: nature25978

The real power of components like the one is that they can be integrated seamlessly with other deep-learning efforts to essentially automate the entire drug discovery process. From proposing active candidates, to selecting the best synthetic routes to scaling up manufacture and optimizing yield and to evaluate their effectiveness.

It certainly appears that the futuristic-sounding promises of big data are beginning to bear fruit, and the role of the chemist will likely have to adapt to new ways of doing science in the future. The next decade is going to be a really interesting.

clubcard · Apr 24, 2018

I'm not convinced, sorry. I don't need to search every possible route... 30 years of experience constantly keeping up with new papers means I spot the highest molar efficiency right away. Of course, that often isn't the cheapest and yes, chemical engineers deal with scale, but in almost all cases, a continuous process scores over a batch process and the choice is based on the cheapest route in practice. For the synthesis of highly complex natural compounds, I would be looking to see if they come up with anything significantly better than RB Woodward managed using the routes available to him. I'm actually prepared to bet that the 'deep learning' will just end up doing it his way. I would also point out that an awful lot of the really cost-cutting routes get patented OR just stay out of the public domain. How does that enter deep-learning? In Silico is of great value but it's being over-vaulted as startups seek to convince managers, not chemists to BUY.

Where I am - I only need 'enough' so not wasting the labs time and going with a route I KNOW they can do is important. Does deep-learning follow knowing the strengths of a lab-team? Does it recognize the reputation & record of the research group who wrote the papers? Some are poorly characterized, some downright lie. There is still a sufficient amount of human ability required and I have found out the hard way that while dimethyl cadmium might be used in a paper to replace 3 steps with 1, do you really want to put people in the path of such reagents?

clubcard · Apr 24, 2018

I'm not alone -

http://blogs.sciencemag.org/pipeline/archives/2018/04/23/benevolentai-worth-two-billion

The chances are that it will fail. It IS interesting, but like graphene, it is supposed to be 'jam tomorrow' every day, week, month, year... We were told HTS would speed things up - robots doing much of the work and so on. It just meant we ended up with x20 more compounds we blindly went down the road with even though it was apparent that they were very unlikely to work. So, cost went UP, discovery went DOWN. The preclinical is the rate-limiting step. Candidates... we have more than we can deal with and have to drop the several very good ones and concentrate on just 1. I don't think I would trust in silico over in vitro.

Jabberwocky · Apr 24, 2018

All good points. I spent a summer as an undergrad following a synthetic route that was supposed to give 70% yield or so but I was getting nothing. Eventually, my professor wrote a letter to the professor who published the paper asking about the reaction and he got a sort of apologetic reply saying they extrapolated because they didn’t have enough chemical left to run a test. In other words that reaction probably didn’t work. I lost interest in organic chemistry and decided to focus on physical chemistry. Glad I did because chemical synthesis wasn’t my favorite.

I have a lot of mixed feeling about seeming blind faith in automation, and despite my excitement reading some of the recent developments, I don’t think this is something . That said, the sophistication of the tools and the patterns they are predicting are hard to ignore. I’m forced to evaluate these tools at this point for fork. I’ve been watching all these video and the advances that have happened in the last decade arereally impressive. They are trained to learn human patterns, and some of those can be sloppy because the humans were sloppy. Other decision making heuristics humans use I sure are not captured by the deep learning and these may be important too. That said, what is important to recognize in all this is NN have no intelligence of their own, they are advance pattern recognition and optimization tools. In that sense I find them useful. They can reveal patterns that humans may not be aware of, biases they didn’t know about and ways around compromises that they cannot conceive. Yeah, these tools are here to stay for better or worse IMO. It would be a mistake however to prematurely favor a deep learning recommendation over a human one at this stage and it may be a long time before a deep learning approach yields measurably better results than humans. Even then, deep learning will never replace humans because that’s what NN need to learn. No doubt the impact of these technologies will be disruptive when they first start getting integrated into workflows and that could have negative effects in an industry that is very focused on the bottom line.

clubcard · Apr 25, 2018

GIGO is the most apt term for the problem. I have spent years wasting time on Russian, Indian & Chinese papers (particularly dubious funding leads to dubious results) and on others from almost every country in the world. There are a host of prizes for people finding cheaper routes to compounds that have potential commercial application. I seem to recall nepetalactone was one posted 20 years ago and nobody has claimed it yet. If these in-silico models work, the obvious way to prove their benefit is to find such important answers.

Chemical engineers may benefit but I think it reasonable to say that their is an art in the synthesis of natural products and the number of variables is not limited to those that can be enumerated. VSEPR and derived vibrational frequencies works fine with 100% purity of every reagent but how often does something work because of an impurity? From FOGBANK in H-bombs to caprolactam in nylon stockings, both need a pinch of cyclohexenone to produce the desired results. The processes were used for 40+ years and it was only in the 90s that someone discovered the catalytic properties.

People confuse knowledge with intelligence with wisdom. I gradually arrived at the third from the second by trial and error of the first.

Predicting Structure-Activity Relationships through Data Mining/Predictice Analytics

Jabberwocky

Frumious Bandersnatch

clubcard

Bluelighter

Jabberwocky

Frumious Bandersnatch

Jabberwocky

Frumious Bandersnatch

clubcard

Bluelighter

clubcard

Bluelighter

Jabberwocky

Frumious Bandersnatch

clubcard

Bluelighter