The Dark Corners of Our DNA Hold Clues about Disease

(Scientific American)

The so-called “streetlight effect” has often fettered scientists who study complex hereditary diseases. The term refers to an old joke about a drunk searching for his lost keys under a streetlight. A cop asks, “Are you sure this is where you lost them?” The drunk says, “No, I lost them in the park, but the light is better here.”

For researchers who study the genetic roots of human diseases, most of the light has shone down on the 2 percent of the human genome that includes protein-coding DNA sequences.

“That’s fine. Lots of diseases are caused by mutations there, but those mutations are low-hanging fruit,” says University of Toronto (U.T.) professor Brendan Frey who studies genetic networks. “They’re easy to find because the mutation actually changes one amino acid to another one, and that very much changes the protein.”

The trouble is, many disease-related mutations also happen in noncoding regions of the genome—the parts that do not directly make proteins but that still regulate how genes behave. Scientists have long been aware of how valuable it would be to analyze the other 98 percent but there has not been a practical way to do it.

Now Frey has developed a “deep-learning” machine algorithm that effectively shines a light on the entire genome. A paper appearing December 18 in Science describes how this algorithm can identify patterns of mutation across coding and noncoding DNA alike. The algorithm can also predict how likely each variant is to contribute to a given disease. “Our method works very differently from existing methods,” says Frey, the study’s lead author. “GWAS-, QTL- and ENCODE-type approaches can’t figure out causal relationships. They can only correlate. Our system can predict whether or not a mutation will cause a change in RNA splicing that could lead to a disease phenotype.”

RNA splicing is one of the major steps in turning genetic blueprints into living organisms. Splicing determines which bits of DNA code get included in the messenger-RNA strings that build proteins. Different configurations yield different proteins. Misregulated splicing contributes to an estimated 15 to 60 percent of human genetic diseases.

Frey, a computer engineer who has a cross appointment in the university’s Department of Medical Research, trained his algorithm using millions of data points: DNA sequences, genetic variations and RNA splicing patterns. The algorithm was then able to extrapolate how likely it was that any of tens of thousands of mutations could cause a splicing error associated with a particular disease.

The research team tested the method on spinal muscular atrophy as well as nonpolyposis colorectal cancer. Frey says the team’s “most ambitious case” was its study of autism spectrum disorder; about 100 genes are known to be associated with it. In fact, many researchers think it is likely that autism comprises many disorders, each resulting from unique mutations but all resulting in common symptoms.

Working with U.T. autism researcher Stephen Scherer, Frey compared mutations in autism patients’ genomes with those of controls. Nothing unusual popped up. But when Frey and Scherer tested the genomes against the mutations flagged by Frey’s algorithm, they “saw patterns emerge.” According to Frey, “Kids with autism are more likely to have these ‘high-scoring’ mutations that change the meaning of the genome, and that are thought to be involved with brain functions and developmental functions.”

Not only did the algorithm’s analysis jibe with existing knowledge about autism genetics, it also identified 17 new disease-causing gene candidates. With each of the three diseases addressed in the study, the algorithm both made predictions that were consistent with existing data and also pointed toward additional regions of the genome where researchers might search next.

The combination of whole-genome analysis and predictive models for RNA splicing makes Frey’s method a major contribution to the field, according to Stephan Sanders, an assistant professor at the University of California, San Francisco, School of Medicine. “I’m looking forward to using this tool in larger data sets and really getting sense of how important splicing is,” he says. Sanders, who researches the genetic causes of diseases, notes Frey’s approach complements, rather than replaces, other methods of genetic analysis. “I think any genomist [sic] would agree that noncoding [areas of the genome] are hugely important. This method is a really novel way of getting at that,” he says.

Although other experts, along with the study’s authors, caution that it is a long journey from this type of research to new treatments, they also agree that Frey’s method reveals an important path toward that goal. “Whole-genome sequencing is absolutely essential,” says Robert Ring, chief science officer at Autism Speaks and the former head of Pfizer’s autism unit. “But making sense of it is really where the rubber hits the road. These guys are bringing machine learning to whole-genome data and developing new ways of finding where genetic changes might be clinically significant. This is where the likelihood of new treatments and diagnoses are.”