1
Introduction
1.1
Uncharacterized Protein Segments Are a Source
of Functional Novelty
Over the past decade, we have observed
a massive increase in the amount of information describing protein
sequences from a variety of organisms.
1,2
While this
may reflect the diversity in sequence space, and possibly also in
function space,
3
a large proportion of
the sequences lacks any useful function annotation.
4,5
Often
these sequences are annotated as putative or hypothetical proteins,
and for the majority their functions still remain unknown.
6,7
Suggestions about potential protein function, primarily molecular
function, often come from computational analysis of their sequences.
For instance, homology detection allows for the transfer of information
from well-characterized protein segments to those with similar sequences
that lack annotation of molecular function.
8−10
Other aspects
of function, such as the biological processes proteins participate
in, may come from genetic- and disease-association studies, expression
and interaction network data, and comparative genomics approaches
that investigate genomic context.
11−17
Characterization of unannotated and uncharacterized protein segments
is expected to lead to the discovery of novel functions as well as
provide important insights into existing biological processes. In
addition, it is likely to shed new light on molecular mechanisms of
diseases that are not yet fully understood. Thus, uncharacterized
protein segments are likely to be a large source of functional novelty
relevant for discovering new biology.
1.2
Structure–Function
Paradigm Enhances
Function Prediction
Traditionally, protein function has been
viewed as critically dependent on the well-defined and folded three-dimensional
structure of the polypeptide chain. This classical structure–function
paradigm (Figure 1; left panel) has mainly
been based on concepts explaining the specificity of enzymes, and
on structures of folded proteins that have been determined primarily
using X-ray diffraction on protein crystals. The classical concept
implies that protein sequence defines structure, which in turn determines
function; that is, function can be inferred from the sequence and
its structure. Even when protein sequences diverge during evolution,
for example, after gene duplication, the overall fold of their structures
remains roughly the same. Therefore, structural similarity between
proteins can reveal distant evolutionary relationships that are not
easily detectable using sequence-based methods.
18,19
Structural genomics efforts such as the Protein Structure Initiative
(PSI) have been set up to enlarge the space of known protein folds
and their functions, thereby complementing sequence-based methods
in an attempt to fill the gap of sequences for which there is no function
annotation.
20,21
Specifically, phase two of the
PSI aimed to structurally characterize proteins and protein domains
of unknown function, often providing the first hypothesis about their
function and serving as a starting point for their further characterization.
1.3
Classification Further Facilitates Function
Prediction
Classification schemes provide a guideline for
systematic function assignment to proteins. Generally, proteins are
made up of a single or multiple domains that can have distinct molecular
functions. These domains, which are referred as structured domains,
often fold independently, make precise tertiary contacts, and adopt
a specific three-dimensional structure to carry out their function.
The sequences that compose structured domains can be organized into
families of homologous sequences, whose members are likely to share
common evolutionary relationship and molecular function. The Pfam
database classifies known protein sequences and contains almost 15 000
such families, for most of which there is some understanding about
the function.
22
Nevertheless, Pfam also
contains more than 3000 families annotated as domains of unknown function,
or DUFs.
23
These families are largely made
up of hypothetical proteins and await function annotation. Another
powerful example of a protein classification scheme is the Structural
Classification of Proteins (SCOP), which provides a means of grouping
proteins with known structure together, based on their structural
and evolutionary relationships.
24,25
SCOP utilizes a hierarchical
classification consisting of four levels, (i) family, (ii) superfamily,
(iii) fold, and (iv) class, with each level corresponding to different
degrees of structural similarity and evolutionary relatedness between
members. Using this scheme, function of newly solved structures or
sequences can be inferred from their similarity with existing protein
classes through structure or sequence comparisons, for instance, as available via
the SUPERFAMILY database.
10
In this direction, another major initiative
is Genome3D, which is a collaborative project to annotate genomic
sequences with predicted 3D structures based on CATH
26
(Class, Architecture, Topology, Homology) and SCOP
24,25
domains to infer protein function.
27
1.4
Intrinsically Disordered Regions and Proteins
While many proteins need to adopt a well-defined structure to carry
out their function, a large fraction of the proteome of any organism
consists of polypeptide segments that are not likely to form a defined
three-dimensional structure, but are nevertheless functional.
28−42
These protein segments are referred to as intrinsically disordered
regions (IDRs; Figure 1; right panel).
43
Because IDRs
generally lack bulky hydrophobic amino acids, they are unable to form
the well-organized hydrophobic core that makes up a structured domain
31,44
and hence their functionality arises in a different manner as compared to the classical
structure–function
view of globular, structured proteins. In this framework, protein
sequences in a genome can be viewed as modular because they are made
up of combinations of structured and disordered regions (Figure 1; bottom panel).
Proteins without IDRs are called
structured proteins, and proteins with entirely disordered sequences
that do not adopt any tertiary structure are referred to as intrinsically
disordered proteins (IDPs). The majority of eukaryotic proteins are
made up of both structured and disordered regions, and both are important
for the repertoire of functions that a protein can have in a variety
of cellular contexts.
43
Traditionally,
IDRs were considered to be passive segments in protein sequences that
“linked” structured domains. However, it is now well
established that IDRs actively participate in diverse functions mediated
by proteins. For instance, disordered regions are frequently subjected
to post-translational modifications (PTMs) that increase the functional
states in which a protein can exist in the cell.
45,46
In addition, they expose short linear peptide motifs of about 3–10
amino acids that permit interaction with structured domains in other
proteins.
47,48
These two features in isolation or in combination
permit the interaction and recruitment of diverse proteins in space
and time, thereby facilitating regulation of virtually all cellular
processes.
47
The prevalence of IDRs in
any genome (see, for example, the D2P2 database,
49
Box 1) in combination
with their unique characteristics means that these regions extend
the classical view of the structure–function paradigm and hence
that of protein function. Thus, functional regions in proteins can
either be structured or disordered, and these need to be considered
as two fundamental classes of functional building blocks of proteins.
50
Figure 1
Structured domains and intrinsically disordered regions
(IDRs)
are two fundamental classes of functional building blocks of proteins.
The synergy between disordered regions and structured domains increases
the functional versatility of proteins. Adapted with permission from
ref (50). Copyright
2012 American Association for the Advancement of Science.
1.5
The Need for Classification
of Intrinsically
Disordered Regions and Proteins
IDRs and IDPs are prevalent
in eukaryotic genomes. For instance, 44% of human protein-coding genes
contain disordered segments of >30 amino acids in length
49
(similar data shown in Figure 2A). In the human genome, 6.4% of all protein-coding
genes
do not have any function annotation in their description in Ensembl
1
(Figure 2B). Further investigation
using the D2P2 database of disorder in genomes
49
revealed that most of these genes with no function
annotation encode at least some disorder (Figure 2B) and that genes with no annotation
contain proportionally
more IDRs (Figure 2C). Given the absence of
structural constraints, IDRs tend to evolve more rapidly than protein
domains that adopt defined structures.
51−56
As a result, identifying homologous regions is harder for IDRs and
IDPs than it is for structured domains. This complicates the transfer
of information about function between homologues and thus the prediction
of function of IDRs and IDPs. Furthermore, much of protein annotation
is based on information on sequence families and structured domains.
However, less than one-half of all residues in the human proteome
fall within such domains (Figure 3). Not only
do most residues of human proteins fall outside domains, a large fraction
of these residues are also disordered (Figure 3A and B, right bars). Moreover, although
it is expected that SUPERFAMILY
domains based on known protein structures have very little disorder
(Figure 3A, left bar), Pfam domains based on
sequence clustering do not contain much more (Figure 3B, left bar). These observations
suggest that there is a large
pool of protein segments that are not considered by conventional protein
annotation methods, because the sequences of disordered regions are
difficult to align, or because the methods do not explicitly consider
disordered and nondomain regions of the protein sequence. Taken together,
these considerations raise the need to devise a classification scheme
specifically for disordered regions in proteins that may enhance the
function prediction and annotation for this important class of protein
segments.
Figure 2
The number of protein-coding genes in the human genome with various
amounts of disorder. Histograms of the numbers of human genes with
annotation (A) and without annotation (B), grouped by the percentage
of disordered residues. (C) A comparison of the fraction of annotated
and unannotated human genes with different amounts of disorder. Residues
in each protein are defined as disordered when there is a consensus
between >75% of the predictors in the D2P2 database
49
at that position. The set
of human genes was
taken from Ensembl release 63,
1
and the
representative protein coded for by the longest transcript was used
in each case. The annotation was taken from the description field
with “open reading frame”, “hypothetical”,
“uncharacterized”, and “putative protein”
treated as no annotation.
Figure 3
The fraction of disordered residues located in domains in human
protein-coding genes: (A) residues inside (left) and outside (right)
of SCOP domains,
24
and (B) residues inside
(left) and outside (right) of Pfam domains (only curated Pfam domains
were considered, i.e., Pfam-A).
22
The SCOP
domains in human proteins are defined by the SUPERFAMILY database.
10
Disordered residues were taken from the D2P2 database
49
(when
there is a consensus between >75% of the disorder predictors).
The
set of human genes was taken from Ensembl release 63.
1
In this Review, we synthesize
and provide an overview of the various
classifications of intrinsically disordered regions and proteins that
have been put forward in the literature since the start of systematic
studies into their function some 15 years ago. We discuss approaches
based on function, functional elements, structure, sequence, protein
interactions, evolution, regulation, and biophysical properties (Table 1). Finally,
we discuss resources that are currently
available for gaining insight into IDR function (Table 2), we suggest areas where
increased efforts are likely to
advance our understanding of the functions of protein disorder, and
we speculate how combinations of multiple existing classification
schemes could achieve high quality function prediction for IDRs, which
should ultimately lead to improved function coverage and a deeper
understanding of protein function.
Table 1
Classifications of
Intrinsically Disordered
Regions and Proteins
basis for classification
classes
description
examples
function
(33,39,57,58)
•entropic chains
IDRs carrying out functions
that benefit directly from their conformational disorder, e.g., flexible
linkers and spacers
MAP2 projection domain,
titin PEVK domain, RPA70, MDA5
•display sites
flexibility of IDRs facilitates
exposure of motifs and easy access for proteins that introduce and
read PTMs
p53, histone
tails, p27,
CREB kinase-inducible domain
•chaperones
their binding properties
(many different partners, rapid association/disassociation, and folding
upon binding) make IDPs suitable for chaperone functions
hnRNP A1, GroEL, α-crystallin,
Hsp33
•effectors
folding upon binding mechanics
allow effectors to modify the activity of their partner proteins
p21, p27, calpastatin, WASP
GTPase-binding domain
•assemblers
assembling IDRs have large
binding interfaces that scaffold multiple binding partners and promote
the formation of higher-order protein complexes
ribosomal proteins L5, L7,
L12, L20, Tcf 3/4, CREB transactivator domain, Axin
•scavengers
disordered scavengers
store and neutralize small ligands
chromogranin
A, Pro-rich glycoproteins, caseins and other SCPPs
functional features
linear motifs
47,125
•structural modification
sites of conformational
alteration of a peptide backbone
peptidylprolyl cis–trans
isomerase Pin1 sites
•proteolytic cleavage
sites of post-translational
processing events or proteolytic cleavage scission sites
Caspase-3/-7, separase,
taspase1 scission sites
•PTM removal/addition
specific binding sequences
that recruit enzymes catalyzing PTM moiety addition or removal
cyclin-dependent kinase
phosphorylation site, SUMOylation site, N-glycosylation site
•complex promoting
motifs that mediate protein–protein
interactions important for complex formation; often associated with
signal transduction
proline-rich SH3-binding
motif, cyclin box, pY SH2-binding motif, PDZ-binding motif, TRAF-binding
motifs in MAVS
•docking
motifs that increase the
specificity and efficiency of modification events by providing an
additional binding surface
KEN box degron, MAPK docking
sites
•targeting or trafficking
signal sites that localize
proteins within particular subcellular organelles or act to traffic
proteins
nuclear
localization signal,
clathrin box motif, endocytosis adaptor trafficking motifs
molecular recognition
features
(MoRFs)
121
•alpha
disordered motifs that form
α-helices upon target binding
p53 ∼ Mdm2, p53 ∼
RPA70, p53 ∼ S100B(ββ), RNase E ∼ enolase,
inhibitor IA3 ∼ proteinase A
•beta
disordered motifs that form
β-strands upon target binding
RNase E ∼ polynucleotide
phosphorylase, Grim ∼ DIAP1, pVIc ∼ adenovirus 2 proteinase
•iota
disordered motifs that form
irregular secondary structure upon target binding
p53 ∼ Cdk2-cyclin
A, amphiphysin ∼ α-adaptin C
•complex
disordered motifs that contain
combinations of different types of secondary structure upon target
binding
amyloid β
A4 ∼
X11, WASP ∼ Cdc42
intrinsically disordered
domains (IDDs)
158,159
some protein domains identified
using sequence-based approaches are fully or largely disordered
WH2, RPEL, BH3, KID domains
co-occurrence
of protein domains with disordered regions
161,162
particular disordered
regions frequently co-occur in the same sequence with specific protein
domains
structure
structural continuum
37
proteins function within
a continuum of differently disordered conformations, extending from
fully structured to completely disordered, with everything in between
and no strict boundaries between the states
protein quartet
32,34,166
•intrinsic
coil
flexible regions
of extended
conformation with hardly any secondary structure; high net charge
differentiates these from disordered globules
ribosomal proteins L22,
L27, 30S, S19, prothymosin α
•pre-molten globule
disordered protein regions
with residual secondary structure, often poised for folding upon binding
events; lower net charge makes them more compact than coils
Max, ribosomal proteins
S12, S18, L23, L32, calsequestrin
•molten globule
globally collapsed conformation
with regions of fluctuating secondary structure
nuclear coactivator binding
domain of CREB binding protein
•folded
structured proteins
with a defined three-dimensional structure
most enzymes,
transmembrane domains, hemoglobin, actin
sequence
sequence–structural
ensemble relationships
166,204
•polar tracts
sequence stretches enriched
in polar amino acids often form globules that are generally devoid
of significant secondary structure preferences
Asn- and Gly-rich sequences,
Gln-rich linkers in transcription factors and RNA-binding proteins
•polyelectrolytes
amino acid compositions
biased toward charged residues of one type; strong polyelectrolytes
(high net charge) form expanded coils
Arg-rich protamines, Glu/Asp-rich
prothymosin α
•polyampholytes
sequences with roughly equal
numbers of positive and negative charges; conformations of polyampholytes
are governed by the linear distribution of oppositely charged residues,
with segregation of opposite charges leading to globules, while well-mixed
charged sequences adopt random-coil or globular conformations, depending
on the total charge
RNA chaperones, splicing
factors, titin PEVK domain, yeast prion Sup35
prediction flavors
205
•V
predicted
best by the VL-2V
predictor, for which the hydrophobic amino acids are the most influential
attributes
E. coli ribosomal proteins
•C
VL-2C is the best predictor
for flavor C, which has more histidine, methionine, and alanine residues
than the other flavors
poly- and oligosaccharide
binding domains
•S
flavor
with less histidine
than the others, best predicted by predictor VL-2S, which has a measure
of sequence complexity as the most important attribute
proteins that facilitate
binding and interaction
disorder–sequence
complexity
206
IDPs from different functional
classes show distinct disorder–sequence complexity distributions
proteins with disordered
linkers between structured domains populate compact and disordered
DC regions
overall degree of disorder
35,51,68,161,208,209
•fraction
categorization of proteins
based on the fraction of residues predicted to be disordered
0–10/10–30/30–100%
disorder
•overall
score
overall disorder
scores
for the whole protein
minimum average disorder
score depending on the predictor
•continuous stretches
presence or absence of continuous
stretches of disordered residues
typically >30 residues
length of disordered regions
211
•>500 residues
proteins that contain disordered
regions of different lengths are enriched for different types of functions
transcription
•300–500 residues
kinase and phosphatase functions
•<50 residues
(metal) ion binding, ion
channels, GTPase regulatory activity
position of disordered regions
211
•N-terminal
proteins that contain disordered
regions at different locations in the sequence are enriched for different
types of functions
DNA-binding, ion channel
•internal
transcription
regulator,
DNA-binding
•C-terminal
transcription repressor/activator,
ion channel
tandem repeats
217,218
•Q/N
glutamine- and asparagine-rich
proteins regions are both important for normal cellular function and
prone to cause harmful aggregation
huntingtin, Sup35p, Ure2p,
Ccr4, Pop2
•S/R
tandem repeats composed
of arginine and serine residues are phosphorylated and disordered,
and play a role in spliceosome assembly
ASF/SF2, SRp75, SRSF1
•K/A/P
tandem repeats composed
of lysine, alanine, and proline function in binding nucleosome linker
DNA
histone H1
•F/G
disordered domains with
phenylalanine-glycine repeats influence NPC gating behavior
nucleoporins
•P/T/S
extensively glycosylated
regions rich in proline, threonine, and serine residues are involved
in mucus formation
mucins
•others
protein interactions
fuzzy complexes by topology
242
•polymorphic
a form of static disorder,
with alternative bound conformations serving distinct functions by
having different effects on the binding partner
β-catenin ∼
Tcf4, NLS ∼ importin-α, actin ∼ WH2 domain
•clamp
complex formation through
folding upon binding of two disordered protein segments, connected
by a linker that remains disordered
Ste5 ∼ Fus3, myosin
VI ∼ actin filament, Oct-1 ∼ DNA
•flanking
complex formation through
folding upon binding of a central disordered protein segment, flanked
by two regions that remain disordered
SF1 splicing factor ∼
U2AF, proline-rich peptides ∼ SH3 domains, p27Kip1 ∼ cyclin-Cdk2
•random
disordered
regions that
remain highly dynamic even in the bound state
elastin self-assembly, Sic1
∼ Cdc4
fuzzy complexes by mechanism
176,251
•conformational selection
the fuzzy region facilitates
the formation of the binding-competent form by shifting the conformational
equilibrium
Max ∼
DNA, MeCP2
∼ DNA
•flexibility
modulation
the fuzzy
region modulates
the flexibility of the binding interface and changes binding entropy
Ets-1 ∼ DNA, SSB
∼ DNA
•competitive
binding
the fuzzy
region serves
as an intramolecular competitive partner for the binding surface.
HMGB1 ∼ DNA, RNase1
∼ RNase inhibitor
•tethering
the fuzzy region increases
the local concentration of a weak-affinity binding domain near the
target, or anchors it via transient interactions
RPA ∼ DNA, UPF1 ∼
UPF2, PC4 ∼ VP16
binding plasticity
257
•static
mono-/polyvalent complexes,
chameleons, penetrators, huggers
for examples, see Figure 12
•coiled-coil based
intertwined strings, long
cylindrical containers, connectors, armature, tweezers and forceps,
grabbers, tentacles, pullers, stackers
•dynamic
cloud contacts
and protein interaction ensembles
evolution
sequence conservation
54
•flexible
regions that require the
property of disorder for functionality regardless of the exact sequence
signaling and regulatory
proteins (Sky1, Bur1)
•constrained
regions of conserved disorder
that also have highly conserved amino acid sequences
ribosomal proteins (Rpl5),
protein chaperones (Hsp90)
•nonconserved
no conservation of the disorder,
nor of the underlying sequence; no clear functional hallmarks
yeast Ty1 retrotransposon
domains A and B
conservation of amino acid
composition
260
•HR
IDRs with high residue conservation
transcription regulation
and DNA binding
•LRHT
IDRs
with low residue conservation
but high conservation of the amino acid composition of the region
ATPase and nuclease activities
•LRLT
IDRs with neither conservation
of sequence nor conservation of amino acid composition
(metal) ion binding proteins
lineage and species
specificity
159
•prokaryotes
species from different kingdoms
of life seem to use disorder for different types of functions
longer lasting interactions
involved in complex formation
•eukaryotes and viruses
transient interactions in
signaling and regulation
evolutionary history and
mechanism of repeat expansion
61
•Type I
repeats that showed no function
diversification after expansion
titin PEVK domain, salivary
proline-rich proteins
•Type II
repeats that acquired diverse
functions through mutation or differential location within the sequence
RNA polymerase II (CTD)
•Type
III
repeats
that
gained new functions as a consequence of their expansion
prion protein
octarepeats
regulation
expression patterns
208
•constitutive
IDPs encoded by constitutively
highly expressed transcripts are almost entirely disordered and often
ribosomal proteins
ribosomal L proteins
•high
IDP-encoding
transcripts
showing high expression levels in most tissues and little tissue specificity
protease inhibitors, splicing
factors, complex assemblers
•medium
these
IDP-encoding transcripts
are expressed at medium levels, with some tissue-specificity
DNA binding, transcription
regulation
•tissue-specific
IDP-encoding transcripts
with highly tissue-specific expression
cell organization regulators,
complex disassemblers
•low or transient
IDP-encoding transcripts
that are present in undetectable amounts; more than one-half of analyzed
IDPs
variety of functions
alternative splicing
304,305,309,312,313
regulation and evolutionary
patterns of inclusion and exclusion of IDR-encoding exons can provide
insights into whether the encoded IDR functions in protein regulation
and interactions
a
tissue-specific region
with a phosphosite in the TJP1 protein in mouse, a mammalian-specific
region in the PTB1 splicing regulator
degradation kinetics
315,316,318,320,321
•degradation accelerators
IDRs that can influence
and accelerate proteasomal degradation of the protein containing it
•others
IDRs that have no influence
on protein half-life or increase it, e.g., because of sequence compositions
that impede proteasome processivity
low complexity sequences
such as glycine-alanine repeats and polyglutamine repeats
post-translational
processing and secretion
337,340
secreted proteins
are depleted for IDPs, but structural disorder is important in, e.g.,
prohormones, the extracellular matrix, and biomineralization
pre-pro-opiomelanocortin,
elastic fiber proteins, SIBLINGs, mucins
biophysical
properties
solubility
209
the sequence features
of
IDPs are generally associated with aqueous solubility, although some
IDPs are thermostable, while others are not; this is likely modulated
by sequence–structural ensemble relationships, such as the
degree of compaction
4E-BP1, calpastatin, CREB,
p21, p27, Sp1, stathmin, WASP
phase transition
137,353
certain IDRs (such as those
that contain specific low-complexity regions or interaction motifs)
can undergo phase transitions like the formation of protein-based
droplets or hydrogels
multivalent SH3-binding
motifs in phase separation, granule-like assemblies of RNA-binding
proteins containing low-complexity IDRs, mucins
biomineralization
117,341
structural disorder is common
in proteins with roles in biomineralization, such as the formation
of bone and teeth
caseins, osteopontin, bone
sialoprotein 2, dentin sialophosphoprotein
Table 2
Current Methods for
Function Prediction
of Intrinsically Disordered Regions and Proteins
basis for
method
description
method
Web site
linear motifs
annotation
of well-characterized linear motifs, which can be mapped onto other
protein sequences
ELM
125
http://elm.eu.org/
MiniMotif
126
http://mnm.engr.uconn.edu/
identification
of putative uncharacterized motifs in protein sequences
SLiMPrints
372
http://bioware.ucd.ie/slimprints.html
phylo-HMM
373
http://www.moseslab.csb.utoronto.ca/phylo_HMM/
DiliMot
374
http://dilimot.russelllab.org/
SLiMFinder
375
http://bioware.ucd.ie/slimfinder.html
PTM sites
resources of
experimentally verified PTM sites, mostly phosphorylation
Phospho.ELM
268
http://phospho.elm.eu.org/
PhosphoSite
376
http://www.phosphosite.org/
PHOSIDA
377
http://www.phosida.com/
identification
and collection of peptide motifs that direct post-translational modifications
ScanSite
380
http://scansite.mit.edu/
NetPhorest
381
http://netphorest.info/
NetworKIN
382
http://networkin.info/
PhosphoNET
383
http://www.phosphonet.ca/
molecular recognition
features
collection
of verified sequence
elements that undergo coupled folding and binding
IDEAL
388
http://www.ideal.force.cs.is.nagoya-u.ac.jp/IDEAL/
prediction
of sequences that undergo disorder-to-order transitions
MoRFpred
385
http://biomine.ece.ualberta.ca/MoRFpred/
ANCHOR
386
http://anchor.enzim.hu/
intrinsically disordered
domains
annotation
of disordered
protein domains, which can be detected by sequence profiles
Pfam
22
http://pfam.sanger.ac.uk/
other
prediction of gene ontology
functions using protein sequence features such as intrinsic disorder
FFPred
391
http://bioinf.cs.ucl.ac.uk/psipred/
function
annotation of experimentally
verified disordered protein regions
DisProt
203
http://www.disprot.org/
predictions
of disordered
regions combined with information on MoRFs, PTM sites, and domains
D2P2
49
http://d2p2.pro/
2
Function
Dunker and co-workers
57
distinguished
28 separate functions for disordered regions, based on literature
analysis of 150 proteins containing disordered regions of 30 residues
or longer. These functionalities can be summarized as molecular recognition,
molecular assembly, protein modification, and entropic chains. Further
development of this scheme resulted in one comprising six different
functional classes of disordered protein regions: entropic chains,
display sites, chaperones, effectors, assemblers, and scavengers (Figure 4).
33,58
In another classification scheme,
Gsponer and Babu classified IDR function into three broad functional
categories: (i) facilitated regulation via diverse post-translational
modifications, (ii) scaffolding and recruitment of different binding
partners, and (iii) conformational variability and adaptability (Figure 5).
39
A single protein
may consist of several disordered regions that belong to different
functional classes.
59
The following section
will address and exemplify the six functionalities of disordered regions.
Figure 4
Functional classification scheme of IDRs.
The function of disordered
regions can stem directly from their highly flexible nature, when
they fulfill entropic chain functions (such as linkers and spacers,
indicated in dark-tone red), or from their ability to bind to partner
molecules (proteins, other macromolecules, or small molecules). In
the latter case, they bind either transiently as display sites of
post-translational modifications or as chaperones (indicated in green),
or they bind permanently as effectors, assemblers, or scavengers (indicated
in dark-tone blue). More extensive descriptions and examples are found
in the main text. Adapted with permission from ref (58). Copyright 2005 Elsevier.
Figure 5
Functional classification of IDRs according
to their interaction
features. (A) The flexibility of IDRs facilitates access to enzymes
that catalyze post-translational modifications and effectors that
bind these PTMs. This permits combinatorial regulation and reuse of
the same components in multiple biological processes. (B) The availability
of molecular recognition features and linear motifs within the IDRs
enables the fishing for (“fly casting”) and gathering
of different partners. (C) Conformational variability enables a nearly
perfect molding to fit the binding interfaces of very diverse interaction
partners. Context-dependent folding of an IDR can activate signaling
processes in one case or inhibit them in another, resulting in completely
different outcomes. Adapted with permission from ref (39). Copyright 2009 Elsevier.
2.1
Entropic Chains
Entropic chains carry
out functions that benefit directly from their conformational disorder;
that is, they function without ever becoming structured. Examples
of entropic chains include flexible linkers, which allow movement
of domains positioned on either ends of the linker relative to each
other, and spacers that regulate the distances between domains. Evidence
that flexibility is a functional characteristic that needs to be maintained
came from studies on a family of flexible linkers in the 70 kDa subunit
of replication protein A (RPA70), which display conserved dynamic
behavior in the face of negligible sequence conservation.
60
The microtubule-associated protein 2 (MAP2)
projection domain exemplifies spacer behavior as it repels molecules
that approach microtubules, thereby providing spacing in the cytoskeleton.
Another subcategory of entropic chains are entropic springs, such
as those present in the titin protein, which contains repeat regions
rich in PEVK amino acids that generate force upon overstretching to
help restore muscle cells to their relaxed length.
61,62
2.2
Display Sites
Post-translational
modifications (PTMs) affect the stability, turnover, interaction potential,
and localization of proteins within the cell.
63
These aspects of PTMs are particularly relevant for proteins involved
in regulation and signaling, as are many IDPs.
35,37,39,64,65
The conformational flexibility of disordered protein
regions as display sites provides advantages over structured regions.
(i) Flexibility facilitates the deposition of PTMs by enabling transient
but specific interaction with catalytic sites of modifying enzymes.
47,66
This is because, upon binding, a flexible, disordered region loses
more conformational freedom (i.e., entropy), which reduces the overall
free energy of binding, leading to weaker and more transient binding
as compared to a folded protein region that interacts with equal strength
(i.e., the same binding enthalpy, or, equal specificity).
28,30,37
(ii) The flexibility of IDRs
also allows for easy access and recognition of the PTMs within the
IDR by effector proteins that mediate downstream outcomes upon binding.
47,66
Indeed, experimental and computational approaches have shown that
disordered regions are enriched for sites that can be phosphorylated,
45,46,67
and suggest that IDPs are likely
to be substrates of a large number of kinases and other modifying
enzymes as they are heavily post-translationally modified.
46,68,69
Furthermore, PTM sites are often
located within short peptide motifs, modification of which influences
the affinity for interaction with diverse binding partners (see section 3.1).
70,71
In turn, disordered protein regions
are strongly enriched for these motifs,
47,72−74
underlining the importance of intrinsic disorder as PTM display
sites. Well-characterized examples of IDPs in which PTMs are key to
function and regulation include, among others, histones, p53, and
the cyclin-dependent kinase regulator p27.
75−77
2.3
Chaperones
Chaperones are proteins
that assist RNA and protein molecules to reach their functionally
folded states.
78,79
Disordered regions make up over
one-half of the sequences of RNA chaperones and over one-third of
the sequences of protein chaperones.
80,81
The versatility
of disordered segments seems well suited for chaperone function, although
mechanistic evidence is still scarce.
82
First, their capacity to structurally adapt to many different binding
partners matches the need for chaperones to bind a wide range of proteins.
Second, disordered segments enable fast macromolecular interactions.
This is because the highly dynamic nature of IDRs prolongs the lifetime
of the encounter complex of the binding event due to rapid sampling
of many different conformations, thereby increasing the number of
nonspecific interactions as compared to an encounter of a structured
protein. In turn, this results in a higher probability to sample the
specific conformation that results in the stable interaction complex
and increases the association rate of the interaction.
83,84
The quick binding of misfolded proteins by disordered chaperones
could, for example, prevent the formation of toxic aggregates by providing
a solubilizing effect (see section 9.1). Finally,
the binding thermodynamics of disordered regions are well suited for
the cycles of repeated chaperone binding and release that enable substrate
folding. It has been proposed that transient binding of disordered
chaperone regions to misfolded substrates induces local folding of
the disordered chaperone, and promotes unfolding of the substrate,
thereby providing the substrate with a chance to refold correctly.
80
This reversible exchange of entropy represents
a distinct type of chaperone function that relies on disordered regions
and does not require ATP. Loss of flexibility of disordered regions
upon substrate binding has been demonstrated for the chaperones GroEL
85
and α-crystallin.
86,87
This mechanism can even be switched on and off at need by regulated
transitions between folded and disordered states,
88
as reported in the case of the redox-regulated chaperone
Hsp33.
89,90
2.4
Effectors
Another
functional class
of disordered regions is that of the effectors, which interact with
other proteins and modify their activity. Upon binding their interaction
partners, IDRs often undergo a disorder-to-order transition, also
known as coupled folding and binding.
91,92
Examples of
two effectors that fold upon binding are p21 and p27, which regulate
different cyclin-dependent kinases (Cdk) that are responsible for
the control of cell-cycle progression in mammals.
66
p21 and p27 exhibit functional diversity by achieving opposite
effects on different Cdk–cyclin complexes, promoting the assembly
and catalytic activity of some (e.g., Cdk4 paired with D-type cyclins),
and inhibiting others (e.g., Cdk2 paired with A- and E-type cyclins).
66
Another effector IDP is calpastatin, which undergoes
significant folding upon binding calpain, thereby achieving specific
and reversible inhibition.
93
IDRs
can also affect the activity of other parts within the same protein,
either through competitive interactions or through allosteric modulation.
The intrinsically disordered GTPase-binding domain (GBD) of the Wiskott–Aldrich
syndrome protein (WASP) illustrates competitive binding that controls
autoinhibition.
94
Binding of the GBD to
the Cdc42 protein promotes the interaction of WASP with the actin
cytoskeleton regulatory machinery. However, GDB adopts a different
structure when it folds back on other parts of WASP to inhibit actin
interaction. Indeed, autoinhibitory regions are generally enriched
for intrinsic disorder and often have different structures in the
inhibitory and functionally active states of the protein.
95
A striking example of allosteric coupling in
a disordered protein was revealed between different binding sites
in the adenovirus E1A oncoprotein.
96
Complexes
of E1A with the TAZ2 domain of CREB-binding protein (CBP) and the
retinoblastoma protein (pRb) can have either positive or negative
cooperativity, depending on the available E1A interaction sites (i.e.,
binding of either pRb or CBP to E1A increases or decreases, respectively,
the probability that the other one will also bind). These findings
support earlier studies that suggest allosteric coupling does not
always require a well-defined structural route to propagate through
the protein, but can also be determined by the stabilities of individual
conformations of the protein that change upon binding their interaction
partners.
97−99
Such a mechanism could be one explanation for how
the availability of different binding partners regulates the outcomes
of multiple binding events involving disordered proteins in a cellular
context.
96
2.5
Assemblers
Disordered assemblers
bring together multiple binding partners to promote the formation
of higher-order protein complexes,
100,101
such as the
ribosome (many ribosomal proteins are disordered
102
), activated T-cell receptor complexes,
58
the RIP1/RIP3 necrosome,
103
and
the transcription preinitiation complex.
104
The presence of different functional regions within the disordered
segments, such as molecular recognition features (MoRFs) and short
linear peptide motifs (SLiMs), enables binding and can bring together
different partners (see sections 3.1 and 3.2). Indeed, larger complexes are assembled
from
proteins that tend to be more disordered,
105
and intrinsic disorder is a common feature of hubs in protein interaction
networks.
106,107
The open structure of disordered
assemblers is largely preserved upon scaffolding their partner proteins,
resulting in a large binding interface that enables multiple proteins
to be bound by a single IDR.
108,109
Furthermore, disordered
regions largely avoid the steric hindrance that prevents the formation
of comparably large complexes from structured proteins.
Assembler
function can be imagined in two ways. (i) The first is structural
mortar, which helps to bring together proteins by stabilizing the
complexes they form. A well-studied example of this behavior is the
assembly of the ribosome, which relies on a sequence of cooperative
binding steps of protein and RNA.
110
Although
the initial stages of rRNA folding are probably driven by the RNA
itself,
111
ribosomal proteins subsequently
fold upon binding the rRNAs,
112,113
which induces structural
changes in both the RNA and the protein, and guides the complex toward
its native state.
110
(ii) The second is
scaffolds that serve as backbones for the spatiotemporally regulated
assembly of different signaling partners. An example of this mechanism
is the Axin scaffold protein, which colocalizes β-catenin, casein
kinase Iα, and glycogen synthetase kinase 3β by their
binding to Axin’s long intrinsically disordered region, thereby
effectively yielding a complex of structured domains with flexible
linkers.
114
The assembly of all four proteins
accelerates interactions between them by raising their local concentrations
and leads to the efficient phosphorylation and subsequent destruction
of β-catenin. Scaffolding regions have one of the highest degrees
of disorder of all functional categories.
109,115
2.6
Scavengers
The final distinct functional
class of IDRs and IDPs are scavengers, which store and neutralize
small ligands. Chromogranin A, one of the earliest examples
of an IDP, functions as a scavenger by storing ATP and adrenaline
in the medulla of the adrenal gland.
116
NMR studies showed that chromogranin is a random coil in both the
isolated form and in its cellular environment in the intact adrenal
gland.
116
Caseins and other calcium-binding
phosphoproteins (SCPPs) are highly disordered proteins that solubilize
clusters of calcium phosphate in milk and other biofluids (see section 9.3).
117
Finally, salivary
proline-rich glycoproteins are scavenger IDPs that bind tannin molecules
in the digestive tract.
33
3
Functional Features
Different types of functional regions
in intrinsically disordered
proteins have been uncovered by investigations aimed both directly
at increasing the understanding of IDRs and indirectly by linking
previously studied functionality of proteins to disordered regions.
First, the majority of linear motifs (such as the SH2 domain interaction
motif) have been found as enriched in IDRs.
48,72,118
Second, the development of disorder prediction
methods (Box 3) has led to the identification of segments that promote disorder-to-order
transitions called molecular recognition features (MoRFs),
119−123
which have been verified using known crystal structures. Third,
some interaction domains identified using crystallography, by sequence
analysis, and by other techniques, turn out to be intrinsically disordered
in solution (e.g., the BH3 domain
124
).
The following section discusses these three interaction features separately
and points out the underlying connections between them.
3.1
Linear Motifs
A common functional
module within IDRs is the linear motif,
47,48,72
also known as LMs, short linear motifs (SLiMs),
125
or MiniMotifs.
126
By regulating low-affinity interactions, these short sequence motifs
(annotated instances are usually 3–10 amino acids long
48
) can target proteins to a particular subcellular
location, recruit enzymes that alter the chemical state of the motif
by post-translational modifications (PTMs), control the stability
of a protein, and promote recruitment of binding factors to facilitate
complex formation.
47,48
Linear motifs, helped by the
flexible nature of the disordered regions that surround them,
71
primarily bind onto the surfaces of globular
domains,
127,128
and their compact binding surface
promotes them to occur multiple times within one protein.
47,48
Moreover, the short nature of many linear motifs means they have
a high propensity to convergently evolve and emerge in unrelated proteins.
47,48
A consequence of these properties is that pathogenic viruses and
bacteria have evolved to mimic these linear motifs, allowing them
to manipulate regulation of cellular processes.
129,130
Linear motifs can be broadly divided into two major families:
those that act as modification sites and those that act as ligands,
with each having numerous subgroups (Figure 6).
131
The first major family, the enzyme
binding or modification motifs, can be divided into three groups.
(i) The first is post-translational processing events or proteolytic
cleavage. A well-known example is the motif recognized by Caspase-3
and -7, which has an [ED]xxD[AGS] consensus sequence. Caspases are
a family of proteases that promote apoptosis and inflammation by cleaving
such motifs in their substrate proteins.
132
Hundreds of proteins have convergently evolved the Caspase-3/-7
motif, and thereby have come under the regulation of the apoptotic
pathway.
133
(ii) The second is PTM moiety
removal and addition. Many enzymes that catalyze post-translational
modifications recognize a specific binding sequence on the substrate.
For example, the cyclin-dependent kinase recognition motif [ST]Px[KR]
is present in many mitotic proteins, and its phosphorylation is key
for regulating cell cycle progression.
134
(iii) The third is structural modifications. This group of motifs
is involved in the catalyzed conformational alteration of a peptide
backbone. The classic example is the peptidylprolyl cis–trans
isomerase (PPIase) Pin1, which binds [ST]P motifs in a phosphorylation
dependent manner to catalyze the cis–trans isomerization of
the proline peptide bond. This modification can regulate the recognition
of phosphorylated [ST]P sites by phosphatases.
135
Figure 6
Functional classification of linear motifs. Linear motifs can be
divided into two major families, which each have three further subgroups.
The modification class motifs all act as recognition sites for enzyme
active sites, whereas the ligand class motifs are always recognized
by the binding surface of a protein partner. More detailed classification
beyond the graph shown here is possible. For example, an important
subgroup of docking motifs are the degrons, which regulate protein
stability by recruiting members of the ubiquitin–proteasome
system. In the regular expressions, x corresponds to any amino acid,
while other letters represent single letter codes of amino acids;
letters within square brackets mean either residue is allowed in that
position.
The second major family of motifs
comprises ligand motifs, which
can also be divided into three main groups (Figure 6). (i) Complex promoting motifs
are the most well-known class
of motifs and include the phosphorylated tyrosine motif recognized
by SH2 (Src homology 2) domains, the C-terminal motifs that bind PDZ
domains, and the proline-rich PxxP motifs that interact with SH3 (Src
homology 3) domains.
136
These motifs often
function in protein scaffolding, and their multivalency (tendency
to occur multiple times in one sequence) can increase the avidity
of interactions and promote phase transition (see section 9.2).
137
(ii) Docking
motifs increase the specificity and efficiency of modification events
(e.g., addition or removal of PTMs, see above) by providing additional
binding surface. These docking motifs are distinct from the modification
sites, but are usually in the same protein. Examples are the KEN box
and D box degrons, which act as recognition surfaces for ubiquitin
ligases that ubiquitinate the protein on a different position, leading
to degradation of the protein by the 26S proteasome.
138,139
The KEN box motif occurs in several key mitotic kinases to ensure
their degradation or deactivation at mitotic exit.
139
In some cases, the docking site is present in a protein
different from that which contains the modification site, as exemplified
by the F box motif. Another part of F box proteins recognizes post-translationally
modified degradation motifs of substrates, while the F box itself
docks the Skp1 components of SCF (Skp, Cullin, F box) E3 ligase complexes.
140
(iii) Targeting motifs can localize proteins
toward subcellular organelles. For example, importin proteins involved
in nuclear transport recognize the nuclear localization signal (NLS),
usually a motif containing a short cluster of lysines and arginines,
and translocate NLS-containing proteins into the nucleus.
141
Targeting motifs can also act to traffic proteins,
as in the case of endocytic motifs. These are recognized by adaptor
proteins at different stages of endocytosis to ensure that cargo proteins
are packaged into vesicles and trafficked to the right location.
142,143
An important feature of linear motifs is their propensity
to act
as molecular switches. This is for two major reasons. (i) Linear motif-mediated
interactions are generally low affinity due to the limited binding
surface. This means that large, bulky post-translational modifications
have a big impact on their binding properties.
71
(ii) Their small footprint (i.e., size) allows motifs to
occur multiple times in the same protein, thereby promoting high avidity
interactions and the recruitment of multiple factors (e.g., the LAT
complex in T-cell receptor signaling
144
).
99
This also means two different motifs
can overlap, resulting in mutually exclusive binding of interaction
partners.
73
The ability of a motif to rapidly
switch between binding partners and create multivalent complexes is
crucial for the creation of dynamic signaling networks.
71
3.2
Molecular Recognition Features
Disordered
segments can also contain another type of peptide motif (10–70
amino acids) that promotes specific protein–protein interactions.
These functional elements are called preformed structural elements
(PSEs),
119
molecular recognition features
(MoRFs) or elements (MoREs),
120−122
or prestructured motifs (PreSMos).
123
Importantly, MoRFs undergo disorder-to-order
transitions upon binding their interaction partners (i.e., folding
upon binding),
38,121,123
and often the unbound form of these preformed elements is biased
toward the conformation that they adopt in the complex.
119
Preformed structural elements and MoRFs may
serve as initial contact points for interaction events, which have
different kinetic and thermodynamic properties than interactions between
structured protein regions as discussed before. Binding of preformed
elements is one version of conformational selection (see section 6), suggested long
ago for interactions with flexible
ligands.
145
At the other extreme is induced
folding, in which structure formation and binding occur concomitantly
after the formation of the initial encounter complex. Given the complexity
of many complexes involving intrinsically disordered regions, interactions
involving both conformational selection of preformed elements and
induced folding likely occur.
92,146
MoRFs occurring
in the Protein Data Bank
147
can be classified
into subtypes according to the structures they adopt in the bound
state: α-MoRFs, β-MoRFs, and ι-MoRFs (Figure 7A–C),
121
which
form α-helices, β-strands, and irregular (but rigid) secondary
structure when bound, respectively. MoRFs that contain combinations
of different types of secondary structure are called complex (Figure 7D).
121
The p53 protein
contains multiple MoRFs that are disordered in the absence of their
interactors (Figure 7E).
120,121
The first p53 MoRF is located near the N-terminus and undergoes
a transition from a disordered to an α-helical state upon interaction
with the Mdm2 protein. In fact, this region of p53 exemplifies the
high potential of IDRs for multiple partner binding as it is known
to bind more than 40 different partners. However, for most of these
complexes, the 3D structures are not determined, and therefore the
MoRF type is not always known. The region between p53 residues 40
and 60 features an α-MoRF that functions as a secondary binding
site for Mdm2 as well as a primary binding site for RPA70.
148
In the absence of any binding partner, this
region shows evidence of minimal helical secondary structure,
149
whereas when bound to either Mdm2
150
or RPA70,
151
a stronger
helical structure is observed. The C-terminal region of p53 also contains
a MoRF that interacts with multiple partners, giving rise to different
bound structures. For example, the S100B(ββ) protein induces
a helical structure, while interaction with the Cdk2–cyclin
A complex leads to an irregular ι-MoRF. An example of the role
of MoRFs in scaffolding proteins is RNase E, which assembles the RNA
degradosome.
152
The flexible C-terminal
end of RNase E contains several recognition motifs that are central
to its scaffolding function and serve as binding sites for other members
of the degradosome.
153
For example, an
α-MoRF interacts with enolase,
154
and a β-MoRF binds polynucleotide phosphorylase.
155
The recognition features are connected by disordered
segments that accommodate assembly of the multiprotein complex by
providing the required space and flexibility. Lee and co-workers
123
have annotated the secondary structure propensities
of many other regions that display transient structural elements and
undergo disorder-to-order transitions, all of which have been experimentally
confirmed by NMR spectroscopy.
Figure 7
Classification of molecular recognition
features (MoRFs) based
on the secondary structure of the bound state. MoRFs (red ribbons)
undergo disorder-to-order transition upon binding their partners (blue
surfaces). (A) α-MoRF. BH3 domain of BAD (MoRF) bound to bcl-xl
(partner) (PDB ID: 1G5J). (B) β-MoRF. Inhibitor of apoptosis protein DIAP1 (partner)
bound to N-terminus of cell death protein GRIM (MoRF) (PDB ID: 1JD5). (C) ι-MoRF.
AP-2 (partner) bound to the recognition motif of amphiphysin (MoRF)
(PDB ID: 1KY7). (D) Complex-MoRF. Phosphotyrosine-binding domain (PTB) of the
X11 protein (partner) bound to amyloid β A4 protein (MoRF) (PDB
ID: 1X11). Note
that the PTB domain of X11 actually binds unphosphorylated peptides
and is a PTB by sequence similarity. Panels A–D reprinted with
permission from ref (122). Copyright 2007 American Chemical Society. (E) Promiscuity
of disorder-controlled
interactions illustrated by the p53 interaction network. A structure
versus disorder prediction on the p53 amino acid sequence is shown
in the center of the figure (up = disorder, down = order) along with
the structures of various regions of p53 bound to 14 different partners.
The predictions for a central region of structure, and the disordered
amino and carbonyl termini have been confirmed experimentally for
p53. The various regions of p53 are color coded to show their structures
in the complex and to map the binding segments to the amino acid sequence.
Starting with the p53–DNA complex (top, left, magenta protein,
blue DNA), and moving in a clockwise direction, the Protein Data Bank
147
IDs and partner names are given as follows
for the 14 complexes: (1tsr – DNA), (1gzh – 53BP1), (1q2d – gcn5),
(3sak –
p53 (tetramerization domain)), (1xqh – set9), (1h26 – cyclin
A), (1ma3 –
sirtuin), (1jsp – CBP bromo domain), (1dt7 – s100ββ), (2h1l – sv40 Large
T antigen), (1ycs – 53BP2), (2gs0 – PH), (1ycr – MDM2), and (2b3g – RPA70). Reprinted
with permission from ref (40). Copyright 2010 Elsevier.
Sequence context can play an active
role in modulating the degree
of structural preorganization of a MoRF. An example pertains to the
study of DNA binding motifs in the basic regions (bRs) of basic region
leucine zipper transcription factors.
156
The bRs are 28–30 residue long regions predicted to be highly
disordered and include a strongly conserved 10-residue DNA binding
motif (DBM). The α-helicity (i.e., preference for α-helical
conformation) of the DBM in the unbound form is modulated by the sequence
of the N-terminal segment that is directly in cis to the DBM.
156
For example, the N-terminal sequence contexts
of Gcn4 and Cys3 DBMs contribute to a higher level of helicity of
the DBM than the same region in c-Fos and Fra1 (whose DBMs have a
low helicity). Essentially, the N-terminal sequence contexts are helix
caps, and these can be used in different ways to ensure different
levels of structural preorganization within an α-MoRF, thereby
suggesting that investigating sequence contexts can provide useful
clues when classifying MoRFs and linear motifs.
157
3.3
Intrinsically Disordered
Domains
Most protein domains that are identified using sequence-based
approaches
are structured, but some can be fully or largely disordered
158
or contain conserved disordered regions,
159
known as intrinsically disordered domains (IDDs).
For instance, about 14% of Pfam domains have more than 50% of their
residues in predicted disordered regions. Many well-known domains,
such as the kinase-inhibitory domain (KID) of Cdk inhibitors (e.g.,
p27
66
) and the Wiskott–Aldrich syndrome
protein (WASP)-homology domain 2 (WH2) of actin-binding proteins,
158
have been shown experimentally to be fully
disordered in isolation and solution. Protein domains with conserved
disordered regions have a variety of functions, but are most commonly
involved in DNA, RNA, and protein binding.
159
Furthermore, domains that were gained during evolution by the extension
of existing exons contain the highest degree of disordered regions.
160
This suggests that exonization of previously
noncoding regions could be an important mechanism for the addition
of disordered segments to proteins.
Interestingly, it has also
been observed that particular disordered regions frequently co-occur
in the same sequence with specific protein domains.
161,162
Some domain families appear only to require the presence of disorder
in their neighborhood for functioning, while others seem to rely on
the occurrence of disordered regions in specific locations relative
to the start or end of the protein domain.
161
For example, particular combinations of domains, involved mainly
in regulatory, binding, receptor, and ion-channel roles, only occur
with a disordered region inserted between them, while others only
occur without a disordered domain between them. These observations
imply that short disordered regions in the vicinity of protein domains
complement the function of a structured domain, and in some cases
may comprise separate functional modules in their own right. Thus,
the co-occurrence of IDRs and structured domains in the same protein
might be useful to gain insight into unannotated disordered regions.
3.4
Continuum of Functional Features
A measure
that is often used to distinguish the different types of
disordered binding modules is length; however, this is likely to stem
primarily from the different methodology used for their detection.
Protein domain detection relies on hidden Markov models,
22
which is not the best approach for identifying
short sequences, and therefore domain annotation tends to focus on
larger sequence regions. In contrast, linear motifs in the ELM database
are biased toward short binding modules (∼3–10 amino
acids
48,125
) as these are more straightforward to annotate.
Finally, the tendency of MoRFs and preformed elements to undergo disorder-to-order
transitions and the statistics used for their detection means that
these features tend to be slightly longer than annotated linear motifs.
Thus, although there are differences in the definitions of linear
motifs and MoRFs, they share many common features
72,163
including a tendency to undergo disorder-to-order transition (all
MoRFs by definition and ∼60% of LMs
48
), an enrichment in IDRs (MoRFs by definition and ∼80% of
LMs are in IDRs
48,72
), and a tendency to promote complex
formation.
48,100,122
Intrinsically disordered domains (IDDs) can also have significant
overlap with MoRFs and linear motifs. For example, the WH2 domain
is considered an IDD
158
and is also defined
as a motif in the ELM database.
125
One
feature that is probably more common in IDDs is that some are not
only capable of binding to well-folded, structured domains (a mechanism
shared with motifs and MoRFs), but can also bind each other in a process
of mutually induced folding. For example, the nuclear coactivator
binding domain (NCBD) of CREB-binding protein (CBP) and the activator
for thyroid hormone and retinoid receptors (ACTR) domain of p160 are
both disordered on their own but upon interaction form a complex by
mutual synergistic folding.
164
The overlap
between linear motifs and MoRFs especially, but also IDDs, suggests
that these functional features are different states in the same continuum
of binding mechanisms involving disordered regions.
4
Structure
Intrinsically disordered regions and proteins
show a wide variety
of structural subtypes. These different types of disorder can be characterized
using an array of experimental techniques (Box 2), and several resources collect computationally
identified
and experimentally verified disordered regions (Box 1). The following section discusses
classification schemes
that are based on structural features of disordered proteins.
4.1
Structural Continuum
Proteins have
been proposed to function within a conformational continuum, ranging
from fully structured to completely disordered.
37
The spectrum covers tightly folded domains that display
either no disorder or only local disorder in loops or tails, multidomain
proteins linked by disordered regions, compact molten globules containing
extensive secondary structure, collapsed globules formed by polar
sequence tracts, unfolded states that transiently populate local elements
of secondary structure, and highly extended states that resemble statistical
coils (Figure 8). In this model, there are
no boundaries between the described states and native proteins could
appear anywhere within the continuous landscape. IDRs are highly dynamic
and fluctuate rapidly over an ensemble of heterogeneous conformations
(see section 4.2).
165
Thus, an IDR may fluctuate stochastically between several different
states, transiently sampling coil-like states, localized secondary
structure, and more compact globular states. Transient localized elements
of secondary structure (most often helices) are common in amphipathic
regions of the sequence and potentially play a role in binding processes.
92
The structural characteristics and populations
of the individual states in the conformational ensemble and the degree
of compaction of the polypeptide chain are determined by the nature
of the amino acids and their distribution in the IDR sequence (see
section 5.1).
166−168
For example, low and
high average charges typically lead to disordered globules and swollen
coils, respectively.
166,167
Figure 8
Schematic representation of the continuum
model of protein structure.
The color gradient represents a continuum of conformational states
ranging from highly dynamic, expanded conformational ensembles (red)
to compact, dynamically restricted, fully folded globular states (blue).
Dynamically disordered states are represented by heavy lines, stably
folded structures as cartoons. A characteristic of IDPs is that they
rapidly interconvert between multiple states in the dynamic conformational
ensemble. In the continuum model, the proteome would populate the
entire spectrum of dynamics, disorder, and folded structure depicted.
4.2
Conformational
Ensembles
Disordered
regions in the native unbound state exist as dynamic ensembles of
rapidly interconverting conformations,
165,169,170
which can be described by relatively flat energy
landscapes.
99,171,172
Conditions, post-translational modifications, and binding events
(see section 6) change the relative free energies
of individual conformations as well as the energy differences between
conformations.
99,173−176
As a result, the populations of individual conformations within
the ensemble change under different conditions. These individual states
are often important for function. Thus, the dynamic nature of IDPs
is best modeled by statistical approaches that describe the probabilities
of individual conformations in the ensemble,
172,177,178
and is best measured by experimental
techniques that prevent conformational averaging (Box 2).
179−182
4.3
Protein Quartet
The protein quartet
model proposes that protein function can arise from four types of
conformational states and the transitions between them: random coil,
pre-molten globule, molten globule, and folded (Figure 9).
32,34
In this model, unbound disordered
regions could fall into all categories except for “folded”.
Proteins in the pre-molten globule state are less compact than molten
globules, but still show some residual secondary structure. In contrast,
proteins in the random coil state show little or no secondary structure.
The pre-molten globule state has a high propensity to participate
in folding upon binding events,
183
which
would make this structural state suitable for disordered regions acting
as effectors and scaffolds. On the basis of the notion that IDPs and
IDRs possess great structural and sequence heterogeneity, proteins
may also be considered as modular assemblies of foldons (independently
foldable regions), inducible foldons (foldable regions that can gain
structure as a result of interaction with specific partners), semifoldons
(regions that are always partially folded), and nonfoldons (regions
that never fold).
184
The four distinct
conformational states of the quartet model are a subset of the continuous
spectrum of differently disordered states (see section 4.1),
37
which extends
from fully ordered to completely structure-less proteins, with everything
in between. A single description of structure (such as the quartet
states) may be suitable for the conformational average of a protein,
while a structural continuum is a better description of an ensemble
of different conformations (see section 4.2).
Figure 9
The protein quartet model of protein conformational states. In
accordance with this model, protein function arises from four types
of conformations of the polypeptide chain (ordered forms, molten globules,
pre-molten globules, and random coils) and transitions between any
of these states.
FG nucleoporins are an
example of the functional significance that
different disordered conformations can have. The porins make up the
central part of nuclear pore complexes (NPCs) and regulate nucleocytoplasmic
transport.
185
Intrinsically disordered
regions with multiple phenylalanine-glycine (FG) motifs make up large
parts of the NPC gates. FG regions adopt various disordered conformations
with specific functions.
186
Some regions
have the low charge characteristics of collapsed coils, while others
are characterized by a high degree of charged amino acids, giving
rise to relaxed and extended coil structures. Molecular dynamics simulations
have shown that extended coils are more dynamic than collapsed coils,
suggesting distinct functionalities for the two structural groups.
Interestingly, some FG nucleoporins feature both types of disorder
along their polypeptide chain. Combinations of disorder subtypes in
nucleoporin domains are likely to contribute to NPC gating behavior
by creating “traffic” zones with distinct physicochemical
properties that influence the dynamics of substrate translocation
through the nuclear envelope.
186−189
4.4
Supertertiary Structure
IDRs allow
for complex regulatory phenomena, as witnessed in the case of multidomain
proteins in signaling and regulation.
39,66,70,71,136,190
Because of the presence of structural
disorder, functional domains, and short motifs, multidomain proteins
are characterized by a dynamic ensemble of tertiary conformations.
Some conformations are dominated by intramolecular domain–domain
and domain–motif interactions and are closed and structured
in nature, while other conformations are more open and disordered.
This state of conformational variability within a protein lies between
the tertiary structure of proteins and the quaternary structure of
multiprotein assemblies, and has been termed supertertiary structure.
191
Complex regulatory function stems from transitions
in the ensemble of these structures, as demonstrated by several well-characterized
proteins, such as the Wiskott–Aldrich syndrome protein (WASP,
see section 2.4),
94
the Src-family tyrosine kinase Hck,
192
and the E3 ubiquitin ligase Smurf2.
193
5
Sequence
The sequences of IDPs and
IDRs have distinct compositional biases.
They are enriched in charged and polar amino acids and depleted in
bulky hydrophobic groups.
31,44,194,195
These biases have led to the
inference that disorder is a natural consequence of weakening the
hydrophobic effects that drive folding of polypeptides into compact
tertiary structures. Although disordered regions generally lack the
ability to fold independently due to these biases in amino acid composition,
distinct subsets of sequences that have different structural and functional
characteristics can be identified within IDRs. The special sequence
properties of disordered regions are the basis for many disorder prediction
methods (Box 3). The following section covers
sequence-based classification schemes of IDRs.
5.1
Sequence–Structural
Ensemble Relationships
Systematic efforts combining experiments
and computations have
addressed the relationship between information encoded in amino acid
sequences and the ensemble of conformations (see section 4.2) these sequences can
sample in different conditions.
These studies have focused on three major archetype sequences: polar
tracts, polyelectrolytes, and polyampholytes.
196
Polar tracts are sequence stretches enriched in polar amino
acids such as glutamine, asparagine, serine, glycine, and proline,
and deficient in charged as well as hydrophobic residues. These polar
tracts (especially glutamine, asparagine, and glycine-rich sequences)
form globules that are generally devoid of significant secondary structure
preferences
170,197−199
and can be as compact as well-folded domains.
196
Collapse of polar tracts arises from the preference for
self-solvation over solvation by the aqueous milieu. In this case,
disorder derives from a lack of specificity for a single compact conformation
as instead heterogeneous ensembles of conformations with similar stabilities
and compactness are formed. The free energy landscape of polar tracts
is weakly funneled and resembles an “egg carton”.
200
Interestingly, the drive to collapse, which
implies a drive to minimize the interface between the IDR and the
surrounding solvent, can also give rise to the significant aggregation
and solubility problems
201
as is the case
with several glutamine, asparagine, and glycine-rich sequences that
are implicated in amyloid formation and phase separation.
202
Another end of the compositional spectrum
are polyelectrolytes. Their amino acid compositions are biased toward
charged residues of one type such as the arginine-rich protamines
166
or the Glu/Asp-rich prothymosin α.
167
Experiments and simulations have shown that
the tendency of polypeptide backbones to form ensembles of collapsed
structures can be reversed by increasing the net charge per residue
past a certain threshold (Figure 10A). The
transition between globules and expanded coils is sharp, suggesting
that small changes to the net charge per residue through post-translational
modifications such as serine or threonine phosphorylation or lysine
acetylation could cause reversible globule-to-coil transitions. These
transitions might control the accessibility of SLiMs and MoRFs or
even modulate the conformations of these elements.
Figure 10
Original
166
and modified
204
diagram-of-states
to classify predicted conformational
properties of IDPs (and IDRs modeled as IDPs). (A) The original diagram
predicts that sequences with a net charge per residue above 0.25 will
be swollen coils. The three axes denote the fraction of positively
charged residues, f
+, the fraction of
negatively charged residues, f
–, and the hydropathy. All three parameters are calculated from the
amino acid composition. Green dots correspond to 364 curated disordered
sequences extracted from the DisProt database.
203
These sequences have hydropathy values that designate them
as being disordered; that is, they lie in the bottom portion of the
pyramid by definition. Additional filters were used for chain length
(more than 30 residues) and the fraction of proline residues (f
pro < 0.3). 97% of sequences used in this
annotation have a net charge per residue of less than 0.26 and are
thus predicted to be globule formers.
204
Adapted from ref (166). Copyright 2010 National Academy of Sciences of the United
States
of America. (B) Modified diagram-of-states from panel (A) with a focus
only on the bottom portion of the pyramid (i.e., stipulating that
the hydropathy is low enough to be ignored).
204
The polyampholytic contribution expands the space encompassed by
nonglobule-formers by subdividing the disordered globules space in
panel (A) into three distinct regions of which sequences in regions
2 and 3 actually may not form globules. In these polyampholytic regions,
one has to account for the total charge, in terms of the fraction
of charged residues (FCR), as well as the net charge per residue (NCPR)
as opposed to NCPR alone. Conformations in regions 2 and 3 are expected
to be random-coil-like if oppositely charged residues are well mixed
in the linear sequence. Otherwise, one can expect compact or semicompact
conformations. The classification scheme uses only the amino acid
sequence as input. Reprinted with permission from ref (204). Copyright 2013 National
Academy of Sciences of the United States of America.
The impact of the net charge per residue on the
conformational
properties of IDRs can be summarized in a diagram-of-states (Figure 10A),
166
which generalizes
the original charge-hydropathy plot.
31
The
diagram classifies IDRs on the basis of their amino acid compositions.
Annotation using curated disordered sequences from the DisProt database
203
(Box 1) initially
suggests that a vast majority (∼95%) of IDPs have amino acid
compositions that predispose them to be globule formers (Figure 10A).
204
However, most
of these predicted globule formers are actually polyampholytes in
that they are enriched in charged residues but have roughly equal
numbers of positive and negative charges.
204
Although such sequences are classified as globule formers on the
basis of their low net charge per residue, in reality the conformational
properties of polyampholytes are governed by the linear sequence distribution
of oppositely charged residues. If the oppositely charged residues
are segregated in the linear sequence, then electrostatic attractions
between oppositely charged blocks cause chain collapse and result
in hairpin or globular conformations. In sequences with well-mixed
oppositely charged residues, the effects of electrostatic repulsions
and attractions counterbalance. These mixed sequences adopt random-coil
or globular conformations, depending on the total charge (in terms
of the fraction of charged residues) (Figure 10B). Many IDPs are strong polyampholytes
with well-mixed linear patterns
of oppositely charged residues.
204
Thus,
IDPs are actually enriched in different classes of random coils that
form swollen, loosely packed conformations (Figure 10B). Such random-coil sequences
are likely
to help improve the solubility profiles of connected structured domains
(see section 9.1) and to promote the flexibility
that is required for functions such as entropic tethers, which promote
high local concentrations of connected protein parts, or entropic
bristles, which occupy large volumes by rapid exploration of conformations.
These biophysical principles of sequence–structural ensemble
relationships enable the use of de novo sequence design as a tool
for modulating these properties and assessing their impact on functions
associated with IDPs and IDRs.
5.2
Prediction
Flavors
Methods for predicting
disordered regions have generally been successful (Box 3), but their prediction
accuracies vary for different types of disordered regions.
205
Some predictors accurately predict certain
disordered regions but have lower accuracy predicting others, whereas
other predictors give opposite results. Vucetic and co-workers
205
classified protein disorder into three different
“flavors” based on competition between disorder predictors.
These V, C, and S disorder flavors (corresponding to the names of
the disorder predictors that best predict them: VL-2V, VL-2C, and
VL-2S) show differences in sequence composition, and combinations
of flavors could be associated with different protein functions. For
example, disordered regions that bind to other proteins are enriched
for flavor S, while disordered ribosomal proteins predominantly belong
to flavor V. Flavor C gave strong disorder predictions for sugar binding
domains.
5.3
Disorder–Sequence Complexity Space
The relationship between sequence complexity and disorder propensity
provides further insight into the structural and functional variations
of IDRs.
206
Different functional classes
of proteins often show a different disorder–sequence complexity
(DC) space distribution. A frequently observed DC-distribution is
composed of a compact structured part and a section extending out
into the low-complexity and high-disorder space before looping back
into the structured region. This pattern describes a disordered linker
region between structured domains. An example is the bacterial translation
initiation factor, which contains a sequence that locates to the low-complexity,
high-disorder region of DC space. This loop connects the N- and C-terminal
domains, which are high-structure and high-complexity.
206,207
Functionally related proteins have similar disorder–sequence
complexity distributions, suggesting that these distributions might
be useful for predicting the function of a disordered region.
5.4
Overall Degree of Disorder
Large-scale
studies into IDP function often group the proteins on the basis of
some measure of disorder. For example, protein sequences have been
categorized on the basis of the overall degree of disorder (i.e.,
the fraction of residues that is shown or predicted to be disordered),
68,208
resulting in groups of structured proteins (0–10% disorder),
moderately disordered proteins (10–30% disorder), and highly
disordered proteins (30–100% disorder). For 24% of human protein-coding
genes, at least 30% of residues are predicted to be disordered (Figure 2A). Other
studies classified proteins on the basis
of an overall score of disorder for the whole protein,
209
and the presence or absence of continuous stretches
of disordered residues with a specific length.
35,51,161,208
Largely structured
proteins are enriched for metabolic functions, while highly disordered
proteins function predominantly in regulation. Hence, classification
of disordered proteins based on the level of disorder provides clues
about what types of functions are likely.
5.5
Length
of Disordered Regions
The
length of IDRs in human follows a power law distribution: there are
large numbers of short disordered regions and increasingly smaller
numbers of longer ones.
210
Other eukaryotic
and prokaryotic proteomes show similar disorder length profiles. 44%
of human protein-coding genes contain substantial disordered segments
of >30 amino acids in length
49
(similar
data shown in Figure 2A). Short IDRs may function
as linkers and contain individual linear motifs or MoRFs, whereas
longer disordered regions might be entropic chains or contain combinations
of motifs or domains functioning in recognition. Very long disordered
regions (more than 500 residues) are typically over-represented in
transcription-related functions,
211
whereas
proteins containing IDRs of 300–500 residues in length are
enriched for kinase and phosphatase functions. Shorter IDRs (less
than 50 residues) tend to be linked to metal ion binding, ion channels,
and GTPase regulatory functions. Thus, the length of a disordered
region can also provide a useful indication about the functional nature
of the protein containing it.
5.6
Position
of Disordered Regions
Almost
all human proteins have some disordered residues within their terminal
regions.
59
For example, 97% of proteins
have predicted disorder in the first or last five residues.
161
Disordered N-terminal tails are common in DNA-binding
proteins, and have been shown to contribute to efficient DNA scanning.
212
Furthermore, proteins that are relatively rich
in disordered residues at the C-terminus are often associated with
transcription factor repressor and activator activities as compared
to proteins rich in internal or N-terminal disorder.
211
Membrane proteins, depending on their topology of insertion,
also contain disordered regions in the N- or C-terminus, but their
sequence composition is different as compared to disordered regions
in cytosolic proteins.
213
Ion channel proteins
are enriched for disordered residues at the N-terminus, and the same
is true to a lesser extent for C-terminal disorder.
211
These terminal disordered regions are often functionally
relevant, as illustrated by their role in the inactivation of voltage-gated
potassium channels.
214
Similarly, many
G-protein-coupled receptors (GPCRs) have large disordered regions
in their C-terminus, and often in the intracellular loops.
215
Several of them harbor peptide motifs that
link ligand binding in the transmembrane region of the receptor to
intracellular effectors, or contain PTM sites or linear motifs that
govern their stability.
216
Finally, proteins
that are relatively rich in internal disordered regions are weakly
enriched for transcription regulator and DNA binding activity.
211
Thus, the relative position of a disordered
region in a sequence provides clues about the function of the protein
containing it.
5.7
Tandem Repeats
Short tandem repeats
are common in IDRs and IDPs.
61,217−220
For instance, as much as 96% of polyglutamate and polyserine stretches
lie within disordered regions.
219
Similarly,
large fractions were found for proline, glycine, glutamine, lysine,
aspartate, arginine, histidine, and threonine repeats. In contrast,
polyleucine stretches occur predominantly within structured regions.
These observations agree with the compositional bias of disordered
regions (see section 5.1); the most common
tandem repeats in IDRs are made up of disorder-promoting residues
44,194
and of sequence patterns that are typically associated with disorder.
195
Moreover, a distinction between perfect and
imperfect tandem repeats suggests that as the repeat perfection increases,
so does the disorder content.
219
Repeats of different composition have been linked to specific functions.
218,221
Consequently, the presence of particular types of repeats is likely
to contribute to IDR functioning. Descriptions and examples of different
classes of disordered tandem repeats and their structural characteristics
have been reviewed previously.
218
For instance,
polyproline and polyglutamine stretches are associated with protein
and nucleic acid binding and transcription factor activity.
222,223
Protein segments enriched for glutamine and asparagine often occur
in disordered regions
224
and are abundant
in eukaryotic proteomes,
225
despite their
propensity to aggregate or form coiled-coil structures.
226
The aggregation propensity of the Q/N-enriched
segments is exploited in the formation of physiologically relevant
assemblies such as P-bodies (e.g., Ccr4 and Pop2), stress granules,
and processing bodies.
227
However, expanded
polyglutamine repeats are also associated with neurodegenerative disorders,
the most well-known being Huntington’s disease.
228
Moreover, several prion-like yeast proteins
(e.g., Sup35p and Ure2p) contain intrinsically disordered Q/N-rich
protein segments that have been implicated in the switch between a
soluble and an insoluble, aggregated form.
225,229
Another example of functional disordered repeats occurs in the SR
protein family of splicing factors (e.g., ASF/SF2 and SRp75).
230,231
SR proteins mediate the assembly of spliceosome components. They
consist of an N-terminal RNA-recognition motif and a disordered C-terminus
with tandem repeats of arginine and serine residues (RS domain). Phosphorylation
switches the RS domain of the serine/arginine-rich splicing factor
1 (SRSF1) from a fully disordered state to a more rigid structure.
232
Other disordered repeats associated with a
specific function include sequences enriched in lysine, alanine, and
proline in the histone H1 C-terminal domain, which are involved in
the formation of 30 nm chromatin fiber by binding linker DNA between
the nucleosomes.
233,234
A final example is dentin sialophosphoprotein
(DSPP), which contains extensively phosphorylated repeats of aspartic
acid and serine involved in calcium phosphate binding (see section 9.3).
235
Some repeat-containing
regions are also prone to undergo phase transitions from a soluble
monomeric state to an insoluble large assembly form, as demonstrated
for regions rich in proline, threonine, and serine residues in mucins
(see section 9.2).
236
6
Protein Interactions
Disordered region-mediated
molecular interactions have been proposed
to work using a combination of conformational selection and induced
folding.
92,146,237
These mechanisms
of binding are two extreme possibilities and are not mutually exclusive.
Both play a role in the interaction between two proteins, the dominant
mechanism depending, for example, on the concentrations of the individual
proteins
238
and the association rate constants.
84
In conformational selection, addition of binding
partners can result in a population shift in the conformational ensemble
of a disordered protein (see section 4.2) toward
the conformation that is most favorable for binding.
119,145,173,175
This mechanism has been observed in both protein–protein
and protein–nucleic acid interactions.
173
Evidence for the role of conformational selection in IDP
binding comes, for example, from the interaction between PDEγ
and the α-subunit of transducin,
239
which is important in phototransduction. The dynamic ensemble of
unbound PDEγ includes a loosely folded state that resembles
its structure when bound to transducin. In induced folding, a protein
undergoes a disorder-to-order transition upon association with its
binding partner.
92,146,240
Evidence for this mechanism in IDP binding comes, for example, from
a study investigating the disordered pKID region of CREB and the KIX
domain of CREB-binding protein. Upon binding of pKID to the KIX domain,
an ensemble of transient encounter complexes forms, which appear to
be stabilized primarily by hydrophobic contacts and evolve to form
the fully bound state via an intermediate state without disassociation
of the two domains.
91,241
6.1
Fuzzy
Complexes
Although disordered
protein regions frequently fold upon interacting with other proteins,
complexes with IDPs often retain significant conformational freedom
and can only be described as structural ensembles.
242
The conformations that disordered proteins adopt in the
bound state cover a continuum, similar to the structural spectrum
of free, unbound IDPs,
243
and range from
static to dynamic, and from full to segmental disorder.
242
In static disordered complexes, disordered
regions can adopt multiple well-defined conformations in the complex,
whereas in dynamic disorder they fluctuate between various states
of an ensemble in the bound state.
Disorder in the bound state
can be classified into four molecular modes of action, each of which
is associated with specific molecular functions (Figure 11A–D).
176,242
(i) The polymorphic
model is a form of static disorder, with alternative bound conformations
serving distinct functions by having different effects on the binding
partner. Examples are the Tcf4 β-catenin binding domain
244
and the WH2 binding domains of thymosin β4
or ciboulot,
245
which have been shown to
adopt several distinct conformations upon β-catenin and actin
binding, respectively. Different actin–WH2 domain complexes
have alternative interaction interfaces and result in actin polymers
with different topologies.
245
The (ii)
clamp and (iii) flanking models represent forms of dynamic disorder
in which complex formation either involves folding upon binding of
two disordered segments that are connected by a linker that remains
disordered, or the reverse situation, respectively. The cyclin-dependent
kinase (Cdk) inhibitor p21, for example, acts as a clamp. It contains
a dynamic helical subdomain that serves as an adaptable linker that
connects two binding domains and enables these to specifically bind
distinct cyclin and Cdk complex combinations.
246
In both the clamp and the flanking models, disordered regions
near the interacting protein segments (often short peptide motifs)
contribute to binding by influencing affinity and specificity.
242,247
This phenomenon relates to the importance of the sequence context
in modulating disordered binding elements (see section 3). Finally, (iv) the random
model is an extreme version of
dynamic disorder in protein complexes, which occurs when the IDR remains
largely disordered even in the bound state. In this case, interaction
is achieved via linear motifs that do not get fixed upon binding.
An example is the self-assembly of elastin, where solid-state NMR
has provided evidence for dynamic disorder within elastin fibers,
which exhibit random-coil like chemical shift values.
248
Another case is the complex between the Cdk
inhibitor Sic1 and the SCF ubiquitin ligase subunit Cdc4, which is
formed in a phosphorylation-dependent manner.
249
At any given time, only one out of nine Sic1 phosphorylation
sites interact with the core Cdc4 binding site, while the others contribute
to the binding energy via a secondary binding site or via long-range
electrostatic interactions (Figure 12N). Hence,
binding interchanges dynamically within the Sic1–Cdc4 complex
to provide ultrafine tuning of the affinity.
249,250
Figure 11
Classification of fuzzy complexes by topology (upper panel) and
by mechanism (lower panel). Blue arrows indicate interactions between
fuzzy disordered regions and structured molecules. Protein Data Bank
147
identifiers for the structures are given in
parentheses. Topological categories: (A) Polymorphic. The WH2 domain
of ciboulot interacts with actin in alternative locations: via an
18-residue segment (3u9z) or via only three residues (2ff3). The flanking regions
remain dynamically
disordered. (B) Clamp. The Oct-1 transcription factor has a bipartite
DNA recognition motif. The two globular binding domains are connected
by a 23 residue long disordered linker (1hf0), shortening of which reduces binding
affinity. (C) Flanking. The p27Kip1 cell-cycle kinase inhibitor
binds to the cyclin–Cdk2 complex (1jsu). The kinase binding site is flanked
by a ∼100 residue long disordered linker, which enables T187
at the C-terminus to be phosphorylated. (D) Random. UmuD2 is a dimer
that is produced from UmuD by RecA-facilitated self-cleavage (1i4v). The resulting
proteins exhibit a random coil signal in circular dichroism experiments
at physiologically relevant concentrations. Mechanistic categories:
(E) Conformational selection. The fuzzy N-terminal acidic tail of
the Max transcription factor (1nkp) facilitates formation of the DNA binding
helix (dark red) of the leucine zipper basic helix–loop–helix
(bHLH) motif. (F) Flexibility modulation. The disordered serine/arginine-rich
region of the Ets-1 transcription factor (1mdm) changes DNA binding affinity by 100–1000-fold
by modulating the flexibility of the binding segment via transient
interactions. (G) Competitive binding. The acidic fuzzy C-terminal
tail of high-mobility group protein B1 (2gzk) competes with DNA for the positively
charged binding surfaces. (H) Tethering. The binding of the virion
protein 16 activation domain to the human transcriptional coactivator
positive cofactor 4 (2phe) is facilitated by acidic disordered regions,
which anchor the binding segments.
Bound disordered regions can impact the interaction affinity
and
specificity of the complex and tune interactions of folded regions
176
with proteins or DNA.
251
Four different mechanisms have been proposed for the formation of
fuzzy complexes (Figure 11E–H). (i)
The first is conformational selection, when the disordered region
shifts the conformational equilibrium of the binding interface toward
the bound form. The fuzzy N-terminal tail of the Max transcription
factor, for example, reduces electrostatic repulsion in the basic
helix–loop–helix (bHLH) domain and thereby facilitates
formation of the DNA recognition helices, which increases binding
affinity by 10–100-fold.
252
(ii)
In the second mechanism, the disordered region(s) modulate flexibility
of the binding interface. The serine- and arginine-rich region of
the Ets-1 transcription factor exemplifies this mechanism, which reduces
DNA binding affinity by 100–1000-fold.
253
(iii) The third mechanism is competitive binding of the
disordered region. Here, the IDR acts as a competitive inhibitor of
other regions in the same protein for binding to a partner. The acidic
fuzzy C-terminal tail of high-mobility group protein B1 (HMGB1) negatively
regulates interaction of the HMG DNA binding domains by occluding
the basic DNA-binding surfaces.
254
(iv)
In the fourth mechanism, the disordered region serves to tether a
weak-affinity binding region to increase its local concentration.
For example, a fuzzy N-terminal domain anchors the human positive
cofactor 4 (PC4) to several transactivation domains including the
herpes simplex virion protein 16 (VP16).
255
All mechanisms of disordered complex formation affect binding to
different degrees and can be further tuned by post-translational modifications.
176,251
PTMs in the disordered region may act as affinity tuners by modulating
the charge available for biomolecular interactions.
256
6.2
Binding Plasticity
Structural analysis
of a large number of intrinsic disorder-based protein complexes resulted
in another categorization of IDRs based on their binding plasticity
(Figure 12).
257
Examples of relatively
static IDR-based complexes are (i) mono- and polyvalent complexes,
which typically consist of interactions between disordered segments
and one or multiple spatially distant binding sites on their binding
partners, respectively, (ii) chameleons, such as p53, that have different
structures when binding to different proteins, (iii) penetrators that
bury significant parts of the protein inside their binding partners,
and (iv) huggers, which function in protein oligomerization, for example,
by coupled folding and binding of disordered monomers. In addition
to these relatively static complexes involving IDRs, one can identify
coiled-coil-based complexes. Regions that make up coiled coils are
typically highly disordered in monomeric state and gain helical structure
upon coiled-coil formation, giving rise to several distinguishable
types of complexes, such as intertwined strings, connectors, armatures,
and tentacles.
Figure 12
A portrait gallery of disorder-based complexes. Illustrative
examples
of various interaction modes of intrinsically disordered proteins
are shown. Protein Data Bank
147
identifiers
for the structures are given in parentheses. (A) MoRFs. Aa, α-MoRF,
a complex between the botulinum neurotoxin (red helix) and its receptor
(a blue cloud) (2NM1); Ab, ι-MoRF, a complex between an 18-mer cognate peptide
derived from the α1 subunit of the nicotinic acetylcholine receptor
from Torpedo californica (red helix)
and α-cobratoxin (a blue cloud) (1LXH). (B) Wrappers. Ba, rat PP1 (blue cloud)
complexed with mouse inhibitor-2 (red helices) (2O8A); Bb, a complex
between the paired domain from the Drosophila paired (prd) protein and DNA (1PDN).
(C) Penetrator. Ribosomal protein s12
embedded into the rRNA (1N34). (D) Huggers. Da, E. coli
trp repressor dimer (1ZT9); Db, tetramerization domain of p53 (1PES); Dc, tetramerization
domain of p73 (2WQI). (E) Intertwined strings. Ea, dimeric coiled coil, a basic coiled-coil
protein from Eubacterium eligens ATCC
27750 (3HNW);
Eb, trimeric coiled coil, salmonella trimeric autotransporter adhesin,
SadA (2WPQ);
Ec, tetrameric coiled coil, the virion-associated protein P3 from
Caulimovirus (2O1J). (F) Long cylindrical containers. Fa, pentameric coiled coil,
side
and top views of the assembly domain of cartilage oligomeric matrix
protein (1FBM); Fb, side and top views of the seven-helix coiled coil, engineered
version of the GCN4 leucine zipper (2HY6). (G) Connectors. Ga, human heat shock
factor binding protein 1 (3CI9); Gb, the bacterial cell division protein ZapA from
Pseudomonas aeruginosa (1W2E). (H) Armature. Ha, side and top views
of the envelope glycoprotein GP2 from Ebola virus (2EBO); Hb, side and top
views of a complex between the N- and C-terminal peptides derived
from the membrane fusion protein of the Visna (1JEK). (I) Tweezers or
forceps. A complex between c-Jun, c-Fos, and DNA. Proteins are shown
as red helices, whereas DNA is shown as a blue cloud (1FOS). (J) Grabbers.
Structure of the complex between βPIX coiled coil (red helices)
and Shank PDZ (blue cloud) (3L4F). (K) Tentacles. Structure of the hexameric molecular
chaperone prefoldin from the archaeum Methanobacterium
thermoautotrophicum (1FXK). (L) Pullers. Structure of the ClpB
chaperone from Thermus thermophilus (1QVR). (M)
Chameleons. The C-terminal fragment of p53 gains different types of
secondary structure in complexes with four different binding partners,
cyclin A (1H26), sirtuin (1MA3), CBP bromo domain (1JSP), and s100ββ (1DT7). Panels
A–M reprinted with permission
from ref (257). Copyright
2011 The Royal Society of Chemistry. (N) Dynamic complexes. Schematic
representation of the polyelectrostatic model of the Sic1–Cdc4
interaction. An IDP (ribbon) interacts with a folded receptor (gray
shape) through several distinct binding motifs and an ensemble of
conformations (indicated by four representations of the interaction).
The intrinsically disordered protein possesses positive and negative
charges (depicted as blue and red circles, respectively) giving rise
to a net charge ql
, while the binding
site in the receptor (light blue) has a charge qr
. The effective distance ⟨r⟩
is between the binding site and the center of mass of the intrinsically
disordered protein. Panel N was reprinted with permission from ref (243). Copyright
2010 John
Wiley & Sons, Inc.
7
Evolution
Disordered regions typically
evolve faster than structured domains.
51−56,107
This behavior largely stems
from a lack of constraints on maintaining packing interactions, which
drives purifying selection in structured sequences.
258
However, disordered residues do display a wide range of
evolutionary rates (Box 2). The following
section discusses the evolutionary classifications of disordered protein
regions. IDRs with similar functions and properties tend to have similar
evolutionary characteristics.
7.1
Sequence Conservation
While the amino
acid sequence of disordered regions evolves at different rates, the
property of disorder is usually conserved for functional sequences.
54,159
Sequence conservation of IDRs varies according to their specific
functions and provides another means for their classification.
54,259,260
Three biologically distinct
classes of IDRs with specific function were identified using a combination
of disorder prediction and multiple sequence alignment of orthologous
groups across 23 species in the yeast clade (Figure 13): (i) flexible disorder describes
regions where disorder is conserved
but that have quickly evolving amino acid sequences (i.e., there is
a requirement to be disordered, regardless of the exact sequence),
(ii) constrained disorder describes regions of conserved disorder
with also highly conserved amino acid sequences, and (iii) nonconserved
disorder, where not even the property of being disordered is conserved
in closely related species. For flexible disorder, low sequence conservation
is expected if the property of disorder itself, as opposed to disorder
in combination with specific sequence, is the only requirement for
function. Examples of functions that mainly require the biophysical
flexibility of disordered regions are entropic springs, spacers, and
flexible linkers between well-folded protein domains.
37,39,57,58
The linker in RPA70 is an example where the dynamic behavior is
conserved even when the sequence conservation is low.
60
Flexible disorder is the most common of the three evolutionary
classes with just over one-half of disordered residues in yeast. It
appears to account not just for the “flexibility” functions
mentioned above, but also for many of the characteristics traditionally
associated with disordered regions, such as strong association with
signaling and regulation processes,
35,50,104,190,261,262
rapid sequence evolution,
51−56,107
the presence of short linear
motifs (which are themselves conserved, see below),
47,72
and tight regulation (see section 8).
68,263
By contrast, constrained disorder (about a third of disordered residues
in yeast) is associated with different properties and functions, such
as chaperone activity and RNA-binding ribosomal proteins.
54
Many proteins that contain the evolutionarily
constrained type of disorder can adopt a fixed conformation, suggesting
that these regions might undergo folding upon binding to their targets.
This structural transition might impose a high degree of local structural
constraints, which results in constraints on the protein sequence
alongside requirements to be flexible.
54
Constrained disordered residues also occur more often in annotated
protein sequence families (domains) than flexible disorder, but both
types are strongly depleted in domains compared to structured regions.
In human, both flexible and constrained disorder are enriched in proteins
functioning in differentiation and development,
264
which reflects the importance of IDPs in these processes.
Finally, nonconserved disorder accounts for around 17% of disordered
residues in yeast and appears to be largely nonfunctional.
Figure 13
Classification
of disordered regions according to their evolutionary
conservation (constrained, flexible, and nonconserved disorder). (A)
Schematic of computing disorder conservation and amino acid sequence
conservation. The alignments are used to calculate the percentage
of sequences in which a residue is disordered and the percentage of
sequences in which the amino acid itself is conserved. A residue is
considered to be conserved disordered if the property of disorder
is conserved in at least one-half of the species. Similarly, the amino
acid type of a residue is considered conserved if it is present in
at least one-half of the species. Disordered residues in which both
sequence and disorder are conserved are referred to as constrained
disorder. Disordered residues in which disorder is conserved but not
the amino acid sequence are referred to as flexible disorder. Residues
that are disordered in S. cerevisiae but not cases of conserved disorder are referred
to as nonconserved
disorder. (B) Disorder splits into three distinct phenomena. Functional
enrichment maps of proteins enriched in flexible disorder versus constrained
disorder. The area of each rectangle is proportional to the occurrence
of that type of disorder in the alignments. Related gene ontology
terms are grouped based on gene overlap. Reprinted with permission
from ref (54). Copyright
2011 Springer Science + Business Media.
Short linear motifs (see section 3.1)
48,125
constitute a special case. Even though SLiMs
almost exclusively
lie within disordered regions, their own amino acid sequence tends
to be conserved.
48
These properties, together
with the difficulty of aligning rapidly evolving disordered sequences,
result in the motifs to move around when comparing their position
in different sequences. In fact, not only do motifs move around (due
to insertions and deletions of amino acids around the motif in the
sequence
67,265
), they can also permute their positions
with respect to other structural and functional modules. For example,
SUMO modification sites in p53 are seen after and before the oligomerization
domain in human and fly, respectively.
266
Such behavior could emerge by convergent evolution and loss of the
motif in the original site, as only a few amino acids need to mutate
to make a new motif elsewhere in the sequence. As long as the position
of the motif with respect to the other modules does not affect function,
such permutations will not affect fitness and hence may emerge relatively
easily during evolution. These are indeed confounding issues when
aligning disordered regions among orthologous proteins to identify
functional motifs.
In many ways, the disordered regions that
contain SLiMs constitute
flexible disorder as by the above classification, as their main role
is to provide flexibility to enable access to the linear motif for
proteins that will bind them as ligands
267
or introduce post-translational modifications.
47,48
Phosphorylation sites are closely related to short linear motifs
that function in binding, but are often too short and weakly conserved
to recognize via computational means.
268
More than 90% of sites phosphorylated by the yeast Cdk1 are in predicted
disordered regions,
67
as consistent with
previous studies highlighting the importance of IDRs as display sites
for phosphorylation and other PTMs (see sections 2.2 and 3.1).
45,46
Comparison of the phosphorylation sites in orthologues of the Cdk1
substrates revealed that the precise position of most phosphorylation
sites is not conserved. Instead, clusters of sites move around in
the alignment of rapidly evolving disordered regions.
69,250,269
Another example of the role
of flexible disorder in signaling and regulation is the yeast serine-arginine
protein kinase Sky1, which regulates proteins involved in mRNA metabolism
and cation homeostasis. The Sky1 C-terminal loop is intrinsically
disordered and contains phosphosites that are important for regulating
its kinase activity.
270
Conservation analysis
has shown that the loop is conserved for disorder but not for sequence.
54
The combination of sequence conservation
of IDRs and conservation
of their amino acid composition between human and seven other eukaryotes
(chimp, dog, rat, mouse, fly, worm, and yeast) also identifies functional
preferences.
260
IDRs with high residue
conservation (HR) are enriched in proteins involved in transcription
regulation and DNA binding. Low residue conservation in combination
with high conservation of the amino acid type composition (LRHT) of
the IDR (i.e., high similarity of overall amino acid composition between
the human IDR and its orthologs) is often associated with ATPase and
nuclease activities. Finally, IDRs that show neither conservation
of sequence nor conservation of amino acid composition (LRLT) are
abundant in (metal) ion binding proteins.
7.2
Lineage
and Species Specificity
Increasingly
complex organisms have higher abundances of disorder in their proteomes.
35,271
An average of 2% of archaeal, 4% of bacterial, and 33% of eukaryotic
proteins have been predicted to contain regions of disorder over 30
residues in length,
35
although there is
much variation within kingdoms.
272,273
In human,
31% of proteins are more than 35% unstructured,
68
and 44% contain stretches of disorder longer than 30 residues
49,161,208
(similar data shown in Figure 2A). Human IDPs are spread relatively uniformly across
the chromosomes, with percentages ranging from 38% (for genes encoding
IDPs on chromosome 21) to 50% on chromosomes 12 and X.
161
A computational analysis of disorder in prokaryotes
has corroborated the higher abundance of disorder in Bacteria as compared
to Archaea.
274
Moreover, in agreement with
the low abundance of disorder in prokaryotes, none of the 13 mitochondrial-encoded
proteins are disordered.
161
Systematic
analysis of IDP occurrence in 53 archaeal species showed that disorder
content is highly species-dependent.
275
For example, Thermoproteales and Halobacteria proteomes have 14% and 34% disordered
residues, respectively. Harsh environmental conditions seem to favor
higher disorder contents, suggesting that some of the archaeal IDPs
evolved to help accommodate hostile habitats.
276
Structural disorder is more common in viruses than
in prokaryotes.
277
The characteristics
of IDRs seem well suited for especially small RNA viruses with extremely
compact genomes.
278,279
For example, disordered regions
could buffer the deleterious effects of mutations introduced by low-fidelity
virus polymerases better than would structured domains.
277
The flexibility of IDRs to interact with many
different proteins, such as proteins of the host immune system, is
another useful feature for compact viruses because it maximizes the
amount of functionality they encode while minimizing the required
genetic information.
280
At the same time,
several human innate immunity proteins have predicted disordered regions
that could be important for their pathogen defense function.
281
For example, the RIG-I-like receptors (RLRs)
RIG-I and MDA5 recognize different types of viral double-stranded
RNA (dsRNA).
282
This functional divergence
is partly achieved by differential flexibility of a loop that is rigid
in RIG-I, but disordered in MDA5, resulting in different RNA binding
preferences.
283
Furthermore, the disordered
linker between the RNA-binding domains and the two N-terminal CARD
(caspase activation and recruitment) domains of MDA5 helps facilitate
oligomerization of the CARD domains, which initiates downstream signaling.
283
Activated RIG-I and MDA5 promote the formation
of prion-like aggregates of the CARD domains of MAVS (mitochondrial
antiviral-signaling).
284
MAVS has a highly
disordered central region that contains multiple phosphorylation sites
and interacts with several proteins, such as TRAF2 and TRAF6 through
their respective consensus binding motifs (PxQx[TS] and PxExx[FYWHDE],
respectively).
285
These interactions are
part of a signaling pathway that activates the transcription factors
IRF3/7 and NF-κB, leading to the expression of proinflammatory
cytokines such as IFN-α/β and various proteins with direct
antiviral activity.
282
For example, to
counteract viral infection, protein kinase R (PKR) phosphorylates
the translation initiation factor eIF2α in the presence dsRNA,
which reduces global protein synthesis in the cell.
286
PKR contains a long disordered interdomain region that
may become ordered upon RNA binding and could affect PKR dimerization.
287,288
Interestingly, viruses counteract PKR action by mimicking eIF2α
and competing for PKR binding, as has been shown in the case of the
poxvirus protein K3L.
289
PKR is under intense
positive selection to keep recognizing eIF2α while minimizing
interaction with viral antagonists.
289
Many
of the changing sites in PKR are in a dynamic loop near the interaction
interface with both eIF2α and K3L.
290
Similarly, recognition of retrovirus capsids by the restriction
factor TRIM5α is mediated by disordered regions in the SPRY
domain, which bear many positively selected residues that are essential
for the antiviral activity.
291
The SPRY
domain exists as an ensemble of disordered conformations that determine
the specificity and affinity of the interaction between TRIM5α
and the viral capsid.
292−294
In this way, the evolutionary flexibility
of disordered regions (see section 7.1) provides
opportunities for proteins of the host immune system to compete with
rapidly changing pathogens while maintaining their functionality.
In addition to the variation in prevalence of disordered regions
between species, different kingdoms of life seem to use conserved
IDRs for different functions: eukaryotic and viral proteins use disorder
mainly for mediating transient protein–protein interactions
in signaling and regulation, while prokaryotes use disorder mainly
for longer lasting interactions involved in complex formation.
159
Thus, knowledge on the lineage, species, and
origin of a disordered region could help in predicting its likely
function.
7.3
Evolutionary History and Mechanism of Repeat
Expansion
Tandem repeats are enriched for intrinsic disorder
(see section 5.7), and IDRs are increasingly
abundant in increasingly complex organisms (see section 7.2). The genetic instability
of repetitive genomic
regions in combination with the structurally permissive nature of
IDRs might have driven the increase in the amount of disorder during
evolution. Disordered repeat regions have been shown to fall into
three categories, based on their evolutionary history and acquired
functional properties (Figure 14):
61
type I regions have not undergone functional
diversification after repeat expansion (e.g., the titin PEVK domain),
type II repeats have acquired diverse functions due to mutation or
differential location within the sequence (e.g., the C-terminal domain
of eukaryotic RNA polymerase II), and type III regions have gained
new functions as a consequence of their expansion per se (e.g., the
prion protein octarepeat region).
Figure 14
Repeat expansion
creates IDRs. IDRs are abundant in repeating sequence
elements, which suggests that repeat expansion is an important mechanism
by which genetic material encoding for structural disorder is generated.
The expanding repeats may fall into three classes (types) in terms
of their functional diversification following expansion. Individual
repeats may remain functionally equivalent (type I), or diversify
(type II), or collectively acquire a completely new function (type
III). Dark-tone red indicates structural disorder of the repeat, which
may undergo full (dark-tone blue) or partial (green) induced folding
upon binding to a partner. Adapted with permission from ref (61). Copyright 2003 John
Wiley
& Sons, Inc.
8
Regulation
Altered availability of IDPs is associated with diseases such as
cancer and neurodegeneration.
190,263,295−299
Indeed, genes that are harmful when overexpressed (i.e., dosage-sensitive
genes) often encode proteins with disordered segments.
300
Multiple mechanisms at different stages during
gene expression (from transcript synthesis to protein degradation)
control the availability of IDPs.
68
Their
tight regulation ensures that IDPs are available in appropriate levels
and for the right amount of time, thereby minimizing the likelihood
of ectopic interactions. Disease-causing altered availability of IDPs
may result in imbalances in signaling pathways by sequestering proteins
through nonfunctional interactions involving disordered segments (i.e.,
molecular titration
263
). The following
section discusses possible functional roles of proteins with IDRs
based on their cellular regulatory properties such as transcript abundance,
alternative splicing, degradation kinetics, and post-translational
processing.
8.1
Expression Patterns
Five different
expression patterns were identified for transcripts encoding highly
disordered proteins by investigating the mRNA levels from over 70
different human tissues and comparing the number of tissues in which
IDP transcripts are expressed against the level of expression (Figure 15).
208
The expression
classes are associated with specific functions. (i) The first subgroup
(Figure 15, light blue markers) shows constitutive
high expression in all tissues and consists exclusively of large ribosomal
subunit proteins, which are almost entirely disordered. (ii) The second
group (blue-green) represents transcripts that show high expression
levels in the majority of tissues. These often function as protease
inhibitors, splicing factors, and complex assemblers. (iii) Moderately
expressed transcripts (green) typically encode disordered proteins
involved in DNA binding and transcription regulation. (iv) IDPs that
are expressed in a tissue-specific manner (yellow) are enriched for
cell organization regulators, transcription cofactors, and factors
that promote complex disassembly. Finally, (v) the remaining transcripts
form a group (gray) not detected to be abundant in any of the tissues
studied. This low and transient expression group contains more than
one-half of the IDP transcripts analyzed and has a variety of functions.
Figure 15
A summary of expression–function trends for human
transcripts
encoding highly disordered proteins. The x-axis represents
the log10 number of tissues in which the transcript is
expressed; the y-axis represents the log10 average magnitude of expression within
the tissues. From the data,
five distinct functional classes of highly disordered human proteins
become apparent. Adapted with permission from ref (208). Copyright 2009 Springer
Science + Business Media.
8.2
Alternative Splicing
Trends in transcriptional
regulation (alternative promotor and polyadenylation site usage) and
post-transcriptional regulation (alternative splicing by inclusion
or exclusion of exons) can also be informative of the role that specific
disordered protein regions play in the cell (Figure 16). Alternatively spliced exons
are overall more likely to
encode intrinsically disordered rather than structured protein segments.
161,301−303
This tendency is even more pronounced in
alternative exons whose inclusion or exclusion is regulated in a tissue-specific
manner.
304
IDRs that are encoded by these
tissue-specific alternative exons frequently influence the choice
of protein interaction partners and can be instrumental in protein
regulation
304,305
by embedding binding motifs,
and residues that can be post-translationally modified.
304
However, simple alteration of the length of
a disordered region
306
can also modulate
the overall protein function (Figure 16). Changes
in IDR length can be an effective mechanism for modifying the affinity
of interactions that a protein makes, particularly in instances where
a disordered region is responsible for the positioning of protein
binding motifs or domains.
307,308
Among the alternative
exons, those that exhibit conserved splicing patterns across different
species are particularly likely to have important regulatory roles.
For example, tissue-specific exons, which are alternatively spliced
in multiple different mammals, remarkably often contain IDRs with
embedded phosphosites.
309
Disordered regions
encoded by these exons are hence likely to act as modulators of protein
function depending on the tissue where they are expressed.
309
While tissue-specific exons that are alternatively
spliced in a conserved fashion often code for phosphosites, the emergence
of novel exons in a gene, although at first likely detrimental,
310
is a possible template for the evolution of
short interaction motifs.
311
Furthermore,
changes in exon regulation can also be important for the emergence
of novel adaptive functions. Accordingly, protein segments encoded
by exons, which are alternatively spliced either in a single species
or in a whole evolutionary lineage, are enriched in short binding
motifs, and alternative inclusion of disordered regions encoded by
these exons is conceivably a source of evolutionary novelty.
312
Figure 16
Transcriptional and post-transcriptional gene
regulation can be
informative of IDR function. How inclusion of exons that code for
IDRs is regulated during gene transcription and alternative splicing
can give insights into the functional roles of the encoded disordered
regions. For example, tissue- or developmental-specific regulation
of alternative splicing or alternative promoter and polyadenylation
site usage can be associated with important roles of the encoded IDRs
in protein regulation and cellular interactions through, for example,
the presence of binding motifs and phosphosites. Additionally, information
on the conservation of patterns of exon inclusion (i.e., events shared
among different evolutionary lineages versus species-specific events)
can aid in better characterization of the encoded IDRs. The figure
illustrates a hypothetical example where an exon (largest red box)
that is included in a tissue-specific manner both in human and in
mouse encodes an IDR that embeds a phosphosite (P) and is involved
in protein regulation. The human gene depicted in the figure has an
additional exon (smallest red box), which encodes an IDR with a short
interaction motif and which is also included in a tissue-specific
manner in humans. Gene structures, mature mRNAs, and corresponding
protein isoforms are shown for human and mouse brain and heart tissues.
On the right, possible functional roles of the IDRs encoded by the
brain isoforms are illustrated. The examples illustrate how protein
functional space can increase due to alternative splicing of exons
that encode IDRs. Adapted with permission from ref (304). Copyright 2012 Elsevier.
In addition to the tendency of
cassette alternative exons to frequently
encode IDRs, exons adjacent to the alternatively spliced ones are
also likely to code for disordered regions around the insertion point
for the alternatively spliced segment.
264,302
These disordered
regions not only provide the structural flexibility that tolerates
both presence and absence of the alternatively spliced segment, but
they can also contain interaction motifs themselves.
264
Furthermore, on the transcriptional level, diversity in
protein isoforms can be created through both alternative splicing
and usage of alternative promoters and polyadenylation sites. Protein
segments that are encoded by the two latter mechanisms can contain
disordered regions with motifs that define protein localization and
stability.
313
Taken together, these examples
illustrate how better understanding of gene regulation and knowledge
of evolutionarily conserved and novel isoforms can provide insights
into possible functional roles of whole proteins and specific protein
regions.
8.3
Degradation
Kinetics
Another emerging
functionality of disordered regions is their role in protein degradation.
314−321
Protein half-life generally correlates with the fraction of disordered
residues,
68,317
and proteins that get ubiquitinated
specifically upon heat shock stress are typically disordered.
322
Although ubiquitination by E3 ligases has a
dominant role in recruiting proteins to the proteasome for degradation,
323,324
some IDRs of sufficient length allow for efficient initiation of
degradation by the proteasome independent of the ubiquitination status.
This idea is supported by in vitro experiments showing that degradation
of tightly folded proteins is accelerated when a disordered region
is attached to model substrates.
315,321
Efficient
degradation only occurs when the disordered terminal region is of
a certain minimal length,
321
and degradation
may be initiated by IDRs either at the protein terminus or internally.
314−321
Proteins that contain IDRs of sufficient length may therefore have
increased turnover, although the exact length requirements will depend
on the substrate. At the same time, not all IDRs influence protein
half-life. For example, disordered polypeptides with specific amino
acid compositions such as glycine-alanine and polyglutamine repeats
can attenuate rather than accelerate degradation by the proteasome.
325−327
The formation of protein complexes or transient interactions with
other proteins may also protect IDPs from degradation. Thus, we can
distinguish a novel functional class of IDRs: those that influence
protein degradation (degradation accelerators) versus those that do
not. These properties might be associated with specific protein function.
For example, proteins that contain IDRs of a given length are probably
more susceptible to degradation, possibly linking them to functions
of IDPs with low expression.
Some highly disordered proteins
(e.g., p53, p73, IκBα, BimEL) can, at least in vitro,
be degraded by the 20S proteasome independent of ubiquitination.
328−333
Specialized proteins termed “nannies” have been shown
to bind to and protect IDPs from ubiquitin-independent 20S proteasomal
degradation.
334
A free IDP, such as newly
synthesized p53, might be degraded by the 20S proteasome, which leads
to fast degradation kinetics. After a nanny binds the IDP (Hdmx in
the case of p53), slower, ubiquitin-dependent degradation by the 26S
proteasome takes place. This biphasic decay has been proposed as a
way to distinguish structured proteins from IDPs and the proteins
that protect them from degradation.
334
8.4
Post-translational Processing and Secretion
The majority of secretory proteins are targeted to the endoplasmic
reticulum (ER) via an N-terminal signal peptide, which helps to initiate
translocation of nascent chains into the ER.
335,336
Bioinformatic analysis of proteins containing N-terminal ER signal
peptides has identified only 10% of these proteins as IDPs (>70%
disordered),
suggesting that IDPs are under-represented in the secretome.
337
The fact that secreted proteins are rarely
IDPs might be partially explained by the requirement for largely disordered
proteins to contain an α-helical prodomain for correct import
into the ER lumen,
338
as demonstrated for
intrinsically disordered prohormones.
337
IDPs lacking this structured, α-helical domain were subjected
to ER-associated degradation (ERAD) despite the presence of a signal
peptide.
338
Despite the relative
depletion of IDPs in the secretome, a number of important IDPs are
processed within the ER, including many prohormones,
337,339
components of the extracellular matrix,
340
and proteins involved in biomineralization (see section 9.3).
117,341,342
Pre-pro-opiomelanocortin (pre-POMC) is a disordered 285 amino acid
protein whose signal peptide is removed during translation to create
the 241-residue pro-opiomelanocortin (POMC). This prohormone has at
least eight putative basic-rich cleavage sites and is able to yield
as many as 10 biologically active peptides including adrenocorticotropic
hormone (ACTH) and β-endorphin. The processing of POMC is tissue-specific
and depends on the type of convertase enzyme expressed.
343
Other prominent examples of disordered extracellular
proteins are elastin and other components of elastic fibers,
344
small integrin-binding ligand N-linked glycoproteins
(SIBLINGs) (see section 9.3),
340−342,345
and mucins (see section 9.2).
236
Thus, although
secreted proteins are not particularly enriched for structural disorder
overall, some IDPs are essential for biomineralization, tissue organization,
and hormonal signaling. In line with the features of intracellular
IDPs, extracellular structural disorder is heavily post-translationally
modified and involved in extensive interactions that organize large
molecular assembles while binding multiple interaction partners.
117,341,342
9
Biophysical
Properties
A large range of biophysical work has been carried
out on structural
disorder in proteins using a variety of experimental techniques (Box 2).
346
Previous sections
have touched on several aspects. Disordered regions rapidly shift
within a continuum of variably extended or globular conformations
and are best described as dynamic ensembles (see section 4). The amino acid sequence
of a disordered region
determines which conformations it can sample, depending for example
on the charge properties (see section 5.1).
Disordered proteins frequently fold upon binding, and their binding
thermodynamics allow for fast, transient, but highly specific interactions
(see sections 2, 3,
and 6). The following section discusses three
other physical properties that are essential for the biology of some
IDRs and IDPs: solubility, the ability to undergo phase transitions,
and the role in biomineralization.
9.1
Solubility
The
solubility of a protein
depends upon the favorability of its interactions with water. Globular
proteins bury hydrophobic amino acids within their solvent-excluded
cores, while their surfaces are generally enriched in polar and charged
amino acids that interact favorably with water, leading to aqueous
solubility.
347,348
The presence of hydrophobic
surface residues, for example, binding sites for other proteins, and
the denaturation of otherwise folded proteins lead to the exposure
of hydrophobic residues to water and reduce solubility, sometimes
leading to aggregation and precipitation. Disordered proteins do not
spontaneously fold into globular structures because their sequences
are depleted in hydrophobic amino acids that, in globular proteins,
drive folding (see section 5).
31,44
The accompanying enrichment in polar and charged amino acids, as
a general rule, causes disordered proteins to be soluble in aqueous
solutions. In addition, IDPs are generally resistant to heat-induced
aggregation and precipitation, because disordered proteins, in isolation,
lack extensive secondary and tertiary structure that in folded, globular
proteins is subject to thermal denaturation. Heat-stability was observed
for some of the earliest examples of IDPs. For example, the highly
disordered cyclin-dependent kinase (Cdk) inhibitor p21 remains soluble
and structurally unaltered from 5 to 90 °C.
28
In fact, the related Cdk inhibitor p27 was purified by
boiling, although at that time it was not known to be a disordered
protein.
349
In that study, boiling was
used as a means to release p27 from its highly stable complexes with
Cdks and cyclins, which, because they are folded proteins, underwent
thermal denaturation and precipitated while heat-stable p27 remained
soluble. This heat-treated preparation of p27 was subsequently demonstrated
to potently inhibit Cdk2-cyclin A.
349
Sequence analysis algorithms have predicted a high prevalence of
IDRs and IDPs in sequenced genomes (see section 7.2).
35,271
To experimentally address the
issue of the disordered protein content of a proteome, Galea and co-workers
209
treated the soluble extract of mouse embryo
fibroblast cells with heat to precipitate folded proteins and then
used large-scale liquid chromatography and mass spectrometry methods
to identify ∼1300 proteins that remained soluble. Disorder
predictions showed that more than two-thirds of these thermostable
proteins are substantially disordered. This demonstrates that disordered
proteins, as a structural class, are more heat stable and soluble
than their folded counterparts, consistent with their sequence features
and the principles of amino acid solubility. However, disordered proteins
exhibit varying degrees of compaction, which is influenced by the
presence and patterning of charged residues within the polypeptide
chain (see section 5.1).
166−168,196
While the influence of compaction
on disordered protein solubility has not been addressed, it is reasonable
to expect that the extent of compaction will influence the exposure
of solubility-promoting amino acids for interactions with water and
therefore aqueous protein solubility.
It is possible that solubility
has influenced the evolution of
disordered protein sequences, with low abundance disordered proteins
involved in signaling and regulation being less dependent on high
solubility than other disordered proteins that are highly abundant
in certain cell types (e.g., titin in muscle cells). Several extracellular
IDPs use their solubility to great effect in the sequestration of
inorganic molecules in the extracellular environment (see section 9.3). Apart from
evolutionary considerations, there
are practical applications of the high solubility associated with
some disordered protein sequences. For example, proteins with higher
degrees of disorder have an increased success rate of expression in
a cell-free protein synthesis system.
350
Furthermore, Dunker and co-workers demonstrated that fusion of a
variety of disordered polypeptide tags containing repetitive, highly
negatively charged sequences (termed “entropic bristles”)
enhanced the aqueous solubility of many proteins previously shown
to be poorly soluble upon expression in E. coli.
351
Whether the solubilizing effect of
these disordered tags is simply due to an increase in the fraction
of solubility-promoting amino acids or to other effects, such as a
potential molecular chaperone function, has not been determined. Clearly,
however, disordered regions within multidomain proteins that also
contain folded domains are likely to influence overall protein solubility.
9.2
Phase Transition
The involvement
of IDRs in phase transitions provides another biophysical angle to
the characterization of proteins that harbor disordered regions.
99
Li and co-workers
137
observed that interactions between recombinant proteins that contain
multiple copies of an SH3 domain and IDRs with multiple instances
of the proline-rich SH3 interaction motif (see section 3.1) produced sharp liquid–liquid-demixing
(phase separations) that resulted in micrometer-sized liquid protein-based
droplets (Figure 17A). The concentrations needed
for the phase transition depend on the valency (i.e., number of repeating
units) of the interacting elements. Importantly, experiments with
the natural NCK–nephrin–N-WASP (neuronal Wiskott–Aldrich
syndrome protein) complex, which contains multiple copies of the same
SH3 interaction partners, showed the formation of similar dynamic
droplets, which lead to a significant increase in the activity of
the actin nucleation factor Arp2/3.
137
The
formation of the droplets is controlled by the degree of phosphorylation
of one of the interaction partners, which potentially explains how
the phase transitions may be regulated in the cell.
Figure 17
Involvement of IDRs
in phase transitions. (A) Interactions between
proteins that contain multiple copies of a specific domain (an SH3
domain in the figure) and IDRs with multiple instances of its interaction
motif (proline-rich SH3 motif here) can, at appropriate concentrations,
produce sharp liquid–liquid-demixing phase separations. This
phase transition is likely to increase local “active”
protein concentrations exploitable for signaling switches. (B) High
concentrations of low-complexity IDRs found in certain RNA binding
domains lead to a reversible phase transition with the formation of
highly dynamic hydrogels. These RNA granule-like assemblies consist
of heteromeric protein aggregates and allow localization and storage
of functionally related but nonidentical RNA molecules. Adapted from
ref (100). Copyright
2013 the Biochemical Society.
A related phenomenon occurs with RNA-binding proteins that
contain
IDRs of low sequence complexity. Such regions have been associated
with the regulated formation of cellular RNA granules.
352
Various types of RNA granules are used to modulate
the fate of specific mRNAs, but their assembly mechanism has remained
unclear. Kato and co-workers
353
reconstituted
granule-like RNA assemblies in vitro by exploiting low complexity
IDRs. They demonstrated that the low-complexity IDRs of certain RNA-binding
proteins were necessary for the formation of granule-like assemblies
and that high concentrations of these regions lead to a reversible
phase transition with a highly dynamic hydrogel state (Figure 17B). Interestingly,
hydrogels formed by the low-complexity
IDR of one purified member of the granules are capable of binding
IDRs of other members and thereby enable the assembly of heterogeneous
macromolecular structures.
353
Many IDRs
that can form such functional aggregates have been shown to be under
tight regulation to modulate their availability in the cell.
224
Regulation of IDR abundance can shift the equilibrium
between the monomeric and oligomeric/aggregate form, thereby preventing
formation of undesirable aggregates and keeping functional assemblies
under control.
224
Together, these findings
indicate that the biophysical properties of certain IDRs (such as
those that contain specific low-complexity regions or linear motifs)
enable phase transitions that are likely to be exploited in various
macromolecular assemblies and could function to bridge the length
scale of proteins with that of organelles.
354
Disorder-mediated phase transitions also occur extracellularly,
as exemplified by the mucin family of proteins. These proteins rely
on structural disorder for the formation of gel-like networks of mucus,
which function in the protection of epithelial surfaces such as those
in the airway and the gut.
355,356
Extensive glycosylation
of very large disordered regions that are rich in proline, threonine,
and serine residues contributes to the formation of these structures.
357
Mucin-1 can contain up to 120 such repeats,
depending on the genetic variant an individual carries.
358
Regulated order-to-disorder transitions of
Mucin-2 are important in the formation of colon mucus aggregates.
88,236,359
Mucin-2 trimers are compact
structures under the conditions of the secretory pathway, where the
pH is low and calcium is present, but these structures partially unfold
and greatly expand in more basic environments, such as in the colon,
triggering a phase transition into a mucus polymer gel.
88,236,359
9.3
Biomineralization
Most animals are
able to produce hard tissues for various physiological purposes by
mineralization of the extracellular matrix.
360,361
Bone and teeth, for example, consist of collagen and other proteins
in conjunction with inorganic calcium phosphate in the form of hydroxyapatite
(HA).
360,362
Proteins involved in hard tissue mineralization
are predicted to have very high levels of disorder,
340−342
and disordered proteins are important in mineral homeostasis in
general,
117
indicating an important role
for IDRs in these processes. For example, unfolded phosphoproteins
sequester calcium phosphate by forming stable complexes in which the
phosphorylated side-chains of the proteins occupy the phosphate positions
on the surfaces of calcium phosphate nanoclusters.
117
The disordered nature of these proteins allows them to
readily adjust their shapes to surround and solubilize clusters of
calcium phosphate. In this manner, proteins such as the milk caseins
achieve high concentrations of calcium and phosphate while preventing
the precipitation of the corresponding salts (i.e., calcification).
117
Caseins belong to the highly disordered secretory
calcium-binding phosphoprotein (SCPP) gene family,
341
which includes bone, tooth, milk, and salivary proteins.
363
Humans encode five small integrin-binding
ligand N-linked glycoproteins (SIBLINGs), which are a subset of SCPPs
involved specifically in regulating bone and teeth formation by bringing
together hydroxyapatite, cell-surface integrins, and collagens.
345,360
These are osteopontin (OPN, or bone sialoprotein 1), bone sialoprotein
2 (IBSP), dentin matrix acidic phosphoprotein 1 (DMP1), matrix extracellular
phosphoglycoprotein (MEPE), and dentin sialophosphoprotein (DSPP).
235
SIBLINGs are highly disordered
340−342,345
and undergo extensive phosphorylation
in the Golgi before they are secreted, as demonstrated in the case
of DSPP, which has approximately 200 phosphoserines.
235
DSPP has a particularly extreme serine and aspartic acid
content, and its maturation product dentin phosphoprotein (DPP, or
phosphophoryn) is likely to be one of the most acidic natural proteins
known.
10
Discussion
It is likely that many of the
functionally uncharacterized proteins
will be similar to already characterized ones.
8−10
This notion
forms the basis for computational methods that aim to improve annotation
coverage by predicting the function of novel and undefined proteins
based on information from better-studied proteins. Databases such
as Pfam
22
and SCOP
24
attest to the success of these approaches. However, existing methods
are focused primarily on sequences that give rise to well-folded protein
structures and domains. As a result, it is much harder to gain insight
into the function of intrinsically disordered regions (IDRs) and proteins
(IDPs), despite the increasing evidence of their prevalence and importance
for protein functionality (Figure 1).
50
Many important disease proteins such as p53,
Myc, α-synuclein, and BRCA1 are highly disordered, underscoring
the importance of disordered regions for understanding the molecular
basis of human diseases.
263,295,299
In this Review, we have assembled an overview of the major
approaches
used to classify and categorize IDRs and IDPs (Table 1). These classification schemes
help us understand how disordered
protein functionality is defined and could be used to enhance function
prediction for disordered protein regions in general. In these final
sections, we discuss the resources that are currently available for
gaining insight into IDR function (Table 2),
we address potential areas for improvement of the current approaches,
and we propose that combinations of multiple existing classification
schemes could achieve higher-quality function prediction for IDRs.
Finally, we suggest areas where increased efforts are likely to advance
our understanding of the functions of structural disorder in proteins.
10.1
Current Methods for Function Prediction of
IDRs and IDPs
Which methods and resources can a researcher
use to gain insight into the functions of the disordered regions in
a protein? Current approaches (Table 2) are mainly based on the presence of functional
features
such as short linear motifs (SLiMs), post-translational modification
(PTM) sites, molecular recognition features (MoRFs), and intrinsically
disordered domains (IDDs) (see section 3).
These aspects have the potential to shed light on which interaction
partners an IDR may have and how many, as well as the mode of binding.
10.1.1
Linear Motif-Based Approaches
Mapping of well-characterized
linear motifs onto other protein sequences
holds particular promise for discovering novel functionality. For
example, proteomic characterization of the motif (RxxPDG) that recruits
Tankyrase ADP-ribose polymerases has led to the identification of
novel Tankyrase substrates and explains the basis for mutations causing
cherubism disease.
364
Similarly, proteome-wide
searches for the SxIP motif have resulted in the identification of
previously uncharacterized microtubule plus-end tracking proteins.
365
However, these types of individual studies
require considerable resources.
MiniMotif
126
and ELM
125
are two major efforts
aimed at the annotation of known instances of linear motifs, which
are primarily found in IDRs, and their binding partners. The MiniMotif
and ELM databases aim to categorize linear motifs of all functions
based on in-depth manual annotation of experimentally validated instances
from the literature. Similar approaches have also been taken specifically
for PTM site motifs (see section 10.1.2). Although
these resources are excellent repositories of the functional sites
that occur in IDRs, they do have certain shortcomings. For example,
the annotations from MiniMotif are not publicly available. Although
the ELM database is the most comprehensive database of functional
features within IDRs, at present it does not have the resources to
annotate all motifs in the literature; ELM contains ∼200 classes
of linear motifs with over 2400 instances, but more than 250 classes
await annotation with this number constantly increasing.
125
This has meant ELM is limited to annotating
(a fraction) of the shorter motif classes and does not explicitly
consider the longer binding modules in disordered regions.
Complementary
to the annotation efforts, the linear motif resources
employ prediction methods that map functionality onto regions of proteins
with unknown function (i.e., unannotated regions). For example, MiniMotif
and ELM use regular expressions derived from experimentally validated
and curated motif instances to search protein sequences. These searches
bring up functional descriptions of sequence instances that match
the regular expressions. A major problem in the computational detection
of short motifs in particular is the high false positive rate, which
means that it is very difficult for users to identify the instances
that are most likely to be functional from the large total of mostly
nonfunctional motif instances that result from these searches. To
overcome this issue, both databases have developed additional methods
to improve prediction accuracy that rely on the use of additional
context information, such as accessibility (using structural models
366
and predictions of intrinsic disorder
72
), evolutionary conservation,
367,368
cell compartment (based on annotation),
126,369
and protein–protein interactions.
128,370,371
These efforts will need to be
combined in the future with a clearer user interface so researchers
can more easily identify the most relevant instances.
De novo
predictors make up the final category of motif resources.
These predictors computationally identify putative uncharacterized
motifs in protein sequences. There are two broad types: predictors
that identify clusters of amino acids that are more conserved than
surrounding residues (e.g., SLiMPrints
372
and phylo-HMM
373
) or those that find
short peptide patterns that are over-represented in a set of sequences
(e.g., DiliMot
374
and SLiMFinder
375
). Although both approaches have been combined
with the gene ontology terms of the identified proteins, further development
is required to define potential functionality.
10.1.2
PTM Site-Based Approaches
In terms
of PTM sites within disordered regions, resources such as Phospho.ELM,
268
PhosphoSite,
376
and
PHOSIDA
377
curate experimentally verified
phosphorylation sites and sometimes other types of modifications from
the literature and genome-scale studies. Integration of such information
with data on SNPs that are seen in natural populations or in cancer
genomes can provide important insights into the functionality of a
PTM site.
378,379
Important progress has been
made in identifying and cataloging peptide motifs that direct post-translational
modifications. ScanSite primarily identifies linear motifs that are
likely to be phosphorylated and play key roles in signaling, such
as the SH2 and 14–3–3 motifs.
380
Annotation of these sequence motifs is based on results from binding
experiments with peptide libraries and phage display experiments.
380
NetPhorest contains consensus sequence motifs
of 179 kinases and 104 phosphorylation-dependent binding domains.
381
In addition, approaches such as NetworKIN
370
systematically integrate experimentally derived
PTM sites with evolutionary information, and define motifs around
the PTM sites that may be recognized by the kinase. In this manner,
site-specific interactions between 123 kinases and specific PTM sites
(often in disordered regions) in 5515 phosphoproteins are predicted.
382
Another resource, PhosphoNET, provides predictions
of potential kinases for over 650 000 putative phosphosites.
383
Extending these approaches to other post-translational
modifications is an area of intense research, and a number of such
PTM site prediction programs currently exist,
384
although linking the PTM sites to the modifying enzymes
remains to be addressed for the other types of modifications.
10.1.3
Molecular Recognition Feature-Based Approaches
Two
important methods exist for identifying novel binding modules
in IDRs based on the concept of molecular recognition features (MoRFs).
MoRFpred predicts sequences that undergo disorder-to-order transitions
of all types of MoRFs (α, β, coil, and complex) using
a combination of sequence alignment and machine learning predictions
based on amino acid properties, predicted disorder, B-factors, and solvent accessibility.
385
ANCHOR also predicts parts of disordered regions that are likely
to fold upon binding with their interactors, but does so by identifying
segments that cannot form enough favorable intrachain interactions
to fold on their own and are likely to gain stabilizing energy by
interacting with a globular partner protein.
386,387
An important shortcoming of the MoRF predictions is the difficultly
in identifying which of the binding sites are relevant and what their
functionality might be. This is primarily because the results are
not linked to known MoRF instances with annotated functions, as is
the case for linear motifs, and no clues are provided regarding the
potential role of a binding site or its interacting partners. The
IDEAL database
388
collects verified elements
in disordered regions that undergo coupled folding and binding upon
interaction (Box 1). The careful annotation
of well-described MoRFs in terms of their sequence propensities or
interaction interfaces as well as their known binding partners, and
integration of these annotations with MoRF predictions, would likely
improve the use of these predictions for gaining insight into IDR
functionality.
10.1.4
Intrinsically Disordered
Domain-Based Approaches
Few attempts have been made to systematically
annotate protein
domains that are largely made up of intrinsic disorder. Pfam
22
models are able to predict several intrinsically
disordered domains (e.g., KID, WH2, RPEL, and BH3 domains). However,
this seems to be a simple consequence of the fact that these disordered
domains can be described and detected by sequence profiles, rather
than an effort directed at annotating long IDRs. ELM
125
has also annotated a small number of long disordered domains,
such as the WH2 motif; however, the main focus of the database remains
on short motifs. Finally, some of the IDRs that are present in annotated
domains are in fact MoRFs or linear motifs, and linear motifs also
frequently fold upon binding like MoRFs, underscoring the underlying
connections between linear motifs, MoRFs, and IDDs as functional elements
(see section 3.4).
10.1.5
Other
Approaches
Only a few IDR
classifications that are not based on linear motifs, MoRFs, or IDDs
have so far been exploited for function prediction. FFPred is a correlation-based
approach that uses the length and position of IDRs along a sequence
(see sections 5.5 and 5.6), among other general protein features, to predict the function
of the protein in terms of gene ontology categories (molecular activities
and biological processes).
211,389−391
The DisProt database of protein disorder
203
(Box 1) lists functions of individual disordered
regions, when known from experiments, the major limitation here being
the small number of regions for which exact function has been characterized.
The Database of Disordered Protein Prediction (D2P2)
49
(Box 1) stores predictions of IDRs in whole genomes, which together with
information on MoRFs, PTM sites, and domains can be used to obtain
insight into the possible function of the IDR and the protein containing
it.
10.2
Requirement for Annotation
Future
effort in the classification of IDRs and IDPs must be directed at
annotation. Substantiating classes with more examples will lead to
refinement of their function descriptions and will likely reveal inaccuracies
in existing classification schemes. For example, there are only a
limited number of well-characterized examples of proteins that contain
the evolutionarily flexible (e.g., RPA70 and Sky1) or constrained
types of disorder (Rpl5 and Hsp90). The same is true for the different
classes of dynamic disorder in protein complexes, although efforts
are ongoing there.
176
In terms of the functional
features of IDRs, there is a need for annotating MoRFs and longer
disordered binding regions as described in the previous section. Efforts
directed at short linear motifs have been very successful, but only
a small fraction of the potentially thousands of motifs
392
have been annotated. Pfam contains almost 15 000
curated protein families,
22
while ELM contains
less than 200 motif classes,
125
suggesting
that significant numbers of functional features are still to be identified
and further annotation is required. High-quality resources that collect
all of the experimentally validated functional regions of intrinsically
disordered regions will provide a strong basis to map functional features
onto novel proteins of unknown function.
10.3
Integration
of Methods for Finding IDR and
IDP Function
The current methods for finding and classifying
IDR and IDP function have been successful in the area of their focus.
However, not all functional characteristics of disordered regions
have been fully exploited, and neither is there a resource that brings
all of these aspects together. The combination of multiple categorizations
and features of IDRs is likely to provide a better understanding of
the functionalities encoded in these regions.
A comprehensive
IDR function resource should have several aspects. It starts with
a reliable consensus disorder prediction for the protein sequence
of interest (Box 3), such as available in the D2P2 database (Box 1).
49
Functional features,
such as SLiMs (see section 3.1), MoRFs (see
section 3.2), and disordered domains (see section 3.3), can then be mapped on every
disordered part
of the protein. The disorder profile allows for the identification
of individual IDRs in the protein, as well as the calculation of disorder
properties of the whole protein, such as which disorder predictors
support which IDRs (see section 5.2), the overall
degree of disorder (see section 5.4), the length
of the individual disordered regions (see section 5.5), or the amount of disorder
at the termini (see section 5.6). These can be used to assign general function
to the proteins, such as gene ontology terms that correlate with these
properties. Patterns in amino acid sequence could reveal additional
function. For example, the presence of tandem repeats or enrichment
in certain amino acids (see sections 5.7 and 7.3) may point toward involvement in
certain processes.
The overall sequence composition and the distribution of charges (see
section 5.1) could indicate the solubility
of a polypeptide chain (see section 9.1) and
conformational properties such as the degree of compaction (see section 4). The combination
of sequence complexity and disorder
propensity could suggest function as well (see section 5.3).
Integration of other types of information
will determine what classifications
can additionally be used. Addition of domain information, such as
Pfam, can provide insight into the role of disordered segments that
are commonly associated with specific structured domains (see section 3.3). Protein–protein
interactions and structures
of protein complexes could indicate interacting partners of IDR binding
elements and the mode of interaction (see section 6). Information about sequence conservation
(see section 7.1) is another important aspect and could provide
clues about evolutionarily constrained or flexible types of disorder,
which are implicated in different types of functions. Knowledge on
the origin of a disordered region in evolution or the species containing
the protein sequence of interest suggests possible functions as well
(see section 7.2). Furthermore, data describing
regulatory properties such as gene expression levels (see section 8.1), alternative
splicing (see section 8.2), and degradation kinetics (see section 8.3) could implicate
IDRs in regulating protein availability
and may suggest or reject roles as interactions hubs, for example.
Finally, biophysical properties of the protein, such as the potential
of multivalent elements to undergo phase transitions (see section 9.2) and occurrence
inside or outside the cell (see
sections 8.4 and 9.3), may suggest involvement in the spatiotemporal organization
of
(extra)cellular assemblies.
The hypothetical resource might
be able to suggest function for
some of the following examples, although it is clear that in other
cases the biology will be too complicated and the outlook of function
prediction as described here will be unrealistic. Therefore, the following
examples should at this point be considered as speculative. A long
(more than 30 residues) IDR that shows signs of evolutionarily flexible
disorder and contains no short motifs or other predicted binding regions
could be a flexible linker between domains or an entropic chain. A
region containing a PxxPx[KR] motif flanked by evolutionarily flexible
disorder that is likely to retain an open conformation in the unbound
form (based on the primary structure) probably binds a class II SH3
domain, and might be involved in transcription processes if the IDR
constitutes the C-terminus of a protein with an otherwise small degree
of disorder. Long IDRs that are encoded by alternatively spliced exons
and have several nonoverlapping functional motifs and MoRFs might
be part of signaling hubs or assemble multiprotein complexes, the
type of which might be inferred from the combination of binding sites
present. A constitutively expressed, largely disordered IDP with an
amino acid composition promoting intrinsic coil conformations and
conservation of both primary and disorder sequence is likely to be
a ribosomal protein or part of another rigid multisubunit complex.
It is clear that some classifications will provide more useful
and direct information about function than others. Some classifications
have been proposed to contrast IDPs with structured proteins, which
does not necessarily make them useful for a detailed description of
disorder function per se. Others have limited use for prediction because
they are conceptual only, or because of overlap in the properties
they describe with other schemes. Moreover, not all approaches can
realistically be incorporated in a tool. Binding functionality and
sequence-based predictions will generally be possible, but predictions
based on other types of data may be harder. For example, assignment
of evolutionarily constrained or flexible disorder requires automatic
alignment of amino acid and disorder sequences, while gene expression
subtypes can be derived from the wealth of microarray and RNA sequencing
data. Various types of information are already brought together in
the D2P2 database,
49
which contains information on disordered regions, MoRFs, PTM sites,
and structured domains, and in ELM,
125
which
shows information on linear motifs, disorder, phosphorylation, domains,
protein–protein interactions, and secondary structure. Further
extension of resources like these, with information on both structured
and disordered regions, holds great promise toward creating a comprehensive
overview of the functional elements and properties of a protein.
10.4
Future Directions
A major area of
improvement in the description of disordered protein regions pertains
to their dynamic behavior.
172,178
IDRs fluctuate rapidly
over an ensemble of heterogeneous conformations (see section 4.2), the relative free
energies and propensities
of which are determined by the amino acid sequence (see section 5.1). The relationship
between sequence and structural
ensemble is important because it describes what part of the time the
chain is in a compact state, and what part of the time it is more
accessible. Knowledge about these structural subtypes and about how
sequence contexts and chemical modifications of the chain (e.g., by
PTMs) modulate the structural ensemble is vital for the correct description
of IDR behavior and has direct implications for the functional roles
such regions can have in the cell.
157
Classical methods are not optimally designed to take structural dynamics
into account. For example, current disorder prediction technology
is successful at distinguishing sequence stretches that are likely
to be disordered versus those that are likely to be part of autonomously
folded domains, resulting in a binary verdict (disordered versus structured)
within a certain confidence limit (Box 3). Although predicted disordered regions correlate
well with experimentally
determined backbone dynamics,
393
detailed
prediction of conformational subtypes requires a more sophisticated
description of disorder. A recent method for the prediction of protein
backbone dynamics, trained based on order parameters estimated from
experimental chemical shifts, is not only capable of distinguishing
different structural organizations with varying degrees of flexibility,
such as folded domains, disordered linkers, molten globules, and MoRFs,
but regions that are predicted to be dynamic also correspond well
with conventional predictions of IDRs.
394
Furthermore, high-throughput atomistic simulations of sequence ensembles
can provide information about the degree of conformational heterogeneity,
395
which can be quantified by various parameters,
such as an information theory measure
396
or an order parameter-like measure.
397
One could imagine a multiple-component scheme describing structural
and dynamic characteristics that would assign, for example, residues
in a random coil small values for the fractional population of secondary
structure, a large value for spatial fluctuations, a fast interconversion
rate, and large values for structural heterogeneity. Conversely, molten
globule residues would be assigned a relatively large value for the
fractional population of secondary structure, a smaller value for
spatial fluctuations and structural heterogeneity, and a slower interconversion
rate. Progress in the objective description of conformational ensembles
will likely require development of novel structural classifications.
Such efforts will be greatly encouraged by the new pE-DB database
of structural ensembles (Box 1).
398
There is considerable room for growth at the
interface between atomistic simulations, physical theories, machine
learning methods, and experiments, to enable the unmasking of the
connection between disorder dynamics and molecular and system level
functions of IDRs and IDPs.
Full understanding of the cellular
functions of IDPs will also
require knowledge of their abundance, their interactions, and their
physical state in the physiological context. Are IDPs always bound
to target proteins, are they chaperoned, or are there pools of unbound
IDPs? Answers to these questions will vary among different IDPs and
will depend on the exact context in the cell. However, the discovery
of features that can help classify and categorize IDRs in terms of
their cellular status will lead to more insights into their function.
For example, entropic chains may mostly be disordered even in the
cell, whereas effectors and assemblers may mostly be associated with
other proteins in folded conformations and exchange binding partners
by competition rather than by dissociation to the free, disordered
state. Scavengers likely populate both disordered and ordered states,
depending on whether or not their ligand is bound. Thus, investigations
of the in-cell status of IDPs
399
will be
crucial toward understanding their biological roles.
11
Conclusion
Finally,
we would like to stress that it is not all about intrinsic
disorder. This Review has focused on classifications for intrinsically
disordered regions and proteins, because function annotation for these
regions is lagging behind annotation of structured regions. However,
proteins are modular, and their functional regions can be structured
or disordered, or somewhere in between. The synergy between these
fundamental building blocks of proteins leads to combinatorial diversity
of function. Therefore, understanding how structure and disorder work
together will be crucial for uncovering the full extent of protein
function.
Box 1
Databases of Intrinsically Disordered Regions and Proteins
Several resources exist that collect experimental or computational
information on disordered regions in proteins. The Database of Protein
Disorder (DisProt, http://www.disprot.org/) was developed
to facilitate research on protein disorder by organizing the rapidly
increasing knowledge about the experimental characterization and the
functionalities of IDRs and IDPs.
203,400
The database
includes the location of the experimentally determined disordered
region(s) in a protein and the methods used for disorder characterization.
Additionally, where known, entries list the biological function of
an IDR and how it performs this function. As of the latest release
(6.02, May 24, 2013), DisProt contained 694 IDP entries and 1 539
IDRs.
The IDEAL database (http://www.ideal.force.cs.is.nagoya-u.ac.jp/IDEAL/) also collects
annotations of experimentally verified IDPs.
388
This database focuses on regions that undergo
coupled folding and binding upon interaction with other proteins (regions
for which there is evidence for both a disordered isolated state and
an ordered bound state), such as MoRFs and certain linear motifs (see
section 3). It also suggests putative sequences
for which there is only evidence of an ordered bound state, but that
are thought to undergo induced folding based on, for example, the
presence of a verified folding-upon-binding element in a homologue.
The latest version (30 August 2013) contained 340 proteins with annotated
IDRs of which 148 contain verified or putative elements that undergo
folding upon binding.
MobiDB (http://mobidb.bio.unipd.it/) collects experimental
data on IDRs from DisProt,
203
IDEAL,
388
and the Protein Data Bank
147
(missing residues in crystal structures and structurally
mobile regions in NMR ensembles).
401
It
also stores disorder prediction data from three methods. The total
of disorder information is summarized in a weighted consensus. The
latest version (1.2.1, August 28, 2012) contained 26 933 proteins
for which there is experimental data on the presence or absence of
disorder and disorder predictions for 4 662 776 proteins
from 297 proteomes.
pE-DB (http://pedb.vib.be/)
is the first database for
the deposition of structural ensembles (see section 4.2) of intrinsically disordered
proteins.
398
Entries contain the primary experimental data (mainly NMR
and SAXS, Box 2), the algorithms used in their calculation, and the coordinates
of the structural ensembles, which are provided as a set of models
in Protein Data Bank
147
format. Development
of pE-DB is intended to support the evolution of new methodologies
for the structural descriptions of the disordered state. pE-DB stored
45 ensembles in 10 entries as of 17 January 2014.
Finally,
the Database of Disordered Protein Prediction (D2P2, http://d2p2.pro/) stores disorder
predictions
(Box 3) made by nine different predictors
for proteins from completely sequenced genomes.
49
Alongside the disorder predictions, it contains information
on MoRFs (ANCHOR
386
), PTM sites (PhosphoSitePlus
402
), and domains (SCOP
24
and Pfam
22
). As of January 2014, D2P2 contained disorder predictions for 10 429 761
sequences in 1 765 genomes from 1 256 distinct species.
Box 2
Experimental Characterization of Intrinsically Disordered Regions
and Proteins
IDPs and IDRs have been studied using a variety
of experimental
techniques, including NMR, SAXS, and smFRET. Nuclear magnetic resonance
(NMR) spectroscopy is the key method to characterize protein disorder,
due to its ability to provide residue-level information on protein
structure and dynamics in solution.
403
Many
aspects of structural disorder can be detected directly using NMR,
including local disorder, folding upon binding, and disorder in complex.
In contrast to NMR methods, detection of disorder using X-ray crystallography
techniques is mainly indirect as it relies on missing electron density.
32
Another powerful method for detecting and characterizing
IDPs is small-angle X-ray scattering (SAXS), which assesses protein
dimensions and shape by measuring the scattered X-ray intensity caused
by a sample. SAXS can be used to determine hydrodynamic parameters
and the degree of globularity of a protein, which are good indicators
to determine whether a protein is compact or unfolded.
183,404
Single-molecule methods are also emerging for the study of structural
disorder.
179−182
These techniques minimize averaging over the heterogeneous ensembles
of conformations in which disordered proteins naturally exist and
thus are able to measure dynamics of individual molecules. For example,
single-molecule fluorescence resonance energy transfer (smFRET) can
measure dynamics and individual conformations of the unbound ensemble,
intermediates during induced folding, and internal friction in the
folding process.
180−182
Atomic force microscopy (AFM) is also useful
for the characterization of the conformational heterogeneity of single
proteins.
182
High-throughput proteomic
approaches are mainly used to identify IDPs. These techniques enrich
cellular extracts for disordered proteins, and then separate structured
from disordered proteins, followed by identification (e.g., by mass
spectrometry). For example, heat treatment enriches cell extracts
for IDPs and depletes for proteins containing folded domains (see
section 9.1).
209
IDPs can also be identified on the basis of their susceptibility
to degradation by the 20S proteasome under conditions in which structured
proteins are resistant (see section 8.3).
332
The degradation assays can be used to identify
binding partners of IDPs that provide protection against degradation.
Finally, computational techniques such as molecular dynamics (MD)
simulations complement experimental approaches and provide important
insights into IDP behavior.
196,405
The DisProt, IDEAL,
MobiDB, and pE-DB databases collect experimentally verified disordered
regions and proteins (Box 1).
Box 3
Prediction
of Intrinsically Disordered Regions and Proteins
Predicting
disordered regions from amino acid sequence allows the
analysis of disordered proteins at a genome-wide scale and provides
initial hypotheses about the presence of structural disorder in individual
proteins.
38,406
A large number of prediction
methods have been developed and are regularly benchmarked as part
of the Critical Assessment of Techniques for Protein Structure Prediction
(CASP).
407,408
Excellent overviews of disorder prediction
methods are given elsewhere,
406,409,410
and nonexhaustive lists of publicly available prediction software
and webservers can be found at http://en.wikipedia.org/wiki/List_of_disorder_prediction_software
and http://www.disprot.org/predictors.php.
Three
general prediction strategies currently exist:
•
Disorder prediction based directly
on sequence properties. For instance, IUPred is a physicochemical
sequence-based method that estimates residue interaction energies.
411
Sequences with lower predicted pairwise interaction
energies are considered more likely to be disordered due to a lack
of stabilizing contacts. Similarly, FoldIndex considers weakly hydrophobic
regions of high net charge. Such regions are likely to be disordered
due to their low energy benefit when adopting a compact conformation.
31,412
•
Machine learning
is used in the
majority of predictors, for example, by using unresolved residues
in X-ray structures as a training set.
410
For example, DISOPRED2 uses linear support vector machines (SVMs)
trained on PSI-BLAST sequence profiles surrounding unresolved residues.
35
Similarly, PONDR XL1 employs a feed-forward
neural network trained on sequence attributes found associated with
unresolved residues.
271
•
Meta-predictors that combine
several individually successful disorder prediction methods have been
developed more recently, resulting in increases in prediction accuracy.
407
For instance, metaPrDOS
413
and MFDp
414
both apply SVM-based
machine learning to the results of a number of individual prediction
methods to arrive at a final score. Similarly, the MobiDB
401
and D2P2 databases
49
(Box 1) provide a consensus
overview of several independent prediction methods.
Curated databases containing experimentally determined
disordered
regions, such as DisProt
203
and IDEAL
388
(Box 1), provide a
gold standard for assessing disorder prediction methods. Overall,
the quality of the predictions appears to have reached a reasonable
plateau of accuracy, with modest recent progress.
407,408
Additional data on biologically relevant long disordered regions
may lead to future improvements in predicting IDRs and IDPs.
408
Box 4
Evolution of Intrinsically
Disordered Regions and Proteins
IDRs generally evolve faster
than their structured counterparts.
51−56,107
However, comparison of the rates
of evolution of structured and disordered regions in 26 protein families
has shown that this is not always the case.
51
To get more insight into the evolution of disordered regions, we
predicted disorder in the human proteome using MULTICOM-REFINE.
415
We integrated the disorder status of the protein
residues with their evolutionary rates across multiple sequence alignments
of homologous proteins from 53 (mostly vertebrate) species in Ensembl
Compara,
1
calculated using the Rate4Site
program.
416
As observed previously,
417
protein residues that are predicted to be disordered
generally evolve more quickly (i.e., have much higher evolutionary
rates) than those in structured regions (Figure Box 4, P value < 10−15, Mann−Whitney
U test). However, the distributions of evolutionary rates
for disordered and structured residues are wide and overlap, which
confirms that some disordered residues are conserved. In line with
this, it has been shown that particular residue types, such as Leu,
Tyr, Trp, and Pro, are more conserved in IDRs than other residue types.
53
Conserved residues and elements in IDRs are
potentially important for function and might be part of protein−protein
interaction interfaces or peptide motifs (see section 7.1). However, sometimes, rapid
divergence of disordered regions
indicates functionality, as in the case of several human antiviral
proteins (see section 7.2).
Figure Box 4
Boxplots of the
distributions of evolutionary rates for predicted
structured (blue) and disordered (red) residues across the human proteome.
Residues with a high evolutionary rate are less conserved. Boxes represent
the 50% of data points in the two quartiles above and below the median
(the horizontal bar within each box). Vertical lines (whiskers) connected
to the boxes represent the highest and lowest nonoutlier data points,
with outliers being defined as >1.5 times the interquartile range
from the median. Outliers are not shown for visual clarity.