Structural basis for small molecule targeting of Doublecortin Like Kinase 1 with DCLK1-IN-1

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Doublecortin-like kinase 1 (DCLK1) is an understudied bi-functional kinase with a proven role in tumour growth and development. However, the presence of tissue-specific spliced DCLK1 isoforms with distinct biological functions have challenged the development of effective strategies to understand the role of DCLK1 in oncogenesis. Recently, DCLK1-IN-1 was reported as a highly selective DCLK1 inhibitor, a powerful tool to dissect DCLK1 biological functions. Here, we report the crystal structures of DCLK1 kinase domain in complex with DCLK1-IN-1 and its precursors. Combined, our data rationalises the structure-activity relationship that informed the development of DCLK1-IN-1 and provides the basis for the high selectivity of DCLK1-IN-1, with DCLK1-IN-1 inducing a drastic conformational change of the ATP binding site. We demonstrate that DCLK1-IN-1 binds DCLK1 long isoforms but does not prevent DCLK1’s Microtubule-Associated Protein (MAP) function. Together, our work provides an invaluable structural platform to further the design of isoform-specific DCLK1 modulators for therapeutic intervention.

Abstract

Patel et al. report crystal structures of the kinase domain of Doublecortin-like kinase 1 (DCLK1) in complex with a potent inhibitor and its precursors. Their structural study reveals insights into the mode of action of the inhibitor DCLK1-IN-1 and sets the basis for the design of isoform-specific inhibitors of DCLK1, which plays a role in tumour growth and development.

Related collections

Most cited references 56

Record: found
Abstract: found
Article: not found

Phaser crystallographic software

Airlie J. McCoy, Ralf W. Grosse-Kunstleve, Paul D. Adams … (2007)

1. Introduction Improved crystallographic methods rely on both improved automation and improved algorithms. The software handling one part of structure solution must be automatically linked to software handling parts upstream and downstream of it in the structure solution pathway with (ideally) no user input, and the algorithms implemented in the software must be of high quality, so that the branching or termination of the structure solution pathway is minimized or eliminated. Automation allows all the choices in structure solution to be explored where the patience and job-tracking abilities of users would be exhausted, while good algorithms give solutions for poorer models, poorer data or unfavourable crystal symmetry. Both forms of improvement are essential for the success of high-throughput structural genomics (Burley et al., 1999 ▶). Macromolecular phasing by either of the two main methods, molecular replacement (MR) and experimental phasing, which includes the technique of single-wavelength anomalous dispersion (SAD), are key parts of the structure solution pathway that have potential for improvement in both automation and the underlying algorithms. MR and SAD are good phasing methods for the development of structure solution pipelines because they only involve the collection of a single data set from a single crystal and have the advantage of minimizing the effects of radiation damage. Phaser aims to facilitate automation of these methods through ease of scripting, and to facilitate the development of improved algorithms for these methods through the use of maximum likelihood and multivariate statistics. Other software shares some of these features. For molecular replacement, AMoRe (Navaza, 1994 ▶) and MOLREP (Vagin & Teplyakov, 1997 ▶) both implement automation strategies, though they lack likelihood-based scoring functions. Likelihood-based experimental phasing can be carried out using Sharp (La Fortelle & Bricogne, 1997 ▶). 2. Algorithms The novel algorithms in Phaser are based on maximum likelihood probability theory and multivariate statistics rather than the traditional least-squares and Patterson methods. Phaser has novel maximum likelihood phasing algorithms for the rotation functions and translation functions in MR and the SAD function in experimental phasing, but also implements other non-likelihood algorithms that are critical to success in certain cases. Summaries of the algorithms implemented in Phaser are given below. For completeness and for consistency of notation, some equations given elsewhere are repeated here. 2.1. Maximum likelihood Maximum likelihood is a branch of statistical inference that asserts that the best model on the evidence of the data is the one that explains what has in fact been observed with the highest probability (Fisher, 1922 ▶). The model is a set of parameters, including the variances describing the error estimates for the parameters. The introduction of maximum likelihood estimators into the methods of refinement, experimental phasing and, with Phaser, MR has substantially increased success rates for structure solution over the methods that they replaced. A set of thought experiments with dice (McCoy, 2004 ▶) demonstrates that likelihood agrees with our intuition and illustrates the key concepts required for understanding likelihood as it is applied to crystallography. The likelihood of the model given the data is defined as the probability of the data given the model. Where the data have independent probability distributions, the joint probability of the data given the model is the product of the individual distributions. In crystallography, the data are the individual reflection intensities. These are not strictly independent, and indeed the statistical relationships resulting from positivity and atomicity underlie direct methods for small-molecule structures (reviewed by Giacovazzo, 1998 ▶). For macromolecular structures, these direct-methods relationships are weaker than effects exploited by density modification methods (reviewed by Kleywegt & Read, 1997 ▶); the presence of solvent means that the molecular transform is over-sampled, and if there is noncrystallographic symmetry then other correlations are also present. However, the assumption of independence is necessary to make the problem tractable and works well in practice. To avoid the numerical problems of working with the product of potentially hundreds of thousands of small probabilities (one for each reflection), the log of the likelihood is used. This has a maximum at the same set of parameters as the original function. Maximum likelihood also has the property that if the data are mathematically transformed to another function of the parameters, then the likelihood optimum will occur at the same set of parameters as the untransformed data. Hence, it is possible to work with either the structure-factor intensities or the structure-factor amplitudes. In the maximum likelihood functions in Phaser, the structure-factor amplitudes (Fs), or normalized structure-factor amplitudes (Es, which are Fs normalized so that the mean-square values are 1) are used. The crystallographic phase problem means that the phase of the structure factor is not measured in the experiment. However, it is easiest to derive the probability distributions in terms of the phased structure factors and then to eliminate the unknown phase by integration, a process known as integrating out a nuisance variable (the nuisance variable being the introduced phase of the observed structure factor, or equivalently the phase difference between the observed structure factor and its expected value). The central limit theorem applies to structure factors, which are sums of many small atomic contributions, so the probability distribution for an acentric reflection, F O, given the expected value of F O (〈F O〉) is a two-dimensional Gaussian with variance Σ centred on 〈F O〉. (Note that here and in the following, bold font is used to represent complex or signed structure factors, and italics to represent their amplitudes.) In applications to molecular replacement and structure refinement, 〈F O〉 is the structure factor calculated from the model (F C) multiplied by a fraction D (where 0 R, H = 0. The atoms are taken to be of equal mass. The eigenvalues λ and eigenvectors U of H can then be calculated. The eigenvalues are directly proportional to the squares of the vibrational frequencies of the normal modes, the lowest eigenvalues thus giving the lowest normal modes. Six of the eigenvalues will be zero, corresponding to the six degrees of freedom for a rotation and translation of the entire structure. For all but the smallest proteins, eigenvalue decomposition of the all-atom Hessian is not computationally feasible with current computer technology. Various methods have been developed to reduce the size of the eigenvalue problem. Bahar et al. (1997 ▶) and Hinsen (1998 ▶) have shown that it is possible to find the lowest frequency normal modes of proteins in the elastic network model by considering amino acid Cα atoms only. However, this merely postpones the computational problem until the proteins are an order of magnitude larger. The problem is solved for any size protein with the rotation–translation block (RTB) approach (Durand et al., 1994 ▶; Tama et al., 2000 ▶), where the protein is divided into blocks of atoms and the rotation and translation modes for each block used project the full Hessian into a lower dimension. The projection matrix is a block-diagonal matrix of dimensions 3N × 3N. Each of the NB block matrices P nb has dimensions 3N nb × 6, where N nb is the number of atoms in the block nb, For atom j in block nb displaced from the centre of mass, of the block, the 3 × 6 matrix P nb,j is The first three columns of the matrix contain the infinitesimal translation eigenvectors of the block and last three columns contain the infinitesimal rotation eigenvectors of the block. The orthogonal basis Q of P nb is then found by QR decomposition: where Q nb is a 3N nb × 6 orthogonal matrix and R nb is a 6 × 6 upper triangle matrix. H can then be projected into the subspace spanned by the translation/rotation basis vectors of the blocks: where The eigenvalues λP and eigenvectors U P of the projected Hessian are then found. The RTB method is able to restrict the size of the eigenvalue problem for any size of protein with the inclusion of an appropriately large N nb for each block. In the implementation of the RTB method in Phaser, N nb for each block is set for each protein such that the total size of the eigenvalue problem is restricted to a matrix H P of maximum dimensions 750 × 750. This enables the eigenvalue problem to be solved in a matter of minutes with current computing technology. The eigenvectors of the translation/rotation subspace can then be expanded back to the atomic space (dimensions of U are N × N): As for the decomposition of the full Hessian H, the eigenvalues are directly proportional to the squares of the vibrational frequencies of the normal modes, the lowest eigenvalues thus giving the lowest normal modes. Although the eigenvalues and eigenvectors generated from decomposition of the full Hessian and using the RTB approach will diverge with increasing frequency, the RTB approach is able to model with good accuracy the lowest frequency normal modes, which are the modes of interest for looking at conformational difference in proteins. The all-atom, Cα only and RTB normal-mode analysis methods are implemented in Phaser. After normal-mode analysis, n normal modes can be used to generate 2 n − 1 (nonzero) combinations of normal modes. Phaser allows the user to specify the r.m.s. deviation between model and target desired by the perturbation, and the fraction dq of the displacement vector for each mode combination corresponding to each model combination is then used to generate the models. Large r.m.s. deviations will cause the geometry of the model to become distorted. Phaser reports when the model becomes so distorted that there are Cα clashes in the structure. 2.4. Packing function The packing of potential solutions in the asymmetric unit is not inherently part of the translation function. It is therefore possible that an arrangement of models has a high log-likelihood gain, although the models may overlap and therefore be physically unreasonable. The packing of the solutions is checked using a clash test using a subset of the atoms in the structure: the ‘trace’ atoms. For proteins, the trace atoms are the Cα positions, spaced at 3.8 Å. For nucleic acid, the phosphate and C atoms in the ribose-phosphate backbone and the N atoms of the bases are selected as trace atoms. These atoms are also spaced at about 3.8 Å, so that the density of trace atoms in nucleic acid is similar to that of proteins, which makes the number of protein–protein, protein–nucleic acid and nucleic acid–nucleic acid clashes comparable where there is a mixed protein–nucleic acid structure. For the clash test, the number of trace atoms from another model within a given distance (default 3 Å) is counted. The clash test includes symmetry-related copies of the model under consideration, other components in the asymmetric unit and their symmetry-related copies. If the search model has a low sequence identity with the target, or has large flexible loops that could adopt an alternative conformation, the number of clashes may be expected to be nonzero. By default the best packing solutions are carried forward, although a specific number of allowed clashes may also be given as the cut-off for acceptance. However, it is better to edit models before use so that structurally nonconserved surface loops are excluded, as they will only contribute noise to the rotation and translation functions. Where an ensemble of structures is used as the model, the highest homology model is taken as the template for the packing search. Before this model is used, the trace atom positions are edited to take account of large conformational differences between the models in the ensemble. Equivalent trace atom positions are compared and if the coordinates deviate by more than 3 Å then the template trace atom is deleted. Thus, use of an ensemble not only improves signal to noise in the maximum likelihood search functions, it also improves the discrimination of possible solutions by the packing function. 2.5. Minimizer Minimization is used in Phaser to optimize the parameters against the appropriate log-likelihood function in the anisotropy correction, in MR (refines the position and orientation of a rigid-body model) and in SAD phasing. The same minimizer code is used for all three applications and has been designed to be easily extensible to other applications. The minimizer for the anisotropy correction uses Newton’s method, while MR and SAD use the standard Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm. Both minimization methods in Phaser include a line search. The line search algorithm is a basic iterative method for finding the local minimum of a target function f. Starting at parameters x , the algorithm finds the minimum (within a convergence tolerance) of by varying γ, where γ is the step distance along a descent direction d . Newton’s method and the BFGS algorithm differ in the determination of the descent direction d that is passed to the line search, and thus the speed of convergence. Within one cycle of the line search (where there is no change in d ) the trial step distances γ are chosen using the golden section method. The golden ratio (51/2/2 + 1/2) divides a line so that the ratio of the larger part to the total is the same as the ratio of the smaller to larger. The method makes no assumptions about the function’s behaviour; in particular, it does not assume that the function is quadratic within the bracketed section. If this assumption were made, the line search could proceed via parabolic interpolation. Newton’s method uses the Hessian matrix H of second derivatives and the gradient g at the initial set of parameters x 0 to find the values of the parameters at the minimum x min. If the function is quadratic in x then Newton’s method will find the minimum in one step, but if not, iteration is required. The method requires the inversion of the Hessian matrix, which, for large matrices, consumes a large amount of computational time and memory resources. The eigenvalues of the Hessian need to be positive for the function to be at a minimum, rather than a maximum or saddle point, since the method converges to any point where the gradient vector is zero. When used with the anisotropy correction, the full Hessian matrix is calculated analytically. The BFGS algorithm is one of the most powerful minimization methods when calculation of the full Hessian using analytic or finite difference methods is very computationally intensive. At every step, the gradient search vector is analysed to build up an approximate Hessian matrix H, in order to make the resulting search vector direction d better than the original gradient vector direction. In the ‘pure’ form of the BFGS algorithm, the method is started with matrix H equal to the identity matrix. The off-diagonal elements of the Hessian, the mixed second derivatives (i.e. ∂2LL/∂p i ∂p j ) are thus initially zero. As the BFGS cycle proceeds, the off-diagonal elements become nonzero using information derived from the gradient. However, in Phaser, the matrix H is not the identity but rather is seeded with diagonal elements equal to the second derivatives of the parameters (p i ) with respect to the log-likelihood target function (LL) (i.e. ∂2LL/∂p i 2, or curvatures), the values found in the ‘true’ Hessian. For the SAD refinement the diagonal elements are calculated analytically, but for the MR refinement the diagonal elements are calculated by finite difference methods. Seeding the Hessian with the diagonal elements dramatically accelerates convergence when the parameters are on different scales; when an identity matrix is used, the parameters on a larger scale can fail to shift significantly because their gradients tend to be smaller, even though the necessary shifts tend to be larger. In the inverse Hessian, small curvatures for parameters on a large scale translate into large scale factors applied to the corresponding gradient terms. If any of these curvature terms are negative (as may happen when the parameters are far from their optimal values), the matrix is not positive definite. Such a situation is corrected by using problem-specific information on the expected relative scale of the parameters from the ‘large-shift’ variable, as discussed below in §2.5.1. In addition to the basic minimization algorithms, the minimizer incorporates the ability to bound, constrain, restrain and reparameterize variables, as discussed in detail below. Bounds must be applied to prevent parameters becoming nonphysical, constraints effectively reduce the number of parameters, restraints are applied to include prior probability information, and reparameterization of variables makes the parameter space more quadratic and improves the performance of the minimizer. 2.5.1. Problem-specific parameter scaling information When a function is defined for minimization in Phaser, information must be provided on the relative scales of the parameters of that function, through a ‘large-shifts’ variable. As its name implies, the variable defines the size of a parameter shift that would be considered ‘large’ for each parameter. The ratios of these large-shift values thus specify prior knowledge about the relative scales of the different parameters for each problem. Suitable large-shift values are found by a combination of physical insight (e.g. the size of a coordinate shift considered to be large will be proportional to d min for the data set) and numerical simulations, studying the behaviour of the likelihood function as parameters are varied systematically in a variety of test cases. The large-shifts information is used in two ways. Firstly, it is used to prevent the line search from taking an excessively large step, which can happen if the estimated curvature for a parameter happens to be too small and can lead to the refinement becoming numerically unstable. If the initial step for a line search would change any parameter by more than its large-shift value, the initial step is scaled down. Secondly, it is used to provide relative scale information to correct negative curvature values. Parameters with positive curvatures are used to define the average relationship between the large-shift values and the curvatures, which can then be used to compute appropriate curvature values for the parameters with negative curvatures. This stabilizes the refinement until it is sufficiently close to the minimum that all curvatures become positive. 2.5.2. Reparameterization Second-order minimization algorithms in effect assume that, at least in the region around the minimum, the function can be approximated as a quadratic. Where this assumption holds, the minimizer will converge faster. It is therefore advantageous to use functions of the parameters being minimized so that the target function is more quadratic in the new parameter space than in the original parameter space (Edwards, 1992 ▶). For example, atomic B factors tend to converge slowly to their refined values because the B factor appears in the exponential term in the structure-factor equation. Although any function of the parameters can be used for this purpose, we have found that taking the logarithm of a parameter is often the most effective reparameterization operation (not only for the B factors). The offset x offset is chosen so that the value of x′ does not become undefined for allowed values of x, and to optimize the quadratic nature of the function in x′. For instance, atomic B factors are reparameterized using an offset of 5 Å2, which allows the B factors to approach zero and also has the physical interpretation of accounting roughly for the width of the distribution of electrons for a stationary atom. 2.5.3. Bounds Bounds on the minimization are applied by setting upper and/or lower limits for each variable where required (e.g. occupancy minimum set to zero). If a parameter reaches a limit during a line search, that line search is terminated. In subsequent line searches, the gradient of that parameter is set to zero whenever the search direction would otherwise move the parameter outside of its bounds. Multiplying the gradient by the step size thus does not alter the value of the parameter at its limit. The parameter will remain at its limit unless calculation of the gradient in subsequent cycles of minimization indicates that the parameter should move away from the boundary and into the allowed range of values. 2.5.4. Constraints Space-group-dependent constraints apply to the anisotropic tensor applied to ΣN in the anisotropic diffraction correction. Atoms on special positions also have constraints on the values of their anisotropic tensor. The anisotropic displacement ellipsoid must remain invariant under the application of each symmetry operator of the space group or site-symmetry group, respectively (Giacovazzo, 1992 ▶; Grosse-Kunstleve & Adams, 2002 ▶). These constraints reduce the number of parameters by either fixing some values of the anisotropic B factors to zero or setting some sets of B factors to be equal. The derivatives in the gradient and Hessian must also be constrained to reflect the constraints in the parameters. 2.5.5. Restraints Bayes’ theorem describes how the probability of the model given the data is related to the likelihood and gives a justification for the use of restraints on the parameters of the model. If the probability of the data is taken as a constant, then P(model) is called the prior probability. When the logarithm of the above equation is taken, Prior probability is therefore introduced into the log-likelihood target function by the addition of terms. If parameters of the model are assumed to have independent Gaussian probability distributions, then the Bayesian view of likelihood will lead to the addition of least-squares terms and hence least-squares restraints on those parameters, such as the least-squares restraints applied to bond lengths and bond angles in typical macromolecular structure refinement programs. In Phaser, least-squares terms are added to restrain the B factors of atoms to the Wilson B factor in SAD refinement, and to restrain the anisotropic B factors to being more isotropic (the ‘sphericity’ restraint). A similar sphericity restraint is used in SHELXL (Sheldrick, 1995 ▶) and in REFMAC5 (Murshudov et al., 1999 ▶). 3. Automation Phaser is designed as a large set of library routines grouped together and made available to users as a series of applications, called modes. The routine-groupings in the modes have been selected mainly on historical grounds; they represent traditional steps in the structure solution pipeline. There are 13 such modes in total: ‘anisotropy correction’, ‘cell content analysis’, ‘normal-mode analysis’, ‘ensembling’, ‘fast rotation function’, ‘brute rotation function’, ‘fast translation function’, ‘brute translation function’, ‘log-likelihood gain’, ‘rigid-body refinement’, ‘single-wavelength anomalous dispersion’, ‘automated molecular replacement’ and ‘automated experimental phasing’. The ‘automated molecular replacement’ and ‘automated experimental phasing’ modes are particularly powerful and aim to automate fully structure solution by MR and SAD, respectively. Aspects of the decision making within the modes are under user input control. For example, the ‘fast rotation function’ mode performs the ensembling calculation, then a fast rotation function calculation and then rescores the top solutions from the fast search with a brute rotation function. There are three possible fast rotation function algorithms and two possible brute rotation functions to choose from. There are four possible criteria for selecting the peaks in the fast rotation function for rescoring with the brute rotation function, and for selecting the results from the rescoring for output. Alternatively, the rescoring of the fast rotation function with the brute rotation function can be turned off to produce results from the fast rotation function only. Other modes generally have fewer routines but are designed along the same principles (details are given in the documentation). 3.1. Automated molecular replacement Most structures that can be solved by MR with Phaser can be solved using the ‘automated molecular replacement’ mode. The flow diagram for this mode is shown in Fig. 1 ▶. The search strategy automates four search processes: those for multiple components in the asymmetric unit, for ambiguity in the hand of the space group and/or other space groups in the same point group, for permutations in the search order for components (when there are multiple components), and for finding the best model when there is more than one possible model for a component. 3.1.1. Multiple components of asymmetric unit Where there are many models to be placed in the asymmetric unit, the signal from the placement of the first model may be buried in noise and the correct placement of this first model only found in the context of all models being placed in the asymmetric unit. One way of tackling this problem has been to use stochastic methods to search the multi-dimensional space (Chang & Lewis, 1997 ▶; Kissinger et al., 1999 ▶; Glykos & Kokkinidis, 2000 ▶). However, we have chosen to use a tree-search-with-pruning approach, where a list of possible placements of the first (and subsequent) models is kept until the placement of the final model. This tree-search-with-pruning search strategy can generate very branched searches that would be challenging for users to negotiate by running separate jobs, but becomes trivial with suitable automation. The search strategy exploits the strength of the maximum likelihood target functions in using prior information in the search for subsequent components in the asymmetric unit. The tree-search-with-pruning strategy is heavily dependent on the criteria used for selecting the peaks that survive to the next round. Four selection criteria are available in Phaser: selection by percentage difference between the top and mean log-likelihood of the search, selection by Z score, selection by number of peaks, and selection of all peaks. The default is selection by percentage, with the default percentage set at 75%. This selection method has the advantage that, if there is one clear peak standing well above the noise, it alone will be passed to the next round, while if there is no clear signal, all peaks high in the list will be passed as potential solutions to the next round. If structure solution fails, it may be possible to rescue the solution by reducing the percentage cut-off used for selection from 75% to, for example, 65%, so that if the correct peak was just missing the default cut-off, it is now included in the list passed to the next round. The tree-search-with-pruning search strategy is sub-optimal where there are multiple copies of the same search model in the asymmetric unit. In this case the search generates many branches, each of which has a subset of the complete solution, and so there is a combinatorial explosion in the search. The tree search would only converge onto one branch (solution) with the placement of the last component on each of the branches, but in practice the run time often becomes excessive and the job is terminated before this point can be reached. When searching for multiple copies of the same component in the asymmetric unit, several copies should be added at each search step (rather than branching at each search step), but this search strategy must currently be performed semi-manually as described elsewhere (McCoy, 2007 ▶). 3.1.2. Alternative space groups The space group of a structure can often be ambiguous after data collection. Ambiguities of space group within the one point group may arise from theoretical considerations (if the space group has an enantiomorph) or on experimental grounds (the data along one or more axes were not collected and the systematic absences along these axes cannot be determined). Changing the space group of a structure to another in the same point group can be performed without re-indexing, merging or scaling the data. Determination of the space group within a point group is therefore an integral part of structure solution by MR. The translation function will yield the highest log-likelihood gain for a correctly packed solution in the correct space group. Phaser allows the user to make a selection of space groups within the same point group for the first translation function calculation in a search for multiple components in the asymmetric unit. If the signal from the placement of the first component is not significantly above noise, the correct space group may not be chosen by this protocol, and the search for all components in the asymmetric unit should be completed separately in all alternative space groups. 3.1.3. Alternative models As the database of known structures expands, the number of potential MR models is also rapidly increasing. Each available model can be used as a separate search model, or combined with other aligned structures in an ‘ensemble’ model. There are also various ways of editing structures before use as MR models (Schwarzenbacher et al., 2004 ▶). The number of MR trials that can be performed thus increases combinatorially with the number of potential models, which makes job tracking difficult for the user. In addition, most users stop performing MR trials as soon as any solution is found, rather than continuing the search until the MR solution with the greatest log-likelihood gain is found, and so they fail to optimize the starting point for subsequent steps in the structure solution pipeline. The use of alternative models to represent a structure component is also useful where there are multiple copies of one type of component in the asymmetric unit and the different copies have different conformations due to packing differences. The best solution will then have the different copies modelled by different search models; if the conformation change is severe enough, it may not be possible to solve the structure without modelling the differences. A set of alternative search models may be generated using previously observed conformational differences among similar structures, or, for example, by normal-mode analysis (see §2.3). Phaser automates searches over multiple models for a component, where each potential model is tested in turn before the one with the greatest log-likelihood gain is found. The loop over alternative models for a component is only implemented in the rotation functions, as the solutions passed from the rotation function to the translation function step explicitly specify which model to use as well as the orientation for the translation function in question. 3.1.4. Search order permutation When searching for multiple components in the asymmetric unit, the order of the search can be a factor in success. The models with the biggest component of the total structure factor will be the easiest to find: when weaker scattering components are the subject of the initial search, the solution may be buried in noise and not significant enough to survive the selection criteria in the tree-search-with-pruning search strategy. Once the strongest scattering components are located, then the search for weaker scattering components (in the background of the strong scattering components) is more likely to be a success. Having a high component of the total structure factor correlates with the model representing a high fraction of the total contents of the asymmetric unit, low r.m.s. deviation between model and target atoms, and low B factors for the target to which the model corresponds. Although the first of these (high completeness) can be determined in advance from the fraction of the total molecular weight represented by the model, the second can only be estimated from the Chothia & Lesk (1986 ▶) formula and the third is unknown in advance. If structure solution fails with the search performed in the order of the molecular weights, then other permutations of search order should be tried. In Phaser, this possibility is automated on request: the entire search strategy (except for the initial anisotropic data correction) is performed for all unique permutations of search orders. 3.2. Automated experimental phasing SAD is the simplest type of experimental phasing method to automate, as it involves only one crystal and one data set. SAD is now becoming the experimental phasing method of choice, overtaking multiple-wavelength anomalous dispersion because only a single data set needs to be collected. This can help minimize radiation damage to the crystal, which has a major adverse effect on the success of multi-wavelength experiments. The ‘automated experimental phasing’ mode in Phaser takes an atomic substructure determined by Patterson, direct or dual-space methods (Karle & Hauptman, 1956 ▶; Rossmann, 1961 ▶; Mukherjee et al., 1989 ▶; Miller et al., 1994 ▶; Sheldrick & Gould, 1995 ▶; Sheldrick et al., 2001 ▶; Grosse-Kunstleve & Adams, 2003 ▶) and refines the positions, occupancies, B factors and values of the atoms to optimize the SAD function, then uses log-likelihood gradient maps to complete the atomic substructure. The flow diagram for this mode is shown in Fig. 2 ▶. The search strategy automates two search processes: those for ambiguity in the hand of the space group and for completing atomic substructure from log-likelihood gradient maps. A feature of using the SAD function for phasing is that the substructure need not only consist of anomalous scatterers; indeed it can consist of only real scatterers, since the real scattering of the partial structure is used as part of the phasing function. This allows structures to be completed from initial real scattering models. 3.2.1. Enantiomorphic space groups Since the SAD phasing mode of Phaser takes as input an atomic substructure model, the space group of the solution has already been determined to within the enantiomorph of the correct space group. Changing the enantiomorph of a SAD refinement involves changing the enantiomorph of the heavy atoms, or in some cases the space group (e.g. the enantiomorphic space group of P41 is P43). In some rare cases (Fdd2, I41, I4122, I41 md, I41 cd, I 2d, F4132; Koch & Fischer, 1989 ▶) the origin of the heavy-atom sites is changed [e.g. the enantiomorphic space group of I41 is I41 with the origin shifted to ( , 0, 0)]. If there is only one type of anomalous scatterer, the refinement need not be repeated in both hands: only the phasing needs to be carried out in the second hand to be considered. However, if there is more than one type of anomalous scatterer, then the refinement and substructure completion needs to be repeated, as it will not be enantiomorphically symmetric in the other hand. To facilitate this, Phaser runs the refinement and substructure completion in both hands [as does other experimental phasing software, e.g. Solve (Terwilliger & Berendzen, 1999 ▶) and autosharp (Vonrhein et al., 2006 ▶)]. The correct space group can then be found by inspection of the electron density maps; the density will only be interpretable in the correct space group. In cases with significant contributions from at least two types of anomalous scatterer in the substructure, the correct space group can also be identified by the log-likelihood gain. 3.2.2. Completing the substructure Peaks in log-likelihood gradient maps indicate the coordinates at which new atoms should be added to improve the log-likelihood gain. In the initial maps, the peaks are likely to indicate the positions of the strongest anomalous scatterers that are missing from the model. As the phasing improves, weaker anomalous scatterers, such as intrinsic sulfurs, will appear in the log-likelihood gradient maps, and finally, if the phasing is exceptional and the resolution high, non-anomalous scatterers will appear, since the SAD function includes a contribution from the real scattering. After refinement, atoms are excluded from the substructure if their occupancy drops below a tenth of the highest occupancy amongst those atoms of the same atom type (and therefore ). Excluded sites are flagged rather than permanently deleted, so that if a peak later appears in the log-likelihood gradient map at this position, the atom can be reinstated and prevented from being deleted again, in order to prevent oscillations in the addition of new sites between cycles and therefore lack of convergence of the substructure completion algorithm. New atoms are added automatically after a peak and hole search of the log-likelihood gradient maps. The cut-off for the consideration of a peak as a potential new atom is that its Z score be higher than 6 (by default) and also higher than the depth of the largest hole in the map, i.e. the largest hole is taken as an additional indication of the noise level of the map. The proximity of each potential new site to previous atoms is then calculated. If a peak is more than a cut-off distance (κ Å) of a previous site, the peak is added as a new atom with the average occupancy and B factor from the current set of sites. If the peak is within κ Å of an isotropic atom already present, the old atom is made anisotropic. Holes in the log-likelihood gradient map within κ Å of an isotropic atom also cause the atom’s B factor to be switched to anisotropic. However, if the peak or hole is within κ Å of an anisotropic atom already present, the peak or hole is ignored. If a peak is within κ Å of a previously excluded site, the excluded site is reinstated and flagged as not for deletion in order to prevent oscillations, as described above. At the end of the cycle of atom addition and isotropic to anisotropic atomic B-factor switching, new sites within 2κ Å of an old atom that is now anisotropic are then removed, since the peak may be absorbed by refining the anisotropic B factor; if not, it will be accepted as a new site in the next cycle of log-likelihood gradient completion. The distance κ may be input directly by the user, but by default it is the ‘optical resolution’ of the structure (κ = 0.715d min), but not less than 1 Å and no more than 10 Å. If the structure contains more than one significant anomalous scatterer, then log-likelihood gradient maps are calculated from each atom type, the maps compared and the atom type associated with each significant peak assigned from the map with the most significant peak at that location. 3.2.3. Initial real scattering model One of the reasons for including MR and SAD phasing within one software package is the ability to use MR solutions with the SAD phasing target to improve the phases. Since the SAD phasing target contains a contribution from the real scatterers, it is possible to use a partial MR model with no anomalous scattering as the initial atomic substructure used for SAD phasing. This approach is useful where there is a poor MR solution combined with a poor anomalous signal in the data. If the poor MR solution means that the structure cannot be phased from this model alone, and the poor anomalous signal means that the anomalous scatterers cannot be located in the data alone, then using the MR solution as the starting model for SAD phasing may provide enough phase information to locate the anomalous scatterers. The combined phase information will be stronger than from either source alone. To facilitate this method of structure solution, Phaser allows the user to input a partial structure model that will be interpreted in terms of its real scattering only and, following phasing with this substructure, to complete the anomalous scattering model from log-likelihood gradient maps as described above. 3.3. Input and output The fastest and most efficient way, in terms of development time, to link software together is using a scripting language, while using a compiled language is most efficient for intensive computation. Following the lead of the PHENIX project (Adams et al., 2002 ▶, 2004 ▶), Phaser uses Python (http://python.org) as the scripting language, C++ as the compiled language, and the Boost.Python library (http://boost.org/libs/python/) for linking C++ and Python. Other packages, notably X-PLOR (Brünger, 1993 ▶) and CNS (Brünger et al., 1998 ▶), have defined their own scripting languages, but the choice of Python ensures that the scripting language is maintained by an active community. Phaser functionality has mostly been made available to Python at the ‘mode’ level. However, some low-level SAD refinement routines in Phaser have been made available to Python directly, so that they can be easily incorporated into phenix.refine. A long tradition of CCP4 keyword-style input in established macromolecular crystallography software (almost exclusively written in Fortran) means that, for many users, this has been the familiar method of calling crystallographic software and is preferred to a Python interface. The challenge for the development of Phaser was to find a way of satisfying both keyword-style input and Python scripting with minimal increase in development time. Taking advantage of the C++ class structure allowed both to be implemented with very little additional code. Each keyword is managed by its own class. The input to each mode of Phaser is controlled by Input objects, which are derived from the set of keyword classes appropriate to the mode. The keyword classes are in turn derived from a CCP4base class containing the functionality for the keyword-style input. Each keyword class has a parse routine that calls the CCP4base class functions to parse the keyword input, stores the input parameters as local variables and then passes these parameters to a keyword class set function. The keyword class set functions check the validity and consistency of the input, throw errors where appropriate and finally set the keyword class’s member parameters. Alternatively, the keyword class set functions can be called directly from Python. These keyword classes are a standalone part of the Phaser code and have already been used in other software developments (Pointless; Evans, 2006 ▶). An Output object controls all text output from Phaser sent to standard output and to text files. Switches on the Output object give different output styles: CCP4-style for compatibility with CCP4 distribution, PHENIX-style for compatibility with the PHENIX interface, CIMR-style for development, XML-style output for developers of automation scripts and a ‘silent running’ option to be used when running Phaser from Python. In addition to the text output, where possible Phaser writes results to files in standard format; coordinates to ‘pdb’ files and reflection data (e.g. map coefficients) to ‘mtz’ files. Switches on the Output object control the writing of these files. 3.3.1. CCP4-style output CCP4-style output is a text log file sent to standard output. While this form of output is easily comprehensible to users, it is far from ideal as an output style for automation scripts. However, it is the only output style available from much of the established software that developers wish to use in their automation scripts, and it is common to use Unix tools such as ‘grep’ to extract key information. For this reason, the log files of Phaser have been designed to help developers who prefer to use this style of output. Phaser prints four levels of log file, summary, log, verbose and debug, as specified by user input. The important output information is in all four levels of file, but it is most efficient to work with the summary output. Phaser prints ‘SUCCESS’ and ‘FAILURE’ at the end of the log file to demarcate the exit state of the program, and also prints the names of any of the other output files produced by the program to the summary output, amongst other features. 3.3.2. XML output XML is becoming commonly used as a way of communicating between steps in an automation pipeline, because XML output can be added very simply by the program author and relatively simply by others with access to the source code. For this reason, Phaser also outputs an XML file when requested. The XML file encapsulates the mark-up within 〈phaser〉 tags. As there is no standard set of XML tags for crystallographic results, Phaser’s XML tags are mostly specific to Phaser but were arrived at after consultation with other developers of XML output for crystallographic software. 3.3.3. Python interface The most elegant and efficient way to run Phaser as part of an automation script is to call the functionality directly from Python. Using Phaser through the Python interface is similar to using Phaser through the keyword interface. Each mode of operation of Phaser described above is controlled by an Input object and its parameter set functions, which have been made available to Python with the Boost.Python library. Phaser is then run with a call to the ‘run-job’ function, which takes the Input object as a parameter. The ‘run-job’ function returns a Result object on completion, which can then be queried using its get functions. The Python Result object can be stored as a ‘pickled’ class structure directly to disk. Text is not sent to standard out in the CCP4 logfile way but may be redirected to another output stream. All Input and Result objects are fully documented. 4. Future developments Phaser will continue to be developed as a platform for implementing novel phasing algorithms and bringing the most effective approaches to the crystallographic community. Much work remains to be done formulating maximum likelihood functions with respect to noncrystallographic symmetry, to account for correlations in the data and to consider non-isomorphism, all with the aim of achieving the best possible initial electron density map. After a generation in which Fortran dominated crystallographic software code, C++ and Python have become the new standard. Several developments, including Phaser, PHENIX (Adams et al., 2002 ▶, 2004 ▶), Clipper (Cowtan, 2002 ▶) and mmdb (Krissinel et al., 2004 ▶), simultaneously chose C++ as the compiled language at their inception at the turn of the millennium. At about the same time, Python was chosen as a scripting language by PHENIX, ccp4mg (Potterton et al., 2002 ▶, 2004 ▶) and PyMol (DeLano, 2002 ▶), amongst others. Since then, other major software developments have also started or converted to C++ and Python, for example PyWarp (Cohen et al., 2004 ▶), MrBump (Keegan & Winn, 2007 ▶) and Pointless (Evans, 2006 ▶). The choice of C++ for software development was driven by the availability of free compilers, an ISO standard (International Standardization Organization et al., 1998 ▶), sophisticated dynamic memory management and the inherent strengths of using an object-oriented language. Python was equally attractive because of the strong community support, its object-oriented design, and the ability to link C++ and Python through the Boost.Python library or the SWIG library (http://www.swig.org/). Now that a ‘critical mass’ of developers has taken to using the new languages, C++ and Python are likely to remain the standard for crystallographic software for the current generation of crystallographic software developers. Phaser source code has been distributed directly by the authors (see http://www-structmed.cimr.cam.ac.uk/phaser for details) and through the PHENIX and CCP4 (Collaborative Computing Project, Number 4, 1994 ▶) software suites. The source code is released for several reasons, including that we believe source code is the most complete form of publication for the algorithms in Phaser. It is hoped that generous licensing conditions and source distribution will encourage the use of Phaser by other developers of crystallographic software and those writing crystallographic automation scripts. There are no licensing restrictions on the use of Phaser in macromolecular crystallography pipelines by other developers, and the license conditions even allow developers to alter the source code (although not to redistribute it). We welcome suggestions for improvements to be incorporated into new versions. Compilation of Phaser requires the computational crystallography toolbox (cctbx; Grosse-Kunstleve & Adams, 2003 ▶), which includes a distribution of the cmtz library (Winn et al., 2002 ▶). The Boost libraries (http://boost.org/) are required for access to the functionality from Python. Phaser runs under a wide range of operating systems including Linux, Irix, OSF1/Tru64, MacOS-X and Windows, and precompiled executables are available for these platforms when only keyword-style access (and not Python access) is required. Graphical user interfaces to Phaser are available for both the PHENIX and the CCP4 suites. User support is available through PHENIX, CCP4 and from the authors (email cimr-phaser@lists.cam.ac.uk).

0 comments Cited 2972 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Is Open Access

<i>Coot</i> : model-building tools for molecular graphics

Paul Emsley, Kevin Cowtan (2004)

Acta Crystallographica Section D Biological Crystallography, 60(12), 2126-2132

0 comments Cited 2746 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Overview of the CCP4 suite and current developments

Martyn D. Winn, Charles C. Ballard, Kevin D Cowtan … (2011)

1. Introduction CCP4 (Collaborative Computational Project, Number 4, 1994 ▶) exists to produce and support a world-leading integrated suite of programs that allows researchers to determine macromolecular structures by X-ray crystallography and other biophysical techniques. CCP4 aims to develop and support the development of cutting-edge approaches to the experimental determination and analysis of protein structure and to integrate these approaches into the CCP4 software suite. CCP4 is a community-based resource that supports the widest possible researcher community, embracing academic, not-for-profit and for-profit research. CCP4 aims to play a key role in the education and training of scientists in experimental structural biology. It encourages the wide dissemination of new ideas, techniques and practice. In this article, we give an overview of the CCP4 project, past, present and future. We begin with a historical perspective on the growth of the software suite, followed by a summary of the current functionality in the suite. We then discuss ongoing plans for the next generation of the suite which is in development. In this account we focus on the suite as a whole, while other articles in this issue delve deeper into individual programs. We intend that this article could serve as a general literature citation for the use of the CCP4 software suite in structure determination, although we also encourage the citation of individual programs, many of the relevant references for which are included here. While we focus here on the CCP4 software suite, we would emphasize that comparable functionality is available in other software packages such as SHARP/autoSHARP (Vonrhein et al., 2007 ▶), SHELX (Sheldrick, 2008 ▶), ARP/wARP (Langer et al., 2008 ▶), PHENIX (Adams et al., 2010 ▶) and many others. 2. Evolution of the CCP4 software suite The CCP4 software suite is a collection of programs implementing specific algorithms concerned with macromolecular structure solution from X-ray diffraction data. Significantly, it is a collection of autonomous and independently developed programs. While some have been commissioned by the academic committees overseeing the CCP4 project, the majority originate from the community to address a perceived gap in current functionality or to implement newly developed algorithms. The result is a collection of around 200 programs, ranging from large programs which are effectively packages in themselves to small ‘jiffy’ programs. Over the years the suite has grown continuously, with each major release featuring significant new software (see Table 1 ▶). Unsurprisingly, there is overlap of functionality, with several programs performing a particular task, albeit often using different approaches. The question then is how to combine these programs into a software suite, both in terms of ensuring communication between the different programs and in helping both naïve and experienced users to navigate through the suite. Early on in the history of CCP4, there was an agreement for all programs to use the same file formats for data files. Formats were specified for diffraction data (the LCF format, later replaced by the MTZ format) and for electron-density maps (the CCP4 map format), while for atomic coordinates the PDB format was adopted. A software library was developed to facilitate reading and writing of these data formats and thereby ensure standardization of the formats. Originally supporting only Fortran programs, the library was re-written to support both Fortran and C/C++ as well as scripting languages (Winn et al., 2002 ▶). The CCP4 set of libraries has since expanded to cover a wider range of crystallographic tasks, in particular with the addition of the Clipper library (Cowtan, 2003 ▶), the MMDB library (Krissinel et al., 2004 ▶) and the CCTBX library (Grosse-Kunstleve et al., 2002 ▶) from the PHENIX project (Adams et al., 2010 ▶). Crystallographic tasks were performed by writing or adapting scripts (e.g. Unix shell or VMS scripts) to link together a number of programs (Fig. 1a ▶) and the suite can still be run in this way. The programs communicate solely via the data files which are passed between them. The user sets program options based on the program documentation and the expected results from earlier steps. A major change was introduced in 2000 with the release of the graphical user interface ccp4i (Fig. 1b ▶; Potterton et al., 2003 ▶). Task interfaces help the user to prepare run scripts. Details of how to run specific programs are largely hidden, as are the jiffy programs used to perform minor functions such as format conversion. Some limited intelligence in the interface code allows program options to be customized according to properties of the data and/or the desired objective. ccp4i interfaces are now available for all of the commonly used CCP4 programs as well as for several non-CCP4 programs (e.g. ARP/wARP; Langer et al., 2008 ▶). The ccp4i interface also introduced for the first time tools for helping the user to organize data. Jobs that have been run were recorded in a ‘database’ (in reality a directory of files) with tools to access and interpret the files saved there. Jobs are further organized into projects, representing different structure solutions. There are now plans to update the CCP4 GUI (see §4), but the impact of the original ccp4i on the suite should not be underestimated. In the last few years, two other modes of accessing the CCP4 suite have emerged. On the one hand, the latest version of the suite contains four complementary automation pipelines, namely xia2 (Winter, 2010 ▶), CRANK (Ness et al., 2004 ▶), MrBUMP (Keegan & Winn, 2007 ▶) and BALBES (Long et al., 2008 ▶). These pipelines attempt to perform large sections of the full structure solution (e.g. phasing) without user intervention. This is achieved partly through the use of a large number of trials, trying different protocols and performing parameter scanning. Such an approach can be very powerful, using cheap computer power to make many more attempts than a user would manually. Automation pipelines have been realised in the last few years because of the maturity of the underlying programs and the availability of sufficient computer power to support multiple trials. On the other hand, graphical programs for interactive use have become more powerful. Rather than simply reviewing the results of previously run programs and performing interactive model editing, Coot (Emsley et al., 2010 ▶) can launch separate refinement and validation programs (Fig. 1c ▶). Similarly, iMOSFLM can be used to interface the data-processing programs POINTLESS and SCALA. In some ways this is a completely different scenario to the automation pipelines. User interaction is paramount, with crystallography programs acting as tools to be invoked. The user can become familiar with the data and structure and use this to make intelligent decisions. Such an approach has also become possible because of the maturity of the invoked programs and the availability of sufficient computer power to run the programs interactively. 3. Overview of current functionality In this section, we give an overview of the current functionality of the CCP4 software suite (corresponding to release series 6.1 at the time of writing). We summarize the automation pipelines and individual programs included in the suite; many more details can be found in the accompanying articles in this issue. We present the functionality in the traditional manner, starting at data processing and ending at validation. However, it is becoming increasingly apparent that these neat categories are breaking down. 3.1. Data processing The earliest starting point for entry into the CCP4 suite is a set of X-ray diffraction images. The data-reduction program MOSFLM (Leslie, 2006 ▶) will take a set of diffraction images, identify spots on each image, index the diffraction pattern and thus identify the Bragg peaks, and integrate the spots. The output is a list of integrated intensities and their standard uncertainties labelled by the h, k, l indices. Associated information includes the batch number of the image from which the intensity was obtained, whether the peak was full or partial and the symmetry operation that relates the particular observation to the chosen asymmetric unit. MOSFLM continues to be improved, with support added recently for Pilatus detectors, addition of automatic backstop masking etc. The most visible change is the replacement of the old X-windows-based interface with the Tcl-based iMOSFLM interface (Fig. 2 ▶), which guides the user in a stepwise manner through the stages of data processing. POINTLESS is a relatively new program whose primary purpose is to identify the Laue group of a crystal from an unmerged data set (Evans, 2006 ▶). The program will also attempt to identify the space group from an analysis of systematic absences. A secondary purpose is to test the choice of indexing and re-index a data set if necessary. Given a choice of space group, the program SCALA (Evans, 2006 ▶) will refine the parameters of a scaling function for an unmerged data set, apply scales to each observation of a reflection and merge all observations of a reflection to give an average intensity. It will also provide an improved estimate of the standard uncertainty of each intensity. The new program CTRUNCATE (which replaces the older TRUNCATE; Stein, unpublished program) can then convert the intensities to structure-factor amplitudes, although downstream programs increasingly use the mean intensities directly. Perhaps more importantly, CTRUNCATE will analyse a data set for signs of twinning, translational noncrystallographic symmetry (NCS), anisotropy and other notable features, since it is best to identify problems before attempting phasing. The program SFCHECK (Vaguine et al., 1999 ▶) will also provide an analysis of a data set, including testing for twinning and translational NCS, estimating the optical resolution and the anisotropy, and plotting the radial and angular completeness. The previous steps of data processing are automated by the xia2 pipeline (Winter, 2010 ▶). From a directory of images, xia2 will identify the type of experiment (multi-wedge, multi-pass, multi-wavelength) and process accordingly. The pipeline will determine the point group, space group and correct indexing. Multiple processing pipelines using alternative underlying programs are supported. At the end, the user should have a set of merged structure-factor amplitudes suitable for input to phasing. 3.2. Experimental phasing CCP4 includes the CRANK pipeline (Ness et al., 2004 ▶), which covers experimental phasing and beyond, and interfaces with several CCP4 and non-CCP4 programs. Heavy-atom substructure detection is performed by AFRO/CRUNCH2 (de Graaff et al., 2001 ▶) or by SHELXC/D (Sheldrick, 2008 ▶) and initial phasing is carried out by BP3 (Pannu et al., 2003 ▶; Pannu & Read, 2004 ▶) or SHELXE (Sheldrick, 2008 ▶). Phase improvement is carried out by SOLOMON (Abrahams & Leslie, 1996 ▶), DM (Cowtan et al., 2001 ▶) or Pirate (Cowtan, 2000 ▶) and automated model building by Buccaneer (Cowtan, 2006 ▶; Cowtan, 2008 ▶) or ARP/wARP (Langer et al., 2008 ▶). CRANK thus supports a range of underlying software handling the communication of data and allowing the user to trial different combinations. CCP4 includes a number of additional individual programs, each of which has its own particular strength. The long-standing CCP4 program MLPHARE for phasing still works in straightforward cases and is fast to use. ACORN (Jia-xing et al., 2005 ▶; Dodson & Woolfson, 2009 ▶) uses ab initio methods for the determination of phases starting from a small fragment which could be a single heavy atom. The use of ab initio methods usually requires atomic resolution data, since it assumes atomicity of the electron density. However, a variant of the so-called free-lunch algorithm (Jia-xing et al., 2005 ▶) allows the temporary generation of phases to atomic resolution which the ACORN method can utilize. The OASIS program (Wu et al., 2009 ▶) also uses ab initio methods to break the phase ambiguity in SAD/SIR phasing. Phaser (McCoy et al., 2007 ▶) can obtain phase estimates starting from known heavy-atom positions and SAD data. Log-likelihood gradient (LLG) maps are used to automatically find additional sites for anomalous scatterers and to detect anisotropy in existing anomalous scatterers. Phaser can also use a partial model, for example from a molecular-replacement solution that is hard to refine, as a source of phase information to help locate weak anomalous scatterers and thus improved phases. The latter reflects the view of experimental phasing and molecular replacement as just two sources of phase information rather than two separate techniques. 3.3. Molecular replacement CCP4 includes two pipelines for molecular replacement (MR): MrBUMP (Keegan & Winn, 2007 ▶) and BALBES (Long et al., 2008 ▶). Both start from processed data and a target sequence and aim to deliver a molecular-replacement solution consisting of positioned and partially refined models. BALBES uses its own database of protein molecules and domains taken from the PDB and customized for MR, while MrBUMP uses public databases and a set of widely available bioinformatics tools to generate possible search models. BALBES is based around the MR program MOLREP (Vagin & Teplyakov, 1997 ▶, 2010 ▶), while MrBUMP can also use the program Phaser (McCoy et al., 2007 ▶). Both MOLREP and Phaser are also available as stand-alone programs in CCP4. As well as providing rotation and translation functions, whereby a search model is positioned in the unit cell to give an initial estimate of the phases, these programs provide additional functionality, including a significant contribution to automated decision-making. For instance, a single run of Phaser can search for several copies each of several components in the structure of a complex, testing different possible search orders and trying different possible choices of space group. The search model for MR may be an ensemble of structures, a set of models from an NMR structure or an electron-density map. Phases for the target may be available, so that the search model is to be fitted into electron density, or there may be density available from an electron-microscopy experiment. The MR step can be followed by rigid-body refinement and the packing of the MR solution can be checked. Much of this functionality is common to Phaser and MOLREP, but there are a number of differences in implementation, so that both may prove useful in certain circumstances. A crucial component of MR is the selection and preparation of search models. The program CHAINSAW (Stein, 2008 ▶) takes as input a sequence alignment which relates residues in the search model to residues in the target protein and uses this information to edit the search model appropriately. The output model is labelled according to the target sequence. MOLREP (Lebedev et al., 2008 ▶) can take as input the target sequence and performs its own alignment to the search model in order to edit the search model. 3.4. Phase improvement and automated model building Having obtained initial phases from experimental phasing, the next step is phase improvement (density modification) to give a map that can be built into. When phases come from molecular replacement, phase improvement may also be useful to reduce model bias. For a long time, the main CCP4 phase-improvement programs were DM (Cowtan et al., 2001 ▶) and SOLOMON (Abrahams & Leslie, 1996 ▶), which covered the standard techniques of solvent flattening/flipping, histogram matching and NCS averaging. More recently, statistically based methods have been incorporated into the program Pirate (Cowtan, 2000 ▶). Pirate can give better results, but has been found to be inconveniently slow. The latest program Parrot (Cowtan, 2010 ▶) achieves similar improvements but is also fast and automated. Given an electron-density map, automated model building is provided in CCP4 by Buccaneer (Cowtan, 2006 ▶, 2008 ▶). This finds candidate Cα positions, builds these into chain fragments, joins the fragments together and docks a sequence. NCS can be used to rebuild and complete related chains. Since version 1.4, there is support for model (re)building after molecular replacement and for supplying known structural elements such as heavy atoms. The CCP4 suite includes an interface for alternating cycles of model building with Buccaneer with cycles of model refinement with REFMAC5. The supplementary program Sloop (Cowtan, unpublished program) builds missing loops using fragments taken from the Richardson’s Top500 library of structures (Lovell et al., 2003 ▶) to fill gaps in the chain. The chance of finding a good fit falls with increasing size of the gap, but the method may work for loops of up to eight residues in length. RAPPER (Furnham et al., 2006 ▶) provides a conformational search algorithm for protein modelling, which can produce an ensemble of models satisfying a wide variety of restraint information. In the context of CCP4, restraints on the modelling are provided by the electron density and/or the locations of the Cα atoms. The ccp4i interface includes modes for loop building or for building the entire structure. 3.5. Refinement and model completion The aim of macromolecular crystallography is to produce a model of the macromolecule of interest which explains the diffraction images as accurately and completely as possible. Both the form of the model and the parameters of the model need to be defined. Refinement is the process of optimizing the values of the model parameters and in CCP4 is performed by the program REFMAC5 (Murshudov et al., 1997 ▶). REFMAC5 will refine atomic coordinates and atomic isotropic or anisotropic displacement parameters (Murshudov et al., 1999 ▶), as well as group parameters for rigid-body refinement and TLS refinement (Winn et al., 2001 ▶, 2003 ▶). It will also refine scaling parameters and a mask-based bulk-solvent correction. When good-quality experimental phases are available, these can be included as additional data (Pannu et al., 1998 ▶). More recently, it has become possible to refine directly against anomalous data for the cases of SAD (Skubák et al., 2004 ▶) and SIRAS (Skubák et al., 2009 ▶) without the need for estimated phases and phase probabilities. REFMAC5 will also now refine against twinned data (Lebedev et al., 2006 ▶), automatically recognising the twin laws and estimating the corresponding twin fractions. The nonprotein contents of the crystal are often of most interest, such as bound ligands, cofactors, metal sites etc. Correct refinement at moderate or low resolution requires a knowledge of the ideal geometry together with associated uncertainties. In REFMAC5 this is handled through a dictionary of possible ligands (Vagin et al., 2004 ▶), with details held in mmCIF format. Dictionary files can be created through the tools SKETCHER and JLIGAND. Refinement goes hand-in-hand with rounds of model building which add/subtract parts of the model and apply large structural changes that are beyond the reach of refinement. In addition to the automated procedures of Buccaneer and RAPPER described above, there are many model-building tools in Coot (Emsley et al., 2010 ▶). A ccp4i interface to the popular ARP/wARP model-building package (Langer et al., 2008 ▶) has also been available for many years. 3.6. Validation, deposition and publication Validation is the process of ensuring that all aspects of the model are supported by the diffraction data, as well as conforming with known features of protein chemistry. Although validation has traditionally been viewed as something that is performed at the end of structure determination, just before deposition, it is now appreciated that validation is an integral part of the process of structure solution, which should be carried out continually. CCP4 includes a wide variety of validation tools, all of which should be run to gain a complete picture of model quality. Coot (Emsley et al., 2010 ▶) has a dedicated drop-down menu of validation tools which can and should be applied as the model is being built. Coot can also extract warnings about particular links or outliers from a REFMAC5 log file. Warnings associated with specific atoms or residues are linked directly to the model as viewed in Coot. The ccp4i ‘Validation and Deposition’ module contains further validation tools. As mentioned above, SFCHECK (Vaguine et al., 1999 ▶) provides a number of measures of data quality, but if a model is provided it will also assess the agreement of the model with the data. Sequins (Cowtan, unpublished program) validates the assigned sequence against electron density (generated from experimental phases or from phases calculated from a side-chain omit process) and warns of misplaced side chains or register errors. RAMPAGE (which is part of the RAPPER package; Furnham et al., 2006 ▶) provides Ramachandran plots based on updated ϕ–ψ propensities. PROCHECK is also included, although the Ramachandran plots are no longer generated, having been superseded by RAMPAGE. R500 (Henrick, unpublished program) checks the stereochemistry in a given PDB file against expected values and lists outliers in REMARK 500 records. The quaternary structure of the protein can be analysed with PISA (Krissinel & Henrick, 2007 ▶). This considers all possible interfaces in the crystal structure, estimates the free energy of dissociation, taking into account solvation and entropy effects, and predicts which interfaces are likely to be of biological significance. The CCP4 molecular-graphics program CCP4mg (Potterton et al., 2002 ▶, 2004 ▶) provides a simple means of generating publication-quality images and movies. As well as displaying coordinates in a wide variety of styles, CCP4mg can display molecular surfaces, electron density, arbitrary vectors and labels. The latest versions are built on the Qt toolkit, giving an enhanced look and feel (Fig. 3 ▶). Structures and views can be transferred between CCP4mg and Coot. 3.7. Jiffies and utilities In addition to the main functionality described above, the CCP4 suite contains a large number of utilities for performing format conversions and various analyses. Reflection data processed in other software packages can be imported with the utilities COMBAT, POINTLESS, SCALEPACK2MTZ, DTREK2SCALA and DTREK2MTZ, while data can be exchanged with other structure-solution packages with CONVERT2MTZ, F2MTZ, CIF2MTZ, MTZ2VARIOUS and MTZ2CIF. There are several useful utilities based on the Clipper library (Cowtan, 2003 ▶), such as CPHASEMATCH, which will compare two phase sets and look for changes in origin or hand. There are also many useful utilities for analysing coordinate files. New programs based on the MMDB library (Krissinel et al., 2004 ▶) include NCONT for listing atom contacts and PDB_MERGE for combining two PDB files. 4. Future plans At the heart of the CCP4 suite are the set of algorithms encoded in individual programs. As always, we include new programs in each major release of the suite and will continue to do so. Since the source of novel software is usually independent developers, the additions to the suite are not centrally planned. Nevertheless, some current themes are clearly recognisable, such as automated model building, in particular for low-resolution data. CCP4 also aims to enhance its functionality related to the maintenance and use of data on small molecules (ligands). Firstly, a considerably larger library of chemical compounds will be provided with the suite. Extended search functions will be provided to allow the efficient retrieval of known compounds or their close analogues. Secondly, existing functions for generating restraint data for new ligands will be enhanced by the inclusion of relevant software such as PRODRG (Schüttelkopf & van Aalten, 2004 ▶) into the suite, as well as by the development of new methods for structure reconstruction on the basis of partial similarity to structures in the library. Functionality will be available through a graphical front-end application, JLIGAND. In addition to the core programs, the infrastructure of CCP4 continues to evolve to support the latest working practices. The current CCP4 GUI, ccp4i, was a major innovation and has served us well for over ten years (Potterton et al., 2003 ▶). While it continues to provide a useful interface to the CCP4 suite, there are increasing demands from automation pipelines and users alike. In particular, there is a requirement to provide help on what to try next, advice which can be useful to both scientists and automated software. This depends on a robust assessment of the experimental data and the results of previous processing, which in turn requires good data management. We aim to address these issues through the development of a next-generation CCP4 interface. There will also be changes in the way that CCP4 is delivered to the end user. We have all become used to automated updates to the software we use (e.g. Windows Update, Synaptic for Debian-based Linux or application-specific updates such as for Firefox). Some CCP4 programs do alert the users to the availability of newer versions and CCP4mg (Potterton et al., 2002 ▶, 2004 ▶) will update the version on request. A CCP4-wide update mechanism is more difficult given the heterogeneous nature of the suite, but efforts in this direction are under way. A specific example of a remotely maintained crystallography platform is given by the US-based SBGrid Consortium. The CCP4 suite is downloaded to a user’s machine or a local server before being run. This is in contrast to many biology software tools, which are web-based. Reasons for running CCP4 locally include the wallclock time of jobs, the detailed control required and the size of data files. Nevertheless, there is increasing usage of web servers for crystallographic tasks. A server at York (http://www.ysbl.york.ac.uk/YSBLPrograms/index.jsp) runs a number of CCP4 programs, including BALBES and Buccaneer, while CCP4 programs are included in a number of other services, for example the ARP/wARP server at Hamburg (http://cluster.embl-hamburg.de/ARPwARP/remote-http.html). Plans are under way to make more CCP4 functionality available via the web. Finally, the coming years will see increasing integration of crystallography with other techniques, both experimental and theoretical. CCP4 aims to contribute towards efforts, such as the European infrastructure project INSTRUCT, to ease the transfer of data to and from these other domains.

0 comments Cited 1823 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Onisha Patel:

ORCID: http://orcid.org/0000-0001-6701-7139

patel.o@wehi.edu.au

Isabelle S. Lucet:

ORCID: http://orcid.org/0000-0002-8563-8753

lucet.i@wehi.edu.au

Journal

Journal ID (nlm-ta): Commun Biol

Journal ID (iso-abbrev): Commun Biol

Title: Communications Biology

Publisher: Nature Publishing Group UK (London )

ISSN (Electronic): 2399-3642

Publication date (Electronic): 20 September 2021

Publication date PMC-release: 20 September 2021

Publication date Collection: 2021

Volume: 4

Electronic Location Identifier: 1105

Affiliations

[1 ]GRID grid.1042.7, The Walter and Eliza Hall Institute of Medical Research, ; Parkville, VIC Australia

[2 ]GRID grid.1008.9, ISNI 0000 0001 2179 088X, Department of Medical Biology, , University of Melbourne, ; Parkville, VIC Australia

Author information

Onisha Patel http://orcid.org/0000-0001-6701-7139

Michael J. Roy http://orcid.org/0000-0003-0198-9108

Joshua M. Hardy http://orcid.org/0000-0002-8014-8552

Isabelle S. Lucet http://orcid.org/0000-0002-8563-8753

Article

Publisher ID: 2631

DOI: 10.1038/s42003-021-02631-y

PMC ID: 8452690

PubMed ID: 34545159

SO-VID: 6eb74ae2-7c19-4a65-b129-e55e82bf4cf1

License:

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

History

Date received : 28 April 2021

Date accepted : 1 September 2021

Funding

Funded by: FundRef https://doi.org/10.13039/501100000923, Department of Education and Training | Australian Research Council (ARC);

Award ID: FT120100056

Award Recipient : Isabelle S. Lucet

Funded by: FundRef https://doi.org/10.13039/501100000925, Department of Health | National Health and Medical Research Council (NHMRC);

Award ID: APP1162058

Award Recipient : Isabelle S. Lucet

Funded by: FundRef https://doi.org/10.13039/501100000947, Australian Cancer Research Foundation (ACRF);

Custom metadata

Keywords: structural biology,x-ray crystallography

Data availability:

Keywords: structural biology, x-ray crystallography

Comments

Comment on this article

scite_

Cited by 7

See all cited by

Most referenced authors 1,359

See all reference authors

Structural basis for small molecule targeting of Doublecortin Like Kinase 1 with DCLK1-IN-1

Read this article at

Abstract

Abstract

Related collections

Fracture and Structural Integrity

Most cited references 56

Phaser crystallographic software

<i>Coot</i> : model-building tools for molecular graphics

Overview of the CCP4 suite and current developments

Author and article information

Contributors

Journal

Affiliations

Author information

Article

History

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 301

Cited by 7

Most referenced authors 1,359