Comments on:

JASON Report JSR 97-315, "Human Genome Project" (version circulated Dec. 1997)

Phil Green (phg@u.washington.edu)

Comments are arranged by section. My comments are in formatted text; responses by Dr. Steven Koonin (principal author of the JASON report, koonin@caltech.edu) are given in unformatted text lines starting with ">"; each of these responses is followed by, or interspersed with, my reply to his response, again in formatted text.

Related links:

section 1.2.2: Need for longer read lengths. I would disagree with the statement that the need for longer read lengths is unique to the genome project -- almost any sequencing application would benefit from having them. In fact nearly all of the improvements in ABI sequencer technology that would result in higher throughput, and that genome sequencers have pushed for, would benefit almost anyone who uses an ABI sequencer. The real reasons why progress in this respect has been slow are (i) that ABI has essentially no competition, and (ii) throughput improvements (including read length, which affects throughput by reducing the number of reads that are needed) tend to reduce the number of machines they can sell.

Koonin response:
> Most (if not all) of the non-HGP briefers we heard from expressed
> no driving NEED for longer read lengths, while all of the HGP briefers
> agreed that their requirements of accuracy and completeness made longer
> reads VERY important.  Hence, it's a distinction between "nice to have"
> and "needed."  Perhaps not strictly "unique" by a rigorous definition,
> and so we've modified the sentence to read "Thus, the goal of the entire
> sequence implies technology needs for which there are limited medical or
> pharmaceutical drivers."
> 
> We agree with the points (i) and (ii) you raise.

I would still disagree even with this rewording. An increase in read length to over 1 kb would be enormously useful to the EST sequencers in pharmaceutical and genomics companies, because it would mean that in many cases they could get the entire coding region of a gene with a single read. While a longer read length would cut costs for genome sequencers, it is not a "need". One can see this by the fact that most genome centers are not even making serious attempts to maximize the read lengths they can obtain with current machines and chemistries. Our own center, which has put significant emphasis on routinely obtaining long reads, is an exception.

section 1.2.3: Scaleup required. The scaleup required to get to 400Mb/yr is more like a factor of 5 to 10, rather than "30-100", since the current worldwide genome sequencing capacity appears to be approaching 100Mb/year (the combined St. Louis + Sanger Center output is already in that range, although not all of their current effort is in human sequencing). Admittedly even a factor of 5 is intimidating and probably won't be attained for the next few years, which means the peak yearly output will have to be higher than 400Mb in order to finish by 2005.

Koonin response:
>  I think the confusion here is between average and peak capacity. 
> 400Mb/yr is the average capacity needed over the next seven years.  But
> if you assume a geometric growth, then you need about 5 doublings of the
> present 50Mb/yr capacity over the next 7 years to get to 3000 Mb
> (50+100+200+400+800+1600=3150).  This gives a doubling time of about (7
> years)/5 ~ 1.4 years = 17 months and a peak capacity of 1600 Mb in the
> last doubling period  (1000 Mb/year), which is 20 times the present
> capacity.    The estimate in the report was done using a continuous model
> with a present capacity of 40 Mb/yr and a doubling time tau
> [rate=40exp(ln2*t/tau)].  Demanding that the integral of this from t=0 to
> t=7 years gives 3200 Mb results in tau=1.3 years (16 months) and a final
> capacity of exp(ln2*7/1.3)=42 times that of the present. Either way, it's
> a lot more than 5-10.

Of course, this all depends on the numbers one plugs in, and I disagree with yours. There are 8 years (not 7) between now and the end of 2005, and 100 Mb (rather than 40 or 50 Mb) is a better estimate for the current (1998) capacity (in fact 100 Mb was the number you used in your Science perspective). With those values, a geometric growth model with roughly a 36% annual increase in capacity will obtain the 3 Gb sequence by the deadline, and will require an output in the final year of about 870 Mb. This is an 8.7 fold increase in capacity over the current amount, which is indeed consistent with the 5-10 fold range I was claiming.

The more important point to be made though is that a geometric growth model is not a very good one to use here. It will be much harder to make a 36% increase in output later in the project than it is now, because certain resources -- particularly scientific and management talent, skilled labor, and laboratory space -- will become increasingly scarce. The true growth pattern is more likely to approximate an S-curve, with an initial period of relatively slow growth (which hopefully is largely completed by now) as sequencers "learn the ropes" of producing high-quality sequence in a cost-effective manner, followed by a period of relatively steep growth as they exploit this knowledge to expand high-throughput operations, followed by a levelling off period as resources become limiting. Although the specifics are hard to predict, my guess would be that the growth will be much more rapid than 36% / yr over the next few years, and then level off at a capacity substantially less than 800 Mb / yr.

section 1.2.5: Project coordination. It was disappointing to me that there was not more discussion of Big Science issues in this report, since this is clearly an area where the JASONs have a significant amount of experience and could provide useful advice if they chose to. The problems of organizing and running a large project, while maintaining not only data quality standards but also the interest and commitment of talented individuals, are ones that directors of sequencing centers are struggling with and would welcome any advice they can get. Nonetheless it should be pointed out that in some respects the genome project is fairly different from other examples of Big Science since it can more easily be (and is being) carried out in distributed fashion.

Koonin response:
>  We have emphasized "bottom up" and "science driven" as important
> foundations for organizing and running the project.  The distributed
> nature of the HGP is certainly acknowledged in our Informatics section
> (particularly the introduction) and in the closing comments of my Science
> Policy Forum.  As an aside, I note that one way to retain talented and
> motivated people is to pay them a lot more than the current NIH postdoc
> scale, which is far lower than in any other science.
> 

I believe there is very little staffing of genome centers by postdocs paid according to the NIH scale, and while I agree the scale should be changed that doesn't address the problems I was alluding to. These more critical problems have to do with balancing young scientists' need for establishing an independent identity leading to a future career in academia or industry, against the need to be a "team player" involved in a production operation with many uninteresting and unrewarding aspects. Surely this tension occurs frequently in Big Science projects and you must have some relevant insights to offer.

section 1.4.1: Know thy system. While I certainly agree with this maxim, it is much harder to do this than the JASONs seem to realize. I believe none of their subsequent suggestions to use simulations or theoretical analysis for this purpose are likely to give useful information, for reasons given below. Such information is obtainable, but only through experiments requiring new data collection.

Koonin response:
>   By "Know thy system," we certainly mean modeling and analysis based
> upon experiments and experience.  To quote from a paragraph in Section
> 3.3.1 ("A systems approach is required"):
> 
> 
>       "Results from simulations are only as good
> as the models used for introducing and propagating errors.  For this
> reason, the computer models must be developed in close association with
> technical experts in all phases of the process being studied so that they
> best reflect the real world.  This exercise will stimulate new
> experiments aimed at the validation of the error-process models, and thus
> will lead to increased experimental understanding of process errors as
> well."

I think you really have no understanding of the difficulty involved in carrying out such a program. What would be required is a major research effort depending upon, among other things, new biological discoveries of a sort that is impossible to predict. Consider just a few of the steps in the sequencing process that would need to be elucidated before they can be modelled:

Subclone growth. The non-randomness of subclone libraries is well-known, and plays an important role in the shotgun sequencing process because it affects the amount of finishing (gap-filling) effort that is required. Your proposal would require obtaining a thorough knowledge of the reasons for this non-randomness to the point that we can quantitatively model it. That would be extremely difficult, because the non-randomness depends upon interactions between features of the cloned sequence, and of the biology of E coli, which are not understood and almost certainly involve biological phenomena not even suspected at present. Certain clone sequences are apparently toxic to E coli, or cause problems with replication; we would have to acquire a better understanding both of the clone sequences (how often do sequences of a particular type occur in the human genome) and of relevant details of E coli biology (why and to what degree are various types of sequences toxic, how does replication efficiency depend on characteristics of the sequence, etc). Not to mention the fact that there is a dependence as well on a number of laboratory variables -- the precise conditions under which the E coli are grown and how these affect replication efficiency of different sequences, how shearing, ligation etc of the DNA carried out in the process of creating the library affect non-randomness, and so on.

Also relevant to this stage of the sequencing process is getting a better, quantitative understanding of the mutations that occur in the subclones (deletions, point mutations). Again, this requires understanding how the clone sequences interact with the biology of E coli. It will be extremely difficult to get the quantitative handle on these issues that your approach requires.

Editing. Like the above this is a process carried out by biological organisms (in this case humans). Under your proposal, we would need to understand, well enough to quantitatively model them, the errors made (or corrected) by a human reviewing the sequence and performing editing. This is obviously extremely complex: It depends not only on specific characteristics of the traces and sequence but on the level of experience, training, and intelligence of the finisher, on variations in ability to process visual information, on variations in understanding of the sequence data collection process and ability to integrate that with the visual information, etc. What types of trace patterns lead to editing errors, at what frequency, with what dependence on training etc? And what about assembly errors and the relative ability of finishers to detect and correct these? I have a very hard time thinking we will be able to model these issues quantitatively any time soon.

The many other ways that human intervention affects the sequencing process, leading to various sorts of errors, would also need to be understood well enough to be quantitatively modelled. For example, human variability in reagent preparation, sample labelling, loading gels, running machines, adhering to protocols, all have an impact on data quality and error rates in critical ways. Obviously, we have been working hard to try to eliminate humans from the loop as much as possible, but they are still going to be involved for the forseeable future.

Even those parts of the process that do not involve biological organisms are far beyond our current modelling capabilities. We do not for example understand the details of sequence specific effects on nucleotide incorporation efficiency by the polymerase, but this is essential if we are to be able to quantitatively model trace data; likewise the nature and effects of impurities in the sequencing reactions, of variation in buffer salts, etc. Even such a purely "physical" process as electrophoresis is still not well understood. The gel matrix, and the characteristics of single-stranded DNA molecules moving through this matrix, are beyond our current capabilities to model accurately. No one is even close to being able to predict irregularities in migration of different fragments as a function of sequence (e.g. compressions), or how even a moderate change in voltage will affect band resolution.

So even just modelling the sequence process as currently practiced is well beyond our capabilities. However under your proposal we would need much more than that -- we would need to understand all of these phenomena well enough that we could accurately simulate a perturbed version of the process (e.g. to see what would happen if we were to use a different polymerase, different reaction conditions, a different level of training for individuals doing the editing, a different bacterial host for growing subclones or different conditions for the current host, different electrophoresis conditions), in order to optimize the process itself. While our understanding of all aspects of the process is certainly increasing incrementally, in my opinion there is no hope whatsoever that we could get to the point of reliable quantitative modelling on a time frame useful for the genome project, no matter how much money you were to put into investigating the above questions. Their solution will depend on novel research discoveries of a nature we can't anticipate. And by the time you were able to get results, sequencing technology almost certainly will have changed enough to make your results irrelevant.

The most important point though is that there is already a well-understood, simple, direct way to explore potential improvements to the sequencing process. To see how a change in the process affects cost or accuracy, all we have to do is carry out a resequencing experiment -- sequence a set of clones with and without the change, and determine the cost and sequence accuracy under each method. This is simple and much cheaper than the route you propose. It involves no computer simulation whatsoever and is guaranteed to give useful information.

sections 1.4.2 - 1.4.4: As will be apparent later I disagree with several aspects of these recommendations.

section 1.4.3: Phred and phrap produce error probabilities for the base calls and consensus sequence, not "figures of merit". I for one would find it useful to see more details for the claim (made several times in this report) that these measures would benefit from "innovative research" aimed at improving them.

Koonin response:
>  The probability as reported by PHRED/PHRAP is certainly a
> "figure of merit", although not the only one possible.  For example,
> Section 4.2.4 of our report discussed the following possible
> generalizations of a single-base probability:
> 
> "Whatever algorithms are used it is
> important that the called sequence of bases have associated confidence
> values together with an interpretation of what these values are supposed
> to mean.  For example confidence values could be pairs of numbers, the
> first representing the confidence that the base call is correct and the
> second representing the confidence that the base called is the next base.
> One might also consider adding a third coordinate representing the
> confidence that the called base corresponds to one base as opposed to
> more than one. These values should continually be checked for internal
> consistency; every read should be compared to the assembled sequence.
> This comparison involves the alignment of the read against the assembled
> sequence minimizing an adjusted error score."

Part of what is so aggravating about your report is that you make many suggestions that (when they aren't outright wrong) are obvious to people in the field. What is really needed is not a list of all possible avenues of research, but some perspective about which potential avenues are likely to be useful i.e. have a high likelihood of reducing costs or increasing throughput or accuracy, and which avenues are unlikely to have a major impact. I found such perspective almost totally lacking in your report. The above is a good example of this: It is obvious that, having introduced single base confidence measures in phred, we could extend these to probabilities for the different types of errors in the manner you indicate. Everyone is aware that such an extension is possible. The real question is, is there any reason to think this will offer significant improvements in quality or cost of sequencing? I am very sceptical that it will, and you provide no reason at all to think otherwise.

Incidentally as my original comments pointed out (see below, comment on pp. 51-52), the "internal consistency" checks that you recommend here are already carried out in phrap.

>       "Finally, there are currently several degrees of freedom in sequencing. 
> Two,  that could yield different (and hopefully independent) processes
> are:
> 
> 1. Using dye labeled
> primers versus dye labeled terminators;
> 
> 2. Sequencing the complementary strand."

These are already taken into account by phrap in computing error probabilities for the consensus from those for the individual reads.

> "Correlated errors define
> an upper bound in the accuracy of base calling algorithms that cannot be
> surmounted by repeated sequencing using the same chemistry.  Ideally the
> confidence values assigned to individual base calls would closely
> correspond to these intrinsic errors.  This can (and should) be tested
> experimentally."

section 2.1.1: I tend to disagree that there are clear "software opportunities" for improving gel reading or assembly, and that "specific improvements might have a dramatic impact on the Project". There are certainly some improvements to be made but these will be incremental in nature and will not have a large impact on either cost or quality. Moreover I would argue that software improvements in read length would actually have LESS impact on the genome project than in commercial applications. Many commercial applications, eg. ESTs or low pass sequencing of disease genes, would benefit from having longer read lengths. However with genome sequencing the accuracy requirement is substantially higher, which effectively means that only the high quality part of the read is useful in deriving the consensus sequence. The basecalling accuracy in that high quality part is already excellent -- the main improvement that is needed is better base-calling in compressions, which appears solvable by taking into account the sequence features that govern hairpin loop formation. Improved read lengths will come from improving accuracy in the part of the read where the data quality itself is intrinsically lower, due to the fact that peaks are poorly resolved and the signal-to-noise ratio is lower. The accuracy here can certainly be improved somewhat, and this will have the effect of improving completeness of assembly following the shotgun stage, which in turn is useful because it makes finishing somewhat more efficient by helping delineate the size of the regions where additional data are required; but I am very skeptical that it can be made accurate enough to permit its use in deriving the consensus sequence, which is what would be required for it to significantly reduce costs by reducing the number of reads needed. Hardware and/or sequencing chemistry improvements could certainly have a major impact in this respect, but I am very dubious that software improvements will.

Koonin response:
>   You seem to be saying:  "The accuracy is good where
> it's good, bad where it ain't, and there's not much we can do about it." 
> I doubt that this is what you mean, so we need to talk about this one. 

No, of course that isn't what I meant. Let me try again. It is important to distinguish between the underlying (raw) data quality on the one hand, and the accuracy of the base calls given that data quality on the other. Software improvements (which is what you are proposing here) will only improve the accuracy of the base calls given the data quality, and will not improve the underlying raw data quality itself. My strong feeling from looking at a lot of data is that (i) in the part of the trace where the underlying data quality is high, the accuracy of the base calls is already about as good as it is going to get -- the only real improvement to be expected is better calling of compressions. (ii) in the part of the read where data quality is lower, there is likely to be an absolute upper bound to how accurate the basecalls can be made; certainly their accuracy can be improved somewhat, but I would be very surprised if it can be made accurate enough to make that part of the read usable in deriving the consensus. There are just too many inherent ambiguities in the lower quality data due to merged peaks, poor signal to noise etc.

Basically the cost of shotgun sequencing is determined by the number of reads needed, which in turn is determined by the read length. There are two relevant kinds of "read lengths" -- the read length usable in assembly (i.e. the part of the read usable in making joins), and the read length usable in deriving the consensus. The latter is much smaller than the former because it is determined by the highly accurate part of the read, which I'm arguing is inherently limited to the high-underlying-data-quality part of the read. Because it is smaller, the read length usable in deriving the consensus is the primary determinant of the number of reads needed and thus the overall cost of the project. My argument above is that the read length usable in deriving the consensus is not going to be increased much by software improvements.

I certainly believe further software improvements can improve the read length usable in assembly, and as I indicated in my comment this will be useful because it will help in finishing by assisting gap-closure, reading through repeats etc. It just won't have a large impact on cost, which is what you were claiming.

> We agree that the accuracy can be improved in the "hard" part of the read
> and that will improve completeness of assembly and make finishing
> somewhat more efficient.  Indeed, we thought that this is what we said.

My point was that this was the ONLY improvement to be expected from software improvements to read accuracy and that contrary to your assertion this will not significantly impact cost.

section 2.1.1: Lane tracking. Greater lane density will actually not have a huge impact on cost, because machine costs (which are all that would be reduced) are a minor fraction of the total costs. Moreover to the extent that increased lane density degrades quality (which it inevitably will) it may actually tend to increase costs, because reducing the number of high-quality bases per read increases the number of reads that are needed.

I do agree that better lane-tracking software would be useful, and indeed there are several efforts underway to produce this. By the way it is already possible to get access to the raw trace data directly from ABI's files.

Koonin response:
>  Increased lane density would improve electrophoresis throughput,
> which would be useful, we believe (and were so told by many sequencers,
> who are clammoring for ABI machines with more lanes).  This should not,
> of course, come at the expense of quality.

Although you say that "of course" this should not come at the expense of quality, in fact there will inevitably be a quality consequence -- narrower lanes mean greater probability of loading errors and of lane tracking errors, and poorer signal to noise characteristics for the lane profile. The issue is a quantitative one, and my point was that it is very unlikely that one will get significant cost savings by going to a higher density. The machine costs (which is really all that would be saved) are already a fairly minor component of the per base costs, and it would take very little degradation of quality to entirely wipe out the savings gained by increased throughput per machine.

The fact that many sequencers do not understand this point is no reason for you to repeat their error! I would encourage you to indeed try to provide "An Independent Perspective" (as you promised in your Science policy forum) and not simply repeat the complaints of those in the field, which are not always carefully thought through. Furthermore it is a bit frustrating that, although your report stresses the need for quantitative analysis (and I agree with that emphasis, up to a point), here and in other places in your report you make recommendations that are not supported by any quantitative analysis, and that would be contradicted by a quantitative analysis if one were carried out.

>  On the other point you raise,
> more than one person told us you couldn't get at the raw trace data from
> an ABI machine except through considerable "reverse engineering" and
> likely voiding of the warranty.  If you know of a way to get the raw
> trace data easily (has ABI made this a feature now?) please tell us and
> the rest of the community.  Many folks would like to be able to do 
> this.

I think you are confused about the distinction between the raw gel image, and the raw trace. It is true that direct capture of the raw gel data requires modification to the machine. However the raw trace (as well as the processed one) is recorded in the ABI chromatogram file and easily accessible.

section 2.1.2: Basecalling. I don't agree that the methods listed here are likely to be of much use in improving basecalling accuracy. The most important remaining sources of error are those having to do with specific phenomena with a known biological basis -- e.g. compressions, which are due to hairpin loop formation. The way to attack these is to understand the underlying cause better, eg. the specific rules that govern the structure of stable hairpins, and then build that into the algorithm. Generic methods from an unrelated discipline are unlikely to be of much use.

Koonin response:
>  If you (or someone else you know) has seriously investigated the
> applicability of the methods cited to the basecalling problem, we'd like
> to know about it.  We agree that compressions and such are very important
> to understand.  Indeed, this is one element of the "Know thy system"
> dictum.

I don't recall anything in your report discussing the importance of studying compressions. Obviously a major aspect of "Knowing thy system" is in having a sense which specific elements of the system are more important and which are less important, and I found your report lacking in that sort of perspective.

section 2.1.2: Assembly. Phrap is misdescribed here -- a greedy algorithm is used for the initial assembly, but the assembly is subsequently revised if problems are encountered. More importantly, I again think it is a mistake to expect that improvements in assembly will have a significant impact on cost, quality, or throughput. Even the pure greedy method in phrap gives the right assembly for greater than 95% of cosmids. The main remaining issues are dealing with the relatively small number of cases where there are large essentially perfect repeats, or many small repeats, and it seems that fairly simple ad hoc methods for dealing with these will work except for the (very rare) "cosmid from hell".

Incidentally (1st para p. 14) the phrap error probabilities already reflect sequence-dependent and trace-effects and these are taken into account in the assembly. The statement "similar techniques can be used to handle assembly in the presence of repeats" is puzzling, since repeats are really the ONLY problem in assembly!

The comments here about "a complete search for the optimum map" are somewhat misplaced, since one does not actually search for an "optimum" in practice. The natural definition of "optimum" in this context would be the maximum likelihood sequence, i.e. the sequence for which the observed data have the highest probability of occurring, and this is not practical to work with (even with the specialized hardware proposed by the JASONs) because the space of possible sequences is impossibly large. Nor has any one proposed an alternative definition of "optimal" that is both algorithmically manageable and gives an appropriate result in practice. Thus, benchmarking against a "complete-branch-and-bound search" is not at present meaningful.

Koonin response:
>   We oversimplified the description of PHRAP as using a greedy
> algorithm throughout, and have modified the text to read:
> 
> "The PHRAP program uses a greedy algorithm
> where the segments with the closest matches are assembled first and the
> program builds out from this initial start, revising as necessary."
> 
> Even if PHRAP gets it right 95% percent of the time, how much additional
> effort is spent dealing with the special cases? 

There is some effort expended, which is why we continue to improve the assembly algorithms. But I don't think anyone would claim it is a major cost factor as you appear to think.

> We also don't see how
> comparision with a complete search cannot add information in some 
> cases.

Please see what I said about "complete search" in my original comments. It is not easy to make this concept meaningful. The only reasonable definition I can come up with is "a complete search of all possible sequences for the cosmid", taking as the score that needs to be maximized, the probability of the observed read data given that sequence. Then as far as I can see a "complete search" is not even close to being possible even with specialized hardware. One will have to make some heuristic simplifications somewhere.

If you do have some way of implementing such a complete search, or if you have something else in mind for what you mean by "complete search", I and others working on assembly would be very interested in hearing about it.

section 2.1.2: Finishing. The major component of finishing labor costs is actually the additional data collection, not, as the JASONs seem to think, the time spent in editing or assembly which is reasonably efficient with current software. Moreover the JASONs appear to be unaware of the fact that rule-based expert systems to automate parts of the finishing process have already been developed in several centers (e.g. St. Louis, Seattle, UTSW). The St. Louis system has been in use for some time (at least a year).

Koonin response:
>   Our text did not make it clear that there's more than just
> software involved in finishing, and we've modified it accordingly.  It
> now reads:
> 
> "The finishing process involves taking an
> assembled sequence and filling in the gaps through a combination of
> manual editing and time-consuming directed sequencing.  At some
> sequencing centers we saw that finishing accounted for roughly half of
> the entire sequencing effort.  The software available to assist finishing
> was of variable sophistication and in some cases consisted of no more
> than simple sequence editors.  While most of the finishing costs are
> associated with directed sequencing, research into finishing software
> could help to automate this process."
>
> 
> As far as the expert systems developed at several centers, we didn't hear
> about these at any of our site visits (e.g., to Wash U.), although we
> certainly asked about them.  A reference would be appreciated.

We have a paper in press in Genome Research describing such a system and I would be happy to send you a preprint if you are interested (it should appear in the March issue, and is by Gordon, Abajian and Green; there will be two other papers from my group in this issue as well, on the phred basecaller and error probabilities). You should also contact Gabor Marth (the developer of the "Finish" program) or Bob Waterston, both of whom are at the Washington Univ. Genome Sequencing Center. The other centers I know of that have put some effort into this problem include MIT (contact Eric Lander) and UTSW (contact Glen Evans or Skip Garner).

section 2.1.3: This proposed "method to bypass assembly" seems to reflect a basic misunderstanding of how Sanger sequencing works. It is essential that all of the template molecules in the sequencing reaction be identical downstream of the priming site -- otherwise one gets garbage (an uninterpretable superposition of traces from different sequences). The JASONs' "binning" procedure would create populations of templates that are not identical. It could only be made to work if each "bin" in fact consisted of identical molecules of exactly the same size, but that would be extremely difficult if not impossible to attain by any process of which I am aware, particularly in view of the huge range in sizes of the fragments. Furthermore the procedure has a number of other fatal flaws -- for example it requires PCR of very large molecules (approaching the size of a cosmid) which is highly nontrivial, and uses a mononucleotide run as a priming site which ensures that (even if the binning problem were solved) the primer will anneal to multiple different positions, resulting in an uninterpretable mix of signals.

Incidentally it is worth pointing out that the idea of sequencing by using nested deletions has been around for a long time, and Ellson Chen in particular has made effective use of it. To get it to work one needs to clone the different deletion variants in order to get homogeneous populations of templates -- it can't be done using solid supports or in solution in the manner the JASONs propose. And, like other directed strategies that have been proposed, it has not turned out to be cost competitive with shotgun sequencing. There is a long history (of which the JASONs seem unaware, and are thus doomed to repeat) of trying to find clever, directed strategies that reduce the read redundancy and avoid the assembly problem. None of these have proven competitive with shotgun sequencing. The overwhelming advantage of the shotgun approach is that it keeps the basic operations simple. Directed strategies have more steps, with the added steps being more technically demanding, and this means more potential failure points and a higher required skill level. The lesson has been that it is more important to keep the approach simple than to make it maximally efficient (on paper). By the way, it is also worth pointing out that most of the impetus towards these directed methods came when shotgun assembly algorithms were not very reliable. Now that they are quite reliable except in a small percentage of cases, there is considerably less interest in this sort of approach.

Koonin response:
>   We removed the "method to by pass assembly" several revisions
> ago, upon the advice of numerous experts.

Good!

section 2.2.1: Institutional barriers. Although this paragraph has elements of truth I would disagree somewhat with its characterization of the NIH. In fact the NHGRI has actually had a strong interest in technology development for the genome project, has generally been able to get reviewers that are working and knowledgeable in this area (if perhaps somewhat biased towards their specific field of interest), and has funded a fair amount of technology research; they may even be putting more into extramural technology development than the DOE is. The problem in my view has been too much emphasis on novelty, and not enough on incremental improvements to the existing technology that have a real prospect of contributing the human genome sequence. As a result not much has emerged of utility.

Koonin response:
>  As we've emphasized, one needs both evolutionary and revolutionary
> advances in the technology.  As we note in the report, of the $78M spent
> by DOE in FY97, about $13M goes for technolgy, and only $1.7M of that was
> for advanced (non-EP based) methods.  We'd like to see the technology
> funding go up substantially (50%).  Francis is quoted in Marshall's
> Science article as saying that the NIH is spending $22M on technology
> development, of which $2.5 "had nothing to do with current technology"
> (presumably non-EP based.  Since the total NIH effort is about twice as
> large as DOE's, there's room for a fair bit more technology support from
> NIH.  The lack of technology push by NIH as a whole is widely
> acknowledged by essentially all biologists I've talked with (where were
> they in support of Lee's development of automated sequencing?) and the
> ultrasound example we cite is consistent with that;   NHGRI shouldn't
> follow suit.

My point was that I don't believe NHGRI is following suit -- they are supporting technology development at a reasonable level. I'm not sure I agree with your statement that "there's room for a fair bit more technology support" at the NHGRI. The amount of "room" depends on the other demands on the budget, and the constraints on the NHGRI are pretty tight because they're committed to fund much more of the actual production sequencing than the DOE is (by a factor significantly exceeding the twofold ratio of their budgets).

section 2.3.1: The history here is a bit screwed up. Actually it was the invention of cloning in the early 1970s, not PCR in the mid 1980s, that made sequencing of all kinds of DNA practical. In fact even now nearly all genomic sequencing is done from clones, not PCR products. Sanger sequencing was invented in the mid 1970s, after cloning, so it was never "limited" to viral genomes -- those were simply the natural first targets.

Koonin response:
>  We've modified the text to read:
> 
> "The invention of cloning technologies
> and, subsequently, PCR made the preparation of pure macroscopic
> quantities of identical molecules routine, allowing gel EP to be applied
> to all kinds of DNA." 

section 2.3.1: The claim here that sequencing cost should be proportional to the machine time required to collect a single read is not correct. Run time affects machine costs, since the more data one gets out of a machine per hour, the lower the machine amortization cost per base. But machine costs are in fact a fairly minor component of the total costs of sequencing (we estimate about $.05 per finished base with ABI sequencers, out of a total of $.50 per base). The bulk of the costs are not in the machine but in the upstream labor and reagent costs, i.e. the template preps and sequencing reactions. These are not affected at all by the machine running time. The idea that labor costs in particular should be proportional to running time is patently ridiculous -- no one sits around waiting for the results of a run once they load and start the machine! In fact very short run times could even have a detrimental effect on labor efficiency, since whoever is tending the machine may not be able to use his time very effectively if the time between machine loads is not be long enough to allow productive work on another task. Similarly reagent costs have nothing to do with machine run time, although they may be reduced by improvements in detection sensitivity.

The biggest single determinant of cost is the number of reads that need to be performed, because all of the costs -- reagents, labor, machine -- are affected by this. Faster run times are only of (some) benefit if they do not result in shorter read lengths, because if read lengths become shorter then more reads are needed, pushing up costs. Current capillary electrophoresis machines fail (at present) on this account -- they have much shorter run times but also much shorter read lengths, which makes sequencing with them more expensive, not less expensive, than with the slower ABI machines.

Koonin response:
>  We understand that read time is only a fraction of the story.  In
> fact, we said so:
> 
> "Needless to say, this possibility is very
> far from being demonstrated. The three steps of single-molecule
> sequencing have not yet been integrated into a working process.   And the
> rate of sequencing in a large-scale operation is limited by many factors
> beyond the rates of the elementary process involved.  With either
> single-molecule or gel electrophoresis separation, the production of
> sequence will be slowed by the complicated manipulations required to
> prepare the molecules for sequencing and to assemble the sequences
> afterwards.  Until single-molecule sequencing is developed into a
> complete system, no realistic estimate of its speed and cost can be made.
>   The most that can be claimed is that single-molecule sequencing offers
> a possibility of radically increasing the speed and radically reducing
> the cost."

My original comment was in response to your claim on p. 22 that "When we are concerned with large-scale operations, the number of bases sequenced per dollar will be proportional to the number of bases sequenced per hour." I believe this statement is just wrong and should be taken out of the report. You seem to be saying now that you "understand" that this may not be true after all, but there is no such caveat in the sentence I just quoted, and the argument you gave in the rest of that paragraph does not hold water for the reasons I gave in my original comment. The crucial point is that read time is actually a relatively minor determinant of cost, because it really only pertains to the machine costs which are a minor fraction (< 10%) of the whole. There is no reason whatsoever to expect the proportionality you propose.

> The real point, of course, is that a full and commensurate systems
> analyses are needed (including cost of labor, reagents, operations,
> capital) and that the optimum with rapid single-molecule sequencing might
> be very different from that with the present gel-EP.

It might be, but there is no reason to expect this given our current understanding of the economics of sequencing (none of which is reflected in your report).

section 2.3.3: De novo sequencing by hybridization. The analysis here is a bit bizarre -- there is no reason to assume only pairwise probe overlaps (i.e. k > L/ 2) and in fact all proposals for de novo sequencing by hybridization of which I am aware take k = 1. Thus for sequencing a "random" sequence the numbers are not nearly as unfavorable as the JASONs suggest, and random sequences of a few kb could in principle be determined using a 9-mer array.

The important point though is that calculations for a random sequence are completely irrelevant. Almost any real genomic sequence of significant length (a few kb or more) will have short repeats in it. In effect, sequencing by hybridization even in the best case (i.e. with perfect hybridization data) amounts to shotgun sequencing with a very short read length (equal to the oligomer length). In shotgun assembly, if a subsequence longer than the read length occurs three or more times on one strand, or twice on opposite strands, then there will be a fundamental ambiguity in the assembly (i.e. the correct sequence cannot be determined from the read data alone); and if the read lengths are very short the chance of this happening is quite high. In almost any real genomic sequence of a few kb there will be repeated 9-mers for example.

Koonin response:
>   The analysis is meant to be "order-of-magnitude."  We so
> state:
> 

The problem is that even the order-of-magnitude you get (for probe density required to obtain a "random" sequence by hybridization) is wrong. And you draw a conclusion from it which, while it happens to be true for other reasons, is not true for the reasons you gave.

> "Some sense of the probe resource
> requirements for de novo sequencing can be understood by the following
> "reverse" strategy applied to an array of Format 2 type." 
> 
> and  do acknowledge that more detailed analyses are needed:
> 
> "Note that these simple limits assume that
> target-probe hybridization and identification at each site are perfect
> and that N is a "typical" random sequence without perverse patterns such
> as multiple repeats (which would present a significant problem). 
> Certainly in practice a number of processes are encountered that
> complicate the interpretation of the hybridization patterns presented by
> arrays (e.g., related to complexity of the thermodynamics of
> hybridization, of patterns from multiple mismatches, etc.) and that are
> currently being addressed in the research literature, with promising
> demonstrations of fidelity.  Clearly in any real application somewhat
> larger arrays than those based upon simple combinatorics will be needed
> for de novo sequencing to maintain accuracy and robustness in the face of
> errors, with an optimum array size lying somewhere between the limits
> discussed above."
> 
> Despite these caveats and the simplicity of analysis, we seem to agree
> that the conclusions are correct.

Yes, but it is important to have a correct argument supporting those conclusions, and you don't!

section 3.1.1: Table of required accuracy levels. This table appears to confuse several different kinds of criteria. For example, in the first application (assembly of long contigs), the error rate referred to is for reads, while in the rest of the table the error rate is that for finished sequence. These are quite different. Also, the error requirement of 0 given for genetic defects is apparently based on the observation that a single error can produce a misleading result, but that is true for the other applications as well.

Also, it is inappropriate to base the gene-finding requirement on the accuracy of EST data as the JASONs are doing here. Obviously one can find some useful information even in sequence with a very high error rate. The discoveries that have been made with ESTs have involved genes with strong similarity to other genes of known function, and strong similarities can indeed be found even if the error rate is substantially higher that the 1% figure given here. With genomic sequence however the issues are entirely different. Many of the genes, one might even argue the most interesting ones, will not have strong similarities to any known gene, and moreover one is faced with the problem of picking out relatively small exons from a background of much larger intronic and intergenic regions. Furthermore it is important to be able to distinguish genes from pseudogenes; even a single frameshift error can cause a gene to appear to be a pseudogene. If the error rate is high every gene will look like a pseudogene (or worse, be undetectable), potentially resulting in a great deal of costly experimental effort to sort out the situation.

Moreover I would strongly question a basic premise underlying this table and in fact most of the JASONs' arguments and suggestions in section 3; namely, the assumption that errors occur randomly, which is required even to talk about a single "error rate" as if that were the only error issue that needs to be specified. This assumption is false; errors have a strong tendency to occur in clusters, and in particular types of sequences (e.g. compression errors in GC rich sequences with a propensity to form certain types of hairpin structures, or miscounting of mononucleotide runs). One cannot assess the impact of an "error rate" on uses of the sequence without knowing the distribution of particular error-prone sequences with respect to features of biological interest, as well as the nature of the error clustering and how it relates to biological features.

Investigating the impact of error rates by simulating errors as the JASONs propose would not yield any useful information in my opinion, because we do not know enough about the distribution of errors in practice to be able to do a realistic simulation. Furthermore we do not even know what features other than genes may still lie undiscovered in the genomic sequence, and so cannot simulate them. Also this proposal ignores the fact that we currently derive more than just the sequence: we get the estimated error probabilities at each position as well, and these have considerable potential to enhance our ability to make use of error-prone sequence. At present there is no software available to make use of them, and what is really needed is improvements in sequence analysis software to take into account the error probabilities.

Koonin response:
>  The table has been included to illustrate the broad range of
> accuracy requirements, and the numbers given are, as stated immediately
> following the table, "only rough order-of-magnitude estimates".  Further,
> we state:
> 

Again, even the order-of-magnitude is wrong for some of them (e.g. gene identification); and as I pointed out there are some conceptual problems as well.

> "More precise estimates for each of these
> uses (and others) can surely be generated by researchers expert in each
> of the various applications.  Beyond qualitative judgment, one useful
> technique would be to run each of the applications with pseudodata in
> which a test sequence is corrupted by artificially generated errors. 
> Variation of the efficacy of each application with the error level would
> determine its error requirement and robustness. Such exercises, carried
> out in software, cost little, yet would go a long way toward setting
> justifiable quality goals.  We recommend that the DOE encourage the
> genomics community to organize such exercises."

The fundamental objections to this approach are (i) we are very far indeed from being able to model errors, and (ii) we don't know all of the important biological features present even in known sequences; as a result we can't simulate them, or even determine how much information is lost when errors are introduced into a known sequence.

> This seems to be fully in accord with the penultimate paragraph in your
> comments on this point.  Further, the point of your last paragraph (that
> we do not yet know enough about the distribution of errors) is again a
> motivator for the experiments under "Know thy system."

I disagree, for reasons that should be clear from my previous comments.

section 3.1.2: Accuracy required for assembly. The observations here are not new, in fact such calculations were made a number of years ago by people doing early work on assembly. However they are no longer taken seriously, because of the two drastic simplifying assumptions that are made: that the sequence is random, and that the errors are randomly distributed. These are now recognized to be so far from being true that no useful conclusions can be drawn from an analysis of this idealized model. Sequence is highly nonrandom, and as I attempted to point out in my presentation to the JASONs, what in fact makes assembly nontrivial is the combination of two factors, repeats and read errors; a high error rate would not be a problem if there were no repeats, and similarly with perfect data repeats would not represent a significant problem, since the great majority of them are highly diverged. Every region of the genome that has been looked at has significant numbers of repeats so any theoretical calculation or simulation that depends on an assumption of randomness is simply not relevant.

The assumption that errors are randomly distributed is also very far from being the case. Errors are not randomly distributed within reads, error rates being significantly higher at the beginnings and ends of reads than in the middle. Nor are they randomly distributed among reads. Due to variations in sequencing conditions, template preps, and electrophoretic conditions some reads inevitably have much higher error rates throughout their lengths than other reads. Moreover reads through the same sequence region, and on the same strand with the same chemistry will tend to reproduce the same errors (e.g. compressions). The assumption of randomly distributed errors is violated in almost every respect one can imagine. The other important factor ignored by this analysis is that we have in addition to the read sequences error probabilities for each base call. These enormously improve the reliability of assembly.

Koonin response:
>  Our analysis is known to be an oversimplification, and we said
> so:
> 

My point was that it is such an extreme oversimplification that it is not useful, even apart from the fact that it is not new. The impression I got when reading this and several other sections of the report is that some of the JASONs got interested in little mathematical problems suggested by the various technologies, solved them, and then put them into the report as if they provided an important insight. You are pretty far behind the field in this -- these types of arguments were made years ago and are now discredited because they depend on assumptions that aren't even close to being true in practice. Being a mathematician by training I enjoy these little problems as much as you do, but they really don't provide any useful insight regarding the real world issues.

> "The analysis here makes several
> (important) simplifying assumptions.  For example, it assumes that the
> fragments are uniformly distributed across the clone and that the clone
> itself is a random sequence of base pairs. While in some regions of the
> genome the latter may be a good assumption, there are certainly areas
> where it is not.  More importantly, even somewhat limited partial repeats
> within the clone will have a possibly significant impact on the analysis.
>  This can be explored experimentally via computer simulations using known
> stretches of the sequence (Section 3.3.1).
> 
>       Further, with fragments produced using sets of restriction enzymes, the
> fragments may well not be uniformly distributed and we only considered
> pointwise garbling (not insertions or deletions).  However the  intent of
> this analysis is simply to illustrate the relative importance of
> base-calling accuracy and coverage (number of fragments) in the
> sequencing process."

Subclones are actually not generated using restriction enzymes because of the unfavorable nature of restriction site distributions (even in random sequence). Mechanical shearing of the DNA is generally used instead.

section 3.2.1: Restriction enzyme verification of sequence accuracy. MCD mapping is an excellent method for checking assembly accuracy and is routinely used for this purpose in our genome center, but it is not appropriate for checking accuracy at the per-base level as the JASONs propose. The problem is that most errors are sequence-dependent and restriction enzyme sites (which are mostly short palindromes) undersample the main types of error-prone sequences: compressions are found in sequences with hairpin loop propensity, which are not usually palindromic (because the two complementary stretches must be separated by a loop); and the other major error-prone feature, mononucleotide runs, are actually not sampled at all by restriction sites, given that the restriction site would need to span the entire run to determine the number of bases in it.

Koonin response:
>   MCD need not be a perfect verification protocol
> for all sequences to be useful.  If there are known sequences where it
> will fail, then these need to be tested by other means.  Our analysis is
> useful to see how far one can go in a "perfect case".  As we noted, the
> level of verification that can be achieved must be tested by practice:

How are you going to efficiently test the known sequences "by other means", without doing random resequencing as the NHGRI currently does, in which case the MCD mapping tests add nothing useful? Also, as I indicated above I disagree that analysis of the "perfect case" is useful -- we are concerned with the real world here.

> "Note that the estimates above assume both
> perfect enzyme specificity; and sufficient fragment length resolution (1%
> seems to be achievable in practice, but one can imagine site or near-site
> configurations where this would not be good enough, so that a different
> set of restriction enzymes might have to be used).  The extent to which
> these assumptions hinder MCD verification, as well as the ability of the
> method to constraint sequence to e<<10-4, can best be investigated by
> trials in the laboratory."

section 3.2.2: Hybridization arrays for sequence verification. Hybridization arrays are also not likely to do a good job of detecting errors, because the same regions that are prone to errors in Sanger sequencing also present problems for hybridization. For example compression prone sequences are unlikely to hybridize well because of the tendency of the oligonucleotide to form hairpin structures, and mononucleotide runs tend not to give interpretable hybridization signals because of the fact that one strand can "slide" with respect to the other and still form a stable hybrid.

Another objection to the method proposed by the JASONs is that it would involve a new chip for each cosmid to be checked, which would make it prohibitively expensive (Affymetrix says, probably optimistically, that it costs ~$65,000 to master a new chip, which amounts to ~$1.50 a base for a 40kb cosmid).

Koonin response:
>  We were well aware of the problems you mention.  The text
> reads:

I don't believe you were aware of them at all -- the paragraph you quote below certainly doesn't demonstrate that. My first point was that the kinds of errors that occur in hybridization are actually not likely to be independent of those occurring in Sanger sequencing, which is at odds with your statement below that these "may well be sufficiently different from those arising from the gel-based procedures so as to give an independent standard for accuracy." And my second point was that the method as you propose it is far too expensive to be useful, and I don't see that addressed anywhere in your report.

> 
> "Perhaps the most important motivation for
> suggesting this strategy for verification is that the "mistakes"
> associated with sequence determination from target-probe interactions in
> a massively parallel fashion may well be sufficiently different from
> those arising from the gel-based procedures so as to give an independent
> standard for accuracy.  Of course there are a host of issues to be
> explored related to the particular kinds of errors made by hybridization
> arrays (including the fidelity with which the original array is produced,
> hybridization equivalents, etc.).  For the purpose at hand, attention
> should be focused on those components that most directly  impact the
> accuracy of the comparison."

section 3.3.1: I would strongly disagree with the proposal that one can learn anything useful from "detailed Monte Carlos computer simulation of the complete mapping and sequencing processes." Errors are highly non-random, and we don't know how to simulate them even for our current processes. Moreover any change to the process will have a different, and unpredictable effect on the frequency and nature of errors that occur -- changes in electrophoresis conditions change the error pattern in very different ways from changes in the sequencing chemistry, which in turn are very different from changes in template preps etc.

Moreover I would again stress the fact that error probabilities generated by the basecaller are now an integral part of the data, and they would need to be simulated as well as the read sequences. We don't know how to do that either.

Understanding the error issues is clearly very important, but the only valid way I can see to study them is by resequencing -- i.e. one needs to actually generate new data under a change in conditions, using a clone of known sequence, and see directly how changes in the process affect the accuracy of the final sequence. We simply don't understand the process well enough to simulate sequence errors arising from it.

Koonin response:
>   Monte Carlo simulation has proven to be an extraordinarily
> useful tool to understand and optimize all manner of complex system;
> sequencing should be no different.  It is not necessary that the errors
> be uniformly random, but only that one understand their distribution,
> correlations, etc.  Acquiring such understanding requires careful and
> targeted experiments, as you note.  We're advocating that the time and
> resources be expended to do these experiments, so that high-fidelity
> simulations can be constructed and exploited.

See my detailed comments regarding prospects (or lack thereof) for Monte Carlo simulation of sequencing, above (p. 8 comments on "Know thy system"). I believe your confidence in the power of this approach comes from its success in studying complex physical systems in which the basic rules governing individual components of the system are fairly well understood quantitatively. That is very far from being the case here.

section 3.3.2: Gold standards for measuring sequence accuracy. I don't think such gold standards would be of much use. The problem is that any valid check of a center's accuracy has to be at least single-blinded, and with a known gold standard that is essentially impossible -- it would be very hard to prevent someone sequencing the gold standard cosmid from consciously or unconsciously being more careful with it than with their routine production sequencing. The procedure that is being used in the current NHGRI quality checking exercise, in which clones are selected at random from among those already submitted to genbank and resequenced independently by two other groups, who presumably are highly motivated to find as many errors as possible, is a preferable approach.

Koonin response:
>  We noted the problems you cite with gold standards, but point
> out that they can have some utility:
> 
> "Although the cosmid standard is expected
> to have greater utility, the phagemid standard will be used to control
> for variables pertaining to DNA sequencing itself within the overall
> work-up of the cosmid DNA.  It is likely that the sequencing groups will
> be on their "best behavior" when processing a gold standard, resulting in
> enhanced performance compared to what might be typical.  This cannot be
> avoided without resorting to cumbersome procedures such as surprise
> examinations or blinded samples.  Thus it will be important to examine
> not only the output of the sequencing procedures, but also the process by
> which the data is obtained.  The extent to which it is possible to
> operate in a "best behavior" mode will itself be instructive in assessing
> DNA sequencing performance.  At the very least, such trials will
> establish a lower limit to the error rate expected."

I have a hard time believing that you can really think this is even workable, much less preferable to the resequencing approach already implemented by the NHGRI. How would you examine "the process by which the data is obtained" and use it to come to any quantitative conclusions as to how much "best behavior" mode is affecting the results? I see no way of doing this that would produce credible quantitative estimates of what the error rate is in "typical" mode. More to the point, do you have any arguments for why this method would have advantages over the one that is already in place?

By the way, the "lower limit" you claim your method would yield would almost certainly be 0 in all cases, since a center would have to be fairly stupid not to ensure by whatever means necessary that their sequence agreed perfectly with the "gold standard".

> The utility of random, independent resequencing  was also recognized:

You might have "recognized" it by indicating in your report that it is already being done!

> "Any verification protocol must require
> significantly less effort that resequencing, and so there will be
> considerable latitude in its implementation.  In one limit, sequencing
> groups might be required to perform and document verification protocols
> for all finished sequence that they wish to deposit in a database. 
> Alternatively, a "verification group" could be established to perform
> "spot" verifications of database entries selected at random.  A third
> possibility is to offer a "bounty" for identifying errors in a database
> entry."

section 3.3.3: One might argue with the statement that the cloning and subcloning steps "are not the most critical with respect to the overall quality of the sequencing process." In fact cloning errors, particularly small deletions, chimeras, and bacterial transposon insertions, are turning out to be a much more frequent problem than people have realized.

Koonin response:
>   Our remarks are consistent with the responses of our briefers on
> this point.  If the situtation is changing, we'd certainly be interested
> in hearing about it.

section 4.2.4: As mentioned above, phred already does produce error probabilities, although for some reason the JASONs have missed the point that the quality values have this interpretation (even though this was discussed both in my presentation and in material I sent later to H. Woodin). So the suggestion to add these is out of date. These have been checked automatically by phrap (for some time!) for internal consistency by comparison to the consensus sequence, so that suggestion also is out-of-date. Likewise we have already done extensive studies of the influence of position within the read, and of sequencing chemistry, on error rates and have shown the error probabilities to be robust.

Koonin response:
>  These points have been addressed in our reponses above.