The basic algorithm is O(N^{2}). Whilst not reducing this complexity measure, I wrote an intensively optimised version with a more than fifty fold speed up over what was previously thought possible. Using this optimised version complete protein versus database searches take minutes rather than hours on an ordinary single processor PC.
An introduction to the basic unoptimised algorithm is in the related article Dynamic Programming
The 'less than or equal to' sign appears like this: '£' in this document. On the machines I've used it looks fine, but if there's a funny symbol there instead tell me so I can repost this document with '<=' in its place.
Methods for Fast Serial Type III Comparison •Top•Index•Down to 1.2 Introduction• 1.1 Index1. Introduction
1.2 Introduction 1.3 Algorithm
2.2 Rearrangement 2.3 Use of registers 2.4 Deferred assignment 2.5 Maximum value and rogue value 2.6 Loop overheads
3.2 Summary of Optimisations 3.3 Related Links 
•Index•Down to 1.3 Algorithm• 
1.2 IntroductionA number of techniques combine to accelerate the bioinformatics sequence comparison routine which finds the highest scoring similar region of two proteins allowing for insertions and deletions in either sequence.The algorithm is described for scoring only, i.e. separated from production of an alignement of the two similar regions. In the database searching application the vast majority of proteins compared are of low similarity to the probe sequence. The alignments for the low scoring similarities are not of interest. It is therefore more economical not to calculate them. Some presentations of the alignment algorithm use a second matrix to record 'the path taken' for traceback of the alignment. This is not necessary. The score matrix itself provides all the information needed to reconstruct the path. In my own implementation I use an unoptimised routine for path reconstruction, however it calls the optimised routine for score calculation to generate the score matrix. 
1.3 AlgorithmThe Type III algorithm can be described economically as follows:A symmetrical table that scores similarities of pairs of characters from an alphabet is used to generate scores d[i,j] for the similarity of the ith character of one string to the jth character of the second. Positive d[i,j] indicate similarity, negative d[i,j] are counterindicative of similarity. A penalty P, the 'indel' penalty, represents the cost of inserting a gap in either sequence. Using these elementary scores, a score S[i,j] for the best local similarity ending on the ith character of the first sequence and jth character of the second can be computed. The score S[i,j] is expressed recursively as: S[i,j] = Max( S[i1,j1]+d[i,j], S[i,j1]P, S[i1,j]P, 0 )This expression is for 0<i£m and 0<j£n. S[i,0] and S[0,j] are equal to 0 for 0£i£m, 0£j£n respectively. Matching the ith character against the jth corresponds to the first term and increases the score S at [i,j] by d[i,j] over its value at S[i1,j1]. The second and third terms correspond to the cost of skipping a character in either sequence. This is P. The formula for S[i,j] calculates the best similarity for substrings ending at characters i and j as being from matching, skipping one or other character, or if all three give negative results, starting a run of similarity after this position. The largest value of S[i,j] gives the score for the best local region of similarity between the two strings. In a recursive implementation to calculate the S[i,j] there would be wasteful recalculation of previously calculated values. Instead the dynamic programming technique is used and all results 0<i£m, 0<j£n are computed in a suitable order. Computation time is thus O(mn).

•Index•Down to 2.2 Rearrangement•
2.1 ReexpressionThe majority of pairs of strings compared in the Molecular Biology application have low similarity and most of the S[i,j] are less than P. To use this observation to increase efficiency requires some rearrangement to the equations. The intermediate stages of rearrangement result in code which is manifestly less efficient. Taking the original formulation we first increase indices. For 0£i<m and 0£j<n we have:S[i+1,j+1] = Max( d[i+1,j+1]+S[i,j], S[i+1,j]P, S[i,j+1]P, 0)We can introduce extra tests and break the 'Max' into components. S[i+1,j+1] = Max(0,d[i+1,j+1]) if S[i,j] > 0 then S[i+1,j+1] = Max(0,d[i+1,j+1]+S[i,j]) if S[i+1,j] > P then S[i+1,j+1] = Max(S[i+1,j+1],S[i+1,j]P) if S[i,j+1] > P then S[i+1,j+1] = Max(S[i+1,j+1],S[i,j+1]P)The same positive values can be obtained as follows: T[i+1,j+1] = d[i+1,j+1] if T[i,j] > 0 then T[i+1,j+1] = T[i+1,j+1]+T[i,j] if T[i+1,j] > P then T[i+1,j+1] = Max(T[i+1,j+1],T[i+1,j]P) if T[i,j+1] > P then T[i+1,j+1] = Max(T[i+1,j+1],T[i,j+1]P)This change which is removal of a 'max' from the first line modifies the treatment of zero. Unlike S[i,j], T[i,j] may become negative. S would hold a zero in such positions. S and T however agree on their positive values. A negative T[i,j] fails each of the comparison test and behaves just as if it were zero in so far as its influences on other T[i,j] is concerned. That S and T agree on positive values can also be checked by considering the cases leading to T[i+1,j+1]£0 and T[i+1,j+1]>0 separately. Because S and T agree on positive values they find the same similar substrings.

•Index•Down to 2.3 Use of registers•
2.2 RearrangementThis new form permits a rearrangement. Before the main loop starts all the T[i,j] are initialised to d[i,j]. In the rearranged loop instead of collecting values for one array element, values are distributed from an element that has positive score. The operations performed at each location in T are now as follows:Start: if T[i,j]>0 then begin if T[i,j]>P then begin T[i+1,j] = Max(T[i+1,j],T[i,j]P) T[i,j+1] = Max(T[i,j+1],T[i,j]P) end T[i+1,j+1] = T[i+1,j+1]+T[i,j] endTesting for T[i,j]>P is necessary only if the testing for T[i,j]>0 has already succeeded. The testing substantially increases the speed. In a typical comparison of a pair of sequences around three quarters of the T[i,j] are negative. These are skipped. This is where the main saving from the rearrangement arises. Of the remaining values around one half score P or less and are dealt with rapidly. The second comparison requires subtraction of P from T[i,j]. The result of this subtraction is then used twice if the test succeeds. The original formulation in terms of S required two subtractions of P per calculation of S[i,j] so there is an overall saving in calculation even when the second test succeeds.

•Index•Down to 2.4 Deferred assignment•
2.3 Use of registersThe next version of the pseudocode shows minor changes to take account of the possibility of using registers. To achieve greater speed the algorithm was implemented at assembly level. A word length register AX was used to hold the current T[i,j] and the register CX to hold T[i,j]  P. At this point the influence of the target processor, the 80286 may be clear. The choice of processor is primarily a reflection of the ready availability and affordability of IBM PCs and compatibles.A standard technique to reduce storage and indexing calculations for the matrix T was also used. Only two of T's columns are needed during processing. T0[j] takes the place of T[i,j] and T1[j] the place of T[i+1,j]. After processing one column, the values in T0 are no longer required. T1 is used in place of T0 and the old T0 is initialised with the appropriate d[i,j] values and takes the place of T1. The new inner loop is shown below. Start: AX = T0[j] if AX>0 then begin CX = AXP if CX>0 then begin T1[j] = Max(T1[j],CX) T0[j+1] = Max(T0[j+1],CX) end T1[j+1] = T1[j+1]+AX end j = j+1 if j<= n then goto start stop 
•Index•Down to 2.5 Maximum value and rogue value•
2.4 Deferred assignmentA technique compilers use to improve efficiency in code they produce is to defer assignments. A compiler may defer the writing of a value held in a register back out to memory. This can lead to a saving when the same memory location is written to several times in succession. This situation frequently occurs in executing instructions in a loop. A memory location written to in one cycle of the loop may be written to again in the subsequent cycle.This algorithm illustrates potential for a more complex form of deferred assignment, conditional deferred assignment. In the unmodified code writing to memory locations is conditional on values calculated within the loop. Depending on the condition, the body of the loop may or may not write a new value out to memory. To cope with conditional deferred assignment, different versions of the loop were required. These handle different states of deferred assignment from the previous execution of the loop. In this code, the 'code expansion' was used to allow deferral of assignment to T0[j+1] and T1[j+1]. If CX>0 then both deferrals take place, if only AX>0 then only deferral of assignment to T1[j+1] takes place, and if AX£0 then no deferred assignment is required. This required three versions of the loop body. As each version completes a cycle, it enters the appropriate version for the next cycle. Which version it enters will depend on which assignments are being deferred. In fact the value being held to write into T0 need not be written at all. The value is needed only in the next cycle round the loop. For this use it can be taken from the register holding the value that is being deferred. Use in determining maximum score is the only other potential subsequent use for values written into T0. Since the deferred value is derived by subtracting P from a value already in the matrix T, it cannot itself be the maximum score. The result does not need to be written out to memory. We now have: Start1: AX = T0[j] if AX>0 then begin CX = AXP if CX>0 then begin T1[j] = Max(T1[j],CX) goto DeferTwo end goto DeferOne end NoDefer: j = j+1 ;AX<=0 if j<= n then goto Start1 stop Start2: DX = AX+T1[j] AX = T0[j] if AX>0 then goto Morecalc T1[j] = DX goto NoDefer DeferOne: j = j+1 ;AX>0, CX<=0 if j<= n then goto Start2 stop Start3: DX = AX+T1[j] AX = T0[j] AX = Max(AX,CX) ;No need to test for AX>0 here as AX>=CX>0 Morecalc: CX = AXP if CX>0 then begin Maxtest: T1[j] = Max(DX,CX) goto DeferTwo end else begin T1[j] = DX end goto NoDefer DeferTwo: j = j+1 ;AX and CX>0 if j<= n then goto Start3 stopThe number of external memory references has been decreased. The plethora of gotos is not a concern. The unconditional gotos can be removed by rearranging the code and code duplication. Other unconditional gotos originating from expanding 'if then else' and 'Max' can be removed in a similar fashion. Further code expansion is possible. In the line AX = Max(AX,CX) the conditional assignment of CX into AX could be replaced by a conditional branch to equivalent code with the roles of CX and AX reversed. This however was not done. It would nearly double the length of the program for a marginal performance gain.

•Index•Down to 2.6 Loop overheads•
2.5 Maximum value and rogue valueDetermination of the overall maximum value in the array and its position has so far been left out of this discussion. The maximum value in a column could be determined by a second loop that scans every entry in the column.We save on loop overheads and on the cost of retrieval from T0 if instead a having a second loop we place the test in the existing loop. Moreover we only need the test in the version of the code which performs both deferrals, since deferrals correspond to values in T0 greater than P. This reduces the number of times we test for the current maximum being exceeded by a factor of six. In making the test more rarely we have lost the ability to correctly score low scoring pairs. The code will only detect those scores greater than P. For the database searching application this is no loss. Substantially higher scores are needed for evidence of relatedness between compared proteins. An even better position for the maximum test is at the line labelled maxtest, when the value DX has been found to be greater than CX. Any matching of two residues where the score for the substrings before matching the two residues is greater than 2P will reach this point. This gives a slightly higher threshold for guaranteed reporting of scores and an even less frequent execution of the maximum test.

•Index•Down to 3.1 Performance•
2.6 Loop overheadsThe loop overheads, the test for j£n, are now a significant fraction of the algorithm's cost. An improvement is the use of a rogue value or 'sentinel' for determining whether the loop has completed. T0[n+1] is initialised with an impossibly high value. This causes entry into the code which checks to see if the current maximum score has been exceeded. If it has, this code additionally checks for the rogue value. Using a rogue value in this way virtually eliminates the time spent in checking for the end of the loop.The tests for j£n can be replaced with unconditional gotos. The testing of AX>0 in the most commonly executed case, no deferral, is a natural candidate for 'loop unrolling'. This unrolling makes a saving since it reverses the test so that the conditional branches most commonly do not branch. At the start of each cycle it is necessary to initialise the vector T1. This initialisation is dependent on the ith character of the second sequence. This is most rapidly performed if vectors for each of the characters in the alphabet are precomputed. Initialisation is then performed using a rapid memory copy operation.

•Index•Down to 3.2 Summary of techniques used•
3.1 PerformanceSpeed of comparison is measured in PMEs^{1}, Path Matrix Elements per second. Comparison of two sequences of length 300 in one second requires a speed of 90,000 PMEs^{1} or better.The overall result of these optimisations is an algorithm which has a speed of 300,000 PMEs^{1} (this was in 1990 using a 16MHz 286). This compares to speeds implied by Pearson (1990) of 4,000 PMEs^{1} for comparison; and by Mount (1988) of 700 PMEs^{1} for comparison and alignment on a mainframe and a speed of 6000 PMEs^{1} for Pascal code on the same 16 MHz PC (own data). Thus the optimisations described here lead to an approximate 50 fold speed improvement.

•Index•
3.2 SummaryA summary of the techniques used to optimise the routine is given below:

© James Crook, June 1998.