Longest common substring

Tags: ,

Longest Common Substring

Given two string A and B, find longest common substring in them. For example, A = “DataStructureandAlgorithms” and B=“Algorithmsandme”, then longest common substring in A and B is “Algorithms”. Below figure shows longest common substring.

longest common substring

Brute force solution is to find all substrings of one string and check any of these substring are substring of second string, while comparing, keep track of the longest one we found. There can be n2substring for a string with length n and to find if a string is substring of another, it takes another m operations, where m is length of second string. Hence, overall complexity of this method is O(n2m).

Can we do better than that?

Longest common substring : Line of thoughts

We have to find longest common substring in strings of length M and length N. Can we find longest common substring till length M-1 and N-1 and then derive longest common substring for M and N?  Yes, we can find. The length either grows by one if last characters are equal or reset to zero if last characters are not equal. Why so?

First see why we need to reset to zero when characters are different. This because we are looking for common substring which means characters should be consecutive, any different character restart the the entire search because with those two  different characters, there can’t be any common substring.

What if characters are same? In that case we increment by one, because, longest common substring in N-1 and M-1 would be either 0 or some number based on how any consecutive common characters were till N-1 and M-1.

What will be longest common substring when one of the strings is empty? It will be zero.

So, do you see recursion here? So, let’s write recursion relation and then implement it.

LCS(i,j) = 1+LCS(i-1, j-1) if S[i] = T[j] 
         =  0 otherwise

This recursion relation has optimal subproblem property that solution to the problem actually depends on solutions to subproblems. Also, there are subproblems which will be calculated again and again, which is called overlapping subproblems. These two properties are required for dynamic programming. To not to calculate subproblems, we will use memoization, for that  create a two dimensional array called LCS with dimensions as n and m. LCS[i][j] represents the length of longest common substring in A[0..i] and B[0..j]. And since solution for i-1 and and j-1 is required before solution of i and j, this matrix will be filled bottom up.

Longest common substring using dynamic programming

How to fill LCS[i][j]?

1. Check if A[i] is equal to B[j] 
   1.1 If yes, LCS[i][j] = 1 + LCS[i-1][j-1]
( Because new character is added to already common substring, 
     if any, till A[0...i-1] and B[0,,j-1])
   1.2 if both characters are not same, LCS[i][j] = 0,
       ( Because if characters are not same, there cannot be any
         common substring including A[i] and B[j].

Implementation

#include <stdio.h>
#include <string.h>

int max(int a, int b){
	return a>b ? a:b;
}
 int longestCommonSubstring(char * A, char * B){
     int lenA = strlen(A);
     int lenB = strlen(B);
     int LCS[lenA+1][lenB+1];

     for (int i=0; i<= lenA; i++){
         LCS[i][0] = 0;
     }

     for (int j=0; j <= lenB; j++){
         LCS[0][j] = 0;
     }
	
     int maxLength = 0;
     for (int i=1; i<= lenA; i++){
        for (int j=1; j <= lenB; j++){
            if (A[i] == B[j]){
                LCS[i][j] = 1 + LCS[i-1][j-1];		
                maxLength = max( maxLength, LCS[i][j] );
            } 
            else {
               LCS[i][j] = 0;
            }
         }
     }
     return maxLength;
}

int main(void) {
    char *a = "ABCDEFGSE";
    char *b = "EBCDEFGV";
	
    printf("\n Longest common substring : %d",
			longestCommonSubstring(a,b));
    return 0;
}
package com.company;

/**
 * Created by sangar on 5.1.18.
 */
public class LCS {

    public  static int longestCommonSubstring(String A, String B){
        int lenA = A.length();
        int lenB = B.length();

        int [][] LCS = new int[lenA][lenB];

        for (int i=0; i<lenA; i++){
            LCS[i][0] = 0;
        }

        for (int j=0; j<lenB; j++){
            LCS[0][j] = 0;
        }

        int maxLength = 0;
        for (int i=1; i<lenA; i++){
            for (int j=1; j<lenB; j++){
                if (A.charAt(i) == B.charAt(j)){
                    LCS[i][j] = 1 + LCS[i-1][j-1];
                    maxLength = Integer.max(maxLength, LCS[i][j]);
                }
                else {
                    LCS[i][j] = 0;
                }
            }
        }

        for (int i=0; i<lenA; i++){
            System.out.println();
            for (int j=0; j<lenB; j++){
                System.out.print(" " + LCS[i][j]);
            }
        }
        return maxLength;
    }

    public static void main(String[] args) {
	    String a = "ABCDEFGS";
	    String b = "EBCDEFG";

        System.out.println("Longest common substring :" +
                longestCommonSubstring(a,b));
    }
}

Time complexity of dynamic programming approach to find length of longest common substring in two string is O(n*m) and space complexity is O(n*m) where n and m are lengths of two given strings.

longest common substring dynamic programming

In next post, we will discuss suffix tree method to find LCS which is more optimized than DP solution and can be easily be generalized for multiple strings.

This solution is very similar to Longest common subsequence. Difference between two problems is that a subsequence is collection of characters, which may or may not be contiguous in string, where for a substring, characters must be contiguous. Based on this difference, out solution will vary a bit.

Please share if you find something wrong or missing. If you want to contribute to site, please refer contact us. We would be happy to publish your work and in turn will pay you too.