03 Marzo 2020
TU Wien, Austria
Departamento de Sistemas de Aprendizaje
Marko Djukanovic

The Longest Common Subsequence (LCS) problem aims at finding a longest string that is a subsequence of each string from a given set of input strings. This problem has applications, in particular, in the context of bioinformatics, where strings represent DNA or protein sequences. Existing approaches include numerous heuristics, but only a few exact approaches, limited to rather small problem instances. Adopting various aspects from leading heuristics for the LCS, we first propose an exact A* search approach, which performs well in comparison to earlier exact approaches in the context of small instances. On the basis of A* search we then develop two hybrid A* –based algorithms in which classical A* iterations are alternated with beam search and anytime column search, respectively. A key feature to guide the heuristic search in these approaches is the usage of an approximate expected length calculation for the LCS of uniform random strings. Even for large problem instances these anytime A* variants yield reasonable solutions early during the search and improve on them over time. Moreover, they terminate with proven optimality if enough time and memory is given. Our algorithms are able to obtain new best results for 82 out of 117 instance groups from the literature. Moreover, in most cases they also provide significantly smaller optimality gaps than other anytime algorithms.