In Algorithmica, September 2016, 76(1):17-46.
Preliminary version (R. Takhanov, V. Kolmogorov "Inference algorithms for pattern-based CRFs on sequence data") appeared in International Conference on Machine Learning (ICML), June 2013.

Abstract

We consider {\em Conditional Random Fields (CRFs) with pattern-based potentials}
defined on a chain. In this model the energy of a string (labeling) $x_1...x_n$
is the sum of terms over intervals $[i,j]$ where each term is non-zero only if the substring $x_i...x_j$
equals a prespecified pattern $\alpha$. Such CRFs can be naturally applied to many sequence tagging problems.

We present efficient algorithms for the three standard inference tasks in a CRF, namely
computing (i) the partition function, (ii) marginals, and (iii) computing the MAP.
Their complexities are respectively
$O(n L)$, $O(n L \ell_{\max})$ and
$O(n L \min\{|D|,\log (\ell_{\max}+1)\})$
where $L$ is the combined length of input patterns, $\ell_{\max}$ is the maximum length of a pattern,
and $D$ is the input alphabet.
This improves on the previous algorithms of [Ye et al. NIPS 2009] whose complexities are respectively $O(n L |D|)$,
$O\left(n |\Gamma| L^2 \ell_{\max}^2\right)$ and $O(n L |D|)$,
where $|\Gamma|$ is the number of input patterns. In addition, we give an efficient algorithm for sampling,
and revisit the case of MAP with non-positive weights.