Research papers

Ravi Kumar, Yury Lifshits, Andrew TomkinsWSDM, 2010Twosided markets arise when two different types of users may realize gains by interacting with one another through one or more platforms or mediators. We initiate a study of the evolution of such markets. We present an empirical analysis of the value accruing to members of each side of the market, based on the presence of the other side. We codify the range of value curves into a general theoretical model, characterize the equilibrium states of twosided markets in our model, and prove that each platform will converge to one of these equilibria. We give some early experimental results of the stability of twosided markets, and close with a theoretical treatment of the formation of different kinds of coalitions in such markets.

Yury LifshitsSISAP, 2009We present an overview of the combinatorial framework for similarity search. An algorithm is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Namely, the input dataset is represented by a comparison oracle that given any three points x,y,z answers whether y or z is closer to x. We assume that the similarity order of the dataset satisfies the four variations of the following disorder inequality: if x is the a'th most similar object to y and y is the b'th most similar object to z, then x is among the D(a+b) most similar objects to z, where D is a relatively small disorder constant. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values and do not use triangle inequality for the latter, and (2) they work for arbitrarily complicated data representations and similarity functions.
Ranwalk, the first known combinatorial solution for nearest neighbors, is randomized, exact, zeroerror algorithm with query time that is logarithmic in number of objects. But Ranwalk preprocessing time is quadratic. Later on, another solution, called combinatorial nets, was discovered. It is deterministic and exact algorithm with nearlinear time and space complexity of preprocessing, and nearlogarithmic time complexity of search. Combinatorial nets also have a number of side applications. For nearduplicate detection they lead to the first known deterministic algorithm that requires just nearlinear time + time proportional to the size of output. For any dataset with small disorder combinatorial nets can be used to construct a visibility graph: the one in which greedy routing deterministically converges to the nearest neighbor of a target in logarithmic number of steps. The later result is the first known workaround for Navarro's impossibility of generalizing Delaunay graphs. 
Alina Beygelzimer, John Langford, Yury Lifshits, Gregory Sorkin, Alex StrehlUAI, 2009We consider the problem of estimating the conditional probability of a label in time O(log n), where n is the number of possible labels. We analyze a natural reduction of this problem to a set of binary regression problems organized in a tree structure, proving a regret bound that scales with the depth of the tree. Motivated by this analysis, we propose the first online algorithm which provably constructs a logarithmic depth tree on the set of labels to solve this problem. We test the algorithm empirically, showing that it works successfully on a dataset with roughly 10^6 labels.

Yury Lifshits, Shengyu ZhangSODA, 2009We study the so called combinatorial framework for algorithmic problems in similarity spaces. Namely, the input dataset is represented by a comparison oracle that given three points x, y, z answers whether y or z is closer to x. We assume that the similarity order of the dataset satisfies the four variations of the following disorder inequality: if x is the a'th most similar object to y and y is the b'th most similar object to z, then x is among the D(a + b) most similar objects to z.
Though the oracle gives much less information compared to the standard general metric space model where distance values are given, one can still design very efficient algorithms for various fundamental computational tasks. For nearest neighbor search we present deterministic and exact algorithm with almost linear time and space complexity of preprocessing, and nearlogarithmic time complexity of search. Then, for nearduplicate detection we present the first known deterministic algorithm that requires just nearlinear time + time proportional to the size of output. Finally, we show that for any dataset satisfying the disorder inequality a visibility graph can be constructed: all outdegrees are nearlogarithmic and greedy routing deterministically converges to the nearest neighbor of a target in logarithmic number of steps. The later result is the first known workaround for Navarro's impossibility of generalizing Delaunay graphs.
The technical contribution of the paper consists of handling "false positives" in data structures and an algorithmic technique upasidedownfilter. 
Yury Lifshits, Shay Mozes, Oren Weimann, and Michal ZivUkelsonAlgorithmica, 2009We present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete timeindependent HMMs. We discuss the application of our method to Viterbi's decoding and training algorithms [33], as well as to the forwardbackward and BaumWelch [6] algorithms. Our approach is based on identifying repeated substrings in the observed input sequence. Initially, we show how to exploit repetitions of all sufficiently small substrings (this is similar to the Four Russians method). Then, we describe four algorithms based alternatively on run length encoding (RLE), LempelZiv (LZ78) parsing, grammarbased compression (SLP), and byte pair encoding (BPE). Compared to Viterbi's algorithm, we achieve speedups of Theta(log n) using the Four Russians method, Omega(r/log r) using RLE, Omega(log n /k) using LZ78, Omega(r/k) using SLP, and Omega(r) using BPE, where k is the number of hidden states, n is the length of the observed sequence and r is its compression ratio (under each compression scheme). Our experimental results demonstrate that our new algorithms are indeed faster in practice. Furthermore, unlike Viterbi's algorithm, our algorithms are highly parallelizable.

Navin Goyal, Yury Lifshits and Hinrich SchützeWSDM, 2008We say that an algorithm for nearest neighbor search is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values and do not use triangle inequality for the latter, and (2) they work for arbitrarily complicated data representations and similarity functions.
In this paper we introduce a special property of the similarity function on a set S that leads to efficient combinatorial algorithms for S. Disorder constant D(S) of a set S is defined to ensure the following inequality: if x is the a'th most similar object to z and y is the b'th most similar object to z, then x is among the D(S)(a+b)'th most similar objects to y.
Assuming that disorder is small we present the first two known combinatorial algorithms for nearest neighbors whose query time has logarithmic dependence on the size of S. The first one, called Ranwalk, is a randomized zeroerror algorithm that always returns the exact nearest neighbor. It uses space quadratic in the input size in preprocessing, but is very efficient in query processing. The second algorithm, called Arwalk, uses nearlinear space. It uses random choices in preprocessing, but the query processing is essentially deterministic. For every fixed query, there is only a small probability that the chosen data structure does not support it.
Finally, we show that for Reuters corpus average disorder is, indeed, quite small and that Ranwalk efficiently computes the nearest neighbor in most cases. 
Yury Lifshits and Dmitri PavlovJournal of Mathematical Sciences, Springer, 2007We present an O(mn2^n log Z) deterministic algorithm for solving the mean payoff game problem, m and n being respectively the number of arcs and vertices in the game graph and Z being the maximum weight (we assume that the weights are integer numbers). The theoretical basis for the algorithm is the potential theory for mean payoff games. This theory allows to restate the problem in terms of solving systems of algebraic equations with minima and maxima. Also we use arc reweighting technique to solve the mean payoff game problem by applying simple modifications to the game graph that do not change the set of winning strategies, obtaining at the end a trivial instance of the problem. We show that any game graph can be simplified by n reweightings.

Daniele Beauquier, Marie Duflot, and Yury LifshitsCSR, 2007In this paper, we consider the decidability of two problems related to information flow in a system with respect to some property. A flow occurs in a system if the conditional probability of the property under some partial observation differs from the a priori probability of that property. For systems modeled as finite Markov chains we prove that the two following problems are decidable: does a system has information flow for a given regular property? is it true that the system has no information flow for any (sequential) property?

Yury LifshitsPhD thesis, 2007(it's in Russian) Как хранить данные, чтобы одновременно иметь быстрый доступ к нужной информации и максимально уменьшить их размер? Для решения первой задачи разработано множество структур данных (сбалансированные деревья, суффиксные массивы и т.д.), ориентированных на быструю обработку конкретных видов запросов. Для уменьшения размера данных используются алгоритмы архивирования (LZ77, LZW, кодирование длин серий и т.д.). Можно ли объединить достоинства этих двух подходов? В представляемой работе построены алгоритмы вычисления ряда свойств сжатых текстов без распаковки, получены нижние оценки на трудоемкость некоторых других задач на сжатых текстах, и, наконец, предложен новый подход к архивированию текстов.

Yury LifshitsCPM, 2007What kind of operations can we perform effectively (without full unpacking) with compressed texts? In this paper we consider three fundamental problems: (1) check the equality of two compressed texts, (2) check whether one compressed text is a substring of another compressed text, and (3) compute the number of different symbols (Hamming distance) between two compressed texts of the same length. We present an algorithm that solves the first problem in O(n^3) time and the second problem in O(n^2m) time. Here n is the size of compressed representation (we consider representations by straightline programs) of the text and m is the size of compressed representation of the pattern. Next, we prove that the third problem is actually #Pcomplete. Thus, we indicate a pair of similar problems (equivalence checking, Hamming distance computation) that have radically different complexity on compressed texts. Our algorithmic technique used for problems (1) and (2) helps for computing minimal periods and covers of compressed texts.

Juhani Karhumaki, Yury Lifshits, and Wojciech RytterCPM, 2007We contribute to combinatorics and algorithmics of words by introducing new types of periodicities in words. A tiling period of a word w is partial word u such that w can be decomposed into several disjoint parallel copies of u, e.g. a_b is a tiling period of aabb. We investigate properties of tiling periodicities and design an algorithm working in O(n log(n) log log(n)) time which finds a tiling period of minimal size, the number of such periods and their compact representation. The combinatorics of tiling periods differs significantly from that for classical full periods, for example unlike the classical case the same word can have many different primitive tiling periods. We consider also a related new type of periods called in the paper multiperiods. As a side product of the paper we solve an open problem posted by T. Harju.

Yury Lifshits and Dirk NowotkaCSR, 2007How could one estimate the total number of clicks a new advertisement could potentially receive in the current market? This question, called the click volume estimation problem is investigated in this paper. This constitutes a new research direction for advertising engines. We propose a model of computing an estimation of the click volume. A key component of our solution is the application of linear regression to a large (but sparse) data set. We propose an iterative method in order to achieve a fast approximation of the solution. We prove that our algorithm always converges to optimal parameters of linear regression. To the best of our knowledge, it is the first time when linear regression is considered in such a large scale context.

Benjamin Hoffmann, Yury Lifshits, and Dirk NowotkaCSR, 2007Consider a family of sets and a single set, called query set. How can one quickly find a member of the family which has a maximal intersection with the query set? Strict time constraints on the query and on a possible preprocessing of the set family make this problem challenging. Such maximal intersection queries arise in a wide range of applications, including web search, recommendation systems, and distributing online advertisements. In general, maximal intersection queries are computationally expensive. Therefore, one need to add some assumptions about input in order to get an efficient solution. We investigate two wellmotivated distributions over all families of sets and propose an algorithm for each of them. We show that with very high probability an almost optimal solution is found in time logarithmic in the size of the family. In particular, we point out a threshold phenomenon on the probabilities of intersecting sets in each of our two input models which leads to efficient algorithms mentioned above.

Yury LifshitsProc. of Institute of System Programming of Russian Academy of Science, 2006The goal of the paper is to construct mathematical abstractions of different aspects of real life software protection. We introduce three following notions: program slowdown, generalized encryption scheme and function sharing. These schemes allowed to discover new applications of such known ideas as trapdoor functions and selfcorrecting programs.

Patrick Cegielski, Irene Guessarian, Yury Lifshits and Yuri MatiyasevichCSR, 2006Given two strings (a text t of length n and a pattern p) and a natural number w, window subsequence problems consist in deciding whether p occurs as a subsequence of t and/or finding the number of size (at most) w windows of text t which contain pattern p as a subsequence, i.e. the letters of pattern p occur in the text window, in the same order as in p, but not necessarily consecutively (they may be interleaved with other letters). We are searching for subsequences in a text which is compressed using LempelZivlike compression algorithms, without decompressing the text, and we would like our algorithms to be almost optimal, in the sense that they run in time O(m) where m is the size of the compressed text. The pattern is uncompressed (because the compression algorithms are evolutive: various occurrences of a same pattern look different in the text).

Yury Lifshits and Markus LohreyMFCS, 2006In this work the computational complexity of two simple string problems on compressed input strings is considered: the querying problem (What is the symbol at a given position in a given input string?) and the embedding problem (Can the first input string be embedded into the second input string?). Straightline programs are used for text compression. It is shown that the querying problem becomes Pcomplete for compressed strings, while the embedding problem becomes hard for the complexity class Theta^p_2.

Yury LifshitsDiscrete Mathematics and Applications, VSP, 2005We study partitions of graphs by a system of disconnecting sets. We find a sharp upper bound for the number of resulting parts and analyse the case where the bound is attained. The result due to D. V. Karpov about the number of parts in a partition is proved under weaker constraints imposed on the graph. We also prove a theorem on bounding parts which yields an upper bound for the number of parts of the partition adjacent to a given vertex.

Yury LifshitsDiploma thesis, 2005In this work we consider a wellknown problem of processing of compressed texts. We study the following question (called Embedding): whether one compressed text is a subsequence of another compressed text? In this paper we show that Embedding is NP and coNPhard.

Yury LifshitsInformation Processing Letters, Elsevier, 2003Hromkovic et al. showed how to transform a regular expression of size n into an epsilonfree nondeterministic finite automaton (which defines the same language as the expression) with O(n) states and O(n log2(n)) transitions. They also established a lower bound (n*log(n)) on the number of transitions. We improve the lower bound to ( n*log^2 n/log log n).
Some of my publications are listed at Scholar.Google, DBLP and Computer Science Bibliographies.