A more flexible LCM
Context
For the last 2 years I have been reading papers about pattern minining. Although this field of data mining has only received little attention in the last decades, as compared to Machine Learning for example, I found the existing algorithms could bring a lot of value on the table in practice.
One of the main algorithm for generic purpose pattern mining is LCM, short for Linear Closed itemset Miner
While reading about this algorithm in the papers presented by Alexandre Termier , there seemd to be practial limitations ruling the existing algorithms.
This paper’s preliminaries section gives a good overview of the principles that are common to almost every (re)-implementation of LCM you can find.
Amongst the core concepts is what is called a reduced dataset
. Basicaly a reduced dataset is a slice of the original dataset; that is to say of subset of the original transactions.
As an example, imagine we are trying to find the frequent closed patterns
in a transactional database, this transactional database can be stored in datastructure like a pandas.Series
, like so:
import pandas as pd
db = pd.Series([
[1, 2, 3, 4, 5, 6],
[2, 3, 5],
[2, 5],
[1, 2, 4, 5, 6],
[2, 4],
[1, 4, 6],
[3, 4, 6],
])
In practice, being able to acces a reduced dataset
means having access to the original transactional dataset via positional indexing
. LCM-like algorithms keep an inner datastructure to get access to the part of the dataset containing a specific item.
The corresponding inner struct for the database above is the following:
item_to_tids = {
1: {0, 3, 5},
2: {0, 1, 2, 3, 4},
3: {0, 1, 6},
4: {0, 3, 4, 5, 6},
5: {0, 1, 2, 3},
6: {0, 3, 5, 6}
})
We simply keep track of every transaction id for every item. Accessing all transactions containing item 4 become something like
db.iloc[item_to_tids[4]]
0 [1, 2, 3, 4, 5, 6]
3 [1, 2, 4, 5, 6]
4 [2, 4]
5 [1, 4, 6]
6 [3, 4, 6]
Unfortunately, If we implement this algorithm keeping the dependency with pandas, we will have many calls to .iloc
.
What the original LCM implementation does is it access a reduced dataset before counting occurences of each item in this reduced dataset, and this operation is done many times …
For example, if we just extended a 1-length item to obtain the 2-length itemset {2, 4}, and we want to check for its closeness (first parent test), we have to do something like
new_tids = item_to_tids[2] & item_to_tids[4]
reduced_db = db.iloc[new_tids] # reduced db contains all transactions containing both 2 and 4
new_counts = Counter()
for transaction in recuced_db:
new_counts.update(transaction)
The state of research
Writing papers implies making assumptions, as a scientific basis on which the rest relies on. While these assumptions may not seem really binding at first sight, in practice they can narrow down the possibilities.
The original definition of LCM assumes you can access subsets of the original data very quickly given a set of positions
. This may be true if all of it fits in a small pd.Series (as seen above), but in modern applications, this is not always easy to ensure, because your data may be
-
Too big to hold in memory
. Or even worse, your data may be accessible for a limited time lapse, as in streaming workflows Having access to the original dataset during the entire run of the algorithm means extra memory costs which are not needed. -
distributed
(or partiionned), positional indexing is not something easy to ensure. As an example, the dask API voluntarily bans positional indexing over rows.
In addition to this the above pseudo code emphasize how much we need of Counters. We have to count items everytime we dig into a subset, again and again …
Now big questions
In the next series of post I will try to design a new algorithm, based on previous works on LCM.
Considering the painpoints I have mentioned, the ideal algorithm should :
-
Only need one pass over the original dataset
. Transactions could be both inserted and removed from the inner datastructures (mainly the item_to_tids mapping). This way we could keep track of freshly obtained data, with only the information we need. -
Do the entire mining with only knowledge about the dedicated datastructure, thus
not requiring awareness of the original dataset
. -
Be
generic in terms of data-type
. Most LCM implementations work with numeric types. But actually the only important constraint to ensure is havingcomparable items (i.e lexicographical order)
. In other words items can be of any type as long as we are able to keep them in a sorted data structure. -
Of course, keep running times and memory consumption competing with standards pattern mining algorithms
While all this seems a little ambitious, there is pretty much no doubt it’s an interesting challenge to take on