PyLZJD: An Easy to Use Tool for Machine Learning

—As Machine Learning (ML) becomes more widely known and popular, so too does the desire for new users from other backgrounds to apply ML techniques to their own domains. A difﬁcult prerequisite that often confounds new users is the feature creation and engineering process. This is especially true when users attempt to apply ML to domains that have not historically received attention from the ML community (e.g., outside of text, images, and audio). The Lempel Ziv Jaccard Distance (LZJD) is a compression based technique that can be used for many machine learning tasks. Because of its compression background, users do not need to specify any feature extraction, making it easy to apply to new domains. We introduce PyLZJD, a library that implements LZJD in a manner meant to be easy to use and apply for novice practitioners. We will discuss the intuition and high-level mechanics behind LZJD, followed by examples of how to use it on problems of disparate data types.


Introduction
Machine Learning (ML) has become an increasingly popular tool, with libraries like Scikit-Learn [PVG + 11] and others [CG16], [Raf17], [MBY + 16], [HFH + 09] making ML algorithms available to a wide audience of potential users.However, ML can be daunting for new and amateur users to pick up and use.Before even considering what algorithm should be used for a given problem, feature creation and engineering is a prerequisite step that is not easy to perform, nor is it easy to automate.
In normal use, we as ML practitioners would describe our data as a matrix X X X that has n rows and d columns.Each of the n rows corresponds to one of our data points (i.e., an example), and each of the d columns corresponds to one of our features.Using cars as an example, we may want to know what color a car is, how old it is, or its odometer mileage, as features.We want to have these features in every row n of our matrix so that we have the information for every car.Once done, we might train a model m(•) to perform a classification problem (e.g., is the car an SUV or sedan?), or use some distance measure d(•, •) to help us find similar or related examples (e.g., which used car that has been sold is most like my own?).
The question becomes, how do we determine what to use as our features?One could begin enumerating every property a car might have, but that would be time consuming, and not all of the features would be relevant to all tasks.If we had an image of a car, we might use a Neural Network to help us extract information or find similar looking images.But if one does not have prior experience with machine learning, these tasks can be daunting.For some types of complex data, feature engineering can be challenging even for experts.
To help new users avoid this difficult task, we have developed the PyLZJD library.PyLZJD makes it easy to get started with ML algorithms and retrieval tasks without needing any kind of feature specification, selection, or engineering from the user.Instead, a user represents their data as a file (i.e., one file for every data point, for n total files).PyLZJD will automatically process the file and can be used with Scikit-Learn to tackle many common tasks.While PyLZJD will likely not be the best method to use for most problems, it provides an avenue for new users to begin using machine learning with minimal effort and time.

The Lempel Ziv Jaccard Distance
LZJD stands for "Lempel Ziv Jaccard Distance" [RN17a] and is the algorithm implemented in PyLZJD.LZJD takes a byte or character sequence x (i.e., a "string"), converts it to a set of substrings, and then converts the set into a digest.This digest is a fixed-length summary of the input sequence, which requires a total of k integers to represent.We can then measure the similarity of digests using a distance function, and we can trade accuracy for speed and compactness by decreasing k.We can optionally convert this digest into a vector in Euclidean space, allowing greater flexibility to use LZJD with other machine learning algorithms.
The inspiration and high-level understanding of LZJD comes from compression algorithms.Let C(•) represent your favorite compression algorithm (e.g., zip or bz2), which takes an input x and produces a compressed version C(x).Using a decompressor, you can recover the original object or file x from C(x).The purpose of this compression is to reduce the size of the file stored on disk.So if |x| represents how many bytes it takes to represent the file x, the goal is that |C(x)| < |x|.
What if we wanted to compare the similarity of two files, x and y?We can use compression to help us do that.Consider two files x and y, with absolutely no shared content.Then we would expect that if we concatenated x and y together to make one larger file, x y, then compressing the concatenated version of the files should be about the same size as the files compressed separately, For that to be true, there must be some overlapping content between x and y that our compressor C(•) was able to reuse in order to achieve a smaller output.The more similarity between x and y, the greater difference in file size we should see.In which case, we could use the ratio of compressed file lengths to tell us how similar the files are.We could call this a "Compression Distance Metric" [KLR04] as shown in Equation 1, where CDM(x, y) returns a smaller value the more similar x and y are, and a larger value if they are different.

CDM(x, y) = C(x y) |C(x)| + |C(y)|
(1) The CDM distance we just described gives the intuition behind LZJD.That we can use compression algorithms to measure the similarity between arbitrary files.CDM has been used to perform time series clustering and classification [KLR04].A large number of compression based distance measures have been proposed [SB06] and used for tasks such as DNA clustering [LCL + 04], image retrieval [Tra07], and malware classification [Bor15].

Mechanics of LZJD
While the above strategy has seen much success, it also suffers from drawbacks.Using a compression algorithm for every similarity comparison makes prior methods slow, and the mechanics of standard compression algorithms are not optimized for machine learning tasks.Equation 1 also does not have the properties of a true distance metric 1 , which can lead to confusing behavior and prevents using tools that rely on these properties.LZJD rectifies these issues by converting a specific compression algorithm, LZMA, into a dedicated distance metric [RN17a].LZJD is fast enough to use for larger datasets and maintains the properties of a true distance metric.LZJD works by first creating the compression dictionary of the Lempel Ziv algorithm [LZ76].The lzset method shows the Lempel compression dictionary creation process.Since LZJD cares about similarity as a direct goal, we do not put in the extra work or code normally required to make an effective compressor.Instead, we simply create a Python set of many different sub-strings of the input sequence b.Because the lzset method gives us a set of objects, we use the wellknown Jaccard similarity to measure how close the two sets are.This is defined in the sim method above, and mathematically in Equation 2.
) is a valid metric, and thus provides all the tools necessary to measure the similarity between arbitrary sequences or files.If a and b represent different sequences, their LZJD is computed as: 1.The properties of a true distance metric are symmetry, indiscernibility, and the triangle inequality.
While the procedure above will implement the LZJD algorithm, it does not include the speedups that have been incorporated into PyLZJD.Following [RN17a] we use Min-Hashing [BCFM98] to convert a set A into a more compact representation A , which is of a fixed size k (i.e., |A | = k) but guarantees that J(A, B) ≈ J(A , B ) 2 .[RN18] reduced computational time and memory use further by mapping every sub-sequence to a hash and performing lzset construction using a rolling hash function to ensure every byte of input was only processed once.To handle class imbalance scenarios, a stochastic variant of LZJD allows over-sampling to improve accuracy [RN17b].All of these optimizations were implemented with Cython [BBC + 11] in order to make PyLZJD as fast as possible.

Vectorizing Inputs
The LZJD algorithm as discussed so far provides only a distance metric.This is valuable for search and information retrieval problems, many clustering algorithms, and k-nearest-neighbor style classification, but it does not avail ourselves to all the algorithms that would be available in Scikit-Learn.Prior work proposed one method of vectorizing LZSets [RN17b] based on feature hashing [WDL + 09], where every item in the set is mapped to a random position in a large and high dimensional input (they used d = 2 20 ).For new users, we want to avoid such high dimensional spaces to avoid the curse of dimensionality [Bel57], a phenomena that makes obtaining meaningful results in higher dimensions difficult.
Working in such high dimensional spaces often requires greater consideration and expertise.To make PyLZJD easier for novices to use, we have developed a different vectorization strategy.To make this possible, we use a new version of Min-Hashing called "SuperMinHash", [Ert17].The new SuperMinHash is up to 40% slower compared to the prior method, but enables us to use what is known as b-bit minwise hashing to convert sets to a more compact vectorized representation [LK11].Since k ≤ 1024 in most cases, and b ≤ 8, we arrive at a more modest d = k • b ≤ 8, 192.By keeping the dimension smaller, we make PyLZJD easier to use and a wider selection of algorithms from Scikit-Learn should produce reasonable results.

Over-Sampling Data
Another feature introduced in [RN17b] is the ability to stochastically over-sample data to create artificially larger datasets.This is particularly useful when working with imbalanced datasets.Given a value false_seen_prob, their approach modifies the inner if statement of lzset to falsely "see" a sub-string that it has not seen before.This is a one line change that looks like the following: if b_s not in s and random.uniform()> false_seen_prob: By doing so, the set of sub-strings returned is altered.However, the altered set is still true to the data in that every string in the set is a real and valid sub-string from the corpus.This works because the Lempel Ziv dictionary creation is sensitive to small changes in the input, so a few small alterations can propagate forward and cause a number of differences in the entire process.By making the condition random, we can repeat the process several times and get different results each time.This provides additional example diversity that can help train a model.When false_seen_prob 2. The bottom-k approach is used by default, where one hash h(•) is applied to every item in the set, and the bottom-k values according to h(•) are selected.
= 0, we get the standard LZJD output.To perform oversampling, we recommend using small values like false_seen_prob ≤ 0.05.

Using PyLZJD
Now that we have given the intuition and described how LZJD works, we show three examples of how PyLZJD performs machine learning, without having to specify a feature processing pipeline.PyLZJD, along with complete versions of these examples, can be found at https://github.com/EdwardRaff/pyLZJD.
To use PyLZJD, at most three functions need to be imported, as shown below.
from pyLZJD import digest, sim, vectorize These three functions work as follows: • digest(b, hash_size=1024, mode=None, processes=-1, false_seen_prob=0.0):takes in (1) a string as data to convert to a digest or (2) a path to a file and converts the file's content to an LZJD digest.If a list is given as input, each element of the list will be processed to return a list of digests. 3 • vectorize(b, hash_size=1024, k=8, processes=-1, false_seen_prob=0.0):works the same as digest, but instead of returning a list, returns a numpy array representing a feature vector.
• sim(A, B): takes two LZJD digests, and returns the similarity score between two files.1.0 indicating they are exactly similar, and 0.0 indicating no similarity.
The above is all that is needed for practitioners to use PyLZJD in their code.Below we will go through three examples of how to use these functions in conjunction with Scikit-Learn to get decent results on these problems.For new users, we recommend considering LZJD as a first-pass easy-to-use algorithm so long as the length of the input data is 200 bytes/characters or more.This recommendation comes from the fact that LZJD is compression based, and it is difficult to compress very short sequences.A quick test of LZJD's appropriateness, is to manually compress your data points (as files) with your favorite compression algorithm.If the files compress well, LZJD may work.If the files do not compress well, LZJD is less likely to work.

T5 Corpus Example
The first example we use is a dataset called T5, which has historically been used for computer forensics [Rou11].It contains 4,457 files that are of one of 8 different file types: html, pdf, text, doc, ppt, jpg, xls, or gif.As a simple first step to using PyLZJD, we will attempt to classify a file as one of these 8 file types.Our code starts by collecting the paths to each file into a list X_paths.Creating a LZJD digest for each of these files is simple; call the digest function as shown below: The processes argument is optional.By setting it to -1, as many processor cores as are available are used.If set to any positive value n, then n cores will be used.A list of digests will be returned with the same corresponding order as the input.The digest 3. mode controls which version of min-hashing is used.None for the standard hash, or "SuperHash" to use the approach that is compatible with vectorization.
function will automatically load every file path from disk, and perform the LZJD process outlined above.
For this first example, we will stick to using LZJD as a similarity tool and distance metric.When you want to use distance based algorithms, you want to use the digest and sim functions instead of vectorize.vectorize is less accurate and slower when computing distances.
To use LZJD's digest with Scikit-Learn, we need to massage the files into a form that it expects.Scikit-Learn needs a distance function between data stored as a list of vectors (i.e., a matrix X).However, our digests are not vectors in the way that Scikit-Learn understands them, so Scikit-Learn needs to be told how to properly measure distances when using LZJD.An easy way to do this4 , which is compatible with other specialized distance a user may want to leverage, is to create a 1-D list of vectors.Each vector will store the index of its digest in the created X_hashes list.Then we create a distance function which uses the index and returns the correct value.While wordy to explain, it takes only a few lines of code: #This will be the vetor given to Scikit-Learn X = [ [i] for i in range(len(X_hashes))] #sklearn will give us two vectors a and b from 'X' def lzjd_dist(a, b): #Each has len(a) = 1, so only one value to grab #The stored value tells us which index #has 'our' digest This is all we need to use the tools built into Scikit-learn.For example, we can perform k-nearest-neighbor classification with cross-validation to see how accurately we predict a file's type.knn_model = KNeighborsClassifier(n_neighbors=5, algorithm='brute', metric=lzjd_dist) scores = cross_val_score(knn_model, X, Y) print("Accuracy: %0.2f (+/-%0.2f)"% (scores.mean(),scores.std()* 2)) The above code returns a value of 91% accuracy, where a majorityvote baseline returns 25%.This was all done without us having to specify anything about the associated file formats, how to parse them, or any feature engineering work.We can also leverage other distance metric based tools that Scikit-Learn provides.For example, we can use the t-SNE [MH08] algorithm to create a 2D embedding of our data that we can visualize with matplotlib.Using Scikit-Learn, this is only one line of code: The resulting plot is shown in Figure 1.We see that the groups are mostly clustered into separate regions, but that there is a significant collection of points that were difficult to organize with their respective groups.While a tutorial on effective t-SNE use is beyond our scope, LZJD allows us to leverage t-SNE for immediate visual feedback and exploration.

Spam Image Classification
The prior example used files of varying types, which is similar to the problem domain that LZJD was developed for.In this example, we change the type of data and how we approach the problem.
Here, our goal is to predict if an email image attachment is a spam image (i.e., undesirable) or a ham image (i.e., desirable -or at least, more desirable than spam).This dataset was collected in 2007 [DGEB07], with 3298 spam and 2021 ham images.This produces an accuracy of about 94.6%, and an AUC of 98.7%.In the above code snippet, we included the class_weight parameter to address class imbalance in the data.There are more examples of spam images, which can bias a model toward calling most inputs "spam" by default.Using a 'balanced' class weight reweights the data as if there was an equal number of examples of each class.With PyLZJD, you can perform a special type of oversampling to help further reduce this impact and improve accuracy.
Here is a simple version of this ability:   LZJD won't always be effective for images, and convolutional neural networks (CNNs) are a better approach if you need the best possible accuracy.However, this example demonstrates that LZJD can still be useful, and has been used successfully to find slightly altered images [Fj].This example also shows how to build a more deployable classifier with PyLZJD and tackle classimbalance situations.

Fig. 1 :
Fig. 1: Example of t-SNE visualization created using LZJD.Best viewed digitally and in color.

Fig. 2 :
Fig. 2: Example of ham (left) and spam (right) images from the dataset's website.We use the vectorize function to create feature vectors for each data point.Using vectorize instead of digest allows us to build models that avoid the nearest neighbor search, which can be slow and cumbersome to deploy.The trade off is we spend more time during the training phase of the algorithm.Doing this with PyLZJD is simple, and the below code snippet handles the work of creating the labels, loading the files, and creating feature vectors, again, without us having to specify anything about the input.spam_paths = glob.glob("personal_image_spam/* ") ham_paths = glob.glob("personal_image_ham/* ") all_paths = spam_paths + ham_paths yBad = [1 for i in range(len(spam_paths))] yGood = [0 for i in range(len(ham_paths))] y = yBad + yGood X = vectorize(all_paths) paths_train, paths_test, y_train, y_test = train_test_split(all_paths, y, test_size=0.2,random_state=42) X_train_clean = vectorize(paths_train) X_train_aug = vectorize(paths_train * 10, false_seen_prob=0.05)X_test = vectorize(paths_test) In this code, X_train_clean constructs the training data in the normal manner.Alternatively, X_train_aug has over-sampled both the spam and ham training data 10 times.Normally, this would create 10 copies of the same vectors and have no impact on the solution learned.But, we added the false_seen_prob flag, which alters how the lzset is constructed: this flag turns on the stochastic component and you get a different result every call.We get a variety of different (but realistic) examples for each datapoint.If we train a new logistic regression model on this data, we get improved results ( Now that we have feature vectors, we can train a Logistic Regression model to predict if a new image is a spam or not.The code to train and evaluate it (by several metrics) is:

TABLE 1 :
Results on training a Logistic Regression model for spam image detection.Over-sampled scores show results when 'false_seen_prob' is used.