Efficient Parameterized Algorithms for Data Packing

Guide to the Artifact

Our code passed the artifact evaluation at POPL 2019 (and we got these two cool badges!). We are very thankful to the AEC for their suggestions to improve this artifact. The code is also available on GitHub at https://github.com/Goharshady/data-packing-popl2019.

Overview

We are considering the problem of data packing over a two-level memory system and providing an algorithm for solving it by exploiting the treewidth of the underlying access graphs. Our algorithm is exact and optimal, i.e. it outputs the minimal number of cache misses possible.

We have implemented our algorithm in Java. Our implementation relies on tree decompositions of the underlying access hypergraphs. These are obtained using the LibTW tool.

We use several common programs as benchmarks. These are all implemented in C++. We compare the results of our algorithm (i.e. the optimal number of cache misses) with those of several previous benchmarks (also implemented in C++). Moreover, we have implemented a cache simulator to count the number of cache misses caused by each of these heuristics.

Artifact Files

Expectations

Note that data packing is a notoriously-difficult problem. It is NP-hard even for a cache of size 1 and hard-to-approximate for any cache size greater than 4. Our main contribution in this paper is the first positive theoretical result and exact algorithm for data packing. This algorithm only functions when the underlying access graphs of a specific order have constant treewidth. Hence, we would like to remark the following points about expectations from the artifact:

  1. We do not expect the artifact to scale to huge cache sizes. In the paper, we are only reporting experiments with cache size <= 5.
  2. We do not expect the artifact to work on every possible input access sequence. It only functions if the access hypergraph of order q* has constant treewidth. Note that computing this treewidth using LibTW is the bottleneck of our approach, and the code will simply fail if this prerequisite fails.
  3. In our benchmarks, we have intentionally included programs that do not satisfy the requirement above (e.g. finding closest pair of points). This is  mentioned in the paper (Figure 15, bottom right). Hence, we expect many instances of failure in the logs.
  4. Given that this is the first exact optimal algorithm for data packing, the main use-case is to analyze the performance of various heuristics (as included in this artifact).
  5. We expect the artifact to always produce correct outputs. If the treewidth is large, the artifact will simply not terminate for a long time.
  6. Running the entirety of the main experiment (whose results are reported in the paper) takes a long time and uses a lot of memory and disk space. We executed it using 12 threads, with 64 GB total RAM, and it took roughly 10 hours and created ~10 GB of input/output/intermediary files in total. We have also included the complete set of input/output/intermediary files created by this experiment as a separate zip file so that if you were unable to run the entire experiment, you can randomly check a few instances to make sure everything works as expected.

File Structure, Prerequisites and Compilation

Many of our scripts are in Bash. We suggest that you execute them on a 64-bit Linux machine (The provided virtual machine runs 64-bit Ubuntu). Our codes are in C++ and Java. Hence, you will need g++ for compilation and a JRE. The figures are visualized using Python3 with numpy and matplotlib. These are all pre-installed on the virtual machine.

File Structure

Compilation

To compile all the C++ programs, open a terminal in the “Experiment” folder and run

./compile.sh

Note that if you are running this outside the virtual machine, you will need to have g++ installed. Please make sure that you compile everything before running any experiments.

Running the Main Experiment

The main experiment (from which all of our experimental results are obtained) runs all possible instances for all benchmarks and 2 <= n <= 100, 1 <= m <= 5 and 2 <=p <= 5. [See the “custom input” section below for the definition of n.] We did not include p = 1, because data packing is trivial in this case (i.e. each data element should be in its own singleton block).

Note that this amounts to roughly 550k runs, each of which with a time limit of 5 minutes. However, many of these runs terminate much faster. On our machine, running with 12 threads, the process finished in ~10 hours.

Instructions

To run the main experiment, simply execute the following command in the “Experiment” folder:

./main.sh k >> full_log.txt 2>>full_log.txt

where k is the number of threads you want to allow for the execution of the experiment. In our case, k was 12.

Note: As mentioned in the Expectations section above, running the complete experiment takes a lot of time, memory and disk space. Make sure you have configured the virtual machine settings accordingly before attempting to run the main experiment.

Please ensure that the virtual machine has access to at least k CPU cores, 4k GB of RAM and 12 GB disk space. A complete log of the experiment will be written to “full_log.txt”, which you can check to see the progress. [The provided “full_log.txt” already contains the log obtained on our machine.]

Note: You can customize which parts of the experiment you want to run by setting the loop bounds in “main.sh”.

Note: The Complete Input/Output Data of the Experiment is included and you can download it and check it against smaller experiments that take less resources to run.

Note: To save time, if the experiment times out on an input, larger inputs of the same benchmark, with the same parameters, will be skipped.

Obtaining the Report [This file is necessary for all the Figures and Tables]

After the experiment execution concludes, you can obtain a csv file containing a report of all the results by simply running:

./make_csv > data.csv

Note: If you are customizing the ranges for parameters in the experiment, make sure you also apply it in the loops in the main function of “make_csv.cpp” and compile again.

Note: The final csv file of the experiments is included in the virtual machine.

Obtaining the Figures and Tables

If you run the entirety of the main experiment, you should first copy the final csv file obtained from the complete experiment above to the “Visualization” folder under the name “data.csv”. The “data.csv” file obtained on our machine is already there.

How to Obtain Figures 15 and 16

Open a terminal in the “Visualization” folder and run

./makeFigA.sh

this will create and show the panels in Figure 15 and also save them as pdf files in the same folder, indexed by their example number. You can change makeFigA.sh to create similar graphs for other examples as well.

Similarly, Figure 16 can be obtained by

./makeFigB.sh

These scripts simply visualize the data in the csv file. The output will be written to several png files in the same folder.

How to Obtain Tables in Appendix C.2

To obtain the first table, i.e. total number of cache misses, run

python3 tableA-totalMiss.py

To obtain the other tables, i.e. cache misses by example type, run

python3 tableB-totalMissByType.py

Note: Figure 17 was created manually in Microsoft Excel using these tables.

Testing the Artifact with Custom Input

Generating an Input Access Sequence

The benchmark programs (examples) are numbered from 1 to 31. Each benchmark program takes a size value n as input parameter and runs the respective algorithm on random inputs of size n (the randomization seed is fixed, so as to make the process deterministic and reproducible). It then prints the resulting memory access sequence.

For example, if you open a terminal in the “examples” folder and run

./ex12 3

it prints an access sequence obtained from running bubble-sort on a random array of size 3.

Running a Single Instance

To run a single instance of our experiment, simply execute

./experiment.sh m p en n

Here, m is the cache size (as in the paper), p is the packing factor (as in the paper), en is the example number and n is the size parameter passed to the example. For example, running

./experiment.sh 1 2 12 3

first generates an access sequence obtained from bubble-sorting an array of size 3 (as in the example above) and then runs our algorithm and all the heuristic algorithms on it to obtain data packing schemes for a cache of size 1 with a packing factor of 2. It then simulates the cache and counts the number of misses caused by each method. Finally, it outputs the number of cache misses caused by each method (starting with ours) and the time consumed by each.

Note that in some instances computing tree decompositions and dynamic programming values takes a lot of memory. So, make sure that the machine has access to at least 8 GB of RAM.

Detailed Results

After running an instance as above, you can find detailed information about it in the newly created folder “Experiment-m-p-en-n”. For example, after running the experiment above, the folder “Experiment-1-2-12-3” will be created which contains the input access sequence R, the tree decomposition, and logs. The complete input/output file contains these detailed results for all the instances of the main experiment.

Changing Time Limits

By default, the time given to each algorithm for producing a data placement scheme is limited to 5 minutes. To change this limit, you can edit lines 6-8 of “run.sh”.

Testing with Your Own Input

If you wish to test our code with your own input, follow these steps:

  1. Create a text file named “ref.txt” and write the access sequence R in it (one data item per line). Each data item should be numbered by a single 32-bit integer.
  2. Choose the cache size m and the packing factor p
  3. Put “ref.txt” in the folder “nice-tree-decomposition” and run
    java -jar MakeNiceTreeDecomposition.jar m p
    This will create a nice tree decomposition for the access hypergraph of order q* and write it in NiceTreeDecomposition.txt
  4. Copy NiceTreeDecomposition.txt and refs.txt to the folder “Algorithm1” and run either
    ./Algorithm1 m p
    if m = 1or
    ./Algorithm2 m p
    if m > 1. Note that, in the paper, we present two different algorithms for these cases. This is why there are different files here.
  5. The optimal number of cache misses will be output by the algorithm

Note: The program will crash if you change the value of m and p without generating a new tree decomposition or if the wrong algorithm is run in step 4 above. If something did not work as expected, please write to us and include the input. We will look into it and answer ASAP.