Guide to the Artifact
Our code passed the artifact evaluation at POPL 2019 (and we got these two cool badges!). We are very thankful to the AEC for their suggestions to improve this artifact. The code is also available on GitHub at https://github.com/Goharshady/data-packing-popl2019.
We are considering the problem of data packing over a two-level memory system and providing an algorithm for solving it by exploiting the treewidth of the underlying access graphs. Our algorithm is exact and optimal, i.e. it outputs the minimal number of cache misses possible.
We have implemented our algorithm in Java. Our implementation relies on tree decompositions of the underlying access hypergraphs. These are obtained using the LibTW tool.
We use several common programs as benchmarks. These are all implemented in C++. We compare the results of our algorithm (i.e. the optimal number of cache misses) with those of several previous benchmarks (also implemented in C++). Moreover, we have implemented a cache simulator to count the number of cache misses caused by each of these heuristics.
Note that data packing is a notoriously-difficult problem. It is NP-hard even for a cache of size 1 and hard-to-approximate for any cache size greater than 4. Our main contribution in this paper is the first positive theoretical result and exact algorithm for data packing. This algorithm only functions when the underlying access graphs of a specific order have constant treewidth. Hence, we would like to remark the following points about expectations from the artifact:
Many of our scripts are in Bash. We suggest that you execute them on a 64-bit Linux machine (The provided virtual machine runs 64-bit Ubuntu). Our codes are in C++ and Java. Hence, you will need g++ for compilation and a JRE. The figures are visualized using Python3 with numpy and matplotlib. These are all pre-installed on the virtual machine.
To compile all the C++ programs, open a terminal in the “Experiment” folder and run
Note that if you are running this outside the virtual machine, you will need to have g++ installed. Please make sure that you compile everything before running any experiments.
The main experiment (from which all of our experimental results are obtained) runs all possible instances for all benchmarks and 2 <= n <= 100, 1 <= m <= 5 and 2 <=p <= 5. [See the “custom input” section below for the definition of n.] We did not include p = 1, because data packing is trivial in this case (i.e. each data element should be in its own singleton block).
Note that this amounts to roughly 550k runs, each of which with a time limit of 5 minutes. However, many of these runs terminate much faster. On our machine, running with 12 threads, the process finished in ~10 hours.
To run the main experiment, simply execute the following command in the “Experiment” folder:
./main.sh k >> full_log.txt 2>>full_log.txt
where k is the number of threads you want to allow for the execution of the experiment. In our case, k was 12.
Note: As mentioned in the Expectations section above, running the complete experiment takes a lot of time, memory and disk space. Make sure you have configured the virtual machine settings accordingly before attempting to run the main experiment.
Please ensure that the virtual machine has access to at least k CPU cores, 4k GB of RAM and 12 GB disk space. A complete log of the experiment will be written to “full_log.txt”, which you can check to see the progress. [The provided “full_log.txt” already contains the log obtained on our machine.]
Note: You can customize which parts of the experiment you want to run by setting the loop bounds in “main.sh”.
Note: The Complete Input/Output Data of the Experiment is included and you can download it and check it against smaller experiments that take less resources to run.
Note: To save time, if the experiment times out on an input, larger inputs of the same benchmark, with the same parameters, will be skipped.
After the experiment execution concludes, you can obtain a csv file containing a report of all the results by simply running:
./make_csv > data.csv
Note: If you are customizing the ranges for parameters in the experiment, make sure you also apply it in the loops in the main function of “make_csv.cpp” and compile again.
Note: The final csv file of the experiments is included in the virtual machine.
If you run the entirety of the main experiment, you should first copy the final csv file obtained from the complete experiment above to the “Visualization” folder under the name “data.csv”. The “data.csv” file obtained on our machine is already there.
Open a terminal in the “Visualization” folder and run
this will create and show the panels in Figure 15 and also save them as pdf files in the same folder, indexed by their example number. You can change makeFigA.sh to create similar graphs for other examples as well.
Similarly, Figure 16 can be obtained by
These scripts simply visualize the data in the csv file. The output will be written to several png files in the same folder.
To obtain the first table, i.e. total number of cache misses, run
To obtain the other tables, i.e. cache misses by example type, run
Note: Figure 17 was created manually in Microsoft Excel using these tables.
The benchmark programs (examples) are numbered from 1 to 31. Each benchmark program takes a size value n as input parameter and runs the respective algorithm on random inputs of size n (the randomization seed is fixed, so as to make the process deterministic and reproducible). It then prints the resulting memory access sequence.
For example, if you open a terminal in the “examples” folder and run
it prints an access sequence obtained from running bubble-sort on a random array of size 3.
To run a single instance of our experiment, simply execute
./experiment.sh m p en n
Here, m is the cache size (as in the paper), p is the packing factor (as in the paper), en is the example number and n is the size parameter passed to the example. For example, running
./experiment.sh 1 2 12 3
first generates an access sequence obtained from bubble-sorting an array of size 3 (as in the example above) and then runs our algorithm and all the heuristic algorithms on it to obtain data packing schemes for a cache of size 1 with a packing factor of 2. It then simulates the cache and counts the number of misses caused by each method. Finally, it outputs the number of cache misses caused by each method (starting with ours) and the time consumed by each.
Note that in some instances computing tree decompositions and dynamic programming values takes a lot of memory. So, make sure that the machine has access to at least 8 GB of RAM.
After running an instance as above, you can find detailed information about it in the newly created folder “Experiment-m-p-en-n”. For example, after running the experiment above, the folder “Experiment-1-2-12-3” will be created which contains the input access sequence R, the tree decomposition, and logs. The complete input/output file contains these detailed results for all the instances of the main experiment.
By default, the time given to each algorithm for producing a data placement scheme is limited to 5 minutes. To change this limit, you can edit lines 6-8 of “run.sh”.
If you wish to test our code with your own input, follow these steps:
Note: The program will crash if you change the value of m and p without generating a new tree decomposition or if the wrong algorithm is run in step 4 above. If something did not work as expected, please write to us and include the input. We will look into it and answer ASAP.